Tratamento de sequências

De Lembiotech Wiki

O software MOTHUR, desenvolvido por Patrick Schloss, é composto vários programas agregados (e aprimorados) em um só pacote, de modo a tornar as análises um processo sequencial.

Um bom ponto de partida são os Analysis Examples na página inicial da wiki do Mothur.

Abaixo uma versão adaptada do Sogin data Analysis, juntamente com a descrição de cada comando.

Arquivos úteis

Alguns arquivos podem ajudá-lo na jornada a ser seguida :)

Análises Intra-amostrais

unique.seqs

The unique.seqs command returns only the unique sequences found in a fasta-formatted sequence file and a file that indicates those sequences that are identical to the reference sequence. Often times a collection of sequences will have a significant number of identical sequences. It sucks up considerable processing time to have to align, calculate distances, and cluster each of these sequences individually.

This will generate two files: amazon.names and amazon.unique.fasta. You can now align amazon.unique.fasta and generate a distance matrix. Then you can use that matrix with the newly generated amazon.names file with the names option for the read.dist command.

If you align your unique sequences, filter and screen them, you might be removing bases from the sequences that accounted for differences between the sequences. You can then rerun your sequences through unique.seqs by providing the name option.

unique.seqs(fasta=ARQUIVO.fasta)

align.seqs

The align.seqs command aligns a user-supplied fasta-formatted candidate sequence file to a user-supplied fasta-formatted template alignment. The general approach is to:
find the closest template for each candidate using kmer searching, blastn, or suffix tree searching;
to make a pairwise alignment between the candidate and de-gapped template sequences using the Needleman-Wunsch, Gotoh, or blastn algorithms;
to re-insert gaps to the candidate and template pairwise alignments using the NAST algorithm so that the candidate sequence alignment is compatible with the original template alignment

align.seqs(candidate=ARQUIVO.unique.fasta, template=silva.bacteria.fasta, ksize=9, align=needleman)

filter.seqs

filter.seqs removes columns from alignments based on a criteria defined by the user. For example, alignments generated against reference alignments (e.g. from RDP, SILVA, or greengenes) often have columns where every character is either a '.' or a '-'. These columns are not included in calculating distances because they have no information in them.

By removing these columns, the calculation of a large number of distances is accelerated.

filter.seqs(fasta=ARQUIVO.unique.align, vertical=T)

dist.seqs

The dist.seqs command will calculate uncorrected pairwise distances between aligned sequences. This approach is better than the commonly used DNADIST because the distances are not stored in RAM, rather they are printed directly to a file.

Furthermore, it is possible to ignore "large" distances that one might not be interested in. The command will generate a column-formatted distance matrix that is compatible with the column option in theread.dist command.

The command is also able to generate a phylip-formatted distance matrix.

There are several options for how to handle gap comparisons and terminal gaps.

dist.seqs(fasta=ARQUIVO.unique.filter.fasta, cutoff=0.10, processors=2)

cluster

Once a distance matrix gets read into mothur, the cluster command can be used to assign sequences to OTUs.
Missing distances
Perhaps the second most commonly asked question is why there isn't a line for distance 0.XX. If you notice the previous example the distances jump from 0.003 to 0.006. Where are 0.004 and 0.005?
Mothur only outputs data if the clustering has been updated for a distance.
So if you don't have data at your favorite distance, that means that nothing changed between the previous distance and the next one.
Therefore if you want OTU data for a distance of 0.005 in this case, you would use the data from 0.003.

cluster(name=ARQUIVO.names)

summary.single

The summary.single command will produce a summary file that has the calculator value for each line in the OTU data and for all possible comparisons between the different groups in the group file.

This can be useful if you aren't interested in generating collector's or rarefaction curves for your multi-sample data analysis.

Aqui encontra-se uma lista com parâmetros para diversos índices

summary.single(calc=sobs-chao-ace-jack-bootstrap-shannon-npshannon-simpson-invsimpson-coverage-boneh-logseries-geometric-bstick-nseqs)

rarefaction.single

The rarefaction.single command will generate intra-sample rarefaction curves using a re-sampling without replacement approach. Rarefaction curves provide a way of comparing the richness observed in different samples. Roughly speaking you get the number of OTUs, on average, that you would have been expected to have observed if you hadn't sampled as many individuals.

Although a formula exists to generate a rarefaction curve (see the example calculation), mothur uses a randomization procedure. It can also help you to assess your sampling intensity.

If a rarefaction curve becomes parallel to the x-axis, you can be reasonably confident that you have done a good job of sampling and can trust the observed level of richness. Otherwise, you need to keep sampling.

Rarefaction is actually a better measure of diversity than it is of richness.

Para poucas sequências (200 ou menos) sugiro um valor freq baixo (eg: 1). Para a ordem de 20k sequências um valor de freq=5000 é razoável.

rarefaction.single(freq=1)

Plotando gráficos de rarefação no R

No ambiente de trabalho do R, execute:

data<-read.table(file="c:\\blast\\mothur\\amazon.unique.an.rarefaction", header=T)
plot(x=data$numsampled, y=data$unique, xlab="Number of Tags Sampled",ylab="OTUs", type="l", col="black", font.lab=3)
points(x=data$numsampled, y=data$X0.03, type="l", col="blue")
points(x=data$numsampled, y=data$X0.05, type="l", col="red")
points(x=data$numsampled, y=data$X0.10, type="l", col="green")
legend(x=50000, y=1500, c("unique", "0.03", "0.05", "0.10"), c("black", "blue", "red", "green"))

Análises Interamostrais

libshuff

The libshuff command implements the libshuff method as previously implemented in the programs s-libshuff and libshuff. The libshuff method is a generic test that describes whether two or more communities have the same structure using the Cramer-von Mises test statistic. The significance of the test statistic indicates the probability that the communities have the same structure by chance. Because each pairwise comparison requires two significance tests, a correction for multiple comparisons (e.g. Bonferroni's correction) must be applied.

ATENÇÃO!
If one desires an experiment-wide false detection rate of 0.05, then these significance values need to be less than 0.025 to be considered statistically significant. If either of the significance values are statistically significant, one can safely state that the two communities are significantly different. Therefore, because 0.0081 is smaller than 0.025, the forest and pasture sequence collections are significantly different.

Para executar o comando libshuff são necessários um arquivo de distância em formato Phylip e um arquivo de grupos.

Gerando distâncias em formato phylip e um arquivo de grupos

Para criar um arquivo de grupos são necessários arquivos FASTA de dois ou mais locais\experimentos diferentes (de preferência em um mesmo diretório). No Mothur execute:

make.group(fasta=sample1.fasta-sample2.fasta-sample3.fasta, groups=A-B-C)

Um arquivo .group será criado dentro do diretório especificado

Para criar um arquivo de distância em formato Phylip com as sequências de todos os arquivos:

Disponha dos arquivos FASTA e execute o seguinte comando:

merge.files(input=fileA-fileB-fileC, output=fileABC)

De posse do arquivo unificado,

siga os passos descritos na ordem descrita acima até o passo de filtragem de sequências (unique.seqs, align.seqs e filter.seqs)

Então execute o comando dist.seqs acrescente o parâmentro output=square para gerar um arquivo de saída no formato Phylip (matriz quadrangular). Desta forma:

dist.seqs(fasta=ARQUIVO.unique.filter.fasta, cutoff=0.10, output=square)

Executando o libshuff

Por fim execute:

libshuff (phylip=arquivo.phylip, group=arquivo.groups.)

Venn

The venn command generates a Venn diagram from data provided in a *.shared file. The command can generate diagrams (as SVG files) to compare the richness shared among 2, 3, or 4 groups. The SVG file that can be further modified in a program like Gimp or Adobe Illustrator to scale the areas to be proportional to the richness represented by the region. Options are available to measure richness based on the observed richness or the estimated richness using the chao and sharedchao calculators.

Assim como no libshuff, também são necessários um arquivo de distância em formato Phylip e um arquivo de Grupos.

cluster(phylip=ARQUIVO.dist, cutoff=0.10)  
make.shared(list=ARQUIVO.fn.list, group=ARQUIVO.groups)
venn(shared=ARQUIVO.fn.shared)

tree.shared

The tree.shared command will generate a newick-formatted tree file that describes the dissimilarity (1-similarity) among multiple groups. Groups are clustered using the UPGMA algorithm using the distance between communities as calculated using any of the calculators describing the similarity in community membership or structure.

calc Using the calc option allows one to select any of the calculators of similarity of community membership and structure. The different calculators can be separated with hyphens (i.e. "-"). For example the following command will generate distance matrices for the Jaccard coefficient using richness estimators, the Yue & Clayton theta, and the Bray-Curtis index:
mothur > tree.shared(shared=abrecovery.fn.shared, calc=jest-thetayc-braycurtis)
Keep in mind that these are distances, which are calculated as one minus the similarity value.

Após gerar um arquivo 'shared' (como no passo anterior) é possível gerar uma árvore filogenética utilizando o seguinte comando:

tree.shared(shared=ARQUIVO.fn.shared)

Para vizualizar a árvore gerada é ncessário um programa externo ao mothur. Confira aqui uma lista de programas para vizualização de árvores filogenéticas.

Shared file

A shared file is analogous to an rabund file. The data in a shared file represent the number of times that an OTU is observed in multiple samples. It is easiest to generate this file using the make.shared command. The structure of a shared file is analogous to an rabund file. The first column contains the label for the comparison - this will be the value for the first column of each line from the original list file (e.g. patient70.fn.list). The second column contains the group name that designates where the data is coming from for that row. Next is the number of OTUs that were found between each of the groups and is the number of columns that follow. Finally, the remaining columns indicate the number of sequences that belonged to each OTU from that group. The list files that are generated from this command contain the sequence names in each OTU for each group and can be processed using commands such as collect.single() like any other list files.

exemplo de arquivo.shared

label	Group	numOtus	Otu001	Otu002	Otu003	Otu004	Otu005	Otu006	Otu007	Otu008	Otu009	Otu010
0.02	Site1	7	0	3	1	0	0	1	0	1	1	0
0.02	Site2	9	1	0	0	2	1	2	1	0	1	1
0.02	Site3	4	0	0	1	0	0	0	0	0	3	0

exemplo de dados abióticos

	Silte/Argila	MO	pH	Salinidade	Temperatura
Site1	1.5921212121	2.3	7.0	0.746634199	1.414973348
Site2	1.7821212121	1.8	7.4	1.615950052	1.531478917
Site3	1.2621212121	1.6	7.2	1.591064607	1.472756449

Utilizando um 'shared file' contendo a abundância relativa de cada OTU e um conjunto de dados abióticos dos locais de estudo é possível realizar uma séria de análises multivariadas (tais como RDAs, PCAs e CCAs)

Chimera check

Antes de proceder, são necessários arquivos de referência adequados (usualmente chamados GOLD), que são versões mais minuciosas do arquivo de alinhamento genérico.

Arquivos de alinhamento formatados para o Mothur podem ser encontrados aqui

Dentro do Mothur existem vários pacotes para a busca de quimeras, produzidos por diversos autores e cada um com um utilizando um algoritimo diferente.

Chimera.slayer

The chimera.slayer command reads a fasta file and reference file and outputs potentially chimeric sequences. The developers of the original algorithm suggest using a special template reference set (i.e. gold). We provide a silva-based alignment of this dataset with our silva reference files. You will need to have a copy of the megablast and formatdb executables in a folder called blast/bin, where the blast folder is next to the mothur executable. The version of megablast/formatdb that you need can be found in [1] or they are included with the executable versions of mothur.

Para executar o comando é necessário que o arquivo de interesse esteja previamente alinhado com o arquivo de referência adequado para a busca de quimeras. Lembrando que para alinhar, caso ainda não tenha feito isto, é interessante filtar as sequências unicas atraves do comando unique.seqs.

align.seqs(fasta=ARQUIVO.fasta, template=silva.gold.align)

De posse dos arquivos alinhados, execute (é possível indicar mais de um arquivo, separando-os por hífens):

chimera.slayer(fasta=ARQUIVO.align, reference=silva.gold.align)

É possível utilizar 'reference=self' para utilizar a diversidade do seu conjunto de dados como arquivo de referência na busca por quimeras. Nesse caso, um 'ARQUIVO.names' (gerado através do unique.seqs) deve ser indicado de forma a apontar a abundância de cada 'espécie'.

Chimera.uchime

The chimera.uchime command reads a fasta file and reference file and outputs potentially chimeric sequences. The original uchime program was written by Robert C. Edgar and donated to the public domain, http://drive5.com/uchime. If you would like to follow along, please download UchimeExample.zip.

Da mesma forma como no chimera.slayer, um arquivo já alinhado com o arquivo de referência é necessário.

chimera.uchime(fasta=ARQUIVO.align, reference=silva.gold.align)