Tratamento de sequências

(Diferença entre revisões)

Edição de 11h27min de 2 de setembro de 2011

Tabela de conteúdo

1 Utilizando o MOTHUR

Utilizando o MOTHUR

O software MOTHUR, desenvolvido por Patrick Schloss, é composto vários programas agregados (e aprimorados) em um só pacote, de modo a tornar as análises um processo sequencial.

Um bom ponto de partida são os Analysis Examples na página inicial da wiki do Mothur.

Abaixo uma versão adaptada do Sogin data Analysis, juntamente com a descrição de cada comando.

unique.seqs

The unique.seqs command returns only the unique sequences found in a fasta-formatted sequence file and a file that indicates those sequences that are identical to the reference sequence. Often times a collection of sequences will have a significant number of identical sequences. It sucks up considerable processing time to have to align, calculate distances, and cluster each of these sequences individually.

This will generate two files: amazon.names and amazon.unique.fasta. You can now align amazon.unique.fasta and generate a distance matrix. Then you can use that matrix with the newly generated amazon.names file with the names option for the read.dist command.

If you align your unique sequences, filter and screen them, you might be removing bases from the sequences that accounted for differences between the sequences. You can then rerun your sequences through unique.seqs by providing the name option.

unique.seqs(fasta=ARQUIVO.fasta)

align.seqs

The align.seqs command aligns a user-supplied fasta-formatted candidate sequence file to a user-supplied fasta-formatted template alignment. The general approach is to: i) find the closest template for each candidate using kmer searching, blastn, or suffix tree searching; ii) to make a pairwise alignment between the candidate and de-gapped template sequences using the Needleman-Wunsch, Gotoh, or blastn algorithms; iii) to re-insert gaps to the candidate and template pairwise alignments using the NAST algorithm so that the candidate sequence alignment is compatible with the original template alignment align.seqs(candidate=ARQUIVO.unique.fasta, template=core_set_aligned.imputed.fasta, ksize=9, align=needleman)

filter.seqs

filter.seqs removes columns from alignments based on a criteria defined by the user. For example, alignments generated against reference alignments (e.g. from RDP, SILVA, or greengenes) often have columns where every character is either a '.' or a '-'. These columns are not included in calculating distances because they have no information in them. By removing these columns, the calculation of a large number of distances is accelerated. filter.seqs(fasta=ARQUIVO.unique.align, vertical=T)

dist.seqs

The dist.seqs command will calculate uncorrected pairwise distances between aligned sequences. This approach is better than the commonly used DNADIST because the distances are not stored in RAM, rather they are printed directly to a file. Furthermore, it is possible to ignore "large" distances that one might not be interested in. The command will generate a column-formatted distance matrix that is compatible with the column option in theread.dist command. The command is also able to generate a phylip-formatted distance matrix. There are several options for how to handle gap comparisons and terminal gaps. dist.seqs(fasta=ARQUIVO.unique.filter.fasta, cutoff=0.10, processors=?)

Levou ~15 min para alinhar SOGIN_DATA utilizando 4 processadores (previsto pela Mothur wiki era 1.6 horas)

cluster

Once a distance matrix gets read into mothur, the cluster command can be used to assign sequences to OTUs.

Missing distances Perhaps the second most commonly asked question is why there isn't a line for distance 0.XX. If you notice the previous example the distances jump from 0.003 to 0.006. Where are 0.004 and 0.005? Mothur only outputs data if the clustering has been updated for a distance. So if you don't have data at your favorite distance, that means that nothing changed between the previous distance and the next one. Therefore if you want OTU data for a distance of 0.005 in this case, you would use the data from 0.003. cluster(name=ARQUIVO.names)

@@ Linha 7: / Linha 7: @@
 Abaixo uma versão adaptada do [http://www.mothur.org/wiki/Sogin_data_analysis Sogin data Analysis], juntamente com a descrição de cada comando.
-) [http://www.mothur.org/wiki/Unique.seqs unique.seqs]
+===[http://www.mothur.org/wiki/Unique.seqs unique.seqs]===
 The unique.seqs command returns only the unique sequences found in a fasta-formatted sequence file and a file that indicates those sequences that are identical to the reference sequence. Often times a collection of sequences will have a significant number of identical sequences. It sucks up considerable processing time to have to align, calculate distances, and cluster each of these sequences individually.
@@ Linha 18: / Linha 18: @@
 </pre>
-)[http://www.mothur.org/wiki/Align.seqs align.seqs]
+===[http://www.mothur.org/wiki/Align.seqs align.seqs]===
 The align.seqs command aligns a user-supplied fasta-formatted candidate sequence file to a user-supplied fasta-formatted template alignment. The general approach is to:
@@ Linha 27: / Linha 27: @@
-)[http://www.mothur.org/wiki/Filter.seqs filter.seqs]
+===[http://www.mothur.org/wiki/Filter.seqs filter.seqs]===
 filter.seqs removes columns from alignments based on a criteria defined by the user. For example, alignments generated against reference alignments (e.g. from RDP, SILVA, or greengenes) often have columns where every character is either a '.' or a '-'. These columns are not included in calculating distances because they have no information in them. By removing these columns, the calculation of a large number of distances is accelerated.
 filter.seqs(fasta=ARQUIVO.unique.align, vertical=T)
-)[http://www.mothur.org/wiki/Dist.seqs dist.seqs]
+===[http://www.mothur.org/wiki/Dist.seqs dist.seqs]===
 The dist.seqs command will calculate uncorrected pairwise distances between aligned sequences. This approach is better than the commonly used DNADIST because the distances are not stored in RAM, rather they are printed directly to a file.
 Furthermore, it is possible to ignore "large" distances that one might not be interested in. The command will generate a column-formatted distance matrix that is compatible with the column option in theread.dist command.
@@ Linha 40: / Linha 40: @@
 Levou ~15 min para alinhar SOGIN_DATA utilizando 4 processadores (previsto pela Mothur wiki era 1.6 horas)
-)[http://www.mothur.org/wiki/Cluster cluster]
+===[http://www.mothur.org/wiki/Cluster cluster]===
 :''Once a distance matrix gets read into mothur, the cluster command can be used to assign sequences to OTUs.''

Tratamento de sequências

Edição de 11h27min de 2 de setembro de 2011

Tabela de conteúdo

Utilizando o MOTHUR

unique.seqs

align.seqs

filter.seqs

dist.seqs

cluster

Ferramentas pessoais

Espaços nominais

Variantes

Visualizações

Ações

Pesquisar

Navegação

Ferramentas