Tratamento de sequências

(Diferença entre revisões)

Edição de 13h23min de 2 de setembro de 2011

Tabela de conteúdo

1 Utilizando o MOTHUR

Utilizando o MOTHUR

O software MOTHUR, desenvolvido por Patrick Schloss, é composto vários programas agregados (e aprimorados) em um só pacote, de modo a tornar as análises um processo sequencial.

Um bom ponto de partida são os Analysis Examples na página inicial da wiki do Mothur.

Abaixo uma versão adaptada do Sogin data Analysis, juntamente com a descrição de cada comando.

unique.seqs

The unique.seqs command returns only the unique sequences found in a fasta-formatted sequence file and a file that indicates those sequences that are identical to the reference sequence. Often times a collection of sequences will have a significant number of identical sequences. It sucks up considerable processing time to have to align, calculate distances, and cluster each of these sequences individually.

This will generate two files: amazon.names and amazon.unique.fasta. You can now align amazon.unique.fasta and generate a distance matrix. Then you can use that matrix with the newly generated amazon.names file with the names option for the read.dist command.

If you align your unique sequences, filter and screen them, you might be removing bases from the sequences that accounted for differences between the sequences. You can then rerun your sequences through unique.seqs by providing the name option.

unique.seqs(fasta=ARQUIVO.fasta)

align.seqs

The align.seqs command aligns a user-supplied fasta-formatted candidate sequence file to a user-supplied fasta-formatted template alignment. The general approach is to: i) find the closest template for each candidate using kmer searching, blastn, or suffix tree searching; ii) to make a pairwise alignment between the candidate and de-gapped template sequences using the Needleman-Wunsch, Gotoh, or blastn algorithms; iii) to re-insert gaps to the candidate and template pairwise alignments using the NAST algorithm so that the candidate sequence alignment is compatible with the original template alignment

align.seqs(candidate=ARQUIVO.unique.fasta, template=core_set_aligned.imputed.fasta, ksize=9, align=needleman)

filter.seqs

filter.seqs removes columns from alignments based on a criteria defined by the user. For example, alignments generated against reference alignments (e.g. from RDP, SILVA, or greengenes) often have columns where every character is either a '.' or a '-'. These columns are not included in calculating distances because they have no information in them. By removing these columns, the calculation of a large number of distances is accelerated.

filter.seqs(fasta=ARQUIVO.unique.align, vertical=T)

dist.seqs

The dist.seqs command will calculate uncorrected pairwise distances between aligned sequences. This approach is better than the commonly used DNADIST because the distances are not stored in RAM, rather they are printed directly to a file. Furthermore, it is possible to ignore "large" distances that one might not be interested in. The command will generate a column-formatted distance matrix that is compatible with the column option in theread.dist command. The command is also able to generate a phylip-formatted distance matrix. There are several options for how to handle gap comparisons and terminal gaps.

dist.seqs(fasta=ARQUIVO.unique.filter.fasta, cutoff=0.10, processors=?)

Levou ~15 min para alinhar SOGIN_DATA utilizando 4 processadores (previsto pela Mothur wiki era 1.6 horas)

cluster

Once a distance matrix gets read into mothur, the cluster command can be used to assign sequences to OTUs.

Missing distances: Perhaps the second most commonly asked question is why there isn't a line for distance 0.XX. If you notice the previous example the distances jump from 0.003 to 0.006. Where are 0.004 and 0.005?; Mothur only outputs data if the clustering has been updated for a distance.; So if you don't have data at your favorite distance, that means that nothing changed between the previous distance and the next one.; Therefore if you want OTU data for a distance of 0.005 in this case, you would use the data from 0.003.

cluster(name=ARQUIVO.names)

summary.single

The summary.single command will produce a summary file that has the calculator value for each line in the OTU data and for all possible comparisons between the different groups in the group file. This can be useful if you aren't interested in generating collector's or rarefaction curves for your multi-sample data analysis. summary.single(calc=sobs-chao-ace-jack-bootstrap-shannon-npshannon-simpson-invsimpson-coverage-boneh-logseries-geometric-bstick-nseqs)

rarefaction.single

The rarefaction.single command will generate intra-sample rarefaction curves using a re-sampling without replacement approach. Rarefaction curves provide a way of comparing the richness observed in different samples. Roughly speaking you get the number of OTUs, on average, that you would have been expected to have observed if you hadn't sampled as many individuals. Although a formula exists to generate a rarefaction curve (see the example calculation), mothur uses a randomization procedure. It can also help you to assess your sampling intensity. If a rarefaction curve becomes parallel to the x-axis, you can be reasonably confident that you have done a good job of sampling and can trust the observed level of richness. Otherwise, you need to keep sampling. Rarefaction is actually a better measure of diversity than it is of richness. Para poucas sequências (200 ou menos) sugiro um valor freq baixo (eg: 1). Para a ordem de 20k sequências um valor de freq=5000 é razoável. rarefaction.single(freq=1)

libshuff

The libshuff command implements the libshuff method as previously implemented in the programs s-libshuff and libshuff. The libshuff method is a generic test that describes whether two or more communities have the same structure using the Cramer-von Mises test statistic. The significance of the test statistic indicates the probability that the communities have the same structure by chance. Because each pairwise comparison requires two significance tests, a correction for multiple comparisons (e.g. Bonferroni's correction) must be applied.

ATENÇÃO!: If one desires an experiment-wide false detection rate of 0.05, then these significance values need to be less than 0.025 to be considered statistically significant. If either of the significance values are statistically significant, one can safely state that the two communities are significantly different. Therefore, because 0.0081 is smaller than 0.025, the forest and pasture sequence collections are significantly different.; Para executar o comando libshuff são necessários um arquivo de distancia em formato Phylip e um arquivo de Grupos.

Para criar um arquivo de grupos são necessários arquivos FASTA de dois ou mais locais\experimentos diferentes (de preferência em um mesmo diretório). No Mothur execute:

make.group(fasta=sample1.fasta-sample2.fasta-sample3.fasta, groups=A-B-C)

Um arquivo .group será criado dentro do diretório especificado

Para criar um arquivo de distância em formato Phylip:

Disponha de arquivos Fasta e execute de modo a unir os vários arquivos Fasta em um só:

merge.files(input=fileA-fileB-fileC, output=fileABC)

De posse do arquivo unificado, siga os passos descritos na ordem descrita acima até o passo de filtragem de sequencias (unique.seqs, align.seqs, filter.seqs)

Ao executar o comando dist.seqs acrescente o parâmentro output=square para gerar um arquivo de saída no formato Phylip (matriz quadrangular). Desta forma:

dist.seqs(fasta=ARQUIVO.unique.filter.fasta, cutoff=0.10, output=square)

Por fim execute:

libshuff (phylip=arquivo.phylip, group=arquivo.groups.)

@@ Linha 85: / Linha 85: @@
 *Para criar um arquivo de distância em formato Phylip:
 :Disponha de arquivos Fasta e execute de modo a unir os vários arquivos Fasta em um só:
+<pre>
 merge.files(input=fileA-fileB-fileC, output=fileABC)
+</pre>
 :De posse do arquivo unificado, siga os passos descritos na ordem descrita acima até o passo de filtragem de sequencias (unique.seqs, align.seqs, filter.seqs)
 :Ao executar o comando dist.seqs acrescente o parâmentro '''output=square''' para gerar um arquivo de saída no formato Phylip (matriz quadrangular). Desta forma:

Tratamento de sequências

Edição de 13h23min de 2 de setembro de 2011

Tabela de conteúdo

Utilizando o MOTHUR

unique.seqs

align.seqs

filter.seqs

dist.seqs

cluster

summary.single

rarefaction.single

libshuff

Ferramentas pessoais

Espaços nominais

Variantes

Visualizações

Ações

Pesquisar

Navegação

Ferramentas