- Tasks assigned and Questions
- De novo assemble pipeline
Original Publish Date: 12 August, 2019
paste(“Updated on:”, format(Sys.time(), ‘%d %B, %Y’))
This Tutorial mainly Originated form Huttenhower Lab Official Reference and Modified by Tank
Tasks assigned and Questions
1 gene catalogue pipeline in details
Reference-based pipeline 👌
De novo Assembly pipeline ❔
De novo assemble pipeline
khmer generation
time parallel -j 8 \
'interleave-reads.py temp/11qc/{1}_1_kneaddata_paired_1.fastq temp/11qc/{1}_1_kneaddata_paired_2.fastq | \
trim-low-abund.py -V -M ${khmer_memory} -Z ${khmer_high} -C ${khmer_low} - -o temp/21khmer/{1}.fq; \
split-paired-reads.py -f -0 temp/21khmer/{1}_0.fq -1 temp/21khmer/{1}_1.fq -2 temp/21khmer/{1}_2.fq temp/21khmer/{1}.fq' \
::: `tail -n+2 result/metadata.txt | cut -f 1`
Assemble
Problem of Seven-Brideg and Euler graph
K-mer and De Bruign graph
K-mer: k-mers are subsequences of length k contained within a biological sequence
$k-mers are subsequences of length \({k}\) contained within a biological sequence. Primarily used within the context of computational genomics and sequence analysis, in which k-mers are composed of nucleotides (i.e. A, T, G, and C), k-mers are capitalized upon to assemble DNA sequences, improve heterologous gene expression, identify species in metagenomic samples, and create attenuated vaccines. Usually, the term k-mer refers to all of a sequence’s subsequences of length \({k}\) , such that the sequence AGAT would have four monomers (A, G, A, and T), three 2-mers (AG, GA, AT), two 3-mers (AGA and GAT) and one 4-mer (AGAT). More generally, a sequence of length \({L}\) L will have \({L-k+1}\) k-mers and \({ n^{k}}\) total possible k-mers, where \({n}\) n is number of possible monomers (e.g. four in the case of DNA).
The frequency of a set of k-mers in a species’a genome, in a genomic region, or in a class of sequences can be used as a “signature” of the underlying sequence. Comparing these frequencies are computationally easier than sequence alignment, and is an important method in alignment-free sequence analysis. It can also be used as a first stage analysis before an alignment.
Assembly of Hamilton and Euler cycle
time parallel -j 8 \
'megahit -t 30 \
--continue\
--presets meta-large\
--k-list 21,29,39,59,79,99,119,141 \
-1 raw_data_example/{1}_1.fq -2 raw_data_example/{1}_2.fq \
-o assembly/megahit/ >> log/megahit/megahit.log 2>&1 ' \
::: `tail -n+2 metaDataMap/metadata.txt | cut -f 1`
tunable parameter1
input 3 mode2
good assembly standard3
Reference Required and other Materials
library(RefManageR)
# library(knitcitations)
# library(rcrossref)
bib <- ReadBib("metaAssembly.bib")
myopts <- BibOptions(bib.style = "authoryear", style="latex", first.inits=FALSE, max.names = 20)
Boisvert, Sébastien, Frédéric Raymond, Élénie Godzaridis, François Laviolette, and Jacques Corbeil (2012). ``Ray Meta: scalable de novo metagenome assembly and profiling’’. In: 13.12, p. R122.
Kultima, Jens Roat, Shinichi Sunagawa, Junhua Li, Weineng Chen, Hua Chen, Daniel R Mende, Manimozhiyan Arumugam, Qi Pan, Binghang Liu, Junjie Qin, and others (2012). ``MOCAT: a metagenomics assembly and gene prediction toolkit’’. In: 7.10, p. e47656.
Li, Dinghua, Chi-Man Liu, Ruibang Luo, Kunihiko Sadakane, and Tak-Wah Lam (2015). ``MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph’’. In: 31.10, pp. 1674–1676.
Nagarajan, Niranjan and Mihai Pop (2013). ``Sequence assembly demystified’’. In: 14.3, p. 157.
Namiki, Toshiaki, Tsuyoshi Hachiya, Hideaki Tanaka, and Yasubumi Sakakibara (2012). ``MetaVelvet: an extension of Velvet assembler to de novo metagenome assembly from short sequence reads’’. In: 40.20, pp. e155–e155.
Narasingarao, Priya, Sheila Podell, Juan A Ugalde, Céline Brochier-Armanet, Joanne B Emerson, Jochen J Brocks, Karla B Heidelberg, Jillian F Banfield, and Eric E Allen (2012). ``De novo metagenomic assembly reveals abundant novel major lineage of Archaea in hypersaline microbial communities’’. In: 6.1, p. 81.
Nielsen, H Bjørn, Mathieu Almeida, Agnieszka Sierakowska Juncker, Simon Rasmussen, Junhua Li, Shinichi Sunagawa, Damian R Plichta, Laurent Gautier, Anders G Pedersen, Emmanuelle Le Chatelier, and others (2014). ``Identification and assembly of genomes and genetic elements in complex metagenomic samples without using reference genomes’’. In: 32.8, p. 822.
Pell, Jason, Arend Hintze, Rosangela Canino-Koning, Adina Howe, James M Tiedje, and C Titus Brown (2012). ``Scaling metagenome sequence assembly with probabilistic de Bruijn graphs’’. In: 109.33, pp. 13272–13277.
Thomas, Torsten, Jack Gilbert, and Folker Meyer (2012). ``Metagenomics-a guide from sampling to data analysis’’. In: 2.1, p. 3.
Treangen, Todd J, Sergey Koren, Daniel D Sommer, Bo Liu, Irina Astrovskaya, Brian Ondov, Aaron E Darling, Adam M Phillippy, and Mihai Pop (2013). ``MetAMOS: a modular and open source metagenomic assembly and analysis pipeline’’. In: 14.1, p. R2.
Binning (metagenomics)
Metagenome assemblies are highly fragmented, comprising thousands of contigs, and researchers do not know a priori which contig derives from which genome, or even know how many genomes are present. The aim of contig ‘binning’ is to group contigs into species. Supervised binning methods use databases of already sequenced genomes to label contigs into taxonomic classes. Unsupervised (clustering)methods look for natural groups in the data
binning is the process of grouping reads or contigs and assigning them to operational taxonomic units. Binning methods can be based on either compositional features or alignment (similarity), or both
binning techniques represent a “best effort” to identify reads or contigs with certain groups of organisms designated as operational taxonomic units (OTUs)
metagenomic assembly produces contigs of short to intermediate length (1–100s of kb). Assigning these contigs to particular microbial groups and/or organisms, a process known as metagenomic binning
There are two main approaches to binning.
- The first involves comparing the unknown gene/contig sequences to sequences of known phylogenetic affiliation.
- The second method is to assign contigs to taxonomic groups based on signatures of nucleotide composition such as GC content or tetranucleotide frequency.
advantage of using nucleotide composition for binning is that the signal is pervasive throughout the genome, and horizontally transferred genes quickly take on the signal of their host
Nucleotide composition-based approaches are particularly valuable for novel genes
Note that in a strict sense, “binning,” in which sequences are clustered into subgroups that are not necessarily identified taxonomically, should be distinguished from “classification” in which taxonomic labels have been applied
Genomic Signatures of Nucleotide Composition
microorganisms have characteristic patterns of nucleotide composition that are conserved within genomes and distinct between genomes. Such compositional biases were recognized at the very beginning of the age of genomics and metagenomics
The biases include features such as GC content ((G+C)/(A+T+C+G)), and the relative abundance of oligonucleotide sequences of a given length (di-, tri-, tetra-, etc. nucleotides
oligonucleotides of any length can be used for binning. What length is optimal? There is a trade-off between information content and length of contig required to provide sufficient data on frequency of specific oligo sequences. Longer oligo patterns contain more information and offer higher specificity (Bohlin et al. 2008). Yet the number of possible oligonucleotides increases exponentially with oligo length: di-, tri-, tetra-, and pentanucleotides have 16, 64, 256, and 1024 possible oligos, respectively (including nonunique reverse complements). Thus, in order to generate a statistically significant frequency histogram, more “samples” are required for longer oligos. In other words, longer contigs are required for longer oligos. Tetranucleotide frequencies typically provide acceptable results down to ~2.5 kb and are very robust above 5 kb, and thus find wide application in current metagenomics projects where contigs in this length range are abundant
Binning by nucleotide composition can be approached by either supervised or unsupervised methods. Supervised methods are trained on the signatures of reference genomes to construct a model that is then used to assign unknown contigs to the reference genomes
Unsupervised approaches do not rely on a priori information about the genome signatures of reference genomes. Rather, they group unknown sequences into clusters directly based on the similarity of their genome signatures
Supervised methods
One of the most widely used supervised methods of metagenomic binning is Phylopithia (McHardy et al. 2007). This method “learns” the characteristics of genome signatures of reference genomes through a support vector machine classifier and then phylogenetically classifies unknown sequences at multiple levels of taxonomy. In addition to nucleotide composition, it also leverages information from phylogenetic marker genes
Naive Bayesian classifiers have also been used to classify short sequence fragments (down to 400 bp) from genomes of pure cultures at an accuracy of 85% (Sandberg et al. 2001)
Interpolated Markov models to classify sequences down to 100 bp
Unsupervised method
- One particularly effective method of clustering and visualizing sequences is self-organizing maps (SOM)
- MaxBin (and MaxBin2) provides an automated pipeline that uses an Expectation-Maximization algorithm, which is reported to push the sequence length threshold down to 1000 bp
MetaBat also uses a combination of tetranucleotide frequency and abundance (coverage) information and is reported to be fully automated
Differential coverage can be used in combination with nucleotide composition to enhance the resolution of binning
Gene prediction and annotation
Prokka and Intro
Prokka: rapid prokaryotic genome annotation,Whole genome annotation is the process of identifying features of interest in a set of genomic DNA sequences, and labelling them with useful information. Prokka is a software tool to annotate bacterial, archaeal and viral genomes quickly and produce standards-compliant output files.
Execute Command:
time parallel --gnu --xapply -j 30 \
"prokka assembly/megahit/final.contigs.fa --outdir genePre/prokka/ {}/\
-prefix mg --metagenome --force --cpus 48 \
--kingdom Archaea,Bacteria,Mitochondria,Viruses" \
>> log/prokka/prokka.log 2>&1 \
::: `tail -n+2 metaDataMap/metadata.txt | cut -f 1` ## 1h time-consuming
Output Files
Extension | Description |
---|---|
.gff | This is the master annotation in GFF3 format, containing both sequences and annotations. It can be viewed directly in Artemis or IGV. |
.gbk | This is a standard Genbank file derived from the master .gff. If the input to prokka was a multi-FASTA, then this will be a multi-Genbank, with one record for each sequence. |
.fna | Nucleotide FASTA file of the input contig sequences. |
.faa | Protein FASTA file of the translated CDS sequences. |
.ffn | Nucleotide FASTA file of all the prediction transcripts (CDS, rRNA, tRNA, tmRNA, misc_RNA) |
.sqn | An ASN1 format “Sequin” file for submission to Genbank. It needs to be edited to set the correct taxonomy, authors, related publication etc. |
.fsa | Nucleotide FASTA file of the input contig sequences, used by “tbl2asn” to create the .sqn file. It is mostly the same as the .fna file, but with extra Sequin tags in the sequence description lines. |
.tbl | Feature Table file, used by “tbl2asn” to create the .sqn file. |
.err | Unacceptable annotations - the NCBI discrepancy report. |
.log | Contains all the output that Prokka produced during its run. This is a record of what settings you used, even if the –quiet option was enabled. |
.txt | Statistics relating to the annotated features found. |
.tsv | Tab-separated file of all features: locus_tag,ftype,len_bp,gene,EC_number,COG,product |
Note
- Prokka is specifically designed for Bacteria, Archaea and Viruses. It can’t handle multi-exon gene models; I would recommend using MAKER 2 for that purpose.
The hmmscan step seems to hang and do nothing?
The problem here is GNU Parallel. It seems the Debian package for hmmer has modified it to require the--gnu
option to behave in the ‘default’ way. There is no clear reason for this. The only way to restore normal behaviour is to edit the prokka script and changeparallel
toparallel --gnu
Mandatory
BioPerl
Used for input/output of various file formats
Stajich et al, The Bioperl toolkit: Perl modules for the life sciences. Genome Res. 2002 Oct;12(10):1611-8.GNU Parallel
A shell tool for executing jobs in parallel using one or more computers
O. Tange, GNU Parallel - The Command-Line Power Tool, ;login: The USENIX Magazine, Feb 2011:42-47.BLAST+
Used for similarity searching against protein sequence libraries
Camacho C et al. BLAST+: architecture and applications. BMC Bioinformatics. 2009 Dec 15;10:421.Prodigal
Finds protein-coding features (CDS)
Hyatt D et al. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics. 2010 Mar 8;11:119.TBL2ASN Prepare sequence records for Genbank submission Tbl2asn home page
Recommended
Aragorn
Finds transfer RNA features (tRNA)
Laslett D, Canback B. ARAGORN, a program to detect tRNA genes and tmRNA genes in nucleotide sequences. Nucleic Acids Res. 2004 Jan 2;32(1):11-6.Barrnap
Used to predict ribosomal RNA features (rRNA). My licence-free replacement for RNAmmmer.
Manuscript under preparation.HMMER3
Used for similarity searching against protein family profiles
Finn RD et al. HMMER web server: interactive sequence similarity searching. Nucleic Acids Res. 2011 Jul;39(Web Server issue):W29-37.
MetaGeneMark for Eukaryotes
GeneMark is an abinitio gene prediction software suite that has modes for both metagenomes (MetaGeneMark, Zhu et al., 2010) and Eukaryotes (GeneMark-ES, Ter-hovhannisyan et al., 2008)
Non-Redundant Gene Set Establishment
CD-HIT and Intro
CD-HIT is a widely used program for clustering biological sequences to reduce sequence redundancy and improve the performance of other sequence analyses.
CD-HIT was originally developed to cluster protein sequences to create reference databases with reduced redundancy (Li, et al., 2001) and was then extended to support clustering nucleotide sequences and comparing two datasets
mainly command and meta info:
Command | Info |
---|---|
* cd-hit | Cluster peptide sequences |
cd-hit-est | Cluster nucleotide sequences |
* cd-hit-2d | Compare 2 peptide databases |
* cd-hit-est-2d | Compare 2 nucleotide databases |
* psi-cd-hit | Cluster proteins at <40% cutoff |
* cd-hit-lap | Identify overlapping reads |
* cd-hit-dup | Identify duplicates from single or paired Illumina reads |
* cd-hit-otu | Cluster rRNA tags |
* cd-hit-para | Cluster sequences in parallel on a computer cluster |
* scripts | Parse results and so on |
* h-cd-hit | Hierarchical clustering |
CD-HIT Algorithm
CD-HIT is a greedy incremental clustering approach. The basic CD-HIT algorithm sorts the input sequences from long to short, and processes them sequentially from the longest to the shortest. The first sequence is automatically classified as the first cluster representative sequence. Then each query sequence of the remaining sequences is compared to the representative sequences found before it, and is classified as redundant or representative based on whether it is similar to one of the existing representative sequences. In default manner (fast mode), a query is grouped into the first representative without comparing to other representatives. In accurate mode, a query is compared to all representatives and grouped to the most similar one
Based on this greedy method, we established several integrated heuristics that make CD-HIT very efficient.
Index Table vs Hash Table
Most sequence alignment tools use short words or k-mer (k is the length) to speed up computation. For example, BLASTN uses 11-mer and BLASTP uses 3-mer by default. An index table, as opposite to a hash table used in most fast alignment methods, uses a unique index for every unique k-mer, therefore an index table is much faster than a hash table. In CD-HIT, we use k=2~5 for proteins and k=8~12 for DNAs, because the all the k-mers can be indexed in computer memory.
Short word filter
Two sequences of a certain identity must share at least a specific number of identical k-mers. It is possible to tell that the identity between two sequences is below a cutoff by counting common k-mers. The filter checks the common k-mers and rejects unnecessary alignments.
Short word statistics
It is the key for the high efficiency of a short word filter (Li, et al., 2002). Through statistical analysis of real alignments, we identified the distribution of common k-mer at different sequence lengths and identities, and applied the results to short word filter.
Banded alignment
The short word filter not only filters out unnecessary alignment, when alignment is needed, it also identifies a narrow band for banded dynamic programming alignment, which is much faster than regular dynamic programming.
Reduced alphabet (to be implemented)
This is for protein clustering. In reduced alphabet, a group of exchangeable residues are reduced to a single residue (I/V/L==>I, S/T==>S, D/E==>D, K/R==>K, F/Y==>F), and then conservative mutations would appear as identities in sequence alignments. It improves the short word filter for clustering at low sequence identity below 50%.
Gapped word (to be implemented)
Short word filter using gapped word allows mismatch within a word such as “ACE” vs “AME”, “ACFE” vs “AMYE”, and “AACTT” vs “AAGTT”, which can be written as “101”, “1001” and “11011”. At low identity cutoff, a gapped word is more efficient than an ungapped word for filtering.
CD-HIT-EST
CD-HIT-EST clusters a nucleotide dataset into clusters that meet a user-defined similarity threshold, usually a sequence identity. The input is a DNA/RNA dataset in fasta format and the output are two files: a fasta file of representative sequences and a text file of list of clusters. Since eukaryotic genes usually have long introns, which cause long gaps, it is difficult to make full-length alignments for these genes. So, CD-HIT-EST is good for non-intron containing sequences like EST.
Basic command:
cd-hit-est -i est_human -o est_human95 -c 0.95 -n 10 -d 0 -M 16000 -T 8 cd-hit-est -i R1.fa -j R2.fa -o R1.95.fa -op R2.95.fa -P 1 -c 0.95 -n 10 -d 0 -M 16000 -T 8
Key parameters Alignment coverage control:
See the figure below, the -aL, -AL, -aS and -AS options can be used to specify the alignment coverage on both the representative sequence and other sequences. -s and -S can control the length difference between the representative sequence and other sequences.
Exececute Command:
cd-hit-est \
-i ~/quantitativeProfiling/doughtStress/wetdry/genePre/prokka/D6303/mg.ffn \
-o ~/quantitativeProfiling/doughtStress/wetdry/nrCDHITEST/mg.ffn.nr \
-aS 0.9 \
-c 0.95 \
-n 10 \
-G 0 \
-M 800000 \
-T 30 \
-d 0 \
-g 1 >>log/cdhit/cdhit.log 2>&1
cd-hit-para.pl \
-i all.genes.fa \
-o non-redundant.geneset.out \
--P cd-hit-est \
--Q 40 \
--T SGE \
--S 37 \
-P priority \
-q queue \
--l vf=10g,p=10 \
-G 0 \
-aS 0.9 \
-c 0.95 \
-M 0 \
-T 30 \
-d 0 \
-g 1 \
-l 10 \
--R restart
command parameters details4
So far the Gene Catalog Construction has been done
Functional Annotations (NR, MetaCyc,GO, UniProt KEGG,COG eggNOG, CAZyome & ARDB)
Difference between NCBI non-redundant and refseq database5
NR Database Intro
NR include non-curated and curated database, refseq only curated ### Non-curated databases (low quality):
- GenBank/GenPept - unreviewed sequences submitted from individual laboratories and large-scale sequencing projects. Since these sequence records are owned by the original submitters and can not be altered, GenBank might contain many low quality sequences.
- trEMBL - unreviewed section of UniProt. This section contains a computer-annotated supplement of SwissProt that contains all the translations of EMBL nucleotide sequence entries not yet integrated in SwissProt
Curated databases (high quality):
- RefSeq - GenBank sequences that are manually curated by the NCBI staff. RefSeq records are owned by NCBI and can be updated as needed to maintain current annotation or to incorporate additional information.
- UniProtKB/SwissProt - manually annotated and reviewed protein sequences
- PIR - non-redundant annotated protein sequence database
- PDB - experimentally-determined structures of proteins, nucleic acids, and complex assemblies
Database Attributes
-
update_blastdb.pl --decompress nr [*] # This command will download the compressed nr BLAST database from NCBI to the current working directory and decompress it
-
email license required for academic usage
kegg kegg_dmnd=/db/kegg/kegg_76
fee license required for adademic usage
-
wget -c -r 1 -np -nd ftp://ftp.ncbi.nih.gov/pub/COG/COG2014/data
eggnog eggnog_db=/db/eggnog
wget -c -r 1 -np -nd http://eggnogdb.embl.de/download/
- cazy dbcaz2_dmnd=/db/protein/dbcan2/CAZyDB.07312018
dbcaz2_anno=/db/protein/dbcan2/fam_description.txt
nohup wget -c -nd -np -r http://bcb.unl.edu/dbCAN2/download/
[ardb]
nohup wget -c -nd -np -r https://card.mcmaster.ca/download/5/ontology-v3.0.4.tar.gz
- resfam resfams_dmnd=/mnt/zhou/yongxin/db/ResFams/Resfams-proteins
resfams_anno=/mnt/bai/yongxin/data/db/ResFams/Resfams-proteins_class.tsv
if the sequence-level, using diamond blastx, protein-level diamond blastp(using transeq to translate first)
transeq \
-sequence ~/quantitativeProfiling/doughtStress/wetdry/nrCDHITEST/mg.ffn.nr \
-outseq ~/quantitativeProfiling/doughtStress/wetdry/nrCDHITEST/mg.ffn.nr.fa \
-trim Y
-clean Y
-error Y
trim7
diamond blastp \
--thread 30 \
--db diamond.index \
--query non-redundant.geneset.pro.faa \
--out diamondout \
--outfmt 6 \
--log \
--max-target-seqs 50 \
--evalue 1e-5 \
--sensitive \
--block-size 6 \
--index-chunks 1
Blast Mode8
Contig quantification salmon
time parallel -j ${j} \
'salmon quant -i temp/22salmon_contig/index -l A -p ${p} --meta \
-1 temp/11qc/{1}_1_kneaddata_paired_1.fastq -2 temp/11qc/{1}_1_kneaddata_paired_2.fastq \
-o temp/22salmon_contig/{1}.quant' \
::: `tail -n+2 result/metadata.txt | cut -f 1`
Contig taxon annotation kraken2
kraken2 --db ${kraken2_db} temp/22megahit_all/final.contigs.fa \
--threads ${p} --use-names \
--output result/22megahit_all/kraken2.tax
Assembly-free metagenomic profiling
override a group of parameters; possible values: meta-sensitive: ‘–min-count 1 –k-list 21,29,39,49,…,129,141’ meta-large: ‘–k-min 27 –k-max 127 –k-step 10’ (large & complex metagenomes, like soil)↩
-1 -2 support list -r se -12 interleaved↩
for each sample, reads were assembled with a series of different k-mer size in parallel, and then were mapped back to each assembly result with SOAP2 , and the optimal k-mer size and assembly result were chosen depending on both contig N50 and mapping rate. De novo metagenome assembly was reperformed with IDBA-UD. Only those contigs with more than 500 bp were kept for further analysis↩
-G 0 # global sequence identity, default 1 if set to 0, then use local sequence identity, combination with alignment control such as -aS -aL -M 800000 # maximal memory -T 30 # thread -d 0 # length of description in .clstr file, default 20 if set to 0, it takes the fasta defline and stops at first space -g 1 # faster mode 0 or accurate mode 1↩
The best way to obtain BLAST databases is to download them from NCBI or the cloud (currently from Google Cloud Platform – an experimental feature - see details below). These are the same databases available via the public BLAST Web Service (https://blast.ncbi.nlm.nih.gov), are updated regularly, and contain taxonomic information built into them. These can also be a source of biological sequence data (see below).To download a preformatted NCBI BLAST database, run the update_blastdb.pl program followed by any relevant options and the name(s) of the BLAST databases to download. For example:↩
[N] This removes all ‘X’ and ‘’ characters from the right end of the translation. The trimming process starts at the end and continues until the next character is not a ’X’ or a ’’↩
BlastX requires as query a DNA sequence; BlastP requires as query a protein↩