Photo by Claudia Fernández Ortiz on Unsplash

Rice metagenome gene catalogue establishment

Rice metagenome gene catalogue establishment

Original Publish Date: 12 August, 2019

paste(“Updated on:”, format(Sys.time(), ‘%d %B, %Y’))

This Tutorial mainly Originated form Huttenhower Lab Official Reference and Modified by Tank

Tasks assigned and Questions

1 gene catalogue pipeline in details

  • Reference-based pipeline 👌

  • De novo Assembly pipeline ❔

De novo assemble pipeline

khmer generation

time parallel -j 8 \
            'interleave-reads.py temp/11qc/{1}_1_kneaddata_paired_1.fastq temp/11qc/{1}_1_kneaddata_paired_2.fastq | \
trim-low-abund.py -V -M ${khmer_memory} -Z ${khmer_high} -C ${khmer_low} - -o temp/21khmer/{1}.fq; \
split-paired-reads.py -f -0 temp/21khmer/{1}_0.fq -1 temp/21khmer/{1}_1.fq -2 temp/21khmer/{1}_2.fq temp/21khmer/{1}.fq' \
::: `tail -n+2 result/metadata.txt | cut -f 1`

Assemble

Problem of Seven-Brideg and Euler graph

K-mer and De Bruign graph

K-mer: k-mers are subsequences of length k contained within a biological sequence

$k-mers are subsequences of length \({k}\) contained within a biological sequence. Primarily used within the context of computational genomics and sequence analysis, in which k-mers are composed of nucleotides (i.e. A, T, G, and C), k-mers are capitalized upon to assemble DNA sequences, improve heterologous gene expression, identify species in metagenomic samples, and create attenuated vaccines. Usually, the term k-mer refers to all of a sequence’s subsequences of length \({k}\) , such that the sequence AGAT would have four monomers (A, G, A, and T), three 2-mers (AG, GA, AT), two 3-mers (AGA and GAT) and one 4-mer (AGAT). More generally, a sequence of length \({L}\) L will have \({L-k+1}\) k-mers and \({ n^{k}}\) total possible k-mers, where \({n}\) n is number of possible monomers (e.g. four in the case of DNA).

The frequency of a set of k-mers in a species’a genome, in a genomic region, or in a class of sequences can be used as a “signature” of the underlying sequence. Comparing these frequencies are computationally easier than sequence alignment, and is an important method in alignment-free sequence analysis. It can also be used as a first stage analysis before an alignment.

Assembly of Hamilton and Euler cycle

time parallel -j 8 \
     'megahit -t 30 \
              --continue\
              --presets meta-large\
              --k-list 21,29,39,59,79,99,119,141 \
              -1 raw_data_example/{1}_1.fq -2 raw_data_example/{1}_2.fq \
              -o assembly/megahit/ >> log/megahit/megahit.log 2>&1 ' \
              ::: `tail -n+2 metaDataMap/metadata.txt | cut -f 1`
              
              

tunable parameter1

input 3 mode2

good assembly standard3

Reference Required and other Materials

library(RefManageR)
# library(knitcitations)
# library(rcrossref)
bib <- ReadBib("metaAssembly.bib")
myopts <- BibOptions(bib.style = "authoryear", style="latex", first.inits=FALSE, max.names = 20)

Boisvert, Sébastien, Frédéric Raymond, Élénie Godzaridis, François Laviolette, and Jacques Corbeil (2012). ``Ray Meta: scalable de novo metagenome assembly and profiling’’. In: 13.12, p. R122.

Kultima, Jens Roat, Shinichi Sunagawa, Junhua Li, Weineng Chen, Hua Chen, Daniel R Mende, Manimozhiyan Arumugam, Qi Pan, Binghang Liu, Junjie Qin, and others (2012). ``MOCAT: a metagenomics assembly and gene prediction toolkit’’. In: 7.10, p. e47656.

Li, Dinghua, Chi-Man Liu, Ruibang Luo, Kunihiko Sadakane, and Tak-Wah Lam (2015). ``MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph’’. In: 31.10, pp. 1674–1676.

Nagarajan, Niranjan and Mihai Pop (2013). ``Sequence assembly demystified’’. In: 14.3, p. 157.

Namiki, Toshiaki, Tsuyoshi Hachiya, Hideaki Tanaka, and Yasubumi Sakakibara (2012). ``MetaVelvet: an extension of Velvet assembler to de novo metagenome assembly from short sequence reads’’. In: 40.20, pp. e155–e155.

Narasingarao, Priya, Sheila Podell, Juan A Ugalde, Céline Brochier-Armanet, Joanne B Emerson, Jochen J Brocks, Karla B Heidelberg, Jillian F Banfield, and Eric E Allen (2012). ``De novo metagenomic assembly reveals abundant novel major lineage of Archaea in hypersaline microbial communities’’. In: 6.1, p. 81.

Nielsen, H Bjørn, Mathieu Almeida, Agnieszka Sierakowska Juncker, Simon Rasmussen, Junhua Li, Shinichi Sunagawa, Damian R Plichta, Laurent Gautier, Anders G Pedersen, Emmanuelle Le Chatelier, and others (2014). ``Identification and assembly of genomes and genetic elements in complex metagenomic samples without using reference genomes’’. In: 32.8, p. 822.

Pell, Jason, Arend Hintze, Rosangela Canino-Koning, Adina Howe, James M Tiedje, and C Titus Brown (2012). ``Scaling metagenome sequence assembly with probabilistic de Bruijn graphs’’. In: 109.33, pp. 13272–13277.

Thomas, Torsten, Jack Gilbert, and Folker Meyer (2012). ``Metagenomics-a guide from sampling to data analysis’’. In: 2.1, p. 3.

Treangen, Todd J, Sergey Koren, Daniel D Sommer, Bo Liu, Irina Astrovskaya, Brian Ondov, Aaron E Darling, Adam M Phillippy, and Mihai Pop (2013). ``MetAMOS: a modular and open source metagenomic assembly and analysis pipeline’’. In: 14.1, p. R2.


Binning (metagenomics)

Metagenome assemblies are highly fragmented, comprising thousands of contigs, and researchers do not know a priori which contig derives from which genome, or even know how many genomes are present. The aim of contig ‘binning’ is to group contigs into species. Supervised binning methods use databases of already sequenced genomes to label contigs into taxonomic classes. Unsupervised (clustering)methods look for natural groups in the data

binning is the process of grouping reads or contigs and assigning them to operational taxonomic units. Binning methods can be based on either compositional features or alignment (similarity), or both

binning techniques represent a “best effort” to identify reads or contigs with certain groups of organisms designated as operational taxonomic units (OTUs)

metagenomic assembly produces contigs of short to intermediate length (1–100s of kb). Assigning these contigs to particular microbial groups and/or organisms, a process known as metagenomic binning

There are two main approaches to binning.

  • The first involves comparing the unknown gene/contig sequences to sequences of known phylogenetic affiliation.
  • The second method is to assign contigs to taxonomic groups based on signatures of nucleotide composition such as GC content or tetranucleotide frequency.

advantage of using nucleotide composition for binning is that the signal is pervasive throughout the genome, and horizontally transferred genes quickly take on the signal of their host

Nucleotide composition-based approaches are particularly valuable for novel genes

Note that in a strict sense, “binning,” in which sequences are clustered into subgroups that are not necessarily identified taxonomically, should be distinguished from “classification” in which taxonomic labels have been applied

Genomic Signatures of Nucleotide Composition

microorganisms have characteristic patterns of nucleotide composition that are conserved within genomes and distinct between genomes. Such compositional biases were recognized at the very beginning of the age of genomics and metagenomics

The biases include features such as GC content ((G+C)/(A+T+C+G)), and the relative abundance of oligonucleotide sequences of a given length (di-, tri-, tetra-, etc. nucleotides

oligonucleotides of any length can be used for binning. What length is optimal? There is a trade-off between information content and length of contig required to provide sufficient data on frequency of specific oligo sequences. Longer oligo patterns contain more information and offer higher specificity (Bohlin et al. 2008). Yet the number of possible oligonucleotides increases exponentially with oligo length: di-, tri-, tetra-, and pentanucleotides have 16, 64, 256, and 1024 possible oligos, respectively (including nonunique reverse complements). Thus, in order to generate a statistically significant frequency histogram, more “samples” are required for longer oligos. In other words, longer contigs are required for longer oligos. Tetranucleotide frequencies typically provide acceptable results down to ~2.5 kb and are very robust above 5 kb, and thus find wide application in current metagenomics projects where contigs in this length range are abundant

Binning by nucleotide composition can be approached by either supervised or unsupervised methods. Supervised methods are trained on the signatures of reference genomes to construct a model that is then used to assign unknown contigs to the reference genomes

Unsupervised approaches do not rely on a priori information about the genome signatures of reference genomes. Rather, they group unknown sequences into clusters directly based on the similarity of their genome signatures

Supervised methods

  • One of the most widely used supervised methods of metagenomic binning is Phylopithia (McHardy et al. 2007). This method “learns” the characteristics of genome signatures of reference genomes through a support vector machine classifier and then phylogenetically classifies unknown sequences at multiple levels of taxonomy. In addition to nucleotide composition, it also leverages information from phylogenetic marker genes

  • Naive Bayesian classifiers have also been used to classify short sequence fragments (down to 400 bp) from genomes of pure cultures at an accuracy of 85% (Sandberg et al. 2001)

  • Interpolated Markov models to classify sequences down to 100 bp

Unsupervised method

  • One particularly effective method of clustering and visualizing sequences is self-organizing maps (SOM)
  • MaxBin (and MaxBin2) provides an automated pipeline that uses an Expectation-Maximization algorithm, which is reported to push the sequence length threshold down to 1000 bp
  • MetaBat also uses a combination of tetranucleotide frequency and abundance (coverage) information and is reported to be fully automated

  • Differential coverage can be used in combination with nucleotide composition to enhance the resolution of binning


Gene prediction and annotation

Prokka and Intro

Prokka: rapid prokaryotic genome annotation,Whole genome annotation is the process of identifying features of interest in a set of genomic DNA sequences, and labelling them with useful information. Prokka is a software tool to annotate bacterial, archaeal and viral genomes quickly and produce standards-compliant output files.

Official Manual in Github


Execute Command:

time parallel --gnu --xapply -j 30 \
            "prokka assembly/megahit/final.contigs.fa --outdir genePre/prokka/ {}/\
            -prefix mg --metagenome --force --cpus 48 \
            --kingdom Archaea,Bacteria,Mitochondria,Viruses" \
            >> log/prokka/prokka.log 2>&1 \
            ::: `tail -n+2  metaDataMap/metadata.txt | cut -f 1` ## 1h time-consuming

Output Files

Extension Description
.gff This is the master annotation in GFF3 format, containing both sequences and annotations. It can be viewed directly in Artemis or IGV.
.gbk This is a standard Genbank file derived from the master .gff. If the input to prokka was a multi-FASTA, then this will be a multi-Genbank, with one record for each sequence.
.fna Nucleotide FASTA file of the input contig sequences.
.faa Protein FASTA file of the translated CDS sequences.
.ffn Nucleotide FASTA file of all the prediction transcripts (CDS, rRNA, tRNA, tmRNA, misc_RNA)
.sqn An ASN1 format “Sequin” file for submission to Genbank. It needs to be edited to set the correct taxonomy, authors, related publication etc.
.fsa Nucleotide FASTA file of the input contig sequences, used by “tbl2asn” to create the .sqn file. It is mostly the same as the .fna file, but with extra Sequin tags in the sequence description lines.
.tbl Feature Table file, used by “tbl2asn” to create the .sqn file.
.err Unacceptable annotations - the NCBI discrepancy report.
.log Contains all the output that Prokka produced during its run. This is a record of what settings you used, even if the –quiet option was enabled.
.txt Statistics relating to the annotated features found.
.tsv Tab-separated file of all features: locus_tag,ftype,len_bp,gene,EC_number,COG,product

Note

  • Prokka is specifically designed for Bacteria, Archaea and Viruses. It can’t handle multi-exon gene models; I would recommend using MAKER 2 for that purpose.
  • The hmmscan step seems to hang and do nothing?
    The problem here is GNU Parallel. It seems the Debian package for hmmer has modified it to require the --gnu option to behave in the ‘default’ way. There is no clear reason for this. The only way to restore normal behaviour is to edit the prokka script and change parallel to parallel --gnu

Mandatory

  • BioPerl
    Used for input/output of various file formats
    Stajich et al, The Bioperl toolkit: Perl modules for the life sciences. Genome Res. 2002 Oct;12(10):1611-8.

  • GNU Parallel
    A shell tool for executing jobs in parallel using one or more computers
    O. Tange, GNU Parallel - The Command-Line Power Tool, ;login: The USENIX Magazine, Feb 2011:42-47.

  • BLAST+
    Used for similarity searching against protein sequence libraries
    Camacho C et al. BLAST+: architecture and applications. BMC Bioinformatics. 2009 Dec 15;10:421.

  • Prodigal
    Finds protein-coding features (CDS)
    Hyatt D et al. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics. 2010 Mar 8;11:119.

  • TBL2ASN Prepare sequence records for Genbank submission Tbl2asn home page

MetaGeneMark for Eukaryotes

GeneMark is an abinitio gene prediction software suite that has modes for both metagenomes (MetaGeneMark, Zhu et al., 2010) and Eukaryotes (GeneMark-ES, Ter-hovhannisyan et al., 2008)


Non-Redundant Gene Set Establishment

CD-HIT and Intro

CD-HIT is a widely used program for clustering biological sequences to reduce sequence redundancy and improve the performance of other sequence analyses.

Official Manual in Github

CD-HIT was originally developed to cluster protein sequences to create reference databases with reduced redundancy (Li, et al., 2001) and was then extended to support clustering nucleotide sequences and comparing two datasets

mainly command and meta info:

Command Info
* cd-hit Cluster peptide sequences
cd-hit-est Cluster nucleotide sequences
* cd-hit-2d Compare 2 peptide databases
* cd-hit-est-2d Compare 2 nucleotide databases
* psi-cd-hit Cluster proteins at <40% cutoff
* cd-hit-lap Identify overlapping reads
* cd-hit-dup Identify duplicates from single or paired Illumina reads
* cd-hit-otu Cluster rRNA tags
* cd-hit-para Cluster sequences in parallel on a computer cluster
* scripts Parse results and so on
* h-cd-hit Hierarchical clustering

CD-HIT Algorithm

CD-HIT is a greedy incremental clustering approach. The basic CD-HIT algorithm sorts the input sequences from long to short, and processes them sequentially from the longest to the shortest. The first sequence is automatically classified as the first cluster representative sequence. Then each query sequence of the remaining sequences is compared to the representative sequences found before it, and is classified as redundant or representative based on whether it is similar to one of the existing representative sequences. In default manner (fast mode), a query is grouped into the first representative without comparing to other representatives. In accurate mode, a query is compared to all representatives and grouped to the most similar one

Based on this greedy method, we established several integrated heuristics that make CD-HIT very efficient.

Index Table vs Hash Table

Most sequence alignment tools use short words or k-mer (k is the length) to speed up computation. For example, BLASTN uses 11-mer and BLASTP uses 3-mer by default. An index table, as opposite to a hash table used in most fast alignment methods, uses a unique index for every unique k-mer, therefore an index table is much faster than a hash table. In CD-HIT, we use k=2~5 for proteins and k=8~12 for DNAs, because the all the k-mers can be indexed in computer memory.

Short word filter

Two sequences of a certain identity must share at least a specific number of identical k-mers. It is possible to tell that the identity between two sequences is below a cutoff by counting common k-mers. The filter checks the common k-mers and rejects unnecessary alignments.

Short word statistics

It is the key for the high efficiency of a short word filter (Li, et al., 2002). Through statistical analysis of real alignments, we identified the distribution of common k-mer at different sequence lengths and identities, and applied the results to short word filter.

Banded alignment

The short word filter not only filters out unnecessary alignment, when alignment is needed, it also identifies a narrow band for banded dynamic programming alignment, which is much faster than regular dynamic programming.

Reduced alphabet (to be implemented)

This is for protein clustering. In reduced alphabet, a group of exchangeable residues are reduced to a single residue (I/V/L==>I, S/T==>S, D/E==>D, K/R==>K, F/Y==>F), and then conservative mutations would appear as identities in sequence alignments. It improves the short word filter for clustering at low sequence identity below 50%.

Gapped word (to be implemented)

Short word filter using gapped word allows mismatch within a word such as “ACE” vs “AME”, “ACFE” vs “AMYE”, and “AACTT” vs “AAGTT”, which can be written as “101”, “1001” and “11011”. At low identity cutoff, a gapped word is more efficient than an ungapped word for filtering.

CD-HIT-EST

CD-HIT-EST clusters a nucleotide dataset into clusters that meet a user-defined similarity threshold, usually a sequence identity. The input is a DNA/RNA dataset in fasta format and the output are two files: a fasta file of representative sequences and a text file of list of clusters. Since eukaryotic genes usually have long introns, which cause long gaps, it is difficult to make full-length alignments for these genes. So, CD-HIT-EST is good for non-intron containing sequences like EST.

Basic command:

cd-hit-est -i est_human -o est_human95 -c 0.95 -n 10 -d 0 -M 16000 -T 8 cd-hit-est -i R1.fa -j R2.fa -o R1.95.fa -op R2.95.fa -P 1 -c 0.95 -n 10 -d 0 -M 16000 -T 8

Key parameters Alignment coverage control:

See the figure below, the -aL, -AL, -aS and -AS options can be used to specify the alignment coverage on both the representative sequence and other sequences. -s and -S can control the length difference between the representative sequence and other sequences.


Exececute Command:

cd-hit-est \
    -i ~/quantitativeProfiling/doughtStress/wetdry/genePre/prokka/D6303/mg.ffn \
    -o ~/quantitativeProfiling/doughtStress/wetdry/nrCDHITEST/mg.ffn.nr \
    -aS 0.9 \
    -c 0.95 \
    -n 10 \
    -G 0 \
    -M 800000 \
    -T 30 \
    -d 0 \
    -g 1 >>log/cdhit/cdhit.log 2>&1
    
    
    
cd-hit-para.pl \
    -i all.genes.fa \
    -o non-redundant.geneset.out \
    --P cd-hit-est \
    --Q 40 \
    --T SGE \
    --S 37 \
    -P priority \
    -q queue \
    --l vf=10g,p=10 \
    -G 0 \
    -aS 0.9 \
    -c 0.95 \
    -M 0 \
    -T 30 \
    -d 0 \
    -g 1 \
    -l 10 \
    --R restart

command parameters details4

So far the Gene Catalog Construction has been done


Functional Annotations (NR, MetaCyc,GO, UniProt KEGG,COG eggNOG, CAZyome & ARDB)

Difference between NCBI non-redundant and refseq database5

NR Database Intro

NR include non-curated and curated database, refseq only curated ### Non-curated databases (low quality):

  • GenBank/GenPept - unreviewed sequences submitted from individual laboratories and large-scale sequencing projects. Since these sequence records are owned by the original submitters and can not be altered, GenBank might contain many low quality sequences.
  • trEMBL - unreviewed section of UniProt. This section contains a computer-annotated supplement of SwissProt that contains all the translations of EMBL nucleotide sequence entries not yet integrated in SwissProt

Curated databases (high quality):

  • RefSeq - GenBank sequences that are manually curated by the NCBI staff. RefSeq records are owned by NCBI and can be updated as needed to maintain current annotation or to incorporate additional information.
  • UniProtKB/SwissProt - manually annotated and reviewed protein sequences
  • PIR - non-redundant annotated protein sequence database
  • PDB - experimentally-determined structures of proteins, nucleic acids, and complex assemblies

Database Attributes

if the sequence-level, using diamond blastx, protein-level diamond blastp(using transeq to translate first)

transeq \
        -sequence ~/quantitativeProfiling/doughtStress/wetdry/nrCDHITEST/mg.ffn.nr \
        -outseq ~/quantitativeProfiling/doughtStress/wetdry/nrCDHITEST/mg.ffn.nr.fa \
        -trim Y
        -clean Y
        -error Y
        
        

trim7

diamond blastp \
        --thread 30 \
        --db diamond.index \
        --query non-redundant.geneset.pro.faa \
        --out diamondout \
        --outfmt 6 \
        --log  \
        --max-target-seqs 50 \
        --evalue 1e-5 \
        --sensitive \
        --block-size 6 \
        --index-chunks 1

Blast Mode8


Contig quantification salmon

time parallel -j ${j} \
            'salmon quant -i temp/22salmon_contig/index -l A -p ${p} --meta \
            -1 temp/11qc/{1}_1_kneaddata_paired_1.fastq -2 temp/11qc/{1}_1_kneaddata_paired_2.fastq \
            -o temp/22salmon_contig/{1}.quant' \
            ::: `tail -n+2 result/metadata.txt | cut -f 1`
            

Contig taxon annotation kraken2

kraken2 --db ${kraken2_db} temp/22megahit_all/final.contigs.fa \
    --threads ${p} --use-names \
    --output result/22megahit_all/kraken2.tax

Assembly-free metagenomic profiling


  1. override a group of parameters; possible values: meta-sensitive: ‘–min-count 1 –k-list 21,29,39,49,…,129,141’ meta-large: ‘–k-min 27 –k-max 127 –k-step 10’ (large & complex metagenomes, like soil)

  2. -1 -2 support list -r se -12 interleaved

  3. for each sample, reads were assembled with a series of different k-mer size in parallel, and then were mapped back to each assembly result with SOAP2 , and the optimal k-mer size and assembly result were chosen depending on both contig N50 and mapping rate. De novo metagenome assembly was reperformed with IDBA-UD. Only those contigs with more than 500 bp were kept for further analysis

  4. -G 0 # global sequence identity, default 1 if set to 0, then use local sequence identity, combination with alignment control such as -aS -aL -M 800000 # maximal memory -T 30 # thread -d 0 # length of description in .clstr file, default 20 if set to 0, it takes the fasta defline and stops at first space -g 1 # faster mode 0 or accurate mode 1

  5. https://www.biostars.org/p/164641/

  6. The best way to obtain BLAST databases is to download them from NCBI or the cloud (currently from Google Cloud Platform – an experimental feature - see details below). These are the same databases available via the public BLAST Web Service (https://blast.ncbi.nlm.nih.gov), are updated regularly, and contain taxonomic information built into them. These can also be a source of biological sequence data (see below).To download a preformatted NCBI BLAST database, run the update_blastdb.pl program followed by any relevant options and the name(s) of the BLAST databases to download. For example:

  7. [N] This removes all ‘X’ and ‘’ characters from the right end of the translation. The trimming process starts at the end and continues until the next character is not a ’X’ or a ’

  8. BlastX requires as query a DNA sequence; BlastP requires as query a protein

Avatar
Tank (Xiao-Ning Zhang)
PhD Student @ Data Miner & Coder

I’m a PhD Student majoring in Bioinformatics and Biostatistics who loves computer programming such as C(++), Java, Python and R.

Related

comments powered by Disqus