Photo by Claudia Fernández Ortiz on Unsplash

Rice metagenome gene catalogue establishment

Last updated on Sep 12, 2019 18 min read 0 Comments metagenomics, rice gene catalogue

Photo by Claudia Fernández Ortiz on Unsplash

Rice metagenome gene catalogue establishment

Last updated on Sep 12, 2019 18 min read 0 Comments metagenomics, rice gene catalogue

Tasks assigned and Questions
De novo assemble pipeline

Original Publish Date: 12 August, 2019

paste(“Updated on:”, format(Sys.time(), ‘%d %B, %Y’))

This Tutorial mainly Originated form Huttenhower Lab Official Reference and Modified by Tank

Tasks assigned and Questions

1 gene catalogue pipeline in details

Reference-based pipeline 👌
De novo Assembly pipeline ❔

De novo assemble pipeline

khmer generation

time parallel -j 8 \
            'interleave-reads.py temp/11qc/{1}_1_kneaddata_paired_1.fastq temp/11qc/{1}_1_kneaddata_paired_2.fastq | \
trim-low-abund.py -V -M ${khmer_memory} -Z ${khmer_high} -C ${khmer_low} - -o temp/21khmer/{1}.fq; \
split-paired-reads.py -f -0 temp/21khmer/{1}_0.fq -1 temp/21khmer/{1}_1.fq -2 temp/21khmer/{1}_2.fq temp/21khmer/{1}.fq' \
::: `tail -n+2 result/metadata.txt | cut -f 1`

Assemble

Problem of Seven-Brideg and Euler graph

K-mer and De Bruign graph

K-mer: k-mers are subsequences of length k contained within a biological sequence

$k-mers are subsequences of length contained within a biological sequence. Primarily used within the context of computational genomics and sequence analysis, in which k-mers are composed of nucleotides (i.e. A, T, G, and C), k-mers are capitalized upon to assemble DNA sequences, improve heterologous gene expression, identify species in metagenomic samples, and create attenuated vaccines. Usually, the term k-mer refers to all of a sequence’s subsequences of length , such that the sequence AGAT would have four monomers (A, G, A, and T), three 2-mers (AG, GA, AT), two 3-mers (AGA and GAT) and one 4-mer (AGAT). More generally, a sequence of length L will have k-mers and total possible k-mers, where n is number of possible monomers (e.g. four in the case of DNA).

The frequency of a set of k-mers in a species’a genome, in a genomic region, or in a class of sequences can be used as a “signature” of the underlying sequence. Comparing these frequencies are computationally easier than sequence alignment, and is an important method in alignment-free sequence analysis. It can also be used as a first stage analysis before an alignment.

Assembly of Hamilton and Euler cycle

time parallel -j 8 \
     'megahit -t 30 \
              --continue\
              --presets meta-large\
              --k-list 21,29,39,59,79,99,119,141 \
              -1 raw_data_example/{1}_1.fq -2 raw_data_example/{1}_2.fq \
              -o assembly/megahit/ >> log/megahit/megahit.log 2>&1 ' \
              ::: `tail -n+2 metaDataMap/metadata.txt | cut -f 1`

tunable parameter¹

input 3 mode²

good assembly standard³

Reference Required and other Materials

library(RefManageR)
# library(knitcitations)
# library(rcrossref)
bib <- ReadBib("metaAssembly.bib")
myopts <- BibOptions(bib.style = "authoryear", style="latex", first.inits=FALSE, max.names = 20)

Boisvert, Sébastien, Frédéric Raymond, Élénie Godzaridis, François Laviolette, and Jacques Corbeil (2012). ``Ray Meta: scalable de novo metagenome assembly and profiling’’. In: 13.12, p. R122.

Kultima, Jens Roat, Shinichi Sunagawa, Junhua Li, Weineng Chen, Hua Chen, Daniel R Mende, Manimozhiyan Arumugam, Qi Pan, Binghang Liu, Junjie Qin, and others (2012). ``MOCAT: a metagenomics assembly and gene prediction toolkit’’. In: 7.10, p. e47656.

Li, Dinghua, Chi-Man Liu, Ruibang Luo, Kunihiko Sadakane, and Tak-Wah Lam (2015). ``MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph’’. In: 31.10, pp. 1674–1676.

Nagarajan, Niranjan and Mihai Pop (2013). ``Sequence assembly demystified’’. In: 14.3, p. 157.

Namiki, Toshiaki, Tsuyoshi Hachiya, Hideaki Tanaka, and Yasubumi Sakakibara (2012). ``MetaVelvet: an extension of Velvet assembler to de novo metagenome assembly from short sequence reads’’. In: 40.20, pp. e155–e155.

Narasingarao, Priya, Sheila Podell, Juan A Ugalde, Céline Brochier-Armanet, Joanne B Emerson, Jochen J Brocks, Karla B Heidelberg, Jillian F Banfield, and Eric E Allen (2012). ``De novo metagenomic assembly reveals abundant novel major lineage of Archaea in hypersaline microbial communities’’. In: 6.1, p. 81.

Nielsen, H Bjørn, Mathieu Almeida, Agnieszka Sierakowska Juncker, Simon Rasmussen, Junhua Li, Shinichi Sunagawa, Damian R Plichta, Laurent Gautier, Anders G Pedersen, Emmanuelle Le Chatelier, and others (2014). ``Identification and assembly of genomes and genetic elements in complex metagenomic samples without using reference genomes’’. In: 32.8, p. 822.

Pell, Jason, Arend Hintze, Rosangela Canino-Koning, Adina Howe, James M Tiedje, and C Titus Brown (2012). ``Scaling metagenome sequence assembly with probabilistic de Bruijn graphs’’. In: 109.33, pp. 13272–13277.

Thomas, Torsten, Jack Gilbert, and Folker Meyer (2012). ``Metagenomics-a guide from sampling to data analysis’’. In: 2.1, p. 3.

Treangen, Todd J, Sergey Koren, Daniel D Sommer, Bo Liu, Irina Astrovskaya, Brian Ondov, Aaron E Darling, Adam M Phillippy, and Mihai Pop (2013). ``MetAMOS: a modular and open source metagenomic assembly and analysis pipeline’’. In: 14.1, p. R2.

Binning (metagenomics)

Metagenome assemblies are highly fragmented, comprising thousands of contigs, and researchers do not know a priori which contig derives from which genome, or even know how many genomes are present. The aim of contig ‘binning’ is to group contigs into species. Supervised binning methods use databases of already sequenced genomes to label contigs into taxonomic classes. Unsupervised (clustering)methods look for natural groups in the data

binning is the process of grouping reads or contigs and assigning them to operational taxonomic units. Binning methods can be based on either compositional features or alignment (similarity), or both

binning techniques represent a “best effort” to identify reads or contigs with certain groups of organisms designated as operational taxonomic units (OTUs)

metagenomic assembly produces contigs of short to intermediate length (1–100s of kb). Assigning these contigs to particular microbial groups and/or organisms, a process known as metagenomic binning

There are two main approaches to binning.

The first involves comparing the unknown gene/contig sequences to sequences of known phylogenetic affiliation.
The second method is to assign contigs to taxonomic groups based on signatures of nucleotide composition such as GC content or tetranucleotide frequency.

advantage of using nucleotide composition for binning is that the signal is pervasive throughout the genome, and horizontally transferred genes quickly take on the signal of their host

Nucleotide composition-based approaches are particularly valuable for novel genes

Note that in a strict sense, “binning,” in which sequences are clustered into subgroups that are not necessarily identified taxonomically, should be distinguished from “classification” in which taxonomic labels have been applied

Genomic Signatures of Nucleotide Composition

microorganisms have characteristic patterns of nucleotide composition that are conserved within genomes and distinct between genomes. Such compositional biases were recognized at the very beginning of the age of genomics and metagenomics

The biases include features such as GC content ((G+C)/(A+T+C+G)), and the relative abundance of oligonucleotide sequences of a given length (di-, tri-, tetra-, etc. nucleotides

oligonucleotides of any length can be used for binning. What length is optimal? There is a trade-off between information content and length of contig required to provide sufficient data on frequency of specific oligo sequences. Longer oligo patterns contain more information and offer higher specificity (Bohlin et al. 2008). Yet the number of possible oligonucleotides increases exponentially with oligo length: di-, tri-, tetra-, and pentanucleotides have 16, 64, 256, and 1024 possible oligos, respectively (including nonunique reverse complements). Thus, in order to generate a statistically significant frequency histogram, more “samples” are required for longer oligos. In other words, longer contigs are required for longer oligos. Tetranucleotide frequencies typically provide acceptable results down to ~2.5 kb and are very robust above 5 kb, and thus find wide application in current metagenomics projects where contigs in this length range are abundant

Binning by nucleotide composition can be approached by either supervised or unsupervised methods. Supervised methods are trained on the signatures of reference genomes to construct a model that is then used to assign unknown contigs to the reference genomes

Unsupervised approaches do not rely on a priori information about the genome signatures of reference genomes. Rather, they group unknown sequences into clusters directly based on the similarity of their genome signatures

Supervised methods

One of the most widely used supervised methods of metagenomic binning is Phylopithia (McHardy et al. 2007). This method “learns” the characteristics of genome signatures of reference genomes through a support vector machine classifier and then phylogenetically classifies unknown sequences at multiple levels of taxonomy. In addition to nucleotide composition, it also leverages information from phylogenetic marker genes
Naive Bayesian classifiers have also been used to classify short sequence fragments (down to 400 bp) from genomes of pure cultures at an accuracy of 85% (Sandberg et al. 2001)
Interpolated Markov models to classify sequences down to 100 bp

Unsupervised method

One particularly effective method of clustering and visualizing sequences is self-organizing maps (SOM)
MaxBin (and MaxBin2) provides an automated pipeline that uses an Expectation-Maximization algorithm, which is reported to push the sequence length threshold down to 1000 bp
MetaBat also uses a combination of tetranucleotide frequency and abundance (coverage) information and is reported to be fully automated
Differential coverage can be used in combination with nucleotide composition to enhance the resolution of binning

Gene prediction and annotation

Prokka and Intro

Prokka: rapid prokaryotic genome annotation,Whole genome annotation is the process of identifying features of interest in a set of genomic DNA sequences, and labelling them with useful information. Prokka is a software tool to annotate bacterial, archaeal and viral genomes quickly and produce standards-compliant output files.

Official Manual in Github

Execute Command:

time parallel --gnu --xapply -j 30 \
            "prokka assembly/megahit/final.contigs.fa --outdir genePre/prokka/ {}/\
            -prefix mg --metagenome --force --cpus 48 \
            --kingdom Archaea,Bacteria,Mitochondria,Viruses" \
            >> log/prokka/prokka.log 2>&1 \
            ::: `tail -n+2  metaDataMap/metadata.txt | cut -f 1` ## 1h time-consuming

Output Files

Extension	Description
.gff	This is the master annotation in GFF3 format, containing both sequences and annotations. It can be viewed directly in Artemis or IGV.
.gbk	This is a standard Genbank file derived from the master .gff. If the input to prokka was a multi-FASTA, then this will be a multi-Genbank, with one record for each sequence.
.fna	Nucleotide FASTA file of the input contig sequences.
.faa	Protein FASTA file of the translated CDS sequences.
.ffn	Nucleotide FASTA file of all the prediction transcripts (CDS, rRNA, tRNA, tmRNA, misc_RNA)
.sqn	An ASN1 format “Sequin” file for submission to Genbank. It needs to be edited to set the correct taxonomy, authors, related publication etc.
.fsa	Nucleotide FASTA file of the input contig sequences, used by “tbl2asn” to create the .sqn file. It is mostly the same as the .fna file, but with extra Sequin tags in the sequence description lines.
.tbl	Feature Table file, used by “tbl2asn” to create the .sqn file.
.err	Unacceptable annotations - the NCBI discrepancy report.
.log	Contains all the output that Prokka produced during its run. This is a record of what settings you used, even if the –quiet option was enabled.
.txt	Statistics relating to the annotated features found.
.tsv	Tab-separated file of all features: locus_tag,ftype,len_bp,gene,EC_number,COG,product

Note

Prokka is specifically designed for Bacteria, Archaea and Viruses. It can’t handle multi-exon gene models; I would recommend using MAKER 2 for that purpose.
The hmmscan step seems to hang and do nothing?
The problem here is GNU Parallel. It seems the Debian package for hmmer has modified it to require the --gnu option to behave in the ‘default’ way. There is no clear reason for this. The only way to restore normal behaviour is to edit the prokka script and change parallel to parallel --gnu

Mandatory

BioPerl
Used for input/output of various file formats
Stajich et al, The Bioperl toolkit: Perl modules for the life sciences. Genome Res. 2002 Oct;12(10):1611-8.
GNU Parallel
A shell tool for executing jobs in parallel using one or more computers
O. Tange, GNU Parallel - The Command-Line Power Tool, ;login: The USENIX Magazine, Feb 2011:42-47.
BLAST+
Used for similarity searching against protein sequence libraries
Camacho C et al. BLAST+: architecture and applications. BMC Bioinformatics. 2009 Dec 15;10:421.
Prodigal
Finds protein-coding features (CDS)
Hyatt D et al. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics. 2010 Mar 8;11:119.
TBL2ASN Prepare sequence records for Genbank submission Tbl2asn home page

Aragorn
Finds transfer RNA features (tRNA)
Laslett D, Canback B. ARAGORN, a program to detect tRNA genes and tmRNA genes in nucleotide sequences. Nucleic Acids Res. 2004 Jan 2;32(1):11-6.
Barrnap
Used to predict ribosomal RNA features (rRNA). My licence-free replacement for RNAmmmer.
Manuscript under preparation.
HMMER3
Used for similarity searching against protein family profiles
Finn RD et al. HMMER web server: interactive sequence similarity searching. Nucleic Acids Res. 2011 Jul;39(Web Server issue):W29-37.

MetaGeneMark for Eukaryotes

GeneMark is an abinitio gene prediction software suite that has modes for both metagenomes (MetaGeneMark, Zhu et al., 2010) and Eukaryotes (GeneMark-ES, Ter-hovhannisyan et al., 2008)

Non-Redundant Gene Set Establishment

CD-HIT and Intro

CD-HIT is a widely used program for clustering biological sequences to reduce sequence redundancy and improve the performance of other sequence analyses.

Official Manual in Github

CD-HIT was originally developed to cluster protein sequences to create reference databases with reduced redundancy (Li, et al., 2001) and was then extended to support clustering nucleotide sequences and comparing two datasets

mainly command and meta info:

Command	Info
* cd-hit	Cluster peptide sequences
cd-hit-est	Cluster nucleotide sequences
* cd-hit-2d	Compare 2 peptide databases
* cd-hit-est-2d	Compare 2 nucleotide databases
* psi-cd-hit	Cluster proteins at <40% cutoff
* cd-hit-lap	Identify overlapping reads
* cd-hit-dup	Identify duplicates from single or paired Illumina reads
* cd-hit-otu	Cluster rRNA tags
* cd-hit-para	Cluster sequences in parallel on a computer cluster
* scripts	Parse results and so on
* h-cd-hit	Hierarchical clustering

CD-HIT Algorithm

CD-HIT is a greedy incremental clustering approach. The basic CD-HIT algorithm sorts the input sequences from long to short, and processes them sequentially from the longest to the shortest. The first sequence is automatically classified as the first cluster representative sequence. Then each query sequence of the remaining sequences is compared to the representative sequences found before it, and is classified as redundant or representative based on whether it is similar to one of the existing representative sequences. In default manner (fast mode), a query is grouped into the first representative without comparing to other representatives. In accurate mode, a query is compared to all representatives and grouped to the most similar one

Based on this greedy method, we established several integrated heuristics that make CD-HIT very efficient.

Index Table vs Hash Table

Most sequence alignment tools use short words or k-mer (k is the length) to speed up computation. For example, BLASTN uses 11-mer and BLASTP uses 3-mer by default. An index table, as opposite to a hash table used in most fast alignment methods, uses a unique index for every unique k-mer, therefore an index table is much faster than a hash table. In CD-HIT, we use k=2~5 for proteins and k=8~12 for DNAs, because the all the k-mers can be indexed in computer memory.

Short word filter

Two sequences of a certain identity must share at least a specific number of identical k-mers. It is possible to tell that the identity between two sequences is below a cutoff by counting common k-mers. The filter checks the common k-mers and rejects unnecessary alignments.

Short word statistics

It is the key for the high efficiency of a short word filter (Li, et al., 2002). Through statistical analysis of real alignments, we identified the distribution of common k-mer at different sequence lengths and identities, and applied the results to short word filter.

Banded alignment

The short word filter not only filters out unnecessary alignment, when alignment is needed, it also identifies a narrow band for banded dynamic programming alignment, which is much faster than regular dynamic programming.

Reduced alphabet (to be implemented)

This is for protein clustering. In reduced alphabet, a group of exchangeable residues are reduced to a single residue (I/V/L==>I, S/T==>S, D/E==>D, K/R==>K, F/Y==>F), and then conservative mutations would appear as identities in sequence alignments. It improves the short word filter for clustering at low sequence identity below 50%.

Gapped word (to be implemented)

Short word filter using gapped word allows mismatch within a word such as “ACE” vs “AME”, “ACFE” vs “AMYE”, and “AACTT” vs “AAGTT”, which can be written as “101”, “1001” and “11011”. At low identity cutoff, a gapped word is more efficient than an ungapped word for filtering.

CD-HIT-EST

CD-HIT-EST clusters a nucleotide dataset into clusters that meet a user-defined similarity threshold, usually a sequence identity. The input is a DNA/RNA dataset in fasta format and the output are two files: a fasta file of representative sequences and a text file of list of clusters. Since eukaryotic genes usually have long introns, which cause long gaps, it is difficult to make full-length alignments for these genes. So, CD-HIT-EST is good for non-intron containing sequences like EST.

Basic command:

cd-hit-est -i est_human -o est_human95 -c 0.95 -n 10 -d 0 -M 16000 -T 8 cd-hit-est -i R1.fa -j R2.fa -o R1.95.fa -op R2.95.fa -P 1 -c 0.95 -n 10 -d 0 -M 16000 -T 8

Key parameters Alignment coverage control:

See the figure below, the -aL, -AL, -aS and -AS options can be used to specify the alignment coverage on both the representative sequence and other sequences. -s and -S can control the length difference between the representative sequence and other sequences.

Exececute Command:

cd-hit-est \
    -i ~/quantitativeProfiling/doughtStress/wetdry/genePre/prokka/D6303/mg.ffn \
    -o ~/quantitativeProfiling/doughtStress/wetdry/nrCDHITEST/mg.ffn.nr \
    -aS 0.9 \
    -c 0.95 \
    -n 10 \
    -G 0 \
    -M 800000 \
    -T 30 \
    -d 0 \
    -g 1 >>log/cdhit/cdhit.log 2>&1
    
    
    
cd-hit-para.pl \
    -i all.genes.fa \
    -o non-redundant.geneset.out \
    --P cd-hit-est \
    --Q 40 \
    --T SGE \
    --S 37 \
    -P priority \
    -q queue \
    --l vf=10g,p=10 \
    -G 0 \
    -aS 0.9 \
    -c 0.95 \
    -M 0 \
    -T 30 \
    -d 0 \
    -g 1 \
    -l 10 \
    --R restart

command parameters details⁴

So far the Gene Catalog Construction has been done

Functional Annotations (NR, MetaCyc,GO, UniProt KEGG,COG eggNOG, CAZyome & ARDB)

Difference between NCBI non-redundant and refseq database⁵

NR Database Intro

NR include non-curated and curated database, refseq only curated ### Non-curated databases (low quality):

GenBank/GenPept - unreviewed sequences submitted from individual laboratories and large-scale sequencing projects. Since these sequence records are owned by the original submitters and can not be altered, GenBank might contain many low quality sequences.
trEMBL - unreviewed section of UniProt. This section contains a computer-annotated supplement of SwissProt that contains all the translations of EMBL nucleotide sequence entries not yet integrated in SwissProt

Curated databases (high quality):

RefSeq - GenBank sequences that are manually curated by the NCBI staff. RefSeq records are owned by NCBI and can be updated as needed to maintain current annotation or to incorporate additional information.
UniProtKB/SwissProt - manually annotated and reviewed protein sequences
PIR - non-redundant annotated protein sequence database
PDB - experimentally-determined structures of proteins, nucleic acids, and complex assemblies

Database Attributes

nr ⁶

update_blastdb.pl --decompress nr [*] # This command will download the compressed nr BLAST database from NCBI to the current working directory and decompress it

metacyc in biocyc

email license required for academic usage
kegg kegg_dmnd=/db/kegg/kegg_76

fee license required for adademic usage
cog

wget -c -r 1 -np -nd ftp://ftp.ncbi.nih.gov/pub/COG/COG2014/data
eggnog eggnog_db=/db/eggnog

wget -c -r 1 -np -nd http://eggnogdb.embl.de/download/
cazy dbcaz2_dmnd=/db/protein/dbcan2/CAZyDB.07312018
dbcaz2_anno=/db/protein/dbcan2/fam_description.txt

nohup wget -c -nd -np -r http://bcb.unl.edu/dbCAN2/download/
[ardb]

nohup wget -c -nd -np -r https://card.mcmaster.ca/download/5/ontology-v3.0.4.tar.gz
resfam resfams_dmnd=/mnt/zhou/yongxin/db/ResFams/Resfams-proteins
resfams_anno=/mnt/bai/yongxin/data/db/ResFams/Resfams-proteins_class.tsv

if the sequence-level, using diamond blastx, protein-level diamond blastp(using transeq to translate first)

transeq \
        -sequence ~/quantitativeProfiling/doughtStress/wetdry/nrCDHITEST/mg.ffn.nr \
        -outseq ~/quantitativeProfiling/doughtStress/wetdry/nrCDHITEST/mg.ffn.nr.fa \
        -trim Y
        -clean Y
        -error Y

trim⁷

diamond blastp \
        --thread 30 \
        --db diamond.index \
        --query non-redundant.geneset.pro.faa \
        --out diamondout \
        --outfmt 6 \
        --log  \
        --max-target-seqs 50 \
        --evalue 1e-5 \
        --sensitive \
        --block-size 6 \
        --index-chunks 1

Blast Mode⁸

Contig quantification salmon

time parallel -j ${j} \
            'salmon quant -i temp/22salmon_contig/index -l A -p ${p} --meta \
            -1 temp/11qc/{1}_1_kneaddata_paired_1.fastq -2 temp/11qc/{1}_1_kneaddata_paired_2.fastq \
            -o temp/22salmon_contig/{1}.quant' \
            ::: `tail -n+2 result/metadata.txt | cut -f 1`

Contig taxon annotation kraken2

kraken2 --db ${kraken2_db} temp/22megahit_all/final.contigs.fa \
    --threads ${p} --use-names \
    --output result/22megahit_all/kraken2.tax

Assembly-free metagenomic profiling

override a group of parameters; possible values: meta-sensitive: ‘–min-count 1 –k-list 21,29,39,49,…,129,141’ meta-large: ‘–k-min 27 –k-max 127 –k-step 10’ (large & complex metagenomes, like soil)↩
-1 -2 support list -r se -12 interleaved↩
for each sample, reads were assembled with a series of different k-mer size in parallel, and then were mapped back to each assembly result with SOAP2 , and the optimal k-mer size and assembly result were chosen depending on both contig N50 and mapping rate. De novo metagenome assembly was reperformed with IDBA-UD. Only those contigs with more than 500 bp were kept for further analysis↩
-G 0 # global sequence identity, default 1 if set to 0, then use local sequence identity, combination with alignment control such as -aS -aL -M 800000 # maximal memory -T 30 # thread -d 0 # length of description in .clstr file, default 20 if set to 0, it takes the fasta defline and stops at first space -g 1 # faster mode 0 or accurate mode 1↩
https://www.biostars.org/p/164641/↩
The best way to obtain BLAST databases is to download them from NCBI or the cloud (currently from Google Cloud Platform – an experimental feature - see details below). These are the same databases available via the public BLAST Web Service (https://blast.ncbi.nlm.nih.gov), are updated regularly, and contain taxonomic information built into them. These can also be a source of biological sequence data (see below).To download a preformatted NCBI BLAST database, run the update_blastdb.pl program followed by any relevant options and the name(s) of the BLAST databases to download. For example:↩
[N] This removes all ‘X’ and ‘’ characters from the right end of the translation. The trimming process starts at the end and continues until the next character is not a ’X’ or a ’’↩
BlastX requires as query a DNA sequence; BlastP requires as query a protein↩

metagenomics

Tank (Xiao-Ning Zhang)

PhD Student @ Data Miner & Coder

I’m a PhD Student majoring in Bioinformatics and Biostatistics who loves computer programming such as C(++), Java, Python and R.

Rice metagenome gene catalogue establishment

Rice metagenome gene catalogue establishment

Tasks assigned and Questions

De novo assemble pipeline

khmer generation

Assemble

Problem of Seven-Brideg and Euler graph

K-mer and De Bruign graph

Assembly of Hamilton and Euler cycle

Reference Required and other Materials

Binning (metagenomics)

Genomic Signatures of Nucleotide Composition

Gene prediction and annotation

Prokka and Intro

Output Files

Note

Mandatory

Recommended

MetaGeneMark for Eukaryotes

Non-Redundant Gene Set Establishment

CD-HIT and Intro

CD-HIT Algorithm

Index Table vs Hash Table

Short word filter

Short word statistics

Banded alignment

Reduced alphabet (to be implemented)

Gapped word (to be implemented)

CD-HIT-EST

Key parameters Alignment coverage control:

Functional Annotations (NR, MetaCyc,GO, UniProt KEGG,COG eggNOG, CAZyome & ARDB)

Curated databases (high quality):

Database Attributes

Contig quantification salmon

Contig taxon annotation kraken2

Assembly-free metagenomic profiling

Tank (Xiao-Ning Zhang)

PhD Student @ Data Miner & Coder

Related