snpEff
这个软件比较重要,尤其是对做遗传变异相关研究的,很多人做完了snp-calling后喜欢用ANNOVAR来进行注释,但是那个注释还是相对比较简单,只能得到该突变位点在基因的哪个区域,那个基因这样的信息,如果想了解更具体一点,就需要更加功能化的软件了,snpEFF就是其中的佼佼者
most widely used variant annotation programs ( - SnpEff, - ANNOVAR and - ENSEMBL’s VEP)
ANN replace of EFF
Obviously this new ‘ANN’ field introduces changes respect to the previous ‘EFF’ field that break compatibility with previous SnpEff versions. In order to use the old ‘EFF’ field, you can use the -formatEff command line option.
Script facilitated
看头文件:
“Functional annotations: ‘Allele | Annotation | Annotation_Impact | Gene_Name | Gene_ID | Feature_Type | Feature_ID | Transcript_BioType | Rank | HGVS.c | HGVS.p | cDNA.pos / cDNA.length | CDS.pos / CDS.length | AA.pos / AA.length | Distance | ERRORS / WARNINGS / INFO’
LOF=Loss Of Function NMD
Classification of variants by potential impact on protein function
Using SNPEff software and publicly available rice data sets, we predicted the effects of variants on protein function and categorized all of the line-specific variants into 23 effect types, which we then grouped into four larger categories (HIGH, MODERATE, LOW or MODIFIER) [snpEff manual]http://snpeff.sourceforge.net/SnpEff_manual.html.
The assignment criteria were pre-defined in the annotation program (SNPEff).
Most variants belonged to the MODIFIER category, which is inferred to have only a weak impact. In the HIGH category, 40 out of 47 variants were frame shifts and their numbers were similar among all lines. In the MODERATE category, there were 21 non-synonymous nucleotide changes in the coding regions (which change an amino acid) in line 50A, 15 in 51A, 21 in 55A, eight in WT1, and seven in WT2. In the LOW category, the number of synonymous amino acid changes (the main type in this category) was slightly higher in 50A than in the other four lines
snpEff ANN field description
- Allele (or ALT):
- Annotation (a.k.a. effect): Annotated using
Sequence Ontology terms. Multiple effects can be concatenated using ‘&’ - Putative_impact: A simple estimation of putative impact deleteriousness : {HIGH, MODERATE, LOW, MODIFIER}
- Gene Name: Common gene name (HGNC). Optional: use closest gene when the variant is “intergenic”
- Gene ID
- Feature type: Which type of feature is in the next field, It is preferred to use Sequence Ontology (SO) terms, but ‘custom’ (user defined) are allowed
Feature ID: Depending on the annotation, this may be: Transcript ID (preferably using version number), Motif ID, miRNA, ChipSeq peak, Histone mark, etc. Note: Some features may not have ID (e.g. histone marks from custom Chip-Seq experiments may not have a unique ID).
- Transcript biotype: The bare minimum is at least a description on whether the transcript is {“Coding”, “Noncoding”}. Whenever possible, use ENSEMBL biotypes
Rank / total: Exon or Intron rank / total number of exons or introns
HGVS.c: Variant using HGVS notation (DNA level) [http://www.hgvs.org/] “c.” for a coding DNA sequence (like c.76A>T) “g.” for a genomic sequence (like g.476A>T) “m.” for a mitochondrial sequence (like m.8993T>C, see Reference Sequence) “n” for a non-coding RNA reference sequence (gene producing an RNA transcript but not a protein, see Community consultation 002) “r.” for an RNA sequence (like r.76a>u) “p.” for a protein sequence (like p.Lys76Asn)
Multiple annotations per VCF line
Usually there is more than one annotation reported in each ANN (or EFF) field
There are several reasons for this:
- A variant can affect multiple genes. E.g a variant can be DOWNSTREAM from one gene and UPSTREAM from another gene
- In complex organisms, genes usually have multiple transcripts. So SnpEff reports the effect of a variant on each transcript
Filter using SnpSift:
Get all entries having LOF with genes that have more than 50% of transcripts affected cat file.eff.vcf | java -jar SnpSift.jar filter “LOF[*].PERC > 0.5”
Get all entries having NMD with genes that have more than 3 transcripts cat file.eff.vcf | java -jar SnpSift.jar filter “NMD[*].NUMTR > 3”
Note: Form version 4.0 onwards, the ‘-lof’ command line option is not required
SnpEff Summary
So, you have a huge file describing all the differences between your sample and the reference genome. But you want to know more about these variants than just their genetic coordinates. E.g.: Are they in a gene? In an exon? Do they change protein coding? Do they cause premature stop codons? SnpEff can help you answer all these questions. The process of adding this information about the variants is called “Annotation”. SnpEff provides several degrees of annotations, from simple (e.g. which gene is each variant affecting) to extremely complex annotations (e.g. will this non-coding variant affect the expression of a gene?). It should be noted that the more complex the annotations, the more it relies in computational predictions. Such computational predictions can be incorrect, so results from SnpEff (or any prediction algorithm) cannot be trusted blindly, they must be analyzed and independently validated by corresponding wet-lab experiments.
Citing SnpEff (you can find the paper here):
“A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3.”, Cingolani P, Platts A, Wang le L, Coon M, Nguyen T, Wang L, Land SJ, Lu X, Ruden DM. Fly (Austin). 2012 Apr-Jun;6(2):80-92. PMID: 22728672 [PubMed - in process]