[TOC]
Aim
We used a newly developed pool of 96 synthetic RNAs with various lengths, and GC content covering a \(2^{20}\) concentration range as spike-in controls to measure
- sensitivity
- accuracy
- biases in RNA-seq experiments as well as to
- derive standard curves for quantifying the abundance of transcripts
Linearity
to test linearity between read density and spike-in input over the entire detection range and excellent agreement between replicates, but we observed significantly larger imprecision than expected under pure Poisson sampling errors
Intro
High-throughput sequencing applications are revolutionizing genome- wide analysis
distinguish isoforms, calculate expression levels for transcripts, and uncover low abundance RNAs
While there are clear advantages to RNA-seq, it is less clear how well the procedure performs, as several studies have reported conflicting RNA-seq accuracy results
By using the control RNAs, we derive limits for the discovery and detection of rare transcripts
in RNA-seq experiments
meaning of spike
external RNA controls are a useful resource for evaluating sensitivity and accuracy of RNA-seq experiments for transcriptome discovery and quantification. These quality metrics facilitate comparable analysis across different samples, protocols, and platforms
- determining the accuracy,
- detection limits,
- reproducibility,
- dynamic range, and
- other performance measures of RNA-seq assays and establishing best practices are critical.
Standardized objective benchmarks provide quantitative measures of system performance and can be used routinely for quality control or for verification or optimization of system performance when changes are made in reagents or instrumentation
Result
The ERCC is working to develop and disseminate a standard set of exogenous RNA controls for use in gene expression assays
will support confidence in measurement results by enabling objective, quantitative assessment of assay performance
In this study, we used a Phase IV test set of ERCC RNAs in a combinatorial design, where some RNA concentrations were constant across pools and others vary in a Latin-square design
design
Stocks of individual RNAs were purified and quantified with UV absorption. The stocks were adjusted in concentration such that they each had the same molarity. These adjusted stocks were mixed into five subpools that each contained about 20 transcripts, covering a 106 dynamic range of abundance. This dynamic range was achieved by preparation of sub-subpools covering a range of 64-fold or 128-fold abundance, reasonable sets of dilutions using standard pipetting technique. These sub-subpools were mixed to create the subpools, which were then mixed in proportions to create pools 12-15 (see table S1), which were arranged such that they had a subpool in constant relative abundance, and four subpools that changed in abundance relative to each other in a Latin square design
Evaluation of spike-in sequence 证明spike-in 的可行性 通过举例子
The ERCC consortium synthesized control RNAs by in vitro transcription of synthetic DNA sequences or of DNA derived from the Bacillus subtilis or the deep-sea vent microbe Methanocaldococcus jannaschii genomes. They also contain a poly-A+ tail mimic in the DNA template. These diverse sequences show at least some of the properties of endogenous transcripts, such as diversity in the GC content and length
Importantly, ERCC RNAs show minimal sequence homology with endogenous transcripts from sequenced eukaryotes this minimizes confounding alignment of ERCC reads to the target genome.
Any spurious alignments of the ERCC reads to genes result in density spikes that are easily distinguished from reads derived from endogenous transcripts.We therefore concluded that ERCC RNAs are distinct from Homo sapiens and D. melanogaster transcripts and are unlikely to interfere with transcript discovery and quantification when used as spike-in controls in these genomes.
##Library QC
we observed by far the highest sequence error rate at the first 6 nucleotides 通过这个可以看一下哪些 碱基容易被测序错误 corresponding to the random hexamer priming site for the reverse transcriptase reaction during library preparation We do not see such an increased sequence error rate at the paired-end read not corresponding to the random priming site, and previous studies have not reported it for Illumina DNA sequencing controls
suggesting that these mismatches were due to imperfect hybridization between the primer and the RNA template. Error rates along the rest of the read increased with read length, as occurs for all Illumina sequencing runs due to the decline in the quality of sequencing chemistry over time
These measurements provide global false-positive rates
We measured the rate of this confounding effect directly by assessing the percentage of reads mapping in the antisense orientation to each ERCC RNA in these libraries (Fig. 1B).We found that 0.7% (60.6%) of the inserts map to ERCC RNAs on the wrong strand (with one outlier at 3%). These measurements provide global false-positive rates and threshold levels for distinguishing endogenous antisense transcripts levels for each library
Standard curves and detection limits
An understanding of the signal response in relation to input amount is critical for quantification, and spike-in controls are valuable for this, as they allowed us to determine the relationship between RNA-seq read counts and known inputs (Fig. 1 C,D). For detected ERCC RNAs, the relationship between RNA input abundance and read density output was constant over the six orders of magnitude in the 100% ERCC library and in libraries containing ERCC RNAs and either D. melanogaster or H. sapiens mRNAs (Pearson’s r > 0.96 on log transformed counts). Since log-log scales obscure nonlinear effects, we also determined the slope of the regression (0.95 6 0.03) and the correlation between input and read depth in the 100% ERCC library (library 6) by a test that transforms data to van der Waerden scores (Pearson’s r = 0.93) (Lehmann and D’Abrera 1988) and in linear space (Pearson’s r = 0.91). These results show linear quantification of the ERCC RNAs over six orders of magnitude. We found that ERCC read counts were highly correlated (Pearson’s r > 0.98) between the 100% ERCC library and libraries mixed with mRNA from either H. sapiens (data not shown) or D. melanogaster (Fig. 1E), indicating that RNA-seq
Scatter plots of read counts versus mass (concentration times length) per ERCC
ERCC-00073 showed aberrant abundance patterns in multiple RNA-seq experiments, as did ERCC-00144 in ERCC pool 14 描述结果和现象 同时给出可能的原因
They may have been inaccurately quantified in our ERCC test set due to errors during the complex mixing scheme used to generate the pools, as they are also suspect in RT-PCR and array experiments on these ERCC pools
quantification of ERCC RNAs is uninfluenced by the complexity of endogenous RNAs in different species, a critical requirement for effective spike-ins. In practical terms, these data indicate that one needs to only sacrifice around 2% of reads to ERCC RNAs in a RNA-seq experiment in order to obtain a standard curve for quantification. Random sampling of reads and overall library complexity always limit RNA-seq detection. Of the six ERCC RNAs that we failed to detect in the 100% ERCC RNA-seq experiment, five were among the least abundant, suggesting that failure to detect RNA was a consequence of low input abundance, random sampling, and sequencing depth. In this case, we loaded ;11 mL of a 108 nmol/mL solution during GAII clustering, corresponding to 107 molecules, which represents an upper boundary on the number of reads in this lane. The five least abundant molecules in the 100% ERCC library (library 6) were present between 0.6 and 2.5 molecules in 107. Even under ideal conditions, if library preparation and clustering followed an unbiased Poisson distribution, the detection probabilities for these least abundant RNAs were 0.3 < P < 0.9. The final undetected ERCC-00134 RNA in the 100% ERCC library was input at 8.3 molecules in 107 and hence should have been detected (P > 0.99). However, this is one of the shortest ERCC RNAs in the pool (274 nt) and showed a high probability of secondary structure (data not shown). Both of these features could have altered gel mobility during size selection (;200 bp) and resulted in exclusion during library construction. High transcript coverage is critical for building transcript models from RNA-seq data, since ideally the entire length of a transcript needs to be covered by reads. Based on simulations, we expected that 53sequencing coverage is required to cover 99% of a transcript with at least one read. In the real data,where reads are not perfectly distributed, at least 83 coverage of an ERCC was required to cover 99% of its primary sequence (Fig. 1F).We suggest that gene models derived from regions with greater than 83coverage should be considered as high-confidence annotations. Measuring for which ERCC spike-ins this coverage has been achieved provides a benchmark for the sensitivity of an RNA-seq transcript discovery experiment. Quantification and rare transcripts To estimate transcript abundances, we used the spike-in data to infer transcript copies per D. melanogaster S2 cell in three libraries, one (library 3) with 5% and-two libraries (libraries 1, 2) with 2.5% ERCCs added to S2 poly-A+ mRNA.We used Tophat (Trapnell et al. 2009) and Cufflinks (Trapnell et al. 2010) to align, assemble, and estimate the mRNA isoform and ERCC RNA abundance.We fit the S2 cell output to a regression of ERCC abundance input and output (for detected ERCCs only) to derive a standard curve with confidence intervals of quantification (Fig. 2A).We used this calibration to determine the concentration of S2 cell mRNAs in the RNA extract relative to the known concentrations of ERCC standards. Since we also determined the yield of RNA (nanograms per cell) extracted from S2 cells, we estimated the average recovered transcript number per cell. In these libraries, a yield of one copy/cell corresponded to 4.4 fragments per kilobase per million mappable fragments (FPKM) (95% confidence interval 3.3–5.7 FPKM). Of the 15,111 annotated transcripts (Tweedie et al. 2009) we detected, 6720 (44%) had an FPKM < 4.0 (Fig. 2B), strongly suggesting that a large portion of the transcriptome in a given cell type is rare in our preparations.However, these rare transcripts, estimated at less than one copy per cell, are nevertheless observed reproducibly between the two replicate RNA-seq libraries (Pearson r = 0.45, P < 2.231016).Howmany of these transcripts are biologically relevant remains an open question. The extended dynamic range of the transcriptome creates a familiar problem for discovering rare transcripts. The most abundant 1.5% of RNAs (more than 100 copies/cell) accounted for 43% of the mapped reads, while the least abundant 44% of RNAs accounted for just 1% of the reads. Only 52 out of the 551 mRNAs encoding transcription factors were present at over 10 copies per cell. To achieve 99% coverage of an mRNA, we estimated that at least an 83 sequencing depth is required. Achieving this standard for D. melanogaster S2 transcripts present at one copy per cell requires at least 68 million uniquely aligned single-end 36-bp reads (see Supplemental Methods). Additionally the underrepresentation of certain sequences and short transcripts in RNA-seq protocols means that significantly more overall reads and possibly different library preparation methods are needed to make up for biases and to cover most transcripts. RNA-seq quantification accuracy While there is clearly a linear relationship between RNA concentration and read density in the ERCC RNA collection over six orders of magnitude, there were significant deviations from a perfect Figure 2. Estimation of cellular transcript abundance in a S2 cell. (A) This plot shows results from a library (library 3) made of 100 ng S2 polyA+ RNA (mRNA yield for this extraction is 0.175 pg/cell) and 5 ng of pool 15 ERCC RNAs. A linear regression of abundance estimated from RNA-seq and the known input amounts. Dashed lines represent 95% confidence intervals for the regression fit. (B) Distribution of S2 transcript abundance estimated from RNA-seq. Spike-in controls for RNA-seq Genome