snp_calling

Genome-wide association studies of 14 agronomic traits in rice landraces

Uncovering the genetic basis of agronomic traits in crop landraces that have adapted to various agro-climatic conditions is important to world food sacurity.

Here we have identified ~3.6 million SNPs by sequencing 517 rice landraces(地方品种) and constructed a high-density haplotype map of the rice genome using a novel data-imputation method.We performed genome-wide association studies (GWAS) for 14 agronomic traits in the population of Oryza sativa indica subspecies. The loci identified through GWAS explained ~ 36% of the phenotypic variance, on average.

The peak signals at six loci were tied closely to previous identified genes.

This study provides a fundamental resouirce for rice genetics research and breeding, and demonstrates that an approach integrating second-generation sequencing and GWAS can be used as a powerful complementary strategy to classical biparental cross-mapping for dissecting complex traits in rice.

Intro

Rice (Oryza sativa L.) is a staple(主食; 主要产品) food for more than half of the world population.

Rice landraces have evolved from their wild progenitor under natural and human selection, leading to the maintenance of high genetic diversity.

These cultivated varieties also have a high capacity to tolerate biotic and abiotic stress, resulting in highly stable yields and an intermediate yield under a low-input agricultural system. Identifying the genetic basis of these diverse varieties will provide important insights for breeding elite(精华; 精锐,精英; 上层集团;) varieties for sustainable agriculture.

GWAS have emerged as a powerful appreach for identifying genes underlying complex diseases at an unprecedented[[ʌnˈpresɪdentɪd]] rate. However, despite their promise, GWAS have largely not been applied to the dissection(解剖) of complex traits in crop plants.

This is due to mainly lack of effective genotyping techniques for plants and the limited resourees for developing high-density haplotype maps like those seen in other well-developed systems, such as the huamn genome HapMap project. Rice is an ideal candidate system for the application of GWAS because it is

  • self-fertilizing and has a
  • high-quality reference genome sequence
  • and phenotyping resources.

Such a system should permit the identification of high-quality haplotypes necessary to accurately associate molecular markers with phenotypes

Here we have genotyped rice landraces through direct resequencing of their genomes by adopting sequencing-by-synthesis technology, which represents a step forward from the oligonucleotide array technology weidely used for GWAS

More than 500 diverse rice landraces, representing a large collection of rice accessions, were sequenced at approximately onefold genome coverage. The resulting data set capture more of the common sequence variation in cultivated rice than any other data set to date.

Using a highly accurate imputation method, we constructed a high-density rice haplotype map and performed GWAS for 14 agronomic traits to identify a substantial number of loci potentially important for rice production and improvement.

Some loci were mapped at close to gene resolution, indicating that GWAS of rice landraces could provide an effective approach for gene identification.

Result

Genome sequencing and SNP identification

From a collection of ~50,000 rice accessions originating in China, we have undertaken an effort to build a large sample of morphologically, genetically and geographically diverse landraces for genetic studies. In this study , a total of 517 landraces were selected and comprehensively phenotyped.

We genotyped these landraces with approximate onefold-coverage genome sequencing using a barcoded multiplex sequencing approach on the Illunima Genome Analyzer II. Three additional cultivars with accurate genome sequences were also sequenced as internal control for evaluating sequence accuracy.

2.7 billion 73-bp paired-end reads were generated. In total,all sequences used for SNP calling comprised ~508-fold coverage of the rice genome.

Sequence reads were aligned to the rice reference genome for SNP identification. We used the alignment of reads to build consensus genome sequences for each rice accession, with a series of filtering criteria that eliminated sequencing and mapping errors.

The resulting consensus sequence of each rice acession covered 27.4% of the reference genome on average (ranging from 12 to 47). Comparisons of the consensus sequence aginst bacterial artificial chromosome(BAC) sequences and high-coverage Illumina data showed that the sequence secificity reached 99.9%.

The SNP calling procedure was then based on discrepancies between the consensus sequence and the reference genome. After exclusion of singleton SNPs,the SNP calling error rate was reduced to 2.7%

A total of 3,625,200(3600K) nonredundant SNPs were identified.resulting in an average of 9.3 SNPs per kb, with 87.9% of the SNPs located within 0.2 kb of the nearest SNP.(?)

About 78% of all SNPs were found in intergenic regions; of the remaining SNPs, the largest number were in introns of annotated genes, followed by coding regions and untranslated regions of annotated genes

The chromosomal distribution of the SNPs is shown

Despite the high density of our SNP map, however,the recall rate (the rate at which all actual SNPs are recalled) was 20%. This was probably due to uneven sampling of shot reads from low-coverage sequencing and the complexity and repetitiveness of the rice genome.

To gain insights into potential functional effects of the detected SNPs,we further anlyzed the SNPs in coding regions. A total of 167,514 SNPs were found in the coding regions of 25,409 annotated genes with transcript support (RAP2 database). 3625 large-effect SNPs (representing mutations predicted to cause large effects).

population structure and geographic differentiation

The phylogenetic relationships of the 517 selected Chinese rice landraces were determined using the genetic distances calculated from the SNPs. the resulting neighbor-joining tree showed two divergent

Population differentiation

From the SNP data, sequence diversity (π) was estimated at 0.0024 for all sampled landraces, and 0.0016 and 0.0006 for indica and japonica, respectively. These estimates suggest that the overall genetic variation of

The population-differentiation statistic (FST) between the indica and japonica landraces was estimated at 0.55 indicating a very strong population differentiation

Population structure

We then investigated the population structure within subspecies. According to the neighbor-joining tree as well as the principal-component analysis (PCA), both indica and japonica had three subgroups, designated 1,2 and 3.

It has previously been suggested that the photoperiod and temperature clines(地分割线) along latitudes may have been the primary factors driving differentiation of cultivated rice in China.

Whole-genome patterns of linkage disequilibrium (LD)

We then analyzed LD for indica and japonica landraces using the SNP data. The LD decay rate was measured as the chromosomal distance at which the average pairwise correlation coefficient (r square) dropped to half its maximum value. Genome-wide LD decay rates of indica and japonica were estimated at ~123 kb and ~167kb, where the r2 drops to 0.25 and 0.28, respectively. This id in agreement with the previous estimation that cultivated rice has a long-range LD from close to 100 kb to over 200 kb, which might be a result of self-fertilization coupled with a relatively small effective population size.(?)

LD decay在群体遗传研究中是最重要和常见的分析内容之一,是衡量LD水平的重要指标。特别在自花授粉的作物中,LD衰减不仅能够反映作物驯化和育种历史,还能显示基因流现象、选择区域等。

传统LD处理软件haploView可以用来计算LD、单倍型等,但上百万snp标记的计算会消耗大量资源,本地运行往往会由于运行内存不足而崩溃。PopLDdecay是一种简单有效的LD衰减分析软件,可计算和绘图,以压缩包形式输入输出内容,相比其他软件,更适合大数据量的计算和分析

Construction of rice Haplotype

Therefore, our approach, combining second-generation sequencing technology with an effective imputation procedure, permits the quick construction of a high-density haplotype map at a markedly lower cost than microarray-based genotyping

We then examined the influence of various biological and experimental factors on the performance of the data imputation (Fig. 3). Notably, this method performed well even when LD decayed within 10 kb; with this LD decay, the missing-data rate was below 5%, with an accuracy above 95%. This suggests that our imputation method for low-coverage genome sequencing data is also applicable to other genomes with short-range LD.

Genome-wide association studies for 14 agronomic traits

The high-density haplotype map enabled genome-wide association mapping in rice. The strong population structure, along with a slow LD decay rate, makes GWAS in this species not straightforward. To evaluate the performance of GWAS, we carried out GWAS on 14 agronomic traits, which can be divided into five categories: morphological characteristics(tiller number and leaf angle), yield components (grain width, grain length, grain weight and spikelet number), grain quality(gelatinization temperature 糊化温度 直链淀粉含量and amylose content),coloration and physiological features(heading date抽穗期, drought tolerance and degreee of seed shattering 落籽;)

Given the strong population differentiation between the two sub-species of cultivated rice, we did not look for associations across both subspecies. We conducted GWAS for 373 indica lines. The sequencing-based (with a minor allele frequency of >0.05)

Both the simple model and the compressed mixed linear model (MLM)21 were used to identify association signals

The compressed MLM approach, which took genome-wide patterns of genetic relatedness into account, greatly reduced false positives, as shown in quantile-quantile plots

We also identified strong association signals with P < 10−8 from the simple model, discarding all but the top five most significant signals for each trait if there was an excess of strong associations

Association signals for six traits were located close to known genes that have been identified previously using mutants or studies of recombinant populations

Although the association resolution varies among loci, Notably, the peak signals of the GWAS loci often appeared near (but not within) the known genes

We then screened the causal polymorphisms of three known genes by direct PCR amplification and sequencing, and found that all of them showed a slightly weaker association than peak signals nearby

These results were consistent with similar findings in Arabidopsis thaliana (need to ?)

and may result from multiple causal polymorphisms of a gene coupled with complex population structure(?)

Together, the data show that the degree to which population stratification confounds associations varies markedly across traits

Avatar
Tank (Xiao-Ning Zhang)
PhD Student @ Data Miner & Coder

I’m a PhD Student majoring in Bioinformatics and Biostatistics who loves computer programming such as C(++), Java, Python and R.

comments powered by Disqus