Abstract
Here we report the results of a field experiment with the model plant Arabidopsis thaliana to identify the fungi and bacteria that colonize its leaves and the host loci that influence the microbe numbers. The composition of this community differs among accessions of A. thaliana. Genome-wide association studies (GWAS) suggest that plant loci responsible for defense and cell wall integrity affect variation in this community. Furthermore, species richness in the bacterial community is shaped by host genetic variation
The most heavily sequenced (that is, most ‘abundant’) fungal OTUs
After correcting for differences in sequencing among samples and adjusting for technical confounders, strong and significant species associations (Kendall’s Test of Concordance15, P¼0.001, 1,000 permutations) were observed within both the bacterial and fungal communities
suggesting that members of the microbial community interact or that portions of the microbial community respond to the same host factors. To take into account these correlations, we summarized each community using eigenvector techniques (Methods), including principal component analysis (PCA) and canonical correspondence analysis (CCA)
The leaf microbial community is shaped by host genetics
We found that genetic variation within A. thaliana clearly shapes the leaf bacterial community, but only when we focused on the most heavily sequenced OTUs
However, these 2,528 bacterial OTUs correspond to 99% of the sequencing reads, which suggests that rare species or sequencing artefacts may obscure evidence that hosts structure their microbial communities
Species of bacteria tend to be more prevalent (that is, common) across host samples than species of fungi, leading to higher estimates of turnover (b-diversity) in the fungal
It is unclear whether fungi disperse poorly compared with bacteria, or whether other factors (for example, host selection and/or interspecific competition) differentially shape these two communities. Nevertheless, both presence/absence and abundance data reveal clear evidence that host-genetic variation shapes the communities of fungi associated with the leaves of A. thaliana, but for only the most heavily sequenced taxa
We looked for further evidence that hosts shape their microbial communities by using genome-wide single-nucleotide polymorphism (SNP) data to estimate the relatedness among accessions, before asking whether more closely related individuals harbour more similar communities. This approach is likely to underestimate the heritability of traits influenced by non-additive effects, genetic heterogeneity, or by rare causal SNPs in incomplete linkage disequilibrium (LD) with genotyped SNPs; nevertheless, heritable eigenvectors were found in both communities, regardless of the ordination technique used (Methods; Supplementary Table 2). For example, SNPs explain 9% of the variance for PC1 (P¼0.003) and 8% of the variance for PC2 (P¼0.015) from PCA of the fungal community, as well as 11% of the variance for PC2 of the bacterial community
Samples with poor sequencing coverage were omitted from all analyses
To correct for differences in sequencing effort (coverage), each sample was resampled to either 800 reads (for bacteria) or 200 reads (for fungi).
However, all samples were resampled to contain 200 reads before making comparisons between the bacterial and fungal communities (for example, Supplementary Fig. 4). Because the samples were grown in different blocks (above), PCR-amplified in separate 96-well plates
and sequenced on separate picotiter (ptp) plates, we took into account these covariates in the analyses described below
Associations among microbes. To perform Kendall’s Test, we used the kendall.global function in the R-package51 vegan52
Ordination techniques. The function rda (scale¼T) in vegan was used to perform PCA, while cca was used for CCA. decorana was used to perform detrended correspondence analysis (DCA).
To test the hypothesis that accessions of A. thaliana differ with respect to the composition of their microbial communities, which we characterized with these ordination techniques, we used the functions envfit (for unconstrained ordination techniques) and anova.cca (for ordinations produced by CCA)
To investigate whether host genetic differences are easier to discern for well-sequenced taxa than rare taxa (that is, due to species turnover among rare species in the microbial community, sequencing artefacts or some other mechanism), we ordered the species matrix by total (maximum) sequencing coverage per OTU. We prefer to characterize well-sequenced taxa as ‘most heavily sequenced OTUs’ rather than ‘most abundant’ because of common technical artefacts (due to primer biases or RNA operon count differences, and so on
In brief, envfit identifies the direction in multi-dimensional ordination space that is maximally associated with an environmental variable (here, host genotype, or accession_id). The goodness-of-fit statistic is r2 that is equal to 1#(ssw/sst); ssw is the within-group sum of squares and sst¼the total sums of squares. To assess the significance of this association, we permuted the data 999 times and counted the number of times that these simulated r2 values matched or exceeded the observed r2 value (including the observed r2 value, which is assumed to be an observation from the null distribution). To determine whether ordinations produced with CCA are shaped by host-genotype, we used the function anova.cca. This function also relies on permutation tests, but does so to determine how often the observed constrained inertia (the constraint being host-genotype) is exceeded when the data are randomly permuted.
Genome-wide association studies
- mixed-model approach
- mixmogam
To estimate a genome-wide P value threshold, we performed permutations
re-ran association scans (genome-wide) using a linear transformation of the phenotype values
- controls for population structure
instead of simulating phenotypes under the null, we permute the transformed phenotype values. This allows us to also control for false positives that can arise if the residuals do not come from a Gaussian distribution
Cholesky decomposition of the inverse phenotypic covariance matrix
permutation
where n is the number of individuals, m is the number of genotype variants tested and k is the number of permutations. We performed k¼100 permutations for each trait
phenotypes
To correct for differences in sequencing effort among samples, all data were resampled to either 800 reads (bacterial community) or 200 reads (fungalcommunity) before conducting GWAS using common SNPs (minor allele frequency MAF>=5%)
We split the results from each analysis into 10-kb windows (yielding 11,614 windows). Then, to make the results comparable across GWAS, we ranked and calculated an empirical P value for each window. Next, we determined the amount of overlap in the top results (Pr0.001, empirical P value) from GWAS of individual OTUs in each community. To determine the significance of observed sharing, we used 100,000 simulations to construct a null distribution; each observation in the null was based on selecting 11 (Pr0.001) windows from each of 100 simulated GWAS results (that is, 100 OTUs). We then counted the number of times a window was shared ‘x’ or more times. To assess the probability of observing the same genomic region in the bacterial and fungal analyses, we sampled from 200 simulated GWAS results (that is, 100 OTUs from each community).