Why SV (CNV) is important
Differences in the presence of even a few genes between otherwise identical bacterial strains may result in critical phenotypic differences
Genes that are deleted or duplicated within different members of a species (also termed copy number variation; CNV), are common across all kingdoms of life
Even a small number of bacterial genes can underlie phenotypes such as virulence3, antibiotic resistance4, host metabolic disease5 and host longevity6, making genetic variation highly important to both the microbe and its host.
Previous Work and Background
A systematic characterization of intra-species CNVs across the human microbiome was recently performed and showed that it is highly prevalent.
This variability could be critical to human pathophysiology
Limitations
this and other studies analysing the genetic repertoire of the microbiome were potentially limited
- by the scope of the annotation databases used
- by ignoring the co-variation of genes from the same genomic region
- Other functional characterization methods may be limited with regards to within-species variation of genes.
what is co-variation in the gene from the same genomic region
Such co-variation is important as it encodes information such as operon membership, gene regulation or susceptibility to horizontal transfer that is only evident when analysing genes in their neighbouring genomic context
study objective
SV term
The spectrum of human genetic variation ranges from the single base pair to large chromosomal events, but it has become apparent that human genomes differ more as a consequence of structural variation than of single-base-pair differences
Structural variation was originally defined as insertions, deletions and inversions greater than 1kb in size. With the sequencing of human genomes now becoming routine , the operational spectrum of structural variants (SVs) and copy number variants (CNVs) has widened to include much smaller events (for example, those >50 bp in length).
The challenge now is to discover the full extent of structural variation and to be able to genotype it routinely in order to understand its effects on
- human disease,
- complex traits
- evolution
Two distinct models to associate the sv and disease
two distinct models have been proposed with respect to associations between disease and structural variation.
The first involves large variants (typically gains and losses several hundred kilobase pairs in length) that are individually rare in the population (<1%) but collectively account for a significant fraction of disease
The second includes multicopy gene families that are commonly copy number variable and contribute to disease susceptibility, as seen for traits related to immune gene functions
classes of SV
Structural variant (SV). Genomic rearrangements that affect >50bp of sequence, including deletions, novel insertions, inversions, mobile-element transpositions, duplications and translocations
Copy number variant (CNV). Also defined as unbalanced structural variants; variants that change the number of base pairs in the genome.
Mobile elements DNA sequences that move location within the genome. Active mobile elements (transposons) in the human genome include Alu, L1 and SVA sequences
Scientific Questions
The discovery and genotyping of structural variation has been central to understanding these disease associations
Systematic and comprehensive assessment of structural variation has been problematic owing to the complexity and multifaceted features of SVs
SV discovery and genotyping requires accurate prediction of three features: copy, content and structure
SVs tend to reside within repetitive DNA, which makes their characterization more difficult. SVs vary widely in size and there are numerous classes of structural variation: deletions, translocations, inversions, mobile elements, tandem duplications and novel insertions
Once a variant has been detected, validated and characterized at the sequence level (discovery), a different suite of methods may be applied to infer genotypes with relaxed threshold
SV calling methods
array-based
sequencing-based There are four general types of strategy, all of which focus on mapping sequence reads to the reference genome and subsequently identifying discordant signatures or patterns that are diagnostic of different classes of SV
Importance of This work
To detect segments of varying lengths, potentially containing multiple genes, that are deleted from certain bacteria in some individuals or present in a variable number of copies in others
identify microbial genomic structural variants (SVs) and find them to be prevalent in the human gut microbiome across phyla and to replicate in different cohorts. SVs are enriched for CRISPRassociated and antibiotic-producing functions and depleted from housekeeping genes, suggesting that they have a role in microbial adaptation. We find multiple associations between SVs and host disease risk factors, many of which replicate in an independent cohort. Exploring genes that are clustered in the same SV, we uncover several possible mechanistic links between the microbiome and its host, including a region
our results uncover a nascent layer of variability in the microbiome that is associated with microbial adaptation and host health
Accurate metagenomic read assignment using ICRA
Problem: over 15% of the metagenomic reads were assigned ambiguously to multiple references upon mapping to a database of 3,953 bacterial genomes
Solution: To address this problem, we devised the ICRA algorithm which uses read assignments, read and mapping qualities, sequencing coverage depth along microbial entities (for example, bacterial genomes) and microbial relative abundances to reassign ambiguously mapped reads
ICRA introduces a demand for sufficient coverage over entities that are to be considered present in a sample, making it robust to genomic regions with extremely high or low coverage that may arise from misassemblies, homology to other microbes or phage activation
Such regions could otherwise bias the estimated relative abundances, potentially even assigning abundances to genomic entities that are absent from the sample.
To test the performance of ICRA, we validated the two key components of the algorithm: its ability to resolve ambiguous read assignments and the accuracy of the species relative abundances that it infers
SGV-Finder to seek to systematically characterize structural variation
we applied to ICRA-corrected read assignments of 887 metagenomic samples to a reference database of 3,953 representative microbial genomes
core of SGV:
SGV-Finder analyses coverage depth across all microbial genomes in all samples to characterize SVs with respect to the standardized coverage of a genome in a given sample
We differentiate between deletion SVs, that are deleted and not covered in 25–75% of samples, and variable SVs, that have highly variable coverage across samples. In both SV types, segments are united based on co-occurrence (deletion SVs) or correlation (variable SVs).
An online metagenome explorer for all SVs and the genes they encompass is available at SV Finder
SV summarized information about bacterial metagenome (SVs are highly prevalent in the microbiome)
detected 2,423 variable SVs and 5,056 deletion SVs in 56 bacteria that passed our coverage thresholds
SVs were detected in six bacterial phyla and one archaeal phylum, with 5–241 SVs per species in 1.4–18.6 kilobase pairs (kbp) average size per species
Variable and deletion SVs make up 0.3–8.4% and 5.0–26.9% of the microbial genome, respectively
Information to infer:
This apparent disparity in size may suggest inherent differences in the formation of the two types of SVs
We detected SVs in every subject and strain analysed, demonstrating the ubiquity of such variations.
SV is prevalent across distinct populations
to To test the universality of these regions and reinforce their biological relevance, we applied ICRA and SGV-Finder independently in
study cohorts (Iseri)
validation cohorts (Dutch Lifelines)
SVs replicate across cohorts
more than 70% of the regions were replicated despite the different genetic background, lifestyle and dietary preferences of the distinct populations studied
Ruminococcus bicirculanus, showed very low concordance between the cohortssuggesting geographically confined variability or strong population-specific environmental factors. Other bacteria, such as Parabacteroides merdae, showed high concordance
SVs are potentially involved in microbial adaptation
characterize the function of SVs by searching for enriched or depleted genetic functions. We annotated gene functions across variable SVs, deletion SVs and ‘conserved’ regions
Using the KEGG BRITE hierarchy, we found that ‘housekeeping’ modules such as nucleotide and amino acid metabolism or carbohydrate and lipid metabolism were significantly depleted from SVs and significantly enriched in conserved regions
Conversely, modules classified as ABC-2 type- and other transport systems were significantly enriched in variable SVs
bacterial conjugation systems, to which the T4SS is related, strongly associate with variability, further implicating SVs as tools of adaptation and speciation
To further characterize the potential contribution of SVs to microbial adaptation, we searched for SVs associated with the fitness of their harbouring microbe. As a proxy for fitness, we calculated bacterial growth rates of 21 strains with sufficient coverage and complete reference genomes, using a method that estimates growth from DNA copy number differences created during DNA replication
suggesting that certain SVs may be important for bacterial adaptation and fitness.
To probe the mechanisms potentially underlying this adaptation, we systematically examined the genetic content of growth-associated deletion SVs and found similar functional profiles as in all SVs, with a
- depletion of housekeeping functions and
- enrichment for genes involved with CRISPR-, transposon- and HGT-associated genes
sVs play important role potentially in microbial adaptation
SVs associate with common mechanisms of conjugation, transposition and phage lysogeny, and may therefore be powerful tools of adaptation. Microbial evolution in densely populated ecosystems such as the human microbiome may thus be driven strongly by SVs, affecting both microbes and host.
SVs associate with risk factors across cohorts (SVs associate with disease risk, replicated in another cohort.)
we associated the abundance of variable SVs and the presence of deletion SVs with metrics of health and risk factors
We found 81 (Spearman’s correlation) and 43 (Mann– Whitney U-test; Fig. 3b) significant associations FDR corrected at 0.1 for variable and deletion SVs, respectively, demonstrating the potential importance of microbial SVs to the human host.
the associations of specific SVs with risk factors allows us to pinpoint specific regions and mechanisms that may underlie the association.
These seemingly paradoxical associations between SVs and risk factors further suggest that SVs represent a different layer of information compared to the taxonomic level, one which may assist in obtaining mechanistic insights into host–microbe interactions
validation association
To test the replicability of these associations, we ran ICRA on samples from the Lifelines cohort and calculated the coverage of the SVs defined from the 887-person cohort. We then calculated the association of these regions with host risk factors measured in the Lifelines cohort and compared those to the associations found in our cohort. Notably, despite presumed inter-cohort differences in genetics, dietary preferences and lifestyles, more than a third (40 out of 117) of the associations found in microbes present in both cohorts were replicated, whereas only 4 of the remaining 77 were significantly associated in the opposite direction with respect to the same association in the Israeli cohort
potentially pointing to a generalized mechanistic association between this SV and disease risk
Summary and Discussion
we systematically detect SVs across metagenomic samples and show that they are highly abundant in the human microbiome and largely conserved across different cohorts
We found that SVs harbour genes of distinct functions and are associated with bacterial growth rates, indicating a potential utility in bacterial adaptation
Finally, we found they are associated with multiple host disease risk factors, many of which replicated in an independent cohort, and that they facilitate exploration of genes varying together, exposing a new layer of putative mechanistic information regarding host-microbiome interactions
functional analysis of genes in those regions, we hypothesize that the main forces driving SVs are mechanisms of HGT as evident from the enrichment of genes performing these functions in SVs. Many genes found in SVs, such as antibiotic biosynthesis genes, can possibly be characterized as passengers to this process of transposition and may have important roles in the adaptation of microbes to their environments and in communication with the host
The associations described here between SVs and host health are not directional or causal and could also be confounded. Although further research is needed to fully understand the interactions between the host, its microbiome and disease, we demonstrate the wealth of mechanistic hypotheses obtained through examining genes with variable copy number along with neighbouring variable genes
generality of this methodology
Our methodology is highly adaptable to any metagenomic scenario and could be used, for example, to detect SVs in the soil microbiome