sv_association_nature

Last updated on Aug 4, 2019 11 min read

Why SV (CNV) is important

Differences in the presence of even a few genes between otherwise identical bacterial strains may result in critical phenotypic differences

Genes that are deleted or duplicated within different members of a species (also termed copy number variation; CNV), are common across all kingdoms of life

Even a small number of bacterial genes can underlie phenotypes such as virulence3, antibiotic resistance4, host metabolic disease5 and host longevity6, making genetic variation highly important to both the microbe and its host.

Previous Work and Background

A systematic characterization of intra-species CNVs across the human microbiome was recently performed and showed that it is highly prevalent.

This variability could be critical to human pathophysiology

gut microbes were found to be involved in multiple host processes ref
associated with multiple disorders ref

Limitations

this and other studies analysing the genetic repertoire of the microbiome were potentially limited

by the scope of the annotation databases used
by ignoring the co-variation of genes from the same genomic region
Other functional characterization methods may be limited with regards to within-species variation of genes.

what is co-variation in the gene from the same genomic region

Such co-variation is important as it encodes information such as operon membership, gene regulation or susceptibility to horizontal transfer that is only evident when analysing genes in their neighbouring genomic context

study objective

SV term

The spectrum of human genetic variation ranges from the single base pair to large chromosomal events, but it has become apparent that human genomes differ more as a consequence of structural variation than of single-base-pair differences

Structural variation was originally defined as insertions, deletions and inversions greater than 1kb in size. With the sequencing of human genomes now becoming routine , the operational spectrum of structural variants (SVs) and copy number variants (CNVs) has widened to include much smaller events (for example, those >50 bp in length).

The challenge now is to discover the full extent of structural variation and to be able to genotype it routinely in order to understand its effects on

human disease,
complex traits
evolution

Two distinct models to associate the sv and disease

two distinct models have been proposed with respect to associations between disease and structural variation.

The first involves large variants (typically gains and losses several hundred kilobase pairs in length) that are individually rare in the population (<1%) but collectively account for a significant fraction of disease
The second includes multicopy gene families that are commonly copy number variable and contribute to disease susceptibility, as seen for traits related to immune gene functions

classes of SV

Structural variant (SV). Genomic rearrangements that affect >50bp of sequence, including deletions, novel insertions, inversions, mobile-element transpositions, duplications and translocations
Copy number variant (CNV). Also defined as unbalanced structural variants; variants that change the number of base pairs in the genome.
Mobile elements DNA sequences that move location within the genome. Active mobile elements (transposons) in the human genome include Alu, L1 and SVA sequences

Scientific Questions

The discovery and genotyping of structural variation has been central to understanding these disease associations

Systematic and comprehensive assessment of structural variation has been problematic owing to the complexity and multifaceted features of SVs
SV discovery and genotyping requires accurate prediction of three features: copy, content and structure
SVs tend to reside within repetitive DNA, which makes their characterization more difficult. SVs vary widely in size and there are numerous classes of structural variation: deletions, translocations, inversions, mobile elements, tandem duplications and novel insertions
Once a variant has been detected, validated and characterized at the sequence level (discovery), a different suite of methods may be applied to infer genotypes with relaxed threshold

SV calling methods

array-based
sequencing-based There are four general types of strategy, all of which focus on mapping sequence reads to the reference genome and subsequently identifying discordant signatures or patterns that are diagnostic of different classes of SV

Importance of This work

To detect segments of varying lengths, potentially containing multiple genes, that are deleted from certain bacteria in some individuals or present in a variable number of copies in others

identify microbial genomic structural variants (SVs) and find them to be prevalent in the human gut microbiome across phyla and to replicate in different cohorts. SVs are enriched for CRISPRassociated and antibiotic-producing functions and depleted from housekeeping genes, suggesting that they have a role in microbial adaptation. We find multiple associations between SVs and host disease risk factors, many of which replicate in an independent cohort. Exploring genes that are clustered in the same SV, we uncover several possible mechanistic links between the microbiome and its host, including a region

our results uncover a nascent layer of variability in the microbiome that is associated with microbial adaptation and host health

Accurate metagenomic read assignment using ICRA

Problem: over 15% of the metagenomic reads were assigned ambiguously to multiple references upon mapping to a database of 3,953 bacterial genomes

Solution: To address this problem, we devised the ICRA algorithm which uses read assignments, read and mapping qualities, sequencing coverage depth along microbial entities (for example, bacterial genomes) and microbial relative abundances to reassign ambiguously mapped reads

ICRA introduces a demand for sufficient coverage over entities that are to be considered present in a sample, making it robust to genomic regions with extremely high or low coverage that may arise from misassemblies, homology to other microbes or phage activation

Such regions could otherwise bias the estimated relative abundances, potentially even assigning abundances to genomic entities that are absent from the sample.

To test the performance of ICRA, we validated the two key components of the algorithm: its ability to resolve ambiguous read assignments and the accuracy of the species relative abundances that it infers

SGV-Finder to seek to systematically characterize structural variation

we applied to ICRA-corrected read assignments of 887 metagenomic samples to a reference database of 3,953 representative microbial genomes

core of SGV:

SGV-Finder analyses coverage depth across all microbial genomes in all samples to characterize SVs with respect to the standardized coverage of a genome in a given sample

We differentiate between deletion SVs, that are deleted and not covered in 25–75% of samples, and variable SVs, that have highly variable coverage across samples. In both SV types, segments are united based on co-occurrence (deletion SVs) or correlation (variable SVs).

An online metagenome explorer for all SVs and the genes they encompass is available at SV Finder

SV summarized information about bacterial metagenome (SVs are highly prevalent in the microbiome)

detected 2,423 variable SVs and 5,056 deletion SVs in 56 bacteria that passed our coverage thresholds
SVs were detected in six bacterial phyla and one archaeal phylum, with 5–241 SVs per species in 1.4–18.6 kilobase pairs (kbp) average size per species
Variable and deletion SVs make up 0.3–8.4% and 5.0–26.9% of the microbial genome, respectively

Information to infer:

This apparent disparity in size may suggest inherent differences in the formation of the two types of SVs
We detected SVs in every subject and strain analysed, demonstrating the ubiquity of such variations.

SV is prevalent across distinct populations

to To test the universality of these regions and reinforce their biological relevance, we applied ICRA and SGV-Finder independently in

study cohorts (Iseri)
validation cohorts (Dutch Lifelines)

SVs replicate across cohorts

more than 70% of the regions were replicated despite the different genetic background, lifestyle and dietary preferences of the distinct populations studied

Ruminococcus bicirculanus, showed very low concordance between the cohortssuggesting geographically confined variability or strong population-specific environmental factors. Other bacteria, such as Parabacteroides merdae, showed high concordance

SVs are person-specific and shared with habitat

different individuals mostly have different SV profiles
In contrast, SVs were highly stable within the same individual, even over time periods exceeding one year
co-habiting individuals as well as for siblings and parent–child pairs who do not live together22 (‘relatives’). We found that they share SVs to a significantly higher degree as compared to two random subjects
relatives have significantly less similar microbiome SV profiles compared to co-habiting subjects
This result is conservative, as relatives who did not live together could still share environmental exposures affecting their microbiome, such as traditional food preferences or shared meals. These results further support our previous findings that the environment dominates over genetics in determining microbiome composition.

SVs are potentially involved in microbial adaptation

characterize the function of SVs by searching for enriched or depleted genetic functions. We annotated gene functions across variable SVs, deletion SVs and ‘conserved’ regions

Using the KEGG BRITE hierarchy, we found that ‘housekeeping’ modules such as nucleotide and amino acid metabolism or carbohydrate and lipid metabolism were significantly depleted from SVs and significantly enriched in conserved regions

Conversely, modules classified as ABC-2 type- and other transport systems were significantly enriched in variable SVs

bacterial conjugation systems, to which the T4SS is related, strongly associate with variability, further implicating SVs as tools of adaptation and speciation

To further characterize the potential contribution of SVs to microbial adaptation, we searched for SVs associated with the fitness of their harbouring microbe. As a proxy for fitness, we calculated bacterial growth rates of 21 strains with sufficient coverage and complete reference genomes, using a method that estimates growth from DNA copy number differences created during DNA replication

suggesting that certain SVs may be important for bacterial adaptation and fitness.

To probe the mechanisms potentially underlying this adaptation, we systematically examined the genetic content of growth-associated deletion SVs and found similar functional profiles as in all SVs, with a

depletion of housekeeping functions and
enrichment for genes involved with CRISPR-, transposon- and HGT-associated genes

sVs play important role potentially in microbial adaptation

SVs associate with common mechanisms of conjugation, transposition and phage lysogeny, and may therefore be powerful tools of adaptation. Microbial evolution in densely populated ecosystems such as the human microbiome may thus be driven strongly by SVs, affecting both microbes and host.

SVs associate with risk factors across cohorts (SVs associate with disease risk, replicated in another cohort.)

we associated the abundance of variable SVs and the presence of deletion SVs with metrics of health and risk factors

We found 81 (Spearman’s correlation) and 43 (Mann– Whitney U-test; Fig. 3b) significant associations FDR corrected at 0.1 for variable and deletion SVs, respectively, demonstrating the potential importance of microbial SVs to the human host.

the associations of specific SVs with risk factors allows us to pinpoint specific regions and mechanisms that may underlie the association.

These seemingly paradoxical associations between SVs and risk factors further suggest that SVs represent a different layer of information compared to the taxonomic level, one which may assist in obtaining mechanistic insights into host–microbe interactions

validation association

To test the replicability of these associations, we ran ICRA on samples from the Lifelines cohort and calculated the coverage of the SVs defined from the 887-person cohort. We then calculated the association of these regions with host risk factors measured in the Lifelines cohort and compared those to the associations found in our cohort. Notably, despite presumed inter-cohort differences in genetics, dietary preferences and lifestyles, more than a third (40 out of 117) of the associations found in microbes present in both cohorts were replicated, whereas only 4 of the remaining 77 were significantly associated in the opposite direction with respect to the same association in the Israeli cohort

potentially pointing to a generalized mechanistic association between this SV and disease risk

Summary and Discussion

we systematically detect SVs across metagenomic samples and show that they are highly abundant in the human microbiome and largely conserved across different cohorts
We found that SVs harbour genes of distinct functions and are associated with bacterial growth rates, indicating a potential utility in bacterial adaptation
Finally, we found they are associated with multiple host disease risk factors, many of which replicated in an independent cohort, and that they facilitate exploration of genes varying together, exposing a new layer of putative mechanistic information regarding host-microbiome interactions
functional analysis of genes in those regions, we hypothesize that the main forces driving SVs are mechanisms of HGT as evident from the enrichment of genes performing these functions in SVs. Many genes found in SVs, such as antibiotic biosynthesis genes, can possibly be characterized as passengers to this process of transposition and may have important roles in the adaptation of microbes to their environments and in communication with the host
The associations described here between SVs and host health are not directional or causal and could also be confounded. Although further research is needed to fully understand the interactions between the host, its microbiome and disease, we demonstrate the wealth of mechanistic hypotheses obtained through examining genes with variable copy number along with neighbouring variable genes

generality of this methodology

Our methodology is highly adaptable to any metagenomic scenario and could be used, for example, to detect SVs in the soil microbiome

Tank (Xiao-Ning Zhang)

PhD Student @ Data Miner & Coder

I’m a PhD Student majoring in Bioinformatics and Biostatistics who loves computer programming such as C(++), Java, Python and R.