Keynote Talks

Inferring historical effective population size using segments of identity by descent

Speaker: Sharon Browning

Abstract: Patterns of identity by descent (IBD) sharing within and between populations are informative about the demographic histories of those populations, including population sizes, bottlenecks, growth rates, and migration rates. IBDNe uses the lengths of detected IBD segments from a sample of individuals to infer past effective sizes of the population. For admixed samples with local ancestry calls, the effective sizes of each of the contributing ancestral populations can be inferred. IBDNe can also be used to estimate rates of migration between populations. We present results from IBDNe from diverse African populations and Hispanic populations. We infer reductions in population sizes occurring at the time of European colonization, high rates of growth in the past few generations, and other population-specific features.

Genomic tales of archaic hominin admixture

Speaker: Josh Akey

Abstract: Genetic data has revealed that hybridization between anatomically modern humans and archaic hominins occurred multiple times and with multiple hominin lineages. We have developed a number of statistical methods to identify sequences inherited from archaic hominin ancestors that persist in the DNA of modern individuals, and applied it to whole-genome sequences from over 1,500 geographically diverse individuals. The catalog of surviving Neandertal and Denisovan sequences identified provides insight into admixture dynamics, selective pressures acting on introgressed sequences, and the functional and phenotypic consequences of hybridization. Moreover, new methodological advances reveal significantly more Neandertal ancestry among African individuals than previously appreciated, due to migration of European ancestors back to Africa. We show this observation has important implications for interpreting contemporary patterns of Neandertal sequences in both African and non-African populations. The continued excavation of archaic hominin lineages from the genomes of geographically diverse humans will clarify hominin evolutionary history and the genetic substrates of uniquely modern traits.

Contributed Talks

Convergence of healthcare records and genetic evidence shows the pleiotropic link between cardiovascular dysfunction and Alzheimer’s disease

Speaker: Hyojung Paik

Abstract: Sequential representations of individual health at scale is essential to unravel a long-progressed disease and tailored care, such as hyperlipidemia accompanying complication. However, using bigdata analytics, rare successes to identify novel disease associations have been largely based on an epidemiological approach without molecular evidence. Here, we created the first large-scale Korean longitudinal disease network of traced diagnoses pathways (i.e. disease trajectories), merging data from over 2.1 million patients from the entire country through Health Insurance Review & Assessment Service (HIRA) between 2009 and 2011. Creating a temporal representation of disease progressions that maps 405 common disease trajectories revealed an unexpected association between Alzheimer (AD), a chronic neurodegeneration, and heart failure (HF) (Relative Risk, RR=3.25). We validated the recapitulation of this association using five years of electronic health records from over 830,000 patients at University of California, San Francisco (UCSF) medical center (RR=5.63). A disease–single-nucleotide polymorphism association database (VARIMED) covering ~8000 studies suggests 41 genetic variants of 92 genes including APOE shared between the relevant genetic variants of AD and HF (p =3.28e-15). Among those of genes, the whole exome sequencing of 50 Alzheimer and non-Alzheimer individuals identified three of novel and rare pathogenic genetic variants harbored in MTHFD1L, DPP10 and ADIPOQ. A pleiotropic impact of those genes was simultaneously observed using network-based approaches. Convergence of healthcare records and genetic evidence may help to dissect the molecular underpinnings of heart disease and associated Alzheimer.

A framework for the well-calibrated analysis of complex traits in admixed individuals

Speaker: Elizabeth Atkinson

Abstract: Currently, most medical genomics studies exclude ‘admixed’ individuals whose ancestry is not homogeneous. Admixed people are routinely removed due to paucity of methodological approaches that account for their genomic complexity such that population substructure can infiltrate analyses and bias results. Admixed populations (including African American and Latino individuals) make up more than a third of the US populace and have higher rates of some complex disorders including PTSD. Yet, these groups face severe disparities in medical research and treatment due to being so sorely underrepresented in genomic studies.

Here, we present a novel analytical framework, distributed as a software package nicknamed ‘Tractor,’ which precisely accounts for subtle differences in admixture at the genotype level, allowing admixed samples to be readily included alongside homogenous ones in statistical genomics efforts. Our pipeline incorporates local ancestry in addition to global ancestry fractions, which takes into account subtle differences in individual-level admixture patterns that may differ among case and control cohorts even if their global ancestry fractions are the same. Tractor further leverages the additional information provided by ancestral chromosome painting to correct phase errors and recover long-range haplotypes in admixed individuals, which we find to be severely disrupted by statistical phasing algorithms.

We apply our framework to several admixed cohorts with high global diversity from the Psychiatric Genomics Consortium PTSD working group (PGC-PTSD), with a focus on admixed populations of the Americas. We demonstrate that we have high accuracy at calling local ancestry across demographic scenarios, and are able to significantly improve long-range phasing in admixed individuals. Compared to the traditional GWAS model, there is a significant gain in power using the Tractor framework, which allows for localancestry aware GWAS, with improvements in power across sample sizes and simulated disease prevalences. We further demonstrate that this framework gives increased fine-mapping precision by leveraging the disrupted linkage disequilibrium blocks visible with ancestral chromosome painting in recently admixed groups.

This framework could be applied to solve statistical issues related to admixture across many medical and population genetics activities, including association testing and evolutionary genome-wide selection scans. In sum, Tractor dramatically advances the existing methodologies for statistical genetic analysis of admixed individuals and allows for significantly better calibrated study of the genetics of complex disorders in underrepresented admixed populations.

Highlight Talks

Characterizing mutagenic effects of recombination through a sequence-level genetic map

Speaker: Bjarni V. Halldórsson

Abstract: Genetic diversity arises from recombination and de novo mutation (DNM). Using a combination of microarray genotype and whole-genome sequence data on parent-child pairs, we identified 4,531,535 crossover recombinations and 200,435 DNMs. The resulting genetic map has a resolution of 682 base pairs. Crossovers exhibit a mutagenic effect, with overrepresentation of DNMs within 1 kilobase of crossovers in males and females. In females, a higher mutation rate is observed up to 40 kilobases from crossovers, particularly for complex crossovers, which increase with maternal age. We identified 35 loci associated with the recombination rate or the location of crossovers, demonstrating extensive genetic control of meiotic recombination, and our results highlight genes linked to the formation of the synaptonemal complex as determinants of crossovers.

Evaluation of methods handling missing data in PCA: applications for ancient DNA

Speaker: Kristiina Ausmees

Abstract: Principal Component Analysis (PCA) is a method of projecting data onto a basis that maximizes its variance, possibly revealing previously unseen patterns or features. PCA can be used to reduce the dimensionality of multivariate data, and is widely applied in visualization of genetic information. In the field of ancient DNA, it is common to use PCA to show genetic affinities of ancient samples in the context of modern variation. Due to the low quality and sequence coverage often exhibited by ancient samples, such analysis is not straightforward, particularly when performing joint visualization of multiple individuals with non-overlapping sequence data. The PCA transform is based on variances of allele frequencies among pairs of individuals, and discrepancies in overlap may therefore have large effects on scores. As the relative distances between scores are used to infer genetic similarity, it is important to distinguish between the effects of the particular set of markers used and actual genetic affinities. This work assesses the problem of using an existing PCA model to estimate scores of new observations with missing data. We consider the particular application of visualizing genotype data, and evaluate several approaches commonly used in population genetic analyses as well as other methods from the literature. Using empirical ancient data, we illustrate the differences between the trimmed score (TRI) and projection to the model plane (PMP) methods, which correspond to two options for handling missing data in the software SMARTPCA. We also show that differences in the set of SNPs considered can have pronounced effects on estimated scores when performing PCA individually on samples and subsequently merging them using Procrustes transformation. Finally, we consider the two least-squares based methods trimmed score regression (TSR) and known data regression (KDR), and show that these exhibit similarly robust behaviour to differences in marker sets. Evaluation of the methods was also performed based on modern sample data with varying levels of simulated sparsity, and showed that the TSR and KDR methods were superior to the others w.r.t estimation error.

The role of splicing variation in hominin evolution

Speaker: Arta Seyedian

Abstract: Genetic variation influencing pre-mRNA splicing is ubiquitous in human populations and constitutes a primary link to phenotypic variation and disease. Despite striking anecdotal examples, the contribution of splice-altering mutations to the evolution of human complex traits remains poorly characterized. Seeking a genome-wide perspective on hominin splicing evolution, we leveraged multiple functional genomic datasets to examine the effects of divergent alleles between archaic and modern humans on patterns of splicing, as well as consequent effects on organismal phenotypes. Mutations disrupting essential splice sites (GT and AG dinucleotides) or occurring adjacent to these regions are relatively easily predicted, and hominin single nucleotide substitutions at 470 such sites were catalogued upon high-coverage sequencing of the Altai Neandertal and Denisovan genomes. In contrast, non-coding variants occurring outside of this context can also contribute to splicing disruption via a phenomenon termed “cryptic splicing”. Though historically challenging to predict, such splice-disrupting mutations are prevalent and can be strongly deleterious, implicated as a major cause of human genetic disorders. Seeking to assess the role of cryptic splicing in hominin divergence, we scored archaic and modern human-specific substitutions for cryptic splice predictions. We identified 279 single nucleotide changes that are predicted to disrupt splicing outside the context of annotated splice sites (SpliceAI Δ > 0.2). These include 153 derived alleles that are specific to the archaic lineages, as well as 126 derived alleles that are specific to the modern human lineage. Notable high-confidence examples include a fixed Neandertal and Denisovan cryptic splice acceptor gain (SpliceAI Δ = 0.89) in OPHN1, mutations in which cause intellectual disability and cerebellar hypoplasia. In addition to mutations that alter splice donor or acceptor sites, splicing effects may arise by mutations that alter the binding of cis-acting splicing enhancers or silencers by trans-acting splicing factors. We thus broadened our analysis by testing for associations between putative introgressed Neandertal sequences and patterns of splicing measured in population-scale RNA-seq data from the GTEx Consortium. Using whole blood tissue as an example, we identified 18 splicing quantitative trait loci (sQTL) where a Neandertalintrogressed variant was the top-scoring variant in cis (10% FDR). Top associations among this set included introgressed SNPs in SLC24A4 (rs61977313; 1.21×10-11), a gene involved in hair, eye, and skin pigmentation, as well as AKAP13 (rs4843090; 2.04×10-10) and TLR1 (rs3924113; 1.68×10-9), genes with well-characterized roles in innate immune response. Moreover, TLR1 is a known candidate of adaptive introgression with demonstrated splicing effects in previous studies of human immunity, thereby providing a positive control in support of our genome-wide approach. Our study highlights the underappreciated role of splice-altering mutations in the functional genomic basis of hominin phenotypic divergence. Moreover, persisting archaic introgressed sequences contribute to both isoform diversity and quantitative variation in the splicing landscape of modern human genomes.

Accurate estimation of transcriptome-wide differential allelic expression

Speaker: Asia Mendelevich

Abstract: Understanding mechanisms that control transcriptional activity of genes is a fundamental goal of biology. Analysis of allele-specific expression aims to measure relative activity, or Allelic Imbalance (AI), of the maternal and paternal alleles in diploid cells and thus capture the integral output of the gene-regulatory systems. The two alleles are located in the same cell nucleus and thus share environmental inputs, but RNA abundance can be different due to genetic dissimilarity between alleles or distinct epigenetic states of the two alleles. Transcriptome-wide allele-specific expression can be measured by a variety of methods, with RNA sequencing being the most widely used. The ability to design such experiments properly is critical but the precise measurement of allelic imbalance has not yet been addressed [2,3]. Most of the existing approaches do not fully take into account the limits of applicability of the experimental and statistical methods and thus may be biased in their estimations. One efficient method to improve the accuracy of the allelic imbalance estimations is to use a set of technical replicates for each biological sample. Differences in AI estimations in technical replicates can be interpreted and include, but are not limited to, differences in sampling, coverage, and variability in allele-specific expression for a given gene and replicate. We found that for RNA-seq data coming from a set of technical replicates there exists an invariant that captures the main features of the experiment and data processing pipeline. Overall, using technical replicates under the assumption that each of them constitutes a truthful sample from the overall distribution, we can estimate the necessary corrections in AI measurements for the corresponding experiment and adjust confidence intervals. We thus propose a method taking all these differences into account and building a more accurate experiment design for allele-specific expression experiments. We applied this method to a set of experimental conditions affecting AI and showed that we are able to detect differential allelic expression more precisely than currently existing approaches.

Integrating GWAS with the largest immune-transcriptome dataset reveals roles of distinct immune cell types in multiple diseases

Speaker: Abhinandan Devaprasad

Abstract:Genome wide association studies (GWAS) identify genetic variants associated with the risk of developing a disease. It is being increasingly recognized that the immune system plays a key role in development and progression of multiple immune-mediated and non-immune mediated diseases. Here, we integrated GWAS data with transcriptome data of 40 different immune cells to study their role in 97 distinct diseases belonging to 6 different disease categories. To do so, we constructed the largest GWAS-Immunome network and performed enrichment analysis to elucidate several known and novel disease associated cell types.

From this network, we found that the largest overlap of disease associated genes between different diseases category belong to the digestive, neurological and immune system mediated diseases. The connection of these three systems have been well described in literature before however it is still unclear in which system the perturbation has the most effect, especially those leading to the first response. The gene ontology analysis of these intersecting genes revealed enrichment of pathways associated largely to the cytokine and interleukin production, immune system regulation, and T- and B-cell activation. This re-emphasizes the need to look at the disease from an immune system standpoint. Therefore in this study, we studied how the disease associated genes and mutations may specifically affect certain immune cell types. We find that the disease associated genes of Alzheimer’s and Parkinson’s are largely expressed on macrophages with genes like SNCA and GPNMB being the top candidate affecting this cell type. SNCA gene is known to play a role in macrophage mediated inflammation and hence indicates a strong role of inflammation in Parkinson’s. However, the role of these genes and macrophages in Alzheimer’s and Parkinson’s has been weekly explored and may help in understanding the causative disease mechanisms.

Similarly, we found that genes associated with Psoriasis majorly affect the genes expressed at high levels in NK, CD4+ and CD8+ T-cells. We found ETS1, a transcription factor, to be an important gene for Psoriasis. ETS1 is known to play a role in production of various cytokine that are increased in psoriasis. However, the exact disease mechanism of this process in psoriasis is poorly understood, looking at NK, CD4+ and CD8+ T cells with respect to ETS1 may help getting the necessary mechanistic insight.

We also explored pleiotropy within our dataset by using a combination of Fisher’s exact test and Jaccard’s index to calculate the similarity of diseases based on their gene variants. We found many pleiotropic associations between several dissimilar diseases such as Type 1 diabetes and Hypothyroidism, Myocardial Infarction and Cholesterol, Psoriasis and Hodgkin’s Lymphoma, Crohn’s disease (CD) and rheumatoid arthritis (RA). Among these pleiotropic associations, CD and RA have genetic variants in genes that play a role in notch signalling pathway with the key gene being NOTCH4. We find that plasmacytoid dendritic cells (pDC) have the highest expression of NOTCH4. Investigating the role of NOTCH4 in pDC’s for CD and RA may help uncover the common disease mechanisms. Thus, our study provides a resource with several such examples of disease associated immune cell types and possible genes/cellular candidates that can seed further studies. We further developed a web tool where the user can find the disease associated cell types and the most interesting gene candidate for all the diseases available on the EBI GWAS catalogue. This may help biologists, immunologists and clinicians alike, to identify the most interesting immune cell type for the disease of interest and this could expedite the understanding of disease mechanism and subsequently help in identifying targets for therapy and treatment.


In silico identification of novel genetic factors associated with longevity in Drosophila

Authors: Bethany Hall, Jonathan J Crofts, Yvonne A Barnett and Nadia Chuzhanova

Abstract:To determine genetic factors, causing variation in survival into old age, several genome-wide association studies (GWAS) have been carried out on panels of long-lived individuals. Most studies tend to have little impact due to small sample sizes. It is for this reason that model organisms such as Drosophila melanogaster have become increasingly important in identifying genetic factors underlying longevity.

First, a network approach has been applied to predict novel genes/genomic regions/SNPs, playing a role in longevity, by integrating three-dimensional (3D) chromosome interaction data (Sexton et al. 2012) and two GWAS summary statistic datasets (Burke et al. 2013; Ivanov et al. 2015). We hypothesise that the 3D architecture of the Drosophila genome dictates the co-location of specific genes/genomic regions, both known to be associated with longevity and novel unknown regions that may be potentially important in longevity. Networks were created using genes/genomic regions, known to associate with longevity, as original nodes with additional nodes (regions) later added to these networks if they strongly interacted (co-localise) with original nodes. Various network measures were calculated, identifying important previously unknown regions. These previously unknown regions were further explored and longevity associated genes were found, with some regions observed to be common between both GWAS datasets. Sub-networks of these networks were also explored, and for both analyses enrichment in Gene Ontology terms identified genes/regions with no previous association with longevity, enriched in longevity-related terms.

Second, SNPs residing within transcription factor binding sites (TFBSs) were analysed. TFBSs are DNA motifs recognised by transcription factors (TFs) that play crucial role in controlling many important processes in the genome. Each TF is typically recognises a collection of often dissimilar DNA motifs. Here we hypothesised that TFs may recognise a certain structure, e.g. non-B DNA structures, rather than sequence motifs. Structures such as slipped, cruciform, triplexes and tetraplexes, formed on (direct, inverted and mirrored repeats and G-quartets were considered. Comparison of frequencies of these repeats in TFBS and matching control groups of sequences has shown a significant enrichment in selected TFBSs for specific non-B DNA forming sequences.

Meta-analysis of Microarray Data Using a Pathway-based Approach to Identify an Expression Signature for Adrenoleukodystrophy Compared with Alzheimer’s Disease

Authors: Yu jeoung Shim, Daeun Min, Junghyun Jung, Wonhee Jang

Adrenoleukodystrophy (ALD) is classified as a rare disease characterized by axonopathy and demyelination in the central nervous system and adrenal insufficiency. Alzheimer’s Disease (AD) has symptom of dementia. There are many of common feature with ALD and AD. both diseases are metabolic disorder, genetic disease and the mechanisms of two pathogenesis is still unclear. ALD causes gradual decreases in cognitive function and the abilities of hearing, sight, and athletic.

Method: We gathered the previously reported gene expression datasets from the ArrayExpress. Normalization and initial pre-processing were performed using a statistical programming language R. Using human white matter and induced pluripotent stem cell (iPSCs) datasets, fast Gene-Set Enrichment Analysis was conducted for the identification genesets related to ALD.

Results: Overall, there were four datasets for human white matter and iPSCs, which was obtained from 43 patients and 61 healthy controls in ALD and AD.

Conclusions: In this study, we found putative markers for ALD and AD.