Post on 09-Jun-2020
Genomic and epigenomic signatures for interpreting complex disease
Manolis Kellis
MIT Computer Science & Artificial Intelligence Laboratory Broad Institute of MIT and Harvard
ATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATATATCCATATCTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTCTAATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTCTGCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACTCTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATGAATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAAGCTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAATTTAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAACTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGGTTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATATTGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGATTGATATGCTTTGCGCCGTCAAAGTTTTGAACGAGAAAAATCCATCCATTACCTTAATAAATGCTGATCCCAAATTTGCTCAAAGGAATCGATTTGCCGTTGGACGGTTCTTATGTCACAATTGATCCTTCTGTGTCGGACTGGTCTAATTACTTTAAATGTGGTCTCCATGTTGCACTCTTTTCTAAAGAAACTTGCACCGGAAAGGTTTGCCAGTGCTCCTCTGGCCGGGCTGCAAGTCTTCTGTGAGGGTGATGTACCATGGCAGTGGATTGTCTTCTTCGGCCGCATTCATTTGTGCCGTTGCTTTAGCTGTTGTTAAAGCGAATATGGGCCCTGGTTATCATATCCAAGCAAAATTTAATGCGTATTACGGTCGTTGCAGAACATTATGTTGGTGTTAACAATGGCGGTATGGATCAGGCTGCCTCTGTTTGGTGAGGAAGATCATGCTCTATACGTTGAGTTCAAACCGCAGTTGAAGGCTACTCCGTTTAAATTTCCGCAATTAAAAAACCATGAATAGCTTTGTTATTGCGAACACCCTTGTTGTATCTAACAAGTTTGAAACCGCCCCAACCAACTATAATTTAAGAGTGGTAGAAGTCACCAGCTGCAAATGTTTTAGCTGCCACGTACGGTGTTGTTTTACTTTCTGGAAAAGAAGGATCGAGCACGAATAAAGGTAATCTAAGAGTTCATGAACGTTTATTATGCCAGATATCACAACATTTCCACACCCTGGAACGGCGATATTGAATCCGGCATCGAACGGTTAACAAAGGCTAGTACTAGTTGAAGAGTCTCTCGCCAATAAGAAACAGGGCTTTAGTGTTGACGATGTCGCACAATCCTTGAATTGTTCTCGCGAAATTCACAAGAGACTACTTAACAACATCTCCAGTGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAATTTAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTGCCTTGATGCGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATCATGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAAGAAGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATCATCGTCTCTAAACCAATTGGGCAGCTGTCTATATGAATTATAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTTTTCTACTCATAACTAGCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGCTTGGCAAGTTGCCAACTGACGAGATGCAGTAAAAAGAGATTGCCGTCTTGAAACTTTTTGTCCTTTTTTTTTTCCGGGGACTCTACGAACCCTTTGTCCTACTGATTAATTTTGTACTGAATTTGGACAATTCAGATTTTAGTAGACAAGCGCGAGGAGGAAAAGAAATGACAAAAATTCCGATGGACAAGAAGATAGGAAAAAAAAAAAGCTTTCACCGATTTCCTAGACCGGAAAAAAGTCGTATGACATCAGAATGAAATTTTCAAGTTAGACAAGGACAAAATCAGGACAAATTGTAAAGATATAATAAACTATTTGATTCAGCGCCAATTTGCCCTTTTCCATTCCATTAAATCTCTGTTCTCTCTTACTTATATGATGATTAGGTATCATCTGTATAAAACTCCTTTCTTAATTTCACTCTAAAGCATCCCATAGAGAAGATCTTTCGGTTCGAAGACATTCCTACGCATAATAAGAATAGGAGGGAATAATGCCAGACAATCTATCATTACATTAGCGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAAAGTATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAATAGCTCATTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGGATTCATTTATGTACACAGGACTTGAAGCCCGTCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAATATCCACTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCGTGTGCTTTTGTTGTTTTGGCCTCTAGAGTTGGATCTGCTTATCATTTGTCATTCCCTATATCATCTAGAGCATCATTCGGTATTTTCT
ATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATATATCCATATCTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTCTAATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTCTGCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACTCTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATGAATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAAGCTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAATTTAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAACTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGGTTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATATTGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGATTGATATGCTTTGCGCCGTCAAAGTTTTGAACGAGAAAAATCCATCCATTACCTTAATAAATGCTGATCCCAAATTTGCTCAAAGGAATCGATTTGCCGTTGGACGGTTCTTATGTCACAATTGATCCTTCTGTGTCGGACTGGTCTAATTACTTTAAATGTGGTCTCCATGTTGCACTCTTTTCTAAAGAAACTTGCACCGGAAAGGTTTGCCAGTGCTCCTCTGGCCGGGCTGCAAGTCTTCTGTGAGGGTGATGTACCATGGCAGTGGATTGTCTTCTTCGGCCGCATTCATTTGTGCCGTTGCTTTAGCTGTTGTTAAAGCGAATATGGGCCCTGGTTATCATATCCAAGCAAAATTTAATGCGTATTACGGTCGTTGCAGAACATTATGTTGGTGTTAACAATGGCGGTATGGATCAGGCTGCCTCTGTTTGGTGAGGAAGATCATGCTCTATACGTTGAGTTCAAACCGCAGTTGAAGGCTACTCCGTTTAAATTTCCGCAATTAAAAAACCATGAATAGCTTTGTTATTGCGAACACCCTTGTTGTATCTAACAAGTTTGAAACCGCCCCAACCAACTATAATTTAAGAGTGGTAGAAGTCACCAGCTGCAAATGTTTTAGCTGCCACGTACGGTGTTGTTTTACTTTCTGGAAAAGAAGGATCGAGCACGAATAAAGGTAATCTAAGAGTTCATGAACGTTTATTATGCCAGATATCACAACATTTCCACACCCTGGAACGGCGATATTGAATCCGGCATCGAACGGTTAACAAAGGCTAGTACTAGTTGAAGAGTCTCTCGCCAATAAGAAACAGGGCTTTAGTGTTGACGATGTCGCACAATCCTTGAATTGTTCTCGCGAAATTCACAAGAGACTACTTAACAACATCTCCAGTGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAATTTAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTGCCTTGATGCGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATCATGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAAGAAGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATCATCGTCTCTAAACCAATTGGGCAGCTGTCTATATGAATTATAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTTTTCTACTCATAACTAGCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGCTTGGCAAGTTGCCAACTGACGAGATGCAGTAAAAAGAGATTGCCGTCTTGAAACTTTTTGTCCTTTTTTTTTTCCGGGGACTCTACGAACCCTTTGTCCTACTGATTAATTTTGTACTGAATTTGGACAATTCAGATTTTAGTAGACAAGCGCGAGGAGGAAAAGAAATGACAAAAATTCCGATGGACAAGAAGATAGGAAAAAAAAAAAGCTTTCACCGATTTCCTAGACCGGAAAAAAGTCGTATGACATCAGAATGAAATTTTCAAGTTAGACAAGGACAAAATCAGGACAAATTGTAAAGATATAATAAACTATTTGATTCAGCGCCAATTTGCCCTTTTCCATTCCATTAAATCTCTGTTCTCTCTTACTTATATGATGATTAGGTATCATCTGTATAAAACTCCTTTCTTAATTTCACTCTAAAGCATCCCATAGAGAAGATCTTTCGGTTCGAAGACATTCCTACGCATAATAAGAATAGGAGGGAATAATGCCAGACAATCTATCATTACATTAGCGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAAAGTATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAATAGCTCATTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGGATTCATTTATGTACACAGGACTTGAAGCCCGTCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAATATCCACTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCGTGTGCTTTTGTTGTTTTGGCCTCTAGAGTTGGATCTGCTTATCATTTGTCATTCCCTATATCATCTAGAGCATCATTCGGTATTTTCT
Genes
Encode proteins
Regulatory motifs
Control gene expression
Building systems-level views of genomes and disease
Goal: A systems-level understanding of genomes and gene regulation: • The regulators: Transcription factors, microRNAs, sequence specificities • The regions: enhancers, promoters, and their tissue-specificity • The targets: TFstargets, regulatorsenhancers, enhancersgenes • The grammars: Interplay of multiple TFs prediction of gene expression The parts list = Building blocks of gene regulatory networks Our tools: Comparative genomics & large-scale experimental datasets. • Evolutionary signatures for coding/non-coding genes, microRNAs, motifs • Chromatin signatures for regulatory regions and their tissue specificity • Activity signatures for linking regulators enhancers target genes • Predictive models for gene function, gene expression, chromatin state Integrative models = Define roles in development, health, disease
Challenge: interpreting disease-associated variants
• GWAS, case-control,… reveal disease-associated variants Molecular mechanism, cell-type specificity, drug targets
• Challenges towards interpreting disease variants – Find ‘true’ causative SNP among many candidates in LD – Use ‘causal’ variant: predict function, pathway, drug targets – Non-coding variant: type of function, cell type of activity – Regulatory variant: upstream regulators, downstream targets
• This talk: genomics tools for addressing these challenges
CATGACTG CATGCCTG
Disease-associated variant (SNP/CNV/…)
Gene annotation (Coding, 5’/3’UTR, RNAs) Evolutionary signatures
Non-coding annotation Chromatin signatures
Roles in gene/chromatin regulation Activator/repressor signatures
Other evidence of function Signatures of selection (sp/pop)
Recombination breakpoints Fa
mily
Inhe
rita
nce
Me vs. my brother
My dad Dad’s mom Mom’s dad
Hum
an a
nces
try
Dis
ease
risk
Genomics: Regions mechanisms drugs Systems: genes combinations pathways
Goal: Towards personal systems genomics
Systems-level views of disease epigenomics • Evolutionary signatures gene/genome annotation
– High-resolution annotation: genes, RNAs, motif instances – Measuring selection within the human population
• Chromatin states for interpreting disease association – Annotate dynamic regulatory elements in multiple cell types – Activity-based linking of regulators enhancers targets
• Interpreting disease-associated sequence variants – Mechanistic predictions for individual top-scoring SNPs – Functional roles of 1000s of disease-associated SNPs
• Systematic manipulation of 2000+ human enhancers – Test effect of single-motif and single-nucleotide disruptions – Role of activator/repressor motifs, disease-associated SNPs
• Personal genomes/epigenomes in health and disease – Allele-specific activity.Alzheimer’sbrain methylationSNP – Global repression of distal enhancers. NRSF, ELK1, CTCF
Large-scale comparative genomics datasets 29 mammals 17 fungi 12 flies
8 Candida
9 Yeasts
Post
-dup
licat
ion
Dip
loid
H
aplo
id
Pre-
dup
P
P
P
P
P
P
N
N
Kellis Nature 2003 Nature 2004; Stark Nature 2007; Clark Nature 2007; Butler Nature 2009; Lindblad-Toh Nature 2011
Comparative genomics and evolutionary signatures
• Comparative genomics can reveal functional elements – For example: exons are deeply conserved to mouse, chicken, fish – Many other elements are also strongly conserved: exons / regulatory?
• Can we also pinpoint specific functions of each region? Yes!
– Patterns of change distinguish different types of functional elements – Specific function Selective pressures Patterns of mutation/inse/del
• Develop evolutionary signatures characteristic of each function Kellis Nature 2003 Nature 2004; Stark Nature 2007; Clark Nature 2007; Butler Nature 2009; Lindblad-Toh Nature 2011
Evolutionary signatures for diverse functions Protein-coding genes - Codon Substitution Frequencies - Reading Frame Conservation
RNA structures - Compensatory changes - Silent G-U substitutions
microRNAs - Shape of conservation profile - Structural features: loops, pairs - Relationship with 3’UTR motifs
Regulatory motifs - Mutations preserve consensus - Increased Branch Length Score - Genome-wide conservation
Stark et al, Nature 2007
Implications for genome annotation / regulation Novel protein-coding genes Revised gene annotations Unusual gene structures
Novel structural families Targeting, editing, stability Riboswitches in mammals
Novel/expanded miR families miR/miR* arm cooperation Sense/anti-sense miR switches
Novel regulatory motifs Regulatory motif instances TF/miRNA regulatory networks Single binding site resolution
Stark et al, Nature 2007
Translational read-through in human & fly
Protein-coding conservation
Continued protein-coding conservation
No more conserv
Stop codon read through
2nd stop codon
Jungreis, Genome Research 2011
Overlapping selection in human exons
Reveal splicing signals, RNA structures, enhancer motifs, dual-coding genes
Synonym. Substitut.
Rate
Lin, Genome Research 2011
RNA structure families: ortholog/paralog cons
Ex:MAT2A S-adeosyl-methionic level detection Structure/loop sequence deep conservation
Parker Gen. Res. 2011
Regions of codon-level positive selection
Distributed vs. localized positive selection Immunity/taste vs. retinal/bone/secretion
distributed
localized
Lindblad-Toh Nature 2011
Measuring constraint at individual nucleotides
• Reveal individual transcription factor binding sites • Within motif instances reveal position-specific bias • More species: motif consensus directly revealed
NRSF motif
Detect SNPs that disrupt conserved regulatory motifs
• Functionally-associated SNPs enriched in states, constraint • Prioritize candidates, increase resolution, disrupted motifs
Measuring selection within the human lineage
Human constraint outside conserved regions
• Non-conserved regions: – ENCODE-active regions
show reduced diversity Lineage-specific constraint in
biochemically-active regions
• Conserved regions: – Non-ENCODE regions
show increased diversity Loss of constraint in human
when biochemically-inactive
Average diversity (heterozygosity) Aggregate over the genome
Active regions
Strongest: motifs, short RNA, Dnase, ChIP, lncRNA
• Significant derived allele depletion in active features
Bound motifs show increased human constraint
Position-specific reduction in bound motif heterozygosity Aggregate across thousands of CTCF motif instances
Most constrained human-specific enhancer functions
Regulatory genes: Transcription, Chromatin, Signaling. Developmental enhancers: embryo, nerve growth
Transcription initiation from Pol2 promoter Transcription coactivator activity
Transcription factor binding Chromatin binding
Negative regulation of transcription, DNA-dependent Transcription factor complex
Protein complex Protein kinase activity
Nerve growth factor receptor signaling pathway Signal transducer activity
Protein serine/threonine kinase activity Negative regulation of transcription from Pol2 prom
Protein tyrosine kinase activity In utero embryonic development
Systems-level views of disease epigenomics • Evolutionary signatures gene/genome annotation
– High-resolution annotation: genes, RNAs, motif instances – Measuring selection within the human population
• Chromatin states for interpreting disease association – Annotate dynamic regulatory elements in multiple cell types – Activity-based linking of regulators enhancers targets
• Interpreting disease-associated sequence variants – Mechanistic predictions for individual top-scoring SNPs – Functional roles of 1000s of disease-associated SNPs
• Systematic manipulation of 2000+ human enhancers – Test effect of single-motif and single-nucleotide disruptions – Role of activator/repressor motifs, disease-associated SNPs
• Personal genomes/epigenomes in health and disease – Allele-specific activity.Alzheimer’sbrain methylationSNP – Global repression of distal enhancers. NRSF, ELK1, CTCF
Chromatin signatures for genome annotation
Ernst et al Nature Biotech 2010
See also: Amos Tanay, Bill Noble.
2. Histone modifications
3. DNA accessibility
1. DNA methylation
Epigenomic maps
ENCODE: Study nine marks in nine human cell lines 9 human cell types 9 marks
H3K4me1
H3K4me2
H3K4me3
H3K27ac
H3K9ac
H3K27me3
H4K20me1
H3K36me3
CTCF
+WCE
+RNA
HUVEC Umbilical vein endothelial
NHEK Keratinocytes
GM12878 Lymphoblastoid
K562 Myelogenous leukemia
HepG2 Liver carcinoma
NHLF Normal human lung fibroblast
HMEC Mammary epithelial cell
HSMM Skeletal muscle myoblasts
H1 Embryonic
x
81 Chromatin Mark Tracks (281 combinations)
Ernst et al, Nature 2011
• Learned jointly across cell types (virtual concatenation)
• State definitions are common
• State locations are dynamic
Brad Bernstein ENCODE Chromatin Group
Chromatin states dynamics across nine cell types
• Single annotation track for each cell type • Summarize cell-type activity at a glance • Can study 9-cell activity pattern across
Predicted linking
Correlated activity
Link enhancers to target genes
Introducing multi-cell activity profiles
HUVEC NHEK GM12878 K562 HepG2
NHLF
HMEC
HSMM
H1
Gene expression
Chromatin States
Active TF motif enrichment
ON OFF
Active enhancer Repressed
Motif enrichment Motif depletion
TF regulator expression
TF On TF Off
Dip-aligned motif biases
Motif aligned Flat profile
Enhancer-gene links supported by eQTL-gene links
25
-1.4
3.2
4.4
-1.8
1.1
3.1
-1.8
-1.5
-0.5 Indiv. 1 Indiv. 2 Indiv. 3 Indiv. 4 Indiv. 5 Indiv. 6 Indiv. 7 Indiv. 8 Indiv. 9
Sequence variant at distal position
A A A C A A A C C
…
Example: Lymphoblastoid (GM) cells study
• Expression/genotype across 60 individuals (Montgomery et al, Nature 2010)
• 120 eQTLs are eligible for enhancer-gene linking based on our datasets
• 51 actually linked (43%) using predictions 4-fold enrichment (10% exp. by chance)
Individuals
… …
Expression level of gene
15kb
• Independent validation of links.
• Relevance to disease datasets.
Validation rationale:
• Expression Quantitative Trait Loci (eQTLs) provide independent SNP-to-gene links
• Do they agree with activity-based links?
eQTL study
Visualizing 10,000s predicted enhancer-gene links
• Overlapping regulatory units, both few and many • Both upstream and downstream elements linked • Enhancers correlate with sequence constraint
26
Link TFs to target enhancers Predict activators vs. repressors
Introducing multi-cell activity profiles
HUVEC NHEK GM12878 K562 HepG2
NHLF
HMEC
HSMM
H1
Gene expression
Chromatin States
Active TF motif enrichment
ON OFF
Active enhancer Repressed
Motif enrichment Motif depletion
TF regulator expression
TF On TF Off
Dip-aligned motif biases
Motif aligned Flat profile
Ex2: Gfi1 repressor of K562/GM cells
Ex1: Oct4 predicted activator of embryonic stem (ES) cells
Coordinated activity reveals activators/repressors
• Enhancer networks: Regulator enhancer target gene
Activity signatures for each TF Enhancer activity
Causal motifs supported by dips & enhancer assays
29
Dip evidence of TF binding (nucleosome displacement)
Enhancer activity halved by single-motif disruption
Motifs bound by TF, contribute to enhancers
Tarjei Mikkelsen
Predicted causal HNF motifs (that also showed dips)
in HepG2 enhancers
Systems-level views of disease epigenomics • Evolutionary signatures gene/genome annotation
– High-resolution annotation: genes, RNAs, motif instances – Measuring selection within the human population
• Chromatin states for interpreting disease association – Annotate dynamic regulatory elements in multiple cell types – Activity-based linking of regulators enhancers targets
• Interpreting disease-associated sequence variants – Mechanistic predictions for individual top-scoring SNPs – Functional roles of 1000s of disease-associated SNPs
• Systematic manipulation of 2000+ human enhancers – Test effect of single-motif and single-nucleotide disruptions – Role of activator/repressor motifs, disease-associated SNPs
• Personal genomes/epigenomes in health and disease – Allele-specific activity.Alzheimer’sbrain methylationSNP – Global repression of distal enhancers. NRSF, ELK1, CTCF
Genotype Disease GWAS
Interpret variants using Epigenomics - Chromatin states: Enhancers, promoters, motifs
- Enrichment in individual loci, across 1000s of SNPs in T1D
Interpreting disease-association signals
CATGACTG CATGCCTG
Epigenome changes in disease
xx
• Disease-associated SNPs enriched for enhancers in relevant cell types • E.g. lupus SNP in GM enhancer disrupts Ets1 predicted activator
Revisiting disease- associated variants
Mechanistic predictions for top disease-associated SNPs
Disrupt activator Ets-1 motif Loss of GM-specific activation Loss of enhancer function Loss of HLA-DRB1 expression
Erythrocyte phenotypes in K562 leukemia cells Lupus erythromatosus in GM lymphoblastoid
`
Creation of repressor Gfi1 motif Gain K562-specific repression Loss of enhancer function Loss of CCDC162 expression
Allele-specific chromatin marks: cis-vs-trans effects
• Maternal and paternal GM12878 genomes sequenced • Map reads to phased genome, handle SNPs indels • Correlate activity changes with sequence differences
HaploReg: systematic ENCODE mining of variants (compbio.mit.edu/HaploReg)
• Start with any list of SNPs or select a GWA study – Mine publically available ENCODE data for significant hits – Hundreds of assays, dozens of cells, conservation, motifs – Report significant overlaps and link to info/browser
Functional enrichment for 1000s of SNPs
Full T1D association spectrum 1000s of causal SNPs
GM12878 Lymphoblastoid
K562 Myelogenous leukemia
• Rank all SNPs by P-value • Find chromatin states with
enrichment in high ranks • Signal spans 1000s of SNPs
GM12878 enhancer enrichment now seen
Cell type specific: GM and K562 enhancers Chromatin state specific: Enhancers/promoters
Could bias in array design contribute to these enrichments? Evaluate all 1000 genomes SNPs by imputing those in LD
Imputing SNPs in LDstronger cell/state separation
• Excess of 30,000 SNPs2049 enhancers (excess 392) • Mostly found in independent loci (1730 with R2<0.2) Systematically measure their regulatory contributions
Enhancers across cell types Chromatin states in GM12878
Enhancers: 2049 (excess 392) 1940 distinct loci (R^2<.8)
Promoters: 462 (excess 81)
Transcribed: 4740 (excess 522)
Repressed: 1351 (excess 76)
Insulator: 240 (excess 23)
Other: 21k (deplete 1093)
Systems-level views of disease epigenomics • Evolutionary signatures gene/genome annotation
– High-resolution annotation: genes, RNAs, motif instances – Measuring selection within the human population
• Chromatin states for interpreting disease association – Annotate dynamic regulatory elements in multiple cell types – Activity-based linking of regulators enhancers targets
• Interpreting disease-associated sequence variants – Mechanistic predictions for individual top-scoring SNPs – Functional roles of 1000s of disease-associated SNPs
• Systematic manipulation of 2000+ human enhancers – Test effect of single-motif and single-nucleotide disruptions – Role of activator/repressor motifs, disease-associated SNPs
• Personal genomes/epigenomes in health and disease – Allele-specific activity.Alzheimer’sbrain methylationSNP – Global repression of distal enhancers. NRSF, ELK1, CTCF
High-throughput experiments: 10,000s enhancers
• Experiment features: – Multiplexed enhancer assays – 10,000s of elements – Each w/ unique barcode – Multiple human cell types – Repeat experiments on same
array / diff barcodes • Applied to:
– Test enhancer offsets – Test causal motifs
• With: Tarjei Mikkelse – Broad Institute, ARRA funds – See also: Barak Cohen,
Jay Shendure, Eran Segal Melnikov, Nature Biotech 2012
Systematic motif disruption for 5 activators and 2 repressors in 2 human cell lines
54000+ measurements (x2 cells, 2x repl)
Example activator: conserved HNF4
motif match WT expression
specific to HepG2
Non-disruptive changes maintain
expression
Motif match disruptions reduce
expression to background
Random changes depend on effect to motif match
Results hold across 2000+ enhancers
• Scramble abolishes reporter expression
• Neutral mutations show no change
• Increasing mutations show more expression
• However, only 40% show wild-type expression: context?
Features of functional wildtype enhancers
• Nucleosome exclusion, motif conservation, other TFs
• Each of these features is encoded in primary sequence
Repressors of HepG2 enhancer act in K562
Repressor disruption aberrant expression in opposite cell types
Testing effect of SNP change in enhancer constructs
• SNPs in enhancer regions can lead to expression changes in downstream reporter genes
• Currently testing all T1D-associated enhancer SNPs
Systems-level views of disease epigenomics • Evolutionary signatures gene/genome annotation
– High-resolution annotation: genes, RNAs, motif instances – Measuring selection within the human population
• Chromatin states for interpreting disease association – Annotate dynamic regulatory elements in multiple cell types – Activity-based linking of regulators enhancers targets
• Interpreting disease-associated sequence variants – Mechanistic predictions for individual top-scoring SNPs – Functional roles of 1000s of disease-associated SNPs
• Systematic manipulation of 2000+ human enhancers – Test effect of single-motif and single-nucleotide disruptions – Role of activator/repressor motifs, disease-associated SNPs
• Personal genomes/epigenomes in health and disease – Allele-specific activity.Alzheimer’sbrain methylationSNP – Global repression of distal enhancers. NRSF, ELK1, CTCF
Genotype Disease GWAS
(1) Interpret variants using Epigenomics - Chromatin states: Enhancers, promoters, motifs
- Enrichment in individual loci, across 1000s of SNPs in T1D
Interpreting disease-association signals
CATGACTG CATGCCTG
(2) Epigenome changes in disease - Intermediate molecular phenotypes associated with disease
- Variation in brain methylomes of Alzheimer’s patients
mQTLs MWAS Epigenome
Phil de Jager: Methylation in 750 Alzheimer patients
500,000 methylation
probes
750 individuals
• Patients followed for 10+ years with cognitive evaluations • Brain samples donated post-mortem methylation/genotype • Seek predictive features: SNPs, QTLs, mQTLs, regulation
Phil de Jager, Roadmap disease epigenomics
Brad Bernstein REMC mapping
Genome Epigenome
meQTL
Phenotype
Epigenome Classification MWAS
1 2
2,500 mQTLs for neighboring SNPs at 10-14
• Overlay Manhattan plots of 450,000 methylation probes
• Cutoff of 10-14 (10-8 after Bonferroni correction)
• Use to pinpoint disrupted motifs, predict epigenome 50
Chromosome and genomic position
P-va
lue
expo
nent
(-lo
g 10P)
Distance from CpG (MB) -1 1
Focusing on 2831 most variable probes Probe intensity distribution
Inte
r-in
divi
dual
var
iabi
lity
• Hemi-methylated probes are also the most variable
• Tiny fraction (0.6%) of all probes
• Promoters: Stable low (active)
• Gene bodies: Stable high (active)
• Enhancers/poised: Most variable
138,731
184 2,647
Multimodal probes (~3Κ)
SNP-associated probes (29% of all)
1 Active promoter
2 Promoter flanking
3 Active enhancer
4 Weak enhancer
5 Gene bodies
6 Active gene bodies
7 Repetitive
8 Heterochromatin
9 Low signal
% of CpG probes
MultimodalSNP-associatedPromoter-depleted
• SNP-associated probes depleted in promoters (driven epigenetically>genetically, open chrom)
SNP-associated All probes
• 93.5% of multimodal probes are SNP-associated
• Importance of distinguishing contribution of genotype to disease associations
Phil de Jager: Methylation in 750 Alzheimer patients
500,000 methylation
probes
750 individuals
• Patients followed for 10+ years with cognitive evaluations • Brain samples donated post-mortem methylation/genotype • Seek predictive features: SNPs, QTLs, mQTLs, regulation
Phil de Jager, Roadmap disease epigenomics
Brad Bernstein REMC mapping
Genome Epigenome
meQTL
Phenotype
Epigenome Classification MWAS
1 2
Global hyper-methylation trend in AD-associated probes
Alzheimer’s Normal
Alzheimer’s Normal
Hypomethylated probes (active)
Hypermethylated probes (repressed) Alzheimer’s-associated probes are hypermethylated
480,000 probes, ranked by Alzheimer’s association
P-v
alue
M
ethy
latio
n
Top 7000 probes
• Global effect across 1000s of probes – Rank all probes by Alzheimer’s association – Observe functional changes down ranklist – 7000 probes show shift in methylation
Complex disease: genome-wide effects, 1000s of loci
Chromatin state breakdown reveals ↓ activity
* => fisher exact test, p-value <= 0.001
% p
robe
s
1 A
ctiv
e pr
omot
er
2 Pr
omot
er
flank
ing
3 A
ctiv
e en
hanc
er
4 W
eak
enha
ncer
5 G
ene
bodi
es
6 A
ctiv
e ge
ne
bodi
es
7 Re
petit
ive
8 H
eter
ochr
omat
in
9 Lo
w s
igna
l
Red: More methylated in Alhzeimer’s Blue: Less methylated in Alzheimer’s
Significant probes are in enhancers Not promoters
Alzheimer’s prediction vs. likely biological pathways
Predictive power: 6k probes + APOE
Regulatory motifs associated with Alzheimer-associated probes
suggest potential pathways
CTCF
NRSF
ELK1
We have not solved Alzheimer’s, but new insights gained
All probes, ranked by AD assoc. P-value
All probes, ranked by AD assoc. P-value
Systems-level views of disease epigenomics • Evolutionary signatures gene/genome annotation
– High-resolution annotation: genes, RNAs, motif instances – Measuring selection within the human population
• Chromatin states for interpreting disease association – Annotate dynamic regulatory elements in multiple cell types – Activity-based linking of regulators enhancers targets
• Interpreting disease-associated sequence variants – Mechanistic predictions for individual top-scoring SNPs – Functional roles of 1000s of disease-associated SNPs
• Systematic manipulation of 2000+ human enhancers – Test effect of single-motif and single-nucleotide disruptions – Role of activator/repressor motifs, disease-associated SNPs
• Personal genomes/epigenomes in health and disease – Allele-specific activity.Alzheimer’sbrain methylationSNP – Global repression of distal enhancers. NRSF, ELK1, CTCF
Goal: A systems-level understanding of genomes and gene regulation: • The regulators: Transcription factors, microRNAs, sequence specificities • The regions: enhancers, promoters, and their tissue-specificity • The targets: TFstargets, regulatorsenhancers, enhancersgenes • The grammars: Interplay of multiple TFs prediction of gene expression The parts list = Building blocks of gene regulatory networks
CATGACTG CATGCCTG
Disease-associated variant (SNP/CNV/…)
Gene annotation (Coding, 5’/3’UTR, RNAs) Evolutionary signatures
Non-coding annotation Chromatin signatures
Roles in gene/chromatin regulation Activator/repressor signatures
Other evidence of function Signatures of selection (sp/pop)
Understanding human variation and human disease
• Challenge: from loci to mechanism, pathways, drug targets
Collaborators and Acknowledgements
• ENCODE – Brad Bernstein, Tarjei Mikkelsen,
Noam Shoresh, David Epstein • Massively parallel enhancer reporter assays
– Tarjei Mikkelsen, Broad Institute • Epigenome Roadmap
– Bing Ren, Brad Bernstein, John Stam, Joe Costello • 2X mammals
– Kerstin Lindblad-Toh, Eric Lander, Manuel Garber, Or Zuk • Funding
– NHGRI, NIH, NSF Sloan Foundation
Daniel Marbach
Mike Lin
Jason Ernst
Jessica Wu
Rachel Sealfon
Pouya Kheradpour
(#187) Manolis Kellis
Chris Bristow
Loyal Goff
Irwin Jungreis
MIT Computational Biology group Compbio.mit.edu
Sushmita Roy
#331: Luke Ward
Stata4 Stata3
Louisa DiStefano Dave
Hendrix
Angela Yen
Ben Holmes Soheil
Feizi Mukul Bansal
#19:Bob Altshuler
Stefan Washietl
Matt Eaton