Investigating the source and function of most of the genome: endogenous retroelements and the proteins that bind them Tim Hughes University of Toronto Banff International Research Station "Statistical and Computational Challenges in Large Scale Molecular Biology" March 27, 2017 Challenges 1.Mapping between DNA sequence and
transcriptional outputs remains a good, but hard, computational problem (many problems) Genome: bought the book; hard to read (Eric Landers seven-word Nano lecture, 2003 Ig Nobel awards) TTTTTAGTAGCAATTTGTACTGATGGTATGGGGCCAAGAGATATATCTTAGAGGGAGGGCTGAGGGTTTGAAGTCCAACTCCTAAGCCAGTGCCAGAA GAGCCAAGGACAGGTACGGCTGTCATCACTTAGACCTCACCCTGTGGAGCCACACCCTAGGGTTGGCCAATCTACTCCCAGGAGCAGGGAGGGCAGGA GCCAGGGCTGGGCATAAAAGTCAGGGCAGAGCCATCTATTGCTTACATTTGCTTCTGACACAACTGTGTTCACTAGCAACCTCAAACAGACACCATGG TGCATCTGACTCCTGAGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATGAAGTTGGTGGTGAGGCCCTGGGCAGGTTGGTATCA AGGTTACAAGACAGGTTTAAGGAGACCAATAGAAACTGGGCATGTGGAGACAGAGAAGACTCTTGGGTTTCTGATAGGCACTGACTCTCTCTGCCTAT TGGTCTATTTTCCCACCCTTAGGCTGCTGGTGGTCTACCCTTGGACCCAGAGGTTCTTTGAGTCCTTTGGGGATCTGTCCACTCCTGATGCTGTTATG GGCAACCCTAAGGTGAAGGCTCATGGCAAGAAAGTGCTCGGTGCCTTTAGTGATGGCCTGGCTCACCTGGACAACCTCAAGGGCACCTTTGCCACACT
Gene regulation occurs at many steps Promoter definition Enhancer definition Chromatin remodelling and modification DNA topology PIC formation Initiation Capping Elongation Splicing
Cleavage and polyadenylation Termination Nuclear export RNA localization Translation Degradation Cells can "classify" elements based solely on sequence we can too, right? Many successes at individual problems
Challenges 1.Mapping between DNA sequence and transcriptional outputs remains a good, but hard, computational problem (many problems) 2.Relative positioning of regulatory elements along chromosomes is important Classifiers for promoter and polyadenylation site Classifier outputs are emissions for HMM representing gene state Model learns what constitutes "normal" genes, predicts "dark matter" transcripts, heterogeneous ends, etc Tested by MPRA and by "gene synthesis"
Genomic transcript predictions Initiation+ Elongation+ Termination+ Unified Model+ TSS/Gene/CpA RNASeq+ ORFs+ Other+ Transcripts+ TranscriptsORFs- Ty element
gal1 gal10 RNASeqUnified ModelTSS/Gene/CpA InitiationElongationTermination- Growing evidence that human transcription is also sloppy Challenges 1.Mapping between DNA sequence and transcriptional outputs remains a good, but hard, computational problem (many problems) 2.Relative positioning of regulatory elements along chromosomes is important
3.Missing motifs for TFs CIS-BP: Catalog of Inferred Sequence Binding Preferences Tracks homology and predicts motifs across >300 genomes (Weirauch et al., Cell 2014 established thresholds for >40 DBD types) Weirauch, Yang et al., Cell 2014 CisBP-RNA: Ray, Kazan, Cook, Weirauch, Najafabadi et al., Nature, 2013 No motif for hundreds of human TFs Most are C2H2 zinc finger proteins A sampling of the ~710 human C2H2 proteins KRAB domain (352):
No motif for hundreds of human TFs Most are C2H2 zinc finger proteins Increase of ~260 in the last 4 months! Next ENCODE includes 159 C2H2 proteins Yin et al. (Taipale lab, in press) has SELEX data for several dozen C2H2 proteins A system for analysis of human C2H2 zinc finger DNA binding Recognition code trained on B1H
data RCADE (Recognition CodeAssisted Discovery of regulatory Elements) Najafabadi, Mnaimneh, Schmitges et al. Nature Biotech 2015 Najafabadi et al. Bioinformatics 2015 ChIP-seq with GFP tagged inducible ORFs in HEK293 cells RCADE motifs for 131 C2H2-ZF proteins
Challenges 1.Mapping between DNA sequence and transcriptional outputs remains a good, but hard, computational problem (many problems) 2.Relative positioning of regulatory elements along chromosomes is important 3.Missing motifs for TFs 4.Mapping the domestication of retroelements and transposons and, the genesis and adaptive roles of the proteins that bind them Human Endogenous Retroelement and Transposon catalog is incomplete
Proprietary "RepBase" is the gold standard RepBase combines automated and manuallydefined models / consensus / who-knows-what DFAM is an "open access" alternative, but only covers four species (Hs, Dr, Ce, Dm) Many of the models are truncated or noncoding Are there human TFs that evolved to silence elements that no longer exist in the human genome? Most retroelements in genomes are truncated (figure from Imbeault et al., 2017) Most LINE L1 models in DFAM are truncated A "working" LINE L1 is 6-7 kb
Length of consensus 6000 5000 4000 3000 2000 1000
0 0 10 20 30 40 50
Kimura divergence 60 70 Will ancestral genome reconstructions improve recovery of active "source" elements? Original ERE Individual copies in present day genomes
Acknowledgements Hughes lab: Hamed Najafabadi (McGill) Matt Weirauch (Cincinatti) Frank Schmitges Marjan Barazandeh Laura Campitelli Ally Yang Ernest Radovani Mihai Albu Hong Zheng Debashish Ray Sam Lambert Tharsan Kanagalingam
Jack Greenblatt Andrew Emili Guoqing Zhong Peter Young Wei Feng Dai Hua Tang Hongbo Guo Quaid Morris Philip Kim CIHR NIH CIFAR
Outline. What is a portfolio career? Is a portfolio career right for you? How can you create a portfolio career? ... Piano teacher. Dog Walker. House-sitter. Part-time virtual assistant. Other examples. Financial analyst and writer about baseball.
THE NEW DEAL AMERICA GETS BACK TO WORK ART DURING THE GREAT DEPRESSION The Federal Art Project (branch of the WPA) paid artists a living wage to produce art Projects included murals, posters and books Much of the art, music...
(e) A food chain along a river bank is shown below. Tree Dragonfly Archerfish Eagle (i) Name a secondary consumer from the food chain above. (ii) What is a pyramid of numbers? Draw a pyramid of numbers for the food...
Serco Group. Stagecoach Group. Great North Eastern Railway. Freightliner ltd. Eurostar. EWS. Podíl mzdových nákladů na provozních nákladech v % Poměr dluh/ vlastní jmění. Poměr provozní výnosy/ provozní náklady . z toho příjmy z dotací veř. dopravní obslužnosti. Příjmy z...
SWE 681 / ISA 681 Secure Software Design & Programming: ... " matches "abdicate" Because "cat" is inside "abdicate" * Regular expressions: For filtering/ checking/validating input REs can be used to filter input - check if the data matches a...
My Pedagogic Creed. Dewey believed that the curriculum should focus on the needs and interests of students, as they have unique emotions and minds. Dewey believed that "The child is the starting point, the center, and the end. His development,...
Ready to download the document? Go ahead and hit continue!