Agilent bake-off: oligonucleotide layers per gene

Agilent bake-off: oligonucleotide layers per gene

Investigating the source and function of most of the genome: endogenous retroelements and the proteins that bind them Tim Hughes University of Toronto Banff International Research Station "Statistical and Computational Challenges in Large Scale Molecular Biology" March 27, 2017 Challenges 1.Mapping between DNA sequence and

transcriptional outputs remains a good, but hard, computational problem (many problems) Genome: bought the book; hard to read (Eric Landers seven-word Nano lecture, 2003 Ig Nobel awards) TTTTTAGTAGCAATTTGTACTGATGGTATGGGGCCAAGAGATATATCTTAGAGGGAGGGCTGAGGGTTTGAAGTCCAACTCCTAAGCCAGTGCCAGAA GAGCCAAGGACAGGTACGGCTGTCATCACTTAGACCTCACCCTGTGGAGCCACACCCTAGGGTTGGCCAATCTACTCCCAGGAGCAGGGAGGGCAGGA GCCAGGGCTGGGCATAAAAGTCAGGGCAGAGCCATCTATTGCTTACATTTGCTTCTGACACAACTGTGTTCACTAGCAACCTCAAACAGACACCATGG TGCATCTGACTCCTGAGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATGAAGTTGGTGGTGAGGCCCTGGGCAGGTTGGTATCA AGGTTACAAGACAGGTTTAAGGAGACCAATAGAAACTGGGCATGTGGAGACAGAGAAGACTCTTGGGTTTCTGATAGGCACTGACTCTCTCTGCCTAT TGGTCTATTTTCCCACCCTTAGGCTGCTGGTGGTCTACCCTTGGACCCAGAGGTTCTTTGAGTCCTTTGGGGATCTGTCCACTCCTGATGCTGTTATG GGCAACCCTAAGGTGAAGGCTCATGGCAAGAAAGTGCTCGGTGCCTTTAGTGATGGCCTGGCTCACCTGGACAACCTCAAGGGCACCTTTGCCACACT

GAGTGAGCTGCACTGTGACAAGCTGCACGTGGATCCTGAGAACTTCAGGGTGAGTCTATGGGACGCTTGATGTTTTCTTTCCCCTTCTTTTCTATGGT TAAGTTCATGTCATAGGAAGGGGATAAGTAACAGGGTACAGTTTAGAATGGGAAACAGACGAATGATTGCATCAGTGTGGAAGTCTCAGGATCGTTTT AGTTTCTTTTATTTGCTGTTCATAACAATTGTTTTCTTTTGTTTAATTCTTGCTTTCTTTTTTTTTCTTCTCCGCAATTTTTACTATTATACTTAATG CCTTAACATTGTGTATAACAAAAGGAAATATCTCTGAGATACATTAAGTAACTTAAAAAAAAACTTTACACAGTCTGCCTAGTACATTACTATTTGGA ATATATGTGTGCTTATTTGCATATTCATAATCTCCCTACTTTATTTTCTTTTATTTTTAATTGATACATAATCATTATACATATTTATGGGTTAAAGT GTAATGTTTTAATATGTGTACACATATTGACCAAATCAGGGTAATTTTGCATTTGTAATTTTAAAAAATGCTTTCTTCTTTTAATATACTTTTTTGTT TATCTTATTTCTAATACTTTCCCTAATCTCTTTCTTTCAGGGCAATAATGATACAATGTATCATGCCTCTTTGCACCATTCTAAAGAATAACAGTGAT AATTTCTGGGTTAAGGCAATAGCAATATCTCTGCATATAAATATTTCTGCATATAAATTGTAACTGATGTAAGAGGTTTCATATTGCTAATAGCAGCT ACAATCCAGCTACCATTCTGCTTTTATTTTATGGTTGGGATAAGGCTGGATTATTCTGAGTCCAAGCTAGGCCCTTTTGCTAATCATGTTCATACCTC TTATCTTCCTCCCACAGCTCCTGGGCAACGTGCTGGTCTGTGTGCTGGCCCATCACTTTGGCAAAGAATTCACCCCACCAGTGCAGGCTGCCTATCAG AAAGTGGTGGCTGGTGTGGCTAATGCCCTGGCCCACAAGTATCACTAAGCTCGCTTTCTTGCTGTCCAATTTCTATTAAAGGTTCCTTTGTTCCCTAA GTCCAACTACTAAACTGGGGGATATTATGAAGGGCCTTGAGCATCTGGATTCTGCCTAATAAAAAACATTTATTTTCATTGCAATGATGTATTTAAAT TATTTCTGAATATTTTACTAAAAAGGGAATGTGGGAGGTCAGTGCATTTAAAACATAAAGAAATGAAGAGCTAGTTCAAACCTTGGGAAAATACACTA TATCTTAAACTCCATGAAAGAAGGTGAGGCTGCAAACAGCTAATGCACATTGGCAACAGCCCCTGATGCATATGCCTTATTCATCCCTCAGAAAAGGA

Gene regulation occurs at many steps Promoter definition Enhancer definition Chromatin remodelling and modification DNA topology PIC formation Initiation Capping Elongation Splicing

Cleavage and polyadenylation Termination Nuclear export RNA localization Translation Degradation Cells can "classify" elements based solely on sequence we can too, right? Many successes at individual problems

Challenges 1.Mapping between DNA sequence and transcriptional outputs remains a good, but hard, computational problem (many problems) 2.Relative positioning of regulatory elements along chromosomes is important Classifiers for promoter and polyadenylation site Classifier outputs are emissions for HMM representing gene state Model learns what constitutes "normal" genes, predicts "dark matter" transcripts, heterogeneous ends, etc Tested by MPRA and by "gene synthesis"

Genomic transcript predictions Initiation+ Elongation+ Termination+ Unified Model+ TSS/Gene/CpA RNASeq+ ORFs+ Other+ Transcripts+ TranscriptsORFs- Ty element

gal1 gal10 RNASeqUnified ModelTSS/Gene/CpA InitiationElongationTermination- Growing evidence that human transcription is also sloppy Challenges 1.Mapping between DNA sequence and transcriptional outputs remains a good, but hard, computational problem (many problems) 2.Relative positioning of regulatory elements along chromosomes is important

3.Missing motifs for TFs CIS-BP: Catalog of Inferred Sequence Binding Preferences Tracks homology and predicts motifs across >300 genomes (Weirauch et al., Cell 2014 established thresholds for >40 DBD types) Weirauch, Yang et al., Cell 2014 CisBP-RNA: Ray, Kazan, Cook, Weirauch, Najafabadi et al., Nature, 2013 No motif for hundreds of human TFs Most are C2H2 zinc finger proteins A sampling of the ~710 human C2H2 proteins KRAB domain (352):

ZNF554 ZNF670 ZNF454 ZNF136 ZNF705A ZNF460 ZNF667

ZNF514 ZNF45 ZNF528 SCAN domain (52): ZSCAN22 MZF1 BTB domain (50): ZBTB12 ZBTB48

SET domain (11): PRDM5 C2H2 only: ZNF271 CTCF ZNF384 ZNF628 YY1

No motif for hundreds of human TFs Most are C2H2 zinc finger proteins Increase of ~260 in the last 4 months! Next ENCODE includes 159 C2H2 proteins Yin et al. (Taipale lab, in press) has SELEX data for several dozen C2H2 proteins A system for analysis of human C2H2 zinc finger DNA binding Recognition code trained on B1H

data RCADE (Recognition CodeAssisted Discovery of regulatory Elements) Najafabadi, Mnaimneh, Schmitges et al. Nature Biotech 2015 Najafabadi et al. Bioinformatics 2015 ChIP-seq with GFP tagged inducible ORFs in HEK293 cells RCADE motifs for 131 C2H2-ZF proteins

Challenges 1.Mapping between DNA sequence and transcriptional outputs remains a good, but hard, computational problem (many problems) 2.Relative positioning of regulatory elements along chromosomes is important 3.Missing motifs for TFs 4.Mapping the domestication of retroelements and transposons and, the genesis and adaptive roles of the proteins that bind them Human Endogenous Retroelement and Transposon catalog is incomplete

Proprietary "RepBase" is the gold standard RepBase combines automated and manuallydefined models / consensus / who-knows-what DFAM is an "open access" alternative, but only covers four species (Hs, Dr, Ce, Dm) Many of the models are truncated or noncoding Are there human TFs that evolved to silence elements that no longer exist in the human genome? Most retroelements in genomes are truncated (figure from Imbeault et al., 2017) Most LINE L1 models in DFAM are truncated A "working" LINE L1 is 6-7 kb

Length of consensus 6000 5000 4000 3000 2000 1000

0 0 10 20 30 40 50

Kimura divergence 60 70 Will ancestral genome reconstructions improve recovery of active "source" elements? Original ERE Individual copies in present day genomes

Acknowledgements Hughes lab: Hamed Najafabadi (McGill) Matt Weirauch (Cincinatti) Frank Schmitges Marjan Barazandeh Laura Campitelli Ally Yang Ernest Radovani Mihai Albu Hong Zheng Debashish Ray Sam Lambert Tharsan Kanagalingam

Jack Greenblatt Andrew Emili Guoqing Zhong Peter Young Wei Feng Dai Hua Tang Hongbo Guo Quaid Morris Philip Kim CIHR NIH CIFAR

Recently Viewed Presentations

  • BOSTON COLLEGE WORLDWIDE WEBINARS Creating a Portfolio Career

    BOSTON COLLEGE WORLDWIDE WEBINARS Creating a Portfolio Career

    Outline. What is a portfolio career? Is a portfolio career right for you? How can you create a portfolio career? ... Piano teacher. Dog Walker. House-sitter. Part-time virtual assistant. Other examples. Financial analyst and writer about baseball.
  • THE NEW DEAL - Ms. Popp's History World

    THE NEW DEAL - Ms. Popp's History World

    THE NEW DEAL AMERICA GETS BACK TO WORK ART DURING THE GREAT DEPRESSION The Federal Art Project (branch of the WPA) paid artists a living wage to produce art Projects included murals, posters and books Much of the art, music...
  • 网络与信息安全 - SecurityCN.net

    网络与信息安全 - SecurityCN.net

    假设凭据是有效的,Winlogon将创建一个令牌(或称访问令牌),并将之绑定到用户登录会话上,稍后访问资源时需要提供令牌 令牌中含有与用户账户有关的所有SID,包括账户的SID,还有该用户所属的所有组和特殊身份(如Domain Admins和INERACTIVE)的SID 可以使用whoami这样的 ...
  • Ecology Questions - The Tutorial Point

    Ecology Questions - The Tutorial Point

    (e) A food chain along a river bank is shown below. Tree Dragonfly Archerfish Eagle (i) Name a secondary consumer from the food chain above. (ii) What is a pyramid of numbers? Draw a pyramid of numbers for the food...
  • Prezentace aplikace PowerPoint - Konference Morava

    Prezentace aplikace PowerPoint - Konference Morava

    Serco Group. Stagecoach Group. Great North Eastern Railway. Freightliner ltd. Eurostar. EWS. Podíl mzdových nákladů na provozních nákladech v % Poměr dluh/ vlastní jmění. Poměr provozní výnosy/ provozní náklady . z toho příjmy z dotací veř. dopravní obslužnosti. Příjmy z...
  • Folie 1 - uni-luebeck.de

    Folie 1 - uni-luebeck.de

    Slidesfrom AIMA bookprovidedby Cristina Conati, UBC. Data Mining Bayesian Networks. Full Bayesian Learning. MAP learning. Maximum Likelihood Learning. Learning Bayesian Networks. Fully observable. With hidden (unobservable) variables. Full Bayesian Learning.
  • SWE 781 / ISA 681 Secure Software Design & Programming

    SWE 781 / ISA 681 Secure Software Design & Programming

    SWE 681 / ISA 681 Secure Software Design & Programming: ... " matches "abdicate" Because "cat" is inside "abdicate" * Regular expressions: For filtering/ checking/validating input REs can be used to filter input - check if the data matches a...
  • John Dewey 1859-1952 By: Aral Belir "If I

    John Dewey 1859-1952 By: Aral Belir "If I

    My Pedagogic Creed. Dewey believed that the curriculum should focus on the needs and interests of students, as they have unique emotions and minds. Dewey believed that "The child is the starting point, the center, and the end. His development,...