Alignments, alignments, everywhere… how, why, which one to use

Alignments, alignments, everywhere… how, why, which one to use

Comparative Sequence Analysis in Molecular Biology Martin Tompa Computer Science & Engineering Genome Sciences University of Washington Seattle, Washington, U.S.A. 2 Outline What genome data is available? What is phylogenetic footprinting? Phylogenetic footprinting by multiple sequence alignment Which parts of multiple sequence alignments are trustworthy? FootPrinter: phylogenetic footprinting without alignment 3 Outline What genome data is available? What is phylogenetic footprinting? Phylogenetic footprinting by multiple

sequence alignment Which parts of multiple sequence alignments are trustworthy? FootPrinter: phylogenetic footprinting without alignment 4 DNA: the cells program Cell DNA Nucleotide (A, C, G, or T) 5 DNA, Genes, and Proteins DNA C T A A C

C G T G G C Gene G A TG C CA G GT Protein

DNA: program for cell processes Proteins (and RNA): execute cell processes 6 How Much DNA in a Cell? An organisms genome is the total DNA in one of its cells. How many nucleotides in a genome? M. tuberculosis D. melanogaster H. sapiens P. nudum bacterium fruit fly human whisk fern 4,000,000 200,000,000 3,000,000,000 250,000,000,000 How can we understand the genomes program? Lab benchwork is costly and time-consuming. We will return to this question.

7 How Many Genomes Are Available? 46 vertebrate genomes sequenced (primates to rodents to marsupials to birds to fishes) 1025 bacterial genomes sequenced (as of 4/6/2010) Insects, fungi, worms, plants, Many more will be finished very soon Fertile ground for comparative genomics 8 1982-2003: number of nucleotides in GenBank doubled every 18 months Since 2003: doubled every 3 years 9 Outline What genome data is available? What is phylogenetic footprinting? Phylogenetic footprinting by multiple sequence alignment Which parts of multiple sequence alignments

are trustworthy? FootPrinter: phylogenetic footprinting without alignment 10 Phylogenetic Footprinting (Tagle et al. 1988) Functional regions of DNA (regions under purifying constraint) evolve slower than nonfunctional ones. 1. Consider a set of corresponding DNA sequences from related species. 2. Identify unusually well conserved subsequences (i.e., ones that have not mutated much over the course of evolution): motifs 11 12 Outline What genome data is available? What is phylogenetic footprinting? Phylogenetic footprinting by multiple sequence alignment

Which parts of multiple sequence alignments are trustworthy? FootPrinter: phylogenetic footprinting without alignment 13 How to Find Conserved Motifs ACTAACCGGGAGATTTCAGA AAGTTCCGGGAGATTTCCA TAGTTATCCGGGAGATTAGA AAAACCGGTAGATTTCAGG human chimp mouse rat 14 Multiple Sequence Alignment AC--TAACCGGGAGATTTCAGA AAGTT--CCGGGAGATTTCC-A TAGTTATCCGGGAGATT--AGA AA---AACCGGTAGATTTCAGG

human chimp mouse rat (Finding the optimal alignment is NP-complete.) 15 Phylogenetic Footprinting 1. Use whole-genome multiple alignment such as provided by UCSC Genome Browser. 2. Search for regions of well conserved alignment. Regulatory elements [Cliften; Kellis; Kolbe; Prakash; Woolfe; Xie (2)] RNA elements [Pedersen; Washietl] General conservation & constraint [Bejerano; Boffelli; Cooper; Margulies (4); Pollard; Prabhakar; Siepel] 16 Outline What genome data is available? What is phylogenetic footprinting? Phylogenetic footprinting by multiple

sequence alignment Which parts of multiple sequence alignments are trustworthy? FootPrinter: phylogenetic footprinting without alignment 17 Which Alignment Columns to Trust? Vertebrate alignment has 3.8 billion columns Automatically generated Recent comparison (Margulies et al., 2007) of 4 whole-mammal alignment methods revealed widespread disagreement 18 Which Alignment Columns to Trust? (with Amol Prakash, generalizing Karlin and Altschul 1990) Goal: label each alignment column with confidence measure of alignment correctness Identify sequences that do not belong Users forewarned about regions of interest Genome browser designers consider realigning

Alignment tool designers get feedback for possible improvements 19 Sample Suspicious Alignment Human Chimp Rhesus Mouse Rat Dog Cow Elephant Tenrec Opossum Chicken Zebrafish -----------GTTGCCATGC-AAAAATATTATGGCTTTACTAAAATTTATACAAG---CATGTCAA----------TTAACAC -----------GTTGCCATGC-AAAAATATTATGGCTTTACTAAAATTTATACAAG---CATGTCAA----------TTAACAC -----------GTTGCCATGC-AAAAATATTATGTCTTTACTAAAATTTATACAAG---CATGTCAA----------TTAACAC -----------GTTGCCATGC-AAAAATATTATGGCTTTACTAAAATTTATACAAG---CGTGTCAA----------TTAACAC -----------GTTGCCATGC-AAAAATATTATGGCTTTACTAAAATTTATACAAG---CGTGTCAA----------TTAACAC -----------GTTGCCATGC-AAAAATATTATGGCTTTACTAAAATTTATACAAG---CATGTCAA----------TTAACAC

-----------GTTGCCATGC-AAAAATATTATGGCTTTACTAAAATTTATACAAG---CATGTCAA----------TTAACAC -----------GTTGCTATGC-AAAAATATTATGGCTTTACTAAAATTTATACAAG---CATGTCAA----------TTAACAC -----------GTTGCCATAC-AAAAATATTATGGCTTTACTAAAATTTATACAAG---CATGTCAA----------TTAACAC -----------GTTGCCATGC-AAAAATATTATGGCTTTACTAAAATTTATACAAG---CATATCAA----------TTAACAC -----------GTTGCCATGCAAAAAATAATATGGCTTTACTAAAATTTACACAAC---CCTGACAA----------TTAACAC GAACATATCCGAGTGCTGTAA-AATACTACTGGGA----ACCAGAAATG-ACAAGTTCCATGACAGCTTTGCCTTTTTGGCTC 20 Scoring Function Pairwise: score(1, 2) = log ( Pr(1, 2) Pr(1) Pr(2) ) Multiple: sc(12345 | ) = log( Pr(12345 | Pr(125 |

Human Chimp Mouse Rat Chicken ) ) Pr(34 | ) ) 1 2 3 4 5 21 Outline of Computation Input Multiple sequence alignment A

For each branch k of the tree { Compute scoring function sck (Felsenstein) Find all maximally scoring segments of A using sck (Ruzzo & Tompa) Compute K, using sck (Karlin & Altschul) Compute p-value pk of each segment score using K, (Karlin & Altschul) Output } Discordance : maxk pk 22 Suspicious Alignment Regions Case study: human chromosome 1 alignment to 16 other vertebrates in UCSC Genome Browser Identify suspicious alignment regions: Length 50 bp p-value 0.1 at each position, all with respect to the same branch k At most 50% gapped columns 23 Proposed Track on the UCSC Browser

24 1,000,000,000 247,000,000 100,000,000 9.7% 10,000,000 1.3% 3.3% 2.3% 15% 26% 24% 29% 1,000,000

100,000 10,000 .004% 1,000 100 10 1 Aligned residues Suspicious residues High phastCons 25 Genomic Locations of Suspicious Regions 6% of chromosome 1 alignments containing mouse are exonic 35% of chromosome 1 alignments containing zebrafish are exonic 26

Outline What genome data is available? What is phylogenetic footprinting? Phylogenetic footprinting by multiple sequence alignment Which parts of multiple sequence alignments are trustworthy? FootPrinter: phylogenetic footprinting without alignment 27 DNA, Genes, and Proteins DNA C T A A C C G T

G G C G A TG C CA G GT Gene Protein DNA: program for cell processes Proteins: execute cell processes 28 Regulation of Genes

What turns genes on and off? When is a gene turned on or off? Where (in which cells) is a gene turned on? How many copies of the gene product are produced? 29 Regulation of Genes Transcription Factor RNA polymerase DNA Regulatory Element Gene 30 Regulation of Genes Transcription Factor RNA polymerase DNA Regulatory Element Gene

31 Goal Identify regulatory elements in DNA sequences. These are: Binding sites for proteins Short subsequences (5-25 nucleotides) Up to 1000 nucleotides (or farther) from gene Inexactly repeating patterns (motifs) 32 CLUSTALW multiple sequence alignment (rbcS gene) Cotton Pea Tobacco Ice-plant Turnip Wheat Duckweed Larch ACGGTT-TCCATTGGATGA---AATGAGATAAGAT---CACTGTGC---TTCTTCCACGTG--GCAGGTTGCCAAAGATA-------AGGCTTTACCATT GTTTTT-TCAGTTAGCTTA---GTGGGCATCTTA----CACGTGGC---ATTATTATCCTA--TT-GGTGGCTAATGATA-------AGG--TTAGCACA TAGGAT-GAGATAAGATTA---CTGAGGTGCTTTA---CACGTGGC---ACCTCCATTGTG--GT-GACTTAAATGAAGA-------ATGGCTTAGCACC

TCCCAT-ACATTGACATAT---ATGGCCCGCCTGCGGCAACAAAAA---AACTAAAGGATA--GCTAGTTGCTACTACAATTC--CCATAACTCACCACC ATTCAT-ATAAATAGAAGG---TCCGCGAACATTG--AAATGTAGATCATGCGTCAGAATT--GTCCTCTCTTAATAGGA-------A-------GGAGC TATGAT-AAAATGAAATAT---TTTGCCCAGCCA-----ACTCAGTCGCATCCTCGGACAA--TTTGTTATCAAGGAACTCAC--CCAAAAACAAGCAAA TCGGAT-GGGGGGGCATGAACACTTGCAATCATT-----TCATGACTCATTTCTGAACATGT-GCCCTTGGCAACGTGTAGACTGCCAACATTAATTAAA TAACAT-ATGATATAACAC---CGGGCACACATTCCTAAACAAAGAGTGATTTCAAATATATCGTTAATTACGACTAACAAAA--TGAAAGTACAAGACC Cotton Pea Tobacco Ice-plant Turnip Wheat Duckweed Larch CAAGAAAAGTTTCCACCCTC------TTTGTGGTCATAATG-GTT-GTAATGTC-ATCTGATTT----AGGATCCAACGTCACCCTTTCTCCCA-----A C---AAAACTTTTCAATCT-------TGTGTGGTTAATATG-ACT-GCAAAGTTTATCATTTTC----ACAATCCAACAA-ACTGGTTCT---------A AAAAATAATTTTCCAACCTTT---CATGTGTGGATATTAAG-ATTTGTATAATGTATCAAGAACC-ACATAATCCAATGGTTAGCTTTATTCCAAGATGA ATCACACATTCTTCCATTTCATCCCCTTTTTCTTGGATGAG-ATAAGATATGGGTTCCTGCCAC----GTGGCACCATACCATGGTTTGTTA-ACGATAA CAAAAGCATTGGCTCAAGTTG-----AGACGAGTAACCATACACATTCATACGTTTTCTTACAAG-ATAAGATAAGATAATGTTATTTCT---------A GCTAGAAAAAGGTTGTGTGGCAGCCACCTAATGACATGAAGGACT-GAAATTTCCAGCACACACA-A-TGTATCCGACGGCAATGCTTCTTC-------ATATAATATTAGAAAAAAATC-----TCCCATAGTATTTAGTATTTACCAAAAGTCACACGACCA-CTAGACTCCAATTTACCCAAATCACTAACCAATT TTCTCGTATAAGGCCACCA-------TTGGTAGACACGTAGTATGCTAAATATGCACCACACACA-CTATCAGATATGGTAGTGGGATCTG--ACGGTCA Cotton Pea

Tobacco Ice-plant Turnip Wheat Duckweed Larch ACCAATCTCT---AAATGTT----GTGAGCT---TAG-GCCAAATTT-TATGACTATA--TAT----AGGGGATTGCACC----AAGGCAGTG-ACACTA GGCAGTGGCC---AACTAC--------------------CACAATTT-TAAGACCATAA-TAT----TGGAAATAGAA------AAATCAAT--ACATTA GGGGGTTGTT---GATTTTT----GTCCGTTAGATAT-GCGAAATATGTAAAACCTTAT-CAT----TATATATAGAG------TGGTGGGCA-ACGATG GGCTCTTAATCAAAAGTTTTAGGTGTGAATTTAGTTT-GATGAGTTTTAAGGTCCTTAT-TATA---TATAGGAAGGGGG----TGCTATGGA-GCAAGG CACCTTTCTTTAATCCTGTGGCAGTTAACGACGATATCATGAAATCTTGATCCTTCGAT-CATTAGGGCTTCATACCTCT----TGCGCTTCTCACTATA CACTGATCCGGAGAAGATAAGGAAACGAGGCAACCAGCGAACGTGAGCCATCCCAACCA-CATCTGTACCAAAGAAACGG----GGCTATATATACCGTG TTAGGTTGAATGGAAAATAG---AACGCAATAATGTCCGACATATTTCCTATATTTCCG-TTTTTCGAGAGAAGGCCTGTGTACCGATAAGGATGTAATC CGCTTCTCCTCTGGAGTTATCCGATTGTAATCCTTGCAGTCCAATTTCTCTGGTCTGGC-CCA----ACCTTAGAGATTG----GGGCTTATA-TCTATA Cotton Pea Tobacco Ice-plant Larch Turnip Wheat Duckweed

T-TAAGGGATCAGTGAGAC-TCTTTTGTATAACTGTAGCAT--ATAGTAC TATAAAGCAAGTTTTAGTA-CAAGCTTTGCAATTCAACCAC--A-AGAAC CATAGACCATCTTGGAAGT-TTAAAGGGAAAAAAGGAAAAG--GGAGAAA TCCTCATCAAAAGGGAAGTGTTTTTTCTCTAACTATATTACTAAGAGTAC TCTTCTTCACAC---AATCCATTTGTGTAGAGCCGCTGGAAGGTAAATCA TATAGATAACCA---AAGCAATAGACAGACAAGTAAGTTAAG-AGAAAAG GTGACCCGGCAATGGGGTCCTCAACTGTAGCCGGCATCCTCCTCTCCTCC CATGGGGCGACG---CAGTGTGTGGAGGAGCAGGCTCAGTCTCCTTCTCG 33 Finding Short Motifs AGTCGTACGTGAC... (Human) AGTAGACGTGCCG... (Chimp) ACGTGAGATACGT... (Rabbit) GAACGGAGTACGT... (Mouse) TCGTGACGGTGAT... (Rat) Size of motif sought: k = 4 34 Most Parsimonious Solution ACGT AGTCGTACGTGAC... AGTAGACGTGCCG...

ACGT ACGTGAGATACGT... ACGT ACGG GAACGGAGTACGT... TCGTGACGGTGAT... Parsimony score: 1 mutation (Finding the most parsimonious motif is NP-complete.) 35 Substring Parsimony Problem Given: phylogenetic tree T, set of orthologous sequences at leaves of T, length k of motif threshold d Problem: Find each set S of k-mers, one k-mer from each leaf, such that the parsimony score of S in T is at most d. This problem is NP-hard.

36 FootPrinters Exact Algorithm (with Mathieu Blanchette, generalizing Sankoff and Rousseau 1975) Wu [s] = best parsimony score for subtree rooted at node u, if u is labeled with string s. 4k entries ACGG: 1 ACGT: 0 ... ACGG: + ACGT: 0 AGTCGTACGTG ACGG: ACGT :0 ...

ACGG: 2 ACGT: 1 ... ACGG: ACGT :0 ... ACGG: ACGT :0 ... ACGG: 1 ACGT: 1 ...

ACGG: 0 ACGT: 2 ... ACGGGACGTGC ACGTGAGATAC GAACGGAGTAC TCGTGACGGTG ACGG: 0 ACGT: + ... 37 Running Time Wu [s] = min ( Wv [t] + d(s, t) ) v: child t of u Number of species Average

sequence length Total time O(n k (4k + l )) Motif length 38 Improvements Better algorithm reduces time from O(n k (42k + l )) to O(n k (4k + l )) By restricting to motifs with parsimony score at most d, greatly reduce the number of table entries computed (exponential in d, polynomial in k) Amenable to many useful extensions (e.g., allow insertions and deletions) 39 Application to -actinactin Gene Gilthead sea bream (678 bp) Medaka fish (1016 bp) Common carp (696 bp) Grass carp (917 bp) Chicken (871 bp)

Human (646 bp) Rabbit (636 bp) Rat (966 bp) Mouse (684 bp) Hamster (1107 bp) 40 Common carp TTGGCATGGCTTTTGTTATTTTTGGCGCTT ACGGACTGTTACCACTTCACGCCGACTCAACTGCGCAGAGAAAAACTTCAAACGACAACA GACTCAGGATCTAAAAACTGGAACGGCGAAGGTGACGGCAATGTTTTGGCAAATAAGCATCCCCGAAGTTCTACAATGCATCTG AGGACTCAATGTTTTTTTTTTTTTTTTTTCTTTAGTCATTCCAAATGTTTGTTAAATGCATTGTTCCGAAACTTATTTGCCTCTATGAAGGCTGCCCAG TAATTGGGAGCATACTTAACATTGTAGTATTGTATGTAAATTATGTAACAAAACAATGACTGGGTTTTTGTACTTTCAGCCTTAATCTTGGGTTTTTT TTTTTTTTTGGTTCCAAAAAACTAAGCTTTACCATTCAAGATGTAAAGGTTTCATTCCCCCTGGCATATTGAAAAAGCTGTGTGGAACGTGGCGGTGCA CCTGTACACTGAC GACATTTGGTGGGGCCAA TAATTCAAATAAAAGTGCACATGTAAGACATCCTACTCTGTGTGATTTTTCTGTTTG TGCTGAGTGAACTTGCTATGAAGTCTTTTAGTGCACTCTTTAATAAAAGTAGTCTTCCCTTAAAGTGTCCCTTCCCTTATGGCCTTCACATTTCTCAACT AGCGCTTCAACTAGAAAGCACTTTAGGGACTGGGATGC

Chicken ACCGGACTGTTACCAACACCCACACCCCTGTGATGAAACAAAACCCATAAATGCGCATAAAACAAGACGAGA TTGGCATGGCTTTATTTG TTGACTCAGGATTAAAAAACTGGAATGGTGAAGGTGTCAGCAGCAGTCTTAAAATGAAACATGTTGGA TTTTTTCTTTTGGCGC GCGAACGCCCCCAAAGTTCTACAATGCATCTGAGGACTTTGATTGTACATTTGTTTCTTTTTTAAT AGTCATTCCAAATATTGTTATAATGCATTGTT ACAGGAAGTTACTCGCCTCTGTGAAGGCAACAGCCCAGCTGGGAGGAGCCGGTACCAATTACTGGTGTTAGATGATAATTGCTTGTCTGTAAATTA TGTAACCCAACAAGTGTCTTTTTGTATCTTCCGCCTTAAAAACAAAACACACTTGATCCTTTTTGGTTTGTCAAGCAAGCGGGCTGTGTTCCCCAGTGA CCTGTACACTGA TAGATGTGAATGAAGGCTTTACAGTCCCCCACAGTCTAGGAGTAAAGTGCCAGTATGTGGGGGAGGGAGGGGCTA CTTAAGACCAGTTCAAATAAAAGTGCACACAATAGAGGCTTGACTGGTGTTGGTTTTTATTTCTGTGCTGCGCTGCTTGGCCGTTGGTAGCTGTTCTC ATCTAGCCTTGCCAGCCTGTGTGGGTCAGCTATCTGCATGGGCTGCGTGCTGGTGCTGTCTGGTGCAGAGGTTGGATAAACCGTGATGATATTTCAG CAAGTGGGAGTTGGCTCTGATTCCATCCTGAGCTGCCATCAGTGTGTTCTGAAGGAAGCTGTTGGATGAGGGTGGGCTGAGTGCTGGGGGACAGCT GGGCTCAGTGGGACTGCAGCTGTGCT Human TTGGCATGGCTTTATTTGTTT TTTTTGTTTTGTTTTGGTTTTTTTTTTTTTTTTGGCTTGACTCAGGATTTAAAAACTGGAACGGTGAAGGTGACAGCAGTCGGTT

GCGGACTATGACTTAGTTGCGTTACACCCTTTCTTGACAAAACCTAACTTGCGCAGAAAACAAGATGAGA GGAGCGAGCATCCCCCAAAGTTCACAATGTGGCCGAGGACTTTGATTGCATTGTTGTTTTTTTAATAGTCATTCCAAATATGAGATGCATTGTTAC AGGAAGTCCCTTGCCATCCTAAAAGCCACCCCACTTCTCTCTAAGGAGAATGGCCCAGTCCTCTCCCAAGTCCACACAGGGGAGGTGATAGCATTGC TTTCGTGTAAATTATGTAATGCAAAATTTTTTTAATCTTCGCCTTAATACTTTTTTATTTTGTTTTATTTTGAATGATGAGCCTTCGTGCCCCCCCTTC CCTGTACACTG CCCCTTTTTGTCCCCCAACTTGAGATGTATGAAGGCTTTTGGTCTCCCTGGGAGTGGGTGGAGGCAGCCAGGGCTTA ACTTGAGACCAGTTGAATAAAAGTGCACACCTTAAAAATGAGGCCAAGTGTGACTTTGTGGTGTGGCTGGGTTGGGGGCAGCAGAGGGTG 41 Motifs Absent from Some Species Find motifs with small parsimony score that span a large part of the tree Example: in tree of 10 species spanning 760 Myrs, find all motifs with

score 0 spanning at least 250 Myrs score 1 spanning at least 350 Myrs score 2 spanning at least 450 Myrs score 3 spanning at least 550 Myrs 42 Application to c-fos Gene 10 Puffer fish 7 2 Chicken 2 1 Pig 2 2

1 0 1 Mouse Hamster Human Asked for motifs of length 10, with 0 mutations over tree of size 6 1 mutation over tree of size 11 2 mutations over tree of size 16 3 mutations over tree of size 21 4 mutations over tree of size 26 Found: 0 mutations over tree of size 8 1 mutation over tree of size 16 3 mutations over tree of size 21

4 mutations over tree of size 28 43 Application to c-fos Gene Motif Score Conserved in Known? CAGGTGCGAATGTTC 0 4 mammals TTCCCGCCTCCCCTCCCC 0 4 mammals GAGTTGGCTGcagcc

3 puffer + 4 mammals GTTCCCGTCAATCcct 1 chicken + 4 mammals yes CACAGGATGTcc 4 all 6 yes AGGACATCTG 1

chicken + 4 mammals yes GTCAGCAGGTTTCCACG 0 4 mammals yes TACTCCAACCGC 0 4 mammals yes metK in B. subtilis 44 Microbial Footprinting

1105 prokaryotes with genomes completely sequenced (as of 4/6/2010) For any prokaryotic gene of interest, plenty of close genes in other species available Relatively simple genomes MicroFootPrinter (with Shane Neph) Designed specifically for phylogenetic footprinting in microbial genomes undergraduate Computational Biology Capstone project User specifies species and gene of interest Automates collection of orthologous genes, cis-regulatory sequences, gene tree, parameters 45 Demo MicroFootPrinter home Examples: Agrobacterium tumefaciens genes regulated by ChvI (with Eugene Nester) chvI (two component response regulator) ropB (outer membrane protein ) 46 Sample chvI motif

Parsimony score: Span: Significance score: B. henselae R. etli R. leguminosarum S. meliloti S. medicae A. tumefaciens M. loti M. sp. O. anthropi B. suis B. melitensis B. abortus B. ovis B. canis 2 41.10 4.22 -151

-90 -106 -119 -118 -105 -80 -87 -158 -38 -156 -156 -156 -38 GCTACAATTT GCCACAATTT GCCACAATTT GCCACAATTT GCCACAATTT GCCACAATTT GCCACATTTT GCCACATTTT GCCACATTTT GCCACATTTT GCCACATTTT

GCCACATTTT GCCACATTTT GCCACATTTT 47 Sample ropB motif Parsimony score: Span: Significance score: 1 20.70 1.34 Jannaschia sp. R. etli R. leguminosarum A. tumefaciens S. meliloti S. medicae -151 -134 -135

-131 -128 -128 CACATTTTGG CACAATTTGG CACAATTTGG CACATTTTGG CACATTTTGG CACATTTTGG 48 Combined ChvI Motif ropB: chvI: Atu1221: ultimate: CACATTTTGG GCCACAATTT TTGTCACAAT GYCACAWTTTGG Y={C,T} W={A,T}

49 References and Acknowledgments Amol Prakash & Martin Tompa, Measuring the Accuracy of Genome-Size Multiple Alignments. Genome Biology, June 2007, R124. Mathieu Blanchette & Martin Tompa, Discovery of Regulatory Elements by a Computational Method for Phylogenetic Footprinting. Genome Research, May 2002, 739-748. Shane Neph & Martin Tompa, MicroFootPrinter: a Tool for Phylogenetic Footprinting in Prokaryotic Genomes. Nucleic Acids Research, July 2006, W366-W368. All software available at bio.cs.washington.edu/software.html 50 Extra Material 51 BLASTX e-values for mouse

alignments to human coding regions Fraction of regions that have an intersection with some annotated human coding exon 1 Suspicious regions Low discordance regions Random regions 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 <1 1 to 2 2 to 3

3 to 4 4 to 5 5 to 6 - log10(E -value) 6 to 7 7 to 8 8 to 9 9 to 10 =10 52 Synthetic experiments Any species (a) (b)

0.8 0.8 0.7 0.7 0.6 Cow Pig 0.5 Cat 0.4 Dog 0.3 Mouse

Rat 0.2 0.1 Fraction of suspicious regions Specificity Fraction of suspicious regions Same species 0 0.6 Cow Pig 0.5 Cat 0.4

Dog 0.3 Mouse Rat 0.2 0.1 0 0-10 10-20 20-30 30-40 40-50 50-60 60-70

70-80 80-90 90-100 =100 0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80

80-90 90-100 =100 Percentage of misaligned columns (c) (d) 0.8 0.8 0.7 0.7 0.6 Cow Pig

0.5 Cat 0.4 Dog 0.3 Mouse Rat 0.2 0.1 0 Fraction of misalignment regions Sensitivity Fraction of misalignment regions Percentage of misaligned columns

0.6 Cow Pig 0.5 Cat 0.4 Dog 0.3 Mouse Rat 0.2 0.1 0 0-10 10-20

20-30 30-40 40-50 50-60 60-70 70-80 80-90 90-100 Percentage of columns with p -value at least 0.1 =100 0-10 10-20

20-30 30-40 40-50 50-60 60-70 70-80 80-90 90-100 =100 Percentage of columns with p-value at least 0.1 53

Recently Viewed Presentations

  • Diversity and Inclusion at P&amp;G

    Diversity and Inclusion at P&G

    Implementing Global Flexibility Danielle Hartmann BC Center for Work & Family Cindy Martinangelo Merck & Co. Ann Andreosatos Procter & Gamble CWF Mission Boston College Center for Work & Family is committed to enhancing the success of organizations and the...
  • The Elbow

    The Elbow

    His hand is planted on the mat when his opponent strikes the lateral side of his elbow. He feels a small pop on the medial side, but finishes the match. After the match he approaches you with pain in both...
  • DGE Webinar Frequently Asked Questions MyLion & MyLCI

    DGE Webinar Frequently Asked Questions MyLion & MyLCI

    For additional information or if you have any questions please write to [email protected] DGE Webinar. Frequently Asked Questions. Can individual Lions enter their volunteer hours? ... Historical reports in MyLCI will freeze at the end of fiscal year 2018 -...
  • Jane Schaffer Writing Strategy:

    Jane Schaffer Writing Strategy:

    Jane Schaffer One-Chunk Paragraph Guidelines for Essays. What is the Jane Schaffer Writing Method? It is a writing format for essays. It consists of a minimum of five paragraphs: an . introduction (including the thesis statement), body paragraphs (three or...
  • Review Genesis Chapters One - Fifteen

    Review Genesis Chapters One - Fifteen

    What did God see about the great whales, living creatures and every winged fowl? B. that it was good. Genesis 3:12. 3. How does the woman respond to the LORD God's question, What is this that thou hast done? ......
  • BATS - C.A.S.T.L.E. Technology

    BATS - C.A.S.T.L.E. Technology

    BATS Different types of Bats: Megabat Microbat How Do Bat mothers care for their babies. Female bats give birth. Bat pups do not have fur, they need a warm/humid place to keep them safe. Nurse their bat pups with milk....
  • Linux vs. Windows: A Comparison of Application and Platform ...

    Linux vs. Windows: A Comparison of Application and Platform ...

    Second, the level of investment in the operating system depends on a number of factors such as the strength of the reputation effects for the developers of the open source operating system, the ratio of developers within the total user...
  • Hot and cold spots are common problems associated with planning:

    Hot and cold spots are common problems associated with planning:

    Hot and cold spots are common problems associated with planning: A. Isocentric Fields B. Four Field Technique C. Three Field Technique D. Abutting Fields