Predicting Genes - Iowa State University

10/21/05 Gene Prediction (formerly Gene Prediction - 3) 10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 1 Announcements Exam 2 - next Friday Posted online: Exam 2 Study Guide 544 Reading Assignment (2 papers) 10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 2 Announcements 544 Semester Projects - Information needed: Please send email to me (or David) [email protected] Briefly describe: Your background & current grad research Is there a problem related to your research you would like to learn more about & develop as project for this course? or What would your dream project be? 10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction

3 Announcements 2 Bioinformatics Seminars today (Fri Oct 21) 12:10 PM BCB Faculty Seminar in E164 Lagomarcino Protein Networks Bob Jernigan, BBMB & Director,Baker Center for Bioinformatics & Biological Statistics http://www.bcb.iastate.edu/courses/BCB691-F2005.html#Oct%2021 4:10 PM GDCB Special Seminar in 1414 MBB Integrating the Unknown-eome with Abiotic Stress Response Networks in Arabidopsis Ron Mittler, Dept. of Biochem & Mol Biology 10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 4 Gene Prediction & Regulation Mon - Gene structure review: Eukaryotes vs prokaryotes Wed - Regulatory regions: Promoters & enhancers Fri - Predicting genes - Predicting regulatory regions (?) Next week: Predicting RNA structure (miRNAs,

too) 10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 5 Reading Assignment Mount Bioinformatics Chp 9 Gene Prediction & Regulation pp 361-385 Predicting Promoters Ck Errata: http://www.bioinformaticsonline.org/help/errata2.html * Brown Genomes 2 (NCBI textbooks online) Sect 9 Overview: Assembly of Transcription Initiation Complex http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=genomes.chapter.7002 Sect 9.1-9.3 DNA binding proteins, Transcription initiation http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=genomes.section.7016 * NOTE: Dont worry about the details!! 10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 6 Optional Reading Reviews: 1) Zhang MQ (2002) Computational prediction of eukaryotic protein-coding genes. Nat Rev Genet 3:698709 http://proxy.lib.iastate.edu:2103/nrg/journal/v3/n9/full/nrg890_fs.html 2)

Wasserman WW & Sandelin (2004) Applied bioinformatics for the identification of regulatory elements. Nat Rev Genet 5:276287 http://proxy.lib.iastate.edu:2103/nrg/journal/v5/n4/full/nrg1315_fs.html 10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 7 Review last lecture: Gene Regulation (formerly Gene Prediction-2) cDNAs & ESTs UniGene Regulatory regions Eukaryotes vs prokaryotes 10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 8 DNA RNA protein Phenotype cDNA Pevsner [1] Transcription

[2] RNA processing (splicing) [3] RNA export [4] RNA surveillance 10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 9 UniGene: unique genes via ESTs Find UniGene at NCBI: www.ncbi.nlm.nih.gov/UniGene UniGene clusters contain many ESTs UniGene data come from many cDNA libraries. Thus, when you look up a gene in UniGene you get information on its abundance and its regional distribution Pevsner p164 10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 10 Today: Gene Prediction (formerly Gene Prediction - 3) Predicting genes Mon - Predicting regulatory regions Focus on promoters Introduction to RNA Later: Genome browsers 10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 11

Gene Prediction Overview of steps & strategies What sequence signals can be used? What other types of information can be used? Algorithms HMMs, discriminant functions, neural nets Gene prediction software 3 major types many,many programs! 10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 12 Predicting Genes - Basic steps: Obtain genomic sequence Translate in all 6 reading frames Compare with protein sequence database Perform database similarity search with EST & cDNA databases, if available Use gene prediction program to locate genes Analyze gene regulatory sequences 10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 13 Overview of gene prediction strategies

What sequence signals can be used? Transcription: TF binding sites, promoter, initiation site, terminator Processing signals: splice donor/acceptors, polyA signal Translation: start (AUG = Met) & stop (UGA,UUA, UAG) ORFs, codon usage What other types of information can be used? cDNAs & ESTs (experimental data,pairwise alignment) homology (sequence comparison, BLAST) 10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 14 Automated gene prediction strategies 1) Similarity-based or Comparative BLAST - Do other organisms have similar sequence? (Is sequence similar to known gene or protein) 2) Ab initio = from the beginning Predict without explicit comparison with cDNA or proteins via rule-based gene models - but rules are derived from statistical analysis of datasets 3) Combined "evidence-based" Combine gene models with alignment to known ESTs & protein sequences

BEST RESULTS? Combined 10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 15 Examples of gene prediction software 1) Similarity-based or Comparative BLAST SGP2 (extension of GeneID) 2) Ab initio = from the beginning GeneID - (used in lab this week) GENSCAN - (used in lab this week) GeneMark.hmm - (should try this!) 3) Combined "evidence-based GeneSeqer (Brendel et al., ISU) BEST? GENSCAN, GeneMark.hmm, GeneSeqer but depends on organism & specific task 10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 16

Gene prediction: Eukaryotes vs prokaryotes Gene prediction is easier in microbial genomes Why? Smaller genomes Simpler gene structures More sequenced genomes! (for comparative approaches) Methods? Previously, mostly HMM-based Now: similarity-based methods because so many genomes available 10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 17 GeneSeqer - Brendel et al. http://deepc2.psi.iastate.edu/cgi-bin/ gs.cgi 10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 18 Thanks to Volker Brendel, ISU for following Figs & Slides Slightly modified from: BSSI Genome Informatics Module http://www.bioinformatics.iastate.edu/BBSI/ course_desc_2005.html#moduleB

V Brendel [email protected] 10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 19 Signals: Pre-mRNA Splicing Start codon Stop codon Genomic DNA pre-mRNA Transcription Cap- -Poly(A) Splicing mRNA -Poly(A) Cap- Translation Protein exon intron GT AG Acceptor site

Donor site Splice sites Brendel 10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 20 Brendel - Spliced Alignment I: Compare with cDNA or EST probes Start codon Stop codon Genomic DNA Start codon mRNA -Poly(A) Cap5-UTR Brendel Stop codon 10/21/05 3-UTR D Dobbs ISU - BCB 444/544X: Gene Prediction 21 Brendel - Spliced Alignment II:

Compare with protein probes Start codon Stop codon Genomic DNA Protein Brendel 10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 22 Brendel Spliced Alignment Algorithm Perform pairwise alignment with large gaps in one sequence (introns) Align genomic DNA with cDNA, EST or protein Score semi-conserved sequences at splice junctions Score coding constraints in translated exons Brendel 10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 23 Donor (GT) & Acceptor (AG) Sites Used for Model Training Type

Species Brendel Number of True Splice Sites / Phase 1 2 3 Home sapiens GT AG 6586 6555 5277 5194 3037 2979 Mus musculus GT AG 1212 1194 1185 1139 521 504 Rattus norvegicus GT

AG 450 442 408 386 147 140 Gallus gallus GT AG 288 284 238 228 107 103 Drosophila GT AG 989 1001 670 671 524 536 C. elegans

GT AG 37029 36864 20500 20325 20789 20626 S. pombe GT AG 170 179 118 122 119 118 Aspergillus GT AG 221 217 176 172 157 163

Arabidopsis thaliana GT AG 23019 22929 9297 9247 8653 8611 Zea mays GT AG 316 311 107 104 88 83 10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 24 Splice Site Detection Information Content Ii : I i =2 +

f iB B U ,C , A ,G log2 ( f iB ) Extent of Splice Signal Window: I i I + 196 . I i : ith position in sequence : average information content over all positions i > 20 nt from splice site : average standard deviation of Brendel 10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 25 Results ? 0.8 0.7 0.8 0.7 Human T2_GT 0.6 0.5 0.5

0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0.0 -50 -40 -30 -20 -10 0.0 0 10 20 30 40 50

-50 -40 -30 -20 -10 0.8 0.5 -20 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 -10 10 20

30 40 50 -50 -40 -30 -20 -10 0.8 0.7 0.7 A. thaliana T2_GT -20 0.4 0.3 0.3 0.2 0.2 0.1 0.1

10 20 30 40 50 -50 -40 -30 -20 -10 0.8 0.7 0.7 A. thaliana F1_AG Brendel -20 -10 10 20 30

40 50 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 20 30 40 50 40 50 A. thaliana Fi_AG 0.6 0.0

-30 0 0.8 0.5 -40 50 0.0 0 0.6 -50 40 A. thaliana T2_AG 0.5 0.4 -10 10 0.6 0.0 -30 0 0.8

0.5 -40 30 0.0 0 0.6 -50 20 Human Fi_AG 0.6 0.0 -30 10 0.7 Human F1_AG 0.6 -40 0 0.8 0.7

-50 Human T2_AG 0.6 0.0 0 10 20 30 10/21/05 40 50 -50 -40 -30 -20 -10 0 10 20 30

D Dobbs ISU - BCB 444/544X: Gene Prediction 26 Bayesian Splice Site Prediction Let S = s-l s-l+1 s-l+2s-1GT s1 s2 s3 sr P{H | S} =P{S | H }P{H } /(H P{S | H }P{H }) r r P{S} = p{l} p{i | i1} = p{l} fi,i1 / fi1 i =l+1 i=l+1 where H indexes the hypotheses of GT or AG at - True site in reading phase 1, 2, or 0 - False within-exon site in reading phase 1, 2, or 0 - False within-intron site Brendel 10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 27 Bayes Factor as Decision Criterion H0: H=T: BF = BF = p{S | T } p{S | F } - 2-class model: - 7 class model:

p{T | S } p{T } (1 p{T | S}) (1 p{T }) BF = x =1, 2 , 0 p{S | Tx } p{Tx } x =1, 2 , 0 Brendel 10/21/05 p{Tx } x =1, 2, 0 ,i p{S | Fx } p{Fx } p{Fx } x =1, 2, 0 ,i D Dobbs ISU - BCB 444/544X: Gene Prediction 28 Interpretation of Bayes Factor

in terms of Critical Value c = 2 lnBF Positive evidence for H0 if 2 c 6 Strong support for H0 if 6 c 10 Very strong support for H0 if c > 10 Brendel 10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 29 Evaluation of Splice Site Prediction Actual True False True Predicted False TP FP FN TN PP=TP+FP PN=FN+TN AP=TP+FN AN=FP+TN Sensitivity: S n =TP / AP =1 Specificity: S p =TP / PP =1 FN Misclassification rates: =

AP Normalized specificity: Brendel 10/21/05 = Coverage AN 1 = PP 1 + r = r= AN AP FP AN 1 = 1 + D Dobbs ISU - BCB 444/544X: Gene Prediction 30 Species Homo sapiens Drosophila

C. elegans A. thaliana Brendel Model 2C 2C 7C 7C Site Test Site Set True False GT 921 44411 AG 920 65103 GT 329 11501 AG

329 14920 GT 400 7460 AG 400 10132 GT 613 9027 AG 614 10196 10/21/05 Bayes Factor Sn Sp (%)

(%) (%) 0 3 6 0 3 6 98.5 91.7 66.3 96.3 90.3 76.1 90.5 96.3 98.5 88.4 92.9 96.1 16.4 34.8 57.6 9.7 15.7 25.6 0 3 6 0 3 6

95.4 90.0 83.9 95.7 92.1 85.1 94.8 97.6 99.1 94.8 97.0 98.5 34.1 53.6 75.0 28.7 41.4 59.4 0 3 6 0 3 6 97.8 94.2 84.8 98.8 96.2 90.2 92.7 97.1 99.1 97.2 98.8

99.5 40.4 64.3 85.4 58.2 76.9 88.5 0 3 6 0 3 6 99.5 95.6 87.1 99.2 96.4 87.1 93.2 97.6 99.3 92.3 96.4 98.6 48.1 73.2 91.0 41.9 62.0 81.2 D Dobbs ISU - BCB 444/544X: Gene Prediction 31

Performance? 1.00 1.00 0.80 Human GT site 0.60 0.40 -6 -4 0.00 -2 0 Human AG site 0.60 0.40 Sn 0.20 -10 -8 0.80 Sn 0.20

2 4 6 8 10 12 14 16 18 20 -10 -8 -6 -4 1.00 0.00 -2 0 2 4 6 8 10 12 14 16 18 20 1.00 0.80 C. elegans GT site 0.60 0.40

-6 -4 0.00 -2 0 0.40 Sn 2 4 6 8 10 12 14 16 18 20 -10 -8 -6 -4 0.40 4 6 8 10 12 14 16 18 20 A. thaliana AG site 0.60

0.40 Sn 0.20 Brendel 2 0.80 A. thaliana GT site 0.60 0.00 -2 0 0.00 -2 0 1.00 0.80 -6 -4 Sn 0.20 1.00 -10 -8 C. elegans AG site

0.60 0.20 -10 -8 0.80 Sn 0.20 2 4 6 8 10 12 14 16 18 20 10/21/05 -10 -8 -6 -4 0.00 -2 0 2 4 6 8

10 12 14 16 18 20 D Dobbs ISU - BCB 444/544X: Gene Prediction 32 Markov Model for Spliced Alignment PG PG (1-PG)(1-PD(n+1)) en en+1 (1-PG)PD(n+1) PA(n)PG (1-PG)PD(n+1) in in+1 1-PA(n) Brendel 10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 33 Performance vs other methods Comparison with ab initio gene prediction programs? Depends on: Availability of ESTs

Availability of protein homologs Brendel 10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 34 GeneSeqer vs NAP vs GENSCAN (Exon prediction) Exon (Sn + Sp) / 2 1.00 0.90 0.80 0.70 0.60 0.50 0.40 0.30 0.20 0.10 0.00 GeneSeqer NAP GENSCAN 0 10 20 30 40 50 60 70 80 90 100 Target protein alignment score GENSCAN - Burge, MIT Brendel 10/21/05

D Dobbs ISU - BCB 444/544X: Gene Prediction 35 GeneSeqer vs NAP vs GENSCAN (Intron prediction) Intron (Sn + Sp) / 2 1.00 0.90 0.80 0.70 0.60 0.50 0.40 0.30 0.20 0.10 0.00 GeneSeqer NAP GENSCAN 0 10 20 30 40 50 60 70 80 90 100 Target protein alignment score GENSCAN - Burge, MIT Brendel 10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 36 GeneSeqer Genomic

Sequence Fast Search Spliced Alignment EST or protein database Output (Suffix Array/ Suffix Tree) Brendel 10/21/05 Assembly D Dobbs ISU - BCB 444/544X: Gene Prediction 37 Brendel 10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 38 Brendel 10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 39

Brendel 10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 40 Brendel 10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 41 Brendel 10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 42 Brendel 10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 43 Gene Structure Annotation - Problems False positive intergenic region: 2 annotated genes actually correspond to a single gene False negative intergenic region: One annotated gene structure actually contains 2 genes

False negative gene prediction: Missing gene (no annotation) Other: partially incorrect gene annotation missing annotation of alternative transcripts Brendel 10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 44 Brendel 10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 45 Other Resources Current Protocols in Bioinformatics http://www.4ulr.com/products/currentprotocols/bioinform atics.html Finding Genes 4.1 An Overview of Gene Identification: Approaches, Strategies, and Considerations 4.2 Using MZEF To Find Internal Coding Exons 4.3 Using GENEID to Identify Genes 4.4 Using GlimmerM to Find Genes in Eukaryotic Genomes 4.5 Prokaryotic Gene Prediction Using GeneMark and GeneMark.hmm 4.6 Eukaryotic Gene Prediction Using GeneMark.hmm 4.7 Application of FirstEF to Find Promoters and First Exons in the Human Genome 4.8 Using TWINSCAN to Predict Gene Structures in Genomic DNA Sequences 4.9 GrailEXP and Genome Analysis Pipeline for Genome Annotation 4.10 Using RepeatMasker to Identify Repetitive Elements in Genomic Sequences

10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 46

Recently Viewed Presentations

  • Move on When Ready - Fultonschools.org

    Move on When Ready - Fultonschools.org

    Times New Roman Arial Tahoma Wingdings Calibri Glowing puzzle pieces design template 1_Glowing puzzle pieces design template Welcome to Freshman Student & Parent Night Our Professional School Counseling Team 9th Grade Academy Updates Graduation Requirements Language Arts - 4 Years...
  • CPSC 875 - Clemson University

    CPSC 875 - Clemson University

    King/Candy Crush Saga. Device owner. Common. code. Software Product Line Strategy. The extension adds a new product to the product line. Platform provider. Content provider. Content Consumer. The platform consists of. some core assets. The content providers develop.
  • Human Geography of Latin America: A Blending of Cultures

    Human Geography of Latin America: A Blending of Cultures

    Human Geography of Latin America: A Blending of Cultures Latin America's native civilizations and varied landscapes, resources, and colonial influences have left the region with a diverse cultural mix.
  • Grade D - Sample Space Diagrams - Web Maths

    Grade D - Sample Space Diagrams - Web Maths

    Grade D - Sample Space Diagrams 1 A) Draw a sample space diagram, listing all the possible outcomes. B) What is the probability that both land on a 2? C) What is the probability that both land on an odd...
  • Input Tax Credit-in Depth Analysis

    Input Tax Credit-in Depth Analysis

    Input Tax Credit is the core concept of GST as GST is destination based tax. It avoids cascading effect of taxes and ensures that tax is collected in the state in which the goods or services both are consumed. "Input...
  • The Present Tense of Regular Verbs

    The Present Tense of Regular Verbs

    There are 3 types of regular verbs in French: Verbs which end in 'er' - eg. Jouer - to play, Chanter - to sing. Verbs which end in 're' - eg.
  • A Facilitator's Guide to Discovering Personal Genius

    A Facilitator's Guide to Discovering Personal Genius

    A Facilitator's Guide to Discovering Personal Genius Typically DPG begins in the Employment Seeker's home Engage Family with an introductory phone call and a letter explaining with basic talking points that: Discovery is about employment Please invite others (siblings, relatives,...
  • Romeo & Juliet - msthibeault.weebly.com

    Romeo & Juliet - msthibeault.weebly.com

    Act Two, Scene One: Setting the Stage. Macbeth, still playing the part of the gracious host, speaks with Banquo before bidding him goodnight, mentioning that he would like Banquo to join him in some exploit. Banquo replies that he will...