Title goes here - teaching.bioinformatics.dtu.dk

Title goes here - teaching.bioinformatics.dtu.dk

Sequence alignment & Substitution matrices By Thomas Nordahl Sequence alignment 1. Sequence alignment is the most important technique used in bioinformatics 2. Infer properties from one protein to another 1. Homologous sequences often have similar biological functions 3. Most information can be deduced from a sequence if the 3D-structure is know 4. 3D-structure determination is very time consuming (X-ray, NMR) 1. Several mg of pure protein is required (> 100mg) 2. Make crystal, solve structure, 1-3 years 3. Large facilities are needed to produce X-ray 1. Rotating anode or synchrotron 5. Determining primary sequence is fast, cheap 6. Structure more conserved than sequence Growth of GenBank and WGS Structures in PDB Car parts analogy to protein folds Protein class & folds Structures in SCOP database The world seems to consist of approx1400 protein folds. Until 2014 no new folds have been observed What can we learn from sequence alignment Find similar sequence from another organism Information from the known sequence can be inherited Layers of conserved information: Structure > function > sequence where, > means more conserved Structure (3D) is the most conserved feature Proteins with different function may still share the same structure Proteins with different may still share the same function Often same function if 40-50% sequence identity Often same protein fold if above 30% sequence identity Sequence alignment M V S T A M V S T A M A T S A M 1 0 0 0 0 V 0 1 0 0 0 S 0 0 1 0 0 Antal identiske aa, % id ? T 0 0 0 1 0 Alignment score using identity matrix? A 0 0 0 0 1 Similar amino acids can be substituted, therefore other types of substitution matrices are used. Blosum matrices Blosum matrices are the most commonly used substitution matrices - Blosum50, Blosum62, blosum80 Symmetrical 20 x 20 matrix, where each element is the substitution score. Positive scores: Amino acids are likely to be aligned in a sequence alignmen -They share similar chemical

characteristics Negative scores: Less likely substitution but still occur. Zero Scores: Invariant Q) In an alignment what is the most likely amino acid that Arg will align to besides itself? Log-odds scores Log-odds scores are given by Log( Observation/Expected) The log-odd score of matching amino acid j with amino acid i in an alignment is log( Pij Qi Qj ) where Pij is the frequency of observation i aligned with j, and Q i, Qj are the frequency if amino acids i and j in the data set. The log-odd score is (in bit units) Pij Sij =2log 2 ( ) Qi Qj Where, Log2(x)=logn(x)/logn(2) S has been normalized to half bits, therefore the factor 2 Example of a scoring matrix BLOSUM80 A R N D C Q E G H I L K M F P S T W Y V A 7 -3 -3 -3 -1 -2 -2 0 -3 -3 -3 -1 -2 -4 -1 2 0 -5 -4 -1 R -3 9 -1 -3 -6 1 -1 -4 0 -5 -4 3

-3 -5 -3 -2 -2 -5 -4 -4 N -3 -1 9 2 -5 0 -1 -1 1 -6 -6 0 -4 -6 -4 1 0 -7 -4 -5 D -3 -3 2 10 -7 -1 2 -3 -2 -7 -7 -2 -6 -6 -3 -1 -2 -8 -6 -6 C -1 -6 -5 -7 13 -5 -7 -6 -7 -2 -3 -6 -3 -4 -6 -2 -2 -5 -5 -2 Q -2 1 0 -1 -5 9 3 -4 1 -5 -4 2 -1 -5 -3 -1

-1 -4 -3 -4 E -2 -1 -1 2 -7 3 8 -4 0 -6 -6 1 -4 -6 -2 -1 -2 -6 -5 -4 G 0 -4 -1 -3 -6 -4 -4 9 -4 -7 -7 -3 -5 -6 -5 -1 -3 -6 -6 -6 H -3 0 1 -2 -7 1 0 -4 12 -6 -5 -1 -4 -2 -4 -2 -3 -4 3 -5 I -3 -5 -6 -7 -2 -5 -6 -7 -6 7 2 -5 2 -1 -5 -4 -2 -5 -3 4

L -3 -4 -6 -7 -3 -4 -6 -7 -5 2 6 -4 3 0 -5 -4 -3 -4 -2 1 K -1 3 0 -2 -6 2 1 -3 -1 -5 -4 8 -3 -5 -2 -1 -1 -6 -4 -4 M -2 -3 -4 -6 -3 -1 -4 -5 -4 2 3 -3 9 0 -4 -3 -1 -3 -3 1 F -4 -5 -6 -6 -4 -5 -6 -6 -2 -1 0 -5 0 10 -6 -4 -4 0 4 -2 P -1 -3

-4 -3 -6 -3 -2 -5 -4 -5 -5 -2 -4 -6 12 -2 -3 -7 -6 -4 S 2 -2 1 -1 -2 -1 -1 -1 -2 -4 -4 -1 -3 -4 -2 7 2 -6 -3 -3 T 0 -2 0 -2 -2 -1 -2 -3 -3 -2 -3 -1 -1 -4 -3 2 8 -5 -3 0 W -5 -5 -7 -8 -5 -4 -6 -6 -4 -5 -4 -6 -3 0 -7 -6 -5 16 3 -5 Y -4 -4 -4 -6 -5 -3

-5 -6 3 -3 -2 -4 -3 4 -6 -3 -3 3 11 -3 V -1 Pij -4 S =2log ( ij 2 -5 Qi Qj -6 -2 -4 Log-Odds scores -4 Have been rounded off -6 to integers -5 4 1 -4 1 -2 -4 -3 0 -5 -3 7 ) An example Sij = 2log2(Pij/(QiQj)) Pij can be calculated as Nij/(Sumij Nij), where Nij is the number of times amino acid i is aligned to amino acid j Sum Nij is the total number of all alignments Nij Qi is the frequency observed in alignment of amino acid i MSA Multiple Sequemce Alignment How to calculate NAA seq1: seq2: seq3: seq4: 1 V A D D 2 V A V A 3 A A A A 4 D D D A NAA = 14 An example NAA NAD NAV NDA NDD NDV NVA NVD

NVV = = = = = = = = = 14 5 5 5 8 2 5 2 2 PAA PAD PAV PDA PDD PDV PVA PVD PVV = = = = = = = = = 14/48 5/48 5/48 5/48 8/48 2/48 5/48 2/48 2/48 MSA Multiple Sequemce Alignmen seq1: seq2: seq3: seq4: 1234 VVAD AAAD DVAD DAAA QA = 8/16 QD = 5/16 QV = 3/16 Example continued PAA PAD PAV PDA PDD PDV PVA PVD PVV = = = = = = = = = 0.29 0.10

0.10 0.10 0.17 0.04 0.10 0.04 0.04 QAQA QAQD QAQV QDQA QDQD QDQV QVQA QVQD QVQV = = = = = = = = = 0.25 0.16 0.09 0.16 0.10 0.06 0.09 0.06 0.03 1: 2: 3: 4: VVAD AAAD DVAD DAAA QA=0.50 QD=0.31 QV=0.19 MSA So what does this mean? PAA PAD PAV PDA PDD PDV PVA PVD PVV = = = = = = = = = 0.29 0.10 0.10 0.10 0.17 0.04 0.10 0.04 0.04 BLOSUM QAQA = 0.25 SAA = 0.44 QAQD = 0.16 SAD =-1.17 QAQV = 0.09 SAV = 0.30

QDQA = 0.16 SDA =-1.17 QDQD = 0.10 SDD = 1.54 QDQV = 0.06 SDV =-0.98 QVQA = 0.09 SVA = 0.30 QVQD = 0.06 SVD =-0.98 QisVQaVlog-likelihood = 0.03 SVV = 0.49 matrix: Sij = 2log2(Pij/(QiQj)) The Scoring matrix A A 0.44 D -1.17 V 0.30 D V -1.17 0.30 1.54 -0.98 -0.98 0.49 1: 2: 3: 4: VVAD AAAD DVAD DAAA MSA And what does the BLOSUMXX mean? High Blosum values mean high similarity between clusters Conserved substitution allowed Low Blosum values mean low similarity between clusters Less conserved substitutions allowed BLOSUM80 A R N D C Q E G H I L K M F P S T W Y V A 7 -3 -3

-3 -1 -2 -2 0 -3 -3 -3 -1 -2 -4 -1 2 0 -5 -4 -1 R -3 9 -1 -3 -6 1 -1 -4 0 -5 -4 3 -3 -5 -3 -2 -2 -5 -4 -4 N -3 -1 9 2 -5 0 -1 -1 1 -6 -6 0 -4 -6 -4 1 0 -7 -4 -5 D -3 -3 2 10 -7 -1 2 -3 -2 -7 -7 -2 -6 -6 -3 -1 -2 -8 -6 -6 C -1 -6 -5 -7 13 -5 -7

-6 -7 -2 -3 -6 -3 -4 -6 -2 -2 -5 -5 -2 Q -2 1 0 -1 -5 9 3 -4 1 -5 -4 2 -1 -5 -3 -1 -1 -4 -3 -4 E -2 -1 -1 2 -7 3 8 -4 0 -6 -6 1 -4 -6 -2 -1 -2 -6 -5 -4 G 0 -4 -1 -3 -6 -4 -4 9 -4 -7 -7 -3 -5 -6 -5 -1 -3 -6 -6 -6 H -3 0 1 -2 -7 1 0 -4 12 -6 -5

-1 -4 -2 -4 -2 -3 -4 3 -5 I -3 -5 -6 -7 -2 -5 -6 -7 -6 7 2 -5 2 -1 -5 -4 -2 -5 -3 4 L -3 -4 -6 -7 -3 -4 -6 -7 -5 2 6 -4 3 0 -5 -4 -3 -4 -2 1 K -1 3 0 -2 -6 2 1 -3 -1 -5 -4 8 -3 -5 -2 -1 -1 -6 -4 -4 M -2 -3 -4 -6 -3 -1 -4 -5 -4 2 3 -3 9 0 -4

-3 -1 -3 -3 1 F -4 -5 -6 -6 -4 -5 -6 -6 -2 -1 0 -5 0 10 -6 -4 -4 0 4 -2 P -1 -3 -4 -3 -6 -3 -2 -5 -4 -5 -5 -2 -4 -6 12 -2 -3 -7 -6 -4 S 2 -2 1 -1 -2 -1 -1 -1 -2 -4 -4 -1 -3 -4 -2 7 2 -6 -3 -3 T 0 -2 0 -2 -2 -1 -2 -3 -3 -2 -3 -1 -1 -4 -3 2 8 -5 -3

0 W -5 -5 -7 -8 -5 -4 -6 -6 -4 -5 -4 -6 -3 0 -7 -6 -5 16 3 -5 Y -4 -4 -4 -6 -5 -3 -5 -6 3 -3 -2 -4 -3 4 -6 -3 -3 3 11 -3 V -1 -4 -5 -6 -2 -4 -4 -6 -5 4 1 -4 1 -2 -4 -3 0 -5 -3 7 = 9.4 = -2.9 BLOSUM30 A R N D C Q E G H I L K M F P S T W Y V

A 4 -1 0 0 -3 1 0 0 -2 0 -1 0 1 -2 -1 1 1 -5 -4 1 R -1 8 -2 -1 -2 3 -1 -2 -1 -3 -2 1 0 -1 -1 -1 -3 0 0 -1 N 0 -2 8 1 -1 -1 -1 0 -1 0 -2 0 0 -1 -3 0 1 -7 -4 -2 D 0 -1 1 9 -3 -1 1 -1 -2 -4 -1 0 -3 -5 -1 0 -1 -4 -1 -2 C -3 -2

-1 -3 17 -2 1 -4 -5 -2 0 -3 -2 -3 -3 -2 -2 -2 -6 -2 Q 1 3 -1 -1 -2 8 2 -2 0 -2 -2 0 -1 -3 0 -1 0 -1 -1 -3 E 0 -1 -1 1 1 2 6 -2 0 -3 -1 2 -1 -4 1 0 -2 -1 -2 -3 G 0 -2 0 -1 -4 -2 -2 8 -3 -1 -2 -1 -2 -3 -1 0 -2 1 -3 -3 H -2 -1 -1 -2 -5 0

0 -3 14 -2 -1 -2 2 -3 1 -1 -2 -5 0 -3 I 0 -3 0 -4 -2 -2 -3 -1 -2 6 2 -2 1 0 -3 -1 0 -3 -1 4 L -1 -2 -2 -1 0 -2 -1 -2 -1 2 4 -2 2 2 -3 -2 0 -2 3 1 K 0 1 0 0 -3 0 2 -1 -2 -2 -2 4 2 -1 1 0 -1 -2 -1 -2 M 1 0 0 -3 -2 -1 -1 -2 2 1

2 2 6 -2 -4 -2 0 -3 -1 0 F -2 -1 -1 -5 -3 -3 -4 -3 -3 0 2 -1 -2 10 -4 -1 -2 1 3 1 P -1 -1 -3 -1 -3 0 1 -1 1 -3 -3 1 -4 -4 11 -1 0 -3 -2 -4 S 1 -1 0 0 -2 -1 0 0 -1 -1 -2 0 -2 -1 -1 4 2 -3 -2 -1 T 1 -3 1 -1 -2 0 -2 -2 -2 0 0 -1 0 -2

0 2 5 -5 -1 1 W -5 0 -7 -4 -2 -1 -1 1 -5 -3 -2 -2 -3 1 -3 -3 -5 20 5 -3 Y -4 0 -4 -1 -6 -1 -2 -3 0 -1 3 -1 -1 3 -2 -2 -1 5 9 1 V 1 -1 -2 -2 -2 -3 -3 -3 -3 4 1 -2 0 1 -4 -1 1 -3 1 5 Blosum30 = 8.3 = -1.16 Blosum80 = 9.4 = -2.9

Recently Viewed Presentations

  • Anthology Poems at a glance: Meaning (M), Context (C), Quotes ...

    Anthology Poems at a glance: Meaning (M), Context (C), Quotes ...

    Sentence structures - applying a variety for effect - simple, compound and complex. PANIC sentence openers & being able to apply these. Paragraphing - TIPTOP rules & being able to apply these effectively. Freytag's narrative structure - able to apply...
  • Virtual Temperature: Tv or T* - UMD

    Virtual Temperature: Tv or T* - UMD

    PowerPoint Presentation PowerPoint Presentation PowerPoint Presentation PowerPoint Presentation PowerPoint Presentation Various Measures of Water Vapor Content Virtual Temperature: Tv or T* PowerPoint Presentation PowerPoint Presentation PowerPoint Presentation PowerPoint Presentation PowerPoint Presentation PowerPoint Presentation Unsaturated ...
  • Planning for Success - OSEP Ideas That Work

    Planning for Success - OSEP Ideas That Work

    Planning for Success: Using Implementation Data to Action Plan for Full and Sustained Implementation. Barbara Sims. Caryn Ward. National SISEP Center. National Implementation Research Network . ... Common mission for professional development .
  • Welcome to Radiology - Yola

    Welcome to Radiology - Yola

    The Anode Heel Effect. Causes the Intensity of radiation to be greater on the cathode side of the tube. Bevel of the anode limits x-rays produced on anode side. Place thicker end of patient on the cathode side. Head usually...
  • Chapter 10, section 1 - Home - Buckeye Valley

    Chapter 10, section 1 - Home - Buckeye Valley

    Chapter 10, section 1. Heredity manifests itself primary in the process called MATURATION. Maturation is developmental changes that occur as a result of automatic and sequential process of developmental that results from genetic signals. Critical Period - best time to...
  • Character - Jefferson County Public Schools

    Character - Jefferson County Public Schools

    Sequence Comprehension Skill First Grade Unit 2 Week 4 Created by Kristi Waltke Let's Review The main idea of a story is what the story is mostly about. Cause and Effect A cause is something that makes something else happen....
  • NGS All Hands 2.5.2010

    NGS All Hands 2.5.2010

    Federal Users of the NSRS. February 8, 2017. Coastal GeoTools, North Charleston. Many of these users also sit on the Federal Geodetic Control Subcommittee, so there is a feedback loop between the providers of geodetic control (NGS) and the users...
  • Coaching for Education (C4E) Coaching for Education Gavin

    Coaching for Education (C4E) Coaching for Education Gavin

    PROFESSIONAL VALUES & ATTRIBUTES. 3. Inspire, motivate and raise aspirations of learners through your enthusiasm and knowledge. 4. Be creative and innovative in selecting and adapting strategies to help learners to learn.