Preparing data for GWAS analysis

Preparing data for GWAS analysis

Preparing data for GWAS analysis Tommy Carstensen AA AB BB ? Manhattan Plot GWAS Quality Control QC 2 Bad QC > Bad Data > Bad Results Entebbe, Uganda 3 Genotype data SNP A SNP B SNP C

SNP D SNP E Female 1 00 AG GG GA 00 Male 1 00 GG GG AA CC

Female 2 AC 00 GG AA CC Female 3 AA AG GC AA CC Male 2 AC AA

00 AA CA 00 = missing data September 2013 Entebbe, Uganda 4 How did we get the genotype data? Genotype Calling Good data SNP1 Bad data SNP2 AA or AB? 00! AA AB BB

September 2013 Entebbe, Uganda 5 Sample QC and SNP QC SNP QC sample QC SNP A SNP B SNP C SNP D SNP E Female 1 00 AG GG

GA 00 Male 1 00 GG GG AA CC Female 2 AC 00 GG AA CC

Female 3 AA AG GC AA CC Male 2 AC AA 00 AA CA 00 = missing data September 2013 Entebbe, Uganda

6 Quality Control Steps Sample QC SNP QC Sample Call Rate/Proportion SNP Call Rate/Proportion Autosomal Heterozygosity Hardy Weinberg Equilibrium (HWE) Sex / Gender X Chromosome Heterozygosity Too Much Relatedness Identity By Descent (IBD) Too Little Relatedness / Confounding Principal Component Analysis (PCA) September 2013 Entebbe, Uganda 7 Quality Control Steps

Sample QC SNP QC Sample Call Rate/Proportion SNP Call Rate/Proportion Autosomal Heterozygosity Hardy Weinberg Equilibrium (HWE) Sex / Gender X Chromosome Heterozygosity Too Much Relatedness Identity By Descent (IBD) Too Little Relatedness / Confounding Principal Component Analysis (PCA) September 2013 Entebbe, Uganda 8 How did we get the data? Genotype Calling Good data SNP1

Bad data SNP2 AA or AB? 00! AA AB BB September 2013 Entebbe, Uganda 9 Sample Call Rate/Proportion SNP1 SNP2 SNP3 SNP4 SNP5 Sample Call Rate

Sample1 00 00 AG GG GA 00 00 60% Sample2 00 00 GG GG AA CC

80% Sample3 AC 00 00 GG AA CC 80% Sample4 AA AG GC AA CC

Sample5 AC AA 00 00 AA CA September 2013 Entebbe, Uganda 100% 80% 10 Quality Control Steps Sample QC SNP QC Sample Call Rate/Proportion

SNP Call Rate/Proportion Autosomal Heterozygosity Hardy Weinberg Equilibrium (HWE) Sex / Gender X Chromosome Heterozygosity Too Much Relatedness Identity By Descent (IBD) Too Little Relatedness / Confounding Principal Component Analysis (PCA) September 2013 Entebbe, Uganda 11 Definition of Heterozygosity Rate Copy 1 Copy 2 SNP1 A A

SNP2 C C SNP3 A T SNP4 C C SNP5 C G SNP6 A

A SNP7 A A SNP8 T T homozygous heterozygous heterozygosity = 2/8 = 0.25 chromosome 1 September 2013 Entebbe, Uganda 12 Heterozygosity Remove samples deviating from average Deviations could arise due to several reasons

heterozygosity Contamination of samples (high heterozygosity) Sample 1 Sample 2 Inbreeding (low heterozygosity) heterozygous Ancestral differenceshomozygous homozygous calling Data quality / Poor genotype Heterozygotes more likely to be missing September 2013 populations Entebbe, Uganda 13 Correlation between quality metrics September 2013 Entebbe, Uganda 14 Quality Control Steps

Sample QC SNP QC Sample Call Rate/Proportion SNP Call Rate/Proportion Autosomal Heterozygosity Hardy Weinberg Equilibrium (HWE) Sex / Gender X Chromosome Heterozygosity Too Much Relatedness Identity By Descent (IBD) Too Little Relatedness / Confounding Principal Component Analysis (PCA) September 2013 Entebbe, Uganda 15 Sex check Looking for mislabelled samples Females Bad Good

Males Good Male 1 allele September 2013 Female 2 alleles Bad Crossover between the X and Y chromosome happens between pseudoautosomal regions. SNPs in PARs are thus excluded from analysis. Entebbe, Uganda 16 Quality Control Steps Sample QC SNP QC Sample Call Rate/Proportion SNP Call Rate/Proportion Autosomal Heterozygosity

Hardy Weinberg Equilibrium (HWE) Sex / Gender X Chromosome Heterozygosity Too Much Relatedness Identity By Descent (IBD) Too Little Relatedness / Confounding Principal Component Analysis (PCA) September 2013 Entebbe, Uganda 17 Relatedness Relatedness is a problem because of overrepresentation of selected alleles, which will bias any multivariate analysis (correlated data!); e.g. PCA or multivariate regression Related samples need to be excluded or taken into account during subsequent analyses One metric of relatedness is Identity By Descent (IBD), which involves calculation of proportion of common alleles between two individuals. Prior to the calculation of IBD, SNPs with a low call rate are permanently excluded and rare SNPs (MAF<5%) and SNPs in Linkage Disequilibrium (LD) are temporarily excluded. September 2013

Entebbe, Uganda 18 Relatedness / IBD Relationship category Monozygotic twins Relatedness Parent-Offspring 1/2 Full siblings 1/2 Grandparent-grandchild 1/4 Uncle/Aunt-Nephew/Niece 1/4 First cousins 1/8

Unrelated 0 Completely identical 1 Half-identical Not identical Me and my mom September 2013 Entebbe, Uganda Me and my sister 19 Count of sample pairs Relatedness / IBD A Ugandan cohort study as an example threshold for temporary exclusion prior to HWE check first cousins (12.5%) etc.

siblings parent-child uncle-niece aunt-nephew etc. duplicates identical twins Maximum IBD for each sample September 2013 Entebbe, Uganda 20 Quality Control Steps Sample QC SNP QC Sample Call Rate/Proportion SNP Call Rate/Proportion Autosomal Heterozygosity

Hardy Weinberg Equilibrium (HWE) Sex / Gender X Chromosome Heterozygosity Too Much Relatedness Identity By Descent (IBD) Too Little Relatedness / Confounding Principal Component Analysis (PCA) September 2013 Entebbe, Uganda 21 How did we get the data? Genotype Calling Good data SNP1 Bad data SNP2 AA or AB? 00! AA AB BB

September 2013 Entebbe, Uganda 22 SNP Call Rate/Proportion SNP1 SNP2 SNP3 SNP4 SNP5 Sample1 00 AG GG GA 00

Sample2 00 GG GG AA CC Sample3 AC 00 GG AA CC Sample4 AA AG

GC AA CC Sample5 AC AA 00 AA CA 60% 80% 80% 100% 80%

SNP Call Rate September 2013 Entebbe, Uganda 23 Quality Control Steps Sample QC SNP QC Sample Call Rate/Proportion SNP Call Rate/Proportion Autosomal Heterozygosity Hardy Weinberg Equilibrium (HWE) Sex / Gender X Chromosome Heterozygosity Too Much Relatedness Identity By Descent (IBD) Too Little Relatedness Principal Component Analysis (PCA) September 2013 Entebbe, Uganda

24 Hardy Weinberg Equilibrium random mating Females Males A (p) C (q) A (p) AA (p2) AC (pq) C (q) AC (pq) CC (q2) September 2013 Sample1

Sample2 Sample3 Sample4 f (A)=p f(C)=q=1-p fe(AA)=p2 SNP1 AC AA AC CC 4/8 4/8 1/4 SNP2 AA AA CC CC 4/8 4/8 1/4 SNP3 AC AC AC

AC 4/8 4/8 1/4 fe(AC)=2pq 2/4 2/4 2/4 fe(CC)=q2 1/4 1/4 1/4 fo(AA) 1/4 2/4 0/4

fo(AC) 2/4 0/4 4/4 fo(CC) 1/4 2/4 0/4 Entebbe, Uganda allele frequencies expected genotype frequencies observed genotype frequencies 25 When HWE does not apply Non-random mating

Selection forces Alleles in disease causing loci Apply HWE only to controls in a case-control study Migration Data quality September 2013 Entebbe, Uganda 26 Quality Control Steps Sample QC SNP QC Sample Call Rate/Proportion SNP Call Rate/Proportion Autosomal Heterozygosity Hardy Weinberg Equilibrium (HWE) Sex / Gender X Chromosome Heterozygosity Too Much Relatedness

Identity By Descent (IBD) Too Little Relatedness / Confounding Principal Component Analysis (PCA) September 2013 Entebbe, Uganda 27 Principal Component Analysis A statistical technique for summarizing many variables with minimal loss of information PCA requires clean non-correlated data September 2013 Entebbe, Uganda 28 Principal Component Analysis A statistical technique for summarizing many variables with minimal loss of information PCA can reveal Population outliers September 2013 Entebbe, Uganda

29 PCA and Population Outliers Africa Europe Study samples Asia September 2013 Entebbe, Uganda 30 Principal Component Analysis A statistical technique for summarizing many variables with minimal loss of information PCA can reveal Population outliers Population structure / Confounding September 2013 Entebbe, Uganda 31

Why population structure / confounding is a problem in a case-control study Causal allele Non-causal allele Balding, Nature Reviews Genetics (2006) September 2013 Entebbe, Uganda 32 Confounding due to chip effects EAST QUAD EAST OCTO September 2013 Entebbe, Uganda 33 PC 1 Confounding due to plate effects Plate Number September 2013

Entebbe, Uganda 34 Observed test statistic Population structure - Inflated QQ-plot September 2013 is a measure of the deviation from the diagonal Expected Entebbe, test Uganda statistic 35 Simple But Effective QC Common Thresholds Sample QC SNP QC Sample Call Rate

>97% SNP Call Rate/Proportion >97% Autosomal Heterozygosity mean3SD Hardy Weinberg Equilibrium (HWE) p>10-4 Sex / Gender PLINK default thresholds Identity By Descent (IBD) <0.05 Principal Component Analysis (PCA) EIGENSTRAT, 6SD, PCs 1-10 September 2013 Entebbe, Uganda 36 Useful references September 2013 37

Useful references September 2013 Entebbe, Uganda 38 Useful software PLINK (QC) http://pngu.mgh.harvard.edu/~purcell/plink EIGENSTRAT (PCA) http://genetics.med.harvard.edu/reich/Reich_Lab/Software.html GEMMA (Association) http://home.uchicago.edu/xz7/software SNPTEST (Association) https://mathgen.stats.ox.ac.uk/genetics_software/snptest/snptest.html shellfish : Parallel PCA and data processing for genome-wide SNP data http://www.stats.ox.ac.uk/~davidson/software/shellfish/shellfish.php September 2013 Entebbe, Uganda 39 Thank you for your attention! Deepti Gurdasani Liz Young

Katherine Ripullone September 2013 Entebbe, Uganda 40

Recently Viewed Presentations

  • Scientific Writing (Illustration)

    Scientific Writing (Illustration)

    On these notes pages, the page and chapter numbers refer to The Craft of Scientific Writing, 3rd edition (Springer-Verlag, 1996). Also referred to is the "Writing Guidelines for Engineering and Science Students," which can be found at the following address:
  • Diapositive 1 - CanalBlog

    Diapositive 1 - CanalBlog

    PROFESSION CONJOINT NOM DATE COTE Maître Tailleur d'habits Perrine Verger Gabeau Francois 30/10/1784 5e8/60 Ancienne dom Gabillard Madeleine 06/04/1762
  • Session Title Session Subtitle

    Session Title Session Subtitle

    Dennis Wilson. Eastern Washington University. Business Intelligence. Coeur d'Alene, Idaho. NWEUG. 2015. ... You will learn how we use Banner and Data Mining tools to identify students at risk. Learn about factors that influence student retention. We will share our...
  • French, Industrial Revolution and the Revolutions of 1830 and ...

    French, Industrial Revolution and the Revolutions of 1830 and ...

    Liberty, Equality, Fraternity- became the motto of the French Revolution. Paris mob attacked the king's castle, he fled to the LA who took him hostage and deposed him from power (led by the radicals) The LA is now replaced by...
  • Transport of Critically Ill Patients Background Changes Made

    Transport of Critically Ill Patients Background Changes Made

    Transport of Critically Ill Patients Background Changes Made Reason for Action: Caring for critically ill patients during transport outside the ICU is a high risk activity. We may not always have the right equipment which leads to potential safety risks...
  • A Day at the Spa: Student Coaching Slides

    A Day at the Spa: Student Coaching Slides

    Silver's Gym: Student Coaching Slides * * Question 1: Regression Analysis Y = a + bX, X is given, Y is predicted * Question 1: Excel Steps Go to web site to get data Click Data, then Data Analysis on...
  • Click to edit Master title style - University of Oklahoma ...

    Click to edit Master title style - University of Oklahoma ...

    LECTURE SERIES ON BALANCE AND POSTURAL CONTROL IN INFANTS AND CHILDREN prepared for 2nd year Doctor of Physical Therapy students. ... Questions with multiple choice answers, ... Assessment methods. Levels of Achievement.
  • Chapter 2 Motion in One Dimension - Sharyland High School

    Chapter 2 Motion in One Dimension - Sharyland High School

    Solving for Vf & Displacement By rearranging the equation for acceleration, we can find a value for the final velocity. vf = vi + aΔt To find displacement of an object moving with uniform acceleration we substitute the above expression...