Linkage and Association John P. Rice, Ph.D. Washington University School of Medicine Outline Linkage

Linkage Disequilibrium Haplotypes History of GWAS dbGaP Methods Genomic Inflation Factor False Discovery Rate Ethnic Stratification QQ-Plots Definition of centimorgan (cM) A1 A2

B1 B2 Gametes A1 B2, A2 B1 are recombinants A1 B1, A2 B2 are non-recombinants = Prob (recombinant) =.01 A and B are 1cM apart Genome Arithmetic

Kb=1,000 bases; Mb=1,000Kb 3.3 billion base pairs; 3,300 cM in genome 3,300,000,000/3,300 = 1 Mb/cM 33,000 genes 33,000/3,300 Mb = 10 genes / Mb Thus, 20 cM region may have 200 genes to examine Erratum closer to 20,000 genes in humans

Linkage Vs. Association Linkage: -Disease travels with marker within families -No association within individuals -Signals for complex traits are wide (20MB) Association: -Can use case/control or case/parents design

-Only works if association in the population -Allelic heterogeneity (eg, BRAC1) a problem Linkage large scale; Association fine scale (<200kb) LOD Score LOD score is log10 (odds for linkage/odds for no linkage) Traditional (1955) cut-off is LOD=3 (linkage 1000 times more likely) A LOD of 3 corresponds to = 0.0001 Lander and Kruglyak (1995) A LOD score cut-off of 3.6 for a genome screen using an infinitely dense map corresponds to a

genome-wide significance of 0.05 This is the criteria often cited today Effective Number of Tests For genome-wide p=.05 Marker Spacing LOD P-value Neffective 10 cM

2.88 .000135 370 5 cM 3.06 .000088 568

2 cM 3.24 .000057 877 1 cM 3.35 .000044

1,136 0.1 cM 3.63 .000022 2,273 Bipolar Disorder Lifetime

prevalence of BP1 1%, BPII 0.5% Risk of suicide 10 15% Treatment not curative, treatments not completely effective in mitigating symptoms Heritability estimates 80% Linkage reports for the chromosomes, with a lack of replication Lack of power in original reports? Significant and Suggestive Linkage Given density of markers, significant linkage

is LOD > 3.03 Suggestive linkage is LOD > 1.75 These take into account that 2 genome screens were analyzed (narrow and broad) Significant Occurs once in twenty genome screens Suggestive Occurs once in a genome screen Chromosome 6 Linkage Analysis (Summary)

Approximately 2,000 independent tests with an infinitely dense genetic map (Multiple testing a much bigger problem in GWAS) Linkage studies have been unsuccessful for complex diseases May be useful as input into GWAS analysis? Today GWAS (using SNP chips) have taken over My opinion pursue chromosomes 6 and 8, even if

not genome-wide significant in GWAS Genome-Wide Association Studies (GWAS) Chips by Illumina and Affymetrix genotype 1 million SNPs (Single Nucleotide Polymorphisms) as well as CNVs (Copy Number Variations) Affordable on a large scale Capitalize on Linkage Disequilibrium between the markers and variation at a susceptibility gene

Disequilibrium Let P(A1)=p1 Let P(B1)=q1 Let P(A1B 1)=h11 No association if h11=p1q1 D = h11-p1q1 Linkage Disequilibirum: Linkage Random Genetic Drift Founder Effect Mutation Selection

Population admixture/stratification Population Stratification Population 1 Population 2 1 9 25

25 9 81 25 25 Odds ratio = 1 Odds ratio = 1

Combined Population 26 34 34 106 Odds ratio = 2.38 Linkage Disequilibrium _________________

A1 A2 B1 B2 Gametes A1 B2, A2 B1 are recombinants A1 B1, A2 B2 are non-recombinants = P (recombinant) __________________ Consider haplotype Ai Bj, frequency hijo in generation 0, what is the frequency in the next generation?

D and r D tends to take on small values and depends on marginal gene frequencies D = D / max(D) r = D / (p1p2 q1q2) = square of usual correlation coefficient () Note: r2 = 0 D = 0 D = 1 if one cell is zero (eg, no recombination) r can be small even when D = 1 Prediction of one SNP by another depends on r D = 1, r2 = .1

D = 1, r2 = .01 Haplotypes We measure genotypes A double heterozygote is ambiguous Must estimate haplotype frequencies from genotype frequencies usually assume random mating and use EM algorithm The program haploview is commonly used to estimate and depict LD gametes

Note: A1 A2 A1 A2 B1 B2

B2 B1 Person 1 Person 2 Different Haplotypes; same genotypes A1 A2 B1 B2 Haplotypes A1 B1, A2 B2; A1 B2, A2 B1 Independence hij = pi qj

Positive Association hij > pi qj Negative Association hij < pi qj Assume random mating but allow for disequilibrium B 1B 1 B 1B 2

B 2B 2 A 1A 1 h112 2h11h12 h122 2h12h22 A 1A 2 2h11h21

A 2A 2 h212 2h21h22 h222 A 1B 1 A 1B 2 A 2B 1

A 2B 2 h11 h12 h21 h22 D plot from Haploview Blocks and Bins

Predictability of one SNP by another best described by r2 basic statistics Block set of SNPs with all pair-wise LD high (usually defined in terms of D) If one uses r2 insert a SNP with low frequency in between SNPs with freqs close to 0.5, then block breaks up!

Perlegen (Hinds et al, Science, 2005) - use bins where a tag SNP has r2 of 0.8 with all other SNPs. Bins may not be contiguous. Summary (Blocks and Bins) Blocks using D may have a biological interpretation (long stretches with |D | =1 and indicates no recombination) Selection of Tag SNPs is a statistical issue, want to predict untyped SNPS from those that are typed r2 is natural measure Most current WGA studies use bins based on

r2 (typically r2 > 0.8) Sample size needed is N/ r2 with reduced r2 Analysis Case/ control studies are common. Use logistic regression with case/control status as the dependent variable. Use SNP genotype as an independent variable with other covariates and test one SNP at a time PLINK is my program of choice to do this Family based studies are also used. TDT (case and both parents) designs are used in

GWAS but less efficient SNP Marker Coding: Genotype X1 1/1 0 1/2 1

2/2 2 Testing Marker Effects log (odds) = + 1X1 odds = ee 1 X1 Genotype 11 12 22 Test 1 = 0, all odds = e Note: No dominance effect

Odds e ee1 ee21 SNP Marker Coding: Genotype X1 X2 11

0 0 12 1 1 22 2

0 Testing Marker Effects log (odds) = + 1 X1 + 2 X2 odds = ee 1 X1 e 2 X2 Genotype 11 12 22 Odds e e e1e2

ee21 Test 1= 2 = 0, all odds = e If 2 = 0, then have additive model Haplotypes? We may wish to consider more than one SNP at a time in the linear regression. More information in a set of close SNPs

May wish to study a set of SNPs to see if one explains the case/control difference, i.e., does the evidence for one SNP disappear when controlling for other SNPs. Haplotype Trend Analysis Zaykin et al (2002) Hum Hered 53:79-91 Use haplotypes in logistic regression For a pair of SNPs, there are 4 haplotypes, so there will be 3 dummy variables Assume pair of haplotypes in an individual are additive, so only need 3 regression

coefficients If haplotypes are known with certainty, then: Haplotype h1/ h1 X1 2 X2 0 X3 0

h1/ h2 1 1 0 h1/ h3 1 0

1 h1/ h4 1 0 0 h2/ h2 0

2 0 h2/ h3 0 1 1 h2/ h4

0 1 0 h3/ h3 0 0 2

h3/ h4 0 0 1 h4/ h4 0 0

0 Estimated Haplotypes One can get estimates of the haplotype probabilities for each individual (LD between SNPs OK) Put the estimated probabilities into the logistic regression GWAS Studies How do we keep up?

A Catalog of Published GWAS www.genome.gov/26525384 Number of Studies:

2005 2 Includes Age-related Macular Degeneration 2006 8 2007 87 2008 70 (through July 27) Bipolar Disorder: 3 studies (1 used pooled genotypes) No convincing signals History of GWAS Early

studies used pooled designs too expensive to do individual genotypes Affymetrix and Illumina come out with affordable SNP chips First study to generate enthusiasm Agerelated macular degeneration (Klein, 2007) found a real signal Type II diabetes studies found real signals linkage studies were problematic Welcome Trust (WTCCC) Initiative Common

set of 3,000 controls Several disorders (including Bipolar) with 2,000 cases each Results in the public domain Published in Nature in 2007 Major U.S. GWAS Initiatives New NIH Policy All NIH Funded GWAS studies must deposit individual genotypes and phenotypic data in dbGaP at NCBI GAIN and GEI RFAs funded studies with existing DNA, subjects consented to allow

data to go to dbGaP, and genotyping done at associated genotyping centers New RFA from NIMH to collect very large (~10,000) samples GAIN Proposals Genetic Association Information Network 6 WGA projects were selected across NIH Projects:

Schizophrenia Bipolar Disorder Depression ADHD Psoriasis Type 1 Diabetes (nephropathy)

Data at dbGap (1 year embargo on publication) Note: 4/6 Mental Health related!! Gene Environment Initiative (GEI) 8 GWAS funded oral cleft, addiction, coronary

heart disease, lung cancer, type 2 diabetes, birth weight, dental caries, premature birth Required existing DNA and subjects consented to share Issued Supplement for replication samples Addiction (Bierut) samples genotyped first we got genotypes from CIDR in May; once cleaned, they go to dbGaP Good News for Analysts Cleaned data available goes to investigators who collected data at the same time as

everyone else It takes years to collect subjects Cleaning GWAS data is hard and time consuming Opportunity for combining data from multiple studies Is this fair? dbGaP Genotype and Phenotype Database Data made available to investigators and others at the same time 1 year publication

embargo Request access using eRA Commons sign on requires Institutional sign-off Request must be approved by a DAC (data access committee) Some statistical and data management issues Genomic Inflation Factor We illustrate with admixed schizophrenia data (CATIE) where we dont control for ethnicity

Genomic inflation factor -lambda When testing 300K to 1M SNPs, most tests are under the null Median chi-square should be .445 Lambda = median chi-sq/.445 Can use lambda to correct chi-sqs for this inflation Better look for source (eg, ethnic admixture), and correct for that Unzipped (binary) file is 185MB

495,163 SNPs Analyzed Total Time: 9 min! Terrible lambda Note: Mixture of EU and AAs Plink Output P-values Uncleaned, admixed data small p-values are an artifact. Welcome Trust used significance level of

5 x 10-7 based an Bayesian arguments Bonferroni correction assumes independent tests PLINK also computes q-values based on FDR (false discovery rate) False Discovery Rate (FDR) V= # true null hypotheses called significant S= # non-true hypotheses called significant Q=V/(V + S) (false positives/all positives) FDR = E(Q) Benjamini & Hochberg (1995) When testing m hypotheses H 1,,Hm, order p-values p1, pm , let k be largest i for which p i (i/m) q*

Then reject H1, Hm Theorem: Above controls FDR at q* Computer program: QVALUE; computed by PLINK Interpretation of FDR If q-value is 0.1, 1/10 is false positive. If we identify 10 SNPs and 9 are real and 1 is false positive major success. Usual experiment-wise error (Bonferroni correction) only one false positive at the chosen p-value.

Some statistical and data management issues Population stratification Perform principal components analysis (10,000 markers probably enough), and plot your samples along with hapmap samples Eigenstrat is commonly used We illustrate with NIMH repository control data who self report as white Problem Samples

(to be removed) One subject clusters with Yoruba sample A handful of subjects trail off to Asian sample. Some reported American Indian ancestry In addition, several samples had phenotypic sex differ from genetic sex probably sample swaps Cleaning of GENEVA addiction GWAS data (SAGE) 1

million Illumina chips were done at CIDR Data should be at dbGaP in a few weeks We just completed cleaning, but havent received the final data Study Design Case/ Control (4,400 individuals) Samples come from 3 studies

Alcohol Dependence (COGA) Nicotine Dependence (COGEND) Cocaine Dependence (FSCD) Cases have a diagnosis of alcohol dependence Controls do not have a dx of alc, nic, or cocaine dependence; must have drunk alcohol Mixture of EUs, AAs and Hispanics Primary Model Dependent

variable (s) Case control status (diagnosis of alcohol dependence)simple logistic model Independent variables

Genotype --(1 df trend test) EU vs AA vs Hispanic (Asians, Mixed, etc excluded) Study (alc, cocaine, nicotine) Gender Test each SNP with 1 df Relatedness

Identify unexpected relatedness, correct pedigree and identify one representative from each family Use IBD Identity by Descent Two individuals can share 0, 1 or 2 alleles from a common ancestor MZ twins (or duplicates) always share 2 alleles IBD; Parent-offspring pairs always share 1 allele IBD, etc. PLINK can estimate these probabilities from the

SNP data (which is IBS data since parents are not genotyped) Prob of IBD by Relationship Z2 1 0 0.25 0 0 0 0 Z1

0 1 0.5 0.5 0.5 0.5 0.25 Z0 kinship 0 0.5 0 0.25 0.25

0.25 0.5 0.125 0.5 0.125 0.5 0.125 0.75 0.0625 Relationship MZ twin (or duplicate) parent-offspring full siblings

half siblings avuncular (uncle/aunt - niece/nephew grandparent-grandchild great grandparent - great grandchild We found unexpected relatedness Duplicates: 8 subjects were both in FSCD and COGA This will be documented by dbGaP

Some full sibs were selected for SAGE and were known Others were identified in cleaning Other unexpected relatedness found Data from extra samples will be distributed by dbGaP Aneuploidy Normal male XY; Normal Female XX Phenotypically male if at least one Y chromosome

Found XXY (male who genotypes like a female), XYY, XO individuals, mosaics Most of this is due to DNA from cell lines Some detected by looking at intensity plots CIDR X0/XX=magenta, XYY=purple, XXY=skyblue, X0=yellow, XXX=black, XY/XXY/XYY=green XYY XXY XY/X0? X0

XX/X0 Population structure Assign samples to population groups for allele frequency estimation, HW testing, etc. Alternatively, produce quantitative covariates to control for population admixture Use the program Eigenstrat to perform Principal Component Analysis Asian

Admixture First PC separates EUs and AAs Second PC separates Hispanics Some self reported ethnicities were in error and turned out to be data entry mistakes One unexpected Asian was found Hardy-Weinberg Equilibrium HWE Let

a SNP have two alleles 1,2 with frequencies p and q =1 p, respectively. The SNP is in HWE if the genotypic frequencies are p2, 2pq, and q2 for genotypes 11, 12, 22. Hardy and Weinberg showed a population reaches HWE in a single generation of random mating. Usually see HWE for markers. HWE Filter

out SNPs with p < 10-06 when testing for HWE Note: test done separately within ethnic groups mixing populations with different allele frequencies leads to non-HWE CNVs (copy number variations) can cause non-HWE Bottom line always inspect intensity plots for signals of interest. Intensity Plot good SNP Uniform Distribution

0 p 1 If we perform N independent statistical tests for which all null Hypotheses are true, we expect a uniform distribution. QQ-plot of association test When we test 1 million SNPs, most are not

truly associated. Plot - log(p) for observed tests against a uniform distribution as a final check Genomic inflation factor If using a chisquare test with 1 df, median value should be 0.445. =observed median / .445. Usually correct chi-sq by dividing by Always best to control for pop admixture, eliminate CNVs, etc first = 1.045 GEVEVA Acknowledgement

U. Washington CIDR Justin Pashall, Mike Feolo, Stephanie Pretel Washington U.

Kim Doheny, Elizabeth Pugh, Kurt Hetrick. NCBI Bruce Weir, Thomas Lumley, Ken Rice, Tushar Bhangale, Xiuwen Zheng, Ian Painter, Fred Boehm, CathyLaurie Laura Bierut, John Rice, Nancy Saccone, Sherri Fisher

NHGRI Emily Harris, Teri Manolio Conclusions GWAS has already been successful for many complex traits linkage has not been Many GWAS are in progress We use plink and SAS for data management, data cleaning and analysis

The only way to learn this is to really be involved in one Availability at dbGaP is a major event cant herd cats, but you can move their food Final Words Current GWAS Chi-Square on steroids Only pick low fruit genome-wide significant; test one SNP at a time How to identify true signals mixed in with noise due to chance? How to identify gene-gene interactions and

G x E interactions? Where is the heritability of 50-80%?