Next Generation Sequencing Nadia Atallah A Next Generation

Next Generation Sequencing Nadia Atallah A Next Generation

Next Generation Sequencing Nadia Atallah A Next Generation Sequencing (NGS) Refresher Became commercially available in 2005 Construction of a sequencing library clonal amplification to generate sequencing features High degree of parallelism Uses micro and nanotechnologies to reduce size of sample components Reduces reagent costs Enables massively parallel sequencing reactions Revolutionary: has brought high speed to genome sequencing Changed the way we do research, medicine RNA-Seq High-throughput sequencing of RNA Allows for quantification of gene expression and differential expression analyses Characterization of alternative splicing Annotation Goal is to identify genes and gene architecture

de novo transcriptome assembly no genome sequence necessary! RNA-seq workflow Design Experiment RNA preparation Prepare Libraries Set up the experiment to address your specific biological questions Meet with your bioinformatician and sequencing center!!! Isolate RNA Purify RNA Convert the RNA to cDNA Add sequencing adapters Sequence the cDNA using a sequencing platform Sequence Analysis

Quality control Align reads to the genome/assemble a transcriptome Downstream analysis based on your questions Design Experiment Replication Mouse 1 Mouse 2 Mouse 3 Number of replicates depends on various factors: Cost, complexity of experimental design (how many factors are of interest), availability of samples Biological Replicates Sequencing libraries from multiple independent biological samples Very important in RNA-seq differential expression analysis studies At least 3 biological replicates needed to calculate statistics such as p-values Sample 1

Sample 2 Sample 3 Sample 1 Sample 2 Sample 3 Technical Replication Sequencing multiple libraries from the same biological sample Allows estimation of non-biological variation Not generally necessary in RNA-seq experiments Technical variation is more of an issue only for lowly expressed transcripts Design Experiment

Pooling Samples in RNA-seq Can be beneficial if tissue is scare/enough RNA is tough to obtain Utilizes more samples, could increase power due to reduced biological variability Danger is of a pooling bias (a difference between the value measured in the pool and the mean of the values measured in the corresponding individual replicates) Possible that you can get a positive result due to only one sample in the pool Might miss small alterations that might disappear when only 1 sample has a different transcriptome profile than others in the pool Generally it is better to use one sample per biological replicate If you must pool, try to use the same amount of material per sample in the pool Evaluated validity of two pooling strategies (3 or 8 biological replicates per pool; two pools per group). Found pooling bias and low positive predictive value of DE analysis in pooled samples. Design Experiment Single-end versus paired-end Reads = the sequenced portion of cDNA fragments Single-end= cDNA fragments are sequenced from only one end (1x100) Paired-end= cDNA fragments are sequenced from

both ends (2x100) Paired-end is important for de novo transcriptome assembly and for identifying transcriptional isoforms Less important for differential gene expression if there is a good reference genome Dont use paired-end reads for sequencing small RNAs Note on read-length: long reads are important for de novo transcript assembly and for identifying transcriptional isoforms, not required for differential gene expression if there is a good reference genome Sequencing Depth How deep should I sequence? Depth= (read length)(number of reads) / (haploid genome length) Each library prep method suffers from specific biases and results in uneven coverage of individual transcripts in order to get reads spanning the entire transcript more reads (deeper sequencing) is required Depends on experimental objectives Differential gene expression? Get enough counts of each transcript such that accurate statistical inferences can be made

De novo transcriptome assembly? Maximize coverage of rare transcripts and transcriptional isoforms Annotation? Alternative splicing analysis? Million Reads Design Experiment 1) Liu Y., et al., RNA-seq differential expression studies: more sequence or more replication? Bioinformatics 30(3):301-304 (2014) 2) Liu Y., et al., Evaluating the impact of sequencing depth on transcriptome profiling in human adipose. Plos One 8(6):e66883 (2013) 3) Bentley, D. R. et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456, 5359 (2008) 4) Rozowsky, J.et al., PeakSeq enables systematic scoring of ChIP-seq experiments relative to controls. Nature Biotech. 27, 65-75 (2009). Design Experiment Strand Specificity Strand-specific= you know whether the read originated from the + or strand Important for de novo transcript assembly Important for identifying true anti-sense transcripts

Less important for differential gene expression if there is a reference genome Knowledge of strandedness may help assign reads to genes adjacent to one another but on opposite strands Design Experiment RNA-seq experimental design summary Very important step - if done incorrectly no amount of statistical expertise can glean information out of your data!!! Biological replicates For differential expression I generally recommend at least 3 allows you to estimate variance and p-values Technical replicates Generally not necessary in RNA-seq experiments Depth of sequencing Depends on your experimental goals and organism! Length of reads Longer reads = better alignments Longer reads = more expensive Paired-end or single-end?

Paired-end = better alignment Paired-end = more expensive Pooling Not ideal but sometimes necessary Strand-specific? Definitely for antisense transcript identification and de novo transcriptome assembly Not necessary for differential gene expression on an organism with a well-characterized reference genome Design Experiment Experimental Design Perfect World Reads as long a possible Paired-end Sequence as deeply as possible to detect novel transcripts (100-200M) As many replicates as possible Preferably run a small pilot experiment first to see how many replicates are needed given the effect size Real World

Determine what your goals are and what treatments you are interested in; plan accordingly For a simple differential gene expression experiment on a human you could get away with single-end, 75-100bp reads, with n=3 biological replicates, sequenced to ~30 million reads/sample (1 lane of sequencing for a simple control vs treatment 6 sample design) Design Experiment Microarray versus RNA-Seq RNA-seq Microarray Counts (discrete data) Continuous data Negative binomial distribution used in statistical analysis

Normal distribution used in statistical analysis No genome sequence needed Uses DNA hybridizations sequence info needed Can be used to characterize novel transcripts/splice forms Metric: Counts (quantitative) Genome must be sequenced Metric: Relative intensities Design Experiment Do I use Microarray or Sequencing? What expertise is available? Is your lab already set up for microarrays? Does your bioinformatician prefer to analyze next gen data? What are people in your department familiar with? Is there someone who can help you troubleshoot problems? Cost microarrays are cheaper

At what levels are the transcripts of interest likely to be expressed at? Microarrays indicate relative rather than absolute expression This can be problematic for accurate estimation of expression levels of very highly or lowly expressed transcripts Does your organism of interest have a well characterized genome? Data analysis: how confident are you in your ability to analyze the data? Microarrays have been around for a lot longer and so microarray analysis has more user-friendly tools Design Experiment What should I tell the sequencing center I want? Depth, number of lanes Multiplexing Single-end versus paired end Which RNA species am I interested in sequencing? Paired-end or single-end? Strand-specific? Length of reads Poly A selection or ribodepletion RNA-seq workflow

Design Experiment RNA preparation Prepare Libraries Set up the experiment to address your specific biological questions Meet with your bioinformatician and sequencing center!!! Isolate RNA Purify RNA Convert the RNA to cDNA Add sequencing adapters Sequence the cDNA using a sequencing platform Sequence Analysis Quality control Align reads to the genome/assemble a transcriptome Downstream analysis based on your questions

RNA preparation RNA extraction, purification, and quality assessment 18S 28S RIN= RNA integrity number Generally, RIN scores >8 are good, depending on the organism Important to use high RIN score samples, particularly when sequencing small RNAs to be sure you arent simply selecting degraded RNAs RNA-seq workflow Design Experiment RNA preparation Prepare Libraries Set up the experiment to address your specific biological questions Meet with your bioinformatician and sequencing center!!!

Isolate RNA Purify RNA Convert the RNA to cDNA Add sequencing adapters Sequence the cDNA using a sequencing platform Sequence Analysis Quality control Align reads to the genome/assemble a transcriptome Downstream analysis based on your questions Prepare Libraries Target Enrichment It is necessary to select which RNAs you sequence Total RNA generally consists of >80% rRNA (Raz et al. ,2011) If rRNA not removed, most reads would be from rRNA Size selection what size RNAs do you want to select? Small RNAs? mRNAs?

Poly A selection= method of isolating Poly(A+) transcripts, usually using oligo-dT affinity Ribodepletion = depletes Ribosomoal RNAs using sequence-specific biotin-labeled probes Prepare Libraries Library Prep Before a sample can be sequenced, it must be prepared into a sample library from total RNA. A library is a collection of fragments that represent sample input Different methods exist, each with different biases RNA-seq workflow Design Experiment RNA preparation

Prepare Libraries Set up the experiment to address your specific biological questions Meet with your bioinformatician and sequencing center!!! Isolate RNA Purify RNA Convert the RNA to cDNA Add sequencing adapters Sequence the cDNA using a sequencing platform Sequence Analysis Quality control Align reads to the genome/assemble a transcriptome Downstream analysis based on your questions Sequence Next Generation Sequencing Platforms

454 Sequencing / Roche GS Junior System GS FLX+ System Illumina (Solexa) HiSeq System Genome analyzer IIx MySeq Applied Biosystems Life Technologies SOLiD 5500 System SOLiD 5500xl System Ion Torrent Personal Genome Machine (PGM) Proton Next Generation Sequencing Platforms Sequence Platform Chemistry

Read Length Run TIme Gb/Run Advantage Disadvantage 454 GS Junior Pyrosequencing 500 8 hrs 0.04 Long read length High error rate

454 GS FLX+ Pyrosequencing 700 23 hrs 0.7 Long read length High error rate HiSeq Reversible terminator 100 2 days (rapid mode) 120 (rapid

mode) High throughput, low cost Short reads, longer run time Ion Proton Proton detection 200 2 hrs 100 Short run times New, less tested

RNA-seq workflow Design Experiment RNA preparation Prepare Libraries Set up the experiment to address your specific biological questions Meet with your bioinformatician and sequencing center!!! Isolate RNA Purify RNA Convert the RNA to cDNA Add sequencing adapters Sequence the cDNA using a sequencing platform Sequence Analysis Quality control

Align reads to the genome/assemble a transcriptome Downstream analysis based on your questions Analysis Standard Differential Expression Analysis Check data quality Unsupervised Clustering Differential expression analysis Trim & filter reads, remove adapters Count reads aligning to each gene GO

enrichment analysis Check data quality Align reads to reference genome Pathway analysis Analysis File formats - FASTQ files what we get back from the sequencing center This is usually the format your data is in when sequencing is complete Text files Contains both sequence and base quality information Phred score = Q = -10log10P

P is base-calling error probability Integer scores converted to ASCII characters Example: @ILLUMINA:188:C03MYACXX:4:1101:3001:1999 1:N:0:CGATGT TACTTGTTACAGGCAATACGAGCAGCTTCCAAAGCTTCACTAGAGACATTTTCTTTCTCCCAACTCACAAGATGAACACAAAATGGAAACT + 1=DDFFFHHHHHJJDGHHHIJIJIIJJIJIIIGIIGJIIIJCHEIIJGIJJIJIIJIJIFGGGGGIJIFFBEFDC>@@[email protected];@(553>@>C(59:? Data Cleaning: a Multistep Process Remove adapters Removes adapter sequences Analysis Remove contaminatio n Remove contamination from fastq files (or GTF files) Trim reads

Trim reads based on quality Separate reads Separate reads into paired and unpaired Analysis Quality Control Per Base Sequence Quality Analysis Quality Control Per Sequence Quality Scores Analysis Aligning Reads to a Reference Unique reads A G C A C C G T T A G T C G A G G A C T A G T C C G A T G C A C

A C C G T T A G T C T A C C G T T

A G T C G A G C Reference Genome G A G G A C T A T A G A G T G T C

C Sample 1 Sample 2 . Sample N Gene 1 145 176 . 189 Gene 2 13 27

. 19 . . . . . Gene G 28 30 . 20 G

A T G C A Analysis File formats: FASTA files Text file with sequences (amino acid or nucleotides) First line per sequence begins with > and information about sequence Example: >comp2_c0_seq1 GCGAGATGATTCTCCGGTTGAATCAGATCCAGAGGCATGTATATATCGTCTGCAAAATGCTAGAAA CCCTCATGTGTGTAATGCAGTGCATTCATGAAAACCTTGTAAGCTCACGTGTCGCTGACTGTCTGA GAACCGACTCGCTAATGTTCCATGGAGTGGCTGCATACATCACAGATTGTGATTCCAG GTTGCGAGACTATTTGCAGGATGCATGCGAGCTGATTGCCTATTCCTTCTACTTCTTAAATAAAGT AAGAGC Analysis File formats: BAM and SAM files

SAM file is a tab-delimited text file that contains sequence alignment information This is what you get after aligning reads to the genome BAM files are simply the binary version (compressed and indexed version )of SAM files they are smaller Example: Header lines (begin with @) Alignment section Analysis Terminology Units Counts = (Xi) the number of reads that align to a particular feature i (gene, isoform, miRNA) Library size= (N) number of reads sequenced FPKM = Fragments per kilobase of exon per million mapped reads Takes length of gene (li) into account FPKMi=(Xi /li*N)*109

CPM = Counts Per Million mapped reads CPMi= Xi /N*106 FDR = False Discovery Rate (the rate of Type I errors false positives); a 10% FDR means that 10% of your differentially expressed genes are likely to be false positives we must adjust for multiple testing in RNA-seq statistical analyses to control the FDR Analysis Caveats If you have zero counts it does not necessarily mean that a gene is not expressed at all Especially in single-cell RNA-seq RNA and protein expression profiles do not always correlate well Correlations vary wildly between RNA and protein expression Depends on category of gene Correlation coefficient distributions were found to be bimodal between gene expression and protein data (one group of gene products had a mean correlation of 0.71; the another had a mean correlation of 0.28) Shankavaram et. al, 2007 Thank you! Any questions?

Recently Viewed Presentations

  • NCPC - Edward Tufte

    NCPC - Edward Tufte

    DNI Senior Procurement Executive ... partner on issues of common concern Understand IC needs Develop mentor-protégé relationships; partner with small business and academia Collaborate on government to industry, industry to government contracting exchange programs The Way Ahead Partner with ...
  • 9-97org - NPAIHB

    9-97org - NPAIHB

    Linda Holt Andy Joseph Stella Washines Janice Clements Pearl Capoeman-Baller Executive Committee Personnel Elders Veterans Executive Director Behavioral Health Public Health Joe Finkbonner Northwest Tribal Epidemiology Center Program Operations Administration EpiCenter Director Victoria Warren-Mears Administrative Officer Jacqueline ...
  • CHAPTER 9 GROWTH OF A NATION Chapter 9

    CHAPTER 9 GROWTH OF A NATION Chapter 9

    ". . . it was written in their Koran, that all nations who should not have answered their authority were sinners, that it was their right and duty to make war upon them wherever they could be found, and to...
  • Cheshire and Merseyside Consortium 2019 V.3. 26/02/2019. Routes to registration . This is a reminder to all staff of the many routes now into nursing and how these students can be supported in practice, Introduce the difference between supernummary status,...
  • Tuning USA: Meeting the Challenges of U.S. Higher Education

    Tuning USA: Meeting the Challenges of U.S. Higher Education

    Dr. Robert Wagenaar, a professor at the University of Groningen in The Netherlands, and co-coordinator of the projects Tuning Educational Structures in Europe, Tuning South-East and Eastern Europe, Tuning Latin America, Tuning Russia, and Tuning Georgia.
  • ПОЛУМИКРОСКОПИЧЕСКИЙ МЕТОД ОПИСАНИЯ МНОГОНУКЛОННЫХ ПЕРЕДАЧ И ...

    ПОЛУМИКРОСКОПИЧЕСКИЙ МЕТОД ОПИСАНИЯ МНОГОНУКЛОННЫХ ПЕРЕДАЧ И ...

    Superheavy elements. Compound . nucleus. The formation of the evaporation residues-superheavy elements is the longest mechanism of nuclear reactions with heavy ions. Fast fission remains as the less studied process in experiment and theory.
  • Industry - AP Human Geography

    Industry - AP Human Geography

    Textile industry may have multiple locations due to the type and cost of labor needed and location of input. Capital. The ability to borrow money greatly influences the location of industry. Automobiles, Silicon Valley and LDC development. Obstacles to optimum...
  • SOC 8311 Basic Social Statistics

    SOC 8311 Basic Social Statistics

    INTERORGANIZATIONAL RELATIONS: STRATEGIC ALLIANCES At the organizational level of analysis, network theories examine interorganizational relations (IOR). Emergent properties arise when orgs interact, exchange, bargain, compete, collaborate, ... Network theorists try to explain origins and consequences of IOR ties.