Next Gen. Sequencing Files and pysam BCHB524 Lecture

Next Gen. Sequencing Files and pysam BCHB524 Lecture

Next Gen. Sequencing Files and pysam BCHB524 Lecture 13 BCHB524 - Edwards Next Gen. Sequencing BCHB524 - Edwards Wiki: Genomics 2 Next Gen. Sequencing

BCHB524 - Edwards Nature Biotechnology 29, 2426 (2011) 3 Python for NGS NGS Special purpose tools (tophat, cufflinks, samtools) for aligning Use

data is big! Python for: Clean up / filter reads Post-process tool output Visualization BCHB524 - Edwards 4 Count reads from FASTQ file # Import BioPython's SeqIO module import Bio.SeqIO # Import the sys module

import sys # Get first command-line argument inputfile = sys.argv[1] # Initialize counter count = 0 # Loop through all reads in inputfile for read in Bio.SeqIO.parse(inputfile, "fastq"): # Increment count count += 1 # Output result print(count,"reads") BCHB524 - Edwards 5 Filter reads in FASTQ file

import Bio.SeqIO import sys # Get command-line arguments inputfile = sys.argv[1] minlength = int(sys.argv[2]) # Loop through all reads in inputfile for read in Bio.SeqIO.parse(inputfile, "fastq"): # Check the length if len(read.seq) > minlength: # Output to standard-out print(read.format("fastq"),end="") BCHB524 - Edwards 6 Filter reads in FASTQ file

import Bio.SeqIO import sys # Get command-line arguments inputfile = sys.argv[1] thr = int(sys.argv[2]) # Loop through all reads in inputfile for read in Bio.SeqIO.parse(inputfile, "fastq"): # Check the minimum phred score if min(read.letter_annotations["phred_quality"]) >= thr: # Output to standard-out print(read.format("fastq"),end="") BCHB524 - Edwards 7 Remove primer sequence

import Bio.SeqIO import sys # Get command-line arguments inputfile = sys.argv[1] # Loop through all reads in inputfile for read in Bio.SeqIO.parse(inputfile, "fastq"): # if the primer sequence is present if read.seq.startswith('GATGACGGTGT'): # remove it and output as FASTA read = read[11:] print(read.format("fastq"),end="") BCHB524 - Edwards 8 Dump space-separated-values

import Bio.SeqIO import sys # Get command-line arguments inputfile = sys.argv[1] # Loop through all reads in inputfile for read in Bio.SeqIO.parse(inputfile, "fastq"): # Output description, and read length print(read.description,len(read.seq)) BCHB524 - Edwards 9 Plot read lengths import Bio.SeqIO import sys from matplotlib.pyplot import *

# Get command-line arguments inputfile = sys.argv[1] lengths = [] # Loop through all reads in inputfile for read in Bio.SeqIO.parse(inputfile, "fastq"): # Store read length lengths.append(len(read.seq)) # lengths.sort() plot(lengths,'.') show() # Save and use "shotwell readlengths.png" to view BCHB524 - Edwards # savefig('readlengths.png') 10 Histogram of read lengths

import Bio.SeqIO import sys from matplotlib.pyplot import * # Get command-line arguments inputfile = sys.argv[1] lengths = [] # Loop through all reads in inputfile for read in Bio.SeqIO.parse(inputfile, "fastq"): # Store read length lengths.append(len(read.seq)) hist(lengths) show() # savefig('readlengthhist.png') BCHB524 - Edwards 11

Plot read lengths and quality import Bio.SeqIO import sys from matplotlib.pyplot import * # Get command-line arguments inputfile = sys.argv[1] lengths1 = [] lengths2 = [] # Loop through all reads in inputfile for read in Bio.SeqIO.parse(inputfile, "fastq"): phred_scores = read.letter_annotations["phred_quality"] l = 0 for phsc in phred_scores: if phsc < 30: break l += 1 lengths1.append(l)

lengths2.append(len(read.seq)) plot(lengths2,lengths1,'.') show() # savefig('readlengths.png') BCHB524 - Edwards 12 Plot read lengths and quality import Bio.SeqIO import sys from matplotlib.pyplot import * # Get command-line arguments inputfile = sys.argv[1] lengths1 = [] lengths2 = []

# Loop through all reads in inputfile for read in Bio.SeqIO.parse(inputfile, "fastq"): phred_scores = read.letter_annotations["phred_quality"] l = 0 for phsc in phred_scores: if phsc < 30: break l += 1 lengths1.append(l) lengths2.append(len(read.seq)) plot(sorted(lengths1),'.',sorted(lengths2),'.') show() # savefig('readlengths.png') BCHB524 - Edwards 13

Samtools using pysam Popular format for alignment records pysam is a lightweight wrapper around the samtools code Need to understand samtools alignment datastructures BAM indexes permit random access by locus Direct access to mate-pairs BCHB524 - Edwards

14 Integrated Genome Viewer BCHB524 - Edwards 15 Integrated Genome Viewer BCHB524 - Edwards 16 Reads overlapping a region # Import the PySam module import pysam

# Open the BAM file bf = pysam.Samfile('10_Normal_Chr21.bam') # Access the reads overlapping 21:9000000-10000000 for aligned_read in bf.fetch('21',9000000,10000000): # Dump the information about each read print(aligned_read.qname,\ aligned_read.seq,\ bf.getrname(aligned_read.tid),\ aligned_read.pos,\ aligned_read.qend) BCHB524 - Edwards 17 Determine coverage by locus import pysam # Open the BAM file

bf = pysam.Samfile('10_Normal_Chr21.bam') # Access the reads overlapping 21:9000000-10000000 for pileup in bf.pileup('21',9000000,10000000): # Dump the position and number of reads print(pileup.pos, pileup.n) # Plot? BCHB524 - Edwards 18 Look for SNPs import pysam bf = pysam.Samfile('10_Normal_Chr21.bam') # For every position in the reference for pileup in bf.pileup('21'): counts = {}

# ...examine every aligned read for pileupread in pileup.pileups: # ...and get the read-base if not pileupread.query_position: continue readbase = pileupread.alignment.seq[pileupread.query_position] # Count the number of each base if readbase not in counts: counts[readbase] = 0 counts[readbase] += 1 # If there is no variation, move on if len(counts) < 2: continue # Otherwise, output the position, coverage and base counts print(pileup.pos, pileup.n, end=" ") for base in sorted(counts): print(base,counts[base], end=" ")

BCHB524 - Edwards print() 19 Filter out bad/poor alignments # ...check the read and alignment if pileupread.indel: continue if pileupread.is_del: continue al = pileupread.alignment if al.is_unmapped: continue if al.is_secondary: continue if int(al.opt('NM')) > 1:

continue if int(al.opt('NH')) > 1: continue # ...and get the read-base if not pileupread.query_position: continue readbase = al.seq[pileupread.query_position] # if not enough observations of minor allele, move on if sorted(counts.values())[-2] < 10: continue BCHB524 - Edwards 20

Recently Viewed Presentations

  • Financial Accounting and Accounting Standards

    Financial Accounting and Accounting Standards

    International Financial Reporting Standards and FASB financial reporting standards. International Financial Reporting Standards, International Accounting Standards, and international accounting interpretations. ... IFRS stands for: International Federation of Reporting Services. Independent ...
  • Poetry - Humble Independent School District

    Poetry - Humble Independent School District

    He loved the sounds of nursery rhymes, foreshadowing his love for the rhythmic ballads of Hopkins, Yeats, and Poe. a neurotic, sickly child who shied away from school and preferred reading on his own. Fascinated by language, he excelled in...
  • WHY ATMs/PARTNERSHIPS? DARCA Workshop 2017 Colorado Water ...

    WHY ATMs/PARTNERSHIPS? DARCA Workshop 2017 Colorado Water ...

    Need a Seller Willing to Make Long-Term Commitments * ATM agreements require partnerships among farmers for scale Selling water rights is profitable and relied upon for retirement Water transfers run counter to farm operations and historical practices ATMs require motivation...
  • DHL PROVIEW proview.dhl.com DHL ProView ,                                         ! DHL

    DHL PROVIEW proview.dhl.com DHL ProView , ! DHL

    Электронные решения. DHL . ProView - этоонлайнсервис отслеживания Ваших грузов, с возможностью отображения статусаи получения уведомлений на различных стадиях доставки!
  • Diapositive 1

    Diapositive 1

    The detection instrument for the quantitative determinationvaryfrom a single Petridish to different dilutions and MPN system of diverse complexity. Intrinsicvariablity of microbiologicaldeterminations ... Every technical step of the method adds to the total variability of the measurement: Sub ...
  • On Reframing - UMKC

    On Reframing - UMKC

    Other ways of "knowing" broccoli More subjective. I like the way it tastes. ... Social Construction of Reality What we "feel" matters How we frame our world of understanding about people matters These "real conditions" of our understanding have meaning...
  • WELCOME!! [www.league.org]

    WELCOME!! [www.league.org]

    WELCOME!! What is your college currently doing to provide opportunities for adjunct faculty? ... 60 participants thus far in program . Continues to be supported by college leadership. Integrating improvements recommended by completers. ... Facebook Group. Webpage is available and...
  • 9th Grade Parent Counseling Presentation

    9th Grade Parent Counseling Presentation

    School events, like Community Days, are a chance for the Raven Community to get together and sustain a school culture that is welcoming and inclusive. CCA supports students' lives outside the classroom through school wide campaigns, such as Raven Wellness...