Issues with creating Genome Browsers for Whole Genome

Issues with creating Genome Browsers for Whole Genome

Issues with creating Genome Browsers for Whole Genome Assemblies G-OnRamp Beta Users Workshop Wilson Leung 06/2017 Outline Obtain genome assemblies from NCBI Transfer large genomics datasets to Galaxy Obtain RNA-Seq data from NCBI SRA Identifying and masking repeats

Obtain protein sequences for tblastn searches Obtain RNA GenBank files for translated BLAT searches Types of evidence tracks on a Genome Browser Protein alignments (SPALN) Geneid N-SCAN PASA-EST Augustus

(with RNA-Seq) RNA PolII ChIP-Seq (MACS2) RNA-Seq Coverage TopHat junctions StringTie + TransDecoder RepeatMasker Obtaining the genome assembly from NCBI BioProject https://www.ncbi.nlm.nih.gov/bioproject/

Entry point to all genomic datasets (e.g., genome assembly, transcriptome) that pertain to a study Data from the 1000 Genome Project available through NCBI BioProject SRA = Sequence Read Archive Database of high-throughput sequencing data Data available

through NCBI, EBI, and DDBJ Access genome assemblies from the NCBI Assembly database https://www.ncbi.nlm.nih.gov/assembly Download data files for GenBank and RefSeq whole genome assemblies Types of genome assemblies

RefSeq categories: Reference genome High quality assembly Standard for comparison Example: D. melanogaster Representative genome Best genome assembly available https://www.ncbi.nlm.nih.gov/assembly/help/

within a clade Obtain genome assemblies from the NCBI FTP site Download genome sequence, predicted transcript and protein products Consistent primary sequence IDs (accession.version) for both GFF and FASTA files https://www.ncbi.nlm.nih.gov/books/NBK431016/# Naming conventions for GenBank assemblies

__. Content type Description genomic Genome assembly (Repeats identified by WindowMasker are in lower-case)

rm Transposons identified by RepeatMasker (Eukaryotes only) See the README.txt file within the directory for details Common data formats used by GenBank assemblies Forma Description

t fna faa Nucleotide sequence in FASTA format Protein sequence in FASTA format gbff GenBank flat file format

gff General Feature Format Version 3 Large data files are compressed by gzip File suffix = .gz Supported by Galaxy Built-in support in macOS Use 7-Zip on MS Windows

http://www.7-zip.org/ Genome assembly in FASTA format: _genomic.fna.gz Example: GCA_000269505.2_DroMir_2.2_genomic.fna.gz DEMO: Access the D. miranda genome assembly from the NCBI FTP site Benefits of using FTP to transfer large files to Galaxy Problems with standard file upload

Most servers have a 2 GB file upload size limit Cannot monitor progress of file upload Cannot resume interrupted file upload Galaxy Main and G-OnRamp support FTP file upload Support transfer of large gzip, bzip2, and zip files https://galaxyproject.org/ftp-upload/ Overview of the File Transfer Protocol (FTP) Data transfer protocol between a client and a server

May allow anonymous access Insecure connection Partial built-in support in most operating systems macOS: Go Connect to Server MS Windows: File Explorer Other graphical clients Cyberduck, FileZilla, Fugu, Use FTP to upload files to Galaxy

Use a FTP client to initiate a FTP connection to Galaxy Galaxy Main FTP server: ftp://usegalaxy.org Use your Galaxy account credentials to authenticate Transfer files to the Galaxy FTP server Use the Upload File tool to import contents of the FTP directory into Galaxy Files available through the Choose FTP file button Directly transfer files from

the NCBI FTP site to Galaxy Open Connection to Galaxy Main in Cyberduck Server: usegalaxy.org Enter the username and password for your Galaxy account File New Browser Copy the FTP link to the GenBank assembly at NCBI Paste link into the Quick Connect textbox and press Enter Select and drag files from the NCBI connection

window to the Galaxy connection window Compatible with version 6.0.0 of DEMO: Use FTP to upload the D. miranda genome assembly to Galaxy Transfer high-throughput sequencing data from the SRA to Galaxy Second and third generation sequencing data available through the Sequence Read Archive (SRA)

NCBI SRA stores sequencing data in sra format Use the SRA Toolkit to convert files to fastq (fastq-dump) Paired-end reads might split at the wrong position: https://www.biostars.org/p/12569/ Goals of repeat analysis Improve G-OnRamp workflow: Improve performance of tblastn and BLAT searches Reduce number of false positives in gene

predictions Survey of the repetitive contents of a genome: Estimate total repeat density Types and distributions of transposons Develop repeat pipeline to handle genome assemblies with different sizes and quality Assembly sizes: 111Mb - 2.8Gb Number of scaffolds: 54 - 402,501

Strategies used to identify repeats in five genome assemblies k-mer based: WindowMasker, Tallymer tRNA derived SINEs: tRNAscan-SE Structure based: LTRharvest + LTRdigest, TRF, TanTan Conserved domains within transposons: transposonPSI Species-specific repeat library: RepBase repeats from closely-related species (if available) RepeatScout MUMmer + PILER RepeatModeler

Repeat classification: RepeatClassifier Repeat tracks available on the G-OnRamp Assembly Hubs WindowMask Tallymer er TRF RepeatMask er Nested repeats LTRHarvest

TransposonP SI http://old-gep.wustl.edu/~wilson/gonramphubs/ Accurate repeat identification requires the use of multiple techniques Repeat libraries Arabidopsis thaliana repeatome

Maumus F, Quesneville H. PLoS One. 2014 Apr 7;9(4):e94101. Run time (seconds) RepeatScout run time vs. genome size RepeatSco ut Genome Size (Mb) Schaeffer CE et al. Bioinformatics.

2016 Jun 15;32(12):i209- Memory required (Gb) High memory requirement of k-mer based repeat finders RepeatSco ut Genome Size (Mb)

Schaeffer CE et al. Bioinformatics. 2016 Jun 15;32(12):i209- Partition genome assembly into smaller batches Shuffle scaffolds in genome assembly Scaffolds in the original assembly are often ordered by size Batch size optimization criteria: Avoids memory errors (i.e., segmentation faults)

Can be processed in a reasonable amount of time Batch size for RepeatScout and PILER: 100 Mb per batch Compare only within each batch Random sample of 600 Mb for the X. laevis genome Use tandem repeat masked genome assembly to improve performance Some genomes (e.g., C. reinhardtii) contain high density of tandem repeats Degrades performance of many repeat finding algorithms

Results in large number of spurious matches RepeatModeler analysis of C. reinhardtii (111 Mb) Requires ~130 hours to process unmasked genome Requires ~90 hours to process tandem repeat masked genome Requires ~30 hours to process A. vittata genome (1.2 Gb) tandem repeat masked assembly in the Use RepeatModeler and PILER analyses Recent changes to

RepeatMasker and RepeatModeler New Dfam_consensus database: Creative Commons CC0 1.0 public domain license http://www.dfam-consensus.org/ Support searches using profile Hidden Markov Models HMMER + Dfam Obtain protein sequences

for tblastn searches Species-specific databases FlyBase: dmel-all-translation-r6.15.fasta.gz http:// flybase.org/static_pages/downloads/bulkdata7.html Swiss-Prot High quality, manually annotated section of UniProtKB http://www.uniprot.org/downloads NCBI RefSeq Use only curated RefSeq records (accession prefix = NP_)

Protein sequences from RefSeq reference genomes https://www.ncbi.nlm.nih.gov/books/NBK50679/ Misannotations in public databases # sequences in family > 50 11-50 10 X

None Average % misannotati on Schnoes AM, et al. PLoS Comput Biol. 2009 Dec;5(12):e1000605. Obtain Swiss-Prot protein sequences UniProt download page

(http://www.uniprot.org/downloads) Entire Swiss-Prot database Swiss-Prot sequences separated by taxonomic divisions Human, invertebrates, mammals, plants, rodents, vertebrates, Download files with the uniprot_sprot prefix Use the seqret EMBOSS tool in Galaxy to create FASTA file Search for reviewed:yes entries in UniProtKB http://www.uniprot.org/uniprot/?query=reviewed%3Ayes Filter protein sequences by taxonomy, keywords, gene ontology, enzyme class or pathways

DEMO: Download Swiss-Prot protein sequences from UniProt NCBI Reference Sequence database More comprehensive than Swiss-Prot Two major types of RefSeq records: Known RefSeq: NP_ Model RefSeq: XP_ Model RefSeq records are based on results from computational pipelines More likely to propagate annotation errors

https://www.ncbi.nlm.nih.gov/refseq/about/ Obtain protein sequences from the NCBI RefSeq database Download from the NCBI Genome database https://www.ncbi.nlm.nih.gov/genome/ Search the NCBI Protein database with the RefSeq and reviewed filters

Obtain RNA GenBank files for translated BLAT searches Available through the NCBI FTP server File with the _rna.gbff.gz suffix Obtain the RNA GenBank file for D. melanogaster Summary Obtain genome assemblies from NCBI Use FTP to transfer large genome assemblies to Galaxy

Use EBI SRA to transfer fastq files to Galaxy Use different approaches to identify repetitive sequences in a genome Obtain transcript and protein sequences from NCBI and UniProtKB for sequence similarity searches Questions? https://flic.kr/p/bhyT8B

Recently Viewed Presentations

  • Chapter 6 Objects and Classes

    Chapter 6 Objects and Classes

    * Scope of Variables The scope of instance and static variables is the entire class. They can be declared anywhere inside a class. The scope of a local variable starts from its declaration and continues to the end of the...
  • Add Title Here. - University of Sheffield

    Add Title Here. - University of Sheffield

    The Dose Adjustment for Normal Eating (DAFNE) course is a structured education programme for adults with Type 1 diabetes. DAFNE has been found to improve glycosylated haemoglobin (HbA. 1c) levels in UK Type 1 diabetes patients1
  • UMLS Users' Meeting presentation

    UMLS Users' Meeting presentation

    UMLS® Users Meeting AMIA 2007 Monday, Nov. 12 11:30-1:30 ... Not taking advantage of UMLS in EHR systems is at best unfortunate and can be really counterproductive How do you handle this issue? ... All the synonyms shown come from...
  • Black and Blue Smoke - hep.ph.imperial.ac.uk

    Black and Blue Smoke - hep.ph.imperial.ac.uk

    Title: Black and Blue Smoke Description: Template created by Sun Microsystems Last modified by: Marcel Stanitzki Created Date: 10/27/2008 6:14:48 PM
  • Juureankrud - Weebly

    Juureankrud - Weebly

    Arial Times New Roman Wingdings Symbol Watermark Juureankrud Tugihammaste valik Slide 3 Juureankrud Ruuminõuded kuulankrutele Snap line Dalbo - Z O-ring MicroFix Rothermann (Eccentric) CEKA- Revax ZL ZEST/ZAAG Shiner magnet Locator root Slide 16 Sphero flex post Slide 18 Slide...
  • Prevodnica - CHTr

    Prevodnica - CHTr

    In 1963, after 150 yrs of existence of this water way with 11 locks the construction of a huge water carousel started. This extraordinary idea was finished in 2002 and became a symbol of Scotland. This invention saves not only...
  • ISF 2015 New Era of Security - May 2015

    ISF 2015 New Era of Security - May 2015

    IBM Security:A New Era of Security. for a New Era of Computing. This PowerPoint deck will walk you through IBM's point of view on how to achieve a higher level of security maturity for your organization to help defend against...
  • Virtual Earthworm Dissection - ISD 622

    Virtual Earthworm Dissection - ISD 622

    Virtual Earthworm Dissection Author: mcastronova Last modified by: Kroc, Kellie Created Date: 3/30/2006 1:08:31 PM Document presentation format: On-screen Show (4:3) Company: Montville Township Board of Edu Other titles