Biological Information

Biological Information

Where we start? We will start with the Biological Problem Translate that to what the data looks like Think about appropriate distributions Navigate ways to Analyze data Using existing methods Adapting existing methods Using newer ideas Why Genomics?

The process and technology of genomics has changed over the last 20 years, but the motivation remains the same: If we can understand and relate the genotypic variation (how we vary based on our genes) to phenotypic traits (height, weight, hair color, disease) then maybe we can understand, predict and potentially INTERVENE for diseases and traits in plants etc. As the human genome and other genomes are sequenced and understood, the interest in disease susceptibility mechanism increases. Can we segregate the cancer gene? If so, can we prevent diseases, can we find ways to silence the cancer gene. The idea of personalized medicine is still a possibility Some Basic Biology Lets get familiar with some terms: Cell

Nucleus DNA Genome Gene Transcriptome SNPs NGS Central Dogma of Molecular Biology DISCLAIMER: My Biology is VERY rudimentary, so dont count on it TOO much. I can explain this from a very data centric view. Biological Hierarchies Cell

Tissue Organ Organism Our Focus: the cell More specifically: Cell Nucleus Chromosome Cells contain chromosomes that are made up of DNA which is broken down into functional units called genes. We focus on genes

Example: Organism: human Organ: stomach Tissue Cell Organelles: say nucleus, chromosomes:22+XX(Y) Macromolecules: DNA:3x10^9bp RNA: ~2000 molecules Proteins ~ 30,000-50,000 Macromolecule Very large molecules made

up of 1000 of atoms. Our interest is the Nucleic Acid part And Proteins (but that falls under proteomics) Nucleic Acids DNA: macro-molecule composed of two chains (double helix structure). The two DNA strands are also known as are composed of nucleotides (composed of 4 nitrogen containing bases: (cytosine [C], guanine [G], adenine [A] or thymine [T]), a sugar, and a phosphate group. The nucleotides joined by covalent bonds between the sugar

of one nucleotide and the phosphate of the next, resulting in an alternating sugar-phosphate backbone. The nitrogenous bases of the two separate polynucleotide strands are bound together, (A with T and C with G), to make double-stranded DNA. RNA Assembled as a chain of nucleotides, but unlike DNA it is more often found in nature as a single-strand folded onto itself. Cellular organisms use messenger RNA (mRNA) to convey genetic information using guanine, uracil, adenine, and cytosine, denoted by the letters G, U, A, and C) that directs synthesis of specific proteins. More Details: DNA

Genetic material (DNA) is present in the nucleus, as a DNA-protein complex called chromatin. The DNA is present as a number of discrete units known as chromosomes. Each DNA strand wraps around groups of small protein molecules called histones, forming a series of bead-like structures, called nucleosomes, connected by the DNA strand. Genome

The sum of all information contained in the DNA for any living thing. The sequence of all the nucleotides in all the chromosomes of an organism. Gene

A hereditary unit consisting of a sequence of DNA that occupies a specific location on a chromosome and determines a particular characteristic in an organism. The nucleus of each eukaryotic (nucleated) cell has a complete set of genes. Each gene provides a blueprint for the synthesis (via RNA) of enzymes and other proteins and specifies when these substances are to be made. Genes undergo mutation when their DNA sequence changes.

Gene: More Facts Genes govern both the structure and metabolic functions of the cells, and thus of the entire organism and, when located in reproductive cells, they pass their information to the next generation. Chemically, each gene consists of a specific sequence of DNA building blocks called nucleotides. Genes may vary in their precise makeup from person to person, including, for example, one nucleotide in a certain location in some people but another nucleotide in that location in others (Single nucleotide polymorphism, SNP). Genes: More Facts

Geometrically, the gene is a double helix formed by the nucleotides. Gene loci are often interspersed with segments of DNA that do not code for proteins; these segments are termed junk DNA or non-coding DNA

The coding portions are called exons and the noncoding (junk) portions are called introns. Junk DNA makes up 97% of the DNA in the human genome, and, despite its name, is necessary for the proper functioning of the genes. Only a fraction of these genes are "expressed"(turned on) and these confer unique properties to each cell type. Scientists study the kinds and amounts of expressed genes in a cell, which in turn provides insights into how the cell responds to its changing needs (idea for microarrays) Central dogma of molecular biology

Each gene is transcribed (at the appropriate time) from DNA into mRNA, which then leaves the nucleus and is translated into the required protein. Any gene which is active in this way at any particular time is said to be expressed. The idea behind MICROARRAYS and rna-seq Breakthrough: Sequencing Sequencing: DNA sequencing is the process of determining the precise order of nucleotides within a DNA molecule. It includes any method or technology that is used to determine the order of the four basesadenine, guanine, cytosine, and thyminein a strand of DNA. So when the genome was sequenced there was a flurry of research in this area. Now, we have NGS (next generation sequencing) where we can look

at millions of DNA molecules at the same time. Some Questions that are being asked: What genes contribute to cancer ? What genes are involved in depression ? What genes respond to cocaine ? What genes are present in a particular cancer cell type and not in others ? How do humans think as opposed to monkey thoughts ? (given 99.2% genome homology)

How DO we answer the questions asked? What is the BEST way to study Genes? How can we effectively answer questions related to genes? Should we focus on a FEW genes and look at it through time or conditions, have a focused study? Should we look at many genes at once (sometimes the whole genome) and compare them all across conditions? Is it gene abundance that we look at? Or which genes are present? Or where the genes are different from person (unit) to person (unit)? Forward and Reverse genetics approaches in biology Biological System (Organism) Forward Genetics

Approach Reverse Genetics Approach (Experiments) e.g. the ras oncogene (Bioinformatics): Discover all genes that Hypothesis: Specific are different in cancer alterations in genes cells as compared to lead to cancer. What

control. (n=300) are these genes? (t=1month) (t=10 years/ lab) Building Blocks (Genes/Molecules) Biological Information Processing DATA: Genomics Storage and Retrieval: Database Summary, Analysis and Visualization : Statistics Outline for this class What types of data are we interested in? Microarrays

RNA seq GWAS What types of experiments do they come from? What are the similarities and difference? What statistical models and methods are used to understand the structure of the data What statistical techniques are used to analyze this data? Assumptions about distributions Pitfalls about the three different data types. Microarray and rna-seqTypes of Data To decide whether I do Microarray or RNA-seq experiment the following has to be taken into account: Potential Deciding Factors:

What genome info do I have? How much money do I have? What statistical methods are we familiar with? Potential Goals are also important in the decision: Goal is? Differential Expression Absolute Quantification Discovering Novel Genes Low Level Expressions

GWAS: SNP? An observational study looking at potential genetic allele level difference between known phenotypes. SNP: are essentially spots on the

genome that can have a different nucleotide among the same species Example, at a specific base position in the human genome, some individuals may have A, others have G or T. These are all alleles for this position. Human genome have about 5 million SNPs. GWAS

Comapres two groups: one with the phenotype of interest (cases, disease) and one without that phenotype (control) SNPs are compared for the two groups. If a particular allele is more frequent for the disease than

without, that SNP is considered associated with the disease. Often, p-values are plotted in Manhattan plots. The odds ratio is calculated. Sometimes p-value ofsimple chisquared test. Objective: finding odds ratios that are significantly different from 1. First paper on the subject in 2005 looking at macular degeneration in eyes. Common to all platforms

Data are more or less reproducible Random errors Correlation Biases

Recently Viewed Presentations

  • Hva getur gerst  einni helgi Dr. sgeir Jnsson

    Hva getur gerst einni helgi Dr. sgeir Jnsson

    Hvað getur gerst á einni helgi Dr. Ásgeir Jónsson Maí 2008 „We note that if JPMorgan Chase-Morgan Stanley merger occurred, it would represent the historic reunion of The House of Morgan, which was forced to split up by the Glass-Steagall...
  • CCIE Service Provider 5-Day Bootcamp

    CCIE Service Provider 5-Day Bootcamp

    CCIE Service Provider 5-Day Bootcamp Instructor Introduction Brian McGahan, CCIE #8593 Routing and Switching - 2002 Service Provider - 2006 Security - 2007 Marvin Greenlee, CCIE #12237 Routing and Switching - 2003 Service Provider - 2006 Security - 2007 Online...
  • Getting Started - Center for Computation and Technology

    Getting Started - Center for Computation and Technology

    vector<double>& hw = homework; // 'hw' is a synonym for 'homework' Anything we do with hw is equivalent to doing the same thing to homework (and v.v.) Adding the const creates a read-only synonym. 2/3/2015, Lecture 5. CSC 1254, Spring...
  • Unit 2: What Does That Mean?

    Unit 2: What Does That Mean?

    For example, in the Cat in the Hat, by Dr. Seuss kids can learn that an A-T has an /AT/ sound. Rhyme can also serve as a clue for decoding words with unfamiliar spellings. From the Cat and the Hat:...
  • The Crucible By Arthur Miller - Jessamine County

    The Crucible By Arthur Miller - Jessamine County

    The Crucible By Arthur Miller CHARACTERS * * * * * * * * * * * * * * * * * * * * * * * Introductory Information Written by Arthur Miller In 1953 A fictional recreation...
  • Policy update: a period of calm and stability

    Policy update: a period of calm and stability

    Vicky Beer. East Midlands and the Humber . John Edwards. West Midlands. Christine Quinn. East of England and North East London. Sue Baldwin. South West. Lisa Mannall. North West London and South Central . Martin Post. South East and South...
  • Density Curves - PCC

    Density Curves - PCC

    What is a density curve? A good question. Density curves represent "idealized" distributions. What do you mean idealized? Now we are going into the realm of philosophy. Since we are applying mathematics, when you apply something, the field of philosophy...
  • Effects of supportive breeding on loci underlying fitness ...

    Effects of supportive breeding on loci underlying fitness ...

    Roza dam, conducted using 9108 loci and phenotypes from 383 adult Chinook that were sampled across five generations. The outer track represents the linkage map for Chinook salmon with the centromeres in yellow; the chromosome numbers (Ots01 to Ots34) are...