Automated and Semi-automated Indexing - Indexing Initiative

Automated and Semi-automated Indexing - Indexing Initiative

The NLM Indexing Initiative: Current Status and Role in Improving Access to Biomedical Information A Report to the Board of Scientific Counselors April 5, 2012 Alan R. Aronson (Principal Investigator) James G. Mork Franois-Michel Lang Willie J. Rogers Antonio J. Jimeno-Yepes J. Caitlin Sticco U. S. National Library of Medicine Outline Introduction [Lan] MetaMap [Franois] The NLM Medical Text Indexer (MTI) [Jim] Availability of Indexing Initiative Tools [Willie] Research and Outreach Efforts [Antonio, Caitlin, Lan] Summary and Future Plans [Lan]

Questions U. S. National Library of Medicine 2 MEDLINE Citation Example U. S. National Library of Medicine 3 Introduction - Growth in MEDLINE Indexed MEDLINE Sizes* (2002 - 2012) 20,000,000 17,674,830 18,000,000 16,000,000 14,000,000 12,000,000 11,289,156 10,000,000 8,000,000 6,000,000 4,000,000

2,000,000 0 2002 2003 2004 * 2005 2006 2007 2008 2009 2010 2011 2012 MEDLINE Baseline less OLDMEDLINE and PubMed-not-MEDLINE U. S. National Library of Medicine

4 The NLM Indexing Initiative (II) The need for MEDLINE indexing support Increasing demand/costs for indexing in light of Flat budgets One solution: creation of the NLM Indexing Initiative in 1996 resulting in NLM Medical Text Indexer (MTI) The Indexing Initiative today: Identification of problems or needs followed by subsequent research Production of MTI recommendations and other indexing Opportunities for training and collaboration U. S. National Library of Medicine 5 Medical Informatics Training Program Fellows Antonio J. Jimeno-Yepes, Postdoctoral Fellow: 2010 J. Caitlin Sticco, Library Associate Fellow: 2011 Bridget T. McInnes, Postgraduate Fellow: 2008 PhD in 2009 Current affiliation: Securboration Aurlie Nvol, Postdoctoral Fellow: 2006-2008 Current affiliation: NCBI Marc Weeber, Postgraduate Fellow: 2000 PhD in 2001

Current affiliation: Personalized Media U. S. National Library of Medicine 6 II Highlights from 2008 Subheading attachment (Aurlie Nvol) Full text experiments (Cliff Gay) Initial Word Sense Disambiguation (WSD) method based on Journal Descriptor (JD) Indexing (Susanne Humphrey) The Journal of Cardiac Surgery has JDs Cardiology and General Surgery U. S. National Library of Medicine 7 II Accomplishments since 2008 The inauguration of MTI as a first-line indexer (MTIFL) Downloadable releases of MetaMap, most recently for Windows XP/7 Significant improvement in MTIs performance due to Technical improvements to MetaMap and MTI, but even more to Close collaboration with LO Index Section More WSD methods with better performance The development of Gene Indexing Assistant (GIA)

U. S. National Library of Medicine 8 Outline Introduction [Lan] MetaMap [Franois] The NLM Medical Text Indexer (MTI) [Jim] Availability of Indexing Initiative Tools [Willie] Research and Outreach Efforts [Antonio, Caitlin, Lan] Summary and Future Plans [Lan] Questions U. S. National Library of Medicine 9 MetaMap - Overview

Purpose Foundations Complexity Processing Example Challenge of UMLS Metathesaurus Growth Significant New Features U. S. National Library of Medicine 10 MetaMap - Purpose Named-entity recognition Identify UMLS Metathesaurus concepts in text Important and difficult problem MetaMaps dual role: Local: Critical component of NLMs Medical Text Indexer (MTI)

Global: Pre-eminent biomedical concept-identification application U. S. National Library of Medicine 11 MetaMap in PubMed Central 60 50 40 53 49 40 43 30 20 10 0 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 99 99 99 99 00 00 00 00 00 00 00 00 00 00 01 01 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 U. S. National Library of Medicine 12

MetaMap - Foundations Knowledge-intensive approach Natural Language Processing (NLP) Emphasize thoroughness over efficiency Howeverefficiency is still important! U. S. National Library of Medicine 13 Complexity of Language - Synonymy Heart Attack Myocardial infarction Attack coronary Heart infarction Myocardial necrosis Infarction of heart AMI MI C0027051: Myocardial Infarction

U. S. National Library of Medicine 14 Complexity of Language - Ambiguity cold C0009264: Cold Temperature C0234192: Cold Sensation C0009443: Common Cold Ambiguity resolved by Word Sense Disambiguation U. S. National Library of Medicine 15 C0180860: Filters [mnob] C0581406: Optical filter [medd] C1522664: filter information process [inpr] C1704449: Filter (function) [cnce] Inferior vena Filter caval stent

filterComponent (PMID 3490760) C1704684: Device [medd] C1875155: Filter - medical device [medd] UMLS Semantic Type String Candidate Concepts: Metathesaurus Concept Unique Identifier (CUI) MetaMap Score (Metathesaurus 1000) MetaMap - Processing Example

909 C0080306: Inferior Vena Cava Filter 804 C0180860: Filter 804 C0581406: Filter 804 C1522664: Filter 804 C1704449: Filter 804 C1704684: Filter 804 C1875155: FILTER C0038257: Stent, device [medd] 717 C0521360: vena caval[medd] C1705817: StentInferior Device Component 673 C0042460: Vena caval 637 C0038257: Stent 637 C1705817: Stent 637 C0447122: Vena [medd] [mnob] [medd] [inpr]

[cnce] [medd] [medd] [blor] [bpoc] [medd] [medd] [bpoc] U. S. National Library of Medicine 16 MetaMap - Processing Example Inferior vena caval stent filter Final Mappings (subsets of candidate sets): Meta Mapping (911) 909 C0080306: Inferior Vena Cava Filter [medd] 637 C1705817: Stent [medd] Meta Mapping (911): 909 C0080306: Inferior Vena Cava Filter [medd] 637 C0038257: Stent [medd] U. S. National Library of Medicine 17

Metathesaurus String Growth 1990-2011 10 8.86M 9 M I L L I O N S 8 7 MEDCIN & FMA > 80x 54xGrowth: Growth>23%/year MEDCIN & FMA 6

SNOMEDCT 5 4 3 162K SNOMEDCT 2 1 0 19 1 9 1 9 19 1 9 20 2 0 20 2 0 20 2 0 2 0 20 2 0 20 2 0 20 2 0 20 2 0 20 2 0 U. S. National Library of Medicine 18 An Especially Egregious Example Phrase from PMID 10931555 protein-4 FN3 fibronectin type III domain GSH glutathione GST glutathione S-transferase hIL-6 human interleukin-6 HSA human serum albumin IC(50) half-maximal inhibitory concentration Ig immunoglobulin IMAC immobilized metal affinity chromatography K(D) equilibrium constant

Extreme, but not atypical MetaMap identifies 99 concepts Mappings are subsets of candidates: Up to 299 mappings Would require 1021 TB of memory! Algorithmic Solutions U. S. National Library of Medicine 19 Solution - Pruning the Candidate Set Problems: Inferior vena caval stent filter MetaMap runs far too long and/or runs out of memory 909 C0080306: Inferior Vena Cava Filter [medd] overnightFilter processing did not complete 804 MTIs C0180860: [mnob] C0581406: 804 Allow

MetaMapFilter to generate (perhaps suboptimal)[medd] results 804 C1522664: Filter [inpr] reasonable amount of time 804 inC1704449: Filter [cnce] 804 without C1704684: Filter [medd] exceeding memory limits C1875155: FILTER [medd] 804 Solution: Prune out least useful candidates 717 C0521360: Inferior vena caval [blor] C0042460:

Vena [bpoc] 673 Default maximum # ofcaval candidates: 35/phrase (heuristic) 637 C0038257: Stent [medd] 637 Pruning under user control C1705817: Stent [medd] C0447122: [bpoc] 637 Fewer candidatesVena Many fewer mappings U. S. National Library of Medicine 20 Results of Algorithmic Improvements

2010 MEDLINE baseline: 146 troublesome citations Original runtime > 12 hours per citation Improved runtime ~ 12.3 seconds per citation 350,000% improvement for problematic citations Efficiency improvements across MEDLINE baseline: 2004 MEDLINE Baseline (12.5M citations): 6 months 2012 MEDLINE Baseline (20.5M citations): 8 days U. S. National Library of Medicine 21 Significant New MetaMap Features Solutions for problems Default output difficult to post-process: XML output MetaMap originally developed for literature, not clinical: Wendy Chapmans NegEx (negation detection) User-Defined Acronyms U. S. National Library of Medicine

22 Literature: Author-Defined Acronyms Acronyms often defined by authors in literature: Trimethyl cetyl ammonium pentachlorphenate (TCAP) and fatty acids as antifungal agents. Reticulo-endothelial immune serum (REIS) in a globulin fraction The bacteriostatic action of isonicotinic acid hydrazid (INAH) on tubercle bacilli the interstitial latero-dorsal hypothalamic nucleus (ILDHN) of the female guinea pig The adrenocorticotropic hormone (ACTH) of the anterior pituitary. MetaMap replaces acronyms short form with their long form U. S. National Library of Medicine 23 Clinical Text: Undefined Acronyms Acronyms rarely defined in clinical text: He underwent a CABG and PTCA in 2008. EKGs show a RBBB with LAFB with 1st AV block Sequential LIMA to the diagonal and LAD and sequential SVG to the PLB and PDA and SVG to IM grafts were placed post-transplantation lymphoproliferative

status post CABG with a patent LIMA to LAD and SVG to D1 patent disorder treatment for PTLD with Rituxan versus CHOP MetaMap users can define undefined acronyms cyclophosphamide, hydroxydaunomycin, Oncovin, and prednisone Allows customizations tailored to specific needs U. S. National Library of Medicine 24 User-Defined Acronyms (UDAs) Customize UDAs for radiology domain: CAT | Computerized Axial Tomography PET | Positron Emission Tomography Otherwise C0031268: C1456682: C0007450: C0325090: C0524517: C0325089: Pet (Pet Animal) [Animal]

Pets (Pet Health) [Group Attribute] Cat (Felis catus) [Mammal] Cat (Felis silvestris) [Mammal] Cat (Genus Felis) [Mammal] cats (Family Felidae) [Mammal] U. S. National Library of Medicine 25 Outline Introduction [Lan] MetaMap [Franois] The NLM Medical Text Indexer (MTI) [Jim] Availability of Indexing Initiative Tools [Willie] Research and Outreach Efforts [Antonio, Caitlin, Lan] Summary and Future Plans [Lan] Questions U. S. National Library of Medicine

26 The NLM Medical Text Indexer (MTI) Overview Uses MTI as First-Line Indexer (MTIFL) Performance U. S. National Library of Medicine 27 MTI - Overview Summarizes input text into an ordered list of MeSH Headings In use since mid-2002 Developed with continued Index Section collaboration Uses article Title and Abstract Provides recommendations for 96% of indexed articles Indexer consulted for 50% of indexed articles U. S. National Library of Medicine

28 MTI Usage Average Daily Indexer Requests to MTI (2002 - 2011) 3,000 2,413 2,500 2,000 1,500 1,294 1,000 500 0 2002 2003 2004 2005

2006 2007 2008 2009 2010 U. S. National Library of Medicine 2011 29 MTI - Uses Assisted indexing of Index Section journal articles Assisted indexing of Cataloging and History of Medicine Division records Automatic indexing of NLM Gateway meeting abstracts First-line indexing (MTIFL) since February 2011 Also available to the Community 45,000 requests (2011) U. S. National Library of Medicine 30

Data Creation and Management System U. S. National Library of Medicine 31 MTI - Uses Assisted indexing of Index Section journal articles Assisted indexing of Cataloging and History of Medicine Division records Automatic indexing of NLM Gateway meeting abstracts First-line indexing (MTIFL) since February 2011 Also available to the Community 45,000 requests (2011) U. S. National Library of Medicine 32 Title + Abstract MTI MetaMap Indexing Actually found in text Restrict to MeSH Maps UMLS Concepts to MeSH PubMed Related Citations

Not necessarily found in text MetaMap Indexing PubMed Related Citations UMLS concepts Related Citations Restrict to MeSH Extract MeSH Descriptors MeSH Main Headings Clustering & Ranking Ordered list of MeSH Main Headings Apply Indexing Rules CheckTag Expansion Subheading Attachment Final Ordered list of MeSH Headings

U. S. National Library of Medicine 33 PubMed Query Example U. S. National Library of Medicine 34 Title + Abstract MTI MetaMap Indexing Actually found in text MetaMap Indexing PubMed Related Citations UMLS concepts Related Citations Received

2,330 Indexer Feedbacks Received 2,330 Indexer Feedbacks Restrict to MeSH Maps Restrict to MeSH Incorporated 40% into MTI Incorporated 40% into MTI UMLS Concepts to MeSH March 20, 2012 March 20, 2012 Extract MeSH Descriptors MeSH Main Headings Hibernation PubMed Related Citations

indexed should only Hibernation should onlybe be indexedfor foranimals, animals,not notfor for Clustering & Ranking Not necessarily found in text "stem cell hibernation" "stem cell hibernation" Ordered list of MeSH Main Headings Clove the Clove(spice)

(spice)should shouldnot notbe bemapped mappedtoto theverb verb"cleave" "cleave" Apply Indexing Rules CheckTag Expansion Subheading Attachment Final Ordered list of MeSH Headings U. S. National Library of Medicine 35 MTI - Example U. S. National Library of Medicine 36 MTI as First-Line Indexer (MTIFL) 23 MEDLINE Journals

Normal MTI Processing MTI Reviser Processes/ Recommends MeSH Reviews Selects Adjusts Approves Indexing Displays in PubMed as Usual Indexer Reviews Selects U. S. National Library of Medicine 37

MTI as First-Line Indexer (MTIFL) 23 MEDLINE Journals MTIFL MTI Processing MTI Reviser Processes/ Indexes MeSH Reviews Selects Adjusts Approves Indexer Reviews Selects Indexing Displays in

PubMed as Usual Index Section Compares MTI and Reviser Indexing U. S. National Library of Medicine 38 MTIFL Experiments in 2010 led by Marina Rappaport Microbiology, Anatomy, Botany, and Medical Informatics journals Initial experiment involved both Indexers and MTI Provided baseline timings and performance Indexer MTIFL Number of Articles 609 668 Average Total Minutes 12.05 14.37

Average MHs 11.12 24.75 } Diff +2.32 +13.63 Identified challenges (and opportunities) Publication Types Chemical Flags Functional annotation of genes Manually added by indexer U. S. National Library of Medicine 39 MTIFL Follow-on experiments focused on reducing MTI revision time: Reduce the number of MTI indexing terms Focus on journals with few/no Gene Annotation or Chemical Flags Initial Final Indexer MTIFL MTIFL MTIFL Diff

Average Total Minutes 12.05 14.37 10.01 -4.36 Average MHs 11.12 24.75 8.58 -16.17 MTI revision time 2.04 minutes faster than Indexer revised time (10.01 minutes vs 12.05 minutes) Pilot project started with 14 journals, expanded to 23 in 2011 U. S. National Library of Medicine 40 MTI - How are we doing? 2008 2008 0.7500 0.7500 Text Indexer (MTI): Precision, Recall, and F Stats ( 2009Medical 2010 Medical Text Indexer (MTI): Precision, Recall, and F Stats ( 2009

2010 2008 - Present) 2008 11 1 2008 - Present) 2011 2011 2012 2012 0.6500 0.6500 0.5500 0.5500 0.4500 0.4500 Recall Recall

MTIFL F1 MTIFL F1 F1 F1 0.3500 0.3500 0.2500 0.2500 Precision Precision on Precision versus Recall Fruition of 2011 Focus Changes U. S. National Library of Medicine 41 Outline

Introduction [Lan] MetaMap [Franois] The NLM Medical Text Indexer (MTI) [Jim] Availability of Indexing Initiative Tools [Willie] Research and Outreach Efforts [Antonio, Caitlin, Lan] Summary and Future Plans [Lan] Questions U. S. National Library of Medicine 42 Availability of Indexing Initiative Tools Remote Access Web API Local Installation Linux Mac OS/X Windows XP/7 U. S. National Library of Medicine

43 Remote Access Interactive Small input data (for testing, etc.), immediate results Batch Large input data processed using a large pool of computing resources U. S. National Library of Medicine 44 U. S. National Library of Medicine 45 U. S. National Library of Medicine 46 Local Installation of MetaMap U. S. National Library of Medicine 47

MetaMap as a UIMA Component Allows MetaMap to be used as an UIMA annotator component. UIMA - Unstructured Information Management Architecture a component-based software for the analysis of unstructured information. Input Text Tokenizer POS Tagger Parser Named Entity Recognizer Relation Extractor Relations U. S. National Library of Medicine

48 MetaMap as a UIMA Component Allows MetaMap to be used as an UIMA annotator component. UIMA - Unstructured Information Management Architecture a component-based software for the analysis of unstructured information. Input Text MetaMap Relation Extractor Relations U. S. National Library of Medicine 49 UIMA-compliant NLP Toolkits A number of NLP toolkits that are UIMA compliant OpenNLP clinical Text Analysis and Knowledge Extraction System (cTAKES)

OpenPipeline U. S. National Library of Medicine 50 Data File Builder Provides the ability to create specialized data models for MetaMap: UMLS augmented with user data UMLS subsets Independent knowledge sources Should have notion of concept, synonymy Ontologies Local Thesauri Other Knowledge Sources U. S. National Library of Medicine 51 Web Access Statistics (2011) Remote Access: 7,500 unique visits - 124 different countries 70,000 Interactive Requests 87,000 Batch Requests MetaMap Downloads:

1,050 for MetaMap program 570 Linux, 200 Mac/OS, 280 Windows 41 for Data File Builder U. S. National Library of Medicine 52 Outline Introduction [Lan] MetaMap [Franois] The NLM Medical Text Indexer (MTI) [Jim] Availability of Indexing Initiative Tools [Willie] Research and Outreach Efforts [Antonio, Caitlin, Lan] Summary and Future Plans [Lan] Questions U. S. National Library of Medicine

53 Enhancing MetaMap and MTI Performance MetaMap precision enhancement through knowledgebased Word Sense Disambiguation MTI enhancement based on Machine Learning U. S. National Library of Medicine 54 Word Sense Disambiguation (WSD) Kids with colds may also have a sore throat, cough, headache, mild fever, fatigue, muscle aches, and loss of appetite. Candidate MetaMap mappings for cold C0234192: Cold (Cold sensation) C0009264: Cold (Cold temperature) C0009443: Cold (Common cold) U. S. National Library of Medicine 55 Knowledge-based WSD Compare UMLS candidate concept profile vectors to context of ambiguous word Concept profile vectors words from definition, synonyms and related concepts

Common cold Cold temperature Weight 265 126 41 Word infect disease fever Weight 258 86 72 Word temperature hypothermia effect 40 cough

48 hot Candidate concept with highest similarity is predicted U. S. National Library of Medicine 56 Knowledge-based WSD Kids with colds may also have a sore throat, cough, headache, mild fever, fatigue, muscle aches, and loss of appetite. Common cold Cold temperature Weight 265 126 41 Word infect disease fever

Weight 258 86 72 Word temperature hypothermia effect 40 cough 48 hot U. S. National Library of Medicine 57 Automatically Extracted Corpus WSD MEDLINE contains numerous examples of ambiguous words context, though not disambiguated Candidate concept CUI:C0009443

cold Unambiguous synonyms commoncold cold common Query "common cold"[tiab] OR "acute nasopharyngitis"[tiab] PubMed CUI:C0009264 cold cold temperature temperature "cold temperature"[tiab] OR "low temperature"[tiab] U. S. National Library of Medicine 58 WSD Method Results

Corpus method has better accuracy than UMLS method UMLS Corpus NLM WSD 0.65 0.69 MSH WSD 0.81 0.84 MSH WSD data set created using MeSH indexing 203 ambiguous words 81 semantic types 37,888 ambiguity cases Indirect evaluation with summarization and MTI correlates with direct evaluation

U. S. National Library of Medicine 59 Citation indexed w/Female, Humans and Male TI -Documenting the symptom experience of cancer patients. AB - Cancer patients experience symptoms associated with their disease, treatment, and comorbidities. Symptom experience is complicated, reflecting symptom prevalence, frequency, and severity. Symptom burden is associated with treatment tolerance as well as patients' quality of life (QOL). A convenience sample of patients with the five most common cancers at a comprehensive cancer center completed surveys assessing symptom experience (Memorial Symptom Assessment Survey) and QOL (Functional Assessment of Cancer Therapy). Patients completed surveys at baseline and at 3, 6, 9, and 12 months thereafter. Surveys were completed by 558 cancer patients with breast, colorectal, gynecologic, lung, or prostate cancer. Patients reported an average of 9.1 symptoms, with symptom experience varying by cancer type. The mean overall QOL for the total sample was 85.1, with results differing by cancer type. Prostate cancer patients reported the lowest symptom burden and the highest QOL. The symptom experience of cancer patients varies widely depending on cancer type. Nevertheless, most patients report symptoms, regardless of whether or not they are currently receiving treatment. 60 U. S. National Library of Medicine MTI enhancement with Machine Learning

Large number of indexing examples available from MEDLINE Two approaches Semi-automatic generation of indexing rules Indexing algorithm selection through meta-learning U. S. National Library of Medicine 61 Bottom-up Indexing Approach Automatic analysis of citations selection of terms production of candidate annotation rules Manual examination and processing Post-filtering based on machine learning Works well with some MeSH headings; e.g. Carbohydrate Sequence U. S. National Library of Medicine 62 MTI Meta-Learning No single method performs better than all evaluated indexing methods Manual selection of best performing indexing methods becomes tedious with a large number of MHs

Select indexing methods automatically based on metalearning U. S. National Library of Medicine 63 CheckTags Machine Learning Results 200k citations for training and 100k citations for testing CheckTag Middle Aged Aged Child, Preschool Adult Male Aged, 80 and over Young Adult Female Adolescent Humans Infant Swine F1 before ML 1.01% 11.72% 6.11% 19.49% 38.47%

1.50% 2.83% 46.06% 24.75% 79.98% 34.39% 71.04% F1 with ML 59.50% 54.67% 45.40% 56.84% 71.14% 30.89% 31.63% 73.84% 42.36% 91.33% 44.69% 74.75% Improvement +58.49 +42.95 +39.29 +37.35 +32.67

+29.39 +28.80 +27.78 +17.61 +11.35 +10.30 +3.71 U. S. National Library of Medicine 64 CheckTags Machine Learning Results 200k citations for training and 100k citations for testing CheckTag Middle Aged Aged Child, Preschool Adult Male Aged, 80 and over Young Adult Female Adolescent Humans Infant Swine

F1 before ML 1.01% 11.72% 6.11% 19.49% 38.47% 1.50% 2.83% 46.06% 24.75% 79.98% 34.39% 71.04% F1 with ML 59.50% 54.67% 45.40% 56.84% 71.14% 30.89% 31.63% 73.84% 42.36% 91.33% 44.69% 74.75%

Improvement +58.49 +42.95 +39.29 +37.35 +32.67 +29.39 +28.80 +27.78 +17.61 +11.35 +10.30 +3.71 U. S. National Library of Medicine 65 CheckTags Machine Learning Results 200k citations for training and 100k citations for testing CheckTag Middle Aged Aged Child, Preschool Adult Male Aged, 80 and over Young Adult

Female Adolescent Humans Infant Swine F1 before ML 1.01% 11.72% 6.11% 19.49% 38.47% 1.50% 2.83% 46.06% 24.75% 79.98% 34.39% 71.04% F1 with ML 59.50% 54.67% 45.40% 56.84% 71.14% 30.89% 31.63%

73.84% 42.36% 91.33% 44.69% 74.75% Improvement +58.49 +42.95 +39.29 +37.35 +32.67 +29.39 +28.80 +27.78 +17.61 +11.35 +10.30 +3.71 U. S. National Library of Medicine 66 Research - J. Caitlin Sticco Introduction to Gene Indexing The Gene Indexing Assistant

U. S. National Library of Medicine 67 U. S. National Library of Medicine 68 The Gene Indexing Assistant An automated tool to assist the indexer in identifying and creating GeneRIFs Evaluate the article Identify genes Make links to Entrez Gene Suggest geneRIF annotation Anticipated Benefits: Increase in speed Increase in comprehensiveness U. S. National Library of Medicine 69

Corpus Creation Gene mentions tagged by manually correcting the automated program GeneRIF classes Non-geneRIF, Structure, Function, Expression, Isolation, Reference, and Other Claims classes Putative, Established, or Non-claim Discourse classes Title, Background, Purpose, Methods, Results, Conclusions Alternate dataset of 600,000 structured abstracts with similar labels U. S. National Library of Medicine 70 Identify species U. S. National Library of Medicine 71/45

Software Origins Integrated External Software GNAT from Jorg Hakenberg Include BANNER for gene identification Linnaeus from Gerner, Nenadic, and Bergman Organism Tagger from Naderi et al. Components Developed In-house Framework Hand-curated dictionary In-house modules for human gene identification, normalization, and geneRIF extraction U. S. National Library of Medicine 72 Identify species U. S. National Library of Medicine 73/45 Gene Mention Identification Filamin a mediates HGF/c-MET signaling in tumor cell migration. Deregulated hepatocyte growth factor (HGF)/c-MET axis has

been correlated with poor clinical outcome and drug resistance in many human cancers. In our study, we show that multiple human cancer tissues and cells express filamin A (FLNA), a large cytoskeletal actin-binding protein, and expression of c-MET is significantly reduced in human tumor cells deficient for FLNA. U. S. National Library of Medicine 74 Gene Mention Identification Filamin a mediates HGF/c-MET signaling in tumor cell migration. Deregulated hepatocyte growth factor (HGF)/c-MET axis has been correlated with poor clinical outcome and drug resistance in many human cancers. In our study, we show that multiple human cancer tissues and cells express filamin A (FLNA), a large cytoskeletal actin-binding protein, and expression of c-MET is significantly reduced in human tumor cells deficient for FLNA. filamin a, flna, hepatocycte growth factor, c-met U. S. National Library of Medicine 75 Gene Mention Identification

In-House Components Hand curated dictionary Derived from Entrez Gene Filtering for problem synonyms Variant creation (reductive tokenization?) Strict Dictionary Mapping External Components GNAT: Conditional Random Fields (CRF) from BANNER U. S. National Library of Medicine 76 Identify species U. S. National Library of Medicine 77/45 Species Identification and Assignment External Components Identification Linnaeus: includes common names and maps stand alone genera to most likely species Organism Tagger: includes cell lines and microbial strains

Assigning genes to species GNAT: Proximity heuristic U. S. National Library of Medicine 78 Gene Mention Normalization c-met ID: 4233, MET hepatocyte growth factor Cancer, tumor, cytokine, cell migration ID: 3082, HGF Official Name cell migration, cytokine, tumor ID: 4233, MET Synonym Oncogene, renal, cancer, tyrosine U. S. National Library of Medicine

79 Gene Mention Normalization Identification and Normalization Results Species Recall Precision F1 Human 83% 80% 81% U. S. National Library of Medicine 80 Identify species U. S. National Library of Medicine

81/45 Classifier Results Features Precision Recall F1 Position (pos) 72% 73% 72% Text (word features) 63% 64% 63% Gene Names Discourse

(Structured Ab. Labels) 55% 70% 62% 70% 80% 75% pos + discourse 70% 86% 76.89% pos + discourse + GO 70% 86%

77.07% U. S. National Library of Medicine 82 Future Improvements and Research Areas Additional preprocessing Expand certain anaphora Extracting interaction data Expanding the dictionaries Improved abbreviation resolution Additional training for low-performing species Integration of additional identification or normalization software U. S. National Library of Medicine 83 Research and Outreach Efforts (concl.)

External Collaboration IBM DeepQA group: applying Watson to health care Data Dissemination MEDLINE Baseline Repository WSD test collections Biomedical NLP/IR Challenges Tomorrow LHNCBC Participation in NLP/IR Challenges Text Retrieval Conference (TREC) Genomics track Medical Records track Informatics for Integrating Biology & the Bedside (i2b2) Medical NLP Challenge U. S. National Library of Medicine 84 Outline

Introduction [Lan] MetaMap [Franois] The NLM Medical Text Indexer (MTI) [Jim] Availability of Indexing Initiative Tools [Willie] Research and Outreach Efforts [Antonio, Caitlin, Lan] Summary and Future Plans [Lan] Questions U. S. National Library of Medicine 85 Indexing Initiative Top 10 (1/2) 10. MTI Why explanation facility 9. Application of MTI to Cataloging and History of Medicine records 8. The MetaMap UIMA wrapper, increasing MetaMaps availability 7. Significant speedup of MetaMap 6. Collaboration with IBM DeepQA group applying Watson to health care U. S. National Library of Medicine

86 Indexing Initiative Top 10 (2/2) 5. The development of Gene Indexing Assistant (GIA) 4. More WSD methods with better results 3. Improvement in MTIs performance due to technical enhancements and close collaboration with Index Section 2. Downloadable releases of MetaMap, especially for Windows Inauguration of MTI as a first-line indexer (MTIFL)! U. S. National Library of Medicine 87 Future Plans Continued collaboration with The NLM Index Section IBM and other external organizations Planned improvements to MetaMap and MTI such as Expansion/improvement of MTIFL capability Add species detection to MTI for disambiguation and for GIA Further MTI research with Antonio Jimeno-Yepes and Caitlin Sticco

Possible high-level MetaMap modularization to facilitate plug and play strategies U. S. National Library of Medicine 88 Questions Generated using Wordle (www.wordle.net) Alan (Lan) R. Aronson James G. Mork Franois-Michel Lang Willie J. Rogers Antonio J. Jimeno-Yepes J. Caitlin Sticco U. S. National Library of Medicine 89 Extra slides in case of questions U. S. National Library of Medicine Candidate Pruning: Output Example

protein-4 FN3 fibronectin type III domain GSH glutathione GST glutathione S-transferase hIL-6 human interleukin-6 HSA human serum albumin IC(50) half-maximal inhibitory concentration Ig immunoglobulin IMAC immobilized metal affinity chromatography K(D) equilibrium constant U. S. National Library of Medicine Candidate Pruning: Output Example (Total=99; Excluded=13; Pruned=50; Remaining=36) 783 equilibrium constant [npop] 780 P Equilibrium [orgf] 780 P Kind of quantity - Equilibrium [qnco] 780 P Constant (qualifier) [qlco] 713 protein K [aapp] 691 Protein concentration [lbpr] 671 protein serum [aapp,bacs]

671 Protein.serum [lbtr] 656 P serum K+ [lbpr] 656 protein human [aapp,bacs] 653 Human immunoglobulin [aapp,imft,phsu] U. S. National Library of Medicine User-Defined Acronyms (UDAs) Simply create a text file with UDA definitions: CABG PTCA RBBB LAFB AV PTLD CHOP LIMA LAD SVG

PLB PDA IM | | | | | | | | | | | | | coronary artery bypass graft percutaneous transluminal coronary angioplasty right bundle branch block left anterior fascicular block aortic valve post-transplantation lymphoproliferative disorder cyclophosphamide, hydroxydaunomycin, Oncovin, and prednisone left internal mammary artery left anterior descending coronary artery saphenous vein graft

posterolateral bundle posterior descending artery internal mammary U. S. National Library of Medicine Complexity - Composite Phrases Pain on the left side of the chest Left sided chest pain (C0541828) Linguistic variants Syntactic processing Word order U. S. National Library of Medicine 94 21 10 Terabytes of Memory?! 21 10 = 10 10

* 10 11 = (10 billion) * (100 billion) 150% of world population Required terabytes/ person Oak Ridge National Labs Cray Jaguar: 300TB U. S. National Library of Medicine 95 Concepts with at least 300 Synonyms 349: C1163679|Water 1000 MG/ML Injectable Solution 327: C0874083|Triclosan 3 MG/ML Medicated Liquid Soap 312: C0980221|Sodium Chloride 0.154 MEQ/ML Injectable Solution U. S. National Library of Medicine

96 MSH WSD corpus UMLS MH MEDLINE Disambiguation corpus U. S. National Library of Medicine 97 Meta-learning U. S. National Library of Medicine 98 ML: Human MeSH heading Method Average F-measure MTI

0.72 Nave Bayes 0.85 Support vector machine 0.88 AdaBoostM1 0.92 U. S. National Library of Medicine 99 Accuracy Accuracy is how close a measured value is to the actual (true) value TP TN accuracy TP FP FN TN Precision, proportion of relevant predictions TP precision TP FP U. S. National Library of Medicine 100

Micro/macro averaging Macro averaging takes into account the category (MH) Micro averaging does not consider MH MH True Pos False Pos Positive Precision Recall F-measure Humans 66,429 5,985 71,484 0.9174 0.9293 0.9233 Male 24,664

7,107 34,463 0.7763 0.7157 0.7448 Female 25,824 6,718 35,501 0.7936 0.7274 0.7590 0.8291 0.7908

0.8090 0.8551 0.8266 0.8406 Macro Micro 116,917 19,810 141,448 U. S. National Library of Medicine 101 MetaMap Indexing (MMI) Summarizes and scores what is found within a citation Location - Title given more emphasis Frequency of occurrence Relevancy: MeSH Tree Depth

MetaMap score Provides a scored and ordered list of UMLS concepts describing the citation Provides our best indicator of MeSH Headings U. S. National Library of Medicine 102 Restrict to MeSH Allows us to map UMLS concepts to MeSH Headings Maps nomenclature to MeSH Encephalitis Virus, California ET: Jamestown Canyon virus ET: Tahyna virus Inkoo virus Jerry Slough virus Keystone virus Melao virus San Angelo virus Serra do Navio virus Snowshoe hare virus Trivittatus virus Lumbo virus South River virus California Group Viruses

U. S. National Library of Medicine 103 PubMed Related Citations (PRC) Uses PubMed pre-calculated related articles, same as DCMS Related Articles tab Provides terms not available in title/abstract Used to filter and support MeSH Headings identified by MetaMap Indexing Only use MeSH Headings, no CheckTags, no Subheadings, no Supplemental Concepts Can provide non-related terms, so heavily filtered U. S. National Library of Medicine 104 MTI Initial MTIFL Journals (Feb 18, 2011) U. S. National Library of Medicine Added September 5, 2011 Added August 18, 2011 Added June 1, 2011

MTI Added MTIFL Journals (17) U. S. National Library of Medicine (19) Added October 5, 2011 MTI Added MTIFL Journals (23) U. S. National Library of Medicine MTIFL Journal Performance C urrent M TIF L Statistics 2012 J ournal A rticles A rch M icrobiol R ecall P recision

F1 P revious R esults Diff 2011 Diff 2010 2010 A rticles 15 57.24% 58.78% 58.00% -1.15% 4.67% 103 Bioinformatics 113

57.98% 64.65% 61.13% 2.38% 15.75% BM C Bioinformatics 126 63.84% 70.80% 67.14% 5.33% 19.69% Can J M icrobiol 22 60.53%

61.69% 61.10% -1.37% Curr Opin Biotechnol 29 73.81% 79.15% 76.39% 7.99% Curr Opin Cell B iol Ecotoxicol Environ Saf Environ Int 0.00% 2011 R ecall P recision F1

A rticles R ecall P recision F1 71.50% 44.01% 54.48% 69 55.97% 62.71% 59.15% 820 76.61% 29.89% 43.01% 433 53.66%

64.91% 58.75% 851 77.83% 28.87% 42.12% 403 57.13% 67.33% 16.22% 131 67.07% 35.29% 46.25% 59 61.07% 63.94% 62.47%

38.45% 99 53.86% 20.73% 29.94% 25 59.73% 80.00% 68.39% 61.81% 0 0.00% 0.00% 0.00% 33.39% 97

54.38% 26.60% 35.72% 31 70.94% 67.38% 69.12% 42 69.91% 79.74% 74.50% 2.81% 27.96% 122 68.92% 32.03% 43.73% 199

65.42% 79.30% 71.69% 11 68.21% 77.44% 72.54% 7.47% 22.33% 92 55.94% 34.57% 42.73% 54 57.20% 75.44% 65.06% 58

60.92% 71.62% 65.84% 3.55% 13.98% 256 63.68% 38.91% 48.31% 183 58.54% 66.56% 62.29% Environ Toxicol 15 75.26% 76.88% 76.06%

5.98% 25.44% 49 68.25% 33.17% 44.65% 24 63.73% 77.85% 70.08% Environ Toxicol Chem 54 68.00% 72.27% 70.07% 1.87% 22.42%

287 66.24% 34.98% 45.78% 111 62.44% 75.13% 68.20% FEM S M icrobiol Ecol 0 0.00% 0.00% 0.00% 0.00% 8.60% 178

68.62% 44.11% 53.70% 157 58.32% 66.87% 62.30% Genomics P roteomics Bioinformatics 0 0.00% 0.00% 0.00% 0.00% 7.29% 30 77.30%

35.80% 48.93% 15 50.36% 63.64% 56.22% 20 80.36% 74.75% 77.45% 8.06% 30.29% 93 45.06% 34.54% 39.11% 18

67.08% 71.88% 69.40% Environ M icrobiol Health P sychol Int J Food M icrobiol 12 81.89% 74.82% 78.20% 14.82% 14.57% 305 69.95% 37.48% 48.81% 272

62.48% 64.31% 63.38% ISM E J 34 64.02% 62.69% 63.35% 1.80% 15.78% 122 65.03% 35.31% 45.77% 120 58.00% 65.56% 61.55%

J Affect Disord 130 82.60% 91.44% 86.80% 50.47% New 338 45.32% 30.32% 36.33% 0 0.00% J Appl M icrobiol 49 59.33%

65.36% 62.20% -0.16% 16.19% 562 71.73% 34.04% 46.17% 489 60.38% 64.48% 62.36% J Ind M icrobiol Biotechnol 26 71.21% 81.31% 75.93% 10.35%

19.66% 107 66.90% 34.95% 45.92% 82 64.23% 66.98% 65.58% J M orphol 30 76.34% 62.31% 68.61% -1.22% 28.91% 131

65.02% 29.85% 40.92% 64 76.85% 63.98% 69.83% Lett Appl M icrobiol 60 64.14% 69.27% 66.61% -0.06% 15.00% 188 71.46% 40.46% 51.67%

116 65.13% 68.28% 66.67% Nord J P sychiatry 19 79.32% 72.76% 75.90% -4.61% 42.83% 55 43.37% 33.30% 37.68% 9 79.17%

81.90% 80.51% 25 79.69% 72.73% 76.05% 10.01% 18.82% 285 71.54% 35.24% 47.22% 278 64.54% 67.61% 66.04% 890 69.99%

74.85% 72.34% 8.35% 19.67% 5,301 66.64% 33.19% 44.31% 3,211 60.74% 67.60% 63.99% Vet M icrobiol Totals U. S. National Library of Medicine 0.00% 0.00% Precision, Recall, F-Measure

Matches Recall: 3/10 = 0.3 Indexing 10 Indexing 15 MTI 3 Matches Precision: 3/15 = 0.2 F1-Measure: (2 * 0.2 * 0.3) / (0.2 + 0.3) = 0.24 U. S. National Library of Medicine MTI MTIWhy Received 2,330 Indexer Feedbacks Received 2,330 Indexer Feedbacks Incorporated 40% into MTI Incorporated 40% into MTI March 20, 2012 March 20, 2012 Why did MTI pick up the term "Crow" in this health services article?

Why did MTI pick up the term "Crow" in this health services article? This is definitely wrong and needs to be looked into. This is definitely wrong and needs to be looked into. Polypeptide aptamer should be indexed as Peptide aptamer (instead Polypeptide aptamer should be indexed as Peptide aptamer (instead of Peptides and Oligonucleotides). of Peptides and Oligonucleotides). U. S. National Library of Medicine 110 Questions Alan (Lan) R. Aronson James G. Mork Franois-Michel Lang Willie J. Rogers Antonio J. Jimeno-Yepes J. Caitlin Sticco U. S. National Library of Medicine 111

Recently Viewed Presentations

  • Internship AMSA-UAJ

    Internship AMSA-UAJ

    Tugas Internship Individu. Foto BPH-Koor dan masing-masing EC @1 (tdk kakak mentor) : Motivasi masuk amsa . Kegiatan yang sudah pernah diikuti. Pesan-pesan untuk anggota baru
  • Photography Activity

    Photography Activity

    Social Changes and the 1920s Thinking Skill: Explicitly assess information and draw conclusions Objective: Assess the impact of consumerism and 1920s America
  • Intro to Public Forum Debate - Columbia Public Schools

    Intro to Public Forum Debate - Columbia Public Schools

    The value in LD and an observation in Public Forum further clarify framework in a round. ... meaning the debate should not be limited only to this act. ... Intro to Public Forum Debate Last modified by:
  • The Origins of Early Government in the Colonies

    The Origins of Early Government in the Colonies

    17th Century Settlers and Early Government in the Colonies Royal Colonies Charter Colonies Proprietary Colonies Royal Colonies This was the most common type of colony and subject the most control by England John Smith John Smith was made part of...
  • DNA Extraction - lehiffa.org

    DNA Extraction - lehiffa.org

    DNA Extraction Biology Agriculture DNA Source Green Peas Blender ½ cup of DNA (peas) A large pinch of table salt (less than 1/8 teaspoon) Twice as much cold water as the DNA source (about 1 cup) Blend on high for...
  • Module 1 - DNSSEC Design Considerations

    Module 1 - DNSSEC Design Considerations

    "More has happened here today than meets the eye. An infrastructure has been created for a hierarchical security system, which can be purposed and re‐purposed in a number of different ways. .."
  • Galaxies Chapter Twenty-Six Guiding Questions  How did astronomers

    Galaxies Chapter Twenty-Six Guiding Questions How did astronomers

    When galaxies were first discovered, it was not clear that they lie far beyond the Milky Way Hubble proved that the spiral nebulae are far beyond the Milky Way Edwin Hubble used Cepheid variables to show that the "nebula" were...
  • Business Planning - National Biodiversity Network

    Business Planning - National Biodiversity Network

    Rhyd-y-creuau. Preston Montford. Dale Fort. Orielton. Margam Park. Millport. Head office. Titanic Quarter - Belfast. Where we are now: 19 FSC Centres in 2013. Access to a large number of skilled tutors. Well placed to run further training courses nationwide.