Automatic Document Categorisation by User Profile in MEDLINE Euripides G.M. Petrakis Angelos Hliaoutakis Intelligent Systems Laboratory www.intelligence.tuc.gr Technical University of Crete (TUC) Chania, Crete, Greece Problem Definition Medical information systems are designed for experts ! Domain specific answers to experts Must also serve naive consumers Easy to read and comprehend information Investigate methods for the categorization of information by user profile
Experts: use complex terms for their searches Consumers: do simple searches using natural language terms 2 ISHIMR 2011, Zurich, Switzerland Current Practices In MEDLINE of U.S. NLM, documents are indexed by experts 10-12 MeSH terms per document (pathology, disease, treatment, drugs etc) Over 15 million documents - Slow !! Automate this process No categorization MedScape, Medlineplus, MedHunt rely on the manual categorization of information Slow, does not scale-up for large collections
3 ISHIMR 2011, Zurich, Switzerland Objectives Investigate methods for automatic document indexing in MEDLINE These terms are subsequently used for filtering documents by user profile Main Idea: categorization of terms to simple terms comprehendible by consumers or more involved terms suitable for experts 4 ISHIMR 2011, Zurich, Switzerland Resources
Automatic indexing in MEDLINE MMTx [U.S. NLM]: MMTx focus on UMLS rather than MeSH AMTEx [DKE, 2009]: MeSH terms, faster and more accurate than MMTx Dictionaries for biomedical and health related concepts UMLS Metathesaurus, MeSH Dictionaries for general English words WordNet, Specialist 5 ISHIMR 2011, Zurich, Switzerland MMTx (MetaMap Transfer) Developed by U.S. NLM Maps text to UMLS Metathesaurus
concepts but MEDLINE indexing is based on MeSH MeSH is a subset of Metathesaurus Suffers from term overgeneration Unrelated terms added to the final candidate list Topic drift HIKM2006 AMTEx HIKM2006 The AMTEx method [DKE 2009] Main ideas: Initial term extraction based on a hybrid linguistic/statistical approach, the C/NC value Extracts general single and multi-word terms (noun phrases)
Mainly multi-word terms: heart disease, coronary artery disease Extracted terms are validated against MeSH Faster, improved precision by merely a fifth of term output of MMTx AMTEx Input: Full text article Example MEDLINE index terms: Aged, Data Collection, Humans,Knee, Middle Aged, Osteoarthritis, Knee/complications, Osteoarthritis, Knee/diagnosis, Pain/classification, Pain/etiology, Prospective Studies, Research Support, Non-U.S. Govt MMTx terms: osteoarthritis knee, retention, peat, rheumatology, acetylcholine, lysine acetate, potassium acetate, questionnaires, target population, population, selection bias, creativeness, reproduction, cohort studies, europe, couples, naloxone, sample size, arthritis, data collection, mail health status, respondents, ontario, universities, dna, baseline survey, medical records, informatics, general practitioners, gender, beliefs, logistic regression, female, marital status,
employment status, comprehension, surveys, age distribution, manual, occupations, manuals, persons, females, minor, minority groups, incentives, business, ability, comparative study, odds ratio, biomedical research, pubmed, copyright, coding, longitudinal studies, immunoelectrophoresis, skin diseases, government, norepinephrine, social sciences, survey methods, tyrosine, new zealand, azauridine, gold, nonrespondents, cycloheximide, rheum, jordan, cadmium, radiopharmaceuticals, community, disease progression, history AMTEx terms: health surveys, pain, review publication type, data collection, osteoarthritis knee, knee, science, health services needs and demand, population, research, questionnaires, informatics, health HIKM2006 AMTEx Term & Document Categorization 9 ISHIMR 2011, Zurich, Switzerland
New Vocabularies Vocabulary of General Terms (VGT): 105.675 general (WordNet) terms VGT = (WordNet) - (MeSH) Vocabulary of Consumer Terms (VCT): 7,165 consumer (MeSH) terms. VCT = (WordNet) (MeSH) Vocabulary of Expert Terms (VET): 16,719 consumer (MeSH) terms VET = (MeSH) - (WordNet) 10 ISHIMR 2011, Zurich, Switzerland Document
Categorization Documents are represented by vectors of terms extracted by AMTEx, MMTx or assigned by human experts The more VET (VCT) terms a document contains the higher its probability to be suitable for experts (consumers) E.g., a document with VET% = 0.62 has 62% probability to be one suitable for experts 11 ISHIMR 2011, Zurich, Switzerland Evaluation Precision and Recall measures: a good method has high values of both Datasets: OHSUMED: 348,566 MEDLINE abstracts that come along
with 64 queries and their relevant answers Ground truth: the set of MeSH index terms assigned to documents by experts 12 ISHIMR 2011, Zurich, Switzerland Categorization by User Profile How good is the method in retrieving answers for consumers and experts ? We run retrievals for consumers & experts 15 out of the 64 queries contain no expert terms and are suitable for consumers The remaining queries are suitable for experts Documents are represented by document vectors of MeSH, MMTx, or AMTEx terms
The retrieval method is Vector Space Model The document similarity score of VSM is multiplied by its respective VET or VCT 13 ISHIMR score 2011, Zurich, Switzerland Consumers Retrieval Task 14 ISHIMR 2011, Zurich, Switzerland Experts Retrieval Task 15
ISHIMR 2011, Zurich, Switzerland Results Consumers retrieval task: Retrievals with the manually assigned MeSH terms performs better MMTx, AMTEx perform equally well Experts retrieval task: Retrievals with AMTEx perform better The results indicate A tendency of human experts to assign simple terms to documents and Selective ability of AMTEx in extracting complex terms suitable for experts 16 ISHIMR 2011, Zurich,
Switzerland Conclusions & Future Work We investigate methods: Automatic document indexing Categorization by user profile AMTEx is very well suited for both problems Future work: more elaborate documents methods (machine learning, fuzzy) More categories According to UMLS SN (pathology, treatment) User categories (e.g., specialty) 17 ISHIMR 2011, Zurich, Switzerland Questions and answers
18 ISHIMR 2011, Zurich, Switzerland x Outlinex Outline INPUT: INPUT: Document Document Collection C/NC value Collection value Click icon to addC/NC SmartArt graphic Multi-word Term
Term Extraction Extraction Multi-word Term Ranking Ranking && Term MeSH MeSH Term Validation Validation Term MeSH MeSH Thesaurus Thesaurus Resource Resource Single-word Term
Term Extraction Extraction Single-word Non-MeSHmulti-word multi-wordare arebroken broken Non-MeSH down&&validated validatedagainst againstMeSH MeSH down Variant Generation Generation Variant OUTPUT: OUTPUT:
MeSH MeSH TermLists Lists Term Term Expansion Expansion Term (MeSH) (MeSH) HIKM2006 AMTEx AMTEx vs MMTx AMTEx: faster, improved precision by merely a fifth of term output of MMTx Data Set Method
Number of Terms Precision Recall Time (hours) AMTEX OHSUMED MMTX 8 40 0.125 0.089
PMC 20 ISHIMR 2011, Zurich, Switzerland MeSH: Medical Subject Headings The NLM medical & biological terms thesaurus: Organized in IS-A hierarchies more than 15 taxonomies & more than 22,000 terms a term may appear in multiple taxonomies No PART-OF relationships Terms organized into synonym sets called entry terms, including stemmed term forms HIKM2006
AMTEx Fragment of the MeSH IS-A Hierarchy Root Nervous system diseases Neurologic manifestations Cranial nerve diseases pain headache neuralgia Facial neuralgia HIKM2006
Cyclical Model. Self-contained steps are easy to manage. Defined processes and output per step. Good model for managing large groups of developers working in parallel. Maintenance, logically starts the next cycle of the process so it maps better on to...
David M. Lewinsohn Michael K. Leonard Philip A. LoBue David L. Cohn Charles L. Daley Ed Desmond Joseph Keane Deborah A. Lewinsohn Ann M. Loeffler Gerald H. Mazurek Richard J. O'Brien MadhukarPai Luca Richeldi Max Salfinger Thomas M. Shinnick Timothy...
Lexicological Relevance and Romanic Context. „Philologica Jassyensia" VIII, no. 1, 2012, p. 19-26. Cristina Florescu Iași, noiembrie 2012 „En janvier 2008, il s'est constitué une équipe internationale, surtout franco-allemande (autour du FEW à l'ATILF et du LEI à l'Université ...
What we will cover Authorization requirements . ... Sometimes it is an issue of math….how much time is actually being spent with the member in significant interactions and interventions, and how many hours in the day to access all the...
Boomerang buyers who lost a home to a foreclosure or short sale between 2007 and 2013 are projected to make about 10 percent of all U.S. home purchases in 2014, according to John Burns Real Estate Consulting (JBREC). The ....
Gain a basic understanding of the NOFA application process. ... economic & community development. Community Compass Terms. What is Technical Assistance? ... Competitive and formula recipients and . subrecipients. Public housing . agencies.
Major MI Websites Mark Cheeseman Regional Medicines Information Pharmacist East Anglia Medicines Information Service Aim Provide an introduction to websites and databases commonly used when working as a Medicines Information pharmacist Learning Outcomes By the end of this session, you...
Ready to download the document? Go ahead and hit continue!