MMI HLDA - University of Colorado Boulder

MMI HLDA - University of Colorado Boulder

OntoNotes: A Unified Relational Semantic Representation Sameer Pradhan, Eduard Hovy, Mitchell Marcus, Martha Palmer, Lance Ramshaw, and Ralph Weischedel http://www.bbn.com/ontonotes 1 Outline Multiple layers of annotation and modeling capture useful elements of text meaning at 90% ITA Syntax Proposition Word sense Ontology Coreference Names An integrated relational database representation Enforces consistency across the different annotations Supports integrated models that can combine evidence from different layers

Some practical issues Sensitivity to changes in layers Adding a new layer to the data Few lessons learned 2 Problems with Multiple Layers of Annotation Not previously available A number of these layers have not been available in significant quantity before: Word Sense Coreference Not previously integrated Each layer encoded separately as individual files, requiring supporting documentation for interpretation Not previously completely consistent Mismatches between Treebank and PropBank Not previously user friendly

Raw text format 3 Unified Representation Provide a bare-bones representation independent of the individual layers semantics that can 4 Efficiently capture intra- and inter- layer semantics Maintain component independence (facilitate collaboration) Provide mechanism for flexible integration (for an application)

Relational Database Integrate information at the required level of granularity + Data storage as close as possible to an application backend Adaptable in face of incremental representational changes Object Oriented API API extremely accessible (dont need to be a hacker to use it) Ability to easily perform cross-layer queries Easily extensible Capable of maintaining version information Ideally at different possible levels Relational Representation Corpus

Trees Coreference 5 Senses Propositions Names Example: Database Representation of Syntax Treebank tokens (stored in the Token table) provide the common base The Tree table stores the recursive tree nodes, each with its span Subsidiary tables define the sets of function tags, phrase types, etc. 6 Object Oriented API 7

Using the API: Importing the modules 8 Using the API: Creating Skeleton Objects 9 Using the API: Creating Full-fledged Objects (I) 10 Using the API: Creating Full-fledged Objects (II) 11 Using the API: Writing to the database 12 Using the API: Reading form the Database 13

Data Loading Life-cycle Database 14 OntoNotes Data: Current and Future OntoNotes 1.0 NW BC NW BN Eng 300 Eng 300 Chi

Chi 250 Ara 100 Ara 15 BN OntoNotes 2.0 250 OntoNotes 3.0 BC NW

BN BC 200 Eng 500 200 200 300 Chi 250 300 150

Ara 200 Advantages of an Integrated Representation Clean, consistent layers Resolve the inconsistencies and problems that this reveals Well defined relationships Database schema defines the merged structure efficiently Extract individual views Treebank, PropBank, etc. SQL queries can extract examples based on multiple layers or define new views Python Object-oriented API allows for programmatic access to tables and queries 16 Example of Database Query Function What is the distribution of named entities that are ARG0s of the predicate say?

if (proposition.lemma == say): for a_proposition in a_proposition_bank: if(a_proposition.lemma != "say"): arg_in_p_q = "select * from argument where proposition_id = '%s';" % (a_proposition.id) a_cursor.execute(arg_in_p_query) argument_rows = a_cursor.fetchall() query = select * from argument where proposition_id = '%s'; .. for a_argument_row in argument_rows: a_argument_id = a_argument_row["id"] a_argument_type = a_argument_row["type"] if(a_argument_type != "ARG0"): n_in_arg_q = "select * from argument_node where argument_id = '%s';" % (a_argument_id) a_cursor.execute(n_in_arg_q) argument_node_rows = a_cursor.fetchall() for a_argument_node_row in argument_node_rows: a_node_id = a_argument_node_row["node_id"] if (argument_type == "ARG0"): for child in node.subtrees(): a_ne_node_query = "select * from name_entity where subtree_id = '%s';" % (a_node_id)

a_cursor.execute(a_ne_node_query) ne_rows = a_cursor.fetchall() for a_ne_row in ne_rows: a_ne_type = a_ne_row["type"] ne_hash[a_ne_type] = ne_hash[a_ne_type] + 1 a_tree = a_tree_document.get_tree(a_tree_id) a_node = a_tree.get_subtree(a_node_id) Name Entity Frequency for a_child in a_node.subtrees(): a_ne_subtree_query = "select * from name_entity where subtree_id = '%s';" % (a_child.id) subtree_ne_rows = a_cursor.execute(a_ne_subtree_query) Person 84 GPE 34

Organization 29 NORP 15 ... ... ne_subtree_rows = a_cursor.fetchall() for a_ne_subtree_row in ne_subtree_rows: a_subtree_ne_type = a_ne_subtree_row["type"] ne_hash[a_subtree_ne_type] = ne_hash[a_subtree_ne_type] + 1 17 Reconciling Treebank and PropBank We found several mis-matches between syntax and propositions

Sometimes PropBank was right Sometimes Treebank was right Guidelines modified to bring the two in line Now each argument points to a single node in the tree Secondary connections are made using Treebank trace chains Almost no discontinuous arguments Non-trace connections are explicitly identified This greater consistency will make it easier to train models that predict argument structure 18 Sensitivity to Changes PropBank changes S NP ARG2 NP

PP PP ARGM-LOC ARG1 JJ NNS CC NNS IN NP NNS IN

NP JJ NNP ... major reductions and realignments of troops in central Europe ... 19 Sensitivity to Changes Treebank changes If the node got deleted, remove associated annotation if any node has a change in children or parent node, then update associated annotation. Print new propbank S NP NP JJ NNS

PP CC NNS IN PP NP NNS IN NP JJ NNP ... major reductions and realignments of troops in central Europe ... 20

Adding a new layer 1. What information do you want to capture? 2. Define relationship with the required layer 3. Design tables 4. Superimpose on existing machinery with respect to the anchor 5. Create a class in the corpora package a. Define a few specific functions Create object from original annotation (Text Reader) Write object to database (DB Writer) Create object from database (DB Reader) Write database to original format (Text Writer) Pretty print function (Pretty Printer) b. Write at least one alignment function at the level where the enrichment is required, or even multiple levels

21 Enrich Treebank/Document/ Few Errors Found 22

Missing co-indices in Trees (found during loading) Invalid sense numbers (while checking against repository) Multiple sense definitions (in the repository) Validation errors in schemas Dead pointers in ontology Multiple coreference chain memberships Missing/Invalid predicate/argument pointers Invalid PB/TB merges Filename/Content mismatches Pinyin/Unicode inconsistencies Varying sentence breaks SLINK Errors Inconsistent TB Empty specifications in the merge process Typos (found through Type Tables) .. And, a few annotation Errors Some Interesting Problems Addressed Word sense annotation transferred from old Treebank to new Treebank Coreference annotation transferred to new Treebank Treebank/PropBank with or without NMLs reside in harmony

Various levels of data quality identified in the database Varying styles of marking traces normalized Language specific idiosyncrasies in inventories and frames normalized Data generated for annotation Eventive nouns Coreference 23 Few Lessons Learned Each layer should abide by a minimum dependency principle adhere to a well defined schema Try to maintain consistency across representation of similar components Use a centralized, version controlled repository Need for single-point, push-button loading philosophy 24 Conclusion Lot of annotation layers available, integrated using a

relational schema A extensible, relational/object oriented architecture available to the community Easily Accessible Through Python API SQL queries unencumbered, open source!! OntoNotes Release 2.0 available from LDC 25 Backup 26 Syntax Layer Identifies meaningful phrases in the text Lays out the structure of how they are related Concerns about the pace of the Vienna talks -- which are aimed at the destruction of some 100,000 weapons , as well as major major

reductions reductions andand realignments realignments of troops of troops in central Europe also are being registered at the Pentagon . in central Europe S NP SYNTAX NP JJ

NNS PP CC NNS IN PP NP NNS IN NP JJ NNP

... major reductions and realignments of troops in central Europe ... 27 Propositional Structure Tells who did what to whom For both verbs and nouns Concerns about the pace of the Vienna talks -- which are aimed at the destruction of some 100,000 weapons , as well as major reductions and realignments of troops in central Europe also are being registered at the Pentagon . S NP ARG2 NP PP PP ARGM-LOC

ARG1 JJ NNS CC NNS IN NP NNS IN NP JJ NNP ... major reductions and realignments of troops in central Europe ... 28

Predicate Frames Predicate frames define the meanings of the numbered arguments Concerns about the pace of the Vienna talks -- which are aimed at the destruction of some 100,000 weapons , as well as major reductions and realignments of troops in central Europe also are being registered at the Pentagon . reduction reduce.01 Make less ARG0 Agent ARG1 Thing falling ARG2 Amount fallen ARG3 Starting point ARG4 Ending point 29 the troops major - Word Sense and Ontology Meaning of nouns and verbs are specified using a catalog of

possible senses All the senses are annotatable at 90% ITA Ontology links (currently being added) capture similarities between related senses of different words Concerns about the pace of the Vienna talks -- which are aimed at the destruction of some 100,000 weapons , as well as major reductions and realignments of troops in central Europe also are being registered at the Pentagon . Word Sense 30 Word Sense aim register 1. Point or direct object, weapon, at something ... 2. 2. Wish, Wish, purpose

purpose or or intend intend to to achieve achieve something something 1. Enter Enterinto intoan anofficial officialrecord record 2. Be aware of, enter into someones consciousness 3. Indicate a measurement 4. Show in ones face Coreference Identifies different mentions of the same entity within a document especially links definite, referring noun phrases, and pronouns to their antecedents

Two types tagged Identity and Attributive President Bush conventional arms talk Viennatalks talks--which whichare areaimed aimedatatthe thedestruction destructionof Concerns about the pace of the Vienna ofsome some100,000 100,000weapons weapons, as , aswell wellasasmajor majorreductions reductionsand andrealignments realignmentsofoftroops

troopsin central Europe also are being registered at thethe Pentagon . Pentagon in central Europe He e0 31 Pentagon e1 e2

Recently Viewed Presentations

  • Economics for Today 2nd edition Irvin B. Tucker

    Economics for Today 2nd edition Irvin B. Tucker

    Classic economists would support which of the following policies: Contractionary Expansionary Nonintervention Fixed wage Assume the economy is in short-run equilibrium at a real GDP below its potential real GDP. According to classical self-correction theory, which of the following policies...
  • Optical characterization - Aalborg Universitet

    Optical characterization - Aalborg Universitet

    Raman scattering. Detects normal modes. Vibrations or rotations in gases or liquids. Phonon modes in solids. Fingerprint of bonds (elements) Sensitive to . State of matter, crystalline or amorphous. Defects. Particle size. Temperature …. Experimental: narrow laser line + good...
  • 3D area 2015 - WordPress.com

    3D area 2015 - WordPress.com

    The Salford Advantage. FE students at a recent recruitment event seemed to be keen to go to University, were clear in their own minds that Tuition Fees were a fact of life, but had some difficulty in making a value...
  • Government Insurance Fraud Task Force

    Government Insurance Fraud Task Force

    Insurance fraud funds other crime. Membership. Core members are:- ABI, IFB, BIBA, FOS, CAB, FSCP + independent chair. Meets monthly. Task force is assisted by a wide advisory group. Flexible membership from law, academia, insurance industry, loss adjusters and others.
  • Sakai Web Design - University of Delaware

    Sakai Web Design - University of Delaware

    [email protected] is a secured Web server that uses the https protocol. If you have http resources on an https page, you will get an annoying warning about mixing secure and non-secure items. To prevent this, either put everything into Sakai...
  • Classifying Organisms

    Classifying Organisms

    Biologists use classification to organize living things into groups so that the organisms are easier to study. Taxonomy is the scientific study of how living things are classified. Taxonomy is useful because once an organism is classified, a scientist knows...
  • 490MIC SCIENTIFIC COMMUNICATION Data Bases and Referencing Systems.

    490MIC SCIENTIFIC COMMUNICATION Data Bases and Referencing Systems.

    Harvard. came originally from "The Bluebook: A Uniform System of Citation" published by the Harvard Law Review Association. The Harvard style and its many variations are used in law, natural sciences, social and behavioural sciences, and medicine.
  • Aflac WorkForces Report 2013-2017  Confidential and Proprietary. For

    Aflac WorkForces Report 2013-2017 Confidential and Proprietary. For

    By accessing these materials, you agree that you will not use such proprietary information or materials in any way whatsoever except for the sole purpose of training and development within Aflac's field force. You further agree not to modify, loan,...