Language Identification and Part-of-Speech Tagging
Language Identification and Part-of-Speech Tagging KEREN SOLODKIN BASED ON A PAPER BY SARAH SCHULZ AND MAREIKE KELLER DIGITAL HUMANITIES SEMINAR 2016 Plan
Introduction and Related Work Training Data Processing of Mixed Text Results Tools for Digital Humanities Conclusion and Future Work 2
Introduction Code Switching Two or more linguistic variety in a single conversation Highly frequent in spoken language and in social media Can also be observed in medieval writing Historical mixed text is unused source of information 3 Example
4 Introduction The Project Automatic language identification (LID) and POS tagging Mixed Latin-Middle English text Make tools available to Humanities scholars Analysis of code-switching rules within nominal phrases Historical multilingualism research
Computational linguistics 5 Related Work LID Lyu and Lyu (2008) Mandarin-Taiwanese Solorio and Liu (2008) Spanish-English. Yeong and Tan (2011) Malay-English.
6 Related Work POS tagging Solorio and Liu (2008) Rodrigues and Kbler (2013) Jamatia et al. (2015) 7
Training Data Macaronic sermons (Horner, 2006) Mixed Latin-Middle English text Annotate with language and part-of-speech information The annotated corpus comprises about 3000 tokens 159 sentences, average length of 19.4 tokens 8
Training Data Table 1: Labels annotated for LID along an explanation for each label and the occurrence in percent 9 Training Data Table 2: Labels annotated for POS tagging along with the explanation for each label and the occurrence in percent
10 Processing of Mixed Text Two models: POS tagging builds upon the results of the LID POS tagging and LID do not inform each other LID is a step to any further processing of mixed text LID needs to be solved with a high accuracy
11 Processing of Mixed Text LID Solorio and Liu (2008) No available lemmatizer for Middle English Include POS informed word lists for both languages Middle English Penn Parsed Corpora of Historical English Latin the Universal Dependency treebank
In case a word is found in one of the lists, its POS is added 12 Processing of Mixed Text CRF Classifiers Conditional Random Fields Take context into account Set of feature functions with weights
13 Processing of Mixed Text LID CRF classifiers are known to be successful for sequence labeling tasks Latin is characterized by a relatively restricted suffix assignment A context window of 5 tokens was used on all features
14 Processing of Mixed Text LID Features functions: 6. Character-unigrams prefix 1. Surface form
7. Character-bigrams prefix 2. POS tag Latin 8. Character-trigrams prefix 3. POS tag Middle English
9. Character-unigram suffix 4. POS from Middle English word list 10. Character-bigram suffix 5. POS from Latin word list 11. Character-trigram suffix 15
Processing of Mixed Text POS Tagging For POS tagging, the same features are used Information generated by the LID system (feature 12a) The performance is evaluated by the gold LID (feature 12b) Differences in the quality of LID influence the POS tagging quality 16
Processing of Mixed Text POS Tagging Features (continuation): 12.a LID label predicted by the LID system 13.b Gold LID label manually annotated for our corpus 17
Results The evaluation was a 10-fold cross-validation 90% for training 10% for testing The reported results are average over all tests 18
Results LID Majority baseline Latin featuring Middle English insertions A combination of Latin and perfect punctuation labeling Per class precision, recall and F-score for a class Macro-averages for the overall system 19
Results LID 20 Results LID Table 3: Performance of the CRF system for language identification compared to the baseline. Precision, recall and F-score per class and macro-average of all classes.
21 Results LID Table 4: Percentage of incorrectly labeled tokens per class along with the distribution of incorrect labels among the other labels. 22
Results POS Tagging Majority baseline The majority of the output of the monolingual Latin tagger Confidence baseline Choose the POS label of the monolingual tagger with a higher level of confidence In case the label indicates that a word is a foreign word, we choose the label from Middle English.
23 Results POS Tagging Table 5: Performance of the CRF system for POS tagging compared to the majority baseline (BL1), the confidence baseline (BL2). CRFbase: system with 11 basic features, CRFpredLID: system with predicted LID as an additional feature, CRFgoldLID system with gold-standard LID as an additional feature. Precision (P), Recall (R) and F-score (F) per class and macro-average of all classes.
24 Training Data Table 2: Labels annotated for POS tagging along with the explanation for each label and the occurrence in percent 25 Results POS Tagging
Table 5: Performance of the CRF system for POS tagging compared to the majority baseline (BL1), the confidence baseline (BL2). CRFbase: system with 11 basic features, CRFpredLID: system with predicted LID as an additional feature, CRFgoldLID system with gold-standard LID as an additional feature. Precision (P), Recall (R) and F-score (F) per class and macro-average of all classes. 26 Results POS Tagging The high average Recall of almost 80 is important for the
task Precision has lower priority The extracted phrases are manually inspected afterwards The CRFpredLID system shows an increase in performance The CRFgoldLID system yields the best performance The differences are not statistically significant 27
Results POS Tagging Table 6: Percentage of incorrectly labeled tokens per class along with the distribution of incorrect labels among the other labels for CRFpredLID system. 28 Results POS Tagging
29 Results POS Tagging 30 Results POS Tagging Incorrectly tagged words appear in POS sequences which rarely appear in the training data
Adding more training data will decrease errors of this kind 31 Results Training Data Size Data sparsity in general is an issue dealing with historical text Investigate how different sizes of the training set influence the results
800 tokens 1600 tokens 2400 tokens (the complete training set) 32 Results Training Data Size Table 7: Different portions of the training set along
with precision, recall and F-score for LID and POS tagging. 33 Tools for Digital Humanities The aim is not only to build a system Enable Humanities scholars to process their data easily A simple web service in Java The data is returned in a ICARUS format
Inspect the data Pose complex search requests Combining both language information and POS tag 34 Figure 1: Search interface of ICARUS returning results on a query for an English adjective followed by a Latin noun within the next 3 tokens. 35
Tools for Digital Humanities The method can easily be adapted to other languages Fitting monolingual taggers (TreeTagger) POS related word lists (if available) The code is publicly available on GitHub 36
Conclusion We saw the implementation and application of two systems developed for a specific purpose We got reasonable results given the very low size of training data We can extend the training data and correct some errors for example by adding monolingual Middle English data
37 Future Work Jointly modeling LID and POS tagging. Dependency parser for mixed text Get insights into the constraints on intra-sentential codeswitching 38
Conclusion and Future Work Collaboration between Humanities and Computer Science. A task-oriented tool development Immediate feedback on the performance Systems are applied to real-world data. The way to give Computer Science the chance to support other fields and find new and interesting challenges
Mr. Yasir Ahmad (INFS), Dr. Ossama Ismail (CNET) Track Leader Responsibilities. Course coordinators will be appointed by program coordinators with strong recommendations of track leaders. Discuss the course specifications of each course in the very start of a new semester.
Typography helps readers navigate through the flow of content. A designer's task is to provide ways into and out of the sea of words by breaking up text into smaller segments and offering alternate routes through the mass of information....
Topic 3: National Income: Where it Comes From and Where it Goes (chapter 3) revised 9/21/09 * * * * * After showing definition of private saving, - give the interpretation of the equation: private saving is disposable income minus...
Ms. Cathy Dickens Executive Director ... Redstone Garrison PEO EIS OSD Quality of Life JIEDDO PEO Soldier RDE Command AMCOM SAMD AMCOM IMMC U.S.Army TMDE DoD CNTPO JPEO Chem & Bio Army Asymmetric Warfare Group Contracting - A Team Sport...
What is "Educational Technology"? Educational technology is a combination of the processes and tools involved in addressing educational needs and problems, with an emphasis on applying the most current tools: computers and their related technologies
Musical instruments produce sound as a result of the vibration of a physical object such as a string on a violin, guitar, or piano, or a column of air in a brass or woodwind instrument. This vibration causes a periodic...
Includes language, thought, reasoning and imagination. Social - emotional development ... These children need unstructured materials such as building blocks and puzzles. Also use visual aids, charts and labels . ... Child Development Principles and Theories
Status of Thailand'sGeospatial Data Infrastructure and Systems (National Geo-informatics Infrastructure Service: NGIS Portal ) ... Are there any "legal or statutory" obligations or requirements relating to geodetic / geospatial data or infrastructure? ... Land Information System ...
Ready to download the document? Go ahead and hit continue!