Language Identification and Part-of-Speech Tagging
Language Identification and Part-of-Speech Tagging KEREN SOLODKIN BASED ON A PAPER BY SARAH SCHULZ AND MAREIKE KELLER DIGITAL HUMANITIES SEMINAR 2016 Plan
Introduction and Related Work Training Data Processing of Mixed Text Results Tools for Digital Humanities Conclusion and Future Work 2
Introduction Code Switching Two or more linguistic variety in a single conversation Highly frequent in spoken language and in social media Can also be observed in medieval writing Historical mixed text is unused source of information 3 Example
4 Introduction The Project Automatic language identification (LID) and POS tagging Mixed Latin-Middle English text Make tools available to Humanities scholars Analysis of code-switching rules within nominal phrases Historical multilingualism research
Computational linguistics 5 Related Work LID Lyu and Lyu (2008) Mandarin-Taiwanese Solorio and Liu (2008) Spanish-English. Yeong and Tan (2011) Malay-English.
6 Related Work POS tagging Solorio and Liu (2008) Rodrigues and Kbler (2013) Jamatia et al. (2015) 7
Training Data Macaronic sermons (Horner, 2006) Mixed Latin-Middle English text Annotate with language and part-of-speech information The annotated corpus comprises about 3000 tokens 159 sentences, average length of 19.4 tokens 8
Training Data Table 1: Labels annotated for LID along an explanation for each label and the occurrence in percent 9 Training Data Table 2: Labels annotated for POS tagging along with the explanation for each label and the occurrence in percent
10 Processing of Mixed Text Two models: POS tagging builds upon the results of the LID POS tagging and LID do not inform each other LID is a step to any further processing of mixed text LID needs to be solved with a high accuracy
11 Processing of Mixed Text LID Solorio and Liu (2008) No available lemmatizer for Middle English Include POS informed word lists for both languages Middle English Penn Parsed Corpora of Historical English Latin the Universal Dependency treebank
In case a word is found in one of the lists, its POS is added 12 Processing of Mixed Text CRF Classifiers Conditional Random Fields Take context into account Set of feature functions with weights
13 Processing of Mixed Text LID CRF classifiers are known to be successful for sequence labeling tasks Latin is characterized by a relatively restricted suffix assignment A context window of 5 tokens was used on all features
14 Processing of Mixed Text LID Features functions: 6. Character-unigrams prefix 1. Surface form
7. Character-bigrams prefix 2. POS tag Latin 8. Character-trigrams prefix 3. POS tag Middle English
9. Character-unigram suffix 4. POS from Middle English word list 10. Character-bigram suffix 5. POS from Latin word list 11. Character-trigram suffix 15
Processing of Mixed Text POS Tagging For POS tagging, the same features are used Information generated by the LID system (feature 12a) The performance is evaluated by the gold LID (feature 12b) Differences in the quality of LID influence the POS tagging quality 16
Processing of Mixed Text POS Tagging Features (continuation): 12.a LID label predicted by the LID system 13.b Gold LID label manually annotated for our corpus 17
Results The evaluation was a 10-fold cross-validation 90% for training 10% for testing The reported results are average over all tests 18
Results LID Majority baseline Latin featuring Middle English insertions A combination of Latin and perfect punctuation labeling Per class precision, recall and F-score for a class Macro-averages for the overall system 19
Results LID 20 Results LID Table 3: Performance of the CRF system for language identification compared to the baseline. Precision, recall and F-score per class and macro-average of all classes.
21 Results LID Table 4: Percentage of incorrectly labeled tokens per class along with the distribution of incorrect labels among the other labels. 22
Results POS Tagging Majority baseline The majority of the output of the monolingual Latin tagger Confidence baseline Choose the POS label of the monolingual tagger with a higher level of confidence In case the label indicates that a word is a foreign word, we choose the label from Middle English.
23 Results POS Tagging Table 5: Performance of the CRF system for POS tagging compared to the majority baseline (BL1), the confidence baseline (BL2). CRFbase: system with 11 basic features, CRFpredLID: system with predicted LID as an additional feature, CRFgoldLID system with gold-standard LID as an additional feature. Precision (P), Recall (R) and F-score (F) per class and macro-average of all classes.
24 Training Data Table 2: Labels annotated for POS tagging along with the explanation for each label and the occurrence in percent 25 Results POS Tagging
Table 5: Performance of the CRF system for POS tagging compared to the majority baseline (BL1), the confidence baseline (BL2). CRFbase: system with 11 basic features, CRFpredLID: system with predicted LID as an additional feature, CRFgoldLID system with gold-standard LID as an additional feature. Precision (P), Recall (R) and F-score (F) per class and macro-average of all classes. 26 Results POS Tagging The high average Recall of almost 80 is important for the
task Precision has lower priority The extracted phrases are manually inspected afterwards The CRFpredLID system shows an increase in performance The CRFgoldLID system yields the best performance The differences are not statistically significant 27
Results POS Tagging Table 6: Percentage of incorrectly labeled tokens per class along with the distribution of incorrect labels among the other labels for CRFpredLID system. 28 Results POS Tagging
29 Results POS Tagging 30 Results POS Tagging Incorrectly tagged words appear in POS sequences which rarely appear in the training data
Adding more training data will decrease errors of this kind 31 Results Training Data Size Data sparsity in general is an issue dealing with historical text Investigate how different sizes of the training set influence the results
800 tokens 1600 tokens 2400 tokens (the complete training set) 32 Results Training Data Size Table 7: Different portions of the training set along
with precision, recall and F-score for LID and POS tagging. 33 Tools for Digital Humanities The aim is not only to build a system Enable Humanities scholars to process their data easily A simple web service in Java The data is returned in a ICARUS format
Inspect the data Pose complex search requests Combining both language information and POS tag 34 Figure 1: Search interface of ICARUS returning results on a query for an English adjective followed by a Latin noun within the next 3 tokens. 35
Tools for Digital Humanities The method can easily be adapted to other languages Fitting monolingual taggers (TreeTagger) POS related word lists (if available) The code is publicly available on GitHub 36
Conclusion We saw the implementation and application of two systems developed for a specific purpose We got reasonable results given the very low size of training data We can extend the training data and correct some errors for example by adding monolingual Middle English data
37 Future Work Jointly modeling LID and POS tagging. Dependency parser for mixed text Get insights into the constraints on intra-sentential codeswitching 38
Conclusion and Future Work Collaboration between Humanities and Computer Science. A task-oriented tool development Immediate feedback on the performance Systems are applied to real-world data. The way to give Computer Science the chance to support other fields and find new and interesting challenges
The only finite symmetry groups of a set of points in ℝ2 (that is, the only "plane symmetry groups" or "groups of isometries of the plane") are the groups ℤ? and ?? for some ?. These groups are sometimes called...
Processed item(s) must be transferred immediately, using aseptic technique, from the sterilizer to the actual point of use, usually the sterile field in an ongoing surgical procedure. No implants. What is your rate? What is the goal? Is this data...
Enzyme Regulation. The rates of enzyme-catalyzed reactions are controlled by regulatory enzymes that . increase the reaction rate when more of a particular substance is needed. decrease the reaction rate when that substance is not needed.
Within each broad compression mechanism, we looked at images from five different categories - Venus, Earth, Mars, Jupiter, and Saturn (all images were taken from the NASA website). We compressed each photo in each category in three different ways: by...
My name is Nazia and I belong to the village which is. located in Union Council Usman Koria. I am at no.1. in sibling. I have four brothers and five sisters. Source. of income is very less than expenditure. My...
New Uniform Guidance for CNCS Grants - 02/03/2015. New Tool to Provide Feedback on the Uniform Guidance - 02/27/2015. ... Grantees receiving direct federal funds may apply for a federally negotiated indirect cost rate from their cognizant agency.
ALL of these questions assess AO3 [Apply knowledge and understanding to interpret, analyse and evaluate geographical information and issues and to make judgements]Eduqas GCSE Geography B. Component 1: 8 mark questions at the end of each theme. Component 2: one...
Reorder buffers. The solution to both of these problems is a reorder buffer. Essentially, any instruction that "finishes" actually stores its results in the reorder buffer - until we know for sure that the instruction was supposed to happen
Ready to download the document? Go ahead and hit continue!