Web Taxonomy Integration through Co-Bootstrapping

Web Taxonomy Integration through Co-Bootstrapping

Web Taxonomy Integration through CoBootstrapping Dell Zhang, Wee Sun Lee SIGIR2004 Introduction Taxonomy A taxonomy, or directory or catalog, is a division of a set of objects (e.g. documents) into a set of categories. T C1 a C2 b c C3 d e

f Adopted from a slide of Srikant. 02/02/20 Dell Zhang 3 Introduction Taxonomy Integration Integrating objects from a source taxonomy N into a master taxonomy M. M C1 a N C2 b

c C3 d e S1 f x S2 y z w Adopted from a slide of Srikant. 02/02/20 Dell Zhang 4 Introduction

Taxonomy Integration Integrating objects from a source taxonomy N into a master taxonomy M. M C1 a b x C2 y z C3 c d e f

z w Adopted from a slide of Srikant. 02/02/20 Dell Zhang 5 Introduction Applications Today 02/02/20 Web Marketplaces (e.g. Amazon)

Web Portals (e.g., NCSTRL) Personal Bookmarks Organizational Resources Dell Zhang 6 Introduction Applications Future: Semantic Web Ontology Merging Ontology Mapping Content-based similarity between two concepts (categories) Doan, A., Madhavan, J., Domingos, P. and Halevy, A. Learning to Map between Ontologies on the Semantic Web. in Proceedings of WWW2002.

02/02/20 Jaccard -sim(C , S ) C S C S Need to do taxonomy integration first. Dell Zhang 7 Problem Statement Taxonomy Integration as A Classification Problem A master taxonomy M with a set of categories C1, C2, , Cm each containing a set of objects Classes: master categories C1, C2, , Cm Training examples: objects in M

A source taxonomy N with a set of categories S1, S2, , Sn each containing a set of objects 02/02/20 Test examples: objects in N Dell Zhang 9 Problem Statement Characteristics Its a multi-class, multi-label classification problem.

There are usually more than two possible classes. One object may belong to more than one class. The test examples are already known to the machine learning algorithm. The test examples are already labeled with a set of categories which are not identical with but relevant to the set of categories to be assigned. 02/02/20 Dell Zhang 10 State-of-the-Art Conventional Learning Algorithms Do not exploit information in the source taxonomy.

NB (Nave Bayes) SVM (Support Vector Machine) Enhanced Learning Algorithms Exploit information in the source taxonomy to build better classifiers for the master taxonomy. 02/02/20 Dell Zhang 11 State-of-the-Art Enhanced Learning Algorithms Agrawal, R. and Srikant, R. On Integrating Catalogs. in Proceedings of WWW2001.

02/02/20 ENB (Enhanced Nave Bayes) Dell Zhang 12 State-of-the-Art Enhanced Learning Algorithms Sarawagi, S., Chakrabarti, S. and Godbole, S. Cross-Training: Learning Probabilistic Mappings between Topics. in Proceedings of KDD2003. 02/02/20 EM2D (Expectation Maximization in 2Dimensions) CT-SVM (Cross-Training SVMs) Dell Zhang

13 State-of-the-Art Enhanced Learning Algorithms Zhang, D. and Lee, W.S. Web Taxonomy Integration using Support Vector Machines. in Proceedings of WWW2004. 02/02/20 CS-TSVM (Cluster Shrinkage + Transductive SVM) Dell Zhang 14 Our Approach Motivation

Possible useful semantic relationships between a master category C and a source category S include: 02/02/20 identical, mutual-exclusive, superset, subset, partially-overlapping. Dell Zhang 15 Our Approach Motivation

In addition, semantic relationships may involve multiple master and source categories. 02/02/20 For example, a master category C may be a subset of the union of two source categories Sa and Sb, so if an object does not belong to either Sa or Sb, it cannot belong to C. Dell Zhang 16 Our Approach Motivation The real-world semantic relationships are noisy and fuzzy, but they can still provide valuable information for classification. 02/02/20

For example, knowing that most (80%) objects in a source category S belong to one master category Ca and the rest (20%) examples belong to another master category Cb is obviously helpful. Dell Zhang 17 Our Approach Idea The difficulty is that the knowledge about those semantic relationships is not explicit but hidden in the data. If we have indicator functions for each category in N, we can imagine taking those indicator functions as features when we learn the classifier for M. This allows us to exploit the semantic relationship among the categories of M and N without explicitly figuring out what the

semantic relationships are. 02/02/20 Dell Zhang 18 Our Approach Idea Specifically, for each object in M, we augment the set of conventional termfeatures FT with a set of source categoryfeatures FN={fNi}. The j-th source categoryfeature fNi of a given object x is a binary feature indicating whether x belongs to the jth source category, i.e., Sj. In the same way, we can get a set of master category-features FM. 02/02/20 Dell Zhang 19

Our Approach Major Considerations (1/2) Q1: How can we train the classifier using two different kinds of features (termfeatures and category-features)? A1: Boosting. 02/02/20 Inspired by Cai, L. and Hofmann, T. Text Categorization by Boosting Automatically Extracted Concepts. in Proceedings of SIGIR2003. Dell Zhang 20 Our Approach

Boosting A boosting learning algorithm combines many weak hypotheses/learners (moderately accurate classification rules) into a highly accurate classifier. A boosting learning algorithm can utilize different kinds of weak hypotheses/learners for different kinds of features, and weight them optimally. 02/02/20 For example, boosting enables us to use the decision tree hypotheses/learners for the category-features and the Nave Bayes hypotheses/learners for the termfeatures. Dell Zhang 21 Our Approach Boosting

The most popular boosting algorithm is AdaBoost introduced in 1995 by Freund and Schapire. Our work is based on its multi-class multilabel version, AdaBoost.MH. The weak hypotheses we used in this paper are simple decision stumps each corresponds to a binary term-feature or category-feature. 02/02/20 Dell Zhang 22 Our Approach Two Major Considerations (2/2)

Q2: How can we train the classifier while the values of the source category-features FN of the training examples are unknown ? A2: Co-Bootstrapping. 02/02/20 Inspired by Blum, A. and Mitchell, T. Combining Labeled and Unlabeled Data with Co-Training. in Proceedings of COLT1998. It is named Cross-Training by Chakrabarti etal. Dell Zhang 23 Our Approach Co-Bootstrapping

Train two classifiers symmetrically, one for the master categories and the other for the source categories. These two classifiers collaborate to mutually bootstrap themselves together. 02/02/20 Dell Zhang 24 Our Approach 02/02/20 Dell Zhang 25 Experiments: Datasets Taxonomies Google Yahoo

Book / Top/ Shopping/ Publications/ Books/ / Business_and_Economy/ Shopping_and_Services/ Books/ Bookstores/ Disease / Top/ Health/ Conditions_and_Diseases/ / Health/ Diseases_and_Conditions/ Movie / Top/ Arts/ Movies/ Genres/ / Entertainment/ Movies_and_Film/ Genres/ Music / Top/ Arts/ Music/ Styles/ / Entertainment/ Music/ Genres/ News

/ Top/ News/ By_Subject/ / News_and_Media/ 02/02/20 Dell Zhang 27 Experiments: Tasks Two symmetric dataset tasks for each GY (integrating objects from Yahoo into Google)

YG (integrating objects from Google into Yahoo) 02/02/20 Dell Zhang 32 Experiments: Tasks Test data: GY We do not need to manually label them because their categories in both taxonomies are known. Training data: subsets of G Y randomly

sampled For GY tasks, we randomly sample n objects from the set G Y as training examples, where n is the number of test examples. For YG tasks, For each task, we do such random sampling for 5 times, and report the averaged performance. 02/02/20 Dell Zhang 33 Experiments: Measures For one category F score (F1 measure), which is the harmonic average of precision (p) and recall (r). 02/02/20

F 2 pr ( p r ) Dell Zhang 36 Experiments: Measures For all categories Macro-averaged F score (maF) Compute the F scores for the binary decisions on each individual category first and then average them over categories. Micro-averaged F score (miF) 02/02/20 Compute the F scores globally over all the binary decisions.

Dell Zhang 37 Experiments: Settings NB and ENB We use Lidstones smoothing: parameter=0.1. We run ENB with a series of exponentially increasing values of the parameter omega: (0, 1, 3, 10, 30, 100, 300, 1000), and report the best experimental results. 02/02/20 Dell Zhang 39

Experiments: Settings AB and CB-AB BoosTexter 02/02/20 http://www.research.att.com/~schapire/ BoosTexter/ It implements AdaBoost on top of "decision stumps". Dell Zhang 40 Experiments: Results CB-AB iteratively improves AB maF(GY)

miF(GY) maF(YG) miF(YG) 0.8 0.7 F score 0.6 0.5 0.4 0.3 0.2 0.1 0 0 1 2 3 4

5 6 7 8 Co-Bootstrapping iteration 02/02/20 Dell Zhang 44 Experiments: Results CB-AB > ENB ENB CB-AB ENB GY

GY News Music Movie Disease Book News Music Movie Disease Book YG macro-averaged F scores 02/02/20

News Music Movie Disease Book News Music Movie Disease 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

Book 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 CB-AB YG micro-averaged F scores Dell Zhang 45 Conclusion Main Contribution

The CB-AB approach to taxonomy integration 02/02/20 It achieves multi-class multi-label classification. It is a discriminative learning algorithm. It enhances the AdaBoost algorithm via exploiting information in the source taxonomy. It does not require a tune set (a set of objects whose source categories and master categories are all known). It enables usage of different weak hypotheses/learners for term-features and category-features. Dell Zhang 46 Conclusion

Comparison Co-Training Classes Features Co-Bootstrapping Two sets of classes: (1) one set of source categories ; (2) one set of master categories. One set of classes. Two sets of features: (1) conventional-features plus source Two disjoint sets of features: category-features; V1 and V2. (2) conventional-features plus master category-features; V1 and V2 are compatible Assumption and uncorrelated (conditionally independent). 02/02/20 The source and master taxonomies

have some semantic overlap, i.e., they are somewhat correlated. Dell Zhang 47 Questions, Comments, Suggestions, ? 02/02/20 Dell Zhang 50 Thank You 02/02/20 Dell Zhang 51

Recently Viewed Presentations

  • Distribution Committee - OWWA

    Distribution Committee - OWWA

    Chair's Report GOOD WORK EVERYONE!!!!! Source Water Protection Committee Committee Members Tim Lotimer Nick Benkovich Peter Busatto Eric Hodgins Mike Price Terry Spiers Mark Schiller Roland Welker Focus Issues related to the protection of ground and surface source water Provided...
  • Law & Motion for Self Help Centers - California Courts

    Law & Motion for Self Help Centers - California Courts

    Law & Motion for Self Help Centers Monica Mitchell Superior Court, County of San Bernardino -- Supervising Attorney, Self Help Services Jodi Prior Superior Court, County of Ventura -- Senior Court Attorney, Self-Help Legal Access Center Larry Meyer Law Library...
  • rrnicholson.files.wordpress.com


    Word. Shaman. Definition. A person regarded as having access to, and influence in, the world of good and evil spirits, esp. among some peoples of northern Asia and North America.
  • www.mcps.k12.md.us/.../interior%201 Colonial Trades Shared by Colonial Kids Introduction

    www.mcps.k12.md.us/.../interior%201 Colonial Trades Shared by Colonial Kids Introduction

    Define the information problem, the task. First----- The Task Working in a group of 3 to 4 students, you will research a trade that was done during the 1700s. Then, dressed in colonial clothes, each student will "play the part"...
  • Writing a Persuasive Essay - PC\|MAC

    Writing a Persuasive Essay - PC\|MAC

    Hidden audience for most of the prompts . Some prompts - audience is given ... Song lyrics work well on the SOL! A related anecdote (the beginning of your focus story) If you start with an anecdote, you should come...
  • New Jersey Board of Public Utilities and its

    New Jersey Board of Public Utilities and its

    Addresses comfort, indoor air quality, health & safety, and energy usage problems. The program uses specially trained/certified Building Performance Institute (BPI) GoldStar contractors. ... Individual HVAC, DHW per unit or building .
  • Oncology - Sheffield Peer Teaching Society

    Oncology - Sheffield Peer Teaching Society

    Bone pain, erectile dysfunction, lower back pain, lethargy, anorexia, wt loss, haematuria. LUTS - less commonly. Raised or rising PSA - not particularly sensitive or specific. Hard and nodular prostate on DRE
  • Object Category Recognition

    Object Category Recognition

    The constellation of parts for a given object is learned from training images * Components Model Generative Probabilistic Model including Location, Scale, and Appearance of Parts Learning Estimate Parameters Via EM Algorithm Recognition Evaluate Image Using Model and Threshold *...