Text Mining Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 [email protected] Text Mining/Information Retrieval Task Statement: Build a system that retrieves documents that users are likely to find relevant to their queries. This assumption underlies the field of Information Retrieval. Information need text input Parse Collections
How is the query constructed? Pre-process Query Index How is the text processed? Rank Evaluate Terminology Token: A natural language word Swim, Simpson, 92513 etc Document: Usually a web page, but more generally any file.
Some IR History Roots in the scientific Information Explosion following WWII Interest in computer-based IR from mid 1950s H.P. Luhn at IBM (1958) Probabilistic models at Rand (Maron & Kuhns) (1960) Boolean system development at Lockheed (60s) Vector Space Model (Salton at Cornell 1965) Statistical Weighting methods and theoretical advances (70s) Refinements and Advances in application (80s) User Interfaces, Large-scale testing and application (90s) Relevance In what ways can a document be relevant to a query? Answer precise question precisely. Who is Homers Boss? Montgomery Burns.
Partially answer question. Where does Homer work? Power Plant. Suggest a source for more information. What is Barts middle name? Look in Issue 234 of Fanzine Give background information. Remind the user of other knowledge. Others ... Information need text input Collections How is the query constructed? Parse
Pre-process Query Index How is the text processed? Rank The section that follows is about Content Analysis (transforming raw text into a computationally more manageable form) Evaluate Document Processing Steps Figure from Baeza-Yates & RibeiroNeto
Stemming and Morphological Analysis Goal: normalize similar words Morphology (form of words) Inflectional Morphology E.g,. inflect verb endings and noun number Never change grammatical class dog, dogs Bike, Biking Swim, Swimmer, Swimming What about build, building; Examples of Stemming (using Porters algorithm) Porters algorithms is available in Java, C, Lisp, Perl, Python etc from http://www.tartarus.org/ ~martin/PorterStemmer/ Original Words consign consigned
Errors Generated by Porter Stemmer (Krovetz 93) Too Aggressive Too Timid organization/ organ european/ europe policy/ police cylinder/ cylindrical execute/ executive create/ creation arm/ army search/ searcher Statistical Properties of Text Token occurrences in text are not uniformly distributed They are also not normally distributed They do exhibit a Zipf distribution
Government documents, 157734 tokens, 32259 unique 8164 the 4771 of 4005 to 2834 a 2827 and 2802 in 1592 The 1370 for 1326 is 1324 s 1194 that 973 by 969 on 915 FT 883 Mr 860 was 855 be 849 Pounds 798 TEXT 798 PUB 798 PROFILE 798 PAGE
798 HEADLINE 798 DOCNO 1 ABC 1 ABFT 1 ABOUT 1 ACFT 1 ACI 1 ACQUI 1 ACQUISITIONS 1 ACSIS 1 ADFT 1 ADVISERS 1 AE Plotting Word Frequency by Rank Main idea: count How many times tokens occur in the text Over all texts in the collection Now rank these according to how often they occur. This is called the rank. The Corresponding Zipf Curve
32 24 20 18 15 15 15 13 13 11 11 10 10 10 10 10 10 9 9 system knowledg base problem
abstract model languag implem reason inform expert analysi rule program oper evalu comput case gener form Zipf Distribution The Important Points: a few elements occur very frequently a medium number of elements have medium frequency many elements occur very infrequently
Zipf Distribution The product of the frequency of words (f) and their rank (r) is approximately constant Rank = order of words frequency of occurrence f C 1 / r N /an 10approximately correct rule of thumb: Another way to state thisCis with Say the most common term occurs C times The second most common occurs C/2 times The third most common occurs C/3 times Zipf Distribution (linear and log scale) Illustration by Jacob Nielsen
What Kinds of Data Exhibit a Zipf Distribution? Words in a text collection Virtually any language usage Library book checkout patterns Incoming Web Page Requests Outgoing Web Page Requests Document Size on Web City Sizes Consequences of Zipf There are always a few very frequent tokens that are not good discriminators. Called stop words in IR
English examples: to, from, on, and, the, ... There are always a large number of tokens that occur once and can mess up algorithms. Medium frequency words most descriptive Word Frequency vs. Resolving Power (from van Rijsbergen 79) The most frequent words are not the most descriptive. Statistical Independence Two events x and y are statistically independent if the product of their probability of their happening individually equals their probability of happening together. P( x)P( y ) P( x, y ) Statistical Independence and Dependence What are examples of things that are statistically independent?
What are examples of things that are statistically dependent? Lexical Associations Subjects write first word that comes to mind doctor/nurse; black/white (Palermo & Jenkins 64) Text Corpora yield similar associations One measure: Mutual Information (Church and Hanks 89) P ( x, y ) I ( x, y ) log 2 P ( x), P ( y ) If word occurrences were independent, the numerator and denominator would be equal (if measured across a large collection) Statistical Independence Compute for a window of words P( x ) P( y ) P( x, y ) if independent. P( x ) f ( x ) / N We' ll approximate P( x, y ) as follows :
abcdefghij klmnop w1 w11 1 N |w| P( x, y ) wi ( x, y ) N i 1 | w |length of window w (say 5) wi words within window starting at position i w( x, y ) number of times x and y co - occur in w N number of words in collection w21 Interesting Associations with Doctor (AP Corpus, N=15 million, Church & Hanks 89) I(x,y) f(x,y) f(x) x
f(y) y 11.3 12 111 Honorary 621 Doctor 11.3 8 1105 Doctors
Treat 8.7 25 621 Doctor 1407 Bills Un-Interesting Associations with Doctor Doctor (AP Corpus, N=15 million, Church & Hanks 89) I(x,y) f(x,y) f(x)
x f(y) y 0.96 6 621 doctor 73785 with 0.95 41 284690
a 1105 doctors 0.93 12 84716 is 1105 doctors These associations were likely to happen because the non-doctor words shown here are very common and therefore likely to co-occur with any noun. Associations Are Important Because
We may be able to discover that phrases that should be treated as a word. I.e. data mining. We may be able to automatically discover synonyms. I.e. Bike and Bicycle Content Analysis Summary Content Analysis: transforming raw text into more computationally useful forms Words in text collections exhibit interesting statistical properties Word frequencies have a Zipf distribution Word co-occurrences exhibit dependencies Text documents are transformed to vectors Pre-processing includes tokenization, stemming, collocations/phrases Information need Collections Pre-process text input
Parse Query Index How is the index constructed? Rank The section that follows is about Index Construction Evaluate Inverted Index This is the primary data structure for text indexes Main Idea: Invert documents into a big index Basic steps: Make a dictionary of all the tokens in the collection
For each token, list all the docs it occurs in. Do a few things to reduce redundancy in the data structure How Are Inverted Files Created Documents are parsed to extract tokens. These are saved with the Document ID. Doc 1 Doc 2 Now is the time for all good men to come to the aid of their country It was a dark and stormy night in the country manor. The time was past midnight Term now
is the time for all good men to come to the aid of their country it was a dark and stormy night in the
country manor the time was past midnight Doc # 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 How Inverted Files are Created After all documents have been parsed the inverted file is sorted alphabetically.
Term now is the time for all good men to come to the aid of their country it was a dark and stormy
night in the country manor the time was past midnight Doc # 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 Term a aid
all and come country country dark for good in is it manor men midnight night now of past stormy the the the the their
time time to to was was Doc # 2 1 1 2 1 1 2 2 1 1 2 1 2 2 1 2
2 1 1 2 2 1 1 2 2 1 1 2 1 1 2 2 How Inverted Files are Created Multiple term entries for a single document are merged. Within-document term frequency information
is compiled. Term a aid all and come country country dark for good in is it manor men midnight night now of past stormy
the the the the their time time to to was was Doc # 2 1 1 2 1 1 2 2 1 1 2
1 2 2 1 2 2 1 1 2 2 1 1 2 2 1 1 2 1 1 2 2 Term a
aid all and come country country dark for good in is it manor men midnight night now of past stormy the the their time
time to was Doc # Freq 2 1 1 2 1 1 2 2 1 1 2 1 2 2 1 2 2
1 1 2 2 1 2 1 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 2 2 1 1 1 2 2 How Inverted Files are Created Then the file can be split into A Dictionary file and A Postings file How Inverted Files are Created
Term a aid all and come country country dark for good in is it manor men midnight night now of past stormy the the
their time time to was Doc # Freq 2 1 1 2 1 1 2 2 1 1 2 1 2 2 1
2 2 1 1 2 2 1 2 1 1 2 1 2 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 2 2 Dictionary Term a aid all
and come country dark for good in is it manor men midnight night now of past stormy the their time to was N docs
1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 2 2 Inverted Indexes Permit fast search for individual terms For each term, you get a list consisting of:
document ID frequency of term in doc (optional) position of term in doc (optional) These lists can be used to solve Boolean queries: country -> d1, d2 manor -> d2 country AND manor -> d2 Also used for statistical ranking algorithms How Inverted Files are Used Dictionary Term a aid all and come country dark for good in is
it manor men midnight night now of past stormy the their time to was N docs Doc # Tot Freq 1 1 1 1
2 Query on time AND dark Freq 2 1 1 2 1 1 2 2 1 1 2 1 2 2 1 2 2 1
1 2 2 1 2 1 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 2 2 1 1 1 2 2 2 docs with time in dictionary -> IDs 1 and 2 from posting file 1 doc with dark in dictionary -> ID 2 from posting file
Therefore, only doc 2 satisfied the query. Information need Collections Pre-process text input Parse Query Index How is the index constructed? Rank The section that follows is about Querying (and
ranking) Evaluate Simple query language: Boolean Terms + Connectors (or operators) terms words normalized (stemmed) words phrases connectors AND OR NOT NEAR (Pseudo Boolean) Word Doc
Cat x Dog Collar x Leash Cat Boolean Queries Cat OR Dog Cat AND Dog (Cat AND Dog) (Cat AND Dog) OR Collar (Cat AND Dog) OR (Collar AND Leash) (Cat OR Dog) AND (Collar OR Leash) Boolean Queries (Cat OR Dog) AND (Collar OR Leash) Each of the following combinations works:
Cat Dog Collar Leash x x x x x x x x x x
x x x x x x x Boolean Queries (Cat OR Dog) AND (Collar OR Leash) None of the following combinations work: Cat Dog Collar Leash x
x x x x x x x Boolean Searching Measurement of the width of cracks in prestressed concrete beams Cracks Formal Query: cracks AND beams AND Width_measurement AND Prestressed_concrete Width
measurement Beams Prestressed concrete Relaxed Query: (C AND B AND P) OR (C AND B AND W) OR (C AND W AND P) OR (B AND W AND P) Ordering of Retrieved Documents Pure Boolean has no ordering In practice: order chronologically order by total number of hits on query terms What if one term has more hits than others? Is it better to one of each term or many of one term? Boolean Model Advantages simple queries are easy to understand relatively easy to implement
Disadvantages difficult to specify what is wanted too much returned, or too little ordering not well determined Dominant language in commercial Information Retrieval systems until the WWW Since the Boolean model is limited, lets consider a generalization Vector Model Documents are represented as bags of words Represented as vectors when used computationally A vector is like an array of floating point Has direction and magnitude Each vector holds a place for every term in the collection Therefore, most vectors are sparse
Smithers secretly loves Monty Burns Monty Burns secretly loves Smithers Both map to [ Burns, loves, Monty, secretly, Smithers] Document Vectors One location for each word Document ids nova A 10 B 5 C D E F G 5 H I galaxy heat 5
3 10 7 6 10 hwood film role 10 9 8 10 7 5 2
7 9 8 5 diet fur 10 9 10 10 1 3 We Can Plot the Vectors Star Doc about astronomy
Doc about movie stars Doc about mammal behavior Diet Documents in 3D Vector Space t3 D1 D9 D11 D5 D3 D10 D4 D2 t1 t2
D7 D8 D6 Illustration from Jurafsky & Martin Vector Space Model docs Homer Marge Bart D1 * * D2 * D3 * * D4 * D5 * * * D6
* * D7 * D8 * D9 * D10 * * D11 * * Q * Note that the query is projected into the same vector space as the documents. The query here is for Marge. We can use a vector similarity model to determine the best match to our query (details in a few slides).
But what weights should we use for the terms? Assigning Weights to Terms Binary Weights Raw term frequency tf x idf Recall the Zipf distribution Want to weight terms highly if they are frequent in relevant documents BUT infrequent in the collection as a whole Binary Weights Only the presence (1) or absence (0) of a term is included in the vector docs D1 D2 D3 D4 D5 D6 D7 D8
1 1 0 1 0 t3 1 0 1 0 1 0 0 0 1 1 1 We have already seen and discussed this model. Raw Term Weights
The frequency of occurrence for the term in each document is included in the vector Counts can be normalized by document lengths. docs D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 t1 2 1 0
3 1 3 0 0 0 0 4 t2 0 0 4 0 6 5 8 10 0 3 0 t3 3
0 7 0 3 0 0 0 1 5 1 This model is open to exploitation by websites sex sex sex sex sex sex sex sex sex sex sex sex sex sex sex sex sex sex sex sex sex sex sex sex sex tf * idf Weights tf * idf measure: term frequency (tf) inverse document frequency (idf) -- a way to
deal with the problems of the Zipf distribution Goal: assign a tf * idf weight to each term in each document tf * idf wik tf ik * log( N / nk ) Tk term k in document Di tf ik frequency of term Tk in document Di idf k inverse document frequency of term Tk in C N total number of documents in the collection C nk the number of documents in C that contain Tk idf k log N nk Inverse Document Frequency IDF provides high values for rare words and low values for common words idf k log N nk For a collection
D1 1 0.2 0.4 0.6 0.8 1.0 Problems with Vector Space There is no real theoretical basis for the assumption of a term space it is more for visualization that having any real basis most similarity measures work about the same regardless of model Terms are not really orthogonal dimensions Terms are not independent of all other terms
Probabilistic Models Rigorous formal model attempts to predict the probability that a given document will be relevant to a given query Ranks retrieved documents according to this probability of relevance (Probability Ranking Principle) Rely on accurate estimates of probabilities Relevance Feedback Main Idea: Modify existing query based on relevance judgements Query Expansion: Extract terms from relevant documents and add them to the query Term Re-weighing: and/or re-weight the terms already in the query Two main approaches: Automatic (psuedo-relevance feedback) Users select relevant documents Users/system select terms from an automaticallygenerated list
Definition: Relevance Feedback is the reformulation of a search query in response to feedback provided by the user for the results of previous versions of the query. Suppose you are interested in bovine agriculture on the banks of the river Jordan Term Vector Term Weights [Jordan , Bank, Bull, River] [ 1 , 1 , 1 , 1 ] Search Search DisplayResults Results Display GatherFeedback Feedback Gather Term Vector [Jordan , Bank, Bull, River]
Term Weights [ 1.1 , 0.1 , 1.3 , 1.2 ] UpdateWeights Weights Update Rocchio Method n1 n2 Ri Si Q1 Q0 i 1 n1 i 1 n2 where Q0 the vector for the initial query Ri the vector for the relevant document i Si the vector for the non - relevant document i n1 the number of relevant documents chosen n2 the number of non - relevant documents chosen
and tune the importance of relevant and nonrelevant terms (in some studies best to set to 0.75 and to 0.25) Rocchio Illustration Although we usually work in vector space for text, it is easier to visualize Euclidian space Original Query Term Re-weighting Note that both the location of the center, and the shape of the query have changed Query Expansion Rocchio Method Rocchio automatically re-weights terms adds in new terms (from relevant docs) have to be careful when using negative terms Rocchio is not a machine learning algorithm
Most methods perform similarly results heavily dependent on test collection Machine learning methods are proving to work better than standard IR approaches like Rocchio Using Relevance Feedback Known to improve results People dont seem to like giving feedback! Relevance Feedback for Time Series The original query The weigh vector. Initially, all weighs are the same. Note: In this example we are using a piecewise linear approximation of the data. We will learn more about this representation later. The initial query is executed, and the five best matches are
shown (in the dendrogram) One by one the 5 best matching sequences will appear, and the user will rank them from between very bad (-3) to very good (+3) Based on the user feedback, both the shape and the weigh vector of the query are changed. The new query can be executed. The hope is that the query shape and weights will converge to the optimal query. Two papers consider relevance feedback for time series.
Query Expansion L Wu, C Faloutsos, K Sycara, T. Payne: FALCON: Feedback Adaptive Loop for ContentBased Retrieval. VLDB 2000: 297-306 Term Re-weighting Keogh, E. & Pazzani, M. Relevance feedback retrieval of time series data. In Proceedings of SIGIR 99 Document Space has High Dimensionality What happens beyond 2 or 3 dimensions? Similarity still has to do with how many tokens are shared in common. More terms -> harder to understand which subsets of words are shared among similar documents. One approach to handling high dimensionality:Clustering Text Clustering Finds overall similarities among groups of documents. Finds overall similarities among groups of tokens. Picks out some themes, ignores others.
Scatter/Gather Hearst & Pedersen 95 Cluster sets of documents into general themes, like a table of contents (using K-means) Display the contents of the clusters by showing topical terms and typical titles User chooses subsets of the clusters and re-clusters the documents within Resulting new groups have different themes S/G Example: query on star Encyclopedia text 8 symbols 68 film, tv (p) 97 astrophysics 67 astronomy(p) 10 flora/fauna 14 sports 47 film, tv 7 music 12 stellar phenomena 49 galaxies, stars 29 constellations
7 miscellaneous Clustering and re-clustering is entirely automated Ego Surfing! http://vivisimo.com/ Information need Collections Pre-process text input Parse Query Index How is the index constructed?
Rank The section that follows is about Evaluation Evaluate Evaluation Why Evaluate? What to Evaluate? How to Evaluate? Why Evaluate? Determine if the system is desirable Make comparative assessments Others? What to Evaluate? How much of the information need is satisfied. How much was learned about a topic. Incidental learning: How much was learned about the collection.
How much was learned about other topics. How inviting the system is. What to Evaluate? effectiveness What can be measured that reflects users ability to use system? (Cleverdon 66) Coverage of Information Form of Presentation Effort required/Ease of Use Time and Space Efficiency Recall proportion of relevant material actually retrieved Precision
proportion of retrieved material actually relevant Relevant vs. Retrieved All docs Retrieved Relevant Precision vs. Recall | RelRetrieved | Precision | Retrieved | | RelRetrieved | Recall | Rel in Collection | All docs Retrieved Relevant Why Precision and Recall? Intuition:
Get as much good stuff while at the same time getting as little junk as possible. Retrieved vs. Relevant Documents Very high precision, very low recall Relevant Retrieved vs. Relevant Documents Very low precision, very low recall (0 in fact) Relevant Retrieved vs. Relevant Documents High recall, but low precision Relevant Retrieved vs. Relevant Documents High precision, high recall (at last!) Relevant Precision/Recall Curves
There is a tradeoff between Precision and Recall So measure Precision at different levels of Recall Note: this is an AVERAGE over MANY queries precision x x x recall x Precision/Recall Curves Difficult to determine which of these two hypothetical results is better: precision x x x recall
x Precision/Recall Curves Recall under various retrieval assumptions R E C A L L 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0
Perfect Tangent Parabolic Parabolic Recall Recall random Perverse 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Proportion of documents retrieved 1000 Documents 100 Relevant Precision under various assumptions P R E C I S I O N
Perverse 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Proportion of documents retrieved Document Cutoff Levels Another way to evaluate: Fix the number of documents retrieved at several levels: top 5 top 10 top 20 top 50 top 100 top 500 Measure precision at each of these levels Take (weighted) average over results
This is a way to focus on how well the system ranks the first k documents. Problems with Precision/Recall Cant know true recall value except in small collections Precision/Recall are related A combined measure sometimes more appropriate Assumes batch mode Interactive IR is important and has different criteria for successful searches Assumes a strict rank ordering matters. Relation to Contingency Table Doc is Relevant Doc is retrieved Doc is NOT retrieved
a c Doc is NOT relevant Doc is Relevant b Doc is retrieved N ret rel N ret rel
d Doc is NOT retrieved N ret rel N ret rel Accuracy: (a+d) / (a+b+c+d) Precision: a/(a+b) Recall: a/(a+c) Why dont we use Accuracy for IR? Doc is NOT relevant
(Assuming a large collection) Most docs arent relevant Most docs arent retrieved Inflates the accuracy value The E-Measure Combine Precision and Recall into one number (van Rijsbergen 79) 1 b2 E 1 2 b 1 R P P = precision R = recall b = measure of relative importance of P or R For example, b = 0.5 means user is twice as interested in precision as recall How to Evaluate? Test Collections
Test Collections Cranfield 2 1400 Documents, 221 Queries 200 Documents, 42 Queries INSPEC 542 Documents, 97 Queries UKCIS -- > 10000 Documents, multiple sets, 193 Queries ADI 82 Document, 35 Queries CACM 3204 Documents, 50 Queries CISI 1460 Documents, 35 Queries MEDLARS (Salton) 273 Documents, 18 Queries TREC Text REtrieval Conference/Competition Run by NIST (National Institute of Standards & Technology) 2002 (November) will be 11th year Collection: >6 Gigabytes (5 CRDOMs), >1.5 Million Docs Newswire & full text news (AP, WSJ, Ziff, FT) Government documents (federal register, Congressional Record)
Radio Transcripts (FBIS) Web subsets TREC (cont.) Queries + Relevance Judgments Queries devised and judged by Information Specialists Relevance judgments done only for those documents retrieved -- not entire collection! Competition Various research and commercial groups compete (TREC 6 had 51, TREC 7 had 56, TREC 8 had 66) Results judged on precision and recall, going up to a recall level of 1000 documents TREC Benefits: made research systems scale to large collections (preWWW) allows for somewhat controlled comparisons Drawbacks: emphasis on high recall, which may be unrealistic for what most users want very long queries, also unrealistic
comparisons still difficult to make, because systems are quite different on many dimensions focus on batch ranking rather than interaction no focus on the WWW TREC is changing Emphasis on specialized tracks Interactive track Natural Language Processing (NLP) track Multilingual tracks (Chinese, Spanish) Filtering track High-Precision High-Performance http://trec.nist.gov/ What to Evaluate? Effectiveness Difficult to measure Recall and Precision are one way What might be others?
Transparency is goal Hypermediacy: presence of old in new gives a sense of multiplicity Technologies of rendering (making visible) are tools for mediating mediation "No text is an innocent by-product…but rather a constructed product," perhaps unconsciously [Cook 2001] 1. Transformation...
State agency respondents recognize that nonprofit organizations have highly developed insight and expertise working with certain populations. As one respondent said, "Nonprofit organizations have their finger on the pulse of what ' s happening in the communities in which they...
Variations in CERES-Terra Fluxes and Cloud Properties with SST Anomalies Zach Eitzen (SSAI/NASA-LaRC) Kuan-Man Xu (NASA-LaRC) Takmeng Wong (NASA-LaRC) Acknowledgments Thanks to David Doelling and Michele Nordeen for help with CERES cloud property data.
Gravitational forces formed the solar system and cause the planets to orbit the Sun. ... Planet comes from the Latin word " ... but due to the low gravitational pull of the planet the gases are thought the be remnants...
Estuaries come in all shapes and sizes and go by different names. Estuaries can also be called bays, lagoons, harbors, inlets, or sounds. Sometimes these water bodies by those names are not necessarily estuaries. The definingfeature of an estuary is...
Ready to download the document? Go ahead and hit continue!