Title

Title

Introduction to Information Retrieval Introduction to Information Retrieval cs160 Introduction David Kauchak adapted from: http://www.stanford.edu/class/cs276/handouts/lecture1-intro.ppt Introduction to Information Retrieval Introductions Name/nickname Dept., college and year One interesting thing about yourself Why are you taking this class? What topics/material would you like to see covered? Plans after graduation Introduction to Information Retrieval

Administrative Web page: www.cs.pomona.edu/classes/cs160/ Syllabus Administrative handout Class feedback In class participation Workload Homework 1 available soon Programming assignment 1 available soon Due time? Introduction to Information Retrieval Information retrieval (IR) What comes to mind when I say information retrieval? Where have you seen IR? What are some real-world examples/uses? Search engines File search (e.g. OS X Spotlight, Windows Instant Search, Google Desktop)

Databases? Catalog search (e.g. library) Intranet search (i.e. corporate networks) Introduction to Information Retrieval Information Retrieval Information Retrieval is finding material in text documents of an unstructured nature that satisfy an information need from within large collections of digitally stored content 5 Introduction to Information Retrieval Information Retrieval Information Retrieval is finding material in text documents of an unstructured nature that satisfy an information need from within large collections of digitally stored content ? 6 Introduction to Information Retrieval Information Retrieval Information Retrieval is finding material in text

documents of an unstructured nature that satisfy an information need from within large collections of digitally stored content Find all documents about computer science Find all course web pages at Pomona What is the cheapest flight from LA to NY? Who is was the 15th president? 7 Introduction to Information Retrieval Information Retrieval Information Retrieval is finding material in text documents of an unstructured nature that satisfy an information need from within large collections of digitally stored content What is the difference between an information need and a query? 8 Introduction to Information Retrieval Information Retrieval Information Retrieval is finding material in text documents of an unstructured nature that satisfy an information need from within large collections of

digitally stored content Information need Find all documents about computer science Find all course web pages at Pomona Who is was the 15th president? Query computer science Pomona AND college AND url-contains class WHO=president NUMBER=15 9 Introduction to Information Retrieval IR vs. databases Structured data tends to refer to information in tables Employee Manager Salary Smith

Jones 50000 Chang Smith 60000 Ivy Smith 50000 Typically allows numerical range and exact match (for text) queries, e.g., Salary < 60000 AND Manager = Smith. Introduction to Information Retrieval Unstructured (text) vs. structured (database) data in 1996 11 Introduction to Information Retrieval

Unstructured (text) vs. structured (database) data in 2006 12 Introduction to Information Retrieval Unstructured data in 1680 Which plays of Shakespeare contain the words Brutus AND Caesar but NOT Calpurnia? One could grep all of Shakespeares plays for Brutus and Caesar, then strip out plays containing Calpurnia. Any problems with this? Slow (for large corpora) Other operations (e.g., find the word Romans near countrymen) not feasible Ranked retrieval (best documents to return) Later lectures 13 Introduction to Information Retrieval Unstructured data in 1680 Which plays of Shakespeare contain the words Brutus AND Caesar but NOT Calpurnia? How might we speed up this type of query? Indexing: for each word, keep track of which documents it occurs in

Introduction to Information Retrieval Term-document incidence matrix Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth Antony 1 1 0 0 0 1

Brutus 1 1 0 1 0 0 Caesar 1 1 0 1 1 1 Calpurnia

0 1 0 0 0 0 Cleopatra 1 0 0 0 0 0 mercy

1 0 1 1 1 1 worser 1 0 1 1 1 0 1 if play contains word, 0 otherwise

Introduction to Information Retrieval Incidence vectors For each term, we have a 0/1 vector Caeser = 110111 Brutus = 110100 Calpurnia = 010000 Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth Antony 1 1 0 0

0 1 Brutus 1 1 0 1 0 0 Caesar 1 1 0 1 1

1 Calpurnia 0 1 0 0 0 0 Cleopatra 1 0 0 0 0

0 mercy 1 0 1 1 1 1 worser 1 0 1 1 1 0

Introduction to Information Retrieval Incidence vectors For each term, we have a 0/1 vector Caeser = 110111 Brutus = 110100 Calpurnia = 010000 How can we get the answer from these vectors Introduction to Information Retrieval Incidence vectors For each term, we have a 0/1 vector Caeser = 110111 Brutus = 110100 Calpurnia = 010000 Bitwise AND the vectors together using the complemented vector for all NOT queries Caeser AND Brutus AND COMPLEMENT(Calpurnia) 110111 & 110100 & ~010000 = 110111 & 110100 & 101111 = 100100 Introduction to Information Retrieval

Answers to query Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar dead, He cried almost to roaring; and he wept When at Philippi he found Brutus slain. Hamlet, Act III, Scene ii Lord Polonius: I did enact Julius Caesar I was killed i' the Capitol; Brutus killed me. 19 Introduction to Information Retrieval Incidence vectors For each term, we have a 0/1 vector Caeser = 110111 Brutus = 110100 Calpurnia = 010000 Bitwise AND the vectors together using the complemented vector for all NOT queries Any problem with this approach? Introduction to Information Retrieval Bigger collections Consider N = 1 million documents, each with about

1000 words Say there are M = 500K distinct terms among these. How big is the incidence matrix? The matrix is a 500K by 1 million matrix = half a trillion 0s and 1s Even for a moderate sized data set we cant store the matrix in memory Each vector has 1 million entries Bitwise operations become much more expensive Introduction to Information Retrieval What does the matrix look like? Consider N = 1 million documents, each with about 1000 words Extremely sparse! How many 1s does the matrix contain? no more than one billion Each of the 1 million documents has at most 1000 1s In practice, well see that the number of unique words in a document is much less than this Whats a better representation? Only record the 1 positions Introduction to Information Retrieval Inverted index

For each term, we store a list of all documents that contain it What data structures might we use for this? Brutus Brutus Brutus 2 4 8 16 32 64 128 array 2 4 8 16 linked list 32 64 4 128 32 64 2 8 16 hashtable docID 128 ? 23 Introduction to Information Retrieval

Inverted index representation Brutus 2 4 8 16 32 64 128 array Pros Simple to implement No extra pointers required for data structure Contiguous memory Cons How do we pick the size of the array? What if we want to add additional documents? 24 Introduction to Information Retrieval Inverted index representation Brutus 2 4 8 16 linked list 32

64 128 Pros Dynamic space allocation Insertion of new documents is straightforward Cons Memory overhead of pointers Noncontiguous memory access 25 Introduction to Information Retrieval Inverted index representation Brutus 4 128 32 64 2 8 16 hashtable Pros Search in constant time Contiguous memory Cons

How do we pick the size? What if we want to add additional documents? May have to rehash if we increase in size To get constant time operations, lots of unused slots/memory 26 Introduction to Information Retrieval Inverted index The most common approach is to use a linked list representation Posting Brutus 2 4 8 16 Calpurnia 1

2 3 5 Caesar 13 Dictionary 32 8 64 13 128 21 34 16 Postings lists 27

Introduction to Information Retrieval Inverted index construction Documents to be indexed Friends, Romans, countrymen. text preprocessing friend , roman , countrymen . indexer Inverted index friend 2 4 roman 1 2 countryman

13 16 Introduction to Information Retrieval Boolean retrieval In the boolean retrieval model we ask a query that is a boolean expression: A boolean query uses AND, OR and NOT to join query terms Caesar AND Brutus AND NOT Calpurnia Pomona AND College (Mike OR Michael) AND Jordan AND NOT(Nike OR Gatorade) Given only these operations, what types of questions cant we answer? Phrases, e.g. Pomona College Proximity, Michael within 2 words of Jordan 29 Introduction to Information Retrieval Boolean retrieval Primary commercial retrieval tool for 3 decades Professional searchers (e.g., lawyers) still like boolean queries Why? You know exactly what youre getting, a query either

matches or it doesnt Through trial and error, can frequently fine tune the query appropriately Dont have to worry about underlying heuristics (e.g. PageRank, term weightings, synonym, etc) 30 Introduction to Information Retrieval Example: WestLaw http://www.westlaw.com/ Largest commercial (paying subscribers) legal search service (started 1975; ranking added 1992) Tens of terabytes of data; 700,000 users Majority of users still use boolean queries Example query: What is the statute of limitations in cases involving the federal tort claims act? LIMIT! /3 STATUTE ACTION /S FEDERAL /2 TORT /3 CLAIM All words starting with LIMIT 31 Introduction to Information Retrieval Example: WestLaw

http://www.westlaw.com/ Largest commercial (paying subscribers) legal search service (started 1975; ranking added 1992) Tens of terabytes of data; 700,000 users Majority of users still use boolean queries Example query: What is the statute of limitations in cases involving the federal tort claims act? LIMIT! /3 STATUTE ACTION /S FEDERAL /2 TORT /3 CLAIM 32 Introduction to Information Retrieval Example: WestLaw http://www.westlaw.com/ Largest commercial (paying subscribers) legal search service (started 1975; ranking added 1992) Tens of terabytes of data; 700,000 users Majority of users still use boolean queries Example query: What is the statute of limitations in cases involving the

federal tort claims act? LIMIT! /3 STATUTE ACTION /S FEDERAL /2 TORT /3 CLAIM /3 = within 3 words, /S = in same sentence 33 Introduction to Information Retrieval Example: WestLaw http://www.westlaw.com/ Another example query: Requirements for disabled people to be able to access a workplace disabl! /p acces\s! /s work-site work-place (employment /3 place) Long, precise queries; proximity operators; incrementally developed; not like web search Professional searchers often like Boolean search: Precision, transparency and control But that doesnt mean they actually work better. Introduction to Information Retrieval Query processing: AND What needs to happen to process:

Brutus AND Caesar Locate Brutus and Caesar in the Dictionary; Retrieve postings lists Brutus 2 4 8 16 Caesar 1 2 3 5 32 8 Merge the two postings: Brutus AND Caesar 2

64 1 3 128 21 34 8 35 Introduction to Information Retrieval The merge Walk through the two postings simultaneously Brutus 2 4 8 16 Caesar 1

2 3 5 32 8 64 1 3 128 21 34 Brutus AND Caesar 36 Introduction to Information Retrieval The merge Walk through the two postings simultaneously

Brutus 2 4 8 16 Caesar 1 2 3 5 32 8 64 1 3

128 21 34 Brutus AND Caesar 37 Introduction to Information Retrieval The merge Walk through the two postings simultaneously Brutus 2 4 8 16 Caesar 1 2

3 5 32 8 64 1 3 128 21 34 Brutus AND Caesar 2 38 Introduction to Information Retrieval The merge Walk through the two postings simultaneously Brutus 2

4 8 16 Caesar 1 2 3 5 32 8 64 1 3 128 21

34 Brutus AND Caesar 2 39 Introduction to Information Retrieval The merge Walk through the two postings simultaneously Brutus 2 4 8 16 Caesar 1 2 3 5

32 8 64 1 3 128 21 34 Brutus AND Caesar 2 40 Introduction to Information Retrieval The merge Walk through the two postings simultaneously Brutus 2 4

8 16 Caesar 1 2 3 5 32 8 64 1 3 128 21 34

Brutus AND Caesar 2 41 Introduction to Information Retrieval The merge Walk through the two postings simultaneously Brutus 2 4 8 16 Caesar 1 2 3 5 32

8 1 3 Brutus AND Caesar 2 64 128 21 34 8 42 Introduction to Information Retrieval The merge Walk through the two postings simultaneously Brutus 2 4

8 16 Caesar 1 2 3 5 32 8 64 1 3 128 21 34

What assumption are we making about the postings lists? For efficiency, when we construct the index, we ensure that the postings lists 43 are sorted Introduction to Information Retrieval The merge Walk through the two postings simultaneously Brutus 2 4 8 16 Caesar 1 2 3

5 32 8 64 1 3 128 21 34 What is the running time? O(length1 + length2) 44 Introduction to Information Retrieval Sec. 1.3 Boolean queries: More general merges Which of the following queries can we still do in

time O(length1+length2)? Brutus AND NOT Caesar Brutus OR NOT Caesar 45 Introduction to Information Retrieval Merging What about an arbitrary Boolean formula? (Brutus OR Caesar) AND NOT (Antony OR Cleopatra) x = (Brutus OR Caesar) y = (Antony OR Cleopatra) x AND NOT y Is there an upper bound on the running time? O(total_terms * query_terms) What about Brutus AND Calpurnia AND Caesar? 46 Introduction to Information Retrieval Query optimization Query: Brutus AND Calpurnia AND Caesar Consider a query that is an AND of t terms. For each of the terms, get its postings, then AND

them together What is the best order for query processing? Brutus 2 4 8 16 Calpurnia 1 2 3 5 Caesar 13 16 32

8 64 13 128 21 34 47 Introduction to Information Retrieval Query optimization example Heuristic: Process in order of increasing freq: merge the two terms with the shortest postings list this creates a new AND query with one less term repeat Brutus 2 4 8 16

Calpurnia 1 2 3 5 Caesar 13 32 8 64 13 128 21 34 16 cute the query as (Caesar AND Brutus) AND Calpurnia 48

Introduction to Information Retrieval Query optimization Query: Brutus OR Calpurnia OR Caesar Consider a query that is an OR of t terms. What is the best order for query processing? Same: still want to merge the shortest postings lists first Brutus 2 4 8 16 Calpurnia 1 2 3 5

Caesar 13 16 32 8 64 13 128 21 34 49 Introduction to Information Retrieval Query optimization in general (madding OR crowd) AND (ignoble OR NOT strife) Need to evaluate OR statements first Which OR should we do first? Estimate the size of each OR by the sum of the posting list lengths NOT is just the number of documents minus the length Then, it looks like an AND query: x AND y

50 Introduction to Information Retrieval Exercise Recommend a query processing order for Term (tangerine OR NOT trees) AND (marmalade OR skies) AND (kaleidoscope OR eyes) eyes kaleidoscope marmalade skies tangerine trees Freq 213312 87009 107913 271658 46653 316812

51 Introduction to Information Retrieval Next steps Phrases Pomona College Proximity: Find Gates NEAR Microsoft. Need index to capture position information in docs. More later Zones in documents: Find documents with (author = Ullman) AND (text contains automata) Ranking search results include occurrence frequency weight different zones/features differently (e.g. title, header, link text, ) Incorporate link structure 52 Introduction to Information Retrieval Resources for todays lecture

Introduction to Information Retrieval, ch. 1 Managing Gigabytes, Chapter 3.2 Modern Information Retrieval, Chapter 8.2 Shakespeare: http://www.rhymezone. com/shakespeare/ Try the neat browse by keyword sequence feature! 53

Recently Viewed Presentations

  • Planets

    Planets

    Named after the . Roman. god of war! ... Titan- Saturn's largest moon is larger than Mercury! Uranus. 19.2 AU from the Sun. Ring system, many moons. (27 so far) Spins horizontal . 84 years revolution. Uranus. It is the...
  • Exploring Terrestrial & Aquatic Biomes Overview: Discovering Ecology

    Exploring Terrestrial & Aquatic Biomes Overview: Discovering Ecology

    Explain the flow of matter and energy through ecosystems by Arranging components of a food chain according to energy flow. Comparing the quantity of energy in the steps of an energy pyramid. Explaining the need for cycling of major nutrients...
  • St. Croix SCUBA Adventure Program Arrival and Registration

    St. Croix SCUBA Adventure Program Arrival and Registration

    Reef Fish and Invertebrate Identification Coral Identification. Certifications and Awards. PADI Project AWARE . S.C.E.N.E. Duty to God . Sleeping. You will be tent camping. Please bring your own bedding and sleeping pad. ... Andrea Watts ...
  • The Bohr-Rutherford Atom

    The Bohr-Rutherford Atom

    The Bohr-Rutherford Atom Nils Bohr Ernest Rutherford Physics 100 Chapt 23 1895 J.J. Thomson discovered electron Cathode rays have negative charge and very small mass Plum pudding?
  • Unit 3A: Neural and Hormonal Systems

    Unit 3A: Neural and Hormonal Systems

    Unit 3A: Neural and Hormonal Systems. ... The word "hormone" comes from a Greek word meaning "to stir up or excite." Hormones carry messages to vital body organs. The pituitary regulates metabolism by stimulating other glands.
  • Profesiones deportivas

    Profesiones deportivas

    Golf y pitch and putt: ResoluciĆ³n de 16 de septiembre de 2011, de la Presidencia del Consejo Superior de Deportes, por la que se publica el plan formativo de la modalidad deportiva de golf y pitch and putt. ... Por...
  • PUBLIC LAW 221-1999 SCHOOL IMPROVEMENT - Partners on the ...

    PUBLIC LAW 221-1999 SCHOOL IMPROVEMENT - Partners on the ...

    Public Law 221 is an accountability system. PL 221 accountability includes: National and Regional Accreditation Agencies School Improvement Planning Models Partners on the Journey/ Archdiocese of Indianapolis Mandatory Annual Assessment State Provided Tests Secondary Indicators Mobility - How is performance...
  • BSA STEM NOVA AWARDS Science Technology Engineering Math

    BSA STEM NOVA AWARDS Science Technology Engineering Math

    Complete one Adventure/Merit Badge/Exploration from a list. Different one for each Nova Award. Scientific Exploration. Activities and/or create something. Visit a place where the STEM area is in practice. Discuss with Counselor how the STEM area affects everyday life. 4/28/2019....