Introduction to Web Search and Advanced Internet Services

Introduction to Web Search and Advanced Internet Services

Introduction to Web Search and Advanced Internet Services Tao Yang UCSB CS290N, Spring 2014 Table of Content Search Engine Architecture and Process Web Content and Size

Users Behavior in Search Sponsored Search: Advertisement Impact to Business and Search Engine Optimization Search engine history and related fields Web search basics Sponsored Links CG Appliance Express Discount Appliances (650) 756-3931 Same Day Certified Installation San Francisco-Oakland-San Jose, CA

User Miele Vacuum Cleaners Miele Vacuums- Complete Selection Free Shipping! Miele Vacuum Cleaners Miele-Free Air shipping! All models. Helpful advice. Web Results 1 - 10 of about 7,310,000 for miele. (0.12 seconds)

Miele, Inc -- Anything else is a compromise Web spider At the heart of your home, Appliances by Miele. ... USA. to Residential Appliances. Vacuum Cleaners. Dishwashers. Cooking Appliances. Steam Oven. Coffee System ... - 20k - Cached - Similar pages Miele Welcome to Miele, the home of the very best appliances and kitchens in the world. - 3k - Cached - Similar pages Miele - Deutscher Hersteller von Einbaugerten, Hausgerten ... - [ Translate this page ] Das Portal zum Thema Essen & Geniessen online unter Miele weltweit

...ein Leben lang. ... Whlen Sie die Miele Vertretung Ihres Landes. - 10k - Cached - Similar pages Herzlich willkommen bei Miele sterreich - [ Translate this page ] Herzlich willkommen bei Miele sterreich Wenn Sie nicht automatisch weitergeleitet werden, klicken Sie bitte hier! HAUSHALTSGERTE ... - 3k - Cached - Similar pages Search Indexer The Web Indexes

Ad indexes Search engine architecture: key pieces Spider (a.k.a. crawler/robot) builds corpus Collects web pages recursively For each known URL, fetch the page, parse it, and extract new URLs Repeat Additional pages from direct submissions & other sources Indexer creates inverted indexes so online system can search Online query process serves query results Front end query reformulation, word processing

Back end finds matching documents and ranks them Inverted index Linked lists generally preferred to arrays Dynamic space allocation Insertion of terms into documents easy Space overhead of pointers Santa 2 4 8

16 Barbara 1 2 3 5 UCSB

13 Dictionary 32 8 64 13 21 128 34

16 Postings Sorted by docID (more later on why).5 Indexing Process Indexing Process Text acquisition identifies and stores documents for indexing Text transformation transforms documents into index terms or features Index creation takes index terms and creates data structures (indexes) to support fast searching

Indexing and Mining at Internet Document Document Document respository respository respository Crawler Crawler Crawler Parsing

Parsing Parsing Content classification Spammer Duplicate removal removal Web documents Inverted index generation Link graph

generation Click data analysis Online Database Query Process Query Process User interaction supports creation and refinement of query, display of results Ranking uses query and indexes to generate ranked list of

documents Evaluation monitors and measures effectiveness and efficiency (primarily offline) Online Engine Architecture Traffic load balancer Client queries Frontend Frontend Frontend Frontend Hierarchical

Cache Cache Cache Cache Neptune Clustering Middleware Ranking Web page Ranking Ranking Ranking index Ranking Ranking

Classification Web page index Structured DB PageInfo Page Info Document Document Document Abstract

Document Abstract Abstract description User Interaction Query transformation Improves initial query, both before and after initial search Includes text transformation techniques used for documents Spell checking and query suggestion provide alternatives to original query Query expansion and relevance feedback modify the original query with additional terms

User Interaction Results output Constructs the display of ranked documents for a query Generates snippets to show how queries match documents Highlights important words and passages Retrieves appropriate advertising in many applications May provide clustering and other visualization tools Online System Support Performance optimization Designing ranking algorithms for efficient processing

Term-at-a time vs. document-at-a-time processing Safe vs. unsafe optimizations Distribution Processing queries in a distributed environment Query broker distributes queries and assembles results Caching is a form of distributed searching Evaluation Logging Logging user queries and interaction is crucial for improving search effectiveness and efficiency Query logs and clickthrough data used for query suggestion, spell checking, query caching, ranking,

advertising search, and other components Ranking analysis Measuring and tuning ranking effectiveness Performance analysis Measuring and tuning system efficiency General Search vs. Vertical Search General Search: identify relevant information with a horizontal/exhaustive view of the world. Vertical Search: Focus on specific segment of web content Integrate domain knowledge (e.g. taxonomies /ontology), & deep web Examples: travel in Expedia, products in Amazon.

Example of Vertical Search: Question Answering Table of Content Search Engine Architecture and Process Web Content and Size Users Behavior in Search Sponsored Search: Advertisement Impact to Business and Search Engine Optimization

Search Engine History/Related Fields Characteristics of Web Content No design/co-ordination Distributed content creation, linking Content includes truth, lies, obsolete information, contradictions Structured (databases), semistructured Scale -- huge Growth slowed down from initial volume doubling every few months Content can be dynamically generated The Web

Dynamic Web Content AA129 Application server Browser A page without a static html version E.g., current status of flight AA129 Current availability of rooms at a hotel Back-end databases Usually, assembled at the time of a request from a browser

Typically, URL has a ? character in it Most dynamic content is ignored by web spiders Many reasons including malicious spider traps Acquired for some content (e.g. news stores) Application-specific spidering The web: size What is being measured? Number of hosts Number of (static) html pages Volume of data Number of hosts netcraft survey Gives monthly report on how many web servers are out there

Number of pages numerous estimates More to follow later in this course For a Web engine: how big its index is The web: the number of hosts The web: web server vendors Static pages: rate of change Fetterly et al. study: several views of data, 150 million pages over 11 weekly crawls Bucketed into 85 groups by extent of change

Diversity Languages/Encodings Hundreds (thousands ?) of languages, W3C encodings: 55 (Jul01) [W3C01] Google (mid 2001): English: 53%, JGCFSKRIP: 30% Document & query topic Popular Query Topics (from 1 million Google queries, Apr 2000) Arts 14.6% Arts: Music 6.1%

Computers 13.8% Regional: North America 5.3% Regional 10.3% Adult: Image Galleries

4.4% Society 8.7% Computers: Software 3.4% Adult 8% Computers: Internet

3.2% Recreation 7.3% Business: Industries 2.3% Business 7.2%

Regional: Europe 1.8% Table of Content

Search Engine Architecture and Process Web Content and Size Users Behavior in Search Sponsored Search: Advertisement Impact to Business and Search Engine Optimization Search Engine History/Related Fields The user Diverse in access methodology Increasingly, high bandwidth connectivity

Growing segment of mobile users: limitations of form factor keyboard, display Diverse in search methodology Search, search + browse, filter by attribute Average query length ~ 2.5 terms Has to do with what theyre searching for Poor comprehension of syntax Early engines surfaced rich syntax Boolean, phrase, etc. Current engines hide these Web Search: How do users find content? Informational (~25%) want to learn about something

autism Navigational (~40%) want to go to that page United Airlines Transactional (~35%) want to do something (web-mediated) Access a service Downloads Shop Gray areas Santa barbara weather

Mars surface images Nikon D-SLR Find a good hub Exploratory search see whats there Car rental Finland 28 Broder 2002, A Taxomony of web search Users evaluation of engines Relevance and validity of results UI Simple, no clutter, error tolerant Trust Results are objective, the engine wants to

help me Pre/Post process tools provided Mitigate user errors (auto spell check) Explicit: Search within results, more like this, refine ... Anticipative: related searches Users evaluation Quality of pages varies widely Relevance is not enough Duplicate elimination Precision vs. recall On the web, recall seldom matters

What matters Precision at 1? Precision above the fold? Comprehensiveness must be able to deal with obscure queries Recall matters when the number of matches is very small User perceptions may be unscientific, but are significant over a large aggregate What about on Mobile Query characteristics: Best known studies by Kamvar and Baluja (2006 and 2007) and by Yi, Maghoul, and Pedersen (2008) Have a different distribution than the query distribution for PC users

Bias towards shorter queries Data contradicts that: 2.6 words per query, same # chars as PC Difficulty of query entry is a significant hurdle Much higher location-based activity More notification-driven tasks 31 Implications and Challenges Task-orientation Specialized content packaging Locality Inference from queries and from devices Minimize typing and round-trips: get

results, not just links Less room to display SERP + other accessories Use of mobile in social settings and leveraging notification abilities 32 Table of Content Search Engine Architecture and Process

Web Content and Size Users Behavior in Search Sponsored Search: Advertisement Impact to Business and Search Engine Optimization Search query Ad 34 Questions Do you think an average user, knows the difference between sponsored search links and

algorithmic search results? 35 How it works Advertiser I want to bid $5 on canon camera I want to bid $2 on cannon camera Ad Index Sponsored

search engine Engine decides when/where to show this ad. Landing page Engine decides how much to charge advertiser on a click. 36 Higher slots get more clicks

Three sub-problems 1. Match ads to query/context 2. Order the ads 3. Pricing on a click-through IR Econ Paid placement Gives consumer opportunity to click through to an advertiser Compensated by advertiser for click through Each consumer reveals clues about her information need at hand The keyword(s) he types (e.g., miele)

Keyword(s) in his email (gmail) Personal profile information (Yahoo! ) Complex logistical problems: selling contracts, scheduling ads supply chain optimization Table of Content Search Engine Architecture and Process Web Content and Size Users Behavior in Search

Sponsored Search: Advertisement Impact to Business and Search Engine Optimization Search Engine History/Related Fields Search Traffic is Important for Business: Example of Site Traffic Analysis The trouble with paid placement It costs money. Whats the alternative? Search Engine Optimization: Tuning your web page to rank highly in the search results for select keywords Alternative to paying for placement Thus, intrinsically a marketing function

Also known as Search Engine Marketing Performed by companies, webmasters and consultants (Search engine optimizers) for their clients Search engine optimization Motives Commercial, political, religious, lobbies Promotion funded by advertising budget Operators Contractors (Search Engine Optimizers) for lobbies, companies Web masters Hosting services

Forum Web master world ( ) Search engine specific tricks Discussions about academic papers More pointers in the Resources The spam industry Simplest forms Early engines relied on the density of terms The top-ranked pages for the query maui resort were the ones containing the most mauis and resorts SEOs responded with dense repetitions of chosen terms

e.g., maui resort maui resort maui resort Often, the repetitions would be in the same color as the background of the web page Repeated terms got indexed by crawlers But not visible to humans on browsers Cant trust the words on a web page, for ranking. Keyword stuffing Invisible text Pornographic Content

Cloaking: Link Farms Boost pagerank of a website Table of Content Search Engine Architecture and Process Web Content and Size

Users Behavior in Search Sponsored Search: Advertisement Impact to Business and Search Engine Optimization Search Engine History/Related Fields Information Retrieval (IR) System Document corpus Query String IR

System 1. Doc1 Ranked 2. Doc2 Documents 3. Doc3 . . 51 From Information Retrieval to Web Search Challenging due to Large-scale and noisy data. retrieving relevant documents to a query. retrieving from large sets of documents efficiently. Relevance is a subjective judgment and may include:

Simplest notion of relevance is that the query string appears verbatim in the document. More: Being on the proper subject. Being timely (recent information). Being authoritative (from a trusted source). Satisfying the goals of the user and his/her intended use of the information (information need). 52

Problems with Keywords May not retrieve relevant documents that include synonymous terms. restaurant vs. caf PRC vs. China May retrieve irrelevant documents that include ambiguous terms. bat (baseball vs. mammal) Apple (company vs. fruit) bit (unit of data vs. act of eating) 53 Search Intent Analysis Taking into account the meaning of the words

used. Taking into account the order of words in the query. Adapting to the user based on direct or indirect feedback. Taking into account the authority of the source. 54 Topics: Text mining Text mining is a cover-all marketing term A lot of what weve already talked about is actually the bread and butter of text mining: Text classification, clustering, and retrieval But we will focus in on some of the higher-level text

applications: Extracting document metadata Topic tracking and new story detection Cross document entity and event coreference Text summarization Question answering Topics: Information extraction Getting semantic information out of textual data Filling the fields of a database record E.g., looking at an event web page: What is the name of the event? What date/time is it? How much does it cost to attend Other applications: resumes, health data,

A limited but practical form of natural language understanding Topics: Recommendation systems Using statistics about the past actions of a group to give advice to an individual E.g., Amazon book suggestions or NetFlix movie suggestions A matrix problem: but now instead of words and documents, its users and documents History of IR and Web Search 1960-70s: Initial exploration of text retrieval systems for small

corpora of scientific abstracts, and law and business documents. Development of the basic Boolean and vector-space models of retrieval. 1980s: Larger document database systems, many run by companies: Lexis-Nexis Dialog MEDLINE 58 From IR to Web Search 1990s: Organized Competitions

NIST TREC Searching FTPable documents on the Internet Archie WAIS Searching the World Wide Web Lycos Yahoo Altavista 59 IR/Web Search History in 2000s 2000s

Link analysis for Web Search Google Inktomi Teoma Feedback based engine: DirectHit ( Automated Information Extraction Whizbang Fetch Burning Glass Question Answering TREC Q/A track Jeeves 60 IR/Web Search Activities in 2000s 2000s continued: Multimedia IR Image Video Audio music

Cross-Language IR Document Summarization Mobile search 61 Related Areas Information Management and Data Analysis Information Science &CHI Machine Learning and data mining Natural Language Processing Large-scale systems Database/data stores Operating systems/networking support

Web language analysis Compression/fast algorithms. Fault tolerance/paralle+distributed systems 62

Recently Viewed Presentations

  • Cultivating Knowledge Relationships Research Library ...

    Cultivating Knowledge Relationships Research Library ...

    Martin Halbert, Dean of Libraries. University of North Texas. CNI Spring 2013 Meeting, San Antonio, TX. ... Kathleen Fitzpatrick (Director of Scholarly Communication at the MLA) and other researchers will discuss new models for publishing, including OA.
  • TruServe: Advanced Reporting - NOSORH

    TruServe: Advanced Reporting - NOSORH

    TruServe: Advanced Reporting. Region D SORH Regional Partnership Meeting. May 20, 2015. Kelly Quigley, TruServe Coordinator
  • AQA Science Tr u s t ~ C

    AQA Science Tr u s t ~ C

    Links to specification, specimen papers and markschemes can be found on Firefly. ... Questions? [email protected] 11/21/2017. How to maximise the learning from revision in Science. Student 2Does ten hours' revision. two hours making mind-maps. two hours creating flash cards of...
  • Erin Kirkbride (High School Athletes) XPLOSIVE SPORTS ACADEMY

    Erin Kirkbride (High School Athletes) XPLOSIVE SPORTS ACADEMY

    University of Virginia Graduate. Sports Medicine. Certified Functional Movement Screen Specialist. Nike Sparq trainer . Combine360 trainer. Guest speaker at Philadelphia Eagles Sports Medicine Symposium

    Scope The portion of a program within which you can refer to a variable Comes into scope Variable comes into existence Goes out of scope Variable ceases to exist * Java Programming, Seventh Edition Understanding Blocks and Scope (cont'd.) Redeclare...
  • History and Overview of Portico A New Electronic

    History and Overview of Portico A New Electronic

    History and Overview of Portico A New Electronic Archiving Service Portico's Mission To preserve scholarly literature published in electronic form and to ensure that these mater
  • Paragraphing


    Paragraph Unity - the entire paragraph should revolve around a single idea or focal point. If it begins with one idea, it should not end with another. ... Problem to Solution Pattern - Introduce a problem, explain it briefly, and...
  • Measurement: Reliability and Validity

    Measurement: Reliability and Validity

    Measurement: Reliability and Validity For a measure to be useful, it must be both reliable and valid Reliable = consistent in producing the same results every time the measure is used