Solr 3.1 and Beyond Yonik Seeley Lucid Imagination

Solr 3.1 and Beyond Yonik Seeley Lucid Imagination

Solr 3.1 and Beyond Yonik Seeley Lucid Imagination [email protected] October 8, 2010 1 Agenda Goal : Introduce new features you can try & use now in Solr development versions 3.1 or 4.0 Relevancy (Extended Dismax Parser) Spatial/Geo Search Search Result Grouping / Field Collapsing Faceting (Pivot, Range, Per-segment) Scalability (Solr Cloud) Odds & Ends Q&A 02/10/20

2 Solr 3.1? What happened to 1.5? Lucene/Solr merged (March 2010) Single set of committers Single dev mailing list ([email protected]) Single shared subversion trunk Keep separate downloads, user mailing lists Other former lucene subprojects spun off (Nutch, Tika, Mahout, etc) Development

trunk is now always next major release (currently 4.0) branch_3x will be base for all 3.x releases Branch together, Release together, Share version numbers RELEVANCE Extended Dismax Parser Superset of dismax &defType=edismax&q=foo&qf=body Fixes edge cases where dismax could still throw exceptions OR Full AND

NOT - lucene syntax support Tries lucene syntax first Smart escaping is done if syntax errors Optionally supports treating and/or as AND/OR in lucene syntax Fielded queries (e.g. myfield:foo) even in degraded mode uf parameter controls what field names may be directly specified in q Extended Dismax Parser (continued)

boost parameter for multiplicative boost-by-function Pure negative query clauses Example: solr OR (-solr) Enhanced pf2=myfield results in term bigrams in sloppy phrase queries myfield:aa bb cc -> myfield:aa bb myfield:bb cc Enhanced term proximity boosting stopword handling stopwords omitted in main query, but added in optional proximity boosting part Example: q=solr is awesome & qf=myfield & pf2=myfield ->

+myfield:(solr awesome) (myfield:solr is myfield:is awesome) Currently controlled by the absence of StopWordFilter in index analyzer, and presence in query analyzer SPATIAL SEARCH 7 Spatial Search Step1: Index some locations! The Alpine Shop 44.013617,-73.168264 Step2: Decide where you are &pt=44.0153371,-73.16734 &d=1 &sfield=store Step3: Profit! Spatial Filter: &fq={!geofilt} Bounding Box: &fq={!bbox} Distance Function: &sort=geodist() asc

02/10/20 8 RESULT GROUPING / FIELD COLLAPSING Field Collapsing Definition Field collapsing Limit the number of results per category category normally defined by unique values in a field Uses Web Search collapse by web site Email threads collapse by thread id Ecommerce/retail

Show the top 5 items for each store category (music, movies, etc) Field Collapsing by Site Result Grouping by Category Field Collapse on Product Type Group by Field http://...&fl=id,name&q=ipod&group=true&group.field=manu_exact "grouped":{ "manu_exact":{ "matches":3, "groups":[{ "groupValue":"Belkin", "doclist":{"numFound":2,"start":0,"docs":[ { "id":"IW-02", "name":"iPod & iPod Mini USB 2.0 Cable"}] }}, { "groupValue":"Apple Computer Inc.",

"doclist":{"numFound":1,"start":0,"docs":[ { 02/10/20 13 Group by Query http://...&group=true&group.query=price:[0 TO 99.99]&group.query=price:[100 TO *]&group.limit=5 "grouped":{ "price:[0 TO 99.99]":{ "matches":3, "doclist":{"numFound":2,"start":0,"docs":[ { "id":"IW-02", "name":"iPod & iPod Mini USB 2.0 Cable"}, { "id":"F8V7067-APL-KIT", "name":"Belkin Mobile Power Cord for iPod"}] }}, "price:[100 TO *]":{ "matches":3, "doclist":{"numFound":1,"start":0,"docs":[

02/10/20 14 Grouping Params parameter meaning group.field= Like facet.field group by unique field values group.query= Like facet.query top docs that also match default group.function=

the function query group.limit= How many docs per group group.sort= How to sort documents within a group Same as sort param rows= How many groups to return sort= How to sort the groups relative to each other (based on top doc) 02/10/20 1

10 15 FACETING Pivot Faceting Other names that could have made sense: Grid Faceting, Cross-Product Faceting, Matrix Faceting Syntax: facet.pivot=field1,field2,field3, facet.pivot=cat,inStock #docs #docs w/ inStock:true

#docs w/ instock:false cat:electronics 14 10 4 cat:memory 3 3 0 cat:connector 2

0 2 cat:graphics card 2 0 2 cat:hard drive 2 2 0 02/10/20 17

Pivot Faceting http://...&facet=true&facet.pivot=cat,popularity "facet_counts":{ "facet_pivot":{ "cat,popularity":[{ "field":"cat", "value":"electronics", 14 docs w/ cat==electronics "count":14, "pivot":[{ "field":"popularity", "value":"6", 5 docs w/ cat==electronics "count":5}, && popularity==6 { "field":"popularity", "value":"7", "count":4}, 02/10/20

(continued) { "field":"popularity", "value":"1", "count":2}]}, { "field":"cat", "value":"memory", "count":3, "pivot":[]}, [] 18 Range Faceting Like Date faceting, but more generic http://...&facet=true &facet.range=price &facet.range.start=0 &facet.range.end=500 &

02/10/20 "facet_counts":{ "facet_ranges":{ "price":{ "counts":{ "0.0":5, "50.0":2, "100.0":0, "150.0":2, "200.0":0, "250.0":1, "300.0":2, "350.0":2, "400.0":0, "450.0":1}, "gap":50.0, "start":0.0, "end":500.0}}}} 19

Existing single-valued faceting algorithm Documents matching the base query Juggernaut q=Juggernaut &facet=true &facet.field=hero 0 2 7 lookup accumulator Priority queue flash, 5 Batman, 3

0 1 0 0 0 2 increment Lucene FieldCache Entry (StringIndex) for the hero field order: for each doc, an index into the lookup array 5 3 5 1 4 5 2 1

lookup: the string values (null) batman flash spiderman superman wolverine Per-segment single-valued algorithm Segment1 FieldCache Entry lookup Base DocSet 0 2

7 inc accumulator1 0 3 5 0 1 2 thread1 Segment2 FieldCache Entry Segment3 FieldCache Entry

accumulator2 0 2 1 0 thread2 accumulator3 1 3 0 4 thread3 FieldCache + accumulator merger (Priority queue) Segment4 FieldCache

Entry accumulator4 0 1 0 thread4 Priority queue flash, 5 Batman, 3 Per-segment faceting Enable with facet.method=fcs Controllable multi-threading facet.field={!threads=4}myfield Disadvantages Larger memory use (FieldCaches + accumulators)

Slower (extra FieldCache merge step needed) Advantages Rebuilds FieldCache entries only for new segments (NRT friendly) Multi-threaded Per-segment faceting performance comparison Test index: 10M documents, 18 segments, single valued field Base DocSet=100 docs, facet.field on a field with 100,000 unique terms A Time for request* facet.method=fc facet.method=fcs static index

3 ms 244 ms quickly changing index 1388 ms 267 ms Base DocSet=1,000,000 docs, facet.field on a field with 100 unique terms B Time for request* facet.method=fc facet.method=fcs static index

26 ms 34 ms quickly changing index 741 ms 94 ms *complete request time, measured externally Faceting Performance Improvements For facet.method=enum, speed up initial population of the filterCache (i.e. first time facet): from 30% to 32x improvement Optimized facet.method=fc for multi-valued fields and large facet.limit up to 3x faster Optimized deep facet paging up to 10x faster with really large facet.offsets Less memory consumed by field cache entries

02/10/20 24 SCALABILITY SolrCloud First steps toward simplifying cluster management Integrates Zookeeper Central configuration (schema.xml, solrconfig.xml, etc) Tracks live nodes + shards of collections Removes need for external load balancers shards=localhost:8983/solr|localhost:8900/solr,

localhost:7574/solr|localhost:7500/solr Can specify logical shard ids shards=NY_shard,NJ_shard Clients dont need to know shards at all: http://localhost:8983/solr/collection1/select? distrib=true SolrCloud : The Future Eliminate all single points of failure Remove Master/Searcher distinction Enables near real-time search in a highly scalable environment High

Availability for Writes Eventual consistency model (like Amazon Dynamo, Cassandra) Elastic Simply add/subtract servers, cluster will rebalance automatically By default, Solr will handle document partitioning ODDS & ENDS Auto-Suggest Many Can be slow for a large corpus New

people currently use terms component auto-suggest builds off SpellCheck component Compact memory based trie for really fast completions Based on a field in the main index, or on a dictionary file http://localhost:8983/solr/suggest?wt=json&indent=true&q=ult 02/10/20 "spellcheck":{ "suggestions":[ "ult",{ "numFound":1, "startOffset":0, "endOffset":3, "suggestion":["ultrasharp"]}, "collation","ultrasharp"]}} 29

Index with JSON $ URL=http://localhost:8983/solr/update/json $ curl $URL -H 'Content-type:application/json' -d ' { "add": { "doc": { "id" : "978-0641723445", "cat" : ["book","hardcover"], "title" : "The Lightning Thief", "author" : "Rick Riordan", "series_t" : "Percy Jackson and the Olympians", "sequence_i" : 1, "genre_s" : "fantasy", "inStock" : true, "price" : 12.50, "pages_i" : 384 } 30 Query Results in CSV

http://localhost:8983/solr/select?q=ipod&fl=name,price,cat,popularity&wt=csv name,price,cat,popularity iPod & iPod Mini USB 2.0 Cable,11.5,"electronics,connector",1 Belkin Mobile Power Cord for iPod w/ Dock,19.95,"electronics,connector",1 Apple 60 GB iPod with Video Playback Black,399.0,"electronics,music",10 Can handle multi-valued fields (see cat field in example) Completely compatible with the CSV update handler (can round-trip) Results are streamed good for dumping entire parts of the index 02/10/20 31 http://localhost:8983/solr/browse 02/10/20 32 Q&A

Recently Viewed Presentations

  • Dear - LIC New Delhi

    Dear - LIC New Delhi

    the world awaits you…. normally an agent's world is limited to around the place he lives. beyond the branch, the division and beyond lic what is happening in insurance industry may not be known.
  • About University of Maryland University College

    About University of Maryland University College

    Testing and validating assumptions How to manage major crises and emergencies Textbooks and lectures do not capture the dynamic, time-sensitive, context-dependent, multi-disciplinary nature of the emergency/crisis scenarios.
  • SETA Estrutura e ambiente operacional - MonteSite

    SETA Estrutura e ambiente operacional - MonteSite

    JOBLIB must be specified at the beginning of the JOB, and is by all EXECs of the job STEPLIB must be specified just after the EXEC statement, and the search in the library it refers to is made just for...
  • OPTIONS Strategy Meeting May 2017

    OPTIONS Strategy Meeting May 2017

    The learning collaborative was conceived to support countries to introduce PrEP. Conceived in Summer 2017 to leverage oral PrEPexperience in OPTIONS countries to support other countries to rollout PrEP. Teams from . eleven countries. have participated in . two rounds:...
  • Civil Air Patrol National Commanders Thoughts on Leadership

    Civil Air Patrol National Commanders Thoughts on Leadership

    Leadership Styles. Servant Leadership. Tips for effectively leading volunteers . Leadership Styles. ... I'm preaching to myself (and you…) Pace yourself. Take time off periodically to recharge, refocus, and plan. Strive for balance.
  • Webinar: CBD and CanMEDS 2015 Tuesday, April 29, 2014 Hosted ...

    Webinar: CBD and CanMEDS 2015 Tuesday, April 29, 2014 Hosted ...

    Based on CanMEDS Roles. Entrustable . Professional Activity (EPA) - An essential . task. of a "discipline" that an individual can be trusted to perform independently in a given context. Used for assessment. Encompasses multiple milestones. Within each stage, CBD...
  • Digging in to get the learning out! Paired

    Digging in to get the learning out! Paired

    A poem about fossils. The Fossil Girl ... Culminating activities - researching, planning, cooking & presenting their own cooking show. Thematic Unit with Paired Texts . Units can be used for struggling readers /ELL students, or can be shared with...
  • Slide Material for DHS Reverse Site Visit

    Slide Material for DHS Reverse Site Visit

    GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University 3:30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s...