What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC

What’s new in Lucene and Solr?
Grant Ingersoll
CTO, LucidWorks
Lucene/Solr Committer

Search is good for…
• Traditional: Fast, fuzzy text matching across a large document
collection
• De-normalized data
– “light” relational
• Top N problems
– Key-value (top 1)
– Recommendations, “Good enough” classification, clustering
• Faceting, slicing and dicing of numerical/enumerated data
• Spatial, spell checking, record linkage, highlighting
• NoSQL

What’s New?
• Community
• Lucene
• Solr

Relax, You’re Among Friends
• Large, diverse search community with many non-traditional search
engine usages
– Object stores, Record linkage, Social, mobile -> web
• “The Apache Way”
– Meritocracy – Those who do, decide!
• Always Be Testing
– Randomized system tests are all the rage
– http://vimeo.com/32087114
• Patches Welcome!

Coming Soon: Lucene and Solr 4.8
Java 1.7

Lucene: Speed and Memory
• Native Near Real Time (NRT) support
– Per segment
– FieldCache can be controlled to only load new segments
– Soft commit -- faster without fsync, allows quicker update visibility
• DWPT (Document Writer per Thread)
– Faster more consistent index speed
• Faster fuzzy & wildcard query processing
• Automatic compression of stored fields and term vectors
• String -> BytesRef
– Much improved data structure
– … means less memory and less garbage collection effort

Lucene: Flexibility
• Flexible Index Formats
– New posting list codecs: Block, Simple Text, HDFS, etc.
– Pulsing codec: improves performance of primary key searches, inlining
docs, positions, and payloads, saves disk seeks
• Pluggable Scoring
– Decoupled from TF/IDF
– Built in alternatives include BM25 & DFR, and others
• http://en.wikipedia.org/wiki/Okapi_BM25
• http://terrier.org/docs/v3.5/dfr_description.html
– Add your own

FS(A|T)
• Keys:
– byte[] – write-once
– Linear time build of min. automata
– Compression, Reverse lookups
– Weights (used for auto-suggest)
– Pluggable Algebra
• Uses:
– Term Dictionary, TokenStreams, Japanese, synonyms, spelling, others
– FuzzyQuery is 100x faster -- http://bit.ly/hgO65c
• More:
– http://slidesha.re/vKtpVA, http://bit.ly/Pkjyu0
– “Smaller Representation of Finite State Automata”
• Proc. of the 16th Inter. Conf. on Implementation and Application of Automata, CIAA'2011, vol. 6807,
2011, pp. 118—192.

Grab Bag
• Lots of new suggesters
– Available in Solr
• Doc Values
– Column oriented store
– Numeric and binary variants are updatable (coming to Solr soon)
• Overhauled term vectors APIs
– Now look a lot like Terms

Solr 4: New Features
• Search/Faceting/Relevance
– New Relevance Function Queries (tf, df, others)
– Pivot Faceting
– Pseudo-join
– Improved Spatial (more later)
– Full support for Lucene Codecs, pluggable scoring
• Indexing
– New Update Processors, including scripting option
– Near real time
• Schema and Config APIs + Schemaless
• Cursors (aka Deep Paging)
• Admin UI

Geospatial improvements
• Index shapes other than points (circles, polygons, etc)
• More complex interactions than point in a circle
• Indexing:
– "geo”:”43.17614,-90.57341”
– “geo”:”Circle(4.56,1.23 d=0.0710)”
– “geo”:”POLYGON((-10 30, -40 40, -10 -20, 40 20, 0 0, -10 30))”
• Searching:
– fq=geo:"Intersects(-74.093 41.042 -69.347 44.558)"
– fq=geo:"Intersects(POLYGON((-10 30, -40 40, -10 -20, 40 20, 0 0, -10
30)))”

Scaling Solr
• Distributed/sharded indexing & search
– Auto distributes updates and queries to appropriate shards
– Near Real Time (NRT) indexing capable
– Document routing extensions
• Dynamically scalable
– New SolrCloud instances add indexing and query capacity
– Supports re-balancing (shard-splitting)
• Reliable
– No single point of failure
– Transactions logged
– Robust, automatic recover
• http://wiki.apache.org/solr/SolrCloud

Solr as NoSQL
• Non-traditional data stores
• Not designed for SQL type queries
• Distributed fault tolerant architecture
• Document oriented, data format agnostic (JSON, XML, CSV, binary)

APIs
• New APIs for Schema and Solr Config
– XML becoming more of an implementation detail
• Managed Schema mode
• Data-driven schema (aka schemaless)
• Synonyms, stopwords, request handlers

Beyond Solr: LucidWorks Open Source
• Effortless AWS deployment and monitoring:
http://www.github.com/lucidworks/solr-scale-tk
• Logstash for Solr: https://github.com/LucidWorks/solrlogmanager
• Banana (Kibana for Solr): https://github.com/LucidWorks/banana
• Data Quality Toolkit: https://github.com/LucidWorks/data-quality
• Coming Soon for Big Data: Hadoop, Pig, Hive 2-way support w/
Lucene and Solr, different file formats, pipelines, Logstash

Summary
• Lucene/Solr 4.x:
– Faster
– More Flexible
– Easier than ever scaling
– More reliable than ever
• Go forth and rank!

Resources
• Me
– grant@lucidworks.com
– @gsingers on Twitter
• LucidWorks
– http://www.lucidworks.com
– http://www.lucidworks.com/support-services/ask-the-experts/

What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC

Ähnlich wie What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC (20)

Mehr von Lucidworks (Archived)

Mehr von Lucidworks (Archived) (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC