3. Search is good for…
• Traditional: Fast, fuzzy text matching across a large document
collection
• De-normalized data
– “light” relational
• Top N problems
– Key-value (top 1)
– Recommendations, “Good enough” classification, clustering
• Faceting, slicing and dicing of numerical/enumerated data
• Spatial, spell checking, record linkage, highlighting
• NoSQL
5. Relax, You’re Among Friends
• Large, diverse search community with many non-traditional search
engine usages
– Object stores, Record linkage, Social, mobile -> web
• “The Apache Way”
– Meritocracy – Those who do, decide!
• Always Be Testing
– Randomized system tests are all the rage
– http://vimeo.com/32087114
• Patches Welcome!
9. Lucene: Speed and Memory
• Native Near Real Time (NRT) support
– Per segment
– FieldCache can be controlled to only load new segments
– Soft commit -- faster without fsync, allows quicker update visibility
• DWPT (Document Writer per Thread)
– Faster more consistent index speed
• Faster fuzzy & wildcard query processing
• Automatic compression of stored fields and term vectors
• String -> BytesRef
– Much improved data structure
– … means less memory and less garbage collection effort
10. Lucene: Flexibility
• Flexible Index Formats
– New posting list codecs: Block, Simple Text, HDFS, etc.
– Pulsing codec: improves performance of primary key searches, inlining
docs, positions, and payloads, saves disk seeks
• Pluggable Scoring
– Decoupled from TF/IDF
– Built in alternatives include BM25 & DFR, and others
• http://en.wikipedia.org/wiki/Okapi_BM25
• http://terrier.org/docs/v3.5/dfr_description.html
– Add your own
11. FS(A|T)
• Keys:
– byte[] – write-once
– Linear time build of min. automata
– Compression, Reverse lookups
– Weights (used for auto-suggest)
– Pluggable Algebra
• Uses:
– Term Dictionary, TokenStreams, Japanese, synonyms, spelling, others
– FuzzyQuery is 100x faster -- http://bit.ly/hgO65c
• More:
– http://slidesha.re/vKtpVA, http://bit.ly/Pkjyu0
– “Smaller Representation of Finite State Automata”
• Proc. of the 16th Inter. Conf. on Implementation and Application of Automata, CIAA'2011, vol. 6807,
2011, pp. 118—192.
12. Grab Bag
• Lots of new suggesters
– Available in Solr
• Doc Values
– Column oriented store
– Numeric and binary variants are updatable (coming to Solr soon)
• Overhauled term vectors APIs
– Now look a lot like Terms
13.
14. Solr 4: New Features
• Search/Faceting/Relevance
– New Relevance Function Queries (tf, df, others)
– Pivot Faceting
– Pseudo-join
– Improved Spatial (more later)
– Full support for Lucene Codecs, pluggable scoring
• Indexing
– New Update Processors, including scripting option
– Near real time
• Schema and Config APIs + Schemaless
• Cursors (aka Deep Paging)
• Admin UI
15. Geospatial improvements
• Index shapes other than points (circles, polygons, etc)
• More complex interactions than point in a circle
• Indexing:
– "geo”:”43.17614,-90.57341”
– “geo”:”Circle(4.56,1.23 d=0.0710)”
– “geo”:”POLYGON((-10 30, -40 40, -10 -20, 40 20, 0 0, -10 30))”
• Searching:
– fq=geo:"Intersects(-74.093 41.042 -69.347 44.558)"
– fq=geo:"Intersects(POLYGON((-10 30, -40 40, -10 -20, 40 20, 0 0, -10
30)))”
16. Scaling Solr
• Distributed/sharded indexing & search
– Auto distributes updates and queries to appropriate shards
– Near Real Time (NRT) indexing capable
– Document routing extensions
• Dynamically scalable
– New SolrCloud instances add indexing and query capacity
– Supports re-balancing (shard-splitting)
• Reliable
– No single point of failure
– Transactions logged
– Robust, automatic recover
• http://wiki.apache.org/solr/SolrCloud
17. Solr as NoSQL
• Non-traditional data stores
• Not designed for SQL type queries
• Distributed fault tolerant architecture
• Document oriented, data format agnostic (JSON, XML, CSV, binary)
19. APIs
• New APIs for Schema and Solr Config
– XML becoming more of an implementation detail
• Managed Schema mode
• Data-driven schema (aka schemaless)
• Synonyms, stopwords, request handlers
20. Beyond Solr: LucidWorks Open Source
• Effortless AWS deployment and monitoring:
http://www.github.com/lucidworks/solr-scale-tk
• Logstash for Solr: https://github.com/LucidWorks/solrlogmanager
• Banana (Kibana for Solr): https://github.com/LucidWorks/banana
• Data Quality Toolkit: https://github.com/LucidWorks/data-quality
• Coming Soon for Big Data: Hadoop, Pig, Hive 2-way support w/
Lucene and Solr, different file formats, pipelines, Logstash
21. Summary
• Lucene/Solr 4.x:
– Faster
– More Flexible
– Easier than ever scaling
– More reliable than ever
• Go forth and rank!
22. Resources
• Me
– grant@lucidworks.com
– @gsingers on Twitter
• LucidWorks
– http://www.lucidworks.com
– http://www.lucidworks.com/support-services/ask-the-experts/