1. Whatâs New in Solr
3.x/4.0
Charlottesville Lucene/Solr Meetup
August 15, 2011
Erik Hatcher
Lucid Imagination
2. What is Solr?
⢠Solr is the popular, blazing fast open source
enterprise search platform from the Apache Lucene
project. Its major features include powerful full-text
search, hit highlighting, faceted search, dynamic
clustering, database integration, rich document (e.g.,
Word, PDF) handling, and geospatial search. Solr is
highly scalable, providing distributed search and
index replication, and it powers the search and
navigation features of many of the world's largest
internet sites.
3. What is Lucene?
⢠Apache Lucene is a high-performance, full-
featured text search engine library written
entirely in Java. It is a technology suitable
for nearly any application that requires full-
text search, especially cross-platform.
4. Solr History
⢠November 2009: Solr 1.4 (Lucene 2.9.1)
⢠June 2010: Solr 1.4.1 (Lucene 2.9.3)
⢠2011
⢠March - Solr 3.1
⢠May - Solr 3.2
⢠July - Solr 3.3
5. Solr 3.1
⢠Improved geospatial support ⢠New autosuggest component
⢠Sorting by function queries ⢠Distributed support for more
components
⢠Range faceting on all numeric ďŹelds
⢠JSON document indexing and CSV
response format
⢠Example Velocity driven search UI at
http://localhost:8983/solr/browse
⢠Apache UIMA integration for metadata
extraction
⢠A new termvector-based highlighter
⢠Improved spellchecking capabilities
⢠Many other BugďŹxes, improvements and
optimizations
⢠Improved integration with Apache
Lucene
6. Major components
⢠Apache Lucene 3.1.0
⢠Apache Tika 0.8
⢠Carrot2 3.4.2
⢠Velocity 1.6.1 and Velocity Tools 2.0-beta3
⢠Apache UIMA 2.3.1-SNAPSHOT
7. Schema / ConďŹg
⢠SOLR-1131: FieldTypes can now output multiple
Fields per Type and still be searched. This can be
handy for hiding the details of a particular
implementation such as in the spatial case.
⢠SOLR-1379: Add RAMDirectoryFactory for non-
persistent in memory index storage.
⢠SOLR-2059: Add "types" attribute to
WordDelimiterFilterFactory, which allows you to
customize how WordDelimiterFilter tokenizes text
with a conďŹguration ďŹle.
9. Geospatial
⢠SOLR-1302: Added several new distance based functions,
including Great Circle (haversine), Manhattan, Euclidean
and String (using the StringDistance methods in the Lucene
spellchecker). Also added geohash(), deg() and rad()
convenience functions. See http://wiki.apache.org/solr/
FunctionQuery
⢠SOLR-1568: Added "native" ďŹltering support for PointType,
GeohashField. Added LatLonType with ďŹltering support
too. See http://wiki.apache.org/solr/SpatialSearch and the
example. Refactored some items in Lucene spatial.
Removed SpatialTileField as the underlying CartesianTier is
broken beyond repair and is going to be moved.
10. Query Parsing
⢠SOLR-1553: New dismax parser implementation (accessible as "edismax") that supports full
lucene syntax, improved reserved char escaping, ďŹelded queries, improved proximity
boosting, and improved stopword handling. Note: status is experimental for now.
⢠SOLR-2015: Add a boolean attribute autoGeneratePhraseQueries to TextField.
autoGeneratePhraseQueries="true" (the default) causes the query parser to generate
phrase queries if multiple tokens are generated from a single non-quoted analysis string.
For example WordDelimiterFilter splitting text:pdp-11 will cause the parser to generate
text:"pdp 11" rather than (text:PDP OR text:11). Note that
autoGeneratePhraseQueries="true" tends to not work well for non whitespace delimited
languages.
⢠SOLR-2128: Full parameter substitution for function queries. Example: q=add($v1,$v2)
&v1=mul(popularity,5)&v2=20.0
⢠SOLR-2133: Function query parser can now parse multiple comma separated value sources.
It also now fails if there is extra unexpected text after parsing the functions, instead of
silently ignoring it. This allows expressions like q=dist(2,vector(1,2),$pt)&pt=3,4
11. Functions
⢠SOLR-1574: Add many new functions from
java Math (e.g. sin, cos)
⢠SOLR-1569: Allow functions to take in
literal strings by modifying the
FunctionQParser and adding
LiteralValueSource
⢠SOLR-1297: Add sort by Function capability
12. Analysis
⢠SOLR-1923: PhoneticFilterFactory now has support for the Caverphone
algorithm.
⢠SOLR-1571: Added unicode collation support though Lucene's
CollationKeyFilter
⢠SOLR-1653: Add PatternReplaceCharFilter
⢠SOLR-1677: Add support for choosing the Lucene Version for Lucene
components within Solr.
⢠SOLR-1984: Add HyphenationCompoundWordTokenFilterFactory.
⢠SOLR-2188: Added "maxTokenLength" argument to the factories for
ClassicTokenizer, StandardTokenizer, and UAX29URLEmailTokenizer.
⢠ICU integration
13. Analysis (cont.)
⢠SOLR-1857: Synced Solr analysis with ⢠SOLR-1740: ShingleFilterFactory supports
Lucene 3.1. Added the "minShingleSize" and "tokenSeparator"
KeywordMarkerFilterFactory and parameters for controlling the minimum
StemmerOverrideFilterFactory, which can shingle size produced by the ďŹlter, and the
be used to tune stemming algorithms. separator string that it uses, respectively.
⢠Added factories for Bulgarian, Czech, Hindi, ⢠SOLR-744: ShingleFilterFactory supports
Turkish, and Wikipedia analysis. Improved the "outputUnigramsIfNoShingles"
the performance of parameter, to output unigrams if the
SnowballPorterFilterFactory. number of input tokens is fewer than
minShingleSize, and no shingles can be
generated.
⢠SOLR-1657: Converted remaining
TokenStreams to the Attributes-based API.
All Solr TokenFilters now support custom ⢠SOLR-1974: Add
Attributes, and some have improved LimitTokenCountFilterFactory.
performance: especially
WordDelimiterFilter and
CommonGramsFilter. ⢠SOLR-1057: Add
PathHierarchyTokenizerFactory.
14. Faceting
⢠SOLR-1240: "Range Faceting" has been added. This is a generalization
of the existing "Date Faceting" logic so that it now supports any all
stock numeric ďŹeld types that support range queries in addition to
dates. facet.date is now deprecated in favor of this generalized
mechanism.
⢠SOLR-397: Date Faceting now supports a "facet.date.include" param
for specifying when the upper & lower end points of computed date
ranges should be included in the range. Legal values are: "all", "lower",
"upper", "edge", and "outer". For backwards compatibility the default
value is the set: [lower,upper,edge], so that all ranges between start
and end are inclusive of their endpoints, but the "before" and "after"
ranges are not.
⢠SOLR-2325: Allow tagging and exclusion of main query for faceting.
15. SolrJ
⢠SOLR-1139: Add TermsComponent Query
and Response Support in SolrJ
⢠SOLR-1815: SolrJ now preserves the order
of facet queries.
16. Solr Components
⢠SOLR-1316: Create autosuggest component
⢠SOLR-2010: Added ability to verify that spell checking collations have
actual results in the index.
⢠SOLR-2157: Suggester should return alpha-sorted results when
onlyMorePopular=false
⢠SOLR-1625: Add regexp support for TermsComponent
⢠SOLR-1556: TermVectorComponent now supports per ďŹeld overrides.
Also, it now throws an error if passed in ďŹelds do not exist and warnings
if ďŹelds that do not have term vector options (termVectors, offsets,
positions) that align with the schema declaration.
⢠SOLR-860: Add debug output for MoreLikeThis.
17. Highlighting
⢠SOLR-1268: Incorporate FastVectorHighlighter
⢠SOLR-2021: Add SolrEncoder plugin to Highlighter.
⢠SOLR-2030: Make FastVectorHighlighter use of
SolrEncoder.
⢠SOLR-2053: Add support for custom comparators
in Solr spellchecker, per LUCENE-2479
⢠SOLR-2049: Add hl.multiValuedSeparatorChar for
FastVectorHighlighter, per LUCENE-2603.
19. Misc.
⢠SOLR-1957: The VelocityResponseWriter contrib moved to core. Example search UI now
available at http://localhost:8983/solr/browse
⢠SOLR-1966: QueryElevationComponent can now return just the included results in the
elevation ďŹle
⢠SOLR-1925: Add CSVResponseWriter (use wt=csv) that returns the list of documents in
CSV format.
⢠SOLR-2263: Add ability for RawResponseWriter to stream binary ďŹles as well as text ďŹles.
⢠SOLR-1750: SolrInfoMBeanHandler added for simpler programmatic access to info
currently available from registry.jsp and stats.jsp
⢠SOLR-2099: Add ability to throttle rsync based replication using rsync option --bwlimit.
21. Optimizations
⢠SOLR-1679: Don't build up string messages in SolrCore.execute unless they
are necessary for the current log level.
⢠SOLR-1874: Optimize PatternReplaceFilter for better performance.
⢠SOLR-1968: speed up initial ďŹlter cache population for facet.method=enum
and also big terms for multi-valued facet.method=fc. The resulting speedup
for the ďŹrst facet request is anywhere from 30% to 32x, depending on how
many terms are in the ďŹeld and how many documents match per term.
⢠SOLR-2089: Speed up UnInvertedField faceting (facet.method=fc for multi-
valued ďŹelds) when facet.limit is both high, and a high enough percentage of
the number of unique terms in the ďŹeld. Extreme cases yield speedups over
3x.
⢠SOLR-2046: add common functions to scripts-util.
22. Solr 3.2
⢠Ability to specify overwrite and commitWithin as request
parameters when using the JSON update format
⢠TermQParserPlugin, useful when generating ďŹlter queries from
terms returned from ďŹeld faceting or the terms component.
⢠DebugComponent now supports using a NamedList to model
Explanation objects in it's responses instead of
Explanation.toString
⢠Improvements to the UIMA and Carrot2 integrations
⢠BugďŹxes and improvements from Apache Lucene 3.2
23. Other 3.2 goodies
⢠SOLR-2061: Pull base tests out into a new
Solr Test Framework module, and publish
binary, javadoc, and source test-framework
jars.
⢠Dependency update: Carrot2 3.5.0
24. Solr 3.3
⢠Grouping / Field Collapsing
⢠A new, automaton-based suggest/autocomplete implementation offering
an order of magnitude smaller RAM consumption.
⢠KStemFilterFactory, an optimized implementation of a less aggressive
stemmer for English.
⢠Solr defaults to a new, more efďŹcient merge policy (TieredMergePolicy).
See http://s.apache.org/merging for more information.
⢠Important bugďŹxes, including extremely high RAM usage in spellchecking.
⢠BugďŹxes and improvements from Apache Lucene 3.3
25. Solr 3.3 details
⢠SOLR-2378: A new, automaton-based, implementation of suggest (autocomplete)
component, offering an order of magnitude smaller memory consumption
compared to ternary trees and jaspell and very fast lookups at runtime.
⢠SOLR-2400: Field- and DocumentAnalysisRequestHandler now provide a position
history for each token, so you can follow the token through all analysis stages. The
output contains a separate int[] attribute containing all positions from previous
Tokenizers/TokenFilters (called "positionHistory").
⢠SOLR-2524: (SOLR-236, SOLR-237, SOLR-1773, SOLR-1311) Grouping / Field
collapsing using the Lucene grouping contrib. The search result can be grouped by
ďŹeld and query.
⢠SOLR-1331: Added a srcCore parameter to CoreAdminHandler's mergeindexes
action to merge one or more cores' indexes to a target core.
⢠SOLR-2610 -- Add an option to delete index through CoreAdmin UNLOAD action
26. Solr 4.0
⢠aka "trunk" at the moment
⢠major changes! (for the better!) at both
Lucene and Solr levels
27. Lucene 4.0
⢠The postings APIs have been removed in favor of the
new ďŹexible indexing (ďŹex) APIs.
⢠With ďŹexible indexing it is now possible for an
application to create its own postings codec, to alter
how ďŹelds, terms, docs and positions are encoded into
the index.
⢠String -> BytesRef
⢠Per-segment everything
28. 4.0 details
⢠Directory.copy/Directory.copyTo now copies all ďŹles (not just
index ďŹles), since what is and isn't and index ďŹle is now
dependent on the codecs used.
⢠String to BytesRef
⢠FuzzyQuery and WildcardQuery now operate on Unicode
codepoints, not unicode code units.
⢠WildcardQuery and QueryParser now allows escaping with
the '' character.
⢠Similarity can now be conďŹgured on a per-ďŹeld basis
31. More Lucene 4.0
features
⢠Added RegexpQuery support to QueryParser.
⢠Adds AutomatonQuery, a MultiTermQuery that
matches terms against a ďŹnite-state machine.
Implement WildcardQuery and FuzzyQuery with
ďŹnite-state methods. Adds RegexpQuery.
⢠The QueryParser now accepts mixed inclusive and
exclusivebounds for range queries. Example: "{3 TO
5]"
32. Solr 4.0
⢠Pivot faceting
⢠Direct Solr spell checker
⢠Increased response writing ďŹexibility (e.g. function query results)
⢠Distributed date/numeric range faceting
⢠"join" query parser
⢠NRT:You may now specify a 'soft' commit when committing. This
will use Lucene's NRT feature to avoid guaranteeing documents
are on stable storage in exchange for faster reopen times. There
is also a new 'soft' autocommit tracker that can be conďŹgured.
33. About Lucid...
⢠Lucid Imagination provides commercial-grade
support, training, high-level consulting and value-
added software for Lucene and Solr.
⢠We make Lucene âenterprise-readyâ by offering:
⢠Free, certiďŹed, distributions and downloads.
⢠Support, training, and consulting.
⢠LucidWorks Enterprise, a commercial search
platform built on top of Solr.
⢠http://www.lucidimagination.com