What's New in Solr 3.x / 4.0

What’s New in Solr
3.x/4.0
Charlottesville Lucene/Solr Meetup
August 15, 2011

Erik Hatcher
Lucid Imagination

What is Solr?
• Solr is the popular, blazing fast open source
enterprise search platform from the Apache Lucene
project. Its major features include powerful full-text
search, hit highlighting, faceted search, dynamic
clustering, database integration, rich document (e.g.,
Word, PDF) handling, and geospatial search. Solr is
highly scalable, providing distributed search and
index replication, and it powers the search and
navigation features of many of the world's largest
internet sites.

What is Lucene?

• Apache Lucene is a high-performance, full-
featured text search engine library written
entirely in Java. It is a technology suitable
for nearly any application that requires full-
text search, especially cross-platform.

Solr History
• November 2009: Solr 1.4 (Lucene 2.9.1)
• June 2010: Solr 1.4.1 (Lucene 2.9.3)
• 2011
• March - Solr 3.1
• May - Solr 3.2
• July - Solr 3.3

Solr 3.1
• Improved geospatial support • New autosuggest component

• Sorting by function queries • Distributed support for more
components

• Range faceting on all numeric ﬁelds
• JSON document indexing and CSV
response format
• Example Velocity driven search UI at
http://localhost:8983/solr/browse
• Apache UIMA integration for metadata
extraction
• A new termvector-based highlighter

• Improved spellchecking capabilities
• Many other Bugﬁxes, improvements and
optimizations

• Improved integration with Apache
Lucene

Major components

• Apache Lucene 3.1.0
• Apache Tika 0.8
• Carrot2 3.4.2
• Velocity 1.6.1 and Velocity Tools 2.0-beta3
• Apache UIMA 2.3.1-SNAPSHOT

Schema / Config
• SOLR-1131: FieldTypes can now output multiple
Fields per Type and still be searched. This can be
handy for hiding the details of a particular
implementation such as in the spatial case.

• SOLR-1379: Add RAMDirectoryFactory for non-
persistent in memory index storage.

• SOLR-2059: Add "types" attribute to
WordDelimiterFilterFactory, which allows you to
customize how WordDelimiterFilter tokenizes text
with a configuration file.

Indexing

• SOLR-945: JSON update handler that
accepts add, delete, commit commands in
JSON format.

Geospatial
• SOLR-1302: Added several new distance based functions,
including Great Circle (haversine), Manhattan, Euclidean
and String (using the StringDistance methods in the Lucene
spellchecker). Also added geohash(), deg() and rad()
convenience functions. See http://wiki.apache.org/solr/
FunctionQuery

• SOLR-1568: Added "native" ﬁltering support for PointType,
GeohashField. Added LatLonType with ﬁltering support
too. See http://wiki.apache.org/solr/SpatialSearch and the
example. Refactored some items in Lucene spatial.
Removed SpatialTileField as the underlying CartesianTier is
broken beyond repair and is going to be moved.

Query Parsing
• SOLR-1553: New dismax parser implementation (accessible as "edismax") that supports full
lucene syntax, improved reserved char escaping, ﬁelded queries, improved proximity
boosting, and improved stopword handling. Note: status is experimental for now.

• SOLR-2015: Add a boolean attribute autoGeneratePhraseQueries to TextField.
autoGeneratePhraseQueries="true" (the default) causes the query parser to generate
phrase queries if multiple tokens are generated from a single non-quoted analysis string.
For example WordDelimiterFilter splitting text:pdp-11 will cause the parser to generate
text:"pdp 11" rather than (text:PDP OR text:11). Note that
autoGeneratePhraseQueries="true" tends to not work well for non whitespace delimited
languages.

• SOLR-2128: Full parameter substitution for function queries. Example: q=add($v1,$v2)
&v1=mul(popularity,5)&v2=20.0

• SOLR-2133: Function query parser can now parse multiple comma separated value sources.
It also now fails if there is extra unexpected text after parsing the functions, instead of
silently ignoring it. This allows expressions like q=dist(2,vector(1,2),$pt)&pt=3,4

Functions
• SOLR-1574: Add many new functions from
java Math (e.g. sin, cos)
• SOLR-1569: Allow functions to take in
literal strings by modifying the
FunctionQParser and adding
LiteralValueSource
• SOLR-1297: Add sort by Function capability

Analysis
• SOLR-1923: PhoneticFilterFactory now has support for the Caverphone
algorithm.

• SOLR-1571: Added unicode collation support though Lucene's
CollationKeyFilter

• SOLR-1653: Add PatternReplaceCharFilter

• SOLR-1677: Add support for choosing the Lucene Version for Lucene
components within Solr.

• SOLR-1984: Add HyphenationCompoundWordTokenFilterFactory.

• SOLR-2188: Added "maxTokenLength" argument to the factories for
ClassicTokenizer, StandardTokenizer, and UAX29URLEmailTokenizer.

• ICU integration

Analysis (cont.)
• SOLR-1857: Synced Solr analysis with • SOLR-1740: ShingleFilterFactory supports
Lucene 3.1. Added the "minShingleSize" and "tokenSeparator"
KeywordMarkerFilterFactory and parameters for controlling the minimum
StemmerOverrideFilterFactory, which can shingle size produced by the ﬁlter, and the
be used to tune stemming algorithms. separator string that it uses, respectively.

• Added factories for Bulgarian, Czech, Hindi, • SOLR-744: ShingleFilterFactory supports
Turkish, and Wikipedia analysis. Improved the "outputUnigramsIfNoShingles"
the performance of parameter, to output unigrams if the
SnowballPorterFilterFactory. number of input tokens is fewer than
minShingleSize, and no shingles can be
generated.
• SOLR-1657: Converted remaining
TokenStreams to the Attributes-based API.
All Solr TokenFilters now support custom • SOLR-1974: Add
Attributes, and some have improved LimitTokenCountFilterFactory.
performance: especially
WordDelimiterFilter and
CommonGramsFilter. • SOLR-1057: Add
PathHierarchyTokenizerFactory.

Faceting
• SOLR-1240: "Range Faceting" has been added. This is a generalization
of the existing "Date Faceting" logic so that it now supports any all
stock numeric ﬁeld types that support range queries in addition to
dates. facet.date is now deprecated in favor of this generalized
mechanism.

• SOLR-397: Date Faceting now supports a "facet.date.include" param
for specifying when the upper & lower end points of computed date
ranges should be included in the range. Legal values are: "all", "lower",
"upper", "edge", and "outer". For backwards compatibility the default
value is the set: [lower,upper,edge], so that all ranges between start
and end are inclusive of their endpoints, but the "before" and "after"
ranges are not.

• SOLR-2325: Allow tagging and exclusion of main query for faceting.

SolrJ

• SOLR-1139: Add TermsComponent Query
and Response Support in SolrJ
• SOLR-1815: SolrJ now preserves the order
of facet queries.

Solr Components
• SOLR-1316: Create autosuggest component

• SOLR-2010: Added ability to verify that spell checking collations have
actual results in the index.

• SOLR-2157: Suggester should return alpha-sorted results when
onlyMorePopular=false

• SOLR-1625: Add regexp support for TermsComponent

• SOLR-1556: TermVectorComponent now supports per field overrides.
Also, it now throws an error if passed in fields do not exist and warnings
if fields that do not have term vector options (termVectors, offsets,
positions) that align with the schema declaration.

• SOLR-860: Add debug output for MoreLikeThis.

Highlighting
• SOLR-1268: Incorporate FastVectorHighlighter

• SOLR-2021: Add SolrEncoder plugin to Highlighter.

• SOLR-2030: Make FastVectorHighlighter use of
SolrEncoder.

• SOLR-2053: Add support for custom comparators
in Solr spellchecker, per LUCENE-2479

• SOLR-2049: Add hl.multiValuedSeparatorChar for
FastVectorHighlighter, per LUCENE-2603.

Distributed

• SOLR-785: Distributed Search support for
SpellCheckComponent
• SOLR-1177: Distributed Search support for
TermsComponent

Misc.

• SOLR-1957: The VelocityResponseWriter contrib moved to core. Example search UI now
available at http://localhost:8983/solr/browse

• SOLR-1966: QueryElevationComponent can now return just the included results in the
elevation file

• SOLR-1925: Add CSVResponseWriter (use wt=csv) that returns the list of documents in
CSV format.

• SOLR-2263: Add ability for RawResponseWriter to stream binary files as well as text files.

• SOLR-1750: SolrInfoMBeanHandler added for simpler programmatic access to info
currently available from registry.jsp and stats.jsp

• SOLR-2099: Add ability to throttle rsync based replication using rsync option --bwlimit.

UIMA
• UIMA - Unstructured Information Management
Architecture - http://uima.apache.org/

• Enables UIMA components to augment
documents

• Entity extraction, automated categorization,
language detection, etc

• "contrib" plugin - SOLR-2129

• http://wiki.apache.org/solr/SolrUIMA

Optimizations
• SOLR-1679: Don't build up string messages in SolrCore.execute unless they
are necessary for the current log level.

• SOLR-1874: Optimize PatternReplaceFilter for better performance.

• SOLR-1968: speed up initial filter cache population for facet.method=enum
and also big terms for multi-valued facet.method=fc. The resulting speedup
for the first facet request is anywhere from 30% to 32x, depending on how
many terms are in the field and how many documents match per term.

• SOLR-2089: Speed up UnInvertedField faceting (facet.method=fc for multi-
valued fields) when facet.limit is both high, and a high enough percentage of
the number of unique terms in the field. Extreme cases yield speedups over
3x.

• SOLR-2046: add common functions to scripts-util.

Solr 3.2
• Ability to specify overwrite and commitWithin as request
parameters when using the JSON update format

• TermQParserPlugin, useful when generating filter queries from
terms returned from field faceting or the terms component.

• DebugComponent now supports using a NamedList to model
Explanation objects in it's responses instead of
Explanation.toString

• Improvements to the UIMA and Carrot2 integrations

• Bugfixes and improvements from Apache Lucene 3.2

Other 3.2 goodies

• SOLR-2061: Pull base tests out into a new
Solr Test Framework module, and publish
binary, javadoc, and source test-framework
jars.
• Dependency update: Carrot2 3.5.0

Solr 3.3
• Grouping / Field Collapsing

• A new, automaton-based suggest/autocomplete implementation offering
an order of magnitude smaller RAM consumption.

• KStemFilterFactory, an optimized implementation of a less aggressive
stemmer for English.

• Solr defaults to a new, more efficient merge policy (TieredMergePolicy).
See http://s.apache.org/merging for more information.

• Important bugfixes, including extremely high RAM usage in spellchecking.

• Bugfixes and improvements from Apache Lucene 3.3

Solr 3.3 details
• SOLR-2378: A new, automaton-based, implementation of suggest (autocomplete)
component, offering an order of magnitude smaller memory consumption
compared to ternary trees and jaspell and very fast lookups at runtime.

• SOLR-2400: Field- and DocumentAnalysisRequestHandler now provide a position
history for each token, so you can follow the token through all analysis stages. The
output contains a separate int[] attribute containing all positions from previous
Tokenizers/TokenFilters (called "positionHistory").

• SOLR-2524: (SOLR-236, SOLR-237, SOLR-1773, SOLR-1311) Grouping / Field
collapsing using the Lucene grouping contrib. The search result can be grouped by
ﬁeld and query.

• SOLR-1331: Added a srcCore parameter to CoreAdminHandler's mergeindexes
action to merge one or more cores' indexes to a target core.

• SOLR-2610 -- Add an option to delete index through CoreAdmin UNLOAD action

Solr 4.0

• aka "trunk" at the moment
• major changes! (for the better!) at both
Lucene and Solr levels

Lucene 4.0
• The postings APIs have been removed in favor of the
new flexible indexing (flex) APIs.

• With flexible indexing it is now possible for an
application to create its own postings codec, to alter
how fields, terms, docs and positions are encoded into
the index.

• String -> BytesRef

• Per-segment everything

4.0 details
• Directory.copy/Directory.copyTo now copies all files (not just
index files), since what is and isn't and index file is now
dependent on the codecs used.

• String to BytesRef

• FuzzyQuery and WildcardQuery now operate on Unicode
codepoints, not unicode code units.

• WildcardQuery and QueryParser now allows escaping with
the '' character.

• Similarity can now be configured on a per-field basis

Relevancy

• more ﬂexible scoring

NRT

• per-segment
• IndexWriter#commit now doesn't block
concurrent indexing while ﬂushing all
'currently' RAM resident documents to
disk.

More Lucene 4.0
features
• Added RegexpQuery support to QueryParser.

• Adds AutomatonQuery, a MultiTermQuery that
matches terms against a ﬁnite-state machine.
Implement WildcardQuery and FuzzyQuery with
ﬁnite-state methods. Adds RegexpQuery.

• The QueryParser now accepts mixed inclusive and
exclusivebounds for range queries. Example: "{3 TO
5]"

Solr 4.0
• Pivot faceting

• Direct Solr spell checker

• Increased response writing ﬂexibility (e.g. function query results)

• Distributed date/numeric range faceting

• "join" query parser

• NRT:You may now specify a 'soft' commit when committing. This
will use Lucene's NRT feature to avoid guaranteeing documents
are on stable storage in exchange for faster reopen times. There
is also a new 'soft' autocommit tracker that can be conﬁgured.

About Lucid...

• Lucid Imagination provides commercial-grade
support, training, high-level consulting and value-
added software for Lucene and Solr.

• We make Lucene ‘enterprise-ready’ by offering:

• Free, certiﬁed, distributions and downloads.

• Support, training, and consulting.

• LucidWorks Enterprise, a commercial search
platform built on top of Solr.

• http://www.lucidimagination.com

LucidFind

http://www.lucidimagination.com/search/?q=charlottesville

What's New in Solr 3.x / 4.0

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie What's New in Solr 3.x / 4.0

Ähnlich wie What's New in Solr 3.x / 4.0 (20)

Mehr von Erik Hatcher

Mehr von Erik Hatcher (13)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

What's New in Solr 3.x / 4.0