Dev Dives: Streamline document processing with UiPath Studio Web
What's new in solr june 2014
1. 1
What’s New in Solr
Solr 4.7 & 4.8
June 12, 2014
Search | Discover | Analyze
2. Speaker
• Software Engineer at LucidWorks
• Lucene/Solr committer and PMC member
• Previously worked on search and NLP at the
Center for Natural Language Processing at
Syracuse University’s iSchool
• Twitter: @steven_a_rowe
Steve Rowe
2
3. Agenda
• A short history of Solr 4
• Solr 4.7 and 4.8: new features
• Solr 4.9 and beyond
3
5. A short history of Solr 4
• SolrCloud
– Distributed indexing and searching, NRT and NoSQL
features, e.g. realtime-get, optimistic concurrency and
durable updates
– Sharding, replication, ZooKeeper ensemble
– High availability with no single points of failure
• Real-time Get: Access latest document version, no
commit or new searcher open required
• Atomic updates: incremental field
add/update/increment via stored fields
• NRT: “soft” commits
5
6. A short history of Solr 4
• Solr Reference Guide now released with each
feature release:
– Live (targeting next Solr release):
http://s.apache.org/SolrReferenceGuide
– Most recent released PDF:
http://s.apache.org/Solr-Ref-Guide-PDF
– Previous release PDFs:
http://s.apache.org/Older-Solr-Ref-Guide-PDFs
6
7. A short history of Solr 4
• Flexible indexing
– Solr core = Lucene index
• Lucene index = 1 or more segments
– Codec: per-segment suite of formats
• Flexible scoring
– You can specify similarity implementation per fieldType in
your schema.xml if you use SchemaSimilarityFactory
– Built-in Similarities (other than the default TF-IDF):
• Okapi BM25
• Divergence from Randomness
• Information-Based
• Language Models (with two smoothing implementations)
• SweetSpot
7
8. A short history of Solr 4
• DocValues: typed column stride fields
– Document-to-value mapping built at index time
– Reduced memory usage compared to field cache
– Good for faceting and sorting
– Missing values now supported as of Solr 4.5
• Pseudo-fields
– Field aliasing, e.g. &fl=result:indexed
– Function queries, aliasable too, e.g. &fl=price:sum(a,b)
– Document transformers
• Standard: [explain], [value], [shard], [docid]
• Pseudo-joins, e.g. ?q={!join+from=manu+to=id}ipod
• Pivot faceting: automatic drill-down (no distr.’d support)
8
9. A short history of Solr 4
• Schema API
• GET /collection/schema/fields/fieldname
• PUT /collection/schema/fields/name
• JSON body: { "type":"text_general",
"stored":true,
"indexed":true }
• Schemaless mode
• a.k.a. data-driven schema or field guessing
• Class guessed based on field values, then class(es)
mapped to a fieldType; first gets added to the schema
• Supported value classes: Boolean, Integer, Long, Float,
Double, and Date
9
10. A short history of Solr 4
• Document routing
– CompositeId router, e.g. id=tenant!docid
• Used by default when numShards specified when
creating a collection.
• Restrict queries to shard(s): &_route_=tenant!
– Implicit router
• Online shard splitting
– Allows collections to scale, rather than having to
decide on how much to overshard up front.
– Split in two; with custom hash ranges; or using
split.key param to split to a dedicated shard
10
11. A short history of Solr 4
• Nested documents, a.k.a. Block Join
– Nested doc to be added:
<add>
<doc>
<field name="id">1</field>
<field name="title">Solr adds block join support</field>
<field name="content_type">parentDocument</field>
<doc>
<field name="id">2</field>
<field name="comments">SolrCloud supports it too!</field>
</doc>
</doc>
</add>
– Queries:
• Child query parser, e.g.
q={!child of="content_type:parentDocument"}title:Solr
• Parent query parser, e.g.
q={!parent which="content_type:parentDocument"}comments:SolrCloud
11
12. A short history of Solr 4
• solr.xml legacy & discovery modes
– Legacy mode (cores listed in solr.xml) is
deprecated; support will be removed in Solr 5.
– Discovery mode (new as of Solr 4.3):
• No cores are listed in solr.xml
• Cores are discovered by a recursive walk of the solr
home directory, marked by core.properties files
• Nested core directories are not allowed
12
13. A short history of Solr 4
• New web admin UI with SolrCloud support
13
14. Solr 4.7 and 4.8: new features
• As of Solr 4.8, Java 7 is the minimum supported
JVM version. Recommended: Oracle 1.7.0_60
• <fields> and <types> tags are no longer necessary in
schema.xml
• Collections API improvements
– Working toward “ZooKeeper = Truth” mode
• legacyCloud=false cluster property
– New actions:
• CLUSTERSTATUS, LIST, ADDROLE, DELETEROLE,
ADDREPLICA, DELETEREPLICA, OVERSEERSTATUS,
MIGRATE, CLUSTERPROP
– Core properties can be specified with CREATE and
SPLITSHARD actions
14
15. Solr 4.7 and 4.8: new features
• Asynchronous execution of long-running
actions
– SolrCloud Collections API:
• CREATE, SPLITSHARD, MIGRATE
– CoreAdminHandler:
• CREATE, RENAME, UNLOAD, SWAP, MERGEINDEXES,
SPLIT
– Tracking request ID supplied via async param
– Track status via the new REQUESTSTATUS action,
using the tracking request ID
• Possible states: running, complete, failed, notfound
– Clear stored statuses with special request ID -1
15
16. Solr 4.7 and 4.8: new features
• Cursors: Efficient Deep Paging
– Request must include a sort, which must include
the uniqueKey, which must be defined
– First page: ?q=…&sort=id+asc&rows=N&cursorMark=*
• Response contains "nextCursorMark":"<base64encoded>"
– Following pages:
?q=…&sort=id+asc&rows=N&cursorMark=<from response>
– Repeat; when nextCursorMark=cursorMark from the
request, there are no more results
– No server-side state
16
18. Solr 4.7 and 4.8: new features
• Document expiration and Time To Live (TTL)
– Auto-delete expired documents
• DocExpirationUpdateProcessorFactory can periodically
wake up and delete expired documents
– Compute expiration date from TTL
• Update request _ttl_ param, or
• Document _ttl_ field
• Both names are configurable, defaulting to _ttl_.
• _ttl_ values are interpreted as Date Math Expressions
relative to NOW, e.g. “+1YEAR”.
18
19. Solr 4.7 and 4.8: new features
• Dynamic synonyms and stopwords
– “Managed” resources: configuration and content for
synonyms and stopwords, persistence managed by Solr
– Specified as ManagedSynonymFilterFactory and
ManagedStopFilterFactory on analyzers in schema.xml
– CRUD operations are enabled via a REST endpoint per
managed resource.
– The “managed” attribute names the REST endpoint, e.g.
<filter class="solr.ManagedStopFilterFactory"
managed="french" />
– E.g. to delete stopword “le” from the “french” managed
stoplist:
curl -X DELETE "…/solr/colln/schema/analysis/stopwords/french/le"
19
20. Solr 4.7 and 4.8: new features
• SSL support in SolrCloud
– URL scheme stored in ZooKeeper
– SSL certificates are specifiable via system properties, to
enable authentication
• Nested documents may be specified in JSON format
• Tri-level compositeId routing
– E.g. “tenant!group!docid”, 8/8/16 hash bits per component
• Build Solr indexes with Hadoop’s MapReduce
– +Mark Miller’s blog: http://bit.ly/1oh0fWq
• Github solr-map-reduce-example: http://bit.ly/1pnDAao
• Named config sets in non-SolrCloud mode
– Default base directory is SOLR_HOME/configsets/
20
21. Solr 4.7 and 4.8: new features
• Suggester v2
– Added BlendedInfixSuggester
– Added FreeTextSuggester
– Queries can use multiple suggesters
• New query parsing features
– SimpleQParserPlugin: parser for human entered
queries with selectable operators.
– ComplexPhraseQParserPlugin: wildcards, ORs, etc.
inside Phrase Queries
• E.g. {!complexphrase inOrder=true}name:"Jo* Smith"
21
22. Solr 4.7 and 4.8: new features
• CollapsingQParserPlugin
– Performant alternative grouping/field collapsing
implementation, for high distinct group cardinality.
• ExpandComponent
– Expands collapsed groups
– Can also expand nested documents
22
23. Solr 4.9 and beyond
• ZooKeeper = Truth / legacyCloud=false
• MODIFYCOLLECTION collections API
– Modify maxShardsPerNode, replicationFactor for the
entire collection
• Incremental Field Updates on numeric
DocValues
– Binary DocValues IFUs also coming
• Multi-valued DocValues sort fields
• Legacy numeric/date field types deprecated,
removed in Solr 5 in favor of Trie field types
23
24. Solr 4.9 and beyond
• In Solr 5, the .war will no longer be shipped
• Index integrity: checksums
• Integrity check on merge off by default
• solrconfig.xml option <indexConfig><checkIntegrityAtMerge>
• New update query param min_rf will allow clients
to set the minimum successful replicas for the
request
• Return Block Join child documents when parents
match, via a new DocTransformer
[child parentFilter=“field:value”]
24
25. Solr 4.9 and beyond
• AnalyticsQuery: support pluggable, pipeline-able
analytics, orderable via the “cost” parameter, like
PostFilters.
• ReRankingQParserPlugin
• Re-rank the top n results
25
26. Platform
LucidWorks Open Source
26
• Effortless AWS deployment and monitoring:
http://www.github.com/lucidworks/solr-scale-tk
• Logstash for Solr:
https://github.com/LucidWorks/solrlogmanager
• Banana (Kibana for Solr):
https://github.com/LucidWorks/banana
• Data Quality Toolkit: https://github.com/LucidWorks/data-
quality
• Coming Soon for Big Data: Hadoop, Pig, Hive 2-way
support w/ Lucene and Solr, different file formats, pipelines,
Logstash
27. Links
Solr website: http://lucene.apache.org/solr
Solr Reference Guide:
• Live (targeting next Solr release):
http://s.apache.org/SolrReferenceGuide
• Most recent released PDF: http://s.apache.org/Solr-Ref-Guide-
PDF
• Previous release PDFs: http://s.apache.org/Older-Solr-Ref-
Guide-PDFs
Lucene/Solr Revolution: http://www.LuceneRevolution.org
Q & A
27
Hinweis der Redaktion
Asynchronous collection API calls in the Solr Reference Guide: https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-AsynchronousCalls
REQUESTSTATUS action in the Solr Reference Guide: http://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-RequestStatus
See Pagination of Results in the Solr Reference Guide: https://cwiki.apache.org/confluence/display/solr/Pagination+of+Results
Chris Hostetter’s scripts to produce the graph: https://github.com/LucidWorks/blog-deep-paging-perf
Date Math Expressions in Solr Javadocs: https://lucene.apache.org/solr/4_8_1/solr-core/org/apache/solr/util/DateMathParser.html
See Chris Hostetter’s blog post “New in Solr 4.8: Document Expiration”: http://searchhub.org/2014/05/07/document-expiration/
See the “Managed Resources” page in the Solr Reference Guide: https://cwiki.apache.org/confluence/display/solr/Managed+Resources
See also Tim Potter’s blog “Using Solr’s REST APIs to manage stop words and synonyms”: http://searchhub.org/2014/03/31/introducing-solrs-restmanager-and-managed-stop-words-and-synonyms/
For info on Tri-level compositeId routing, see Anshum Gupta’s blog “Multi level composite-id routing in SolrCloud”: http://searchhub.org/2014/01/06/10590/
See the Config Sets page in the Solr Reference Guide: https://cwiki.apache.org/confluence/display/solr/Config+Sets
Suggester v2 JIRA issue: https://issues.apache.org/jira/browse/SOLR-5378
Simple Query Parser in the Solr Reference Guide: https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-SimpleQueryParser
Complex Phrase Query Parser in the Solr Reference Guide: https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-ComplexPhraseQueryParser
See the Collapse & Expand page in the Solr Reference Guide: https://cwiki.apache.org/confluence/display/solr/Collapse+%26+Expand
See also Joel Bernstein’s blog post “The CollapsingQParserPlugin: Solr’s New High Performance Field Collapsing PostFilter”: http://heliosearch.org/the-collapsingqparserplugin-solrs-new-high-performance-field-collapsing-postfilter/
See also Joel Bernstein’s blog post “Solr’s New Expand Component”: http://heliosearch.org/solrs-new-expand-component/
See also Joel Bernstein’s blog post “Using the ExpandComponent to expand a Solr Block Join”: http://heliosearch.org/expand-block-join/
See Joel Bernstein’s blog post “Solr’s New AnalyticsQuery API”: http://heliosearch.org/solrs-new-analyticsquery-api/
See Joel Bernstein’s blog post “New in Solr 4.9: Query Re-Ranking”: http://heliosearch.org/solrs-new-re-ranking-feature/