SlideShare ist ein Scribd-Unternehmen logo
1 von 40
Apache Solr
Oberseminar, 12.06.2015
Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen
Péter Király, pkiraly@gwdg.de
What is Apache Solr?
Solr is the popular, blazing-fast, open source enterprise
search platform built on Apache Lucene
2
● 1999: Doug Cutting published Lucene
● 2004: Yonik Seeley published Solr
● 2006: Apache project (2007: TLP)
● 2009: LucidWorks company
● 2010: Merge of Lucene and Solr
● 2011: 3.1
● 2012: 4.0
● 2015: 5.0
History in one minute
3
“Sister” projects
● Nutch: web scale search engine
● Tika: document parser
● Hadoop: distributes storage and data
processing
● Elasticsearch: alternative to Solr
● forks/ports of Lucene
● client libraries and tools (Luke index viewer)
4
Main features I
● Faceted navigation
● Hit highlighting
● Query language
● Schema-less mode and Schema REST API
● JSON, XML, PHP, Ruby, Python, XSLT,
Velocity and custom Java binary outputs
● HTML administration interface
5
Main features II
● Replication to other Solr servers
● Distributed search through sharding
● Search results clustering based on Carrot2
● Extensible through plugins
● Relevance boosting via functions
● Caching - queries, filters, and documents
● Embeddable in a Java Application
6
Main features III
● Geo-spatial search, including multiple
points per documents and polygons
● Automated management of large clusters
through ZooKeeper
● Function queries
● Field Collapsing and grouping
● Auto-suggest
7
Inverted index
Original documents:
Doc # Content field
1 A Fun Guide to Cooking
2 Decorating Your Home
3 How to Raise a Child
4 Buying a New Car
8
Inverted index
Index structure
Term Doc1 Doc2 Doc3 Doc4 Doc5 Doc6 Doc7
a 0 1 1 1 0 0 0
becomming 0 0 0 0 1 0 0
beginner’s 0 0 0 0 0 1 0
buy 0 0 1 0 0 0 0
stored as a bit vectorstored as reference to a tree
structure
9
Indexing
Document ~ RDBM record
Fields (key-value structure):
● types (text, numeric, date, point, custom)
● indexed, stored, multiple, required
● field name patterns (prefixes, suffixes, such
as *_tx)
● special fields (identifier, _version_)
10
Indexing
formats: JSON, XML, binary, RDBM, ...
connections: file, Data Import Handler, API
sharding (separating documents into multiple
parts)
denormalized documents - (almost) no JOIN ;-(
copy field
catch all field (contains everything)
11
A document example (XML)
<doc>
<field name="id">F8V7067-APL-KIT</field> string
<field name="name">Belkin Mobile Power Cord for iPod w/ Dock</field> text
<field name="cat">electronics</field>
<field name="cat">connector</field> multivalue
<field name="price">19.95</field> float
<field name="inStock">false</field> boolean
<field name="store">45.18014,-93.87741</field> geo point
<field name="manufacturedate_dt">2005-08-01T16:30:25Z</field> date
</doc>
12
A document example (JSON)
{
"id": "F8V7067-APL-KIT",
"name": "Belkin Mobile Power Cord for iPod w/ Dock",
"cat": ["electronics", "connector"],
"price":19.95,
"inStock":false,
"store": "45.18014,-93.87741",
"manufacturedate_dt": "2005-08-01T16:30:25Z"
}
13
A document example (Solr4j library)
SolrServer solr = new HttpSolrServer(“http://…”);
SolrInputDocument doc = new SolrInputDocument();
doc.setField("id", "F8V7067-APL-KIT");
doc.setField("name", "Belkin Mobile Power Cord for iPod w/ Dock");
...
solr.add(doc);
solr.commit(true, true);
14
Text analysis chain
1) character filters — preprocess text
pattern replace, ASCII folding, HTML stripping
1) tokenizers — split text into smaller units
whitespace, lowercase, word delim., standard
1) token filters — examine/modify/eliminate
stemming, lowercase, stop words,
15
Text analysis chain
<fieldType name="my-text-type" class="solr.TextField">
<analyzer type="index">
<charFilter class="solr.MappingCharFilterFactory"
mapping="mapping-FoldToASCII.txt" />
<tokenizer class="solr.WhitespaceTokenizerFactory" />
<filter class="solr.StopFilterFactory" words="stopwords.txt" />
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
16
Text analysis result
#Yummm :) Drinking a latte at Caffé Grecco in
SF’s historic North Beach…Learning text
analysis
“#yumm”, “drink”, “latte”, “caffe”, “grecco”,
“sf”/”san francisco”, “historic” “north” “beach”
“learn”, “text”, “analysis”
17
Performing queries
1) user enters a query (+ specifies other
components)
2) query handler
3) analysis (use similar as in indexing)
4) run search
5) adding components
6) serialization (XML, JSON etc.)
18
Lucene query language
● *:* (→ everything)
● gwdg
● name:gwdg
● name:admin*
● h?ld (→ hold, held)
● name:administrator~ (→ —tor, —tion)
● name:Gesellschaft~0.6 (similarity measure)
19
Lucene query language
● name:Max AND name:Planck
● name:Max OR name:Planck
● name:Max NOT name:Planck
● name:”Max Planck”
● name:(“Max Planck” OR Gesselschaft)
● “Max Planck”~3 (within 3 words)
→ so “Planck Max”, “Max Ludwig Planck”
20
Lucene query language
● max planck^10 (weighting)
● price:[10 TO 20] (→ 10..20)
● price:{10 TO 20} (→ 11..19)
● born:[1900-01-01T00:00.0Z TO 1949-12-
31T23:59.0Z] (date range)
21
Date mathematics
indexing hour granularity
"born": "2012-05-22T09:30:22Z/HOUR"
search by relative time range, eg. last month:
born:[NOW/DAY-1MONTH TO NOW/DAY]
keywords:
MINUTE, HOUR, DAY, WEEK, MONTH, YEAR
22
Faceted search
Facets let user to get an overview of the
content, and helps to browse without entering
search terms (search theorists: browse and
search are equally imortant).
● term/field facet: list terms and counts
● query facet: run queries, return counts
● range facet: split range into pieces
23
Term facets
&facet=true
&facet.field=TYPE
"facet_fields":{
"TYPE":[
"IMAGE", 25334764,
"TEXT", 16990647,
"VIDEO", 702787,
"SOUND", 558825,
"3D", 21303
]
http://europeana.eu - Europeana portal
24
Term facet
Additional parameters:
● limit, offset → for pagination
● sort (by index or count) → alphabetically or frequency
● mincount → filter less frequent terms
● missing → number of documents miss this field
● prefix → such as “http” to display URLs only
● f.[facet name].facet.[parameter] → overwrites generals
25
Query facets
&facet=true&
facet.query=price:[* TO 5}&
facet.query=price:[5 TO 10}&
facet.query=price:[10 TO 20}&
facet.query=price:[20 TO 50}&
facet.query=price:[50 TO *]
"facet_counts":{
"facet_queries":{
"price:[* TO 5}":6,
"price:[5 TO 10}":5,
"price:[10 TO 20}":3,
"price:[20 TO 50}":6,
"price:[50 TO *]":0
},
26
Query facets (zooming)
From centuries to years
http://pcu.bage.es/ Catálogo Colectivo de las Bibliotecas de la Administración General del Estado
27
Range facet
&facet=true&
facet.range=price&
facet.range.start=0&
facet.range.end=50&
facet.range.gap=5
"facet_ranges":{
"price":{
"counts":[
"0.0", 6, "5.0", 5,
"10.0", 0, "15.0", 3,
"20.0", 2, "25.0", 2,
"30.0", 1, "35.0", 0,
"40.0", 0, "45.0", 1
],
"gap":5.0,"start":0.0,"end":50.0
}}}}
28
Hit highlighting
?...&hl=true
&hl.fl=name
&hl.simple.pre=<em>
&hl.simple.post=</em>
"highlighting": {
"SP2514N": { ←ID
"name": [
"<em>SpinPoint P120
</em> SP2514N - hard
drive - 250 GB - ATA-
133"]}
29
More like this… (similar documents)
mlt (more like this)
handler:
● doc ID
● fields
● boost
● limit
● min length and
freq
http://catalog.lib.kyushu-u.ac.jp/en/ - Kyushu University library catalog
30
More like this (alternative solution)
(DATA_PROVIDER:("NIOD")^0.2 OR what:("IMAGE" OR "Amerikaanse
Strijdkrachten" OR "Luchtmacht" OR "Steden - Zie ook: Ruimtelijke ordening,
Wederopbouw, Dorpen")^0.8) NOT europeana_id:"/2021622/11607
31
Multilingual search
<filter class="solr.EnglishPossessiveFilterFactory"/>
<filter class="solr.ArabicNormalizationFilterFactory"/>
<filter class="solr.ArabicStemFilterFactory"/>
<filter class="solr.PersianCharFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="lang/en_stop.txt"/>
<filter class="solr.SynonymFilterFactory" synonyms="lang/en_synonyms.txt" />
<filter class="solr.SnowballPorterFilterFactory" language="Hungarian" />
32
Multilingual search strategies
● Separate fields by language
→ title_en:horse OR title_de:horse OR title_hu:horse
● Separate collections (core, shard) per language
all core has language settings and same field names
→ /select?shards=.../english,.../spanish,.../french
&q=title:horse
● All language in one field (from Solr 5.0)
→ title:(es|escuela OR en,es,de|school OR school)
33
Multilingual search
query → translation API → rewrited query
horse → (Hauspferd OR Ló OR Paard OR …)
34
Relevancy
The most important concepts:
● Term frequency (tf) - how often a particular term appears in a matching
document
● Inverse document frequency (idf) - how “rare” a search term is, inverse
of the document frequency (how many total documents the search term
appears within)
● field normalization factor (field norm) - a combination of factors
describing the importance of a particular field on a per-document basis
35
Relevancy
score(q,d) = Σ (tf(t in d) × idf(t)2 × t.getBoost() ×
norm(t,d)) × coord(q,d) × queryNorm(q)
where
t = term; d = document; q = query; f = field
tf(t in d) = num. of term occurrences in document1/2
norm(t,d) = d.getBoost() × lengthNorm(f) × f.getBoost()
idf(t) = 1 + log (numDocs / (docFreq +1))
coord(q,d) = numTermsInDocumentFromQuery / numTermsInQuery
queryNorm(q) = 1 / (sumOfSquaredWeights1/2)
sumOfSquaredWeights = q.getBoost()2 × Σ(idf(t) × t.getBoost())2
see: Solr in Action, p. 67
36
Debug
?...&debug=true
...
"debug":{
"rawquerystring":"hard drive",
"querystring":"hard drive",
"parsedquery":"text:hard text:drive",
"parsedquery_toString":"text:hard text:drive",
37
debug
"explain":{
"6H500F0":”
1.209934 = (MATCH) sum of:
0.6588537 = (MATCH) weight(text:hard in 2) [DefaultSimilarity], result of:
0.6588537 = score(doc=2,freq=2.0), product of:
0.73792744 = queryWeight, product of:
3.3671236 = idf(docFreq=2, maxDocs=32)
0.21915662 = queryNorm
0.8928435 = fieldWeight in 2, product of:
1.4142135 = tf(freq=2.0), with freq of:
2.0 = termFreq=2.0
3.3671236 = idf(docFreq=2, maxDocs=32)
...
38
References
● http://lucene.apache.org/solr/
● Grainger & Potter: Solr in Action
● https://lucidworks.com/blog/
● http://blog.sematext.com/
● http://solr.pl/
● https://www.packtpub.com/all?search=solr
● http://www.slideshare.net/treygrainger
39
Happy searching!
40

Weitere ähnliche Inhalte

Was ist angesagt?

Querying Nested JSON Data Using N1QL and Couchbase
Querying Nested JSON Data Using N1QL and CouchbaseQuerying Nested JSON Data Using N1QL and Couchbase
Querying Nested JSON Data Using N1QL and CouchbaseBrant Burnett
 
MongoDB World 2016 : Advanced Aggregation
MongoDB World 2016 : Advanced AggregationMongoDB World 2016 : Advanced Aggregation
MongoDB World 2016 : Advanced AggregationJoe Drumgoole
 
OSDC 2012 | Building a first application on MongoDB by Ross Lawley
OSDC 2012 | Building a first application on MongoDB by Ross LawleyOSDC 2012 | Building a first application on MongoDB by Ross Lawley
OSDC 2012 | Building a first application on MongoDB by Ross LawleyNETWAYS
 
Java Performance Tips (So Code Camp San Diego 2014)
Java Performance Tips (So Code Camp San Diego 2014)Java Performance Tips (So Code Camp San Diego 2014)
Java Performance Tips (So Code Camp San Diego 2014)Kai Chan
 
Data Processing and Aggregation with MongoDB
Data Processing and Aggregation with MongoDB Data Processing and Aggregation with MongoDB
Data Processing and Aggregation with MongoDB MongoDB
 
The Aggregation Framework
The Aggregation FrameworkThe Aggregation Framework
The Aggregation FrameworkMongoDB
 
MongoDB Aggregation Framework
MongoDB Aggregation FrameworkMongoDB Aggregation Framework
MongoDB Aggregation FrameworkCaserta
 
Couchbase N1QL: Language & Architecture Overview.
Couchbase N1QL: Language & Architecture Overview.Couchbase N1QL: Language & Architecture Overview.
Couchbase N1QL: Language & Architecture Overview.Keshav Murthy
 
Oracle Developer Day, 20 October 2009, Oracle De Meern, Holland: Oracle Datab...
Oracle Developer Day, 20 October 2009, Oracle De Meern, Holland: Oracle Datab...Oracle Developer Day, 20 October 2009, Oracle De Meern, Holland: Oracle Datab...
Oracle Developer Day, 20 October 2009, Oracle De Meern, Holland: Oracle Datab...Marco Gralike
 
UKOUG Tech14 - Using Database In-Memory Column Store with Complex Datatypes
UKOUG Tech14 - Using Database In-Memory Column Store with Complex DatatypesUKOUG Tech14 - Using Database In-Memory Column Store with Complex Datatypes
UKOUG Tech14 - Using Database In-Memory Column Store with Complex DatatypesMarco Gralike
 
Starting with JSON Path Expressions in Oracle 12.1.0.2
Starting with JSON Path Expressions in Oracle 12.1.0.2Starting with JSON Path Expressions in Oracle 12.1.0.2
Starting with JSON Path Expressions in Oracle 12.1.0.2Marco Gralike
 
OakTable World 2015 - Using XMLType content with the Oracle In-Memory Column...
OakTable World 2015  - Using XMLType content with the Oracle In-Memory Column...OakTable World 2015  - Using XMLType content with the Oracle In-Memory Column...
OakTable World 2015 - Using XMLType content with the Oracle In-Memory Column...Marco Gralike
 
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)Kai Chan
 
Indexing and Performance Tuning
Indexing and Performance TuningIndexing and Performance Tuning
Indexing and Performance TuningMongoDB
 
Avro, la puissance du binaire, la souplesse du JSON
Avro, la puissance du binaire, la souplesse du JSONAvro, la puissance du binaire, la souplesse du JSON
Avro, la puissance du binaire, la souplesse du JSONAlexandre Victoor
 
Json in 18c and 19c
Json in 18c and 19cJson in 18c and 19c
Json in 18c and 19cstewashton
 
Beyond the Basics 2: Aggregation Framework
Beyond the Basics 2: Aggregation Framework Beyond the Basics 2: Aggregation Framework
Beyond the Basics 2: Aggregation Framework MongoDB
 
Webinar: Data Processing and Aggregation Options
Webinar: Data Processing and Aggregation OptionsWebinar: Data Processing and Aggregation Options
Webinar: Data Processing and Aggregation OptionsMongoDB
 

Was ist angesagt? (20)

Querying Nested JSON Data Using N1QL and Couchbase
Querying Nested JSON Data Using N1QL and CouchbaseQuerying Nested JSON Data Using N1QL and Couchbase
Querying Nested JSON Data Using N1QL and Couchbase
 
MongoDB World 2016 : Advanced Aggregation
MongoDB World 2016 : Advanced AggregationMongoDB World 2016 : Advanced Aggregation
MongoDB World 2016 : Advanced Aggregation
 
OSDC 2012 | Building a first application on MongoDB by Ross Lawley
OSDC 2012 | Building a first application on MongoDB by Ross LawleyOSDC 2012 | Building a first application on MongoDB by Ross Lawley
OSDC 2012 | Building a first application on MongoDB by Ross Lawley
 
Java Performance Tips (So Code Camp San Diego 2014)
Java Performance Tips (So Code Camp San Diego 2014)Java Performance Tips (So Code Camp San Diego 2014)
Java Performance Tips (So Code Camp San Diego 2014)
 
Full metal mongo
Full metal mongoFull metal mongo
Full metal mongo
 
Data Processing and Aggregation with MongoDB
Data Processing and Aggregation with MongoDB Data Processing and Aggregation with MongoDB
Data Processing and Aggregation with MongoDB
 
The Aggregation Framework
The Aggregation FrameworkThe Aggregation Framework
The Aggregation Framework
 
MongoDB Aggregation Framework
MongoDB Aggregation FrameworkMongoDB Aggregation Framework
MongoDB Aggregation Framework
 
Couchbase N1QL: Language & Architecture Overview.
Couchbase N1QL: Language & Architecture Overview.Couchbase N1QL: Language & Architecture Overview.
Couchbase N1QL: Language & Architecture Overview.
 
Oracle Developer Day, 20 October 2009, Oracle De Meern, Holland: Oracle Datab...
Oracle Developer Day, 20 October 2009, Oracle De Meern, Holland: Oracle Datab...Oracle Developer Day, 20 October 2009, Oracle De Meern, Holland: Oracle Datab...
Oracle Developer Day, 20 October 2009, Oracle De Meern, Holland: Oracle Datab...
 
UKOUG Tech14 - Using Database In-Memory Column Store with Complex Datatypes
UKOUG Tech14 - Using Database In-Memory Column Store with Complex DatatypesUKOUG Tech14 - Using Database In-Memory Column Store with Complex Datatypes
UKOUG Tech14 - Using Database In-Memory Column Store with Complex Datatypes
 
Starting with JSON Path Expressions in Oracle 12.1.0.2
Starting with JSON Path Expressions in Oracle 12.1.0.2Starting with JSON Path Expressions in Oracle 12.1.0.2
Starting with JSON Path Expressions in Oracle 12.1.0.2
 
OakTable World 2015 - Using XMLType content with the Oracle In-Memory Column...
OakTable World 2015  - Using XMLType content with the Oracle In-Memory Column...OakTable World 2015  - Using XMLType content with the Oracle In-Memory Column...
OakTable World 2015 - Using XMLType content with the Oracle In-Memory Column...
 
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)
 
Indexing and Performance Tuning
Indexing and Performance TuningIndexing and Performance Tuning
Indexing and Performance Tuning
 
Avro, la puissance du binaire, la souplesse du JSON
Avro, la puissance du binaire, la souplesse du JSONAvro, la puissance du binaire, la souplesse du JSON
Avro, la puissance du binaire, la souplesse du JSON
 
Json in 18c and 19c
Json in 18c and 19cJson in 18c and 19c
Json in 18c and 19c
 
Beyond the Basics 2: Aggregation Framework
Beyond the Basics 2: Aggregation Framework Beyond the Basics 2: Aggregation Framework
Beyond the Basics 2: Aggregation Framework
 
Webinar: Data Processing and Aggregation Options
Webinar: Data Processing and Aggregation OptionsWebinar: Data Processing and Aggregation Options
Webinar: Data Processing and Aggregation Options
 
Avro introduction
Avro introductionAvro introduction
Avro introduction
 

Ähnlich wie Apache solr

Data Science with Solr and Spark
Data Science with Solr and SparkData Science with Solr and Spark
Data Science with Solr and SparkLucidworks
 
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)Kai Chan
 
Using Search API, Search API Solr and Facets in Drupal 8
Using Search API, Search API Solr and Facets in Drupal 8Using Search API, Search API Solr and Facets in Drupal 8
Using Search API, Search API Solr and Facets in Drupal 8Websolutions Agency
 
OpenCms Days 2012 - OpenCms 8.5: Using Apache Solr to retrieve content
OpenCms Days 2012 - OpenCms 8.5: Using Apache Solr to retrieve contentOpenCms Days 2012 - OpenCms 8.5: Using Apache Solr to retrieve content
OpenCms Days 2012 - OpenCms 8.5: Using Apache Solr to retrieve contentAlkacon Software GmbH & Co. KG
 
Elastic search and Symfony3 - A practical approach
Elastic search and Symfony3 - A practical approachElastic search and Symfony3 - A practical approach
Elastic search and Symfony3 - A practical approachSymfonyMu
 
Elasticsearch an overview
Elasticsearch   an overviewElasticsearch   an overview
Elasticsearch an overviewAmit Juneja
 
Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)
Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)
Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)Alexandre Rafalovitch
 
IT talk SPb "Full text search for lazy guys"
IT talk SPb "Full text search for lazy guys" IT talk SPb "Full text search for lazy guys"
IT talk SPb "Full text search for lazy guys" DataArt
 
Solr introduction
Solr introductionSolr introduction
Solr introductionLap Tran
 
Confluent & MongoDB APAC Lunch & Learn
Confluent & MongoDB APAC Lunch & LearnConfluent & MongoDB APAC Lunch & Learn
Confluent & MongoDB APAC Lunch & Learnconfluent
 
Transformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs PigTransformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs PigLester Martin
 
Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...jaxLondonConference
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with SolrErik Hatcher
 
Oslo Solr MeetUp March 2012 - Solr4 alpha
Oslo Solr MeetUp March 2012 - Solr4 alphaOslo Solr MeetUp March 2012 - Solr4 alpha
Oslo Solr MeetUp March 2012 - Solr4 alphaCominvent AS
 
Big Data and Machine Learning with FIWARE
Big Data and Machine Learning with FIWAREBig Data and Machine Learning with FIWARE
Big Data and Machine Learning with FIWAREFernando Lopez Aguilar
 
Hortonworks Technical Workshop - HDP Search
Hortonworks Technical Workshop - HDP Search Hortonworks Technical Workshop - HDP Search
Hortonworks Technical Workshop - HDP Search Hortonworks
 
Elasticsearch - SEARCH & ANALYZE DATA IN REAL TIME
Elasticsearch - SEARCH & ANALYZE DATA IN REAL TIMEElasticsearch - SEARCH & ANALYZE DATA IN REAL TIME
Elasticsearch - SEARCH & ANALYZE DATA IN REAL TIMEPiotr Pelczar
 

Ähnlich wie Apache solr (20)

Data Science with Solr and Spark
Data Science with Solr and SparkData Science with Solr and Spark
Data Science with Solr and Spark
 
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
 
Using Search API, Search API Solr and Facets in Drupal 8
Using Search API, Search API Solr and Facets in Drupal 8Using Search API, Search API Solr and Facets in Drupal 8
Using Search API, Search API Solr and Facets in Drupal 8
 
OpenCms Days 2012 - OpenCms 8.5: Using Apache Solr to retrieve content
OpenCms Days 2012 - OpenCms 8.5: Using Apache Solr to retrieve contentOpenCms Days 2012 - OpenCms 8.5: Using Apache Solr to retrieve content
OpenCms Days 2012 - OpenCms 8.5: Using Apache Solr to retrieve content
 
Elastic search and Symfony3 - A practical approach
Elastic search and Symfony3 - A practical approachElastic search and Symfony3 - A practical approach
Elastic search and Symfony3 - A practical approach
 
Solr5
Solr5Solr5
Solr5
 
Elasticsearch an overview
Elasticsearch   an overviewElasticsearch   an overview
Elasticsearch an overview
 
Solr 3.1 and beyond
Solr 3.1 and beyondSolr 3.1 and beyond
Solr 3.1 and beyond
 
Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)
Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)
Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)
 
IT talk SPb "Full text search for lazy guys"
IT talk SPb "Full text search for lazy guys" IT talk SPb "Full text search for lazy guys"
IT talk SPb "Full text search for lazy guys"
 
Solr introduction
Solr introductionSolr introduction
Solr introduction
 
Confluent & MongoDB APAC Lunch & Learn
Confluent & MongoDB APAC Lunch & LearnConfluent & MongoDB APAC Lunch & Learn
Confluent & MongoDB APAC Lunch & Learn
 
Transformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs PigTransformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs Pig
 
Oracle by Muhammad Iqbal
Oracle by Muhammad IqbalOracle by Muhammad Iqbal
Oracle by Muhammad Iqbal
 
Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
 
Oslo Solr MeetUp March 2012 - Solr4 alpha
Oslo Solr MeetUp March 2012 - Solr4 alphaOslo Solr MeetUp March 2012 - Solr4 alpha
Oslo Solr MeetUp March 2012 - Solr4 alpha
 
Big Data and Machine Learning with FIWARE
Big Data and Machine Learning with FIWAREBig Data and Machine Learning with FIWARE
Big Data and Machine Learning with FIWARE
 
Hortonworks Technical Workshop - HDP Search
Hortonworks Technical Workshop - HDP Search Hortonworks Technical Workshop - HDP Search
Hortonworks Technical Workshop - HDP Search
 
Elasticsearch - SEARCH & ANALYZE DATA IN REAL TIME
Elasticsearch - SEARCH & ANALYZE DATA IN REAL TIMEElasticsearch - SEARCH & ANALYZE DATA IN REAL TIME
Elasticsearch - SEARCH & ANALYZE DATA IN REAL TIME
 

Mehr von Péter Király

Requirements of DARIAH community for a Dataverse repository (SSHOC 2020)
Requirements of DARIAH community for a Dataverse repository (SSHOC 2020)Requirements of DARIAH community for a Dataverse repository (SSHOC 2020)
Requirements of DARIAH community for a Dataverse repository (SSHOC 2020)Péter Király
 
Validating 126 million MARC records (DATeCH 2019)
Validating 126 million MARC records (DATeCH 2019)Validating 126 million MARC records (DATeCH 2019)
Validating 126 million MARC records (DATeCH 2019)Péter Király
 
Measuring Metadata Quality (doctoral defense 2019)
Measuring Metadata Quality (doctoral defense 2019)Measuring Metadata Quality (doctoral defense 2019)
Measuring Metadata Quality (doctoral defense 2019)Péter Király
 
Empirical evaluation of library catalogues (SWIB 2019)
Empirical evaluation of library catalogues (SWIB 2019)Empirical evaluation of library catalogues (SWIB 2019)
Empirical evaluation of library catalogues (SWIB 2019)Péter Király
 
GRO.data - Dataverse in Göttingen (Dataverse Europe 2020)
GRO.data - Dataverse in Göttingen (Dataverse Europe 2020)GRO.data - Dataverse in Göttingen (Dataverse Europe 2020)
GRO.data - Dataverse in Göttingen (Dataverse Europe 2020)Péter Király
 
Data element constraints for DDB (DDB 2021)
Data element constraints for DDB (DDB 2021)Data element constraints for DDB (DDB 2021)
Data element constraints for DDB (DDB 2021)Péter Király
 
Incubating Göttingen Cultural Analytics Alliance (SUB 2021)
Incubating Göttingen Cultural Analytics Alliance (SUB 2021)Incubating Göttingen Cultural Analytics Alliance (SUB 2021)
Incubating Göttingen Cultural Analytics Alliance (SUB 2021)Péter Király
 
Continuous quality assessment for MARC21 catalogues (MINI ELAG 2021)
Continuous quality assessment for MARC21 catalogues (MINI ELAG 2021)Continuous quality assessment for MARC21 catalogues (MINI ELAG 2021)
Continuous quality assessment for MARC21 catalogues (MINI ELAG 2021)Péter Király
 
Introduction to data quality management (BVB KVB FDM-KompetenzPool, 2021)
Introduction to data quality management (BVB KVB FDM-KompetenzPool, 2021)Introduction to data quality management (BVB KVB FDM-KompetenzPool, 2021)
Introduction to data quality management (BVB KVB FDM-KompetenzPool, 2021)Péter Király
 
Magyar irodalom idegen nyelven (BTK ITI 2021)
Magyar irodalom idegen nyelven (BTK ITI 2021)Magyar irodalom idegen nyelven (BTK ITI 2021)
Magyar irodalom idegen nyelven (BTK ITI 2021)Péter Király
 
Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022)
Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022)Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022)
Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022)Péter Király
 
FRBR a book history perspective (Bibliodata WG 2022)
FRBR a book history perspective (Bibliodata WG 2022)FRBR a book history perspective (Bibliodata WG 2022)
FRBR a book history perspective (Bibliodata WG 2022)Péter Király
 
GRO.data - Dataverse in Göttingen (Magdeburg Coffee Lecture, 2022)
GRO.data - Dataverse in Göttingen (Magdeburg Coffee Lecture, 2022)GRO.data - Dataverse in Göttingen (Magdeburg Coffee Lecture, 2022)
GRO.data - Dataverse in Göttingen (Magdeburg Coffee Lecture, 2022)Péter Király
 
Understanding, extracting and enhancing catalogue data (CE Book history works...
Understanding, extracting and enhancing catalogue data (CE Book history works...Understanding, extracting and enhancing catalogue data (CE Book history works...
Understanding, extracting and enhancing catalogue data (CE Book history works...Péter Király
 
Measuring cultural heritage metadata quality (Semantics 2017)
Measuring cultural heritage metadata quality (Semantics 2017)Measuring cultural heritage metadata quality (Semantics 2017)
Measuring cultural heritage metadata quality (Semantics 2017)Péter Király
 
Measuring Metadata Quality in Europeana (ADOCHS 2017)
Measuring Metadata Quality in Europeana (ADOCHS 2017)Measuring Metadata Quality in Europeana (ADOCHS 2017)
Measuring Metadata Quality in Europeana (ADOCHS 2017)Péter Király
 
Measuring library catalogs (ADOCHS 2017)
Measuring library catalogs (ADOCHS 2017)Measuring library catalogs (ADOCHS 2017)
Measuring library catalogs (ADOCHS 2017)Péter Király
 
Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)
Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)
Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)Péter Király
 
Researching metadata quality (ORKG 2018)
Researching metadata quality (ORKG 2018)Researching metadata quality (ORKG 2018)
Researching metadata quality (ORKG 2018)Péter Király
 
Metadata quality in cultural heritage institutions (ReIRes-FAIR 2018)
Metadata quality in cultural heritage institutions (ReIRes-FAIR 2018)Metadata quality in cultural heritage institutions (ReIRes-FAIR 2018)
Metadata quality in cultural heritage institutions (ReIRes-FAIR 2018)Péter Király
 

Mehr von Péter Király (20)

Requirements of DARIAH community for a Dataverse repository (SSHOC 2020)
Requirements of DARIAH community for a Dataverse repository (SSHOC 2020)Requirements of DARIAH community for a Dataverse repository (SSHOC 2020)
Requirements of DARIAH community for a Dataverse repository (SSHOC 2020)
 
Validating 126 million MARC records (DATeCH 2019)
Validating 126 million MARC records (DATeCH 2019)Validating 126 million MARC records (DATeCH 2019)
Validating 126 million MARC records (DATeCH 2019)
 
Measuring Metadata Quality (doctoral defense 2019)
Measuring Metadata Quality (doctoral defense 2019)Measuring Metadata Quality (doctoral defense 2019)
Measuring Metadata Quality (doctoral defense 2019)
 
Empirical evaluation of library catalogues (SWIB 2019)
Empirical evaluation of library catalogues (SWIB 2019)Empirical evaluation of library catalogues (SWIB 2019)
Empirical evaluation of library catalogues (SWIB 2019)
 
GRO.data - Dataverse in Göttingen (Dataverse Europe 2020)
GRO.data - Dataverse in Göttingen (Dataverse Europe 2020)GRO.data - Dataverse in Göttingen (Dataverse Europe 2020)
GRO.data - Dataverse in Göttingen (Dataverse Europe 2020)
 
Data element constraints for DDB (DDB 2021)
Data element constraints for DDB (DDB 2021)Data element constraints for DDB (DDB 2021)
Data element constraints for DDB (DDB 2021)
 
Incubating Göttingen Cultural Analytics Alliance (SUB 2021)
Incubating Göttingen Cultural Analytics Alliance (SUB 2021)Incubating Göttingen Cultural Analytics Alliance (SUB 2021)
Incubating Göttingen Cultural Analytics Alliance (SUB 2021)
 
Continuous quality assessment for MARC21 catalogues (MINI ELAG 2021)
Continuous quality assessment for MARC21 catalogues (MINI ELAG 2021)Continuous quality assessment for MARC21 catalogues (MINI ELAG 2021)
Continuous quality assessment for MARC21 catalogues (MINI ELAG 2021)
 
Introduction to data quality management (BVB KVB FDM-KompetenzPool, 2021)
Introduction to data quality management (BVB KVB FDM-KompetenzPool, 2021)Introduction to data quality management (BVB KVB FDM-KompetenzPool, 2021)
Introduction to data quality management (BVB KVB FDM-KompetenzPool, 2021)
 
Magyar irodalom idegen nyelven (BTK ITI 2021)
Magyar irodalom idegen nyelven (BTK ITI 2021)Magyar irodalom idegen nyelven (BTK ITI 2021)
Magyar irodalom idegen nyelven (BTK ITI 2021)
 
Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022)
Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022)Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022)
Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022)
 
FRBR a book history perspective (Bibliodata WG 2022)
FRBR a book history perspective (Bibliodata WG 2022)FRBR a book history perspective (Bibliodata WG 2022)
FRBR a book history perspective (Bibliodata WG 2022)
 
GRO.data - Dataverse in Göttingen (Magdeburg Coffee Lecture, 2022)
GRO.data - Dataverse in Göttingen (Magdeburg Coffee Lecture, 2022)GRO.data - Dataverse in Göttingen (Magdeburg Coffee Lecture, 2022)
GRO.data - Dataverse in Göttingen (Magdeburg Coffee Lecture, 2022)
 
Understanding, extracting and enhancing catalogue data (CE Book history works...
Understanding, extracting and enhancing catalogue data (CE Book history works...Understanding, extracting and enhancing catalogue data (CE Book history works...
Understanding, extracting and enhancing catalogue data (CE Book history works...
 
Measuring cultural heritage metadata quality (Semantics 2017)
Measuring cultural heritage metadata quality (Semantics 2017)Measuring cultural heritage metadata quality (Semantics 2017)
Measuring cultural heritage metadata quality (Semantics 2017)
 
Measuring Metadata Quality in Europeana (ADOCHS 2017)
Measuring Metadata Quality in Europeana (ADOCHS 2017)Measuring Metadata Quality in Europeana (ADOCHS 2017)
Measuring Metadata Quality in Europeana (ADOCHS 2017)
 
Measuring library catalogs (ADOCHS 2017)
Measuring library catalogs (ADOCHS 2017)Measuring library catalogs (ADOCHS 2017)
Measuring library catalogs (ADOCHS 2017)
 
Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)
Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)
Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)
 
Researching metadata quality (ORKG 2018)
Researching metadata quality (ORKG 2018)Researching metadata quality (ORKG 2018)
Researching metadata quality (ORKG 2018)
 
Metadata quality in cultural heritage institutions (ReIRes-FAIR 2018)
Metadata quality in cultural heritage institutions (ReIRes-FAIR 2018)Metadata quality in cultural heritage institutions (ReIRes-FAIR 2018)
Metadata quality in cultural heritage institutions (ReIRes-FAIR 2018)
 

Kürzlich hochgeladen

5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AIABDERRAOUF MEHENNI
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...Health
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823
 

Kürzlich hochgeladen (20)

Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 

Apache solr

  • 1. Apache Solr Oberseminar, 12.06.2015 Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen Péter Király, pkiraly@gwdg.de
  • 2. What is Apache Solr? Solr is the popular, blazing-fast, open source enterprise search platform built on Apache Lucene 2
  • 3. ● 1999: Doug Cutting published Lucene ● 2004: Yonik Seeley published Solr ● 2006: Apache project (2007: TLP) ● 2009: LucidWorks company ● 2010: Merge of Lucene and Solr ● 2011: 3.1 ● 2012: 4.0 ● 2015: 5.0 History in one minute 3
  • 4. “Sister” projects ● Nutch: web scale search engine ● Tika: document parser ● Hadoop: distributes storage and data processing ● Elasticsearch: alternative to Solr ● forks/ports of Lucene ● client libraries and tools (Luke index viewer) 4
  • 5. Main features I ● Faceted navigation ● Hit highlighting ● Query language ● Schema-less mode and Schema REST API ● JSON, XML, PHP, Ruby, Python, XSLT, Velocity and custom Java binary outputs ● HTML administration interface 5
  • 6. Main features II ● Replication to other Solr servers ● Distributed search through sharding ● Search results clustering based on Carrot2 ● Extensible through plugins ● Relevance boosting via functions ● Caching - queries, filters, and documents ● Embeddable in a Java Application 6
  • 7. Main features III ● Geo-spatial search, including multiple points per documents and polygons ● Automated management of large clusters through ZooKeeper ● Function queries ● Field Collapsing and grouping ● Auto-suggest 7
  • 8. Inverted index Original documents: Doc # Content field 1 A Fun Guide to Cooking 2 Decorating Your Home 3 How to Raise a Child 4 Buying a New Car 8
  • 9. Inverted index Index structure Term Doc1 Doc2 Doc3 Doc4 Doc5 Doc6 Doc7 a 0 1 1 1 0 0 0 becomming 0 0 0 0 1 0 0 beginner’s 0 0 0 0 0 1 0 buy 0 0 1 0 0 0 0 stored as a bit vectorstored as reference to a tree structure 9
  • 10. Indexing Document ~ RDBM record Fields (key-value structure): ● types (text, numeric, date, point, custom) ● indexed, stored, multiple, required ● field name patterns (prefixes, suffixes, such as *_tx) ● special fields (identifier, _version_) 10
  • 11. Indexing formats: JSON, XML, binary, RDBM, ... connections: file, Data Import Handler, API sharding (separating documents into multiple parts) denormalized documents - (almost) no JOIN ;-( copy field catch all field (contains everything) 11
  • 12. A document example (XML) <doc> <field name="id">F8V7067-APL-KIT</field> string <field name="name">Belkin Mobile Power Cord for iPod w/ Dock</field> text <field name="cat">electronics</field> <field name="cat">connector</field> multivalue <field name="price">19.95</field> float <field name="inStock">false</field> boolean <field name="store">45.18014,-93.87741</field> geo point <field name="manufacturedate_dt">2005-08-01T16:30:25Z</field> date </doc> 12
  • 13. A document example (JSON) { "id": "F8V7067-APL-KIT", "name": "Belkin Mobile Power Cord for iPod w/ Dock", "cat": ["electronics", "connector"], "price":19.95, "inStock":false, "store": "45.18014,-93.87741", "manufacturedate_dt": "2005-08-01T16:30:25Z" } 13
  • 14. A document example (Solr4j library) SolrServer solr = new HttpSolrServer(“http://…”); SolrInputDocument doc = new SolrInputDocument(); doc.setField("id", "F8V7067-APL-KIT"); doc.setField("name", "Belkin Mobile Power Cord for iPod w/ Dock"); ... solr.add(doc); solr.commit(true, true); 14
  • 15. Text analysis chain 1) character filters — preprocess text pattern replace, ASCII folding, HTML stripping 1) tokenizers — split text into smaller units whitespace, lowercase, word delim., standard 1) token filters — examine/modify/eliminate stemming, lowercase, stop words, 15
  • 16. Text analysis chain <fieldType name="my-text-type" class="solr.TextField"> <analyzer type="index"> <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-FoldToASCII.txt" /> <tokenizer class="solr.WhitespaceTokenizerFactory" /> <filter class="solr.StopFilterFactory" words="stopwords.txt" /> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType> 16
  • 17. Text analysis result #Yummm :) Drinking a latte at Caffé Grecco in SF’s historic North Beach…Learning text analysis “#yumm”, “drink”, “latte”, “caffe”, “grecco”, “sf”/”san francisco”, “historic” “north” “beach” “learn”, “text”, “analysis” 17
  • 18. Performing queries 1) user enters a query (+ specifies other components) 2) query handler 3) analysis (use similar as in indexing) 4) run search 5) adding components 6) serialization (XML, JSON etc.) 18
  • 19. Lucene query language ● *:* (→ everything) ● gwdg ● name:gwdg ● name:admin* ● h?ld (→ hold, held) ● name:administrator~ (→ —tor, —tion) ● name:Gesellschaft~0.6 (similarity measure) 19
  • 20. Lucene query language ● name:Max AND name:Planck ● name:Max OR name:Planck ● name:Max NOT name:Planck ● name:”Max Planck” ● name:(“Max Planck” OR Gesselschaft) ● “Max Planck”~3 (within 3 words) → so “Planck Max”, “Max Ludwig Planck” 20
  • 21. Lucene query language ● max planck^10 (weighting) ● price:[10 TO 20] (→ 10..20) ● price:{10 TO 20} (→ 11..19) ● born:[1900-01-01T00:00.0Z TO 1949-12- 31T23:59.0Z] (date range) 21
  • 22. Date mathematics indexing hour granularity "born": "2012-05-22T09:30:22Z/HOUR" search by relative time range, eg. last month: born:[NOW/DAY-1MONTH TO NOW/DAY] keywords: MINUTE, HOUR, DAY, WEEK, MONTH, YEAR 22
  • 23. Faceted search Facets let user to get an overview of the content, and helps to browse without entering search terms (search theorists: browse and search are equally imortant). ● term/field facet: list terms and counts ● query facet: run queries, return counts ● range facet: split range into pieces 23
  • 24. Term facets &facet=true &facet.field=TYPE "facet_fields":{ "TYPE":[ "IMAGE", 25334764, "TEXT", 16990647, "VIDEO", 702787, "SOUND", 558825, "3D", 21303 ] http://europeana.eu - Europeana portal 24
  • 25. Term facet Additional parameters: ● limit, offset → for pagination ● sort (by index or count) → alphabetically or frequency ● mincount → filter less frequent terms ● missing → number of documents miss this field ● prefix → such as “http” to display URLs only ● f.[facet name].facet.[parameter] → overwrites generals 25
  • 26. Query facets &facet=true& facet.query=price:[* TO 5}& facet.query=price:[5 TO 10}& facet.query=price:[10 TO 20}& facet.query=price:[20 TO 50}& facet.query=price:[50 TO *] "facet_counts":{ "facet_queries":{ "price:[* TO 5}":6, "price:[5 TO 10}":5, "price:[10 TO 20}":3, "price:[20 TO 50}":6, "price:[50 TO *]":0 }, 26
  • 27. Query facets (zooming) From centuries to years http://pcu.bage.es/ Catálogo Colectivo de las Bibliotecas de la Administración General del Estado 27
  • 28. Range facet &facet=true& facet.range=price& facet.range.start=0& facet.range.end=50& facet.range.gap=5 "facet_ranges":{ "price":{ "counts":[ "0.0", 6, "5.0", 5, "10.0", 0, "15.0", 3, "20.0", 2, "25.0", 2, "30.0", 1, "35.0", 0, "40.0", 0, "45.0", 1 ], "gap":5.0,"start":0.0,"end":50.0 }}}} 28
  • 29. Hit highlighting ?...&hl=true &hl.fl=name &hl.simple.pre=<em> &hl.simple.post=</em> "highlighting": { "SP2514N": { ←ID "name": [ "<em>SpinPoint P120 </em> SP2514N - hard drive - 250 GB - ATA- 133"]} 29
  • 30. More like this… (similar documents) mlt (more like this) handler: ● doc ID ● fields ● boost ● limit ● min length and freq http://catalog.lib.kyushu-u.ac.jp/en/ - Kyushu University library catalog 30
  • 31. More like this (alternative solution) (DATA_PROVIDER:("NIOD")^0.2 OR what:("IMAGE" OR "Amerikaanse Strijdkrachten" OR "Luchtmacht" OR "Steden - Zie ook: Ruimtelijke ordening, Wederopbouw, Dorpen")^0.8) NOT europeana_id:"/2021622/11607 31
  • 32. Multilingual search <filter class="solr.EnglishPossessiveFilterFactory"/> <filter class="solr.ArabicNormalizationFilterFactory"/> <filter class="solr.ArabicStemFilterFactory"/> <filter class="solr.PersianCharFilterFactory"/> <filter class="solr.KeywordMarkerFilterFactory" protected="lang/en_stop.txt"/> <filter class="solr.SynonymFilterFactory" synonyms="lang/en_synonyms.txt" /> <filter class="solr.SnowballPorterFilterFactory" language="Hungarian" /> 32
  • 33. Multilingual search strategies ● Separate fields by language → title_en:horse OR title_de:horse OR title_hu:horse ● Separate collections (core, shard) per language all core has language settings and same field names → /select?shards=.../english,.../spanish,.../french &q=title:horse ● All language in one field (from Solr 5.0) → title:(es|escuela OR en,es,de|school OR school) 33
  • 34. Multilingual search query → translation API → rewrited query horse → (Hauspferd OR Ló OR Paard OR …) 34
  • 35. Relevancy The most important concepts: ● Term frequency (tf) - how often a particular term appears in a matching document ● Inverse document frequency (idf) - how “rare” a search term is, inverse of the document frequency (how many total documents the search term appears within) ● field normalization factor (field norm) - a combination of factors describing the importance of a particular field on a per-document basis 35
  • 36. Relevancy score(q,d) = Σ (tf(t in d) × idf(t)2 × t.getBoost() × norm(t,d)) × coord(q,d) × queryNorm(q) where t = term; d = document; q = query; f = field tf(t in d) = num. of term occurrences in document1/2 norm(t,d) = d.getBoost() × lengthNorm(f) × f.getBoost() idf(t) = 1 + log (numDocs / (docFreq +1)) coord(q,d) = numTermsInDocumentFromQuery / numTermsInQuery queryNorm(q) = 1 / (sumOfSquaredWeights1/2) sumOfSquaredWeights = q.getBoost()2 × Σ(idf(t) × t.getBoost())2 see: Solr in Action, p. 67 36
  • 38. debug "explain":{ "6H500F0":” 1.209934 = (MATCH) sum of: 0.6588537 = (MATCH) weight(text:hard in 2) [DefaultSimilarity], result of: 0.6588537 = score(doc=2,freq=2.0), product of: 0.73792744 = queryWeight, product of: 3.3671236 = idf(docFreq=2, maxDocs=32) 0.21915662 = queryNorm 0.8928435 = fieldWeight in 2, product of: 1.4142135 = tf(freq=2.0), with freq of: 2.0 = termFreq=2.0 3.3671236 = idf(docFreq=2, maxDocs=32) ... 38
  • 39. References ● http://lucene.apache.org/solr/ ● Grainger & Potter: Solr in Action ● https://lucidworks.com/blog/ ● http://blog.sematext.com/ ● http://solr.pl/ ● https://www.packtpub.com/all?search=solr ● http://www.slideshare.net/treygrainger 39