Got hundreds of millions of documents to search? DataImportHandler blowing up while indexing? Random thread errors thrown by Solr Cellduring document extraction? Query performance collapsing? Then you've searching at Big Data scale. This talk will focus on the underlying principles of Big Data, and how to apply them to Solr. This talk isn't a deep dive into SolrCloud, though we'll talk about it. It also isn't meant to be a talk on traditional scaling of Solr.
3. Who am I?
• Principal of OpenSource Connections
- Solr/Lucene Search Consultancy
• Member of Apache Software
Foundation
• SOLR-284 UpdateRichDocuments
(July 07)
• Fascinated by the art of software
development
5. war
Telling some stories ^
• Prototyping
• Application Development
• Maintaining Your Big Search Indexes
6. Not an intro to
SolrCloud/ElasticSearch!
• Great round table discussion yesterday led
by Mark Miller
• SolrCloud 4 Architecture talk in this room
NEXT!
• Solr4 vs Elastic Search at 4:45 PM TODAY!
7. Background for Client
X’s Project
• Big Data is any data set that is primarily at
rest due to the difficulty of working with it.
• 100’s of millions of documents to search
• Aggressive timeline.
• All the data must be searched per query.
• Limited selection of tools available.
• On Solr 3.x line
8. Telling some stories
• Prototyping
• Application Development
• Maintaining Your Big Search Indexes
9. Boy meets Girl Story
Content
Files
Ingest Solr
Solr
Pipeline Solr
Solr
Metadata
10. Boy meets Girl Story
Content
Files
Ingest Solr
Solr
Pipeline Solr
Solr
Metadata
11. Boy meets Girl Story
Content
Files
Ingest Solr
Solr
Pipeline Solr
Solr
Metadata
12. Boy meets Girl Story
Content
Files
Ingest Solr
Solr
Pipeline Solr
Solr
Metadata
13. Boy meets Girl Story
Content
Files
Ingest Solr
Solr
Pipeline Solr
Solr
Metadata
17. Make it easy to change
sharding
public void run(Map options, List<SolrInputDocument> docs) throws
InstantiationException, IllegalAccessException, ClassNotFoundException {
IndexStrategy indexStrategy = (IndexStrategy) Class.forName(
"com.o19s.solr.ModShardIndexStrategy").newInstance();
indexStrategy.configure(options);
for (SolrInputDocument doc:docs){
indexStrategy.addDocument(doc);
}
}
18. Separate JVM from Solr
Cores
• Step 1: Fire up empty Solr’s on all the
servers (nohup &).
• Step 2:Verify they started cleanly.
• Step 3: Create Cores (curl http://
search1.o19s.com:8983/solr/admin?
action=create&name=run2)
• Step 4: Create a “aggregator” core, passing
in urls of Cores. (&property.shards=)
23. Don’t Move Files
• SCP across machines is slow/error prone
• NFS share, single point of failure.
• Clustered file system like GFS (Global File
System) can have “fencing” issues
• HDFS shines here.
• ZooKeeper shines here.
28. Telling some stories
• Prototyping
• Application Development
• Maintaining Your Big Search Indexes
29. Using Solr as key/value store
Solr Key/
Value Cache
Metadata
Ingest Solr
Solr
Pipeline Solr
Solr
Content
Files
30. Using Solr as key/value store
• thousands of queries per second without
real time get.
http://localhost:8983/solr/run2_enrichment/select?
q=id:DOC45242&fl=entities,html
• how fast with real time get?
http://localhost:8983/solr/run2_enrichment/get?
id=DOC45242&fl=entities,html
31. Push schema definition
to the application
• Not “schema less”
• Just different owner of schema!
• Schema may have common set of fields like
id, type, timestamp, version
• Nothing required.
q=intensity_i:[70 TO 0]&fq=TYPE:streetlamp_monitor
32. Don’t do expensive
things in Solr
• Tika content extraction aka Solr Cell
• UpdateRequestProcessorChain
33. Don’t do expensive
things in Solr
• Tika content extraction aka Solr Cell
• UpdateRequestProcessorChain
37. Beware JavaBin
Solr Key/ Solr 3.4
Value Cache
Metadata
Solr 4
Ingest Solr
Solr
Pipeline Solr
Solr
Content
Which SolrJ
Files
version do I
use?
38. No JavaBin
/u
G te
p
iv /
da
e av
m r
e o!
• Avoid Jarmaggeddon
• Reflection? Ugh.
39. Avro!
• Supports serialization of data readable from
multiple languages
• It’s smart XML, w/o the XML!
• Handles forward and reverse versions of an
object
• Compact and fast to read.
41. Tika as a pipeline?
• Auto detects content type
• Metadata structure has all the
key/value needed for Solr
• Allows us to scale up with
Behemoth project.
42. Telling some stories
• Prototyping
• Application Development
• Maintaining Your Big Search Indexes
43. Upgrade Lucene
Indexes Easily
• Don’t reindex!
• Try out new versions of
Lucene based search engines.
David Lyle
java -cp lucene-core.jar
org.apache.lucene.index.IndexUpgrader [-delete-prior-
commits] [-verbose] indexDir
55. Building a Patents Index
300
300
225
Machine Count
150
75
1 5
0
5 days 3 days 30 Minutes
What happens when we want to index 2 million patents in 30 minutes?
56. Amazon AWS is Good but...
• EC2 is costly
• Issues of access to internal data
• Firewall and security
57. Can we Cycle Scavenge?
• Data Center is heavily used 9 to 5 EST.
• Lesser, but significant load 8 to 10 PM
EST
• Minimal CPU load at night.
• Amazon Spot Pricing for EC2
• Seti @HOME
• JavaGenes - Genetics processing
• Condor Platform (http://
research.cs.wisc.edu/condor/)
49
58. Balancing Load
Production Load Batch Jobs
100
75
50
25
0
1 AM 3 AM 5 AM 9AM 3PM 9PM 11 PM
50
59. Do I need Failover?
• Can I build quickly?
• Do I have a reliable cluster of servers?
• Am I spread across data centers?
• Is sooo 90’s....
60. Telling some stories
• Prototyping
• Application Development
• Maintaining Your Big Search Indexes
64. Thank you!
Questions?
Nervous about • epugh@o19s.com
speaking up? Ask
me on later! • @dep4b
about ask • www.opensourceconnections.com
Hinweis der Redaktion
\n
Search was the original big data problem. Now search is back, but with a new cooler name &#x201C;Big Data&#x201D;, and search is the dominant metaphor for exposing big data sets to business users to make actual decisions. Big Data is rapidly changing fields such as HealthCare, and I maintain that the next revoultion in healtchare won't be via a doctor wielding a scalpel, but via a doctor wielding a mouse.\n
SOLR-284 back in July 07 was a first cut at a content extraction library before Tika came along.\n
\n
And I love Agile development processes. And I think of agile as business -> requirements -> development -> testing -> systems administration\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
And I don&#x2019;t mean this as a shot against Hadoop, but with the right hardware, you can get a lot done in bash, with a bit of Java or Perl sprinkled in.\nThere is a lot of value in getting started today building large scaled out ingestors.\n
\n
Notice our property style? Made it easy to read in properties in both Bash and Java!\n
Try sharding at different sizes using Mod\nTry sharding by month, or week, or hour depending on your volume of data.\n
\n
\n
We had huge left over &#x201C;enterprise&#x201D; boxes with ginourmous amounts of ram and cpu. We were IO bound.\n\n
\n
\n
\n
\n
The verbose:gc and +PrintGCDetails lets you grep for the frequency of partial versus full garbage collecitons. We rolled back from 3.4 to 3.1 based on this data on one project.\n
Again, horse racing two slaves can help. You can also pass in the connection information via jconsole command line which makes it easier to monitor a set of Solrs\n
\n
\n
i love working with CSV and Solr. The CSV writer type is great for moving data between solrs. (Don&#x2019;t forget to store everything!)\n
\n
\n
\n
You have many fewer Solrs then you do Indexer processors.\n
\n
\n
\n
\n
\n
\n
Jukka did a great presentation yesterday.\n
\n
\n
\n
dollar tree makes crap. Stores are always empty or missing items. You don&#x2019;t want your indexing like that. Space shuttle costed 500 MILLIOn dollars to launch it every time. You don&#x2019;t want your indexing process to be like launching the space shuttle.\n
\n
\n
\n
runs every hour.\nLooks at log files to determine if a solr cluster is misbehaving.\n
Hal 9000 misbheaved\nruns every hour.\nLooks at log files to determine if a solr cluster is misbehaving.\nEspecially if you are on cloud platform. They implement their servers on the cheapest commodity hardware \n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
Kaa the snake from The Jungle Book hynotizing Mowgli. \nDanah Boyd among others have said that Big Data sometimes throws out thousands of years \n\n