Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rahul Jain
1.
2. Building a Large Scale SEO/SEM
Application with Apache Solr
Rahul Jain
Freelance Big-data/Search Consultant
@rahuldausa
dynamicrahul2020@gmail.com
3. About Me…
• Freelance Big-data/Search Consultant based out of Hyderabad, India
• Provide Consulting services and solutions for Solr, Elasticsearch and other Big data
solutions (Apache Hadoop and Spark)
• Organizer of two Meetup groups in Hyderabad
• Hyderabad Apache Solr/Lucene
• Big Data Hyderabad
4. What I am going to talk
Share our experience in working on Search in this application …
• What all issues we have faced and Lessons learned
• How we do Database Import, Batch Indexing…
• Techniques to Scale and improve Search latency
• The System Architecture
• Some tips for tuning Solr
• Q/A
5. What does the Application do
§ Keyword Research and Competitor Analysis Tool for SEO (Search Engine Optimization) and SEM
(Search Engine Marketing) Professionals
§ End user search for a keyword or a domain, and get all insights about that.
§ Aggregate data for the top 50 results of Google and Bing across 3 countries for 80million+ keywords.
§ Provide key metrics like keywords, CPM (Cost per mille), CPC (Cost per click), competitor’s details etc.
Web
crawling
Data
Processing
&
Aggrega4on
Ad
Networks
Apis
Databases
Data
Collec4on
*All
trademarks
and
logos
belong
to
their
respec1ve
owners.
7. High level Architecture
Load
Balancer
(HAProxy)
Managed
Cache
Apache
Solr
Cache
Cluster
(Redis)
Apache
Solr
Internet
Database
(MySQL)
App
Server
(Tomcat)
Apache
Solr
Search
Head
Web
Server
Farm
Php
App
(Nginx)
Cluster
Manager
(Custom
using
Zookeeper)
Search
Head
:
• Is
a
Solr
Server
which
does
not
contain
any
data.
• Make
a
Distributed
Search
request
and
aggregate
the
Search
Results
• Also
works
as
a
Load
Balancer
for
search
queries.
Apache
Solr
Search
Head
(Solr)
1 2 3
4
8
5
6
7
Ids
lookup
Cache
Fetch
cluster
Mapping
for
which
month’
cluster
8. Search - Key challenges
§ After processing we have ~40 billion records every month in MySQL database
including
§ 80+ Million Keywords
§ 110+ Million Domains
§ 1billion+ URLs
§ Multiple keywords for a Single URL and vice-versa
§ Multiple tables with varying size from 50million to 12billion
§ Search is a core functionality, so all records (selected fields) must be Indexed in Solr
§ Page load time (including all 24 widgets, Max) < 1.5 sec (Critical)
§ But… we need to load this data only once every month for all countries, so we can
do Batch Indexing and as this data never changes, we can apply caching.
10. Data Import from MySQL to Solr
• Solr’s DataImportHanlder is awesome but quickly become pretty slow for large volume
• We wrote our Custom Data Importer that can read(pull) documents from Database and pushses (Async) these into
Solr.
Data
Importer
(Custom)
Solr
Solr
Solr
Table
ID
(Primary/
Unique
Key
with
Index)
Columns
1
Record1
2
Record2
…………
5000
Record
5000
*6000
Record
6000
-‐-‐-‐-‐-‐-‐-‐
n…
Record
n…
Database
Batch
1-‐2000
Batch
2001-‐4000
Importer
batches
these
database
Batches
into
a
Bigger
Batch
(10k
documents)
and
Flushes
to
selected
Solr
servers
Asynchronously
in
a
round
robin
fashion
Rather
than
using
“limit”
func4on
of
Database,
it
queries
by
Range
of
IDs
(Primary
Key).
Importer
Fetches
10
batches
at
a
4me
from
MySQL
database,
each
having
2k
Records.
Each
call
is
Stateless.
Downside:
• We
“select * from table t
where ID=1 to ID<=2000″
“select * from table t
where ID=2001 to ID<=4000″
must
need
to
have
a
primary
key
and
that
can
be
slow
while
crea4ng
it
in
database.
• This
approach
required
more
number
of
calls,
if
the
IDs
are
not
sequen4al.
………
*Non-‐sequen4al
Id
12. Indexing
All
tables
into
a
Single
Big
Index
• All tables in same Index, distributed on multiple Solr cores
and Solr servers (Java processes)
• Commit on every 120million records or in every 15 minutes
whichever is earlier
• Disabled Soft-commit and updates (overwrite=false), as
each call to addDocument calls updateDocument under
the hood
• But still.. Indexing was slow (due to being sequential for all
tables) and we need to stop it after 2 days.
• Search was also awfully slow (order of Minutes)
From
cache,
aber
warm-‐up
Bunch
of
shards
~100
13. Creating a Distributed Index for each table
How many shards ?
• Each table have varying number of records from 50million to
12billion
• If we choose 100million per shard (core), it means for 12billion, we
need to query 120 shards, awfully slow.
• Other side If we choose 500million/shard, a table with 500million
records will have only 1 shard, high latency, high memory usage
(Heap) and no distributed search*.
• Hybrid Approach : Determine number of shards based on max
number of records in table.
• Did a benchmarking to find the best sweet spot for max documents
(records) per shard with most optimal Search latency
• Commit at the end for each table.
Records/Max
Shards
Table
Max
Number
of
Records
in
table
Max
number
of
Shards
(cores)
Allowed
<100
million
1
100-‐300million
2
<500
million
4
<
1
billion
6
1-‐5
billion
8
>5
billion
16
*
Distributed
Search
improves
latency
but
may
not
be
faster
always
as
search
latency
is
limited
by
4me
taken
by
last
shard
in
responding.
14. It worked fine but one day suddenly….
java.lang.OutOfMemoryError:
Java heap Space
• All Solr servers were crashed.
• We restarted but they keep crashing randomly after every other day
• Took a Heap dump and realized that it is due to Field Cache
• Found a Solution : Doc values and never had this issue again till date.
15. Doc Values (Life Saver)
• Disk based Field Data a.ka. Doc values
• Document to value mapping built at index time
• Store Field values on Disk (or Memory) in a column stride fashion
• Better compression than Field Cache
• Replacement of Field Cache (not completely)
• Quite suitable for Custom Scoring, Sorting and Faceting
References:
• Old article (but Good): http://blog.trifork.com/2011/10/27/introducing-lucene-index-doc-values/
• https://cwiki.apache.org/confluence/display/solr/DocValues
• http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/doc-values.html
17. Partitioning
• 3 Level Partitioning, by Month, Country and Table name
• Each Month has its own Cluster and a Cluster Manager.
• Latency and Throughput are tradeoff, you can’t have both at the same time.
Node
n
Node
n
Web
server
Farm
Load
Balancer
App
Server
Search
Head
(US)
Search
Head
(UK)
Search
Head
(AU)
Master
Cluster
Manager
Internet
Cluster
2
for
another
month
e.g
Feb
Fetch
Cluster
Mapping
and
make
a
request
to
Search
Head
with
respec4ve
Solr
cores
for
that
Country,
Month
and
Table
ApApp
pS
eSrevrevre
r
Cluster
1
for
a
Month
e.g.
Jan
Solr
Solr
Solr
Cluster
1
Cluster
2
Solr
Solr
Solr
Node
1
Solr
Solr
Solr
Solr
Solr
Node
1
Solr
Cluster
1
Cluster
Manager
Cluster
Manager
A
P
*A
:
Ac4ve
P
:
Passive
Cluster
Manager
Cluster
Manager
A
P
Real
4me
Sync
1
user
24
UI
widgets,
24
Ajax
requests
41
search
requests
Search
Head
(US)
Search
Head
(UK)
Search
Head
(AU)
18. Index Optimization Strategy
• Running optimization on ~200+ Solr cores is very-very time consuming
• Solr Cores with bigger Index size (~70GB) have 2500+ segments due to higher Merge Factor while Indexing.
• Can’ t be run in parallel on all Cores in a Single Machine as heavily dependent on Cpu and Disk IO
• Optimizing large segments into a very small number is very very time consuming and can take upto 3x Index size on Disk
• Other side Small number of segments improve performance drastically, so need to have a balance.
Node
1
Solr
Solr
Staging
Cluster
Manager
*As
per
our
observa4on
for
our
data,
Op4miza4on
process
takes
~42-‐46
seconds
for
1GB
We
need
to
do
it
for
4.6TB
(including
all
boxes),
the
total
Solr
Index
size
for
a
Single
Month
Solr
Op4mizer
Produc4on
Cluster
Manager
Fetches Cluster
Mapping (list of all
cores)
Once optimization and cache
warmup is done, pushes the
Cluster Mapping to Production
Cluster manager, making all
Indices live
Optimizing a Solr core into a very small
number of segments takes a huge time.
so we do it iteratively.
Algo:
Choose Max 3 cores on a
Machine to optimize in
parallel. Start with least size
of Index
Index
Size
Number
of
docs
Determine
Max
Segments
Allowed
Reduce
Segments
to
*.90
in
each
Run
Current
Segments
Aber
op4miza4on
Node
2
Solr
Solr
Solr
20. External Caching
• In Distributed search, for a repeated query request, all Solr severs
needs to be hit, even though result is served from Solr’s cache. It
increase search latency with lowering throughput.
• Solution: cache most frequently accessed query results in app layer
(LRU based eviction)
• We use Redis for Caching
• All complex aggregation queries’ results, once fetched from multiple
Solr servers are served from cache on subsequent requests.
Why Redis…
• Advanced In-Memory key-value store
• Insane fast
• Response time in order of 5-10ms
• Provides Cache behavior (set, get) with advance data structures like
hashes, lists, sets, sorted sets, bitmaps etc.
• http://redis.io/
21. Hardware
• We use Bare Metal, Dedicated servers for Solr due to below reasons
1. Performance gain (with virtual servers, performance dropped by ~18-20%)
2. Better value of computing power/$ spent
• 2.6Ghz, 32 core (4x8 core), 384GB RAM, 6TB SAS 15k (RAID10)
• 2.6Ghz, 16 core (2x8 core), 192GB RAM, 4TB SAS 15k (RAID10)
• Since Index size is 4.6TB/month, we want to cache more data in Disk Cache with bigger RAM.
SSD vs SAS
1. SSD : Indexing rate - peek (MySQL to Solr) : 330k docs/sec (each doc: ~100-125 bytes)
2. SAS 15k: 182k docs/sec (dropped by ~45%)
3. SAS 15k is quite cheaper than SSD for bigger hard disks.
4. We are using SAS 15k, as being cost effective but have plans to move to SSD in future.
22. Conclusion : Key takeaways
General:
• Understand the characteristics of the data and partition it well.
Cache:
§ Spend time in analyzing the Cache usage. Tune them. It is 10x-50x faster.
§ Always use Filter Query (fq) wherever it is possible as that will improve the performance due to Filter cache.
GC :
§ Keep your JVM heap size to lower value (proportional to machine’s RAM) with leaving enough RAM for kernel as bigger
heap will lead to frequent GC. 4GB to 8GB heap allocation is quite good range. but we use 12GB/16GB.
§ Keep an eye on Garbage collection (GC) logs specially on Full GC.
Tuning Params:
§ Don’t use Soft Commit if you don’t need it. Specially in Batch Loading
§ Always explore tuning of Solr for High performance, like ramBufferSize, MergeFactor, HttpShardHandler’s
various configurations.
§ Use hash in Redis to minimize the memory usage.
Read the whole experience for more detail:
http://rahuldausa.wordpress.com/2014/05/16/real-time-search-on-40-billion-records-month-with-solr/