In this talk, I introduce Alluxio, the fastest growing open source project in the big data ecosystem, and show how to leverage it for optimizing Solr performance. I'll begin with a brief introduction about how Alluxio works and why it's interesting for the Solr community. Next, I describe how to run Solr on Alluxio and cover basic integration scenarios. Lastly, I provide some performance comparisons between running Solr on Alluxio vs. a local FS and HDFS. Attendees will come away with a new toolset to help them use Solr to tackle a wide array of big data problems.
Call Girls In Hsr Layout â 7737669865 𼾠Book Your One night Stand
Â
Running Solr in the Cloud at Memory Speed with Alluxio
1. Running Solr at Memory Speed with Alluxio
Timothy Potter
Lucidworks
2. Agenda
⢠Overview of Alluxio
⢠Running Solr on Alluxio
⢠Interesting Use Cases
⢠Futures
⢠Questions?
3. 3
01
Cool things Iâve learned about Alluxio âŚ
⢠Fastest growing open source project in big data
space
⢠Baidu reported having an Alluxio cluster with
1000 workers and 50TB of RAM ⌠in Feb 2016!
⢠Brings cloud-storage into the compute layer; data
access at memory speed
⢠No need to move / migrate data into Alluxio; just
mount the under storage!
⢠Apache 2.0 licensed but also has a commercial
offering with support if needed
4. 4
01
Alluxio Basics
⢠Hadoop FileSystem API: alluxio://âŚ
⢠Supports single node up to massive
clusters
⢠Uses ZK for HA stuff; master/worker
model
⢠Supports many popular storage
systems: HDFS, S3, Azure Blob store,
GCS, GlusterFS âŚ
⢠Alluxio FUSE to mount as FS on Linux
memory-centric
virtual distributed
storage system
5. 5
01
Configure Solr to use Alluxio
⢠mkdir or mount Solr root dir in Alluxio
bin/alluxio fs mkdir /solr
⢠Set start-up options in bin/solr.in.sh:
solr.directoryFactory=HdfsDirectoryFactory
solr.lock.type=hdfs
solr.hdfs.home=alluxio://master:19998/solr
solr.hdfs.confdir=/path/hadoop-conf
⢠Add a core-site.xml to set:
fs.AbstractFileSystem.alluxio.impl=alluxio.hadoop.AlluxioFileSystem
fs.alluxio.impl.disable.cache=true
alluxio.user.file.writetype.default=CACHE_THROUGH
⢠Add alluxio client JAR to Solr classpath
Copy alluxio-core-client-runtime-1.5.0-jar-with-dependencies.jar to
server/solr-webapp/webapp/WEB-INF/lib/
⢠Upconfig alluxio configset to ZK
bin/solr zk upconfig -n alluxio -d server/solr/configsets/alluxio/conf
see: http://bit.ly/2y33wQs
6. 6
01
Solr on Alluxio Tips & Tricks
⢠Run an Alluxio worker on each Solr node
⢠Write mode should be CACHE_THROUGH to ensure Solr files get
persisted to the under storage, e.g. S3
⢠Admin can âpinâ an index directory to ensure it stays cached in
memory
⢠Set TTL on index directories that can be freed from memory after a
given timeframe
⢠Load command moves data from the under storage into Alluxio, such
as after restoring an index from backup
7. 7
01
Use Case 1: Replace the OS cache with Local under FS
⢠Index performance
~ 5M docs, ~4K docs/sec, <1% diff than local FS, 8GB index on disk
⢠Query performance (9gb index, 5M docs, r4.xlarge)
* NOTE: ymmv! Utterly un-scientific experiments to get a feel for the technology
Metrics Alluxio MMap/SSD HDFS
QPS 36 42 20
Max QTime 2212 ms 1789 ms 5612 ms
Stddev QTime 335 ms 353 ms 609 ms
Median QTime 70 ms 9 ms 187 ms
75% 372 ms 383 ms 754 ms
95% 972 ms 996 ms 1723 ms
99% 1426 ms 1349 ms 2599 ms
8. 8
01
Use Case 2: Use cloud storage as under FS (S3, GCS, Azure)
⢠Indexing rate: ~3,650 docs/sec to S3 vs. on 4,000 on local
⢠As expected, query perf metrics nearly identical ď
⢠Mount the cloud storage system to a directory in Alluxio
bin/alluxio fs mount
alluxio://ec2-34-196-176-70.compute-1.amazonaws.com:19998/s3 s3a://sstk-dev/alluxio
⢠Deploy cloud instances with lots of memory, e.g. r4âs in EC2
⢠Use tiered storage to take advantage of the ephemeral disks
(fast SSDs)
⢠âpinâ specific indexes for better performance guarantees S3 or GCS
Alluxio (memory)
10 to 100 Gbps
100 Mbps to
10 Gbps
9. 9
01
Use Case 3: Time-based Partitioning
⢠Fits nicely with write-once indexes: signals, logs
⢠Use Alluxioâs TTL feature to âfreeâ indexes on
aged out partitions
⢠Tiered storage also allows you to have hot
(memory), warm (SSD), cool (HDD), and cold
(S3) partitions
⢠Allocators and evictors to re-arrange blocks
between tiers; easy to plug-in advanced
strategies
Solr
Partition
9-15
Solr
Partition
9-14
Alluxio (memory)
Alluxio (SSD)
Solr
Partition
9-13
S3 or GCS
10. 1
01
Use Case 4: Cloud-based Recovery
⢠Solr auto-add replica (have to use
the HdfsUpdateLog)
<updateLog class=âsolr.HdfsUpdateLogâ> âŚ
⢠Alluxio will pull the files from memory
on another worker if theyâre available
or go back to under FS storage
⢠Wise to have some auto-warming
queries / caches configured so that
replicas donât get marked as active in
the cluster until they are warmed up
⌠thanks Shalin! SOLR-6086
S3 or GCS
Solr
Replica
Alluxio (memory)
Node 1 (us-east-1d)
Node 2 (us-east-1c)
Solr
overseer
Solr
Replica
Add
Replica
Alluxio (memory)
11. 1
01
Synergy with Analytics & Machine Learning
⢠Solr streaming expressions power analytics jobs that may
require massive result sets at once
⢠Hybrid solutions that mix Solr with compute frameworks
like Spark and Flink
⢠Alluxio speeds up SparkSQL and ML jobs
⢠Fusion SQL ~ Keeping expensive views in Alluxio for
analytics dashboards (complex queries against data
loaded from Solr)
12. 1
01
Work in progress âŚ
⢠ALLUXIO-2995: Perf issue (fixed in 1.6.0)
Work-around is: alluxio.user.file.cache.partially.read.block=false
⢠Orphaned write.lock prevents core initialization after crash, SOLR-
8335 and SOLR-8169
bin/alluxio fs rm /solr/alluxio1/core_node1/data/index/write.lock
⢠SOLR-11335: Closing FileSystem object retrieved from get()
fs.alluxio.impl.disable.cache = true (in core-site.xml)
⢠SOLR-6237: Shared replicas
⢠SOLR-9515: Couldnât get Solr running with s3a w/o Alluxio;
classpath issues ď
⢠Test ASYNC_THROUGH write mode with Solr
13. 1
01
FAQ
⢠Does Alluxio support running in HA mode?
⢠How does data locality work with Solr & Alluxio?
⢠What block size do you recommend for Solr?
⢠Whatâs the overhead of CACHE_THROUGH
during indexing?
⢠What about Solrâs block cache?
⢠Does Alluxio work with Solr 7?
In this talk, I introduce Alluxio, the fastest growing open source project in the big data ecosystem, and show how to leverage it for optimizing Solr performance. I'll begin with a brief introduction about how Alluxio works and why it's interesting for the Solr community. Next, I describe how to run Solr on Alluxio and cover basic integration scenarios. Lastly, I provide some performance comparisons between running Solr on Alluxio vs. a local FS and HDFS. Attendees will come away with a new toolset to help them use Solr to tackle a wide array of big data problems.
Apache Zeppelin interpreter to execute FS shell commands, e.g. ls /mnt/solr
Another benefit is you can try this out quickly on EC2
r4.xlarge with 4 cpu, 5M docs, 10K random queries, 16 concurrent users (jmeter)
Still might be useful to âpinâ specific indexes to help ensure performance
Overall, using Alluxio was slower for queries, which is expected as MMap is faster than reading from Alluxio even though files are in memory
However, Alluxio beat HDFS. Probably could have done some BlockCache tuning but seems complicated
Accelerate remote storage I/O
Since indexes are in S3, you could run Spark jobs that read the full index w/o impacting search performance
Avoid cloud vendor lock-in as Solr doesnât know anything about the underlying cloud FS
Important: Could not get Solr to work against S3 w/o Alluxio due to Hadoop classpath issues and an issue with HttpClient 4.3; this is documented at: https://community.plm.automation.siemens.com/t5/The-Big-Data-Blog/Running-Solr-on-S3/ba-p/388004
However, this is another example of using Alluxio to hide under FS issues from Solr!
What happens when an old partition is queried? Does Alluxio pull that into cache and evict other data or ??? How to control this
Solr on S3A w/o Alluxio issues: https://community.plm.automation.siemens.com/t5/The-Big-Data-Blog/Running-Solr-on-S3/ba-p/388004
Data locality: youâll want an alluxio worker on every node where you plan to run Solr replicas
Be careful with smaller block sizes and merging / optimize
CACHE_THROUGH didnât show much overhead, <%1 diff