Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.


349 Aufrufe

Veröffentlicht am

RubiX: A caching framework for big data engines in the cloud. Helps provide data caching capabilities to engines like Presto, Spark, Hadoop, etc transparently without user intervention.

Veröffentlicht in: Ingenieurwesen
  • Als Erste(r) kommentieren

  • Gehören Sie zu den Ersten, denen das gefällt!


  1. 1. RubiX A caching framework for big data engines in the cloud Strata + Hadoop World March 2017 Shubham Tagra (stagra@qubole.com)
  2. 2. Agenda ● Intro ● Why Caching? ● Path to Rubix ● Rubix Architecture ● Future of Rubix ● QnA
  3. 3. Built for Anyone who Uses Data Analysts l Data Scientists l Data Engineers l Data Admins Optimize performance, cost, and scale through automation, control and orchestration of big data workloads. A Single Platform for Any Use Case ETL & Reporting l Ad Hoc Queries l Machine Learning l Streaming l Vertical Apps Open Source Engines, Optimized for the Cloud Native Integration with multiple cloud providers
  4. 4. Qubole operates at Cloud Scale 500 PB Data Processed in the Cloud Monthly 6 PB 80 PB 150 PB 500 PB 500 Nodes Largest Spark Cluster in the Cloud 2000 Clusters Started per month
  5. 5. Why Caching ● Popularity of Cloud Stores like S3 + Near-infinite capacity + Inexpensive + Ease of use - Network Latencies - Back-offs
  6. 6. Rubix ancestors ● File cache
  7. 7. Rubix ancestors ● File cache ○ Benefits: as much as 10x performance improvement ○ Problems ■ Huge warm-ups ■ Cache size ■ Tied to Presto ■ Required Presto scheduler changes
  8. 8. ● Improve performance ● Abstracted from user ○ Easy of use ● Support Columnar formats ○ Improves speed ● Work well with autoscaling ○ Saves cost ● Ease of extension to clouds and engines Requirements for new cache
  9. 9. Alternatives Considered: FUSE FileSystem ● Mount S3 paths on ec2 ● OS for page caching, read ahead, etc ● Problems ○ Exclusive control over bucket ○ Data corruptions in external updates ○ Not production ready
  10. 10. Alternatives Considered: HTTP Caching
  11. 11. Alternatives Considered: HTTP Caching ● Worked fine with TXT data ● Problems ○ Columnar formats and Byte-Range based Varnish Keys ■ Poor hit ratio ■ Redundant copies
  12. 12. Tachyon/Alluxio ● More than just a caching system ● We required light weight system ● SQL first
  13. 13. Rubix ● Extendible to many engines ● Columnar format friendly ● Works well with autoscaling ● Share-able across engines/instances
  14. 14. Architecture ● Split ownership assignment system ● Data Caching System ● Plugins
  15. 15. Architecture ● Split ownership assignment system ○ Used in master node during split computation ○ Calculates which node owns particular split of file ○ Uses Consistent Hashing to work well with Autoscaling
  16. 16. ● Data Caching System ○ Used in worker nodes when data is read ○ Read from disk or remote as per the metadata ○ Metadata stored in units of block (1MB each) ○ BookKeeper provides metadata for the block ○ Metadata too Checkpointed to local disk Architecture
  17. 17. ● Plugin ○ Provides two types of information ■ How to get the list of nodes in the system ■ FileSystem for remote reads ○ E.g. presto plugin, hadoop1 plugin, hadoop2 plugin Architecture
  18. 18. Plugins: Presto ● Presto provides tight control over scheduling local splits ● This ensured that splits will be always scheduled locally ● Worked well for our customers
  19. 19. Plugins: Hadoop ● Strict local scheduling was not possible with hadoop ● This meant lot of warm-ups and redundant copies of data ● Options: ○ Read directly from remote for non-local read ○ Figure out the correct owner and read from it ○ Implement Non-Local reads for Hadoop support ○ Learnings ■ 100% strict location based scheduling not possible in H2
  20. 20. Using Rubix with Presto ● Configure disk mount point ○ Assumes disks mounted on /media/ephemeral0, /media/ephemeral1, etc by default ● Start BookKeeper ● Place rubix jars in hive-hadoop2 plugin of Presto ● Configure Presto to use Rubix FileSystem for the cloud store
  21. 21. Using Rubix with Presto in Qubole
  22. 22. Using Rubix with Hadoop ● Configure disk mount point ○ Assumes disks mounted on /media/ephemeral0, /media/ephemeral1, etc by default ● Start BookKeeper ● Place rubix jars with hadoop libraries ● Configure Hadoop to use Rubix FileSystem for the cloud store
  23. 23. Extending to other Engines and Clouds
  24. 24. Performance gains
  25. 25. Future Work ● Extend to other clouds and engines ● Table aware objects in Rubix ● Caching policies for Hive Partitions ● Subquery caching
  26. 26. Questions?