SlideShare a Scribd company logo
1 of 17
Hybrid collaborative tiered
storage with Alluxio
Thai Bui
Data Engineer @ Bazaarvoice
Bazaarvoice
● Founded in 2005 in Austin, TX
● Digital marketing SaaS platforms for ratings and reviews
○ Display & syndicate reviews from brands to retailer websites
○ Reporting & analytics on consumers, reviews, products, etc.
● 2,600 client websites
● 5.4 billion product page views each month
● 900 million unique shoppers each month
Reporting & analytics on S3
When you have 100s of TB of data on S3
● Just listing the files is slow
● Download speed in EC2 is limited (50-150Mb/s per node)
● No concept of cache
● No concept of data locality
AWS S3 : The Need For Speed
● Add tiered storage to S3
○ Hot, warm, cold storage (fastest, fast, and not so fast)
○ Metadata cache
○ Data cache
● Keep data local
○ In the same machine, not via the Ethernet cable
● Compatible with existing services
○ Hadoop, Spark, Hive, Presto, etc.
● Adaptive & highly configurable
○ Symlink for S3
ZFS
App1 Spark
Alluxio
S3
Hot & Warm
Cold
Overview
App2
● Alluxio
○ Distributed data
storage
○ Hadoop compatible
○ By AMPLab
● ZFS
○ OS-level file system
○ Volume manager
○ By Sun Microsystems
● Both are open-source
Metastore
Alluxio : The tiered-storage layer
● Support for native filesystem and Hadoop filesystem
● Distributed and can be installed on every node
○ Provides data locality
● Mount S3, HDFS, etc. to Alluxio
○ Think symlink. No data movement.
● Use Hive metastore to partition data into hot/warm and cold region
○ Acts as a remote tiered-storage layer
ZFS : The acceleration layer
● Both a filesytem & a volume manager
○ Mirror write to 2 SSDs -> 2x read speed
● Works at the Linux kernel-space
○ Works with RAM to accelerate read/write
○ Auto promote/demote blocks from RAM to other storage
○ Used with local NVMe SSD if data is not in RAM
○ Acts as a local tiered-storage layer
● Extremely reliable
○ Automatic block checksum & repair
ZFS + NVMe: Micro benchmark
I3.4xlarge, up to 10Gbit network, 2 x 1.9 NVMe SSD
● Baseline w/ EBS
○ 135 MB/s write (dd if=/dev/zero of=/tmp/test1.img bs=1G count=1 oflag=dsync)
○ 157 MB/s read (dd if=/tmp/test1.img of=/dev/zero bs=8k)
● ZFS + 2 mirrored NVMe SSD
○ 820 MB/s write (dd if=/dev/zero of=/alluxio/fs/test1.img bs=1G count=1)
○ 1.7 GB/s read (dd if=/alluxio/fs/test1.img of=/dev/zero bs=1G count=1)
● 4x write, 10x read compared to EBS
● 10-15x compared to S3
With ZFS
ZFS
Hot
Warm
Kernel-space
User-space
Alluxio
RAM
NVMe SSD
promote demote
Native/Hadoop Filesystem API
Hive
Metastore
Last 30
days
Alluxio
> 30 daysS3
Hot &
Warm
Cold
With Hive
CPU/IO Monitoring
Tiered storage Monitoring
Alluxio Monitoring
Hive Monitoring & Performance
Scanning 200G of data in
tiered storage, 500M
rows, select *
Scanning 5G of data in
tiered storage, 350M
rows, fewer projections
Scanning 35G of data in
S3, 1.6B rows, count
distinct
Metadata/split calculation ops
60s, majority of the
time spent on
scanning S3
Result
● 5-10X read improvement in Hive
○ Worker can short-circuit and read directly from ZFS instead of S3
○ Move compute to the data
● Easy to debug, with feedback loop, collaborative
○ Data publishers + data analysts/scientists
● Good for iterating over the same data set multiple times
○ Machine learning
○ Exploratory analysis
● Give us control over S3
○ More recent data should be faster to access
Question?

More Related Content

What's hot

Scalable and High available Distributed File System Metadata Service Using gR...
Scalable and High available Distributed File System Metadata Service Using gR...Scalable and High available Distributed File System Metadata Service Using gR...
Scalable and High available Distributed File System Metadata Service Using gR...
Alluxio, Inc.
 

What's hot (20)

The Practice of Alluxio in JD.com
The Practice of Alluxio in JD.comThe Practice of Alluxio in JD.com
The Practice of Alluxio in JD.com
 
Best Practices for Using Alluxio with Spark
Best Practices for Using Alluxio with SparkBest Practices for Using Alluxio with Spark
Best Practices for Using Alluxio with Spark
 
RaptorX: Building a 10X Faster Presto with hierarchical cache
RaptorX: Building a 10X Faster Presto with hierarchical cacheRaptorX: Building a 10X Faster Presto with hierarchical cache
RaptorX: Building a 10X Faster Presto with hierarchical cache
 
How to Develop and Operate Cloud First Data Platforms
How to Develop and Operate Cloud First Data PlatformsHow to Develop and Operate Cloud First Data Platforms
How to Develop and Operate Cloud First Data Platforms
 
Hybrid data lake on google cloud with alluxio and dataproc
Hybrid data lake on google cloud  with alluxio and dataprocHybrid data lake on google cloud  with alluxio and dataproc
Hybrid data lake on google cloud with alluxio and dataproc
 
Alluxio+Presto: An Architecture for Fast SQL in the Cloud
Alluxio+Presto: An Architecture for Fast SQL in the CloudAlluxio+Presto: An Architecture for Fast SQL in the Cloud
Alluxio+Presto: An Architecture for Fast SQL in the Cloud
 
Rise of Intermediate APIs - Beam and Alluxio at Alluxio Meetup 2016
Rise of Intermediate APIs - Beam and Alluxio at Alluxio Meetup 2016Rise of Intermediate APIs - Beam and Alluxio at Alluxio Meetup 2016
Rise of Intermediate APIs - Beam and Alluxio at Alluxio Meetup 2016
 
Achieving Separation of Compute and Storage in a Cloud World
Achieving Separation of Compute and Storage in a Cloud WorldAchieving Separation of Compute and Storage in a Cloud World
Achieving Separation of Compute and Storage in a Cloud World
 
Ultra-fast SQL Analytics using PAS (Presto on Alluxio Stack)
Ultra-fast SQL Analytics using PAS (Presto on Alluxio Stack)Ultra-fast SQL Analytics using PAS (Presto on Alluxio Stack)
Ultra-fast SQL Analytics using PAS (Presto on Alluxio Stack)
 
Best Practice in Accelerating Data Applications with Spark+Alluxio
Best Practice in Accelerating Data Applications with Spark+AlluxioBest Practice in Accelerating Data Applications with Spark+Alluxio
Best Practice in Accelerating Data Applications with Spark+Alluxio
 
Getting Started with Apache Spark and Alluxio for Blazingly Fast Analytics
Getting Started with Apache Spark and Alluxio for Blazingly Fast AnalyticsGetting Started with Apache Spark and Alluxio for Blazingly Fast Analytics
Getting Started with Apache Spark and Alluxio for Blazingly Fast Analytics
 
Getting Started with Alluxio + Spark + S3
Getting Started with Alluxio + Spark + S3Getting Started with Alluxio + Spark + S3
Getting Started with Alluxio + Spark + S3
 
Exploring Alluxio for Daily Tasks at Robinhood
Exploring Alluxio for Daily Tasks at RobinhoodExploring Alluxio for Daily Tasks at Robinhood
Exploring Alluxio for Daily Tasks at Robinhood
 
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & AlluxioAlluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
 
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
 
Alluxio: Unify Data at Memory Speed at Strata and Hadoop World San Jose 2017
Alluxio: Unify Data at Memory Speed at Strata and Hadoop World San Jose 2017Alluxio: Unify Data at Memory Speed at Strata and Hadoop World San Jose 2017
Alluxio: Unify Data at Memory Speed at Strata and Hadoop World San Jose 2017
 
The Practice of Presto & Alluxio in E-Commerce Big Data Platform
The Practice of Presto & Alluxio in E-Commerce Big Data PlatformThe Practice of Presto & Alluxio in E-Commerce Big Data Platform
The Practice of Presto & Alluxio in E-Commerce Big Data Platform
 
Scalable and High available Distributed File System Metadata Service Using gR...
Scalable and High available Distributed File System Metadata Service Using gR...Scalable and High available Distributed File System Metadata Service Using gR...
Scalable and High available Distributed File System Metadata Service Using gR...
 
Speeding Up Spark Performance using Alluxio at China Unicom
Speeding Up Spark Performance using Alluxio at China UnicomSpeeding Up Spark Performance using Alluxio at China Unicom
Speeding Up Spark Performance using Alluxio at China Unicom
 
Speed up large-scale ML/DL offline inference job with Alluxio
Speed up large-scale ML/DL offline inference job with AlluxioSpeed up large-scale ML/DL offline inference job with Alluxio
Speed up large-scale ML/DL offline inference job with Alluxio
 

Similar to Hybrid collaborative tiered storage with alluxio

Initial presentation of swift (for montreal user group)
Initial presentation of swift (for montreal user group)Initial presentation of swift (for montreal user group)
Initial presentation of swift (for montreal user group)
Marcos García
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Omid Vahdaty
 

Similar to Hybrid collaborative tiered storage with alluxio (20)

How to Build a Cloud Native Stack for Analytics with Spark, Hive, and Alluxio...
How to Build a Cloud Native Stack for Analytics with Spark, Hive, and Alluxio...How to Build a Cloud Native Stack for Analytics with Spark, Hive, and Alluxio...
How to Build a Cloud Native Stack for Analytics with Spark, Hive, and Alluxio...
 
Kafka on ZFS: Better Living Through Filesystems
Kafka on ZFS: Better Living Through Filesystems Kafka on ZFS: Better Living Through Filesystems
Kafka on ZFS: Better Living Through Filesystems
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
 
Application Caching: The Hidden Microservice
Application Caching: The Hidden MicroserviceApplication Caching: The Hidden Microservice
Application Caching: The Hidden Microservice
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
 
AWS March 2016 Webinar Series - Managed Database Services on Amazon Web Services
AWS March 2016 Webinar Series - Managed Database Services on Amazon Web ServicesAWS March 2016 Webinar Series - Managed Database Services on Amazon Web Services
AWS March 2016 Webinar Series - Managed Database Services on Amazon Web Services
 
Initial presentation of swift (for montreal user group)
Initial presentation of swift (for montreal user group)Initial presentation of swift (for montreal user group)
Initial presentation of swift (for montreal user group)
 
EVCache: Lowering Costs for a Low Latency Cache with RocksDB
EVCache: Lowering Costs for a Low Latency Cache with RocksDBEVCache: Lowering Costs for a Low Latency Cache with RocksDB
EVCache: Lowering Costs for a Low Latency Cache with RocksDB
 
Pulsar Storage on BookKeeper _Seamless Evolution
Pulsar Storage on BookKeeper _Seamless EvolutionPulsar Storage on BookKeeper _Seamless Evolution
Pulsar Storage on BookKeeper _Seamless Evolution
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3
 
Amazon Aurora: The New Relational Database Engine from Amazon
Amazon Aurora: The New Relational Database Engine from AmazonAmazon Aurora: The New Relational Database Engine from Amazon
Amazon Aurora: The New Relational Database Engine from Amazon
 
Amazon Aurora: The New Relational Database Engine from Amazon
Amazon Aurora: The New Relational Database Engine from AmazonAmazon Aurora: The New Relational Database Engine from Amazon
Amazon Aurora: The New Relational Database Engine from Amazon
 
Logs @ OVHcloud
Logs @ OVHcloudLogs @ OVHcloud
Logs @ OVHcloud
 
Amazon Aurora: Amazon’s New Relational Database Engine
Amazon Aurora: Amazon’s New Relational Database EngineAmazon Aurora: Amazon’s New Relational Database Engine
Amazon Aurora: Amazon’s New Relational Database Engine
 
505 kobal exadata
505 kobal exadata505 kobal exadata
505 kobal exadata
 
PGConf APAC 2018 - High performance json postgre-sql vs. mongodb
PGConf APAC 2018 - High performance json  postgre-sql vs. mongodbPGConf APAC 2018 - High performance json  postgre-sql vs. mongodb
PGConf APAC 2018 - High performance json postgre-sql vs. mongodb
 
Introduction to AWS Big Data
Introduction to AWS Big Data Introduction to AWS Big Data
Introduction to AWS Big Data
 
Percona XtraBackup - New Features and Improvements
Percona XtraBackup - New Features and ImprovementsPercona XtraBackup - New Features and Improvements
Percona XtraBackup - New Features and Improvements
 
Selecting the Right AWS Database Solution - AWS 2017 Online Tech Talks
Selecting the Right AWS Database Solution - AWS 2017 Online Tech TalksSelecting the Right AWS Database Solution - AWS 2017 Online Tech Talks
Selecting the Right AWS Database Solution - AWS 2017 Online Tech Talks
 

Recently uploaded

Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
MarinCaroMartnezBerg
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
JoseMangaJr1
 

Recently uploaded (20)

BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptx
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 

Hybrid collaborative tiered storage with alluxio

  • 1. Hybrid collaborative tiered storage with Alluxio Thai Bui Data Engineer @ Bazaarvoice
  • 2. Bazaarvoice ● Founded in 2005 in Austin, TX ● Digital marketing SaaS platforms for ratings and reviews ○ Display & syndicate reviews from brands to retailer websites ○ Reporting & analytics on consumers, reviews, products, etc. ● 2,600 client websites ● 5.4 billion product page views each month ● 900 million unique shoppers each month
  • 3. Reporting & analytics on S3 When you have 100s of TB of data on S3 ● Just listing the files is slow ● Download speed in EC2 is limited (50-150Mb/s per node) ● No concept of cache ● No concept of data locality
  • 4. AWS S3 : The Need For Speed ● Add tiered storage to S3 ○ Hot, warm, cold storage (fastest, fast, and not so fast) ○ Metadata cache ○ Data cache ● Keep data local ○ In the same machine, not via the Ethernet cable ● Compatible with existing services ○ Hadoop, Spark, Hive, Presto, etc. ● Adaptive & highly configurable ○ Symlink for S3
  • 5. ZFS App1 Spark Alluxio S3 Hot & Warm Cold Overview App2 ● Alluxio ○ Distributed data storage ○ Hadoop compatible ○ By AMPLab ● ZFS ○ OS-level file system ○ Volume manager ○ By Sun Microsystems ● Both are open-source Metastore
  • 6. Alluxio : The tiered-storage layer ● Support for native filesystem and Hadoop filesystem ● Distributed and can be installed on every node ○ Provides data locality ● Mount S3, HDFS, etc. to Alluxio ○ Think symlink. No data movement. ● Use Hive metastore to partition data into hot/warm and cold region ○ Acts as a remote tiered-storage layer
  • 7. ZFS : The acceleration layer ● Both a filesytem & a volume manager ○ Mirror write to 2 SSDs -> 2x read speed ● Works at the Linux kernel-space ○ Works with RAM to accelerate read/write ○ Auto promote/demote blocks from RAM to other storage ○ Used with local NVMe SSD if data is not in RAM ○ Acts as a local tiered-storage layer ● Extremely reliable ○ Automatic block checksum & repair
  • 8. ZFS + NVMe: Micro benchmark I3.4xlarge, up to 10Gbit network, 2 x 1.9 NVMe SSD ● Baseline w/ EBS ○ 135 MB/s write (dd if=/dev/zero of=/tmp/test1.img bs=1G count=1 oflag=dsync) ○ 157 MB/s read (dd if=/tmp/test1.img of=/dev/zero bs=8k) ● ZFS + 2 mirrored NVMe SSD ○ 820 MB/s write (dd if=/dev/zero of=/alluxio/fs/test1.img bs=1G count=1) ○ 1.7 GB/s read (dd if=/alluxio/fs/test1.img of=/dev/zero bs=1G count=1) ● 4x write, 10x read compared to EBS ● 10-15x compared to S3
  • 10. Hive Metastore Last 30 days Alluxio > 30 daysS3 Hot & Warm Cold With Hive
  • 14. Hive Monitoring & Performance Scanning 200G of data in tiered storage, 500M rows, select * Scanning 5G of data in tiered storage, 350M rows, fewer projections
  • 15. Scanning 35G of data in S3, 1.6B rows, count distinct Metadata/split calculation ops 60s, majority of the time spent on scanning S3
  • 16. Result ● 5-10X read improvement in Hive ○ Worker can short-circuit and read directly from ZFS instead of S3 ○ Move compute to the data ● Easy to debug, with feedback loop, collaborative ○ Data publishers + data analysts/scientists ● Good for iterating over the same data set multiple times ○ Machine learning ○ Exploratory analysis ● Give us control over S3 ○ More recent data should be faster to access