SlideShare ist ein Scribd-Unternehmen logo
1 von 17
Downloaden Sie, um offline zu lesen
DATA ORCHESTRATION SUMMIT
2020
SuperDB
Modernizing Global Shared Data Analytics Platform and our Alluxio
Journey
Sandipan Chakraborty | Director Engineering
2
Topics
• Brief About Rakuten
• SuperDB Journey
• Our Data Landscape
• Challenges
• Approach
• Journey with Alluxio
3
70
+
Service
s
Japan’s Largest e-commerce company
Internet
Services
Fintech
Services
Communication
s
& Contents
4
SuperDB: Centralized Data Platform for Rakuten
Ecosystem
43Services*1
700+
TeraBytes
Normalized Data sets
6,500+
Users*2
from 40+
Businesses
70+Services in Ecosystem
*1 Excluding small services and common services *2 # of weekly active
users
PetaBytes of
Data
5
Our Journey
201
3
201
8
8 25+
Teradata + Hadoop
Big Data Stack
2013 - 2018
Presto
Mesos DC/OS
On-Premise & GCP
Hadoop
Cluster
GCS
Click Stream Data
Recommendation,
PersonalizationSupport ML,
1 5
200
7
201
2
Traditional EDW
Teradata
2007 –
2012
On-Premise
BI Reporting,
Ad-Hoc Analysis
30+
services
2019
2019 -
2020
40+
services
202
0
Multi-Cloud (GCP + Azure)
Presto + Alluxio
(POC)
Mesos DC/OS
Kubernetes
Starburst Presto
Alluxio (Prod.)
Cloud Storage
Hybrid Compute
Hadoop
ClustersObject Storage
Optimize Analytics
Optimize AI / ML
Teradata + Hadoop
Big Data Stack
6
Our Data
Landscape
Web /
RAT
User
Transaction
IoT /
Device
Apps
Real-Time & Batch
(Containers)
Common
Schema
Business Generated
(Data Producers) Business DWH / DL
SuperDB
(Enterprise Repository)
Data challenges
Diverse data from diverse
sources, growing rapidly
Easier Data Management
Based on Personas, Gives Transparency & Better Cost Control,
Standardized and Automated
Faster & Better Insight & AI
Start Analysis by ability to connect and Run
anywhere
Insights & Data Science
(Data Consumers)
SINGLE VERSION OF TRUTH!
Data Projections
and Feature
Sets
Virtual Data Mart
(On-Prem / GCP
Cloud)
Super DB
on-premise (JP)
SuperDB Cloud
(US, Japan)
Auto-Sync
On-Demand Scale (Cloud + Containers)
Common
Schema
Cloud
Bursting
(Containers)
AWS
Azure
GCP
Click Stream
On-premise
Faster Business
Insight
Faster time to
Analysis
Quick
Experimentation
Cloud Native & Hybrid Architecture Granular
Access
control
Data encryption (End to End)
Multi-Factor
Authenticatio
n
Query Layer
Normalized
Transaction /
aggregated
Transaction /
aggregated
Transaction /
aggregated
Auto-Sync
7
•Adhoc Query Capacity
•Discover, Fast and Easy Access, OLAP
& Low Latency
•BI Support and Reporting.
Business
Analysts
•Adhoc Query Capacity. (OLAP, low
latency)
•Run workload in large scale computing.
•Data Science Platform and tools for ML
- AI workloads
Data
Scientists
•Ability to Integrate with API’s
•Support of Data Sync to different
clusters
Applications
•Query, Data Ingestion and
Transformation
•Scalable processing, long running jobs
•Real-time and Batch Support
•Data QC Support
Data
Engineers
• Secured Access Layer
• Ability to create Audit Reports
• Data Lineage and traceability
Governanc
e, Audit &
Security
•Maintaining the data system infra.
•Workload Turning.
•Data Pipeline maintaining.
•Data QC
System
Admin &
Operators
•Creates, Joins, Ad-hoc Report,
KPI’s
•Experiments & Quick Analysis
•Support various Marketing
activities
Sand-Box
Users
Support for Different
Personas
8
Our Challenges
• Compute elasticity for experiments.
• Adding capacity was time-consuming process
System
Scalability
• Unable to address / optimize for different Personas
• Legacy Code, limited processing power resulting in Job delays
Data
Availability
• Too many data copy pipelines needed to be built, delaying the access to data
• Managing for data copy pipelines to different clusters became an operation overhead.Data freshness
• Data Movements before any Analysis can be done. Not all is present in DWH for
analysis.
• Quick Analysis cannot be done across different businesses data silos.
Analytics
Agility
9
Our Approach
•Compute Elasticity for Experiments
•Adding capacity was time-consuming
process
System
Scalability
•Unable to address / optimize for
different Personas
•Legacy Code, limited processing
power resulting in Job delays
Data
Availability
•Data sync cannot be done between
different cluster in DC’s.
•Too many Data Copies
Data
freshness
•Cannot join between Transaction &
behavior data.
•Needs lot of Data Movements
•Quick Analysis cannot be done.
Analytics
Agility
Hybrid & Cloud-native architecture
• On Demand Compute with Public Cloud
• Separate Storage and Process
• Containerization and Cloud Native
Data Sync & Orchestration (Alluxio)
• Data Sync across DC’s and Cloud.
• Data Processing Cache Layer
Query Layer (Starburst Presto)
• Start Analytics connecting to different stores on
multi-cloud , on-prem before any data
movement
• Common security layer with Ranger
10
One Major Challenge
Data Sources Teradata
Legacy HDFS
New HDFS
PwC
Legac
y
ODIN
Python
Legacy
copy
copy
copy
Pipeline X
Pipeline Y
Pipeline Z
❖ ODIN is homebrewed data ingestion system
❖ Legacy HDFS and New HDFS are in different data centers, so downstream migration is not straight forward due to computing resource
constrains
GCS
copy
Spark
Pipeline
New
11
Data Sync
Source
Data
Alluxio Ingest
Alluxio XHDFS Cluster
HDFS Cluster
GCS
Alluxio Y
Alluxio Z
Rakuten
DC1
Rakuten
DC2
GCP
❖ Alluxio Ingest Cluster: data persist to multi destination via Under Store Replication.
❖ Consumption tool cache data from different DC to improve performance, and enable DR
Released in Production
12
Data Caching for Consumption
Alluxio
master
Alluxio
worker
Alluxio
worker
Alluxio
worker
Presto
Coordinator
Presto
worker
Presto
worker
Presto
worker
Mem
Cach
e
Mem
Cach
e
Mem
Cach
e
Mem
Cach
e
GKE (GCP) & AKS (Azure) 2020
Production
Physical box
Physical box
Physical box
HDFS: DC
local
HDFS: DC
remote 1
HDFS DC
remote 2
Alluxio
master
Alluxio
worker
Alluxio
worker
Alluxio
worker
Presto
Coordinator
Presto
worker
Presto
worker
Presto
worker
On-Prem Bare Metal (POC)
13
Consumption in Production Today
Physical
box
Physical
box
Physical
box
HDFS DC1
HDFS DC 2
Alluxio
master
Alluxio
worker
Alluxio
worker
Alluxio
worker
Presto
Coordinator
Presto
worker
Presto
worker
Presto
worker
On-Prem Bare Metal
(2019 - Early 2020)
Bare Metal (K8 Cluster) --- Present Production
14
TensorFlow /
Caffe
Spark
Compute
(Transformati
on)
Spark
Compute
Aggregations
Distributed
Cache
Kubernetes ,
KubeflowLinu
x
Rakuten
OneClou
d
Bare Metal GPU CPU
HD
FS
Object
Store
NA
S
Libfuse
AlluxioFUSE
Alluxio
JVM
Distributed Cache (Presently under POC)
15
Our Journey with
Alluxio
Started using Presto
Open source
(On-Prem)
201
7
201
8
Started using Presto
Open source
(GCP)
POC with Presto +
Alluxio
(GCP)
201
9
202
0
Presto + Alluxio
(GCP , Azure)
POC : Distributed
Cache with Alluxio for
ML & Data Pipeline
Jobs
Data Sync with Alluxio
(On-Prem)
202
1
Planned : Distributed
Cache with Alluxio for
ML & Data Pipeline
Jobs
16
Overview: Wrap-up
RDB
NoSQL
Files
events Pipeline
Service
Hadoop
Discovery Service
Consumption Service
Transformations
Landing
zone
Common
Schema
mapping
Common
Marts
Data Orchestration Layer
Presto
BI toolsAI / ML
Data
Exploring
Downstream
pipelines
Spark
Schema management Data ACL Classification Auditing
Changelogs
Changelogs
Cloud
Modernizing Global Shared Data Analytics Platform and our Alluxio Journey

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Iceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsIceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data Analytics
 
Presto: Fast SQL-on-Anything Across Data Lakes, DBMS, and NoSQL Data Stores
Presto: Fast SQL-on-Anything Across Data Lakes, DBMS, and NoSQL Data StoresPresto: Fast SQL-on-Anything Across Data Lakes, DBMS, and NoSQL Data Stores
Presto: Fast SQL-on-Anything Across Data Lakes, DBMS, and NoSQL Data Stores
 
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
 
Introducing the Hub for Data Orchestration
Introducing the Hub for Data OrchestrationIntroducing the Hub for Data Orchestration
Introducing the Hub for Data Orchestration
 
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stackAccelerating analytics in the cloud with the Starburst Presto + Alluxio stack
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack
 
Orchestrate a Data Symphony
Orchestrate a Data SymphonyOrchestrate a Data Symphony
Orchestrate a Data Symphony
 
Enterprise Distributed Query Service powered by Presto & Alluxio across cloud...
Enterprise Distributed Query Service powered by Presto & Alluxio across cloud...Enterprise Distributed Query Service powered by Presto & Alluxio across cloud...
Enterprise Distributed Query Service powered by Presto & Alluxio across cloud...
 
Alluxio + Spark: Accelerating Auto Data Tagging in WeRide
Alluxio + Spark: Accelerating Auto Data Tagging in WeRideAlluxio + Spark: Accelerating Auto Data Tagging in WeRide
Alluxio + Spark: Accelerating Auto Data Tagging in WeRide
 
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
 
How to Develop and Operate Cloud Native Data Platforms and Applications
How to Develop and Operate Cloud Native Data Platforms and ApplicationsHow to Develop and Operate Cloud Native Data Platforms and Applications
How to Develop and Operate Cloud Native Data Platforms and Applications
 
Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration ...
Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration ...Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration ...
Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration ...
 
Deep Learning in the Cloud at Scale: A Data Orchestration Story
Deep Learning in the Cloud at Scale: A Data Orchestration StoryDeep Learning in the Cloud at Scale: A Data Orchestration Story
Deep Learning in the Cloud at Scale: A Data Orchestration Story
 
Powering Data Science and AI with Apache Spark, Alluxio, and IBM
Powering Data Science and AI with Apache Spark, Alluxio, and IBMPowering Data Science and AI with Apache Spark, Alluxio, and IBM
Powering Data Science and AI with Apache Spark, Alluxio, and IBM
 
StorageQuery: federated querying on object stores, powered by Alluxio and Presto
StorageQuery: federated querying on object stores, powered by Alluxio and PrestoStorageQuery: federated querying on object stores, powered by Alluxio and Presto
StorageQuery: federated querying on object stores, powered by Alluxio and Presto
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
The Practice of Presto & Alluxio in E-Commerce Big Data Platform
The Practice of Presto & Alluxio in E-Commerce Big Data PlatformThe Practice of Presto & Alluxio in E-Commerce Big Data Platform
The Practice of Presto & Alluxio in E-Commerce Big Data Platform
 
The hidden engineering behind machine learning products at Helixa
The hidden engineering behind machine learning products at HelixaThe hidden engineering behind machine learning products at Helixa
The hidden engineering behind machine learning products at Helixa
 
What's New in Alluxio 2.3
What's New in Alluxio 2.3What's New in Alluxio 2.3
What's New in Alluxio 2.3
 
Reducing large S3 API costs using Alluxio at Datasapiens
Reducing large S3 API costs using Alluxio at Datasapiens Reducing large S3 API costs using Alluxio at Datasapiens
Reducing large S3 API costs using Alluxio at Datasapiens
 
Alluxio - Virtual Unified File System
Alluxio - Virtual Unified File System Alluxio - Virtual Unified File System
Alluxio - Virtual Unified File System
 

Ähnlich wie Modernizing Global Shared Data Analytics Platform and our Alluxio Journey

Powering Real-Time Big Data Analytics with a Next-Gen GPU Database
Powering Real-Time Big Data Analytics with a Next-Gen GPU DatabasePowering Real-Time Big Data Analytics with a Next-Gen GPU Database
Powering Real-Time Big Data Analytics with a Next-Gen GPU Database
Kinetica
 

Ähnlich wie Modernizing Global Shared Data Analytics Platform and our Alluxio Journey (20)

DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
 
Skilwise Big data
Skilwise Big dataSkilwise Big data
Skilwise Big data
 
Using the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid Warehouse
Using the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid WarehouseUsing the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid Warehouse
Using the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid Warehouse
 
Accelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud EraAccelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud Era
 
Hadoop and SQL: Delivery Analytics Across the Organization
Hadoop and SQL:  Delivery Analytics Across the OrganizationHadoop and SQL:  Delivery Analytics Across the Organization
Hadoop and SQL: Delivery Analytics Across the Organization
 
How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...
 
Hadoop and Your Enterprise Data Warehouse
Hadoop and Your Enterprise Data WarehouseHadoop and Your Enterprise Data Warehouse
Hadoop and Your Enterprise Data Warehouse
 
Enabling big data & AI workloads on the object store at DBS
Enabling big data & AI workloads on the object store at DBS Enabling big data & AI workloads on the object store at DBS
Enabling big data & AI workloads on the object store at DBS
 
Use Cases from Batch to Streaming, MapReduce to Spark, Mainframe to Cloud: To...
Use Cases from Batch to Streaming, MapReduce to Spark, Mainframe to Cloud: To...Use Cases from Batch to Streaming, MapReduce to Spark, Mainframe to Cloud: To...
Use Cases from Batch to Streaming, MapReduce to Spark, Mainframe to Cloud: To...
 
Skillwise Big Data part 2
Skillwise Big Data part 2Skillwise Big Data part 2
Skillwise Big Data part 2
 
Simplifying Your Cloud Architecture with a Logical Data Fabric (APAC)
Simplifying Your Cloud Architecture with a Logical Data Fabric (APAC)Simplifying Your Cloud Architecture with a Logical Data Fabric (APAC)
Simplifying Your Cloud Architecture with a Logical Data Fabric (APAC)
 
Accelerating Big Data Analytics
Accelerating Big Data AnalyticsAccelerating Big Data Analytics
Accelerating Big Data Analytics
 
Turning Petabytes of Data into Profit with Hadoop for the World’s Biggest Ret...
Turning Petabytes of Data into Profit with Hadoop for the World’s Biggest Ret...Turning Petabytes of Data into Profit with Hadoop for the World’s Biggest Ret...
Turning Petabytes of Data into Profit with Hadoop for the World’s Biggest Ret...
 
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
 
Powering Real-Time Big Data Analytics with a Next-Gen GPU Database
Powering Real-Time Big Data Analytics with a Next-Gen GPU DatabasePowering Real-Time Big Data Analytics with a Next-Gen GPU Database
Powering Real-Time Big Data Analytics with a Next-Gen GPU Database
 
Demystifying Data Warehouse as a Service (DWaaS)
Demystifying Data Warehouse as a Service (DWaaS)Demystifying Data Warehouse as a Service (DWaaS)
Demystifying Data Warehouse as a Service (DWaaS)
 
Data Orchestration for the Hybrid Cloud Era
Data Orchestration for the Hybrid Cloud EraData Orchestration for the Hybrid Cloud Era
Data Orchestration for the Hybrid Cloud Era
 
Logical Data Fabric and Data Mesh – Driving Business Outcomes
Logical Data Fabric and Data Mesh – Driving Business OutcomesLogical Data Fabric and Data Mesh – Driving Business Outcomes
Logical Data Fabric and Data Mesh – Driving Business Outcomes
 
Horses for Courses: Database Roundtable
Horses for Courses: Database RoundtableHorses for Courses: Database Roundtable
Horses for Courses: Database Roundtable
 

Mehr von Alluxio, Inc.

Mehr von Alluxio, Inc. (20)

Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
Optimizing Data Access for Analytics And AI with Alluxio
Optimizing Data Access for Analytics And AI with AlluxioOptimizing Data Access for Analytics And AI with Alluxio
Optimizing Data Access for Analytics And AI with Alluxio
 
Speed Up Presto at Uber with Alluxio Caching
Speed Up Presto at Uber with Alluxio CachingSpeed Up Presto at Uber with Alluxio Caching
Speed Up Presto at Uber with Alluxio Caching
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at Scale
 
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/MLBig Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
 
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
 
Alluxio Monthly Webinar | Five Disruptive Trends that Every Data & AI Leader...
Alluxio Monthly Webinar | Five Disruptive Trends that Every  Data & AI Leader...Alluxio Monthly Webinar | Five Disruptive Trends that Every  Data & AI Leader...
Alluxio Monthly Webinar | Five Disruptive Trends that Every Data & AI Leader...
 
Data Infra Meetup | FIFO Queues are All You Need for Cache Eviction
Data Infra Meetup | FIFO Queues are All You Need for Cache EvictionData Infra Meetup | FIFO Queues are All You Need for Cache Eviction
Data Infra Meetup | FIFO Queues are All You Need for Cache Eviction
 
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio EdgeData Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
 
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the CloudData Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
 
Data Infra Meetup | ByteDance's Native Parquet Reader
Data Infra Meetup | ByteDance's Native Parquet ReaderData Infra Meetup | ByteDance's Native Parquet Reader
Data Infra Meetup | ByteDance's Native Parquet Reader
 
Data Infra Meetup | Uber's Data Storage Evolution
Data Infra Meetup | Uber's Data Storage EvolutionData Infra Meetup | Uber's Data Storage Evolution
Data Infra Meetup | Uber's Data Storage Evolution
 
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
 
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
 
AI Infra Day | The AI Infra in the Generative AI Era
AI Infra Day | The AI Infra in the Generative AI EraAI Infra Day | The AI Infra in the Generative AI Era
AI Infra Day | The AI Infra in the Generative AI Era
 
AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...
AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...
AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...
 
AI Infra Day | The Generative AI Market And Intel AI Strategy and Product Up...
AI Infra Day | The Generative AI Market  And Intel AI Strategy and Product Up...AI Infra Day | The Generative AI Market  And Intel AI Strategy and Product Up...
AI Infra Day | The Generative AI Market And Intel AI Strategy and Product Up...
 
AI Infra Day | Composable PyTorch Distributed with PT2 @ Meta
AI Infra Day | Composable PyTorch Distributed with PT2 @ MetaAI Infra Day | Composable PyTorch Distributed with PT2 @ Meta
AI Infra Day | Composable PyTorch Distributed with PT2 @ Meta
 
AI Infra Day | Model Lifecycle Management Quality Assurance at Uber Scale
AI Infra Day | Model Lifecycle Management Quality Assurance at Uber ScaleAI Infra Day | Model Lifecycle Management Quality Assurance at Uber Scale
AI Infra Day | Model Lifecycle Management Quality Assurance at Uber Scale
 
Alluxio Monthly Webinar | Efficient Data Loading for Model Training on AWS
Alluxio Monthly Webinar | Efficient Data Loading for Model Training on AWSAlluxio Monthly Webinar | Efficient Data Loading for Model Training on AWS
Alluxio Monthly Webinar | Efficient Data Loading for Model Training on AWS
 

Kürzlich hochgeladen

%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
masabamasaba
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
VictorSzoltysek
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
shinachiaurasa2
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
VictoriaMetrics
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 

Kürzlich hochgeladen (20)

OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
 
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With SimplicityWSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
 
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
 
%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 

Modernizing Global Shared Data Analytics Platform and our Alluxio Journey

  • 1. DATA ORCHESTRATION SUMMIT 2020 SuperDB Modernizing Global Shared Data Analytics Platform and our Alluxio Journey Sandipan Chakraborty | Director Engineering
  • 2. 2 Topics • Brief About Rakuten • SuperDB Journey • Our Data Landscape • Challenges • Approach • Journey with Alluxio
  • 3. 3 70 + Service s Japan’s Largest e-commerce company Internet Services Fintech Services Communication s & Contents
  • 4. 4 SuperDB: Centralized Data Platform for Rakuten Ecosystem 43Services*1 700+ TeraBytes Normalized Data sets 6,500+ Users*2 from 40+ Businesses 70+Services in Ecosystem *1 Excluding small services and common services *2 # of weekly active users PetaBytes of Data
  • 5. 5 Our Journey 201 3 201 8 8 25+ Teradata + Hadoop Big Data Stack 2013 - 2018 Presto Mesos DC/OS On-Premise & GCP Hadoop Cluster GCS Click Stream Data Recommendation, PersonalizationSupport ML, 1 5 200 7 201 2 Traditional EDW Teradata 2007 – 2012 On-Premise BI Reporting, Ad-Hoc Analysis 30+ services 2019 2019 - 2020 40+ services 202 0 Multi-Cloud (GCP + Azure) Presto + Alluxio (POC) Mesos DC/OS Kubernetes Starburst Presto Alluxio (Prod.) Cloud Storage Hybrid Compute Hadoop ClustersObject Storage Optimize Analytics Optimize AI / ML Teradata + Hadoop Big Data Stack
  • 6. 6 Our Data Landscape Web / RAT User Transaction IoT / Device Apps Real-Time & Batch (Containers) Common Schema Business Generated (Data Producers) Business DWH / DL SuperDB (Enterprise Repository) Data challenges Diverse data from diverse sources, growing rapidly Easier Data Management Based on Personas, Gives Transparency & Better Cost Control, Standardized and Automated Faster & Better Insight & AI Start Analysis by ability to connect and Run anywhere Insights & Data Science (Data Consumers) SINGLE VERSION OF TRUTH! Data Projections and Feature Sets Virtual Data Mart (On-Prem / GCP Cloud) Super DB on-premise (JP) SuperDB Cloud (US, Japan) Auto-Sync On-Demand Scale (Cloud + Containers) Common Schema Cloud Bursting (Containers) AWS Azure GCP Click Stream On-premise Faster Business Insight Faster time to Analysis Quick Experimentation Cloud Native & Hybrid Architecture Granular Access control Data encryption (End to End) Multi-Factor Authenticatio n Query Layer Normalized Transaction / aggregated Transaction / aggregated Transaction / aggregated Auto-Sync
  • 7. 7 •Adhoc Query Capacity •Discover, Fast and Easy Access, OLAP & Low Latency •BI Support and Reporting. Business Analysts •Adhoc Query Capacity. (OLAP, low latency) •Run workload in large scale computing. •Data Science Platform and tools for ML - AI workloads Data Scientists •Ability to Integrate with API’s •Support of Data Sync to different clusters Applications •Query, Data Ingestion and Transformation •Scalable processing, long running jobs •Real-time and Batch Support •Data QC Support Data Engineers • Secured Access Layer • Ability to create Audit Reports • Data Lineage and traceability Governanc e, Audit & Security •Maintaining the data system infra. •Workload Turning. •Data Pipeline maintaining. •Data QC System Admin & Operators •Creates, Joins, Ad-hoc Report, KPI’s •Experiments & Quick Analysis •Support various Marketing activities Sand-Box Users Support for Different Personas
  • 8. 8 Our Challenges • Compute elasticity for experiments. • Adding capacity was time-consuming process System Scalability • Unable to address / optimize for different Personas • Legacy Code, limited processing power resulting in Job delays Data Availability • Too many data copy pipelines needed to be built, delaying the access to data • Managing for data copy pipelines to different clusters became an operation overhead.Data freshness • Data Movements before any Analysis can be done. Not all is present in DWH for analysis. • Quick Analysis cannot be done across different businesses data silos. Analytics Agility
  • 9. 9 Our Approach •Compute Elasticity for Experiments •Adding capacity was time-consuming process System Scalability •Unable to address / optimize for different Personas •Legacy Code, limited processing power resulting in Job delays Data Availability •Data sync cannot be done between different cluster in DC’s. •Too many Data Copies Data freshness •Cannot join between Transaction & behavior data. •Needs lot of Data Movements •Quick Analysis cannot be done. Analytics Agility Hybrid & Cloud-native architecture • On Demand Compute with Public Cloud • Separate Storage and Process • Containerization and Cloud Native Data Sync & Orchestration (Alluxio) • Data Sync across DC’s and Cloud. • Data Processing Cache Layer Query Layer (Starburst Presto) • Start Analytics connecting to different stores on multi-cloud , on-prem before any data movement • Common security layer with Ranger
  • 10. 10 One Major Challenge Data Sources Teradata Legacy HDFS New HDFS PwC Legac y ODIN Python Legacy copy copy copy Pipeline X Pipeline Y Pipeline Z ❖ ODIN is homebrewed data ingestion system ❖ Legacy HDFS and New HDFS are in different data centers, so downstream migration is not straight forward due to computing resource constrains GCS copy Spark Pipeline New
  • 11. 11 Data Sync Source Data Alluxio Ingest Alluxio XHDFS Cluster HDFS Cluster GCS Alluxio Y Alluxio Z Rakuten DC1 Rakuten DC2 GCP ❖ Alluxio Ingest Cluster: data persist to multi destination via Under Store Replication. ❖ Consumption tool cache data from different DC to improve performance, and enable DR Released in Production
  • 12. 12 Data Caching for Consumption Alluxio master Alluxio worker Alluxio worker Alluxio worker Presto Coordinator Presto worker Presto worker Presto worker Mem Cach e Mem Cach e Mem Cach e Mem Cach e GKE (GCP) & AKS (Azure) 2020 Production Physical box Physical box Physical box HDFS: DC local HDFS: DC remote 1 HDFS DC remote 2 Alluxio master Alluxio worker Alluxio worker Alluxio worker Presto Coordinator Presto worker Presto worker Presto worker On-Prem Bare Metal (POC)
  • 13. 13 Consumption in Production Today Physical box Physical box Physical box HDFS DC1 HDFS DC 2 Alluxio master Alluxio worker Alluxio worker Alluxio worker Presto Coordinator Presto worker Presto worker Presto worker On-Prem Bare Metal (2019 - Early 2020) Bare Metal (K8 Cluster) --- Present Production
  • 14. 14 TensorFlow / Caffe Spark Compute (Transformati on) Spark Compute Aggregations Distributed Cache Kubernetes , KubeflowLinu x Rakuten OneClou d Bare Metal GPU CPU HD FS Object Store NA S Libfuse AlluxioFUSE Alluxio JVM Distributed Cache (Presently under POC)
  • 15. 15 Our Journey with Alluxio Started using Presto Open source (On-Prem) 201 7 201 8 Started using Presto Open source (GCP) POC with Presto + Alluxio (GCP) 201 9 202 0 Presto + Alluxio (GCP , Azure) POC : Distributed Cache with Alluxio for ML & Data Pipeline Jobs Data Sync with Alluxio (On-Prem) 202 1 Planned : Distributed Cache with Alluxio for ML & Data Pipeline Jobs
  • 16. 16 Overview: Wrap-up RDB NoSQL Files events Pipeline Service Hadoop Discovery Service Consumption Service Transformations Landing zone Common Schema mapping Common Marts Data Orchestration Layer Presto BI toolsAI / ML Data Exploring Downstream pipelines Spark Schema management Data ACL Classification Auditing Changelogs Changelogs Cloud