Cerebro: Bringing together data scientists and BI users on a common analytics platform in the cloud
https://conferences.oreilly.com/strata/strata-eu-2019/public/schedule/detail/77861
How to Troubleshoot Apps for the Modern Connected Worker
Cerebro: Bringing together data scientists and bi users - Royal Caribbean - Strata - London 2019
1.
2. Royal Caribbean Cruises, Ltd.
2
• Founded in 1968
• Six companies employing over 65,000
people from 120 countries who have
served over 50 million guests
• Fleet of over 55 ships and growing
• Countless industry “firsts” - such as rock
climbing wall, ice skating, and surfing at
sea
• Each brand delivering a unique Guest
experience
• www.rclcorporate.com
12. What is Cerebro™
Cerebro™ is a project under Excalibur’s data program
focused on delivering a next-generation data
management platform.
Design Drivers and Architecture Principles
12
13. Cerebro™ is Cloud Native
Cloud-native data lake architecture leveraging vendor managed services
13
Managed Services Container Based
Azure Data Lake Store Azure Data Factory
14. Storage Type Object Store Document Store Graph Store
Which Data? Sensor data;
financial data;
Reference data;
dynamic schema
Relationships
Which Queries Data science; BI;
large analytical jobs
Single record; small
batches; mutations
Relationship
analysis; mutations
Key Considerations Parquet and Arrow
accelerate queries
Ability to handle
streaming
workloads
Flexibility and ability
to handle
complexity
Cerebro™ Leverages Different Storage Engines
Why there is a need for a Heterogeneous Data Lake
14
Azure Data Lake Store (ADLS)
15. Cerebro™ Leverages In-
Memory Architecture
• Scalability via distributed in-
memory compute layer, object
storage
• Dremio and Spark anchor in-
memory computing layer
• Parquet and object store (ADLS)
for storage layer, plus MongoDB
and Neo4j
• Dremio and Arrow Flight further
accelerate access and in-
memory processing
15
Compute Layer
Storage Layer
Today Future
(with Arrow Flight)
16. Cerebro™ - Phase 1
16
• Initial release focused on ingestion of
sources spanning current data silos
• Establishment of a Raw Zone with
Landing and Staging Areas
• Physical storage is file based (CSV,
Parquet) on Azure Data Lake Store
(ADLS) to support variety and variability
of data
• Staging Area requires users to be
familiar with low level data structures in
order to execute queries joining
disparate source systems (e.g. multiple
PMS and Casino sources)
Raw
Zone
Cloud Object Store, Document Store, Graph
Standardized
Zone
Enriched
Zone
Ingest
Batch
CDC
Batch
SFTP
File
RDBMS
Reservations
Customer Master
Property Management
Casino
Clickstream
Marketing
Metadata Management, Data Catalog, Data Ingestion, Data Integration
Data Virtualization, Self-service BI, Advanced Analytics
Data
Engineers
Operational
Analytics
BI
Analysts
Self-Service
Dashboards
Data
Scientists
Advanced
Analytics
Data
Stewards
Compliance
Analytics
Landing Area
Staging Area
Transform Consume
17. Data Pipeline – Phase 1
17
Data
Engineers
Data
Scientists
• Talend utilized to ingest data from a
number of sources (RDBMS, File-based,
API) into CSV files stored in the Landing
Area (ADLS)
• Talend / Spark leveraged to create
Parquet files in the Staging Area (ADLS)
• In-memory columnar (Arrow) via Dremio
accelerates SQL based query access for
data engineering and data science use
cases
• Leverages data virtualization within
Dremio to support simple ad-hoc
integration and agile exploration
• Supports data science and advanced
analytics (AI/ML) via Azure Databricks
(Python, Scala, Java, R)
Ingest
Talend
Azure HDInsight
Persist
Azure Data Lake Store
Model/PredictExplore
Dremio
Azure Data Catalog
Azure Databricks
Python
Scala
Java
R
Roles
Azure Data Lake Store
Azure HDInsight
Azure Data Catalog
18. Cerebro™ - Phase 2
18
• Implementation of a Standardized Zone
based on semantic view of entities that
will be easier to query for casual users
• Introduction of MongoDB (Document)
will allow the platform to support low
latency ingestion and consumption of
customer data required to support
downstream applications (Call Center)
• Dremio still leveraged to support
analytical use cases involving customer
data stored in MongoDB (Marketing)
• Introduction of Neo4j (Graph) will
increase overall agility (relationships) as
well as provide insights by leveraging
advanced functionality (patterns,
recommendations)
Raw
Zone
Cloud Object Store, Document Store, Graph
Standardized
Zone
Enriched
Zone
Ingest
Batch
CDC
Batch
SFTP
File
RDBMS
Reservations
Customer Master
Property Management
Casino
Clickstream
Marketing
Metadata Management, Data Catalog, Data Ingestion, Data Integration
Data Virtualization, Self-service BI, Advanced Analytics
Data
Engineers
Operational
Analytics
BI
Analysts
Self-Service
Dashboards
Data
Scientists
Advanced
Analytics
Data
Stewards
Compliance
Analytics
Landing Area
Staging Area
Transform Consume
Downstream
Applications
Developers
19. Data Pipeline – Phase 2
19
Data
Engineers
Data
Scientists
Ingest/Process
Talend
Azure HDInsight
Azure Databricks
Azure Data Factory
Persist
Azure Data Lake Store
MongoDB Atlas
Neo4j
Model/PredictExplore/Visualize
Dremio
Azure Data Catalog
Power BI
Azure Databricks
Python
Scala
Java
R
Roles
• Talend used to develop pipelines that
process (cleanse, integrate, harmonize)
data sourced from Raw Zone
• Data resulting from pipeline executions
is persisted in the appropriate store(s)
(ADLS, Neo4j and MongoDB) to support
both analytical and operational
requirements
• Develop services to be consumed by
customer facing applications and other
downstream processes via managed
APIs
BI
Analysts
Data
Stewards
Services
Azure Functions
Apigee
Azure Kubernetes Service
Azure HDInsight
Azure Data Lake Store
Azure Data Catalog
Azure Data Factory
Azure Kubernetes Service
Azure Functions
20. User ExperienceProcessIngestData Sources
Consumers
Modern
Analytics
Modern
Data Platform
BusinessAnalystsDataScientists
Batch
Integration
Applications
Streaming
Integration
Kafka on
HDInsight
On-Premises
Property
Management
Customer
Master
Reservations
Casino
Spark on
HDInsight
Talend
Big Data
Azure Data Lake Store
External
Clickstream
Customer
Feedback
Campaign
Management
Neo4j Causal Cluster
Azure Event Hubs
Self-Service
Data Analytics
Azure Data Catalog
Advanced Analytics
Azure Data Factory
Data Services
Azure Functions
Azure Kubernetes Service
MongoDB Atlas
20
DBeaver EE