SlideShare ist ein Scribd-Unternehmen logo
1 von 30
Dynamic DDL
Adding Structure to Streaming Data on the
Fly
OUR SPEAKERS
Hao Zou
Software Engineer
Data Science & Engineering
GoPro
David Winters
Big Data Architect
Data Science & Engineering
GoPro
TOPICS TO COVER
• Background and Business
• GoPro Data Platform Architecture
• Old File-based Pipeline Architecture
• New Dynamic DDL Architecture
• Dynamic DDL Deep Dive
• Using Cloud-Based Services (Optional)
• Questions
Background and
WHEN WE GOT HERE…
DATA ANALYTICS WAS BASED ON WORD OF MOUTH (& THIS GUY)
TODAY’S BUSINESS
GoPro
Data
Analytics
Platform
Consumer Devices GoPro Apps &
Cloud
E-Commerce
Social
Media
& OTT
CRM
Product Insight
User
CRM/Marketing &
Personalization
ERP
Web
Mobile
DATA CHALLENGES AT GOPRO
• Variety of data - Hardware and Software products
• Software - Mobile and Desktop Apps
• Hardware - Cameras, Drones, Controllers, Accessories,
etc.
• External - CRM, ERP, OTT, E-Commerce, Web, Social, etc.
• Variety of data ingestion mechanisms - Lambda Architecture
• Real-time streaming pipeline - GoPro products
• Batch pipeline - External 3rd party systems
• Complex Transformations
• Data often stored in binary to conserve space in cameras
• Heterogeneous data formats (JSON, XML, and packed
binary)
• Seamless Data Aggregations
• Blend data between different sources, hardware, and
Data Platform
OLD FILE-BASED PIPELINE ARCHITECTURE
ETL Cluster
• Aggregations and
Joins
• Hive and Spark jobs
• Map/Reduce
• Airflow
Secure Data Mart
Cluster
• End User Query
• Impala / Sentry
• Parquet
• Kerberos & LDAP
Analytics Apps
•Hue
•Tableau
•Plotly
•Python
•R
Real Time
Cluster
•Log file streaming
•RESTful service
•Kafka
•Spark Streaming
•HBase
Batch Induction
Framework
• Batch files
• Scheduled downloads
• Pre-processing
• Java App
• Airflow
JSON
JSON
Parquet
DDL
• Rest API
• FTP downloads
• S3 sync
Streaming
Batch
Download
STREAMING ENDPOINT
ELBHTTP
Pipeline for processing of streaming logs
To ETL Cluster
events
events
state
SPARK STREAMING PIPELINE
/path1/…
/path2/…
/path3/…
ToETL
Cluster
/path4/…
events
state
events
events
events
state
state
state
ETL PIPELINE
HDFS
Hive Metastore
To SDM Cluster
From Realtime Cluster
Batch
Induction
Framework
state
snapshot
DATA DELIVERY!
HDFS
Hive Metastore
Applications
Thrift
ODBC
Server
User
Studio
Studio - Staging
GDA
Report
SDM Cluster
From ETL Cluster
PROS AND CONS OF OLD SYSTEM
• Isolation of workloads
• Fast ingest
• Secure
• Fast delivery/queries
• Loosely coupled clusters
• Multiple copies of data
• Tightly coupled storage and compute
• Lack of elasticity
• Operational overhead of multiple clusters
NEW DYNAMIC DDL ARCHITECTURE
Amazon S3
Bucket
Real Time
Cluster
Batch
Induction
Framework
Hive
Metastore
Ephemeral
ETL
Cluster
Parquet
+
DDL
Aggregates
Events
+
State Ephemeral
Data Mart
Cluster #1
Ephemeral
Data Mart
Cluster #2
Ephemeral
Data Mart
Cluster #N
• Rest API
• FTP downloads
• S3 sync
Streaming
Batch
Download
•Notebook
s
•Tableau
•Plotly
•Python
•R
Improvements
Single copy of data
Separate storage from compute
Elastic clusters
Single long running cluster to maintain
Parquet
+
DDL
Dynamic
DDL!
Dynamic DDL Deep Dive
NEW DYNAMIC DDL ARCHITECTURE
Streaming Cluster
ELBHTTP
Pipeline for processing of streaming logs
S3
HIVE
METASTORE
transition
Centralized Hive
MetaStore
For each topic, dynamically add the table
structure and create the table or insert
data into the table if already exists
DYNAMIC DDL
• What is Dynamic DDL?
• Dynamic DDL is adding structure (schema) to the data on the fly whenever the providers of the data are changing
their structure.
• Why is Dynamic DDL needed?
• Providers of data are changing their structure constantly. Without Dynamic DDL, the table schema is hard coded
and has to be manually updated based on the changes of the incoming data.
• All of the aggregation SQL would have to be manually updated due to the schema change.
• Faster turnaround for the data ingestion. Data can be ingested and made available within minutes (sometimes
seconds).
• How we did this?
• Using Spark SQL/Dataframe
• See example
DYNAMIC DDL
• Example:
{"_data":{"record":{"id": "1", "first_name": "John", "last_name": "Fork", "state": "California", "city": "san Mateo"}, "log_ts":"2016-07-20T00:06:01Z"}}
Fixed schema
Dynamically generated
schema
{"record_key":"state","record_value":"California","id":"1","log_ts":"2016-07-20T00:06:01Z"}
{"record_key":"last_name","record_value":"Fork","id":"1","log_ts":"2016-07-20T00:06:01Z"}
{"record_key":"city","record_value":"san Mateo","id":"1","log_ts":"2016-07-20T00:06:01Z"}
{"record_key":"first_name","record_value":"John","id":"1","log_ts":"2016-07-20T00:06:01Z"}
Flatten the data first
SELECT MAX(CASE WHEN record_key = 'state' THEN record_value ELSE null END) AS data_record_state,
MAX(CASE WHEN record_key = 'last_name' THEN record_value ELSE null END) As data_record_last_name,
MAX(CASE WHEN record_key = 'first_name' THEN record_value ELSE null END) As data_record_first_name,
MAX(CASE WHEN record_key = 'city' THEN record_value ELSE null END) AS data_record_city,
id as data_record_id, log_ts as data_log_ts from test group by id, log_ts
DYNAMIC DDL USING SPARK SQL/DATAFRAME
• Code snippet of Dynamic DDL transforming new JSON attributes into relational columns
Add the partition columns
Manually create the table due to a bug in
spark
DYNAMIC DDL USING SPARK SQL/DATAFRAME
Add the new columns that exist in the incoming
data frame but do not exist yet in the destination
table
This syntax is not working anymore after upgrading to spark 2.x
DYNAMIC DDL USING SPARK SQL/DATAFRAME
Three temporary way to solve the problem in spark 2.x:
• Launch a hiveserver2 service, then use jdbc call hive to alter the table
• Use spark to directly connect to hivemetastore, then update the
metadata
• Update spark source code to support Alter table syntax and
repackage it
DYNAMIC DDL USING SPARK SQL/DATAFRAME
Project all columns from the table
Append the data into the destination table
DYNAMIC DDL USING SPARK SQL/DATAFRAME
Add the new partition key
• Reprocessing the DDL Table with new partition Key (Tuning tips)
Choose the partition key wisely
Use coalesce if there too many partitions
Use Coalesce to control the job tasksUse filter if Data still too large
Using Cloud-based
USING S3: WHAT IS S3?
• S3 is not a file system.
• S3 is an object store. Similar to a key-value store.
• S3 objects are presented in a hierarchical view but are not stored in that manner.
• S3 objects are stored with a key derived from a “path”.
• The key is used to fan out the objects across shards.
• The path is for display purposes only. Only the first 3 to 4 characters are
used for sharding.
• S3 does not have strong transactional semantics but instead has eventual
consistency.
• S3 is not appropriate for realtime updates.
• S3 is suited for longer term storage.
USING S3: BEHAVIORS
• S3 has similar behaviors to HDFS but even more extreme.
• Larger latencies
• Larger files/writes – Think GBs
• Write and read latencies are larger but the bandwidth is much larger with S3.
• Thus throughput can be increased with parallel writers (same latency but
more throughput through parallel operations)
• Partition your RDDs/DataFrames and increase your workers/executors
to optimize the parallelism.
• Each write/read has more overhead due to the web service calls.
• So use larger buffers.
• Match the size of your HDFS buffers/blocks if reading/writing from/to HDFS.
• Collect data for longer durations before writing large buffers in parallel to S3.
• Retry logic – Writes to S3 can and will fail.
• Cannot stream to S3 – Complete files must be uploaded.
• Technically, you can simulate streaming with multipart upload.
USING S3: TIPS
• Tips for using S3 with HDFS
• Use the s3a scheme.
• Many optimizations including buffering options (disk-based, on-heap, or
off-heap) and incremental parallel uploads (S3A Fast Upload).
• More here: http://hadoop.apache.org/docs/current/hadoop-
aws/tools/hadoop-aws/index.html#S3A
• Don’t use rename/move.
• Moves are great for HDFS to support better transactional semantics
when streaming files.
• For S3, moves/renames are copy and delete operations which can be
very slow especially due to the eventual consistency.
• Other advanced S3 techniques:
• Hash object names to better shard the objects in a bucket.
• Use multiple buckets to increase bandwidth.
QUESTIONS?
Q & A

Weitere ähnliche Inhalte

Was ist angesagt?

Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
DataWorks Summit
 

Was ist angesagt? (20)

Bridle your Flying Islands and Castles in the Sky: Built-in Governance and Se...
Bridle your Flying Islands and Castles in the Sky: Built-in Governance and Se...Bridle your Flying Islands and Castles in the Sky: Built-in Governance and Se...
Bridle your Flying Islands and Castles in the Sky: Built-in Governance and Se...
 
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the CloudBring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
 
Cloudy with a chance of Hadoop - real world considerations
Cloudy with a chance of Hadoop - real world considerationsCloudy with a chance of Hadoop - real world considerations
Cloudy with a chance of Hadoop - real world considerations
 
Apache Ignite vs Alluxio: Memory Speed Big Data Analytics
Apache Ignite vs Alluxio: Memory Speed Big Data AnalyticsApache Ignite vs Alluxio: Memory Speed Big Data Analytics
Apache Ignite vs Alluxio: Memory Speed Big Data Analytics
 
Fishing Graphs in a Hadoop Data Lake
Fishing Graphs in a Hadoop Data Lake Fishing Graphs in a Hadoop Data Lake
Fishing Graphs in a Hadoop Data Lake
 
Treat your enterprise data lake indigestion: Enterprise ready security and go...
Treat your enterprise data lake indigestion: Enterprise ready security and go...Treat your enterprise data lake indigestion: Enterprise ready security and go...
Treat your enterprise data lake indigestion: Enterprise ready security and go...
 
Improving Hadoop Resiliency and Operational Efficiency with EMC Isilon
Improving Hadoop Resiliency and Operational Efficiency with EMC IsilonImproving Hadoop Resiliency and Operational Efficiency with EMC Isilon
Improving Hadoop Resiliency and Operational Efficiency with EMC Isilon
 
Hadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in ProductionHadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in Production
 
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...
 
Hadoop in the Cloud - The what, why and how from the experts
Hadoop in the Cloud - The what, why and how from the expertsHadoop in the Cloud - The what, why and how from the experts
Hadoop in the Cloud - The what, why and how from the experts
 
Built-In Security for the Cloud
Built-In Security for the CloudBuilt-In Security for the Cloud
Built-In Security for the Cloud
 
Big SQL: Powerful SQL Optimization - Re-Imagined for open source
Big SQL: Powerful SQL Optimization - Re-Imagined for open sourceBig SQL: Powerful SQL Optimization - Re-Imagined for open source
Big SQL: Powerful SQL Optimization - Re-Imagined for open source
 
HAWQ Meets Hive - Querying Unmanaged Data
HAWQ Meets Hive - Querying Unmanaged DataHAWQ Meets Hive - Querying Unmanaged Data
HAWQ Meets Hive - Querying Unmanaged Data
 
Protecting your Critical Hadoop Clusters Against Disasters
Protecting your Critical Hadoop Clusters Against DisastersProtecting your Critical Hadoop Clusters Against Disasters
Protecting your Critical Hadoop Clusters Against Disasters
 
HPE Keynote Hadoop Summit San Jose 2016
HPE Keynote Hadoop Summit San Jose 2016HPE Keynote Hadoop Summit San Jose 2016
HPE Keynote Hadoop Summit San Jose 2016
 
Format Wars: from VHS and Beta to Avro and Parquet
Format Wars: from VHS and Beta to Avro and ParquetFormat Wars: from VHS and Beta to Avro and Parquet
Format Wars: from VHS and Beta to Avro and Parquet
 
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
 
Big Data Simplified - Is all about Ab'strakSHeN
Big Data Simplified - Is all about Ab'strakSHeNBig Data Simplified - Is all about Ab'strakSHeN
Big Data Simplified - Is all about Ab'strakSHeN
 
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
 
What's new in Ambari
What's new in AmbariWhat's new in Ambari
What's new in Ambari
 

Ähnlich wie Dynamic DDL: Adding structure to streaming IoT data on the fly

Adding structure to your streaming pipelines: moving from Spark streaming to ...
Adding structure to your streaming pipelines: moving from Spark streaming to ...Adding structure to your streaming pipelines: moving from Spark streaming to ...
Adding structure to your streaming pipelines: moving from Spark streaming to ...
DataWorks Summit
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Precisely
 

Ähnlich wie Dynamic DDL: Adding structure to streaming IoT data on the fly (20)

Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
 
Adding structure to your streaming pipelines: moving from Spark streaming to ...
Adding structure to your streaming pipelines: moving from Spark streaming to ...Adding structure to your streaming pipelines: moving from Spark streaming to ...
Adding structure to your streaming pipelines: moving from Spark streaming to ...
 
Building Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon RedshiftBuilding Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon Redshift
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
 
Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...
 
Moving to Databricks & Delta
Moving to Databricks & DeltaMoving to Databricks & Delta
Moving to Databricks & Delta
 
Big Data Analytics on the Cloud Oracle Applications AWS Redshift & Tableau
Big Data Analytics on the Cloud Oracle Applications AWS Redshift & TableauBig Data Analytics on the Cloud Oracle Applications AWS Redshift & Tableau
Big Data Analytics on the Cloud Oracle Applications AWS Redshift & Tableau
 
Big data applications
Big data applicationsBig data applications
Big data applications
 
Data Analysis on AWS
Data Analysis on AWSData Analysis on AWS
Data Analysis on AWS
 
Spark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with SparkSpark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with Spark
 
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics Platform
 
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
 
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
 
Real-Time Analytics with Spark and MemSQL
Real-Time Analytics with Spark and MemSQLReal-Time Analytics with Spark and MemSQL
Real-Time Analytics with Spark and MemSQL
 
Suburface 2021 IBM Cloud Data Lake
Suburface 2021 IBM Cloud Data LakeSuburface 2021 IBM Cloud Data Lake
Suburface 2021 IBM Cloud Data Lake
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
 

Mehr von DataWorks Summit

HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 

Mehr von DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Kürzlich hochgeladen

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Kürzlich hochgeladen (20)

presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 

Dynamic DDL: Adding structure to streaming IoT data on the fly

  • 1. Dynamic DDL Adding Structure to Streaming Data on the Fly
  • 2. OUR SPEAKERS Hao Zou Software Engineer Data Science & Engineering GoPro David Winters Big Data Architect Data Science & Engineering GoPro
  • 3. TOPICS TO COVER • Background and Business • GoPro Data Platform Architecture • Old File-based Pipeline Architecture • New Dynamic DDL Architecture • Dynamic DDL Deep Dive • Using Cloud-Based Services (Optional) • Questions
  • 5. WHEN WE GOT HERE… DATA ANALYTICS WAS BASED ON WORD OF MOUTH (& THIS GUY)
  • 7. GoPro Data Analytics Platform Consumer Devices GoPro Apps & Cloud E-Commerce Social Media & OTT CRM Product Insight User CRM/Marketing & Personalization ERP Web Mobile
  • 8. DATA CHALLENGES AT GOPRO • Variety of data - Hardware and Software products • Software - Mobile and Desktop Apps • Hardware - Cameras, Drones, Controllers, Accessories, etc. • External - CRM, ERP, OTT, E-Commerce, Web, Social, etc. • Variety of data ingestion mechanisms - Lambda Architecture • Real-time streaming pipeline - GoPro products • Batch pipeline - External 3rd party systems • Complex Transformations • Data often stored in binary to conserve space in cameras • Heterogeneous data formats (JSON, XML, and packed binary) • Seamless Data Aggregations • Blend data between different sources, hardware, and
  • 10. OLD FILE-BASED PIPELINE ARCHITECTURE ETL Cluster • Aggregations and Joins • Hive and Spark jobs • Map/Reduce • Airflow Secure Data Mart Cluster • End User Query • Impala / Sentry • Parquet • Kerberos & LDAP Analytics Apps •Hue •Tableau •Plotly •Python •R Real Time Cluster •Log file streaming •RESTful service •Kafka •Spark Streaming •HBase Batch Induction Framework • Batch files • Scheduled downloads • Pre-processing • Java App • Airflow JSON JSON Parquet DDL • Rest API • FTP downloads • S3 sync Streaming Batch Download
  • 11. STREAMING ENDPOINT ELBHTTP Pipeline for processing of streaming logs To ETL Cluster events events state
  • 13. ETL PIPELINE HDFS Hive Metastore To SDM Cluster From Realtime Cluster Batch Induction Framework state snapshot
  • 15. PROS AND CONS OF OLD SYSTEM • Isolation of workloads • Fast ingest • Secure • Fast delivery/queries • Loosely coupled clusters • Multiple copies of data • Tightly coupled storage and compute • Lack of elasticity • Operational overhead of multiple clusters
  • 16. NEW DYNAMIC DDL ARCHITECTURE Amazon S3 Bucket Real Time Cluster Batch Induction Framework Hive Metastore Ephemeral ETL Cluster Parquet + DDL Aggregates Events + State Ephemeral Data Mart Cluster #1 Ephemeral Data Mart Cluster #2 Ephemeral Data Mart Cluster #N • Rest API • FTP downloads • S3 sync Streaming Batch Download •Notebook s •Tableau •Plotly •Python •R Improvements Single copy of data Separate storage from compute Elastic clusters Single long running cluster to maintain Parquet + DDL Dynamic DDL!
  • 18. NEW DYNAMIC DDL ARCHITECTURE Streaming Cluster ELBHTTP Pipeline for processing of streaming logs S3 HIVE METASTORE transition Centralized Hive MetaStore For each topic, dynamically add the table structure and create the table or insert data into the table if already exists
  • 19. DYNAMIC DDL • What is Dynamic DDL? • Dynamic DDL is adding structure (schema) to the data on the fly whenever the providers of the data are changing their structure. • Why is Dynamic DDL needed? • Providers of data are changing their structure constantly. Without Dynamic DDL, the table schema is hard coded and has to be manually updated based on the changes of the incoming data. • All of the aggregation SQL would have to be manually updated due to the schema change. • Faster turnaround for the data ingestion. Data can be ingested and made available within minutes (sometimes seconds). • How we did this? • Using Spark SQL/Dataframe • See example
  • 20. DYNAMIC DDL • Example: {"_data":{"record":{"id": "1", "first_name": "John", "last_name": "Fork", "state": "California", "city": "san Mateo"}, "log_ts":"2016-07-20T00:06:01Z"}} Fixed schema Dynamically generated schema {"record_key":"state","record_value":"California","id":"1","log_ts":"2016-07-20T00:06:01Z"} {"record_key":"last_name","record_value":"Fork","id":"1","log_ts":"2016-07-20T00:06:01Z"} {"record_key":"city","record_value":"san Mateo","id":"1","log_ts":"2016-07-20T00:06:01Z"} {"record_key":"first_name","record_value":"John","id":"1","log_ts":"2016-07-20T00:06:01Z"} Flatten the data first SELECT MAX(CASE WHEN record_key = 'state' THEN record_value ELSE null END) AS data_record_state, MAX(CASE WHEN record_key = 'last_name' THEN record_value ELSE null END) As data_record_last_name, MAX(CASE WHEN record_key = 'first_name' THEN record_value ELSE null END) As data_record_first_name, MAX(CASE WHEN record_key = 'city' THEN record_value ELSE null END) AS data_record_city, id as data_record_id, log_ts as data_log_ts from test group by id, log_ts
  • 21. DYNAMIC DDL USING SPARK SQL/DATAFRAME • Code snippet of Dynamic DDL transforming new JSON attributes into relational columns Add the partition columns Manually create the table due to a bug in spark
  • 22. DYNAMIC DDL USING SPARK SQL/DATAFRAME Add the new columns that exist in the incoming data frame but do not exist yet in the destination table This syntax is not working anymore after upgrading to spark 2.x
  • 23. DYNAMIC DDL USING SPARK SQL/DATAFRAME Three temporary way to solve the problem in spark 2.x: • Launch a hiveserver2 service, then use jdbc call hive to alter the table • Use spark to directly connect to hivemetastore, then update the metadata • Update spark source code to support Alter table syntax and repackage it
  • 24. DYNAMIC DDL USING SPARK SQL/DATAFRAME Project all columns from the table Append the data into the destination table
  • 25. DYNAMIC DDL USING SPARK SQL/DATAFRAME Add the new partition key • Reprocessing the DDL Table with new partition Key (Tuning tips) Choose the partition key wisely Use coalesce if there too many partitions Use Coalesce to control the job tasksUse filter if Data still too large
  • 27. USING S3: WHAT IS S3? • S3 is not a file system. • S3 is an object store. Similar to a key-value store. • S3 objects are presented in a hierarchical view but are not stored in that manner. • S3 objects are stored with a key derived from a “path”. • The key is used to fan out the objects across shards. • The path is for display purposes only. Only the first 3 to 4 characters are used for sharding. • S3 does not have strong transactional semantics but instead has eventual consistency. • S3 is not appropriate for realtime updates. • S3 is suited for longer term storage.
  • 28. USING S3: BEHAVIORS • S3 has similar behaviors to HDFS but even more extreme. • Larger latencies • Larger files/writes – Think GBs • Write and read latencies are larger but the bandwidth is much larger with S3. • Thus throughput can be increased with parallel writers (same latency but more throughput through parallel operations) • Partition your RDDs/DataFrames and increase your workers/executors to optimize the parallelism. • Each write/read has more overhead due to the web service calls. • So use larger buffers. • Match the size of your HDFS buffers/blocks if reading/writing from/to HDFS. • Collect data for longer durations before writing large buffers in parallel to S3. • Retry logic – Writes to S3 can and will fail. • Cannot stream to S3 – Complete files must be uploaded. • Technically, you can simulate streaming with multipart upload.
  • 29. USING S3: TIPS • Tips for using S3 with HDFS • Use the s3a scheme. • Many optimizations including buffering options (disk-based, on-heap, or off-heap) and incremental parallel uploads (S3A Fast Upload). • More here: http://hadoop.apache.org/docs/current/hadoop- aws/tools/hadoop-aws/index.html#S3A • Don’t use rename/move. • Moves are great for HDFS to support better transactional semantics when streaming files. • For S3, moves/renames are copy and delete operations which can be very slow especially due to the eventual consistency. • Other advanced S3 techniques: • Hash object names to better shard the objects in a bucket. • Use multiple buckets to increase bandwidth.

Hinweis der Redaktion

  1. GoPro is the ultimate accessory for active people with smartphones. Products and services include: Core products -- Cameras Advanced Solutions – Karma, stabilization and VR Accessories and Mounts Software suite that ties it all together  GoPro has become the ultimate, end-to-end storytelling solution.
  2. This is what we wanted to build.
  3. These were the challenges to solve.
  4. High Level Architecture of Data Platform Isolation of workloads  3 clusters (ingest, ETL, delivery) Lamdba architecture Input and output data formats Cadence of clusters A word about Data Sources: IoT data Logs from devices, applications (desktop and mobile), external systems and services, ERP, web/email marketing, etc. Some Raw and Gzip, Some Binary and JSON Some streaming and some batch Batch data Web marketing, campaigns Social media ERP CRM Lambda architecture Both batch and stream processing Basic needs/workloads in a Data Platform High throughput ingestion Transformations: joins, aggregations, etc. Fast queries Today, we have 3 clusters to isolate these workloads We started with one cluster, ETL Everything ran there Ingest (Flume) Batch (Framework) ETL (Hive) Analytical (Impala) Lots of resource contention (I/O, memory, cores) To alleviate the resource contention, we opted for 3 clusters to isolate the workloads. Ingest cluster for near real-time streaming Kafka, Spark Streaming (Cloudera Parcels) Input: Logs, Output: JSON Minutes cadence Moving towards more real-time in seconds Induction framework for scheduled batch ingestion ETL cluster for heavy duty aggregation Input: JSON flat files, Output: Aggregated Parquet files Hive (Map/Reduce) Hourly cadence Secure Data Mart Kerberos, LDAP, Active Directory, Apache Sentry (Cloudera committers) Input: Compressed Parquet files Analytical SQL engine for Tableau, ad-hoc queries (Hue), data wrangling (Trifacta), and data science (Jupyter Notebooks and RStudio) With all that said, we will examine the newer technologies that will enable us to simplify our architecture and merge clusters in the future. Kudu is one possible new technology that could help us to consolidate some of the clusters.
  5. Let’s take a deeper dive into our streaming ingestion… Logs are streamed from devices and software applications (desktop and mobile) to web service endpoint Endpoint is an elastic pool of Tomcat servers sitting behind ELB in AWS Custom servlet pushes logs into Kafka topics by environment A series of Spark streaming jobs process the logs from Kafka Landing place in ingestion cluster is HDFS with JSON flat files Rationalization of tech stacks… Why Kafka? Unrivaled write throughput for a queue Traditional queue throughput: 100K writes/sec on the biggest box you can buy Kafka throughput: 1M writes/sec on 3-4 commodity servers Strong ordering policy of messages Distributed Fault-tolerant through replication Support synchronous and asynchronous writes Pairs nicely with Spark Streaming for simpler scaling out (Kafka topic partitions map directly to Spark RDD partitions/tasks) Why Spark Streaming? Strong transactional semantics - "exactly once" processing Leverage Spark technology for both data ingest and analytics Horizontally scalable - High throughput for micro-batching Large open source community
  6. Keyword: Impedance mismatch As previously stated, logs are streamed from devices and software applications (desktop and mobile) to web service endpoint Logs are diverse: gzipped, raw, binary, JSON, batched events, streamed single events Vary significantly in size from < 1 KB to > 1 MB Logs are redirected based on data category and routed to appropriate Kafka topic and respective Spark Streaming job Logs move from Kafka topic to Kafka topic with each Kafka topic having a Spark Streaming job that consumes the log, processes the log, and writes the log to another topic Tree like structure of jobs with more generic logic towards the root of the tree and more specialized logic moving towards the leaf nodes There are generic jobs/services and specialized jobs/services Generic services include PII removal and hashing, IP to Geo lookups, and batched writing to HDFS We perform batched HDFS writing since Kafka likes small messages (1 KB ideal) and HDFS likes large files (100+ MB) Specialized services contain business logic Finally, the logs are written into HDFS as JSON flat files (which are sometimes compressed depending on the type of data) Scheduled ETL jobs perform a distributed copy (distcp) to move the data to the ETL cluster for further heavier aggregations
  7. A few things… Two flows of data: streaming and batch Join data sources Aggregate data sources Convert to compressed columnar format (gzipped Parquet fies) On the ETL cluster… Here’s where we do our heavy lifting. Almost entirely all Hive Map Reduce jobs Some Impala to make the really big narly aggregations more performant Previously, had a custom Java Map Reduce job for sessionization of events This has been replaced with a Spark Streaming job on the ingestion cluster In the future, want to push as much of the ETL processing back into the ingestion cluster for more real-time processing We also have a custom Java Induction framework which ingests data from external services that only make data available on slower schedules (daily, twice daily, etc.) The output from the ETL cluster is Parquet files that are added to partitioned managed tables in the Hive metastore. The Parquet files are then copied via distcp to the Secure Data Mart.
  8. Parquet files are copied from the ETL cluster and added to partitioned managed tables in the Hive Metastore of the Secure Data Mart. The Secure Data Mart is protected with Apache Sentry. Kerberos is used for authentication.  Corporate Standard Active Directory stores the groups.  Corporate Standard Access control is role based and the roles are assigned with Sentry. Hue has a Sentry UI app to manage authorization.
  9. Store data in one place  Data (S3) + Structure (Hive Metastore) Separate compute nodes from storage nodes Elasticity  size of clusters and number of clusters Lower operational overhead of maintaining HDFS storage nodes Redirect batch ingest into stream ingest (Pump batch data into Kafka)  RESULT: One codebase for both stream and batch ingestion
  10. Even thought the fixed schema resolves the problem of data provider changing the data structure frequently. It becomes really difficult for the analysts and data scientists to analyze the data. 1.how would they know these two rows are coming from one event? 2.has to use id to associate  these two rows
  11. /* * The table does not exist, so create it.  We create it manually due to a bug in Spark where its metadata * gets out of synch with Hive's metadata if we let Spark automatically create the table .  So we create the table first and  later alter * the table and add columns to it.  See below for more details: * * https://issues.apache.org/jira/browse/SPARK-9761 */
  12. /* * The table does not exist, so create it. We create it manually due to a bug in Spark where its metadata * gets out of synch with Hive's metadata if we let Spark automatically create the table and we later alter * the table and add columns to it. See below for more details: * * https://issues.apache.org/jira/browse/SPARK-9761 */
  13. /* * The table does not exist, so create it. We create it manually due to a bug in Spark where its metadata * gets out of synch with Hive's metadata if we let Spark automatically create the table and we later alter * the table and add columns to it. See below for more details: * * https://issues.apache.org/jira/browse/SPARK-9761 */
  14. /* * The table does not exist, so create it. We create it manually due to a bug in Spark where its metadata * gets out of synch with Hive's metadata if we let Spark automatically create the table and we later alter * the table and add columns to it. See below for more details: * * https://issues.apache.org/jira/browse/SPARK-9761 */