SlideShare ist ein Scribd-Unternehmen logo
1 von 30
Downloaden Sie, um offline zu lesen
Dynamic DDL
Adding Structure to Streaming Data on the Fly
OUR SPEAKERS
Hao Zou
Software Engineer
Data Science &Engineering
GoPro
David Winters
Big Data Architect
Data Science &Engineering
GoPro
TOPICS TO COVER
• Background and Business
• GoPro Data Platform Architecture
• Old File-based Pipeline Architecture
• New Dynamic DDLArchitecture
• Dynamic DDL Deep Dive
• Using Cloud-Based Services (Optional)
• Questions
Background and Business
WHEN WE GOT HERE…
DATA ANALYTICS WAS BASED ON WORD OF MOUTH (& THIS GUY)
TODAY’S BUSINESS
GoPro Data
Analytics
Platform
Consumer Devices GoPro Apps & Cloud
E-Commerce
Social Media
& OTT
CRM
Product Insight
User Segmentation
CRM/Marketing &
Personalization
ERP
Web
Mobile
DATA CHALLENGES AT GOPRO
• Variety of data - Hardware and Software products
• Software - Mobile and Desktop Apps
• Hardware - Cameras, Drones, Controllers,Accessories, etc.
• External - CRM, ERP, OTT, E-Commerce, Web, Social,etc.
• Variety of data ingestion mechanisms - LambdaArchitecture
• Real-time streaming pipeline - GoPro products
• Batch pipeline - External 3rd party systems
• Complex Transformations
• Data often stored in binary to conserve space in cameras
• Heterogeneous data formats (JSON, XML, and packed binary)
• Seamless Data Aggregations
• Blend data between different sources, hardware, and software
• Build structures which reflect state vs. event-based
• Handle Privacy & Anonymization
Data Platform Architecture
OLD FILE-BASED PIPELINE ARCHITECTURE
ETL Cluster
•Aggregations and
Joins
•Hive and Spark jobs
•Map/Reduce
•Airflow
Secure Data Mart
Cluster
• End User Query
• Impala / Sentry
• Parquet
• Kerberos & LDAP
Analytics Apps
•Hue
•Tableau
•Plotly
•Python
•R
Real Time Cluster
•Log file streaming
•RESTful service
•Kafka
•Spark Streaming
•HBase
Batch Induction
Framework
•Batch files
•Scheduled downloads
•Pre-processing
•Java App
•Airflow
JSON
JSON
Parquet
DDL
• Rest API
• FTP downloads
• S3 sync
Streaming
Batch
Download
STREAMING ENDPOINT
ELBHTTP
Pipeline for processing of streaming logs
To ETL Cluster
events
events
state
SPARK STREAMING PIPELINE
/path1/…
/path2/…
/path3/…
ToETL
Cluster
/path4/…
events
state
events
events
events
state
state
state
ETL PIPELINE
HDFS
Hive Metastore
To SDM Cluster
From Realtime Cluster
Batch
Induction
Framework
state
snapshot
DATA DELIVERY!
HDFS
Hive Metastore
Applications
Thrift
ODBC
Server
User
Studio
Studio -Staging
GDA
Report
SDM Cluster
From ETL Cluster
PROS AND CONS OF OLD SYSTEM
• Isolation of workloads
• Fast ingest
• Secure
• Fast delivery/queries
• Loosely coupled clusters
• Multiple copies of data
• Tightly coupled storage and compute
• Lack of elasticity
• Operational overhead of multiple clusters
NEW DYNAMIC DDL ARCHITECTURE
Amazon S3
Bucket
Real Time Cluster
Batch
Induction
Framework
Hive
Metastore
Ephemeral
ETL
Cluster
Parquet
+
DDL
Aggregates
Events
+
State
Ephemeral
Data Mart
Cluster #1
Ephemeral
Data Mart
Cluster #2
Ephemeral
Data Mart
Cluster #N
• Rest API
• FTP downloads
• S3 sync
Streaming
Batch
Download
•Notebooks
•Tableau
•Plotly
•Python
•R
Improvements
Single copy of data
Separate storage from compute
Elastic clusters
Single long running cluster to maintain
Parquet
+
DDL
Dynamic
DDL!
Dynamic DDL Deep Dive
NEW DYNAMIC DDL ARCHITECTURE
Streaming Cluster
ELBHTTP
Pipeline for processing of streaming logs
S3
HIVE
METASTORE
transition
Centralized Hive MetaStore
For each topic, dynamically add thetable
structure and create the table or insert data
into the table if already exists
DYNAMIC DDL
• What is Dynamic DDL?
• Dynamic DDL is adding structure (schema)to the data on the fly wheneverthe providersofthe data are changing their
structure.
• Why is Dynamic DDL needed?
• Providersofdata are changingtheirstructure constantly.Without Dynamic DDL,the table schema ishard coded and hasto be
manually updatedbased on the changesofthe incoming data.
• All of the aggregation SQL would have to be manuallyupdated due to the schema change.
• Faster turnaroundforthe data ingestion. Data can be ingested and made available within minutes(sometimesseconds).
• How we did this?
• Using Spark SQL/Dataframe
• See example
DYNAMIC DDL
• Example:
{"_data":{"record":{"id": "1", "first_name": "John", "last_name": "Fork", "state": "California", "city": "san Mateo"}, "log_ts":"2016-07-
20T00:06:01Z"}}
Fixed schema
Dynamically generated schema
{"record_key":"state","record_value":"California","id":"1","log_ts":"2016-07-20T00:06:01Z"}
{"record_key":"last_name","record_value":"Fork","id":"1","log_ts":"2016-07-20T00:06:01Z"}
{"record_key":"city","record_value":"san Mateo","id":"1","log_ts":"2016-07-20T00:06:01Z"}
{"record_key":"first_name","record_value":"John","id":"1","log_ts":"2016-07-20T00:06:01Z"}
Flatten thedata first
SELECT MAX(CASE WHEN record_key = 'state' THEN record_value ELSE null END) AS data_record_state,
MAX(CASE WHEN record_key = 'last_name' THEN record_value ELSE null END) As data_record_last_name,
MAX(CASE WHEN record_key = 'first_name' THEN record_value ELSE null END) As data_record_first_name,
MAX(CASE WHEN record_key = 'city' THEN record_value ELSE null END) AS data_record_city,
id as data_record_id, log_ts as data_log_ts from test group by id, log_ts
DYNAMIC DDL USING SPARK SQL/DATAFRAME
• Code snippet of Dynamic DDL transforming new JSON attributes into relational columns
Add the partition columns
Manually create the table due to a bug in spark
DYNAMIC DDL USING SPARK SQL/DATAFRAME
Add the new columns that exist in the incoming data
frame but do not exist yet in the destination table
This syntax is not working anymore after upgrading to spark 2.x
DYNAMIC DDL USING SPARK SQL/DATAFRAME
Three temporary way to solve the problem in spark 2.x:
• Launch a hiveserver2 service, then use jdbc call hive to alter the table
• Use spark to directly connect to hivemetastore, then update the metadata
• Update spark source code to supportAlter table syntax and repackage it
DYNAMIC DDL USING SPARK SQL/DATAFRAME
Project all columns from the table
Append the data into the destination table
DYNAMIC DDL USING SPARK SQL/DATAFRAME
Add the new partition key
• Reprocessing the DDL Table with new partition Key (Tuning tips)
Choose the partition keywisely
Use coalesce ifthere too many partitions
Use Coalesce to control the job tasksUse filterif Data still too large
Using Cloud-based Services
USING S3: WHAT IS S3?
• S3 is not a file system.
• S3 is an object store. Similar to a key-value store.
• S3 objects are presented in a hierarchical view but are not stored in that manner.
• S3 objects are stored with a key derived from a “path”.
• The key is used to fan out the objects across shards.
• The path is for display purposes only. Only the first 3 to 4 characters are used
for sharding.
• S3 does not have strong transactional semantics but instead has eventual consistency.
• S3 is not appropriate for realtime updates.
• S3 is suited for longer term storage.
USING S3: BEHAVIORS
• S3 has similar behaviors to HDFS but even more extreme.
• Larger latencies
• Larger files/writes – Think GBs
• Write and read latencies are larger but the bandwidth is much larger with S3.
• Thus throughput can be increased with parallel writers (same latency but more
throughputthrough parallel operations)
• Partition your RDDs/DataFrames and increase your workers/executors to
optimize the parallelism.
• Each write/read has more overhead due to the web service calls.
• So use larger buffers.
• Match the size of your HDFS buffers/blocks if reading/writing from/to HDFS.
• Collect data for longer durations before writing large buffers in parallel to S3.
• Retry logic – Writes to S3 can and will fail.
• Cannot stream to S3 – Complete files must be uploaded.
• Technically, you can simulate streaming with multipart upload.
USING S3: TIPS
• Tips for using S3 with HDFS
• Use the s3a scheme.
• Many optimizations including buffering options (disk-based, on-heap, or off-
heap) and incremental parallel uploads (S3A Fast Upload).
• More here: http://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-
aws/index.html#S3A
• Don’t use rename/move.
• Moves are great for HDFS to support better transactional semantics when
streaming files.
• For S3, moves/renames are copy and delete operations which can be very slow
especially due to the eventual consistency.
• Other advanced S3 techniques:
• Hash object names to better shard the objects in a bucket.
• Use multiple buckets to increase bandwidth.
QUESTIONS?
Q & A

Weitere ähnliche Inhalte

Was ist angesagt?

Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Databricks
 

Was ist angesagt? (20)

Presto best practices for Cluster admins, data engineers and analysts
Presto best practices for Cluster admins, data engineers and analystsPresto best practices for Cluster admins, data engineers and analysts
Presto best practices for Cluster admins, data engineers and analysts
 
A Closer Look at Apache Kudu
A Closer Look at Apache KuduA Closer Look at Apache Kudu
A Closer Look at Apache Kudu
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...
ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...
ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...
 
Owning Your Own (Data) Lake House
Owning Your Own (Data) Lake HouseOwning Your Own (Data) Lake House
Owning Your Own (Data) Lake House
 
Making Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMaking Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse Technology
 
Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEA
 
Big data on aws
Big data on awsBig data on aws
Big data on aws
 
Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)
 
Delta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDelta from a Data Engineer's Perspective
Delta from a Data Engineer's Perspective
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse Architecture
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
 
Intro to databricks delta lake
 Intro to databricks delta lake Intro to databricks delta lake
Intro to databricks delta lake
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
 

Ähnlich wie Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters and Hao Zou

Adding structure to your streaming pipelines: moving from Spark streaming to ...
Adding structure to your streaming pipelines: moving from Spark streaming to ...Adding structure to your streaming pipelines: moving from Spark streaming to ...
Adding structure to your streaming pipelines: moving from Spark streaming to ...
DataWorks Summit
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Precisely
 

Ähnlich wie Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters and Hao Zou (20)

Dynamic DDL: Adding structure to streaming IoT data on the fly
Dynamic DDL: Adding structure to streaming IoT data on the flyDynamic DDL: Adding structure to streaming IoT data on the fly
Dynamic DDL: Adding structure to streaming IoT data on the fly
 
Adding structure to your streaming pipelines: moving from Spark streaming to ...
Adding structure to your streaming pipelines: moving from Spark streaming to ...Adding structure to your streaming pipelines: moving from Spark streaming to ...
Adding structure to your streaming pipelines: moving from Spark streaming to ...
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
 
Building Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon RedshiftBuilding Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon Redshift
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics Platform
 
Moving to Databricks & Delta
Moving to Databricks & DeltaMoving to Databricks & Delta
Moving to Databricks & Delta
 
Spark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with SparkSpark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with Spark
 
Data Analysis on AWS
Data Analysis on AWSData Analysis on AWS
Data Analysis on AWS
 
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
 
Big data applications
Big data applicationsBig data applications
Big data applications
 
Big Data Analytics on the Cloud Oracle Applications AWS Redshift & Tableau
Big Data Analytics on the Cloud Oracle Applications AWS Redshift & TableauBig Data Analytics on the Cloud Oracle Applications AWS Redshift & Tableau
Big Data Analytics on the Cloud Oracle Applications AWS Redshift & Tableau
 
Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...
 
Real-Time Analytics with Spark and MemSQL
Real-Time Analytics with Spark and MemSQLReal-Time Analytics with Spark and MemSQL
Real-Time Analytics with Spark and MemSQL
 
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
 
Customer Education Webcast: New Features in Data Integration and Streaming CDC
Customer Education Webcast: New Features in Data Integration and Streaming CDCCustomer Education Webcast: New Features in Data Integration and Streaming CDC
Customer Education Webcast: New Features in Data Integration and Streaming CDC
 
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
 
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?
 

Mehr von Databricks

Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 

Mehr von Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Kürzlich hochgeladen

Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night StandCall Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
gajnagarg
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
gajnagarg
 
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
gajnagarg
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
amitlee9823
 
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
gajnagarg
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
amitlee9823
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
amitlee9823
 

Kürzlich hochgeladen (20)

Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night StandCall Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
 
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning Approach
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
 

Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters and Hao Zou

  • 1. Dynamic DDL Adding Structure to Streaming Data on the Fly
  • 2. OUR SPEAKERS Hao Zou Software Engineer Data Science &Engineering GoPro David Winters Big Data Architect Data Science &Engineering GoPro
  • 3. TOPICS TO COVER • Background and Business • GoPro Data Platform Architecture • Old File-based Pipeline Architecture • New Dynamic DDLArchitecture • Dynamic DDL Deep Dive • Using Cloud-Based Services (Optional) • Questions
  • 5. WHEN WE GOT HERE… DATA ANALYTICS WAS BASED ON WORD OF MOUTH (& THIS GUY)
  • 7. GoPro Data Analytics Platform Consumer Devices GoPro Apps & Cloud E-Commerce Social Media & OTT CRM Product Insight User Segmentation CRM/Marketing & Personalization ERP Web Mobile
  • 8. DATA CHALLENGES AT GOPRO • Variety of data - Hardware and Software products • Software - Mobile and Desktop Apps • Hardware - Cameras, Drones, Controllers,Accessories, etc. • External - CRM, ERP, OTT, E-Commerce, Web, Social,etc. • Variety of data ingestion mechanisms - LambdaArchitecture • Real-time streaming pipeline - GoPro products • Batch pipeline - External 3rd party systems • Complex Transformations • Data often stored in binary to conserve space in cameras • Heterogeneous data formats (JSON, XML, and packed binary) • Seamless Data Aggregations • Blend data between different sources, hardware, and software • Build structures which reflect state vs. event-based • Handle Privacy & Anonymization
  • 10. OLD FILE-BASED PIPELINE ARCHITECTURE ETL Cluster •Aggregations and Joins •Hive and Spark jobs •Map/Reduce •Airflow Secure Data Mart Cluster • End User Query • Impala / Sentry • Parquet • Kerberos & LDAP Analytics Apps •Hue •Tableau •Plotly •Python •R Real Time Cluster •Log file streaming •RESTful service •Kafka •Spark Streaming •HBase Batch Induction Framework •Batch files •Scheduled downloads •Pre-processing •Java App •Airflow JSON JSON Parquet DDL • Rest API • FTP downloads • S3 sync Streaming Batch Download
  • 11. STREAMING ENDPOINT ELBHTTP Pipeline for processing of streaming logs To ETL Cluster events events state
  • 13. ETL PIPELINE HDFS Hive Metastore To SDM Cluster From Realtime Cluster Batch Induction Framework state snapshot
  • 15. PROS AND CONS OF OLD SYSTEM • Isolation of workloads • Fast ingest • Secure • Fast delivery/queries • Loosely coupled clusters • Multiple copies of data • Tightly coupled storage and compute • Lack of elasticity • Operational overhead of multiple clusters
  • 16. NEW DYNAMIC DDL ARCHITECTURE Amazon S3 Bucket Real Time Cluster Batch Induction Framework Hive Metastore Ephemeral ETL Cluster Parquet + DDL Aggregates Events + State Ephemeral Data Mart Cluster #1 Ephemeral Data Mart Cluster #2 Ephemeral Data Mart Cluster #N • Rest API • FTP downloads • S3 sync Streaming Batch Download •Notebooks •Tableau •Plotly •Python •R Improvements Single copy of data Separate storage from compute Elastic clusters Single long running cluster to maintain Parquet + DDL Dynamic DDL!
  • 18. NEW DYNAMIC DDL ARCHITECTURE Streaming Cluster ELBHTTP Pipeline for processing of streaming logs S3 HIVE METASTORE transition Centralized Hive MetaStore For each topic, dynamically add thetable structure and create the table or insert data into the table if already exists
  • 19. DYNAMIC DDL • What is Dynamic DDL? • Dynamic DDL is adding structure (schema)to the data on the fly wheneverthe providersofthe data are changing their structure. • Why is Dynamic DDL needed? • Providersofdata are changingtheirstructure constantly.Without Dynamic DDL,the table schema ishard coded and hasto be manually updatedbased on the changesofthe incoming data. • All of the aggregation SQL would have to be manuallyupdated due to the schema change. • Faster turnaroundforthe data ingestion. Data can be ingested and made available within minutes(sometimesseconds). • How we did this? • Using Spark SQL/Dataframe • See example
  • 20. DYNAMIC DDL • Example: {"_data":{"record":{"id": "1", "first_name": "John", "last_name": "Fork", "state": "California", "city": "san Mateo"}, "log_ts":"2016-07- 20T00:06:01Z"}} Fixed schema Dynamically generated schema {"record_key":"state","record_value":"California","id":"1","log_ts":"2016-07-20T00:06:01Z"} {"record_key":"last_name","record_value":"Fork","id":"1","log_ts":"2016-07-20T00:06:01Z"} {"record_key":"city","record_value":"san Mateo","id":"1","log_ts":"2016-07-20T00:06:01Z"} {"record_key":"first_name","record_value":"John","id":"1","log_ts":"2016-07-20T00:06:01Z"} Flatten thedata first SELECT MAX(CASE WHEN record_key = 'state' THEN record_value ELSE null END) AS data_record_state, MAX(CASE WHEN record_key = 'last_name' THEN record_value ELSE null END) As data_record_last_name, MAX(CASE WHEN record_key = 'first_name' THEN record_value ELSE null END) As data_record_first_name, MAX(CASE WHEN record_key = 'city' THEN record_value ELSE null END) AS data_record_city, id as data_record_id, log_ts as data_log_ts from test group by id, log_ts
  • 21. DYNAMIC DDL USING SPARK SQL/DATAFRAME • Code snippet of Dynamic DDL transforming new JSON attributes into relational columns Add the partition columns Manually create the table due to a bug in spark
  • 22. DYNAMIC DDL USING SPARK SQL/DATAFRAME Add the new columns that exist in the incoming data frame but do not exist yet in the destination table This syntax is not working anymore after upgrading to spark 2.x
  • 23. DYNAMIC DDL USING SPARK SQL/DATAFRAME Three temporary way to solve the problem in spark 2.x: • Launch a hiveserver2 service, then use jdbc call hive to alter the table • Use spark to directly connect to hivemetastore, then update the metadata • Update spark source code to supportAlter table syntax and repackage it
  • 24. DYNAMIC DDL USING SPARK SQL/DATAFRAME Project all columns from the table Append the data into the destination table
  • 25. DYNAMIC DDL USING SPARK SQL/DATAFRAME Add the new partition key • Reprocessing the DDL Table with new partition Key (Tuning tips) Choose the partition keywisely Use coalesce ifthere too many partitions Use Coalesce to control the job tasksUse filterif Data still too large
  • 27. USING S3: WHAT IS S3? • S3 is not a file system. • S3 is an object store. Similar to a key-value store. • S3 objects are presented in a hierarchical view but are not stored in that manner. • S3 objects are stored with a key derived from a “path”. • The key is used to fan out the objects across shards. • The path is for display purposes only. Only the first 3 to 4 characters are used for sharding. • S3 does not have strong transactional semantics but instead has eventual consistency. • S3 is not appropriate for realtime updates. • S3 is suited for longer term storage.
  • 28. USING S3: BEHAVIORS • S3 has similar behaviors to HDFS but even more extreme. • Larger latencies • Larger files/writes – Think GBs • Write and read latencies are larger but the bandwidth is much larger with S3. • Thus throughput can be increased with parallel writers (same latency but more throughputthrough parallel operations) • Partition your RDDs/DataFrames and increase your workers/executors to optimize the parallelism. • Each write/read has more overhead due to the web service calls. • So use larger buffers. • Match the size of your HDFS buffers/blocks if reading/writing from/to HDFS. • Collect data for longer durations before writing large buffers in parallel to S3. • Retry logic – Writes to S3 can and will fail. • Cannot stream to S3 – Complete files must be uploaded. • Technically, you can simulate streaming with multipart upload.
  • 29. USING S3: TIPS • Tips for using S3 with HDFS • Use the s3a scheme. • Many optimizations including buffering options (disk-based, on-heap, or off- heap) and incremental parallel uploads (S3A Fast Upload). • More here: http://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop- aws/index.html#S3A • Don’t use rename/move. • Moves are great for HDFS to support better transactional semantics when streaming files. • For S3, moves/renames are copy and delete operations which can be very slow especially due to the eventual consistency. • Other advanced S3 techniques: • Hash object names to better shard the objects in a bucket. • Use multiple buckets to increase bandwidth.