SlideShare ist ein Scribd-Unternehmen logo
1 von 38
Downloaden Sie, um offline zu lesen
2017 Software Architecture Summit
Scaling100PBDataWarehousein Cloud
Changshu LIU
Software Engineer @ Pinterest
10/2017
Pinterest :
The World’s Catalog
of Ideas
2017 Software Architecture Summit
Scale@Pinterest
Biz Scale
‱ 200M+ MAUs
‱ 120B+ Pins
‱ 3B+ Boards
Data Scale
‱ 140+ PB @ S3
‱ 6000+ Hive/Hadoop nodes
‱ 400+ Presto nodes
2017 Software Architecture Summit
Data@Pinterest
Agenda
1 Data Ingestion
2 Streaming
3 Hive
4 Presto
5 Workflow
2017 Software Architecture Summit
Data Ingestion - Overview
2017 Software Architecture Summit
Data Ingestion - Logging
Singer
- Logging Agent for Kafka
- Decouple logging and kafka
- Failure Isolation
- Buffering, Pooling
Merced
- Log Persistence and Transportation Service
- Kafka -> Storage Destinations
- S3, HBase and SQS
- Auto rebalancing, failure handling
https://hortonworks.com/apache/kafka/
2017 Software Architecture Summit
Data Ingestion - Full DB Dump
Tracker
● Daily db dump: Hadoop Streaming with mysqldump
● Hadoop node <-> MySQL node traffic pains
○ Balancing & Throttling
● Performance pains
○ Remote mysqldump
○ Python streaming
2017 Software Architecture Summit
Data Ingestion - Inc DB Dump
WaterMill
● Maxwell tails MySQL bin logs and writes to Kafka
○ Tail bin logs and write to kafka as json
● Mini-batch job to apply updates on base version
○ Distribute records by db shard id
○ Small (update) table join Big (base) table
● Alias table points to the latest snapshot
2017 Software Architecture Summit
Data Ingestion - Stats
Kafka Stats
● ~150B Messages/Day
● ~400T Bytes/Day
● ~1000 Kafka Topics
● ~1400 Kafka Brokers
Doctor Kafka https://github.com/pinterest/doctorkafka
DB Dump Stats
● ~80 Table/Day
● ~100T Bytes/Day
2017 Software Architecture Summit
Data Processing Overview
Agenda
1 Data Ingestion
2 Streaming
3 Hive
4 Presto
5 Workflow
2017 Software Architecture Summit
Streaming - Overview
2017 Software Architecture Summit
Streaming - Voracity
● Schema Validation
○ Filter out malformed records
● Late Arrival Event Handling
○ Move late event to proper bucket
● Parquet Conversion
○ Json/Thrift -> Parquet
Agenda
1 Data Ingestion
2 Streaming
3 Hive
4 Presto
5 Workflow
2017 Software Architecture Summit
Hive - Overview
AdHoc Query
Scheduled Query
2017 Software Architecture Summit
Hive - Scheduled Query
INSERT OVERWRITE DIRECTORY 's3://hive_query_output/cluster_name/20171019/application_1508263025827_2301/0'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '001'
COLLECTION ITEMS TERMINATED BY '002'
MAP KEYS TERMINATED BY '003'
LINES TERMINATED BY 'n'
STORED AS TEXTFILE
SELECT country, count(*)
FROM db_users
GROUP BY country
SELECT country, count(*)
FROM db_users
GROUP BY country
Rewrite
● Query is submitted by System
● Based on Hive CLI
● Talking to MetaStore (MySQL)
● YARN REST API
○ DistShell: AppMaster + Worker
■ Hive CLI as pure shell command
○ Get stdout/stderr from NM REST API
○ Rewrite select query to insert to s3
■ Container log is not stable
2017 Software Architecture Summit
Hive - AdHoc Query
● Query is Submitted by Human
○ Through beeline/tabeau etc
● HiveServer2 is the Query Submission Service
○ via JDBC/ODBC
○ Zookeeper based HA
● Talking to Hive MetaStore Server
○ HiveServer2 is a shared resource
● LDAP for Authentication
○ Impersonation for 3rd party tool
● Hive hook to Audit all queries
2017 Software Architecture Summit
Hive - Cloud Integration
FileSinkOperator - Where S3 doesn't fit
● Kind of “2-phase commit” protocol
● Phase 1 - each task attempt will write to a unique tmp output path
hdfs://${staging_dir}/${query_id}/
_task_tmp.-ext-${stage_id}/_tmp.${task_id}_${attpt_id}
_tmp.-ext-${stage_id}/${task_id}_${attempt_id}
Other Hive Operator
FileSinkOperator
process()
closeOp()



...
MapRed Task
Rename
Write
2017 Software Architecture Summit
Hive - Cloud Integration
FileSinkOperator - Where S3 doesn't fit
● Kind of “2-phase commit” protocol
● Phase 2 - stage controller (HiveDriver) rename query temp to final path
hdfs://${staging_dir}_${query_id}/
_task_tmp.-ext-${stage_id}/_tmp.${task_id}_${attpt_id}
_tmp.-ext-${stage_id}/
${task_id}_0
${task_id}_1
${task_id}_2
${task_id}
hdfs://${staging_dir}_${query_id}/_tmp.-ext-${stage_id}
hdfs://hive_db_name/tab_name/dt=20171026
Other Hive Stage
Output Stage
jobCloseOp()
mvFileToFinalPath



...
HiveDriver
Invoke
2017 Software Architecture Summit
Hive - Cloud Integration
FileSinkOperator - Where S3 doesn't fit
● Fast & Atomic file/dir rename operation is the key
● Unfortunately, rename on s3 is copy & delete
2017 Software Architecture Summit
Hive - Direct Committer @ S3
Direct Output Committer
● Each task attempt writes to final output directly
○ {target folder}/{task_id}
● Consequences
○ Query Level Failure -> partial output in target folder
■ Cleanup target folder in the very beginning of each query
■ Disable INSERT INTO (using INSERT OVERWRITE Q1 UNION ALL Q2)
○ Task Level Failure -> FileAlreadyExistsException
■ File open mode CREATE -> OVERWRITE (ORC & Parquet)
○ Speculative Execution -> corrupted task output
■ Cause: outWriter.close() in killed task attempt
● S3A write to local disk first and then upload to s3 in close()
■ Solution: Disable speculative execution
2017 Software Architecture Summit
Hive - Speculative Exec @ S3
Direct Committer with Speculative Execution
● S3A Staging Committer (originated from Netflix)
○ Based on S3 multipart uploading 1
○ Decouple upload & commit, simulate write to tmp & rename to final
○ Being actively worked in Apache: HADOOP-13786
● S3A Hive Committer (originated from Pinterest, on-going)
○ Ignore the MapReduce commit protocol
○ Allow different task attempts to commit at the same time
○ Stage controller (HiveDriver) dedup multiple attempts for the same task
2017 Software Architecture Summit
Hive - Dynamic Partition @ S3
New Challenges
● Target folder path is only known at runtime
● Race condition among multiple FileSinkOperators
● Discover new dynamic partitions to update Metastore
Solutions
● Encode the ts into output file name: {target_part_folder}/{dpCtx.queryStartTs}_{task_id}
● New dynamic partition: folders that
○ Modification time > query start time
○ Contains files starting with query_start_ts
Agenda
1 Data Ingestion
2 Streaming
3 Hive
4 Presto
5 Workflow
2017 Software Architecture Summit
Presto - Overview
2017 Software Architecture Summit
Presto - Cluster & Stats
Cluster Management
● Instance type: r3.8xl
● Computing only, no local data
● Using AWS ASG
● Managed by Terraform
Presto Stats
● ~800 users
● ~10k queries daily
● 5X ~ 50X faster than Hive
● AdHoc (250 node): P90: 4min, P99: 30min
● App (150 node): P90: 2min, P99: 20min
2017 Software Architecture Summit
Presto - Feature Challenges
Concurrency Bug in ObjectInspector library
● Hive runtime is for single thread env
● Presto is highly concurrent and shared
Thrift Table Support
● Presto Hive MetaStore client doesn’t support “dynamic” table
● Deeply nested table will hang in planning phase
Schema Evolution Support
● Historical partition schema mismatch
● Small int -> big int
● Parquet table column rename
2017 Software Architecture Summit
Presto - Security Enhancement
Access Control
● Check LDAP group for proxy user for access rights
● Data Level Access is controlled by IAM role
Security Features
● Authentication
○ Integrated middle service with Impersonation
● Audit
○ Every Query is Logged
○ Plugin based on EventListener
2017 Software Architecture Summit
Presto - Cluster Operations
Separate AdHoc & Application Cluster
● Adhoc cluster is generally less stable
● App cluster only run well known queries
Bad Query Monitor
● Detected by bot based on runtime and splits #
● Kill the query and send slack message to end user
Bad Node Monitor
● Detected by bot based on query failure due to node issue stats
○ Bad node - 3 query failed on it in last 5 min
● Specially important when the cluster is scaled up & down on a daily basis
Agenda
1 Data Ingestion
2 Streaming
3 Hive
4 Presto
5 Workflow
2017 Software Architecture Summit
Workflow - Programming
Workflow Prgm Framework
● Skyline
● DataJob
● Pinflow
2017 Software Architecture Summit
Data Architecture - Manager
Workflow Manager
● Pinball
● https://github.com/pinterest/pinball
Workflow Stats
● 2000+ Workflows
● 60k+ Jobs
● ~80% Hive
● ~15% Cascading/Spark
● ~5% Python
2017 Software Architecture Summit
The World’s Catalog of Ideas
csliu@pinterest.com
Discover
Ideas
Singer Architecture
https://www.slideshare.net/DiscoverPinterest/singer-pinterests-logging-infrastructure
Merced Architecture
https://github.com/pinterest/secor
Pinball Architecture
https://github.com/pinterest/pinball

Weitere Àhnliche Inhalte

Was ist angesagt?

Spacecrafts Made Simple: How Loft Orbital Delivers Unparalleled Speed-to-Spac...
Spacecrafts Made Simple: How Loft Orbital Delivers Unparalleled Speed-to-Spac...Spacecrafts Made Simple: How Loft Orbital Delivers Unparalleled Speed-to-Spac...
Spacecrafts Made Simple: How Loft Orbital Delivers Unparalleled Speed-to-Spac...
InfluxData
 
Kafka Connect: Operational Lessons Learned from the Trenches (Elizabeth Benne...
Kafka Connect: Operational Lessons Learned from the Trenches (Elizabeth Benne...Kafka Connect: Operational Lessons Learned from the Trenches (Elizabeth Benne...
Kafka Connect: Operational Lessons Learned from the Trenches (Elizabeth Benne...
confluent
 
Flink Forward San Francisco 2018: David Reniz & Dahyr Vergara - "Real-time m...
Flink Forward San Francisco 2018:  David Reniz & Dahyr Vergara - "Real-time m...Flink Forward San Francisco 2018:  David Reniz & Dahyr Vergara - "Real-time m...
Flink Forward San Francisco 2018: David Reniz & Dahyr Vergara - "Real-time m...
Flink Forward
 

Was ist angesagt? (20)

cLoki: Like Loki but for ClickHouse
cLoki: Like Loki but for ClickHousecLoki: Like Loki but for ClickHouse
cLoki: Like Loki but for ClickHouse
 
Norikra Recent Updates
Norikra Recent UpdatesNorikra Recent Updates
Norikra Recent Updates
 
Streaming ETL - from RDBMS to Dashboard with KSQL
Streaming ETL - from RDBMS to Dashboard with KSQLStreaming ETL - from RDBMS to Dashboard with KSQL
Streaming ETL - from RDBMS to Dashboard with KSQL
 
20180503 kube con eu kubernetes metrics deep dive
20180503 kube con eu   kubernetes metrics deep dive20180503 kube con eu   kubernetes metrics deep dive
20180503 kube con eu kubernetes metrics deep dive
 
Spacecrafts Made Simple: How Loft Orbital Delivers Unparalleled Speed-to-Spac...
Spacecrafts Made Simple: How Loft Orbital Delivers Unparalleled Speed-to-Spac...Spacecrafts Made Simple: How Loft Orbital Delivers Unparalleled Speed-to-Spac...
Spacecrafts Made Simple: How Loft Orbital Delivers Unparalleled Speed-to-Spac...
 
ClickHouse Paris Meetup. ClickHouse Analytical DBMS, Introduction. By Alexand...
ClickHouse Paris Meetup. ClickHouse Analytical DBMS, Introduction. By Alexand...ClickHouse Paris Meetup. ClickHouse Analytical DBMS, Introduction. By Alexand...
ClickHouse Paris Meetup. ClickHouse Analytical DBMS, Introduction. By Alexand...
 
stackconf 2020 | Ignite talk: Opensource in Advanced Research Computing, How ...
stackconf 2020 | Ignite talk: Opensource in Advanced Research Computing, How ...stackconf 2020 | Ignite talk: Opensource in Advanced Research Computing, How ...
stackconf 2020 | Ignite talk: Opensource in Advanced Research Computing, How ...
 
HTTP Analytics for 6M requests per second using ClickHouse, by Alexander Boc...
HTTP Analytics for 6M requests per second using ClickHouse, by  Alexander Boc...HTTP Analytics for 6M requests per second using ClickHouse, by  Alexander Boc...
HTTP Analytics for 6M requests per second using ClickHouse, by Alexander Boc...
 
Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark Wu
Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark WuVirtual Flink Forward 2020: A deep dive into Flink SQL - Jark Wu
Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark Wu
 
Unify Enterprise Data Processing System Platform Level Integration of Flink a...
Unify Enterprise Data Processing System Platform Level Integration of Flink a...Unify Enterprise Data Processing System Platform Level Integration of Flink a...
Unify Enterprise Data Processing System Platform Level Integration of Flink a...
 
Time series denver an introduction to prometheus
Time series denver   an introduction to prometheusTime series denver   an introduction to prometheus
Time series denver an introduction to prometheus
 
Kafka Connect: Operational Lessons Learned from the Trenches (Elizabeth Benne...
Kafka Connect: Operational Lessons Learned from the Trenches (Elizabeth Benne...Kafka Connect: Operational Lessons Learned from the Trenches (Elizabeth Benne...
Kafka Connect: Operational Lessons Learned from the Trenches (Elizabeth Benne...
 
Sprint 38 review
Sprint 38 reviewSprint 38 review
Sprint 38 review
 
Kubernetes Colorado - Kubernetes metrics deep dive 10/25/2017
Kubernetes Colorado - Kubernetes metrics deep dive 10/25/2017Kubernetes Colorado - Kubernetes metrics deep dive 10/25/2017
Kubernetes Colorado - Kubernetes metrics deep dive 10/25/2017
 
Presto @ Treasure Data - Presto Meetup Boston 2015
Presto @ Treasure Data - Presto Meetup Boston 2015Presto @ Treasure Data - Presto Meetup Boston 2015
Presto @ Treasure Data - Presto Meetup Boston 2015
 
Maximilian Michels - Flink and Beam
Maximilian Michels - Flink and BeamMaximilian Michels - Flink and Beam
Maximilian Michels - Flink and Beam
 
Flink Forward SF 2017: Cliff Resnick & Seth Wiesman - From Zero to Streami...
Flink Forward SF 2017:  Cliff Resnick & Seth Wiesman -   From Zero to Streami...Flink Forward SF 2017:  Cliff Resnick & Seth Wiesman -   From Zero to Streami...
Flink Forward SF 2017: Cliff Resnick & Seth Wiesman - From Zero to Streami...
 
Flink Forward San Francisco 2018: David Reniz & Dahyr Vergara - "Real-time m...
Flink Forward San Francisco 2018:  David Reniz & Dahyr Vergara - "Real-time m...Flink Forward San Francisco 2018:  David Reniz & Dahyr Vergara - "Real-time m...
Flink Forward San Francisco 2018: David Reniz & Dahyr Vergara - "Real-time m...
 
Time to-live: How to Perform Automatic State Cleanup in Apache Flink - Andrey...
Time to-live: How to Perform Automatic State Cleanup in Apache Flink - Andrey...Time to-live: How to Perform Automatic State Cleanup in Apache Flink - Andrey...
Time to-live: How to Perform Automatic State Cleanup in Apache Flink - Andrey...
 
Matt Jarvis - Unravelling Logs: Log Processing with Logstash and Riemann
Matt Jarvis - Unravelling Logs: Log Processing with Logstash and Riemann Matt Jarvis - Unravelling Logs: Log Processing with Logstash and Riemann
Matt Jarvis - Unravelling Logs: Log Processing with Logstash and Riemann
 

Ähnlich wie Scaling 100PB Data Warehouse in Cloud

A Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapA Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's Roadmap
Itai Yaffe
 

Ähnlich wie Scaling 100PB Data Warehouse in Cloud (20)

Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21
 
DSDT Meetup Nov 2017
DSDT Meetup Nov 2017DSDT Meetup Nov 2017
DSDT Meetup Nov 2017
 
Flink Forward Berlin 2017: Mihail Vieru - A Materialization Engine for Data I...
Flink Forward Berlin 2017: Mihail Vieru - A Materialization Engine for Data I...Flink Forward Berlin 2017: Mihail Vieru - A Materialization Engine for Data I...
Flink Forward Berlin 2017: Mihail Vieru - A Materialization Engine for Data I...
 
Sprint 78
Sprint 78Sprint 78
Sprint 78
 
A Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapA Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's Roadmap
 
Sprint 17
Sprint 17Sprint 17
Sprint 17
 
Massively Scaled High Performance Web Services with PHP
Massively Scaled High Performance Web Services with PHPMassively Scaled High Performance Web Services with PHP
Massively Scaled High Performance Web Services with PHP
 
KubernetesäžŠă§ć‹•äœœă™ă‚‹æ©Ÿæą°ć­Šçż’ăƒąă‚žăƒ„ăƒŒăƒ«ăźé…äżĄïŒ†çźĄç†ćŸș盀Rekcurd に぀いお
KubernetesäžŠă§ć‹•äœœă™ă‚‹æ©Ÿæą°ć­Šçż’ăƒąă‚žăƒ„ăƒŒăƒ«ăźé…äżĄïŒ†çźĄç†ćŸș盀Rekcurd に぀いおKubernetesäžŠă§ć‹•äœœă™ă‚‹æ©Ÿæą°ć­Šçż’ăƒąă‚žăƒ„ăƒŒăƒ«ăźé…äżĄïŒ†çźĄç†ćŸș盀Rekcurd に぀いお
KubernetesäžŠă§ć‹•äœœă™ă‚‹æ©Ÿæą°ć­Šçż’ăƒąă‚žăƒ„ăƒŒăƒ«ăźé…äżĄïŒ†çźĄç†ćŸș盀Rekcurd に぀いお
 
Heroku to Kubernetes & Gihub to Gitlab success story
Heroku to Kubernetes & Gihub to Gitlab success storyHeroku to Kubernetes & Gihub to Gitlab success story
Heroku to Kubernetes & Gihub to Gitlab success story
 
PyConIE 2017 Writing and deploying serverless python applications
PyConIE 2017 Writing and deploying serverless python applicationsPyConIE 2017 Writing and deploying serverless python applications
PyConIE 2017 Writing and deploying serverless python applications
 
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with KubernetesKubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
 
How Docker Accelerates Continuous Development at ironSource: Containers #101 ...
How Docker Accelerates Continuous Development at ironSource: Containers #101 ...How Docker Accelerates Continuous Development at ironSource: Containers #101 ...
How Docker Accelerates Continuous Development at ironSource: Containers #101 ...
 
Sprint 62
Sprint 62Sprint 62
Sprint 62
 
#RADC4L16: An API-First Archives Approach at NPR
#RADC4L16: An API-First Archives Approach at NPR#RADC4L16: An API-First Archives Approach at NPR
#RADC4L16: An API-First Archives Approach at NPR
 
Writing and deploying serverless python applications
Writing and deploying serverless python applicationsWriting and deploying serverless python applications
Writing and deploying serverless python applications
 
Netflix Container Scheduling and Execution - QCon New York 2016
Netflix Container Scheduling and Execution - QCon New York 2016Netflix Container Scheduling and Execution - QCon New York 2016
Netflix Container Scheduling and Execution - QCon New York 2016
 
Scheduling a fuller house - Talk at QCon NY 2016
Scheduling a fuller house - Talk at QCon NY 2016Scheduling a fuller house - Talk at QCon NY 2016
Scheduling a fuller house - Talk at QCon NY 2016
 
Sprint 66
Sprint 66Sprint 66
Sprint 66
 
Improving Apache Spark Downscaling
 Improving Apache Spark Downscaling Improving Apache Spark Downscaling
Improving Apache Spark Downscaling
 
OSDC 2018 | Hardware-level data-center monitoring with Prometheus by Conrad H...
OSDC 2018 | Hardware-level data-center monitoring with Prometheus by Conrad H...OSDC 2018 | Hardware-level data-center monitoring with Prometheus by Conrad H...
OSDC 2018 | Hardware-level data-center monitoring with Prometheus by Conrad H...
 

KĂŒrzlich hochgeladen

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

KĂŒrzlich hochgeladen (20)

Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 

Scaling 100PB Data Warehouse in Cloud

  • 1. 2017 Software Architecture Summit Scaling100PBDataWarehousein Cloud Changshu LIU Software Engineer @ Pinterest 10/2017
  • 2. Pinterest : The World’s Catalog of Ideas
  • 3. 2017 Software Architecture Summit Scale@Pinterest Biz Scale ‱ 200M+ MAUs ‱ 120B+ Pins ‱ 3B+ Boards Data Scale ‱ 140+ PB @ S3 ‱ 6000+ Hive/Hadoop nodes ‱ 400+ Presto nodes
  • 4. 2017 Software Architecture Summit Data@Pinterest
  • 5. Agenda 1 Data Ingestion 2 Streaming 3 Hive 4 Presto 5 Workflow
  • 6. 2017 Software Architecture Summit Data Ingestion - Overview
  • 7. 2017 Software Architecture Summit Data Ingestion - Logging Singer - Logging Agent for Kafka - Decouple logging and kafka - Failure Isolation - Buffering, Pooling Merced - Log Persistence and Transportation Service - Kafka -> Storage Destinations - S3, HBase and SQS - Auto rebalancing, failure handling https://hortonworks.com/apache/kafka/
  • 8. 2017 Software Architecture Summit Data Ingestion - Full DB Dump Tracker ● Daily db dump: Hadoop Streaming with mysqldump ● Hadoop node <-> MySQL node traffic pains ○ Balancing & Throttling ● Performance pains ○ Remote mysqldump ○ Python streaming
  • 9. 2017 Software Architecture Summit Data Ingestion - Inc DB Dump WaterMill ● Maxwell tails MySQL bin logs and writes to Kafka ○ Tail bin logs and write to kafka as json ● Mini-batch job to apply updates on base version ○ Distribute records by db shard id ○ Small (update) table join Big (base) table ● Alias table points to the latest snapshot
  • 10. 2017 Software Architecture Summit Data Ingestion - Stats Kafka Stats ● ~150B Messages/Day ● ~400T Bytes/Day ● ~1000 Kafka Topics ● ~1400 Kafka Brokers Doctor Kafka https://github.com/pinterest/doctorkafka DB Dump Stats ● ~80 Table/Day ● ~100T Bytes/Day
  • 11. 2017 Software Architecture Summit Data Processing Overview
  • 12. Agenda 1 Data Ingestion 2 Streaming 3 Hive 4 Presto 5 Workflow
  • 13. 2017 Software Architecture Summit Streaming - Overview
  • 14. 2017 Software Architecture Summit Streaming - Voracity ● Schema Validation ○ Filter out malformed records ● Late Arrival Event Handling ○ Move late event to proper bucket ● Parquet Conversion ○ Json/Thrift -> Parquet
  • 15. Agenda 1 Data Ingestion 2 Streaming 3 Hive 4 Presto 5 Workflow
  • 16. 2017 Software Architecture Summit Hive - Overview AdHoc Query Scheduled Query
  • 17. 2017 Software Architecture Summit Hive - Scheduled Query INSERT OVERWRITE DIRECTORY 's3://hive_query_output/cluster_name/20171019/application_1508263025827_2301/0' ROW FORMAT DELIMITED FIELDS TERMINATED BY '001' COLLECTION ITEMS TERMINATED BY '002' MAP KEYS TERMINATED BY '003' LINES TERMINATED BY 'n' STORED AS TEXTFILE SELECT country, count(*) FROM db_users GROUP BY country SELECT country, count(*) FROM db_users GROUP BY country Rewrite ● Query is submitted by System ● Based on Hive CLI ● Talking to MetaStore (MySQL) ● YARN REST API ○ DistShell: AppMaster + Worker ■ Hive CLI as pure shell command ○ Get stdout/stderr from NM REST API ○ Rewrite select query to insert to s3 ■ Container log is not stable
  • 18. 2017 Software Architecture Summit Hive - AdHoc Query ● Query is Submitted by Human ○ Through beeline/tabeau etc ● HiveServer2 is the Query Submission Service ○ via JDBC/ODBC ○ Zookeeper based HA ● Talking to Hive MetaStore Server ○ HiveServer2 is a shared resource ● LDAP for Authentication ○ Impersonation for 3rd party tool ● Hive hook to Audit all queries
  • 19. 2017 Software Architecture Summit Hive - Cloud Integration FileSinkOperator - Where S3 doesn't fit ● Kind of “2-phase commit” protocol ● Phase 1 - each task attempt will write to a unique tmp output path hdfs://${staging_dir}/${query_id}/ _task_tmp.-ext-${stage_id}/_tmp.${task_id}_${attpt_id} _tmp.-ext-${stage_id}/${task_id}_${attempt_id} Other Hive Operator FileSinkOperator process() closeOp() 


... MapRed Task Rename Write
  • 20. 2017 Software Architecture Summit Hive - Cloud Integration FileSinkOperator - Where S3 doesn't fit ● Kind of “2-phase commit” protocol ● Phase 2 - stage controller (HiveDriver) rename query temp to final path hdfs://${staging_dir}_${query_id}/ _task_tmp.-ext-${stage_id}/_tmp.${task_id}_${attpt_id} _tmp.-ext-${stage_id}/ ${task_id}_0 ${task_id}_1 ${task_id}_2 ${task_id} hdfs://${staging_dir}_${query_id}/_tmp.-ext-${stage_id} hdfs://hive_db_name/tab_name/dt=20171026 Other Hive Stage Output Stage jobCloseOp() mvFileToFinalPath 


... HiveDriver Invoke
  • 21. 2017 Software Architecture Summit Hive - Cloud Integration FileSinkOperator - Where S3 doesn't fit ● Fast & Atomic file/dir rename operation is the key ● Unfortunately, rename on s3 is copy & delete
  • 22. 2017 Software Architecture Summit Hive - Direct Committer @ S3 Direct Output Committer ● Each task attempt writes to final output directly ○ {target folder}/{task_id} ● Consequences ○ Query Level Failure -> partial output in target folder ■ Cleanup target folder in the very beginning of each query ■ Disable INSERT INTO (using INSERT OVERWRITE Q1 UNION ALL Q2) ○ Task Level Failure -> FileAlreadyExistsException ■ File open mode CREATE -> OVERWRITE (ORC & Parquet) ○ Speculative Execution -> corrupted task output ■ Cause: outWriter.close() in killed task attempt ● S3A write to local disk first and then upload to s3 in close() ■ Solution: Disable speculative execution
  • 23. 2017 Software Architecture Summit Hive - Speculative Exec @ S3 Direct Committer with Speculative Execution ● S3A Staging Committer (originated from Netflix) ○ Based on S3 multipart uploading 1 ○ Decouple upload & commit, simulate write to tmp & rename to final ○ Being actively worked in Apache: HADOOP-13786 ● S3A Hive Committer (originated from Pinterest, on-going) ○ Ignore the MapReduce commit protocol ○ Allow different task attempts to commit at the same time ○ Stage controller (HiveDriver) dedup multiple attempts for the same task
  • 24. 2017 Software Architecture Summit Hive - Dynamic Partition @ S3 New Challenges ● Target folder path is only known at runtime ● Race condition among multiple FileSinkOperators ● Discover new dynamic partitions to update Metastore Solutions ● Encode the ts into output file name: {target_part_folder}/{dpCtx.queryStartTs}_{task_id} ● New dynamic partition: folders that ○ Modification time > query start time ○ Contains files starting with query_start_ts
  • 25. Agenda 1 Data Ingestion 2 Streaming 3 Hive 4 Presto 5 Workflow
  • 26. 2017 Software Architecture Summit Presto - Overview
  • 27. 2017 Software Architecture Summit Presto - Cluster & Stats Cluster Management ● Instance type: r3.8xl ● Computing only, no local data ● Using AWS ASG ● Managed by Terraform Presto Stats ● ~800 users ● ~10k queries daily ● 5X ~ 50X faster than Hive ● AdHoc (250 node): P90: 4min, P99: 30min ● App (150 node): P90: 2min, P99: 20min
  • 28. 2017 Software Architecture Summit Presto - Feature Challenges Concurrency Bug in ObjectInspector library ● Hive runtime is for single thread env ● Presto is highly concurrent and shared Thrift Table Support ● Presto Hive MetaStore client doesn’t support “dynamic” table ● Deeply nested table will hang in planning phase Schema Evolution Support ● Historical partition schema mismatch ● Small int -> big int ● Parquet table column rename
  • 29. 2017 Software Architecture Summit Presto - Security Enhancement Access Control ● Check LDAP group for proxy user for access rights ● Data Level Access is controlled by IAM role Security Features ● Authentication ○ Integrated middle service with Impersonation ● Audit ○ Every Query is Logged ○ Plugin based on EventListener
  • 30. 2017 Software Architecture Summit Presto - Cluster Operations Separate AdHoc & Application Cluster ● Adhoc cluster is generally less stable ● App cluster only run well known queries Bad Query Monitor ● Detected by bot based on runtime and splits # ● Kill the query and send slack message to end user Bad Node Monitor ● Detected by bot based on query failure due to node issue stats ○ Bad node - 3 query failed on it in last 5 min ● Specially important when the cluster is scaled up & down on a daily basis
  • 31. Agenda 1 Data Ingestion 2 Streaming 3 Hive 4 Presto 5 Workflow
  • 32. 2017 Software Architecture Summit Workflow - Programming Workflow Prgm Framework ● Skyline ● DataJob ● Pinflow
  • 33. 2017 Software Architecture Summit Data Architecture - Manager Workflow Manager ● Pinball ● https://github.com/pinterest/pinball Workflow Stats ● 2000+ Workflows ● 60k+ Jobs ● ~80% Hive ● ~15% Cascading/Spark ● ~5% Python
  • 34. 2017 Software Architecture Summit The World’s Catalog of Ideas csliu@pinterest.com