Scalable data pipeline

•

0 gefällt mir•86 views

Be ready to big data challenges. The material was composed based on the performance of Leonid Sokolov, Big Data Architect from GreenM. Full article https://medium.com/greenm/scalable-data-pipeline-f5d3c8f7a6d9

Technologie

Scalable Data Pipeline
Be ready to big data challenges with
by Leonid Sokolov
Big Data Architect at greenm.io

Data Pipeline Challenges
• Complex workflow management
• AWS Athena doesn’t scale well

Basic Data Actions
Extract
Transform
Load

Extract Transform Load
Extract Transform Load

Extract
Extract Transform Load
Transform Load
Input Process Output Input Process Output Input Process Output

Input Process OutputExtract
Database: Partition by Integer Key

+1000000
13569933
+ ???
???
Input Process OutputExtract
12569933 13680800900000083645ec0c06727066d249cfd01 ffffffcb586df761f63f561d946ac7c5
Database: Partition by Varchar Key

Input Process OutputExtract
Database: Varchar key

• Use scalable storage: HDFS, SЗ
• Use multiple files
• Use splittable file formats and compression
Format Splittable
CSV Yes*
JSON Yes**
Parquet Yes
Compression Splittable
gzip No
bzip2 Yes
Snappy No
Input Process OutputExtract
Files
• * CSV is splittable when it is a raw, uncompressed file or using a splittable compression format such as BZIP2
• ** JSON has the same conditions about splittability when compressed as CSV with one extra difference.
When “wholeFile” option is set to true (re: SPARK-18352), JSON is NOT splittable.

• Use scalable storage: HDFS, SЗ
• Spark on S3: mapreduce.fileoutputcommitter.algorithm.version = 2
• EMR 5.20.0 or later
• Use multiple files (better the same number as in input)
Input Process OutputExtract

Extract Results
0
10
20
30
40
50
60
70
80
Extract Time (minutes)
Before After
• EMR 5.20.0 with 10 instances (c4.4xlarge)
• Input: 3 Databases (MS SQL), ~400GB (Raw Data)
• Output: Parquet(Snappy), ~100GB

Input Process OutputTransform
Volume of Data
Partitions
Data Skew
Volume of Data
Partitions

Map Shuffle Reduce
SELECT * FROM Encounters e JOIN Providers p ON e.ProviderId = p.ProviderId
Input Process OutputTransform
Shuffle Join
Shuffle Map

Map Reduce
SELECT * FROM Encounters e JOIN Providers p ON e.ProviderId =p.ProviderId
Input Process OutputTransform
Broadcast Join
Broadcast Collect

• Use data partitioning, bucketing, sorting
• Broadcast small tables when joining them to big table
• spark.sql.autoBroadcastJoinThreshold= 10485760 (10 MB, default)
• Use COUNT(key) instead of COUNT(DISTINCT key) if possible
• Drop unused data
• Filter/reduce before join
• Cache Datasets used multiple times
Input Process OutputTransform
Reduce Shuffles

Transform
0
10
20
30
40
50
60
Transform Time(minutes)
Before After
Results
• EMR 5.20.0 with 10 instances (c4.8xlarge)
• Input: Parquet(Snappy), 50GB
• Output: ORC(ZLib), 19GB

Extract Transform Load
Input Process Output Input Process Output Input Process Output

Input Process OutputLoad
COPY dm1.FactTable
(
Column1,
Column2,
DateColumnTZ FILLER TIMESTAMPTZ,
DateColumn AS DateColumnTZ AT TIME ZONE 'UTC',
Column4,
...
)
FROM 's3://bucket/prod/datamarts/dm1/FactTable/snapshots/snapshotid=20190418/part-*.orc'
ORC
DIRECT
ABORT ON ERROR;

Load Results
0
20
40
60
80
100
Load Time(minutes)
Before After
• Input: ORC(ZLib), 19GB
• Output: Vertica DB (7 Node)

Summary
• Build architecture for scale
• Consider tomorrow’s data volume
• Build with failure in mind
• Understand the risks and be ready to respond

Technologies
Extract Transform Load
Environment AWS, EMR 5.20.0 AWS, EMR 5.20.0 AWS Batch
Technology Spark 2.4 Spark 2.4 Vertica
Languages Scala + SQL Scala + SQL Python + SQL
Input
Format
Compression
MS SQL, MySQL
Tables
-
S3
Parquet
Snappy
S3
ORC, Parq
Zlib, Snappy
Output
Format
Compression
S3
Parquet
Snappy
S3
ORC ,Parquet
Zlib, Snappy
Vertica
Tables
Native

Empfohlen

From Traditional Data Warehouse To Real Time Data WarehouseOsama Hussein

SQL Server Managing Test Data & Stress Testing January 2011Mark Ginnebaugh

Using Hadoop to Offload Data Warehouse Processing and More - Brad AnsersonMapR Technologies

Modern data warehouseStephen Alex

Big Data Technologies and Why They Matter To R UsersAdaryl "Bob" Wakefield, MBA

Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop ProfessionalsCloudera, Inc.

Data lakeGHAZOUANI WAEL

How One Company Offloaded Data Warehouse ETL To Hadoop and Saved $30 MillionDataWorks Summit

Empfohlen

From Traditional Data Warehouse To Real Time Data WarehouseOsama Hussein

SQL Server Managing Test Data & Stress Testing January 2011Mark Ginnebaugh

Using Hadoop to Offload Data Warehouse Processing and More - Brad AnsersonMapR Technologies

Modern data warehouseStephen Alex

Big Data Technologies and Why They Matter To R UsersAdaryl "Bob" Wakefield, MBA

Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop ProfessionalsCloudera, Inc.

Data lakeGHAZOUANI WAEL

How One Company Offloaded Data Warehouse ETL To Hadoop and Saved $30 MillionDataWorks Summit

Breakout: Hadoop and the Operational Data StoreCloudera, Inc.

Data lake analytics for the adminTillmann Eitelberg

The Future of Data Warehousing: ETL Will Never be the SameCloudera, Inc.

Hadoop Integration into Data Warehousing ArchitecturesHumza Naseer

Data Warehouse OptimizationCloudera, Inc.

Hybrid Data Warehouse Hadoop ImplementationsDavid Portnoy

Designing modern dw and data lakepunedevscom

Data Lakehouse Symposium | Day 1 | Part 2Databricks

Data Warehouse Best PracticesEduardo Castro

Building an Effective Data Warehouse ArchitectureJames Serra

Building a Data Lake - An App Dev's PerspectiveGeekNightHyderabad

Integrated Data Warehouse with Hadoop and Oracle DatabaseGwen (Chen) Shapira

Data warehouseSonali Chawla

Hadoop and Your Data WarehouseCaserta

O'Reilly ebook: Operationalizing the Data LakeVasu S

Anatomy of a data driven architecture - Tamir Dresher Tamir Dresher

Data LakeAnitha Krishnappa

The Marriage of the Data Lake and the Data Warehouse and Why You Need BothAdaryl "Bob" Wakefield, MBA

So You Want to Build a Data Lake?David P. Moore

Big Data: Architecture and Performance Considerations in Logical Data LakesDenodo

Hypertable - massively scalable nosql databasebigdatagurus_meetup

Best Practices for Migrating your Data Warehouse to Amazon RedshiftAmazon Web Services

Weitere ähnliche Inhalte

Was ist angesagt?

Breakout: Hadoop and the Operational Data StoreCloudera, Inc.

Data lake analytics for the adminTillmann Eitelberg

The Future of Data Warehousing: ETL Will Never be the SameCloudera, Inc.

Hadoop Integration into Data Warehousing ArchitecturesHumza Naseer

Data Warehouse OptimizationCloudera, Inc.

Hybrid Data Warehouse Hadoop ImplementationsDavid Portnoy

Designing modern dw and data lakepunedevscom

Data Lakehouse Symposium | Day 1 | Part 2Databricks

Data Warehouse Best PracticesEduardo Castro

Building an Effective Data Warehouse ArchitectureJames Serra

Building a Data Lake - An App Dev's PerspectiveGeekNightHyderabad

Integrated Data Warehouse with Hadoop and Oracle DatabaseGwen (Chen) Shapira

Data warehouseSonali Chawla

Hadoop and Your Data WarehouseCaserta

O'Reilly ebook: Operationalizing the Data LakeVasu S

Anatomy of a data driven architecture - Tamir Dresher Tamir Dresher

Data LakeAnitha Krishnappa

The Marriage of the Data Lake and the Data Warehouse and Why You Need BothAdaryl "Bob" Wakefield, MBA

So You Want to Build a Data Lake?David P. Moore

Big Data: Architecture and Performance Considerations in Logical Data LakesDenodo

Was ist angesagt? (20)

Breakout: Hadoop and the Operational Data Store

Data lake analytics for the admin

The Future of Data Warehousing: ETL Will Never be the Same

Hadoop Integration into Data Warehousing Architectures

Data Warehouse Optimization

Hybrid Data Warehouse Hadoop Implementations

Designing modern dw and data lake

Data Lakehouse Symposium | Day 1 | Part 2

Data Warehouse Best Practices

Building an Effective Data Warehouse Architecture

Building a Data Lake - An App Dev's Perspective

Integrated Data Warehouse with Hadoop and Oracle Database

Data warehouse

Hadoop and Your Data Warehouse

O'Reilly ebook: Operationalizing the Data Lake

Anatomy of a data driven architecture - Tamir Dresher

Data Lake

The Marriage of the Data Lake and the Data Warehouse and Why You Need Both

So You Want to Build a Data Lake?

Big Data: Architecture and Performance Considerations in Logical Data Lakes

Ähnlich wie Scalable data pipeline

Hypertable - massively scalable nosql databasebigdatagurus_meetup

Best Practices for Migrating your Data Warehouse to Amazon RedshiftAmazon Web Services

Best Practices for Migrating your Data Warehouse to Amazon Redshift Amazon Web Services

Emerging technologies /frameworks in Big DataRahul Jain

Optimising Geospatial Queries with Dynamic File PruningDatabricks

How to Load Data, Revisited, UTOUGKaren Cannell

Oracle GoldenGate Presentation from OTN Virtual Technology Summit - 7/9/14 (PDF)Bobby Curtis

OGG Architecture PerformanceEnkitec

(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New FeaturesAmazon Web Services

Exadata下的数据并行加载、并行卸载及性能监控Kaiyao Huang

Code and Memory Optimisation Tricks Sperasoft

Code and memory optimization tricksDevGAMM Conference

MariaDB ColumnStoreMariaDB plc

Scaling ingest pipelines with high performance computing principles - Rajiv K...SignalFx

The Pushdown of Everything by Stephan Kessler and Santiago MolaSpark Summit

Oracle GoldenGate Architecture PerformanceEnkitec

Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu YongCeph Community

ETL with SPARK - First Spark London meetupRafal Kwasny

Intro to Apache Apex - Next Gen Platform for Ingest and TransformApache Apex

Ähnlich wie Scalable data pipeline (20)

Hypertable - massively scalable nosql database

Best Practices for Migrating your Data Warehouse to Amazon Redshift

Emerging technologies /frameworks in Big Data

Optimising Geospatial Queries with Dynamic File Pruning

How to Load Data, Revisited, UTOUG

Oracle GoldenGate Presentation from OTN Virtual Technology Summit - 7/9/14 (PDF)

OGG Architecture Performance

(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features

Exadata下的数据并行加载、并行卸载及性能监控

Code and Memory Optimisation Tricks

Code and memory optimization tricks

MariaDB ColumnStore

Scaling ingest pipelines with high performance computing principles - Rajiv K...

The Pushdown of Everything by Stephan Kessler and Santiago Mola

Oracle GoldenGate Architecture Performance

Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong

ETL with SPARK - First Spark London meetup

Intro to Apache Apex - Next Gen Platform for Ingest and Transform

Mehr von GreenM

User Case of Migration from MicroStrategy to Power BIGreenM

Tableau vs MicrostrategyGreenM

Data monsters probablistic data structuresGreenM

Data streamsnorkelingdatamonstersGreenM

Data monstersrealtimeetl newGreenM

DAX as Power BI Visualization WeaponGreenM

How To Make Your Dashboard SmallerGreenM

Data Pipeline Installation QualityGreenM

Mehr von GreenM (8)

User Case of Migration from MicroStrategy to Power BI

Tableau vs Microstrategy

Data monsters probablistic data structures

Data streamsnorkelingdatamonsters

Data monstersrealtimeetl new

DAX as Power BI Visualization Weapon

How To Make Your Dashboard Smaller

Data Pipeline Installation Quality

Kürzlich hochgeladen

Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK

My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar

Key Features Of Token Development (1).pptxLBM Solutions

How to convert PDF to text with Nanonetsnaman860154

[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745

The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los

Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes

The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad

Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software

Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik

Scaling API-first – The story of a global engineering organizationRadu Cotescu

Install Stable Diffusion in windows machinePadma Pradeep

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106

IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge

Salesforce Community Group Quito, Salesforce 101Paola De la Torre

Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited

Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4

Pigging Solutions in Pet Food ManufacturingPigging Solutions

From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software

Kürzlich hochgeladen (20)

Unblocking The Main Thread Solving ANRs and Frozen Frames

My Hashitalk Indonesia April 2024 Presentation

Key Features Of Token Development (1).pptx

How to convert PDF to text with Nanonets

[2024]Digital Global Overview Report 2024 Meltwater.pdf

The 7 Things I Know About Cyber Security After 25 Years | April 2024

Enhancing Worker Digital Experience: A Hands-on Workshop for Partners

The Codex of Business Writing Software for Real-World Solutions 2.pptx

Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation

Injustice - Developers Among Us (SciFiDevCon 2024)

Scaling API-first – The story of a global engineering organization

Install Stable Diffusion in windows machine

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics

IAC 2024 - IA Fast Track to Search Focused AI Solutions

Salesforce Community Group Quito, Salesforce 101

Handwritten Text Recognition for manuscripts and early printed texts

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365

Azure Monitor & Application Insight to monitor Infrastructure & Application

Pigging Solutions in Pet Food Manufacturing

From Event to Action: Accelerate Your Decision Making with Real-Time Automation

Scalable data pipeline

1. Scalable Data Pipeline Be ready to big data challenges with by Leonid Sokolov Big Data Architect at greenm.io

2. Agenda • Data Lake Monsters Recap • Scale Distributed Data Processing • Technologies

3. Data Lake Monsters Recap

4. Data Lake Monsters Recap

5. Data Lake Monsters Recap

6. Data Lake Monsters Recap

7. Data Pipeline Challenges • Complex workflow management • AWS Athena doesn’t scale well

8. Scale Distributed Data Processing

9. Basic Data Actions Extract Transform Load

10. Extract Transform Load Extract Transform Load

11. Extract Extract Transform Load Transform Load Input Process Output Input Process Output Input Process Output

12. Input Process OutputExtract Database: Partition by Integer Key

13. +1000000 13569933 + ??? ??? Input Process OutputExtract 12569933 13680800900000083645ec0c06727066d249cfd01 ffffffcb586df761f63f561d946ac7c5 Database: Partition by Varchar Key

14. Input Process OutputExtract Database: Varchar key

15. • Use scalable storage: HDFS, SЗ • Use multiple files • Use splittable file formats and compression Format Splittable CSV Yes* JSON Yes** Parquet Yes Compression Splittable gzip No bzip2 Yes Snappy No Input Process OutputExtract Files • * CSV is splittable when it is a raw, uncompressed file or using a splittable compression format such as BZIP2 • ** JSON has the same conditions about splittability when compressed as CSV with one extra difference. When “wholeFile” option is set to true (re: SPARK-18352), JSON is NOT splittable.

16. • Use scalable storage: HDFS, SЗ • Spark on S3: mapreduce.fileoutputcommitter.algorithm.version = 2 • EMR 5.20.0 or later • Use multiple files (better the same number as in input) Input Process OutputExtract

17. Extract Results 0 10 20 30 40 50 60 70 80 Extract Time (minutes) Before After • EMR 5.20.0 with 10 instances (c4.4xlarge) • Input: 3 Databases (MS SQL), ~400GB (Raw Data) • Output: Parquet(Snappy), ~100GB

18. Extract Extract Transform Load Transform Load Input Process Output Input Process Output Input Process Output

19. Input Process OutputTransform Volume of Data Partitions Data Skew Volume of Data Partitions

20. Map Shuffle Reduce SELECT * FROM Encounters e JOIN Providers p ON e.ProviderId = p.ProviderId Input Process OutputTransform Shuffle Join Shuffle Map

21. Map Reduce SELECT * FROM Encounters e JOIN Providers p ON e.ProviderId =p.ProviderId Input Process OutputTransform Broadcast Join Broadcast Collect

22. • Use data partitioning, bucketing, sorting • Broadcast small tables when joining them to big table • spark.sql.autoBroadcastJoinThreshold= 10485760 (10 MB, default) • Use COUNT(key) instead of COUNT(DISTINCT key) if possible • Drop unused data • Filter/reduce before join • Cache Datasets used multiple times Input Process OutputTransform Reduce Shuffles

23. Transform 0 10 20 30 40 50 60 Transform Time(minutes) Before After Results • EMR 5.20.0 with 10 instances (c4.8xlarge) • Input: Parquet(Snappy), 50GB • Output: ORC(ZLib), 19GB

24. Extract Transform Load Input Process Output Input Process Output Input Process Output

25. Input Process OutputLoad COPY dm1.FactTable ( Column1, Column2, DateColumnTZ FILLER TIMESTAMPTZ, DateColumn AS DateColumnTZ AT TIME ZONE 'UTC', Column4, ... ) FROM 's3://bucket/prod/datamarts/dm1/FactTable/snapshots/snapshotid=20190418/part-*.orc' ORC DIRECT ABORT ON ERROR;

26. Load Results 0 20 40 60 80 100 Load Time(minutes) Before After • Input: ORC(ZLib), 19GB • Output: Vertica DB (7 Node)

27. Summary • Build architecture for scale • Consider tomorrow’s data volume • Build with failure in mind • Understand the risks and be ready to respond

28. Technologies Extract Transform Load Environment AWS, EMR 5.20.0 AWS, EMR 5.20.0 AWS Batch Technology Spark 2.4 Spark 2.4 Vertica Languages Scala + SQL Scala + SQL Python + SQL Input Format Compression MS SQL, MySQL Tables - S3 Parquet Snappy S3 ORC, Parq Zlib, Snappy Output Format Compression S3 Parquet Snappy S3 ORC ,Parquet Zlib, Snappy Vertica Tables Native

Hinweis der Redaktion

Масштабируемый Data Pipeline. Данную тему мы разделили на две части, 1й доклад будет посвящен работе непосредственно с данными, 2й доклад проведет ... Он расскажет вам об больше об управелении и администрировании Data Pipeline. В 1й части речь пойдет больше об архитектуре и алгортимах самих программ и процессов, и совсем немного о технологиях. Главный смысл этой части я вынес в описание доклада: будь готов к вызовам которые связаны с данными, а конкретно к масштабированию процессов работы с этими данными. Все любят хранить данные, все работают с ними, кол-во данных растет каждый день и в определенный момент данных становиться достаточно много и мы не успевам обработать их за отведенных промежуток времени и тут как всегда мы начинаем думать о масштабировании.
Все о чем я вам расскажу сегодня это история продукта над котором мы работаем в компании и начало этой истории мы рассказывали вам ровно год назад. Его проводил Антон, он рассказывал об озере данных. Мы подробнее остановимся здесь, чтобы напомнить вам о чем идет речь. Обсудим какие проблемы мы не решили на тот момент и с какими новыми проблемами столкнулись. Поговорим непосредственно о масштабировании. В этой части я буду приводит реальные примеры проблем с которыми мы сталкивались и как мы их решали. В конце подведем краткий итог и поговорим о технологиях которые мы использовали.
В предыдущем докладе Антон рассказывал об идеальном и реальном мирах. От том что чаще всего не получается построить идеальню архитектуру по независящим от нас причинам.
Множество процессов которые рождались налету без учета общей архитектуры для компании. Все это привело к сложным процессам и большому набору разно информации которая хранится в различных источниках.
Решением этой проблемы может быть создание озера данных. Наполнение его данными со всех возвожных источнико.
Когда все данные находятся в одном хранилище мы можем построить гибкую архитектуру дальнейших процессов работы с этими данными.
Extract Tranыform Load могут быть реализованы как 1 процесс Например, программа, которая вычитывает данные из источника, делает трансформацию и затем сохраняет эти данные в другое хранилище данных.
Для того чтобы не запутаться в названих и понимать о каком конкретно процессе идет речь, мы с вами заменим названия процессов нижнего уровня на Input, Process и Output. C точки зрения архитектуры в данной реализации есть как приемущества так и недостатки с одниночным процессом без сохранения промежуточных результатов. Перейдем к самому главному в этом докладе – к масштабированию.