SlideShare ist ein Scribd-Unternehmen logo
1 von 16
Downloaden Sie, um offline zu lesen
Data Management Compression 
for 
Data Lakes 
Jonathan WINANDY 
primatice.com
About ME 
• I am (Not Only) a Data Engineer
Contents 
• Data Lake : What is a Data Lake ? 
• Compute : What can we do with the gathered 
data ? 
• Conclusion !
Jobs 
OLAP 
DB Backend App 
Caches 
DB 
Cluster 
DB 
Cluster 
App V1 
Jobs 
Specific 
DB 
Specific 
DB 
Monitoring 
Authentification 
HDFS / HADOOP 
Deploy ? 
App V2 
Infrastructure > The Data Lake* 
(*) you actually need 
more than just Hadoop 
to make a data lake. 
Data Lake
Data Lake 
Comment We use HDFS faire circuler to store la a data copy entre of all les Data. 
différents stockages ? 
And we use HADOOP to create views on this Data. 
HDFS / HADOOP 
DB 
Cluster DB 
Cluster 
OLAP 
DB 
Specific 
DB 
Specific 
DB
Photos de la base de données 
DB 
Cluster 
DB 
Cluster 
Offloading to HDFS 
Offloading data is just making a copy of 
databases data onto HDFS as a directory 
with a couple of files (parts). 
HDFS / HADOOP 
De : sql://$schema/$table 
Vers : hdfs://staging/fromschema=$schema/ 
fromtable=$table/importdatetime=$time 
Data Lake 
(HDFS is ‘just’ a Distributed File System)
Photos de la base de données 
For relational databases, we can use apache sqoop to 
copy for example the table payments from schema sales 
into hdfs://staging/sales/payments/2014-09-22T18_01_10 
into hdfs://staging/sales/payments/2014-09-21T17_50_25 
into hdfs://staging/sales/payments/2014-09-20T18_32_43 
DB 
Cluster 
DB 
Cluster 
HDFS / HADOOP 
De : sql://$schema/$table 
Vers : hdfs://staging/fromschema=$schema/ 
fromtable=$table/importdatetime=$time 
Data Lake 
(HDFS is ‘just’ a Distributed File System)
Data Lake 
A light capacity planning for your potential cluster : 
• If you have 100 GB of binary compressed data*, 
• With a cost of storage around 0,03$/per GB/per 
month, 
• An offload every day would cost 90$/per month 
of kept Data, 
• In six month, this offloading would cost 1900$ 
and weight 18TB (<- this is Big Data). 
* obviously not plain text/JSON
Data Lake 
But for this 18TB, you have quite a lot of features : 
• You can track and understand bugs and data 
corruption, 
• Analysing the business without harming your 
production DBs, 
• Bootstrap new products based on your existing 
Data, 
• And also have now a real excuse to learn 
Hadoop or make some use of your existing 
Hadoop bare metal clusters !
Analyse des photos 
Compute 
Having a couple of snapshots in HDFS, 
we can use the tremendous power of MapReduce 
to join over the snapshots, and compute what 
changed in the database during a day. 
Extraction des photos de tables SQL sur HDFS, puis 
comparaison avec les photos des jours précédents : 
delta = f( J , J - 1 ) 
merge = f( J , J - 1 , J - 2 , J - … , …)
Analyse des photos 
Compute 
Δ function takes 2 (or +) snapshots and output a merge 
view of those snapshots with an extra column ‘dm’. 
Inputs : Output : 
Input / Output : 
id name 
1 Jon 
2 Bob 
3 James 
id name 
1 John 
3 James 
4 Jim 
Δ( , ) = 
id name dm 
1 John 01 
1 Jon 10 
3 James 11 
4 Jim 01 
2 Bob 10 
hhttttppss::////ggiitthhuubb..ccoomm//vviiaaddeeoo//vviiaaddeeoo--aavvrroo--uuttiillss 
https://github.com/viadeo/viadeo-avro-utils
Analyse des photos 
Δ function takes 2 (or +) snapshots and output a merge 
view of those snapshots with an extra column ‘dm’. 
Inputs (generalised) : Output : 
Input / Output (généralisé) : 
id 
1 
2 
3 
Δ( , , , ) = 
id dm 
1 1111 
2 1000 
3 1101 
4 0110 
5 0011 
id 
1 
3 
4 
id 
1 
4 
5 
id 
1 
3 
5 
https://github.com/viadeo/viadeo-avro-utils 
https://github.com/viadeo/viadeo-avro-utils 
Compute
EfAfniaclyasicngit téhe d coemsp preshsoiont o: s 
merging 50 snapshots 
Compute 
Ratio : Size of merge output / Size of largest input 
Less is better, ≤ 1 is perfect
Efficacité des photos 
Small datasets OK, Just x2 
get a bit of overhead 
WINS !! 
Analysing the compression : 
merging 50 snapshots 
Compute 
Ratio : Size of merge output / Size of largest input 
Less is better, ≤ 1 is perfect
Conclusion 
• So now the offloading might cost just 36$ of 
storage for 6 month (6$ per month). 
• Welcome back at small data ! 
• But what does this mean for production Data ? 
Did you really need to store that in a mutable 
database in the first place ?
Thank you, 
Questions ?

Weitere ähnliche Inhalte

Andere mochten auch

Data Compression In SQL
Data Compression In SQLData Compression In SQL
Data Compression In SQL
Boosh Booshan
 

Andere mochten auch (11)

Data Compression In SQL
Data Compression In SQLData Compression In SQL
Data Compression In SQL
 
Data compression
Data compressionData compression
Data compression
 
A new algorithm for data compression technique using vlsi
A new algorithm for data compression technique using vlsiA new algorithm for data compression technique using vlsi
A new algorithm for data compression technique using vlsi
 
Simple Data Compression
Simple Data CompressionSimple Data Compression
Simple Data Compression
 
Compression techniques
Compression techniquesCompression techniques
Compression techniques
 
data compression technique
data compression techniquedata compression technique
data compression technique
 
Data compression
Data compression Data compression
Data compression
 
Data Compression (Lossy and Lossless)
Data Compression (Lossy and Lossless)Data Compression (Lossy and Lossless)
Data Compression (Lossy and Lossless)
 
Data compression in Modern Application
Data compression in Modern ApplicationData compression in Modern Application
Data compression in Modern Application
 
Image file formats
Image file formatsImage file formats
Image file formats
 
LinkedIn SlideShare: Knowledge, Well-Presented
LinkedIn SlideShare: Knowledge, Well-PresentedLinkedIn SlideShare: Knowledge, Well-Presented
LinkedIn SlideShare: Knowledge, Well-Presented
 

Mehr von univalence

Introduction aux Macros
Introduction aux MacrosIntroduction aux Macros
Introduction aux Macros
univalence
 

Mehr von univalence (10)

Scala pour le Data Eng
Scala pour le Data EngScala pour le Data Eng
Scala pour le Data Eng
 
Spark-adabra, Comment Construire un DATALAKE ! (Devoxx 2017)
Spark-adabra, Comment Construire un DATALAKE ! (Devoxx 2017) Spark-adabra, Comment Construire un DATALAKE ! (Devoxx 2017)
Spark-adabra, Comment Construire un DATALAKE ! (Devoxx 2017)
 
7 key recipes for data engineering
7 key recipes for data engineering7 key recipes for data engineering
7 key recipes for data engineering
 
7 key recipes for data engineering
7 key recipes for data engineering7 key recipes for data engineering
7 key recipes for data engineering
 
Streaming in Scala with Avro
Streaming in Scala with AvroStreaming in Scala with Avro
Streaming in Scala with Avro
 
Beyond tabular data
Beyond tabular dataBeyond tabular data
Beyond tabular data
 
Introduction à kafka
Introduction à kafkaIntroduction à kafka
Introduction à kafka
 
Data encoding and Metadata for Streams
Data encoding and Metadata for StreamsData encoding and Metadata for Streams
Data encoding and Metadata for Streams
 
Introduction aux Macros
Introduction aux MacrosIntroduction aux Macros
Introduction aux Macros
 
Big data forever
Big data foreverBig data forever
Big data forever
 

Kürzlich hochgeladen

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Kürzlich hochgeladen (20)

FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 

Data compression for Data Lakes

  • 1. Data Management Compression for Data Lakes Jonathan WINANDY primatice.com
  • 2. About ME • I am (Not Only) a Data Engineer
  • 3. Contents • Data Lake : What is a Data Lake ? • Compute : What can we do with the gathered data ? • Conclusion !
  • 4. Jobs OLAP DB Backend App Caches DB Cluster DB Cluster App V1 Jobs Specific DB Specific DB Monitoring Authentification HDFS / HADOOP Deploy ? App V2 Infrastructure > The Data Lake* (*) you actually need more than just Hadoop to make a data lake. Data Lake
  • 5. Data Lake Comment We use HDFS faire circuler to store la a data copy entre of all les Data. différents stockages ? And we use HADOOP to create views on this Data. HDFS / HADOOP DB Cluster DB Cluster OLAP DB Specific DB Specific DB
  • 6. Photos de la base de données DB Cluster DB Cluster Offloading to HDFS Offloading data is just making a copy of databases data onto HDFS as a directory with a couple of files (parts). HDFS / HADOOP De : sql://$schema/$table Vers : hdfs://staging/fromschema=$schema/ fromtable=$table/importdatetime=$time Data Lake (HDFS is ‘just’ a Distributed File System)
  • 7. Photos de la base de données For relational databases, we can use apache sqoop to copy for example the table payments from schema sales into hdfs://staging/sales/payments/2014-09-22T18_01_10 into hdfs://staging/sales/payments/2014-09-21T17_50_25 into hdfs://staging/sales/payments/2014-09-20T18_32_43 DB Cluster DB Cluster HDFS / HADOOP De : sql://$schema/$table Vers : hdfs://staging/fromschema=$schema/ fromtable=$table/importdatetime=$time Data Lake (HDFS is ‘just’ a Distributed File System)
  • 8. Data Lake A light capacity planning for your potential cluster : • If you have 100 GB of binary compressed data*, • With a cost of storage around 0,03$/per GB/per month, • An offload every day would cost 90$/per month of kept Data, • In six month, this offloading would cost 1900$ and weight 18TB (<- this is Big Data). * obviously not plain text/JSON
  • 9. Data Lake But for this 18TB, you have quite a lot of features : • You can track and understand bugs and data corruption, • Analysing the business without harming your production DBs, • Bootstrap new products based on your existing Data, • And also have now a real excuse to learn Hadoop or make some use of your existing Hadoop bare metal clusters !
  • 10. Analyse des photos Compute Having a couple of snapshots in HDFS, we can use the tremendous power of MapReduce to join over the snapshots, and compute what changed in the database during a day. Extraction des photos de tables SQL sur HDFS, puis comparaison avec les photos des jours précédents : delta = f( J , J - 1 ) merge = f( J , J - 1 , J - 2 , J - … , …)
  • 11. Analyse des photos Compute Δ function takes 2 (or +) snapshots and output a merge view of those snapshots with an extra column ‘dm’. Inputs : Output : Input / Output : id name 1 Jon 2 Bob 3 James id name 1 John 3 James 4 Jim Δ( , ) = id name dm 1 John 01 1 Jon 10 3 James 11 4 Jim 01 2 Bob 10 hhttttppss::////ggiitthhuubb..ccoomm//vviiaaddeeoo//vviiaaddeeoo--aavvrroo--uuttiillss https://github.com/viadeo/viadeo-avro-utils
  • 12. Analyse des photos Δ function takes 2 (or +) snapshots and output a merge view of those snapshots with an extra column ‘dm’. Inputs (generalised) : Output : Input / Output (généralisé) : id 1 2 3 Δ( , , , ) = id dm 1 1111 2 1000 3 1101 4 0110 5 0011 id 1 3 4 id 1 4 5 id 1 3 5 https://github.com/viadeo/viadeo-avro-utils https://github.com/viadeo/viadeo-avro-utils Compute
  • 13. EfAfniaclyasicngit téhe d coemsp preshsoiont o: s merging 50 snapshots Compute Ratio : Size of merge output / Size of largest input Less is better, ≤ 1 is perfect
  • 14. Efficacité des photos Small datasets OK, Just x2 get a bit of overhead WINS !! Analysing the compression : merging 50 snapshots Compute Ratio : Size of merge output / Size of largest input Less is better, ≤ 1 is perfect
  • 15. Conclusion • So now the offloading might cost just 36$ of storage for 6 month (6$ per month). • Welcome back at small data ! • But what does this mean for production Data ? Did you really need to store that in a mutable database in the first place ?