Data compression for Data Lakes

•

0 gefällt mir•1,597 views

A short presentation about Row based compression in Data Lake with Hadoop. The first version of source code is available at https://github.com/viadeo/viadeo-avro-utils . The code use Avro for metadata management, which makes the task easier, but the ideas can be ported to other formats.

Technologie

Data Management Compression
for
Data Lakes
Jonathan WINANDY
primatice.com

About ME
• I am (Not Only) a Data Engineer

Contents
• Data Lake : What is a Data Lake ?
• Compute : What can we do with the gathered
data ?
• Conclusion !

Jobs
OLAP
DB Backend App
Caches
DB
Cluster
DB
Cluster
App V1
Jobs
Specific
DB
Specific
DB
Monitoring
Authentification
HDFS / HADOOP
Deploy ?
App V2
Infrastructure > The Data Lake*
(*) you actually need
more than just Hadoop
to make a data lake.
Data Lake

Data Lake
Comment We use HDFS faire circuler to store la a data copy entre of all les Data.
différents stockages ?
And we use HADOOP to create views on this Data.
HDFS / HADOOP
DB
Cluster DB
Cluster
OLAP
DB
Specific
DB
Specific
DB

Photos de la base de données
DB
Cluster
DB
Cluster
Offloading to HDFS
Offloading data is just making a copy of
databases data onto HDFS as a directory
with a couple of files (parts).
HDFS / HADOOP
De : sql://$schema/$table
Vers : hdfs://staging/fromschema=$schema/
fromtable=$table/importdatetime=$time
Data Lake
(HDFS is ‘just’ a Distributed File System)

Photos de la base de données
For relational databases, we can use apache sqoop to
copy for example the table payments from schema sales
into hdfs://staging/sales/payments/2014-09-22T18_01_10
into hdfs://staging/sales/payments/2014-09-21T17_50_25
into hdfs://staging/sales/payments/2014-09-20T18_32_43
DB
Cluster
DB
Cluster
HDFS / HADOOP
De : sql://$schema/$table
Vers : hdfs://staging/fromschema=$schema/
fromtable=$table/importdatetime=$time
Data Lake
(HDFS is ‘just’ a Distributed File System)

Data Lake
A light capacity planning for your potential cluster :
• If you have 100 GB of binary compressed data*,
• With a cost of storage around 0,03$/per GB/per
month,
• An offload every day would cost 90$/per month
of kept Data,
• In six month, this offloading would cost 1900$
and weight 18TB (<- this is Big Data).
* obviously not plain text/JSON

Data Lake
But for this 18TB, you have quite a lot of features :
• You can track and understand bugs and data
corruption,
• Analysing the business without harming your
production DBs,
• Bootstrap new products based on your existing
Data,
• And also have now a real excuse to learn
Hadoop or make some use of your existing
Hadoop bare metal clusters !

Analyse des photos
Compute
Having a couple of snapshots in HDFS,
we can use the tremendous power of MapReduce
to join over the snapshots, and compute what
changed in the database during a day.
Extraction des photos de tables SQL sur HDFS, puis
comparaison avec les photos des jours précédents :
delta = f( J , J - 1 )
merge = f( J , J - 1 , J - 2 , J - … , …)

Analyse des photos
Compute
Δ function takes 2 (or +) snapshots and output a merge
view of those snapshots with an extra column ‘dm’.
Inputs : Output :
Input / Output :
id name
1 Jon
2 Bob
3 James
id name
1 John
3 James
4 Jim
Δ( , ) =
id name dm
1 John 01
1 Jon 10
3 James 11
4 Jim 01
2 Bob 10
hhttttppss::////ggiitthhuubb..ccoomm//vviiaaddeeoo//vviiaaddeeoo--aavvrroo--uuttiillss
https://github.com/viadeo/viadeo-avro-utils

Analyse des photos
Δ function takes 2 (or +) snapshots and output a merge
view of those snapshots with an extra column ‘dm’.
Inputs (generalised) : Output :
Input / Output (généralisé) :
id
1
2
3
Δ( , , , ) =
id dm
1 1111
2 1000
3 1101
4 0110
5 0011
id
1
3
4
id
1
4
5
id
1
3
5
https://github.com/viadeo/viadeo-avro-utils
https://github.com/viadeo/viadeo-avro-utils
Compute

EfAfniaclyasicngit téhe d coemsp preshsoiont o: s
merging 50 snapshots
Compute
Ratio : Size of merge output / Size of largest input
Less is better, ≤ 1 is perfect

Efficacité des photos
Small datasets OK, Just x2
get a bit of overhead
WINS !!
Analysing the compression :
merging 50 snapshots
Compute
Ratio : Size of merge output / Size of largest input
Less is better, ≤ 1 is perfect

Conclusion
• So now the offloading might cost just 36$ of
storage for 6 month (6$ per month).
• Welcome back at small data !
• But what does this mean for production Data ?
Did you really need to store that in a mutable
database in the first place ?

Weitere ähnliche Inhalte

Andere mochten auch

Data Compression In SQL

Boosh Booshan

Data compression

Sherif Abdelfattah

A new algorithm for data compression technique using vlsi

Tejeswar Tej

Simple Data Compression

Damian T. Gordon

Compression techniques

m_divya_bharathi

data compression technique

CHINMOY PAUL

Data compression

Muhammad Irtiza

Data Compression (Lossy and Lossless)

Project Student

Data compression in Modern Application

LivePerson

Image file formats

Bob Watson

LinkedIn SlideShare: Knowledge, Well-Presented

SlideShare

Andere mochten auch (11)

Data Compression In SQL

Data compression

A new algorithm for data compression technique using vlsi

Simple Data Compression

Compression techniques

data compression technique

Data compression

Data Compression (Lossy and Lossless)

Data compression in Modern Application

Image file formats

LinkedIn SlideShare: Knowledge, Well-Presented

Mehr von univalence

Scala pour le Data Eng

univalence

Tallk présenté à Devoxx avec Bachir Ait M'Barek : https://www.linkedin.com/in/baitmbarek C’est la révolution dans la BI, les zones tampon FTP laissent la place aux systèmes de fichier distribués, le SQL s'exécute sur Hadoop, les dashboard en HTML5 remplacent les clients lourds, mais ne peut-on pas rationaliser un peu l’approche ? Comment s’y prendre pour transformer une chaine BI en datalake ? Cette université fera le tour de l’ingénierie des données en mode BigData. Au travers d’une présentation détaillée des concepts, de retour d’expériences et d’un cas pratique, nous allons découvrir : les technologies et l’architecture, avec Spark, Kafka, Elasticsearch, Impala et Mesos, et les méthodes associées : cycle de développement avec Hadoop, tests unitaires, jointures, gestion de la qualité de donnée, recette en mode Big Data et gestion des métadonnées.

Spark-adabra, Comment Construire un DATALAKE ! (Devoxx 2017)

univalence

7 key recipes for data engineering

univalence

After the construction of several datalakes and large business intelligence pipelines, we now know that the use of Scala and its principles were essential to the success of those large undertakings. In this talk, we will go through the 7 key scala-based architectures and methodologies that were used in real-life projects. More specifically, we will see the impact of these recipes on Spark performances, and how it enabled the rapid growth of those projects.

7 key recipes for data engineering

univalence

Streaming in Scala with Avro

univalence

Beyond tabular data

univalence

Introduction à kafka

univalence

Data encoding and Metadata for Streams

univalence

Introduction aux Macros

univalence

Big data forever

univalence

Mehr von univalence (10)

Scala pour le Data Eng

Spark-adabra, Comment Construire un DATALAKE ! (Devoxx 2017)

7 key recipes for data engineering

Streaming in Scala with Avro

Beyond tabular data

Introduction à kafka

Data encoding and Metadata for Streams

Introduction aux Macros

Big data forever

Kürzlich hochgeladen

FWD Group - Insurer Innovation Award 2024

The Digital Insurer

MySQL Webinar, presented on the 25th of April, 2024. Summary: MySQL solutions enable the deployment of diverse Database Architectures tailored to specific needs, including High Availability, Disaster Recovery, and Read Scale-Out. With MySQL Shell's AdminAPI, administrators can seamlessly set up, manage, and monitor these solutions, ensuring efficiency and ease of use in their administration. MySQL Router, on the other hand, provides transparent routing from the application traffic to the backend servers in the architectures, requiring minimal configuration. Completely built in-house and supported by Oracle, these solutions have been adopted by enterprises of all sizes for their business-critical applications. In this presentation, we'll delve into various database architecture solutions to help you choose the right one based on your business requirements. Focusing on technical details and the latest features to maximize the potential of these solutions.

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

Miguel Araújo

Architecting Cloud Native Applications

WSO2

Three things you will take away from the session: • How to run an effective tenant-to-tenant migration • Best practices for before, during, and after migration • Tips for using migration as a springboard to prepare for Copilot in Microsoft 365 Main ideas: Migration Overview: The presentation covers the current reality of cross-tenant migrations, the triggers, phases, best practices, and benefits of a successful tenant migration Considerations: When considering a migration, it is important to consider the migration scope, performance, customization, flexibility, user-friendly interface, automation, monitoring, support, training, scalability, data integrity, data security, cost, and licensing structure Next Wave: The next wave of change includes the launch of Copilot, which requires businesses to be prepared for upcoming changes related to Copilot and the cloud, and to consolidate data and tighten governance ShareGate: ShareGate can help with pre-migration analysis, configurable migration tool, and automated, end-user driven collaborative governance

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff

sammart93

Data Cloud, More than a CDP by Matt Robison

Anna Loughnan Colquhoun

MS Copilot expands with MS Graph connectors

Nanddeep Nachan

Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows. We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases. This video focuses on the deployment of external web forms using Jotform for Bonterra Impact Management. This solution can be customized to your organization’s needs and deployed to support the common use cases below: - Intake and consent - Assessments - Surveys - Applications - Program registration Interested in deploying web form automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...

Jeffrey Haguewood

In the thrilling conclusion to 2023, ransomware groups had a banner year, really outdoing themselves in the "make everyone's life miserable" department. LockBit 3.0 took gold in the hacking olympics, followed by the plucky upstarts Clop and ALPHV/BlackCat. Apparently, 48% of organizations were feeling left out and decided to get in on the cyber attack action. Business services won the "most likely to get digitally mugged" award, with education and retail nipping at their heels. Hackers expanded their repertoire beyond boring old encryption to the much more exciting world of extortion. The US, UK and Canada took top honors in the "countries most likely to pay up" category. Bitcoins were the currency of choice for discerning hackers, because who doesn't love untraceable money?

Ransomware_Q4_2023. The report. [EN].pdf

Overkill Security

Scalable LLM APIs for AI and Generative AI Application Development Ettikan Karuppiah, Director/Technologist - NVIDIA Apidays Singapore 2024: Connecting Customers, Business and Technology (April 17 & 18, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...

apidays

2024: Domino Containers - The Next Step. News from the Domino Container commu...

Martijn de Jong

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood

Juan lago vázquez

Accelerating FinTech Innovation: Unleashing API Economy and GenAI Vasa Krishnan, Chief Technology Officer - FinResults Apidays New York 2024: The API Economy in the AI Era (April 30 & May 1, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...

apidays

Manulife - Insurer Transformation Award 2024

The Digital Insurer

Building Digital Trust in a Digital Economy Veronica Tan, Director - Cyber Security Agency of Singapore Apidays Singapore 2024: Connecting Customers, Business and Technology (April 17 & 18, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...

apidays

Effective data discovery is crucial for maintaining compliance and mitigating risks in today's rapidly evolving privacy landscape. However, traditional manual approaches often struggle to keep pace with the growing volume and complexity of data. Join us for an insightful webinar where industry leaders from TrustArc and Privya will share their expertise on leveraging AI-powered solutions to revolutionize data discovery. You'll learn how to: - Effortlessly maintain a comprehensive, up-to-date data inventory - Harness code scanning insights to gain complete visibility into data flows leveraging the advantages of code scanning over DB scanning - Simplify compliance by leveraging Privya's integration with TrustArc - Implement proven strategies to mitigate third-party risks Our panel of experts will discuss real-world case studies and share practical strategies for overcoming common data discovery challenges. They'll also explore the latest trends and innovations in AI-driven data management, and how these technologies can help organizations stay ahead of the curve in an ever-changing privacy landscape.

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery

TrustArc

Webinar Recording: https://www.panagenda.com/webinars/why-teams-call-analytics-is-critical-to-your-entire-business Nothing is as frustrating and noticeable as being in an important call and being unable to see or hear the other person. Not surprising then, that issues with Teams calls are among the most common problems users call their helpdesk for. Having in depth insight into everything relevant going on at the user’s device, local network, ISP and Microsoft itself during the call is crucial for good Microsoft Teams Call quality support. To ensure a quick and adequate solution and to ensure your users get the most out of their Microsoft 365. But did you know that ‘bad calls’ are also an excellent indicator of other problems arising? Precisely because it is so noticeable!? Like the canary in the mine, bad calls can be early indicators of problems. Problems that might otherwise not have been noticed for a while but can have a big impact on productivity and satisfaction. Join this session by Christoph Adler to learn how true Microsoft Teams call quality analytics helped other organizations troubleshoot bad calls and identify and fix problems that impacted Teams calls or the use of Microsoft365 in general. See what it can do to keep your users happy and productive! In this session we will cover - Why CQD data alone is not enough to troubleshoot call problems - The importance of attributing call problems to the right call participant - What call quality analytics can do to help you quickly find, fix-, and prevent problems - Why having retrospective detailed insights matters - Real life examples of how others have used Microsoft Teams call quality monitoring to problem shoot problems with their ISP, network, device health and more.

Why Teams call analytics are critical to your entire business

panagenda

A Beginners Guide to Building a RAG App Using Open Source Milvus

Zilliz

Boost Fertility New Invention Ups Success Rates.pdf

sudhanshuwaghmare1

Automating Google Workspace (GWS) & more with Apps Script

wesley chun

In this session, we will delve into strategic approaches for optimizing knowledge management within Microsoft 365, amidst the evolving landscape of Copilot. From leveraging automatic metadata classification and permission governance with SharePoint Premium, to unlocking Viva Engage for the cultivation of knowledge and communities, you will gain actionable insights to bolster your organization's knowledge-sharing initiatives. In this session, we will also explore how to facilitate solutions to enable your employees to find answers and expertise within Microsoft 365. You will leave equipped with practical techniques and a deeper understanding of how there is more to effective knowledge management than just enabling Copilot, but building actual solutions to prepare the knowledge that Copilot and your employees can use.

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

Drew Madelung

Kürzlich hochgeladen (20)

FWD Group - Insurer Innovation Award 2024

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

Architecting Cloud Native Applications

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff

Data Cloud, More than a CDP by Matt Robison

MS Copilot expands with MS Graph connectors

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...

Ransomware_Q4_2023. The report. [EN].pdf

Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...

2024: Domino Containers - The Next Step. News from the Domino Container commu...

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...

Manulife - Insurer Transformation Award 2024

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery

Why Teams call analytics are critical to your entire business

A Beginners Guide to Building a RAG App Using Open Source Milvus

Boost Fertility New Invention Ups Success Rates.pdf

Automating Google Workspace (GWS) & more with Apps Script

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

Data compression for Data Lakes

1. Data Management Compression for Data Lakes Jonathan WINANDY primatice.com

2. About ME • I am (Not Only) a Data Engineer

3. Contents • Data Lake : What is a Data Lake ? • Compute : What can we do with the gathered data ? • Conclusion !

4. Jobs OLAP DB Backend App Caches DB Cluster DB Cluster App V1 Jobs Specific DB Specific DB Monitoring Authentification HDFS / HADOOP Deploy ? App V2 Infrastructure > The Data Lake* (*) you actually need more than just Hadoop to make a data lake. Data Lake

5. Data Lake Comment We use HDFS faire circuler to store la a data copy entre of all les Data. différents stockages ? And we use HADOOP to create views on this Data. HDFS / HADOOP DB Cluster DB Cluster OLAP DB Specific DB Specific DB

6. Photos de la base de données DB Cluster DB Cluster Offloading to HDFS Offloading data is just making a copy of databases data onto HDFS as a directory with a couple of files (parts). HDFS / HADOOP De : sql://$schema/$table Vers : hdfs://staging/fromschema=$schema/ fromtable=$table/importdatetime=$time Data Lake (HDFS is ‘just’ a Distributed File System)

7. Photos de la base de données For relational databases, we can use apache sqoop to copy for example the table payments from schema sales into hdfs://staging/sales/payments/2014-09-22T18_01_10 into hdfs://staging/sales/payments/2014-09-21T17_50_25 into hdfs://staging/sales/payments/2014-09-20T18_32_43 DB Cluster DB Cluster HDFS / HADOOP De : sql://$schema/$table Vers : hdfs://staging/fromschema=$schema/ fromtable=$table/importdatetime=$time Data Lake (HDFS is ‘just’ a Distributed File System)

8. Data Lake A light capacity planning for your potential cluster : • If you have 100 GB of binary compressed data*, • With a cost of storage around 0,03$/per GB/per month, • An offload every day would cost 90$/per month of kept Data, • In six month, this offloading would cost 1900$ and weight 18TB (<- this is Big Data). * obviously not plain text/JSON

9. Data Lake But for this 18TB, you have quite a lot of features : • You can track and understand bugs and data corruption, • Analysing the business without harming your production DBs, • Bootstrap new products based on your existing Data, • And also have now a real excuse to learn Hadoop or make some use of your existing Hadoop bare metal clusters !

10. Analyse des photos Compute Having a couple of snapshots in HDFS, we can use the tremendous power of MapReduce to join over the snapshots, and compute what changed in the database during a day. Extraction des photos de tables SQL sur HDFS, puis comparaison avec les photos des jours précédents : delta = f( J , J - 1 ) merge = f( J , J - 1 , J - 2 , J - … , …)

11. Analyse des photos Compute Δ function takes 2 (or +) snapshots and output a merge view of those snapshots with an extra column ‘dm’. Inputs : Output : Input / Output : id name 1 Jon 2 Bob 3 James id name 1 John 3 James 4 Jim Δ( , ) = id name dm 1 John 01 1 Jon 10 3 James 11 4 Jim 01 2 Bob 10 hhttttppss::////ggiitthhuubb..ccoomm//vviiaaddeeoo//vviiaaddeeoo--aavvrroo--uuttiillss https://github.com/viadeo/viadeo-avro-utils

12. Analyse des photos Δ function takes 2 (or +) snapshots and output a merge view of those snapshots with an extra column ‘dm’. Inputs (generalised) : Output : Input / Output (généralisé) : id 1 2 3 Δ( , , , ) = id dm 1 1111 2 1000 3 1101 4 0110 5 0011 id 1 3 4 id 1 4 5 id 1 3 5 https://github.com/viadeo/viadeo-avro-utils https://github.com/viadeo/viadeo-avro-utils Compute

13. EfAfniaclyasicngit téhe d coemsp preshsoiont o: s merging 50 snapshots Compute Ratio : Size of merge output / Size of largest input Less is better, ≤ 1 is perfect

14. Efficacité des photos Small datasets OK, Just x2 get a bit of overhead WINS !! Analysing the compression : merging 50 snapshots Compute Ratio : Size of merge output / Size of largest input Less is better, ≤ 1 is perfect

15. Conclusion • So now the offloading might cost just 36$ of storage for 6 month (6$ per month). • Welcome back at small data ! • But what does this mean for production Data ? Did you really need to store that in a mutable database in the first place ?

16. Thank you, Questions ?

Data compression for Data Lakes

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Andere mochten auch

Andere mochten auch (11)

Mehr von univalence

Mehr von univalence (10)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Data compression for Data Lakes