Building an open data platform with apache iceberg

•

2 gefällt mir•633 views

Alluxio, Inc.

Alluxio Day VIII December 14, 2021 https://www.alluxio.io/alluxio-day/ Speaker: Ryan Blue, Apache Iceberg

Software

Building an Open
Data Platform with
Apache Iceberg
Ryan Blue
Alluxio Day 8, December 2021

Current data architecture
● Multi-engine
○ Spark for ETL, ML
○ Trino for ad-hoc, ETL
○ Flink for streaming
○ Druid for aggregates
● In the cloud (or moving)
● Hive Metastore
○ No metastore?
● Investing in data
○ In people
○ In tools
○ In infrastructure

But the
pieces
don’t ﬁt
together
quite right

What is Iceberg?
● A table format
○ Akin to columnar file formats
○ Transactional guarantees
○ Performance enhancements
● A standard for analytic tables
○ Open source spec and library
○ Integrated into query engines

Object storage
The gap
Data & metadata
Compute
Apache
Spark
Catalog
???

Shared storage requirements
Technical:
● Must handle concurrent writes
● Must be scalable, performant
● Must be cloud native
Practical:
● Must be open source
● Must be neutral
● Must address productivity

Iceberg’s
goals
● Add reliable transactions
● Unlock performance
● Fix usability

Object storage
Open data platform
Data & metadata
Compute
Apache
Spark
Catalog
Vertical solutions Open data stack
Data
Services

Lessons learned
● Avoid unpleasant surprises
○ Principle of least surprise
● Donʼt steal attention
○ Reduce context switching

Usability improvements
● Schema evolution
○ Instantaneous – no rewrites
○ Safe – no undead columns 🧟
○ Saves days of headache
ALTER TABLE db.tab
RENAME COLUMN
id TO customer_id
● Layout evolution
○ Lazy – only rewrite if needed
○ Partitioning mistakes are okay
○ Changes with your data
○ Saves a month of headache
ALTER TABLE db.tab
ADD PARTITION FIELD
bucket(256, id)

Practical improvements
● Hidden partitioning
○ No silent correctness bugs
○ No conversion mistakes
○ Query without understanding
a tableʼs physical layout
● Reliable updates
○ Stop manual cleanup
○ Use any query engine
○ Automate maintenance

Performance improvements
● Indexed metadata
○ Fast job planning
○ Fast query execution
○ Faster iteration
● Table configuration
○ Tune tables, not jobs
○ Automate table tuning
○ Cluster and sort from config

Empfohlen

A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks

Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue

Apache Iceberg Presentation for the St. Louis Big Data IDEAAdam Doyle

Spark with Delta LakeKnoldus Inc.

Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.

Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama

Intro to Delta LakeDatabricks

Iceberg + Alluxio for Fast Data AnalyticsAlluxio, Inc.

Empfohlen

A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks

Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue

Apache Iceberg Presentation for the St. Louis Big Data IDEAAdam Doyle

Spark with Delta LakeKnoldus Inc.

Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.

Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama

Intro to Delta LakeDatabricks

Iceberg + Alluxio for Fast Data AnalyticsAlluxio, Inc.

Iceberg: a fast table format for S3DataWorks Summit

Batch Processing at Scale with Flink & IcebergFlink Forward

Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Databricks

The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...Dremio Corporation

Building large scale transactional data lake using apache hudiBill Liu

Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...StreamNative

Making Apache Spark Better with Delta LakeDatabricks

Databricks Delta Lake and Its BenefitsDatabricks

Building a Virtual Data Lake with Apache ArrowDremio Corporation

Presto Summit 2018 - 09 - Netflix Icebergkbajda

Introduction SQL Analytics on Lakehouse ArchitectureDatabricks

Diving into Delta Lake: Unpacking the Transaction LogDatabricks

Reshape Data Lake (as of 2020.07)Eric Sun

Delta lake and the delta architectureAdam Doyle

Apache Iceberg: An Architectural Look Under the CoversScyllaDB

Tame the small files problem and optimize data layout for streaming ingestion...Flink Forward

Hudi architecture, fundamentals and capabilitiesNishith Agarwal

How to build a streaming Lakehouse with Flink, Kafka, and HudiFlink Forward

The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...Databricks

Real-time Analytics with Trino and Apache PinotXiang Fu

Scalable Clusters On DemandBogdan Kyryliuk

Introducing DatawaveAccumulo Summit

Weitere ähnliche Inhalte

Was ist angesagt?

Iceberg: a fast table format for S3DataWorks Summit

Batch Processing at Scale with Flink & IcebergFlink Forward

Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Databricks

The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...Dremio Corporation

Building large scale transactional data lake using apache hudiBill Liu

Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...StreamNative

Making Apache Spark Better with Delta LakeDatabricks

Databricks Delta Lake and Its BenefitsDatabricks

Building a Virtual Data Lake with Apache ArrowDremio Corporation

Presto Summit 2018 - 09 - Netflix Icebergkbajda

Introduction SQL Analytics on Lakehouse ArchitectureDatabricks

Diving into Delta Lake: Unpacking the Transaction LogDatabricks

Reshape Data Lake (as of 2020.07)Eric Sun

Delta lake and the delta architectureAdam Doyle

Apache Iceberg: An Architectural Look Under the CoversScyllaDB

Tame the small files problem and optimize data layout for streaming ingestion...Flink Forward

Hudi architecture, fundamentals and capabilitiesNishith Agarwal

How to build a streaming Lakehouse with Flink, Kafka, and HudiFlink Forward

The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...Databricks

Real-time Analytics with Trino and Apache PinotXiang Fu

Was ist angesagt? (20)

Iceberg: a fast table format for S3

Batch Processing at Scale with Flink & Iceberg

Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...

The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...

Building large scale transactional data lake using apache hudi

Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...

Making Apache Spark Better with Delta Lake

Databricks Delta Lake and Its Benefits

Building a Virtual Data Lake with Apache Arrow

Presto Summit 2018 - 09 - Netflix Iceberg

Introduction SQL Analytics on Lakehouse Architecture

Diving into Delta Lake: Unpacking the Transaction Log

Reshape Data Lake (as of 2020.07)

Delta lake and the delta architecture

Apache Iceberg: An Architectural Look Under the Covers

Tame the small files problem and optimize data layout for streaming ingestion...

Hudi architecture, fundamentals and capabilities

How to build a streaming Lakehouse with Flink, Kafka, and Hudi

The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...

Real-time Analytics with Trino and Apache Pinot

Ähnlich wie Building an open data platform with apache iceberg

Scalable Clusters On DemandBogdan Kyryliuk

Introducing DatawaveAccumulo Summit

Collaborative data science and how to build a data science toolchain around n...Moon Soo Lee

Data Platform in the CloudAmihay Zer-Kavod

Introduction to Structured Data Processing with Spark SQLdatamantra

Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dan Lynn

Dirty data? Clean it up! - Datapalooza Denver 2016Dan Lynn

ETL Practices for Better or WorseEric Sun

Fluent Bit: Log Forwarding at ScaleEduardo Silva Pereira

AWS Big Data Demystified #1: Big data architecture lessons learned Omid Vahdaty

Data Day Texas 2017: Scaling Data Science at Stitch FixStefan Krawczyk

Big Data in 200 km/h | AWS Big Data Demystified #1.3 Omid Vahdaty

Visual, scalable, and manageable data loading to and from Neo4j with Apache Hop Neo4j

Red hat infrastructure for analyticsKyle Bader

AirBNB's ML platform - BigHeadKarthik Murugesan

Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...Databricks

Apache Tajo on SwiftJihoon Son

[OpenStack Day in Korea 2015] Track 2-6 - Apache Tajo on SwiftOpenStack Korea Community

Graph Analytics on Data from Meetup.comKarin Patenge

Using Cloud Automation Technologies to Deliver an Enterprise Data FabricCambridge Semantics

Ähnlich wie Building an open data platform with apache iceberg (20)

Scalable Clusters On Demand

Introducing Datawave

Collaborative data science and how to build a data science toolchain around n...

Data Platform in the Cloud

Introduction to Structured Data Processing with Spark SQL

Dirty Data? Clean it up! - Rocky Mountain DataCon 2016

Dirty data? Clean it up! - Datapalooza Denver 2016

ETL Practices for Better or Worse

Fluent Bit: Log Forwarding at Scale

AWS Big Data Demystified #1: Big data architecture lessons learned

Data Day Texas 2017: Scaling Data Science at Stitch Fix

Big Data in 200 km/h | AWS Big Data Demystified #1.3

Visual, scalable, and manageable data loading to and from Neo4j with Apache Hop

Red hat infrastructure for analytics

AirBNB's ML platform - BigHead

Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...

Apache Tajo on Swift

[OpenStack Day in Korea 2015] Track 2-6 - Apache Tajo on Swift

Graph Analytics on Data from Meetup.com

Using Cloud Automation Technologies to Deliver an Enterprise Data Fabric

Mehr von Alluxio, Inc.

Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.

Optimizing Data Access for Analytics And AI with AlluxioAlluxio, Inc.

Speed Up Presto at Uber with Alluxio CachingAlluxio, Inc.

Correctly Loading Incremental Data at ScaleAlluxio, Inc.

Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/MLAlluxio, Inc.

Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...Alluxio, Inc.

Alluxio Monthly Webinar | Five Disruptive Trends that Every Data & AI Leader...Alluxio, Inc.

Data Infra Meetup | FIFO Queues are All You Need for Cache EvictionAlluxio, Inc.

Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio EdgeAlluxio, Inc.

Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the CloudAlluxio, Inc.

Data Infra Meetup | ByteDance's Native Parquet ReaderAlluxio, Inc.

Data Infra Meetup | Uber's Data Storage EvolutionAlluxio, Inc.

Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...Alluxio, Inc.

AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...Alluxio, Inc.

AI Infra Day | The AI Infra in the Generative AI EraAlluxio, Inc.

AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...Alluxio, Inc.

AI Infra Day | The Generative AI Market And Intel AI Strategy and Product Up...Alluxio, Inc.

AI Infra Day | Composable PyTorch Distributed with PT2 @ MetaAlluxio, Inc.

AI Infra Day | Model Lifecycle Management Quality Assurance at Uber ScaleAlluxio, Inc.

Alluxio Monthly Webinar | Efficient Data Loading for Model Training on AWSAlluxio, Inc.

Mehr von Alluxio, Inc. (20)

Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data

Optimizing Data Access for Analytics And AI with Alluxio

Speed Up Presto at Uber with Alluxio Caching

Correctly Loading Incremental Data at Scale

Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML

Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...

Alluxio Monthly Webinar | Five Disruptive Trends that Every Data & AI Leader...

Data Infra Meetup | FIFO Queues are All You Need for Cache Eviction

Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge

Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud

Data Infra Meetup | ByteDance's Native Parquet Reader

Data Infra Meetup | Uber's Data Storage Evolution

Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...

AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...

AI Infra Day | The AI Infra in the Generative AI Era

AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...

AI Infra Day | The Generative AI Market And Intel AI Strategy and Product Up...

AI Infra Day | Composable PyTorch Distributed with PT2 @ Meta

AI Infra Day | Model Lifecycle Management Quality Assurance at Uber Scale

Alluxio Monthly Webinar | Efficient Data Loading for Model Training on AWS

Kürzlich hochgeladen

%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...masabamasaba

Announcing Codolex 2.0 from GDK SoftwareJim McKeeth

%in tembisa+277-882-255-28 abortion pills for sale in tembisamasabamasaba

Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab

Artyushina_Guest lecture_YorkU CS May 2024.pptxAnnaArtyushina1

%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfonteinmasabamasaba

What Goes Wrong with Language Definitions and How to Improve the SituationJuha-Pekka Tolvanen

Direct Style Effect Systems -The Print[A] Example- A Comprehension AidPhilip Schwarz

WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...WSO2

OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...Shane Coughlan

%in Hazyview+277-882-255-28 abortion pills for sale in Hazyviewmasabamasaba

%in Harare+277-882-255-28 abortion pills for sale in Hararemasabamasaba

%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...masabamasaba

tonesoftglanshi9

WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...WSO2

W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda

Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfkalichargn70th171

Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Bert Jan Schrijver

WSO2CON 2024 - Does Open Source Still Matter?WSO2

%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...masabamasaba

Kürzlich hochgeladen (20)

%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...

Announcing Codolex 2.0 from GDK Software

%in tembisa+277-882-255-28 abortion pills for sale in tembisa

Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...

Artyushina_Guest lecture_YorkU CS May 2024.pptx

%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein

What Goes Wrong with Language Definitions and How to Improve the Situation

Direct Style Effect Systems -The Print[A] Example- A Comprehension Aid

WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...

OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...

%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview

%in Harare+277-882-255-28 abortion pills for sale in Harare

%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...

tonesoftg

WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...

W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...

Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf

Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...

WSO2CON 2024 - Does Open Source Still Matter?

%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...

Building an open data platform with apache iceberg

1. Building an Open Data Platform with Apache Iceberg Ryan Blue Alluxio Day 8, December 2021

2. What is Iceberg?

3. What is Iceberg?

4. Where are we going?

5. Current data architecture ● Multi-engine ○ Spark for ETL, ML ○ Trino for ad-hoc, ETL ○ Flink for streaming ○ Druid for aggregates ● In the cloud (or moving) ● Hive Metastore ○ No metastore? ● Investing in data ○ In people ○ In tools ○ In infrastructure

6. But the pieces don’t ﬁt together quite right

7. What is Iceberg?

8. What is Iceberg? ● A table format ○ Akin to columnar file formats ○ Transactional guarantees ○ Performance enhancements ● A standard for analytic tables ○ Open source spec and library ○ Integrated into query engines

9. And how does that help?

10. Object storage The gap Data & metadata Compute Apache Spark Catalog ???

11. Shared storage requirements Technical: ● Must handle concurrent writes ● Must be scalable, performant ● Must be cloud native Practical: ● Must be open source ● Must be neutral ● Must address productivity

12. Iceberg’s goals ● Add reliable transactions ● Unlock performance ● Fix usability

13. Object storage Open data platform Data & metadata Compute Apache Spark Catalog Vertical solutions Open data stack Data Services

14. Thank you!

15. Iceberg exists to ﬁx productivity

16. Lessons learned ● Avoid unpleasant surprises ○ Principle of least surprise ● Donʼt steal attention ○ Reduce context switching

17. We try to make Iceberg invisible

18. Usability improvements ● Schema evolution ○ Instantaneous – no rewrites ○ Safe – no undead columns 🧟 ○ Saves days of headache ALTER TABLE db.tab RENAME COLUMN id TO customer_id ● Layout evolution ○ Lazy – only rewrite if needed ○ Partitioning mistakes are okay ○ Changes with your data ○ Saves a month of headache ALTER TABLE db.tab ADD PARTITION FIELD bucket(256, id)

19. Practical improvements ● Hidden partitioning ○ No silent correctness bugs ○ No conversion mistakes ○ Query without understanding a tableʼs physical layout ● Reliable updates ○ Stop manual cleanup ○ Use any query engine ○ Automate maintenance

20. Performance improvements ● Indexed metadata ○ Fast job planning ○ Fast query execution ○ Faster iteration ● Table configuration ○ Tune tables, not jobs ○ Automate table tuning ○ Cluster and sort from config