SlideShare ist ein Scribd-Unternehmen logo
1 von 17
Downloaden Sie, um offline zu lesen
HOW to use First steps
© 2022 Cloudera, Inc. All rights reserved. 2
Recommended Iceberg Workflow
Create Iceberg
tables
a. Bring your own
datasets by
converting your
Hive external
tables
OR
b. Use the sample
airline datasets
CDW: Hive
CDE: Spark SQL
1
Batch Insert
data
To prepare Time
Travel scenario:
Insert more data into
Iceberg tables with
Hive or Spark
CDE: Spark SQL
2
Create Security
Policy
Create a Ranger
policy to mask a
column for Fine
Grained Access
Control (FGAC)
SDX: Ranger
3
Build BI Query
Create SQL Queries
for standard ops.
reporting
CDW: Impala SQL
4
Build
Visualizations
Create data sets &
Visuals from Query
CDV: Create data set
from query & Build
Visuals
5
Perform Time
Travel
Create Time Travel
Queries and
Execute them to
audit what has
changed
CDW: Hive/Impala SQL
CDE: Spark Scala API
6
Partition
Evolution
Optimize partition
schema to improve
query performance
CDW: Hive/Impala SQL
CDE: Spark SQL
7
Table
Maintenance
Manage / Expire
Snapshots
CDE: Spark SQL
8
CREATE INGEST/ PREP SERVE OPERATION / MAINTENANCE
GOVERN
© 2022 Cloudera, Inc. All rights reserved. 3
SQL Commands ( Hive, Spark, Impala)
© 2022 Cloudera, Inc. All rights reserved. 4
SQL Commands
Iceberg
Tables
T
a
b
l
e
C
o
n
v
e
r
s
i
o
n
Tim
e Travel
DDL
Query
D
M
L
Ease of Use through consistent SQL Syntax across compute engines
Rich set of SQL commands are developed
for Hive, Impala and Spark to
• Create and manipulate database objects
• Run Queries
• Load data into tables
• Modify data in tables
• Perform Time Travel operations
• Convert to Iceberg tables
© 2022 Cloudera, Inc. All rights reserved. 5
Snapshot of Iceberg SQL Commands
Hive Impala Spark
Select ⬤ ⬤ ⬤
DML (INSERT INTO, INSERT OVERWRITE) ⬤ ⬤ ⬤
Create Table ⬤ ⬤ ⬤
Alter Table ⬤ ⬤ ⬤
Drop Table ⬤ ⬤ ⬤
Truncate Table ⬤ ⬤ NA
Create-Table-As-Select ⬤ ⬤ ⬤
Replace-Table-As-Select NA NA ⬤
Partition Evolution ⬤ ⬤ ⬤
Partition Transformation ⬤ ⬤ ⬤
Schema Evolution ⬤ ⬤ ⬤
Table Metadata (DESCRIBE TABLE, SHOW CREATE
TABLE)
⬤ ⬤ ⬤
Time Travel ⬤ ⬤ Scala API now, SQL is planned
Table Migration ⬤ NA ⬤
Table Maintenance NA NA ⬤
⬤ General Availability
⬤ Tech Preview
Compute Engines Interoperability &
Fine Grained Access Control
© 2022 Cloudera, Inc. All rights reserved. 7
Compute Engine Interoperability & FGAC
❏ Consistent Iceberg table access and
processing with SQL using Hive, Spark and
Impala (reads and writes)
❏ No partial reads
❏ No adapters needed
❏ Iceberg FGAC support through Ranger
integration with Hive / Impala
❏ Spark is planned
❏ Compatible with existing workflows
❏ Optimized for performance, cost and
developer efficiency
Iceberg Tables
Apache Impala
Table Conversion SQL commands /
Utility [Tech Preview]
© 2022 Cloudera, Inc. All rights reserved. 9
Table Conversion from Hive External to Iceberg Tables
1. Hive table migration:
ALTER TABLE tbl SET TBLPROPERTIES
(‘storage_handler’=’org.apache.iceberg.mr.hive.HiveIcebergStorageHandler’)
2. Spark 3:
a. Import Hive tables into Iceberg
spark.sql("CALL <catalog>.system.snapshot('<src>', '<dest>')")
b. Migrate Hive tables to Iceberg tables
spark.sql("CALL <catalog>.system.migrate('<src>')")
Time Travel Operations
© 2022 Cloudera, Inc. All rights reserved. 11
Time Travel
t
Time Travel is the ability to make a query reproducible at a given snapshot and/or time
Time Travel operations:
● SELECT … AS OF …
Apache Impala
Snapshot A Snapshot Z
Standard SQL operations:
● Queries
● DDL
● DML
t
|
|
T
0
© 2022 Cloudera, Inc. All rights reserved. 12
Time Travel Operations
Time Travel Ops SQL Examples
Hive / Impala
Query
SELECT * FROM table FOR SYSTEM_TIME AS OF ’2021-08-09 10:35:57’;
SELECT * FROM table FOR SYSTEM_VERSION AS OF 1234567;
Spark Scala API // time travel to snapshot with ID 10963874102873L
spark.read
.option("snapshot-id", 10963874102873L)
.format("iceberg")
.load("path/to/table")
// time travel to October 26, 1986 at 01:21:00
spark.read
.option("as-of-timestamp", "499162860000")
.format("iceberg")
.load("path/to/table")
Partition Evolution
© 2022 Cloudera, Inc. All rights reserved. 14
In-place Partition Evolution
❏ Existing big data solution doesn’t support in-place
partition evolution. Entire table must be completely
rewritten with new partition column
❏ With Iceberg’s hidden partition, a separation between
physical and logical, users are not required to maintain
partition columns.
❏ Iceberg tables can evolve partition schemas over time
as data volume changes.
❏ Benefits:
❏ No costly table rewrites or table migration
❏ No query rewrites
❏ Reduce downtime and improve SLA
1 2 3 4 5 6 7
8 9 10 11 12 13 14
15 16 17 18 19 20 21
22 23 24 25 26 27 28
29 30 31
2022-01-01 t
Partitions included in query plan
Partitioned by Month(date) Partitioned by Day(date)
2021-10-01 2021-11-01 2021-12-01 2022-01…
SELECT * FROM SALES_ORDER
WHERE
DATE > 2021-11-23 AND
DATA < 2022-01-19
Split plan 1 Split plan 2
© 2022 Cloudera, Inc. All rights reserved. 15
Partition Evolution SQL examples
Engine SQL Examples
Hive / Impala // Partition evolution to hour
ALTER TABLE t SET PARTITION SPEC (hour(ts))
Spark SQL // Partition evolution to hour
ALTER TABLE t ADD PARTITION FIELD (hour(ts))
Table Maintenance [ Tech Preview ]
© 2022 Cloudera, Inc. All rights reserved. 17
Table Maintenance [ Tech Preview ]
Time Travel Ops Examples
Hive / Impala
Query
// Tentative, Proposed Syntax, not in GA
// Expires snapshots that are older than 7 days.
ALTER TABLE test_table EXECUTE expire_snapshots_lt
(now() - interval 7
days);
Spark Scala API // Not in GA
// Expires snapshots that are older than 7 day
Table test_table = …
long tsToExpire = System.currentTimeMillis() - (1000*60*60*24*7);
test_table.expireSnapshots()
.expireOlderThan(tsToExpire)
.commit();
Expiring old snapshots removes them from metadata, so they are no longer available for time travel operations. Data files are
not deleted until they are no longer referenced by a snapshot that may be used for time travel. Regularly expiring snapshots
deletes unused data files.

Weitere ähnliche Inhalte

Was ist angesagt?

Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Cloudera, Inc.
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
 
Spark with Delta Lake
Spark with Delta LakeSpark with Delta Lake
Spark with Delta LakeKnoldus Inc.
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseDatabricks
 
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...Cathrine Wilhelmsen
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Databricks
 
Building Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerBuilding Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerDatabricks
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationDatabricks
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks
 
Building a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache ArrowBuilding a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache ArrowDremio Corporation
 
Data Mess to Data Mesh | Jay Kreps, CEO, Confluent | Kafka Summit Americas 20...
Data Mess to Data Mesh | Jay Kreps, CEO, Confluent | Kafka Summit Americas 20...Data Mess to Data Mesh | Jay Kreps, CEO, Confluent | Kafka Summit Americas 20...
Data Mess to Data Mesh | Jay Kreps, CEO, Confluent | Kafka Summit Americas 20...HostedbyConfluent
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guideRyan Blue
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiDatabricks
 
Build Real-Time Applications with Databricks Streaming
Build Real-Time Applications with Databricks StreamingBuild Real-Time Applications with Databricks Streaming
Build Real-Time Applications with Databricks StreamingDatabricks
 
Incremental View Maintenance with Coral, DBT, and Iceberg
Incremental View Maintenance with Coral, DBT, and IcebergIncremental View Maintenance with Coral, DBT, and Iceberg
Incremental View Maintenance with Coral, DBT, and IcebergWalaa Eldin Moustafa
 
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar ZecevicDataScienceConferenc1
 

Was ist angesagt? (20)

Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive


 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
Spark with Delta Lake
Spark with Delta LakeSpark with Delta Lake
Spark with Delta Lake
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
 
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
 
Building Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerBuilding Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics Primer
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its Benefits
 
Building a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache ArrowBuilding a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache Arrow
 
Data Mess to Data Mesh | Jay Kreps, CEO, Confluent | Kafka Summit Americas 20...
Data Mess to Data Mesh | Jay Kreps, CEO, Confluent | Kafka Summit Americas 20...Data Mess to Data Mesh | Jay Kreps, CEO, Confluent | Kafka Summit Americas 20...
Data Mess to Data Mesh | Jay Kreps, CEO, Confluent | Kafka Summit Americas 20...
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
The delta architecture
The delta architectureThe delta architecture
The delta architecture
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
 
Build Real-Time Applications with Databricks Streaming
Build Real-Time Applications with Databricks StreamingBuild Real-Time Applications with Databricks Streaming
Build Real-Time Applications with Databricks Streaming
 
Incremental View Maintenance with Coral, DBT, and Iceberg
Incremental View Maintenance with Coral, DBT, and IcebergIncremental View Maintenance with Coral, DBT, and Iceberg
Incremental View Maintenance with Coral, DBT, and Iceberg
 
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
 

Ähnlich wie Some Iceberg Basics for Beginners (CDP).pdf

Fast federated SQL with Apache Calcite
Fast federated SQL with Apache CalciteFast federated SQL with Apache Calcite
Fast federated SQL with Apache CalciteChris Baynes
 
Using SQL Plan Management (SPM) to balance Plan Flexibility and Plan Stability
Using SQL Plan Management (SPM) to balance Plan Flexibility and Plan StabilityUsing SQL Plan Management (SPM) to balance Plan Flexibility and Plan Stability
Using SQL Plan Management (SPM) to balance Plan Flexibility and Plan StabilityCarlos Sierra
 
Zesty journey to adopt apache iceberg-AWS-Floor28_Sep-23.pdf
Zesty journey to adopt apache iceberg-AWS-Floor28_Sep-23.pdfZesty journey to adopt apache iceberg-AWS-Floor28_Sep-23.pdf
Zesty journey to adopt apache iceberg-AWS-Floor28_Sep-23.pdfEran Levy
 
Sprint 186
Sprint 186Sprint 186
Sprint 186ManageIQ
 
Windows on AWS
Windows on AWSWindows on AWS
Windows on AWSDatavail
 
Delivering Insights from 20M+ Smart Homes with 500M+ Devices
Delivering Insights from 20M+ Smart Homes with 500M+ DevicesDelivering Insights from 20M+ Smart Homes with 500M+ Devices
Delivering Insights from 20M+ Smart Homes with 500M+ DevicesDatabricks
 
Introducing the eDB360 Tool
Introducing the eDB360 ToolIntroducing the eDB360 Tool
Introducing the eDB360 ToolCarlos Sierra
 
Denver SQL Saturday The Next Frontier
Denver SQL Saturday The Next FrontierDenver SQL Saturday The Next Frontier
Denver SQL Saturday The Next FrontierKellyn Pot'Vin-Gorman
 
Performance Stability, Tips and Tricks and Underscores
Performance Stability, Tips and Tricks and UnderscoresPerformance Stability, Tips and Tricks and Underscores
Performance Stability, Tips and Tricks and UnderscoresJitendra Singh
 
Spark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike PercySpark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike PercySpark Summit
 
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at DatabricksLessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at DatabricksDatabricks
 
Replicating in Real-time from MySQL to Amazon Redshift
Replicating in Real-time from MySQL to Amazon RedshiftReplicating in Real-time from MySQL to Amazon Redshift
Replicating in Real-time from MySQL to Amazon RedshiftContinuent
 
Sprint 170
Sprint 170Sprint 170
Sprint 170ManageIQ
 
What's coming in Airflow 2.0? - NYC Apache Airflow Meetup
What's coming in Airflow 2.0? - NYC Apache Airflow MeetupWhat's coming in Airflow 2.0? - NYC Apache Airflow Meetup
What's coming in Airflow 2.0? - NYC Apache Airflow MeetupKaxil Naik
 
PgConf US 2015 - ALTER DATABASE ADD more SANITY
PgConf US 2015  - ALTER DATABASE ADD more SANITYPgConf US 2015  - ALTER DATABASE ADD more SANITY
PgConf US 2015 - ALTER DATABASE ADD more SANITYOleksii Kliukin
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Technical Deck Delta Live Tables.pdf
Technical Deck Delta Live Tables.pdfTechnical Deck Delta Live Tables.pdf
Technical Deck Delta Live Tables.pdfIlham31574
 
Sprint 168
Sprint 168Sprint 168
Sprint 168ManageIQ
 
Sprint 185
Sprint 185Sprint 185
Sprint 185ManageIQ
 

Ähnlich wie Some Iceberg Basics for Beginners (CDP).pdf (20)

Fast federated SQL with Apache Calcite
Fast federated SQL with Apache CalciteFast federated SQL with Apache Calcite
Fast federated SQL with Apache Calcite
 
Using SQL Plan Management (SPM) to balance Plan Flexibility and Plan Stability
Using SQL Plan Management (SPM) to balance Plan Flexibility and Plan StabilityUsing SQL Plan Management (SPM) to balance Plan Flexibility and Plan Stability
Using SQL Plan Management (SPM) to balance Plan Flexibility and Plan Stability
 
Zesty journey to adopt apache iceberg-AWS-Floor28_Sep-23.pdf
Zesty journey to adopt apache iceberg-AWS-Floor28_Sep-23.pdfZesty journey to adopt apache iceberg-AWS-Floor28_Sep-23.pdf
Zesty journey to adopt apache iceberg-AWS-Floor28_Sep-23.pdf
 
Sprint 186
Sprint 186Sprint 186
Sprint 186
 
Windows on AWS
Windows on AWSWindows on AWS
Windows on AWS
 
Delivering Insights from 20M+ Smart Homes with 500M+ Devices
Delivering Insights from 20M+ Smart Homes with 500M+ DevicesDelivering Insights from 20M+ Smart Homes with 500M+ Devices
Delivering Insights from 20M+ Smart Homes with 500M+ Devices
 
Copy Data Management for the DBA
Copy Data Management for the DBACopy Data Management for the DBA
Copy Data Management for the DBA
 
Introducing the eDB360 Tool
Introducing the eDB360 ToolIntroducing the eDB360 Tool
Introducing the eDB360 Tool
 
Denver SQL Saturday The Next Frontier
Denver SQL Saturday The Next FrontierDenver SQL Saturday The Next Frontier
Denver SQL Saturday The Next Frontier
 
Performance Stability, Tips and Tricks and Underscores
Performance Stability, Tips and Tricks and UnderscoresPerformance Stability, Tips and Tricks and Underscores
Performance Stability, Tips and Tricks and Underscores
 
Spark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike PercySpark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike Percy
 
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at DatabricksLessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
 
Replicating in Real-time from MySQL to Amazon Redshift
Replicating in Real-time from MySQL to Amazon RedshiftReplicating in Real-time from MySQL to Amazon Redshift
Replicating in Real-time from MySQL to Amazon Redshift
 
Sprint 170
Sprint 170Sprint 170
Sprint 170
 
What's coming in Airflow 2.0? - NYC Apache Airflow Meetup
What's coming in Airflow 2.0? - NYC Apache Airflow MeetupWhat's coming in Airflow 2.0? - NYC Apache Airflow Meetup
What's coming in Airflow 2.0? - NYC Apache Airflow Meetup
 
PgConf US 2015 - ALTER DATABASE ADD more SANITY
PgConf US 2015  - ALTER DATABASE ADD more SANITYPgConf US 2015  - ALTER DATABASE ADD more SANITY
PgConf US 2015 - ALTER DATABASE ADD more SANITY
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Technical Deck Delta Live Tables.pdf
Technical Deck Delta Live Tables.pdfTechnical Deck Delta Live Tables.pdf
Technical Deck Delta Live Tables.pdf
 
Sprint 168
Sprint 168Sprint 168
Sprint 168
 
Sprint 185
Sprint 185Sprint 185
Sprint 185
 

Kürzlich hochgeladen

Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 

Kürzlich hochgeladen (20)

Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 

Some Iceberg Basics for Beginners (CDP).pdf

  • 1. HOW to use First steps
  • 2. © 2022 Cloudera, Inc. All rights reserved. 2 Recommended Iceberg Workflow Create Iceberg tables a. Bring your own datasets by converting your Hive external tables OR b. Use the sample airline datasets CDW: Hive CDE: Spark SQL 1 Batch Insert data To prepare Time Travel scenario: Insert more data into Iceberg tables with Hive or Spark CDE: Spark SQL 2 Create Security Policy Create a Ranger policy to mask a column for Fine Grained Access Control (FGAC) SDX: Ranger 3 Build BI Query Create SQL Queries for standard ops. reporting CDW: Impala SQL 4 Build Visualizations Create data sets & Visuals from Query CDV: Create data set from query & Build Visuals 5 Perform Time Travel Create Time Travel Queries and Execute them to audit what has changed CDW: Hive/Impala SQL CDE: Spark Scala API 6 Partition Evolution Optimize partition schema to improve query performance CDW: Hive/Impala SQL CDE: Spark SQL 7 Table Maintenance Manage / Expire Snapshots CDE: Spark SQL 8 CREATE INGEST/ PREP SERVE OPERATION / MAINTENANCE GOVERN
  • 3. © 2022 Cloudera, Inc. All rights reserved. 3 SQL Commands ( Hive, Spark, Impala)
  • 4. © 2022 Cloudera, Inc. All rights reserved. 4 SQL Commands Iceberg Tables T a b l e C o n v e r s i o n Tim e Travel DDL Query D M L Ease of Use through consistent SQL Syntax across compute engines Rich set of SQL commands are developed for Hive, Impala and Spark to • Create and manipulate database objects • Run Queries • Load data into tables • Modify data in tables • Perform Time Travel operations • Convert to Iceberg tables
  • 5. © 2022 Cloudera, Inc. All rights reserved. 5 Snapshot of Iceberg SQL Commands Hive Impala Spark Select ⬤ ⬤ ⬤ DML (INSERT INTO, INSERT OVERWRITE) ⬤ ⬤ ⬤ Create Table ⬤ ⬤ ⬤ Alter Table ⬤ ⬤ ⬤ Drop Table ⬤ ⬤ ⬤ Truncate Table ⬤ ⬤ NA Create-Table-As-Select ⬤ ⬤ ⬤ Replace-Table-As-Select NA NA ⬤ Partition Evolution ⬤ ⬤ ⬤ Partition Transformation ⬤ ⬤ ⬤ Schema Evolution ⬤ ⬤ ⬤ Table Metadata (DESCRIBE TABLE, SHOW CREATE TABLE) ⬤ ⬤ ⬤ Time Travel ⬤ ⬤ Scala API now, SQL is planned Table Migration ⬤ NA ⬤ Table Maintenance NA NA ⬤ ⬤ General Availability ⬤ Tech Preview
  • 6. Compute Engines Interoperability & Fine Grained Access Control
  • 7. © 2022 Cloudera, Inc. All rights reserved. 7 Compute Engine Interoperability & FGAC ❏ Consistent Iceberg table access and processing with SQL using Hive, Spark and Impala (reads and writes) ❏ No partial reads ❏ No adapters needed ❏ Iceberg FGAC support through Ranger integration with Hive / Impala ❏ Spark is planned ❏ Compatible with existing workflows ❏ Optimized for performance, cost and developer efficiency Iceberg Tables Apache Impala
  • 8. Table Conversion SQL commands / Utility [Tech Preview]
  • 9. © 2022 Cloudera, Inc. All rights reserved. 9 Table Conversion from Hive External to Iceberg Tables 1. Hive table migration: ALTER TABLE tbl SET TBLPROPERTIES (‘storage_handler’=’org.apache.iceberg.mr.hive.HiveIcebergStorageHandler’) 2. Spark 3: a. Import Hive tables into Iceberg spark.sql("CALL <catalog>.system.snapshot('<src>', '<dest>')") b. Migrate Hive tables to Iceberg tables spark.sql("CALL <catalog>.system.migrate('<src>')")
  • 11. © 2022 Cloudera, Inc. All rights reserved. 11 Time Travel t Time Travel is the ability to make a query reproducible at a given snapshot and/or time Time Travel operations: ● SELECT … AS OF … Apache Impala Snapshot A Snapshot Z Standard SQL operations: ● Queries ● DDL ● DML t | | T 0
  • 12. © 2022 Cloudera, Inc. All rights reserved. 12 Time Travel Operations Time Travel Ops SQL Examples Hive / Impala Query SELECT * FROM table FOR SYSTEM_TIME AS OF ’2021-08-09 10:35:57’; SELECT * FROM table FOR SYSTEM_VERSION AS OF 1234567; Spark Scala API // time travel to snapshot with ID 10963874102873L spark.read .option("snapshot-id", 10963874102873L) .format("iceberg") .load("path/to/table") // time travel to October 26, 1986 at 01:21:00 spark.read .option("as-of-timestamp", "499162860000") .format("iceberg") .load("path/to/table")
  • 14. © 2022 Cloudera, Inc. All rights reserved. 14 In-place Partition Evolution ❏ Existing big data solution doesn’t support in-place partition evolution. Entire table must be completely rewritten with new partition column ❏ With Iceberg’s hidden partition, a separation between physical and logical, users are not required to maintain partition columns. ❏ Iceberg tables can evolve partition schemas over time as data volume changes. ❏ Benefits: ❏ No costly table rewrites or table migration ❏ No query rewrites ❏ Reduce downtime and improve SLA 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 2022-01-01 t Partitions included in query plan Partitioned by Month(date) Partitioned by Day(date) 2021-10-01 2021-11-01 2021-12-01 2022-01… SELECT * FROM SALES_ORDER WHERE DATE > 2021-11-23 AND DATA < 2022-01-19 Split plan 1 Split plan 2
  • 15. © 2022 Cloudera, Inc. All rights reserved. 15 Partition Evolution SQL examples Engine SQL Examples Hive / Impala // Partition evolution to hour ALTER TABLE t SET PARTITION SPEC (hour(ts)) Spark SQL // Partition evolution to hour ALTER TABLE t ADD PARTITION FIELD (hour(ts))
  • 16. Table Maintenance [ Tech Preview ]
  • 17. © 2022 Cloudera, Inc. All rights reserved. 17 Table Maintenance [ Tech Preview ] Time Travel Ops Examples Hive / Impala Query // Tentative, Proposed Syntax, not in GA // Expires snapshots that are older than 7 days. ALTER TABLE test_table EXECUTE expire_snapshots_lt (now() - interval 7 days); Spark Scala API // Not in GA // Expires snapshots that are older than 7 day Table test_table = … long tsToExpire = System.currentTimeMillis() - (1000*60*60*24*7); test_table.expireSnapshots() .expireOlderThan(tsToExpire) .commit(); Expiring old snapshots removes them from metadata, so they are no longer available for time travel operations. Data files are not deleted until they are no longer referenced by a snapshot that may be used for time travel. Regularly expiring snapshots deletes unused data files.