Suche senden
Hochladen
Some Iceberg Basics for Beginners (CDP).pdf
•
2 gefällt mir
•
370 views
Michael Kogan
Folgen
Iceberg for Beginners
Weniger lesen
Mehr lesen
Technologie
Melden
Teilen
Melden
Teilen
1 von 17
Jetzt herunterladen
Downloaden Sie, um offline zu lesen
Empfohlen
SQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at Comcast
Databricks
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
Databricks
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse Architecture
Databricks
Building an open data platform with apache iceberg
Building an open data platform with apache iceberg
Alluxio, Inc.
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
James Serra
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
Intro to Delta Lake
Intro to Delta Lake
Databricks
Empfohlen
SQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at Comcast
Databricks
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
Databricks
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse Architecture
Databricks
Building an open data platform with apache iceberg
Building an open data platform with apache iceberg
Alluxio, Inc.
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
James Serra
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
Intro to Delta Lake
Intro to Delta Lake
Databricks
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
Apache Kudu: Technical Deep Dive
Apache Kudu: Technical Deep Dive
Cloudera, Inc.
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama
Spark with Delta Lake
Spark with Delta Lake
Knoldus Inc.
Free Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
Databricks
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Cathrine Wilhelmsen
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
Databricks
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Databricks
Building Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics Primer
Databricks
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
Databricks
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its Benefits
Databricks
Building a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache Arrow
Dremio Corporation
Data Mess to Data Mesh | Jay Kreps, CEO, Confluent | Kafka Summit Americas 20...
Data Mess to Data Mesh | Jay Kreps, CEO, Confluent | Kafka Summit Americas 20...
HostedbyConfluent
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
Ryan Blue
The delta architecture
The delta architecture
Prakash Chockalingam
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
Databricks
Build Real-Time Applications with Databricks Streaming
Build Real-Time Applications with Databricks Streaming
Databricks
Incremental View Maintenance with Coral, DBT, and Iceberg
Incremental View Maintenance with Coral, DBT, and Iceberg
Walaa Eldin Moustafa
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
DataScienceConferenc1
Fast federated SQL with Apache Calcite
Fast federated SQL with Apache Calcite
Chris Baynes
Using SQL Plan Management (SPM) to balance Plan Flexibility and Plan Stability
Using SQL Plan Management (SPM) to balance Plan Flexibility and Plan Stability
Carlos Sierra
Weitere ähnliche Inhalte
Was ist angesagt?
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
Apache Kudu: Technical Deep Dive
Apache Kudu: Technical Deep Dive
Cloudera, Inc.
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama
Spark with Delta Lake
Spark with Delta Lake
Knoldus Inc.
Free Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
Databricks
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Cathrine Wilhelmsen
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
Databricks
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Databricks
Building Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics Primer
Databricks
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
Databricks
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its Benefits
Databricks
Building a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache Arrow
Dremio Corporation
Data Mess to Data Mesh | Jay Kreps, CEO, Confluent | Kafka Summit Americas 20...
Data Mess to Data Mesh | Jay Kreps, CEO, Confluent | Kafka Summit Americas 20...
HostedbyConfluent
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
Ryan Blue
The delta architecture
The delta architecture
Prakash Chockalingam
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
Databricks
Build Real-Time Applications with Databricks Streaming
Build Real-Time Applications with Databricks Streaming
Databricks
Incremental View Maintenance with Coral, DBT, and Iceberg
Incremental View Maintenance with Coral, DBT, and Iceberg
Walaa Eldin Moustafa
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
DataScienceConferenc1
Was ist angesagt?
(20)
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Apache Kudu: Technical Deep Dive
Apache Kudu: Technical Deep Dive
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Spark with Delta Lake
Spark with Delta Lake
Free Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Building Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics Primer
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its Benefits
Building a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache Arrow
Data Mess to Data Mesh | Jay Kreps, CEO, Confluent | Kafka Summit Americas 20...
Data Mess to Data Mesh | Jay Kreps, CEO, Confluent | Kafka Summit Americas 20...
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
The delta architecture
The delta architecture
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
Build Real-Time Applications with Databricks Streaming
Build Real-Time Applications with Databricks Streaming
Incremental View Maintenance with Coral, DBT, and Iceberg
Incremental View Maintenance with Coral, DBT, and Iceberg
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
Ähnlich wie Some Iceberg Basics for Beginners (CDP).pdf
Fast federated SQL with Apache Calcite
Fast federated SQL with Apache Calcite
Chris Baynes
Using SQL Plan Management (SPM) to balance Plan Flexibility and Plan Stability
Using SQL Plan Management (SPM) to balance Plan Flexibility and Plan Stability
Carlos Sierra
Zesty journey to adopt apache iceberg-AWS-Floor28_Sep-23.pdf
Zesty journey to adopt apache iceberg-AWS-Floor28_Sep-23.pdf
Eran Levy
Sprint 186
Sprint 186
ManageIQ
Windows on AWS
Windows on AWS
Datavail
Delivering Insights from 20M+ Smart Homes with 500M+ Devices
Delivering Insights from 20M+ Smart Homes with 500M+ Devices
Databricks
Copy Data Management for the DBA
Copy Data Management for the DBA
Kellyn Pot'Vin-Gorman
Introducing the eDB360 Tool
Introducing the eDB360 Tool
Carlos Sierra
Denver SQL Saturday The Next Frontier
Denver SQL Saturday The Next Frontier
Kellyn Pot'Vin-Gorman
Performance Stability, Tips and Tricks and Underscores
Performance Stability, Tips and Tricks and Underscores
Jitendra Singh
Spark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike Percy
Spark Summit
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Databricks
Replicating in Real-time from MySQL to Amazon Redshift
Replicating in Real-time from MySQL to Amazon Redshift
Continuent
Sprint 170
Sprint 170
ManageIQ
What's coming in Airflow 2.0? - NYC Apache Airflow Meetup
What's coming in Airflow 2.0? - NYC Apache Airflow Meetup
Kaxil Naik
PgConf US 2015 - ALTER DATABASE ADD more SANITY
PgConf US 2015 - ALTER DATABASE ADD more SANITY
Oleksii Kliukin
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
Technical Deck Delta Live Tables.pdf
Technical Deck Delta Live Tables.pdf
Ilham31574
Sprint 168
Sprint 168
ManageIQ
Sprint 185
Sprint 185
ManageIQ
Ähnlich wie Some Iceberg Basics for Beginners (CDP).pdf
(20)
Fast federated SQL with Apache Calcite
Fast federated SQL with Apache Calcite
Using SQL Plan Management (SPM) to balance Plan Flexibility and Plan Stability
Using SQL Plan Management (SPM) to balance Plan Flexibility and Plan Stability
Zesty journey to adopt apache iceberg-AWS-Floor28_Sep-23.pdf
Zesty journey to adopt apache iceberg-AWS-Floor28_Sep-23.pdf
Sprint 186
Sprint 186
Windows on AWS
Windows on AWS
Delivering Insights from 20M+ Smart Homes with 500M+ Devices
Delivering Insights from 20M+ Smart Homes with 500M+ Devices
Copy Data Management for the DBA
Copy Data Management for the DBA
Introducing the eDB360 Tool
Introducing the eDB360 Tool
Denver SQL Saturday The Next Frontier
Denver SQL Saturday The Next Frontier
Performance Stability, Tips and Tricks and Underscores
Performance Stability, Tips and Tricks and Underscores
Spark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike Percy
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Replicating in Real-time from MySQL to Amazon Redshift
Replicating in Real-time from MySQL to Amazon Redshift
Sprint 170
Sprint 170
What's coming in Airflow 2.0? - NYC Apache Airflow Meetup
What's coming in Airflow 2.0? - NYC Apache Airflow Meetup
PgConf US 2015 - ALTER DATABASE ADD more SANITY
PgConf US 2015 - ALTER DATABASE ADD more SANITY
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Technical Deck Delta Live Tables.pdf
Technical Deck Delta Live Tables.pdf
Sprint 168
Sprint 168
Sprint 185
Sprint 185
Kürzlich hochgeladen
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
sudhanshuwaghmare1
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Juan lago vázquez
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
Sandro Moreira
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
rafiqahmad00786416
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
sammart93
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
johnbeverley2021
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
Christopher Logan Kennedy
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
apidays
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Jeffrey Haguewood
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
Andrey Devyatkin
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
The Digital Insurer
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
apidays
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
Nanddeep Nachan
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
DianaGray10
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
Zilliz
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
Khushali Kathiriya
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
danishmna97
Kürzlich hochgeladen
(20)
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
Some Iceberg Basics for Beginners (CDP).pdf
1.
HOW to use
First steps
2.
© 2022 Cloudera,
Inc. All rights reserved. 2 Recommended Iceberg Workflow Create Iceberg tables a. Bring your own datasets by converting your Hive external tables OR b. Use the sample airline datasets CDW: Hive CDE: Spark SQL 1 Batch Insert data To prepare Time Travel scenario: Insert more data into Iceberg tables with Hive or Spark CDE: Spark SQL 2 Create Security Policy Create a Ranger policy to mask a column for Fine Grained Access Control (FGAC) SDX: Ranger 3 Build BI Query Create SQL Queries for standard ops. reporting CDW: Impala SQL 4 Build Visualizations Create data sets & Visuals from Query CDV: Create data set from query & Build Visuals 5 Perform Time Travel Create Time Travel Queries and Execute them to audit what has changed CDW: Hive/Impala SQL CDE: Spark Scala API 6 Partition Evolution Optimize partition schema to improve query performance CDW: Hive/Impala SQL CDE: Spark SQL 7 Table Maintenance Manage / Expire Snapshots CDE: Spark SQL 8 CREATE INGEST/ PREP SERVE OPERATION / MAINTENANCE GOVERN
3.
© 2022 Cloudera,
Inc. All rights reserved. 3 SQL Commands ( Hive, Spark, Impala)
4.
© 2022 Cloudera,
Inc. All rights reserved. 4 SQL Commands Iceberg Tables T a b l e C o n v e r s i o n Tim e Travel DDL Query D M L Ease of Use through consistent SQL Syntax across compute engines Rich set of SQL commands are developed for Hive, Impala and Spark to • Create and manipulate database objects • Run Queries • Load data into tables • Modify data in tables • Perform Time Travel operations • Convert to Iceberg tables
5.
© 2022 Cloudera,
Inc. All rights reserved. 5 Snapshot of Iceberg SQL Commands Hive Impala Spark Select ⬤ ⬤ ⬤ DML (INSERT INTO, INSERT OVERWRITE) ⬤ ⬤ ⬤ Create Table ⬤ ⬤ ⬤ Alter Table ⬤ ⬤ ⬤ Drop Table ⬤ ⬤ ⬤ Truncate Table ⬤ ⬤ NA Create-Table-As-Select ⬤ ⬤ ⬤ Replace-Table-As-Select NA NA ⬤ Partition Evolution ⬤ ⬤ ⬤ Partition Transformation ⬤ ⬤ ⬤ Schema Evolution ⬤ ⬤ ⬤ Table Metadata (DESCRIBE TABLE, SHOW CREATE TABLE) ⬤ ⬤ ⬤ Time Travel ⬤ ⬤ Scala API now, SQL is planned Table Migration ⬤ NA ⬤ Table Maintenance NA NA ⬤ ⬤ General Availability ⬤ Tech Preview
6.
Compute Engines Interoperability
& Fine Grained Access Control
7.
© 2022 Cloudera,
Inc. All rights reserved. 7 Compute Engine Interoperability & FGAC ❏ Consistent Iceberg table access and processing with SQL using Hive, Spark and Impala (reads and writes) ❏ No partial reads ❏ No adapters needed ❏ Iceberg FGAC support through Ranger integration with Hive / Impala ❏ Spark is planned ❏ Compatible with existing workflows ❏ Optimized for performance, cost and developer efficiency Iceberg Tables Apache Impala
8.
Table Conversion SQL
commands / Utility [Tech Preview]
9.
© 2022 Cloudera,
Inc. All rights reserved. 9 Table Conversion from Hive External to Iceberg Tables 1. Hive table migration: ALTER TABLE tbl SET TBLPROPERTIES (‘storage_handler’=’org.apache.iceberg.mr.hive.HiveIcebergStorageHandler’) 2. Spark 3: a. Import Hive tables into Iceberg spark.sql("CALL <catalog>.system.snapshot('<src>', '<dest>')") b. Migrate Hive tables to Iceberg tables spark.sql("CALL <catalog>.system.migrate('<src>')")
10.
Time Travel Operations
11.
© 2022 Cloudera,
Inc. All rights reserved. 11 Time Travel t Time Travel is the ability to make a query reproducible at a given snapshot and/or time Time Travel operations: ● SELECT … AS OF … Apache Impala Snapshot A Snapshot Z Standard SQL operations: ● Queries ● DDL ● DML t | | T 0
12.
© 2022 Cloudera,
Inc. All rights reserved. 12 Time Travel Operations Time Travel Ops SQL Examples Hive / Impala Query SELECT * FROM table FOR SYSTEM_TIME AS OF ’2021-08-09 10:35:57’; SELECT * FROM table FOR SYSTEM_VERSION AS OF 1234567; Spark Scala API // time travel to snapshot with ID 10963874102873L spark.read .option("snapshot-id", 10963874102873L) .format("iceberg") .load("path/to/table") // time travel to October 26, 1986 at 01:21:00 spark.read .option("as-of-timestamp", "499162860000") .format("iceberg") .load("path/to/table")
13.
Partition Evolution
14.
© 2022 Cloudera,
Inc. All rights reserved. 14 In-place Partition Evolution ❏ Existing big data solution doesn’t support in-place partition evolution. Entire table must be completely rewritten with new partition column ❏ With Iceberg’s hidden partition, a separation between physical and logical, users are not required to maintain partition columns. ❏ Iceberg tables can evolve partition schemas over time as data volume changes. ❏ Benefits: ❏ No costly table rewrites or table migration ❏ No query rewrites ❏ Reduce downtime and improve SLA 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 2022-01-01 t Partitions included in query plan Partitioned by Month(date) Partitioned by Day(date) 2021-10-01 2021-11-01 2021-12-01 2022-01… SELECT * FROM SALES_ORDER WHERE DATE > 2021-11-23 AND DATA < 2022-01-19 Split plan 1 Split plan 2
15.
© 2022 Cloudera,
Inc. All rights reserved. 15 Partition Evolution SQL examples Engine SQL Examples Hive / Impala // Partition evolution to hour ALTER TABLE t SET PARTITION SPEC (hour(ts)) Spark SQL // Partition evolution to hour ALTER TABLE t ADD PARTITION FIELD (hour(ts))
16.
Table Maintenance [
Tech Preview ]
17.
© 2022 Cloudera,
Inc. All rights reserved. 17 Table Maintenance [ Tech Preview ] Time Travel Ops Examples Hive / Impala Query // Tentative, Proposed Syntax, not in GA // Expires snapshots that are older than 7 days. ALTER TABLE test_table EXECUTE expire_snapshots_lt (now() - interval 7 days); Spark Scala API // Not in GA // Expires snapshots that are older than 7 day Table test_table = … long tsToExpire = System.currentTimeMillis() - (1000*60*60*24*7); test_table.expireSnapshots() .expireOlderThan(tsToExpire) .commit(); Expiring old snapshots removes them from metadata, so they are no longer available for time travel operations. Data files are not deleted until they are no longer referenced by a snapshot that may be used for time travel. Regularly expiring snapshots deletes unused data files.
Jetzt herunterladen