2. Lack of etiquette and manners is a huge turn off.
KnolX Etiquettes
Punctuality
Join the session 5 minutes priorto
the session start time. We start on
time andconclude on time!
Feedback
Makesure to submita constructive
feedback for all sessions as it is
very helpful for the presenter.
Silent Mode
Keep yourmobiledevices in silent
mode, feel free to moveout of
session in case you need to attend
an urgent call.
Avoid Disturbance
Avoidunwantedchitchat during
the session.
3. Our Agenda
01 Why Delta Lake ?
02 Data Warehouse
03 Data Lake
04 Possible Solution
Delta Lake
05
05
06 Demo
4. Why Delta Lake?
Streaming Systems
Data source come through the systems
like Apache Kafka or Amazon Kinesis
Data Lakes
Data is stored for long periods of time in
data lake where it’s optimized for large
scale and low cost.
Data Warehouse
Valuable data is stored which are then again
optimized for high concurrency & reliability.
The modern data architecture uses the
blend of at least these three different
types of systems.
Data
Architecture
5. Data Warehouse
2013 2017 2018
● A data management system that stores current and
historical data from multiple sources in a business
friendly manner for easier insights and reporting.
● Data warehouses are typically used for business
intelligence (BI), reporting and data analysis.
Limitations
➔No support for video, audio, text
➔No support for data science
➔ ML Limited support for streaming Closed & proprietary
formats
ETL
(Extract Transform Load)
Data Source
6. Data Lake
2017 2018
● A central location that holds a large amount of data in its
native, raw format.
● Unstructured and semi-structured data like photos, video,
audio, and documents, which is essential for today’s machine
learning and advanced analytics use cases.
Limitations
➔Poor BI support Complex to set up
➔Poor performance
➔Lack of security features
➔Reliability issues
7. What’s the Solution?
A combination of DW & DL
Structured &
Unstructured Data
Data Lake
ETL
Metadata, Caching &
Indexing Layer
Data Validation
Data Warehousing
Reports, BI & Data
Science
8. Data Lakehouse
2017 2018
A system which merges the flexibility, low cost, and scale of
a data lake with the data management and ACID
transactions of data warehouses, addressing the limitations
of both.
Benefits
➔Don’t have to copy data to data lake and another copy to
some data warehouse
➔Cost savings, both in infrastructure and staff and
consulting overhead.
➔Scalability through underline cloud storage
➔Reliability through ACID transaction.
9. What is Delta Lake?
2018
● Delta Lake is a file-based open-source metadata layer
that enables building Lakehouse architecture on the top of
data lakes.
● It can run on existing data lakes and is fully compatible
with processing engines like Apache Spark
With Delta Lake -
➔Scalable metadata handling
➔ACID Transactions
➔Streaming and Batch unification
➔Time Travel (query an oldersnapshotof a Delta table)
➔Schema Enforcement
10. The Medallion Architecture
Ingestion Tables Refined Tables Feature/Agg Data Store
● No business rules or
transformations of any kind
● Should be fast and easy to
get new data to this layer
● Prioritize speed to market
and write performance- just
enough transformations
● Quality data expected
● Prioritize business use
cases and user experience
● Precalculated, business-
specific transformations
11. Features of Delta Lake
01 02
03 04
06
05
ACID Transactions
Data lake transactions done using processing
engine are committed for durability and
exposed to other readers in an atomic fashion.
Audit History
Transaction logs enables the full audit trail
of any changes made to the data
Schema
Enforcement
Automatically enforces schema
when writing and reading data
from lake
Unification of batch and
streaming
Table in Delta Lake is a batch table as well
as a streaming source and sink
Full DML Support
DML operations like deletes and updates,
but also complex data merge, or upsert
scenarios
Metadata Support
& Scaling
Leverages Spark distributedprocessing
power to handle all the metadata for
petabyte-scale tables with billions of files
at ease
12. Getting Started With
Delta Lake with
Spark-Shell
Delta Lake in
Pyspark
Delta Lake on
Databricks
1 2
3 4 Hello Delta Lake
14. Delta Lake
Best Practices
Choosethe rightpartition column:
If the cardinality of a column will be very high, do
not use that column for partitioning.
Amount of data in each partition. < 1GB
Improve performance on Delta Lake
Merge
Compact Files
A large number of small files should be rewritten
into a smaller number of larger files on a regular
basis. Thisis known as compaction.
Enhanced checkpoints for low latency
queries
Replace the content or schema of the
table.
Sometimesyou maywant to replace a Delta table.
Spark Caching
Differencebetween Delta Lake and
Parquet on ApacheSpark