Data Con LA 2020
Description
Data warehouses are not enough. Data lakes are the backbone of a modern data environment. Data Lakes are best built leveraging unique services of the cloud provider to reduce operations complexity. This session will explain why everyone's talking about data lakes, break down the best services in Azure to build a Data Lake, and walk through code for querying and loading with Azure Databricks and Event Hubs for Kafka. Attendees will leave the session with a firm grasp of why we build data lakes and how Azure Databricks fits in for ETL and querying.
Speaker
Dustin Vannoy, Dustin Vannoy Consulting, Principal Data Engineer
2. Dustin Vannoy
Data Engineering Consultant
Co-founder Data Engineering San Diego
/in/dustinvannoy
@dustinvannoy
dustin@dustinvannoy.com
Technologies
• Azure & AWS
• Spark
• Kafka
• Python
Modern Data Systems
• Data Lakes
• Analytics in Cloud
• Streaming
3. Agenda
● Data Lake Defined
● Querying the Data Lake
● Reference Architecture
○ Spark with Databricks
○ Data Lake Storage
○ Event Hubs for Kafka
4. Data Lake Defined
Big Data Capable
Store first, evaluate and
model later
Data Zones
1. Raw
2. Enriched
3. Certified / Curated
Ready for Analysts
Query layer, other
analytic tools access
5. Store Everything
Why Data Lakes?
Reason #1
CSV, JSON, Parquet, Avro, Text
No schema on write
Cheaper storage
6. Massive Scale (Big Data)
Why Data Lakes?
Reason #2
Scale up easily
Span hot and cold storage
Pay only for what you need
7. Reason #3
Storage + Compute
Separate
Why Data Lakes?
Multiple analytics tools / same data
Cost savings
8. Data Warehouse Defined
Structured Data
Processed and modeled
for analytics use
Interactive query
Analysts can get
answers to questions
quickly
BI tool support
Reporting tools can
query efficiently
17. Why Spark?
Big data and the cloud
changed our mindset.
We want tools that
scale easily as data
size grows.
Spark is a leader in
data processing that
scales across many
machines. It can run
on Hadoop but is faster
and easier than Map
Reduce.
25. Why Event Hubs?
Reliable place to
stream events;
decoupled from
destination
Event Hubs is a scalable
message broker,
keeping up with
producer and persisting
data for all consumers
26. Hub for streaming data
Data Lake
Trip Data
User
Dashboard
Real-time
report
Vendor Data
Azure Event Hubs
27. Event Hubs key concepts
● Namespace = container to hold multiple Event Hubs
● Event Hub = Topic
● Partitions and Consumer Groups
○ Same concepts as Kafka
○ Minor differences in implementation
● Throughput Units define level of scalability
31. Azure Data Lake Storage, Gen 2
Partition folders
Parquet or Delta format (not CSV)
Use splittable compression
Small files are a problem (< 128 MB)
Storage Best Practices
32. Spark is powerful, but...
● Not ACID compliant – too easy to get corrupted data
● Schema mismatches – no validation on write
● Small files written, not efficient for reading
● Reads too much data (no indexes, only partitions)
34. Delta Log
“The transaction log is the mechanism through which Delta Lake is able to
offer the guarantee of atomicity.”
Reference: Databricks Blog: Unpacking the Transaction Log
36. References
Notebooks from demos: https://github.com/datakickstart/databricks-notebooks
Pluralsight Databricks + PySpark Training:
https://app.pluralsight.com/channels/details/0418df96-d33b-43bc-8a77-1d437d3c53e2?s=1
LinkedIn Learning: https://www.linkedin.com/learning/apache-spark-essential-training
Delta Lake:
https://www.youtube.com/watch?v=F91G4RoA8is
My YouTube Data Lake and Spark intros:
https://youtu.be/YOu2OZ2Y2mI
https://youtu.be/Ud6luYCkkMk
More links at bottom of this blog post:
https://dustinvannoy.com/2020/04/26/journey-of-a-data-engineer-part-2/