This document provides an overview of the Databricks platform. It discusses how Databricks combines features of data warehouses and data lakes to create a "data lakehouse" that supports both business intelligence/reporting and data science/machine learning use cases. Key components of the Databricks platform include Apache Spark, Delta Lake, MLFlow, Jupyter notebooks, and Delta Live Tables. The platform aims to unify data engineering, data warehousing, streaming, and data science tasks on a single open-source platform.
2. inteligencija.com
We are Data & Analytics consulting company committed to deliver great solutions and products that
enables our clients to unlock hidden opportunities within data, become data-driven and make better
business decisions
Our goal is to enable data-driven business decisions
Offices in UK,
Sweden,
Austria,
Slovenia and
Croatia
200+
employees
20 years in
Data &
Analytics
250+
projects
100+
clients on 5
continents
3. inteligencija.com
We deliver E2E Cloud Data & Analytics solutions
Data Strategy &
Governance
Data
Management
Data Science &
Analytics
Performance
Management
Implement practices,
concepts and
processes dedicated
to leveraging data as
valuable asset.
Design data models,
improve data quality
and master data,
protect data, manage
whole data supply
chain and make data
available for any
relevant business
need.
Utilize data and
answer business
questions through
reporting, self-service
BI and data
visualization.
Use machine
learning algorithms
to uncover the
unseen patterns,
insights and trends in
data and derive
meaningful
information.
Automate budgeting
and forecasting,
financial
consolidation and
performance
management
reporting.
Discover
opportunities for data
monetization, access
organizational
maturity, evaluate
architectural options
and define migration
to cloud strategy,
plan and prioritize
projects and estimate
costs.
Data Engineering
Collect and store
data at scale, from
multiple sources and
formats, and make
them reliable and
consistent for
analysis.
5. inteligencija.com
The story about Databricks
• The team who built Apache Spark founded Databricks in
2013
• They started several OSS projects:
• Apache Spark
• Delta Lake
• MLFlow
• Invented the Data Lakehouse concept
• Named leader by Gartner in both
• Database Management Systems
7. inteligencija.com
Data Lakehouse Concept
• Marries Data Warehouses and Data Lakes
• Data Warehouses
• Built for efficient BI and reporting
• But:
• Poor support for unstructured data, data science and
streaming
• Closed formats
• Expensive to scale
8. inteligencija.com
Data Lakehouse Concept
• Data Lakes
• Store any kind of data
• Cheap storage
• Allow for exploratory data analysis and streaming UCs
• However:
• Complex to set up
• Poor BI performance
• Often devolve into data swamps
9. inteligencija.com
Gartner insights
• 85% of Big Data and Data Science projects fail
• $3.9T business value created by AI in 2022 (by the 15% ?)
• Why do Data Science projects fail?
• Recent MIT Technology Review survey of 600 C-level
executives:
“72% percent of the technology executives we surveyed for this study say that, should their
companies fail to achieve their AI goals, data issues are more likely than not to be the reason.
Improving processing speeds, governance, and quality of data, as well as its sufficiency for
models, are the main data imperatives to ensure AI can be scaled, say the survey
respondents.”
10. inteligencija.com
The usual problems
• Ill-defined use cases
• Data warehouses and data lakes in separate silos:
• Data often duplicated and/or difficult to access (formats,
interfaces)
• Difficult to consolidate security models
• Difficult to apply governance
11. inteligencija.com
Databricks Lakehouse Platform - benefits
• Unifies Data Warehouse and AI use cases on a single
platform
• Built on open source and open standards
• Consistent across cloud providers (Azure, AWS, GCP)
• Provides ACID transactions
• Schema enforcement capabilities
• In one platform:
• Data Warehousing
• Data Engineering
• Data Streaming
• Data Science and ML
• Data Governance
18. inteligencija.com
Computing resources
• Clusters
• One or more VM instances running Spark components:
Driver and Executors
• Required for running notebooks, jobs, pipelines, …
• All-purpose clusters and job clusters
• SQL Warehouses (formerly „SQL Endpoints”)
• Optimized for BI workloads
• Required for running anything in SQL Workspace
• For exploring data, running queries, alerts, …
21. inteligencija.com
Apache Spark
• General-purpose, distributed data processing engine
• Efficient and fast
• Spark SQL, Spark Streaming, Spark ML
• APIs in Java, Scala, Python, R
• Widely used today – ubiquitous
• Databricks provides Photon execution engine on top
22. inteligencija.com
Jupyter notebooks
• Web-based, interactive and collaborative
• Databricks supports Python, SQL, R and Scala
• Can also serve as documentation (can be exported to
HTML, PDF, etc.)
• Can be executed as jobs in Databricks and organized in
Pipelines
• In Databricks attached to clusters
23. inteligencija.com
Delta Lake
• Data storage framework built on top of Parquet
• Provides ACID transactions; upserts (MERGE statements)
and deletes
• Schema enforcement
• Time travel
• Scalable metadata handling
• Unifies streaming and batch processing
24. inteligencija.com
Delta Live Tables
• Framework for building data processing pipelines
• You define transformations and DLT manages:
• Orchestration
• Cluster management
• Monitoring
• Data quality (Expectations)
• Error handling
• Can perform CDC with APPLY CHANGES INTO .. FROM ..
25. inteligencija.com
MLflow
• Framework for managing machine learning lifecycles
• MLflow Tracking – tracks experiments and runs,
parameters, metrics
• MLflow models – storage format for describing models of
different “flavors” (e.g. sklearn, keras, xgboost etc.)
• MLflow Projects – package code in a format to reproduce
runs on different platforms
• Model registry – manage models in a central repository