4. Extract/Transform/Load(ETL) is an integration approach that pulls
information from remote sources, transforms it into defined formats and styles,
then loads it into databases, data sources, or Data Warehouses.
Extract/Load/Transform(ELT) similarly extracts data from one or multiple
remote sources, but then loads it into the target Data Lake without any other formatting.
The transformation of data, in an ELT process, happens within the target database. ELT
asks less of remote sources, requiring only their raw and unprepared data.
5. Big
Data
LambdaArchitecture
Big data solutions typicallyinvolve one or more of the following types of workload:
• Batchprocessing of big data sources at rest.
• Real-time processing of big data in motion.
• Interactive exploration of big data.
• Predictive analytics and machine learning.
Consider big data architectures when you need to:
• Store and process data in volumes too large for a traditional database.
• Transform unstructureddata for analysis and reporting.
• Capture, process, and analyze unboundedstreams of data in real time
6. Big
Data
BatchProcessing
A common big data scenario is batch processing of dataat rest. In this scenario, the source data is loaded into datastorage, either
by the source application itself or by an orchestrationworkflow. The data is then processed in-place by a parallelizedjob, which
can also be initiated by the orchestration workflow. The processing may include multiple iterative steps before the transformed
results are loadedintoan analytical datastore, which can be queriedby analytics and reporting components.
A batchlayer (coldpath) stores all of the incoming data in its raw form and performs
batch processing on the data. The result of this processing is stored as a batchview.
7. Big
Data
Real Time Processing
Real time processing deals with streamsof data that are captured in real-time and processed with minimal
latencyto generate real-time (or near-real-time) reports or automated responses. For example, a real-time traffic
monitoring solutionmight use sensor data to detect high traffic volumes. This data could be used to dynamically
update a map to show congestion, or automatically initiate high-occupancy lanes or other traffic management
systems.
A speedlayer (hot path) analyzes data in realtime. This layer is
designed for low latency, at the expense of accuracy.
10. StreamingData
BoundedData
Bound data is finiteand unchangingdata, where everything is known about the set of data. Typically Bound data has a
known ending point and is relatively fixed. An easy example is what was last year’s sales numbers for Telsa Model S.
Since we are looking into the past we have a perfect timebox with a fixed number of results (number of sales).
UnboundedData
Unbound data is unpredictable,infinite,and not alwayssequential. The data creation is a never ending cycle, similar
to Bill Murray in Ground Hog Day. It just keeps going and going. For example, data generatedona WebScale
Enterprise Networkis Unbound. The network traffic messages and logs are constantly being generated, external
traffic can scale-up generating more messages, remote systems with latency could report non-sequential logs, and etc.
11. StreamingDataKappaArchitecture
A drawback to the lambdaarchitectureis itscomplexity. Processing logic appears in two different places — the cold
and hot paths — using different frameworks. This leads to duplicate computation logic and the complexity of
managing the architecture for both paths.
The kappaarchitecture was proposed by Jay Krepsas an alternative to the lambda architecture. It has the same basic
goals as the lambda architecture, but with an important distinction: Alldataflowsthrougha single path, using a
streamprocessing system.
Apache Flink
Apache Beam
Apache Spark
12. ApacheKafkais an open source distributed
streamingplatformcapable of handling trillionsof
eventsa day. Initially conceived as a messaging queue,
Kafka is based on an abstraction of a distributed
commit log. Since being created and open sourced by
LinkedIn in 2011, Kafka has quickly evolved from
messaging queue to a full-fledged streaming platform.
Founded by the original developers of Apache Kafka,
Confluent delivers the most complete distribution of
Kafkawith Confluent Platform. Confluent Platform
improves Kafka with additional open source and
commercial features designed to enhance the streaming
experience of both operators and developers in
production, at massive scale.
13. DataBricks
The UnifiedAnalyticsPlatform
Unifying Data Science,Engineeringand Business
Accelerateperformance
with an optimized Spark platform
Increase productivity
through interactive data science
Streamline processes
from ETL to production
Reduce costand complexity
with a fully managed, cloud-native platform
14. DataBricksI/O
The Databricks I/O module (DBIO) takes processing speeds to the next level with an
optimized AWS S3 access layer — significantly improving the performance of Apache
Spark™ in the cloud.
• Higher S3 Throughput
• More Efficient Decoding
• Data Skipping
• Transactional Writes on S3
15. DataBricksServerLess
Databricks’ serverless and highly elastic cloud service is designed to remove operational
complexity while ensuring reliability and cost efficiency at scale.
• Shared Pools
• Auto-Scaling
• Auto-Configuration
• Reliable Fine-Grained Sharing
16. ForDataEngineering
Build Fast and Reliable Data Pipelines
• Cross-TeamCollaboration– Sharing Insights in Real-Time,
Interactive Workspace.
• UnifyingALL Analytics– Batch, Ad-Hoc, Machine Learning, Deep
Learning, Streaming and Graph Processing.
• ProductionWorkflows – Unified Platform Streamlines End-o-End
Data Ingest and ETL.
• Robust Integrations – AWS Tools and Data Stores Connectors and
Integrate with CI/CD.
ForDataScience
Unleash the Power of Machine Learningand AI
• AutomatedCluster Management – Launch Expertly-Tuned Spark Clusters.
• Optimizedfor Performance– DataBricks Runtime Optimizes Spark – 10-40x.
• Supportfor MultipleProgrammingLanguages – Interactive Query Large-Scale
DataSets in R, Python, Scala or SQL.
• HighlyExtensible– Use of Popular Libraries within Notebook - scikit-learn,
nltk ML, pandas.
17. Microsoft Azure DataBricks
Fast, Easy & Collaborative Apache SparkAnalyticsPlatform
Scale Without Limits
Integrate Effortlesslywith
- Azure SQL Dw
- Azure CosmosDB
- Azure Data Lake Store
- Azure Blob Storage
- Azure Event Hubs
- Azure IoT Hub
- Azure Data Factory
- PowerBi
databricks
18. Azure SQLData Warehouse [PaaS]– Platform-as-a-Services for Dw
Analytics Platform for Enterprises
Cloud Data Warehouse [Dw] with MPP [MassiveParallelProcessing]
Elastic Compute & Storage, FullyManagedInfrastructure with a Ecosystem Integration
SQL Server Foundation Stack – Support for SQL & PolyBase Language
19. Azure SQLData Warehouse
PolyBase for Querying Big Data Stores – [T-SQL]
Columnar Data Storage Type
[MP] Massively Parallel Processing
20. Databricks Delta delivers a powerful transactionalstoragelayer by harnessing the power of Apache Spark
and Databricks DBFS. The core abstraction of Databricks Delta is an optimized Spark table that
• Stores data as Parquet files in DBFS.
• Maintains a transaction log that efficiently tracks changes to the table.
You read and write data stored in the delta format using the same familiar ApacheSparkSQL batchand streaming APIs that
you use to work with Hive tables and DBFS directories. With the addition of the transaction log and other enhancements,
Databricks Delta offers significant benefits:
ACIDtransactions
• Multiple writers can simultaneously modify a dataset and see consistent views.
• Writers can modify a dataset without interfering with jobs reading the dataset.
Fastreadaccess
• Automatic file management organizes data into large files that can be read efficiently.
• Statistics enable speeding up reads by 10-100x and and data skipping avoids reading irrelevant information.
21. Data Warehouse[Dw] – (1 Gen)
ETL for Data Centralization & BI Analysis
No Future Proof – Missing Predictions, Real-Time, Scale
• Pristine
• Fast Queries
• Transactional
• Expensive for Scale, Not Elastic
• Require ETL, Stale Data, No Real-Time
• No Predictions, No ML
• Closed Formats [Lock In]
Hadoop Data Lake – (2 Gen)
ETL ALL Data, Scalable, Open Lake for ALL Use Cases
Become a Cheap Messy Data Store with Poor Performance
• Massive Scale
• Inexpensive Storage
• Open-Formats [Parquet, ORC]
• Promise of ML & Real-Time Streaming
• Inconsistent Data
• Unreliable for Analytics
• Lack of Schema
• Poor Performance
DataBricks
The UnifiedAnalytics Platform
Databricks Delta – (3 Gen)
A Unified Data Management System for Real-Time Big Data
Powerful Transactional Storage Layer
• The Good of Dw
• The Good of Data Lakes
• Decoupled Compute & Storage
• ACID Transactions & Data Validation
• Data Indexing & Caching [10x ~100x]
• Real-Time Streaming Ingest