SlideShare ist ein Scribd-Unternehmen logo
1 von 21
ANAC – Agência
Nacional de Aviação
Civil
OtimizaçãodeProjetosdeBigData,BIe
InteligênciaArtificialcom
Agenda
Big Data
ETL vs ELT
Arquitetura Lambda
Streaming
Arquitetura Kappa
Azure Databricks
Azure SQL Dw
Azure Databricks Delta
Big Data
Extract/Transform/Load(ETL) is an integration approach that pulls
information from remote sources, transforms it into defined formats and styles,
then loads it into databases, data sources, or Data Warehouses.
Extract/Load/Transform(ELT) similarly extracts data from one or multiple
remote sources, but then loads it into the target Data Lake without any other formatting.
The transformation of data, in an ELT process, happens within the target database. ELT
asks less of remote sources, requiring only their raw and unprepared data.
Big
Data
LambdaArchitecture
Big data solutions typicallyinvolve one or more of the following types of workload:
• Batchprocessing of big data sources at rest.
• Real-time processing of big data in motion.
• Interactive exploration of big data.
• Predictive analytics and machine learning.
Consider big data architectures when you need to:
• Store and process data in volumes too large for a traditional database.
• Transform unstructureddata for analysis and reporting.
• Capture, process, and analyze unboundedstreams of data in real time
Big
Data
BatchProcessing
A common big data scenario is batch processing of dataat rest. In this scenario, the source data is loaded into datastorage, either
by the source application itself or by an orchestrationworkflow. The data is then processed in-place by a parallelizedjob, which
can also be initiated by the orchestration workflow. The processing may include multiple iterative steps before the transformed
results are loadedintoan analytical datastore, which can be queriedby analytics and reporting components.
A batchlayer (coldpath) stores all of the incoming data in its raw form and performs
batch processing on the data. The result of this processing is stored as a batchview.
Big
Data
Real Time Processing
Real time processing deals with streamsof data that are captured in real-time and processed with minimal
latencyto generate real-time (or near-real-time) reports or automated responses. For example, a real-time traffic
monitoring solutionmight use sensor data to detect high traffic volumes. This data could be used to dynamically
update a map to show congestion, or automatically initiate high-occupancy lanes or other traffic management
systems.
A speedlayer (hot path) analyzes data in realtime. This layer is
designed for low latency, at the expense of accuracy.
StreamingData
Continuously Generated – [In-Motion]
Different Sources & Types of Data
Processed Incrementally
Data is Sent in Small Sizes [KB]
StreamingData
Advantagesof StreamingAnalytics
- Analyze Data as Fast as Possible
- Provides Deeper Insights Through Data Visualization
- Offers Insight into Customer Behavior
- Remain Competitive
StreamingData
BoundedData
Bound data is finiteand unchangingdata, where everything is known about the set of data. Typically Bound data has a
known ending point and is relatively fixed. An easy example is what was last year’s sales numbers for Telsa Model S.
Since we are looking into the past we have a perfect timebox with a fixed number of results (number of sales).
UnboundedData
Unbound data is unpredictable,infinite,and not alwayssequential. The data creation is a never ending cycle, similar
to Bill Murray in Ground Hog Day. It just keeps going and going. For example, data generatedona WebScale
Enterprise Networkis Unbound. The network traffic messages and logs are constantly being generated, external
traffic can scale-up generating more messages, remote systems with latency could report non-sequential logs, and etc.
StreamingDataKappaArchitecture
A drawback to the lambdaarchitectureis itscomplexity. Processing logic appears in two different places — the cold
and hot paths — using different frameworks. This leads to duplicate computation logic and the complexity of
managing the architecture for both paths.
The kappaarchitecture was proposed by Jay Krepsas an alternative to the lambda architecture. It has the same basic
goals as the lambda architecture, but with an important distinction: Alldataflowsthrougha single path, using a
streamprocessing system.
Apache Flink
Apache Beam
Apache Spark
ApacheKafkais an open source distributed
streamingplatformcapable of handling trillionsof
eventsa day. Initially conceived as a messaging queue,
Kafka is based on an abstraction of a distributed
commit log. Since being created and open sourced by
LinkedIn in 2011, Kafka has quickly evolved from
messaging queue to a full-fledged streaming platform.
Founded by the original developers of Apache Kafka,
Confluent delivers the most complete distribution of
Kafkawith Confluent Platform. Confluent Platform
improves Kafka with additional open source and
commercial features designed to enhance the streaming
experience of both operators and developers in
production, at massive scale.
DataBricks
The UnifiedAnalyticsPlatform
Unifying Data Science,Engineeringand Business
Accelerateperformance
with an optimized Spark platform
Increase productivity
through interactive data science
Streamline processes
from ETL to production
Reduce costand complexity
with a fully managed, cloud-native platform
DataBricksI/O
The Databricks I/O module (DBIO) takes processing speeds to the next level with an
optimized AWS S3 access layer — significantly improving the performance of Apache
Spark™ in the cloud.
• Higher S3 Throughput
• More Efficient Decoding
• Data Skipping
• Transactional Writes on S3
DataBricksServerLess
Databricks’ serverless and highly elastic cloud service is designed to remove operational
complexity while ensuring reliability and cost efficiency at scale.
• Shared Pools
• Auto-Scaling
• Auto-Configuration
• Reliable Fine-Grained Sharing
ForDataEngineering
Build Fast and Reliable Data Pipelines
• Cross-TeamCollaboration– Sharing Insights in Real-Time,
Interactive Workspace.
• UnifyingALL Analytics– Batch, Ad-Hoc, Machine Learning, Deep
Learning, Streaming and Graph Processing.
• ProductionWorkflows – Unified Platform Streamlines End-o-End
Data Ingest and ETL.
• Robust Integrations – AWS Tools and Data Stores Connectors and
Integrate with CI/CD.
ForDataScience
Unleash the Power of Machine Learningand AI
• AutomatedCluster Management – Launch Expertly-Tuned Spark Clusters.
• Optimizedfor Performance– DataBricks Runtime Optimizes Spark – 10-40x.
• Supportfor MultipleProgrammingLanguages – Interactive Query Large-Scale
DataSets in R, Python, Scala or SQL.
• HighlyExtensible– Use of Popular Libraries within Notebook - scikit-learn,
nltk ML, pandas.
Microsoft Azure DataBricks
Fast, Easy & Collaborative Apache SparkAnalyticsPlatform
Scale Without Limits
Integrate Effortlesslywith
- Azure SQL Dw
- Azure CosmosDB
- Azure Data Lake Store
- Azure Blob Storage
- Azure Event Hubs
- Azure IoT Hub
- Azure Data Factory
- PowerBi
databricks
Azure SQLData Warehouse [PaaS]– Platform-as-a-Services for Dw
Analytics Platform for Enterprises
Cloud Data Warehouse [Dw] with MPP [MassiveParallelProcessing]
Elastic Compute & Storage, FullyManagedInfrastructure with a Ecosystem Integration
SQL Server Foundation Stack – Support for SQL & PolyBase Language
Azure SQLData Warehouse
PolyBase for Querying Big Data Stores – [T-SQL]
Columnar Data Storage Type
[MP] Massively Parallel Processing
Databricks Delta delivers a powerful transactionalstoragelayer by harnessing the power of Apache Spark
and Databricks DBFS. The core abstraction of Databricks Delta is an optimized Spark table that
• Stores data as Parquet files in DBFS.
• Maintains a transaction log that efficiently tracks changes to the table.
You read and write data stored in the delta format using the same familiar ApacheSparkSQL batchand streaming APIs that
you use to work with Hive tables and DBFS directories. With the addition of the transaction log and other enhancements,
Databricks Delta offers significant benefits:
ACIDtransactions
• Multiple writers can simultaneously modify a dataset and see consistent views.
• Writers can modify a dataset without interfering with jobs reading the dataset.
Fastreadaccess
• Automatic file management organizes data into large files that can be read efficiently.
• Statistics enable speeding up reads by 10-100x and and data skipping avoids reading irrelevant information.
Data Warehouse[Dw] – (1 Gen)
ETL for Data Centralization & BI Analysis
No Future Proof – Missing Predictions, Real-Time, Scale
• Pristine
• Fast Queries
• Transactional
• Expensive for Scale, Not Elastic
• Require ETL, Stale Data, No Real-Time
• No Predictions, No ML
• Closed Formats [Lock In]
Hadoop Data Lake – (2 Gen)
ETL ALL Data, Scalable, Open Lake for ALL Use Cases
Become a Cheap Messy Data Store with Poor Performance
• Massive Scale
• Inexpensive Storage
• Open-Formats [Parquet, ORC]
• Promise of ML & Real-Time Streaming
• Inconsistent Data
• Unreliable for Analytics
• Lack of Schema
• Poor Performance
DataBricks
The UnifiedAnalytics Platform
Databricks Delta – (3 Gen)
A Unified Data Management System for Real-Time Big Data
Powerful Transactional Storage Layer
• The Good of Dw
• The Good of Data Lakes
• Decoupled Compute & Storage
• ACID Transactions & Data Validation
• Data Indexing & Caching [10x ~100x]
• Real-Time Streaming Ingest

Weitere ähnliche Inhalte

Was ist angesagt?

Data lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiryData lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amirydatastack
 
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...Lace Lofranco
 
Big data architecture on cloud computing infrastructure
Big data architecture on cloud computing infrastructureBig data architecture on cloud computing infrastructure
Big data architecture on cloud computing infrastructuredatastack
 
Operationalizing Big Data Pipelines At Scale
Operationalizing Big Data Pipelines At ScaleOperationalizing Big Data Pipelines At Scale
Operationalizing Big Data Pipelines At ScaleDatabricks
 
How to Build Modern Data Architectures Both On Premises and in the Cloud
How to Build Modern Data Architectures Both On Premises and in the CloudHow to Build Modern Data Architectures Both On Premises and in the Cloud
How to Build Modern Data Architectures Both On Premises and in the CloudVMware Tanzu
 
Azure data factory
Azure data factoryAzure data factory
Azure data factoryBizTalk360
 
Azure Data Factory ETL Patterns in the Cloud
Azure Data Factory ETL Patterns in the CloudAzure Data Factory ETL Patterns in the Cloud
Azure Data Factory ETL Patterns in the CloudMark Kromer
 
Data Lakes with Azure Databricks
Data Lakes with Azure DatabricksData Lakes with Azure Databricks
Data Lakes with Azure DatabricksData Con LA
 
Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure DatabricksJames Serra
 
Introduction to PolyBase
Introduction to PolyBaseIntroduction to PolyBase
Introduction to PolyBaseJames Serra
 
Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...
Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...
Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...Michael Rys
 
What is an Open Data Lake? - Data Sheets | Whitepaper
What is an Open Data Lake? - Data Sheets | WhitepaperWhat is an Open Data Lake? - Data Sheets | Whitepaper
What is an Open Data Lake? - Data Sheets | WhitepaperVasu S
 
Integration Monday - Analysing StackExchange data with Azure Data Lake
Integration Monday - Analysing StackExchange data with Azure Data LakeIntegration Monday - Analysing StackExchange data with Azure Data Lake
Integration Monday - Analysing StackExchange data with Azure Data LakeTom Kerkhove
 
Dipping Your Toes: Azure Data Lake for DBAs
Dipping Your Toes: Azure Data Lake for DBAsDipping Your Toes: Azure Data Lake for DBAs
Dipping Your Toes: Azure Data Lake for DBAsBob Pusateri
 
Hadoop vs. RDBMS for Advanced Analytics
Hadoop vs. RDBMS for Advanced AnalyticsHadoop vs. RDBMS for Advanced Analytics
Hadoop vs. RDBMS for Advanced Analyticsjoshwills
 
201905 Azure Databricks for Machine Learning
201905 Azure Databricks for Machine Learning201905 Azure Databricks for Machine Learning
201905 Azure Databricks for Machine LearningMark Tabladillo
 
Mapping Data Flows Training April 2021
Mapping Data Flows Training April 2021Mapping Data Flows Training April 2021
Mapping Data Flows Training April 2021Mark Kromer
 
Modern Data architecture Design
Modern Data architecture DesignModern Data architecture Design
Modern Data architecture DesignKujambu Murugesan
 

Was ist angesagt? (20)

Data lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiryData lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiry
 
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
 
Big data architecture on cloud computing infrastructure
Big data architecture on cloud computing infrastructureBig data architecture on cloud computing infrastructure
Big data architecture on cloud computing infrastructure
 
Lecture1
Lecture1Lecture1
Lecture1
 
Operationalizing Big Data Pipelines At Scale
Operationalizing Big Data Pipelines At ScaleOperationalizing Big Data Pipelines At Scale
Operationalizing Big Data Pipelines At Scale
 
An intro to Azure Data Lake
An intro to Azure Data LakeAn intro to Azure Data Lake
An intro to Azure Data Lake
 
How to Build Modern Data Architectures Both On Premises and in the Cloud
How to Build Modern Data Architectures Both On Premises and in the CloudHow to Build Modern Data Architectures Both On Premises and in the Cloud
How to Build Modern Data Architectures Both On Premises and in the Cloud
 
Azure data factory
Azure data factoryAzure data factory
Azure data factory
 
Azure Data Factory ETL Patterns in the Cloud
Azure Data Factory ETL Patterns in the CloudAzure Data Factory ETL Patterns in the Cloud
Azure Data Factory ETL Patterns in the Cloud
 
Data Lakes with Azure Databricks
Data Lakes with Azure DatabricksData Lakes with Azure Databricks
Data Lakes with Azure Databricks
 
Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure Databricks
 
Introduction to PolyBase
Introduction to PolyBaseIntroduction to PolyBase
Introduction to PolyBase
 
Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...
Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...
Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...
 
What is an Open Data Lake? - Data Sheets | Whitepaper
What is an Open Data Lake? - Data Sheets | WhitepaperWhat is an Open Data Lake? - Data Sheets | Whitepaper
What is an Open Data Lake? - Data Sheets | Whitepaper
 
Integration Monday - Analysing StackExchange data with Azure Data Lake
Integration Monday - Analysing StackExchange data with Azure Data LakeIntegration Monday - Analysing StackExchange data with Azure Data Lake
Integration Monday - Analysing StackExchange data with Azure Data Lake
 
Dipping Your Toes: Azure Data Lake for DBAs
Dipping Your Toes: Azure Data Lake for DBAsDipping Your Toes: Azure Data Lake for DBAs
Dipping Your Toes: Azure Data Lake for DBAs
 
Hadoop vs. RDBMS for Advanced Analytics
Hadoop vs. RDBMS for Advanced AnalyticsHadoop vs. RDBMS for Advanced Analytics
Hadoop vs. RDBMS for Advanced Analytics
 
201905 Azure Databricks for Machine Learning
201905 Azure Databricks for Machine Learning201905 Azure Databricks for Machine Learning
201905 Azure Databricks for Machine Learning
 
Mapping Data Flows Training April 2021
Mapping Data Flows Training April 2021Mapping Data Flows Training April 2021
Mapping Data Flows Training April 2021
 
Modern Data architecture Design
Modern Data architecture DesignModern Data architecture Design
Modern Data architecture Design
 

Ähnlich wie Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure

Big Data_Architecture.pptx
Big Data_Architecture.pptxBig Data_Architecture.pptx
Big Data_Architecture.pptxbetalab
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptxAlex Ivy
 
Keeping Analytics Data Fresh in a Streaming Architecture | John Neal, Qlik
Keeping Analytics Data Fresh in a Streaming Architecture | John Neal, QlikKeeping Analytics Data Fresh in a Streaming Architecture | John Neal, Qlik
Keeping Analytics Data Fresh in a Streaming Architecture | John Neal, QlikHostedbyConfluent
 
Cloud Lambda Architecture Patterns
Cloud Lambda Architecture PatternsCloud Lambda Architecture Patterns
Cloud Lambda Architecture PatternsAsis Mohanty
 
Spark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with SparkSpark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with SparkSpark Summit
 
Cosmos DB Real-time Advanced Analytics Workshop
Cosmos DB Real-time Advanced Analytics WorkshopCosmos DB Real-time Advanced Analytics Workshop
Cosmos DB Real-time Advanced Analytics WorkshopDatabricks
 
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...AboutYouGmbH
 
Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Martin Bém
 
Big Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWSBig Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWSjavier ramirez
 
DBP-010_Using Azure Data Services for Modern Data Applications
DBP-010_Using Azure Data Services for Modern Data ApplicationsDBP-010_Using Azure Data Services for Modern Data Applications
DBP-010_Using Azure Data Services for Modern Data Applicationsdecode2016
 
Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, Preset
Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, PresetStreaming Data Analytics with ksqlDB and Superset | Robert Stolz, Preset
Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, PresetHostedbyConfluent
 
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...DataStax
 
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2Amazon Web Services
 
Using Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFUsing Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFAmazon Web Services
 
Streaming Data Ingest and Processing with Apache Kafka
Streaming Data Ingest and Processing with Apache KafkaStreaming Data Ingest and Processing with Apache Kafka
Streaming Data Ingest and Processing with Apache KafkaAttunity
 
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...Understanding AWS Managed Database and Analytics Services | AWS Public Sector...
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...Amazon Web Services
 

Ähnlich wie Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure (20)

Big Data_Architecture.pptx
Big Data_Architecture.pptxBig Data_Architecture.pptx
Big Data_Architecture.pptx
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
 
AWS Big Data Landscape
AWS Big Data LandscapeAWS Big Data Landscape
AWS Big Data Landscape
 
Keeping Analytics Data Fresh in a Streaming Architecture | John Neal, Qlik
Keeping Analytics Data Fresh in a Streaming Architecture | John Neal, QlikKeeping Analytics Data Fresh in a Streaming Architecture | John Neal, Qlik
Keeping Analytics Data Fresh in a Streaming Architecture | John Neal, Qlik
 
Cloud Lambda Architecture Patterns
Cloud Lambda Architecture PatternsCloud Lambda Architecture Patterns
Cloud Lambda Architecture Patterns
 
Spark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with SparkSpark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with Spark
 
Cosmos DB Real-time Advanced Analytics Workshop
Cosmos DB Real-time Advanced Analytics WorkshopCosmos DB Real-time Advanced Analytics Workshop
Cosmos DB Real-time Advanced Analytics Workshop
 
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
 
Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27
 
Big Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWSBig Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWS
 
DBP-010_Using Azure Data Services for Modern Data Applications
DBP-010_Using Azure Data Services for Modern Data ApplicationsDBP-010_Using Azure Data Services for Modern Data Applications
DBP-010_Using Azure Data Services for Modern Data Applications
 
Real time analytics
Real time analyticsReal time analytics
Real time analytics
 
Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, Preset
Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, PresetStreaming Data Analytics with ksqlDB and Superset | Robert Stolz, Preset
Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, Preset
 
Using Data Lakes
Using Data Lakes Using Data Lakes
Using Data Lakes
 
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
 
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
 
Using Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFUsing Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SF
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
Streaming Data Ingest and Processing with Apache Kafka
Streaming Data Ingest and Processing with Apache KafkaStreaming Data Ingest and Processing with Apache Kafka
Streaming Data Ingest and Processing with Apache Kafka
 
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...Understanding AWS Managed Database and Analytics Services | AWS Public Sector...
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...
 

Kürzlich hochgeladen

Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 

Kürzlich hochgeladen (20)

Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 

Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure

  • 1. ANAC – Agência Nacional de Aviação Civil OtimizaçãodeProjetosdeBigData,BIe InteligênciaArtificialcom
  • 2. Agenda Big Data ETL vs ELT Arquitetura Lambda Streaming Arquitetura Kappa Azure Databricks Azure SQL Dw Azure Databricks Delta
  • 4. Extract/Transform/Load(ETL) is an integration approach that pulls information from remote sources, transforms it into defined formats and styles, then loads it into databases, data sources, or Data Warehouses. Extract/Load/Transform(ELT) similarly extracts data from one or multiple remote sources, but then loads it into the target Data Lake without any other formatting. The transformation of data, in an ELT process, happens within the target database. ELT asks less of remote sources, requiring only their raw and unprepared data.
  • 5. Big Data LambdaArchitecture Big data solutions typicallyinvolve one or more of the following types of workload: • Batchprocessing of big data sources at rest. • Real-time processing of big data in motion. • Interactive exploration of big data. • Predictive analytics and machine learning. Consider big data architectures when you need to: • Store and process data in volumes too large for a traditional database. • Transform unstructureddata for analysis and reporting. • Capture, process, and analyze unboundedstreams of data in real time
  • 6. Big Data BatchProcessing A common big data scenario is batch processing of dataat rest. In this scenario, the source data is loaded into datastorage, either by the source application itself or by an orchestrationworkflow. The data is then processed in-place by a parallelizedjob, which can also be initiated by the orchestration workflow. The processing may include multiple iterative steps before the transformed results are loadedintoan analytical datastore, which can be queriedby analytics and reporting components. A batchlayer (coldpath) stores all of the incoming data in its raw form and performs batch processing on the data. The result of this processing is stored as a batchview.
  • 7. Big Data Real Time Processing Real time processing deals with streamsof data that are captured in real-time and processed with minimal latencyto generate real-time (or near-real-time) reports or automated responses. For example, a real-time traffic monitoring solutionmight use sensor data to detect high traffic volumes. This data could be used to dynamically update a map to show congestion, or automatically initiate high-occupancy lanes or other traffic management systems. A speedlayer (hot path) analyzes data in realtime. This layer is designed for low latency, at the expense of accuracy.
  • 8. StreamingData Continuously Generated – [In-Motion] Different Sources & Types of Data Processed Incrementally Data is Sent in Small Sizes [KB]
  • 9. StreamingData Advantagesof StreamingAnalytics - Analyze Data as Fast as Possible - Provides Deeper Insights Through Data Visualization - Offers Insight into Customer Behavior - Remain Competitive
  • 10. StreamingData BoundedData Bound data is finiteand unchangingdata, where everything is known about the set of data. Typically Bound data has a known ending point and is relatively fixed. An easy example is what was last year’s sales numbers for Telsa Model S. Since we are looking into the past we have a perfect timebox with a fixed number of results (number of sales). UnboundedData Unbound data is unpredictable,infinite,and not alwayssequential. The data creation is a never ending cycle, similar to Bill Murray in Ground Hog Day. It just keeps going and going. For example, data generatedona WebScale Enterprise Networkis Unbound. The network traffic messages and logs are constantly being generated, external traffic can scale-up generating more messages, remote systems with latency could report non-sequential logs, and etc.
  • 11. StreamingDataKappaArchitecture A drawback to the lambdaarchitectureis itscomplexity. Processing logic appears in two different places — the cold and hot paths — using different frameworks. This leads to duplicate computation logic and the complexity of managing the architecture for both paths. The kappaarchitecture was proposed by Jay Krepsas an alternative to the lambda architecture. It has the same basic goals as the lambda architecture, but with an important distinction: Alldataflowsthrougha single path, using a streamprocessing system. Apache Flink Apache Beam Apache Spark
  • 12. ApacheKafkais an open source distributed streamingplatformcapable of handling trillionsof eventsa day. Initially conceived as a messaging queue, Kafka is based on an abstraction of a distributed commit log. Since being created and open sourced by LinkedIn in 2011, Kafka has quickly evolved from messaging queue to a full-fledged streaming platform. Founded by the original developers of Apache Kafka, Confluent delivers the most complete distribution of Kafkawith Confluent Platform. Confluent Platform improves Kafka with additional open source and commercial features designed to enhance the streaming experience of both operators and developers in production, at massive scale.
  • 13. DataBricks The UnifiedAnalyticsPlatform Unifying Data Science,Engineeringand Business Accelerateperformance with an optimized Spark platform Increase productivity through interactive data science Streamline processes from ETL to production Reduce costand complexity with a fully managed, cloud-native platform
  • 14. DataBricksI/O The Databricks I/O module (DBIO) takes processing speeds to the next level with an optimized AWS S3 access layer — significantly improving the performance of Apache Spark™ in the cloud. • Higher S3 Throughput • More Efficient Decoding • Data Skipping • Transactional Writes on S3
  • 15. DataBricksServerLess Databricks’ serverless and highly elastic cloud service is designed to remove operational complexity while ensuring reliability and cost efficiency at scale. • Shared Pools • Auto-Scaling • Auto-Configuration • Reliable Fine-Grained Sharing
  • 16. ForDataEngineering Build Fast and Reliable Data Pipelines • Cross-TeamCollaboration– Sharing Insights in Real-Time, Interactive Workspace. • UnifyingALL Analytics– Batch, Ad-Hoc, Machine Learning, Deep Learning, Streaming and Graph Processing. • ProductionWorkflows – Unified Platform Streamlines End-o-End Data Ingest and ETL. • Robust Integrations – AWS Tools and Data Stores Connectors and Integrate with CI/CD. ForDataScience Unleash the Power of Machine Learningand AI • AutomatedCluster Management – Launch Expertly-Tuned Spark Clusters. • Optimizedfor Performance– DataBricks Runtime Optimizes Spark – 10-40x. • Supportfor MultipleProgrammingLanguages – Interactive Query Large-Scale DataSets in R, Python, Scala or SQL. • HighlyExtensible– Use of Popular Libraries within Notebook - scikit-learn, nltk ML, pandas.
  • 17. Microsoft Azure DataBricks Fast, Easy & Collaborative Apache SparkAnalyticsPlatform Scale Without Limits Integrate Effortlesslywith - Azure SQL Dw - Azure CosmosDB - Azure Data Lake Store - Azure Blob Storage - Azure Event Hubs - Azure IoT Hub - Azure Data Factory - PowerBi databricks
  • 18. Azure SQLData Warehouse [PaaS]– Platform-as-a-Services for Dw Analytics Platform for Enterprises Cloud Data Warehouse [Dw] with MPP [MassiveParallelProcessing] Elastic Compute & Storage, FullyManagedInfrastructure with a Ecosystem Integration SQL Server Foundation Stack – Support for SQL & PolyBase Language
  • 19. Azure SQLData Warehouse PolyBase for Querying Big Data Stores – [T-SQL] Columnar Data Storage Type [MP] Massively Parallel Processing
  • 20. Databricks Delta delivers a powerful transactionalstoragelayer by harnessing the power of Apache Spark and Databricks DBFS. The core abstraction of Databricks Delta is an optimized Spark table that • Stores data as Parquet files in DBFS. • Maintains a transaction log that efficiently tracks changes to the table. You read and write data stored in the delta format using the same familiar ApacheSparkSQL batchand streaming APIs that you use to work with Hive tables and DBFS directories. With the addition of the transaction log and other enhancements, Databricks Delta offers significant benefits: ACIDtransactions • Multiple writers can simultaneously modify a dataset and see consistent views. • Writers can modify a dataset without interfering with jobs reading the dataset. Fastreadaccess • Automatic file management organizes data into large files that can be read efficiently. • Statistics enable speeding up reads by 10-100x and and data skipping avoids reading irrelevant information.
  • 21. Data Warehouse[Dw] – (1 Gen) ETL for Data Centralization & BI Analysis No Future Proof – Missing Predictions, Real-Time, Scale • Pristine • Fast Queries • Transactional • Expensive for Scale, Not Elastic • Require ETL, Stale Data, No Real-Time • No Predictions, No ML • Closed Formats [Lock In] Hadoop Data Lake – (2 Gen) ETL ALL Data, Scalable, Open Lake for ALL Use Cases Become a Cheap Messy Data Store with Poor Performance • Massive Scale • Inexpensive Storage • Open-Formats [Parquet, ORC] • Promise of ML & Real-Time Streaming • Inconsistent Data • Unreliable for Analytics • Lack of Schema • Poor Performance DataBricks The UnifiedAnalytics Platform Databricks Delta – (3 Gen) A Unified Data Management System for Real-Time Big Data Powerful Transactional Storage Layer • The Good of Dw • The Good of Data Lakes • Decoupled Compute & Storage • ACID Transactions & Data Validation • Data Indexing & Caching [10x ~100x] • Real-Time Streaming Ingest