Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Wird geladen in …3
×

Hier ansehen

1 von 21 Anzeige

Weitere Verwandte Inhalte

Diashows für Sie (20)

Ähnlich wie Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure (20)

Anzeige

Aktuellste (20)

Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure

  1. 1. ANAC – Agência Nacional de Aviação Civil OtimizaçãodeProjetosdeBigData,BIe InteligênciaArtificialcom
  2. 2. Agenda Big Data ETL vs ELT Arquitetura Lambda Streaming Arquitetura Kappa Azure Databricks Azure SQL Dw Azure Databricks Delta
  3. 3. Big Data
  4. 4. Extract/Transform/Load(ETL) is an integration approach that pulls information from remote sources, transforms it into defined formats and styles, then loads it into databases, data sources, or Data Warehouses. Extract/Load/Transform(ELT) similarly extracts data from one or multiple remote sources, but then loads it into the target Data Lake without any other formatting. The transformation of data, in an ELT process, happens within the target database. ELT asks less of remote sources, requiring only their raw and unprepared data.
  5. 5. Big Data LambdaArchitecture Big data solutions typicallyinvolve one or more of the following types of workload: • Batchprocessing of big data sources at rest. • Real-time processing of big data in motion. • Interactive exploration of big data. • Predictive analytics and machine learning. Consider big data architectures when you need to: • Store and process data in volumes too large for a traditional database. • Transform unstructureddata for analysis and reporting. • Capture, process, and analyze unboundedstreams of data in real time
  6. 6. Big Data BatchProcessing A common big data scenario is batch processing of dataat rest. In this scenario, the source data is loaded into datastorage, either by the source application itself or by an orchestrationworkflow. The data is then processed in-place by a parallelizedjob, which can also be initiated by the orchestration workflow. The processing may include multiple iterative steps before the transformed results are loadedintoan analytical datastore, which can be queriedby analytics and reporting components. A batchlayer (coldpath) stores all of the incoming data in its raw form and performs batch processing on the data. The result of this processing is stored as a batchview.
  7. 7. Big Data Real Time Processing Real time processing deals with streamsof data that are captured in real-time and processed with minimal latencyto generate real-time (or near-real-time) reports or automated responses. For example, a real-time traffic monitoring solutionmight use sensor data to detect high traffic volumes. This data could be used to dynamically update a map to show congestion, or automatically initiate high-occupancy lanes or other traffic management systems. A speedlayer (hot path) analyzes data in realtime. This layer is designed for low latency, at the expense of accuracy.
  8. 8. StreamingData Continuously Generated – [In-Motion] Different Sources & Types of Data Processed Incrementally Data is Sent in Small Sizes [KB]
  9. 9. StreamingData Advantagesof StreamingAnalytics - Analyze Data as Fast as Possible - Provides Deeper Insights Through Data Visualization - Offers Insight into Customer Behavior - Remain Competitive
  10. 10. StreamingData BoundedData Bound data is finiteand unchangingdata, where everything is known about the set of data. Typically Bound data has a known ending point and is relatively fixed. An easy example is what was last year’s sales numbers for Telsa Model S. Since we are looking into the past we have a perfect timebox with a fixed number of results (number of sales). UnboundedData Unbound data is unpredictable,infinite,and not alwayssequential. The data creation is a never ending cycle, similar to Bill Murray in Ground Hog Day. It just keeps going and going. For example, data generatedona WebScale Enterprise Networkis Unbound. The network traffic messages and logs are constantly being generated, external traffic can scale-up generating more messages, remote systems with latency could report non-sequential logs, and etc.
  11. 11. StreamingDataKappaArchitecture A drawback to the lambdaarchitectureis itscomplexity. Processing logic appears in two different places — the cold and hot paths — using different frameworks. This leads to duplicate computation logic and the complexity of managing the architecture for both paths. The kappaarchitecture was proposed by Jay Krepsas an alternative to the lambda architecture. It has the same basic goals as the lambda architecture, but with an important distinction: Alldataflowsthrougha single path, using a streamprocessing system. Apache Flink Apache Beam Apache Spark
  12. 12. ApacheKafkais an open source distributed streamingplatformcapable of handling trillionsof eventsa day. Initially conceived as a messaging queue, Kafka is based on an abstraction of a distributed commit log. Since being created and open sourced by LinkedIn in 2011, Kafka has quickly evolved from messaging queue to a full-fledged streaming platform. Founded by the original developers of Apache Kafka, Confluent delivers the most complete distribution of Kafkawith Confluent Platform. Confluent Platform improves Kafka with additional open source and commercial features designed to enhance the streaming experience of both operators and developers in production, at massive scale.
  13. 13. DataBricks The UnifiedAnalyticsPlatform Unifying Data Science,Engineeringand Business Accelerateperformance with an optimized Spark platform Increase productivity through interactive data science Streamline processes from ETL to production Reduce costand complexity with a fully managed, cloud-native platform
  14. 14. DataBricksI/O The Databricks I/O module (DBIO) takes processing speeds to the next level with an optimized AWS S3 access layer — significantly improving the performance of Apache Spark™ in the cloud. • Higher S3 Throughput • More Efficient Decoding • Data Skipping • Transactional Writes on S3
  15. 15. DataBricksServerLess Databricks’ serverless and highly elastic cloud service is designed to remove operational complexity while ensuring reliability and cost efficiency at scale. • Shared Pools • Auto-Scaling • Auto-Configuration • Reliable Fine-Grained Sharing
  16. 16. ForDataEngineering Build Fast and Reliable Data Pipelines • Cross-TeamCollaboration– Sharing Insights in Real-Time, Interactive Workspace. • UnifyingALL Analytics– Batch, Ad-Hoc, Machine Learning, Deep Learning, Streaming and Graph Processing. • ProductionWorkflows – Unified Platform Streamlines End-o-End Data Ingest and ETL. • Robust Integrations – AWS Tools and Data Stores Connectors and Integrate with CI/CD. ForDataScience Unleash the Power of Machine Learningand AI • AutomatedCluster Management – Launch Expertly-Tuned Spark Clusters. • Optimizedfor Performance– DataBricks Runtime Optimizes Spark – 10-40x. • Supportfor MultipleProgrammingLanguages – Interactive Query Large-Scale DataSets in R, Python, Scala or SQL. • HighlyExtensible– Use of Popular Libraries within Notebook - scikit-learn, nltk ML, pandas.
  17. 17. Microsoft Azure DataBricks Fast, Easy & Collaborative Apache SparkAnalyticsPlatform Scale Without Limits Integrate Effortlesslywith - Azure SQL Dw - Azure CosmosDB - Azure Data Lake Store - Azure Blob Storage - Azure Event Hubs - Azure IoT Hub - Azure Data Factory - PowerBi databricks
  18. 18. Azure SQLData Warehouse [PaaS]– Platform-as-a-Services for Dw Analytics Platform for Enterprises Cloud Data Warehouse [Dw] with MPP [MassiveParallelProcessing] Elastic Compute & Storage, FullyManagedInfrastructure with a Ecosystem Integration SQL Server Foundation Stack – Support for SQL & PolyBase Language
  19. 19. Azure SQLData Warehouse PolyBase for Querying Big Data Stores – [T-SQL] Columnar Data Storage Type [MP] Massively Parallel Processing
  20. 20. Databricks Delta delivers a powerful transactionalstoragelayer by harnessing the power of Apache Spark and Databricks DBFS. The core abstraction of Databricks Delta is an optimized Spark table that • Stores data as Parquet files in DBFS. • Maintains a transaction log that efficiently tracks changes to the table. You read and write data stored in the delta format using the same familiar ApacheSparkSQL batchand streaming APIs that you use to work with Hive tables and DBFS directories. With the addition of the transaction log and other enhancements, Databricks Delta offers significant benefits: ACIDtransactions • Multiple writers can simultaneously modify a dataset and see consistent views. • Writers can modify a dataset without interfering with jobs reading the dataset. Fastreadaccess • Automatic file management organizes data into large files that can be read efficiently. • Statistics enable speeding up reads by 10-100x and and data skipping avoids reading irrelevant information.
  21. 21. Data Warehouse[Dw] – (1 Gen) ETL for Data Centralization & BI Analysis No Future Proof – Missing Predictions, Real-Time, Scale • Pristine • Fast Queries • Transactional • Expensive for Scale, Not Elastic • Require ETL, Stale Data, No Real-Time • No Predictions, No ML • Closed Formats [Lock In] Hadoop Data Lake – (2 Gen) ETL ALL Data, Scalable, Open Lake for ALL Use Cases Become a Cheap Messy Data Store with Poor Performance • Massive Scale • Inexpensive Storage • Open-Formats [Parquet, ORC] • Promise of ML & Real-Time Streaming • Inconsistent Data • Unreliable for Analytics • Lack of Schema • Poor Performance DataBricks The UnifiedAnalytics Platform Databricks Delta – (3 Gen) A Unified Data Management System for Real-Time Big Data Powerful Transactional Storage Layer • The Good of Dw • The Good of Data Lakes • Decoupled Compute & Storage • ACID Transactions & Data Validation • Data Indexing & Caching [10x ~100x] • Real-Time Streaming Ingest

×