SlideShare ist ein Scribd-Unternehmen logo
1 von 43
Big Data on Azure
@davidgiard
David Giard
Microsoft Technical Evangelist
dgiard@Microsoft.com
Davidgiard.com
@davidgiard
@davidgiard
Cloud Computing
Host some or all of your data or application
on a third-party server
in a highly-scalable, highly-reliable way
@davidgiard
Advantages of Cloud Computing
• Lower capital costs
• Flexible operating cost (Rent vs Buy)
• Platform as a Service
• Freedom from infrastructure / hardware
• Redundancy
• Automatic monitoring and failover
@davidgiard
0
1
2
3
4
5
6
7
8
9
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Demand and Capacity
@davidgiard
0
1
2
3
4
5
6
7
8
9
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Demand and Capacity
@davidgiard
0
1
2
3
4
5
6
7
8
9
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Demand and Capacity
@davidgiard
0
1
2
3
4
5
6
7
8
9
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Demand and Capacity
@davidgiard
0
1
2
3
4
5
6
7
8
9
Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri
Demand and Capacity
@davidgiard
0
1
2
3
4
5
6
7
8
9
1:00 2:00 3:00 4:00 5:00 6:00 7:00 8:00 9:00 10:00 11:00 12:00
Demand and Capacity
@davidgiard
0
1
2
3
4
5
6
7
8
9
1:00 2:00 3:00 4:00 5:00 6:00 7:00 8:00 9:00 10:00 11:00 12:00
Big Data Demand
@davidgiard
Cost Factors
• Service
• VM Size
• # VMs
• Time
@davidgiard
HDInsight
@davidgiard
Azure HDInsight
• Microsoft Azure’s big-data solution using Hadoop
• Open-source framework for storing and analyzing massive amounts of data
on clusters built from commodity hardware
• Uses Hadoop Distributed File System (HDFS) for storage
• Employs the open-source Hortonworks Data Platform implementation
of Hadoop
• Includes HBase, Hive, Pig, Storm, Spark, and more
• Integrates with popular BI tools
• Includes Power BI, Excel, SSAS, SSRS, Tableau
@davidgiard
Apache Hadoop on Azure
• Automatic cluster provisioning and configuration
• Bypass an otherwise manual-intensive process
• Cluster scaling
• Change number of nodes without deleting/re-creating the cluster
• High availability/reliability
• Managed solution - 99.9% SLA
• HDInsight includes a secondary head node
• Reliable and economical storage
• HDFS mapped over Azure Blob Storage
• Accessed through “wasb://” protocol prefix
@davidgiard
Lambda Architecture
• Batch Layer
• Speed Layer
• Serving Layer
@davidgiard
Clusters
@davidgiard
Clusters
Blob Storage
@davidgiard
HDInsight Cluster Types
• Hadoop: Query workloads
• Reliable data storage, simple MapReduce
• HBase: NoSQL workloads
• Distributed database offering random access to large amounts of
data
• Apache Storm: Stream workloads
• Real-time analysis of moving data streams
• Apache Spark: High-performance workloads
• In-memory parallel processing
@davidgiard
Cluster Creation
@davidgiard
Cluster Creation
{
"$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#",
"contentVersion": "1.0.0.0",
"parameters": {
"clusterName": {
"type": "string",
"metadata": {
"description": "The name of the HDInsight cluster to create."
}
},
"clusterLoginUserName": {
"type": "string",
"defaultValue": "admin",
"metadata": {
"description": "These credentials can be used to submit jobs to the cluster and to log into cluster dashboards."
}
},
"clusterLoginPassword": {
"type": "securestring",
@davidgiard
Demo
@davidgiard
@davidgiard
Storm
• Apache Storm is a distributed, fault-tolerant, open-source computation
system that allows you to process data in real-time with Hadoop.
• Apache Storm on HDInsight allows you to create distributed, real-time
analytics solutions in the Azure environment by using Apache Hadoop.
• Storm solutions can also provide guaranteed processing of data, with the
ability to replay data that was not successfully processed the first time.
• Ability to write Storm components in C#, JAVA and Python.
• Azure Scale up or Scale down without an impact for running Storm
topologies.
• Ease of provision and use in Azure portal.
• Visual Studio project templates for Storm apps
@davidgiard
Storm
• Apache Storm apps are submitted as Topologies.
• A topology is a graph of computation that processes streams
• Stream: An unbound collection of tuples. Streams are produced by spouts
and bolts, and they are consumed by bolts.
• Tuple: A named list of dynamically typed values.
• Spout: Consumes data from a data source and emits one or more streams.
• Bolt: Consumes streams, performs processing on tuples, and may emit
streams. Bolts are also responsible for writing data to external storage,
such as a queue, HDInsight, HBase, a blob, or other data store.
• Nimbus: JobTracker in Hadoop that distribute jobs, monitoring failures.
@davidgiard
Stream
Apache Storm Topology
Event Source
Tuple
(
“timestamp:: 1234567890,
“measurement”: “123”,
“location”: “ABC123”
)
Tuple
{
key1: “value1”,
key2: “value2”,
key3: “value3”,
}
{
key1: “value1”,
key2: “value2”,
key3: “value3”,
}
Tuple Tuple
{
key1: “value1”,
key4: “value4”
}
Bolt
Spout
Bolt Bolt
Bolt
Tuple
{
key1: “value1”,
key2: “value2”,
key3: “value3”,
}
@davidgiard
Demo
@DavidGiard
@davidgiard
HBase
• Apache HBase is an open-source, NoSQL database that is built on Hadoop
and modeled after Google BigTable.
• HBase provides random access and strong consistency for large amounts of
unstructured and semistructured data in a schemaless database organized
by column families
• Data is stored in the rows of a table, and data within a row is grouped by
column family.
• The open-source code scales linearly to handle petabytes of data on
thousands of nodes. It can rely on data redundancy, batch processing, and
other features that are provided by distributed applications in the Hadoop
ecosystem.
@davidgiard
HBase
• HBase Commands:
• create  Equivalent to create table in T-SQL
• get  Equivalent to select statements in T-SQL
• put  Equivalent to update, Insert statement in T-SQL
• scan  Equivalent to select (no where condition) in T-SQL
• delete/deleteall  Equivalent to delete in T-SQL
• HBase shell is your query tool to execute in CRUD commands to a HBase cluster.
• Data can also be managed using the HBase C# API, which provides a client library on top
of the HBase REST API.
• An HBase database can also be queried by using Hive.
@davidgiard
HBase
RowKey a:1 a:2 a:3 a:4 b:1 b:2 c:numA
982069 10 20 30 40 5 7 4
926025 9 11 21 4 9 3
254114 11 15 22 35 7 11 4
881514 8 14 2 3 2
Column family “a”
Column family “b”
Column family “c”
@DavidGiard
@davidgiard
Hive
• Apache Hive is a data warehouse system for Hadoop, which enables data
summarization, querying, and analysis of data by using HiveQL (a query
language similar to SQL).
• Hive understands how to work with structured and semi-structured data,
such as text files where the fields are delimited by specific characters.
• Hive also supports custom serializer/deserializers for complex or
irregularly structured data.
• Hive can also be extended through user-defined functions (UDF).
• A UDF allows you to implement functionality or logic that isn't easily
modeled in HiveQL.
@davidgiard
HiveQL
# Number of Records
SELECT COUNT(1) FROM www_access;
# Number of Unique IPs
SELECT COUNT(1) FROM ( 
SELECT DISTINCT ip FROM www_access 
) t;
# Number of Unique IPs that Accessed the Top Page
SELECT COUNT(distinct ip) FROM www_access 
WHERE url='/';
# Number of Accesses per Unique IP
SELECT ip, COUNT(1) FROM www_access 
GROUP BY ip LIMIT 30;
# Unique IPs Sorted by Number of Accesses
SELECT ip, COUNT(1) AS cnt FROM www_access 
GROUP BY ip
ORDER BY cnt DESC LIMIT 30;
# Number of Accesses After a Certain Time
SELECT COUNT(1) FROM www_access 
WHERE TD_TIME_RANGE(time, "2011-08-19", NULL, "PDT")
@DavidGiard
@davidgiard
Apache Spark
• Interactive manipulation and visualization of data
• Scala, Python, and R Interactive Shells
• Jupyter Notebook with PySpark (Python) and Spark (Scala) kernels provide in-
browser interaction
• Unified platform for processing multiple workloads
• Real-time processing, Machine Learning, Stream Analytics, Interactive
Querying, Graphing
• Leverages in-memory processing for really big data
• Resilient distributed datasets (RDDs)
• APIs for processing large datasets
• Up to 100x faster than MapReduce
@davidgiard
Spark Components on HDInsight
• Spark Core
• Includes Spark SQL, Spark Streaming,
GraphX, and MLlib
• Anaconda
• Livy
• Jupyter Notebooks
• ODBC Driver for connecting from BI tools
(Power BI, Tableau)
@davidgiard
Jupyter Notebooks on HDInsight
• Browser-based interface for working with text, code, equations, plots,
graphics, and interactive controls in a single document.
• Include preset Spark and Hive contexts (sc and sqlContext)
@davidgiard
Demo
@davidgiard
Items of Note About HDInsight
• There is no “suspend” on HDInsight clusters
• Provision the cluster, do work, then delete the cluster to avoid unnecessary
charges
• Storage can be decoupled from the cluster and reused across deployments
• Can deploy from the portal, but often scripted in practice
• Easier/repeatable creation and deletion
@davidgiard
Get Started
azure.com
@DavidGiard
Thank you!
@DavidGiard
Links
www.slideshare.net/dgiard/big-data-on-azure-70554456
github.com/MSFTImagine/computerscience/tree/master/Workshop/7.%20HDInsight

Weitere ähnliche Inhalte

Was ist angesagt?

Hd insight essentials quick view
Hd insight essentials quick viewHd insight essentials quick view
Hd insight essentials quick viewRajesh Nadipalli
 
Data Lakes with Azure Databricks
Data Lakes with Azure DatabricksData Lakes with Azure Databricks
Data Lakes with Azure DatabricksData Con LA
 
Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...
Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...
Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...Michael Rys
 
Cortana Analytics Workshop: Azure Data Lake
Cortana Analytics Workshop: Azure Data LakeCortana Analytics Workshop: Azure Data Lake
Cortana Analytics Workshop: Azure Data LakeMSAdvAnalytics
 
Building Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft AzureBuilding Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft AzureDmitry Anoshin
 
5 Comparing Microsoft Big Data Technologies for Analytics
5 Comparing Microsoft Big Data Technologies for Analytics5 Comparing Microsoft Big Data Technologies for Analytics
5 Comparing Microsoft Big Data Technologies for AnalyticsJen Stirrup
 
Microsoft Azure Big Data Analytics
Microsoft Azure Big Data AnalyticsMicrosoft Azure Big Data Analytics
Microsoft Azure Big Data AnalyticsMark Kromer
 
Analyzing StackExchange data with Azure Data Lake
Analyzing StackExchange data with Azure Data LakeAnalyzing StackExchange data with Azure Data Lake
Analyzing StackExchange data with Azure Data LakeBizTalk360
 
Azure Big Data Story
Azure Big Data StoryAzure Big Data Story
Azure Big Data StoryLynn Langit
 
Azure cafe marketplace with looker data analytics
Azure cafe marketplace with looker data analyticsAzure cafe marketplace with looker data analytics
Azure cafe marketplace with looker data analyticsMark Kromer
 
DataOps for the Modern Data Warehouse on Microsoft Azure @ NDCOslo 2020 - Lac...
DataOps for the Modern Data Warehouse on Microsoft Azure @ NDCOslo 2020 - Lac...DataOps for the Modern Data Warehouse on Microsoft Azure @ NDCOslo 2020 - Lac...
DataOps for the Modern Data Warehouse on Microsoft Azure @ NDCOslo 2020 - Lac...Lace Lofranco
 
Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure DatabricksJames Serra
 
Cortana Analytics Workshop: Operationalizing Your End-to-End Analytics Solution
Cortana Analytics Workshop: Operationalizing Your End-to-End Analytics SolutionCortana Analytics Workshop: Operationalizing Your End-to-End Analytics Solution
Cortana Analytics Workshop: Operationalizing Your End-to-End Analytics SolutionMSAdvAnalytics
 
A lap around Azure Data Factory
A lap around Azure Data FactoryA lap around Azure Data Factory
A lap around Azure Data FactoryBizTalk360
 
TechEvent Databricks on Azure
TechEvent Databricks on AzureTechEvent Databricks on Azure
TechEvent Databricks on AzureTrivadis
 
Introduction to PolyBase
Introduction to PolyBaseIntroduction to PolyBase
Introduction to PolyBaseJames Serra
 
The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO...
The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO...The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO...
The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO...The Hive
 

Was ist angesagt? (20)

Hd insight essentials quick view
Hd insight essentials quick viewHd insight essentials quick view
Hd insight essentials quick view
 
Data Lakes with Azure Databricks
Data Lakes with Azure DatabricksData Lakes with Azure Databricks
Data Lakes with Azure Databricks
 
Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...
Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...
Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...
 
Big Data in Azure
Big Data in AzureBig Data in Azure
Big Data in Azure
 
Cortana Analytics Workshop: Azure Data Lake
Cortana Analytics Workshop: Azure Data LakeCortana Analytics Workshop: Azure Data Lake
Cortana Analytics Workshop: Azure Data Lake
 
Building Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft AzureBuilding Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft Azure
 
5 Comparing Microsoft Big Data Technologies for Analytics
5 Comparing Microsoft Big Data Technologies for Analytics5 Comparing Microsoft Big Data Technologies for Analytics
5 Comparing Microsoft Big Data Technologies for Analytics
 
Modern data warehouse
Modern data warehouseModern data warehouse
Modern data warehouse
 
Microsoft Azure Big Data Analytics
Microsoft Azure Big Data AnalyticsMicrosoft Azure Big Data Analytics
Microsoft Azure Big Data Analytics
 
Analyzing StackExchange data with Azure Data Lake
Analyzing StackExchange data with Azure Data LakeAnalyzing StackExchange data with Azure Data Lake
Analyzing StackExchange data with Azure Data Lake
 
Azure Big Data Story
Azure Big Data StoryAzure Big Data Story
Azure Big Data Story
 
Introduction to Azure HDInsight
Introduction to Azure HDInsightIntroduction to Azure HDInsight
Introduction to Azure HDInsight
 
Azure cafe marketplace with looker data analytics
Azure cafe marketplace with looker data analyticsAzure cafe marketplace with looker data analytics
Azure cafe marketplace with looker data analytics
 
DataOps for the Modern Data Warehouse on Microsoft Azure @ NDCOslo 2020 - Lac...
DataOps for the Modern Data Warehouse on Microsoft Azure @ NDCOslo 2020 - Lac...DataOps for the Modern Data Warehouse on Microsoft Azure @ NDCOslo 2020 - Lac...
DataOps for the Modern Data Warehouse on Microsoft Azure @ NDCOslo 2020 - Lac...
 
Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure Databricks
 
Cortana Analytics Workshop: Operationalizing Your End-to-End Analytics Solution
Cortana Analytics Workshop: Operationalizing Your End-to-End Analytics SolutionCortana Analytics Workshop: Operationalizing Your End-to-End Analytics Solution
Cortana Analytics Workshop: Operationalizing Your End-to-End Analytics Solution
 
A lap around Azure Data Factory
A lap around Azure Data FactoryA lap around Azure Data Factory
A lap around Azure Data Factory
 
TechEvent Databricks on Azure
TechEvent Databricks on AzureTechEvent Databricks on Azure
TechEvent Databricks on Azure
 
Introduction to PolyBase
Introduction to PolyBaseIntroduction to PolyBase
Introduction to PolyBase
 
The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO...
The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO...The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO...
The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO...
 

Andere mochten auch

Microsoft azure without microsoft
Microsoft azure without microsoftMicrosoft azure without microsoft
Microsoft azure without microsoftDavid Giard
 
Architecting big data solutions in the cloud
Architecting big data solutions in the cloudArchitecting big data solutions in the cloud
Architecting big data solutions in the cloudMostafa
 
Azure Data Lake Intro (SQLBits 2016)
Azure Data Lake Intro (SQLBits 2016)Azure Data Lake Intro (SQLBits 2016)
Azure Data Lake Intro (SQLBits 2016)Michael Rys
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lakeJames Serra
 
Microsoft Azure vs Amazon Web Services (AWS) Services & Feature Mapping
Microsoft Azure vs Amazon Web Services (AWS) Services & Feature MappingMicrosoft Azure vs Amazon Web Services (AWS) Services & Feature Mapping
Microsoft Azure vs Amazon Web Services (AWS) Services & Feature MappingIlyas F ☁☁☁
 
Azure Spark - Big Data - Coresic 2016
Azure Spark - Big Data - Coresic 2016Azure Spark - Big Data - Coresic 2016
Azure Spark - Big Data - Coresic 2016nnakasone
 
Intorducing Big Data and Microsoft Azure
Intorducing Big Data and Microsoft AzureIntorducing Big Data and Microsoft Azure
Intorducing Big Data and Microsoft AzureKhalid Salama
 
Optimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsight
Optimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsightOptimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsight
Optimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsightHBaseCon
 
AWS vs Azure - Cloud Services Comparison
AWS vs Azure - Cloud Services ComparisonAWS vs Azure - Cloud Services Comparison
AWS vs Azure - Cloud Services ComparisonAniket Kanitkar
 
Dive into Spark Streaming
Dive into Spark StreamingDive into Spark Streaming
Dive into Spark StreamingGerard Maas
 
Choosing technologies for a big data solution in the cloud
Choosing technologies for a big data solution in the cloudChoosing technologies for a big data solution in the cloud
Choosing technologies for a big data solution in the cloudJames Serra
 
Data visualization 2012-09
Data visualization   2012-09Data visualization   2012-09
Data visualization 2012-09David Giard
 
Effective Data Visualization
Effective Data VisualizationEffective Data Visualization
Effective Data VisualizationDavid Giard
 
Microsoft cloud big data strategy
Microsoft cloud big data strategyMicrosoft cloud big data strategy
Microsoft cloud big data strategyJames Serra
 
HA/DR options with SQL Server in Azure and hybrid
HA/DR options with SQL Server in Azure and hybridHA/DR options with SQL Server in Azure and hybrid
HA/DR options with SQL Server in Azure and hybridJames Serra
 
(BDT310) Big Data Architectural Patterns and Best Practices on AWS
(BDT310) Big Data Architectural Patterns and Best Practices on AWS(BDT310) Big Data Architectural Patterns and Best Practices on AWS
(BDT310) Big Data Architectural Patterns and Best Practices on AWSAmazon Web Services
 
Introduction to Microsoft’s Hadoop solution (HDInsight)
Introduction to Microsoft’s Hadoop solution (HDInsight)Introduction to Microsoft’s Hadoop solution (HDInsight)
Introduction to Microsoft’s Hadoop solution (HDInsight)James Serra
 
Azure Data Lake Analytics Deep Dive
Azure Data Lake Analytics Deep DiveAzure Data Lake Analytics Deep Dive
Azure Data Lake Analytics Deep DiveIlyas F ☁☁☁
 

Andere mochten auch (20)

Microsoft azure without microsoft
Microsoft azure without microsoftMicrosoft azure without microsoft
Microsoft azure without microsoft
 
Architecting big data solutions in the cloud
Architecting big data solutions in the cloudArchitecting big data solutions in the cloud
Architecting big data solutions in the cloud
 
Azure Data Lake Intro (SQLBits 2016)
Azure Data Lake Intro (SQLBits 2016)Azure Data Lake Intro (SQLBits 2016)
Azure Data Lake Intro (SQLBits 2016)
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
 
Microsoft Azure vs Amazon Web Services (AWS) Services & Feature Mapping
Microsoft Azure vs Amazon Web Services (AWS) Services & Feature MappingMicrosoft Azure vs Amazon Web Services (AWS) Services & Feature Mapping
Microsoft Azure vs Amazon Web Services (AWS) Services & Feature Mapping
 
Azure Spark - Big Data - Coresic 2016
Azure Spark - Big Data - Coresic 2016Azure Spark - Big Data - Coresic 2016
Azure Spark - Big Data - Coresic 2016
 
Big Data en Azure: Azure Data Lake
Big Data en Azure: Azure Data LakeBig Data en Azure: Azure Data Lake
Big Data en Azure: Azure Data Lake
 
Intorducing Big Data and Microsoft Azure
Intorducing Big Data and Microsoft AzureIntorducing Big Data and Microsoft Azure
Intorducing Big Data and Microsoft Azure
 
Optimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsight
Optimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsightOptimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsight
Optimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsight
 
AWS vs Azure - Cloud Services Comparison
AWS vs Azure - Cloud Services ComparisonAWS vs Azure - Cloud Services Comparison
AWS vs Azure - Cloud Services Comparison
 
Dive into Spark Streaming
Dive into Spark StreamingDive into Spark Streaming
Dive into Spark Streaming
 
Choosing technologies for a big data solution in the cloud
Choosing technologies for a big data solution in the cloudChoosing technologies for a big data solution in the cloud
Choosing technologies for a big data solution in the cloud
 
Data visualization 2012-09
Data visualization   2012-09Data visualization   2012-09
Data visualization 2012-09
 
Effective Data Visualization
Effective Data VisualizationEffective Data Visualization
Effective Data Visualization
 
Microsoft cloud big data strategy
Microsoft cloud big data strategyMicrosoft cloud big data strategy
Microsoft cloud big data strategy
 
HA/DR options with SQL Server in Azure and hybrid
HA/DR options with SQL Server in Azure and hybridHA/DR options with SQL Server in Azure and hybrid
HA/DR options with SQL Server in Azure and hybrid
 
(BDT310) Big Data Architectural Patterns and Best Practices on AWS
(BDT310) Big Data Architectural Patterns and Best Practices on AWS(BDT310) Big Data Architectural Patterns and Best Practices on AWS
(BDT310) Big Data Architectural Patterns and Best Practices on AWS
 
Introduction to Microsoft’s Hadoop solution (HDInsight)
Introduction to Microsoft’s Hadoop solution (HDInsight)Introduction to Microsoft’s Hadoop solution (HDInsight)
Introduction to Microsoft’s Hadoop solution (HDInsight)
 
Azure Data Lake Analytics Deep Dive
Azure Data Lake Analytics Deep DiveAzure Data Lake Analytics Deep Dive
Azure Data Lake Analytics Deep Dive
 
Build Features, Not Apps
Build Features, Not AppsBuild Features, Not Apps
Build Features, Not Apps
 

Ähnlich wie Big Data on azure

Big Data Solutions in Azure - David Giard
Big Data Solutions in Azure - David GiardBig Data Solutions in Azure - David Giard
Big Data Solutions in Azure - David GiardITCamp
 
Best Practices: Hadoop migration to Azure HDInsight
Best Practices: Hadoop migration to Azure HDInsightBest Practices: Hadoop migration to Azure HDInsight
Best Practices: Hadoop migration to Azure HDInsightRevin Chalil
 
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionTugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionCodemotion
 
USQL Trivadis Azure Data Lake Event
USQL Trivadis Azure Data Lake EventUSQL Trivadis Azure Data Lake Event
USQL Trivadis Azure Data Lake EventTrivadis
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impalamarkgrover
 
Building Big data solutions in Azure
Building Big data solutions in AzureBuilding Big data solutions in Azure
Building Big data solutions in AzureMostafa
 
Big data solutions in Azure
Big data solutions in AzureBig data solutions in Azure
Big data solutions in AzureMostafa
 
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...Michael Rys
 
Data Analysis on AWS
Data Analysis on AWSData Analysis on AWS
Data Analysis on AWSPaolo latella
 
4Developers 2018: Przetwarzanie Big Data w oparciu o architekturę Lambda na p...
4Developers 2018: Przetwarzanie Big Data w oparciu o architekturę Lambda na p...4Developers 2018: Przetwarzanie Big Data w oparciu o architekturę Lambda na p...
4Developers 2018: Przetwarzanie Big Data w oparciu o architekturę Lambda na p...PROIDEA
 
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...Fwdays
 
Big data solutions in azure
Big data solutions in azureBig data solutions in azure
Big data solutions in azureMostafa
 
DataStax - Analytics on Apache Cassandra - Paris Tech Talks meetup
DataStax - Analytics on Apache Cassandra - Paris Tech Talks meetupDataStax - Analytics on Apache Cassandra - Paris Tech Talks meetup
DataStax - Analytics on Apache Cassandra - Paris Tech Talks meetupVictor Coustenoble
 
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache KuduJeff Holoman
 
Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016StampedeCon
 
Cascading introduction
Cascading introductionCascading introduction
Cascading introductionAlex Su
 
Apache Cassandra introduction
Apache Cassandra introductionApache Cassandra introduction
Apache Cassandra introductionfardinjamshidi
 

Ähnlich wie Big Data on azure (20)

Big Data Solutions in Azure - David Giard
Big Data Solutions in Azure - David GiardBig Data Solutions in Azure - David Giard
Big Data Solutions in Azure - David Giard
 
Best Practices: Hadoop migration to Azure HDInsight
Best Practices: Hadoop migration to Azure HDInsightBest Practices: Hadoop migration to Azure HDInsight
Best Practices: Hadoop migration to Azure HDInsight
 
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionTugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
 
USQL Trivadis Azure Data Lake Event
USQL Trivadis Azure Data Lake EventUSQL Trivadis Azure Data Lake Event
USQL Trivadis Azure Data Lake Event
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
Building Big data solutions in Azure
Building Big data solutions in AzureBuilding Big data solutions in Azure
Building Big data solutions in Azure
 
Big data solutions in Azure
Big data solutions in AzureBig data solutions in Azure
Big data solutions in Azure
 
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
 
Data Analysis on AWS
Data Analysis on AWSData Analysis on AWS
Data Analysis on AWS
 
CC -Unit4.pptx
CC -Unit4.pptxCC -Unit4.pptx
CC -Unit4.pptx
 
4Developers 2018: Przetwarzanie Big Data w oparciu o architekturę Lambda na p...
4Developers 2018: Przetwarzanie Big Data w oparciu o architekturę Lambda na p...4Developers 2018: Przetwarzanie Big Data w oparciu o architekturę Lambda na p...
4Developers 2018: Przetwarzanie Big Data w oparciu o architekturę Lambda na p...
 
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
 
Big data solutions in azure
Big data solutions in azureBig data solutions in azure
Big data solutions in azure
 
DataStax - Analytics on Apache Cassandra - Paris Tech Talks meetup
DataStax - Analytics on Apache Cassandra - Paris Tech Talks meetupDataStax - Analytics on Apache Cassandra - Paris Tech Talks meetup
DataStax - Analytics on Apache Cassandra - Paris Tech Talks meetup
 
מיכאל
מיכאלמיכאל
מיכאל
 
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache Kudu
 
Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016
 
Cascading introduction
Cascading introductionCascading introduction
Cascading introduction
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Apache Cassandra introduction
Apache Cassandra introductionApache Cassandra introduction
Apache Cassandra introduction
 

Mehr von David Giard

Data Visualization - CodeMash 2022
Data Visualization - CodeMash 2022Data Visualization - CodeMash 2022
Data Visualization - CodeMash 2022David Giard
 
Azure data factory
Azure data factoryAzure data factory
Azure data factoryDavid Giard
 
University of Texas lecture: Data Science Tools in Microsoft Azure
University of Texas lecture: Data Science Tools in Microsoft AzureUniversity of Texas lecture: Data Science Tools in Microsoft Azure
University of Texas lecture: Data Science Tools in Microsoft AzureDavid Giard
 
University of Texas, Data Science, March 29, 2018
University of Texas, Data Science, March 29, 2018University of Texas, Data Science, March 29, 2018
University of Texas, Data Science, March 29, 2018David Giard
 
Intro to cloud and azure
Intro to cloud and azureIntro to cloud and azure
Intro to cloud and azureDavid Giard
 
Azure and deep learning
Azure and deep learningAzure and deep learning
Azure and deep learningDavid Giard
 
Azure and Deep Learning
Azure and Deep LearningAzure and Deep Learning
Azure and Deep LearningDavid Giard
 
Cloud and azure and rock and roll
Cloud and azure and rock and rollCloud and azure and rock and roll
Cloud and azure and rock and rollDavid Giard
 
Own your own career advice from a veteran consultant
Own your own career   advice from a veteran consultantOwn your own career   advice from a veteran consultant
Own your own career advice from a veteran consultantDavid Giard
 
You and Your Tech Community
You and Your Tech CommunityYou and Your Tech Community
You and Your Tech CommunityDavid Giard
 
Microsoft Azure IoT overview
Microsoft Azure IoT overviewMicrosoft Azure IoT overview
Microsoft Azure IoT overviewDavid Giard
 
Cloud and azure and rock and roll
Cloud and azure and rock and rollCloud and azure and rock and roll
Cloud and azure and rock and rollDavid Giard
 
Azure mobile apps
Azure mobile appsAzure mobile apps
Azure mobile appsDavid Giard
 
Building a TV show with Angular, Bootstrap, and Web Services
Building a TV show with Angular, Bootstrap, and Web ServicesBuilding a TV show with Angular, Bootstrap, and Web Services
Building a TV show with Angular, Bootstrap, and Web ServicesDavid Giard
 
Angular2 and TypeScript
Angular2 and TypeScriptAngular2 and TypeScript
Angular2 and TypeScriptDavid Giard
 
Cloud and azure and rock and roll
Cloud and azure and rock and rollCloud and azure and rock and roll
Cloud and azure and rock and rollDavid Giard
 
How I Learned to Stop Worrying and Love jQuery (Jan 2013)
How I Learned to Stop Worrying and Love jQuery (Jan 2013)How I Learned to Stop Worrying and Love jQuery (Jan 2013)
How I Learned to Stop Worrying and Love jQuery (Jan 2013)David Giard
 

Mehr von David Giard (20)

Data Visualization - CodeMash 2022
Data Visualization - CodeMash 2022Data Visualization - CodeMash 2022
Data Visualization - CodeMash 2022
 
Azure data factory
Azure data factoryAzure data factory
Azure data factory
 
Azure functions
Azure functionsAzure functions
Azure functions
 
University of Texas lecture: Data Science Tools in Microsoft Azure
University of Texas lecture: Data Science Tools in Microsoft AzureUniversity of Texas lecture: Data Science Tools in Microsoft Azure
University of Texas lecture: Data Science Tools in Microsoft Azure
 
University of Texas, Data Science, March 29, 2018
University of Texas, Data Science, March 29, 2018University of Texas, Data Science, March 29, 2018
University of Texas, Data Science, March 29, 2018
 
Intro to cloud and azure
Intro to cloud and azureIntro to cloud and azure
Intro to cloud and azure
 
Azure and deep learning
Azure and deep learningAzure and deep learning
Azure and deep learning
 
Azure and Deep Learning
Azure and Deep LearningAzure and Deep Learning
Azure and Deep Learning
 
Custom vision
Custom visionCustom vision
Custom vision
 
Cloud and azure and rock and roll
Cloud and azure and rock and rollCloud and azure and rock and roll
Cloud and azure and rock and roll
 
Own your own career advice from a veteran consultant
Own your own career   advice from a veteran consultantOwn your own career   advice from a veteran consultant
Own your own career advice from a veteran consultant
 
You and Your Tech Community
You and Your Tech CommunityYou and Your Tech Community
You and Your Tech Community
 
Microsoft Azure IoT overview
Microsoft Azure IoT overviewMicrosoft Azure IoT overview
Microsoft Azure IoT overview
 
Cloud and azure and rock and roll
Cloud and azure and rock and rollCloud and azure and rock and roll
Cloud and azure and rock and roll
 
Azure mobile apps
Azure mobile appsAzure mobile apps
Azure mobile apps
 
Building a TV show with Angular, Bootstrap, and Web Services
Building a TV show with Angular, Bootstrap, and Web ServicesBuilding a TV show with Angular, Bootstrap, and Web Services
Building a TV show with Angular, Bootstrap, and Web Services
 
Angular2 and TypeScript
Angular2 and TypeScriptAngular2 and TypeScript
Angular2 and TypeScript
 
Containers
ContainersContainers
Containers
 
Cloud and azure and rock and roll
Cloud and azure and rock and rollCloud and azure and rock and roll
Cloud and azure and rock and roll
 
How I Learned to Stop Worrying and Love jQuery (Jan 2013)
How I Learned to Stop Worrying and Love jQuery (Jan 2013)How I Learned to Stop Worrying and Love jQuery (Jan 2013)
How I Learned to Stop Worrying and Love jQuery (Jan 2013)
 

Kürzlich hochgeladen

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 

Kürzlich hochgeladen (20)

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 

Big Data on azure

  • 1. Big Data on Azure
  • 2. @davidgiard David Giard Microsoft Technical Evangelist dgiard@Microsoft.com Davidgiard.com @davidgiard
  • 3. @davidgiard Cloud Computing Host some or all of your data or application on a third-party server in a highly-scalable, highly-reliable way
  • 4. @davidgiard Advantages of Cloud Computing • Lower capital costs • Flexible operating cost (Rent vs Buy) • Platform as a Service • Freedom from infrastructure / hardware • Redundancy • Automatic monitoring and failover
  • 5. @davidgiard 0 1 2 3 4 5 6 7 8 9 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Demand and Capacity
  • 6. @davidgiard 0 1 2 3 4 5 6 7 8 9 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Demand and Capacity
  • 7. @davidgiard 0 1 2 3 4 5 6 7 8 9 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Demand and Capacity
  • 8. @davidgiard 0 1 2 3 4 5 6 7 8 9 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Demand and Capacity
  • 9. @davidgiard 0 1 2 3 4 5 6 7 8 9 Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Demand and Capacity
  • 10. @davidgiard 0 1 2 3 4 5 6 7 8 9 1:00 2:00 3:00 4:00 5:00 6:00 7:00 8:00 9:00 10:00 11:00 12:00 Demand and Capacity
  • 11. @davidgiard 0 1 2 3 4 5 6 7 8 9 1:00 2:00 3:00 4:00 5:00 6:00 7:00 8:00 9:00 10:00 11:00 12:00 Big Data Demand
  • 12. @davidgiard Cost Factors • Service • VM Size • # VMs • Time
  • 14. @davidgiard Azure HDInsight • Microsoft Azure’s big-data solution using Hadoop • Open-source framework for storing and analyzing massive amounts of data on clusters built from commodity hardware • Uses Hadoop Distributed File System (HDFS) for storage • Employs the open-source Hortonworks Data Platform implementation of Hadoop • Includes HBase, Hive, Pig, Storm, Spark, and more • Integrates with popular BI tools • Includes Power BI, Excel, SSAS, SSRS, Tableau
  • 15. @davidgiard Apache Hadoop on Azure • Automatic cluster provisioning and configuration • Bypass an otherwise manual-intensive process • Cluster scaling • Change number of nodes without deleting/re-creating the cluster • High availability/reliability • Managed solution - 99.9% SLA • HDInsight includes a secondary head node • Reliable and economical storage • HDFS mapped over Azure Blob Storage • Accessed through “wasb://” protocol prefix
  • 16. @davidgiard Lambda Architecture • Batch Layer • Speed Layer • Serving Layer
  • 19. @davidgiard HDInsight Cluster Types • Hadoop: Query workloads • Reliable data storage, simple MapReduce • HBase: NoSQL workloads • Distributed database offering random access to large amounts of data • Apache Storm: Stream workloads • Real-time analysis of moving data streams • Apache Spark: High-performance workloads • In-memory parallel processing
  • 21. @davidgiard Cluster Creation { "$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#", "contentVersion": "1.0.0.0", "parameters": { "clusterName": { "type": "string", "metadata": { "description": "The name of the HDInsight cluster to create." } }, "clusterLoginUserName": { "type": "string", "defaultValue": "admin", "metadata": { "description": "These credentials can be used to submit jobs to the cluster and to log into cluster dashboards." } }, "clusterLoginPassword": { "type": "securestring",
  • 24. @davidgiard Storm • Apache Storm is a distributed, fault-tolerant, open-source computation system that allows you to process data in real-time with Hadoop. • Apache Storm on HDInsight allows you to create distributed, real-time analytics solutions in the Azure environment by using Apache Hadoop. • Storm solutions can also provide guaranteed processing of data, with the ability to replay data that was not successfully processed the first time. • Ability to write Storm components in C#, JAVA and Python. • Azure Scale up or Scale down without an impact for running Storm topologies. • Ease of provision and use in Azure portal. • Visual Studio project templates for Storm apps
  • 25. @davidgiard Storm • Apache Storm apps are submitted as Topologies. • A topology is a graph of computation that processes streams • Stream: An unbound collection of tuples. Streams are produced by spouts and bolts, and they are consumed by bolts. • Tuple: A named list of dynamically typed values. • Spout: Consumes data from a data source and emits one or more streams. • Bolt: Consumes streams, performs processing on tuples, and may emit streams. Bolts are also responsible for writing data to external storage, such as a queue, HDInsight, HBase, a blob, or other data store. • Nimbus: JobTracker in Hadoop that distribute jobs, monitoring failures.
  • 26. @davidgiard Stream Apache Storm Topology Event Source Tuple ( “timestamp:: 1234567890, “measurement”: “123”, “location”: “ABC123” ) Tuple { key1: “value1”, key2: “value2”, key3: “value3”, } { key1: “value1”, key2: “value2”, key3: “value3”, } Tuple Tuple { key1: “value1”, key4: “value4” } Bolt Spout Bolt Bolt Bolt Tuple { key1: “value1”, key2: “value2”, key3: “value3”, }
  • 29. @davidgiard HBase • Apache HBase is an open-source, NoSQL database that is built on Hadoop and modeled after Google BigTable. • HBase provides random access and strong consistency for large amounts of unstructured and semistructured data in a schemaless database organized by column families • Data is stored in the rows of a table, and data within a row is grouped by column family. • The open-source code scales linearly to handle petabytes of data on thousands of nodes. It can rely on data redundancy, batch processing, and other features that are provided by distributed applications in the Hadoop ecosystem.
  • 30. @davidgiard HBase • HBase Commands: • create  Equivalent to create table in T-SQL • get  Equivalent to select statements in T-SQL • put  Equivalent to update, Insert statement in T-SQL • scan  Equivalent to select (no where condition) in T-SQL • delete/deleteall  Equivalent to delete in T-SQL • HBase shell is your query tool to execute in CRUD commands to a HBase cluster. • Data can also be managed using the HBase C# API, which provides a client library on top of the HBase REST API. • An HBase database can also be queried by using Hive.
  • 31. @davidgiard HBase RowKey a:1 a:2 a:3 a:4 b:1 b:2 c:numA 982069 10 20 30 40 5 7 4 926025 9 11 21 4 9 3 254114 11 15 22 35 7 11 4 881514 8 14 2 3 2 Column family “a” Column family “b” Column family “c”
  • 33. @davidgiard Hive • Apache Hive is a data warehouse system for Hadoop, which enables data summarization, querying, and analysis of data by using HiveQL (a query language similar to SQL). • Hive understands how to work with structured and semi-structured data, such as text files where the fields are delimited by specific characters. • Hive also supports custom serializer/deserializers for complex or irregularly structured data. • Hive can also be extended through user-defined functions (UDF). • A UDF allows you to implement functionality or logic that isn't easily modeled in HiveQL.
  • 34. @davidgiard HiveQL # Number of Records SELECT COUNT(1) FROM www_access; # Number of Unique IPs SELECT COUNT(1) FROM ( SELECT DISTINCT ip FROM www_access ) t; # Number of Unique IPs that Accessed the Top Page SELECT COUNT(distinct ip) FROM www_access WHERE url='/'; # Number of Accesses per Unique IP SELECT ip, COUNT(1) FROM www_access GROUP BY ip LIMIT 30; # Unique IPs Sorted by Number of Accesses SELECT ip, COUNT(1) AS cnt FROM www_access GROUP BY ip ORDER BY cnt DESC LIMIT 30; # Number of Accesses After a Certain Time SELECT COUNT(1) FROM www_access WHERE TD_TIME_RANGE(time, "2011-08-19", NULL, "PDT")
  • 36. @davidgiard Apache Spark • Interactive manipulation and visualization of data • Scala, Python, and R Interactive Shells • Jupyter Notebook with PySpark (Python) and Spark (Scala) kernels provide in- browser interaction • Unified platform for processing multiple workloads • Real-time processing, Machine Learning, Stream Analytics, Interactive Querying, Graphing • Leverages in-memory processing for really big data • Resilient distributed datasets (RDDs) • APIs for processing large datasets • Up to 100x faster than MapReduce
  • 37. @davidgiard Spark Components on HDInsight • Spark Core • Includes Spark SQL, Spark Streaming, GraphX, and MLlib • Anaconda • Livy • Jupyter Notebooks • ODBC Driver for connecting from BI tools (Power BI, Tableau)
  • 38. @davidgiard Jupyter Notebooks on HDInsight • Browser-based interface for working with text, code, equations, plots, graphics, and interactive controls in a single document. • Include preset Spark and Hive contexts (sc and sqlContext)
  • 40. @davidgiard Items of Note About HDInsight • There is no “suspend” on HDInsight clusters • Provision the cluster, do work, then delete the cluster to avoid unnecessary charges • Storage can be decoupled from the cluster and reused across deployments • Can deploy from the portal, but often scripted in practice • Easier/repeatable creation and deletion

Hinweis der Redaktion

  1. Batch Layer Pre-computes results Some latency Speed Layer Real-time view of most recent data e.g. Apache Storm Serving Layer Ad-hoc queries Pre-computed views e.g, Apache HBase
  2. https://en.wikipedia.org/wiki/Lambda_architecture
  3. No schema No referential intergrity Forward-only, read-only database Perfect for timed data Very fast Highly scalable Often look only at latest entry in each family
  4. https://github.com/MSFTImagine/computerscience/tree/master/Workshop/7.%20HDInsight