SlideShare ist ein Scribd-Unternehmen logo
1 von 80
| © Copyright 2015 Hitachi Consulting1
Introducing Big Data
with Microsoft Azure
Khalid M. Salama
Microsoft Business Intelligence
Hitachi Consulting UK
We Make it Happen. Better.
| © Copyright 2015 Hitachi Consulting2
Outline
 What is Big Data?
 Why Big Data Platforms?
 Fundamentals of a Big Data Platform
 Distributed Processing & CAP Theorem
 Big Data Solutions vs. Traditional RDBMS
 Where Big Data Fits in Enterprise Data Platforms?
 Hadoop Ecosystem: Apache Tools for Big Data
 Big Data on Microsoft Azure
 How to Get Started with Big Data?
| © Copyright 2015 Hitachi Consulting3
Basic Concepts
| © Copyright 2015 Hitachi Consulting4
What is Big Data?
“Data that is too complex for processing using traditional relational
databases efficiently and cost-effectively.”
In a nutshell…
| © Copyright 2015 Hitachi Consulting5
What is Big Data?
“Data that is too complex for processing using traditional relational
databases efficiently and cost-effectively.”
Big Data attributes…
Complex (3 V’s)
 Volume – Huge amounts of data to process
 Variety – A mixture of structured and unstructured data
 Velocity – High frequency or (near) real-time data processing
| © Copyright 2015 Hitachi Consulting6
What is Big Data?
“Data that is too complex for processing using traditional relational
databases efficiently and cost-effectively.”
Tell me more…
Complex (3 V’s)
 Volume – Huge amounts of data to process
 Variety – A mixture of structured and unstructured data
 Velocity – High frequency or (near) real-time data processing
Processing
 Stream (operational)
 Batch (Analytical)
Efficiently
 Availability/Scalability
 Performance/Throughputs
Cost-Effectively
 Acquiring
 Scaling up/down
| © Copyright 2015 Hitachi Consulting7
What is Big Data?
Common examples and applications
• User Experience Improvement
• Recommendation & Target Advertising
Clickstream
• Predictive Maintenance
• Energy Efficiency – Smart City
Sensor/Devices
• Sentiment Analysis
• Crisis Management
Social Media
• Push Notifications
• Process Optimisation
Spatial & GPS
• Proactive securityImages/Audio/Video
• Analysis of customer reviews/feedbacks/complaints
• Automatic news summarization/analysis
Free Text
| © Copyright 2015 Hitachi Consulting8
Why Big Data Platforms?
Traditional Data Platforms
| © Copyright 2015 Hitachi Consulting9
Why Big Data Platforms?
Breaking points of traditional Data Platforms – Volume
| © Copyright 2015 Hitachi Consulting10
Why Big Data Platforms?
Breaking points of traditional Data Platforms – Variety
| © Copyright 2015 Hitachi Consulting11
Why Big Data Platforms?
Breaking points of traditional Data Platforms – Velocity
| © Copyright 2015 Hitachi Consulting12
Enterprise-wide data scale
Terabytes
Gigabytes
| © Copyright 2015 Hitachi Consulting13
Enterprise-wide data scale
Terabytes
Gigabytes
| © Copyright 2015 Hitachi Consulting14
Enterprise-wide data scale
Terabytes
Gigabytes
| © Copyright 2015 Hitachi Consulting15
Enterprise-wide data scale
Terabytes
Gigabytes
Non-
Transactional
Transactional
| © Copyright 2015 Hitachi Consulting16
Addressing Big Data Challenges
| © Copyright 2015 Hitachi Consulting17
Addressing Big Data Challenges
Addressing the three “V”s…
Volume
Variety
Velocity
Challenges
| © Copyright 2015 Hitachi Consulting18
Addressing Big Data Challenges
Addressing the three “V”s…
Volume
Variety
Velocity
Distributed
Computing
Challenges Solutions
| © Copyright 2015 Hitachi Consulting19
Addressing Big Data Challenges
Addressing the three “V”s…
Volume
Variety
Velocity
Distributed
Computing
Batch Processing
Challenges Solutions
| © Copyright 2015 Hitachi Consulting20
Addressing Big Data Challenges
Addressing the three “V”s…
Volume
Variety
Velocity
Distributed
Computing
Stream
Processing
Batch Processing
Challenges Solutions
| © Copyright 2015 Hitachi Consulting21
Addressing Big Data Challenges
Addressing the three “V”s…
Volume
Variety
Velocity
Distributed
Computing
NoSQL
Stream
Processing
Batch Processing
Challenges Solutions
| © Copyright 2015 Hitachi Consulting22
Addressing Big Data Challenges
Addressing the three “V”s…
Volume
Variety
Velocity
Distributed
Computing
NoSQL
Stream
Processing
In-Memory
Processing
Batch Processing
Challenges Solutions
| © Copyright 2015 Hitachi Consulting23
Addressing Big Data Challenges
Addressing the three “V”s…
Volume
Variety
Velocity
Distributed
Computing
NoSQL
Stream
Processing
In-Memory
Processing
Batch Processing
Consistency/Availability
/FaultToleranceTrade-off
(CAP)
Challenges Solutions
| © Copyright 2015 Hitachi Consulting24
Addressing Big Data Challenges
Tell me more….
Distributed Computing
Batch Processing
In-Memory Processing
Stream Processing
NoSQL
 Distributed
 Available/ Fault Tolerant
 Random read/write access
 Supports Batch & Stream
 Cluster of many data/compute nodes (commodity hardware)
 Data Partitioning (sharding)
 Data partitions are processed in parallel
 Easy/cheap to scale-out
 Process massive amount of data
 Write once / read many
 High latency
 Iterative processing of the same data in memory
 Data size that fits into the memory
 Low latency
 Process continuous stream of data
 Small data chunks
 Low latency
 Key-value stores
 Column family stores
 Document stores
 Graph stores
 Distributed
 Available/ Fault Tolerant
 Eventually Consistent
 High throughputs
 Distributed
 Available/ Fault Tolerant
 Eventually Consistent
 Distributed
 Available/ Fault Tolerant
 Eventually Consistent
| © Copyright 2015 Hitachi Consulting25
Fundamental Components
| © Copyright 2015 Hitachi Consulting26
Fundamentals of a Big Data Platform
Basic Architectural Components
Distributed File System
….
| © Copyright 2015 Hitachi Consulting27
Basic Architectural Components
Distributed File System
….
 Data file are stored in
raw form (no schema)
 Partitioned across data
nodes (disks)
 A partition is replicated
to M nodes
 Fault-tolerance
Fundamentals of a Big Data Platform
| © Copyright 2015 Hitachi Consulting28
Basic Architectural Components
Distributed File System
Compute Cluster
Head Compute
1
….
….
 Data file are stored in
raw form (no schema)
 Partitioned across data
nodes (disks)
 A partition is replicated
to M nodes
 Fault-tolerance
Compute
2
Compute
N
Resource Manager
Fundamentals of a Big Data Platform
| © Copyright 2015 Hitachi Consulting29
Basic Architectural Components
Distributed File System
Compute Cluster
Head Compute
1
….
….
 Data file are stored in
raw form (no schema)
 Partitioned across data
nodes (disks)
 A partition is replicated
to M nodes
 Fault-tolerance
 Plus an extra
failover head node
 Availability
Compute
2
Compute
N
Resource Manager
 Manage and execute jobs
 Distributed execution
model
Fundamentals of a Big Data Platform
| © Copyright 2015 Hitachi Consulting30
Basic Architectural Components
Distributed File System
Compute Cluster
Applications
Batch In-Memory Stream SQL NoSQL
Head Compute
1
….
….
 Data file are stored in
raw form (no schema)
 Partitioned across data
nodes (disks)
 A partition is replicated
to M nodes
 Fault-tolerance
 Plus an extra
failover head node
 Availability
Acquisition
Compute
2
Compute
N
Resource Manager
 Manage and execute jobs
 Distributed execution
model
Fundamentals of a Big Data Platform
| © Copyright 2015 Hitachi Consulting31
Basic Architectural Components
Distributed File System
Compute Cluster
Applications
Batch In-Memory Stream SQL NoSQL
Head Compute
1
….
….
 Data file are stored in
raw form (no schema)
 Partitioned across data
nodes (disks)
 A partition is replicated
to M nodes
 Fault-tolerance
 Plus an extra
failover head node
 Availability
 Support Batch/Speed
workloads
Acquisition
Compute
2
Compute
N
Resource Manager
 Manage and execute jobs
 Distributed execution
model
Fundamentals of a Big Data Platform
| © Copyright 2015 Hitachi Consulting32
Fundamentals of a Big Data Platform
Lambda Architecture
 Data is dispatched to both the batch layer and the
speed layer for processing.
 The batch layer manages the master dataset (write
once- read many), and pre-computes the batch
views. Handle large data volumes with high latency.
 The speed layer indexes the batch views so that
they can be queried in low-latency, ad-hoc way.
Deals with recent, limited window of data only.
 The serving layer answer and incoming query by
merging results from batch views and real-time views
Hot Path
Cold Path
| © Copyright 2015 Hitachi Consulting33
CAP Theorem
| © Copyright 2015 Hitachi Consulting34
Distributed Processing & CAP Theorem
 In order to handle large volume of data processing efficiently, we need to scale out, i.e.
partition the data and distribute the computation
The trade-off…
| © Copyright 2015 Hitachi Consulting35
Distributed Processing & CAP Theorem
 In order to handle large volume of data processing efficiently, we need to scale out, i.e.
partition the data and distribute the computation
 Now we face a trade-off between Consistency, Availability, and Partition Tolerance
The trade-off…
| © Copyright 2015 Hitachi Consulting36
Distributed Processing & CAP Theorem
 In order to handle large volume of data processing efficiently, we need to scale out, i.e.
partition the data and distribute the computation
 Now we face a trade-off between Consistency, Availability, and Partition Tolerance
 Consistency: Data is in a consistent state across all the nodes.
That is, all the reads would get you the same, most recent write.
The trade-off…
| © Copyright 2015 Hitachi Consulting37
Distributed Processing & CAP Theorem
 In order to handle large volume of data processing efficiently, we need to scale out, i.e.
partition the data and distribute the computation
 Now we face a trade-off between Consistency, Availability, and Partition Tolerance
 Consistency: Data is in a consistent state across all the nodes.
That is, all the reads would get you the same, most recent write.
 Availability: Every request to the system gets a response (i.e., executed) on success/failure.
That is, system responsiveness (latency)
The trade-off…
| © Copyright 2015 Hitachi Consulting38
Distributed Processing & CAP Theorem
 In order to handle large volume of data processing efficiently, we need to scale out, i.e.
partition the data and distribute the computation
 Now we face a trade-off between Consistency, Availability, and Partition Tolerance
 Consistency: Data is in a consistent state across all the nodes.
That is, all the reads would get you the same, most recent write.
 Availability: Every request to the system gets a response (i.e., executed) on success/failure.
That is, system responsiveness (latency)
 Partition Tolerance: The system continuous to work despite of message loss or partition
(node) failure. That is, the system can sustain partial network failures.
The trade-off…
| © Copyright 2015 Hitachi Consulting39
Distributed Processing & CAP Theorem
 In order to handle large volume of data processing efficiently, we need to scale out, i.e.
partition the data and distribute the computation
 Now we face a trade-off between Consistency, Availability, and Partition Tolerance
 Consistency: Data is in a consistent state across all the nodes.
That is, all the reads would get you the same, most recent write.
 Availability: Every request to the system gets a response (i.e., executed) on success/failure.
That is, system responsiveness (latency).
 Partition Tolerance: The system continuous to work despite of message loss or partition
(node) failure. That is, the system can sustain partial network failures.
 CAP Theorem: only two out of three properties can be satisfied in a distributed data
system. In facet, it is consistency vs availability, wrt partition tolerance!
The trade-off…
| © Copyright 2015 Hitachi Consulting40
Distributed Processing & CAP Theorem
The trade-off…
Continues working if
partition is not reachable
by the systemP
C A
| © Copyright 2015 Hitachi Consulting41
Distributed Processing & CAP Theorem
The trade-off…
Continues working if
partition is not reachable
by the systemP
C A
Big Data Systems
 BASE Mode – Eventually Consistency
 Remains available (operational &
responsive)
 partition tolerant, i.e., sacrifices
consistency
| © Copyright 2015 Hitachi Consulting42
Distributed Processing & CAP Theorem
The trade-off…
Continues working if
partition is not reachable
by the system
Transactional RDBMS
 ACID Mode – Strong Consistency
 Commits are atomic across the
entre system
 Not partition tolerant,
i.e., sacrifices availability
P
C A
Big Data Systems
 BASE Mode – Eventually Consistency
 Remains available (operational &
responsive)
 partition tolerant, i.e., sacrifices
consistency
| © Copyright 2015 Hitachi Consulting43
Distributed Processing & CAP Theorem
The trade-off…
Continues working if
partition is not reachable
by the system
Transactional RDBMS
 ACID Mode – Strong Consistency
 Commits are atomic across the
entre system
 Not partition tolerant,
i.e., sacrifices availability
P
C A
Big Data Systems
 BASE Mode – Eventually Consistency
 Remains available (operational &
responsive)
 partition tolerant, i.e., sacrifices
consistency
ACID
 Atomic: Everything in a transaction succeeds
or the entire transaction is rolled back.
 Consistent: A transaction cannot leave the
database in an inconsistent state.
 Isolated: Transactions cannot interfere with
each other.
 Durable: Completed transactions persist,
even when servers restart etc.
BASE
 Basic Availability
 Soft-state
 Eventual consistency
| © Copyright 2015 Hitachi Consulting44
Distributed Processing & CAP Theorem
The trade-off…
Continues working if
partition is not reachable
by the system
Transactional RDBMS
 ACID Mode – Strong Consistency
 Commits are atomic across the
entre system
 Not partition tolerant,
i.e., sacrifices availability
P
C A
Big Data Systems
 BASE Mode – Eventually Consistency
 Remains available (operational &
responsive)
 partition tolerant, i.e., sacrifices
consistency
ACID
 Atomic: Everything in a transaction succeeds
or the entire transaction is rolled back.
 Consistent: A transaction cannot leave the
database in an inconsistent state.
 Isolated: Transactions cannot interfere with
each other.
 Durable: Completed transactions persist,
even when servers restart etc.
BASE
 Basic Availability
 Soft-state
 Eventual consistency
NoSQL: Strong vs. Eventual Consistency
| © Copyright 2015 Hitachi Consulting45
Big Data Solutions vs. Traditional RDMS
The face-off…
Feature RDBMS Big Data (Batch) Big Data (Stream & NoSQL)
Data Integrity Strong Consistency
– ACID Transactions
Eventual Consistency
– BASE Model
Depending on the technology
(Strong vs. Eventual Consistency)
Schema Static – required on write Dynamic – schema on read Flexible – extensible
Data types and
formats
Structured Structured , Semi-structured, and
unstructured
Semi-structured
Read and write
pattern
Fully repeatable read/write Write once, repeatable read Fully repeatable read/write
Storage volume Gigabytes to terabytes Terabytes, petabytes, and beyond Terabytes, petabytes, and beyond
- (small data chunks for stream processing)
Scalability Scale up with more powerful hardware Scale out with additional servers Scale out with additional servers
Data processing
distribution
Limited or none Distributed across the cluster Distributed across the cluster
Economics Expensive hardware and software Commodity hardware and open
source software
Commodity hardware and open
source software
Microsoft Patterns & Practises
| © Copyright 2015 Hitachi Consulting46
Enterprise Big Data Platform
| © Copyright 2015 Hitachi Consulting47
Big Data Fit in Enterprise Data Platform
Enterprise Data Platform
| © Copyright 2015 Hitachi Consulting48
Big Data Fit in Enterprise Data Platform
Use Case 1: Data Exploration/ Experiments Platform
101
100
Microsoft Patterns & Practises
| © Copyright 2015 Hitachi Consulting49
Big Data Fit in Enterprise Data Platform
Use Case 2: Data Processing (ETL)
MPP
MPP
Microsoft Patterns & Practises
| © Copyright 2015 Hitachi Consulting50
Big Data Fit in Enterprise Data Platform
Use Case 3: Data Warehouse
Microsoft Patterns & Practises
| © Copyright 2015 Hitachi Consulting51
Big Data Fit in Enterprise Data Platform
Use Case 4: Full Data/BI Integration
Microsoft Patterns & Practises
1 – ETL Level Integration
2 – DW Level Integration
3 – BI Level Integration
 Corporate Data Model
 Reports/Dashboard (Mashup)
MPP
| © Copyright 2015 Hitachi Consulting52
Big Data Fit in Enterprise Data Platform
Use Case 4: Full Data/BI Integration
Microsoft Patterns & Practises
1 – ETL Level Integration
2 – DW Level Integration
3 – BI Level Integration
 Corporate Data Model
 Reports/Dashboard (Mashup)
MPP
Operational Apps
| © Copyright 2015 Hitachi Consulting53
Big Data with Hadoop
| © Copyright 2015 Hitachi Consulting54
Introducing Hadoop
Apache Hadoop Ecosystem - “A” Big Data Platform
Hadoop Distributed File System (HDFS)
Applications
In-Memory Stream SQL
 Spark-
SQL
NoSQL Machine
Learning
….
Batch
Yet Another Resource Negotiator (YARN)
Search Orchest.
MgmntAcquisition
Named
Node
DataNode 1 DataNode 2 DataNode 3 DataNode N
| © Copyright 2015 Hitachi Consulting55
Introducing Hadoop
Apache Hadoop Ecosystem - “A” Big Data Platform
A programming model for distributed
processing large data on a cluster
A scripting platform for processing and
analysing large data sets
The de facto standard for SQL queries in
Hadoop
Efficiently transfers bulk data between
Apache Hadoop and relational data stores
An algorithm library for scalable machine
learning on Hadoop
Provides workflow scheduling services
manage Hadoop jobs
A system for processing streaming data in
real time
A fast, scalable, fault-tolerant messaging
system
In-Memory compute for ETL, Machine
Learning, SQL, and streaming
A distributed key-value store with cell-based
access control
CouchDB: JSON document-oriented data
store
Provides random read/write access to a
distributed, fault tolerant, NoSQL data store
| © Copyright 2015 Hitachi Consulting56
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
Oct-03 Dec-04 Jan-06 Feb-06 Apr-06 May-06 Apr-07 Jun-07 Oct-07 Jan-08 Feb-08 Jul-08 Oct-08 Nov-08 Mar-09 Apr-09 May-10 Jun-10 Sep-10 Jan-11 Mar-11 Jun-11 Jan-12 Nov-12 Feb-14 Jun-15
Introducing Hadoop
History
| © Copyright 2015 Hitachi Consulting57
Introducing Hadoop
MapReduce - Distributed Programing Model
| © Copyright 2015 Hitachi Consulting58
Introducing Hadoop
MapReduce - Distributed Programing Model
Read lines
from file
Convert line
to
Key-Value
Pair(s)
Filter
(by
key/value)
Combine
Values with
similar Keys
Shuffle data
across
nodes
for reduces
by Key
Map
| © Copyright 2015 Hitachi Consulting59
Introducing Hadoop
MapReduce - Distributed Programing Model
Read lines
from file
Convert line
to
Key-Value
Pair(s)
Filter
(by
key/value)
Combine
Values with
similar Keys
Shuffle data
across
nodes
for reduces
by Key
Sort by Key
Aggregate
(reduce)
Filter
(based on
aggregated
value)
Write
results
to file
Map Reduce
| © Copyright 2015 Hitachi Consulting60
Introducing Hadoop
MapReduce - Distributed Programing Model
Read lines
from file
Convert line
to
Key-Value
Pair(s)
Filter
(by
key/value)
Combine
Values with
similar Keys
Shuffle data
across
nodes
for reduces
by Key
Sort by Key
Aggregate
(reduce)
Filter
(based on
aggregated
value)
Write
results
to file
Map Reduce
Input
Mapper
Mapper
Mapper
| © Copyright 2015 Hitachi Consulting61
Introducing Hadoop
MapReduce - Distributed Programing Model
Read lines
from file
Convert line
to
Key-Value
Pair(s)
Filter
(by
key/value)
Combine
Values with
similar Keys
Shuffle data
across
nodes
for reduces
by Key
Sort by Key
Aggregate
(reduce)
Filter
(based on
aggregated
value)
Write
results
to file
Map Reduce
Input
Mapper
Mapper
Mapper
HashShuffling
(Key1, Value1)
(Key2, Value2)
(Key1, Value3)
| © Copyright 2015 Hitachi Consulting62
Introducing Hadoop
MapReduce - Distributed Programing Model
Read lines
from file
Convert line
to
Key-Value
Pair(s)
Filter
(by
key/value)
Combine
Values with
similar Keys
Shuffle data
across
nodes
for reduces
by Key
Sort by Key
Aggregate
(reduce)
Filter
(based on
aggregated
value)
Write
results
to file
Map Reduce
Input
Mapper
Mapper
Mapper
Reducer
Reducer
HashShuffling
Output
(Key1, Value1)
(Key2, Value2)
(Key1, Value3)
Key1: {Value1, Value3}
Key 2: {Value2}
| © Copyright 2015 Hitachi Consulting63
Introducing Hadoop
MapReduce - Example
SELECT Month, City, SUM(SalesValue) FROM Sales WHERE Product = ‘Bike’ GROUP BY City Having SUM(SalesValue) > 50,000
Read lines
from file
Convert line
to (Month-
Cirty,
Value) Pair
Discard
lines
where
Product is
not ‘Bike’
Combine
Values with
similar Keys
Shuffle data
across
nodes
for reduces
by Key
Sort by Key
Sum all the
values
in a given
key
Discard
records
where sum
<= 50,000
Write
results
to file
| © Copyright 2015 Hitachi Consulting64
Introducing Hadoop
MapReduce - Example
SELECT Month, City, SUM(SalesValue) FROM Sales WHERE Product = ‘Bike’ GROUP BY City Having SUM(SalesValue) > 50,000
Read lines
from file
Convert line
to (Month-
Cirty,
Value) Pair
Discard
lines
where
Product is
not ‘Bike’
Combine
Values with
similar Keys
Shuffle data
across
nodes
for reduces
by Key
Sort by Key
Sum all the
values
in a given
key
Discard
records
where sum
<= 50,000
Write
results
to file
| © Copyright 2015 Hitachi Consulting65
Big Data with Microsoft Azure
| © Copyright 2015 Hitachi Consulting66
Big Data on Microsoft Azure
Virtual Machines
(IaaS)
Azure Services
(Data Acquisition, Stream Processing, Machine Learning, NoSQL)
Azure HDInsight
(IaaS+)
Azure Data Lake
(PaaS)
| © Copyright 2015 Hitachi Consulting67
Big Data on Microsoft Azure
Infrastructure as a Service (IaaS).
Different distributions of Hadoop, still 100% Hadoop
(plus distribution specific extra tools).
You are responsible for provisioning, configuring, managing,
and updating the cluster with new tools.
The Distributed File System is part the compute cluster,
that is, killing the cluster means loosing the data
Hortonworks/Cloudera/MapR Virtual Machines
| © Copyright 2015 Hitachi Consulting68
Big Data on Microsoft Azure
Azure HDInsight
Infrastructure as a Service+ (SaaS+).
Hortonworks distribution of Hadoop.
You pay for the cluster (infrastructure), and the Blob Storage, rather than the jobs.
Yet, you are NOT responsible for configuring, managing,
and updating the cluster with new tools (Managed by Microsoft).
On-demand Provisioning/shutting down.
Independent of the Distributed File System (Azure Blob Storage),
that is, killing the cluster will not cause loosing the data.
Data can be shared by multiple clusters.
| © Copyright 2015 Hitachi Consulting69
Big Data on Microsoft Azure
Azure HDInsight
Windows Azure Blob Storage (WABS) Distributed File System
Applications (by cluster type)
Spark Storm HBase
….
Hadoop
Yet Another Resource Negotiator (YARN)
| © Copyright 2015 Hitachi Consulting70
Big Data on Microsoft Azure
Azure HDInsight
Windows Azure Blob Storage (WABS) Distributed File System
Applications (by cluster type)
Spark Storm HBase
….
Hadoop
Yet Another Resource Negotiator (YARN)
Acquisition
 Azure Data Factory
Stream Processing
• Steam Analytics
• Event Hub
Machine Learning
 Azure Machine
Learning
NoSQL
 Table Storage
 DocumentDB
| © Copyright 2015 Hitachi Consulting71
Big Data on Microsoft Azure
The PaaS zoo on the cloud…
Data Factory - Defines and automates the
movement, processing, and transformation of data by
through data flow pipelines.
Stream Analytics - Real-time event processing engine
for real-time analytic computations on data streams
Event Hub - highly scalable data ingress (message
queuing) service that can ingest millions of events
per second for downstream processing
Machine Learning - Cloud-based predictive analytics
service rapid creation and deployment predictive
models as analytics solutions
Table Storage - Stores structured key/attribute
NoSQL data store in the cloud.
DocumentDB - fully managed NoSQL JSON database
service for high performance, high availability,
automatic scaling, and ease of development
| © Copyright 2015 Hitachi Consulting72
Data Lake Analytics
Big Data on Microsoft Azure
Azure Data Lake
Data Lake Storage
….
U-SQL
Acquisition
 Azure Data Factory
Stream Processing
• Steam Analytics
• Event Hub
Machine Learning
 Azure Machine
Learning
NoSQL
 Table Storage
 DocumentDB
Yet Another Resource Negotiator (YARN)
| © Copyright 2015 Hitachi Consulting73
Big Data on Microsoft Azure
Azure Data Lake
Platform as a Service (PaaS).
Microsoft’s own implementation of Big Data Platform, like Google (GCP) and
Amazon (AWS), rather than a distribution of Hadoop.
U-SQL for batch data processing.
You pay for the jobs, and the data lake storage.
Optimized Distributed File System (Data Lake) for analytical workloads.
| © Copyright 2015 Hitachi Consulting74
Big Data on Microsoft Azure
Microsoft Azure Big Data Analytics Options
Microsoft Advanced Analytics laboratory
| © Copyright 2015 Hitachi Consulting75
Big Data on Microsoft Azure
Microsoft Azure – Cortana Analytical Suite
Microsoft
| © Copyright 2015 Hitachi Consulting76
How to Get Started with Big Data?
 Read these slides!
 Coursera – Big Data Specialization
https://www.coursera.org/specializations/big-data
 Azure Documentation – HDInsight Emulator
https://azure.microsoft.com/en-gb/documentation/articles/hdinsight-hadoop-emulator-get-started
 MVA – Big Data Analytics
https://mva.microsoft.com/en-US/training-courses/big-data-analytics-8255?l=ogCizYKy_9604984382
 MVA – Big Data Analytics with HDInsight: Hadoop on Azure
https://mva.microsoft.com/en-US/training-courses/big-data-analytics-with-hdinsight-hadoop-on-azure-10551
 MVA – Implementing Big Data Analysis
https://mva.microsoft.com/en-US/training-courses/implementing-big-data-analysis-8311?l=44REr2Yy_5404984382
 Azure Documentation – Getting Started with HDInsight
https://azure.microsoft.com/en-gb/documentation/services/hdinsight/
 Microsoft Patterns & Practice – Developing big data solutions on Microsoft Azure HDInsight
https://msdn.microsoft.com/en-gb/library/dn749874.aspx
 Azure Documentation – Data Lake
https://azure.microsoft.com/en-gb/documentation/services/data-lake-analytics/
 Apache Hadoop http://hadoop.apache.org/
O’Reliy Books– Hadoop: The Definitive Guide 4th Edition
| © Copyright 2015 Hitachi Consulting77
Useful Hadoop Commands
 To list the contents of a directory: hadoop fs -ls /<DirectoryPath>
 To see contents of a file: hadoop fs -cat /<FilePath>
 To create a directory in HDFS: hadoop fs -mkdir / <DiretoryPath>
 To upload files from local file system to the Hadoop : hadoop fs -put <localSrcPath> /<hdfsDstPath>
 To download files from the Hadoop data file system to the local file system: hadoop fs -get /<FilePath>
 To copy a file from source to destination: hadoop fs -cp /<SrcFilePath> /<DstFilePath>
 To copy a file from Local file system to HDFS: hadoop fs -copyFromLocal <LocalSrcPath> /<hdfsDstPath>
 To copy a file to Local file system from HDFS: hadoop fs -copyToLocal /<hdfsSrcFilePath> /<DstFilePath>
 To remove a file from HDFS: hadoop fs -rm /<FilePath>
 To remove a directory from HDFS: hadoop fs -rm -r /<DirectoryPath>
| © Copyright 2015 Hitachi Consulting78
Coming soon…
 Introduction to Azure Data Factory, and Data Lake Analytics with U-SQL
 Introduction to Hive on HDInsight
 Event & Stream Processing on Microsoft Azure
 NoSQL on Microsoft Azure
 Introduction to Spark on HDInsight
 Introduction to Azure Batch
Stay tuned
| © Copyright 2015 Hitachi Consulting79
Acknowledgement
Thanks for Paul Lineham for answering
all my stupid big data questions, patiently…
| © Copyright 2015 Hitachi Consulting81
Thank you!

Weitere ähnliche Inhalte

Was ist angesagt?

Analytics in a Day Virtual Workshop
Analytics in a Day Virtual WorkshopAnalytics in a Day Virtual Workshop
Analytics in a Day Virtual WorkshopCCG
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra
 
Fast and Furious: From POC to an Enterprise Big Data Stack in 2014
Fast and Furious: From POC to an Enterprise Big Data Stack in 2014Fast and Furious: From POC to an Enterprise Big Data Stack in 2014
Fast and Furious: From POC to an Enterprise Big Data Stack in 2014MapR Technologies
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)James Serra
 
Hadoop World 2011: I Want to Be BIG - Lessons Learned at Scale - David "Sunny...
Hadoop World 2011: I Want to Be BIG - Lessons Learned at Scale - David "Sunny...Hadoop World 2011: I Want to Be BIG - Lessons Learned at Scale - David "Sunny...
Hadoop World 2011: I Want to Be BIG - Lessons Learned at Scale - David "Sunny...Cloudera, Inc.
 
Enterprise Data Lake - Scalable Digital
Enterprise Data Lake - Scalable DigitalEnterprise Data Lake - Scalable Digital
Enterprise Data Lake - Scalable Digitalsambiswal
 
How to select a modern data warehouse and get the most out of it?
How to select a modern data warehouse and get the most out of it?How to select a modern data warehouse and get the most out of it?
How to select a modern data warehouse and get the most out of it?Slim Baltagi
 
Modern data warehouse
Modern data warehouseModern data warehouse
Modern data warehouseStephen Alex
 
Designing an Agile Fast Data Architecture for Big Data Ecosystem using Logica...
Designing an Agile Fast Data Architecture for Big Data Ecosystem using Logica...Designing an Agile Fast Data Architecture for Big Data Ecosystem using Logica...
Designing an Agile Fast Data Architecture for Big Data Ecosystem using Logica...Denodo
 
Where does Fast Data Strategy Fit within IT Projects
Where does Fast Data Strategy Fit within IT ProjectsWhere does Fast Data Strategy Fit within IT Projects
Where does Fast Data Strategy Fit within IT ProjectsDenodo
 
Data Ninja Webinar Series: Realizing the Promise of Data Lakes
Data Ninja Webinar Series: Realizing the Promise of Data LakesData Ninja Webinar Series: Realizing the Promise of Data Lakes
Data Ninja Webinar Series: Realizing the Promise of Data LakesDenodo
 
Data Mesh at CMC Markets: Past, Present and Future
Data Mesh at CMC Markets: Past, Present and FutureData Mesh at CMC Markets: Past, Present and Future
Data Mesh at CMC Markets: Past, Present and FutureLorenzo Nicora
 
2012 10 bigdata_overview
2012 10 bigdata_overview2012 10 bigdata_overview
2012 10 bigdata_overviewjdijcks
 
Cloud Storage Spring Cleaning: A Treasure Hunt
Cloud Storage Spring Cleaning: A Treasure HuntCloud Storage Spring Cleaning: A Treasure Hunt
Cloud Storage Spring Cleaning: A Treasure HuntSteven Moy
 
Analytics in a Day Ft. Synapse Virtual Workshop
Analytics in a Day Ft. Synapse Virtual WorkshopAnalytics in a Day Ft. Synapse Virtual Workshop
Analytics in a Day Ft. Synapse Virtual WorkshopCCG
 
Data Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshData Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshJeffrey T. Pollock
 

Was ist angesagt? (20)

Analytics in a Day Virtual Workshop
Analytics in a Day Virtual WorkshopAnalytics in a Day Virtual Workshop
Analytics in a Day Virtual Workshop
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 
Fast and Furious: From POC to an Enterprise Big Data Stack in 2014
Fast and Furious: From POC to an Enterprise Big Data Stack in 2014Fast and Furious: From POC to an Enterprise Big Data Stack in 2014
Fast and Furious: From POC to an Enterprise Big Data Stack in 2014
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
 
Hadoop World 2011: I Want to Be BIG - Lessons Learned at Scale - David "Sunny...
Hadoop World 2011: I Want to Be BIG - Lessons Learned at Scale - David "Sunny...Hadoop World 2011: I Want to Be BIG - Lessons Learned at Scale - David "Sunny...
Hadoop World 2011: I Want to Be BIG - Lessons Learned at Scale - David "Sunny...
 
Enterprise Data Lake - Scalable Digital
Enterprise Data Lake - Scalable DigitalEnterprise Data Lake - Scalable Digital
Enterprise Data Lake - Scalable Digital
 
2022 02 Integration Bootcamp
2022 02 Integration Bootcamp2022 02 Integration Bootcamp
2022 02 Integration Bootcamp
 
How to select a modern data warehouse and get the most out of it?
How to select a modern data warehouse and get the most out of it?How to select a modern data warehouse and get the most out of it?
How to select a modern data warehouse and get the most out of it?
 
Modern data warehouse
Modern data warehouseModern data warehouse
Modern data warehouse
 
Designing an Agile Fast Data Architecture for Big Data Ecosystem using Logica...
Designing an Agile Fast Data Architecture for Big Data Ecosystem using Logica...Designing an Agile Fast Data Architecture for Big Data Ecosystem using Logica...
Designing an Agile Fast Data Architecture for Big Data Ecosystem using Logica...
 
Where does Fast Data Strategy Fit within IT Projects
Where does Fast Data Strategy Fit within IT ProjectsWhere does Fast Data Strategy Fit within IT Projects
Where does Fast Data Strategy Fit within IT Projects
 
Data Ninja Webinar Series: Realizing the Promise of Data Lakes
Data Ninja Webinar Series: Realizing the Promise of Data LakesData Ninja Webinar Series: Realizing the Promise of Data Lakes
Data Ninja Webinar Series: Realizing the Promise of Data Lakes
 
Data Mesh at CMC Markets: Past, Present and Future
Data Mesh at CMC Markets: Past, Present and FutureData Mesh at CMC Markets: Past, Present and Future
Data Mesh at CMC Markets: Past, Present and Future
 
2012 10 bigdata_overview
2012 10 bigdata_overview2012 10 bigdata_overview
2012 10 bigdata_overview
 
Cloud Storage Spring Cleaning: A Treasure Hunt
Cloud Storage Spring Cleaning: A Treasure HuntCloud Storage Spring Cleaning: A Treasure Hunt
Cloud Storage Spring Cleaning: A Treasure Hunt
 
Analytics in a Day Ft. Synapse Virtual Workshop
Analytics in a Day Ft. Synapse Virtual WorkshopAnalytics in a Day Ft. Synapse Virtual Workshop
Analytics in a Day Ft. Synapse Virtual Workshop
 
Data Federation
Data FederationData Federation
Data Federation
 
Data Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshData Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to Mesh
 
Data lake
Data lakeData lake
Data lake
 
SQL Server Disaster Recovery Implementation
SQL Server Disaster Recovery ImplementationSQL Server Disaster Recovery Implementation
SQL Server Disaster Recovery Implementation
 

Andere mochten auch

Big Data Analytics in the Cloud with Microsoft Azure
Big Data Analytics in the Cloud with Microsoft AzureBig Data Analytics in the Cloud with Microsoft Azure
Big Data Analytics in the Cloud with Microsoft AzureMark Kromer
 
Microsoft Azure Big Data Analytics
Microsoft Azure Big Data AnalyticsMicrosoft Azure Big Data Analytics
Microsoft Azure Big Data AnalyticsMark Kromer
 
Building the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsBuilding the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsKhalid Salama
 
Choosing technologies for a big data solution in the cloud
Choosing technologies for a big data solution in the cloudChoosing technologies for a big data solution in the cloud
Choosing technologies for a big data solution in the cloudJames Serra
 
Microsoft cloud big data strategy
Microsoft cloud big data strategyMicrosoft cloud big data strategy
Microsoft cloud big data strategyJames Serra
 
Visualising the tabular model for power view upload
Visualising the tabular model for power view uploadVisualising the tabular model for power view upload
Visualising the tabular model for power view uploadJen Stirrup
 
Microservices, DevOps, and Continuous Delivery
Microservices, DevOps, and Continuous DeliveryMicroservices, DevOps, and Continuous Delivery
Microservices, DevOps, and Continuous DeliveryKhalid Salama
 
Windows Azure HDInsight Service
Windows Azure HDInsight ServiceWindows Azure HDInsight Service
Windows Azure HDInsight ServiceNeil Mackenzie
 
Building big data solutions on azure
Building big data solutions on azureBuilding big data solutions on azure
Building big data solutions on azureEyal Ben Ivri
 
Cortana Analytics Workshop: Azure Data Lake
Cortana Analytics Workshop: Azure Data LakeCortana Analytics Workshop: Azure Data Lake
Cortana Analytics Workshop: Azure Data LakeMSAdvAnalytics
 
Architecting big data solutions in the cloud
Architecting big data solutions in the cloudArchitecting big data solutions in the cloud
Architecting big data solutions in the cloudMostafa
 
Big Data on azure
Big Data on azureBig Data on azure
Big Data on azureDavid Giard
 
Dive into Spark Streaming
Dive into Spark StreamingDive into Spark Streaming
Dive into Spark StreamingGerard Maas
 
Spark with Azure HDInsight - Tampa Bay Data Science - Adnan Masood, PhD
Spark with Azure HDInsight  - Tampa Bay Data Science - Adnan Masood, PhDSpark with Azure HDInsight  - Tampa Bay Data Science - Adnan Masood, PhD
Spark with Azure HDInsight - Tampa Bay Data Science - Adnan Masood, PhDAdnan Masood
 
PASS Summit Data Storytelling with R Power BI and AzureML
PASS Summit Data Storytelling with R Power BI and AzureMLPASS Summit Data Storytelling with R Power BI and AzureML
PASS Summit Data Storytelling with R Power BI and AzureMLJen Stirrup
 
Introduction to Microsoft’s Hadoop solution (HDInsight)
Introduction to Microsoft’s Hadoop solution (HDInsight)Introduction to Microsoft’s Hadoop solution (HDInsight)
Introduction to Microsoft’s Hadoop solution (HDInsight)James Serra
 
Spark with HDInsight
Spark with HDInsightSpark with HDInsight
Spark with HDInsightKhalid Salama
 
Microsoft Azure vs Amazon Web Services (AWS) Services & Feature Mapping
Microsoft Azure vs Amazon Web Services (AWS) Services & Feature MappingMicrosoft Azure vs Amazon Web Services (AWS) Services & Feature Mapping
Microsoft Azure vs Amazon Web Services (AWS) Services & Feature MappingIlyas F ☁☁☁
 

Andere mochten auch (20)

Big Data in Azure
Big Data in AzureBig Data in Azure
Big Data in Azure
 
Big Data Analytics in the Cloud with Microsoft Azure
Big Data Analytics in the Cloud with Microsoft AzureBig Data Analytics in the Cloud with Microsoft Azure
Big Data Analytics in the Cloud with Microsoft Azure
 
Microsoft Azure Big Data Analytics
Microsoft Azure Big Data AnalyticsMicrosoft Azure Big Data Analytics
Microsoft Azure Big Data Analytics
 
Building the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsBuilding the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake Analytics
 
Choosing technologies for a big data solution in the cloud
Choosing technologies for a big data solution in the cloudChoosing technologies for a big data solution in the cloud
Choosing technologies for a big data solution in the cloud
 
Microsoft cloud big data strategy
Microsoft cloud big data strategyMicrosoft cloud big data strategy
Microsoft cloud big data strategy
 
Visualising the tabular model for power view upload
Visualising the tabular model for power view uploadVisualising the tabular model for power view upload
Visualising the tabular model for power view upload
 
Microservices, DevOps, and Continuous Delivery
Microservices, DevOps, and Continuous DeliveryMicroservices, DevOps, and Continuous Delivery
Microservices, DevOps, and Continuous Delivery
 
Big Data en Azure: Azure Data Lake
Big Data en Azure: Azure Data LakeBig Data en Azure: Azure Data Lake
Big Data en Azure: Azure Data Lake
 
Windows Azure HDInsight Service
Windows Azure HDInsight ServiceWindows Azure HDInsight Service
Windows Azure HDInsight Service
 
Building big data solutions on azure
Building big data solutions on azureBuilding big data solutions on azure
Building big data solutions on azure
 
Cortana Analytics Workshop: Azure Data Lake
Cortana Analytics Workshop: Azure Data LakeCortana Analytics Workshop: Azure Data Lake
Cortana Analytics Workshop: Azure Data Lake
 
Architecting big data solutions in the cloud
Architecting big data solutions in the cloudArchitecting big data solutions in the cloud
Architecting big data solutions in the cloud
 
Big Data on azure
Big Data on azureBig Data on azure
Big Data on azure
 
Dive into Spark Streaming
Dive into Spark StreamingDive into Spark Streaming
Dive into Spark Streaming
 
Spark with Azure HDInsight - Tampa Bay Data Science - Adnan Masood, PhD
Spark with Azure HDInsight  - Tampa Bay Data Science - Adnan Masood, PhDSpark with Azure HDInsight  - Tampa Bay Data Science - Adnan Masood, PhD
Spark with Azure HDInsight - Tampa Bay Data Science - Adnan Masood, PhD
 
PASS Summit Data Storytelling with R Power BI and AzureML
PASS Summit Data Storytelling with R Power BI and AzureMLPASS Summit Data Storytelling with R Power BI and AzureML
PASS Summit Data Storytelling with R Power BI and AzureML
 
Introduction to Microsoft’s Hadoop solution (HDInsight)
Introduction to Microsoft’s Hadoop solution (HDInsight)Introduction to Microsoft’s Hadoop solution (HDInsight)
Introduction to Microsoft’s Hadoop solution (HDInsight)
 
Spark with HDInsight
Spark with HDInsightSpark with HDInsight
Spark with HDInsight
 
Microsoft Azure vs Amazon Web Services (AWS) Services & Feature Mapping
Microsoft Azure vs Amazon Web Services (AWS) Services & Feature MappingMicrosoft Azure vs Amazon Web Services (AWS) Services & Feature Mapping
Microsoft Azure vs Amazon Web Services (AWS) Services & Feature Mapping
 

Ähnlich wie Intorducing Big Data and Microsoft Azure

Enterprise Architecture in the Era of Big Data and Quantum Computing
Enterprise Architecture in the Era of Big Data and Quantum ComputingEnterprise Architecture in the Era of Big Data and Quantum Computing
Enterprise Architecture in the Era of Big Data and Quantum ComputingKnowledgent
 
Hitachi solution-profile-advanced-project-version-management-in-schlumberger-...
Hitachi solution-profile-advanced-project-version-management-in-schlumberger-...Hitachi solution-profile-advanced-project-version-management-in-schlumberger-...
Hitachi solution-profile-advanced-project-version-management-in-schlumberger-...Hitachi Vantara
 
Cloud Native Batch Processing: Beyond the What and How
Cloud Native Batch Processing: Beyond the What and HowCloud Native Batch Processing: Beyond the What and How
Cloud Native Batch Processing: Beyond the What and HowVMware Tanzu
 
Big Data
Big DataBig Data
Big DataNGDATA
 
BAR360 open data platform presentation at DAMA, Sydney
BAR360 open data platform presentation at DAMA, SydneyBAR360 open data platform presentation at DAMA, Sydney
BAR360 open data platform presentation at DAMA, SydneySai Paravastu
 
Data Mesh using Microsoft Fabric
Data Mesh using Microsoft FabricData Mesh using Microsoft Fabric
Data Mesh using Microsoft FabricNathan Bijnens
 
Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...
Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...
Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...Daniel Zivkovic
 
Webinar: Improving Time to Value for Enterprise Big Data Analytics
Webinar: Improving Time to Value for Enterprise Big Data AnalyticsWebinar: Improving Time to Value for Enterprise Big Data Analytics
Webinar: Improving Time to Value for Enterprise Big Data AnalyticsStorage Switzerland
 
Impala Unlocks Interactive BI on Hadoop
Impala Unlocks Interactive BI on HadoopImpala Unlocks Interactive BI on Hadoop
Impala Unlocks Interactive BI on HadoopCloudera, Inc.
 
Big data presentation (2014)
Big data presentation (2014)Big data presentation (2014)
Big data presentation (2014)Xavier Constant
 
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization Denodo
 
Traditional data word
Traditional data wordTraditional data word
Traditional data wordorcoxsm
 
The Key to Big Data Modeling: Collaboration
The Key to Big Data Modeling: CollaborationThe Key to Big Data Modeling: Collaboration
The Key to Big Data Modeling: CollaborationEmbarcadero Technologies
 
The Executive View on Big Data Platform Hosting - Evaluating Hosting Services...
The Executive View on Big Data Platform Hosting - Evaluating Hosting Services...The Executive View on Big Data Platform Hosting - Evaluating Hosting Services...
The Executive View on Big Data Platform Hosting - Evaluating Hosting Services...Chad Lawler
 
Extended Data Warehouse - A New Data Architecture for Modern BI with Claudia ...
Extended Data Warehouse - A New Data Architecture for Modern BI with Claudia ...Extended Data Warehouse - A New Data Architecture for Modern BI with Claudia ...
Extended Data Warehouse - A New Data Architecture for Modern BI with Claudia ...Denodo
 
Horses for Courses: Database Roundtable
Horses for Courses: Database RoundtableHorses for Courses: Database Roundtable
Horses for Courses: Database RoundtableEric Kavanagh
 
BD_Architecture and Charateristics.pptx.pdf
BD_Architecture and Charateristics.pptx.pdfBD_Architecture and Charateristics.pptx.pdf
BD_Architecture and Charateristics.pptx.pdferamfatima43
 

Ähnlich wie Intorducing Big Data and Microsoft Azure (20)

Big data Question bank.pdf
Big data Question bank.pdfBig data Question bank.pdf
Big data Question bank.pdf
 
Enterprise Architecture in the Era of Big Data and Quantum Computing
Enterprise Architecture in the Era of Big Data and Quantum ComputingEnterprise Architecture in the Era of Big Data and Quantum Computing
Enterprise Architecture in the Era of Big Data and Quantum Computing
 
Hitachi solution-profile-advanced-project-version-management-in-schlumberger-...
Hitachi solution-profile-advanced-project-version-management-in-schlumberger-...Hitachi solution-profile-advanced-project-version-management-in-schlumberger-...
Hitachi solution-profile-advanced-project-version-management-in-schlumberger-...
 
Cloud Native Batch Processing: Beyond the What and How
Cloud Native Batch Processing: Beyond the What and HowCloud Native Batch Processing: Beyond the What and How
Cloud Native Batch Processing: Beyond the What and How
 
Big Data
Big DataBig Data
Big Data
 
BAR360 open data platform presentation at DAMA, Sydney
BAR360 open data platform presentation at DAMA, SydneyBAR360 open data platform presentation at DAMA, Sydney
BAR360 open data platform presentation at DAMA, Sydney
 
Data Mesh using Microsoft Fabric
Data Mesh using Microsoft FabricData Mesh using Microsoft Fabric
Data Mesh using Microsoft Fabric
 
Big data architecture
Big data architectureBig data architecture
Big data architecture
 
Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...
Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...
Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...
 
Webinar: Improving Time to Value for Enterprise Big Data Analytics
Webinar: Improving Time to Value for Enterprise Big Data AnalyticsWebinar: Improving Time to Value for Enterprise Big Data Analytics
Webinar: Improving Time to Value for Enterprise Big Data Analytics
 
Impala Unlocks Interactive BI on Hadoop
Impala Unlocks Interactive BI on HadoopImpala Unlocks Interactive BI on Hadoop
Impala Unlocks Interactive BI on Hadoop
 
Big data presentation (2014)
Big data presentation (2014)Big data presentation (2014)
Big data presentation (2014)
 
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
 
Traditional data word
Traditional data wordTraditional data word
Traditional data word
 
The Key to Big Data Modeling: Collaboration
The Key to Big Data Modeling: CollaborationThe Key to Big Data Modeling: Collaboration
The Key to Big Data Modeling: Collaboration
 
The Executive View on Big Data Platform Hosting - Evaluating Hosting Services...
The Executive View on Big Data Platform Hosting - Evaluating Hosting Services...The Executive View on Big Data Platform Hosting - Evaluating Hosting Services...
The Executive View on Big Data Platform Hosting - Evaluating Hosting Services...
 
Streaming is a Detail
Streaming is a DetailStreaming is a Detail
Streaming is a Detail
 
Extended Data Warehouse - A New Data Architecture for Modern BI with Claudia ...
Extended Data Warehouse - A New Data Architecture for Modern BI with Claudia ...Extended Data Warehouse - A New Data Architecture for Modern BI with Claudia ...
Extended Data Warehouse - A New Data Architecture for Modern BI with Claudia ...
 
Horses for Courses: Database Roundtable
Horses for Courses: Database RoundtableHorses for Courses: Database Roundtable
Horses for Courses: Database Roundtable
 
BD_Architecture and Charateristics.pptx.pdf
BD_Architecture and Charateristics.pptx.pdfBD_Architecture and Charateristics.pptx.pdf
BD_Architecture and Charateristics.pptx.pdf
 

Mehr von Khalid Salama

Microsoft R - ScaleR Overview
Microsoft R - ScaleR OverviewMicrosoft R - ScaleR Overview
Microsoft R - ScaleR OverviewKhalid Salama
 
Machine learning with Spark
Machine learning with SparkMachine learning with Spark
Machine learning with SparkKhalid Salama
 
Real-Time Event & Stream Processing on MS Azure
Real-Time Event & Stream Processing on MS AzureReal-Time Event & Stream Processing on MS Azure
Real-Time Event & Stream Processing on MS AzureKhalid Salama
 
Data Mining - The Big Picture!
Data Mining - The Big Picture!Data Mining - The Big Picture!
Data Mining - The Big Picture!Khalid Salama
 

Mehr von Khalid Salama (6)

Microsoft R - ScaleR Overview
Microsoft R - ScaleR OverviewMicrosoft R - ScaleR Overview
Microsoft R - ScaleR Overview
 
Graph Analytics
Graph AnalyticsGraph Analytics
Graph Analytics
 
Machine learning with Spark
Machine learning with SparkMachine learning with Spark
Machine learning with Spark
 
Hive with HDInsight
Hive with HDInsightHive with HDInsight
Hive with HDInsight
 
Real-Time Event & Stream Processing on MS Azure
Real-Time Event & Stream Processing on MS AzureReal-Time Event & Stream Processing on MS Azure
Real-Time Event & Stream Processing on MS Azure
 
Data Mining - The Big Picture!
Data Mining - The Big Picture!Data Mining - The Big Picture!
Data Mining - The Big Picture!
 

Kürzlich hochgeladen

Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxolyaivanovalion
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 

Kürzlich hochgeladen (20)

Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptx
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 

Intorducing Big Data and Microsoft Azure

  • 1. | © Copyright 2015 Hitachi Consulting1 Introducing Big Data with Microsoft Azure Khalid M. Salama Microsoft Business Intelligence Hitachi Consulting UK We Make it Happen. Better.
  • 2. | © Copyright 2015 Hitachi Consulting2 Outline  What is Big Data?  Why Big Data Platforms?  Fundamentals of a Big Data Platform  Distributed Processing & CAP Theorem  Big Data Solutions vs. Traditional RDBMS  Where Big Data Fits in Enterprise Data Platforms?  Hadoop Ecosystem: Apache Tools for Big Data  Big Data on Microsoft Azure  How to Get Started with Big Data?
  • 3. | © Copyright 2015 Hitachi Consulting3 Basic Concepts
  • 4. | © Copyright 2015 Hitachi Consulting4 What is Big Data? “Data that is too complex for processing using traditional relational databases efficiently and cost-effectively.” In a nutshell…
  • 5. | © Copyright 2015 Hitachi Consulting5 What is Big Data? “Data that is too complex for processing using traditional relational databases efficiently and cost-effectively.” Big Data attributes… Complex (3 V’s)  Volume – Huge amounts of data to process  Variety – A mixture of structured and unstructured data  Velocity – High frequency or (near) real-time data processing
  • 6. | © Copyright 2015 Hitachi Consulting6 What is Big Data? “Data that is too complex for processing using traditional relational databases efficiently and cost-effectively.” Tell me more… Complex (3 V’s)  Volume – Huge amounts of data to process  Variety – A mixture of structured and unstructured data  Velocity – High frequency or (near) real-time data processing Processing  Stream (operational)  Batch (Analytical) Efficiently  Availability/Scalability  Performance/Throughputs Cost-Effectively  Acquiring  Scaling up/down
  • 7. | © Copyright 2015 Hitachi Consulting7 What is Big Data? Common examples and applications • User Experience Improvement • Recommendation & Target Advertising Clickstream • Predictive Maintenance • Energy Efficiency – Smart City Sensor/Devices • Sentiment Analysis • Crisis Management Social Media • Push Notifications • Process Optimisation Spatial & GPS • Proactive securityImages/Audio/Video • Analysis of customer reviews/feedbacks/complaints • Automatic news summarization/analysis Free Text
  • 8. | © Copyright 2015 Hitachi Consulting8 Why Big Data Platforms? Traditional Data Platforms
  • 9. | © Copyright 2015 Hitachi Consulting9 Why Big Data Platforms? Breaking points of traditional Data Platforms – Volume
  • 10. | © Copyright 2015 Hitachi Consulting10 Why Big Data Platforms? Breaking points of traditional Data Platforms – Variety
  • 11. | © Copyright 2015 Hitachi Consulting11 Why Big Data Platforms? Breaking points of traditional Data Platforms – Velocity
  • 12. | © Copyright 2015 Hitachi Consulting12 Enterprise-wide data scale Terabytes Gigabytes
  • 13. | © Copyright 2015 Hitachi Consulting13 Enterprise-wide data scale Terabytes Gigabytes
  • 14. | © Copyright 2015 Hitachi Consulting14 Enterprise-wide data scale Terabytes Gigabytes
  • 15. | © Copyright 2015 Hitachi Consulting15 Enterprise-wide data scale Terabytes Gigabytes Non- Transactional Transactional
  • 16. | © Copyright 2015 Hitachi Consulting16 Addressing Big Data Challenges
  • 17. | © Copyright 2015 Hitachi Consulting17 Addressing Big Data Challenges Addressing the three “V”s… Volume Variety Velocity Challenges
  • 18. | © Copyright 2015 Hitachi Consulting18 Addressing Big Data Challenges Addressing the three “V”s… Volume Variety Velocity Distributed Computing Challenges Solutions
  • 19. | © Copyright 2015 Hitachi Consulting19 Addressing Big Data Challenges Addressing the three “V”s… Volume Variety Velocity Distributed Computing Batch Processing Challenges Solutions
  • 20. | © Copyright 2015 Hitachi Consulting20 Addressing Big Data Challenges Addressing the three “V”s… Volume Variety Velocity Distributed Computing Stream Processing Batch Processing Challenges Solutions
  • 21. | © Copyright 2015 Hitachi Consulting21 Addressing Big Data Challenges Addressing the three “V”s… Volume Variety Velocity Distributed Computing NoSQL Stream Processing Batch Processing Challenges Solutions
  • 22. | © Copyright 2015 Hitachi Consulting22 Addressing Big Data Challenges Addressing the three “V”s… Volume Variety Velocity Distributed Computing NoSQL Stream Processing In-Memory Processing Batch Processing Challenges Solutions
  • 23. | © Copyright 2015 Hitachi Consulting23 Addressing Big Data Challenges Addressing the three “V”s… Volume Variety Velocity Distributed Computing NoSQL Stream Processing In-Memory Processing Batch Processing Consistency/Availability /FaultToleranceTrade-off (CAP) Challenges Solutions
  • 24. | © Copyright 2015 Hitachi Consulting24 Addressing Big Data Challenges Tell me more…. Distributed Computing Batch Processing In-Memory Processing Stream Processing NoSQL  Distributed  Available/ Fault Tolerant  Random read/write access  Supports Batch & Stream  Cluster of many data/compute nodes (commodity hardware)  Data Partitioning (sharding)  Data partitions are processed in parallel  Easy/cheap to scale-out  Process massive amount of data  Write once / read many  High latency  Iterative processing of the same data in memory  Data size that fits into the memory  Low latency  Process continuous stream of data  Small data chunks  Low latency  Key-value stores  Column family stores  Document stores  Graph stores  Distributed  Available/ Fault Tolerant  Eventually Consistent  High throughputs  Distributed  Available/ Fault Tolerant  Eventually Consistent  Distributed  Available/ Fault Tolerant  Eventually Consistent
  • 25. | © Copyright 2015 Hitachi Consulting25 Fundamental Components
  • 26. | © Copyright 2015 Hitachi Consulting26 Fundamentals of a Big Data Platform Basic Architectural Components Distributed File System ….
  • 27. | © Copyright 2015 Hitachi Consulting27 Basic Architectural Components Distributed File System ….  Data file are stored in raw form (no schema)  Partitioned across data nodes (disks)  A partition is replicated to M nodes  Fault-tolerance Fundamentals of a Big Data Platform
  • 28. | © Copyright 2015 Hitachi Consulting28 Basic Architectural Components Distributed File System Compute Cluster Head Compute 1 …. ….  Data file are stored in raw form (no schema)  Partitioned across data nodes (disks)  A partition is replicated to M nodes  Fault-tolerance Compute 2 Compute N Resource Manager Fundamentals of a Big Data Platform
  • 29. | © Copyright 2015 Hitachi Consulting29 Basic Architectural Components Distributed File System Compute Cluster Head Compute 1 …. ….  Data file are stored in raw form (no schema)  Partitioned across data nodes (disks)  A partition is replicated to M nodes  Fault-tolerance  Plus an extra failover head node  Availability Compute 2 Compute N Resource Manager  Manage and execute jobs  Distributed execution model Fundamentals of a Big Data Platform
  • 30. | © Copyright 2015 Hitachi Consulting30 Basic Architectural Components Distributed File System Compute Cluster Applications Batch In-Memory Stream SQL NoSQL Head Compute 1 …. ….  Data file are stored in raw form (no schema)  Partitioned across data nodes (disks)  A partition is replicated to M nodes  Fault-tolerance  Plus an extra failover head node  Availability Acquisition Compute 2 Compute N Resource Manager  Manage and execute jobs  Distributed execution model Fundamentals of a Big Data Platform
  • 31. | © Copyright 2015 Hitachi Consulting31 Basic Architectural Components Distributed File System Compute Cluster Applications Batch In-Memory Stream SQL NoSQL Head Compute 1 …. ….  Data file are stored in raw form (no schema)  Partitioned across data nodes (disks)  A partition is replicated to M nodes  Fault-tolerance  Plus an extra failover head node  Availability  Support Batch/Speed workloads Acquisition Compute 2 Compute N Resource Manager  Manage and execute jobs  Distributed execution model Fundamentals of a Big Data Platform
  • 32. | © Copyright 2015 Hitachi Consulting32 Fundamentals of a Big Data Platform Lambda Architecture  Data is dispatched to both the batch layer and the speed layer for processing.  The batch layer manages the master dataset (write once- read many), and pre-computes the batch views. Handle large data volumes with high latency.  The speed layer indexes the batch views so that they can be queried in low-latency, ad-hoc way. Deals with recent, limited window of data only.  The serving layer answer and incoming query by merging results from batch views and real-time views Hot Path Cold Path
  • 33. | © Copyright 2015 Hitachi Consulting33 CAP Theorem
  • 34. | © Copyright 2015 Hitachi Consulting34 Distributed Processing & CAP Theorem  In order to handle large volume of data processing efficiently, we need to scale out, i.e. partition the data and distribute the computation The trade-off…
  • 35. | © Copyright 2015 Hitachi Consulting35 Distributed Processing & CAP Theorem  In order to handle large volume of data processing efficiently, we need to scale out, i.e. partition the data and distribute the computation  Now we face a trade-off between Consistency, Availability, and Partition Tolerance The trade-off…
  • 36. | © Copyright 2015 Hitachi Consulting36 Distributed Processing & CAP Theorem  In order to handle large volume of data processing efficiently, we need to scale out, i.e. partition the data and distribute the computation  Now we face a trade-off between Consistency, Availability, and Partition Tolerance  Consistency: Data is in a consistent state across all the nodes. That is, all the reads would get you the same, most recent write. The trade-off…
  • 37. | © Copyright 2015 Hitachi Consulting37 Distributed Processing & CAP Theorem  In order to handle large volume of data processing efficiently, we need to scale out, i.e. partition the data and distribute the computation  Now we face a trade-off between Consistency, Availability, and Partition Tolerance  Consistency: Data is in a consistent state across all the nodes. That is, all the reads would get you the same, most recent write.  Availability: Every request to the system gets a response (i.e., executed) on success/failure. That is, system responsiveness (latency) The trade-off…
  • 38. | © Copyright 2015 Hitachi Consulting38 Distributed Processing & CAP Theorem  In order to handle large volume of data processing efficiently, we need to scale out, i.e. partition the data and distribute the computation  Now we face a trade-off between Consistency, Availability, and Partition Tolerance  Consistency: Data is in a consistent state across all the nodes. That is, all the reads would get you the same, most recent write.  Availability: Every request to the system gets a response (i.e., executed) on success/failure. That is, system responsiveness (latency)  Partition Tolerance: The system continuous to work despite of message loss or partition (node) failure. That is, the system can sustain partial network failures. The trade-off…
  • 39. | © Copyright 2015 Hitachi Consulting39 Distributed Processing & CAP Theorem  In order to handle large volume of data processing efficiently, we need to scale out, i.e. partition the data and distribute the computation  Now we face a trade-off between Consistency, Availability, and Partition Tolerance  Consistency: Data is in a consistent state across all the nodes. That is, all the reads would get you the same, most recent write.  Availability: Every request to the system gets a response (i.e., executed) on success/failure. That is, system responsiveness (latency).  Partition Tolerance: The system continuous to work despite of message loss or partition (node) failure. That is, the system can sustain partial network failures.  CAP Theorem: only two out of three properties can be satisfied in a distributed data system. In facet, it is consistency vs availability, wrt partition tolerance! The trade-off…
  • 40. | © Copyright 2015 Hitachi Consulting40 Distributed Processing & CAP Theorem The trade-off… Continues working if partition is not reachable by the systemP C A
  • 41. | © Copyright 2015 Hitachi Consulting41 Distributed Processing & CAP Theorem The trade-off… Continues working if partition is not reachable by the systemP C A Big Data Systems  BASE Mode – Eventually Consistency  Remains available (operational & responsive)  partition tolerant, i.e., sacrifices consistency
  • 42. | © Copyright 2015 Hitachi Consulting42 Distributed Processing & CAP Theorem The trade-off… Continues working if partition is not reachable by the system Transactional RDBMS  ACID Mode – Strong Consistency  Commits are atomic across the entre system  Not partition tolerant, i.e., sacrifices availability P C A Big Data Systems  BASE Mode – Eventually Consistency  Remains available (operational & responsive)  partition tolerant, i.e., sacrifices consistency
  • 43. | © Copyright 2015 Hitachi Consulting43 Distributed Processing & CAP Theorem The trade-off… Continues working if partition is not reachable by the system Transactional RDBMS  ACID Mode – Strong Consistency  Commits are atomic across the entre system  Not partition tolerant, i.e., sacrifices availability P C A Big Data Systems  BASE Mode – Eventually Consistency  Remains available (operational & responsive)  partition tolerant, i.e., sacrifices consistency ACID  Atomic: Everything in a transaction succeeds or the entire transaction is rolled back.  Consistent: A transaction cannot leave the database in an inconsistent state.  Isolated: Transactions cannot interfere with each other.  Durable: Completed transactions persist, even when servers restart etc. BASE  Basic Availability  Soft-state  Eventual consistency
  • 44. | © Copyright 2015 Hitachi Consulting44 Distributed Processing & CAP Theorem The trade-off… Continues working if partition is not reachable by the system Transactional RDBMS  ACID Mode – Strong Consistency  Commits are atomic across the entre system  Not partition tolerant, i.e., sacrifices availability P C A Big Data Systems  BASE Mode – Eventually Consistency  Remains available (operational & responsive)  partition tolerant, i.e., sacrifices consistency ACID  Atomic: Everything in a transaction succeeds or the entire transaction is rolled back.  Consistent: A transaction cannot leave the database in an inconsistent state.  Isolated: Transactions cannot interfere with each other.  Durable: Completed transactions persist, even when servers restart etc. BASE  Basic Availability  Soft-state  Eventual consistency NoSQL: Strong vs. Eventual Consistency
  • 45. | © Copyright 2015 Hitachi Consulting45 Big Data Solutions vs. Traditional RDMS The face-off… Feature RDBMS Big Data (Batch) Big Data (Stream & NoSQL) Data Integrity Strong Consistency – ACID Transactions Eventual Consistency – BASE Model Depending on the technology (Strong vs. Eventual Consistency) Schema Static – required on write Dynamic – schema on read Flexible – extensible Data types and formats Structured Structured , Semi-structured, and unstructured Semi-structured Read and write pattern Fully repeatable read/write Write once, repeatable read Fully repeatable read/write Storage volume Gigabytes to terabytes Terabytes, petabytes, and beyond Terabytes, petabytes, and beyond - (small data chunks for stream processing) Scalability Scale up with more powerful hardware Scale out with additional servers Scale out with additional servers Data processing distribution Limited or none Distributed across the cluster Distributed across the cluster Economics Expensive hardware and software Commodity hardware and open source software Commodity hardware and open source software Microsoft Patterns & Practises
  • 46. | © Copyright 2015 Hitachi Consulting46 Enterprise Big Data Platform
  • 47. | © Copyright 2015 Hitachi Consulting47 Big Data Fit in Enterprise Data Platform Enterprise Data Platform
  • 48. | © Copyright 2015 Hitachi Consulting48 Big Data Fit in Enterprise Data Platform Use Case 1: Data Exploration/ Experiments Platform 101 100 Microsoft Patterns & Practises
  • 49. | © Copyright 2015 Hitachi Consulting49 Big Data Fit in Enterprise Data Platform Use Case 2: Data Processing (ETL) MPP MPP Microsoft Patterns & Practises
  • 50. | © Copyright 2015 Hitachi Consulting50 Big Data Fit in Enterprise Data Platform Use Case 3: Data Warehouse Microsoft Patterns & Practises
  • 51. | © Copyright 2015 Hitachi Consulting51 Big Data Fit in Enterprise Data Platform Use Case 4: Full Data/BI Integration Microsoft Patterns & Practises 1 – ETL Level Integration 2 – DW Level Integration 3 – BI Level Integration  Corporate Data Model  Reports/Dashboard (Mashup) MPP
  • 52. | © Copyright 2015 Hitachi Consulting52 Big Data Fit in Enterprise Data Platform Use Case 4: Full Data/BI Integration Microsoft Patterns & Practises 1 – ETL Level Integration 2 – DW Level Integration 3 – BI Level Integration  Corporate Data Model  Reports/Dashboard (Mashup) MPP Operational Apps
  • 53. | © Copyright 2015 Hitachi Consulting53 Big Data with Hadoop
  • 54. | © Copyright 2015 Hitachi Consulting54 Introducing Hadoop Apache Hadoop Ecosystem - “A” Big Data Platform Hadoop Distributed File System (HDFS) Applications In-Memory Stream SQL  Spark- SQL NoSQL Machine Learning …. Batch Yet Another Resource Negotiator (YARN) Search Orchest. MgmntAcquisition Named Node DataNode 1 DataNode 2 DataNode 3 DataNode N
  • 55. | © Copyright 2015 Hitachi Consulting55 Introducing Hadoop Apache Hadoop Ecosystem - “A” Big Data Platform A programming model for distributed processing large data on a cluster A scripting platform for processing and analysing large data sets The de facto standard for SQL queries in Hadoop Efficiently transfers bulk data between Apache Hadoop and relational data stores An algorithm library for scalable machine learning on Hadoop Provides workflow scheduling services manage Hadoop jobs A system for processing streaming data in real time A fast, scalable, fault-tolerant messaging system In-Memory compute for ETL, Machine Learning, SQL, and streaming A distributed key-value store with cell-based access control CouchDB: JSON document-oriented data store Provides random read/write access to a distributed, fault tolerant, NoSQL data store
  • 56. | © Copyright 2015 Hitachi Consulting56 [CELLRANGE] [CELLRANGE] [CELLRANGE] [CELLRANGE] [CELLRANGE] [CELLRANGE] [CELLRANGE] [CELLRANGE] [CELLRANGE] [CELLRANGE] [CELLRANGE] [CELLRANGE] [CELLRANGE] [CELLRANGE] [CELLRANGE] [CELLRANGE] [CELLRANGE] [CELLRANGE] [CELLRANGE] [CELLRANGE] [CELLRANGE] [CELLRANGE] [CELLRANGE] [CELLRANGE] [CELLRANGE] [CELLRANGE] Oct-03 Dec-04 Jan-06 Feb-06 Apr-06 May-06 Apr-07 Jun-07 Oct-07 Jan-08 Feb-08 Jul-08 Oct-08 Nov-08 Mar-09 Apr-09 May-10 Jun-10 Sep-10 Jan-11 Mar-11 Jun-11 Jan-12 Nov-12 Feb-14 Jun-15 Introducing Hadoop History
  • 57. | © Copyright 2015 Hitachi Consulting57 Introducing Hadoop MapReduce - Distributed Programing Model
  • 58. | © Copyright 2015 Hitachi Consulting58 Introducing Hadoop MapReduce - Distributed Programing Model Read lines from file Convert line to Key-Value Pair(s) Filter (by key/value) Combine Values with similar Keys Shuffle data across nodes for reduces by Key Map
  • 59. | © Copyright 2015 Hitachi Consulting59 Introducing Hadoop MapReduce - Distributed Programing Model Read lines from file Convert line to Key-Value Pair(s) Filter (by key/value) Combine Values with similar Keys Shuffle data across nodes for reduces by Key Sort by Key Aggregate (reduce) Filter (based on aggregated value) Write results to file Map Reduce
  • 60. | © Copyright 2015 Hitachi Consulting60 Introducing Hadoop MapReduce - Distributed Programing Model Read lines from file Convert line to Key-Value Pair(s) Filter (by key/value) Combine Values with similar Keys Shuffle data across nodes for reduces by Key Sort by Key Aggregate (reduce) Filter (based on aggregated value) Write results to file Map Reduce Input Mapper Mapper Mapper
  • 61. | © Copyright 2015 Hitachi Consulting61 Introducing Hadoop MapReduce - Distributed Programing Model Read lines from file Convert line to Key-Value Pair(s) Filter (by key/value) Combine Values with similar Keys Shuffle data across nodes for reduces by Key Sort by Key Aggregate (reduce) Filter (based on aggregated value) Write results to file Map Reduce Input Mapper Mapper Mapper HashShuffling (Key1, Value1) (Key2, Value2) (Key1, Value3)
  • 62. | © Copyright 2015 Hitachi Consulting62 Introducing Hadoop MapReduce - Distributed Programing Model Read lines from file Convert line to Key-Value Pair(s) Filter (by key/value) Combine Values with similar Keys Shuffle data across nodes for reduces by Key Sort by Key Aggregate (reduce) Filter (based on aggregated value) Write results to file Map Reduce Input Mapper Mapper Mapper Reducer Reducer HashShuffling Output (Key1, Value1) (Key2, Value2) (Key1, Value3) Key1: {Value1, Value3} Key 2: {Value2}
  • 63. | © Copyright 2015 Hitachi Consulting63 Introducing Hadoop MapReduce - Example SELECT Month, City, SUM(SalesValue) FROM Sales WHERE Product = ‘Bike’ GROUP BY City Having SUM(SalesValue) > 50,000 Read lines from file Convert line to (Month- Cirty, Value) Pair Discard lines where Product is not ‘Bike’ Combine Values with similar Keys Shuffle data across nodes for reduces by Key Sort by Key Sum all the values in a given key Discard records where sum <= 50,000 Write results to file
  • 64. | © Copyright 2015 Hitachi Consulting64 Introducing Hadoop MapReduce - Example SELECT Month, City, SUM(SalesValue) FROM Sales WHERE Product = ‘Bike’ GROUP BY City Having SUM(SalesValue) > 50,000 Read lines from file Convert line to (Month- Cirty, Value) Pair Discard lines where Product is not ‘Bike’ Combine Values with similar Keys Shuffle data across nodes for reduces by Key Sort by Key Sum all the values in a given key Discard records where sum <= 50,000 Write results to file
  • 65. | © Copyright 2015 Hitachi Consulting65 Big Data with Microsoft Azure
  • 66. | © Copyright 2015 Hitachi Consulting66 Big Data on Microsoft Azure Virtual Machines (IaaS) Azure Services (Data Acquisition, Stream Processing, Machine Learning, NoSQL) Azure HDInsight (IaaS+) Azure Data Lake (PaaS)
  • 67. | © Copyright 2015 Hitachi Consulting67 Big Data on Microsoft Azure Infrastructure as a Service (IaaS). Different distributions of Hadoop, still 100% Hadoop (plus distribution specific extra tools). You are responsible for provisioning, configuring, managing, and updating the cluster with new tools. The Distributed File System is part the compute cluster, that is, killing the cluster means loosing the data Hortonworks/Cloudera/MapR Virtual Machines
  • 68. | © Copyright 2015 Hitachi Consulting68 Big Data on Microsoft Azure Azure HDInsight Infrastructure as a Service+ (SaaS+). Hortonworks distribution of Hadoop. You pay for the cluster (infrastructure), and the Blob Storage, rather than the jobs. Yet, you are NOT responsible for configuring, managing, and updating the cluster with new tools (Managed by Microsoft). On-demand Provisioning/shutting down. Independent of the Distributed File System (Azure Blob Storage), that is, killing the cluster will not cause loosing the data. Data can be shared by multiple clusters.
  • 69. | © Copyright 2015 Hitachi Consulting69 Big Data on Microsoft Azure Azure HDInsight Windows Azure Blob Storage (WABS) Distributed File System Applications (by cluster type) Spark Storm HBase …. Hadoop Yet Another Resource Negotiator (YARN)
  • 70. | © Copyright 2015 Hitachi Consulting70 Big Data on Microsoft Azure Azure HDInsight Windows Azure Blob Storage (WABS) Distributed File System Applications (by cluster type) Spark Storm HBase …. Hadoop Yet Another Resource Negotiator (YARN) Acquisition  Azure Data Factory Stream Processing • Steam Analytics • Event Hub Machine Learning  Azure Machine Learning NoSQL  Table Storage  DocumentDB
  • 71. | © Copyright 2015 Hitachi Consulting71 Big Data on Microsoft Azure The PaaS zoo on the cloud… Data Factory - Defines and automates the movement, processing, and transformation of data by through data flow pipelines. Stream Analytics - Real-time event processing engine for real-time analytic computations on data streams Event Hub - highly scalable data ingress (message queuing) service that can ingest millions of events per second for downstream processing Machine Learning - Cloud-based predictive analytics service rapid creation and deployment predictive models as analytics solutions Table Storage - Stores structured key/attribute NoSQL data store in the cloud. DocumentDB - fully managed NoSQL JSON database service for high performance, high availability, automatic scaling, and ease of development
  • 72. | © Copyright 2015 Hitachi Consulting72 Data Lake Analytics Big Data on Microsoft Azure Azure Data Lake Data Lake Storage …. U-SQL Acquisition  Azure Data Factory Stream Processing • Steam Analytics • Event Hub Machine Learning  Azure Machine Learning NoSQL  Table Storage  DocumentDB Yet Another Resource Negotiator (YARN)
  • 73. | © Copyright 2015 Hitachi Consulting73 Big Data on Microsoft Azure Azure Data Lake Platform as a Service (PaaS). Microsoft’s own implementation of Big Data Platform, like Google (GCP) and Amazon (AWS), rather than a distribution of Hadoop. U-SQL for batch data processing. You pay for the jobs, and the data lake storage. Optimized Distributed File System (Data Lake) for analytical workloads.
  • 74. | © Copyright 2015 Hitachi Consulting74 Big Data on Microsoft Azure Microsoft Azure Big Data Analytics Options Microsoft Advanced Analytics laboratory
  • 75. | © Copyright 2015 Hitachi Consulting75 Big Data on Microsoft Azure Microsoft Azure – Cortana Analytical Suite Microsoft
  • 76. | © Copyright 2015 Hitachi Consulting76 How to Get Started with Big Data?  Read these slides!  Coursera – Big Data Specialization https://www.coursera.org/specializations/big-data  Azure Documentation – HDInsight Emulator https://azure.microsoft.com/en-gb/documentation/articles/hdinsight-hadoop-emulator-get-started  MVA – Big Data Analytics https://mva.microsoft.com/en-US/training-courses/big-data-analytics-8255?l=ogCizYKy_9604984382  MVA – Big Data Analytics with HDInsight: Hadoop on Azure https://mva.microsoft.com/en-US/training-courses/big-data-analytics-with-hdinsight-hadoop-on-azure-10551  MVA – Implementing Big Data Analysis https://mva.microsoft.com/en-US/training-courses/implementing-big-data-analysis-8311?l=44REr2Yy_5404984382  Azure Documentation – Getting Started with HDInsight https://azure.microsoft.com/en-gb/documentation/services/hdinsight/  Microsoft Patterns & Practice – Developing big data solutions on Microsoft Azure HDInsight https://msdn.microsoft.com/en-gb/library/dn749874.aspx  Azure Documentation – Data Lake https://azure.microsoft.com/en-gb/documentation/services/data-lake-analytics/  Apache Hadoop http://hadoop.apache.org/ O’Reliy Books– Hadoop: The Definitive Guide 4th Edition
  • 77. | © Copyright 2015 Hitachi Consulting77 Useful Hadoop Commands  To list the contents of a directory: hadoop fs -ls /<DirectoryPath>  To see contents of a file: hadoop fs -cat /<FilePath>  To create a directory in HDFS: hadoop fs -mkdir / <DiretoryPath>  To upload files from local file system to the Hadoop : hadoop fs -put <localSrcPath> /<hdfsDstPath>  To download files from the Hadoop data file system to the local file system: hadoop fs -get /<FilePath>  To copy a file from source to destination: hadoop fs -cp /<SrcFilePath> /<DstFilePath>  To copy a file from Local file system to HDFS: hadoop fs -copyFromLocal <LocalSrcPath> /<hdfsDstPath>  To copy a file to Local file system from HDFS: hadoop fs -copyToLocal /<hdfsSrcFilePath> /<DstFilePath>  To remove a file from HDFS: hadoop fs -rm /<FilePath>  To remove a directory from HDFS: hadoop fs -rm -r /<DirectoryPath>
  • 78. | © Copyright 2015 Hitachi Consulting78 Coming soon…  Introduction to Azure Data Factory, and Data Lake Analytics with U-SQL  Introduction to Hive on HDInsight  Event & Stream Processing on Microsoft Azure  NoSQL on Microsoft Azure  Introduction to Spark on HDInsight  Introduction to Azure Batch Stay tuned
  • 79. | © Copyright 2015 Hitachi Consulting79 Acknowledgement Thanks for Paul Lineham for answering all my stupid big data questions, patiently…
  • 80. | © Copyright 2015 Hitachi Consulting81 Thank you!