1. Modernizing Your
Data Warehouse
using APS
Big data. Small data. All data.
Stéphane Fréchette - SQL Server MVP - @sfrechette
Database / Business Intelligence Solution Architect
3. Increasing
data volumes
1
Real-time
data
2
New data
sources and types
3
4
Cloud-born
data
Data sources
4.
The modern data warehouse
Data sources Non-relational data
5. Insights from all your data
Enrich and optimize your data from non-traditional sources
5
6. Roadblocks to a modern data warehouse
Keep legacy
investment
Buy new tier-one
hardware appliance
Acquire Big Data
solution
Acquire business
intelligence
Limited
scalability and ability to
handle new data types
Significant training
and data silos
High acquisition
and migration
costs
Complex with low
adoption
7. Introducing the Microsoft Analytics Platform System
The turnkey modern data warehouse appliance
• Relational and non-relational
data in a single appliance
• Enterprise-ready Hadoop
• Integrated querying across
Hadoop and PDW using T-SQL
• Direct integration with
Microsoft BI tools such as
Microsoft Excel
• Near real-time performance
with In-Memory Columnstore
• Ability to scale out to
accommodate growing data
• Removal of data warehouse
bottlenecks with MPP SQL
Server
• Concurrency that fuels rapid
adoption
• Industry’s lowest data
warehouse appliance price per
terabyte
• Value through a single
appliance solution
• Value with flexible hardware
options using commodity
hardware
9. Evolution in the nature and use of data in the enterprise
Data complexity:
variety and velocity
Petabytes
Historical
analysis
Insight
analysis
Predictive
analytics
Predictive
forecasting
Value to the business
10. What is Hadoop?
Microsoft Confidential
10
OPERATIONAL
SERVICES
AMBARI
Core Services
DATA
SERVICES
MAP
REDUCE
HDFS
FLUME
SQOOP
LOAD &
EXTRACT
NFS
WebHDFS
OOZIE
YARN
HIVE &
HCATALOG
PIG
FALCON HBASE
Hadoop Cluster
compute
&
. . .
storage . . .
. .
compute
&
storage
.
.
Hadoop clusters provide
scale-out storage and
distributed data processing
on commodity hardware
11. Manageable, secured, and highly available Hadoop integrated into the appliance
High performance
and tuned within the
appliance
End-user
authentication with
Active Directory
Accessible insights
for everyone with
Microsoft BI tools
Managed and
monitored using
System Center
100-percent Apache
Hadoop
SQL Server
Parallel Data
Warehouse
PolyBase
Microsoft
HDInsight
12. Parallel Data Warehouse
workload
HDInsight workload
Fabric
Hardware
Appliance
A region is a logical container within an
appliance
Each workload contains the following
boundaries:
• Security
• Metering
• Servicing
13. Bringing Hadoop point solutions and the data warehouse together for users and IT
Provides a single T-SQL query model for PDW
and Hadoop with rich features of T-SQL,
including joins without ETL
Uses the power of MPP to enhance query
execution performance
Supports Windows Azure HDInsight to enable
new hybrid cloud scenarios
Provides the ability to query non-Microsoft
Hadoop distributions, such as Hortonworks and
Cloudera
SQL Server
Parallel Data
Warehouse
Microsoft Azure
HDInsight
PolyBase
Microsoft
HDInsight
Hortonworks for
Windows and Linux
Cloudera
Select… Result set
14. Results
Direct and parallelized HDFS access
Enhancing the Data Movement Service (DMS) of APS to allow direct communication between HDFS data nodes and PDW compute
nodes
Non-relational data
Social
apps
Sensor
and RFID
Mobile
apps
Web
apps
Hadoop
Relational data
Traditional schema-based
data warehouse applications
Regular
T-SQL
External table
External data
source
External file
format
Enhanced PDW
query engine
HDFS bridge PDW
15. Hadoop / Data Lake
(Cloudera, Hortonworks,
HDInsight)
Source systems
Day / Hour / Minute Refresh
SQL Server
Data Marts
SQL Server
Reporting Services
SQL Server
Analytics / Ad-hoc / Visualization
MapReduce T-SQL
SQL Server
Parallel Data
Warehouse
PolyBase
Microsoft
HDInsight
Analysis Services APS
16. HDFS File / Directory
//hdfs/social_media/twitter
//hdfs/social_media/twitter/Daily.log
1
0
Hadoop
Dynamic binding
Column filtering
Row filtering
User Location Product Sentiment Rtwt Hour Date
Sean
Audie
Suz
Tom
Sanjay
Roger
Steve
CA
CO
WA
IL
MN
TX
AL
xbox
excel
xbox
sqls
wp8
ssas
ssrs
-1
0
1
1
1
1
5
0
8
0
0
0
8
8
2
2
1
23
23
5-15-14
5-15-14
5-15-14
5-13-14
5-14-14
5-14-14
5-13-14
SELECT User, Product, Sentiment
FROM Twitter_Table
WHERE Hour = Current - 1
AND Date = Today
AND Sentiment >= 0
17. Improve APS operations by extending PolyBase
HDFS file formats
Textfile and
RCFile support
• Microsoft Azure HDInsight
• HDInsight on APS
• Hortonworks Data Platform
1.3 and 2.0 (Linux/Windows
Server)
• Cloudera Linux 4.3
Security and
permission model
External table
source and file
format syntax
Microsoft
Azure
Storage
Blobs
AU1
PolyBase v2
Analytics Platform
System
(powered by PolyBase)
18. Big Data insights for anyone
New insights with familiar tools through native Microsoft BI integration
Minimizes IT
intervention for
discovering data
with tools such as
Microsoft Excel
Enables DBA and
power users to join
relational and
Hadoop data with
T-SQL
Takes advantage of
high adoption
of Excel, Power
View, PowerPivot,
and SQL Server
Analysis Services
Offers Hadoop
tools like
MapReduce, Hive,
and Pig for data
scientists
Everyone else using
Microsoft BI tools
Power users
Data scientist
20. CREATE EXTERNAL DATA SOURCE datasource_name
{WITH (
TYPE = <data_source>,
LOCATION =‘<location>’,
[JOB_TRACKER_LOCATION = ‘<jb_location>’]
};
1 Type of external data source
2 Location of external data source
Enabling or disabling of MapReduce
job generation
3
21. CREATE EXTERNAL FILE FORMAT fileformat_name
{WITH (
FORMAT_TYPE = <type>,
[SERDE_METHOD = ‘<sede_method>’,]
[DATA_COMPRESSION = ‘<compr_method>’,
[FORMAT_OPTIONS (<format_options>)]
};
1 Type of external data source
2 (De)Serialization method [Hive RCFile]
3 Compression method
4 (Optional) Format Options [Text Files]
22. <Format Options> :: =
[,FIELD_TERMINATOR = ‘value’],
[,STRING_DELIMITER = ‘value’],
[,DATE_FORMAT = ‘value’],
[USE_TYPE_DEFAULT = ‘value’]
1 Column delimiter
2 Delimiter for string data types
3 To specify a particular date format
4 How missing entries are handled
23. Bringing islands of Hadoop data together
Running high performance queries against Hadoop data
Archiving data warehouse data to Hadoop (move)
Exporting relational data to Hadoop (copy)
Importing Hadoop data into a data warehouse (copy)
25. Scale up Rowstore
Diminishing scale as requirements grow
Data
Querying data by row
Page 1 Page 2 Page 3
C1 C2 C3 C4
R1 R1 R1 R1
R2 R2 R2 R2
R3 R3 R3 R3
R4 R4 R4 R4
R5 R5 R5 R5
R6 R6 R6 R6
Sub-optimal performance for many data
warehouse queries
Forklift
Forklift
26. Scale out Multiple nodes with dedicated CPU,
memory, and storage
Ability to incrementally add hardware
for near-linear scale to multiple
petabytes
Ability to handle query complexity and
concurrency at scale
No “forklift” of prior warehouse to
increase capacity
Ability to scale out HDInsight and PDW
Scaling out your data to petabytes
Scale-out technologies in the Analytics Platform System
PDW /
HDInsight
PDW /
HDInsight
PDW /
HDInsight
PDW
PDW /
HDInsight
PDW /
HDInsight
PDW /
HDInsight
0 terabytes 6 petabytes
27. Blazing-fast performance
MPP and In-Memory Columnstore for next-generation performance
Up to 100x
faster queries
Updateable clustered columnstore vs. table with customary indexing
• Store data in columnar format for massive
compression
• Load data into or out of memory for next-generation
performance with up to 60%
improvement in data loading speed
• Updateable and clustered for real-time trickle
loading
Up to 15x
more compression
Columnstore index representation
Parallel query execution
Query
Results
28. Why is a clustered columnstore index
important?
• Saves space
• Provides easier management by eliminating
maintenance of secondary indexes
• Supports all PDW data types, including high-precision
decimal data types and more
Space used in GB (table with 101 million rows)
Space used = table space + index space
20.0
15.0
10.0
5.0
0.0
91%
savings
1 2 3 4 5 6
In-Memory Columnstore is featured in the
storage engine in PDW AU1
29. Relational query execution processing
1 SQL queries sent to control node
Control node creates query
execution plan
2
Query plan creates distributed
queries to run on each compute
node
3
Distributed queries sent to compute
nodes (all running in parallel)
4
Control node collects query results
and returns them to user
5
Create query plan
User query
Client Control
Compute
Compute
Compute
Compute
Appliance
Management
Query results
Aggregate query results Compute nodes
process query plan
operations in parallel
30. SQL Server SMP
Reporting and cubes
BI Tools
Great performance with mixed workloads
Analytics Platform System
ETL/ELT with SSIS, DQS, MDS
ERP CRM LOB APPS
ETL/ELT with DWLoader
Hadoop / Big Data
PDW
PolyBase
HDInsight
Ad hoc queries
Intra-Day
Near real-time
Fast ad hoc
Columnstore
Polybase
CRTAS
Link Table
Real-Time
ROLAP / MOLAP
DirectQuery
SNAC
32. High performance using commodity hardware
Price per terabyte for leading vendors
Significantly lower
price per terabyte
than the closest competitor
Price per terabyte for user-available storage (compressed)
NOTE: Orange line indicates average price per
terabyte.
Thousands
Oracle EMC IBM Teradata Microsoft
$30
$25
$20
$15
$10
$5
$0
Lower storage costs
with Windows Server 2012
Storage Spaces
33. Hardware and software engineered together
The ease of an appliance
Co-engineered
with HP, Dell, and
Quanta best
practices
Leading
performance with
commodity
hardware
Integrated
support plan with
a single Microsoft
PDW contact
Pre-configured,
built, and tuned
software and
hardware
PolyBase
HDInsight
34. Hardware architecture InfiniBand
InfiniBand
PDW region
Ethernet
Ethernet
Control node
Failover node
Master node
Failover node
Compute nodes
Economical disk storage
Compute nodes
Economical disk storage
Compute nodes
Economical disk storage
Networking
HDInsight region
PDW region
Rack #1
InfiniBand
InfiniBand
Ethernet
Ethernet
Failover node
Compute nodes
Economical disk storage
Compute nodes
Economical disk storage
Compute nodes
Economical disk storage
HDI extension base
unit
HDI active scale
unit
HDI active scale
unit
HDI extension base
unit
Rack #2
HST-01
HST-02
HSA-01
HST-02
Economical
disk storage
IB and Ethernet
Active Unit Addition of two or three compute nodes
depending on OEM hardware
configuration and related storage
Passive Unit Host for non-worker HDInsight nodes
Failover Node High availability for the rack
35. • PDW engine
• DMS Manager
• SQL Server 2012 Enterprise Edition (PDW build)
Base Unit C
T
L
Host 1
Host 2
Host 3
Host 4
Economical
disk storage
IB and
Ethernet
Direct attached SAS
M
A
D
A
D
V
M
M
Compute 1
Compute 2
Software details
• All hosts run Windows Server 2012 Standard and
Windows Azure Virtual Machines
• Fabric or workload in Hyper-V Virtual Machines
• Fabric virtual machine, management server (MAD01),
and control server (CTL) share one server
• PDW agent that runs on all hosts and all virtual
machines
• DWConfig and Admin Console
• Windows Storage Spaces and Azure Storage blobs
36. CT
Base Unit
L
Host 1
Host 1
Host 2
Host 3
Host 4
Economical
disk
storage
IB and
Ethernet
Direct attached SAS
M
AD
A
D
V
M
M
Compute 1
Compute 1
Compute 2
Host 5
Passive Unit
2
Base Unit
CT
L
M
AD
FA
B
AD
V
M
M
Compute 1
CT
L
Virtual machine migration can be used to move
workload nodes to new hosts after hardware failure
Cluster Shared Volumes
• Enable all nodes to access logical unit numbers
(LUNs) on economical disk storage
• Use Server Message Block (SMB3) protocol
Failover capabilities
• Uses one cluster across the whole appliance
• Automatically migrates virtual machines on host
failure
• Enforces rules with affinity and anti-affinity maps
• Uses Windows Failover Cluster Manager