選擇正確的Solution 來建置現代化的雲端資料倉儲

應用系統的資料來源
1985 1990 1995 2000 2005 2010 2015 2020
網際網路連結
數位
類比

$
1.6兆
領先運用資料資產的公司將創
造出額外的商業價值
Source: IDC, 2014
10%的公司, 在2020年預期將有一
個藉由資料資產營利的高獲益
事業單位
Source: Gartner, 2016

資料運用的趨勢與挑戰

組織需要處理的各式資料

ETL pipeline
Dedicated ETL tools (e.g. SSIS)
Defined schema
Queries
Results
Relational
LOB
Applications
Traditional business analytics process
1. Start with end-user requirements to identify desired reports
and analysis
2. Define corresponding database schema and queries
3. Identify the required data sources
4. Create a Extract-Transform-Load (ETL) pipeline to extract
required data (curation) and transform it to target schema
(‘schema-on-write’)
5. Create reports. Analyze data
All data not immediately required is discarded or archived

無限期的儲存分析察看結果
從所有資料來源取得
資料
Iterate
新的大數據思維: 所有的資料都有價值
• 所有的資料都有潛在價值
• 資料需要儲藏
• 沒有定義好的schema—儲存原始格式
• Schema 在查詢時才被指派跟轉化(schema-on-read).
• 應用程式跟使用者決定適合的資料解譯方式
12

大數據 (Big Data) 帶來的挑戰
建立新的技術
與能力
找出如何
取得價值
整合既有的
資訊科技投資
*Gartner: Survey Analysis – Hadoop Adoption Drivers and Challenges (Stamford, CT.: Gartner, 2015)

巨量資料儲存機器學習跟分析
行動
People
Automated
Systems
Apps
Web
Mobile
Bots
智慧服務
儀錶板 & 資料視覺化
Cortana
Bot
Framework
Cognitive
Services
Power BI
資訊管理
Event Hubs
Data Catalog
Data Factory
HDInsight
(Hadoop and
Spark)
Stream
Analytics
智慧服務
Data Lake
Analytics
Machine
Learning
SQL Data
Warehouse
Data Lake
Store
資料
來源
應用
程式
感知器
與裝置
資料
IoT Hub

Hadoop platform包含許多不同專案
資料服務
營運服務
= HDFS + MapReduce + YARN
+ ecosystem of tools and frameworks

Microsoft 貢獻到Hadoop專案

常透過Hadoop 處理的資料型態
1. 情緒分析(Sentiment)
Understand how your customers feel about your brand
2. Clickstream
Capture and analyze website visitors’ data trails and optimize your website
3. 感應器(Sensor)/機器
Discover patterns in data streaming automatically from remote sensors and machines
4. 地理資訊
Analyze location-based data to manage operations where they occur
5. 伺服器 Logs
Research logs to diagnose process failures and prevent security breaches
6. 非結構化資料 (txt, video, pictures, etc..)
Understand patterns in files across millions of web pages, emails, and documents

Azure HDInsight 簡介
Hadoop Meets the Cloud由微軟所管理的Hadoop服務
使用100% 開源的Apache Hadoop
相容.Net 與 Java 工具
可自動升級 Hadoop 版本
數分鐘內可以設定完成並執行, 無須採購硬體
執行於 Windows 或 Linux
啟用與設定服務, 使用, 取消服務 – 可以保留資料
微軟提供技術支援

Hadoop Distribution包含許多不同專案

Data Node Data Node Data Node Data Node
Task Tracker Task Tracker Task Tracker Task Tracker
Name Node
Job Tracker
HMaster
Coordination
Region Server Region Server Region Server Region Server

Stream
processin
g
Search and query
Data analytics (Excel)
Web/thick client
dashboards
Devices to take action
RabbitMQ /
ActiveMQ

Azure
HDInsight
In Memory
Spark

其他Hadoop 元件與工具
Ambari: Cluster provisioning, management, and monitoring.
Avro (Microsoft .NET Library for Avro): Data serialization for
the Microsoft .NET environment
MapReduce and YARN: Distributed processing and resource
management
Oozie: Workflow management
Phoenix: Relational database layer over HBase
Pig: Simpler scripting for MapReduce transformations
Sqoop: Data import and export
Tez: Allows data-intensive processes to run efficiently at
scale
ZooKeeper: Coordination of processes in distributed systems

受維護的Hodoop服務
自動進行作業系統更新及安全性更新
Hadoop 版本每年快速演進
輕易地維持在最新的Hadoop版本

結合Hadoop作先進資料分析
Cloud

HDInsight 優勢
自動化建置 Hadoop clusters
使用最新, 穩定的 Hadoop 元件
提供叢集的高可用度跟高可靠性
透過Azure Blob storage提供經濟, 有效率的儲存方式
整合其他Azure 服務, 包括 Web apps 跟 SQL
Database
低進入成本

be removed January 1, 2017
https://portal.azure.com
https://azure.microsoft.com/en-
us/documentation/templates/?term=hdinsight
叢集佈署

First Cloud Hadoop solution to onboard LLAP (Long Lived and Process) from the Stinger.Next initiatives, which
promises sub-second querying on big data, which is 25x faster than existing Hive.

Apache Spark – An Unified Framework
An unified, open source, parallel, data processing framework for Big Data Analytics
Spark Core Engine
Spark SQL
Interactive
Queries
Spark
Streaming
Stream processing
Spark MLlib
Machine
Learning
GraphX
Graph
Computation
Yarn Mesos
Standalone
Scheduler

Fast, expressive cluster computing system compatible with Apache
Hadoop
• Works with any Hadoop-supported storage system (HDFS, S3, Avro, …)
Improves efficiency through:
• In-memory computing primitives
• General computation graphs
Improves usability through:
• Rich APIs in Java, Scala, Python
• Interactive shell
Spark was initially started by Matei Zaharia at UC Berkeley AMPLab in
2009, was open sourced in 2010 and donated to Apache in 2013
Up to 100× faster
Often 2-10× less code
What is Spark?

Spark for Azure HDInsight
Spark
Node
Spark
Node
Spark
Node
Spark
Node
Spark
Node
Storage Layer
Decision
Maker
Decision
Maker
Decision
Maker
Spark Cluster
clients

Spark Notebooks
Using the Spark shell to run
interactive queries
Using the Spark shell to run Spark
SQL queries
Using a standalone Scala program

Apache Spark benefits
Unified engine Ecosystem
Developer
productivity
Performance

Advantages of a unified platform
Spark Streaming
Machine
learning
Spark SQL

102.5 100
72
23
2100
206
50400
6592
2013 Record
(Hadoop)
Spark 100 TB
Data Size (TB) Time (Min) Nodes Cores
Faster data, faster results
tinyurl.com/spark-sort
Logistic regression
140
120
100
80
40
20
0
60
Hadoop
Spark 0.9
Logistic regression on a 100-node cluster
with 100 GB of data.
Spark is the 2014 Sort Benchmark
winner.
3x faster than 2013 winner
(Hadoop).

What makes Spark fast?
Reads from
HDFS
Writes to
HDFS
Reads from
HDFS
Writes to
HDFS
Step 1 Step 2
Step 1
Reads and writes
from HDFS

Spark cluster architecture
ReadReadRead
Cluster manager
HDFS
Worker nodeWorker node Worker node Worker node
Driver program
SparkContext

Developing Spark apps with notebooks
Jupyter and Zeppelin are two notebooks that work
with Apache Spark

Jupyter
Language agnostic
Supports a rich Read-Evaluate-Print-Loop (REPL) protocol Includes:
Jupyter interactive web-based notebook
Jupyter Qt console
Jupyter Terminal console
Notebook viewer (nbviewer)
full list here
Supported languages (kernels)

Zeppelin architecture
Browser client
Zeppelin server
Class loader Class loader
Interpreter group Interpreter group
Interpreter Dep Spark Spark SQL
HTTP Rest Websocket
…
Spark
…
Maven
Apache Spark is supported in
Zeppelin with the Spark interpreter
group, which consists of four
interpreters.
Name Class Description
%spark SparkInterpreter Creates SparkContext and provides
scala environment
%pyspark PySparkInterpreter Provides python environment
%sql SparkSQLInterprete
r
Provides SQL environment
%dep DepInterpreter Dependency loader

Spark SQL overview
You run interactive
Spark SQL statements
using notebooks.
Run Spark SQL
statements using
notebooks
HDInsight uses Azure
Blob storage account
for storing data.
Create an Azure
storage account
HDInsight makes
Apache Spark available
as a service in cloud.
HDInsight makes
Apache Spark
available as a
service in cloud.

Spark SQL overview
Built-in External
And more…

行動
People
Automated
Systems
Apps
Web
Mobile
Bots
智慧服務
Cortana
Bot
Framework
Cognitive
Services
Power BI
資訊管理
Event Hubs
Data Catalog
Data Factory
機器學習跟分析
HDInsight
(Hadoop and
Spark)
Stream
Analytics
智慧服務
Data Lake
Analytics
Machine
Learning
巨量資料儲存
SQL Data
Warehouse
Data Lake
Store
資料
來源
應用
程式
感知器
與裝置
資料
IoT Hub

What investment is your company making in big data?
大數據處理技術對許多組織仍是挑戰
45%
40%
35%
30%
25%
20%
15%
10%
5%
0%
Fully deployed Have a pilot
in place
Currently
investigating
Interested, but haven’t
investigated yet
Have investigated and
decided not to pursue
Not being
considered
5%
11%
29%
41%
5%
9%
Interest in big data 70%
Invested in big data 16%
91% Hadoop usage concerns
71% Hadoop/BI tool inexperience

微軟
資料平台
Relational Beyond-Relational
On-premisesCloud
Comprehensive
Connected
Choice
SQL Server AzureVM
Azure SQL DB
Azure SQL DW
AzureData Lake Analytics
AzureData Lake Store
AzureHDInsight
Fast Trackfor SQL Server
Analytics Platform System
SQL Server 2016 + SuperdomeX
Analytics Platform System
Hadoop
FederatedQuery
Power BI
AzureMachine Learning
AzureData Factory

Azure SQL 資料倉儲服務
關聯式資料倉儲服務, 完全由微軟負責管理維運.
業界第一個可彈性伸縮(elastic) , 具備 SQL Server 功能的雲端資料倉儲
適合小型到大型的資料儲存需求
彈性伸縮(
Petabytes規模
MPP: 大量平行處理 Saas
Azure
Public
Cloud
Office 365Office 365
按照運算效能及儲存空間分別計價
可動態暫停
(dynamic pause) 運算
AzureAzure

Azure SQL資料倉儲服務架構
Control
Node
Compute
Node
Compute
Node
Compute
Node
Compute
Node
SQL
DB
SQL
DB
SQL
DB
SQL
DB
Blob storage [WASB(S)]
Compute
Scale compute up or down
when required
(SLA <= 60 seconds).
Pause, Resume, Stop, Start.
Storage
AddLoad data to WASB(S)
without incurring compute
costs
Massively Parallel
Processing (MPP) Engine
Azure Infrastructure and
Storage
100 DWU < > 2000 DWU
儲存與運算分開, 提供彈性的服務架
構與計費方式
(儲存與運算資源分別計價)
Application or
User connection
HDInsight
Data Loading
(SSIS, REST, OLE, ADO, ODBC,
WebHDFS, AZCopy, PS) DMS
DMS DMS DMS DMS
DMS (Data
Movement Service)
在所有的資料庫節
點上運行

Azure SQL資料倉儲服務 – 控制節點( Node )
Control
Node
SQL
DB
Compute
Node
Compute
Node
Compute
Node
Compute
Node
SQL
DB
SQL
DB
SQL
DB
SQL
DB
Massively Parallel
HDInsight
Control
Node
SQL DB
• Endpoint for connections
• Regular SQL endpoint (TCP 1433)
• Persists no user data (metadata
only)
• Coordinates compute activity
using MPP

Azure SQL資料倉儲服務 – 運算節點( Node )
Control
Node
SQL
DB
Compute
Node
Compute
Node
Compute
Node
Compute
Node
SQL
DB
SQL
DB
SQL
DB
SQL
DB
Massively Parallel
HDInsight
Compute
Node(s)
Azure SQL Database
SQL DB
An increase of DWU will
increase the number of
compute nodes

Azure SQL資料倉儲服務 – Blob 儲存體
Control
Node
SQL
DB
Compute
Node
Compute
Node
Compute
Node
Compute
Node
SQL
DB
SQL
DB
SQL
DB
SQL
DB
Massively Parallel
HDInsight
• RA-GRS storage
• +PB’s of storage
• Ingest data without
incurring compute costs

CREATE TABLE [Products] ( … )
WITH
(
DISTRIBUTION = HASH(<COLUMN>)
);
 分散式資料表將資料分散到所有儲存體上以
提高效能
 Round robin 或 hash-distributed
 每一個 Compute node 只處理本地的資料
 使用column-based儲存體, SQL 資料倉儲最
多可讓壓縮平均提升 5 倍，查詢效能提升 10
倍以上。
每個資料列被分配到
同一個Node

CREATE TABLE [build].[FactOnlineSales]
(
[OnlineSalesKey] int NOT NULL
, [DateKey] datetime NOT NULL
, [StoreKey] int NOT NULL
, [ProductKey] int NOT NULL
, [PromotionKey] int NOT NULL
, [CurrencyKey] int NOT NULL
, [CustomerKey] int NOT NULL
, [SalesOrderNumber] nvarchar(20) NOT NULL
, [SalesOrderLineNumber] int NULL
, [SalesQuantity] int NOT NULL
, [SalesAmount] money NOT NULL
)
WITH
( CLUSTERED COLUMNSTORE INDEX
, DISTRIBUTION = ROUND_ROBIN
)
;
CREATE TABLE [build].[FactOnlineSales]
(
[OnlineSalesKey] int NOT NULL
, [DateKey] datetime NOT NULL
, [StoreKey] int NOT NULL
, [ProductKey] int NOT NULL
, [PromotionKey] int NOT NULL
, [CurrencyKey] int NOT NULL
, [CustomerKey] int NOT NULL
, [SalesOrderNumber] nvarchar(20) NOT NULL
, [SalesOrderLineNumber] int NULL
, [SalesQuantity] int NOT NULL
, [SalesAmount] money NOT NULL
)
WITH
( CLUSTERED COLUMNSTORE INDEX
, DISTRIBUTION = HASH([ProductKey])
)
;

13 14 1615 17 18 2019 21 22 2423
25 26 2827 29 30 3231 33 34 3635
37 38 4039 41 42 4443 45 46 4847
49 50 5251 53 54 5655 57 58 6059
01 02 0403 05 06 0807 09 10 1211

13 14 1615 17 18 2019 21 22 2423
25 26 2827 29 30 3231 33 34 3635
37 38 4039 41 42 4443 45 46 4847
49 50 5251 53 54 5655 57 58 6059
01 02 0403 05 06 0807 09 10 1211
HASH ( )N01020301

透過Polybase查詢非結構化資料
T-SQL query
SQL Server Hadoop
計程車交易:
************************
**********************
*********************
**********************
***********************
$658.39
Jim Gray
姓名
11/13/58
生日
WA
縣市
Ann Smith 04/29/76 ME

Polybase 資料匯入
Azure Storage
Blob(s)
Polybase
Azure SQL Data Warehouse
Engine
Worker4
Worker1
Worker5
Worker3
Worker2
Worker6

彈性伸縮規模(Elastic Scale)
重大複雜運算時增加運算效能, 運算完畢可減少回日常計算所需運算效能
隨時可以應付臨時的複雜大數據運算
根據需求自由搭配運算效能跟儲存空間
彈性伸縮(

Engine Worker1
Azure Storage Blob(s)
D12D11 D13 D14 D15 D16 D18D17 D19 D20
D22D21 D23 D24 D25 D26 D28D27 D29 D30
D32D31 D33 D34 D35 D36 D38D37 D39 D40
D42D41 D43 D44 D45 D46 D48D47 D49 D50
D52D51 D53 D54 D55 D56 D58D57 D59 D60
D2D1 D3 D4 D5 D6 D8D7 D9 D10

Engine
Worker4
Worker1
Worker5
Worker3
Worker2
Worker6 D52D51 D53 D54 D55 D56 D58D57 D59 D60
D12D11 D13 D14 D15 D16 D18D17 D19 D20
D22D21 D23 D24 D25 D26 D28D27 D29 D30
D32D31 D33 D34 D35 D36 D38D37 D39 D40
D42D41 D43 D44 D45 D46 D48D47 D49 D50
D2D1 D3 D4 D5 D6 D8D7 D9 D10

暫停(Pause) 功能
保留資料– 無須重新載入或重建(restore) 資料
當暫停時, 僅需付雲端儲存費用, 大幅降低成本
可透過PowerShell/REST API自動化
$$$$

Azure SQL
Data
Warehouse
D52D51 D53 D54 D55 D56 D58D57 D59 D60
D12D11 D13 D14 D15 D16 D18D17 D19 D20
D22D21 D23 D24 D25 D26 D28D27 D29 D30
D32D31 D33 D34 D35 D36 D38D37 D39 D40
D42D41 D43 D44 D45 D46 D48D47 D49 D50
D2D1 D3 D4 D5 D6 D8D7 D9 D10

透過PowerShell/TSQL/Azure Portal 來調整
等級調整(Scale)
配合尖峰和離峰點, 移動DWU 等級
執行大量資料載入或轉換作業之前，相應增加 DWU 以
使您的資料更快速可供使用
暫停(Pause)
將運算資源釋出,CPU和記憶體資源會傳回可用資源集區
只針對儲存部分收費(無運算費用)
暫停時所有進行中的查詢都會取消。交易性查詢 (會修改
您的資料或結構) 可能無法快速地停止。
周末暫停= 28%
晚上暫停= 35%
40 工作小時 = 75%
透過暫停來節省成本

儲存匯出
匯入資料串流匯入
查詢

個別客戶帳號
Clickstream
企業帳號
DWU 200
DWU 600
DWU 1200

運算規模單位: Data Warehouse Unit (DWU)
Engine
Nodes
1 1 1 1 1 1 1 1 1 1 1 1
Worker
Nodes
1 2 3 4 5 6 10 12 15 20 30 60
Total # of
distributions
60 60 60 60 60 60 60 60 60 60 60 60
# of
distributions
per node
60 30 20 15 12 10 6 5 4 3 2 1
Concurrency
Slots
4 8 12 16 20 24 32 32 32 32 32 32

App Service
Intelligent App
Hadoop
Azure Machine
Learning
Power BI
Azure SQL
Database
SQL
Azure SQL Data
Warehouse
End-to-end platform built for the cloud
Power of integration

總結
微軟Azure SQL資料倉儲服
務透過新一代技術可以協
助用戶透過熟悉的技術跟
平台處理針對現代大數據
的挑戰, 使用者在彈性, 效
能與價格上有更多選擇
Azure SQL資料倉儲服
務不僅是企業級雲端資
料倉儲, 更提供在數秒
鐘之內增加/減少運算
效能, 並提供暫停功能,
減少企業成本
透過與眾多資料分析工具
的整合(PowerBI, Azure
Machine Learning), 不論
是大型組織或是小企業, 都
可以透過Azure SQL資料倉
儲服務進行分析, 管控資料,
找出大數據內含的價值

Azure
Data Lake Store
針對大數據分析需求設
計的超級規模資料儲存
庫
雲端上提供的Hadoop File System (HDFS)
沒有資料量上限
儲存任何資料的原始格式
企業等級的權限管控跟加密
針對分析的需求作效能最佳化

具高度延展性, 分散式, 支援平行處理的雲端檔案系統
支援多種的資料分析框架
什麼是 Azure Data Lake Store?
LOB Applications
SocialDevices
Clickstream
Sensors
Video
Web
Relational
HDInsight
ADL Analytics
Machine Learning
Spark
R
98
ADL Store

ADL Store 無限規模架構
ADL Store 中的檔案被切片分散到不同blocks中
Blocks 被分散到後端儲存系統中的不同的data
nodes
在有足夠的data nodes狀況下, 任何大小的檔案
可以被儲存˙
Azure 雲端上的後端儲存系統概念上可以有無
限的資源
每個檔案的Metadata也被同樣的系統儲存
99
Azure Data Lake Store file
…Block 1 Block 2 Block 2
後端儲存系統
Data node Data node Data node Data node Data nodeData node
Block Block Block Block Block Block

ADL Store 提供大量的傳輸量
透過平行讀取ADL Store提供大量的傳輸量
每個讀取動作都在data notes 上藉由平行讀取
同時進行
Read operation
100
Azure Data Lake Store file
…Block 1 Block 2 Block 2
後端儲存系統
Data node Data node Data node Data node Data nodeData node
Block Block Block Block Block Block

ADL Store 資料安全: Role-based 存取控制
每個檔案跟目錄都被指派給一個擁有
者(owner)跟群組(group )
檔案跟目錄都可以有不同的權限
(read(r), write(w), execute(x)) 給擁有者
(owner)跟群組(group )還有其他使用
者(other)
詳細的存取控制規則(ACLs)可以被指派
到特定的使用者及群組
101

ADL Store 是 HDFS-相容檔案系統
透過 WebHDFS 端點 Azure Data Lake Store 是一個 Hadoop相容檔案系統, 可以無縫的
整合 Azure HDInsight
Map reduce
HBase
transactions
Any HDFS applicationHive query
Azure HDInsight
Hadoop WebHDFS client
Hadoop WebHDFS client
WebHDFS
endpoint
WebHDFS
REST API
WebHDFS
REST API
102
ADL Store file ADL Store file ADL Store file ADL Store fileADL Store file
Azure Data Lake Store

ADL Store: 高可用性及可靠度
• 每個區域(region) Azure 將資料物件存放3份分
別在不同的失敗(fault) 及升級(upgrade) 領域
(domains)
• 所有操作動作都複製到另外兩份, 並確保複製
完成後才 commit.
• 可以從任何一個資料副本進行讀取
Data is never lost or unavailable
even under failures
Replica 1
Replica 2 Replica 3
Fault/upgrade
domains
Write Commit

ADL Store: Ingress
Data can be ingested into Azure Data Lake Store from a variety of sources
Server logs
Azure Event Hub
Apache
Flume
Azure Storage Blobs
Custom programs
.NET SDK
JavaScript CLI
Azure Portal
Azure PowerShell
Azure Data Factory
Apache Sqoop
Azure SQL DB
Azure SQL DW
Azure tables
Table Storage
On-premises databases
SQL
104
ADL Store
ADLS Built-in
copy service

ADL Store: Egress
Data can be exported from Azure Data Lake Store into numerous targets/sinks
Azure SQL DB
SQL
Azure SQL DW
Azure Tables
Table Storage
On-premises databases
Azure Data Factory
Apache Sqoop
Azure Storage Blobs
Custom programs
.NET SDK
JavaScript CLI
Azure Portal
Azure PowerShell
105
Built-in
ADLS copy service
ADL Store

Data Lake Store: 技術規格
安全性資料存取需要支援授權管理
原始格式能儲存原始資料格式以追蹤資料血統及出處
低延遲能支援高頻率的資料操作.
能支援多種分析框架—Batch, Real-time, Streaming, ML etc.
沒有單一框架可以支援所有資料內容跟分析方式.
多種分析框架
資料細節可記載資料的詳細內容.
吞吐量能承受像Hadoop and Spark這樣平行處理架構的資料存取需求
可靠度高可用度及可靠度.
延展性可容納快速增長的資料
多種資料來源可從多種資料來源輸入資料.

企業規格的安
全性
高度延展性,
可隨時調整運
算效能
立即可以使用,
無須事先建置
容易使用, 客
製化彈性高
處理所有的資
料類型
0100101001000101010100101001000
10101010010100100010101010010100
10001010101001010010001010101001
0100100010101010010100100010101
0100101001000101010100101001000
10101010010100100010101010010100
10001010101001010010001010101001
0100100010101010010100100010101
0100101001000101010100101001000
10101010010100100010101010010100
Azure Data Lake Analytics

Azure
Data Lake Analytics
新的分散式資料分析服務
基於Apache YARN上的分散式資料分析服務
每個搜尋都可以彈性的指定執行規模, 使用
者可以專注在商業需求, 而不是硬體
內建 U-SQL— 可以混合使用SQL查詢語法
及 C# 程式的語言
整合Visual Studio , 開發, 除錯, 調校程式碼
更快速
Federated query 支援多個 Azure 資料來源
企業等級的 role based access control

ADL Analytics
特色
• 針對大數據應用設計
• 支援多種資料來源
• 簡化管理跟維護成本
• 透過新的U-SQL 語言來處理巨量資料
111

ADLA直接在資料來源做查詢
• 無須移動資料, 直接將查詢任務派送到資料來源
執行
• 避免查詢前必須將儲存在不同地方的大量資料透
過網路搬移
• 提供單一資料檢視方式, 無論資料實際儲存在何
處
• 減少資料多個副本的資料擴散(Data proliferation )
問題
• 所有資料都可用單一查詢語法
• 各個資料來源可以維持原本各自的管理機制
• 將SQL查詢表示式直接在遠端SQL 資料來源執行
• Filters
• Joins
U-SQL Query Query
Azure
Storage Blobs
Azure SQL
in VMs
Azure
SQL DB
Azure Data
Lake Analytics
Azure
SQL Data Warehouse
Azure
Data Lake Storage

Work across all cloud data
Azure Data Lake
Analytics
Azure SQL DW Azure SQL DB
Azure
Storage Blobs
Azure
Data Lake Store
SQL Server in an
Azure VM

U-SQL 語法
SQL 陳述式(Declarative) 查詢
• 使用 SQL語法 : SELECT FROM WHERE with GROUP
BY/aggregation, joins, SQL analytics functions
• 容易做最佳化調校
可處理結構性及非結構性資料
• Schema 在讀檔時決定
• 支援關聯式 metadata 物件 (e.g. database, table)
高度擴充性
• 基於C# 型別系統(Type system )
• C# 表述語言(Expressionlanguage)
• 使用者自訂義 functions(U-SQL and C#)
• 使用者自訂義 aggregators(C#)
• 使用者自訂義 operators (UDO) (C#)
提供容易擴充的平行化處理及Scale-out架構
• EXTRACTOR, OUTPUTTER, PROCESSOR,REDUCER,
COMBINER, APPLIER
將查詢送到不同資料來源執行
REFERENCE MyDB.MyAssembly;
CREATE TABLE T( cid int, first_order DateTime
, last_order DateTime, order_count int
, order_amount float );
@o = EXTRACT oid int, cid int, odate DateTime, amount float
FROM "/input/orders.txt"
USING Extractors.Csv();
@c = EXTRACT cid int, name string, city string
FROM "/input/customers.txt"
USING Extractors.Csv();
@j = SELECT c.cid, MIN(o.odate) AS firstorder
, MAX(o.date) AS lastorder, COUNT(o.oid) AS ordercnt
, AGG<MyAgg.MySum>(c.amount) AS totalamount
FROM @c AS c LEFT OUTER JOIN @o AS o ON c.cid == o.cid
WHERE c.city.StartsWith("New")
&& MyNamespace.MyFunction(o.odate) > 10
GROUP BY c.cid;
OUTPUT @j TO "/output/result.txt"
USING new MyData.Write();
INSERT INTO T SELECT * FROM @j;

整合 Visual Studio
整合U-SQL, Hive 及Storm
初學者容易上手
提供給專家豐富的工具
視覺化的呈現執行狀態, 並可重播執行狀態以利找出效能
瓶頸及進行優化

Logical -> Physical Plan
每個方塊代表Vertex, 代表
整體工作中的一部分任務
每個SuperVertex (aka “Stage)
中的 Vertexes 都對相同的資
料做相同的動作
後面stages 中的Vertexes 有可
能會跟前一個 stage的Vertexes
有關視覺化呈現執行結構
與狀態

透過10個平行(Parallelism)將1.87GB Json檔案
資料做彙總計算
- 編譯時間: 28 秒
- 執行時間: 2分鐘

簡化管理跟維護
• 以網頁為基礎的管理介面
• 透過 PowerShell自動化排
程
• 整合 Azure AD, 以角色為
主的權限管控
• 監控服務操作及執行

U-SQL
GitHub
Microsoft.Analytics.Samples.Formats/

Azure提供多元的大數據技術架構選擇
透過完整的解決方案協助企業加速創新
任何 Hadoop
技術
最佳化調校,
受管理維護的
Haddop叢集
針對大數據資料整理
需求設計的
資料分析服務
HDP | CDH | MapR
(Azure Marketplace)
Data Lake Analytics
Azure Data Lake
Analytics
Data Lake StoreAzure Storage
控制容易使用
UserAdoption
IaaS Hadoop Managed Hadoop Big Data as-a-service
HDInsight

Microsoft Azure Data Lake
YARN
U-SQL
Analytics Service HDInsight
Store
HDFS

Azure SQL DW HDInsight Hive HDInsight Spark Azure Data Lake SQL Server (IaaS)
Volume Petabytes Petabytes Petabytes Petabytes Terabytes
Security Encryption, TD,
Audit
ADLS / Apache
Ranger
ADLS AAD Security
Groups (data)
Encryption, TD
Audit
Languages T-SQL HiveQL SparkSQL, HiveQL,
Scala, Java,
Python, R
U-SQL T-SQL
Extensibility No Yes, .NET/SerDe Yes, Packages Yes, .NET Yes, .NET CLR
External File
Types
ORC, TXT,
Parquet, RCFile
ORC, CSV, Parquet
+ others
Parquet, JSON,
Hive + others
Many ORC, TXT, Parquet,
RCFile
Admin Low-Medium Medium-High Medium-High Low High
Cost Model DWU Nodes & VM Nodes & VM Units/Jobs VM
Schema
Definition
Schema on
Write / Polybase
Schema on Read Schema on Read Schema on Read Schema on Write /
Polybase

The “Clusters” Big Data Approach
Hardware
Purchase
Maintaining
Hardware
Cluster Time
Nodes
Time
Wasted compute time vs. Productive
compute time

The “Clusterless” Big Data Approach
Intelligently managing the
cluster lifetime and scale
compute time
compute time with clusters
compute time with Azure Data Lake
Analytics
A clusterless approach
doesn’t have unused
compute time

Enabling Further Cost Optimizations
Productive compute time with Azure Data Lake
Analytics
Productive compute time vs Optimized compute time with
Azure Data Lake Analytics

Analytics APIs
Ready to consume APIs for
Vision, Speech, Language,
Knowledge
R-based analytics
Enterprise grade, write
once deploy anywhere
Cloud analytics
Easy drag/drop UX with
single click
operationalization
Azure Machine LearningMicrosoft R Cognitive Services
Solutions
Big Data Platform
Run large massively
parallel compute
and data jobs
HDInsight/Spark
Citizen Data Scientist
Advanced Data
Scientist Developer
Data Engineer
/Data Scientist
Preconfigured
Solutions/Apps/Soluti
on Templates
BDM/TDM
Finished Apps & Solutions
Ready to consume Apps and
solutions for solving specific
business scenarios

MapReduce &
Tez
U-SQL
Data Lake Store
WebHDFS
YARN
Spark
Batch
Interactive
Streaming
ML
Batch
Interactive
Streaming
ML
FEDERATION to enable very large
(100K+) YARN clusters, Cross-DC,
BCDR
REEF – “libc for BigData”
AMEOBA – work preserving pre-
emption
RAYON – Capacity Reservation
MERCURY & YAQ – Optimistic
allocation + YARN conservatism to
improve performance
OAuth Support
Microsoft works with the Open Source community

Big Data Pipeline and Workflow

Big Data Pipeline and Data Flow in Azure
HDInsight
(Hadoop and
Spark)
Stream Analytics
Data Lake
Analytics
Machine
Learning

ON PREMISES CLOUD
Massive
Archive
On Prem HDFS
Active
Incoming Data
“Landing
Zone”
Data Lake
Store
Move to
cloud via
AzCopy
Data Lake
Store
Data Lake
Analytics
Azure DW
CONSUMPTION
Machine Learning at scale.
Customer Segmentation &
Fraud Detection)
Web Portals
Mobile
Apps
Power BI
Experimentation at scale.
Drive changes based on
customer behavior
Real World Scenario with Azure Data Lake
Jupyter
Data Science
Notebooks

雲端隨選隨用各式資料快速上線服務資料分享
跟協同合作
開放支援完整資料
分析流程

專注在解決資料問題, 而不
需要架設複雜系統環境

專注在解決資料問題,而不
需要架設複雜系統環境
解決接收大量, 持續性, 爆發性, 來自全球
的各式資料問題

結構性資料
非結構性資料
從數MB 到數百PB 大小

雲端上的Hadoop分散式檔案系統
以類似原生的HDFS 服務為基礎
可以被所有支援HDFS 的專案存取
(Spark, Storm, Flume, Sqoop, Kafka, R, etc.)
支援整合巨量資料分析架構如HDInsight,
Hortonworks, and Cloudera
HDInsight
各式資料都是有潛在
價值, Data Lake提供
單一儲存環境, 提供
企業儲存大量各式原
始資料及平行處理能
力, 以便於應用在未
來的智慧型資料分析
與呈現

從機器學習實驗到產生操
作化分析預測API 都使用相
同的工具
快速地進行機器學習中的
資料搬移, 訓練, 評分

從資料中萃取出價值需
要全公司的投入
將組織中各個不同的資
料生態系串接在一起
容易分享學習心得

從資料中萃取出價值需
要全公司的投入
將組織中各個不同的資
料生態系串接在一起
容易分享學習心得
解決組織內跨部門資料取得困難及資料
科學家培養及訓練的問題

擁抱開放原始碼生態系
結合廣大生態系提供更
靈活的彈性
讓各式技術人員都可以
運用熟悉的工具

唯一一家提供從資料匯
入到產生行動及資料呈
現完整的解決方案

Cortana Analytic Suite (分析套件包)
將資料透過先進資料分析轉換成智慧型決策與行動
決策與行動
People
Automated
Systems
Apps
Web
Mobile
Bots
智慧服務
Cortana
Bot
Framework
Cognitive
Services
Power BI
資訊管理
Event Hubs
Data Catalog
Data Factory
機器學習跟分析
HDInsight
(Hadoop and
Spark)
Stream
Analytics
智慧分析
Data Lake
Analytics
Machine
Learning
巨量資料儲存
SQL Data
Warehouse
Data Lake
Store
Data
Sources
Apps
Sensors
and
devices
資料產生
IoT Hub
DocumetDB

選擇正確的Solution 來建置現代化的雲端資料倉儲

選擇正確的Solution 來建置現代化的雲端資料倉儲

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (9)

Ähnlich wie 選擇正確的Solution 來建置現代化的雲端資料倉儲

Ähnlich wie 選擇正確的Solution 來建置現代化的雲端資料倉儲 (20)

選擇正確的Solution 來建置現代化的雲端資料倉儲