2. Big Data & AI における課題
Silo 化するテクノロジー
Great for Data, but not AI Great for AI, but not for data
Customer
Data
Emails /
Web Pages
Sensor
Data
(IoT)
Video/
Speech
Click
Streams
…
18. Unified Analytics Platform
Databricks Workspace
Collaborative Notebooks, Production Jobs
Databricks Runtime
Databricks Cloud Service
Transactions Indexing
ML Frameworks
Blob Storage
Data Lake Storage
AZURE
DATA SOURCES
Event Hub
IoT Hub
Synapse Analytics
Cosmos DB
Azure Data Factory
19. Unified Analytics Platform
Databricks Workspace
Collaborative Notebooks, Production Jobs
Databricks Runtime
Databricks Cloud Service
Transactions Indexing
ML Frameworks
Blob Storage
Data Lake Storage
AZURE
DATA SOURCES
Event Hub
IoT Hub
Synapse Analytics
Cosmos DB
Azure Data Factory
# Read Configuration
readConfig = {
"Endpoint": "https://doctorwho.documents.azure.com:443/",
"Masterkey": "YOUR-KEY-HERE",
"Database": "DepartureDelays",
"Collection": "flights_pcoll",
"query_custom": "SELECT c.date, c.delay, c.origin, c.destination FROM c WHERE c.origin
= 'SEA'" // Optional
}
# Connect via azure-cosmosdb-spark to create Spark DataFrame
flights = spark.read.format(
"com.microsoft.azure.cosmosdb.spark").options(**readConfig).load()
flights.count()
20. Unified Analytics Platform
Databricks Workspace
Collaborative Notebooks, Production Jobs
Databricks Runtime
Databricks Cloud Service
Transactions Indexing
ML Frameworks
Blob Storage
Data Lake Storage
AZURE
DATA SOURCES
Event Hub
IoT Hub
Synapse Analytics
Cosmos DB
Azure Data Factory
# Set up the Blob Storage account access key in the notebook session conf.
spark.conf.set(
"fs.azure.account.key.<your-storage-account-name>.blob.core.windows.net",
"<your-storage-account-access-key>")
# Load data from a Synapse Analytics query.
df = spark.read
.format("com.databricks.spark.sqldw")
.option("url", "jdbc:sqlserver://<the-rest-of-the-connection-string>")
.option("tempDir", "wasbs://<your-container-name>@<your-storage-account-
name>.blob.core.windows.net/<your-directory-name>")
.option("forwardSparkAzureStorageCredentials", "true")
.option("query", "select x, count(*) as cnt from my_table_in_dw group by x")
.load()
31. Delta Lake – Reliability
Streaming
● ACID トランザクション
● スキーマエンフォース
● 統合 Batch & Streaming
● Time Travel / スナップショット
主要機能
高品質で高信頼な
データ
いつでも分析に
活用可能
Batch
Updates/Deletes トランザクション
ログ
Parquet ファイル
SELECT count(*) FROM events
TIMESTAMP AS OF timestamp
SELECT count(*) FROM events
VERSION AS OF version
spark.read.format("delta").option("timestampAsOf",
timestamp_string).load("/events/")
INSERT INTO my_table
SELECT * FROM my_table TIMESTAMP AS OF
• date_sub(current_date(), 1)
過去のデータの再生成 誤った書き込み時のロールバック
37. Analytics & AI is the #1 investment for business
leaders, however they struggle to maximize ROI
80% 55%
From : “Understanding Why Analytics Strategies Fall Short for Some, but Not Others”
https://azure.microsoft.com/en-us/resources/why-analytics-strategies-fall-short-for-some-but-not-others/
38. Apache Spark を内包する製品やサービス
• Azure Synapse Analytics
• Azure Data Factory - Mapping Data Flow *
• Azure Data Factory - Wrangling Data Flow
• SQL Server 2019 Big Data Cluster
* : Azure Databricks を使用
39. Mapping Data Flow
• Resilient data
transformation Flows
• Transform at scale
• Code-free
• Operationalized with
Data Factory
40. Apache Spark を内包する製品やサービス
• Azure Cosmos DB
• Azure Synapse Analytics
• Azure Data Factory - Mapping Data Flow *
• Azure Data Factory - Wrangling Data Flow
• SQL Server 2019 Big Data Cluster
* : Azure Databricks を使用