Handwritten Text Recognition for manuscripts and early printed texts
Â
Microsoft Data Integration Pipelines: Azure Data Factory and SSIS
1.
2. Automate pipelines:
Tools for building ETL pipelines
Mark Kromer
Microsoft Sr. Program Manager
Information Management
@kromerbigdata
3. ETL Pipeline Objectives
• Consume hybrid disparate data (Extract)
• On-prem + Cloud
• Structured, un-structured, semi-structured
• Calculate and format data for analytics (Transform)
• Transform, aggregate, join, normalize
• Address large-scale Big Data requirements (Scale / Load)
• Scale-up or Scale-out data movement and transformation
• Operationalize (Automate)
• Create pipeline orchestrations for different org requirements
• Manage & monitor multiple pipelines
4. Hybrid Data Integration Pattern 1:
Analyze blog comments
Azure SQL Database
SQL Server
(on-premises)
Data Management
Gateway Req’d for ADF
Azure Data Factory (PaaS)
Capture blog comments via API
Drop into Blob
Store
Copy & lookup
Power BI Dashboard
Visualize and analyze
SSIS (self-hosted)
Transform via SPROC (ELT)
Transform via Dataflow (ETL)
5. Hybrid Data Integration Pattern 2:
Sentiment Analysis with Machine Learning
Azure Data Factory
Power BI
Blob Storage
Azure Functions
6. Hybrid Data Integration Pattern 3:
Modern Data Warehouse
Daily flat files
OLTP DB
Tables
Analytical
Schemas
AML: Churn
Model
Customer Call Details
Azure Data Factory (PaaS)
SSIS (self-hosted)
Social Media
(un/semi structured)
7. SQL Server Integration Services (SSIS)
SSIS is a platform for building enterprise-grade data integration solutions
User-friendly code-free authoring / management client tools:
SQL Server Data Tools (SSDT) SQL Server Management Studio (SSMS)
Wealth of connectors + rich transformations to
Extract, Transform, and Load (ETL) data between various sources and destinations, on premises and in the cloud
Low Total Cost of Ownership (TCO)
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19. Azure Feature Pack
Provides Azure connectivity components for SSIS
1. Move / transfer hybrid data between various sources and destinations, on premises
and in Azure
2. Develop ELT workflows with Big Data transformations / processing in Azure
3. Combine traditional ETL and modern ELT workflows spanning on-premises and Azure
22. SQL DW + SSIS – Quick intro
SQL DW is Microsoft’s scale-out database in the
cloud
Built on Massively Parallel Processing (MPP)
architecture
Capable of processing huge volumes of relational and
non-relational data.
It divides data and processing capability across
multiple nodes
Control Node receives, optimizes, and distributes
requests to Compute Nodes that work in parallel.
There are 2 ways to load data into SQL DW using
SSIS:
Front-loading through Control Node with data flows
Back-loading through Compute Nodes with PolyBase
23. Azure SQL DW Upload Task – Typical scenario
Azure Blob Storage SQL DW
Cloud
On-Premise
SQL Server Flat File SSIS Machine
0. Export to a flat file
1. Upload to Azure Blob
2. Create an external table
3. Trigger PolyBase to load data
Azure SQL DW Upload Task automates steps 1 – 2 – 3 below:
24. Azure SQL DW Upload Task
On Azure SQL DW Upload Task Editor, you can
1. Name and describe a create / insert table task
2. Select and configure UTF-8-encoded text file(s) as your data
source
3. Select and configure Azure Storage Connection Manager +
new / existing blob container as your data staging area
4. Select and configure ADO.NET Connection Manager for SQL
DW + new / existing table as your data destination
5. Map source and destination columns for the create / insert
table task
6. Define metadata / data types for source columns
25. Azure SQL DW Upload Task
Following configurations on Azure SQL
DW Upload Task Editor, T-SQL script
that triggers PolyBase to load data
from your Azure Blob Storage into SQL
DW will be automatically generated
You can still manually edit this auto-
generated T-SQL script to customize it
for your particular needs
26.
27. RAW DATA,
DATA CHAOS
REFINED, ORGANIZED
DATA
DATA CLARITY,
BETTER DECISIONS
DATA MOVEMENT DATA TRANSFORMATIONS BUSINESS INTELLIGENCE
AZURE DATA FACTORY
HYBRID DATA INTEGRATION AT SCALE
Customer
Profitability
Sentiment
Analysis
Market
Basket
Analysis
Machine Learning, Big Data Analytics, SQL, NoSQL,
Data Warehouse , Data Lake
28. ADF: Orchestrate data services at scale with fully managed Data
Integration cloud service
PREPARE TRAN
AN
INGEST
SQL
<>
SQL
DATA SOURCES
{ }
SQL
• Create, schedule, orchestrate, and manage data pipelines
• Visualize data lineage
• Connect to on-premises and cloud data sources
• Monitor data pipeline health
• Automate cloud resource management
• Move relational data for Hadoop processing
• Transform with Hive, Pig, PySpark, SQL SPROC or custom code
31. ELT with Apache Spark Activity from ADF Pipeline
Create new pipeline
and HDI Cluster
Linked Service for
Spark from Azure
Portal
Invoke Python script
from Spark activity to
transform data at scale
Schedule, monitor and
manage pipeline from
ADF
Verify results and
perform analytics from
Jupyter notebooks /
PBI
32.
33. ELT with Azure Data Lake from ADF Pipeline
Create new pipeline
and Azure Data Lake
Analytics Linked
Service from Azure
Portal
Perform data
transformations at
scale with U-SQL script
Schedule, monitor and
manage pipeline from
ADF
Verify results and
perform analytics from
ADLA
52. Microsoft Data Integration Roadmap
• SSIS
• SQL Server 2017
• SSIS on Linux
• Scale-out
• ADF
• SSIS as a Cloud “Integration Runtime”
• Code-free web-based user experience
• Control Flow orchestration + Data Flow steps
• On-Demand Spark Cluster