Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platform, and intelligent (SQLBits 2018)

3
Data sourcesNon-relational data
DESIGNED FOR THE
QUESTIONS YOU KNOW!

The Data Lake Approach
Ingest all data
regardless of
requirements
Store all data
in native format
without schema
definition
Do analysis
Hadoop, Spark, R,
Azure Data Lake
Analytics (ADLA)
Interactive queries
Batch queries
Machine Learning
Data warehouse
Real-time analytics
Devices

HDFS Compatible REST API
ADL Store
.NET, SQL, Python, R
scaled out by U-SQL
ADL Analytics
Open Source Apache
Hadoop ADL Client
Azure Databricks
HDInsight
Hive
• Performance at
scale
• Optimized for
analytics
• Multiple
analytics
engines
• Single
repository
sharing

Governance and
Master Data
Management
Azure SQL Data Warehouse
Data Quality
and Lineage
ERP,
CRM,
and
other
LOB
Data
OLTP
and
other
RDBMS
Clickstrea
m Logs
and
Events
Sensors,
Social,
Weather,
other
un-
structure
d data
ETL
Azure
Data
Lake
Analytics
(U-SQL)
Azure Data Lake Store
Spark
on HDI,
DataBri
cks
BI Models
Reports and
Dashboards
Apache
Hadoo
p on
HDInsi
ght
Polybase
Analyst
Power User
Data Engineer
Data Scientist
Big Data Warehouse

Streaming Layer
Clean,
Curate,
Aggregate
Combine
reference
data
Perform
Scoring
from ML
models
IoT
Sensors
and/or
User
activity
streams
Social,
Trends,
Weathe
r etc.
Clickstrea
m, Batch
Files,
server
logs,
Images,
videos,
and other
unstructur
ed data
Event
Broker
(Event
Hubs,
Apache
Kafka)
Azure
Data
Lake
Analytics
(U-SQL)
Spark
on HDI,
Databri
cks
Event
Broker
Realtime
Dashboards
Apache
Hadoop
on
HDInsig
ht
Analyst
Data Engineer
Data Scientist
Trained
Machine
Learning
Models
Reference Data
Realtime Processing with Lambda Architecture
Automated
Systems

Data Lake Analytics Workloads
With BATCH workload, Data Lake Analytics is ideal for
• The transformation and preparation of data for use in other systems
• Analytics on VERY LARGE amounts of data
• Massively Parallel programs written in .NET, Python and R, scaled out with U-
SQL
• Performing Cognition at Scale on large collections

Scales out your custom code in .NET, Python, R over
your Data Lake
Familiar syntax to millions of SQL & .NET developers
Unifies
• Declarative nature of SQL with the imperative
power of your programming language
• Processing of structured, semi-structured and
unstructured data
• Querying multiple Azure Data Sources
(Federated Query)
Introducing U-SQL
A framework for Big Data

U-SQL can query data from multiple sources in Azure.
Where possible data transformation is pushed close to
the remote query engine to minimize data transfer and
maximize performance.
Easily query data in multiple Azure data stores without
moving it to a single store
U-SQL
Query
Query
Azure
Storage Blobs
Azure SQL
in VMs
Azure
SQL DB
Azure Data
Lake Analytics
Azure
SQL Data Warehouse
Azure
Data Lake Storage

Embedded Artificial Intelligence
Host Deep Neural Networks (DNNs)
6 Built-in Cognitive Functions
 Face API
 Image Tagging
 Emotion analysis
 OCR
 Text Key Phrase Extraction
 Text Sentiment Analysis

• Fully managed service to support orchestration of data
movement and transformation
• Connect to relational or non-relational data that is on-premises
or in the cloud
• Single pane of glass to monitor and manage data processing
pipelines
• Globally deployed service infrastructure
• Cost Effective
Compose, orchestrate & monitor data services at scale
Stored Procedures
Hadoop on Azure
Trusted data
BI & analyticsData Lake Analytics
Custom Code
Machine Learning

Category Data store Supported as source Supported as sink
Azure
Azure Blob storage
Azure SQL Database
Azure SQL Data Warehouse
Azure Table storage
Azure DocumentDB
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
Databases
SQL Server*
Oracle*
MySQL*
DB2*
Teradata*
PostgreSQL*
Sybase*
Cassandra*
MongoDB*
Amazon Redshift
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
File
File System*
HDFS*
Amazon S3
✓
✓
✓
✓
Others
Salesforce
Generic ODBC*
Generic OData
Web Table (table from HTML)
GE Historian*
✓
✓
✓
✓
✓
Connects ADL Store out-of-the-box to all your stores

{ "name": "ComputeEventsByRegionPipeline",
"properties": { "description": "This is a U-SQL pipeline.",
"activities": [ { "type": "DataLakeAnalyticsU-SQL",
"typeProperties": { "scriptPath":
"scriptskonaSearchLogProcessing.txt",
"scriptLinkedService": "StorageLinkedService",
"degreeOfParallelism": 3,
"priority": 100,
"parameters": { "in": "/datalake/input/SearchLog.tsv",
"out": "/datalake/output/Result.tsv" } },
"inputs": [ { "name": "DataLakeTable" } ],
"outputs": [ { "name": "EventsByRegionTable" } ],
"policy": { "timeout": "06:00:00",
"concurrency": 1,
"executionPriorityOrder": "NewestFirst",
"retry": 1 },
"scheduler": { "frequency": "Day", "interval": 1 },
"name": "EventsByRegion",
"linkedServiceName":
"AzureDataLakeAnalyticsLinkedService" } ],
"start": "2015-08-08T00:00:00Z",
"end": "2015-08-08T01:00:00Z",
"isPaused": false } }
https://docs.microsoft.com/en-us/azure/data-factory/v1/data-factory-usql-activity
https://docs.microsoft.com/en-us/azure/data-factory/transform-data-using-data-lake-analytics

Original Jobs View New Pipeline Jobs ViewOriginal
• List jobs submitted in the last 30 days
• Aggregate trends of jobs over 30 days
• Order and filter list of jobs
New
• Superset of original jobs view
• Adds grouping of jobs by pipelines & recurrences
• Jobs and consumption trends per pipeline
• Quickly identify pipelines and jobs to troubleshoot
• Quickly compare failed jobs with “last known good” instance
• Manage pipeline cost, improve efficiency and predict future
cost
How to use
• Create ADF v2 pipelines containing ADLA U-SQL activities
• Pipelines and Recurrences automatically appear in ADLA portal
• Submit and monitor pipeline/recurring jobs using Azure
PowerShell, ADLA SDK and REST APIs

 Automatic "in-lining"
optimized out-of-
the-box
 Per job
parallelization
visibility into execution
 Heatmap to identify
bottlenecks

EXTRACT Expression
@s = EXTRACT a string, b int
FROM "filepath/file.csv"
USING Extractors.Csv(encoding: Encoding.Unicode);
• Built-in Extractors: Csv, Tsv, Text with lots of options, Parquet
• Custom Extractors: e.g., JSON, XML, etc. (see http://usql.io)
OUTPUT Expression
OUTPUT @s
TO "filepath/file.csv"
USING Outputters.Csv();
• Built-in Outputters: Csv, Tsv, Text, Parquet
• Custom Outputters: e.g., JSON, XML, etc. (see http://usql.io)
Filepath URIs
• Relative URI to default ADL Storage account: "filepath/file.csv"
• Absolute URIs:
• ADLS: "adl://account.azuredatalakestore.net/filepath/file.csv"
• WASB: "wasb://container@account/filepath/file.csv"

Simple pattern language on filename and path
@pattern string =
"/input/{date:yyyy}/{date:MM}/{date:dd}/{*}.{suffix}";
• Binds two columns date and suffix
• Wildcards the filename
• Limits on number of files and file sizes can be improved with
SET @@FeaturePreviews =
"FileSetV2Dot5:on,InputFileGrouping:on,
AsyncCompilerStoreAccess:on";
(Will become default between now and middle of year)
Virtual columns
EXTRACT name string
, suffix string // virtual column
, date DateTime // virtual column
FROM @pattern
USING Extractors.Csv();
• Refer to virtual columns in predicates to get partition elimination
• Warning gets raised if no partition elimination was found

ADLA Account/Catalog
Database
Schema
[1,n]
[1,n]
[0,n]
tables views TVFs
C# Fns C# UDAgg
Clustered
Index
partitions
C#
Assemblies
C# Extractors
Data
Source
C# Reducers
C# Processors
C# Combiners
C# Outputters
Ext. tables
User
objects
Refers toContains Implemented
and named by
Procedures
Creden-
tials
MD
Name
C#
Name
C# Applier
Table Types
Legend
Statistics
C# UDTs
Packages

• Naming
• Discovery
• Sharing
• Securing
U-SQL Catalog
Naming
• Default Database and Schema context: master.dbo
• Quote identifiers with []: [my table]
• Stores data in ADL Storage /catalog folder
Discovery
• Visual Studio Server Explorer
• Azure Data Lake Analytics Portal
• SDKs and Azure Powershell commands
• Catalog Views: usql.databases, usql.tables etc.
Sharing
• Within an Azure Data Lake Analytics account
• Across ADLA accounts that share same Azure Active Directory:
• Referencing Assemblies
• Calling TVFs, Procedures and referencing tables and views
• Inserting into tables
Securing
• Secured with AAD principals at catalog and Database level

Views
CREATE VIEW V AS EXTRACT…
CREATE VIEW V AS SELECT …
• Cannot contain user-defined objects (e.g. UDF or UDOs)!
• Will be inlined
Table-Valued Functions (TVFs)
CREATE FUNCTION F (@arg string = "default")
RETURNS @res [TABLE ( … )]
AS BEGIN … @res = … END;
• Provides parameterization
• One or more results
• Can contain multiple statements
• Can contain user-code (needs assembly reference)
• Will always be inlined
• Infers schema or checks against specified return schema

CREATE PROCEDURE P (@arg string = "default“) AS
BEGIN
…;
OUTPUT @res TO …;
INSERT INTO T …;
END;
• Provides parameterization
• No result but writes into file or table
• Can contain multiple statements
• Can contain user-code (needs assembly reference)
• Will always be inlined
• Can contain DDL (but no CREATE, DROP FUNCTION/PROCEDURE)

DECLARE @variable SqlArray<int> =
new SqlArray<int>{1,2};
DECLARE @variable = new SqlArray<int>{1,2};
• Provides named and typed scalar expressions
• Option to infer the type of the scalar variable
DECLARE EXTERNAL @parameter = "string value";
• Provides overwriteable defaulting of a scalar variable
• Allows external parameter models (e.g., Azure Data Factory)
DECLARE CONST @const_expression = "my "+@parameter;
• Checks and guarantees that expression is evaluated at compile time,
otherwise errors.

CREATE TABLE T (col1 int
, col2 string
, col3 SQL.MAP<string,string>
, INDEX idx CLUSTERED (col2 ASC)
PARTITION BY (col1)
DISTRIBUTED BY HASH (driver_id)
);
• Structured Data, built-in Data types only (no UDTs)
• Clustered Index (needs to be specified): row-oriented
• Fine-grained distribution (needs to be specified):
• HASH, DIRECT HASH, RANGE, ROUND ROBIN
• Addressable Partitions (optional)
CREATE TABLE T (INDEX idx CLUSTERED …) AS SELECT …;
CREATE TABLE T (INDEX idx CLUSTERED …) AS EXTRACT…;
CREATE TABLE T (INDEX idx CLUSTERED …) AS myTVF(DEFAULT);
• Infer the schema from the query
• Still requires index and distribution (does not support partitioning)

Benefits of Table clustering and distribution
• Faster lookup of data provided by distribution and clustering when right
distribution/cluster is chosen
• Data distribution provides better localized scale out
• Used for filters, joins and grouping
Benefits of Table partitioning
• Provides data life cycle management (“expire” old partitions)
• Partial re-computation of data at partition level
• Query predicates can provide partition elimination
Do not use when…
• No filters, joins and grouping
• No reuse of the data for future queries
If in doubt: use sampling (e.g., SAMPLE ANY(x)) and test.

ALTER TABLE T ADD COLUMN eventName string;
ALTER TABLE T DROP COLUMN col3;
ALTER TABLE T ADD COLUMN result string, clientId string,
payload int?;
ALTER TABLE T DROP COLUMN clientId, result;
• Meta-data only operation
• Existing rows will get
• Non-nullable types: C# data type default value (e.g., int will be 0)
• Nullable types: null

U-SQL extensibility
Extend U-SQL with C#/.NET/Python/R
Built-in operators,
function, aggregates
C# expressions (in SELECT expressions)
User-defined aggregates (UDAGGs)
User-defined functions (UDFs)
User-defined operators (UDOs) in
.Net/Python/R

What are UDOs?
Custom Operator Extensions in
language of your choice
Scaled out by U-SQL
• PROCESS
• COMBINE
• REDUCE

JSON Processing
https://github.com/Azure/usql/tree/master/Examples/DataFormats
https://github.com/Azure/usql/tree/master/Examples/JSONExamples

Microsoft.Analytics.Samples.Formats
NewtonSoft.Json Apache AvroSystem.Xml
JSON
Processing

JSON
Processing
@json =
EXTRACT personid int,
name string,
addresses string
FROM @input
USING new Json.JsonExtractor(“[*].person");
@person =
SELECT personid,
name,
Json.JsonFunctions.JsonTuple(addresses)["address"] AS address_array
FROM @json;
@addresses = SELECT personid, name, Json.JsonFunctions.JsonTuple(address) AS address
FROM @person
CROSS APPLY
EXPLODE (Json.JsonFunctions.JsonTuple(address_array).Values) AS A(address);
@result =
SELECT personid,
name,
address["addressid"]AS addressid,
address["street"]AS street,
address["postcode"]AS postcode,
address["city"]AS city
FROM @addresses;

https://github.com/Azure/usql/tree/master/Examples/ImageApp
https://docs.microsoft.com/en-us/azure/data-lake-analytics/data-lake-analytics-u-sql-cognitive
Car
Green
Parked
Outdoor
Racing

Additional
Resources
http://usql.io
http://blogs.msdn.microsoft.com/azuredatalake/
http://blogs.msdn.microsoft.com/mrys/
https://channel9.msdn.com/Search?term=U-SQL#ch9Search
http://aka.ms/usql_reference
https://docs.microsoft.com/en-us/azure/data-lake-analytics/data-lake-analytics-u-sql-
programmability-guide
https://docs.microsoft.com/en-us/azure/data-lake-analytics/
https://msdn.microsoft.com/en-us/magazine/mt614251
https://msdn.microsoft.com/magazine/mt790200
http://www.slideshare.net/MichaelRys
Getting Started with R in U-SQL
https://docs.microsoft.com/en-us/azure/data-lake-analytics/data-lake-analytics-u-sql-
python-extensions
https://social.msdn.microsoft.com/Forums/azure/en-US/home?forum=AzureDataLake
http://stackoverflow.com/questions/tagged/u-sql
http://aka.ms/adlfeedback
Continue your
education at
Microsoft Virtual
Academy online.

Learn more about Azure Data Lake
usql@microsoft.com@MikeDoesBigData
Thank You!
http://aka.ms/azuredatalake

Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platform, and intelligent (SQLBits 2018)

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platform, and intelligent (SQLBits 2018)

Ähnlich wie Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platform, and intelligent (SQLBits 2018) (20)

Mehr von Michael Rys

Mehr von Michael Rys (12)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platform, and intelligent (SQLBits 2018)

Hinweis der Redaktion