Using existing language skillsets to create large-scale, cloud-based analytics

Model & ServePrep & Train
Data Lake Analytics
Store
Data Lake Store
Ingest
Data Factory
SQL Data
Warehouse
Databricks SPARK
HDInsight SPARK
SQL DB
(reference data)
• Landing zone structure
• File size distribution
• File types and formats
• Languages & Existing Libraries
• Data “Cooking”: Normalization and Enrichment
• Partition & Structure data for performance or for
serving
• Interactively analyze data
• Notebooks
• PySpark
• SQL
HDInsight Hive LLAP

Job Scheduler
& Queue
Front-EndService
Vertex Execution
Consume
Local
Storage
Data Lake
Store
Author
Plan
Compiler Optimizer
Vertexes
running in
YARN
Containers
U-SQL
Runtime
Optimized
Plan
Vertex Scheduling
On containers
Job Manager
USQL
Catalog
Overall U-SQL Batch Job Execution Lifetime (1)
Stage
Codegen
(C++/C#
Compilation)

Finalization
Phase Execution Phase
Queueing
Preparation Phase
Job Scheduler
& Queue
Front-EndService
Vertex Execution
Consume
Local
Storage
Data Lake
Store
Author
Plan
Compiler Optimizer
Vertexes
running in
YARN
Containers
U-SQL
Runtime
Optimized
Plan
Vertex Scheduling
On containers
Job Manager
USQL
Catalog
Overall U-SQL Batch Job Execution Lifetime (2)
Vertex
Codegen
(C++/C#
Compilation)

Vertex Execution View
Open a job and click
Vertex Execution View
(will require job profile
to load)
Filter which vertices
get shown.
Visualization of
vertex execution
Vertex details

Vertex Execution ViewSelected vertex
indicator
Row of vertex information
(currently selected)
Colors indicate what is happening
with the vertex
(blue) CREATING – the vertex is being setup on a container (e.g.
user code resources being copied to container)
(orange) QUEUING – the vertex is waiting to start on the
container. Other vertexes may be using AUs so this vertex is waiting
for an AU to become available.
(green) RUNNING – the vertex is actually doing work
in this case, Notice that vertex
creation time is much larger
than time the vertex took to
do its work.

Big Data is made of MANY, SMALL files
FILE SIZE
GB TB PBMB0
COUNT
OF FILES
Most files way less
than a TB Very long tail.
Some files are HUGE

Recap: EXTRACT from FileSet
Without FileSets:
Explicit List in input files
@rows =
EXTRACT name string, id int
FROM
"/file1.tsv",
"/file2.tsv",
"/file3.tsv"
USING Extractors.Csv( );
With a FileSet:
EXTRACT every file in a folder
suffix
{suffix}
The value for the
column named
“suffix”
Comes from here (it
is the filename)

Recap: EXTRACT from FileSet
FileSet: EXTRACT a Pattern FileSet: EXTRACT with Pattern
and partition elimination
date
suffix
{date:yyyy} {date:MM} {date:dd} {suffix}
date
suffix
{date:yyyy} {date:MM} {date:dd} {suffix}
WHERE date >= System.DateTime.Parse(“2016/1/1”) AND
date < System.DateTime.Parse(“2016/2/1”);

Working with MANY input Files
A U-SQL script has an
upper bound on the
number of input files it
can work on.
Yesterday’s limit was
a few 1000’s of files
New limit is now 100,000s of input files
- no syntax change to previous

Working with MANY, SMALL files
TODAY
• Every file requires a separate EXTRACT vertex.
• Lots of small files = lots of EXTRACT vertices.
• Vertices have a startup & shutdown cost that
may be much larger than the time required to
read a small file
EXTRACT VERTEX
EXTRACT VERTEX
EXTRACT VERTEX
EXTRACT VERTEX
f1
f2
f3
f4
f5 EXTRACT VERTEX
f6 EXTRACT VERTEX
TOMORROW
• When possible the same extract vertex will be used for
multiple small files.
• Up to 200 files or 1GB of data (whatever is reached first)
• In PUBLIC PREVIEW now!
• GA Summer 2018
EXTRACT VERTEX
EXTRACT VERTEX
SET @@FeaturePreviews = "InputFileGrouping:on";
f1
f2
f3
f4
f5
f6

~5 second vertex execution time (green)
Input File Grouping (for 15 small files)
Before After15 Vertices 2 Vertices
~5 second vertex creation time (blue)
~5 second vertex creation time (blue)
~1 second vertex execution time (green)

Built-in Format Support: CSV/TSV & friends
• Extractors.Csv()|Tsv()|Text(delimiter: )
• Outputters.Csv()|Tsv()|Text(delimiter: )
Major Options:
• encoding: UTF-8 (default), UTF-16, ASCII, Windows-125x
• skipFirstNRows: Skip header lines (extractor only)
• outputHeader: true or false (outputter only)
• quoting: true or false.
• Will handle ““ quoted fields to guard delimiter in text.
• DOES NOT guard end of line delimiter!
• silent: true or false (extractor only)
• Allows to skip mis-aligned number of column rows
• Casts invalid values to NULL if target type is nullable
• DOES NOT HANDLE WRONG ENCODINGs, rows too long etc.!
• nullEscape: character representation of null value in input
• escapeChar: character to escape delimiter characters
Will Execute in parallel:
• Based on file set definition
• Every 1GB will be a separate vertex
• Every 1 vertex will execute 4 extractor
instances in parallel on 250MB + 4MB
As consequence:
• Large CSV files will get parallelized
• An extractor instance will only see its data
and nothing else.
Supports column pruning:
• only columns needed in script are fully
extracted

Built-in Format Support: Parquet (Preview)
Extractors.Parquet(), Outputters.Parquet()
SET @@FeaturePreviews = "EnableParquetUdos:on";
Major Options (on Outputter only):
• rowGroupSize: size of a row group in MB
• rowGroupRows: size of a row group in rows
• columnOptions:
ColOptions := ColOption [',' ColOption].
ColOption :=
columnindex
( [':'decimalPrecision]['.'decimalScale] |
[':'DateTimePrecision] )
['~'Compression].
DateTimePrecision := 'micro' | 'milli' | 'days'.
Compression :=
'uncompressed' | 'snappy' | 'brotli' | 'gzip'.
Will Execute in parallel only if:
• file set definition is used on input and
output
As consequence:
• Aim to generate and read Parquet files of
300MB to 3GB in size.
Supports column pruning:
• only columns needed in script are fully
extracted
PRO TIP:
OUTPUT @data
TO "/data/data_{*}.parquet"
USING Outputters.Parquet()
Futures:
Managed U-SQL tables PREVIEW in 2018 H2
GA 2018 H2

Other Format Support
ORC
Native EXTRACT/OUTPUT PRIVATE PREVIEW now
Managed U-SQL tables PREVIEW in 2018 H2
PUBLIC PREVIEW in Summer 2018
GA 2018 H2
JSON, XML & AVRO
Custom UDO lib on GitHub
https://github.com/Azure/usql/tree/master/Examples
/DataFormats
Built-in Support on Roadmap. No ETA
Images and Text docs
Cognitive Services (installable via Portal)
https://msdn.microsoft.com/en-us/azure/data-
lake-analytics/u-sql/cognitive-capabilities-in-u-
sql
Other custom formats
PDF, Excel etc
Community provided custom libs:
https://devblog.xyz/simple-pdf-text-extractor-
adla/
https://github.com/Azure/AzureDataLake/tree/m
aster/Samples/ExcelExtractor

Java
U-SQL scales your code
Scales out your custom imperative Code (written in .NET,
Python, R, Java) in a declarative SQL-based framework
R
Python
.NET
U-SQL Framework

What are UDOs? • User-Defined Extractors
• Converts files into rowset
• User-Defined Outputters
• Converts rowset into files
• User-Defined Processors
• Take one row and produce one row
• Pass-through versus transforming
• User-Defined Appliers
• Take one row and produce 0 to n rows
• Used with OUTER/CROSS APPLY
• User-Defined Combiners
• Combines rowsets (like a user-defined join)
• User-Defined Reducers
• Take n rows and produce m rows (normally m<n)
• Scaled out with explicit U-SQL Syntax that takes a
UDO instance (created as part of the execution):
• EXTRACT
• OUTPUT
• CROSS APPLY
Custom Operator Extensions in
language of your choice
Scaled out by U-SQL
• PROCESS
• COMBINE
• REDUCE

Extract
Process
Output
User CodeUser Code
User Code
User Code
Declarative Framework
User Extensions
U-SQL Example
Extract
User Code
User Code
A .NET UDO used within this
stage

Managing U-SQL
Assemblies
• Create assemblies for reuse
• .Net, or JVM!
• Reference assemblies
• Enumerate assemblies
• Drop assemblies
• VisualStudio makes registration easy!
• CREATE [JVM] ASSEMBLY db.assembly FROM @path;
• CREATE [JVM] ASSEMBLY db.assembly FROM byte[];
• Can also include additional resource files
• REFERENCE ASSEMBLY db.assembly;
• Referencing .Net Framework Assemblies
• Always accessible system namespaces:
• U-SQL specific (e.g., for SQL.MAP)
• All provided by system.dll system.core.dll system.data.dll,
System.Runtime.Serialization.dll, mscorelib.dll (e.g.,
System.Text, System.Text.RegularExpressions, System.Linq)
• Add all other .Net Framework Assemblies with:
REFERENCE SYSTEM ASSEMBLY [System.XML];
• Enumerating Assemblies
• Powershell command
• U-SQL Studio Server Explorer and Azure Portal
• DROP ASSEMBLY db.assembly;

DEPLOY RESOURCE Syntax:
'DEPLOY' 'RESOURCE' file_path_URI { ',' file_path_URI }.
Example:
DEPLOY RESOURCE "/config/configfile.xml", "package.zip";
Use Cases:
• Script specific configuration files (not stored with Asm)
• Script specific models
• Any other file you want to access from user code on all
vertices
Semantics:
• Files have to be in ADLS or WASB
• Files are deployed to vertex and are accessible from any
custom code
Limits:
• Single resource file limit is 400MB
• Overall limit for deployed resource files is 3GB

Python with Azure Data Lake Today & Tomorrow
Management & Ops
with Python
For automating or
operating Azure Data
Lake
• Python SDKs
• Python-based Azure
CLI
Doing Analytics with
Python today
• Run Python via
Extension library
Reducer UDO on
vertices.
• Only runs in a Reducer
context
Python tomorrow
• Run Python natively
on vertices.
• Build UDOs in Python:
• Extractors, Processors,
Outputters, Reducers,
Appliers, Combiners!

REFERENCE ASSEMBLY [ExtPython];
DECLARE @myScript = @"
def mult10(v):
return v*10.0
def usqlml_main(df):
df['amount10'] = df.amount.apply(mult10)
del df['amount']
return df
";
@a =
SELECT * FROM
(VALUES
("Contoso", 1500.0),
("Woodgrove", 2700.0)
) AS
D( customer, amount );
@b =
REDUCE @a ON customer
PRODUCE customer string, amount10 double
USING new Extension.Python.Reducer( pyScript:@myScript );
Today:
Transforming data
with Python
Create column based on
data from another column
using a python function
Delete a column
USE REFERENCE
ASSEMBLY to enable the
Python Extensions
usql_main accepts a pandas
DataFrame as input and
returns a DataFrame as output
Use a REDUCE statement
to partition on a key
Specify output Schema.
Python code MUST output
this schema in the
DataFrame
Use Extension.Python.Reducer and pass in
the script text.

class OrdersExtractor:
def __init__(self):
pass
def Extract(self, rawInput, output_row):
buf = bytearray(4 * 1024 * 1024)
output_schema = output_row.Schema
for line in rawInput.Split('n'):
num_bytes = line.readinto(buf)
cols = buf[:num_bytes].decode('utf-8').split('|’)
for i in range(len(cols)):
col_type = output_schema.GetColumn(i).Type
output_row.Set(i, col_type(cols[i]))
yield output_row
Write an Extractor in pure Python
Write Extractor class
Provide initializer
Implement Extract method:
rawInput is input data stream
output_row is resulting row that gets
accumulated into rowset
Access to EXTRACT schema
Overscan aware row splitter
Setting column by position or name
Accumulate row into rowset

DEPLOY RESOURCE @"/Build2018Demo/NativePython/testudo.py";
@orders =
EXTRACT
O_ORDERKEY long,
O_CUSTKEY long,
O_ORDERSTATUS string,
O_TOTALPRICE double,
O_ORDERDATE string,
O_ORDERPRIORITY int,
O_CLERK string,
O_SHIPPRIORITY int,
O_COMMENT string
FROM @"/Build2018Demo/NativePython/orders_sample.tbl"
USING Extractors.Python(
prologue: "import testudo",
expression: "testudo.OrdersExtractor()");
Using an Extractor written in Python
Deploy Python code to Vertex
Invoke Python Extractor
prologue:
sets up the runtime python context
(imports, object definitions)
expression:
invokes the Extractor

class OrdersOutputterWithFinishMethod:
def __init__(self):
pass
def Output(self, row, output):
stream = output.GetBaseStream
schema = row.Schema
if len(row) != len(schema):
raise RuntimeError("Length of values is not same as schema length")
for columnIndex in range(len(schema)):
stream.write(bytes(str(row[columnIndex]), 'utf8'))
if(columnIndex < len(schema)- 1):
stream.write('|')
stream.write('n')
def Finish(self, output):
output.GetBaseStream.write("End of Rowset")
Writing a native Python outputter
Optional finisher to write a footer to file

import sys
sys.path.insert(0, 'methods.zip')
import double
import divide
class ZipModuleMethodsReducer:
def __init__(self):
pass
def Reduce(self, inputRowset, outputRow):
for row in inputRowset:
# external method provided by double.py within 'methods.zip'
val = double.Double(row["O_TOTALPRICE"])
# external method provided by divide.py within 'methods.zip'
outputRow["O_TOTALPRICE"] = divide.Devide(val, 2)
yield outputRow
Using custom Python modules in UDO
1. Use DEPLOY RESOURCE methods.zip in U-SQL script
2. Use Python ZIP import feature in python UDO script

U-SQL Vertex Code (Python)
C#
C++
Algebra
Additional Python Libs and Script
managed dll
native dll
Compilation output (in job folder)
Compilation and Optimization
U-SQL
Metadata
Service
Deployed to
Vertices
ADLS DEPLOY RESOURCE
Script.py
OtherLibs.zip
System files
(built-in Runtimes, Core DLLs, OS)
Python Python Engine & Libs

Java with Azure Data Lake
Management & Ops
with Java
For automating or
operating Azure Data
Lake
• Java SDKs
Java
• Run Java natively on
vertices.
• Build UDOs in Java:
• Extractors, Processors,
Outputters, Reducers,
Appliers, Combiners!

package microsoft.analytics.samples;
import microsoft.analytics.interfaces.*;
public class ColumnProcessor extends Processor
{
public ColumnProcessor(){}
@Override
public Row process(Row input, UpdatableRow output) throws Throwable
{
Schema inSchema = input.getSchema();
Schema outSchema = output.getSchema();
for (int i = 0; i < outSchema.getCount(); i++)
{
String colName = outSchema.getColumn(i).getName();
int colIndex = inSchema.indexOf(colName);
if ((colIndex < 0) || (colIndex >= inSchema.getCount()))
{ throw new java.lang.IllegalArgumentException("Schema mismatch"); }
Object value = input.getColumnValue(colIndex);
output.setColumnValue(i, value);
}
return output;
}
}
Writing a native Java processor
Microsoft UDO interfaces
Extend base Processor
Initializer (can be used for parameters)
Overwrite process and return row
Input is input row
Output is output row
Accessing the Schema of the input and
output
Setting output value by position (or
name)

SET @@InternalDebug = "EnableJava:on";
CREATE JVM ASSEMBLY jvmAsm FROM @"Jarsmicrosoft.analytics.samples.jar";
Registering native Java processor

REFERENCE ASSEMBLY jvmAsm;
...
@result1 =
PROCESS @result
PRODUCE col1, col2, col3
USING Processors.Java("new microsoft.analytics.samples.ColumnProcessor()");
...
Calling a native Java processor
Import UDO written in Java (note it
looks the same regardless of
implementation language)
Generating an instance of processor and
call it from U-SQL

REFERENCE ASSEMBLY jvmHiveSerDeAsm;
@result = EXTRACT
Band string,
Name string,
Male bool?,
Instrument string,
Born int?,
Children long?,
NetWorth double?
FROM @"InputBandsDataJson.txt"
USING Extractors.Hive("new org.openx.data.jsonserde.JsonSerDe()", true);
Calling an existing Hive SerDe with U-SQL EXTRACT
Import Hive SerDe that was registered
as U-SQL Assembly
Call Hive SerDe with built-in
Extractors.Hive

CREATE JVM ASSEMBLY jvmHiveSerDeAsm FROM @"Jarsjson-serde-1.3.8.jar"
WITH ADDITIONAL FILES =
(
DEPLOY @"Jarscommons-logging-1.2.jar",
DEPLOY @"Jarshadoop-core-1.2.1.jar",
DEPLOY @"Jarshive-common-1.2.1.jar",
DEPLOY @"Jarshive-serde-1.2.1.jar",
DEPLOY @"Jarshive-hcatalog-core-2.3.2.jar"
);
Registering native Java Hive SerDe

U-SQL Vertex Code (Java)
C#
C++
Algebra
Referenced JVM Libs
managed dll
native dll
Compilation output (in job folder)
Compilation and Optimization
U-SQL
Metadata
Service
Deployed to
Vertices
ADLS
System files
(built-in Runtimes, Core DLLs, OS)
Java JVM and libs
REFERENCE ASSEMBLY
(JVM)

Python and Java Execution Paradigm
Python/JVM system (type mapping) Python/JVM system (type mapping)

Scenario: Split a rowset into multiple files
Id Amt
1024 100
4578 200
2309 300
8713 400
4578 500
8713 600
1024 700
2309 800
Id Amt
1024 100
1024 700
Id Amt
4578 200
4578 500
Id Amt
2309 300
2309 800
Id Amt
8713 400
8713 600
Input rowset
Split by Id 1024_data.csv 4578_data.csv 2309_data.csv 8713_data.csv
Requirement: Create one file per unique customer id

OUTPUT To FileSet
TODAY
• The OUTPUT filenames must be known at
compile time. The number of output files is
Static – they are explicitly listed in the
script.
TOMORROW
• The OUTPUT filenames can be inferred from the data.
• The Number of output files is Dynamic!
• PRIVATE PREVIEW now
• GA in 2018 H2
• Will support Parquet
• Will support Custom Outputters
@rows_1 = SELECT @rows
WHERE id = 1340;
@rows_2 = SELECT @rows
WHERE id = 7890;
OUTPUT @rows_1 TO
“/user_1340.tsv”
USING Outputers.Tsv();
OUTPUT @rows_1 TO
“/user_7890.tsv”
OUTPUT @rows TO
“/user_{id}.tsv”

Stage Details – The Operator Graph
Right-click to the the operator
graph
Details about
the UDO – the
exact class
name is given

AUs Time AUSec
1 739 739
2 390 780
3 281 843
4 223 892
5 189 945
6 166 996
7 155 1085
8 146 1168
9 137 1233
10 133 1330
11 131 1441
12 126 1512
13 123 1599
14 123 1722
15 122 1830
16 112 1792
17 114 1938
18 115 2070
19 115 2185
20 114 2280
21 113 2373
22 113 2486
Using execution logs from a
previously run job. We simulate
the vertex scheduler algorithm
for a given number of AUs.
NOTICE:
#1 - Most of the improvement
happens with only a few Aus
(YELLOW)
#2 Very little improvement
after 22 AUs
AUs Time AUSec
23 112 2576
24 112 2688
25 112 2800
26 112 2912
27 112 3024
28 112 3136
29 112 3248
30 112 3360
31 112 3472
32 112 3584
33 112 3696
34 112 3808
35 112 3920
36 112 4032
37 112 4144
38 112 4256
39 112 4368
40 112 4480
41 112 4592

AU Seconds (Cost)
0
100
200
300
400
500
600
700
800
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
ExecutionTime(sec)
1 AU
2 AUs
3 AUs
41 AUs

BALANCED recommendation
N AUs, where:
Assigning N+1 AUs would result in
a percentage increase in cost (AU-
hours) that is greater than the
percentage decrease in running
time (seconds).
FAST recommendation
N AUs, where:
Assigning N+1 AUs would result in
a percentage increase in cost (AU-
hours) that is more than 2x the
percentage decrease in running
time (seconds).
AU Hours (cost)
Runningtime
PEAK
300 AUs
1 AU
2 AUs
PEAK
The max AUs that could
theoretically be used by this
job
BALANCED
10 AUs
FAST
20 AUs

AU / Analysis
Showing ACTUAL (black)

AU / Analysis
Versus
Balanced (blue)

AU / Analysis
Versus
Fast (green)

AU / Analysis
Versus
Custom (purple)

Daily Cooking Pipeline
Activity 1 Activity 2 Activity 3
ADL U-SQL Job
Submitter: mrys
AUs: 50
PipelineId: ID from Cooking Pipeline
RecurrenceId: ID from Activity 1
Script: …

The Pipeline
Recurring job within the pipeline

Compare
Duration (green)
Input Size (blue)
Output size (purple)

Resources
• Blogs and community page:
• http://usql.io (U-SQL Github)
• http://blogs.msdn.microsoft.com/azuredatalake/
• http://blogs.msdn.microsoft.com/mrys/
• https://channel9.msdn.com/Search?term=U-SQL#ch9Search
• Documentation, presentations and articles:
• http://aka.ms/usql_reference
• https://docs.microsoft.com/en-us/azure/data-lake-analytics/data-lake-analytics-
u-sql-programmability-guide
• https://docs.microsoft.com/en-us/azure/data-lake-analytics/
• https://msdn.microsoft.com/en-us/magazine/mt614251
• https://msdn.microsoft.com/magazine/mt790200
• http://www.slideshare.net/MichaelRys
• Getting Started with R in U-SQL
• https://docs.microsoft.com/en-us/azure/data-lake-analytics/data-lake-analytics-
u-sql-python-extensions
• ADL forums and feedback
• https://social.msdn.microsoft.com/Forums/azure/en-
US/home?forum=AzureDataLake
• http://stackoverflow.com/questions/tagged/u-sql
• http://aka.ms/adlfeedback
Continue your education at
Microsoft Virtual Academy
online.

Abstract
Data Scientists and Data Wranglers often have existing code that they
would like to use at scale over large data sets. In this presentation we
show how to meet your customers where they are, allowing them to take
their existing Python, R, Java, code and libraries and existing formats --for
example Parquet -- and apply them at scale to schematize unstructured
data and process large amounts of data in Azure Data Lake with U-SQL.
We will show how large customers meet the challenges of processing
multiple cubes with data subsets to secure data for specific audiences
using U-SQL partitioned output, making it easy to dynamically partition
data for processing from Azure Data Lake.

Using existing language skillsets to create large-scale, cloud-based analytics

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Using existing language skillsets to create large-scale, cloud-based analytics

Ähnlich wie Using existing language skillsets to create large-scale, cloud-based analytics (20)

Mehr von Microsoft Tech Community

Mehr von Microsoft Tech Community (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Using existing language skillsets to create large-scale, cloud-based analytics

Hinweis der Redaktion