SlideShare ist ein Scribd-Unternehmen logo
1 von 72
Model & ServePrep & Train
Data Lake Analytics
Store
Data Lake Store
Ingest
Data Factory
SQL Data
Warehouse
Databricks SPARK
HDInsight SPARK
SQL DB
(reference data)
• Landing zone structure
• File size distribution
• File types and formats
• Languages & Existing Libraries
• Data “Cooking”: Normalization and Enrichment
• Partition & Structure data for performance or for
serving
• Interactively analyze data
• Notebooks
• PySpark
• SQL
HDInsight Hive LLAP
Job Scheduler
& Queue
Front-EndService
Vertex Execution
Consume
Local
Storage
Data Lake
Store
Author
Plan
Compiler Optimizer
Vertexes
running in
YARN
Containers
U-SQL
Runtime
Optimized
Plan
Vertex Scheduling
On containers
Job Manager
USQL
Catalog
Overall U-SQL Batch Job Execution Lifetime (1)
Stage
Codegen
(C++/C#
Compilation)
Finalization
Phase Execution Phase
Queueing
Preparation Phase
Job Scheduler
& Queue
Front-EndService
Vertex Execution
Consume
Local
Storage
Data Lake
Store
Author
Plan
Compiler Optimizer
Vertexes
running in
YARN
Containers
U-SQL
Runtime
Optimized
Plan
Vertex Scheduling
On containers
Job Manager
USQL
Catalog
Overall U-SQL Batch Job Execution Lifetime (2)
Vertex
Codegen
(C++/C#
Compilation)
Vertex Execution View
Open a job and click
Vertex Execution View
(will require job profile
to load)
Filter which vertices
get shown.
Visualization of
vertex execution
Vertex details
Vertex Execution ViewSelected vertex
indicator
Row of vertex information
(currently selected)
Colors indicate what is happening
with the vertex
(blue) CREATING – the vertex is being setup on a container (e.g.
user code resources being copied to container)
(orange) QUEUING – the vertex is waiting to start on the
container. Other vertexes may be using AUs so this vertex is waiting
for an AU to become available.
(green) RUNNING – the vertex is actually doing work
in this case, Notice that vertex
creation time is much larger
than time the vertex took to
do its work.
Big Data is made of MANY, SMALL files
FILE SIZE
GB TB PBMB0
COUNT
OF FILES
Most files way less
than a TB Very long tail.
Some files are HUGE
Recap: EXTRACT from FileSet
Without FileSets:
Explicit List in input files
@rows =
EXTRACT name string, id int
FROM
"/file1.tsv",
"/file2.tsv",
"/file3.tsv"
USING Extractors.Csv( );
With a FileSet:
EXTRACT every file in a folder
suffix
{suffix}
The value for the
column named
“suffix”
Comes from here (it
is the filename)
Recap: EXTRACT from FileSet
FileSet: EXTRACT a Pattern FileSet: EXTRACT with Pattern
and partition elimination
date
suffix
{date:yyyy} {date:MM} {date:dd} {suffix}
date
suffix
{date:yyyy} {date:MM} {date:dd} {suffix}
WHERE date >= System.DateTime.Parse(“2016/1/1”) AND
date < System.DateTime.Parse(“2016/2/1”);
Working with MANY input Files
A U-SQL script has an
upper bound on the
number of input files it
can work on.
Yesterday’s limit was
a few 1000’s of files
New limit is now 100,000s of input files
- no syntax change to previous
Working with MANY, SMALL files
TODAY
• Every file requires a separate EXTRACT vertex.
• Lots of small files = lots of EXTRACT vertices.
• Vertices have a startup & shutdown cost that
may be much larger than the time required to
read a small file
EXTRACT VERTEX
EXTRACT VERTEX
EXTRACT VERTEX
EXTRACT VERTEX
f1
f2
f3
f4
f5 EXTRACT VERTEX
f6 EXTRACT VERTEX
TOMORROW
• When possible the same extract vertex will be used for
multiple small files.
• Up to 200 files or 1GB of data (whatever is reached first)
• In PUBLIC PREVIEW now!
• GA Summer 2018
EXTRACT VERTEX
EXTRACT VERTEX
SET @@FeaturePreviews = "InputFileGrouping:on";
f1
f2
f3
f4
f5
f6
~5 second vertex execution time (green)
Input File Grouping (for 15 small files)
Before After15 Vertices 2 Vertices
~5 second vertex creation time (blue)
~5 second vertex creation time (blue)
~1 second vertex execution time (green)
Built-in Format Support: CSV/TSV & friends
• Extractors.Csv()|Tsv()|Text(delimiter: )
• Outputters.Csv()|Tsv()|Text(delimiter: )
Major Options:
• encoding: UTF-8 (default), UTF-16, ASCII, Windows-125x
• skipFirstNRows: Skip header lines (extractor only)
• outputHeader: true or false (outputter only)
• quoting: true or false.
• Will handle ““ quoted fields to guard delimiter in text.
• DOES NOT guard end of line delimiter!
• silent: true or false (extractor only)
• Allows to skip mis-aligned number of column rows
• Casts invalid values to NULL if target type is nullable
• DOES NOT HANDLE WRONG ENCODINGs, rows too long etc.!
• nullEscape: character representation of null value in input
• escapeChar: character to escape delimiter characters
Will Execute in parallel:
• Based on file set definition
• Every 1GB will be a separate vertex
• Every 1 vertex will execute 4 extractor
instances in parallel on 250MB + 4MB
As consequence:
• Large CSV files will get parallelized
• An extractor instance will only see its data
and nothing else.
Supports column pruning:
• only columns needed in script are fully
extracted
Built-in Format Support: Parquet (Preview)
Extractors.Parquet(), Outputters.Parquet()
SET @@FeaturePreviews = "EnableParquetUdos:on";
Major Options (on Outputter only):
• rowGroupSize: size of a row group in MB
• rowGroupRows: size of a row group in rows
• columnOptions:
ColOptions := ColOption [',' ColOption].
ColOption :=
columnindex
( [':'decimalPrecision]['.'decimalScale] |
[':'DateTimePrecision] )
['~'Compression].
DateTimePrecision := 'micro' | 'milli' | 'days'.
Compression :=
'uncompressed' | 'snappy' | 'brotli' | 'gzip'.
Will Execute in parallel only if:
• file set definition is used on input and
output
As consequence:
• Aim to generate and read Parquet files of
300MB to 3GB in size.
Supports column pruning:
• only columns needed in script are fully
extracted
PRO TIP:
OUTPUT @data
TO "/data/data_{*}.parquet"
USING Outputters.Parquet()
Futures:
Managed U-SQL tables PREVIEW in 2018 H2
GA 2018 H2
Other Format Support
ORC
Native EXTRACT/OUTPUT PRIVATE PREVIEW now
Managed U-SQL tables PREVIEW in 2018 H2
PUBLIC PREVIEW in Summer 2018
GA 2018 H2
JSON, XML & AVRO
Custom UDO lib on GitHub
https://github.com/Azure/usql/tree/master/Examples
/DataFormats
Built-in Support on Roadmap. No ETA
Images and Text docs
Cognitive Services (installable via Portal)
https://msdn.microsoft.com/en-us/azure/data-
lake-analytics/u-sql/cognitive-capabilities-in-u-
sql
Other custom formats
PDF, Excel etc
Community provided custom libs:
https://devblog.xyz/simple-pdf-text-extractor-
adla/
https://github.com/Azure/AzureDataLake/tree/m
aster/Samples/ExcelExtractor
Java
U-SQL scales your code
Scales out your custom imperative Code (written in .NET,
Python, R, Java) in a declarative SQL-based framework
R
Python
.NET
U-SQL Framework
What are UDOs? • User-Defined Extractors
• Converts files into rowset
• User-Defined Outputters
• Converts rowset into files
• User-Defined Processors
• Take one row and produce one row
• Pass-through versus transforming
• User-Defined Appliers
• Take one row and produce 0 to n rows
• Used with OUTER/CROSS APPLY
• User-Defined Combiners
• Combines rowsets (like a user-defined join)
• User-Defined Reducers
• Take n rows and produce m rows (normally m<n)
• Scaled out with explicit U-SQL Syntax that takes a
UDO instance (created as part of the execution):
• EXTRACT
• OUTPUT
• CROSS APPLY
Custom Operator Extensions in
language of your choice
Scaled out by U-SQL
• PROCESS
• COMBINE
• REDUCE
Extract
Process
Output
User CodeUser Code
User Code
User Code
Declarative Framework
User Extensions
U-SQL Example
Extract
User Code
User Code
A .NET UDO used within this
stage
Managing U-SQL
Assemblies
• Create assemblies for reuse
• .Net, or JVM!
• Reference assemblies
• Enumerate assemblies
• Drop assemblies
• VisualStudio makes registration easy!
• CREATE [JVM] ASSEMBLY db.assembly FROM @path;
• CREATE [JVM] ASSEMBLY db.assembly FROM byte[];
• Can also include additional resource files
• REFERENCE ASSEMBLY db.assembly;
• Referencing .Net Framework Assemblies
• Always accessible system namespaces:
• U-SQL specific (e.g., for SQL.MAP)
• All provided by system.dll system.core.dll system.data.dll,
System.Runtime.Serialization.dll, mscorelib.dll (e.g.,
System.Text, System.Text.RegularExpressions, System.Linq)
• Add all other .Net Framework Assemblies with:
REFERENCE SYSTEM ASSEMBLY [System.XML];
• Enumerating Assemblies
• Powershell command
• U-SQL Studio Server Explorer and Azure Portal
• DROP ASSEMBLY db.assembly;
DEPLOY RESOURCE Syntax:
'DEPLOY' 'RESOURCE' file_path_URI { ',' file_path_URI }.
Example:
DEPLOY RESOURCE "/config/configfile.xml", "package.zip";
Use Cases:
• Script specific configuration files (not stored with Asm)
• Script specific models
• Any other file you want to access from user code on all
vertices
Semantics:
• Files have to be in ADLS or WASB
• Files are deployed to vertex and are accessible from any
custom code
Limits:
• Single resource file limit is 400MB
• Overall limit for deployed resource files is 3GB
Python with Azure Data Lake Today & Tomorrow
Management & Ops
with Python
For automating or
operating Azure Data
Lake
• Python SDKs
• Python-based Azure
CLI
Doing Analytics with
Python today
• Run Python via
Extension library
Reducer UDO on
vertices.
• Only runs in a Reducer
context
Doing Analytics with
Python tomorrow
• Run Python natively
on vertices.
• Build UDOs in Python:
• Extractors, Processors,
Outputters, Reducers,
Appliers, Combiners!
REFERENCE ASSEMBLY [ExtPython];
DECLARE @myScript = @"
def mult10(v):
return v*10.0
def usqlml_main(df):
df['amount10'] = df.amount.apply(mult10)
del df['amount']
return df
";
@a =
SELECT * FROM
(VALUES
("Contoso", 1500.0),
("Woodgrove", 2700.0)
) AS
D( customer, amount );
@b =
REDUCE @a ON customer
PRODUCE customer string, amount10 double
USING new Extension.Python.Reducer( pyScript:@myScript );
Today:
Transforming data
with Python
Create column based on
data from another column
using a python function
Delete a column
USE REFERENCE
ASSEMBLY to enable the
Python Extensions
usql_main accepts a pandas
DataFrame as input and
returns a DataFrame as output
Use a REDUCE statement
to partition on a key
Specify output Schema.
Python code MUST output
this schema in the
DataFrame
Use Extension.Python.Reducer and pass in
the script text.
class OrdersExtractor:
def __init__(self):
pass
def Extract(self, rawInput, output_row):
buf = bytearray(4 * 1024 * 1024)
output_schema = output_row.Schema
for line in rawInput.Split('n'):
num_bytes = line.readinto(buf)
cols = buf[:num_bytes].decode('utf-8').split('|’)
for i in range(len(cols)):
col_type = output_schema.GetColumn(i).Type
output_row.Set(i, col_type(cols[i]))
yield output_row
Write an Extractor in pure Python
Write Extractor class
Provide initializer
Implement Extract method:
rawInput is input data stream
output_row is resulting row that gets
accumulated into rowset
Access to EXTRACT schema
Overscan aware row splitter
Setting column by position or name
Accumulate row into rowset
DEPLOY RESOURCE @"/Build2018Demo/NativePython/testudo.py";
@orders =
EXTRACT
O_ORDERKEY long,
O_CUSTKEY long,
O_ORDERSTATUS string,
O_TOTALPRICE double,
O_ORDERDATE string,
O_ORDERPRIORITY int,
O_CLERK string,
O_SHIPPRIORITY int,
O_COMMENT string
FROM @"/Build2018Demo/NativePython/orders_sample.tbl"
USING Extractors.Python(
prologue: "import testudo",
expression: "testudo.OrdersExtractor()");
Using an Extractor written in Python
Deploy Python code to Vertex
Invoke Python Extractor
prologue:
sets up the runtime python context
(imports, object definitions)
expression:
invokes the Extractor
class OrdersOutputterWithFinishMethod:
def __init__(self):
pass
def Output(self, row, output):
stream = output.GetBaseStream
schema = row.Schema
if len(row) != len(schema):
raise RuntimeError("Length of values is not same as schema length")
for columnIndex in range(len(schema)):
stream.write(bytes(str(row[columnIndex]), 'utf8'))
if(columnIndex < len(schema)- 1):
stream.write('|')
stream.write('n')
def Finish(self, output):
output.GetBaseStream.write("End of Rowset")
Writing a native Python outputter
Optional finisher to write a footer to file
import sys
sys.path.insert(0, 'methods.zip')
import double
import divide
class ZipModuleMethodsReducer:
def __init__(self):
pass
def Reduce(self, inputRowset, outputRow):
for row in inputRowset:
# external method provided by double.py within 'methods.zip'
val = double.Double(row["O_TOTALPRICE"])
# external method provided by divide.py within 'methods.zip'
outputRow["O_TOTALPRICE"] = divide.Devide(val, 2)
yield outputRow
Using custom Python modules in UDO
1. Use DEPLOY RESOURCE methods.zip in U-SQL script
2. Use Python ZIP import feature in python UDO script
U-SQL Vertex Code (Python)
C#
C++
Algebra
Additional Python Libs and Script
managed dll
native dll
Compilation output (in job folder)
Compilation and Optimization
U-SQL
Metadata
Service
Deployed to
Vertices
ADLS DEPLOY RESOURCE
Script.py
OtherLibs.zip
System files
(built-in Runtimes, Core DLLs, OS)
Python Python Engine & Libs
Java with Azure Data Lake
Management & Ops
with Java
For automating or
operating Azure Data
Lake
• Java SDKs
Doing Analytics with
Java
• Run Java natively on
vertices.
• Build UDOs in Java:
• Extractors, Processors,
Outputters, Reducers,
Appliers, Combiners!
package microsoft.analytics.samples;
import microsoft.analytics.interfaces.*;
public class ColumnProcessor extends Processor
{
public ColumnProcessor(){}
@Override
public Row process(Row input, UpdatableRow output) throws Throwable
{
Schema inSchema = input.getSchema();
Schema outSchema = output.getSchema();
for (int i = 0; i < outSchema.getCount(); i++)
{
String colName = outSchema.getColumn(i).getName();
int colIndex = inSchema.indexOf(colName);
if ((colIndex < 0) || (colIndex >= inSchema.getCount()))
{ throw new java.lang.IllegalArgumentException("Schema mismatch"); }
Object value = input.getColumnValue(colIndex);
output.setColumnValue(i, value);
}
return output;
}
}
Writing a native Java processor
Microsoft UDO interfaces
Extend base Processor
Initializer (can be used for parameters)
Overwrite process and return row
Input is input row
Output is output row
Accessing the Schema of the input and
output
Setting output value by position (or
name)
SET @@InternalDebug = "EnableJava:on";
CREATE JVM ASSEMBLY jvmAsm FROM @"Jarsmicrosoft.analytics.samples.jar";
Registering native Java processor
REFERENCE ASSEMBLY jvmAsm;
...
@result1 =
PROCESS @result
PRODUCE col1, col2, col3
USING Processors.Java("new microsoft.analytics.samples.ColumnProcessor()");
...
Calling a native Java processor
Import UDO written in Java (note it
looks the same regardless of
implementation language)
Generating an instance of processor and
call it from U-SQL
SET @@InternalDebug = "EnableJava:on";
REFERENCE ASSEMBLY jvmHiveSerDeAsm;
@result = EXTRACT
Band string,
Name string,
Male bool?,
Instrument string,
Born int?,
Children long?,
NetWorth double?
FROM @"InputBandsDataJson.txt"
USING Extractors.Hive("new org.openx.data.jsonserde.JsonSerDe()", true);
Calling an existing Hive SerDe with U-SQL EXTRACT
Import Hive SerDe that was registered
as U-SQL Assembly
Call Hive SerDe with built-in
Extractors.Hive
SET @@InternalDebug = "EnableJava:on";
CREATE JVM ASSEMBLY jvmHiveSerDeAsm FROM @"Jarsjson-serde-1.3.8.jar"
WITH ADDITIONAL FILES =
(
DEPLOY @"Jarscommons-logging-1.2.jar",
DEPLOY @"Jarshadoop-core-1.2.1.jar",
DEPLOY @"Jarshive-common-1.2.1.jar",
DEPLOY @"Jarshive-serde-1.2.1.jar",
DEPLOY @"Jarshive-hcatalog-core-2.3.2.jar"
);
Registering native Java Hive SerDe
U-SQL Vertex Code (Java)
C#
C++
Algebra
Referenced JVM Libs
managed dll
native dll
Compilation output (in job folder)
Compilation and Optimization
U-SQL
Metadata
Service
Deployed to
Vertices
ADLS
System files
(built-in Runtimes, Core DLLs, OS)
Java JVM and libs
REFERENCE ASSEMBLY
(JVM)
Python and Java Execution Paradigm
Python/JVM system (type mapping) Python/JVM system (type mapping)
Scenario: Split a rowset into multiple files
Id Amt
1024 100
4578 200
2309 300
8713 400
4578 500
8713 600
1024 700
2309 800
Id Amt
1024 100
1024 700
Id Amt
4578 200
4578 500
Id Amt
2309 300
2309 800
Id Amt
8713 400
8713 600
Input rowset
Split by Id 1024_data.csv 4578_data.csv 2309_data.csv 8713_data.csv
Requirement: Create one file per unique customer id
OUTPUT To FileSet
TODAY
• The OUTPUT filenames must be known at
compile time. The number of output files is
Static – they are explicitly listed in the
script.
TOMORROW
• The OUTPUT filenames can be inferred from the data.
• The Number of output files is Dynamic!
• PRIVATE PREVIEW now
• GA in 2018 H2
• Will support Parquet
• Will support Custom Outputters
@rows_1 = SELECT @rows
WHERE id = 1340;
@rows_2 = SELECT @rows
WHERE id = 7890;
OUTPUT @rows_1 TO
“/user_1340.tsv”
USING Outputers.Tsv();
OUTPUT @rows_1 TO
“/user_7890.tsv”
USING Outputers.Tsv();
OUTPUT @rows TO
“/user_{id}.tsv”
USING Outputers.Tsv();
Stage Details – The Operator Graph
Right-click to the the operator
graph
Details about
the UDO – the
exact class
name is given
AUs Time AUSec
1 739 739
2 390 780
3 281 843
4 223 892
5 189 945
6 166 996
7 155 1085
8 146 1168
9 137 1233
10 133 1330
11 131 1441
12 126 1512
13 123 1599
14 123 1722
15 122 1830
16 112 1792
17 114 1938
18 115 2070
19 115 2185
20 114 2280
21 113 2373
22 113 2486
Using execution logs from a
previously run job. We simulate
the vertex scheduler algorithm
for a given number of AUs.
NOTICE:
#1 - Most of the improvement
happens with only a few Aus
(YELLOW)
#2 Very little improvement
after 22 AUs
AUs Time AUSec
23 112 2576
24 112 2688
25 112 2800
26 112 2912
27 112 3024
28 112 3136
29 112 3248
30 112 3360
31 112 3472
32 112 3584
33 112 3696
34 112 3808
35 112 3920
36 112 4032
37 112 4144
38 112 4256
39 112 4368
40 112 4480
41 112 4592
AU Seconds (Cost)
0
100
200
300
400
500
600
700
800
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
ExecutionTime(sec)
1 AU
2 AUs
3 AUs
41 AUs
BALANCED recommendation
N AUs, where:
Assigning N+1 AUs would result in
a percentage increase in cost (AU-
hours) that is greater than the
percentage decrease in running
time (seconds).
FAST recommendation
N AUs, where:
Assigning N+1 AUs would result in
a percentage increase in cost (AU-
hours) that is more than 2x the
percentage decrease in running
time (seconds).
AU Hours (cost)
Runningtime
PEAK
300 AUs
1 AU
2 AUs
PEAK
The max AUs that could
theoretically be used by this
job
BALANCED
10 AUs
FAST
20 AUs
In Visual Studio
In Portal
AU / Analysis
Showing ACTUAL (black)
AU / Analysis
Showing ACTUAL (black)
Versus
Balanced (blue)
AU / Analysis
Showing ACTUAL (black)
Versus
Fast (green)
AU / Analysis
Showing ACTUAL (black)
Versus
Custom (purple)
Daily Cooking Pipeline
Activity 1 Activity 2 Activity 3
ADL U-SQL Job
Submitter: mrys
AUs: 50
PipelineId: ID from Cooking Pipeline
RecurrenceId: ID from Activity 1
Script: …
The Pipeline
Recurring job within the pipeline
Compare
Duration (green)
Input Size (blue)
Output size (purple)
Resources
• Blogs and community page:
• http://usql.io (U-SQL Github)
• http://blogs.msdn.microsoft.com/azuredatalake/
• http://blogs.msdn.microsoft.com/mrys/
• https://channel9.msdn.com/Search?term=U-SQL#ch9Search
• Documentation, presentations and articles:
• http://aka.ms/usql_reference
• https://docs.microsoft.com/en-us/azure/data-lake-analytics/data-lake-analytics-
u-sql-programmability-guide
• https://docs.microsoft.com/en-us/azure/data-lake-analytics/
• https://msdn.microsoft.com/en-us/magazine/mt614251
• https://msdn.microsoft.com/magazine/mt790200
• http://www.slideshare.net/MichaelRys
• Getting Started with R in U-SQL
• https://docs.microsoft.com/en-us/azure/data-lake-analytics/data-lake-analytics-
u-sql-python-extensions
• ADL forums and feedback
• https://social.msdn.microsoft.com/Forums/azure/en-
US/home?forum=AzureDataLake
• http://stackoverflow.com/questions/tagged/u-sql
• http://aka.ms/adlfeedback
Continue your education at
Microsoft Virtual Academy
online.
Abstract
Data Scientists and Data Wranglers often have existing code that they
would like to use at scale over large data sets. In this presentation we
show how to meet your customers where they are, allowing them to take
their existing Python, R, Java, code and libraries and existing formats --for
example Parquet -- and apply them at scale to schematize unstructured
data and process large amounts of data in Azure Data Lake with U-SQL.
We will show how large customers meet the challenges of processing
multiple cubes with data subsets to secure data for specific audiences
using U-SQL partitioned output, making it easy to dynamically partition
data for processing from Azure Data Lake.

Weitere ähnliche Inhalte

Was ist angesagt?

Solr Black Belt Pre-conference
Solr Black Belt Pre-conferenceSolr Black Belt Pre-conference
Solr Black Belt Pre-conference
Erik Hatcher
 
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Lucidworks
 
Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
Improved Search With Lucene 4.0 - NOVA Lucene/Solr MeetupImproved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
rcmuir
 
Friction-free ETL: Automating data transformation with Impala | Strata + Hado...
Friction-free ETL: Automating data transformation with Impala | Strata + Hado...Friction-free ETL: Automating data transformation with Impala | Strata + Hado...
Friction-free ETL: Automating data transformation with Impala | Strata + Hado...
Cloudera, Inc.
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
Erik Hatcher
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
Erik Hatcher
 

Was ist angesagt? (20)

SQL to Hive Cheat Sheet
SQL to Hive Cheat SheetSQL to Hive Cheat Sheet
SQL to Hive Cheat Sheet
 
Solr Black Belt Pre-conference
Solr Black Belt Pre-conferenceSolr Black Belt Pre-conference
Solr Black Belt Pre-conference
 
Faster Data Analytics with Apache Spark using Apache Solr - Kiran Chitturi, L...
Faster Data Analytics with Apache Spark using Apache Solr - Kiran Chitturi, L...Faster Data Analytics with Apache Spark using Apache Solr - Kiran Chitturi, L...
Faster Data Analytics with Apache Spark using Apache Solr - Kiran Chitturi, L...
 
Hive @ Bucharest Java User Group
Hive @ Bucharest Java User GroupHive @ Bucharest Java User Group
Hive @ Bucharest Java User Group
 
Solr 6 Feature Preview
Solr 6 Feature PreviewSolr 6 Feature Preview
Solr 6 Feature Preview
 
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
 
Ingesting and Manipulating Data with JavaScript
Ingesting and Manipulating Data with JavaScriptIngesting and Manipulating Data with JavaScript
Ingesting and Manipulating Data with JavaScript
 
Webinar: What's New in Solr 7
Webinar: What's New in Solr 7 Webinar: What's New in Solr 7
Webinar: What's New in Solr 7
 
Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
Improved Search With Lucene 4.0 - NOVA Lucene/Solr MeetupImproved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
 
Is SQLcl the Next Generation of SQL*Plus?
Is SQLcl the Next Generation of SQL*Plus?Is SQLcl the Next Generation of SQL*Plus?
Is SQLcl the Next Generation of SQL*Plus?
 
Flexible search in Apache Jackrabbit Oak
Flexible search in Apache Jackrabbit OakFlexible search in Apache Jackrabbit Oak
Flexible search in Apache Jackrabbit Oak
 
Spark zeppelin-cassandra at synchrotron
Spark zeppelin-cassandra at synchrotronSpark zeppelin-cassandra at synchrotron
Spark zeppelin-cassandra at synchrotron
 
Friction-free ETL: Automating data transformation with Impala | Strata + Hado...
Friction-free ETL: Automating data transformation with Impala | Strata + Hado...Friction-free ETL: Automating data transformation with Impala | Strata + Hado...
Friction-free ETL: Automating data transformation with Impala | Strata + Hado...
 
High-Performance Hibernate Devoxx France 2016
High-Performance Hibernate Devoxx France 2016High-Performance Hibernate Devoxx France 2016
High-Performance Hibernate Devoxx France 2016
 
Native Code, Off-Heap Data & JSON Facet API for Solr (Heliosearch)
Native Code, Off-Heap Data & JSON Facet API for Solr (Heliosearch)Native Code, Off-Heap Data & JSON Facet API for Solr (Heliosearch)
Native Code, Off-Heap Data & JSON Facet API for Solr (Heliosearch)
 
PostgreSQL Advanced Queries
PostgreSQL Advanced QueriesPostgreSQL Advanced Queries
PostgreSQL Advanced Queries
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
 
Lucene's Latest (for Libraries)
Lucene's Latest (for Libraries)Lucene's Latest (for Libraries)
Lucene's Latest (for Libraries)
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
 
Apache zeppelin the missing component for the big data ecosystem
Apache zeppelin the missing component for the big data ecosystemApache zeppelin the missing component for the big data ecosystem
Apache zeppelin the missing component for the big data ecosystem
 

Ähnlich wie Using existing language skillsets to create large-scale, cloud-based analytics

android sqlite
android sqliteandroid sqlite
android sqlite
Deepa Rani
 

Ähnlich wie Using existing language skillsets to create large-scale, cloud-based analytics (20)

Introducing U-SQL (SQLPASS 2016)
Introducing U-SQL (SQLPASS 2016)Introducing U-SQL (SQLPASS 2016)
Introducing U-SQL (SQLPASS 2016)
 
Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)
Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)
Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)
 
U-SQL - Azure Data Lake Analytics for Developers
U-SQL - Azure Data Lake Analytics for DevelopersU-SQL - Azure Data Lake Analytics for Developers
U-SQL - Azure Data Lake Analytics for Developers
 
Taming the Data Science Monster with A New ‘Sword’ – U-SQL
Taming the Data Science Monster with A New ‘Sword’ – U-SQLTaming the Data Science Monster with A New ‘Sword’ – U-SQL
Taming the Data Science Monster with A New ‘Sword’ – U-SQL
 
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
 
Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platfor...
Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platfor...Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platfor...
Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platfor...
 
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache ApexApache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
 
An intro to Azure Data Lake
An intro to Azure Data LakeAn intro to Azure Data Lake
An intro to Azure Data Lake
 
Experience sql server on l inux and docker
Experience sql server on l inux and dockerExperience sql server on l inux and docker
Experience sql server on l inux and docker
 
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
 
Brk2051 sql server on linux and docker
Brk2051 sql server on linux and dockerBrk2051 sql server on linux and docker
Brk2051 sql server on linux and docker
 
Be A Hero: Transforming GoPro Analytics Data Pipeline
Be A Hero: Transforming GoPro Analytics Data PipelineBe A Hero: Transforming GoPro Analytics Data Pipeline
Be A Hero: Transforming GoPro Analytics Data Pipeline
 
android sqlite
android sqliteandroid sqlite
android sqlite
 
3 CityNetConf - sql+c#=u-sql
3 CityNetConf - sql+c#=u-sql3 CityNetConf - sql+c#=u-sql
3 CityNetConf - sql+c#=u-sql
 
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
 
Kognitio - an overview
Kognitio - an overviewKognitio - an overview
Kognitio - an overview
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Introduction to Amazon Athena
Introduction to Amazon AthenaIntroduction to Amazon Athena
Introduction to Amazon Athena
 
ETL 2.0 Data Engineering for developers
ETL 2.0 Data Engineering for developersETL 2.0 Data Engineering for developers
ETL 2.0 Data Engineering for developers
 
NEW LAUNCH! Intro to Amazon Athena. Easily analyze data in S3, using SQL.
NEW LAUNCH! Intro to Amazon Athena. Easily analyze data in S3, using SQL.NEW LAUNCH! Intro to Amazon Athena. Easily analyze data in S3, using SQL.
NEW LAUNCH! Intro to Amazon Athena. Easily analyze data in S3, using SQL.
 

Mehr von Microsoft Tech Community

Mehr von Microsoft Tech Community (20)

100 ways to use Yammer
100 ways to use Yammer100 ways to use Yammer
100 ways to use Yammer
 
10 Yammer Group Suggestions
10 Yammer Group Suggestions10 Yammer Group Suggestions
10 Yammer Group Suggestions
 
Removing Security Roadblocks to IoT Deployment Success
Removing Security Roadblocks to IoT Deployment SuccessRemoving Security Roadblocks to IoT Deployment Success
Removing Security Roadblocks to IoT Deployment Success
 
Building mobile apps with Visual Studio and Xamarin
Building mobile apps with Visual Studio and XamarinBuilding mobile apps with Visual Studio and Xamarin
Building mobile apps with Visual Studio and Xamarin
 
Best practices with Microsoft Graph: Making your applications more performant...
Best practices with Microsoft Graph: Making your applications more performant...Best practices with Microsoft Graph: Making your applications more performant...
Best practices with Microsoft Graph: Making your applications more performant...
 
Interactive emails in Outlook with Adaptive Cards
Interactive emails in Outlook with Adaptive CardsInteractive emails in Outlook with Adaptive Cards
Interactive emails in Outlook with Adaptive Cards
 
Unlocking security insights with Microsoft Graph API
Unlocking security insights with Microsoft Graph APIUnlocking security insights with Microsoft Graph API
Unlocking security insights with Microsoft Graph API
 
Break through the serverless barriers with Durable Functions
Break through the serverless barriers with Durable FunctionsBreak through the serverless barriers with Durable Functions
Break through the serverless barriers with Durable Functions
 
Multiplayer Server Scaling with Azure Container Instances
Multiplayer Server Scaling with Azure Container InstancesMultiplayer Server Scaling with Azure Container Instances
Multiplayer Server Scaling with Azure Container Instances
 
Explore Azure Cosmos DB
Explore Azure Cosmos DBExplore Azure Cosmos DB
Explore Azure Cosmos DB
 
Media Streaming Apps with Azure and Xamarin
Media Streaming Apps with Azure and XamarinMedia Streaming Apps with Azure and Xamarin
Media Streaming Apps with Azure and Xamarin
 
DevOps for Data Science
DevOps for Data ScienceDevOps for Data Science
DevOps for Data Science
 
Real-World Solutions with PowerApps: Tips & tricks to manage your app complexity
Real-World Solutions with PowerApps: Tips & tricks to manage your app complexityReal-World Solutions with PowerApps: Tips & tricks to manage your app complexity
Real-World Solutions with PowerApps: Tips & tricks to manage your app complexity
 
Azure Functions and Microsoft Graph
Azure Functions and Microsoft GraphAzure Functions and Microsoft Graph
Azure Functions and Microsoft Graph
 
Ingestion in data pipelines with Managed Kafka Clusters in Azure HDInsight
Ingestion in data pipelines with Managed Kafka Clusters in Azure HDInsightIngestion in data pipelines with Managed Kafka Clusters in Azure HDInsight
Ingestion in data pipelines with Managed Kafka Clusters in Azure HDInsight
 
Getting Started with Visual Studio Tools for AI
Getting Started with Visual Studio Tools for AIGetting Started with Visual Studio Tools for AI
Getting Started with Visual Studio Tools for AI
 
Using AML Python SDK
Using AML Python SDKUsing AML Python SDK
Using AML Python SDK
 
Mobile Workforce Location Tracking with Bing Maps
Mobile Workforce Location Tracking with Bing MapsMobile Workforce Location Tracking with Bing Maps
Mobile Workforce Location Tracking with Bing Maps
 
Cognitive Services Labs in action Anomaly detection
Cognitive Services Labs in action Anomaly detectionCognitive Services Labs in action Anomaly detection
Cognitive Services Labs in action Anomaly detection
 
Speech Devices SDK
Speech Devices SDKSpeech Devices SDK
Speech Devices SDK
 

Kürzlich hochgeladen

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Kürzlich hochgeladen (20)

A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 

Using existing language skillsets to create large-scale, cloud-based analytics

  • 1.
  • 2.
  • 3. Model & ServePrep & Train Data Lake Analytics Store Data Lake Store Ingest Data Factory SQL Data Warehouse Databricks SPARK HDInsight SPARK SQL DB (reference data) • Landing zone structure • File size distribution • File types and formats • Languages & Existing Libraries • Data “Cooking”: Normalization and Enrichment • Partition & Structure data for performance or for serving • Interactively analyze data • Notebooks • PySpark • SQL HDInsight Hive LLAP
  • 4.
  • 5. Job Scheduler & Queue Front-EndService Vertex Execution Consume Local Storage Data Lake Store Author Plan Compiler Optimizer Vertexes running in YARN Containers U-SQL Runtime Optimized Plan Vertex Scheduling On containers Job Manager USQL Catalog Overall U-SQL Batch Job Execution Lifetime (1) Stage Codegen (C++/C# Compilation)
  • 6. Finalization Phase Execution Phase Queueing Preparation Phase Job Scheduler & Queue Front-EndService Vertex Execution Consume Local Storage Data Lake Store Author Plan Compiler Optimizer Vertexes running in YARN Containers U-SQL Runtime Optimized Plan Vertex Scheduling On containers Job Manager USQL Catalog Overall U-SQL Batch Job Execution Lifetime (2) Vertex Codegen (C++/C# Compilation)
  • 7. Vertex Execution View Open a job and click Vertex Execution View (will require job profile to load) Filter which vertices get shown. Visualization of vertex execution Vertex details
  • 8. Vertex Execution ViewSelected vertex indicator Row of vertex information (currently selected) Colors indicate what is happening with the vertex (blue) CREATING – the vertex is being setup on a container (e.g. user code resources being copied to container) (orange) QUEUING – the vertex is waiting to start on the container. Other vertexes may be using AUs so this vertex is waiting for an AU to become available. (green) RUNNING – the vertex is actually doing work in this case, Notice that vertex creation time is much larger than time the vertex took to do its work.
  • 9. Big Data is made of MANY, SMALL files FILE SIZE GB TB PBMB0 COUNT OF FILES Most files way less than a TB Very long tail. Some files are HUGE
  • 10. Recap: EXTRACT from FileSet Without FileSets: Explicit List in input files @rows = EXTRACT name string, id int FROM "/file1.tsv", "/file2.tsv", "/file3.tsv" USING Extractors.Csv( ); With a FileSet: EXTRACT every file in a folder suffix {suffix} The value for the column named “suffix” Comes from here (it is the filename)
  • 11. Recap: EXTRACT from FileSet FileSet: EXTRACT a Pattern FileSet: EXTRACT with Pattern and partition elimination date suffix {date:yyyy} {date:MM} {date:dd} {suffix} date suffix {date:yyyy} {date:MM} {date:dd} {suffix} WHERE date >= System.DateTime.Parse(“2016/1/1”) AND date < System.DateTime.Parse(“2016/2/1”);
  • 12. Working with MANY input Files A U-SQL script has an upper bound on the number of input files it can work on. Yesterday’s limit was a few 1000’s of files New limit is now 100,000s of input files - no syntax change to previous
  • 13. Working with MANY, SMALL files TODAY • Every file requires a separate EXTRACT vertex. • Lots of small files = lots of EXTRACT vertices. • Vertices have a startup & shutdown cost that may be much larger than the time required to read a small file EXTRACT VERTEX EXTRACT VERTEX EXTRACT VERTEX EXTRACT VERTEX f1 f2 f3 f4 f5 EXTRACT VERTEX f6 EXTRACT VERTEX TOMORROW • When possible the same extract vertex will be used for multiple small files. • Up to 200 files or 1GB of data (whatever is reached first) • In PUBLIC PREVIEW now! • GA Summer 2018 EXTRACT VERTEX EXTRACT VERTEX SET @@FeaturePreviews = "InputFileGrouping:on"; f1 f2 f3 f4 f5 f6
  • 14.
  • 15. ~5 second vertex execution time (green) Input File Grouping (for 15 small files) Before After15 Vertices 2 Vertices ~5 second vertex creation time (blue) ~5 second vertex creation time (blue) ~1 second vertex execution time (green)
  • 16.
  • 17. Built-in Format Support: CSV/TSV & friends • Extractors.Csv()|Tsv()|Text(delimiter: ) • Outputters.Csv()|Tsv()|Text(delimiter: ) Major Options: • encoding: UTF-8 (default), UTF-16, ASCII, Windows-125x • skipFirstNRows: Skip header lines (extractor only) • outputHeader: true or false (outputter only) • quoting: true or false. • Will handle ““ quoted fields to guard delimiter in text. • DOES NOT guard end of line delimiter! • silent: true or false (extractor only) • Allows to skip mis-aligned number of column rows • Casts invalid values to NULL if target type is nullable • DOES NOT HANDLE WRONG ENCODINGs, rows too long etc.! • nullEscape: character representation of null value in input • escapeChar: character to escape delimiter characters Will Execute in parallel: • Based on file set definition • Every 1GB will be a separate vertex • Every 1 vertex will execute 4 extractor instances in parallel on 250MB + 4MB As consequence: • Large CSV files will get parallelized • An extractor instance will only see its data and nothing else. Supports column pruning: • only columns needed in script are fully extracted
  • 18. Built-in Format Support: Parquet (Preview) Extractors.Parquet(), Outputters.Parquet() SET @@FeaturePreviews = "EnableParquetUdos:on"; Major Options (on Outputter only): • rowGroupSize: size of a row group in MB • rowGroupRows: size of a row group in rows • columnOptions: ColOptions := ColOption [',' ColOption]. ColOption := columnindex ( [':'decimalPrecision]['.'decimalScale] | [':'DateTimePrecision] ) ['~'Compression]. DateTimePrecision := 'micro' | 'milli' | 'days'. Compression := 'uncompressed' | 'snappy' | 'brotli' | 'gzip'. Will Execute in parallel only if: • file set definition is used on input and output As consequence: • Aim to generate and read Parquet files of 300MB to 3GB in size. Supports column pruning: • only columns needed in script are fully extracted PRO TIP: OUTPUT @data TO "/data/data_{*}.parquet" USING Outputters.Parquet() Futures: Managed U-SQL tables PREVIEW in 2018 H2 GA 2018 H2
  • 19. Other Format Support ORC Native EXTRACT/OUTPUT PRIVATE PREVIEW now Managed U-SQL tables PREVIEW in 2018 H2 PUBLIC PREVIEW in Summer 2018 GA 2018 H2 JSON, XML & AVRO Custom UDO lib on GitHub https://github.com/Azure/usql/tree/master/Examples /DataFormats Built-in Support on Roadmap. No ETA Images and Text docs Cognitive Services (installable via Portal) https://msdn.microsoft.com/en-us/azure/data- lake-analytics/u-sql/cognitive-capabilities-in-u- sql Other custom formats PDF, Excel etc Community provided custom libs: https://devblog.xyz/simple-pdf-text-extractor- adla/ https://github.com/Azure/AzureDataLake/tree/m aster/Samples/ExcelExtractor
  • 20.
  • 21.
  • 22. Java U-SQL scales your code Scales out your custom imperative Code (written in .NET, Python, R, Java) in a declarative SQL-based framework R Python .NET U-SQL Framework
  • 23. What are UDOs? • User-Defined Extractors • Converts files into rowset • User-Defined Outputters • Converts rowset into files • User-Defined Processors • Take one row and produce one row • Pass-through versus transforming • User-Defined Appliers • Take one row and produce 0 to n rows • Used with OUTER/CROSS APPLY • User-Defined Combiners • Combines rowsets (like a user-defined join) • User-Defined Reducers • Take n rows and produce m rows (normally m<n) • Scaled out with explicit U-SQL Syntax that takes a UDO instance (created as part of the execution): • EXTRACT • OUTPUT • CROSS APPLY Custom Operator Extensions in language of your choice Scaled out by U-SQL • PROCESS • COMBINE • REDUCE
  • 24. Extract Process Output User CodeUser Code User Code User Code Declarative Framework User Extensions U-SQL Example Extract User Code User Code A .NET UDO used within this stage
  • 25. Managing U-SQL Assemblies • Create assemblies for reuse • .Net, or JVM! • Reference assemblies • Enumerate assemblies • Drop assemblies • VisualStudio makes registration easy! • CREATE [JVM] ASSEMBLY db.assembly FROM @path; • CREATE [JVM] ASSEMBLY db.assembly FROM byte[]; • Can also include additional resource files • REFERENCE ASSEMBLY db.assembly; • Referencing .Net Framework Assemblies • Always accessible system namespaces: • U-SQL specific (e.g., for SQL.MAP) • All provided by system.dll system.core.dll system.data.dll, System.Runtime.Serialization.dll, mscorelib.dll (e.g., System.Text, System.Text.RegularExpressions, System.Linq) • Add all other .Net Framework Assemblies with: REFERENCE SYSTEM ASSEMBLY [System.XML]; • Enumerating Assemblies • Powershell command • U-SQL Studio Server Explorer and Azure Portal • DROP ASSEMBLY db.assembly;
  • 26. DEPLOY RESOURCE Syntax: 'DEPLOY' 'RESOURCE' file_path_URI { ',' file_path_URI }. Example: DEPLOY RESOURCE "/config/configfile.xml", "package.zip"; Use Cases: • Script specific configuration files (not stored with Asm) • Script specific models • Any other file you want to access from user code on all vertices Semantics: • Files have to be in ADLS or WASB • Files are deployed to vertex and are accessible from any custom code Limits: • Single resource file limit is 400MB • Overall limit for deployed resource files is 3GB
  • 27.
  • 28. Python with Azure Data Lake Today & Tomorrow Management & Ops with Python For automating or operating Azure Data Lake • Python SDKs • Python-based Azure CLI Doing Analytics with Python today • Run Python via Extension library Reducer UDO on vertices. • Only runs in a Reducer context Doing Analytics with Python tomorrow • Run Python natively on vertices. • Build UDOs in Python: • Extractors, Processors, Outputters, Reducers, Appliers, Combiners!
  • 29. REFERENCE ASSEMBLY [ExtPython]; DECLARE @myScript = @" def mult10(v): return v*10.0 def usqlml_main(df): df['amount10'] = df.amount.apply(mult10) del df['amount'] return df "; @a = SELECT * FROM (VALUES ("Contoso", 1500.0), ("Woodgrove", 2700.0) ) AS D( customer, amount ); @b = REDUCE @a ON customer PRODUCE customer string, amount10 double USING new Extension.Python.Reducer( pyScript:@myScript ); Today: Transforming data with Python Create column based on data from another column using a python function Delete a column USE REFERENCE ASSEMBLY to enable the Python Extensions usql_main accepts a pandas DataFrame as input and returns a DataFrame as output Use a REDUCE statement to partition on a key Specify output Schema. Python code MUST output this schema in the DataFrame Use Extension.Python.Reducer and pass in the script text.
  • 30.
  • 31. class OrdersExtractor: def __init__(self): pass def Extract(self, rawInput, output_row): buf = bytearray(4 * 1024 * 1024) output_schema = output_row.Schema for line in rawInput.Split('n'): num_bytes = line.readinto(buf) cols = buf[:num_bytes].decode('utf-8').split('|’) for i in range(len(cols)): col_type = output_schema.GetColumn(i).Type output_row.Set(i, col_type(cols[i])) yield output_row Write an Extractor in pure Python Write Extractor class Provide initializer Implement Extract method: rawInput is input data stream output_row is resulting row that gets accumulated into rowset Access to EXTRACT schema Overscan aware row splitter Setting column by position or name Accumulate row into rowset
  • 32. DEPLOY RESOURCE @"/Build2018Demo/NativePython/testudo.py"; @orders = EXTRACT O_ORDERKEY long, O_CUSTKEY long, O_ORDERSTATUS string, O_TOTALPRICE double, O_ORDERDATE string, O_ORDERPRIORITY int, O_CLERK string, O_SHIPPRIORITY int, O_COMMENT string FROM @"/Build2018Demo/NativePython/orders_sample.tbl" USING Extractors.Python( prologue: "import testudo", expression: "testudo.OrdersExtractor()"); Using an Extractor written in Python Deploy Python code to Vertex Invoke Python Extractor prologue: sets up the runtime python context (imports, object definitions) expression: invokes the Extractor
  • 33. class OrdersOutputterWithFinishMethod: def __init__(self): pass def Output(self, row, output): stream = output.GetBaseStream schema = row.Schema if len(row) != len(schema): raise RuntimeError("Length of values is not same as schema length") for columnIndex in range(len(schema)): stream.write(bytes(str(row[columnIndex]), 'utf8')) if(columnIndex < len(schema)- 1): stream.write('|') stream.write('n') def Finish(self, output): output.GetBaseStream.write("End of Rowset") Writing a native Python outputter Optional finisher to write a footer to file
  • 34. import sys sys.path.insert(0, 'methods.zip') import double import divide class ZipModuleMethodsReducer: def __init__(self): pass def Reduce(self, inputRowset, outputRow): for row in inputRowset: # external method provided by double.py within 'methods.zip' val = double.Double(row["O_TOTALPRICE"]) # external method provided by divide.py within 'methods.zip' outputRow["O_TOTALPRICE"] = divide.Devide(val, 2) yield outputRow Using custom Python modules in UDO 1. Use DEPLOY RESOURCE methods.zip in U-SQL script 2. Use Python ZIP import feature in python UDO script
  • 35. U-SQL Vertex Code (Python) C# C++ Algebra Additional Python Libs and Script managed dll native dll Compilation output (in job folder) Compilation and Optimization U-SQL Metadata Service Deployed to Vertices ADLS DEPLOY RESOURCE Script.py OtherLibs.zip System files (built-in Runtimes, Core DLLs, OS) Python Python Engine & Libs
  • 36.
  • 37. Java with Azure Data Lake Management & Ops with Java For automating or operating Azure Data Lake • Java SDKs Doing Analytics with Java • Run Java natively on vertices. • Build UDOs in Java: • Extractors, Processors, Outputters, Reducers, Appliers, Combiners!
  • 38. package microsoft.analytics.samples; import microsoft.analytics.interfaces.*; public class ColumnProcessor extends Processor { public ColumnProcessor(){} @Override public Row process(Row input, UpdatableRow output) throws Throwable { Schema inSchema = input.getSchema(); Schema outSchema = output.getSchema(); for (int i = 0; i < outSchema.getCount(); i++) { String colName = outSchema.getColumn(i).getName(); int colIndex = inSchema.indexOf(colName); if ((colIndex < 0) || (colIndex >= inSchema.getCount())) { throw new java.lang.IllegalArgumentException("Schema mismatch"); } Object value = input.getColumnValue(colIndex); output.setColumnValue(i, value); } return output; } } Writing a native Java processor Microsoft UDO interfaces Extend base Processor Initializer (can be used for parameters) Overwrite process and return row Input is input row Output is output row Accessing the Schema of the input and output Setting output value by position (or name)
  • 39. SET @@InternalDebug = "EnableJava:on"; CREATE JVM ASSEMBLY jvmAsm FROM @"Jarsmicrosoft.analytics.samples.jar"; Registering native Java processor
  • 40. REFERENCE ASSEMBLY jvmAsm; ... @result1 = PROCESS @result PRODUCE col1, col2, col3 USING Processors.Java("new microsoft.analytics.samples.ColumnProcessor()"); ... Calling a native Java processor Import UDO written in Java (note it looks the same regardless of implementation language) Generating an instance of processor and call it from U-SQL
  • 41. SET @@InternalDebug = "EnableJava:on"; REFERENCE ASSEMBLY jvmHiveSerDeAsm; @result = EXTRACT Band string, Name string, Male bool?, Instrument string, Born int?, Children long?, NetWorth double? FROM @"InputBandsDataJson.txt" USING Extractors.Hive("new org.openx.data.jsonserde.JsonSerDe()", true); Calling an existing Hive SerDe with U-SQL EXTRACT Import Hive SerDe that was registered as U-SQL Assembly Call Hive SerDe with built-in Extractors.Hive
  • 42. SET @@InternalDebug = "EnableJava:on"; CREATE JVM ASSEMBLY jvmHiveSerDeAsm FROM @"Jarsjson-serde-1.3.8.jar" WITH ADDITIONAL FILES = ( DEPLOY @"Jarscommons-logging-1.2.jar", DEPLOY @"Jarshadoop-core-1.2.1.jar", DEPLOY @"Jarshive-common-1.2.1.jar", DEPLOY @"Jarshive-serde-1.2.1.jar", DEPLOY @"Jarshive-hcatalog-core-2.3.2.jar" ); Registering native Java Hive SerDe
  • 43. U-SQL Vertex Code (Java) C# C++ Algebra Referenced JVM Libs managed dll native dll Compilation output (in job folder) Compilation and Optimization U-SQL Metadata Service Deployed to Vertices ADLS System files (built-in Runtimes, Core DLLs, OS) Java JVM and libs REFERENCE ASSEMBLY (JVM)
  • 44. Python and Java Execution Paradigm Python/JVM system (type mapping) Python/JVM system (type mapping)
  • 45.
  • 46. Scenario: Split a rowset into multiple files Id Amt 1024 100 4578 200 2309 300 8713 400 4578 500 8713 600 1024 700 2309 800 Id Amt 1024 100 1024 700 Id Amt 4578 200 4578 500 Id Amt 2309 300 2309 800 Id Amt 8713 400 8713 600 Input rowset Split by Id 1024_data.csv 4578_data.csv 2309_data.csv 8713_data.csv Requirement: Create one file per unique customer id
  • 47. OUTPUT To FileSet TODAY • The OUTPUT filenames must be known at compile time. The number of output files is Static – they are explicitly listed in the script. TOMORROW • The OUTPUT filenames can be inferred from the data. • The Number of output files is Dynamic! • PRIVATE PREVIEW now • GA in 2018 H2 • Will support Parquet • Will support Custom Outputters @rows_1 = SELECT @rows WHERE id = 1340; @rows_2 = SELECT @rows WHERE id = 7890; OUTPUT @rows_1 TO “/user_1340.tsv” USING Outputers.Tsv(); OUTPUT @rows_1 TO “/user_7890.tsv” USING Outputers.Tsv(); OUTPUT @rows TO “/user_{id}.tsv” USING Outputers.Tsv();
  • 48.
  • 49.
  • 50.
  • 51. Stage Details – The Operator Graph Right-click to the the operator graph Details about the UDO – the exact class name is given
  • 52.
  • 53. AUs Time AUSec 1 739 739 2 390 780 3 281 843 4 223 892 5 189 945 6 166 996 7 155 1085 8 146 1168 9 137 1233 10 133 1330 11 131 1441 12 126 1512 13 123 1599 14 123 1722 15 122 1830 16 112 1792 17 114 1938 18 115 2070 19 115 2185 20 114 2280 21 113 2373 22 113 2486 Using execution logs from a previously run job. We simulate the vertex scheduler algorithm for a given number of AUs. NOTICE: #1 - Most of the improvement happens with only a few Aus (YELLOW) #2 Very little improvement after 22 AUs AUs Time AUSec 23 112 2576 24 112 2688 25 112 2800 26 112 2912 27 112 3024 28 112 3136 29 112 3248 30 112 3360 31 112 3472 32 112 3584 33 112 3696 34 112 3808 35 112 3920 36 112 4032 37 112 4144 38 112 4256 39 112 4368 40 112 4480 41 112 4592
  • 54. AU Seconds (Cost) 0 100 200 300 400 500 600 700 800 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 ExecutionTime(sec) 1 AU 2 AUs 3 AUs 41 AUs
  • 55. BALANCED recommendation N AUs, where: Assigning N+1 AUs would result in a percentage increase in cost (AU- hours) that is greater than the percentage decrease in running time (seconds). FAST recommendation N AUs, where: Assigning N+1 AUs would result in a percentage increase in cost (AU- hours) that is more than 2x the percentage decrease in running time (seconds). AU Hours (cost) Runningtime PEAK 300 AUs 1 AU 2 AUs PEAK The max AUs that could theoretically be used by this job BALANCED 10 AUs FAST 20 AUs
  • 57. AU / Analysis Showing ACTUAL (black)
  • 58. AU / Analysis Showing ACTUAL (black) Versus Balanced (blue)
  • 59. AU / Analysis Showing ACTUAL (black) Versus Fast (green)
  • 60. AU / Analysis Showing ACTUAL (black) Versus Custom (purple)
  • 61.
  • 62.
  • 63. Daily Cooking Pipeline Activity 1 Activity 2 Activity 3 ADL U-SQL Job Submitter: mrys AUs: 50 PipelineId: ID from Cooking Pipeline RecurrenceId: ID from Activity 1 Script: …
  • 64. The Pipeline Recurring job within the pipeline
  • 65.
  • 66. Compare Duration (green) Input Size (blue) Output size (purple)
  • 67.
  • 68. Resources • Blogs and community page: • http://usql.io (U-SQL Github) • http://blogs.msdn.microsoft.com/azuredatalake/ • http://blogs.msdn.microsoft.com/mrys/ • https://channel9.msdn.com/Search?term=U-SQL#ch9Search • Documentation, presentations and articles: • http://aka.ms/usql_reference • https://docs.microsoft.com/en-us/azure/data-lake-analytics/data-lake-analytics- u-sql-programmability-guide • https://docs.microsoft.com/en-us/azure/data-lake-analytics/ • https://msdn.microsoft.com/en-us/magazine/mt614251 • https://msdn.microsoft.com/magazine/mt790200 • http://www.slideshare.net/MichaelRys • Getting Started with R in U-SQL • https://docs.microsoft.com/en-us/azure/data-lake-analytics/data-lake-analytics- u-sql-python-extensions • ADL forums and feedback • https://social.msdn.microsoft.com/Forums/azure/en- US/home?forum=AzureDataLake • http://stackoverflow.com/questions/tagged/u-sql • http://aka.ms/adlfeedback Continue your education at Microsoft Virtual Academy online.
  • 69.
  • 70.
  • 71.
  • 72. Abstract Data Scientists and Data Wranglers often have existing code that they would like to use at scale over large data sets. In this presentation we show how to meet your customers where they are, allowing them to take their existing Python, R, Java, code and libraries and existing formats --for example Parquet -- and apply them at scale to schematize unstructured data and process large amounts of data in Azure Data Lake with U-SQL. We will show how large customers meet the challenges of processing multiple cubes with data subsets to secure data for specific audiences using U-SQL partitioned output, making it easy to dynamically partition data for processing from Azure Data Lake.

Hinweis der Redaktion

  1. 22
  2. 24
  3. © 2015 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.