DataMass Summit - Machine Learning for Big Data in SQL Server

Machine Learning
for Big Data in SQL Server
Łukasz Grala
lukasz@tidk.pl | lukasz.grala@cs.put.poznan.pl

Łukasz Grala
• Senior architekt rozwiązań Platformy Danych & Business Intelligence & Zaawansowanej Analityki w TIDK
• Twórca „Data Scientist as as Service”
• Certyfikowany trener Microsoft i wykładowca na wyższych uczelniach
• Autor zaawansowanych szkoleń i warsztatów, oraz licznych publikacji i webcastów
• Od 2010 roku wyróżniany nagrodą Microsoft Data Platform MVP
• Doktorant Politechnika Poznańska – Wydział Informatyki (obszar bazy danych, eksploracja danych, uczenie maszynowe)
• Prelegent na licznych konferencjach w kraju i na świecie
• Posiada liczne certyfikaty (MCT, MCSE, MCSA, MCITP,…)
• Członek zarządu Polskiego Towarzystwa Informatycznego Oddział Wielkopolski
• Członek i lider Polish SQL Server User Group (PLSSUG)
• Pasjonat analizy, przechowywania i przetwarzania danych, miłośnik Jazzu
email lukasz@tidk.pl - lukasz.grala@cs.put.poznan.pl blog: grala.it

Gartner – Magic Quadrant
Business Intelligence and Analytics PlatformsData Management Solutions for Analytics

[
{
"Number":"SO43659",
"Date":"2011-05-31T00:00:00"
"AccountNumber":"AW29825",
"Price":59.99,
"Quantity":1
},
{
"Number":"SO43661",
"Date":"2011-06-01T00:00:00“
"AccountNumber":"AW73565“,
"Price":24.99,
"Quantity":3
}
]
Number Date Customer Price Quantity
SO43659 2011-05-31T00:00:00 AW29825 59.99 1
SO43661 2011-06-01T00:00:00 AW73565 24.99 3
SELECT * FROM myTable
FOR JSON AUTO
SELECT * FROM
OPENJSON(@json)
Data exchange with JSON

CREATE TABLE SalesOrderRecord (
Id int PRIMARY KEY IDENTITY,
OrderNumber NVARCHAR(25) NOT NULL,
OrderDate DATETIME NOT NULL,
JSalesOrderDetails NVARCHAR(4000)
CONSTRAINT SalesOrderDetails_IS_JSON
CHECK ( ISJSON(JSalesOrderDetails)>0 ),
Quantity AS
CAST(JSON_VALUE(JSalesOrderDetails, '$.Order.Qty') AS int)
)
GO
CREATE INDEX idxJson
ON SalesOrderRecord(Quantity)
INCLUDE (Price);
JSON and relational
JSON is plain text
ISJSON guarantees consistency
Optimize further with
computed column and INDEX

SQL Server and Azure DocumentDB
Schema-free NoSQL
document store
Scalable transactional processing
for rapidly changing apps
Premium relational
DB capable to
exchange data with
modern apps & services
Derives unified insights from
structured/unstructured data
JSON
JS
JS
JSON

Query relational
and non-relational
data, on-premises
and in Azure
Apps
T-SQL query
SQL Server Hadoop
PolyBase
Query relational and non-relational data with T-SQL

Benchmark CNTK
https://github.com/Alexey-Kamenev/Benchmarks
https://github.com/Microsoft/CNTK

Use Microsoft R Open with…
Microsoft R Server Big-data analytics and distributed computing on Linux, Hadoop and Teradata
SQL Server 2016 | 2017 Big-data analytics integrated with SQL Server database
Azure SQL Database Database as a Service - cloud
Azure Data Lake Big-data analytics Solutions - cloud
PowerBI Computations and charts from R scripts in dashboards
Azure ML Studio R Scripts in cloud-based Experiment workflows
Visual Studio R Tools for Visual Studio: integrated development environment for R
HDInsights R integrated with cloud-based Hadoop clusters

Example Solutions
Fraud detection
Salesforecasting
Warehouse efficiency
Predictivemaintenance
Extensibility
Microsoft Azure
Machine Learning Marketplace
New R scripts
010010
100100
010101
010010
100100
010101
010010
100100
010101
010010
100100
010101
Built-in advanced analytics
In-database analytics
Built-in to SQL Server
Analytic Library
T-SQL Interface
Relational Data
010010
100100
010101
010010
100100
010101
Data Scientist
Interact directly
with data
Data Developer/DBA
Manage data and
analytics together
?
R
RIntegration

Scalable in-database analytics
Data Scientist
Interacts directly with data
Creates models
and experiments
Data Analyst/DBA
Manages data and
analytics together
Example Solutions
• Fraud detection
• Sales forecasting
• Warehouse efficiency
• Predictive maintenance
010010
100100
010101
Relational Data
Extensibility
?
R
R Integration
Analytic Library
Open Source R
Revolution PEMA
T-SQL Interface
How is it Integrated?
• T-SQL calls a Stored Procedure
• Script is run in SQL through
extensibility model
• Result sets sent through Web API
to database or applications
Benefits
• Faster deployment of ML models
• Less data movement, faster
insights
• Work with large datasets: mitigate
R memory and scalability
limitations

Microsoft R Server
• Microsoft R Server for Redhat Linux
• Microsoft R Server for SUSE Linux
• Microsoft R Server for Teradata DB
• Microsoft R Server for Hadoop on Redhat
Microsoft R Server

Easy to use: add 2 lines to the top of each script
library(checkpoint)
checkpoint("2014-09-17")
For the package author:
Use package versions available on the chosen date
Installs packages local to this project
Allows different package versions to be used simultaneously
For a script collaborator:
Automatically installs required packages
Detects required packages (no need to manually install!)
Uses same package versions as script author to ensure reproducibility
Using checkpoint
16

ScaleR – Parallel + “Big Data”
Stream data in to RAM in blocks. “Big Data” can be any data size. We
handle Megabytes to Gigabytes to Terabytes…
Our ScaleR algorithms work
inside multiple cores / nodes in
parallel at high speed
Interim results are collected and
combined analytically to produce
the output on the entire data setXDF file format is optimised to work with the ScaleR library and
significantly speeds up iterative algorithm processing.

DistributedR - In-Hadoop
• Uses Hadoop nodes for R
computations
• Eliminate data movement
latency on very large data
• Remove data duplication
• Faster model development
• No MapReduce R coding
• Develop better models using
all the data
= Microsoft R Server

MRS and Hadoop Architecture options
R R R R R
R R R R R
ScaleR Production
RStudio Server Pro
Microsoft R Server
1. Copy
2. Stream
3. Send

Parallelized Algorithms
• Data import – Delimited, Fixed, SAS, SPSS, OBDC
• Variable creation & transformation
• Recode variables
• Factor variables
• Missing value handling
• Sort, Merge, Split
• Aggregate by category (means, sums)
• Min / Max, Mean, Median (approx.)
• Quantiles (approx.)
• Standard Deviation
• Variance
• Correlation
• Covariance
• Sum of Squares (cross product matrix for set variables)
• Pairwise Cross tabs
• Risk Ratio & Odds Ratio
• Cross-Tabulation of Data (standard tables & long form)
• Marginal Summaries of Cross Tabulations
• Chi Square Test
• Kendall Rank Correlation
• Fisher’s Exact Test
• Student’s t-Test
• Subsample (observations & variables)
• Random Sampling
Data Step Statistical Tests
Sampling
Descriptive Statistics
• Sum of Squares (cross product matrix for set variables)
• Multiple Linear Regression
• Generalized Linear Models (GLM) exponential family
distributions: binomial, Gaussian, inverse Gaussian,
Poisson, Tweedie. Standard link functions: cauchit,
identity, log, logit, probit. User defined distributions & link
functions.
• Covariance & Correlation Matrices
• Logistic Regression
• Classification & Regression Trees
• Predictions/scoring for models
• Residuals for all models
Predictive Models
• K-Means
• Decision Trees
• Decision Forests
• Stochastic Gradient Boosted Decision Trees
Cluster Analysis
Classification
Simulation
Variable Selection
• Stepwise Regression Linear, Logistic
and GLM
• Monte Carlo
• Parallel Random Number Generation
Combination
• Using Revolution rxDataStep and rxExec functions
to combine open source R with Revolution R
• PEMA API

Learners
Algorithms Strengths
rxFastLinear Fast, accurate linear learner with auto L1 & L2
rxLogisticRegression Logistic Regression with L1 & L2
rxFastTree
Boosted Decision tree from Bing. Competitive wth
XGBoost. Most accurate learner for most cases
rxFastForest Random Forest
rxNeuralNet GPU accelereted Net# DNNs with Convolutions
rxOneClassSvm Anomaly or unbalanced binary classification

Learners - Scalability
• Streaming (not RAM bound)
• Billions of features
• Multi-proc
• GPU acceleration for DNNs
• Distributed on Hadoop/Spark via Ensambling

Text Analytics Scenarios
Text Classification
• Email or support ticket routing
• Customer call triage
• Detect illegal trading activity
Sentiment Analysis
• Social media monitoring
• Call center support

Sentiment Analysis
• Pre-trained model
• Cognitive Service Parity
• Uses DNN Embedding
• Domain Adaptation

Image Featurization
• Image similarity search:
• Product Catalog search
• Classification
• Plankton Monitoring
• Galaxy Classification
• Retina Pathology detection
• Anomaly Detection
• Defects detection in manufacturing

Image Featurization
Convolutional DNNs with GPU
Pre-trained Models
• ResNet18
• ResNet 50
• ResNet 101
• AlexNet

Machine Learning Server in SQL Server 2017
•R Server 9.2
•Python (Anaconda)

Question?
lukasz@tidk.pl lukasz.grala@cs.put.poznan.pl
tidk.pl DSaaS.co facebook.com/TIDKpl grala.it

DataMass Summit - Machine Learning for Big Data in SQL Server

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie DataMass Summit - Machine Learning for Big Data in SQL Server

Ähnlich wie DataMass Summit - Machine Learning for Big Data in SQL Server (20)

Mehr von Łukasz Grala

Mehr von Łukasz Grala (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

DataMass Summit - Machine Learning for Big Data in SQL Server

Hinweis der Redaktion