Sesja pokazująca zarówno Machine Learning Server (czyli algorytmy uczenia maszynowego w językach R i Python), ale także możliwość korzystania z danych JSON w SQL Server, czy też łączenia się do danych znajdujących się na HDFS, HADOOP, czy Spark poprzez Polybase w SQL Server, by te dane wykorzystywać do analizy, predykcji poprzez modele w językach R lub Python.
DataMass Summit - Machine Learning for Big Data in SQL Server
1. Machine Learning
for Big Data in SQL Server
Łukasz Grala
lukasz@tidk.pl | lukasz.grala@cs.put.poznan.pl
2. Łukasz Grala
• Senior architekt rozwiązań Platformy Danych & Business Intelligence & Zaawansowanej Analityki w TIDK
• Twórca „Data Scientist as as Service”
• Certyfikowany trener Microsoft i wykładowca na wyższych uczelniach
• Autor zaawansowanych szkoleń i warsztatów, oraz licznych publikacji i webcastów
• Od 2010 roku wyróżniany nagrodą Microsoft Data Platform MVP
• Doktorant Politechnika Poznańska – Wydział Informatyki (obszar bazy danych, eksploracja danych, uczenie maszynowe)
• Prelegent na licznych konferencjach w kraju i na świecie
• Posiada liczne certyfikaty (MCT, MCSE, MCSA, MCITP,…)
• Członek zarządu Polskiego Towarzystwa Informatycznego Oddział Wielkopolski
• Członek i lider Polish SQL Server User Group (PLSSUG)
• Pasjonat analizy, przechowywania i przetwarzania danych, miłośnik Jazzu
email lukasz@tidk.pl - lukasz.grala@cs.put.poznan.pl blog: grala.it
3. Gartner – Magic Quadrant
Business Intelligence and Analytics PlatformsData Management Solutions for Analytics
6. CREATE TABLE SalesOrderRecord (
Id int PRIMARY KEY IDENTITY,
OrderNumber NVARCHAR(25) NOT NULL,
OrderDate DATETIME NOT NULL,
JSalesOrderDetails NVARCHAR(4000)
CONSTRAINT SalesOrderDetails_IS_JSON
CHECK ( ISJSON(JSalesOrderDetails)>0 ),
Quantity AS
CAST(JSON_VALUE(JSalesOrderDetails, '$.Order.Qty') AS int)
)
GO
CREATE INDEX idxJson
ON SalesOrderRecord(Quantity)
INCLUDE (Price);
JSON and relational
JSON is plain text
ISJSON guarantees consistency
Optimize further with
computed column and INDEX
7. SQL Server and Azure DocumentDB
Schema-free NoSQL
document store
Scalable transactional processing
for rapidly changing apps
Premium relational
DB capable to
exchange data with
modern apps & services
Derives unified insights from
structured/unstructured data
JSON
JS
JS
JSON
11. Use Microsoft R Open with…
Microsoft R Server Big-data analytics and distributed computing on Linux, Hadoop and Teradata
SQL Server 2016 | 2017 Big-data analytics integrated with SQL Server database
Azure SQL Database Database as a Service - cloud
Azure Data Lake Big-data analytics Solutions - cloud
PowerBI Computations and charts from R scripts in dashboards
Azure ML Studio R Scripts in cloud-based Experiment workflows
Visual Studio R Tools for Visual Studio: integrated development environment for R
HDInsights R integrated with cloud-based Hadoop clusters
12. Example Solutions
Fraud detection
Salesforecasting
Warehouse efficiency
Predictivemaintenance
Extensibility
Microsoft Azure
Machine Learning Marketplace
New R scripts
010010
100100
010101
010010
100100
010101
010010
100100
010101
010010
100100
010101
Built-in advanced analytics
In-database analytics
Built-in to SQL Server
Analytic Library
T-SQL Interface
Relational Data
010010
100100
010101
010010
100100
010101
Data Scientist
Interact directly
with data
Data Developer/DBA
Manage data and
analytics together
?
R
RIntegration
13. Scalable in-database analytics
Data Scientist
Interacts directly with data
Creates models
and experiments
Data Analyst/DBA
Manages data and
analytics together
Example Solutions
• Fraud detection
• Sales forecasting
• Warehouse efficiency
• Predictive maintenance
010010
100100
010101
Relational Data
Extensibility
?
R
R Integration
Analytic Library
Open Source R
Revolution PEMA
T-SQL Interface
How is it Integrated?
• T-SQL calls a Stored Procedure
• Script is run in SQL through
extensibility model
• Result sets sent through Web API
to database or applications
Benefits
• Faster deployment of ML models
• Less data movement, faster
insights
• Work with large datasets: mitigate
R memory and scalability
limitations
14. Microsoft R Server
• Microsoft R Server for Redhat Linux
• Microsoft R Server for SUSE Linux
• Microsoft R Server for Teradata DB
• Microsoft R Server for Hadoop on Redhat
Microsoft R Server
16. Easy to use: add 2 lines to the top of each script
library(checkpoint)
checkpoint("2014-09-17")
For the package author:
Use package versions available on the chosen date
Installs packages local to this project
Allows different package versions to be used simultaneously
For a script collaborator:
Automatically installs required packages
Detects required packages (no need to manually install!)
Uses same package versions as script author to ensure reproducibility
Using checkpoint
16
17. ScaleR – Parallel + “Big Data”
Stream data in to RAM in blocks. “Big Data” can be any data size. We
handle Megabytes to Gigabytes to Terabytes…
Our ScaleR algorithms work
inside multiple cores / nodes in
parallel at high speed
Interim results are collected and
combined analytically to produce
the output on the entire data setXDF file format is optimised to work with the ScaleR library and
significantly speeds up iterative algorithm processing.
18. DistributedR - In-Hadoop
• Uses Hadoop nodes for R
computations
• Eliminate data movement
latency on very large data
• Remove data duplication
• Faster model development
• No MapReduce R coding
• Develop better models using
all the data
= Microsoft R Server
19. MRS and Hadoop Architecture options
R R R R R
R R R R R
ScaleR Production
RStudio Server Pro
Microsoft R Server
1. Copy
2. Stream
3. Send
20. Parallelized Algorithms
• Data import – Delimited, Fixed, SAS, SPSS, OBDC
• Variable creation & transformation
• Recode variables
• Factor variables
• Missing value handling
• Sort, Merge, Split
• Aggregate by category (means, sums)
• Min / Max, Mean, Median (approx.)
• Quantiles (approx.)
• Standard Deviation
• Variance
• Correlation
• Covariance
• Sum of Squares (cross product matrix for set variables)
• Pairwise Cross tabs
• Risk Ratio & Odds Ratio
• Cross-Tabulation of Data (standard tables & long form)
• Marginal Summaries of Cross Tabulations
• Chi Square Test
• Kendall Rank Correlation
• Fisher’s Exact Test
• Student’s t-Test
• Subsample (observations & variables)
• Random Sampling
Data Step Statistical Tests
Sampling
Descriptive Statistics
• Sum of Squares (cross product matrix for set variables)
• Multiple Linear Regression
• Generalized Linear Models (GLM) exponential family
distributions: binomial, Gaussian, inverse Gaussian,
Poisson, Tweedie. Standard link functions: cauchit,
identity, log, logit, probit. User defined distributions & link
functions.
• Covariance & Correlation Matrices
• Logistic Regression
• Classification & Regression Trees
• Predictions/scoring for models
• Residuals for all models
Predictive Models
• K-Means
• Decision Trees
• Decision Forests
• Stochastic Gradient Boosted Decision Trees
Cluster Analysis
Classification
Simulation
Variable Selection
• Stepwise Regression Linear, Logistic
and GLM
• Monte Carlo
• Parallel Random Number Generation
Combination
• Using Revolution rxDataStep and rxExec functions
to combine open source R with Revolution R
• PEMA API
21. Learners
Algorithms Strengths
rxFastLinear Fast, accurate linear learner with auto L1 & L2
rxLogisticRegression Logistic Regression with L1 & L2
rxFastTree
Boosted Decision tree from Bing. Competitive wth
XGBoost. Most accurate learner for most cases
rxFastForest Random Forest
rxNeuralNet GPU accelereted Net# DNNs with Convolutions
rxOneClassSvm Anomaly or unbalanced binary classification
22. Learners - Scalability
• Streaming (not RAM bound)
• Billions of features
• Multi-proc
• GPU acceleration for DNNs
• Distributed on Hadoop/Spark via Ensambling
23. Text Analytics Scenarios
Text Classification
• Email or support ticket routing
• Customer call triage
• Detect illegal trading activity
Sentiment Analysis
• Social media monitoring
• Call center support
Source: https://msdn.microsoft.com/en-us/library/dn921882(v=sql.130).aspx
Format query results as JSON by adding the FOR JSON clause to a SELECT statement.
Use the FOR JSON clause to delegate the formatting of JSON output from your client applications to SQL Server. For more info, see Use JSON output in SQL Server and in client apps (SQL Server).
See TechNet Virtual Lab “Exploring SQL Server 2016 support for JSON data” - http://go.microsoft.com/?linkid=9898458
When it comes to key BI investments we are making it much easier to manage relational and non-relational data with Polybase technology that allows you to query Hadoop data and SQL Server relational data through single T-SQL query. One of the challenges we see with Hadoop is there are not enough people out there with Hadoop and Map Reduce skillset and this technology simplifies the skillset needed to manage Hadoop data. This can also work across your on-premises environment or SQL Server running in Azure.
AI:
ML (Uczenie Bayesowskie, Drzew decyzyjnych, Zbioru reguł,
Sieci ekspertowe
Sieci neuronowe
Dowodzenie twierdzeń
Podejmowanie decyzji przy braku pełnych danych
Rozumowanie logiczne/racjonalne
Logika rozmyta
Algorytmy ewolucyjne