Big Data Science in Scala V2

Big Data-Science
in Scala
Anastasia Lieva
Data Scientist
@lievAnastazia

Agenda
1. Big Data as motivation for Scala
2. Overview of data-science libraries in scala
2. Demonstration of some libraries
on real dataset
3. Your choice in the pocket?

1. R
2. Python
3. SQL
2014
KDnuggets Polls: most popular tools in data-science
2015
2016

Context: Real Time Bidding
Raw requests: 200 000 requests per second
8 terabytes per day

R
Python
SQL
Scala
Spark ML/DATAFRAME/SQL
SMILE
Saddle
Breeze

Components that we need to resolve the problem
Learning/optimisation algorithme
Mathematical analysis
Tuning/optimisation of algorithme
Preprocessing
Evaluation
...
Visualisation

Frame your search Which library to pick up?
Scala
Spark SparkTS Smile Breeze Saddle
learning
algorithms
mathematical
analysis
algorithms tuning
preprocessing
evaluation
visualisation

Frame your search
Which library to pick up?
DeepLearning.scala
(ThoughtWorks)
Neuron DeepLearning4j
deep learning
Scala

Problem:
Optimize click rate of delivering ads
We want to estimate the probability the ads will be clicked
● request configuration
● proposed creative
● user history
● third-party information
depending on:

Time series analysis
Clustering
Classification
Regression
...
...
Descriptive statistics
Frame the problem!

Visualisation
Preprocessing
Machine
Learning
Evaluation
Features
engineering
Features
selection
Features
extraction
Hyper-param
eters tuning
Algorithm
optimization
Algorithm
Evaluation
strategies
Visualisation
Evaluation
metrics

Algorithm:
Random Forest
Averaging the decisions
from all the trees
os
Categorie City
Games
Android
Music
iOs
Paris
Nantes
Oui Non OuiNon
adType
adSize weekDay
320x50 480x320
Video
SaturdayMonday
Oui Non OuiNon
Banner

Raw data
{
"id":"951cb9f5-2bab-46ce-b759-8245cffxxxxx",
"time":"2016-06-09T0:25:28Z",
"bidfloor":2.88,
"appOrSite":"app",
"adType":"banner",
"categories":"games,news,football",
"publisherId":"11e281c1123139xxxxx",
"carrier":"208-10",
"os":"iOS",
"connectionType":3,
"coords":[48.929256439208984, 2.4255824089050293],
"adSize":[320, 50],
"exchange":"xxxxx",
[...],
"clicked":true
}

Os MaxPrice Time
Android 7.3 2016-06-09T0:25:28Z
iOS 4.55 2016-05-09T14:23:12Z
WindowsPhone 2.89 2016-06-09T11:35:11Z

Os MaxPrice Time
Android 7.3 2016-06-09T0:25:28Z
iOS 4.55 2016-05-09T14:23:12Z
WindowsPhone 2.89 2016-06-09T11:35:11Z
Click
False
True
False

Os MaxPrice Time
Android 7.3 2016-06-09T0:25:28Z
iOS 4.55 2016-05-09T14:23:12Z
WindowsPhone 2.89 2016-06-09T11:35:11Z
Click
False
True
False
Os MaxPrice Time
3.0 6.0 1.0
5.0 3.0 5.0
1.0 2.0 3.0

Preprocessing: Spark ml
● Extraction: Extracting features from “raw” data
● Transformation: Scaling, converting, or modifying features
● Selection: Selecting a subset from a larger set of features

Preprocessing: Saddle
array-backed, specialized data structures:
Pandas-like operations:
dealing with missing values
index transformation tools
extracting,slicing,mapping row/column wise
groupBy/join/concat
sorting/pivoting

Learning: Spark ml
Dataframe-based API
● Classification
● Regression
● Linear Methods
● Decision Trees
● Tree ensembles

Learning: Spark ml
Dataframe-based API
Pipeline interface
● Classification
● Regression
● Linear Methods
● Decision Trees
● Tree ensembles
TF-IDF String Indexer Assembler Random Forest Evaluation

Learning: Smile
● Classification
● Regression
● Linear Methods
● Decision Trees
● Tree ensembles
Array-backed API

Learning: Smile
● Classification
● Regression
● Linear Methods
● Decision Trees
● Tree ensembles
★ Visualisation
★ Missing Values Imputation
★ Association Rule Mining
★ Manifold learning
★ Multi-dimensional scaling
★ Feature selection and dimensionality reduction

Saddle Preprocessing
Features
engineering
Features
selection
Features
extraction
Scala

Saddle Create the dataframe
Balance the data

Preprocessing: Saddle
Split randomly to test and train sets
and convert to input type needed in Smile RF implementation

1. Out-of-box easy to use structures:
frame, matrix, series, vectors
2. Not active development
3. Not typesafe dataframes
Saddle
Scala

Spark Preprocessing
Features
engineering
Features
selection
Features
extraction
Scala

Databricks Notebook
Display and download options

balance the data

Index categorical data
timestamp os osIdx
1465037789 iOS 1
1464983457 Windows Phone 2
1465019529 Android 0
1464974567 iOS 1
1465018552 Android 0

Conversion and sampling

1. Spark SQL optimized methods
2. MLlib out-of-box features engineering / features selection
3. Dataset performance & type safety
Spark
Scala

1. TypeSafe & very performant
2. You have to implement yourself
all preprocessing stages and methods
Execution time for 0.3 GB preprocessing 1.2 seconds
Execution time for 13 GB preprocessing 22 seconds
Native Scala library
Scala

Visualisation
Preprocessing
Features
engineering
Features
selection
Features
extraction
Random Forest
os
Categorie City
Games
Android
Music
iOs
Paris
Nantes
Oui Non OuiNon
adType
adSize weekDay
320x50 480x320
Video
SaturdayMonday
Oui Non OuiNon
Banner

Smile
Machine
Learning
Hyper-param
eters tuning
Algorithm
optimization
Algorithm
Scala

Learning: Smile
Construct Classifier and set
hyperparameters

Learning:
Train model
and predict on test dataframe
Smile
0.17041644829479835,0.0,0.24611540915530505,1.1389295846602683,0.07655364222
388063,0.0,0.0,0.009896625232551026,4.57453119760533,0.36047880690737855,1.2
020833333333334,0.007662298205433167,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0

Spark
Machine
Learning
Hyper-param
eters tuning
Algorithm
optimization
Algorithm
Scala

Learning:
Construct Classifier and set
hyperparameters
Spark ml

Pipeline interface
String
Indexer
Tokenizer Bucketizer PCA Assembler

Visualisation
Visualisation
Preprocessing
Machine
Learning
Evaluation
Features
engineering
Features
selection
Features
extraction
Hyper-param
eters tuning
Algorithm
optimization
Algorithm
Evaluation
strategies
Evaluation
metrics

Spark Smile
Regression
Binary
Classification
Multiclass
Classification
Regression
Classification
evaluators

Compare Spark and Smile Random Forest
The higher the better The lower the better
Classification metrics

Compare Spark and Smile Random Forest
Running time on 13 GB
minutes

Compare preprocessing:
Spark vs Saddle

My List[tools] for THIS project:
Preprocessing
Spark
Machine Learning
(Random Forest)
Smile

Your Option[tools] for YOUR project:
Spark
Spark TS
SMILE
Breeze
Saddle

Thank you for your
attention!
and go make data-science to save the world
@lievAnastazia

Big Data Science in Scala V2

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Big Data Science in Scala V2

Ähnlich wie Big Data Science in Scala V2 (20)

Mehr von Anastasia Bobyreva

Mehr von Anastasia Bobyreva (8)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Big Data Science in Scala V2