1. The document discusses big data and data science libraries in Scala for tasks like preprocessing, machine learning, and evaluation.
2. It demonstrates using Spark and Smile libraries on a real dataset to optimize click-through rates by analyzing features like OS, categories, and time.
3. The document compares the performance of Spark and Smile for random forest classification and regression on a 13GB dataset.
2. Agenda
1. Big Data as motivation for Scala
2. Overview of data-science libraries in scala
2. Demonstration of some libraries
on real dataset
3. Your choice in the pocket?
3. 1. R
2. Python
3. SQL
2014
KDnuggets Polls: most popular tools in data-science
2015
2016
4. Context: Real Time Bidding
Raw requests: 200 000 requests per second
8 terabytes per day
7. Components that we need to resolve the problem
Learning/optimisation algorithme
Mathematical analysis
Tuning/optimisation of algorithme
Preprocessing
Evaluation
...
Visualisation
8. Frame your search Which library to pick up?
Scala
Spark SparkTS Smile Breeze Saddle
learning
algorithms
mathematical
analysis
algorithms tuning
preprocessing
evaluation
visualisation
9. Frame your search
Which library to pick up?
DeepLearning.scala
(ThoughtWorks)
Neuron DeepLearning4j
deep learning
Scala
10. Problem:
Optimize click rate of delivering ads
We want to estimate the probability the ads will be clicked
● request configuration
● proposed creative
● user history
● third-party information
depending on:
13. Algorithm:
Random Forest
Averaging the decisions
from all the trees
os
Categorie City
Games
Android
Music
iOs
Paris
Nantes
Oui Non OuiNon
adType
adSize weekDay
320x50 480x320
Video
SaturdayMonday
Oui Non OuiNon
Banner
15. Os MaxPrice Time
Android 7.3 2016-06-09T0:25:28Z
iOS 4.55 2016-05-09T14:23:12Z
WindowsPhone 2.89 2016-06-09T11:35:11Z
16. Os MaxPrice Time
Android 7.3 2016-06-09T0:25:28Z
iOS 4.55 2016-05-09T14:23:12Z
WindowsPhone 2.89 2016-06-09T11:35:11Z
17. Os MaxPrice Time
Android 7.3 2016-06-09T0:25:28Z
iOS 4.55 2016-05-09T14:23:12Z
WindowsPhone 2.89 2016-06-09T11:35:11Z
18. Os MaxPrice Time
Android 7.3 2016-06-09T0:25:28Z
iOS 4.55 2016-05-09T14:23:12Z
WindowsPhone 2.89 2016-06-09T11:35:11Z
19. Os MaxPrice Time
Android 7.3 2016-06-09T0:25:28Z
iOS 4.55 2016-05-09T14:23:12Z
WindowsPhone 2.89 2016-06-09T11:35:11Z
Click
False
True
False
20. Os MaxPrice Time
Android 7.3 2016-06-09T0:25:28Z
iOS 4.55 2016-05-09T14:23:12Z
WindowsPhone 2.89 2016-06-09T11:35:11Z
Click
False
True
False
Os MaxPrice Time
3.0 6.0 1.0
5.0 3.0 5.0
1.0 2.0 3.0
21. Preprocessing: Spark ml
● Extraction: Extracting features from “raw” data
● Transformation: Scaling, converting, or modifying features
● Selection: Selecting a subset from a larger set of features
22. Preprocessing: Saddle
array-backed, specialized data structures:
Pandas-like operations:
dealing with missing values
index transformation tools
extracting,slicing,mapping row/column wise
groupBy/join/concat
sorting/pivoting
24. Learning: Spark ml
Dataframe-based API
Pipeline interface
● Classification
● Regression
● Linear Methods
● Decision Trees
● Tree ensembles
TF-IDF String Indexer Assembler Random Forest Evaluation
43. 1. Spark SQL optimized methods
2. MLlib out-of-box features engineering / features selection
3. Dataset performance & type safety
Spark
Scala
44. 1. TypeSafe & very performant
2. You have to implement yourself
all preprocessing stages and methods
Execution time for 0.3 GB preprocessing 1.2 seconds
Execution time for 13 GB preprocessing 22 seconds
Native Scala library
Scala
48. Learning:
Train model
and predict on test dataframe
Smile
0.17041644829479835,0.0,0.24611540915530505,1.1389295846602683,0.07655364222
388063,0.0,0.0,0.009896625232551026,4.57453119760533,0.36047880690737855,1.2
020833333333334,0.007662298205433167,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0