This talk presents you how three scala libraries - Smile, Saddle and Spark ML - satisfy requirements of new Big Data Science projects. Let's see it on example of click-through rate prediction.
7. Problem:
Optimize click rate of delivering ads
We want to estimate the probability the ads will be clicked
● request configuration
● proposed creative
● user history
● third-party information
depending on:
10. Os MaxPrice Time
Android 7.3 2016-06-09T0:25:28Z
iOS 4.55 2016-05-09T14:23:12Z
WindowsPhone 2.89 2016-06-09T11:35:11Z
11. Os MaxPrice Time
Android 7.3 2016-06-09T0:25:28Z
iOS 4.55 2016-05-09T14:23:12Z
WindowsPhone 2.89 2016-06-09T11:35:11Z
12. Os MaxPrice Time
Android 7.3 2016-06-09T0:25:28Z
iOS 4.55 2016-05-09T14:23:12Z
WindowsPhone 2.89 2016-06-09T11:35:11Z
13. Os MaxPrice Time
Android 7.3 2016-06-09T0:25:28Z
iOS 4.55 2016-05-09T14:23:12Z
WindowsPhone 2.89 2016-06-09T11:35:11Z
14. Os MaxPrice Time
Android 7.3 2016-06-09T0:25:28Z
iOS 4.55 2016-05-09T14:23:12Z
WindowsPhone 2.89 2016-06-09T11:35:11Z
Click
False
True
False
15. Os MaxPrice Time
Android 7.3 2016-06-09T0:25:28Z
iOS 4.55 2016-05-09T14:23:12Z
WindowsPhone 2.89 2016-06-09T11:35:11Z
Click
False
True
False
Os MaxPrice Time
3.0 6.0 1.0
5.0 3.0 5.0
1.0 2.0 3.0
16. Preprocessing: Spark ml
Extraction: Extracting features from “raw” data
Transformation: Scaling, converting, or modifying features
Selection: Selecting a subset from a larger set of features
17. Preprocessing: Spark ml
Extraction: Extracting features from “raw” data
TF-IDF, SparkSQL
Transformation: Scaling, converting, or modifying features
Bucketizer, String Indexer, Index to String, Vector Assembler
Selection: Selecting a subset from a larger set of features
ChiSqSelector
18. Preprocessing: Saddle
array-backed, specialized data structures:
Pandas-like operations:
dealing with missing values
index transformation tools
extracting,slicing,mapping
row/column wise
groupBy/join/concat
sorting/pivoting
20. Learning: Spark ml
Dataframe-based API
Pipeline interface
Classification
Regression
Linear Methods
Decision Trees
Tree ensembles
TF-IDF String Indexer Assembler Random Forest Evaluation