Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.
#EUai9
Marcin Kulka and Michał Kaczmarczyk
9LivesData
Oct/26/2017
No More Cumbersomeness:
Automatic Predictive
Modeling
on...
Who we are?
• Marcin Kulka – Senior Software
Engineer
• Michał Kaczmarczyk (Ph.D.) –
Software Architect, Team Leader and
P...
Who we are?
• Advanced software R&D company (Warsaw,
Poland)
• 75+ scientists and software engineers
• Specializing in sca...
4
• Masato Asahara (Ph.D.) -
Researcher, NEC Data Science
Research Laboratory
• Ryohei Fujimaki (Ph.D.) -
Research Fellow, N...
Agenda
• Typical use case for predictive modeling problem
• Our technology - Automatic Predictive Modeling
• Design challe...
Motivation
7
Predictive analysis in industry and business
8
Driver risk
assessment
Inventory
Optimization
Churn
Retention
Predictive
Ma...
... but Predictive Modeling
• Takes a long time
• Requires high skills
9
Typical predictive modeling use case
1010
Training Data
Validation Data
Test Data
Highly accurate
prediction results
Typical predictive modeling use case
1111
Predictive
models
Training Data
Validation Data
Test Data
Highly accurate
predic...
Predictive model design
12
Algorithm selection
Accuracy v s Transparency
Black box White box
Predictive model design
13
Hyperparameters tuning
Best balance
Algorithm selection
Accuracy v s Transparency
Black box Whi...
Predictive model design
14
Hyperparameters tuning
Best balance
Feature selection
Algorithm selection
Accuracy v s Transpar...
Predictive model design
15
Hyperparameters tuning
Best balance
Feature selection
Algorithm selection
Accuracy v s Transpar...
Predictive model design
16
Hyperparameters tuning
Best balance
Feature selection
Algorithm selection
Accuracy v s Transpar...
Predictive model design
17
Hyperparameters tuning
Best balance
Feature selection
Algorithm selection
Accuracy v s Transpar...
Automatic predictive modeling
18
Hyperparameters tuning
Best balance
Feature selection
Algorithm selection
Accuracy v s Tr...
Automatic predictive modeling
19
Hyperparameters tuning
Best balance
Feature selection
Algorithm selection
Accuracy v s Tr...
Our technology
20
Exploring massive modeling possibilities
21
Data
preprocessing
strategies
Exploring massive modeling possibilities
22
Algorithms
Yes
No Yes
Data
preprocessing
strategies
Exploring massive modeling possibilities
23
Algorithms
Yes
No Yes
Data
preprocessing
strategies
Feature
selection!
Exploring massive modeling possibilities
24
Algorithms
Yes
No Yes
Hyperparameters
tuning
Data
preprocessing
strategies
Fea...
Exploring massive modeling possibilities
25
Algorithms
Yes
No Yes
Data
preprocessing
strategies
Yes
No Yes
Feature
selecti...
Exploring massive modeling possibilities
26
Algorithms
Yes
No Yes
Data
preprocessing
strategies
Yes
No Yes
Feature
selecti...
Automating and accelerating with Spark
27
Complete in hours!
Yes
No Yes
Algorithms
Yes
No Yes
Data
preprocessing
strategie...
28
Training
data
Validation
criteria
Validation
data
Modeling flow = training + validation
Modeling flow = training + validation
29
Training
data
Validation
data
Training
models
Validating
models
Models
Test
data
...
Modeling and prediction flow
30
Training
data
Validation
data
Training
models
Validating
models
Models
Test
data
Predictio...
Design challenges
and solutions
31
3232
Challenges to achieve high execution performance
• Using native ML
engines in Spark
• Parameter-aware
scheduling
• Pr...
3333
θ1
θ2
θ3
• Using native ML
engines in Spark
• Parameter-aware
scheduling
• Predictive work
balancing
Challenges to ac...
Using native ML engines in Spark
Why?
34
Comparison of Spark and native ML engines
35
(+ Spark ML)
Native
ML engines
Comparison of Spark and native ML engines
36
(+ Spark ML)
Native
ML engines
Scalability Yes No (or very limited)
Comparison of Spark and native ML engines
37
(+ Spark ML)
Native
ML engines
Scalability Yes No (or very limited)
Choice of...
Comparison of Spark and native ML engines
38
(+ Spark ML)
Native
ML engines
Scalability Yes No (or very limited)
Choice of...
Comparison of Spark and native ML engines
39
(+ Spark ML)
Native
ML engines
Scalability Yes No (or very limited)
Choice of...
Comparison of Spark and native ML engines
• We would like to combine Spark and ML engines
40
(+ Spark ML)
Native
ML engine...
Combining Spark and ML engines for training
41
Training
data
(parquet)
HDFS
Models
42
Data preprocessing
(MapReduce)
Training
data
(parquet)
HDFS
Models
Combining Spark and ML engines for training
43
Machine Learning
(map operation)
Data preprocessing
(MapReduce)
Training
data
(parquet)
HDFS
Models
Yes
No Yes
Combinin...
44
Machine Learning
(map operation)
Data preprocessing
(MapReduce)
Training
data
(parquet)
HDFS
Models
Yes
No Yes
’Single ...
45
Machine Learning
(map operation)
Data preprocessing
(MapReduce)
Training
data
(parquet)
HDFS
Models
Yes
No Yes
Input
re...
46
Machine Learning
(map operation)
Data preprocessing
(MapReduce)
Training
data
(parquet)
HDFS
Models
Yes
No Yes
Combinin...
47
Machine Learning
(map operation)
Converting to
RDD[Matrix]
Data preprocessing
(MapReduce)
Training
data
(parquet)
HDFS
...
48
Machine Learning
(map operation)
Converting to
RDD[Matrix]
Data preprocessing
(MapReduce)
Training
data
(parquet)
HDFS
...
Converting to
RDD[Matrix]
49
Machine Learning
(map operation)
Data preprocessing
(MapReduce)
Training
data
(parquet)
HDFS
...
Combining Spark and ML engines for validation
50
Validation
data
(parquet)
HDFS
51
Data preprocessing
(MapReduce)
Validation
data
(parquet)
HDFS
Combining Spark and ML engines for validation
52
Converting to
RDD[Matrix]
Data preprocessing
(MapReduce)
Validation
data
(parquet)
HDFS
Matrix
Matrix
Matrix
Combining ...
Converting to
RDD[Matrix]
Matrix
Matrix
Matrix
53
Prediction
(map operation)
Data preprocessing
(MapReduce)
Validation
dat...
Converting to
RDD[Matrix]
Matrix
Matrix
Matrix
54
Validation
(MapReduce)
Prediction
(map operation)
Data preprocessing
(Ma...
Converting to
RDD[Matrix]
Matrix
Matrix
Matrix
55
Validation
(MapReduce)
Prediction
(map operation)
Data preprocessing
(Ma...
56
Predict
(map operation)
Convert to
RDD[Matrix]
Data preprocessing
(MapReduce)
Test data
(parquet)
HDFS
HDFS
Prediction
...
Design challenges
5757
θ1
θ2
θ3
• Using native ML
engines in Spark
• Parameter-aware
scheduling
• Predictive work
balancing
Many models to schedule
58
Matrix X3
Matrix X2
Matrix X1
Many models to schedule
59
Algorithms
Hyperparameters
Data
preprocessing
strategies
Parameters:
θ1, θ2, θ3 ...
Matrix X3
M...
Many models to schedule
60
Algorithms
Hyperparameters
Data
preprocessing
strategies
Machine Learning
Yes
No Yes
Parameters...
Naive scheduling
61
Load &
Convert
Parameter θ1
Parameter θ1
Parameter θ1
Matrix X1
Matrix X2
Matrix X3
• Waste of memory
...
62
Parameter-aware scheduling
62
• Efficient memory
usage
• Infrequent data
loading from
other servers
• Infrequent data t...
Design challenges
6363
θ1
θ2
θ3
• Using native ML
engines in Spark
• Parameter-aware
scheduling
• Predictive work
balancing
Machine learning – most work intensive & time consuming part
64
Machine Learning
(map operation)
Convert
to matrix
Data pr...
Naive balancing of models to compute
65
5 min 5 min
Complicated model
Naive balancing of models to compute
66
5 min 5 min
1 min 1 min Wait 8 min…Yes
No Yes
Yes
No Yes
Decision tree model
Compl...
Predictive balancing
• Balancing
complex and
simple
models
(based on
previous
estimation)
• Complex
models first
5 min 1 m...
Evaluation
68
Evaluation – targeting Top-10%
• Prediction problem
– Comparing Top-10% precision of targeting potential
positive samples
...
Evaluation – data sets
• KDDCUP 2014 competition data
– 557K records for training and validate data
– 62K records for test...
Evaluation – cluster specificaton
• Size: 3U
• Server modules: 34
• CPU: 272 cores (Intel Xeon D 2.1GHz)
– 128 cores used ...
Evaluation results and conclusions
72
Data Our
technology
Logistic
regression
SVM Random
Forests
KDDCUP 2014 15.6% 13.5% 1...
Evaluation results and conclusions
• Competitive results with good accuracy
73
Data Our
technology
Logistic
regression
SVM...
Evaluation results and conclusions
• Short execution time
• Full automation of the whole process
• Handling data of any si...
Our observations
75
Our observations
• Using RDD of huge but compact objects
optimized for ML computations
• Limiting execution time overhead ...
Our observations
• Using RDD of huge but compact objects
optimized for ML computations
• Limiting execution time overhead ...
Converting to
RDD[Matrix]
78
Machine Learning
(map operation)
Data preprocessing
(MapReduce)
Training
data
(parquet)
HDFS
...
• Spark used for parallelization
• All the necessary data for a single execution kept
without memory overhead
• Performanc...
Our observations
• Using RDD of huge but compact objects
optimized for fast computations
• Limiting execution time overhea...
Limiting execution overhead in tests
• Submitting Spark application takes time
81
TestSpark submit Spark submit Test Spark...
Limiting execution overhead in tests
• We submit only once
82
TestSpark submit Test Test
♪~
Our observations
• Using RDD of huge but compact objects
optimized for fast computations
• Limiting execution time overhea...
Stable execution on YARN
• Default configuration sometimes failing with not
enough memory
• Spark Web UI:
• Serving much m...
Stable execution on YARN
• JVM system memory spikes over YARN
limitation suddenly (*)
85
(*) Shivnath and Mayuresh. “Under...
Stable execution on YARN
• Tip: spark.yarn.executor.memoryOverhead to be
carefully configured
• Recommended overhead: 6-10...
Summary
87
Summary
• Predictive modeling problem
– Requires sophisticated knowledge
– Takes a long time
• Our technology: Automatic P...
Future work
• Extending to other models
(e.g. deep learning)
• Speeding up by GPU
• Reducing YARN memory
overhead
89
Thank you!
90
Nächste SlideShare
Wird geladen in …5
×

No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin Kulka and Michał Kaczmarczyk

922 Aufrufe

Veröffentlicht am

Building accurate machine learning models has been an art of data scientists, i.e., algorithm selection, hyper parameter tuning, feature selection and so on. Recently, challenges to breakthrough this “black-arts” have got started. In cooperation with our partner, NEC Laboratories America, we have developed a Spark-based automatic predictive modeling system. The system automatically searches the best algorithm, parameters and features without any manual work. In this talk, we will share how the automation system is designed to exploit attractive advantages of Spark. The evaluation with real open data demonstrates that our system can explore hundreds of predictive models and discovers the most accurate ones in minutes on a Ultra High Density Server, which employs 272 CPU cores, 2TB memory and 17TB SSD in 3U chassis. We will also share open challenges to learn such a massive amount of models on Spark, particularly from reliability and stability standpoints. This talk will cover the presentation already shown on Spark Summit SF’17 (#SFds5) but from more technical perspective.

Veröffentlicht in: Daten & Analysen
  • who will win this game? get free picks and predictions.  http://ishbv.com/zcodesys/pdf
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier

No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin Kulka and Michał Kaczmarczyk

  1. 1. #EUai9 Marcin Kulka and Michał Kaczmarczyk 9LivesData Oct/26/2017 No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark
  2. 2. Who we are? • Marcin Kulka – Senior Software Engineer • Michał Kaczmarczyk (Ph.D.) – Software Architect, Team Leader and Project Manager 2
  3. 3. Who we are? • Advanced software R&D company (Warsaw, Poland) • 75+ scientists and software engineers • Specializing in scalable storage, distributed and big data systems • Cooperating with partners all around the world 3
  4. 4. 4
  5. 5. • Masato Asahara (Ph.D.) - Researcher, NEC Data Science Research Laboratory • Ryohei Fujimaki (Ph.D.) - Research Fellow, NEC Data Science Research Laboratory 5
  6. 6. Agenda • Typical use case for predictive modeling problem • Our technology - Automatic Predictive Modeling • Design challenges • Evaluation results • Our observations 6
  7. 7. Motivation 7
  8. 8. Predictive analysis in industry and business 8 Driver risk assessment Inventory Optimization Churn Retention Predictive Maintenance Product price optimization Sales optimization Energy/water operation mgmt
  9. 9. ... but Predictive Modeling • Takes a long time • Requires high skills 9
  10. 10. Typical predictive modeling use case 1010 Training Data Validation Data Test Data Highly accurate prediction results
  11. 11. Typical predictive modeling use case 1111 Predictive models Training Data Validation Data Test Data Highly accurate prediction results
  12. 12. Predictive model design 12 Algorithm selection Accuracy v s Transparency Black box White box
  13. 13. Predictive model design 13 Hyperparameters tuning Best balance Algorithm selection Accuracy v s Transparency Black box White box
  14. 14. Predictive model design 14 Hyperparameters tuning Best balance Feature selection Algorithm selection Accuracy v s Transparency Black box White box Determining a set of features Sales = f (Price, Location) Sales = f (Price, Weather) or
  15. 15. Predictive model design 15 Hyperparameters tuning Best balance Feature selection Algorithm selection Accuracy v s Transparency Black box White box Determining a set of features A lot of effort, many models… Sales = f (Price, Location) Sales = f (Price, Weather) or
  16. 16. Predictive model design 16 Hyperparameters tuning Best balance Feature selection Algorithm selection Accuracy v s Transparency Black box White box Determining a set of features A lot of effort, many models… Many iterations, weeks... Sales = f (Price, Location) Sales = f (Price, Weather) or
  17. 17. Predictive model design 17 Hyperparameters tuning Best balance Feature selection Algorithm selection Accuracy v s Transparency Black box White box Determining a set of features A lot of effort, many models… Many iterations, weeks... Sales = f (Price, Location) Sales = f (Price, Weather) or Sophisticated knowledge...
  18. 18. Automatic predictive modeling 18 Hyperparameters tuning Best balance Feature selection Algorithm selection Accuracy v s Transparency Black box White box Determining a set of features Sales = f (Price, Location) Sales = f (Price, Weather) or
  19. 19. Automatic predictive modeling 19 Hyperparameters tuning Best balance Feature selection Algorithm selection Accuracy v s Transparency Black box White box Determining a set of features Highly accurate results in a short time! Sales = f (Price, Location) Sales = f (Price, Weather) or
  20. 20. Our technology 20
  21. 21. Exploring massive modeling possibilities 21 Data preprocessing strategies
  22. 22. Exploring massive modeling possibilities 22 Algorithms Yes No Yes Data preprocessing strategies
  23. 23. Exploring massive modeling possibilities 23 Algorithms Yes No Yes Data preprocessing strategies Feature selection!
  24. 24. Exploring massive modeling possibilities 24 Algorithms Yes No Yes Hyperparameters tuning Data preprocessing strategies Feature selection!
  25. 25. Exploring massive modeling possibilities 25 Algorithms Yes No Yes Data preprocessing strategies Yes No Yes Feature selection! 1000s of models! Hyperparameters tuning
  26. 26. Exploring massive modeling possibilities 26 Algorithms Yes No Yes Data preprocessing strategies Yes No Yes Feature selection! 1000s of models! Hyperparameters tuning
  27. 27. Automating and accelerating with Spark 27 Complete in hours! Yes No Yes Algorithms Yes No Yes Data preprocessing strategies Feature selection! Hyperparameters tuning
  28. 28. 28 Training data Validation criteria Validation data Modeling flow = training + validation
  29. 29. Modeling flow = training + validation 29 Training data Validation data Training models Validating models Models Test data Best model Validation criteria
  30. 30. Modeling and prediction flow 30 Training data Validation data Training models Validating models Models Test data Prediction Best model Validation criteria Best prediction
  31. 31. Design challenges and solutions 31
  32. 32. 3232 Challenges to achieve high execution performance • Using native ML engines in Spark • Parameter-aware scheduling • Predictive work balancing 3232 θ1 θ2 θ3
  33. 33. 3333 θ1 θ2 θ3 • Using native ML engines in Spark • Parameter-aware scheduling • Predictive work balancing Challenges to achieve high execution performance
  34. 34. Using native ML engines in Spark Why? 34
  35. 35. Comparison of Spark and native ML engines 35 (+ Spark ML) Native ML engines
  36. 36. Comparison of Spark and native ML engines 36 (+ Spark ML) Native ML engines Scalability Yes No (or very limited)
  37. 37. Comparison of Spark and native ML engines 37 (+ Spark ML) Native ML engines Scalability Yes No (or very limited) Choice of algorithms Some Many (+ possibly some custom, very efficient) Accuracy
  38. 38. Comparison of Spark and native ML engines 38 (+ Spark ML) Native ML engines Scalability Yes No (or very limited) Choice of algorithms Some Many (+ possibly some custom, very efficient) Performance Medium Extremely high Distributed nature, synchronization overhead Accuracy If data fits a single server
  39. 39. Comparison of Spark and native ML engines 39 (+ Spark ML) Native ML engines Scalability Yes No (or very limited) Choice of algorithms Some Many (+ possibly some custom, very efficient) Performance Medium Extremely high Distributed nature, synchronization overhead Accuracy If data fits a single server
  40. 40. Comparison of Spark and native ML engines • We would like to combine Spark and ML engines 40 (+ Spark ML) Native ML engines Scalability Yes No (or very limited) Choice of algorithms Some Many (+ possibly some custom, very efficient) Performance Medium Extremely high
  41. 41. Combining Spark and ML engines for training 41 Training data (parquet) HDFS Models
  42. 42. 42 Data preprocessing (MapReduce) Training data (parquet) HDFS Models Combining Spark and ML engines for training
  43. 43. 43 Machine Learning (map operation) Data preprocessing (MapReduce) Training data (parquet) HDFS Models Yes No Yes Combining Spark and ML engines for training
  44. 44. 44 Machine Learning (map operation) Data preprocessing (MapReduce) Training data (parquet) HDFS Models Yes No Yes ’Single ML engine’ on a single executor Combining Spark and ML engines for training
  45. 45. 45 Machine Learning (map operation) Data preprocessing (MapReduce) Training data (parquet) HDFS Models Yes No Yes Input requirements: size & format ’Single ML engine’ on a single executor Combining Spark and ML engines for training
  46. 46. 46 Machine Learning (map operation) Data preprocessing (MapReduce) Training data (parquet) HDFS Models Yes No Yes Combining Spark and ML engines for training
  47. 47. 47 Machine Learning (map operation) Converting to RDD[Matrix] Data preprocessing (MapReduce) Training data (parquet) HDFS Models Yes No Yes Matrix Matrix Matrix Combining Spark and ML engines for training
  48. 48. 48 Machine Learning (map operation) Converting to RDD[Matrix] Data preprocessing (MapReduce) Training data (parquet) HDFS Models Yes No Yes Matrix Matrix Matrix Combining Spark and ML engines for training RDD of huge, efficiently stored objects optimized for ML computations!!!
  49. 49. Converting to RDD[Matrix] 49 Machine Learning (map operation) Data preprocessing (MapReduce) Training data (parquet) HDFS HDFS 1000s of models Yes No Yes Yes No Yes Matrix Matrix Matrix RDD of huge, efficiently stored objects optimized for ML computations!!! Combining Spark and ML engines for training
  50. 50. Combining Spark and ML engines for validation 50 Validation data (parquet) HDFS
  51. 51. 51 Data preprocessing (MapReduce) Validation data (parquet) HDFS Combining Spark and ML engines for validation
  52. 52. 52 Converting to RDD[Matrix] Data preprocessing (MapReduce) Validation data (parquet) HDFS Matrix Matrix Matrix Combining Spark and ML engines for validation
  53. 53. Converting to RDD[Matrix] Matrix Matrix Matrix 53 Prediction (map operation) Data preprocessing (MapReduce) Validation data (parquet) HDFS Computing validation results for many models Combining Spark and ML engines for validation
  54. 54. Converting to RDD[Matrix] Matrix Matrix Matrix 54 Validation (MapReduce) Prediction (map operation) Data preprocessing (MapReduce) Validation data (parquet) HDFS Computing validation scores Combining Spark and ML engines for validation
  55. 55. Converting to RDD[Matrix] Matrix Matrix Matrix 55 Validation (MapReduce) Prediction (map operation) Data preprocessing (MapReduce) Validation data (parquet) HDFS HDFS Best model Combining Spark and ML engines for validation
  56. 56. 56 Predict (map operation) Convert to RDD[Matrix] Data preprocessing (MapReduce) Test data (parquet) HDFS HDFS Prediction results (parquet) Matrix Matrix Matrix Computations only for selected models Combining Spark and ML engines for prediction
  57. 57. Design challenges 5757 θ1 θ2 θ3 • Using native ML engines in Spark • Parameter-aware scheduling • Predictive work balancing
  58. 58. Many models to schedule 58 Matrix X3 Matrix X2 Matrix X1
  59. 59. Many models to schedule 59 Algorithms Hyperparameters Data preprocessing strategies Parameters: θ1, θ2, θ3 ... Matrix X3 Matrix X2 Matrix X1
  60. 60. Many models to schedule 60 Algorithms Hyperparameters Data preprocessing strategies Machine Learning Yes No Yes Parameters: θ1, θ2, θ3 ... Matrix X3 Matrix X2 Matrix X1
  61. 61. Naive scheduling 61 Load & Convert Parameter θ1 Parameter θ1 Parameter θ1 Matrix X1 Matrix X2 Matrix X3 • Waste of memory • Frequent data loading from other servers • Frequent data to matrix conversion 61 Load & Convert Parameter θ1 Parameter θ1 Parameter θ1 Matrix X1 Matrix X2 Matrix X3
  62. 62. 62 Parameter-aware scheduling 62 • Efficient memory usage • Infrequent data loading from other servers • Infrequent data to matrix conversion 62 Parameter θ1 Parameter θ2 Parameter θ3 Matrix X1
  63. 63. Design challenges 6363 θ1 θ2 θ3 • Using native ML engines in Spark • Parameter-aware scheduling • Predictive work balancing
  64. 64. Machine learning – most work intensive & time consuming part 64 Machine Learning (map operation) Convert to matrix Data preprocessing (MapReduce) Training data (parquet) HDFS HDFS Yes No Yes We must ensure good balance of paralleled work 1000s of models Matrix Matrix Matrix
  65. 65. Naive balancing of models to compute 65 5 min 5 min Complicated model
  66. 66. Naive balancing of models to compute 66 5 min 5 min 1 min 1 min Wait 8 min…Yes No Yes Yes No Yes Decision tree model Complicated model
  67. 67. Predictive balancing • Balancing complex and simple models (based on previous estimation) • Complex models first 5 min 1 min 5 min 1 min Yes No Yes Yes No Yes ♪~ ♪~ 67
  68. 68. Evaluation 68
  69. 69. Evaluation – targeting Top-10% • Prediction problem – Comparing Top-10% precision of targeting potential positive samples • Comparing with manual predictive modeling – Done with scikit-learn v0.18.1 – Selected algorithms (Logistic Regression, SVM, Random Forests) – Selected preprocessing strategies – All parameters of algorithms set with default values • except Random Forest (n_estimators = 200) 69
  70. 70. Evaluation – data sets • KDDCUP 2014 competition data – 557K records for training and validate data – 62K records for test data – Features: 500 • KDDCUP 2015 competition data – 108K records for training and validate data – 12K records for test data – Features: 500 • IJCAI 2015 competition data – 87K records for training, validate and test data – Features: 500 70
  71. 71. Evaluation – cluster specificaton • Size: 3U • Server modules: 34 • CPU: 272 cores (Intel Xeon D 2.1GHz) – 128 cores used in the evaluation • RAM: 2TB • Storage: 34TB SSD • Internal network: 10GbE • Spark v1.6.0, Hadoop v2.7.3 71 Scalable Modular Server (DX2000)
  72. 72. Evaluation results and conclusions 72 Data Our technology Logistic regression SVM Random Forests KDDCUP 2014 15.6% 13.5% 12.0% 14.8% KDDCUP 2015 97.1% 95.5% 93.1% 97.2% IJCAI 2015 8.2% 8.3% 8.1% 8.2% Top-10% precision results
  73. 73. Evaluation results and conclusions • Competitive results with good accuracy 73 Data Our technology Logistic regression SVM Random Forests KDDCUP 2014 15.6% 13.5% 12.0% 14.8% KDDCUP 2015 97.1% 95.5% 93.1% 97.2% IJCAI 2015 8.2% 8.3% 8.1% 8.2% Top-10% precision results
  74. 74. Evaluation results and conclusions • Short execution time • Full automation of the whole process • Handling data of any size 74 Data Our technology KDDCUP 2014 172 minutes KDDCUP 2015 45 minutes IJCAI 2015 36 minutes Execution time
  75. 75. Our observations 75
  76. 76. Our observations • Using RDD of huge but compact objects optimized for ML computations • Limiting execution time overhead in tests on YARN • Stable execution on YARN 76
  77. 77. Our observations • Using RDD of huge but compact objects optimized for ML computations • Limiting execution time overhead in tests on YARN • Stable execution on YARN 77
  78. 78. Converting to RDD[Matrix] 78 Machine Learning (map operation) Data preprocessing (MapReduce) Training data (parquet) HDFS HDFS 1000s of models Yes No Yes Yes No Yes Matrix Matrix Matrix RDD[DenseMatrix]
  79. 79. • Spark used for parallelization • All the necessary data for a single execution kept without memory overhead • Performance critical operations executed: – On objects with Linear Algebra operations optimized – By fast native ML algorithms 79 RDD[DenseMatrix]
  80. 80. Our observations • Using RDD of huge but compact objects optimized for fast computations • Limiting execution time overhead in tests on YARN • Stable execution on YARN 80
  81. 81. Limiting execution overhead in tests • Submitting Spark application takes time 81 TestSpark submit Spark submit Test Spark submit Test
  82. 82. Limiting execution overhead in tests • We submit only once 82 TestSpark submit Test Test ♪~
  83. 83. Our observations • Using RDD of huge but compact objects optimized for fast computations • Limiting execution time overhead in tests on YARN • Stable execution on YARN 83
  84. 84. Stable execution on YARN • Default configuration sometimes failing with not enough memory • Spark Web UI: • Serving much memory to Spark but application still failing • Known problem in Spark 84
  85. 85. Stable execution on YARN • JVM system memory spikes over YARN limitation suddenly (*) 85 (*) Shivnath and Mayuresh. “Understanding Memory Management In Spark For Fun And Profit”, Spark Summit 2016. YARN limitation (6GB) Time Memory(GB) Spike of JVM system memory usage
  86. 86. Stable execution on YARN • Tip: spark.yarn.executor.memoryOverhead to be carefully configured • Recommended overhead: 6-10% • 15% overhead required in our case • Must be thoroughly investigated 86 (http://spark.apache.org/docs/2.1.1/running-on-yarn.html)
  87. 87. Summary 87
  88. 88. Summary • Predictive modeling problem – Requires sophisticated knowledge – Takes a long time • Our technology: Automatic Predictive Modeling – Combines Spark with native ML engines – Fully automates the whole process – Provides highly accurate results – Takes at most hours – Handles data of any size 88
  89. 89. Future work • Extending to other models (e.g. deep learning) • Speeding up by GPU • Reducing YARN memory overhead 89
  90. 90. Thank you! 90

×