3. How Can Intel Help Machine Learning?
PC World, May 2015
4. How Can We Tell What Should Be Improved?
Many algorithms, many data types, constantly evolving…
5. A Machine Learning Benchmark is Obviously a Must
… but how can it incorporate the diversity of this domain and the
ongoing and future changes?
6. Our Basic Approach: Cover the Building Blocks
We observed that the various Machine Learning algorithms are composed of
several types of building blocks - these building blocks should be handled well
7. The Machine Learning Building Block Types
• ML basic building blocks
1. Linear Algebra
2. Measures
3. Special Functions
4. Mathematical Optimization
5. Data Characteristics
6. Data-dependent Compute
7. Memory Access
8. Very large models
9. Hybrid Methods
• ML Meta building blocks
1. Learning Protocols
2. Learning Phases
3. Algorithmic Flow and Structure
Compute
Data
Compute - Data Interplay
Process
8. Machine Learning Building Blocks: Example
• ML basic building blocks
1. Linear Algebra
2. Measures
3. Special Functions
4. Optimization Problems
5. Data Characteristics
6. Data-dependent Compute
7. Memory Access
8. Very large models
9. Hybrid Methods
• ML Meta building blocks
1. Learning Protocols
2. Learning Phases
3. Algorithmic Flow and Structure
Linear Algebra
• GEMM
• 𝑋 𝑇
𝑋, 𝑋𝑋 𝑇
, 𝑋 𝑇
𝑋 −1
𝑋 𝑇
• Quadratic Form - 𝑣 𝑇 𝐴𝑣
• Commonly used Algorithms
• Inversion
• Matrix Factorization
• Eigendecomposition
• Singular Value Decomposition (SVD)
• Need to support both Dense and Sparse
• Special Matrices of interest
• Symmetric – Covariance, Kernel
• Stochastic – Row elements sum to 1
• Boolean
• Diagonal
9. Machine Learning Building Blocks: Example
• ML basic building blocks
1. Linear Algebra
2. Measures
3. Special Functions
4. Mathematical Optimization
5. Data Characteristics
6. Data-dependent Compute
7. Memory Access
8. Very large models
9. Hybrid Methods
• ML Meta building blocks
1. Learning Protocols
2. Learning Phases
3. Algorithmic Flow and Structure
Data Characteristics
• Type and Format
• Numeric/Categorical –16b, 32b, 64b
• Sparse and Dense
• Typical sizes, Sparsity structure
• Distribution
• Univariate, Dependency Structure, Mixture,
Even/Biased Class, Separability
# of features Feature types Sparse/ Dense
Usages Small Mid Large Catego
rical
Numeri
c
Time
Series
Sparse Dense
Advertising
SNA
Clinical
Genomics
Telco
IoT
10. Example: Mapping Algorithms to Building Blocks
PCA Decision Tree Deep Learning - CNN Apriori Adaboost
Linear Algebra GEMM Convolution, GEMM
Measures Infotheo, Gini
Infotheo, Euclidean,
Softmax
Special Functions log sigmoid, tanh, ReLU exp
Mathematical Optimization Non-convex
Data Characteristics
Categorical + + +
Numeric + + + +
Data-Dep. Compute
Sorting, Bucketing,
Data-dep. Branches
Counting, Bucketing,
Data-dep. Branches
Memory Access
Blocks + +
Columns +
Other
Predicate-based,
Associative
Weighted
Sampling
Very Large Models +
Hybrid Method Committee
11. Application: A Machine Learning Workloads Suite
• Building Blocks Coverage (partial list …)
• Linear Algebra – GEMM, Inv., Factorization, …
• Measures – Euclidean, InfoGain, RBF, …
• Special Functions – log, exp, …
• Math. Optimization – QP, EM, L-BFGS, SGD , …
• Data – Num., Cat., Dense, Sparse, Feat. Dep., …
• Data-dep. Compute – Sort, Bucket, KD Tree, …
• Memory Access – Seq, Indexed, Pred, Rnd, …
• Very large models – CNN, KNN, K-SVM, …
Building Block
Type
Algorithms
Dense Linear
Algebra
K-Means
SVM
PCA
GMM
Logistic Reg.
Sparse Linear
Algebra
K-Means
SVM
PCA
Logistic Reg.
ALS
Data Dependency Apriori
Decision Tree
Naïve Bayes
KNN
LDA
Walktrap
Large Models CNN
Our approach enables selecting representatives of the major building blocks
• Tasks Coverage:
Classification, Clustering, Recommendation,
Dimensionality Reduction, Rule Induction,
Community Detection
Building Block
Type
Algorithms Data sets
Dense Linear
Algebra
K-Means
SVM
PCA
GMM
Logistic Reg.
Clustered
Sparse Linear
Algebra
K-Means
SVM
PCA
Logistic Reg.
ALS
Graphs
Text
Data Dependency Apriori
Decision Tree
Naïve Bayes
KNN
LDA
Walktrap
Clustered
Graphs
Bio Informatics
Text
Manufacturing
Large Models CNN Images
12. Which Datasets to Use?
There are publicly available datasets, but they may not cover all relevant sizes
and characteristics. We complement them by simulating data.
• Power Law graph
• Small World graph
(regular/random)
• SBM (few/many blocks)
DensityX X Size
http://snap.stanford.edu/data/
13. Which Datasets to Use: Another Example
• Simulated Dense Clustered Datasets
vary by
• Number of dimensions
• Number of samples
• Number of clusters
• Mixing proportion
• Uniform, Power-law
• Dependency structure
• Cluster separation
• c-separation*
• Alignment in space
• Scattered, Line, Sphere
* Dasgupta, S., Schulman, L., A Probabilistic Analysis of EM for Mixtures of
Separated, Spherical Gaussians. JMLR, 8 (2007)
15. Which Parameters / Configurations to Use?
We should use each algorithm with implementations, configurations and
parameters that will express all its building blocks
16. Isn’t It Too Big for a Benchmark?
The benchmark should be concise – it should not contain dozens of thousands of
separate executions (for each algorithm, dataset and configuration)
17. Reducing the Number of Workloads
We developed a WOrkload Optimization Framework (WOOF), which enables
running many executions and clustering them by hardware or software profiles
We then select one representative for each bottleneck behavior
18. Software Profiling
Software behavior is evaluated using the Perf Linux tool. Thousands of executions
are reduced into few representatives of the behaviors encountered.
• Radial SVM:
• Linear SVM:
19. Hardware Profiling
Hardware behavior is evaluated using Yasin’s Top-Down methodology(*), identifying
the percentage of time spent on each of the processor hotspots
(*) A. Yasin – Top Down Analysis: never lost with Xeon perf counters – CERN workshop (2013)
20. Hardware Profiling: Community Detection Example
Multiple executions of community detection algorithms are reduced into five
representatives of different hardware behaviors
(*) L1, L2 and L3 are the three levels of caches of the processor
22. Hardware Profiling: Illustrating the Effect of Data Selection
Alternate Least Squares (ALS)
Different data characteristics cause different hardware profiles, and simulated data
may introduce additional behaviors (projected on two dimensions using PCA)
23. Benchmark Building Process
Algorithm Selection
Defining Parameter Sets
Defining Datasets
Reducing to
Representatives
Results
Analysis/Validation
24. Current Status and Next Steps
• Based on the above process we analyzed 18 algorithms, representing the main
building blocks, with multiple datasets and configurations, and built a suit of 50
machine-learning workloads
• Both software developers and hardware architects inside Intel started to use it
and gained interesting insights
• Work on completing the benchmark is currently in progress
25. Acknowledgments
• This project was initiated and guided by Dr. Shai Fine, Advanced Analytics, Intel
• Much of the presented results and analysis is due to the intensive work of the
Advanced Analytics WOOF team:
Chen Admati, Omer Barak, Omer Ben-porat, Roy Ben-shimol, Amir Chanovsky, Nufar Gaspar, Dima
Hanukaev, Tom Hope, Litan Ilany, Nitzan Kalvari, Oren David Kimhi, Hagar Loeub, Michal Moran,
Jacob Neiman, Yevgeni Nous, Yevgeni Reif , Yahav Shadmiy, Gilad Wallach
• Additional valuable contributions were made by:
Assaf Araki, Ehud Cohen, Jason Dai, Boris Ginzburg, Sergey Goffman, Paul Kandel, Sergey
Maidanov, Debbie Marr, Andrey Nikolaev, Gilad Olswang, Nir Peled, Ananth Sankaranarayanan,
Nadathur Rajagopalan Satish, Ganesh Venkatesh, Brian D Womack