1. Practical Aspects of
Machine Learning on Big-Data platforms
(Big Learning)
Mohit Garg
Research Engineer
TXII, Airbus Group Innovations
25-05-2015
2. Motivation & Agenda
Motivation: Answer the questions
– Why multiple ecosystems for scalable
ML?
– Is there a unified approach for a big-
data platform for ML?
– How much to catch up on?
– How industry leaders are doing it?
– Put things into perspective !
Agenda: To present
– Quick brief on practical ML process
– Current landscape of Open source tools
– Evolutionary drivers with examples
– Case studies
– The Twitter Experience
– * Share journey and observations
(ongoing process)
ML (Optimization,
LA, Stats)
scalabilityBig Data (Schema,
workflow, architecture)
5. Quick brief - Process
α
β
Train
Tune (Cross Validate Data)
Measure (Test Data)
• Not applicable to all ML modeling techniques. Biologically-inspired algorithms are more of a paradigm
(guidelines) rather than algorithm, and requires algorithm definition under those guidelines (GA, ACO, ANN).
• Graph Source: http://www.astroml.org/sklearn_tutorial/practical.html
Bias Vs Variance
Learning Curve
Data Sampling Algorithm Model EvaluationModel/Hypothesis
8. Quick brief – WF breakdown k-means
Input Data (Points)
Statement Block
(assigning
clusters)
Termination
condition (if no
change)
While (!termination_condition)
meta input
(new cluster centres)
updates
9. Quick brief – WF breakdown k-means
Only after iteration
is over
Input Data (Points)
Statement Block
(assigning
clusters)
Termination
condition (if no
change)
While (!termination_condition)
meta input
(new cluster centres)
updates
new cluster centres
10. Quick brief – Pieces
• An ML-algorithm is finally a computer algorithm
• Complex design or blocks of
– Blocks feeding each other. Output becoming Input
– Iterations over entire data (Gradient descent for Linear, Logistic
Regression, K-means etc) – Memory limitations
– Algorithms as non-linear workflows.
• Principles when operating on Large datasets
– Minimize I/O – Don’t read/write disk again and again
– Minimize Network transfer – Localize logic (non-additive?) to
data
– Use online-trainable algorithms- Optimized Parallel Algorithms
– Ease of use –Abstraction - Package well for end-user
11. Quick brief – then and now
• Small Data
– Static data
• Big Data
– Static Data
– But cant run on single machine
• Online Big Data
– Integrated with ETL
– Prediction and Learning together
– Twitter case study
α
β
Train
Tune (Cross Validate Data)
Measure (Test Data)
α
β
Train
Tune (Cross Validate Data)
Measure (Test Data)
Velocity
α
β
20. Quick Review: MapReduce + Hadoop
• Bigger focus on
– Large scaling on data
– Scheduling and Concurrency control
• Load balancing
– Fault tolerance
– Basically, to save tonnes of user’s efforts
required in older frameworks like MPI.
– The map and reduce can be ‘executables ’ in
virtually any language (Streaming)
– *Maps (& reducers) don’t interact !
• MapReduce exploits massively
parallelizable problems, what about rest of
them?
– Simple case: Try finding median of integers(1-40
say) using MR?
• Can we expect any algorithm to execute in
with MR implementation with same time-
bounds?
21. Loose Integration
• Set of components/APIs
– exposing existing tools with Map-Reduce frameworks
– to be compiled, optimized and deployed in
– streaming or pipe mode with frameworks.
• Hadoop/MapReduce bindings available for
– R
– Python (numpy, sci-kit)
• Focus on
– Accommodating existing user-base to leverage hadoop data storage
– Easy & neat APIs for native users.
– No real effort on ‘bridging the gap’
23. Loose Integration – Pydoop Example
• Uses Hadoop Pipes as underlying framework
• Based on Cpython, so provide inclusion of sci-kit, num-py etc
• Lets you define map and reduce logics
• But, does not provide better representations of ML Algorithms
26. Scientific - Efforts
• Efforts comes in waves with breakthroughs
• Efforts on
– Accuracy bounds & Convergence
– Execution time bounds
• Recent efforts in tandem with Big Data
– Distributable algorithms – Central Limit theorem (local
logic with simpler aggregations)
– Batch-to-Online model – ‘One pass mode’ (avoid
iterations)
• Example
– Distributable algorithms - Ensemble Classification (eg
Random forest), K-means++||
– Batch-to-online - (SGD)
• Note – Power ‘inherently’ lies in Big Data
– Simple algorithm with larger dataset outperforms complex
algorithms with smaller dataset
Image-2 Source: Andrew Ng – Coursera ML-08
O1 O2 ON
Ǝ
27. Logistic Classification
• Sample: Yes/No kind of answers
– Is this tweet a spam?
– Will this tweeter log back in 48 hours?
X1 X2 …… XN Y
X11 . . X1N 0
. . . . 1
. . . . 1
. . . . 1
. . . . 0
XM1 . . XMN 0
X Y
x1
x2
xN
hӨ (x)
Ө1
Ө2
ӨN
Hypothesis hӨ (x) = 1 / ( 1 + e – ӨTx)
Cost(x) hӨ (x) - y
J =Cost(X)
• Ө is unknown variable
• Lets start with random value of Ө
• Aim is to change Ө to minimize J
29. Gradient Descent
• Cost function requires all records.
While (W does not change)
{
// Load data
// find local losses
// Aggregate local losses
// Find gradient
// Update W
// Save W to disk
}
/* Multiple Passes */
J =Cost(X, Ө)
M1 M2 MN
Map – loads data
Reduce – Calculates
gradient and updates W
R
Saves W (intermediate)
User code
30. Stochastic Gradient Descent (SGD)
• No need to get cost function for gradient calculation
• Each iteration on a data point - xi
• Gradient calculated using only xi
• As good as performing SGD on single machine. Reducer – a serious
bottleneck
M1 M2 MN
Map – loads data
Reduce – Calculates
gradient and updates W
R
Saves W (final)
// Load data
While (no samples left)
{
// Find gradient using xi
// Update W
}
// Save W
/* Single Pass */
User code
Ref: Bottou, Léon (1998). "Online Algorithms and Stochastic Approximations". Online Learning and Neural Networks
31. SGD - Distributed
• Similar to SGD, but have multiple reducers
• Data is thoroughly randomized
• Multiple classifiers are learned together – ensemble classifiers
• Bottleneck of single reducer (Network Data) resolved
• Testing using standard aggregation over predictors’ results
M1 M2 MN
Map –load data
Reduce – Calculates
gradient and updates WjR1
W1
// Pre-process – Randomize
// Load data
While (no samples left)
{
// Find gradient using xi
// Update Wj
}
/* Single Pass and
distributed */
User code
R2
W2
Ref: L. Bottou. Large-scale machine learning with stochastic gradient descent. COMPSTAT, 2010.
33. Now1971 2020
Moore’s Law vs Kryder’s Law
Source: Collective information from Wiki & its references
“if hard drives were to continue to progress
at their then-current pace of about 40%
per year, then in 2020 a two-platter, 2.5-
inch disk drive would store approximately
40 TB and cost about $40” - Kryder
Moore’s II law : “As the cost of computer
power to the consumer falls, the cost for
producers to fulfill Moore's law follows an
opposite trend: R&D, manufacturing, and
test costs have increased steadily with each
new generation of chips”
GAP
- Individual processors’ power growing at slower rate
- Data Storage becomes easier & cheaper.
- MORE data, LESS processors – and the gap is
widening !
- Computer h/w architecture working at its pace to
provide faster buses, RAM & augmented GPUs.
Architectural – Forces
34. 2012
VolumeinExabytes
15000
2017
Percentage of uncertain data
Percentofuncertaindata
We are here
Sensors
& Devices
VoIP
Enterprise
Data
Social
Media
6000
9000
100
0
50
VeracityVolume
Variety
Architectural – Forces
Source: IBM - Challenges and Opportunities with Big Data- Dr Hammou Messatfa
35. Mahout with MapReduce
• Key feature: Somewhat loose & somewhat tight integration
– Among the earliest library to exploit batch-like scalable components online
learning algorithms.
– Some algorithms re-engineered for MapReduce, some not.
– Performance hit for iterative algorithms. Huge I/O overhead
– Each (global) iteration means Map-Reduce job :O
– Integration of new scalable learners less active.
• Industry acceptance
– Accepted for scalable Recommender systems
• Future
– Mahout Samsara for scalable low-level Linear Algebra as scala & spark bindings
Sybil
36. Cascading
• Key feature: Abstraction & Packaging
– Let you think of workflows as chain of MR
– Pre-existing methods for reading and storage methods
– Provide checkpoints in the workflow to save the state.
– J-Unit for test-case driven s/w development.
• Industry acceptance
– Scalding is scala bindings for cascading, from twitter
– Used by Bixo for anti-spam classification
– Used to load data by elasticsearch & Cassandra
– Ebay leverages scalding design for distributable computing.
Sybil
38. Pig – Quick Summary
High level dataflow language (Pig Latin)
Much simpler than Java
Simplify the data processing
Put the operations at the apropriate phases
Chains multiple MR jobs
Appropriate for ML workflows
No need to take care of intermediate outputs
Provide user defined functions (UDFs) in java,
integrable with Pig
Sybil
39. Pig – Quick Summary
A=LOAD 'file1' AS (x, y, z);
B=LOAD 'file2' AS (t, u, v);
C=FILTER A by y > 0;
D=JOIN C BY x, B BY u;
E=GROUP D BY z;
F=FOREACH E GENERATE
group, COUNT(D);
STORE F INTO 'output';
LOAD
FILTER
LOAD
JOIN
GROUP
FOREACH
STORE
Sybil
FILTER
LOCAL REARRANGE
Map
Reduce
Map
Reduce
PACKAGE
FOREACH
LOCAL REARRANGE
PACKAGE
FOREACH
40. ML-lib
Sybil
Part of Apache Spark framework
Data can be from hdfs, S3, local files
Encapsulates run-time data as Resilient Distributed Data store (RDD)
RDD are in-memory data pieces
Failt tolerent – knows how to recreate itself if resident node goes down
No distinction between map and reduce, just a task.
Vigilant
Bindings for R too – SparkR
Real ingenuity in implementing new-generation algorithms (online &
distributed)
Example, has three versions of K-means – Lloyd, k-means++, k-means++ ||
Key feature
Shared objects – means tasks (belonging to one node) can share objects.
42. Tez
Sybil
Apache Ìncubated project
Fundamentally similar design principles as of Spark
Encapsulates run-time data as nodes just like RDDs
Key features
In-memory data
Shared objects – means tasks (belonging to one node) can
share objects.
Very few comaprative studies available
Not much contributions from open community
44. Distributed R
Sybil
Opensource Project lead by HP
Similar to R-hadoop but with some new
genuine features like
User-defined array partitioning
Local transformation/functions
Master-Worker synchronization
Not the same ingenuity yet, as seen in ML-lib.
Only fundamentally scalable algorithms (online
& distributable) scales linearly.
Tall claims of 50-100x time efficiency when
used with HP-Vertica database
45. Sibyl
Sybil
Not opensource yet, but some rumours !
Claims to provide a GFS based highly scalable
flexible infrastucture for embedding ML
process in ETL
Designed for supervised learning
Focussed on learning user behaviours
Youtube video recommendations
Spam filters
Major design principle– Columnar Data
Suitable for sparse datasets (new columns?)
Comrpression techniques for columnar data
much efficient (structral similarity)
46. Columnar data- LZO Compression
• Idea 1
– Compression should be ‘splittable’
– A large file can be compressed and
split in size equal to hdfs block.
– Each block should hold its
‘decompression key’
Compression Size
(GB)
Compression
Time (s)
Decompression
Time (s)
None 8.0 - -
Gzip 1.3 241 72
LZO 2.0 55 35
• Idea 2
– Compress data on hadoop (save 3/4th
space)
– Save 75% I/O time !!
– Achieve Decompression < 75% I/O
time
| | | | | | | |
47. Conclusion
• Big Data has resurrected interest in ML algorithms.
• A two-way push is leading the confluence – Online & Distributed
Learning (scientific) & flexible workflows (architectural) to
accommodate them.
• Facilitated by compression, serialization, in-memory, DAG-
representations, columnar databases etc.
• Majority of man-hours goes into engineering the pipelines.
• Industry aiming to provide high level abstraction on standard ML
algorithms hiding gory details.
48. Learning Resources
• MOOCs (Coursera)
– Machine Learning (Stanford)
– Design & Analysis of Algorithms (Stanford)
– R Programming Language (John Hopkins)
– Exploratory Data Analysis (John Hopkins)
• Online Competitions
– Kaggle Data Science platform
• Software Resources
– Matlab
– R
– Scikit – Python ?
– APIs – ANN, JGAP
• 2009 - "Subspace extracting Adaptive Cellular Network for layered Architectures
with circular boundaries.“ Paper on IEEE.
• 2006-07. 1st prize – IBM’s Great mind challenge – “Transport Management System”
Multi –TSP implementation using Genetic Algorithm .