Improving Efficiency of Machine Learning Algorithms using HPCC Systems
Improvi
RV College of
Engineering
Go, change the
world
1
Improving Efficiency of Machine Learning Algorithms
Using HPCC Systems Platform
Dr. G. Shobha
Professor, CSE Department
RV College of Engineering, Bengaluru - 59
RV College of
Engineering
PRESENTATION CONTENTS
Go, change the world
2
Introduction and Motivation
HPCC Systems Architecture
Parallel DBSCAN Algorithm
Experimental Results &
Conclusions
RV College of
Engineering
Introduction and Motivation
Go, change the world
3
Key Factors of Machine
Learning
1. Large Data Sets
Millions of labelled images, thousands of hours of speech
2. Improved Models and Algorithms
• Deep Neural Networks: hundreds of layers, millions of parameters
3. Efficient Computation for Machine Learning:
• Computational power for ML increased by ~100x since 2010
• Gains (GPU, CPU) almost stagnant in latest generations
• Computation times are extremely large anyway (days to weeks to months)
Go-to Solution: Distribute Machine Learning Applications to Multiple Processors and Nodes
RV College of
Engineering
Introduction and Motivation
Go, change the world
6
Parallel Processing Architectures for Distributed
Machine Learning
1. Map Reduce
Ex : Hadoop , Spark, Data Torrent
Limitations of Hadoop
Go-to Solution: HPCC Systems Architecture by LexisNexis Risk Solutions
2. Data Flow
Ex : HPCC Systems
RV College of
Engineering
HPCC Systems Architecture
Go, change the world
7
THOR :
• data refinery engine
• gives the user control over data
transformations.
• facilitates optimal operational
capacity on mixed schema data
ROXIE :
• search engine
• speed real-time queries through
interfaces such as REST, SOAP and
XML.
• reduces the latency associated
with querying
ECL (Enterprise Data Control Language).
- High Level language for parallel data
processing
- Dataflow architecture
- implicitly parallel and declarative in nature
provides several constructs to simplify parallel
compute operations
RV College of
Engineering
Go, change the world
8
Advantages of HPCC Systems Architecture for Distributed Machine Learning
• Highly integrated system environment
- capabilities from raw data processing to high-performance queries
and data analysis using a common language;
• Optimized cluster approach
- provides high performance at a much lower system cost than other
system alternatives
• Stable and reliable processing environment proven in production applications
for varied organizations over a 15-year period;
• Innovative data-centric programming language (ECL)
• High-level of fault resilience and capabilities
• Suitable for a wide range of data-intensive
HPCC Systems Architecture
Major features:
• Discover clusters of arbitrary shape
• Handle noise
• One scan
• Need density parameters as termination condition
RV College of
Engineering
Density Based Spatial Clustering Application with
Noise (DBSCAN)
Go, change the world
9
• Clusters are dense region the data space, separated by
regions of lower object density
• A cluster is defined as a maximal set of density-connected
points
• Discovers clusters of arbitrary shape
RV College of
Engineering
Go, change the world
10
Two parameters:
Eps: Maximum radius of the neighborhood
MinPts: Minimum number of points in an Eps-
neighborhood of that point
NEps(p): {q belongs to D | dist(p,q) <= Eps}
Directly density-reachable: A point p is directly density-
reachable from a point q wrt. Eps, MinPts if
1) p belongs to NEps(q)
2) core point condition:
|NEps (q)| >= MinPts
Density Based Spatial Clustering Application with Noise
(DBSCAN)
computationally inefficient task when applied to large amounts of data, especially on big data platforms.
RV College of
Engineering
Go, change the world
11
DBSCAN
RV College of
Engineering
Go, change the world
12
Drawback : Computationally inefficient when applied to large amounts of data,
especially on big data platforms
Sequential DBSCAN Algorithm
Go To Solution : Parallel DBSCAN Algorithm On HPCC Systems Big data Platform
Specification Value
Architecture x86_64
CPU op-mode(s) 32-bit, 64-bit
Byte Order Little Endian
Model Name Intel Xeon
CPU GHz 2.4
Core (s) 6
RAM 6GB
Hard Disk 128GB
Processor Specification for Each Node
Data Set : Frog’s , MFCC
Dimension : 20
RV College of
Engineering
Go, change the world
13
Parallel DBSCAN Algorithm On HPCC Systems Platform
1. Spraying the Data
• Thor Engine distributes Data Points , assigned to global unique Ids across the
nodes in cluster evenly
• Each of the local nodes then sort the data points by their unique ids
• Send the data to local clustering stage
2. Local Clustering DBSCAN Algorithm is executed on each local node in HPCC Cluster.
2 operations
• Union : Final cluster is represented by highest core point.
• Find : Used to identify the parent i.e., highest core point,, for each
point(node) in the tree.
3. Global Merge • Trees are merged together to form Global Clusters – point
belong to more than one tree in different nodes.
• the final clusters are obtained which are represented by
their highest core point across all nodes
RV College of
Engineering
Go, change the world
14
Parallel DBSCAN Algorithm On HPCC Big data Platform
(Source code - https://github.com/hpcc-systems/dbscan)
contributors - Yathish & Team
RV College of
Engineering
Go, change the world
15
Experimental Results & Conclusions
Size Eps
distan
ce
Minpts
in a
cluster
Time on
single node
(in s)
Time on
two
nodes (in
s)
Time on three
nodes (in s)
4800 0.2 2 16.35 14.5 15.86
6000 0.3 9 35.24 22.246 23.471
7200 0.3 10 53.48 44.426 45.63
9000 0.35 10 112.80 50.57 53.642
14300 0.4 20 535.74 213.92
2
203.184
30000 0.4 20 3924.7 964.61
6
727.33
50000 0.5 30 24948.6 5124.3 3266.462
0
5000
10000
15000
20000
25000
30000
4800 6000 7200 9000 14300 30000 50000
ExecutionTime(seconds)
Size
Serial vs Parallel Execution Time
Serial Parallel (2 Nodes) Parallel (3 Nodes)
RV College of
Engineering
Go, change the world
16
Conclusions
• Multi node setup outperforms the single node setup in all cases
• Increase data points increases the parallel algorithm to perform better than its serial
counterpart
• HPCC Platform supports cross platform developments in languages like C++, python,
etc., which makes it to develop applications at a faster pace.
• Thor and Roxie components of HPCC Platform enables faster data ingestion and data
query across multiple nodes - Makes it efficient in implementing machine learning
algorithms
• the Platform parallelizes the sequential algorithms across multiple nodes efficiently.
RV College of
Engineering
Go, change the world
17
References
• https://researchcollaborations.elsevier.com/en/organisations/httpswwwrvceeduin
• MQTT protocol support for ROXIE ,https://github.com/hpcc-systems/mqtt-for-roxie
• Automated Data Skew Profiler, https://github.com/notharsh/DataSkewProfiler
• Extending current ML library with LexisNexis HPCC Systems
https://github.com/lilyclemson/DBSCAN/tree/project
• Image Processing Library in HPCC , https://github.com/TanmayH/HPCC-OPENCV
• Fraud detection in value based cards,https://github.com/aksharprasad/HPCC
• Evaluation of machine learning algorithms,
https://github.com/suryanarayanan21/ML_Core
• Interfacing Octave with ECL GitHub Link : https://github.com/Sathvik10/Octave-
Plugin
• Continuous integration of Roxie query / data deployments using Jenkins,
https://github.com/JUJayashree/jenkin_JOB_xml
RV College of
Engineering
Go, change the world
18
Acknowledge
Prof. Jyothi, Asst. Prof. CSE Dept., RVCE
Vasanth, Instructor, CSE Dept., RVCE
Students of RVCE
1. Jayant Suresh
2. Harsh Mishra
3. Amogh Vardhan Kashi
4. Manjunath Jakkaraddi
5. Shubham Phal
6. Tanmay Hukkeri
7. Yathish H R
8. Akshar Prasad
9. Sathvik K R
10. A Suryanarayanan
Currently working Students
1. Varsha R Jenni
2. Akhil Dua
3. Atreya Bain
4. Anurag Singh Bhadauria
5. Ambu Karthik
6. Rohit Sachin