Anzeige

Improving Efficiency of Machine Learning Algorithms using HPCC Systems

HPCC Systems
30. Jun 2020
Anzeige

Más contenido relacionado

Similar a Improving Efficiency of Machine Learning Algorithms using HPCC Systems(20)

Anzeige

Último(20)

Anzeige

Improving Efficiency of Machine Learning Algorithms using HPCC Systems

  1. Improvi RV College of Engineering Go, change the world 1 Improving Efficiency of Machine Learning Algorithms Using HPCC Systems Platform Dr. G. Shobha Professor, CSE Department RV College of Engineering, Bengaluru - 59
  2. RV College of Engineering PRESENTATION CONTENTS Go, change the world 2 Introduction and Motivation HPCC Systems Architecture Parallel DBSCAN Algorithm Experimental Results & Conclusions
  3. RV College of Engineering Introduction and Motivation Go, change the world 3 Key Factors of Machine Learning 1. Large Data Sets Millions of labelled images, thousands of hours of speech 2. Improved Models and Algorithms • Deep Neural Networks: hundreds of layers, millions of parameters 3. Efficient Computation for Machine Learning: • Computational power for ML increased by ~100x since 2010 • Gains (GPU, CPU) almost stagnant in latest generations • Computation times are extremely large anyway (days to weeks to months) Go-to Solution: Distribute Machine Learning Applications to Multiple Processors and Nodes
  4. RV College of Engineering Introduction and Motivation Go, change the world 4 Machine Learning in One Node
  5. RV College of Engineering Introduction and Motivation Go, change the world 5 Distributed Machine Learning
  6. RV College of Engineering Introduction and Motivation Go, change the world 6 Parallel Processing Architectures for Distributed Machine Learning 1. Map Reduce Ex : Hadoop , Spark, Data Torrent Limitations of Hadoop Go-to Solution: HPCC Systems Architecture by LexisNexis Risk Solutions 2. Data Flow Ex : HPCC Systems
  7. RV College of Engineering HPCC Systems Architecture Go, change the world 7 THOR : • data refinery engine • gives the user control over data transformations. • facilitates optimal operational capacity on mixed schema data ROXIE : • search engine • speed real-time queries through interfaces such as REST, SOAP and XML. • reduces the latency associated with querying ECL (Enterprise Data Control Language). - High Level language for parallel data processing - Dataflow architecture - implicitly parallel and declarative in nature provides several constructs to simplify parallel compute operations
  8. RV College of Engineering Go, change the world 8 Advantages of HPCC Systems Architecture for Distributed Machine Learning • Highly integrated system environment - capabilities from raw data processing to high-performance queries and data analysis using a common language; • Optimized cluster approach - provides high performance at a much lower system cost than other system alternatives • Stable and reliable processing environment proven in production applications for varied organizations over a 15-year period; • Innovative data-centric programming language (ECL) • High-level of fault resilience and capabilities • Suitable for a wide range of data-intensive HPCC Systems Architecture
  9. Major features: • Discover clusters of arbitrary shape • Handle noise • One scan • Need density parameters as termination condition RV College of Engineering Density Based Spatial Clustering Application with Noise (DBSCAN) Go, change the world 9 • Clusters are dense region the data space, separated by regions of lower object density • A cluster is defined as a maximal set of density-connected points • Discovers clusters of arbitrary shape
  10. RV College of Engineering Go, change the world 10 Two parameters: Eps: Maximum radius of the neighborhood MinPts: Minimum number of points in an Eps- neighborhood of that point NEps(p): {q belongs to D | dist(p,q) <= Eps} Directly density-reachable: A point p is directly density- reachable from a point q wrt. Eps, MinPts if 1) p belongs to NEps(q) 2) core point condition: |NEps (q)| >= MinPts Density Based Spatial Clustering Application with Noise (DBSCAN)
  11. computationally inefficient task when applied to large amounts of data, especially on big data platforms. RV College of Engineering Go, change the world 11 DBSCAN
  12. RV College of Engineering Go, change the world 12 Drawback : Computationally inefficient when applied to large amounts of data, especially on big data platforms Sequential DBSCAN Algorithm Go To Solution : Parallel DBSCAN Algorithm On HPCC Systems Big data Platform Specification Value Architecture x86_64 CPU op-mode(s) 32-bit, 64-bit Byte Order Little Endian Model Name Intel Xeon CPU GHz 2.4 Core (s) 6 RAM 6GB Hard Disk 128GB Processor Specification for Each Node Data Set : Frog’s , MFCC Dimension : 20
  13. RV College of Engineering Go, change the world 13 Parallel DBSCAN Algorithm On HPCC Systems Platform 1. Spraying the Data • Thor Engine distributes Data Points , assigned to global unique Ids across the nodes in cluster evenly • Each of the local nodes then sort the data points by their unique ids • Send the data to local clustering stage 2. Local Clustering DBSCAN Algorithm is executed on each local node in HPCC Cluster. 2 operations • Union : Final cluster is represented by highest core point. • Find : Used to identify the parent i.e., highest core point,, for each point(node) in the tree. 3. Global Merge • Trees are merged together to form Global Clusters – point belong to more than one tree in different nodes. • the final clusters are obtained which are represented by their highest core point across all nodes
  14. RV College of Engineering Go, change the world 14 Parallel DBSCAN Algorithm On HPCC Big data Platform (Source code - https://github.com/hpcc-systems/dbscan) contributors - Yathish & Team
  15. RV College of Engineering Go, change the world 15 Experimental Results & Conclusions Size Eps distan ce Minpts in a cluster Time on single node (in s) Time on two nodes (in s) Time on three nodes (in s) 4800 0.2 2 16.35 14.5 15.86 6000 0.3 9 35.24 22.246 23.471 7200 0.3 10 53.48 44.426 45.63 9000 0.35 10 112.80 50.57 53.642 14300 0.4 20 535.74 213.92 2 203.184 30000 0.4 20 3924.7 964.61 6 727.33 50000 0.5 30 24948.6 5124.3 3266.462 0 5000 10000 15000 20000 25000 30000 4800 6000 7200 9000 14300 30000 50000 ExecutionTime(seconds) Size Serial vs Parallel Execution Time Serial Parallel (2 Nodes) Parallel (3 Nodes)
  16. RV College of Engineering Go, change the world 16 Conclusions • Multi node setup outperforms the single node setup in all cases • Increase data points increases the parallel algorithm to perform better than its serial counterpart • HPCC Platform supports cross platform developments in languages like C++, python, etc., which makes it to develop applications at a faster pace. • Thor and Roxie components of HPCC Platform enables faster data ingestion and data query across multiple nodes - Makes it efficient in implementing machine learning algorithms • the Platform parallelizes the sequential algorithms across multiple nodes efficiently.
  17. RV College of Engineering Go, change the world 17 References • https://researchcollaborations.elsevier.com/en/organisations/httpswwwrvceeduin • MQTT protocol support for ROXIE ,https://github.com/hpcc-systems/mqtt-for-roxie • Automated Data Skew Profiler, https://github.com/notharsh/DataSkewProfiler • Extending current ML library with LexisNexis HPCC Systems https://github.com/lilyclemson/DBSCAN/tree/project • Image Processing Library in HPCC , https://github.com/TanmayH/HPCC-OPENCV • Fraud detection in value based cards,https://github.com/aksharprasad/HPCC • Evaluation of machine learning algorithms, https://github.com/suryanarayanan21/ML_Core • Interfacing Octave with ECL GitHub Link : https://github.com/Sathvik10/Octave- Plugin • Continuous integration of Roxie query / data deployments using Jenkins, https://github.com/JUJayashree/jenkin_JOB_xml
  18. RV College of Engineering Go, change the world 18 Acknowledge Prof. Jyothi, Asst. Prof. CSE Dept., RVCE Vasanth, Instructor, CSE Dept., RVCE Students of RVCE 1. Jayant Suresh 2. Harsh Mishra 3. Amogh Vardhan Kashi 4. Manjunath Jakkaraddi 5. Shubham Phal 6. Tanmay Hukkeri 7. Yathish H R 8. Akshar Prasad 9. Sathvik K R 10. A Suryanarayanan Currently working Students 1. Varsha R Jenni 2. Akhil Dua 3. Atreya Bain 4. Anurag Singh Bhadauria 5. Ambu Karthik 6. Rohit Sachin
  19. RV College of Engineering Go, change the world 19
Anzeige