Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

GOAI: GPU-Accelerated Data Science DataSciCon 2017

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Wird geladen in …3
×

Hier ansehen

1 von 62 Anzeige

GOAI: GPU-Accelerated Data Science DataSciCon 2017

Herunterladen, um offline zu lesen

The GPU Open Analytics Initiative, GOAI, is accelerating data science like never before. CPUs are not improving at the same rate as networking and storage, and leveraging GPUs data scientist can analyze more data than ever with less hardware. Learn more about how GPU are accelerating data science (not just Deep Learning), and how to get started.

The GPU Open Analytics Initiative, GOAI, is accelerating data science like never before. CPUs are not improving at the same rate as networking and storage, and leveraging GPUs data scientist can analyze more data than ever with less hardware. Learn more about how GPU are accelerating data science (not just Deep Learning), and how to get started.

Anzeige
Anzeige

Weitere Verwandte Inhalte

Diashows für Sie (20)

Ähnlich wie GOAI: GPU-Accelerated Data Science DataSciCon 2017 (20)

Anzeige

Aktuellste (20)

GOAI: GPU-Accelerated Data Science DataSciCon 2017

  1. 1. 1 GOAI: GPU-ACCELERATED DATA SCIENCE Joshua Patterson | Director of Applied Solutions Engineering | DataSciCon 2017 @datametrician
  2. 2. 2 SPARK ECOSYSTEM The Glue of Big Data • Spark has almost become synonymous with Hadoop and Big Data • It’s the interface/API for big data app to app communication • The processing layer for big data and leading ML framework
  3. 3. 3 DATA PROCESSING EVOLUTION Faster Data Access Less Data Movement HDFS Read HDFS Write HDFS Read HDFS Write HDFS Read Query ETL ML Train Hadoop Processing, Reading from disk
  4. 4. 4 DATA PROCESSING EVOLUTION Faster Data Access Less Data Movement HDFS Read HDFS Write HDFS Read HDFS Write HDFS Read Query ETL ML Train HDFS Read Query ETL ML Train Hadoop Processing, Reading from disk 25-100x Improvement Less code Language flexible Primarily In-Memory Spark In-Memory Processing
  5. 5. 5 SPARK ECOSYSTEM Lacks Full GPU Integration • 4 Core Parts: SQL, Streaming (Spark functions micro batched), Machine Learning, & Graph • Spark is currently optimizing its existing code base, adding more usability, not GPU support yet
  6. 6. 6 25-100x Improvement Less code Language flexible Primarily In-Memory DATA PROCESSING EVOLUTION Faster Data Access Less Data Movement HDFS Read HDFS Write HDFS Read HDFS Write HDFS Read Query ETL ML Train HDFS Read Query ETL ML Train HDFS Read GPU Read Query CPU Write GPU Read ETL CPU Write GPU Read ML Train 5-10x Improvement More code Language rigid Substantially on GPU GPU/Spark In-Memory Processing Hadoop Processing, Reading from disk Spark In-Memory Processing
  7. 7. 7 Pre-GPU DATA FRAME CURRENT H2O.ai Graphistry Anaconda Gunrock BlazingDB MapD CPU APP A APP B Copy & Convert Copy & Convert Copy & Convert Copy & Convert Copy & ConvertCopy & Convert Copy & Convert Too Much Glue Code & Lack Of Standards • For GPU applications to talk to each other data must be copy and converted up to three times • Each company has to build and maintain connectors to copy and convert • Some products wanted direct connectors to other products • Reduced hops but more for them to maintain and develop • A standard was needed • ISVs always starting from scratch • Barrier to entry and integration
  8. 8. 8 GPU Data Frame Data Movement Kills Performance Volume of data Numberofdatahandoffs Handoff Pre-GPU DATA FRAME
  9. 9. 9 APP A GPU-ACCELERATED ARCHITECTURE THEN Too much data movement and too many different data formats CPU GPU APP B Read DataH2O.ai Anaconda Gunrock Graphistry BlazingDB MapD Copy & Convert Copy & Convert Copy & Convert Load Data APP A GPU Data APP B GPU Data
  10. 10. 10 APP A GPU-ACCELERATED ARCHITECTURE THEN Too much data movement and too many different data formats CPU GPU APP B Read DataH2O.ai Anaconda Gunrock Graphistry BlazingDB MapD Copy & Convert Copy & Convert Copy & Convert Load Data APP A GPU Data APP B GPU Data
  11. 11. 11 INTEROPERABILITY IN BIG DATA Lessons Learned From Apache Arrow & Parquet • Both Apache Arrow and Apache Parquet are compressed columnar storage • Arrow resides in memory whereas Parquet resides on disk • Major push in the big data world to remove bottlenecks of copy & converting data between systems that was a major issue in the GPU world
  12. 12. 12 GPU-ACCELERATED ARCHITECTURE NOW Single data format and shared access to data on GPU CPU GPU GPU MEM Read DataH2O.ai Anaconda Gunrock Graphistry BlazingDB MapD Load Data Apache Arrow GPU Data Frame Based on:
  13. 13. 13 GPU OPEN ANALYTICS INITIATIVE github.com/gpuopenanalytics GPU Data Frame (GDF) Ingest/ Parse Exploratory Analysis Feature Engineering ML/DL Algorithms Grid Search Scoring Model Export @gpuoai Apache Arrow
  14. 14. 14 EASY TO USE @gpuoai
  15. 15. 15 EASY TO USE @gpuoai
  16. 16. 16 USE GPUS IN PYTHON @gpuoai
  17. 17. 17 GROWING COMMUNITY SUPPORT Apache Arrow Apache Parquet
  18. 18. 18 GPU ACCELERATION ACROSS THE ECOSYSTEM
  19. 19. 19 25-100x Improvement Less code Language flexible Primarily In-Memory DATA PROCESSING EVOLUTION Faster Data Access Less Data Movement HDFS Read HDFS Write HDFS Read HDFS Write HDFS Read Query ETL ML Train HDFS Read Query ETL ML Train HDFS Read GPU Read Query CPU Write GPU Read ETL CPU Write GPU Read ML Train Arrow Read Query ETL ML Train 5-10x Improvement More code Language rigid Substantially on GPU 25-100x Improvement Same code Language flexible Primarily on GPU End to End GPU Processing (GOAI) GPU/Spark In-Memory Processing Hadoop Processing, Reading from disk Spark In-Memory Processing
  20. 20. 20 Expand GPU Usage More Data, Less Hardware 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 2008 2010 2012 2014 2016 2017 Peak Double Precision NVIDIA GPU x86 CPU TFLOPS Scaling up and out with GPU co- processors
  21. 21. 21 ANACONDA Python ETL for GPU A Python open-source just-in-time optimizing compiler that uses LLVM to produce native machine instructions. Primary Contributor to PyGDF. Dask is a flexible parallel computing library for analytic computing with dynamic task scheduling and big data collections. Primary contributor to Dask_GDF. Jeremy Howard Deep learning researcher & educator. Founder: fast.ai; Faculty: USF & Singularity University; // Previously - CEO: Enlitic; President: Kaggle; CEO Fastmail Rewrote @scikit_learn PolynomialFeatures in @ContinuumIO Numba. Got a 40x speedup (would be bigger with more data!) 12 lines of code
  22. 22. 22 BLAZINGDB Scale out Datawarehousing
  23. 23. 23 Optimized Networking GPU Analysis and MLGPU Rendering GRAPHISTRY Graph Visualization Hunting: Daily Anomalies SecOps: Shadow IT UseIR: Killchain Analysis Fraud: Tracking EmbezzlersThreat Intel: Botnet Analysis
  24. 24. 24 GPU-accelerated graph analytics library Multi-GPU optimized algorithms Reduced cost and increased performance Performance constantly improving GUNROCK
  25. 25. 25 H2O.AI H2O4GPU - GPU Machine Learning Library
  26. 26. 26
  27. 27. 27 87 51 171 with latest solver
  28. 28. 28
  29. 29. 29 MAPD MapD Core MapD Immerse LLVM Backend Rendering Streaming LLVM creates one custom function that runs at speeds approaching hand-written functions. LLVM enables generic targeting of different architectures + run simultaneously on CPU/GPU. Speed eliminates need to pre-index or aggregate data. Compute resides on GPUs freeing CPUs to parse + ingest. Finally, newest data can be combined with billions of rows of “near historical” data. Data goes from compute (CUDA) to graphics (OpenGL) pipeline without copy and comes back as compressed PNG (~100 KB) rather than raw data (> 1GB).
  30. 30. 30 MAPD ARCHITECTURE Visualization Libraries JavaScript libraries that allow users to build custom web- based visualization apps powered by a MapD Core database based on DC.js. LLVM MapD Core SQL queries are compiled with a just-in-time (JIT) LLVM based compiler, and run as NVIDIA GPU machine code. Distributed Scale-out MapD Core has native distributed scale-out capabilities. MapD Core users can query and visualize larger datasets with much smaller cluster sizes than traditional solutions. High Availability MapD Core has high availability functionality that provides durability and redundancy. Ingest and queries are load balanced across servers for additional throughput. Open Source Commercial
  31. 31. 31 CYBER SECURITY An Ideal Use Case for GPU Acceleration
  32. 32. 32 FIRST PRINCIPLES OF CYBER SECURITY Where the industry must go 1. Indication of compromise needs to improve as attacks are becoming more sophisticated, subtle, and hidden in the massive volume and velocity of data. Combining machine learning, graph analysis, and applied statistics, and integrating these methods with deep learning is essential to reduce false positives, detect threats faster, and empower analyst to be more efficient. 2. Event management is an accelerated analytics problem, the volume and velocity of data from devices requires a new approach that combines all data sources to allow for more intelligent/advanced threat hunting and exploration at scale across machine data. 3. Visualization will be a key part of daily operations, which will allows analyst to label and train Deep Learning models faster, and validate machine learning prediciton.
  33. 33. 33 RULES & PEOPLE DON’T SCALE Right now, financial services reports it takes an average of 98 days to detect an Advance Threat but retailers say it can be about seven months. Once the security community moves beyond the mantras “encrypt everything” and “secure the perimeter,” it can begin developing intelligent prioritization and response plans to various kinds of breaches – with a strong focus on integrity. The challenge lies in efficiently scaling these technologies for practical deployment, and making them reliable for large networks. This is where the security community should focus its efforts. http://www.wired.com/2015/12/the-cia-secret-to-cybersecurity-that-no-one-seems-to-get/ Current methods are too slow
  34. 34. 34 ATTACKS ARE MORE SOPHISTICATED How Hackers Hijacked a Bank’s Entire Online Operation https://www.wired.com/2017/04/hackers-hijacked-banks-entire-online-operation/
  35. 35. 35 FIRST PRINCIPLES OF CYBER SECURITY Where the industry must go 1. Indication of compromise needs to improve as attacks are becoming more sophisticated, subtle, and hidden in the massive volume and velocity of data. Combining machine learning, graph analysis, and applied statistics, and integrating these methods with deep learning is essential to reduce false positives, detect threats faster, and empower analyst to be more efficient.
  36. 36. 36 MULTI MODEL APPROACH No Silver Bullet In Cyber Security nvGRAPH https://github.com/h2oai/h2o4gpu # edges = E * 2^S ~34M
  37. 37. 37 FIRST PRINCIPLES OF CYBER SECURITY Where the industry must go 1. Indication of compromise needs to improve as attacks are becoming more sophisticated, subtle, and hidden in the massive volume and velocity of data. Combining machine learning, graph analysis, and applied statistics, and integrating these methods with deep learning is essential to reduce false positives, detect threats faster, and empower analyst to be more efficient. 2. Event management is an accelerated analytics problem, the volume and velocity of data from devices requires a new approach that combines all data sources to allow for more intelligent/advanced threat hunting and exploration at scale across machine data.
  38. 38. 38 GPU ACCELERATION Accelerate the Pipeline, Not Just Deep Learning • GPUs for deep learning = proven • Where else and how else can we use GPU acceleration? • Dashboards • Accelerating data pipeline • Stream processing • Building better models faster • First: GPU databases Data Ingestion Data Processing Visualization Model Training Inferencing
  39. 39. 39 MOVING TO BIG DATA IS A START Spark outperforms traditional SIEM vs Big Data Solution 10 node cluster - ~$60k in hardware Production SIEM of Fortune 500 Enterprise Data 450+ columns ~250 million events per day SIEM Spark vs SIEM Benchmarks from Accenture Labs - Strata NY, Bsides LV
  40. 40. 40 MOVING TO BIG DATA IS A START Spark outperforms traditional SIEM Typical Scenario Time Period SIEM Big Data Speed Up 1 Show all network communication from one host (IP) to multiple hosts (IPs) 1 Day 3h 20m 13s 1m 44s 114 Times Faster 1 Week Not Feasible* 4m 05s 2 Retrieve failed logon attempts in Active Directory 1 Day 18m 26s 1m 37s 10 Times Faster 1 Week 2h 13m 45s 3m 10s 41 Times Faster 3 Search for Malware (exe) in Symantec logs 1 Day 3h 24m 36s 1m 37s 125 Times Faster 1 Week Not Feasible* 3m 22s 4 View all proxy logs for a for specific domain 1 Day 4h 30m 13s 2m 54s 92 Times Faster 1 Week Not Feasible* 1m 09s** Spark vs SIEM Benchmarks from Accenture Labs - Strata NY, Bsides LV
  41. 41. 41 GPU DATABASES ARE EVEN FASTER 1.1 Billion Taxi Ride Benchmarks 21 30 1560 80 99 1250 150 269 2250 372 696 2970 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 MapD DGX-1 MapD 4 x P100 Redshift 6-node Spark 11-node Query 1 Query 2 Query 3 Query 4 TimeinMilliseconds Source: MapD Benchmarks on DGX from internal NVIDIA testing following guidelines of Mark Litwintschik’s blogs: Redshift, 6-node ds2.8xlarge cluster & Spark 2.1, 11 x m3.xlarge cluster w/ HDFS @marklit82 10190 8134 19624 85942
  42. 42. 42 FIRST PRINCIPLES OF CYBER SECURITY Where the industry must go 1. Indication of compromise needs to improve as attacks are becoming more sophisticated, subtle, and hidden in the massive volume and velocity of data. Combining machine learning, graph analysis, and applied statistics, and integrating these methods with deep learning is essential to reduce false positives, detect threats faster, and empower analyst to be more efficient. 2. Event management is an accelerated analytics problem, the volume and velocity of data from devices requires a new approach that combines all data sources to allow for more intelligent/advanced threat hunting and exploration at scale across machine data. 3. Visualization will be a key part of daily operations, which will allows analyst to label and train Deep Learning models faster, and validate machine learning predictions.
  43. 43. 43
  44. 44. 44 DATA PLATFORM-AS-A-SERVICE • Handles 1M events/second • Auto-scales the cluster automatically SCALE • Offers HA with no data-loss • Always-on architecture • Data replication HIGH AVAILABILITY • Data platform security has been implemented with VPCs in AWS • Dashboard access using NVIDIA LDAP SECURITY • Log-to-analytics • Kibana, JDBC access • Accessing data using BI tools SELF SERVICE
  45. 45. 45 ARCHITECTURE V1
  46. 46. 46 ARCHITECTURE V2 (with MapD)
  47. 47. 47 MAPD VS KIBANA Dashboards Comparison + Performance Test Method
  48. 48. 48 DASHBOARD PERFORMANCE MapD Immerse vs Elastic Kibana 0 100 200 300 1 6 11 16 21 26 31 MapD Immerse (DGX) MapD Immerse (P2) Elastic Kibana x < 9s < 12s Days of Data TimetoFullyLoad(seconds)
  49. 49. 49 VISUALIZATION WITH GPU Less hardware, more performance, more scale
  50. 50. 50 VISUALIZATION WITH GPU Less hardware, more performance, more scale 1/10th the hardware 1-2 orders of magnitude more performance
  51. 51. 51 VISUALIZATION WITH GPU Less hardware, more performance, more scale 1/10th the hardware 1-2 orders of magnitude more performance Real time visualization of 100K+ nodes 1M+ Edges 50-100x faster clustering than other solutions
  52. 52. 52 LISTS DO NOT VISUALLY SCALE Text search is a great starting point! Does not scale Do not see the 30K+ events nor the IPs, users, nor how they relate…
  53. 53. 53 BAR CHARTS HIDE RELATIONSHIPS Good for summaries! But not: individual items But not: behaviors, relationships, patterns, outliers, … ?
  54. 54. 54 GRAPHS: A KEY MISSING VIEW Unified Model Shows entities, events, and relationships Multipurpose: connect, see, interact Visual Inspect individual items See behavior, patterns, and outliers Scale to enterprise workloads
  55. 55. 55 DIFFERENT GRAPHS, DIFFERENT QUESTIONS Uni Ex: Network mapping “Is it safe to reboot this?” ip ip Hyper Ex: Incident response “Did this escalate?” Multi Ex: SSH trails “Is a user crossing zones?” ip user userip ip user event event user ip
  56. 56. 56 CURRENT WORK
  57. 57. 57 CYBERWORKS CYBERWORKS SIEM SDK Goals • Open Source Ecosystem & Select ISVs • Integration Points w/ leading security vendors • FireEye • Splunk • Palo Alto Networks Purpose A platform to allow analysts to hunt and analyze data faster at scale than traditional big data to find unknown and zero day threats. It will accelerate the threat detection ecosystem and harden cyber defense utilizing GPU ISVs and Deep Learning Frameworks. Purpose Built SDK For SIEM Analytics
  58. 58. 58 CYBERWORKS ACTIVITIES Continuous Improvement Use GPU accelerated databases to analyze data to improve hunting today, as well as enrich and label data for Deep Learning Connect accelerated DBs to Splunk for event management, hunting, and exploration. Use Graphistry and MapD to visualize the data for anomaly and threat detection in new ways. The goal is to GPU accelerate parts of Splunk through partnership and connect/bolt on GPUDBs/Graphistry Use ML and Graph Analytics for feature extraction and behavioral analytics, an ensemble approach to detection. Expand Deep Learning training as more data is labeled/classified, and threats are caught faster, building off DL techniques used in GFN, other groups, and external ISV. Generalize Deep Learning for supervised and unsupervised anomaly and threat detection (Insider, APT, DDOS, etc…) while building our own cyber security deep learning accelerator. Use best practices from Driveworks and other accelerators and SDK as a reference architecture. Leverage DL from other parts of the firm to accelerate development as well. While using Splunk Cloud to protect Nvidia, we create a redundant path of data to enable R&D. nvGRAPH
  59. 59. 59 CYBERWORKS ARCHITECTURE SecOps Data Sources Ingest Storage Stream Processing Batch Processing Serving Layer Notebook Visualization Graph Processing cuSTINGER Graph Visualization Interactivity QuerySpeed Gunrock Deep Learning Machine Learning
  60. 60. 60 CYBERWORKS HARDWARE Scale out Cluster DGX Cluster NAS SIEM Notebooks End User 3rd Party Apps Messaging Queue Accelerating your SIEM
  61. 61. 61 JOIN THE REVOLUTION Everyone Can Help! APACHE ARROW APACHE PARQUET GPU Open Analytics Initiative https://arrow.apache.org/ @ApacheArrow https://parquet.apache.org/ @ApacheParquet http://gpuopenanalytics.com/ @Gpuoai Integrations, feedback, documentation support, pull requests, new issues, or donations welcomed!
  62. 62. Joshua Patterson @datametrician QUESTIONS?

×