Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

Spark DC Interactive Meetup: HTAP with Spark and In-Memory Data Grids

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Wird geladen in …3
×

Hier ansehen

1 von 50 Anzeige

Weitere Verwandte Inhalte

Diashows für Sie (20)

Andere mochten auch (15)

Anzeige

Ähnlich wie Spark DC Interactive Meetup: HTAP with Spark and In-Memory Data Grids (20)

Aktuellste (20)

Anzeige

Spark DC Interactive Meetup: HTAP with Spark and In-Memory Data Grids

  1. 1. 1 Hybrid Transactional/Analytics Processing with Spark and In-Memory Data Grids Copyright © GigaSpaces 2017. All rights reserved. Ali Hodroj VP, Products and Strategy @ahodroj
  2. 2. 2 GigaSpaces Ultra-Low Latency / High Throughput Middleware Direct customers 500+ Headquarters New York, NY Established 2001
  3. 3. 3 HERE How we got
  4. 4. 4 We’re seeing more in our customer base
  5. 5. 5 …a shift towards real-time BI Big Data Fast Data
  6. 6. 6 Sample Customer Use Cases Internet of Things Omni-Channel Operational Intelligence Operational Analytics Predictive Analytics Fraud Detection, Supply chain optimization Personalization, Recommendation Edge Analytics Operational Intelligence, Predictive Maintenance, Spatial Analytics
  7. 7. 7 In-Memory Computing (not a new thing) Rapid decline in RAM prices lead to advanced data processing innovations drives • Transactional (2001-present) – In-Memory Databases – In-Memory Data Grids • Analytics (2012-present) – In-Memory Data Processing Frameworks (Spark) – In-Memory File Systems (Tachyon)
  8. 8. 8 In-Memory Data Processing: Apache Spark
  9. 9. 99 Data Grid is a cluster of machines that work together to create a resilient shared data fabric for low-latency data access and extreme transaction processing In-Memory Data Grid: Online Transaction Processing at Low-Latency and High Throughput http://xap.github.io
  10. 10. 10 In-Memory Data Grid 101 Feeder Virtual Machine Virtual MachineVirtual Machine Partitioned Data
  11. 11. 11 Write Event-Driven / Reactive In-Memory Data Grid 101: Execution Models RPC / Master-Worker
  12. 12. 12 Write Event-Driven / Reactive In-Memory Data Grid 101: Execution Models RPC / Master-Worker
  13. 13. 13 In-Memory Data Grid 101: Typical Deployment HTML HTTP/S HW LB REST HTTP/ S REST HTTP/S LB Agen t GSA HTTPD Load Balanc er LB Agen t GSA HTTPD Load Balanc er Mirror Service GSA DB Private or Public Cloud Processing Processing Processing Processing Processing Processi ng Processing Processing Processing Processing Processing Processing Primary Set 1 Primary Set 2 Primary Set 3 Primary Set 4 Primary Set 5 Primary Set 6 Backup Set 6Backup Set 5Backup Set 4 Backup Set 1 Backup Set 2 Backup Set 3 GSA GSA GSA GSA GSA GSA Async )
  14. 14. 14 Host Cisco UCS Server CPU Intel 16core 2.9GHz Concurrent Threads 2 Throughput 200, 400, 800 ops/sec
  15. 15. 15
  16. 16. 16 Hybrid Transactional/Analytics Processing at Scale Provide closed-loop analytics pipeline. Data, insight, to action at sub-second latency IoT and Omni-channel require the convergence of many different data types Blend of both real-time and historical data Requirements 1 Bi-directional integration between transactional and analytical data stores Ability to support POJO, JSON, GeoSpatial, and Unstructured types through a unified API Unified and scale-out real-time and historical data store Challenges 2 3
  17. 17. 17 HTAP: SPARK + MICROSERVICES Our road towards
  18. 18. 18 What’s needed Large-scale distributed analytics framework Unified, scale-out, low-latency data store Transactional capabilities: ACID, Event-Driven, Rich Data modeling Microservices
  19. 19. 19 Our approach to HTAP Low-latency Scale-Out In-Memory Data Grid Large-scale distributed analytics framework +
  20. 20. 20 • Unified & Concise API • Highly Flexible Data Store Integration • Massive Community and Adoption Why Spark?
  21. 21. 21 1 Bi-directional integration between transactional and analytical data stores Provide closed-loop analytics pipeline. Data, insight, to action at scale (at sub-seconds)
  22. 22. 22
  23. 23. 23 In-Memory Data Grid In-Memory Store(RAM) Flash, SSD, Off-Heap Store Spark Spark SQL Spark Steaming Machine Learning Highavailability Security&Management Transactional Tier ACID-compliant Strong Consistency Analytics Tier
  24. 24. 24 • Get Partitions: An array of partitions that a dataset is divided to • Compute: A compute function to do a computation on partitions • Get Preferred Location: Optional preferred locations, i.e. hosts for a partition where the data will be loaded • IMDG Distributed Query to get partitions and their hosts • Iterator over portion of data • Hosts from Distributed Query Build a connector: Spark to IMDG
  25. 25. 25 node 1 Spark master Grid master node 2 Spark worker Grid Partition node 3 Spark worker Grid Partition NoSQL Storage Pattern #1: Data Locality (machine-level)
  26. 26. 26 Aggregation in Spark Filtering and columns pruning in Data Grid SELECT SUM(amount) FROM order WHERE city = ‘NY’ AND year > 2012 Spark SQL architecture: • Pushing down predicates to Data Grid • Leveraging indexes • Transparent to user • Enabling support for other languages - Python/R Implementing DataSource API Pattern #2: Pushdown Predicates (Grid-side processing)
  27. 27. 27 node 1 Spark master Grid master node 2 Spark worker Grid Partition node 3 Spark worker Grid Partition Lightweight workers, small JVMs Large JVMs, Fast indexing NoSQL Storage Pattern #3: Decouple Data Processing from Data Storage
  28. 28. 28 Push-down Predicates performance Traditional Spark filtering of 7MM records Grid-side + Spark filtering of 7MM records 31 sec 800 ms vs
  29. 29. 29 Ability to support POJO, JSON, GeoSpatial, and Unstructured types through a unified API 2 IoT and Omni-channel require the convergence of many different data types
  30. 30. In-Memory Data Grid + Spark Convergence Geo-Spatial Full Text
  31. 31. Simple K/V to RDD Mapping
  32. 32. POJO Domain Model to Spark
  33. 33. POJO Domain Model to Spark (Event-Driven)
  34. 34. JSON Domain Model to Spark
  35. 35. Geo-Spatial Data Frames Geo-Spatial
  36. 36. Full Text Indexes + Lucene Analyzers Full Text
  37. 37. 37 Unified and scale-out real-time and historical data store 3 Blend of both real-time and historical data
  38. 38. 38 hash(key) % #nodes In-Memory Data Grid Partitioning
  39. 39. 39 hash(key) % #nodes In-Memory Data Grid Partitioning – With HA
  40. 40. 40 node 1 Spark executor Spark Partition #1 Grid Partition #1 Direct connection Simple, but not enough parallelism for Spark node 2 Spark executor Spark Partition #2 Grid Partition #2 node 3 Spark executor Spark Partition #3 Grid Partition #3 Spark to Data Grid Partition Cardinality
  41. 41. 41 node 1 Spark Executor Grid Primary #1 0 . . 1 . . 2 . . 3 . . 4 . . 5 . . . . . . . . . . . . Spark Partition #1 1023 1 Spark partition = M grid buckets 1 Grid partition = N Spark partitions Spark Partition #2 Spark Partition #1 Pattern #4: Grid bucketing for higher throughput
  42. 42. 42 Eventually, we productized this as an open source Spark distribution
  43. 43. @InsightEdgeIO http://insightedge.io Apache 2 License http://insightedge.io/docs http://insightedge.io/blog http://github.com/InsightEdge
  44. 44. GigaSpaces InsightEdge http://insightedge.io High Performance Spark with OLTP Capabilities
  45. 45. upcoming: Spark RDD/DF native read/save on Off-Heap (SSD/Flash/Direct Buffers) Application Processi ng Primary instance s Backup instance s Sync Replicati on Storage Array Storage Array In Memory Data Grid Spark worker Spark worker • Significant RAM TCO reduction in Spark clusters • Direct RDD/DataFrame read write from SSD/Flash device • Avoid Filesystem hops and write amplification
  46. 46. 46 REFERENCE ARCHITECTURES
  47. 47. 47 In-Process HTAP
  48. 48. 48 In-Memory Data Grid Realtime Replication • Scoring models • Trigger actions • Events Transactions Analytics XAP + InsightEdge deployed on different grid clusters with bi- directional real-time data replication Point-of-Decision HTAP
  49. 49. 4949 Challenge • Stream data from 1,000s of Taxis • Actively monitor and generate real-time notifications • Real-time Route Optimization and Geo-Fencing Solution • Leverage unified in-memory data fabric as middleware for geo-spatial analytics • Elastically scale stream processing and transactional apps together • Location-based tracking, Geo-fencing Edge components Data Sources Transportation / IoT: Connected Cars / Fleet Geo-Analytics
  50. 50. 50 THANK YOU! QUESTIONS?

×