Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Streaming Architecture including Rendezvous for Machine Learning

290 Aufrufe

Veröffentlicht am

This was a talk I gave at CMU that talked about the rationale for streaming microservices using the rendezvous architecture as an example.

Veröffentlicht in: Technologie
  • Als Erste(r) kommentieren

  • Gehören Sie zu den Ersten, denen das gefällt!

Streaming Architecture including Rendezvous for Machine Learning

  1. 1. © 2017 MapR Technologies 1 Why Stream? and Machine Learning Logistics
  2. 2. © 2017 MapR Technologies 2 Contact Information Ted Dunning, PhD Chief Application Architect, MapR Technologies Committer, PMC member, board member, ASF O’Reilly author Email tdunning@mapr.com tdunning@apache.org Twitter @Ted_Dunning
  3. 3. © 2017 MapR Technologies 3 Traditional Solution – Use a Profile Database POS 1..n Fraud detector Last card use
  4. 4. © 2017 MapR Technologies 4 What Happens as You Scale Up? POS 1..n Fraud detector Last card use POS 1..n Fraud detector POS 1..n Fraud detector
  5. 5. © 2017 MapR Technologies 5 Shared Database Can Be A Problem POS 1..n Fraud detector Last card use POS 1..n Fraud detector POS 1..n Fraud detector Shared database causes problems Big problem is disagreement about schema and indexing
  6. 6. © 2017 MapR Technologies 6 Alternative: Use a Stream to Isolate Services POS 1..n Fraud detector Last card use Updater card activity
  7. 7. © 2017 MapR Technologies 7 Add New Services via the Stream POS 1..n Fraud detector Last card use Updater Card location history Other card activity
  8. 8. © 2017 MapR Technologies 8 Changing Implementation Through Isolation POS 1..n Last card use Updater POS 1..n Last card use Updater card activity Fraud detector Fraud detector
  9. 9. © 2017 MapR Technologies 9 Changing Implementation Through Isolation POS 1..n Last card use Updater POS 1..n Last card use Updater card activity Fraud detector Fraud detector
  10. 10. © 2017 MapR Technologies 10 With MapR, Geo-Distributed Data Appears Local stream Data source Consumer
  11. 11. © 2017 MapR Technologies 11 With MapR, Geo-Distributed Data Appears Local stream stream Data source Consumer
  12. 12. © 2017 MapR Technologies 12 With MapR, Geo-distributed Data Appears Local stream stream Data source ConsumerGlobal Data Center Regional Data Center
  13. 13. © 2017 MapR Technologies 13 Use Case: Telecommunications Callers Towers cdr data
  14. 14. © 2017 MapR Technologies 14 Streaming in Telecom • Data collection & handling happens at different levels – tower, local data center, central data center) • Batch: Can take 30 minutes per level • Streaming: Latency drops to seconds or sub-seconds per level • Ability to respond as events occur • MapR Streams enables stream replication with offsets across data centers
  15. 15. © 2017 MapR Technologies 15 Unique to MapR: Manage Topics at Stream Level • Many more topics on MapR cluster • Topics are grouped together in Stream (different from Kafka) • Policies set at the Stream level such as time-to-live, ACEs (controlled access at this level is different than Kafka) • Geo-distributed stream replication (different from Kafka) Stream Topic 1 Topic 3 Topic 2 Image © 2016 Ted Dunning & Ellen Friedman from Chap 5 of O’Reilly book Streaming Architecture used with permission
  16. 16. © 2017 MapR Technologies 16 Use Case: Each pump has many sensors pump data Dashboard C2 topic = p1 p2 p3 p4 p5 p1 p1 p5
  17. 17. © 2017 MapR Technologies 17 Use topics as an organizing principle
  18. 18. © 2017 MapR Technologies 18 Example Files Table Streams Directories Cluster Volume mount point
  19. 19. © 2017 MapR Technologies 19 Cluster Volume mount point
  20. 20. © 2017 MapR Technologies 20 Streams should be integrated tightly into normal persistence
  21. 21. © 2017 MapR Technologies 21 Stream vs Database • Can be better for flexibility and multi-tenancy • Streams can be 50 – 100x faster than db (no mutation) • Faster means less arguments about performance optimization • Operations are simpler so works better to share data • Don’t have to commit to one type of db: push updates through stream and let each group use the db they want
  22. 22. © 2017 MapR Technologies 22 Collect Data log consolidator web server web server Web- server Log Web- server Log log_events log-stash log-stash data center
  23. 23. © 2017 MapR Technologies 23 And Transport to Global Analytics log consolidator web server web server Web- server Log Web- server Log log_events log-stash log-stash data center GHQ log_events events Elaborate events (log-stash) Aggregate Signal detection
  24. 24. © 2017 MapR Technologies 24 With Many Sources log consolidator web server web server Web- server Log Web- server Log log_events log-stash log-stash data center GHQ log_events events Elaborate events (log-stash) Aggregate Signal detection
  25. 25. © 2017 MapR Technologies 25 With Many Sources log consolidator web server web server Web- server Log Web- server Log log_events log-stash log-stash data center GHQ log_events events Elaborate events (log-stash) Aggregate Signal detection log consolidator web server Web- server Log web server Web- server Log log_events log-stash log-stash data center
  26. 26. © 2017 MapR Technologies 26 With Many Sources log consolidator web server web server Web- server Log Web- server Log log_events log-stash log-stash data center GHQ log_events events Elaborate events (log-stash) Aggregate Signal detection log consolidator web server Web- server Log web server Web- server Log log_events log-stash log-stash data center log consolidator web server Web- server Log web server Web- server Log log_events log-stash log-stash data center
  27. 27. © 2017 MapR Technologies 27 Analytics Doesn’t Care About Location GHQ log_events events Elaborate events (log-stash) Aggregate Signal detection
  28. 28. © 2017 MapR Technologies 28 Analytics Doesn’t Care About Location GHQ log_events events Elaborate events (log-stash) Aggregate Signal detection Topic: data-center . machine . sensor
  29. 29. © 2017 MapR Technologies 29 Analytics Doesn’t Care About Location GHQ log_events events Elaborate events (log-stash) Aggregate Signal detection Topic: data-center . *. sensor
  30. 30. © 2017 MapR Technologies 30 Analytics Doesn’t Care About Location GHQ log_events events Elaborate events (log-stash) Aggregate Signal detection Topic: data-center . machine. *
  31. 31. © 2017 MapR Technologies 31 Analytics Doesn’t Care About Location GHQ log_events events Elaborate events (log-stash) Aggregate Signal detection Topic: * . *. sensor
  32. 32. © 2017 MapR Technologies 32 Act locally, learn globally
  33. 33. © 2017 MapR Technologies 33 Machine Learning Logistics
  34. 34. © 2017 MapR Technologies 34 Traditional View
  35. 35. © 2017 MapR Technologies 35 Traditional View: This isn’t the whole story
  36. 36. © 2017 MapR Technologies 36 90% of the effort in successful machine learning isn’t in the training or model dev… It’s the logistics
  37. 37. © 2017 MapR Technologies 37 Why? • Just getting the training data is hard – Which data? How to make it accessible? Multiple sources! – New kinds of observations force restarts – Requires a ton of domain knowledge • The myth of the unitary model – You can’t train just one – You will have dozens of models, likely hundreds or more – Handoff to new versions is tricky – You have to get run-time to be sure about which is better 
  38. 38. © 2017 MapR Technologies 38 What Machine Learning Tool is Best? • Most successful groups keep several “favorite” machine learning tools at hand – No single tool is best in every situation • The most important tool is a platform that supports logistics well – Don’t have to do everything at the application level – Lots of what matters can be handled at the platform level • A good design for the logistics can make a big difference
  39. 39. © 2017 MapR Technologies 39 Some Gotchas • Ops-oriented people will not “get it” regarding modeling subtleties • Data scientists will not “get it” regarding operational realities • Therefore, modelers have to deliver self-contained models • And, ops has to provide pre-wired structure
  40. 40. © 2017 MapR Technologies 40 Rendezvous Architecture Input Scores RendezvousModel 1 Model 2 Model 3 request response Results
  41. 41. © 2017 MapR Technologies 41 Rendezvous to the Rescue: Better ML Logistics • Stream-1st architecture is a powerful approach with surprisingly widespread advantages – Innovative technologies emerging to for streaming data • Microservices approach provides flexibility – Streaming supports microservices (if done right) • Containers remove surprises – Predictable environment for running models
  42. 42. © 2017 MapR Technologies 42 Rendezvous: Mainly for Decisioning Engines • Decisioning models – Looking for a “right answer” – Simpler than reinforcement learning • Examples include: – Fraud detection – Predictive analytics / market prediction – Churn prediction (as in telecommunications) – Yield optimization – Deep learning in form of speech or image recognition, in some cases
  43. 43. © 2017 MapR Technologies 43 What We Ultimately Want request response Model
  44. 44. © 2017 MapR Technologies 44 But This Isn’t The Answer Model 1 request response Load balancer Model 2 Model 3
  45. 45. © 2017 MapR Technologies 45 First Try with Streams Input Model 1 Model 2 Model 3 request response ?
  46. 46. © 2017 MapR Technologies 46 First Rendezvous Input Scores RendezvousModel 1 Model 2 Model 3 request response Results
  47. 47. © 2017 MapR Technologies 47 Some Key Points • Note that all models see identical inputs • All models run in production setting • All models send scores to same stream • The rendezvous server decides which scores to ignore • Roll forward, roll back, correlated comparison are all now trivial
  48. 48. © 2017 MapR Technologies 48 Reality Check, Injecting External State Model 1 Model 2 Model 3 request Raw Add external data Input Database The world
  49. 49. © 2017 MapR Technologies 49 Recording Raw Data (as it really was) Input Scores Decoy Model 2 Model 3 Archive
  50. 50. © 2017 MapR Technologies 50 Quality & Reproducibility of Input Data is Important! • Recording raw-ish data is really a big deal – Data as seen by a model is worth gold – Data reconstructed later often has time-machine leaks – Databases were made for updates, streams are safer • Raw data is useful for non-ML cases as well (think flexibility) • Decoy model records training data as seen by models under development & evaluation
  51. 51. © 2017 MapR Technologies 51 Canary for Comparison Real model ∆ Result Canary Decoy Archive Input
  52. 52. © 2017 MapR Technologies 52 What Does the Canary Do? • The canary is a real model, but is very rarely updated • The canary results are almost never used for decisioning • The virtue of the canary is stability • Comparing to the canary results gives insight into new models
  53. 53. © 2017 MapR Technologies 53 Isolated Development With Stream Replication Model 1 Model 2 Model 3 request Raw Add external data Input Internal 1 Internal 2 Internal 3 The world Model 4 Raw New external data Input Internal 4 Production Development
  54. 54. © 2017 MapR Technologies 54 A Quick Review Input Scores RendezvousModel 1 Model 2 Model 3 request response Results Proxy
  55. 55. © 2017 MapR Technologies 55 The Proxy Talks to the Outside World Input Scores RendezvousModel 1 Model 2 Model 3 request response Results Proxy
  56. 56. © 2017 MapR Technologies 56 The Input Stream Feeds All Models Identically Input Scores RendezvousModel 1 Model 2 Model 3 request response Results Proxy
  57. 57. © 2017 MapR Technologies 57 The Scores Stream Contains All Results Input Scores RendezvousModel 1 Model 2 Model 3 request response Results Proxy
  58. 58. © 2017 MapR Technologies 58 The Rendezvous Picks A Result Input Scores RendezvousModel 1 Model 2 Model 3 request response Results Proxy
  59. 59. © 2017 MapR Technologies 59 Results Return Via A Stream and Return Address Input Scores RendezvousModel 1 Model 2 Model 3 request response Results Proxy
  60. 60. © 2017 MapR Technologies 60 Models in production live in the real world: Conditions may (will) change
  61. 61. © 2017 MapR Technologies 61 Rendezvous Schedules • The key idea of rendezvous schedules is to define the trade-off of latency versus model priority – At short delays, we want the best – At moderate delays we will compromise a bit – Near the deadline, we will take any answer at all • Normally the same rendezvous schedules apply to all transactions – Overriding default schedule has bona fide uses
  62. 62. © 2017 MapR Technologies 62 Rendezvous Overrides • Incoming transaction can carry an overriding schedule – This is great for QA, to see output from a specific model – Overriding the default schedule is also good for systemic A/B tests • Overrides should be unusual
  63. 63. © 2017 MapR Technologies 63 Scaling Up • More kinds of model – multiple rendezvous frameworks for different tasks • More throughput – Fast default models – Partition input stream to allow parallel model evaluation – Input batching • Extreme volumes require extreme measures – Cannibalize fancy models to run more fast/simple models – Speed before beauty
  64. 64. © 2017 MapR Technologies 64 Faster Throughput Through Failure • Suppose we have one model that can handle 10,000 t/s @ 2ms – But this isn’t the most accurate model. Not bad, but not best • And our champion model can handle 1000 t/s @ 10ms • Then imagine a burst of 2000 t/s for several minutes • Champion can only evaluate half of all requests – Should skip to keep up – Fast model will cover for champion
  65. 65. © 2017 MapR Technologies 65 Input Scores Model 1 Model 2 Model 3
  66. 66. © 2017 MapR Technologies 66 Input Scores Model 1 Model 2 Model 3
  67. 67. © 2017 MapR Technologies 67 Input Scores Model 1 Model 2 Model 3
  68. 68. © 2017 MapR Technologies 68 Always have a default or fallback model Models that fall behind should discard requests to catch up
  69. 69. © 2017 MapR Technologies 69 Limitations of Rendezvous • 100% speculative execution can be expensive – Can be mitigated by partial speculation – Or it may just be too expensive • Minimum Viable Products should be minimal – You may not require zero downtime … be realistic • Context may be too large • Latency limits may be too stringent
  70. 70. © 2017 MapR Technologies 70 Ad Targeting Example Detailed scoring Proxy Pre-select 1 2 Sharded Ad Scoring 3 User Profile Ads User profile and context used for rough-cut selection of ads Roughly 1000 ads are scored in detail for p(click)
  71. 71. © 2017 MapR Technologies 71 Why Not Full Rendezvous? • 1000’s of ads / second x 1000 candidates = 1M scores / second – AKA “a lot” • Scoring a single model is expensive • Sharding and replication provides a form of failure tolerance • Full speculative execution across several options is prohibitive • Latency guarantees can be very short (10 ms)
  72. 72. © 2017 MapR Technologies 72 Rendezvous-lite Options • We have some options • We can allow selective speculation on marked requests – If only 1% of ads run speculative execution, we can pack 10x more shards per node and use 10x fewer nodes – Selective speculation doesn’t give redundancy • We can release results if >80% of shards reply • Temporary speculation during hand-offs is useful
  73. 73. © 2017 MapR Technologies 73 Let’s Review
  74. 74. © 2017 MapR Technologies 74 A Quick Review Input Scores RendezvousModel 1 Model 2 Model 3 request response Results Proxy
  75. 75. © 2017 MapR Technologies 75 The Proxy Talks to the Outside World Input Scores RendezvousModel 1 Model 2 Model 3 request response Results Proxy
  76. 76. © 2017 MapR Technologies 76 The Input Stream Feeds All Models Identically Input Scores RendezvousModel 1 Model 2 Model 3 request response Results Proxy
  77. 77. © 2017 MapR Technologies 77 The Scores Stream Contains All Results Input Scores RendezvousModel 1 Model 2 Model 3 request response Results Proxy
  78. 78. © 2017 MapR Technologies 78 The Rendezvous Picks A Result Input Scores RendezvousModel 1 Model 2 Model 3 request response Results Proxy
  79. 79. © 2017 MapR Technologies 79 Results Return Via A Stream and Return Address Input Scores RendezvousModel 1 Model 2 Model 3 request response Results Proxy
  80. 80. © 2017 MapR Technologies 80 Not Such Bad Ideas • Keep models running “in the wings” – Don’t wait until conditions change to start building the next model – Keep new short-history models ready to roll, some graybeards as well • Hot hand-off – With rendezvous: just stop ignoring the new best model • Deploy a canary server – Keep an old model active as a reference – If it was 90% correct, difference with any better model should be small – Score distribution should be roughly constant
  81. 81. © 2017 MapR Technologies 81 New book: how to manage machine learning models Download free pdf or read free online via @MapR: https://mapr.com/ebook/machine-learning-logistics/ “Rendezvous Architecture” by Ted Dunning & Ellen Friedman, in Encyclopedia of Big Data Technologies. Sherif Sakr and Albert Zomaya, editors. Springer International Publishing, in press 2018. and
  82. 82. © 2017 MapR Technologies 82 Contact Information Ted Dunning, PhD Chief Application Architect, MapR Technologies Committer, PMC member, board member, ASF O’Reilly author Email tdunning@mapr.com tdunning@apache.org Twitter @Ted_Dunning
  83. 83. © 2017 MapR Technologies 83 Q&A @mapr tdunning@mapr.com ENGAGE WITH US @ Ted_Dunning

×