Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.
A Spark-stack for Automating Life-
cycle of Prediction Models
© 2017 24/7 Customer, Inc. All rights reserved.
Monday, Nove...
Agenda
© 2017 24/7 Customer, Inc. All rights reserved.
• Introduction
• Use cases for ML models at [24]7
• Model managemen...
Prediction Models in [24]7.ai
Monday, November 13, 2017
© 2017 24/7 Customer, Inc. All rights reserved.
About [24]7
© 2017 24/7 Customer, Inc. All rights reserved.
• [24]7 is a software company based out of Bay area,
US and Ba...
2.5B
Digital Interactions/Year
4.5TB
Interaction Data/Week
90%+
CSAT across channels
100M
Visitors/Year
1st
True Multi-mod...
Assist (for Chat)
Smart chat platform for online and
mobile engagement
Assist (for IVR)
Call deflection to mobile web chat...
Data Science – What it means for [24]7
© 2017 24/7 Customer, Inc. All rights reserved.
fn (Customer type,
location, Identi...
Big Data in [24]7
© 2017 24/7 Customer, Inc. All rights reserved.
Data	Sources Technologies
Use case of intent prediction: Web visits
© 2017 24/7 Customer, Inc. All rights reserved.
• For our clients in the retail ...
Use case of intent prediction: IVR Calls
© 2017 24/7 Customer, Inc. All rights reserved.
• For our clients in banking, our...
Use case of intent prediction: within Chat
© 2017 24/7 Customer, Inc. All rights reserved.
• An emerging use case is deplo...
Technology and Model Management
at [24]7
Monday, November 13, 2017
© 2017 24/7 Customer, Inc. All rights reserved.
High Level Architecture
© 2017 24/7 Customer, Inc. All rights reserved. 13
Events Real	Time	
Platform
Batch	Data	
Platform...
[24]7 Big Data Platform: Technologies
© 2017 24/7 Customer, Inc. All rights reserved.
• We use multiple open-source techno...
Architecture for model building
© 2017 24/7 Customer, Inc. All rights reserved. 15
Events
Batch	Data	Platform
HDFS
Nightly...
Model building workflow
© 2017 24/7 Customer, Inc. All rights reserved.
Sign	Contract
Data	Requirement	Gathering
Data	Capt...
Platform for Model Management – Why?
© 2017 24/7 Customer, Inc. All rights reserved.
• Prediction models are one of the ke...
Spark Stack for Model Management
Monday, November 13, 2017
© 2017 24/7 Customer, Inc. All rights reserved.
Early Iteration for Model Management Platform
© 2017 24/7 Customer, Inc. All rights reserved.
• Model management platform ...
Early Iteration for Model Management Platform
© 2017 24/7 Customer, Inc. All rights reserved. 20
Events
Batch	Data	Platfor...
Pros and Cons of using Vertica
© 2017 24/7 Customer, Inc. All rights reserved.
• Pros
• All the EDA and computations happe...
Moving to Spark
© 2017 24/7 Customer, Inc. All rights reserved.
• Spark is a strong distributed computation engine with hu...
Model Management Platform with Spark
© 2017 24/7 Customer, Inc. All rights reserved. 23
Events
Batch	Data	Platform
HDFS
St...
Developing the Framework
© 2017 24/7 Customer, Inc. All rights reserved.
• The framework is a wrapper around spark librari...
HashingTF in SparkML
© 2017 24/7 Customer, Inc. All rights reserved.
• HashingTF is a way of automated feature engineering...
Using HashingTF
© 2017 24/7 Customer, Inc. All rights reserved.
• Using hashingTF can replace multiple preprocessing
steps...
Other Benefits of using Spark
© 2017 24/7 Customer, Inc. All rights reserved.
• Model training is much faster compared to ...
Future work on the platform
© 2017 24/7 Customer, Inc. All rights reserved.
• We are exploring training of other complex
m...
Questions
© 2017 24/7 Customer, Inc. All rights reserved.
Nächste SlideShare
Wird geladen in …5
×

Spark stack for Model life-cycle management

554 Aufrufe

Veröffentlicht am

Spark stack for Model life-cycle management

Veröffentlicht in: Daten & Analysen
  • Als Erste(r) kommentieren

Spark stack for Model life-cycle management

  1. 1. A Spark-stack for Automating Life- cycle of Prediction Models © 2017 24/7 Customer, Inc. All rights reserved. Monday, November 13, 2017 Samik Raychaudhuri, Ph.D. Director, Data Science Group [24]7.ai Innovation Labs Bangalore Apache Spark Meetup Nov 2017
  2. 2. Agenda © 2017 24/7 Customer, Inc. All rights reserved. • Introduction • Use cases for ML models at [24]7 • Model management at [24]7 • The Spark Stack • Conclusion
  3. 3. Prediction Models in [24]7.ai Monday, November 13, 2017 © 2017 24/7 Customer, Inc. All rights reserved.
  4. 4. About [24]7 © 2017 24/7 Customer, Inc. All rights reserved. • [24]7 is a software company based out of Bay area, US and Bangalore, India, delivering customer support solutions enhanced by predictive technologies • Using predictive models to drive enhanced customer experience is an emerging and niche area of application of analytics and big data • Our machine learning models on big data predict the customer intent across various touchpoints in real time, helping us provide an intuitive experience when the customers (of our clients) contact us
  5. 5. 2.5B Digital Interactions/Year 4.5TB Interaction Data/Week 90%+ CSAT across channels 100M Visitors/Year 1st True Multi-modal Solution 1st Omni-channel Solution We deliver a cloud-based software platform that uses predictive analytics and big data to make company-to- consumer connections intuitive. [24]7 - World’s Largest Self-Service Network © 2017 24/7 Customer, Inc. All rights reserved.
  6. 6. Assist (for Chat) Smart chat platform for online and mobile engagement Assist (for IVR) Call deflection to mobile web chat for higher NPS and ROI Assist (for Voice) Smart voice agent platform for multi- modal engagement of voice callers SELF SERVICE PRODUCTS ASSISTED SERVICE PRODUCTS © 2014. 24/7 Customer, INC. All rights reserved. CONFIDENTIAL Predictive Sales Drive higher incremental revenue and customer acquisition Predictive Service Reduce customer effort to increase CSAT and NPS in customer service Chat Agents Chat agent services that engage customers and help reduce costs, generate revenue, and improve CSAT Voice Agents Voice agent services that engage customers and help reduce costs, generate revenue, and improve CSAT SOLUTIONS SERVICES Social Social sharing Mobile Mobile self-service Vivid Speech Mobile for IVR Speech Speech self-service IVR [24]7 iLabs: A Quick Snapshot
  7. 7. Data Science – What it means for [24]7 © 2017 24/7 Customer, Inc. All rights reserved. fn (Customer type, location, Identity, interaction context, journey, behavior …) Intent: Purchase; issue with product or service, … Customer Intent Engine Intent Models fn (Identity, ntent type, history, channel affinity, customer value…) Measure: usage, containment, repeat… Engagement Engine Guided self- service “” Cha t Phon e Sales Resolution Experience Retention Metrics: conversion rate, revenue, CSAT, … Outcomes Machine Learning At Scale Creating Personalized Intuitive Consumer Experiences
  8. 8. Big Data in [24]7 © 2017 24/7 Customer, Inc. All rights reserved. Data Sources Technologies
  9. 9. Use case of intent prediction: Web visits © 2017 24/7 Customer, Inc. All rights reserved. • For our clients in the retail vertical, we provide chat agents who are experienced in providing differentiated support • The differentiation is based on: • Current phase of the journey • Specific persona of the visitor • We use ML models to compute probabilities of various intents, and use them to provide customized intervention for sales and service journeys
  10. 10. Use case of intent prediction: IVR Calls © 2017 24/7 Customer, Inc. All rights reserved. • For our clients in banking, our IVR platform provide self- service options for service journeys • The challenge is to resolve the issues faced by the customer within the IVR platform itself • One of our flagship offering is our natural language understanding engine from free-flowing response • Again, we use ML models to compute probabilities of various intents from the response, and use them to provide specific service or transfer to a voice agent alongwith context
  11. 11. Use case of intent prediction: within Chat © 2017 24/7 Customer, Inc. All rights reserved. • An emerging use case is deploying AI-assisted Virtual Agents (chatbots) for verious enterprise use cases • The challenges here are: • To detect intent from natural language texts, and then provide natural language response – essentially continue a natural conversation • To be able to bring in human agents when the conversation goes out-of-scope for the VA. • We are using ML models to detect intent and state from the conversation and take appropriate action
  12. 12. Technology and Model Management at [24]7 Monday, November 13, 2017 © 2017 24/7 Customer, Inc. All rights reserved.
  13. 13. High Level Architecture © 2017 24/7 Customer, Inc. All rights reserved. 13 Events Real Time Platform Batch Data Platform Events Reporting and BI Predictions Models
  14. 14. [24]7 Big Data Platform: Technologies © 2017 24/7 Customer, Inc. All rights reserved. • We use multiple open-source technologies to power our platform. Some of the technologies in use: • Real Time Platform • Apache Cassandra ring [http://cassandra.apache.org/] • Jetty server for execution [http://www.eclipse.org/jetty/] • Batch Data Platform • Apache Hadoop [http://hadoop.apache.org/] • Apache Hive [http://hive.apache.org/] • Apache Spark [http://spark.apache.org/] [Upcoming] • Others • Apache Kafka [http://kafka.apache.org/] • Apache Avro [http://avro.apache.org/] • HP Vertica database [http://www.vertica.com/] • Apache Pig [http://pig.apache.org/] • Apache Druid [https://druid.apache.org/]
  15. 15. Architecture for model building © 2017 24/7 Customer, Inc. All rights reserved. 15 Events Batch Data Platform HDFS Nightly MR Jobs Structured Datamart Regular Model Building Model Management Platform Analytics & Monitoring Retraining R&D Model Building Deploy Trained Model
  16. 16. Model building workflow © 2017 24/7 Customer, Inc. All rights reserved. Sign Contract Data Requirement Gathering Data Capture Exploratory Data Analysis Model Building Simulation Model Deployment Monitoring and Retraining
  17. 17. Platform for Model Management – Why? © 2017 24/7 Customer, Inc. All rights reserved. • Prediction models are one of the key piece to achieve targets set in the contract, however it is part of a larger workflow – needs standardization • Standard transformations: We now support a set of standard transformations, coded in the same standard way in any model • Standard libraries: Different libraries in different software ecosystem (e.g., R, Python, Spark ML etc.) produce slightly different result. With this platform, we can compare models, or select one runtime to deploy models • Skill can become an issue when working on prediction models for various clients – the platform takes skill out of the equation by providing templates encoding best practices
  18. 18. Spark Stack for Model Management Monday, November 13, 2017 © 2017 24/7 Customer, Inc. All rights reserved.
  19. 19. Early Iteration for Model Management Platform © 2017 24/7 Customer, Inc. All rights reserved. • Model management platform was originally built on top of Vertica • Vertica from HP (now MicroFocus) is a columnar database with strong analytical query capabilities • We loaded the output of MR jobs in Vertica, which acted as our datamart • Model training workflow was managed by Oozie • The actual job of training models were performed in the Vertica cluster using Vertica UDF’s written in C++ and R
  20. 20. Early Iteration for Model Management Platform © 2017 24/7 Customer, Inc. All rights reserved. 20 Events Batch Data Platform HDFS Nightly MR Jobs Structured Datamart: Vertica Regular Model Building Model Management Platform: Vertica UDFs + Oozie workflows Analytics & Monitoring Retraining R&D Model Building Deploy Trained Model
  21. 21. Pros and Cons of using Vertica © 2017 24/7 Customer, Inc. All rights reserved. • Pros • All the EDA and computations happened in-database, thus there were no substantial data movement for model building • Vertica supports SQL and R, thus resulting in easy onboarding for analysts and data scientists • Custom code for feature engineering from existing columns • Cons • Speed of computation was limited by the cluster size of Vertica • R UDFs cannot be parallelized, thereby limiting the amount of distributed computations that can be done while training complex models • In some cases, hard to maintain or find R libraries compatible with Vertica • Compatibility issues in general • Small community of developers • Cumbersome model deployment • License requirement vs existing spark cluster
  22. 22. Moving to Spark © 2017 24/7 Customer, Inc. All rights reserved. • Spark is a strong distributed computation engine with huge community supporting it • It is general purpose, helping to deploy scripts/codes for data preparation as well as monitoring • SparkML has matured with lots of features, quick bug fixes and (again) active community • We wanted to expand model building to more use-cases, and the required data were already available in HDFS • Spark models can be directly deployed on our production JVM stack • We already had a Spark cluster which was getting used for ad-hoc queries • Eliminates the need of specific feature engineering by using hashing tricks
  23. 23. Model Management Platform with Spark © 2017 24/7 Customer, Inc. All rights reserved. 23 Events Batch Data Platform HDFS Structured Datamart: HDFS/Vertica Regular Model Building Model Management Platform: Spark Cluster Analytics & Monitoring Retraining R&D Model Building Deploy Trained Model Nightly Jobs (MR+Spark)
  24. 24. Developing the Framework © 2017 24/7 Customer, Inc. All rights reserved. • The framework is a wrapper around spark libraries developed in-house in Scala • Has specialized modules to manage: • Provision for config reading and validation • Provision for reading data from HDFS (through Hive) and Vertica • Provision for output (models) to be available as both bytecode and as other (legacy) formats • Provision for supporting custom model training workflows including post-processing • API for accessing individual functionality • Needed around 8-9 man-months to complete the project
  25. 25. HashingTF in SparkML © 2017 24/7 Customer, Inc. All rights reserved. • HashingTF is a way of automated feature engineering from textual data using hashing trick • Essentially, using this method, one can project text to a large multidimensional space, thereby capturing nuanced features UTF-8 Encoding hashBytes Byte to Int conversion Multiply/Rotate/ Add/Shift/XOR Mixing Constants Hashed Value Index ScalingNumber of Features TF Computation HashingTF vector Array of Features
  26. 26. Using HashingTF © 2017 24/7 Customer, Inc. All rights reserved. • Using hashingTF can replace multiple preprocessing steps for ML model training: • Dealing with categorical variables • Custom feature extraction (e.g., using regular expression) from text data • Example: Categorizing URL’s • In our comparison experiments, we have noticed similar or better results from models using hashingTF vs models developed the traditional way • Effect was more prominent when the original model included multiple custom-created feature from large amount of text
  27. 27. Other Benefits of using Spark © 2017 24/7 Customer, Inc. All rights reserved. • Model training is much faster compared to the legacy method • We are able to use distributed computation among the nodes • For a model trained on 1M rows, we see 2x-5x improvement • Innovative deployment of production models • Uses a mix of javascript code and java byte-serialized code for a DAG of models • Complex models in spark format (byte-serialized) runs faster • Faster cycle of model training, testing and deployment as the same underlying infrastructure is used
  28. 28. Future work on the platform © 2017 24/7 Customer, Inc. All rights reserved. • We are exploring training of other complex models on the spark platform • Deep learning models for chatbot conversations using MXNET • We have worked on some innovations in sampling, solving optimization problems and training svm models in the spark library • Would like to share those with the spark community
  29. 29. Questions © 2017 24/7 Customer, Inc. All rights reserved.

×