Productive Machine Learning and Deep Learning Projects
Machine Learning (ML) and Deep Learning (DL), known holistically as Artificial Intelligence, are no longer luxuries but necessities if companies want to remain relevant n today’s market. Data driven organizations that encourage the development of ML and DL projects allow companies to create and deploy models to create predictions in real time. Even more exciting, these real time predictions allow organizations to trigger actions based on these predictions, which ultimately improves the bottom line. However, organizations struggle to incorporate ML and DL projects to create models that improve performance. This talk focuses on how companies can enable data science platforms so that data engineers, data scientists and business analysts can quickly explore data, create and test ML and DL models, and deploy to staging and production environments regardless of the language or framework used by the team and organization.
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Greg Werner, CEO & Founder, 3Blades.io at MLconf ATL 2017
1. Data Science with Teams
Improve the efficiency of your data science teams
with platforms that enhance collaboration and
flexibility
2. Agenda
● Some Background
● Goals
● Data Science Project Teams
● Challenges
● Some Solutions
● Conclusions
3. Background
Integration experience with Oil & Gas, Financial, Insurance and Retail industries in
multiple geographies
What did these customers have in common? All had data science teams that
worked in Silos
Difficulties when taking a data science
course
Source: Wikipedia
8. Data Science Teams - The New Way
Data ScientistFinance Manager
Accountant
Tax and Compliance
Treasury
Data Gurus:
- Analytics
- Data Engineers
- Business Intelligence
- Compliance
IT Manager
10. I Want GPUs - And I Just Want Them to Work
Work around for NVidia Docker Wrapper:
- nvidia-docker -d -p 8888:8888 tensorflow/tensorflow:latest-gpu
OR
- docker run -ti --rm `curl -s http://localhost:3476/docker/cli`
tensorflow/tensorflow:latest-gpu
OR
- docker run -ti --rm --volume-driver=nvidia-docker --
volume=nvidia_driver_375.82:/usr/local/nvidia:ro --device=/dev/nvidiactl --
device=/dev/nvidia-uvm --device=/dev/nvidia0 nvidia/cuda nvidia-smi
11. The Need for DevOps Chops
Registrator
Docker
Container
EC2 Instance
Reverse Proxy with consul-template
The old way... The new way...
Docker
Container
EC2 Instance
Reverse Proxy with static upstream location
name
$$$ $
15. Solutions
Provide flexibility with the tools that data scientists use for exploratory
data analysis and visualizations
One central source for project files with support for version control
Share visualizations from EDA
Train and save Machine Learning and Deep Learning models with
multiple frameworks, from within the same project
Streamline deployment pipelines
Talking points:
Siloed data initiatives is a common denominator
Data scientists were segregated from the rest of the organization
Tooling was disparate
Initially, the need to streamline Jupyter Notebook deployments with a class of students came up after many students were complaining about the time and effort involved in using specific dependencies to complete their tasks. Package managers were not enough: users also needed an integrated solution to access a consistent and reliable solution to complete their assignments using Jupyter Notebooks. We also noticed that companies, in general, did not provide a homogeneous environment for their data science teams. This led to many headaches but was considered business as usual.
Talking points:
Issues encountered with the education vertical were common across industry, i.e. too much time spent con configuration
Basic ROI calculations justified the implementation and support of a data science hub
Data science platform “a ha” moment came when pitching a solution to consolidate project workspace environments for different people across different organizations, in particular for Exploratory Data Analysis (EDA). Educational institutions are usually constrained by budget requirements, however, after providing ROI numbers on how much time and effort Teachers Assistants (TA’s) spent on providing technical support for their users, the decision to implement a data science platform was a no brainer. Nevertheless, we had the suspicion that the enterprise (SMB’s and large enterprises alike) were encountering the same challenges but were exacerbated due to the fact that more personas were involved within data science and analytics teams.
Talking points:
Disparate teams
Data scientists siloed from the rest of the organization
Ultimate goal is to automate certain processes within the organization
Automation helps improve the top and bottom lines, improves competitiveness
Organizations struggle to become ‘data driven’. What does that mean? Data driven organizations are those that wish to use the data they have available to improve insights and allow their business to become more competitive. Assuming the organization has successfully consolidated their data into central data warehouses or data lakes, and assuming this data is defined with standard schemas, data science and data analytics teams have the power to analyze the data, obtain valuable insights and start improving the agility of their organizations with ‘prescriptive analytics’ and ‘predictive analytics’. Prescriptive analytics involves creating machine learning and deep learning models that automate certain business processes, such as:
Automatically tag images with classification types (cat or not a cat)
Automatically classifying a customer with the probability that the customer will churn
Recommendations for value added products to improve the checkout dollar amount at an e-commerce site
Spam or not spam
However, organizations have struggled to integrate data scientists into their organizations. Data science teams that just ‘do the math’ and create visualizations on an organization's data sets to not provide much value in and of itself. Creating a machine learning model that automatically recommends a product that is not strategic to the organization does not provide much value.
Talking points:
Dashboards democratize data so that team members can quickly absorb meaningful insights and key performance indicators.
Exploratory data analysis (EDA) and model creation/deployment not really a part of the picture.
Traditional Business Intelligence tools have been around for years. Some tools offer specific integrations into a variety of data sources and allow users to quickly create rich and interactive visualizations of their data. SQL, a language made popular by relational databases, is a very popular language for analytics. New developments help accelerate the time from data source to dashboard with in memory calculations, GPU powered databases, among others.
Big data tools, such as Hadoop and Spark allow users to create dashboards from large data sets. However, BI tools rely traditionally on structured data. Also, traditional dashboards don’t take into account how to create machine learning and deep learning models.
Just a review of a Data Scientist’s skill set.
Talking points:
Organizations realize they need to automate their processes and that automation must come from real time analysis of data points
The deliverable is not just a BI dashboard anymore, the deliverable is a deployable machine learning and deep learning model
Embedding a data science team member into the group increases value
As mentioned, historically data science teams have been isolated from the rest of the organization.
Successful data driven organizations embed their data scientists into various business groups. For example: data extraction and loading into a warehouse table are done by engineering teams, however, a data science liaison, embedded within a certain department or relevant company wide project, can help data engineers improve the schema definition for the data being exposed which could save valuable time during the exploratory data analysis phase. Data engineers can create tables using their favorite Extract Transform and Load (ETL) tools to remove not-a-number (NaN) rows, remove columns that are irrelevant such as data base PK/FK’s, etc.
Inversely, the data scientist could help the person telling the data story (could be anyone in the group, including herself) what features are relevant, how the certain normalizations were completed without delving into the technical details, etc. “This was the only customer that bought a widget in Atlanta so the attributes for this person were adjusted to not skew the dataset in their favor”.
Talking points:
Move from prescriptive to predictive analytics
Deliver a machine learning or deep learning model that will allow organizations to automate processes
Visualizations are still important, but used for telling a data story for EDA and also for visualizing how models are behaving in real time
Predictive analytics looks at the historical trends in data to provide insights. Organization members are then tasked to optimize processes to improve organizational results based on trends. However, companies need to automate tasks (remove the human from the actual task execution) based on certain indicators. In this case, visualizations are used in EDA to better understand the data with the goal of creating and deploying machine learning and deep learning models that can automate certain organization processes.
Talking points:
Support data source imports from multiple sources
EDA needed as first step to build and deliver artifacts to automate business processes. Artifacts in this context are machine learning and deep learning models.
Data engineers and DevOps need access to data science hub to streamline their own processes
Traditional teams use Excel spread sheets, among other tools, and are flying back and forth with emails, chat applications or external project management solutions. Even if all users work within shared environments such as Google Docs or Office 365, teams had no way of sharing all files and tools within one common environment, particularly for exploratory data analysis, since viewing and editing files within these environments are constrained to a certain set of file formats. Nevertheless, certain organizations and individuals prefer one language over the other. For example, a data science team involved with the Finance department may be more involved with using the R programming language, and the data science team involved with the marketing department may be more involved with the Python programming environment. In both cases, users may use multiple tools for one language. For example, some individuals may prefer RStudio for R, and others amy prefer using R with Jupyter Notebooks. Server management is important to optimize compute resources.
Traditional teams use Excel spread sheets, among other tools, and are flying back and forth with emails, chat applications or external project management solutions. Even if all users work within shared environments such as Google Docs or Office 365, teams had no way of sharing all files and tools within one common environment, particularly for exploratory data analysis, since viewing and editing files within these environments are constrained to a certain set of file formats. Nevertheless, certain organizations and individuals prefer one language over the other. For example, a data science team involved with the Finance department may be more involved with using the R programming language, and the data science team involved with the marketing department may be more involved with the Python programming environment. In both cases, users may use multiple tools for one language. For example, some individuals may prefer RStudio for R, and others amy prefer using R with Jupyter Notebooks.
A central source for project files alleviates compliance requirements. Usually, data engineers (either due to security requirements or simply that they don’t want to surface multiple schemas/formats for different users) would rather deliver the data product to a ‘clean’ table, so data scientists can do their work using self service approaches. Having a centrally managed set of files for specific projects also helps keeps things organized when different users are accessing project files, so version control becomes important as well.