Title
DataOps, the secret weapon for delivering AI, data science, and business intelligence value at speed.
Synopsis
● According to recent research, just 7.3% of organisations say the state of their data and analytics is excellent, and only 22% of companies are currently seeing a significant return from data science expenditure.
● Poor returns on data & analytics investment are often the result of applying 20th-century thinking to 21st-century challenges and opportunities.
● Modern data science and analytics require secure, efficient processes to turn raw data from multiple sources and in numerous formats into useful inputs to a data product.
● Developing, orchestrating and iterating modern data pipelines is an extremely complex process requiring multiple technologies and skills.
● Other domains have to successfully overcome the challenge of delivering high-quality products at speed in complex environments. DataOps applies proven agile principles, lean thinking and DevOps practices to the development of data products.
● A DataOps approach aligns data producers, analytical data consumers, processes and technology with the rest of the organisation and its goals.
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
DataOps - Big Data and AI World London - March 2020 - Harvinder Atwal
1. DataOps, the secret weapon for
delivering AI, data science, and business
intelligence value at speed
Harvinder Atwal
2. // Harvinder Atwal
MoneySuperMarket
// Web
dunnhumby
{"previous" : "Insight Director, Tesco Clubcard"}
LLOYDS BANKING GROUP
{"previous" : "Senior Manager, Customer Strategy and Insight"}
{"Current" : "Interim Chief Data Officer"}
@harvindersatwal
BRITISH AIRWAYS
{"previous" : "Senior Operational Research Analyst"}
{"about" : "me"}
@gmail.com
3. £2bn
SAVINGS
2019 estimate total of household savings
1993 80% 13.1 million $2 billion 989
We started life
as mortgages
2000
of UK Online
Adults visit one
of our websites
each year
MoneySuperMarket
Active users
2019
Market cap
2020
Product
Providers
4. 3 major ways Data Science can help the
organisation
Product
Creation
Customer
Experience
Business
efficiency
5. Applications aren’t in short supply
Demand Forecasting
Capacity Forecasting
Marketing automation
Supply chain management, automatic ordering
Automatic scaling of infrastructure
Document Classification
Image Annotation
Customer Service
Machine Translation
Anomaly Detection
Product Recommendation
Fraud Detection
Image Selection
Text Generation
Predictive Maintenance
Automated Pricing
Automated routing
Medical diagnosis
8. Just 7.3% of organisations say the state of
their data and analytics is excellent*
*New Vantage Partners Big Data and AI Executive Survey 2020
9. Only 22% of companies are currently
seeing a significant return from data
science expenditures*
*Obligatory conference presentation quote from GartnerForresterMcKinsey Consulting. Sorry.
12. Technology is less important than you
think, because the data says so
19.1%
9.1%
2018 2020
*New Vantage Partners Big Data and AI Executive Survey 2020
"Principal Challenge to Becoming Data-Driven is
Technology"*
15. model.fit(X_train, y_train)
is actually the easiest part
Data Governance
Data Quality
Data Security
Test Data Management Version Control
Access
Control
Team
Organisation
Stakeholder
Buy-in
Outcome
Measurement
16. No one wants to talk about lack of value,
dirty data, people, processes and culture
19. Gartner has predicted that, “through 2022, only 20%
of analytic insights will deliver business outcomes.”
20. Step #1 Focus on Organisational Objectives
and Outcomes, not Data Outputs
Resources Activities Outputs Outcomes Impact
The Program Logic Model
Success does not start with data, data scientists, models, insight, or
technology, it literally ends with them.
Success starts with the impacts and outcomes you want and works back
from there to make them happen.
21. We need a completely different approach
to delivering outcomes from data
25. Data has little/no value once process
complete
Data sharing limited to applications
requiring it for specific business processes
26. Big Design Up Front (BDUF) Data
Warehouses to meet specific
requirements
27. Network
Mobile data
Messages
Clickstream
Social Media
Billing
Call Details
Multiple sources of
data
Storage and Compute are cheap
CRM
Viewing
Multiple data
formats
Semi Structured
Unstructured
Structured
Multiple data
silos
Inventory
Human Resources
Data Warehouses Cubes
and Marts
Operational Data Stores
Transactional Sources
File Systems
Big Data
We can no longer apply steam-age thinking to
data
28. Data Sources
Analytics
Tools
Customer
Lifetime Value
Modelling
Churn modelling
Financial
Forecasting
Fraud detection
Regulatory
Feeds
Offer
prioritisation
Next Best
Action
Sentiment
Analysis
Segmentation
Product Affinity
Cross-sell
modelling
Financial
Modelling
Cohort Analysis
Product
Forecasting
Strategy
Planning
Marketing
Effectiveness
AB Testing
Reporting
Cubes
Data needs to be shared and combined across many systems to
support multiple and sometimes complex analysis
Example Analytics Outputs
30. Data is no longer an application by-product,
it is a Product
31. DATA PRODUCTS
“A PRODUCT THAT FACILITATES AN END GOAL
THROUGH THE USE OF DATA”
- DJ PATIL, FORMER US CHIEF DATA SCIENTIST
#2 Think Product not Project
32. Data Analytics is complex manufacturing
Data storage and Databases
Cloud file storage, NoSQL DB,Distributed file system, RDBMS, Analytical DB
Compute infrastructure and Query execution engines
VMs, Container services, and Distributed compute frameworks
Distributed SQL execution engines
Development tools, workspaces and software libraries
Data Analytics
Data exploration, Data
visualization, Data analysis,
Data science, Machine
learning, Deep learning
Reproducibility, Deployment, Orchestration and Monitoring
Output files
BI Tools
Interactive
dashboards
Web Apps
APIs
Product
creation,
Customer
experience
and Business
efficiency
Data
Ingestion
Data
Transformation
Data
Analytics
Data
Products
Use Cases
Data integration and Data processing
pipelines
ETL/ELT tools, Stream processing
MDM,Data unification, and Data preparation
Data management
Product Development System
Production System
34. Eliminate waste, improve quality
#3 Apply Lean thinking
The Optimist The Pessimist The Lean Thinker
THE GLASS IS
HALF FULL
THE GLASS IS
HALF EMPTY
WHY IS THE GLASS
TWICE AS BIG AS IT
SHOULD BE?
38. Data pipelines will break the second you
put them into production
Often there is more complexity in data
than the code
39. Monitoring and testing is needed to trust
pipelines and keep them healthy
Integrity checks
Data Completeness Check
Data Versioning
Data Classification
Data Lineage Tracking
Data Cleansing
Watermarking
Quality Checks
File validation
Data Correctness Check
Data Accuracy CheckData Consistency Check
Data Uniformity Check
ETL Performance Testing End User Testing
Regression Testing
Metadata Testing Transformation Testing
Integration Testing
41. Trust people with data
Identity and
Access
Management
Custom role
permissions
Audit trail
logs
Data Loss
Prevention
Encryption
of Data at
Rest
Encryption
of Data in
Motion
Resource
Monitoring
Firewall
rules
Resource
and
Object
Isolation
Penetration
Testing
Code
Encryption
and
Backup
Segregation
of Duties
Authorisation
protocols
Data
Access and
Privacy
Policy
Metadata
Management
Data
Cataloging
Data
Stewards
and
Owners
47. Continuous Integration: Commit Code
Regularly
Data Cleaning Master
Data Cleaning
Dev Branch
Feature Extraction Dev
Feature Extraction
Master
Model Train Master
Model Train Dev Branch
Machine Learning Pipeline
Product Development (e.g. App, Website, Marketing system, Operational System, Dashboard, etc.)
49. #6 Organise for success –
Conway's Law isn't academic
Microsoft's research found organisational structure predicted code quality better than
other measurable factors such as Code Churn, Code Complexity, Dependencies, Code
Coverage or Pre-Release Bugs
50. Nearly 60 percent of breakaway organizations use
cross-functional teams, versus less than a third
of the remaining respondents that do so.
51. Core
Personas
Data Engineer Data AnalystData Scientist Team Lead
Data Platform
Administration
ML
Engineer
Supporting
Personas
Solutions Architect
DBA
Security Expert
Specialist Tester
Technical Lead Designer
55. Dash-Shaped
(Generalist)
Capable in a lot of things
but not expert in any
No I (or -) in teams
I-Shaped
(Specialist)
Expert at one thing
Poor Better Good Best
Breadth of knowledge
Depth of knowledge
T-Shaped
(Generalising Specialist)
Capable in a lot of things
and expert in one.
Pi-Shaped
(Multi-skilled)
M-Shaped
(Poly-skilled)
56. Analytics
Specialists
and Centre of
Excellence
Source data
system
owners
Data Management and Platform teams
(Databases, data storage, compute infrastructure, analytical tools, data governance and security,
master data management, operations, etc.)
Domain use
cases
Cross-functional domain team
(Data engineers, Data scientists, Data analysts, Stakeholder, etc.)
Cross-functional domain team
(Data engineers, Data scientists, Data analysts, Stakeholder, etc.)
Cross-functional domain team
(Data engineers, Data scientists, Data analysts, Stakeholder, etc.)
Domain use
cases
Domain use
cases
Self-service access Productionise
Domain-orientated Teams (optimised for speed)
57. Team Customer
Data Product
Service
Is our product
healthy
(Monitoring)
Is the product
meeting our
objectives?
(Benefit
Measurement)
Is our team and
its processes
healthy?
(Retrospectives)
Is our internal
service delivery
fit for purpose?
(Service Delivery
Review)
Viewpoint
Concern
#7 Measure and act on feedback
Source: Matt Philips
58. Just as DevOps is more than Chef, Puppet
and Ansible
DataOps is more than tools
59. DataOps can't be delivered by a monolithic
solution, it requires multiple technologies
60.
61.
62.
63. // Harvinder Atwal // Web
var current: {
companyName : "MoneySuperMarket",
position : “Interim Chief Data Officer"
};
var previous1: {
companyName : "Dunnhumby",
position : "Insight Director,"
+ "Tesco Clubcard"
};
var previous2: {
companyName : "Lloyds Banking Group",
position : "Senior Manager"
};
var previous3: {
companyName : "British Airways",
position : "Senior Operational Research Analyst"
};
{"about" : "me"}
var username = "harvindersatwal";
var linkedIn = "/in/" + username;
var twitter = "@" + username;
var email = username + "@gmail.com";
Hinweis der Redaktion
Technology has actually more than kept up data challenges
But that's where the money and attention is going
The problem in Data Science is an overemphasis on machine learning and especially among junior data scientists that belief that accuracy score(s) on a test dataset is the definition of success. This seems a very strange definition of success to me.
A perfect model that never goes into production is no better than a model that doesn't exist.
Don't get me wrong there are domains where model accuracy is extremely important like healthcare, fraud detection and adTech but these are minority compared to applications where doing anything is a big step up from doing nothing.
Edison’s “jumbo dynamo” at the wold’s first power station in Lower Manhattan.
Methodologies that were invented for Software Development and Product Development apply to Data too.
This is sometimes really hard for Data Scientists who experiment with data on laptops to accept.
If you want data Analytics go faster we need to accept that data Analytics pipelines need brakes in the form of rules and constraints to build trust.
This may seem obvious if you're a developer but most data scientists and analysts are not trained in development and devops best practices.
However, adopting these approaches will lead to a step change in productivity.
Reproducibility is a critical requirement for DataOps and version control is the foundation upon which a lot of the delivery is built.
Version control makes it possible to maintain an archived version of the code used to produce a particular result. The most common software being Git using services like github gitlab or bitbucket.
Version control is the foundation upon which a lot of delivery is built.
At a minimum, reviewers of a publication and future researchers should be able to:1) Download all data and software used to generate the results.2) Run tests and review source code to verify correctness.3) Run a build process to execute the computation.
Version control makes it possible to maintain an archived version of the code used to produce a particular result. Examples include Git and Subversion.
3) Automated build systems document the high-level structure of a computation: which programs process which data, what outputs they produce, etc. Examples include Make and Ant.
It's not just code and he's proved useful but also the analytical environment, including packages and libraries, versions of languages and system level software. Popular solutions include package environment managers like Conda, containers solutions like Docker or virtual machines containing an entire operating system plus a specific environment.
Configuration management tools document the details of the computational environment where the result was produced, including the programming languages, libraries, and system-level software the results depend on.
Examples include package managers like Conda that document a set of packages,
containers like Docker that also document system software,
and virtual machines that actually contain the entire environment needed to run a computation.
The pipeline for data pipelines can be broken and multiple steps.
Here’s an example of machine learning pipeline.
This means you can work faster because different people can work on different parts of a pipeline in parallel.
But this can cause problems during integration so the solution is to continuously integrate code is often you can into the master branches.
In an enterprise setting where multiple data scientists could be working on a single project, the first step to doing data science work that scales is implementing version control, whether that’s GitHub, GitLab, Bitbucket, or another solution. Once your team has the ability to track code changes, the next step is to create a process in which they regularly commit their code to the master branch of your repository.
“organizations which design systems ... are constrained to produce designs which are copies of the communication structures of these organizations.”
— M. Conway
Personas are not people or job titles
Not all roles fit into cross functional teams, e.g. architect.
Work comes to the teams
Matt Philips
Which brings me on to tools
Just as chemistry is not about the tubes but the process of experimentation. DataOps is not tied to a particular technology, architecture, tool, language or framework.
However, some tools are better at supporting DataOps collaboration, orchestration, agility, quality, security, access and ease of use.
Long gone are the days when monolithic solutions worked
Previous stack, one vendor for data formats, data storage, query interface, language, and functions. Low interoperability. E.g. SAS
Now many technologies in an ecosystem.
More data science looks much more like software development than the data warehousing or BI of old.