As we move from the Data Warehouse to the Data Supply Chain, we open our perspective to include the full life cycle of data, from raw material to data product.
To produce data products with the most value, in an efficient and cost effective manner, quality control processes must be put into place at each link in the chain, driven by the requirements of data scientists. With such quality control processes in place, the burden of data scientists to cleanse data – typically 80% of the data scientists’ efforts – can be greatly reduced.
Data Models – including schema, metadata, rules, and provenance – play a crucial role in ensuring an effective Data Supply Chain.
Each Data Supply Chain link must be defined with firm boundaries with clear lines of team responsibility – with Data Models providing the natural borders.
In this talk we will discuss the processes that must be put into place at each link in the Data Supply Chain including perspectives on:
* The definition of Data Supply Chain vs. Data Warehouse
* Tools to create, manage, utilize, and share Data Models
* Tracking Data Provenance
* ETL processes, driven by Data Models
* Collaborative processes across Data Science teams
* Visualization of Data and Data Flow across the Data Supply Chain
* Apache Hadoop and Apache Spark as enabling technologies
* Data Science
* Cross-Organizational Collaboration
* Security
1. Optimizing the
Data Supply Chain
for Data Science
61 Broadway Suite 1105
New York, NY 10006
info@vital.ai
http://www.vital.ai
Marc Hadfield
CEO, Vital A.I.
2. about: vital ai
61 Broadway Suite 1105
New York, NY 10006
info@vital.ai
http://www.vital.ai
Software Applications:
Artificial Intelligence,
Machine Learning,
Data Science.
Software Vendor & Consulting Services
3. agenda
61 Broadway Suite 1105
New York, NY 10006
info@vital.ai
http://www.vital.ai
• Data Models
• How A.I., Data Science, & Data Governance relate
• Data Supply Chain & the Data Product
• Problem: the “Telephone Game” across the DSC
• Architecture Transition from Data Warehouse to DSC
• Data Models and DSC; a Framework for Solutions
• Examples
• Collaboration & Visualization
note: general methodology, with some specific
examples from Vital AI implementations.
4. takeaways:
61 Broadway Suite 1105
New York, NY 10006
info@vital.ai
http://www.vital.ai
• The Data Supply Chain is a supply
chain to deliver Data Products
• Data Models can capture the implicit
meaning of data (and that is the goal!)
• Data Models can help negotiate the
implicit differences across the DSC
• Data Models offer a means to
collaborate on data standards
(meaning) across the DSC partners
5. about data models:
61 Broadway Suite 1105
New York, NY 10006
info@vital.ai
http://www.vital.ai
Semantic Models
6. big data:
61 Broadway Suite 1105
New York, NY 10006
info@vital.ai
http://www.vital.ai
volume, velocity,
variety, veracity
variety: data models
“Product”: different meaning in
Manufacturing vs Retail context
Healthcare, same entity: “Patient”,
“InsuredPerson”, “BillableEntity”
7. example:
61 Broadway Suite 1105
New York, NY 10006
info@vital.ai
http://www.vital.ai
Class: Person
Property: birthday
Standardized Unique Global Identifier (URI)
data type: date
relationship with property: age
allowed range of values (can’t be born in the future)
typical (average/expected) value…
(Birthdays in Wikipedia vs Customer Database)
8. about: vital ai tech
61 Broadway Suite 1105
New York, NY 10006
info@vital.ai
http://www.vital.ai
Vital AI Development Kit (VDK)
VitalSigns — Data Modeling &
Code Generation
VitalService — Common API for
Databases, Machine Learning,
Apache Spark, Data Transforms
9. about: vital ai tech
61 Broadway Suite 1105
New York, NY 10006
info@vital.ai
http://www.vital.ai
VitalService
Query
Executable
Query
Query Generator
Common Query API:
Relational DB (SQL)
Graph DB (Sparql)
Key/Value Store
NOSQL DB
Document DB
Apache Spark
Hive (Hadoop)
Predictive Models (a query for an unknown value)
Goal: Build A.I. applications across variety of
infrastructure with consistent API & Models.
10. example data:
61 Broadway Suite 1105
New York, NY 10006
info@vital.ai
http://www.vital.ai
Person:Recipient
Person:Sender Message
hasRecipient
hasSender
11. example “MetaQL” query:
61 Broadway Suite 1105
New York, NY 10006
info@vital.ai
http://www.vital.ai
GRAPH {
value segments: ["mydata"]
ARC {
node_constraint { Message.class }
constraint { "?person1 != ?person2" }
ARC_AND {
ARC {
edge_constraint { Edge_hasSender.class }
node_constraint {
Person.props().emailAddress.equalTo(“john@example.org")
}
node_constraint { Person.class }
node_provides { "person1 = URI" }
}
ARC {
edge_constraint { Edge_hasRecipient.class }
node_constraint { Person.class }
node_provides { "person2 = URI" }
}
}
}
}
“Person” may have subtypes, like Student or Employee.
12. a.i. and data quality
61 Broadway Suite 1105
New York, NY 10006
info@vital.ai
http://www.vital.ai
13. data models &
machine learning:
61 Broadway Suite 1105
New York, NY 10006
info@vital.ai
http://www.vital.ai
using the meaning of classes and
properties, automatically
generate predictive models.
predictive models features:
birthday, zip code, …
14. data governance =
defining the meaning of data =
feature (pre)engineering
61 Broadway Suite 1105
New York, NY 10006
info@vital.ai
http://www.vital.ai
critical aspect of data science
15. 61 Broadway Suite 1105
New York, NY 10006
info@vital.ai
http://www.vital.ai
Progression of Analytics:
16. 61 Broadway Suite 1105
New York, NY 10006
info@vital.ai
http://www.vital.ai
where a.i. happens
Progression of Analytics:
17. Garbage In = Garbage Out
61 Broadway Suite 1105
New York, NY 10006
info@vital.ai
http://www.vital.ai
= Bad A.I.
data governance
required for Good A.I.
18. 61 Broadway Suite 1105
New York, NY 10006
info@vital.ai
http://www.vital.ai
one more point on
data governance…
think outside the box
(data warehouse)
19. data governance: data in motion
61 Broadway Suite 1105
New York, NY 10006
info@vital.ai
http://www.vital.ai
vs.
inside data warehouse
outside data warehouse
22. data supply chain:
61 Broadway Suite 1105
New York, NY 10006
info@vital.ai
http://www.vital.ai
data product
23. 61 Broadway Suite 1105
New York, NY 10006
info@vital.ai
http://www.vital.ai
Retail Recommendations…
Shipping/Logistics Optimization…
Compliance, Auditing, Security, Fraud Detection…
data product:
24. why data supply chain?
61 Broadway Suite 1105
New York, NY 10006
info@vital.ai
http://www.vital.ai
Partner DW Your DW
"No matter who you are, most of the smartest
people work for someone else.” — Bill Joy.
25. why data supply chain?
61 Broadway Suite 1105
New York, NY 10006
info@vital.ai
http://www.vital.ai
Partner DW Your DW
"No matter who you are, most of the smartest people
data works for someone else.” — Bill Joy. (revised)
26. data supply chain
61 Broadway Suite 1105
New York, NY 10006
info@vital.ai
http://www.vital.ai
Partner DW
Your DW
why not ETL?
28. Extract…
61 Broadway Suite 1105
New York, NY 10006
info@vital.ai
http://www.vital.ai
not quite as expected…
29. 61 Broadway Suite 1105
New York, NY 10006
info@vital.ai
http://www.vital.ai
Transform…
a bit extreme…
30. Load…
61 Broadway Suite 1105
New York, NY 10006
info@vital.ai
http://www.vital.ai
a bit messy…
31. Clean…
61 Broadway Suite 1105
New York, NY 10006
info@vital.ai
http://www.vital.ai
a lot of manual effort…
32. … your imported data
61 Broadway Suite 1105
New York, NY 10006
info@vital.ai
http://www.vital.ai
33. Your DW
61 Broadway Suite 1105
New York, NY 10006
info@vital.ai
http://www.vital.ai
Partner DW
Why?
34. 61 Broadway Suite 1105
New York, NY 10006
info@vital.ai
http://www.vital.ai
what goes wrong?
telephone game…
35. You
61 Broadway Suite 1105
New York, NY 10006
info@vital.ai
http://www.vital.ai
Partner
Model “A”
Model “B”
Implicit Model
36. Resolution:
Make explicit the implicit.
Align Data Models.
61 Broadway Suite 1105
New York, NY 10006
info@vital.ai
http://www.vital.ai
Reason:
Implicit assumptions in the data.
ETL can’t see the forest for the trees.
(or it’s very difficult with missing
assumptions)
37. Example: Internet of Things
61 Broadway Suite 1105
New York, NY 10006
info@vital.ai
http://www.vital.ai
Predictive Analytics
“Nest for Office Buildings”
Office Tower with Building
Management System (BMS)
containing 100,000 monitored
points (temperature, energy
usage of chiller, fan speed, etc.)
with significant missing data,
errors, and noise. Reconciliation
of data to produce predictive
models to minimize energy usage.
Rules for data correctness.
38. Sensor Data Validation:
61 Broadway Suite 1105
New York, NY 10006
info@vital.ai
http://www.vital.ai
Source data had temperature values of “0” (zero)
which meant either the temperature was 0 degrees
or that the sensor had an error.
Data Model “knows” that it’s rarely 0 degrees in
July (far from the standard deviation), and that the
temperature can be compared to weather data on a
day in December for reasonableness.
If Data Model also knows the maintenance schedule
for the sensors, then it “knows” when to expect 0
error values and exclude them.
Missing Maintenance Assumptions.
Fill in secondary (weather) data for validation.
39. how did we get here?
61 Broadway Suite 1105
New York, NY 10006
info@vital.ai
http://www.vital.ai
Architecture Review:
a quick step back…
What is a Data Supply Chain
architecture?
40. “traditional” data warehouse:
61 Broadway Suite 1105
New York, NY 10006
info@vital.ai
http://www.vital.ai
ETL within the organization.
Data Governance across the organization.
DW
41. tech co. “agile” data warehouse:
61 Broadway Suite 1105
New York, NY 10006
info@vital.ai
http://www.vital.ai
storage
compute
HDFS
Spark
DataSets
Jobs
Batch/Streaming
Build Predictive Models
Realtime: Spark/Storm
hadoop cluster
42. enterprise: data lake
61 Broadway Suite 1105
New York, NY 10006
info@vital.ai
http://www.vital.ai
storage
compute
HDFS
Spark
X(save $)
“Data Swamp”
43. aside: Data Lake
61 Broadway Suite 1105
New York, NY 10006
info@vital.ai
http://www.vital.ai
better analogy: Scriptorium
library,
manuscript copying,
& book distribution.
but not as Pithy as “Lake”…
44. tech co. microservices (micro-SOA):
61 Broadway Suite 1105
New York, NY 10006
info@vital.ai
http://www.vital.ai
storage
compute
service
“Composed” App
external:
social data,
weather API
independent clusters,
local data expertise
optimize development
processes, scale up.
45. microservices example:
61 Broadway Suite 1105
New York, NY 10006
info@vital.ai
http://www.vital.ai
Amazon: product search uses
170 independent microservices
including services for predicting
customer characteristics, getting
product images, etc.
http://www.infoworld.com/article/2903144/application-
development/how-to-succeed-with-microservices-architecture.html
Netflix similar architecture
46. Data Supply Chain:
61 Broadway Suite 1105
New York, NY 10006
info@vital.ai
http://www.vital.ai
storage
compute
service
Data Product
“ETL”
Owner “A” Owner “B”
optimize development
processes, scale up.
independent clusters,
local data, ownership
47. Interaction Points:
61 Broadway Suite 1105
New York, NY 10006
info@vital.ai
http://www.vital.ai
Data Product
service
compute
ETL
Owner “A” Owner “B”
48. Data Lineage: Cloudera Navigator
61 Broadway Suite 1105
New York, NY 10006
info@vital.ai
http://www.vital.ai
…within a Data Warehouse
trace back jobs that
produced every data field.
49. Data Supply Chain
with Provenance:
61 Broadway Suite 1105
New York, NY 10006
info@vital.ai
http://www.vital.ai
include provenance data directly
in imported dataset.
use in rules to interpret the data.
entity-123 | hasSource | datasource-A
entity-123 | name | “John Doe”
Data Warehouse B
50. Interaction Points: Data Models
61 Broadway Suite 1105
New York, NY 10006
info@vital.ai
http://www.vital.ai
Data Product
service
compute
ETL
Data Models: Gatekeepers & Transform
Owner “A” Owner “B”
51. Data Supply Chain
using Models:
61 Broadway Suite 1105
New York, NY 10006
info@vital.ai
http://www.vital.ai
storage
compute
service
Data Product
ETL
Owner “A” Owner “B”
Model
Server
Data Models: focus of
data governance
52. Semantic Data Models:
61 Broadway Suite 1105
New York, NY 10006
info@vital.ai
http://www.vital.ai
Make explicit the meaning of data
Transformation and Validation Rules
leverage the Model and Meaning.
Such Rules may be packaged with the
Model, and managed together.
Protect against implicit assumptions
53. Example: Financial Services
61 Broadway Suite 1105
New York, NY 10006
info@vital.ai
http://www.vital.ai
A B C
Service Provider
Reconciliation of Corporate
Structure across 1,000’s of
organizations. Compliance
Rules barring communication
between “researchers” and
“traders”.
Rules to infer if “Mary” is a
“researcher” or “trader”.
Conflicting concepts of
“Branch-Office”, “Direct-
Report”, etc. across the Globe.
54. Example: Hospital Group
61 Broadway Suite 1105
New York, NY 10006
info@vital.ai
http://www.vital.ai
A B C
Data Analytics
Reconciliation across
Patient Records,
Insurance, & Billing
for Patient Predictive
Analytics.
Rules for identity:
“same person”
55. Data Models:
61 Broadway Suite 1105
New York, NY 10006
info@vital.ai
http://www.vital.ai
OWL: Semantic Ontology Model
(W3C Standard, Various Standards for Rules)
VitalSigns: Generate Code
validation, transformation, …
VitalSigns: Versioning, Dependencies, Exchange,
Storage, Change Management (Semantic “Diff”)
56. Example: Personally Identifiable Information
61 Broadway Suite 1105
New York, NY 10006
info@vital.ai
http://www.vital.ai
Data Governance determines that “Profession” and
“ZipCode” cannot be used together.
(Maybe a single “Dentist” in a small town…)
Within a single Data Warehouse we can bar these data
elements from being combined.
But:
Microservice A provides value of “Profession”
Microservice B provides value of “ZipCode”
How to enforce that these two microservices cannot be
combined?
57. Example: Personally Identifiable Information
61 Broadway Suite 1105
New York, NY 10006
info@vital.ai
http://www.vital.ai
Validation code enforcing data usage:
Person person123 = get_person_details(“entity-123”)
// this call works:
person123.profession = get-profession(person123)
// this call blocks because of data model validation
// person123 already has “profession” property
person123.zipcode = get-zipcode(person123)
58. 61 Broadway Suite 1105
New York, NY 10006
info@vital.ai
http://www.vital.ai
Gatekeepers
Externally Managed.
Active not Passive, more like “code”.
Defining what should exist, not
cataloguing what exists.
Can decide when to be tolerant or strict.
Semantic Data Models:
59. Collaborative Conversations:
61 Broadway Suite 1105
New York, NY 10006
info@vital.ai
http://www.vital.ai
Infrastructure
DevOps
Data Scientists
Business +
Domain Experts
Developers
Semantic
Model
60. Collaborative Conversations:
61 Broadway Suite 1105
New York, NY 10006
info@vital.ai
http://www.vital.ai
Business +
Domain Experts
Semantic
Model
Business +
Domain Experts
Semantic
Model
Partner A Partner B
Model Alignment
What
Concepts to
combine, not
what Tables
to combine
(that comes
later).
61. Authoring Tool: OWL IDE Protege
61 Broadway Suite 1105
New York, NY 10006
info@vital.ai
http://www.vital.ai
64. in conclusion, takeaways:
61 Broadway Suite 1105
New York, NY 10006
info@vital.ai
http://www.vital.ai
• The Data Supply Chain is a supply
chain to deliver Data Products
• Data Models can capture the implicit
meaning of data (and that is the goal!)
• Data Models can help negotiate the
implicit differences across the DSC
• Data Models offer a means to
collaborate on data standards
(meaning) across the DSC partners
65. Questions?
61 Broadway Suite 1105
New York, NY 10006
info@vital.ai
http://www.vital.ai
Marc Hadfield
CEO, Vital A.I.
marc@vital.ai