Tamr Field Engineer Timothy Danford, Ph.D., discusses how Data Variety -- the natural, siloed nature of data as it’s created -- is creating a bottleneck to biomedical data analytics. Rule-based, deterministic data unification approaches are “too brittle” scale to the hundreds or thousands of different data formats, sources and silos within the enterprise. Danford submits, instead, that Tamr’s bottom-up, probabilistic approach with “active learning” is proving successful at unifying heterogeneous data at scale.
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Combining Human+Machine Intelligence to Successfully Integrate Biomedical Data
1. COMBINING HUMAN & MACHINE INTELLIGENCE TO
SUCCESSFULLY INTEGRATE BIOMEDICAL DATA
TIMOTHY DANFORD | TAMR, INC.
2. THE DATA INTEGRATION PROBLEM
● flat files: every file has its own columns
● bioinformatics: every tool has its own
file format
● graph data: RDF, OWL, “knowledge
graphs”
● proprietary / legacy formats: SAS,
DBF
● relational databases: inconsistent data
models
Biomedical Data Integration is a
Constantly Moving Target
3. THE DATA INTEGRATION PROBLEM
● One solution: hire or train data curators
who understand the subject area
● Benefits: accuracy
● Problems
o Low bandwidth
o Difficult to scale to larger problems
o Recording decisions
o Consistency between curators
Data Curation Teams Do Not Scale
4. THE DATA INTEGRATION PROBLEM
● Build an automated or rules-based
system to perform data integration
● Benefits: scale
● Problems
o Accuracy, edge-cases
o Programmers do not scale
o Out-of-band communication
o Expensive to maintain
o Brittle in the face of new data
Rule-based Integration Is Brittle
5. TAMR AUTOMATES DATA INTEGRATION
● Solution: combine learning rules with
asking experts
● Modern machine learning techniques
o semi-supervised learning
o active learning
● Benefits
o speed of an automated system
o accuracy of human experts
o auditability
o responds well to changing
requirements
Use Probabilistic Rules with Active
Learning
6. TAMR AUTOMATES DATA INTEGRATION
● Build a unified schema and link it to
source attributes
● Engage subject matter experts to
answer questions
● Automate data transformation
● Eliminate redundant records with de-
duplication
Tamr Combines Machine Learning
and Expert Feedback
7. CASE STUDY: CLINICAL STUDY DATA
● Clinical study data integration is motivated
by a single schema: CDISC
o mandated by FDA for data submission
o common schema for clinical data
warehouses
● Mostly performed by SAS scripting today
● Tamr learns attribute mapping and
transformations using human feedback
An Example: Clinical Study Data Integration
9. THE BIOMEDICAL DATA INTEGRATION PROBLEM
Fundamentally, many scientific analyses are tabular
rows are ‘entities’
columns are ‘attributes’
graphs (paths) and hierarchies (part/whole) are other shapes
tables emphasize independence of entities and attributes
Tabular Datasets are a Core Data Shape
10. THE BIOMEDICAL DATA INTEGRATION PROBLEM
● Column-oriented: Find the matching attributes
● Row-oriented: Discover duplicate entities
Data Integration Proceeds In Two Directions
11.
12. ● 80% of clinical data today goes unused
● Clinical Data Warehouses capture legacy data
● Improved analytics = better trials, less $$
Advanced Analytics, Better Clinical Trials
TAMR BUILDS LASTING VALUE
SAS
Faster Regulatory
Filings
Better Clinical
Analytics
Data Mining for
New Indications
13. Dynamic, Integrated View of 15k Existing and New
Sources: Biopharma
Result
• Replaced 10+ man years of human curation effort with Tamr
• Engage 600 Scientists in data quality ownership
Challenges
• $2B in research and silos of experimental results
• 15,000 sources of experimental results
• Hundreds of decentralized labs
• 1M+ rows with >100k attribute names
• Non-standardized attribute names & measurement units
• Manual curation prohibitively time & cost intensive
Solution
• Integrate data to find similar experiments
• Scaling data curation to incorporate all sources at
reasonable cost
• Engage owners of data sources in improving quality of data
15k sources integrated into one view
Tamr Output
14. TACKLING THE ENTERPRISE DATA SILO PROBLEM
All are necessary but not sufficient to truly address next-gen challenges
● Democratized visualization and modeling - radical consumption heterogeneity
● SemanticWeb/LinkedData - radical source heterogeneity
● Provenance for data to improve reliability
● Rapid iteration/change requires reproduceability from source
● Desire for longitudinal data across many entities
● Need for automated data quality / assurance
Traditional approaches...
● Standardization - worth trying
● Aggregation - yes - but actually makes the problem worse
● Top-down modeling (MDM/ETL) - ok for app-specific or well-defined data
Hinweis der Redaktion
Key Messages:
Today I’ll be speaking about how data variety, the natural, siloed nature of data as it’s created, is creating a bottleneck to analytics, and how deterministic data unification approaches aren’t alone sufficient to scale to the variety of hundreds or thousands of data silos found within the enterprise.
What we won’t worry about today:
incremental updates,
data velocity
scale
What we won’t worry about today:
incremental updates,
data velocity
scale
What we won’t worry about today:
incremental updates,
data velocity
scale
What we won’t worry about today:
incremental updates,
data velocity
scale
graph data: rows are nodes, columns are nodes or edges.
genomics - rows: genes, variants, ‘features’, and columns: position
or: rows are people and columns are variants
or: rows are people and columns are phenotypes
or: rows are phenotypes and columns are variants (sort of a pivot version)
clinical study data: rows are people, or visits, or measurements, and columns are dates, observation codes, categories, names.
Sometimes the data just *is* in spreadsheets! (A large Swiss pharmaceutical company, every screening experiment was captured in a separate spreadsheet. “Which experiments were even run?”)
A single insight that crosses data silos
Discovery that doesn’t “double count” evidence
Matching for causal inference
No single method can solve this problem!
We need an iterative approach, that automates integration but is guided and corrected by human feedback.
Looking to get an integrated view—previously w/ manual effort and cannot redo—need an automated system to work w humans to create a catalogue
Mapping to 80% accuracy
Opened discussion up across departments
This slide has animation. You need to click once.
Traditional approaches, while necessary, are not alone sufficient to truly address next-gen data challenges
Democratized visualization and modeling - radical consumption heterogeneity
New visualization and modeling tools have helped democratize analytics, changing the ways in which business users across the enterprise want to consume data. Today, more users require access to high-quality data for varying analytics projects. How do rule base approaches scale with more users consuming data in different ways?
SemanticWeb/LinkedData - radical source heterogeneity
Extensions for structuring and understanding data on the web have introduced a radical new source of heterogeneous data, presenting challenges to traditional top down data-integration approaches. If we already struggle with scale of our own internal enterprise data, how do you leverage a source with the scale and variety of the web?
Provenance for data to improve reliability
To be able to reproduce results and ensure data quality, you need to able to understand how the data has been used and transformed over time. Understanding the inputs, entities, systems, and processes that influence data of interest in an automated, programmatic way can improve reliability
Rapid iteration/change requires reproducability from source
Can you reproduce the same analysis and transformations from the source data, over time?
Desire for longitudinal data across many entities
For many organizations, it’s important to understand how the relationships between a given set of entities has changed over time. For instance, understanding the relationships between a part, supplier, and product can lead to buying the highest quality part at the cheapest price, from the most reliable manufacturer.