SlideShare ist ein Scribd-Unternehmen logo
1 von 14
COMBINING HUMAN & MACHINE INTELLIGENCE TO
SUCCESSFULLY INTEGRATE BIOMEDICAL DATA
TIMOTHY DANFORD | TAMR, INC.
THE DATA INTEGRATION PROBLEM
● flat files: every file has its own columns
● bioinformatics: every tool has its own
file format
● graph data: RDF, OWL, “knowledge
graphs”
● proprietary / legacy formats: SAS,
DBF
● relational databases: inconsistent data
models
Biomedical Data Integration is a
Constantly Moving Target
THE DATA INTEGRATION PROBLEM
● One solution: hire or train data curators
who understand the subject area
● Benefits: accuracy
● Problems
o Low bandwidth
o Difficult to scale to larger problems
o Recording decisions
o Consistency between curators
Data Curation Teams Do Not Scale
THE DATA INTEGRATION PROBLEM
● Build an automated or rules-based
system to perform data integration
● Benefits: scale
● Problems
o Accuracy, edge-cases
o Programmers do not scale
o Out-of-band communication
o Expensive to maintain
o Brittle in the face of new data
Rule-based Integration Is Brittle
TAMR AUTOMATES DATA INTEGRATION
● Solution: combine learning rules with
asking experts
● Modern machine learning techniques
o semi-supervised learning
o active learning
● Benefits
o speed of an automated system
o accuracy of human experts
o auditability
o responds well to changing
requirements
Use Probabilistic Rules with Active
Learning
TAMR AUTOMATES DATA INTEGRATION
● Build a unified schema and link it to
source attributes
● Engage subject matter experts to
answer questions
● Automate data transformation
● Eliminate redundant records with de-
duplication
Tamr Combines Machine Learning
and Expert Feedback
CASE STUDY: CLINICAL STUDY DATA
● Clinical study data integration is motivated
by a single schema: CDISC
o mandated by FDA for data submission
o common schema for clinical data
warehouses
● Mostly performed by SAS scripting today
● Tamr learns attribute mapping and
transformations using human feedback
An Example: Clinical Study Data Integration
Thank You
THE BIOMEDICAL DATA INTEGRATION PROBLEM
Fundamentally, many scientific analyses are tabular
rows are ‘entities’
columns are ‘attributes’
graphs (paths) and hierarchies (part/whole) are other shapes
tables emphasize independence of entities and attributes
Tabular Datasets are a Core Data Shape
THE BIOMEDICAL DATA INTEGRATION PROBLEM
● Column-oriented: Find the matching attributes
● Row-oriented: Discover duplicate entities
Data Integration Proceeds In Two Directions
● 80% of clinical data today goes unused
● Clinical Data Warehouses capture legacy data
● Improved analytics = better trials, less $$
Advanced Analytics, Better Clinical Trials
TAMR BUILDS LASTING VALUE
SAS
Faster Regulatory
Filings
Better Clinical
Analytics
Data Mining for
New Indications
Dynamic, Integrated View of 15k Existing and New
Sources: Biopharma
Result
• Replaced 10+ man years of human curation effort with Tamr
• Engage 600 Scientists in data quality ownership
Challenges
• $2B in research and silos of experimental results
• 15,000 sources of experimental results
• Hundreds of decentralized labs
• 1M+ rows with >100k attribute names
• Non-standardized attribute names & measurement units
• Manual curation prohibitively time & cost intensive
Solution
• Integrate data to find similar experiments
• Scaling data curation to incorporate all sources at
reasonable cost
• Engage owners of data sources in improving quality of data
15k sources integrated into one view
Tamr Output
TACKLING THE ENTERPRISE DATA SILO PROBLEM
All are necessary but not sufficient to truly address next-gen challenges
● Democratized visualization and modeling - radical consumption heterogeneity
● SemanticWeb/LinkedData - radical source heterogeneity
● Provenance for data to improve reliability
● Rapid iteration/change requires reproduceability from source
● Desire for longitudinal data across many entities
● Need for automated data quality / assurance
Traditional approaches...
● Standardization - worth trying
● Aggregation - yes - but actually makes the problem worse
● Top-down modeling (MDM/ETL) - ok for app-specific or well-defined data

Weitere ähnliche Inhalte

Was ist angesagt?

IT Category Purchasing Managers Opportunity for Savings with Non Relational S...
IT Category Purchasing Managers Opportunity for Savings with Non Relational S...IT Category Purchasing Managers Opportunity for Savings with Non Relational S...
IT Category Purchasing Managers Opportunity for Savings with Non Relational S...Bill Kohnen
 
How to Build a Smart Data Lake Using Semantics
How to Build a Smart Data Lake Using SemanticsHow to Build a Smart Data Lake Using Semantics
How to Build a Smart Data Lake Using SemanticsCambridge Semantics
 
Data Warehouse Design and Best Practices
Data Warehouse Design and Best PracticesData Warehouse Design and Best Practices
Data Warehouse Design and Best PracticesIvo Andreev
 
Prcn 2019 stage 1264-question-presentation_poster file_id-15
Prcn 2019 stage 1264-question-presentation_poster file_id-15Prcn 2019 stage 1264-question-presentation_poster file_id-15
Prcn 2019 stage 1264-question-presentation_poster file_id-15madynav
 
DGIQ 2015 The Fundamentals of Data Quality
DGIQ 2015 The Fundamentals of Data QualityDGIQ 2015 The Fundamentals of Data Quality
DGIQ 2015 The Fundamentals of Data QualityCaserta
 
Introduction to Data Warehousing
Introduction to Data WarehousingIntroduction to Data Warehousing
Introduction to Data WarehousingEdureka!
 
SEQUEL 7 Signs You Need a Data Warehouse
SEQUEL 7 Signs You Need a Data WarehouseSEQUEL 7 Signs You Need a Data Warehouse
SEQUEL 7 Signs You Need a Data WarehouseHelpSystems
 
Splunk Business Analytics
Splunk Business AnalyticsSplunk Business Analytics
Splunk Business AnalyticsCleverDATA
 
Data warehouse concepts
Data warehouse conceptsData warehouse concepts
Data warehouse conceptsobieefans
 
DATA MART APPROCHES TO ARCHITECTURE
DATA MART APPROCHES TO ARCHITECTUREDATA MART APPROCHES TO ARCHITECTURE
DATA MART APPROCHES TO ARCHITECTURESachin Batham
 
Bi presentation to bkk
Bi presentation to bkkBi presentation to bkk
Bi presentation to bkkguest4e975e2
 
Data Quality Dashboards
Data Quality DashboardsData Quality Dashboards
Data Quality DashboardsWilliam Sharp
 
Introduction To Data Warehousing
Introduction To Data WarehousingIntroduction To Data Warehousing
Introduction To Data WarehousingAlex Meadows
 
Datawarehouse & bi introduction
Datawarehouse & bi introductionDatawarehouse & bi introduction
Datawarehouse & bi introductionguest7b34c2
 
Data Warehouse
Data WarehouseData Warehouse
Data WarehouseSana Alvi
 
Data Warehouse Interview Questions And Answers | Data Warehouse Tutorial | Ed...
Data Warehouse Interview Questions And Answers | Data Warehouse Tutorial | Ed...Data Warehouse Interview Questions And Answers | Data Warehouse Tutorial | Ed...
Data Warehouse Interview Questions And Answers | Data Warehouse Tutorial | Ed...Edureka!
 
Data Wearhouse (Dw) concepts
Data Wearhouse (Dw)  conceptsData Wearhouse (Dw)  concepts
Data Wearhouse (Dw) conceptsBeing Topper
 

Was ist angesagt? (20)

IT Category Purchasing Managers Opportunity for Savings with Non Relational S...
IT Category Purchasing Managers Opportunity for Savings with Non Relational S...IT Category Purchasing Managers Opportunity for Savings with Non Relational S...
IT Category Purchasing Managers Opportunity for Savings with Non Relational S...
 
How to Build a Smart Data Lake Using Semantics
How to Build a Smart Data Lake Using SemanticsHow to Build a Smart Data Lake Using Semantics
How to Build a Smart Data Lake Using Semantics
 
Data Warehouse Design and Best Practices
Data Warehouse Design and Best PracticesData Warehouse Design and Best Practices
Data Warehouse Design and Best Practices
 
Prcn 2019 stage 1264-question-presentation_poster file_id-15
Prcn 2019 stage 1264-question-presentation_poster file_id-15Prcn 2019 stage 1264-question-presentation_poster file_id-15
Prcn 2019 stage 1264-question-presentation_poster file_id-15
 
DGIQ 2015 The Fundamentals of Data Quality
DGIQ 2015 The Fundamentals of Data QualityDGIQ 2015 The Fundamentals of Data Quality
DGIQ 2015 The Fundamentals of Data Quality
 
Bi overview
Bi overviewBi overview
Bi overview
 
Introduction to Data Warehousing
Introduction to Data WarehousingIntroduction to Data Warehousing
Introduction to Data Warehousing
 
SEQUEL 7 Signs You Need a Data Warehouse
SEQUEL 7 Signs You Need a Data WarehouseSEQUEL 7 Signs You Need a Data Warehouse
SEQUEL 7 Signs You Need a Data Warehouse
 
Splunk Business Analytics
Splunk Business AnalyticsSplunk Business Analytics
Splunk Business Analytics
 
Data warehouse concepts
Data warehouse conceptsData warehouse concepts
Data warehouse concepts
 
DATA MART APPROCHES TO ARCHITECTURE
DATA MART APPROCHES TO ARCHITECTUREDATA MART APPROCHES TO ARCHITECTURE
DATA MART APPROCHES TO ARCHITECTURE
 
Big Data Modeling
Big Data ModelingBig Data Modeling
Big Data Modeling
 
Data ware housing- Introduction to data ware housing
Data ware housing- Introduction to data ware housingData ware housing- Introduction to data ware housing
Data ware housing- Introduction to data ware housing
 
Bi presentation to bkk
Bi presentation to bkkBi presentation to bkk
Bi presentation to bkk
 
Data Quality Dashboards
Data Quality DashboardsData Quality Dashboards
Data Quality Dashboards
 
Introduction To Data Warehousing
Introduction To Data WarehousingIntroduction To Data Warehousing
Introduction To Data Warehousing
 
Datawarehouse & bi introduction
Datawarehouse & bi introductionDatawarehouse & bi introduction
Datawarehouse & bi introduction
 
Data Warehouse
Data WarehouseData Warehouse
Data Warehouse
 
Data Warehouse Interview Questions And Answers | Data Warehouse Tutorial | Ed...
Data Warehouse Interview Questions And Answers | Data Warehouse Tutorial | Ed...Data Warehouse Interview Questions And Answers | Data Warehouse Tutorial | Ed...
Data Warehouse Interview Questions And Answers | Data Warehouse Tutorial | Ed...
 
Data Wearhouse (Dw) concepts
Data Wearhouse (Dw)  conceptsData Wearhouse (Dw)  concepts
Data Wearhouse (Dw) concepts
 

Andere mochten auch

Slaid tokoh perniagaan
Slaid tokoh perniagaanSlaid tokoh perniagaan
Slaid tokoh perniagaanAzwar Anis
 
Hipaa security officer perfomance appraisal 2
Hipaa security officer perfomance appraisal 2Hipaa security officer perfomance appraisal 2
Hipaa security officer perfomance appraisal 2tonychoper6104
 
11 le thithuydung.mul1
11 le thithuydung.mul111 le thithuydung.mul1
11 le thithuydung.mul1Dung Le
 
Formas y animaciones leidy
Formas y animaciones leidyFormas y animaciones leidy
Formas y animaciones leidyleidyfabiana17
 
Webcatalog sale
Webcatalog saleWebcatalog sale
Webcatalog saleronnagr
 
Priamry data type
Priamry data typePriamry data type
Priamry data type200Hussain
 
Ke hoach to chuc van nghe khoa ufm quan tri hoc
Ke hoach to chuc van nghe khoa ufm quan tri hocKe hoach to chuc van nghe khoa ufm quan tri hoc
Ke hoach to chuc van nghe khoa ufm quan tri hocMyLan2014
 
Pennsylvania Common Employment Myths: What You Don't Know Might Hurt You
Pennsylvania Common Employment Myths: What You Don't Know Might Hurt YouPennsylvania Common Employment Myths: What You Don't Know Might Hurt You
Pennsylvania Common Employment Myths: What You Don't Know Might Hurt YouCurley & Rothman, LLC
 
Towards a critical history of child protection social work
Towards a critical history of child protection social workTowards a critical history of child protection social work
Towards a critical history of child protection social workBASPCAN
 
Stc call sheet 1-1
Stc call sheet 1-1Stc call sheet 1-1
Stc call sheet 1-1Manuel Bxyan
 

Andere mochten auch (20)

Cours
CoursCours
Cours
 
Matt Schultz 4.4
Matt Schultz 4.4Matt Schultz 4.4
Matt Schultz 4.4
 
Slaid tokoh perniagaan
Slaid tokoh perniagaanSlaid tokoh perniagaan
Slaid tokoh perniagaan
 
Formas y animaciones
Formas y animacionesFormas y animaciones
Formas y animaciones
 
Hipaa security officer perfomance appraisal 2
Hipaa security officer perfomance appraisal 2Hipaa security officer perfomance appraisal 2
Hipaa security officer perfomance appraisal 2
 
Space Apps 2015
Space Apps 2015Space Apps 2015
Space Apps 2015
 
11 le thithuydung.mul1
11 le thithuydung.mul111 le thithuydung.mul1
11 le thithuydung.mul1
 
Formas y animaciones leidy
Formas y animaciones leidyFormas y animaciones leidy
Formas y animaciones leidy
 
Formas y animaciones
Formas y animacionesFormas y animaciones
Formas y animaciones
 
Looping e
Looping   eLooping   e
Looping e
 
Webcatalog sale
Webcatalog saleWebcatalog sale
Webcatalog sale
 
Thomas Salzano - Best Romantic travel destinations
Thomas Salzano - Best Romantic travel destinationsThomas Salzano - Best Romantic travel destinations
Thomas Salzano - Best Romantic travel destinations
 
Priamry data type
Priamry data typePriamry data type
Priamry data type
 
Saroj_Mahanta
Saroj_MahantaSaroj_Mahanta
Saroj_Mahanta
 
Question 1
Question 1Question 1
Question 1
 
Catalogo2017
Catalogo2017Catalogo2017
Catalogo2017
 
Ke hoach to chuc van nghe khoa ufm quan tri hoc
Ke hoach to chuc van nghe khoa ufm quan tri hocKe hoach to chuc van nghe khoa ufm quan tri hoc
Ke hoach to chuc van nghe khoa ufm quan tri hoc
 
Pennsylvania Common Employment Myths: What You Don't Know Might Hurt You
Pennsylvania Common Employment Myths: What You Don't Know Might Hurt YouPennsylvania Common Employment Myths: What You Don't Know Might Hurt You
Pennsylvania Common Employment Myths: What You Don't Know Might Hurt You
 
Towards a critical history of child protection social work
Towards a critical history of child protection social workTowards a critical history of child protection social work
Towards a critical history of child protection social work
 
Stc call sheet 1-1
Stc call sheet 1-1Stc call sheet 1-1
Stc call sheet 1-1
 

Ähnlich wie Combining Human+Machine Intelligence to Successfully Integrate Biomedical Data

Data Management Best Practices
Data Management Best PracticesData Management Best Practices
Data Management Best PracticesChristopher Eaker
 
Sql server ___________session_1-intro
Sql server  ___________session_1-introSql server  ___________session_1-intro
Sql server ___________session_1-introEhtisham Ali
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning Gopal Sakarkar
 
Build data quality rules and data cleansing into your data pipelines
Build data quality rules and data cleansing into your data pipelinesBuild data quality rules and data cleansing into your data pipelines
Build data quality rules and data cleansing into your data pipelinesMark Kromer
 
Data Quality: A Raising Data Warehousing Concern
Data Quality: A Raising Data Warehousing ConcernData Quality: A Raising Data Warehousing Concern
Data Quality: A Raising Data Warehousing ConcernAmin Chowdhury
 
Qiagram
QiagramQiagram
Qiagramjwppz
 
Coping with Data for WHOI JP Students
Coping with Data for WHOI JP StudentsCoping with Data for WHOI JP Students
Coping with Data for WHOI JP StudentsCarly Strasser
 
The Simulacrum, a Synthetic Cancer Dataset
The Simulacrum, a Synthetic Cancer DatasetThe Simulacrum, a Synthetic Cancer Dataset
The Simulacrum, a Synthetic Cancer DatasetCongChen35
 
DataGraft: Data-as-a-Service for Open Data
DataGraft: Data-as-a-Service for Open DataDataGraft: Data-as-a-Service for Open Data
DataGraft: Data-as-a-Service for Open Datadapaasproject
 
Conceptual framework for entity integration from multiple data sources - Draz...
Conceptual framework for entity integration from multiple data sources - Draz...Conceptual framework for entity integration from multiple data sources - Draz...
Conceptual framework for entity integration from multiple data sources - Draz...Institute of Contemporary Sciences
 
Intro to Data warehousing lecture 10
Intro to Data warehousing   lecture 10Intro to Data warehousing   lecture 10
Intro to Data warehousing lecture 10AnwarrChaudary
 
An Agile & Adaptive Approach to Addressing Financial Services Regulations and...
An Agile & Adaptive Approach to Addressing Financial Services Regulations and...An Agile & Adaptive Approach to Addressing Financial Services Regulations and...
An Agile & Adaptive Approach to Addressing Financial Services Regulations and...Neo4j
 
Got data?… now what? An introduction to modern data platforms
Got data?… now what?  An introduction to modern data platformsGot data?… now what?  An introduction to modern data platforms
Got data?… now what? An introduction to modern data platformsJamesAnderson599331
 

Ähnlich wie Combining Human+Machine Intelligence to Successfully Integrate Biomedical Data (20)

Intro to Data Management
Intro to Data ManagementIntro to Data Management
Intro to Data Management
 
Data Management Best Practices
Data Management Best PracticesData Management Best Practices
Data Management Best Practices
 
Sql server ___________session_1-intro
Sql server  ___________session_1-introSql server  ___________session_1-intro
Sql server ___________session_1-intro
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning
 
Build data quality rules and data cleansing into your data pipelines
Build data quality rules and data cleansing into your data pipelinesBuild data quality rules and data cleansing into your data pipelines
Build data quality rules and data cleansing into your data pipelines
 
lecture5 (1) (2).pptx
lecture5 (1) (2).pptxlecture5 (1) (2).pptx
lecture5 (1) (2).pptx
 
lecture 1.pdf
lecture 1.pdflecture 1.pdf
lecture 1.pdf
 
Data Quality: A Raising Data Warehousing Concern
Data Quality: A Raising Data Warehousing ConcernData Quality: A Raising Data Warehousing Concern
Data Quality: A Raising Data Warehousing Concern
 
Qiagram
QiagramQiagram
Qiagram
 
Coping with Data for WHOI JP Students
Coping with Data for WHOI JP StudentsCoping with Data for WHOI JP Students
Coping with Data for WHOI JP Students
 
The Simulacrum, a Synthetic Cancer Dataset
The Simulacrum, a Synthetic Cancer DatasetThe Simulacrum, a Synthetic Cancer Dataset
The Simulacrum, a Synthetic Cancer Dataset
 
Preprocess
PreprocessPreprocess
Preprocess
 
DataGraft: Data-as-a-Service for Open Data
DataGraft: Data-as-a-Service for Open DataDataGraft: Data-as-a-Service for Open Data
DataGraft: Data-as-a-Service for Open Data
 
Conceptual framework for entity integration from multiple data sources - Draz...
Conceptual framework for entity integration from multiple data sources - Draz...Conceptual framework for entity integration from multiple data sources - Draz...
Conceptual framework for entity integration from multiple data sources - Draz...
 
DW (1).ppt
DW (1).pptDW (1).ppt
DW (1).ppt
 
Intro to Data warehousing lecture 10
Intro to Data warehousing   lecture 10Intro to Data warehousing   lecture 10
Intro to Data warehousing lecture 10
 
Data Preparation.pptx
Data Preparation.pptxData Preparation.pptx
Data Preparation.pptx
 
Data pre processing
Data pre processingData pre processing
Data pre processing
 
An Agile & Adaptive Approach to Addressing Financial Services Regulations and...
An Agile & Adaptive Approach to Addressing Financial Services Regulations and...An Agile & Adaptive Approach to Addressing Financial Services Regulations and...
An Agile & Adaptive Approach to Addressing Financial Services Regulations and...
 
Got data?… now what? An introduction to modern data platforms
Got data?… now what?  An introduction to modern data platformsGot data?… now what?  An introduction to modern data platforms
Got data?… now what? An introduction to modern data platforms
 

Kürzlich hochgeladen

Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 

Kürzlich hochgeladen (20)

Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 

Combining Human+Machine Intelligence to Successfully Integrate Biomedical Data

  • 1. COMBINING HUMAN & MACHINE INTELLIGENCE TO SUCCESSFULLY INTEGRATE BIOMEDICAL DATA TIMOTHY DANFORD | TAMR, INC.
  • 2. THE DATA INTEGRATION PROBLEM ● flat files: every file has its own columns ● bioinformatics: every tool has its own file format ● graph data: RDF, OWL, “knowledge graphs” ● proprietary / legacy formats: SAS, DBF ● relational databases: inconsistent data models Biomedical Data Integration is a Constantly Moving Target
  • 3. THE DATA INTEGRATION PROBLEM ● One solution: hire or train data curators who understand the subject area ● Benefits: accuracy ● Problems o Low bandwidth o Difficult to scale to larger problems o Recording decisions o Consistency between curators Data Curation Teams Do Not Scale
  • 4. THE DATA INTEGRATION PROBLEM ● Build an automated or rules-based system to perform data integration ● Benefits: scale ● Problems o Accuracy, edge-cases o Programmers do not scale o Out-of-band communication o Expensive to maintain o Brittle in the face of new data Rule-based Integration Is Brittle
  • 5. TAMR AUTOMATES DATA INTEGRATION ● Solution: combine learning rules with asking experts ● Modern machine learning techniques o semi-supervised learning o active learning ● Benefits o speed of an automated system o accuracy of human experts o auditability o responds well to changing requirements Use Probabilistic Rules with Active Learning
  • 6. TAMR AUTOMATES DATA INTEGRATION ● Build a unified schema and link it to source attributes ● Engage subject matter experts to answer questions ● Automate data transformation ● Eliminate redundant records with de- duplication Tamr Combines Machine Learning and Expert Feedback
  • 7. CASE STUDY: CLINICAL STUDY DATA ● Clinical study data integration is motivated by a single schema: CDISC o mandated by FDA for data submission o common schema for clinical data warehouses ● Mostly performed by SAS scripting today ● Tamr learns attribute mapping and transformations using human feedback An Example: Clinical Study Data Integration
  • 9. THE BIOMEDICAL DATA INTEGRATION PROBLEM Fundamentally, many scientific analyses are tabular rows are ‘entities’ columns are ‘attributes’ graphs (paths) and hierarchies (part/whole) are other shapes tables emphasize independence of entities and attributes Tabular Datasets are a Core Data Shape
  • 10. THE BIOMEDICAL DATA INTEGRATION PROBLEM ● Column-oriented: Find the matching attributes ● Row-oriented: Discover duplicate entities Data Integration Proceeds In Two Directions
  • 11.
  • 12. ● 80% of clinical data today goes unused ● Clinical Data Warehouses capture legacy data ● Improved analytics = better trials, less $$ Advanced Analytics, Better Clinical Trials TAMR BUILDS LASTING VALUE SAS Faster Regulatory Filings Better Clinical Analytics Data Mining for New Indications
  • 13. Dynamic, Integrated View of 15k Existing and New Sources: Biopharma Result • Replaced 10+ man years of human curation effort with Tamr • Engage 600 Scientists in data quality ownership Challenges • $2B in research and silos of experimental results • 15,000 sources of experimental results • Hundreds of decentralized labs • 1M+ rows with >100k attribute names • Non-standardized attribute names & measurement units • Manual curation prohibitively time & cost intensive Solution • Integrate data to find similar experiments • Scaling data curation to incorporate all sources at reasonable cost • Engage owners of data sources in improving quality of data 15k sources integrated into one view Tamr Output
  • 14. TACKLING THE ENTERPRISE DATA SILO PROBLEM All are necessary but not sufficient to truly address next-gen challenges ● Democratized visualization and modeling - radical consumption heterogeneity ● SemanticWeb/LinkedData - radical source heterogeneity ● Provenance for data to improve reliability ● Rapid iteration/change requires reproduceability from source ● Desire for longitudinal data across many entities ● Need for automated data quality / assurance Traditional approaches... ● Standardization - worth trying ● Aggregation - yes - but actually makes the problem worse ● Top-down modeling (MDM/ETL) - ok for app-specific or well-defined data

Hinweis der Redaktion

  1. Key Messages: Today I’ll be speaking about how data variety, the natural, siloed nature of data as it’s created, is creating a bottleneck to analytics, and how deterministic data unification approaches aren’t alone sufficient to scale to the variety of hundreds or thousands of data silos found within the enterprise.
  2. What we won’t worry about today: incremental updates, data velocity scale
  3. What we won’t worry about today: incremental updates, data velocity scale
  4. What we won’t worry about today: incremental updates, data velocity scale
  5. What we won’t worry about today: incremental updates, data velocity scale
  6. graph data: rows are nodes, columns are nodes or edges. genomics - rows: genes, variants, ‘features’, and columns: position or: rows are people and columns are variants or: rows are people and columns are phenotypes or: rows are phenotypes and columns are variants (sort of a pivot version) clinical study data: rows are people, or visits, or measurements, and columns are dates, observation codes, categories, names. Sometimes the data just *is* in spreadsheets! (A large Swiss pharmaceutical company, every screening experiment was captured in a separate spreadsheet. “Which experiments were even run?”) A single insight that crosses data silos Discovery that doesn’t “double count” evidence Matching for causal inference
  7. No single method can solve this problem! We need an iterative approach, that automates integration but is guided and corrected by human feedback.
  8. Looking to get an integrated view—previously w/ manual effort and cannot redo—need an automated system to work w humans to create a catalogue Mapping to 80% accuracy Opened discussion up across departments
  9. This slide has animation. You need to click once. Traditional approaches, while necessary, are not alone sufficient to truly address next-gen data challenges Democratized visualization and modeling - radical consumption heterogeneity New visualization and modeling tools have helped democratize analytics, changing the ways in which business users across the enterprise want to consume data. Today, more users require access to high-quality data for varying analytics projects. How do rule base approaches scale with more users consuming data in different ways? SemanticWeb/LinkedData - radical source heterogeneity Extensions for structuring and understanding data on the web have introduced a radical new source of heterogeneous data, presenting challenges to traditional top down data-integration approaches. If we already struggle with scale of our own internal enterprise data, how do you leverage a source with the scale and variety of the web? Provenance for data to improve reliability To be able to reproduce results and ensure data quality, you need to able to understand how the data has been used and transformed over time. Understanding the inputs, entities, systems, and processes that influence data of interest in an automated, programmatic way can improve reliability Rapid iteration/change requires reproducability from source Can you reproduce the same analysis and transformations from the source data, over time? Desire for longitudinal data across many entities For many organizations, it’s important to understand how the relationships between a given set of entities has changed over time. For instance, understanding the relationships between a part, supplier, and product can lead to buying the highest quality part at the cheapest price, from the most reliable manufacturer.