SlideShare ist ein Scribd-Unternehmen logo
1 von 1
Downloaden Sie, um offline zu lesen
APP
NGS Applications
J. S. Freitas1
, M. P. Caraciolo1
, V. M. Diniz1
, R. B. de Alexandre1
, J. B. Oliveira1
1
Genomika Diagnósticos
API-Centric Data Integration for Human Genomics Reference
Databases: Achievements, Lessons Learned and Challenges
MOTIVATION
Data Integration is a main challenge faced in clinical genetics where there are
multiple heterogeneous databases spanning several domains presented in
confusing formats without clear and common standards. In variant analysis for
molecular diagnostics applications, one central task is to connect biological
information to clinical data such that specialists can determine the potential impact
of that variant associated with the disease [1, 2].
For this task, it requires the flexible assembly of tailored data sets continuously
curated without wasting the biologists and geneticists time on searching several
databases individually online, parsing, cleaning and integrating those data in
complex spreadsheets.
We are building a platform that leverages Linked Data to provide integrated
access to bioinformatics databases such as OMIN, Clinvar, using a common
and well-defined interface.
Our assumption is that by exposing those datasets via Application
Programming Interfaces (API's), it can facilitate the data access from several
sources to a big data infrastructure, which provides integrated access to
covering information about biological, carrier testing, variant analysis and
literature mining.
bioinfo@genomika.com.br | genomika.com.br
Rua Senador José Henrique, 224, Alfred Nobel, Sala 1301 | Recife, PE | Brazil
OUR COLLABORATION
DATA INFRASTRUCTURE
Lessons Learned
x
REFERENCES
[1] Anguita, A., et al. (2010) A review of methods and tools for database integration in biomedicine. Curr. Bioinform., 5, 253–269
[2] Peterson, Thomas A., Emily Doughty, and Maricel G. Kann. "Towards precision medicine: advances in computational approaches for the analysis of human
variants." Journal of molecular biology 425.21 (2013): 4047-4063.
[3] Lakshman, Avinash, and Prashant Malik. "Cassandra: a decentralized structured storage system." ACM SIGOPS Operating Systems Review 44.2 (2010): 35-40.
[4] Spark, Apache. "Lightning-fast cluster computing (2015)." (2015): 345-353.
[5] Stockinger, Heinz, et al. "Experience using web services for biological sequence analysis." Briefings in bioinformatics 9.6 (2008): 493-505.
DISTRIBUTED AGGREGATION NEW SOURCE CONSUMPTION
The growing number of databases vs the variability of their
schemata. To tackle it, we designed a global schema, using
meta-modeling concepts to abstract the data fields and values.
Novel approaches to aggregate the facets by the same key. Good
solutions: NoSQL databases (Cassandra) and large data
processing engine using MapReduce concepts (Spark) [3, 4].
Load several databases and related versions will require a
replication/distributed policy for your database engine. There are
some good dataengine solutions that achieved great results on
this by using a distributed strategy for partitioning data.
RESTful APIs for exposing data. It supports several formats (XML,
JSON) and frameworks available that works out-of-the-box [5].
Challenges
The underlying datasets can change their
schema, so there's a intellectual complexity in
developing fixes in the source data
consumption.
The limited number of building new versions,
the all process requires bandwidth and
demanding computing power, so how to
overcome the number of fetching jobs running
simultaneously?
How to deal with semantic mappings between
datasets or depositories? What should the
single integrated vocabulary be in order to
identify possible relationships?
sample
genomic
position
genomic
position
Sequencing
Machine
Annotator
(rowA,
(DataFieldA, facetValue1))
(rowB,
(DataFieldA, facetValue2))
(rowA,
[(DFA, FV1)),
(DFB, FV3)),
(DFC, FV4)),
(DFD, FV7)),
(DFE, FV8)),
(DFF, FV9))]
(rowA,
(DFB, FV3))
(rowA,
(DFC, FV4))
(rowB,
(DFB, FV5))
(rowB,
(DFC, FV6))
(rowA,
(DFA, FV1))
(rowA,
(DFB, FV3))
(rowA,
(DFE, FV8))
(rowA,
(DFF, FV9))
(rowB,
[(DFA, FV2)),
(DFB, FV5)),
(DFC, FV6)),
(DFD, FV10)),
(DFE, FV11)),
(DFF, FV12))]
(rowB,
(DFB, FV2)
(rowB,
(DFB, FV5))
(rowB,
(DFE, FV11))
(rowB,
(DFF, FV12))
(rowA,
(DFD, FV7))
(rowA,
(DFE, FV8))
(rowB,
(DFE, FV11))
(rowB,
(DFF, FV12))
ClinGen Tool
Patient
Data
150,000,000
Variants observed
Variants
we understand
2003 2007 2015
Genotype
AnnotatorClinvar
dbSNP
Uniprot
OMIM
NCBI
GENE
1,000
Genome
Depository N
Clinvar
OMIM
DATA EXPOSURE
...
omim_idGene Symbol
100650
... Datafield N
... Facet #1ALDH
104760 ... Facet #nAPP
DataFieldrowID
Gene_Symbol
... DataFacet
... ALDH1
OMIM_ID ... 1006501
Gene_Symbol ... APP2
OMIM_ID ... 1047602
1.0.0 2.0.0 Depository
Version
...Genes Phenotypes Dataset N
curl
https://$GENDB_API_KEY@api.gendb.com/v1/
datasets/OMIM/3.5.0/Genes/data 
-H "Content-Type: application/json"

-d '{
"filters": [
["gene_symbol", "BRCA1"]
]
}'
{
"dataset": "OMIM/3.5.0/Genes",
"dataset_id": 65,
"genome_build": "GRCh37",
"limit": 100,
"total": 111425,
"took": 5,
"results": [ "..." ]
}
As the number of current human variant
resources used in variant analysis increases,
the variants reported growing faster every
year, there's only a initial work on
understanding all this information and how
can we extract and link those variant sources.
...
fetch data
Sequencer Data
fetch data
API
GENDB
MIM
1000 Genomes
Entrez Gene
dbSNP
dbSNP
dbNSFP
COSMIC
ClinVar
Other Sources
+ name
+ output_dir
- fetch(is_dl_forced=False)
- parse()
- prepare_new_dataset(name, version)
- update_new_version(version_name)
- check_if_remote_newer(remote, local)
- get_files(is_dl_forced)
- fetch_from_url(remote_file, local_file)
- fetch_from_db(query, conn, limit, is_dl_forced)
- fetch_from_source(...)
Source
Abstract class for any
data sources that we'll
import and process.
Each of the subclasses
will fetch() the data,
scrub() it as necessary,
then parse() it into a
database.
+ name: OMIM
+ output_dir : "./raw/omim"
- _get_omim_ids()
- _process_all()
- _process_morbidmap()
- _process_phenotypicseries()
OMIM
+ name: Source N
+ output_dir : "output/dir"
- local functions()
- inheritend_functions()
extendsextends
Source N
...
......

Weitere ähnliche Inhalte

Was ist angesagt?

NY Prostate Cancer Conference - P.A. Fearn - Session 1: Data management for p...
NY Prostate Cancer Conference - P.A. Fearn - Session 1: Data management for p...NY Prostate Cancer Conference - P.A. Fearn - Session 1: Data management for p...
NY Prostate Cancer Conference - P.A. Fearn - Session 1: Data management for p...
European School of Oncology
 
Cancer Analytics Poster
Cancer Analytics PosterCancer Analytics Poster
Cancer Analytics Poster
Michael Atkins
 
Nucl. Acids Res.-2014-Howe-nar-gku1244
Nucl. Acids Res.-2014-Howe-nar-gku1244Nucl. Acids Res.-2014-Howe-nar-gku1244
Nucl. Acids Res.-2014-Howe-nar-gku1244
Yasel Cruz
 
ReVeaLD: A user-driven domain-specific interactive search platform for biomed...
ReVeaLD: A user-driven domain-specific interactive search platform for biomed...ReVeaLD: A user-driven domain-specific interactive search platform for biomed...
ReVeaLD: A user-driven domain-specific interactive search platform for biomed...
Maulik Kamdar
 
Semantic approaches for biomedical knowledge discovery - Discovery Science 20...
Semantic approaches for biomedical knowledge discovery - Discovery Science 20...Semantic approaches for biomedical knowledge discovery - Discovery Science 20...
Semantic approaches for biomedical knowledge discovery - Discovery Science 20...
Michel Dumontier
 
Developing tools for high resolution mass spectrometry-based screening via th...
Developing tools for high resolution mass spectrometry-based screening via th...Developing tools for high resolution mass spectrometry-based screening via th...
Developing tools for high resolution mass spectrometry-based screening via th...
Andrew McEachran
 
Metadata in the BioSample Online Repository are Impaired by Numerous Anomalie...
Metadata in the BioSample Online Repository are Impaired by Numerous Anomalie...Metadata in the BioSample Online Repository are Impaired by Numerous Anomalie...
Metadata in the BioSample Online Repository are Impaired by Numerous Anomalie...
CEDAR: Center for Expanded Data Annotation and Retrieval
 

Was ist angesagt? (20)

CEDAR: Easing Authoring of Metadata to Make Biomedical Data Sets More Findabl...
CEDAR: Easing Authoring of Metadata to Make Biomedical Data Sets More Findabl...CEDAR: Easing Authoring of Metadata to Make Biomedical Data Sets More Findabl...
CEDAR: Easing Authoring of Metadata to Make Biomedical Data Sets More Findabl...
 
NY Prostate Cancer Conference - P.A. Fearn - Session 1: Data management for p...
NY Prostate Cancer Conference - P.A. Fearn - Session 1: Data management for p...NY Prostate Cancer Conference - P.A. Fearn - Session 1: Data management for p...
NY Prostate Cancer Conference - P.A. Fearn - Session 1: Data management for p...
 
Cancer Analytics Poster
Cancer Analytics PosterCancer Analytics Poster
Cancer Analytics Poster
 
Rethinking data intensive science using scalable analytics systems
 Rethinking data intensive science using scalable analytics systems Rethinking data intensive science using scalable analytics systems
Rethinking data intensive science using scalable analytics systems
 
Making it Easier, Possibly Even Pleasant, to Author Rich Experimental Metadata
Making it Easier, Possibly Even Pleasant, to Author Rich Experimental MetadataMaking it Easier, Possibly Even Pleasant, to Author Rich Experimental Metadata
Making it Easier, Possibly Even Pleasant, to Author Rich Experimental Metadata
 
Wim de Grave: Big Data in life sciences
Wim de Grave:  Big Data in life sciencesWim de Grave:  Big Data in life sciences
Wim de Grave: Big Data in life sciences
 
2016 ACS Semantic Approaches for Biochemical Knowledge Discovery
2016 ACS Semantic Approaches for Biochemical Knowledge Discovery2016 ACS Semantic Approaches for Biochemical Knowledge Discovery
2016 ACS Semantic Approaches for Biochemical Knowledge Discovery
 
Link Analysis of Life Sciences Linked Data
Link Analysis of Life Sciences Linked DataLink Analysis of Life Sciences Linked Data
Link Analysis of Life Sciences Linked Data
 
Role of Data Accessibility During Pandemic
Role of Data Accessibility During PandemicRole of Data Accessibility During Pandemic
Role of Data Accessibility During Pandemic
 
Presentation from Code Camp 2017
Presentation from Code Camp 2017Presentation from Code Camp 2017
Presentation from Code Camp 2017
 
Nucl. Acids Res.-2014-Howe-nar-gku1244
Nucl. Acids Res.-2014-Howe-nar-gku1244Nucl. Acids Res.-2014-Howe-nar-gku1244
Nucl. Acids Res.-2014-Howe-nar-gku1244
 
ReVeaLD: A user-driven domain-specific interactive search platform for biomed...
ReVeaLD: A user-driven domain-specific interactive search platform for biomed...ReVeaLD: A user-driven domain-specific interactive search platform for biomed...
ReVeaLD: A user-driven domain-specific interactive search platform for biomed...
 
Semantic approaches for biomedical knowledge discovery - Discovery Science 20...
Semantic approaches for biomedical knowledge discovery - Discovery Science 20...Semantic approaches for biomedical knowledge discovery - Discovery Science 20...
Semantic approaches for biomedical knowledge discovery - Discovery Science 20...
 
W3C HCLS Dataset Description Guidelines
W3C HCLS Dataset Description GuidelinesW3C HCLS Dataset Description Guidelines
W3C HCLS Dataset Description Guidelines
 
Next-Generation Search Engines for Information Retrieval
Next-Generation Search Engines for Information RetrievalNext-Generation Search Engines for Information Retrieval
Next-Generation Search Engines for Information Retrieval
 
Omic Data Integration Strategies
Omic Data Integration StrategiesOmic Data Integration Strategies
Omic Data Integration Strategies
 
The CEDAR Workbench: An Ontology-Assisted Environment for Authoring Metadata ...
The CEDAR Workbench: An Ontology-Assisted Environment for Authoring Metadata ...The CEDAR Workbench: An Ontology-Assisted Environment for Authoring Metadata ...
The CEDAR Workbench: An Ontology-Assisted Environment for Authoring Metadata ...
 
Developing tools for high resolution mass spectrometry-based screening via th...
Developing tools for high resolution mass spectrometry-based screening via th...Developing tools for high resolution mass spectrometry-based screening via th...
Developing tools for high resolution mass spectrometry-based screening via th...
 
Metadata in the BioSample Online Repository are Impaired by Numerous Anomalie...
Metadata in the BioSample Online Repository are Impaired by Numerous Anomalie...Metadata in the BioSample Online Repository are Impaired by Numerous Anomalie...
Metadata in the BioSample Online Repository are Impaired by Numerous Anomalie...
 
MiAIRR:Minimum information about an Adaptive Immune Receptor Repertoire Seque...
MiAIRR:Minimum information about an Adaptive Immune Receptor Repertoire Seque...MiAIRR:Minimum information about an Adaptive Immune Receptor Repertoire Seque...
MiAIRR:Minimum information about an Adaptive Immune Receptor Repertoire Seque...
 

Ähnlich wie API-Centric Data Integration for Human Genomics Reference Databases: Achievements, Lessons Learned and Challenges

2013 nas-ehs-data-integration-dc
2013 nas-ehs-data-integration-dc2013 nas-ehs-data-integration-dc
2013 nas-ehs-data-integration-dc
c.titus.brown
 
Branch: An interactive, web-based tool for building decision tree classifiers
Branch: An interactive, web-based tool for building decision tree classifiersBranch: An interactive, web-based tool for building decision tree classifiers
Branch: An interactive, web-based tool for building decision tree classifiers
Benjamin Good
 
eTRIKS Data Harmonization Service Platform
eTRIKS Data Harmonization Service PlatformeTRIKS Data Harmonization Service Platform
eTRIKS Data Harmonization Service Platform
ibemam
 

Ähnlich wie API-Centric Data Integration for Human Genomics Reference Databases: Achievements, Lessons Learned and Challenges (20)

D1803012022
D1803012022D1803012022
D1803012022
 
2013 nas-ehs-data-integration-dc
2013 nas-ehs-data-integration-dc2013 nas-ehs-data-integration-dc
2013 nas-ehs-data-integration-dc
 
Being FAIR: Enabling Reproducible Data Science
Being FAIR: Enabling Reproducible Data ScienceBeing FAIR: Enabling Reproducible Data Science
Being FAIR: Enabling Reproducible Data Science
 
A Systems Approach To Qualitative Data Management And Analysis
A Systems Approach To Qualitative Data Management And AnalysisA Systems Approach To Qualitative Data Management And Analysis
A Systems Approach To Qualitative Data Management And Analysis
 
LIMS for maize mapping project
LIMS for maize mapping projectLIMS for maize mapping project
LIMS for maize mapping project
 
LIMS FOR MAIZE MAPPING PROJECT
LIMS FOR MAIZE MAPPING PROJECTLIMS FOR MAIZE MAPPING PROJECT
LIMS FOR MAIZE MAPPING PROJECT
 
The Roots: Linked data and the foundations of successful Agriculture Data
The Roots: Linked data and the foundations of successful Agriculture DataThe Roots: Linked data and the foundations of successful Agriculture Data
The Roots: Linked data and the foundations of successful Agriculture Data
 
V1_I1_2012_Paper5.doc
V1_I1_2012_Paper5.docV1_I1_2012_Paper5.doc
V1_I1_2012_Paper5.doc
 
GASCAN: A Novel Database for Gastric Cancer Genes and Primers
GASCAN: A Novel Database for Gastric Cancer Genes and PrimersGASCAN: A Novel Database for Gastric Cancer Genes and Primers
GASCAN: A Novel Database for Gastric Cancer Genes and Primers
 
Accelerating GWAS epistatic interaction analysis methods
Accelerating GWAS epistatic interaction analysis methodsAccelerating GWAS epistatic interaction analysis methods
Accelerating GWAS epistatic interaction analysis methods
 
Poster (1)
Poster (1)Poster (1)
Poster (1)
 
Leveraging CEDAR workbench for ontology-linked submission of adaptive immune ...
Leveraging CEDAR workbench for ontology-linked submission of adaptive immune ...Leveraging CEDAR workbench for ontology-linked submission of adaptive immune ...
Leveraging CEDAR workbench for ontology-linked submission of adaptive immune ...
 
B.3.5
B.3.5B.3.5
B.3.5
 
Interactive Analysis of Large-Scale Sequencing Genomics Data Sets using a Rea...
Interactive Analysis of Large-Scale Sequencing Genomics Data Sets using a Rea...Interactive Analysis of Large-Scale Sequencing Genomics Data Sets using a Rea...
Interactive Analysis of Large-Scale Sequencing Genomics Data Sets using a Rea...
 
Current advances to bridge the usability-expressivity gap in biomedical seman...
Current advances to bridge the usability-expressivity gap in biomedical seman...Current advances to bridge the usability-expressivity gap in biomedical seman...
Current advances to bridge the usability-expressivity gap in biomedical seman...
 
Branch: An interactive, web-based tool for building decision tree classifiers
Branch: An interactive, web-based tool for building decision tree classifiersBranch: An interactive, web-based tool for building decision tree classifiers
Branch: An interactive, web-based tool for building decision tree classifiers
 
Bioinformatics data mining
Bioinformatics data miningBioinformatics data mining
Bioinformatics data mining
 
eTRIKS Data Harmonization Service Platform
eTRIKS Data Harmonization Service PlatformeTRIKS Data Harmonization Service Platform
eTRIKS Data Harmonization Service Platform
 
A consistent and efficient graphical User Interface Design and Querying Organ...
A consistent and efficient graphical User Interface Design and Querying Organ...A consistent and efficient graphical User Interface Design and Querying Organ...
A consistent and efficient graphical User Interface Design and Querying Organ...
 
Jax bio dataworldcongress.ngs.20181128finalwithoutbu
Jax bio dataworldcongress.ngs.20181128finalwithoutbuJax bio dataworldcongress.ngs.20181128finalwithoutbu
Jax bio dataworldcongress.ngs.20181128finalwithoutbu
 

Mehr von Genomika Diagnósticos

Mehr von Genomika Diagnósticos (9)

MamaRisk - Resume Article IHC 2016
MamaRisk - Resume Article IHC 2016MamaRisk - Resume Article IHC 2016
MamaRisk - Resume Article IHC 2016
 
MamaRisk - Presentation IHC 2016
MamaRisk - Presentation IHC 2016MamaRisk - Presentation IHC 2016
MamaRisk - Presentation IHC 2016
 
Detecção de CNVs por NGS: validação de pipeline de bioinformática para painéi...
Detecção de CNVs por NGS: validação de pipeline de bioinformática para painéi...Detecção de CNVs por NGS: validação de pipeline de bioinformática para painéi...
Detecção de CNVs por NGS: validação de pipeline de bioinformática para painéi...
 
The importance of an adequate soft-clip based approach on bioinformatics pipe...
The importance of an adequate soft-clip based approach on bioinformatics pipe...The importance of an adequate soft-clip based approach on bioinformatics pipe...
The importance of an adequate soft-clip based approach on bioinformatics pipe...
 
Best Practices for Bioinformatics Pipelines for Molecular-Barcoded Targeted S...
Best Practices for Bioinformatics Pipelines for Molecular-Barcoded Targeted S...Best Practices for Bioinformatics Pipelines for Molecular-Barcoded Targeted S...
Best Practices for Bioinformatics Pipelines for Molecular-Barcoded Targeted S...
 
X-Meeting Poster 2015 - Vallys A Coverage tool
X-Meeting Poster 2015 - Vallys A Coverage toolX-Meeting Poster 2015 - Vallys A Coverage tool
X-Meeting Poster 2015 - Vallys A Coverage tool
 
Docker poster bsb2015-print
Docker poster bsb2015-printDocker poster bsb2015-print
Docker poster bsb2015-print
 
Como seu DNA com a Bioinformática pode revolucionar o diagnóstico clínico no ...
Como seu DNA com a Bioinformática pode revolucionar o diagnóstico clínico no ...Como seu DNA com a Bioinformática pode revolucionar o diagnóstico clínico no ...
Como seu DNA com a Bioinformática pode revolucionar o diagnóstico clínico no ...
 
Construindo softwares de bioinformática para análises clínicas (Introdução)
Construindo softwares  de bioinformática  para análises clínicas (Introdução)  Construindo softwares  de bioinformática  para análises clínicas (Introdução)
Construindo softwares de bioinformática para análises clínicas (Introdução)
 

Kürzlich hochgeladen

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 

Kürzlich hochgeladen (20)

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 

API-Centric Data Integration for Human Genomics Reference Databases: Achievements, Lessons Learned and Challenges

  • 1. APP NGS Applications J. S. Freitas1 , M. P. Caraciolo1 , V. M. Diniz1 , R. B. de Alexandre1 , J. B. Oliveira1 1 Genomika Diagnósticos API-Centric Data Integration for Human Genomics Reference Databases: Achievements, Lessons Learned and Challenges MOTIVATION Data Integration is a main challenge faced in clinical genetics where there are multiple heterogeneous databases spanning several domains presented in confusing formats without clear and common standards. In variant analysis for molecular diagnostics applications, one central task is to connect biological information to clinical data such that specialists can determine the potential impact of that variant associated with the disease [1, 2]. For this task, it requires the flexible assembly of tailored data sets continuously curated without wasting the biologists and geneticists time on searching several databases individually online, parsing, cleaning and integrating those data in complex spreadsheets. We are building a platform that leverages Linked Data to provide integrated access to bioinformatics databases such as OMIN, Clinvar, using a common and well-defined interface. Our assumption is that by exposing those datasets via Application Programming Interfaces (API's), it can facilitate the data access from several sources to a big data infrastructure, which provides integrated access to covering information about biological, carrier testing, variant analysis and literature mining. bioinfo@genomika.com.br | genomika.com.br Rua Senador José Henrique, 224, Alfred Nobel, Sala 1301 | Recife, PE | Brazil OUR COLLABORATION DATA INFRASTRUCTURE Lessons Learned x REFERENCES [1] Anguita, A., et al. (2010) A review of methods and tools for database integration in biomedicine. Curr. Bioinform., 5, 253–269 [2] Peterson, Thomas A., Emily Doughty, and Maricel G. Kann. "Towards precision medicine: advances in computational approaches for the analysis of human variants." Journal of molecular biology 425.21 (2013): 4047-4063. [3] Lakshman, Avinash, and Prashant Malik. "Cassandra: a decentralized structured storage system." ACM SIGOPS Operating Systems Review 44.2 (2010): 35-40. [4] Spark, Apache. "Lightning-fast cluster computing (2015)." (2015): 345-353. [5] Stockinger, Heinz, et al. "Experience using web services for biological sequence analysis." Briefings in bioinformatics 9.6 (2008): 493-505. DISTRIBUTED AGGREGATION NEW SOURCE CONSUMPTION The growing number of databases vs the variability of their schemata. To tackle it, we designed a global schema, using meta-modeling concepts to abstract the data fields and values. Novel approaches to aggregate the facets by the same key. Good solutions: NoSQL databases (Cassandra) and large data processing engine using MapReduce concepts (Spark) [3, 4]. Load several databases and related versions will require a replication/distributed policy for your database engine. There are some good dataengine solutions that achieved great results on this by using a distributed strategy for partitioning data. RESTful APIs for exposing data. It supports several formats (XML, JSON) and frameworks available that works out-of-the-box [5]. Challenges The underlying datasets can change their schema, so there's a intellectual complexity in developing fixes in the source data consumption. The limited number of building new versions, the all process requires bandwidth and demanding computing power, so how to overcome the number of fetching jobs running simultaneously? How to deal with semantic mappings between datasets or depositories? What should the single integrated vocabulary be in order to identify possible relationships? sample genomic position genomic position Sequencing Machine Annotator (rowA, (DataFieldA, facetValue1)) (rowB, (DataFieldA, facetValue2)) (rowA, [(DFA, FV1)), (DFB, FV3)), (DFC, FV4)), (DFD, FV7)), (DFE, FV8)), (DFF, FV9))] (rowA, (DFB, FV3)) (rowA, (DFC, FV4)) (rowB, (DFB, FV5)) (rowB, (DFC, FV6)) (rowA, (DFA, FV1)) (rowA, (DFB, FV3)) (rowA, (DFE, FV8)) (rowA, (DFF, FV9)) (rowB, [(DFA, FV2)), (DFB, FV5)), (DFC, FV6)), (DFD, FV10)), (DFE, FV11)), (DFF, FV12))] (rowB, (DFB, FV2) (rowB, (DFB, FV5)) (rowB, (DFE, FV11)) (rowB, (DFF, FV12)) (rowA, (DFD, FV7)) (rowA, (DFE, FV8)) (rowB, (DFE, FV11)) (rowB, (DFF, FV12)) ClinGen Tool Patient Data 150,000,000 Variants observed Variants we understand 2003 2007 2015 Genotype AnnotatorClinvar dbSNP Uniprot OMIM NCBI GENE 1,000 Genome Depository N Clinvar OMIM DATA EXPOSURE ... omim_idGene Symbol 100650 ... Datafield N ... Facet #1ALDH 104760 ... Facet #nAPP DataFieldrowID Gene_Symbol ... DataFacet ... ALDH1 OMIM_ID ... 1006501 Gene_Symbol ... APP2 OMIM_ID ... 1047602 1.0.0 2.0.0 Depository Version ...Genes Phenotypes Dataset N curl https://$GENDB_API_KEY@api.gendb.com/v1/ datasets/OMIM/3.5.0/Genes/data -H "Content-Type: application/json" -d '{ "filters": [ ["gene_symbol", "BRCA1"] ] }' { "dataset": "OMIM/3.5.0/Genes", "dataset_id": 65, "genome_build": "GRCh37", "limit": 100, "total": 111425, "took": 5, "results": [ "..." ] } As the number of current human variant resources used in variant analysis increases, the variants reported growing faster every year, there's only a initial work on understanding all this information and how can we extract and link those variant sources. ... fetch data Sequencer Data fetch data API GENDB MIM 1000 Genomes Entrez Gene dbSNP dbSNP dbNSFP COSMIC ClinVar Other Sources + name + output_dir - fetch(is_dl_forced=False) - parse() - prepare_new_dataset(name, version) - update_new_version(version_name) - check_if_remote_newer(remote, local) - get_files(is_dl_forced) - fetch_from_url(remote_file, local_file) - fetch_from_db(query, conn, limit, is_dl_forced) - fetch_from_source(...) Source Abstract class for any data sources that we'll import and process. Each of the subclasses will fetch() the data, scrub() it as necessary, then parse() it into a database. + name: OMIM + output_dir : "./raw/omim" - _get_omim_ids() - _process_all() - _process_morbidmap() - _process_phenotypicseries() OMIM + name: Source N + output_dir : "output/dir" - local functions() - inheritend_functions() extendsextends Source N ... ......