SlideShare ist ein Scribd-Unternehmen logo
1 von 22
VCF and RDF
              Jeremy J Carroll
               April 3rd, 2013
Prepared for W3C Clinical Genomics Task Force
                  www.Syapse.com
                  jjc@Syapse.com



              Copyright 2013 Syapse Inc.
Contents
• Background
   – Business Background
   – Goal
   – VCF and RDF
• This specific work (exploration)
   – Stats
   – VCF example
   – Lots of example transforms: VCF and RDF
• Discussion: might the examples meet the
  business goals?
Background
• Syapse Discovery provides an end-to-end
  solution for […] laboratories deploying next-
  generation sequencing-based diagnostics, […]
  configurable semantic data structure enables
  users to bring omics data together with
  traditional medical information […]
• This work: initial exploration of VCF
  import, what useful information is in a VCF
  etc.
Success = Useful Advanced Query
        Over Genomic Information




Patient         Report           Tissue
Disease:        Gene:            Biopsy Site:
Breast Cancer   V-ERB-B2 AVIAN   Lobe of Liver
VCF, a Web Based Knowledge Format
• Variant Call Format
• Used for exchange from one system to
  another
• Used for exchange from one lab to another
• Used for exchange between universities and
  the wider community
RDF, a Web Based Knowledge Format
• Resource Description Framework
• Used for exchange from one system to
  another
• Used for exchange from one lab to another
• Used for exchange between universities and
  the wider community
VCF vs RDF
• VCF: efficient representation of huge
  quantities of data

• RDF: flexible representation with excellent
  interoperability and query
Materialized or Virtual?
• Materialized = Transform data:
  Extract Transform Load into RDF

• Virtual = Transform query:
  Runtime mapping

• Or some combination: transform VCF into
  some internal format, transform SPARQL
  queries into a query over internal format
This Work
• Initial mapping from VCF to RDF
• Materialized for simplicity

• Key questions:
  – Is this useful?
     • Would query answer interesting questions?
  – Scale?
     • Can we materialize? Should we go virtual?
Known limitations
• Lots not addressed ….
• Too many literals not enough URIs
• Slow

• Hoped to be fit for purpose, i.e. answering
  questions, not production code. (scale and
  utility, see previous slide)
Some Stats
• Working with Chromosome 7 from 1000G
    ftp://ftp-
    trace.ncbi.nih.gov/1000genomes/ftp/release/20110521/ALL.chr7.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz



• 704,243 Variants, 1,092 samples
     – 689,034,445 ref calls, 79,998,911 variant calls
•   H/W: new mac book pro 2.3 GHz Intel Core i7 8 GB
•   Gunzip: 1m37s
•   PyVCF: 2h41m
•   VCF2TTL: 37h57m
•   serdi TTL 2 Ntriples: 13h57m
•   triples: 14,780,932,912 (1.2 trillion bytes)
VCF Overview
                                                       Metadata
##fileformat=VCFv4.0
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=1000GenomesPilot-NCBI36                                                                      T-Box:
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
                                                                                                         Properties, Classe
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
                                                                                                         s, Domains, Rang
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">                                       es
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS           ID        REF        ALT      QUAL       FILTER INFO                                FORMAT        NA00001          NA00002
20         14370     rs6054257 G          A        29         PASS   NS=3;DP=14;AF=0.5;DB;H2             GT:GQ:DP:HQ   0|0:48:1:51,51   1|0:48:8:51,51
20         17330     .         T          A        3.0        q10    NS=3;DP=11;AF=0.017                 GT:GQ:DP:HQ   0|0:49:3:58,50   0|1:3:5:65,3
20         1110696 rs6040355 A            G,T      1e+03      PASS   NS=2;DP=10;AF=0.333,0.667;AA=T;DB   GT:GQ:DP:HQ   1|2:21:6:23,27   2|1:2:0:18,2
20         1230237 .           T          .        47         PASS   NS=3;DP=13;AA=T                     GT:GQ:DP:HQ   0|0:54:7:56,60   0|0:48:4:51,51
20         1234567 microsat1 GTCT         G,GTACT .           PASS   NS=3;DP=9;AA=G                      GT:GQ:DP      ./.:35:4         0/2:17:2



                                                        Per
     Per Variant                                    Alternative                                                   Per
     Information                                       Allele                                                     Sample, Variant
                                                   Information                                                    Call
Key Classes
• Variant A position in the reference
  genome, where mutation may happen, and
  related information.
• Allele A possible sequence of bases at some
  variant, and related information.
• Sample A related set of Variant Calls
• Variant Call A selection of two Alleles (diploid)
  at a Variant (can be phased or unphased)
A Variant




#CHROM   POS     ID        REF   ALT   QUAL
20       1110696 rs6040355 A     G,T   1e+03
INFO about a Variant

                                Same node as on previous
                                slide, selecting different triples for
                                display.

                                DB property should probably be a
                                class (dbSNP membership); AF (allele
                                frequency) is not a property of the
                                variant but the allele.




#CHROM   POS     INFO
20       1110696 NS=2;DP=10;AF=0.333,0.667;AA=T;DB
INFO about Alleles


                                               Same node as on
                                               previous slide, selecting
                                               different triples for
                                               display.




#CHROM   POS     REF ALT  INFO
20       1110696 A G,TAF=0.333,0.667;AA=T;DB
Detail about a Variant Call


                                               Same variant node as on
                                               previous slide, HQ is
                                               mismodeled.




#CHROM    POS     REF ALT   FORMAT         NA00001
20        1110696 A G,T GT:GQ:DP:HQ   1|2:21:6:23,27
A Phased Diploid Genotype Call: rdf:Seq
                                         The | indicates
                                         phased, consistent
                                         ordering between VCF
                                         rows. rdf:Seq indicates
                                         order is significant.
                                         Heterozygous

                                         VCF concept of phase set
                                         not addressed.




#CHROM   POS     REF ALT  FORMAT         NA00001
20       1110696 A G,TGT:GQ:DP:HQ   1|2:21:6:23,27
Unphased Diploid Genotype Call: rdf:Bag
                                       The / indicates unphased
                                       genotype. rdf:Bag
                                       indicates order is not
                                       significant. rdf:Bag also
                                       used for homozygous
                                       genotypes. The 0/2
                                       indicates the use of the
                                       reference and the 2nd
                                       alternative.

                                       This is a different variant




#CHROM POS     REF ALT      FORMAT     NA00002
20     1234567 GTCT G,GTACT GT:GQ:DP   0/2:17:2
Genotype Likelihoods 1000G




#CHROM POS REF ALT FORMAT NA11829
7      16161 A G GT:DS:GL 0|1:1.300:-0.36,-0.43,-0.71
Notes: _1 and _2 hide one another, id_2397 is variant of id_23830
Discussion Points
• How much of this is actually useful?
  – Maybe just artifacts of the history of the call?
  – Maybe just joins that we could/should recompute on
    the fly
  – Maybe much of the data is fundamentally
    uninteresting

• Do we know the queries? If so can we represent
  in a more compact way and optimize for those
  queries?
Further observations
• Metadata is a mess, not actually reusable by
  machine, e.g. why file date but not sample date …
  provokes the desire to show rationale for file, but
  not the performance
• Extensibility of Tbox allows English text to
  introduce new syntax (e.g. enumerations, per alt
  allele and reference). This is not a good idea.
• The per genotype call info is only for unphased
  data, yet the call can be phased

Weitere ähnliche Inhalte

Andere mochten auch

(HLS305) Transforming Cancer Treatment: Integrating Data to Deliver on the Pr...
(HLS305) Transforming Cancer Treatment: Integrating Data to Deliver on the Pr...(HLS305) Transforming Cancer Treatment: Integrating Data to Deliver on the Pr...
(HLS305) Transforming Cancer Treatment: Integrating Data to Deliver on the Pr...Amazon Web Services
 
Electronic Medical Records: From Clinical Decision Support to Precision Medicine
Electronic Medical Records: From Clinical Decision Support to Precision MedicineElectronic Medical Records: From Clinical Decision Support to Precision Medicine
Electronic Medical Records: From Clinical Decision Support to Precision MedicineKent State University
 
Potential partner pitch
Potential partner pitchPotential partner pitch
Potential partner pitchJan Bendtsen
 
Partnership Proposal
Partnership ProposalPartnership Proposal
Partnership ProposalArvind Jha
 
Partnership presentation
Partnership presentationPartnership presentation
Partnership presentationemail2neev
 
Dwolla Startup Pitch Deck
Dwolla Startup Pitch DeckDwolla Startup Pitch Deck
Dwolla Startup Pitch DeckJoseph Hsieh
 
Linkedin Series B Pitch Deck
Linkedin Series B Pitch DeckLinkedin Series B Pitch Deck
Linkedin Series B Pitch DeckJoseph Hsieh
 
Mixpanel - Our pitch deck that we used to raise $65M
Mixpanel - Our pitch deck that we used to raise $65MMixpanel - Our pitch deck that we used to raise $65M
Mixpanel - Our pitch deck that we used to raise $65MSuhail Doshi
 
The slide deck we used to raise half a million dollars
The slide deck we used to raise half a million dollarsThe slide deck we used to raise half a million dollars
The slide deck we used to raise half a million dollarsBuffer
 

Andere mochten auch (10)

(HLS305) Transforming Cancer Treatment: Integrating Data to Deliver on the Pr...
(HLS305) Transforming Cancer Treatment: Integrating Data to Deliver on the Pr...(HLS305) Transforming Cancer Treatment: Integrating Data to Deliver on the Pr...
(HLS305) Transforming Cancer Treatment: Integrating Data to Deliver on the Pr...
 
Electronic Medical Records: From Clinical Decision Support to Precision Medicine
Electronic Medical Records: From Clinical Decision Support to Precision MedicineElectronic Medical Records: From Clinical Decision Support to Precision Medicine
Electronic Medical Records: From Clinical Decision Support to Precision Medicine
 
Precision Medicine World Conference 2017
Precision Medicine World Conference 2017Precision Medicine World Conference 2017
Precision Medicine World Conference 2017
 
Potential partner pitch
Potential partner pitchPotential partner pitch
Potential partner pitch
 
Partnership Proposal
Partnership ProposalPartnership Proposal
Partnership Proposal
 
Partnership presentation
Partnership presentationPartnership presentation
Partnership presentation
 
Dwolla Startup Pitch Deck
Dwolla Startup Pitch DeckDwolla Startup Pitch Deck
Dwolla Startup Pitch Deck
 
Linkedin Series B Pitch Deck
Linkedin Series B Pitch DeckLinkedin Series B Pitch Deck
Linkedin Series B Pitch Deck
 
Mixpanel - Our pitch deck that we used to raise $65M
Mixpanel - Our pitch deck that we used to raise $65MMixpanel - Our pitch deck that we used to raise $65M
Mixpanel - Our pitch deck that we used to raise $65M
 
The slide deck we used to raise half a million dollars
The slide deck we used to raise half a million dollarsThe slide deck we used to raise half a million dollars
The slide deck we used to raise half a million dollars
 

Ähnlich wie VCF and RDF

Compilers Are Databases
Compilers Are DatabasesCompilers Are Databases
Compilers Are DatabasesMartin Odersky
 
The Little Register Allocator
The Little Register AllocatorThe Little Register Allocator
The Little Register AllocatorIan Wang
 
Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Li Shen
 
QuadIron An open source library for number theoretic transform-based erasure ...
QuadIron An open source library for number theoretic transform-based erasure ...QuadIron An open source library for number theoretic transform-based erasure ...
QuadIron An open source library for number theoretic transform-based erasure ...Scality
 
Next-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plotNext-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plotLi Shen
 
RNA sequencing analysis tutorial with NGS
RNA sequencing analysis tutorial with NGSRNA sequencing analysis tutorial with NGS
RNA sequencing analysis tutorial with NGSHAMNAHAMNA8
 
Analytical Queries with Hive: SQL Windowing and Table Functions
Analytical Queries with Hive: SQL Windowing and Table FunctionsAnalytical Queries with Hive: SQL Windowing and Table Functions
Analytical Queries with Hive: SQL Windowing and Table FunctionsDataWorks Summit
 
Incremental pattern matching in the VIATRA2 model transformation system
Incremental pattern matching in the VIATRA2 model transformation systemIncremental pattern matching in the VIATRA2 model transformation system
Incremental pattern matching in the VIATRA2 model transformation systemIstvan Rath
 
Oracle plsql and d2 k interview questions
Oracle plsql and d2 k interview questionsOracle plsql and d2 k interview questions
Oracle plsql and d2 k interview questionsArunkumar Gurav
 
Deep Packet Inspection with Regular Expression Matching
Deep Packet Inspection with Regular Expression MatchingDeep Packet Inspection with Regular Expression Matching
Deep Packet Inspection with Regular Expression MatchingEditor IJCATR
 
Perl at SkyCon'12
Perl at SkyCon'12Perl at SkyCon'12
Perl at SkyCon'12Tim Bunce
 
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...QIAGEN
 
XtremeDistil: Multi-stage Distillation for Massive Multilingual Models
XtremeDistil: Multi-stage Distillation for Massive Multilingual ModelsXtremeDistil: Multi-stage Distillation for Massive Multilingual Models
XtremeDistil: Multi-stage Distillation for Massive Multilingual ModelsSubhabrata Mukherjee
 
Data Presentations Cassandra Sigmod
Data  Presentations  Cassandra SigmodData  Presentations  Cassandra Sigmod
Data Presentations Cassandra SigmodJeff Hammerbacher
 

Ähnlich wie VCF and RDF (20)

Compilers Are Databases
Compilers Are DatabasesCompilers Are Databases
Compilers Are Databases
 
The Little Register Allocator
The Little Register AllocatorThe Little Register Allocator
The Little Register Allocator
 
Elastic Search
Elastic SearchElastic Search
Elastic Search
 
Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2
 
Rnaseq forgenefinding
Rnaseq forgenefindingRnaseq forgenefinding
Rnaseq forgenefinding
 
QuadIron An open source library for number theoretic transform-based erasure ...
QuadIron An open source library for number theoretic transform-based erasure ...QuadIron An open source library for number theoretic transform-based erasure ...
QuadIron An open source library for number theoretic transform-based erasure ...
 
Next-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plotNext-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plot
 
RNA sequencing analysis tutorial with NGS
RNA sequencing analysis tutorial with NGSRNA sequencing analysis tutorial with NGS
RNA sequencing analysis tutorial with NGS
 
8086 instruction set
8086 instruction set8086 instruction set
8086 instruction set
 
Analytical Queries with Hive: SQL Windowing and Table Functions
Analytical Queries with Hive: SQL Windowing and Table FunctionsAnalytical Queries with Hive: SQL Windowing and Table Functions
Analytical Queries with Hive: SQL Windowing and Table Functions
 
Incremental pattern matching in the VIATRA2 model transformation system
Incremental pattern matching in the VIATRA2 model transformation systemIncremental pattern matching in the VIATRA2 model transformation system
Incremental pattern matching in the VIATRA2 model transformation system
 
Oracle plsql and d2 k interview questions
Oracle plsql and d2 k interview questionsOracle plsql and d2 k interview questions
Oracle plsql and d2 k interview questions
 
Compiler unit 5
Compiler  unit 5Compiler  unit 5
Compiler unit 5
 
Deep Packet Inspection with Regular Expression Matching
Deep Packet Inspection with Regular Expression MatchingDeep Packet Inspection with Regular Expression Matching
Deep Packet Inspection with Regular Expression Matching
 
Perl at SkyCon'12
Perl at SkyCon'12Perl at SkyCon'12
Perl at SkyCon'12
 
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
 
XtremeDistil: Multi-stage Distillation for Massive Multilingual Models
XtremeDistil: Multi-stage Distillation for Massive Multilingual ModelsXtremeDistil: Multi-stage Distillation for Massive Multilingual Models
XtremeDistil: Multi-stage Distillation for Massive Multilingual Models
 
Sept2016 sv nist_intro
Sept2016 sv nist_introSept2016 sv nist_intro
Sept2016 sv nist_intro
 
Dpdk applications
Dpdk applicationsDpdk applications
Dpdk applications
 
Data Presentations Cassandra Sigmod
Data  Presentations  Cassandra SigmodData  Presentations  Cassandra Sigmod
Data Presentations Cassandra Sigmod
 

Kürzlich hochgeladen

MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusZilliz
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 

Kürzlich hochgeladen (20)

MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 

VCF and RDF

  • 1. VCF and RDF Jeremy J Carroll April 3rd, 2013 Prepared for W3C Clinical Genomics Task Force www.Syapse.com jjc@Syapse.com Copyright 2013 Syapse Inc.
  • 2. Contents • Background – Business Background – Goal – VCF and RDF • This specific work (exploration) – Stats – VCF example – Lots of example transforms: VCF and RDF • Discussion: might the examples meet the business goals?
  • 3. Background • Syapse Discovery provides an end-to-end solution for […] laboratories deploying next- generation sequencing-based diagnostics, […] configurable semantic data structure enables users to bring omics data together with traditional medical information […] • This work: initial exploration of VCF import, what useful information is in a VCF etc.
  • 4. Success = Useful Advanced Query Over Genomic Information Patient Report Tissue Disease: Gene: Biopsy Site: Breast Cancer V-ERB-B2 AVIAN Lobe of Liver
  • 5. VCF, a Web Based Knowledge Format • Variant Call Format • Used for exchange from one system to another • Used for exchange from one lab to another • Used for exchange between universities and the wider community
  • 6. RDF, a Web Based Knowledge Format • Resource Description Framework • Used for exchange from one system to another • Used for exchange from one lab to another • Used for exchange between universities and the wider community
  • 7. VCF vs RDF • VCF: efficient representation of huge quantities of data • RDF: flexible representation with excellent interoperability and query
  • 8. Materialized or Virtual? • Materialized = Transform data: Extract Transform Load into RDF • Virtual = Transform query: Runtime mapping • Or some combination: transform VCF into some internal format, transform SPARQL queries into a query over internal format
  • 9. This Work • Initial mapping from VCF to RDF • Materialized for simplicity • Key questions: – Is this useful? • Would query answer interesting questions? – Scale? • Can we materialize? Should we go virtual?
  • 10. Known limitations • Lots not addressed …. • Too many literals not enough URIs • Slow • Hoped to be fit for purpose, i.e. answering questions, not production code. (scale and utility, see previous slide)
  • 11. Some Stats • Working with Chromosome 7 from 1000G ftp://ftp- trace.ncbi.nih.gov/1000genomes/ftp/release/20110521/ALL.chr7.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz • 704,243 Variants, 1,092 samples – 689,034,445 ref calls, 79,998,911 variant calls • H/W: new mac book pro 2.3 GHz Intel Core i7 8 GB • Gunzip: 1m37s • PyVCF: 2h41m • VCF2TTL: 37h57m • serdi TTL 2 Ntriples: 13h57m • triples: 14,780,932,912 (1.2 trillion bytes)
  • 12. VCF Overview Metadata ##fileformat=VCFv4.0 ##fileDate=20090805 ##source=myImputationProgramV3.1 ##reference=1000GenomesPilot-NCBI36 T-Box: ##phasing=partial ##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data"> ##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth"> Properties, Classe ##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency"> ##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele"> s, Domains, Rang ##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129"> ##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership"> es ##FILTER=<ID=q10,Description="Quality below 10"> ##FILTER=<ID=s50,Description="Less than 50% of samples have data"> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality"> ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth"> ##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality"> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 20 17330 . T A 3.0 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 20 1110696 rs6040355 A G,T 1e+03 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 20 1234567 microsat1 GTCT G,GTACT . PASS NS=3;DP=9;AA=G GT:GQ:DP ./.:35:4 0/2:17:2 Per Per Variant Alternative Per Information Allele Sample, Variant Information Call
  • 13. Key Classes • Variant A position in the reference genome, where mutation may happen, and related information. • Allele A possible sequence of bases at some variant, and related information. • Sample A related set of Variant Calls • Variant Call A selection of two Alleles (diploid) at a Variant (can be phased or unphased)
  • 14. A Variant #CHROM POS ID REF ALT QUAL 20 1110696 rs6040355 A G,T 1e+03
  • 15. INFO about a Variant Same node as on previous slide, selecting different triples for display. DB property should probably be a class (dbSNP membership); AF (allele frequency) is not a property of the variant but the allele. #CHROM POS INFO 20 1110696 NS=2;DP=10;AF=0.333,0.667;AA=T;DB
  • 16. INFO about Alleles Same node as on previous slide, selecting different triples for display. #CHROM POS REF ALT INFO 20 1110696 A G,TAF=0.333,0.667;AA=T;DB
  • 17. Detail about a Variant Call Same variant node as on previous slide, HQ is mismodeled. #CHROM POS REF ALT FORMAT NA00001 20 1110696 A G,T GT:GQ:DP:HQ 1|2:21:6:23,27
  • 18. A Phased Diploid Genotype Call: rdf:Seq The | indicates phased, consistent ordering between VCF rows. rdf:Seq indicates order is significant. Heterozygous VCF concept of phase set not addressed. #CHROM POS REF ALT FORMAT NA00001 20 1110696 A G,TGT:GQ:DP:HQ 1|2:21:6:23,27
  • 19. Unphased Diploid Genotype Call: rdf:Bag The / indicates unphased genotype. rdf:Bag indicates order is not significant. rdf:Bag also used for homozygous genotypes. The 0/2 indicates the use of the reference and the 2nd alternative. This is a different variant #CHROM POS REF ALT FORMAT NA00002 20 1234567 GTCT G,GTACT GT:GQ:DP 0/2:17:2
  • 20. Genotype Likelihoods 1000G #CHROM POS REF ALT FORMAT NA11829 7 16161 A G GT:DS:GL 0|1:1.300:-0.36,-0.43,-0.71 Notes: _1 and _2 hide one another, id_2397 is variant of id_23830
  • 21. Discussion Points • How much of this is actually useful? – Maybe just artifacts of the history of the call? – Maybe just joins that we could/should recompute on the fly – Maybe much of the data is fundamentally uninteresting • Do we know the queries? If so can we represent in a more compact way and optimize for those queries?
  • 22. Further observations • Metadata is a mess, not actually reusable by machine, e.g. why file date but not sample date … provokes the desire to show rationale for file, but not the performance • Extensibility of Tbox allows English text to introduce new syntax (e.g. enumerations, per alt allele and reference). This is not a good idea. • The per genotype call info is only for unphased data, yet the call can be phased

Hinweis der Redaktion

  1. VCF2TTLreal 2278m26.935suser 2269m23.479ssys 8m2.282stime serdi -i turtle -o ntriples /tmp/pipe | time wc 14780932912 59123731783 1,211334,403999 136699.18 real 4809.46 user 628.36 sysreal 2278m19.206suser 807m12.920ssys 29m30.467s