SlideShare ist ein Scribd-Unternehmen logo
1 von 26
Downloaden Sie, um offline zu lesen
MMTF-Spark: Interactive, Scalable, and
Reproducible Datamining of
3D Macromolecular Structures
NIH/NCI U01CA198942
Peter W. Rose
Director, Structural Bioinformatics Laboratory
San Diego Supercomputer Center
UC San Diego
PDB – A Billion Atom Archive
> 1 billion atoms in the asymmetric units
~140,000
Structures
in May 2018
Growing Structure Size and Complexity
Largest asymmetric structure in PDB Largest symmetric structure in PDB
HIV-1 capsid: PDB ID 3J3Q
~2.4M unique atoms
Faustovirus major capsid: PDB ID 5J7V
~40M overall atoms
à Scalability Issues
•  Interactive visualization
•  slow network transfer
•  slow parsing
•  slow rendering
•  Mobile visualization
•  limited bandwidth
•  limited memory
•  Large-scale structural analysis
•  slow repeated I/O
•  slow repeated parsing
Compressive Structural Bioinformatics
Efficiently store, transmit, and visualize 3D structures of biological macromolecules
Perform large-scale structural calculations such as geometric queries or structural
comparisons over the entire PDB archive held in memory
Macromolecular 3D Structure
Biological macromolecules are polymers constructed
by linking monomers by covalent bonds
Biological macromolecules: proteins, nucleic acids
MMTF
•  MacroMolecular Transmission Format (mmtf.rcsb.org)
•  Compact
•  fast network transfer
•  fast I/O
•  Fast to parse
•  binary, no string parsing
•  Contains information for structural analysis and visualization
•  covalent bonds and bond orders
•  consistently calculated secondary structure
Lossless vs. Lossy Compression
File Size Comparison
(PDB archive ~127,000 entries)
All file sizes are for the gzip compressed files
File Parsing Speed
(PDB archive ~127,000 entries)
Comparison made with BioJava and mmtf-java API (single core)
Download + Parsing time
MMTF vs. mmCIF
Bethesda,	MD	
		85		MMTF	
2418		mmCIF	
Switzerland	
1589 		MMTF	
4431		mmCIF	
Russia	
			557		MMTF	
failed		mmCIF	
Japan	
			79		MMTF	
	2838		mmCIF	
San	Diego,	CA	
	36		MMTF	
840		mmCIF	
Time (seconds) to download* 100 large PDB structures from UCSD
and parse with JavaScript decoder in Chrome browser
*Note: download times are highly variable and not representative
Demo
Who Uses MMTF?
MMTF-Spark
mmtf-spark (Java)
•  High performance processing
•  Suitable for large-scale calculations
•  Integration with other libraries, e.g.,
BioJava
mmtf-pyspark (Python)
•  Interactive scripting
•  2D and 3D visualization
•  ML/DL tool ecosystem
•  Sharable data analysis in Jupyter
Notebooks
Example of a Spark Workflow
MMTF-PySpark in Jupyter Notebook
Interactive and reproducible dataming of the Protein Data Bank
Plans for a Scalable Infrastructure
Expand structural
coverage
•  PDB
•  PDB-redo
•  Swiss-Model
•  Rosetta-Models
•  …
Host in cloud-based infrastructure
Integrate with
other biological
resources
•  Drug info
•  Proteomic
•  Genomic
•  …
Summary
•  MacroMolecular Transmission Format (MMTF, mmtf.rcsb.org)
•  Compressed, binary, efficient representation of 3D structures
•  Lossless representation (~4x compression over gzip)
•  Lossy, reduced representation (~37x compression over gzip)
•  Compressive Structural Bioinformatics
•  Algorithms, application, and workflows using MMTF
•  10 to 100+ fold speedup
Structure Visualization Large Scale PDB Mining
Web-based molecular graphics for large complexes (2016)
Web 3D ‘16, 185-186, DOI: 10.1145/2945292.2945324
Resources•  MMTF Website
•  http://mmtf.rcsb.org
•  MMTF Specification
•  https://github.com/rcsb/mmtf
•  MMTF Libraries
•  https://github.com/rcsb/mmtf-java
•  https://github.com/rcsb/mmtf-javascript
•  https://github.com/rcsb/mmtf-python
•  https://github.com/rcsb/mmtf-c
•  https://github.com/rcsb/mmtf-cpp
•  MMTF-Spark/PySpark
•  https://github.com/sbl-sdsc/mmtf-spark, https://github.com/sbl-sdsc/mmtf-workshop-2017
•  https://github.com/sbl-sdsc/mmtf-pyspark, https://github.com/sbl-sdsc/mmtf-workshop-2018
•  Publications
•  Bradley AR, et al. (2017) MMTF—An efficient file format for the transmission, visualization, and
analysis of macromolecular structures. PLOS Computational Biology 13(6): e1005575.
https://doi.org/10.1371/journal.pcbi.1005575
•  Valasatava Y, et al. (2017) Towards an efficient compression of 3D coordinates of macromolecular
structures. PLOS ONE 12(3): e0174846. https://doi.org/10.1371/journal.pone.0174846
•  Rose AS, et al. (2016) Web-based molecular graphics for large complexes. In Proceedings of the
21st International Conference on Web3D Technology (Web3D '16). ACM, New York, NY, USA,
185-186. https://doi.org/10.1145/2945292.2945324
Funding
This workshop was supported by the National Cancer Institute
of the National Institutes of Health under Award Number
U01CA198942. The content is solely the responsibility of the
authors and does not necessarily represent the official views of
the National Institutes of Health.
PDBx/mmCIF
Flexible, extensible, and verbose format
with rich metadata, well suited for archival
purposes (mmcif.wwpdb.org)
repetitive information
redundant annotations
inefficient representation
MMTF Compression Pipeline
integer encoding
dictionary encoding
run-length encoding
delta encoding
GZIP
recursive
indexing
extract structural
data
calculate bonds,
SSE
Binary, extensible container format of MMTF
It's like JSON.
but fast and small.
Columnar Encoding of Arrays
Custom Encoding Examples
Run-length encoding (e.g., occupancy)
Delta and run-length encoding (e.g., serial numbers)
Integer, delta, and recursive index encoding
(e.g., xyz-coordinates as 16-bit integers)
Delta encoding reduced the dynamic range of numbers which makes them more compressible
Data In MMTF File
MMTF contains a subset of data used by
visualization and structural analysis programs
Dictionary Encoding
Unique groups (residues) are stored in a “dictionary”
Unique entities are stored in a “dictionary”
(e.g., polymer type, polymer sequences)
Community Engagement
•  Open source specification
•  Open source libraries

Weitere ähnliche Inhalte

Was ist angesagt?

Biological databases
Biological databasesBiological databases
Biological databasesAfra Fathima
 
Bioinformatics biological databases
Bioinformatics biological databasesBioinformatics biological databases
Bioinformatics biological databasesSangeeta Das
 
Biological databases
Biological databasesBiological databases
Biological databasesAshfaq Ahmad
 
Publicly available tools and open resources in Bioinformatics
Publicly available  tools and open resources in BioinformaticsPublicly available  tools and open resources in Bioinformatics
Publicly available tools and open resources in BioinformaticsArindam Ghosh
 
Gene bank by kk sahu
Gene bank by kk sahuGene bank by kk sahu
Gene bank by kk sahuKAUSHAL SAHU
 
DNA data bank of japan (DDBJ)
DNA data bank of japan (DDBJ)DNA data bank of japan (DDBJ)
DNA data bank of japan (DDBJ)ZoufishanY
 
Computational biology bls 303
Computational biology bls 303Computational biology bls 303
Computational biology bls 303Bruno Mmassy
 
Protein information resource (PIR)
Protein information resource (PIR)Protein information resource (PIR)
Protein information resource (PIR)ShivaniShewale2
 
End-to-End Learning for Answering Structured Queries Directly over Text
End-to-End Learning for  Answering Structured Queries Directly over Text End-to-End Learning for  Answering Structured Queries Directly over Text
End-to-End Learning for Answering Structured Queries Directly over Text Paul Groth
 
Data retreival system
Data retreival systemData retreival system
Data retreival systemShikha Thakur
 

Was ist angesagt? (20)

Composite and Specialized databases
Composite and Specialized databasesComposite and Specialized databases
Composite and Specialized databases
 
Biological databases
Biological databasesBiological databases
Biological databases
 
Bioinformatics biological databases
Bioinformatics biological databasesBioinformatics biological databases
Bioinformatics biological databases
 
Databases
DatabasesDatabases
Databases
 
Biological databases
Biological databasesBiological databases
Biological databases
 
Data retrieval tools
Data retrieval toolsData retrieval tools
Data retrieval tools
 
Publicly available tools and open resources in Bioinformatics
Publicly available  tools and open resources in BioinformaticsPublicly available  tools and open resources in Bioinformatics
Publicly available tools and open resources in Bioinformatics
 
Datasets with bioschemas
Datasets with bioschemasDatasets with bioschemas
Datasets with bioschemas
 
Gene bank by kk sahu
Gene bank by kk sahuGene bank by kk sahu
Gene bank by kk sahu
 
Introduction to Biological databases
Introduction to Biological databasesIntroduction to Biological databases
Introduction to Biological databases
 
Major databases in bioinformatics
Major databases in bioinformaticsMajor databases in bioinformatics
Major databases in bioinformatics
 
DNA data bank of japan (DDBJ)
DNA data bank of japan (DDBJ)DNA data bank of japan (DDBJ)
DNA data bank of japan (DDBJ)
 
PIR- Protein Information Resource
PIR- Protein Information ResourcePIR- Protein Information Resource
PIR- Protein Information Resource
 
Computational biology bls 303
Computational biology bls 303Computational biology bls 303
Computational biology bls 303
 
Protein Databases
Protein DatabasesProtein Databases
Protein Databases
 
Protein information resource (PIR)
Protein information resource (PIR)Protein information resource (PIR)
Protein information resource (PIR)
 
Data Retrieval Systems
Data Retrieval SystemsData Retrieval Systems
Data Retrieval Systems
 
End-to-End Learning for Answering Structured Queries Directly over Text
End-to-End Learning for  Answering Structured Queries Directly over Text End-to-End Learning for  Answering Structured Queries Directly over Text
End-to-End Learning for Answering Structured Queries Directly over Text
 
Data retreival system
Data retreival systemData retreival system
Data retreival system
 
Entrez databases
Entrez databasesEntrez databases
Entrez databases
 

Ähnlich wie MMTF-Spark: Interactive, Scalable, and Reproducible Datamining of 3D Macromolecular Structures

Compressive Structural Bioinformatics: Large-scale analysis and visualization...
Compressive Structural Bioinformatics: Large-scale analysis and visualization...Compressive Structural Bioinformatics: Large-scale analysis and visualization...
Compressive Structural Bioinformatics: Large-scale analysis and visualization...Peter Rose
 
A consistent and efficient graphical User Interface Design and Querying Organ...
A consistent and efficient graphical User Interface Design and Querying Organ...A consistent and efficient graphical User Interface Design and Querying Organ...
A consistent and efficient graphical User Interface Design and Querying Organ...CSCJournals
 
The swings and roundabouts of a decade of fun and games with Research Objects
The swings and roundabouts of a decade of fun and games with Research Objects The swings and roundabouts of a decade of fun and games with Research Objects
The swings and roundabouts of a decade of fun and games with Research Objects Carole Goble
 
NRNB Annual Report 2018
NRNB Annual Report 2018NRNB Annual Report 2018
NRNB Annual Report 2018Alexander Pico
 
Computation and Knowledge
Computation and KnowledgeComputation and Knowledge
Computation and KnowledgeIan Foster
 
ICASSP 2012: Analysis of Streaming Social Networks and Graphs on Multicore Ar...
ICASSP 2012: Analysis of Streaming Social Networks and Graphs on Multicore Ar...ICASSP 2012: Analysis of Streaming Social Networks and Graphs on Multicore Ar...
ICASSP 2012: Analysis of Streaming Social Networks and Graphs on Multicore Ar...Jason Riedy
 
Machine Learning for Molecules
Machine Learning for MoleculesMachine Learning for Molecules
Machine Learning for MoleculesIchigaku Takigawa
 
Biological Database (1)pptxpdfpdfpdf.pdf
Biological Database (1)pptxpdfpdfpdf.pdfBiological Database (1)pptxpdfpdfpdf.pdf
Biological Database (1)pptxpdfpdfpdf.pdfBioinformaticsCentre
 
2014 11-13-sbsm032-reproducible research
2014 11-13-sbsm032-reproducible research2014 11-13-sbsm032-reproducible research
2014 11-13-sbsm032-reproducible researchYannick Wurm
 
Streaming HYpothesis REasoning
Streaming HYpothesis REasoningStreaming HYpothesis REasoning
Streaming HYpothesis REasoningWilliam Smith
 
Book of abstract volume 8 no 9 ijcsis december 2010
Book of abstract volume 8 no 9 ijcsis december 2010Book of abstract volume 8 no 9 ijcsis december 2010
Book of abstract volume 8 no 9 ijcsis december 2010Oladokun Sulaiman
 
How to Scale from Workstation through Cloud to HPC in Cryo-EM Processing
How to Scale from Workstation through Cloud to HPC in Cryo-EM ProcessingHow to Scale from Workstation through Cloud to HPC in Cryo-EM Processing
How to Scale from Workstation through Cloud to HPC in Cryo-EM Processinginside-BigData.com
 
Journal of Computer Science Research | Vol.2, Iss.2 April 2020
Journal of Computer Science Research | Vol.2, Iss.2 April 2020Journal of Computer Science Research | Vol.2, Iss.2 April 2020
Journal of Computer Science Research | Vol.2, Iss.2 April 2020Bilingual Publishing Group
 
A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o...
A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o...A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o...
A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o...Ilkay Altintas, Ph.D.
 
Efficient Distributed In-Memory Processing of RDF Datasets - PhD Viva
Efficient Distributed In-Memory Processing of RDF Datasets - PhD VivaEfficient Distributed In-Memory Processing of RDF Datasets - PhD Viva
Efficient Distributed In-Memory Processing of RDF Datasets - PhD VivaGezim Sejdiu
 
VIZBI 2013 - Overview
VIZBI 2013 - OverviewVIZBI 2013 - Overview
VIZBI 2013 - OverviewDerek Wright
 

Ähnlich wie MMTF-Spark: Interactive, Scalable, and Reproducible Datamining of 3D Macromolecular Structures (20)

Compressive Structural Bioinformatics: Large-scale analysis and visualization...
Compressive Structural Bioinformatics: Large-scale analysis and visualization...Compressive Structural Bioinformatics: Large-scale analysis and visualization...
Compressive Structural Bioinformatics: Large-scale analysis and visualization...
 
A consistent and efficient graphical User Interface Design and Querying Organ...
A consistent and efficient graphical User Interface Design and Querying Organ...A consistent and efficient graphical User Interface Design and Querying Organ...
A consistent and efficient graphical User Interface Design and Querying Organ...
 
Poster (1)
Poster (1)Poster (1)
Poster (1)
 
CADD meeting 08-30-2016
CADD meeting 08-30-2016CADD meeting 08-30-2016
CADD meeting 08-30-2016
 
D1803012022
D1803012022D1803012022
D1803012022
 
NRNB EAC Report 2011
NRNB EAC Report 2011NRNB EAC Report 2011
NRNB EAC Report 2011
 
The swings and roundabouts of a decade of fun and games with Research Objects
The swings and roundabouts of a decade of fun and games with Research Objects The swings and roundabouts of a decade of fun and games with Research Objects
The swings and roundabouts of a decade of fun and games with Research Objects
 
NRNB Annual Report 2018
NRNB Annual Report 2018NRNB Annual Report 2018
NRNB Annual Report 2018
 
Computation and Knowledge
Computation and KnowledgeComputation and Knowledge
Computation and Knowledge
 
ICASSP 2012: Analysis of Streaming Social Networks and Graphs on Multicore Ar...
ICASSP 2012: Analysis of Streaming Social Networks and Graphs on Multicore Ar...ICASSP 2012: Analysis of Streaming Social Networks and Graphs on Multicore Ar...
ICASSP 2012: Analysis of Streaming Social Networks and Graphs on Multicore Ar...
 
Machine Learning for Molecules
Machine Learning for MoleculesMachine Learning for Molecules
Machine Learning for Molecules
 
Biological Database (1)pptxpdfpdfpdf.pdf
Biological Database (1)pptxpdfpdfpdf.pdfBiological Database (1)pptxpdfpdfpdf.pdf
Biological Database (1)pptxpdfpdfpdf.pdf
 
2014 11-13-sbsm032-reproducible research
2014 11-13-sbsm032-reproducible research2014 11-13-sbsm032-reproducible research
2014 11-13-sbsm032-reproducible research
 
Streaming HYpothesis REasoning
Streaming HYpothesis REasoningStreaming HYpothesis REasoning
Streaming HYpothesis REasoning
 
Book of abstract volume 8 no 9 ijcsis december 2010
Book of abstract volume 8 no 9 ijcsis december 2010Book of abstract volume 8 no 9 ijcsis december 2010
Book of abstract volume 8 no 9 ijcsis december 2010
 
How to Scale from Workstation through Cloud to HPC in Cryo-EM Processing
How to Scale from Workstation through Cloud to HPC in Cryo-EM ProcessingHow to Scale from Workstation through Cloud to HPC in Cryo-EM Processing
How to Scale from Workstation through Cloud to HPC in Cryo-EM Processing
 
Journal of Computer Science Research | Vol.2, Iss.2 April 2020
Journal of Computer Science Research | Vol.2, Iss.2 April 2020Journal of Computer Science Research | Vol.2, Iss.2 April 2020
Journal of Computer Science Research | Vol.2, Iss.2 April 2020
 
A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o...
A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o...A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o...
A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o...
 
Efficient Distributed In-Memory Processing of RDF Datasets - PhD Viva
Efficient Distributed In-Memory Processing of RDF Datasets - PhD VivaEfficient Distributed In-Memory Processing of RDF Datasets - PhD Viva
Efficient Distributed In-Memory Processing of RDF Datasets - PhD Viva
 
VIZBI 2013 - Overview
VIZBI 2013 - OverviewVIZBI 2013 - Overview
VIZBI 2013 - Overview
 

Kürzlich hochgeladen

Phenolics: types, biosynthesis and functions.
Phenolics: types, biosynthesis and functions.Phenolics: types, biosynthesis and functions.
Phenolics: types, biosynthesis and functions.Silpa
 
300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptxryanrooker
 
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptxTHE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptxANSARKHAN96
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxMohamedFarag457087
 
Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.Silpa
 
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....muralinath2
 
Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfSumit Kumar yadav
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxseri bangash
 
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort ServiceCall Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort Serviceshivanisharma5244
 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Silpa
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryAlex Henderson
 
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIACURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIADr. TATHAGAT KHOBRAGADE
 
Role of AI in seed science Predictive modelling and Beyond.pptx
Role of AI in seed science  Predictive modelling and  Beyond.pptxRole of AI in seed science  Predictive modelling and  Beyond.pptx
Role of AI in seed science Predictive modelling and Beyond.pptxArvind Kumar
 
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.Silpa
 
Genome sequencing,shotgun sequencing.pptx
Genome sequencing,shotgun sequencing.pptxGenome sequencing,shotgun sequencing.pptx
Genome sequencing,shotgun sequencing.pptxSilpa
 
Atp synthase , Atp synthase complex 1 to 4.
Atp synthase , Atp synthase complex 1 to 4.Atp synthase , Atp synthase complex 1 to 4.
Atp synthase , Atp synthase complex 1 to 4.Silpa
 
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.Silpa
 
Cyanide resistant respiration pathway.pptx
Cyanide resistant respiration pathway.pptxCyanide resistant respiration pathway.pptx
Cyanide resistant respiration pathway.pptxSilpa
 

Kürzlich hochgeladen (20)

Phenolics: types, biosynthesis and functions.
Phenolics: types, biosynthesis and functions.Phenolics: types, biosynthesis and functions.
Phenolics: types, biosynthesis and functions.
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx
 
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptxTHE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptx
 
Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.
 
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
 
Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdf
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
 
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort ServiceCall Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.
 
Clean In Place(CIP).pptx .
Clean In Place(CIP).pptx                 .Clean In Place(CIP).pptx                 .
Clean In Place(CIP).pptx .
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIACURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
 
Role of AI in seed science Predictive modelling and Beyond.pptx
Role of AI in seed science  Predictive modelling and  Beyond.pptxRole of AI in seed science  Predictive modelling and  Beyond.pptx
Role of AI in seed science Predictive modelling and Beyond.pptx
 
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
 
Genome sequencing,shotgun sequencing.pptx
Genome sequencing,shotgun sequencing.pptxGenome sequencing,shotgun sequencing.pptx
Genome sequencing,shotgun sequencing.pptx
 
Atp synthase , Atp synthase complex 1 to 4.
Atp synthase , Atp synthase complex 1 to 4.Atp synthase , Atp synthase complex 1 to 4.
Atp synthase , Atp synthase complex 1 to 4.
 
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
 
Cyanide resistant respiration pathway.pptx
Cyanide resistant respiration pathway.pptxCyanide resistant respiration pathway.pptx
Cyanide resistant respiration pathway.pptx
 

MMTF-Spark: Interactive, Scalable, and Reproducible Datamining of 3D Macromolecular Structures

  • 1. MMTF-Spark: Interactive, Scalable, and Reproducible Datamining of 3D Macromolecular Structures NIH/NCI U01CA198942 Peter W. Rose Director, Structural Bioinformatics Laboratory San Diego Supercomputer Center UC San Diego
  • 2. PDB – A Billion Atom Archive > 1 billion atoms in the asymmetric units ~140,000 Structures in May 2018
  • 3. Growing Structure Size and Complexity Largest asymmetric structure in PDB Largest symmetric structure in PDB HIV-1 capsid: PDB ID 3J3Q ~2.4M unique atoms Faustovirus major capsid: PDB ID 5J7V ~40M overall atoms
  • 4. à Scalability Issues •  Interactive visualization •  slow network transfer •  slow parsing •  slow rendering •  Mobile visualization •  limited bandwidth •  limited memory •  Large-scale structural analysis •  slow repeated I/O •  slow repeated parsing
  • 5. Compressive Structural Bioinformatics Efficiently store, transmit, and visualize 3D structures of biological macromolecules Perform large-scale structural calculations such as geometric queries or structural comparisons over the entire PDB archive held in memory
  • 6. Macromolecular 3D Structure Biological macromolecules are polymers constructed by linking monomers by covalent bonds Biological macromolecules: proteins, nucleic acids
  • 7. MMTF •  MacroMolecular Transmission Format (mmtf.rcsb.org) •  Compact •  fast network transfer •  fast I/O •  Fast to parse •  binary, no string parsing •  Contains information for structural analysis and visualization •  covalent bonds and bond orders •  consistently calculated secondary structure
  • 8. Lossless vs. Lossy Compression
  • 9. File Size Comparison (PDB archive ~127,000 entries) All file sizes are for the gzip compressed files
  • 10. File Parsing Speed (PDB archive ~127,000 entries) Comparison made with BioJava and mmtf-java API (single core)
  • 11. Download + Parsing time MMTF vs. mmCIF Bethesda, MD 85 MMTF 2418 mmCIF Switzerland 1589  MMTF 4431 mmCIF Russia 557 MMTF failed mmCIF Japan 79 MMTF 2838 mmCIF San Diego, CA 36 MMTF 840 mmCIF Time (seconds) to download* 100 large PDB structures from UCSD and parse with JavaScript decoder in Chrome browser *Note: download times are highly variable and not representative Demo
  • 13. MMTF-Spark mmtf-spark (Java) •  High performance processing •  Suitable for large-scale calculations •  Integration with other libraries, e.g., BioJava mmtf-pyspark (Python) •  Interactive scripting •  2D and 3D visualization •  ML/DL tool ecosystem •  Sharable data analysis in Jupyter Notebooks
  • 14. Example of a Spark Workflow
  • 15. MMTF-PySpark in Jupyter Notebook Interactive and reproducible dataming of the Protein Data Bank
  • 16. Plans for a Scalable Infrastructure Expand structural coverage •  PDB •  PDB-redo •  Swiss-Model •  Rosetta-Models •  … Host in cloud-based infrastructure Integrate with other biological resources •  Drug info •  Proteomic •  Genomic •  …
  • 17. Summary •  MacroMolecular Transmission Format (MMTF, mmtf.rcsb.org) •  Compressed, binary, efficient representation of 3D structures •  Lossless representation (~4x compression over gzip) •  Lossy, reduced representation (~37x compression over gzip) •  Compressive Structural Bioinformatics •  Algorithms, application, and workflows using MMTF •  10 to 100+ fold speedup Structure Visualization Large Scale PDB Mining Web-based molecular graphics for large complexes (2016) Web 3D ‘16, 185-186, DOI: 10.1145/2945292.2945324
  • 18. Resources•  MMTF Website •  http://mmtf.rcsb.org •  MMTF Specification •  https://github.com/rcsb/mmtf •  MMTF Libraries •  https://github.com/rcsb/mmtf-java •  https://github.com/rcsb/mmtf-javascript •  https://github.com/rcsb/mmtf-python •  https://github.com/rcsb/mmtf-c •  https://github.com/rcsb/mmtf-cpp •  MMTF-Spark/PySpark •  https://github.com/sbl-sdsc/mmtf-spark, https://github.com/sbl-sdsc/mmtf-workshop-2017 •  https://github.com/sbl-sdsc/mmtf-pyspark, https://github.com/sbl-sdsc/mmtf-workshop-2018 •  Publications •  Bradley AR, et al. (2017) MMTF—An efficient file format for the transmission, visualization, and analysis of macromolecular structures. PLOS Computational Biology 13(6): e1005575. https://doi.org/10.1371/journal.pcbi.1005575 •  Valasatava Y, et al. (2017) Towards an efficient compression of 3D coordinates of macromolecular structures. PLOS ONE 12(3): e0174846. https://doi.org/10.1371/journal.pone.0174846 •  Rose AS, et al. (2016) Web-based molecular graphics for large complexes. In Proceedings of the 21st International Conference on Web3D Technology (Web3D '16). ACM, New York, NY, USA, 185-186. https://doi.org/10.1145/2945292.2945324
  • 19. Funding This workshop was supported by the National Cancer Institute of the National Institutes of Health under Award Number U01CA198942. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
  • 20. PDBx/mmCIF Flexible, extensible, and verbose format with rich metadata, well suited for archival purposes (mmcif.wwpdb.org) repetitive information redundant annotations inefficient representation
  • 21. MMTF Compression Pipeline integer encoding dictionary encoding run-length encoding delta encoding GZIP recursive indexing extract structural data calculate bonds, SSE Binary, extensible container format of MMTF It's like JSON. but fast and small.
  • 23. Custom Encoding Examples Run-length encoding (e.g., occupancy) Delta and run-length encoding (e.g., serial numbers) Integer, delta, and recursive index encoding (e.g., xyz-coordinates as 16-bit integers) Delta encoding reduced the dynamic range of numbers which makes them more compressible
  • 24. Data In MMTF File MMTF contains a subset of data used by visualization and structural analysis programs
  • 25. Dictionary Encoding Unique groups (residues) are stored in a “dictionary” Unique entities are stored in a “dictionary” (e.g., polymer type, polymer sequences)
  • 26. Community Engagement •  Open source specification •  Open source libraries