Presented at the NIH/NCI Informatics Technology for Cancer Research (ITCR) 2018 meeting (https://itcr.cancer.gov/).
Advances in Structural Bioinformatics are driven by the fast growth in experimental 3D structures and integration with even larger sets of sequence and protein function data. At the same time, the field of Data Science has created new technologies for re-engineering legacy software pipelines to make them scalable, easy to use, reproducible, reusable, and sharable.
Here, we describe the MMTF-Spark/PySpark [1] project that combines three key components to create such an infrastructure: 1. Interactive Jupyter notebooks to run ad hoc analyses, data mining, machine learning, and visualization of 3D structure and sequence datasets, 2. A scalable compute infrastructure to run these analyses interactively across large datasets, e.g., the entire PDB, using previously developed efficient data representations [2, 3] and the Apache Spark framework for distributed parallel computing, 3. A library of methods for data mining and analysis of 3D structure and sequence data, capitalizing on the rich data analytics, visualization, and machine/deep learning tools available in the Python ecosystem.
Scientists face a number of complex and time consuming barriers when applying structural bioinformatics analysis, including complex software setups, non-interoperable data formats and software applications, lack of documentation, simple examples, and tutorials. Given the large datasets, biologists routinely apply computational tools and automation pipelines in their research. However, there is a long tail of ad-hoc, one-off, questions that biologists ask that cannot be answered using available web resources or workflow systems that focus on common tasks. In this project, we provide a self-contained programming environment that caters to scientists with varying computational skills and needs, ranging from biologists with basic programming skills, to structural and computational biologists who want to share their work, to data scientists who seek access to bioinformatics datasets to benchmark new machine learning methods. A key advantage of this environment is interactivity, which enables iterative exploration. By combining documentation, data sets, analysis code, results, and interactive visualizations in Jupyter notebooks, the steps of an interactive session can be captured, reproduced, and shared.
Acknowledgements
This project was supported by the NCI of the NIH under award number U01 CA198942.
References
1. https://github.com/sbl-sdsc/mmtf-spark, https://github.com/sbl-sdsc/mmtf-pyspark
2. Bradley AR, et al. (2017) MMTF - an efficient file format for the transmission, visualization, and analysis of macromolecular structures. PLOS Computational Biology 13(6): e1005575.
3. Valasatava Y, et al. (2017) Towards an efficient compression of 3D coordinates of macromolecular structures. PLOS ONE 12(3): e0174846.
MMTF-Spark: Interactive, Scalable, and Reproducible Datamining of 3D Macromolecular Structures
1. MMTF-Spark: Interactive, Scalable, and
Reproducible Datamining of
3D Macromolecular Structures
NIH/NCI U01CA198942
Peter W. Rose
Director, Structural Bioinformatics Laboratory
San Diego Supercomputer Center
UC San Diego
2. PDB – A Billion Atom Archive
> 1 billion atoms in the asymmetric units
~140,000
Structures
in May 2018
3. Growing Structure Size and Complexity
Largest asymmetric structure in PDB Largest symmetric structure in PDB
HIV-1 capsid: PDB ID 3J3Q
~2.4M unique atoms
Faustovirus major capsid: PDB ID 5J7V
~40M overall atoms
5. Compressive Structural Bioinformatics
Efficiently store, transmit, and visualize 3D structures of biological macromolecules
Perform large-scale structural calculations such as geometric queries or structural
comparisons over the entire PDB archive held in memory
6. Macromolecular 3D Structure
Biological macromolecules are polymers constructed
by linking monomers by covalent bonds
Biological macromolecules: proteins, nucleic acids
7. MMTF
• MacroMolecular Transmission Format (mmtf.rcsb.org)
• Compact
• fast network transfer
• fast I/O
• Fast to parse
• binary, no string parsing
• Contains information for structural analysis and visualization
• covalent bonds and bond orders
• consistently calculated secondary structure
10. File Parsing Speed
(PDB archive ~127,000 entries)
Comparison made with BioJava and mmtf-java API (single core)
11. Download + Parsing time
MMTF vs. mmCIF
Bethesda, MD
85 MMTF
2418 mmCIF
Switzerland
1589 MMTF
4431 mmCIF
Russia
557 MMTF
failed mmCIF
Japan
79 MMTF
2838 mmCIF
San Diego, CA
36 MMTF
840 mmCIF
Time (seconds) to download* 100 large PDB structures from UCSD
and parse with JavaScript decoder in Chrome browser
*Note: download times are highly variable and not representative
Demo
15. MMTF-PySpark in Jupyter Notebook
Interactive and reproducible dataming of the Protein Data Bank
16. Plans for a Scalable Infrastructure
Expand structural
coverage
• PDB
• PDB-redo
• Swiss-Model
• Rosetta-Models
• …
Host in cloud-based infrastructure
Integrate with
other biological
resources
• Drug info
• Proteomic
• Genomic
• …
17. Summary
• MacroMolecular Transmission Format (MMTF, mmtf.rcsb.org)
• Compressed, binary, efficient representation of 3D structures
• Lossless representation (~4x compression over gzip)
• Lossy, reduced representation (~37x compression over gzip)
• Compressive Structural Bioinformatics
• Algorithms, application, and workflows using MMTF
• 10 to 100+ fold speedup
Structure Visualization Large Scale PDB Mining
Web-based molecular graphics for large complexes (2016)
Web 3D ‘16, 185-186, DOI: 10.1145/2945292.2945324
18. Resources• MMTF Website
• http://mmtf.rcsb.org
• MMTF Specification
• https://github.com/rcsb/mmtf
• MMTF Libraries
• https://github.com/rcsb/mmtf-java
• https://github.com/rcsb/mmtf-javascript
• https://github.com/rcsb/mmtf-python
• https://github.com/rcsb/mmtf-c
• https://github.com/rcsb/mmtf-cpp
• MMTF-Spark/PySpark
• https://github.com/sbl-sdsc/mmtf-spark, https://github.com/sbl-sdsc/mmtf-workshop-2017
• https://github.com/sbl-sdsc/mmtf-pyspark, https://github.com/sbl-sdsc/mmtf-workshop-2018
• Publications
• Bradley AR, et al. (2017) MMTF—An efficient file format for the transmission, visualization, and
analysis of macromolecular structures. PLOS Computational Biology 13(6): e1005575.
https://doi.org/10.1371/journal.pcbi.1005575
• Valasatava Y, et al. (2017) Towards an efficient compression of 3D coordinates of macromolecular
structures. PLOS ONE 12(3): e0174846. https://doi.org/10.1371/journal.pone.0174846
• Rose AS, et al. (2016) Web-based molecular graphics for large complexes. In Proceedings of the
21st International Conference on Web3D Technology (Web3D '16). ACM, New York, NY, USA,
185-186. https://doi.org/10.1145/2945292.2945324
19. Funding
This workshop was supported by the National Cancer Institute
of the National Institutes of Health under Award Number
U01CA198942. The content is solely the responsibility of the
authors and does not necessarily represent the official views of
the National Institutes of Health.
20. PDBx/mmCIF
Flexible, extensible, and verbose format
with rich metadata, well suited for archival
purposes (mmcif.wwpdb.org)
repetitive information
redundant annotations
inefficient representation
21. MMTF Compression Pipeline
integer encoding
dictionary encoding
run-length encoding
delta encoding
GZIP
recursive
indexing
extract structural
data
calculate bonds,
SSE
Binary, extensible container format of MMTF
It's like JSON.
but fast and small.
23. Custom Encoding Examples
Run-length encoding (e.g., occupancy)
Delta and run-length encoding (e.g., serial numbers)
Integer, delta, and recursive index encoding
(e.g., xyz-coordinates as 16-bit integers)
Delta encoding reduced the dynamic range of numbers which makes them more compressible
24. Data In MMTF File
MMTF contains a subset of data used by
visualization and structural analysis programs
25. Dictionary Encoding
Unique groups (residues) are stored in a “dictionary”
Unique entities are stored in a “dictionary”
(e.g., polymer type, polymer sequences)