Presentation regarding InterMine and its adoption by the AIP and MTGD project, made at the Informatics Research WIPS meeting on 03 November 2014, conducted at J. Craig Venter Institute, Rockville, MD.
Presented by Vivek Krishnakumar
The Mariana Trench remarkable geological features on Earth.pptx
Quick Intro to InterMine within AIP and MTGD - JCVI Research Works-in-Progress Meeting
1. InterMine
Integrated Data Warehouse
Use Cases: Arabidopsis & Medicago Genome Projects
Vivek Krishnakumar
Plant Genomics Group (EUK)
IFX Research WIPS Meeting, 03 October 2014
2. Overview
• Introduction
• InterMine
Integrated data warehouse, Extensible data model,
Flexible query system
Web and Programmatic Interface
Other InterMine instances
• Use cases
Arabidopsis Information Portal (AIP)
Medicago truncatula Genome Database (MTGD)
• Summary
Advantages
Caveats
3. Introduction
For genome projects that wish to expose their
data via the web (query, visualize, warehouse)
to foster scientific collaboration, there are
several technologies available:
• JCVI developed software
Manatee (backed by an RDBMS)
• Externally developed software
BioMart (federated from various databases)
Tripal (powered by Drupal, backed by CHADOdb)
InterMine
4. InterMine
• Functions as a data warehouse for the integration of complex
biological data. Integration across data types occurs based on
a common identifier (e.g. gene primary ID)
• Uses a flexible and extensible data model, controlled by XML
files, driven by ontologies (Sequence [SO], Gene [SO], etc.)
Genomics, Proteomics, Interactions, Homology,
Expression, Pathways (and more data types)
Parsers for commonly used biological data formats
Provides framework for adding your own data
• Offers a flexible query system, optimized via precomputed
tables (no need for schema denormalization)
Smith, RN. et al. InterMine: a flexible data warehouse system for the integration and analysis of heterogeneous biological data
Bioinformatics (2012) 28 (23): 3163-3165
5. InterMine (contd.)
• Provides a user-friendly web interface exposing
powerful features:
Analysis of lists (facilitate enrichment studies)
Full-featured report pages (one-stop shop)
Interactive result tables (sort, filter, summarize)
Visual query builder (no need to write SQL!)
Quick search and Region-based search
• Fosters development of external applications
using data hosted within InterMine via Application
Programming Interfaces (API):
RESTful
Perl, Python, Ruby, Java, JavaScript
Kalderimis, A. et al. InterMine: extensive web services for modern biology
Nucl. Acids Res. (1 July 2014) 42 (W1): W468-W472
6. Public “Mines”
• InterMine supports querying across mines
for cross-database integration
• Vast number of warehouses powered by
InterMine already exist
7. Arabidopsis Information Portal (AIP)
• AIP origins
Funded by NSF in response to community needs, following
termination of funding to TAIR
• AIP objectives
Develop a community web resource that…
– is sustainable and fundable and community-extensible
– hosts analysis & visualization tools, user data spaces
Federation: integrate diverse data sets from distributed data
sources; foster development of tools for and by the community
Maintenance of the Col-0 gold standard annotation
• AIP methods
Assimilate TAIR data
Host an InterMine instance devoted to Arabidopsis (thale cress)
Offer and consume RESTful web services
Integrate and utilize iPlant resources
8. ThaleMine
https://apps.araport.org/thalemine
• An InterMine interface
to Arabidopsis genomic
data
• Integrates a wide
variety of data types
(A-E, H), some of
which are warehoused
and others are
federated via web
services
• Embedded elements
visualizing gene
structure (JBrowse, not
shown), interaction
networks (F),
expression patterns (G)
9. Visual Query Builder
Image created by Benjamin Rosen (Bioinformatics Analyst, Plant Genomics Group)
10. Interactive Result Tables Region-based search
Images created by Benjamin Rosen (Bioinformatics Analyst, Plant Genomics Group)
11. MedicMine
http://medicmine.jcvi.org
• NSF funded project to
assist with the curation
of the Medicago
truncatula Genome
Assembly and
Annotation (funding
ended August 2014)
• In order to warehouse
and prolong the project
data, an InterMine
interface for Medicago
was implemented
(backed by a CHADO
database)
• Provides similar kind of
functionality available via
ThaleMine
12. Summary
• Advantages
InterMine is a powerful biological data warehouse
Performs complex data integration
Allows fast and flexible querying
Well documented programmatic interface
Cookie-cutter, user-friendly web interface
Facilitates cross-talk between “mines”
• Caveats
Adding more data requires a full database rebuild (incremental loading
is not possible) because of the integration step
• About InterMine:
Developed by the Micklem Lab at the University of Cambridge, UK
Written in Java, backed by PostgreSQLdb, deployed under Tomcat.
Documentation and downloads available at http://www.intermine.org
13. Chris Town, PI
Chris Nelson
PM
Lisa McDonald
Education and
Outreach
Coordinator
Jason Miller, Co-PI
Technical Lead
Erik Ferlanti
SE
Vivek Krishnakumar
BE
Svetlana Karamycheva
BE
Maria Kim
BE
Gos Micklem, co-PI Sergio Contrino
Eva Huala
Project lead, TAIR
Software Engineer
Bob Muller
Technical lead, TAIR
Matt Vaughn
co-PI Steve Mock
Advanced Computing
Interfaces
Rion Dooley,
Web and Cloud
Services
Matt Hanlon,
Web and Mobile
Applications
Ben Rosen
BA