Building genomic data cyberinfrastructure with the online database software Tripal and analysis workflows driven by Galaxy
1. Building genomic data cyberinfrastructure with the online
database software Tripal and analysis workflows driven by
Galaxy
Meg Staton
University of Tennessee, Knoxville
mstaton1@utk.edu
@hardwoodgenomics
2. Cyberinfrastructure
Need to connect people to
• Computing systems
• Data storage systems
• Advanced instruments
• Data repositories
• Visualization environments
• Sensors
All distributed across the world
4. FAIR data principles
Findable
• Unique and persistent identifiers
Accessible
• Open and free method for retrieval
Interoperable
• Data are properly associated with other datasets
Re-usable
• Rich metadata (attributes for who, what, when, where, how)
5. The community (genome) database
Mission
• Collect data
• Curate data
• Integrate data
• Provide access to data
6. Difference from primary repositories
Why do we need community databases?
The “Community” Part
• Understand what is important for your users
• Respond to questions
• Attend community meetings
• Participate in grants
• Take data that doesn’t have a home anywhere else
• Manual curation
7. Challenges
• 2007, Clemson University
• We were writing all the database
and web code from scratch
• Starting to accumulate multiple
databases
• Would like to focus on biological
visualization, instead cobbling
together code modules to handle
• Usernames/passwords/permissions
• Front page news items
• Calendar of meetings
• There has to be an easier way!
Dorrie Main Stephen Ficklin
8. A web framework for genetic and genomic data
Goals:
• Simplify construction of a community genomics
websites
• Encourage high-quality, standards-based websites
for data sharing and collaboration
• Expand and reuse code
http://tripal.info
9. Content Management
System
Website construction toolkit
Open source
Globally utilized and
supported
Manages users
Module-based design
My Drupal Web Site
Calendar Module
Views
XML Sitemap
10. My Drupal Web Site
Calendar Module
Views
Organism
Sequence Feature
GenotypeDrupal Database
11. Why use Tripal?
Goals:
• Simpler construction
• Encourage high-quality, standards-based websites
for data sharing and collaboration
• Expand and reuse code
Open source
Friendly developers
Responsive mailing list
20. What problem is being solved?
• Drupal internal search
• Easy to set up and customize (for normal Drupal data types)
• Slow to index, slow to return results
• Need a solution that will:
• Access Chado database
• Provide flexible and customizable indexing – index only what is
needed, not everything
• Scale to very large biological data sets
21. Elasticsearch Software
Distributed, open source search and analytics engine
• Massively distributed – can scale horizontally
• Multitenancy – a search cluster can manage many
individual indices that can be queried individually or as a
group
• Feature-rich - autocomplete, fuzzy searching, “did you
mean” suggestions
• Open source
• Widely adopted
24. After indexing, build search block
The block is a normal Tripal
block that can be placed on
any or all pages.
Blocks can also be deleted
from the admin back end.
28. Elasticsearch Module
Faster indexing (if only due to multicore usage)
Faster searching
Future Development
• Multisite installs on a single web server – currently
working
• Port to Tripal 3.0
• Compare to new internal searching
30. What problem is being solved?
Biological
Samples
RNA Libraries Gene Expression
Levels
Need a better way to store and visualize RNASeq differential gene
expression experiments.
31. Expression Module – Content Types
• Biomaterial
• Similar to NCBI BioSample and SRA
• We currently do not differentiate between samples and
libraries
• Expression Analysis
• User specifies protocol and array design if a microarray
was used
• Upload and display of gene expression values
32. Loading Data
• Import biomaterial
• BioSample data downloaded from NCBI (xml)
• Flat file format (based on NCBI biomaterial bulk load
form)
• Can associate ontology terms through flat file
• Create a new expression analysis
• Import expression values as text files
• (assumed to be normalized, features must already
exist)
• Individual file per sample
• Tab delimited file with gene rows, sample columns
35. Visualization – Gene Expression
Hover over a library name for
a description
Some options to alter the
graphic
36. Expression Visualization Tool
• Paste a list of genes in to get a full heatmap across all
libraries.
• Plotly allows you to zoom, download, etc.
37. Future Work on Expression Module
• Transfer the list of all features from search results to
expression visualization tool
• Add significance/p-values from differential gene
expression test results
• Aid searching – limit results only to genes that respond to cold
stress
• Interactive data filtering
• Tie into analysis engine
• Tie into a publication module
39. Galaxy is an open, web-based platform for accessible,
reproducible, and transparent computational biomedical
research.
No need to use the command line to run NGS pipelines.
Use a website to upload data, build an analysis pipeline
and run it.
40.
41. Tripal Galaxy Module
• Currently under development
• https://github.com/tripal/tripal_galaxy
• Tripal sites can provide Galaxy workflows to their users
• Ensures reproducibility of data analysis steps
• Decreases curator effort/time
• Provides the workflow within the look-and-feel of the site
• Can be installed by any Tripal site once completed.
42.
43. Galaxy Workflows
Testing on Galaxy instances at Washington State
University, University of Connecticut, and University of
Tennessee
DNA Sequence Data
• Re-sequencing
alignment
• Variant discovery
(against the reference)
• Variant discovery
(between samples)
• Prediction of functional
genetic variants
https://github.com/statonlab/dibbs
44. Tripal Galaxy
• Expected release in April 2016 for first workflow on HWG
• Galaxy backend will be running at WSU
• Need to continue work on
• Selecting and filtering data to input to a workflow
• Monitoring workflow status
• Receiving meaningful error messages if problems occur
46. Users produce messy data
Day Collector Color Diseased?
11-14-16 Evan Red 0
11-14-16 Evan Pink 0
11-14-16 Evan White 1
Nov 14 2016 Becky Fuschia True
Nov 14 2016 Becky White False
16-11-14 Miriam Vermillion Yes
47. Standardize Collection
• Create forms for data collection
• Serve through a flexible mobile app
• Currently prototyping as a citizen science app
48.
49. Mobile App
• Timeline
• Citizen Science app released
by July 2017
• Prototype of full phenotyping
app by Jan 2018
• Testing in multiple systems
51. Abdullah Almsaeed Bradford Condon Miriam Paya Milans
Research Associate Postdoc Postdoc
Ming Chen Fang Lui
Graduate Student Graduate Student
Stephen Ficklin
Dorrie Main
Jill Wegrzyn
Bert Abbott
Dana Nelson
Ellen Crocker