Biogrid is a unified interface for running popular bioinformatics applications on distributed computational resources using versioned and cached databases. It provides efficient access to applications like BLAST, ClustalW, and HMMER through a standardized job submission process. The system leverages grid infrastructure like the Nordic Data Grid Facility to enable burst computing and high availability. It allows researchers to run jobs on multiple shared clusters with a single command.
Potential of AI (Generative AI) in Business: Learnings and Insights
Biogrid - Distributed Bioinformatics for the Grid
1. Biogrid – Bioinformatics for the grid
Joel Hedlund <yohell@ifm.liu.se>
Biogrid User and Developer
Linköping University, Sweden
Birds-of-a-feather session tonight: see me after this talk!
2. Outline
• What is it?
• What is it good for?
• Does it really work?
• Gory details.
• Why did we do this?
• Profit!
4. What is it?
• Unified interface
...to popular bioinformatic applications
...on shared, distributed computational resources
...using versioned and cached databases
5. What is it good for?
• Burst computing
– High demand for short periods of time
• high during development / production
• low during analysis / writing papers
– Share resources to enable more efficient use
• Database accessibility
• Availibility
• Unified interface
7. What is NDGF?
• Nordic Data Grid Facility
• A WLCG Tier1 facility
– Worldwide LHC Computational Grid
– Stores and processes data from LHC at CERN
• peak rate ≈ 1.6Gb/s, when the accelerator is running
(and that’s after most of the data have been filtered away)
11. ”Does it really work, this
distributed thingie?”
Why yes, very well thank you!
12. NDGF
• 96% availablity
(highest of all Tier1 facilities)
• Third largest Tier1 facility in the world
• Lowest ratio of failed ATLAS jobs
• Production goals met, and beyond
– Goal: 8% of all ATLAS resources (10.5% provided)
– Goal: 9% of all ALICE resources (12% provided)
* Data graciously stolen from Leif Nixons NorduNet 2008 talk. Thank you Leif :-)
23. Unified Interface
• XRSL Job Description
Standard in ARC Grid Middleware
• Well defined runtime environments
$HMMERDIR: node local (fast) scratch dir containing db files
prepare_db: download and unpack db files on the fly from front node to $HMMERDIR
26. Unified Interface
• Run on any resource I can access:
$ ngsub myjob.xrsl
• ...or run on my buddy’s cluster:
$ ngsub -c kiniini.csc.fi myjob.xrsl
• Check jobs:
$ ngstat refinehmm-family023
(or use Grid Monitor web interface at www.nordugrid.org)
• Fetch results:
$ ngget refinehmm-family*
DATA GRID
RESULTS
27. What do I need?
1. A resource with ARC and Biogrid REs
2. An ARC client
3. A Grid Certificate
(available from a number of global certificate authorities)
4. Time allowance on the resource
( 5. Biogrid VO Membership
Not really necessary, but it will get you 1 & 4 )
28. What do I need?
...or you can just grab the RE scripts off the biogrid website,
and your db of choice from the biogrid dCache.
29. Why did we do this?
Bioinformatic applications...
– CPU intensive
– Small input and output files
– ”Large” databases can be cached
...are very well suited for distributed computing.
31. Subclassification of the MDR superfamily
• 15000 members
from all kingdoms of life
• 500 families
25% sequence identity
• 40 human members
• Different substrate specificities
• Different subunit & cofactor count
• 2 HMMs available for superfamily detection
• None for any of the individual families
32. Subclassification of the MDR superfamily
• We made HMMs for all MDR (sub)families
with 20+ members.
• 86 families
• 34 detected subfamilies to 14 of these
• 11579 / 15000 sequences classified
• ≈5000*hmmsearch vs UniProtKB
Manuscript in preparation
33. refinehmm
• Algorithm for automated HMM refinement
• Produces stable and reliable HMMs
• Developed using Biogrid REs and resources
Will also be open source software once the paper is out.
34. Acknowledgements
• Olli Tourunen Supercomputing centers
Biogrid developer
• NSC
• Bengt Persson Jens Larsson, Leif Nixon
Biogrid PI
• HPC2N
• NDGF Åke Sandgren
Michael Grønager
Josva Kleist • Others
C3SE, CSC, Uppmax, Lunarc, PDC,
• Biogrid co-applicants Aalborg University, Oslo University
Ann-Charlotte Berglund Sonnhammer
Erik Sonnhammer
Inge Jonassen Joel Hedlund
yohell@ifm.liu.se
Biogrid User and Developer
Linköping University, Sweden
Birds-of-a-feather session tonight: see me after the talk!
35. Acknowledgements
• Olli Tourunen Supercomputing centers
Biogrid developer
• NSC
• Bengt Persson Jens Larsson, Leif Nixon
Biogrid PI
• HPC2N
• NDGF Åke Sandgren
Michael Grønager
Josva Kleist • Others
C3SE, CSC, Uppmax, Lunarc, PDC,
• Biogrid co-applicants Aalborg University, Oslo University
Ann-Charlotte Berglund Sonnhammer
Erik Sonnhammer
Inge Jonassen Joel Hedlund
yohell@ifm.liu.se
Biogrid User and Developer
Linköping University, Sweden
Birds-of-a-feather session tonight: see me after the talk!