SlideShare a Scribd company logo
1 of 57
Download to read offline
Tips & Tricks for
Software Engineering in
     Bioinformatics
        Presented by:
         Joel Dudley
Who is this guy?
Avg. time spent programming (hours)




                                      10.0

                                       7.5

                                       5.0

                                       2.5

                                        0
                                             5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 25 25 26 27 28 29 30 31 32

                                                                          Age (years)
http://www.megasoftware.net
Kumar S. and Dudley J. “Bioinformatics software for biologists in the genomics era.”
Bioinformatics (2007) vol. 23 (14) pp. 1713-7
Bioinformatics Philosophy
Build Your Toolbox
Learn UNIX!
Be a jack of all trades, but master of one.




    http://oreilly.com/news/graphics/prog_lang_poster.pdf
R
     C/C++ PHP
VB                   PERL


                            Python




                            Ruby
                            Java

                            LISP
Java is not just for Java




                           http://jruby.codehaus.org
http://www.jython.org
Simplified Wrapper and Interface Generator (SWIG)


              Greasy-fast C library




                Doughy-soft
             scripting language

                http://www.swig.org/
Frameworks are Friends




        BioBike
Stand on the slumped, dandruff-covered shoulders of
            millions of computer nerds.
Don’t trust yourself (or your hard disk).
Don’t be afraid to use more than three letters
             to define a variable!

#!/usr/bin/perl
# 472-byte qrpff, Keith Winstein and Marc Horowitz <sipb-iap-dvd@mit.edu>
# MPEG 2 PS VOB file -> descrambled output on stdout.
# usage: perl -I <k1>:<k2>:<k3>:<k4>:<k5> qrpff
# where k1..k5 are the title key bytes in least to most-significant order

s''$/=2048;while(<>){G=29;R=142;if((@a=unqT=quot;C*quot;,_)[20]&48){D=89;_=unqb24,qT,@
b=map{ord qB8,unqb8,qT,_^$a[--D]}@INC;s/...$/1$&/;Q=unqV,qb25,_;H=73;O=$b[4]<<9
|256|$b[3];Q=Q>>8^(P=(E=255)&(Q>>12^Q>>4^Q/8^Q))<<17,O=O>>8^(E&(F=(S=O>>14&7^O)
^S*8^S<<6))<<9,_=(map{U=_%16orE^=R^=110&(S=(unqT,quot;xbntdxbzx14dquot;)[_/16%8]);E
^=(72,@z=(64,72,G^=12*(U-2?0:S&17)),H^=_%64?12:0,@z)[_%8]}(16..271))[_]^((D>>=8
)+=P+(~F&E))for@a[128..$#a]}print+qT,@a}';s/[D-HO-U_]/$$&/g;s/q/pack+/g;eval
Object-Oriented Software Design Decisions



                                 shment
                          compli
                        Ac
       tecture
 Archi
module GraphBuilder
  LINE_TYPES = [:solid,:dashed,:dotted]
  module Nodes
    SHAPE_TYPES =
[:rectangle,:roundrectangle,:ellipse,:parallelogram,:hexagon,:octagon,:diamond,:triangle,:trapezoid,:trapezoid2,:rectangle3d]
    class BaseNode
      attr_accessor :label,:geometry,:fill_colors,:outline,:degree,:data
      def initialize(opts={})
        @opts = {
          :form=>:ellipse,
          :height=>50.0,
          :width=>50.0,
          :label=>quot;GraphNode#{self.object_id}quot;,
          :line_type=>:solid,
          :fill_color => {:R=>255,:G=>204,:B=>0,:A=>255},
          :fill_color2 => nil,
          :data => {},
          :outline_color=>{:R=>0,:G=>0,:B=>0,:A=>255}, # Set to nil or {:R=>0,:G=>0,:B=>0,:A=>0} for no outline
        }.merge(opts)
        @data = @opts[:data] # for storing application-specific data
        @label = Labels::NodeLabel.new(@opts[:label])
        @geometry = {:pos_x=>0.0,:pos_y=>0.0,:width=>1.0,:height=>1.0}
        @fill_colors = [@opts[:fill_color],nil]
        @outline = {:line_type=>@opts[:line_type],:color=>@opts[:outline_color]}
        @degree = {:in=>0,:out=>0}
      end

      def clone_params
        {
          :label=>text,
          :fill_color=>@fill_colors.first,
          :form=>@form,
          :height=>@geometry[:height],
          :width=>@geometry[:width]
        }
      end
    end

    class ShapeNode < BaseNode
      attr_accessor :form
      def initialize(opts={})
        super
        @form = @opts[:form]
        @geometry[:height] = @opts[:height]
        @geometry[:width] = @opts[:width]
      end
To Subclass or not to subclass? Use mixins!
   class Array
     def arithmetic_mean
       self.inject(0.0) { |sum,x| x = x.real if x.is_a?(Complex); sum + x.to_f } / self.length.to_f
     end

     def geometric_mean
       begin
         Math.exp(self.select { |x| x > 0.0 }.collect { |x| Math.log(x) }.arithmetic_mean)
       rescue Errno::ERANGE
         Math.exp(self.select { |x| x > 0.0 }.collect { |x| BigMath.log(x,50) }.arithmetic_mean)
       end
     end

     def median
       if self.length.odd?
         self[self.length / 2]
       else
         upper_median = self[self.length / 2]
         lower_median = self[(self.length / 2) - 1]
         [upper_median,lower_median].arithmetic_mean
       end
     end

     def standard_deviation
       mean = self.arithmetic_mean
       deviations = self.map { |x| x - mean }
       sqr_deviations = deviations.map { |x| x**2 }
       sum_sqr_deviations = sqr_deviations.inject(0.0) { |sum,x| sum + x }
       Math.sqrt(sum_sqr_deviations/(self.length - 1).to_f)
     end
     alias_method :sd, :standard_deviation

     def shuffle
       sort_by { rand }
     end

     def shuffle!
       self.replace shuffle
     end
   end
Documenting code sucks! Automate it.

• Come up with a convention for your
  “headers”
• Use automated documentation generation
  tools
    • JavaDoc
    • Rdoc
    • Pydoc / Epydoc
• Save code snippets in a searchable
  repository
A little performance optimization goes a long way

     • General tools
      • DTrace
      • strace
      • gdb
     • Language specific
      • Ruby-prof
      • Psyco/Pyrex
      • JBoss Profiler/JIT
Working with data
# Copyright © 1996-2007 SRI International, Marine Biological Laboratory, DoubleTwist Inc.,
# The Institute for Genomic Research, J. Craig Venter Institute, University of California at San
Diego, and UNAM. All Rights Reserved.
#
#
# Please see the license agreement regarding the use of and distribution of this file.
# The format of this file is defined at http://bioinformatics.ai.sri.com/ptools/flatfile-
format.html .
#
# Species: E. coli K-12
# Database: EcoCyc
# Version: 11.5
# File Name: dnabindsites.dat
# Date and time generated: August 6, 2007, 17:32:33
#
# Attributes:
#    UNIQUE-ID
#    TYPES
#    COMMON-NAME
#    ABS-CENTER-POS
#    APPEARS-IN-BINDING-REACTIONS
#    CITATIONS
#    COMMENT
#    COMPONENT-OF
#    COMPONENTS
#    CREDITS
#    DATA-SOURCE
#    DBLINKS
#    INSTANCE-NAME-TEMPLATE
#    INVOLVED-IN-REGULATION
#    LEFT-END-POSITION
#    REGULATED-PROMOTER
#    RELATIVE-CENTER-DISTANCE
#    RIGHT-END-POSITION
#    SYNONYMS
#
UNIQUE-ID - BS86
TYPES - DNA-Binding-Sites
ABS-CENTER-POS - 4098761
CITATIONS - 94018613
CITATIONS - 94018613:EV-EXP-IDA-BINDING-OF-CELLULAR-EXTRACTS:3310246267:martin
CITATIONS - 14711822:EV-COMP-AINF-SIMILAR-TO-CONSENSUS:3310246267:martin
COMPONENT-OF - TU00064
INVOLVED-IN-REGULATION - REG0-5521
TYPE-OF-EVIDENCE - :BINDING-OF-CELLULAR-EXTRACTS
//
If you can represent most of your data as key/value
    pairs, then at the very least use a BerkeleyDB




  http://www.oracle.com/technology/products/berkeley-db/index.html
In most cases a relational database is an
    appropriate choice for bioinformatics data
• Clean and consolidated (vs. a rats nest of files and
 folders)
• Improved performance (memory usage and File I/O)
• Data consistency through constraints and transactions
• Easily portable (SQL92 standard)
• Querying (asking questions about data) vs. Parsing
 (reading and loading data)
• Commonly used data processing functions can be
 implemented as stored procedures
“But I’m a scientist, not a DBA! Harrumph!”


                              http://www.sqlite.org
“...SQLite is a software library that implements a self-contained, serverless,
         zero-configuration, transactional SQL database engine...”
But seriously, don’t write any SQL (What?)
               Relational Database
          (MySQL, PostgreSQL, Oracle, etc)




          Object Relational Mapper (ORM)




Model


                                             Instance
Beyond the RDBMS




http://strokedb.com/       http://incubator.apache.org/couchdb




                 http://www.hypertable.org
Thinking in Parallel
Loosely Coupled                Tightly Coupled
•                              •
    Each task is independent       Tasks are interdependent

•                              •
    No synchronous inter-          Synchronous inter-task
    task communication             communication via
                                   messaging interface
•   Example: Computing a
                               •
    Maximum Likelihood             Example: Monte Carlo
    Phylogeny for every gene       simulation of 3D protein
    family in the Panther          interactions in cytoplasm
    Database
                               •   Software: OpenMPI,
•   Software: OpenPBS,             MPICH, PVM
    SGE, Xgrid, PlatformLSF
Use your idle CPU cores!
Start thinking in terms of MapReduce
   (old hat for Lisp programmers!)




Image source: http://code.google.com/edu/parallel/mapreduce-tutorial.html
map(String key, String value):
// key: document name
// value: document contents
for each word w in value:
  EmitIntermediate(w, quot;1quot;);

reduce(String key, Iterator values):
// key: a word
// values: a list of counts
int result = 0;
for each v in values:
  result += ParseInt(v);
Emit(AsString(result));     [1]
map(String key, String value):
// key: Sequence alignment file name
// value: multiple alignment
for each exon w in value:
  EmitIntermediate(w, CpGIndex);

reduce(String key, Iterator values):
// key: an exon
// values: a list of CpG Index Values
int result = 0;
for each i in values:
  result += ParseInt(v);
Emit(AsString(result/length(values));   [1]
http://sourceforge.net/projects/cloudburst-bio/
MapReduce Implementations



http://hadoop.apache.org/core/
                                               http://skynet.rubyforge.org/




   http://discoproject.org/




                                 http://labs.trolltech.com/page/Projects/Threads/QtConcurrent
Embracing Hardware
Single Instruction, Multiple Data (SIMD)
Graphics Processing Unit (GPU):
    Not just fun and games
GPU Programming is Getting Easier




 Compute Unified
                                             OpenCL
Device Architecture
  http://www.nvidia.com/cuda   http://s08.idav.ucdavis.edu/munshi-opencl.pdf
Field Programmable Gate Arrays (FPGA)
Field Programmable Gate Arrays (FPGA)
Playing nice with others
Data Interchange Formats


• JSON
• YAML
• XML
 • Microformats
 • RDF
person = {
       quot;namequot;: quot;Joel Dudleyquot;,
       quot;agequot;: 32,
       quot;heightquot;: 1.83,
       quot;urlsquot;: [
         quot;http://www.joeldudley.com/quot;,
         quot;http://www.linkedin.com/in/joeldudleyquot;
       ]
     }



                        VS.

<person>
  <name>Joel Dudley</name>
  <age>32</age>
  <height>1.83</height>
  <urls>
    <url>http://www.joeldudley.com/</url>
    <url> http://www.linkedin.com/in/joeldudley </url>
  </urls>
</person>
Web Services



• Remote Procedure Call (RPC)
• Representational State Transfer (ReST)
• SOAP
• ActiveResource Pattern
class Video < ActiveYouTube
  self.site = quot;http://gdata.youtube.com/feeds/apiquot;

  ## To search by categories and tags
  def self.search_by_tags (*options)
    from_urls = []
    if options.last.is_a? Hash
      excludes = options.slice!(options.length-1)
      if excludes[:exclude].kind_of? Array
        from_urls << excludes[:exclude].map{|keyword| quot;-quot;+keyword}.join(quot;/quot;)
      else
        from_urls << quot;-quot;+excludes[:exclude]
      end
    end
    from_urls << options.find_all{|keyword| keyword =~ /^[a-z]/}.join(quot;/quot;)
    from_urls << options.find_all{|category| category =~ /^[A-Z]/}.join(quot;%7Cquot;)
    from_urls.delete_if {|x| x.empty?}
    self.find(:all,:from=>quot;/feeds/api/videos/-/quot;+from_urls.reverse.join(quot;/quot;))
  end
end

class User < ActiveYouTube
  self.site = quot;http://gdata.youtube.com/feeds/apiquot;
end

class Standardfeed < ActiveYouTube
  self.site = quot;http://gdata.youtube.com/feeds/apiquot;
end

class Playlist < ActiveYouTube
  self.site = quot;http://gdata.youtube.com/feeds/apiquot;
end
search = Video.find(:first, :params => {:vq => 'ruby', :quot;max-resultsquot; => '5'})
  puts search.entry.length

 ## video information of id = ZTUVgYoeN_o
 vid = Video.find(quot;ZTUVgYoeN_oquot;)
 puts vid.group.content[0].url

 ## video comments
 comments = Video.find_custom(quot;ZTUVgYoeN_oquot;).get(:comments)
 puts comments.entry[0].link[2].href

 ## searching with category/tags
 results = Video.search_by_tags(quot;Comedyquot;)
 puts results[0].entry[0].title
 # more examples:
 # Video.search_by_tags(quot;Comedyquot;, quot;dogquot;)
 # Video.search_by_tags(quot;Newsquot;,quot;Sportsquot;,quot;footballquot;, :exclude=>quot;soccerquot;)
Teamwork
Be Agile
      Manifesto for Agile Software Development

          We are uncovering better ways of developing
          software by doing it and helping others do it.
           Through this work we have come to value:

       • Individuals and interactions over processes and tools
       • Working software over comprehensive documentation
       • Customer collaboration over contract negotiation
       • Responding to change over following a plan
That is, while there is value in the items on the right, we value the
                       items on the left more.
                      http://agilemanifesto.org/
Be Agile

As a [role], I want to [goal], so I can [reason].


                  Storyboard
                      Iterate!

                    Feedback
                             Acceptance
            Unit Testing
                               Testing
Automate Development



http://nant.sourceforge.net/     http://www.scons.org/




  http://www.capify.org/       http://nant.sourceforge.net/
Lightweight Tools for Project Management
Closing Remarks

• Focus on the goal (Biology/Medicine)
• Don’t be clever (you’ll trick yourself)
• Value your time
• Outsource everything but genius
• Use the tools available to you
• Have fun!

More Related Content

Similar to Tips for software engineering in bioinformatics

Intermachine Parallelism
Intermachine ParallelismIntermachine Parallelism
Intermachine ParallelismSri Prasanna
 
String Comparison Surprises: Did Postgres lose my data?
String Comparison Surprises: Did Postgres lose my data?String Comparison Surprises: Did Postgres lose my data?
String Comparison Surprises: Did Postgres lose my data?Jeremy Schneider
 
Compiler2016 by abcdabcd987
Compiler2016 by abcdabcd987Compiler2016 by abcdabcd987
Compiler2016 by abcdabcd987乐群 陈
 
Building DSLs On CLR and DLR (Microsoft.NET)
Building DSLs On CLR and DLR (Microsoft.NET)Building DSLs On CLR and DLR (Microsoft.NET)
Building DSLs On CLR and DLR (Microsoft.NET)Vitaly Baum
 
Introduction to source{d} Engine and source{d} Lookout
Introduction to source{d} Engine and source{d} Lookout Introduction to source{d} Engine and source{d} Lookout
Introduction to source{d} Engine and source{d} Lookout source{d}
 
Machine vision and device integration with the Ruby programming language (2008)
Machine vision and device integration with the Ruby programming language (2008)Machine vision and device integration with the Ruby programming language (2008)
Machine vision and device integration with the Ruby programming language (2008)Jan Wedekind
 
Groovy Update - JavaPolis 2007
Groovy Update - JavaPolis 2007Groovy Update - JavaPolis 2007
Groovy Update - JavaPolis 2007Guillaume Laforge
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks
 
Language-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible researchLanguage-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible researchAndrew Lowe
 
Scala Meetup Hamburg - Spark
Scala Meetup Hamburg - SparkScala Meetup Hamburg - Spark
Scala Meetup Hamburg - SparkIvan Morozov
 
Overview Of Parallel Development - Ericnel
Overview Of Parallel Development -  EricnelOverview Of Parallel Development -  Ericnel
Overview Of Parallel Development - Ericnelukdpe
 
200612_BioPackathon_ss
200612_BioPackathon_ss200612_BioPackathon_ss
200612_BioPackathon_ssSatoshi Kume
 
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...Data Con LA
 
Groovy Introduction - JAX Germany - 2008
Groovy Introduction - JAX Germany - 2008Groovy Introduction - JAX Germany - 2008
Groovy Introduction - JAX Germany - 2008Guillaume Laforge
 
2 Years of Real World FP at REA
2 Years of Real World FP at REA2 Years of Real World FP at REA
2 Years of Real World FP at REAkenbot
 
TI1220 Lecture 14: Domain-Specific Languages
TI1220 Lecture 14: Domain-Specific LanguagesTI1220 Lecture 14: Domain-Specific Languages
TI1220 Lecture 14: Domain-Specific LanguagesEelco Visser
 

Similar to Tips for software engineering in bioinformatics (20)

Intermachine Parallelism
Intermachine ParallelismIntermachine Parallelism
Intermachine Parallelism
 
String Comparison Surprises: Did Postgres lose my data?
String Comparison Surprises: Did Postgres lose my data?String Comparison Surprises: Did Postgres lose my data?
String Comparison Surprises: Did Postgres lose my data?
 
Compiler2016 by abcdabcd987
Compiler2016 by abcdabcd987Compiler2016 by abcdabcd987
Compiler2016 by abcdabcd987
 
Building DSLs On CLR and DLR (Microsoft.NET)
Building DSLs On CLR and DLR (Microsoft.NET)Building DSLs On CLR and DLR (Microsoft.NET)
Building DSLs On CLR and DLR (Microsoft.NET)
 
Introduction to source{d} Engine and source{d} Lookout
Introduction to source{d} Engine and source{d} Lookout Introduction to source{d} Engine and source{d} Lookout
Introduction to source{d} Engine and source{d} Lookout
 
Machine vision and device integration with the Ruby programming language (2008)
Machine vision and device integration with the Ruby programming language (2008)Machine vision and device integration with the Ruby programming language (2008)
Machine vision and device integration with the Ruby programming language (2008)
 
Groovy Update - JavaPolis 2007
Groovy Update - JavaPolis 2007Groovy Update - JavaPolis 2007
Groovy Update - JavaPolis 2007
 
myHadoop 0.30
myHadoop 0.30myHadoop 0.30
myHadoop 0.30
 
Evolution of Spark APIs
Evolution of Spark APIsEvolution of Spark APIs
Evolution of Spark APIs
 
Effective Object Oriented Design in Cpp
Effective Object Oriented Design in CppEffective Object Oriented Design in Cpp
Effective Object Oriented Design in Cpp
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 
Language-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible researchLanguage-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible research
 
Scala Meetup Hamburg - Spark
Scala Meetup Hamburg - SparkScala Meetup Hamburg - Spark
Scala Meetup Hamburg - Spark
 
Overview Of Parallel Development - Ericnel
Overview Of Parallel Development -  EricnelOverview Of Parallel Development -  Ericnel
Overview Of Parallel Development - Ericnel
 
Having Fun with Play
Having Fun with PlayHaving Fun with Play
Having Fun with Play
 
200612_BioPackathon_ss
200612_BioPackathon_ss200612_BioPackathon_ss
200612_BioPackathon_ss
 
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...
 
Groovy Introduction - JAX Germany - 2008
Groovy Introduction - JAX Germany - 2008Groovy Introduction - JAX Germany - 2008
Groovy Introduction - JAX Germany - 2008
 
2 Years of Real World FP at REA
2 Years of Real World FP at REA2 Years of Real World FP at REA
2 Years of Real World FP at REA
 
TI1220 Lecture 14: Domain-Specific Languages
TI1220 Lecture 14: Domain-Specific LanguagesTI1220 Lecture 14: Domain-Specific Languages
TI1220 Lecture 14: Domain-Specific Languages
 

Recently uploaded

Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...BookNet Canada
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observabilityitnewsafrica
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsYoss Cohen
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 

Recently uploaded (20)

Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platforms
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 

Tips for software engineering in bioinformatics

  • 1. Tips & Tricks for Software Engineering in Bioinformatics Presented by: Joel Dudley
  • 2. Who is this guy?
  • 3. Avg. time spent programming (hours) 10.0 7.5 5.0 2.5 0 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 25 25 26 27 28 29 30 31 32 Age (years)
  • 5. Kumar S. and Dudley J. “Bioinformatics software for biologists in the genomics era.” Bioinformatics (2007) vol. 23 (14) pp. 1713-7
  • 9. Be a jack of all trades, but master of one. http://oreilly.com/news/graphics/prog_lang_poster.pdf
  • 10. R C/C++ PHP VB PERL Python Ruby Java LISP
  • 11. Java is not just for Java http://jruby.codehaus.org http://www.jython.org
  • 12. Simplified Wrapper and Interface Generator (SWIG) Greasy-fast C library Doughy-soft scripting language http://www.swig.org/
  • 14. Stand on the slumped, dandruff-covered shoulders of millions of computer nerds.
  • 15.
  • 16. Don’t trust yourself (or your hard disk).
  • 17. Don’t be afraid to use more than three letters to define a variable! #!/usr/bin/perl # 472-byte qrpff, Keith Winstein and Marc Horowitz <sipb-iap-dvd@mit.edu> # MPEG 2 PS VOB file -> descrambled output on stdout. # usage: perl -I <k1>:<k2>:<k3>:<k4>:<k5> qrpff # where k1..k5 are the title key bytes in least to most-significant order s''$/=2048;while(<>){G=29;R=142;if((@a=unqT=quot;C*quot;,_)[20]&48){D=89;_=unqb24,qT,@ b=map{ord qB8,unqb8,qT,_^$a[--D]}@INC;s/...$/1$&/;Q=unqV,qb25,_;H=73;O=$b[4]<<9 |256|$b[3];Q=Q>>8^(P=(E=255)&(Q>>12^Q>>4^Q/8^Q))<<17,O=O>>8^(E&(F=(S=O>>14&7^O) ^S*8^S<<6))<<9,_=(map{U=_%16orE^=R^=110&(S=(unqT,quot;xbntdxbzx14dquot;)[_/16%8]);E ^=(72,@z=(64,72,G^=12*(U-2?0:S&17)),H^=_%64?12:0,@z)[_%8]}(16..271))[_]^((D>>=8 )+=P+(~F&E))for@a[128..$#a]}print+qT,@a}';s/[D-HO-U_]/$$&/g;s/q/pack+/g;eval
  • 18. Object-Oriented Software Design Decisions shment compli Ac tecture Archi
  • 19. module GraphBuilder LINE_TYPES = [:solid,:dashed,:dotted] module Nodes SHAPE_TYPES = [:rectangle,:roundrectangle,:ellipse,:parallelogram,:hexagon,:octagon,:diamond,:triangle,:trapezoid,:trapezoid2,:rectangle3d] class BaseNode attr_accessor :label,:geometry,:fill_colors,:outline,:degree,:data def initialize(opts={}) @opts = { :form=>:ellipse, :height=>50.0, :width=>50.0, :label=>quot;GraphNode#{self.object_id}quot;, :line_type=>:solid, :fill_color => {:R=>255,:G=>204,:B=>0,:A=>255}, :fill_color2 => nil, :data => {}, :outline_color=>{:R=>0,:G=>0,:B=>0,:A=>255}, # Set to nil or {:R=>0,:G=>0,:B=>0,:A=>0} for no outline }.merge(opts) @data = @opts[:data] # for storing application-specific data @label = Labels::NodeLabel.new(@opts[:label]) @geometry = {:pos_x=>0.0,:pos_y=>0.0,:width=>1.0,:height=>1.0} @fill_colors = [@opts[:fill_color],nil] @outline = {:line_type=>@opts[:line_type],:color=>@opts[:outline_color]} @degree = {:in=>0,:out=>0} end def clone_params { :label=>text, :fill_color=>@fill_colors.first, :form=>@form, :height=>@geometry[:height], :width=>@geometry[:width] } end end class ShapeNode < BaseNode attr_accessor :form def initialize(opts={}) super @form = @opts[:form] @geometry[:height] = @opts[:height] @geometry[:width] = @opts[:width] end
  • 20. To Subclass or not to subclass? Use mixins! class Array def arithmetic_mean self.inject(0.0) { |sum,x| x = x.real if x.is_a?(Complex); sum + x.to_f } / self.length.to_f end def geometric_mean begin Math.exp(self.select { |x| x > 0.0 }.collect { |x| Math.log(x) }.arithmetic_mean) rescue Errno::ERANGE Math.exp(self.select { |x| x > 0.0 }.collect { |x| BigMath.log(x,50) }.arithmetic_mean) end end def median if self.length.odd? self[self.length / 2] else upper_median = self[self.length / 2] lower_median = self[(self.length / 2) - 1] [upper_median,lower_median].arithmetic_mean end end def standard_deviation mean = self.arithmetic_mean deviations = self.map { |x| x - mean } sqr_deviations = deviations.map { |x| x**2 } sum_sqr_deviations = sqr_deviations.inject(0.0) { |sum,x| sum + x } Math.sqrt(sum_sqr_deviations/(self.length - 1).to_f) end alias_method :sd, :standard_deviation def shuffle sort_by { rand } end def shuffle! self.replace shuffle end end
  • 21. Documenting code sucks! Automate it. • Come up with a convention for your “headers” • Use automated documentation generation tools • JavaDoc • Rdoc • Pydoc / Epydoc • Save code snippets in a searchable repository
  • 22. A little performance optimization goes a long way • General tools • DTrace • strace • gdb • Language specific • Ruby-prof • Psyco/Pyrex • JBoss Profiler/JIT
  • 24. # Copyright © 1996-2007 SRI International, Marine Biological Laboratory, DoubleTwist Inc., # The Institute for Genomic Research, J. Craig Venter Institute, University of California at San Diego, and UNAM. All Rights Reserved. # # # Please see the license agreement regarding the use of and distribution of this file. # The format of this file is defined at http://bioinformatics.ai.sri.com/ptools/flatfile- format.html . # # Species: E. coli K-12 # Database: EcoCyc # Version: 11.5 # File Name: dnabindsites.dat # Date and time generated: August 6, 2007, 17:32:33 # # Attributes: # UNIQUE-ID # TYPES # COMMON-NAME # ABS-CENTER-POS # APPEARS-IN-BINDING-REACTIONS # CITATIONS # COMMENT # COMPONENT-OF # COMPONENTS # CREDITS # DATA-SOURCE # DBLINKS # INSTANCE-NAME-TEMPLATE # INVOLVED-IN-REGULATION # LEFT-END-POSITION # REGULATED-PROMOTER # RELATIVE-CENTER-DISTANCE # RIGHT-END-POSITION # SYNONYMS # UNIQUE-ID - BS86 TYPES - DNA-Binding-Sites ABS-CENTER-POS - 4098761 CITATIONS - 94018613 CITATIONS - 94018613:EV-EXP-IDA-BINDING-OF-CELLULAR-EXTRACTS:3310246267:martin CITATIONS - 14711822:EV-COMP-AINF-SIMILAR-TO-CONSENSUS:3310246267:martin COMPONENT-OF - TU00064 INVOLVED-IN-REGULATION - REG0-5521 TYPE-OF-EVIDENCE - :BINDING-OF-CELLULAR-EXTRACTS //
  • 25. If you can represent most of your data as key/value pairs, then at the very least use a BerkeleyDB http://www.oracle.com/technology/products/berkeley-db/index.html
  • 26. In most cases a relational database is an appropriate choice for bioinformatics data • Clean and consolidated (vs. a rats nest of files and folders) • Improved performance (memory usage and File I/O) • Data consistency through constraints and transactions • Easily portable (SQL92 standard) • Querying (asking questions about data) vs. Parsing (reading and loading data) • Commonly used data processing functions can be implemented as stored procedures
  • 27. “But I’m a scientist, not a DBA! Harrumph!” http://www.sqlite.org “...SQLite is a software library that implements a self-contained, serverless, zero-configuration, transactional SQL database engine...”
  • 28. But seriously, don’t write any SQL (What?) Relational Database (MySQL, PostgreSQL, Oracle, etc) Object Relational Mapper (ORM) Model Instance
  • 29. Beyond the RDBMS http://strokedb.com/ http://incubator.apache.org/couchdb http://www.hypertable.org
  • 31. Loosely Coupled Tightly Coupled • • Each task is independent Tasks are interdependent • • No synchronous inter- Synchronous inter-task task communication communication via messaging interface • Example: Computing a • Maximum Likelihood Example: Monte Carlo Phylogeny for every gene simulation of 3D protein family in the Panther interactions in cytoplasm Database • Software: OpenMPI, • Software: OpenPBS, MPICH, PVM SGE, Xgrid, PlatformLSF
  • 32. Use your idle CPU cores!
  • 33. Start thinking in terms of MapReduce (old hat for Lisp programmers!) Image source: http://code.google.com/edu/parallel/mapreduce-tutorial.html
  • 34. map(String key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, quot;1quot;); reduce(String key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result)); [1]
  • 35. map(String key, String value): // key: Sequence alignment file name // value: multiple alignment for each exon w in value: EmitIntermediate(w, CpGIndex); reduce(String key, Iterator values): // key: an exon // values: a list of CpG Index Values int result = 0; for each i in values: result += ParseInt(v); Emit(AsString(result/length(values)); [1]
  • 37. MapReduce Implementations http://hadoop.apache.org/core/ http://skynet.rubyforge.org/ http://discoproject.org/ http://labs.trolltech.com/page/Projects/Threads/QtConcurrent
  • 40. Graphics Processing Unit (GPU): Not just fun and games
  • 41.
  • 42. GPU Programming is Getting Easier Compute Unified OpenCL Device Architecture http://www.nvidia.com/cuda http://s08.idav.ucdavis.edu/munshi-opencl.pdf
  • 43.
  • 44. Field Programmable Gate Arrays (FPGA)
  • 45. Field Programmable Gate Arrays (FPGA)
  • 47. Data Interchange Formats • JSON • YAML • XML • Microformats • RDF
  • 48. person = { quot;namequot;: quot;Joel Dudleyquot;, quot;agequot;: 32, quot;heightquot;: 1.83, quot;urlsquot;: [ quot;http://www.joeldudley.com/quot;, quot;http://www.linkedin.com/in/joeldudleyquot; ] } VS. <person> <name>Joel Dudley</name> <age>32</age> <height>1.83</height> <urls> <url>http://www.joeldudley.com/</url> <url> http://www.linkedin.com/in/joeldudley </url> </urls> </person>
  • 49. Web Services • Remote Procedure Call (RPC) • Representational State Transfer (ReST) • SOAP • ActiveResource Pattern
  • 50. class Video < ActiveYouTube self.site = quot;http://gdata.youtube.com/feeds/apiquot; ## To search by categories and tags def self.search_by_tags (*options) from_urls = [] if options.last.is_a? Hash excludes = options.slice!(options.length-1) if excludes[:exclude].kind_of? Array from_urls << excludes[:exclude].map{|keyword| quot;-quot;+keyword}.join(quot;/quot;) else from_urls << quot;-quot;+excludes[:exclude] end end from_urls << options.find_all{|keyword| keyword =~ /^[a-z]/}.join(quot;/quot;) from_urls << options.find_all{|category| category =~ /^[A-Z]/}.join(quot;%7Cquot;) from_urls.delete_if {|x| x.empty?} self.find(:all,:from=>quot;/feeds/api/videos/-/quot;+from_urls.reverse.join(quot;/quot;)) end end class User < ActiveYouTube self.site = quot;http://gdata.youtube.com/feeds/apiquot; end class Standardfeed < ActiveYouTube self.site = quot;http://gdata.youtube.com/feeds/apiquot; end class Playlist < ActiveYouTube self.site = quot;http://gdata.youtube.com/feeds/apiquot; end
  • 51. search = Video.find(:first, :params => {:vq => 'ruby', :quot;max-resultsquot; => '5'}) puts search.entry.length ## video information of id = ZTUVgYoeN_o vid = Video.find(quot;ZTUVgYoeN_oquot;) puts vid.group.content[0].url ## video comments comments = Video.find_custom(quot;ZTUVgYoeN_oquot;).get(:comments) puts comments.entry[0].link[2].href ## searching with category/tags results = Video.search_by_tags(quot;Comedyquot;) puts results[0].entry[0].title # more examples: # Video.search_by_tags(quot;Comedyquot;, quot;dogquot;) # Video.search_by_tags(quot;Newsquot;,quot;Sportsquot;,quot;footballquot;, :exclude=>quot;soccerquot;)
  • 53. Be Agile Manifesto for Agile Software Development We are uncovering better ways of developing software by doing it and helping others do it. Through this work we have come to value: • Individuals and interactions over processes and tools • Working software over comprehensive documentation • Customer collaboration over contract negotiation • Responding to change over following a plan That is, while there is value in the items on the right, we value the items on the left more. http://agilemanifesto.org/
  • 54. Be Agile As a [role], I want to [goal], so I can [reason]. Storyboard Iterate! Feedback Acceptance Unit Testing Testing
  • 55. Automate Development http://nant.sourceforge.net/ http://www.scons.org/ http://www.capify.org/ http://nant.sourceforge.net/
  • 56. Lightweight Tools for Project Management
  • 57. Closing Remarks • Focus on the goal (Biology/Medicine) • Don’t be clever (you’ll trick yourself) • Value your time • Outsource everything but genius • Use the tools available to you • Have fun!