SlideShare ist ein Scribd-Unternehmen logo
1 von 4
Downloaden Sie, um offline zu lesen
30th Annual International IEEE EMBS Conference
Vancouver, British Columbia, Canada, August 20-24, 2008




                               Intelligible machine learning with malibu
                                                     Robert E. Langlois and Hui Lu


   Abstract— malibu is an open-source machine learning work-               project or a more powerful protocol to solve a problem in
bench developed in C/C++ for high-performance real-world                   a specific domain. Finally, open-source tools give the wider
applications, namely bioinformatics and medical informatics.               scientific community full access to machine learning tools
It leverages third-party machine learning implementations
for more robust bug free software. This workbench handles                  that have found application in a wide range of fields. Such
several well-studied supervised machine learning problems                  access permits both use of the tool and the ability to extend
including classification, regression, importance-weighted clas-             the tool to fit a specific need.
sification and multiple-instance learning. The malibu inter-                   A machine learning workbench provides at a minimum
face was designed to create reproducible experiments ideally
run in a remote and/or command line environment. The
                                                                           five services beyond the standard machine learning tool:
software can be found at: http://proteomics.bioengr.                         1)   Learning algorithms
uic.edu/malibu/index.html                                                    2)   Learning evaluation
  .                                                                          3)   Dataset preprocessing
                      I. INTRODUCTION                                        4)   Interface unification
                                                                             5)   Extensible bindings
   Recently open source software has matured into a solution
able to handle complex real world applications. One such ap-               In essence, a machine learning workbench provides a unified
plication entails developing implementations for the substan-              interface to a number of learning algorithms and, ideally,
tial number machine learning algorithms available for the nu-              handles more than one type of machine learning problem. For
merous problem domains e.g. bioinformatics type problems                   example, a supervised machine learning workbench should
including protein annotation, microarray analysis as well as               support a number of classifiers including decision trees and
others. Indeed, many such algorithms remain unused due to                  support vector machines as well as algorithms for other prob-
unavailable or poor implementations. Moreover, many re-                    lems such as calibration methods for probabilistic regression.
searchers recognize the need for peer-reviewed open-source                 Likewise, a workbench should provide stock tools to handle
machine learning software by the scientific community[1].                   common tasks such as dataset preprocessing or learning eval-
The advantages of open-source machine learning tools in-                   uation. For a supervised machine learning workbench, these
clude:                                                                     stock tools should include metrics to measure performance,
   1) Reproducibility and transparency                                     algorithms to perform cross-validation and tools to perform
   2) Uncovering problems in current algorithms                            discreetization or normalization. Finally, a workbench should
   3) Building on existing resources                                       be extensible providing the ability to add or support new
   4) Access to machine learning tools                                     algorithms as well as tying into some scripting language.
One of the fundamental benefits of open-source machine                         A number of machine learning workbenches have been
learning software is that it facilitates the reproducibility of            developed in programming languages including C/C++, Java,
experimental results. While such experiments are relatively                Python and Matlab. Indeed, the programming language
easy to reproduce compared to other fields, they often are not.             characterizes fundamental properties in the corresponding
At the same time, the pressure to publish remains constant                 software. That is, tools based in C/C++ are more difficult
leading to unintentional (or intentional) “cheating”. Having               to develop yet utilize computing resources more efficiently.
access to an implementation of a machine learning algorithm                Java-, Matlab- and Python-based tools are easier to develop
(and ideally the associated dataset e.g. UCI repository[2])                and deploy but require an interpreter and garbage collector.
could potentially eliminate such cheating or, at a minimum,                Java- and Delphi-based tools support a rich set of libraries
its ill effects. Likewise, tendering the source of an algorithm            enabling complex graphical user interfaces. Finally, Matlab-
allows the community at large to more quickly discover                     based tools enjoy a rich set of statistical and optimiza-
problems in current algorithms on both the conceptual and                  tion routines providing the ability to quick-prototype many
concrete levels. Similarly, it enables others to build increas-            learning algorithms. Using these languages a number of
ingly more intricate systems on top of available source code.              workbenches haven been developed. One of the most popular
This could simply mean a better user interface on an existing              is WEKA[3], a Java-based workbench that supports a large
                                                                           number of supervised and unsupervised machine learning
    This work is partially supported by the NIH.                           algorithms. This workbench has been extended and modified
    R. E. Langlois is with Department of Bioengineering, University of     by a number of projects, most notably by RapidMiner[4], a
Illinois at Chicago, Chicago, IL 60607, USA ezra@uic.edu
    H. Lu is with Faculty of Department of Bioengineering, University of   workbench focused on fast-prototyping and data visualiza-
Illinois at Chicago, Chicago, IL 60607, USA huilu@uic.edu                  tion. Likewise, a number of workbenches have been devel-

978-1-4244-1815-2/08/$25.00 ©2008 IEEE.                               3795
oped in C/C++ including Shogun[5], Elefant[6], MLC++[7],             In (binary) classification, the algorithm learns a model
Orange[8] and Torch[9]. Shogun, Orange and Elefant support        from labeled training examples where the label belongs to
python bindings enabling efficient machine learning work           one of two discreet classes. malibu incorporates a number of
flows. Matlab has also proven an excellent platform for            third-party and built-in algorithms to handle classification.
machine learning with its own considerable statistical and        The third-party classifiers include LIBSVM[18], Cover Tree
machine learning libraries; it has been further extended          kNN[20], INDTree[21] and C4.5[22]. The built-in classifiers
by Spider[10] better handle a large number of machine             include the Willow[19] decision tree and ADTree[23]. malibu
learning problems including supervised, unsupervised and          also supports a number of (binary) meta-classifiers that
semi-supervised learning. There has been considerable effort      construct ensembles of classifiers to improve performance,
in developing additional open-source machine learning soft-       which includes Bagging[24], Subagging[25], AdaBoost[26],
ware. To this end, most available workbenches can be found        Confidence-ratedAdaBoost[26], Gentle AdaBoost[27] and
in a peer-reviewed machine learning software repository1 .        for the tree-based classifiers Random Forests[28].
   The applications of such machine learning software ranges         In importance-weighted classification, the algorithm learns
from facial recognition to medical diagnosis. One clini-          a model from training examples labeled by their relative
cal application of machine learning is the identification of       importance such that a prediction will be biased toward more
cancerous tumors using data collected by some imaging             important training examples. One popular variant is called
modality, e.g. microscopic analysis of cells [11]. Specifically,   cost-sensitive classification where examples are weighted
a machine learning algorithm can segment an image into            based on their class label. malibu supports both implicit and
regions where one may contain a cancerous tumor. A later          explicit weighting for each algorithm where implicit weight-
algorithm can learn features within these regions (i.e. shape     ing is supported by LIBSVM, kNN and Willow. Furthermore,
of possible tumor, texture of its edges, level of contrast) to    an explicit method utilizes the Costing wrapper[29] to make
distinguish benign and cancerous tissue. In more recent work,     any classifier importance-weighted.
machine learning has found great success in the arena of             In regression, the algorithm learns a real-valued output
brain-computer interfaces [12]. Such devices have a number        from training examples labeled with a real label. One
of applications ranging from clinical monitoring of arousal       special case of regression is probabilistic regression where
to investigating the working of the human brain.                  the learning algorithm assigns a probability to an example
   In this work, we introduce a new machine learning work-        as belonging to a particular class. Similar to importance-
bench for bioinformatics tasks. This workbench has been           weighted classification malibu supports both implicit and
applied to a number of problems ranging from function pre-        explicit regression. That is, learning algorithms such as LIB-
diction, e.g. prediction of DNA-binding residues[13], DNA-        SVM, kNN and Willow, which support regression. For binary
binding proteins[14], [15], membrane-binding proteins[16],        classifiers, malibu also includes explicit wrappers to extend a
[15], to structure prediction e.g. protein folds[17].             classifier to handle probabilistic regression. These wrappers
                                                                  include sigmoid correction[30], isotonic regression[31] and
              II. LEARNING WITH malibu                            probing[32].
   malibu is an open-source machine learning workbench               In multiple-instance learning (MIL), examples are grouped
written in C/C++ and is geared toward supervised learning.        into bags where the bag not an individual example has a
The basic design of malibu comprises a hierarchy of C++           label. A bag is positive if at least one instance in the bag is
template classes that both wrap and extend a core set of          positive otherwise the bag is negative. In malibu any binary
classification algorithms. By utilizing proven C++ template        classifier can be extended to multiple-instance learning by
meta-programming techniques used in the Boost Libraries2          viewing this problem as binary classification with positive
and the matrix template library3 , malibu provides an efficient    class noise; all parameters are selected by estimating bag-
yet extensible library of algorithms. The core classifiers         level (not instance-level) performance. malibu also supports
comprise both third-party tools, e.g. LIBSVM[18], and native      extending a weak classifier to a multiple-instance learner
implementations[19].                                              through the AdaBoost.C2MIL wrapper[19].
                                                                  B. Learning evaluation
A. Learning algorithms
                                                                     Evaluating the performance of a learning algorithm is
   The malibu workbench currently supports a number of            important to both select the best model and estimate the per-
supervised learning problems including classification, meta-       formance on unseen testing dataset. The performance of an
classification, importance-weighted classification, regression      algorithm is measured as follows:
and multiple-instance learning. A supervised learning prob-          for each partition do
lem comprises a set of labeled training examples with the              Train algorithm on one partition
goal of predicting the label on an unseen (and possibly                Evaluate on other partition
unlabeled) example.                                                  end for
  1 http://mloss.org                                              Learning algorithm performance is usually measured by
  2 http://www.boost.org                                          metrics and/or graphs. A single metric reflects some question
  3 http://www.osl.iu.edu/research/mtl/                           about the performance of a learning algorithm whereas

                                                             3796
a graph reflects a series of questions. malibu supports a          library; in loose-binding, the workbench writes out a file in
number of threshold metrics from a tabulated contingency          a format supported by another tool. Currently malibu sup-
table, which estimate the performance for every problem           ports soft-binding to web-browsers, LTEX, GNUPLOT4 and
                                                                                                        A
                                                                            5
except regression; it also supports a number of regression        Graphviz . That is, the metrics describing the performance of
and ranking metrics. Likewise, malibu supports a number of        a learning algorithm can be written out in both the HTML
graphs including the receiver operating characteristics curve,    and latex formats. Similarly, the performance can also be
the cost curve[33], the precision/recall curve, lift curves and   written out as a plot in the GNUPLOT format. Finally, the
reliability diagrams.                                             models describing the tree-based learning algorithms can be
   Note that malibu provides automated model selection            written out as graphs in the Graphviz DOT format.
for every learning algorithm using the previously described
evaluation metrics and the dataset partitioning algorithms             III. CONCLUSIONS AND FUTURE WORKS
introduced in the next section.                                   A. Conclusions
C. Dataset preprocessing                                             The maturity of open source software in conjunction with
                                                                  the present need for robust implementations of machine
   Preprocessing a dataset is a critical step for many ma-        learning algorithms has given rise to significant efforts in
chine learning algorithms e.g. normalization of attributes for    developing large-scale workbenches. However, no single
distance-based methods such as SVM. Moreover, preprocess-         workbench is comprehensive in its coverage of machine
ing also includes algorithms that partition the dataset for       learning algorithms nor does every workbench provide an
model evaluation. malibu comprises a number of algorithms         optimal set of features. malibu is a high-performance ma-
to transform a dataset into an appropriate format such as         chine learning workbench developed to extend classifiers
normalization for distance-based methods, nominal-to-binary       to handle classification as well as other problem domains
for distance-based methods, and discreetization to speed up       namely regression, importance-weighted classification and
sorting-based methods. Likewise, malibu includes partition-       multiple-instance learning. It also satisfies the basic criterion
ing methods such as cross-validation, bootstrapping, holdout      of a workbench by providing a unified user interface, dataset
and progressive validation[34]. Each of these methods has         preprocessing algorithms, learning algorithms and binding to
various advantages and disadvantages. Holdout requires a          other tools to facilitate learning.
large amount of dataset but its the best understood. For             The primary contribution of the malibu workbench is im-
smaller datasets, cross-validation, progressive validation and    proved usability for a more computer-scientist oriented user
bootstrapping are more appropriate where cross-validation is      group. That is, malibu is written in ANSI C++ and has been
the most widely used method.                                      extensively tested in Windows and Unix-like environments.
D. Interface unification                                           By downloading binary files rather than interpreted code,
                                                                  malibu does not require the user to learn how to use a
   The interface to a machine learning algorithm includes         Java (e.g. how to increase available memory) or Matlab
setting parameters, reading datasets, outputting results and      interpreter (e.g. how to program in Matlab). It supports a
writing models. Setting parameters in malibu can accom-           number of dataset formats removing the burden of creating
plished using either command-line arguments or a configu-          scripts to format a dataset from the user. Similarly, it provides
ration file where a subgroup of arguments can be written           a number of standard model selection and evaluation algo-
to and read from a file. The parameter system also sup-            rithms often missing from third-party code (e.g. CoverTree).
ports implicit configuration files depending on the name of         malibu also provides a configuration file, which allows
learning algorithm where command-line parameters override         users to modify arguments in an environment that provides
configuration files which, in turn, override implicit config-        additional information about each command. Finally, malibu
uration files. The dataset format supported by malibu is a         provides bindings for third-party tools to generate graphs and
standard tab/comma/space delimited file and every example          plots. Another contribution includes implementation of new
is delimited by line separators. Indeed, the format allows        algorithms (e.g. AdaBoost.C2MIL) as well as extension of
changes in class position, existence of a header, index of        any algorithm to new problem domains (e.g. classifiers to
bag label or number of prefixing labels.                           multiple-instance learning).
   When a model is applied to a test set, a malibu learning
algorithm writes predictions to the standard output. It also      B. Future Works
outputs statistics describing a training and/or testing set as       At the same time, malibu (like most available software) is
well as a copy of the configuration file. Finally, malibu           a work in progress. One direction of development is to scale
supports writing out the models of learning algorithms in         the workbench up to distributed computing. That is, model
the ASCII format.                                                 selection and validation can be distributed via the message
E. Extensible bindings                                            passing interface (MPI) to multiple CPUs and machines.
                                                                  Another direction will focus on developing stronger bindings
   A workbench may interface (or bind) another software tool
through two mechanisms: tight-binding and loose-binding. In         4 http://www.gnuplot.info/

tight-binding, the workbench makes a function call to some          5 http://www.graphviz.org/



                                                             3797
between key software packages. A scripting language such                       [14] N. Bhardwaj, R. E. Langlois, G. Zhao, and H. Lu, “Kernel-based
as python is better suited to selecting objects, extracting                         machine learning protocol for predicting DNA-binding proteins,”
                                                                                    Nucleic Acids Research, vol. 33, no. 20, pp. 6486–6493, 2005.
features and tying in other applications. A final direction                     [15] R. Langlois, M. Carson, N. Bhardwaj, and H. Lu, “Learning to
will be to assemble more classifiers including Na¨ve Bayes,
                                                    ı                               translate sequence and structure to function: Identifying DNA binding
logistic regression as well as more learning strategies such as                     and membrane binding proteins,” Annals of Biomedical Engineering,
                                                                                    vol. 35, no. 6, pp. 1043–1052, 2007.
multi-class classification, multi-part learning and structured-                 [16] N. Bhardwaj, R. V. Stahelin, R. E. Langlois, W. Cho, and H. Lu,
prediction.                                                                         “Structural bioinformatics prediction of membrane-binding proteins,”
                                                                                    Journal of Molecular Biology, vol. 359, no. 2, pp. 486–495, 2006.
                                                                               [17] R. E. Langlois, A. Diec, O. Perisic, Y. Dai, and H. Lu, “Improved
                 IV. ACKNOWLEDGMENTS                                                protein fold assignment using support vector machines,” International
                                                                                    Journal of Bioinformatics Research and Applications, vol. 1, no. 3,
  This work is partially supported by NIH P01 AI060915                              pp. 319–334, 2006.
(H.L.). R.E.L. acknowledges the support from NIH training                      [18] C.-C. Chang and C.-J. Lin, “LIBSVM: A library for support vector
grant T32 HL 07692: Cellular Signaling in Cardiovascular                            machines,” 2001, http://www.csie.ntu.edu.tw/∼cjlin/libsvm.
                                                                               [19] R. E. Langlois, “Machine learning in bioinformatics: Algorithms,
System (P.I. John Solaro).                                                          implementations and applications,” Ph.D. Thesis, Univeristy of Illinois
                                                                                    at Chicago, Chicago, IL, USA, 2008.
                             R EFERENCES                                       [20] A. Beygelzimer, S. Kakade, and J. Langford, “Cover trees for nearest
                                                                                    neighbor,” in International Conference on Machine Learning, vol. 148.
 [1] S. Sonnenburg, M. L. Braun, C. S. Ong, S. Bengio, L. Bottou,                   Pittsburgh, Pennsylvania: ACM, 2006, pp. 97–104.
     G. Holmes, Y. LeCun, K.-R. Muller, F. Pereira, C. E. Rasmussen,           [21] W. Buntine, “Learning classification trees,” Statistics and Computing,
     G. Ratsch, B. Scholkopf, A. Smola, P. Vincent, J. Weston, and                  vol. 2, no. 2, pp. 63–73, 1992.
     R. Williamson, “The need for open source software in machine              [22] J. R. Quinlan, “Improved use of continuous attributes in C4.5,” Journal
     learning,” Journal of Machine Learning Research, vol. 8, pp. 2443–             of Artificial Intelligence Research, vol. 4, pp. 77–90, 1996.
     2466, Oct 2007.                                                           [23] Y. Freund and L. Mason, “The alternating decision tree learning
 [2] A. Asuncion and D. Newman, “UCI machine learning repository,”                  algorithm,” in International Conference on Machine Learning, vol. 16,
     2007, http://www.ics.uci.edu/∼mlearn/MLRepository.html.                        Bled, Slovenia, 1999.
 [3] I. H. Witten and E. Frank, Data Mining: Practical Machine Learning        [24] L. Breiman, “Bagging predictors,” Machine Learning, vol. 24, no. 2,
     Tools and Techniques, 2nd ed. San Francisco: Morgan Kaufmann,                  pp. 123–140, 1996.
     2005, http://www.cs.waikato.ac.nz/ml/weka/.                               [25] P. Buhlmann, “Bagging, subagging and bragging for improving some
 [4] I. Mierswa, M. Wurst, R. Klinkenberg, M. Scholz, and T. Euler,                 prediction algorithms,” in Recent Advances and Trends in Nonpara-
     “YALE: Rapid prototyping for complex data mining tasks,” in ACM                metric Statistics, M. G. Akritas and D. N. Politis, Eds. North Holland:
     SIGKDD International Conference on Knowledge Discovery and Data                Elsevier, 2003, pp. 19–34.
     Mining, vol. 12, Philadelphia, USA, 2006.                                 [26] R. E. Schapire and Y. Singer, “Improved boosting algorithms using
 [5] S. Sonnenburg, G. R¨ tsch, C. Sch¨ fer, and B. Sch¨ lkopf, “Large scale
                           a            a              o                            confidence-rated predictions,” Machine Learning, vol. 37, no. 3, pp.
     multiple kernel learning,” Journal of Machine Learning Research,               297–336, 1999.
     vol. 7, pp. 1531–1565, July 2006, http://www.shogun-toolbox.org/.         [27] J. Friedman, T. Hastie, and R. Tibshirani, “Additive logistic regression:
 [6] K. Gawande, C. Webers, A. J. Smola, and S. Vishwanathan, “Elefant:             A statistical view of boosting,” Annals of Statistics, vol. 28, no. 2, pp.
     A python machine learning toolbox,” in SciPy Conference, 2007.                 337–407, 2000.
 [7] R. Kohavi, D. Sommerfield, and J. Dougherty, “Data mining using            [28] L. Breiman, “Random forests,” Machine Learning, vol. 45, no. 1, pp.
     MLC++, a machine learning library in C++,” in International Confer-            5–32, 2001.
     ence on Tools with Artificial Intelligence, vol. 8. Toulouse, France:      [29] B. Zadrozny, J. Langford, and N. Abe, “Cost-sensitive learning by
     IEEE Computer Society, 1996, p. 234, http://www.sgi.com/tech/mlc/.             cost-proportionate example weighting,” in IEEE International Confer-
 [8] J. Demˇar, B. Zupan, G. Leban, and T. Curk, “Orange: From exper-
             s                                                                      ence on Data Mining, vol. 3, Melbourne, Florida, 2003, p. 435.
     imental machine learning to interactive data mining,” in Knowledge        [30] J. C. Platt, “Probabilistic outputs for support vector machines and
     Discovery in Databases: PKDD 2004, ser. Lecture Notes in Computer              comparisons to regularized likelihood methods,” in Advances in Large
     Science. Berlin/Heidelberg: Springer, 2004, vol. 3202, pp. 537–539.            Margin Classifiers, P. J. Bartlett, B. Scholkopf, D. Schuurmans, and
 [9] R. Collobert, S. Bengio, and J. Mariethoz, “Torch: A modular ma-               A. J. Smola, Eds. Boston: MIT Press, 1999, pp. 61–74.
     chine learning software library,” IDIAP Research Institute, Tech. Rep.    [31] B. Zadrozny and C. Elkan, “Transforming classifier scores into ac-
     IDIAP-RR 02-46, 2002, http://www.torch.ch/.                                    curate multiclass probability estimates,” in Special Interest Group on
[10] J. Weston, A. Elisseeff, G. BakIr, and F. Sinz, “SPIDER: Object                Knowledge Discovery and Data Mining, vol. 8. Edmonton, Alberta,
     oriented machine learning library,” 2003, http://www.kyb.tuebingen.            Canada: ACM Press, 2002, pp. 694–699.
     mpg.de/bs/people/spider/main.html.                                        [32] J. Langford and B. Zadrozny, “Estimating class membership probabil-
[11] J. Mohr and K. Obermayer, “A topographic support vector machine:               ities using classifier learners,” in International Workshop on Artificial
     Classification using local label configurations,” in Advances in Neural          Intelligence and Statistics, vol. 10, Barbados, 2005.
     Information Processing Systems 17, L. K. Saul, Y. Weiss, and L. Bot-      [33] C. Drummond and R. C. Holte, “Cost curves: An improved method for
     tou, Eds. Cambridge, MA: MIT Press, 2005, pp. 929–936.                         visualizing classifier performance,” Machine Learning, vol. 65, no. 1,
[12] G. D. M. K. G. C. B. B. Klaus-Robert M¨ ller, Michael Tangermann,
                                                u                                   pp. 95–130, 2006.
     “Machine learning for real-time single-trial eeg-analysis: From brain-    [34] A. Blum, A. Kalai, and J. Langford, “Beating the hold-out: Bounds
     computer interfacing to mental state monitoring,” J. Neurosci. Meth-           for k-fold and progressive cross-validation,” in COLT: Computational
     ods, vol. 167, no. 1, pp. 82–90, 2008.                                         Learning Theory, vol. 12. Santa Cruz, California: ACM, 1999, pp.
[13] N. Bhardwaj and H. Lu, “Residue-level prediction of DNA-binding                203–208.
     sites and its application on DNA-binding protein predictions,” FEBS
     Letters, vol. 581, no. 5, pp. 1058–1066, 2007.




                                                                          3798

Weitere ähnliche Inhalte

Ähnlich wie Intelligible Machine Learning with Malibu for bioinformatics ...

Large-Scale Machine Learning at Twitter
Large-Scale Machine Learning at TwitterLarge-Scale Machine Learning at Twitter
Large-Scale Machine Learning at Twitternep_test_account
 
Simulagora (Euroscipy2014 - Logilab)
Simulagora (Euroscipy2014 - Logilab)Simulagora (Euroscipy2014 - Logilab)
Simulagora (Euroscipy2014 - Logilab)Logilab
 
5212303961620480 1585670953 joanna_stachera_proposal_g_soc2020
5212303961620480 1585670953 joanna_stachera_proposal_g_soc20205212303961620480 1585670953 joanna_stachera_proposal_g_soc2020
5212303961620480 1585670953 joanna_stachera_proposal_g_soc2020JoannaStachera1
 
CE-LEARNING-CTS2016_paper_5
CE-LEARNING-CTS2016_paper_5CE-LEARNING-CTS2016_paper_5
CE-LEARNING-CTS2016_paper_5Manoj Kumar
 
Deep Learning libraries and first experiments with Theano
Deep Learning libraries and first experiments with TheanoDeep Learning libraries and first experiments with Theano
Deep Learning libraries and first experiments with TheanoVincenzo Lomonaco
 
MultiObjective(11) - Copy
MultiObjective(11) - CopyMultiObjective(11) - Copy
MultiObjective(11) - CopyAMIT KUMAR
 
Final Total Preliminary Report
Final Total Preliminary ReportFinal Total Preliminary Report
Final Total Preliminary ReportMrugen Deshmukh
 
MACHINE LEARNING TOOLBOX
MACHINE LEARNING TOOLBOXMACHINE LEARNING TOOLBOX
MACHINE LEARNING TOOLBOXmlaij
 
Applying Machine Learning to Software Clustering
Applying Machine Learning to Software ClusteringApplying Machine Learning to Software Clustering
Applying Machine Learning to Software Clusteringbutest
 
USABILITY EVALUATION OF A CONTROL AND PROGRAMMING ENVIRONMENT FOR PROGRAMMING...
USABILITY EVALUATION OF A CONTROL AND PROGRAMMING ENVIRONMENT FOR PROGRAMMING...USABILITY EVALUATION OF A CONTROL AND PROGRAMMING ENVIRONMENT FOR PROGRAMMING...
USABILITY EVALUATION OF A CONTROL AND PROGRAMMING ENVIRONMENT FOR PROGRAMMING...ijseajournal
 
A Web Services Based Framework For Uniform Integration Of Command-Line Bioinf...
A Web Services Based Framework For Uniform Integration Of Command-Line Bioinf...A Web Services Based Framework For Uniform Integration Of Command-Line Bioinf...
A Web Services Based Framework For Uniform Integration Of Command-Line Bioinf...Kim Daniels
 
Software Analytics: Towards Software Mining that Matters (2014)
Software Analytics:Towards Software Mining that Matters (2014)Software Analytics:Towards Software Mining that Matters (2014)
Software Analytics: Towards Software Mining that Matters (2014)Tao Xie
 
Text classification supervised algorithms with term frequency inverse documen...
Text classification supervised algorithms with term frequency inverse documen...Text classification supervised algorithms with term frequency inverse documen...
Text classification supervised algorithms with term frequency inverse documen...IJECEIAES
 
A Few Useful Things to Know about Machine Learning
A Few Useful Things to Know about Machine LearningA Few Useful Things to Know about Machine Learning
A Few Useful Things to Know about Machine Learningnep_test_account
 
An Overview of Supervised Machine Learning Paradigms and their Classifiers
An Overview of Supervised Machine Learning Paradigms and their ClassifiersAn Overview of Supervised Machine Learning Paradigms and their Classifiers
An Overview of Supervised Machine Learning Paradigms and their ClassifiersIJAEMSJORNAL
 
A Development Shell For Cooperative Problem-Solving Environments
A Development Shell For Cooperative Problem-Solving EnvironmentsA Development Shell For Cooperative Problem-Solving Environments
A Development Shell For Cooperative Problem-Solving EnvironmentsJody Sullivan
 
A Computational Framework for Multi-dimensional Context-aware Adaptation
A Computational Framework for Multi-dimensional Context-aware AdaptationA Computational Framework for Multi-dimensional Context-aware Adaptation
A Computational Framework for Multi-dimensional Context-aware AdaptationSerenoa Project
 

Ähnlich wie Intelligible Machine Learning with Malibu for bioinformatics ... (20)

Large-Scale Machine Learning at Twitter
Large-Scale Machine Learning at TwitterLarge-Scale Machine Learning at Twitter
Large-Scale Machine Learning at Twitter
 
Simulagora (Euroscipy2014 - Logilab)
Simulagora (Euroscipy2014 - Logilab)Simulagora (Euroscipy2014 - Logilab)
Simulagora (Euroscipy2014 - Logilab)
 
5212303961620480 1585670953 joanna_stachera_proposal_g_soc2020
5212303961620480 1585670953 joanna_stachera_proposal_g_soc20205212303961620480 1585670953 joanna_stachera_proposal_g_soc2020
5212303961620480 1585670953 joanna_stachera_proposal_g_soc2020
 
CE-LEARNING-CTS2016_paper_5
CE-LEARNING-CTS2016_paper_5CE-LEARNING-CTS2016_paper_5
CE-LEARNING-CTS2016_paper_5
 
Deep Learning libraries and first experiments with Theano
Deep Learning libraries and first experiments with TheanoDeep Learning libraries and first experiments with Theano
Deep Learning libraries and first experiments with Theano
 
Machine learning
Machine learningMachine learning
Machine learning
 
MultiObjective(11) - Copy
MultiObjective(11) - CopyMultiObjective(11) - Copy
MultiObjective(11) - Copy
 
C2-4-Putchala
C2-4-PutchalaC2-4-Putchala
C2-4-Putchala
 
Final Total Preliminary Report
Final Total Preliminary ReportFinal Total Preliminary Report
Final Total Preliminary Report
 
MACHINE LEARNING TOOLBOX
MACHINE LEARNING TOOLBOXMACHINE LEARNING TOOLBOX
MACHINE LEARNING TOOLBOX
 
Applying Machine Learning to Software Clustering
Applying Machine Learning to Software ClusteringApplying Machine Learning to Software Clustering
Applying Machine Learning to Software Clustering
 
Data mining weka
Data mining wekaData mining weka
Data mining weka
 
USABILITY EVALUATION OF A CONTROL AND PROGRAMMING ENVIRONMENT FOR PROGRAMMING...
USABILITY EVALUATION OF A CONTROL AND PROGRAMMING ENVIRONMENT FOR PROGRAMMING...USABILITY EVALUATION OF A CONTROL AND PROGRAMMING ENVIRONMENT FOR PROGRAMMING...
USABILITY EVALUATION OF A CONTROL AND PROGRAMMING ENVIRONMENT FOR PROGRAMMING...
 
A Web Services Based Framework For Uniform Integration Of Command-Line Bioinf...
A Web Services Based Framework For Uniform Integration Of Command-Line Bioinf...A Web Services Based Framework For Uniform Integration Of Command-Line Bioinf...
A Web Services Based Framework For Uniform Integration Of Command-Line Bioinf...
 
Software Analytics: Towards Software Mining that Matters (2014)
Software Analytics:Towards Software Mining that Matters (2014)Software Analytics:Towards Software Mining that Matters (2014)
Software Analytics: Towards Software Mining that Matters (2014)
 
Text classification supervised algorithms with term frequency inverse documen...
Text classification supervised algorithms with term frequency inverse documen...Text classification supervised algorithms with term frequency inverse documen...
Text classification supervised algorithms with term frequency inverse documen...
 
A Few Useful Things to Know about Machine Learning
A Few Useful Things to Know about Machine LearningA Few Useful Things to Know about Machine Learning
A Few Useful Things to Know about Machine Learning
 
An Overview of Supervised Machine Learning Paradigms and their Classifiers
An Overview of Supervised Machine Learning Paradigms and their ClassifiersAn Overview of Supervised Machine Learning Paradigms and their Classifiers
An Overview of Supervised Machine Learning Paradigms and their Classifiers
 
A Development Shell For Cooperative Problem-Solving Environments
A Development Shell For Cooperative Problem-Solving EnvironmentsA Development Shell For Cooperative Problem-Solving Environments
A Development Shell For Cooperative Problem-Solving Environments
 
A Computational Framework for Multi-dimensional Context-aware Adaptation
A Computational Framework for Multi-dimensional Context-aware AdaptationA Computational Framework for Multi-dimensional Context-aware Adaptation
A Computational Framework for Multi-dimensional Context-aware Adaptation
 

Mehr von butest

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEbutest
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jacksonbutest
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer IIbutest
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazzbutest
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.docbutest
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1butest
 
Facebook
Facebook Facebook
Facebook butest
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...butest
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...butest
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTbutest
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docbutest
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docbutest
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.docbutest
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!butest
 

Mehr von butest (20)

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
 
PPT
PPTPPT
PPT
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
 
Facebook
Facebook Facebook
Facebook
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
 
hier
hierhier
hier
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
 

Intelligible Machine Learning with Malibu for bioinformatics ...

  • 1. 30th Annual International IEEE EMBS Conference Vancouver, British Columbia, Canada, August 20-24, 2008 Intelligible machine learning with malibu Robert E. Langlois and Hui Lu Abstract— malibu is an open-source machine learning work- project or a more powerful protocol to solve a problem in bench developed in C/C++ for high-performance real-world a specific domain. Finally, open-source tools give the wider applications, namely bioinformatics and medical informatics. scientific community full access to machine learning tools It leverages third-party machine learning implementations for more robust bug free software. This workbench handles that have found application in a wide range of fields. Such several well-studied supervised machine learning problems access permits both use of the tool and the ability to extend including classification, regression, importance-weighted clas- the tool to fit a specific need. sification and multiple-instance learning. The malibu inter- A machine learning workbench provides at a minimum face was designed to create reproducible experiments ideally run in a remote and/or command line environment. The five services beyond the standard machine learning tool: software can be found at: http://proteomics.bioengr. 1) Learning algorithms uic.edu/malibu/index.html 2) Learning evaluation . 3) Dataset preprocessing I. INTRODUCTION 4) Interface unification 5) Extensible bindings Recently open source software has matured into a solution able to handle complex real world applications. One such ap- In essence, a machine learning workbench provides a unified plication entails developing implementations for the substan- interface to a number of learning algorithms and, ideally, tial number machine learning algorithms available for the nu- handles more than one type of machine learning problem. For merous problem domains e.g. bioinformatics type problems example, a supervised machine learning workbench should including protein annotation, microarray analysis as well as support a number of classifiers including decision trees and others. Indeed, many such algorithms remain unused due to support vector machines as well as algorithms for other prob- unavailable or poor implementations. Moreover, many re- lems such as calibration methods for probabilistic regression. searchers recognize the need for peer-reviewed open-source Likewise, a workbench should provide stock tools to handle machine learning software by the scientific community[1]. common tasks such as dataset preprocessing or learning eval- The advantages of open-source machine learning tools in- uation. For a supervised machine learning workbench, these clude: stock tools should include metrics to measure performance, 1) Reproducibility and transparency algorithms to perform cross-validation and tools to perform 2) Uncovering problems in current algorithms discreetization or normalization. Finally, a workbench should 3) Building on existing resources be extensible providing the ability to add or support new 4) Access to machine learning tools algorithms as well as tying into some scripting language. One of the fundamental benefits of open-source machine A number of machine learning workbenches have been learning software is that it facilitates the reproducibility of developed in programming languages including C/C++, Java, experimental results. While such experiments are relatively Python and Matlab. Indeed, the programming language easy to reproduce compared to other fields, they often are not. characterizes fundamental properties in the corresponding At the same time, the pressure to publish remains constant software. That is, tools based in C/C++ are more difficult leading to unintentional (or intentional) “cheating”. Having to develop yet utilize computing resources more efficiently. access to an implementation of a machine learning algorithm Java-, Matlab- and Python-based tools are easier to develop (and ideally the associated dataset e.g. UCI repository[2]) and deploy but require an interpreter and garbage collector. could potentially eliminate such cheating or, at a minimum, Java- and Delphi-based tools support a rich set of libraries its ill effects. Likewise, tendering the source of an algorithm enabling complex graphical user interfaces. Finally, Matlab- allows the community at large to more quickly discover based tools enjoy a rich set of statistical and optimiza- problems in current algorithms on both the conceptual and tion routines providing the ability to quick-prototype many concrete levels. Similarly, it enables others to build increas- learning algorithms. Using these languages a number of ingly more intricate systems on top of available source code. workbenches haven been developed. One of the most popular This could simply mean a better user interface on an existing is WEKA[3], a Java-based workbench that supports a large number of supervised and unsupervised machine learning This work is partially supported by the NIH. algorithms. This workbench has been extended and modified R. E. Langlois is with Department of Bioengineering, University of by a number of projects, most notably by RapidMiner[4], a Illinois at Chicago, Chicago, IL 60607, USA ezra@uic.edu H. Lu is with Faculty of Department of Bioengineering, University of workbench focused on fast-prototyping and data visualiza- Illinois at Chicago, Chicago, IL 60607, USA huilu@uic.edu tion. Likewise, a number of workbenches have been devel- 978-1-4244-1815-2/08/$25.00 ©2008 IEEE. 3795
  • 2. oped in C/C++ including Shogun[5], Elefant[6], MLC++[7], In (binary) classification, the algorithm learns a model Orange[8] and Torch[9]. Shogun, Orange and Elefant support from labeled training examples where the label belongs to python bindings enabling efficient machine learning work one of two discreet classes. malibu incorporates a number of flows. Matlab has also proven an excellent platform for third-party and built-in algorithms to handle classification. machine learning with its own considerable statistical and The third-party classifiers include LIBSVM[18], Cover Tree machine learning libraries; it has been further extended kNN[20], INDTree[21] and C4.5[22]. The built-in classifiers by Spider[10] better handle a large number of machine include the Willow[19] decision tree and ADTree[23]. malibu learning problems including supervised, unsupervised and also supports a number of (binary) meta-classifiers that semi-supervised learning. There has been considerable effort construct ensembles of classifiers to improve performance, in developing additional open-source machine learning soft- which includes Bagging[24], Subagging[25], AdaBoost[26], ware. To this end, most available workbenches can be found Confidence-ratedAdaBoost[26], Gentle AdaBoost[27] and in a peer-reviewed machine learning software repository1 . for the tree-based classifiers Random Forests[28]. The applications of such machine learning software ranges In importance-weighted classification, the algorithm learns from facial recognition to medical diagnosis. One clini- a model from training examples labeled by their relative cal application of machine learning is the identification of importance such that a prediction will be biased toward more cancerous tumors using data collected by some imaging important training examples. One popular variant is called modality, e.g. microscopic analysis of cells [11]. Specifically, cost-sensitive classification where examples are weighted a machine learning algorithm can segment an image into based on their class label. malibu supports both implicit and regions where one may contain a cancerous tumor. A later explicit weighting for each algorithm where implicit weight- algorithm can learn features within these regions (i.e. shape ing is supported by LIBSVM, kNN and Willow. Furthermore, of possible tumor, texture of its edges, level of contrast) to an explicit method utilizes the Costing wrapper[29] to make distinguish benign and cancerous tissue. In more recent work, any classifier importance-weighted. machine learning has found great success in the arena of In regression, the algorithm learns a real-valued output brain-computer interfaces [12]. Such devices have a number from training examples labeled with a real label. One of applications ranging from clinical monitoring of arousal special case of regression is probabilistic regression where to investigating the working of the human brain. the learning algorithm assigns a probability to an example In this work, we introduce a new machine learning work- as belonging to a particular class. Similar to importance- bench for bioinformatics tasks. This workbench has been weighted classification malibu supports both implicit and applied to a number of problems ranging from function pre- explicit regression. That is, learning algorithms such as LIB- diction, e.g. prediction of DNA-binding residues[13], DNA- SVM, kNN and Willow, which support regression. For binary binding proteins[14], [15], membrane-binding proteins[16], classifiers, malibu also includes explicit wrappers to extend a [15], to structure prediction e.g. protein folds[17]. classifier to handle probabilistic regression. These wrappers include sigmoid correction[30], isotonic regression[31] and II. LEARNING WITH malibu probing[32]. malibu is an open-source machine learning workbench In multiple-instance learning (MIL), examples are grouped written in C/C++ and is geared toward supervised learning. into bags where the bag not an individual example has a The basic design of malibu comprises a hierarchy of C++ label. A bag is positive if at least one instance in the bag is template classes that both wrap and extend a core set of positive otherwise the bag is negative. In malibu any binary classification algorithms. By utilizing proven C++ template classifier can be extended to multiple-instance learning by meta-programming techniques used in the Boost Libraries2 viewing this problem as binary classification with positive and the matrix template library3 , malibu provides an efficient class noise; all parameters are selected by estimating bag- yet extensible library of algorithms. The core classifiers level (not instance-level) performance. malibu also supports comprise both third-party tools, e.g. LIBSVM[18], and native extending a weak classifier to a multiple-instance learner implementations[19]. through the AdaBoost.C2MIL wrapper[19]. B. Learning evaluation A. Learning algorithms Evaluating the performance of a learning algorithm is The malibu workbench currently supports a number of important to both select the best model and estimate the per- supervised learning problems including classification, meta- formance on unseen testing dataset. The performance of an classification, importance-weighted classification, regression algorithm is measured as follows: and multiple-instance learning. A supervised learning prob- for each partition do lem comprises a set of labeled training examples with the Train algorithm on one partition goal of predicting the label on an unseen (and possibly Evaluate on other partition unlabeled) example. end for 1 http://mloss.org Learning algorithm performance is usually measured by 2 http://www.boost.org metrics and/or graphs. A single metric reflects some question 3 http://www.osl.iu.edu/research/mtl/ about the performance of a learning algorithm whereas 3796
  • 3. a graph reflects a series of questions. malibu supports a library; in loose-binding, the workbench writes out a file in number of threshold metrics from a tabulated contingency a format supported by another tool. Currently malibu sup- table, which estimate the performance for every problem ports soft-binding to web-browsers, LTEX, GNUPLOT4 and A 5 except regression; it also supports a number of regression Graphviz . That is, the metrics describing the performance of and ranking metrics. Likewise, malibu supports a number of a learning algorithm can be written out in both the HTML graphs including the receiver operating characteristics curve, and latex formats. Similarly, the performance can also be the cost curve[33], the precision/recall curve, lift curves and written out as a plot in the GNUPLOT format. Finally, the reliability diagrams. models describing the tree-based learning algorithms can be Note that malibu provides automated model selection written out as graphs in the Graphviz DOT format. for every learning algorithm using the previously described evaluation metrics and the dataset partitioning algorithms III. CONCLUSIONS AND FUTURE WORKS introduced in the next section. A. Conclusions C. Dataset preprocessing The maturity of open source software in conjunction with the present need for robust implementations of machine Preprocessing a dataset is a critical step for many ma- learning algorithms has given rise to significant efforts in chine learning algorithms e.g. normalization of attributes for developing large-scale workbenches. However, no single distance-based methods such as SVM. Moreover, preprocess- workbench is comprehensive in its coverage of machine ing also includes algorithms that partition the dataset for learning algorithms nor does every workbench provide an model evaluation. malibu comprises a number of algorithms optimal set of features. malibu is a high-performance ma- to transform a dataset into an appropriate format such as chine learning workbench developed to extend classifiers normalization for distance-based methods, nominal-to-binary to handle classification as well as other problem domains for distance-based methods, and discreetization to speed up namely regression, importance-weighted classification and sorting-based methods. Likewise, malibu includes partition- multiple-instance learning. It also satisfies the basic criterion ing methods such as cross-validation, bootstrapping, holdout of a workbench by providing a unified user interface, dataset and progressive validation[34]. Each of these methods has preprocessing algorithms, learning algorithms and binding to various advantages and disadvantages. Holdout requires a other tools to facilitate learning. large amount of dataset but its the best understood. For The primary contribution of the malibu workbench is im- smaller datasets, cross-validation, progressive validation and proved usability for a more computer-scientist oriented user bootstrapping are more appropriate where cross-validation is group. That is, malibu is written in ANSI C++ and has been the most widely used method. extensively tested in Windows and Unix-like environments. D. Interface unification By downloading binary files rather than interpreted code, malibu does not require the user to learn how to use a The interface to a machine learning algorithm includes Java (e.g. how to increase available memory) or Matlab setting parameters, reading datasets, outputting results and interpreter (e.g. how to program in Matlab). It supports a writing models. Setting parameters in malibu can accom- number of dataset formats removing the burden of creating plished using either command-line arguments or a configu- scripts to format a dataset from the user. Similarly, it provides ration file where a subgroup of arguments can be written a number of standard model selection and evaluation algo- to and read from a file. The parameter system also sup- rithms often missing from third-party code (e.g. CoverTree). ports implicit configuration files depending on the name of malibu also provides a configuration file, which allows learning algorithm where command-line parameters override users to modify arguments in an environment that provides configuration files which, in turn, override implicit config- additional information about each command. Finally, malibu uration files. The dataset format supported by malibu is a provides bindings for third-party tools to generate graphs and standard tab/comma/space delimited file and every example plots. Another contribution includes implementation of new is delimited by line separators. Indeed, the format allows algorithms (e.g. AdaBoost.C2MIL) as well as extension of changes in class position, existence of a header, index of any algorithm to new problem domains (e.g. classifiers to bag label or number of prefixing labels. multiple-instance learning). When a model is applied to a test set, a malibu learning algorithm writes predictions to the standard output. It also B. Future Works outputs statistics describing a training and/or testing set as At the same time, malibu (like most available software) is well as a copy of the configuration file. Finally, malibu a work in progress. One direction of development is to scale supports writing out the models of learning algorithms in the workbench up to distributed computing. That is, model the ASCII format. selection and validation can be distributed via the message E. Extensible bindings passing interface (MPI) to multiple CPUs and machines. Another direction will focus on developing stronger bindings A workbench may interface (or bind) another software tool through two mechanisms: tight-binding and loose-binding. In 4 http://www.gnuplot.info/ tight-binding, the workbench makes a function call to some 5 http://www.graphviz.org/ 3797
  • 4. between key software packages. A scripting language such [14] N. Bhardwaj, R. E. Langlois, G. Zhao, and H. Lu, “Kernel-based as python is better suited to selecting objects, extracting machine learning protocol for predicting DNA-binding proteins,” Nucleic Acids Research, vol. 33, no. 20, pp. 6486–6493, 2005. features and tying in other applications. A final direction [15] R. Langlois, M. Carson, N. Bhardwaj, and H. Lu, “Learning to will be to assemble more classifiers including Na¨ve Bayes, ı translate sequence and structure to function: Identifying DNA binding logistic regression as well as more learning strategies such as and membrane binding proteins,” Annals of Biomedical Engineering, vol. 35, no. 6, pp. 1043–1052, 2007. multi-class classification, multi-part learning and structured- [16] N. Bhardwaj, R. V. Stahelin, R. E. Langlois, W. Cho, and H. Lu, prediction. “Structural bioinformatics prediction of membrane-binding proteins,” Journal of Molecular Biology, vol. 359, no. 2, pp. 486–495, 2006. [17] R. E. Langlois, A. Diec, O. Perisic, Y. Dai, and H. Lu, “Improved IV. ACKNOWLEDGMENTS protein fold assignment using support vector machines,” International Journal of Bioinformatics Research and Applications, vol. 1, no. 3, This work is partially supported by NIH P01 AI060915 pp. 319–334, 2006. (H.L.). R.E.L. acknowledges the support from NIH training [18] C.-C. Chang and C.-J. Lin, “LIBSVM: A library for support vector grant T32 HL 07692: Cellular Signaling in Cardiovascular machines,” 2001, http://www.csie.ntu.edu.tw/∼cjlin/libsvm. [19] R. E. Langlois, “Machine learning in bioinformatics: Algorithms, System (P.I. John Solaro). implementations and applications,” Ph.D. Thesis, Univeristy of Illinois at Chicago, Chicago, IL, USA, 2008. R EFERENCES [20] A. Beygelzimer, S. Kakade, and J. Langford, “Cover trees for nearest neighbor,” in International Conference on Machine Learning, vol. 148. [1] S. Sonnenburg, M. L. Braun, C. S. Ong, S. Bengio, L. Bottou, Pittsburgh, Pennsylvania: ACM, 2006, pp. 97–104. G. Holmes, Y. LeCun, K.-R. Muller, F. Pereira, C. E. Rasmussen, [21] W. Buntine, “Learning classification trees,” Statistics and Computing, G. Ratsch, B. Scholkopf, A. Smola, P. Vincent, J. Weston, and vol. 2, no. 2, pp. 63–73, 1992. R. Williamson, “The need for open source software in machine [22] J. R. Quinlan, “Improved use of continuous attributes in C4.5,” Journal learning,” Journal of Machine Learning Research, vol. 8, pp. 2443– of Artificial Intelligence Research, vol. 4, pp. 77–90, 1996. 2466, Oct 2007. [23] Y. Freund and L. Mason, “The alternating decision tree learning [2] A. Asuncion and D. Newman, “UCI machine learning repository,” algorithm,” in International Conference on Machine Learning, vol. 16, 2007, http://www.ics.uci.edu/∼mlearn/MLRepository.html. Bled, Slovenia, 1999. [3] I. H. Witten and E. Frank, Data Mining: Practical Machine Learning [24] L. Breiman, “Bagging predictors,” Machine Learning, vol. 24, no. 2, Tools and Techniques, 2nd ed. San Francisco: Morgan Kaufmann, pp. 123–140, 1996. 2005, http://www.cs.waikato.ac.nz/ml/weka/. [25] P. Buhlmann, “Bagging, subagging and bragging for improving some [4] I. Mierswa, M. Wurst, R. Klinkenberg, M. Scholz, and T. Euler, prediction algorithms,” in Recent Advances and Trends in Nonpara- “YALE: Rapid prototyping for complex data mining tasks,” in ACM metric Statistics, M. G. Akritas and D. N. Politis, Eds. North Holland: SIGKDD International Conference on Knowledge Discovery and Data Elsevier, 2003, pp. 19–34. Mining, vol. 12, Philadelphia, USA, 2006. [26] R. E. Schapire and Y. Singer, “Improved boosting algorithms using [5] S. Sonnenburg, G. R¨ tsch, C. Sch¨ fer, and B. Sch¨ lkopf, “Large scale a a o confidence-rated predictions,” Machine Learning, vol. 37, no. 3, pp. multiple kernel learning,” Journal of Machine Learning Research, 297–336, 1999. vol. 7, pp. 1531–1565, July 2006, http://www.shogun-toolbox.org/. [27] J. Friedman, T. Hastie, and R. Tibshirani, “Additive logistic regression: [6] K. Gawande, C. Webers, A. J. Smola, and S. Vishwanathan, “Elefant: A statistical view of boosting,” Annals of Statistics, vol. 28, no. 2, pp. A python machine learning toolbox,” in SciPy Conference, 2007. 337–407, 2000. [7] R. Kohavi, D. Sommerfield, and J. Dougherty, “Data mining using [28] L. Breiman, “Random forests,” Machine Learning, vol. 45, no. 1, pp. MLC++, a machine learning library in C++,” in International Confer- 5–32, 2001. ence on Tools with Artificial Intelligence, vol. 8. Toulouse, France: [29] B. Zadrozny, J. Langford, and N. Abe, “Cost-sensitive learning by IEEE Computer Society, 1996, p. 234, http://www.sgi.com/tech/mlc/. cost-proportionate example weighting,” in IEEE International Confer- [8] J. Demˇar, B. Zupan, G. Leban, and T. Curk, “Orange: From exper- s ence on Data Mining, vol. 3, Melbourne, Florida, 2003, p. 435. imental machine learning to interactive data mining,” in Knowledge [30] J. C. Platt, “Probabilistic outputs for support vector machines and Discovery in Databases: PKDD 2004, ser. Lecture Notes in Computer comparisons to regularized likelihood methods,” in Advances in Large Science. Berlin/Heidelberg: Springer, 2004, vol. 3202, pp. 537–539. Margin Classifiers, P. J. Bartlett, B. Scholkopf, D. Schuurmans, and [9] R. Collobert, S. Bengio, and J. Mariethoz, “Torch: A modular ma- A. J. Smola, Eds. Boston: MIT Press, 1999, pp. 61–74. chine learning software library,” IDIAP Research Institute, Tech. Rep. [31] B. Zadrozny and C. Elkan, “Transforming classifier scores into ac- IDIAP-RR 02-46, 2002, http://www.torch.ch/. curate multiclass probability estimates,” in Special Interest Group on [10] J. Weston, A. Elisseeff, G. BakIr, and F. Sinz, “SPIDER: Object Knowledge Discovery and Data Mining, vol. 8. Edmonton, Alberta, oriented machine learning library,” 2003, http://www.kyb.tuebingen. Canada: ACM Press, 2002, pp. 694–699. mpg.de/bs/people/spider/main.html. [32] J. Langford and B. Zadrozny, “Estimating class membership probabil- [11] J. Mohr and K. Obermayer, “A topographic support vector machine: ities using classifier learners,” in International Workshop on Artificial Classification using local label configurations,” in Advances in Neural Intelligence and Statistics, vol. 10, Barbados, 2005. Information Processing Systems 17, L. K. Saul, Y. Weiss, and L. Bot- [33] C. Drummond and R. C. Holte, “Cost curves: An improved method for tou, Eds. Cambridge, MA: MIT Press, 2005, pp. 929–936. visualizing classifier performance,” Machine Learning, vol. 65, no. 1, [12] G. D. M. K. G. C. B. B. Klaus-Robert M¨ ller, Michael Tangermann, u pp. 95–130, 2006. “Machine learning for real-time single-trial eeg-analysis: From brain- [34] A. Blum, A. Kalai, and J. Langford, “Beating the hold-out: Bounds computer interfacing to mental state monitoring,” J. Neurosci. Meth- for k-fold and progressive cross-validation,” in COLT: Computational ods, vol. 167, no. 1, pp. 82–90, 2008. Learning Theory, vol. 12. Santa Cruz, California: ACM, 1999, pp. [13] N. Bhardwaj and H. Lu, “Residue-level prediction of DNA-binding 203–208. sites and its application on DNA-binding protein predictions,” FEBS Letters, vol. 581, no. 5, pp. 1058–1066, 2007. 3798