Intelligible Machine Learning with Malibu for bioinformatics ...

30th Annual International IEEE EMBS Conference
Vancouver, British Columbia, Canada, August 20-24, 2008

Intelligible machine learning with malibu
Robert E. Langlois and Hui Lu

Abstract— malibu is an open-source machine learning work- project or a more powerful protocol to solve a problem in
bench developed in C/C++ for high-performance real-world a specific domain. Finally, open-source tools give the wider
applications, namely bioinformatics and medical informatics. scientific community full access to machine learning tools
It leverages third-party machine learning implementations
for more robust bug free software. This workbench handles that have found application in a wide range of fields. Such
several well-studied supervised machine learning problems access permits both use of the tool and the ability to extend
including classification, regression, importance-weighted clas- the tool to fit a specific need.
sification and multiple-instance learning. The malibu inter- A machine learning workbench provides at a minimum
face was designed to create reproducible experiments ideally
run in a remote and/or command line environment. The
five services beyond the standard machine learning tool:
software can be found at: http://proteomics.bioengr. 1) Learning algorithms
uic.edu/malibu/index.html 2) Learning evaluation
. 3) Dataset preprocessing
I. INTRODUCTION 4) Interface unification
5) Extensible bindings
Recently open source software has matured into a solution
able to handle complex real world applications. One such ap- In essence, a machine learning workbench provides a unified
plication entails developing implementations for the substan- interface to a number of learning algorithms and, ideally,
tial number machine learning algorithms available for the nu- handles more than one type of machine learning problem. For
merous problem domains e.g. bioinformatics type problems example, a supervised machine learning workbench should
including protein annotation, microarray analysis as well as support a number of classifiers including decision trees and
others. Indeed, many such algorithms remain unused due to support vector machines as well as algorithms for other prob-
unavailable or poor implementations. Moreover, many re- lems such as calibration methods for probabilistic regression.
searchers recognize the need for peer-reviewed open-source Likewise, a workbench should provide stock tools to handle
machine learning software by the scientific community[1]. common tasks such as dataset preprocessing or learning eval-
The advantages of open-source machine learning tools in- uation. For a supervised machine learning workbench, these
clude: stock tools should include metrics to measure performance,
1) Reproducibility and transparency algorithms to perform cross-validation and tools to perform
2) Uncovering problems in current algorithms discreetization or normalization. Finally, a workbench should
3) Building on existing resources be extensible providing the ability to add or support new
4) Access to machine learning tools algorithms as well as tying into some scripting language.
One of the fundamental benefits of open-source machine A number of machine learning workbenches have been
learning software is that it facilitates the reproducibility of developed in programming languages including C/C++, Java,
experimental results. While such experiments are relatively Python and Matlab. Indeed, the programming language
easy to reproduce compared to other fields, they often are not. characterizes fundamental properties in the corresponding
At the same time, the pressure to publish remains constant software. That is, tools based in C/C++ are more difficult
leading to unintentional (or intentional) “cheating”. Having to develop yet utilize computing resources more efficiently.
access to an implementation of a machine learning algorithm Java-, Matlab- and Python-based tools are easier to develop
(and ideally the associated dataset e.g. UCI repository[2]) and deploy but require an interpreter and garbage collector.
could potentially eliminate such cheating or, at a minimum, Java- and Delphi-based tools support a rich set of libraries
its ill effects. Likewise, tendering the source of an algorithm enabling complex graphical user interfaces. Finally, Matlab-
allows the community at large to more quickly discover based tools enjoy a rich set of statistical and optimiza-
problems in current algorithms on both the conceptual and tion routines providing the ability to quick-prototype many
concrete levels. Similarly, it enables others to build increas- learning algorithms. Using these languages a number of
ingly more intricate systems on top of available source code. workbenches haven been developed. One of the most popular
This could simply mean a better user interface on an existing is WEKA[3], a Java-based workbench that supports a large
number of supervised and unsupervised machine learning
This work is partially supported by the NIH. algorithms. This workbench has been extended and modified
R. E. Langlois is with Department of Bioengineering, University of by a number of projects, most notably by RapidMiner[4], a
Illinois at Chicago, Chicago, IL 60607, USA ezra@uic.edu
H. Lu is with Faculty of Department of Bioengineering, University of workbench focused on fast-prototyping and data visualiza-
Illinois at Chicago, Chicago, IL 60607, USA huilu@uic.edu tion. Likewise, a number of workbenches have been devel-

978-1-4244-1815-2/08/$25.00 ©2008 IEEE. 3795

oped in C/C++ including Shogun[5], Elefant[6], MLC++[7], In (binary) classification, the algorithm learns a model
Orange[8] and Torch[9]. Shogun, Orange and Elefant support from labeled training examples where the label belongs to
python bindings enabling efficient machine learning work one of two discreet classes. malibu incorporates a number of
flows. Matlab has also proven an excellent platform for third-party and built-in algorithms to handle classification.
machine learning with its own considerable statistical and The third-party classifiers include LIBSVM[18], Cover Tree
machine learning libraries; it has been further extended kNN[20], INDTree[21] and C4.5[22]. The built-in classifiers
by Spider[10] better handle a large number of machine include the Willow[19] decision tree and ADTree[23]. malibu
learning problems including supervised, unsupervised and also supports a number of (binary) meta-classifiers that
semi-supervised learning. There has been considerable effort construct ensembles of classifiers to improve performance,
in developing additional open-source machine learning soft- which includes Bagging[24], Subagging[25], AdaBoost[26],
ware. To this end, most available workbenches can be found Confidence-ratedAdaBoost[26], Gentle AdaBoost[27] and
in a peer-reviewed machine learning software repository1 . for the tree-based classifiers Random Forests[28].
The applications of such machine learning software ranges In importance-weighted classification, the algorithm learns
from facial recognition to medical diagnosis. One clini- a model from training examples labeled by their relative
cal application of machine learning is the identification of importance such that a prediction will be biased toward more
cancerous tumors using data collected by some imaging important training examples. One popular variant is called
modality, e.g. microscopic analysis of cells [11]. Specifically, cost-sensitive classification where examples are weighted
a machine learning algorithm can segment an image into based on their class label. malibu supports both implicit and
regions where one may contain a cancerous tumor. A later explicit weighting for each algorithm where implicit weight-
algorithm can learn features within these regions (i.e. shape ing is supported by LIBSVM, kNN and Willow. Furthermore,
of possible tumor, texture of its edges, level of contrast) to an explicit method utilizes the Costing wrapper[29] to make
distinguish benign and cancerous tissue. In more recent work, any classifier importance-weighted.
machine learning has found great success in the arena of In regression, the algorithm learns a real-valued output
brain-computer interfaces [12]. Such devices have a number from training examples labeled with a real label. One
of applications ranging from clinical monitoring of arousal special case of regression is probabilistic regression where
to investigating the working of the human brain. the learning algorithm assigns a probability to an example
In this work, we introduce a new machine learning work- as belonging to a particular class. Similar to importance-
bench for bioinformatics tasks. This workbench has been weighted classification malibu supports both implicit and
applied to a number of problems ranging from function pre- explicit regression. That is, learning algorithms such as LIB-
diction, e.g. prediction of DNA-binding residues[13], DNA- SVM, kNN and Willow, which support regression. For binary
binding proteins[14], [15], membrane-binding proteins[16], classifiers, malibu also includes explicit wrappers to extend a
[15], to structure prediction e.g. protein folds[17]. classifier to handle probabilistic regression. These wrappers
include sigmoid correction[30], isotonic regression[31] and
II. LEARNING WITH malibu probing[32].
malibu is an open-source machine learning workbench In multiple-instance learning (MIL), examples are grouped
written in C/C++ and is geared toward supervised learning. into bags where the bag not an individual example has a
The basic design of malibu comprises a hierarchy of C++ label. A bag is positive if at least one instance in the bag is
template classes that both wrap and extend a core set of positive otherwise the bag is negative. In malibu any binary
classification algorithms. By utilizing proven C++ template classifier can be extended to multiple-instance learning by
meta-programming techniques used in the Boost Libraries2 viewing this problem as binary classification with positive
and the matrix template library3 , malibu provides an efficient class noise; all parameters are selected by estimating bag-
yet extensible library of algorithms. The core classifiers level (not instance-level) performance. malibu also supports
comprise both third-party tools, e.g. LIBSVM[18], and native extending a weak classifier to a multiple-instance learner
implementations[19]. through the AdaBoost.C2MIL wrapper[19].
B. Learning evaluation
A. Learning algorithms
Evaluating the performance of a learning algorithm is
The malibu workbench currently supports a number of important to both select the best model and estimate the per-
supervised learning problems including classification, meta- formance on unseen testing dataset. The performance of an
classification, importance-weighted classification, regression algorithm is measured as follows:
and multiple-instance learning. A supervised learning prob- for each partition do
lem comprises a set of labeled training examples with the Train algorithm on one partition
goal of predicting the label on an unseen (and possibly Evaluate on other partition
unlabeled) example. end for
1 http://mloss.org Learning algorithm performance is usually measured by
2 http://www.boost.org metrics and/or graphs. A single metric reflects some question
3 http://www.osl.iu.edu/research/mtl/ about the performance of a learning algorithm whereas

3796

a graph reflects a series of questions. malibu supports a library; in loose-binding, the workbench writes out a file in
number of threshold metrics from a tabulated contingency a format supported by another tool. Currently malibu sup-
table, which estimate the performance for every problem ports soft-binding to web-browsers, LTEX, GNUPLOT4 and
A
5
except regression; it also supports a number of regression Graphviz . That is, the metrics describing the performance of
and ranking metrics. Likewise, malibu supports a number of a learning algorithm can be written out in both the HTML
graphs including the receiver operating characteristics curve, and latex formats. Similarly, the performance can also be
the cost curve[33], the precision/recall curve, lift curves and written out as a plot in the GNUPLOT format. Finally, the
reliability diagrams. models describing the tree-based learning algorithms can be
Note that malibu provides automated model selection written out as graphs in the Graphviz DOT format.
for every learning algorithm using the previously described
evaluation metrics and the dataset partitioning algorithms III. CONCLUSIONS AND FUTURE WORKS
introduced in the next section. A. Conclusions
C. Dataset preprocessing The maturity of open source software in conjunction with
the present need for robust implementations of machine
Preprocessing a dataset is a critical step for many ma- learning algorithms has given rise to significant efforts in
chine learning algorithms e.g. normalization of attributes for developing large-scale workbenches. However, no single
distance-based methods such as SVM. Moreover, preprocess- workbench is comprehensive in its coverage of machine
ing also includes algorithms that partition the dataset for learning algorithms nor does every workbench provide an
model evaluation. malibu comprises a number of algorithms optimal set of features. malibu is a high-performance ma-
to transform a dataset into an appropriate format such as chine learning workbench developed to extend classifiers
normalization for distance-based methods, nominal-to-binary to handle classification as well as other problem domains
for distance-based methods, and discreetization to speed up namely regression, importance-weighted classification and
sorting-based methods. Likewise, malibu includes partition- multiple-instance learning. It also satisfies the basic criterion
ing methods such as cross-validation, bootstrapping, holdout of a workbench by providing a unified user interface, dataset
and progressive validation[34]. Each of these methods has preprocessing algorithms, learning algorithms and binding to
various advantages and disadvantages. Holdout requires a other tools to facilitate learning.
large amount of dataset but its the best understood. For The primary contribution of the malibu workbench is im-
smaller datasets, cross-validation, progressive validation and proved usability for a more computer-scientist oriented user
bootstrapping are more appropriate where cross-validation is group. That is, malibu is written in ANSI C++ and has been
the most widely used method. extensively tested in Windows and Unix-like environments.
D. Interface unification By downloading binary files rather than interpreted code,
malibu does not require the user to learn how to use a
The interface to a machine learning algorithm includes Java (e.g. how to increase available memory) or Matlab
setting parameters, reading datasets, outputting results and interpreter (e.g. how to program in Matlab). It supports a
writing models. Setting parameters in malibu can accom- number of dataset formats removing the burden of creating
plished using either command-line arguments or a configu- scripts to format a dataset from the user. Similarly, it provides
ration file where a subgroup of arguments can be written a number of standard model selection and evaluation algo-
to and read from a file. The parameter system also sup- rithms often missing from third-party code (e.g. CoverTree).
ports implicit configuration files depending on the name of malibu also provides a configuration file, which allows
learning algorithm where command-line parameters override users to modify arguments in an environment that provides
configuration files which, in turn, override implicit config- additional information about each command. Finally, malibu
uration files. The dataset format supported by malibu is a provides bindings for third-party tools to generate graphs and
standard tab/comma/space delimited file and every example plots. Another contribution includes implementation of new
is delimited by line separators. Indeed, the format allows algorithms (e.g. AdaBoost.C2MIL) as well as extension of
changes in class position, existence of a header, index of any algorithm to new problem domains (e.g. classifiers to
bag label or number of prefixing labels. multiple-instance learning).
When a model is applied to a test set, a malibu learning
algorithm writes predictions to the standard output. It also B. Future Works
outputs statistics describing a training and/or testing set as At the same time, malibu (like most available software) is
well as a copy of the configuration file. Finally, malibu a work in progress. One direction of development is to scale
supports writing out the models of learning algorithms in the workbench up to distributed computing. That is, model
the ASCII format. selection and validation can be distributed via the message
E. Extensible bindings passing interface (MPI) to multiple CPUs and machines.
Another direction will focus on developing stronger bindings
A workbench may interface (or bind) another software tool
through two mechanisms: tight-binding and loose-binding. In 4 http://www.gnuplot.info/

tight-binding, the workbench makes a function call to some 5 http://www.graphviz.org/

3797

between key software packages. A scripting language such [14] N. Bhardwaj, R. E. Langlois, G. Zhao, and H. Lu, “Kernel-based
as python is better suited to selecting objects, extracting machine learning protocol for predicting DNA-binding proteins,”
Nucleic Acids Research, vol. 33, no. 20, pp. 6486–6493, 2005.
features and tying in other applications. A final direction [15] R. Langlois, M. Carson, N. Bhardwaj, and H. Lu, “Learning to
will be to assemble more classifiers including Na¨ve Bayes,
ı translate sequence and structure to function: Identifying DNA binding
logistic regression as well as more learning strategies such as and membrane binding proteins,” Annals of Biomedical Engineering,
vol. 35, no. 6, pp. 1043–1052, 2007.
multi-class classification, multi-part learning and structured- [16] N. Bhardwaj, R. V. Stahelin, R. E. Langlois, W. Cho, and H. Lu,
prediction. “Structural bioinformatics prediction of membrane-binding proteins,”
Journal of Molecular Biology, vol. 359, no. 2, pp. 486–495, 2006.
[17] R. E. Langlois, A. Diec, O. Perisic, Y. Dai, and H. Lu, “Improved
IV. ACKNOWLEDGMENTS protein fold assignment using support vector machines,” International
Journal of Bioinformatics Research and Applications, vol. 1, no. 3,
This work is partially supported by NIH P01 AI060915 pp. 319–334, 2006.
(H.L.). R.E.L. acknowledges the support from NIH training [18] C.-C. Chang and C.-J. Lin, “LIBSVM: A library for support vector
grant T32 HL 07692: Cellular Signaling in Cardiovascular machines,” 2001, http://www.csie.ntu.edu.tw/∼cjlin/libsvm.
[19] R. E. Langlois, “Machine learning in bioinformatics: Algorithms,
System (P.I. John Solaro). implementations and applications,” Ph.D. Thesis, Univeristy of Illinois
at Chicago, Chicago, IL, USA, 2008.
R EFERENCES [20] A. Beygelzimer, S. Kakade, and J. Langford, “Cover trees for nearest
neighbor,” in International Conference on Machine Learning, vol. 148.
[1] S. Sonnenburg, M. L. Braun, C. S. Ong, S. Bengio, L. Bottou, Pittsburgh, Pennsylvania: ACM, 2006, pp. 97–104.
G. Holmes, Y. LeCun, K.-R. Muller, F. Pereira, C. E. Rasmussen, [21] W. Buntine, “Learning classification trees,” Statistics and Computing,
G. Ratsch, B. Scholkopf, A. Smola, P. Vincent, J. Weston, and vol. 2, no. 2, pp. 63–73, 1992.
R. Williamson, “The need for open source software in machine [22] J. R. Quinlan, “Improved use of continuous attributes in C4.5,” Journal
learning,” Journal of Machine Learning Research, vol. 8, pp. 2443– of Artificial Intelligence Research, vol. 4, pp. 77–90, 1996.
2466, Oct 2007. [23] Y. Freund and L. Mason, “The alternating decision tree learning
[2] A. Asuncion and D. Newman, “UCI machine learning repository,” algorithm,” in International Conference on Machine Learning, vol. 16,
2007, http://www.ics.uci.edu/∼mlearn/MLRepository.html. Bled, Slovenia, 1999.
[3] I. H. Witten and E. Frank, Data Mining: Practical Machine Learning [24] L. Breiman, “Bagging predictors,” Machine Learning, vol. 24, no. 2,
Tools and Techniques, 2nd ed. San Francisco: Morgan Kaufmann, pp. 123–140, 1996.
2005, http://www.cs.waikato.ac.nz/ml/weka/. [25] P. Buhlmann, “Bagging, subagging and bragging for improving some
[4] I. Mierswa, M. Wurst, R. Klinkenberg, M. Scholz, and T. Euler, prediction algorithms,” in Recent Advances and Trends in Nonpara-
“YALE: Rapid prototyping for complex data mining tasks,” in ACM metric Statistics, M. G. Akritas and D. N. Politis, Eds. North Holland:
SIGKDD International Conference on Knowledge Discovery and Data Elsevier, 2003, pp. 19–34.
Mining, vol. 12, Philadelphia, USA, 2006. [26] R. E. Schapire and Y. Singer, “Improved boosting algorithms using
[5] S. Sonnenburg, G. R¨ tsch, C. Sch¨ fer, and B. Sch¨ lkopf, “Large scale
a a o confidence-rated predictions,” Machine Learning, vol. 37, no. 3, pp.
multiple kernel learning,” Journal of Machine Learning Research, 297–336, 1999.
vol. 7, pp. 1531–1565, July 2006, http://www.shogun-toolbox.org/. [27] J. Friedman, T. Hastie, and R. Tibshirani, “Additive logistic regression:
[6] K. Gawande, C. Webers, A. J. Smola, and S. Vishwanathan, “Elefant: A statistical view of boosting,” Annals of Statistics, vol. 28, no. 2, pp.
A python machine learning toolbox,” in SciPy Conference, 2007. 337–407, 2000.
[7] R. Kohavi, D. Sommerfield, and J. Dougherty, “Data mining using [28] L. Breiman, “Random forests,” Machine Learning, vol. 45, no. 1, pp.
MLC++, a machine learning library in C++,” in International Confer- 5–32, 2001.
ence on Tools with Artificial Intelligence, vol. 8. Toulouse, France: [29] B. Zadrozny, J. Langford, and N. Abe, “Cost-sensitive learning by
IEEE Computer Society, 1996, p. 234, http://www.sgi.com/tech/mlc/. cost-proportionate example weighting,” in IEEE International Confer-
[8] J. Demˇar, B. Zupan, G. Leban, and T. Curk, “Orange: From exper-
s ence on Data Mining, vol. 3, Melbourne, Florida, 2003, p. 435.
imental machine learning to interactive data mining,” in Knowledge [30] J. C. Platt, “Probabilistic outputs for support vector machines and
Discovery in Databases: PKDD 2004, ser. Lecture Notes in Computer comparisons to regularized likelihood methods,” in Advances in Large
Science. Berlin/Heidelberg: Springer, 2004, vol. 3202, pp. 537–539. Margin Classifiers, P. J. Bartlett, B. Scholkopf, D. Schuurmans, and
[9] R. Collobert, S. Bengio, and J. Mariethoz, “Torch: A modular ma- A. J. Smola, Eds. Boston: MIT Press, 1999, pp. 61–74.
chine learning software library,” IDIAP Research Institute, Tech. Rep. [31] B. Zadrozny and C. Elkan, “Transforming classifier scores into ac-
IDIAP-RR 02-46, 2002, http://www.torch.ch/. curate multiclass probability estimates,” in Special Interest Group on
[10] J. Weston, A. Elisseeff, G. BakIr, and F. Sinz, “SPIDER: Object Knowledge Discovery and Data Mining, vol. 8. Edmonton, Alberta,
oriented machine learning library,” 2003, http://www.kyb.tuebingen. Canada: ACM Press, 2002, pp. 694–699.
mpg.de/bs/people/spider/main.html. [32] J. Langford and B. Zadrozny, “Estimating class membership probabil-
[11] J. Mohr and K. Obermayer, “A topographic support vector machine: ities using classifier learners,” in International Workshop on Artificial
Classification using local label configurations,” in Advances in Neural Intelligence and Statistics, vol. 10, Barbados, 2005.
Information Processing Systems 17, L. K. Saul, Y. Weiss, and L. Bot- [33] C. Drummond and R. C. Holte, “Cost curves: An improved method for
tou, Eds. Cambridge, MA: MIT Press, 2005, pp. 929–936. visualizing classifier performance,” Machine Learning, vol. 65, no. 1,
[12] G. D. M. K. G. C. B. B. Klaus-Robert M¨ ller, Michael Tangermann,
u pp. 95–130, 2006.
“Machine learning for real-time single-trial eeg-analysis: From brain- [34] A. Blum, A. Kalai, and J. Langford, “Beating the hold-out: Bounds
computer interfacing to mental state monitoring,” J. Neurosci. Meth- for k-fold and progressive cross-validation,” in COLT: Computational
ods, vol. 167, no. 1, pp. 82–90, 2008. Learning Theory, vol. 12. Santa Cruz, California: ACM, 1999, pp.
[13] N. Bhardwaj and H. Lu, “Residue-level prediction of DNA-binding 203–208.
sites and its application on DNA-binding protein predictions,” FEBS
Letters, vol. 581, no. 5, pp. 1058–1066, 2007.

3798

Intelligible Machine Learning with Malibu for bioinformatics ...

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie Intelligible Machine Learning with Malibu for bioinformatics ...

Ähnlich wie Intelligible Machine Learning with Malibu for bioinformatics ... (20)

Mehr von butest

Mehr von butest (20)

Intelligible Machine Learning with Malibu for bioinformatics ...