WEKA is a popular open source machine learning toolkit written in Java. It contains algorithms for data pre-processing, classification, regression, clustering, association rules, and visualization. WEKA has graphical user interfaces for exploring data and evaluating models, as well as tools for performing experiments and comparing machine learning algorithms. It supports common data formats and can operate on datasets stored in files or databases. WEKA is widely used for research and applications involving machine learning and data mining.
1. WEKA: A Machine
Machine Learning with Learning Toolkit
WEKA The Explorer
• Classification and
Regression
• Clustering
Eibe Frank • Association Rules
• Attribute Selection
Department of Computer Science,
University of Waikato, New Zealand • Data Visualization
The Experimenter
The Knowledge
Flow GUI
Conclusions
WEKA: the bird
Copyright: Martin Kramer (mkramer@wxs.nl)
2/4/2004 University of Waikato 2
Machine Learning for Data Mining 1
2. WEKA: the software
Machine learning/data mining software written in
Java (distributed under the GNU Public License)
Used for research, education, and applications
Complements “Data Mining” by Witten & Frank
Main features:
Comprehensive set of data pre-processing tools,
learning algorithms and evaluation methods
Graphical user interfaces (incl. data visualization)
Environment for comparing learning algorithms
2/4/2004 University of Waikato 3
WEKA: versions
There are several versions of WEKA:
WEKA 3.0: “book version” compatible with
description in data mining book
WEKA 3.2: “GUI version” adds graphical user
interfaces (book version is command-line only)
WEKA 3.3: “development version” with lots of
improvements
This talk is based on the latest snapshot of WEKA
3.3 (soon to be WEKA 3.4)
2/4/2004 University of Waikato 4
Machine Learning for Data Mining 2
3. WEKA only deals with “flat” files
@relation heart-disease-simplified
@attribute age numeric
@attribute sex { female, male}
@attribute chest_pain_type { typ_angina, asympt, non_anginal, atyp_angina}
@attribute cholesterol numeric
@attribute exercise_induced_angina { no, yes}
@attribute class { present, not_present}
@data
63,male,typ_angina,233,no,not_present
67,male,asympt,286,yes,present
67,male,asympt,229,yes,present
38,female,non_anginal,?,no,not_present
...
2/4/2004 University of Waikato 5
WEKA only deals with “flat” files
@relation heart-disease-simplified
@attribute age numeric
@attribute sex { female, male}
@attribute chest_pain_type { typ_angina, asympt, non_anginal, atyp_angina}
@attribute cholesterol numeric
@attribute exercise_induced_angina { no, yes}
@attribute class { present, not_present}
@data
63,male,typ_angina,233,no,not_present
67,male,asympt,286,yes,present
67,male,asympt,229,yes,present
38,female,non_anginal,?,no,not_present
...
2/4/2004 University of Waikato 6
Machine Learning for Data Mining 3
4. 2/4/2004 University of Waikato 7
2/4/2004 University of Waikato 8
Machine Learning for Data Mining 4
5. 2/4/2004 University of Waikato 9
Explorer: pre-processing the data
Data can be imported from a file in various
formats: ARFF, CSV, C4.5, binary
Data can also be read from a URL or from an SQL
database (using JDBC)
Pre-processing tools in WEKA are called “filters”
WEKA contains filters for:
Discretization, normalization, resampling, attribute
selection, transforming and combining attributes, …
2/4/2004 University of Waikato 10
Machine Learning for Data Mining 5
6. 2/4/2004 University of Waikato 11
2/4/2004 University of Waikato 12
Machine Learning for Data Mining 6
7. 2/4/2004 University of Waikato 13
2/4/2004 University of Waikato 14
Machine Learning for Data Mining 7
8. 2/4/2004 University of Waikato 15
2/4/2004 University of Waikato 16
Machine Learning for Data Mining 8
9. 2/4/2004 University of Waikato 17
2/4/2004 University of Waikato 18
Machine Learning for Data Mining 9
10. 2/4/2004 University of Waikato 19
2/4/2004 University of Waikato 20
Machine Learning for Data Mining 10
11. 2/4/2004 University of Waikato 21
2/4/2004 University of Waikato 22
Machine Learning for Data Mining 11
12. 2/4/2004 University of Waikato 23
2/4/2004 University of Waikato 24
Machine Learning for Data Mining 12
13. 2/4/2004 University of Waikato 25
2/4/2004 University of Waikato 26
Machine Learning for Data Mining 13
14. 2/4/2004 University of Waikato 27
2/4/2004 University of Waikato 28
Machine Learning for Data Mining 14
15. 2/4/2004 University of Waikato 29
2/4/2004 University of Waikato 30
Machine Learning for Data Mining 15
16. 2/4/2004 University of Waikato 31
Explorer: building “classifiers”
Classifiers in WEKA are models for predicting
nominal or numeric quantities
Implemented learning schemes include:
Decision trees and lists, instance-based classifiers,
support vector machines, multi-layer perceptrons,
logistic regression, Bayes’ nets, …
“Meta”-classifiers include:
Bagging, boosting, stacking, error-correcting output
codes, locally weighted learning, …
2/4/2004 University of Waikato 32
Machine Learning for Data Mining 16
17. 2/4/2004 University of Waikato 33
2/4/2004 University of Waikato 34
Machine Learning for Data Mining 17
18. 2/4/2004 University of Waikato 35
2/4/2004 University of Waikato 36
Machine Learning for Data Mining 18
19. 2/4/2004 University of Waikato 37
2/4/2004 University of Waikato 38
Machine Learning for Data Mining 19
20. 2/4/2004 University of Waikato 53
Explorer: clustering data
WEKA contains “clusterers” for finding groups of
similar instances in a dataset
Implemented schemes are:
k-Means, EM, Cobweb, X-means, FarthestFirst
Clusters can be visualized and compared to “true”
clusters (if given)
Evaluation based on loglikelihood if clustering
scheme produces a probability distribution
2/4/2004 University of Waikato 92
Machine Learning for Data Mining 20
21. Explorer: finding associations
WEKA contains an implementation of the Apriori
algorithm for learning association rules
Works only with discrete data
Can identify statistical dependencies between
groups of attributes:
milk, butter ⇒ bread, eggs (with confidence 0.9 and
support 2000)
Apriori can compute all rules that have a given
minimum support and exceed a given confidence
2/4/2004 University of Waikato 108
2/4/2004 University of Waikato 109
Machine Learning for Data Mining 21
22. 2/4/2004 University of Waikato 110
2/4/2004 University of Waikato 111
Machine Learning for Data Mining 22
23. 2/4/2004 University of Waikato 112
2/4/2004 University of Waikato 113
Machine Learning for Data Mining 23
24. 2/4/2004 University of Waikato 114
2/4/2004 University of Waikato 115
Machine Learning for Data Mining 24
25. Explorer: attribute selection
Panel that can be used to investigate which
(subsets of) attributes are the most predictive ones
Attribute selection methods contain two parts:
A search method: best-first, forward selection,
random, exhaustive, genetic algorithm, ranking
An evaluation method: correlation-based, wrapper,
information gain, chi-squared, …
Very flexible: WEKA allows (almost) arbitrary
combinations of these two
2/4/2004 University of Waikato 116
2/4/2004 University of Waikato 117
Machine Learning for Data Mining 25
26. 2/4/2004 University of Waikato 118
2/4/2004 University of Waikato 119
Machine Learning for Data Mining 26
27. 2/4/2004 University of Waikato 120
2/4/2004 University of Waikato 121
Machine Learning for Data Mining 27
28. 2/4/2004 University of Waikato 122
2/4/2004 University of Waikato 123
Machine Learning for Data Mining 28
29. 2/4/2004 University of Waikato 124
Explorer: data visualization
Visualization very useful in practice: e.g. helps to
determine difficulty of the learning problem
WEKA can visualize single attributes (1-d) and
pairs of attributes (2-d)
To do: rotating 3-d visualizations (Xgobi-style)
Color-coded class values
“Jitter” option to deal with nominal attributes (and
to detect “hidden” data points)
“Zoom-in” function
2/4/2004 University of Waikato 125
Machine Learning for Data Mining 29
30. 2/4/2004 University of Waikato 126
2/4/2004 University of Waikato 127
Machine Learning for Data Mining 30
31. 2/4/2004 University of Waikato 128
2/4/2004 University of Waikato 129
Machine Learning for Data Mining 31
32. 2/4/2004 University of Waikato 130
2/4/2004 University of Waikato 131
Machine Learning for Data Mining 32
33. 2/4/2004 University of Waikato 132
2/4/2004 University of Waikato 133
Machine Learning for Data Mining 33
34. 2/4/2004 University of Waikato 134
2/4/2004 University of Waikato 135
Machine Learning for Data Mining 34
35. 2/4/2004 University of Waikato 136
2/4/2004 University of Waikato 137
Machine Learning for Data Mining 35
36. Performing experiments
Experimenter makes it easy to compare the
performance of different learning schemes
For classification and regression problems
Results can be written into file or database
Evaluation options: cross-validation, learning
curve, hold-out
Can also iterate over different parameter settings
Significance-testing built in!
2/4/2004 University of Waikato 138
The Knowledge Flow GUI
New graphical user interface for WEKA
Java-Beans-based interface for setting up and
running machine learning experiments
Data sources, classifiers, etc. are beans and can
be connected graphically
Data “flows” through components: e.g.,
“data source” -> “filter” -> “classifier” -> “evaluator”
Layouts can be saved and loaded again later
2/4/2004 University of Waikato 152
Machine Learning for Data Mining 36
37. 2/4/2004 University of Waikato 153
2/4/2004 University of Waikato 154
Machine Learning for Data Mining 37
38. 2/4/2004 University of Waikato 155
2/4/2004 University of Waikato 156
Machine Learning for Data Mining 38
39. 2/4/2004 University of Waikato 157
2/4/2004 University of Waikato 158
Machine Learning for Data Mining 39
40. 2/4/2004 University of Waikato 159
2/4/2004 University of Waikato 160
Machine Learning for Data Mining 40
41. 2/4/2004 University of Waikato 161
2/4/2004 University of Waikato 162
Machine Learning for Data Mining 41
42. 2/4/2004 University of Waikato 163
2/4/2004 University of Waikato 164
Machine Learning for Data Mining 42
43. 2/4/2004 University of Waikato 165
2/4/2004 University of Waikato 166
Machine Learning for Data Mining 43
44. 2/4/2004 University of Waikato 167
2/4/2004 University of Waikato 168
Machine Learning for Data Mining 44
45. 2/4/2004 University of Waikato 169
2/4/2004 University of Waikato 170
Machine Learning for Data Mining 45
46. 2/4/2004 University of Waikato 171
2/4/2004 University of Waikato 172
Machine Learning for Data Mining 46
47. Conclusion: try it yourself!
WEKA is available at
http://www.cs.waikato.ac.nz/ml/weka
Also has a list of projects based on WEKA
WEKA contributors:
Abdelaziz Mahoui, Alexander K. Seewald, Ashraf M. Kibriya, Bernhard
Pfahringer , Brent Martin, Peter Flach, Eibe Frank ,Gabi Schmidberger
,Ian H. Witten , J. Lindgren, Janice Boughton, Jason Wells, Len Trigg,
Lucio de Souza Coelho, Malcolm Ware, Mark Hall ,Remco Bouckaert ,
Richard Kirkby, Shane Butler, Shane Legg, Stuart Inglis, Sylvain Roy,
Tony Voyle, Xin Xu, Yong Wang, Zhihai Wang
2/4/2004 University of Waikato 173
Machine Learning for Data Mining 47