2. Weka: An Introduction 2
WEKA
Waikato Environment for Knowledge Analysis (WEKA)
Developed by the Department of Computer Science, University
of Waikato, New Zealand
Machine learning/data mining software written in Java
(distributed under the GNU Public License)
Used for research, education, and applications
http://www.cs.waikato.ac.nz/ml/weka/
3. Weka: An Introduction 3
Weka Interfaces
Explorer
Preprocessing, attribute selection, learning, visualization
Knowledge Flow
Visual design of KDD process
Experimenter
testing and evaluating machine learning algorithms
Command-line
4. Weka: An Introduction 4
Data Formats
Uses flat text files to describe the data
Can work with a wide variety of data files including its own “.arff”
format and C4.5 file formats
Data can be imported from a file in various formats: ARFF,
CSV, C4.5 etc.
5. Weka: An Introduction 5
Data Formats (Contd.)
ARFF (Attribute Relation File Format)
@relation person
@attribute age numeric
@attribute name string
@attribute education {College, Masters, Doctorate}
@attribute class {>50K,<=50K}
@data
Supported Data types
Numeric
String
Nominal
Date
Relational
6. Weka: An Introduction 6
Explorer
Supports Exploratory Data Analysis
Preprocess: Choose and modify the data being acted on.
Classify: Train and test learning schemes that classify or
perform regression.
Cluster: Learn clusters for the data.
Associate: Learn association rules for the data.
Select attributes: Select the most relevant attributes in the
data.
Visualize: View an interactive 2D plot of the data.
7. Weka: An Introduction 7
Explorer - Preprocessing
Loading Data
Open file
Open URL
Open DB
Generate
Native format – ARFF
Supports file Conversions
8. Weka: An Introduction 8
Explorer – Applying Filters
Supervised Vs Unsupervised Filters
Attribute Vs Instance Filters
Unsupervised Attribute Filters
Add-Adds a new attribute
Normalize-Scales all numeric values
Remove-Remove Attributes (RemoveType / RemoveUseless)
Unsupervised Instance Filters
Randomize- Randomize order of instance in a dataset
RemoveWithValues- Filter out instances with certain attribute values
Supervised Attribute Filters
AttributeSelection- Attribute Selection Methods
Discretize- Convert Numeric attributes to nominal
Supervised Instance Filters
Resample- Produce a random sub sample of a dataset
13. Weka: An Introduction 13
Knowledge Flow Interface
Data-flow inspired interface to WEKA
process data in batches or incrementally
process multiple batches or streams in parallel
(each separate flow executes in its own thread)
chain filters together
visualize performance of incremental classifiers
during processing
14. Weka: An Introduction 14
Experimenter Interface
Enables the user to create, run, modify, and analyse
experiments in a more convenient manner
Modes of Operation
Simple
Advanced
Local / Remote Experiments are supported
15. Weka: An Introduction 15
Command Line Interface
Plain text panel from where commands can be
entered
java <classname> [<args>] invokes a java class with
the given arguments (if any)
break stops the current thread, e.g., a running classifier,
in a friendly manner
kill stops the current thread in an unfriendly fashion
cls clears the output area
exit exits the Simple CLI
help [<command>]
16. Weka: An Introduction 16
Weka Operation
The Operating System’s command line interface can also be
used after setting the CLASSPATH accordingly.
All the functionality supported by Weka can also be invoked
from one’s own source code.
17. Weka: An Introduction 17
Weka Extensions
BioWeka - Extension library for knowledge discovery in
biology
WekaMetal - Meta learning extension to WEKA
Weka-Parallel - Parallel processing for WEKA
Grid Weka - Grid computing using WEKA
18. Weka: An Introduction 18
References
Witten, I.H. and Frank, E. (2005) Data Mining: Practical machine
learning tools and techniques. 2nd edition Morgan Kaufmann,
San Francisco
Weka Knowledge Flow Tutorial, Mark Hall Peter Reutemann
http://www.inf.fh-
dortmund.de/personen/professoren/engels/dm/praktikum/WEKA-
KnowledgeFlowTutorial-3-5-7.pdf
WEKA Manual for Version 3-6-2 - Remco R. Bouckaert, Eibe
Frank et.al, January 11, 2010