Mahout part2

Mahout in Action
Part 2

Yasmine M. Gaber
4 April 2013

Agenda

Part 2: Clustering

Part 3: Classification

Clustering

An algorithm


A notion of both similarity and dissimilarity


A stopping condition

Measuring the similarity of items

Euclidean Distance

Creating the input

Preprocess the data

Use that data to create vectors

Save the vectors in SequenceFile format as input for the
algorithm

Using Mahout clustering

The SequenceFile containing the input
vectors.

The SequenceFile containing the initial cluster
centers.

The similarity measure to be used.

The convergenceThreshold.

The number of iterations to be done.

The Vector implementation used in the input
files.

Distance measures

Euclidean distance measure


Squared Euclidean distance measure


Manhattan distance measure

Distance measures

Cosine distance measure


Tanimoto distance measure

Representing text documents as
vectors

Vector Space Model (VSM)

TF-IDF


N-gram collocations

Generating vectors from documents

$ bin/mahout seqdirectory -c UTF-8 -i
examples/reuters-extracted/ -o reuters-seqfiles


$ bin/mahout seq2sparse -i reuters-seqfiles/ -o
reuters-vectors -ow

Improving quality of vectors using
normalization

P-norm


$ bin/mahout seq2sparse -i reuters-seqfiles/
-o reuters-normalized-bigram -ow -a
org.apache.lucene.analysis.WhitespaceAnalyz
er
-chunk 200 -wt tfidf -s 5 -md 3 -x 90 -ng 2
-ml 50 -seq -n 2

Clustering Categories

Exclusive clustering

Overlapping clustering

Hierarchical clustering

Probabilistic clustering

Clustering Approaches


Fixed number of centers


Bottom-up approach


Top-down approach

Clustering algorithms

K-means clustering


Fuzzy k-means clustering


Dirichlet clustering

Running k-means clustering

$ bin/mahout kmeans -i reuters-vectors/tfidf-
vectors/ -c reuters-initial-clusters -o reuters-
kmeans-clusters -dm
org.apache.mahout.common.distance.Square
dEuclideanDistanceMeasure -cd 1.0 -k 20
-x 20 -cl

$ bin/mahout kmeans -i reuters-vectors/tfidf-
vectors/ -c reuters-initial-clusters -o reuters-
kmeans-clusters -dm
org.apache.mahout.common.distance.Cosine
DistanceMeasure -cd 0.1 -k 20 -x 20 -cl

$ bin/mahout clusterdump -dt sequencefile -d

Fuzzy k-means clustering

Instead of the exclusive clustering in k-means,
fuzzy k-means tries to generate overlapping
clusters from the data set.


Also known as fuzzy c-means algorithm.

Running fuzzy k-means clustering

Running fuzzy k-means clustering

$ bin/mahout fkmeans -i reuters-vectors/tfidf-
vectors/ -c reuters-fkmeans-centroids -o
reuters-fkmeans-clusters -cd 1.0 -k 21 -m 2
-ow -x 10 -dm
org.apache.mahout.common.distance.Square
dEuclideanDistanceMeasure

Fuzziness factor

Dirichlet clustering

model-based clustering algorithm

Running Dirichlet clustering

$ bin/mahout dirichlet -i reuters-vectors/tfidf-
vectors -o reuters-dirichlet-clusters -k 60
-x 10 -a0 1.0 -md
org.apache.mahout.clustering.dirichlet.models.
GaussianClusterDistribution -mp
org.apache.mahout.math.SequentialAccessSp
arseVector

Evaluating and improving clustering
quality

Inspecting clustering output

Evaluating the quality of clustering0

Improving clustering quality

Inspecting clustering output

$ bin/mahout clusterdump -s kmeans-
output/clusters-19/ -d reuters-
vectors/dictionary.file-0 -dt sequencefile -n 10


Top Terms:
said =>
11.60126582278481
bank =>
5.943037974683544
dollar =>

Analyzing clustering output

Distance measure and feature selection

Inter-cluster and intra-cluster distances

Mixed and overlapping clusters

Improving clustering quality

Improving document vector generation

Writing a custom distance measure

Real-world applications of clustering

Clustering like-minded people on Twitter


Suggesting tags for an artist on Last.fm using
clustering


Creating a related-posts feature for a website

Classification

Classification is a process of using specific
information (input) to choose a single selection
(output) from a short list of predetermined
potential responses.

Applications of classification, e.g. spam
filtering

Why use Mahout for classification?

Classification

Training versus test versus production

Predictor variables versus target variable

Records, fields, and values

Types of values for predictor
variables

Continuous

Categorical

Word-like

Text-like

Classification Work flow

Training the model


Evaluating the model


Using the model in production

Stage 1: training the classification
model

Stage 2: evaluating the classification
model
Stage 3: using the model in production

Stage 1: training the classification
model

Define Categories for the Target Variable

Collect Historical Data

Define Predictor Variables

Select a Learning Algorithm to Train the Model

Use Learning Algorithm to Train the Model

Extracting features to build a
Mahout classifier

Preprocessing raw data into
classifiable data

Converting classifiable data into
vectors

Use one Vector cell per word, category, or
continuous value

Represent Vectors implicitly as bags of words

Use feature hashing

Classifying the 20 newsgroups data
set

The classifier evaluation API

Percent correct

Confusion matrix

Entropy matrix

AUC

Log likelihood

When classifiers go bad

Target leaks

Broken feature extraction

Tuning the problem

Remove Fluff Variables

Add New Variables, Interactions, and Derived
Values

Tuning the classifier

Try Alternative Algorithms

Tune the Learning Algorithm

Thank You

Contact at:
Email: Yasmine.Gaber@espace.com.eg
Twitter: Twitter.com/yasmine_mohamed

Mahout part2

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (14)

Similar to Mahout part2

Similar to Mahout part2 (20)

More from Yasmine Gaber

More from Yasmine Gaber (8)

Recently uploaded

Recently uploaded (20)

Mahout part2