5. Creating the input
Preprocess the data
Use that data to create vectors
Save the vectors in SequenceFile format as input for the
algorithm
6. Using Mahout clustering
The SequenceFile containing the input
vectors.
The SequenceFile containing the initial cluster
centers.
The similarity measure to be used.
The convergenceThreshold.
The number of iterations to be done.
The Vector implementation used in the input
files.
21. Fuzzy k-means clustering
Instead of the exclusive clustering in k-means,
fuzzy k-means tries to generate overlapping
clusters from the data set.
Also known as fuzzy c-means algorithm.
30. Real-world applications of clustering
Clustering like-minded people on Twitter
Suggesting tags for an artist on Last.fm using
clustering
Creating a related-posts feature for a website
31. Classification
Classification is a process of using specific
information (input) to choose a single selection
(output) from a short list of predetermined
potential responses.
Applications of classification, e.g. spam
filtering
37. Stage 1: training the classification
model
Stage 2: evaluating the classification
model
Stage 3: using the model in production
38. Stage 1: training the classification
model
Define Categories for the Target Variable
Collect Historical Data
Define Predictor Variables
Select a Learning Algorithm to Train the Model
Use Learning Algorithm to Train the Model
41. Converting classifiable data into
vectors
Use one Vector cell per word, category, or
continuous value
Represent Vectors implicitly as bags of words
Use feature hashing