SlideShare a Scribd company logo
1 of 48
Mahout in Action
         Part 2


    Yasmine M. Gaber
       4 April 2013
Agenda

    Part 2: Clustering

    Part 3: Classification
Clustering

    An algorithm


    A notion of both similarity and dissimilarity


    A stopping condition
Measuring the similarity of items

    Euclidean Distance
Creating the input

    Preprocess the data

    Use that data to create vectors

    Save the vectors in SequenceFile format as input for the
    algorithm
Using Mahout clustering

    The SequenceFile containing the input
    vectors.

    The SequenceFile containing the initial cluster
    centers.

    The similarity measure to be used.

    The convergenceThreshold.

    The number of iterations to be done.

    The Vector implementation used in the input
    files.
Using Mahout clustering
Distance measures

    Euclidean distance measure




    Squared Euclidean distance measure


    Manhattan distance measure
Distance measures

    Cosine distance measure




    Tanimoto distance measure
Playing Around
Representing data
Representing text documents as
               vectors

    Vector Space Model (VSM)

    TF-IDF




    N-gram collocations
Generating vectors from documents

    $ bin/mahout seqdirectory -c UTF-8 -i
    examples/reuters-extracted/ -o reuters-seqfiles


    $ bin/mahout seq2sparse -i reuters-seqfiles/ -o
    reuters-vectors -ow
Improving quality of vectors using
             normalization

    P-norm




    $ bin/mahout seq2sparse -i reuters-seqfiles/
    -o reuters-normalized-bigram -ow -a
    org.apache.lucene.analysis.WhitespaceAnalyz
    er
-chunk 200 -wt tfidf -s 5 -md 3 -x 90     -ng 2
  -ml 50 -seq -n 2
Clustering Categories

    Exclusive clustering

    Overlapping clustering

    Hierarchical clustering

    Probabilistic clustering
Clustering Approaches


    Fixed number of centers


    Bottom-up approach


    Top-down approach
Clustering algorithms

    K-means clustering


    Fuzzy k-means clustering


    Dirichlet clustering
k-means clustering algorithm
Running k-means clustering
Running k-means clustering

    $ bin/mahout kmeans -i reuters-vectors/tfidf-
    vectors/ -c reuters-initial-clusters -o reuters-
    kmeans-clusters -dm
    org.apache.mahout.common.distance.Square
    dEuclideanDistanceMeasure -cd 1.0 -k 20
    -x 20 -cl

    $ bin/mahout kmeans -i reuters-vectors/tfidf-
    vectors/ -c reuters-initial-clusters -o reuters-
    kmeans-clusters -dm
    org.apache.mahout.common.distance.Cosine
    DistanceMeasure -cd 0.1 -k 20 -x 20 -cl

    $ bin/mahout clusterdump -dt sequencefile -d
Fuzzy k-means clustering

    Instead of the exclusive clustering in k-means,
    fuzzy k-means tries to generate overlapping
    clusters from the data set.


    Also known as fuzzy c-means algorithm.
Running fuzzy k-means clustering
Running fuzzy k-means clustering

    $ bin/mahout fkmeans -i reuters-vectors/tfidf-
    vectors/ -c reuters-fkmeans-centroids -o
    reuters-fkmeans-clusters -cd 1.0 -k 21 -m 2
    -ow -x 10 -dm
    org.apache.mahout.common.distance.Square
    dEuclideanDistanceMeasure

    Fuzziness factor
Dirichlet clustering

    model-based clustering algorithm
Running Dirichlet clustering

    $ bin/mahout dirichlet -i reuters-vectors/tfidf-
    vectors -o reuters-dirichlet-clusters -k 60
    -x 10 -a0 1.0 -md
    org.apache.mahout.clustering.dirichlet.models.
    GaussianClusterDistribution -mp
    org.apache.mahout.math.SequentialAccessSp
    arseVector
Evaluating and improving clustering
              quality

    Inspecting clustering output

    Evaluating the quality of clustering0

    Improving clustering quality
Inspecting clustering output

    $ bin/mahout clusterdump -s kmeans-
    output/clusters-19/ -d reuters-
    vectors/dictionary.file-0 -dt sequencefile -n 10


    Top Terms:
           said                =>
    11.60126582278481
           bank                 =>
    5.943037974683544
            dollar             =>
Analyzing clustering output

    Distance measure and feature selection

    Inter-cluster and intra-cluster distances

    Mixed and overlapping clusters
Improving clustering quality

    Improving document vector generation

    Writing a custom distance measure
Real-world applications of clustering

    Clustering like-minded people on Twitter


    Suggesting tags for an artist on Last.fm using
    clustering


    Creating a related-posts feature for a website
Classification

    Classification is a process of using specific
    information (input) to choose a single selection
    (output) from a short list of predetermined
    potential responses.

    Applications of classification, e.g. spam
    filtering
Why use Mahout for classification?
How classification works
Classification

    Training versus test versus production

    Predictor variables versus target variable

    Records, fields, and values
Types of values for predictor
                variables

    Continuous

    Categorical

    Word-like

    Text-like
Classification Work flow

    Training the model


    Evaluating the model


    Using the model in production
Stage 1: training the classification
                model

Stage 2: evaluating the classification
              model
Stage 3: using the model in production
Stage 1: training the classification
                  model

    Define Categories for the Target Variable

    Collect Historical Data

    Define Predictor Variables

    Select a Learning Algorithm to Train the Model

    Use Learning Algorithm to Train the Model
Extracting features to build a
      Mahout classifier
Preprocessing raw data into
     classifiable data
Converting classifiable data into
                vectors

    Use one Vector cell per word, category, or
    continuous value

    Represent Vectors implicitly as bags of words

    Use feature hashing
Classifying the 20 newsgroups data
                 set
Choosing an algorithm
The classifier evaluation API

    Percent correct

    Confusion matrix

    Entropy matrix

    AUC

    Log likelihood
When classifiers go bad

    Target leaks

    Broken feature extraction
Tuning the problem

    Remove Fluff Variables

    Add New Variables, Interactions, and Derived
    Values
Tuning the classifier

    Try Alternative Algorithms

    Tune the Learning Algorithm
Thank You



               Contact at:
Email: Yasmine.Gaber@espace.com.eg
Twitter: Twitter.com/yasmine_mohamed

More Related Content

What's hot

Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC HadoopIntro to Mahout -- DC Hadoop
Intro to Mahout -- DC HadoopGrant Ingersoll
 
Tutorial Mahout - Recommendation
Tutorial Mahout - RecommendationTutorial Mahout - Recommendation
Tutorial Mahout - RecommendationCataldo Musto
 
Apache Mahout Tutorial - Recommendation - 2013/2014
Apache Mahout Tutorial - Recommendation - 2013/2014 Apache Mahout Tutorial - Recommendation - 2013/2014
Apache Mahout Tutorial - Recommendation - 2013/2014 Cataldo Musto
 
Big Data Analytics using Mahout
Big Data Analytics using MahoutBig Data Analytics using Mahout
Big Data Analytics using MahoutIMC Institute
 
Mahout classification presentation
Mahout classification presentationMahout classification presentation
Mahout classification presentationNaoki Nakatani
 
Whats Right and Wrong with Apache Mahout
Whats Right and Wrong with Apache MahoutWhats Right and Wrong with Apache Mahout
Whats Right and Wrong with Apache MahoutTed Dunning
 
Apache Mahout 於電子商務的應用
Apache Mahout 於電子商務的應用Apache Mahout 於電子商務的應用
Apache Mahout 於電子商務的應用James Chen
 
Next directions in Mahout's recommenders
Next directions in Mahout's recommendersNext directions in Mahout's recommenders
Next directions in Mahout's recommenderssscdotopen
 
Intro to Mahout
Intro to MahoutIntro to Mahout
Intro to MahoutUri Lavi
 
Apache Mahout
Apache MahoutApache Mahout
Apache MahoutAjit Koti
 
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...Varad Meru
 
Introduction to Mahout
Introduction to MahoutIntroduction to Mahout
Introduction to MahoutTed Dunning
 
Logistic Regression using Mahout
Logistic Regression using MahoutLogistic Regression using Mahout
Logistic Regression using Mahouttanuvir
 
Introduction to Apache Mahout
Introduction to Apache MahoutIntroduction to Apache Mahout
Introduction to Apache MahoutAman Adhikari
 
Mahout Introduction BarCampDC
Mahout Introduction BarCampDCMahout Introduction BarCampDC
Mahout Introduction BarCampDCDrew Farris
 

What's hot (20)

Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC HadoopIntro to Mahout -- DC Hadoop
Intro to Mahout -- DC Hadoop
 
Intro to Apache Mahout
Intro to Apache MahoutIntro to Apache Mahout
Intro to Apache Mahout
 
Tutorial Mahout - Recommendation
Tutorial Mahout - RecommendationTutorial Mahout - Recommendation
Tutorial Mahout - Recommendation
 
Apache Mahout Tutorial - Recommendation - 2013/2014
Apache Mahout Tutorial - Recommendation - 2013/2014 Apache Mahout Tutorial - Recommendation - 2013/2014
Apache Mahout Tutorial - Recommendation - 2013/2014
 
Big Data Analytics using Mahout
Big Data Analytics using MahoutBig Data Analytics using Mahout
Big Data Analytics using Mahout
 
Mahout classification presentation
Mahout classification presentationMahout classification presentation
Mahout classification presentation
 
Whats Right and Wrong with Apache Mahout
Whats Right and Wrong with Apache MahoutWhats Right and Wrong with Apache Mahout
Whats Right and Wrong with Apache Mahout
 
Apache Mahout 於電子商務的應用
Apache Mahout 於電子商務的應用Apache Mahout 於電子商務的應用
Apache Mahout 於電子商務的應用
 
Apache mahout
Apache mahoutApache mahout
Apache mahout
 
Next directions in Mahout's recommenders
Next directions in Mahout's recommendersNext directions in Mahout's recommenders
Next directions in Mahout's recommenders
 
Intro to Mahout
Intro to MahoutIntro to Mahout
Intro to Mahout
 
Apache Mahout
Apache MahoutApache Mahout
Apache Mahout
 
Apache Mahout
Apache MahoutApache Mahout
Apache Mahout
 
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
 
Introduction to Mahout
Introduction to MahoutIntroduction to Mahout
Introduction to Mahout
 
Logistic Regression using Mahout
Logistic Regression using MahoutLogistic Regression using Mahout
Logistic Regression using Mahout
 
Mahout
MahoutMahout
Mahout
 
Introduction to Apache Mahout
Introduction to Apache MahoutIntroduction to Apache Mahout
Introduction to Apache Mahout
 
Mahout Introduction BarCampDC
Mahout Introduction BarCampDCMahout Introduction BarCampDC
Mahout Introduction BarCampDC
 
mahout introduction
mahout  introductionmahout  introduction
mahout introduction
 

Viewers also liked

Expectation Maximization and Gaussian Mixture Models
Expectation Maximization and Gaussian Mixture ModelsExpectation Maximization and Gaussian Mixture Models
Expectation Maximization and Gaussian Mixture Modelspetitegeek
 
Understanding Mahout classification documentation
Understanding Mahout  classification documentationUnderstanding Mahout  classification documentation
Understanding Mahout classification documentationNaveen Kumar
 
Expectation Maximization | Statistics
Expectation Maximization | StatisticsExpectation Maximization | Statistics
Expectation Maximization | StatisticsTransweb Global Inc
 
Clustering:k-means, expect-maximization and gaussian mixture model
Clustering:k-means, expect-maximization and gaussian mixture modelClustering:k-means, expect-maximization and gaussian mixture model
Clustering:k-means, expect-maximization and gaussian mixture modeljins0618
 
3.5 model based clustering
3.5 model based clustering3.5 model based clustering
3.5 model based clusteringKrish_ver2
 
Clustering, k means algorithm
Clustering, k means algorithmClustering, k means algorithm
Clustering, k means algorithmJunyoung Park
 
K-Means, its Variants and its Applications
K-Means, its Variants and its ApplicationsK-Means, its Variants and its Applications
K-Means, its Variants and its ApplicationsVarad Meru
 
Lecture 18: Gaussian Mixture Models and Expectation Maximization
Lecture 18: Gaussian Mixture Models and Expectation MaximizationLecture 18: Gaussian Mixture Models and Expectation Maximization
Lecture 18: Gaussian Mixture Models and Expectation Maximizationbutest
 
Yahoo! Mail antispam - Bay area Hadoop user group
Yahoo! Mail antispam - Bay area Hadoop user groupYahoo! Mail antispam - Bay area Hadoop user group
Yahoo! Mail antispam - Bay area Hadoop user groupHadoop User Group
 
A Quick Tutorial on Mahout’s Recommendation Engine (v 0.4)
A Quick Tutorial on Mahout’s Recommendation Engine (v 0.4)A Quick Tutorial on Mahout’s Recommendation Engine (v 0.4)
A Quick Tutorial on Mahout’s Recommendation Engine (v 0.4)Jee Vang, Ph.D.
 
Introduction to Mahout and Machine Learning
Introduction to Mahout and Machine LearningIntroduction to Mahout and Machine Learning
Introduction to Mahout and Machine LearningVarad Meru
 

Viewers also liked (14)

Expectation Maximization and Gaussian Mixture Models
Expectation Maximization and Gaussian Mixture ModelsExpectation Maximization and Gaussian Mixture Models
Expectation Maximization and Gaussian Mixture Models
 
Understanding Mahout classification documentation
Understanding Mahout  classification documentationUnderstanding Mahout  classification documentation
Understanding Mahout classification documentation
 
Expectation Maximization | Statistics
Expectation Maximization | StatisticsExpectation Maximization | Statistics
Expectation Maximization | Statistics
 
Clustering
ClusteringClustering
Clustering
 
Clustering:k-means, expect-maximization and gaussian mixture model
Clustering:k-means, expect-maximization and gaussian mixture modelClustering:k-means, expect-maximization and gaussian mixture model
Clustering:k-means, expect-maximization and gaussian mixture model
 
3.5 model based clustering
3.5 model based clustering3.5 model based clustering
3.5 model based clustering
 
Clustering, k means algorithm
Clustering, k means algorithmClustering, k means algorithm
Clustering, k means algorithm
 
K-Means, its Variants and its Applications
K-Means, its Variants and its ApplicationsK-Means, its Variants and its Applications
K-Means, its Variants and its Applications
 
Lecture 18: Gaussian Mixture Models and Expectation Maximization
Lecture 18: Gaussian Mixture Models and Expectation MaximizationLecture 18: Gaussian Mixture Models and Expectation Maximization
Lecture 18: Gaussian Mixture Models and Expectation Maximization
 
K means Clustering Algorithm
K means Clustering AlgorithmK means Clustering Algorithm
K means Clustering Algorithm
 
Yahoo! Mail antispam - Bay area Hadoop user group
Yahoo! Mail antispam - Bay area Hadoop user groupYahoo! Mail antispam - Bay area Hadoop user group
Yahoo! Mail antispam - Bay area Hadoop user group
 
A Quick Tutorial on Mahout’s Recommendation Engine (v 0.4)
A Quick Tutorial on Mahout’s Recommendation Engine (v 0.4)A Quick Tutorial on Mahout’s Recommendation Engine (v 0.4)
A Quick Tutorial on Mahout’s Recommendation Engine (v 0.4)
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
 
Introduction to Mahout and Machine Learning
Introduction to Mahout and Machine LearningIntroduction to Mahout and Machine Learning
Introduction to Mahout and Machine Learning
 

Similar to Mahout part2

PPT file
PPT filePPT file
PPT filebutest
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnBenjamin Bengfort
 
Barga Data Science lecture 8
Barga Data Science lecture 8Barga Data Science lecture 8
Barga Data Science lecture 8Roger Barga
 
Data Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence dataData Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence dataDatamining Tools
 
Data Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence dataData Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence dataDataminingTools Inc
 
Machine Learning and Real-World Applications
Machine Learning and Real-World ApplicationsMachine Learning and Real-World Applications
Machine Learning and Real-World ApplicationsMachinePulse
 
Lab 2: Classification and Regression Prediction Models, training and testing ...
Lab 2: Classification and Regression Prediction Models, training and testing ...Lab 2: Classification and Regression Prediction Models, training and testing ...
Lab 2: Classification and Regression Prediction Models, training and testing ...Yao Yao
 
RDataMining-reference-card
RDataMining-reference-cardRDataMining-reference-card
RDataMining-reference-cardYanchang Zhao
 
RapidMiner: Data Mining And Rapid Miner
RapidMiner:  Data Mining And Rapid MinerRapidMiner:  Data Mining And Rapid Miner
RapidMiner: Data Mining And Rapid MinerRapidmining Content
 
RapidMiner: Data Mining And Rapid Miner
RapidMiner: Data Mining And Rapid MinerRapidMiner: Data Mining And Rapid Miner
RapidMiner: Data Mining And Rapid MinerDataminingTools Inc
 
Optimal feature selection from v mware esxi 5.1 feature set
Optimal feature selection from v mware esxi 5.1 feature setOptimal feature selection from v mware esxi 5.1 feature set
Optimal feature selection from v mware esxi 5.1 feature setijccmsjournal
 
Optimal Feature Selection from VMware ESXi 5.1 Feature Set
Optimal Feature Selection from VMware ESXi 5.1 Feature SetOptimal Feature Selection from VMware ESXi 5.1 Feature Set
Optimal Feature Selection from VMware ESXi 5.1 Feature Setijccmsjournal
 
Optimal Feature Selection from VMware ESXi 5.1 Feature Set
Optimal Feature Selection from VMware ESXi 5.1 Feature SetOptimal Feature Selection from VMware ESXi 5.1 Feature Set
Optimal Feature Selection from VMware ESXi 5.1 Feature Setijccmsjournal
 
Optimal Feature Selection from VMware ESXi 5.1 Feature Set
Optimal Feature Selection from VMware ESXi 5.1  Feature SetOptimal Feature Selection from VMware ESXi 5.1  Feature Set
Optimal Feature Selection from VMware ESXi 5.1 Feature Setijccmsjournal
 
Optimal Feature Selection from VMware ESXi 5.1 Feature Set
Optimal Feature Selection from VMware ESXi 5.1 Feature SetOptimal Feature Selection from VMware ESXi 5.1 Feature Set
Optimal Feature Selection from VMware ESXi 5.1 Feature Setijccmsjournal
 
Two methods for optimising cognitive model parameters
Two methods for optimising cognitive model parametersTwo methods for optimising cognitive model parameters
Two methods for optimising cognitive model parametersUniversity of Huddersfield
 
A Study of Efficiency Improvements Technique for K-Means Algorithm
A Study of Efficiency Improvements Technique for K-Means AlgorithmA Study of Efficiency Improvements Technique for K-Means Algorithm
A Study of Efficiency Improvements Technique for K-Means AlgorithmIRJET Journal
 

Similar to Mahout part2 (20)

PPT file
PPT filePPT file
PPT file
 
R refcard-data-mining
R refcard-data-miningR refcard-data-mining
R refcard-data-mining
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-Learn
 
Barga Data Science lecture 8
Barga Data Science lecture 8Barga Data Science lecture 8
Barga Data Science lecture 8
 
Data Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence dataData Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence data
 
Data Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence dataData Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence data
 
Machine Learning and Real-World Applications
Machine Learning and Real-World ApplicationsMachine Learning and Real-World Applications
Machine Learning and Real-World Applications
 
Lab 2: Classification and Regression Prediction Models, training and testing ...
Lab 2: Classification and Regression Prediction Models, training and testing ...Lab 2: Classification and Regression Prediction Models, training and testing ...
Lab 2: Classification and Regression Prediction Models, training and testing ...
 
RDataMining-reference-card
RDataMining-reference-cardRDataMining-reference-card
RDataMining-reference-card
 
RapidMiner: Data Mining And Rapid Miner
RapidMiner:  Data Mining And Rapid MinerRapidMiner:  Data Mining And Rapid Miner
RapidMiner: Data Mining And Rapid Miner
 
RapidMiner: Data Mining And Rapid Miner
RapidMiner: Data Mining And Rapid MinerRapidMiner: Data Mining And Rapid Miner
RapidMiner: Data Mining And Rapid Miner
 
Optimal feature selection from v mware esxi 5.1 feature set
Optimal feature selection from v mware esxi 5.1 feature setOptimal feature selection from v mware esxi 5.1 feature set
Optimal feature selection from v mware esxi 5.1 feature set
 
Optimal Feature Selection from VMware ESXi 5.1 Feature Set
Optimal Feature Selection from VMware ESXi 5.1 Feature SetOptimal Feature Selection from VMware ESXi 5.1 Feature Set
Optimal Feature Selection from VMware ESXi 5.1 Feature Set
 
Optimal Feature Selection from VMware ESXi 5.1 Feature Set
Optimal Feature Selection from VMware ESXi 5.1 Feature SetOptimal Feature Selection from VMware ESXi 5.1 Feature Set
Optimal Feature Selection from VMware ESXi 5.1 Feature Set
 
Optimal Feature Selection from VMware ESXi 5.1 Feature Set
Optimal Feature Selection from VMware ESXi 5.1  Feature SetOptimal Feature Selection from VMware ESXi 5.1  Feature Set
Optimal Feature Selection from VMware ESXi 5.1 Feature Set
 
Optimal Feature Selection from VMware ESXi 5.1 Feature Set
Optimal Feature Selection from VMware ESXi 5.1 Feature SetOptimal Feature Selection from VMware ESXi 5.1 Feature Set
Optimal Feature Selection from VMware ESXi 5.1 Feature Set
 
Two methods for optimising cognitive model parameters
Two methods for optimising cognitive model parametersTwo methods for optimising cognitive model parameters
Two methods for optimising cognitive model parameters
 
BioWeka
BioWekaBioWeka
BioWeka
 
My8clst
My8clstMy8clst
My8clst
 
A Study of Efficiency Improvements Technique for K-Means Algorithm
A Study of Efficiency Improvements Technique for K-Means AlgorithmA Study of Efficiency Improvements Technique for K-Means Algorithm
A Study of Efficiency Improvements Technique for K-Means Algorithm
 

More from Yasmine Gaber (8)

Capistrano
CapistranoCapistrano
Capistrano
 
Ionic
IonicIonic
Ionic
 
Dyna trace
Dyna traceDyna trace
Dyna trace
 
Mahout part1
Mahout part1Mahout part1
Mahout part1
 
Ibn Sina
Ibn SinaIbn Sina
Ibn Sina
 
Home Bowling
Home BowlingHome Bowling
Home Bowling
 
Oauth2.0
Oauth2.0Oauth2.0
Oauth2.0
 
Why_do i_hate_shopping
Why_do i_hate_shoppingWhy_do i_hate_shopping
Why_do i_hate_shopping
 

Recently uploaded

Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 

Recently uploaded (20)

Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 

Mahout part2

  • 1. Mahout in Action Part 2 Yasmine M. Gaber 4 April 2013
  • 2. Agenda  Part 2: Clustering  Part 3: Classification
  • 3. Clustering  An algorithm  A notion of both similarity and dissimilarity  A stopping condition
  • 4. Measuring the similarity of items  Euclidean Distance
  • 5. Creating the input  Preprocess the data  Use that data to create vectors  Save the vectors in SequenceFile format as input for the algorithm
  • 6. Using Mahout clustering  The SequenceFile containing the input vectors.  The SequenceFile containing the initial cluster centers.  The similarity measure to be used.  The convergenceThreshold.  The number of iterations to be done.  The Vector implementation used in the input files.
  • 8. Distance measures  Euclidean distance measure  Squared Euclidean distance measure  Manhattan distance measure
  • 9. Distance measures  Cosine distance measure  Tanimoto distance measure
  • 12. Representing text documents as vectors  Vector Space Model (VSM)  TF-IDF  N-gram collocations
  • 13. Generating vectors from documents  $ bin/mahout seqdirectory -c UTF-8 -i examples/reuters-extracted/ -o reuters-seqfiles  $ bin/mahout seq2sparse -i reuters-seqfiles/ -o reuters-vectors -ow
  • 14. Improving quality of vectors using normalization  P-norm  $ bin/mahout seq2sparse -i reuters-seqfiles/ -o reuters-normalized-bigram -ow -a org.apache.lucene.analysis.WhitespaceAnalyz er -chunk 200 -wt tfidf -s 5 -md 3 -x 90 -ng 2 -ml 50 -seq -n 2
  • 15. Clustering Categories  Exclusive clustering  Overlapping clustering  Hierarchical clustering  Probabilistic clustering
  • 16. Clustering Approaches  Fixed number of centers  Bottom-up approach  Top-down approach
  • 17. Clustering algorithms  K-means clustering  Fuzzy k-means clustering  Dirichlet clustering
  • 20. Running k-means clustering  $ bin/mahout kmeans -i reuters-vectors/tfidf- vectors/ -c reuters-initial-clusters -o reuters- kmeans-clusters -dm org.apache.mahout.common.distance.Square dEuclideanDistanceMeasure -cd 1.0 -k 20 -x 20 -cl  $ bin/mahout kmeans -i reuters-vectors/tfidf- vectors/ -c reuters-initial-clusters -o reuters- kmeans-clusters -dm org.apache.mahout.common.distance.Cosine DistanceMeasure -cd 0.1 -k 20 -x 20 -cl  $ bin/mahout clusterdump -dt sequencefile -d
  • 21. Fuzzy k-means clustering  Instead of the exclusive clustering in k-means, fuzzy k-means tries to generate overlapping clusters from the data set.  Also known as fuzzy c-means algorithm.
  • 22. Running fuzzy k-means clustering
  • 23. Running fuzzy k-means clustering  $ bin/mahout fkmeans -i reuters-vectors/tfidf- vectors/ -c reuters-fkmeans-centroids -o reuters-fkmeans-clusters -cd 1.0 -k 21 -m 2 -ow -x 10 -dm org.apache.mahout.common.distance.Square dEuclideanDistanceMeasure  Fuzziness factor
  • 24. Dirichlet clustering  model-based clustering algorithm
  • 25. Running Dirichlet clustering  $ bin/mahout dirichlet -i reuters-vectors/tfidf- vectors -o reuters-dirichlet-clusters -k 60 -x 10 -a0 1.0 -md org.apache.mahout.clustering.dirichlet.models. GaussianClusterDistribution -mp org.apache.mahout.math.SequentialAccessSp arseVector
  • 26. Evaluating and improving clustering quality  Inspecting clustering output  Evaluating the quality of clustering0  Improving clustering quality
  • 27. Inspecting clustering output  $ bin/mahout clusterdump -s kmeans- output/clusters-19/ -d reuters- vectors/dictionary.file-0 -dt sequencefile -n 10  Top Terms: said => 11.60126582278481 bank => 5.943037974683544 dollar =>
  • 28. Analyzing clustering output  Distance measure and feature selection  Inter-cluster and intra-cluster distances  Mixed and overlapping clusters
  • 29. Improving clustering quality  Improving document vector generation  Writing a custom distance measure
  • 30. Real-world applications of clustering  Clustering like-minded people on Twitter  Suggesting tags for an artist on Last.fm using clustering  Creating a related-posts feature for a website
  • 31. Classification  Classification is a process of using specific information (input) to choose a single selection (output) from a short list of predetermined potential responses.  Applications of classification, e.g. spam filtering
  • 32. Why use Mahout for classification?
  • 34. Classification  Training versus test versus production  Predictor variables versus target variable  Records, fields, and values
  • 35. Types of values for predictor variables  Continuous  Categorical  Word-like  Text-like
  • 36. Classification Work flow  Training the model  Evaluating the model  Using the model in production
  • 37. Stage 1: training the classification model Stage 2: evaluating the classification model Stage 3: using the model in production
  • 38. Stage 1: training the classification model  Define Categories for the Target Variable  Collect Historical Data  Define Predictor Variables  Select a Learning Algorithm to Train the Model  Use Learning Algorithm to Train the Model
  • 39. Extracting features to build a Mahout classifier
  • 40. Preprocessing raw data into classifiable data
  • 41. Converting classifiable data into vectors  Use one Vector cell per word, category, or continuous value  Represent Vectors implicitly as bags of words  Use feature hashing
  • 42. Classifying the 20 newsgroups data set
  • 44. The classifier evaluation API  Percent correct  Confusion matrix  Entropy matrix  AUC  Log likelihood
  • 45. When classifiers go bad  Target leaks  Broken feature extraction
  • 46. Tuning the problem  Remove Fluff Variables  Add New Variables, Interactions, and Derived Values
  • 47. Tuning the classifier  Try Alternative Algorithms  Tune the Learning Algorithm
  • 48. Thank You Contact at: Email: Yasmine.Gaber@espace.com.eg Twitter: Twitter.com/yasmine_mohamed