SlideShare ist ein Scribd-Unternehmen logo
1 von 13
Downloaden Sie, um offline zu lesen
Tutorial on MALLET
     Shatakirti
     MT2011096
MALLET


Contents
1 Introduction to MALLET                                                      2

2 Where do we use MALLET?                                                    2

3 Getting Started                                                             3
  3.1 Installing MALLET . . . . . . . . . . . . . . . . . . . . . . . .       3
  3.2 Using the Script . . . . . . . . . . . . . . . . . . . . . . . . . .    3

4 Importing Data files                                                         3

5 Natural Language Processing                                                 4

6 Document classification                                                     5

7 Sequence Tagging                                                           9

8 Topic Models                                                               11

References                                                                   12


List of Figures
   1    Natural Language Processing using MALLET . . . . . . . . . 5
   2    Document classification . . . . . . . . . . . . . . . . . . . . . . 8
   3    Sequence Tagging . . . . . . . . . . . . . . . . . . . . . . . . . 10




                                      1
MALLET


1     Introduction to MALLET
MALLET is a Java-based package for statistical natural language processing,
document classification, clustering, topic modeling, information extraction,
and other machine learning applications to text..



2     Where do we use MALLET?
:

    1. Historical Topics and Trends
      Our aim here is to automatically discover general topics that appear
      in a large newpaper corpus. MALLET is run over a period of interest
      to find the top general topic groups. For example: if we wish to know
      the top ten topic groups between the years 1965-1901, the MALLET
      is run to find this dataset. In addition, we can also find topics more
      strongly associated with say ”iron”. We can extract 5 lines on each
      side of the line containing ”iron” and again run mallet to find the top
      general topic groups.

    2. Detect spam mails
      We can use the document classification capabilities of MALLET to
      detect spam mails. A simple example of this would be a spam classifier
      like you’d find in your email inbox. Since we know what good mail
      looks like, and since we know what spam typically looks like, we can
      craft a Naive Bayes classifier to make a statistical approximation as to
      whether or not a new message is spam.

    3. Extract important information
      We can use the sequence tagging functionality that MALLET provides
      to extract important information from data. By employing named-
      entity recognition techniques, we can figure out exactly what a docu-
      ment is talking about without having to read through the entire text
      ourselves. Imagine someone hands you a book and asks you for all the
      characters and locations featured throughout the text. Using named-
      entity recognition, a computer can accomplish that task in mere seconds
      as compared to the hours it would take a human.




                                      2
MALLET


3     Getting Started
3.1     Installing MALLET
    1. Download the latest version of mallet from
       http://mallet.cs.umass.edu/download.php

    2. To Build MALLET 2.0, you must have Apache Ant. You can download
       it from http://ant.apache.org/

    3. Set all the environment variables pointing to Java Home, Ant Home
       and Mallet Home (Mallet Directory).

    4. Change to the MALLET directory and type:
       ant
       Example : C:UsersVAIOWebIRmallet-2.0.7>ant

      If ant finishes with ”BUILD SUCCESSFUL”, MALLET is now ready
      to use.



3.2     Using the Script
Now, if you installed MALLET in the directory WebIRmallet-2.0.7,
this script will be present in the WebIRmallet-2.0.7bin. If the cur-
rent working directory is the MALLET directory, you can use this script in
this pattern:

binmallet [command] --option value --option value ...

Type binmallet to get a list of commands and the help can be found
by using the option --help with any command to get a description of valid
commands.


4     Importing Data files
To import a data file use the command:

binmallet import-file --input [filename]
--output [output filename] [options]



                                     3
MALLET


Similarly, to import an entire directory use:

binmallet import dir [dir path]
--output [output filename] [options]

For example:

binmallet import-file --input sample-datawebenhill.txt
--output output.mallet

in the above example, the input data is hill.txt and the output is present
in the output.mallet file after removing the stopwords.

binmallet import-dir --input sample-dataweb*
--output output.mallet

in the above example, the input data is folders present in web folder and
the output is given in the output.mallet file after removing the stopwords

For more options use the help by typing in:

binmallet import-file --help or
binmallet import-dir --help


5    Natural Language Processing
MALLET includes routines for transforming text documents into numerical
representations that can then be processed efficiently. This process is imple-
mented through a flexible system called ”pipes”, which handle distinct tasks
such as tokenizing strings, removing stopwords, and converting sequences
into count vectors. MALLET uses Unicode files, and thus, we can use vari-
ous language files and provide MALLET with certain rules for for processing
the data. We can use regular expressions to tokanize any word segment in
any language. For example if we type in

binmallet import-file --input sample-datawebenhill.txt
--output output.mallet --print-output --remove-stopwords

in the above example, MALLET removes the stopwords and prints the out-
put and also writes the output in the output.mallet file. A sample output


                                      4
MALLET


with and without removing stopwords is shown below :




       (a) without removing stopwords           (b) Removing stopwords

          Figure 1: Natural Language Processing using MALLET

    The above figure shows the support for English language by MALLET.
In the above snapshot, a simple txt file ”hill.txt” written in English language
is imported. The words are numbered and the number of occurrences are
also shown. The stopwords are recognized by MALLET and can or cannot
be included in the output file as per the user’s requirements. Currently,
MALLET doesn’t support only Chineese and Japaneese text..


6    Document classification
A classifier is an algorithm that distinguishes between a fixed set of classes,
such as ”spam” vs. ”non-spam”, based on some previous training (Note that
MALLET is also a machine learning tool). MALLET includes implemen-
tations of several classification algorithms. Some of them are Naive Bayes
algorithm, Maximum Entropy, and Decision Trees.
    To get strted with the document classifier, first loasd the data into MAL-
LET format. Then follow the following steps:



                                        5
MALLET


 1. Train the classifier:
    Suppose u have a MALLET data file called train.mallet, use the
    command :

    binmallet train-classifier --input train.mallet
    --output-classifier my.classifier


 2. Choose the algorithm:
    The default classification algorithm is Naive Bayes Theorem. To select
    a different algorithm, use the --trainer option. For example, to use
    the MaxEnt algorithm, use the following command:

    binmallet train-classifier --input training.mallet
    --output-classifier my.classifier --trainer MaxEnt

    You can also try - NaiveBayes, C45, Decision Tree.
    To compare multiple training algorithms, use the following command,

    binmallet train-classifier --input labeled.mallet
    --training-portion --trainer MaxEnt
    --trainer NaiveBayes

    This command will comapre the MaxEnt and the NaiveBayes algo-
    rithms.


 3. Evaluation:
    If we wish to know if the classifier is producing good results on data
    now used in the training, we can split a single set of instance into train-
    ing and testing lists. For this purpose, you can use a command like:

    binmallet train-classifier --input labeled.mallet
    --training-portion 0.9

    This command will randomly split the data into 90% training instances,
    which will be used to train the classifier and the remaining 10% testing
    instances. MALLET will use the classifier to predict the class labels
    of the testing instances, compare those to the true labels, and report
    results. You can even try various training options that u can find in
    the help of mallet.

                                     6
MALLET


   For example, u can try the following command :

   binmallet train-classifier --input web.mallet
   --trainer MaxEnt --trainer NaiveBayes
   --training-portion 0.9 --num-trials 10

   This command will run 10 trials, in which the input data is randomly
   split into 90% training instances and 10% testing instances. For each
   trial, MALLET trains a MaxEnt classifier and a Naive Bayes classifier
   on the training instances, then prints accuracy results and a matrix of
   correct and predicted labels for each classifier. An illustration is shown
   in the next page.




                                   7
MALLET




         (a)




          8




         (b)
MALLET


7    Sequence Tagging
Sometimes, we may have a very large database with distinct values in it, take
for example, a large gene database. MALLET includes implementations of
widely used sequence algorithms including hidden Markov models (HMMs)
and linear chain conditional random fields (CRFs). These algorithms support
applications such as gene finding and named-entity recognition.

Simple Tagger

Simple tagger is a command line interface to the MALLET CRF class. To
use this, each line in the input file should represent a token. The needed
format is :

feature1 feature2 ...            featuren label

For example, write the following in a file named ”sample” and put it in
the mallet directory.

Kirti CAPITALIZED noun
slept non-noun
here LOWERCASE STOPWORD non-noun

To train the CRF, use the following command while in the mallet direc-
tory:

java -cp class;libmallet-deps.jar
cc.mallet.fst.SimpleTagger --train true
--model-file nouncrf sample

This command will train the CRF. The --train true command will spec-
ify that this is the training. Here the CRF file is created in the mallet direc-
tory itself. We can however specify the locations as per convinience.




                                      9
MALLET




                    (a)




                    (b)



         Figure 3: Sequence Tagging
                     10
MALLET


   Now that we have trained MALLET, we can put it to test by creating a
new file called ”test”. Inside this file, we write :

CAPITAL Al
slept
here .

Now we need the file to be labelled, so, we use CRF in the nouncrf by
typing:

java -cp class;libmallet-deps.jar
cc.mallet.fst.SimpleTagger
--model-file nouncrf test

which produces the following output:

Number of predicates:            5
noun CAPITAL Al
non-noun slept
non-noun here




8    Topic Models
Topic models provide a simple way to analyze large volumes of unlabeled text.
A ”topic” consists of a cluster of words that frequently occur together. Using
some contextual clues, the topic models can connect the words with similar
meanings and distinguish between uses of words with multiple meanings.
   Now the first step in acheiving a Topic model is to import a set of doc-
uments. Suppose we want to import the files in the folder ”en”, type the
command:

binmallet import-dir
--input sample-dataweben --output output.mallet
--keep-sequence --remove-stopwords

This command will remove all the stopwords, keep all the sequences and
write the output to a ”output.mallet” file in the mallet directory.



                                     11
MALLET


   Now, type in the command:

binmallet train-topics
--input sample-datawebenoutput.mallet
--num-topics 100 --output-state topic-state.gz

Here --num-topics [NUMBER] represents the number of topics to use.
More the number, more the fine-grained results we get and --output-state
outputs a compressed text file containing the words in the corpus with their
topic assignments. This file format can easily be parsed and used by non-
Java-based software. Note that the state file will be GZipped, so it is helpful
to provide a filename that ends in .gz.




References
[1] http://mallet.cs.umass.edu

[2] http://www.fieldstone-software.com/mallet/




                                     12

Weitere ähnliche Inhalte

Andere mochten auch

Cerebrovascular disease.ppt
Cerebrovascular disease.pptCerebrovascular disease.ppt
Cerebrovascular disease.pptShama
 
Mallet Finger Power Point
Mallet Finger Power PointMallet Finger Power Point
Mallet Finger Power PointTodd Peterson
 
Bio degadable implants used in Orthopaedics by Dr.Vinay
Bio degadable implants used in Orthopaedics by Dr.VinayBio degadable implants used in Orthopaedics by Dr.Vinay
Bio degadable implants used in Orthopaedics by Dr.VinayVenkat Vinay
 
Use of implant in surgery
Use of implant in surgeryUse of implant in surgery
Use of implant in surgeryBashir BnYunus
 
Biodegradable implants
Biodegradable implantsBiodegradable implants
Biodegradable implantsDr Imran Jan
 
Implant : Challenging Drug Delivery System
Implant : Challenging Drug Delivery SystemImplant : Challenging Drug Delivery System
Implant : Challenging Drug Delivery Systembiniyapatel
 
Frozen shoulder 9.6.15
Frozen shoulder 9.6.15Frozen shoulder 9.6.15
Frozen shoulder 9.6.15Anubhav Verma
 
Principle of Maximum Entropy
Principle of Maximum EntropyPrinciple of Maximum Entropy
Principle of Maximum EntropyJiawang Liu
 
Text Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter DataText Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter DataYanchang Zhao
 

Andere mochten auch (18)

Mallet finger
Mallet fingerMallet finger
Mallet finger
 
Mallet finger
Mallet fingerMallet finger
Mallet finger
 
Cerebrovascular disease.ppt
Cerebrovascular disease.pptCerebrovascular disease.ppt
Cerebrovascular disease.ppt
 
Mallet Finger Power Point
Mallet Finger Power PointMallet Finger Power Point
Mallet Finger Power Point
 
Bio degadable implants used in Orthopaedics by Dr.Vinay
Bio degadable implants used in Orthopaedics by Dr.VinayBio degadable implants used in Orthopaedics by Dr.Vinay
Bio degadable implants used in Orthopaedics by Dr.Vinay
 
Use of implant in surgery
Use of implant in surgeryUse of implant in surgery
Use of implant in surgery
 
Implants
ImplantsImplants
Implants
 
Biodegradable implants
Biodegradable implantsBiodegradable implants
Biodegradable implants
 
Mallet finger
Mallet fingerMallet finger
Mallet finger
 
Textmining Introduction
Textmining IntroductionTextmining Introduction
Textmining Introduction
 
Implant : Challenging Drug Delivery System
Implant : Challenging Drug Delivery SystemImplant : Challenging Drug Delivery System
Implant : Challenging Drug Delivery System
 
Frozen shoulder 9.6.15
Frozen shoulder 9.6.15Frozen shoulder 9.6.15
Frozen shoulder 9.6.15
 
Frozen Shoulder
Frozen ShoulderFrozen Shoulder
Frozen Shoulder
 
biodegradable polymers
biodegradable polymersbiodegradable polymers
biodegradable polymers
 
ORGANIZATIONAL CHANGE
ORGANIZATIONAL CHANGEORGANIZATIONAL CHANGE
ORGANIZATIONAL CHANGE
 
Principle of Maximum Entropy
Principle of Maximum EntropyPrinciple of Maximum Entropy
Principle of Maximum Entropy
 
Cerebrovascular Accident
Cerebrovascular AccidentCerebrovascular Accident
Cerebrovascular Accident
 
Text Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter DataText Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter Data
 

Ähnlich wie Mallet

Implementing the Genetic Algorithm in XSLT: PoC
Implementing the Genetic Algorithm in XSLT: PoCImplementing the Genetic Algorithm in XSLT: PoC
Implementing the Genetic Algorithm in XSLT: PoCjimfuller2009
 
DLT UNIT-3.docx
DLT  UNIT-3.docxDLT  UNIT-3.docx
DLT UNIT-3.docx0567Padma
 
Monte -- machine learning in Python
Monte -- machine learning in PythonMonte -- machine learning in Python
Monte -- machine learning in Pythonbutest
 
Monte -- machine learning in Python
Monte -- machine learning in PythonMonte -- machine learning in Python
Monte -- machine learning in Pythonbutest
 
HPC and HPGPU Cluster Tutorial
HPC and HPGPU Cluster TutorialHPC and HPGPU Cluster Tutorial
HPC and HPGPU Cluster TutorialDirk Hähnel
 
matmultHomework3.pdfNames of Files to Submit matmult..docx
matmultHomework3.pdfNames of Files to Submit  matmult..docxmatmultHomework3.pdfNames of Files to Submit  matmult..docx
matmultHomework3.pdfNames of Files to Submit matmult..docxandreecapon
 
Learning puppet chapter 3
Learning puppet chapter 3Learning puppet chapter 3
Learning puppet chapter 3Vishal Biyani
 
MKT 100Week 6 Assignment{Enter Student Name Here}{Enter Bu.docx
MKT 100Week 6 Assignment{Enter Student Name Here}{Enter Bu.docxMKT 100Week 6 Assignment{Enter Student Name Here}{Enter Bu.docx
MKT 100Week 6 Assignment{Enter Student Name Here}{Enter Bu.docxkendalfarrier
 
Message broker session extra
Message broker session extraMessage broker session extra
Message broker session extraalfador
 
Laboratory manual
Laboratory manualLaboratory manual
Laboratory manualAsif Rana
 
Arules_TM_Rpart_Markdown
Arules_TM_Rpart_MarkdownArules_TM_Rpart_Markdown
Arules_TM_Rpart_MarkdownAdrian Cuyugan
 
The program reads data from two files, itemsList-0x.txt and .docx
The program reads data from two files, itemsList-0x.txt and .docxThe program reads data from two files, itemsList-0x.txt and .docx
The program reads data from two files, itemsList-0x.txt and .docxoscars29
 
interfacing matlab with embedded systems
interfacing matlab with embedded systemsinterfacing matlab with embedded systems
interfacing matlab with embedded systemsRaghav Shetty
 
Ssis partitioning and best practices
Ssis partitioning and best practicesSsis partitioning and best practices
Ssis partitioning and best practicesVinod Kumar
 

Ähnlich wie Mallet (20)

Malab tutorial
Malab tutorialMalab tutorial
Malab tutorial
 
Implementing the Genetic Algorithm in XSLT: PoC
Implementing the Genetic Algorithm in XSLT: PoCImplementing the Genetic Algorithm in XSLT: PoC
Implementing the Genetic Algorithm in XSLT: PoC
 
An ntutorial[1]
An ntutorial[1]An ntutorial[1]
An ntutorial[1]
 
DLT UNIT-3.docx
DLT  UNIT-3.docxDLT  UNIT-3.docx
DLT UNIT-3.docx
 
Monte -- machine learning in Python
Monte -- machine learning in PythonMonte -- machine learning in Python
Monte -- machine learning in Python
 
Monte -- machine learning in Python
Monte -- machine learning in PythonMonte -- machine learning in Python
Monte -- machine learning in Python
 
HPC and HPGPU Cluster Tutorial
HPC and HPGPU Cluster TutorialHPC and HPGPU Cluster Tutorial
HPC and HPGPU Cluster Tutorial
 
matmultHomework3.pdfNames of Files to Submit matmult..docx
matmultHomework3.pdfNames of Files to Submit  matmult..docxmatmultHomework3.pdfNames of Files to Submit  matmult..docx
matmultHomework3.pdfNames of Files to Submit matmult..docx
 
Learning puppet chapter 3
Learning puppet chapter 3Learning puppet chapter 3
Learning puppet chapter 3
 
MKT 100Week 6 Assignment{Enter Student Name Here}{Enter Bu.docx
MKT 100Week 6 Assignment{Enter Student Name Here}{Enter Bu.docxMKT 100Week 6 Assignment{Enter Student Name Here}{Enter Bu.docx
MKT 100Week 6 Assignment{Enter Student Name Here}{Enter Bu.docx
 
Message broker session extra
Message broker session extraMessage broker session extra
Message broker session extra
 
Laboratory manual
Laboratory manualLaboratory manual
Laboratory manual
 
PuttingItAllTogether
PuttingItAllTogetherPuttingItAllTogether
PuttingItAllTogether
 
++Matlab 14 sesiones
++Matlab 14 sesiones++Matlab 14 sesiones
++Matlab 14 sesiones
 
Metasploit
MetasploitMetasploit
Metasploit
 
Arules_TM_Rpart_Markdown
Arules_TM_Rpart_MarkdownArules_TM_Rpart_Markdown
Arules_TM_Rpart_Markdown
 
The program reads data from two files, itemsList-0x.txt and .docx
The program reads data from two files, itemsList-0x.txt and .docxThe program reads data from two files, itemsList-0x.txt and .docx
The program reads data from two files, itemsList-0x.txt and .docx
 
interfacing matlab with embedded systems
interfacing matlab with embedded systemsinterfacing matlab with embedded systems
interfacing matlab with embedded systems
 
Ssis partitioning and best practices
Ssis partitioning and best practicesSsis partitioning and best practices
Ssis partitioning and best practices
 
Matlab OOP
Matlab OOPMatlab OOP
Matlab OOP
 

Mallet

  • 1. Tutorial on MALLET Shatakirti MT2011096
  • 2. MALLET Contents 1 Introduction to MALLET 2 2 Where do we use MALLET? 2 3 Getting Started 3 3.1 Installing MALLET . . . . . . . . . . . . . . . . . . . . . . . . 3 3.2 Using the Script . . . . . . . . . . . . . . . . . . . . . . . . . . 3 4 Importing Data files 3 5 Natural Language Processing 4 6 Document classification 5 7 Sequence Tagging 9 8 Topic Models 11 References 12 List of Figures 1 Natural Language Processing using MALLET . . . . . . . . . 5 2 Document classification . . . . . . . . . . . . . . . . . . . . . . 8 3 Sequence Tagging . . . . . . . . . . . . . . . . . . . . . . . . . 10 1
  • 3. MALLET 1 Introduction to MALLET MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.. 2 Where do we use MALLET? : 1. Historical Topics and Trends Our aim here is to automatically discover general topics that appear in a large newpaper corpus. MALLET is run over a period of interest to find the top general topic groups. For example: if we wish to know the top ten topic groups between the years 1965-1901, the MALLET is run to find this dataset. In addition, we can also find topics more strongly associated with say ”iron”. We can extract 5 lines on each side of the line containing ”iron” and again run mallet to find the top general topic groups. 2. Detect spam mails We can use the document classification capabilities of MALLET to detect spam mails. A simple example of this would be a spam classifier like you’d find in your email inbox. Since we know what good mail looks like, and since we know what spam typically looks like, we can craft a Naive Bayes classifier to make a statistical approximation as to whether or not a new message is spam. 3. Extract important information We can use the sequence tagging functionality that MALLET provides to extract important information from data. By employing named- entity recognition techniques, we can figure out exactly what a docu- ment is talking about without having to read through the entire text ourselves. Imagine someone hands you a book and asks you for all the characters and locations featured throughout the text. Using named- entity recognition, a computer can accomplish that task in mere seconds as compared to the hours it would take a human. 2
  • 4. MALLET 3 Getting Started 3.1 Installing MALLET 1. Download the latest version of mallet from http://mallet.cs.umass.edu/download.php 2. To Build MALLET 2.0, you must have Apache Ant. You can download it from http://ant.apache.org/ 3. Set all the environment variables pointing to Java Home, Ant Home and Mallet Home (Mallet Directory). 4. Change to the MALLET directory and type: ant Example : C:UsersVAIOWebIRmallet-2.0.7>ant If ant finishes with ”BUILD SUCCESSFUL”, MALLET is now ready to use. 3.2 Using the Script Now, if you installed MALLET in the directory WebIRmallet-2.0.7, this script will be present in the WebIRmallet-2.0.7bin. If the cur- rent working directory is the MALLET directory, you can use this script in this pattern: binmallet [command] --option value --option value ... Type binmallet to get a list of commands and the help can be found by using the option --help with any command to get a description of valid commands. 4 Importing Data files To import a data file use the command: binmallet import-file --input [filename] --output [output filename] [options] 3
  • 5. MALLET Similarly, to import an entire directory use: binmallet import dir [dir path] --output [output filename] [options] For example: binmallet import-file --input sample-datawebenhill.txt --output output.mallet in the above example, the input data is hill.txt and the output is present in the output.mallet file after removing the stopwords. binmallet import-dir --input sample-dataweb* --output output.mallet in the above example, the input data is folders present in web folder and the output is given in the output.mallet file after removing the stopwords For more options use the help by typing in: binmallet import-file --help or binmallet import-dir --help 5 Natural Language Processing MALLET includes routines for transforming text documents into numerical representations that can then be processed efficiently. This process is imple- mented through a flexible system called ”pipes”, which handle distinct tasks such as tokenizing strings, removing stopwords, and converting sequences into count vectors. MALLET uses Unicode files, and thus, we can use vari- ous language files and provide MALLET with certain rules for for processing the data. We can use regular expressions to tokanize any word segment in any language. For example if we type in binmallet import-file --input sample-datawebenhill.txt --output output.mallet --print-output --remove-stopwords in the above example, MALLET removes the stopwords and prints the out- put and also writes the output in the output.mallet file. A sample output 4
  • 6. MALLET with and without removing stopwords is shown below : (a) without removing stopwords (b) Removing stopwords Figure 1: Natural Language Processing using MALLET The above figure shows the support for English language by MALLET. In the above snapshot, a simple txt file ”hill.txt” written in English language is imported. The words are numbered and the number of occurrences are also shown. The stopwords are recognized by MALLET and can or cannot be included in the output file as per the user’s requirements. Currently, MALLET doesn’t support only Chineese and Japaneese text.. 6 Document classification A classifier is an algorithm that distinguishes between a fixed set of classes, such as ”spam” vs. ”non-spam”, based on some previous training (Note that MALLET is also a machine learning tool). MALLET includes implemen- tations of several classification algorithms. Some of them are Naive Bayes algorithm, Maximum Entropy, and Decision Trees. To get strted with the document classifier, first loasd the data into MAL- LET format. Then follow the following steps: 5
  • 7. MALLET 1. Train the classifier: Suppose u have a MALLET data file called train.mallet, use the command : binmallet train-classifier --input train.mallet --output-classifier my.classifier 2. Choose the algorithm: The default classification algorithm is Naive Bayes Theorem. To select a different algorithm, use the --trainer option. For example, to use the MaxEnt algorithm, use the following command: binmallet train-classifier --input training.mallet --output-classifier my.classifier --trainer MaxEnt You can also try - NaiveBayes, C45, Decision Tree. To compare multiple training algorithms, use the following command, binmallet train-classifier --input labeled.mallet --training-portion --trainer MaxEnt --trainer NaiveBayes This command will comapre the MaxEnt and the NaiveBayes algo- rithms. 3. Evaluation: If we wish to know if the classifier is producing good results on data now used in the training, we can split a single set of instance into train- ing and testing lists. For this purpose, you can use a command like: binmallet train-classifier --input labeled.mallet --training-portion 0.9 This command will randomly split the data into 90% training instances, which will be used to train the classifier and the remaining 10% testing instances. MALLET will use the classifier to predict the class labels of the testing instances, compare those to the true labels, and report results. You can even try various training options that u can find in the help of mallet. 6
  • 8. MALLET For example, u can try the following command : binmallet train-classifier --input web.mallet --trainer MaxEnt --trainer NaiveBayes --training-portion 0.9 --num-trials 10 This command will run 10 trials, in which the input data is randomly split into 90% training instances and 10% testing instances. For each trial, MALLET trains a MaxEnt classifier and a Naive Bayes classifier on the training instances, then prints accuracy results and a matrix of correct and predicted labels for each classifier. An illustration is shown in the next page. 7
  • 9. MALLET (a) 8 (b)
  • 10. MALLET 7 Sequence Tagging Sometimes, we may have a very large database with distinct values in it, take for example, a large gene database. MALLET includes implementations of widely used sequence algorithms including hidden Markov models (HMMs) and linear chain conditional random fields (CRFs). These algorithms support applications such as gene finding and named-entity recognition. Simple Tagger Simple tagger is a command line interface to the MALLET CRF class. To use this, each line in the input file should represent a token. The needed format is : feature1 feature2 ... featuren label For example, write the following in a file named ”sample” and put it in the mallet directory. Kirti CAPITALIZED noun slept non-noun here LOWERCASE STOPWORD non-noun To train the CRF, use the following command while in the mallet direc- tory: java -cp class;libmallet-deps.jar cc.mallet.fst.SimpleTagger --train true --model-file nouncrf sample This command will train the CRF. The --train true command will spec- ify that this is the training. Here the CRF file is created in the mallet direc- tory itself. We can however specify the locations as per convinience. 9
  • 11. MALLET (a) (b) Figure 3: Sequence Tagging 10
  • 12. MALLET Now that we have trained MALLET, we can put it to test by creating a new file called ”test”. Inside this file, we write : CAPITAL Al slept here . Now we need the file to be labelled, so, we use CRF in the nouncrf by typing: java -cp class;libmallet-deps.jar cc.mallet.fst.SimpleTagger --model-file nouncrf test which produces the following output: Number of predicates: 5 noun CAPITAL Al non-noun slept non-noun here 8 Topic Models Topic models provide a simple way to analyze large volumes of unlabeled text. A ”topic” consists of a cluster of words that frequently occur together. Using some contextual clues, the topic models can connect the words with similar meanings and distinguish between uses of words with multiple meanings. Now the first step in acheiving a Topic model is to import a set of doc- uments. Suppose we want to import the files in the folder ”en”, type the command: binmallet import-dir --input sample-dataweben --output output.mallet --keep-sequence --remove-stopwords This command will remove all the stopwords, keep all the sequences and write the output to a ”output.mallet” file in the mallet directory. 11
  • 13. MALLET Now, type in the command: binmallet train-topics --input sample-datawebenoutput.mallet --num-topics 100 --output-state topic-state.gz Here --num-topics [NUMBER] represents the number of topics to use. More the number, more the fine-grained results we get and --output-state outputs a compressed text file containing the words in the corpus with their topic assignments. This file format can easily be parsed and used by non- Java-based software. Note that the state file will be GZipped, so it is helpful to provide a filename that ends in .gz. References [1] http://mallet.cs.umass.edu [2] http://www.fieldstone-software.com/mallet/ 12