1. Getting Started with NLTK
An Introduction to NLTK
Sreejith S
srssreejith@gmail.com
@tweet2sree
FOSSMeet 2011,NIC Calicut
06 February 2011
Sreejith S Getting Started with NLTK
2. Just a word about me !!
Working in Natural Language Processing (NLP), Machine Learning,
Text Mining
Active member of ilugcbe , http://ilugcbe.techstud.org
Works for 365Media Pvt. Ltd. Coimbatore India.
@tweet2sree , srssreejith@gmail.com
Sreejith S Getting Started with NLTK
3. Introduction - NLP
Natural Language Processing
Sreejith S Getting Started with NLTK
4. Introduction - NLP
Natural Language Processing
NLP is an inter-disciplinary subject
Sreejith S Getting Started with NLTK
5. Introduction - NLP
Natural Language Processing
NLP is an inter-disciplinary subject
Computer Science
Sreejith S Getting Started with NLTK
6. Introduction - NLP
Natural Language Processing
NLP is an inter-disciplinary subject
Computer Science
Linguistics
Sreejith S Getting Started with NLTK
7. Introduction - NLP
Natural Language Processing
NLP is an inter-disciplinary subject
Computer Science
Linguistics
Statistics etc...
Sreejith S Getting Started with NLTK
8. Introduction - NLP
Natural Language Processing
NLP is an inter-disciplinary subject
Computer Science
Linguistics
Statistics etc...
NLP is a sub field of Artificial Intelligence
Sreejith S Getting Started with NLTK
9. Introduction - NLP
Natural Language Processing
NLP is an inter-disciplinary subject
Computer Science
Linguistics
Statistics etc...
NLP is a sub field of Artificial Intelligence
NLP - Any kind of computer manipulation of natural language.
Sreejith S Getting Started with NLTK
10. Introduction - NLP
Natural Language Processing
NLP is an inter-disciplinary subject
Computer Science
Linguistics
Statistics etc...
NLP is a sub field of Artificial Intelligence
NLP - Any kind of computer manipulation of natural language.
It is a rapidly developing field of study
Sreejith S Getting Started with NLTK
11. Introduction - NLP
Natural Language Processing
NLP is an inter-disciplinary subject
Computer Science
Linguistics
Statistics etc...
NLP is a sub field of Artificial Intelligence
NLP - Any kind of computer manipulation of natural language.
It is a rapidly developing field of study
Everyday applications of NLP
Sreejith S Getting Started with NLTK
12. Introduction - NLP
Natural Language Processing
NLP is an inter-disciplinary subject
Computer Science
Linguistics
Statistics etc...
NLP is a sub field of Artificial Intelligence
NLP - Any kind of computer manipulation of natural language.
It is a rapidly developing field of study
Everyday applications of NLP
Handwriting recognition,Machine translation,Question-answering
systems,Spell checkers,Grammer checkers etc...
Sreejith S Getting Started with NLTK
13. Natural Language Toolkit (NLTK)
A collection of Python programs, modules, data set and tutorial to
support research and development in Natural Language Processing
(NLP)
Sreejith S Getting Started with NLTK
14. Natural Language Toolkit (NLTK)
A collection of Python programs, modules, data set and tutorial to
support research and development in Natural Language Processing
(NLP)
Written by Steven Bird, Edvard Loper and Ewan Klien
Sreejith S Getting Started with NLTK
15. Natural Language Toolkit (NLTK)
A collection of Python programs, modules, data set and tutorial to
support research and development in Natural Language Processing
(NLP)
Written by Steven Bird, Edvard Loper and Ewan Klien
NLTK is
Sreejith S Getting Started with NLTK
16. Natural Language Toolkit (NLTK)
A collection of Python programs, modules, data set and tutorial to
support research and development in Natural Language Processing
(NLP)
Written by Steven Bird, Edvard Loper and Ewan Klien
NLTK is
Free and Open source
Sreejith S Getting Started with NLTK
17. Natural Language Toolkit (NLTK)
A collection of Python programs, modules, data set and tutorial to
support research and development in Natural Language Processing
(NLP)
Written by Steven Bird, Edvard Loper and Ewan Klien
NLTK is
Free and Open source
Easy to use
Sreejith S Getting Started with NLTK
18. Natural Language Toolkit (NLTK)
A collection of Python programs, modules, data set and tutorial to
support research and development in Natural Language Processing
(NLP)
Written by Steven Bird, Edvard Loper and Ewan Klien
NLTK is
Free and Open source
Easy to use
Modular
Sreejith S Getting Started with NLTK
19. Natural Language Toolkit (NLTK)
A collection of Python programs, modules, data set and tutorial to
support research and development in Natural Language Processing
(NLP)
Written by Steven Bird, Edvard Loper and Ewan Klien
NLTK is
Free and Open source
Easy to use
Modular
Well documented
Sreejith S Getting Started with NLTK
20. Natural Language Toolkit (NLTK)
A collection of Python programs, modules, data set and tutorial to
support research and development in Natural Language Processing
(NLP)
Written by Steven Bird, Edvard Loper and Ewan Klien
NLTK is
Free and Open source
Easy to use
Modular
Well documented
Simple and extensible
Sreejith S Getting Started with NLTK
21. Natural Language Toolkit (NLTK)
A collection of Python programs, modules, data set and tutorial to
support research and development in Natural Language Processing
(NLP)
Written by Steven Bird, Edvard Loper and Ewan Klien
NLTK is
Free and Open source
Easy to use
Modular
Well documented
Simple and extensible
http://www.nltk.org
Sreejith S Getting Started with NLTK
22. What You Will Learn
How simple programs can help you manipulate and analyze language
data, and how to write these programs
Sreejith S Getting Started with NLTK
23. What You Will Learn
How simple programs can help you manipulate and analyze language
data, and how to write these programs
How key concepts from NLP and linguistics are used to describe and
analyze language
Sreejith S Getting Started with NLTK
24. What You Will Learn
How simple programs can help you manipulate and analyze language
data, and how to write these programs
How key concepts from NLP and linguistics are used to describe and
analyze language
How data structures and algorithms are used in NLP
Sreejith S Getting Started with NLTK
25. What You Will Learn
How simple programs can help you manipulate and analyze language
data, and how to write these programs
How key concepts from NLP and linguistics are used to describe and
analyze language
How data structures and algorithms are used in NLP
How language data is stored in standard formats, and how data can
be used to evaluate the performance of NLP techniques
Sreejith S Getting Started with NLTK
26. Installation of NLTK
Make sure that Ptyhon 2.4 or 2.5 or 2.6 is available in your system
Sreejith S Getting Started with NLTK
27. Installation of NLTK
Make sure that Ptyhon 2.4 or 2.5 or 2.6 is available in your system
Install Python Tkinter package
Sreejith S Getting Started with NLTK
28. Installation of NLTK
Make sure that Ptyhon 2.4 or 2.5 or 2.6 is available in your system
Install Python Tkinter package
Install Numpy, Matplotlib, Prover9, MaltParse and MegaM
Sreejith S Getting Started with NLTK
29. Installation of NLTK
Make sure that Ptyhon 2.4 or 2.5 or 2.6 is available in your system
Install Python Tkinter package
Install Numpy, Matplotlib, Prover9, MaltParse and MegaM
Download NLTK and Install it
Sreejith S Getting Started with NLTK
30. Installation of NLTK
Make sure that Ptyhon 2.4 or 2.5 or 2.6 is available in your system
Install Python Tkinter package
Install Numpy, Matplotlib, Prover9, MaltParse and MegaM
Download NLTK and Install it
If you are installing NLTK from source Download
http://nltk.googlecode.com/files/nltk-2.0b9.zip
Sreejith S Getting Started with NLTK
31. Installation of NLTK
Make sure that Ptyhon 2.4 or 2.5 or 2.6 is available in your system
Install Python Tkinter package
Install Numpy, Matplotlib, Prover9, MaltParse and MegaM
Download NLTK and Install it
If you are installing NLTK from source Download
http://nltk.googlecode.com/files/nltk-2.0b9.zip
Unzip it , It will create nltk-2.0b9 .
Sreejith S Getting Started with NLTK
32. Installation of NLTK
Make sure that Ptyhon 2.4 or 2.5 or 2.6 is available in your system
Install Python Tkinter package
Install Numpy, Matplotlib, Prover9, MaltParse and MegaM
Download NLTK and Install it
If you are installing NLTK from source Download
http://nltk.googlecode.com/files/nltk-2.0b9.zip
Unzip it , It will create nltk-2.0b9 .
Open terminal and cd in to this folder, Be super user , python
setup.py install
Sreejith S Getting Started with NLTK
33. Installation of NLTK
Make sure that Ptyhon 2.4 or 2.5 or 2.6 is available in your system
Install Python Tkinter package
Install Numpy, Matplotlib, Prover9, MaltParse and MegaM
Download NLTK and Install it
If you are installing NLTK from source Download
http://nltk.googlecode.com/files/nltk-2.0b9.zip
Unzip it , It will create nltk-2.0b9 .
Open terminal and cd in to this folder, Be super user , python
setup.py install
To install data
Sreejith S Getting Started with NLTK
34. Installation of NLTK
Make sure that Ptyhon 2.4 or 2.5 or 2.6 is available in your system
Install Python Tkinter package
Install Numpy, Matplotlib, Prover9, MaltParse and MegaM
Download NLTK and Install it
If you are installing NLTK from source Download
http://nltk.googlecode.com/files/nltk-2.0b9.zip
Unzip it , It will create nltk-2.0b9 .
Open terminal and cd in to this folder, Be super user , python
setup.py install
To install data
Start python interpreter
>>> import nltk
>>> nltk.download()
Sreejith S Getting Started with NLTK
35. Installation of NLTK
Make sure that Ptyhon 2.4 or 2.5 or 2.6 is available in your system
Install Python Tkinter package
Install Numpy, Matplotlib, Prover9, MaltParse and MegaM
Download NLTK and Install it
If you are installing NLTK from source Download
http://nltk.googlecode.com/files/nltk-2.0b9.zip
Unzip it , It will create nltk-2.0b9 .
Open terminal and cd in to this folder, Be super user , python
setup.py install
To install data
Start python interpreter
>>> import nltk
>>> nltk.download()
Now you are ready to play with NLTK !!!
Sreejith S Getting Started with NLTK
36. NLTK Modules
NLTK Modules Functionality
Sreejith S Getting Started with NLTK
37. NLTK Modules
NLTK Modules Functionality
nltk.corpus Courpus
Sreejith S Getting Started with NLTK
38. NLTK Modules
NLTK Modules Functionality
nltk.corpus Courpus
nltk.tokenize,nltk.stem Tokenizers,stemmers
Sreejith S Getting Started with NLTK
39. NLTK Modules
NLTK Modules Functionality
nltk.corpus Courpus
nltk.tokenize,nltk.stem Tokenizers,stemmers
nltk.collocations t-test,chi-squared,mutual-info
Sreejith S Getting Started with NLTK
40. NLTK Modules
NLTK Modules Functionality
nltk.corpus Courpus
nltk.tokenize,nltk.stem Tokenizers,stemmers
nltk.collocations t-test,chi-squared,mutual-info
nltk.tag n-gram,backoff,Brill,HMM,TnT
Sreejith S Getting Started with NLTK
41. NLTK Modules
NLTK Modules Functionality
nltk.corpus Courpus
nltk.tokenize,nltk.stem Tokenizers,stemmers
nltk.collocations t-test,chi-squared,mutual-info
nltk.tag n-gram,backoff,Brill,HMM,TnT
nltk.classify,nltk.cluster Decision tree,Naive bayes,K-means
Sreejith S Getting Started with NLTK
42. NLTK Modules
NLTK Modules Functionality
nltk.corpus Courpus
nltk.tokenize,nltk.stem Tokenizers,stemmers
nltk.collocations t-test,chi-squared,mutual-info
nltk.tag n-gram,backoff,Brill,HMM,TnT
nltk.classify,nltk.cluster Decision tree,Naive bayes,K-means
nltk.chunk Regex,n-gram,named entity
Sreejith S Getting Started with NLTK
43. NLTK Modules
NLTK Modules Functionality
nltk.corpus Courpus
nltk.tokenize,nltk.stem Tokenizers,stemmers
nltk.collocations t-test,chi-squared,mutual-info
nltk.tag n-gram,backoff,Brill,HMM,TnT
nltk.classify,nltk.cluster Decision tree,Naive bayes,K-means
nltk.chunk Regex,n-gram,named entity
nltk.parsing Parsing
Sreejith S Getting Started with NLTK
49. Let us start the game
To access data for working out the example in the book
Start python interpreter
Sreejith S Getting Started with NLTK
50. Let us start the game
To access data for working out the example in the book
Start python interpreter
Some basic work outs from the book
Sreejith S Getting Started with NLTK
51. Let us start the game
To access data for working out the example in the book
Start python interpreter
Some basic work outs from the book
Concordance
Sreejith S Getting Started with NLTK
52. Let us start the game
To access data for working out the example in the book
Start python interpreter
Some basic work outs from the book
Concordance
>>> from nltk.book import *
>>> text1.concordance("monstrous")
Sreejith S Getting Started with NLTK
53. Let us start the game
To access data for working out the example in the book
Start python interpreter
Some basic work outs from the book
Concordance
>>> from nltk.book import *
>>> text1.concordance("monstrous")
Similar
Sreejith S Getting Started with NLTK
54. Let us start the game
To access data for working out the example in the book
Start python interpreter
Some basic work outs from the book
Concordance
>>> from nltk.book import *
>>> text1.concordance("monstrous")
Similar
>>> text1.similar("monstrous")
Sreejith S Getting Started with NLTK
55. Let us start the game
To access data for working out the example in the book
Start python interpreter
Some basic work outs from the book
Concordance
>>> from nltk.book import *
>>> text1.concordance("monstrous")
Similar
>>> text1.similar("monstrous")
Dispersion plot - Positional information
Sreejith S Getting Started with NLTK
56. Let us start the game
To access data for working out the example in the book
Start python interpreter
Some basic work outs from the book
Concordance
>>> from nltk.book import *
>>> text1.concordance("monstrous")
Similar
>>> text1.similar("monstrous")
Dispersion plot - Positional information
>>> text4.dispersion_plot(["citizens",
"democracy", "freedom", "duties", "America"])
>>> text4.dispersion_plot(["and",
"to", "of", "with", "the"])
What is it !!! Why ???
Sreejith S Getting Started with NLTK
57. Continued...
Some basic work outs from the book
Sreejith S Getting Started with NLTK
58. Continued...
Some basic work outs from the book
Generate
Sreejith S Getting Started with NLTK
59. Continued...
Some basic work outs from the book
Generate
>>> text3.generate()
Sreejith S Getting Started with NLTK
60. Continued...
Some basic work outs from the book
Generate
>>> text3.generate()
Counting Vocabulary
Sreejith S Getting Started with NLTK
61. Continued...
Some basic work outs from the book
Generate
>>> text3.generate()
Counting Vocabulary
>>> len(text3)
Sreejith S Getting Started with NLTK
62. Continued...
Some basic work outs from the book
Generate
>>> text3.generate()
Counting Vocabulary
>>> len(text3)
List of distinct words ,sorted in dictionary order.
Sreejith S Getting Started with NLTK
63. Continued...
Some basic work outs from the book
Generate
>>> text3.generate()
Counting Vocabulary
>>> len(text3)
List of distinct words ,sorted in dictionary order.
>>> sorted(set(text3))
Sreejith S Getting Started with NLTK
64. Continued...
Some basic work outs from the book
Generate
>>> text3.generate()
Counting Vocabulary
>>> len(text3)
List of distinct words ,sorted in dictionary order.
>>> sorted(set(text3))
Count occurrence of a particular word in a text
Sreejith S Getting Started with NLTK
65. Continued...
Some basic work outs from the book
Generate
>>> text3.generate()
Counting Vocabulary
>>> len(text3)
List of distinct words ,sorted in dictionary order.
>>> sorted(set(text3))
Count occurrence of a particular word in a text
>>> text3.count("and")
What percentage of text it is taken by a specific word
>>> 100 * text3.count("and") / len(text3)
Sreejith S Getting Started with NLTK
67. Collocation & Bigram
Collocation
A collocation is a sequence of words that occur together unusually often
e.g :- red wine , strong tea
But strong computer is not a collocation
Sreejith S Getting Started with NLTK
68. Collocation & Bigram
Collocation
A collocation is a sequence of words that occur together unusually often
e.g :- red wine , strong tea
But strong computer is not a collocation
>>> text4.collocations()
Sreejith S Getting Started with NLTK
69. Collocation & Bigram
Collocation
A collocation is a sequence of words that occur together unusually often
e.g :- red wine , strong tea
But strong computer is not a collocation
>>> text4.collocations()
Bigrams
List of word pairs
Sreejith S Getting Started with NLTK
70. Collocation & Bigram
Collocation
A collocation is a sequence of words that occur together unusually often
e.g :- red wine , strong tea
But strong computer is not a collocation
>>> text4.collocations()
Bigrams
List of word pairs
>>> text = "sreejith is talking about NLTK"
>>> wordlist = text.split()
>>> bigrams(wordlist)
Sreejith S Getting Started with NLTK
71. Collocation & Bigram
Collocation
A collocation is a sequence of words that occur together unusually often
e.g :- red wine , strong tea
But strong computer is not a collocation
>>> text4.collocations()
Bigrams
List of word pairs
>>> text = "sreejith is talking about NLTK"
>>> wordlist = text.split()
>>> bigrams(wordlist)
what will happen if i do like this
>>> bigrams(text)
Sreejith S Getting Started with NLTK
72. Work with our own data
Populate our own corpora with NLTK and analyse it
Sreejith S Getting Started with NLTK
73. Work with our own data
Populate our own corpora with NLTK and analyse it
>>> from nltk.corpus import
PlaintextCorpusReader as ptr
>>> corpus = ’/home/developer/Desktop/Sreejith’
>>> wordlist = ptr(corpus,’.*’)
>>> wordlist.fileids()
Sreejith S Getting Started with NLTK
74. Work with our own data
Populate our own corpora with NLTK and analyse it
>>> from nltk.corpus import
PlaintextCorpusReader as ptr
>>> corpus = ’/home/developer/Desktop/Sreejith’
>>> wordlist = ptr(corpus,’.*’)
>>> wordlist.fileids()
Let us try to find it out how to count number of characters, words
and sentences in the corpus
Sreejith S Getting Started with NLTK
75. Work with our own data
Populate our own corpora with NLTK and analyse it
>>> from nltk.corpus import
PlaintextCorpusReader as ptr
>>> corpus = ’/home/developer/Desktop/Sreejith’
>>> wordlist = ptr(corpus,’.*’)
>>> wordlist.fileids()
Let us try to find it out how to count number of characters, words
and sentences in the corpus
>>> for fid in wordlist.fileids():
print len(wordlist.raw(fid))
>>> for fid in wordlist.fileids():
print len(wordlist.words(fid))
>>> for fid in wordlist.fileids():
print len(wordlist.sents(fid))
Sreejith S Getting Started with NLTK
76. Continued...
Ploting conditional frquency distribution
Sreejith S Getting Started with NLTK
77. Continued...
Ploting conditional frquency distribution
>>> text = "sreejith is talking about NLTK"
>>> words = text.split()
>>> big = bigrams(words)
>>> gd = nltk.ConditionalFreqDist(big)
>>> gd.plot()
Sreejith S Getting Started with NLTK
78. Continued...
Ploting conditional frquency distribution
>>> text = "sreejith is talking about NLTK"
>>> words = text.split()
>>> big = bigrams(words)
>>> gd = nltk.ConditionalFreqDist(big)
>>> gd.plot()
Tabulate CFD
Sreejith S Getting Started with NLTK
79. Continued...
Ploting conditional frquency distribution
>>> text = "sreejith is talking about NLTK"
>>> words = text.split()
>>> big = bigrams(words)
>>> gd = nltk.ConditionalFreqDist(big)
>>> gd.plot()
Tabulate CFD
>>> gd.tabulate()
Sreejith S Getting Started with NLTK
80. Continued...
Ploting conditional frquency distribution
>>> text = "sreejith is talking about NLTK"
>>> words = text.split()
>>> big = bigrams(words)
>>> gd = nltk.ConditionalFreqDist(big)
>>> gd.plot()
Tabulate CFD
>>> gd.tabulate()
Plot frequency distribution
Sreejith S Getting Started with NLTK
81. Continued...
Ploting conditional frquency distribution
>>> text = "sreejith is talking about NLTK"
>>> words = text.split()
>>> big = bigrams(words)
>>> gd = nltk.ConditionalFreqDist(big)
>>> gd.plot()
Tabulate CFD
>>> gd.tabulate()
Plot frequency distribution
>>> fdist = FreqDist(text1)
>>> fdist.plot(50,cumulative=True)
Sreejith S Getting Started with NLTK
83. Normalizing Text
Stemming
Stemming is the process for reducing inflected (or sometimes derived)
words to their stem, base or root form , generally a written word form
Sreejith S Getting Started with NLTK
84. Normalizing Text
Stemming
Stemming is the process for reducing inflected (or sometimes derived)
words to their stem, base or root form , generally a written word form
>>> porter = nltk.PorterStemmer()
>>> word = ’running’
>>> porter.stem(word)
>>> lancaster = nltk.LancasterStemmer()
>>> lancaster.stem(tok[2])
Sreejith S Getting Started with NLTK
86. Normalizing Text
Lemmatization
Stemming + make sure that the resulting form is a known word in a
dictionary
Sreejith S Getting Started with NLTK
87. Normalizing Text
Lemmatization
Stemming + make sure that the resulting form is a known word in a
dictionary
>>> wnl = nltk.WordNetLemmatizer()
>>> wnl.lemmatize(word)
Sreejith S Getting Started with NLTK
88. POS Tagging
Sreejith S Getting Started with NLTK
89. POS Tagging
POS Tagging
The process of classifying words into their parts-of-speech and labeling
them accordingly is known as part-of-speech tagging, POS tagging
Sreejith S Getting Started with NLTK
90. POS Tagging
POS Tagging
The process of classifying words into their parts-of-speech and labeling
them accordingly is known as part-of-speech tagging, POS tagging
>>> text = nltk.word_tokenize("we are attending
FOSS meet at NIC calicut")
>>> nltk.pos_tag(text)
Sreejith S Getting Started with NLTK
95. Machine Translation
Babelizer Shell
Translating a sentence from its source langauge to a specified language.
NLTK provides babelize shell
Sreejith S Getting Started with NLTK
96. Machine Translation
Babelizer Shell
Translating a sentence from its source langauge to a specified language.
NLTK provides babelize shell
>>> babelize_shell()
Babel> hello how are you?
Babel> german
Babel> run
Sreejith S Getting Started with NLTK
97. Machine Translation
Babelizer Shell
Translating a sentence from its source langauge to a specified language.
NLTK provides babelize shell
>>> babelize_shell()
Babel> hello how are you?
Babel> german
Babel> run
Just try Google Translator, Yahoo babelfish
Sreejith S Getting Started with NLTK
98. What u can do??
Contribute to NLTK
GSOC
NLP Training
Real time research
Sreejith S Getting Started with NLTK
99. Reference
Steven Bird, Edvard Loper and Ewan Klien
Natural Language Processing with Python
Jacob Perkins
Python Text Processing with NLTK2.0 Cookbook
http://www.nltk.org
Sreejith S Getting Started with NLTK
100. Questions
Sreejith S Getting Started with NLTK
101. And finally...
Sreejith.S
Sreejith S Getting Started with NLTK