SlideShare a Scribd company logo
1 of 50
Download to read offline
Natural Language Toolkit [NLTK]
                                                  Prakash B Pimpale
                                                  pbpimpale@gmail.com

                                  @
    FOSS(From the Open Source Shelf) 
  An open source softwares seminar series

                         (CC) KBCS CDAC MUMBAI
Let's start
➢   What is it about?
    ➢   NLTK – installing and using it for some 'basic NLP' 
        and Text Processing tasks
    ➢   Some NLP and Some python 
➢   How will we proceed ?
    ➢   NLTK introduction
    ➢   NLP introduction
    ➢   NLTK installation
    ➢   Playing with text using NLTK 
    ➢   Conclusion          (CC) KBCS CDAC MUMBAI
NLTK

➢   Open source python modules and linguistic 
    data for Natural Langauge Processing 
    application 
➢   Dveloped by group of people, project leaders: 
    Steven Bird, Edward Loper, Ewan Klein 
➢   Python: simple, free and open source, portable, 
    object oriented, interpreted 


                     (CC) KBCS CDAC MUMBAI
NLP
➢   Natural Language Processing: field of computer science 
    and linguistics woks out interactions between computers 
    and human (natural) languages.
➢   some brand ambassadors: machine translation, 
    automatic summarization, information extraction, 
    transliteration, question answering, opinion mining  
➢   Basic tasks: stemming, POS tagging, chunking, parsing, 
    etc.
➢   Involves simple frequency count to undertanding and 
    generating complex text 

                        (CC) KBCS CDAC MUMBAI
Terms/Steps in NLP

➢   Tokenization: getting words and punctuations 
    out from text


➢   Stemming: getting the (inflectional) root of word; 
    plays, playing, played : play  




                     (CC) KBCS CDAC MUMBAI
cont..

POS(Part of Speech) tagging: 
     Ram  NNP  
     killed  VBD 
     Ravana NNP 


POS tags: NNP­ proper noun
                 VBD  verb, past tense
(Penn tree bank tagset )
                   (CC) KBCS CDAC MUMBAI
cont..

➢   Chunking: groups the similar POS tags (liberal 
    def)




                    (CC) KBCS CDAC MUMBAI
cont..
Parsing: Identifying gramatical structures in a 
sentence:  Mary saw a dog
 *(S 
    (NP Mary)
    **(VP 
       (V saw)
       ***(NP 
           (Det a)
           (N dog)   )***   )**   )*
                           (CC) KBCS CDAC MUMBAI
Applications

➢   Machine Translation
    MaTra, Google translate
➢   Transliteration 
    Xlit, Google Indic
➢   Text classification
    Spam filters
    many more.......!
                        (CC) KBCS CDAC MUMBAI
Challenges
➢   Word sense disambiguation
     Bank(financial/river)
     Plant(industrial/natural)
➢   Name entity recognition
    CDAC is in Mumbai
➢   Clause Boundary detection
    Ram who was the king of Ayodhya killed 
    Ravana 
    and many more.....!
                    (CC) KBCS CDAC MUMBAI
Installing NLTK
    Windows
➢   Install Python: Version 2.4.*, 2.5.*, or 2.6.* and 
    not 3.0
    available http://www.nltk.org/download 
    [great instructions!]
➢   Install PyYAML:
➢   Install Numpy
➢   Install Matplotlib
➢   Install NLTK         (CC) KBCS CDAC MUMBAI
cont..

    Linux/Unix
➢   Most likely python will be there (if not­> 
    package manager)
➢   Get python­nympy and python­scipy packages 
    (scintific library for multidimentional array and 
    linear algebra), use package manager to install 
    or 
    'sudo apt­get install python­numpy python­
    scipy'
                      (CC) KBCS CDAC MUMBAI
cont..

    NLTK Source Installation
➢   Download NLTK source 
    ( http://nltk.googlecode.com/files/nltk­2.0b8.zip)
➢   Unzip it
➢   Go to the new unziped folder
➢   Just do it! 
    'sudo python setup.py install'
    Done :­) !
                     (CC) KBCS CDAC MUMBAI
Installing Data

➢   We need data to experiment with
    It's too simple (same for all platform)...
      $python            #(IDLE) enter to python prompt 
      >>> import nltk
      >>> nltk.download()
    select any one option that you like/need



                        (CC) KBCS CDAC MUMBAI
cont..




         (CC) KBCS CDAC MUMBAI
cont..

 Test installation
 ( set path if needed: as 
 export NLTK_DATA = '/home/....../nltk_data')
       >>> from nltk.corpus import brown
       >>> brown.words()[0:20]
        ['The', 'Fulton', 'County', 'Grand', 'Jury']
                     All set!


                          (CC) KBCS CDAC MUMBAI
Playing with text

 Feeling python 
 $python
 >>>print 'Hello world'
 >>>2+3+4
     
 Okay !


                   (CC) KBCS CDAC MUMBAI
cont..

 Data from book:
   >>>from nltk.book import *
 see what is it?
    >>>text1
 Search the word:
   >>> text1.concordance("monstrous")
 Find similar words 
   >>> text1.similar("monstrous")
                   (CC) KBCS CDAC MUMBAI
cont..
Find common context
  >>>text2.common_contexts(["monstrous", "very"])
Find position of the word in entire text
  >>>text4.dispersion_plot(["citizens", "democracy", 
  "freedom", "duties", "America"])
Counting vocab
   >>>len(text4)  #gives tokens(words and 
  punctuations)


                    (CC) KBCS CDAC MUMBAI
cont..
 Vocabulary(unique words) used by author
   >>>len(set(text1))
 See sorted vocabulary(upper case first!)
   >>>sorted(set(text1))
 Measuring richness of text(average repetition)
   >>>len(text1)/len(set(text1))
 (before division import for floating point division
 from __future__ import division)
                    (CC) KBCS CDAC MUMBAI
cont..

 Counting occurances 
      >>>text5.count("lol")
 % of the text taken
      >>> 100 * text5.count('a') / len(text5)
 let's write function for it in python 
      >>>def  prcntg_oftext(word,text):
                   return100*.count(wrd)/len(txt)
          
                        (CC) KBCS CDAC MUMBAI
cont..
 Writing our own text
   >>>textown =['a','text','is','a','list','of','words']
 adding texts(lists, can contain anything)
   >>>sent = ['pqr']
   >>>sent2 = ['xyz']
   >>>sent+sent2
 append to text
   >>>sent.append('abc')

                        (CC) KBCS CDAC MUMBAI
cont..

 Indexing text
   >>> text4[173]
 or 
   >>>text4.index('awaken')
 slicing
 text1[3:4], text[:5], text[10:] try it! (notice ending 
 index one less than specified)

                    (CC) KBCS CDAC MUMBAI
cont..

 Strings in python
   >>>name = 'astring'
   >>>name[0]
 what's more
   >>>name*2
   >>>'SPACE '.join(['let us','join'])




                     (CC) KBCS CDAC MUMBAI
cont..

 Statistics to text
 Frequency Distribution
    >>> fdist1 = FreqDist(text1)
    >>> fdist1
    >>> vocabulary1 = fdist1.keys()
    >>>vocabulary1[:25]
 let's plot it...
    >>>fdist1.plot(50, cumulative=True)
                      (CC) KBCS CDAC MUMBAI
cont..

 *what the text1 is about?
 *are all the words meaningful?
 Let's find long words i.e.
 Words from text1 with length more than  10
 in python
    >>>longw=[word for word in text1 if len(word)>10]
 let's see it... 
    >>>sorted(longw)
                    (CC) KBCS CDAC MUMBAI
cont..

 *most long words have small freq and are not 
 major topic of text many times
 *words with long length and certain freq may 
 help to identify text topic
   >>>sorted([w for w in set(text5) if len(w) > 7 and 
   fdist5[w] > 7])




                    (CC) KBCS CDAC MUMBAI
cont..
 Bigrams and collocations:
 *Bigrams
   >>>sent=['near','river', 'bank']
   >>>bigrams(sent)
 basic to many NLP apps; transliteration, 
 translation
 *collocations: frequent bigrams of rare words
   >>>text4.collocations()
 basic to some WSD approaches
                     (CC) KBCS CDAC MUMBAI
cont..
 FreDist functions:




                  (CC) KBCS CDAC MUMBAI
cont..
 Python Frequntly needed 'word' functions




                 (CC) KBCS CDAC MUMBAI
cont

 Bigger probs:
 WSD, Anaphora resolution, Machine translation
 try and laugh:
   >>>babelize_shell()
   >>>german
   >>>run 
 or try it here: 
 http://translationparty.com/#6917644
                  (CC) KBCS CDAC MUMBAI
cont..
  Playing with text corpora
  Available corporas: Gutenberg, Web and chat 
  text, Brown, Reuters, inaugural address, etc.
    >>> from nltk.corpus import inaugural
    >>> inaugural.fileids()
    >>>fileid[:4] for fileid in 
    inaugural.fileids()]


                   (CC) KBCS CDAC MUMBAI
cont..

   >>> cfd = nltk.ConditionalFreqDist(
               (target, file[:4])
               for fileid in inaugural.fileids()
               for w in inaugural.words(fileid)
               for target in ['america', 'citizen']
               if w.lower().startswith(target))
   >>> cfd.plot()
   Conditional Freq distribution of words 'america' and 
   'citizen' with different years
                         (CC) KBCS CDAC MUMBAI
cont..
 Generating text (applying bigrams)
   >>> def generate_model(cfdist, word, num=15):
       for i in range(num):
           print word,
           word = cfdist[word].max()
   ­­­
   >>>text = nltk.corpus.genesis.words('english­kjv.txt')
   >>>bigrams = nltk.bigrams(text)
   >>>cfd = nltk.ConditionalFreqDist(bigrams)
    >>print cfd['living']
    >>generate_model(cfd, 'living')
                   (CC) KBCS CDAC MUMBAI
cont..
  WordList corpora
           >>>def unusual_words(text):
                text_vocab = set(w.lower() for w in text if  
              w.isalpha())
               english_vocab = set(w.lower() for w in 
              nltk.corpus.words.words())
               unusual = text_vocab.difference(english_vocab)
               return sorted(unusual)
  ­­
           >>>unusual_words(nltk.corpus.gutenberg.words('aust
            en­sense.txt'))
  Are the unusual or misspelled words.
                      (CC) KBCS CDAC MUMBAI
cont..
 Stop words:
   >>> from nltk.corpus import stopwords
   >>> stopwords.words('english')
   ­­ Percentage of content
   >>> def content_fraction(text):
          stopwords = nltk.corpus.stopwords.words('english')
          content = [w for w in text if w.lower() not in stopwords]
          return len(content) / len(text)
   >>> content_fraction(nltk.corpus.reuters.words())
   is the useful fraction of useful words! 
                          (CC) KBCS CDAC MUMBAI
cont..

 WordNet: structured, semanticaly orinted 
 dictionary
   >>> from nltk.corpus import wordnet as wn
 get synset/collection of synonyms
   >>> wn.synsets('motorcar')
 now get synonyms(lemmas)
   >>> wn.synset('car.n.01').lemma_names


                  (CC) KBCS CDAC MUMBAI
cont..

 What does car.n.01 actually mean?
   >>> wn.synset('car.n.01').definition
 An example:
   >>> wn.synset('car.n.01').examples
 Also
   >>>wn.lemma('supply.n.02.supply').antonyms()
 Useful data in many applications like WSD, 
 understanding text, question answering, etc. 
                   (CC) KBCS CDAC MUMBAI
cont..

 Text from web:
   >>> from urllib import urlopen
   >>> url = 
   "http://www.gutenberg.org/files/2554/2554.txt"
   >>> raw = urlopen(url).read()
   >>> type(raw)
   >>> len(raw)
 is in characters...!

                   (CC) KBCS CDAC MUMBAI
cont..
 Tokenization:
   >>> tokens = nltk.word_tokenize(raw)
   >>> type(tokens)
   >>> len(tokens)
   >>> tokens[:15]
 convert to Text
   >>> text = nltk.Text(tokens)
   >>> type(text)
 Now you can do all above things with this!
                     (CC) KBCS CDAC MUMBAI
cont..
What about HTML?
  >>> url = 
  "http://news.bbc.co.uk/2/hi/health/2284783.stm"
  >>> html = urlopen(url).read()
  >>> html[:60]
It's HTML; Clean the tags...
  >>> raw = nltk.clean_html(html)
  >>> tokens = nltk.word_tokenize(raw)
  >>> tokens[:60]
  >>> text = nltk.Text(tokens)   
                     (CC) KBCS CDAC MUMBAI
cont..
 Try it yourself 
   Processing RSS feeds using universal feed parser (
   http://feedparser.org/)


 Opening text from local machine:
    >>> f = open('document.txt')
    >>> raw = f.read()
 (make sure you are in current directory)
 then tokenize and bring to Text and play with it
                    (CC) KBCS CDAC MUMBAI
cont..
 Capture user input:
   >>> s = raw_input("Enter some text: ")
   >>> print "You typed", len(nltk.word_tokenize(s)), 
   "words."
 Writing O/p to files
   >>> output_file = open('output.txt', 'w')
   >>> words = set(nltk.corpus.genesis.words('english­
   kjv.txt'))
   >>> for word in sorted(words):     
   output_file.write(word + "n")
                      (CC) KBCS CDAC MUMBAI
cont..
 Stemming : NLTK has inbuilt stemmers 
   >>>text =['apples','called','indians','applying']
   >>> porter = nltk.PorterStemmer()
   >>> lancaster = nltk.LancasterStemmer()
   >>>[porter.stem(p) for p in text]
   >>>[lancaster.stem(p) for p in text]
 Lemmatization: WordNet lemmetizer, removes 
 affixes only if the new word exist in dictionary
   >>> wnl = nltk.WordNetLemmatizer()
   >>> [wnl.lemmatize(t) for t in text]
                   (CC) KBCS CDAC MUMBAI
cont..

 Using a POS tagger:
   >>> text = nltk.word_tokenize("And now for 
   something completely different")


 Query the POS tag documentation as:
   >>>nltk.help.upenn_tagset('NNP')
 Parsing with NLTK
             an example  
                     (CC) KBCS CDAC MUMBAI
cont..
 Classifying with NLTK:
 Problem: Gender identification from name
 Feature last letter 
             >>> def gender_features(word):
                        return {'last_letter': word[­1]}
 Get data
   >>> from nltk.corpus import names
   >>> import random



                       (CC) KBCS CDAC MUMBAI
cont..
    >>> names = ([(name, 'male') for name in 
    names.words('male.txt')] +
              [(name, 'female') for name in 
    names.words('female.txt')])
    >>> random.shuffle(names)
 Get Feature set
     >>> featuresets = [(gender_features(n), g) for (n,g) in 
    names]
    >>> train_set, test_set = featuresets[500:], featuresets[:500]
 Train
    >>> classifier = nltk.NaiveBayesClassifier.train(train_set)
 Test                    (CC) KBCS CDAC MUMBAI

    >>> classifier.classify(gender_features('Neo'))
Conclusion

 NLTK is rich resource for starting with Natural 
 Language Processing and it's being open 
 source and based on python adds feathers to 
 it's cap!




                  (CC) KBCS CDAC MUMBAI
References



➢   http://www.nltk.org/
➢   http://www.python.org/
➢   http://wikipedia.org/ 
     (all accessed latest on 19th March 2010)
➢   Natural Language Processing with Python, Steven Bird, Ewan Klein, and 
    Edward Loper, O'REILLY


                              (CC) KBCS CDAC MUMBAI
                 Thank you 



          #More on NLP @ http://matra2.blogspot.com/
           
           #Slides were  prepared for 'supporting' the talk and are distributed as it is under CC 3.0
                               (CC) KBCS CDAC MUMBAI

More Related Content

What's hot

Natural language processing (NLP) introduction
Natural language processing (NLP) introductionNatural language processing (NLP) introduction
Natural language processing (NLP) introductionRobert Lujo
 
Introduction to Named Entity Recognition
Introduction to Named Entity RecognitionIntroduction to Named Entity Recognition
Introduction to Named Entity RecognitionTomer Lieber
 
Deep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDeep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDevashish Shanker
 
Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...
Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...
Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...Edureka!
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingYasir Khan
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)Yuriy Guts
 
Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)WingChan46
 
Natural language processing
Natural language processingNatural language processing
Natural language processingKarenVacca
 
NLTK - Natural Language Processing in Python
NLTK - Natural Language Processing in PythonNLTK - Natural Language Processing in Python
NLTK - Natural Language Processing in Pythonshanbady
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language ProcessingPranav Gupta
 
Introduction to TensorFlow 2.0
Introduction to TensorFlow 2.0Introduction to TensorFlow 2.0
Introduction to TensorFlow 2.0Databricks
 
Natural Language Processing in AI
Natural Language Processing in AINatural Language Processing in AI
Natural Language Processing in AISaurav Shrestha
 

What's hot (20)

Natural language processing (NLP) introduction
Natural language processing (NLP) introductionNatural language processing (NLP) introduction
Natural language processing (NLP) introduction
 
Introduction to Named Entity Recognition
Introduction to Named Entity RecognitionIntroduction to Named Entity Recognition
Introduction to Named Entity Recognition
 
Deep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDeep Learning for Natural Language Processing
Deep Learning for Natural Language Processing
 
NLP
NLPNLP
NLP
 
What is word2vec?
What is word2vec?What is word2vec?
What is word2vec?
 
Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...
Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...
Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
 
NAMED ENTITY RECOGNITION
NAMED ENTITY RECOGNITIONNAMED ENTITY RECOGNITION
NAMED ENTITY RECOGNITION
 
Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
Word2Vec
Word2VecWord2Vec
Word2Vec
 
NLTK - Natural Language Processing in Python
NLTK - Natural Language Processing in PythonNLTK - Natural Language Processing in Python
NLTK - Natural Language Processing in Python
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language Processing
 
Word embedding
Word embedding Word embedding
Word embedding
 
Intro to nlp
Intro to nlpIntro to nlp
Intro to nlp
 
Introduction to TensorFlow 2.0
Introduction to TensorFlow 2.0Introduction to TensorFlow 2.0
Introduction to TensorFlow 2.0
 
Nlp
NlpNlp
Nlp
 
NLTK in 20 minutes
NLTK in 20 minutesNLTK in 20 minutes
NLTK in 20 minutes
 
Natural Language Processing in AI
Natural Language Processing in AINatural Language Processing in AI
Natural Language Processing in AI
 

Viewers also liked

Practical Natural Language Processing
Practical Natural Language ProcessingPractical Natural Language Processing
Practical Natural Language ProcessingJaganadh Gopinadhan
 
Natural Language Processing with Python
Natural Language Processing with PythonNatural Language Processing with Python
Natural Language Processing with PythonBenjamin Bengfort
 
Nltk natural language toolkit overview and application @ PyHug
Nltk  natural language toolkit overview and application @ PyHugNltk  natural language toolkit overview and application @ PyHug
Nltk natural language toolkit overview and application @ PyHugJimmy Lai
 
Nltk:a tool for_nlp - py_con-dhaka-2014
Nltk:a tool for_nlp - py_con-dhaka-2014Nltk:a tool for_nlp - py_con-dhaka-2014
Nltk:a tool for_nlp - py_con-dhaka-2014Fasihul Kabir
 
Natural language processing
Natural language processingNatural language processing
Natural language processingprashantdahake
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language Processingrohitnayak
 
Natural language processing
Natural language processingNatural language processing
Natural language processingYogendra Tamang
 
Information retrieval using yelp dataset
Information retrieval using yelp datasetInformation retrieval using yelp dataset
Information retrieval using yelp datasetMaiyaporn Phanich
 
Nd4 j slides.pptx
Nd4 j slides.pptxNd4 j slides.pptx
Nd4 j slides.pptxAdam Gibson
 
TheAssociationOfAtheismOnLegalPersonalityInTurkey
TheAssociationOfAtheismOnLegalPersonalityInTurkeyTheAssociationOfAtheismOnLegalPersonalityInTurkey
TheAssociationOfAtheismOnLegalPersonalityInTurkeyMorgan Elizabeth Romano
 
Corpus Bootstrapping with NLTK
Corpus Bootstrapping with NLTKCorpus Bootstrapping with NLTK
Corpus Bootstrapping with NLTKJacob Perkins
 
Semantic Twitter Analyzing Tweets For Real Time Event Notification
Semantic Twitter Analyzing Tweets For Real Time Event NotificationSemantic Twitter Analyzing Tweets For Real Time Event Notification
Semantic Twitter Analyzing Tweets For Real Time Event Notificationokazaki117
 
semantic social network analysis
semantic social network analysissemantic social network analysis
semantic social network analysisguillaume ereteo
 
IEEE 2016-2017 SOFTWARE TITLE
IEEE 2016-2017  SOFTWARE TITLE IEEE 2016-2017  SOFTWARE TITLE
IEEE 2016-2017 SOFTWARE TITLE FOCUSLOGICPROJECTS
 
Ijcai ip-2015 cyberbullying-final
Ijcai ip-2015 cyberbullying-finalIjcai ip-2015 cyberbullying-final
Ijcai ip-2015 cyberbullying-finalMichal Ptaszynski
 
Nltk - Boston Text Analytics
Nltk - Boston Text AnalyticsNltk - Boston Text Analytics
Nltk - Boston Text Analyticsshanbady
 

Viewers also liked (20)

Practical Natural Language Processing
Practical Natural Language ProcessingPractical Natural Language Processing
Practical Natural Language Processing
 
Natural Language Processing with Python
Natural Language Processing with PythonNatural Language Processing with Python
Natural Language Processing with Python
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Nltk natural language toolkit overview and application @ PyHug
Nltk  natural language toolkit overview and application @ PyHugNltk  natural language toolkit overview and application @ PyHug
Nltk natural language toolkit overview and application @ PyHug
 
Nltk:a tool for_nlp - py_con-dhaka-2014
Nltk:a tool for_nlp - py_con-dhaka-2014Nltk:a tool for_nlp - py_con-dhaka-2014
Nltk:a tool for_nlp - py_con-dhaka-2014
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language Processing
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
Information retrieval using yelp dataset
Information retrieval using yelp datasetInformation retrieval using yelp dataset
Information retrieval using yelp dataset
 
Nd4 j slides.pptx
Nd4 j slides.pptxNd4 j slides.pptx
Nd4 j slides.pptx
 
TheAssociationOfAtheismOnLegalPersonalityInTurkey
TheAssociationOfAtheismOnLegalPersonalityInTurkeyTheAssociationOfAtheismOnLegalPersonalityInTurkey
TheAssociationOfAtheismOnLegalPersonalityInTurkey
 
Corpus Bootstrapping with NLTK
Corpus Bootstrapping with NLTKCorpus Bootstrapping with NLTK
Corpus Bootstrapping with NLTK
 
Semantic Twitter Analyzing Tweets For Real Time Event Notification
Semantic Twitter Analyzing Tweets For Real Time Event NotificationSemantic Twitter Analyzing Tweets For Real Time Event Notification
Semantic Twitter Analyzing Tweets For Real Time Event Notification
 
semantic social network analysis
semantic social network analysissemantic social network analysis
semantic social network analysis
 
IEEE 2016-2017 SOFTWARE TITLE
IEEE 2016-2017  SOFTWARE TITLE IEEE 2016-2017  SOFTWARE TITLE
IEEE 2016-2017 SOFTWARE TITLE
 
Nltk
NltkNltk
Nltk
 
Ijcai ip-2015 cyberbullying-final
Ijcai ip-2015 cyberbullying-finalIjcai ip-2015 cyberbullying-final
Ijcai ip-2015 cyberbullying-final
 
Nltk - Boston Text Analytics
Nltk - Boston Text AnalyticsNltk - Boston Text Analytics
Nltk - Boston Text Analytics
 
Data mining on yelp dataset
Data mining on yelp datasetData mining on yelp dataset
Data mining on yelp dataset
 
Aisb cyberbullying
Aisb cyberbullyingAisb cyberbullying
Aisb cyberbullying
 

Similar to Natural Language Toolkit (NLTK), Basics

Spark Streaming with Cassandra
Spark Streaming with CassandraSpark Streaming with Cassandra
Spark Streaming with CassandraJacek Lewandowski
 
Building a Gigaword Corpus (PyCon 2017)
Building a Gigaword Corpus (PyCon 2017)Building a Gigaword Corpus (PyCon 2017)
Building a Gigaword Corpus (PyCon 2017)Rebecca Bilbro
 
Compiler Construction | Lecture 13 | Code Generation
Compiler Construction | Lecture 13 | Code GenerationCompiler Construction | Lecture 13 | Code Generation
Compiler Construction | Lecture 13 | Code GenerationEelco Visser
 
Building DSLs On CLR and DLR (Microsoft.NET)
Building DSLs On CLR and DLR (Microsoft.NET)Building DSLs On CLR and DLR (Microsoft.NET)
Building DSLs On CLR and DLR (Microsoft.NET)Vitaly Baum
 
NativeBoost
NativeBoostNativeBoost
NativeBoostESUG
 
Static PIE, How and Why - Metasploit's new POSIX payload: Mettle
Static PIE, How and Why - Metasploit's new POSIX payload: MettleStatic PIE, How and Why - Metasploit's new POSIX payload: Mettle
Static PIE, How and Why - Metasploit's new POSIX payload: MettleBrent Cook
 
Spoofax: ontwikkeling van domeinspecifieke talen in Eclipse
Spoofax: ontwikkeling van domeinspecifieke talen in EclipseSpoofax: ontwikkeling van domeinspecifieke talen in Eclipse
Spoofax: ontwikkeling van domeinspecifieke talen in EclipseDevnology
 
The Art Of Readable Code
The Art Of Readable CodeThe Art Of Readable Code
The Art Of Readable CodeBaidu, Inc.
 
Introduction to-linux
Introduction to-linuxIntroduction to-linux
Introduction to-linuxkishore1986
 
"ClojureScript journey: from little script, to CLI program, to AWS Lambda fun...
"ClojureScript journey: from little script, to CLI program, to AWS Lambda fun..."ClojureScript journey: from little script, to CLI program, to AWS Lambda fun...
"ClojureScript journey: from little script, to CLI program, to AWS Lambda fun...Julia Cherniak
 
The Ring programming language version 1.5.4 book - Part 180 of 185
The Ring programming language version 1.5.4 book - Part 180 of 185The Ring programming language version 1.5.4 book - Part 180 of 185
The Ring programming language version 1.5.4 book - Part 180 of 185Mahmoud Samir Fayed
 
The Ring programming language version 1.5.2 book - Part 176 of 181
The Ring programming language version 1.5.2 book - Part 176 of 181The Ring programming language version 1.5.2 book - Part 176 of 181
The Ring programming language version 1.5.2 book - Part 176 of 181Mahmoud Samir Fayed
 
cs262_intro_slides.pptx
cs262_intro_slides.pptxcs262_intro_slides.pptx
cs262_intro_slides.pptxAnaLoreto13
 
Presentation of Python, Django, DockerStack
Presentation of Python, Django, DockerStackPresentation of Python, Django, DockerStack
Presentation of Python, Django, DockerStackDavid Sanchez
 
Code as Data workshop: Using source{d} Engine to extract insights from git re...
Code as Data workshop: Using source{d} Engine to extract insights from git re...Code as Data workshop: Using source{d} Engine to extract insights from git re...
Code as Data workshop: Using source{d} Engine to extract insights from git re...source{d}
 

Similar to Natural Language Toolkit (NLTK), Basics (20)

NLTK introduction
NLTK introductionNLTK introduction
NLTK introduction
 
Spark Streaming with Cassandra
Spark Streaming with CassandraSpark Streaming with Cassandra
Spark Streaming with Cassandra
 
Modern C++
Modern C++Modern C++
Modern C++
 
Building a Gigaword Corpus (PyCon 2017)
Building a Gigaword Corpus (PyCon 2017)Building a Gigaword Corpus (PyCon 2017)
Building a Gigaword Corpus (PyCon 2017)
 
Compiler Construction | Lecture 13 | Code Generation
Compiler Construction | Lecture 13 | Code GenerationCompiler Construction | Lecture 13 | Code Generation
Compiler Construction | Lecture 13 | Code Generation
 
Parm
ParmParm
Parm
 
Parm
ParmParm
Parm
 
Building DSLs On CLR and DLR (Microsoft.NET)
Building DSLs On CLR and DLR (Microsoft.NET)Building DSLs On CLR and DLR (Microsoft.NET)
Building DSLs On CLR and DLR (Microsoft.NET)
 
Return of c++
Return of c++Return of c++
Return of c++
 
NativeBoost
NativeBoostNativeBoost
NativeBoost
 
Static PIE, How and Why - Metasploit's new POSIX payload: Mettle
Static PIE, How and Why - Metasploit's new POSIX payload: MettleStatic PIE, How and Why - Metasploit's new POSIX payload: Mettle
Static PIE, How and Why - Metasploit's new POSIX payload: Mettle
 
Spoofax: ontwikkeling van domeinspecifieke talen in Eclipse
Spoofax: ontwikkeling van domeinspecifieke talen in EclipseSpoofax: ontwikkeling van domeinspecifieke talen in Eclipse
Spoofax: ontwikkeling van domeinspecifieke talen in Eclipse
 
The Art Of Readable Code
The Art Of Readable CodeThe Art Of Readable Code
The Art Of Readable Code
 
Introduction to-linux
Introduction to-linuxIntroduction to-linux
Introduction to-linux
 
"ClojureScript journey: from little script, to CLI program, to AWS Lambda fun...
"ClojureScript journey: from little script, to CLI program, to AWS Lambda fun..."ClojureScript journey: from little script, to CLI program, to AWS Lambda fun...
"ClojureScript journey: from little script, to CLI program, to AWS Lambda fun...
 
The Ring programming language version 1.5.4 book - Part 180 of 185
The Ring programming language version 1.5.4 book - Part 180 of 185The Ring programming language version 1.5.4 book - Part 180 of 185
The Ring programming language version 1.5.4 book - Part 180 of 185
 
The Ring programming language version 1.5.2 book - Part 176 of 181
The Ring programming language version 1.5.2 book - Part 176 of 181The Ring programming language version 1.5.2 book - Part 176 of 181
The Ring programming language version 1.5.2 book - Part 176 of 181
 
cs262_intro_slides.pptx
cs262_intro_slides.pptxcs262_intro_slides.pptx
cs262_intro_slides.pptx
 
Presentation of Python, Django, DockerStack
Presentation of Python, Django, DockerStackPresentation of Python, Django, DockerStack
Presentation of Python, Django, DockerStack
 
Code as Data workshop: Using source{d} Engine to extract insights from git re...
Code as Data workshop: Using source{d} Engine to extract insights from git re...Code as Data workshop: Using source{d} Engine to extract insights from git re...
Code as Data workshop: Using source{d} Engine to extract insights from git re...
 

More from Prakash Pimpale

Support Vector Machines for Classification
Support Vector Machines for ClassificationSupport Vector Machines for Classification
Support Vector Machines for ClassificationPrakash Pimpale
 
Technology Entrepreneurship Venture Lab 2012 - Trunk monkey
Technology Entrepreneurship Venture Lab 2012 - Trunk monkeyTechnology Entrepreneurship Venture Lab 2012 - Trunk monkey
Technology Entrepreneurship Venture Lab 2012 - Trunk monkeyPrakash Pimpale
 
Entrepreneurship, Entrepreneurs and Startups
Entrepreneurship, Entrepreneurs and Startups Entrepreneurship, Entrepreneurs and Startups
Entrepreneurship, Entrepreneurs and Startups Prakash Pimpale
 
Collaboration tools in education
Collaboration tools in educationCollaboration tools in education
Collaboration tools in educationPrakash Pimpale
 
Genetic Algorithms Made Easy
Genetic Algorithms Made EasyGenetic Algorithms Made Easy
Genetic Algorithms Made EasyPrakash Pimpale
 

More from Prakash Pimpale (6)

Support Vector Machines for Classification
Support Vector Machines for ClassificationSupport Vector Machines for Classification
Support Vector Machines for Classification
 
Patent Search
Patent SearchPatent Search
Patent Search
 
Technology Entrepreneurship Venture Lab 2012 - Trunk monkey
Technology Entrepreneurship Venture Lab 2012 - Trunk monkeyTechnology Entrepreneurship Venture Lab 2012 - Trunk monkey
Technology Entrepreneurship Venture Lab 2012 - Trunk monkey
 
Entrepreneurship, Entrepreneurs and Startups
Entrepreneurship, Entrepreneurs and Startups Entrepreneurship, Entrepreneurs and Startups
Entrepreneurship, Entrepreneurs and Startups
 
Collaboration tools in education
Collaboration tools in educationCollaboration tools in education
Collaboration tools in education
 
Genetic Algorithms Made Easy
Genetic Algorithms Made EasyGenetic Algorithms Made Easy
Genetic Algorithms Made Easy
 

Recently uploaded

Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 

Recently uploaded (20)

Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 

Natural Language Toolkit (NLTK), Basics

  • 1. Natural Language Toolkit [NLTK]                                           Prakash B Pimpale                                                 pbpimpale@gmail.com @ FOSS(From the Open Source Shelf)  An open source softwares seminar series (CC) KBCS CDAC MUMBAI
  • 2. Let's start ➢ What is it about? ➢ NLTK – installing and using it for some 'basic NLP'  and Text Processing tasks ➢ Some NLP and Some python  ➢ How will we proceed ? ➢ NLTK introduction ➢ NLP introduction ➢ NLTK installation ➢ Playing with text using NLTK  ➢ Conclusion (CC) KBCS CDAC MUMBAI
  • 3. NLTK ➢ Open source python modules and linguistic  data for Natural Langauge Processing  application  ➢ Dveloped by group of people, project leaders:  Steven Bird, Edward Loper, Ewan Klein  ➢ Python: simple, free and open source, portable,  object oriented, interpreted  (CC) KBCS CDAC MUMBAI
  • 4. NLP ➢ Natural Language Processing: field of computer science  and linguistics woks out interactions between computers  and human (natural) languages. ➢ some brand ambassadors: machine translation,  automatic summarization, information extraction,  transliteration, question answering, opinion mining   ➢ Basic tasks: stemming, POS tagging, chunking, parsing,  etc. ➢ Involves simple frequency count to undertanding and  generating complex text  (CC) KBCS CDAC MUMBAI
  • 5. Terms/Steps in NLP ➢ Tokenization: getting words and punctuations  out from text ➢ Stemming: getting the (inflectional) root of word;  plays, playing, played : play   (CC) KBCS CDAC MUMBAI
  • 7. cont.. ➢ Chunking: groups the similar POS tags (liberal  def) (CC) KBCS CDAC MUMBAI
  • 9. Applications ➢ Machine Translation MaTra, Google translate ➢ Transliteration  Xlit, Google Indic ➢ Text classification Spam filters many more.......! (CC) KBCS CDAC MUMBAI
  • 10. Challenges ➢ Word sense disambiguation  Bank(financial/river)  Plant(industrial/natural) ➢ Name entity recognition CDAC is in Mumbai ➢ Clause Boundary detection Ram who was the king of Ayodhya killed  Ravana  and many more.....! (CC) KBCS CDAC MUMBAI
  • 11. Installing NLTK Windows ➢ Install Python: Version 2.4.*, 2.5.*, or 2.6.* and  not 3.0 available http://www.nltk.org/download  [great instructions!] ➢ Install PyYAML: ➢ Install Numpy ➢ Install Matplotlib ➢ Install NLTK (CC) KBCS CDAC MUMBAI
  • 12. cont.. Linux/Unix ➢ Most likely python will be there (if not­>  package manager) ➢ Get python­nympy and python­scipy packages  (scintific library for multidimentional array and  linear algebra), use package manager to install  or  'sudo apt­get install python­numpy python­ scipy' (CC) KBCS CDAC MUMBAI
  • 13. cont.. NLTK Source Installation ➢ Download NLTK source  ( http://nltk.googlecode.com/files/nltk­2.0b8.zip) ➢ Unzip it ➢ Go to the new unziped folder ➢ Just do it!  'sudo python setup.py install' Done :­) ! (CC) KBCS CDAC MUMBAI
  • 14. Installing Data ➢ We need data to experiment with It's too simple (same for all platform)... $python            #(IDLE) enter to python prompt  >>> import nltk >>> nltk.download() select any one option that you like/need (CC) KBCS CDAC MUMBAI
  • 15. cont.. (CC) KBCS CDAC MUMBAI
  • 16. cont.. Test installation ( set path if needed: as  export NLTK_DATA = '/home/....../nltk_data')    >>> from nltk.corpus import brown    >>> brown.words()[0:20]     ['The', 'Fulton', 'County', 'Grand', 'Jury']                     All set! (CC) KBCS CDAC MUMBAI
  • 17. Playing with text Feeling python  $python >>>print 'Hello world' >>>2+3+4      Okay ! (CC) KBCS CDAC MUMBAI
  • 18. cont.. Data from book: >>>from nltk.book import * see what is it?  >>>text1 Search the word: >>> text1.concordance("monstrous") Find similar words  >>> text1.similar("monstrous") (CC) KBCS CDAC MUMBAI
  • 19. cont.. Find common context >>>text2.common_contexts(["monstrous", "very"]) Find position of the word in entire text >>>text4.dispersion_plot(["citizens", "democracy",  "freedom", "duties", "America"]) Counting vocab  >>>len(text4)  #gives tokens(words and  punctuations) (CC) KBCS CDAC MUMBAI
  • 20. cont.. Vocabulary(unique words) used by author >>>len(set(text1)) See sorted vocabulary(upper case first!) >>>sorted(set(text1)) Measuring richness of text(average repetition) >>>len(text1)/len(set(text1)) (before division import for floating point division from __future__ import division) (CC) KBCS CDAC MUMBAI
  • 21. cont.. Counting occurances  >>>text5.count("lol") % of the text taken >>> 100 * text5.count('a') / len(text5) let's write function for it in python  >>>def  prcntg_oftext(word,text):              return100*.count(wrd)/len(txt)              (CC) KBCS CDAC MUMBAI
  • 22. cont.. Writing our own text >>>textown =['a','text','is','a','list','of','words'] adding texts(lists, can contain anything) >>>sent = ['pqr'] >>>sent2 = ['xyz'] >>>sent+sent2 append to text >>>sent.append('abc') (CC) KBCS CDAC MUMBAI
  • 23. cont.. Indexing text >>> text4[173] or  >>>text4.index('awaken') slicing text1[3:4], text[:5], text[10:] try it! (notice ending  index one less than specified) (CC) KBCS CDAC MUMBAI
  • 24. cont.. Strings in python >>>name = 'astring' >>>name[0] what's more >>>name*2 >>>'SPACE '.join(['let us','join']) (CC) KBCS CDAC MUMBAI
  • 25. cont.. Statistics to text Frequency Distribution >>> fdist1 = FreqDist(text1) >>> fdist1 >>> vocabulary1 = fdist1.keys() >>>vocabulary1[:25] let's plot it... >>>fdist1.plot(50, cumulative=True) (CC) KBCS CDAC MUMBAI
  • 26. cont.. *what the text1 is about? *are all the words meaningful? Let's find long words i.e. Words from text1 with length more than  10 in python >>>longw=[word for word in text1 if len(word)>10] let's see it...  >>>sorted(longw) (CC) KBCS CDAC MUMBAI
  • 27. cont.. *most long words have small freq and are not  major topic of text many times *words with long length and certain freq may  help to identify text topic >>>sorted([w for w in set(text5) if len(w) > 7 and  fdist5[w] > 7]) (CC) KBCS CDAC MUMBAI
  • 28. cont.. Bigrams and collocations: *Bigrams >>>sent=['near','river', 'bank'] >>>bigrams(sent) basic to many NLP apps; transliteration,  translation *collocations: frequent bigrams of rare words >>>text4.collocations() basic to some WSD approaches (CC) KBCS CDAC MUMBAI
  • 29. cont.. FreDist functions: (CC) KBCS CDAC MUMBAI
  • 31. cont Bigger probs: WSD, Anaphora resolution, Machine translation try and laugh: >>>babelize_shell() >>>german >>>run  or try it here:  http://translationparty.com/#6917644 (CC) KBCS CDAC MUMBAI
  • 32. cont.. Playing with text corpora Available corporas: Gutenberg, Web and chat  text, Brown, Reuters, inaugural address, etc. >>> from nltk.corpus import inaugural >>> inaugural.fileids() >>>fileid[:4] for fileid in  inaugural.fileids()] (CC) KBCS CDAC MUMBAI
  • 33. cont.. >>> cfd = nltk.ConditionalFreqDist(             (target, file[:4])             for fileid in inaugural.fileids()             for w in inaugural.words(fileid)             for target in ['america', 'citizen']             if w.lower().startswith(target)) >>> cfd.plot() Conditional Freq distribution of words 'america' and  'citizen' with different years (CC) KBCS CDAC MUMBAI
  • 34. cont.. Generating text (applying bigrams) >>> def generate_model(cfdist, word, num=15):     for i in range(num):         print word,         word = cfdist[word].max() ­­­ >>>text = nltk.corpus.genesis.words('english­kjv.txt') >>>bigrams = nltk.bigrams(text) >>>cfd = nltk.ConditionalFreqDist(bigrams)  >>print cfd['living']  >>generate_model(cfd, 'living') (CC) KBCS CDAC MUMBAI
  • 35. cont.. WordList corpora >>>def unusual_words(text):      text_vocab = set(w.lower() for w in text if   w.isalpha())     english_vocab = set(w.lower() for w in  nltk.corpus.words.words())     unusual = text_vocab.difference(english_vocab)     return sorted(unusual) ­­ >>>unusual_words(nltk.corpus.gutenberg.words('aust en­sense.txt')) Are the unusual or misspelled words. (CC) KBCS CDAC MUMBAI
  • 36. cont.. Stop words: >>> from nltk.corpus import stopwords >>> stopwords.words('english') ­­ Percentage of content >>> def content_fraction(text):        stopwords = nltk.corpus.stopwords.words('english')        content = [w for w in text if w.lower() not in stopwords]        return len(content) / len(text) >>> content_fraction(nltk.corpus.reuters.words()) is the useful fraction of useful words!  (CC) KBCS CDAC MUMBAI
  • 37. cont.. WordNet: structured, semanticaly orinted  dictionary >>> from nltk.corpus import wordnet as wn get synset/collection of synonyms >>> wn.synsets('motorcar') now get synonyms(lemmas) >>> wn.synset('car.n.01').lemma_names (CC) KBCS CDAC MUMBAI
  • 38. cont.. What does car.n.01 actually mean? >>> wn.synset('car.n.01').definition An example: >>> wn.synset('car.n.01').examples Also >>>wn.lemma('supply.n.02.supply').antonyms() Useful data in many applications like WSD,  understanding text, question answering, etc.  (CC) KBCS CDAC MUMBAI
  • 39. cont.. Text from web: >>> from urllib import urlopen >>> url =  "http://www.gutenberg.org/files/2554/2554.txt" >>> raw = urlopen(url).read() >>> type(raw) >>> len(raw) is in characters...! (CC) KBCS CDAC MUMBAI
  • 40. cont.. Tokenization: >>> tokens = nltk.word_tokenize(raw) >>> type(tokens) >>> len(tokens) >>> tokens[:15] convert to Text >>> text = nltk.Text(tokens) >>> type(text) Now you can do all above things with this! (CC) KBCS CDAC MUMBAI
  • 41. cont.. What about HTML? >>> url =  "http://news.bbc.co.uk/2/hi/health/2284783.stm" >>> html = urlopen(url).read() >>> html[:60] It's HTML; Clean the tags... >>> raw = nltk.clean_html(html) >>> tokens = nltk.word_tokenize(raw) >>> tokens[:60] >>> text = nltk.Text(tokens)    (CC) KBCS CDAC MUMBAI
  • 42. cont.. Try it yourself  Processing RSS feeds using universal feed parser ( http://feedparser.org/) Opening text from local machine:  >>> f = open('document.txt')  >>> raw = f.read() (make sure you are in current directory) then tokenize and bring to Text and play with it (CC) KBCS CDAC MUMBAI
  • 43. cont.. Capture user input: >>> s = raw_input("Enter some text: ") >>> print "You typed", len(nltk.word_tokenize(s)),  "words." Writing O/p to files >>> output_file = open('output.txt', 'w') >>> words = set(nltk.corpus.genesis.words('english­ kjv.txt')) >>> for word in sorted(words):      output_file.write(word + "n") (CC) KBCS CDAC MUMBAI
  • 44. cont.. Stemming : NLTK has inbuilt stemmers  >>>text =['apples','called','indians','applying'] >>> porter = nltk.PorterStemmer() >>> lancaster = nltk.LancasterStemmer() >>>[porter.stem(p) for p in text] >>>[lancaster.stem(p) for p in text] Lemmatization: WordNet lemmetizer, removes  affixes only if the new word exist in dictionary >>> wnl = nltk.WordNetLemmatizer() >>> [wnl.lemmatize(t) for t in text] (CC) KBCS CDAC MUMBAI
  • 45. cont.. Using a POS tagger: >>> text = nltk.word_tokenize("And now for  something completely different") Query the POS tag documentation as: >>>nltk.help.upenn_tagset('NNP') Parsing with NLTK           an example   (CC) KBCS CDAC MUMBAI
  • 46. cont.. Classifying with NLTK: Problem: Gender identification from name Feature last letter  >>> def gender_features(word):            return {'last_letter': word[­1]} Get data >>> from nltk.corpus import names >>> import random (CC) KBCS CDAC MUMBAI
  • 47. cont.. >>> names = ([(name, 'male') for name in  names.words('male.txt')] +           [(name, 'female') for name in  names.words('female.txt')]) >>> random.shuffle(names) Get Feature set  >>> featuresets = [(gender_features(n), g) for (n,g) in  names] >>> train_set, test_set = featuresets[500:], featuresets[:500] Train >>> classifier = nltk.NaiveBayesClassifier.train(train_set) Test (CC) KBCS CDAC MUMBAI >>> classifier.classify(gender_features('Neo'))
  • 48. Conclusion NLTK is rich resource for starting with Natural  Language Processing and it's being open  source and based on python adds feathers to  it's cap! (CC) KBCS CDAC MUMBAI
  • 49. References ➢ http://www.nltk.org/ ➢ http://www.python.org/ ➢ http://wikipedia.org/   (all accessed latest on 19th March 2010) ➢ Natural Language Processing with Python, Steven Bird, Ewan Klein, and  Edward Loper, O'REILLY (CC) KBCS CDAC MUMBAI
  • 50.                  Thank you      #More on NLP @ http://matra2.blogspot.com/                #Slides were  prepared for 'supporting' the talk and are distributed as it is under CC 3.0 (CC) KBCS CDAC MUMBAI