Natural Language Processing using JavaScript "Natural" Library. This deck covers Natural Language Understanding using JavaScript "Natural" library in detail
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
NLP Basics with Natural JavaScript Library
1. Basic Natural Language Processing
using
Natural (JavaScript/Node) Library
Aniruddha Chakrabarti
AVP and Chief Architect, Digital, Mphasis
@anchakra | Linkedin.com/in/aniruddhac | slideshare.net/aniruddha.chakrabarti/
2. Agenda
• Emergence of Artificial Intelligence, AI First
• What is Natural Language Processing (NLP)
• Natural JavaScript/Node NLP Library
• Tokenization - Word Tokenizer
• Stemming and Lemmatization
• String Distance
• Inflectors
• Phonetics
• N-Grams
• Classifier
• tf-idf
• POS Tagger
• Spell Check
3. → Turing Machine
→ Automating manual processes,
tabulating data
→ Reducing manual effort and time
→ IBM System/360 (S/360),
Mainframes, AS/400
→ Computing Power (Moore’s Law)
→ Systems need to be explicitly programmed using
explicit logic and rules. Pre programmed
→ Personal Computers (PCs), Communication
(Networked PCs, Client/Server, Internet, WWW)
→ Automating business processes
→ Mostly structured data
→ Systems that learn from historical data and can make predictions. Not
rule based system.
→ Uses Machine Learning, NLP to analyze unstructured data (text, image,
audio, video)
→ Predictive Analytics, Deep Learning, Neural Nets,
→ OCR, Speech recognition, Text to speech, Face recognition, Video
analysis, …
→ Cognitive Services (pay as you go model) – IBM Watson, Microsoft
Cognitive Services, …
→ Robotics, Internet of Things, Conversational Systems, Wearables, Blur of
physical & virtual
→ Still mostly Weak AI / Narrow AI
Third Era of Computing * - AI First/AI Everywhere (Cognitive Systems)
* From “The Computing Universe” by Tony Hey and Gyuri Papav
→ Strong AI / Full AI
→ Artificial General
Intelligence (AGI)
Tabulating Machines
1960 – 1980
Programmable Systems
1980 - 2010
AI First/AI Everywhere
(Cognitive Systems)
2010 - Current
Real AI ?
?
AI Winter AI Summer
• Artificial Intelligence has emerged as the third era of computing after tabulating machine and
programmable systems.
4. Gartner Hype Cycle … 2017
• AI technologies like Cognitive Computing, Virtual
Assistants/Chatbot, Conversational AI, Machine
Learning, Deep Learning and Autonomous Vehicles
appear at the peak in Gartner Hype Cycle of Emerging
Technologies, 2017.
• Reinforcement Learning and Artificial General
Intelligence (AGI) has appeared at the starting points of
hype cycle – they are expected to peak in coming years.
5. Emergence of “AI Everywhere”
Gartner recons AI as one of the
three mega trends. AI
technologies like
Conversational UI, Machine
Learning, Deep Learning and
Cognitive Computing
constitutes “AI Everywhere”
6. What is Natural Language Processing?
• Field of computer science, artificial intelligence and computational linguistics concerned
with the interactions between computers and human (natural) languages, and, in particular,
concerned with programming computers to fruitfully process large natural language corpora –
Wikipedia
• Broadly categorized into two areas -
▪ Natural Language Understanding (NLU)
▪ Natural Language Generation (NLG)
Natural Language
Processing (NLP)
Natural Language
Understanding (NLU)
Natural Language
Generation (NLG)
7. Some applications of NLP
• Spell correction (MS Word/ any other editor)
• Search engines (Google, Bing, Yahoo, wolfram alpha)
• Speech engines (Siri, Google Voice, Cortana)
• Personal Voice Assistants (Amazon Alexa, Google Home, …)
• Spam classifiers (All e-mail services)
• News feeds (Google, Yahoo!, and so on)
• Machine translation (Google Translate, and so on)
• Chatbots, Intelligent Virtual Agent/IVA
• IBM Watson, Microsoft LUIS, Amazon Lex/Alexa
8. NLP Tools & Libraries
• GATE
• Mallet (Java)
• Open NLP – Apache (Java)
• UIMA
• CoreNLP - Stanford CoreNLP toolkit (Java)
• Genism
• Natural Language Toolkit / NLTK (Python) – by far the most popular NLP library & tool
• spaCy (Python) – built on top of NLTK
• TextBlob
• Natural Library (JavaScript/Node)
NLTK
9. What is Natural
• "Natural" is a general natural language processing library for nodejs.
• Supports basic NLP tasks like tokenizing, stemming, classification, phonetics, tf-idf, WordNet,
string similarity, inflections
• At the moment, most of the algorithms are English-specific
• Created by Chris Umbel
• Loosely based on NLTK (Python) NLP Library
• https://github.com/NaturalNode/natural
• http://www.chrisumbel.com/article/node_js_natural_language_porter_stemmer_lancaster_baye
s_naive_metaphone_soundex
10. Natural library install and setup
• Install using npm (Package manager for Node), use –g switch (for global installation)
• Include the Natural package through require
npm install –g natural
// include the natural library
let Natural = require('natural');
11. Tokenization
• A word (Token) is the minimal unit that a machine can understand and process.
• Tokenization is the process of splitting the raw string into meaningful tokens
• Raw text cannot be further processed without going through tokenization.
• Complexity of tokenization varies according to the need of the NLP application, and the
complexity of the language itself.
▪ In English it can be as simple as choosing only words and numbers through a regular
expression. But for Chinese and Japanese, it will be a very complex task.
• Two primary types of tokenizers:
▪ Word Tokenizer: Tokenizes raw text to words
▪ Sentence Tokenizer: Tokenizes raw text to sentences
12. Word Tokenizer
• A word (Token) is the minimal unit that a machine can understand & process
• Tokenization is the process of splitting the raw string into meaningful tokens – Tokenizer
tokenizes or splits raw text into words
• Natural comes with multiple tokenizers -
▪ Word Tokenizer: a tokenizer that divides a text into sequences of alphabetic and
numeric characters. (Ignores punctuation)
▪ Word Punct Tokenizer: Word + punctuation tokenizer. A tokenizer that divides a text into
sequences of alphabetic and non-alphabetic characters.
▪ Treebank Word Tokenizer: uses regular expressions to tokenize text as in Penn
Treebank
▪ Regexp Tokenizer: Tokenizes text using regular expression patterns.
▪ Aggressive Tokenizer:
13. Word Tokenizer (Cont’d)
var sentence = "Hello, how are you? I don't know you!"
var wordTokenizer = new Natural.WordTokenizer();
var tokens = wordTokenizer.tokenize(sentence);
console.log(tokens);
// prints [ 'Hello', 'how', 'are', 'you', 'I', 'don', 't', 'know', 'you' ]
var tokenizer = new Natural.WordPunctTokenizer();
var tokens = tokenizer.tokenize(sentence);
console.log(tokens);
// prints [ 'Hello', ', ', 'how', 'are', 'you', '? ', 'I', 'don', '‘’,
// 't’, 'know', 'you', '!' ]
var tokenizer = new Natural. TreebankWordTokenizer();
var tokens = tokenizer.tokenize(sentence);
console.log(tokens);
// prints [ 'Hello', ', ', 'how', 'are', 'you', '? ', 'I', 'don', '‘’,
// 't’, 'know', 'you', '!' ]
console.log(new Natural.AgressiveTokenizer().tokenize(sentence));
// prints ['Hello', 'how', 'are', 'you', 'I', 'don', 't', 'know', 'you' ]
14. Stemming
• Process of reducing inflected or derived words to their word stem, base or root form.
• Similar to cutting down the branches of a tree to its stem
• More of a crude rule-based process by which we want to club together different variations of
the token – rule based
• Removes –s/es or -ing or -ed
eating, eats, eaten, eat -> eat
stopping, stopped, stops, stop -> stop
ate -> ate (wrong should be eat)
15. Stemming (Cont’d)
• Different stemming algorithms -
▪ Lovins Stemmer - First published stemmer was written by Julie Beth Lovins in 1968.
Lovins Stemmer is not used currently.
▪ Porter Stemmer - Written by Martin Porter and in July 1980. Very widely used and
became the de facto standard algorithm used for English stemming.
▪ Lancaster Stemmer - Paice/Husk stemmer developed at Lancaster University. The
stemmer, although remaining efficient and easily implemented, is known to be very
strong and aggressive. The stemmer utilizes a single table of rules, each of which may
specify the removal or replacement of an ending.
▪ Snowball Stemmer – Also called Porter2 stemmer, since this is an updated version of
original Porter Stemmer. Natural does not support Snowball Stemmer
• Lemmatization is a more robust and methodical way of combining grammatical variations to
the root of a word.
▪ Natural does not support any Lemmatization algorithm.
▪ NLTK and other matured NLP libraries support Lemmatization
16. Stemming – Porter Stemmer and Lancaster Stemmer
var porterStemmer = Natural.PorterStemmer;
console.log(porterStemmer.stem("ate")); // prints at
console.log(porterStemmer.stem("eating")); // prints eat
console.log(porterStemmer.stem("eats")); // prints eat
console.log(porterStemmer.stem("eat")); // prints eat
console.log(porterStemmer.stem("agreement")); // prints agreement
var lancasterStemmer = Natural.LancasterStemmer;
console.log(lancasterStemmer.stem("ate")); // prints at
console.log(lancasterStemmer.stem("eating")); // prints eat
console.log(lancasterStemmer.stem("eats")); // prints eat
console.log(lancasterStemmer.stem("eat")); // prints eat
console.log(lancasterStemmer.stem("agreement")); // prints agr
• Natural supports Porter Stemmer and Lancaster Stemmer only. It does not support Snowball
Stemmer.
• Both the stemmers provide a stem method
17. Stemming – Porter Stemmer (Non English languages)
• Natural supports Porter Stemmer in Non English languages also
• Following languages are supported -
▪ Farsi - PorterStemmerFa
▪ French - PorterStemmerFr
▪ Russian - PorterStemmerRu
▪ Spanish - PorterStemmerEs
▪ Italian - PorterStemmerIt
▪ PorterStemmerNo
▪ Swedish - PorterStemmerSv
▪ PorterStemmerPt
18. Lemmatization
• More methodical way of converting all the grammatical/inflected forms of the root of the
word.
• Uses context and part of speech to determine the inflected form of the word and applies
different normalization rules for each part of speech to get the root word (lemma)
• Natural NLP library does not support Lemmatization.
19. Inflector
• Inflectors are used to pluralize or singularize words
• There are different types of Inflectors available in Natural Library
▪ Noun Inflector: pluralize or singularize nouns only
▪ Verb Inflector: Verbs can be pluralized/singularized with a Verb Inflector. Natural
provides a inflector called PresentVerbInflector which works on Present Tense Verbs
only
▪ Both noun and verb inflector provides singularize and pluralize methods
▪ Number or Count Inflector: Ordinal numbers could be formed from normal number
▪ Provides a single method called nth which returns the ordinal form of any number
passed
20. Inflector (Cont’d)
// pluralize or singularize nouns only
var nounInflector = new Natural.NounInflector();
console.log(nounInflector.pluralize("Book")); // prints Books
console.log(nounInflector.pluralize("radius")); // prints radii
console.log(nounInflector.singularize("flies")); // prints fly
console.log(nounInflector.singularize("men")); // prints man
var countInflector = Natural.CountInflector;
console.log(countInflector.nth("1")); // prints 1st
console.log(countInflector.nth("2")); // prints 2nd
console.log(countInflector.nth("3")); // prints 3rd
console.log(countInflector.nth("4")); // prints 4th
console.log(countInflector.nth("10")); // prints 10th
var verbInflector = new Natural.PresentVerbInflector();
console.log(verbInflector.singularize("go")); // prints goes
console.log(verbInflector.singularize("run")); // prints runs
console.log(verbInflector.pluralize("becomes")); // prints become
console.log(verbInflector.pluralize("presents")); // prints present
21. N-Grams
• an n-gram is a contiguous sequence of n items from a given sample of text or speech.
• The items can be phonemes, syllables, letters, words or base pairs according to the
application. The n-grams typically are collected from a text or speech corpus.
• When the items are words, n-grams may also be called shingles
• An n-gram of size 1 is referred to as a "unigram"; size 2 is a "bigram"; size 3 is a "trigram".
• Larger sizes are sometimes referred to by the value of n in modern language, e.g., "four-
gram", "five-gram", and so on.
Hello how are you Hello how how are are you
bigram
Hello how are you Hello how are how are you
trigram
Hello how are you Hello
unigram
how are you
23. Phonetics
• A phonetic algorithm is an algorithm for indexing of words by their pronunciation.
• A phonetic matching algorithm is an algorithm that matches word by their pronunciation rather
than spelling.
• Most phonetic algorithms were developed for use with the English language. Consequently,
applying the rules to words in other languages might not give a meaningful result.
• Some of the well known phonetics algorithms are –
▪ Soundex - Developed to encode surnames for use in censuses. Soundex codes are four-
character strings composed of a single letter followed by three numbers.
▪ Daitch–Mokotoff Soundex - Refinement of Soundex designed to better match surnames of
Slavic & Germanic origin. Daitch–Mokotoff Soundex codes are strings composed of six
numeric digits.
▪ Cologne phonetics - Similar to Soundex, but more suitable for German words.
▪ Metaphone, Double Metaphone, and Metaphone 3 - Suitable for use with most English
words, not just names. Metaphone algorithms are basis for many popular spell checkers.
▪ New York State Identification and Intelligence System (NYSIIS) - Maps similar phonemes to
the same letter. The result is a string that can be pronounced by the reader without decoding.
▪ Match Rating Approach developed by Western Airlines in 1977 - this algorithm has an
encoding and range comparison technique.
▪ Caverphone, created to assist in data matching between late 19th century and early 20th
century electoral rolls, optimized for accents present in parts of New Zealand.
24. Phonetics Matching (Cont’d)
• Natural supports Phonetic Matching using three algorithms –
▪ SoundEx
▪ Metaphone
▪ DoubleMetaphone
var metaphone = Natural.Metaphone;
var soundex = Natural.SoundEx;
var doubleMetaphone = Natural.DoubleMetaphone;
// using SoundEx for phonetic matching
console.log(soundex.compare("nuremberg", "nuremburg")); // returns true
console.log(soundex.compare("Paris", "Pari")); // returns false
// using Metaphone for phonetic matching
console.log(metaphone.compare("Fool", "Full")); // returns true
console.log(metaphone.compare("Fool", "Failed")); // returns false
// using Double Metaphone for phonetic matching
console.log(doubleMetaphone.compare("Bangalore", "Bengaluru")); // returns true
console.log(doubleMetaphone.compare("Mumbai", "Bombay")); // returns false
25. String Distance
• String Distance measures how closely two strings match.
• Natural provides JaroWinkler Distance and Levenshtein Distance algorithms for String
Distance match
JaroWinkler Distance
• Jaro distance between two words is the minimum number of single-character transpositions
required to change one word into the other.
• It is a variant proposed in 1990 by William E. Winkler of the Jaro distance metric (1989,
Matthew A. Jaro).
• Returns a number between 0 and 1 which tells how closely the strings match (0 = no match,
1 = exact match)
// Using JaroWrinkler Distance algorithm
console.log(Natural.JaroWinklerDistance("Hello", "Hello")); // returns 1: exact match
console.log(Natural.JaroWinklerDistance("Me", "You")); // returns 0: no match
console.log(Natural.JaroWinklerDistance("Bangalore", "Bengaluru")); // returns 0.72: partial match
console.log(Natural.JaroWinklerDistance("Mumbai", "Bombay")); // returns 0.66: partial match
26. String Distance - Levenstein Distance
• Levenstein Distance between two words is the minimum number of single-character edits
(insertions, deletions or substitutions) required to change one word into the other.
• Named after the Soviet mathematician Vladimir Levenshtein, who considered this distance
in 1965
• Also be referred as edit distance
// Using Levenshtein Distance algorithm
console.log(Natural.LevenshteinDistance("Hello", "Hello")); // 0
console.log(Natural.LevenshteinDistance("Bangalore", "Bengaluru")); // 3
console.log(Natural.LevenshteinDistance("Mumbai", "Bombay")); // 3
console.log(Natural.LevenshteinDistance("Chennai", "Madras")); // 6
console.log(Natural.LevenshteinDistance("Nuremberg", "Nuremburg")); // 1
B a n g a l o r e B e n g a l u r u
3 character change
N u r e m b e r g N u r e m b u r g
1 character change
27. tf-idf
• tf–idf or TFIDF is short for term frequency - inverse document frequency
• tf-idf determines how important a word (or words) is to a document relative to a corpus.
• Often used as weighting factor in searches of information retrieval, text mining & user modeling.
• The tf-idf value increases proportionally to the number of times a word appears in the
document and is offset by the frequency of the word in the corpus, which helps to adjust for
the fact that some words appear more frequently in general.
• tfidf method returns the measure of importance of a word
var tfidf = new Natural.TfIdf();
// Documents could be added to tf-idf. Here only a single doc is added, but more could be added
tfidf.addDocument("this document is about node. Its also about NLP. Node is used for it");
// Find out the tf-idf of different words in the document
console.log(tfidf.tfidf("node", 0)); // prints 0.61 as node appears multiple times in the doc
console.log(tfidf.tfidf("NLP", 0)); // prints 0.30 as NLP appears only single time
console.log(tfidf.tfidf("ruby", 0)); // prints 0 as ruby does not appear in the doc
console.log(tfidf.listTerms(0)); [ { term: 'node', tfidf: 0.6137056388801094 },
{ term: 'document', tfidf: 0.3068528194400547 },
{ term: 'nlp', tfidf: 0.3068528194400547 },
{ term: 'used', tfidf: 0.3068528194400547 } ]
28. tf-idf (cont’d)
• Disc files could also be added to tf-idf
• Multiple documents could be added to tf-idf
var tfidf = new Natural.TfIdf();
// Adding files from disc to tfidf
tfidf.addFileSync("C:/Data/Profile.txt");
console.log(tfidf.listTerms(0));
// Multiple documents added to tdidf which forms the entire corpus
tfidf.addDocument('this document is about node. Its also about NLP. Node is used for it');
tfidf.addDocument('this document is about ruby.');
tfidf.addDocument('this document is about ruby and node.');
console.log(tfidf.tfidf("node", 0)); // prints 2
console.log(tfidf.tfidf("NLP", 0)); // prints 1.40
console.log(tfidf.tfidf("ruby", 0)); // prints 0
console.log(tfidf.tfidf("node", 1)); // prints 0 as node does not appear in 2nd doc
console.log(tfidf.tfidf("ruby", 1)); // prints 1 as ruby appears in 2nd doc
console.log(tfidf.tfidf("node", 2)); // prints 1 as node appears in 3rd doc
console.log(tfidf.tfidf("ruby", 2)); // prints 1 as ruby appears in 3rd doc
29. tf-idf (cont’d)
• tfidf method returns the measure of importance of a word in various documents
• tfidf method accepts the word and a callback
// Multiple documents added to tdidf which forms the entire corpus
tfidf.addDocument('this document is about node. Its also about NLP. Node is used for it');
tfidf.addDocument('this document is about ruby.');
tfidf.addDocument('this document is about ruby and node.’);
// tfidfs method is used to find the importance of the word across multiple documents
tfidf.tfidfs('node', function(ctr, measure){
console.log('tf-idf of node in document #' + ctr + ' is ' + measure);
});
30. POS (Part of Speech) Tagging
• Process of marking up a word in a text (corpus) as corresponding to a particular part of
speech, based on both its definition and its context—i.e., its relationship with adjacent and
related words in a phrase, sentence, or paragraph.
• Also called grammatical tagging or word-category disambiguation,
31. POS (Part of Speech) Tagging
• Current state of the art POS tagging algorithms can predict the POS of the given word with
a higher degree of precision (that is approximately 97%). But still lots of research going on
in the area of POS tagging.
No Tag Description
1. CC Coordinating conjunction
2. CD Cardinal number
3. DT Determiner
4. EX Existential there
5. FW Foreign word
6. IN Preposition or subordinating conjunction
7. JJ Adjective
8. JJR Adjective, comparative
9. JJS Adjective, superlative
10. LS List item marker
11. MD Modal
12. NN Noun, singular or mass
13. NNS Noun, plural
14. NNP Proper noun, singular
15. NNPS Proper noun, plural
16. PDT Predeterminer
17. POS Possessive ending
18. PRP Personal pronoun
No Tag Description
19. PRP$ Possessive pronoun
20. RB Adverb
21. RBR Adverb, comparative
22. RBS Adverb, superlative
23. RP Particle
24. SYM Symbol
25. TO to
26. UH Interjection
27. VB Verb, base form
28. VBD Verb, past tense
29. VBG Verb, gerund or present participle
30. VBN Verb, past participle
31. VBP Verb, non-3rd person singular present
32. VBZ Verb, 3rd person singular present
33. WDT Wh-determiner
34. WP Wh-pronoun
35. WP$ Possessive wh-pronoun
36. WRB Wh-adverb
32. POS Tagging – Brill POS Tagger
• Natural supports POS tagging through Brill POS Tagger that implements Eric Brill's
transformational algorithm (transformation rules are specified in external files).
• E. Brill's tagger, most widely used English POS-taggers, employs rule-based algorithms.
// Path where natural library is located
var baseFolder = path.join(path.dirname(require.resolve("natural")), "brill_pos_tagger");
// Rules file located in /data/<language> sub folder under natural library
var rulesFilename = baseFolder + "/data/English/tr_from_posjs.txt";
// Lexicon file located in /data/<language> sub folder under natural library
var lexiconFilename = baseFolder + "/data/English/lexicon_from_posjs.json";
var defaultCategory = 'N';
var lexicon = new Natural.Lexicon(lexiconFilename, defaultCategory);
var rules = new Natural.RuleSet(rulesFilename);
// Any tagger needs lexicon and rules for successful POS tagging of words
// Brill POS Tagger object is created passing lexicon file and rules file location
var tagger = new Natural.BrillPOSTagger(lexicon, rules);
var sentence = "I see the man with the telescope";
var tokenizer = new Natural.WordTokenizer();
// tokenize the sentence to tokens
var tokens = tokenizer.tokenize(sentence);
console.log(tagger.tag(tokens));
[ [ 'I', 'NN' ],
[ 'see', 'VB' ],
[ 'the', 'DT' ],
[ 'man', 'NN' ],
[ 'with', 'IN' ],
[ 'the', 'DT' ],
[ 'telescope', 'NN' ] ]