SlideShare ist ein Scribd-Unternehmen logo
1 von 46
AI GEEKS
 Fetch from API - https://github.com/bear/python-twitter
 Crawl Web-sites - https://github.com/scrapy/scrapy
 Use Browser Hack - https://github.com/Jefferson-Henrique/GetOldTweets-python
 If the data is in csv
Df pd.read_csv(file_name,index_col=None, header=0,
usecols[“field_name1”,”field_name2”,…..])
 If the files is stored as Json
def load_json(path_to_json):
import os
import json
import pandas as pd
list_files = [pos_json for pos_json in os.listdir(path_to_json) if pos_json.endswith('.json')]
#set up an empty dictionary
resultdict = {}
i=1
for f in list_files:
with open(os.path.join(path_to_json, f), "r") as inputjson:
resultdict[f] = json.load(inputjson)
i=i+1
bar.update(i)
df = pd.DataFrame(resultdict)
df2=df.T
return df2
 Try converting JSON into dict or CSV into Dict?
 Why?
Breaking texts into tokens which you want to consider to feed into NLP algorithm
from nltk.tokenize import RegexpTokenizer
#to separate words without punctuation
tokenizer = RegexpTokenizer(r'w+’)
#convert into lower case to avoid duplication of the same word
raw = text.lower()
tokens = tokenizer.tokenize(raw)
 Stop words are commonly occurring words which doesn’t contribute to topic
modelling.
 the, and, or
 However, sometimes, removing stop words affect topic modelling
 For e.g., Thor The Ragnarok is a single topic but we use stop words mechanism, then it
will be removed.
Common occurring words which doesn’t provide the context
from stop_words import get_stop_words
from stop_words import LANGUAGE_MAPPING
from stop_words import AVAILABLE_LANGUAGES
 # create English stop words list
 english_stop_words = get_stop_words(‘en')
 # remove stop words from tokens
 stopped_tokens = [i for i in tokens if not i in english_stop_words]
 Why cant we load our own stop words in a list and filter out the tokens with stop
words?
 Can we stop words repository for other purpose?
 Manually add Malay language in Stop words corpus
 Make a language detection mechanism using stop words
 A common NLP technique to reduce topically similar words to their root. For e.g.,
“stemming,” “stemmer,” “stemmed,” all have similar meanings; stemming reduces
those terms to “stem.”
 Important for topic modeling, which would otherwise view those terms as separate
entities and reduce their importance in the model.
 It's a bunch of rules for reducing a word:
 sses -> es
 ies -> i
 ational -> ate
 tional -> tion
 s -> ∅
 when conflicts, the longest rule wins
 Bad idea unless you customize it.
Arabic Stemming Process
Simple Stemming Process
 Code
from nltk.stem.porter import
stemmer = PorterStemmer()
plurals = ['caresses', 'flies', 'dies', 'mules', 'denied', ... 'died', 'agreed', 'owned', 'humbled',
'sized', 'meeting', 'stating', 'siezing', 'itemization', ... 'sensational', 'traditional', 'reference',
'colonizer', ... 'plotted’]
singles = [stemmer.stem(plural) for plural in plurals]
print(' '.join(singles))
 Output:
caress fli die mule deni die agre own humbl size meet state siez item sensat tradit refer
colon plot
 Code
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english")
stemmer2 = SnowballStemmer("english", ignore_stopwords=True)
print(stemmer.stem("having"))
print(stemmer2.stem("having"))
 Output:
have
having
 Find out if there is any support of SnowBall Stemmer in Malay language? If not,
find out how you can implement your own?
 It goes one step further than stemming.
 It obtains grammatically correct words and distinguishes words by their word
sense with the use of a vocabulary (e.g., type can mean write or category).
 It is a much more difficult and expensive process than stemming.
 Code
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet as wn
class MyWordNet:
def __init__(self, wn):
self._wordnet = wn
run = MyWordNet(wn)
lemma = WordNetLemmatizer()
lem = map(lemma.lemmatize, stopped_tokens)
itemarr=[]
for item in lem:
itemarr.append(item)
print( itemarr)
 from nltk import FreqDist
 fdist = FreqDist(itemarr)
 fdist2= FreqDist(stopped_tokens)
 from sklearn.feature_extraction.text import CountVectorizer
corpus = [
'All my cats in a row',
'When my cat sits down, she looks like a Furby toy!',
'The cat from outer space',
'Sunshine loves to sit like this for some reason.'
]
vectorizer = CountVectorizer()
print( vectorizer.fit_transform(corpus).todense() )
print( vectorizer.vocabulary_ )
https://pythonprogramminglanguage.com/bag-of-words/
 LDA
 LSI
 TFIDF
 Doc2Vec
 LDA stands for latent dirichlet allocation
 It is basically of distribution of words in topic k (let’s say 50) with probability of
topic k occurring in document d (let’s say 5000)
 Mechanism - It uses special kind of distribution called Dirichlet Distribution
which is nothing but multi—variate generalization of Beta distribution of
probability density function
Andrew NgDavid Blei Michael I Jordan
Sentence 1: I spend the evening watching football
Sentence 2: I ate nachos and guacamole.
Sentence 3: I spend the evening watching football while eating nachos and guacamole.
LDA might say something like:
Sentence A is 100% about Topic 1
Sentence B is 100% Topic 2
Sentence C is 65% is Topic 1, 35% Topic 2
But also tells that
Topic 1 is about football (50%), evening (50%),
topic 2 is about nachos (50%), guacamole (50)%
https://ai.stanford.edu/~ang/papers/nips01-lda.pdf
 LDA works on set of documents, thus document need to be uniquely identified.
How can document tokenized and lemmatized can be stored?
Options:
1. JSON
2. DICT
3. PANDAS DATAFRAME
4. CSV
 Remember, we load the data into DF
 Can we iterate the DF and store in DICT?
 Code
for index, row in df2.iterrows():
#preprocess text data by applying stop words and/or stemming/Lemmatization
local_dict[str(row[’unique_identifier_of_record_in_Dataframe’])]=itemarr
 What are the benefits of storing the processed data in DICT instead of DF?
 Save the dictionary in pickle for later use by various models. How to do it?
 Code:
from six.moves import cPickle as pickle
with open(dict_file_name, 'wb') as f:
pickle.dump(local_dict, f)
Once loaded, how to reload. How to do it?
 Code
with open(dict_file_name 'rb') as f:
reload_dict= pickle.load(f)
Code:
# turn our tokenized documents into a id <-> term dictionary
from gensim import corpora, models
dictionary = corpora.Dictionary(texts)
# convert tokenized documents into a document-term matrix#our dictionary must be
converted into a bag-of-words:
random_seed = 69
state = np.random.RandomState(random_seed)
corpus = [dictionary.doc2bow(text) for text in texts]
corpora.MmCorpus.serialize('./corpus/'+product_name+'.corpus.mm', corpus)
# generate LDA model
ldaModel = gensim.models.ldamodel.LdaModel(corpus,
num_topics=n_topics,
id2word = dictionary,
random_state=state)
 Try printing dictionary and corpus and understand the structure
#generate Matrix similarities to be used
from gensim.similarities import MatrixSimilarity
print(ldaModel[corpus])
index = MatrixSimilarity(ldaModel[corpus])
#generate the list of unique Document identifier
DocIdList=list(reload_dict.keys())
#Now, we’ve already stored the list of words against each DocID in reload_Dict
# we can generate a vector of bag of words using only those words which are present in the
given Doc ID
#let’s suppose Doc ID is 101, then reload_dict[101] will give the list of processed tokens
vec_bow = dictionary.doc2bow(reload_dict[101]
#like in the previous example, we get vector of LDA model of bag of words of the given
document
vec_lda = ldaModel[vec_bow]
# now, using the similarity matrix, we can find out what
sims = index[vec_lda]
print(sims)
# However, it is not sorted
 Can you sort the similarity matrix?
 Can you find out top 5 document ID which are similar and store in json file.
 based on the principle that words that are used in the same contexts tend to have
similar meanings
 identify patterns in the relationships between the terms and concepts contained in
an unstructured collection of text
 uses a mathematical technique called singular value decomposition
 https://en.wikipedia.org/wiki/Latent_semantic_analysis#Latent_semantic_indexing
https://en.wikipedia.org/wiki/Singular-value_decomposition
 Generate Model using genism similar to what has been done in LDA
 Find similar documents using LSI model
TFIDF, is a numerical statistic that is intended to reflect how important a word is
to a document in a collection or corpus
 Term Frequency of t in document d = the number of times that term t occurs in
document d
 Term Frequency adjusted for document length = the number of times that
term t occurs in document d / number of words in d
Inverse document frequency - how much information the word provides, that is,
whether the term is common or rare across all documents.
IDF= log ( Number of Documents / total number of documents by the number of
documents containing the term t)
 Generate Model using genism similar to what has been done in TFIDF
 Find similar documents using TFIDF model
https://arxiv.org/pdf/1301.3781.pdf
https://arxiv.org/pdf/1605.02019.pdf
LDA2VEC model adds in skipgrams.
A word predicts another word in the same window,
as in word2vec, but also has the notion of a context vector
which only changes at the document level as in LDA.
 Source: https://github.com/TropComplique/lda2vec-pytorch
 Go to 20newsgroups/.
 Run get_windows.ipynb to prepare data.
 Run python train.py for training.
 Run explore_trained_model.ipynb.
 To use this on your data you need to edit get_windows.ipynb. Also there are
hyperparameters in 20newsgroups/train.py, utils/training.py, utils/lda2vec_loss.py.
Tricks in natural language processing

Weitere ähnliche Inhalte

Was ist angesagt?

NLTK: Natural Language Processing made easy
NLTK: Natural Language Processing made easyNLTK: Natural Language Processing made easy
NLTK: Natural Language Processing made easyoutsider2
 
Natural Intelligence ICCS 2010
Natural Intelligence ICCS 2010Natural Intelligence ICCS 2010
Natural Intelligence ICCS 2010fmguler
 
Presentation [superscript] sup
Presentation [superscript] sup Presentation [superscript] sup
Presentation [superscript] sup vaod
 
Word Embeddings - Introduction
Word Embeddings - IntroductionWord Embeddings - Introduction
Word Embeddings - IntroductionChristian Perone
 
Sketch engine presentation
Sketch engine presentationSketch engine presentation
Sketch engine presentationiwan_rg
 
Nltk:a tool for_nlp - py_con-dhaka-2014
Nltk:a tool for_nlp - py_con-dhaka-2014Nltk:a tool for_nlp - py_con-dhaka-2014
Nltk:a tool for_nlp - py_con-dhaka-2014Fasihul Kabir
 
ACL読み会2014@PFI "Less Grammar, More Features"
ACL読み会2014@PFI "Less Grammar, More Features"ACL読み会2014@PFI "Less Grammar, More Features"
ACL読み会2014@PFI "Less Grammar, More Features"nozyh
 
Recent developments in_text_steganography
Recent developments in_text_steganographyRecent developments in_text_steganography
Recent developments in_text_steganographyRubab Rizvi
 
A Simple Introduction to Word Embeddings
A Simple Introduction to Word EmbeddingsA Simple Introduction to Word Embeddings
A Simple Introduction to Word EmbeddingsBhaskar Mitra
 
Lecture 9 - Machine Learning and Support Vector Machines (SVM)
Lecture 9 - Machine Learning and Support Vector Machines (SVM)Lecture 9 - Machine Learning and Support Vector Machines (SVM)
Lecture 9 - Machine Learning and Support Vector Machines (SVM)Sean Golliher
 
Aspect Detection for Sentiment / Emotion Analysis
Aspect Detection for Sentiment / Emotion AnalysisAspect Detection for Sentiment / Emotion Analysis
Aspect Detection for Sentiment / Emotion AnalysisSeth Grimes
 
Lda2vec text by the bay 2016 with notes
Lda2vec text by the bay 2016 with notesLda2vec text by the bay 2016 with notes
Lda2vec text by the bay 2016 with notes👋 Christopher Moody
 
HackYale - Natural Language Processing (Week 1)
HackYale - Natural Language Processing (Week 1)HackYale - Natural Language Processing (Week 1)
HackYale - Natural Language Processing (Week 1)Nick Hathaway
 
A Distributed Architecture System for Recognizing Textual Entailment
A Distributed Architecture System for Recognizing Textual EntailmentA Distributed Architecture System for Recognizing Textual Entailment
A Distributed Architecture System for Recognizing Textual EntailmentFaculty of Computer Science
 
Lecture 7- Text Statistics and Document Parsing
Lecture 7- Text Statistics and Document ParsingLecture 7- Text Statistics and Document Parsing
Lecture 7- Text Statistics and Document ParsingSean Golliher
 
Nltk natural language toolkit overview and application @ PyCon.tw 2012
Nltk  natural language toolkit overview and application @ PyCon.tw 2012Nltk  natural language toolkit overview and application @ PyCon.tw 2012
Nltk natural language toolkit overview and application @ PyCon.tw 2012Jimmy Lai
 
Mirella Lapata - 2017 - Invited Keynote: Translating from Multiple Modalities...
Mirella Lapata - 2017 - Invited Keynote: Translating from Multiple Modalities...Mirella Lapata - 2017 - Invited Keynote: Translating from Multiple Modalities...
Mirella Lapata - 2017 - Invited Keynote: Translating from Multiple Modalities...Association for Computational Linguistics
 

Was ist angesagt? (20)

NLTK: Natural Language Processing made easy
NLTK: Natural Language Processing made easyNLTK: Natural Language Processing made easy
NLTK: Natural Language Processing made easy
 
Natural Intelligence ICCS 2010
Natural Intelligence ICCS 2010Natural Intelligence ICCS 2010
Natural Intelligence ICCS 2010
 
Presentation [superscript] sup
Presentation [superscript] sup Presentation [superscript] sup
Presentation [superscript] sup
 
Word Embeddings - Introduction
Word Embeddings - IntroductionWord Embeddings - Introduction
Word Embeddings - Introduction
 
Sketch engine presentation
Sketch engine presentationSketch engine presentation
Sketch engine presentation
 
Nltk:a tool for_nlp - py_con-dhaka-2014
Nltk:a tool for_nlp - py_con-dhaka-2014Nltk:a tool for_nlp - py_con-dhaka-2014
Nltk:a tool for_nlp - py_con-dhaka-2014
 
ACL読み会2014@PFI "Less Grammar, More Features"
ACL読み会2014@PFI "Less Grammar, More Features"ACL読み会2014@PFI "Less Grammar, More Features"
ACL読み会2014@PFI "Less Grammar, More Features"
 
Recent developments in_text_steganography
Recent developments in_text_steganographyRecent developments in_text_steganography
Recent developments in_text_steganography
 
A Simple Introduction to Word Embeddings
A Simple Introduction to Word EmbeddingsA Simple Introduction to Word Embeddings
A Simple Introduction to Word Embeddings
 
OpenNLP demo
OpenNLP demoOpenNLP demo
OpenNLP demo
 
Lecture 9 - Machine Learning and Support Vector Machines (SVM)
Lecture 9 - Machine Learning and Support Vector Machines (SVM)Lecture 9 - Machine Learning and Support Vector Machines (SVM)
Lecture 9 - Machine Learning and Support Vector Machines (SVM)
 
Aspect Detection for Sentiment / Emotion Analysis
Aspect Detection for Sentiment / Emotion AnalysisAspect Detection for Sentiment / Emotion Analysis
Aspect Detection for Sentiment / Emotion Analysis
 
Word2Vec
Word2VecWord2Vec
Word2Vec
 
Lda2vec text by the bay 2016 with notes
Lda2vec text by the bay 2016 with notesLda2vec text by the bay 2016 with notes
Lda2vec text by the bay 2016 with notes
 
HackYale - Natural Language Processing (Week 1)
HackYale - Natural Language Processing (Week 1)HackYale - Natural Language Processing (Week 1)
HackYale - Natural Language Processing (Week 1)
 
A Distributed Architecture System for Recognizing Textual Entailment
A Distributed Architecture System for Recognizing Textual EntailmentA Distributed Architecture System for Recognizing Textual Entailment
A Distributed Architecture System for Recognizing Textual Entailment
 
Lecture 7- Text Statistics and Document Parsing
Lecture 7- Text Statistics and Document ParsingLecture 7- Text Statistics and Document Parsing
Lecture 7- Text Statistics and Document Parsing
 
Nltk natural language toolkit overview and application @ PyCon.tw 2012
Nltk  natural language toolkit overview and application @ PyCon.tw 2012Nltk  natural language toolkit overview and application @ PyCon.tw 2012
Nltk natural language toolkit overview and application @ PyCon.tw 2012
 
Introduction to NLTK
Introduction to NLTKIntroduction to NLTK
Introduction to NLTK
 
Mirella Lapata - 2017 - Invited Keynote: Translating from Multiple Modalities...
Mirella Lapata - 2017 - Invited Keynote: Translating from Multiple Modalities...Mirella Lapata - 2017 - Invited Keynote: Translating from Multiple Modalities...
Mirella Lapata - 2017 - Invited Keynote: Translating from Multiple Modalities...
 

Ähnlich wie Tricks in natural language processing

Dsm as theory building
Dsm as theory buildingDsm as theory building
Dsm as theory buildingClarkTony
 
Code as Data workshop: Using source{d} Engine to extract insights from git re...
Code as Data workshop: Using source{d} Engine to extract insights from git re...Code as Data workshop: Using source{d} Engine to extract insights from git re...
Code as Data workshop: Using source{d} Engine to extract insights from git re...source{d}
 
Ti1220 Lecture 7: Polymorphism
Ti1220 Lecture 7: PolymorphismTi1220 Lecture 7: Polymorphism
Ti1220 Lecture 7: PolymorphismEelco Visser
 
Text Mining Infrastructure in R
Text Mining Infrastructure in RText Mining Infrastructure in R
Text Mining Infrastructure in RAshraf Uddin
 
Pyconie 2012
Pyconie 2012Pyconie 2012
Pyconie 2012Yaqi Zhao
 
Introduction to source{d} Engine and source{d} Lookout
Introduction to source{d} Engine and source{d} Lookout Introduction to source{d} Engine and source{d} Lookout
Introduction to source{d} Engine and source{d} Lookout source{d}
 
MACHINE-DRIVEN TEXT ANALYSIS
MACHINE-DRIVEN TEXT ANALYSISMACHINE-DRIVEN TEXT ANALYSIS
MACHINE-DRIVEN TEXT ANALYSISMassimo Schenone
 
Information Retrieval-4(inverted index_&amp;_query handling)
Information Retrieval-4(inverted index_&amp;_query handling)Information Retrieval-4(inverted index_&amp;_query handling)
Information Retrieval-4(inverted index_&amp;_query handling)Jeet Das
 
Python: an introduction for PHP webdevelopers
Python: an introduction for PHP webdevelopersPython: an introduction for PHP webdevelopers
Python: an introduction for PHP webdevelopersGlenn De Backer
 
JLIFF: Where we are, and where we're going
JLIFF: Where we are, and where we're goingJLIFF: Where we are, and where we're going
JLIFF: Where we are, and where we're goingChase Tingley
 
code4lib 2011 preconference: What's New in Solr (since 1.4.1)
code4lib 2011 preconference: What's New in Solr (since 1.4.1)code4lib 2011 preconference: What's New in Solr (since 1.4.1)
code4lib 2011 preconference: What's New in Solr (since 1.4.1)Erik Hatcher
 
Word embeddings
Word embeddingsWord embeddings
Word embeddingsShruti kar
 
Separation of Concerns in Language Definition
Separation of Concerns in Language DefinitionSeparation of Concerns in Language Definition
Separation of Concerns in Language DefinitionEelco Visser
 
Twitter Author Prediction from Tweets using Bayesian Network
Twitter Author Prediction from Tweets using Bayesian NetworkTwitter Author Prediction from Tweets using Bayesian Network
Twitter Author Prediction from Tweets using Bayesian NetworkHendy Irawan
 
An Introduction to gensim: "Topic Modelling for Humans"
An Introduction to gensim: "Topic Modelling for Humans"An Introduction to gensim: "Topic Modelling for Humans"
An Introduction to gensim: "Topic Modelling for Humans"sandinmyjoints
 

Ähnlich wie Tricks in natural language processing (20)

Python basic
Python basicPython basic
Python basic
 
Dsm as theory building
Dsm as theory buildingDsm as theory building
Dsm as theory building
 
Code as Data workshop: Using source{d} Engine to extract insights from git re...
Code as Data workshop: Using source{d} Engine to extract insights from git re...Code as Data workshop: Using source{d} Engine to extract insights from git re...
Code as Data workshop: Using source{d} Engine to extract insights from git re...
 
Ti1220 Lecture 7: Polymorphism
Ti1220 Lecture 7: PolymorphismTi1220 Lecture 7: Polymorphism
Ti1220 Lecture 7: Polymorphism
 
Text Mining Infrastructure in R
Text Mining Infrastructure in RText Mining Infrastructure in R
Text Mining Infrastructure in R
 
Functions in python
Functions in pythonFunctions in python
Functions in python
 
Pyconie 2012
Pyconie 2012Pyconie 2012
Pyconie 2012
 
Introduction to source{d} Engine and source{d} Lookout
Introduction to source{d} Engine and source{d} Lookout Introduction to source{d} Engine and source{d} Lookout
Introduction to source{d} Engine and source{d} Lookout
 
MACHINE-DRIVEN TEXT ANALYSIS
MACHINE-DRIVEN TEXT ANALYSISMACHINE-DRIVEN TEXT ANALYSIS
MACHINE-DRIVEN TEXT ANALYSIS
 
Information Retrieval-4(inverted index_&amp;_query handling)
Information Retrieval-4(inverted index_&amp;_query handling)Information Retrieval-4(inverted index_&amp;_query handling)
Information Retrieval-4(inverted index_&amp;_query handling)
 
Unit VI
Unit VI Unit VI
Unit VI
 
Python: an introduction for PHP webdevelopers
Python: an introduction for PHP webdevelopersPython: an introduction for PHP webdevelopers
Python: an introduction for PHP webdevelopers
 
JLIFF: Where we are, and where we're going
JLIFF: Where we are, and where we're goingJLIFF: Where we are, and where we're going
JLIFF: Where we are, and where we're going
 
code4lib 2011 preconference: What's New in Solr (since 1.4.1)
code4lib 2011 preconference: What's New in Solr (since 1.4.1)code4lib 2011 preconference: What's New in Solr (since 1.4.1)
code4lib 2011 preconference: What's New in Solr (since 1.4.1)
 
Word embeddings
Word embeddingsWord embeddings
Word embeddings
 
Separation of Concerns in Language Definition
Separation of Concerns in Language DefinitionSeparation of Concerns in Language Definition
Separation of Concerns in Language Definition
 
Twitter Author Prediction from Tweets using Bayesian Network
Twitter Author Prediction from Tweets using Bayesian NetworkTwitter Author Prediction from Tweets using Bayesian Network
Twitter Author Prediction from Tweets using Bayesian Network
 
lab4_php
lab4_phplab4_php
lab4_php
 
lab4_php
lab4_phplab4_php
lab4_php
 
An Introduction to gensim: "Topic Modelling for Humans"
An Introduction to gensim: "Topic Modelling for Humans"An Introduction to gensim: "Topic Modelling for Humans"
An Introduction to gensim: "Topic Modelling for Humans"
 

Mehr von Babu Priyavrat

Ensemble learning Techniques
Ensemble learning TechniquesEnsemble learning Techniques
Ensemble learning TechniquesBabu Priyavrat
 
NLP using Deep learning
NLP using Deep learningNLP using Deep learning
NLP using Deep learningBabu Priyavrat
 
Introduction to TensorFlow
Introduction to TensorFlowIntroduction to TensorFlow
Introduction to TensorFlowBabu Priyavrat
 
Supervised Machine Learning in R
Supervised  Machine Learning  in RSupervised  Machine Learning  in R
Supervised Machine Learning in RBabu Priyavrat
 
Introduction to-machine-learning
Introduction to-machine-learningIntroduction to-machine-learning
Introduction to-machine-learningBabu Priyavrat
 

Mehr von Babu Priyavrat (7)

5G and Drones
5G and Drones 5G and Drones
5G and Drones
 
Ensemble learning Techniques
Ensemble learning TechniquesEnsemble learning Techniques
Ensemble learning Techniques
 
NLP using Deep learning
NLP using Deep learningNLP using Deep learning
NLP using Deep learning
 
Introduction to TensorFlow
Introduction to TensorFlowIntroduction to TensorFlow
Introduction to TensorFlow
 
Neural network
Neural networkNeural network
Neural network
 
Supervised Machine Learning in R
Supervised  Machine Learning  in RSupervised  Machine Learning  in R
Supervised Machine Learning in R
 
Introduction to-machine-learning
Introduction to-machine-learningIntroduction to-machine-learning
Introduction to-machine-learning
 

Kürzlich hochgeladen

BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...shivangimorya083
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxolyaivanovalion
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 

Kürzlich hochgeladen (20)

BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 

Tricks in natural language processing

  • 2.  Fetch from API - https://github.com/bear/python-twitter  Crawl Web-sites - https://github.com/scrapy/scrapy  Use Browser Hack - https://github.com/Jefferson-Henrique/GetOldTweets-python
  • 3.  If the data is in csv Df pd.read_csv(file_name,index_col=None, header=0, usecols[“field_name1”,”field_name2”,…..])
  • 4.  If the files is stored as Json def load_json(path_to_json): import os import json import pandas as pd list_files = [pos_json for pos_json in os.listdir(path_to_json) if pos_json.endswith('.json')] #set up an empty dictionary resultdict = {} i=1 for f in list_files: with open(os.path.join(path_to_json, f), "r") as inputjson: resultdict[f] = json.load(inputjson) i=i+1 bar.update(i) df = pd.DataFrame(resultdict) df2=df.T return df2
  • 5.  Try converting JSON into dict or CSV into Dict?  Why?
  • 6. Breaking texts into tokens which you want to consider to feed into NLP algorithm from nltk.tokenize import RegexpTokenizer #to separate words without punctuation tokenizer = RegexpTokenizer(r'w+’) #convert into lower case to avoid duplication of the same word raw = text.lower() tokens = tokenizer.tokenize(raw)
  • 7.  Stop words are commonly occurring words which doesn’t contribute to topic modelling.  the, and, or  However, sometimes, removing stop words affect topic modelling  For e.g., Thor The Ragnarok is a single topic but we use stop words mechanism, then it will be removed.
  • 8. Common occurring words which doesn’t provide the context from stop_words import get_stop_words from stop_words import LANGUAGE_MAPPING from stop_words import AVAILABLE_LANGUAGES  # create English stop words list  english_stop_words = get_stop_words(‘en')  # remove stop words from tokens  stopped_tokens = [i for i in tokens if not i in english_stop_words]
  • 9.  Why cant we load our own stop words in a list and filter out the tokens with stop words?  Can we stop words repository for other purpose?
  • 10.  Manually add Malay language in Stop words corpus  Make a language detection mechanism using stop words
  • 11.  A common NLP technique to reduce topically similar words to their root. For e.g., “stemming,” “stemmer,” “stemmed,” all have similar meanings; stemming reduces those terms to “stem.”  Important for topic modeling, which would otherwise view those terms as separate entities and reduce their importance in the model.  It's a bunch of rules for reducing a word:  sses -> es  ies -> i  ational -> ate  tional -> tion  s -> ∅  when conflicts, the longest rule wins  Bad idea unless you customize it.
  • 12. Arabic Stemming Process Simple Stemming Process
  • 13.  Code from nltk.stem.porter import stemmer = PorterStemmer() plurals = ['caresses', 'flies', 'dies', 'mules', 'denied', ... 'died', 'agreed', 'owned', 'humbled', 'sized', 'meeting', 'stating', 'siezing', 'itemization', ... 'sensational', 'traditional', 'reference', 'colonizer', ... 'plotted’] singles = [stemmer.stem(plural) for plural in plurals] print(' '.join(singles))  Output: caress fli die mule deni die agre own humbl size meet state siez item sensat tradit refer colon plot
  • 14.  Code from nltk.stem.snowball import SnowballStemmer stemmer = SnowballStemmer("english") stemmer2 = SnowballStemmer("english", ignore_stopwords=True) print(stemmer.stem("having")) print(stemmer2.stem("having"))  Output: have having
  • 15.  Find out if there is any support of SnowBall Stemmer in Malay language? If not, find out how you can implement your own?
  • 16.  It goes one step further than stemming.  It obtains grammatically correct words and distinguishes words by their word sense with the use of a vocabulary (e.g., type can mean write or category).  It is a much more difficult and expensive process than stemming.
  • 17.
  • 18.  Code from nltk.stem import WordNetLemmatizer from nltk.corpus import wordnet as wn class MyWordNet: def __init__(self, wn): self._wordnet = wn run = MyWordNet(wn) lemma = WordNetLemmatizer() lem = map(lemma.lemmatize, stopped_tokens) itemarr=[] for item in lem: itemarr.append(item) print( itemarr)
  • 19.  from nltk import FreqDist  fdist = FreqDist(itemarr)  fdist2= FreqDist(stopped_tokens)
  • 20.
  • 21.  from sklearn.feature_extraction.text import CountVectorizer corpus = [ 'All my cats in a row', 'When my cat sits down, she looks like a Furby toy!', 'The cat from outer space', 'Sunshine loves to sit like this for some reason.' ] vectorizer = CountVectorizer() print( vectorizer.fit_transform(corpus).todense() ) print( vectorizer.vocabulary_ ) https://pythonprogramminglanguage.com/bag-of-words/
  • 22.  LDA  LSI  TFIDF  Doc2Vec
  • 23.  LDA stands for latent dirichlet allocation  It is basically of distribution of words in topic k (let’s say 50) with probability of topic k occurring in document d (let’s say 5000)  Mechanism - It uses special kind of distribution called Dirichlet Distribution which is nothing but multi—variate generalization of Beta distribution of probability density function
  • 24. Andrew NgDavid Blei Michael I Jordan
  • 25. Sentence 1: I spend the evening watching football Sentence 2: I ate nachos and guacamole. Sentence 3: I spend the evening watching football while eating nachos and guacamole. LDA might say something like: Sentence A is 100% about Topic 1 Sentence B is 100% Topic 2 Sentence C is 65% is Topic 1, 35% Topic 2 But also tells that Topic 1 is about football (50%), evening (50%), topic 2 is about nachos (50%), guacamole (50)%
  • 26.
  • 27.
  • 29.  LDA works on set of documents, thus document need to be uniquely identified. How can document tokenized and lemmatized can be stored? Options: 1. JSON 2. DICT 3. PANDAS DATAFRAME 4. CSV
  • 30.  Remember, we load the data into DF  Can we iterate the DF and store in DICT?  Code for index, row in df2.iterrows(): #preprocess text data by applying stop words and/or stemming/Lemmatization local_dict[str(row[’unique_identifier_of_record_in_Dataframe’])]=itemarr  What are the benefits of storing the processed data in DICT instead of DF?
  • 31.  Save the dictionary in pickle for later use by various models. How to do it?  Code: from six.moves import cPickle as pickle with open(dict_file_name, 'wb') as f: pickle.dump(local_dict, f) Once loaded, how to reload. How to do it?  Code with open(dict_file_name 'rb') as f: reload_dict= pickle.load(f)
  • 32. Code: # turn our tokenized documents into a id <-> term dictionary from gensim import corpora, models dictionary = corpora.Dictionary(texts) # convert tokenized documents into a document-term matrix#our dictionary must be converted into a bag-of-words: random_seed = 69 state = np.random.RandomState(random_seed) corpus = [dictionary.doc2bow(text) for text in texts] corpora.MmCorpus.serialize('./corpus/'+product_name+'.corpus.mm', corpus) # generate LDA model ldaModel = gensim.models.ldamodel.LdaModel(corpus, num_topics=n_topics, id2word = dictionary, random_state=state)
  • 33.  Try printing dictionary and corpus and understand the structure
  • 34. #generate Matrix similarities to be used from gensim.similarities import MatrixSimilarity print(ldaModel[corpus]) index = MatrixSimilarity(ldaModel[corpus])
  • 35. #generate the list of unique Document identifier DocIdList=list(reload_dict.keys()) #Now, we’ve already stored the list of words against each DocID in reload_Dict # we can generate a vector of bag of words using only those words which are present in the given Doc ID #let’s suppose Doc ID is 101, then reload_dict[101] will give the list of processed tokens vec_bow = dictionary.doc2bow(reload_dict[101] #like in the previous example, we get vector of LDA model of bag of words of the given document vec_lda = ldaModel[vec_bow] # now, using the similarity matrix, we can find out what sims = index[vec_lda] print(sims) # However, it is not sorted
  • 36.  Can you sort the similarity matrix?  Can you find out top 5 document ID which are similar and store in json file.
  • 37.  based on the principle that words that are used in the same contexts tend to have similar meanings  identify patterns in the relationships between the terms and concepts contained in an unstructured collection of text  uses a mathematical technique called singular value decomposition  https://en.wikipedia.org/wiki/Latent_semantic_analysis#Latent_semantic_indexing
  • 39.  Generate Model using genism similar to what has been done in LDA  Find similar documents using LSI model
  • 40. TFIDF, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus  Term Frequency of t in document d = the number of times that term t occurs in document d  Term Frequency adjusted for document length = the number of times that term t occurs in document d / number of words in d Inverse document frequency - how much information the word provides, that is, whether the term is common or rare across all documents. IDF= log ( Number of Documents / total number of documents by the number of documents containing the term t)
  • 41.  Generate Model using genism similar to what has been done in TFIDF  Find similar documents using TFIDF model
  • 42.
  • 44. https://arxiv.org/pdf/1605.02019.pdf LDA2VEC model adds in skipgrams. A word predicts another word in the same window, as in word2vec, but also has the notion of a context vector which only changes at the document level as in LDA.
  • 45.  Source: https://github.com/TropComplique/lda2vec-pytorch  Go to 20newsgroups/.  Run get_windows.ipynb to prepare data.  Run python train.py for training.  Run explore_trained_model.ipynb.  To use this on your data you need to edit get_windows.ipynb. Also there are hyperparameters in 20newsgroups/train.py, utils/training.py, utils/lda2vec_loss.py.

Hinweis der Redaktion

  1. Hint: check out the structure of json using json.loads For CSV, it’s better to store as a list of dictionary where each row represent a dictionary and keys are headers. Faster search as dict has key to search value
  2. Go to languages.json and add the malay.txt against ‘ms’ as language code. For language detection, pass the text with all available languages in stop words from languages.json and which one has highest number of words present, use that
  3. Visit SnowBall website
  4. For sorting, sims = sorted(enumerate(sims), key=lambda item: -item[1]) For finding top document, try this: for index,similarity in sims[1:30]: json.dumps(df[df[‘DocID’]==DocIdList[index]].to_dict(orient=“index”))