SlideShare ist ein Scribd-Unternehmen logo
1 von 24
Text Analytics with
and
(w/ examples from Tobacco Control)
@BenHealey
The Process
Look intenselyFrequencies
Classification
Bright Idea Gather Clean Standardise
De-dup and select
http://scrapy.org
Spiders  Items  Pipelines
- readLines, XML / Rcurl / scrapeR packages
- tm package (factiva plugin), twitteR
- Beautiful Soup
- Pandas (eg, financial data)
http://blog.siliconstraits.vn/building-web-crawler-scrapy/
• Translating text to consistent form
– Scrapy returns unicode strings
– Māori  Maori
• SWAPSET =
[[ u"Ā", "A"], [ u"ā", "a"], [ u"ä", "a"]]
• translation_table =
dict([(ord(k), unicode(v)) for k, v in settings.SWAPSET])
• cleaned_content =
html_content.translate(translation_table)
– Or…
• test=u’Māori’ (you already have unicode)
• Unidecode(test) (returns ‘Maori’)
• Dealing with non-Unicode
– http://nedbatchelder.com/text/unipain.html
– Some scraped html will be in latin1 (mismatch UTF8)
– Have your datastore default to UTF-8
– Learn to love whack-a-mole
• Dealing with too many spaces:
– newstring = ' '.join(mystring.split())
– Or… use re
• Don’t forget the metadata!
– Define a common data structure early if you have
multiple sources
Text Standardisation
• Stopwords
– "a, about, above, across, ... yourself, yourselves, you've, z”
• Stemmers
– "some sample stemmed words"  "some sampl stem word“
• Tokenisers (eg, for bigrams)
– BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
– tdm <- TermDocumentMatrix(crude, control = list(tokenize = BigramTokenizer))
– ‘and said’, ‘and security’
Natural Language Toolkittm package
Text Standardisation
libs = c("RODBC", "RWeka“, "Snowball","wordcloud", "tm" ,"topicmodels")
…
cleanCorpus = function(corpus) {
corpus.tmp = tm_map(corpus, tolower) # ??? Not sure.
corpus.tmp = tm_map(corpus.tmp, removePunctuation)
corpus.tmp = tm_map(corpus.tmp, removeWords, stopwords("english"))
corpus.tmp = tm_map(corpus.tmp, stripWhitespace)
return(corpus.tmp)
}
posts.corpus = cleanCorpus(posts.corpus)
posts.corpus_stemmed = tm_map(posts.corpus, stemDocument)
Text Standardisation
• Using dictionaries for stem completion
politi.tdm <- TermDocumentMatrix(politi.corpus)
politi.tdm = removeSparseTerms(politi.tdm, 0.99)
politi.tdm = as.matrix(politi.tdm)
# get word counts in decreasing order, put these into a plain text doc.
word_freqs = sort(rowSums(politi.tdm), decreasing=TRUE)
length(word_freqs)
smalldict = PlainTextDocument(names(word_freqs))
politi.corpus_final = tm_map(politi.corpus_stemmed,
stemCompletion, dictionary=smalldict, type="first")
Deduplication
• Python sets
– shingles1 = set(get_shingles(record1['standardised_content']))
• Shingling and Jaccard similarity
– (a,rose,is,a,rose,is,a,rose)
– {(a,rose,is,a), (rose,is,a,rose), (is,a,rose,is), (a,rose,is,a), (rose,is,a,rose)}
• {(a,rose,is,a), (rose,is,a,rose), (is,a,rose,is)}
–
http://infolab.stanford.edu/~ullman/mmds/ch3.pdf  a free text
http://www.cs.utah.edu/~jeffp/teaching/cs5955/L4-Jaccard+Shingle.pdf
Frequency Analysis
• Document-Term Matrix
– politi.dtm <- DocumentTermMatrix(politi.corpus_stemmed,
control = list(wordLengths=c(4,Inf)))
• Frequent and co-occurring terms
– findFreqTerms(politi.dtm, 5000)
[1] "2011" "also" "announc" "area" "around"
[6] "auckland" "better" "bill" "build" "busi"
– findAssocs(politi.dtm, "smoke", 0.5)
smoke tobacco quit smokefre smoker 2025 cigarett
1.00 0.74 0.68 0.62 0.62 0.58 0.57
Mentions of the 2025 goal
Mentions of the 2025 goal
Top 100 terms: Tariana Turia
Note: Documents from Aug 2011 – July 2012 Wordcloud package
Top 100 terms: Tony Ryall
Note: Documents from Aug 2011 – July 2012
• Exploration and feature extraction
– Metadata gathered at time of collection (eg, Scrapy)
– RODBC or MySQLdb with plain ol’ SQL
– Native or package functions for length of strings, sna, etc.
• Unsupervised
– nltk.cluster
– tm, topicmodels, as.matrix(dtm)  kmeans, etc.
• Supervised
– First hurdle: Training set 
– nltk.classify
– tm, e1071, others…
Classification
2 posts or fewer more than 750 posts
846 1,157 23 45,499
41.0% 1.3% 1.1% 50.1%
Cohort: New users (posters) in Q1 2012
• LDA (topicmodels)
– New users
– Highly active users
Topic 1 Topic 2 Topic 3 Topic 4 Topic 5
good smoke just smoke feel
day time day quit day
thank week get can dont
well patch realli one like
will start think will still
Topic 1 Topic 2 Topic 3 Topic 4 Topic 5
quit good day like feel
smoke one well day thing
can take great your just
will stay done now get
luck strong awesom get time
• LDA (topicmodels)
– Highly active users (HAU)
– HAU1 (F, 38, PI)
– HAU2 (F, 33, NZE)
– HAU3 (M, 48, NZE)
Topic 1 Topic 2 Topic 3 Topic 4 Topic 5
quit good day like feel
smoke one well day thing
can take great your just
will stay done now get
luck strong awesom get time
18% 14% 40% 8% 20%
31% 21% 27% 6% 16%
16% 9% 21% 49% 5%
Recap
• Your text will probably be messy
– Python, R-based tools reduce the pain
• Simple analyses can generate useful insight
• Combine with data of other types for context
– source, quantities, dates, network position, history
• May surface useful features for classification
Slides, Code: message2ben@gmail.com

Weitere ähnliche Inhalte

Was ist angesagt?

Text mining and social network analysis of twitter data part 1
Text mining and social network analysis of twitter data part 1Text mining and social network analysis of twitter data part 1
Text mining and social network analysis of twitter data part 1Johan Blomme
 
Text Mining Using R
Text Mining Using RText Mining Using R
Text Mining Using RKnoldus Inc.
 
A first look at tf idf-pdx data science meetup
A first look at tf idf-pdx data science meetupA first look at tf idf-pdx data science meetup
A first look at tf idf-pdx data science meetupDan Sullivan, Ph.D.
 
[Yang, Downey and Boyd-Graber 2015] Efficient Methods for Incorporating Knowl...
[Yang, Downey and Boyd-Graber 2015] Efficient Methods for Incorporating Knowl...[Yang, Downey and Boyd-Graber 2015] Efficient Methods for Incorporating Knowl...
[Yang, Downey and Boyd-Graber 2015] Efficient Methods for Incorporating Knowl...Shuyo Nakatani
 
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey GusevImage Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey GusevDatabricks
 
Algorithm Name Detection & Extraction
Algorithm Name Detection & ExtractionAlgorithm Name Detection & Extraction
Algorithm Name Detection & ExtractionDeeksha thakur
 
Duet @ TREC 2019 Deep Learning Track
Duet @ TREC 2019 Deep Learning TrackDuet @ TREC 2019 Deep Learning Track
Duet @ TREC 2019 Deep Learning TrackBhaskar Mitra
 
ACL2014 Reading: [Zhang+] "Kneser-Ney Smoothing on Expected Count" and [Pickh...
ACL2014 Reading: [Zhang+] "Kneser-Ney Smoothing on Expected Count" and [Pickh...ACL2014 Reading: [Zhang+] "Kneser-Ney Smoothing on Expected Count" and [Pickh...
ACL2014 Reading: [Zhang+] "Kneser-Ney Smoothing on Expected Count" and [Pickh...Shuyo Nakatani
 
Classification of CNN.com Articles using a TF*IDF Metric
Classification of CNN.com Articles using a TF*IDF MetricClassification of CNN.com Articles using a TF*IDF Metric
Classification of CNN.com Articles using a TF*IDF MetricMarie Vans
 
Cross-lingual Information Retrieval
Cross-lingual Information RetrievalCross-lingual Information Retrieval
Cross-lingual Information RetrievalShadi Saleh
 
Artificial Intelligence
Artificial IntelligenceArtificial Intelligence
Artificial Intelligencevini89
 
Cross language information retrieval (clir)slide
Cross language information retrieval (clir)slideCross language information retrieval (clir)slide
Cross language information retrieval (clir)slideMohd Iqbal Al-farabi
 
Dual Embedding Space Model (DESM)
Dual Embedding Space Model (DESM)Dual Embedding Space Model (DESM)
Dual Embedding Space Model (DESM)Bhaskar Mitra
 
DCU Search Runs at MediaEval 2014 Search and Hyperlinking
DCU Search Runs at MediaEval 2014 Search and HyperlinkingDCU Search Runs at MediaEval 2014 Search and Hyperlinking
DCU Search Runs at MediaEval 2014 Search and Hyperlinkingmultimediaeval
 
Framester: A Wide Coverage Linguistic Linked Data Hub
Framester: A Wide Coverage Linguistic Linked Data HubFramester: A Wide Coverage Linguistic Linked Data Hub
Framester: A Wide Coverage Linguistic Linked Data HubMehwish Alam
 
Interactive Knowledge Discovery over Web of Data.
Interactive Knowledge Discovery over Web of Data.Interactive Knowledge Discovery over Web of Data.
Interactive Knowledge Discovery over Web of Data.Mehwish Alam
 
Slides
SlidesSlides
Slidesbutest
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information RetrievalBhaskar Mitra
 

Was ist angesagt? (20)

Text mining and social network analysis of twitter data part 1
Text mining and social network analysis of twitter data part 1Text mining and social network analysis of twitter data part 1
Text mining and social network analysis of twitter data part 1
 
Text Mining Using R
Text Mining Using RText Mining Using R
Text Mining Using R
 
Data mining techniques
Data mining techniquesData mining techniques
Data mining techniques
 
A first look at tf idf-pdx data science meetup
A first look at tf idf-pdx data science meetupA first look at tf idf-pdx data science meetup
A first look at tf idf-pdx data science meetup
 
[Yang, Downey and Boyd-Graber 2015] Efficient Methods for Incorporating Knowl...
[Yang, Downey and Boyd-Graber 2015] Efficient Methods for Incorporating Knowl...[Yang, Downey and Boyd-Graber 2015] Efficient Methods for Incorporating Knowl...
[Yang, Downey and Boyd-Graber 2015] Efficient Methods for Incorporating Knowl...
 
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey GusevImage Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev
 
Algorithm Name Detection & Extraction
Algorithm Name Detection & ExtractionAlgorithm Name Detection & Extraction
Algorithm Name Detection & Extraction
 
Duet @ TREC 2019 Deep Learning Track
Duet @ TREC 2019 Deep Learning TrackDuet @ TREC 2019 Deep Learning Track
Duet @ TREC 2019 Deep Learning Track
 
ACL2014 Reading: [Zhang+] "Kneser-Ney Smoothing on Expected Count" and [Pickh...
ACL2014 Reading: [Zhang+] "Kneser-Ney Smoothing on Expected Count" and [Pickh...ACL2014 Reading: [Zhang+] "Kneser-Ney Smoothing on Expected Count" and [Pickh...
ACL2014 Reading: [Zhang+] "Kneser-Ney Smoothing on Expected Count" and [Pickh...
 
Classification of CNN.com Articles using a TF*IDF Metric
Classification of CNN.com Articles using a TF*IDF MetricClassification of CNN.com Articles using a TF*IDF Metric
Classification of CNN.com Articles using a TF*IDF Metric
 
Cross-lingual Information Retrieval
Cross-lingual Information RetrievalCross-lingual Information Retrieval
Cross-lingual Information Retrieval
 
Artificial Intelligence
Artificial IntelligenceArtificial Intelligence
Artificial Intelligence
 
Cross language information retrieval (clir)slide
Cross language information retrieval (clir)slideCross language information retrieval (clir)slide
Cross language information retrieval (clir)slide
 
Dual Embedding Space Model (DESM)
Dual Embedding Space Model (DESM)Dual Embedding Space Model (DESM)
Dual Embedding Space Model (DESM)
 
07 04-06
07 04-0607 04-06
07 04-06
 
DCU Search Runs at MediaEval 2014 Search and Hyperlinking
DCU Search Runs at MediaEval 2014 Search and HyperlinkingDCU Search Runs at MediaEval 2014 Search and Hyperlinking
DCU Search Runs at MediaEval 2014 Search and Hyperlinking
 
Framester: A Wide Coverage Linguistic Linked Data Hub
Framester: A Wide Coverage Linguistic Linked Data HubFramester: A Wide Coverage Linguistic Linked Data Hub
Framester: A Wide Coverage Linguistic Linked Data Hub
 
Interactive Knowledge Discovery over Web of Data.
Interactive Knowledge Discovery over Web of Data.Interactive Knowledge Discovery over Web of Data.
Interactive Knowledge Discovery over Web of Data.
 
Slides
SlidesSlides
Slides
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information Retrieval
 

Andere mochten auch

Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...Jimmy Lai
 
Learn 90% of Python in 90 Minutes
Learn 90% of Python in 90 MinutesLearn 90% of Python in 90 Minutes
Learn 90% of Python in 90 MinutesMatt Harrison
 
Textkernel - Semantic Recruiting Technology
Textkernel - Semantic Recruiting TechnologyTextkernel - Semantic Recruiting Technology
Textkernel - Semantic Recruiting TechnologyTextkernel
 
Making the invisible visible through SNA
Making the invisible visible through SNAMaking the invisible visible through SNA
Making the invisible visible through SNAMYRA School of Business
 
2015 pdf-marc smith-node xl-social media sna
2015 pdf-marc smith-node xl-social media sna2015 pdf-marc smith-node xl-social media sna
2015 pdf-marc smith-node xl-social media snaMarc Smith
 
Social Network Analysis for Competitive Intelligence
Social Network Analysis for Competitive IntelligenceSocial Network Analysis for Competitive Intelligence
Social Network Analysis for Competitive IntelligenceAugust Jackson
 
Chapter 9 the progressive era
Chapter 9 the progressive eraChapter 9 the progressive era
Chapter 9 the progressive eradrs412
 
Space decoder-v3 preassessment game
Space decoder-v3 preassessment gameSpace decoder-v3 preassessment game
Space decoder-v3 preassessment gamedrs412
 
The revolution begins
The revolution beginsThe revolution begins
The revolution beginsdrs412
 
Договор для юридических лиц
Договор для юридических лицДоговор для юридических лиц
Договор для юридических лицbiletprofi
 
Presentation at D 3170 Pre-Pets
Presentation at D 3170 Pre-PetsPresentation at D 3170 Pre-Pets
Presentation at D 3170 Pre-PetsPrakash Saraswat
 
Timelin present day timeline ppt dr. carr
Timelin present day timeline ppt  dr. carrTimelin present day timeline ppt  dr. carr
Timelin present day timeline ppt dr. carrdrs412
 
Matching Grants - A tool to strengthen fellowship &amp; International Goodwill
Matching Grants - A tool to strengthen fellowship &amp; International GoodwillMatching Grants - A tool to strengthen fellowship &amp; International Goodwill
Matching Grants - A tool to strengthen fellowship &amp; International GoodwillPrakash Saraswat
 
Declaring independence
Declaring independenceDeclaring independence
Declaring independencedrs412
 
New south
New southNew south
New southdrs412
 

Andere mochten auch (20)

Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
 
TextMining with R
TextMining with RTextMining with R
TextMining with R
 
Learn 90% of Python in 90 Minutes
Learn 90% of Python in 90 MinutesLearn 90% of Python in 90 Minutes
Learn 90% of Python in 90 Minutes
 
Python for R users
Python for R usersPython for R users
Python for R users
 
Textkernel - Semantic Recruiting Technology
Textkernel - Semantic Recruiting TechnologyTextkernel - Semantic Recruiting Technology
Textkernel - Semantic Recruiting Technology
 
Social Networks at Scale
Social Networks at ScaleSocial Networks at Scale
Social Networks at Scale
 
Making the invisible visible through SNA
Making the invisible visible through SNAMaking the invisible visible through SNA
Making the invisible visible through SNA
 
2015 pdf-marc smith-node xl-social media sna
2015 pdf-marc smith-node xl-social media sna2015 pdf-marc smith-node xl-social media sna
2015 pdf-marc smith-node xl-social media sna
 
Social Network Analysis for Competitive Intelligence
Social Network Analysis for Competitive IntelligenceSocial Network Analysis for Competitive Intelligence
Social Network Analysis for Competitive Intelligence
 
Chapter 9 the progressive era
Chapter 9 the progressive eraChapter 9 the progressive era
Chapter 9 the progressive era
 
Space decoder-v3 preassessment game
Space decoder-v3 preassessment gameSpace decoder-v3 preassessment game
Space decoder-v3 preassessment game
 
The revolution begins
The revolution beginsThe revolution begins
The revolution begins
 
RC Vasco da gama - TRF
RC Vasco da gama - TRFRC Vasco da gama - TRF
RC Vasco da gama - TRF
 
Договор для юридических лиц
Договор для юридических лицДоговор для юридических лиц
Договор для юридических лиц
 
Presentation at D 3170 Pre-Pets
Presentation at D 3170 Pre-PetsPresentation at D 3170 Pre-Pets
Presentation at D 3170 Pre-Pets
 
Timelin present day timeline ppt dr. carr
Timelin present day timeline ppt  dr. carrTimelin present day timeline ppt  dr. carr
Timelin present day timeline ppt dr. carr
 
Matching Grants - A tool to strengthen fellowship &amp; International Goodwill
Matching Grants - A tool to strengthen fellowship &amp; International GoodwillMatching Grants - A tool to strengthen fellowship &amp; International Goodwill
Matching Grants - A tool to strengthen fellowship &amp; International Goodwill
 
Declaring independence
Declaring independenceDeclaring independence
Declaring independence
 
Prac 15
Prac 15Prac 15
Prac 15
 
New south
New southNew south
New south
 

Ähnlich wie Text analytics in Python and R with examples from Tobacco Control

Text Mining with Node.js - Philipp Burckhardt, Carnegie Mellon University
Text Mining with Node.js - Philipp Burckhardt, Carnegie Mellon UniversityText Mining with Node.js - Philipp Burckhardt, Carnegie Mellon University
Text Mining with Node.js - Philipp Burckhardt, Carnegie Mellon UniversityNodejsFoundation
 
Introduction to Text Mining
Introduction to Text Mining Introduction to Text Mining
Introduction to Text Mining Rupak Roy
 
Semi-automated Exploration and Extraction of Data in Scientific Tables
Semi-automated Exploration and Extraction of Data in Scientific TablesSemi-automated Exploration and Extraction of Data in Scientific Tables
Semi-automated Exploration and Extraction of Data in Scientific TablesElsevier
 
Feature Engineering for NLP
Feature Engineering for NLPFeature Engineering for NLP
Feature Engineering for NLPBill Liu
 
Оформление пайплайна в NLP проекте​, Виталий Радченко. 22 июня, 2019
Оформление пайплайна в NLP проекте​, Виталий Радченко. 22 июня, 2019Оформление пайплайна в NLP проекте​, Виталий Радченко. 22 июня, 2019
Оформление пайплайна в NLP проекте​, Виталий Радченко. 22 июня, 2019Mail.ru Group
 
useR! 2012 Talk
useR! 2012 TalkuseR! 2012 Talk
useR! 2012 Talkrtelmore
 
Profiling and optimization
Profiling and optimizationProfiling and optimization
Profiling and optimizationg3_nittala
 
Fast Exact String Pattern-Matching Algorithm for Fixed Length Patterns
Fast Exact String Pattern-Matching Algorithm for Fixed Length PatternsFast Exact String Pattern-Matching Algorithm for Fixed Length Patterns
Fast Exact String Pattern-Matching Algorithm for Fixed Length Patternskvaderlipa
 
SMS Spam Filter Design Using R: A Machine Learning Approach
SMS Spam Filter Design Using R: A Machine Learning ApproachSMS Spam Filter Design Using R: A Machine Learning Approach
SMS Spam Filter Design Using R: A Machine Learning ApproachReza Rahimi
 
Textrank algorithm
Textrank algorithmTextrank algorithm
Textrank algorithmAndrew Koo
 
Tajolabigdatacamp2014 140618135810-phpapp01 hyunsik-choi
Tajolabigdatacamp2014 140618135810-phpapp01 hyunsik-choiTajolabigdatacamp2014 140618135810-phpapp01 hyunsik-choi
Tajolabigdatacamp2014 140618135810-phpapp01 hyunsik-choiData Con LA
 
functional groovy
functional groovyfunctional groovy
functional groovyPaul King
 
Topic Modeling - NLP
Topic Modeling - NLPTopic Modeling - NLP
Topic Modeling - NLPRupak Roy
 
The Road To Damascus - A Conversion Experience: LotusScript and @Formula to SSJS
The Road To Damascus - A Conversion Experience: LotusScript and @Formula to SSJSThe Road To Damascus - A Conversion Experience: LotusScript and @Formula to SSJS
The Road To Damascus - A Conversion Experience: LotusScript and @Formula to SSJSmfyleman
 
code4lib 2011 preconference: What's New in Solr (since 1.4.1)
code4lib 2011 preconference: What's New in Solr (since 1.4.1)code4lib 2011 preconference: What's New in Solr (since 1.4.1)
code4lib 2011 preconference: What's New in Solr (since 1.4.1)Erik Hatcher
 
Icsug dev day2014_road to damascus - conversion experience-lotusscript and @f...
Icsug dev day2014_road to damascus - conversion experience-lotusscript and @f...Icsug dev day2014_road to damascus - conversion experience-lotusscript and @f...
Icsug dev day2014_road to damascus - conversion experience-lotusscript and @f...ICS User Group
 
Ejercicios de estilo en la programación
Ejercicios de estilo en la programaciónEjercicios de estilo en la programación
Ejercicios de estilo en la programaciónSoftware Guru
 

Ähnlich wie Text analytics in Python and R with examples from Tobacco Control (20)

Prolog 7-Languages
Prolog 7-LanguagesProlog 7-Languages
Prolog 7-Languages
 
Text Mining with Node.js - Philipp Burckhardt, Carnegie Mellon University
Text Mining with Node.js - Philipp Burckhardt, Carnegie Mellon UniversityText Mining with Node.js - Philipp Burckhardt, Carnegie Mellon University
Text Mining with Node.js - Philipp Burckhardt, Carnegie Mellon University
 
Introduction to Text Mining
Introduction to Text Mining Introduction to Text Mining
Introduction to Text Mining
 
Semi-automated Exploration and Extraction of Data in Scientific Tables
Semi-automated Exploration and Extraction of Data in Scientific TablesSemi-automated Exploration and Extraction of Data in Scientific Tables
Semi-automated Exploration and Extraction of Data in Scientific Tables
 
Feature Engineering for NLP
Feature Engineering for NLPFeature Engineering for NLP
Feature Engineering for NLP
 
Оформление пайплайна в NLP проекте​, Виталий Радченко. 22 июня, 2019
Оформление пайплайна в NLP проекте​, Виталий Радченко. 22 июня, 2019Оформление пайплайна в NLP проекте​, Виталий Радченко. 22 июня, 2019
Оформление пайплайна в NLP проекте​, Виталий Радченко. 22 июня, 2019
 
useR! 2012 Talk
useR! 2012 TalkuseR! 2012 Talk
useR! 2012 Talk
 
Profiling and optimization
Profiling and optimizationProfiling and optimization
Profiling and optimization
 
Fast Exact String Pattern-Matching Algorithm for Fixed Length Patterns
Fast Exact String Pattern-Matching Algorithm for Fixed Length PatternsFast Exact String Pattern-Matching Algorithm for Fixed Length Patterns
Fast Exact String Pattern-Matching Algorithm for Fixed Length Patterns
 
SMS Spam Filter Design Using R: A Machine Learning Approach
SMS Spam Filter Design Using R: A Machine Learning ApproachSMS Spam Filter Design Using R: A Machine Learning Approach
SMS Spam Filter Design Using R: A Machine Learning Approach
 
Textrank algorithm
Textrank algorithmTextrank algorithm
Textrank algorithm
 
DA_02_algorithms.pptx
DA_02_algorithms.pptxDA_02_algorithms.pptx
DA_02_algorithms.pptx
 
Tajolabigdatacamp2014 140618135810-phpapp01 hyunsik-choi
Tajolabigdatacamp2014 140618135810-phpapp01 hyunsik-choiTajolabigdatacamp2014 140618135810-phpapp01 hyunsik-choi
Tajolabigdatacamp2014 140618135810-phpapp01 hyunsik-choi
 
functional groovy
functional groovyfunctional groovy
functional groovy
 
Meow Hagedorn
Meow HagedornMeow Hagedorn
Meow Hagedorn
 
Topic Modeling - NLP
Topic Modeling - NLPTopic Modeling - NLP
Topic Modeling - NLP
 
The Road To Damascus - A Conversion Experience: LotusScript and @Formula to SSJS
The Road To Damascus - A Conversion Experience: LotusScript and @Formula to SSJSThe Road To Damascus - A Conversion Experience: LotusScript and @Formula to SSJS
The Road To Damascus - A Conversion Experience: LotusScript and @Formula to SSJS
 
code4lib 2011 preconference: What's New in Solr (since 1.4.1)
code4lib 2011 preconference: What's New in Solr (since 1.4.1)code4lib 2011 preconference: What's New in Solr (since 1.4.1)
code4lib 2011 preconference: What's New in Solr (since 1.4.1)
 
Icsug dev day2014_road to damascus - conversion experience-lotusscript and @f...
Icsug dev day2014_road to damascus - conversion experience-lotusscript and @f...Icsug dev day2014_road to damascus - conversion experience-lotusscript and @f...
Icsug dev day2014_road to damascus - conversion experience-lotusscript and @f...
 
Ejercicios de estilo en la programación
Ejercicios de estilo en la programaciónEjercicios de estilo en la programación
Ejercicios de estilo en la programación
 

Kürzlich hochgeladen

Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 

Kürzlich hochgeladen (20)

Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 

Text analytics in Python and R with examples from Tobacco Control

  • 1. Text Analytics with and (w/ examples from Tobacco Control) @BenHealey
  • 2. The Process Look intenselyFrequencies Classification Bright Idea Gather Clean Standardise De-dup and select
  • 3. http://scrapy.org Spiders  Items  Pipelines - readLines, XML / Rcurl / scrapeR packages - tm package (factiva plugin), twitteR - Beautiful Soup - Pandas (eg, financial data) http://blog.siliconstraits.vn/building-web-crawler-scrapy/
  • 4.
  • 5.
  • 6.
  • 7. • Translating text to consistent form – Scrapy returns unicode strings – Māori  Maori • SWAPSET = [[ u"Ā", "A"], [ u"ā", "a"], [ u"ä", "a"]] • translation_table = dict([(ord(k), unicode(v)) for k, v in settings.SWAPSET]) • cleaned_content = html_content.translate(translation_table) – Or… • test=u’Māori’ (you already have unicode) • Unidecode(test) (returns ‘Maori’)
  • 8. • Dealing with non-Unicode – http://nedbatchelder.com/text/unipain.html – Some scraped html will be in latin1 (mismatch UTF8) – Have your datastore default to UTF-8 – Learn to love whack-a-mole • Dealing with too many spaces: – newstring = ' '.join(mystring.split()) – Or… use re • Don’t forget the metadata! – Define a common data structure early if you have multiple sources
  • 9. Text Standardisation • Stopwords – "a, about, above, across, ... yourself, yourselves, you've, z” • Stemmers – "some sample stemmed words"  "some sampl stem word“ • Tokenisers (eg, for bigrams) – BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2)) – tdm <- TermDocumentMatrix(crude, control = list(tokenize = BigramTokenizer)) – ‘and said’, ‘and security’ Natural Language Toolkittm package
  • 10. Text Standardisation libs = c("RODBC", "RWeka“, "Snowball","wordcloud", "tm" ,"topicmodels") … cleanCorpus = function(corpus) { corpus.tmp = tm_map(corpus, tolower) # ??? Not sure. corpus.tmp = tm_map(corpus.tmp, removePunctuation) corpus.tmp = tm_map(corpus.tmp, removeWords, stopwords("english")) corpus.tmp = tm_map(corpus.tmp, stripWhitespace) return(corpus.tmp) } posts.corpus = cleanCorpus(posts.corpus) posts.corpus_stemmed = tm_map(posts.corpus, stemDocument)
  • 11. Text Standardisation • Using dictionaries for stem completion politi.tdm <- TermDocumentMatrix(politi.corpus) politi.tdm = removeSparseTerms(politi.tdm, 0.99) politi.tdm = as.matrix(politi.tdm) # get word counts in decreasing order, put these into a plain text doc. word_freqs = sort(rowSums(politi.tdm), decreasing=TRUE) length(word_freqs) smalldict = PlainTextDocument(names(word_freqs)) politi.corpus_final = tm_map(politi.corpus_stemmed, stemCompletion, dictionary=smalldict, type="first")
  • 12. Deduplication • Python sets – shingles1 = set(get_shingles(record1['standardised_content'])) • Shingling and Jaccard similarity – (a,rose,is,a,rose,is,a,rose) – {(a,rose,is,a), (rose,is,a,rose), (is,a,rose,is), (a,rose,is,a), (rose,is,a,rose)} • {(a,rose,is,a), (rose,is,a,rose), (is,a,rose,is)} – http://infolab.stanford.edu/~ullman/mmds/ch3.pdf  a free text http://www.cs.utah.edu/~jeffp/teaching/cs5955/L4-Jaccard+Shingle.pdf
  • 13. Frequency Analysis • Document-Term Matrix – politi.dtm <- DocumentTermMatrix(politi.corpus_stemmed, control = list(wordLengths=c(4,Inf))) • Frequent and co-occurring terms – findFreqTerms(politi.dtm, 5000) [1] "2011" "also" "announc" "area" "around" [6] "auckland" "better" "bill" "build" "busi" – findAssocs(politi.dtm, "smoke", 0.5) smoke tobacco quit smokefre smoker 2025 cigarett 1.00 0.74 0.68 0.62 0.62 0.58 0.57
  • 14.
  • 15. Mentions of the 2025 goal
  • 16. Mentions of the 2025 goal
  • 17. Top 100 terms: Tariana Turia Note: Documents from Aug 2011 – July 2012 Wordcloud package
  • 18. Top 100 terms: Tony Ryall Note: Documents from Aug 2011 – July 2012
  • 19. • Exploration and feature extraction – Metadata gathered at time of collection (eg, Scrapy) – RODBC or MySQLdb with plain ol’ SQL – Native or package functions for length of strings, sna, etc. • Unsupervised – nltk.cluster – tm, topicmodels, as.matrix(dtm)  kmeans, etc. • Supervised – First hurdle: Training set  – nltk.classify – tm, e1071, others… Classification
  • 20. 2 posts or fewer more than 750 posts 846 1,157 23 45,499 41.0% 1.3% 1.1% 50.1%
  • 21. Cohort: New users (posters) in Q1 2012
  • 22. • LDA (topicmodels) – New users – Highly active users Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 good smoke just smoke feel day time day quit day thank week get can dont well patch realli one like will start think will still Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 quit good day like feel smoke one well day thing can take great your just will stay done now get luck strong awesom get time
  • 23. • LDA (topicmodels) – Highly active users (HAU) – HAU1 (F, 38, PI) – HAU2 (F, 33, NZE) – HAU3 (M, 48, NZE) Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 quit good day like feel smoke one well day thing can take great your just will stay done now get luck strong awesom get time 18% 14% 40% 8% 20% 31% 21% 27% 6% 16% 16% 9% 21% 49% 5%
  • 24. Recap • Your text will probably be messy – Python, R-based tools reduce the pain • Simple analyses can generate useful insight • Combine with data of other types for context – source, quantities, dates, network position, history • May surface useful features for classification Slides, Code: message2ben@gmail.com

Hinweis der Redaktion

  1. Gather stage.
  2. Gather stage.
  3. Clean stage
  4. Clean stage
  5. Clean stage
  6. Standardise stage
  7. Standardise stage
  8. Standardise stage0.99 is generous. Lower would remove more terms.A term-document matrix where those terms from x are removed which have at least asparse percentage of empty (i.e., terms occurring 0 times in a document) elements. I.e., the resulting matrix contains only terms with a sparse factor of less than sparse.TermDocumentMatrix (terms along side (rows), docs along top (columns))
  9. Dedup and select stage
  10. Analysis stage
  11. Analysis stage
  12. Analysis stage
  13. Analysis stage
  14. Analysis stage
  15. Analysis stage
  16. Analysis stageDragonfly talk by Marcus Frean on LatentDirichletAllocation
  17. Analysis stage (exploratory)
  18. Analysis stage (Exploratory)
  19. Analysis stage (Unsupervised classification)
  20. Analysis stage (Unsupervised classification)