SlideShare ist ein Scribd-Unternehmen logo
1 von 19
HANDS ON:
TEXT MINING WITH R
Jahnab Kumar Deka
Introduction
• To learn from collections of text documents like books,
newspapers, emails, etc.
Important Terms:
• Tokenization
• Tagging (Noun/Verb/…)
• Chunking(Noun Phase)
• Stemming(-ing/-s/-ed)
Important packages in R
• library(tm) # Framework for text mining.
• library(SnowballC) # Provides wordStem() for stemming.
• library(qdap) # Quantitative discourse analysis of
transcripts.
• library(qdapDictionaries)
• library(dplyr) # Data preparation and pipes %>%.
• library(RColorBrewer) # Generate palette of colours for
plots.
• library(ggplot2) # Plot word frequencies.
• library(scales) # Include commas in numbers.
• library(Rgraphviz) # Correlation plots.
Corpus
• Collection of text
• Each corpus will have separate articles, stories, volumes,
each treated as a separate entity or record.
• Any file format can be converted to text file for corpus
Eg:
• PDF to Text File
• system("for f in *.pdf; do pdftotext -enc ASCII7 -nopgbrk $f; done")
• Word Document to Text File
• system("for f in *.doc; do antiword $f; done")
Corpus
• Consider folder corpus/txt
• List some of file names
Loading Corpus
• Loading Corpus
** Using DirSource() the source object is passed on to Corpus() which loads the documents.
• In case of PDF Documents
• docs <- Corpus(DirSource(cname), readerControl=list(reader=readPDF))
** xpdf application needs to be installed for readPDF()
• In case of Word Documents
• docs <- Corpus(DirSource(cname), readerControl=list(reader=readDOC("-r -s")))
** -r requests that removed text be included in the output
** -s requests that text hidden by Word be included
Exploration of Corpus
• inspect()
• Preparing the corpus
• Transformation type
• tm map() is used to apply one of this transformation
• Other transformations can be implemented using R functions and wrapped
within content_transformer()
Transformation Example
• replace “/”, “@” and “|” with a space
• Alternate method
• Conversion to toLower Case
• Remove Numbers
• Remove Punctuation
Contd...
• Remove English Stop Words
• Remove Own Stop Words
• Strip Whitespace
• Specific Transformations
Contd...
• Stemming
• Creating a Document Term Matrix
A matrix with documents as the rows
terms as the columns
count of the frequency of words as the cells of the matrix.
• Term frequency
Contd...
• Frequency order of item
• ord <- order(freq)
• Least Frequent item
• freq[head(ord)]
• Most frequent item
• freq[tail(ord)]
• Document Term matrix to CSV
• dtm <- DocumentTermMatrix(docs)
• m <- as.matrix(dtm)
• write.csv(m, file="dtm.csv")
Contd...
• Removing Sparse Terms
• dtms <- removeSparseTerms(dtm, 0.1) //Sparse factor
• the resulting matrix contains only terms with a sparse factor of less than sparse.
• Frequent items and association
** lowfreq = terms that occur at least 1000 times
• Association with word with correlation limit
• // association of “data” with other word
• // two words always appear together => correlation would be 1.0
Correlation
• 50 of the more frequent words
• With minimum correlation of 0.5
• Word occurrences 100
• By default
• 20 random terms
• With minimum correlation of 0.7
Plotting word frequencies
• freq <- sort(colSums(as.matrix(dtm)), decreasing=TRUE)
• wf <- data.frame(word=names(freq), freq=freq)
• //words that occurs at least 500 times in the corpus
Word cloud
Size of Word & Frequency
• For word limitation
• wordcloud(names(freq), freq, max.words=100)
• For term frequency limitation
• wordcloud(names(freq), freq, min.freq=100)
• Adding Color
• wordcloud(names(freq), freq, min.freq=100, colors=brewer.pal(6, "Dark2"))
Quantitative Analysis of Text (qdap)
• Extracting the column names (the terms) and retain those shorter
than 20 characters
• To generate frequencies and percentage
Contd...
• Word Length Counts
** vertical line = Mean length of words
Letter and Position Heatmap

Weitere ähnliche Inhalte

Was ist angesagt?

Machine Learning for Survival Analysis
Machine Learning for Survival AnalysisMachine Learning for Survival Analysis
Machine Learning for Survival AnalysisChandan Reddy
 
Lecture1 introduction to big data
Lecture1 introduction to big dataLecture1 introduction to big data
Lecture1 introduction to big datahktripathy
 
Topic Modeling - NLP
Topic Modeling - NLPTopic Modeling - NLP
Topic Modeling - NLPRupak Roy
 
ANOVA in R by Aman Chauhan
ANOVA in R by Aman ChauhanANOVA in R by Aman Chauhan
ANOVA in R by Aman ChauhanAman Chauhan
 
Quantitative Data Analysis using R
Quantitative Data Analysis using RQuantitative Data Analysis using R
Quantitative Data Analysis using RTaddesse Kassahun
 
4.4 text mining
4.4 text mining4.4 text mining
4.4 text miningKrish_ver2
 
Introduction to R for data science
Introduction to R for data scienceIntroduction to R for data science
Introduction to R for data scienceLong Nguyen
 
R Programming Language
R Programming LanguageR Programming Language
R Programming LanguageNareshKarela1
 
Exploratory data analysis with Python
Exploratory data analysis with PythonExploratory data analysis with Python
Exploratory data analysis with PythonDavis David
 
Exploratory data analysis data visualization
Exploratory data analysis data visualizationExploratory data analysis data visualization
Exploratory data analysis data visualizationDr. Hamdan Al-Sabri
 
Big Data & Text Mining
Big Data & Text MiningBig Data & Text Mining
Big Data & Text MiningMichel Bruley
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data AnalyticsRohithND
 
Introduction to data analytics
Introduction to data analyticsIntroduction to data analytics
Introduction to data analyticsSSaudia
 

Was ist angesagt? (20)

Machine Learning for Survival Analysis
Machine Learning for Survival AnalysisMachine Learning for Survival Analysis
Machine Learning for Survival Analysis
 
Lecture #01
Lecture #01Lecture #01
Lecture #01
 
Unit 2
Unit 2Unit 2
Unit 2
 
Logistic regression
Logistic regressionLogistic regression
Logistic regression
 
Lecture1 introduction to big data
Lecture1 introduction to big dataLecture1 introduction to big data
Lecture1 introduction to big data
 
Topic Modeling - NLP
Topic Modeling - NLPTopic Modeling - NLP
Topic Modeling - NLP
 
Programming in R
Programming in RProgramming in R
Programming in R
 
ANOVA in R by Aman Chauhan
ANOVA in R by Aman ChauhanANOVA in R by Aman Chauhan
ANOVA in R by Aman Chauhan
 
Data Wrangling
Data WranglingData Wrangling
Data Wrangling
 
01 Data Mining: Concepts and Techniques, 2nd ed.
01 Data Mining: Concepts and Techniques, 2nd ed.01 Data Mining: Concepts and Techniques, 2nd ed.
01 Data Mining: Concepts and Techniques, 2nd ed.
 
Machine Learning with R
Machine Learning with RMachine Learning with R
Machine Learning with R
 
Quantitative Data Analysis using R
Quantitative Data Analysis using RQuantitative Data Analysis using R
Quantitative Data Analysis using R
 
4.4 text mining
4.4 text mining4.4 text mining
4.4 text mining
 
Introduction to R for data science
Introduction to R for data scienceIntroduction to R for data science
Introduction to R for data science
 
R Programming Language
R Programming LanguageR Programming Language
R Programming Language
 
Exploratory data analysis with Python
Exploratory data analysis with PythonExploratory data analysis with Python
Exploratory data analysis with Python
 
Exploratory data analysis data visualization
Exploratory data analysis data visualizationExploratory data analysis data visualization
Exploratory data analysis data visualization
 
Big Data & Text Mining
Big Data & Text MiningBig Data & Text Mining
Big Data & Text Mining
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Introduction to data analytics
Introduction to data analyticsIntroduction to data analytics
Introduction to data analytics
 

Andere mochten auch

Text Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter DataText Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter DataYanchang Zhao
 
Natural Language Processing in R (rNLP)
Natural Language Processing in R (rNLP)Natural Language Processing in R (rNLP)
Natural Language Processing in R (rNLP)fridolin.wild
 
Text Mining Infrastructure in R
Text Mining Infrastructure in RText Mining Infrastructure in R
Text Mining Infrastructure in RAshraf Uddin
 
Social media analysis in R using twitter API
Social media analysis in R using twitter API Social media analysis in R using twitter API
Social media analysis in R using twitter API Mohd Shadab Alam
 
Twitter analysis by Kaify Rais
Twitter analysis by Kaify RaisTwitter analysis by Kaify Rais
Twitter analysis by Kaify RaisAjay Ohri
 
Sentiment analysis of tweets
Sentiment analysis of tweetsSentiment analysis of tweets
Sentiment analysis of tweetsVasu Jain
 
Sentiment Analysis in Twitter
Sentiment Analysis in TwitterSentiment Analysis in Twitter
Sentiment Analysis in TwitterAyushi Dalmia
 
Text analytics in Python and R with examples from Tobacco Control
Text analytics in Python and R with examples from Tobacco ControlText analytics in Python and R with examples from Tobacco Control
Text analytics in Python and R with examples from Tobacco ControlBen Healey
 
Sentiment Analysis of Twitter Data
Sentiment Analysis of Twitter DataSentiment Analysis of Twitter Data
Sentiment Analysis of Twitter DataSumit Raj
 
Time Series Analysis and Mining with R
Time Series Analysis and Mining with RTime Series Analysis and Mining with R
Time Series Analysis and Mining with RYanchang Zhao
 
Data Clustering with R
Data Clustering with RData Clustering with R
Data Clustering with RYanchang Zhao
 
Interactive Text Mining Suite: Data Visualization for Literary Studies
Interactive Text Mining Suite: Data Visualization for Literary Studies Interactive Text Mining Suite: Data Visualization for Literary Studies
Interactive Text Mining Suite: Data Visualization for Literary Studies Olga Scrivner
 
Rugby World Cup 2011 twitter analysis
Rugby World Cup 2011 twitter analysisRugby World Cup 2011 twitter analysis
Rugby World Cup 2011 twitter analysisiGo2 Pty Ltd
 
Der Nobelpreis geht an: Vitamin C
Der Nobelpreis geht an: Vitamin CDer Nobelpreis geht an: Vitamin C
Der Nobelpreis geht an: Vitamin CDr Rath
 
Text Mining for Second Screen
Text Mining for Second ScreenText Mining for Second Screen
Text Mining for Second ScreenIvan Demin
 
Count-Min Tree Sketch : Approximate counting for NLP tasks
Count-Min Tree Sketch : Approximate counting for NLP tasksCount-Min Tree Sketch : Approximate counting for NLP tasks
Count-Min Tree Sketch : Approximate counting for NLP tasksGuillaume Pitel
 

Andere mochten auch (20)

Text Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter DataText Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter Data
 
TextMining with R
TextMining with RTextMining with R
TextMining with R
 
Natural Language Processing in R (rNLP)
Natural Language Processing in R (rNLP)Natural Language Processing in R (rNLP)
Natural Language Processing in R (rNLP)
 
Text Mining Infrastructure in R
Text Mining Infrastructure in RText Mining Infrastructure in R
Text Mining Infrastructure in R
 
Social media analysis in R using twitter API
Social media analysis in R using twitter API Social media analysis in R using twitter API
Social media analysis in R using twitter API
 
Twitter analysis by Kaify Rais
Twitter analysis by Kaify RaisTwitter analysis by Kaify Rais
Twitter analysis by Kaify Rais
 
Sentiment analysis of tweets
Sentiment analysis of tweetsSentiment analysis of tweets
Sentiment analysis of tweets
 
Sentiment Analysis in Twitter
Sentiment Analysis in TwitterSentiment Analysis in Twitter
Sentiment Analysis in Twitter
 
Text analytics in Python and R with examples from Tobacco Control
Text analytics in Python and R with examples from Tobacco ControlText analytics in Python and R with examples from Tobacco Control
Text analytics in Python and R with examples from Tobacco Control
 
Sentiment Analysis of Twitter Data
Sentiment Analysis of Twitter DataSentiment Analysis of Twitter Data
Sentiment Analysis of Twitter Data
 
Time Series Analysis and Mining with R
Time Series Analysis and Mining with RTime Series Analysis and Mining with R
Time Series Analysis and Mining with R
 
Data Clustering with R
Data Clustering with RData Clustering with R
Data Clustering with R
 
Interactive Text Mining Suite: Data Visualization for Literary Studies
Interactive Text Mining Suite: Data Visualization for Literary Studies Interactive Text Mining Suite: Data Visualization for Literary Studies
Interactive Text Mining Suite: Data Visualization for Literary Studies
 
R Datatypes
R DatatypesR Datatypes
R Datatypes
 
Text MIning
Text MIningText MIning
Text MIning
 
Rugby World Cup 2011 twitter analysis
Rugby World Cup 2011 twitter analysisRugby World Cup 2011 twitter analysis
Rugby World Cup 2011 twitter analysis
 
Der Nobelpreis geht an: Vitamin C
Der Nobelpreis geht an: Vitamin CDer Nobelpreis geht an: Vitamin C
Der Nobelpreis geht an: Vitamin C
 
Text Mining for Second Screen
Text Mining for Second ScreenText Mining for Second Screen
Text Mining for Second Screen
 
Count-Min Tree Sketch : Approximate counting for NLP tasks
Count-Min Tree Sketch : Approximate counting for NLP tasksCount-Min Tree Sketch : Approximate counting for NLP tasks
Count-Min Tree Sketch : Approximate counting for NLP tasks
 
Semantische Systeme 3 0
Semantische Systeme 3 0Semantische Systeme 3 0
Semantische Systeme 3 0
 

Ähnlich wie hands on: Text Mining With R

Introduction to R for Learning Analytics Researchers
Introduction to R for Learning Analytics ResearchersIntroduction to R for Learning Analytics Researchers
Introduction to R for Learning Analytics ResearchersVitomir Kovanovic
 
Introduction to XML
Introduction to XMLIntroduction to XML
Introduction to XMLAbhra Basak
 
OUTDATED Text Mining 3/5: String Processing
OUTDATED Text Mining 3/5: String ProcessingOUTDATED Text Mining 3/5: String Processing
OUTDATED Text Mining 3/5: String ProcessingFlorian Leitner
 
SMS Spam Filter Design Using R: A Machine Learning Approach
SMS Spam Filter Design Using R: A Machine Learning ApproachSMS Spam Filter Design Using R: A Machine Learning Approach
SMS Spam Filter Design Using R: A Machine Learning ApproachReza Rahimi
 
Alexey Golub - Writing parsers in c# | 3Shape Meetup
Alexey Golub - Writing parsers in c# | 3Shape MeetupAlexey Golub - Writing parsers in c# | 3Shape Meetup
Alexey Golub - Writing parsers in c# | 3Shape MeetupOleksii Holub
 
Text and Numbers (Data Types)in PHP
Text and Numbers (Data Types)in PHPText and Numbers (Data Types)in PHP
Text and Numbers (Data Types)in PHPKamal Acharya
 
Set Similarity Search using a Distributed Prefix Tree Index
Set Similarity Search using a Distributed Prefix Tree IndexSet Similarity Search using a Distributed Prefix Tree Index
Set Similarity Search using a Distributed Prefix Tree IndexHPCC Systems
 
Understanding Hadoop through examples
Understanding Hadoop through examplesUnderstanding Hadoop through examples
Understanding Hadoop through examplesYoshitomo Matsubara
 
Course 102: Lecture 13: Regular Expressions
Course 102: Lecture 13: Regular Expressions Course 102: Lecture 13: Regular Expressions
Course 102: Lecture 13: Regular Expressions Ahmed El-Arabawy
 

Ähnlich wie hands on: Text Mining With R (20)

Web search engines
Web search enginesWeb search engines
Web search engines
 
Introduction to R for Learning Analytics Researchers
Introduction to R for Learning Analytics ResearchersIntroduction to R for Learning Analytics Researchers
Introduction to R for Learning Analytics Researchers
 
04 standard class library c#
04 standard class library c#04 standard class library c#
04 standard class library c#
 
Data Collection and Storage
Data Collection and StorageData Collection and Storage
Data Collection and Storage
 
Introduction to XML
Introduction to XMLIntroduction to XML
Introduction to XML
 
OUTDATED Text Mining 3/5: String Processing
OUTDATED Text Mining 3/5: String ProcessingOUTDATED Text Mining 3/5: String Processing
OUTDATED Text Mining 3/5: String Processing
 
search engine
search enginesearch engine
search engine
 
SMS Spam Filter Design Using R: A Machine Learning Approach
SMS Spam Filter Design Using R: A Machine Learning ApproachSMS Spam Filter Design Using R: A Machine Learning Approach
SMS Spam Filter Design Using R: A Machine Learning Approach
 
Alexey Golub - Writing parsers in c# | 3Shape Meetup
Alexey Golub - Writing parsers in c# | 3Shape MeetupAlexey Golub - Writing parsers in c# | 3Shape Meetup
Alexey Golub - Writing parsers in c# | 3Shape Meetup
 
Text and Numbers (Data Types)in PHP
Text and Numbers (Data Types)in PHPText and Numbers (Data Types)in PHP
Text and Numbers (Data Types)in PHP
 
MIPS Architecture
MIPS ArchitectureMIPS Architecture
MIPS Architecture
 
Set Similarity Search using a Distributed Prefix Tree Index
Set Similarity Search using a Distributed Prefix Tree IndexSet Similarity Search using a Distributed Prefix Tree Index
Set Similarity Search using a Distributed Prefix Tree Index
 
AWS Data Collection & Storage
AWS Data Collection & StorageAWS Data Collection & Storage
AWS Data Collection & Storage
 
Understanding Hadoop through examples
Understanding Hadoop through examplesUnderstanding Hadoop through examples
Understanding Hadoop through examples
 
Basics R.ppt
Basics R.pptBasics R.ppt
Basics R.ppt
 
Lecture_4.pdf
Lecture_4.pdfLecture_4.pdf
Lecture_4.pdf
 
Text features
Text featuresText features
Text features
 
Basics.ppt
Basics.pptBasics.ppt
Basics.ppt
 
Course 102: Lecture 13: Regular Expressions
Course 102: Lecture 13: Regular Expressions Course 102: Lecture 13: Regular Expressions
Course 102: Lecture 13: Regular Expressions
 
محاضرة برنامج التحليل الكمي R program د.هديل القفيدي
محاضرة برنامج التحليل الكمي   R program د.هديل القفيديمحاضرة برنامج التحليل الكمي   R program د.هديل القفيدي
محاضرة برنامج التحليل الكمي R program د.هديل القفيدي
 

Kürzlich hochgeladen

{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...shivangimorya083
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 

Kürzlich hochgeladen (20)

{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 

hands on: Text Mining With R

  • 1. HANDS ON: TEXT MINING WITH R Jahnab Kumar Deka
  • 2. Introduction • To learn from collections of text documents like books, newspapers, emails, etc. Important Terms: • Tokenization • Tagging (Noun/Verb/…) • Chunking(Noun Phase) • Stemming(-ing/-s/-ed)
  • 3. Important packages in R • library(tm) # Framework for text mining. • library(SnowballC) # Provides wordStem() for stemming. • library(qdap) # Quantitative discourse analysis of transcripts. • library(qdapDictionaries) • library(dplyr) # Data preparation and pipes %>%. • library(RColorBrewer) # Generate palette of colours for plots. • library(ggplot2) # Plot word frequencies. • library(scales) # Include commas in numbers. • library(Rgraphviz) # Correlation plots.
  • 4. Corpus • Collection of text • Each corpus will have separate articles, stories, volumes, each treated as a separate entity or record. • Any file format can be converted to text file for corpus Eg: • PDF to Text File • system("for f in *.pdf; do pdftotext -enc ASCII7 -nopgbrk $f; done") • Word Document to Text File • system("for f in *.doc; do antiword $f; done")
  • 5. Corpus • Consider folder corpus/txt • List some of file names
  • 6. Loading Corpus • Loading Corpus ** Using DirSource() the source object is passed on to Corpus() which loads the documents. • In case of PDF Documents • docs <- Corpus(DirSource(cname), readerControl=list(reader=readPDF)) ** xpdf application needs to be installed for readPDF() • In case of Word Documents • docs <- Corpus(DirSource(cname), readerControl=list(reader=readDOC("-r -s"))) ** -r requests that removed text be included in the output ** -s requests that text hidden by Word be included
  • 7. Exploration of Corpus • inspect() • Preparing the corpus • Transformation type • tm map() is used to apply one of this transformation • Other transformations can be implemented using R functions and wrapped within content_transformer()
  • 8. Transformation Example • replace “/”, “@” and “|” with a space • Alternate method • Conversion to toLower Case • Remove Numbers • Remove Punctuation
  • 9. Contd... • Remove English Stop Words • Remove Own Stop Words • Strip Whitespace • Specific Transformations
  • 10. Contd... • Stemming • Creating a Document Term Matrix A matrix with documents as the rows terms as the columns count of the frequency of words as the cells of the matrix. • Term frequency
  • 11. Contd... • Frequency order of item • ord <- order(freq) • Least Frequent item • freq[head(ord)] • Most frequent item • freq[tail(ord)] • Document Term matrix to CSV • dtm <- DocumentTermMatrix(docs) • m <- as.matrix(dtm) • write.csv(m, file="dtm.csv")
  • 12. Contd... • Removing Sparse Terms • dtms <- removeSparseTerms(dtm, 0.1) //Sparse factor • the resulting matrix contains only terms with a sparse factor of less than sparse. • Frequent items and association ** lowfreq = terms that occur at least 1000 times • Association with word with correlation limit • // association of “data” with other word • // two words always appear together => correlation would be 1.0
  • 13. Correlation • 50 of the more frequent words • With minimum correlation of 0.5 • Word occurrences 100 • By default • 20 random terms • With minimum correlation of 0.7
  • 14. Plotting word frequencies • freq <- sort(colSums(as.matrix(dtm)), decreasing=TRUE) • wf <- data.frame(word=names(freq), freq=freq) • //words that occurs at least 500 times in the corpus
  • 16. Size of Word & Frequency • For word limitation • wordcloud(names(freq), freq, max.words=100) • For term frequency limitation • wordcloud(names(freq), freq, min.freq=100) • Adding Color • wordcloud(names(freq), freq, min.freq=100, colors=brewer.pal(6, "Dark2"))
  • 17. Quantitative Analysis of Text (qdap) • Extracting the column names (the terms) and retain those shorter than 20 characters • To generate frequencies and percentage
  • 18. Contd... • Word Length Counts ** vertical line = Mean length of words