SlideShare a Scribd company logo
1 of 29
Statistical text mining using R
Tom Liptrot
The Christie Hospital
Motivation
Example 1: Dickens to
matrix
Example 2: Electronic
patient records
Dickens to Matrix: a bag of words
IT WAS the best of times, it was the worst of times,
it was the age of wisdom, it was the age of
foolishness, it was the epoch of belief, it was the
epoch of incredulity, it was the season of Light, it
was the season of Darkness, it was the spring of
hope, it was the winter of despair, we had
everything before us, we had nothing before us, we
were all going direct to Heaven, we were all going
direct the other way- in short, the period was so far
like the present period, that some of its noisiest
authorities insisted on its being received, for good or
for evil, in the superlative degree of comparison
only.
Dickens to Matrix: a matrix
Documents
Words
#Example matrix syntax
A = matrix(c(1, rep(0,6), 2), nrow = 4)
library(slam)
S = simple_triplet_matrix(c(1, 4), c(1, 2), c(1, 2))
library(Matrix)
M = sparseMatrix(i = c(1, 4), j = c(1, 2), x = c(1, 2))
𝑎11 𝑎12 ⋯ 𝑎1𝑛
𝑎21 𝑎22 ⋯ 𝑎2𝑛
⋮ ⋮ ⋱ ⋮
𝑎 𝑚1 𝑎 𝑚2 ⋯ 𝑎 𝑚𝑛
Dickens to Matrix: tm package
library(tm) #load the tm package
corpus_1 <- Corpus(VectorSource(txt)) # creates a ‘corpus’ from a vector
corpus_1 <- tm_map(corpus_1, content_transformer(tolower))
corpus_1 <- tm_map(corpus_1, removeWords, stopwords("english"))
corpus_1 <- tm_map(corpus_1, removePunctuation)
corpus_1 <- tm_map(corpus_1, stemDocument)
corpus_1 <- tm_map(corpus_1, stripWhitespace)
it was the best of times, it was the worst of times, it was the age of
wisdom, it was the age of foolishness, it was the epoch of belief, it was
the epoch of incredulity, it was the season of light, it was the season of
darkness, it was the spring of hope, it was the winter of despair, we
had everything before us, we had nothing before us, we were all going
direct to heaven, we were all going direct the other way- in short, the
period was so far like the present period, that some of its noisiest
authorities insisted on its being received, for good or for evil, in the
superlative degree of comparison only.
Dickens to Matrix: stopwords
library(tm)
corpus_1 <- Corpus(VectorSource(txt))
corpus_1 <- tm_map(corpus_1, content_transformer(tolower))
corpus_1 <- tm_map(corpus_1, removeWords, stopwords("english"))
corpus_1 <- tm_map(corpus_1, removePunctuation)
corpus_1 <- tm_map(corpus_1, stemDocument)
corpus_1 <- tm_map(corpus_1, stripWhitespace)
it was the best of times, it was the worst of times, it was the age of
wisdom, it was the age of foolishness, it was the epoch of belief, it
was the epoch of incredulity, it was the season of light, it was the
season of darkness, it was the spring of hope, it was the winter of
despair, we had everything before us, we had nothing before us, we
were all going direct to heaven, we were all going direct the other
way- in short, the period was so far like the present period, that
some of its noisiest authorities insisted on its being received, for
good or for evil, in the superlative degree of comparison only.
Dickens to Matrix: stopwords
library(tm)
corpus_1 <- Corpus(VectorSource(txt))
corpus_1 <- tm_map(corpus_1, content_transformer(tolower))
corpus_1 <- tm_map(corpus_1, removeWords, stopwords("english"))
corpus_1 <- tm_map(corpus_1, removePunctuation)
corpus_1 <- tm_map(corpus_1, stemDocument)
corpus_1 <- tm_map(corpus_1, stripWhitespace)
best times, worst times, age wisdom, age foolishness, epoch
belief, epoch incredulity, season light, season darkness,
spring hope, winter despair, everything us, nothing us, going
direct heaven, going direct way- short, period far like present
period, noisiest authorities insisted received, good evil,
superlative degree comparison .
Dickens to Matrix: punctuation
library(tm)
corpus_1 <- Corpus(VectorSource(txt))
corpus_1 <- tm_map(corpus_1, content_transformer(tolower))
corpus_1 <- tm_map(corpus_1, removeWords, stopwords("english"))
corpus_1 <- tm_map(corpus_1, removePunctuation)
corpus_1 <- tm_map(corpus_1, stemDocument)
corpus_1 <- tm_map(corpus_1, stripWhitespace)
best times worst times age wisdom age foolishness epoch
belief epoch incredulity season light season darkness spring
hope winter despair everything us nothing us going direct
heaven going direct way short period far like present period
noisiest authorities insisted received good evil superlative degree
comparison
Dickens to Matrix: stemming
library(tm)
corpus_1 <- Corpus(VectorSource(txt))
corpus_1 <- tm_map(corpus_1, content_transformer(tolower))
corpus_1 <- tm_map(corpus_1, removeWords, stopwords("english"))
corpus_1 <- tm_map(corpus_1, removePunctuation)
corpus_1 <- tm_map(corpus_1, stemDocument)
corpus_1 <- tm_map(corpus_1, stripWhitespace)
best time worst time age wisdom age foolish epoch
belief epoch incredul season light season dark spring hope
winter despair everyth us noth us go direct heaven go direct
way short period far like present period noisiest author insist
receiv good evil superl degre comparison
Dickens to Matrix: cleanup
library(tm)
corpus_1 <- Corpus(VectorSource(txt))
corpus_1 <- tm_map(corpus_1, content_transformer(tolower))
corpus_1 <- tm_map(corpus_1, removeWords, stopwords("english"))
corpus_1 <- tm_map(corpus_1, removePunctuation)
corpus_1 <- tm_map(corpus_1, stemDocument)
corpus_1 <- tm_map(corpus_1, stripWhitespace)
best time worst time age wisdom age foolish epoch belief epoch
incredul season light season dark spring hope winter despair everyth
us noth us go direct heaven go direct way short period far like present
period noisiest author insist receiv good evil superl degre comparison
Dickens to Matrix: Term Document Matrix
tdm <- TermDocumentMatrix(corpus_1)
<<TermDocumentMatrix (terms: 35, documents: 1)>>
Non-/sparse entries: 35/0
Sparsity : 0%
Maximal term length: 10
Weighting : term frequency (tf)
class(tdm)
[1] "TermDocumentMatrix" "simple_triplet_matrix“
dim (tdm)
[1] 35 1
age 2 epoch 2 insist 1 short 1
author 1 everyth 1 light 1 spring 1
belief 1 evil 1 like 1 superl 1
best 1 far 1 noisiest 1 time 2
comparison 1 foolish 1 noth 1 way 1
dark 1 good 1 period 2 winter 1
degre 1 heaven 1 present 1 wisdom 1
despair 1 hope 1 receiv 1 worst 1
direct 2 incredul 1 season 2
Dickens to Matrix: Ngrams
Dickens to Matrix: Ngrams
Library(Rweka)
four_gram_tokeniser <- function(x, n) {
RWeka:::NGramTokenizer(x, RWeka:::Weka_control(min = 1, max = 4))
}
tdm_4gram <- TermDocumentMatrix(corpus_1,
control = list(tokenize = four_gram_tokeniser)))
dim(tdm_4gram)
[1] 163 1
age 2 author insist receiv good 1 dark 1
age foolish 1 belief 1 dark spring 1
age foolish epoch 1 belief epoch 1 dark spring hope 1
age foolish epoch belief 1 belief epoch incredul 1 dark spring hope winter 1
age wisdom 1 belief epoch incredul season 1 degre 1
age wisdom age 1 best 1 degre comparison 1
age wisdom age foolish 1 best time 1 despair 1
author 1 best time worst 1 despair everyth 1
author insist 1 best time worst time 1 despair everyth us 1
author insist receiv 1 comparison 1 despair everyth us noth 1
Electronic patient records: Gathering
structured medical data
Doctor enters structured data directly
Electronic patient records: Gathering
structured medical data
Trained staff
extract
structured
data from
typed notes
Doctor enters structured data directly
Electronic patient records: example text
Diagnosis: Oesophagus lower third squamous cell carcinoma, T3 N2 M0
History: X year old lady who presented with progressive dysphagia since
X and was known at X Hospital. She underwent an endoscopy which
found a tumour which was biopsied and is a squamous cell carcinoma.
A staging CT scan picked up a left upper lobe nodule. She then went on
to have an EUS at X this was performed by Dr X and showed an early
T3 tumour at 35-40cm of 4 small 4-6mm para-oesophageal nodes,
between 35-40cm. There was a further 7.8mm node in the AP window at
27cm, the carina was measured at 28cm and aortic arch at 24cm, the
conclusion T3 N2 M0. A subsequent PET CT scan was arranged-see
below. She can manage a soft diet such as Weetabix, soft toast, mashed
potato and gets occasional food stuck. Has lost half a stone in weight
and is supplementing with 3 Fresubin supplements per day.
Performance score is 1.
Electronic patient records: targets
Diagnosis: Oesophagus lower third squamous cell carcinoma, T3 N2 M0
History: X year old lady who presented with progressive dysphagia since X and was known at X
Hospital. She underwent an endoscopy which found a tumour which was biopsied and is a
squamous cell carcinoma. A staging CT scan picked up a left upper lobe nodule. She then went
on to have an EUS at X this was performed by Dr X and showed an early T3 tumour at 35-40cm of
4 small 4-6mm para-oesophageal nodes, between 35-40cm. There was a further 7.8mm node in
the AP window at 27cm, the carina was measured at 28cm and aortic arch at 24cm, the conclusion
T3 N2 M0. A subsequent PET CT scan was arranged-see below. She can manage a soft diet such
as Weetabix, soft toast, mashed potato and gets occasional food stuck. Has lost half a stone in
weight and is supplementing with 3 Fresubin supplements per day. Performance score is 1.
Electronic patient records: steps
1. Identify patients where we have both structured data and notes (c.20k)
2. Extract notes and structured data from SQL database
3. Make term document matrix (as shown previously) (60m x 20k)
4. Split data into training and development set
5. Train classification model using training set
6. Assess performance and tune model using development set
7. Evaluate system performance on independent dataset
8. Use system to extract structured data where we have none
Electronic patient records: predicting
disease site using the elastic net
#fits a elastic net model, classifying into oesophagus or not
selecting lambda through cross validation
library(glmnet)
dim(tdm) #22,843 documents, 677,017 Ngrams
#note tdm must either be a matrix or a SparseMatrix NOT a
simple_triplet_matrix
mod_oeso <- cv.glmnet( x = tdm,
y = disease_site == 'Oesophagus',
family = "binomial")
𝛽 = argmin
𝛽
𝑦 − 𝑋𝛽 2
+ 𝜆2 𝛽 2
+ 𝜆1 𝛽 1
OLS + RIDGE + LASSO
Electronic patient records: The Elastic Net
#plots non-zero coefficients from elastic net model
coefs <- coef(mod_oeso, s = mod_oeso$lambda.1se)[,1]
coefs <- coefs[coefs != 0]
coefs <- coefs[order(abs(coefs), decreasing = TRUE)]
barplot(coefs[-1], horiz = TRUE, col = 2)
P(site = ‘Oesophagus’) = 0.03
Electronic patient records: classification
performance: primary disease site
Training set = 20,000
Test set = 4,000 patients
80% of patients can be classified
with 95% accuracy (remaining 20%
can be done by human abstractors)
Next step is full formal evaluation on
independent dataset
Working in combination with rules
based approach from Manchester
University
AUC = 90%
Electronic patient records: Possible
extensions
• Classification (hierarchical)
• Cluster analysis (KNN)
• Time
• Survival
• Drug toxicity
• Quality of life
Thanks
Tom.liptrot@christie.nhs.uk
Books example
get_links <- function(address, link_prefix = '', link_suffix = ''){
page <- getURL(address)
# Convert to R
tree <- htmlParse(page)
## Get All link elements
links <- xpathSApply(tree, path = "//*/a",
fun = xmlGetAttr, name = "href")
## Convert to vector
links <- unlist(links)
## add prefix and suffix
paste0(link_prefix, links, link_suffix)
}
links_authors <- get_links("http://textfiles.com/etext/AUTHORS/", '/',
link_prefix ='http://textfiles.com/etext/AUTHORS/')
links_text <- alply(links_authors, 1,function(.x){
get_links(.x, link_prefix =.x , link_suffix = '')
})
books <- llply(links_text, function(.x){
aaply(.x, 1, getURL)
})
Principle components analysis
## Code to get the first n principal components
## from a large sparse matrix term document matrix of class
dgCMatrix
library(irlba)
n <- 5 # number of components to calculate
m <- nrow(tdm) # 110703 terms in tdm matrix
xt.x <- crossprod(tdm)
x.Means <- colMeans(tdm)
xt.x <- (xt.x - m * tcrossprod(x.means)) / (m-1)
svd <- irlba(xt.x, nu=0, nv=n, tol=1e-10)
0.8 1.0 1.2 1.4
0.8
0.9
1.0
1.1
1.2
1.3
1.4
PC2
PC3
ARISTOTLE
BURROUGHS
DICKENS
KANT
PLATO
SHAKESPEARE
plot(svd$v[i,c(2,3)] + 1,
col = books_df$author,
log = 'xy',
xlab = 'PC2',
ylab = 'PC3')
PCA plot

More Related Content

Viewers also liked

Introduction to R for Data Mining (Feb 2013)
Introduction to R for Data Mining (Feb 2013)Introduction to R for Data Mining (Feb 2013)
Introduction to R for Data Mining (Feb 2013)
Revolution Analytics
 

Viewers also liked (20)

Text mining with R-studio
Text mining with R-studioText mining with R-studio
Text mining with R-studio
 
My Data Analysis Portfolio (Text Mining)
My Data Analysis Portfolio (Text Mining)My Data Analysis Portfolio (Text Mining)
My Data Analysis Portfolio (Text Mining)
 
Data mining with R- regression models
Data mining with R- regression modelsData mining with R- regression models
Data mining with R- regression models
 
Twitter Text Mining with Web scraping, R, Shiny and Hadoop - Richard Sheng
Twitter Text Mining with Web scraping, R, Shiny and Hadoop - Richard Sheng Twitter Text Mining with Web scraping, R, Shiny and Hadoop - Richard Sheng
Twitter Text Mining with Web scraping, R, Shiny and Hadoop - Richard Sheng
 
Data Exploration and Visualization with R
Data Exploration and Visualization with RData Exploration and Visualization with R
Data Exploration and Visualization with R
 
Introduction to Data Mining with R and Data Import/Export in R
Introduction to Data Mining with R and Data Import/Export in RIntroduction to Data Mining with R and Data Import/Export in R
Introduction to Data Mining with R and Data Import/Export in R
 
hands on: Text Mining With R
hands on: Text Mining With Rhands on: Text Mining With R
hands on: Text Mining With R
 
R Reference Card for Data Mining
R Reference Card for Data MiningR Reference Card for Data Mining
R Reference Card for Data Mining
 
An Introduction to Data Mining with R
An Introduction to Data Mining with RAn Introduction to Data Mining with R
An Introduction to Data Mining with R
 
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
 
Regression and Classification with R
Regression and Classification with RRegression and Classification with R
Regression and Classification with R
 
A short tutorial on r
A short tutorial on rA short tutorial on r
A short tutorial on r
 
Natural Language Processing in R (rNLP)
Natural Language Processing in R (rNLP)Natural Language Processing in R (rNLP)
Natural Language Processing in R (rNLP)
 
Introduction to R for Data Mining (Feb 2013)
Introduction to R for Data Mining (Feb 2013)Introduction to R for Data Mining (Feb 2013)
Introduction to R for Data Mining (Feb 2013)
 
TextMining with R
TextMining with RTextMining with R
TextMining with R
 
Data mining tools (R , WEKA, RAPID MINER, ORANGE)
Data mining tools (R , WEKA, RAPID MINER, ORANGE)Data mining tools (R , WEKA, RAPID MINER, ORANGE)
Data mining tools (R , WEKA, RAPID MINER, ORANGE)
 
Sentiment analysis
Sentiment analysisSentiment analysis
Sentiment analysis
 
Social media analysis in R using twitter API
Social media analysis in R using twitter API Social media analysis in R using twitter API
Social media analysis in R using twitter API
 
Twitter analysis by Kaify Rais
Twitter analysis by Kaify RaisTwitter analysis by Kaify Rais
Twitter analysis by Kaify Rais
 
Association Rule Mining with R
Association Rule Mining with RAssociation Rule Mining with R
Association Rule Mining with R
 

Similar to R user group presentation

Exploring temporal graph data with Python: 
a study on tensor decomposition o...
Exploring temporal graph data with Python: 
a study on tensor decomposition o...Exploring temporal graph data with Python: 
a study on tensor decomposition o...
Exploring temporal graph data with Python: 
a study on tensor decomposition o...
André Panisson
 
Simple semantics in topic detection and tracking
Simple semantics in topic detection and trackingSimple semantics in topic detection and tracking
Simple semantics in topic detection and tracking
George Ang
 
Hidden Markov Models with applications to speech recognition
Hidden Markov Models with applications to speech recognitionHidden Markov Models with applications to speech recognition
Hidden Markov Models with applications to speech recognition
butest
 

Similar to R user group presentation (20)

Quantifying MCMC exploration of phylogenetic tree space
Quantifying MCMC exploration of phylogenetic tree spaceQuantifying MCMC exploration of phylogenetic tree space
Quantifying MCMC exploration of phylogenetic tree space
 
Spacey random walks and higher-order data analysis
Spacey random walks and higher-order data analysisSpacey random walks and higher-order data analysis
Spacey random walks and higher-order data analysis
 
Variational quantum fidelity estimation
Variational quantum fidelity estimationVariational quantum fidelity estimation
Variational quantum fidelity estimation
 
Multi-Armed Bandits:
 Intro, examples and tricks
Multi-Armed Bandits:
 Intro, examples and tricksMulti-Armed Bandits:
 Intro, examples and tricks
Multi-Armed Bandits:
 Intro, examples and tricks
 
Markov Blanket Causal Discovery Using Minimum Message Length
Markov Blanket Causal  Discovery Using Minimum  Message LengthMarkov Blanket Causal  Discovery Using Minimum  Message Length
Markov Blanket Causal Discovery Using Minimum Message Length
 
Recent developments on unbiased MCMC
Recent developments on unbiased MCMCRecent developments on unbiased MCMC
Recent developments on unbiased MCMC
 
Prediction and Explanation over DL-Lite Data Streams
Prediction and Explanation over DL-Lite Data StreamsPrediction and Explanation over DL-Lite Data Streams
Prediction and Explanation over DL-Lite Data Streams
 
Poster_PingPong
Poster_PingPongPoster_PingPong
Poster_PingPong
 
MD_course.ppt
MD_course.pptMD_course.ppt
MD_course.ppt
 
Kalman filtering
Kalman filteringKalman filtering
Kalman filtering
 
Updated (version 2.3 THRILLER) Easy Perspective to (Complexity)-Thriller 12 S...
Updated (version 2.3 THRILLER) Easy Perspective to (Complexity)-Thriller 12 S...Updated (version 2.3 THRILLER) Easy Perspective to (Complexity)-Thriller 12 S...
Updated (version 2.3 THRILLER) Easy Perspective to (Complexity)-Thriller 12 S...
 
A baisc ideas of statistical physics.pptx
A baisc ideas of statistical physics.pptxA baisc ideas of statistical physics.pptx
A baisc ideas of statistical physics.pptx
 
08 entropie
08 entropie08 entropie
08 entropie
 
Exploring temporal graph data with Python: 
a study on tensor decomposition o...
Exploring temporal graph data with Python: 
a study on tensor decomposition o...Exploring temporal graph data with Python: 
a study on tensor decomposition o...
Exploring temporal graph data with Python: 
a study on tensor decomposition o...
 
Spike sorting: What is it? Why do we need it? Where does it come from? How is...
Spike sorting: What is it? Why do we need it? Where does it come from? How is...Spike sorting: What is it? Why do we need it? Where does it come from? How is...
Spike sorting: What is it? Why do we need it? Where does it come from? How is...
 
Temporal dynamics of human behavior in social networks (ii)
Temporal dynamics of human behavior in social networks (ii)Temporal dynamics of human behavior in social networks (ii)
Temporal dynamics of human behavior in social networks (ii)
 
Simple semantics in topic detection and tracking
Simple semantics in topic detection and trackingSimple semantics in topic detection and tracking
Simple semantics in topic detection and tracking
 
Looking Inside Mechanistic Models of Carcinogenesis
Looking Inside Mechanistic Models of CarcinogenesisLooking Inside Mechanistic Models of Carcinogenesis
Looking Inside Mechanistic Models of Carcinogenesis
 
International Journal of Engineering Inventions (IJEI),
International Journal of Engineering Inventions (IJEI), International Journal of Engineering Inventions (IJEI),
International Journal of Engineering Inventions (IJEI),
 
Hidden Markov Models with applications to speech recognition
Hidden Markov Models with applications to speech recognitionHidden Markov Models with applications to speech recognition
Hidden Markov Models with applications to speech recognition
 

Recently uploaded

Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
nirzagarg
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
ahmedjiabur940
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Bertram Ludäscher
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
gajnagarg
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
gajnagarg
 

Recently uploaded (20)

TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
 
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
 

R user group presentation

  • 1. Statistical text mining using R Tom Liptrot The Christie Hospital
  • 3.
  • 4. Example 1: Dickens to matrix Example 2: Electronic patient records
  • 5. Dickens to Matrix: a bag of words IT WAS the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness, it was the spring of hope, it was the winter of despair, we had everything before us, we had nothing before us, we were all going direct to Heaven, we were all going direct the other way- in short, the period was so far like the present period, that some of its noisiest authorities insisted on its being received, for good or for evil, in the superlative degree of comparison only.
  • 6. Dickens to Matrix: a matrix Documents Words #Example matrix syntax A = matrix(c(1, rep(0,6), 2), nrow = 4) library(slam) S = simple_triplet_matrix(c(1, 4), c(1, 2), c(1, 2)) library(Matrix) M = sparseMatrix(i = c(1, 4), j = c(1, 2), x = c(1, 2)) 𝑎11 𝑎12 ⋯ 𝑎1𝑛 𝑎21 𝑎22 ⋯ 𝑎2𝑛 ⋮ ⋮ ⋱ ⋮ 𝑎 𝑚1 𝑎 𝑚2 ⋯ 𝑎 𝑚𝑛
  • 7. Dickens to Matrix: tm package library(tm) #load the tm package corpus_1 <- Corpus(VectorSource(txt)) # creates a ‘corpus’ from a vector corpus_1 <- tm_map(corpus_1, content_transformer(tolower)) corpus_1 <- tm_map(corpus_1, removeWords, stopwords("english")) corpus_1 <- tm_map(corpus_1, removePunctuation) corpus_1 <- tm_map(corpus_1, stemDocument) corpus_1 <- tm_map(corpus_1, stripWhitespace) it was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of light, it was the season of darkness, it was the spring of hope, it was the winter of despair, we had everything before us, we had nothing before us, we were all going direct to heaven, we were all going direct the other way- in short, the period was so far like the present period, that some of its noisiest authorities insisted on its being received, for good or for evil, in the superlative degree of comparison only.
  • 8. Dickens to Matrix: stopwords library(tm) corpus_1 <- Corpus(VectorSource(txt)) corpus_1 <- tm_map(corpus_1, content_transformer(tolower)) corpus_1 <- tm_map(corpus_1, removeWords, stopwords("english")) corpus_1 <- tm_map(corpus_1, removePunctuation) corpus_1 <- tm_map(corpus_1, stemDocument) corpus_1 <- tm_map(corpus_1, stripWhitespace) it was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of light, it was the season of darkness, it was the spring of hope, it was the winter of despair, we had everything before us, we had nothing before us, we were all going direct to heaven, we were all going direct the other way- in short, the period was so far like the present period, that some of its noisiest authorities insisted on its being received, for good or for evil, in the superlative degree of comparison only.
  • 9. Dickens to Matrix: stopwords library(tm) corpus_1 <- Corpus(VectorSource(txt)) corpus_1 <- tm_map(corpus_1, content_transformer(tolower)) corpus_1 <- tm_map(corpus_1, removeWords, stopwords("english")) corpus_1 <- tm_map(corpus_1, removePunctuation) corpus_1 <- tm_map(corpus_1, stemDocument) corpus_1 <- tm_map(corpus_1, stripWhitespace) best times, worst times, age wisdom, age foolishness, epoch belief, epoch incredulity, season light, season darkness, spring hope, winter despair, everything us, nothing us, going direct heaven, going direct way- short, period far like present period, noisiest authorities insisted received, good evil, superlative degree comparison .
  • 10. Dickens to Matrix: punctuation library(tm) corpus_1 <- Corpus(VectorSource(txt)) corpus_1 <- tm_map(corpus_1, content_transformer(tolower)) corpus_1 <- tm_map(corpus_1, removeWords, stopwords("english")) corpus_1 <- tm_map(corpus_1, removePunctuation) corpus_1 <- tm_map(corpus_1, stemDocument) corpus_1 <- tm_map(corpus_1, stripWhitespace) best times worst times age wisdom age foolishness epoch belief epoch incredulity season light season darkness spring hope winter despair everything us nothing us going direct heaven going direct way short period far like present period noisiest authorities insisted received good evil superlative degree comparison
  • 11. Dickens to Matrix: stemming library(tm) corpus_1 <- Corpus(VectorSource(txt)) corpus_1 <- tm_map(corpus_1, content_transformer(tolower)) corpus_1 <- tm_map(corpus_1, removeWords, stopwords("english")) corpus_1 <- tm_map(corpus_1, removePunctuation) corpus_1 <- tm_map(corpus_1, stemDocument) corpus_1 <- tm_map(corpus_1, stripWhitespace) best time worst time age wisdom age foolish epoch belief epoch incredul season light season dark spring hope winter despair everyth us noth us go direct heaven go direct way short period far like present period noisiest author insist receiv good evil superl degre comparison
  • 12. Dickens to Matrix: cleanup library(tm) corpus_1 <- Corpus(VectorSource(txt)) corpus_1 <- tm_map(corpus_1, content_transformer(tolower)) corpus_1 <- tm_map(corpus_1, removeWords, stopwords("english")) corpus_1 <- tm_map(corpus_1, removePunctuation) corpus_1 <- tm_map(corpus_1, stemDocument) corpus_1 <- tm_map(corpus_1, stripWhitespace) best time worst time age wisdom age foolish epoch belief epoch incredul season light season dark spring hope winter despair everyth us noth us go direct heaven go direct way short period far like present period noisiest author insist receiv good evil superl degre comparison
  • 13. Dickens to Matrix: Term Document Matrix tdm <- TermDocumentMatrix(corpus_1) <<TermDocumentMatrix (terms: 35, documents: 1)>> Non-/sparse entries: 35/0 Sparsity : 0% Maximal term length: 10 Weighting : term frequency (tf) class(tdm) [1] "TermDocumentMatrix" "simple_triplet_matrix“ dim (tdm) [1] 35 1 age 2 epoch 2 insist 1 short 1 author 1 everyth 1 light 1 spring 1 belief 1 evil 1 like 1 superl 1 best 1 far 1 noisiest 1 time 2 comparison 1 foolish 1 noth 1 way 1 dark 1 good 1 period 2 winter 1 degre 1 heaven 1 present 1 wisdom 1 despair 1 hope 1 receiv 1 worst 1 direct 2 incredul 1 season 2
  • 15. Dickens to Matrix: Ngrams Library(Rweka) four_gram_tokeniser <- function(x, n) { RWeka:::NGramTokenizer(x, RWeka:::Weka_control(min = 1, max = 4)) } tdm_4gram <- TermDocumentMatrix(corpus_1, control = list(tokenize = four_gram_tokeniser))) dim(tdm_4gram) [1] 163 1 age 2 author insist receiv good 1 dark 1 age foolish 1 belief 1 dark spring 1 age foolish epoch 1 belief epoch 1 dark spring hope 1 age foolish epoch belief 1 belief epoch incredul 1 dark spring hope winter 1 age wisdom 1 belief epoch incredul season 1 degre 1 age wisdom age 1 best 1 degre comparison 1 age wisdom age foolish 1 best time 1 despair 1 author 1 best time worst 1 despair everyth 1 author insist 1 best time worst time 1 despair everyth us 1 author insist receiv 1 comparison 1 despair everyth us noth 1
  • 16. Electronic patient records: Gathering structured medical data Doctor enters structured data directly
  • 17. Electronic patient records: Gathering structured medical data Trained staff extract structured data from typed notes Doctor enters structured data directly
  • 18. Electronic patient records: example text Diagnosis: Oesophagus lower third squamous cell carcinoma, T3 N2 M0 History: X year old lady who presented with progressive dysphagia since X and was known at X Hospital. She underwent an endoscopy which found a tumour which was biopsied and is a squamous cell carcinoma. A staging CT scan picked up a left upper lobe nodule. She then went on to have an EUS at X this was performed by Dr X and showed an early T3 tumour at 35-40cm of 4 small 4-6mm para-oesophageal nodes, between 35-40cm. There was a further 7.8mm node in the AP window at 27cm, the carina was measured at 28cm and aortic arch at 24cm, the conclusion T3 N2 M0. A subsequent PET CT scan was arranged-see below. She can manage a soft diet such as Weetabix, soft toast, mashed potato and gets occasional food stuck. Has lost half a stone in weight and is supplementing with 3 Fresubin supplements per day. Performance score is 1.
  • 19. Electronic patient records: targets Diagnosis: Oesophagus lower third squamous cell carcinoma, T3 N2 M0 History: X year old lady who presented with progressive dysphagia since X and was known at X Hospital. She underwent an endoscopy which found a tumour which was biopsied and is a squamous cell carcinoma. A staging CT scan picked up a left upper lobe nodule. She then went on to have an EUS at X this was performed by Dr X and showed an early T3 tumour at 35-40cm of 4 small 4-6mm para-oesophageal nodes, between 35-40cm. There was a further 7.8mm node in the AP window at 27cm, the carina was measured at 28cm and aortic arch at 24cm, the conclusion T3 N2 M0. A subsequent PET CT scan was arranged-see below. She can manage a soft diet such as Weetabix, soft toast, mashed potato and gets occasional food stuck. Has lost half a stone in weight and is supplementing with 3 Fresubin supplements per day. Performance score is 1.
  • 20. Electronic patient records: steps 1. Identify patients where we have both structured data and notes (c.20k) 2. Extract notes and structured data from SQL database 3. Make term document matrix (as shown previously) (60m x 20k) 4. Split data into training and development set 5. Train classification model using training set 6. Assess performance and tune model using development set 7. Evaluate system performance on independent dataset 8. Use system to extract structured data where we have none
  • 21. Electronic patient records: predicting disease site using the elastic net #fits a elastic net model, classifying into oesophagus or not selecting lambda through cross validation library(glmnet) dim(tdm) #22,843 documents, 677,017 Ngrams #note tdm must either be a matrix or a SparseMatrix NOT a simple_triplet_matrix mod_oeso <- cv.glmnet( x = tdm, y = disease_site == 'Oesophagus', family = "binomial") 𝛽 = argmin 𝛽 𝑦 − 𝑋𝛽 2 + 𝜆2 𝛽 2 + 𝜆1 𝛽 1 OLS + RIDGE + LASSO
  • 22. Electronic patient records: The Elastic Net #plots non-zero coefficients from elastic net model coefs <- coef(mod_oeso, s = mod_oeso$lambda.1se)[,1] coefs <- coefs[coefs != 0] coefs <- coefs[order(abs(coefs), decreasing = TRUE)] barplot(coefs[-1], horiz = TRUE, col = 2) P(site = ‘Oesophagus’) = 0.03
  • 23. Electronic patient records: classification performance: primary disease site Training set = 20,000 Test set = 4,000 patients 80% of patients can be classified with 95% accuracy (remaining 20% can be done by human abstractors) Next step is full formal evaluation on independent dataset Working in combination with rules based approach from Manchester University AUC = 90%
  • 24. Electronic patient records: Possible extensions • Classification (hierarchical) • Cluster analysis (KNN) • Time • Survival • Drug toxicity • Quality of life
  • 26. Books example get_links <- function(address, link_prefix = '', link_suffix = ''){ page <- getURL(address) # Convert to R tree <- htmlParse(page) ## Get All link elements links <- xpathSApply(tree, path = "//*/a", fun = xmlGetAttr, name = "href") ## Convert to vector links <- unlist(links) ## add prefix and suffix paste0(link_prefix, links, link_suffix) } links_authors <- get_links("http://textfiles.com/etext/AUTHORS/", '/', link_prefix ='http://textfiles.com/etext/AUTHORS/') links_text <- alply(links_authors, 1,function(.x){ get_links(.x, link_prefix =.x , link_suffix = '') }) books <- llply(links_text, function(.x){ aaply(.x, 1, getURL) })
  • 27.
  • 28. Principle components analysis ## Code to get the first n principal components ## from a large sparse matrix term document matrix of class dgCMatrix library(irlba) n <- 5 # number of components to calculate m <- nrow(tdm) # 110703 terms in tdm matrix xt.x <- crossprod(tdm) x.Means <- colMeans(tdm) xt.x <- (xt.x - m * tcrossprod(x.means)) / (m-1) svd <- irlba(xt.x, nu=0, nv=n, tol=1e-10)
  • 29. 0.8 1.0 1.2 1.4 0.8 0.9 1.0 1.1 1.2 1.3 1.4 PC2 PC3 ARISTOTLE BURROUGHS DICKENS KANT PLATO SHAKESPEARE plot(svd$v[i,c(2,3)] + 1, col = books_df$author, log = 'xy', xlab = 'PC2', ylab = 'PC3') PCA plot

Editor's Notes

  1. Aims Make people want to try this themselves Get feedback or suggestions from those who have done this already
  2. Aims Make people want to try this themselves Get feedback or suggestions from those who have done this already