SlideShare ist ein Scribd-Unternehmen logo
1 von 27
Downloaden Sie, um offline zu lesen
Hadley Wickham
Stat405String processing
Thursday, 4 November 2010
Next project
Poster presentation: last week of class or
finals week?
Nov 23: Tracy Volz will be presenting on
how to do a good poster
You need to pick your own dataset. Ideas
here: http://www.delicious.com/hadley/
data. Start thinking now!
Thursday, 4 November 2010
Motivation
Want to try and classify spam vs. ham
(non-spam) email.
Need to process character vectors
containing complete contents of email.
Eventually, will create variables that will be
useful for classification.
Same techniques very useful for data
cleaning.
Thursday, 4 November 2010
Getting started
load("email.rdata")
str(contents) # second 100 are spam
install.packages("stringr")
# all functions starting with str_
# come from this package
help(package = "stringr")
apropos("str_")
Thursday, 4 November 2010
Received: from NAHOU-MSMBX07V.corp.enron.com ([192.168.110.98]) by NAHOU-MSAPP01S.corp.enron.com
with Microsoft SMTPSVC(5.0.2195.2966);
Mon, 1 Oct 2001 14:39:38 -0500
x-mimeole: Produced By Microsoft Exchange V6.0.4712.0
content-class: urn:content-classes:message
MIME-Version: 1.0
Content-Type: text/plain;
Content-Transfer-Encoding: binary
Subject: MERCATOR ENERGY INCORPORATED and ENERGY WEST INCORPORATED
Date: Mon, 1 Oct 2001 14:39:38 -0500
Message-ID: <053C29CC8315964CB98E1BD5BD48E3080B522E@NAHOU-MSMBX07V.corp.enron.com>
X-MS-Has-Attach:
X-MS-TNEF-Correlator: <053C29CC8315964CB98E1BD5BD48E3080B522E@NAHOU-MSMBX07V.corp.enron.com>
Thread-Topic: MERCATOR ENERGY INCORPORATED and ENERGY WEST INCORPORATED
Thread-Index: AcFKsM5wASRhZ102QuKXl3U5Ww741Q==
X-Priority: 1
Priority: Urgent
Importance: high
From: "Landau, Georgi" <Georgi.Landau@ENRON.com>
To: "Bailey, Susan" <Susan.Bailey@ENRON.com>,
"Boyd, Samantha" <Samantha.Boyd@ENRON.com>,
"Heard, Marie" <Marie.Heard@ENRON.com>,
"Jones, Tana" <Tana.Jones@ENRON.com>,
"Panus, Stephanie" <Stephanie.Panus@ENRON.com>
Return-Path: Georgi.Landau@ENRON.com
Please check your records and let me know if you have any kind of documentation evidencing a merger
indicating that Mercator Energy Incorporated merged with and into Energy West Incorporated?
I am unable to find anything to substantiate this information.
Thanks for your help.
Georgi Landau
Enron Net Works
Thursday, 4 November 2010
...
Date: Sun, 11 Nov 2001 11:55:05 -0500
Message-Id: <200503101247.j2ACloAq014654@host.high-host.com>
To: benrobinson13@shaw.ca
Subject: LETS DO THIS TOGTHER
From: ben1 <ben_1_wills@yahoo.com.au>
X-Priority: 3 (Normal)
CC:
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-Mailer: RLSP Mailer
Dear Friend,
Firstly, not to caurse you embarrassment, I am Barrister Benson Wills, a Solicitor at law and the
personal attorney to late Mr. Mark Michelle a National of France, who used to be a private
contractor with the Shell Petroleum Development Company in Saudi Arabia, herein after shall be
referred to as my client. On the 21st of April 2001, him and his wife with their three children were
involved in an auto crash, all occupants of the vehicle unfortunately lost their lives.
Since then, I have made several enquiries with his country's embassies to locate any of my clients
extended relatives, this has also proved unsuccessful. After these several unsuccessful attempts, I
decided to contact you with this business partnership proposal. I have contacted
you to assist in repatriating a huge amount of money left behind by my client before they get
confiscated or declared unserviceable by the Finance/Security Company where these huge deposit was
lodged.
The deceased had a deposit valued presently at £30milion Pounds Sterling and the Company has issued
me a notice to provide his next of kin or Beneficiary by Will otherwise have the account confiscated
within the next thirty official working days.
Since I have been unsuccessful in locating the relatives for over two years now, I seek your consent
to present you as the next of kin/Will Beneficiary to the deceased so that the proceeds of this
Thursday, 4 November 2010
Structure of an email
Headers give metadata
Empty line
Body of email
Other major complication is attachments
and alternative content, but we’ll ignore
those for class
Thursday, 4 November 2010
Tasks
• Split into header and contents
• Split header into fields
• Generate useful variables for
distinguishing between spam and non-
spam
Thursday, 4 November 2010
String
basics
Thursday, 4 November 2010
Thursday, 4 November 2010
http://www.youtube.com/
watch?v=ejweI0EQpX8
Thursday, 4 November 2010
# Special characters
a <- ""
b <- """
c <- "anbnc"
a
cat(a, "n")
b
cat(b, "n")
c
cat(c, "n")
Thursday, 4 November 2010
Special characters
• Use  to “escape” special characters
• " = "
• n = new line
•  = 
• t = tab
• ?Quotes for more
Thursday, 4 November 2010
Displaying strings
print() will display quoted form.
cat() will display actual contents (but
need to add a newline to the end).
(But generally better to use message() if
you want to send a message to the user
of your function)
Thursday, 4 November 2010
Your turn
Create a string for each of the following
strings:
:-
(^_^")
@_'-'
m/
Create a multiline string.
Compare the output from print and cat
Thursday, 4 November 2010
a <- ":-"
b <- "(^_^")"
c <- "@_'-'"
d <- "m/"
e <- "This stringngoes overnmultiple lines"
a; b; c; d; e
cat(str_join(a, b, c, d, e, "n", sep = "n"))
Thursday, 4 November 2010
Back to the
problem
Thursday, 4 November 2010
Header vs. content
Need to split the string into two pieces,
based on the the location of double line
break:
str_locate(string, pattern)
Splitting = creating two substrings, one to
the right, one to the left:
str_sub(string, start, end)
Thursday, 4 November 2010
str_locate("great")
str_locate("fantastic", "a")
str_locate("super", "a")
superlatives <- c("great", "fantastic", "super")
res <- str_locate(superlatives, "a")
str(res)
str(str_locate_all(superlatives, "a"))
str_sub("testing", 1, 3)
str_sub("testing", start = 4)
str_sub("testing", end = 4)
input <- c("abc", "defg")
str_sub(input, c(2, 3))
Thursday, 4 November 2010
Your turn
Use sub_locate() to identify the location
where the double break is (make sure to
check a few!)
Split the emails into header and content
with str_sub()
Thursday, 4 November 2010
breaks <- str_locate(contents, "nn")
# Remove invalid emails
valid <- !is.na(breaks[, "start"])
contents <- contents[valid]
breaks <- breaks[valid, ]
# Extract headers and bodies
header <- str_sub(contents, end = breaks[, 1])
body <- str_sub(contents, start = breaks[, 2])
Thursday, 4 November 2010
Headers
• Each header starts at the beginning of
a new line
• Each header is composed of a name
and contents, separated by a colon
Thursday, 4 November 2010
h <- header[2]
# Does this work?
str_split(h, "n")[[1]]
# Why / why not?
# How could you fix the problem?
Thursday, 4 November 2010
# Split & patch up
lines <- str_split(h, "n")
continued <- str_sub(lines, 1, 1) %in% c(" ", "t")
# This is a neat trick!
groups <- cumsum(!continued)
fields <- rep(NA, max(groups))
for (i in seq_along(fields)) {
fields[i] <- str_join(lines[groups == i],
collapse = "n")
}
Thursday, 4 November 2010
Your turn
Write a small function that given a single
header field splits it into name and
contents.
Do you want to use str_split(), or
str_locate() & str_sub()?
Remember to get the algorithm working
before you write the function
Thursday, 4 November 2010
test1 <- "Sender: <Lighthouse@independent.org>"
test2 <- "Subject: Alice: Where is my coffee?"
f1 <- function(input) {
str_split(input, ": ")[[1]]
}
f2 <- function(input) {
colon <- str_locate(input, ": ")
c(
str_sub(input, end = colon[, 1] - 1),
str_sub(input, start = colon[, 2] + 1)
)
}
f3 <- function(input) {
str_split_fixed(input, ": ", 2)[1, ]
}
Thursday, 4 November 2010
We split the content into header and
body. And split up the header into fields.
Both of these tasks used fixed strings.
What if the pattern we need to match is
more complicated?
Next time
Thursday, 4 November 2010

Weitere ähnliche Inhalte

Andere mochten auch

Model Visualisation (with ggplot2)
Model Visualisation (with ggplot2)Model Visualisation (with ggplot2)
Model Visualisation (with ggplot2)
Hadley Wickham
 
Introducing natural language processing(NLP) with r
Introducing natural language processing(NLP) with rIntroducing natural language processing(NLP) with r
Introducing natural language processing(NLP) with r
Vivian S. Zhang
 

Andere mochten auch (20)

04 Wrapup
04 Wrapup04 Wrapup
04 Wrapup
 
Correlations, Trends, and Outliers in ggplot2
Correlations, Trends, and Outliers in ggplot2Correlations, Trends, and Outliers in ggplot2
Correlations, Trends, and Outliers in ggplot2
 
16 Sequences
16 Sequences16 Sequences
16 Sequences
 
20 date-times
20 date-times20 date-times
20 date-times
 
24 modelling
24 modelling24 modelling
24 modelling
 
Model Visualisation (with ggplot2)
Model Visualisation (with ggplot2)Model Visualisation (with ggplot2)
Model Visualisation (with ggplot2)
 
03 Conditional
03 Conditional03 Conditional
03 Conditional
 
Graphical inference
Graphical inferenceGraphical inference
Graphical inference
 
R workshop iii -- 3 hours to learn ggplot2 series
R workshop iii -- 3 hours to learn ggplot2 seriesR workshop iii -- 3 hours to learn ggplot2 series
R workshop iii -- 3 hours to learn ggplot2 series
 
03 Modelling
03 Modelling03 Modelling
03 Modelling
 
23 data-structures
23 data-structures23 data-structures
23 data-structures
 
02 Ddply
02 Ddply02 Ddply
02 Ddply
 
01 Intro
01 Intro01 Intro
01 Intro
 
Reshaping Data in R
Reshaping Data in RReshaping Data in R
Reshaping Data in R
 
Machine learning in R
Machine learning in RMachine learning in R
Machine learning in R
 
4 R Tutorial DPLYR Apply Function
4 R Tutorial DPLYR Apply Function4 R Tutorial DPLYR Apply Function
4 R Tutorial DPLYR Apply Function
 
Data manipulation with dplyr
Data manipulation with dplyrData manipulation with dplyr
Data manipulation with dplyr
 
Data Manipulation Using R (& dplyr)
Data Manipulation Using R (& dplyr)Data Manipulation Using R (& dplyr)
Data Manipulation Using R (& dplyr)
 
Introducing natural language processing(NLP) with r
Introducing natural language processing(NLP) with rIntroducing natural language processing(NLP) with r
Introducing natural language processing(NLP) with r
 
Grouping & Summarizing Data in R
Grouping & Summarizing Data in RGrouping & Summarizing Data in R
Grouping & Summarizing Data in R
 

Ähnlich wie 21 spam

Rhetorical Strategies and Fallacies WorksheetPHL320 Version 3.docx
Rhetorical Strategies and Fallacies WorksheetPHL320 Version 3.docxRhetorical Strategies and Fallacies WorksheetPHL320 Version 3.docx
Rhetorical Strategies and Fallacies WorksheetPHL320 Version 3.docx
joellemurphey
 
ITECH 1006 - Database Management Systems CRICOS Provide.docx
ITECH 1006 - Database Management Systems CRICOS Provide.docxITECH 1006 - Database Management Systems CRICOS Provide.docx
ITECH 1006 - Database Management Systems CRICOS Provide.docx
christiandean12115
 
DataStore & Region Sales DatabaseIDStore NoSales RegionItem NoItem.docx
DataStore & Region Sales DatabaseIDStore NoSales RegionItem NoItem.docxDataStore & Region Sales DatabaseIDStore NoSales RegionItem NoItem.docx
DataStore & Region Sales DatabaseIDStore NoSales RegionItem NoItem.docx
edwardmarivel
 
Web-IT Support and Consulting - dBase exports
Web-IT Support and Consulting - dBase exportsWeb-IT Support and Consulting - dBase exports
Web-IT Support and Consulting - dBase exports
Dirk Cludts
 
Write better python code with these 10 tricks | by yong cui, ph.d. | aug, 202...
Write better python code with these 10 tricks | by yong cui, ph.d. | aug, 202...Write better python code with these 10 tricks | by yong cui, ph.d. | aug, 202...
Write better python code with these 10 tricks | by yong cui, ph.d. | aug, 202...
amit kuraria
 
Sustainable TDD
Sustainable TDDSustainable TDD
Sustainable TDD
Steven Mak
 

Ähnlich wie 21 spam (20)

Rhetorical Strategies and Fallacies WorksheetPHL320 Version 3.docx
Rhetorical Strategies and Fallacies WorksheetPHL320 Version 3.docxRhetorical Strategies and Fallacies WorksheetPHL320 Version 3.docx
Rhetorical Strategies and Fallacies WorksheetPHL320 Version 3.docx
 
22 Spam
22 Spam22 Spam
22 Spam
 
22 spam
22 spam22 spam
22 spam
 
ITECH 1006 - Database Management Systems CRICOS Provide.docx
ITECH 1006 - Database Management Systems CRICOS Provide.docxITECH 1006 - Database Management Systems CRICOS Provide.docx
ITECH 1006 - Database Management Systems CRICOS Provide.docx
 
DataStore & Region Sales DatabaseIDStore NoSales RegionItem NoItem.docx
DataStore & Region Sales DatabaseIDStore NoSales RegionItem NoItem.docxDataStore & Region Sales DatabaseIDStore NoSales RegionItem NoItem.docx
DataStore & Region Sales DatabaseIDStore NoSales RegionItem NoItem.docx
 
Web-IT Support and Consulting - dBase exports
Web-IT Support and Consulting - dBase exportsWeb-IT Support and Consulting - dBase exports
Web-IT Support and Consulting - dBase exports
 
Web-IT Support and Consulting - bulk dBase (DBF) exports via Microsoft Excel ...
Web-IT Support and Consulting - bulk dBase (DBF) exports via Microsoft Excel ...Web-IT Support and Consulting - bulk dBase (DBF) exports via Microsoft Excel ...
Web-IT Support and Consulting - bulk dBase (DBF) exports via Microsoft Excel ...
 
Working with deeply nested documents in Apache Solr
Working with deeply nested documents in Apache SolrWorking with deeply nested documents in Apache Solr
Working with deeply nested documents in Apache Solr
 
Working with Deeply Nested Documents in Apache Solr: Presented by Anshum Gupt...
Working with Deeply Nested Documents in Apache Solr: Presented by Anshum Gupt...Working with Deeply Nested Documents in Apache Solr: Presented by Anshum Gupt...
Working with Deeply Nested Documents in Apache Solr: Presented by Anshum Gupt...
 
Write better python code with these 10 tricks | by yong cui, ph.d. | aug, 202...
Write better python code with these 10 tricks | by yong cui, ph.d. | aug, 202...Write better python code with these 10 tricks | by yong cui, ph.d. | aug, 202...
Write better python code with these 10 tricks | by yong cui, ph.d. | aug, 202...
 
Writing Toefl Essays - Academiccalendar.Web.Fc2
Writing Toefl Essays - Academiccalendar.Web.Fc2Writing Toefl Essays - Academiccalendar.Web.Fc2
Writing Toefl Essays - Academiccalendar.Web.Fc2
 
Enroller Colloquium: Sulman Sarwar
Enroller Colloquium: Sulman SarwarEnroller Colloquium: Sulman Sarwar
Enroller Colloquium: Sulman Sarwar
 
Beginner-friendly Guide to ML-enabled Automation in Organic Marketing, Lazari...
Beginner-friendly Guide to ML-enabled Automation in Organic Marketing, Lazari...Beginner-friendly Guide to ML-enabled Automation in Organic Marketing, Lazari...
Beginner-friendly Guide to ML-enabled Automation in Organic Marketing, Lazari...
 
Domino Administration Wizardry - Dark Arts Edition
Domino Administration Wizardry - Dark Arts EditionDomino Administration Wizardry - Dark Arts Edition
Domino Administration Wizardry - Dark Arts Edition
 
I want to know more about compuerized text analysis
I want to know more about   compuerized text analysisI want to know more about   compuerized text analysis
I want to know more about compuerized text analysis
 
Structured Document Search and Retrieval
Structured Document Search and RetrievalStructured Document Search and Retrieval
Structured Document Search and Retrieval
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
Working with deeply nested documents in Apache Solr
Working with deeply nested documents in Apache SolrWorking with deeply nested documents in Apache Solr
Working with deeply nested documents in Apache Solr
 
Language Sleuthing HOWTO with NLTK
Language Sleuthing HOWTO with NLTKLanguage Sleuthing HOWTO with NLTK
Language Sleuthing HOWTO with NLTK
 
Sustainable TDD
Sustainable TDDSustainable TDD
Sustainable TDD
 

Mehr von Hadley Wickham (20)

27 development
27 development27 development
27 development
 
27 development
27 development27 development
27 development
 
19 tables
19 tables19 tables
19 tables
 
18 cleaning
18 cleaning18 cleaning
18 cleaning
 
17 polishing
17 polishing17 polishing
17 polishing
 
16 critique
16 critique16 critique
16 critique
 
15 time-space
15 time-space15 time-space
15 time-space
 
14 case-study
14 case-study14 case-study
14 case-study
 
13 case-study
13 case-study13 case-study
13 case-study
 
12 adv-manip
12 adv-manip12 adv-manip
12 adv-manip
 
11 adv-manip
11 adv-manip11 adv-manip
11 adv-manip
 
11 adv-manip
11 adv-manip11 adv-manip
11 adv-manip
 
10 simulation
10 simulation10 simulation
10 simulation
 
10 simulation
10 simulation10 simulation
10 simulation
 
09 bootstrapping
09 bootstrapping09 bootstrapping
09 bootstrapping
 
08 functions
08 functions08 functions
08 functions
 
07 problem-solving
07 problem-solving07 problem-solving
07 problem-solving
 
06 data
06 data06 data
06 data
 
05 subsetting
05 subsetting05 subsetting
05 subsetting
 
04 reports
04 reports04 reports
04 reports
 

21 spam

  • 2. Next project Poster presentation: last week of class or finals week? Nov 23: Tracy Volz will be presenting on how to do a good poster You need to pick your own dataset. Ideas here: http://www.delicious.com/hadley/ data. Start thinking now! Thursday, 4 November 2010
  • 3. Motivation Want to try and classify spam vs. ham (non-spam) email. Need to process character vectors containing complete contents of email. Eventually, will create variables that will be useful for classification. Same techniques very useful for data cleaning. Thursday, 4 November 2010
  • 4. Getting started load("email.rdata") str(contents) # second 100 are spam install.packages("stringr") # all functions starting with str_ # come from this package help(package = "stringr") apropos("str_") Thursday, 4 November 2010
  • 5. Received: from NAHOU-MSMBX07V.corp.enron.com ([192.168.110.98]) by NAHOU-MSAPP01S.corp.enron.com with Microsoft SMTPSVC(5.0.2195.2966); Mon, 1 Oct 2001 14:39:38 -0500 x-mimeole: Produced By Microsoft Exchange V6.0.4712.0 content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; Content-Transfer-Encoding: binary Subject: MERCATOR ENERGY INCORPORATED and ENERGY WEST INCORPORATED Date: Mon, 1 Oct 2001 14:39:38 -0500 Message-ID: <053C29CC8315964CB98E1BD5BD48E3080B522E@NAHOU-MSMBX07V.corp.enron.com> X-MS-Has-Attach: X-MS-TNEF-Correlator: <053C29CC8315964CB98E1BD5BD48E3080B522E@NAHOU-MSMBX07V.corp.enron.com> Thread-Topic: MERCATOR ENERGY INCORPORATED and ENERGY WEST INCORPORATED Thread-Index: AcFKsM5wASRhZ102QuKXl3U5Ww741Q== X-Priority: 1 Priority: Urgent Importance: high From: "Landau, Georgi" <Georgi.Landau@ENRON.com> To: "Bailey, Susan" <Susan.Bailey@ENRON.com>, "Boyd, Samantha" <Samantha.Boyd@ENRON.com>, "Heard, Marie" <Marie.Heard@ENRON.com>, "Jones, Tana" <Tana.Jones@ENRON.com>, "Panus, Stephanie" <Stephanie.Panus@ENRON.com> Return-Path: Georgi.Landau@ENRON.com Please check your records and let me know if you have any kind of documentation evidencing a merger indicating that Mercator Energy Incorporated merged with and into Energy West Incorporated? I am unable to find anything to substantiate this information. Thanks for your help. Georgi Landau Enron Net Works Thursday, 4 November 2010
  • 6. ... Date: Sun, 11 Nov 2001 11:55:05 -0500 Message-Id: <200503101247.j2ACloAq014654@host.high-host.com> To: benrobinson13@shaw.ca Subject: LETS DO THIS TOGTHER From: ben1 <ben_1_wills@yahoo.com.au> X-Priority: 3 (Normal) CC: Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Mailer: RLSP Mailer Dear Friend, Firstly, not to caurse you embarrassment, I am Barrister Benson Wills, a Solicitor at law and the personal attorney to late Mr. Mark Michelle a National of France, who used to be a private contractor with the Shell Petroleum Development Company in Saudi Arabia, herein after shall be referred to as my client. On the 21st of April 2001, him and his wife with their three children were involved in an auto crash, all occupants of the vehicle unfortunately lost their lives. Since then, I have made several enquiries with his country's embassies to locate any of my clients extended relatives, this has also proved unsuccessful. After these several unsuccessful attempts, I decided to contact you with this business partnership proposal. I have contacted you to assist in repatriating a huge amount of money left behind by my client before they get confiscated or declared unserviceable by the Finance/Security Company where these huge deposit was lodged. The deceased had a deposit valued presently at £30milion Pounds Sterling and the Company has issued me a notice to provide his next of kin or Beneficiary by Will otherwise have the account confiscated within the next thirty official working days. Since I have been unsuccessful in locating the relatives for over two years now, I seek your consent to present you as the next of kin/Will Beneficiary to the deceased so that the proceeds of this Thursday, 4 November 2010
  • 7. Structure of an email Headers give metadata Empty line Body of email Other major complication is attachments and alternative content, but we’ll ignore those for class Thursday, 4 November 2010
  • 8. Tasks • Split into header and contents • Split header into fields • Generate useful variables for distinguishing between spam and non- spam Thursday, 4 November 2010
  • 12. # Special characters a <- "" b <- """ c <- "anbnc" a cat(a, "n") b cat(b, "n") c cat(c, "n") Thursday, 4 November 2010
  • 13. Special characters • Use to “escape” special characters • " = " • n = new line • = • t = tab • ?Quotes for more Thursday, 4 November 2010
  • 14. Displaying strings print() will display quoted form. cat() will display actual contents (but need to add a newline to the end). (But generally better to use message() if you want to send a message to the user of your function) Thursday, 4 November 2010
  • 15. Your turn Create a string for each of the following strings: :- (^_^") @_'-' m/ Create a multiline string. Compare the output from print and cat Thursday, 4 November 2010
  • 16. a <- ":-" b <- "(^_^")" c <- "@_'-'" d <- "m/" e <- "This stringngoes overnmultiple lines" a; b; c; d; e cat(str_join(a, b, c, d, e, "n", sep = "n")) Thursday, 4 November 2010
  • 17. Back to the problem Thursday, 4 November 2010
  • 18. Header vs. content Need to split the string into two pieces, based on the the location of double line break: str_locate(string, pattern) Splitting = creating two substrings, one to the right, one to the left: str_sub(string, start, end) Thursday, 4 November 2010
  • 19. str_locate("great") str_locate("fantastic", "a") str_locate("super", "a") superlatives <- c("great", "fantastic", "super") res <- str_locate(superlatives, "a") str(res) str(str_locate_all(superlatives, "a")) str_sub("testing", 1, 3) str_sub("testing", start = 4) str_sub("testing", end = 4) input <- c("abc", "defg") str_sub(input, c(2, 3)) Thursday, 4 November 2010
  • 20. Your turn Use sub_locate() to identify the location where the double break is (make sure to check a few!) Split the emails into header and content with str_sub() Thursday, 4 November 2010
  • 21. breaks <- str_locate(contents, "nn") # Remove invalid emails valid <- !is.na(breaks[, "start"]) contents <- contents[valid] breaks <- breaks[valid, ] # Extract headers and bodies header <- str_sub(contents, end = breaks[, 1]) body <- str_sub(contents, start = breaks[, 2]) Thursday, 4 November 2010
  • 22. Headers • Each header starts at the beginning of a new line • Each header is composed of a name and contents, separated by a colon Thursday, 4 November 2010
  • 23. h <- header[2] # Does this work? str_split(h, "n")[[1]] # Why / why not? # How could you fix the problem? Thursday, 4 November 2010
  • 24. # Split & patch up lines <- str_split(h, "n") continued <- str_sub(lines, 1, 1) %in% c(" ", "t") # This is a neat trick! groups <- cumsum(!continued) fields <- rep(NA, max(groups)) for (i in seq_along(fields)) { fields[i] <- str_join(lines[groups == i], collapse = "n") } Thursday, 4 November 2010
  • 25. Your turn Write a small function that given a single header field splits it into name and contents. Do you want to use str_split(), or str_locate() & str_sub()? Remember to get the algorithm working before you write the function Thursday, 4 November 2010
  • 26. test1 <- "Sender: <Lighthouse@independent.org>" test2 <- "Subject: Alice: Where is my coffee?" f1 <- function(input) { str_split(input, ": ")[[1]] } f2 <- function(input) { colon <- str_locate(input, ": ") c( str_sub(input, end = colon[, 1] - 1), str_sub(input, start = colon[, 2] + 1) ) } f3 <- function(input) { str_split_fixed(input, ": ", 2)[1, ] } Thursday, 4 November 2010
  • 27. We split the content into header and body. And split up the header into fields. Both of these tasks used fixed strings. What if the pattern we need to match is more complicated? Next time Thursday, 4 November 2010