SlideShare ist ein Scribd-Unternehmen logo
1 von 40
Downloaden Sie, um offline zu lesen
Big Data and Automated Content Analysis
Week 5 – Wednesday
»Regular expressions and NLP«
Damian Trilling
d.c.trilling@uva.nl
@damian0604
www.damiantrilling.net
Afdeling Communicatiewetenschap
Universiteit van Amsterdam
29 April 2014
Regular expressions Some more Natural Language Processing Take-home message & next steps
Today
1 ACA using regular expressions
What is a regexp?
Using a regexp in Python
2 Some more Natural Language Processing
Stemming
Parsing sentences
3 Take-home message & next steps
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
Automated content analysis using regular expressions
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
What is a regexp?
Regular Expressions: What and why?
What is a regexp?
• a very widespread way to describe patterns in strings
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
What is a regexp?
Regular Expressions: What and why?
What is a regexp?
• a very widespread way to describe patterns in strings
• Think of wildcards like * or operators like OR, AND or NOT in
search strings: a regexp does the same, but is much more
powerful
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
What is a regexp?
Regular Expressions: What and why?
What is a regexp?
• a very widespread way to describe patterns in strings
• Think of wildcards like * or operators like OR, AND or NOT in
search strings: a regexp does the same, but is much more
powerful
• You can use them in many editors (!), in the Terminal, in
STATA . . . and in Python
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
What is a regexp?
An example
From last week’s task
• We wanted to remove everything but words from a tweet
• We did so by calling the .replace() method
• We could do this with a regular expression as well:
[ˆa-zA-Z] would match anything that is not a letter
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
What is a regexp?
Basic regexp elements
Alternatives
[TtFf] matches either T or t or F or f
Twitter|Facebook matches either Twitter or Facebook
. matches any character
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
What is a regexp?
Basic regexp elements
Alternatives
[TtFf] matches either T or t or F or f
Twitter|Facebook matches either Twitter or Facebook
. matches any character
Repetition
* the expression before occurs 0 or more times
+ the expression before occurs 1 or more times
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
What is a regexp?
regexp quizz
Which words would be matched?
1 [Pp]ython
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
What is a regexp?
regexp quizz
Which words would be matched?
1 [Pp]ython
2 [A-Z]+
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
What is a regexp?
regexp quizz
Which words would be matched?
1 [Pp]ython
2 [A-Z]+
3 RT :* @[a-zA-Z0-9]*
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
What is a regexp?
What else is possible?
If you google regexp or regular expression, you’ll get a bunch
of useful overviews. The wikipedia page is not too bad, either.
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
Using a regexp in Python
How to use regular expressions in Python
The module re
re.findall("[Tt]witter|[Ff]acebook",testo) returns a list
with all occurances of Twitter or Facebook in the
string called testo
re.findall("[0-9]+[a-zA-Z]+",testo) returns a list with all
words that start with one or more numbers followed
by one or more letters in the string called testo
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
Using a regexp in Python
How to use regular expressions in Python
The module re
re.findall("[Tt]witter|[Ff]acebook",testo) returns a list
with all occurances of Twitter or Facebook in the
string called testo
re.findall("[0-9]+[a-zA-Z]+",testo) returns a list with all
words that start with one or more numbers followed
by one or more letters in the string called testo
re.sub("[Tt]witter|[Ff]acebook","a social medium",testo)
returns a string in which all all occurances of Twitter
or Facebook are replaced by "a social medium"
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
Using a regexp in Python
How to use regular expressions in Python
The module re
re.match(" +([0-9]+) of ([0-9]+) points",line) returns
None unless it exactly matches the string line. If it
does, you can access the part between () with the
.group() method.
Example:
1 line=" 2 of 25 points"
2 result=re.match(" +([0-9]+) of ([0-9]+) points",line)
3 if result:
4 print ("Your points:",result.group(1))
5 print ("Maximum points:",result.group(2))
Your points: 2
Maximum points: 25
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
Using a regexp in Python
Possible applications
Data preprocessing
• Remove unwanted characters, words, . . .
• Identify meaningful bits of text: usernames, headlines, where
an article starts, . . .
• filter (distinguish relevant from irrelevant cases)
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
Using a regexp in Python
Possible applications
Data analysis: Automated coding
• Actors
• Brands
• links or other markers that follow a regular pattern
• Numbers (!)
Big Data and Automated Content Analysis Damian Trilling
Example 1: Counting actors
1 import re, csv
2 from os import listdir, path
3 mypath ="/home/damian/artikelen"
4 filename_list=[]
5 matchcount54_list=[]
6 matchcount10_list=[]
7 onlyfiles = [f for f in listdir(mypath) if path.isfile(path.join(mypath,
f))]
8 for f in onlyfiles:
9 matchcount54=0
10 matchcount10=0
11 with open(path.join(mypath,f),mode="r",encoding="utf-8") as fi:
12 artikel=fi.readlines()
13 for line in artikel:
14 matches54 = re.findall(’Israel.*(minister|politician.*|[Aa]
uthorit)’,line)
15 matches10 = re.findall(’[Pp]alest’,line)
16 matchcount54+=len(matches54)
17 matchcount10+=len(matches10)
18 filename_list.append(f)
19 matchcount54_list.append(matchcount54)
20 matchcount10_list.append(matchcount10)
21 output=zip(filename_list,matchcount10_list,matchcount54_list)
22 with open("overzichtstabel.csv", mode=’w’,encoding="utf-8") as fo:
23 writer = csv.writer(fo)
24 writer.writerows(output)
Regular expressions Some more Natural Language Processing Take-home message & next steps
Using a regexp in Python
Example 2: Which number has this Lexis Nexis article?
1 All Rights Reserved
2
3 2 of 200 DOCUMENTS
4
5 De Telegraaf
6
7 21 maart 2014 vrijdag
8
9 Brussel bereikt akkoord aanpak probleembanken;
10 ECB krijgt meer in melk te brokkelen
11
12 SECTION: Finance; Blz. 24
13 LENGTH: 660 woorden
14
15 BRUSSEL Europa heeft gisteren op de valreep een akkoord bereikt
16 over een saneringsfonds voor banken. Daarmee staat de laatste
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
Using a regexp in Python
Example 2: Check the number of a lexis nexis article
1 All Rights Reserved
2
3 2 of 200 DOCUMENTS
4
5 De Telegraaf
6
7 21 maart 2014 vrijdag
8
9 Brussel bereikt akkoord aanpak probleembanken;
10 ECB krijgt meer in melk te brokkelen
11
12 SECTION: Finance; Blz. 24
13 LENGTH: 660 woorden
14
15 BRUSSEL Europa heeft gisteren op de valreep een akkoord bereikt
16 over een saneringsfonds voor banken. Daarmee staat de laatste
1 for line in tekst:
2 matchObj=re.match(r" +([0-9]+) of ([0-9]+) DOCUMENTS",line)
3 if matchObj:
4 numberofarticle= int(matchObj.group(1))
5 totalnumberofarticles= int(matchObj.group(2))
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
Using a regexp in Python
Practice yourself!
http://www.pyregex.com/
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
Some more Natural Language Processing
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
Some more NLP: What and why?
What can we do?
• remove stopwords (last week)
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
Some more NLP: What and why?
What can we do?
• remove stopwords (last week)
• stemming
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
Some more NLP: What and why?
What can we do?
• remove stopwords (last week)
• stemming
• Parse sentences (advanced)
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
Stemming
NLP: What and why?
Why do stemming?
• Because we do not want to distinguish between smoke,
smoked, smoking, . . .
• Typical preprocessing step (like stopword removal)
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
Stemming
Stemming
(with NLTK, see Bird, S., Loper, E., & Klein, E. (2009). Natural language processing
with Python. Sebastopol, CA: O’Reilly.)
1 from nltk.stem.snowball import SnowballStemmer
2 stemmer=SnowballStemmer("english")
3 frase="I am running while generously greeting my neighbors"
4 frasenuevo=""
5 for palabra in frase.split():
6 frasenuevo=frasenuevo + stemmer.stem(palabra) + " "
If we now did print(frasenuevo), it would return:
1 i am run while generous greet my neighbor
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
Stemming
Stemming and stopword removal - let’s combine them!
1 from nltk.stem.snowball import SnowballStemmer
2 from nltk.corpus import stopwords
3 stemmer=SnowballStemmer("english")
4 stopwords = stopwords.words("english")
5 frase="I am running while generously greeting my neighbors"
6 frasenuevo=""
7 for palabra in frase.lower().split():
8 if palabra not in stopwords:
9 frasenuevo=frasenuevo + stemmer.stem(palabra) + " "
Now, print(frasenuevo) returns:
1 run generous greet neighbor
Perfect!
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
Stemming
Stemming and stopword removal - let’s combine them!
1 from nltk.stem.snowball import SnowballStemmer
2 from nltk.corpus import stopwords
3 stemmer=SnowballStemmer("english")
4 stopwords = stopwords.words("english")
5 frase="I am running while generously greeting my neighbors"
6 frasenuevo=""
7 for palabra in frase.lower().split():
8 if palabra not in stopwords:
9 frasenuevo=frasenuevo + stemmer.stem(palabra) + " "
Now, print(frasenuevo) returns:
1 run generous greet neighbor
Perfect!
In order to use nltk.corpus.stopwords, you have to download that module once. You can do so by typing the
following in the Python console and selecting the appropriate package from the menu that pops up:
import nltk
nltk.download()
NB: Don’t download everything, that’s several GB.
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
Parsing sentences
NLP: What and why?
Why parse sentences?
• To find out what grammatical function words have
• and to get closer to the meaning.
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
Parsing sentences
Parsing a sentence
1 import nltk
2 sentence = "At eight o’clock on Thursday morning, Arthur didn’t feel
very good."
3 tokens = nltk.word_tokenize(sentence)
4 print (tokens)
nltk.word_tokenize(sentence) is similar to sentence.split(),
but compare handling of punctuation and the didn’t in the
output:
1 [’At’, ’eight’, "o’clock", ’on’, ’Thursday’, ’morning’,’Arthur’, ’did’,
"n’t", ’feel’, ’very’, ’good’, ’.’]
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
Parsing sentences
Parsing a sentence
Now, as the next step, you can “tag” the tokenized sentence:
1 tagged = nltk.pos_tag(tokens)
2 print (tagged[0:6])
gives you the following:
1 [(’At’, ’IN’), (’eight’, ’CD’), ("o’clock", ’JJ’), (’on’, ’IN’),
2 (’Thursday’, ’NNP’), (’morning’, ’NN’)]
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
Parsing sentences
Parsing a sentence
Now, as the next step, you can “tag” the tokenized sentence:
1 tagged = nltk.pos_tag(tokens)
2 print (tagged[0:6])
gives you the following:
1 [(’At’, ’IN’), (’eight’, ’CD’), ("o’clock", ’JJ’), (’on’, ’IN’),
2 (’Thursday’, ’NNP’), (’morning’, ’NN’)]
And you could get the word type of "morning" with
tagged[5][1]!
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
Parsing sentences
More NLP
Look at http://nltk.org
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
Take-home message
Take-home exam
Next meetings
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
Take-home messages
What you should be familiar with:
• Steps to preprocessing the data
• Regular expressions
• Word counts and co-occurrences
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
Take-home exam
You will receive detailed instructions NOW.
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
Next meetings: Week 6
No meeting on Monday (http://4en5mei.nl)
Wednesday
Scraping and parsing
Big Data and Automated Content Analysis Damian Trilling

Weitere ähnliche Inhalte

Was ist angesagt?

Text Analysis: Latent Topics and Annotated Documents
Text Analysis: Latent Topics and Annotated DocumentsText Analysis: Latent Topics and Annotated Documents
Text Analysis: Latent Topics and Annotated Documents
Nelson Auner
 

Was ist angesagt? (20)

BDACA - Lecture3
BDACA - Lecture3BDACA - Lecture3
BDACA - Lecture3
 
BDACA1516s2 - Lecture3
BDACA1516s2 - Lecture3BDACA1516s2 - Lecture3
BDACA1516s2 - Lecture3
 
BDACA1516s2 - Lecture6
BDACA1516s2 - Lecture6BDACA1516s2 - Lecture6
BDACA1516s2 - Lecture6
 
BDACA1516s2 - Lecture7
BDACA1516s2 - Lecture7BDACA1516s2 - Lecture7
BDACA1516s2 - Lecture7
 
Working with text data
Working with text dataWorking with text data
Working with text data
 
Analyzing social media with Python and other tools (1/4)
Analyzing social media with Python and other tools (1/4)Analyzing social media with Python and other tools (1/4)
Analyzing social media with Python and other tools (1/4)
 
BDACA1516s2 - Lecture8
BDACA1516s2 - Lecture8BDACA1516s2 - Lecture8
BDACA1516s2 - Lecture8
 
BDACA1617s2 - Lecture 2
BDACA1617s2 - Lecture 2BDACA1617s2 - Lecture 2
BDACA1617s2 - Lecture 2
 
An Introduction To Python - Lists, Part 1
An Introduction To Python - Lists, Part 1An Introduction To Python - Lists, Part 1
An Introduction To Python - Lists, Part 1
 
BDACA1617s2 - Tutorial 1
BDACA1617s2 - Tutorial 1BDACA1617s2 - Tutorial 1
BDACA1617s2 - Tutorial 1
 
BDACA1617s2 - Lecture3
BDACA1617s2 - Lecture3BDACA1617s2 - Lecture3
BDACA1617s2 - Lecture3
 
Analyzing social media with Python and other tools (2/4)
Analyzing social media with Python and other tools (2/4) Analyzing social media with Python and other tools (2/4)
Analyzing social media with Python and other tools (2/4)
 
Python for Big Data Analytics
Python for Big Data AnalyticsPython for Big Data Analytics
Python for Big Data Analytics
 
BDACA1617s2 - Lecture4
BDACA1617s2 - Lecture4BDACA1617s2 - Lecture4
BDACA1617s2 - Lecture4
 
Text Analysis: Latent Topics and Annotated Documents
Text Analysis: Latent Topics and Annotated DocumentsText Analysis: Latent Topics and Annotated Documents
Text Analysis: Latent Topics and Annotated Documents
 
Text analytics in Python and R with examples from Tobacco Control
Text analytics in Python and R with examples from Tobacco ControlText analytics in Python and R with examples from Tobacco Control
Text analytics in Python and R with examples from Tobacco Control
 
From TREC to Watson: is open domain question answering a solved problem?
From TREC to Watson: is open domain question answering a solved problem?From TREC to Watson: is open domain question answering a solved problem?
From TREC to Watson: is open domain question answering a solved problem?
 
BD-ACA week3a
BD-ACA week3aBD-ACA week3a
BD-ACA week3a
 
Google code search
Google code searchGoogle code search
Google code search
 
2017 biological databases_part1_vupload
2017 biological databases_part1_vupload2017 biological databases_part1_vupload
2017 biological databases_part1_vupload
 

Andere mochten auch (6)

Developing with the Twitter API
Developing with the Twitter APIDeveloping with the Twitter API
Developing with the Twitter API
 
Twitter API 2.0
Twitter API 2.0Twitter API 2.0
Twitter API 2.0
 
DXN - Egy legális fekete üzlet!
DXN - Egy legális fekete üzlet!DXN - Egy legális fekete üzlet!
DXN - Egy legális fekete üzlet!
 
The Twitter API: A Presentation to Adobe
The Twitter API: A Presentation to AdobeThe Twitter API: A Presentation to Adobe
The Twitter API: A Presentation to Adobe
 
Twitter api
Twitter apiTwitter api
Twitter api
 
Twitter PPT
Twitter PPTTwitter PPT
Twitter PPT
 

Ähnlich wie BD-ACA week5

TeelTech - Advancing Mobile Device Forensics (online version)
TeelTech - Advancing Mobile Device Forensics (online version)TeelTech - Advancing Mobile Device Forensics (online version)
TeelTech - Advancing Mobile Device Forensics (online version)
Mike Felch
 

Ähnlich wie BD-ACA week5 (20)

Introduction to Python
Introduction to Python Introduction to Python
Introduction to Python
 
BD-ACA week7a
BD-ACA week7aBD-ACA week7a
BD-ACA week7a
 
Python cheat-sheet
Python cheat-sheetPython cheat-sheet
Python cheat-sheet
 
Introduction into Search Engines and Information Retrieval
Introduction into Search Engines and Information RetrievalIntroduction into Search Engines and Information Retrieval
Introduction into Search Engines and Information Retrieval
 
NLP - Prédictions de tags sur les questions Stackoverflow
NLP - Prédictions de tags sur les questions StackoverflowNLP - Prédictions de tags sur les questions Stackoverflow
NLP - Prédictions de tags sur les questions Stackoverflow
 
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and VocabulariesHaystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
 
First Steps in Python Programming
First Steps in Python ProgrammingFirst Steps in Python Programming
First Steps in Python Programming
 
CPPDS Slide.pdf
CPPDS Slide.pdfCPPDS Slide.pdf
CPPDS Slide.pdf
 
Utilizing the natural langauage toolkit for keyword research
Utilizing the natural langauage toolkit for keyword researchUtilizing the natural langauage toolkit for keyword research
Utilizing the natural langauage toolkit for keyword research
 
Pa1 session 2
Pa1 session 2 Pa1 session 2
Pa1 session 2
 
BD-ACA week2
BD-ACA week2BD-ACA week2
BD-ACA week2
 
TeelTech - Advancing Mobile Device Forensics (online version)
TeelTech - Advancing Mobile Device Forensics (online version)TeelTech - Advancing Mobile Device Forensics (online version)
TeelTech - Advancing Mobile Device Forensics (online version)
 
What we can learn from Rebol?
What we can learn from Rebol?What we can learn from Rebol?
What we can learn from Rebol?
 
Programming with Python
Programming with PythonProgramming with Python
Programming with Python
 
Clean code, Better coding practices
Clean code, Better coding practicesClean code, Better coding practices
Clean code, Better coding practices
 
RDataMining slides-text-mining-with-r
RDataMining slides-text-mining-with-rRDataMining slides-text-mining-with-r
RDataMining slides-text-mining-with-r
 
Presentation on basics of python
Presentation on basics of pythonPresentation on basics of python
Presentation on basics of python
 
Python45 2
Python45 2Python45 2
Python45 2
 
Automation Testing theory notes.pptx
Automation Testing theory notes.pptxAutomation Testing theory notes.pptx
Automation Testing theory notes.pptx
 
Python_Haegl.powerpoint presentation. tx
Python_Haegl.powerpoint presentation. txPython_Haegl.powerpoint presentation. tx
Python_Haegl.powerpoint presentation. tx
 

Mehr von Department of Communication Science, University of Amsterdam

Mehr von Department of Communication Science, University of Amsterdam (13)

BDACA - Lecture8
BDACA - Lecture8BDACA - Lecture8
BDACA - Lecture8
 
BDACA - Lecture7
BDACA - Lecture7BDACA - Lecture7
BDACA - Lecture7
 
BDACA - Lecture6
BDACA - Lecture6BDACA - Lecture6
BDACA - Lecture6
 
BDACA - Lecture4
BDACA - Lecture4BDACA - Lecture4
BDACA - Lecture4
 
BDACA - Tutorial1
BDACA - Tutorial1BDACA - Tutorial1
BDACA - Tutorial1
 
BDACA - Lecture1
BDACA - Lecture1BDACA - Lecture1
BDACA - Lecture1
 
BDACA1617s2 - Lecture 1
BDACA1617s2 - Lecture 1BDACA1617s2 - Lecture 1
BDACA1617s2 - Lecture 1
 
Media diets in an age of apps and social media: Dealing with a third layer of...
Media diets in an age of apps and social media: Dealing with a third layer of...Media diets in an age of apps and social media: Dealing with a third layer of...
Media diets in an age of apps and social media: Dealing with a third layer of...
 
Conceptualizing and measuring news exposure as network of users and news items
Conceptualizing and measuring news exposure as network of users and news itemsConceptualizing and measuring news exposure as network of users and news items
Conceptualizing and measuring news exposure as network of users and news items
 
Data Science: Case "Political Communication 2/2"
Data Science: Case "Political Communication 2/2"Data Science: Case "Political Communication 2/2"
Data Science: Case "Political Communication 2/2"
 
Data Science: Case "Political Communication 1/2"
Data Science: Case "Political Communication 1/2"Data Science: Case "Political Communication 1/2"
Data Science: Case "Political Communication 1/2"
 
BDACA1516s2 - Lecture4
 BDACA1516s2 - Lecture4 BDACA1516s2 - Lecture4
BDACA1516s2 - Lecture4
 
BDACA1516s2 - Lecture1
BDACA1516s2 - Lecture1BDACA1516s2 - Lecture1
BDACA1516s2 - Lecture1
 

Kürzlich hochgeladen

The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
heathfieldcps1
 

Kürzlich hochgeladen (20)

Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
Wellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptxWellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptx
 
How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
 
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
How to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxHow to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptx
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - English
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024
 
Plant propagation: Sexual and Asexual propapagation.pptx
Plant propagation: Sexual and Asexual propapagation.pptxPlant propagation: Sexual and Asexual propapagation.pptx
Plant propagation: Sexual and Asexual propapagation.pptx
 
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptxExploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptx
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
 

BD-ACA week5

  • 1. Big Data and Automated Content Analysis Week 5 – Wednesday »Regular expressions and NLP« Damian Trilling d.c.trilling@uva.nl @damian0604 www.damiantrilling.net Afdeling Communicatiewetenschap Universiteit van Amsterdam 29 April 2014
  • 2. Regular expressions Some more Natural Language Processing Take-home message & next steps Today 1 ACA using regular expressions What is a regexp? Using a regexp in Python 2 Some more Natural Language Processing Stemming Parsing sentences 3 Take-home message & next steps Big Data and Automated Content Analysis Damian Trilling
  • 3. Regular expressions Some more Natural Language Processing Take-home message & next steps Automated content analysis using regular expressions Big Data and Automated Content Analysis Damian Trilling
  • 4. Regular expressions Some more Natural Language Processing Take-home message & next steps What is a regexp? Regular Expressions: What and why? What is a regexp? • a very widespread way to describe patterns in strings Big Data and Automated Content Analysis Damian Trilling
  • 5. Regular expressions Some more Natural Language Processing Take-home message & next steps What is a regexp? Regular Expressions: What and why? What is a regexp? • a very widespread way to describe patterns in strings • Think of wildcards like * or operators like OR, AND or NOT in search strings: a regexp does the same, but is much more powerful Big Data and Automated Content Analysis Damian Trilling
  • 6. Regular expressions Some more Natural Language Processing Take-home message & next steps What is a regexp? Regular Expressions: What and why? What is a regexp? • a very widespread way to describe patterns in strings • Think of wildcards like * or operators like OR, AND or NOT in search strings: a regexp does the same, but is much more powerful • You can use them in many editors (!), in the Terminal, in STATA . . . and in Python Big Data and Automated Content Analysis Damian Trilling
  • 7. Regular expressions Some more Natural Language Processing Take-home message & next steps What is a regexp? An example From last week’s task • We wanted to remove everything but words from a tweet • We did so by calling the .replace() method • We could do this with a regular expression as well: [ˆa-zA-Z] would match anything that is not a letter Big Data and Automated Content Analysis Damian Trilling
  • 8. Regular expressions Some more Natural Language Processing Take-home message & next steps What is a regexp? Basic regexp elements Alternatives [TtFf] matches either T or t or F or f Twitter|Facebook matches either Twitter or Facebook . matches any character Big Data and Automated Content Analysis Damian Trilling
  • 9. Regular expressions Some more Natural Language Processing Take-home message & next steps What is a regexp? Basic regexp elements Alternatives [TtFf] matches either T or t or F or f Twitter|Facebook matches either Twitter or Facebook . matches any character Repetition * the expression before occurs 0 or more times + the expression before occurs 1 or more times Big Data and Automated Content Analysis Damian Trilling
  • 10. Regular expressions Some more Natural Language Processing Take-home message & next steps What is a regexp? regexp quizz Which words would be matched? 1 [Pp]ython Big Data and Automated Content Analysis Damian Trilling
  • 11. Regular expressions Some more Natural Language Processing Take-home message & next steps What is a regexp? regexp quizz Which words would be matched? 1 [Pp]ython 2 [A-Z]+ Big Data and Automated Content Analysis Damian Trilling
  • 12. Regular expressions Some more Natural Language Processing Take-home message & next steps What is a regexp? regexp quizz Which words would be matched? 1 [Pp]ython 2 [A-Z]+ 3 RT :* @[a-zA-Z0-9]* Big Data and Automated Content Analysis Damian Trilling
  • 13. Regular expressions Some more Natural Language Processing Take-home message & next steps What is a regexp? What else is possible? If you google regexp or regular expression, you’ll get a bunch of useful overviews. The wikipedia page is not too bad, either. Big Data and Automated Content Analysis Damian Trilling
  • 14. Regular expressions Some more Natural Language Processing Take-home message & next steps Using a regexp in Python How to use regular expressions in Python The module re re.findall("[Tt]witter|[Ff]acebook",testo) returns a list with all occurances of Twitter or Facebook in the string called testo re.findall("[0-9]+[a-zA-Z]+",testo) returns a list with all words that start with one or more numbers followed by one or more letters in the string called testo Big Data and Automated Content Analysis Damian Trilling
  • 15. Regular expressions Some more Natural Language Processing Take-home message & next steps Using a regexp in Python How to use regular expressions in Python The module re re.findall("[Tt]witter|[Ff]acebook",testo) returns a list with all occurances of Twitter or Facebook in the string called testo re.findall("[0-9]+[a-zA-Z]+",testo) returns a list with all words that start with one or more numbers followed by one or more letters in the string called testo re.sub("[Tt]witter|[Ff]acebook","a social medium",testo) returns a string in which all all occurances of Twitter or Facebook are replaced by "a social medium" Big Data and Automated Content Analysis Damian Trilling
  • 16. Regular expressions Some more Natural Language Processing Take-home message & next steps Using a regexp in Python How to use regular expressions in Python The module re re.match(" +([0-9]+) of ([0-9]+) points",line) returns None unless it exactly matches the string line. If it does, you can access the part between () with the .group() method. Example: 1 line=" 2 of 25 points" 2 result=re.match(" +([0-9]+) of ([0-9]+) points",line) 3 if result: 4 print ("Your points:",result.group(1)) 5 print ("Maximum points:",result.group(2)) Your points: 2 Maximum points: 25 Big Data and Automated Content Analysis Damian Trilling
  • 17. Regular expressions Some more Natural Language Processing Take-home message & next steps Using a regexp in Python Possible applications Data preprocessing • Remove unwanted characters, words, . . . • Identify meaningful bits of text: usernames, headlines, where an article starts, . . . • filter (distinguish relevant from irrelevant cases) Big Data and Automated Content Analysis Damian Trilling
  • 18. Regular expressions Some more Natural Language Processing Take-home message & next steps Using a regexp in Python Possible applications Data analysis: Automated coding • Actors • Brands • links or other markers that follow a regular pattern • Numbers (!) Big Data and Automated Content Analysis Damian Trilling
  • 19. Example 1: Counting actors 1 import re, csv 2 from os import listdir, path 3 mypath ="/home/damian/artikelen" 4 filename_list=[] 5 matchcount54_list=[] 6 matchcount10_list=[] 7 onlyfiles = [f for f in listdir(mypath) if path.isfile(path.join(mypath, f))] 8 for f in onlyfiles: 9 matchcount54=0 10 matchcount10=0 11 with open(path.join(mypath,f),mode="r",encoding="utf-8") as fi: 12 artikel=fi.readlines() 13 for line in artikel: 14 matches54 = re.findall(’Israel.*(minister|politician.*|[Aa] uthorit)’,line) 15 matches10 = re.findall(’[Pp]alest’,line) 16 matchcount54+=len(matches54) 17 matchcount10+=len(matches10) 18 filename_list.append(f) 19 matchcount54_list.append(matchcount54) 20 matchcount10_list.append(matchcount10) 21 output=zip(filename_list,matchcount10_list,matchcount54_list) 22 with open("overzichtstabel.csv", mode=’w’,encoding="utf-8") as fo: 23 writer = csv.writer(fo) 24 writer.writerows(output)
  • 20. Regular expressions Some more Natural Language Processing Take-home message & next steps Using a regexp in Python Example 2: Which number has this Lexis Nexis article? 1 All Rights Reserved 2 3 2 of 200 DOCUMENTS 4 5 De Telegraaf 6 7 21 maart 2014 vrijdag 8 9 Brussel bereikt akkoord aanpak probleembanken; 10 ECB krijgt meer in melk te brokkelen 11 12 SECTION: Finance; Blz. 24 13 LENGTH: 660 woorden 14 15 BRUSSEL Europa heeft gisteren op de valreep een akkoord bereikt 16 over een saneringsfonds voor banken. Daarmee staat de laatste Big Data and Automated Content Analysis Damian Trilling
  • 21. Regular expressions Some more Natural Language Processing Take-home message & next steps Using a regexp in Python Example 2: Check the number of a lexis nexis article 1 All Rights Reserved 2 3 2 of 200 DOCUMENTS 4 5 De Telegraaf 6 7 21 maart 2014 vrijdag 8 9 Brussel bereikt akkoord aanpak probleembanken; 10 ECB krijgt meer in melk te brokkelen 11 12 SECTION: Finance; Blz. 24 13 LENGTH: 660 woorden 14 15 BRUSSEL Europa heeft gisteren op de valreep een akkoord bereikt 16 over een saneringsfonds voor banken. Daarmee staat de laatste 1 for line in tekst: 2 matchObj=re.match(r" +([0-9]+) of ([0-9]+) DOCUMENTS",line) 3 if matchObj: 4 numberofarticle= int(matchObj.group(1)) 5 totalnumberofarticles= int(matchObj.group(2)) Big Data and Automated Content Analysis Damian Trilling
  • 22. Regular expressions Some more Natural Language Processing Take-home message & next steps Using a regexp in Python Practice yourself! http://www.pyregex.com/ Big Data and Automated Content Analysis Damian Trilling
  • 23. Regular expressions Some more Natural Language Processing Take-home message & next steps Some more Natural Language Processing Big Data and Automated Content Analysis Damian Trilling
  • 24. Regular expressions Some more Natural Language Processing Take-home message & next steps Some more NLP: What and why? What can we do? • remove stopwords (last week) Big Data and Automated Content Analysis Damian Trilling
  • 25. Regular expressions Some more Natural Language Processing Take-home message & next steps Some more NLP: What and why? What can we do? • remove stopwords (last week) • stemming Big Data and Automated Content Analysis Damian Trilling
  • 26. Regular expressions Some more Natural Language Processing Take-home message & next steps Some more NLP: What and why? What can we do? • remove stopwords (last week) • stemming • Parse sentences (advanced) Big Data and Automated Content Analysis Damian Trilling
  • 27. Regular expressions Some more Natural Language Processing Take-home message & next steps Stemming NLP: What and why? Why do stemming? • Because we do not want to distinguish between smoke, smoked, smoking, . . . • Typical preprocessing step (like stopword removal) Big Data and Automated Content Analysis Damian Trilling
  • 28. Regular expressions Some more Natural Language Processing Take-home message & next steps Stemming Stemming (with NLTK, see Bird, S., Loper, E., & Klein, E. (2009). Natural language processing with Python. Sebastopol, CA: O’Reilly.) 1 from nltk.stem.snowball import SnowballStemmer 2 stemmer=SnowballStemmer("english") 3 frase="I am running while generously greeting my neighbors" 4 frasenuevo="" 5 for palabra in frase.split(): 6 frasenuevo=frasenuevo + stemmer.stem(palabra) + " " If we now did print(frasenuevo), it would return: 1 i am run while generous greet my neighbor Big Data and Automated Content Analysis Damian Trilling
  • 29. Regular expressions Some more Natural Language Processing Take-home message & next steps Stemming Stemming and stopword removal - let’s combine them! 1 from nltk.stem.snowball import SnowballStemmer 2 from nltk.corpus import stopwords 3 stemmer=SnowballStemmer("english") 4 stopwords = stopwords.words("english") 5 frase="I am running while generously greeting my neighbors" 6 frasenuevo="" 7 for palabra in frase.lower().split(): 8 if palabra not in stopwords: 9 frasenuevo=frasenuevo + stemmer.stem(palabra) + " " Now, print(frasenuevo) returns: 1 run generous greet neighbor Perfect! Big Data and Automated Content Analysis Damian Trilling
  • 30. Regular expressions Some more Natural Language Processing Take-home message & next steps Stemming Stemming and stopword removal - let’s combine them! 1 from nltk.stem.snowball import SnowballStemmer 2 from nltk.corpus import stopwords 3 stemmer=SnowballStemmer("english") 4 stopwords = stopwords.words("english") 5 frase="I am running while generously greeting my neighbors" 6 frasenuevo="" 7 for palabra in frase.lower().split(): 8 if palabra not in stopwords: 9 frasenuevo=frasenuevo + stemmer.stem(palabra) + " " Now, print(frasenuevo) returns: 1 run generous greet neighbor Perfect! In order to use nltk.corpus.stopwords, you have to download that module once. You can do so by typing the following in the Python console and selecting the appropriate package from the menu that pops up: import nltk nltk.download() NB: Don’t download everything, that’s several GB. Big Data and Automated Content Analysis Damian Trilling
  • 31.
  • 32. Regular expressions Some more Natural Language Processing Take-home message & next steps Parsing sentences NLP: What and why? Why parse sentences? • To find out what grammatical function words have • and to get closer to the meaning. Big Data and Automated Content Analysis Damian Trilling
  • 33. Regular expressions Some more Natural Language Processing Take-home message & next steps Parsing sentences Parsing a sentence 1 import nltk 2 sentence = "At eight o’clock on Thursday morning, Arthur didn’t feel very good." 3 tokens = nltk.word_tokenize(sentence) 4 print (tokens) nltk.word_tokenize(sentence) is similar to sentence.split(), but compare handling of punctuation and the didn’t in the output: 1 [’At’, ’eight’, "o’clock", ’on’, ’Thursday’, ’morning’,’Arthur’, ’did’, "n’t", ’feel’, ’very’, ’good’, ’.’] Big Data and Automated Content Analysis Damian Trilling
  • 34. Regular expressions Some more Natural Language Processing Take-home message & next steps Parsing sentences Parsing a sentence Now, as the next step, you can “tag” the tokenized sentence: 1 tagged = nltk.pos_tag(tokens) 2 print (tagged[0:6]) gives you the following: 1 [(’At’, ’IN’), (’eight’, ’CD’), ("o’clock", ’JJ’), (’on’, ’IN’), 2 (’Thursday’, ’NNP’), (’morning’, ’NN’)] Big Data and Automated Content Analysis Damian Trilling
  • 35. Regular expressions Some more Natural Language Processing Take-home message & next steps Parsing sentences Parsing a sentence Now, as the next step, you can “tag” the tokenized sentence: 1 tagged = nltk.pos_tag(tokens) 2 print (tagged[0:6]) gives you the following: 1 [(’At’, ’IN’), (’eight’, ’CD’), ("o’clock", ’JJ’), (’on’, ’IN’), 2 (’Thursday’, ’NNP’), (’morning’, ’NN’)] And you could get the word type of "morning" with tagged[5][1]! Big Data and Automated Content Analysis Damian Trilling
  • 36. Regular expressions Some more Natural Language Processing Take-home message & next steps Parsing sentences More NLP Look at http://nltk.org Big Data and Automated Content Analysis Damian Trilling
  • 37. Regular expressions Some more Natural Language Processing Take-home message & next steps Take-home message Take-home exam Next meetings Big Data and Automated Content Analysis Damian Trilling
  • 38. Regular expressions Some more Natural Language Processing Take-home message & next steps Take-home messages What you should be familiar with: • Steps to preprocessing the data • Regular expressions • Word counts and co-occurrences Big Data and Automated Content Analysis Damian Trilling
  • 39. Regular expressions Some more Natural Language Processing Take-home message & next steps Take-home exam You will receive detailed instructions NOW. Big Data and Automated Content Analysis Damian Trilling
  • 40. Regular expressions Some more Natural Language Processing Take-home message & next steps Next meetings: Week 6 No meeting on Monday (http://4en5mei.nl) Wednesday Scraping and parsing Big Data and Automated Content Analysis Damian Trilling