SlideShare ist ein Scribd-Unternehmen logo
1 von 40
Downloaden Sie, um offline zu lesen
Big Data and Automated Content Analysis
Week 5 – Monday
»Automated content analysis with NLP and
regular expressions«
Damian Trilling
d.c.trilling@uva.nl
@damian0604
www.damiantrilling.net
Afdeling Communicatiewetenschap
Universiteit van Amsterdam
25 April 2014
Regular expressions Some more Natural Language Processing Take-home message & next steps
Today
1 ACA using regular expressions
What is a regexp?
Using a regexp in Python
2 Some more Natural Language Processing
Stemming
Parsing sentences
3 Take-home message & next steps
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
Automated content analysis using regular expressions
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
What is a regexp?
Regular Expressions: What and why?
What is a regexp?
• a very widespread way to describe patterns in strings
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
What is a regexp?
Regular Expressions: What and why?
What is a regexp?
• a very widespread way to describe patterns in strings
• Think of wildcards like * or operators like OR, AND or NOT in
search strings: a regexp does the same, but is much more
powerful
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
What is a regexp?
Regular Expressions: What and why?
What is a regexp?
• a very widespread way to describe patterns in strings
• Think of wildcards like * or operators like OR, AND or NOT in
search strings: a regexp does the same, but is much more
powerful
• You can use them in many editors (!), in the Terminal, in
STATA . . . and in Python
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
What is a regexp?
An example
From last week’s task
• We wanted to remove everything but words from a tweet
• We did so by calling the .replace() method
• We could do this with a regular expression as well:
[ˆa-zA-Z] would match anything that is not a letter
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
What is a regexp?
Basic regexp elements
Alternatives
[TtFf] matches either T or t or F or f
Twitter|Facebook matches either Twitter or Facebook
. matches any character
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
What is a regexp?
Basic regexp elements
Alternatives
[TtFf] matches either T or t or F or f
Twitter|Facebook matches either Twitter or Facebook
. matches any character
Repetition
* the expression before occurs 0 or more times
+ the expression before occurs 1 or more times
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
What is a regexp?
regexp quizz
Which words would be matched?
1 [Pp]ython
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
What is a regexp?
regexp quizz
Which words would be matched?
1 [Pp]ython
2 [A-Z]+
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
What is a regexp?
regexp quizz
Which words would be matched?
1 [Pp]ython
2 [A-Z]+
3 RT :* @[a-zA-Z0-9]*
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
What is a regexp?
What else is possible?
If you google regexp or regular expression, you’ll get a bunch
of useful overviews. The wikipedia page is not too bad, either.
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
Using a regexp in Python
How to use regular expressions in Python
The module re
re.findall("[Tt]witter|[Ff]acebook",testo) returns a list
with all occurances of Twitter or Facebook in the
string called testo
re.findall("[0-9]+[a-zA-Z]+",testo) returns a list with all
words that start with one or more numbers followed
by one or more letters in the string called testo
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
Using a regexp in Python
How to use regular expressions in Python
The module re
re.findall("[Tt]witter|[Ff]acebook",testo) returns a list
with all occurances of Twitter or Facebook in the
string called testo
re.findall("[0-9]+[a-zA-Z]+",testo) returns a list with all
words that start with one or more numbers followed
by one or more letters in the string called testo
re.sub("[Tt]witter|[Ff]acebook","a social medium",testo)
returns a string in which all all occurances of Twitter
or Facebook are replaced by "a social medium"
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
Using a regexp in Python
How to use regular expressions in Python
The module re
re.match(" +([0-9]+) of ([0-9]+) points",line) returns
None unless it exactly matches the string line. If it
does, you can access the part between () with the
.group() method.
Example:
1 line=" 2 of 25 points"
2 result=re.match(" +([0-9]+) of ([0-9]+) points",line)
3 if result:
4 print ("Your points:",result.group(1))
5 print ("Maximum points:",result.group(2))
Your points: 2
Maximum points: 25
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
Using a regexp in Python
Possible applications
Data preprocessing
• Remove unwanted characters, words, . . .
• Identify meaningful bits of text: usernames, headlines, where
an article starts, . . .
• filter (distinguish relevant from irrelevant cases)
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
Using a regexp in Python
Possible applications
Data analysis: Automated coding
• Actors
• Brands
• links or other markers that follow a regular pattern
• Numbers (!)
Big Data and Automated Content Analysis Damian Trilling
Example 1: Counting actors
1 import re, csv
2 from os import listdir, path
3 mypath ="/home/damian/artikelen"
4 filename_list=[]
5 matchcount54_list=[]
6 matchcount10_list=[]
7 onlyfiles = [f for f in listdir(mypath) if path.isfile(path.join(mypath,
f))]
8 for f in onlyfiles:
9 matchcount54=0
10 matchcount10=0
11 with open(path.join(mypath,f),mode="r",encoding="utf-8") as fi:
12 artikel=fi.readlines()
13 for line in artikel:
14 matches54 = re.findall(’Israel.*(minister|politician.*|[Aa]
uthorit)’,line)
15 matches10 = re.findall(’[Pp]alest’,line)
16 matchcount54+=len(matches54)
17 matchcount10+=len(matches10)
18 filename_list.append(f)
19 matchcount54_list.append(matchcount54)
20 matchcount10_list.append(matchcount10)
21 output=zip(filename_list,matchcount10_list,matchcount54_list)
22 with open("overzichtstabel.csv", mode=’w’,encoding="utf-8") as fo:
23 writer = csv.writer(fo)
24 writer.writerows(output)
Regular expressions Some more Natural Language Processing Take-home message & next steps
Using a regexp in Python
Example 2: Which number has this Lexis Nexis article?
1 All Rights Reserved
2
3 2 of 200 DOCUMENTS
4
5 De Telegraaf
6
7 21 maart 2014 vrijdag
8
9 Brussel bereikt akkoord aanpak probleembanken;
10 ECB krijgt meer in melk te brokkelen
11
12 SECTION: Finance; Blz. 24
13 LENGTH: 660 woorden
14
15 BRUSSEL Europa heeft gisteren op de valreep een akkoord bereikt
16 over een saneringsfonds voor banken. Daarmee staat de laatste
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
Using a regexp in Python
Example 2: Check the number of a lexis nexis article
1 All Rights Reserved
2
3 2 of 200 DOCUMENTS
4
5 De Telegraaf
6
7 21 maart 2014 vrijdag
8
9 Brussel bereikt akkoord aanpak probleembanken;
10 ECB krijgt meer in melk te brokkelen
11
12 SECTION: Finance; Blz. 24
13 LENGTH: 660 woorden
14
15 BRUSSEL Europa heeft gisteren op de valreep een akkoord bereikt
16 over een saneringsfonds voor banken. Daarmee staat de laatste
1 for line in tekst:
2 matchObj=re.match(r" +([0-9]+) of ([0-9]+) DOCUMENTS",line)
3 if matchObj:
4 numberofarticle= int(matchObj.group(1))
5 totalnumberofarticles= int(matchObj.group(2))
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
Using a regexp in Python
Practice yourself!
http://www.pyregex.com/
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
Some more Natural Language Processing
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
Some more NLP: What and why?
What can we do?
• remove stopwords (last week)
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
Some more NLP: What and why?
What can we do?
• remove stopwords (last week)
• stemming
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
Some more NLP: What and why?
What can we do?
• remove stopwords (last week)
• stemming
• Parse sentences (advanced)
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
Stemming
NLP: What and why?
Why do stemming?
• Because we do not want to distinguish between smoke,
smoked, smoking, . . .
• Typical preprocessing step (like stopword removal)
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
Stemming
Stemming
(with NLTK, see Bird, S., Loper, E., & Klein, E. (2009). Natural language processing
with Python. Sebastopol, CA: O’Reilly.)
1 from nltk.stem.snowball import SnowballStemmer
2 stemmer=SnowballStemmer("english")
3 frase="I am running while generously greeting my neighbors"
4 frasenuevo=""
5 for palabra in frase.split():
6 frasenuevo=frasenuevo + stemmer.stem(palabra) + " "
If we now did print(frasenuevo), it would return:
1 i am run while generous greet my neighbor
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
Stemming
Stemming and stopword removal - let’s combine them!
1 from nltk.stem.snowball import SnowballStemmer
2 from nltk.corpus import stopwords
3 stemmer=SnowballStemmer("english")
4 stopwords = stopwords.words("english")
5 frase="I am running while generously greeting my neighbors"
6 frasenuevo=""
7 for palabra in frase.lower().split():
8 if palabra not in stopwords:
9 frasenuevo=frasenuevo + stemmer.stem(palabra) + " "
Now, print(frasenuevo) returns:
1 run generous greet neighbor
Perfect!
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
Stemming
Stemming and stopword removal - let’s combine them!
1 from nltk.stem.snowball import SnowballStemmer
2 from nltk.corpus import stopwords
3 stemmer=SnowballStemmer("english")
4 stopwords = stopwords.words("english")
5 frase="I am running while generously greeting my neighbors"
6 frasenuevo=""
7 for palabra in frase.lower().split():
8 if palabra not in stopwords:
9 frasenuevo=frasenuevo + stemmer.stem(palabra) + " "
Now, print(frasenuevo) returns:
1 run generous greet neighbor
Perfect!
In order to use nltk.corpus.stopwords, you have to download that module once. You can do so by typing the
following in the Python console and selecting the appropriate package from the menu that pops up:
import nltk
nltk.download()
NB: Don’t download everything, that’s several GB.
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
Parsing sentences
NLP: What and why?
Why parse sentences?
• To find out what grammatical function words have
• and to get closer to the meaning.
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
Parsing sentences
Parsing a sentence
1 import nltk
2 sentence = "At eight o’clock on Thursday morning, Arthur didn’t feel
very good."
3 tokens = nltk.word_tokenize(sentence)
4 print (tokens)
nltk.word_tokenize(sentence) is similar to sentence.split(),
but compare handling of punctuation and the didn’t in the
output:
1 [’At’, ’eight’, "o’clock", ’on’, ’Thursday’, ’morning’,’Arthur’, ’did’,
"n’t", ’feel’, ’very’, ’good’, ’.’]
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
Parsing sentences
Parsing a sentence
Now, as the next step, you can “tag” the tokenized sentence:
1 tagged = nltk.pos_tag(tokens)
2 print (tagged[0:6])
gives you the following:
1 [(’At’, ’IN’), (’eight’, ’CD’), ("o’clock", ’JJ’), (’on’, ’IN’),
2 (’Thursday’, ’NNP’), (’morning’, ’NN’)]
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
Parsing sentences
Parsing a sentence
Now, as the next step, you can “tag” the tokenized sentence:
1 tagged = nltk.pos_tag(tokens)
2 print (tagged[0:6])
gives you the following:
1 [(’At’, ’IN’), (’eight’, ’CD’), ("o’clock", ’JJ’), (’on’, ’IN’),
2 (’Thursday’, ’NNP’), (’morning’, ’NN’)]
And you could get the word type of "morning" with
tagged[5][1]!
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
Parsing sentences
More NLP
Look at http://nltk.org
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
Take-home message
Take-home exam
Next meetings
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
Take-home messages
What you should be familiar with:
• Steps to preprocessing the data
• Regular expressions
• Word counts and co-occurrences
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
Next meetings: Week 6
This Wednesday is Koningsdag; no meeting. But take some time
this week for recapping today’s topics (= Chapter 6)
Monday
Scraping and parsing (lecture)
Wednesday
Scraping and parsing (exercises)
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
Take-home exam
You will receive detailed instructions NOW.
Grading criteria
• Task 1: correctness, comprehensiveness, depth, argumentation
• Task 2: comprehensiveness, feasibility, usefulness,
argumentation, creativity, detail/reproducibility
• Task 3: does it run?; does it make sense?; advancedness of
analys[ie]s
Big Data and Automated Content Analysis Damian Trilling

Weitere ähnliche Inhalte

Was ist angesagt?

From TREC to Watson: is open domain question answering a solved problem?
From TREC to Watson: is open domain question answering a solved problem?From TREC to Watson: is open domain question answering a solved problem?
From TREC to Watson: is open domain question answering a solved problem?Constantin Orasan
 
The Web of Data: do we actually understand what we built?
The Web of Data: do we actually understand what we built?The Web of Data: do we actually understand what we built?
The Web of Data: do we actually understand what we built?Frank van Harmelen
 
Text Analysis: Latent Topics and Annotated Documents
Text Analysis: Latent Topics and Annotated DocumentsText Analysis: Latent Topics and Annotated Documents
Text Analysis: Latent Topics and Annotated DocumentsNelson Auner
 
Informatics is a natural science
Informatics is a natural scienceInformatics is a natural science
Informatics is a natural scienceFrank van Harmelen
 

Was ist angesagt? (20)

BDACA1617s2 - Lecture7
BDACA1617s2 - Lecture7BDACA1617s2 - Lecture7
BDACA1617s2 - Lecture7
 
Analyzing social media with Python and other tools (1/4)
Analyzing social media with Python and other tools (1/4)Analyzing social media with Python and other tools (1/4)
Analyzing social media with Python and other tools (1/4)
 
BDACA1617s2 - Lecture3
BDACA1617s2 - Lecture3BDACA1617s2 - Lecture3
BDACA1617s2 - Lecture3
 
BDACA1617s2 - Lecture 2
BDACA1617s2 - Lecture 2BDACA1617s2 - Lecture 2
BDACA1617s2 - Lecture 2
 
BD-ACA week5
BD-ACA week5BD-ACA week5
BD-ACA week5
 
BDACA1617s2 - Lecture4
BDACA1617s2 - Lecture4BDACA1617s2 - Lecture4
BDACA1617s2 - Lecture4
 
BDACA1617s2 - Tutorial 1
BDACA1617s2 - Tutorial 1BDACA1617s2 - Tutorial 1
BDACA1617s2 - Tutorial 1
 
BDACA - Lecture4
BDACA - Lecture4BDACA - Lecture4
BDACA - Lecture4
 
BD-ACA week3a
BD-ACA week3aBD-ACA week3a
BD-ACA week3a
 
BDACA1516s2 - Lecture4
 BDACA1516s2 - Lecture4 BDACA1516s2 - Lecture4
BDACA1516s2 - Lecture4
 
BDACA - Lecture7
BDACA - Lecture7BDACA - Lecture7
BDACA - Lecture7
 
BDACA - Lecture6
BDACA - Lecture6BDACA - Lecture6
BDACA - Lecture6
 
BDACA - Lecture8
BDACA - Lecture8BDACA - Lecture8
BDACA - Lecture8
 
Analyzing social media with Python and other tools (2/4)
Analyzing social media with Python and other tools (2/4) Analyzing social media with Python and other tools (2/4)
Analyzing social media with Python and other tools (2/4)
 
Basics of IR: Web Information Systems class
Basics of IR: Web Information Systems class Basics of IR: Web Information Systems class
Basics of IR: Web Information Systems class
 
From TREC to Watson: is open domain question answering a solved problem?
From TREC to Watson: is open domain question answering a solved problem?From TREC to Watson: is open domain question answering a solved problem?
From TREC to Watson: is open domain question answering a solved problem?
 
The Web of Data: do we actually understand what we built?
The Web of Data: do we actually understand what we built?The Web of Data: do we actually understand what we built?
The Web of Data: do we actually understand what we built?
 
Data Visualization at Twitter
Data Visualization at TwitterData Visualization at Twitter
Data Visualization at Twitter
 
Text Analysis: Latent Topics and Annotated Documents
Text Analysis: Latent Topics and Annotated DocumentsText Analysis: Latent Topics and Annotated Documents
Text Analysis: Latent Topics and Annotated Documents
 
Informatics is a natural science
Informatics is a natural scienceInformatics is a natural science
Informatics is a natural science
 

Ähnlich wie BDACA1516s2 - Lecture5

Introduction to Python
Introduction to Python Introduction to Python
Introduction to Python amiable_indian
 
Introduction into Search Engines and Information Retrieval
Introduction into Search Engines and Information RetrievalIntroduction into Search Engines and Information Retrieval
Introduction into Search Engines and Information RetrievalA. LE
 
TeelTech - Advancing Mobile Device Forensics (online version)
TeelTech - Advancing Mobile Device Forensics (online version)TeelTech - Advancing Mobile Device Forensics (online version)
TeelTech - Advancing Mobile Device Forensics (online version)Mike Felch
 
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and VocabulariesHaystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and VocabulariesMax Irwin
 
First Steps in Python Programming
First Steps in Python ProgrammingFirst Steps in Python Programming
First Steps in Python ProgrammingDozie Agbo
 
Utilizing the natural langauage toolkit for keyword research
Utilizing the natural langauage toolkit for keyword researchUtilizing the natural langauage toolkit for keyword research
Utilizing the natural langauage toolkit for keyword researchErudite
 
RDataMining slides-text-mining-with-r
RDataMining slides-text-mining-with-rRDataMining slides-text-mining-with-r
RDataMining slides-text-mining-with-rYanchang Zhao
 
NLP - Prédictions de tags sur les questions Stackoverflow
NLP - Prédictions de tags sur les questions StackoverflowNLP - Prédictions de tags sur les questions Stackoverflow
NLP - Prédictions de tags sur les questions StackoverflowFUMERY Michael
 
Automation Testing theory notes.pptx
Automation Testing theory notes.pptxAutomation Testing theory notes.pptx
Automation Testing theory notes.pptxNileshBorkar12
 
GDG Helwan Introduction to python
GDG Helwan Introduction to pythonGDG Helwan Introduction to python
GDG Helwan Introduction to pythonMohamed Hegazy
 
Clean code, Better coding practices
Clean code, Better coding practicesClean code, Better coding practices
Clean code, Better coding practicesParamvir Singh
 
What we can learn from Rebol?
What we can learn from Rebol?What we can learn from Rebol?
What we can learn from Rebol?lichtkind
 

Ähnlich wie BDACA1516s2 - Lecture5 (20)

BDACA - Lecture5
BDACA - Lecture5BDACA - Lecture5
BDACA - Lecture5
 
BD-ACA week7a
BD-ACA week7aBD-ACA week7a
BD-ACA week7a
 
BDACA - Tutorial5
BDACA - Tutorial5BDACA - Tutorial5
BDACA - Tutorial5
 
BDACA - Lecture2
BDACA - Lecture2BDACA - Lecture2
BDACA - Lecture2
 
Introduction to Python
Introduction to Python Introduction to Python
Introduction to Python
 
Introduction into Search Engines and Information Retrieval
Introduction into Search Engines and Information RetrievalIntroduction into Search Engines and Information Retrieval
Introduction into Search Engines and Information Retrieval
 
TeelTech - Advancing Mobile Device Forensics (online version)
TeelTech - Advancing Mobile Device Forensics (online version)TeelTech - Advancing Mobile Device Forensics (online version)
TeelTech - Advancing Mobile Device Forensics (online version)
 
Python cheat-sheet
Python cheat-sheetPython cheat-sheet
Python cheat-sheet
 
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and VocabulariesHaystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
 
First Steps in Python Programming
First Steps in Python ProgrammingFirst Steps in Python Programming
First Steps in Python Programming
 
BD-ACA week2
BD-ACA week2BD-ACA week2
BD-ACA week2
 
Utilizing the natural langauage toolkit for keyword research
Utilizing the natural langauage toolkit for keyword researchUtilizing the natural langauage toolkit for keyword research
Utilizing the natural langauage toolkit for keyword research
 
RDataMining slides-text-mining-with-r
RDataMining slides-text-mining-with-rRDataMining slides-text-mining-with-r
RDataMining slides-text-mining-with-r
 
Pa1 session 2
Pa1 session 2 Pa1 session 2
Pa1 session 2
 
NLP - Prédictions de tags sur les questions Stackoverflow
NLP - Prédictions de tags sur les questions StackoverflowNLP - Prédictions de tags sur les questions Stackoverflow
NLP - Prédictions de tags sur les questions Stackoverflow
 
Automation Testing theory notes.pptx
Automation Testing theory notes.pptxAutomation Testing theory notes.pptx
Automation Testing theory notes.pptx
 
GDG Helwan Introduction to python
GDG Helwan Introduction to pythonGDG Helwan Introduction to python
GDG Helwan Introduction to python
 
Clean code, Better coding practices
Clean code, Better coding practicesClean code, Better coding practices
Clean code, Better coding practices
 
What we can learn from Rebol?
What we can learn from Rebol?What we can learn from Rebol?
What we can learn from Rebol?
 
Programming with Python
Programming with PythonProgramming with Python
Programming with Python
 

Mehr von Department of Communication Science, University of Amsterdam (9)

BDACA - Tutorial1
BDACA - Tutorial1BDACA - Tutorial1
BDACA - Tutorial1
 
BDACA - Lecture1
BDACA - Lecture1BDACA - Lecture1
BDACA - Lecture1
 
BDACA1617s2 - Lecture 1
BDACA1617s2 - Lecture 1BDACA1617s2 - Lecture 1
BDACA1617s2 - Lecture 1
 
Media diets in an age of apps and social media: Dealing with a third layer of...
Media diets in an age of apps and social media: Dealing with a third layer of...Media diets in an age of apps and social media: Dealing with a third layer of...
Media diets in an age of apps and social media: Dealing with a third layer of...
 
Conceptualizing and measuring news exposure as network of users and news items
Conceptualizing and measuring news exposure as network of users and news itemsConceptualizing and measuring news exposure as network of users and news items
Conceptualizing and measuring news exposure as network of users and news items
 
Data Science: Case "Political Communication 2/2"
Data Science: Case "Political Communication 2/2"Data Science: Case "Political Communication 2/2"
Data Science: Case "Political Communication 2/2"
 
Data Science: Case "Political Communication 1/2"
Data Science: Case "Political Communication 1/2"Data Science: Case "Political Communication 1/2"
Data Science: Case "Political Communication 1/2"
 
BDACA1516s2 - Lecture1
BDACA1516s2 - Lecture1BDACA1516s2 - Lecture1
BDACA1516s2 - Lecture1
 
Should we worry about filter bubbles?
Should we worry about filter bubbles?Should we worry about filter bubbles?
Should we worry about filter bubbles?
 

Kürzlich hochgeladen

Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxRamakrishna Reddy Bijjam
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17Celine George
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin ClassesCeline George
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxAreebaZafar22
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...pradhanghanshyam7136
 
ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701bronxfugly43
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSCeline George
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsKarakKing
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfNirmal Dwivedi
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - Englishneillewis46
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...Poonam Aher Patil
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.MaryamAhmad92
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...christianmathematics
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibitjbellavia9
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Jisc
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.christianmathematics
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptxMaritesTamaniVerdade
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfAdmir Softic
 

Kürzlich hochgeladen (20)

Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 
ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 
Spatium Project Simulation student brief
Spatium Project Simulation student briefSpatium Project Simulation student brief
Spatium Project Simulation student brief
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - English
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 

BDACA1516s2 - Lecture5

  • 1. Big Data and Automated Content Analysis Week 5 – Monday »Automated content analysis with NLP and regular expressions« Damian Trilling d.c.trilling@uva.nl @damian0604 www.damiantrilling.net Afdeling Communicatiewetenschap Universiteit van Amsterdam 25 April 2014
  • 2. Regular expressions Some more Natural Language Processing Take-home message & next steps Today 1 ACA using regular expressions What is a regexp? Using a regexp in Python 2 Some more Natural Language Processing Stemming Parsing sentences 3 Take-home message & next steps Big Data and Automated Content Analysis Damian Trilling
  • 3. Regular expressions Some more Natural Language Processing Take-home message & next steps Automated content analysis using regular expressions Big Data and Automated Content Analysis Damian Trilling
  • 4. Regular expressions Some more Natural Language Processing Take-home message & next steps What is a regexp? Regular Expressions: What and why? What is a regexp? • a very widespread way to describe patterns in strings Big Data and Automated Content Analysis Damian Trilling
  • 5. Regular expressions Some more Natural Language Processing Take-home message & next steps What is a regexp? Regular Expressions: What and why? What is a regexp? • a very widespread way to describe patterns in strings • Think of wildcards like * or operators like OR, AND or NOT in search strings: a regexp does the same, but is much more powerful Big Data and Automated Content Analysis Damian Trilling
  • 6. Regular expressions Some more Natural Language Processing Take-home message & next steps What is a regexp? Regular Expressions: What and why? What is a regexp? • a very widespread way to describe patterns in strings • Think of wildcards like * or operators like OR, AND or NOT in search strings: a regexp does the same, but is much more powerful • You can use them in many editors (!), in the Terminal, in STATA . . . and in Python Big Data and Automated Content Analysis Damian Trilling
  • 7. Regular expressions Some more Natural Language Processing Take-home message & next steps What is a regexp? An example From last week’s task • We wanted to remove everything but words from a tweet • We did so by calling the .replace() method • We could do this with a regular expression as well: [ˆa-zA-Z] would match anything that is not a letter Big Data and Automated Content Analysis Damian Trilling
  • 8. Regular expressions Some more Natural Language Processing Take-home message & next steps What is a regexp? Basic regexp elements Alternatives [TtFf] matches either T or t or F or f Twitter|Facebook matches either Twitter or Facebook . matches any character Big Data and Automated Content Analysis Damian Trilling
  • 9. Regular expressions Some more Natural Language Processing Take-home message & next steps What is a regexp? Basic regexp elements Alternatives [TtFf] matches either T or t or F or f Twitter|Facebook matches either Twitter or Facebook . matches any character Repetition * the expression before occurs 0 or more times + the expression before occurs 1 or more times Big Data and Automated Content Analysis Damian Trilling
  • 10. Regular expressions Some more Natural Language Processing Take-home message & next steps What is a regexp? regexp quizz Which words would be matched? 1 [Pp]ython Big Data and Automated Content Analysis Damian Trilling
  • 11. Regular expressions Some more Natural Language Processing Take-home message & next steps What is a regexp? regexp quizz Which words would be matched? 1 [Pp]ython 2 [A-Z]+ Big Data and Automated Content Analysis Damian Trilling
  • 12. Regular expressions Some more Natural Language Processing Take-home message & next steps What is a regexp? regexp quizz Which words would be matched? 1 [Pp]ython 2 [A-Z]+ 3 RT :* @[a-zA-Z0-9]* Big Data and Automated Content Analysis Damian Trilling
  • 13. Regular expressions Some more Natural Language Processing Take-home message & next steps What is a regexp? What else is possible? If you google regexp or regular expression, you’ll get a bunch of useful overviews. The wikipedia page is not too bad, either. Big Data and Automated Content Analysis Damian Trilling
  • 14. Regular expressions Some more Natural Language Processing Take-home message & next steps Using a regexp in Python How to use regular expressions in Python The module re re.findall("[Tt]witter|[Ff]acebook",testo) returns a list with all occurances of Twitter or Facebook in the string called testo re.findall("[0-9]+[a-zA-Z]+",testo) returns a list with all words that start with one or more numbers followed by one or more letters in the string called testo Big Data and Automated Content Analysis Damian Trilling
  • 15. Regular expressions Some more Natural Language Processing Take-home message & next steps Using a regexp in Python How to use regular expressions in Python The module re re.findall("[Tt]witter|[Ff]acebook",testo) returns a list with all occurances of Twitter or Facebook in the string called testo re.findall("[0-9]+[a-zA-Z]+",testo) returns a list with all words that start with one or more numbers followed by one or more letters in the string called testo re.sub("[Tt]witter|[Ff]acebook","a social medium",testo) returns a string in which all all occurances of Twitter or Facebook are replaced by "a social medium" Big Data and Automated Content Analysis Damian Trilling
  • 16. Regular expressions Some more Natural Language Processing Take-home message & next steps Using a regexp in Python How to use regular expressions in Python The module re re.match(" +([0-9]+) of ([0-9]+) points",line) returns None unless it exactly matches the string line. If it does, you can access the part between () with the .group() method. Example: 1 line=" 2 of 25 points" 2 result=re.match(" +([0-9]+) of ([0-9]+) points",line) 3 if result: 4 print ("Your points:",result.group(1)) 5 print ("Maximum points:",result.group(2)) Your points: 2 Maximum points: 25 Big Data and Automated Content Analysis Damian Trilling
  • 17. Regular expressions Some more Natural Language Processing Take-home message & next steps Using a regexp in Python Possible applications Data preprocessing • Remove unwanted characters, words, . . . • Identify meaningful bits of text: usernames, headlines, where an article starts, . . . • filter (distinguish relevant from irrelevant cases) Big Data and Automated Content Analysis Damian Trilling
  • 18. Regular expressions Some more Natural Language Processing Take-home message & next steps Using a regexp in Python Possible applications Data analysis: Automated coding • Actors • Brands • links or other markers that follow a regular pattern • Numbers (!) Big Data and Automated Content Analysis Damian Trilling
  • 19. Example 1: Counting actors 1 import re, csv 2 from os import listdir, path 3 mypath ="/home/damian/artikelen" 4 filename_list=[] 5 matchcount54_list=[] 6 matchcount10_list=[] 7 onlyfiles = [f for f in listdir(mypath) if path.isfile(path.join(mypath, f))] 8 for f in onlyfiles: 9 matchcount54=0 10 matchcount10=0 11 with open(path.join(mypath,f),mode="r",encoding="utf-8") as fi: 12 artikel=fi.readlines() 13 for line in artikel: 14 matches54 = re.findall(’Israel.*(minister|politician.*|[Aa] uthorit)’,line) 15 matches10 = re.findall(’[Pp]alest’,line) 16 matchcount54+=len(matches54) 17 matchcount10+=len(matches10) 18 filename_list.append(f) 19 matchcount54_list.append(matchcount54) 20 matchcount10_list.append(matchcount10) 21 output=zip(filename_list,matchcount10_list,matchcount54_list) 22 with open("overzichtstabel.csv", mode=’w’,encoding="utf-8") as fo: 23 writer = csv.writer(fo) 24 writer.writerows(output)
  • 20. Regular expressions Some more Natural Language Processing Take-home message & next steps Using a regexp in Python Example 2: Which number has this Lexis Nexis article? 1 All Rights Reserved 2 3 2 of 200 DOCUMENTS 4 5 De Telegraaf 6 7 21 maart 2014 vrijdag 8 9 Brussel bereikt akkoord aanpak probleembanken; 10 ECB krijgt meer in melk te brokkelen 11 12 SECTION: Finance; Blz. 24 13 LENGTH: 660 woorden 14 15 BRUSSEL Europa heeft gisteren op de valreep een akkoord bereikt 16 over een saneringsfonds voor banken. Daarmee staat de laatste Big Data and Automated Content Analysis Damian Trilling
  • 21. Regular expressions Some more Natural Language Processing Take-home message & next steps Using a regexp in Python Example 2: Check the number of a lexis nexis article 1 All Rights Reserved 2 3 2 of 200 DOCUMENTS 4 5 De Telegraaf 6 7 21 maart 2014 vrijdag 8 9 Brussel bereikt akkoord aanpak probleembanken; 10 ECB krijgt meer in melk te brokkelen 11 12 SECTION: Finance; Blz. 24 13 LENGTH: 660 woorden 14 15 BRUSSEL Europa heeft gisteren op de valreep een akkoord bereikt 16 over een saneringsfonds voor banken. Daarmee staat de laatste 1 for line in tekst: 2 matchObj=re.match(r" +([0-9]+) of ([0-9]+) DOCUMENTS",line) 3 if matchObj: 4 numberofarticle= int(matchObj.group(1)) 5 totalnumberofarticles= int(matchObj.group(2)) Big Data and Automated Content Analysis Damian Trilling
  • 22. Regular expressions Some more Natural Language Processing Take-home message & next steps Using a regexp in Python Practice yourself! http://www.pyregex.com/ Big Data and Automated Content Analysis Damian Trilling
  • 23. Regular expressions Some more Natural Language Processing Take-home message & next steps Some more Natural Language Processing Big Data and Automated Content Analysis Damian Trilling
  • 24. Regular expressions Some more Natural Language Processing Take-home message & next steps Some more NLP: What and why? What can we do? • remove stopwords (last week) Big Data and Automated Content Analysis Damian Trilling
  • 25. Regular expressions Some more Natural Language Processing Take-home message & next steps Some more NLP: What and why? What can we do? • remove stopwords (last week) • stemming Big Data and Automated Content Analysis Damian Trilling
  • 26. Regular expressions Some more Natural Language Processing Take-home message & next steps Some more NLP: What and why? What can we do? • remove stopwords (last week) • stemming • Parse sentences (advanced) Big Data and Automated Content Analysis Damian Trilling
  • 27. Regular expressions Some more Natural Language Processing Take-home message & next steps Stemming NLP: What and why? Why do stemming? • Because we do not want to distinguish between smoke, smoked, smoking, . . . • Typical preprocessing step (like stopword removal) Big Data and Automated Content Analysis Damian Trilling
  • 28. Regular expressions Some more Natural Language Processing Take-home message & next steps Stemming Stemming (with NLTK, see Bird, S., Loper, E., & Klein, E. (2009). Natural language processing with Python. Sebastopol, CA: O’Reilly.) 1 from nltk.stem.snowball import SnowballStemmer 2 stemmer=SnowballStemmer("english") 3 frase="I am running while generously greeting my neighbors" 4 frasenuevo="" 5 for palabra in frase.split(): 6 frasenuevo=frasenuevo + stemmer.stem(palabra) + " " If we now did print(frasenuevo), it would return: 1 i am run while generous greet my neighbor Big Data and Automated Content Analysis Damian Trilling
  • 29. Regular expressions Some more Natural Language Processing Take-home message & next steps Stemming Stemming and stopword removal - let’s combine them! 1 from nltk.stem.snowball import SnowballStemmer 2 from nltk.corpus import stopwords 3 stemmer=SnowballStemmer("english") 4 stopwords = stopwords.words("english") 5 frase="I am running while generously greeting my neighbors" 6 frasenuevo="" 7 for palabra in frase.lower().split(): 8 if palabra not in stopwords: 9 frasenuevo=frasenuevo + stemmer.stem(palabra) + " " Now, print(frasenuevo) returns: 1 run generous greet neighbor Perfect! Big Data and Automated Content Analysis Damian Trilling
  • 30. Regular expressions Some more Natural Language Processing Take-home message & next steps Stemming Stemming and stopword removal - let’s combine them! 1 from nltk.stem.snowball import SnowballStemmer 2 from nltk.corpus import stopwords 3 stemmer=SnowballStemmer("english") 4 stopwords = stopwords.words("english") 5 frase="I am running while generously greeting my neighbors" 6 frasenuevo="" 7 for palabra in frase.lower().split(): 8 if palabra not in stopwords: 9 frasenuevo=frasenuevo + stemmer.stem(palabra) + " " Now, print(frasenuevo) returns: 1 run generous greet neighbor Perfect! In order to use nltk.corpus.stopwords, you have to download that module once. You can do so by typing the following in the Python console and selecting the appropriate package from the menu that pops up: import nltk nltk.download() NB: Don’t download everything, that’s several GB. Big Data and Automated Content Analysis Damian Trilling
  • 31.
  • 32. Regular expressions Some more Natural Language Processing Take-home message & next steps Parsing sentences NLP: What and why? Why parse sentences? • To find out what grammatical function words have • and to get closer to the meaning. Big Data and Automated Content Analysis Damian Trilling
  • 33. Regular expressions Some more Natural Language Processing Take-home message & next steps Parsing sentences Parsing a sentence 1 import nltk 2 sentence = "At eight o’clock on Thursday morning, Arthur didn’t feel very good." 3 tokens = nltk.word_tokenize(sentence) 4 print (tokens) nltk.word_tokenize(sentence) is similar to sentence.split(), but compare handling of punctuation and the didn’t in the output: 1 [’At’, ’eight’, "o’clock", ’on’, ’Thursday’, ’morning’,’Arthur’, ’did’, "n’t", ’feel’, ’very’, ’good’, ’.’] Big Data and Automated Content Analysis Damian Trilling
  • 34. Regular expressions Some more Natural Language Processing Take-home message & next steps Parsing sentences Parsing a sentence Now, as the next step, you can “tag” the tokenized sentence: 1 tagged = nltk.pos_tag(tokens) 2 print (tagged[0:6]) gives you the following: 1 [(’At’, ’IN’), (’eight’, ’CD’), ("o’clock", ’JJ’), (’on’, ’IN’), 2 (’Thursday’, ’NNP’), (’morning’, ’NN’)] Big Data and Automated Content Analysis Damian Trilling
  • 35. Regular expressions Some more Natural Language Processing Take-home message & next steps Parsing sentences Parsing a sentence Now, as the next step, you can “tag” the tokenized sentence: 1 tagged = nltk.pos_tag(tokens) 2 print (tagged[0:6]) gives you the following: 1 [(’At’, ’IN’), (’eight’, ’CD’), ("o’clock", ’JJ’), (’on’, ’IN’), 2 (’Thursday’, ’NNP’), (’morning’, ’NN’)] And you could get the word type of "morning" with tagged[5][1]! Big Data and Automated Content Analysis Damian Trilling
  • 36. Regular expressions Some more Natural Language Processing Take-home message & next steps Parsing sentences More NLP Look at http://nltk.org Big Data and Automated Content Analysis Damian Trilling
  • 37. Regular expressions Some more Natural Language Processing Take-home message & next steps Take-home message Take-home exam Next meetings Big Data and Automated Content Analysis Damian Trilling
  • 38. Regular expressions Some more Natural Language Processing Take-home message & next steps Take-home messages What you should be familiar with: • Steps to preprocessing the data • Regular expressions • Word counts and co-occurrences Big Data and Automated Content Analysis Damian Trilling
  • 39. Regular expressions Some more Natural Language Processing Take-home message & next steps Next meetings: Week 6 This Wednesday is Koningsdag; no meeting. But take some time this week for recapping today’s topics (= Chapter 6) Monday Scraping and parsing (lecture) Wednesday Scraping and parsing (exercises) Big Data and Automated Content Analysis Damian Trilling
  • 40. Regular expressions Some more Natural Language Processing Take-home message & next steps Take-home exam You will receive detailed instructions NOW. Grading criteria • Task 1: correctness, comprehensiveness, depth, argumentation • Task 2: comprehensiveness, feasibility, usefulness, argumentation, creativity, detail/reproducibility • Task 3: does it run?; does it make sense?; advancedness of analys[ie]s Big Data and Automated Content Analysis Damian Trilling