SlideShare ist ein Scribd-Unternehmen logo
1 von 82
Downloaden Sie, um offline zu lesen
Good morning!

Enjoy your coffee and install
Putty and NotepadPlus via "Software Maintance/Application
Catalgue". And the Pattern-package (see my e-mail). Thanks.
Big Data? What are we talking about?

The process: collect, store, analyze

Python

Questions?

Hands-on-Workshop
Big (Twitter) Data
Damian Trilling
d.c.trilling@uva.nl
@damian0604
www.damiantrilling.net
Afdeling Communicatiewetenschap
Universiteit van Amsterdam

30 January 2014
9.30
#bigdata

Damian Trilling
Big Data? What are we talking about?

The process: collect, store, analyze

Python

Questions?

The next one and a half days
You’ll hear about
• Collecting social media data via APIs, RSS and scraping (and

the tools for it)
• Technical infrastructure (via surfsara)
• Python
• Sentiment analysis
• Automated coding
• Frequencies and other statistics
• Social network analysis with Gephi
• ...

#bigdata

Damian Trilling
Big Data? What are we talking about?

The process: collect, store, analyze

Python

Questions?

In this session (1/4):
1 Big Data? What are we talking about?

Exploring the field
Some examples
2 The process: collect, store, analyze

A scheme
Our implementation
3 Python

What it is
When to use it
When not to use it
4 Questions?

#bigdata

Damian Trilling
Big Data? What are we talking about?

The process: collect, store, analyze

Python

Questions?

Exploring the field

What’s big data?
What are we talking about?

#bigdata

Damian Trilling
Big Data? What are we talking about?

The process: collect, store, analyze

Python

Questions?

Exploring the field

What are we talking about?
Today, it’s a hands-on workshop, so let’s keep this important (!)
discussion for later.

#bigdata

Damian Trilling
Big Data? What are we talking about?

The process: collect, store, analyze

Python

Questions?

Exploring the field

What are we talking about?

So, no definition, but some brief thoughts
• Existing data ( = experiments or surveys)
• Too big to code manually
• Too big to handle with normal tools
• New research questions
• Call to revisit the relationship between theory and empirical

research

#bigdata

Damian Trilling
Big Data? What are we talking about?

The process: collect, store, analyze

Python

Questions?

Exploring the field

What are we talking about?
Today, . . .
• we are not going to talk about REALLY BIG data,
• but we will have some exercises on datasets a normal

computer can handle

#bigdata

Damian Trilling
Big Data? What are we talking about?

The process: collect, store, analyze

Python

Questions?

Exploring the field

What are we talking about?
Today, . . .
• we are not going to talk about REALLY BIG data,
• but we will have some exercises on datasets a normal

computer can handle

Tomorrow, . . .
• we will also learn about scaling up these techniques
• SurfSARA provides infrastructure for this

#bigdata

Damian Trilling
Big Data? What are we talking about?

The process: collect, store, analyze

Python

Questions?

Exploring the field

What are we talking about?

Some sources
• Social Network Sites
• RSS-feeds
• Databases
• Scraping text from the web
• ...

#bigdata

Damian Trilling
Big Data? What are we talking about?

The process: collect, store, analyze

Python

Questions?

Exploring the field

It’s out there!
You only have to collect it.

#bigdata

Damian Trilling
Big Data? What are we talking about?

The process: collect, store, analyze

Python

Questions?

Exploring the field

But why should we care?
We can answer new questions
• Find needles in haystacks
• Identify networks, co-word analysis, linguistic analysis, . . .
• Verify our theories in larger datasets

#bigdata

Damian Trilling
Big Data? What are we talking about?

The process: collect, store, analyze

Python

Questions?

Exploring the field

But why should we care?
We can answer new questions
• Find needles in haystacks
• Identify networks, co-word analysis, linguistic analysis, . . .
• Verify our theories in larger datasets

It makes sense
• There are things that computers are simply better at than

humans, e.g. in counting things
• Having human coders look for words in texts is like calculating

a regression analysis by hand

#bigdata

Damian Trilling
Big Data? What are we talking about?

The process: collect, store, analyze

Python

Questions?

Some examples

Some examples

#bigdata

Damian Trilling
Big Data? What are we talking about?

The process: collect, store, analyze

Python

Questions?

Some examples

A recent master thesis

The needle in the haystack

#bigdata

Damian Trilling
Big Data? What are we talking about?

The process: collect, store, analyze

Python

Questions?

Some examples

A recent master thesis

The needle in the haystack
Imagine you want to analyze some very rare content.

#bigdata

Damian Trilling
Big Data? What are we talking about?

The process: collect, store, analyze

Python

Questions?

Some examples

A recent master thesis

The needle in the haystack
Imagine you want to analyze some very rare content.
Normal sampling won’t work, that’s for sure.

#bigdata

Damian Trilling
Big Data? What are we talking about?

The process: collect, store, analyze

Python

Questions?

Some examples

So you’d better collect everything first

Getting all news coverage from Dutch news sites

Pöll, B. (2013). Social media: new sources, new profession? A content analysis of the use of social media as a
source for journalists in online news articles. Master Thesis, Universiteit van Amsterdam.

#bigdata

Damian Trilling
Big Data? What are we talking about?

The process: collect, store, analyze

Python

Questions?

Some examples

So you’d better collect everything first

Getting all news coverage from Dutch news sites
1

Collect all articles from nine news sites during a period of two
months, resulting in a database with 74.000 articles.

Pöll, B. (2013). Social media: new sources, new profession? A content analysis of the use of social media as a
source for journalists in online news articles. Master Thesis, Universiteit van Amsterdam.

#bigdata

Damian Trilling
Big Data? What are we talking about?

The process: collect, store, analyze

Python

Questions?

Some examples

So you’d better collect everything first

Getting all news coverage from Dutch news sites
1

Collect all articles from nine news sites during a period of two
months, resulting in a database with 74.000 articles.

2

Filter articles containing specific keywords.

Pöll, B. (2013). Social media: new sources, new profession? A content analysis of the use of social media as a
source for journalists in online news articles. Master Thesis, Universiteit van Amsterdam.

#bigdata

Damian Trilling
Big Data? What are we talking about?

The process: collect, store, analyze

Python

Questions?

Some examples

So you’d better collect everything first

Getting all news coverage from Dutch news sites
1

Collect all articles from nine news sites during a period of two
months, resulting in a database with 74.000 articles.

2

Filter articles containing specific keywords.

3

Those 292 articles where then manually coded.

Pöll, B. (2013). Social media: new sources, new profession? A content analysis of the use of social media as a
source for journalists in online news articles. Master Thesis, Universiteit van Amsterdam.

#bigdata

Damian Trilling
Big Data? What are we talking about?

The process: collect, store, analyze

Python

Questions?

Some examples

#bigdata

Damian Trilling
Big Data? What are we talking about?

The process: collect, store, analyze

Python

Questions?

Some examples

It’s just one line of code!

url.txt
http://www.gmx.at/themen/wissen/mensch/108g5xi-baeuerlich-schiefe-zaehne
http://www.gmx.at/themen/unterhaltung/klatsch-tratsch/408g740-fuermannbittet-um-verzeihung
http://www.gmx.at/themen/nachrichten/aufruhr-arabien/268g70u-regierungwill-zuruecktreten
http://www.gmx.at/themen/nachrichten/panorama/828g54y-neues-zur-klagegegen-republik
http://www.gmx.at/themen/nachrichten/panorama/968g72s-millionstrafewegen-oelpest
http://www.gmx.at/themen/unterhaltung/klatsch-tratsch/368g6yc-keinbabybauch-nur-fast-food
...
...
...

#bigdata

wget-commando
wget -i urls.txt

Damian Trilling
Big Data? What are we talking about?

The process: collect, store, analyze

Python

Questions?

Some examples

A recent bachelor thesis

Tone in tweets

#bigdata

Damian Trilling
Big Data? What are we talking about?

The process: collect, store, analyze

Python

Questions?

Some examples

A recent bachelor thesis

Tone in tweets
Imagine you want to know something about someone’s behavior on
twitter. Or how a specific topic is discussed on Twitter.

#bigdata

Damian Trilling
Big Data? What are we talking about?

The process: collect, store, analyze

Python

Questions?

Some examples

A recent bachelor thesis

Tone in tweets
Imagine you want to know something about someone’s behavior on
twitter. Or how a specific topic is discussed on Twitter.
Do you really want to go through thousands of tweets by hand?

#bigdata

Damian Trilling
Big Data? What are we talking about?

The process: collect, store, analyze

Python

Questions?

Some examples

So you’d better think about automating your coding
Finding out how negative or positive politicians are towards
their opponents

Schut, L. (2013). Verenigde Staten vs. Verenigd Koningrijk: Een automatische inhoudsanalyse naar verklarende
factoren voor het gebruik van positive campaigning en negative campaigning door vooraanstaande politici en
politieke partijen op Twitter. Bachelor Thesis, Universiteit van Amsterdam.

#bigdata

Damian Trilling
Big Data? What are we talking about?

The process: collect, store, analyze

Python

Questions?

Some examples

So you’d better think about automating your coding
Finding out how negative or positive politicians are towards
their opponents
The student took lists with positive and negative words and made
additional ones with a politician’s opponents.

Schut, L. (2013). Verenigde Staten vs. Verenigd Koningrijk: Een automatische inhoudsanalyse naar verklarende
factoren voor het gebruik van positive campaigning en negative campaigning door vooraanstaande politici en
politieke partijen op Twitter. Bachelor Thesis, Universiteit van Amsterdam.

#bigdata

Damian Trilling
Big Data? What are we talking about?

The process: collect, store, analyze

Python

Questions?

Some examples

So you’d better think about automating your coding
Finding out how negative or positive politicians are towards
their opponents
The student took lists with positive and negative words and made
additional ones with a politician’s opponents.
She used a Python-script to check which type of words was used to
refer to opponents.

Schut, L. (2013). Verenigde Staten vs. Verenigd Koningrijk: Een automatische inhoudsanalyse naar verklarende
factoren voor het gebruik van positive campaigning en negative campaigning door vooraanstaande politici en
politieke partijen op Twitter. Bachelor Thesis, Universiteit van Amsterdam.

#bigdata

Damian Trilling
Big Data? What are we talking about?

The process: collect, store, analyze

Python

Questions?

Some examples

So you’d better think about automating your coding
Finding out how negative or positive politicians are towards
their opponents
The student took lists with positive and negative words and made
additional ones with a politician’s opponents.
She used a Python-script to check which type of words was used to
refer to opponents.
For further analysis, the results where imported in SPSS.
Schut, L. (2013). Verenigde Staten vs. Verenigd Koningrijk: Een automatische inhoudsanalyse naar verklarende
factoren voor het gebruik van positive campaigning en negative campaigning door vooraanstaande politici en
politieke partijen op Twitter. Bachelor Thesis, Universiteit van Amsterdam.

#bigdata

Damian Trilling
Big Data? What are we talking about?

The process: collect, store, analyze

Python

Questions?

Some examples

#bigdata

Damian Trilling
Big Data? What are we talking about?

The process: collect, store, analyze

Python

Questions?

Some examples

#bigdata

Damian Trilling
Big Data? What are we talking about?

The process: collect, store, analyze

Python

Questions?

Some examples

Frame adoption on Twitter

Which phrases used by Merkel and Steinbrück on TV make it
to the #tvduell discussion on Twitter?
Identify frequently used words in the transcript of the debate and
in tweets.
Find co-occurrances.

#bigdata

Damian Trilling
Big Data? What are we talking about?

The process: collect, store, analyze

Python

Questions?

Some examples

Frame adoption on Twitter

#bigdata

Damian Trilling
Big Data? What are we talking about?

The process: collect, store, analyze

Python

Questions?

A scheme

The process: collect, store, analyze
A scheme

#bigdata

Damian Trilling
Big Data? What are we talking about?

The process: collect, store, analyze

Python

Questions?

Our implementation

datacollection.followthenews-uva.cloudlet.sara.nl

#bigdata

Damian Trilling
Big Data? What are we talking about?

The process: collect, store, analyze

Python

Questions?

Our implementation

datacollection.followthenews-uva.cloudlet.sara.nl
yourTwapperkeeper
Continuosly calls the Twitter-API and saves all
tweets containing specific hashtags to a
mySQL-database.

#bigdata

Damian Trilling
Big Data? What are we talking about?

The process: collect, store, analyze

Python

Questions?

Our implementation

datacollection.followthenews-uva.cloudlet.sara.nl
yourTwapperkeeper
Continuosly calls the Twitter-API and saves all
tweets containing specific hashtags to a
mySQL-database.

rsshond
Calls the RSS-feeds of news sites 1x/hour,
saves title, time, header, and teaser of all new
articles into a CSV-table, follows the link to
the full text and downloads them.

#bigdata

Damian Trilling
Big Data? What are we talking about?

The process: collect, store, analyze

Python

Questions?

Our implementation

datacollection.followthenews-uva.cloudlet.sara.nl
yourTwapperkeeper
Continuosly calls the Twitter-API and saves all
tweets containing specific hashtags to a
mySQL-database.

rsshond
Calls the RSS-feeds of news sites 1x/hour,
saves title, time, header, and teaser of all new
articles into a CSV-table, follows the link to
the full text and downloads them.

snapshot
Visits some URLs every 4x/day and downloads
them.
#bigdata

Damian Trilling
Big Data? What are we talking about?

The process: collect, store, analyze

Python

Questions?

Our implementation

How to access the collected data?

#bigdata

Damian Trilling
Big Data? What are we talking about?

The process: collect, store, analyze

Python

Questions?

Our implementation

How to access the collected data?
Apache-webserver
Download the data from
http://datacollection.
followthenews-uva.cloudlet.sara.nl.

#bigdata

Damian Trilling
Big Data? What are we talking about?

The process: collect, store, analyze

Python

Questions?

Our implementation

How to access the collected data?
Apache-webserver
Download the data from
http://datacollection.
followthenews-uva.cloudlet.sara.nl.

SSH (scp)
Transfer data directly to your computer or
another server (like
speeltuin.followthenews-uva.cloudlet.sara.nl)

#bigdata

Damian Trilling
Big Data? What are we talking about?

The process: collect, store, analyze

Python

Questions?

Our implementation

How to access the collected data?
Apache-webserver
Download the data from
http://datacollection.
followthenews-uva.cloudlet.sara.nl.

SSH (scp)
Transfer data directly to your computer or
another server (like
speeltuin.followthenews-uva.cloudlet.sara.nl)

Beehub
Connect the server to beehub, which can be
mounted like the "p-schijf" or accessed online.
#bigdata

Damian Trilling
Big Data? What are we talking about?

The process: collect, store, analyze

Python

Questions?

What it is

Python

#bigdata

Damian Trilling
Big Data? What are we talking about?

The process: collect, store, analyze

Python

Questions?

What it is

One tool to rule them all?

#bigdata

Damian Trilling
Big Data? What are we talking about?

The process: collect, store, analyze

Python

Questions?

What it is

One tool to rule them all?

Of course there are ready-made tool for some of the questions we
want to answer. But for many, there isn’t. Python offers us the
possibility to build exactly the tool we need.

#bigdata

Damian Trilling
Big Data? What are we talking about?

The process: collect, store, analyze

Python

Questions?

What it is

One tool to rule them all?

Of course there are ready-made tool for some of the questions we
want to answer. But for many, there isn’t. Python offers us the
possibility to build exactly the tool we need.

fun!

#bigdata

And it’s

Damian Trilling
Big Data? What are we talking about?

The process: collect, store, analyze

Python

Questions?

What it is

What is Python?
It is a programming language
• It is flexible. You can use it for (in principle) any kind of data
• There are virtually no limits regarding the amount of data to

process
• You can run it on every platform

#bigdata

Damian Trilling
Big Data? What are we talking about?

The process: collect, store, analyze

Python

Questions?

What it is

What is Python?
It is a programming language
• It is flexible. You can use it for (in principle) any kind of data
• There are virtually no limits regarding the amount of data to

process
• You can run it on every platform
• And yet it is easy to learn!

#bigdata

Damian Trilling
Big Data? What are we talking about?

The process: collect, store, analyze

Python

Questions?

What it is

What is Python?
It is a programming language
• It is flexible. You can use it for (in principle) any kind of data
• There are virtually no limits regarding the amount of data to

process
• You can run it on every platform
• And yet it is easy to learn!

It is widely used for content analysis
• Many online ressources and toolkits
• Books about NLP and Web Scraping with Python

#bigdata

Damian Trilling
Big Data? What are we talking about?

The process: collect, store, analyze

Python

Questions?

What it is

You do not have to become a
programmer.

#bigdata

Damian Trilling
Big Data? What are we talking about?

The process: collect, store, analyze

Python

Questions?

What it is

You do not have to become a
programmer. If you know how to
write SPSS or STATA syntax, you
will understand Python.

#bigdata

Damian Trilling
Big Data? What are we talking about?

The process: collect, store, analyze

Python

Questions?

What it is

You do not have to become a
programmer. If you know how to
write SPSS or STATA syntax, you
will understand Python.
(But if you have ever had contact with whatever programming language,
it helps.)

#bigdata

Damian Trilling
Big Data? What are we talking about?

The process: collect, store, analyze

Python

Questions?

What it is

You do not have to become a
programmer. If you know how to
write SPSS or STATA syntax, you
will understand Python.
(But if you have ever had contact with whatever programming language,

It’s enough if you can read and
modify the code.
it helps.)

#bigdata

Damian Trilling
Big Data? What are we talking about?

The process: collect, store, analyze

Python

Questions?

What it is

Think of the following task

RQ: What are the differences in terms of actors mentioned
between Israeli and Palestinian news coverage?

#bigdata

Damian Trilling
Big Data? What are we talking about?

The process: collect, store, analyze

Python

Questions?

What it is

Think of the following task

RQ: What are the differences in terms of actors mentioned
between Israeli and Palestinian news coverage?
1

#bigdata

The data structure: You have a folder with articles

Damian Trilling
Big Data? What are we talking about?

The process: collect, store, analyze

Python

Questions?

What it is

Think of the following task

RQ: What are the differences in terms of actors mentioned
between Israeli and Palestinian news coverage?
1
2

#bigdata

The data structure: You have a folder with articles
The desired output: You want a table with the file names and
a column per actor, counting how often they are mentioned

Damian Trilling
Big Data? What are we talking about?

The process: collect, store, analyze

Python

Questions?

What it is

Think of the following task

RQ: What are the differences in terms of actors mentioned
between Israeli and Palestinian news coverage?
1
2

The desired output: You want a table with the file names and
a column per actor, counting how often they are mentioned

3

#bigdata

The data structure: You have a folder with articles

A typical task for a short Python script!

Damian Trilling
Big Data? What are we talking about?

The process: collect, store, analyze

Python

Questions?

What it is

You need someting like this:
for every file in folder:
read the file
count actors
add new row to table with filename and actor counts
save table
(such a notation is called pseudo-code)

#bigdata

Damian Trilling
Big Data? What are we talking about?

The process: collect, store, analyze

Python

Questions?

What it is

mypath ="C:UsersRicardaDocumentsArtikelen"
regex54 = re.compile(r’Israel.*[minister|politician.*|[Aa]uthorit’)
filename_list=[]
matchcount54=0
matchcount54_list=[]
onlyfiles = [ f for f in listdir(mypath) if isfile(join(mypath,f)) ]
for f in onlyfiles:
matchcount54=0
artikel=open(join(mypath,f),"r")
for line in artikel:
matches54 = regex54.findall(line)
for word in matches54:
matchcount54=matchcount54+1
filename_list.append(f)
matchcount54_list.append(matchcount54)
artikel.close()
output=zip(filename_list,matchcount54_list)
writer = csv.writer(open("overzichtstabel.csv", ’wb’))
writer.writerows(output)
#bigdata

Damian Trilling
Big Data? What are we talking about?

The process: collect, store, analyze

Python

Questions?

What it is

This is not too different from a script Jelle uses for his dissertation.
The main difference: He doesn’t code regular expressions, but
calculates document similarity.
slides-jelle.pdf

#bigdata

Damian Trilling
Big Data? What are we talking about?

The process: collect, store, analyze

Python

Questions?

When to use it

When to use Python

#bigdata

Damian Trilling
Big Data? What are we talking about?

The process: collect, store, analyze

Python

Questions?

When to use it

1st group of tasks

Highly repetitive tasks
Simple tasks (counting things, comparing texts, . . . ) that can be
described in a formalized way. Saves time even with few cases, but
there is virtually no size limit.
Example: Retweets start with RT, optionally followed by a space,
and some letters. So it is very easy to identify them automatically

#bigdata

Damian Trilling
Big Data? What are we talking about?

The process: collect, store, analyze

Python

Questions?

When to use it

2nd group of tasks

Task for which specific Python modules exist
There are thousands of modules suitable for text analysis. You
basically only have to write code for data input and output.
Example: Sentiment analysis

#bigdata

Damian Trilling
Big Data? What are we talking about?

The process: collect, store, analyze

Python

Questions?

When to use it

3rd group of tasks

API’s, RSS, webscraping . . .
You can use Python if you want to collect and store information.
Example: Collecting bio’s of Twitter users, scraping the web (data
journalism!), downloading Facebook data

#bigdata

Damian Trilling
Big Data? What are we talking about?

The process: collect, store, analyze

Python

Questions?

When not to use it

When not to use Python

#bigdata

Damian Trilling
Big Data? What are we talking about?

The process: collect, store, analyze

Python

Questions?

When not to use it

Maybe you do not need to write a Python script . . .

. . . when there are already suitable tools available.
Sometimes, the perfect ready-made tool already exists.

Example: Axel Bruns’ awk-scripts for Twitter analysis
(www. mappingonlinepublics. net ). If I had to write such a tool, I’d do it in
Python, but hey, he did it already with awk and it works.

#bigdata

Damian Trilling
Big Data? What are we talking about?

The process: collect, store, analyze

Python

Questions?

When not to use it

Maybe you do not need to write a Python script . . .

. . . when there are already suitable tools available.
Sometimes, the perfect ready-made tool already exists.
But still, sometimes it is more efficient to write something that does exactly
what you want
Example: Axel Bruns’ awk-scripts for Twitter analysis
(www. mappingonlinepublics. net ). If I had to write such a tool, I’d do it in
Python, but hey, he did it already with awk and it works.

#bigdata

Damian Trilling
Big Data? What are we talking about?

The process: collect, store, analyze

Python

Questions?

When not to use it

And, let’s face it,. . .

. . . we are no programmers.
So maybe, some tasks are too complex for us to program ourselves.

#bigdata

Damian Trilling
Big Data? What are we talking about?

The process: collect, store, analyze

Python

Questions?

When not to use it

And, let’s face it,. . .

. . . we are no programmers.
So maybe, some tasks are too complex for us to program ourselves.
But there is a huge online community that helps you.

#bigdata

Damian Trilling
Big Data? What are we talking about?

The process: collect, store, analyze

Python

Questions?

When not to use it

Recap
1 Big Data? What are we talking about?

Exploring the field
Some examples
2 The process: collect, store, analyze

A scheme
Our implementation
3 Python

What it is
When to use it
When not to use it
4 Questions?

#bigdata

Damian Trilling
Big Data? What are we talking about?

The process: collect, store, analyze

Python

Questions?

When not to use it

After the break

Hand’s on! Exploring a basic Python script

#bigdata

Damian Trilling
Big Data? What are we talking about?

The process: collect, store, analyze

Python

Questions?

Vragen of opmerkingen?

Damian Trilling
d.c.trilling@uva.nl
@damian0604
www.damiantrilling.net
#bigdata

Damian Trilling

Weitere ähnliche Inhalte

Was ist angesagt?

Python for Big Data Analytics
Python for Big Data AnalyticsPython for Big Data Analytics
Python for Big Data AnalyticsEdureka!
 
Informatics is a natural science
Informatics is a natural scienceInformatics is a natural science
Informatics is a natural scienceFrank van Harmelen
 
Python and BIG Data analytics | Python Fundamentals | Python Architecture
Python and BIG Data analytics | Python Fundamentals | Python ArchitecturePython and BIG Data analytics | Python Fundamentals | Python Architecture
Python and BIG Data analytics | Python Fundamentals | Python ArchitectureSkillspeed
 
Kim Hammar Msc Thesis Defense - 2018
Kim Hammar Msc Thesis Defense - 2018Kim Hammar Msc Thesis Defense - 2018
Kim Hammar Msc Thesis Defense - 2018Kim Hammar
 
Frontiers of Computational Journalism week 2 - Text Analysis
Frontiers of Computational Journalism week 2 - Text AnalysisFrontiers of Computational Journalism week 2 - Text Analysis
Frontiers of Computational Journalism week 2 - Text AnalysisJonathan Stray
 

Was ist angesagt? (20)

BDACA - Lecture4
BDACA - Lecture4BDACA - Lecture4
BDACA - Lecture4
 
BDACA1617s2 - Lecture4
BDACA1617s2 - Lecture4BDACA1617s2 - Lecture4
BDACA1617s2 - Lecture4
 
BDACA1617s2 - Lecture3
BDACA1617s2 - Lecture3BDACA1617s2 - Lecture3
BDACA1617s2 - Lecture3
 
BDACA - Lecture2
BDACA - Lecture2BDACA - Lecture2
BDACA - Lecture2
 
BDACA1617s2 - Lecture7
BDACA1617s2 - Lecture7BDACA1617s2 - Lecture7
BDACA1617s2 - Lecture7
 
BDACA1617s2 - Lecture6
BDACA1617s2 - Lecture6BDACA1617s2 - Lecture6
BDACA1617s2 - Lecture6
 
BDACA - Lecture5
BDACA - Lecture5BDACA - Lecture5
BDACA - Lecture5
 
BDACA1617s2 - Lecture5
BDACA1617s2 - Lecture5BDACA1617s2 - Lecture5
BDACA1617s2 - Lecture5
 
BDACA - Lecture7
BDACA - Lecture7BDACA - Lecture7
BDACA - Lecture7
 
BDACA - Tutorial5
BDACA - Tutorial5BDACA - Tutorial5
BDACA - Tutorial5
 
BDACA1617s2 - Lecture 2
BDACA1617s2 - Lecture 2BDACA1617s2 - Lecture 2
BDACA1617s2 - Lecture 2
 
BD-ACA week5
BD-ACA week5BD-ACA week5
BD-ACA week5
 
BDACA1617s2 - Tutorial 1
BDACA1617s2 - Tutorial 1BDACA1617s2 - Tutorial 1
BDACA1617s2 - Tutorial 1
 
BDACA - Lecture6
BDACA - Lecture6BDACA - Lecture6
BDACA - Lecture6
 
BDACA - Lecture8
BDACA - Lecture8BDACA - Lecture8
BDACA - Lecture8
 
Python for Big Data Analytics
Python for Big Data AnalyticsPython for Big Data Analytics
Python for Big Data Analytics
 
Informatics is a natural science
Informatics is a natural scienceInformatics is a natural science
Informatics is a natural science
 
Python and BIG Data analytics | Python Fundamentals | Python Architecture
Python and BIG Data analytics | Python Fundamentals | Python ArchitecturePython and BIG Data analytics | Python Fundamentals | Python Architecture
Python and BIG Data analytics | Python Fundamentals | Python Architecture
 
Kim Hammar Msc Thesis Defense - 2018
Kim Hammar Msc Thesis Defense - 2018Kim Hammar Msc Thesis Defense - 2018
Kim Hammar Msc Thesis Defense - 2018
 
Frontiers of Computational Journalism week 2 - Text Analysis
Frontiers of Computational Journalism week 2 - Text AnalysisFrontiers of Computational Journalism week 2 - Text Analysis
Frontiers of Computational Journalism week 2 - Text Analysis
 

Ähnlich wie Analyzing social media with Python and other tools (1/4)

Concepts, use cases and principles to build big data systems (1)
Concepts, use cases and principles to build big data systems (1)Concepts, use cases and principles to build big data systems (1)
Concepts, use cases and principles to build big data systems (1)Trieu Nguyen
 
OpenFest 2012 : Leveraging the public internet
OpenFest 2012 : Leveraging the public internetOpenFest 2012 : Leveraging the public internet
OpenFest 2012 : Leveraging the public internettkisason
 
Data science presentation
Data science presentationData science presentation
Data science presentationMSDEVMTL
 
Python PPT
Python PPTPython PPT
Python PPTEdureka!
 
SuanIct-Bigdata desktop-final
SuanIct-Bigdata desktop-finalSuanIct-Bigdata desktop-final
SuanIct-Bigdata desktop-finalstelligence
 
Google Cloud - Google's vision on AI
Google Cloud - Google's vision on AIGoogle Cloud - Google's vision on AI
Google Cloud - Google's vision on AIBigDataExpo
 
Introduction To Data Science With Python
Introduction To Data Science With PythonIntroduction To Data Science With Python
Introduction To Data Science With PythonSpotle.ai
 
GalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataGalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataPaco Nathan
 
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...PyData
 
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...Mihai Criveti
 
How You Can Use Open Source Materials to Learn Python & Data Science - EuroPy...
How You Can Use Open Source Materials to Learn Python & Data Science - EuroPy...How You Can Use Open Source Materials to Learn Python & Data Science - EuroPy...
How You Can Use Open Source Materials to Learn Python & Data Science - EuroPy...Kamila Stępniowska
 
Deep Learning using Tensorflow and Data Science Experience
Deep Learning using Tensorflow and Data Science ExperienceDeep Learning using Tensorflow and Data Science Experience
Deep Learning using Tensorflow and Data Science ExperienceRoy Cecil
 
Algorithm Marketplace and the new "Algorithm Economy"
Algorithm Marketplace and the new "Algorithm Economy"Algorithm Marketplace and the new "Algorithm Economy"
Algorithm Marketplace and the new "Algorithm Economy"Diego Oppenheimer
 
Big Data and Data Science: The Technologies Shaping Our Lives
Big Data and Data Science: The Technologies Shaping Our LivesBig Data and Data Science: The Technologies Shaping Our Lives
Big Data and Data Science: The Technologies Shaping Our LivesRukshan Batuwita
 

Ähnlich wie Analyzing social media with Python and other tools (1/4) (20)

Concepts, use cases and principles to build big data systems (1)
Concepts, use cases and principles to build big data systems (1)Concepts, use cases and principles to build big data systems (1)
Concepts, use cases and principles to build big data systems (1)
 
OpenFest 2012 : Leveraging the public internet
OpenFest 2012 : Leveraging the public internetOpenFest 2012 : Leveraging the public internet
OpenFest 2012 : Leveraging the public internet
 
Data science presentation
Data science presentationData science presentation
Data science presentation
 
Python PPT
Python PPTPython PPT
Python PPT
 
BDACA1516s2 - Lecture1
BDACA1516s2 - Lecture1BDACA1516s2 - Lecture1
BDACA1516s2 - Lecture1
 
SuanIct-Bigdata desktop-final
SuanIct-Bigdata desktop-finalSuanIct-Bigdata desktop-final
SuanIct-Bigdata desktop-final
 
Google Cloud - Google's vision on AI
Google Cloud - Google's vision on AIGoogle Cloud - Google's vision on AI
Google Cloud - Google's vision on AI
 
2014 pycon-talk
2014 pycon-talk2014 pycon-talk
2014 pycon-talk
 
Introduction To Data Science With Python
Introduction To Data Science With PythonIntroduction To Data Science With Python
Introduction To Data Science With Python
 
GalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataGalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About Data
 
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
 
BD-ACA week1b
BD-ACA week1bBD-ACA week1b
BD-ACA week1b
 
Searching tech2
Searching tech2Searching tech2
Searching tech2
 
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
 
Big Data
Big DataBig Data
Big Data
 
How You Can Use Open Source Materials to Learn Python & Data Science - EuroPy...
How You Can Use Open Source Materials to Learn Python & Data Science - EuroPy...How You Can Use Open Source Materials to Learn Python & Data Science - EuroPy...
How You Can Use Open Source Materials to Learn Python & Data Science - EuroPy...
 
Deep Learning using Tensorflow and Data Science Experience
Deep Learning using Tensorflow and Data Science ExperienceDeep Learning using Tensorflow and Data Science Experience
Deep Learning using Tensorflow and Data Science Experience
 
SKILLWISE-BIGDATA ANALYSIS
SKILLWISE-BIGDATA ANALYSISSKILLWISE-BIGDATA ANALYSIS
SKILLWISE-BIGDATA ANALYSIS
 
Algorithm Marketplace and the new "Algorithm Economy"
Algorithm Marketplace and the new "Algorithm Economy"Algorithm Marketplace and the new "Algorithm Economy"
Algorithm Marketplace and the new "Algorithm Economy"
 
Big Data and Data Science: The Technologies Shaping Our Lives
Big Data and Data Science: The Technologies Shaping Our LivesBig Data and Data Science: The Technologies Shaping Our Lives
Big Data and Data Science: The Technologies Shaping Our Lives
 

Mehr von Department of Communication Science, University of Amsterdam (8)

BDACA - Tutorial1
BDACA - Tutorial1BDACA - Tutorial1
BDACA - Tutorial1
 
BDACA - Lecture1
BDACA - Lecture1BDACA - Lecture1
BDACA - Lecture1
 
BDACA1617s2 - Lecture 1
BDACA1617s2 - Lecture 1BDACA1617s2 - Lecture 1
BDACA1617s2 - Lecture 1
 
Media diets in an age of apps and social media: Dealing with a third layer of...
Media diets in an age of apps and social media: Dealing with a third layer of...Media diets in an age of apps and social media: Dealing with a third layer of...
Media diets in an age of apps and social media: Dealing with a third layer of...
 
Conceptualizing and measuring news exposure as network of users and news items
Conceptualizing and measuring news exposure as network of users and news itemsConceptualizing and measuring news exposure as network of users and news items
Conceptualizing and measuring news exposure as network of users and news items
 
Data Science: Case "Political Communication 2/2"
Data Science: Case "Political Communication 2/2"Data Science: Case "Political Communication 2/2"
Data Science: Case "Political Communication 2/2"
 
Data Science: Case "Political Communication 1/2"
Data Science: Case "Political Communication 1/2"Data Science: Case "Political Communication 1/2"
Data Science: Case "Political Communication 1/2"
 
BDACA1516s2 - Lecture4
 BDACA1516s2 - Lecture4 BDACA1516s2 - Lecture4
BDACA1516s2 - Lecture4
 

Kürzlich hochgeladen

Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxAshokKarra1
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parentsnavabharathschool99
 
Roles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceRoles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceSamikshaHamane
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...JhezDiaz1
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSJoshuaGantuangco2
 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfSpandanaRallapalli
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxAnupkumar Sharma
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Celine George
 
Gas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxGas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxDr.Ibrahim Hassaan
 
ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomnelietumpap1
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Celine George
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Mark Reed
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designMIPLM
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONHumphrey A Beña
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxthorishapillay1
 
Q4 English4 Week3 PPT Melcnmg-based.pptx
Q4 English4 Week3 PPT Melcnmg-based.pptxQ4 English4 Week3 PPT Melcnmg-based.pptx
Q4 English4 Week3 PPT Melcnmg-based.pptxnelietumpap1
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)lakshayb543
 

Kürzlich hochgeladen (20)

Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptx
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parents
 
Roles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceRoles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in Pharmacovigilance
 
Raw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptxRaw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptx
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdf
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17
 
Gas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxGas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptx
 
ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choom
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)
 
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptxFINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-design
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptx
 
Q4 English4 Week3 PPT Melcnmg-based.pptx
Q4 English4 Week3 PPT Melcnmg-based.pptxQ4 English4 Week3 PPT Melcnmg-based.pptx
Q4 English4 Week3 PPT Melcnmg-based.pptx
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 

Analyzing social media with Python and other tools (1/4)

  • 1. Good morning! Enjoy your coffee and install Putty and NotepadPlus via "Software Maintance/Application Catalgue". And the Pattern-package (see my e-mail). Thanks.
  • 2. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? Hands-on-Workshop Big (Twitter) Data Damian Trilling d.c.trilling@uva.nl @damian0604 www.damiantrilling.net Afdeling Communicatiewetenschap Universiteit van Amsterdam 30 January 2014 9.30 #bigdata Damian Trilling
  • 3.
  • 4. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? The next one and a half days You’ll hear about • Collecting social media data via APIs, RSS and scraping (and the tools for it) • Technical infrastructure (via surfsara) • Python • Sentiment analysis • Automated coding • Frequencies and other statistics • Social network analysis with Gephi • ... #bigdata Damian Trilling
  • 5. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? In this session (1/4): 1 Big Data? What are we talking about? Exploring the field Some examples 2 The process: collect, store, analyze A scheme Our implementation 3 Python What it is When to use it When not to use it 4 Questions? #bigdata Damian Trilling
  • 6. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? Exploring the field What’s big data? What are we talking about? #bigdata Damian Trilling
  • 7. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? Exploring the field What are we talking about? Today, it’s a hands-on workshop, so let’s keep this important (!) discussion for later. #bigdata Damian Trilling
  • 8. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? Exploring the field What are we talking about? So, no definition, but some brief thoughts • Existing data ( = experiments or surveys) • Too big to code manually • Too big to handle with normal tools • New research questions • Call to revisit the relationship between theory and empirical research #bigdata Damian Trilling
  • 9. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? Exploring the field What are we talking about? Today, . . . • we are not going to talk about REALLY BIG data, • but we will have some exercises on datasets a normal computer can handle #bigdata Damian Trilling
  • 10. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? Exploring the field What are we talking about? Today, . . . • we are not going to talk about REALLY BIG data, • but we will have some exercises on datasets a normal computer can handle Tomorrow, . . . • we will also learn about scaling up these techniques • SurfSARA provides infrastructure for this #bigdata Damian Trilling
  • 11. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? Exploring the field What are we talking about? Some sources • Social Network Sites • RSS-feeds • Databases • Scraping text from the web • ... #bigdata Damian Trilling
  • 12. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? Exploring the field It’s out there! You only have to collect it. #bigdata Damian Trilling
  • 13. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? Exploring the field But why should we care? We can answer new questions • Find needles in haystacks • Identify networks, co-word analysis, linguistic analysis, . . . • Verify our theories in larger datasets #bigdata Damian Trilling
  • 14. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? Exploring the field But why should we care? We can answer new questions • Find needles in haystacks • Identify networks, co-word analysis, linguistic analysis, . . . • Verify our theories in larger datasets It makes sense • There are things that computers are simply better at than humans, e.g. in counting things • Having human coders look for words in texts is like calculating a regression analysis by hand #bigdata Damian Trilling
  • 15.
  • 16.
  • 17. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? Some examples Some examples #bigdata Damian Trilling
  • 18. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? Some examples A recent master thesis The needle in the haystack #bigdata Damian Trilling
  • 19. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? Some examples A recent master thesis The needle in the haystack Imagine you want to analyze some very rare content. #bigdata Damian Trilling
  • 20. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? Some examples A recent master thesis The needle in the haystack Imagine you want to analyze some very rare content. Normal sampling won’t work, that’s for sure. #bigdata Damian Trilling
  • 21. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? Some examples So you’d better collect everything first Getting all news coverage from Dutch news sites Pöll, B. (2013). Social media: new sources, new profession? A content analysis of the use of social media as a source for journalists in online news articles. Master Thesis, Universiteit van Amsterdam. #bigdata Damian Trilling
  • 22. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? Some examples So you’d better collect everything first Getting all news coverage from Dutch news sites 1 Collect all articles from nine news sites during a period of two months, resulting in a database with 74.000 articles. Pöll, B. (2013). Social media: new sources, new profession? A content analysis of the use of social media as a source for journalists in online news articles. Master Thesis, Universiteit van Amsterdam. #bigdata Damian Trilling
  • 23. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? Some examples So you’d better collect everything first Getting all news coverage from Dutch news sites 1 Collect all articles from nine news sites during a period of two months, resulting in a database with 74.000 articles. 2 Filter articles containing specific keywords. Pöll, B. (2013). Social media: new sources, new profession? A content analysis of the use of social media as a source for journalists in online news articles. Master Thesis, Universiteit van Amsterdam. #bigdata Damian Trilling
  • 24. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? Some examples So you’d better collect everything first Getting all news coverage from Dutch news sites 1 Collect all articles from nine news sites during a period of two months, resulting in a database with 74.000 articles. 2 Filter articles containing specific keywords. 3 Those 292 articles where then manually coded. Pöll, B. (2013). Social media: new sources, new profession? A content analysis of the use of social media as a source for journalists in online news articles. Master Thesis, Universiteit van Amsterdam. #bigdata Damian Trilling
  • 25. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? Some examples #bigdata Damian Trilling
  • 26. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? Some examples It’s just one line of code! url.txt http://www.gmx.at/themen/wissen/mensch/108g5xi-baeuerlich-schiefe-zaehne http://www.gmx.at/themen/unterhaltung/klatsch-tratsch/408g740-fuermannbittet-um-verzeihung http://www.gmx.at/themen/nachrichten/aufruhr-arabien/268g70u-regierungwill-zuruecktreten http://www.gmx.at/themen/nachrichten/panorama/828g54y-neues-zur-klagegegen-republik http://www.gmx.at/themen/nachrichten/panorama/968g72s-millionstrafewegen-oelpest http://www.gmx.at/themen/unterhaltung/klatsch-tratsch/368g6yc-keinbabybauch-nur-fast-food ... ... ... #bigdata wget-commando wget -i urls.txt Damian Trilling
  • 27. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? Some examples A recent bachelor thesis Tone in tweets #bigdata Damian Trilling
  • 28. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? Some examples A recent bachelor thesis Tone in tweets Imagine you want to know something about someone’s behavior on twitter. Or how a specific topic is discussed on Twitter. #bigdata Damian Trilling
  • 29. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? Some examples A recent bachelor thesis Tone in tweets Imagine you want to know something about someone’s behavior on twitter. Or how a specific topic is discussed on Twitter. Do you really want to go through thousands of tweets by hand? #bigdata Damian Trilling
  • 30. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? Some examples So you’d better think about automating your coding Finding out how negative or positive politicians are towards their opponents Schut, L. (2013). Verenigde Staten vs. Verenigd Koningrijk: Een automatische inhoudsanalyse naar verklarende factoren voor het gebruik van positive campaigning en negative campaigning door vooraanstaande politici en politieke partijen op Twitter. Bachelor Thesis, Universiteit van Amsterdam. #bigdata Damian Trilling
  • 31. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? Some examples So you’d better think about automating your coding Finding out how negative or positive politicians are towards their opponents The student took lists with positive and negative words and made additional ones with a politician’s opponents. Schut, L. (2013). Verenigde Staten vs. Verenigd Koningrijk: Een automatische inhoudsanalyse naar verklarende factoren voor het gebruik van positive campaigning en negative campaigning door vooraanstaande politici en politieke partijen op Twitter. Bachelor Thesis, Universiteit van Amsterdam. #bigdata Damian Trilling
  • 32. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? Some examples So you’d better think about automating your coding Finding out how negative or positive politicians are towards their opponents The student took lists with positive and negative words and made additional ones with a politician’s opponents. She used a Python-script to check which type of words was used to refer to opponents. Schut, L. (2013). Verenigde Staten vs. Verenigd Koningrijk: Een automatische inhoudsanalyse naar verklarende factoren voor het gebruik van positive campaigning en negative campaigning door vooraanstaande politici en politieke partijen op Twitter. Bachelor Thesis, Universiteit van Amsterdam. #bigdata Damian Trilling
  • 33. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? Some examples So you’d better think about automating your coding Finding out how negative or positive politicians are towards their opponents The student took lists with positive and negative words and made additional ones with a politician’s opponents. She used a Python-script to check which type of words was used to refer to opponents. For further analysis, the results where imported in SPSS. Schut, L. (2013). Verenigde Staten vs. Verenigd Koningrijk: Een automatische inhoudsanalyse naar verklarende factoren voor het gebruik van positive campaigning en negative campaigning door vooraanstaande politici en politieke partijen op Twitter. Bachelor Thesis, Universiteit van Amsterdam. #bigdata Damian Trilling
  • 34. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? Some examples #bigdata Damian Trilling
  • 35. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? Some examples #bigdata Damian Trilling
  • 36. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? Some examples Frame adoption on Twitter Which phrases used by Merkel and Steinbrück on TV make it to the #tvduell discussion on Twitter? Identify frequently used words in the transcript of the debate and in tweets. Find co-occurrances. #bigdata Damian Trilling
  • 37. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? Some examples Frame adoption on Twitter #bigdata Damian Trilling
  • 38. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? A scheme The process: collect, store, analyze A scheme #bigdata Damian Trilling
  • 39.
  • 40.
  • 41.
  • 42.
  • 43.
  • 44.
  • 45. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? Our implementation datacollection.followthenews-uva.cloudlet.sara.nl #bigdata Damian Trilling
  • 46. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? Our implementation datacollection.followthenews-uva.cloudlet.sara.nl yourTwapperkeeper Continuosly calls the Twitter-API and saves all tweets containing specific hashtags to a mySQL-database. #bigdata Damian Trilling
  • 47. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? Our implementation datacollection.followthenews-uva.cloudlet.sara.nl yourTwapperkeeper Continuosly calls the Twitter-API and saves all tweets containing specific hashtags to a mySQL-database. rsshond Calls the RSS-feeds of news sites 1x/hour, saves title, time, header, and teaser of all new articles into a CSV-table, follows the link to the full text and downloads them. #bigdata Damian Trilling
  • 48. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? Our implementation datacollection.followthenews-uva.cloudlet.sara.nl yourTwapperkeeper Continuosly calls the Twitter-API and saves all tweets containing specific hashtags to a mySQL-database. rsshond Calls the RSS-feeds of news sites 1x/hour, saves title, time, header, and teaser of all new articles into a CSV-table, follows the link to the full text and downloads them. snapshot Visits some URLs every 4x/day and downloads them. #bigdata Damian Trilling
  • 49. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? Our implementation How to access the collected data? #bigdata Damian Trilling
  • 50. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? Our implementation How to access the collected data? Apache-webserver Download the data from http://datacollection. followthenews-uva.cloudlet.sara.nl. #bigdata Damian Trilling
  • 51. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? Our implementation How to access the collected data? Apache-webserver Download the data from http://datacollection. followthenews-uva.cloudlet.sara.nl. SSH (scp) Transfer data directly to your computer or another server (like speeltuin.followthenews-uva.cloudlet.sara.nl) #bigdata Damian Trilling
  • 52. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? Our implementation How to access the collected data? Apache-webserver Download the data from http://datacollection. followthenews-uva.cloudlet.sara.nl. SSH (scp) Transfer data directly to your computer or another server (like speeltuin.followthenews-uva.cloudlet.sara.nl) Beehub Connect the server to beehub, which can be mounted like the "p-schijf" or accessed online. #bigdata Damian Trilling
  • 53. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? What it is Python #bigdata Damian Trilling
  • 54. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? What it is One tool to rule them all? #bigdata Damian Trilling
  • 55. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? What it is One tool to rule them all? Of course there are ready-made tool for some of the questions we want to answer. But for many, there isn’t. Python offers us the possibility to build exactly the tool we need. #bigdata Damian Trilling
  • 56. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? What it is One tool to rule them all? Of course there are ready-made tool for some of the questions we want to answer. But for many, there isn’t. Python offers us the possibility to build exactly the tool we need. fun! #bigdata And it’s Damian Trilling
  • 57. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? What it is What is Python? It is a programming language • It is flexible. You can use it for (in principle) any kind of data • There are virtually no limits regarding the amount of data to process • You can run it on every platform #bigdata Damian Trilling
  • 58. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? What it is What is Python? It is a programming language • It is flexible. You can use it for (in principle) any kind of data • There are virtually no limits regarding the amount of data to process • You can run it on every platform • And yet it is easy to learn! #bigdata Damian Trilling
  • 59. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? What it is What is Python? It is a programming language • It is flexible. You can use it for (in principle) any kind of data • There are virtually no limits regarding the amount of data to process • You can run it on every platform • And yet it is easy to learn! It is widely used for content analysis • Many online ressources and toolkits • Books about NLP and Web Scraping with Python #bigdata Damian Trilling
  • 60. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? What it is You do not have to become a programmer. #bigdata Damian Trilling
  • 61. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? What it is You do not have to become a programmer. If you know how to write SPSS or STATA syntax, you will understand Python. #bigdata Damian Trilling
  • 62. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? What it is You do not have to become a programmer. If you know how to write SPSS or STATA syntax, you will understand Python. (But if you have ever had contact with whatever programming language, it helps.) #bigdata Damian Trilling
  • 63. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? What it is You do not have to become a programmer. If you know how to write SPSS or STATA syntax, you will understand Python. (But if you have ever had contact with whatever programming language, It’s enough if you can read and modify the code. it helps.) #bigdata Damian Trilling
  • 64. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? What it is Think of the following task RQ: What are the differences in terms of actors mentioned between Israeli and Palestinian news coverage? #bigdata Damian Trilling
  • 65. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? What it is Think of the following task RQ: What are the differences in terms of actors mentioned between Israeli and Palestinian news coverage? 1 #bigdata The data structure: You have a folder with articles Damian Trilling
  • 66. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? What it is Think of the following task RQ: What are the differences in terms of actors mentioned between Israeli and Palestinian news coverage? 1 2 #bigdata The data structure: You have a folder with articles The desired output: You want a table with the file names and a column per actor, counting how often they are mentioned Damian Trilling
  • 67. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? What it is Think of the following task RQ: What are the differences in terms of actors mentioned between Israeli and Palestinian news coverage? 1 2 The desired output: You want a table with the file names and a column per actor, counting how often they are mentioned 3 #bigdata The data structure: You have a folder with articles A typical task for a short Python script! Damian Trilling
  • 68. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? What it is You need someting like this: for every file in folder: read the file count actors add new row to table with filename and actor counts save table (such a notation is called pseudo-code) #bigdata Damian Trilling
  • 69. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? What it is mypath ="C:UsersRicardaDocumentsArtikelen" regex54 = re.compile(r’Israel.*[minister|politician.*|[Aa]uthorit’) filename_list=[] matchcount54=0 matchcount54_list=[] onlyfiles = [ f for f in listdir(mypath) if isfile(join(mypath,f)) ] for f in onlyfiles: matchcount54=0 artikel=open(join(mypath,f),"r") for line in artikel: matches54 = regex54.findall(line) for word in matches54: matchcount54=matchcount54+1 filename_list.append(f) matchcount54_list.append(matchcount54) artikel.close() output=zip(filename_list,matchcount54_list) writer = csv.writer(open("overzichtstabel.csv", ’wb’)) writer.writerows(output) #bigdata Damian Trilling
  • 70. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? What it is This is not too different from a script Jelle uses for his dissertation. The main difference: He doesn’t code regular expressions, but calculates document similarity. slides-jelle.pdf #bigdata Damian Trilling
  • 71. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? When to use it When to use Python #bigdata Damian Trilling
  • 72. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? When to use it 1st group of tasks Highly repetitive tasks Simple tasks (counting things, comparing texts, . . . ) that can be described in a formalized way. Saves time even with few cases, but there is virtually no size limit. Example: Retweets start with RT, optionally followed by a space, and some letters. So it is very easy to identify them automatically #bigdata Damian Trilling
  • 73. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? When to use it 2nd group of tasks Task for which specific Python modules exist There are thousands of modules suitable for text analysis. You basically only have to write code for data input and output. Example: Sentiment analysis #bigdata Damian Trilling
  • 74. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? When to use it 3rd group of tasks API’s, RSS, webscraping . . . You can use Python if you want to collect and store information. Example: Collecting bio’s of Twitter users, scraping the web (data journalism!), downloading Facebook data #bigdata Damian Trilling
  • 75. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? When not to use it When not to use Python #bigdata Damian Trilling
  • 76. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? When not to use it Maybe you do not need to write a Python script . . . . . . when there are already suitable tools available. Sometimes, the perfect ready-made tool already exists. Example: Axel Bruns’ awk-scripts for Twitter analysis (www. mappingonlinepublics. net ). If I had to write such a tool, I’d do it in Python, but hey, he did it already with awk and it works. #bigdata Damian Trilling
  • 77. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? When not to use it Maybe you do not need to write a Python script . . . . . . when there are already suitable tools available. Sometimes, the perfect ready-made tool already exists. But still, sometimes it is more efficient to write something that does exactly what you want Example: Axel Bruns’ awk-scripts for Twitter analysis (www. mappingonlinepublics. net ). If I had to write such a tool, I’d do it in Python, but hey, he did it already with awk and it works. #bigdata Damian Trilling
  • 78. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? When not to use it And, let’s face it,. . . . . . we are no programmers. So maybe, some tasks are too complex for us to program ourselves. #bigdata Damian Trilling
  • 79. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? When not to use it And, let’s face it,. . . . . . we are no programmers. So maybe, some tasks are too complex for us to program ourselves. But there is a huge online community that helps you. #bigdata Damian Trilling
  • 80. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? When not to use it Recap 1 Big Data? What are we talking about? Exploring the field Some examples 2 The process: collect, store, analyze A scheme Our implementation 3 Python What it is When to use it When not to use it 4 Questions? #bigdata Damian Trilling
  • 81. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? When not to use it After the break Hand’s on! Exploring a basic Python script #bigdata Damian Trilling
  • 82. Big Data? What are we talking about? The process: collect, store, analyze Python Questions? Vragen of opmerkingen? Damian Trilling d.c.trilling@uva.nl @damian0604 www.damiantrilling.net #bigdata Damian Trilling