Social media presentation held at RC33 conference, Sydney, Australia
1. Analyzing Twitter data
Issues
Challenges
and
Opportunities
RC33 Conference, Sydney Australia,
9-13 July 2012
Maurice Vergeer
m.vergeer@maw.ru.nl / www.mauricevergeer.nl / blog.mauricevergeer.nl
Radboud University Nijmegen, the Netherlands
2. Many platform Empty platform /
- Facebook infrastructure
- Twitter - Facility
- Linkedin
- Hyves
- RenRen
- Cyworld User generated content
- Orkut - Text
- Youtube - Audio
- Flickr - Video
- Plurk - Pictures
- Sina Weibo
- Etc
Social media
3. Number of articles on politics, Internet and social media
180
160
140
120
Number of articles
100
80
60
40
20
0
1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012
Internet and politics (query 1) Social media and politics (query 2) Internet, social media and politics (query 3)
Source: Vergeer (in press / 2012) in New Media & Society
7. Opportunities
◦ Methodological/technical
Timeseries analysis
Network analysis
◦ Actors
◦ Content
◦ Diffusion of information through onine social networks
◦ Social media activities
Limitations
◦ Twitter
Reliability of Twitter API
Outline
8. • Within Twitter (using the API)
• Username
• Account creation data
• # of followers
• And the actual usernames of these followers
• # of followers
• And the actual usernames of those being followed
• Tweet text
• And many more (see dev.twitter.com)
Data sources
9. Tweet
◦ Tweet text
◦ Whether or not it was a reply to another tweet
To whom it was a reply (username/screenname and numerical
userid)
◦ Whether or not it was a retweet (according to Twitter)
Which tweet was retweeted (nunerical tweetid)
10. Message of tweet
Whether or not is was a directed tweet
(sent to someone in particular)
◦ Identified by an @-sign
Whether or not is was a retweet
◦ Identified by RT
Type of content
11. Undirected tweet
◦ RCMP Commissioner appearing before Public Safety Cmte now.
What a popular guy - he has his own paparazzi!
Directed tweet
◦ Fantastic blog by my good friend @GlenPearson -
http://bit.ly/hlAKXp #lpc
Directed tweet to two usernames
◦ @miken32 @CBCEdmonton probably because that is NOT what I
said--more commercially viable is different than not needed.
Retweet
◦ RT @liberal_party: Think Durham deserves better than Bev Oda?
Join @BobRaeMP for a rally tomorrow at 1pm http://lpc.ca/durham
#cdnpoli #lpc
Tweet examples
12.
13. Traditional material
◦ Produced by professional actors
◦ Newspapers
◦ Public administration documents
Social media
◦ Produced by
professional actors
general public
Content analysis of tweets
14. Large quantities of data
Word frequencies
◦ Identifying the most important words in the corpus
◦ Code these words into more general categories
Switch to SPSS (or other type of data management tool)
◦ Search for the words in the actual tweets
◦ Assign tweet to a specific code
Improvements in SPSS
◦ Compute command facilitates many new text operators
◦ Char.index, Char.substr, etc
Alternative
◦ Regular expressions
◦ complex
Data extraction
15. Publicly available data sources on
parliament, election council
Time series
◦ Identifying relevant societal/political events
relevant for the study at hand
Ex.1 temporarily shut down of election campaign
due to passenger plane crash of Dutch airliner in
Libia My 2010
Ex.2 Deregistration of People s Political Power
Party of Canada
External data sources
16. 900
800
700
600
500
400
300
200
100
0
newspaper broadcasting radio news agency magazine online only local
institutional Twitter account Personal Twitter account 9
17. Source: Vergeer & Hermans (forthcoming / 2013)
in Journal of Computer-Mediated Communication
20. Date and time
For longitudinal analysis and cross-national comparisons
◦ take note of the time differences and correct if necessary.
Time zones
Daytime saving
What to do with countries having multiple time zones?
◦ Depends on RQs
Communication patterns: keep a single time zone
Focus on individual daily patterns: adjust for time zones
21. Total tweets by candidates, followers and followed:
◦ 4,536,854 tweets
Breakdown
◦ Tweets among candidates: appr 2%
◦ Tweets to inner circles (followers or being followed)
appr 18%
◦ Tweets to outer circle: appr
33%
◦ Tweets not directed to anyone in particular appr
49%
◦ Extracting users from tweets (@adresses)
Communication network analysis
22. Communication network based on
candidates identified in tweets
Excluding the general public
Communication network analysis
23.
24. See http://tinyurl.com/blzajsl for
animated version.
25. Retrospective
◦ 3200 tweets back in time
Cost technical
◦ Access to firehose for real time data
Limitations in data collection
26. Date of tweet
◦ Minute fraction is time stamped with the wrong date
Solution
◦ Estimate date and time using the tweetid
Status of tweet as retweet
◦ RT
Solution:
Use text search operators to identify real retweets (“RT ”, “rt “)
Also see http://tinyurl.com/bohhjzn
Reply to tweets
◦ Only the first address is identified
Solution
◦ Search for multiple @-addresses using text extraction methods
Reliability of data as provided by
the API
28. Not gigabyte, ot terabytes,
But petabytes and exabytes of data
29. Only for the few
Specific hardware requirements
◦ Computing power
◦ Data storage
The data presented in this presentation
◦ Appr 4.5 million records equals appr 1
gigabyte, not that Big
31. • Focus on specific cases
-political communication:
politicians – candidates in elections
-fan studies
celebrities
cast of popular soap opera’s
◦ -journalism studies
journalists and newspapers
Focus on specific cases
32. actor information
information on societal events
accumulate data over time using the
same data structure
◦ Proonged analysis
◦ Multuple case studies, cross-national
comparative analysis
Enrich existing Twitter data with
external data
33. Traditional process (textbook approach)
◦ RQ -> research design
Practice, particularly with secondaire (i.e. third party) data
◦ Data RQ research design
◦ Data research design RQ
Twitter
Content analysis
Longitudinal analysis
Network analysis
Different research designs requires different techniques
Collaborate
Look at the data from different
angles, i.e. research designs