Twitter analytics: some thoughts on sampling, tools, data, ethics and user requirements

Twitter analytics: some thoughts
on sampling, tools, data, ethics
and user requirements
Farida Vis, Information School
University of Sheffield
@flygirltwo
Keynote SRA Social Media in Social Research conference, London, 24 June 2013.

READING
THE RIOTS
ON TWITTER
Rob Procter (University of Manchester)
Farida Vis (University of Leicester)
Alexander Voss (University of St Andrews)
[Funded by JISC]
#readingtheriots

What role did social media play?
2.6 million riot tweets (donated by Twitter)
–
700,000 individual accounts
Initially:
o Role of Rumours
o Did incitement take place? [no - #riotcleanup]
o What is the role of different actors on Twitter?

Guardian Interactive Team (Alastair Dant)
http://www.guardian.co.uk/uk/interactive/20
11/dec/07/london-riots-twitter
Data Journalism Award (sponsored by
Google)

• Lots of questions about methods
• Lots of questions about our tools
• Lots of questions about donated data
• Lots of questions about ethics

Actor Types – top 1000 mentions
Typical long tail distribution
Twitter researchers tend to focus on the head

Actor Types
Mainstream Media Police/emergency services
Only online media (news) Riot accounts
Non-(news) mainstream media Celebrities
Journalists (mainstream media) Researchers
Journalists (online media) Members of the public
Non-(news) media organisations Bots
Bloggers Unclear
Activists Account closed down
UK Twitterati Fake/spoof account
Political Actors Other
http://researchingsocialmedia.org/2012/01/24/reading-the-riots-on-twitter-who-tweeted-the-riots/

Who tweeted the riots? - categories
mainstream media
journalists
riot accounts

You know you’re dealing with Twitter data when…
Number 13, 6697 mentions

Individual accounts with > 3K mentions

30031 mentions, 441 tweets sent over 4 days: top UK listed journalist (2)
3484 mentions, 290 tweets sent over 4 days: top non UK listed journalist
(34)

Image sharing practices during crises

400 million tweets/day (March 2013)
40 million Instagram images/day (January 2013)
Percentages posted to Twitter / Facebook
-> 59% posted to Twitter
-> 98% posted to Facebook

Where do images fit in the era of ‘Big Data’?

Big Data – text + number driven
Images: undervalued, underexplored
Not by the users

Deleted content
http://twitpic.com/62m6nx

#FakeSandy pics
250,000 tweets (4hrs)
1 weekend
http://istwitterwrong.tumblr.com/
Jean Burgess
Farida Vis
Axel Bruns

‘fakes’
http://www.guardian.co.uk
/news/datablog/2012/nov/
06/fake-sandy-pictures-
social-media

Twitter handles
MPSBarkDag
MPSBarnet
MPSBexley
MPSBrent
MPSBromley
MPSCamden
metpoliceuk
MPSWestminster
MPSCroydon
EalingMPS
MPSEnfield
MPSGreenwich
MPSHackney
MPSHammFul
MPSHaringey
MPSHarrow
MPSHavering
MPSHillingdon
MPSHounslow
MPSIslington
MPSKenChel
MPSKingston
LambethMPS
MPSLewisham
MPSMerton
MPSNewham
MPSRedbridge
MPSRichmond
MPSSouthwark
MPSSutton
MPSTowerHam
MPSWForest
MPSWandsworth
Plus:
@MetPoliceEvents (Updates from the Met Police
regarding demonstrations & events in London)
@MPSOnTheStreet (An official MPS account giving an
officer on the ground's view of events, operations and
other policing activities in London)
@MPSDoI (Updates from the Metropolitan Police
Service, Directorate of Information)
Police tweets

Collecting the data
Scraper by Jacopo
Ottaviani
URL for the scraper: https://scraperwiki.com/scrapers/police_and_the_olympics_2012/
ScraperWiki is a key DDJ
site

Datajournalismhandbook.org
Reference point 1

Data challenges
• Collecting Twitter data in (real) time (APIs)
• Methods for building a reliable corpus
• Problems with language bias
• Problems with hashtag/keyword bias
• API bias
• Demographics of Twitter users – who are they?
• Problems with escalating volume
• Mapping explosion of new tools: are they any good?
• Off the shelf tools (growing divide in research capacity in
this area)
• Limitations of the tools
• Problems with data sharing / replicating studies + findings

Data challenge 1: Know your API

See: https://dev.twitter.com/start

1% random sample of the firehose
If not rate limited – all data may be collected

We collect and analyse messages exchanged in Twitter using two of
the platforms publicly available APIs (the search and stream
specifications). We assess the differences between the two samples,
and compare the networks of communication reconstructed from them.
The empirical context is given by political protests taking place in May
2012: we track online communication around these protests for the
period of one month, and reconstruct the network of mentions and re-
tweets according to the two samples. We find that the search API over-
represents the more central users and does not offer an accurate
picture of peripheral activity; we also find that the bias is greater for the
network of mentions. We discuss the implications of this bias for the
study of diffusion dynamics and collective action in the digital era, and
advocate the need for more uniform sampling procedures in the study
of online communication.
(González-Bailó n et al, 2012)

Data challenge 3: rate limiting + 1%

Random sampling with the streaming API: the 1%
‘If we estimate a daily tweet volume of 450 million tweets (Farber), this
would mean that, in terms of standard sampling theory, the 1%
endpoint would provide a representative and high resolution sample
with a maximum margin of error or 0.06 as a confidence level of 99%,
making the study of even relatively small subpopulations within that
sample a realistic option.’
(Gerlitz and Rieder, 2013)

Data challenge 4: relation to firehose?

‘The essential drawback of the Twitter API is the lack of documentation
concerning what and how much data users get. This leads researchers
to question whether the sampled data is a valid representation of the
overall activity on Twitter. In this work we embark on answering this
question by comparing data collected using Twitter’s sampled API
service with data collected using the full, albeit costly, Firehose stream
that includes every single published tweet.’
(Morstatter et al, 2013)

Data challenge 5: relation to ‘general public’?

Data challenge 6: what data to collect?

For hashtag datasets: contributions made by specific users and
groups of users; overall patterns of activity over time;
combinations to examine contributions by specific users and
groups over time. (Bruns and Stieglitz, 2013)

Data challenge 6: how to collect the data?

Recent explosion in Twitter tools
• Twitonomy
• Scraperwiki
• TAGS
• DMI Twitter Capture and Analysis Toolset
• MozDeh (and Webometric Analyst)
• NViVO 10
• YourTwapperKeeper

#horsemeat still producing data in June!

Collects up to 8000 tweets based
on hashtags/keywords/users

DMI Twitter Capture and Analysis Toolset

DMI tools for extracting links (all the URLs)
Mostly URLS are shorted, mainly using t.co (Twitter). Unpack them using:
Didn’t always work, manual unpacking and note taking (plus you still
have the shortened URL in case you want to retrace it.

MOZDEH (and Webometric Analyst)

Data challenge 7: how to analyse the data?

What to do about all those bots?

Data collected + methods used
produce specific research object

Data challenge 8: representing your data?

Data visualisations: what are they and what do they want?

Data challenge 9: how to deal with ethics?

Data challenge 10: user requirements?

What do we want from these APIs, the data,
the tools, and Twitter researchers so that we
can develop more robust social scientific
research on Twitter?

References
• Bruns, A., and Stieglitz, S. 2013. Towards More Systematic Twitter Analysis: Metrics
for Tweeting Activities. International Journal of Social Research Methodology.
DOI:10.1080/13645579.2013.770300 Available from:
http://snurb.info/files/2013/Towards%20More%20Systematic%20Twitter%20Analysis
%20(final).pdf
• Gerlitz, C. & Rieder, B. 2013. Mining One Percent of Twitter: Collections, Baselines,
Sampling. M/C Journal, Vol. 16, No 2. Available from: http://journal.media-
culture.org.au/index.php/mcjournal/article/viewArticle/620
• González-Bailó n, S., Ning, W., Rivero, A., Borge-Holthoefer, J., & Moreno, Y. 2012.
Assessing the Bias in Communication Networks Samples from Twitter. Available
from: http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2185134
• Morstatter, F., Pfeffer, J., Liu, H, & Carley, K.M. 2013. Is the Sample Good Enough?
Comparing Data from Twitter’s Streaming API with Twitter’s Firehose. Association for
the Advancement of Artificial Intelligence. Available from:
http://www.public.asu.edu/~fmorstat/paperpdfs/icwsm2013.pdf
• Vis, F. 2012 . Twitter as a reporting tool for breaking news: journalists tweeting the
2011 UK riots, Digital Journalism 1(1). Available from:
http://www.tandfonline.com/doi/full/10.1080/21670811.2012.741316#.UcwBZ-CPDao
• Vis, F., Faulkner, S., Parry, K., Manyukhina, Y., and Evans, L. (in press), Twitpic-ing
the riots: analysing images shared on Twitter during the 2011 UK riots, in Twitter and
Society, Weller, K., Bruns, A., Burgess, J.,Mahrt, M., and Puschmann, C. (eds.), New
York: Peter Lang.

Links to all mentioned tools
• Twitonomy - http://www.twitonomy.com/
• Scraperwiki - https://beta.scraperwiki.com/
• TAGS - http://mashe.hawksey.info/2013/02/twitter-archive-
tagsv5/
• DMI Twitter Capture and Analysis Toolset -
https://wiki.digitalmethods.net/Dmi/ToolDmiTcat
• MozDeh (and Webometric Analyst) -
http://mozdeh.wlv.ac.uk/ + http://lexiurl.wlv.ac.uk/
• NViVO 10 -
http://www.qsrinternational.com/products_nvivo.aspx
• YourTwapperKeeper -
https://github.com/540co/yourTwapperKeeper
See also:
http://mappingonlinepublics.net/tag/yourtwapperkeeper/

Twitter analytics: some thoughts on sampling, tools, data, ethics and user requirements

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (17)

Ähnlich wie Twitter analytics: some thoughts on sampling, tools, data, ethics and user requirements

Ähnlich wie Twitter analytics: some thoughts on sampling, tools, data, ethics and user requirements (20)

Mehr von Farida Vis

Mehr von Farida Vis (13)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Twitter analytics: some thoughts on sampling, tools, data, ethics and user requirements