Invited talk given at the QUEST (Qualitative Experise at Southampton, http://www.quest.soton.ac.uk/) group event (http://www.quest.soton.ac.uk/training/) on Qualitative Methods and Big Data.
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
The Web Science MacroScope: Mixed-methods Approach for Understanding Web Activity
1. The Web Science MacroScope: Mixed-methods Approach for
Understanding Web Activity
Markus Luczak-Roesch (some slides based on work of Ramine Tinati)
University of Southampton (UK), Web and Internet Science Group
@mluczak | http://markus-luczak.de
Image source: https://en.wikipedia.org/wiki/File:Compound_Microscope_(cropped).JPG, CC BY-SA 4.0
3. Essential part of the data science story of the
WWW: Web Observatories
Data Sources
Challenges:
- Who are the providers?
- Is the service reliable/stable?
Data
Collec=on
Challenges:
- API Limita=ons/Restric=ons
- Data Schemas/Consistency
- Does it change over=me?
Data
Storage
Challenges:
- Storage approaches
(rela=onal, flat, linked?)
Data Analysis
and Modelling
Challenges:
- What methods/models?
- How is the data sampled?
Data
Visualisa=on
Challenges:
- Misrepresenta=on of data?
e.g. visualise “filtered” data
Data Querying
and Transforma3on
Sta3s3cal and
computa3onal analysis
Methods
Data
Interpreta=on
Challenges:
- Are the ques=ons being
asked relevant to the data
- Are insights being fed back
into the analysis?
Add or update ini3al stored data
Update current harves3ng strategy (req. for real-3me analysis)
(a)
Image source: https://en.wikipedia.org/wiki/File:Sphinx_Observatory.jpg, CC BY-SA 2.0
4. What to observe? Social Machines!
“Real life is and must be full of all kinds of social constraint
– the very processes from which society arises. Computers
can help if we use them to create abstract social
machines on the Web: processes in which
the people do the creative work and the
machine does the administration.“
Berners-Lee, Tim; Mark Fischetti (1999). Weaving the Web: The Original Design and
Ultimate Destiny of the World Wide Web by its inventor. Britain: Orion Business. ISBN
0-7528-2090-7.
5. Topic outbreaks across systems
Peak in
tweets
containing
topic ‘x’
Peak in
Wikipedia
views of
ar7cles ‘x’
‘Lag diffusion’
7me
7. Participation in Citizen Science projects by
communication patterns
98 game entries and 835,732 chat messages,
ue players. For each game, the EyeWire sys-
duration taken (in seconds) for a player to
he time the game was completed. Each chat
player’s ID, timestamp, and message text.
the question of player chat engagement and
granularity of players with similar character-
fferent sets of players related to their gaming
ur. We initially reduced the data to include
ed to both games and chat. we labelled these
Based on these players, we computed several
players related to specific EyeWire features;
sets we computed a number of statistics and
escribed in Table 2.
uting statistics for the 10,714 ’active’ players
mes and chat, we extracted the top quadrant
milar to the approach taken in other citizen
mmunity engagement [27]. We label these
ive’. Based on a initial analysis of user re-
e’ players contain individuals who sustained
of 30 days with respects to writing chat mes-
a game.
anised as follows, we begin by presenting the
the system-level analysis, then explore the
lationship with a players’ gaming participa-
n the chat messages corresponding to differ-
ing process, the impact on game commands
y, examine the context of the chat messages
ing.
Figure 4: Distribution of games, chat messages, and account
durations (games and chat) for all EyeWire players.
Figure 5: Timeline of chat and gaming activity for the EyeWire
platform.
5.1.1 Player Cohorts
As shown in Figure 4, the analysis of chat and gaming account
duration reveals that for gaming activity, there are many players
Stage Criteria
Before Game (Q0) 30s < Game Start
Start of Game (Q1) Game Start < x < 1st Quartile Game Duration
During Game (Q2-3) Quartile Game Duration < x < 3rd Quartile
Game Duration
End of Game (Q4) 3rd Quartile Game Duration < x < Game End
After Game (Q5) 30s < Game End
Table 1: Chat Message Stages: Boundary Conditions
Tinati, R., Luczak-Rösch, M., Simperl, E., Hall, W., & Shadbolt, N. (2015,
May). /Command'and conquer: analysing discussion in a citizen science
game. In ACM Web Science Conference 2015.
ng past classifications.
Main Interface in EyeWire
ns and gamification techniques are integral
eWire platform. As shown in 2, EyeWire
eal-time chat that allows players to talk to
layers points and achievements, as well as
commands which provide additional func-
and talking. Game commands are issued
h (‘/’), such as being able to mute and hide
ng the ‘/silence’ command. Issuing player
not shown on the public chat feed, unless
and such as group message (‘/gm’), which
particular team, in which they first have to
mmand.
am is an community-driven process which
n ongoing competition between teams of
re either setup by the EyeWire team (usu-
esh system activity), or led by the players
r a specific goal or set of ’badges’.
ernal chat, the main interface links to ad-
interfaces which are not part of the game.
oject blog, where the community managers
s, competitions, and challenges as well as
ul players. The players can also consult the
ntains information about how to play the
ence behind ‘connectome’ mapping. In ad-
e provided with a forum that is meant to be
nsive, asynchronous discussion on various
including error reports.
METHODS
Figure 2: Embedded Chat Interface in EyeWire
given time frame. The cohort analysis examines monthly cohorts
of players based on their first chat and game entry, and provides
a measure of sustained activity. Based on the the monthly player
retention values, we are able to differentiate between different sets
of users, as described in the following section.
To examine the context and discourse within the chat messages,
we perform text analysis to extract the use of EyeWire game com-
mands, and also perform topic modelling on the content of the chat
messages. To achieve this we use LDA [5] to derive topic models
which contain common vocabulary used by players. We combine
this with the different categories of chat messages in order to de-
termine the context of chat during different stages of completing a
game.
As we are interested in the relationship between a players gam-
ing session and use of chat, we construct a model of player chat
messages which classify chat activity at different stages of when a
game is being performed. As described in Table 1 and illustrated
in Figure 3, we categorise the chat messages into 5 stages around
the process of gaming. Stages Q1 to Q4 are relative to the time it
took for the game to be completed. For example, if a game was
completed in 10 seconds, then Q1 would represent 0-2 seconds,
8. Participation in Citizen Science projects by
communication patterns
Luczak-Roesch, M., Tinati, R., Simperl, E., Van Kleek, M., Shadbolt, N., & Simpson, R. (2014). Why won't aliens talk to us? Content and community dynamics in
online citizen science. Proceedings of the Eighth AAAI Conference on Weblogs and Social Media, {ICWSM} 2014, Ann Arbor, Michigan, USA, June 1-4, 2014.
Image source: David Miller, https://daily.zooniverse.org/2013/11/21/an-ever-expanding-zooniverse/
10. Temporal networks of information co-occurrence
for system-agnostic exploratory data analysis
Markus Luczak-Roesch, Ramine Tinati, Max van Kleek, and Nigel Shadbolt. 2015. From coincidence to
purposeful flow? Properties of transcendental information cascades. In IEEE/ACM International
Conference on Advances in Social Networks Analysis and Mining (ASONAM), Paris, FR.
11. Where is the MacroScope?
Data Sources
Challenges:
- Who are the providers?
- Is the service reliable/stable?
Data
Collec=on
Challenges:
- API Limita=ons/Restric=ons
- Data Schemas/Consistency
- Does it change over=me?
Data
Storage
Challenges:
- Storage approaches
(rela=onal, flat, linked?)
Data Analysis
and Modelling
Challenges:
- What methods/models?
- How is the data sampled?
Data
Visualisa=on
Challenges:
- Misrepresenta=on of data?
e.g. visualise “filtered” data
Data Querying
and Transforma3on
Sta3s3cal and
computa3onal analysis
Methods
Data
Interpreta=on
Challenges:
- Are the ques=ons being
asked relevant to the data
- Are insights being fed back
into the analysis?
Add or update ini3al stored data
Update current harves3ng strategy (req. for real-3me analysis)
(a)
Image sources: https://en.wikipedia.org/wiki/File:Compound_Microscope_(cropped).JPG, CC
BY-SA 4.0 & https://en.wikipedia.org/wiki/File:Sphinx_Observatory.jpg, CC BY-SA 2.0
13. Other data visualization capacities
Image source: screenshot from https://www.imperial.ac.uk/data-science/kpmg-data-observatory-/technical-specifications/
14. Other data visualization capacities
Image source: screenshot from http://approach.rpi.edu/2015/11/18/immersive-experience-the-campfire/
15. What is the MacroScope?
“Wow, they don’t even know
that this is happening!”
16. Do we really think this is an event to be
addressed in a purely quantitative fashion?
Source: United Nations Development Programme, https://goo.gl/Z1uXdV, CC BY-NC-ND 2.0
17. A qualitative investigation of crowdsourced
disaster response
• Haiti (Ushahidi, N=298)
– requests for help from
identified local source
• Congo (Ushahidi, N=102)
– information about the
situation but not who is
responsible for this
information
– more non-local sources
• Ebola (Twitter, N=298)
– comments
• tasteless jokes
• racist comments
• concern that the crisis could
spread and call to
governments to close the
borders
Joint project with Silke Roth
18. Boundaries of crowdsourced disaster response
• Wrong things go viral
• Crowdsourcing informativeness
of social media information not
synchronized with crises
negative
neutral
positive
18
“When you tell a […] kid that is has got Ebola”
19. Serendipitous discoveries in Citizen Science
Hanny’s Voorwerp
Galaxy Zoo [2007]
Green Pea Galaxies
Galaxy Zoo [2007]
Yellow Balls
Milky Way [2009]
Circumbinary Planet Ph1b
Planet Hunter [2012]
Convict Worm
Seafloor Explorer [2012]
Spanish Flu
Operation War Diaries [2014]
20. From information co-occurrence to the discovery
of hidden structure in Wikipedia
Figure 1: Wikipedia edits in a three dimensional space. The di-
mensions are (1) time; (2) information diversity as the chronologi-
Tinati, R., Luczak-Rösch, M., & Hall, W. (to appear). Finding Structure in Wikipedia Edit Activity:
An Information Cascade Approach . In WikiWorkshop 2016, co-located with WWW 2016.
Events detected:
• Edward Snowden speech at SXSW
conference
• US supreme court case on same sex
marriage
(a) Cascade Article Network (CAN): Nodes represent unique
Wikipedia articles, edges are shared edits based on a shared
identifier matched. A force directed layout has been ap-
plied, with edge path lengths determined by edge weight. The
strongly connected component (A) contains articles associated
with South Korean media, (B) and (C) contain articles related
to the USA.
(b) Cascade-to-Cascade path network graph: Nodes are cas-
cades, Edges are the shared articles between cascades. The cen-
tral strongly connected component is established by the Identi-
fiers shown in Table 3. A force directed layout has been applied,
with edge path lengths determined by edge weight.
21. The MacroScope is technology
External APIs
• Twitter
• Wikipedia
• Instagram
• Google Trends
• Yahoo Trends
Pre-processing
Stage:
1.Enrich Streams
2. Unify feeds
into
WO JSON Format
Streaming
Stage:
1. Post incoming
stream to
RabbitMQ
exchange (each
source has its
own exchange)
Hadoop
Storage Stage:
1. Apache Flume
for each stream
HDFS
HTTP Streaming
Stage:
1. Send Stream to
Web Observatory
Server
Unstructured
Web Streams
or Web Scraped
Pages
Web
Observatory
JSON Data
Schema
RabbitMQ
JSON Stream
Socket.IO
Daily Storage
Stage:
1. MapReduce
Daily Results
MongoDB
MacroScope
Socket.IO
22. • six screens in WAIS labs
• as part of presentations
• as a mobile exhibit
• as a Web application
There is more than one MacroScope
24. Engagement with the general public
Scholars People from the general public
demonstrating the
power and the danger
of individuals sharing
information online
developing a new
“situational ethics of
data”
27. A mantra for the MacroScope:
“Overview first, zoom and filter, then details-
on demand”* and capture
engagement.
* Shneiderman, B. (1996, September). The eyes have it: A task by
data type taxonomy for information visualizations. In Visual
Languages, 1996. Proceedings., IEEE Symposium on (pp. 336-343).
IEEE.
Image source: screenshot taken from http://data.shopsavvy.mobi/globe
28. The Web Science MacroScope:
Mixed-methods Approach for
Understanding Web Activity
Markus Luczak-Roesch
@mluczak | http://markus-luczak.de
Image source: https://en.wikipedia.org/wiki/File:Compound_Microscope_(cropped).JPG, CC BY-SA 4.0
Discover
Describe
Directly engage