Since the numbers of hypertext pages and hyperlinks in the WWW have been continuously growing for more than 20 years, the problem of finding relevant content has become increasingly important. We have developed and evaluated techniques for a time-dependent characterization of the global and local relevance of WWW pages based on document length, number of links, and cross-correlations in user-access time series. We focus on content and user activity in selected groups of Wikipedia articles as a first application mainly because of data availability. Our goal is the assignment of ranking values to a hypertext page
(node). The values shall cover static properties of the node and its neighbourhood (context) as well as dynamic properties derived from its page-view rates that depend on underlying communication processes. We show in several examples how this goal can be achieved.
DPG 2014: "Context Sensitive and Time Dependent Relevance of Wikipedia Articles"
1. Identifying Semantic Concepts
Selection of CNâs
Data Collection
Preprocessing
Extraction of CN, LN,
IWL, GN, page-size
and access log data
Calculation of REPk, REPv, REPt and REL
Analysis of data
Eric Tessenow,1 Mirko KĂ€mpf,2 and Jan W. Kantelhardt 2
Abstract
Since the numbers of hypertext pages and hyperlinks in the WWW have
been continuously growing for more than 20 years, the problem of
finding relevant content has become increasingly important. We have
developed and evaluated techniques for a time-dependent characteri-
zation of the global and local relevance of WWW pages based on
document length, number of links, and cross-correlations in user-access
time series. We focus on content and user activity in selected groups of
Wikipedia articles as a first application mainly because of data availa-
bility. Our goal is the assignment of ranking values to a hypertext page
(node). The values shall cover static properties of the node and its
neighbourhood (context) as well as dynamic properties derived from its
page-view rates that depend on underlying communication processes.
We show in several examples how this goal can be achieved.
1 Institute of Communications Studies, University of Leeds, LS2 9JT, Leeds, United Kingdom
2 Institut fĂŒr Physik, Martin-Luther-UniversitĂ€t Halle-Wittenberg, 06099 Halle (Saale), Germany
Motivation
Since many aspects have to be taken into account in the analysis of
global social networks, it is challenging to compare data collections and
obtain results from their analysis. We, therefore, require a robust and at
the same time flexible framework, which enables interdisciplinary
research as scientist from different fields look at different parts of a data
set. Our work suggests a methodology for comparable measurements of
a nodeâs relevance in local graphs defined by the nodeâs local
neighbourhood, while considering local link structure, text volume, user
access activity and editorial activity.
This enables a qualitative and also an efficient quantitative analysis of
parts of a global social network without having to explore and analyze
the whole graph.
In order to identify and to compare different communication pro-
cesses on multiple channels, one has to quantify the influence of
the environment in which an individual process is embedded in,
e.g. for different topics and different regions on earth we study
usage patterns and embedding of content in one of the largest
public and open content networks, Wikipedia.
Information Flow in Correlation Networks Outlook
Local Representation Indexes: REPk,v and REPa,e(t)
Data Sets & Processing
SOE
6.1
References
[1] KĂ€mpf, M., Tessenow, E., Kantelhardt, J.W., Context Sensitive and Time Resolved Relevance in Complex Networks. Unpublished (in preparation, 2014).
[2] KĂ€mpf M., Tismer S., Kantelhardt J.W., Muchnik L., Fluctuations in Wikipedia access-rate and edit-event data. Physica A, 391: 6101-6111 (2012).
[3] KĂ€mpf M., Kantelhardt J.W., Muchnik L., From time series to co-evolving functional networks: dynamics of the complex system âWikipediaâ, Proc. Europ. Conf. Complex Syst. (2012).
[4] Schreck B., KĂ€mpf M., Kantelhardt J.W., Motzkau H., Comparing the usage of global and local Wikipedias with focus on Swedish Wikipedia, arXiv:1308.1776 (2013).
[5] KĂ€mpf M., Kantelhardt J.W., Hadoop.TS: large-scale time-series processing, International Journal of Computer Applications (IJCA) 74: 17 (2013), DOI: 10.5120/12974-0233.
[6] Segev E., Mapping the international: Global and local salience and news-links between countries in popular news sites worldwide. Int. Journal for Internet Science, 5: 48-71. (2010)
Contact
We compare different media types â in
particular channels which push information to
consumer (TV news, radio news, Twitter and
Facebook communication) - opposed to pull-media
like Wikipedia, forum or blackboard websites, from
which customers pull data on demand.
We evaluate how properties of different
network types, e.g. social-, content-, and
communication-networks influence each
other and if such couplings depend more on the
content or more on the way information is offered and
spread.
Finally we are interested in the question:
To what extend and how can automatied tools
influence the communication processes?
Relevance Indexes: RELv and RELa(t)
We measure characteristic static and dynamic properties of a Wikipedia page based on I.) node degree k,
II.) average text-volume v, and III.) their access-rate or edit-rates time series (a(t), e(t)) in order to
determine and quantify the level of representation in a semantic or lingual context.
I.) Node-degree III.) Time-dependent access-rate a(t)
II.) Average text volume
We measure the time dependent or tempo-
ral relevance of a Wikipedia page during a
time period for access rates (a,b) of the
central node CN (black), the group IWL
(green), the local neighbourhood (LN, blue)
And the global neighbourhood (GN, red).
a) Relevance Index: shows the level of
attraction of a topic, e.g. for a Wikipedia
page in one selected language.
It compares the user interest in pages in
the selected language and average values
for pages with the same content for all other
languages.
Fig. 1: Definition of partial
data sets (local networks)
Fig. 2: Comparison of local network
structures with identical nodes based
on (a) direct links and (b) functional
link strengths derived from access
activity.
We calculate the time-dependent link strengths correlation by:
Fig. 3: Comparison of static representation indexes for two semantic concepts (data sets 1 and 2).
Fig. 4: Comparison of two local page networks with an assumed
higher global relevance (left) and with higher local relevance (right).
Fig. 5: Distribution of dynamic link strengths
for statically linked pages (a,d), for pages within
groups LN (blue) and GN (red) (b,e), and for
pages in different groups (c,f). Lines show
results for real data and are compared with
results from randomly shuffled data series
(filled areas).
Average values and maximum values of the
distribution function vary over time. Hence, we
cannot define a simple threshold to identify
relevant links. However, the distributions differ
significantly for real data and surrogate data in
(a,d,e).
In the presence of extreme events in access
time series (bottom row) we find a significant
increase in cross-correlation based link
strengths for page pairs in the local and global
neighbourhoods.
RAW data set
large scale data management
Partial data set
preparation
Result data set
Communication
Process
Modelling and Analysis
of Complex Systems
Definition of data sets (Fig. 1)
a) Central node (CN), all directly linked nodes in the same language (local neighbourhood,
LN), all nodes regarding the same topic in other languages (linked by inter-wiki links, IWL),
and the all nodes linked to nodes in the IWL group (global neighbourhood, GN).
b) The CN group and the IWL group are the core of the local network for one topic. Both
neighbourhoods, local (LN) and global (GN) form the hull of the local network.
Data stets for preliminary results and method tests
We address three data sets with differently chosen CNs (Wikipedia pages):
(1) Four German cities (Berlin, Heidelberg, Bad Harzburg, Sulingen) and two British cities
(Oxford, Birmingham);
(2) âUnited States of Americaâ, âGermanyâ, the âPresident of the United States Barack
Obamaâ, and the âFederal Chancellor Angela Merkelâ in German and English language;
(3) Selected CNs with predominantly local and global relevance: Erfurt rampage and
Illuminati book â both already used in a previous study of the fluctuations in Wikipedia
access-rate time series [2] â and four times three pairs of CNs within the categories:
minorities, cities, politicians, and meals.
Comparison of static link network and dynamic correlation networks (Fig. 2)
a) Direct Wikipedia links between all nodes in the groups CN, LN, IWL, and GN.
b) Functional link strengths calculated from user access-rate time series.
Illuminati (book) Erfurt rampage
eric.tessenow@gmail.com, mirko.kaempf@gmail.com, jan.kantelhardt@physik.uni-halle.de
This work was supported by:
Acknowledgement
Global and local relevance seem to be a characteristic property of a
page. In (c) the local relevance decreases (blue dashed line) and in
(d) it is similar for all languages. To compare L.REL and G.REL we
show the cross-correlation for sliding windows of different sizes in (e)
and (f).
CN: Erfurt rampage (languge: de)
Jan-Feb 2009
Mar-Apr 2009
Fig. 6: Time resolved average link strength for local
functional networks around two selected CNs (see Fig. 4).
Fig. 6a) shows a significant change in the average cross-
correlation for pages in group GN (area A). At the same time the
correlation in group LN drops significantly. In Fig. 6b) one can see
that a decreasing local correlation is not necessarily related to a
change in global correlations. This way one might be able to
distinguish between local and global relevance as well.
We visualize the relevance of semantic concepts for specific regions, while we take the natural density of
speakers and topic-specific relevance of languages within a specific region into account.
Such a language-dependent visualization will help to distinctively identify global and local trends for a semantic concept
in a specific continent, country, or region based on public data sources and social communication and content networks
like Wikipedia, but also Facebook, Google+, Twitter or even internal system, used in global Enterprises can be analyzed
this way â even in a multilingual environment.
Fig. 7: Collaboration networks for pages regarding
the same topic in different languages (central large
violet nodes) show inhomogeneous structure with
clusters of multiple sizes. Connections between
editor-clusters are âautomatic editing tools (robots)â.
Do such robots influence the spread of information?
Context Sensitive and Time Resolved Relevance
of Wikipedia Articles