From Bibliometrics to Cybermetrics - a book chapter by Nicola de Bellis
24. Jul 2014•0 gefällt mir
0 gefällt mir
Sei der Erste, dem dies gefällt
Mehr anzeigen
•216 Aufrufe
Aufrufe
Aufrufe insgesamt
0
Auf Slideshare
0
Aus Einbettungen
0
Anzahl der Einbettungen
0
Melden
Internet
Disclaimer: All original texts and images belong to their rightful owners.
Chapter 8 of the book "Bibliometrics and Citation Analysis" by Nicola de Bellis.
The web exhibits a citation structure, links
between web pages being similar to
bibliographic citations.
Thanks to the markup languages, the
information units composing a text can be
marked and made recognizable by a label
that facilitates their automatic connection
with the full text of the cited document.
Disciplinary databases:
Chemical Abstract Service (CAS)
SAO/NASA Astrophysics Data System (ADS)
SPIRES HEP database
MathSciNet
Citeseer
Ieee Xploree
Citebase
Citations in Economics
Multidisciplinary databases:
Web of Science
Google scholar
Scopus
The relevance of a webpage to a user query
can be estimated by looking at the link rates
and topology of the other pages pointing to
it.
Pagerank:
Google’s ranking algorithm.
It assigns different “prestige” scores to individual
pages according to their position in the overall
network.
More weight is assigned to the pages receiving
more links.
An “authority” is a page that receives many
links from quality “hubs” (like a citation
classic).
A quality “hub” is a page providing many
links to “authorities” (like a good review
paper).
Advantages:
The immediacy of scientific literature implied an
information revolution.
The web significantly helps to increase citation
impact, and local online usage became one of
the best predictors of future citations.
Less gate-keeping.
Disadvantages:
Fewer distinct articles are cited more.
Citations tend to concentrate on more recent
publications.
How to quantify the Web-wide cognitive and
social life of scientific literature?
The impact of a set of documents outside the ISI
circuit can be estimated by:
Counting, by means of usage mining techniques, the
number of document views or downloads over a
certain period of time
Interviewing a significant sample of readers
Counting, by means of search engines’ facilities, the
number of links to the website hosting the documents
Identifying and counting, as ISI indexes do, the
bibliographic citations to those documents from non-
ISI sources.
Standards and protocols have been
developed in the context of national and
international projects to make uniform the
recording and reporting of online usage
statistics:
COUNTER (Counting Online Usage of Networked
Electronic Resources)
SUSHI (Standarized Usage Harvesting Initiative)
MESUR (Metrics from Scholarly Usage of
Resources)
Peer-reviewed open access journals appeared
in the 1980’s, for example New Horizons in
Adult Education, Psycholoquy, Postmodern
Culture and Surfaces.
In the 1990’s RePEc-Research Papers in
Economics, Medline/PubMed Central and
CogPrints were started or opened to public.
In 1991 Ginsparg setted up arXiv, a preprint
and postprint central repository initially only
for high-energy physics.
Under the slogan “Public access
to publicly funded research”,
the Open Access movement has
published theoretical and
business models along with
technical infrastructure, to
support the free online
dissemination of peer-reviewed
scientific literature since the
late 1990’s.
There are two options for authors following
this way of publication:
Submit a paper directly to an OA journal.
IT peer-reviews and makes freely available all of its
contents for all users while shifting editorial costs onto
the author of the funding institution.
There are over 3,200 OA journals in the Directory of
Open Access Journals (www.doaj.org)
Keep publishing in traditional journals, but
archive a peer-reviewed version of the same
content into an open accessible repository.
A goal of the OA movements has been to
demonstrate that open access substantially
increases research impact:
In 2001, Lawrence provided evidence that
citation rates in a sample of computer science
conference articles appeared significantly
correlated with their level of accessibility.
In 2007, Harnad and Brody’s team has been
detecting OA citation advantage across all
disciplines in a twelve-year sample of ISI articles
(1992-2003). The citation impact was 25 to 250%
higher for OA papers.
Counter-arguments:
Subjectivity factor in the selection of postable
items
Increased visibility
Readership
Shelf-exposition
Best authors tend to be overrepresented
Self-selection bias postulate
In 2007 a paper by Moed performs a citation
analysis of papers posted to the arXiv’s
condensed matter section before being published
in scientific journals and compares the results
with those of a parallel citation analysis for
unposted articles published in the same journals.
Articles posted to the preprint server are
actually more cited than unposted ones, but the
effect varies with the papers’ age.
The citation advantage of many OA papers fades
into the individual performances of the authors
themselves through publishing strategy.
Two studies on the citation impact of OA journals
indexed on the Web of Science appeared in
2004. The impact factor of ISA OA journals was
lower than no OA journals.
Despite the evidence, there are important
reasons to support OA journals:
Shortening the paths between invisible colleges and
turn them into real time collaboration network will
increase the speed and effectiveness of scientific
communication.
In the non-big research areas, it increases the
opportunity of pursuing research goals.
It will allow the shaping of new ideas in constant
interplay with other scientists with similar interests.
Harnad is proposing a multidimensional,
field-sensitive, and carefully validated open
access scientometrics, taken advantage of
open access materials. The key is…
Metadata: set of encoded data attached to
information units processed by the automatic
indexing system to help identify, retrieve,
and manage them in an effective fashion.
But there needs to be a metadata standard.
To date, indexing algorithms have failed.
www.citebase.org is an indexing system of OA
repositories. It was developed by Brody’s team in
US in 2001. It uses the OAI-Protocol for Metadata
Harvesting.
The Citebase software parses the bibliographic
references of the fulltext papers hosted by the
servers and, every time a reference matches the
full text of another paper in the same repository,
it creates a link.
A usage/Citation Impact Correlator produces a
correlation table comparing the number of times
an article has been cited with the approximate
number of times it has been downloaded.
CiteSeer, formerly ResearchIndex
(citeseer.ist.psu.edu), is a digital library
search and management system developed in
US.
It gathers together research article preprints
and postprints from several distributed nodes
of the open access Web through web
crawling techniques.
It extracts the context surrounding the
citation in the body of the paper.
The new Web Citation Index, based on CiteSeer
technology, was launched officially in 2005.
It covers materials from OA repositories that
meet quality criteria, such as:
arXiv.
The Caltech Collection of Open Digital Archives.
The Australian National University Eprints Repository.
The NASA Langley Technical Library Digital Repository.
The open access content in Digital Commons.
OAI-compliant institutional repository service.
Citebase and Citeseer are not advisable tools for
bibliometric evaluations for now. They are still
pilot projects.
The probability of a webpage to be included
into a search engine database increases as
the web crawler fetches other pages linking
to it.
But!
Links do not acknowledge intellectual debts.
They lack peer review.
Links are not indelible footprints in the
landscape of recorded scholarly activity.
Their study is divided in:
1. Complex network analysis, which investigates
the topological properties of the Internet and the
Web as particular cases of an evolving complex
network.
2. Hyperlink network analysis, which interprets
the connections between websites as
technological symbols of social ties among
individuals, groups, organizations and nations.
3. Webometrics, which extends to the web space
concepts and methods originally developed in the
field of bibliometrics.
The web topological structure, i.e. the
number and distribution of links between the
nodes, initially played the crucial role of
understanding a wide range of issues:
The way users surf the Web.
The ease with which they gather information.
The formation of Web communities as clusters of
highly interacting nodes.
The spread of ideas, innovations, hacking
attacks, and computer viruses.
Theoretical physicists have recently shifted
the attention to the dynamics of the
structure by progressive addition or removal
of nodes and links.
The key role on the modeling exercise is the
graph:
What kind of graph is the Web?
What pattern, if any, is revealed by the hyperlink
distribution among the nodes?
Do the links tend to be evenly distributed?
If not, why not?
In the late 1950s, when Erds and Renyi
supplied graph theory with a coherent
probabilistic foundation, the conviction
gained ground that complex social and
natural systems could be represented, in
mathematical terms, by random graphs.
Each node of a random graph has an equal
probability of acquiring a link, and the
frequency distribution of links among nodes
is conveniently described by a probability
distribution (Poisson).
In random graphs, there is a dominant average
number of links per node called the network’s
“scale”. It is an upper threshold that prevents
the system from having nodes with a
disproportionately higher number of links.
Nodes are not clustered and display statistically
short distances between each other.
Empirical evidence seemed to contradict this
model because the structure of complex
networks was somewhere between a totally
regular graph and a random graph.
In 1998, Watts and Strogartz set a model of
complex networks using the small world.
A small world is said to exist whenever
members of any large group are connected to
each other through short chains of
intermediate acquaintances.
The path to small worlds:
Pool and Kochen made mathematical descriptions
of social contact based on statistical mechanics
methods, encompassing graph-theoretic models
and Monte Carlo simulations in the 1950’s
In 1967, Milgram initiated a series of experiments
to test the small world conjecture in real social
networks. He found that in average, the
acquaintance chain required to connect two
random individuals is composed of about six
links.
In 1967, Watts and Strogatz showed that a
complex network is a small world displaying both
the highly clustered sets of nodes typical of
regular graphs and the small path lengths
between any two nodes typical of random
graphs.
They computarized the clustering coefficient and
recognized the importance of short cuts.
Further experiments confirmed that documents on the
web are nineteen clicks away from each other in
average.
In 19678, Albert and Barabasi issued an
alternative class of models for the large-scale
properties of complex networks.
Networks grow by the addition of new nodes linking to
already existing ones.
This addition follows a mechanism of preferential
attachment that replicates the Matthew Effect.
This means that nodes have a higher probability to link
with highly connected nodes than with poorly
connected or isolated ones.
P(n) is the probability that a node has to establish a
link.
n is a node.
An experiment in 1999 confirmed the World
Wide Web is a scale-free network governed
by the power law.
P(n) = 1
n a
Nowadays, the network came increasingly to
represent not simply a communication
facility, but a tool for building online
collaboration platforms where new
knowledge can be created, modified, and
negotiated, in a sort of virtual laboratory
without walls.
Sociologists have been using Social Network
Analysis (SNA) in the World Wide Web
hyperlink texture since 1997. It is called
Hyperlink Network Analysis (HNA).
Objectives:
Check whether the hyperlink network is
organized around central websites which play the
role of hubs.
Centrality measures are carried out by counting
the number of ingoing and outgoing links for a
given website (indegree and outdegree
centrality).
Centrality has an aspect of “closeness”, intended
to single out the website with the shortest path
to all others.
Betweeness estimates a website’s frequency with
which it falls between the paths connecting
other sites.
OHNA techniques have been promisingly
applied in case studies dealing with topics
such as e-commerce; social movements; and
interpersonal, interorganizational, and
international communication.
But, can links be used as proxies for
scientific communication flows and as
building blocks of new, web-inclusive
scientometric indicators of research
prominence?
In 1995, Bossy suggested that the digital
network layer offered an unprecedented
source of information on the scholarly
sociocognitive activities that predate
publication ouput.
It meant to move from bibliographic citation
to webpages, websites and links from
universities, departments, research institutes
and individual scientists webpages.
At first, Altavista was used.
In 1995, Algorythm of co-word mapping by
Prabowo and Thellwall was used by
Leydesdorff and Curran to identify the
connectivity patterns of the Triple-Helix.
The Web Impact Factor (WIF) of a site or
area of the Eb, introduced by Ingwersen in
1998 may be defined as a measure of the
frequency with which the average webpage
of the site has been linked at a certain time.
S is the Site.
I is the total number of link pages (including
self-link) to the Site.
P is the number of webpages published in S
that are indexed by the search engine.
WIF(S) = I = 100 =
P 50
2
But where do link data come from? How
reliable and valid are the tools for gathering
them?
Commercial search engines don’t restore a
reliable and consistent picture of global and
local connectivity rates over time because:
Search engines crawl and index only a small
portion of the World Wide Web. There is an
“invisible web”.
Different search engines use distinct crawling
algorithms.
Overlapping between competing search engines’
databases is small.
The WIF is also not a very good bibliometric
measure, due to content variability and
structural instability:
The number of links can be spuriously inflated by
a huge number of unlinkable files, and the
format of the webpage can be as single or split.
Webpages also lack coding standarization and
their half-life is variable.
For longitudinal studies, www.archive.org
can be used.
Since 2000 the Academic Web Link Database
Project has been collecting link data relative
to the academic web spaces of New Zealand,
Australia, UK, Spain, China and Taiwan.
Mike Thelwall’s Alternative Document Models
(ADMs) allow modulating link analysis by
truncating the linking URLs at a higher level
than that of the web page:
Directory
Domain
Site
The Webometrics Ranking of World
Universities (www.webometrics.info)
launched in 2004 in Spain.
It ranks web domains of academic and
research organizations according to volume,
visibility and impact of their content.
They apply WIF to capture ratio between
visibility, measured by inlink rates returned
by commercial search engines, and size,
measured by number of hosted web pages.
Two additional measures, dubbed Rich file
and Scholar Indexes, capture the volume of
potentially relevant academic output in
standard formats:
Adobe Portable Document Format .pdf
Adobe PostScript .ps
Microsoft Word Document .doc
Microsoft Powerpoint .ppt
And the number of papers and citations for
each academic domain in Google Scholar.
Thelwall and colleagues’ methodology of link
analysis also investigates the patterns of
connections between groups of academic
sites at the national level.
University websites have been found to be
relatively more stable than other cyber-
traces in longitudinal studies.
But we have to remember that web visibility
and academic performance are different
affairs.
Bibliometricians usually resort to direct surveys
of webmasters’ reasons to link or hyperlink
context and content analysis to investigate the
psychological side of the link generation process.
Links usually are meant to facilitate navigation
toward quarters of loosely structured and
generically useful information, or to suggest
related resources.
But they alone are not sufficient to pin down
communication patterns on the Web and their
statistical analysis will probably follow the same
path of citation analysis.