1. 2010 CRC PhD Student Conference
Supporting the Exploration of Research Spaces
Chwhynny Overbeeke
c.overbeeke@open.ac.uk
Supervisors Enrico Motta, Tom Heath, Paul Mulholland
Department Knowledge Media Institute
Status Full-time
Probation viva Before
Starting date December 2009
1 Introduction
It is often hard to make sense of what exactly is going on in the research community. What topics
or researchers are new and emerging, gaining popularity, or disappearing? How does this happen
and why? What are the key publications or events in a particular area? How can we understand
whether geographical shifts are occurring in a research area? There are several tools available that
allow users to explore different elements of a research area. However, making sense of the dynamics
of a research area is still a very challenging task. This leads to my research question:
How can we improve the level of support for people to explore the dynamics of a research commu-
nity?
2 Framework and Background
In order to answer this question we first need to identify the different elements, relations and
dimensions that define a research area and put them into a framework. We then need to find
existing tools that address these elements, and categorize them according to our framework in
order to identify gaps in the current level of support. Some elements we already identified are:
people, institutions and organizations, events, activity, popularity, publications, citations, time,
geography, keywords, studentships, funding, impact, and technologies.
The people element is about the researchers that are or were present in the research community,
whilst the institutions and organizations element refers to the research groups, institutions, and
organizations that are active within an area of research, and the affiliations the people within the
community have with them. Events can be workshops, conferences, seminars, competitions, or any
other kind of research-related happening. EventSeer1 is a service that aggregates all the calls for
papers and event announcements that float around the web into one common, searchable tool. It
keeps track of events, people, topics and organizations, and lists the most popular people, topics,
and organizations per week.
1 http://www.eventseer.net
Page 69 of 125
2. 2010 CRC PhD Student Conference
The activity element refers to how active the researchers, institutions, and organizations are within
the field, for instance event attendance or organization, or the number and frequency of publications
and events. A tool that can be used to explore this is Faceted DBLP2 , a server interface for the
DBLP server3 which provides bibliographic information on major computer science journals and
proceedings [Ley 2002]. Faceted DBLP starts with some keyword and shows the result set along
with a set of facets, e.g. distinguishing publication years, authors, venues, and publication types.
The user can characterize the result set in terms of main research topics and filter it according to
certain subtopics. There are GrowBag graphs available for keywords (number of hits/coverage).
Popularity is about the interest that is displayed in a person, institution or organization, publica-
tion, topic, technology, or event. WikiCFP4 is a service that helps organize and share academic
information. Users can browse and add calls for papers per subject category, and users to add calls
for papers to their own personal user list. Each call for paper has information on the event name,
date, location, and deadline. WikiCFP also provides hourly updated lists of the most popular
categories, calls for papers, and user lists.
One indicator of topic popularity is the number of publications on a topic. There are many tools
that show the number of publications per topic per year. PubSearch is a fully automatic web mining
approach for the identification of research trends that searches and downloads scientific publications
from web sites that typically include academic web pages [Tho et al. 2003]. It extracts citations
which are stored in the tool’s Web Citation Database which is used to generate temporal document
clusters and journal clusters. These clusters are then mined to find their interrelationships, which
are used to detect trends and emerging trends for a specified research area.
Another indicator of popularity is how often a publication or researcher is cited. Citations can
also help identify relations between researchers through analysis of who is citing who and when,
and what their affiliations are. Publish Or Perish is a piece of software that retrieves and analyzes
academic citations [Harzing and Van der Wal 2008]. It uses Google Scholar5 to obtain raw citations,
and analyzes them. It presents a wide range of citation metrics such as the total number of papers
and citations, average number of citations per paper and author, the average number of papers per
author and year, an analysis of number of authors per paper, et cetera.
Topics, interests, and people evolve over time, and the makeup of the research community changes
when people and organizations enter or leave certain research areas or change their direction.
Some topics appear to be more established or densely represented in certain geographical areas,
for instance because a prolific institution is located there and has attracted several experts on a
particular topic, or because many events on a topic are held in that area. AuthorMapper6 is an
online tool for visualizing scientific research. It searches journal articles from the SpringerLink7
and allows users to explore the database by plotting the location of authors, research topics and
institutions on a world map. It also allows users to identify research trends through timeline graphs,
statistics and regions.
Keywords are an important indicator of a research area because they are the labels that have been
put on publications or events by the people and organizations within that research area. Google
2 http://dblp.l3s.de/
3 http://dblp.uni-trier.de/
4 http://www.wikicfp.com/
5 http://scholar.google.com/
6 http://www.authormapper.com/
7 http://www.springerlink.com/
Page 70 of 125
3. 2010 CRC PhD Student Conference
Scholar is a subset of the Google search index consisting of full-text journal articles, technical re-
ports, preprints, thesis, books, and web sites that are deemed ’scholarly’ [Noruzi 2005, Harzing and
Van der Wal 2008]. Google Scholar has crawling and indexing agreements with several publishers.
The system is based on keyword search only and its results are organized by a closely guarded
relevance algorithm. The ’cited-by-x’ feature allows users to see by whom a publication was cited,
and where.
The availability of new studentships indicates that a research area is trying to attract new people.
This may mean that the area is hoping to expand, change direction, or become more established.
The availability of funding within a research area or topic is an indicator of the interest that
is displayed in it, or the level of importance it is deemed to have at a particular time. The
Postgraduate Studentships web site8 offers a search engine as well as a browsable list of study or
funding opportunities organized by subjects, masters, PhD/doctoral and professional doctorates
and a browsable list of general funders, funding universities and featured departments. The site
also lists open days and fairs.
The level of impact of the research carried out by a research group, institution, organization or
individual researcher leads to their establishment in the research community, which in turn could
lead to more citations and event attendance. The technologies element refers to the technologies
that are developed within an area of research, and their impact, popularity and establishment.
Research impact is on a small scale implemented into Scopus (http://www.scopus.com/), currently
a preview-only tool which, amongst other things, identifies and matches an organization with all
its research output, tracks how primary research is practically applied in patents and tracks the
influence of peer-reviewed research on web literature. It covers nearly 18,000 titles from over 5,000
publishers, 40,000,000 records, scientific web pages, and articles-in-press. A tool that ranks publi-
cations is DBPubs, a system for analyzing and exploring the content of database publications by
combining keyword search with OLAP-style aggregations, navigation, and reporting [Baid et al.
2008]. It performs keyword search over the content of publications. The meta data (title, author,
venue, year et cetera) provide OLAP static dimensions, which are combined with dynamic dimen-
sions discovered from the content of the publications in the search result, such as frequent phrases,
relevant phrases and topics. Based on the link structure between documents (i.e. citations) publi-
cation ranks are computed, which are aggregated to find seminal papers, discover trends, and rank
authors.
Finally, we would like to discuss a more generic tool, DBLife9 [DeRose et al. 2007, Goldberg and
Andrzejewski 2007, Doan et al. 2006], which is a prototype of a dynamic portal of current informa-
tion for the database research community. It automatically discovers and revisits web pages and
resources for the community, extracts information from them, and integrates it to present a unified
view of people, organizations, papers, talks, et cetera. For example, it provides a chronological
summary, has a browsable list of organizations and conferences, and it summarizes interesting new
facts for the day such as new publications, events, or projects. It also provides community statistics
including top cited people, top h-indexed people, and top cited publications. DBLife is currently
unfinished and does not have full functionality, but from the prototype alone one can conclude it
will most likely address quite a few elements from our framework.
8 http://www.postgraduatestudentships.co.uk/
9 http://dblife.cs.wisc.edu/
Page 71 of 125
4. 2010 CRC PhD Student Conference
3 Methodology
In order to find out what are the key problems people encounter when trying to make sense of the
dynamics of a research area we will carry out an empirical study, which consists of a task and a
short questionnaire.
The 30 to 40 minute task is to be carried out by around 10 to 12 subjects who will be asked to
investigate a research area that is fairly new to them and write a short report on their findings.
The subjects’ actions will be recorded using screen capture software and the subjects themselves
will be videoed for the duration of the task so that the entire exploration process is documented.
The screen capture will show the actions the subjects take and the tools they use to reach their
goal. The video data will show any reactions the subjects may display during their exploration
process, for example confusion or frustration with a tool they are trying to use. The questionnaire
will be filled out by as many subjects as possible, who will be asked to identify the key elements
of a research area which they would take into account when planning a PhD research. In the
questionnaire people will be made aware of the framework we created, but we will allow for open
answers and additions to the existing framework.
The technical study will consist of an overview, comparison, critical review, and gap analysis of
existing tools that support the exploration of the research community. It will link those tools to
our framework in order to find out to what extent the several elements are covered by the existing
tools.
At this stage we will have highlighted the key elements that define a research area, identified gaps
in the existing support for the exploration of the research community, and gathered evidence to
support this by mapping existing tools to our framework, carrying out a practical task, and sending
out a questionnaire. We will then aim to improve support for people to explore the dynamics of
the research community by implementing novel tools, addressing the gaps that have emerged from
these studies. Our hypothesis is that at least some of these gaps are due to the lack of integration
between different types of data covering different elements of a research area.
References
Baid, A., Balmin, A., Hwang, H., Nijkamp, E., Rao, J., Reinwald, B., Simitsis, A., Sismanis, Y.,
and Van Ham, F. (2008). DBPubs: Multidimensional Exploration of Database Publications.
Proceedings of the VLDB Endowment, 1(2):1456–1459.
DeRose, P., Shen, W., Chen, F., Lee, Y., Burdick, D., Doan, A., and Ramakrishnan, R. (2007).
DBLife: A Community Information Management Platform for the Database Research Commu-
nity. In Weikum, G., Hellerstein, J., and Stonebraker, M., editors, Proceedings of the 3rd Biennial
Conference on Innovative Data Systems Research (CIDR 2007), Asilomar, California, USA.
Diederich, J. and Balke, W. (2008). FacetedDBLP - Navigational Access for Digital Libraries.
Bulletin of the IEEE Technical Committee on Digital Libraries (TCDL), 4(1).
Diederich, J., Balke, W., and Thaden, U. (2007). Demonstrating the Semantic GrowBag: Au-
tomatically Creating Topic Facets for FacetedDBLP. In Proceedings of the ACM IEEE Joint
Conference on Digital Libraries (JCDL 2007), Vancouver, British Columbia, Canada.
Page 72 of 125
5. 2010 CRC PhD Student Conference
Doan, A., Ramakrishnan, R., Chen, F., DeRose, P., Lee, Y., McCann, R., Sayyadian, M., and Shen,
W. (2006). Community Information Management. IEEE Data Engineering Bulletin, Special Issue
on Probabilistic Databases, 29.
Goldberg, A. and Andrzejewski, D. (2007). Automatic Research Summaries in DBLife. CS 764:
Topics in Database Management Systems.
Harzing, A. and Van der Wal, R. (2008). Google Scholar as a New Source for Citation Analysis.
Ethics in Science and Environmental Politics, 8:61–73.
Ley, M. (2002). The DBLP Computer Science Bibliography: Evolution, Research Issues, Perspec-
tives. In Proceedings of the 9th International Symposium (SPIRE 2002), pages 481–486, Lisbon,
Portugal.
Noruzi, A. (2005). Google Scholar: The New Generation of Citation Indexes. Libri, 55:170–180.
Tho, Q., Hui, S., and Fong, A. (2003). Web Mining for Identifying Research Trends. In Sembok,
T., Badioze Zaman, H., Chen, H., Urs, S., and Myaeng, S., editors, Proceedings of the 6th Inter-
national Conference on Asian Digital Libraries (ICADL 2003), pages 290–301, Kuala Lumpur,
Malaysia. Springer.
Page 73 of 125