1st meeting of PG PUSHPIN

Project group

PUSHPIN
Supporting Scholarly Awareness
in Publications and Social Networks

University of Paderborn
Computer Science Education Group
Wolfgang Reinhardt

CLASSIC RESEARCH
+
WEB 2.0 / SEMANTIC WEB / SOCIAL NETWORKS
+
NEW METHODS AND METHODOLOGIES
=
RESEARCH 2.0 & PG PUSHPIN

Wolfgang Reinhardt - wolle@upb.de - Universität Paderborn

GOALS OF THE PROJECT
GROUP
• Data Mining in scientiﬁc publications

• Who’s writing about what? Who’s writing with whom?

• Clustering & similarity measures, Recommendations, Experts

• Connections to Social Networking sites (ginkgo)

• visual analytics, visualizations

• Extension of the knowAAN architecture & analysis of large
data sets

RESULTS OF PG KNOWAAN

• Java-basedbackend that allows automatically analysis of
publications (metadata extraction, text analysis, relations
between publications a.m.m.)

• Clustering and similarity detection

• currently ﬁrst test with Hadoop & Mahout

• Rails-, JavaScript-, CSS-based frontend for navigation

• Examples:

CO-AUTHOR NETWORKS


LOCATION OF AUTHORS


WORD CLOUDS


BIBLIOMETRIC NETWORKS


GINKGO
• conference management tool + social network

• Goal:

• checksubmitted publications for plagiarized content, topical
and social connections

• Recommendations (users, events, publications)

http://ginkgosem.com

PEOPLE

• Prof. Johannes Magenheim

• Wolfgang Reinhardt

• Tobias Varlemann

GOALS OF A PROJECT GROUP
• self-organization to the greatest extent

• systematic assignment of roles and responsibilities

• ﬁnding and facilitate special talents

• process oriented personnel placement like in industry

• regular presentations of work progress

• creation of interim and ﬁnal reports

• working on the edge of science

TIMEFRAME

• 18.10.2011 - 31.10.2012 (54 weeks)

• 30 ECTS = 900 hours of work (approx. 17h / week)

• Seminar phase until January 2012

• Creativity workshops in January

• Core implementation phase from February 2012 onwards

• agile Development (4 milestones, 4 iterations per milestone)

REQUIREMENTS

• active participation

• check UPB mails at least daily

• good communication skills,

• team work

• creativity in design and implementation

• testing ;)

TOOLS

• SVN and Trac
#pgpushpin
• Blog

• Twitter (if you like)

• Mendeley for exchange of research papers

• Delicious for social bookmarks

SEMINAR PHASE

• each one of you works on one topic

• theoretical framework, applications, prototypes

• regular meetings with supervisors

• regular blogging at http://pgpushpin.wordpress.com

• presentation in mid January 2012 (25 minutes plus discussion)

• article due at end January 2012 (approx. 16-24 pages)

1.HTML5 and Javascript 9.Distributed computing with
Frameworks Hadoop 2

2.Visual Analytics 10.Developing Multitouch Table
Applications
3.Agile Software Development in
Small Teams 11.Clustering of text documents

4.Trend detection and visualization 12.Plagiarism detection

5.Text processing 13.Social Network Analysis

6.Metadata extraction from 14.Faceted Search User Interfaces
research papers
15.Browser-based visualization of
7.Text similarities large networks

8.Distributed computing with 16.Scientiﬁc recommender systems
Hadoop 1

ALL TOPICS ARE FOCUSED
ON SCHOLARLY OUTPUT

E.G. SCIENTIFIC PAPERS,
RESEARCHER
COLLABORATION

HTML5 AND JAVASCRIPT
FRAMEWORKS
• development of sustainable web applications (responsive
design)

• current and coming standards

• web workers, local storage, WebGL, server-side JS, web
sockets

• Visualizations, Word Clouds, time-dependent course

• Javascript frameworks for visualizations, graphs etc.

VISUAL ANALYTICS
• information / scientiﬁc visualization that allow reasoning

• visual analytics and their application to research

• cartography / geovisualization

• ﬂow visualization

• diagrammatic reasoning

• state of the art and mockups for new developments

• tools/frameworks for realization (browser-based)

AGILE SOFTWARE DEVELOP.
IN SMALL TEAMS
• agile
software development and project management in small
teams

• application to the project group (roles and requirements)

• TDD, BDD, FDD

• Scrum, eXtreme programming, Kanban

• Pair Programming

TREND DETECTION AND
VISUALIZATION & SEARCH
• trend spotting and visualization & forecasting

• which topics are gaining ground and which are on the decline

• which networks are expanding, which are saturated

• ThemeRiver - StreamGraph visualizations

• Custom Search Applications (Solr and its extensions)

• semantic search, linked data approaches

TEXT PROCESSING

• PDF text extraction (get rid of headers and footers)

• Part-of-speech detection, lemmatizing text, stemming

• classiﬁcation, topic extraction and knowledge discovery
(untrained)

• LDA from Mahout

• usageof Apache OpenNLP & Apache Mahout for
prototypes

METADATA EXTRACTION
FROM RESEARCH PAPERS
• How to best extract metadata from research papers?

• Parscit and others (?)

• Conditional Random Fields -- CRF++ good
• Support Vector Machines mathematical
knowledge
• Selected information is relevant only needed
• extract geo locations from papers

TEXT SIMILARITIES

• Vector Space Model & Term Document Matrix

• LSA / LSI with SVD

• methods for calculation text-based similarities

• possibility for live calculations

• temporary ﬁles

• usage of Apache Mahout for prototypes

DISTRIBUTED COMPUTING
WITH HADOOP 1
• MapReduce

• Hadoop

• HBase

• HDFS


DISTRIBUTED COMPUTING
WITH HADOOP 2
• MapReduce

• Hadoop

• Hive Data Warehousing

• Job Orchestration (e.g. with Zookeeper)

• Pig Data Flow


DEVELOPING MULTITOUCH
TABLE APPLICATIONS
• http://www.youtube.com/watch?v=f1X5ffRrde8

• C# and .Net 4.0, Visual Studio 2010

• WPF and Surface SDK

• Fiducials

• buildsimulation, mockups of possible applications, state-of-the-
art presentation

• http://www.microsoft.com/silverlight/pivotviewer/

CLUSTERING OF TEXT
DOCUMENTS
• Methods for analyzing large collections of texts

• k-means, single-link, full-link, canopy

• visualization opportunities

• how to add documents to a large clustering


PLAGIARISM DETECTION

• How to detect potentially plagiarized content?

• Ethical discussion on (self-)plagiarism

• text breakdown in elements (sections, paragraphs, sentences)

• n-grams

• internal and external plagiarism detection

SOCIAL NETWORK ANALYSIS

• Social Network Theory

• measures from SNA

• existing examples of research applications

• bibliometrics and scientometrics

• take real conference series as example

FACETED SEARCH &
INTERFACE EVAL
• Best practices and design recommendations

• frameworks for development

• enclosure / APIs

• only work on JSON data & no direct DB access

• Java / ASP .Net / SEAM ....

• own prototype

BROWSER-BASED VISUALIZ.
OF LARGE NETWORKS
• level of detail

• WebGL, web workers

• Gephi

• visualize properties

• allow faceted search

• should be working on tablets

SCIENTIFIC RECOMMENDER
SYSTEMS
• state of the art

• item-based
and collaborative ﬁltering / hybrid
recommenders

• algorithms, visualizations

• existing applications in research


NEXT STEPS
• vote for three topics until Wednesday, 8pm

• mail with favorite topic, 2nd and 3rd place

• decision on Friday

• create Wordpress, Delicious and Mendeley account

• ﬁnalpresentation of PG knowAAN this Thursday, 4.45pm
in F0.231

• ﬁrst meetings with supervisors next two weeks

wolfgang reinhardt university of paderborn

social media sna
twitter recommendations
awareness
research networks
bibliometrics
artefact-actor-networks
ginkgo
research 2.0
www.isitjustme.de www.ginkgosem.com
@wollepb @wollepb @wolfgang.reinhardt
@wollepb @wollepb @wolfgang.reinhardt
@wollepb @wollepb @wollepb

1st meeting of PG PUSHPIN

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie 1st meeting of PG PUSHPIN

Ähnlich wie 1st meeting of PG PUSHPIN (20)

Mehr von Wolfgang Reinhardt

Mehr von Wolfgang Reinhardt (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

1st meeting of PG PUSHPIN