Slides from the first meeting of the project group PUSHPIN at the University of Paderborn. I focus on the general focus of the project group and the topics for the seminar phase.
1. Project group
PUSHPIN
Supporting Scholarly Awareness
in Publications and Social Networks
University of Paderborn
Computer Science Education Group
Wolfgang Reinhardt
2. CLASSIC RESEARCH
+
WEB 2.0 / SEMANTIC WEB / SOCIAL NETWORKS
+
NEW METHODS AND METHODOLOGIES
=
RESEARCH 2.0 & PG PUSHPIN
Wolfgang Reinhardt - wolle@upb.de - Universität Paderborn
3. GOALS OF THE PROJECT
GROUP
• Data Mining in scientific publications
• Who’s writing about what? Who’s writing with whom?
• Clustering & similarity measures, Recommendations, Experts
• Connections to Social Networking sites (ginkgo)
• visual analytics, visualizations
• Extension of the knowAAN architecture & analysis of large
data sets
Wolfgang Reinhardt - wolle@upb.de - Universität Paderborn
4. RESULTS OF PG KNOWAAN
• Java-basedbackend that allows automatically analysis of
publications (metadata extraction, text analysis, relations
between publications a.m.m.)
• Clustering and similarity detection
• currently first test with Hadoop & Mahout
• Rails-, JavaScript-, CSS-based frontend for navigation
• Examples:
Wolfgang Reinhardt - wolle@upb.de - Universität Paderborn
5. CO-AUTHOR NETWORKS
Wolfgang Reinhardt - wolle@upb.de - Universität Paderborn
6. LOCATION OF AUTHORS
Wolfgang Reinhardt - wolle@upb.de - Universität Paderborn
13. GOALS OF A PROJECT GROUP
• self-organization to the greatest extent
• systematic assignment of roles and responsibilities
• finding and facilitate special talents
• process oriented personnel placement like in industry
• regular presentations of work progress
• creation of interim and final reports
• working on the edge of science
14. TIMEFRAME
• 18.10.2011 - 31.10.2012 (54 weeks)
• 30 ECTS = 900 hours of work (approx. 17h / week)
• Seminar phase until January 2012
• Creativity workshops in January
• Core implementation phase from February 2012 onwards
• agile Development (4 milestones, 4 iterations per milestone)
15. REQUIREMENTS
• active participation
• check UPB mails at least daily
• good communication skills,
• team work
• creativity in design and implementation
• testing ;)
16. TOOLS
• SVN and Trac
#pgpushpin
• Blog
• Twitter (if you like)
• Mendeley for exchange of research papers
• Delicious for social bookmarks
18. SEMINAR PHASE
• each one of you works on one topic
• theoretical framework, applications, prototypes
• regular meetings with supervisors
• regular blogging at http://pgpushpin.wordpress.com
• presentation in mid January 2012 (25 minutes plus discussion)
• article due at end January 2012 (approx. 16-24 pages)
20. 1.HTML5 and Javascript 9.Distributed computing with
Frameworks Hadoop 2
2.Visual Analytics 10.Developing Multitouch Table
Applications
3.Agile Software Development in
Small Teams 11.Clustering of text documents
4.Trend detection and visualization 12.Plagiarism detection
5.Text processing 13.Social Network Analysis
6.Metadata extraction from 14.Faceted Search User Interfaces
research papers
15.Browser-based visualization of
7.Text similarities large networks
8.Distributed computing with 16.Scientific recommender systems
Hadoop 1
21. ALL TOPICS ARE FOCUSED
ON SCHOLARLY OUTPUT
E.G. SCIENTIFIC PAPERS,
RESEARCHER
COLLABORATION
22. HTML5 AND JAVASCRIPT
FRAMEWORKS
• development of sustainable web applications (responsive
design)
• current and coming standards
• web workers, local storage, WebGL, server-side JS, web
sockets
• Visualizations, Word Clouds, time-dependent course
• Javascript frameworks for visualizations, graphs etc.
23. VISUAL ANALYTICS
• information / scientific visualization that allow reasoning
• visual analytics and their application to research
• cartography / geovisualization
• flow visualization
• diagrammatic reasoning
• state of the art and mockups for new developments
• tools/frameworks for realization (browser-based)
24. AGILE SOFTWARE DEVELOP.
IN SMALL TEAMS
• agile
software development and project management in small
teams
• application to the project group (roles and requirements)
• TDD, BDD, FDD
• Scrum, eXtreme programming, Kanban
• Pair Programming
25. TREND DETECTION AND
VISUALIZATION & SEARCH
• trend spotting and visualization & forecasting
• which topics are gaining ground and which are on the decline
• which networks are expanding, which are saturated
• ThemeRiver - StreamGraph visualizations
• Custom Search Applications (Solr and its extensions)
• semantic search, linked data approaches
26. TEXT PROCESSING
• PDF text extraction (get rid of headers and footers)
• Part-of-speech detection, lemmatizing text, stemming
• classification, topic extraction and knowledge discovery
(untrained)
• LDA from Mahout
• usageof Apache OpenNLP & Apache Mahout for
prototypes
27. METADATA EXTRACTION
FROM RESEARCH PAPERS
• How to best extract metadata from research papers?
• Parscit and others (?)
• Conditional Random Fields -- CRF++ good
• Support Vector Machines mathematical
knowledge
• Selected information is relevant only needed
• extract geo locations from papers
28. TEXT SIMILARITIES
• Vector Space Model & Term Document Matrix
• LSA / LSI with SVD
• methods for calculation text-based similarities
• possibility for live calculations
• temporary files
• usage of Apache Mahout for prototypes
29. DISTRIBUTED COMPUTING
WITH HADOOP 1
• MapReduce
• Hadoop
• HBase
• HDFS
• usage of Apache Mahout for prototypes
30. DISTRIBUTED COMPUTING
WITH HADOOP 2
• MapReduce
• Hadoop
• Hive Data Warehousing
• Job Orchestration (e.g. with Zookeeper)
• Pig Data Flow
• usage of Apache Mahout for prototypes
31. DEVELOPING MULTITOUCH
TABLE APPLICATIONS
• http://www.youtube.com/watch?v=f1X5ffRrde8
• C# and .Net 4.0, Visual Studio 2010
• WPF and Surface SDK
• Fiducials
• buildsimulation, mockups of possible applications, state-of-the-
art presentation
• http://www.microsoft.com/silverlight/pivotviewer/
32. CLUSTERING OF TEXT
DOCUMENTS
• Methods for analyzing large collections of texts
• k-means, single-link, full-link, canopy
• visualization opportunities
• how to add documents to a large clustering
• usage of Apache Mahout for prototypes
33. PLAGIARISM DETECTION
• How to detect potentially plagiarized content?
• Ethical discussion on (self-)plagiarism
• text breakdown in elements (sections, paragraphs, sentences)
• n-grams
• internal and external plagiarism detection
34. SOCIAL NETWORK ANALYSIS
• Social Network Theory
• measures from SNA
• existing examples of research applications
• bibliometrics and scientometrics
• take real conference series as example
35. FACETED SEARCH &
INTERFACE EVAL
• Best practices and design recommendations
• frameworks for development
• enclosure / APIs
• only work on JSON data & no direct DB access
• Java / ASP .Net / SEAM ....
• own prototype
36. BROWSER-BASED VISUALIZ.
OF LARGE NETWORKS
• level of detail
• WebGL, web workers
• Gephi
• visualize properties
• allow faceted search
• should be working on tablets
37. SCIENTIFIC RECOMMENDER
SYSTEMS
• state of the art
• item-based
and collaborative filtering / hybrid
recommenders
• algorithms, visualizations
• existing applications in research
• usage of Apache Mahout for prototypes
39. NEXT STEPS
• vote for three topics until Wednesday, 8pm
• mail with favorite topic, 2nd and 3rd place
• decision on Friday
• create Wordpress, Delicious and Mendeley account
• finalpresentation of PG knowAAN this Thursday, 4.45pm
in F0.231
• first meetings with supervisors next two weeks
40. wolfgang reinhardt university of paderborn
social media sna
twitter recommendations
awareness
research networks
bibliometrics
artefact-actor-networks
ginkgo
research 2.0
www.isitjustme.de www.ginkgosem.com
@wollepb @wollepb @wolfgang.reinhardt
@wollepb @wollepb @wolfgang.reinhardt
@wollepb @wollepb @wollepb