A profile of Applied Data Analysis Lab (ADA Lab) at ICM, University of Warsaw. It contains a brief overview of our research interests, which are located at or near the intersection of text and data mining and open science. This is a version from December 2014.
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
A profile of Applied Data Analysis Lab (ADA Lab)
1. Applied Data Analysis Lab – a profile
Dr. Łukasz Bolikowski
ICM, University of Warsaw
December 2014
2. ADA Lab ICM UW
University of Warsaw (UW) is one of the top Polish higher education establishments.
Interdisciplinary Centre for Mathematical and Computational Modelling (ICM)
is a supercomputing and research data centre within the University of Warsaw.
Applied Data Analysis Lab (ADA Lab) is a research group within the ICM.
3. ADA Lab’s Scope of Interest
Scalable Text and Data Mining Informatics for Open Science
Legal Text Mining
Business Data Mining
Training Outreach
Scholarly PDF Mining
Map of Science
Persistent IDs
Data Anonymization
4. Legal Text Mining
Building a judgment analysis system for Poland.
Integrating data from common courts, the
Supreme Administrative Court, the Supreme
Court, and the Constitutional Tribunal.
Planning a larger, European project with similar
goals (Horizon 2020; currently building consor-tium
and defining scope).
5. Business Data Mining
Leveraging high demand for data science skills.
For-profit projects with business partners.
Usually can’t discuss details due to NDAs.
Our favourite toolset:
R for data understanding and modelling
Apache Spark for analysing larger data sets
D3 for information visualization
CRISP-DM for managing our projects
(Cross-Industry Standard Process for Data Mining)
6. Training and Outreach
“Web-Scale Data Mining and Processing”
(Course at Polish Academy of Sciences)
“Introduction to Text Mining”
(Course at Warsaw School of Data Analysis organised by ICM)
Internal trainings on Hadoop, Spark
Presentations at Big Data conferences
(Target audience: business partners)
Workshops and internships for talented youth
(In collaboration with Polish Children’s Fund)
7. Scholarly PDF Mining
Extracting metadata, bibliographic references, and full text
from scholarly PDFs. Research direction: semantic anno-tation
of paragraphs, sentences, phrases.
CERMINE is an open software (AGPL license), with users
worldwide: OpenAIRE.eu, Paperity.org, Public Knowledge
Project.
Interfaces for humans and for machines (RESTful API).
Try CERMINE at: http://cermine.ceon.pl/
8. Map of Science
A comprehensive map of academia. Mining available
documents and data sets in order to reconstruct the
graph of relations between: people, documents, insti-tutions,
topics, funding sources.
Final result: a publicly available data set.
Why? Better understanding of science. Cool features
in digital libraries and research information systems.
Elements of the map currently developed in OpenAIRE
and OCEAN projects.
9. Persistent IDs
To achieve long-term preservation of research arti-facts,
we need an identifier minting and management
scheme that can outlive the organization managing
the scheme.
We are developing a distributed scheme based on
public-key cryptography and P2P networking (a lot
in common with Bitcoin).
10. Data Anonymization
Privacy-preserving research data publication is a
cross-cutting issue, applies to various types of
data analysed at ICM: legal judgments, medical
records, social network activity.
11. Thank you for your attention. Let’s stay in touch!
adalab.icm.edu.pl/blog
twitter.com/adalab_icm
linkedin.com/in/bolikowski
twitter.com/bolikowski
lukasz.bolikowski@icm.edu.pl
12. License
c 2014 ICM, University of Warsaw. Some rights reserved. This presentation is available under a CC BY 3.0 license. Materials from the following
sources were used:
https://www.flickr.com/photos/86530412@N02/8213432552 (p. 4, CC BY 2.0)
https://www.flickr.com/photos/124247024@N07/13903385550 (p. 5, CC BY-SA 2.0)
https://www.flickr.com/photos/genista/228006200 (p. 6, CC BY-SA 2.0)
https://www.flickr.com/photos/bohman/210977249 (p. 9, CC BY 2.0)
https://www.flickr.com/photos/hyku/368912557 (p. 10, CC BY 2.0)