2. The global research community generates over 1.5 million new
scholarly articles per annum.
The STM report (2009)
2
Lokman I. Meho, The rise and rise of
citation analysis, 2007
e STM report (2009)
… some 90% of papers … are never cited.
… 50% of papers are never read by anyone other than their
authors, referees and journal editors
…about scientific literature?
… one paper published every 30 seconds
… 70,000 papers published on a single protein, the tumor
suppressor p53
Spangler et al, Automated Hypothesis
Generation based on Mining Scientific
Literature, 2014
e STM report (2009)
3. Emerging solution(s)
Machine reading
process textual sources, organise and classify in various
dimensions, extract main (indexical) information items,
… and understanding
identify and extract entities and relations between entities, facilitate
the transformation of unstructured textual sources into structured
data
… and predicting
enable the multidimensional analysis of structured data to extract
meaningful insights and improve the ability to predict
3
4. Structuring and mining
textual data
many examples from medical research
An example from social sciences:
study social confrontation in the Greek
society with a focus on the years of the crisis
based on newspaper corpora
what have been the claims of the social agents (parties,
unions, different professional associations, etc) against which
government/state bodies, instruments used, how they were
reported in different newspapers
4
5. Study social confrontation
example
Κατάληψη στα Υποθηκοφυλακεία Πειραιώς και
Σαλαμίνας αποφάσισε ο Δικηγορικός Σύλλογος
Πειραιώς (ΔΣΠ), στις 26 και 27 Απριλίου 2011,
διαμαρτυρόμενος για τα σοβαρότατα
προβλήματα λειτουργίας που παρουσιάζουν.
The Piraeus Bar Association ( SAB) decided to
go for the occupation of land registries in
Piraeus and Salamis on 26 and April 27, 2011 ,
protesting about the serious operational
problems they present.
5
9. Main objective
Establish an open and sustainable Text and
Data Mining (TDM) platform and infrastructure
where researchers can collaboratively create,
discover, share and re-use knowledge from a
wide range of text based scientific and
scholarly related sources.
9
10. infrastructure - focus on
interoperability
build on existing TDM tools - no new
algorithms
service oriented - discovery, re-use of
content & tools
community driven - user centric
requirements
open science - openness at all levels
Key aspects
10
11. The landscape
Text Mining
Researchers
Text Mining
Researchers
Content ProvidersContent Providers
End UsersEnd UsersComputing InfrastructuresComputing Infrastructures
11
12. the project
• Started: June 2015
• Duration: 3 years
• Total budget: 6,068,074
Euros
16 Partners
• 6 mining research groups
• 3 content providers
• 1 data center
• 1 library association
• 2 legal experts
• 6 community related partners
• 2 SMEs
12
Partners
Athena RIC
Univ. of Manchester (NacTem)
Univ. of Darmstadt
INRA
EMBL-EBI
Agro-Know
LIBER
Univ. of Amsterdam
Open University UK
EPFL
CNIO
Univ. of Sheffield (GATE)
GESIS
GRNET
Frontiers
Univ. of Stirling
13. the challenges
Content
Barriers and obstacles due to non-availability, technical restrictions,
copyright law or licensing issues.
No uniform way to search for, retrieve and access content for TDM.
Services
How to identify the most fitting one? Do I have permission to use it?
How to combine with other services I have access to or I need? How
to use them on my content?
Processing
Where to deploy? Are my machines powerful enough? How can I
get access to powerful machines? Where to store intermediate and
final results? How to ensure persistence of storage?
13
Bring all stakeholders together!
15. accessible content
Metadata and transfer protocols
•Document literature content, language resources, data categories
taxonomies, provenance information
•Generic and domain-specific metadata descriptions
•Identify standards for metadata harvesting and federated search in
distributed repositories
IPR and licensing
•Study IPR restrictions for reuse of sources
• Exceptions?
• What about non-commercial research?
•Translate the legal & policy aspects into authentication and
authorization specifications (GEANT’s EduGain, …)
• User-to-service and service-to-service interactions
15
Starting with repositories and OA
publishers
via OpenAIRE and CORE
Starting with repositories and OA
publishers
via OpenAIRE and CORE
In close collaboration with the
FUTURETDM project
http://project.futuretdm.eu/
In close collaboration with the
FUTURETDM project
http://project.futuretdm.eu/
17. Use cases (1)
Scholarly communication analytics
OpenAIRE, CORE, Frontiers
•Semantic search and discovery of open scientific outcomes
•Map of academia – scholarly communication network
•Research monitoring and analytics
Life sciences
EBI, Human brain project
•Assisted curation of the EMBL-EBI chemical databases for
metabolomics
•Curation of the neurosciences resources KnowledgeBase and
Neurolex
18
18. Use cases (2)
Agriculture and biodiversity
INRA, AGRO-KNOW, EFSA
•Enrich agricultural databases to assist food- and water-borne
disease outbreak alerts and product recalls
•Image, figure and dataset discovery in the AGRIS FAO online
service
social sciences
GESIS
•Develop and evaluate methods for the automatic detection and
linking of named entities, citation traces and intentions in social
science scientific publications
19
19. Expectations from today’s WS
•Establish contact and dialogue with content providers,
especially OA content providers
•Understand current practices, problems and limitations
•Look into the emerging requirements
•Explore the challenges content providers face at
technical, legal, policy and organisational challenges
face in making their data open for text and data mining
•Develop a common vision and strategy
20