Text mining and machine learning

Extract – Analyse – Search - Visualise
Text mining and machine learning for Research Data Management
Dr Tom Parsons and Mitchell Murphy
28/06/2017

2
Co founder, RDM, Knowledge Management
DR. TOM PARSONS
React.js panel and Node.js
WILL EVANS
Python/R data scientist Machine learning and computer vision
DR. STUART BOWE & MITCH MURPHY
Co founder, Software delivery
TIM VENISON
Python, architecture, processing pipeline
BARNABY KEENE
About Spotlight Data
Rapid development of innovative products
OUR AGILE CROSS FUNCTIONAL TEAM
28/06/2017
Developers, architects and researchers
POOL OF ASSOCIATES AND PLACEMENTS

3
Gathering and
making sense of
unstructured data
captured from a
variety of sources
We use charting,
network graphs,
maps and other
techniques for data
investigation
Mining data from
archives, websites
social media and API
sources
Analysis Tools
From simple interfaces
and powerful searches
to end to end large
scale processing
systems
We utilise machine
learning techniques
to extract and
investigate data.
What we do
Data science
Dark DataData Mining Data VisualisationArtificial Intelligence
28/06/2017

4
Spotlight Data
Projects
• Large project with the UK Government and Durham University:
• Applying text mining and machine learning to large data sets
and document corpora
• Twitter and social media mining for ESRC Climate Change project
• Sensor data analysis and machine learning
28/06/2017

5
The Nanowire system
Cloud or on premise
Microservice containerised architecture
Ingest DiscoverProcess
Workers
User panel User panel
Data Processing –
Natural Language
Processing, text
mining, classifiers,
pattern recognition
MQ
Pre-
process
Storage
28/06/2017

6
Ability to process structured and unstructured data
DATA PROCESSING CAPABILITY
Built to adapt to use cases that constantly evolve through a
microservice architecture
ADAPTABILITY
Design for all levels of users with continual improvement
USER EXPERIENCE
Cloud and infrastructure agnostic with the ability to scale
from 100s to millions of files
SCALING
The ability to quickly change releases on a fast and robust
deployment system
FAST DEPLOYMENT
All components to be tested prior to release in a continuous
integration and deployment cycle
TESTED
Nanowire goals
Development targets
Utilising open source libraries with a permissive licence.
OPEN SOURCE
All services to be provided as Docker containers by default,
with no external dependencies
CONTAINERISED
28/06/2017

8
Text mining
What to do with this information:
• Mine information for research?
• Develop new products and drive innovation
• Allow reuse of research data?
28/06/2017
“The discovery by computer of new, previously unknown information, by automatically
extracting information from different written resources. A key element is the linking ... of
the extracted information ... to form new facts or new hypotheses to be explored
further” (Hearst, 2003)
“An estimated 2.4 million scientific articles published every year” Research Consulting TDM report

9
Text mining
Extracting information
Choose sources Extract text Clean text Analysis Clustering Results
28/06/2017
DATABASES, FILES,
FOLDERS, OFFICE 365
NATURAL LANGUAGE
PROCESSING –
ENTITIES, CONCEPTS,
TOPICS, KEYWORDS,
SENTIMENT
STOP WORD REMOVAL,
TOKENISATION

10
Results
Visualising data
28/06/2017

11
Clusters
Graph databases
28/06/2017

12
Enhanced data storage
JSON Linked Data format
{
"@context": "http://schema.org",
"@type": "DigitalDocument",
"mentions": [
{
"@type": "Person",
"email": "tom.parsons@nottingham.ac.uk"
},
{
"@type": "Thing",
"url": "http://admire.jiscinvolve.org/wp/"
}
],
"spatialCoverage": [
{
"@type": "Place",
"name": "Manchester"
},
{
"@type": "Place",
"name": "British Library"
},
{
"@type": "Place",
"name": "Nottingham"
}
],
"keywords": "rdm,project,nottingham,support,research data",
"inLanguage": {
"@type": "Language",
"name": "English"
},
"typicalAgeRange": ">=18"
}
ANALYSIS RESULTS VALIDATED JSON-LD
28/06/2017

13
Linking text to data
Relationships between data, articles and people
28/06/2017
RESEARCH OUTPUTS
AUTHORS, ACADEMICS, PI/CO-I
UNIVERSITIES, LOCATIONS

14
Typical metadata
28/06/2017

15
Data tables
28/06/2017
Data set: https://www.repository.cam.ac.uk/handle/1810/32806

16
Automated relationships between data, articles and people
28/06/2017
RESEARCH OUTPUTS
AUTHORS, ACADEMICS, PI/CO-I
UNIVERSITIES, LOCATIONS
COMPACT SILTY-LOAM SOIL 2
COURTYARD DEPOSIT BY 2
DEPOSIT BY OVEN 2
DEPOSIT WHITE THIN 2
FI9710 ASHY COURTYARD 2
IIID 5705 FI9710 2
LAYER OF PHYTOLITHS 9
RESIDUE FROM POT 2
RM 4 RESIDUE 2
RM 97 BURNT 2
THIN LAYER OF 2
WHITE LAYER OF 7
WHITE THIN LAYER 2
Citation: Madella, M. (2004). Kilise Tepe Monograph Section F2 Phytolith Data
Table 1
Madella, M.
URL: https://www.repository.cam.ac.uk/handle/1810/32806
Places: Europe, Turkey
Organisations: University of Cambridge
Densham, M.
URL:
https://www.repository.cam.ac.uk/han
dle/1810/33130

17
Search and discovery
Graph databases
28/06/2017
RESEARCH OUTPUTS RELATED
TO PHYTOLITHS
AUTHORS CONNECTED TO
MULTIPLE KILISE TEPE TOPICS

18
Results
Visualising data
28/06/2017

19
Discussion
Text mining
• Discuss in groups for 10 minutes:
• Sources of text and data (files, images, video etc.)
• How could text mining be used for RDM?
• What do you struggle with?
• What are the top three priorities?
28/06/2017

Introduction
Machine learning and text

21
Overview
• What is it?
• Why is it needed?
• Why is it useful for research data management?
• How does it work?
• Demo
28/06/2017
Machine Learning

22
What Is It?
28/06/2017
Machine Learning
• How does an athlete learn to become good at their sport?
• How does a machine learn how to predict outcomes?
• So what is a machine learning algorithm?

23
Why Is It Needed?
28/06/2017
Machine Learning

24
Why Is It Useful For RDM?
28/06/2017
Machine Learning
FORMS

25
How Does It Work?
Machine Learning
• Finding the topic of a file using linear regression
20/06/17
Words (x) Topics (y)

26
Demo
Machine Learning
20/06/17

Introduction
Machine learning and images

28
Facial recognition
Machine learning across document content
Original image
Convert to
grayscale
Extract
face
Find possible
matches
Evaluation of algorithms LBPH, Eigenfaces,
Fisherfaces
TRAINING THE DATA
Allow a user to search for faces within a document corpus or
train the system to recognise individuals
FUTURE
MATCHING FACES IN THE TRAINED MODEL
TRAINING THE MODEL THEN TESTING
28/06/2017

29
Facial recognition
Sometimes makes mistakes…
28/06/2017

30
Image classifiers
TensorFlow machine learning
[”submarine, pigboat, sub, U-boat", "0.989818" ],
["indri, indris, Indri indri, Indri brevicaudatus", "0.00165158"
["killer whale, killer, orca, grampus, sea wolf, Orcinus orca","8.52245e-
05"],
["steam locomotive", "8.31971e-05" ]]},
28/06/2017

31
Review
Machine Learning
20/06/17
• What is it?
• Why is it needed?
• Why is it useful for research data management?
• How does it work?

32
Machine learning exercise
Discussion
Discuss in groups (10 mins):
• How could machine learning be used for RDM?
• Improving RDM:
• What are the ’painful’ manual tasks?
• What could be improved?
• What are the top three priorities?
28/06/2017

Beyond an RDM repository
The future?

34
Spotlight Data
The future
• Deploy text mining/machine learning system within the UK
Government
• Develop the ’next-generation’ of data repository
• Mining data repositories and OA outputs
• Office365 mining and optimisation
• Analysis of the data
28/06/2017

35
EMAIL
mitch@spotlightdata.co.uk
REGISTERED OFFICE
tom@spotlightdata.co.uk
The Ingenuity Centre,
University of Nottingham Innovation Park,
Triumph Road, Nottingham,
NG7 2TU.
Strategic KM Ltd is a Company Registered in England and Wales,
Reg No. 06433359

Text mining and machine learning

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Text mining and machine learning

Ähnlich wie Text mining and machine learning (20)

Mehr von Jisc RDM

Mehr von Jisc RDM (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Text mining and machine learning