This document describes Spotlight Data, a company that uses text mining, machine learning, and data visualization to help with research data management. It introduces key members of Spotlight Data's team and describes some of their current projects, including work with the UK government and Durham University applying text mining and machine learning to large datasets. It also provides an overview of Spotlight Data's Nanowire system for ingesting, processing, and analyzing both structured and unstructured data at scale using a microservices architecture.
VIP Kolkata Call Girl Jatin Das Park 👉 8250192130 Available With Room
Text mining and machine learning
1. Extract – Analyse – Search - Visualise
Text mining and machine learning for Research Data Management
Dr Tom Parsons and Mitchell Murphy
28/06/2017
2. 2
Co founder, RDM, Knowledge Management
DR. TOM PARSONS
React.js panel and Node.js
WILL EVANS
Python/R data scientist Machine learning and computer vision
DR. STUART BOWE & MITCH MURPHY
Co founder, Software delivery
TIM VENISON
Python, architecture, processing pipeline
BARNABY KEENE
About Spotlight Data
Rapid development of innovative products
OUR AGILE CROSS FUNCTIONAL TEAM
28/06/2017
Developers, architects and researchers
POOL OF ASSOCIATES AND PLACEMENTS
3. 3
Gathering and
making sense of
unstructured data
captured from a
variety of sources
We use charting,
network graphs,
maps and other
techniques for data
investigation
Mining data from
archives, websites
social media and API
sources
Analysis Tools
From simple interfaces
and powerful searches
to end to end large
scale processing
systems
We utilise machine
learning techniques
to extract and
investigate data.
What we do
Data science
Dark DataData Mining Data VisualisationArtificial Intelligence
28/06/2017
4. 4
Spotlight Data
Projects
• Large project with the UK Government and Durham University:
• Applying text mining and machine learning to large data sets
and document corpora
• Twitter and social media mining for ESRC Climate Change project
• Sensor data analysis and machine learning
28/06/2017
5. 5
The Nanowire system
Cloud or on premise
Microservice containerised architecture
Ingest DiscoverProcess
Workers
User panel User panel
Data Processing –
Natural Language
Processing, text
mining, classifiers,
pattern recognition
MQ
Pre-
process
Storage
28/06/2017
6. 6
Ability to process structured and unstructured data
DATA PROCESSING CAPABILITY
Built to adapt to use cases that constantly evolve through a
microservice architecture
ADAPTABILITY
Design for all levels of users with continual improvement
USER EXPERIENCE
Cloud and infrastructure agnostic with the ability to scale
from 100s to millions of files
SCALING
The ability to quickly change releases on a fast and robust
deployment system
FAST DEPLOYMENT
All components to be tested prior to release in a continuous
integration and deployment cycle
TESTED
Nanowire goals
Development targets
Utilising open source libraries with a permissive licence.
OPEN SOURCE
All services to be provided as Docker containers by default,
with no external dependencies
CONTAINERISED
28/06/2017
8. 8
Text mining
What to do with this information:
• Mine information for research?
• Develop new products and drive innovation
• Allow reuse of research data?
28/06/2017
“The discovery by computer of new, previously unknown information, by automatically
extracting information from different written resources. A key element is the linking ... of
the extracted information ... to form new facts or new hypotheses to be explored
further” (Hearst, 2003)
“An estimated 2.4 million scientific articles published every year” Research Consulting TDM report
9. 9
Text mining
Extracting information
Choose sources Extract text Clean text Analysis Clustering Results
28/06/2017
DATABASES, FILES,
FOLDERS, OFFICE 365
NATURAL LANGUAGE
PROCESSING –
ENTITIES, CONCEPTS,
TOPICS, KEYWORDS,
SENTIMENT
STOP WORD REMOVAL,
TOKENISATION
13. 13
Linking text to data
Relationships between data, articles and people
28/06/2017
RESEARCH OUTPUTS
AUTHORS, ACADEMICS, PI/CO-I
UNIVERSITIES, LOCATIONS
15. 15
Linking text to data
Data tables
28/06/2017
Data set: https://www.repository.cam.ac.uk/handle/1810/32806
16. 16
Linking text to data
Automated relationships between data, articles and people
28/06/2017
RESEARCH OUTPUTS
AUTHORS, ACADEMICS, PI/CO-I
UNIVERSITIES, LOCATIONS
COMPACT SILTY-LOAM SOIL 2
COURTYARD DEPOSIT BY 2
DEPOSIT BY OVEN 2
DEPOSIT WHITE THIN 2
FI9710 ASHY COURTYARD 2
IIID 5705 FI9710 2
LAYER OF PHYTOLITHS 9
RESIDUE FROM POT 2
RM 4 RESIDUE 2
RM 97 BURNT 2
THIN LAYER OF 2
WHITE LAYER OF 7
WHITE THIN LAYER 2
Citation: Madella, M. (2004). Kilise Tepe Monograph Section F2 Phytolith Data
Table 1
Madella, M.
URL: https://www.repository.cam.ac.uk/handle/1810/32806
Places: Europe, Turkey
Organisations: University of Cambridge
Densham, M.
URL:
https://www.repository.cam.ac.uk/han
dle/1810/33130
17. 17
Search and discovery
Graph databases
28/06/2017
RESEARCH OUTPUTS RELATED
TO PHYTOLITHS
AUTHORS CONNECTED TO
MULTIPLE KILISE TEPE TOPICS
19. 19
Discussion
Text mining
• Discuss in groups for 10 minutes:
• Sources of text and data (files, images, video etc.)
• How could text mining be used for RDM?
• What do you struggle with?
• What are the top three priorities?
28/06/2017
21. 21
Overview
• What is it?
• Why is it needed?
• Why is it useful for research data management?
• How does it work?
• Demo
28/06/2017
Machine Learning
22. 22
What Is It?
28/06/2017
Machine Learning
• How does an athlete learn to become good at their sport?
• How does a machine learn how to predict outcomes?
• So what is a machine learning algorithm?
23. 23
Why Is It Needed?
28/06/2017
Machine Learning
24. 24
Why Is It Useful For RDM?
28/06/2017
Machine Learning
FORMS
25. 25
How Does It Work?
Machine Learning
• Finding the topic of a file using linear regression
20/06/17
Words (x) Topics (y)
28. 28
Facial recognition
Machine learning across document content
Original image
Convert to
grayscale
Extract
face
Find possible
matches
Evaluation of algorithms LBPH, Eigenfaces,
Fisherfaces
TRAINING THE DATA
Allow a user to search for faces within a document corpus or
train the system to recognise individuals
FUTURE
MATCHING FACES IN THE TRAINED MODEL
TRAINING THE MODEL THEN TESTING
28/06/2017
32. 32
Machine learning exercise
Discussion
Discuss in groups (10 mins):
• How could machine learning be used for RDM?
• Improving RDM:
• What are the ’painful’ manual tasks?
• What could be improved?
• What are the top three priorities?
28/06/2017
34. 34
Spotlight Data
The future
• Deploy text mining/machine learning system within the UK
Government
• Develop the ’next-generation’ of data repository
• Mining data repositories and OA outputs
• Office365 mining and optimisation
• Analysis of the data
28/06/2017