1
@BL_Labs @BrisUniRIT @JGIBristol @Cudigitalnet @BL_DigiSchol
labs@bl.uk https://goo.gl/
http://www.bl.uk/projects/british-library-labs
Funded by the Andrew W. Mellon Foundation
Mahendra Mahey
Experiment with our
Digital Collections
Mahendra Mahey
Manager of BL Labs
Running since March 2013
Core Team
• Adam Farquhar (PI)
• Mahendra Mahey
• Ben O’Steen
• Eleanor Cooper (0.5)
What is British Library Labs?
12:50 to 13:20, Thuirsday 12th April 2018
BL Labs Roadshow 2018
University of Britsol
UK.
2
@BL_Labs @BrisUniRIT @JGIBristol @Cudigitalnet @BL_DigiSchol
labs@bl.uk https://goo.gl/
The British Library
Inside the British Library
Space for 1200 readers, around 500,000 visitors per year
Building 37 uses low oxygen and robots
Reading room and delivery to London
Many items stored at Document Supply and Storage centre 48 hours away
Stockton-on-Tees
Author right to payment each time their books
are borrowed from public libraries.
St Pancras, London, UK
Many books are stored 4 stories below the building
UK Legal Deposit Library – Reference only
Founded in 1973 though origins stem back to British Museum Library 1753
Boston-Spa
3
@BL_Labs @BrisUniRIT @JGIBristol @Cudigitalnet @BL_DigiSchol
labs@bl.uk https://goo.gl/
Collections – not just books!
> 180*million items
> 0.8* m serial titles
> 8* m stamps
> 14* m books
> 6* m sound recordings
> 4* m maps
> 1.6* m musical scores
> 0.3* m manuscripts
> 60* m patents
King’s Library *Estimates
4
@BL_Labs @BrisUniRIT @JGIBristol @Cudigitalnet @BL_DigiSchol
labs@bl.uk https://goo.gl/
Living Knowledge Vision (2015 – 2023)
Custodianship Research Business
Culture Learning International
To make our intellectual heritage accessible to everyone,
for research, inspiration and enjoyment and be the most open, creative
and innovative institution of its kind by 2023 (50 year anniversary).
Document:http://goo.gl/h41wW7 Speech:https://goo.gl/Py9uHK
Roly Keating (Chief Executive Officer of the British Library)
To make our intellectual heritage accessible to everyone,
for research, inspiration and enjoyment and be the most open, creative
and innovative institution of its kind by 2023 (50 year anniversary).
6
@BL_Labs @BrisUniRIT @JGIBristol @Cudigitalnet @BL_DigiSchol
labs@bl.uk https://goo.gl/
Digital research methods
Digital Scholarship
Visualisations
Application Programming Interfaces (APIs)
for datasets e.g. Metadata, Images, etc
Transcribing
Annotation
Location based searching & Geo-tagging
Corpus analysis, Text Mining &
Natural Language Processing
Crowdsourcing
Human Computation
In 20 years time?
8
@BL_Labs @BrisUniRIT @JGIBristol @Cudigitalnet @BL_DigiSchol
labs@bl.uk https://goo.gl/
/
Knowledge Quarter London
80 knowledge organisations (as of 14/04/18) within 1 mile radius of
Kings Cross, http://www.knowledgequarter.london
http://www.turing.ac.uk (Headquartered at the British Library)
UK Web Archive and e-legal deposit (2013)
http://www.webarchive.org.uk/ukwa/
Born digital
11
@BL_Labs @BrisUniRIT @JGIBristol @Cudigitalnet @BL_DigiSchol
labs@bl.uk https://goo.gl/
Playbills, Books, Newspapers
(includes Optical Character Recognition (OCR))
Digital collections and Datasets
British National
Bibliography
http://bnb.data.bl.uk
http://sounds.bl.ukhttp://dml.city.ac.uk/
Music (Recordings & Sheet) & Sounds
http://goo.gl/frSMJt
Broadcast News (TV and Radio)
http://goo.gl/cwThHw
http://goo.gl/pBkisZhttp://goo.gl/E8aRyQ
Usage data
EtHOS
Web ArchiveImages, Manuscripts & Maps
http://www.qdl.qa/
Qatar Digital Library
http://idp.bl.uk/
International
Dunhuang
Project
Maps
http://www.bl.uk/maps/
Hebrew Manuscripts
http://goo.gl/4sbCp9
Flickr &
Wikimedia Commons
https://goo.gl/LZRmaZ
12
@BL_Labs @BrisUniRIT @JGIBristol @Cudigitalnet @BL_DigiSchol
labs@bl.uk https://goo.gl/
Finding Open Cultural Heritage Datasets
Collection Guides (199 as of 12/04/2018)
https://www.bl.uk/collection-guides/
Datasets about our collections
Bibliographic datasets relating to our published and
archival holdings
Datasets for content mining
Content suitable for use in text and data mining
research
Datasets for image analysis
Image collections suitable for large-scale image-
analysis-based research
Datasets from UK Web Archive
Data and API services available for accessing UK Web
Archive
Digital mapping
Geospatial data, cartographic applications, digital aerial
photography and scanned historic map materials
https://data.bl.uk
Download collections as zips, no API
Each dataset has a Digital Object Identifier (DOI)
can be referenced for research
Not all discoverable via
search engines!
14
@BL_Labs @BrisUniRIT @JGIBristol @Cudigitalnet @BL_DigiSchol
labs@bl.uk https://goo.gl/
Competition
Awards
Projects
Tell us your ideas of what to do
with our digital content (2013-16)
Show us what you have already done with
our digital content in research, artistic,
commercial and learning and teaching
categories
Talk to us about working on
collaborative projects
Tell us your ideas of what to do
with our digital content
Engagement
• Roadshows
• Events
• Meetings
• Conversations
New!
Digital Research
Support
15
@BL_Labs @BrisUniRIT @JGIBristol @Cudigitalnet @BL_DigiSchol
labs@bl.uk https://goo.gl/
Digital Research Support process
Online Query
(baseline)
Response
online
Requires discussion
Online or
Face to face
(intermediate)
labs@bl.uk
@BL_Labs
Other…
Explore data first
>=1 project chosen
& supported per month
Submit Project Proposal
(advanced)
Open Onsite
Data.bl.uk
Onsite only datasets
Labs website
(entry)
http://labs.bl.uk & http://data.bl.uk
16
@BL_Labs @BrisUniRIT @JGIBristol @Cudigitalnet @BL_DigiSchol
labs@bl.uk https://goo.gl/
Digital Research Support
Application Process
• Complete online form - https://goo.gl/Kgaq8d
• Entries reviewed and selected at the beginning of the month
• Up to 5 days support provided
• Technical, curatorial and legal advice
• Scope, Costs, Time, Risks
• Any other relevant issues?
17
@BL_Labs @BrisUniRIT @JGIBristol @Cudigitalnet @BL_DigiSchol
labs@bl.uk https://goo.gl/
• The Library has to go out to meet researchers, regularly and
cyclically to tell them what we have and learn what they
want to do
• Debunk ‘myths’ about the Library
• Show / tell researchers about the reality of our data
• Researcher’s ideas always change once they explore the
data!
https://goo.gl/esqpRb
Lots of two-way communication!
BL Labs runs annual ‘Roadshows’
around the UK
18
@BL_Labs @BrisUniRIT @JGIBristol @Cudigitalnet @BL_DigiSchol
labs@bl.uk https://goo.gl/
Have you got X?
https://upload.wikimedia.org/wikipedia/commons/5/50/Real_wuerzburg.jpg
Looking for Physical Content in the British Library
19
@BL_Labs @BrisUniRIT @JGIBristol @Cudigitalnet @BL_DigiSchol
labs@bl.uk https://goo.gl/
Have you got X digitised / in digital form?
http://www.yorkmix.com/wp-content/uploads/2014/04/mr-simms-sweet-shoppe-york.jpg
Looking for Digitised / Digital Content in the BL
21
@BL_Labs @BrisUniRIT @JGIBristol @Cudigitalnet @BL_DigiSchol
labs@bl.uk https://goo.gl/
Openly Licensed Digital Content?
15% Openly
Licensed
Around 80%*
available online
Working through to make more open…
Though some collections will always only be available onsite due to
various reasons including legal, ethical etc
Breakdown by collection*
Manuscripts 59%
Books 9%
Maps and Views 7%
Newspapers 3%
Archives and Records 3%
Paintings, Prints and Drawings 2%
*Based on number of digitisation projects (702 as of 12/04/18)
Largest proportion of funding
Public / Private Partnership
15 %* Openly Licensed – most online
85 %* Available onsite only at the moment
*Estimates
22
@BL_Labs @BrisUniRIT @JGIBristol @Cudigitalnet @BL_DigiSchol
labs@bl.uk https://goo.gl/
The Story of the Digital Collection…
Digital
Collection
Curator
Who paid for the digitisation?
Who did the digitisation?
Technology used
Born digital?
Published
Unpublished
Where is it?
Can it still be accessed?
Generates income
Reputational risk in using?
Legalities
Politics when digitised
Personalities involved
Surprises (e.g. gaps)
Descriptive information
Old format not supported
What media was the
digitisation done from?
Is there any background documentation?
No Descriptive information
Inconsistent descriptive information
Still there?
Good to know the background ‘Story’ of a Digital Collection’
if you want to use it for research and make conclusions…
24
@BL_Labs @BrisUniRIT @JGIBristol @Cudigitalnet @BL_DigiSchol
labs@bl.uk https://goo.gl/
How do we give access to
onsite-only
Digital Collections
(85% of our Digital Collections)?
26
@BL_Labs @BrisUniRIT @JGIBristol @Cudigitalnet @BL_DigiSchol
labs@bl.uk https://goo.gl/
OPEN
£
• Have to be ‘onsite’ (interpretations vary)
• Need to be ‘security cleared’ ‘trusted’ for some collections
– Hence ‘Researcher in Residence Model’, trialling onsite ‘Digital Research
Suite’ in reading room
• Further permission may be required (depending on ‘story’ of
collection)
• Content could be on various media formats (not always online)
• 5 - 20 % re-use of material for non commercial research for some
collections, depends on agreements in place
• We are learning ‘pathways’ so that this becomes ‘everyday’ to
provide onsite access to some digital collections in the future
Accessing digital collections onsite
27
@BL_Labs @BrisUniRIT @JGIBristol @Cudigitalnet @BL_DigiSchol
labs@bl.uk https://goo.gl/
https://goo.gl/qpCLlk
https://goo.gl/wMTS3Z
• Dialogue typically:
– you are ‘lucky’ & we have the digital content
/ data relevant to your research
– we don’t have exactly what your looking for,
but is there anything of interest? Let’s talk…
– engagement can be hard work and it’s
constantly required to maintain interest in our
digital collections!
• We also tend to attract researchers with ‘fuzzier’
research boundaries and possibly open to more
interdisciplinary / collaborative research
• Artists find this dialogue easier…
What engagement does the BL have with
researchers wanting use our digital content?
29
@BL_Labs @BrisUniRIT @JGIBristol @Cudigitalnet @BL_DigiSchol
labs@bl.uk https://goo.gl/
Phase 1: Exploration
Allows a researcher to:
– Understand the data in open-ended fashion.
– Discover potential tools to work with the data.
– Gain awareness of their capabilities and limitations.
– Develop a firmer research query.
– Gauge the costs, risks and time needed.
• Outputs of the exploration are not intended to be shareable,
beyond personal experience and key features (data size, formats, tool
successes, etc.).
30
@BL_Labs @BrisUniRIT @JGIBristol @Cudigitalnet @BL_DigiSchol
labs@bl.uk https://goo.gl/
Phase 2: Query-Focussed
• A firmer and more informed query by the researcher where:
– Suitable datasets already lined up
– There is a good idea of the initial toolset and capabilities (human
and computer) required
– The project output is outlined, and relevant reuse applications are
begun.
– Clear agreements on what happens at the end of the project – data
deletion, virtual machine deletion/archiving/etc.
– Project may iterate on initial ideas,depending on researcher’s
cost/risk appetite
Submit idea
for support
31
@BL_Labs @BrisUniRIT @JGIBristol @Cudigitalnet @BL_DigiSchol
labs@bl.uk https://goo.gl/
Phase 3: Wrap-up
• Wrap-up
– Work (code, notes) exported and given to researcher
– All derivative data is licenced or retained based on reuse
agreements (Access & Reuse board, etc.)
– Provisions made for the project are wound-down, as agreed
(derivative data deleted after a grace period, etc.)
33
@BL_Labs @BrisUniRIT @JGIBristol @Cudigitalnet @BL_DigiSchol
labs@bl.uk https://goo.gl/
Why are doing this? (1)
We support research it’s our job!
We want to work closely with and
listening to those who want use
our digital collections and data
for their work!
https://goo.gl/esqpRb
34
@BL_Labs @BrisUniRIT @JGIBristol @Cudigitalnet @BL_DigiSchol
labs@bl.uk https://goo.gl/
We can learn how we are and should be supporting you and
this therefore shapes the problems we work on, such as:
https://goo.gl/esqpRb
Why are doing this? (2)
• Access to digital collections / data?
• Advice, guidance, technical
support, training
• Services, Tools and Processes?
• Many more reasons…
35
@BL_Labs @BrisUniRIT @JGIBristol @Cudigitalnet @BL_DigiSchol
labs@bl.uk https://goo.gl/
Where are the gaps between what you want & what we can
give?
How do we build the bridges to overcome the gaps?
Why are doing this? (3)
https://goo.gl/6CwCeE
36
@BL_Labs @BrisUniRIT @JGIBristol @Cudigitalnet @BL_DigiSchol
labs@bl.uk https://goo.gl/
How do we help you ‘navigate’ their way through the
‘maze’ (sometimes) of the
Library to what they want to do?
Sometimes requires understanding the culture of the organisation
https://goo.gl/62JnQT
Why are doing this? (4)
37
@BL_Labs @BrisUniRIT @JGIBristol @Cudigitalnet @BL_DigiSchol
labs@bl.uk https://goo.gl/
What did people
actually do?
Examples from Text and Images
Over 200 examples (including sound, video) from
Competition and Awards:
http://labs.bl.uk/Ideas+for+Labs
http://labs.bl.uk/Other+Uses+of+Collections
38
@BL_Labs @BrisUniRIT @JGIBristol @Cudigitalnet @BL_DigiSchol
labs@bl.uk https://goo.gl/
Example Pattern of Research
1, 2, 3
1. Find / identify new things in messy stuff
2. Unlock hidden history / data
3. Celebrate new discoveries
39
@BL_Labs @BrisUniRIT @JGIBristol @Cudigitalnet @BL_DigiSchol
labs@bl.uk https://goo.gl/
Finding / identifying invisible / well hidden
things in ‘messy’ historical data
https://goo.gl/mcpa8B
Not the British Library!
Example Pattern of Research 1
40
@BL_Labs @BrisUniRIT @JGIBristol @Cudigitalnet @BL_DigiSchol
labs@bl.uk https://goo.gl/
Messiness in historical data
• 'Begun in Kiryu, Japan, finished in France'
• 'Bali? Java? Mexico?'
• Variations on USA:
– U.S.
– U.S.A
– U.S.A.
– USA
– United States of America
– USA ?
– United States (case)
• Inconsistency in uncertainty
– U.S.A. or England
– U.S.A./England ?
– England & U.S.A.
42
@BL_Labs @BrisUniRIT @JGIBristol @Cudigitalnet @BL_DigiSchol
labs@bl.uk https://goo.gl/
• Cultural heritage records contain uncertainty and fuzziness (e.g. date ranges, multiple
values, uncertain or unavailable information)—Curators and staff at institutions often
have unique expertise in deciphering these anomalies-ask them! ( [1960] vs.1960 can
have a big impact depending on what you’re doing)
• Optical Character Recognition in particular is an imperfect art-need to consider how
bad it is, how this might effect your findings, and what needs doing to mitigate it.
• Keeping data clean, organised, open and described well will not only make your life
easier, but enable its widespread re-use beyond and increase future impact. (Datasets
you’ve created in the course of your research projects could even be used to enhance
national collections!)
• Decisions always need to be made while normalising information for visualisation.
Documenting them is important for your research but also future re-use!
• Is your aim enquiry or presentation? All of this will have an impact on the tools and
data cleaning choices you make.
Things to consider: Data + Tools
44
@BL_Labs @BrisUniRIT @JGIBristol @Cudigitalnet @BL_DigiSchol
labs@bl.uk https://goo.gl/
#digitalhumanities
dancohen/lists/digitalhumanities
@ProfHacker
@Dhnow
@BL_DigiSchol
And more links to resources here: http://scottbot.net/teaching-yourself-to-code-in-dh/
45
@BL_Labs @BrisUniRIT @JGIBristol @Cudigitalnet @BL_DigiSchol
labs@bl.uk https://goo.gl/
Unearthing / unlocking
hidden histories & data
to stimulate new research
https://goo.gl/vJ291F
It’s an
18th Century Poem!
Example Pattern of Research 2
46
@BL_Labs @BrisUniRIT @JGIBristol @Cudigitalnet @BL_DigiSchol
labs@bl.uk https://goo.gl/
Celebrating hidden histories / data
creatively through events, art &
performance
https://goo.gl/Ql0Bwz
Re-enacting, re-discovering history
Example Pattern of Research 3
48
@BL_Labs @BrisUniRIT @JGIBristol @Cudigitalnet @BL_DigiSchol
labs@bl.uk https://goo.gl/
https://goo.gl/oUNj5N
https://goo.gl/ImAUv4
Finding things in ‘messy’
Optical Character Recognised (OCR) text
Mrs Folly
• Clean up some manually
• Get human ‘ground truth’
• Write computer code (sometimes
it’s machine learning) to find
things reliably in it ‘automatically’
• Try code on messy content
• Tweak if necessary
• Digital ‘lasso’ around content
• Human sift through
Mrs Folly
An example pattern of research
49
@BL_Labs @BrisUniRIT @JGIBristol @Cudigitalnet @BL_DigiSchol
labs@bl.uk https://goo.gl/
Legalities of Machine Learning /
Text and Data mining
https://goo.gl/toq4Bo
Legalities of Machine Learning / Text and Data
mining still up for discussion…Often misunderstood
Is it the same as humans reading and looking for
patterns…just a bit quicker?
50
@BL_Labs @BrisUniRIT @JGIBristol @Cudigitalnet @BL_DigiSchol
labs@bl.uk https://goo.gl/
http://victorianhumour.tubmblr.com
Victorian Meme Machine (2014)
https://goo.gl/HMqDt3
Bob Nicholson
http://victorianhumour.tumblr.com/
Bob Nicholson interviewed on
BBC Radio 4 Making History Programme:
http://goo.gl/fmV9ep
And telling jokes to the public:
http://goo.gl/xIDRhz
Bob obtained further funding from his university
Looking for more collaborations
https://www.youtube.com/watch?v=-GRgj7Q5OM0
Rob Walker, Victorian Mother-in-law Jokes
Victorian Comedy Night, 7 Nov 2016
Learnt about access paths
to digital collections
51
@BL_Labs @BrisUniRIT @JGIBristol @Cudigitalnet @BL_DigiSchol
labs@bl.uk https://goo.gl/
Katrina Navickas (2015)
Political Meetings Mapper
http://politicalmeetingsmapper.co.uk
https://goo.gl/Qq78Oa
Labs Symposium 2015
https://goo.gl/BSA3be
Interview 2015
The Chartist Newspaper
http://goo.gl/vOLSnH
Chartist Monster Meeting
Chartists Walking Tour and
Re-enactment London
Learnt that domain knowledge
reduces noise
52
@BL_Labs @BrisUniRIT @JGIBristol @Cudigitalnet @BL_DigiSchol
labs@bl.uk https://goo.gl/
Black Abolitionist Performances & their
Presence in Britain (2016) – Hannah-Rose Murray
Frederick
Douglass
Ellen
Craft
Josiah
Henson
Ida B
Wells
A Performance by
Joe Williams &
Martelle Edinborough
http://frederickdouglassinbritain.com/
Started to implement
Machine Learning Techniques
53
@BL_Labs @BrisUniRIT @JGIBristol @Cudigitalnet @BL_DigiSchol
labs@bl.uk https://goo.gl/
Data-mining verse in 18th Century newspapers
BL Labs Project 16-17, Jennifer Batt
https://goo.gl/5Akthd
Slides courtesy Jennifer Batt
Started to refine
Machine Learning Techniques
54
@BL_Labs @BrisUniRIT @JGIBristol @Cudigitalnet @BL_DigiSchol
labs@bl.uk https://goo.gl/
Psychiatrist’s Journey
into 19th Century Newspapers (2016)
• Dr Surendra P Singh, Consultant Psychiatrist
• To identify weekly, monthly, yearly and
longitudinal trends in suicide reporting in
terms of gender, status, sites, locations and
health in OCR text of 19th Century
Newspapers
• Used ‘R’ Open Source Stats
Package to collect ‘Suicide’ corpus
• Looking for collaborators to work on this
dataset
Use off-the-shelf tools
and remote access pathways
55
@BL_Labs @BrisUniRIT @JGIBristol @Cudigitalnet @BL_DigiSchol
labs@bl.uk https://goo.gl/
Virtual Infrastructure for OCR text
OCR text ‘scraped’ from
digitised newspapers
and put in internal cloud
Jupyter notebook
Write python code and results
in web browser
http://jupyter.org
Access available for researchers ‘in residence’
57
@BL_Labs @BrisUniRIT @JGIBristol @Cudigitalnet @BL_DigiSchol
labs@bl.uk https://goo.gl/
65,000 digitised 19th Century books
Image: Artwork by Alicia Martin 2007 / 2008
Paid for by:
For a full list:
https://goo.gl/HqPQMS
Subjects include:
Philosophy
Poetry
History
Literature
1789 - 1876
62
@BL_Labs @BrisUniRIT @JGIBristol @Cudigitalnet @BL_DigiSchol
labs@bl.uk https://goo.gl/
OCR XML Generated by ABBY Fine Reader
Optical Character Recognition
64
@BL_Labs @BrisUniRIT @JGIBristol @Cudigitalnet @BL_DigiSchol
labs@bl.uk https://goo.gl/
We did some of our own
experiments…do as we tell others!
Experiment with our
Digital Collections@BL_Labs
65
@BL_Labs @BrisUniRIT @JGIBristol @Cudigitalnet @BL_DigiSchol
labs@bl.uk https://goo.gl/
Ben O’Steen of @BL_Labs after Hack Event, August 2013
66
@BL_Labs @BrisUniRIT @JGIBristol @Cudigitalnet @BL_DigiSchol
labs@bl.uk https://goo.gl/
Ben O’Steen of @BL_Labs after Hack Event, August 2013
67
@BL_Labs @BrisUniRIT @JGIBristol @Cudigitalnet @BL_DigiSchol
labs@bl.uk https://goo.gl/
Ben O’Steen of @BL_Labs after Hack Event, August 2013
74
@BL_Labs @BrisUniRIT @JGIBristol @Cudigitalnet @BL_DigiSchol
labs@bl.uk https://goo.gl/
One major problem!
•We know about the books these images come
from but we know nothing about the actual
images!
•How will we identify them?
•How will we find them later?
•How can we do that with 1 million images?
•Try a few experiments!
75
@BL_Labs @BrisUniRIT @JGIBristol @Cudigitalnet @BL_DigiSchol
labs@bl.uk https://goo.gl/
Running face recognition
on the images
Face Recognition Algorithm
Trained on Photographs
Late August 2013
76
@BL_Labs @BrisUniRIT @JGIBristol @Cudigitalnet @BL_DigiSchol
labs@bl.uk https://goo.gl/
Face Recognition
Algorithms worked
better for female
faces than men’s
77
@BL_Labs @BrisUniRIT @JGIBristol @Cudigitalnet @BL_DigiSchol
labs@bl.uk https://goo.gl/
The Mechanical Curator
Snipped image posted
almost randomly
every hour…
on a Tumblr blog
One of our early followers was…
Ben O’Steen, 30 September 2013
Has a slight ‘mood’…
once image published,
tries to find 8 similar images
e.g. ‘slanty’, ‘circular’ etc.
& then gets ‘bored’
follow…
@MechCuratorBot
80
@BL_Labs @BrisUniRIT @JGIBristol @Cudigitalnet @BL_DigiSchol
labs@bl.uk https://goo.gl/
British Library Flickr Commons
Why Flickr Commons?
• Free!
• Each image has it’s own unique web address, easy to share
• Can Tag images
• Has Application Programming Interface (API)
Late August 2013
81
@BL_Labs @BrisUniRIT @JGIBristol @Cudigitalnet @BL_DigiSchol
labs@bl.uk https://goo.gl/
Worked better for female faces than men’s
Press
http://mechanicalcurator.tumblr.com
Posts image every 30 minutes
http://www.flickr.com/photos/britishlibrary/
1,020,418 images
need tagging!
Creative uses of images
Face recognition
Algorithms based on photos
Mechanical Curator
with an algorithmic brain
(Circles, Squares and Slanty etc)
http://goo.gl/qPPgxX
Wikimedia
Flickr Commons
Individual URL & API
Snipping out images
from 65,000 Digitised Books*
>800,000,000* views
>17,000,000* tags
https://goo.gl/FgZ4HM
Work @ BL by Ben O’Steen, Labs
and Digital Research Team*Matt Prior - http://goo.gl/j29Tnx
Since Dec 2013
Tumblr
*Estimates
>More demand to see
physical items
83
@BL_Labs @BrisUniRIT @JGIBristol @Cudigitalnet @BL_DigiSchol
labs@bl.uk https://goo.gl/
Tagging a million images
Iterative Crowdsourcing
http://goo.gl/j6fxac
Cardiff University’s
Lost Visions Project
http://www.metadatagames.org/
Metadata Games
James Heald
Mario Klingemann
Chico 45
Use computational methods
Human Tagger
Top British Library Flickr Commons Taggers
18 hard core taggers
How to reward and keep motivated this ‘small group?
Average for ‘crowd’ is 1 tag per person
What kind of ‘task’ can this ‘crowd’ do?
Mobile games for ‘Ships’, ‘Covers’ and ‘Portraits’ Interface for tagging
84
@BL_Labs @BrisUniRIT @JGIBristol @Cudigitalnet @BL_DigiSchol
labs@bl.uk https://goo.gl/
Adam Crymble (2015)
Crowdsource Arcade
http://goo.gl/LBfJ4W
http://goo.gl/OH9pOZ
https://goo.gl/7z0j8p
30 mins talk
Labs Symposium (2015)
https://goo.gl/SSRsdd
5 min interview (2015)
http://goo.gl/0APpE8
Game Jam
Using Arcade Games
to help Tag images
‘Art Treachery’ and ‘Tag Attack’
85
@BL_Labs @BrisUniRIT @JGIBristol @Cudigitalnet @BL_DigiSchol
labs@bl.uk https://goo.gl/
Special Jury’s Prize (2015)
James Heald – Wikimedia and Map work
https://goo.gl/WYZCB2
http://goo.gl/HNQq5e
https://goo.gl/VPgffL
https://commons.wikimedia.org/
https://goo.gl/djtm1b
Labs Symposium (2015)Geotagging maps
50,000 Maps
Found in Flickr 1 million
Human & Computational Tagging
& Community engagement
Geo-referencing work
https://www.bl.uk/georeferencer
86
@BL_Labs @BrisUniRIT @JGIBristol @Cudigitalnet @BL_DigiSchol
labs@bl.uk https://goo.gl/
SherlockNet: Competition Winner 2016
Karen Wang, Luda Zhao and Brian Do
Using Convolutional Neural Networks to Automatically Tag and Caption
the British Library Flickr Commons 1 million Image Collection
12 categories
>15.5 million tags added
>100,000 captions
bit.ly/sherlocknet
Pooled surrounding
OCR text on page
from similar images
Used Microsoft COCO (photographs) &
British Museum Prints and Drawings
collections as training sets.
Tags Captions
87
@BL_Labs @BrisUniRIT @JGIBristol @Cudigitalnet @BL_DigiSchol
labs@bl.uk https://goo.gl/
http://goo.gl/dM8ieA
Mario Klingeman (2015)
Code Artist / Curator
http://goo.gl/bNxGZZ
Kris Hoffman (2016)
Animation for Fashion Week 2016
https://goo.gl/QilqqT
Jiayi Chong 2016 - Animation tool
https://www.facebook.com/RealmlandStory/
Paul Rand Pierce 2016
Graphic Novel on Facebook
Tragic Looking Women
44 Men who Look 44
(Notice the direction faces)
A Hat on the Ground
Spells trouble
Artistic / Creative Works
https://www.youtube.com/watch?v=Q3SBxO34Zlc
David Normal 2014 and 2015
Collages/Paintings & Lightboxes
88
@BL_Labs @BrisUniRIT @JGIBristol @Cudigitalnet @BL_DigiSchol
labs@bl.uk https://goo.gl/
Imaginary Cities – BL Labs Project /
Exhibition 16-18 (Michael Takeo Magruder)
An artistic exploration seeking to create provocative fictional cityscapes for the Information Age
from the British Library’s digital collection of historic urban maps
89
@BL_Labs @BrisUniRIT @JGIBristol @Cudigitalnet @BL_DigiSchol
labs@bl.uk https://goo.gl/
Alanna Hilton
British Fashion Colleges Council and
Teatum Jones
90
@BL_Labs @BrisUniRIT @JGIBristol @Cudigitalnet @BL_DigiSchol
labs@bl.uk https://goo.gl/
It all starts from a conversation!
• Start with a conversation, our data isn’t highly visible on search engines
(yet!) & not easy to find. Need to create and embrace serendipity &
opportunities for use by talking!
• Need to have several conversations with several stakeholders & tap into
their tacit knowledge that isn’t always written down sometimes to progress
ideas.
• Often misunderstandings because of jargon & different meaning of words.
https://goo.gl/XaHYT9
?
Audience
research &
Digital
interests
Digital
collections
we have
This is where Labs works
91
@BL_Labs @BrisUniRIT @JGIBristol @Cudigitalnet @BL_DigiSchol
labs@bl.uk https://goo.gl/
Many researchers have the domain knowledge but lack
technical / digital skills to use Digital Research
methods.
Should they be teamed up with those that want to solve
problems or get trained? (Will look at in the afternoon)
Digital skills training needed for Humanities
researchers…
https://goo.gl/i5GVfI
https://goo.gl/kwcK8J
92
@BL_Labs @BrisUniRIT @JGIBristol @Cudigitalnet @BL_DigiSchol
labs@bl.uk https://goo.gl/
Labs mindset…
1. Start a conversation, generate positive energy,
be nice, have fun and try to support ideas .
2. Start with small experiments, but think big!
3. Fail faster (don’t be afraid) and persevere.
4. Reject perfectionism! Good enough is
sometimes…good enough!
5. Celebrate the uses of digital collections, tell
the world!
https://goo.gl/noASfl
93
@BL_Labs @BrisUniRIT @JGIBristol @Cudigitalnet @BL_DigiSchol
labs@bl.uk https://goo.gl/
Explore or Imagine Our Data!
• CSV of Metadata
https://data.bl.uk/digbks/dig19cbooks-mdata-csv.csv
• 19th Century Books - Book Metadata - 01/09/2013.
https://data.bl.uk/digbks/db21.html
• Digitised Books - Flickr Tag History - Dec 2013 to March 2016.
TSV
https://data.bl.uk/digbks/db15.html
• Digitised Hebrew Manuscripts - Metadata
https://data.bl.uk/hebrewmanuscripts/heb1.html
• Digitised Hebrew Manuscripts: Or 2210 - Or 2364
https://data.bl.uk/hebrewmanuscripts/heb8.html
• Theatrical playbills from Britain and Ireland (OCR text only)
https://data.bl.uk/playbills/pb2.html
• Portraits of actors, views of theatres and playbills (covering
1750 - 1821 in a single volume)
https://data.bl.uk/singlesheet/por1.html
• Volumes of Lysons Collectanea (Amusements), comprising
broadsides, cuttings, advertisements on amusements.1660-
1840. https://data.bl.uk/singlesheet/ad1.html
https://data.bl.uk
• Have a look at the data.
• Data Quality
• Issues
Or an idea you have thought of
what to do with the data!
http://labs.bl.uk/Ideas+for+Labs
Smaller datasets
Hinweis der Redaktion
140 seconds
The British Library is the national library of the UK and one of the largest research libraries in the world . The Library moved to a new purpose built building in 1997 <click> the largest of it’s kind that was built in the UK in the 20th century. Many frequently used items are stored 5 stories below the main building at St Pancras in London and many might not know that part of the building is meant to look like a ship on a journey to discovery!<click>. <click to switch off>
The building can sit 1,200 researchers at any one time across 5 reading rooms.
<click>Medium and long term requested items are held at Boston Spa in Yorkshire in a low oxygen warehouse, using robot to retrieve items. In total, the library has 625 km of shelving, growing by 12 km every year.
Whilst we acquire items through purchase or gifts, much of the collection has been built up through legal deposit. That is, by law, a copy of every UK and Ireland print publication must be given to the British Library by its publishers. Around 3 million items are added per year. In 2013, legal deposit was extended to cover non-print material which means by law we take in digitally published items as well, which means regular mass crawls of the entire UK web domain as well as ebooks, ejournals etc.
85 seconds
The picture you can see is inside the main building in London, it’s the King’s Library – King George the Third’s personal library! Sometimes known as the ‘stack’, I walk past this everyday and I sometimes forget that the collections the British Library have are truly staggering! We currently estimate them to exceed <click>150 million items, representing every age of written civilisation and every known language. Our archives now contain the earliest surviving printed book in the world, the Diamond Sutra, written in Chinese and dating from 868 AD….
So some big numbers…
Over …<click>14 million books
<click>60 million patents
<click>8 million stamps
<click>4 million maps
<click>3 million sound recordings
<click>1.6 million music scores
<click>over .3 million manuscripts
<click>0.8 million serials titles (which are of course made up of many many volumes/editions), this is where a lot of our content is, just in case you thought the numbers didn’t add up!
Get clearer annotation image and transcription (perhaps TILT)
6 Seconds (20 Words)
So <Click> ‘how’ do we try and engage those who might be interested in the BL’s digital collections and data? <Click>
17 Seconds (53 Words)
<Click>The British Library is one of the largest Library’s in the world <Click> with an estimated 180 million physical items, with only a small proportion being digitised. <Click>We estimate this is around 1-2%, but no one really knows exactly how much. However, increasingly more items are being stored as ‘born’ digital, such as the UK Web Archive<Click>
Have balance of Multimedia
Broadcast news and radio, sounds asave our sounds
Books and newspapers
Images
BNB
Qatar Digital library
Hebrew manuscripts
<click>The British Library faces many challenges of access to our Digital collections!
<click> Sometimes digital content is only available onsite due to license restrictions,
<click>or even only on a specific computer in a reading room! Technically there are very few reasons why digital content can’t be online
<click> though it might be too big or hasn’t been transferred from other digital storage media.
<click>Sometimes access is through a paywall. Finally,
<click>some content is in the happy sunny place, online, open and freely available.
The real reasons why there are challenges to accessing digital content are of course human. They require different approaches from the Library and may often involve an honest, open dialogue and negotiation with the publishers.
The Labs project has tried to address this problem my creating a ‘residency model’ for researchers to work intensively with a digital collection on-site, so as to not infringe access conditions, I will say more about this later.
Examples from the Cooper Hewitt collection. I spent 3/5 of my time at the Cooper Hewitt just trying to get the data clean enough to vaguely represent the collection. The problem is that computers think U.S., U. S. , U.S.A., U. S. A. , United States, United States of America are six different places.
Fields also contain things like internal notes about potential duplicates, unexpected extra information - notes on what type of location, etc. Lots of inconsistencies - uncertainty and date ranges expressed in different ways.
More common GLAM issues - What year is 'early 18th century'? What do you do with '1836 (probably)'?
Open Refine is an amazing tool, and I wouldn't have gotten anywhere at Cooper Hewitt without it. It will suggest ways to make the data more consistent. You can then export the data and keep working on it in other tools, or put it into Open Refine. Because Refine runs locally it can be used for sensitive data you mightn't put online.
One issue is that GLAMs tend to use question marks to record uncertainty in attribution, but Refine strips out all punctuation, so you have to be careful about preserving it (if that's what you want).
Takes in TSV, CSV, *SV, Excel (.xls and .xlsx), JSON, XML, RDF as XML, and Google Data documents.
http://freeyourmetadata.org/cleanup/ useful advice
21 Seconds (65 Words)
Katrina Navickas was particularly interested in the <Click>Chartist Movement who were a group who were campaigning for the vote for working people. <Click>They were the biggest popular movement for democracy in 19th century British history, just as this is early picture shows a huge monster meeting at Kennington Common<Click>She wanted to use a combination of manual and computational methods to explore our Digitised Newspapers to find out when and where they met and plot them on map. <Click>and hopefully unearthing new history.
Watch out the gunner and skunk as they will make an appearance again!
Posts small illustrations taken almost at random from the digitised book corpus to a Tumblr blog.
This experiment with undirected engagement was a by-product of work to uncover the hidden wealth of illustrations within the digitised pages.
27 Seconds (82 Words)
Adam Crymble <Click>wanted to harness the power of playing fun games on arcade machines to help with crowdsourcing the tagging of un-described images. He particularly wanted to engage a younger audience into crowdsourcing .<Click>On the right you can see a replica 1980’s arcade machine we built and <Click>and on the bottom left some tagging games that were developed through a ‘Games Jam’ for the machine. <Click>. Let’s take a closer look at two of the games…<Click>
18 Seconds (56 Words)
Indexing BL the 1 million & Mapping the Maps – was led by James Heald and collaboration with others <Click>They produced an index of 1 million 'Mechanical Curator collection' images on <Click>Wikimedia Commons from a collection of largely un-described images. <Click>This gave rise to finding 50,000 maps within the collection partially through a map-tag-a-thon <Click>These are now being geo-referenced. <Click>