SlideShare ist ein Scribd-Unternehmen logo
1 von 37
"Mechanical Curator"
(The technical story)
It began with dogfood...
• "Given access to a filesystem of media
with an easily learned layout
convention, can a researcher use their
own tools?"
It began with dogfood...
• "Given access to a filesystem of media
with an easily learned layout
convention, can a researcher use their
own tools?"
• So we contrived a research question:
"Can we find the faces in the
19th C scanned book
collection?"
Outcome:
• Majority of tools and libraries expect local
filesystem or in-memory access; no
network/API knowledge needed by
researcher.
• While lookup by layout is awkward, it is a
pragmatic approach when distributing
content by sneakernet. Might be pairable
by a light online search-engine and
documentation/wiki for best practices.
'Project' success?
• Computer Vision algorithms are
predominantly based on photographic
input. Room for improvement.
• Catch-22 with respect to training sets.
'Project' success?
• Computer Vision algorithms are
predominantly based on photographic
input. Room for improvement.
• Catch-22 with respect to training sets.
• But... applying Haar cascade profiles,
based on a photo training set, had some
reasonable success!
19C depictions of faces
• Likelyhood of detection:
• Female faces > Male
19C depictions of faces
• Likelyhood of detection:
• Female faces > Male
• Why women?
• Drawn more symmetrically - male faces were
more likely to be exaggerated.
• Depiction is typically 'clean' and posed
• Fashion: beards, spectacles and hats - very
different to the training sets
An Interesting By-product emerged
• The ALTO XML, created by MS as part of
the digitisation process, was found to have
'GraphicalIllustration' elements.
An Interesting By-product emerged
• The ALTO XML, created by MS as part of
the digitisation process, was found to have
'GraphicalIllustration' elements.
– polygonal boundaries for areas where it
detected contiguous content but where OCR
didn't work.
A map to all* the images?
* Unlikely to be comprehensive
A map to all* the images?
The 'Mechanical Curator' found:
– Maps
– Portraits
– Marginalia
– Covers
– Charts and diagrams
– Decorations
Microsoft Books
• Context:
– 47k 'works' digitised, 68k volumes
– 15.3Tb images, 1.3Tb ALTO XML
– circa 22+ million JP2000 images, 150-200DPI
(unconfirmed), a zipfile ('store') per volume
– 360 pages per volume on average
– No explicit subjects in metadata, but heavy on
travel, geography, ethnology, (English)
literature and plenty of 'misc'
Accessible?
• In theory, the books were accessible
online.
• In practice, it was a real challenge to find
anything viewable.
Image extraction process
• Worker-based, using a message queue to
coordinate.
• Thread-unsafe (due to zips) so limited to
one worker per zip.
– Local network storage was nearly full
– Limited by hardware too (4 months to get
RAM upgrade)
Tech used:
• Virtualbox
• Redis (msg queue, semaphore, metadata
cache)
• Python
– OpenCV main library used:
• Opens JP2000 with colour profiles
• Quick to work with image regions
• Also saved region as JPG (92%) for reuse
Filter first!
• ALTO with Illustration element is only
concern.
• Grep - quickly discerned the 1 million XML
files of interest (only 4-5% of total)
Resilience
• Never trust a process
– Did it fail?
Resilience
• Never trust a process
– Did it fail?
– Did it fail silently?
Resilience
• Never trust a process
– Did it fail?
– Did it fail silently?
– Does the expected JPG exist on disc? Is it
non-zero in length?
Resilience
• Never trust a process
– Did it fail?
– Did it fail silently?
– Does the expected JPG exist on disc? Is it
non-zero in length?
– Did IT services hard reboot your desktop
machine hosting the VMs you use in a given
night?
Overview:
• Started with one desktop VM, and a
connection to a local NAS
• Ended having used multiple VMs on Azure
as well, after piping content to their store.
– Redis replicated natively w/ SSH tunnel to
write node
Identifiers...
• Little help available from overstretched IT
architecture team.
• Naive filename syntax to begin with:
– SYSNUM_VOL_PG_IMGIDX_humantxt.jpg
– Stored by publication year.
We have images!
• 580Gb JPGs
• From dogfooding, hybrid approach
seemed necessary:
• Online, sharable, linkable, easy to find
presence, with a unique ID per image.
• Easy mapping between local image and
online image.
Images already available
• ... in theory.
• We needed something else in the short-
term.
Options
• Wikimedia Commons: we know about the
books, but have no idea about the actual
content! WC wouldn't be able to handle
1mil images in one go.
• Er... Flickr?
Upload by worker
• Again, similar structure - job was simply a
filepath (metadata deduceable)
• Ran approximately 16-18 workers for 9
days to upload images.
• High 90s upload success rate (time of day
dependent)
Outcome
• Launched 13 December on Flickr
Commons
• Spike: 55 million image views in 5 days
• By March 2014, 70k+ tags added by
community -
map, portrait, cover, childrensbook, and so
on.
Keeping track
• Many bad/misleading API calls
• (people.photos.)recentlyUpdated seems to
mostly work
Current scheme
• Every morning, call recentlyUpdated for
list of images that had some change
• Re-scan images and deduce changes in
tags, comments, views and favourites.
– (Same pattern, rescan jobs taken by
get_activity workers. Running 4 is enough
outside of spike times)
Caching
• Redis sets:
– PeopleID links to set of FlickrID+tagadded
– FlickrID links to set of user tags
– Sorted sets for 'high score' lists:
contributors, favourites, tags
Summary
• Workers to spin up when required
• Variety of workers, variety of queues
• Never trust a worker or process
• Never trust an API
• Sample where you can't test.

Weitere ähnliche Inhalte

Ähnlich wie Mechanical Curator Finds 19th C Faces

Developing a Staff-Only Samvera Application
Developing a Staff-Only Samvera ApplicationDeveloping a Staff-Only Samvera Application
Developing a Staff-Only Samvera ApplicationJames Griffin
 
Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack Srinath Perera
 
Mainstreaming Digital Imaging: Missouri Botanical Garden Archives
Mainstreaming Digital Imaging: Missouri Botanical Garden Archives Mainstreaming Digital Imaging: Missouri Botanical Garden Archives
Mainstreaming Digital Imaging: Missouri Botanical Garden Archives Chris Freeland
 
The Elephant in the Library - Integrating Hadoop
The Elephant in the Library - Integrating HadoopThe Elephant in the Library - Integrating Hadoop
The Elephant in the Library - Integrating Hadoopcneudecker
 
From a student to an apache committer practice of apache io tdb
From a student to an apache committer  practice of apache io tdbFrom a student to an apache committer  practice of apache io tdb
From a student to an apache committer practice of apache io tdbjixuan1989
 
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlyData Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlySarah Guido
 
Future-Proofing the Web: What We Can Do Today
Future-Proofing the Web: What We Can Do TodayFuture-Proofing the Web: What We Can Do Today
Future-Proofing the Web: What We Can Do TodayJohn Kunze
 
Promises of Deep Learning
Promises of Deep LearningPromises of Deep Learning
Promises of Deep LearningDavid Khosid
 
Tooling for the JavaScript Era
Tooling for the JavaScript EraTooling for the JavaScript Era
Tooling for the JavaScript Eramartinlippert
 
Data Science, Machine Learning and Neural Networks
Data Science, Machine Learning and Neural NetworksData Science, Machine Learning and Neural Networks
Data Science, Machine Learning and Neural NetworksBICA Labs
 
Big Data & Artificial Intelligence
Big Data & Artificial IntelligenceBig Data & Artificial Intelligence
Big Data & Artificial IntelligenceZavain Dar
 
neurisa_11_09_rosenthal
neurisa_11_09_rosenthalneurisa_11_09_rosenthal
neurisa_11_09_rosenthaltutorialsruby
 
neurisa_11_09_rosenthal
neurisa_11_09_rosenthalneurisa_11_09_rosenthal
neurisa_11_09_rosenthaltutorialsruby
 
Bring Satellite and Drone Imagery into your Data Science Workflows
Bring Satellite and Drone Imagery into your Data Science WorkflowsBring Satellite and Drone Imagery into your Data Science Workflows
Bring Satellite and Drone Imagery into your Data Science WorkflowsDatabricks
 
Using Containers and HPC to Solve the Mysteries of the Universe by Deborah Bard
Using Containers and HPC to Solve the Mysteries of the Universe by Deborah BardUsing Containers and HPC to Solve the Mysteries of the Universe by Deborah Bard
Using Containers and HPC to Solve the Mysteries of the Universe by Deborah BardDocker, Inc.
 
Sven Kreiss, Lead Data Scientist, Wildcard at MLconf ATL - 9/18/15
Sven Kreiss, Lead Data Scientist, Wildcard at MLconf ATL - 9/18/15Sven Kreiss, Lead Data Scientist, Wildcard at MLconf ATL - 9/18/15
Sven Kreiss, Lead Data Scientist, Wildcard at MLconf ATL - 9/18/15MLconf
 
The LoCloud lightweight digital library and alternative content sources, Adam...
The LoCloud lightweight digital library and alternative content sources, Adam...The LoCloud lightweight digital library and alternative content sources, Adam...
The LoCloud lightweight digital library and alternative content sources, Adam...locloud
 

Ähnlich wie Mechanical Curator Finds 19th C Faces (20)

Developing a Staff-Only Samvera Application
Developing a Staff-Only Samvera ApplicationDeveloping a Staff-Only Samvera Application
Developing a Staff-Only Samvera Application
 
Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack
 
Mainstreaming Digital Imaging: Missouri Botanical Garden Archives
Mainstreaming Digital Imaging: Missouri Botanical Garden Archives Mainstreaming Digital Imaging: Missouri Botanical Garden Archives
Mainstreaming Digital Imaging: Missouri Botanical Garden Archives
 
The Elephant in the Library - Integrating Hadoop
The Elephant in the Library - Integrating HadoopThe Elephant in the Library - Integrating Hadoop
The Elephant in the Library - Integrating Hadoop
 
From a student to an apache committer practice of apache io tdb
From a student to an apache committer  practice of apache io tdbFrom a student to an apache committer  practice of apache io tdb
From a student to an apache committer practice of apache io tdb
 
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlyData Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at Bitly
 
Future-Proofing the Web: What We Can Do Today
Future-Proofing the Web: What We Can Do TodayFuture-Proofing the Web: What We Can Do Today
Future-Proofing the Web: What We Can Do Today
 
Stegano Forensics
Stegano ForensicsStegano Forensics
Stegano Forensics
 
Promises of Deep Learning
Promises of Deep LearningPromises of Deep Learning
Promises of Deep Learning
 
Tooling for the JavaScript Era
Tooling for the JavaScript EraTooling for the JavaScript Era
Tooling for the JavaScript Era
 
Data Science, Machine Learning and Neural Networks
Data Science, Machine Learning and Neural NetworksData Science, Machine Learning and Neural Networks
Data Science, Machine Learning and Neural Networks
 
About Scanning and Metadata Standards - NEMO 2010
About Scanning and Metadata Standards - NEMO 2010About Scanning and Metadata Standards - NEMO 2010
About Scanning and Metadata Standards - NEMO 2010
 
Big Data & Artificial Intelligence
Big Data & Artificial IntelligenceBig Data & Artificial Intelligence
Big Data & Artificial Intelligence
 
neurisa_11_09_rosenthal
neurisa_11_09_rosenthalneurisa_11_09_rosenthal
neurisa_11_09_rosenthal
 
neurisa_11_09_rosenthal
neurisa_11_09_rosenthalneurisa_11_09_rosenthal
neurisa_11_09_rosenthal
 
Bring Satellite and Drone Imagery into your Data Science Workflows
Bring Satellite and Drone Imagery into your Data Science WorkflowsBring Satellite and Drone Imagery into your Data Science Workflows
Bring Satellite and Drone Imagery into your Data Science Workflows
 
Kings fund - implementing Hyku
Kings fund - implementing HykuKings fund - implementing Hyku
Kings fund - implementing Hyku
 
Using Containers and HPC to Solve the Mysteries of the Universe by Deborah Bard
Using Containers and HPC to Solve the Mysteries of the Universe by Deborah BardUsing Containers and HPC to Solve the Mysteries of the Universe by Deborah Bard
Using Containers and HPC to Solve the Mysteries of the Universe by Deborah Bard
 
Sven Kreiss, Lead Data Scientist, Wildcard at MLconf ATL - 9/18/15
Sven Kreiss, Lead Data Scientist, Wildcard at MLconf ATL - 9/18/15Sven Kreiss, Lead Data Scientist, Wildcard at MLconf ATL - 9/18/15
Sven Kreiss, Lead Data Scientist, Wildcard at MLconf ATL - 9/18/15
 
The LoCloud lightweight digital library and alternative content sources, Adam...
The LoCloud lightweight digital library and alternative content sources, Adam...The LoCloud lightweight digital library and alternative content sources, Adam...
The LoCloud lightweight digital library and alternative content sources, Adam...
 

Mehr von benosteen

Arches Getty Brownbag Talk
Arches Getty Brownbag TalkArches Getty Brownbag Talk
Arches Getty Brownbag Talkbenosteen
 
Bl labs ucl-services
Bl labs ucl-servicesBl labs ucl-services
Bl labs ucl-servicesbenosteen
 
Bl labs what is british library labs
Bl labs   what is british library labsBl labs   what is british library labs
Bl labs what is british library labsbenosteen
 
British Library Labs - Overview Talk 2017
British Library Labs - Overview Talk 2017British Library Labs - Overview Talk 2017
British Library Labs - Overview Talk 2017benosteen
 
Uses of Library Collections
Uses of Library CollectionsUses of Library Collections
Uses of Library Collectionsbenosteen
 
CityLIS talk, Feb 1st 2016
CityLIS talk, Feb 1st 2016CityLIS talk, Feb 1st 2016
CityLIS talk, Feb 1st 2016benosteen
 
NDF,Te Papa, New Zealand 2015 - Keynote
NDF,Te Papa, New Zealand 2015 - KeynoteNDF,Te Papa, New Zealand 2015 - Keynote
NDF,Te Papa, New Zealand 2015 - Keynotebenosteen
 
British library labs - What? Why?
British library labs - What? Why?British library labs - What? Why?
British library labs - What? Why?benosteen
 
UKSG 2015 Mechanical curator and British Library labs
UKSG 2015  Mechanical curator and British Library labsUKSG 2015  Mechanical curator and British Library labs
UKSG 2015 Mechanical curator and British Library labsbenosteen
 
Sharing and Serendipity
Sharing and SerendipitySharing and Serendipity
Sharing and Serendipitybenosteen
 
Mechanical Curator (@ CREATE PUBLIC DOMAIN WORKSHOP FOR CREATIVE BUSINESSES)
Mechanical Curator (@ CREATE PUBLIC DOMAIN WORKSHOP FOR CREATIVE BUSINESSES)Mechanical Curator (@ CREATE PUBLIC DOMAIN WORKSHOP FOR CREATIVE BUSINESSES)
Mechanical Curator (@ CREATE PUBLIC DOMAIN WORKSHOP FOR CREATIVE BUSINESSES)benosteen
 
BL Labs 2014 Symposium: The Mechanical Curator
BL Labs 2014 Symposium: The Mechanical CuratorBL Labs 2014 Symposium: The Mechanical Curator
BL Labs 2014 Symposium: The Mechanical Curatorbenosteen
 
The surprising adventures of the mechanical curator
The surprising adventures of the mechanical curatorThe surprising adventures of the mechanical curator
The surprising adventures of the mechanical curatorbenosteen
 
Postscript, books and binding
Postscript, books and bindingPostscript, books and binding
Postscript, books and bindingbenosteen
 
Open Bibliography, Citations and Scholarship
Open Bibliography, Citations and ScholarshipOpen Bibliography, Citations and Scholarship
Open Bibliography, Citations and Scholarshipbenosteen
 
Text-mining and Automation
Text-mining and AutomationText-mining and Automation
Text-mining and Automationbenosteen
 
Bodleian Library's DAMS system
Bodleian Library's DAMS systemBodleian Library's DAMS system
Bodleian Library's DAMS systembenosteen
 
Choices, modelling and Frankenstein Ontologies
Choices, modelling and Frankenstein OntologiesChoices, modelling and Frankenstein Ontologies
Choices, modelling and Frankenstein Ontologiesbenosteen
 
Where are Repository's Going?
Where are Repository's Going?Where are Repository's Going?
Where are Repository's Going?benosteen
 

Mehr von benosteen (20)

Arches Getty Brownbag Talk
Arches Getty Brownbag TalkArches Getty Brownbag Talk
Arches Getty Brownbag Talk
 
Bl labs ucl-services
Bl labs ucl-servicesBl labs ucl-services
Bl labs ucl-services
 
Bl labs what is british library labs
Bl labs   what is british library labsBl labs   what is british library labs
Bl labs what is british library labs
 
British Library Labs - Overview Talk 2017
British Library Labs - Overview Talk 2017British Library Labs - Overview Talk 2017
British Library Labs - Overview Talk 2017
 
Uses of Library Collections
Uses of Library CollectionsUses of Library Collections
Uses of Library Collections
 
CityLIS talk, Feb 1st 2016
CityLIS talk, Feb 1st 2016CityLIS talk, Feb 1st 2016
CityLIS talk, Feb 1st 2016
 
NDF,Te Papa, New Zealand 2015 - Keynote
NDF,Te Papa, New Zealand 2015 - KeynoteNDF,Te Papa, New Zealand 2015 - Keynote
NDF,Te Papa, New Zealand 2015 - Keynote
 
British library labs - What? Why?
British library labs - What? Why?British library labs - What? Why?
British library labs - What? Why?
 
UKSG 2015 Mechanical curator and British Library labs
UKSG 2015  Mechanical curator and British Library labsUKSG 2015  Mechanical curator and British Library labs
UKSG 2015 Mechanical curator and British Library labs
 
Sharing and Serendipity
Sharing and SerendipitySharing and Serendipity
Sharing and Serendipity
 
Mechanical Curator (@ CREATE PUBLIC DOMAIN WORKSHOP FOR CREATIVE BUSINESSES)
Mechanical Curator (@ CREATE PUBLIC DOMAIN WORKSHOP FOR CREATIVE BUSINESSES)Mechanical Curator (@ CREATE PUBLIC DOMAIN WORKSHOP FOR CREATIVE BUSINESSES)
Mechanical Curator (@ CREATE PUBLIC DOMAIN WORKSHOP FOR CREATIVE BUSINESSES)
 
BL Labs 2014 Symposium: The Mechanical Curator
BL Labs 2014 Symposium: The Mechanical CuratorBL Labs 2014 Symposium: The Mechanical Curator
BL Labs 2014 Symposium: The Mechanical Curator
 
The surprising adventures of the mechanical curator
The surprising adventures of the mechanical curatorThe surprising adventures of the mechanical curator
The surprising adventures of the mechanical curator
 
Mashspa
MashspaMashspa
Mashspa
 
Postscript, books and binding
Postscript, books and bindingPostscript, books and binding
Postscript, books and binding
 
Open Bibliography, Citations and Scholarship
Open Bibliography, Citations and ScholarshipOpen Bibliography, Citations and Scholarship
Open Bibliography, Citations and Scholarship
 
Text-mining and Automation
Text-mining and AutomationText-mining and Automation
Text-mining and Automation
 
Bodleian Library's DAMS system
Bodleian Library's DAMS systemBodleian Library's DAMS system
Bodleian Library's DAMS system
 
Choices, modelling and Frankenstein Ontologies
Choices, modelling and Frankenstein OntologiesChoices, modelling and Frankenstein Ontologies
Choices, modelling and Frankenstein Ontologies
 
Where are Repository's Going?
Where are Repository's Going?Where are Repository's Going?
Where are Repository's Going?
 

Kürzlich hochgeladen

SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Disha Kariya
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...Sapna Thakur
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...fonyou31
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 

Kürzlich hochgeladen (20)

SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 

Mechanical Curator Finds 19th C Faces

  • 2. It began with dogfood... • "Given access to a filesystem of media with an easily learned layout convention, can a researcher use their own tools?"
  • 3. It began with dogfood... • "Given access to a filesystem of media with an easily learned layout convention, can a researcher use their own tools?" • So we contrived a research question:
  • 4. "Can we find the faces in the 19th C scanned book collection?"
  • 5.
  • 6. Outcome: • Majority of tools and libraries expect local filesystem or in-memory access; no network/API knowledge needed by researcher. • While lookup by layout is awkward, it is a pragmatic approach when distributing content by sneakernet. Might be pairable by a light online search-engine and documentation/wiki for best practices.
  • 7. 'Project' success? • Computer Vision algorithms are predominantly based on photographic input. Room for improvement. • Catch-22 with respect to training sets.
  • 8. 'Project' success? • Computer Vision algorithms are predominantly based on photographic input. Room for improvement. • Catch-22 with respect to training sets. • But... applying Haar cascade profiles, based on a photo training set, had some reasonable success!
  • 9. 19C depictions of faces • Likelyhood of detection: • Female faces > Male
  • 10. 19C depictions of faces • Likelyhood of detection: • Female faces > Male • Why women? • Drawn more symmetrically - male faces were more likely to be exaggerated. • Depiction is typically 'clean' and posed • Fashion: beards, spectacles and hats - very different to the training sets
  • 11. An Interesting By-product emerged • The ALTO XML, created by MS as part of the digitisation process, was found to have 'GraphicalIllustration' elements.
  • 12. An Interesting By-product emerged • The ALTO XML, created by MS as part of the digitisation process, was found to have 'GraphicalIllustration' elements. – polygonal boundaries for areas where it detected contiguous content but where OCR didn't work.
  • 13. A map to all* the images? * Unlikely to be comprehensive
  • 14. A map to all* the images? The 'Mechanical Curator' found: – Maps – Portraits – Marginalia – Covers – Charts and diagrams – Decorations
  • 15.
  • 16.
  • 17. Microsoft Books • Context: – 47k 'works' digitised, 68k volumes – 15.3Tb images, 1.3Tb ALTO XML – circa 22+ million JP2000 images, 150-200DPI (unconfirmed), a zipfile ('store') per volume – 360 pages per volume on average – No explicit subjects in metadata, but heavy on travel, geography, ethnology, (English) literature and plenty of 'misc'
  • 18. Accessible? • In theory, the books were accessible online. • In practice, it was a real challenge to find anything viewable.
  • 19. Image extraction process • Worker-based, using a message queue to coordinate. • Thread-unsafe (due to zips) so limited to one worker per zip. – Local network storage was nearly full – Limited by hardware too (4 months to get RAM upgrade)
  • 20. Tech used: • Virtualbox • Redis (msg queue, semaphore, metadata cache) • Python – OpenCV main library used: • Opens JP2000 with colour profiles • Quick to work with image regions • Also saved region as JPG (92%) for reuse
  • 21. Filter first! • ALTO with Illustration element is only concern. • Grep - quickly discerned the 1 million XML files of interest (only 4-5% of total)
  • 22. Resilience • Never trust a process – Did it fail?
  • 23. Resilience • Never trust a process – Did it fail? – Did it fail silently?
  • 24. Resilience • Never trust a process – Did it fail? – Did it fail silently? – Does the expected JPG exist on disc? Is it non-zero in length?
  • 25. Resilience • Never trust a process – Did it fail? – Did it fail silently? – Does the expected JPG exist on disc? Is it non-zero in length? – Did IT services hard reboot your desktop machine hosting the VMs you use in a given night?
  • 26. Overview: • Started with one desktop VM, and a connection to a local NAS • Ended having used multiple VMs on Azure as well, after piping content to their store. – Redis replicated natively w/ SSH tunnel to write node
  • 27. Identifiers... • Little help available from overstretched IT architecture team. • Naive filename syntax to begin with: – SYSNUM_VOL_PG_IMGIDX_humantxt.jpg – Stored by publication year.
  • 28. We have images! • 580Gb JPGs • From dogfooding, hybrid approach seemed necessary: • Online, sharable, linkable, easy to find presence, with a unique ID per image. • Easy mapping between local image and online image.
  • 29. Images already available • ... in theory. • We needed something else in the short- term.
  • 30. Options • Wikimedia Commons: we know about the books, but have no idea about the actual content! WC wouldn't be able to handle 1mil images in one go. • Er... Flickr?
  • 31. Upload by worker • Again, similar structure - job was simply a filepath (metadata deduceable) • Ran approximately 16-18 workers for 9 days to upload images. • High 90s upload success rate (time of day dependent)
  • 32. Outcome • Launched 13 December on Flickr Commons • Spike: 55 million image views in 5 days • By March 2014, 70k+ tags added by community - map, portrait, cover, childrensbook, and so on.
  • 33.
  • 34. Keeping track • Many bad/misleading API calls • (people.photos.)recentlyUpdated seems to mostly work
  • 35. Current scheme • Every morning, call recentlyUpdated for list of images that had some change • Re-scan images and deduce changes in tags, comments, views and favourites. – (Same pattern, rescan jobs taken by get_activity workers. Running 4 is enough outside of spike times)
  • 36. Caching • Redis sets: – PeopleID links to set of FlickrID+tagadded – FlickrID links to set of user tags – Sorted sets for 'high score' lists: contributors, favourites, tags
  • 37. Summary • Workers to spin up when required • Variety of workers, variety of queues • Never trust a worker or process • Never trust an API • Sample where you can't test.