2. It began with dogfood...
• "Given access to a filesystem of media
with an easily learned layout
convention, can a researcher use their
own tools?"
3. It began with dogfood...
• "Given access to a filesystem of media
with an easily learned layout
convention, can a researcher use their
own tools?"
• So we contrived a research question:
4. "Can we find the faces in the
19th C scanned book
collection?"
5.
6. Outcome:
• Majority of tools and libraries expect local
filesystem or in-memory access; no
network/API knowledge needed by
researcher.
• While lookup by layout is awkward, it is a
pragmatic approach when distributing
content by sneakernet. Might be pairable
by a light online search-engine and
documentation/wiki for best practices.
7. 'Project' success?
• Computer Vision algorithms are
predominantly based on photographic
input. Room for improvement.
• Catch-22 with respect to training sets.
8. 'Project' success?
• Computer Vision algorithms are
predominantly based on photographic
input. Room for improvement.
• Catch-22 with respect to training sets.
• But... applying Haar cascade profiles,
based on a photo training set, had some
reasonable success!
9. 19C depictions of faces
• Likelyhood of detection:
• Female faces > Male
10. 19C depictions of faces
• Likelyhood of detection:
• Female faces > Male
• Why women?
• Drawn more symmetrically - male faces were
more likely to be exaggerated.
• Depiction is typically 'clean' and posed
• Fashion: beards, spectacles and hats - very
different to the training sets
11. An Interesting By-product emerged
• The ALTO XML, created by MS as part of
the digitisation process, was found to have
'GraphicalIllustration' elements.
12. An Interesting By-product emerged
• The ALTO XML, created by MS as part of
the digitisation process, was found to have
'GraphicalIllustration' elements.
– polygonal boundaries for areas where it
detected contiguous content but where OCR
didn't work.
13. A map to all* the images?
* Unlikely to be comprehensive
14. A map to all* the images?
The 'Mechanical Curator' found:
– Maps
– Portraits
– Marginalia
– Covers
– Charts and diagrams
– Decorations
15.
16.
17. Microsoft Books
• Context:
– 47k 'works' digitised, 68k volumes
– 15.3Tb images, 1.3Tb ALTO XML
– circa 22+ million JP2000 images, 150-200DPI
(unconfirmed), a zipfile ('store') per volume
– 360 pages per volume on average
– No explicit subjects in metadata, but heavy on
travel, geography, ethnology, (English)
literature and plenty of 'misc'
18. Accessible?
• In theory, the books were accessible
online.
• In practice, it was a real challenge to find
anything viewable.
19. Image extraction process
• Worker-based, using a message queue to
coordinate.
• Thread-unsafe (due to zips) so limited to
one worker per zip.
– Local network storage was nearly full
– Limited by hardware too (4 months to get
RAM upgrade)
20. Tech used:
• Virtualbox
• Redis (msg queue, semaphore, metadata
cache)
• Python
– OpenCV main library used:
• Opens JP2000 with colour profiles
• Quick to work with image regions
• Also saved region as JPG (92%) for reuse
21. Filter first!
• ALTO with Illustration element is only
concern.
• Grep - quickly discerned the 1 million XML
files of interest (only 4-5% of total)
24. Resilience
• Never trust a process
– Did it fail?
– Did it fail silently?
– Does the expected JPG exist on disc? Is it
non-zero in length?
25. Resilience
• Never trust a process
– Did it fail?
– Did it fail silently?
– Does the expected JPG exist on disc? Is it
non-zero in length?
– Did IT services hard reboot your desktop
machine hosting the VMs you use in a given
night?
26. Overview:
• Started with one desktop VM, and a
connection to a local NAS
• Ended having used multiple VMs on Azure
as well, after piping content to their store.
– Redis replicated natively w/ SSH tunnel to
write node
27. Identifiers...
• Little help available from overstretched IT
architecture team.
• Naive filename syntax to begin with:
– SYSNUM_VOL_PG_IMGIDX_humantxt.jpg
– Stored by publication year.
28. We have images!
• 580Gb JPGs
• From dogfooding, hybrid approach
seemed necessary:
• Online, sharable, linkable, easy to find
presence, with a unique ID per image.
• Easy mapping between local image and
online image.
30. Options
• Wikimedia Commons: we know about the
books, but have no idea about the actual
content! WC wouldn't be able to handle
1mil images in one go.
• Er... Flickr?
31. Upload by worker
• Again, similar structure - job was simply a
filepath (metadata deduceable)
• Ran approximately 16-18 workers for 9
days to upload images.
• High 90s upload success rate (time of day
dependent)
32. Outcome
• Launched 13 December on Flickr
Commons
• Spike: 55 million image views in 5 days
• By March 2014, 70k+ tags added by
community -
map, portrait, cover, childrensbook, and so
on.
33.
34. Keeping track
• Many bad/misleading API calls
• (people.photos.)recentlyUpdated seems to
mostly work
35. Current scheme
• Every morning, call recentlyUpdated for
list of images that had some change
• Re-scan images and deduce changes in
tags, comments, views and favourites.
– (Same pattern, rescan jobs taken by
get_activity workers. Running 4 is enough
outside of spike times)
36. Caching
• Redis sets:
– PeopleID links to set of FlickrID+tagadded
– FlickrID links to set of user tags
– Sorted sets for 'high score' lists:
contributors, favourites, tags
37. Summary
• Workers to spin up when required
• Variety of workers, variety of queues
• Never trust a worker or process
• Never trust an API
• Sample where you can't test.