5. Solution: Use Neural Networks to Generate
Descriptive Tags For Every Image
building
architecture
temple
pyramid
stone
etc...
6. Roadmap
- What are Neural Networks?
- The First Pass: Category Classification
- Lots of Tags: OCR Tagging Using Related Images
- Image Captioning
- Presenting: SherlockNet Interface
12. We trained a CNN to classify all 1M images into
one of 12 categories
people: 0.80
architecture: 0.12
diagrams: 0.05
object: 0.02
decoration: 0.01
Confidence
percentage
13. We trained a CNN to classify all 1M images into
one of 12 categories
81% top-1 accuracy
97% top-3 accuracy
14. We trained a CNN to classify all 1M images into
one of 12 categories
15. We trained a CNN to classify all 1M images into
one of 12 categories
16. We trained a CNN to classify all 1M images into
one of 12 categories
20. We “vectorize” images and minimize Euclidean
distance to obtain related images
CNN
<3,4,-1,-3,4>
21. We “vectorize” images and minimize Euclidean
distance to obtain related images
CNN
<3,4,-1,-3,4>
<2,4,-1,-5,3>
<-3,2,5,3,-3>
<-1,0,0,5,1>
<3,3,0,-3,5>
D = 6
D = 161
D = 106
D = 3
22. We “vectorize” images and minimize Euclidean
distance to obtain related images
CNN
<3,4,-1,-3,4>
<2,4,-1,-5,3>
<-3,2,5,3,-3>
<-1,0,0,5,1>
<3,3,0,-3,5>
D = 6
D = 161
D = 106
D = 3
25. We then had similar images “vote” on tags
bird
tree
london
park
stick
plant
wing
claws
beak
nuts
wing
pacific
species
people
rainbow
pair
bird
park
wing
description
species
tree
beak
london
perch
eye
bird
bench
26. We pooled surrounding text from similar images
bird
park
wing
species
beak
+ =
This makes the tags for each image much cleaner
and more refined
32. Motivation
- Most natural way of showcasing
images
- Opportunities to provide
contextual information, “the man
next to the woman”
- From AI research standpoint:
interesting theoretical challenges
33. Background
- Combining two distinct
neural networks (CNNs and
RNNs) to do end-to-end
processing
- Very active area of research
34. Challenges
- High quality photographs vs.
low-res, black & white
illustrations
- Ambiguity in detail levels
- Difficulty in obtaining ground
truth data(“machines can’t
learn without prior knowledge”)
vs
.
???
41. A New Dataset
- British Museum Prints and Drawings Collection
- Over ~200,000 images through public interface
- Many have good, human-annotated captions
- Potential for machine learning research?
(From www.britishmuseum.org Online catalogue)
46. SherlockNet will one day provide multiple levels of
high-quality text annotation for every image
Tags: Architecture, landscape, river, trees, boat
Caption: A boat on a tree-lined river in front of a building
47. Acknowledgements
The British Library
Mahendra Mahey
Adam Farquhar
Hana Lewis
Adrian Edwards
Elliot Crowley
Mario Klingemann
Ben O'Steen
Stanford University
Andrej Karpathy
Justin Johnson
Stefano Ermon
The British Museum
48. SherlockNet will one day provide multiple levels of
high-quality text annotation for every image
Tags: Architecture, landscape, river, trees, boat
Caption: A boat on a tree-lined river in front of a building
51. Neural networks reveal image features that become
more or less frequent over time
Feature #541 is highly activated in modern decorations compared to antique decorations
52. Neural networks reveal image features that become
more or less frequent over time
Images with high
score for Feature #541
Images with low
score for Feature #541
Feature #541 is highly activated in modern decorations compared to antique decorations
53. Neural networks reveal image features that become
more or less frequent over time
Images with high
score for Feature #541
Images with low
score for Feature #541
Feature #541 is highly activated in modern decorations compared to antique decorations
Feature #541 probably indicates the presence of lines
delineating the top and bottom of the decoration.
54. Process + Results
- Decorations: 64% accuracy, Maps: 52% accuracy (compared to 16%/
20% accuracy random chance, respectively)
- Pretty good results given inherent limitations!
Talk about the British Library’s Flickr Commons collections.
Contains more than a million images from the British Library’s digitized collection of over 65,000 books.
15th - 19th centuries
Subjects: Literature, Science, Anthropology, many more
Put online by the British Library for researchers and the public to use them in novel and interesting ways.
Date
Volume
Page
Current tags not very useful
Talk about neural networks class.
Convolutional neural networks, or CNNs for short.
This bleeding-edge computer vision technology has been used in the past couple of years to perform image recognition with extremely successful results..
...even outperforming humans!
Final project goal: use neural networks to generate descriptive tags for every image in the British Library Flickr collection.
In our project we used CNNs to
classify each image into a category
find related images
and generate captions.
CNNs are very, very good at the above tasks. We’ll talk briefly about why this is, and how they wor.
At a high level, a neural network takes an input and, for each possible category, it computes a score. A higher score means the input is more likely to be in that category.
In the process of computing the scores, the input is passed through multiple layers. At each layer, the neural network is “activated” by features of increasing levels of complexity. These activations are determined by parameters that the neural network learns over time.
The concept is analogous to the activation of biological neurons that form the communication network of our brain and spinal cord.
The multiple layers of a neural network allow it to recognize complex models - the more neurons (computational power), the more complex the models it can recognize, and thus better classification results.
Convolutional neural networks are specialized for images as input. Because images follow a images have width, height, and depth, CNNs are able to optimize for this architecture by only computing activations for a small region of the input. For example, it can look for localized features like an edge or a blotch of color.
I.e. can recognize a visual feature that appears in multiple places, facing different ways, at different angles and sizes, etc.
The name, convolution, comes from the mathematical operation that is performed between the input and the neural network’s parameters at each layer.
Tie this into Brian’s 12 categories.
Explain SherlockNet Labs
Signpost what we’re going to do:
Next two sections we are calling them “SherlockNet Labs”, for tasks that we think are a little bit beyond what’s currently feasible with neural nets, but this projects allowed us the opportunity to explore them a bit, and our hope is that we can inspire further research in these topics
Explain SherlockNet Labs
Signpost what we’re going to do:
Next two sections we are calling them “SherlockNet Labs”, for tasks that we think are a little bit beyond what’s currently feasible with neural nets, but this projects allowed us the opportunity to explore them a bit, and our hope is that we can inspire further research in these topics
Why do we need/want captions?
most natural ways of showcasing
Around 5% to 10% of the dataset
Two years ago I studied abroad at Oxford. There I did a tutorial on the history of British architecture.
One thing I found really challenging was trying to find records of architecture in books. I spent hours sitting in the Bodleian hunting down books.
Oxford, architecture
To recap, we have tagged over 1 Million images with convolutional neural networks and generated hundred of thousands of human-readable captions for them, while providing an interface for people to explore them easily. Our ultimate hope for this project is to have it serve as a prototype for any digital collection to make their content more discoverable, dramatically cut down on years of manual labelling, and provide tools to discover deeper insights into the rich, rich materials that history has left us.
We want to thank all of our mentor and collaborators, especially the British Labs and Mahendra Mahey for working with us through multiple time zones to make this project happen. It’s been great fun. Thank you!