2. Outline
• Personal introduction
• What is Colourbox?
• Why is Colourbox interesting?
• Similar images
• Search result ranking
• Recommendations
• Why Colourbox?
• Open position
• Questions
Martin R. Ehmsen
martin@colourbox.com
www.colourbox.com
3. Who am I?
Why am I here?
• Me
• Graduated from IMADA, 2010
• Ph.D. in Computer Science
• Online Algorithms
• Technical Project Manager
& System Architect
• Why this talk?
• Promote Colourbox
• There are interesting jobs on Funen
Martin R. Ehmsen
martin@colourbox.com
www.colourbox.com
4. Colourbox
• Microstock photography company
• Resell images, vector graphics, videos
• March 2006
• 3 employees, 50 users, 50,000 images,
150 new images daily
• November 2011
• 21 employees, 65,000 users, 2,000,000 images,
5,000 new images daily
• Currently in top 10 of all stock sites, aiming at #1
Martin R. Ehmsen
martin@colourbox.com
www.colourbox.com
5. Colourbox
• Only stock site that offers
flat rate
• Download all you want
for €249,- per month
• Search, find, download
• Browse, get inspired,
download
Martin R. Ehmsen
martin@colourbox.com
www.colourbox.com
6. The Tech
• Build using open source software
• HTML(5), CSS(3), and Javascript (jQuery) front-end
• Varnish, Lighttpd, and Memcached
• MySQL (Percona) database
• PHP backend
• PHP, Python, and C++ scripts
• Self-developed search engine (Colourit)
• Using Python and C
• Cloud based on Amazon EC2 and S3
Martin R. Ehmsen
martin@colourbox.com
www.colourbox.com
7. The Setup
Martin R. Ehmsen
martin@colourbox.com
www.colourbox.com
8. The Geek Side
• Techniques from mathematics and computer science
• Distributed/parallel computing
• Vector mathematics
• Various tree structures
• Set intersection
• Cache oblivious algorithms
• Clustering algorithms
• Ranking algorithms
• Markov chains
• etc...
Martin R. Ehmsen
martin@colourbox.com
www.colourbox.com
9. Similar images
• Given an image, what other images look similar to it?
• Inspire
• Browse
• All images have keywords
• The keyword-to-image association is weighted
• How pronounced is the keyword for the image?
• Calculated automatically (more later)
Martin R. Ehmsen
martin@colourbox.com
www.colourbox.com
10. Similar images
Martin R. Ehmsen
martin@colourbox.com
www.colourbox.com
11. Similar images
• Each keyword is a dimension in keyword vector space
• Each image is then represented as a vector in this space
• The projection onto each dimension is the weight of
the corresponding keyword
• Example
• (goat, 96), (white, 94), (outside, 50)
• Vector (x, y, z, w) = (0.96, 0.94, 0.5, 0)
• (goat, 47), (white, 81), (day, 19)
• Vector (x, y, z, w) = (0.47, 0.81, 0, 0.19)
Martin R. Ehmsen
martin@colourbox.com
www.colourbox.com
12. Similar images
• Similarity is then the angle between two vectors
• Easily calculated using high school math
· = cos(θ)| || |
u v u v
• Result between 0 and 90 degrees
• Example (cont.)
• (0.96, 0.94, 0.5, 0) and (0.47, 0.81, 0, 0.19)
• Approx 27.73 degrees
• Do two images with similarity of 27.73 degrees look similar?
• Experiments determined the cut-off
Martin R. Ehmsen
martin@colourbox.com
www.colourbox.com
13. Similar images
Martin R. Ehmsen
martin@colourbox.com
www.colourbox.com
14. Similar images
• 2,000,000 images yields 2,000,000,000,000 comparisons
• No job dependencies
• No data modifications
• Relatively small data size
• Each keyword is identified by a number
• Very easy to do in parallel and distribute
• Speed up using a trick from cache oblivious algorithms
• This is not a one-time thing
• Keywords and weights change
Martin R. Ehmsen
martin@colourbox.com
www.colourbox.com
15. Ranking of results
• How to rank search results?
• Want the “best” results first
• First solution: Use number of downloads as parameter
• Problems
• Old good images rank over new excellent images
• Wrong keywords distort the results
Martin R. Ehmsen
martin@colourbox.com
www.colourbox.com
16. Ranking of results
• Harvest information from the users
• A clicked/downloaded image
• Matched the search string well
• Is a “good” image
• A shown-but-not-clicked image either
• Does not match the search string well, or
• Is a “bad” image
Martin R. Ehmsen
martin@colourbox.com
www.colourbox.com
17. Ranking of results
• The keyword-to-image association is weighted
• Keyword weights are updated when
• a keyworder assigns a keyword (high weight)
• a supplier assigns a keyword (high weight)
• a user clicks on a photo presented by a search
• a user does NOT click on a photo presented
Martin R. Ehmsen
martin@colourbox.com
www.colourbox.com
18. Ranking of results
• Search “Summer Lemon”
• User clicks first result
• Pros
Lemon (0.9) Lemon (0.7)
• Second image ranked Summer (0.8) Summer (0.9)
lower for “Lemon” Apple (0.1) Apple (0.0)
• Cons
• “Summer” ranked lower
on second image
Lemon (0.95) Lemon (0.65)
• Fixed by subsequent Summer (0.86) Summer (0.8)
searches Apple (0.1) Apple (0.0)
Martin R. Ehmsen
martin@colourbox.com
www.colourbox.com
19. Ranking of results
• Images with
• Wrong keywords are ranked very low over time
• Good keywords are ranked higher
• Great images are ranked higher overall
• New excellent images can rank over old mediocre images
Martin R. Ehmsen
martin@colourbox.com
www.colourbox.com
20. Recommendations
• “You are currently looking at image X,
and you might be interested in image Y, Z, and W”
Martin R. Ehmsen
martin@colourbox.com
www.colourbox.com
21. Recommendations
• What images are connected?
• Let’s track our users to find out
Martin R. Ehmsen
martin@colourbox.com
www.colourbox.com
22. Recommendations
#2364906 #2964241 #2684393
Martin R. Ehmsen
martin@colourbox.com
www.colourbox.com
23. Recommendations
Martin R. Ehmsen
martin@colourbox.com
www.colourbox.com
24. Recommendations
Martin R. Ehmsen
martin@colourbox.com
www.colourbox.com
25. Recommendations
• Enter Markov chains
• Using a Markov chain of order 1, the probability of
going from media X to media Y is
• How many times path X - Y was followed, divided by
• Sum over all paths going out of image X
Martin R. Ehmsen
martin@colourbox.com
www.colourbox.com
26. Why Colourbox?
• We are
• small - 15 people no more than 15 steps apart
• flat - no long chains of command
• flexible - we can move on good idea immediately
• a 2011 Gazelle - we are still hiring while others are still firing
• We have
• Relaxed atmosphere
• Flexible work hours
• Candy cabinet, world class coffee machine, and stunning view :-)
• etc...
Martin R. Ehmsen
martin@colourbox.com
www.colourbox.com
27. Why Colourbox?
• You get
• to work on fun problems
• great colleagues
• an international outlook
• to serve customers who are excited about us
• to be part of a company which aims to be #1
• New projects
• SkyFish - Company Colourbox
• Zulubox - to articles what Colourbox is to images
Martin R. Ehmsen
martin@colourbox.com
www.colourbox.com
28. We are hiring!
• Software Developer – front-end systems
• Focus on HTML5, JS, PHP, SQL, etc.
• Can implement a pixel-perfect design from a PSD
• Can implement scalable code that also performs
well when it is executed 50 times per second
• You know your way around Linux
• Start August 1st
• We are construction a new office building
• Unsolicited applications are always welcome
Martin R. Ehmsen
martin@colourbox.com
www.colourbox.com