Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1rXaVvn.
Aaron Gardner pulls back the covers on the Etsy Search ecosystem and how they got here -- the good, the bad, and the funky. Filmed at qconsf.com.
Aaron Gardner is Engineering Director for Buyer Experience at Etsy, a marketplace where people around the world connect to buy and sell unique goods. Previous gigs include Meetup, Comcast, and some startups.
Etsy Search: How We Index and Query 26 Million One-of-a-kind Items
1. Etsy Search Architecture
How we Query & Index 26 Million One-of-a-Kind Items
Aaron Gardner
Engineering Director, Etsy
agardner@etsy.com / @aargard
QCon SF 2014
November 3rd, 2014
2. InfoQ.com: News & Community Site
• 750,000 unique visitors/month
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• News 15-20 / week
• Articles 3-4 / week
• Presentations (videos) 12-15 / week
• Interviews 2-3 / week
• Books 1 / month
Watch the video with slide
synchronization on InfoQ.com!
http://www.infoq.com/presentations
/etsy-search-ecosystem
3. Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
Presented at QCon San Francisco
www.qconsf.com
4.
5.
6. Etsy by the numbers
• 40 million members
• 1+ million sellers
• 26 million listed items
• 25%+ of orders cross a border
• $1.35 billion in sales in 2013
30. Thrift: Cross-Lang service & code-
gen
• Define API service & data types
• Generates PHP & Java client/server
code
• Pros: — Fast, Easy to extend at low
level
• Cons — Not HTTP
31. Life of a query
Find relevant items with
standard Information
Retrieval (IR) techniques
32. Life of a query
Ranking uses scores published
from Big Data ML stack
34. Return only PKs from Search
• Keep search index small, better for…
• Replication
• Caches
• JVM GC happiness
• Less data to send back to Webs
• Web stack great at loading content by
PKs
39. The Banner Protocol Dance
• Client picks search host
• Send Thrift search query
• Wait for 4-byte “OK” up to 10ms
• If timeout: try another host
• else “OK”: wait for real results
47. Denorm DB = Speed & Sanity
• We reindex a lot — For features, index cleanup
• Tune denorm DB for speed
• Don’t hurt user-serving DBs with indexers
• 1 denorm store =1 dumb indexing lib
• Easier to maintain, debug, & add new content
48. Denorm & Indexing
• Sellers & buyers & crons do stuff
• Web stack changes data in site DBs
49. Denorm & Indexing
• Denorm Cron (PHP) finds IDs touched since a cursor
• Spawns async PHP jobs to denorm & store blobs
59. Magic Config File
<?php
/**
* Which search cluster is live.
* ------------------------------------------
* THIS IS A GENERATED FILE. DO NOT EDIT.
* ------------------------------------------
* Generated on Sat, 01 Nov 14 06:35:03 +0000
*/
$CONFIG[‘search_mtime'] = '1414823703';
$CONFIG['search_live'] = 'flop';
$CONFIG['search_dark'] = 'flip';