July 27, 2011 Bay Area Search Presentation
Brian Johnson, Engineering Director, Query Services @ eBay
Query expansion is an important part of of the search recall for all search engines. In this talk I'll discuss some of the general trend driving Hadoop adoption within the Search Query Services team at eBay, and the types of algorithms/techniques we've moved to Hadoop at eBay. Over time we've moved from smaller, editorial data sets to large machine generated data sets mined from behavior log data, items/listings, catalogs, etc. One common workflow is to mine large candidate rewrites/expansions data sets from multiple data sources, use crowd sourced human judgment to classify a subset of the candidates (true positive, false positive), use machine learning techniques discard false positives, run automated validation on the final data set, and automatically push to production.
Ravi Jammalakadaka, Senior Applied Researcher, Query Services @ eBay
Ravi is a real engineer. Not a pointy haired manager like the previous speaker. Expect some real engineering:-) He'll be doing a literature review for acronym mining and discussing a real world implementation.
Title: Mining Acronyms From Raw Text
Abstract: Significant number of eBay products are known by their acronyms. eBay query expansion service expands user queries by their acronym equivalents to increase recall. The challenge is to mine acronyms from either seller ( ex. item descriptions, titles) or buyer ( ex. queries) data.
Ravi will present the state of the art algorithms from recent conferences that mine acronyms from raw text and present their limitations. He will present a new acronym mining algorithm that seeks to address the limitations identified with previous algorithms. He will present a machine learning classifier that seeks to remove the false positives generated from the acronym mining algorithm.
2. Agenda
Ÿ 6:30 Eat & Greet - Free Food & Beer
Ÿ 7:00 Speaker #1 – Brian Johnson
Ÿ 7:45 Speaker #2 – Ravi Jammalamadaka
Ÿ Plan on 2 fabulous 45 minute presentations by excellent local search experts.
Please suggest speakers or topics you would like to hear.
Ÿ Great speakers, good food, fine beer, and everyone's favorite search term - Free,
Free, Free:-)
Ÿ Event will be held at the eBay campus just off 17/880 @ Hamilton in the main
Community building. Look for lobby/flagpole.
Ÿ 4th Wednesday of every month
Ÿ http://www.meetup.com/Bay-Area-Search/
3. How Can I Help?
Ÿ Speakers
Ÿ Feedback
Ÿ Organizers
Ÿ Videographers
4. Brian Johnson
Ÿ Brian is the Director of Engineering for Query Services at eBay. He has held this
role since January of 2011. Prior to that he managed the engineering teams for
Query Understanding (metrics and crowdsourced human judgment), classification,
data publishing, and browsing. Brian has been at eBay since 2002.
Ÿ Prior to eBay Brian was at (http://www.linkedin.com/in/brianscottjohnson)
– Handspring - Managed the team working on email/IM/web browsing for one of
the first smartphones (Treo)
– Excite@Home - Director of Engineering for the Excite homepage
– Synopsys - Engineer for chip design visualization
– AT&T Bell Labs - Data visualization research
Ÿ Brian received his PHD in Computer Science from the University of Maryland in
1993. His papers regarding visualizing hierarchical and categorical data with
Treemaps have been cited hundreds of times.
Ÿ Brian is a pleasure to listen to and I'm sure you'll appreciate his insights from the
trenches regarding search query rewrite research and practice at eBay.
5. Ravi Jammalamadaka
Ÿ Ravi works in the query services team at eBay
looking at ways to rewrite user queries to improve
both precision and recall.
Ÿ Received his PhD from University of California,
Irvine.
– Research on Data Security, Databases
Ÿ Ravi published 10 research papers in the areas of
databases, data security and data mining.
Ÿ Ravi was invited to be a Program committee member
for IEEE ISI 2010, 2011 and ICDE 2010 (demo
track).
6. Query
Rewrites
Brian Johnson
Bay Area Search
July 27, 2011
8. What Is A Query?
Ÿ Queries are more than a text box
Ÿ Keywords=Red Size 7 Shoes
Ÿ Keywords=Red, Category=Shoes
Ÿ Keywords=Red, Category=Shoes, Size = 7
Ÿ Many filter variables affects recall
Ÿ Query, category, attributes current context dimension targets
Ÿ Format, condition, location/distance, shipping, seller, price
9. Questions About Queries
Ÿ Popularity/Rank
Ÿ Supply
Ÿ Demand
Ÿ Click Through Rate (CTR)
Ÿ Conversion
Ÿ Rewrites/Expansions
Ÿ Related Searches with CTR & Conversion
Ÿ Category Supply/Demand/CTR/Sales
Ÿ Product Supply/Demand/CTR/Sales
Ÿ Top Products
Ÿ Items (recalled, view, bin, bid, offer, watch, ask, purchase)
Ÿ Autocompletes
Ÿ Classification (broad, narrow, ambiguous, help, navigational)
Ÿ Purchase Site
Ÿ Frequency by day, day of week, time of day
Ÿ Cross Border
Ÿ Sales
Ÿ Position distribution in user sessions
Ÿ Result set size
Ÿ Exit Rate
Ÿ Exit Destination
9
13. Example Query Services/Rewrites
• Related Search
canon sd1300is, canon sd1400 is, canon sd4000, canon sd1400is, canon sd, canon sd1300 is waterproof,
canon sd 1300, canon
• Stemming (ipod or ipods)
• Spelling (cannon or canon)
• Condition (new or condition=new)
• Synonyms (boat carpet or marine carpet)
• Space Synonyms (MarioKart > Mario Kart)
• Item Specifics (blue or color=blue)
• Acronyms (os = one size in CSA | Operating Systems in Electronics)
• Category (shoes or category=63850)
• Cross Border (site=0 and category =123) or (site=3 and category=456)
• Fitment (fits model=X)
• Term Removal (Harry Potter and the Order of the Phoenix (daily deal))
13
14. Context & Specificity
Ÿ Beyond decontextualized single entities
Ÿ Examples
– Stemming failures
○ (cowboy v cowboys) and (hat v hats)
○ Doesn’t work for cowboy hats & dallas cowboy caps/hats
– hp printer > (hp v “hewlett packard”) printer
– 15 hp pump > 15 (hp v horsepower) pump
– motor bike > motor (bike v cycle)
– audi b6 > (audi v make=audi) & (b6 v platform=b6) v (product=789)
– the who != who the
– Time
○ Today: latest generation > latest generation v (generation=4)
○ Tomorrow: latest generation > latest generation v (generation=5)
17. Better, Faster, Cheaper
Better
• Better recall
• Awesome related search suggestions
• Mind reading spell corrections
Faster
• <3 milliseconds per query
• 1.2 billion queries per day
• 1,000’s of queries per second on a single machine
Cheaper
• Hadoop offline
• Caching online
18. Metrics/Evaluation
Ÿ Revenue (A/B Test)
Ÿ Relevance (Recall, Precision, DCG, etc.)
Ÿ Result Count
Ÿ Result Set Overlap
Ÿ Click Through Rate
Ÿ Feedback (site links)
Ÿ Human Judgment
Ÿ Competitive/Benchmark data
Ÿ “Gold” test sets
18
19. Thinking about rewrites
Ÿ Query length Ÿ Language detection
Ÿ Intent identification Ÿ Concept vs instantiation
Ÿ Autocomplete, (ex: car vs honda)
autosuggest Ÿ Phrases
Ÿ Summarization Ÿ Bracketing
Ÿ Inference (ex: movie 9) Ÿ Normalization
Ÿ Stemming Ÿ Key term extraction
Ÿ Synonyms Ÿ Term relaxation /
Ÿ Spell checking constraining
Ÿ Stopwords, noise words Ÿ Session context
Ÿ Abbreviations, acronyms Ÿ Trend detection
Ÿ Units, brands, sizes, Ÿ Online feedback
dimensions Ÿ Temporal queries, recency
Ÿ Buzz
19
21. Synonym Candidates
Synonyms
derived
from
top
changes
in
successive
queries
frame
frames
lamp
lamps
case
cases
grill
grille
shoe
shoes
Synonyms
derived
from
top
queries
in
item
query
clusters
texas
instruments
ba
ii
plus
4
ba
ii
plus
brighton
handbag
brighton
purse
lenovo
x200
thinkpad
x200
king
bedspread
king
coverlet
rockabilly
dress
swing
dress
1963
ford
falcon
63
falcon
jessica
simpson
hair
extensions
jessica
simpson
hairdo
Abbrevia<ons/acronym
derived
from
query
transi<ons
stanford
ky
stanford
kentucky
dc
sub
dc
subwoofer
meridian
ms
meridian
mississippi
front
royal
va
front
royal
virginia
baseball
pin
baseball
pinback
snowboard
helmet
l
snowboard
helmet
large
motorcycle
cam
motorcycle
camera
diamond
amp
diamond
amplifier
ac4ve
sub
ac4ve
subwoofer
shapleigh
me
shapleigh
maine
23. Spell Check – Offline
Ÿ Successive queries qi and qi’ are candidates q1
for spell correction analysis if the edit
distance is within 40% of the average query
length. q2
• qi and qi’ may have tokens in common, called
anchors. q3 q1’
• Use transitivity remove intermediate queries.
Ÿ Create a bipartite graph for spell correction q4 q2’
candidates.
Ÿ Same query can exist on the source and sink q5
sides of the graph.
Ÿ Compute input and output degrees of each
sink node, indicating how info flows in and q6
out of a query.
Ÿ A correct spelling candidate is a sink node
with a far more flow into rather than out of it.
24. Spell Check – Online
query
Tokenize to tokens
In the white
list? (wi-2, wi-1, wi)
Found a
match
Calculate
contextual Priority
possibility Queue
Search in
dictionary
No, go
Obtain entropy to next
N-Gram Index Last?
Yes, get the
A list of best
Edit distance, candidates Obtain cosine
phonetics similarity Result
26. Acronyms
Ÿ Expand User Queries
– Increase recall without sacrificing precision
– Better deals for buyers
Ÿ Examples
BAPE 2,540 results
OR(Bathing Ape, Bape) 2987 results
Rescue Project 26
27. Mining Acronyms From Query Reformulations
Ÿ Learn from user behavioral data
Ÿ Example
UCB Sweatshirt CSA
University of California Berkeley CSA
Sweatshirt
Rescue Project 27
28. Acronym Context & Specificity
Ÿ Need to express context sensitive expansions
– Categorical
○ ATC > Armored Troop Carrier in Toys and Hobbies
○ ATC > Artist trading card in ART
○ ATC > Automatic Tool Change in Business and Industrial
– Directional
○ Old > Antique
○ Yoga towels/mats > Yogitoes
Rescue Project 28
29. Acronym/Abbreviation Category Based
Mining Expansions
• Acronyms/Abbreviation mined from Raw
text and query logs hp
Electronics Cars and Trucks
• Look for patterns of text
• long form (short form)
• short form (long form)
• Employ intelligent matching algorithms to Hewlett Packard horsepower
mine candidates
Example title: System allows
new cheap Playstation portable (PSP) • Category based expansions
Acronym discovered • Directional expansions
PSP -> PlayStation Portable • Positive and Negative
Candidates mined are fed through a expansions
machine learning classifier to remove the
false positives
32. Talk Overview
Ÿ Motivation
– Introduction of the Acronym mining problem.
Ÿ Related Work
– Algorithm overview.
Ÿ eBay Acronym Mining algorithm.
– Architecture.
– Algorithm overview.
Ÿ Results.
Ÿ Conclusions.
33. Motivation
Ÿ User queries are incomplete representation of their
information needs
– Spelling mistakes
○ Jetsky instead of Jetski
– Synonyms are not considered
○ PS3 and PlayStation 3 ( Acronym, topic of talk)
○ JetSki and Personal Watercraft
– Users are not experts in search engine technology
○ Example: Anniversary gifts for men
eBay, Inc. 33
34. Need for Query Rewrites
JetSky 2 results
Spelling Correction
JetSki 23782 results
Synonym Expansion
OR( Jetski, Personal WaterCraft) 24151 results
eBay, Inc. 34
36. Where can we find Acronyms?
Grand Theft Auto III (GTA 3) (PlayStation 2, 2001)
New Uke
Grand Theft Auto IV (GTA 4) PS3 mint condition
Warhawk (No Headset) PlayStation 3 (PS3) BRAND NEW!
New Ukulele COLD LASER. Low Level Laser Therapy(LLLT) + Acupuncture
From Item Title/Descriptions
From Query Reformulations
i.e how users change their
queries.
eBay, Inc. 36
38. Schwartz et al: Greedy Match Algorithm
Warhawk (No Headset) PlayStation 3 (PS3) BRAND NEW!
Warhawk (No Headset) PlayStation 3 (PS3) BRAND NEW!
eBay, Inc. 38
39. Identifying Abbreviation Definitions in Biomedical Text.
Ÿ Mining for patterns
– long form ( short form)
– short form ( long form)
– Long form is no more than min ( |A| + 5 , |A| * 2).
– Roche et. al. proposes that number to be less than
|A|*3.
Ÿ The characters in the short form should match the long
form in the same order and the first character in the
short form should be at the beginning of a word.
Ÿ Example:
– PS3 -> PlayStation 3
eBay, Inc. 39
40. Schwartz et al
Ÿ Pros:
– Finds almost all abbreviations and acronyms
Ÿ Cons:
– High False positive rate.
○ Foot Massage Diabetes Treatment (FEET)
– Suffers from truncated long form problem.
– Example: American Automobile Association (AAA)
eBay, Inc. 40
41. Acronym-Expansion Recognition and Ranking on the Web
Ÿ First few characters match
Ÿ Ignore Stop words
Ÿ Example:
– Cool - > Cooperation in Ontology and Linguistics.
Alpa Jain, Silviu Cucerzan, Saliha Azzam. Acronym-Expansion
Recognition and Ranking on the Web.
eBay, Inc. 41
42. Jain et al
Ÿ Pros:
– Low false positive rate
Ÿ Cons:
– Does not do a good job at identifying abbreviations
– Misses out on a lot of actual acronyms
○ Will not find PlayStation 3 and PS3 association.
eBay, Inc. 42
43. eBay Acronym Mining Architecture
Candidate
Feature
Classifier
Generator
Extractor
User
Dic4onary
Data
Live
on
Human
A/B
Test
Site
Judgment
44. eBay Acronym/Abbreviation Mining Algorithm
Ÿ Desirable Properties
– Find all abbreviation and Acronyms like the greedy match
– Reduce the amount of false positives
– Solve the truncated long form problem.
Ÿ What makes a good acronym – expansion pair?
– Characters in the acronym are found at the beginning of the words.
– Expansions generally do not have words that are skipped or not
represented in the acronym.
– Can a cost metric capture the intuition ?
eBay, Inc. 44
45. Cost Based Approach for Mining Abbreviations
CIM ------- Computer Interface Module
Total Cost: Low cost
PVC ------- PolyVinyl Cloride
Total Cost: medium cost
HSF –-- Heat shock transcription factor
Total Cost: High Cost
eBay, Inc. 45
46. Cost Based Recursive Algorithm
Title: new American Automobile Association (AAA) map of
mexico
Objective: Find the longest form with the lowest cost
American Automobile Association (AAA)
Min ( American Automobile Associ (AA) , American Automobile Associ (AAA) )
+
Cost so far
eBay, Inc. 46
47. Salient Properties of the new algorithm
Ÿ If Cost > Threshold, then the long form is a false positive.
Ÿ As cost increases
– False positives increase
– The chance that a real acronym is not identified decreases
Ÿ As cost decreases
– False positives decrease
– The chance that a real acronym is not identified increases.
Ÿ At lower costs, the algorithm behaves like the first few characters
match.
Ÿ At high costs, the algorithm behaves like the greedy match
algorithm.
eBay, Inc. 47
48. Experiments
Sample Dataset: 2.5 million item titles
Algorithm Total Candidates False Positive Rate Yield
Greedy Match 2548 39 % 1554
First Few 759 4% 728
Characters Match
Cost Based Match, 1223 14 % 1051
k1
Cost Based Match, 1604 16 % 1284
k2
Cost Based Match, 2023 20 % 1554
k3
eBay, Inc. 48
49. Removing false positives
Ÿ Goal
– Develop a classification algorithm that will classify is a
candidate is a acronym or not.
Ÿ Classification algorithm
– Decision trees
○ TreeNet data mining tool.
Ÿ Candidate are tagged with many features.
Ÿ Classifier learns on the tagged golden set.
Ÿ New candidates are then run through the classifier.
eBay, Inc. 49
50. Example of a Decision Tree
Tid Refund Marital Taxable
Splitting Attributes
Status Income Cheat
1 Yes Single 125K No
2 No Married 100K No Refund
No
Yes No
3 No Single 70K
4 Yes Married 120K No NO MarSt
5 No Divorced 95K Yes Married
Single, Divorced
6 No Married 60K No
7 Yes Divorced 220K No TaxInc NO
8 No Single 85K Yes < 80K > 80K
9 No Married 75K No
NO YES
10 No Single 90K Yes
Model: Decision Tree
10
Training Data
eBay, Inc. 50
Acknowledgements: George Kollios, gkollios@cs.bu.edu
51. Features: Neighborhood Similarity
Ÿ Rationale: Two synonym candidates A and B, will tend
to have similar neighbors (viz keywords) surrounding
them.
Neighborhood
similarity = Intersection ( Neighbours(A) , Neighbours(b) )
Min (Neighbours(a), Neighbours(b))
eBay, Inc. 51
52. Features: Mutual Information
Ÿ Rationale: The goal of this metric is determine if the co-occurrence of the
candidates in the description is significantly more than the random
chance of them co-occurring.
eBay, Inc. 52
53. Features: KL divergence
Ÿ Rationale: Two synonym candidates will have similar
category distributions of their inventory.
eBay, Inc. 53
54. Kl distance: Example
Ipods: Electronics (50), Electronics (100),
Ipod:
Clothing Shoes and
Clothing Shoes and
Accessories (1)
Accessories (3)
Ipod: Electronics (100),
T-shirt Clothing Shoes and
Clothing Shoes and Accessories (1000),
Accessories (3) Uniforms ( 50)
KL divergence: 0.83 KL divergence:
128592.74
56. Classifier Results
Ÿ False positive rate at the candidate generation stage 20 %
Ÿ False positive rate after going through the classifier is 5.5 %
Ÿ The remaining false positives are removed by human
judges.
eBay, Inc. 56
57. Conclusions
Ÿ We presented the state of the art algorithms for acronym
mining and their limitations.
Ÿ We presented a new cost based algorithm for mining
acronyms from raw text that seeks to address the limitations
of the previous algorithms.
Ÿ We presented a classifier approach to remove false
positives.
Ÿ We experimentally validated our approach and show it is a
viable approach for mining acronyms.
eBay, Inc. 57
58. References
Ÿ [1] Ariel S Schwartz, Marti A. Hearst. A simple Aglorithm for Identifying
Abbreviation definition in BioMedical Text.
Ÿ [2] Yongja Park, Roy J. Byrd. Hybrid text mining for finding abbreviations
and their definitions.
Ÿ [3] Mathieu Roche, Violaine Prince. Managing the Acronym/Expansion
Identification Process for Text-mining Applications.
eBay, Inc. 58
59. References(2)
Ÿ [4] Yee Fan Tan, Ergin Elmacioglu, Min-Yen Kan, Dongwon Lee. Efficient Web-
Based Linkage of Short to Long Forms.
Ÿ [5] Alpa Jain, Silviu Cucerzan, Saliha Azzam. Acronym-Expansion Recognition
and Ranking on the Web.
Ÿ [6]Xiaonan Ji, Gu Xu, James Bailey and Hang Li. Mining, Ranking and Using
Acronym Patterns.
eBay, Inc. 59