4. StubHub is about…..
Worlds
Largest
Ticke<ng
marketplace
10M active listings
We
enable
“access”
to
events
We want to be more!!!
5. Some Fun Facts about StubHub!
Ø An eBay owned company
Ø Over 25 million users and growing
Ø We sell one ticket per second
Ø ~8.5 million page views a day, on an average
Ø ~ 3 million additional page views per day on Mobile devices
Ø ~10 M tickets for sale in sports, concerts and others.
Ø ~ 1 TB of data processed monthly by the analytics infrastructure – This number will
significantly go up as we bring in data from many of the unstructured data sources
Ø ~300 Million SQL executions/day
7. Agenda
Ø Use case
Ø Challenges
Ø Legacy solution
Ø Our approach
Ø Results
8. Use Case : Content Ingestion
Input
record
Pre
deduplica<on
Deduplica<on
Post
deduplica<on
Normalize
Filtering
Classifica<on
Geocode
Review
Insert
Update
Discard
Feed-‐1
Feed-‐2
Feed-‐3
Feed-‐n
Form
Event
DB
9. Challenges : Deduplication
Ø Problem space
² Event
catalog
Ø Performance considerations
² Real
<me
processing
² Batch
processing
Ø Speed and data quality
10. Legacy Solution : Deduplication Flow
Deduplica<onModule
for
each
field
Event
DB
for
each
document
Client
1:
getDuplicates()
2:
getSubsetByLoca.on()
3:
loop
4:
DuplicateList
5:
upsert()
Normalize
Filter
Compute
Score
Feed
Ingestor
Batch
Job
UGC
11. Approach : Problem Model
Ø Milpitas
Library
vs
Milpitas
Public
Library
Ø 1601
E
7th
St
vs
1601
E.
Seventh
St.
Ø Pick
up
the
right
algo,
edit
distance,
jaccard.
Library,
Restaurant,
etc
Milpitas
Library
160
N.
Main
St;
40
N.
Milpitas
Blvd.
Distance
:
~0.5
mi
e.g.
venue
Boost
name,
street
number
Dup
detec<on
-‐
name,
address
etc
Subset
-‐
Text
Similarity
on
Categories
Subset
-‐
Geo
spa<al
distance
Venue
Deduplica.on
13. Approach : Deduplication Service
public interface DeduplicationService<T> {
/**
* Checks for duplicate entity and return a DeduplicationResponse containing information about duplicates
found. For each possible duplicate, there is a justification as to why it's a duplicate.
* @param t entity for which duplicates need to be found.
* @param options use options provided by this object to find and filter the results.
* @return a not null instance of DeduplicationResponse object.
* @throws DeduplicationConnectivityException if there was an issue in connecting to the dedupe data
store.
*/
public DeduplicationResponse<T> findDuplicates(T t, DedupeOptions options)
throws DeduplicationConnectivityException;
}
14. Approach : Deduplication Service
@Component(value = "VenueDeduplicationService”)
public class VenueDeduplicationService
implements DeduplicationService<Venue> {
@Override
public DeduplicationResponse<Venue> findDuplicates(Venue venue, DedupeOptions options)
throws
Deduplica<onConnec<vityExcep<on
{
}
}
@Component(value = "EventDeduplicationService”)
public class EventDeduplicationService
implements DeduplicationService<Event> {
@Override
public DeduplicationResponse<Event> findDuplicates(Event event, DedupeOptions options)
throws DeduplicationConnectivityException {
}
}
15. Approach : Optimizations
Ø How to keep the score consistent?
²
<similarity
class=“TfSimilarity"/>
Ø Auto commit settings
² <autoSomCommit><maxTime>5</maxTime></autoSomCommit>
Ø Custom PostFilter
² <queryParser
name="fdist"
class=“DistanceQParserPlugin"/>
Ø Custom update handler
²
<processor
class=“VenueUpdateProcessorFactory”></processor>
16. Results : Sample Output
Input
Venue
Matched
Venue
Score
Distance
Jillian's
Billiards
Club
101
Fourth
St.
Jillian's
175
4th
St.
1.5573
5.6352
Lush
Lounge
1092
Post
St.
Lush
Lounge
1221
Polk
St.
12.9836
16.6501
Mountain
Theatre
10
Panoramic
Hwy.
Mountain
Theater
Nearby
E
Ridgecrest
Boulevard
and
Pantoll
Road
3.2509
5.8913
17. Results : Sample Output
Input
Venue
Matched
Venue
Score
Distance
The
Hedley
Club
at
Hotel
DeAnza
233
W.
Santa
Clara
St.
Hedley
Club
233
W.
Santa
Clara
St.
5.0805
0.0000
Sonya
Paz
Fine
Art
Gallery
1793
LafayeYe
St.
Sonya
Paz
Gallery
and
Studio
1793
LafayeYe
St.
Suite
110
6.6764
0.0069
Pearl
Avenue
Library
Community
Room
4270
Pearl
Ave.
Pearl
Avenue
Branch
Library
4270
Pearl
Ave.
5.7024
0.0000
Milpitas
Library
160
N.
Main
St.
Milpitas
Library
40
N.
Milpitas
Blvd.
16.4318
0.7284
18. Summary
Ø Use case
² Content
inges<on
Ø Challenges
² Deduplica<on
Ø Legacy solution
Ø Our approach
² Used
SOLR
for
text
similarity
² Extended
default
behavior
² REST
endpoint
over
SOLR
interface
Ø Next steps
² Big
data
² Performer
matching
² I18n
Ø Results