This document discusses how Bundle.com uses text analytics on their "big legacy data" of billions of credit card transactions to associate transactions with specific merchants. It outlines their challenges of dealing with high volumes of variable text data and no centralized merchant identifiers. Their solution was to parallelize the process by geographic neighborhoods to reduce the scale, use text clustering and de-duplication to generate preliminary merchant listings, and then reconcile those with a "merchant source of truth" database. This cascade of scale reductions and computational efficiencies improved their processing time from eons to just minutes.
New insights from big legacy data at bundle (Presented at Text Analytics World 2011)
1. New Insights from ‘Big Legacy Data’:
The Role of Text Analytics at Bundle.com
Jaime Fitzgerald, President, Fitzgerald Analytics, Inc.
Alex Hasha, Chief Data Scientist, Bundle.com
October 2011
Architects of Fact-Based Decisions™
2. Agenda for Today’s Talk
1. Introduction to the Business Model
2. The Role of Text Analytics
3. A Key Challenge and How we Overcame It
4. Takeaways
5. Q&A
New Insights from ‘Big Legacy Data’: The Role of Text Analytics at Bundle.com 2
3. Introduction
Jaime Fitzgerald, Alex Hasha
Founder @ Data Scientist @
Fitzgerald Analytics Bundle Corp
@JaimeFitzgerald @AlexHasha
Leading development of data products
Transforming data into value for clients
Responsible Designing statistical methods / algorithm
For… that transform data into insights for
Creating meaningful careers for employees
consumers
Helps clients convert Data to Dollars™ Uses data to help consumers make better
At a decisions with their money
Brings a strategic perspective to improve Bends valuable legacy data to new
Company
ROI on investments in technology, data, purposes
That
people, and processes Is growing and hiring!
Working on a movement to Democratize
Also Learning about and implementing best
Analytics by Reducing the “Barrier to
Working practices for managing complex data
Benefit” for non-profits, social
On pipelines
entrepreneurs, and gov’t
New Insights from ‘Big Legacy Data’: The Role of Text Analytics at Bundle.com 3
4. For Example, We Help You Decide Where to Spend…
New Insights from ‘Big Legacy Data’: The Role of Text Analytics at Bundle.com 4
5. We Do This with Billions of Real Spending Records
Unlike other merchant listing sites, our content is based on real credit card
spending by 20 million households
Key Issues with this Data:
Example: Credit Card Statement Data 1. Credit card data lacks
merchant identifier
2. So we rely on text analytics
to associate transactions
with merchants
New Insights from ‘Big Legacy Data’: The Role of Text Analytics at Bundle.com 5
6. A Business Model “Built Out Of Data”
Transformed To Create New
Old Data in New Ways Features Such As…
Card Transaction Normalization People Who Shop
Data Here Also Like…
Clustering
Merchant Listings The Bundle Loyalty
(e.g., Address, Phone Score
Number, Business Type)
Linking
Data-Driven
Other Data: Reviews From an
Census, Bureau of Labor
Aggregation Array of Customer
Statistics, User Feedback Segments
New Insights from ‘Big Legacy Data’: The Role of Text Analytics at Bundle.com 6
7. The Benefit is to Provide More Accurate, Less Biased Content
New Insights from ‘Big Legacy Data’: The Role of Text Analytics at Bundle.com 7
8. Before the Fun Stuff Happens…
Before we can generate insights about merchants for our users, we must associate
each transaction in our database with a specific merchant from a master list….
Two main problems:
Credit Card 1. Accurate Fuzzy Matching is Difficult
Transactions 2. Scale of Data is Enormous
(Billions – 109)
This case focuses on the second problem
• Highly variable text
descriptions
• Noisy geographic
info Comprehensive Listing
Text
• Noisy merchant Matching of US Merchants
category info (Tens of Millions – 107)
Naïve item by item search takes O(1016)
expensive string comparisons: Too Slow!
New Insights from ‘Big Legacy Data’: The Role of Text Analytics at Bundle.com 8
9. A “Brute Force” Approach Would Never Work…
1
1. Matching w/in Hundreds of
Millions of Merchants would
Processing Time / Workload
require massive processing… Nation
….Fortunately we don’t need to
match at this level
2. Batching at local
area, process
orders of
magnitude faster.
City
Neighborhood
0
Hundreds Hundreds of Tens of Millions
Thousands
# of Merchants in Comparison Set
New Insights from ‘Big Legacy Data’: The Role of Text Analytics at Bundle.com 9
10. Solution to Scaling Problem
This is a “Cascade of Scale Reductions”, Parallelizing by Location
Credit Card Transactions
(Billions – 109)
Keys to solving the scaling problem:
Batch Transactions by
Geographic Neighborhood
1. Scale Reduction /
Parallelized Text Clustering
2. Free Open Source Software
1 2 10000
Dedupe
Description
Secondary Fuzzy Matching
Strings
Process Reconciles Preliminary
Listings with Merchant
“Source of Truth”
Text Clustering
(Not Matching)
Consolidate Strings Belonging
to Same Merchant
Computational Efficiency
Increased by a Factor of 108!
Preliminary Merchant Final Merged
Listing Generated Directly Transaction Eons -> Days -> Minutes
from Transactions Data Set
(Tens of Millions–107)
New Insights from ‘Big Legacy Data’: The Role of Text Analytics at Bundle.com 10
11. Takeaways
1. Tame your data before perfecting your methods.
efficiency enables experimentation, iteration, improvement.
2. Design your process to minimize unnecessary complexity
(e.g. Parallel Processing at Scale, Normalization, Pre-Filtering)
3. Tools: Take advantage of powerful (and inexpensive) open-
source tools that enable your process...
New Insights from ‘Big Legacy Data’: The Role of Text Analytics at Bundle.com 11