From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records Per Year
1. From Big Legacy Data to Insight: Lessons Learned Creating
New Value from a Billion Low Quality Records
Jaime Fitzgerald, President, Fitzgerald Analytics, Inc.
Alex Hasha, Chief Data Scientist, Bundle.com
May 1, 2012
Architects of Fact-Based Decisions™
Jaime intro:Alex Intro: Thanks Jaime. Since Jaime has already introduced me, I’ll introduce Bundle. Bundle is a company that uses data to help consumers make better decisions with their money. We do this on the one hand by providing free tools for managing personal financial data. But more to the point of today’s talk, we are also mining mountains of credit card transaction data to extract actionable insights for consumers based on the spending behavior of their peers.
First to provide local merchant profiles for consumers that is deeply data-drivenLocal Search Business (Yelp, CitiSearch, FourSquare, Google, Bing)% of local searches on mobile devices is growing very fastFast-growing sector in data-driven startupsExample: Ted’s montana grillBundle addresses issues with other sites:Selection Bias (strong opinions over-represented)System Gaming (just like SEO. interesting story “reputation mgt” companies!)Explicit rankings (rank by the actual metrics!)
Alex: So where does Text Analytics come into this? As you might imagine, bending old data to a new purpose is fraught with difficulties, because the dataset was designed with different applications in mind. A key problem we faced with our credit card transaction database was that the transaction records lack a merchant identifier. It’s primary purpose is for interacting with card holders, generating statements, and not suprisingly it’s formatted very much like an enormous credit card statement. The merchant name is embedded in a text field, which also contains other information. It’s semi-structured, but lacks a consistent format.Clearly, to unlock insights about merchants from this data, we have to associate the transactions with merchants using this text field, so text analytics is absolutely crucial to our business.AH: Just some background here: In the credit card industry there are “acquiring banks”, which deals with merchants and processes their credit card transaction over various payment networks, and “issuing banks” which issue cards to consumers, and manage the generation of statements and billing of individuals. Since the interactions with merchants and consumers are split between two entities, you end up with data sets that are either consumer or merchant focused. We get our data from an “issuing” bank, so they don’t have detailed merchant info., beyond what they need to generate statements for cardholders. That is the root of our problem.
Alex: This is a screen shot of our core offering,the Bundle Merchant recommender, which aims to help consumers with their most frequent money decisions: where to spend it. Visually, I’m sure you’re reminded of user review sites like Yelp or Citysearch, and the purpose, to help you discover great merchants, is similar. Our content, though, is very different because it’s generated directly from the credit card transactions of over 20 million US households.
Alex: (Review features left to right.)I just wanted to return to this screen shot to highlight the features that are made possible by transforming credit card data in this way. (Loyalty score) Unlike other sites, our star ratings are data driven: we assign each merchant what we call the “Bundle Loyalty Score”, which is calculated from the share of wallet a merchant’s customers devote to the business and how frequently they return. (Coverage) Because we capture transactions from a broad-cross section of the population, we have data on many small local merchants, not just the popular ones that attract a lot of reviews. (Segments and Silent majority) We can break merchants customers down into demographic and behavioral segments, to show how well it serves different groups, and which groups it is most popular with. We’re capturing information about the silent majority of shoppers, who shop without writing about it online, and also avoid the common bias on review sites towards extremely positive or extremely negative reviews.(Real price levels) We have rich data about the real range of prices visitors to this merchant are paying, based on real transactions.(Web of merchants) Another unique feature on Bundle is that we can show you what other merchants are popular with customers of this merchant. We’re all familiar with “People who bought this also bought” on Amazon and other online market places, but I believe we’re the first to take this to the offline market place on a massive scale.
Alex: This is a screen shot of our core offering,the Bundle Merchant recommender, which aims to help consumers with their most frequent money decisions: where to spend it. Visually, I’m sure you’re reminded of user review sites like Yelp or Citysearch, and the purpose, to help you discover great merchants, is similar. Our content, though, is very different because it’s generated directly from the credit card transactions of over 20 million US households.
Alex: So where does Text Analytics come into this? As you might imagine, bending old data to a new purpose is fraught with difficulties, because the dataset was designed with different applications in mind. A key problem we faced with our credit card transaction database was that the transaction records lack a merchant identifier. It’s primary purpose is for interacting with card holders, generating statements, and not suprisingly it’s formatted very much like an enormous credit card statement. The merchant name is embedded in a text field, which also contains other information. It’s semi-structured, but lacks a consistent format.Clearly, to unlock insights about merchants from this data, we have to associate the transactions with merchants using this text field, so text analytics is absolutely crucial to our business.AH: Just some background here: In the credit card industry there are “acquiring banks”, which deals with merchants and processes their credit card transaction over various payment networks, and “issuing banks” which issue cards to consumers, and manage the generation of statements and billing of individuals. Since the interactions with merchants and consumers are split between two entities, you end up with data sets that are either consumer or merchant focused. We get our data from an “issuing” bank, so they don’t have detailed merchant info., beyond what they need to generate statements for cardholders. That is the root of our problem.
Top 10 Possible Matches, Like Google Search)
Jaime: Take it back to audience. A common theme in converting data to dollars is to to extract new value from old data by MATCHING with other preexisting data. No need to dwell on particulars of Bundle data on this slide, except as an instance of a more general pattern.
JF Provides Framing: This is a universal problem for companies seeking to convert Data to Dollars, repurposing old data sets often requires matching with other data sets without a common key. AH: It should be clear now how a robust, accurate algorithm for matching text descriptions to merchant listings is a prerequisite for our entire user experience.There are two aspects of this problem that created significant challenges for us. First, there’s the basic issue that accurate fuzzy string matching is hard. Our inputs highly variable transaction descriptions, sometimes dozens or hundreds per merchant, inconsistent coding, error prone geographic indicators, and noisy merchant category indicators. These give us a lot to go on, but to treat any of them as a source of truth gets you in trouble. We’re at a Text Analytics conference, so I don’t have to tell you that accurate fuzzy string matching can be hard, especially if supporting data like merchant category and geo information are not 100% reliable. But before we could even begin to attack that problem we had to do something about the sheer size of our data set.We receive about 1 billion credit card transactions per year, each of which must be associated with one of 10s of millions of merchants in a comprehensive listing. Not that anyone would try this, but a brute force attempt to take each transaction description and scan through the merchant listing item by item looking for a match would require on the order of 10^16 fuzzy string comparisons. To put that in perspective, if each comparison took about a millisecond, the match would take over 300,000 years to run.Clearly something needs to be done to reduce the scale of the input AND the matching search space. Broadly speaking, we accomplished this by breaking the matching process into two phases, using text clustering in the first phase to dramatically decrease the size of the data set, and then proceeding to a fuzzy match.
This isn’t rocket science, there are a handful of obvious places to start simplifying the problem. One key lever is location: if you have a transaction that occurred in New Mexico it doesn’t make sense to include merchants in New York in your search.There are tens of millions of merchants nationally, but only hundreds of thousands in each city, and maybe a thousand max in each neighborhood. If you can identify the neighborhood of a transaction, and only search the merchants in that neighborhood, the efficiency payoff is hugeThis wasn’t a completely obvious step for us, though, because as I mentioned before the geographic fields in our transaction data were not 100% reliable. We could identify the city with no problem, but at the neighborhood level there is a significant error rate. But we eventually realized we had to ignore all the little complications and, at all costs, reduce the size of our data so we could work with it efficiently. It’s worth creating an intermediate data set that’s still pretty messy, if you can now load it into R on your laptop and try out a few fuzzy matching experiments in an afternoon.
This slide gives a high level overview of how we achieved a cascade of scale reductions by batching transactions by neighborhood. Considering each neighborhood in isolation, we dedupe and then cluster transaction strings which are highly likely to be generated by the same merchant. Each of these clusters is assigned a preliminary merchant ID. At this point we have a preliminary merchant listing which still suffers from some of the quality issues of the original data set but Can provide aggregated transaction data views which to inform subsequent matching and is on a much more manageable scale.The output of the clustering algorithm feeds into a more resource intensive fuzzy matching algorithm, which becomes feasible at this scale.Taking this approach on a single machine, we were able to get our processing time down to about a week. However, in startup time a week is not much better than 300K years. Thanks to the revolution in open source parallel computing, we were able to quickly set up a small Hadoop cluster which parallelizes the text clustering operations so all the neighborhoods run at the same time. This brought our processing down to about 20 minutes. While this isn’t a complete solution to the initial problem, it vastly increases our capability to experiment with new methods and tweaks to the existing process.So that’s a quick and dirty introduction to a part of our technology stack, and now I”ll turn it over to Jaime to convert my case study into some high level takeaways.
Robin custbehavior PayComplainPay....then....ST vs LT RecAdvLoyalty
Top 10 Possible Matches, Like Google Search)
Comments:Consider trade-offs between false positive and false negativesRelated Hot/Emerging Best Practices we can mention to frame this:Metrics-Driven DevelopmentBeginning with the End in Mind / Causal Clarity