SlideShare a Scribd company logo
1 of 19
Download to read offline
From Big Legacy Data to Insight: Lessons Learned Creating
New Value from a Billion Low Quality Records

Jaime Fitzgerald, President, Fitzgerald Analytics, Inc.
Alex Hasha, Chief Data Scientist, Bundle.com

May 1, 2012


                                             Architects of Fact-Based Decisions™
Agenda for Today’s Talk




                          1.       The Business Model


                          2.       The Text Analytics Challenge


                          3.       How We Overcame the Challenge


                          4.       Key Takeaways


                          5.       Q&A




From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records © 2012 Fitzgerald Analytics, Inc. All Rights Reserved   2
Introduction

                                                      Jaime Fitzgerald,                                                         Alex Hasha

                                                      Founder @                                                                 Data Scientist @
                                                      Fitzgerald Analytics                                                      Bundle Corp
                                                      @JaimeFitzgerald                                                          @AlexHasha

                                                                                                        Leading development of data products
                              Transforming data into value for clients
    Responsible                                                                                         Designing statistical methods / algorithm
          For…                                                                                           that transform data into insights for
                              Creating meaningful careers for employees
                                                                                                         consumers

                              Helps clients convert Data to Dollars™                                   Uses data to help consumers make better
            At a                                                                                         decisions with their money
                              Brings a strategic perspective to improve                                Bends valuable legacy data to new
        Company
                               ROI on investments in technology, data,                                   purposes
           That
                               people, and processes                                                    Is growing and hiring!

            Also              Working to Democratize Analytics by                                      Learning about and implementing best
         Working               Reducing the “Barrier to Benefit” for non-                                practices for managing complex data
             On                profits, social entrepreneurs, and gov’t                                  pipelines



From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records © 2012 Fitzgerald Analytics, Inc. All Rights Reserved   3
The Local Search Business




From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records © 2012 Fitzgerald Analytics, Inc. All Rights Reserved   4
Gaps in Local Search Offerings


                                           Paid Advertisement Not Trusted



                                                User-Reviews Can be Biased


                                                                                                                   Not
                              Selection                                   Can be
                                                                                                               Personalized
                                Bias                                      Gamed
                                                                                                                 (to you)


From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records © 2012 Fitzgerald Analytics, Inc. All Rights Reserved   5
Bundle’s Unique Contribution
        Unlike other merchant listing sites, our content is based on real credit card
        spending by 20 million households

        Example: Credit Card Statement Data




From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records © 2012 Fitzgerald Analytics, Inc. All Rights Reserved   6
A Screen Shot From our Site




From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records © 2012 Fitzgerald Analytics, Inc. All Rights Reserved   7
A Screen Shot From our Site




From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records © 2012 Fitzgerald Analytics, Inc. All Rights Reserved   8
A Screen Shot From our Site




From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records © 2012 Fitzgerald Analytics, Inc. All Rights Reserved   9
We Do This with Billions of Real Spending Records
        Unlike other merchant listing sites, our content is based on real credit card
        spending by 20 million households
                                                                                                            Key Issues with this Data:
        Example: Credit Card Statement Data                                                                 1. Credit card data lacks
                                                                                                               merchant identifier
                                                                                                            2. So we rely on text analytics
                                                                                                               to associate transactions
                                                                                                               with merchants




From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records © 2012 Fitzgerald Analytics, Inc. All Rights Reserved   10
Building our “Version of the Truth” from 3 sources


                                   Our
                                                                                       Localeze                                          Factual
                             Transaction Data


                 Proprietary                                                                                              Crowd Sourced
                                                                           High Quality
           Pros  Differentiated                                                                                           Up to the
                                                                           Clean / Verified
                 Special Sauce                                                                                             Minute



                                                                           Incomplete                                     More variability
          Cons  Semi-Structured
                                                                           Lag / Recency                                   in quality



From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records © 2012 Fitzgerald Analytics, Inc. All Rights Reserved   11
Data: Not Useful Until Refined.




From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records © 2012 Fitzgerald Analytics, Inc. All Rights Reserved   12
Key Steps in “Refinement” (Transformation)

                                                                          Transformed                                         To Create New
                       Old Data                                           in New Ways                                       Features Such As…


                Card Transaction                                             Normalization                                   People Who Shop
                      Data                                                                                                    Here Also Like…


                                                                             Clustering
               Merchant Listings                                                                                            The Bundle Loyalty
               (e.g., Address, Phone                                                                                              Score
              Number, Business Type)
                                                                             Linking
                                                                                                                                Data-Driven
                    Other Data:                                                                                              Reviews From an
             Census, Bureau of Labor
                                                                             Aggregation                                     Array of Customer
             Statistics, User Feedback                                                                                           Segments



From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records © 2012 Fitzgerald Analytics, Inc. All Rights Reserved   13
Before the Fun Stuff Happens…
        Before we can generate insights about merchants for our users, we must associate
        each transaction in our database with a specific merchant from a master list….



                                                                                 Two main problems:
                              Credit Card
                             Transactions                                        1. Accurate Fuzzy Matching is Difficult
                            (Billions – 109)                                     2. Scale of Data is Enormous
                    • Highly variable text
                      descriptions
                    • Noisy geographic
                      info                                                                         Comprehensive Listing
                                                                    Text
                    • Noisy merchant                               Matching                           of US Merchants
                      category info                                                                (Tens of Millions – 107)


             Naïve item by item search takes O(1016)
             expensive string comparisons: Too Slow!

From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records © 2012 Fitzgerald Analytics, Inc. All Rights Reserved   14
A “Brute Force” Approach Would Never Work…


                                      1
                                                 1. Matching w/in Hundreds of
                                                    Millions of Merchants would
                    Processing Time / Workload


                                                    require massive processing…                                              Nation
                                                    ….Fortunately we don’t need to
                                                    match at this level

                                                 2. Batching at local
                                                    area, process
                                                    orders of
                                                    magnitude faster.
                                                                                       City



                                                    Neighborhood
                                      0
                                                     Hundreds                   Hundreds of                          Tens of Millions
                                                                                 Thousands
                                                               # of Merchants in Comparison Set

From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records © 2012 Fitzgerald Analytics, Inc. All Rights Reserved   15
Solution to Scaling Problem
        This is a “Cascade of Scale Reductions”, Parallelizing by Location
                 Credit Card Transactions
                          (Billions – 109)
                                                                                                       Keys to solving the scaling problem:
                Batch Transactions by
               Geographic Neighborhood
                                                                                                           1. Scale Reduction /
                                                                                                              Parallelized Text Clustering
                                                                                                           2. Free Open Source Software
             1        2                        10000



                           Dedupe
                          Description
                            Strings
                                                                                                                  Secondary Fuzzy Matching
                                                                                                                Process Reconciles Preliminary
                                                                                                                    Listings with Merchant
                      Text Clustering                                                                                   “Source of Truth”
                   (Not Matching)
            Consolidate Strings Belonging
                 to Same Merchant
                                                                                                                                 Computational Efficiency
                                                                                                                               Increased by a Factor of 108!
                   Preliminary Merchant                                                   Final Merged
                 Listing Generated Directly                                                Transaction                            Eons -> Days -> Minutes
                      from Transactions                                                      Data Set
                   (Tens of Millions–107)

From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records © 2012 Fitzgerald Analytics, Inc. All Rights Reserved   16
Data Preparation: Phase 1


                                                                            Machine
                               DAMA Lens                                  Learning Lens


                                                                                                                               Example:
                                                                                    • Unsupervised                             Anthonys Restaurant
                                                         Deduping                     Learning                                 #123 Brkly NY
                • Matching                                 X 10,                    • Text Clustering
                  (Strings)
                                                         Cleansing                  • Pattern
                                                                                                                               Anthony’s Restaurant
                                                                                      Discovery




From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records © 2012 Fitzgerald Analytics, Inc. All Rights Reserved   17
Data Preparation: Phase 2


                                                                       Machine
                             DAMA Lens                               Learning Lens


                                                                                                                                 Search Retrieves Top
                                                                                                                                 10 Possible Matches
                                                    • Deduping
                 • Record                                                            • Information                               Classifier applied to
                                                      + 30%
                   Linkage                                                             Retrieval                                 each, returns
                                                    • More
                 • Data Quality                       Cleansing                                                                  confidence score
                                                                                     • Supervised
                   Enhancement                      • Data                             Classifier                                If Confidence = High,
                                                      Enrichment                                                                 Records are linked




From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records © 2012 Fitzgerald Analytics, Inc. All Rights Reserved   18
Takeaways



           1. Tame your data before perfecting your methods.
           efficiency enables experimentation, iteration, improvement.



           2. Design your process to minimize unnecessary complexity
           (e.g. Parallel Processing at Scale, Normalization, Pre-Filtering)



            3. Tools: Take advantage of powerful (and inexpensive) open-
            source tools that enable your process...


From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records © 2012 Fitzgerald Analytics, Inc. All Rights Reserved   19

More Related Content

What's hot

Data Discovery for Big Big Insights - Tableau Webinar Slides
Data Discovery for Big Big Insights - Tableau Webinar SlidesData Discovery for Big Big Insights - Tableau Webinar Slides
Data Discovery for Big Big Insights - Tableau Webinar SlidesFitzgerald Analytics, Inc.
 
Big Data and Analytics
Big Data and AnalyticsBig Data and Analytics
Big Data and Analyticsdmurph4
 
The Comprehensive Approach: A Unified Information Architecture
The Comprehensive Approach: A Unified Information ArchitectureThe Comprehensive Approach: A Unified Information Architecture
The Comprehensive Approach: A Unified Information ArchitectureInside Analysis
 
Sales Growth: Find Big Growth in Big Data - Lattice Engines & McKinsey
Sales Growth: Find Big Growth in Big Data - Lattice Engines & McKinseySales Growth: Find Big Growth in Big Data - Lattice Engines & McKinsey
Sales Growth: Find Big Growth in Big Data - Lattice Engines & McKinseyLattice Engines
 
Big Transaction Data - CMG Vegas 2012
Big Transaction Data - CMG Vegas 2012Big Transaction Data - CMG Vegas 2012
Big Transaction Data - CMG Vegas 2012nickychu
 
Crunching “Big Data” to Drive 2012 Revenue Growth: The 5 Myths of Sales & Mar...
Crunching “Big Data” to Drive 2012 Revenue Growth: The 5 Myths of Sales & Mar...Crunching “Big Data” to Drive 2012 Revenue Growth: The 5 Myths of Sales & Mar...
Crunching “Big Data” to Drive 2012 Revenue Growth: The 5 Myths of Sales & Mar...MarketBridge
 
B2Bdatapartners Capabilities
B2Bdatapartners CapabilitiesB2Bdatapartners Capabilities
B2Bdatapartners CapabilitiesB2Bdatapartners
 
Analytical Revolution
Analytical RevolutionAnalytical Revolution
Analytical RevolutionNedODoherty
 
Knowledgelevers expanded
Knowledgelevers expandedKnowledgelevers expanded
Knowledgelevers expandedKnowledgelevers
 
Smarter Analytics giver dig indsigt i hele forretningen, Rich Holada, IBM US
Smarter Analytics giver dig indsigt i hele forretningen, Rich Holada, IBM USSmarter Analytics giver dig indsigt i hele forretningen, Rich Holada, IBM US
Smarter Analytics giver dig indsigt i hele forretningen, Rich Holada, IBM USIBM Danmark
 
Demystifying BI For Mid-Market Enterprises
Demystifying BI For Mid-Market EnterprisesDemystifying BI For Mid-Market Enterprises
Demystifying BI For Mid-Market EnterprisesJamal_Shah
 
Intel Cloud Summit: Big Data
Intel Cloud Summit: Big DataIntel Cloud Summit: Big Data
Intel Cloud Summit: Big DataIntelAPAC
 
Monetizing data - An Evening with Eight of Chicago's Data Product Management...
Monetizing data  - An Evening with Eight of Chicago's Data Product Management...Monetizing data  - An Evening with Eight of Chicago's Data Product Management...
Monetizing data - An Evening with Eight of Chicago's Data Product Management...Randy Horton
 
Zy Vision Solutions Overview
Zy Vision Solutions OverviewZy Vision Solutions Overview
Zy Vision Solutions Overviewtresag71
 
From Data Science to Business Value - Analytics Applied
From Data Science to Business Value - Analytics AppliedFrom Data Science to Business Value - Analytics Applied
From Data Science to Business Value - Analytics AppliedTeradata Aster
 
Enfathom Service Overview
Enfathom Service OverviewEnfathom Service Overview
Enfathom Service Overviewbgoverstreet
 
Enfathom Service Overview
Enfathom Service OverviewEnfathom Service Overview
Enfathom Service Overviewcfsanders
 
Australian CIO Summit 2012: A Strategic Approach To BIG DATA Analytics. Separ...
Australian CIO Summit 2012: A Strategic Approach To BIG DATA Analytics. Separ...Australian CIO Summit 2012: A Strategic Approach To BIG DATA Analytics. Separ...
Australian CIO Summit 2012: A Strategic Approach To BIG DATA Analytics. Separ...IT Network marcus evans
 
Intel Cloud summit: Big Data by Nick Knupffer
Intel Cloud summit: Big Data by Nick KnupfferIntel Cloud summit: Big Data by Nick Knupffer
Intel Cloud summit: Big Data by Nick KnupfferIntelAPAC
 

What's hot (20)

Data Discovery for Big Big Insights - Tableau Webinar Slides
Data Discovery for Big Big Insights - Tableau Webinar SlidesData Discovery for Big Big Insights - Tableau Webinar Slides
Data Discovery for Big Big Insights - Tableau Webinar Slides
 
Big Data and Analytics
Big Data and AnalyticsBig Data and Analytics
Big Data and Analytics
 
The Comprehensive Approach: A Unified Information Architecture
The Comprehensive Approach: A Unified Information ArchitectureThe Comprehensive Approach: A Unified Information Architecture
The Comprehensive Approach: A Unified Information Architecture
 
Sales Growth: Find Big Growth in Big Data - Lattice Engines & McKinsey
Sales Growth: Find Big Growth in Big Data - Lattice Engines & McKinseySales Growth: Find Big Growth in Big Data - Lattice Engines & McKinsey
Sales Growth: Find Big Growth in Big Data - Lattice Engines & McKinsey
 
Big Transaction Data - CMG Vegas 2012
Big Transaction Data - CMG Vegas 2012Big Transaction Data - CMG Vegas 2012
Big Transaction Data - CMG Vegas 2012
 
Crunching “Big Data” to Drive 2012 Revenue Growth: The 5 Myths of Sales & Mar...
Crunching “Big Data” to Drive 2012 Revenue Growth: The 5 Myths of Sales & Mar...Crunching “Big Data” to Drive 2012 Revenue Growth: The 5 Myths of Sales & Mar...
Crunching “Big Data” to Drive 2012 Revenue Growth: The 5 Myths of Sales & Mar...
 
B2Bdatapartners Capabilities
B2Bdatapartners CapabilitiesB2Bdatapartners Capabilities
B2Bdatapartners Capabilities
 
Analytical Revolution
Analytical RevolutionAnalytical Revolution
Analytical Revolution
 
Knowledgelevers expanded
Knowledgelevers expandedKnowledgelevers expanded
Knowledgelevers expanded
 
Smarter Analytics giver dig indsigt i hele forretningen, Rich Holada, IBM US
Smarter Analytics giver dig indsigt i hele forretningen, Rich Holada, IBM USSmarter Analytics giver dig indsigt i hele forretningen, Rich Holada, IBM US
Smarter Analytics giver dig indsigt i hele forretningen, Rich Holada, IBM US
 
Demystifying BI For Mid-Market Enterprises
Demystifying BI For Mid-Market EnterprisesDemystifying BI For Mid-Market Enterprises
Demystifying BI For Mid-Market Enterprises
 
Intel Cloud Summit: Big Data
Intel Cloud Summit: Big DataIntel Cloud Summit: Big Data
Intel Cloud Summit: Big Data
 
Monetizing data - An Evening with Eight of Chicago's Data Product Management...
Monetizing data  - An Evening with Eight of Chicago's Data Product Management...Monetizing data  - An Evening with Eight of Chicago's Data Product Management...
Monetizing data - An Evening with Eight of Chicago's Data Product Management...
 
Zy Vision Solutions Overview
Zy Vision Solutions OverviewZy Vision Solutions Overview
Zy Vision Solutions Overview
 
From Data Science to Business Value - Analytics Applied
From Data Science to Business Value - Analytics AppliedFrom Data Science to Business Value - Analytics Applied
From Data Science to Business Value - Analytics Applied
 
Enfathom Service Overview
Enfathom Service OverviewEnfathom Service Overview
Enfathom Service Overview
 
Enfathom Service Overview
Enfathom Service OverviewEnfathom Service Overview
Enfathom Service Overview
 
Australian CIO Summit 2012: A Strategic Approach To BIG DATA Analytics. Separ...
Australian CIO Summit 2012: A Strategic Approach To BIG DATA Analytics. Separ...Australian CIO Summit 2012: A Strategic Approach To BIG DATA Analytics. Separ...
Australian CIO Summit 2012: A Strategic Approach To BIG DATA Analytics. Separ...
 
Intel Cloud summit: Big Data by Nick Knupffer
Intel Cloud summit: Big Data by Nick KnupfferIntel Cloud summit: Big Data by Nick Knupffer
Intel Cloud summit: Big Data by Nick Knupffer
 
Aod Narrative
Aod NarrativeAod Narrative
Aod Narrative
 

Similar to From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records Per Year

Governing the Data to Dollars Value Chain™ - Sept 2012 NYC Data Governance Co...
Governing the Data to Dollars Value Chain™ - Sept 2012 NYC Data Governance Co...Governing the Data to Dollars Value Chain™ - Sept 2012 NYC Data Governance Co...
Governing the Data to Dollars Value Chain™ - Sept 2012 NYC Data Governance Co...Fitzgerald Analytics, Inc.
 
Big Data Meets Customer Profitability Analytics
Big Data Meets Customer Profitability AnalyticsBig Data Meets Customer Profitability Analytics
Big Data Meets Customer Profitability AnalyticsDATAVERSITY
 
Evaluating Big Data Predictive Analytics Platforms
Evaluating Big Data Predictive Analytics PlatformsEvaluating Big Data Predictive Analytics Platforms
Evaluating Big Data Predictive Analytics PlatformsTeradata Aster
 
Big Data in Financial Services: How to Improve Performance with Data-Driven D...
Big Data in Financial Services: How to Improve Performance with Data-Driven D...Big Data in Financial Services: How to Improve Performance with Data-Driven D...
Big Data in Financial Services: How to Improve Performance with Data-Driven D...Perficient, Inc.
 
Data Activation For (Not So Much) Dummies
Data Activation For (Not So Much) DummiesData Activation For (Not So Much) Dummies
Data Activation For (Not So Much) DummiesCory Treffiletti
 
EDF2012 Wolfgang Nimfuehr - Bringing Big Data to the Enterprise
EDF2012   Wolfgang Nimfuehr - Bringing Big Data to the EnterpriseEDF2012   Wolfgang Nimfuehr - Bringing Big Data to the Enterprise
EDF2012 Wolfgang Nimfuehr - Bringing Big Data to the EnterpriseEuropean Data Forum
 
Big Data Meets Customer Profitability Analytics
Big Data Meets Customer Profitability AnalyticsBig Data Meets Customer Profitability Analytics
Big Data Meets Customer Profitability AnalyticsFitzgerald Analytics, Inc.
 
The Big Deal About Big Data For Customer Engagement
The Big Deal About Big Data For Customer EngagementThe Big Deal About Big Data For Customer Engagement
The Big Deal About Big Data For Customer EngagementIBM India Smarter Computing
 
Hadoop: What It Is and What It's Not
Hadoop: What It Is and What It's NotHadoop: What It Is and What It's Not
Hadoop: What It Is and What It's NotInside Analysis
 
Robert LeBlanc - Why Big Data? Why Now?
Robert LeBlanc - Why Big Data? Why Now?Robert LeBlanc - Why Big Data? Why Now?
Robert LeBlanc - Why Big Data? Why Now?Mauricio Godoy
 
Enfathom service overview
Enfathom service overviewEnfathom service overview
Enfathom service overviewchooylee
 
Day 2 aziz apj aziz_big_datakeynote_press
Day 2 aziz apj aziz_big_datakeynote_pressDay 2 aziz apj aziz_big_datakeynote_press
Day 2 aziz apj aziz_big_datakeynote_pressIntelAPAC
 
Big Data Meets Social Analytics - IBM Connect 2012 (CN-CC13)
Big Data Meets Social Analytics - IBM Connect 2012 (CN-CC13)Big Data Meets Social Analytics - IBM Connect 2012 (CN-CC13)
Big Data Meets Social Analytics - IBM Connect 2012 (CN-CC13)Mark Heid
 
Valtech - Big Data for marketing (EN)
Valtech - Big Data for marketing (EN)Valtech - Big Data for marketing (EN)
Valtech - Big Data for marketing (EN)Valtech
 
Scenari evolutivi nello snellimento dei sistemi informativi
Scenari evolutivi nello snellimento dei sistemi informativiScenari evolutivi nello snellimento dei sistemi informativi
Scenari evolutivi nello snellimento dei sistemi informativiFondazione CUOA
 
01 im overview high level
01 im overview high level01 im overview high level
01 im overview high levelJames Findlay
 
OSC2012: Big Data Using Open Source: Netapp Project - Technical
OSC2012: Big Data Using Open Source: Netapp Project - TechnicalOSC2012: Big Data Using Open Source: Netapp Project - Technical
OSC2012: Big Data Using Open Source: Netapp Project - TechnicalAccenture the Netherlands
 
Building A Bi Strategy
Building A Bi StrategyBuilding A Bi Strategy
Building A Bi Strategylarryzagata
 

Similar to From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records Per Year (20)

Governing the Data to Dollars Value Chain™ - Sept 2012 NYC Data Governance Co...
Governing the Data to Dollars Value Chain™ - Sept 2012 NYC Data Governance Co...Governing the Data to Dollars Value Chain™ - Sept 2012 NYC Data Governance Co...
Governing the Data to Dollars Value Chain™ - Sept 2012 NYC Data Governance Co...
 
Big Data Meets Customer Profitability Analytics
Big Data Meets Customer Profitability AnalyticsBig Data Meets Customer Profitability Analytics
Big Data Meets Customer Profitability Analytics
 
Evaluating Big Data Predictive Analytics Platforms
Evaluating Big Data Predictive Analytics PlatformsEvaluating Big Data Predictive Analytics Platforms
Evaluating Big Data Predictive Analytics Platforms
 
Big Data in Financial Services: How to Improve Performance with Data-Driven D...
Big Data in Financial Services: How to Improve Performance with Data-Driven D...Big Data in Financial Services: How to Improve Performance with Data-Driven D...
Big Data in Financial Services: How to Improve Performance with Data-Driven D...
 
Data Activation For (Not So Much) Dummies
Data Activation For (Not So Much) DummiesData Activation For (Not So Much) Dummies
Data Activation For (Not So Much) Dummies
 
Search2012 ibm vf
Search2012 ibm vfSearch2012 ibm vf
Search2012 ibm vf
 
EDF2012 Wolfgang Nimfuehr - Bringing Big Data to the Enterprise
EDF2012   Wolfgang Nimfuehr - Bringing Big Data to the EnterpriseEDF2012   Wolfgang Nimfuehr - Bringing Big Data to the Enterprise
EDF2012 Wolfgang Nimfuehr - Bringing Big Data to the Enterprise
 
Big Data Meets Customer Profitability Analytics
Big Data Meets Customer Profitability AnalyticsBig Data Meets Customer Profitability Analytics
Big Data Meets Customer Profitability Analytics
 
The Big Deal About Big Data For Customer Engagement
The Big Deal About Big Data For Customer EngagementThe Big Deal About Big Data For Customer Engagement
The Big Deal About Big Data For Customer Engagement
 
Hadoop: What It Is and What It's Not
Hadoop: What It Is and What It's NotHadoop: What It Is and What It's Not
Hadoop: What It Is and What It's Not
 
Robert LeBlanc - Why Big Data? Why Now?
Robert LeBlanc - Why Big Data? Why Now?Robert LeBlanc - Why Big Data? Why Now?
Robert LeBlanc - Why Big Data? Why Now?
 
Enfathom service overview
Enfathom service overviewEnfathom service overview
Enfathom service overview
 
Day 2 aziz apj aziz_big_datakeynote_press
Day 2 aziz apj aziz_big_datakeynote_pressDay 2 aziz apj aziz_big_datakeynote_press
Day 2 aziz apj aziz_big_datakeynote_press
 
Big Data Meets Social Analytics - IBM Connect 2012 (CN-CC13)
Big Data Meets Social Analytics - IBM Connect 2012 (CN-CC13)Big Data Meets Social Analytics - IBM Connect 2012 (CN-CC13)
Big Data Meets Social Analytics - IBM Connect 2012 (CN-CC13)
 
Valtech - Big Data for marketing (EN)
Valtech - Big Data for marketing (EN)Valtech - Big Data for marketing (EN)
Valtech - Big Data for marketing (EN)
 
Scenari evolutivi nello snellimento dei sistemi informativi
Scenari evolutivi nello snellimento dei sistemi informativiScenari evolutivi nello snellimento dei sistemi informativi
Scenari evolutivi nello snellimento dei sistemi informativi
 
01 im overview high level
01 im overview high level01 im overview high level
01 im overview high level
 
OSC2012: Big Data Using Open Source: Netapp Project - Technical
OSC2012: Big Data Using Open Source: Netapp Project - TechnicalOSC2012: Big Data Using Open Source: Netapp Project - Technical
OSC2012: Big Data Using Open Source: Netapp Project - Technical
 
Making Money With Big Data
Making Money With Big DataMaking Money With Big Data
Making Money With Big Data
 
Building A Bi Strategy
Building A Bi StrategyBuilding A Bi Strategy
Building A Bi Strategy
 

More from Fitzgerald Analytics, Inc.

Profiting from customer profitability + big data fitzgerald analytics
Profiting from customer profitability + big data fitzgerald analyticsProfiting from customer profitability + big data fitzgerald analytics
Profiting from customer profitability + big data fitzgerald analyticsFitzgerald Analytics, Inc.
 
2013 12-05 data-driven innovation - fitzgerald analytics workshop at gilbane ...
2013 12-05 data-driven innovation - fitzgerald analytics workshop at gilbane ...2013 12-05 data-driven innovation - fitzgerald analytics workshop at gilbane ...
2013 12-05 data-driven innovation - fitzgerald analytics workshop at gilbane ...Fitzgerald Analytics, Inc.
 
Analytics in Financial Services - Behavioral Finance Event - Data Visualizati...
Analytics in Financial Services - Behavioral Finance Event - Data Visualizati...Analytics in Financial Services - Behavioral Finance Event - Data Visualizati...
Analytics in Financial Services - Behavioral Finance Event - Data Visualizati...Fitzgerald Analytics, Inc.
 
Analytics in financial services prez behavioral finance + data visualizatio...
Analytics in financial services prez   behavioral finance + data visualizatio...Analytics in financial services prez   behavioral finance + data visualizatio...
Analytics in financial services prez behavioral finance + data visualizatio...Fitzgerald Analytics, Inc.
 
Jaime Fitzgerald on Data-Driven Customer Experience in Financial Services and...
Jaime Fitzgerald on Data-Driven Customer Experience in Financial Services and...Jaime Fitzgerald on Data-Driven Customer Experience in Financial Services and...
Jaime Fitzgerald on Data-Driven Customer Experience in Financial Services and...Fitzgerald Analytics, Inc.
 
TDWI NYC Chapter - Tony Baer Ovum on Big data, Data quality, and BI Convergence
TDWI NYC Chapter - Tony Baer Ovum on Big data, Data quality, and BI ConvergenceTDWI NYC Chapter - Tony Baer Ovum on Big data, Data quality, and BI Convergence
TDWI NYC Chapter - Tony Baer Ovum on Big data, Data quality, and BI ConvergenceFitzgerald Analytics, Inc.
 
Data visualization trends in Business Intelligence: Allison Sapka at Analytic...
Data visualization trends in Business Intelligence: Allison Sapka at Analytic...Data visualization trends in Business Intelligence: Allison Sapka at Analytic...
Data visualization trends in Business Intelligence: Allison Sapka at Analytic...Fitzgerald Analytics, Inc.
 
Data to Dollars™ - Practical Analytics in the Big Data Era Jaime Fitzgerald A...
Data to Dollars™ - Practical Analytics in the Big Data Era Jaime Fitzgerald A...Data to Dollars™ - Practical Analytics in the Big Data Era Jaime Fitzgerald A...
Data to Dollars™ - Practical Analytics in the Big Data Era Jaime Fitzgerald A...Fitzgerald Analytics, Inc.
 
Keynote on Financial Services Analytics - Presented aug 2011
Keynote on Financial Services Analytics - Presented aug 2011Keynote on Financial Services Analytics - Presented aug 2011
Keynote on Financial Services Analytics - Presented aug 2011Fitzgerald Analytics, Inc.
 
Knowledge management for analytic teams jaime fitzgerald and alex hasha - p...
Knowledge management for analytic teams   jaime fitzgerald and alex hasha - p...Knowledge management for analytic teams   jaime fitzgerald and alex hasha - p...
Knowledge management for analytic teams jaime fitzgerald and alex hasha - p...Fitzgerald Analytics, Inc.
 
Analytics in Financial Services: Keynote Presentation for TDWI and NY Tech Co...
Analytics in Financial Services: Keynote Presentation for TDWI and NY Tech Co...Analytics in Financial Services: Keynote Presentation for TDWI and NY Tech Co...
Analytics in Financial Services: Keynote Presentation for TDWI and NY Tech Co...Fitzgerald Analytics, Inc.
 
Jaime Fitzgerald: A Master Data Management Road-Trip - Presented Enterprise D...
Jaime Fitzgerald: A Master Data Management Road-Trip - Presented Enterprise D...Jaime Fitzgerald: A Master Data Management Road-Trip - Presented Enterprise D...
Jaime Fitzgerald: A Master Data Management Road-Trip - Presented Enterprise D...Fitzgerald Analytics, Inc.
 

More from Fitzgerald Analytics, Inc. (14)

Profiting from customer profitability + big data fitzgerald analytics
Profiting from customer profitability + big data fitzgerald analyticsProfiting from customer profitability + big data fitzgerald analytics
Profiting from customer profitability + big data fitzgerald analytics
 
2013 12-05 data-driven innovation - fitzgerald analytics workshop at gilbane ...
2013 12-05 data-driven innovation - fitzgerald analytics workshop at gilbane ...2013 12-05 data-driven innovation - fitzgerald analytics workshop at gilbane ...
2013 12-05 data-driven innovation - fitzgerald analytics workshop at gilbane ...
 
Analytics in Financial Services - Behavioral Finance Event - Data Visualizati...
Analytics in Financial Services - Behavioral Finance Event - Data Visualizati...Analytics in Financial Services - Behavioral Finance Event - Data Visualizati...
Analytics in Financial Services - Behavioral Finance Event - Data Visualizati...
 
Analytics in financial services prez behavioral finance + data visualizatio...
Analytics in financial services prez   behavioral finance + data visualizatio...Analytics in financial services prez   behavioral finance + data visualizatio...
Analytics in financial services prez behavioral finance + data visualizatio...
 
Jaime Fitzgerald on Data-Driven Customer Experience in Financial Services and...
Jaime Fitzgerald on Data-Driven Customer Experience in Financial Services and...Jaime Fitzgerald on Data-Driven Customer Experience in Financial Services and...
Jaime Fitzgerald on Data-Driven Customer Experience in Financial Services and...
 
TDWI NYC Chapter - Tony Baer Ovum on Big data, Data quality, and BI Convergence
TDWI NYC Chapter - Tony Baer Ovum on Big data, Data quality, and BI ConvergenceTDWI NYC Chapter - Tony Baer Ovum on Big data, Data quality, and BI Convergence
TDWI NYC Chapter - Tony Baer Ovum on Big data, Data quality, and BI Convergence
 
Text graph-visualization redux
Text graph-visualization reduxText graph-visualization redux
Text graph-visualization redux
 
Data visualization trends in Business Intelligence: Allison Sapka at Analytic...
Data visualization trends in Business Intelligence: Allison Sapka at Analytic...Data visualization trends in Business Intelligence: Allison Sapka at Analytic...
Data visualization trends in Business Intelligence: Allison Sapka at Analytic...
 
Data to Dollars™ - Practical Analytics in the Big Data Era Jaime Fitzgerald A...
Data to Dollars™ - Practical Analytics in the Big Data Era Jaime Fitzgerald A...Data to Dollars™ - Practical Analytics in the Big Data Era Jaime Fitzgerald A...
Data to Dollars™ - Practical Analytics in the Big Data Era Jaime Fitzgerald A...
 
Keynote on Financial Services Analytics - Presented aug 2011
Keynote on Financial Services Analytics - Presented aug 2011Keynote on Financial Services Analytics - Presented aug 2011
Keynote on Financial Services Analytics - Presented aug 2011
 
Knowledge management for analytic teams jaime fitzgerald and alex hasha - p...
Knowledge management for analytic teams   jaime fitzgerald and alex hasha - p...Knowledge management for analytic teams   jaime fitzgerald and alex hasha - p...
Knowledge management for analytic teams jaime fitzgerald and alex hasha - p...
 
Analytics in Financial Services: Keynote Presentation for TDWI and NY Tech Co...
Analytics in Financial Services: Keynote Presentation for TDWI and NY Tech Co...Analytics in Financial Services: Keynote Presentation for TDWI and NY Tech Co...
Analytics in Financial Services: Keynote Presentation for TDWI and NY Tech Co...
 
Fitzgerald Analytics 1-Page Overview
Fitzgerald Analytics 1-Page OverviewFitzgerald Analytics 1-Page Overview
Fitzgerald Analytics 1-Page Overview
 
Jaime Fitzgerald: A Master Data Management Road-Trip - Presented Enterprise D...
Jaime Fitzgerald: A Master Data Management Road-Trip - Presented Enterprise D...Jaime Fitzgerald: A Master Data Management Road-Trip - Presented Enterprise D...
Jaime Fitzgerald: A Master Data Management Road-Trip - Presented Enterprise D...
 

Recently uploaded

TEST BANK For Corporate Finance, 13th Edition By Stephen Ross, Randolph Weste...
TEST BANK For Corporate Finance, 13th Edition By Stephen Ross, Randolph Weste...TEST BANK For Corporate Finance, 13th Edition By Stephen Ross, Randolph Weste...
TEST BANK For Corporate Finance, 13th Edition By Stephen Ross, Randolph Weste...ssifa0344
 
The Economic History of the U.S. Lecture 22.pdf
The Economic History of the U.S. Lecture 22.pdfThe Economic History of the U.S. Lecture 22.pdf
The Economic History of the U.S. Lecture 22.pdfGale Pooley
 
High Class Call Girls Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
High Class Call Girls Nagpur Grishma Call 7001035870 Meet With Nagpur EscortsHigh Class Call Girls Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
High Class Call Girls Nagpur Grishma Call 7001035870 Meet With Nagpur Escortsranjana rawat
 
The Economic History of the U.S. Lecture 26.pdf
The Economic History of the U.S. Lecture 26.pdfThe Economic History of the U.S. Lecture 26.pdf
The Economic History of the U.S. Lecture 26.pdfGale Pooley
 
The Economic History of the U.S. Lecture 25.pdf
The Economic History of the U.S. Lecture 25.pdfThe Economic History of the U.S. Lecture 25.pdf
The Economic History of the U.S. Lecture 25.pdfGale Pooley
 
05_Annelore Lenoir_Docbyte_MeetupDora&Cybersecurity.pptx
05_Annelore Lenoir_Docbyte_MeetupDora&Cybersecurity.pptx05_Annelore Lenoir_Docbyte_MeetupDora&Cybersecurity.pptx
05_Annelore Lenoir_Docbyte_MeetupDora&Cybersecurity.pptxFinTech Belgium
 
06_Joeri Van Speybroek_Dell_MeetupDora&Cybersecurity.pdf
06_Joeri Van Speybroek_Dell_MeetupDora&Cybersecurity.pdf06_Joeri Van Speybroek_Dell_MeetupDora&Cybersecurity.pdf
06_Joeri Van Speybroek_Dell_MeetupDora&Cybersecurity.pdfFinTech Belgium
 
Dividend Policy and Dividend Decision Theories.pptx
Dividend Policy and Dividend Decision Theories.pptxDividend Policy and Dividend Decision Theories.pptx
Dividend Policy and Dividend Decision Theories.pptxanshikagoel52
 
Instant Issue Debit Cards - School Designs
Instant Issue Debit Cards - School DesignsInstant Issue Debit Cards - School Designs
Instant Issue Debit Cards - School Designsegoetzinger
 
Call Girls Service Nagpur Maya Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Maya Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Maya Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Maya Call 7001035870 Meet With Nagpur Escortsranjana rawat
 
VIP Call Girls Service Dilsukhnagar Hyderabad Call +91-8250192130
VIP Call Girls Service Dilsukhnagar Hyderabad Call +91-8250192130VIP Call Girls Service Dilsukhnagar Hyderabad Call +91-8250192130
VIP Call Girls Service Dilsukhnagar Hyderabad Call +91-8250192130Suhani Kapoor
 
20240429 Calibre April 2024 Investor Presentation.pdf
20240429 Calibre April 2024 Investor Presentation.pdf20240429 Calibre April 2024 Investor Presentation.pdf
20240429 Calibre April 2024 Investor Presentation.pdfAdnet Communications
 
Call Girls Koregaon Park Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Koregaon Park Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Koregaon Park Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Koregaon Park Call Me 7737669865 Budget Friendly No Advance Bookingroncy bisnoi
 
Instant Issue Debit Cards - High School Spirit
Instant Issue Debit Cards - High School SpiritInstant Issue Debit Cards - High School Spirit
Instant Issue Debit Cards - High School Spiritegoetzinger
 
(ANIKA) Budhwar Peth Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANIKA) Budhwar Peth Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANIKA) Budhwar Peth Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANIKA) Budhwar Peth Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
(DIYA) Bhumkar Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(DIYA) Bhumkar Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(DIYA) Bhumkar Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(DIYA) Bhumkar Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Best VIP Call Girls Noida Sector 18 Call Me: 8448380779
Best VIP Call Girls Noida Sector 18 Call Me: 8448380779Best VIP Call Girls Noida Sector 18 Call Me: 8448380779
Best VIP Call Girls Noida Sector 18 Call Me: 8448380779Delhi Call girls
 
Top Rated Pune Call Girls Viman Nagar ⟟ 6297143586 ⟟ Call Me For Genuine Sex...
Top Rated  Pune Call Girls Viman Nagar ⟟ 6297143586 ⟟ Call Me For Genuine Sex...Top Rated  Pune Call Girls Viman Nagar ⟟ 6297143586 ⟟ Call Me For Genuine Sex...
Top Rated Pune Call Girls Viman Nagar ⟟ 6297143586 ⟟ Call Me For Genuine Sex...Call Girls in Nagpur High Profile
 
The Economic History of the U.S. Lecture 21.pdf
The Economic History of the U.S. Lecture 21.pdfThe Economic History of the U.S. Lecture 21.pdf
The Economic History of the U.S. Lecture 21.pdfGale Pooley
 

Recently uploaded (20)

TEST BANK For Corporate Finance, 13th Edition By Stephen Ross, Randolph Weste...
TEST BANK For Corporate Finance, 13th Edition By Stephen Ross, Randolph Weste...TEST BANK For Corporate Finance, 13th Edition By Stephen Ross, Randolph Weste...
TEST BANK For Corporate Finance, 13th Edition By Stephen Ross, Randolph Weste...
 
The Economic History of the U.S. Lecture 22.pdf
The Economic History of the U.S. Lecture 22.pdfThe Economic History of the U.S. Lecture 22.pdf
The Economic History of the U.S. Lecture 22.pdf
 
High Class Call Girls Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
High Class Call Girls Nagpur Grishma Call 7001035870 Meet With Nagpur EscortsHigh Class Call Girls Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
High Class Call Girls Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
 
The Economic History of the U.S. Lecture 26.pdf
The Economic History of the U.S. Lecture 26.pdfThe Economic History of the U.S. Lecture 26.pdf
The Economic History of the U.S. Lecture 26.pdf
 
The Economic History of the U.S. Lecture 25.pdf
The Economic History of the U.S. Lecture 25.pdfThe Economic History of the U.S. Lecture 25.pdf
The Economic History of the U.S. Lecture 25.pdf
 
05_Annelore Lenoir_Docbyte_MeetupDora&Cybersecurity.pptx
05_Annelore Lenoir_Docbyte_MeetupDora&Cybersecurity.pptx05_Annelore Lenoir_Docbyte_MeetupDora&Cybersecurity.pptx
05_Annelore Lenoir_Docbyte_MeetupDora&Cybersecurity.pptx
 
06_Joeri Van Speybroek_Dell_MeetupDora&Cybersecurity.pdf
06_Joeri Van Speybroek_Dell_MeetupDora&Cybersecurity.pdf06_Joeri Van Speybroek_Dell_MeetupDora&Cybersecurity.pdf
06_Joeri Van Speybroek_Dell_MeetupDora&Cybersecurity.pdf
 
Dividend Policy and Dividend Decision Theories.pptx
Dividend Policy and Dividend Decision Theories.pptxDividend Policy and Dividend Decision Theories.pptx
Dividend Policy and Dividend Decision Theories.pptx
 
Commercial Bank Economic Capsule - April 2024
Commercial Bank Economic Capsule - April 2024Commercial Bank Economic Capsule - April 2024
Commercial Bank Economic Capsule - April 2024
 
Instant Issue Debit Cards - School Designs
Instant Issue Debit Cards - School DesignsInstant Issue Debit Cards - School Designs
Instant Issue Debit Cards - School Designs
 
Call Girls Service Nagpur Maya Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Maya Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Maya Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Maya Call 7001035870 Meet With Nagpur Escorts
 
VIP Call Girls Service Dilsukhnagar Hyderabad Call +91-8250192130
VIP Call Girls Service Dilsukhnagar Hyderabad Call +91-8250192130VIP Call Girls Service Dilsukhnagar Hyderabad Call +91-8250192130
VIP Call Girls Service Dilsukhnagar Hyderabad Call +91-8250192130
 
20240429 Calibre April 2024 Investor Presentation.pdf
20240429 Calibre April 2024 Investor Presentation.pdf20240429 Calibre April 2024 Investor Presentation.pdf
20240429 Calibre April 2024 Investor Presentation.pdf
 
Call Girls Koregaon Park Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Koregaon Park Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Koregaon Park Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Koregaon Park Call Me 7737669865 Budget Friendly No Advance Booking
 
Instant Issue Debit Cards - High School Spirit
Instant Issue Debit Cards - High School SpiritInstant Issue Debit Cards - High School Spirit
Instant Issue Debit Cards - High School Spirit
 
(ANIKA) Budhwar Peth Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANIKA) Budhwar Peth Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANIKA) Budhwar Peth Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANIKA) Budhwar Peth Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
(DIYA) Bhumkar Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(DIYA) Bhumkar Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(DIYA) Bhumkar Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(DIYA) Bhumkar Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Best VIP Call Girls Noida Sector 18 Call Me: 8448380779
Best VIP Call Girls Noida Sector 18 Call Me: 8448380779Best VIP Call Girls Noida Sector 18 Call Me: 8448380779
Best VIP Call Girls Noida Sector 18 Call Me: 8448380779
 
Top Rated Pune Call Girls Viman Nagar ⟟ 6297143586 ⟟ Call Me For Genuine Sex...
Top Rated  Pune Call Girls Viman Nagar ⟟ 6297143586 ⟟ Call Me For Genuine Sex...Top Rated  Pune Call Girls Viman Nagar ⟟ 6297143586 ⟟ Call Me For Genuine Sex...
Top Rated Pune Call Girls Viman Nagar ⟟ 6297143586 ⟟ Call Me For Genuine Sex...
 
The Economic History of the U.S. Lecture 21.pdf
The Economic History of the U.S. Lecture 21.pdfThe Economic History of the U.S. Lecture 21.pdf
The Economic History of the U.S. Lecture 21.pdf
 

From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records Per Year

  • 1. From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records Jaime Fitzgerald, President, Fitzgerald Analytics, Inc. Alex Hasha, Chief Data Scientist, Bundle.com May 1, 2012 Architects of Fact-Based Decisions™
  • 2. Agenda for Today’s Talk 1. The Business Model 2. The Text Analytics Challenge 3. How We Overcame the Challenge 4. Key Takeaways 5. Q&A From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records © 2012 Fitzgerald Analytics, Inc. All Rights Reserved 2
  • 3. Introduction Jaime Fitzgerald, Alex Hasha Founder @ Data Scientist @ Fitzgerald Analytics Bundle Corp @JaimeFitzgerald @AlexHasha  Leading development of data products  Transforming data into value for clients Responsible  Designing statistical methods / algorithm For… that transform data into insights for  Creating meaningful careers for employees consumers  Helps clients convert Data to Dollars™  Uses data to help consumers make better At a decisions with their money  Brings a strategic perspective to improve  Bends valuable legacy data to new Company ROI on investments in technology, data, purposes That people, and processes  Is growing and hiring! Also  Working to Democratize Analytics by  Learning about and implementing best Working Reducing the “Barrier to Benefit” for non- practices for managing complex data On profits, social entrepreneurs, and gov’t pipelines From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records © 2012 Fitzgerald Analytics, Inc. All Rights Reserved 3
  • 4. The Local Search Business From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records © 2012 Fitzgerald Analytics, Inc. All Rights Reserved 4
  • 5. Gaps in Local Search Offerings Paid Advertisement Not Trusted User-Reviews Can be Biased Not Selection Can be Personalized Bias Gamed (to you) From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records © 2012 Fitzgerald Analytics, Inc. All Rights Reserved 5
  • 6. Bundle’s Unique Contribution Unlike other merchant listing sites, our content is based on real credit card spending by 20 million households Example: Credit Card Statement Data From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records © 2012 Fitzgerald Analytics, Inc. All Rights Reserved 6
  • 7. A Screen Shot From our Site From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records © 2012 Fitzgerald Analytics, Inc. All Rights Reserved 7
  • 8. A Screen Shot From our Site From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records © 2012 Fitzgerald Analytics, Inc. All Rights Reserved 8
  • 9. A Screen Shot From our Site From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records © 2012 Fitzgerald Analytics, Inc. All Rights Reserved 9
  • 10. We Do This with Billions of Real Spending Records Unlike other merchant listing sites, our content is based on real credit card spending by 20 million households Key Issues with this Data: Example: Credit Card Statement Data 1. Credit card data lacks merchant identifier 2. So we rely on text analytics to associate transactions with merchants From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records © 2012 Fitzgerald Analytics, Inc. All Rights Reserved 10
  • 11. Building our “Version of the Truth” from 3 sources Our Localeze Factual Transaction Data  Proprietary  Crowd Sourced  High Quality Pros  Differentiated  Up to the  Clean / Verified  Special Sauce Minute  Incomplete  More variability Cons  Semi-Structured  Lag / Recency in quality From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records © 2012 Fitzgerald Analytics, Inc. All Rights Reserved 11
  • 12. Data: Not Useful Until Refined. From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records © 2012 Fitzgerald Analytics, Inc. All Rights Reserved 12
  • 13. Key Steps in “Refinement” (Transformation) Transformed To Create New Old Data in New Ways Features Such As… Card Transaction Normalization People Who Shop Data Here Also Like… Clustering Merchant Listings The Bundle Loyalty (e.g., Address, Phone Score Number, Business Type) Linking Data-Driven Other Data: Reviews From an Census, Bureau of Labor Aggregation Array of Customer Statistics, User Feedback Segments From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records © 2012 Fitzgerald Analytics, Inc. All Rights Reserved 13
  • 14. Before the Fun Stuff Happens… Before we can generate insights about merchants for our users, we must associate each transaction in our database with a specific merchant from a master list…. Two main problems: Credit Card Transactions 1. Accurate Fuzzy Matching is Difficult (Billions – 109) 2. Scale of Data is Enormous • Highly variable text descriptions • Noisy geographic info Comprehensive Listing Text • Noisy merchant Matching of US Merchants category info (Tens of Millions – 107) Naïve item by item search takes O(1016) expensive string comparisons: Too Slow! From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records © 2012 Fitzgerald Analytics, Inc. All Rights Reserved 14
  • 15. A “Brute Force” Approach Would Never Work… 1 1. Matching w/in Hundreds of Millions of Merchants would Processing Time / Workload require massive processing… Nation ….Fortunately we don’t need to match at this level 2. Batching at local area, process orders of magnitude faster. City Neighborhood 0 Hundreds Hundreds of Tens of Millions Thousands # of Merchants in Comparison Set From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records © 2012 Fitzgerald Analytics, Inc. All Rights Reserved 15
  • 16. Solution to Scaling Problem This is a “Cascade of Scale Reductions”, Parallelizing by Location Credit Card Transactions (Billions – 109) Keys to solving the scaling problem: Batch Transactions by Geographic Neighborhood 1. Scale Reduction / Parallelized Text Clustering 2. Free Open Source Software 1 2 10000 Dedupe Description Strings Secondary Fuzzy Matching Process Reconciles Preliminary Listings with Merchant Text Clustering “Source of Truth” (Not Matching) Consolidate Strings Belonging to Same Merchant Computational Efficiency Increased by a Factor of 108! Preliminary Merchant Final Merged Listing Generated Directly Transaction Eons -> Days -> Minutes from Transactions Data Set (Tens of Millions–107) From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records © 2012 Fitzgerald Analytics, Inc. All Rights Reserved 16
  • 17. Data Preparation: Phase 1 Machine DAMA Lens Learning Lens Example: • Unsupervised Anthonys Restaurant Deduping Learning #123 Brkly NY • Matching X 10, • Text Clustering (Strings) Cleansing • Pattern Anthony’s Restaurant Discovery From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records © 2012 Fitzgerald Analytics, Inc. All Rights Reserved 17
  • 18. Data Preparation: Phase 2 Machine DAMA Lens Learning Lens Search Retrieves Top 10 Possible Matches • Deduping • Record • Information Classifier applied to + 30% Linkage Retrieval each, returns • More • Data Quality Cleansing confidence score • Supervised Enhancement • Data Classifier If Confidence = High, Enrichment Records are linked From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records © 2012 Fitzgerald Analytics, Inc. All Rights Reserved 18
  • 19. Takeaways 1. Tame your data before perfecting your methods. efficiency enables experimentation, iteration, improvement. 2. Design your process to minimize unnecessary complexity (e.g. Parallel Processing at Scale, Normalization, Pre-Filtering) 3. Tools: Take advantage of powerful (and inexpensive) open- source tools that enable your process... From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records © 2012 Fitzgerald Analytics, Inc. All Rights Reserved 19

Editor's Notes

  1. Jaime intro:Alex Intro: Thanks Jaime. Since Jaime has already introduced me, I’ll introduce Bundle. Bundle is a company that uses data to help consumers make better decisions with their money. We do this on the one hand by providing free tools for managing personal financial data. But more to the point of today’s talk, we are also mining mountains of credit card transaction data to extract actionable insights for consumers based on the spending behavior of their peers.
  2. First to provide local merchant profiles for consumers that is deeply data-drivenLocal Search Business (Yelp, CitiSearch, FourSquare, Google, Bing)% of local searches on mobile devices is growing very fastFast-growing sector in data-driven startupsExample: Ted’s montana grillBundle addresses issues with other sites:Selection Bias (strong opinions over-represented)System Gaming (just like SEO. interesting story “reputation mgt” companies!)Explicit rankings (rank by the actual metrics!)
  3. Alex: So where does Text Analytics come into this? As you might imagine, bending old data to a new purpose is fraught with difficulties, because the dataset was designed with different applications in mind. A key problem we faced with our credit card transaction database was that the transaction records lack a merchant identifier. It’s primary purpose is for interacting with card holders, generating statements, and not suprisingly it’s formatted very much like an enormous credit card statement. The merchant name is embedded in a text field, which also contains other information. It’s semi-structured, but lacks a consistent format.Clearly, to unlock insights about merchants from this data, we have to associate the transactions with merchants using this text field, so text analytics is absolutely crucial to our business.AH: Just some background here: In the credit card industry there are “acquiring banks”, which deals with merchants and processes their credit card transaction over various payment networks, and “issuing banks” which issue cards to consumers, and manage the generation of statements and billing of individuals. Since the interactions with merchants and consumers are split between two entities, you end up with data sets that are either consumer or merchant focused. We get our data from an “issuing” bank, so they don’t have detailed merchant info., beyond what they need to generate statements for cardholders. That is the root of our problem.
  4. Alex: This is a screen shot of our core offering,the Bundle Merchant recommender, which aims to help consumers with their most frequent money decisions: where to spend it. Visually, I’m sure you’re reminded of user review sites like Yelp or Citysearch, and the purpose, to help you discover great merchants, is similar. Our content, though, is very different because it’s generated directly from the credit card transactions of over 20 million US households.
  5. Alex: (Review features left to right.)I just wanted to return to this screen shot to highlight the features that are made possible by transforming credit card data in this way. (Loyalty score) Unlike other sites, our star ratings are data driven: we assign each merchant what we call the “Bundle Loyalty Score”, which is calculated from the share of wallet a merchant’s customers devote to the business and how frequently they return. (Coverage) Because we capture transactions from a broad-cross section of the population, we have data on many small local merchants, not just the popular ones that attract a lot of reviews. (Segments and Silent majority) We can break merchants customers down into demographic and behavioral segments, to show how well it serves different groups, and which groups it is most popular with. We’re capturing information about the silent majority of shoppers, who shop without writing about it online, and also avoid the common bias on review sites towards extremely positive or extremely negative reviews.(Real price levels) We have rich data about the real range of prices visitors to this merchant are paying, based on real transactions.(Web of merchants) Another unique feature on Bundle is that we can show you what other merchants are popular with customers of this merchant. We’re all familiar with “People who bought this also bought” on Amazon and other online market places, but I believe we’re the first to take this to the offline market place on a massive scale.
  6. Alex: This is a screen shot of our core offering,the Bundle Merchant recommender, which aims to help consumers with their most frequent money decisions: where to spend it. Visually, I’m sure you’re reminded of user review sites like Yelp or Citysearch, and the purpose, to help you discover great merchants, is similar. Our content, though, is very different because it’s generated directly from the credit card transactions of over 20 million US households.
  7. Alex: So where does Text Analytics come into this? As you might imagine, bending old data to a new purpose is fraught with difficulties, because the dataset was designed with different applications in mind. A key problem we faced with our credit card transaction database was that the transaction records lack a merchant identifier. It’s primary purpose is for interacting with card holders, generating statements, and not suprisingly it’s formatted very much like an enormous credit card statement. The merchant name is embedded in a text field, which also contains other information. It’s semi-structured, but lacks a consistent format.Clearly, to unlock insights about merchants from this data, we have to associate the transactions with merchants using this text field, so text analytics is absolutely crucial to our business.AH: Just some background here: In the credit card industry there are “acquiring banks”, which deals with merchants and processes their credit card transaction over various payment networks, and “issuing banks” which issue cards to consumers, and manage the generation of statements and billing of individuals. Since the interactions with merchants and consumers are split between two entities, you end up with data sets that are either consumer or merchant focused. We get our data from an “issuing” bank, so they don’t have detailed merchant info., beyond what they need to generate statements for cardholders. That is the root of our problem.
  8. Top 10 Possible Matches, Like Google Search)
  9. Jaime: Take it back to audience. A common theme in converting data to dollars is to to extract new value from old data by MATCHING with other preexisting data. No need to dwell on particulars of Bundle data on this slide, except as an instance of a more general pattern.
  10. JF Provides Framing: This is a universal problem for companies seeking to convert Data to Dollars, repurposing old data sets often requires matching with other data sets without a common key. AH: It should be clear now how a robust, accurate algorithm for matching text descriptions to merchant listings is a prerequisite for our entire user experience.There are two aspects of this problem that created significant challenges for us. First, there’s the basic issue that accurate fuzzy string matching is hard. Our inputs highly variable transaction descriptions, sometimes dozens or hundreds per merchant, inconsistent coding, error prone geographic indicators, and noisy merchant category indicators. These give us a lot to go on, but to treat any of them as a source of truth gets you in trouble. We’re at a Text Analytics conference, so I don’t have to tell you that accurate fuzzy string matching can be hard, especially if supporting data like merchant category and geo information are not 100% reliable. But before we could even begin to attack that problem we had to do something about the sheer size of our data set.We receive about 1 billion credit card transactions per year, each of which must be associated with one of 10s of millions of merchants in a comprehensive listing. Not that anyone would try this, but a brute force attempt to take each transaction description and scan through the merchant listing item by item looking for a match would require on the order of 10^16 fuzzy string comparisons. To put that in perspective, if each comparison took about a millisecond, the match would take over 300,000 years to run.Clearly something needs to be done to reduce the scale of the input AND the matching search space. Broadly speaking, we accomplished this by breaking the matching process into two phases, using text clustering in the first phase to dramatically decrease the size of the data set, and then proceeding to a fuzzy match.
  11. This isn’t rocket science, there are a handful of obvious places to start simplifying the problem. One key lever is location: if you have a transaction that occurred in New Mexico it doesn’t make sense to include merchants in New York in your search.There are tens of millions of merchants nationally, but only hundreds of thousands in each city, and maybe a thousand max in each neighborhood. If you can identify the neighborhood of a transaction, and only search the merchants in that neighborhood, the efficiency payoff is hugeThis wasn’t a completely obvious step for us, though, because as I mentioned before the geographic fields in our transaction data were not 100% reliable. We could identify the city with no problem, but at the neighborhood level there is a significant error rate. But we eventually realized we had to ignore all the little complications and, at all costs, reduce the size of our data so we could work with it efficiently. It’s worth creating an intermediate data set that’s still pretty messy, if you can now load it into R on your laptop and try out a few fuzzy matching experiments in an afternoon.
  12. This slide gives a high level overview of how we achieved a cascade of scale reductions by batching transactions by neighborhood. Considering each neighborhood in isolation, we dedupe and then cluster transaction strings which are highly likely to be generated by the same merchant. Each of these clusters is assigned a preliminary merchant ID. At this point we have a preliminary merchant listing which still suffers from some of the quality issues of the original data set but Can provide aggregated transaction data views which to inform subsequent matching and is on a much more manageable scale.The output of the clustering algorithm feeds into a more resource intensive fuzzy matching algorithm, which becomes feasible at this scale.Taking this approach on a single machine, we were able to get our processing time down to about a week. However, in startup time a week is not much better than 300K years. Thanks to the revolution in open source parallel computing, we were able to quickly set up a small Hadoop cluster which parallelizes the text clustering operations so all the neighborhoods run at the same time. This brought our processing down to about 20 minutes. While this isn’t a complete solution to the initial problem, it vastly increases our capability to experiment with new methods and tweaks to the existing process.So that’s a quick and dirty introduction to a part of our technology stack, and now I”ll turn it over to Jaime to convert my case study into some high level takeaways.
  13. Robin custbehavior PayComplainPay....then....ST vs LT RecAdvLoyalty
  14. Top 10 Possible Matches, Like Google Search)
  15. Comments:Consider trade-offs between false positive and false negativesRelated Hot/Emerging Best Practices we can mention to frame this:Metrics-Driven DevelopmentBeginning with the End in Mind / Causal Clarity 