SlideShare ist ein Scribd-Unternehmen logo
1 von 28
BETTER ALGORITHMS
FROM BIGGER DATA
Chris Bingham, CTO, Crimson Hexagon




                                      April 26th, 2012
INTRODUCTION
Crimson Hexagon and me
ABOUT CRIMSON HEXAGON

• Founded 4 years ago; now 40+ employees in Boston

• Help companies make actionable business decisions

• Based on unique analysis of social media and internal data

• Customers include F100, agencies, UN

• Tech stack:
   •   Java, with R for algorithms
   •   Massive Lucene infrastructure with custom shard management
   •   Distributed computing framework for analysis
   •   Hadoop increasingly used
BIG DATA, BETTER DATA, BETTER ALGORITHMS

• World’s largest searchable social media archive

• >200 billion posts in 2012

• Adding 1 billion every 2-3 days

• Twitter, Facebook, blogs, forums, comments, news, etc.
BIG DATA, BETTER DATA, BETTER ALGORITHMS

• Who’s talking and listening?
   • Demographics
   • Interests
   • Relationships

• Trends and comparisons
   • Compared to yourself, over time
   • Compared to industry, competitors, etc.

• Human input
   • Define specific business question and possible answers
   • Provides focus and context
BIG DATA, BETTER DATA, BETTER ALGORITHMS

• Based on work by co-founder Gary King at Harvard

• Takes all those billions of posts, plus the human input

• Leverages the human judgment to massive scale

• Quantitative answers to specific business questions

• Accurate in any language
ALGORITHMS AND BIG DATA
The problem of leverage
MACHINE LEARNING



             Let’s consider a typical
           data-analysis problem using
                machine learning.




           How does having more data
               help (or hurt) us?
DEFINE CATEGORIES




                                         A


                    Some set of user-    B
                    defined categories
                       (AKA topics,
                      classes, etc.)     C


                                         D
PROVIDE TRAINING




                                        A


                                        B

                   Training examples
                   to map features to   C
                       categories

                                        D
LEARN A MODEL




                              A

       Algorithm classifies
            items into        B
        categories based
         on training data
                              C


                              D
CLASSIFY ITEMS




                          A


                          B
 w      x     y      z

                          C

   Incoming unknown
 items to be classified   D
OBTAIN RESULTS




                          A   y


   Result: Items are      B   w
  classified, hopefully
       correctly!
                          C   x   z


                          D
DID IT WORK?




                           A   y       A   y


    Compare algorithm to   B   w       B   w
    human(s) to measure
     accuracy—here “z”
       was incorrectly     C   x       C   x   z
         classified

                           D       z   D
ERROR RATE

    We were wrong
    25% of the time.
     What happens
   when we add more
        data?



      75% correct




      25% wrong
SCALE TO BIG DATA

    We just make the
    same mistakes on
      a larger scale.




                        75% correct
      75% correct




       25% wrong



                        25% wrong
CAN MORE DATA HELP?


                                                  A
           Can bigger data help us? In
           some ways.                             B

           •   It can enable more types of
               analysis                           C
           •   It can enable analysis of more
               categories
           •   It can provide more raw material   D
               for training and validation


           What about accuracy?                   E


                                                  F
HUMAN SCALE




                                 A
     More training usually
  improves accuracy—but we
  need not just more data, but   B
        more humans.

     Humans don’t scale.         C


                                 D
FEEDBACK



             For some
     applications, users can        A   y
   implicitly provide feedback
        through their use.
                                    B   w
   e.g. ad placement; spam
           detection
                                    C   x   z
   But this isn’t possible in all
    cases—and you can’t be
     too wrong to begin with        D
BOOTSTRAPPING



      We can also feed the         A   y
    classified items back into
   the training set (no human
           intervention).          B   w

         Some incorrect
   classifications will become     C   x   z
  part of the training! But that
    doesn’t necessarily hurt.
                                   D
BOOTSTRAPPING RESULT


   The more data you have,
  the more you can classify.               r
                               A   y
                                       y s
 The more you classify, the
  more training data you
          obtain.              B   w w
                                    wt
 The more training data, the
 more accurate the results.    C   x        z
                                       x
                                           u
 And we didn’t have to scale
  the human involvement.       D    x           v
                                        x
                                               x
INDIVIDUAL VS. AGGREGATE

 So far we’ve considered classification
     of individual items. This is the
    conventional machine-learning
                approach.                 A   y


                                          B   w
 w     x      y     z

                                          C   x   z


                                          D
INDIVIDUAL VS. AGGREGATE

  What if we want to know the size of
 each category, rather than which items
        are in which category?
                                          A   25% A
     e.g. epidemiology, polls, market
                research
                                          B   25% B
 w      x      y     z

                                          C   50% C


                                          D   0% D
INDIVIDUAL VS. AGGREGATE


        When considered individually, there’s a limited amount
              of information we have about each item.

         As a result, there will be limited correlation with the
             training data, and therefore poor accuracy.


                   A? C?
           w   =
                      B? D?

           x   =                                     75% correct


           y   =
                                                     25% wrong
           z   =
INDIVIDUAL VS. AGGREGATE


         When considered in the aggregate, there’s much
        more data correlating with the training data for each
                             category.

       As a result, we can make more accurate estimates of
                     the category proportions.


                         %    %     %    %D
                         A    B     C

      W+X+Y+                                                85% correct
        Z
                    =

                                                                15% wrong
INDIVIDUAL VS. AGGREGATE


              Now, increasing the amount of data can
              actually increase the accuracy—with the
               same amount of human training data.

                         %    %    %    %D
                         A    B    C




   S+T+U+V+                                             95% correct
   W+X+Y+Z          =


                                                          5% wrong
CONCLUSION

• Bigger data is important

• Better data is important

• Better algorithms are important

• The sweet spot is when one leverages the other


                 Bigger data can lead
                 to better algorithms.
QUESTIONS?

Weitere ähnliche Inhalte

Kürzlich hochgeladen

Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 

Kürzlich hochgeladen (20)

Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 

Empfohlen

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by HubspotMarius Sescu
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTExpeed Software
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsPixeldarts
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 

Empfohlen (20)

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 

Christopher Bingham, Crimson Hexagon: Better Algorithms from Bigger Data

  • 1. BETTER ALGORITHMS FROM BIGGER DATA Chris Bingham, CTO, Crimson Hexagon April 26th, 2012
  • 3. ABOUT CRIMSON HEXAGON • Founded 4 years ago; now 40+ employees in Boston • Help companies make actionable business decisions • Based on unique analysis of social media and internal data • Customers include F100, agencies, UN • Tech stack: • Java, with R for algorithms • Massive Lucene infrastructure with custom shard management • Distributed computing framework for analysis • Hadoop increasingly used
  • 4. BIG DATA, BETTER DATA, BETTER ALGORITHMS • World’s largest searchable social media archive • >200 billion posts in 2012 • Adding 1 billion every 2-3 days • Twitter, Facebook, blogs, forums, comments, news, etc.
  • 5. BIG DATA, BETTER DATA, BETTER ALGORITHMS • Who’s talking and listening? • Demographics • Interests • Relationships • Trends and comparisons • Compared to yourself, over time • Compared to industry, competitors, etc. • Human input • Define specific business question and possible answers • Provides focus and context
  • 6. BIG DATA, BETTER DATA, BETTER ALGORITHMS • Based on work by co-founder Gary King at Harvard • Takes all those billions of posts, plus the human input • Leverages the human judgment to massive scale • Quantitative answers to specific business questions • Accurate in any language
  • 7. ALGORITHMS AND BIG DATA The problem of leverage
  • 8. MACHINE LEARNING Let’s consider a typical data-analysis problem using machine learning. How does having more data help (or hurt) us?
  • 9. DEFINE CATEGORIES A Some set of user- B defined categories (AKA topics, classes, etc.) C D
  • 10. PROVIDE TRAINING A B Training examples to map features to C categories D
  • 11. LEARN A MODEL A Algorithm classifies items into B categories based on training data C D
  • 12. CLASSIFY ITEMS A B w x y z C Incoming unknown items to be classified D
  • 13. OBTAIN RESULTS A y Result: Items are B w classified, hopefully correctly! C x z D
  • 14. DID IT WORK? A y A y Compare algorithm to B w B w human(s) to measure accuracy—here “z” was incorrectly C x C x z classified D z D
  • 15. ERROR RATE We were wrong 25% of the time. What happens when we add more data? 75% correct 25% wrong
  • 16. SCALE TO BIG DATA We just make the same mistakes on a larger scale. 75% correct 75% correct 25% wrong 25% wrong
  • 17. CAN MORE DATA HELP? A Can bigger data help us? In some ways. B • It can enable more types of analysis C • It can enable analysis of more categories • It can provide more raw material D for training and validation What about accuracy? E F
  • 18. HUMAN SCALE A More training usually improves accuracy—but we need not just more data, but B more humans. Humans don’t scale. C D
  • 19. FEEDBACK For some applications, users can A y implicitly provide feedback through their use. B w e.g. ad placement; spam detection C x z But this isn’t possible in all cases—and you can’t be too wrong to begin with D
  • 20. BOOTSTRAPPING We can also feed the A y classified items back into the training set (no human intervention). B w Some incorrect classifications will become C x z part of the training! But that doesn’t necessarily hurt. D
  • 21. BOOTSTRAPPING RESULT The more data you have, the more you can classify. r A y y s The more you classify, the more training data you obtain. B w w wt The more training data, the more accurate the results. C x z x u And we didn’t have to scale the human involvement. D x v x x
  • 22. INDIVIDUAL VS. AGGREGATE So far we’ve considered classification of individual items. This is the conventional machine-learning approach. A y B w w x y z C x z D
  • 23. INDIVIDUAL VS. AGGREGATE What if we want to know the size of each category, rather than which items are in which category? A 25% A e.g. epidemiology, polls, market research B 25% B w x y z C 50% C D 0% D
  • 24. INDIVIDUAL VS. AGGREGATE When considered individually, there’s a limited amount of information we have about each item. As a result, there will be limited correlation with the training data, and therefore poor accuracy. A? C? w = B? D? x = 75% correct y = 25% wrong z =
  • 25. INDIVIDUAL VS. AGGREGATE When considered in the aggregate, there’s much more data correlating with the training data for each category. As a result, we can make more accurate estimates of the category proportions. % % % %D A B C W+X+Y+ 85% correct Z = 15% wrong
  • 26. INDIVIDUAL VS. AGGREGATE Now, increasing the amount of data can actually increase the accuracy—with the same amount of human training data. % % % %D A B C S+T+U+V+ 95% correct W+X+Y+Z = 5% wrong
  • 27. CONCLUSION • Bigger data is important • Better data is important • Better algorithms are important • The sweet spot is when one leverages the other Bigger data can lead to better algorithms.