SlideShare ist ein Scribd-Unternehmen logo
1 von 10
Acml 2012 contest
Hierarchical Committee Machines to Detect
Frauds in Mobile Advertising

Description about the contest and the dataset provided can
be found here http://palanteer.sis.smu.edu.sg/fdma2012/

Evaluation metric: MAP
                            S. Shivashankar and P. Manoj
                                  Ericsson Research India
Feature engineering/ derived
attributes (1)
S.NO       Feature                         Description                                                                   Comments

     1     Number of unique ip             No of unique ip's per pid                                                     Helped
     2     Number of unique cid            No of unique cid's per pid                                                    Helped
     3     Number of unique cntr           No of unique cntr's per pid                                                   Helped
     4     Number of unique category       No of unique categories per pid                                               Helped

     5     Total clicks                    Total clicks per pid                                                          Helped
     6     Category                        Category name for which clicks exist per pid                                  Helped

     7     Country feature vector          Country wise clicks per pid. 1 X C vector, where C is the total number of     Did not help
                                             countries                                                                      much
     8     Category feature vector         Category wise clicks per pid. 1 X N, where N is the number of categories      Did not help
                                                                                                                            much
     9     Clicks per category             No of clicks per category per pid                                             Helped

    10     Countries with highest number   Countries sorted according to number of clicks and K countries with highest   Did not help
             of clicks                       clicks per cid are appended.                                                   much




Ericsson Internal | 2012-10-19 | Page 2
Feature engineering/ derived
attributes (2)
S.NO      Feature                                           Description                                                              Comments

   11     Bank account given or not                         Boolean attribute : 0 if the bank account for pid is not given, else     Did not help
                                                              1                                                                         much
   12     Address given or not                              Boolean attribute : 0 if the address for pid is not given, else 1        Did not help
                                                                                                                                        much
   13     Top country                                       Country with highest clicks per pid                                      Did not help
                                                                                                                                        much
   14     Cluster id                                        Cluster clicks data into predefined number of clusters, say 5, add       Did not help
                                                               the distribution of clicks within the clusters as a feature vector.      much
   15     No of referrers                                   No of unique referrers per pid                                           Helped

   16     Number of days                                    Number of days pid is active                                             Helped
   17     Clicks per day                                    Number of clicks per day per pid                                         Helped (A)

   18     Sum of difference in time                         Sum of time difference between each click for a pid                      Helped

   19     Average of difference in time                     Average over difference in time between each click for a pid             Helped

   20     Standard deviation of sum of difference in time   SD of time difference between each click for a pid                       Did not help
                                                                                                                                        much




Ericsson Internal | 2012-10-19 | Page 3
Feature engineering/ derived
attributes (3)
S.NO         Feature                                    Description                                          Comments

       21    Clicks per category                        Total number of clicks per category per pid          Helped (B)

       22    Average clicks - day                       Average of clicks per day per pid                    Did not help much

       23    Average clicks - referrer                  Average of clicks per referrer per pid               Did not help much

       24    No of agents                               No of unique agents per pid                          Helped (C)


       25    Sum of difference of clicks – ip and cid   Sum of difference of clicks per ip per cid per pid   Did not help much

       26    Sum of clicks                              Duplicate clicks sum                                 Did not help much

       27    Average clicks – agent                     Average clicks per agent per pid                     Helped – LAD

       28    Average clicks - ip                        Average of clicks per ip per pid                     Helped – LAD

       29    Average clicks - cid                       Average of clicks per cid per pid                    Helped – LAD

       30    Average clicks - cntr                      Average of clicks per cntr per pid                   Helped – LAD (D)




Ericsson Internal | 2012-10-19 | Page 4
Methods used
› We posed this problem as a two class problem, rather than 3 class, since there are efficient methods
  for binary class classification.
       – Fraud and Observation are grouped together
       – Observation and OK are grouped together

› Fraud and Observation grouped together helped better than 3 class and other 2 class setups.

› Datasets
       – First 10 attributes that helped were grouped into dataset A, 13 attributes into dataset B, 14 into dataset C and 18
         into dataset D. A, B, C, D are marked in the previous slides.

› Algorithms
       – J48, REP tree, LAD tree, AODE
       – Note that dataset D performs well with LAD tree only

› Approaches for class imbalance
       – Cost sensitive classification
       – Ensemble learning




Ericsson Internal | 2012-10-19 | Page 5
observations
        Method                                                 Dataset A           Dataset B    Dataset C     Dataset D

        Decorate with j48                                                  38.54        41.99         43.19           43.29

        Bagging with REP tree                                              32.99        39.64         41.64           40.99

        Bagging with Cost sensitive classifier with LAD tree               38.57        43.06         46.28           47.57

        Kstar                                                              17.87        27.54         29.87   -

        AODE                                                               19.01        38.75   -                     41.27




                Note that results using classifiers that performed well and
                   that were giving diverse results (to help in ensemble
                 learning) are given here. Not all classifiers we tried are
                                      presented here.



Ericsson Internal | 2012-10-19 | Page 6
Hierarchical Committee
machines

                                          Datasets with different groupings of
                                                       attributes


                          CMA                    CMB                 CMC               CMD


                                                       Combined
                                                         CM

                                                                           Score on the validation set – 51.49
                                             x          p(fraud|x)          Score on the test set – 38.0744




Ericsson Internal | 2012-10-19 | Page 7
Discussions (1)
› Typical methods such as over-sampling, under-sampling, SMOTE, HDDT did not help
       – Sampling methods might have to be investigated carefully to see how they can be useful, since it
         is a widely accepted method for scenarios with class imbalance.

› Cost-sensitive classification helped with few classifiers like LAD Tree

› Random Forest did not perform better than other tree counterparts with ensemble
  learner. And was not diverse to help in the final committee machine

› Bayesian based ranking methods such as AODE performs well with more attributes

› With more attributes LAD tree performs well individually, but does not produce so diverse
  results on dataset C and D.

› Memory based methods such as kStar do not perform well individually, but helped as part
  of the committee.




Ericsson Internal | 2012-10-19 | Page 8
Discussions (2)
› Most of the fraud clicks belonged to publishers whose category was ‘AD’ and
  ‘MC’

› Common intuition to use duplicate ip per publisher did not help much

› Country information did not help much

› Surprisingly phone agent (model) information of the users helped

› Time information was critically important for good performance
    – Might need further investigations/refinements to improve results




Ericsson Internal | 2012-10-19 | Page 9
Acml kites

Weitere ähnliche Inhalte

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 

Empfohlen

How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
ThinkNow
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
Kurio // The Social Media Age(ncy)
 

Empfohlen (20)

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 

Acml kites

  • 1. Acml 2012 contest Hierarchical Committee Machines to Detect Frauds in Mobile Advertising Description about the contest and the dataset provided can be found here http://palanteer.sis.smu.edu.sg/fdma2012/ Evaluation metric: MAP S. Shivashankar and P. Manoj Ericsson Research India
  • 2. Feature engineering/ derived attributes (1) S.NO Feature Description Comments 1 Number of unique ip No of unique ip's per pid Helped 2 Number of unique cid No of unique cid's per pid Helped 3 Number of unique cntr No of unique cntr's per pid Helped 4 Number of unique category No of unique categories per pid Helped 5 Total clicks Total clicks per pid Helped 6 Category Category name for which clicks exist per pid Helped 7 Country feature vector Country wise clicks per pid. 1 X C vector, where C is the total number of Did not help countries much 8 Category feature vector Category wise clicks per pid. 1 X N, where N is the number of categories Did not help much 9 Clicks per category No of clicks per category per pid Helped 10 Countries with highest number Countries sorted according to number of clicks and K countries with highest Did not help of clicks clicks per cid are appended. much Ericsson Internal | 2012-10-19 | Page 2
  • 3. Feature engineering/ derived attributes (2) S.NO Feature Description Comments 11 Bank account given or not Boolean attribute : 0 if the bank account for pid is not given, else Did not help 1 much 12 Address given or not Boolean attribute : 0 if the address for pid is not given, else 1 Did not help much 13 Top country Country with highest clicks per pid Did not help much 14 Cluster id Cluster clicks data into predefined number of clusters, say 5, add Did not help the distribution of clicks within the clusters as a feature vector. much 15 No of referrers No of unique referrers per pid Helped 16 Number of days Number of days pid is active Helped 17 Clicks per day Number of clicks per day per pid Helped (A) 18 Sum of difference in time Sum of time difference between each click for a pid Helped 19 Average of difference in time Average over difference in time between each click for a pid Helped 20 Standard deviation of sum of difference in time SD of time difference between each click for a pid Did not help much Ericsson Internal | 2012-10-19 | Page 3
  • 4. Feature engineering/ derived attributes (3) S.NO Feature Description Comments 21 Clicks per category Total number of clicks per category per pid Helped (B) 22 Average clicks - day Average of clicks per day per pid Did not help much 23 Average clicks - referrer Average of clicks per referrer per pid Did not help much 24 No of agents No of unique agents per pid Helped (C) 25 Sum of difference of clicks – ip and cid Sum of difference of clicks per ip per cid per pid Did not help much 26 Sum of clicks Duplicate clicks sum Did not help much 27 Average clicks – agent Average clicks per agent per pid Helped – LAD 28 Average clicks - ip Average of clicks per ip per pid Helped – LAD 29 Average clicks - cid Average of clicks per cid per pid Helped – LAD 30 Average clicks - cntr Average of clicks per cntr per pid Helped – LAD (D) Ericsson Internal | 2012-10-19 | Page 4
  • 5. Methods used › We posed this problem as a two class problem, rather than 3 class, since there are efficient methods for binary class classification. – Fraud and Observation are grouped together – Observation and OK are grouped together › Fraud and Observation grouped together helped better than 3 class and other 2 class setups. › Datasets – First 10 attributes that helped were grouped into dataset A, 13 attributes into dataset B, 14 into dataset C and 18 into dataset D. A, B, C, D are marked in the previous slides. › Algorithms – J48, REP tree, LAD tree, AODE – Note that dataset D performs well with LAD tree only › Approaches for class imbalance – Cost sensitive classification – Ensemble learning Ericsson Internal | 2012-10-19 | Page 5
  • 6. observations Method Dataset A Dataset B Dataset C Dataset D Decorate with j48 38.54 41.99 43.19 43.29 Bagging with REP tree 32.99 39.64 41.64 40.99 Bagging with Cost sensitive classifier with LAD tree 38.57 43.06 46.28 47.57 Kstar 17.87 27.54 29.87 - AODE 19.01 38.75 - 41.27 Note that results using classifiers that performed well and that were giving diverse results (to help in ensemble learning) are given here. Not all classifiers we tried are presented here. Ericsson Internal | 2012-10-19 | Page 6
  • 7. Hierarchical Committee machines Datasets with different groupings of attributes CMA CMB CMC CMD Combined CM Score on the validation set – 51.49 x p(fraud|x) Score on the test set – 38.0744 Ericsson Internal | 2012-10-19 | Page 7
  • 8. Discussions (1) › Typical methods such as over-sampling, under-sampling, SMOTE, HDDT did not help – Sampling methods might have to be investigated carefully to see how they can be useful, since it is a widely accepted method for scenarios with class imbalance. › Cost-sensitive classification helped with few classifiers like LAD Tree › Random Forest did not perform better than other tree counterparts with ensemble learner. And was not diverse to help in the final committee machine › Bayesian based ranking methods such as AODE performs well with more attributes › With more attributes LAD tree performs well individually, but does not produce so diverse results on dataset C and D. › Memory based methods such as kStar do not perform well individually, but helped as part of the committee. Ericsson Internal | 2012-10-19 | Page 8
  • 9. Discussions (2) › Most of the fraud clicks belonged to publishers whose category was ‘AD’ and ‘MC’ › Common intuition to use duplicate ip per publisher did not help much › Country information did not help much › Surprisingly phone agent (model) information of the users helped › Time information was critically important for good performance – Might need further investigations/refinements to improve results Ericsson Internal | 2012-10-19 | Page 9

Hinweis der Redaktion

  1. 2011-10-19