SlideShare ist ein Scribd-Unternehmen logo
1 von 40
Downloaden Sie, um offline zu lesen
Embedded Automatic Model
       Training and Forecasting
in an Enterprise Software Application
(… or how to embed a data mining consultant in a box)

    Presented to the SF Bay ACM Data Mining SIG
              March 11, 2009 by Greg Makowski
           Principal Consultant, Golden Data Mining
                 p             ,                  g
Outline
  Challenge:
    How to automate not only forecasting, but model
    training?

  Solution:
    Focus on a vertical market application
    Deeply investigate the business & technical issues

  Result:
    An enterprise application
    Up to a 30% reduction in $ lost to over and under
    stock
                                                         1
Challenge: Business Pain Point
 JDA Software (
              (who owns the IP) has dozens of
                              )
 enterprise retail supply chain applications

 The R l i h
 Th Replenishment software does a very good
                    t ft      d            gd
 job keeping store shelves stocked at the right
 level when sales are steadyy
   Moves product from warehouse to DC to store
   Sales are NOT STEADY during sales events!

 PAIN POINT: The event planner has to estimate
 the lift in sales for every store-item combination,
                             store item
   (6k stores) * (1k to 4k item’s)   24 mm store-item lift estmts.
                                                                     2
Retail
                                 (context)

Challenge: 16 Page Newspaper Insert
Can vary by region or ZIP
Event Lift Forecasting (ELF)

   Lift is a multiplier for the increase in sales over
   normal
     “Prod X in Store Y will sell 6.8 times more than normal”
   Normal sales are around the event, for the same:
     time period (i.e. Thr – Sun), a week before and after
     (non-overlapping)
     Store – product                (SKU is a key for product)



            Event
            E   t                     Lift

                                                                 4
Retail


Challenge: Appropriate for Business User
  A retail event planner
    Has revenue goals and a “budget” of discount $
    Has to get through a lot of detail quickly
    Does not typically create mathematical forecasts

  Uses an enterprise application to layout the
  event flyer about 3 weeks in advance
    Decides for the event:
       departments / items / pricing / photos / language
    Uses the software to specify SKU’s, images and
    l
    layout th fl
         t the flyer

                                                               5
Product Mgmt
                                                      Software Arch

Challenge: How to Productize (Agile)?
 This is not a one-off consulting project, but SW
 Software engineering needs (get in the ballpark)
   right starting p
     g          g position, metrics, use cases, data flow
                          ,        ,          ,
   Support good Agile development process
 Goals
   At least 90% software and 10% configuration,
   not repeated consulting projects
                                projects,
      Control the Total Cost of Ownership for the product
   RELIABLE when used by the business user     user,
      working at the level of detail that the user cares about
                                                                  6
Product Mgmt


Challenge: Details we Have vs. Need to Start
Outline
         g
  Challenge: How to automate not only
                                    y
  forecasting, but model training?

  Solution:
    Focus on a vertical market application
    Deeply investigate the business & technical i
    D   li       i      hb i             h i l issues

  Result:
    An enterprise application
    Up to a 30% reduction in $ lost to over and under
            30 educt o          ost o e a d u de
    stock
                                                        8
Product Mgmt
                                                    Data Mining

Path to Solution
  Customer lead, product driven – design general

  Can’t data mine – without data
    Start data request process with several clients
    Jumpstart efforts with Monte Carlo
      Combine Census fields with noise to create a target
      The models and forecast matter less – the process MORE

  Ask for business interviews
  Understand users, metrics, past challenges
    What is the BATNA?
    Best Alternative, To A New Alternative (system)?
                                                               9
Data Mining


Data Sources
 Event Attributes (for planned in 3 weeks & past)
   Pricing, placement (page #, on a page)
   Products, departments, layout
   Store f
   S     features, d
                   demographics of population in
                            hi    f     li i
   area,
 Past events
   Flyers may have 1, 8, 12, 16, 20, 64 pages
   Same week last year may have a different prod mix
   Calculate Lift for all store-items for all past events
      Normal sales (not during an event) near in time
      Event sales; Lift = (event sales) / (non-event sales)
                                                                  10
Data Mining


Iterative KDD Process
Knowledge Discovery in Databases (KDD)

        Select Data for Analysis (from prior event app)
   1.

        Exploratory Data Analysis (EDA)
   2.

        Preprocessing (manipulating fields)
           p         g(     p     g       )
   3.

        Model Building (Training DM algorithms)
   4.

        Model Evaluation (appl to hold o t data)
                          (apply         out
   5.
   5

        Post-process score to business value
   6.

        Feed the next application (Lift / store-item)
   7.
                                                             11
Data Mining
                                      Product Mgmt

 Easiest to Automate From the Core

Go through full process, automating
  model building / evaluation
  EDA & Preprocessing
  Select past marketing campaigns




                                                 12
Data Mining
Hypothesis to Select Past Campaigns:
1) Most Similar Past Events
                      Attention: your expertise will be quizzed!
  Hypothesis: a close fit to the new event is better

  Compare high level event attributes
    Number of pages of the flyer
    Discount (average, max)
    “Primary” departments, sub-dep, catg, sub-category
    … and so on

  Use “fuzzy” Euclidian distance to match past
  events to the planned event in 3 weeks
    Select the 1-10 most similar events in the last year
                                                                  13
Data Mining
Hypothesis to Select Past Campaigns:
2) Select Broadly

 Hypothesis: more training records p
  yp                       g         provides a
 wide variety of behavior, and better generalization

 Exclude past marketing events that are quite
 different (but be broadly inclusive)
    If the planned event is 10-18 pages, exclude 1-2 and
    64 page events

 Audience Quiz: VOTE for what you expect
    1) Close fit,
             fit    2) Broad fit ?

                                                               14
Data Mining


Select Past Campaigns: Results & Why
                   g
 Answer from testing:
   BROADLY selecting past marketing
   events to train for the planned
   event works much better

 Why:    Breadth      Robust G
                             Generalization
   Same sale last year was different in many ways
   Broad variety of price points / item or department
   Variety of items on cover
   Variation
   V i ti over geography  h
                                                            15
Data Mining


Exploratory Data Analysis (EDA)

    Front cover items had a lift 5.1 times higher
    than the average elsewhere!

    Lift as high as 130 – after Halloween candy
    sale
       l

    The top 5% of the records had 90% of the lift
    (over all store-item combinations)


                                                            16
Data Mining
                                                                                                                                   Retail

   Exploratory Data Analysis (EDA)
                             The Cash Flow is Very Concentrated
                                                                                                         Range of Lift Values
                              Range of Lift Values
                                                                                                        (Omitting the Largest)
                      (The Top 5% Provides 88% of the Lift)
                                                                                             7
                140


                                                                                             6
                120


                                                                                             5
                100




                                                                                        t)
                                                                             Lift (Target
Lift (Target)




                                                                                             4
                80


                                                                                             3
                60

                                                                                             2



                                                                                                 ?
                40

                                                                                             1
                20

                                                                                             0
                 0
                                                                                                 012   3456      7 8 9 10 11 12 13 14 15 16 17 18 19
                      0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
                                                                                                        Bins of an Equal num ber of Records
                                Bins of an Equal num ber of Records
                                                                                                                      Lift     Baseline
                                          Lift     Baseline


                                       Test weight and target variations, lift and lift_log                                                            17
Data Mining++


Preprocessing - Categorical
  Average past Lift per category
      Percent off bin (i.e. 0%, 5%, 10%, 15% … 80%)
      Price Savings Bin (i.e. $2, $4, $6 …)
      Store hi
      S     hierarchy
                   h
      Product hierarchy (50k to 100k SKUs, 4-6 levels)
         Department, Sub department, Category Sub-Catgegory
         Department Sub-department Category, Sub Catgegory
      Seasonality, time, month, week
      Reason codes (the event is a circular, clearance)
      Location on the page in the flyer (top right, top left..)
    Multivariate combinations – powerful & scalable
      (price bin) + (page loc bin) + (sub-cat)
                                                                   18
Data Mining++


Preprocessing – Interactions




                                          19
Data Mining


Design Of Experiments (DOE)
  Model Notebook (pictured in next slide)
    One row per model trained
    input columns: data version, model parameters
    output columns: training time, results in-sample,
    out of sample, gap (bigger is worse), and gap
    penalized results
  Sections per data mining algorithm, i.e.
    Stepwise Regression Naïve Bayes
              Regression,
    Cubist (tree w/ regression in leaves)
    Neural Net
    TreeNet (from Salford Systms)
                                                           20
Data Mining++
                                                                              Instead of Occam’s Razor

Model Notebook Tracks DOE
Generalization Error = abs( in sample res – out of sample res )
Conservative Result = worst( in, out samp ) + Generalization Err
                             (,         p
                                                                                 MODEL RESULTS
                    ANALYSIS ENGINE SETTINGS                                    Mean Abs Err (-good)
                                                                               1      2       3     4
N in                                                                           In  Out of Gen: Out +
       Eng    parameter 1    parameter 2 parameter 3          comment
ser                                                                           Samp Samp In-Out Gen



 1     regr Try target: LIFT LOG
                        LIFT_LOG                           58 vars selected   1.184 1.264
                                                                              1 184 1 264    0.08
                                                                                             0 08   1.34
                                                                                                    1 34


                                             limit to 15
 2     regr Try target: LIFT_LOG                              limit to 15     1.21   1.289   0.08   1.37
                                                vars


 3     regr Try target: LIFT                               65 vars selected   1.732 2.654    0.92   3.58


                                             limit to 15
 4     regr Try target: LIFT                                  limit to 15     1.714 1.837
                                                                              1 714 1 837    0.12
                                                                                             0 12   1.96
                                                                                                    1 96
                                                vars
            Start with unv4_trn, and set larger wgt's
 5     regr for larger lift values   wgt_2=1;              60 vars selected   1.20   1.42    0.22   1.63
                                                                                                           21
            IF(2<lift) wgt_2 = 2; IF(5<lift) wgt_2 = 3;
Data Mining++


Data Mining Algorithm Improvements
  Cubist   http://www.rulequest.com/cubist-info.html
    Ross Quinlan uses a “greedy algorithm” to select
    regression fields for each leaf
    Tested and changed to “stepwise regression” for
                             stepwise regression
    each leaf
                                               Split 1



                                     Split 2
                                      p                  Split 3
                                                          p


                                 Leaf 1 Leaf 2    Leaf 3 Leaf 4


                                                                    22
Data Mining
                                                                                                                                                                                                                               Retail

Training Priority – a Complex Surface
                                                                                                                                                                                                                             $180,000

                                                                                                                                                                                                                             $160,000

                                                                                                                                                                                                                             $140,000

                                                                                                                                                                                                                         $120,000




                                                                                                                                                                                                                                         on-Event Cash Flow w
                                                                                                                                                                                                                                                  e-Items *
                                                                                                                                                                                                                         $100,000




                                                                                                                                                                                                                                            Event Lift *
                                                                                                                                                                                                                         $80,000




                                                                                                                                                                                                                                                  C
                                                                                                                                                                                                                                         Num Store
                                                                                                                                                                                                                        $60,000
                                                                                                                                                                                                                        $60 000

                                                                                                                                                                                                                        $40,000

                                                                                                                                                                                                                        $20,000




                                                                                                                                                                                                                                         N
         lift to 4.1                                                                                                                                                                                                    $0




                                                                                                                                                                                                                                        No
                                                                                                                                                                                                       cash to $7,647
                lift to 2.1




                                                                                                                                                                                        cash to $182
                                                                                                                                                                       cash to $79.38
                        lift to 1.4




                                                                                                                                                        ca to $48.03
                                                                                                                                       cash to $32.36
                                                                                                                      cash to $22.89
                                lift to 1.0
             Lift
                                                                                                     cash to $17.08
                                                                                              2.54




                                                                                                                                                         ash
                                                                                                                                                                                                          Cash Flow =
                                         lift to .55
                                                  55
                                                                                                                            o

                                                                                                                                          h
                                                                                1
                                                                    cash to $8.81




                                                                                                             $
                                                                                    cash to $12
                                                       cash to $6




                                                                                                                                                                                                           Non-Event
                                                                                                                                                                                                           Units/day *
                                                                                                                                                                                                             Price
                                                                                                                                                                                                                                                                23
Data Mining
                                                                              Retail

    Model Notebook: Example of Describing Models
                                     Top 1/6 of most expensive items, $5.30+
||||||||||||||||||||||||||||||||||

                                     Past lift by store, sub dept, dept, front page
                                                  store sub-dept dept
                       |||||||||||

                                     Average daily sales per item over prior events
                        ||||||||||

                                     Average price
                              |||

                                     Item is located on the front page of the flyer
                                |

                                     Number of Saturday & Sundays in the event
                                     Item comes from the Health and Beauty dept
                                     Item in the Stationary department
                                     Avg # items sold / day




                                                                                            24
Data Mining
                                 Retail

Calculate $ of “Business Pain”
                Business Pain



                  zero
                  error


                                   Over
Under
                                   Stock
                                   Sk
Stock




                                               25
Data Mining
                                                  Retail

Calculate $ of “Business Pain”
                Business Pain



                             zero
                             error

                                              ?
        15% business
           pain $
                                     1% bus         Over
Under
                                     pain $         Stock
                                                    Sk
Stock

                        Equal mistakes
                         q
                       Unequal PAIN in $
                                                                26
Data Mining++
                                                         Retail

Calculate $ of “Business Pain”
                Business Pain

  No way – that could get you fired!
    New progress in getting feedback
                             zero
                                                       30% bus
                             error
        15% business                                    pain $
           pain $
                                     1% bus                   Over
Under
                                     pain $                   Stock
                                                              Sk
Stock
                                              4 week supply
                        Equal mistakes
                         q                      of SKU
                       Unequal PAIN in $       30% off sale
                                                                      27
Data Mining


Best Models by Lift Correlation <> Best by $

  The order of “best” models ranked by
                best
    technical metrics (correlation, MAD) vs.
    business pain metric did ’t match
    bi         i      t i didn’t    th
    A HUGE mismatch!

  Change error function of data mining algs
    “$ over stock and under stock”



                                                             28
Data Mining++


Change Data Mining Algorithm Error Func

 Error function depends on
 knowing the threshold per SKU
   “4 weeks of normal sales volume for the SKU
    4                                      SKU”
 Neural Net   (proprietary, from missile targeting)
   After epoch, i.e. forward pass of 1000 records,
   calculate this error to minimize
 Stepwise Regression & Cubist Leaf Regr.
   Change optimization problem from an RMSE of
   the target to RMSE of this error function & target
                                                                 29
Product Mgmt
                            Retail

Worry About Response Time




                                       30
Product Mgmt
                                                          Data Mining

User Interface: 5 Levels of Complexity
  Needs to make reliable for simplest step
    Source data fields: use what is available & populated
    Insure the minimum data enables a reliable system
    Use metadata to select fields (i e exclude low corr, empty)
                                  (i.e.            corr
  Level 1:
    Train 6 models each for 3 fast engines, or with fast settings
                                     g                         g
    (i.e. more shallow trees)
    (~30 seconds)
  Later Levels:
    Add more extensive search per engine of model parameters
    more models in DOE, use slower engines, stay time sensitive
    (~30 minutes to 2 hours)

                                                                     31
Product Mgmt
                                                              Data Mining

How is ELF Software and Not Consulting?
 Software install and configuration process
   Connect to Event Planning, Connect to Replenishment
   Use metadata tags on custom fields
      Not dependent on field names
      Semantic (i.e. spending) and analytic tags (categorical, source)
   Preprocessing executes if supporting data is available
   Installer validates by using ELF to create test models
 End users create production models


            Event
            E   t                         Lift

                                                                         4
                                                                         32
Outline
         g
  Challenge: How to automate not only
                                    y
  forecasting, but model training?

  Solution:
    Focus on a vertical market application
    Deeply investigate the business & technical i
    D   li       i      hb i             h i l issues

  Result:
    An enterprise application
    Up to a 30% reduction in $ lost to over and under
            30 educt o          ost o e a d u de
    stock
                                                        33
Retail
                                                                    Data Mining

Result: Reduction in Business Pain
8 to 30% Reduction in Business Pain $
                                 ELF, Model 117
                                                              ELF
   ELF    ELF over     $ over     ELF HIGH     $ High Over            $ under
                                                             under
 stocking  stock       stock      Over Stock      Stock                stock
                                                             stock
      181                                                         87 $      87
      190       31 $        31
      183       46 $        46
      115                                                        77 $     233
      179      105 $       105
      191                                                       109 $     109
      252      101 $       101
      176                                                        40 $      40
      122                                                        37 $     111
      169                                                         6$        6
      183      122 $       122
      119                                                        37 $     112
      287      130 $       477
                                                                                  34
      412      141 $       281
Product Mgmt
                                                    Software Dev

Result: Start Agile Process After
                            After…
  Product Requirements Document (PRD)

  Technical Specifications:
    data flow diagrams, use cases, business metric
  Working Prototype, support for testing

  Go through Agile & Scrum efforts w/ the
  software
  soft are engineering group
                       gro p
    Review, revise, evaluate vs. business metrics


                                                               35
Product Mgmt
                                                 Data Mining

Result: Patent Application Process
  Provisional Patent     http://www.uspto.gov/
    Re-write with help of patent attorney, very formal
    Application will not be published for 18 months
  Ordinary Skill in the Art     Written by…
    Jeffrey D Ullman, Stanford Computer Science
    http://infolab.stanford.edu/~ullman/pub/focs00.html
    h //i f l b        f d d / ll      / b/f 00 h l
    The idea must be “novel,” “non obvious” & useful
    Novel – does not appear in previous literature
    Non obvious – would not be discovered by one of
    “ordinary skill in the art when the idea is needed
     ordinary              art”
       How obvious is “obvious?” To how many of 100?
                                                            36
Data Mining


To What other Verticals Could This Apply?
 It can apply where p
         pp y                 p               ,
                     past examples in volume,
 relate to future examples
 Marketing / Advertising: (media independent)
           g            g(           p      )
   Finding new customers, clickers, buyers, spending
   Cross sell, up sell
                p
   Customer Attrition (most likely to cancel)
 Mortgage Bond p
     gg        pricing
                     g       (p
                             (help US out of this mess)
                                                      )
   rating mortgages inside,
   forecasting p p y
             g prepayment & default rates
 Many other verticals
                                                            37
Summary
 How to automate?          From the center out (i.e. onion)
   Narrow vertical application, known data source & feeds
                   application
 How to select training data?           Broadly
 Best improvement?
 B ti              t?
   Optimize by what gets people promoted or fired
   Change DM alg. to opt. bus metric
               alg   opt
 How to make robust?       Support, but not require, fields
   Heavy Research and Prototyping (R&P) before starting Agile
 How to succeed in business software?
   Support end users at the level of complexity they want
     pp                                 p     y    y
   Help them succeed consistently and reliably
                                                                38
Questions & Answers?
  Greg_Makowski@Yahoo.com
     (408)781-6808 cell

  This PPT will be posted on SF Bay ACM and LinkedIn, below
     http://sfbayacm.org/events/2009-03-11.php
     http://www.LinkedIn.com/in/GregMakowski
     http://fora.tv/ (Video company)

  Future talks for ACM and ACM DM SIG
     http://www.sfbayacm.org/dmsig.php
     http://www sfbayacm org/dmsig php

  Other talks
     http://www.meetup.com/Bay-Area-Collective-Intelligence/
     http://www meetup com/Bay-Area-Collective-Intelligence/
     http://www.sdforum.org (business intelligence & other sigs)

                                                                   39

Weitere ähnliche Inhalte

Andere mochten auch

Three case studies deploying cluster analysis
Three case studies deploying cluster analysisThree case studies deploying cluster analysis
Three case studies deploying cluster analysisGreg Makowski
 
Powering Real­time Decision Engines in Finance and Healthcare using Open Sour...
Powering Real­time Decision Engines in Finance and Healthcare using Open Sour...Powering Real­time Decision Engines in Finance and Healthcare using Open Sour...
Powering Real­time Decision Engines in Finance and Healthcare using Open Sour...Greg Makowski
 
Kamanja: Driving Business Value through Real-Time Decisioning Solutions
Kamanja: Driving Business Value through Real-Time Decisioning SolutionsKamanja: Driving Business Value through Real-Time Decisioning Solutions
Kamanja: Driving Business Value through Real-Time Decisioning SolutionsGreg Makowski
 
How to Create 80% of a Big Data Pilot Project
How to Create 80% of a Big Data Pilot ProjectHow to Create 80% of a Big Data Pilot Project
How to Create 80% of a Big Data Pilot ProjectGreg Makowski
 
360-Degree Leadership
360-Degree Leadership360-Degree Leadership
360-Degree LeadershipChuck Terrell
 
K-Means, its Variants and its Applications
K-Means, its Variants and its ApplicationsK-Means, its Variants and its Applications
K-Means, its Variants and its ApplicationsVarad Meru
 
Application of Clustering in Data Science using Real-life Examples
Application of Clustering in Data Science using Real-life Examples Application of Clustering in Data Science using Real-life Examples
Application of Clustering in Data Science using Real-life Examples Edureka!
 
Cluster analysis for market segmentation
Cluster analysis for market segmentationCluster analysis for market segmentation
Cluster analysis for market segmentationVishal Tandel
 

Andere mochten auch (12)

Three case studies deploying cluster analysis
Three case studies deploying cluster analysisThree case studies deploying cluster analysis
Three case studies deploying cluster analysis
 
Powering Real­time Decision Engines in Finance and Healthcare using Open Sour...
Powering Real­time Decision Engines in Finance and Healthcare using Open Sour...Powering Real­time Decision Engines in Finance and Healthcare using Open Sour...
Powering Real­time Decision Engines in Finance and Healthcare using Open Sour...
 
Kamanja: Driving Business Value through Real-Time Decisioning Solutions
Kamanja: Driving Business Value through Real-Time Decisioning SolutionsKamanja: Driving Business Value through Real-Time Decisioning Solutions
Kamanja: Driving Business Value through Real-Time Decisioning Solutions
 
How to Create 80% of a Big Data Pilot Project
How to Create 80% of a Big Data Pilot ProjectHow to Create 80% of a Big Data Pilot Project
How to Create 80% of a Big Data Pilot Project
 
360-Degree Leadership
360-Degree Leadership360-Degree Leadership
360-Degree Leadership
 
360 Degree Leader - Ayub Jake Salik
360 Degree Leader - Ayub Jake Salik360 Degree Leader - Ayub Jake Salik
360 Degree Leader - Ayub Jake Salik
 
360 Degree Leadership
360 Degree Leadership360 Degree Leadership
360 Degree Leadership
 
K-Means, its Variants and its Applications
K-Means, its Variants and its ApplicationsK-Means, its Variants and its Applications
K-Means, its Variants and its Applications
 
Application of Clustering in Data Science using Real-life Examples
Application of Clustering in Data Science using Real-life Examples Application of Clustering in Data Science using Real-life Examples
Application of Clustering in Data Science using Real-life Examples
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
 
Cluster Analysis for Dummies
Cluster Analysis for DummiesCluster Analysis for Dummies
Cluster Analysis for Dummies
 
Cluster analysis for market segmentation
Cluster analysis for market segmentationCluster analysis for market segmentation
Cluster analysis for market segmentation
 

Mehr von Greg Makowski

Understanding Hallucinations in LLMs - 2023 09 29.pptx
Understanding Hallucinations in LLMs - 2023 09 29.pptxUnderstanding Hallucinations in LLMs - 2023 09 29.pptx
Understanding Hallucinations in LLMs - 2023 09 29.pptxGreg Makowski
 
Future of AI - 2023 07 25.pptx
Future of AI - 2023 07 25.pptxFuture of AI - 2023 07 25.pptx
Future of AI - 2023 07 25.pptxGreg Makowski
 
A Successful Hiring Process for Data Scientists
A Successful Hiring Process for Data ScientistsA Successful Hiring Process for Data Scientists
A Successful Hiring Process for Data ScientistsGreg Makowski
 
Kdd 2019: Standardizing Data Science to Help Hiring
Kdd 2019:  Standardizing Data Science to Help HiringKdd 2019:  Standardizing Data Science to Help Hiring
Kdd 2019: Standardizing Data Science to Help HiringGreg Makowski
 
Tales from an ip worker in consulting and software
Tales from an ip worker in consulting and softwareTales from an ip worker in consulting and software
Tales from an ip worker in consulting and softwareGreg Makowski
 
Predictive Model and Record Description with Segmented Sensitivity Analysis (...
Predictive Model and Record Description with Segmented Sensitivity Analysis (...Predictive Model and Record Description with Segmented Sensitivity Analysis (...
Predictive Model and Record Description with Segmented Sensitivity Analysis (...Greg Makowski
 

Mehr von Greg Makowski (6)

Understanding Hallucinations in LLMs - 2023 09 29.pptx
Understanding Hallucinations in LLMs - 2023 09 29.pptxUnderstanding Hallucinations in LLMs - 2023 09 29.pptx
Understanding Hallucinations in LLMs - 2023 09 29.pptx
 
Future of AI - 2023 07 25.pptx
Future of AI - 2023 07 25.pptxFuture of AI - 2023 07 25.pptx
Future of AI - 2023 07 25.pptx
 
A Successful Hiring Process for Data Scientists
A Successful Hiring Process for Data ScientistsA Successful Hiring Process for Data Scientists
A Successful Hiring Process for Data Scientists
 
Kdd 2019: Standardizing Data Science to Help Hiring
Kdd 2019:  Standardizing Data Science to Help HiringKdd 2019:  Standardizing Data Science to Help Hiring
Kdd 2019: Standardizing Data Science to Help Hiring
 
Tales from an ip worker in consulting and software
Tales from an ip worker in consulting and softwareTales from an ip worker in consulting and software
Tales from an ip worker in consulting and software
 
Predictive Model and Record Description with Segmented Sensitivity Analysis (...
Predictive Model and Record Description with Segmented Sensitivity Analysis (...Predictive Model and Record Description with Segmented Sensitivity Analysis (...
Predictive Model and Record Description with Segmented Sensitivity Analysis (...
 

Kürzlich hochgeladen

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 

Kürzlich hochgeladen (20)

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 

Embedded Automatic Model Training And Forc In An Enterprise Sw Applic

  • 1. Embedded Automatic Model Training and Forecasting in an Enterprise Software Application (… or how to embed a data mining consultant in a box) Presented to the SF Bay ACM Data Mining SIG March 11, 2009 by Greg Makowski Principal Consultant, Golden Data Mining p , g
  • 2. Outline Challenge: How to automate not only forecasting, but model training? Solution: Focus on a vertical market application Deeply investigate the business & technical issues Result: An enterprise application Up to a 30% reduction in $ lost to over and under stock 1
  • 3. Challenge: Business Pain Point JDA Software ( (who owns the IP) has dozens of ) enterprise retail supply chain applications The R l i h Th Replenishment software does a very good t ft d gd job keeping store shelves stocked at the right level when sales are steadyy Moves product from warehouse to DC to store Sales are NOT STEADY during sales events! PAIN POINT: The event planner has to estimate the lift in sales for every store-item combination, store item (6k stores) * (1k to 4k item’s) 24 mm store-item lift estmts. 2
  • 4. Retail (context) Challenge: 16 Page Newspaper Insert Can vary by region or ZIP
  • 5. Event Lift Forecasting (ELF) Lift is a multiplier for the increase in sales over normal “Prod X in Store Y will sell 6.8 times more than normal” Normal sales are around the event, for the same: time period (i.e. Thr – Sun), a week before and after (non-overlapping) Store – product (SKU is a key for product) Event E t Lift 4
  • 6. Retail Challenge: Appropriate for Business User A retail event planner Has revenue goals and a “budget” of discount $ Has to get through a lot of detail quickly Does not typically create mathematical forecasts Uses an enterprise application to layout the event flyer about 3 weeks in advance Decides for the event: departments / items / pricing / photos / language Uses the software to specify SKU’s, images and l layout th fl t the flyer 5
  • 7. Product Mgmt Software Arch Challenge: How to Productize (Agile)? This is not a one-off consulting project, but SW Software engineering needs (get in the ballpark) right starting p g g position, metrics, use cases, data flow , , , Support good Agile development process Goals At least 90% software and 10% configuration, not repeated consulting projects projects, Control the Total Cost of Ownership for the product RELIABLE when used by the business user user, working at the level of detail that the user cares about 6
  • 8. Product Mgmt Challenge: Details we Have vs. Need to Start
  • 9. Outline g Challenge: How to automate not only y forecasting, but model training? Solution: Focus on a vertical market application Deeply investigate the business & technical i D li i hb i h i l issues Result: An enterprise application Up to a 30% reduction in $ lost to over and under 30 educt o ost o e a d u de stock 8
  • 10. Product Mgmt Data Mining Path to Solution Customer lead, product driven – design general Can’t data mine – without data Start data request process with several clients Jumpstart efforts with Monte Carlo Combine Census fields with noise to create a target The models and forecast matter less – the process MORE Ask for business interviews Understand users, metrics, past challenges What is the BATNA? Best Alternative, To A New Alternative (system)? 9
  • 11. Data Mining Data Sources Event Attributes (for planned in 3 weeks & past) Pricing, placement (page #, on a page) Products, departments, layout Store f S features, d demographics of population in hi f li i area, Past events Flyers may have 1, 8, 12, 16, 20, 64 pages Same week last year may have a different prod mix Calculate Lift for all store-items for all past events Normal sales (not during an event) near in time Event sales; Lift = (event sales) / (non-event sales) 10
  • 12. Data Mining Iterative KDD Process Knowledge Discovery in Databases (KDD) Select Data for Analysis (from prior event app) 1. Exploratory Data Analysis (EDA) 2. Preprocessing (manipulating fields) p g( p g ) 3. Model Building (Training DM algorithms) 4. Model Evaluation (appl to hold o t data) (apply out 5. 5 Post-process score to business value 6. Feed the next application (Lift / store-item) 7. 11
  • 13. Data Mining Product Mgmt Easiest to Automate From the Core Go through full process, automating model building / evaluation EDA & Preprocessing Select past marketing campaigns 12
  • 14. Data Mining Hypothesis to Select Past Campaigns: 1) Most Similar Past Events Attention: your expertise will be quizzed! Hypothesis: a close fit to the new event is better Compare high level event attributes Number of pages of the flyer Discount (average, max) “Primary” departments, sub-dep, catg, sub-category … and so on Use “fuzzy” Euclidian distance to match past events to the planned event in 3 weeks Select the 1-10 most similar events in the last year 13
  • 15. Data Mining Hypothesis to Select Past Campaigns: 2) Select Broadly Hypothesis: more training records p yp g provides a wide variety of behavior, and better generalization Exclude past marketing events that are quite different (but be broadly inclusive) If the planned event is 10-18 pages, exclude 1-2 and 64 page events Audience Quiz: VOTE for what you expect 1) Close fit, fit 2) Broad fit ? 14
  • 16. Data Mining Select Past Campaigns: Results & Why g Answer from testing: BROADLY selecting past marketing events to train for the planned event works much better Why: Breadth Robust G Generalization Same sale last year was different in many ways Broad variety of price points / item or department Variety of items on cover Variation V i ti over geography h 15
  • 17. Data Mining Exploratory Data Analysis (EDA) Front cover items had a lift 5.1 times higher than the average elsewhere! Lift as high as 130 – after Halloween candy sale l The top 5% of the records had 90% of the lift (over all store-item combinations) 16
  • 18. Data Mining Retail Exploratory Data Analysis (EDA) The Cash Flow is Very Concentrated Range of Lift Values Range of Lift Values (Omitting the Largest) (The Top 5% Provides 88% of the Lift) 7 140 6 120 5 100 t) Lift (Target Lift (Target) 4 80 3 60 2 ? 40 1 20 0 0 012 3456 7 8 9 10 11 12 13 14 15 16 17 18 19 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Bins of an Equal num ber of Records Bins of an Equal num ber of Records Lift Baseline Lift Baseline Test weight and target variations, lift and lift_log 17
  • 19. Data Mining++ Preprocessing - Categorical Average past Lift per category Percent off bin (i.e. 0%, 5%, 10%, 15% … 80%) Price Savings Bin (i.e. $2, $4, $6 …) Store hi S hierarchy h Product hierarchy (50k to 100k SKUs, 4-6 levels) Department, Sub department, Category Sub-Catgegory Department Sub-department Category, Sub Catgegory Seasonality, time, month, week Reason codes (the event is a circular, clearance) Location on the page in the flyer (top right, top left..) Multivariate combinations – powerful & scalable (price bin) + (page loc bin) + (sub-cat) 18
  • 21. Data Mining Design Of Experiments (DOE) Model Notebook (pictured in next slide) One row per model trained input columns: data version, model parameters output columns: training time, results in-sample, out of sample, gap (bigger is worse), and gap penalized results Sections per data mining algorithm, i.e. Stepwise Regression Naïve Bayes Regression, Cubist (tree w/ regression in leaves) Neural Net TreeNet (from Salford Systms) 20
  • 22. Data Mining++ Instead of Occam’s Razor Model Notebook Tracks DOE Generalization Error = abs( in sample res – out of sample res ) Conservative Result = worst( in, out samp ) + Generalization Err (, p MODEL RESULTS ANALYSIS ENGINE SETTINGS Mean Abs Err (-good) 1 2 3 4 N in In Out of Gen: Out + Eng parameter 1 parameter 2 parameter 3 comment ser Samp Samp In-Out Gen 1 regr Try target: LIFT LOG LIFT_LOG 58 vars selected 1.184 1.264 1 184 1 264 0.08 0 08 1.34 1 34 limit to 15 2 regr Try target: LIFT_LOG limit to 15 1.21 1.289 0.08 1.37 vars 3 regr Try target: LIFT 65 vars selected 1.732 2.654 0.92 3.58 limit to 15 4 regr Try target: LIFT limit to 15 1.714 1.837 1 714 1 837 0.12 0 12 1.96 1 96 vars Start with unv4_trn, and set larger wgt's 5 regr for larger lift values wgt_2=1; 60 vars selected 1.20 1.42 0.22 1.63 21 IF(2<lift) wgt_2 = 2; IF(5<lift) wgt_2 = 3;
  • 23. Data Mining++ Data Mining Algorithm Improvements Cubist http://www.rulequest.com/cubist-info.html Ross Quinlan uses a “greedy algorithm” to select regression fields for each leaf Tested and changed to “stepwise regression” for stepwise regression each leaf Split 1 Split 2 p Split 3 p Leaf 1 Leaf 2 Leaf 3 Leaf 4 22
  • 24. Data Mining Retail Training Priority – a Complex Surface $180,000 $160,000 $140,000 $120,000 on-Event Cash Flow w e-Items * $100,000 Event Lift * $80,000 C Num Store $60,000 $60 000 $40,000 $20,000 N lift to 4.1 $0 No cash to $7,647 lift to 2.1 cash to $182 cash to $79.38 lift to 1.4 ca to $48.03 cash to $32.36 cash to $22.89 lift to 1.0 Lift cash to $17.08 2.54 ash Cash Flow = lift to .55 55 o h 1 cash to $8.81 $ cash to $12 cash to $6 Non-Event Units/day * Price 23
  • 25. Data Mining Retail Model Notebook: Example of Describing Models Top 1/6 of most expensive items, $5.30+ |||||||||||||||||||||||||||||||||| Past lift by store, sub dept, dept, front page store sub-dept dept ||||||||||| Average daily sales per item over prior events |||||||||| Average price ||| Item is located on the front page of the flyer | Number of Saturday & Sundays in the event Item comes from the Health and Beauty dept Item in the Stationary department Avg # items sold / day 24
  • 26. Data Mining Retail Calculate $ of “Business Pain” Business Pain zero error Over Under Stock Sk Stock 25
  • 27. Data Mining Retail Calculate $ of “Business Pain” Business Pain zero error ? 15% business pain $ 1% bus Over Under pain $ Stock Sk Stock Equal mistakes q Unequal PAIN in $ 26
  • 28. Data Mining++ Retail Calculate $ of “Business Pain” Business Pain No way – that could get you fired! New progress in getting feedback zero 30% bus error 15% business pain $ pain $ 1% bus Over Under pain $ Stock Sk Stock 4 week supply Equal mistakes q of SKU Unequal PAIN in $ 30% off sale 27
  • 29. Data Mining Best Models by Lift Correlation <> Best by $ The order of “best” models ranked by best technical metrics (correlation, MAD) vs. business pain metric did ’t match bi i t i didn’t th A HUGE mismatch! Change error function of data mining algs “$ over stock and under stock” 28
  • 30. Data Mining++ Change Data Mining Algorithm Error Func Error function depends on knowing the threshold per SKU “4 weeks of normal sales volume for the SKU 4 SKU” Neural Net (proprietary, from missile targeting) After epoch, i.e. forward pass of 1000 records, calculate this error to minimize Stepwise Regression & Cubist Leaf Regr. Change optimization problem from an RMSE of the target to RMSE of this error function & target 29
  • 31. Product Mgmt Retail Worry About Response Time 30
  • 32. Product Mgmt Data Mining User Interface: 5 Levels of Complexity Needs to make reliable for simplest step Source data fields: use what is available & populated Insure the minimum data enables a reliable system Use metadata to select fields (i e exclude low corr, empty) (i.e. corr Level 1: Train 6 models each for 3 fast engines, or with fast settings g g (i.e. more shallow trees) (~30 seconds) Later Levels: Add more extensive search per engine of model parameters more models in DOE, use slower engines, stay time sensitive (~30 minutes to 2 hours) 31
  • 33. Product Mgmt Data Mining How is ELF Software and Not Consulting? Software install and configuration process Connect to Event Planning, Connect to Replenishment Use metadata tags on custom fields Not dependent on field names Semantic (i.e. spending) and analytic tags (categorical, source) Preprocessing executes if supporting data is available Installer validates by using ELF to create test models End users create production models Event E t Lift 4 32
  • 34. Outline g Challenge: How to automate not only y forecasting, but model training? Solution: Focus on a vertical market application Deeply investigate the business & technical i D li i hb i h i l issues Result: An enterprise application Up to a 30% reduction in $ lost to over and under 30 educt o ost o e a d u de stock 33
  • 35. Retail Data Mining Result: Reduction in Business Pain 8 to 30% Reduction in Business Pain $ ELF, Model 117 ELF ELF ELF over $ over ELF HIGH $ High Over $ under under stocking stock stock Over Stock Stock stock stock 181 87 $ 87 190 31 $ 31 183 46 $ 46 115 77 $ 233 179 105 $ 105 191 109 $ 109 252 101 $ 101 176 40 $ 40 122 37 $ 111 169 6$ 6 183 122 $ 122 119 37 $ 112 287 130 $ 477 34 412 141 $ 281
  • 36. Product Mgmt Software Dev Result: Start Agile Process After After… Product Requirements Document (PRD) Technical Specifications: data flow diagrams, use cases, business metric Working Prototype, support for testing Go through Agile & Scrum efforts w/ the software soft are engineering group gro p Review, revise, evaluate vs. business metrics 35
  • 37. Product Mgmt Data Mining Result: Patent Application Process Provisional Patent http://www.uspto.gov/ Re-write with help of patent attorney, very formal Application will not be published for 18 months Ordinary Skill in the Art Written by… Jeffrey D Ullman, Stanford Computer Science http://infolab.stanford.edu/~ullman/pub/focs00.html h //i f l b f d d / ll / b/f 00 h l The idea must be “novel,” “non obvious” & useful Novel – does not appear in previous literature Non obvious – would not be discovered by one of “ordinary skill in the art when the idea is needed ordinary art” How obvious is “obvious?” To how many of 100? 36
  • 38. Data Mining To What other Verticals Could This Apply? It can apply where p pp y p , past examples in volume, relate to future examples Marketing / Advertising: (media independent) g g( p ) Finding new customers, clickers, buyers, spending Cross sell, up sell p Customer Attrition (most likely to cancel) Mortgage Bond p gg pricing g (p (help US out of this mess) ) rating mortgages inside, forecasting p p y g prepayment & default rates Many other verticals 37
  • 39. Summary How to automate? From the center out (i.e. onion) Narrow vertical application, known data source & feeds application How to select training data? Broadly Best improvement? B ti t? Optimize by what gets people promoted or fired Change DM alg. to opt. bus metric alg opt How to make robust? Support, but not require, fields Heavy Research and Prototyping (R&P) before starting Agile How to succeed in business software? Support end users at the level of complexity they want pp p y y Help them succeed consistently and reliably 38
  • 40. Questions & Answers? Greg_Makowski@Yahoo.com (408)781-6808 cell This PPT will be posted on SF Bay ACM and LinkedIn, below http://sfbayacm.org/events/2009-03-11.php http://www.LinkedIn.com/in/GregMakowski http://fora.tv/ (Video company) Future talks for ACM and ACM DM SIG http://www.sfbayacm.org/dmsig.php http://www sfbayacm org/dmsig php Other talks http://www.meetup.com/Bay-Area-Collective-Intelligence/ http://www meetup com/Bay-Area-Collective-Intelligence/ http://www.sdforum.org (business intelligence & other sigs) 39