SlideShare a Scribd company logo
1 of 34
Learning Linear Models
           with Hadoop
           Ulrich Rückert




                             © 2012 Datameer, Inc. All rights reserved.


Thursday, March 28, 2013
Agenda

                What are linear models anyway?
                How to learn linear models with Hadoop
                Demo
                Tips, tricks and caveats
                Conclusion




                                                 © 2012 Datameer, Inc. All rights reserved.


Thursday, March 28, 2013
Predictive Analytics
                                                                                                   Test Data
                                                                                         Age             Income   BuysBook
                                                                Target                   22               67000      ?
           Example Learning                   Attributes       Attribute                  39             41000       ?

           Task                                  Age       Income        BuysBook
                                                 24        60000             yes
           • Ad on booksellerʼs web page         65        80000             no
                                                 60        95000             no
           • Will a customer buy this book?      35        52000             yes

           • Training set: observations on       20
                                                 43
                                                           45000
                                                           75000
                                                                             yes
                                                                             yes
                                                                                                        Model
                previous customers               26        51000             yes
                                                 52        47000             no
           •    Test set: new customers          47        38000              no
                                                 25        22000              no

            Letʼs learn a linear                 33        47000             yes


            model!                                 Training Data                         Age
                                                                                         22
                                                                                                         Income
                                                                                                          67000
                                                                                                                  BuysBook
                                                                                                                    yes
                                                                                          39             41000       no

                                                                                                Prediction



                                                                    © 2012 Datameer, Inc. All rights reserved.


Thursday, March 28, 2013
Linear Models
                                                                Expert1        Expert2         BuysBook
                                                                  24             60               ?
                                                                  64              80                ?
                                                                  60              96                ?
           Whatʼs in the black box?
           • Letʼs pretend all attributes are
                expert ratings
           •    Large positive value means yes
           •    Small value means no                 Expert 1             Expert 2                             Prediction

           •    Intermediate value: donʼt know         24
                                                       65
                                                                            60
                                                                            80
                                                                                                                   ?
                                                                                                                   ?
                                                       60                   95                                     ?
            Let the experts vote
           •    Sum over ratings for each row
           •    Larger than threshold: predict yes              Expert1
                                                                  24
                                                                               Expert2
                                                                                  60
                                                                                               Prediction
                                                                                                    ?

           •    Smaller: predict no                               64
                                                                  60
                                                                                  80
                                                                                  96
                                                                                                    ?
                                                                                                    ?




                                                                  © 2012 Datameer, Inc. All rights reserved.


Thursday, March 28, 2013
Linear Models
                                                                Expert1           Expert2         BuysBook
                                                                  24                60               ?
                                                                  64                 80                ?
                                                                  60                 96                ?
           Whatʼs in the black box?
           • Letʼs pretend all attributes are
                expert ratings                                                                                    Threshold


           •    Large positive value means yes                                                                       97

           •    Small value means no                 Expert 1              Expert 2                               > threshold

           •    Intermediate value: donʼt know         24
                                                       65
                                                                 +
                                                                 +
                                                                               60
                                                                               80
                                                                                             =
                                                                                             =
                                                                                                    84
                                                                                                    145
                                                                                                                     no
                                                                                                                     yes
                                                       60        +             95            =      155              yes
            Let the experts vote
           •    Sum over ratings for each row
           •    Larger than threshold: predict yes              Expert1
                                                                  24
                                                                                  Expert2
                                                                                     60
                                                                                                  Prediction
                                                                                                      no

           •    Smaller: predict no                               64
                                                                  60
                                                                                     80
                                                                                     96
                                                                                                      yes
                                                                                                      yes




                                                                     © 2012 Datameer, Inc. All rights reserved.


Thursday, March 28, 2013
Linear Models
                                                              Expert1           Expert2         BuysBook
                                                                24                60               ?
                                                                64                 80                ?
           Assign a weight to each                              60                 96                ?

           expert
           • Expert is mostly correct: large      Weight 1               Weight 2                               Threshold
                weight
                                                   0.75                   0.25                                     48
           •    Expert is uninformative: zero
           •    Expert is consistently wrong:     Expert 1               Expert 2                               > threshold

                negative weight                   0.75 • 24    +         0.25 • 60         =      33               no
                                                  0.75 • 64    +         0.25 • 80         =      68               yes
                                                  0.75 • 60    +         0.25 • 96         =      69               yes
            Learning models
           •    A linear model contains weights
                and threshold                                 Expert1           Expert2         Prediction
                                                                24                 60               no
           •    Learn by finding weights with                    64                 80               yes
                lowest error on training data                   60                 96               yes




                                                                   © 2012 Datameer, Inc. All rights reserved.


Thursday, March 28, 2013
Linear Models
                                                             Expert1           Expert2         BuysBook
                                                               24                60               ?
                                                               64                 80                ?
           Assign a weight to each                             60                 96                ?

           expert
           • Expert is mostly correct: large      Weight 1              Weight 2                               Threshold
                weight
                                                     0                   0.25                                     18
           •    Expert is uninformative: zero
           •    Expert is consistently wrong:     Expert 1              Expert 2                               > threshold

                negative weight                    0 • 24     +         0.25 • 60         =      15               no
                                                   0 • 64     +         0.25 • 80         =      20               yes
                                                   0 • 60     +         0.25 • 96         =      24               yes
            Learning models
           •    A linear model contains weights
                and threshold                                Expert1           Expert2         Prediction
                                                               24                 60               no
           •    Learn by finding weights with                   64                 80               yes
                lowest error on training data                  60                 96               yes




                                                                  © 2012 Datameer, Inc. All rights reserved.


Thursday, March 28, 2013
Linear Models
                                                              Expert1           Expert2         BuysBook
                                                                24                60               ?
                                                                64                 80                ?
           Assign a weight to each                              60                 96                ?

           expert
           • Expert is mostly correct: large      Weight 1               Weight 2                               Threshold
                weight
                                                   -0.5                   0.25                                      -8
           •    Expert is uninformative: zero
           •    Expert is consistently wrong:     Expert 1               Expert 2                               > threshold

                negative weight                   -0.5 • 24    +         0.25 • 60         =        3              yes
                                                  -0.5 • 64    +         0.25 • 80         =      -12              no
                                                  -0.5 • 60    +         0.25 • 96         =       -6              yes
            Learning models
           •    A linear model contains weights
                and threshold                                 Expert1           Expert2         Prediction
                                                                24                 60               yes
           •    Learn by finding weights with                    64                 80               no
                lowest error on training data                   60                 96               yes




                                                                   © 2012 Datameer, Inc. All rights reserved.


Thursday, March 28, 2013
Learning Linear Models
           Stochastic Gradient                        Start with default weights
           Decent (SGD)
           • Main idea: start with default
                weights                                Read next training row

           •    For each row check if current
                weights predict correctly
           •    If misclassification: adjust weights    Do weights predict the
                                                           correct label?
                                                                                                 Yes


            How to adjust weights?
                                                                        No
           •    if positive class: add row
                                                           Adjust weights
           •    if negative class: subtract row




                                                            © 2012 Datameer, Inc. All rights reserved.


Thursday, March 28, 2013
Learning Linear Models
                           repeat
                             row = readNextRow();
                             if(predict(weights, row.attributes) != row.class)
                                    weights += row.class * row.attributes;
                                    threshold += -row.class;
                                 endif
                           end


                                 Weight 1         Weight 2                       Threshold

                                    1               -1                                  0

                                   Age            Income                        > threshold

                                   1•?      +      -1 • ?       =     ?                 ?




                                            Age        Income       BuysBook
                                            24           60            +1



                                                                               © 2012 Datameer, Inc. All rights reserved.


Thursday, March 28, 2013
Learning Linear Models
                           repeat
                             row = readNextRow();
                             if(predict(weights, row.attributes) != row.class)
                                    weights += row.class * row.attributes;
                                    threshold += -row.class;
                                 endif
                           end


                                 Weight 1         Weight 2                       Threshold

                                     1              -1                                  0

                                   Age            Income                        > threshold

                                   1 • 24   +      -1 • 60      =    -36                -1




                                            Age        Income       BuysBook
                                            24           60            +1



                                                                               © 2012 Datameer, Inc. All rights reserved.


Thursday, March 28, 2013
Learning Linear Models
                           repeat
                             row = readNextRow();
                             if(predict(weights, row.attributes) != row.class)
                                    weights += row.class * row.attributes;
                                    threshold += -row.class;
                                 endif
                           end


                                 Weight 1         Weight 2                       Threshold

                                   25               59                                  0

                                   Age            Income                        > threshold

                                  25 • 24   +     59 • 60       =   4140               +1




                                            Age        Income       BuysBook
                                            24           60            +1



                                                                               © 2012 Datameer, Inc. All rights reserved.


Thursday, March 28, 2013
Learning Linear Models
                           repeat
                             row = readNextRow();
                             if(predict(weights, row.attributes) != row.class)
                                    weights += row.class * row.attributes;
                                    threshold += -row.class;
                                 endif
                           end


                                 Weight 1         Weight 2                       Threshold

                                   25               59                                 -1

                                   Age            Income                        > threshold

                                  25 • 24   +     59 • 60       =   4140               +1




                                            Age        Income       BuysBook
                                            24           60            +1



                                                                               © 2012 Datameer, Inc. All rights reserved.


Thursday, March 28, 2013
Learning Linear Models
                           repeat
                             row = readNextRow();
                             if(predict(weights, row.attributes) != row.class)
                                    weights += row.class * row.attributes;
                                    threshold += -row.class;
                                 endif
                           end


                                 Weight 1         Weight 2                       Threshold

                                    25              59                                 -1

                                   Age            Income                        > threshold

                                   25 • ?   +      59 • ?       =     ?                 ?




                                            Age        Income       BuysBook
                                            30           30            -1



                                                                               © 2012 Datameer, Inc. All rights reserved.


Thursday, March 28, 2013
Learning Linear Models
                           repeat
                             row = readNextRow();
                             if(predict(weights, row.attributes) != row.class)
                                    weights += row.class * row.attributes;
                                    threshold += -row.class;
                                 endif
                           end


                                 Weight 1         Weight 2                       Threshold

                                   25               59                                 -1

                                   Age            Income                        > threshold

                                  25 • 30   +     59 • 30       =   2520               +1




                                            Age        Income       BuysBook
                                            30           30            -1



                                                                               © 2012 Datameer, Inc. All rights reserved.


Thursday, March 28, 2013
Learning Linear Models
                           repeat
                             row = readNextRow();
                             if(predict(weights, row.attributes) != row.class)
                                    weights += row.class * row.attributes;
                                    threshold += -row.class;
                                 endif
                           end


                                 Weight 1         Weight 2                       Threshold

                                    -5              29                                 -1

                                   Age            Income                        > threshold

                                  -5 • 30   +     29 • 30       =    720               +1




                                            Age        Income       BuysBook
                                            30           30            -1



                                                                               © 2012 Datameer, Inc. All rights reserved.


Thursday, March 28, 2013
Learning Linear Models
                           repeat
                             row = readNextRow();
                             if(predict(weights, row.attributes) != row.class)
                                    weights += row.class * row.attributes;
                                    threshold += -row.class;
                                 endif
                           end


                                 Weight 1         Weight 2                       Threshold

                                    -5              29                                  0

                                   Age            Income                        > threshold

                                  -5 • 30   +     29 • 30       =    720               +1




                                            Age        Income       BuysBook
                                            30           30            -1



                                                                               © 2012 Datameer, Inc. All rights reserved.


Thursday, March 28, 2013
Learning - Convergence
       repeat
          row = readNextRow();
          if(predict(weights, row.attributes) != row.class)
                   weights += row.class * row.attributes;
                   threshold += -row.class;
             endif
       end




                                                            © 2012 Datameer, Inc. All rights reserved.


Thursday, March 28, 2013
Learning - Convergence
       repeat
          row = readNextRow();
          if(predict(weights, row.attributes) != row.class)
                   weights += row.class * row.attributes;
                   threshold += -row.class;
             endif
       end




                                                            © 2012 Datameer, Inc. All rights reserved.


Thursday, March 28, 2013
Learning - Convergence
       repeat
             row = readNextRow();
             if(predict(weights, row.attributes) != row.class)
                    weights += 0.001 * row.class * row.attributes;
                 threshold += -row.class;
              endif
        end




                                                                © 2012 Datameer, Inc. All rights reserved.


Thursday, March 28, 2013
Learning - Convergence
       repeat
             row = readNextRow();
             if(predict(weights, row.attributes) != row.class)
                    weights += 0.001 * row.class * row.attributes;
                 threshold += -row.class;
              endif
        end




                                                                © 2012 Datameer, Inc. All rights reserved.


Thursday, March 28, 2013
Learning - Convergence
       for i=1 to ∞
              row = readNextRow();
              if(predict(weights, row.attributes) != row.class)
                    weights += (1/i) * row.class * row.attributes;
                 threshold += -row.class;
              endif
        end




                                                                © 2012 Datameer, Inc. All rights reserved.


Thursday, March 28, 2013
Learning - Convergence
       for i=1 to ∞
              row = readNextRow();
              if(predict(weights, row.attributes) != row.class)
                    weights += (1/i) * row.class * row.attributes;
                 threshold += -row.class;
              endif
        end




                                                                © 2012 Datameer, Inc. All rights reserved.


Thursday, March 28, 2013
Learning - Margin
      for i = 1 to ∞
         row = readNextRow();
         if(margin(weights, row.attributes, threshold) <= 1)
            weights += (1/n) * row.class * row.attributes;
               threshold += -row.class;
            endif
      end


            Weight 1             Weight 2                      Threshold

               0.5                0.25                           26.5

               Age               Income            Margin      > threshold

             0.5 • 24      +     0.25 • 60     =      27           +1




                           Age        Income        BuysBook
                           24           60             +1




                                                                             © 2012 Datameer, Inc. All rights reserved.


Thursday, March 28, 2013
Learning - Margin
      for i = 1 to ∞
         row = readNextRow();
         if(margin(weights, row.attributes, threshold) <= 1)
            weights += (1/n) * row.class * row.attributes;
               threshold += -row.class;
            endif
      end


            Weight 1             Weight 2                      Threshold

               0.5                0.25                           26.5

               Age               Income            Margin      > threshold

             0.5 • 24      +     0.25 • 60     =      27           +1




                           Age        Income        BuysBook
                           24           60             +1




                                                                             © 2012 Datameer, Inc. All rights reserved.


Thursday, March 28, 2013
Learning - Margin
      for i = 1 to ∞
         row = readNextRow();
         if(margin(weights, row.attributes, threshold) <= 1)
            weights += (1/n) * row.class * row.attributes;
               threshold += -row.class;
            endif
      end


            Weight 1             Weight 2                      Threshold

               0.5                0.25                           26.5

               Age               Income            Margin      > threshold

             0.5 • 24      +     0.25 • 60     =      27           +1




                           Age        Income        BuysBook
                           24           60             +1




                                                                             © 2012 Datameer, Inc. All rights reserved.


Thursday, March 28, 2013
Learning - Regularization
                                                    for i = 1 to ∞
           Attributes are often                        row = readNextRow();
                                                       if(margin(weights, row.attributes, threshold) <= 1)
           correlated                                        weights += (1/n) * row.class * row.attributes;
                                                             threshold += -row.class;
           • Contributions cancel out                     endif
                                                    end
           • This leads to unreasonably
                large weights...
           •    ... and models which are not              Weight 1             Weight 2                                Threshold
                robust to noise
                                                            0.5                   0.5                                     30
            Regularization                                  Age                 Income                                 > threshold


           •    Make sure weights donʼt get too            0.5 • 24   +         0.5 • 60           =       42              +1

                large
           •    L2 regularization: weights are                        Age              Income           BuysBook
                proportional to attribute quality                     24                   60                +1




                                                                          © 2012 Datameer, Inc. All rights reserved.


Thursday, March 28, 2013
Learning - Regularization
                                                    for i = 1 to ∞
           Attributes are often                        row = readNextRow();
                                                       if(margin(weights, row.attributes, threshold) <= 1)
           correlated                                        weights += (1/n) * row.class * row.attributes;
                                                             threshold += -row.class;
           • Contributions cancel out                     endif
                                                    end
           • This leads to unreasonably
                large weights...
           •    ... and models which are not              Weight 1             Weight 2                                Threshold
                robust to noise
                                                           1000                -399.3                                     30
            Regularization                                  Age                 Income                                 > threshold


           •    Make sure weights donʼt get too           1000 • 24   +       -399.3 • 60          =       42              +1

                large
           •    L2 regularization: weights are                        Age              Income           BuysBook
                proportional to attribute quality                     24                   60                +1




                                                                          © 2012 Datameer, Inc. All rights reserved.


Thursday, March 28, 2013
Learning - Regularization
                                                    for i = 1 to ∞
           Attributes are often                        row = readNextRow();
                                                       if(margin(weights, row.attributes, threshold) <= 1)
           correlated                                        weights += (1/n) * row.class * row.attributes;
                                                             threshold += -row.class;
           • Contributions cancel out                     endif

           • This leads to unreasonably             end
                                                          weights = i/(i+r) * weights;

                large weights...
           •    ... and models which are not              Weight 1             Weight 2                                Threshold
                robust to noise
                                                           1000                -399.3                                     30
            Regularization                                  Age                 Income                                 > threshold


           •    Make sure weights donʼt get too           1000 • 24   +       -399.3 • 60          =       42              +1

                large
           •    L2 regularization: weights are                        Age              Income           BuysBook
                proportional to attribute quality                     24                   60                +1




                                                                          © 2012 Datameer, Inc. All rights reserved.


Thursday, March 28, 2013
Implementation on Hadoop

                 Map-Reduce
                  •    Input data must be in random order

                  •    Mapper: send data to reducer in random order

                  •    Reducer: run the actual Stochastic Gradient Descent

                 Evaluation and Parameter Selection
                  •    Perform several runs with varying parameters

                  •    Learn on training set, evaluate on test set

                  •    Many runs with with partial data often better than one run with all data




                                                                       © 2012 Datameer, Inc. All rights reserved.


Thursday, March 28, 2013
Demo



                                  © 2012 Datameer, Inc. All rights reserved.


Thursday, March 28, 2013
Learning Linear Models

                 Stochastic Gradient Descent: Pros and Cons
                  •    One sweep over the data: easy to implement on top of Hadoop

                  •    Flexible: support vector machines, logistic regression, etc.

                  •    Provides good enough estimate instead of optimum

                  •    Parameter selection and evaluation are crucial

                 Alternative: convex optimization
                  •    Formulate learning as numerical optimization problem

                  •    On Hadoop: usually LBFGS

                  •    See Vowpal Wobbit for a large scale implementation


                                                                        © 2012 Datameer, Inc. All rights reserved.


Thursday, March 28, 2013
Conclusion

                 Linear Models
                  •    Prediction based on weighted vote and threshold

                 Stochastic Gradient Descent
                  •    Adjust weight vector iteratively for each misclassified row

                  •    Decreasing step size to ensure convergence

                  •    Margins and regularization for robustness

                 Implementation
                  •    Mapper provides random order, reducer performs SGD

                  •    Evaluation and parameter selection are crucial

                                                                        © 2012 Datameer, Inc. All rights reserved.


Thursday, March 28, 2013
Thanks
                           urueckert@datameer.com




                                                    © 2012 Datameer, Inc. All rights reserved.


Thursday, March 28, 2013

More Related Content

More from DataWorks Summit

HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
Open Source, Open Data: Driving Innovation in Smart Cities
Open Source, Open Data: Driving Innovation in Smart CitiesOpen Source, Open Data: Driving Innovation in Smart Cities
Open Source, Open Data: Driving Innovation in Smart Cities
DataWorks Summit
 
Data Protection in Hybrid Enterprise Data Lake Environment
Data Protection in Hybrid Enterprise Data Lake EnvironmentData Protection in Hybrid Enterprise Data Lake Environment
Data Protection in Hybrid Enterprise Data Lake Environment
DataWorks Summit
 

More from DataWorks Summit (20)

Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
 
Applying Noisy Knowledge Graphs to Real Problems
Applying Noisy Knowledge Graphs to Real ProblemsApplying Noisy Knowledge Graphs to Real Problems
Applying Noisy Knowledge Graphs to Real Problems
 
Open Source, Open Data: Driving Innovation in Smart Cities
Open Source, Open Data: Driving Innovation in Smart CitiesOpen Source, Open Data: Driving Innovation in Smart Cities
Open Source, Open Data: Driving Innovation in Smart Cities
 
Data Protection in Hybrid Enterprise Data Lake Environment
Data Protection in Hybrid Enterprise Data Lake EnvironmentData Protection in Hybrid Enterprise Data Lake Environment
Data Protection in Hybrid Enterprise Data Lake Environment
 

Recently uploaded

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Recently uploaded (20)

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 

Learning Linear Models with Hadoop

  • 1. Learning Linear Models with Hadoop Ulrich Rückert © 2012 Datameer, Inc. All rights reserved. Thursday, March 28, 2013
  • 2. Agenda What are linear models anyway? How to learn linear models with Hadoop Demo Tips, tricks and caveats Conclusion © 2012 Datameer, Inc. All rights reserved. Thursday, March 28, 2013
  • 3. Predictive Analytics Test Data Age Income BuysBook Target 22 67000 ? Example Learning Attributes Attribute 39 41000 ? Task Age Income BuysBook 24 60000 yes • Ad on booksellerʼs web page 65 80000 no 60 95000 no • Will a customer buy this book? 35 52000 yes • Training set: observations on 20 43 45000 75000 yes yes Model previous customers 26 51000 yes 52 47000 no • Test set: new customers 47 38000 no 25 22000 no Letʼs learn a linear 33 47000 yes model! Training Data Age 22 Income 67000 BuysBook yes 39 41000 no Prediction © 2012 Datameer, Inc. All rights reserved. Thursday, March 28, 2013
  • 4. Linear Models Expert1 Expert2 BuysBook 24 60 ? 64 80 ? 60 96 ? Whatʼs in the black box? • Letʼs pretend all attributes are expert ratings • Large positive value means yes • Small value means no Expert 1 Expert 2 Prediction • Intermediate value: donʼt know 24 65 60 80 ? ? 60 95 ? Let the experts vote • Sum over ratings for each row • Larger than threshold: predict yes Expert1 24 Expert2 60 Prediction ? • Smaller: predict no 64 60 80 96 ? ? © 2012 Datameer, Inc. All rights reserved. Thursday, March 28, 2013
  • 5. Linear Models Expert1 Expert2 BuysBook 24 60 ? 64 80 ? 60 96 ? Whatʼs in the black box? • Letʼs pretend all attributes are expert ratings Threshold • Large positive value means yes 97 • Small value means no Expert 1 Expert 2 > threshold • Intermediate value: donʼt know 24 65 + + 60 80 = = 84 145 no yes 60 + 95 = 155 yes Let the experts vote • Sum over ratings for each row • Larger than threshold: predict yes Expert1 24 Expert2 60 Prediction no • Smaller: predict no 64 60 80 96 yes yes © 2012 Datameer, Inc. All rights reserved. Thursday, March 28, 2013
  • 6. Linear Models Expert1 Expert2 BuysBook 24 60 ? 64 80 ? Assign a weight to each 60 96 ? expert • Expert is mostly correct: large Weight 1 Weight 2 Threshold weight 0.75 0.25 48 • Expert is uninformative: zero • Expert is consistently wrong: Expert 1 Expert 2 > threshold negative weight 0.75 • 24 + 0.25 • 60 = 33 no 0.75 • 64 + 0.25 • 80 = 68 yes 0.75 • 60 + 0.25 • 96 = 69 yes Learning models • A linear model contains weights and threshold Expert1 Expert2 Prediction 24 60 no • Learn by finding weights with 64 80 yes lowest error on training data 60 96 yes © 2012 Datameer, Inc. All rights reserved. Thursday, March 28, 2013
  • 7. Linear Models Expert1 Expert2 BuysBook 24 60 ? 64 80 ? Assign a weight to each 60 96 ? expert • Expert is mostly correct: large Weight 1 Weight 2 Threshold weight 0 0.25 18 • Expert is uninformative: zero • Expert is consistently wrong: Expert 1 Expert 2 > threshold negative weight 0 • 24 + 0.25 • 60 = 15 no 0 • 64 + 0.25 • 80 = 20 yes 0 • 60 + 0.25 • 96 = 24 yes Learning models • A linear model contains weights and threshold Expert1 Expert2 Prediction 24 60 no • Learn by finding weights with 64 80 yes lowest error on training data 60 96 yes © 2012 Datameer, Inc. All rights reserved. Thursday, March 28, 2013
  • 8. Linear Models Expert1 Expert2 BuysBook 24 60 ? 64 80 ? Assign a weight to each 60 96 ? expert • Expert is mostly correct: large Weight 1 Weight 2 Threshold weight -0.5 0.25 -8 • Expert is uninformative: zero • Expert is consistently wrong: Expert 1 Expert 2 > threshold negative weight -0.5 • 24 + 0.25 • 60 = 3 yes -0.5 • 64 + 0.25 • 80 = -12 no -0.5 • 60 + 0.25 • 96 = -6 yes Learning models • A linear model contains weights and threshold Expert1 Expert2 Prediction 24 60 yes • Learn by finding weights with 64 80 no lowest error on training data 60 96 yes © 2012 Datameer, Inc. All rights reserved. Thursday, March 28, 2013
  • 9. Learning Linear Models Stochastic Gradient Start with default weights Decent (SGD) • Main idea: start with default weights Read next training row • For each row check if current weights predict correctly • If misclassification: adjust weights Do weights predict the correct label? Yes How to adjust weights? No • if positive class: add row Adjust weights • if negative class: subtract row © 2012 Datameer, Inc. All rights reserved. Thursday, March 28, 2013
  • 10. Learning Linear Models repeat row = readNextRow(); if(predict(weights, row.attributes) != row.class) weights += row.class * row.attributes; threshold += -row.class; endif end Weight 1 Weight 2 Threshold 1 -1 0 Age Income > threshold 1•? + -1 • ? = ? ? Age Income BuysBook 24 60 +1 © 2012 Datameer, Inc. All rights reserved. Thursday, March 28, 2013
  • 11. Learning Linear Models repeat row = readNextRow(); if(predict(weights, row.attributes) != row.class) weights += row.class * row.attributes; threshold += -row.class; endif end Weight 1 Weight 2 Threshold 1 -1 0 Age Income > threshold 1 • 24 + -1 • 60 = -36 -1 Age Income BuysBook 24 60 +1 © 2012 Datameer, Inc. All rights reserved. Thursday, March 28, 2013
  • 12. Learning Linear Models repeat row = readNextRow(); if(predict(weights, row.attributes) != row.class) weights += row.class * row.attributes; threshold += -row.class; endif end Weight 1 Weight 2 Threshold 25 59 0 Age Income > threshold 25 • 24 + 59 • 60 = 4140 +1 Age Income BuysBook 24 60 +1 © 2012 Datameer, Inc. All rights reserved. Thursday, March 28, 2013
  • 13. Learning Linear Models repeat row = readNextRow(); if(predict(weights, row.attributes) != row.class) weights += row.class * row.attributes; threshold += -row.class; endif end Weight 1 Weight 2 Threshold 25 59 -1 Age Income > threshold 25 • 24 + 59 • 60 = 4140 +1 Age Income BuysBook 24 60 +1 © 2012 Datameer, Inc. All rights reserved. Thursday, March 28, 2013
  • 14. Learning Linear Models repeat row = readNextRow(); if(predict(weights, row.attributes) != row.class) weights += row.class * row.attributes; threshold += -row.class; endif end Weight 1 Weight 2 Threshold 25 59 -1 Age Income > threshold 25 • ? + 59 • ? = ? ? Age Income BuysBook 30 30 -1 © 2012 Datameer, Inc. All rights reserved. Thursday, March 28, 2013
  • 15. Learning Linear Models repeat row = readNextRow(); if(predict(weights, row.attributes) != row.class) weights += row.class * row.attributes; threshold += -row.class; endif end Weight 1 Weight 2 Threshold 25 59 -1 Age Income > threshold 25 • 30 + 59 • 30 = 2520 +1 Age Income BuysBook 30 30 -1 © 2012 Datameer, Inc. All rights reserved. Thursday, March 28, 2013
  • 16. Learning Linear Models repeat row = readNextRow(); if(predict(weights, row.attributes) != row.class) weights += row.class * row.attributes; threshold += -row.class; endif end Weight 1 Weight 2 Threshold -5 29 -1 Age Income > threshold -5 • 30 + 29 • 30 = 720 +1 Age Income BuysBook 30 30 -1 © 2012 Datameer, Inc. All rights reserved. Thursday, March 28, 2013
  • 17. Learning Linear Models repeat row = readNextRow(); if(predict(weights, row.attributes) != row.class) weights += row.class * row.attributes; threshold += -row.class; endif end Weight 1 Weight 2 Threshold -5 29 0 Age Income > threshold -5 • 30 + 29 • 30 = 720 +1 Age Income BuysBook 30 30 -1 © 2012 Datameer, Inc. All rights reserved. Thursday, March 28, 2013
  • 18. Learning - Convergence repeat row = readNextRow(); if(predict(weights, row.attributes) != row.class) weights += row.class * row.attributes; threshold += -row.class; endif end © 2012 Datameer, Inc. All rights reserved. Thursday, March 28, 2013
  • 19. Learning - Convergence repeat row = readNextRow(); if(predict(weights, row.attributes) != row.class) weights += row.class * row.attributes; threshold += -row.class; endif end © 2012 Datameer, Inc. All rights reserved. Thursday, March 28, 2013
  • 20. Learning - Convergence repeat row = readNextRow(); if(predict(weights, row.attributes) != row.class) weights += 0.001 * row.class * row.attributes; threshold += -row.class; endif end © 2012 Datameer, Inc. All rights reserved. Thursday, March 28, 2013
  • 21. Learning - Convergence repeat row = readNextRow(); if(predict(weights, row.attributes) != row.class) weights += 0.001 * row.class * row.attributes; threshold += -row.class; endif end © 2012 Datameer, Inc. All rights reserved. Thursday, March 28, 2013
  • 22. Learning - Convergence for i=1 to ∞ row = readNextRow(); if(predict(weights, row.attributes) != row.class) weights += (1/i) * row.class * row.attributes; threshold += -row.class; endif end © 2012 Datameer, Inc. All rights reserved. Thursday, March 28, 2013
  • 23. Learning - Convergence for i=1 to ∞ row = readNextRow(); if(predict(weights, row.attributes) != row.class) weights += (1/i) * row.class * row.attributes; threshold += -row.class; endif end © 2012 Datameer, Inc. All rights reserved. Thursday, March 28, 2013
  • 24. Learning - Margin for i = 1 to ∞ row = readNextRow(); if(margin(weights, row.attributes, threshold) <= 1) weights += (1/n) * row.class * row.attributes; threshold += -row.class; endif end Weight 1 Weight 2 Threshold 0.5 0.25 26.5 Age Income Margin > threshold 0.5 • 24 + 0.25 • 60 = 27 +1 Age Income BuysBook 24 60 +1 © 2012 Datameer, Inc. All rights reserved. Thursday, March 28, 2013
  • 25. Learning - Margin for i = 1 to ∞ row = readNextRow(); if(margin(weights, row.attributes, threshold) <= 1) weights += (1/n) * row.class * row.attributes; threshold += -row.class; endif end Weight 1 Weight 2 Threshold 0.5 0.25 26.5 Age Income Margin > threshold 0.5 • 24 + 0.25 • 60 = 27 +1 Age Income BuysBook 24 60 +1 © 2012 Datameer, Inc. All rights reserved. Thursday, March 28, 2013
  • 26. Learning - Margin for i = 1 to ∞ row = readNextRow(); if(margin(weights, row.attributes, threshold) <= 1) weights += (1/n) * row.class * row.attributes; threshold += -row.class; endif end Weight 1 Weight 2 Threshold 0.5 0.25 26.5 Age Income Margin > threshold 0.5 • 24 + 0.25 • 60 = 27 +1 Age Income BuysBook 24 60 +1 © 2012 Datameer, Inc. All rights reserved. Thursday, March 28, 2013
  • 27. Learning - Regularization for i = 1 to ∞ Attributes are often row = readNextRow(); if(margin(weights, row.attributes, threshold) <= 1) correlated weights += (1/n) * row.class * row.attributes; threshold += -row.class; • Contributions cancel out endif end • This leads to unreasonably large weights... • ... and models which are not Weight 1 Weight 2 Threshold robust to noise 0.5 0.5 30 Regularization Age Income > threshold • Make sure weights donʼt get too 0.5 • 24 + 0.5 • 60 = 42 +1 large • L2 regularization: weights are Age Income BuysBook proportional to attribute quality 24 60 +1 © 2012 Datameer, Inc. All rights reserved. Thursday, March 28, 2013
  • 28. Learning - Regularization for i = 1 to ∞ Attributes are often row = readNextRow(); if(margin(weights, row.attributes, threshold) <= 1) correlated weights += (1/n) * row.class * row.attributes; threshold += -row.class; • Contributions cancel out endif end • This leads to unreasonably large weights... • ... and models which are not Weight 1 Weight 2 Threshold robust to noise 1000 -399.3 30 Regularization Age Income > threshold • Make sure weights donʼt get too 1000 • 24 + -399.3 • 60 = 42 +1 large • L2 regularization: weights are Age Income BuysBook proportional to attribute quality 24 60 +1 © 2012 Datameer, Inc. All rights reserved. Thursday, March 28, 2013
  • 29. Learning - Regularization for i = 1 to ∞ Attributes are often row = readNextRow(); if(margin(weights, row.attributes, threshold) <= 1) correlated weights += (1/n) * row.class * row.attributes; threshold += -row.class; • Contributions cancel out endif • This leads to unreasonably end weights = i/(i+r) * weights; large weights... • ... and models which are not Weight 1 Weight 2 Threshold robust to noise 1000 -399.3 30 Regularization Age Income > threshold • Make sure weights donʼt get too 1000 • 24 + -399.3 • 60 = 42 +1 large • L2 regularization: weights are Age Income BuysBook proportional to attribute quality 24 60 +1 © 2012 Datameer, Inc. All rights reserved. Thursday, March 28, 2013
  • 30. Implementation on Hadoop Map-Reduce • Input data must be in random order • Mapper: send data to reducer in random order • Reducer: run the actual Stochastic Gradient Descent Evaluation and Parameter Selection • Perform several runs with varying parameters • Learn on training set, evaluate on test set • Many runs with with partial data often better than one run with all data © 2012 Datameer, Inc. All rights reserved. Thursday, March 28, 2013
  • 31. Demo © 2012 Datameer, Inc. All rights reserved. Thursday, March 28, 2013
  • 32. Learning Linear Models Stochastic Gradient Descent: Pros and Cons • One sweep over the data: easy to implement on top of Hadoop • Flexible: support vector machines, logistic regression, etc. • Provides good enough estimate instead of optimum • Parameter selection and evaluation are crucial Alternative: convex optimization • Formulate learning as numerical optimization problem • On Hadoop: usually LBFGS • See Vowpal Wobbit for a large scale implementation © 2012 Datameer, Inc. All rights reserved. Thursday, March 28, 2013
  • 33. Conclusion Linear Models • Prediction based on weighted vote and threshold Stochastic Gradient Descent • Adjust weight vector iteratively for each misclassified row • Decreasing step size to ensure convergence • Margins and regularization for robustness Implementation • Mapper provides random order, reducer performs SGD • Evaluation and parameter selection are crucial © 2012 Datameer, Inc. All rights reserved. Thursday, March 28, 2013
  • 34. Thanks urueckert@datameer.com © 2012 Datameer, Inc. All rights reserved. Thursday, March 28, 2013