SlideShare ist ein Scribd-Unternehmen logo
1 von 7
Downloaden Sie, um offline zu lesen
Statistically adaptive learning for a general class of
                                                                             cost functions (SA L-BFGS)∗


                                                    Stephen Purpura †§                                Dustin Hillard §                       Mark Hubenthal ‡§
                                               spurpura@contextrelevant.com                  dhillard@contextrelevant.com             mhubenthal@contextrelevant.com
                                                                       §                                               §
                                                        Jim Walsh                                       Scott Golder                            Scott Smith §
                                                jwalsh@contextrelevant.com                       sgolder@contextrelevant.com            ssmith@contextrelevant.com
arXiv:1209.0029v3 [cs.LG] 5 Sep 2012




                                                                    Abstract                                    where x(i) ∈ Rl is the feature vector of the ith exam-
                                                                                                                ple, y (i) ∈ {0, 1} is the label, θ ∈ Rl is the vector
                                              We present a system that enables rapid model
                                                                                                                of fitting parameters, l is a smooth convex loss function
                                              experimentation for tera-scale machine learn-
                                                                                                                and S a regularizer. Some of the more popular methods
                                              ing with trillions of non-zero features, billions
                                                                                                                for determining θ include linear and logistic regression,
                                              of training examples, and millions of param-
                                                                                                                respectively. The optimal such θ corresponds to a linear
                                              eters. Our contribution to the literature is a
                                                                                                                predictor function pθ (x) = θT x that best fits the data
                                              new method (SA L-BFGS) for changing batch
                                                                                                                in some appropriate sense, depending on l and S. We
                                              L-BFGS to perform in near real-time by using
                                                                                                                remark that such cost functions in (1) have a structure
                                              statistical tools to balance the contributions of
                                                                                                                which is naturally decomposable over the given training
                                              previous weights, old training examples, and
                                                                                                                examples, so that all computations can potentially be run
                                              new training examples to achieve fast conver-
                                                                                                                in parallel over a distributed environment.
                                              gence with few iterations. The result is, to
                                              our knowledge, the most scalable and flexible                         However, in practice it is often the case that the model
                                              linear learning system reported in the literature,                must be updated accordingly as new data is acquired.
                                              beating standard practice with the current best                   That is, we want to answer the question of how θ changes
                                              system (Vowpal Wabbit and AllReduce). Using                       in the presence of new training examples. One naive
                                              the KDD Cup 2012 data set from Tencent, Inc.                      approach would be to completely redo the entire data
                                              we provide experimental results to verify the                     analysis process from scratch on the larger data set. The
                                              performance of this method.                                       current fastest method in such a case utilizes the L-BFGS
                                                                                                                quasi-Newton minimization algorithm with AllReduce
                                       1 Introduction                                                           along a distributed cluster, (Agarwal et al., 2012). The
                                                                                                                other extreme is to apply the method of online learning,
                                       The demand for analysis and predictive modeling derived                  which considers one data point at a time and updates the
                                       from very large data sets has grown immensely in recent                  parameters θ according to some form of gradient descent,
                                       years. One of the big problems in meeting this demand is                 see (Langford et al., 2009), (Duchi et al., 2010). How-
                                       the fact that data has grown faster than the availability of             ever, on the tera-scale, neither approach is as appealing or
                                       raw computational speed. As such, it has been necessary                  as fast as we can achieve with our method. We describe in
                                       to use intelligent and efficient approaches when tackling                 a bit more detail these recent approaches to solving (1) in
                                       the data training process. Specifically, there has been                   Section 3. For completeness, we also refer the reader to
                                       much focus on problems of the form                                       recent work relating to large-scale optimization contained
                                                            m                                                   in (Schraudolph et al., 2007) and (Bottou, 2010).
                                                     min          l(θT x(i) ; y (i) ) + λS(θ),           (1)
                                                     θ∈Rl
                                                            i=1
                                                                                                                   Our approach in simple terms lies somewhere be-
                                          ∗
                                                                                                                tween pure L-BFGS minimization (widely accepted as
                                            The material of this work is patent pending.
                                          †                                                                     the fastest brute force optimization algorithm whenever
                                            in absentia at Department of Information Science, Cornell
                                       University, Ithaca, NY, 14850                                            the function is convex and smooth) and online learning.
                                          ‡
                                            recent graduate of Department of Mathematics, University            While L-BFGS offers accuracy and robustness with a
                                       of Washington, Seattle, WA, 98195                                        relatively small number of iterations, it fails to take di-
                                          §
                                            Context Relevant, Inc., Seattle, WA, 98121                          rect advantage of situations where the new data is not
very different from that acquired previously or situations      respect to a fixed (optimal) parameter θ∗ (in practice we
where the new data is extremely different than the old          don’t know the true value of θ∗ ) by
data. Certainly, one can initiate a new optimization job                                      t
on the larger data set with the parameter θ initialized to        Rφ (t) = rφ (θ∗ , t) :=         [φs (θs ) − φs (θ∗ )]     (3)
the previous result. But we are left with the problem of                                    s=0
optimizing over increasingly larger training sets at one                      t
time. Similarly, online learning methods only consider                   =         [fs (θs ) + λS(θs ) − fs (θ∗ ) − λS(θ∗ )].
one data point at a time and cannot reasonably change                        s=0
the parameter θ by too much at any given step without           An effective algorithm is then one in which the sequence
risk of severely increasing the regret. It also cannot typ-          tf
                                                                {θt }t=0 suffers sub-linear regret, i.e., Rφ (t) = o(t).
ically reach as small of an error count as that of a global        As mentioned earlier, there has been much work done
gradient descent approach. On the other hand, we will           regarding how to solve (1) with a variety of meth-
show that it is possible to combine the advantages of           ods. Before proceeding with a basic overview of the
both methods: in particular the small number of iterations      two most popular approaches to large-scale machine
and speed of L-BFGS when applied to reasonably sized            learning in Section 3, it is important to understand the
batches, and the ability of online learning to “forget” pre-    underlying assumptions and implications of the pre-
vious data when the new data has changed significantly.          vious body of work. In particular, we mention the
   The outline of the paper is as follows. In Section 2         work of L´ on Bottou in (Bottou, 2010) regarding large-
                                                                           e
we describe the general problem of interest. In Sec-            scale optimization with stochastic gradient descent, and
tion 3 we briefly mention current widely used methods            (Schraudolph et al., 2007) regarding stochastic online L-
of solving (1). In Section 4 we outline the statistically       BFGS (oL-BFGS) optimization. Such works and others
adaptive learning algorithm. Finally, in Section 5 we           have demonstrated that with a lot of randomly shuffled
benchmark the performance of our two related methods            data, a variety of methods (oL-BFGS, 2nd order stochas-
(Context Relevant FAST L-BFGS and SA L-BFGS, re-                tic gradient descent and averaged stochastic gradient de-
spectively) against Vowpal Wabbit - one of the fastest          scent) can work in fewer iterations than L-BFGS because:
currently available routines which incorporates the work
of (Agarwal et al., 2012). We also include the associated        (a) Small data learning problems are fundamentally dif-
Area Under Curve (AUC) rating, which roughly speak-                  ferent from large data learning problems;
ing, is a number in [0, 1], where a value of 1 indicates        (b) The cost functions as framed in the literature have
perfect prediction by the model.                                    well suited curvature near the global minimum.
2 Background and Problem Setup                                     We remark that the key problem for all quasi-Newton
                                                                based optimization methods (including L-BFGS) has
In this paper the underlying problem is as follows. Sup-        been that noise associated with the approximation pro-
pose we have a sequence of time-indexed data sets               cess – with specific properties dependent on each learning
                                (i)             (i)
{Xt , Yt } where Xt = {xt }mt , Yt = {yt }mt , t =
                                    i=1              i=1        problem – causes adverse conditions which can make L-
0, 1, . . . , tf is the time index, and mt ∈ N is the batch     BFGS (and its variants) fail. However, the problem of
size (typically independent of t). Such data is given se-       noise leading to non-positive curvature near the minimum
quentially as it is acquired (e.g. t could represent days),     can be averted if the data is appropriately shaped (i.e. fea-
so that at t = 0 one only has possession of {X0 , Y0 }. Al-     ture selection plus proper data transformations). For now
ternatively, if we are given a large data set all at once, we   though, we ignore the issue and assume we already have
could divide it into batches indexed sequentially by t. In      a methodology for “feature shaping” that assures under
general, we use the notation xt with subscript t to denote      operational conditions that the curvature of the resulting
a time-dependent vector at time t, and we write xt,j to         learning problem is well-suited to the algorithm that we
denote the jth component of xt . For each t = 0, . . . , tf     describe.
we define
                           mt                                   3 Previous Work
                                      (i)   (i)
                ft (θ) =         l(θT xt ; yt )
                                                                3.1 Online Updates
                           i=1
               φt (θ) = ft (θ) + λS(θ),                  (2)    In online learning, the problem of storage is completely
                                                                averted as each data point is discarded once it is read.
where as before, l is a given smooth convex loss function       We remark that one can essentially view this approach
and S is a regularizer. Also let θt be the parameter vector     as a special case of the statistically adaptive method de-
obtained at time t, which in practice will approximately        scribed in Section 4 with a batch size of 1. Such algo-
                       t
minimize λS(θ)+ s=0 fs (θ). We define the regret with            rithms iteratively make a prediction θt ∈ Rl and then
receive a convex loss function φt as in (2). Typically,         Assuming invertibility of X T X, it is well known that the
φt (θ) = l(θT xt ; yt ) + λR(θ), where (xt , yt ) is the data   solution is given by
point read at time t. We then make an update to obtain
                                                                                      θ = (X T X)−1 X T Y.                   (5)
θt+1 using a rule that is typically based on the gradi-
ent of l(θT xt , yt ) in θ. Indeed, the simplest approach       Now, suppose that we have time indexed data
(with no regularization term) would be the update rule          {Xs , Ys }T with Xs ∈ Rms ×l and Ys ∈ Rms . In order
                                                                          s=0
                        T
θt+1 = θt − ηt ∇θ l(θt xt , yt ).                               to update θt given {Xt+1 , Yt+1 }, first we must check how
   However, there currently exist more sophisticated up-        well θt fits the newly augmented data set. We do this by
date schemes which can achieve better regret bounds for         evaluating
(3). In particular, the work of Duchi, Hazan, and Singer                                t+1
                                                                                                            2
is a type of subgradient method with adaptive proximal                                          Xs θ − Ys   2                (6)
functions. It is proven that their ADAGRAD algorithm                                    s=0

can achieve theoretical regret bounds of the form               with θ = θt . Depending on the result, we choose a
                                                                parameter λ ∈ [0, 1] that determines how much weight
                                   1/2                          to give the previous data when computing θt+1 . That
          Rφ (t) = O( θ∗ 2 tr(Gt )) and      (4)
                    Å                      ã                    is, λ represents how much we would like to “forget” the
                                       1/2
          Rφ (t) = O max θs − θ∗ 2 tr(Gt ) ,                    previous data (or emphasize the new data), with a value of
                          s≤t
                                                                λ = 1 indicating that all previous data has been thrown
                               t       T                        out. Similarly, the case λ = 1 corresponds to the case
where in general, Gt =         s=0 gs gs is an outer prod-
                                                                                                       2
                                                                when past and present are weighed equally, and the case
uct matrix generated by the sequence of gradients gs =
                                                                λ = 1 corresponds to the case when θt fits the new data
∇θ fs (θs ) (Duchi et al., 2010). We remark that since the
                                                                perfectly (i.e. (6) is equal to zero).
loss function gradients converge to zero under ideal con-                                                t
ditions, the estimate (4) is indeed sublinear, because the
                                                                   Let X[0,t] be the                     s=0 ms × l       matrix
                                                                    T    T          T T
                                                                [X0 , X1 , . . . , Xt ]        obtained by concatenation,
decay of the gradients, however slow, counters the linear
                         1/2                                    and similarly define the length- t ms vector     s=0
growth in the size of Gt .
                                                                Y[0,t] := [Y0T , Y1T , . . . , YtT ]T . Then (6) is equivalent to
3.2 Vowpal Wabbit with Gradient Descent                           X[0,t+1] θ − Y[0,t+1] 2 . Now, when using a particular
                                                                                             2
                                                                second order Newton method for minimizing a smooth
Vowpal Wabbit is a freely available software pack-              convex function, the computation of the inverse Hessian
age which implements the method described briefly in             matrix is analogous to computing (X[0,t] X[0,t] )−1 T
(Agarwal et al., 2012). In particular, it combines online       above. As t grows large, the cumulative normal matrix
learning and brute force gradient-descent optimization in          T
                                                                X[0,t] X[0,t] becomes increasingly costly to compute
a slightly different way. First, one does a single pass         from scratch, as does its inverse. Fortunately, we observe
over the whole data set using online learning to obtain         that
a rough choice of parameter θ. Then, L-BFGS optimiza-
                                                                     T                   T               T
tion of the cost function is initiated with the data split          X[0,t+1] X[0,t+1] = X[0,t] X[0,t] + Xt+1 Xt+1 .          (7)
across a cluster. The cost function and its gradient are        However, if we want to incorporate the flexibility to
computed locally and AllReduce is utilized to collect the       weigh current data differently relative to previous data,
global function values and gradients in order to update θ.      we need to abandon the exact computation of (7). In-
The main improvement of this algorithm over previous            stead, letting At denote the approximate analogue of
methods is the use of AllReduce with the Hadoop file                T
                                                                X[0,t] X[0,t] , we introduce the update
structure, which significantly cuts down on communi-
cation time as is the case with MapReduce. Moreover,                                     2
                                                                          At+1 ←             µ2 At + Xt+1 Xt+1
                                                                                                      T
                                                                                                                             (8)
the baseline online learning step is done with a learn-                               1 + µ2
ing rate chosen in an optimal manner as discussed in                                     µ      2

(Karampatziakis and Langford, 2011).                            where µ satisfies λ = 1+µ2 .
                                                                  The actual update of θt is performed as follows. Define
4 Our Approach                                                                                                   
                                                                                                                   Y0
                                                                                                                      
                                                                                                                  Y1 
4.1 Least Squares Digression                                        ‹           T               T            T 
                                                                    Y[0,t] := X[0,t] Y[0,t] = [X0 , . . . , Xt ]  . 
                                                                                                                      
Before we describe the statistically adaptive approach for                                                        . 
                                                                                                                    .
minimizing a generic cost function, consider the follow-                                                           Yt
ing simpler scenario in the context of least squares regres-                     t
                                                                                       T
sion. Given data {X, Y } with X ∈ Rm×l and Y ∈ Rm ,                         =         Xs Ys .
we want to choose θ that solves minθ∈Rl Xθ − Y 2 .        2                     s=0
Up to time t, the standard solution to the least squares
problem on the data {X[0,t], Y[0,t] } is then                    batch                                       data points selected
                                                                 1,2,...,t-1                                 from previous batches
                        T                ‹
                  θ = (X[0,t] X[0,t] )−1 Y[0,t] .          (9)

                                  ‹
Now let Bt be an approximation to Y[0,t] . We define Bt+1
by the update
                                                                                                            data points selected
                        2                                        batch t
                                                                                                            from current batch
         Bt+1 =             µ2 Bt + Xt+1 Yt+1 .
                                     T
                     1 + µ2
                                                                 Figure 1: Subsampling of the partitioned data stream at
Finally, we set                                                  time t and times 0, 1, . . . , t − 1, respectively.
                              −1
                     θt+1 := At+1 Bt+1 .                  (10)
                                                                 is used to update θ. From the subsample chosen, we
It is easily verified that (10) coincides with the standard       then apply a gradient descent optimization routine where
update (9) when µ = 1.                                           the initialization of the associated starting parameters is
4.2 Statistically Adaptive Learning                              generated from those stored from the previous iteration.
                                                                 In the case of L-BFGS, the rank 1 matrices used to ap-
Returning to our original problem, we start with the pa-         proximate the inverse Hessian stored from the previous
rameter θ0 obtained from some initial pass through of            iteration are used to initialize the new descent routine.
{X0 , Y0 }, typically using a particular gradient descent al-    We summarize the process in Algorithm 1.
gorithm. In what follows, we will need to define an easily
evaluated error function to be applied at each iteration,
                                                                 Algorithm 1 Statistically Adaptive Learning Method (SA
mildly related to the cumulative regret (3):
                                                                 L-BFGS)
                       t      ms        (i)         (i)          Require: Error checking function I(t, θ)
                       s=0    i=1 |pθ (xt )   − yt |                                       tf
      I(t, θ) :=                t                         (11)      Given data {Xs , Ys }s=0
                                s=0 |Xs |
                                                                    Run gradient descent optimization on {X0 , Y0 } to
We remark that I(t, θ) represents the relative number of            compute θ0
incorrect predictions associated with the parameter θ over          for t = 1 to tf do
all data points from time s = 0 to s = t. Moreover,                     if I(t + 1, θt ) − I(t, θt ) > σ(t) then
because pθ is a linear function of x, I is very fast to                     Choose Mold and Mnew
evaluate (essentially O(m) where m = t ms ).                                Subsample data
                                            s=0
   Given θt , we compute I(t + 1, θt ). There are two                       Run L-BFGS with initial parameter θt to ob-
extremal possibilities:                                             tain θt+1
                                                                        else θt+1 ← θt
 1. I(t + 1, θt ) is significantly larger than I(t, θt ). More           end if
    precisely, we mean that I(t + 1, θt ) − I(t, θt ) >             end for
    σ(t), where σ(t) is the standard deviation of
    {I(s, θs )}t . In this case the data has significantly
                s=0                                                 As a typical example, at some time t we might have
    changed, and so θ must be modified.                                                                     tf
                                                                 Mold = 1000, Mnew = 100, 000, and t=0 mt = 1·109 .
 2. Otherwise, there is no need to change θ and we set           This would be indicative of a batch {Xt+1 , Yt+1 } with
    θt+1 = θt .                                                  significantly higher error using the current parameter θt
                                                                 than for previously analyzed batches.
In the former case, we use the magnitude of I(t+ 1, θt )−           We remark that when learning on each new batch of
I(t, θt ) to determine a subsample of the old and new data       data, there are two main aspects that can be parallelized.
with Mold and Mnew points chosen, respectively (see              First, the batch itself can be partitioned and distributed
Figure 1). Roughly speaking, the larger the difference the       among nodes in a cluster via AllReduce to significantly
more weight will be given to the most recent data points.        speed up the evaluation of the cost function and its gra-
The sampling of previous data points serves to anchor            dient as is done in (Agarwal et al., 2012). Furthermore,
the model so that the parameters do not over fit to the           one can run multiple independent optimization routines
new batch at the expense of significantly increasing the          in parallel where the distribution used to subsample from
global regret. This is a generalization of online learning       Xt+1 and ∪t Xs is varied. The resulting parameters
                                                                              s=0
methods where only the most recent single data point             θ obtained from each separate instance can then be sta-
tistically compared so as to make sure that the model is       mention that the number of features generated is on the
not overly sensitive to the choice of sampling distribu-       order of 1000.
tion. Otherwise, having θ be too highly dependent on the
choice of subsample would invalidate using a stochas-          5.2 Model 1 Results
tic gradient descent-based approach. A bi-product of           For our first set of experiments, we compare the perfor-
this ability to simultaneously experiment with different       mance of Vowpal Wabbit (VW) using its L-BFGS im-
samplings is that it provides a quick means to check the       plementation and the Context Relevant Flexible Analyt-
consistency of the data.                                       ics and Statistics TechnologyTM L-BFGS implementation
   Finally, we remark that the SA L-BFGS method can be         running on 10 Amazon m1.xlarge instances. The time
reasonably adapted to account for changes in the selected      measured (in seconds) is only the time required to train
features as new data is acquired. Indeed, it is very ap-       the models. The features are generated and cached for
pealing within the industry to be able to experiment with      each implementation prior to training.
different choices of features in order to find those that          Performance was measured using the Area Under
matter most, while still being able to use the previously      Curve (AUC) metric because this was the methodology
computed parameters θt to speed up the optimization on         used in (Tencent, 2012). In short, the AUC is equal to the
the new data. Of course, it is possible to directly ap-        probability that a classifier will rank a randomly chosen
ply an online learning approach in this situation, since       positive instance higher than a randomly chosen negative
previous data points have already been discarded. But          one. More precisely, it is computed via Algorithm 3 in
typical gradient descent algorithms do not a priori have       (Fawcett, 2004). We compute our AUC results over a
the flexibility to be directly applied in such cases and they   portion of the public section of the test set that has about
typically perform worse than batch methods such as L-          2 million examples.
BFGS(Agarwal et al., 2012).                                       Model 1 includes only the basic id features, with no
                                                               conjunction features, and achieves an AUC of 0.748 as
5 Experiments                                                  shown in Table 1. A simple baseline performance, which
                                                               can be generated by predicting the mean ctr for each
5.1 Description of Dataset and Features
                                                               ad id would perform at approximately an AUC of 0.71.
We consider data used to predict the click-through-rate        The winner of (Tencent, 2012) performed at an AUC of
(pCTR) of online ads. An accurate model is necessary           0.80. However, the winning model was substantially
in the search advertising market in order to appropriately     more complicated and used many additional features that
rank ads and price clicks. The data contains 11 variables      were excluded from this simple demonstration. In future
and 1 output, corresponding to the number of times a           work, we will explore more sophisticated feature sets.
given ad was clicked by the user among the number of              The Context Relevant and VW models both achieve the
times it was displayed. In order to reduce the data size,      same AUC on the test set, which validates that the basic
instances with the same user id, ad id, query, and setting     gradient descent and L-BFGS implementations are func-
are combined, so that the output may take on any posi-         tionally equivalent. The Context Relevant model com-
tive integer value. For each instance (training example),      pletes learning between four and five times more quickly.
the input variables serve to classify various properties of    Our implementation is heavily optimized to reduce com-
the ad displayed, in addition to the specific search query      putation time as well as memory footprint. In addition,
entered. This data was acquired from sessions of the Ten-      our implementation utilizes an underlying MapReduce
cent proprietary search engine and was posted publicly on      implementation that provides robustness to job and node
www.kddcup.2012.org (Tencent, 2012).                           failures1 .
   For these experiments we build a basic model that
learns from the identifiers provided in the training set.       5.3 Model 2 Results
These include unique identifiers for each query, ad, key-       Context Relevant’s implementation of SA L-BFGS is
word, advertiser, title, description, display url, user, ad    designed to accelerate and simplify learning iterative
position, and ad depth (further details available in the       changes to models. Using information gleaned from the
KDD documentation). We compute a position and depth            initial L-BFGS pass, SA L-BFGS develops a sampling
normalized click through rate for each identifier, as well
                                                                   1
as combinations (conjunctions) of these identifiers. Then             Context Relevant had to re-write the AllReduce network
at training and testing time we annotate each example          implementation to add error checking so that the AllReduce sys-
with these normalized click through rates. Additionally,       tem was robust to errors that were encountered during normal
                                                               execution of experiments on Amazon’s EC2 systems. Without
before running the optimization, it is necessary to build      these changes, we could not keep AllReduce from hanging dur-
appropriate feature vectors (i.e. shape the data). We will     ing the experiments. There is no graceful recovery from the loss
not go into detail regarding how this is done, except to       of a single node.
Table 1: Model 1 Results For Different Learning Mech-        Table 2: Model 2 Results For Different Learning Mech-
anisms (VW = Vowpal Wabbit; CR = Context Relevant            anisms (VW = Vowpal Wabbit; CR = Context Relevant
FAST L-BFGS                                                  FAST L-BFGS; SA = Context Relevant FAST SA L-
                        VW           CR                      BFGS)
                     L-BFGS L-BFGS                                            VW          CR           CR
           seconds       490         114                                   L-BFGS L-BFGS SA L-BFGS
            AUC         .748        .748                          seconds     515         115           9
                                                                   AUC        .750       .752         .751

strategy to minimize sampling induced noise when learn-
ing new models that are derived from previous models.        system uses a new version of L-BFGS to combine the
The larger the divergence in the models, the less speed-     robustness and accuracy of second order gradient descent
up is likely. For this set of experiments, we add a con-     optimization methods with the memory advantages of
junction feature that captures the interaction between a     online learning. This provides a model building envi-
query id and an ad id, which has frequently been an          ronment that significantly lowers the time and compute
important feature in well known advertising systems. We      cost of asking new questions. The ability to quickly ask
then compare the speed and accuracy of Vowpal Wab-           and answer experimental questions vastly expands to set
bit (VW) using its L-BFGS implementation; the Context        of questions that can be asked, and therefore the space
Relevant Flexible Analytics and Statistics TechnologyTM      of models that can be explored to discover the optimal
L-BFGS implementation, and the Context Relevant Flex-        solution.
ible Analytics and Statistics TechnologyTM SA L-BFGS            SA L-BFGS is also well suited to environments where
implementation (SA) running on 10 Amazon X1.Large            the underlying distribution of the data provided to a learn-
instances. Here the baseline L-BFGS models are trained       ing algorithm is shifting. Whether the shift is caused
with the standard practice for adding a new feature, the     by changes in user behavior, changes in market pricing,
models are retrained on the entire dataset. The time mea-    or changes in term usage, SA L-BFGS can be empiri-
sured (in seconds) is only the time required to train the    cally tuned to dynamically adjust to the changing con-
models. The features are generated and cached for each       ditions. One can utilize the parallelized approach in
implementation prior to training.                            (Agarwal et al., 2012) on each batch of data in the time
   Again, performance was measured using the Area Un-        variable, with the additional freedom to choose the batch
der Curve (AUC) metric. Table 2 lists the results for each   size. Furthermore, the statistical aspects of the algorithm
algorithm, and shows that AUC improves in comparison         provide a useful way to check the consistency of the data
to Model 1 when the new feature is added. As with Model      in real time. However, like all L-BFGS implementa-
1, the VW and CR models achieve similar AUC and the          tions that rely on small, reduced, or sampled data sets,
basic L-BFGS CR learning time is significantly faster.        increased sampling noise from the L-BFGS estimation
Furthermore, we show that SA L-BFGS also achieves            process affects the quality of the resulting learning al-
similar AUC, but in less than one tenth of the time (which   gorithm. Users of this new algorithm must take care to
likewise implies one tenth of the compute cost required).    provide smooth convex loss functions for optimization.
This speed up can enable a large increase in the number      The Context Relevant Flexible Analytics and Statistics
of experiments, without requiring additional compute or      TechnologyTM is designed to provide such functions for
time. It is important to note that the speed of L-BFGS       optimization.
and SA L-BFGS is essentially tied to the number of
features and the number of examples for each iteration.      7 About Context Relevant and the Authors
The primary performance gains that can be found are: (a)
reducing the number of iterations; (b) reducing the num-     Context Relevant was founded in March 2012 by Stephen
ber of examples; or (c) reducing the number of features      Purpura and Christian Metcalfe. The company was ini-
with non-zero weights. SA L-BFGS adopts the former           tially funded by friends, family, Seattle angel investors,
two strategies. A reduction in the number of features        and Madrona Venture Group.
with non-zero weights can be forced through aggressive          Stephen Purpura – CEO & Co-Founder – Stephen
regularization, but at the expense of specificity.            works as the CEO and CTO of the company. He has more
                                                             than 20 years of experience, is listed as an inventor of
6 Conclusions                                                five issued United States patents and served as program
                                                             manager of Microsoft Windows and the Chief Security
We have presented a new tera-scale machine learning          Officer of MSFDC, one of the first Internet bill payment
system that enables rapid model experimentation. The         systems. Stephen is recognized as a leading expert in the
fields of machine learning, political micro-targeting and     as MSNBC, The New York Times, The Washington Post
predictive analytics.                                        and National Public Radio. He has also been profiled by
   Stephen received a bachelor’s degree from the Uni-        LiveScience’s “ScienceLives”.
versity of Washington, a master’s degree from Harvard,          He has worked as a research scientist in the Social
where he was part of the Program on Networked Gover-         Computing Lab at HP Labs and has interned at Google,
nance at the John F. Kennedy School of Government, and       IBM and Microsoft. Scott holds a master’s degree from
he is set to be granted a PhD in information science from    MIT, where he worked with the Media Laboratory’s So-
Cornell.                                                     ciable Media Group, and graduated from Harvard Univer-
   Jim Walsh – VP Engineering – Jim brings 23 years          sity, where he studied Linguistics and Computer Science.
of experience in technology innovation and engineering       Scott is currently on leave from the PhD program in So-
management to Context Relevant. He founded, built and        ciology at Cornell University.
led the Cosmos team – Microsoft’s massive scale dis-            Mark Hubenthal – Member of the Technical Staff –
tributed data storage and analysis environment, under-       Mark recently received his PhD in Mathematics from the
pinning virtually all Microsoft products including Bing      University of Washington. He works on inverse problems
– and founded the Bing multimedia search team. In his        applicable to medical imaging and geophysics.
final position at Microsoft, Jim served as the principal         Scott Smith - Principal Engineer and Architect - Scott
architect for Microsoft advertising platforms.               has experience with distributed computing, compiler de-
   Prior to joining Microsoft, Jim created two software      sign, and performance optimization. At Akamai, he
startups, wrote the second-ever Windows application and      helped design and implement the load balancing and
created custom computer animation software for the tele-     failover logic for the first CDN. At Clustrix, he built the
vision broadcast industry. He is recognized as one of        SQL optimizer and compiler for a distributed RDBMS.
the world’s leading network performance experts and has      He holds a masters degree in computer science from MIT.
been granted twenty two technology patents that range
from performance to user interface design. Jim earned a
bachelor of science degree in computing science from the     References
University of Alberta.                                      [Agarwal et al.2012] A. Agarwal, O. Chapelle, M. Dudik,
   Dustin Rigg Hillard – Director of Engineering and           and J. Langford. 2012. A reliable effective terascale
Data Scientist – Dustin is a recognized data science and       linear learning system. Feb. arXiv:1110.4198v2.
machine learning expert who has published more than 30
papers in these areas. Previously at Microsoft and Ya-      [Bottou2010] L. Bottou. 2010. Large-scale machine
                                                               learning with stochastic gradient descent. pages 177–
hoo!, he spent the last decade building systems that sig-
                                                               187.
nificantly improve large-scale processing and machine-
learning for advertising, natural language and speech.      [Duchi et al.2010] J. Duchi, E. Hazan, and Y. Singer.
   Prior to joining Context Relevant, Dustin Hillard           2010. Adaptive subgradient methods for online learn-
worked for Microsoft, where he worked to improve               ing and stochastic optimization.
speech understanding for mobile applications and XBox       [Fawcett2004] T. Fawcett. 2004. ROC graphs: Notes and
Kinect. Before that he was at Yahoo! for three                 practical considerations for researchers.
years, where he focused on improving ad relevance.
His research in graduate school focused on automatic        [Karampatziakis and Langford2011] N. Karampatziakis
speech recognition and statistical translation. Dustin         and J. Langford. 2011. Online importance weight
incorporates approaches from these and other fields to          aware updates.
learn from massive amounts of data with supervised,         [Langford et al.2009] J. Langford, L. Li, and T. Zhang.
semi-supervised, and unsupervised machine learning ap-         2009. Sparse online learning via truncated gradient.
proaches.                                                      Journal of Machine Learning Research, 10:777–801.
   Dustin holds a bachelor’s and master’s degree as well
as a PhD in Electrical Engineering from the University of   [Schraudolph et al.2007] N. Schraudolph, J. Yu, and
                                                               S. G¨ nter. 2007. A stochastic quasi-newton method
                                                                    u
Washington.
                                                               for online convex optimization.
   Scott Golder – Data Scientist & Staff Sociologist –
Scott mines social networking data to investigate broad     [Tencent2012] Tencent.             2012.
questions such as when people are happiest (mornings           KDD       Cup     2012, Track 2  Data.
and weekends) and how Twitter users form new social            http://www.kddcup2012.org/c/kddcup2012-track2/dat
ties. His work has been published in the journal Science
as well as top computer science conferences by the ACM
and IEEE, and has been covered in media outlets such

Weitere ähnliche Inhalte

Was ist angesagt?

Parameter Uncertainty and Learning in Dynamic Financial Decisions
Parameter Uncertainty and Learning in Dynamic Financial DecisionsParameter Uncertainty and Learning in Dynamic Financial Decisions
Parameter Uncertainty and Learning in Dynamic Financial DecisionsDaniel Bruggisser
 
Main single agent machine learning algorithms
Main single agent machine learning algorithmsMain single agent machine learning algorithms
Main single agent machine learning algorithmsbutest
 
Macroeconometrics of Investment and the User Cost of Capital Presentation Sample
Macroeconometrics of Investment and the User Cost of Capital Presentation SampleMacroeconometrics of Investment and the User Cost of Capital Presentation Sample
Macroeconometrics of Investment and the User Cost of Capital Presentation SampleThethach Chuaprapaisilp
 
Using Domain Feature to handle Feature Interactions
Using Domain Feature to handle Feature InteractionsUsing Domain Feature to handle Feature Interactions
Using Domain Feature to handle Feature InteractionsSébastien Mosser
 
A semi-supervised method for efficient construction of statistical spoken lan...
A semi-supervised method for efficient construction of statistical spoken lan...A semi-supervised method for efficient construction of statistical spoken lan...
A semi-supervised method for efficient construction of statistical spoken lan...Seokhwan Kim
 
Applied Machine Learning For Search Engine Relevance
Applied Machine Learning For Search Engine Relevance Applied Machine Learning For Search Engine Relevance
Applied Machine Learning For Search Engine Relevance charlesmartin14
 
Cutting Edge Predictive Modeling For Classification
Cutting Edge Predictive Modeling For ClassificationCutting Edge Predictive Modeling For Classification
Cutting Edge Predictive Modeling For ClassificationPankaj Sharma
 
Lec5 advanced-policy-gradient-methods
Lec5 advanced-policy-gradient-methodsLec5 advanced-policy-gradient-methods
Lec5 advanced-policy-gradient-methodsRonald Teo
 
Standard costing summary
Standard costing summaryStandard costing summary
Standard costing summarypaulgyamfi
 

Was ist angesagt? (9)

Parameter Uncertainty and Learning in Dynamic Financial Decisions
Parameter Uncertainty and Learning in Dynamic Financial DecisionsParameter Uncertainty and Learning in Dynamic Financial Decisions
Parameter Uncertainty and Learning in Dynamic Financial Decisions
 
Main single agent machine learning algorithms
Main single agent machine learning algorithmsMain single agent machine learning algorithms
Main single agent machine learning algorithms
 
Macroeconometrics of Investment and the User Cost of Capital Presentation Sample
Macroeconometrics of Investment and the User Cost of Capital Presentation SampleMacroeconometrics of Investment and the User Cost of Capital Presentation Sample
Macroeconometrics of Investment and the User Cost of Capital Presentation Sample
 
Using Domain Feature to handle Feature Interactions
Using Domain Feature to handle Feature InteractionsUsing Domain Feature to handle Feature Interactions
Using Domain Feature to handle Feature Interactions
 
A semi-supervised method for efficient construction of statistical spoken lan...
A semi-supervised method for efficient construction of statistical spoken lan...A semi-supervised method for efficient construction of statistical spoken lan...
A semi-supervised method for efficient construction of statistical spoken lan...
 
Applied Machine Learning For Search Engine Relevance
Applied Machine Learning For Search Engine Relevance Applied Machine Learning For Search Engine Relevance
Applied Machine Learning For Search Engine Relevance
 
Cutting Edge Predictive Modeling For Classification
Cutting Edge Predictive Modeling For ClassificationCutting Edge Predictive Modeling For Classification
Cutting Edge Predictive Modeling For Classification
 
Lec5 advanced-policy-gradient-methods
Lec5 advanced-policy-gradient-methodsLec5 advanced-policy-gradient-methods
Lec5 advanced-policy-gradient-methods
 
Standard costing summary
Standard costing summaryStandard costing summary
Standard costing summary
 

Ähnlich wie Statistically adaptive learning for a general class of..

HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardin...
HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardin...HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardin...
HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardin...Sunny Kr
 
Deep learning concepts
Deep learning conceptsDeep learning concepts
Deep learning conceptsJoe li
 
Deep learning ensembles loss landscape
Deep learning ensembles loss landscapeDeep learning ensembles loss landscape
Deep learning ensembles loss landscapeDevansh16
 
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021Chris Ohk
 
ON AVERAGE CASE ANALYSIS THROUGH STATISTICAL BOUNDS : LINKING THEORY TO PRACTICE
ON AVERAGE CASE ANALYSIS THROUGH STATISTICAL BOUNDS : LINKING THEORY TO PRACTICEON AVERAGE CASE ANALYSIS THROUGH STATISTICAL BOUNDS : LINKING THEORY TO PRACTICE
ON AVERAGE CASE ANALYSIS THROUGH STATISTICAL BOUNDS : LINKING THEORY TO PRACTICEcscpconf
 
On average case analysis through statistical bounds linking theory to practice
On average case analysis through statistical bounds  linking theory to practiceOn average case analysis through statistical bounds  linking theory to practice
On average case analysis through statistical bounds linking theory to practicecsandit
 
On average case analysis through statistical bounds linking theory to practice
On average case analysis through statistical bounds  linking theory to practiceOn average case analysis through statistical bounds  linking theory to practice
On average case analysis through statistical bounds linking theory to practicecsandit
 
Implicit regularization of SGD in NLP
Implicit regularization of SGD in NLPImplicit regularization of SGD in NLP
Implicit regularization of SGD in NLPDeren Lei
 
SELFLESS INHERITANCE
SELFLESS INHERITANCESELFLESS INHERITANCE
SELFLESS INHERITANCEijpla
 
A PSO-Based Subtractive Data Clustering Algorithm
A PSO-Based Subtractive Data Clustering AlgorithmA PSO-Based Subtractive Data Clustering Algorithm
A PSO-Based Subtractive Data Clustering AlgorithmIJORCS
 
SigOpt_Bayesian_Optimization_Primer
SigOpt_Bayesian_Optimization_PrimerSigOpt_Bayesian_Optimization_Primer
SigOpt_Bayesian_Optimization_PrimerIan Dewancker
 
Deep learning MindMap
Deep learning MindMapDeep learning MindMap
Deep learning MindMapAshish Patel
 
Fast and Scalable Semi Supervised Adaptation For Video Action Recognition
Fast and Scalable Semi Supervised Adaptation For Video Action RecognitionFast and Scalable Semi Supervised Adaptation For Video Action Recognition
Fast and Scalable Semi Supervised Adaptation For Video Action RecognitionIJSRED
 
An Uncertainty-Aware Approach to Optimal Configuration of Stream Processing S...
An Uncertainty-Aware Approach to Optimal Configuration of Stream Processing S...An Uncertainty-Aware Approach to Optimal Configuration of Stream Processing S...
An Uncertainty-Aware Approach to Optimal Configuration of Stream Processing S...Pooyan Jamshidi
 
Survey on Artificial Neural Network Learning Technique Algorithms
Survey on Artificial Neural Network Learning Technique AlgorithmsSurvey on Artificial Neural Network Learning Technique Algorithms
Survey on Artificial Neural Network Learning Technique AlgorithmsIRJET Journal
 
Polikar10missing
Polikar10missingPolikar10missing
Polikar10missingkagupta
 
CIlib 2.0: Rethinking Implementation
CIlib 2.0: Rethinking ImplementationCIlib 2.0: Rethinking Implementation
CIlib 2.0: Rethinking ImplementationGary Pamparà
 
Transfer Learning for Improving Model Predictions in Highly Configurable Soft...
Transfer Learning for Improving Model Predictions in Highly Configurable Soft...Transfer Learning for Improving Model Predictions in Highly Configurable Soft...
Transfer Learning for Improving Model Predictions in Highly Configurable Soft...Pooyan Jamshidi
 

Ähnlich wie Statistically adaptive learning for a general class of.. (20)

P229 godfrey
P229 godfreyP229 godfrey
P229 godfrey
 
HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardin...
HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardin...HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardin...
HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardin...
 
ssc_icml13
ssc_icml13ssc_icml13
ssc_icml13
 
Deep learning concepts
Deep learning conceptsDeep learning concepts
Deep learning concepts
 
Deep learning ensembles loss landscape
Deep learning ensembles loss landscapeDeep learning ensembles loss landscape
Deep learning ensembles loss landscape
 
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
 
ON AVERAGE CASE ANALYSIS THROUGH STATISTICAL BOUNDS : LINKING THEORY TO PRACTICE
ON AVERAGE CASE ANALYSIS THROUGH STATISTICAL BOUNDS : LINKING THEORY TO PRACTICEON AVERAGE CASE ANALYSIS THROUGH STATISTICAL BOUNDS : LINKING THEORY TO PRACTICE
ON AVERAGE CASE ANALYSIS THROUGH STATISTICAL BOUNDS : LINKING THEORY TO PRACTICE
 
On average case analysis through statistical bounds linking theory to practice
On average case analysis through statistical bounds  linking theory to practiceOn average case analysis through statistical bounds  linking theory to practice
On average case analysis through statistical bounds linking theory to practice
 
On average case analysis through statistical bounds linking theory to practice
On average case analysis through statistical bounds  linking theory to practiceOn average case analysis through statistical bounds  linking theory to practice
On average case analysis through statistical bounds linking theory to practice
 
Implicit regularization of SGD in NLP
Implicit regularization of SGD in NLPImplicit regularization of SGD in NLP
Implicit regularization of SGD in NLP
 
SELFLESS INHERITANCE
SELFLESS INHERITANCESELFLESS INHERITANCE
SELFLESS INHERITANCE
 
A PSO-Based Subtractive Data Clustering Algorithm
A PSO-Based Subtractive Data Clustering AlgorithmA PSO-Based Subtractive Data Clustering Algorithm
A PSO-Based Subtractive Data Clustering Algorithm
 
SigOpt_Bayesian_Optimization_Primer
SigOpt_Bayesian_Optimization_PrimerSigOpt_Bayesian_Optimization_Primer
SigOpt_Bayesian_Optimization_Primer
 
Deep learning MindMap
Deep learning MindMapDeep learning MindMap
Deep learning MindMap
 
Fast and Scalable Semi Supervised Adaptation For Video Action Recognition
Fast and Scalable Semi Supervised Adaptation For Video Action RecognitionFast and Scalable Semi Supervised Adaptation For Video Action Recognition
Fast and Scalable Semi Supervised Adaptation For Video Action Recognition
 
An Uncertainty-Aware Approach to Optimal Configuration of Stream Processing S...
An Uncertainty-Aware Approach to Optimal Configuration of Stream Processing S...An Uncertainty-Aware Approach to Optimal Configuration of Stream Processing S...
An Uncertainty-Aware Approach to Optimal Configuration of Stream Processing S...
 
Survey on Artificial Neural Network Learning Technique Algorithms
Survey on Artificial Neural Network Learning Technique AlgorithmsSurvey on Artificial Neural Network Learning Technique Algorithms
Survey on Artificial Neural Network Learning Technique Algorithms
 
Polikar10missing
Polikar10missingPolikar10missing
Polikar10missing
 
CIlib 2.0: Rethinking Implementation
CIlib 2.0: Rethinking ImplementationCIlib 2.0: Rethinking Implementation
CIlib 2.0: Rethinking Implementation
 
Transfer Learning for Improving Model Predictions in Highly Configurable Soft...
Transfer Learning for Improving Model Predictions in Highly Configurable Soft...Transfer Learning for Improving Model Predictions in Highly Configurable Soft...
Transfer Learning for Improving Model Predictions in Highly Configurable Soft...
 

Mehr von Accenture

Certify 2014trends-report
Certify 2014trends-reportCertify 2014trends-report
Certify 2014trends-reportAccenture
 
Calabrio analyze
Calabrio analyzeCalabrio analyze
Calabrio analyzeAccenture
 
Tier 2 net app baseline design standard revised nov 2011
Tier 2 net app baseline design standard   revised nov 2011Tier 2 net app baseline design standard   revised nov 2011
Tier 2 net app baseline design standard revised nov 2011Accenture
 
Perf stat windows
Perf stat windowsPerf stat windows
Perf stat windowsAccenture
 
Performance problems on ethernet networks when the e0m management interface i...
Performance problems on ethernet networks when the e0m management interface i...Performance problems on ethernet networks when the e0m management interface i...
Performance problems on ethernet networks when the e0m management interface i...Accenture
 
NetApp system installation workbook Spokane
NetApp system installation workbook SpokaneNetApp system installation workbook Spokane
NetApp system installation workbook SpokaneAccenture
 
Migrate volume in akfiler7
Migrate volume in akfiler7Migrate volume in akfiler7
Migrate volume in akfiler7Accenture
 
Migrate vol in akfiler7
Migrate vol in akfiler7Migrate vol in akfiler7
Migrate vol in akfiler7Accenture
 
Data storage requirements AK
Data storage requirements AKData storage requirements AK
Data storage requirements AKAccenture
 
C mode class
C mode classC mode class
C mode classAccenture
 
Akfiler upgrades providence july 2012
Akfiler upgrades providence july 2012Akfiler upgrades providence july 2012
Akfiler upgrades providence july 2012Accenture
 
Reporting demo
Reporting demoReporting demo
Reporting demoAccenture
 
Net app virtualization preso
Net app virtualization presoNet app virtualization preso
Net app virtualization presoAccenture
 
Providence net app upgrade plan PPMC
Providence net app upgrade plan PPMCProvidence net app upgrade plan PPMC
Providence net app upgrade plan PPMCAccenture
 
WSC Net App storage for windows challenges and solutions
WSC Net App storage for windows challenges and solutionsWSC Net App storage for windows challenges and solutions
WSC Net App storage for windows challenges and solutionsAccenture
 
50,000-seat_VMware_view_deployment
50,000-seat_VMware_view_deployment50,000-seat_VMware_view_deployment
50,000-seat_VMware_view_deploymentAccenture
 
Tr 3998 -deployment_guide_for_hosted_shared_desktops_and_on-demand_applicatio...
Tr 3998 -deployment_guide_for_hosted_shared_desktops_and_on-demand_applicatio...Tr 3998 -deployment_guide_for_hosted_shared_desktops_and_on-demand_applicatio...
Tr 3998 -deployment_guide_for_hosted_shared_desktops_and_on-demand_applicatio...Accenture
 
Tr 3749 -net_app_storage_best_practices_for_v_mware_vsphere,_dec_11
Tr 3749 -net_app_storage_best_practices_for_v_mware_vsphere,_dec_11Tr 3749 -net_app_storage_best_practices_for_v_mware_vsphere,_dec_11
Tr 3749 -net_app_storage_best_practices_for_v_mware_vsphere,_dec_11Accenture
 
Snap mirror source to tape to destination scenario
Snap mirror source to tape to destination scenarioSnap mirror source to tape to destination scenario
Snap mirror source to tape to destination scenarioAccenture
 

Mehr von Accenture (20)

Certify 2014trends-report
Certify 2014trends-reportCertify 2014trends-report
Certify 2014trends-report
 
Calabrio analyze
Calabrio analyzeCalabrio analyze
Calabrio analyze
 
Tier 2 net app baseline design standard revised nov 2011
Tier 2 net app baseline design standard   revised nov 2011Tier 2 net app baseline design standard   revised nov 2011
Tier 2 net app baseline design standard revised nov 2011
 
Perf stat windows
Perf stat windowsPerf stat windows
Perf stat windows
 
Performance problems on ethernet networks when the e0m management interface i...
Performance problems on ethernet networks when the e0m management interface i...Performance problems on ethernet networks when the e0m management interface i...
Performance problems on ethernet networks when the e0m management interface i...
 
NetApp system installation workbook Spokane
NetApp system installation workbook SpokaneNetApp system installation workbook Spokane
NetApp system installation workbook Spokane
 
Migrate volume in akfiler7
Migrate volume in akfiler7Migrate volume in akfiler7
Migrate volume in akfiler7
 
Migrate vol in akfiler7
Migrate vol in akfiler7Migrate vol in akfiler7
Migrate vol in akfiler7
 
Data storage requirements AK
Data storage requirements AKData storage requirements AK
Data storage requirements AK
 
C mode class
C mode classC mode class
C mode class
 
Akfiler upgrades providence july 2012
Akfiler upgrades providence july 2012Akfiler upgrades providence july 2012
Akfiler upgrades providence july 2012
 
NA notes
NA notesNA notes
NA notes
 
Reporting demo
Reporting demoReporting demo
Reporting demo
 
Net app virtualization preso
Net app virtualization presoNet app virtualization preso
Net app virtualization preso
 
Providence net app upgrade plan PPMC
Providence net app upgrade plan PPMCProvidence net app upgrade plan PPMC
Providence net app upgrade plan PPMC
 
WSC Net App storage for windows challenges and solutions
WSC Net App storage for windows challenges and solutionsWSC Net App storage for windows challenges and solutions
WSC Net App storage for windows challenges and solutions
 
50,000-seat_VMware_view_deployment
50,000-seat_VMware_view_deployment50,000-seat_VMware_view_deployment
50,000-seat_VMware_view_deployment
 
Tr 3998 -deployment_guide_for_hosted_shared_desktops_and_on-demand_applicatio...
Tr 3998 -deployment_guide_for_hosted_shared_desktops_and_on-demand_applicatio...Tr 3998 -deployment_guide_for_hosted_shared_desktops_and_on-demand_applicatio...
Tr 3998 -deployment_guide_for_hosted_shared_desktops_and_on-demand_applicatio...
 
Tr 3749 -net_app_storage_best_practices_for_v_mware_vsphere,_dec_11
Tr 3749 -net_app_storage_best_practices_for_v_mware_vsphere,_dec_11Tr 3749 -net_app_storage_best_practices_for_v_mware_vsphere,_dec_11
Tr 3749 -net_app_storage_best_practices_for_v_mware_vsphere,_dec_11
 
Snap mirror source to tape to destination scenario
Snap mirror source to tape to destination scenarioSnap mirror source to tape to destination scenario
Snap mirror source to tape to destination scenario
 

Statistically adaptive learning for a general class of..

  • 1. Statistically adaptive learning for a general class of cost functions (SA L-BFGS)∗ Stephen Purpura †§ Dustin Hillard § Mark Hubenthal ‡§ spurpura@contextrelevant.com dhillard@contextrelevant.com mhubenthal@contextrelevant.com § § Jim Walsh Scott Golder Scott Smith § jwalsh@contextrelevant.com sgolder@contextrelevant.com ssmith@contextrelevant.com arXiv:1209.0029v3 [cs.LG] 5 Sep 2012 Abstract where x(i) ∈ Rl is the feature vector of the ith exam- ple, y (i) ∈ {0, 1} is the label, θ ∈ Rl is the vector We present a system that enables rapid model of fitting parameters, l is a smooth convex loss function experimentation for tera-scale machine learn- and S a regularizer. Some of the more popular methods ing with trillions of non-zero features, billions for determining θ include linear and logistic regression, of training examples, and millions of param- respectively. The optimal such θ corresponds to a linear eters. Our contribution to the literature is a predictor function pθ (x) = θT x that best fits the data new method (SA L-BFGS) for changing batch in some appropriate sense, depending on l and S. We L-BFGS to perform in near real-time by using remark that such cost functions in (1) have a structure statistical tools to balance the contributions of which is naturally decomposable over the given training previous weights, old training examples, and examples, so that all computations can potentially be run new training examples to achieve fast conver- in parallel over a distributed environment. gence with few iterations. The result is, to our knowledge, the most scalable and flexible However, in practice it is often the case that the model linear learning system reported in the literature, must be updated accordingly as new data is acquired. beating standard practice with the current best That is, we want to answer the question of how θ changes system (Vowpal Wabbit and AllReduce). Using in the presence of new training examples. One naive the KDD Cup 2012 data set from Tencent, Inc. approach would be to completely redo the entire data we provide experimental results to verify the analysis process from scratch on the larger data set. The performance of this method. current fastest method in such a case utilizes the L-BFGS quasi-Newton minimization algorithm with AllReduce 1 Introduction along a distributed cluster, (Agarwal et al., 2012). The other extreme is to apply the method of online learning, The demand for analysis and predictive modeling derived which considers one data point at a time and updates the from very large data sets has grown immensely in recent parameters θ according to some form of gradient descent, years. One of the big problems in meeting this demand is see (Langford et al., 2009), (Duchi et al., 2010). How- the fact that data has grown faster than the availability of ever, on the tera-scale, neither approach is as appealing or raw computational speed. As such, it has been necessary as fast as we can achieve with our method. We describe in to use intelligent and efficient approaches when tackling a bit more detail these recent approaches to solving (1) in the data training process. Specifically, there has been Section 3. For completeness, we also refer the reader to much focus on problems of the form recent work relating to large-scale optimization contained m in (Schraudolph et al., 2007) and (Bottou, 2010). min l(θT x(i) ; y (i) ) + λS(θ), (1) θ∈Rl i=1 Our approach in simple terms lies somewhere be- ∗ tween pure L-BFGS minimization (widely accepted as The material of this work is patent pending. † the fastest brute force optimization algorithm whenever in absentia at Department of Information Science, Cornell University, Ithaca, NY, 14850 the function is convex and smooth) and online learning. ‡ recent graduate of Department of Mathematics, University While L-BFGS offers accuracy and robustness with a of Washington, Seattle, WA, 98195 relatively small number of iterations, it fails to take di- § Context Relevant, Inc., Seattle, WA, 98121 rect advantage of situations where the new data is not
  • 2. very different from that acquired previously or situations respect to a fixed (optimal) parameter θ∗ (in practice we where the new data is extremely different than the old don’t know the true value of θ∗ ) by data. Certainly, one can initiate a new optimization job t on the larger data set with the parameter θ initialized to Rφ (t) = rφ (θ∗ , t) := [φs (θs ) − φs (θ∗ )] (3) the previous result. But we are left with the problem of s=0 optimizing over increasingly larger training sets at one t time. Similarly, online learning methods only consider = [fs (θs ) + λS(θs ) − fs (θ∗ ) − λS(θ∗ )]. one data point at a time and cannot reasonably change s=0 the parameter θ by too much at any given step without An effective algorithm is then one in which the sequence risk of severely increasing the regret. It also cannot typ- tf {θt }t=0 suffers sub-linear regret, i.e., Rφ (t) = o(t). ically reach as small of an error count as that of a global As mentioned earlier, there has been much work done gradient descent approach. On the other hand, we will regarding how to solve (1) with a variety of meth- show that it is possible to combine the advantages of ods. Before proceeding with a basic overview of the both methods: in particular the small number of iterations two most popular approaches to large-scale machine and speed of L-BFGS when applied to reasonably sized learning in Section 3, it is important to understand the batches, and the ability of online learning to “forget” pre- underlying assumptions and implications of the pre- vious data when the new data has changed significantly. vious body of work. In particular, we mention the The outline of the paper is as follows. In Section 2 work of L´ on Bottou in (Bottou, 2010) regarding large- e we describe the general problem of interest. In Sec- scale optimization with stochastic gradient descent, and tion 3 we briefly mention current widely used methods (Schraudolph et al., 2007) regarding stochastic online L- of solving (1). In Section 4 we outline the statistically BFGS (oL-BFGS) optimization. Such works and others adaptive learning algorithm. Finally, in Section 5 we have demonstrated that with a lot of randomly shuffled benchmark the performance of our two related methods data, a variety of methods (oL-BFGS, 2nd order stochas- (Context Relevant FAST L-BFGS and SA L-BFGS, re- tic gradient descent and averaged stochastic gradient de- spectively) against Vowpal Wabbit - one of the fastest scent) can work in fewer iterations than L-BFGS because: currently available routines which incorporates the work of (Agarwal et al., 2012). We also include the associated (a) Small data learning problems are fundamentally dif- Area Under Curve (AUC) rating, which roughly speak- ferent from large data learning problems; ing, is a number in [0, 1], where a value of 1 indicates (b) The cost functions as framed in the literature have perfect prediction by the model. well suited curvature near the global minimum. 2 Background and Problem Setup We remark that the key problem for all quasi-Newton based optimization methods (including L-BFGS) has In this paper the underlying problem is as follows. Sup- been that noise associated with the approximation pro- pose we have a sequence of time-indexed data sets cess – with specific properties dependent on each learning (i) (i) {Xt , Yt } where Xt = {xt }mt , Yt = {yt }mt , t = i=1 i=1 problem – causes adverse conditions which can make L- 0, 1, . . . , tf is the time index, and mt ∈ N is the batch BFGS (and its variants) fail. However, the problem of size (typically independent of t). Such data is given se- noise leading to non-positive curvature near the minimum quentially as it is acquired (e.g. t could represent days), can be averted if the data is appropriately shaped (i.e. fea- so that at t = 0 one only has possession of {X0 , Y0 }. Al- ture selection plus proper data transformations). For now ternatively, if we are given a large data set all at once, we though, we ignore the issue and assume we already have could divide it into batches indexed sequentially by t. In a methodology for “feature shaping” that assures under general, we use the notation xt with subscript t to denote operational conditions that the curvature of the resulting a time-dependent vector at time t, and we write xt,j to learning problem is well-suited to the algorithm that we denote the jth component of xt . For each t = 0, . . . , tf describe. we define mt 3 Previous Work (i) (i) ft (θ) = l(θT xt ; yt ) 3.1 Online Updates i=1 φt (θ) = ft (θ) + λS(θ), (2) In online learning, the problem of storage is completely averted as each data point is discarded once it is read. where as before, l is a given smooth convex loss function We remark that one can essentially view this approach and S is a regularizer. Also let θt be the parameter vector as a special case of the statistically adaptive method de- obtained at time t, which in practice will approximately scribed in Section 4 with a batch size of 1. Such algo- t minimize λS(θ)+ s=0 fs (θ). We define the regret with rithms iteratively make a prediction θt ∈ Rl and then
  • 3. receive a convex loss function φt as in (2). Typically, Assuming invertibility of X T X, it is well known that the φt (θ) = l(θT xt ; yt ) + λR(θ), where (xt , yt ) is the data solution is given by point read at time t. We then make an update to obtain θ = (X T X)−1 X T Y. (5) θt+1 using a rule that is typically based on the gradi- ent of l(θT xt , yt ) in θ. Indeed, the simplest approach Now, suppose that we have time indexed data (with no regularization term) would be the update rule {Xs , Ys }T with Xs ∈ Rms ×l and Ys ∈ Rms . In order s=0 T θt+1 = θt − ηt ∇θ l(θt xt , yt ). to update θt given {Xt+1 , Yt+1 }, first we must check how However, there currently exist more sophisticated up- well θt fits the newly augmented data set. We do this by date schemes which can achieve better regret bounds for evaluating (3). In particular, the work of Duchi, Hazan, and Singer t+1 2 is a type of subgradient method with adaptive proximal Xs θ − Ys 2 (6) functions. It is proven that their ADAGRAD algorithm s=0 can achieve theoretical regret bounds of the form with θ = θt . Depending on the result, we choose a parameter λ ∈ [0, 1] that determines how much weight 1/2 to give the previous data when computing θt+1 . That Rφ (t) = O( θ∗ 2 tr(Gt )) and (4) Å ã is, λ represents how much we would like to “forget” the 1/2 Rφ (t) = O max θs − θ∗ 2 tr(Gt ) , previous data (or emphasize the new data), with a value of s≤t λ = 1 indicating that all previous data has been thrown t T out. Similarly, the case λ = 1 corresponds to the case where in general, Gt = s=0 gs gs is an outer prod- 2 when past and present are weighed equally, and the case uct matrix generated by the sequence of gradients gs = λ = 1 corresponds to the case when θt fits the new data ∇θ fs (θs ) (Duchi et al., 2010). We remark that since the perfectly (i.e. (6) is equal to zero). loss function gradients converge to zero under ideal con- t ditions, the estimate (4) is indeed sublinear, because the Let X[0,t] be the s=0 ms × l matrix T T T T [X0 , X1 , . . . , Xt ] obtained by concatenation, decay of the gradients, however slow, counters the linear 1/2 and similarly define the length- t ms vector s=0 growth in the size of Gt . Y[0,t] := [Y0T , Y1T , . . . , YtT ]T . Then (6) is equivalent to 3.2 Vowpal Wabbit with Gradient Descent X[0,t+1] θ − Y[0,t+1] 2 . Now, when using a particular 2 second order Newton method for minimizing a smooth Vowpal Wabbit is a freely available software pack- convex function, the computation of the inverse Hessian age which implements the method described briefly in matrix is analogous to computing (X[0,t] X[0,t] )−1 T (Agarwal et al., 2012). In particular, it combines online above. As t grows large, the cumulative normal matrix learning and brute force gradient-descent optimization in T X[0,t] X[0,t] becomes increasingly costly to compute a slightly different way. First, one does a single pass from scratch, as does its inverse. Fortunately, we observe over the whole data set using online learning to obtain that a rough choice of parameter θ. Then, L-BFGS optimiza- T T T tion of the cost function is initiated with the data split X[0,t+1] X[0,t+1] = X[0,t] X[0,t] + Xt+1 Xt+1 . (7) across a cluster. The cost function and its gradient are However, if we want to incorporate the flexibility to computed locally and AllReduce is utilized to collect the weigh current data differently relative to previous data, global function values and gradients in order to update θ. we need to abandon the exact computation of (7). In- The main improvement of this algorithm over previous stead, letting At denote the approximate analogue of methods is the use of AllReduce with the Hadoop file T X[0,t] X[0,t] , we introduce the update structure, which significantly cuts down on communi- cation time as is the case with MapReduce. Moreover, 2 At+1 ← µ2 At + Xt+1 Xt+1 T (8) the baseline online learning step is done with a learn- 1 + µ2 ing rate chosen in an optimal manner as discussed in µ 2 (Karampatziakis and Langford, 2011). where µ satisfies λ = 1+µ2 . The actual update of θt is performed as follows. Define 4 Our Approach  Y0   Y1  4.1 Least Squares Digression ‹ T T T  Y[0,t] := X[0,t] Y[0,t] = [X0 , . . . , Xt ]  .   Before we describe the statistically adaptive approach for  .  . minimizing a generic cost function, consider the follow- Yt ing simpler scenario in the context of least squares regres- t T sion. Given data {X, Y } with X ∈ Rm×l and Y ∈ Rm , = Xs Ys . we want to choose θ that solves minθ∈Rl Xθ − Y 2 . 2 s=0
  • 4. Up to time t, the standard solution to the least squares problem on the data {X[0,t], Y[0,t] } is then batch data points selected 1,2,...,t-1 from previous batches T ‹ θ = (X[0,t] X[0,t] )−1 Y[0,t] . (9) ‹ Now let Bt be an approximation to Y[0,t] . We define Bt+1 by the update data points selected 2 batch t from current batch Bt+1 = µ2 Bt + Xt+1 Yt+1 . T 1 + µ2 Figure 1: Subsampling of the partitioned data stream at Finally, we set time t and times 0, 1, . . . , t − 1, respectively. −1 θt+1 := At+1 Bt+1 . (10) is used to update θ. From the subsample chosen, we It is easily verified that (10) coincides with the standard then apply a gradient descent optimization routine where update (9) when µ = 1. the initialization of the associated starting parameters is 4.2 Statistically Adaptive Learning generated from those stored from the previous iteration. In the case of L-BFGS, the rank 1 matrices used to ap- Returning to our original problem, we start with the pa- proximate the inverse Hessian stored from the previous rameter θ0 obtained from some initial pass through of iteration are used to initialize the new descent routine. {X0 , Y0 }, typically using a particular gradient descent al- We summarize the process in Algorithm 1. gorithm. In what follows, we will need to define an easily evaluated error function to be applied at each iteration, Algorithm 1 Statistically Adaptive Learning Method (SA mildly related to the cumulative regret (3): L-BFGS) t ms (i) (i) Require: Error checking function I(t, θ) s=0 i=1 |pθ (xt ) − yt | tf I(t, θ) := t (11) Given data {Xs , Ys }s=0 s=0 |Xs | Run gradient descent optimization on {X0 , Y0 } to We remark that I(t, θ) represents the relative number of compute θ0 incorrect predictions associated with the parameter θ over for t = 1 to tf do all data points from time s = 0 to s = t. Moreover, if I(t + 1, θt ) − I(t, θt ) > σ(t) then because pθ is a linear function of x, I is very fast to Choose Mold and Mnew evaluate (essentially O(m) where m = t ms ). Subsample data s=0 Given θt , we compute I(t + 1, θt ). There are two Run L-BFGS with initial parameter θt to ob- extremal possibilities: tain θt+1 else θt+1 ← θt 1. I(t + 1, θt ) is significantly larger than I(t, θt ). More end if precisely, we mean that I(t + 1, θt ) − I(t, θt ) > end for σ(t), where σ(t) is the standard deviation of {I(s, θs )}t . In this case the data has significantly s=0 As a typical example, at some time t we might have changed, and so θ must be modified. tf Mold = 1000, Mnew = 100, 000, and t=0 mt = 1·109 . 2. Otherwise, there is no need to change θ and we set This would be indicative of a batch {Xt+1 , Yt+1 } with θt+1 = θt . significantly higher error using the current parameter θt than for previously analyzed batches. In the former case, we use the magnitude of I(t+ 1, θt )− We remark that when learning on each new batch of I(t, θt ) to determine a subsample of the old and new data data, there are two main aspects that can be parallelized. with Mold and Mnew points chosen, respectively (see First, the batch itself can be partitioned and distributed Figure 1). Roughly speaking, the larger the difference the among nodes in a cluster via AllReduce to significantly more weight will be given to the most recent data points. speed up the evaluation of the cost function and its gra- The sampling of previous data points serves to anchor dient as is done in (Agarwal et al., 2012). Furthermore, the model so that the parameters do not over fit to the one can run multiple independent optimization routines new batch at the expense of significantly increasing the in parallel where the distribution used to subsample from global regret. This is a generalization of online learning Xt+1 and ∪t Xs is varied. The resulting parameters s=0 methods where only the most recent single data point θ obtained from each separate instance can then be sta-
  • 5. tistically compared so as to make sure that the model is mention that the number of features generated is on the not overly sensitive to the choice of sampling distribu- order of 1000. tion. Otherwise, having θ be too highly dependent on the choice of subsample would invalidate using a stochas- 5.2 Model 1 Results tic gradient descent-based approach. A bi-product of For our first set of experiments, we compare the perfor- this ability to simultaneously experiment with different mance of Vowpal Wabbit (VW) using its L-BFGS im- samplings is that it provides a quick means to check the plementation and the Context Relevant Flexible Analyt- consistency of the data. ics and Statistics TechnologyTM L-BFGS implementation Finally, we remark that the SA L-BFGS method can be running on 10 Amazon m1.xlarge instances. The time reasonably adapted to account for changes in the selected measured (in seconds) is only the time required to train features as new data is acquired. Indeed, it is very ap- the models. The features are generated and cached for pealing within the industry to be able to experiment with each implementation prior to training. different choices of features in order to find those that Performance was measured using the Area Under matter most, while still being able to use the previously Curve (AUC) metric because this was the methodology computed parameters θt to speed up the optimization on used in (Tencent, 2012). In short, the AUC is equal to the the new data. Of course, it is possible to directly ap- probability that a classifier will rank a randomly chosen ply an online learning approach in this situation, since positive instance higher than a randomly chosen negative previous data points have already been discarded. But one. More precisely, it is computed via Algorithm 3 in typical gradient descent algorithms do not a priori have (Fawcett, 2004). We compute our AUC results over a the flexibility to be directly applied in such cases and they portion of the public section of the test set that has about typically perform worse than batch methods such as L- 2 million examples. BFGS(Agarwal et al., 2012). Model 1 includes only the basic id features, with no conjunction features, and achieves an AUC of 0.748 as 5 Experiments shown in Table 1. A simple baseline performance, which can be generated by predicting the mean ctr for each 5.1 Description of Dataset and Features ad id would perform at approximately an AUC of 0.71. We consider data used to predict the click-through-rate The winner of (Tencent, 2012) performed at an AUC of (pCTR) of online ads. An accurate model is necessary 0.80. However, the winning model was substantially in the search advertising market in order to appropriately more complicated and used many additional features that rank ads and price clicks. The data contains 11 variables were excluded from this simple demonstration. In future and 1 output, corresponding to the number of times a work, we will explore more sophisticated feature sets. given ad was clicked by the user among the number of The Context Relevant and VW models both achieve the times it was displayed. In order to reduce the data size, same AUC on the test set, which validates that the basic instances with the same user id, ad id, query, and setting gradient descent and L-BFGS implementations are func- are combined, so that the output may take on any posi- tionally equivalent. The Context Relevant model com- tive integer value. For each instance (training example), pletes learning between four and five times more quickly. the input variables serve to classify various properties of Our implementation is heavily optimized to reduce com- the ad displayed, in addition to the specific search query putation time as well as memory footprint. In addition, entered. This data was acquired from sessions of the Ten- our implementation utilizes an underlying MapReduce cent proprietary search engine and was posted publicly on implementation that provides robustness to job and node www.kddcup.2012.org (Tencent, 2012). failures1 . For these experiments we build a basic model that learns from the identifiers provided in the training set. 5.3 Model 2 Results These include unique identifiers for each query, ad, key- Context Relevant’s implementation of SA L-BFGS is word, advertiser, title, description, display url, user, ad designed to accelerate and simplify learning iterative position, and ad depth (further details available in the changes to models. Using information gleaned from the KDD documentation). We compute a position and depth initial L-BFGS pass, SA L-BFGS develops a sampling normalized click through rate for each identifier, as well 1 as combinations (conjunctions) of these identifiers. Then Context Relevant had to re-write the AllReduce network at training and testing time we annotate each example implementation to add error checking so that the AllReduce sys- with these normalized click through rates. Additionally, tem was robust to errors that were encountered during normal execution of experiments on Amazon’s EC2 systems. Without before running the optimization, it is necessary to build these changes, we could not keep AllReduce from hanging dur- appropriate feature vectors (i.e. shape the data). We will ing the experiments. There is no graceful recovery from the loss not go into detail regarding how this is done, except to of a single node.
  • 6. Table 1: Model 1 Results For Different Learning Mech- Table 2: Model 2 Results For Different Learning Mech- anisms (VW = Vowpal Wabbit; CR = Context Relevant anisms (VW = Vowpal Wabbit; CR = Context Relevant FAST L-BFGS FAST L-BFGS; SA = Context Relevant FAST SA L- VW CR BFGS) L-BFGS L-BFGS VW CR CR seconds 490 114 L-BFGS L-BFGS SA L-BFGS AUC .748 .748 seconds 515 115 9 AUC .750 .752 .751 strategy to minimize sampling induced noise when learn- ing new models that are derived from previous models. system uses a new version of L-BFGS to combine the The larger the divergence in the models, the less speed- robustness and accuracy of second order gradient descent up is likely. For this set of experiments, we add a con- optimization methods with the memory advantages of junction feature that captures the interaction between a online learning. This provides a model building envi- query id and an ad id, which has frequently been an ronment that significantly lowers the time and compute important feature in well known advertising systems. We cost of asking new questions. The ability to quickly ask then compare the speed and accuracy of Vowpal Wab- and answer experimental questions vastly expands to set bit (VW) using its L-BFGS implementation; the Context of questions that can be asked, and therefore the space Relevant Flexible Analytics and Statistics TechnologyTM of models that can be explored to discover the optimal L-BFGS implementation, and the Context Relevant Flex- solution. ible Analytics and Statistics TechnologyTM SA L-BFGS SA L-BFGS is also well suited to environments where implementation (SA) running on 10 Amazon X1.Large the underlying distribution of the data provided to a learn- instances. Here the baseline L-BFGS models are trained ing algorithm is shifting. Whether the shift is caused with the standard practice for adding a new feature, the by changes in user behavior, changes in market pricing, models are retrained on the entire dataset. The time mea- or changes in term usage, SA L-BFGS can be empiri- sured (in seconds) is only the time required to train the cally tuned to dynamically adjust to the changing con- models. The features are generated and cached for each ditions. One can utilize the parallelized approach in implementation prior to training. (Agarwal et al., 2012) on each batch of data in the time Again, performance was measured using the Area Un- variable, with the additional freedom to choose the batch der Curve (AUC) metric. Table 2 lists the results for each size. Furthermore, the statistical aspects of the algorithm algorithm, and shows that AUC improves in comparison provide a useful way to check the consistency of the data to Model 1 when the new feature is added. As with Model in real time. However, like all L-BFGS implementa- 1, the VW and CR models achieve similar AUC and the tions that rely on small, reduced, or sampled data sets, basic L-BFGS CR learning time is significantly faster. increased sampling noise from the L-BFGS estimation Furthermore, we show that SA L-BFGS also achieves process affects the quality of the resulting learning al- similar AUC, but in less than one tenth of the time (which gorithm. Users of this new algorithm must take care to likewise implies one tenth of the compute cost required). provide smooth convex loss functions for optimization. This speed up can enable a large increase in the number The Context Relevant Flexible Analytics and Statistics of experiments, without requiring additional compute or TechnologyTM is designed to provide such functions for time. It is important to note that the speed of L-BFGS optimization. and SA L-BFGS is essentially tied to the number of features and the number of examples for each iteration. 7 About Context Relevant and the Authors The primary performance gains that can be found are: (a) reducing the number of iterations; (b) reducing the num- Context Relevant was founded in March 2012 by Stephen ber of examples; or (c) reducing the number of features Purpura and Christian Metcalfe. The company was ini- with non-zero weights. SA L-BFGS adopts the former tially funded by friends, family, Seattle angel investors, two strategies. A reduction in the number of features and Madrona Venture Group. with non-zero weights can be forced through aggressive Stephen Purpura – CEO & Co-Founder – Stephen regularization, but at the expense of specificity. works as the CEO and CTO of the company. He has more than 20 years of experience, is listed as an inventor of 6 Conclusions five issued United States patents and served as program manager of Microsoft Windows and the Chief Security We have presented a new tera-scale machine learning Officer of MSFDC, one of the first Internet bill payment system that enables rapid model experimentation. The systems. Stephen is recognized as a leading expert in the
  • 7. fields of machine learning, political micro-targeting and as MSNBC, The New York Times, The Washington Post predictive analytics. and National Public Radio. He has also been profiled by Stephen received a bachelor’s degree from the Uni- LiveScience’s “ScienceLives”. versity of Washington, a master’s degree from Harvard, He has worked as a research scientist in the Social where he was part of the Program on Networked Gover- Computing Lab at HP Labs and has interned at Google, nance at the John F. Kennedy School of Government, and IBM and Microsoft. Scott holds a master’s degree from he is set to be granted a PhD in information science from MIT, where he worked with the Media Laboratory’s So- Cornell. ciable Media Group, and graduated from Harvard Univer- Jim Walsh – VP Engineering – Jim brings 23 years sity, where he studied Linguistics and Computer Science. of experience in technology innovation and engineering Scott is currently on leave from the PhD program in So- management to Context Relevant. He founded, built and ciology at Cornell University. led the Cosmos team – Microsoft’s massive scale dis- Mark Hubenthal – Member of the Technical Staff – tributed data storage and analysis environment, under- Mark recently received his PhD in Mathematics from the pinning virtually all Microsoft products including Bing University of Washington. He works on inverse problems – and founded the Bing multimedia search team. In his applicable to medical imaging and geophysics. final position at Microsoft, Jim served as the principal Scott Smith - Principal Engineer and Architect - Scott architect for Microsoft advertising platforms. has experience with distributed computing, compiler de- Prior to joining Microsoft, Jim created two software sign, and performance optimization. At Akamai, he startups, wrote the second-ever Windows application and helped design and implement the load balancing and created custom computer animation software for the tele- failover logic for the first CDN. At Clustrix, he built the vision broadcast industry. He is recognized as one of SQL optimizer and compiler for a distributed RDBMS. the world’s leading network performance experts and has He holds a masters degree in computer science from MIT. been granted twenty two technology patents that range from performance to user interface design. Jim earned a bachelor of science degree in computing science from the References University of Alberta. [Agarwal et al.2012] A. Agarwal, O. Chapelle, M. Dudik, Dustin Rigg Hillard – Director of Engineering and and J. Langford. 2012. A reliable effective terascale Data Scientist – Dustin is a recognized data science and linear learning system. Feb. arXiv:1110.4198v2. machine learning expert who has published more than 30 papers in these areas. Previously at Microsoft and Ya- [Bottou2010] L. Bottou. 2010. Large-scale machine learning with stochastic gradient descent. pages 177– hoo!, he spent the last decade building systems that sig- 187. nificantly improve large-scale processing and machine- learning for advertising, natural language and speech. [Duchi et al.2010] J. Duchi, E. Hazan, and Y. Singer. Prior to joining Context Relevant, Dustin Hillard 2010. Adaptive subgradient methods for online learn- worked for Microsoft, where he worked to improve ing and stochastic optimization. speech understanding for mobile applications and XBox [Fawcett2004] T. Fawcett. 2004. ROC graphs: Notes and Kinect. Before that he was at Yahoo! for three practical considerations for researchers. years, where he focused on improving ad relevance. His research in graduate school focused on automatic [Karampatziakis and Langford2011] N. Karampatziakis speech recognition and statistical translation. Dustin and J. Langford. 2011. Online importance weight incorporates approaches from these and other fields to aware updates. learn from massive amounts of data with supervised, [Langford et al.2009] J. Langford, L. Li, and T. Zhang. semi-supervised, and unsupervised machine learning ap- 2009. Sparse online learning via truncated gradient. proaches. Journal of Machine Learning Research, 10:777–801. Dustin holds a bachelor’s and master’s degree as well as a PhD in Electrical Engineering from the University of [Schraudolph et al.2007] N. Schraudolph, J. Yu, and S. G¨ nter. 2007. A stochastic quasi-newton method u Washington. for online convex optimization. Scott Golder – Data Scientist & Staff Sociologist – Scott mines social networking data to investigate broad [Tencent2012] Tencent. 2012. questions such as when people are happiest (mornings KDD Cup 2012, Track 2 Data. and weekends) and how Twitter users form new social http://www.kddcup2012.org/c/kddcup2012-track2/dat ties. His work has been published in the journal Science as well as top computer science conferences by the ACM and IEEE, and has been covered in media outlets such