Snap mirror source to tape to destination scenario
Statistically adaptive learning for a general class of..
1. Statistically adaptive learning for a general class of
cost functions (SA L-BFGS)∗
Stephen Purpura †§ Dustin Hillard § Mark Hubenthal ‡§
spurpura@contextrelevant.com dhillard@contextrelevant.com mhubenthal@contextrelevant.com
§ §
Jim Walsh Scott Golder Scott Smith §
jwalsh@contextrelevant.com sgolder@contextrelevant.com ssmith@contextrelevant.com
arXiv:1209.0029v3 [cs.LG] 5 Sep 2012
Abstract where x(i) ∈ Rl is the feature vector of the ith exam-
ple, y (i) ∈ {0, 1} is the label, θ ∈ Rl is the vector
We present a system that enables rapid model
of fitting parameters, l is a smooth convex loss function
experimentation for tera-scale machine learn-
and S a regularizer. Some of the more popular methods
ing with trillions of non-zero features, billions
for determining θ include linear and logistic regression,
of training examples, and millions of param-
respectively. The optimal such θ corresponds to a linear
eters. Our contribution to the literature is a
predictor function pθ (x) = θT x that best fits the data
new method (SA L-BFGS) for changing batch
in some appropriate sense, depending on l and S. We
L-BFGS to perform in near real-time by using
remark that such cost functions in (1) have a structure
statistical tools to balance the contributions of
which is naturally decomposable over the given training
previous weights, old training examples, and
examples, so that all computations can potentially be run
new training examples to achieve fast conver-
in parallel over a distributed environment.
gence with few iterations. The result is, to
our knowledge, the most scalable and flexible However, in practice it is often the case that the model
linear learning system reported in the literature, must be updated accordingly as new data is acquired.
beating standard practice with the current best That is, we want to answer the question of how θ changes
system (Vowpal Wabbit and AllReduce). Using in the presence of new training examples. One naive
the KDD Cup 2012 data set from Tencent, Inc. approach would be to completely redo the entire data
we provide experimental results to verify the analysis process from scratch on the larger data set. The
performance of this method. current fastest method in such a case utilizes the L-BFGS
quasi-Newton minimization algorithm with AllReduce
1 Introduction along a distributed cluster, (Agarwal et al., 2012). The
other extreme is to apply the method of online learning,
The demand for analysis and predictive modeling derived which considers one data point at a time and updates the
from very large data sets has grown immensely in recent parameters θ according to some form of gradient descent,
years. One of the big problems in meeting this demand is see (Langford et al., 2009), (Duchi et al., 2010). How-
the fact that data has grown faster than the availability of ever, on the tera-scale, neither approach is as appealing or
raw computational speed. As such, it has been necessary as fast as we can achieve with our method. We describe in
to use intelligent and efficient approaches when tackling a bit more detail these recent approaches to solving (1) in
the data training process. Specifically, there has been Section 3. For completeness, we also refer the reader to
much focus on problems of the form recent work relating to large-scale optimization contained
m in (Schraudolph et al., 2007) and (Bottou, 2010).
min l(θT x(i) ; y (i) ) + λS(θ), (1)
θ∈Rl
i=1
Our approach in simple terms lies somewhere be-
∗
tween pure L-BFGS minimization (widely accepted as
The material of this work is patent pending.
† the fastest brute force optimization algorithm whenever
in absentia at Department of Information Science, Cornell
University, Ithaca, NY, 14850 the function is convex and smooth) and online learning.
‡
recent graduate of Department of Mathematics, University While L-BFGS offers accuracy and robustness with a
of Washington, Seattle, WA, 98195 relatively small number of iterations, it fails to take di-
§
Context Relevant, Inc., Seattle, WA, 98121 rect advantage of situations where the new data is not
2. very different from that acquired previously or situations respect to a fixed (optimal) parameter θ∗ (in practice we
where the new data is extremely different than the old don’t know the true value of θ∗ ) by
data. Certainly, one can initiate a new optimization job t
on the larger data set with the parameter θ initialized to Rφ (t) = rφ (θ∗ , t) := [φs (θs ) − φs (θ∗ )] (3)
the previous result. But we are left with the problem of s=0
optimizing over increasingly larger training sets at one t
time. Similarly, online learning methods only consider = [fs (θs ) + λS(θs ) − fs (θ∗ ) − λS(θ∗ )].
one data point at a time and cannot reasonably change s=0
the parameter θ by too much at any given step without An effective algorithm is then one in which the sequence
risk of severely increasing the regret. It also cannot typ- tf
{θt }t=0 suffers sub-linear regret, i.e., Rφ (t) = o(t).
ically reach as small of an error count as that of a global As mentioned earlier, there has been much work done
gradient descent approach. On the other hand, we will regarding how to solve (1) with a variety of meth-
show that it is possible to combine the advantages of ods. Before proceeding with a basic overview of the
both methods: in particular the small number of iterations two most popular approaches to large-scale machine
and speed of L-BFGS when applied to reasonably sized learning in Section 3, it is important to understand the
batches, and the ability of online learning to “forget” pre- underlying assumptions and implications of the pre-
vious data when the new data has changed significantly. vious body of work. In particular, we mention the
The outline of the paper is as follows. In Section 2 work of L´ on Bottou in (Bottou, 2010) regarding large-
e
we describe the general problem of interest. In Sec- scale optimization with stochastic gradient descent, and
tion 3 we briefly mention current widely used methods (Schraudolph et al., 2007) regarding stochastic online L-
of solving (1). In Section 4 we outline the statistically BFGS (oL-BFGS) optimization. Such works and others
adaptive learning algorithm. Finally, in Section 5 we have demonstrated that with a lot of randomly shuffled
benchmark the performance of our two related methods data, a variety of methods (oL-BFGS, 2nd order stochas-
(Context Relevant FAST L-BFGS and SA L-BFGS, re- tic gradient descent and averaged stochastic gradient de-
spectively) against Vowpal Wabbit - one of the fastest scent) can work in fewer iterations than L-BFGS because:
currently available routines which incorporates the work
of (Agarwal et al., 2012). We also include the associated (a) Small data learning problems are fundamentally dif-
Area Under Curve (AUC) rating, which roughly speak- ferent from large data learning problems;
ing, is a number in [0, 1], where a value of 1 indicates (b) The cost functions as framed in the literature have
perfect prediction by the model. well suited curvature near the global minimum.
2 Background and Problem Setup We remark that the key problem for all quasi-Newton
based optimization methods (including L-BFGS) has
In this paper the underlying problem is as follows. Sup- been that noise associated with the approximation pro-
pose we have a sequence of time-indexed data sets cess – with specific properties dependent on each learning
(i) (i)
{Xt , Yt } where Xt = {xt }mt , Yt = {yt }mt , t =
i=1 i=1 problem – causes adverse conditions which can make L-
0, 1, . . . , tf is the time index, and mt ∈ N is the batch BFGS (and its variants) fail. However, the problem of
size (typically independent of t). Such data is given se- noise leading to non-positive curvature near the minimum
quentially as it is acquired (e.g. t could represent days), can be averted if the data is appropriately shaped (i.e. fea-
so that at t = 0 one only has possession of {X0 , Y0 }. Al- ture selection plus proper data transformations). For now
ternatively, if we are given a large data set all at once, we though, we ignore the issue and assume we already have
could divide it into batches indexed sequentially by t. In a methodology for “feature shaping” that assures under
general, we use the notation xt with subscript t to denote operational conditions that the curvature of the resulting
a time-dependent vector at time t, and we write xt,j to learning problem is well-suited to the algorithm that we
denote the jth component of xt . For each t = 0, . . . , tf describe.
we define
mt 3 Previous Work
(i) (i)
ft (θ) = l(θT xt ; yt )
3.1 Online Updates
i=1
φt (θ) = ft (θ) + λS(θ), (2) In online learning, the problem of storage is completely
averted as each data point is discarded once it is read.
where as before, l is a given smooth convex loss function We remark that one can essentially view this approach
and S is a regularizer. Also let θt be the parameter vector as a special case of the statistically adaptive method de-
obtained at time t, which in practice will approximately scribed in Section 4 with a batch size of 1. Such algo-
t
minimize λS(θ)+ s=0 fs (θ). We define the regret with rithms iteratively make a prediction θt ∈ Rl and then
3. receive a convex loss function φt as in (2). Typically, Assuming invertibility of X T X, it is well known that the
φt (θ) = l(θT xt ; yt ) + λR(θ), where (xt , yt ) is the data solution is given by
point read at time t. We then make an update to obtain
θ = (X T X)−1 X T Y. (5)
θt+1 using a rule that is typically based on the gradi-
ent of l(θT xt , yt ) in θ. Indeed, the simplest approach Now, suppose that we have time indexed data
(with no regularization term) would be the update rule {Xs , Ys }T with Xs ∈ Rms ×l and Ys ∈ Rms . In order
s=0
T
θt+1 = θt − ηt ∇θ l(θt xt , yt ). to update θt given {Xt+1 , Yt+1 }, first we must check how
However, there currently exist more sophisticated up- well θt fits the newly augmented data set. We do this by
date schemes which can achieve better regret bounds for evaluating
(3). In particular, the work of Duchi, Hazan, and Singer t+1
2
is a type of subgradient method with adaptive proximal Xs θ − Ys 2 (6)
functions. It is proven that their ADAGRAD algorithm s=0
can achieve theoretical regret bounds of the form with θ = θt . Depending on the result, we choose a
parameter λ ∈ [0, 1] that determines how much weight
1/2 to give the previous data when computing θt+1 . That
Rφ (t) = O( θ∗ 2 tr(Gt )) and (4)
Å ã is, λ represents how much we would like to “forget” the
1/2
Rφ (t) = O max θs − θ∗ 2 tr(Gt ) , previous data (or emphasize the new data), with a value of
s≤t
λ = 1 indicating that all previous data has been thrown
t T out. Similarly, the case λ = 1 corresponds to the case
where in general, Gt = s=0 gs gs is an outer prod-
2
when past and present are weighed equally, and the case
uct matrix generated by the sequence of gradients gs =
λ = 1 corresponds to the case when θt fits the new data
∇θ fs (θs ) (Duchi et al., 2010). We remark that since the
perfectly (i.e. (6) is equal to zero).
loss function gradients converge to zero under ideal con- t
ditions, the estimate (4) is indeed sublinear, because the
Let X[0,t] be the s=0 ms × l matrix
T T T T
[X0 , X1 , . . . , Xt ] obtained by concatenation,
decay of the gradients, however slow, counters the linear
1/2 and similarly define the length- t ms vector s=0
growth in the size of Gt .
Y[0,t] := [Y0T , Y1T , . . . , YtT ]T . Then (6) is equivalent to
3.2 Vowpal Wabbit with Gradient Descent X[0,t+1] θ − Y[0,t+1] 2 . Now, when using a particular
2
second order Newton method for minimizing a smooth
Vowpal Wabbit is a freely available software pack- convex function, the computation of the inverse Hessian
age which implements the method described briefly in matrix is analogous to computing (X[0,t] X[0,t] )−1 T
(Agarwal et al., 2012). In particular, it combines online above. As t grows large, the cumulative normal matrix
learning and brute force gradient-descent optimization in T
X[0,t] X[0,t] becomes increasingly costly to compute
a slightly different way. First, one does a single pass from scratch, as does its inverse. Fortunately, we observe
over the whole data set using online learning to obtain that
a rough choice of parameter θ. Then, L-BFGS optimiza-
T T T
tion of the cost function is initiated with the data split X[0,t+1] X[0,t+1] = X[0,t] X[0,t] + Xt+1 Xt+1 . (7)
across a cluster. The cost function and its gradient are However, if we want to incorporate the flexibility to
computed locally and AllReduce is utilized to collect the weigh current data differently relative to previous data,
global function values and gradients in order to update θ. we need to abandon the exact computation of (7). In-
The main improvement of this algorithm over previous stead, letting At denote the approximate analogue of
methods is the use of AllReduce with the Hadoop file T
X[0,t] X[0,t] , we introduce the update
structure, which significantly cuts down on communi-
cation time as is the case with MapReduce. Moreover, 2
At+1 ← µ2 At + Xt+1 Xt+1
T
(8)
the baseline online learning step is done with a learn- 1 + µ2
ing rate chosen in an optimal manner as discussed in µ 2
(Karampatziakis and Langford, 2011). where µ satisfies λ = 1+µ2 .
The actual update of θt is performed as follows. Define
4 Our Approach
Y0
Y1
4.1 Least Squares Digression ‹ T T T
Y[0,t] := X[0,t] Y[0,t] = [X0 , . . . , Xt ] .
Before we describe the statistically adaptive approach for .
.
minimizing a generic cost function, consider the follow- Yt
ing simpler scenario in the context of least squares regres- t
T
sion. Given data {X, Y } with X ∈ Rm×l and Y ∈ Rm , = Xs Ys .
we want to choose θ that solves minθ∈Rl Xθ − Y 2 . 2 s=0
4. Up to time t, the standard solution to the least squares
problem on the data {X[0,t], Y[0,t] } is then batch data points selected
1,2,...,t-1 from previous batches
T ‹
θ = (X[0,t] X[0,t] )−1 Y[0,t] . (9)
‹
Now let Bt be an approximation to Y[0,t] . We define Bt+1
by the update
data points selected
2 batch t
from current batch
Bt+1 = µ2 Bt + Xt+1 Yt+1 .
T
1 + µ2
Figure 1: Subsampling of the partitioned data stream at
Finally, we set time t and times 0, 1, . . . , t − 1, respectively.
−1
θt+1 := At+1 Bt+1 . (10)
is used to update θ. From the subsample chosen, we
It is easily verified that (10) coincides with the standard then apply a gradient descent optimization routine where
update (9) when µ = 1. the initialization of the associated starting parameters is
4.2 Statistically Adaptive Learning generated from those stored from the previous iteration.
In the case of L-BFGS, the rank 1 matrices used to ap-
Returning to our original problem, we start with the pa- proximate the inverse Hessian stored from the previous
rameter θ0 obtained from some initial pass through of iteration are used to initialize the new descent routine.
{X0 , Y0 }, typically using a particular gradient descent al- We summarize the process in Algorithm 1.
gorithm. In what follows, we will need to define an easily
evaluated error function to be applied at each iteration,
Algorithm 1 Statistically Adaptive Learning Method (SA
mildly related to the cumulative regret (3):
L-BFGS)
t ms (i) (i) Require: Error checking function I(t, θ)
s=0 i=1 |pθ (xt ) − yt | tf
I(t, θ) := t (11) Given data {Xs , Ys }s=0
s=0 |Xs |
Run gradient descent optimization on {X0 , Y0 } to
We remark that I(t, θ) represents the relative number of compute θ0
incorrect predictions associated with the parameter θ over for t = 1 to tf do
all data points from time s = 0 to s = t. Moreover, if I(t + 1, θt ) − I(t, θt ) > σ(t) then
because pθ is a linear function of x, I is very fast to Choose Mold and Mnew
evaluate (essentially O(m) where m = t ms ). Subsample data
s=0
Given θt , we compute I(t + 1, θt ). There are two Run L-BFGS with initial parameter θt to ob-
extremal possibilities: tain θt+1
else θt+1 ← θt
1. I(t + 1, θt ) is significantly larger than I(t, θt ). More end if
precisely, we mean that I(t + 1, θt ) − I(t, θt ) > end for
σ(t), where σ(t) is the standard deviation of
{I(s, θs )}t . In this case the data has significantly
s=0 As a typical example, at some time t we might have
changed, and so θ must be modified. tf
Mold = 1000, Mnew = 100, 000, and t=0 mt = 1·109 .
2. Otherwise, there is no need to change θ and we set This would be indicative of a batch {Xt+1 , Yt+1 } with
θt+1 = θt . significantly higher error using the current parameter θt
than for previously analyzed batches.
In the former case, we use the magnitude of I(t+ 1, θt )− We remark that when learning on each new batch of
I(t, θt ) to determine a subsample of the old and new data data, there are two main aspects that can be parallelized.
with Mold and Mnew points chosen, respectively (see First, the batch itself can be partitioned and distributed
Figure 1). Roughly speaking, the larger the difference the among nodes in a cluster via AllReduce to significantly
more weight will be given to the most recent data points. speed up the evaluation of the cost function and its gra-
The sampling of previous data points serves to anchor dient as is done in (Agarwal et al., 2012). Furthermore,
the model so that the parameters do not over fit to the one can run multiple independent optimization routines
new batch at the expense of significantly increasing the in parallel where the distribution used to subsample from
global regret. This is a generalization of online learning Xt+1 and ∪t Xs is varied. The resulting parameters
s=0
methods where only the most recent single data point θ obtained from each separate instance can then be sta-
5. tistically compared so as to make sure that the model is mention that the number of features generated is on the
not overly sensitive to the choice of sampling distribu- order of 1000.
tion. Otherwise, having θ be too highly dependent on the
choice of subsample would invalidate using a stochas- 5.2 Model 1 Results
tic gradient descent-based approach. A bi-product of For our first set of experiments, we compare the perfor-
this ability to simultaneously experiment with different mance of Vowpal Wabbit (VW) using its L-BFGS im-
samplings is that it provides a quick means to check the plementation and the Context Relevant Flexible Analyt-
consistency of the data. ics and Statistics TechnologyTM L-BFGS implementation
Finally, we remark that the SA L-BFGS method can be running on 10 Amazon m1.xlarge instances. The time
reasonably adapted to account for changes in the selected measured (in seconds) is only the time required to train
features as new data is acquired. Indeed, it is very ap- the models. The features are generated and cached for
pealing within the industry to be able to experiment with each implementation prior to training.
different choices of features in order to find those that Performance was measured using the Area Under
matter most, while still being able to use the previously Curve (AUC) metric because this was the methodology
computed parameters θt to speed up the optimization on used in (Tencent, 2012). In short, the AUC is equal to the
the new data. Of course, it is possible to directly ap- probability that a classifier will rank a randomly chosen
ply an online learning approach in this situation, since positive instance higher than a randomly chosen negative
previous data points have already been discarded. But one. More precisely, it is computed via Algorithm 3 in
typical gradient descent algorithms do not a priori have (Fawcett, 2004). We compute our AUC results over a
the flexibility to be directly applied in such cases and they portion of the public section of the test set that has about
typically perform worse than batch methods such as L- 2 million examples.
BFGS(Agarwal et al., 2012). Model 1 includes only the basic id features, with no
conjunction features, and achieves an AUC of 0.748 as
5 Experiments shown in Table 1. A simple baseline performance, which
can be generated by predicting the mean ctr for each
5.1 Description of Dataset and Features
ad id would perform at approximately an AUC of 0.71.
We consider data used to predict the click-through-rate The winner of (Tencent, 2012) performed at an AUC of
(pCTR) of online ads. An accurate model is necessary 0.80. However, the winning model was substantially
in the search advertising market in order to appropriately more complicated and used many additional features that
rank ads and price clicks. The data contains 11 variables were excluded from this simple demonstration. In future
and 1 output, corresponding to the number of times a work, we will explore more sophisticated feature sets.
given ad was clicked by the user among the number of The Context Relevant and VW models both achieve the
times it was displayed. In order to reduce the data size, same AUC on the test set, which validates that the basic
instances with the same user id, ad id, query, and setting gradient descent and L-BFGS implementations are func-
are combined, so that the output may take on any posi- tionally equivalent. The Context Relevant model com-
tive integer value. For each instance (training example), pletes learning between four and five times more quickly.
the input variables serve to classify various properties of Our implementation is heavily optimized to reduce com-
the ad displayed, in addition to the specific search query putation time as well as memory footprint. In addition,
entered. This data was acquired from sessions of the Ten- our implementation utilizes an underlying MapReduce
cent proprietary search engine and was posted publicly on implementation that provides robustness to job and node
www.kddcup.2012.org (Tencent, 2012). failures1 .
For these experiments we build a basic model that
learns from the identifiers provided in the training set. 5.3 Model 2 Results
These include unique identifiers for each query, ad, key- Context Relevant’s implementation of SA L-BFGS is
word, advertiser, title, description, display url, user, ad designed to accelerate and simplify learning iterative
position, and ad depth (further details available in the changes to models. Using information gleaned from the
KDD documentation). We compute a position and depth initial L-BFGS pass, SA L-BFGS develops a sampling
normalized click through rate for each identifier, as well
1
as combinations (conjunctions) of these identifiers. Then Context Relevant had to re-write the AllReduce network
at training and testing time we annotate each example implementation to add error checking so that the AllReduce sys-
with these normalized click through rates. Additionally, tem was robust to errors that were encountered during normal
execution of experiments on Amazon’s EC2 systems. Without
before running the optimization, it is necessary to build these changes, we could not keep AllReduce from hanging dur-
appropriate feature vectors (i.e. shape the data). We will ing the experiments. There is no graceful recovery from the loss
not go into detail regarding how this is done, except to of a single node.
6. Table 1: Model 1 Results For Different Learning Mech- Table 2: Model 2 Results For Different Learning Mech-
anisms (VW = Vowpal Wabbit; CR = Context Relevant anisms (VW = Vowpal Wabbit; CR = Context Relevant
FAST L-BFGS FAST L-BFGS; SA = Context Relevant FAST SA L-
VW CR BFGS)
L-BFGS L-BFGS VW CR CR
seconds 490 114 L-BFGS L-BFGS SA L-BFGS
AUC .748 .748 seconds 515 115 9
AUC .750 .752 .751
strategy to minimize sampling induced noise when learn-
ing new models that are derived from previous models. system uses a new version of L-BFGS to combine the
The larger the divergence in the models, the less speed- robustness and accuracy of second order gradient descent
up is likely. For this set of experiments, we add a con- optimization methods with the memory advantages of
junction feature that captures the interaction between a online learning. This provides a model building envi-
query id and an ad id, which has frequently been an ronment that significantly lowers the time and compute
important feature in well known advertising systems. We cost of asking new questions. The ability to quickly ask
then compare the speed and accuracy of Vowpal Wab- and answer experimental questions vastly expands to set
bit (VW) using its L-BFGS implementation; the Context of questions that can be asked, and therefore the space
Relevant Flexible Analytics and Statistics TechnologyTM of models that can be explored to discover the optimal
L-BFGS implementation, and the Context Relevant Flex- solution.
ible Analytics and Statistics TechnologyTM SA L-BFGS SA L-BFGS is also well suited to environments where
implementation (SA) running on 10 Amazon X1.Large the underlying distribution of the data provided to a learn-
instances. Here the baseline L-BFGS models are trained ing algorithm is shifting. Whether the shift is caused
with the standard practice for adding a new feature, the by changes in user behavior, changes in market pricing,
models are retrained on the entire dataset. The time mea- or changes in term usage, SA L-BFGS can be empiri-
sured (in seconds) is only the time required to train the cally tuned to dynamically adjust to the changing con-
models. The features are generated and cached for each ditions. One can utilize the parallelized approach in
implementation prior to training. (Agarwal et al., 2012) on each batch of data in the time
Again, performance was measured using the Area Un- variable, with the additional freedom to choose the batch
der Curve (AUC) metric. Table 2 lists the results for each size. Furthermore, the statistical aspects of the algorithm
algorithm, and shows that AUC improves in comparison provide a useful way to check the consistency of the data
to Model 1 when the new feature is added. As with Model in real time. However, like all L-BFGS implementa-
1, the VW and CR models achieve similar AUC and the tions that rely on small, reduced, or sampled data sets,
basic L-BFGS CR learning time is significantly faster. increased sampling noise from the L-BFGS estimation
Furthermore, we show that SA L-BFGS also achieves process affects the quality of the resulting learning al-
similar AUC, but in less than one tenth of the time (which gorithm. Users of this new algorithm must take care to
likewise implies one tenth of the compute cost required). provide smooth convex loss functions for optimization.
This speed up can enable a large increase in the number The Context Relevant Flexible Analytics and Statistics
of experiments, without requiring additional compute or TechnologyTM is designed to provide such functions for
time. It is important to note that the speed of L-BFGS optimization.
and SA L-BFGS is essentially tied to the number of
features and the number of examples for each iteration. 7 About Context Relevant and the Authors
The primary performance gains that can be found are: (a)
reducing the number of iterations; (b) reducing the num- Context Relevant was founded in March 2012 by Stephen
ber of examples; or (c) reducing the number of features Purpura and Christian Metcalfe. The company was ini-
with non-zero weights. SA L-BFGS adopts the former tially funded by friends, family, Seattle angel investors,
two strategies. A reduction in the number of features and Madrona Venture Group.
with non-zero weights can be forced through aggressive Stephen Purpura – CEO & Co-Founder – Stephen
regularization, but at the expense of specificity. works as the CEO and CTO of the company. He has more
than 20 years of experience, is listed as an inventor of
6 Conclusions five issued United States patents and served as program
manager of Microsoft Windows and the Chief Security
We have presented a new tera-scale machine learning Officer of MSFDC, one of the first Internet bill payment
system that enables rapid model experimentation. The systems. Stephen is recognized as a leading expert in the
7. fields of machine learning, political micro-targeting and as MSNBC, The New York Times, The Washington Post
predictive analytics. and National Public Radio. He has also been profiled by
Stephen received a bachelor’s degree from the Uni- LiveScience’s “ScienceLives”.
versity of Washington, a master’s degree from Harvard, He has worked as a research scientist in the Social
where he was part of the Program on Networked Gover- Computing Lab at HP Labs and has interned at Google,
nance at the John F. Kennedy School of Government, and IBM and Microsoft. Scott holds a master’s degree from
he is set to be granted a PhD in information science from MIT, where he worked with the Media Laboratory’s So-
Cornell. ciable Media Group, and graduated from Harvard Univer-
Jim Walsh – VP Engineering – Jim brings 23 years sity, where he studied Linguistics and Computer Science.
of experience in technology innovation and engineering Scott is currently on leave from the PhD program in So-
management to Context Relevant. He founded, built and ciology at Cornell University.
led the Cosmos team – Microsoft’s massive scale dis- Mark Hubenthal – Member of the Technical Staff –
tributed data storage and analysis environment, under- Mark recently received his PhD in Mathematics from the
pinning virtually all Microsoft products including Bing University of Washington. He works on inverse problems
– and founded the Bing multimedia search team. In his applicable to medical imaging and geophysics.
final position at Microsoft, Jim served as the principal Scott Smith - Principal Engineer and Architect - Scott
architect for Microsoft advertising platforms. has experience with distributed computing, compiler de-
Prior to joining Microsoft, Jim created two software sign, and performance optimization. At Akamai, he
startups, wrote the second-ever Windows application and helped design and implement the load balancing and
created custom computer animation software for the tele- failover logic for the first CDN. At Clustrix, he built the
vision broadcast industry. He is recognized as one of SQL optimizer and compiler for a distributed RDBMS.
the world’s leading network performance experts and has He holds a masters degree in computer science from MIT.
been granted twenty two technology patents that range
from performance to user interface design. Jim earned a
bachelor of science degree in computing science from the References
University of Alberta. [Agarwal et al.2012] A. Agarwal, O. Chapelle, M. Dudik,
Dustin Rigg Hillard – Director of Engineering and and J. Langford. 2012. A reliable effective terascale
Data Scientist – Dustin is a recognized data science and linear learning system. Feb. arXiv:1110.4198v2.
machine learning expert who has published more than 30
papers in these areas. Previously at Microsoft and Ya- [Bottou2010] L. Bottou. 2010. Large-scale machine
learning with stochastic gradient descent. pages 177–
hoo!, he spent the last decade building systems that sig-
187.
nificantly improve large-scale processing and machine-
learning for advertising, natural language and speech. [Duchi et al.2010] J. Duchi, E. Hazan, and Y. Singer.
Prior to joining Context Relevant, Dustin Hillard 2010. Adaptive subgradient methods for online learn-
worked for Microsoft, where he worked to improve ing and stochastic optimization.
speech understanding for mobile applications and XBox [Fawcett2004] T. Fawcett. 2004. ROC graphs: Notes and
Kinect. Before that he was at Yahoo! for three practical considerations for researchers.
years, where he focused on improving ad relevance.
His research in graduate school focused on automatic [Karampatziakis and Langford2011] N. Karampatziakis
speech recognition and statistical translation. Dustin and J. Langford. 2011. Online importance weight
incorporates approaches from these and other fields to aware updates.
learn from massive amounts of data with supervised, [Langford et al.2009] J. Langford, L. Li, and T. Zhang.
semi-supervised, and unsupervised machine learning ap- 2009. Sparse online learning via truncated gradient.
proaches. Journal of Machine Learning Research, 10:777–801.
Dustin holds a bachelor’s and master’s degree as well
as a PhD in Electrical Engineering from the University of [Schraudolph et al.2007] N. Schraudolph, J. Yu, and
S. G¨ nter. 2007. A stochastic quasi-newton method
u
Washington.
for online convex optimization.
Scott Golder – Data Scientist & Staff Sociologist –
Scott mines social networking data to investigate broad [Tencent2012] Tencent. 2012.
questions such as when people are happiest (mornings KDD Cup 2012, Track 2 Data.
and weekends) and how Twitter users form new social http://www.kddcup2012.org/c/kddcup2012-track2/dat
ties. His work has been published in the journal Science
as well as top computer science conferences by the ACM
and IEEE, and has been covered in media outlets such