We have other engines out there, plenty of them and we have SQL / Java
Language trends, thirst for answers.
Example each task takes a certain amount of time but we want to buffer the time and leave room for Murphy’s law that something will go wrong or take a bit longer. 20% more time. Each unit is independent.
Concise, declarartive but also provides greater description to CPU on how to handle the problem.
Anywhere is the real answer. Here’s a few examples that have been bubbling up among our customers.
Speaker: pick one of these and describe, briefly, an example that you know about in 4-5 sentences. One ‘close’ to the audience of course is the best. There’s separate use case deck by industry in the workshop wiki page that you can use for ideas.
Example from movie: “IDENTITY THEFT”
Some common applications of outlier detection include:
Fraud detection: Purchasing behavior of a credit card owner usually changes when the purchasing behavior of a credit card owner usually changes when the card is stolen and the abnormal buying patterns can indicate fraud.
Medicine: Unusual test results may indicate an underlying health issue
Sports: Exceptional players may appear as outliers in particular parameters and placed in positions where the team can most benefit
Gas, Convenience, retail,
Before we can detect an outlier, we have to define it. The most intuitive definition I’ve seen is from a 1980 book Identification of Outliers (Hawkins):
“An outlier is an observation which deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism”
Speaker: It’s worth repeating that all of these topics a deep enough to have entire university library shelves filled with books on the topic. We’re just skimming the surface.
Anomaly detection is related to clustering, but almost its inverse.
If points that are similar to each other cluster together (representing “normal” behavior or patterns), we want to find the points that are NOT in any cluster.
What: Identify what group a new observation belong in.
This is a simplified visual example, taken from Andrew Ng’s ML class. Plotted are only 2 dimensions (features) from the full feature vector – age and tumor size.
The instances provided are “labeled” – in this case marked in red/x vs. blue/circle.
Note that some red instances are inside the blue cluster, which represents the fact that learning sets sometimes introduce noise, making the learning task not trivial.
Similarly some blue points are inside the red cluster.
Regression is supervised learning where instead of predicting a category (like malignant or benign from the previous examples) we predict a “value” – a number.
In this example (again from Andrew Ng’s class) we are trying to predict the “price of a house” given a single variable: size in feet.
Clearly more complex models in multiple dimensions would be better; for example, we can use other features like “age of house” or “number of previous owners”, “geographic location” or “score for closest public school”
With unsupervised learning, we again have as input a feature matrix with rows as instances and columns as variables, but NO LABELS.
Now the goal is to find a label (cluster number) for each instance, but we are not learning a given function to match, rather trying to figure out the natural way instances may be grouped together.
Note that we are usually NOT given the number of desired cluster (often called “K”), and may need to determine this on our own.
Over-fitting means the model performs very well on the training set but does not generalize well so results on unseen data are poor.
As shown in the diagram, this means the model learned the specific granular details of the training set and not the generic function it was meant to learn.
This is why we “evaluate” on the validation set (and not the training set), because if we measured error on the training set we may get a false sense of performance if the model is over-fitting.
Under-fittingmeans the model doesn’t have enough degrees of freedom to learn the needed model, and usually has a high bias.
Underfitting is often a result of an excessively simple model. In practice you won’t encounter underfitting very often. Data sets that are used for predictive modelling nowadays often come with too many predictors, not too few. Nonetheless, when building any predictive model, you should use validation or cross-validation to assess predictive accuracy and avoid these problems. Here we may have many observations, but too few features (matrix is tall and narrow).