Geoff
- 1. Copyright © 2013 Geoffrey I Webb
Fundamental and Advanced Machine Learning Methods for
Big Data Applications
Geoffrey I Webb,
Ana Martinez, Nayyar Zaidi, Shenglei Chen
Monash University
http://www.csse.monash.edu.au/~webb
- 3. Copyright © 2013 Geoffrey I Webb
Overview
• Big data
• Classification learning
• Sampling
• Dimensionality reduction
• Scaling-up existing algorithms
• Stream learning
• Bias and variance and big data
• Selective KDB
• Incremental Bayesian Network Classifiers
- 4. Copyright © 2013 Geoffrey I Webb
Big data
• Can mean many things
– Complex integration of many heterogeneous data sources
– Very large/streaming data
Name (SI Value Binary usage
decimal prefixes)
kilobyte (kB)
megabyte (MB)
gigabyte (GB)
terabyte (TB)
petabyte (PB)
exabyte (EB)
zettabyte (ZB)
yottabyte (YB)
Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
- 5. Copyright © 2013 Geoffrey I Webb
What is ‘big’?
• Number of
– instances
– dimensions
– classes
• Big data usually includes data sets with sizes beyond the ability of
commonly used software tools to capture, curate, manage, and process
the data within a tolerable elapsed time. Big data sizes are a
constantly moving target, as of 2012 ranging from a few dozen
terabytes to many petabytes of data in a single data set.
– Wikipedia
• Machine learning research usually treats more than 1 million examples
as very large.
Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
- 6. Copyright © 2013 Geoffrey I Webb
Examples
• Spelling correction
• Translation
• Farecast
• Recommender systems
• Electoral outcomes
Whitelaw, C, B Hutchinson, GY Chung, & G Ellis. "Using the web for language independent spellchecking and autocorrection." In Proceedings of the 2009 Conference
on Empirical Methods in Natural Language Processing: Volume 2, pp. 890-899. Association for Computational Linguistics, 2009.
Silver, Nate. The Signal and the Noise: Why So Many Predictions Fail-but Some Don't. Penguin Press, 2012.
Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
- 7. Copyright © 2013 Geoffrey I Webb
Not a universal panacea
• Jeopardy but not chess
• Spelling correction and translation but not comprehension
http://www.engadget.com/2011/02/15/watson-soundly-beats-the-humans-in-first-round-of-jeopardy/
Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
- 8. Copyright © 2013 Geoffrey I Webb
Classification learning
Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
- 9. Copyright © 2013 Geoffrey I Webb
Evolving distributions
• Key issue
– Is the distribution from which the data are drawn static or
dynamic?
– Concept drift
• class membership changes, eg rich
– Concept evolution
• new classes emerge
– Distribution drift
• probabilities change
Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
- 10. Copyright © 2013 Geoffrey I Webb
Dimension of change
• Normally time but may be other such as location
• Classifier can only take dimension of change into account if
data to be classified will fall within current scope or if it is
possible to extrapolate
18 14
16
12
14
10
12
10 8
Training Training
8 6
Testing Testing
6
4
4
2
2
0 0
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28
Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
- 11. Copyright © 2013 Geoffrey I Webb
Loss functions
Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
- 12. Copyright © 2013 Geoffrey I Webb
Imbalanced classes
• Many big datasets have a rare class of interest and a
majority class from which we seek to distinguish it.
– Ad click-through
– Conversions
– Disease
– Fraud
– Homeland security
Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
- 13. Copyright © 2013 Geoffrey I Webb
Loss functions for imbalanced classes
Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
- 14. Copyright © 2013 Geoffrey I Webb
Loss functions for imbalanced classes
Predictions
Pos Neg
Actual
Pos TP FN
Neg FP TN
Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
- 15. Copyright © 2013 Geoffrey I Webb
Loss functions for imbalanced classes
• Area under the ROC curve
True Positive Rate (TPR)
Predictions
Pos Neg
Actual
Pos TP FN
Neg FP TN
False Positive Rate (FPR)
Predictions
Pos Neg
Actual
Pos TP FN
Neg FP TN
Prof. William H. Press, “Unit 17: Classifier Performance: ROC, Precision-Recall, and All That.”
http://www.nr.com/CS395T/lectures2008/17-ROCPrecisionRecall.pdf
Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
- 16. Copyright © 2013 Geoffrey I Webb
Loss functions for imbalanced classes
• Area under the Precision Recall Curve
Recall = True Positive
Rate (TPR)
Predictions
Pos Neg
Actual
Pos TP FN
Neg FP TN
Precision
Predictions
Pos Neg
Actual
Pos TP FN
Neg FP TN
Prof. William H. Press, “Unit 17: Classifier Performance: ROC, Precision-Recall, and All That.”
http://www.nr.com/CS395T/lectures2008/17-ROCPrecisionRecall.pdf
Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
- 17. Copyright © 2013 Geoffrey I Webb
Mutual information
Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
- 18. Copyright © 2013 Geoffrey I Webb
Learning curves
0.7
KDB k=2
0.6
0.5
Root Mean Squared Error
0.4
0.3
0.2
0.1
0
0 100,000 200,000 300,000 400,000 500,000 600,000 700,000 800,000 900,000 1,000,000
Data quantity
Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
- 19. Copyright © 2013 Geoffrey I Webb
Sampling
• Select s instances from a dataset of size n
• Important that sample be selected randomly
• Make sure you use a robust random number generator
Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
- 20. Copyright © 2013 Geoffrey I Webb
Ideal Sampling
• Select data quantity at which learning curve approaches asymptotic error and
learn from sample
0.7
KDB k=2
0.6
0.5
Root Mean Squared Error
0.4
0.3
0.2
0.1
0
0 100,000 200,000 300,000 400,000 500,000 600,000 700,000 800,000 900,000 1,000,000
Data quantity
Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
- 21. Copyright © 2013 Geoffrey I Webb
Finding asymptotic error
• Progressive sampling
0.7
KDB k=2
0.6
0.5
Root Mean Squared Error
0.4
0.3
0.2
0.1
0
0 100,000 200,000 300,000 400,000 500,000 600,000 700,000 800,000 900,000 1,000,000
Data quantity
Provost, F, D Jensen, T Oates. “Efficient progressive sampling.” In Proc 5th ACM SIGKDD international conference on Knowledge Discovery and Data Mining, pp. 23-32. ACM, 1999.
Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
- 22. Copyright © 2013 Geoffrey I Webb
Hoeffding's bound
Error margin
Sample
Population mean Sample size
mean
Hulten, G, and P Domingos. "Mining complex models from arbitrarily large databases in constant time." In Proceedings 8th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 525-531. ACM, 2002.
Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
- 23. Copyright © 2013 Geoffrey I Webb
Maximum sample
• Take largest sample capacity can handle
0.7
KDB k=2
0.6
0.5
Root Mean Squared Error
0.4
0.3
0.2
0.1
0
0 100,000 200,000 300,000 400,000 500,000 600,000 700,000 800,000 900,000 1,000,000
Data quantity
Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
- 24. Copyright © 2013 Geoffrey I Webb
Maximum sample
• Take largest sample capacity can handle
• Saves overheads of repeated sampling and risk of terminating
too soon
• Has risk that asymptotic error may not be reached
– but alternative techniques wouldn’t be able to handle a
larger sample anyway!
Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
- 25. Copyright © 2013 Geoffrey I Webb
Sampling with and without replacement
• Sampling involves deciding how many times Ki each element i
of a collection should occur in the sample
• Sampling without replacement restricts Ki to 0 or 1
Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
- 26. Copyright © 2013 Geoffrey I Webb
Uniform fixed-sized sampling with replacement for fixed n
selected ← 0
while selected < s
add a randomly selected instance to the sample
increment selected
Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
- 27. Copyright © 2013 Geoffrey I Webb
Uniform sequential variable-sized sampling without replacement
i ← 1
while i < n
with fixed probability do
add the next instance to the sample
increment i
Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
- 28. Copyright © 2013 Geoffrey I Webb
Uniform sequential fixed-sized sampling without replacement for
known n
selected ← 0
i ← 1
while selected < s
with probability (s - selected )/(n-i+1) do
add the next instance to the sample
increment selected
increment i
Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
- 29. Copyright © 2013 Geoffrey I Webb
Uniform sequential fixed-sized sampling without replacement for
unknown n
count ← 0
while count < s and count < n
add the next instance to the sample
increment count
while more instances remain
increment count
with probability s/count do
add the next instance to the sample replacing an existing instance
selected at random
else
discard the next instance
Tille, Yves. Sampling algorithms. Springer, 2006.
Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
- 30. Copyright © 2013 Geoffrey I Webb
Dimensionality reduction
• Many learning algorithms are super-linear with respect to
dimensionality
• Dimensionality can be reduced by
– feature selection
– feature projection
Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
- 31. Copyright © 2013 Geoffrey I Webb
Feature selection
• Most powerful techniques are too computationally intensive for big
data
– Eg wrapper techniques
– Best approach varies depending on base learner
• Techniques that consider only the relationship between an attribute
and the class are efficient
– Eg top-k mutual information
– However, overlook complex interactions between attributes
• May be most effective to use powerful technique on a sample
Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
- 32. Copyright © 2013 Geoffrey I Webb
Feature Projection
• Project feature space onto lower dimensional space
• Principal Components Analysis
• First principal component is the planar projection that maximises
variance (= minimises RMSE with respect to original)
• Subsequent principal components are those that maximise variance (=
minimise RMSE) while being uncorrelated with prior components
• First few principal components will capture most of the variation (=
information) in the data
• Generalisations including principal curves and manifolds project onto
manifolds instead of planes
Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
- 33. Copyright © 2013 Geoffrey I Webb
Scaling-up existing algorithms
• Distributed cloud/cluster computing
• Hadoop
– Commodity clusters
– Map Reduce
• Map problem onto sub-problems and distribute these
• Assemble solution from solutions to sub-problems
White, Tom. Hadoop: The definitive guide. O'Reilly Media, Inc., 2012
Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
- 34. Copyright © 2013 Geoffrey I Webb
Streaming algorithms
• Handle data that are too large to retain
– computer network/phone traffic, financial transactions, web
searches, sensor data
• May be difficult to get labelled data
• Strong memory and running time constraints
– learning rate must be greater than the data rate
– only limited data can be retained
• Real time accuracy evaluation and formalisation, mainly to adjust
the parameters accordingly.
Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
- 35. Copyright © 2013 Geoffrey I Webb
Online and incremental learning
• Online learning
– Data arrives as input stream
– Classifier makes prediction
– Then correct classification is revealed and classifier updated
– Examples
• Ad placement, online conversions
• Incremental learning
– Classifier is updated as input arrives
– Classifier is identical to batch classifier
Auer, Peter. “Online Learning.” In Encyclopedia of Machine Learning, C. Sammut and G.I. Webb, Editors. 2010, Springer: New York. p. 736-743.
Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
- 36. Copyright © 2013 Geoffrey I Webb
Streaming Strategies
• Retain samples of data and learn from these
– Continually assess current model against incoming data and
when models lose accuracy take new samples and relearn
• Continually update a model using current data
– Refine using new data
– Prune elements that decline in accuracy
• Create ensemble of classifiers each learned from successive time
periods
– Retire older classifiers as newer ones are created
Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
- 37. Copyright © 2013 Geoffrey I Webb
Weighted majority algorithm
• Each classifier E has a weight wt E
• Classification by weighted majority vote
• All incorrect classifiers have their weights reduced wt+1 E
=wt E , 0< <1
• Error is bounded to no more than twice the error of the best
classifier
Littlestone, N, and MK Warmuth. "The weighted majority algorithm." In 30th Annual Symposium on Foundation of Computer Science, pp. 256-261. IEEE, 1989.
Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
- 38. Copyright © 2013 Geoffrey I Webb
Winnow
Threshold
Binary attributes
Non-negative real valued
weights
Prediction Correct xi = 0 xi = 1
1 0 unchanged
0 1 unchanged
Littlestone, N. "Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm." Machine Learning 2(4)(1988): 285-318.
Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
- 39. Copyright © 2013 Geoffrey I Webb
Stochastic gradient descent
• Many classifiers have parameters that are learned by optimisation
e.g. logistic regression and SVM
– usually requires many passes through the data
• For linear classifiers stochastic gradient descent often converges before
a single pass is completed.
– global gradient approximated by the gradient at each example
– performs sequential updates
– good step size is essential
• learn from an initial sample
– must take examples in random order
Zhang, Tong. "Solving large scale linear prediction problems using stochastic gradient descent algorithms." In Proceedings 21st International Conference on Machine learning, p. 116. ACM, 2004.
Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
- 40. Copyright © 2013 Geoffrey I Webb
Bias and variance
• Learning curves are not all equal
0.8
KDB k=2
KDB k=2
KDB k=5
0.7
0.6
Root Mean Squared Error
0.5
0.4
0.3
0.2
0.1
0
0 100,000 200,000 300,000 400,000 500,000 600,000 700,000 800,000 900,000 1,000,000
Data quantity
Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
- 41. Copyright © 2013 Geoffrey I Webb
Bias and variance
• A major factor in the difference between learning curves
• Decomposition of 0-1 loss
• Bias and variance relate to the performance of the learner
given different training sets
“Bias Variance Decomposition.” In Encyclopedia of Machine Learning, C. Sammut and G.I. Webb, Editors. 2010, Springer: New York. p. 100-101.
Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
- 42. Copyright © 2013 Geoffrey I Webb
Bias and Variance
1,0,1,1,0,1,0,0,1
1,1,0,1,0,1,1,1,1
0,0,1,1,0,1,0,1,0
1
1,0,1,1,1,0,1,1,0 1
1,0,1,1,0,1,0,0,1
1,1,0,1,0,1,1,1,1
0,0,1,1,0,1,0,1,0 Learner 1,0,1,1,0,1,0,0,? 1
1,1,0,1,0,1,1,1,?
1,0,1,1,1,0,1,1,0 0
1,0,1,1,0,1,0,0,1
1,1,0,1,0,1,1,1,1
0
0,0,1,1,0,1,0,1,0
1,0,1,1,1,0,1,1,0
0
Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
- 43. Copyright © 2013 Geoffrey I Webb
Bias and Variance
1,0,1,1,0,1,0,0,1
1,1,0,1,0,1,1,1,1
0,0,1,1,0,1,0,1,0
1
1,0,1,1,1,0,1,1,0 1
1,0,1,1,0,1,0,0,1
1,1,0,1,0,1,1,1,1
0,0,1,1,0,1,0,1,0 Learner 1,0,1,1,0,1,0,0,? 1
1,1,0,1,0,1,1,1,?
1,0,1,1,1,0,1,1,0 0
1,0,1,1,0,1,0,0,1
1,1,0,1,0,1,1,1,1
0
0,0,1,1,0,1,0,1,0
1,0,1,1,1,0,1,1,0
0
Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
- 44. Copyright © 2013 Geoffrey I Webb
Bias and Variance
1,0,1,1,0,1,0,0,1
1,1,0,1,0,1,1,1,1
0,0,1,1,0,1,0,1,0
1
1,0,1,1,1,0,1,1,0 1
1,0,1,1,0,1,0,0,1
1,1,0,1,0,1,1,1,1
0,0,1,1,0,1,0,1,0 Learner 1,0,1,1,0,1,0,0,? 1
1,1,0,1,0,1,1,1,?
1,0,1,1,1,0,1,1,0 0
1,0,1,1,0,1,0,0,1
1,1,0,1,0,1,1,1,1
0
0,0,1,1,0,1,0,1,0
1,0,1,1,1,0,1,1,0
0
Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
- 45. Copyright © 2013 Geoffrey I Webb
Bias and Variance
1,0,1,1,0,1,0,0,1
1,1,0,1,0,1,1,1,1
0,0,1,1,0,1,0,1,0
1
1,0,1,1,1,0,1,1,0 1X
1,0,1,1,0,1,0,0,1
1,1,0,1,0,1,1,1,1
0,0,1,1,0,1,0,1,0 Learner 1,0,1,1,0,1,0,0,? 1
1,1,0,1,0,1,1,1,?
1,0,1,1,1,0,1,1,0 0
1,0,1,1,0,1,0,0,1
1,1,0,1,0,1,1,1,1
0X
0,0,1,1,0,1,0,1,0
1,0,1,1,1,0,1,1,0
0
Variance ≈ (lower limit on) error due to variability in response to sampling
Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
- 46. Copyright © 2013 Geoffrey I Webb
Bias and Variance
1,0,1,1,0,1,0,0,1
1,1,0,1,0,1,1,1,1
0,0,1,1,0,1,0,1,0
1X
1,0,1,1,1,0,1,1,0 1
1,0,1,1,0,1,0,0,1
1,1,0,1,0,1,1,1,1
0,0,1,1,0,1,0,1,0 Learner 1,0,1,1,0,1,0,0,? 1X
1,1,0,1,0,1,1,1,?
1,0,1,1,1,0,1,1,0 0X
1,0,1,1,0,1,0,0,1
1,1,0,1,0,1,1,1,1
0
0,0,1,1,0,1,0,1,0
1,0,1,1,1,0,1,1,0
0X
Variance ≈ (lower limit on) error due to variability in response to sampling
Bias ≈ error due to central tendency of the learner
Bias = error - variance
Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
- 47. Copyright © 2013 Geoffrey I Webb
Bias and variance
High bias Low bias High bias Low bias
High variance High variance Low variance Low variance
Image from Bias Variance Decomposition, in Encyclopedia of Machine Learning, C. Sammut and G.I. Webb, Editors. 2010, Springer: New York. p. 100-101.
Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
- 48. Copyright © 2013 Geoffrey I Webb
Intrinsic error
• Many bias/variance analyses also include intrinsic error
• For our purposes this is included in bias
Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
- 49. Copyright © 2013 Geoffrey I Webb
Bias/variance and big data
• As data quantity increases, variance should decrease
• Low variance important for small data
• Low bias important for big data
Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
- 50. Copyright © 2013 Geoffrey I Webb
Low bias important for big data
• Low bias requires capacity to describe wide variety of
multivariate distributions
• Big datasets contain fine detail needed to precisely delineate
complex multivariate distributions
Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
- 51. Copyright © 2013 Geoffrey I Webb
Bias/variance and big data
0.8
Naïve Bayes
0.7 KDB k=2
KDB k=5
0.6
Root Mean Squared Error
0.5
0.4
0.3
0.2
0.1
0
0 100,000 200,000 300,000 400,000 500,000 600,000 700,000 800,000 900,000 1,000,000
Data quantity
Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
- 52. Copyright © 2013 Geoffrey I Webb
Most machine learning research has used small data
0.8
Naïve Bayes
0.7 KDB k=2
KDB k=5
0.6
Root Mean Squared Error
0.5
0.4
0.3
0.2
0.1
0
0 100,000 200,000 300,000 400,000 500,000 600,000 700,000 800,000 900,000 1,000,000
Data quantity
Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
- 53. Copyright © 2013 Geoffrey I Webb
Computational tractability
• Error will be minimised by low bias algorithms
• Big data require efficient computation
– Linear wrt size
– Learn in a limited number of passes
• Most low-bias learners are compute intensive
– super-linear with respect to data quantity
– Kernel SVM and Random Forests
Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
- 54. Copyright © 2013 Geoffrey I Webb
k-dependence Bayesian classifier (KDB)
• Bayesian network classifier proposed by Sahami (1995).
• KDB
– the probability of each
attribute value is conditioned C
by the class and at most
k other attributes.
A A A A4
– Extends TAN to multiple 1 2 3
parents.
Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
- 55. Copyright © 2013 Geoffrey I Webb
k-dependence Bayesian classifier (KDB)
• k=0 is Naïve Bayes C
• k variance and bias
• High k with low bias should A1 A2 A3 A4
have low error for big data.
0.8
Naïve Bayes
0.7 KDB k=2
KDB k=5
0.6
Root Mean Squared Error
0.5
0.4
0.3
0.2
0.1
0
0 100,000 200,000 300,000 400,000 500,000 600,000 700,000 800,000 900,000 1,000,000
Data quantity
Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
- 56. Copyright © 2013 Geoffrey I Webb
KDB algorithm
1st pass:
• Order attributes according to mutual
C
information (MI) with the class.
2nd pass:
• Assign k parents to each attribute A1 A2 A3 A4
according to MI conditioned on the
class.
• Add the class as parent of all
attributes
Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
- 57. Copyright © 2013 Geoffrey I Webb
Two pass learning
No of instances Av no of values/att
No of attributes No of classes
No of classes No of attributes Av no of values/att
Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
- 58. Copyright © 2013 Geoffrey I Webb
Selective KDB - Motivation
• KDB is efficient and effective for large data.
• Irrelevant attributes can increase error.
• Cannot predetermine the best k for a given data quantity.
• Want an efficient way to select attributes and best k.
Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
- 59. Copyright © 2013 Geoffrey I Webb
Selective KDB
C C C
MI(Ai;C)
MI(Ai;Aj,C) A1 A2 A3 A4 A1 A2 A3
A1 A2 A3 A4
LF1 LF2 LF3 LF4
best
Leave-one-out cv (Pazzani’s trick)
Attributes ordered by MI
Each alternative model tested is a minor addition to the previous
Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
- 60. Copyright © 2013 Geoffrey I Webb
Selective KDB
• Loss function can be RMSE, 0-1 loss, Matthews Correlation
Coefficient (for unbalanced datasets), etc.
• Still the value of k has to be tuned.
– Solution: Selective2 KDB: matrix of loss function results
kxn .
a1 a2 a3 a4 a5 a6
p1 p1 p1 p1 p1
p2 p2 p2 p2
p3 p3 p3
Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
- 61. Copyright © 2013 Geoffrey I Webb
Selective KDB
• Loss function can be RMSE, 0-1 loss, Matthews Correlation
Coefficient (for unbalanced datasets), etc.
• Still the value of k has to be tuned.
– Solution: Selective2 KDB: matrix of loss function results
kxn .
KDB Selective KDB Selective2 KDB
Training time
Test time
Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
- 62. Copyright © 2013 Geoffrey I Webb
Selective KDB – Results (RMSE)
• Competitive with KDB in 16 very large datasets (165K-
54.6M examples):
KDB
selective KDB 8-8-0 5-11-0 5-11-0 6-10-0 6-9-1
k-selective KDB 5-11-0
• Mean best k = 4.11
• Mean % attributes selected = 82.6626.72
Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
- 63. Copyright © 2013 Geoffrey I Webb
Selective KDB – Results (RMSE)
• Comparison with Random Forest.
RF (5EF) RF (Num)
Trees = 10 Trees = 100 Trees = 10 Trees = 100
k-selective KDB 6-1-6 4-1-7 5-0-8 4-0-8
• Need to sample in 3/4 (out of 16) datasets to get RF
10/100 results.
Mnist MITC Satellite Splice
(250K/8.1M) (600K/839K) (2M/8.7M) (10M/54.6M)
RF (100) Sample 0.29580.0017 0.05180.0007 0.45680.0006 0.05300.0005
k-selective Sample 0.23240.0029 0.04550.0019 0.45310.0011 0.05210.0006
KDB All data 0.14490.0007 0.04460.0020 0.44480.0004 0.05230.0002
Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
- 64. Copyright © 2013 Geoffrey I Webb
Selective KDB – Results (MCC)
• Unbalanced datasets: use MCC as loss function.
• Splice dataset: 0.32% of positive classes.
KDB selective KDB
0.1768 0.1918
0.1855 0.1984
0.1932 0.2043
0.1986 0.2105
0.2061 0.2148
Numeric Discrete
• Comparison with Random Forest. attributes attributes
MITC Splice
(600K/839K) (10M/54.6M)
RF (100) Sample 0.9989 0.0950
k-selective Sample 0.9954 0.1963
KDB All data 0.9956 0.2148
Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
- 65. Copyright © 2013 Geoffrey I Webb
Incremental Bayesian Network Classifiers
y
x1 x2 x3 … xn
Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
- 66. Copyright © 2013 Geoffrey I Webb
Incremental naïve Bayes
• Probability estimates are based on counts of the frequency of
each attribute value co-occurring with the class
• These can be updated incrementally
• Can these desirable features be generalised to more
sophisticated learners?
Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
- 67. Copyright © 2013 Geoffrey I Webb
Adding edges reduces bias
• With additional edges it is possible to exactly represent all
naïve Bayes distributions and more
– Lower bias
– Higher variance
– Should be more accurate for bigger data
– But which edges should we add?
y
x1 x2 x3 … xn
Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
- 68. Copyright © 2013 Geoffrey I Webb
Averaged n-Dependence Estimators
• Develop all of a family of classifiers that each add edges to
naïve Bayes
• Select order of dependence, n
• Each model selects n attributes
– All other attributes are independent given these attributes
and the class
– Each model has lower bias but higher variance than NB
– Ensembling reduces the variance
Webb, GI, JR Boughton, FZheng, KM Ting, HSalem. "Learning by extrapolation from marginal to full-multivariate probability distributions: decreasingly naive Bayesian classification." Machine Learning 86(2) (2012): 233-272
Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
- 69. Copyright © 2013 Geoffrey I Webb
Averaged n-Dependence Estimators
All subsets of
n attributes
Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
- 70. Copyright © 2013 Geoffrey I Webb
Averaged n-Dependence Estimators
• Incremental learning in a single pass through the data
• Training time complexity O(man+1) Number of
attributes
Number of Number
• Classification time complexity O(a k) n+1
training examples of classes
• Space complexity O(an+1vn+1k)
Average number of
values per attribute
Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
- 71. Copyright © 2013 Geoffrey I Webb
Averaged n-Dependence Estimators
• As n increases bias decreases
– Good for big data
0.7
0.6
Root Mean Squared Error
0.5
0.4
Naïve Bayes
0.3
A1DE
0.2
A2DE
0.1
A3DE
0
0 100000 200000 300000 400000 500000 600000 700000 800000 900000 1000000
Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
- 72. Copyright © 2013 Geoffrey I Webb
Subsumption resolution
• If P(x1 | x2) = 1.0 then P(y | x1,x2) = P(y | x2)
– Eg P(oedema | female, pregnant) =
P(oedema | pregnant)
• Subsumption resolution looks for subsuming attributes at classification
time and ignores them
– Simple correction for extreme form of violation of attribute
independence assumption
– Very effective in practice – reduce bias at small cost in variance –
though not always applicable
– For AnDE with n≥1 uses statistics collected already – no learning
overhead – often reduces classification time
Zheng, F, GI Webb, P Suraweera, L Zhu. "Subsumption resolution: an efficient and effective technique for semi-naive Bayesian learning." Machine Learning 87(1)(2012): 93-125.
Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
- 73. Copyright © 2013 Geoffrey I Webb
Weighting
Jiang, Liangxiao, and Harry Zhang. "Weightily averaged one-dependence estimators." In PRICAI 2006, pp. 970-974. Springer Berlin, 2006.
Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
- 74. Copyright © 2013 Geoffrey I Webb
Weighting
• Weighting also reduces bias at the cost of a small increase
in variance
0.6
A3DE A3DE W
0.5
Root Mean Squared Error
0.4
0.3
0.2
0.1
0
0 100000 200000 300000 400000 500000 600000 700000 800000 900000 1000000
Data quantity
Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
- 75. Copyright © 2013 Geoffrey I Webb
Weighting and subsumption resolution are complementary
• When SR is applicable, both in combination have lower bias
but slightly higher variance than either alone
RMSE
Dataset Size A2DE A2DE-SR A2DE-W A2DE-WSR
cleveland 303 0.359 0.360 0.361 0.361
small
balance-scale 625 0.430 0.430 0.430 0.430
anneal 898 0.118 0.098 0.116 0.096
adult 48,842 0.313 0.306 0.308 0.303
localization 164,860 0.499 0.499 0.498 0.498
large
covtype 581,102 0.371 0.349 0.350 0.335
poker-hand 1,025,010 0.496 0.496 0.420 0.420
kddcup 5,209,460 0.044 0.040 0.043 0.039
Big data | Class learning | Sampling | Dimensionality red’n | Scaling-up | Streams | Bias/variance | Selective KDB | Incremental BNC
- 77. Copyright © 2013 Geoffrey I Webb
References
Silver, Nate. The Signal and the Noise: Why So Many Predictions Fail-but Some Don't. Penguin Press, 2012.
Whitelaw, Casey, Ben Hutchinson, Grace Y. Chung, and Gerard Ellis. "Using the web for language independent spellchecking and
autocorrection." In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2, pp. 890-899.
Association for Computational Linguistics, 2009.
Prof. William H. Press, “Unit 17: Classifier Performance: ROC, Precision-Recall, and All That.”
http://www.nr.com/CS395T/lectures2008/17-ROCPrecisionRecall.pdf
Provost, Foster, David Jensen, and Tim Oates. “Efficient progressive sampling.” In Proceedings 5th ACM SIGKDD international
conference on Knowledge Discovery and Data Mining, pp. 23-32. ACM, 1999.
Hulten, Geoff, and Pedro Domingos. "Mining complex models from arbitrarily large databases in constant time." In Proceedings of the
eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 525-531. ACM, 2002.
Tille, Yves. Sampling algorithms. Springer, 2006.
White, Tom. Hadoop: The definitive guide. O'Reilly Media, Inc., 2012
Auer, Peter. “Online Learning.” In Encyclopedia of Machine Learning, C. Sammut and G.I. Webb, Editors. 2010, Springer: New York. p.
736-743.
Littlestone, Nick, and Manfred K. Warmuth. "The weighted majority algorithm." In Foundations of Computer Science, 1989., 30th Annual
Symposium on, pp. 256-261. IEEE, 1989.
Littlestone, Nick. "Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm." Machine learning 2, no. 4
(1988): 285-318.
Zhang, Tong. "Solving large scale linear prediction problems using stochastic gradient descent algorithms." In Proceedings 21st
International Conference on Machine learning, p. 116. ACM, 2004.
“Bias Variance Decomposition.” In Encyclopedia of Machine Learning, C. Sammut and G.I. Webb, Editors. 2010, Springer: New York. p.
100-101.
Sahami, Mehran. "Learning limited dependence Bayesian classifiers." In KDD-96: Proceedings of the Second International Conference
on Knowledge Discovery and Data Mining, pp. 335-338. 1996.
Webb, Geoffrey I., Janice R. Boughton, Fei Zheng, Kai Ming Ting, and Houssam Salem. "Learning by extrapolation from marginal to full-
multivariate probability distributions: decreasingly naive Bayesian classification." Machine Learning 86, no. 2 (2012): 233-272.
Zheng, Fei, Geoffrey I. Webb, Pramuditha Suraweera, and Liguang Zhu. "Subsumption resolution: an efficient and effective technique for
semi-naive Bayesian learning." Machine Learning 87, no. 1 (2012): 93-125.
Jiang, Liangxiao, and Harry Zhang. "Weightily averaged one-dependence estimators." In PRICAI 2006: trends in artificial intelligence, pp.
970-974. Springer Berlin Heidelberg, 2006.