Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Active learning lecture
1. Active Learning
COMS 6998-4:
Learning and Empirical Inference
Irina Rish
IBM T.J. Watson Research Center
2. Outline
Motivation
Active learning approaches
Membership queries
Uncertainty Sampling
Information-based loss functions
Uncertainty-Region Sampling
Query by committee
Applications
Active Collaborative Prediction
Active Bayes net learning
2
3. Standard supervised learning model
Given m labeled points, want to learn a classifier with
misclassification rate <ε, chosen from a hypothesis class H with
VC dimension d < 1.
VC theory: need m to be roughly d/ε, in the realizable case.
3
4. Active learning
In many situations – like speech recognition and document
retrieval – unlabeled data is easy to come by, but there is a
charge for each label.
What is the minimum number of labels needed to achieve the
target error rate? 4
6. What is Active Learning?
Unlabeled data are readily available; labels are
expensive
Want to use adaptive decisions to choose which
labels to acquire for a given dataset
Goal is accurate classifier with minimal cost
6
7. Active learning warning
Choice of data is only as good as the model itself
Assume a linear model, then two data points are sufficient
What happens when data are not linear?
7
12. Problem
Many results in this framework, even for complicated
hypothesis classes.
[Baum and Lang, 1991] tried fitting a neural net to handwritten
characters.
Synthetic instances created were incomprehensible to humans!
[Lewis and Gale, 1992] tried training text classifiers.
“an artificial text created by a learning algorithm is unlikely to
be a legitimate natural language expression, and probably would
be uninterpretable by a human teacher.”
12
16. Uncertainty Sampling
[Lewis & Gale, 1994]
Query the event that the current classifier is most
uncertain about
If uncertainty is measured in Euclidean distance:
x x x x x x x x x x
Used trivially in SVMs, graphical models, etc.
16
17. Information-based Loss Function
[MacKay, 1992]
Maximize KL-divergence between posterior and prior
Maximize reduction in model entropy between posterior
and prior
Minimize cross-entropy between posterior and prior
All of these are notions of information gain
17
18. Query by Committee
[Seung et al. 1992, Freund et al. 1997]
Prior distribution over hypotheses
Samples a set of classifiers from distribution
Queries an example based on the degree of
disagreement between committee of classifiers
x x x x x x x x x x
A B C
18
21. Dataset (D) Example
t Sex Age Test A Test B Test C Disease
0 M 40-50 0 1 1 ?
1 F 50-60 0 1 0 ?
2 F 30-40 0 0 0 ?
3 F 60+ 1 1 1 ?
4 M 10-20 0 1 0 ?
5 M 40-50 0 0 1 ?
6 F 0-10 0 0 0 ?
7 M 30-40 1 1 0 ?
8 M 20-30 0 0 1 ? 21
26. Model Space
Model: St Ot
Model Parameters: P(St) P(Ot|St)
Generative Model:
Must be able to compute P(St=i, Ot=ot | w)
26
27. Model Parameter Space (W)
• W = space of possible
parameter values
• Prior on parameters: P(W )
• Posterior over models: P (W | D ) ∝ P ( D | W ) P (W )
T
∝ ∏ P ( St , Ot | W ) P(W )
t
27
28. Notation
We Have:
• Dataset, D
• Model parameter space, W
• Query algorithm, q
q(W,D) returns t*, the next sample to label
28
29. Game
while NotDone
• Learn P(W | D)
• q chooses next example to label
• Expert adds label to D
29
34. Score Function
scoreuncert ( S t ) = uncertainty(P ( St | Ot ))
= H ( St )
= ∑ P ( St = i ) log P( St = i )
i
34
35. Uncertainty Sampling Example
t Sex Age Test Test Test St P(St) H(St)
A B C
1 M 20-
30
0 1 1 ? 0.02 0.043
2 F 20-
30
0 1 0 ? 0.01 0.024
3 F 30-
40
1 0 0 ? 0.05 0.086
4 F 60+ 1 1 0 ?
FALSE 0.12 0.159
5 M 10-
20
0 1 0 ? 0.01 0.024
6 M 20-
30
1 1 1 ? 0.96 0.073
35
36. Uncertainty Sampling Example
t Sex Age Test Test Test St P(St) H(St)
A B C
1 M 20-
30
0 1 1 ? 0.01 0.024
2 F 20-
30
0 1 0 ? 0.02 0.043
3 F 30-
40
1 0 0 ? 0.04 0.073
4 F 60+ 1 1 0 ?
FALSE 0.00 0.00
5 M 10-
20
0 1 0 ?
TRUE 0.06 0.112
6 M 20-
30
1 1 1 ? 0.97 0.059
36
37. Uncertainty Sampling
GOOD: couldn’t be easier
GOOD: often performs pretty well
BAD: H(St) measures information gain about the
samples, not the model
Sensitive to noisy samples
37
38. Can we do better than
uncertainty sampling?
38
41. Information-Gain
• Choose the example that is expected to
most reduce H(W)
• I.e., Maximize H(W) – H(W | St)
Current model Expected model space
space entropy entropy if we learn St
41
43. We usually can’t just sum over all models to
get H(St|W)
H (W ) = − ∫ P ( w) log P ( w) dw
w
…but we can sample from P(W | D)
H (W ) ∝ H (C )
= −∑ P (c) log P (c)
c∈C
43
44. Conditional Model Entropy
H (W ) = ∫ P ( w) log P ( w) dw
w
H (W | St = i ) = ∫ P ( w | St = i ) log P ( w | St = i ) dw
w
H (W | St ) = ∑ P ( St = i ) ∫ P( w | S t = i ) log P ( w | St = i ) dw
w
i
44
46. t Sex Age Test Test Test St P(St) Score =
A B C H(C) - H(C|St)
1 M 20-3 0 1 1
0
? 0.02 0.53
2 F 20-3 0 1 0
0
? 0.01 0.58
3 F 30-4 1 0 0
0
? 0.05 0.40
4 F 60+ 1 1 1 ? 0.12 0.49
5 M 10-2 0 1 0
0
? 0.01 0.57
20-3
6 M
0
0 0 1 ? 0.02 0.52
46
47. Score Function
score IG ( St ) = H (C ) − H (C | St )
= H ( St ) − H ( St | C )
Familiar?
47
48. Uncertainty Sampling &
Information Gain
scoreUncertain ( St ) = H ( St )
score InfoGain ( St ) = H ( St ) − H ( St | C )
48
50. If our objective is to reduce
the prediction error, then
“the expected information gain
of an unlabeled sample is NOT
a sufficient criterion for
constructing good queries”
50
51. Strategy #2:
Query by Committee
Temporary Assumptions:
Pool Sequential
P(W | D) Version Space
Probabilistic Noiseless
QBC attacks the size of the “Version space”
51
54. O1 O2 O3 O4 O5 O6 O7
S1 S2 S3 S4 S5 S6 S7
FALSE!
TRUE! Ooh, now we’re going to learn
something for sure!
One of them is definitely wrong.
Model #1 Model #2
54
55. The Original QBC
Algorithm
As each example arrives…
• Choose a committee, C, (usually of size 2)
randomly from Version Space
• Have each member of C classify it
• If the committee disagrees, select it.
55
57. Infogain vs Query by Committee
[Seung, Opper, Sompolinsky, 1992; Freund, Seung, Shamir, Tishby 1997]
First idea: Try to rapidly reduce volume of version space?
Problem: doesn’t take data distribution into account.
H:
Which pair of hypotheses is closest? Depends on data distribution P.
Distance measure on H: d(h,h’) = P(h(x) ≠ h’(x)) 57
58. Query-by-committee
First idea: Try to rapidly reduce volume of version space?
Problem: doesn’t take data distribution into account.
To keep things simple, say d(h,h’) = Euclidean distance
H:
Error is likely to
remain large!
58
59. Query-by-committee
Elegant scheme which decreases volume in a manner which is
sensitive to the data distribution.
Bayesian setting: given a prior π on H
H1 = H
For t = 1, 2, …
receive an unlabeled point xt drawn from P
[informally: is there a lot of disagreement about xt in Ht?]
choose two hypotheses h,h’ randomly from (π, Ht)
if h(xt) ≠ h’(xt): ask for xt’s label
set Ht+1
59
Problem: how to implement it efficiently?
60. Query-by-committee
For t = 1, 2, …
receive an unlabeled point xt drawn from P
choose two hypotheses h,h’ randomly from (π, Ht)
if h(xt) ≠ h’(xt): ask for xt’s label
set Ht+1
Observation: the probability of getting pair (h,h’) in the inner
loop (when a query is made) is proportional to π(h) π(h’) d(h,h’).
vs.
60
Ht
62. Query-by-committee
Label bound:
For H = {linear separators in Rd}, P = uniform distribution,
just d log 1/ε labels to reach a hypothesis with error < ε.
Implementation: need to randomly pick h according to (π, Ht).
e.g. H = {linear separators in Rd}, π = uniform distribution:
How do you pick a
Ht random point from a
convex body?
62
63. Sampling from convex bodies
By random walk!
2. Ball walk
3. Hit-and-run
[Gilad-Bachrach, Navot, Tishby 2005] Studies random walks and
also ways to kernelize QBC.
63
65. Some challenges
[1] For linear separators, analyze the label complexity for
some distribution other than uniform!
[2] How to handle nonseparable data?
Need a robust base learner
+ true boundary
-
65
67. Approach: Collaborative Prediction (CP)
QoS measure (e.g. bandwidth) Movie Ratings
Server1 Server2 Server3 Matrix Geisha Shrek
Client1 3674 18 Alina ? 1 3
Client2 187 567 Gerry 4 2
Client3 1688 Irina 9 10
Client4 3009 703 ? Raja 4
Given previously observed ratings R(x,y), where X is a
“user” and Y is a “product”, predict unobserved ratings
- will Alina like “The Matrix”? (unlikely )
- will Client 86 have fast download from Server 39?
- will member X of funding committee approve our project Y?
67
68. Collaborative Prediction = Matrix Approximation
100 servers
• Important assumption:
100 clients
matrix entries are NOT independent,
e.g. similar users have similar tastes
• Approaches:
mainly factorized models assuming
hidden ‘factors’ that affect ratings
(pLSA, MCVQ, SVD, NMF, MMMF, …)
68
69. 2 4 5 1 4 2
User’s ‘weights’ Factors
associated with ‘factors’’
Assumptions:
- there is a number of (hidden) factors behind the user
preferences that relate to (hidden) movie properties
- movies have intrinsic values associated with such factors
- users have intrinsic weights with such factors;
user ratings a weighted (linear) combinations of movie’s values
69
72. rank k
2 4 5 1 4 2 7 2 5 4 5 3 1 4 2
3 1 2 2 5 3 1 2 4 2 2 7 5 6
4 2 4 1 3 1 4 3 2 2 4 1 4 3 1
=
3 3 4 2 3 1 2 3 4 3 2 4 5
2 3 1 4 3 2 2 3 2 1 3 4 3 5 2
2 2 1 4 8 2 2 9 1 8 3 4 5
1 3 1 1 4 1 2 3 5 1 1 5 6 4
Y X
Objective: find a factorizable X=UV’ that approximates Y
X = arg min Loss ( X ' , Y )
X'
and satisfies some “regularization” constraints (e.g. rank(X) < k)
Loss functions: depends on the nature of your problem
72
73. Matrix Factorization Approaches
Singular value decomposition (SVD) – low-rank approximation
Assumes fully observed Y and sum-squared loss
In collaborative prediction, Y is only partially observed
Low-rank approximation becomes non-convex problem w/ many local minima
Furthermore, we may not want sum-squared loss, but instead
accurate predictions (0/1 loss, approximated by hinge loss)
cost-sensitive predictions (missing a good server vs suggesting a bad one)
ranking cost (e.g., suggest k ‘best’ movies for a user)
NON-CONVEX PROBLEMS!
Use instead the state-of-art Max-Margin Matrix Factorization [Srebro 05]
replaces bounded rank constraint by bounded norm of U, V’ vectors
convex optimization problem! – can be solved exactly by semi-definite programming
strongly relates to learning max-margin classifiers (SVMs)
Exploit MMMF’s properties to augment it with active sampling! 73
74. Key Idea of MMMF
Rows – feature vectors, Columns – linear classifiers
Linear classifiersweight vectors
“margin” here =
Dist(sample, line)
v2
f1 -1
“ma
rgin
Feature vectors
”
Xij = signij x marginij
Predictorij = signij
If signij > 0, classify as +1,
Otherwise classify as -1 74
76. Active Learning with MMMF
- We extend MMMF to Active-MMMF using margin-based active sampling
- We investigate exploitation vs exploration trade-offs imposed by different heuristics
-0.3 -0.5 0.3 0.4
0.6 0.1 -0.9 -0.6
-0.5 0.8 0.2
-0.9 0.3 -0.1
0.6 -0.7 -0.5 0.7
-0.9 -0.8 0.9
0.1 0.5 0.2 0.3
-0.5 0.6 0.6
0.5 0.2 -0.9 0.7 -0.8 -0.5
0.9 -0.6 -0.1 0.9
0.7 0.8 -0.4 0.3 Margin-based heuristics:
-0.5 -0.2 0.6
-0.5 -0.5 -0.5 -0.4
0.6 -0.5 0.4 0.5 min-margin (most-uncertain)
-0.2 -0.5 -0.5 0.1 0.9 0.3
min-margin positive (“good” uncertain)
0.8 -0.5 0.6 0.2
max-margin (‘safe choice’ but no76
info)
77. Active Max-Margin Matrix Factorization
A-MMMF(M,s)
1. Given s sparse matrix Y, learn approximation X = MMMF(Y)
2. Using current predictions, actively select “best s” samples and
request their labels (e.g., test client/server pair via ‘enforced’ download)
3. Add new samples to Y
4. Repeat 1-3
Issues:
Beyond simple greedy margin-based heuristics?
Theoretical guarantees? not so easy with non-trivial learning
methods and non-trivial data distributions
(any suggestions??? )
77
79. Empirical Results: Latency Prediction
P2Psim data NLANR-AMP data
Active sampling with most-uncertain (and most-uncertain positive)
heuristics provide consistent improvement over random and least-
uncertain-next sampling 79
82. Introducing Cost: Exploration vs Exploitation
DownloadGrid: PlanetLab:
bandwidth prediction latency prediction
Active sampling lower prediction errors at lower costs (saves 100s of samples)
(better prediction better server assignment decisions faster downloads
Active sampling achieves a good exploration vs exploitation trade-off:
reduced decision cost AND information gain
82
83. Conclusions
Common challenge in many applications:
need for cost-efficient sampling
This talk: linear hidden factor models with active sampling
Active sampling improves predictive accuracy while keeping
sampling complexity low in a wide variety of applications
Future work:
Better active sampling heuristics?
Theoretical analysis of active sampling performance?
Dynamic Matrix Factorizations: tracking time-varying matrices
Incremental MMMF? (solving from scratch every time is too costly)
83
84. References
Some of the most influential papers
• Simon Tong, Daphne Koller.
Support Vector Machine Active Learning with Applications to Text Classification
. Journal of Machine Learning Research. Volume 2, pages 45-66. 2001.
• Y. Freund, H. S. Seung, E. Shamir, N. Tishby. 1997.
Selective sampling using the query by committee algorithm. Machine
Learning, 28:133—168
• David Cohn, Zoubin Ghahramani, and Michael Jordan.
Active learning with statistical models, Journal of Artificial Intelligence
Research, (4): 129-145, 1996.
• David Cohn, Les Atlas and Richard Ladner.
Improving generalization with active learning, Machine Learning
15(2):201-221, 1994.
• D. J. C. Mackay.
Information-Based Objective Functions for Active Data Selection. Neural
Comput., vol. 4, no. 4, pp. 590--604, 1992. 84
85. NIPS papers
• Francis Bach. Active learning for misspecified generalized linear models. NIPS-06
• Ran Gilad-Bachrach, Amir Navot, Naftali Tishby. Query by Committee Made Real. NIPS-05
• Brent Bryan, Jeff Schneider, Robert Nichol, Christopher Miller, Christopher Genovese, Larry Wasserman.
Active Learning For Identifying Function Threshold Boundaries . NIPS-05
• Rui Castro, Rebecca Willett, Robert Nowak. Faster Rates in Regression via Active Learning. NIPS-05
• Sanjoy Dasgupta. Coarse sample complexity bounds for active learning. NIPS-05
• Masashi Sugiyama. Active Learning for Misspecified Models. NIPS-05
• Brigham Anderson, Andrew Moore. Fast Information Value for Graphical Models. NIPS-05
• Dan Pelleg, Andrew W. Moore. Active Learning for Anomaly and Rare-Category Detection. NIPS-04
• Sanjoy Dasgupta. Analysis of a greedy active learning strategy. NIPS-04
• T. Jaakkola and H. Siegelmann. Active Information Retrieval. NIPS-01
• M. K. Warmuth et al. Active Learning in the Drug Discovery Process. NIPS-01
• Jonathan D. Nelson, Javier R. Movellan. Active Inference in Concept Learning. NIPS-00
• Simon Tong, Daphne Koller. Active Learning for Parameter Estimation in Bayesian Networks. NIPS-00
• Thomas Hofmann and Joachim M. Buhnmnn. Active Data Clustering. NIPS-97
• K. Fukumizu. Active Learning in Multilayer Perceptrons. NIPS-95
• Anders Krogh, Jesper Vedelsby.
NEURAL NETWORK ENSEMBLES, CROSS VALIDATION, AND ACTIVE LEARNING. NIPS-94
• Kah Kay Sung, Partha Niyogi. ACTIVE LEARNING FOR FUNCTION APPROXIMATION. NIPS-94
• David Cohn, Zoubin Ghahramani, Michael I. Jordan. ACTIVE LEARNING WITH STATISTICAL MODELS.
NIPS-94
• Sebastian B. Thrun and Knut Moller. Active Exploration in Dynamic Environments. NIPS-91 85
86. ICML papers
• Maria-Florina Balcan, Alina Beygelzimer, John Langford. Agnostic Active Learning. ICML-06
• Steven C. H. Hoi, Rong Jin, Jianke Zhu, Michael R. Lyu.
Batch Mode Active Learning and Its Application to Medical Image Classification. ICML-06
• Sriharsha Veeramachaneni, Emanuele Olivetti, Paolo Avesani.
Active Sampling for Detecting Irrelevant Features. ICML-06
• Kai Yu, Jinbo Bi, Volker Tresp. Active Learning via Transductive Experimental Design. ICML-06
• Rohit Singh, Nathan Palmer, David Gifford, Bonnie Berger, Ziv Bar-Joseph. Active Learning for
Sampling in Time-Series Experiments With Application to Gene Expression Analysis. ICML-05
• Prem Melville, Raymond Mooney. Diverse Ensembles for Active Learning. ICML-04
• Klaus Brinker. Active Learning of Label Ranking Functions. ICML-04
• Hieu Nguyen, Arnold Smeulders. Active Learning Using Pre-clustering. ICML-04
• Greg Schohn and David Cohn. Less is More: Active Learning with Support Vector Machines, ICML-00
• Simon Tong, Daphne Koller.
Support Vector Machine Active Learning with Applications to Text Classification. ICML-00.
• COLT papers
• S. Dasgupta, A. Kalai, and C. Monteleoni. Analysis of perceptron-based active learning. COLT-05.
• H. S. Seung, M. Opper, and H. Sompolinski. 1992. Query by committee. COLT-92, pages 287--294.
86
87. Journal Papers
• Antoine Bordes, Seyda Ertekin, Jason Weston, Leon Bottou.
Fast Kernel Classifiers with Online and Active Learning. Journal of Machine Learning Research
(JMLR), vol. 6, pp. 1579-1619, 2005.
• Simon Tong, Daphne Koller.
Support Vector Machine Active Learning with Applications to Text Classification. Journal of
Machine Learning Research. Volume 2, pages 45-66. 2001.
• Y. Freund, H. S. Seung, E. Shamir, N. Tishby. 1997.
Selective sampling using the query by committee algorithm. Machine Learning, 28:133--168
• David Cohn, Zoubin Ghahramani, and Michael Jordan. Active learning with statistical models,
Journal of Artificial Intelligence Research, (4): 129-145, 1996.
• David Cohn, Les Atlas and Richard Ladner. Improving generalization with active learning,
Machine Learning 15(2):201-221, 1994.
• D. J. C. Mackay. Information-Based Objective Functions for Active Data Selection. Neural
Comput., vol. 4, no. 4, pp. 590--604, 1992.
• Haussler, D., Kearns, M., and Schapire, R. E. (1994).
Bounds on the sample complexity of Bayesian learning using information theory and the VC dimension
. Machine Learning, 14, 83--113
• Fedorov, V. V. 1972. Theory of optimal experiment. Academic Press.
• Saar-Tsechansky, M. and F. Provost.
Active Sampling for Class Probability Estimation and Ranking. Machine Learning 54:2 2004,
153-178.
87
91. Entropy Function
• A measure of information in
random event X with possible
outcomes {x1,…,xn}
H(x) = - Σi p(xi) log2 p(xi)
• Comments on entropy function:
– Entropy of an event is zero when
the outcome is known
– Entropy is maximal when all
outcomes are equally likely
• The average minimum yes/no
questions to answer some
question (connection to binary 91
search) [Shannon, 1948]
92. Kullback-Leibler divergence
• P is the true distribution; Q distribution is used to encode
data instead of P
• KL divergence is the expected extra message length per
datum that must be transmitted using Q
DKL(P || Q) = Σi P(xi) log (P(xi)/Q(xi))
= Σi P(xi) log Q(xi) – Σi P(xi) log P(xi)
= -H(P,Q) + H(P)
= -Cross-entropy + entropy
• Measure of how “wrong” Q is with respect to true
distribution P 92
93. Learning Bayesian Networks
E B
Data
R A
+ Learner
Prior Knowledge C
E B P(A | E,B)
e b .9 .1
• Model Building
e b .7 .3
• Parameter estimation
• Causal structure discovery e b .8 .2
• Passive Learning vs Active Learning e b .99 .01
93
94. Active Learning
• Selective Active Learning
• Interventional Active Learning
• Obtain measure of quality of current model
• Choose query that most improves quality
• Update model
94
95. Active Learning: Parameter
Estimation
•
[Tong & Koller, NIPS-2000]
Given a BN structure G
• A prior distribution p(θ)
• Learner request a particular instantiation q (Query)
Updated
Initial Network
distribution
G, p(θ)
p´(θ)
E B
Query (q) E B
Active Learner
A
A
+
Response (x)
Training
data
How to update parameter density
How to select next query based on p
95
96. Updating parameter density
•Do not update A since we are fixing it
B
•If we select A then do not update B
•Sampling from P(B|A=a) ≠ P(B)
•If we force A then we can update B A
•Sampling from P(B|A:=a) = P(B)*
•Update all other nodes as usual
J M
•Obtain new density
p(θ | A = a, X = x )
96
*Pearl 2000
97. Bayesian point estimation
• Goal: a single estimate θ
– instead of a distribution p over θ
p(θ)
~
• If we choose θ and the true model is θ’
θ θ’
then we incur some loss, L(θ’ || θ)
97
98. Bayesian point estimation
• We do not know the true θ’
• Density p represents optimal beliefs over θ’
• Choose θ that minimizes the expected loss
~
θ = argminθ ∫p(θ’) L(θ’ || θ) dθ’
~
• Call θ the Bayesian point estimate
• Use the expected loss of the Bayesian point
estimate as a measure of quality of p(θ):
– Risk(p) = ∫p(θ’) L(θ’ || θ) dθ’
98
99. The Querying component
• Set the controllable variables so as to
minimize the expected posterior risk:
ExPRisk(p | Q=q)
~
= ∑ P( X = x | Q = q) ∫ p(θ | x ) KL(θ || θ ) dθ
x
• KL divergence will be used for loss
Conditional KL-divergence
– KL(θ || θ’)=∑KL(Pθ(Xi|Ui)|| Pθ’(Xi|Ui)) 99
100. Algorithm Summary
• For each potential query q
• Compute ∆Risk(X|q)
• Choose q for which ∆Risk(X|q) is greatest
– Cost of computing ∆Risk(X|q):
• Cost of Bayesian network inference
• Complexity: O (|Q|. Cost of inference)
100
101. Uncertainty sampling
Maintain a single hypothesis, based on labels seen so far.
Query the point about which this hypothesis is most “uncertain”.
Problem: confidence of a single hypothesis may not accurately
represent the true diversity of opinion in the hypothesis class.
X
-
- -
-
+ + -
-
+
+ + -
- - 101
103. Region of uncertainty
Current version space: portion of H consistent with labels so far.
“Region of uncertainty” = part of data space about which there is
still some uncertainty (i.e. disagreement within version space)
current version space
Suppose data lies
on circle in R2; +
hypotheses are +
linear separators.
(spaces X, H
superimposed)
region of uncertainty
in data space 103
104. Region of uncertainty
Algorithm [CAL92]:
of the unlabeled points which lie in the region of uncertainty,
pick one at random to query.
Data and current version space
hypothesis spaces,
superimposed:
(both are the
surface of the unit
sphere in Rd)
region of uncertainty
in data space 104
105. Region of uncertainty
Number of labels needed depends on H and also on P.
Special case: H = {linear separators in Rd}, P = uniform
distribution over unit sphere.
Then: just d log 1/ε labels are needed to reach a hypothesis
with error rate < ε.
[1] Supervised learning: d/ε labels.
[2] Best we can hope for.
105
106. Region of uncertainty
Algorithm [CAL92]:
of the unlabeled points which lie in the region of uncertainty,
pick one at random to query.
For more general distributions: suboptimal…
Need to measure
quality of a query – or
alternatively, size of
version space.
106
Explain why this isn’t pathological: issue of margins
Explain why this isn’t pathological: issue of margins
Explain why this isn’t pathological: issue of margins
Explain why this isn’t pathological: issue of margins
Explain why this isn’t pathological: issue of margins
Explain why this isn’t pathological: issue of margins
New properties of our application vs CF: Dynamically changing ratings Possibilitues for active sampling Constraint on resources vs CF Absense of rating may mean that server was NOT selected selectd before on purpose (?) Future improvements: Use client and server ‘features’ (IP etc)
Red box on left top : which data where actually used in experiments? Color version Histograms
Examples: movies and dGrid – Motivate active learning in CF scenarios Minimize # of questions /downloads but maximuze their ‘informativeness’ Incremental algorithm – future work
Error bars Sparsity for those matrices % of samples added %% added samples – X axis For future – initial training set effect on the active learning
Error bars Sparsity for those matrices % of samples added %% added samples – X axis For future – initial training set effect on the active learning
Error bars Sparsity for those matrices % of samples added %% added samples – X axis For future – initial training set effect on the active learning
Error bars Sparsity for those matrices % of samples added %% added samples – X axis For future – initial training set effect on the active learning
When you don’t have enough data to even fit a hypothesis, should you trust its confidence judgements?
Explain why this isn’t pathological: issue of margins