2. 1. Probability 101: Mass & Density Functions
2. Probability 102: Simplex and its geometrical meaning
3. Dirichlet Distribution
4. Dirichlet Process
5. A demo
6. An application
Table of Contents
2
3. Probability 101
• PDF = probability that a continuous random variable has a particular range of values
• PMF = probability that a discrete random variable is exactly equal to some value
3
• In continuous setting:
∫b
a f(x)dx = prob. that outcome is between a and b
i.e., units of f(x) = prob. Per unit length (dx)
= how dense is probability per unit length near x
• In discrete setting:
f(x) = Pr(X = x)
i.e., units of f(x) = simple probability
= what is the mass of object X at point x
• Set of PMFs on entire sample space.
S = { x E Rn : xi >= 0, ∑i=1..n xi = 1}
Probability Mass Function (PMF) vs Density Function (PDF)
Probability Simplex
4. 4
• A k-dimensional polytope ( a geometric object with flat sides) formed from convex hull of
its k+1 vertices.
Probability 102: K-Simplex – geometrical meaning
• Let u0, u1, …, uk E Rk be (k+1) points, then the simplex determined by them = set of points:
C = {Ɵ0u0 + … + Ɵkuk | ∑i = 0...k Ɵi = 1 and Ɵi >= 0 ∀ i }
Looking at u0, u1, u2 as a disjoint set of possible events, such that their probs. sum to 1.
i.e. p0 + p1 + p2 = 1, where 0 <= pi <= 1
Consider the three probabilities as points in Euclidean space (p1,p2,p3).
Resulting shape outlines the perimeter of a triangle.
While the set C lies in a k-dim. Space (k=3), the object it forms is (k-1)
dimensional.
Each point pi in the simplex = a pmf in its own (i.e. each component of pi
= [0,1] and all its components sum up to 1).
5. Dirichlet distribution
5
• Let Q = [Q1, Q2, …, Qk] = a random pmf (i.e. Qi >= 0) for i = 1,2,…, k and ∑i=1..k Qi = 1.
• Let α = [α1, α2, . . . , αk], with αi > 0 for each i, and let α0 = ∑i=1..k αi
• Then, Q = a Dirichlet distribution with param. α and is denoted by Q ∼ Dir(α):
P(Q1, Q2, …, Qk) =
• A probability distribution whose samples lie in the (k-1) dimensional probability
simplex ∆k, i.e., a distribution over pmfs of length k.
• Ranges over possible parameters vectors for a multinomial distribution and is
the conjugate prior of multinomial distribution.
“A distribution of distributions”
6. Dirichlet distribution – an example use-case
• X = vector representing n draws of a random var. with 3 possible outcomes = [4,4,2]
• PMF of X = multinomial distribution = (p1n1* p2n2 * p3n3) * n!/ n1!*n2!*n3!
6
Q) What if p1, p2, p3 are unknown? i.e., no certainty over what the distribution of
categorical vars. is!
Solution: use a Dirichlet distribution with params α1, α2, α3 to first draw a P ~ Dir(α), and then, draw
X ~ Multi(p).
• Introduces one level of indirection in the model for X – instead of saying what P generated X, use
params α1, α2, α3 to find likely prob. Distributions and then draw samples X acc. To random P.
• Since, sampling is directly from a prob. K-Simplex => the values of a k-dim. Dirichlet distribution =
mean value of the Dirichlet.
• Addition of the Dirichlet distribution = introducing prior beliefs about what X is likely to occur. i.e., a
random pmf has a Dirichlet distribution with param α. [1]
• Analogy 1: if a random pmf = a bag full of dice, then a sample from the Dirichlet = a specific dice.
7. Dirichlet Process
Dirichlet Processes to the Rescue !
7
• In the dice analogy, the dice must have a finite no. of faces.
• Limitation of Dirichlet distribution = assumes a finite set of events.
• Enables working with an infinite set of events, and hence to model prob.
Distributions over infinite sample spaces.
Analogy 2:
• Asking a pedestrians on the street to choose their fav. Color out of {V,I,B,G,Y,O,R}.
• Based on answer, model each person as a pmf over 7 colors.
• Each person’s pmf = a realization of a draw from a Dirichlet distribution over 7 colors.
What if the choices are no longer restricted to 7 colors?
• Modelling an individual’s pmfs (over infinite dim.) = a distribution over distributions over
an infinite samle space.
• One solution = a Dirichlet process.
8. Dirichlet Process – definition
Input = H (a prob. Distribution a.k.a base distribution), α (a +ve real no. a.k.a
concentration param.)
Draw A (i.e., nth element) from H.
For n > 1:
Assign A to a new category with the prob. α / (α + n – 1).
Assign A to a pre-existing category x with prob. nx / (α + n – 1), where nx = no. of
random variables already assigned to x.
8
• Assign elements A,B,C to unknown no. of categories following the algorithm:
• Used when modelling data that tends to repeat previous values in a “rich get richer”
fashion.
• Can also be defined as a Chinese Restaurant Process.
• Applications: Morphological segmentation in NLP, Modelling mutation rates of genes in
evolutionary biology.
10. An application: Learning of hierarchical Morphology paradigms [3]
• A paradigm = a pair (StemList, SuffixList) where, each Stem+Suffix string = a valid word.
• Can be modelled as a hierarchical structure.
10
• Morphologically similar words = close to
each other in the structure.
• Similarity metric = # common morphemes
• Notations: w = word, s = stem, m = suffix
• Assumption: Stems and suffixes
generated independently from each
other.
• Prob. of a word = p(w = s+m)
= p(s) * p(m)
11. An application: Learning of hierarchical Morphology paradigms [3]
11
1. Two Dirichlet processes generate stems and suffixes independently:
• βs = concentration parameter, i.e., no. of stem types generated by the DP
• If β = small, new stem/suffix types are less likely to be generated.
• β = large, more likely to generate new stem/suffix types, thus yielding more uniform distribution.
• Authors choose β < 1, i.e. to yield a more skewed distribution with sparse stems & suffixes.
• P = base distribution specifying prior prob. Distribution
for morpheme lengths.
• Joint prob. Of stems can then be calculated as:
12. References
1. Frigyik, Bela A. et al. “Introduction to the Dirichlet Distribution and Related Processes.”
(2010).
2. http://phyletica.org/dirichlet-process/
3. Can, Burcu and Suresh Manandhar. “Probabilistic Hierarchical Clustering of
Morphological Paradigms.” EACL (2012).
12