Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Mining the social web 6
1. Mining the Social Web: Chapter6
LinkedIn: Clustering Your Professional
Network for Fun(And Profit?)
chois79
2. Contents
• Introduction
• Motivation for Clustering
• Clustering Contacts by Job Title
– Standardizing and Counting Job Titles
– Common Similarity Metrics for Clustering
– A Greedy Approach to Clustering
– Hierarchical and k-Means Clustering
• Fetching Extended Profile Information
• Closing Remarks
3. Introduction
• LinkedIn is a popular social networking site focused
on professional and business relationships
• The two primary ways you can access Linked-in
– Exporting it as address book data
– Using the Linked-in API
• This chapter introduce fundamental clustering
techniques to answer the following kinds of queries
– Which of your connections are the most similar based
upon a criterion like job title?
– Which of your connections have worked in companies you
want to work for?
– Where do most of your connections reside geographically?
4. Motivation for Clustering
• Which of your connections are the most similar based
upon a criterion like job title?
– LinkedIn members are able to enter in their professional
information as free text.
• job titles, company name, professional interests…
• There are two issues
– How to measure similarity between two values
• Ex) Chief Executive Officer, Chief Technology Officer
– How to cluster every people
• It would be ideal to compare every member to every other member
• This is n-squared problem
5. Clustering Contacts by Job Title
Standardizing and Counting Job Titles
• Standardizing and Counting Job Titles
– Use a pattern for transforming common job title
– Perform a basic frequency analysis
standardizing
6. Clustering Contacts by Job Title
Common Similarity Metrics for Clustering
• Edit distance(Levenshtein distance)
– The number of operations required to transform one of
them into the other
– Ex1) dad into bad = 1
– Ex2) □park into spake = 3
s p a k e distance
□ p a r k 3
p a r k □ 4
p □ a r k 4
p a r □ k 5
7. Clustering Contacts by Job Title
Common Similarity Metrics for Clustering
• N-gram similarity
– Terse way of expressing each possible consequence of n
tokens from a text
– Ex) bi-gram (n = 2)
8. Clustering Contacts by Job Title
Common Similarity Metrics for Clustering
• Jaccard distance
– The number of items in common between the two sets
divided by the total number
• |Set1 intersection Set2| / |Set1 union Set2|
– In nltk.metrics.distance.jaccard_distnace
• ( len(X.union(Y)) – len(X.intersection(Y))) / float(len(X.union(Y))
• MASI distance
– Weighted version of Jaccard similarity
• adjusts the score to result in a smaller distance than Jaccard when a
partial overlap between set exists
• 1 – float(len(X.intersection(Y))) / float(max(len(X), len(Y))
9. Clustering Contacts by Job Title
Common Similarity Metrics for Clustering
• Jaccard distance vs. MASI distance
10. Clustering Contacts by Job Title
A Greedy Approach to Clustering
• Cluster job titles by comparing them using MASI
distance
n-squared problem
• Scalable clustering sure ain’t easy
– O(n2) algorithm is simply unacceptable
• len(all_titles) * len(all_titles) times
11. Clustering Contacts by Job Title
A Greedy Approach for Clustering
• A random sample is selected for the scoring
function
– Execute the inner loop a much smaller, fixed number of
times
12. Clustering Contacts by Job Title
Hierarchical Clustering
• Hierarchical clustering(agglomerative clustering)
– Deterministic and exhaustive technique
• Compute the full matrix of distance between all items
• Walks through the matrix clustering items that meet a minimum
distance threshold
– 0.5*n2 times (dynamic programming)
• Ex) (abc, def), (def, abc)
13. Clustering Contacts by Job Title
K-Means Clustering
• K-Means Clustering
– Generally executes on the order of O(k*n) times
– Step
1. Randomly pick k points in the data space as initial values that will
be used to compute the k clusters: K 1, K2… Kk .
2. Assign each of the n points to a cluster by finding the nearest Kn –
effectively creating k clusters and requiring k * n comparisons
3. For each of the k clusters, calculate the centroid, or the mean of the
cluster, and reassign its Ki value to be that value.
4. Repeat steps 2-3 until the members of the clusters do not change
between iteration. Generally speaking, relatively few iterations are
required for convergence
– http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/
AppletKM.html
14. Fetching Extended Profile Information(1/2)
• OAuth
– Open standard for
authorization
– allows users to share
their private resources
stored on one site with
another site without
having to hand out their
credentials
16. Closing Remarks
• This chapter covered some serious ground
– Introduce fundamental clustering techniques
– Apply to your profession network data on linked in a
variety of ways