SlideShare a Scribd company logo
1 of 7
Download to read offline
Generating Supplemental Content Information Using
Virtual Profiles
Haishan Liu
Linkedin Corporation
2029 Stierlin Court
Mountain View, CA, 94043
haliu@linkedin.com
Mohammad Amin
Linkedin Corporation
2029 Stierlin Court
Mountain View, CA, 94043
mamin@linkedin.com
Baoshi Yan
Linkedin Corporation
2029 Stierlin Court
Mountain View, CA, 94043
byan@linkedin.com
Anmol Bhasin
Linkedin Corporation
2029 Stierlin Court
Mountain View, CA, 94043
abhasin@linkedin.com
ABSTRACT
We describe a hybrid recommendation platform/technique
at LinkedIn that seeks to optimally extract relevant infor-
mation pertaining to items to be recommended. By extend-
ing the notion of an item profile, we propose the concept
of a “virtual profile” that augments the content of the item
with rich set of features inherited from members who have
already shown explicit interest in it. Unlike item-based col-
laborative filtering, we focus on discovering the characteris-
tic descriptors that underlie the item-user association. Such
information is used as supplemental features in a content-
based filtering system. The main objective of virtual pro-
files is to provide a means to tap into rich-content infor-
mation from one type of entity and propagate features ex-
tracted from which to other affiliated entities that may suf-
fer from relative data scarcity. We empirically evaluate the
proposed method on a real-world community recommenda-
tion problem at Linkedin. The result shows that the virtual
profiles outperform a collaborative filtering based approach
(user who likes this also likes that). In particular, the im-
provement is more significant for new users with only limited
connections, demonstrating the capability of the method to
address the cold-start problem in pure collaborative filtering
systems.
Categories and Subject Descriptors
H.2.8 [Database Management]: Data Mining
General Terms
Theory
Keywords
hybrid recommender systems, feature generation and extrac-
tion, model-based recommendation, virtural profiles
1. INTROCUDTION
Large scale recommender systems, in the era of internet scale
data deluge, contribute significantly to mitigate information
overload problem by unveiling relevant and interesting ob-
jects to users. Rather than hoping for serendipitous encoun-
ters, recommender systems bring forth the notion of per-
sonalized information discovery by presenting to the user a
smaller pool of relevant objects. Collaborative filtering, the
de facto mechanism for recommendation, fails to address
“cold start problems” which has led to the exploration of
hybrid recommenders. Hybrid recommenders combine in-
formation obtained from different sources and techniques to
achieve better outcome. Typically a hybrid recommender
system incorporates information from a myriad of sources
e.g. content meta data, interaction data, global popularity,
social network and social interaction information and so on.
Each of these information sources offers different level of rel-
evance guarantee at varying computation overhead. Hence,
how these information sources are computed and how they
are combined play a vital role in the final outcome.
As of today LinkedIn has more than 220 million users. As
the largest and most popular professional networking site,
LinkedIn presents some unique opportunities and challenges
for content discovery and recommendation. It is imperative
for the members to be able to discover and subscribe to
companies and groups (referred to as community henceforth)
that might be relevant to them from a professional context.
In this paper, we describe a hybrid community recommen-
dation platform/technique at LinkedIn that optimally com-
bines information from multiple sources. In order to extract
more relevant information pertaining to the community to
be recommended, i.e. to further extend the notion of content
meta data, we have proposed the concept of “virtual profile”
that augments the content meta data with rich set of features
inherited from the set of members who have already shown
explicit interest to it. In general the notion of virtual profile
answers: “What are the most dominant features pertain-
ing to the members who have shown interest to a particular
community?”. This question essentially maps an object into
the same feature space as that of the subscribers’. Content
meta data, extended with this inferred information provides
additional warranty against cold start problem. LinkedIn
data presents a unique opportunity to extend the content
features with extracted features since there is no dearth of
rich set of information about the subscribers in the data set,
which essentially renders the synergy immensely valuable.
The contribution of this paper is as follows:
1. Generic content meta data extension method i.e vir-
tual profile generation.
2. Scalable and generic recommendation computation plat-
form that powers multiple real-time recommendation
products at LinkedIn.
3. Seamless integration of multiple, heterogeneous data
sources to compute optimal outcome.
2. RELATED WORK
There has been a flurry of research in the domain of recom-
mender systems with the objective of improving personal-
ization [1]. Most traditional recommenders are powered by
collaborative filtering [9, 17], content-based predictors [8,
14] and knowledge based filtering techniques [11]. Each in-
dividual techniques have their own strengths and weaknesses
e.g. while collaborative filtering techniques suffer from data
sparsity and cold start problems [15], content-based tech-
niques are prone to skewed recommendation [14]. Hybrid
recommenders combine the best of both worlds, making the
recommenders more robust in practice. Much work has been
done to combine multiple recommenders in an effective way
to outperform any single one. In [5] Burke depicts a taxon-
omy of recommender systems, where multiple recommenders
are arranged to allow execution in a parallel or cascaded
topology. A system described in [4] combines multiple col-
laborative filtering approaches using a linear combination of
static weights learned via linear regression. STREAM [2],
which combines multi-tier predictors, uses dynamically gen-
erated metrics to learn the next level of predictors. In [12],
a hybrid movie recommender system is proposed that uses
content based predictors to boost user data which drives the
ensuing collaborative filtering based recommendation. The
content information is obtained from IMDB and a Naive
Bayes classifier is used for building user item profiles. Fi-
nally a user-based collaborative filtering is employed to ob-
tain the final recommendation. However, this approach suf-
fers from scalability issues. Pazzani [13] proposed a hybrid
recommender system where the content based user profiles
are used to group similar users which is subsequently used
to predict user preferences. In many of these user-item rec-
ommendation frameworks, items to be recommended can be
augmented with meta-data corresponding to the members
who have already shown explicit interest to it. In other
words, these items can be represented as an object in the
same feature space as that of the users. These representa-
tions could be thought of “virtual user profiles” or “virtual
profiles”. This could potentially add one other layer of in-
formation source to guide the recommendation process. In
our approach, we describe a large scale recommender system
that combines data from multiple heterogeneous sources in-
cluding virtual profiles and social network to serve real time
traffic in a large professional social networking site.
3. METHOD
3.1 System Overview
We adhere to building our recommender system based on
content filtering since we have an abundant access to rich-
content entities, such as user profiles, which enables a straight-
forward means for feature extraction, indexing and match-
ing. Target entities (those the client wants recommenda-
tions of) are feature extracted and put into a reverse index,
and source entities (those the client wants recommendations
for) are converted into complex queries against the index.
This provides a form of content-based recommendation score
where the match is determined by the degree of similarity
between the source and target entity features, with differ-
ent fields weighted by a set of parameters determined by an
offline learning-to-rank process. Figure 1 illustrates a brief
workflow of the system. It also shows how we can augment
the system by including more information, such as virtual
profiles, as new features in the content filtering recommen-
dation, as detailed below.
Figure 1: A brief workflow for the recommender
system with virtual profiles.
We view every entity as being characterized by two set of
content features: one extracted from explicit information
associated with the entity which we name the “primary pro-
file”, and the other inferred from the entity’s behavior and
association with other entities, which we name the “virtual
profile.”The main objective of virtual profiles is to provide a
means to tap into rich-content information from one type of
entity and propagate features extracted from which to other
affiliated entities that may suffer from relative data scarcity.
Essentially, a virtual profile of an entity is an aggregation
of statistically relevant features from primary profiles of af-
filiated entities, in which way it introduces a collaborative
filtering aspect in our content filtering system. For example,
a virtual profile of a Linkedin group constitutes distinctive
features from its participants so that the group can be most
effectively distinguished from others.
To first extract features from entities to generate primary
profiles, we utilize a feature extractor layer, a standalone
service that accumulates underlying entity database change
events and identifies various fields in the document. Various
types of fields that could be feature extracted include rich
text fields, such as job summary, member position summary
etc., and specialized fields, such as Geo entities including
region, country, city, coordinates, etc.
The presented content filtering system can be extended to
consider other collaborative filtering aspects, for example,
by including network proximity as a feature while computing
relevance scores. We describe a browsemap-based method
along this line as a comparison in Section 4. As a gen-
eral platform, every application consuming recommenda-
tions from this system can easily build its own logic for
reranking/reordering of results based on custom filtering cri-
teria. the concept of network proximity, e.g., recommending
jobs to discussion groups.
3.2 Generating Virtual Profiles
The virtual profile generation process for an entity aims at
selecting from a total of n features of its affiliated entities,
a subset with k < n features that is “maximally informa-
tive” about the entity. In a classification point of view, the
entity that we generate the virtual profile for represents a
target class for a set of documents (primary profiles). We
need a measure to evaluate the“information content”of each
individual feature with regard to the target class. We pro-
pose to use mutual information for this purpose. Mutual
information measures arbitrary dependencies between ran-
dom variables. And the fact that the mutual information
is independent of the coordinates chosen permits a robust
estimation makes it suitable for assessing the “information
content” of features in complex classification tasks.
In accordance with Shannon’s information theory, the un-
certainty of a document class C as a random variable can
be measured as:
H(C) = −
∑
c∈C
P(c)logP(c) ,
After knowing the feature vector F, the conditional entropy
H(C|F) measures the remaining uncertainty about C:
H(C) = −
∑
f∈F
P(f)
∑
c∈C
P(c|f)logP(c|f) .
After having observed the feature vector F, the mutual in-
formation, i.e., the amount of decreased class uncertainty is
defined as:
I(C; F) = H(C) − H(C|F) =
∑
c,f
P(c, f)log
P(c, f)
P(c)P(f)
,
where P(c, f) is the joint probability of class c and feature
f.
Therefore, to generate virtual profiles, the goal is to find
the optimal feature subset, S ⊆ F, so that I(C; S) is max-
imized. From an information theoretic perspective, select-
ing features that maximize I(C; F) translates into selecting
those features that contain the maximum information about
class C. However, locating the optimal subset requires an
exhaustive combinatorial search over the feature space, re-
quiring a number of runs equal to
(n
k
)
, where n is the size
of the original feature set and k is that of the desired sub-
set. Besides, an exact solution also demands large training
sample sizes to estimate the higher order joint probability
distribution in I(F; C). For example, Fraser’s method [6], a
computationally efficient algorithm for calculating the opti-
mal I(C; S), requires for its convergence a number of sam-
ples “in the millions” when the number of features in the
input vector is larger than 3 or 4.
Given these difficulties, most of the existing approaches ap-
proximate I(F; C) based on the assumption of lower-order
dependencies between features. For example, a second-order
feature dependence assumption is proposed by Battiti [3]
to approximate I(F; C) by a greedy incremental selection
scheme with a heuristic to account for correlations between
features: Given a set of already selected features, the algo-
rithm chooses the next feature as the one that maximizes the
information about the class corrected by subtracting a quan-
tity proportional to the average mutual information with the
selected features.
Unfortunately, the calculation of pairwise feature correlation
I(f, f′
) is impractical in our case because the feature dimen-
sion is extremely high given the bag-of-words extracted from
textual contents. Therefore, we make a first-order class de-
pendence assumption that each feature independently influ-
ences the class variable, which means to select the mth fea-
ture, fm, is independent from the (m − 1) already selected
features, i.e., P(fm|f1, . . . , fm−1, C) = P(fm|C). This re-
sults a straightforward greedy algorithm to generate the vir-
tual profile for an entity c, which consists of following steps:
1) gather features from all primary profiles associated with
entities that have an affiliation with c, 2) calculate mutual
information, I(f; c), between each feature and e, and 3) se-
lect top k features with highest I(f; c) into the virtual pro-
file. More specifically, I(f; c) can be calculated as follows.
I(f; c) =
∑
ef ∈{1,0}
∑
ec∈{1,0}
P(f = ef , c = ec) log
P(f = ef , c = ec)
P(f = ef )P(c = ec)
,
(1)
where f is a random variable that takes values ef = 1 (en-
tity primary profile contains feature f) and ef = 0 (the
entity primary profile does not contain feature f), and c is
a random variable that takes values ec = 1 (the entity is
affiliated with c) and ec = 0 (the entity is not affiliated with
c). The probabilities in Equation 1 can be calculated using
maximum likelihood estimation.
4. EXPERIMENTS
Our goal is to test if virtual profiles are a valuable source
of features to improve the recommendation performance. In
designing experiments, we want to verify the heuristic as-
sumption that virtual profile can use features greedily se-
lected by mutual information. We also want to compare the
performance of virtual profiles with other classic collabora-
tive filtering methods and study their tradeoffs. Further-
more, by experimenting with different parameter settings
to generate virtual profiles, we want to provide a general
guidance on how virtual profiles can be best implemented in
practice.
4.1 Methodologies
We choose a community recommendation problem at Linkedin
as the test application. Successful recommendations would
result in users following certain communities, while users
are also presented the choice to opt-out communities at any
later point.
We extract three kinds features from entities (users and com-
munities) in this application domain as follows.
1. content features: features from users’ and communi-
ties’ textual information extracted into predefined stan-
dardized fields (e.g., name, industry, description, etc.).
2. virtual profile: as described in Section 3, a set of fea-
tures selected from a community’s followers as supple-
ments to the community’s primary profile.
3. browsemap: a collaborative feature representing the
co-affiliation relationship, or “users who follow X also
follow Y.”
Browsemaps capture a notion of similarity between com-
munities that is driven by users’ preference. To generate
a browsemap for a community, from all other communi-
ties that it shares followers with, we choose top 50 ones
ranked by TF/IDF. And then for each user, we take the
closure of communities she has already followed with re-
spect to browsemaps, and select top 50 ones weighted by
their TF/IDF scores normalized over the number of com-
munities followed. Communities selected in this way can be
essentially seen as recommendations by collaborative filter-
ing. We instead treat them as part of a standalone feature,
and when combined with users’ content features to generate
a search query, it would lead to extra field matches with hits
against communities appear in the feature. And the weight
of this match, just like matches in other features, can be
determined in an offline learning process.
The content features extracted for communities contains
only three fields (i.e., name, description, and tags). They
represent nearly a minimum amount of information that is
required for a content filtering recommender system to func-
tion, and are therefore considered as a baseline in the exper-
iment. Browsemaps, on the other hand, are designed as an
alternative to virtual profiles for comparison, given that they
both take into account the interaction among entities.
As for model fitting, we use a training set including 3.4
million positive and 2.2 million negative examples gathered
from both explicit and implicit user feedbacks (e.g., fol-
low/unfollow or lack of action to recommendations). We
apply an L2-regularized logistic regression with various com-
bination of the above mentioned features. The best model
under each configuration is selected by optimizing the area
under the ROC curve (AUC-ROC). Performances of differ-
ent models are evaluated both offline and online. The results
are presented in the next sections.
4.2 Results
4.2.1 Offline evaluation
We compare the AUC for models obtained by training with
four different feature configurations, namely, (A) content
features only, (B) content features plus virtual profiles, (C)
content features plus browsemaps, and (D) content features
plus both virtual profiles and browsemaps. It can be seen
from Figure 2 that, the ROC curve of model A completely
dominates that of model B (with AUCs 0.72 vs. 0.60), and
both of them dominate that of model C (AUC 0.44). The
same performance pattern is also exhibited in the precision-
recall curve, as shown in Figure 3.
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
False positive rate
Truepositiverate
content features + vp
content features + bm
content features + vp + bm
content features only
Figure 2: ROC curves for different models.
Besides classification performance, another important mea-
sure that can be evaluated offline is the coverage, which
refers to the degree to which recommendations cover the set
of available items (item space coverage) and the degree to
which recommendations can be generated to all potential
users (user space coverage) [7, 10]. Owing to a distributed
algorithm developed at Linkedin, we are able to calculate
recommendations offline for all our 220 million users. Us-
ing each of the trained model described above, we calculate
a different set of recommendations for each user, with the
size of each set capped at 50. We counted numbers of times
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
Recall
Precision
content features + vp
content features + bm
content features + vp + bm
content features only
Figure 3: Precision-recall curves for different mod-
els.
unique communities appeared in recommendations (frequen-
cies) under different models. Figure 4 shows a logarithmic
scale of the frequencies sorted in descending order plotted
against their ranks.
It is not surprising to see that the baseline curve from the
content-features-only model is the lowest since features ex-
tracted for communities in this case contains the least amount
of information. And the distribution of the recommendation
frequency simply reflects the distribution of the amount of
textual content of each community, which is subject to the
power law. On the other hand, the curve from the model
with the addition of browsemaps visibly bulges outwards
from the baseline for about two thirds of points, indicating
that those points are getting higher frequencies showing up
in recommendations, hence more coverage. Most remark-
ably, the model with the addition of virtual profiles signifi-
cantly increased the frequencies for almost all points on the
curve except for cases where original baseline frequencies are
extremely high or low.
The reason why browsemaps slightly boost the coverage for
some communities is because those communities bear little
content information yet having followers already. Having
followers makes them eligible to be potentially included in
other communities’ browsemaps, and thus leads to a higher
chance to matches with users. However, for users not hav-
ing followed any communities at all, browsemaps become an
empty feature, which is the reason why for about a third of
communities, there sees no increase in coverage from browsemaps
compared with the baseline. This phenomenon is also illus-
trated in Figure 5, in which the recommendation frequencies
of unique companies are only counted for new users (i.e.,
users who have not started following communities yet). We
observe that the model with browsemaps produces an iden-
tical curve to the baseline, while the model with virtual pro-
files exerts a consistent boost. This shows that browsemaps,
as a feature of a collaborative filtering aspect, fails to address
cold start, while virtual profiles provides a well-rounded im-
provement in terms of both coverage and predictive power.
0 50000 100000 150000 200000 250000
2e+025e+022e+035e+032e+045e+042e+05
numberofrecommendations
content features + vp
content features + bm
content features only
Figure 4: number of recommendation per unique
companies.
4.2.2 Online evaluation
To further evaluate models with various feature configura-
tions (i.e., content features with vp, content features with
bm, content features with both vp and bm, and content fea-
tures only), we deployed them to serve realtime online rec-
ommendation requests and compare performances through
a bucket test. We assign a unique bucket of 2.5% randomly
selected users to each model. The bucket with the model
based only on content features is the control, while others
are variants.
The duration of the test is determined according to Wheeler [18],
where a conservative estimation of sample size to achieve an
80% power (the probability of correctly rejecting the null
hypothesis when it is indeed false) is given by Equation 2.
n = (
4rσ
∆
)2
, (2)
where n is the minimum number of samples (impressions to
be delivered) for each equal-sized variant, r is the number
of variants, σ2
is the variance of the OEC (Overall Evalu-
ation Criterion [16], a quantitative measure of the experi-
ment’s objective.), and ∆ is the sensitivity, or the desired
amount of change. The OEC in this test is the Click-through
rate (CTR) of recommendations. Assume each click-through
0 50000 100000 150000 200000
2e+025e+022e+035e+032e+045e+04
numberofrecommendations
content features + vp
content features + bm
content features only
Figure 5: number of recommendation for new users
per unique companies.
event is a Bernoulli trial with probability p = ctr0 (con-
trol CTR, which is estimated from historical data), then
σ2
= p(1 − p). Applying Equation 2 and knowing the ap-
proximate recommendation impressions per day, we derive
the length of the test to be 7 days.
Figure 7 presents the results of the test by showing the per-
centage change in CTR of variant models relative to the con-
trol, on each individual day of the test. Overall, the model
with virtual profiles outperforms the control by 91.2%. Sur-
prisingly, however, we do not observe any improvement from
the model with browsemaps. The model with both virtual
profiles and browsemaps increased the CTR by 84.4%. The
difference between the two best performing model is not
significant (p value 0.062), which is similar to the offline
evaluation result. The reason why browsemaps fail to in-
crease overall CTR may be attributed to the fact that only
one third of users have followed communities in this par-
ticular application, meaning the cold start effect is much
pronounced. Virtual profiles, on the other hand, is not vul-
nerable to this problem since it is content-based and does
not rely on pre-existing user-item affiliations, as is demon-
strated in this experiment.
5. CONCLUSION AND FUTURE WORK
We presented virtual profiles, a generic content meta data
extension method. We also introduced how it is utilized in a
scalable and generic content-based hybrid recommender sys-
tem that powers multiple real-time recommendation prod-
ucts at LinkedIn. The goal of virtual profiles is to provide a
means to tap into rich-content information from one type of
entity and propagate features extracted from which to other
affiliated entities that may suffer from relative data scarcity.
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
False positive rate
Truepositiverate
vp−top50
vp−top100
vp−top200
Figure 6: ROC curves for virtual profiles with dif-
ferent number of terms.
It brings a collaborative filtering aspect in the form of a sup-
plement to content features in the recommender system. It
is shown to outperform a method that directly incorporate
network proximity from collaborative filtering.
Experiments supported that our first-order class dependence
assumption and the greedy algorithm in calculating the mu-
tual information is a reasonable approximation. In future
work, we will investigate scalable ways to account for de-
pendencies among features. We plan to explore more term
weighting methods besides mutual information, including
other classic information theoretic quantities such as the
Kullback-Leibler divergence, or TF/IDF.
6. REFERENCES
[1] G. Adomavicius and A. Tuzhilin. Toward the next
generation of recommender systems: A survey of the
state-of-the-art and possible extensions. IEEE
TRANSACTIONS ON KNOWLEDGE AND DATA
ENGINEERING, 17(6):734–749, 2005.
[2] X. Bao, L. Bergman, and R. Thompson. Stacking
recommendation engines with additional
meta-features. In Proceedings of the third ACM
conference on Recommender systems, RecSys ’09,
pages 109–116, 2009.
[3] R. Battiti. Using mutual information for selecting
features in supervised neural net learning. Trans.
Neur. Netw., 5(4):537–550, July 1994.
[4] R. M. Bell, Y. Koren, and C. Volinsky. The BellKor
solution to the Netflix Prize.
[5] R. Burke. Hybrid recommender systems: Survey and
experiments. User Modeling and User-Adapted
Interaction, 12(4):331–370, Nov. 2002.
1 2 3 4 5 6 7
0.51.01.52.02.53.0
Day
CTR%
1 2 3 4 5 6 7
0.51.01.52.02.53.0
1 2 3 4 5 6 7
0.51.01.52.02.53.0
content features + vp
content features + bm
content features + vp + bm
Figure 7: Model CTRs.
[6] A. M. Fraser and H. L. Swinney. Independent
coordinates for strange attractors from mutual
information. Physical Review A, 33(2):1134–1140, Feb.
1986.
[7] M. Ge, C. Delgado-Battenfeld, and D. Jannach.
Beyond accuracy: evaluating recommender systems by
coverage and serendipity. In Proceedings of the fourth
ACM conference on Recommender systems, RecSys
’10, pages 257–260, New York, NY, USA, 2010. ACM.
[8] K. Goldberg, T. Roeder, D. Gupta, and C. Perkins.
Eigentaste: A constant time collaborative filtering
algorithm. Inf. Retr., 4(2):133–151, July 2001.
[9] J. L. Herlocker, J. A. Konstan, and J. Riedl.
Explaining collaborative filtering recommendations. In
Proceedings of the 2000 ACM conference on Computer
supported cooperative work, CSCW ’00, pages 241–250,
New York, NY, USA, 2000. ACM.
[10] J. L. Herlocker, J. A. Konstan, L. G. Terveen, and
J. T. Riedl. Evaluating collaborative filtering
recommender systems. ACM Trans. Inf. Syst.,
22(1):5–53, Jan. 2004.
[11] P. B. Kantor. Recommender systems handbook.
Springer, 2009.
[12] P. Melville, R. J. Mooney, and R. Nagarajan.
Content-boosted collaborative filtering for improved
recommendations. pages 187–192, 2002.
[13] M. J. Pazzani. A framework for collaborative,
content-based and demographic filtering. Artif. Intell.
Rev., 13(5-6):393–408, Dec. 1999.
[14] M. J. Pazzani and D. Billsus. The adaptive web.
chapter Content-based recommendation systems,
pages 325–341. Springer-Verlag, Berlin, Heidelberg,
2007.
[15] P. Resnick, N. Iacovou, M. Suchak, P. Bergstrom, and
J. Riedl. Grouplens: an open architecture for
collaborative filtering of netnews. In Proceedings of the
1994 ACM conference on Computer supported
cooperative work, CSCW ’94, pages 175–186, New
York, NY, USA, 1994. ACM.
[16] R. K. Roy. Design of experiments using the taguchi
approach: 16 steps to product and process
improvement. Wiley, 20011.
[17] B. Sarwar, G. Karypis, J. Konstan, and J. Riedl.
Item-based collaborative filtering recommendation
algorithms. In Proceedings of the 10th international
conference on World Wide Web, WWW ’01, pages
285–295, 2001.
[18] R. E. Wheller. Portable power. Technometrics,
16(2):177–179, 1974.

More Related Content

What's hot

Aiim Webinar Helen Mitchell Unified Search Final 7 21 2010
Aiim Webinar Helen Mitchell  Unified Search Final 7 21 2010Aiim Webinar Helen Mitchell  Unified Search Final 7 21 2010
Aiim Webinar Helen Mitchell Unified Search Final 7 21 2010Helen Mitchell
 
SocialCapitalIraq-AcceleratedResearchImpact
SocialCapitalIraq-AcceleratedResearchImpactSocialCapitalIraq-AcceleratedResearchImpact
SocialCapitalIraq-AcceleratedResearchImpactJon Gresham, Ph.D.
 
Scei technical whitepaper-19.06.2012
Scei technical whitepaper-19.06.2012Scei technical whitepaper-19.06.2012
Scei technical whitepaper-19.06.2012STIinnsbruck
 
Everything Self-Service:Linked Data Applications with the Information Workbench
Everything Self-Service:Linked Data Applications with the Information WorkbenchEverything Self-Service:Linked Data Applications with the Information Workbench
Everything Self-Service:Linked Data Applications with the Information WorkbenchPeter Haase
 
Findability Primer by Information Architected - the IA Primer Series
Findability Primer by Information Architected - the IA Primer SeriesFindability Primer by Information Architected - the IA Primer Series
Findability Primer by Information Architected - the IA Primer SeriesDan Keldsen
 
Rep on the Roll A peer to peer reputation system based on a rolling blockchain
Rep on the Roll A peer to peer reputation system based on a rolling blockchainRep on the Roll A peer to peer reputation system based on a rolling blockchain
Rep on the Roll A peer to peer reputation system based on a rolling blockchainRichard Dennis
 
An imperative focus on semantic
An imperative focus on semanticAn imperative focus on semantic
An imperative focus on semanticijasa
 
Km World Taxonomy Boot Camp 2011
Km World Taxonomy Boot Camp  2011Km World Taxonomy Boot Camp  2011
Km World Taxonomy Boot Camp 2011ajrhem
 
Information Architecture Primer - Integrating search,tagging, taxonomy and us...
Information Architecture Primer - Integrating search,tagging, taxonomy and us...Information Architecture Primer - Integrating search,tagging, taxonomy and us...
Information Architecture Primer - Integrating search,tagging, taxonomy and us...Dan Keldsen
 
The Web Information System of the National Institute for Astrophysics: differ...
The Web Information System of the National Institute for Astrophysics: differ...The Web Information System of the National Institute for Astrophysics: differ...
The Web Information System of the National Institute for Astrophysics: differ...inscit2006
 
Towards enhanced user interaction to qualify web resources for higher-layered...
Towards enhanced user interaction to qualify web resources for higher-layered...Towards enhanced user interaction to qualify web resources for higher-layered...
Towards enhanced user interaction to qualify web resources for higher-layered...Monika Steinberg
 
SPSBOS -- How your metadata strategy impacts everything you do
SPSBOS -- How your metadata strategy impacts everything you doSPSBOS -- How your metadata strategy impacts everything you do
SPSBOS -- How your metadata strategy impacts everything you doChristian Buckley
 

What's hot (17)

Aiim Webinar Helen Mitchell Unified Search Final 7 21 2010
Aiim Webinar Helen Mitchell  Unified Search Final 7 21 2010Aiim Webinar Helen Mitchell  Unified Search Final 7 21 2010
Aiim Webinar Helen Mitchell Unified Search Final 7 21 2010
 
SocialCapitalIraq-AcceleratedResearchImpact
SocialCapitalIraq-AcceleratedResearchImpactSocialCapitalIraq-AcceleratedResearchImpact
SocialCapitalIraq-AcceleratedResearchImpact
 
Scei technical whitepaper-19.06.2012
Scei technical whitepaper-19.06.2012Scei technical whitepaper-19.06.2012
Scei technical whitepaper-19.06.2012
 
Everything Self-Service:Linked Data Applications with the Information Workbench
Everything Self-Service:Linked Data Applications with the Information WorkbenchEverything Self-Service:Linked Data Applications with the Information Workbench
Everything Self-Service:Linked Data Applications with the Information Workbench
 
Findability Primer by Information Architected - the IA Primer Series
Findability Primer by Information Architected - the IA Primer SeriesFindability Primer by Information Architected - the IA Primer Series
Findability Primer by Information Architected - the IA Primer Series
 
Playing Tag: Managed Metadata and Taxonomies in SharePoint 2010
Playing Tag: Managed Metadata and Taxonomies in SharePoint 2010Playing Tag: Managed Metadata and Taxonomies in SharePoint 2010
Playing Tag: Managed Metadata and Taxonomies in SharePoint 2010
 
Rep on the Roll A peer to peer reputation system based on a rolling blockchain
Rep on the Roll A peer to peer reputation system based on a rolling blockchainRep on the Roll A peer to peer reputation system based on a rolling blockchain
Rep on the Roll A peer to peer reputation system based on a rolling blockchain
 
An imperative focus on semantic
An imperative focus on semanticAn imperative focus on semantic
An imperative focus on semantic
 
Angels_in_our_Midst
Angels_in_our_MidstAngels_in_our_Midst
Angels_in_our_Midst
 
Km World Taxonomy Boot Camp 2011
Km World Taxonomy Boot Camp  2011Km World Taxonomy Boot Camp  2011
Km World Taxonomy Boot Camp 2011
 
Information Architecture Primer - Integrating search,tagging, taxonomy and us...
Information Architecture Primer - Integrating search,tagging, taxonomy and us...Information Architecture Primer - Integrating search,tagging, taxonomy and us...
Information Architecture Primer - Integrating search,tagging, taxonomy and us...
 
The Web Information System of the National Institute for Astrophysics: differ...
The Web Information System of the National Institute for Astrophysics: differ...The Web Information System of the National Institute for Astrophysics: differ...
The Web Information System of the National Institute for Astrophysics: differ...
 
Towards enhanced user interaction to qualify web resources for higher-layered...
Towards enhanced user interaction to qualify web resources for higher-layered...Towards enhanced user interaction to qualify web resources for higher-layered...
Towards enhanced user interaction to qualify web resources for higher-layered...
 
KMA Taxonomy TBC2010
KMA Taxonomy TBC2010KMA Taxonomy TBC2010
KMA Taxonomy TBC2010
 
SPSBOS -- How your metadata strategy impacts everything you do
SPSBOS -- How your metadata strategy impacts everything you doSPSBOS -- How your metadata strategy impacts everything you do
SPSBOS -- How your metadata strategy impacts everything you do
 
A1060104
A1060104A1060104
A1060104
 
KMA's mms2010nyc
KMA's mms2010nycKMA's mms2010nyc
KMA's mms2010nyc
 

Similar to Recsys virtual-profiles

Identical Users in Different Social Media Provides Uniform Network Structure ...
Identical Users in Different Social Media Provides Uniform Network Structure ...Identical Users in Different Social Media Provides Uniform Network Structure ...
Identical Users in Different Social Media Provides Uniform Network Structure ...IJMTST Journal
 
Personalized E-commerce based recommendation systems using deep-learning tech...
Personalized E-commerce based recommendation systems using deep-learning tech...Personalized E-commerce based recommendation systems using deep-learning tech...
Personalized E-commerce based recommendation systems using deep-learning tech...IAESIJAI
 
An Improvised Fuzzy Preference Tree Of CRS For E-Services Using Incremental A...
An Improvised Fuzzy Preference Tree Of CRS For E-Services Using Incremental A...An Improvised Fuzzy Preference Tree Of CRS For E-Services Using Incremental A...
An Improvised Fuzzy Preference Tree Of CRS For E-Services Using Incremental A...IJTET Journal
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
Unification Algorithm in Hefty Iterative Multi-tier Classifiers for Gigantic ...
Unification Algorithm in Hefty Iterative Multi-tier Classifiers for Gigantic ...Unification Algorithm in Hefty Iterative Multi-tier Classifiers for Gigantic ...
Unification Algorithm in Hefty Iterative Multi-tier Classifiers for Gigantic ...Editor IJAIEM
 
Liggett Methods And Tools Slides Q1 2011
Liggett Methods And Tools Slides Q1 2011Liggett Methods And Tools Slides Q1 2011
Liggett Methods And Tools Slides Q1 2011tliggett
 
A Community Detection and Recommendation System
A Community Detection and Recommendation SystemA Community Detection and Recommendation System
A Community Detection and Recommendation SystemIRJET Journal
 
IRJET- Analysis on Existing Methodologies of User Service Rating Prediction S...
IRJET- Analysis on Existing Methodologies of User Service Rating Prediction S...IRJET- Analysis on Existing Methodologies of User Service Rating Prediction S...
IRJET- Analysis on Existing Methodologies of User Service Rating Prediction S...IRJET Journal
 
Avoiding Anonymous Users in Multiple Social Media Networks (SMN)
Avoiding Anonymous Users in Multiple Social Media Networks (SMN)Avoiding Anonymous Users in Multiple Social Media Networks (SMN)
Avoiding Anonymous Users in Multiple Social Media Networks (SMN)paperpublications3
 
CONTEXTUAL MODEL OF RECOMMENDING RESOURCES ON AN ACADEMIC NETWORKING PORTAL
CONTEXTUAL MODEL OF RECOMMENDING RESOURCES ON AN ACADEMIC NETWORKING PORTALCONTEXTUAL MODEL OF RECOMMENDING RESOURCES ON AN ACADEMIC NETWORKING PORTAL
CONTEXTUAL MODEL OF RECOMMENDING RESOURCES ON AN ACADEMIC NETWORKING PORTALcscpconf
 
Contextual model of recommending resources on an academic networking portal
Contextual model of recommending resources on an academic networking portalContextual model of recommending resources on an academic networking portal
Contextual model of recommending resources on an academic networking portalcsandit
 
Study of Recommendation System Used In Tourism and Travel
Study of Recommendation System Used In Tourism and TravelStudy of Recommendation System Used In Tourism and Travel
Study of Recommendation System Used In Tourism and Travelijtsrd
 
Sweeny group think-ias2015
Sweeny group think-ias2015Sweeny group think-ias2015
Sweeny group think-ias2015Marianne Sweeny
 
An Efficient Trust Evaluation using Fact-Finder Technique
An Efficient Trust Evaluation using Fact-Finder TechniqueAn Efficient Trust Evaluation using Fact-Finder Technique
An Efficient Trust Evaluation using Fact-Finder TechniqueIJCSIS Research Publications
 
Fuzzy Logic Based Recommender System
Fuzzy Logic Based Recommender SystemFuzzy Logic Based Recommender System
Fuzzy Logic Based Recommender SystemRSIS International
 
Multi-Tier Sentiment Analysis System in Big Data Environment
Multi-Tier Sentiment Analysis System in Big Data EnvironmentMulti-Tier Sentiment Analysis System in Big Data Environment
Multi-Tier Sentiment Analysis System in Big Data EnvironmentIJCSIS Research Publications
 
HABIB FIGA GUYE {BULE HORA UNIVERSITY}(habibifiga@gmail.com
HABIB FIGA GUYE {BULE HORA UNIVERSITY}(habibifiga@gmail.comHABIB FIGA GUYE {BULE HORA UNIVERSITY}(habibifiga@gmail.com
HABIB FIGA GUYE {BULE HORA UNIVERSITY}(habibifiga@gmail.comHABIB FIGA GUYE
 
Recommendation Systems Basics
Recommendation Systems BasicsRecommendation Systems Basics
Recommendation Systems BasicsJarin Tasnim Khan
 
Enhanced Web Usage Mining Using Fuzzy Clustering and Collaborative Filtering ...
Enhanced Web Usage Mining Using Fuzzy Clustering and Collaborative Filtering ...Enhanced Web Usage Mining Using Fuzzy Clustering and Collaborative Filtering ...
Enhanced Web Usage Mining Using Fuzzy Clustering and Collaborative Filtering ...inventionjournals
 

Similar to Recsys virtual-profiles (20)

Identical Users in Different Social Media Provides Uniform Network Structure ...
Identical Users in Different Social Media Provides Uniform Network Structure ...Identical Users in Different Social Media Provides Uniform Network Structure ...
Identical Users in Different Social Media Provides Uniform Network Structure ...
 
Personalized E-commerce based recommendation systems using deep-learning tech...
Personalized E-commerce based recommendation systems using deep-learning tech...Personalized E-commerce based recommendation systems using deep-learning tech...
Personalized E-commerce based recommendation systems using deep-learning tech...
 
An Improvised Fuzzy Preference Tree Of CRS For E-Services Using Incremental A...
An Improvised Fuzzy Preference Tree Of CRS For E-Services Using Incremental A...An Improvised Fuzzy Preference Tree Of CRS For E-Services Using Incremental A...
An Improvised Fuzzy Preference Tree Of CRS For E-Services Using Incremental A...
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
Unification Algorithm in Hefty Iterative Multi-tier Classifiers for Gigantic ...
Unification Algorithm in Hefty Iterative Multi-tier Classifiers for Gigantic ...Unification Algorithm in Hefty Iterative Multi-tier Classifiers for Gigantic ...
Unification Algorithm in Hefty Iterative Multi-tier Classifiers for Gigantic ...
 
Liggett Methods And Tools Slides Q1 2011
Liggett Methods And Tools Slides Q1 2011Liggett Methods And Tools Slides Q1 2011
Liggett Methods And Tools Slides Q1 2011
 
AN EFFECTIVE FRAMEWORK FOR GENERATING RECOMMENDATIONS
AN EFFECTIVE FRAMEWORK FOR GENERATING RECOMMENDATIONSAN EFFECTIVE FRAMEWORK FOR GENERATING RECOMMENDATIONS
AN EFFECTIVE FRAMEWORK FOR GENERATING RECOMMENDATIONS
 
A Community Detection and Recommendation System
A Community Detection and Recommendation SystemA Community Detection and Recommendation System
A Community Detection and Recommendation System
 
IRJET- Analysis on Existing Methodologies of User Service Rating Prediction S...
IRJET- Analysis on Existing Methodologies of User Service Rating Prediction S...IRJET- Analysis on Existing Methodologies of User Service Rating Prediction S...
IRJET- Analysis on Existing Methodologies of User Service Rating Prediction S...
 
Avoiding Anonymous Users in Multiple Social Media Networks (SMN)
Avoiding Anonymous Users in Multiple Social Media Networks (SMN)Avoiding Anonymous Users in Multiple Social Media Networks (SMN)
Avoiding Anonymous Users in Multiple Social Media Networks (SMN)
 
CONTEXTUAL MODEL OF RECOMMENDING RESOURCES ON AN ACADEMIC NETWORKING PORTAL
CONTEXTUAL MODEL OF RECOMMENDING RESOURCES ON AN ACADEMIC NETWORKING PORTALCONTEXTUAL MODEL OF RECOMMENDING RESOURCES ON AN ACADEMIC NETWORKING PORTAL
CONTEXTUAL MODEL OF RECOMMENDING RESOURCES ON AN ACADEMIC NETWORKING PORTAL
 
Contextual model of recommending resources on an academic networking portal
Contextual model of recommending resources on an academic networking portalContextual model of recommending resources on an academic networking portal
Contextual model of recommending resources on an academic networking portal
 
Study of Recommendation System Used In Tourism and Travel
Study of Recommendation System Used In Tourism and TravelStudy of Recommendation System Used In Tourism and Travel
Study of Recommendation System Used In Tourism and Travel
 
Sweeny group think-ias2015
Sweeny group think-ias2015Sweeny group think-ias2015
Sweeny group think-ias2015
 
An Efficient Trust Evaluation using Fact-Finder Technique
An Efficient Trust Evaluation using Fact-Finder TechniqueAn Efficient Trust Evaluation using Fact-Finder Technique
An Efficient Trust Evaluation using Fact-Finder Technique
 
Fuzzy Logic Based Recommender System
Fuzzy Logic Based Recommender SystemFuzzy Logic Based Recommender System
Fuzzy Logic Based Recommender System
 
Multi-Tier Sentiment Analysis System in Big Data Environment
Multi-Tier Sentiment Analysis System in Big Data EnvironmentMulti-Tier Sentiment Analysis System in Big Data Environment
Multi-Tier Sentiment Analysis System in Big Data Environment
 
HABIB FIGA GUYE {BULE HORA UNIVERSITY}(habibifiga@gmail.com
HABIB FIGA GUYE {BULE HORA UNIVERSITY}(habibifiga@gmail.comHABIB FIGA GUYE {BULE HORA UNIVERSITY}(habibifiga@gmail.com
HABIB FIGA GUYE {BULE HORA UNIVERSITY}(habibifiga@gmail.com
 
Recommendation Systems Basics
Recommendation Systems BasicsRecommendation Systems Basics
Recommendation Systems Basics
 
Enhanced Web Usage Mining Using Fuzzy Clustering and Collaborative Filtering ...
Enhanced Web Usage Mining Using Fuzzy Clustering and Collaborative Filtering ...Enhanced Web Usage Mining Using Fuzzy Clustering and Collaborative Filtering ...
Enhanced Web Usage Mining Using Fuzzy Clustering and Collaborative Filtering ...
 

Recsys virtual-profiles

  • 1. Generating Supplemental Content Information Using Virtual Profiles Haishan Liu Linkedin Corporation 2029 Stierlin Court Mountain View, CA, 94043 haliu@linkedin.com Mohammad Amin Linkedin Corporation 2029 Stierlin Court Mountain View, CA, 94043 mamin@linkedin.com Baoshi Yan Linkedin Corporation 2029 Stierlin Court Mountain View, CA, 94043 byan@linkedin.com Anmol Bhasin Linkedin Corporation 2029 Stierlin Court Mountain View, CA, 94043 abhasin@linkedin.com ABSTRACT We describe a hybrid recommendation platform/technique at LinkedIn that seeks to optimally extract relevant infor- mation pertaining to items to be recommended. By extend- ing the notion of an item profile, we propose the concept of a “virtual profile” that augments the content of the item with rich set of features inherited from members who have already shown explicit interest in it. Unlike item-based col- laborative filtering, we focus on discovering the characteris- tic descriptors that underlie the item-user association. Such information is used as supplemental features in a content- based filtering system. The main objective of virtual pro- files is to provide a means to tap into rich-content infor- mation from one type of entity and propagate features ex- tracted from which to other affiliated entities that may suf- fer from relative data scarcity. We empirically evaluate the proposed method on a real-world community recommenda- tion problem at Linkedin. The result shows that the virtual profiles outperform a collaborative filtering based approach (user who likes this also likes that). In particular, the im- provement is more significant for new users with only limited connections, demonstrating the capability of the method to address the cold-start problem in pure collaborative filtering systems. Categories and Subject Descriptors H.2.8 [Database Management]: Data Mining General Terms Theory Keywords hybrid recommender systems, feature generation and extrac- tion, model-based recommendation, virtural profiles 1. INTROCUDTION Large scale recommender systems, in the era of internet scale data deluge, contribute significantly to mitigate information overload problem by unveiling relevant and interesting ob- jects to users. Rather than hoping for serendipitous encoun- ters, recommender systems bring forth the notion of per- sonalized information discovery by presenting to the user a smaller pool of relevant objects. Collaborative filtering, the de facto mechanism for recommendation, fails to address “cold start problems” which has led to the exploration of hybrid recommenders. Hybrid recommenders combine in- formation obtained from different sources and techniques to achieve better outcome. Typically a hybrid recommender system incorporates information from a myriad of sources e.g. content meta data, interaction data, global popularity, social network and social interaction information and so on. Each of these information sources offers different level of rel- evance guarantee at varying computation overhead. Hence, how these information sources are computed and how they are combined play a vital role in the final outcome. As of today LinkedIn has more than 220 million users. As the largest and most popular professional networking site, LinkedIn presents some unique opportunities and challenges for content discovery and recommendation. It is imperative for the members to be able to discover and subscribe to companies and groups (referred to as community henceforth) that might be relevant to them from a professional context. In this paper, we describe a hybrid community recommen- dation platform/technique at LinkedIn that optimally com- bines information from multiple sources. In order to extract more relevant information pertaining to the community to be recommended, i.e. to further extend the notion of content meta data, we have proposed the concept of “virtual profile” that augments the content meta data with rich set of features inherited from the set of members who have already shown explicit interest to it. In general the notion of virtual profile answers: “What are the most dominant features pertain- ing to the members who have shown interest to a particular
  • 2. community?”. This question essentially maps an object into the same feature space as that of the subscribers’. Content meta data, extended with this inferred information provides additional warranty against cold start problem. LinkedIn data presents a unique opportunity to extend the content features with extracted features since there is no dearth of rich set of information about the subscribers in the data set, which essentially renders the synergy immensely valuable. The contribution of this paper is as follows: 1. Generic content meta data extension method i.e vir- tual profile generation. 2. Scalable and generic recommendation computation plat- form that powers multiple real-time recommendation products at LinkedIn. 3. Seamless integration of multiple, heterogeneous data sources to compute optimal outcome. 2. RELATED WORK There has been a flurry of research in the domain of recom- mender systems with the objective of improving personal- ization [1]. Most traditional recommenders are powered by collaborative filtering [9, 17], content-based predictors [8, 14] and knowledge based filtering techniques [11]. Each in- dividual techniques have their own strengths and weaknesses e.g. while collaborative filtering techniques suffer from data sparsity and cold start problems [15], content-based tech- niques are prone to skewed recommendation [14]. Hybrid recommenders combine the best of both worlds, making the recommenders more robust in practice. Much work has been done to combine multiple recommenders in an effective way to outperform any single one. In [5] Burke depicts a taxon- omy of recommender systems, where multiple recommenders are arranged to allow execution in a parallel or cascaded topology. A system described in [4] combines multiple col- laborative filtering approaches using a linear combination of static weights learned via linear regression. STREAM [2], which combines multi-tier predictors, uses dynamically gen- erated metrics to learn the next level of predictors. In [12], a hybrid movie recommender system is proposed that uses content based predictors to boost user data which drives the ensuing collaborative filtering based recommendation. The content information is obtained from IMDB and a Naive Bayes classifier is used for building user item profiles. Fi- nally a user-based collaborative filtering is employed to ob- tain the final recommendation. However, this approach suf- fers from scalability issues. Pazzani [13] proposed a hybrid recommender system where the content based user profiles are used to group similar users which is subsequently used to predict user preferences. In many of these user-item rec- ommendation frameworks, items to be recommended can be augmented with meta-data corresponding to the members who have already shown explicit interest to it. In other words, these items can be represented as an object in the same feature space as that of the users. These representa- tions could be thought of “virtual user profiles” or “virtual profiles”. This could potentially add one other layer of in- formation source to guide the recommendation process. In our approach, we describe a large scale recommender system that combines data from multiple heterogeneous sources in- cluding virtual profiles and social network to serve real time traffic in a large professional social networking site. 3. METHOD 3.1 System Overview We adhere to building our recommender system based on content filtering since we have an abundant access to rich- content entities, such as user profiles, which enables a straight- forward means for feature extraction, indexing and match- ing. Target entities (those the client wants recommenda- tions of) are feature extracted and put into a reverse index, and source entities (those the client wants recommendations for) are converted into complex queries against the index. This provides a form of content-based recommendation score where the match is determined by the degree of similarity between the source and target entity features, with differ- ent fields weighted by a set of parameters determined by an offline learning-to-rank process. Figure 1 illustrates a brief workflow of the system. It also shows how we can augment the system by including more information, such as virtual profiles, as new features in the content filtering recommen- dation, as detailed below. Figure 1: A brief workflow for the recommender system with virtual profiles. We view every entity as being characterized by two set of content features: one extracted from explicit information associated with the entity which we name the “primary pro- file”, and the other inferred from the entity’s behavior and association with other entities, which we name the “virtual profile.”The main objective of virtual profiles is to provide a means to tap into rich-content information from one type of entity and propagate features extracted from which to other affiliated entities that may suffer from relative data scarcity. Essentially, a virtual profile of an entity is an aggregation of statistically relevant features from primary profiles of af- filiated entities, in which way it introduces a collaborative
  • 3. filtering aspect in our content filtering system. For example, a virtual profile of a Linkedin group constitutes distinctive features from its participants so that the group can be most effectively distinguished from others. To first extract features from entities to generate primary profiles, we utilize a feature extractor layer, a standalone service that accumulates underlying entity database change events and identifies various fields in the document. Various types of fields that could be feature extracted include rich text fields, such as job summary, member position summary etc., and specialized fields, such as Geo entities including region, country, city, coordinates, etc. The presented content filtering system can be extended to consider other collaborative filtering aspects, for example, by including network proximity as a feature while computing relevance scores. We describe a browsemap-based method along this line as a comparison in Section 4. As a gen- eral platform, every application consuming recommenda- tions from this system can easily build its own logic for reranking/reordering of results based on custom filtering cri- teria. the concept of network proximity, e.g., recommending jobs to discussion groups. 3.2 Generating Virtual Profiles The virtual profile generation process for an entity aims at selecting from a total of n features of its affiliated entities, a subset with k < n features that is “maximally informa- tive” about the entity. In a classification point of view, the entity that we generate the virtual profile for represents a target class for a set of documents (primary profiles). We need a measure to evaluate the“information content”of each individual feature with regard to the target class. We pro- pose to use mutual information for this purpose. Mutual information measures arbitrary dependencies between ran- dom variables. And the fact that the mutual information is independent of the coordinates chosen permits a robust estimation makes it suitable for assessing the “information content” of features in complex classification tasks. In accordance with Shannon’s information theory, the un- certainty of a document class C as a random variable can be measured as: H(C) = − ∑ c∈C P(c)logP(c) , After knowing the feature vector F, the conditional entropy H(C|F) measures the remaining uncertainty about C: H(C) = − ∑ f∈F P(f) ∑ c∈C P(c|f)logP(c|f) . After having observed the feature vector F, the mutual in- formation, i.e., the amount of decreased class uncertainty is defined as: I(C; F) = H(C) − H(C|F) = ∑ c,f P(c, f)log P(c, f) P(c)P(f) , where P(c, f) is the joint probability of class c and feature f. Therefore, to generate virtual profiles, the goal is to find the optimal feature subset, S ⊆ F, so that I(C; S) is max- imized. From an information theoretic perspective, select- ing features that maximize I(C; F) translates into selecting those features that contain the maximum information about class C. However, locating the optimal subset requires an exhaustive combinatorial search over the feature space, re- quiring a number of runs equal to (n k ) , where n is the size of the original feature set and k is that of the desired sub- set. Besides, an exact solution also demands large training sample sizes to estimate the higher order joint probability distribution in I(F; C). For example, Fraser’s method [6], a computationally efficient algorithm for calculating the opti- mal I(C; S), requires for its convergence a number of sam- ples “in the millions” when the number of features in the input vector is larger than 3 or 4. Given these difficulties, most of the existing approaches ap- proximate I(F; C) based on the assumption of lower-order dependencies between features. For example, a second-order feature dependence assumption is proposed by Battiti [3] to approximate I(F; C) by a greedy incremental selection scheme with a heuristic to account for correlations between features: Given a set of already selected features, the algo- rithm chooses the next feature as the one that maximizes the information about the class corrected by subtracting a quan- tity proportional to the average mutual information with the selected features. Unfortunately, the calculation of pairwise feature correlation I(f, f′ ) is impractical in our case because the feature dimen- sion is extremely high given the bag-of-words extracted from textual contents. Therefore, we make a first-order class de- pendence assumption that each feature independently influ- ences the class variable, which means to select the mth fea- ture, fm, is independent from the (m − 1) already selected features, i.e., P(fm|f1, . . . , fm−1, C) = P(fm|C). This re- sults a straightforward greedy algorithm to generate the vir- tual profile for an entity c, which consists of following steps: 1) gather features from all primary profiles associated with entities that have an affiliation with c, 2) calculate mutual information, I(f; c), between each feature and e, and 3) se- lect top k features with highest I(f; c) into the virtual pro- file. More specifically, I(f; c) can be calculated as follows. I(f; c) = ∑ ef ∈{1,0} ∑ ec∈{1,0} P(f = ef , c = ec) log P(f = ef , c = ec) P(f = ef )P(c = ec) , (1) where f is a random variable that takes values ef = 1 (en- tity primary profile contains feature f) and ef = 0 (the entity primary profile does not contain feature f), and c is a random variable that takes values ec = 1 (the entity is affiliated with c) and ec = 0 (the entity is not affiliated with c). The probabilities in Equation 1 can be calculated using maximum likelihood estimation. 4. EXPERIMENTS
  • 4. Our goal is to test if virtual profiles are a valuable source of features to improve the recommendation performance. In designing experiments, we want to verify the heuristic as- sumption that virtual profile can use features greedily se- lected by mutual information. We also want to compare the performance of virtual profiles with other classic collabora- tive filtering methods and study their tradeoffs. Further- more, by experimenting with different parameter settings to generate virtual profiles, we want to provide a general guidance on how virtual profiles can be best implemented in practice. 4.1 Methodologies We choose a community recommendation problem at Linkedin as the test application. Successful recommendations would result in users following certain communities, while users are also presented the choice to opt-out communities at any later point. We extract three kinds features from entities (users and com- munities) in this application domain as follows. 1. content features: features from users’ and communi- ties’ textual information extracted into predefined stan- dardized fields (e.g., name, industry, description, etc.). 2. virtual profile: as described in Section 3, a set of fea- tures selected from a community’s followers as supple- ments to the community’s primary profile. 3. browsemap: a collaborative feature representing the co-affiliation relationship, or “users who follow X also follow Y.” Browsemaps capture a notion of similarity between com- munities that is driven by users’ preference. To generate a browsemap for a community, from all other communi- ties that it shares followers with, we choose top 50 ones ranked by TF/IDF. And then for each user, we take the closure of communities she has already followed with re- spect to browsemaps, and select top 50 ones weighted by their TF/IDF scores normalized over the number of com- munities followed. Communities selected in this way can be essentially seen as recommendations by collaborative filter- ing. We instead treat them as part of a standalone feature, and when combined with users’ content features to generate a search query, it would lead to extra field matches with hits against communities appear in the feature. And the weight of this match, just like matches in other features, can be determined in an offline learning process. The content features extracted for communities contains only three fields (i.e., name, description, and tags). They represent nearly a minimum amount of information that is required for a content filtering recommender system to func- tion, and are therefore considered as a baseline in the exper- iment. Browsemaps, on the other hand, are designed as an alternative to virtual profiles for comparison, given that they both take into account the interaction among entities. As for model fitting, we use a training set including 3.4 million positive and 2.2 million negative examples gathered from both explicit and implicit user feedbacks (e.g., fol- low/unfollow or lack of action to recommendations). We apply an L2-regularized logistic regression with various com- bination of the above mentioned features. The best model under each configuration is selected by optimizing the area under the ROC curve (AUC-ROC). Performances of differ- ent models are evaluated both offline and online. The results are presented in the next sections. 4.2 Results 4.2.1 Offline evaluation We compare the AUC for models obtained by training with four different feature configurations, namely, (A) content features only, (B) content features plus virtual profiles, (C) content features plus browsemaps, and (D) content features plus both virtual profiles and browsemaps. It can be seen from Figure 2 that, the ROC curve of model A completely dominates that of model B (with AUCs 0.72 vs. 0.60), and both of them dominate that of model C (AUC 0.44). The same performance pattern is also exhibited in the precision- recall curve, as shown in Figure 3. 0.0 0.2 0.4 0.6 0.8 1.0 0.00.20.40.60.81.0 False positive rate Truepositiverate content features + vp content features + bm content features + vp + bm content features only Figure 2: ROC curves for different models. Besides classification performance, another important mea- sure that can be evaluated offline is the coverage, which refers to the degree to which recommendations cover the set of available items (item space coverage) and the degree to which recommendations can be generated to all potential users (user space coverage) [7, 10]. Owing to a distributed algorithm developed at Linkedin, we are able to calculate recommendations offline for all our 220 million users. Us- ing each of the trained model described above, we calculate a different set of recommendations for each user, with the size of each set capped at 50. We counted numbers of times
  • 5. 0.0 0.2 0.4 0.6 0.8 1.0 0.00.20.40.60.81.0 Recall Precision content features + vp content features + bm content features + vp + bm content features only Figure 3: Precision-recall curves for different mod- els. unique communities appeared in recommendations (frequen- cies) under different models. Figure 4 shows a logarithmic scale of the frequencies sorted in descending order plotted against their ranks. It is not surprising to see that the baseline curve from the content-features-only model is the lowest since features ex- tracted for communities in this case contains the least amount of information. And the distribution of the recommendation frequency simply reflects the distribution of the amount of textual content of each community, which is subject to the power law. On the other hand, the curve from the model with the addition of browsemaps visibly bulges outwards from the baseline for about two thirds of points, indicating that those points are getting higher frequencies showing up in recommendations, hence more coverage. Most remark- ably, the model with the addition of virtual profiles signifi- cantly increased the frequencies for almost all points on the curve except for cases where original baseline frequencies are extremely high or low. The reason why browsemaps slightly boost the coverage for some communities is because those communities bear little content information yet having followers already. Having followers makes them eligible to be potentially included in other communities’ browsemaps, and thus leads to a higher chance to matches with users. However, for users not hav- ing followed any communities at all, browsemaps become an empty feature, which is the reason why for about a third of communities, there sees no increase in coverage from browsemaps compared with the baseline. This phenomenon is also illus- trated in Figure 5, in which the recommendation frequencies of unique companies are only counted for new users (i.e., users who have not started following communities yet). We observe that the model with browsemaps produces an iden- tical curve to the baseline, while the model with virtual pro- files exerts a consistent boost. This shows that browsemaps, as a feature of a collaborative filtering aspect, fails to address cold start, while virtual profiles provides a well-rounded im- provement in terms of both coverage and predictive power. 0 50000 100000 150000 200000 250000 2e+025e+022e+035e+032e+045e+042e+05 numberofrecommendations content features + vp content features + bm content features only Figure 4: number of recommendation per unique companies. 4.2.2 Online evaluation To further evaluate models with various feature configura- tions (i.e., content features with vp, content features with bm, content features with both vp and bm, and content fea- tures only), we deployed them to serve realtime online rec- ommendation requests and compare performances through a bucket test. We assign a unique bucket of 2.5% randomly selected users to each model. The bucket with the model based only on content features is the control, while others are variants. The duration of the test is determined according to Wheeler [18], where a conservative estimation of sample size to achieve an 80% power (the probability of correctly rejecting the null hypothesis when it is indeed false) is given by Equation 2. n = ( 4rσ ∆ )2 , (2) where n is the minimum number of samples (impressions to be delivered) for each equal-sized variant, r is the number of variants, σ2 is the variance of the OEC (Overall Evalu- ation Criterion [16], a quantitative measure of the experi- ment’s objective.), and ∆ is the sensitivity, or the desired amount of change. The OEC in this test is the Click-through rate (CTR) of recommendations. Assume each click-through
  • 6. 0 50000 100000 150000 200000 2e+025e+022e+035e+032e+045e+04 numberofrecommendations content features + vp content features + bm content features only Figure 5: number of recommendation for new users per unique companies. event is a Bernoulli trial with probability p = ctr0 (con- trol CTR, which is estimated from historical data), then σ2 = p(1 − p). Applying Equation 2 and knowing the ap- proximate recommendation impressions per day, we derive the length of the test to be 7 days. Figure 7 presents the results of the test by showing the per- centage change in CTR of variant models relative to the con- trol, on each individual day of the test. Overall, the model with virtual profiles outperforms the control by 91.2%. Sur- prisingly, however, we do not observe any improvement from the model with browsemaps. The model with both virtual profiles and browsemaps increased the CTR by 84.4%. The difference between the two best performing model is not significant (p value 0.062), which is similar to the offline evaluation result. The reason why browsemaps fail to in- crease overall CTR may be attributed to the fact that only one third of users have followed communities in this par- ticular application, meaning the cold start effect is much pronounced. Virtual profiles, on the other hand, is not vul- nerable to this problem since it is content-based and does not rely on pre-existing user-item affiliations, as is demon- strated in this experiment. 5. CONCLUSION AND FUTURE WORK We presented virtual profiles, a generic content meta data extension method. We also introduced how it is utilized in a scalable and generic content-based hybrid recommender sys- tem that powers multiple real-time recommendation prod- ucts at LinkedIn. The goal of virtual profiles is to provide a means to tap into rich-content information from one type of entity and propagate features extracted from which to other affiliated entities that may suffer from relative data scarcity. 0.0 0.2 0.4 0.6 0.8 1.0 0.00.20.40.60.81.0 False positive rate Truepositiverate vp−top50 vp−top100 vp−top200 Figure 6: ROC curves for virtual profiles with dif- ferent number of terms. It brings a collaborative filtering aspect in the form of a sup- plement to content features in the recommender system. It is shown to outperform a method that directly incorporate network proximity from collaborative filtering. Experiments supported that our first-order class dependence assumption and the greedy algorithm in calculating the mu- tual information is a reasonable approximation. In future work, we will investigate scalable ways to account for de- pendencies among features. We plan to explore more term weighting methods besides mutual information, including other classic information theoretic quantities such as the Kullback-Leibler divergence, or TF/IDF. 6. REFERENCES [1] G. Adomavicius and A. Tuzhilin. Toward the next generation of recommender systems: A survey of the state-of-the-art and possible extensions. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 17(6):734–749, 2005. [2] X. Bao, L. Bergman, and R. Thompson. Stacking recommendation engines with additional meta-features. In Proceedings of the third ACM conference on Recommender systems, RecSys ’09, pages 109–116, 2009. [3] R. Battiti. Using mutual information for selecting features in supervised neural net learning. Trans. Neur. Netw., 5(4):537–550, July 1994. [4] R. M. Bell, Y. Koren, and C. Volinsky. The BellKor solution to the Netflix Prize. [5] R. Burke. Hybrid recommender systems: Survey and experiments. User Modeling and User-Adapted Interaction, 12(4):331–370, Nov. 2002.
  • 7. 1 2 3 4 5 6 7 0.51.01.52.02.53.0 Day CTR% 1 2 3 4 5 6 7 0.51.01.52.02.53.0 1 2 3 4 5 6 7 0.51.01.52.02.53.0 content features + vp content features + bm content features + vp + bm Figure 7: Model CTRs. [6] A. M. Fraser and H. L. Swinney. Independent coordinates for strange attractors from mutual information. Physical Review A, 33(2):1134–1140, Feb. 1986. [7] M. Ge, C. Delgado-Battenfeld, and D. Jannach. Beyond accuracy: evaluating recommender systems by coverage and serendipity. In Proceedings of the fourth ACM conference on Recommender systems, RecSys ’10, pages 257–260, New York, NY, USA, 2010. ACM. [8] K. Goldberg, T. Roeder, D. Gupta, and C. Perkins. Eigentaste: A constant time collaborative filtering algorithm. Inf. Retr., 4(2):133–151, July 2001. [9] J. L. Herlocker, J. A. Konstan, and J. Riedl. Explaining collaborative filtering recommendations. In Proceedings of the 2000 ACM conference on Computer supported cooperative work, CSCW ’00, pages 241–250, New York, NY, USA, 2000. ACM. [10] J. L. Herlocker, J. A. Konstan, L. G. Terveen, and J. T. Riedl. Evaluating collaborative filtering recommender systems. ACM Trans. Inf. Syst., 22(1):5–53, Jan. 2004. [11] P. B. Kantor. Recommender systems handbook. Springer, 2009. [12] P. Melville, R. J. Mooney, and R. Nagarajan. Content-boosted collaborative filtering for improved recommendations. pages 187–192, 2002. [13] M. J. Pazzani. A framework for collaborative, content-based and demographic filtering. Artif. Intell. Rev., 13(5-6):393–408, Dec. 1999. [14] M. J. Pazzani and D. Billsus. The adaptive web. chapter Content-based recommendation systems, pages 325–341. Springer-Verlag, Berlin, Heidelberg, 2007. [15] P. Resnick, N. Iacovou, M. Suchak, P. Bergstrom, and J. Riedl. Grouplens: an open architecture for collaborative filtering of netnews. In Proceedings of the 1994 ACM conference on Computer supported cooperative work, CSCW ’94, pages 175–186, New York, NY, USA, 1994. ACM. [16] R. K. Roy. Design of experiments using the taguchi approach: 16 steps to product and process improvement. Wiley, 20011. [17] B. Sarwar, G. Karypis, J. Konstan, and J. Riedl. Item-based collaborative filtering recommendation algorithms. In Proceedings of the 10th international conference on World Wide Web, WWW ’01, pages 285–295, 2001. [18] R. E. Wheller. Portable power. Technometrics, 16(2):177–179, 1974.