1. Matt Lease
• School of Information @mattlease
University of Texas at Austin ml@utexas.edu
Joint work with
with
Yinglong Zhang Jin Zhang Jacek Gwizdka
Multidimensional Relevance Modeling
via Psychometrics & Crowdsourcing
slides: www.slideshare.net/mattlease
2. Saracevic’s ‘97 Salton Award address
“…the human-centered side was often highly critical
of the systems side for ignoring users... [when]
results have implications for systems design &
practice. Unfortunately… beyond suggestions,
concrete design solutions were not delivered.
“…the systems side by and large ignores the user
side and user studies… the stance is ‘tell us what
to do and we will.’ But nobody is telling...
“Thus, there are not many interactions…”
Matt Lease <ml@utexas.edu> 2/20
3. Primary Research Question
• What is relevance?
– What factors constitute it? Can we quantify their
relative importance? How do they interact?
• Old IR question, many studies, little agreement
• Potential impacts?
– Further understanding of cognitive relevance
– Guide IR engineering toward inferring key factors
– Foster multi-dimensional evaluation of IR systems
Matt Lease <ml@utexas.edu> 3/20
4. Secondary Research Question
• How can we measure/ensure quality of
subjective relevance judgments
– How can we distinguish valid subjectivity vs. human
error in judging disagreements (traditional or online)?
• Potential impacts
– Help explain/reduce judging disagreements
– Enable evaluation wrt. distribution of opinions
– Encourage other subjective data collection in HCOMP
Matt Lease <ml@utexas.edu> 4/20
5. Pscychology to the Rescue!
• A Guide to Behavioral Experiments
on Mechanical Turk
– W. Mason and S. Suri (2010). SSRN online.
• Crowdsourcing for Human Subjects Research
– L. Schmidt (CrowdConf 2010)
• Crowdsourcing Content Analysis for Behavioral Research:
Insights from Mechanical Turk
– Conley & Tosti-Kharas (2010). Academy of Management
• Amazon's Mechanical Turk : A New Source of
Inexpensive, Yet High-Quality, Data?
– M. Buhrmester et al. (2011). Perspectives… 6(1):3-5.
– see also: Amazon Mechanical Turk Guide for Social Scientists
5/20
7. Contributions
• Describe a simple, reliable, scalable method for
collecting diverse (subjective), multi-dimensional
relevance judgments from online participants
– Online survey techniques from pscyhometrics
– Data available online
• Describe a rigorous, positivist, data-driven framework
for inferring & modeling multi-dimensional relevance
– Structural equation modeling (SEM) from pscyhometrics
– Run the experiment & let the data speak for itself!
– Implemented in standard R libraries available online
Matt Lease <ml@utexas.edu> 7/20
8. An example model of multi-dimensional relevance
Matt Lease <ml@utexas.edu> 8/20
9. Experimental Design
• Define some search tasks
• Pick some documents to be judged
• Hypothesize some relevance dimensions
• Ask participants to answer some questions
• Analyze data via Structural Equation Modeling (SEM)
– Use Exploratory Factor Analysis (EFA) to assess question-
factor relationships, then prune “bad” questions
– Use Confirmatory Factor Analysis (CFA) to assess
correlations, test significance, & compare models
– Cousin to graphical models in statistics/AI
Matt Lease <ml@utexas.edu> 9/20
10. Collecting multi-dimensional relevance
judgments
• Participant picks one of several pre-defined topics
– You want to plan a one week vacation in China
• Participant assigned a Web page to judge
– We wrote a query for each topic, submitted to a popular
search engine, and did stratified sampling of results
• Participant answers a set of likert-scale questions
– I think the information in this page is incorrect
– It’s difficult to understand the information in this page
– …
Matt Lease <ml@utexas.edu> 10/20
11. What Questions might we ask?
• What factors do you think impact relevance…
• We hypothesize same 5 factors as Xu & Chen ’06
– Topicality, reliability, novelty, understability, & scope
– Choose same to make revised mechanics & any
difference in findings maximally clear
• Assume factors are incomplete & imperfect
– Positivist approach: do these factors explain
observed data better than other alternatives:
uni-dimensional relevance or another set of factors?
Matt Lease <ml@utexas.edu> 11/20
12. How do we ask the questions?
• Ask 3+ questions per hypothesized dimension
– Ask repeated, similar questions, & change polarity
– Randomize question order (don’t group questions)
– Over-generate questions to allow for later pruning
– Exclude participants failing self-consistency checks
• Usual stuff
– Use clear, familiar, non-leading wording
– Balance likert response scale,
– Pre-test survey in-house, then pilot study online
Matt Lease <ml@utexas.edu> 12/20
13. Structural Equation Modeling (SEM)
• Based on Sewell Wright’s path analysis (1921)
– A factor model is parameterized by factor loadings,
covariances, & residual error terms
• Graphical representation: path diagram
– Observed variables in boxes
– Latent variables in ovals
– Directed edges denote
causal relationships
– Residual error terms
implicitly assumed
Matt Lease <ml@utexas.edu> 13/20
14. Exploratory Factor Analysis (EFA) – 1 of 2
• Is the sample large enough for EFA?
– Kaiser-Mayer-Olkin (KMO) Measure of Adequacy
– Bartlett’s Test of Sphericity
• Principal Axis Factoring (PAF) to find eigenvalues
– Assume some large, constant # of latent factors
– Assume each factor has a connecting edge to each question
– Estimate factor model parameters by least-squares (ML)
• Promax (oblique) rotation to maximize correlations
• Prune factors via Parallel Analysis
– Create random data with same # factors & questions
– Create correlation matrix and find eigenvalues
Matt Lease <ml@utexas.edu> 14/20
15. • Perform Parallel Analysis
– Create random data w/ same # of factors & questions
– Create correlation matrix and find eigenvalues
• Create Scree Plot of Eigenvalues
• Re-run EFA for reduced factors
• Compute Pearson correlations
• Discard questions with:
– Weak factor loading
– Strong cross-factor loading
– Lack of logical interpretation
• Kenny’s Rule: need >= 2 questions per factor for EFA
Exploratory Factor Analysis (EFA) – 2 of 2
Matt Lease <ml@utexas.edu> 15/20
17. CFA: Assess and Compare Models
• F First-order baseline model uses a single
latent factor to explain observed data
Posited hierarchical factor model
uses 5 relevance dimensions
Matt Lease <ml@utexas.edu> 17/20
18. • Null model assume observations independent
– Covariance between questions fixed at 0 and all means and
coveriances left free
• Comparison stats
– Non-Normed Fit Index (NNFI)
– Comparative Fit Index (CFI)
– Root-Mean Squared Error of Approximation (RMSEA)
– Standardized-root Mean-Square Residual (SMSR)
Confirmatory Factor Analysis (CFA)
Matt Lease <ml@utexas.edu> 18/20
19. Our model of multi-dimensional relevance
Matt Lease <ml@utexas.edu> 19/20
20. Future Directions
• More data-driven positivist research into factors
– Different user groups, search scenarios, devices, etc.
– Need more data to support normative claims
• Train/test operational systems for varying factors
– Identify/extend detected features for each dimension
– Personalize search results for individual preferences
• Improve judging agreement by making task more
natural and/or assessing impact of latent factors?
• Intra-subject vs. inter-subject aggregation?
– Other methods for ensuring subjective data quality?
20/20