The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track

•

0 likes•251 views

This paper describes the participation of the uc3m team in both tasks of the TREC 2011 Crowdsourcing Track. For the first task we submitted three runs that used Amazon Mechanical Turk: one where workers made relevance judgments based on a 3-point scale, and two similar runs where workers provided an explicit ranking of documents. All three runs implemented a quality control mechanism at the task level based on a simple reading comprehension test. For the second task we also submitted three runs: one with a stepwise execution of the GetAnotherLabel algorithm and two others with a rule-based and a SVMbased model. According to the NIST gold labels, our runs performed very well in both tasks, ranking at the top for most measures.

Science

In a Nutshell
3 runs, Amazon Mechanical Turk, External HITs
One HIT for each set of 5 documents = 435 HITs (2175 judgments)
$0.20 per HIT = $0.04 per document
Run 3 Stepwise execution of the GetAnotherLabel algorithm. Hypothesis: bad workers for one type of topics are not necessarily bad for others. For each worker wi compute expected quality qi on all topics and quality qij on each topic type tj. For topics in tj, use only workers with qij>qi. Topic categorization: TREC category (closed, advice, navigational, etc.), topic subject (politics, shopping, etc.) and rarity of the topic words. Runs 1 & 2 Train rule-based and SVM-based ML models. Features:
•Worker confusion matrix from GetAnotherLabel:
•For all workers, average posterior probability of relevant/nonrelevant
•For all workers, average correct-to-incorrect ratio when saying relevant or not
•For the document, relevant-to-nonrelevant ratio
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track
Julián Urbano, Mónica Marrero, Diego Martín, Jorge Morato, Karina Robles and Juan Lloréns
Gaithersburg, USA November 16th, 2011
run 1
run 2
run 3
Hours to complete
8.5
38
20.5
HITs submitted (overhead)
438 (+1%)
535 (+23%)
448 (+3%)
Submitted workers (just previewers)
29 (102)
83 (383)
30 (163)
Average documents per worker
76
32
75
Total cost (including fees)
$95.7
$95.7
$95.7
much better control of the whole process
fair for most workers (previous trials)
2. Display Modes
•With images
•Black & white, same layout but no images
Topic key terms (run 3)
3. Task focus: keywords (runs 1 & 2) or relevance (run 3)
4. Tabbed design
5. Quality Control
Worker Level
50 HITs at most, at least 100 approved and 95% approval (98% in run 3)
Implicit Task Level: Work Time
At least 4.5 s/document (preview+work)
Explicit Task Level: Comprehension What set of keywords better describe the document?
•Correct: top 3 by TF + 2 from next 5
•Incorrect: 5 random in last 25
some folks work while previewing
subjects always recognize top 1-2 by TF
Rejecting & Blocking
Action
Failure
run 1
run 2
run 3
Reject
Keyword
1
0
1
Time
2
1
1
Block
Keyword
1
1
1
Time
2
1
1
HITs rejected
3 (1%)
100 (23%)
13 (3%)
Workers blocked
0 (0%)
40 (48%)
4 (13%)
7. Relevance Labels Binary
•run 1: bad = 0, fair or good = 1
•runs 2 & 3: normalize slider range in [0-1] If value > 0.4 then 1, else 0 Ranking
•run 1: order by relevance, then by failures in keywords and then by time spent
•runs 2 & 3: explicit in sliders
Task I
Task II
Acc.
Rec.
Prec.
Spec.
AP
NDCG
Median
.623
.729
.773
.536
.931
.922
run 1
.748
.802
.841
.632
.922
.958
run 2
.690
.720
.821
.607
.889
.935
run 3
.731
.737
.857
.728
.894
.932
Acc.
Rec.
Prec.
Spec.
AP
NDCG
Median
.640
.754
.625
.560
.111
.359
run 1
.699
.754
.679
.644
.166
.415
run 2
.714
.750
.700
.678
.082
.331
run 3
.571
.659
.560
.484
.060
.299
according to Wordnet
unbiased majority voting
1. Document Preprocessing
Cleanup for smooth loading and safe rendering: remove everything unrelated to style or layout
6. Relevance: run 1 run2 run3
* Unofficial, as per NIST gold labels

Similar to The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track

Performance evaluation of IR models

Nisha Arankandath

Rui Meng - 2017 - Deep Keyphrase Generation

Association for Computational Linguistics

Tom DeMarco states that “You can’t control what you can’t measure”, but how much can we change and control (with) what we measure? This talk investigates the opportunities and limits of data-driven software engineering, shows which opportunities lie ahead of us when we engage in mining and analyzing software engineering process data, but also highlights important factors that influence the success and adaptability of data-based improvement approaches.

Can we induce change with what we measure?

Michaela Greiler

Comparative Study of Machine Learning Algorithms for Sentiment Analysis with ...

Sagar Deogirkar

2013 7 24 TAR Webinar 5 Tips & Myths Sigler

Sonya Sigler

Machine learning lets you make better business decisions by uncovering patterns in your consumer behavior data that is hard for the human eye to spot. You can also use it to automate routine, expensive human tasks that were previously not doable by computers. In the business to business space (B2B), if your competitors can make wiser business decisions based on data and automate more business operations but you still base your decisions on guesswork and lack automation, you will lose out on business productivity. In this introduction to machine learning tech talk, you will learn how to use machine learning even if you do not have deep technical expertise on this technology. Topics covered: 1.What is machine learning 2.What is a typical ML application architecture 3.How to start ML development with free resource links 4.Key decision factors in ML technology selection depending on use case scenarios

Intro to Machine Learning by Microsoft Ventures

microsoftventures

Fully Automated QA System For Large Scale Search And Recommendation Engines U...

Spark Summit

CS3114_09212011.ppt

Arumugam90

These slides were designed for a talk at the IT-Meetup League of Geeks in Passau. It contains an introduction to the concept of TF and it's major improvements in version 2.0. Furthermore, basics about Machine and Deep Learning are explained. Finally, I explain how to do Computer Vision in TensorFlow 2. The full talk can be found on YouTube: https://www.youtube.com/channel/UCycbEYf8CJSaAVCYgfMOAPQ Code is on Github: https://github.com/sastemmler/leagueofgeeks

Machine Learning with TensorFlow 2

Sarah Stemmler

Webinar: Performance Tuning + Optimization

MongoDB

Chapter 5 Query Evaluation.pdf

Habtamu100

Parts 1 & 2: WWW 2018 Tutorial: Understanding User Needs & Tasks

Rishabh Mehrotra

Building largescalepredictionsystemv1

arthi v

Applied Machine Learning for Chemistry II (HSI2020)

Ichigaku Takigawa

Database Research Principles Revealed

infoblog

ChatGPT Data analysis is the process of inspecting, cleaning, transforming, and modeling data to discover useful information, draw conclusions, and support decision-making. It involves applying various techniques and methods to extract insights from data sets, often with the goal of uncovering patterns, trends, relationships, or making predictions. Here's an overview of the key steps and techniques involved in data analysis: Data Collection: The first step in data analysis is gathering relevant data from various sources. This can include structured data from databases, spreadsheets, or surveys, as well as unstructured data such as text documents, social media posts, or sensor readings. Data Cleaning and Preprocessing: Once the data is collected, it often needs to be cleaned and preprocessed to ensure its quality and suitability for analysis. This involves handling missing values, removing duplicates, addressing inconsistencies, and transforming data into a suitable format for analysis. Exploratory Data Analysis (EDA): EDA involves examining and understanding the data through summary statistics, visualizations, and statistical techniques. It helps identify patterns, distributions, outliers, and potential relationships between variables. EDA also helps in formulating hypotheses and guiding further analysis. Data Modeling and Statistical Analysis: In this step, various statistical techniques and models are applied to the data to gain deeper insights. This can include descriptive statistics, inferential statistics, hypothesis testing, regression analysis, time series analysis, clustering, classification, and more. The choice of techniques depends on the nature of the data and the research questions being addressed. Data Visualization: Data visualization plays a crucial role in data analysis. It involves creating meaningful and visually appealing representations of data through charts, graphs, plots, and interactive dashboards. Visualizations help in communicating insights effectively and spotting trends or patterns that may be difficult to identify in raw data. Interpretation and Conclusion: Once the analysis is performed, the findings need to be interpreted in the context of the problem or research objectives. Conclusions are drawn based on the results, and recommendations or insights are provided to stakeholders or decision-makers. Reporting and Communication: The final step is to present the results and findings of the data analysis in a clear and concise manner. This can be in the form of reports, presentations, or interactive visualizations. Effective communication of the analysis results is crucial for stakeholders to understand and make informed decisions based on the insights gained. Data analysis is widely used in various fields, including business, finance, marketing, healthcare, social sciences, and more. It plays a crucial role in extracting value from data, supporting evidence-based decision-making, and driving actionable insig

04-Data-Analysis-Overview.pptx

Shree Shree

Simple rules for building robust machine learning models

Kyriakos Chatzidimitriou

Jump-start your machine learning project by using the crowd to build your training set. Before you can train your machine learning algorithm, you need to take your raw inputs and label, annotate, or tag them to build your ground truth. Learn how to use the Amazon Mechanical Turk marketplace to perform these tasks. We share Amazon's best practices, developed while training our own machine learning algorithms, and walk you through quickly getting affordable and high-quality training data.

AWS re:Invent 2016: Getting to Ground Truth with Amazon Mechanical Turk (MAC201)

Amazon Web Services

OR Ndejje Univ (1).pptx

ChandigaRichard1

Heidelberg presentation

npz

Similar to The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track (20)

Performance evaluation of IR models

Rui Meng - 2017 - Deep Keyphrase Generation

Can we induce change with what we measure?

Comparative Study of Machine Learning Algorithms for Sentiment Analysis with ...

2013 7 24 TAR Webinar 5 Tips & Myths Sigler

Intro to Machine Learning by Microsoft Ventures

Fully Automated QA System For Large Scale Search And Recommendation Engines U...

CS3114_09212011.ppt

Machine Learning with TensorFlow 2

Webinar: Performance Tuning + Optimization

Chapter 5 Query Evaluation.pdf

Parts 1 & 2: WWW 2018 Tutorial: Understanding User Needs & Tasks

Building largescalepredictionsystemv1

Applied Machine Learning for Chemistry II (HSI2020)

Database Research Principles Revealed

04-Data-Analysis-Overview.pptx

Simple rules for building robust machine learning models

AWS re:Invent 2016: Getting to Ground Truth with Amazon Mechanical Turk (MAC201)

OR Ndejje Univ (1).pptx

Heidelberg presentation

More from Julián Urbano

Statistical significance testing is widely accepted as a means to assess how well a difference in effectiveness reflects an actual difference between systems, as opposed to random noise because of the selection of topics. According to recent surveys on SIGIR, CIKM, ECIR and TOIS papers, the t-test is the most popular choice among IR researchers. However, previous work has suggested computer intensive tests like the bootstrap or the permutation test, based mainly on theoretical arguments. On empirical grounds, others have suggested non-parametric alternatives such as the Wilcoxon test. Indeed, the question of which tests we should use has accompanied IR and related fields for decades now. Previous theoretical studies on this matter were limited in that we know that test assumptions are not met in IR experiments, and empirical studies were limited in that we do not have the necessary control over the null hypotheses to compute actual Type I and Type II error rates under realistic conditions. Therefore, not only is it unclear which test to use, but also how much trust we should put in them. In contrast to past studies, in this paper we employ a recent simulation methodology from TREC data to go around these limitations. Our study comprises over 500 million p-values computed for a range of tests, systems, effectiveness measures, topic set sizes and effect sizes, and for both the 2-tail and 1-tail cases. Having such a large supply of IR evaluation data with full knowledge of the null hypotheses, we are finally in a position to evaluate how well statistical significance tests really behave with IR data, and make sound recommendations for practitioners.

Statistical Significance Testing in Information Retrieval: An Empirical Analy...

Julián Urbano

Going through a PhD may be seen as a requirement for an academic career or a different kind of job, simply as “the next step” in education, as something to do “because why not?”, or even just as a hobby you have on the side. What it really is though, is a life-changing experience, something that can be terribly painful and amazingly rewarding at the same time. In that journey I learned a few lessons in the hard way, lessons that I wish someone had told me about at the time. In this talk I’ll try to do just that and not talk about the content and process of a PhD, but rather about you, the person, during your PhD.

Your PhD and You

Julián Urbano

Statistical Analysis of Results in Music Information Retrieval: Why and How

Julián Urbano

The Kendall tau and AP correlation coefficients are very commonly use to compare two rankings over the same set of items. Even though Kendall tau was originally defined assuming that there are no ties in the rankings, two alternative versions were soon developed to account for ties in two different scenarios: measure the accuracy of an observer with respect to a true and objective ranking, and measure the agreement between two observers in the absence of a true ranking. These two variants prove useful in cases where ties are possible in either ranking, and may indeed result in very different scores. AP correlation was devised to incorporate a top-heaviness component into Kendall tau, penalizing more heavily if differences occur between items at the top of the rankings, making it a very compelling coefficient in Information Retrieval settings. However, the treatment of ties in AP correlation remains an open problem. In this paper we fill this gap, providing closed analytical formulations of AP correlation under the two scenarios of ties contemplated in Kendall tau. In addition, we developed an R package that implements these coefficients.

The Treatment of Ties in AP Correlation

Julián Urbano

The Music Information Retrieval Evaluation eXchange (MIREX) is a valuable community service, having established standard datasets, metrics, baselines, methodologies, and infrastructure for comparing MIR methods. While MIREX has managed to successfully maintain operations for over a decade, its long-term sustainability is at risk without considerable ongoing financial support. The imposed constraint that input data cannot be made freely available to participants necessitates that all algorithms run on centralized computational resources, which are administered by a limited number of people. This incurs an approximately linear cost with the number of submissions, exacting significant tolls on both human and financial resources, such that the current paradigm becomes less tenable as participation increases. To alleviate the recurring costs of future evaluation campaigns, we propose a distributed, community-centric paradigm for system evaluation, built upon the principles of openness, transparency, reproducibility, and incremental evaluation. We argue that this proposal has the potential to reduce operating costs to sustainable levels. Moreover, the proposed paradigm would improve scalability, and eventually result in the release of large, open datasets for improving both MIR techniques and evaluation methods.

A Plan for Sustainable MIR Evaluation

Julián Urbano

Structured Information Retrieval is gaining a lot of interest in recent years, as this kind of information is becoming an invaluable asset for professional communities such as Software Engineering. Most of the research has focused on XML documents, with initiatives like INEX to bring together and evaluate new techniques focused on structured information. Despite the use of XML documents is the immediate choice, the Web is filled with several other types of structured information, which account for millions of other documents. These documents may be collected directly using standard Web search engines like Google and Yahoo, or following specific search patterns in online repositories like Sourceforge. This demo describes a distributed and focused web crawler for any kind of structured documents, and we show with it how to exploit general-purpose resources to gather large amounts of real-world structured documents off the Web. This kind of tool could help building large test collections of other types of documents, such as Java source code for software oriented search engines or RDF for semantic searching.

Crawling the Web for Structured Documents

Julián Urbano

Much research in MIR is based on descriptors computed from audio signals. Some music corpora use different audio encodings, some do not contain audio but descriptors already computed in some particular way, and sometimes we have to gather audio files ourselves. We thus assume that descriptors are robust to these changes and algorithms are not affected. We investigated this issue for MFCCs and Chroma: how do encoding quality, analysis parameters and musical characteristics affect their robustness?

What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...

Julián Urbano

Test-collection based evaluation in (Music) Information Retrieval has been used for half a century now as the means to evaluate and compare retrieval techniques and advance the state of the art. However, this paradigm makes certain assumptions that remain a research problem and that may invalidate our experimental results. In this talk I will approach this paradigm as an estimator of certain probability distributions that describe the final user experience. These distributions are estimated with a test collection, computing system-related distributions assumed to reliably correlate with the target user-related distributions. Using the Audio Music Similarity task as an example, I will talk about issues with our current evaluation methods, the degree to which they are problematic, how to analyze them and improve the situation. In terms of validity, we will see how the measured system distributions correspond to the target user distributions, and how this correspondence affects the conclusions we draw from an experiment. In terms of reliability, we will discuss optimal characteristics of test collections and statistical procedures. In terms of efficiency, we discuss models and methods to greatly reduce the annotation cost of an evaluation experiment.

Evaluation in (Music) Information Retrieval through the Audio Music Similarit...

Julián Urbano

Symbolic Melodic Similarity (through Shape Similarity)

Julián Urbano

Evaluation in Audio Music Similarity

Julián Urbano

Validity and Reliability of Cranfield-like Evaluation in Information Retrieval

Julián Urbano

The reliability of a test collection is proportional to the number of queries it contains. But building a collection with many queries is expensive, so researchers have to find a balance between reliability and cost. Previous work on the measurement of test collection reliability relied on data-based approaches that contemplated random what if scenarios, and provided indicators such as swap rates and Kendall tau correlations. Generalizability Theory was proposed as an alternative founded on analysis of variance that provides reliability indicators based on statistical theory. However, these reliability indicators are hard to interpret in practice, because they do not correspond to well known indicators like Kendall tau correlation. We empirically established these relationships based on data from over 40 TREC collections, thus filling the gap in the practical interpretation of Generalizability Theory. We also review the computation of these indicators, and show that they are extremely dependent on the sample of systems and queries used, so much that the required number of queries to achieve a certain level of reliability can vary in orders of magnitude. We discuss the computation of confidence intervals for these statistics, providing a much more reliable tool to measure test collection reliability. Reflecting upon all these results, we review a wealth of TREC test collections, arguing that they are possibly not as reliable as generally accepted and that the common choice of 50 queries is insufficient even for stable rankings.

On the Measurement of Test Collection Reliability

Julián Urbano

How Significant is Statistically Significant? The case of Audio Music Similar...

Julián Urbano

Reliable evaluation of Information Retrieval systems requires large amounts of relevance judgments. Making these annotations is quite complex and tedious for many Music Information Retrieval tasks, so performing such evaluations requires too much effort. A low-cost alternative is the application of Minimal Test Collection algorithms, which offer quite reliable results while significantly reducing the annotation effort. The idea is to incrementally select what documents to judge so that we can compute estimates of the effectiveness differences between systems with a certain degree of confidence. In this paper we show a first approach towards its application to the evaluation of the Audio Music Similarity and Retrieval task, run by the annual MIREX evaluation campaign. An analysis with the MIREX 2011 data shows that the judging effort can be reduced to about 35% to obtain results with 95% confidence.

Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...

Julián Urbano

This notebook paper describes our participation in both tasks of the TREC 2011 Crowdsourcing Track. For the first one we submitted three runs that used Amazon Mechanical Turk: one where workers made relevance judgments based on a 3-point scale, and two similar runs where workers provided an explicit ranking of documents. All three runs implemented a quality control mechanism at the task level, which was based on a simple reading comprehension test. For the second task we submitted another three runs: one with a stepwise execution of the GetAnotherLabel algorithm by Ipeirotis et al., and two others with a rule-based and a SVM-based model. We also comment on several topics regarding the Track design and evaluation methods.

The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...

Julián Urbano

The Music Information Retrieval field has acknowledged the need for rigorous scientific evaluations for some time now. Several efforts were set out to develop and provide the necessary infrastructure, technology and methodologies to carry out these evaluations, out of which the annual Music Information Retrieval Evaluation eXchange emerged. The community as a whole has enormously gained from this evaluation forum, but very little attention has been paid to reliability and correctness issues. From the standpoint of the analysis of experimental validity, this paper presents a survey of past meta-evaluation work in the context of Text Information Retrieval, arguing that the music community still needs to address various issues concerning the evaluation of music systems and the IR cycle, pointing out directions for further research and proposals in this line.

Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...

Julián Urbano

In this paper we analyze the reliability of the results in the evaluation of Audio Music Similarity and Retrieval systems. We focus on the power and stability of the evaluation, that is, how often a significant difference is found between systems and how often these significant differences are incorrect. We study the effect of using different effectiveness measures with different sets of relevance judgments, for varying number of queries and alternative statistical procedures. Different measures are shown to behave similarly overall, though some are much more sensitive and stable than others. The use of different statistical procedures does improve the reliability of the results, and it allows using as little as half the number of queries currently used in MIREX evaluations while still offering very similar reliability levels. We also conclude that experimenters can be very confident that if a significant difference is found between two systems, the difference is indeed real.

Audio Music Similarity and Retrieval: Evaluation Power and Stability

Julián Urbano

We describe a pilot experiment to update the program of an Information Retrieval course for Computer Science undergraduates. We have engaged the students in the development of a search engine from scratch, and they have been involved in the elaboration, also from scratch, of a complete test collection to evaluate their systems. With this methodology they get a whole vision of the Information Retrieval process as they would find it in a real-world setting, and their direct involvement in the evaluation makes them realize the importance of these laboratory experiments in Computer Science. We show that this methodology is indeed reliable and feasible, and so we plan on improving and keep using it in the next years, leading to a public repository of resources for Information Retrieval courses.

Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...

Julián Urbano

Ground truths based on partially ordered lists have been used for some years now to evaluate the effectiveness of Music Information Retrieval systems, especially in tasks related to symbolic melodic similarity. However, there has been practically no meta-evaluation to measure or improve the correctness of these evaluations. In this paper we revise the methodology used to generate these ground truths and disclose some issues that need to be addressed. In particular, we focus on the arrangement and aggregation of the relevant results, and show that it is not possible to ensure lists completely consistent. We develop a measure of consistency based on Average Dynamic Recall and propose several alternatives to arrange the lists, all of which prove to be more consistent than the original method. The results of the MIREX 2005 evaluation are revisited using these alternative ground truths.

Improving the Generation of Ground Truths based on Partially Ordered Lists

Julián Urbano

Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks

Julián Urbano

More from Julián Urbano (20)