This talk was given at the Cologne AI and Machine Learning Meetup on July 2, 2024 (https://www.meetup.com/de-DE/cologne-ai-and-machine-learning-meetup/events/299362587/) by Fabian Haak: Quantifying Subjective Phenomena: Simulation, Detection, and Synthetic Training Data with LLMs
There are many aspects of different forms of media that are relevant to measure as subjective. Is a text easy to read? Is a comment offensive? Is a document relevant? Is a decision morally just? The inherent problem with measuring these aspects is that different humans would judge differently based on a range of often intransparent item properties, as well as the person’s personal experiences, attitudes, and preexisting knowledge. Despite that, effective means of measuring these aspects and creating training data representative of a typical user or a certain target audience are essential for ensuring quality, fairness, and relevance in media consumption and production. We at CIR, Cologne Information Retrieval Group at TH Köln, use a multi-stage approach leveraging large language models to quantify subjective phenomena and construct synthetic training data, that are on par with human-annotated datasets.