Fabian Haak, Björn Engelmann
CIR @ CAIML31
Jul 2, 2024
Quantifying
Subjective
Phenomena
Simulation, Quantification,
and Synthetic Training Data
with LLMs
Fabian Haak, Björn Engelmann
TH Köln
Fabian Haak, Björn Engelmann
CIR @ CAIML31
Jul 2, 2024
This Presentation
● Introduction: Data and Subjectivity
● Research: How we use LLMs to generate
synthetic training data
Fabian Haak, Björn Engelmann
CIR @ CAIML31
Jul 2, 2024
What, if we
do not have data
or
can not easily label data?
Fabian Haak, Björn Engelmann
CIR @ CAIML31
Jul 2, 2024
THE DATA PROBLEM
Fabian Haak, Björn Engelmann
CIR @ CAIML31
Jul 2, 2024
Example:
Spam Mail Detection
Fabian Haak, Björn Engelmann
CIR @ CAIML31
Jul 2, 2024
Data Model
Classification,
Generation, …
Mails
● Private Communication
● Business Communication
● Automated Notifications
● Password resets
● Phishing
● “Spam”
● …
Features
● Text length
● Punctuation
● Sender mail address
● Orthographic
correctness
● Lexical features
● …
Spam?
● Dichotomous Label
● Multi-Class Categorical
Labels
● Probabilities
Fabian Haak, Björn Engelmann
CIR @ CAIML31
Jul 2, 2024
To: fabian_haak@th-koeln.de ✉
From: Schwiegermutter61@gmail.com
Betreff: Aw: Aw: Aw: Das lustige BärchenVideo
Hallo Fabian!!!!
Du hast mir immer noch nicht auf das lustige BärchenVideo
geantwortet, das ich dir gestern geschickt hab.
Hier ist nochmal eines, ist das nicht super lustig?
https://www.youtube.com/watch?v=abUxMmcXoNo
PS: Ihr kommt heute doch zum Essen vorbei??!
Bussi
Fabian Haak, Björn Engelmann
CIR @ CAIML31
Jul 2, 2024
To: fabian_haak@th-koeln.de ✉
From: Schwiegermutter61@gmail.com
Betreff: Aw: Aw: Aw: Das lustige Bärchenwideo
Hallo Fabian!!!!
Du hast mir immer noch nicht auf das lustige Bärchenwideo
geantwortet, das ich dir gestern Geschickt hab.
Hier ist nochmal eines, ist das nicht super lustig?
https://www.youtube.com/watch?v=abUxMmcXoNo
PS: Ihr kommt heute doch zum Essen vorbei??!
Bussi
SPAM?
Private Communcation?
Fabian Haak, Björn Engelmann
CIR @ CAIML31
Jul 2, 2024
To: fabian_haak@th-koeln.de ✉
From: Schwiegermutter61@gmail.com
Betreff: Aw: Aw: Aw: Das lustige Bärchenwideo
Hallo Fabian!!!!
Du hast mir immer noch nicht auf das lustige Bärchenwideo
geantwortet, das ich dir gestern Geschickt hab.
Hier ist nochmal eines, ist das nicht super lustig?
https://www.youtube.com/watch?v=abUxMmcXoNo
PS: Ihr kommt heute doch zum Essen vorbei??!
Bussi
SPAM?
Private Communcation?
Both?
Fabian Haak, Björn Engelmann
CIR @ CAIML31
Jul 2, 2024
Problem:
SUBJECTIVITY
Fabian Haak, Björn Engelmann
CIR @ CAIML31
Jul 2, 2024
Subjective Phenomena
● Usually involves user-labeled data
● Judment highly subjective and personal
● Further: temporal and local differences
Bias
Text
Simplicity
Relevance Profanity
Sentiment Morality
Beauty of
Poem/Art
…
Fabian Haak, Björn Engelmann
CIR @ CAIML31
Jul 2, 2024
Data/Object
Model/
Annotator
Modality
● Text
○ Sentence
○ Paragraph
○ Poem
● Table
● Image
● …
Subjective
Quantification
User Context
● Demographics
● Personal experiences
and morality
● Cultural/ethnic aspects
● Knowledge
● Language expertise
● …
Object properties
● Semantic aspects
● Linguistic aspects
● Image features
● …
Model Context
● Interaction language
● Prompt
● Model properties
● …
Phenomena
● Profanity
● Simplicity
● Beauty
● …
Subjective
Definition
Fabian Haak, Björn Engelmann
CIR @ CAIML31
Jul 2, 2024
Data
Features
Model
What, if something is missing?
Fabian Haak, Björn Engelmann
CIR @ CAIML31
Jul 2, 2024
BATS
&
ARTS
Evaluating
Text Readability
and Simplicity
Fabian Haak, Björn Engelmann
CIR @ CAIML31
Jul 2, 2024
CLEF23 Simpletext:
Simplification of scientific texts
for non-experts.
Fabian Haak, Björn Engelmann
CIR @ CAIML31
Jul 2, 2024
Current Evaluation Challenges
● Limitations of current evaluation approaches
○ too ignorant: Flesch-Kincaid and similar metrics evaluate
basic readability
○ reference-based: SARI, BLEU etc. assess similarity to an
optimal simplification reference
● Lack of Explainability
● Not taking domain & target audience into account
● Lack of Datasets
Fabian Haak, Björn Engelmann
CIR @ CAIML31
Jul 2, 2024
BATS
Fabian Haak, Björn Engelmann
CIR @ CAIML31
Jul 2, 2024
BenchmArking Text Simplicity:
How to evaluate
simplicity?
Fabian Haak, Björn Engelmann
CIR @ CAIML31
Jul 2, 2024
Fabian Haak, Björn Engelmann
CIR @ CAIML31
Jul 2, 2024
From Features to Snorkel Labels
Features Interpretations
● few unique entities
● few long words
● few words per sentence
● few negations
● low depth of the
syntactic tree
● …
37 135 1249
Parametrizations
● few unique entities
○ entities/sentence
○ entities/text
○ entities/tokens
● few unique entities
○ entities/sentence
■ < 5
■ < 4
■ < 3
■ < 2
■ < 1
Fabian Haak, Björn Engelmann
CIR @ CAIML31
Jul 2, 2024
Which dataset-, target audience-, and domain-specific
characteristics can be found regarding simplicity?
Fabian Haak, Björn Engelmann
CIR @ CAIML31
Jul 2, 2024
Which dataset-, target audience-, and domain-specific
characteristics can be found regarding simplicity?
Available datasets are
inconsisent
Fabian Haak, Björn Engelmann
CIR @ CAIML31
Jul 2, 2024
Data
Features
Model
RE: What, if something is missing?
Fabian Haak, Björn Engelmann
CIR @ CAIML31
Jul 2, 2024
We need more, better datasets!
(faster & cheaper!)
Fabian Haak, Björn Engelmann
CIR @ CAIML31
Jul 2, 2024
ARTS
Fabian Haak, Björn Engelmann
CIR @ CAIML31
Jul 2, 2024
ARTS
Fabian Haak, Björn Engelmann
CIR @ CAIML31
Jul 2, 2024
Assessing Readability &
Text Simplicity
1. Overcoming subjectivity
2. LLM-labeled synthetic data
Fabian Haak, Björn Engelmann
CIR @ CAIML31
Jul 2, 2024
Hard:
How simple is this text on a scale
from 0 to 100?
Fabian Haak, Björn Engelmann
CIR @ CAIML31
Jul 2, 2024
Easier:
Which of these two texts is
simpler?
Fabian Haak, Björn Engelmann
CIR @ CAIML31
Jul 2, 2024
Generating Labeled Data with ARTS
1. Pairwise comparison
2. Apply Elo-algorithm
3. Derive ranking
4. Apply simplicity/readability scores
Fabian Haak, Björn Engelmann
CIR @ CAIML31
Jul 2, 2024
Elo Algorithm
Fabian Haak, Björn Engelmann
CIR @ CAIML31
Jul 2, 2024
Fabian Haak, Björn Engelmann
CIR @ CAIML31
Jul 2, 2024
Back to BATS…
Fabian Haak, Björn Engelmann
CIR @ CAIML31
Jul 2, 2024
ARTS3000
BATS
embeddings
Test: ARTS94
RF/GB Model
Fabian Haak, Björn Engelmann
CIR @ CAIML31
Jul 2, 2024
Fabian Haak, Björn Engelmann
CIR @ CAIML31
Jul 2, 2024
SYNTHETIC DATA
Fabian Haak, Björn Engelmann
CIR @ CAIML31
Jul 2, 2024
Idea:
Simulate user decisions with
Large Language Models.
Fabian Haak, Björn Engelmann
CIR @ CAIML31
Jul 2, 2024
Example: User Aspect Simulation
Seite 38
Motoki et al. (2024).
More human than human:
measuring ChatGPT
political bias.
https://doi.org/10.1007/s11127
-023-01097-2
Fabian Haak, Björn Engelmann
CIR @ CAIML31
Jul 2, 2024
Small Study:
Simulating user groups with
GPT personas
Fabian Haak, Björn Engelmann
CIR @ CAIML31
Jul 2, 2024
For 1000 German news headlines:
“How agreeable/biased/correct/offensive is the
following news headline on a scale from 1 to 10?
Answer as a typical CDU/SPD/Grüne/AFD/Linkspartei voter”
Fabian Haak, Björn Engelmann
CIR @ CAIML31
Jul 2, 2024
How agreeable is the news headline?
Fabian Haak, Björn Engelmann
CIR @ CAIML31
Jul 2, 2024
Headline CDU SPD Grüne AFD Linkspartei
Apple liefert Ersatzteile nun an Kunden 9 9 9 9 9
Verbietet den Test von Weltraumwummen! 5 7 9 3 8
"Ungeimpfte gefährden uns alle" 9 9 9 2 8
How agreeable is the news headline?
Fabian Haak, Björn Engelmann
CIR @ CAIML31
Jul 2, 2024
Agreeability:
avg.
absolute
difference
Fabian Haak, Björn Engelmann
CIR @ CAIML31
Jul 2, 2024
Limitation: Unintended Biases
Seite 44
Gupta et al. (2023).
Bias Runs Deep: Implicit
Reasoning Biases in
Persona-Assigned LLMs.
http://arxiv.org/abs/2311.04892
Fabian Haak, Björn Engelmann
CIR @ CAIML31
Jul 2, 2024
Synthetic
Training Data
Small Model
Classification,
Generation, …
Classification,
Generation, …
LLM LLM Personas
Unlabelled
Data
Expert
Knowledge
Labelled
Data
Typical Workflow of Using LLMs for
Synthetic Data
Fabian Haak, Björn Engelmann
CIR @ CAIML31
Jul 2, 2024
When and Why Synthetic Data
● Rare Events and Edge Cases
● Unavailable Users/Data
● Subjective Phenomena
● Accelerated Development
● Cost
● But: Biases & Validation
Fabian Haak, Björn Engelmann
CIR @ CAIML31
Jul 2, 2024
Synthetic
Training Data
Small Model
Classification,
Generation, …
Classification,
Generation, …
LLM LLM Personas
Unlabelled
Data
Expert
Knowledge
Labelled
Data
Questions?
Private Communcation?
Both?

Quantifying Subjective Phenomena: Simulation, Detection, and Synthetic Training Data with LLMs

  • 1.
    Fabian Haak, BjörnEngelmann CIR @ CAIML31 Jul 2, 2024 Quantifying Subjective Phenomena Simulation, Quantification, and Synthetic Training Data with LLMs Fabian Haak, Björn Engelmann TH Köln
  • 2.
    Fabian Haak, BjörnEngelmann CIR @ CAIML31 Jul 2, 2024 This Presentation ● Introduction: Data and Subjectivity ● Research: How we use LLMs to generate synthetic training data
  • 3.
    Fabian Haak, BjörnEngelmann CIR @ CAIML31 Jul 2, 2024 What, if we do not have data or can not easily label data?
  • 4.
    Fabian Haak, BjörnEngelmann CIR @ CAIML31 Jul 2, 2024 THE DATA PROBLEM
  • 5.
    Fabian Haak, BjörnEngelmann CIR @ CAIML31 Jul 2, 2024 Example: Spam Mail Detection
  • 6.
    Fabian Haak, BjörnEngelmann CIR @ CAIML31 Jul 2, 2024 Data Model Classification, Generation, … Mails ● Private Communication ● Business Communication ● Automated Notifications ● Password resets ● Phishing ● “Spam” ● … Features ● Text length ● Punctuation ● Sender mail address ● Orthographic correctness ● Lexical features ● … Spam? ● Dichotomous Label ● Multi-Class Categorical Labels ● Probabilities
  • 7.
    Fabian Haak, BjörnEngelmann CIR @ CAIML31 Jul 2, 2024 To: fabian_haak@th-koeln.de ✉ From: Schwiegermutter61@gmail.com Betreff: Aw: Aw: Aw: Das lustige BärchenVideo Hallo Fabian!!!! Du hast mir immer noch nicht auf das lustige BärchenVideo geantwortet, das ich dir gestern geschickt hab. Hier ist nochmal eines, ist das nicht super lustig? https://www.youtube.com/watch?v=abUxMmcXoNo PS: Ihr kommt heute doch zum Essen vorbei??! Bussi
  • 8.
    Fabian Haak, BjörnEngelmann CIR @ CAIML31 Jul 2, 2024 To: fabian_haak@th-koeln.de ✉ From: Schwiegermutter61@gmail.com Betreff: Aw: Aw: Aw: Das lustige Bärchenwideo Hallo Fabian!!!! Du hast mir immer noch nicht auf das lustige Bärchenwideo geantwortet, das ich dir gestern Geschickt hab. Hier ist nochmal eines, ist das nicht super lustig? https://www.youtube.com/watch?v=abUxMmcXoNo PS: Ihr kommt heute doch zum Essen vorbei??! Bussi SPAM? Private Communcation?
  • 9.
    Fabian Haak, BjörnEngelmann CIR @ CAIML31 Jul 2, 2024 To: fabian_haak@th-koeln.de ✉ From: Schwiegermutter61@gmail.com Betreff: Aw: Aw: Aw: Das lustige Bärchenwideo Hallo Fabian!!!! Du hast mir immer noch nicht auf das lustige Bärchenwideo geantwortet, das ich dir gestern Geschickt hab. Hier ist nochmal eines, ist das nicht super lustig? https://www.youtube.com/watch?v=abUxMmcXoNo PS: Ihr kommt heute doch zum Essen vorbei??! Bussi SPAM? Private Communcation? Both?
  • 10.
    Fabian Haak, BjörnEngelmann CIR @ CAIML31 Jul 2, 2024 Problem: SUBJECTIVITY
  • 11.
    Fabian Haak, BjörnEngelmann CIR @ CAIML31 Jul 2, 2024 Subjective Phenomena ● Usually involves user-labeled data ● Judment highly subjective and personal ● Further: temporal and local differences Bias Text Simplicity Relevance Profanity Sentiment Morality Beauty of Poem/Art …
  • 12.
    Fabian Haak, BjörnEngelmann CIR @ CAIML31 Jul 2, 2024 Data/Object Model/ Annotator Modality ● Text ○ Sentence ○ Paragraph ○ Poem ● Table ● Image ● … Subjective Quantification User Context ● Demographics ● Personal experiences and morality ● Cultural/ethnic aspects ● Knowledge ● Language expertise ● … Object properties ● Semantic aspects ● Linguistic aspects ● Image features ● … Model Context ● Interaction language ● Prompt ● Model properties ● … Phenomena ● Profanity ● Simplicity ● Beauty ● … Subjective Definition
  • 13.
    Fabian Haak, BjörnEngelmann CIR @ CAIML31 Jul 2, 2024 Data Features Model What, if something is missing?
  • 14.
    Fabian Haak, BjörnEngelmann CIR @ CAIML31 Jul 2, 2024 BATS & ARTS Evaluating Text Readability and Simplicity
  • 15.
    Fabian Haak, BjörnEngelmann CIR @ CAIML31 Jul 2, 2024 CLEF23 Simpletext: Simplification of scientific texts for non-experts.
  • 16.
    Fabian Haak, BjörnEngelmann CIR @ CAIML31 Jul 2, 2024 Current Evaluation Challenges ● Limitations of current evaluation approaches ○ too ignorant: Flesch-Kincaid and similar metrics evaluate basic readability ○ reference-based: SARI, BLEU etc. assess similarity to an optimal simplification reference ● Lack of Explainability ● Not taking domain & target audience into account ● Lack of Datasets
  • 17.
    Fabian Haak, BjörnEngelmann CIR @ CAIML31 Jul 2, 2024 BATS
  • 18.
    Fabian Haak, BjörnEngelmann CIR @ CAIML31 Jul 2, 2024 BenchmArking Text Simplicity: How to evaluate simplicity?
  • 19.
    Fabian Haak, BjörnEngelmann CIR @ CAIML31 Jul 2, 2024
  • 20.
    Fabian Haak, BjörnEngelmann CIR @ CAIML31 Jul 2, 2024 From Features to Snorkel Labels Features Interpretations ● few unique entities ● few long words ● few words per sentence ● few negations ● low depth of the syntactic tree ● … 37 135 1249 Parametrizations ● few unique entities ○ entities/sentence ○ entities/text ○ entities/tokens ● few unique entities ○ entities/sentence ■ < 5 ■ < 4 ■ < 3 ■ < 2 ■ < 1
  • 21.
    Fabian Haak, BjörnEngelmann CIR @ CAIML31 Jul 2, 2024 Which dataset-, target audience-, and domain-specific characteristics can be found regarding simplicity?
  • 22.
    Fabian Haak, BjörnEngelmann CIR @ CAIML31 Jul 2, 2024 Which dataset-, target audience-, and domain-specific characteristics can be found regarding simplicity? Available datasets are inconsisent
  • 23.
    Fabian Haak, BjörnEngelmann CIR @ CAIML31 Jul 2, 2024 Data Features Model RE: What, if something is missing?
  • 24.
    Fabian Haak, BjörnEngelmann CIR @ CAIML31 Jul 2, 2024 We need more, better datasets! (faster & cheaper!)
  • 25.
    Fabian Haak, BjörnEngelmann CIR @ CAIML31 Jul 2, 2024 ARTS
  • 26.
    Fabian Haak, BjörnEngelmann CIR @ CAIML31 Jul 2, 2024 ARTS
  • 27.
    Fabian Haak, BjörnEngelmann CIR @ CAIML31 Jul 2, 2024 Assessing Readability & Text Simplicity 1. Overcoming subjectivity 2. LLM-labeled synthetic data
  • 28.
    Fabian Haak, BjörnEngelmann CIR @ CAIML31 Jul 2, 2024 Hard: How simple is this text on a scale from 0 to 100?
  • 29.
    Fabian Haak, BjörnEngelmann CIR @ CAIML31 Jul 2, 2024 Easier: Which of these two texts is simpler?
  • 30.
    Fabian Haak, BjörnEngelmann CIR @ CAIML31 Jul 2, 2024 Generating Labeled Data with ARTS 1. Pairwise comparison 2. Apply Elo-algorithm 3. Derive ranking 4. Apply simplicity/readability scores
  • 31.
    Fabian Haak, BjörnEngelmann CIR @ CAIML31 Jul 2, 2024 Elo Algorithm
  • 32.
    Fabian Haak, BjörnEngelmann CIR @ CAIML31 Jul 2, 2024
  • 33.
    Fabian Haak, BjörnEngelmann CIR @ CAIML31 Jul 2, 2024 Back to BATS…
  • 34.
    Fabian Haak, BjörnEngelmann CIR @ CAIML31 Jul 2, 2024 ARTS3000 BATS embeddings Test: ARTS94 RF/GB Model
  • 35.
    Fabian Haak, BjörnEngelmann CIR @ CAIML31 Jul 2, 2024
  • 36.
    Fabian Haak, BjörnEngelmann CIR @ CAIML31 Jul 2, 2024 SYNTHETIC DATA
  • 37.
    Fabian Haak, BjörnEngelmann CIR @ CAIML31 Jul 2, 2024 Idea: Simulate user decisions with Large Language Models.
  • 38.
    Fabian Haak, BjörnEngelmann CIR @ CAIML31 Jul 2, 2024 Example: User Aspect Simulation Seite 38 Motoki et al. (2024). More human than human: measuring ChatGPT political bias. https://doi.org/10.1007/s11127 -023-01097-2
  • 39.
    Fabian Haak, BjörnEngelmann CIR @ CAIML31 Jul 2, 2024 Small Study: Simulating user groups with GPT personas
  • 40.
    Fabian Haak, BjörnEngelmann CIR @ CAIML31 Jul 2, 2024 For 1000 German news headlines: “How agreeable/biased/correct/offensive is the following news headline on a scale from 1 to 10? Answer as a typical CDU/SPD/Grüne/AFD/Linkspartei voter”
  • 41.
    Fabian Haak, BjörnEngelmann CIR @ CAIML31 Jul 2, 2024 How agreeable is the news headline?
  • 42.
    Fabian Haak, BjörnEngelmann CIR @ CAIML31 Jul 2, 2024 Headline CDU SPD Grüne AFD Linkspartei Apple liefert Ersatzteile nun an Kunden 9 9 9 9 9 Verbietet den Test von Weltraumwummen! 5 7 9 3 8 "Ungeimpfte gefährden uns alle" 9 9 9 2 8 How agreeable is the news headline?
  • 43.
    Fabian Haak, BjörnEngelmann CIR @ CAIML31 Jul 2, 2024 Agreeability: avg. absolute difference
  • 44.
    Fabian Haak, BjörnEngelmann CIR @ CAIML31 Jul 2, 2024 Limitation: Unintended Biases Seite 44 Gupta et al. (2023). Bias Runs Deep: Implicit Reasoning Biases in Persona-Assigned LLMs. http://arxiv.org/abs/2311.04892
  • 45.
    Fabian Haak, BjörnEngelmann CIR @ CAIML31 Jul 2, 2024 Synthetic Training Data Small Model Classification, Generation, … Classification, Generation, … LLM LLM Personas Unlabelled Data Expert Knowledge Labelled Data Typical Workflow of Using LLMs for Synthetic Data
  • 46.
    Fabian Haak, BjörnEngelmann CIR @ CAIML31 Jul 2, 2024 When and Why Synthetic Data ● Rare Events and Edge Cases ● Unavailable Users/Data ● Subjective Phenomena ● Accelerated Development ● Cost ● But: Biases & Validation
  • 47.
    Fabian Haak, BjörnEngelmann CIR @ CAIML31 Jul 2, 2024 Synthetic Training Data Small Model Classification, Generation, … Classification, Generation, … LLM LLM Personas Unlabelled Data Expert Knowledge Labelled Data Questions? Private Communcation? Both?