Roman Kyslyi: Методи уникнення галюцинацій у великих мовних моделях (UA)
AI & BigData Online Day 2023 Autumn
Website – www.aiconf.com.ua
Youtube – https://www.youtube.com/startuplviv
FB – https://www.facebook.com/aiconf
2. LLM (Large Language Models)
LLM is a language model for next token prediction
Sequence (input +
output embeddings)
Transformer
Softmax
Predicted tokens
3. Hallucination
- Sentence contradiction:
- Prompt: What is the color of the sky?
Answer: The sky is green
- Prompt contradiction:
- Prompt: Write a positive review about the restaurant.
Answer: The food was terrible and
- Fact contradiction
- Prompt: Who is Joe Biden?
Answer: Joe Biden was first US president
- Contextual:
- Prompt: Tell me about the Paris
Answer: Paris is a capital of France and a name of a singer
4. Why?
- Data quality
- Scraping a lot of data, wrong facts, imbalance of the classes
- Generalization wrong
- Generation method
- High probability of generic words vs low probability of specific words
- Insufficient context (“Can cats speak eng?”)
- Not accurate parameters
- length of the generation
- Temperature
- etc.
Source: https://arxiv.org/pdf/2212.10511.pdf
6. Two model paradigm
- Completion model (generates response)
- Probability model (response tokens with probability)
Ideally this should be the same models
7. Metrics
Dataset for evaluation: WikiBio, OpenAssistant Conversations Dataset
Probability metrics:
- LogProbs
- PPL-5 (approximate top 5 LogProbs)
Aggregation metrics:
- Min (LogProbs and PPL-5) over the text
- Average (LogProbs and PPL-5) over the text
8. How to reduce (simple)
- Precice prompt
- Instead “What happened in 1964?” use “Tell date of the Beatles albums releases”
- Temperature
- Higher values make outputs more random, while lower values make the outputs more deterministic
- Multi shot prompting
- Use several prompts to to give more context
- Ask the model to verify itself
9. How to reduce (more complex)
Compute BLEU between all pairs:
- Measures quality of machine-generated translations.
- Compares machine translation to human reference translations.
- Calculates precision of n-grams (e.g., 1-4 word sequences).
- Higher BLEU score (0-100) = closer to human translation.
10. Several requests with slightly different question (NER + distances)
(https://crfm.stanford.edu/2023/03/13/alpaca.html)
How to reduce (more complex)
11. Check on Knowledge Graph
(with additional data)
https://github.com/neo4j/NaLLM
How to reduce (more complex)
Reference: https://towardsdatascience.com/from-text-to-knowledge-the-information-extraction-pipeline-b65e7e30273e