The search for relevant case law is a difficult task to automate, as the reasoning behind the relevance assessment does not translate into clear and scalable rules. The similarity of the circumstances, the legislation references, the reasoning underlying the legal qualification, the argumentation path, can all be used to evaluate relevance. The level of needed legal expertise limit the availability of properly annotated data. Generalized Language Models tackle the issue of having few data available for machine learning, we will explore how they can perform in this specific task, what limitations do we have to consider and how to move forward.
Julien RossiJulien Rossi
JULIEN ROSSI
Lecturer, Researcher
THEME
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
GENERALIZED LANGUAGE MODELS FOR CASE LAW RETRIEVAL - Big Data Expo 2019
1. Legal Information Retrieval with Generalized Language
Models
Julien ROSSI, Evangelos KANOULAS
September 19, 2019
Big Data Expo
2. Who we are
Prof. Dr. Evangelos Kanoulas
Professor at Amsterdam Business School, UvA
Professor at Institute of Informatics, UvA
Researcher in Information Retrieval, Conversational Agents
Julien Rossi
Lecturer at Amsterdam Business School, UvA
PhD Candidate, Legal Text Analytics
MBA Big Data, ABS
MSc Computer Sciences
1
5. COLIEE 2019
COLIEE stands for Competition on Legal Information Extraction/Entailment. This
competition started in 2014, in collaboration between University of Alberta (Canada)
and National Institute of Informatics (Japan).
It is a testing ground for applying text analytics to legal documents and tasks.
We have 2 Research Questions:
• RQ1: Can we deal with long documents ?
• RQ2: Can we improve retrieval with limited data ?
3
6. Task 1 - Legal Case Retrieval Task
• Collection of Canadian Supreme Court Judgments
• Single Topic: Immigration & Citizenship
• Search for Relevant Documents within a collection
• Query is a Legal Case
• Noticed cases are relevant to the query case
• Relevance is binary, and not motivated
4
7. Task 1 - Legal Case Retrieval Task
• Labeled Dataset contains 285 query cases
• Each query case comes with a collection of 200 candidate cases
• In total 10000 unique documents
• Unlabeled Dataset contains 61 unknown query cases
• All cases involving Immigration and Citizenship
5
8. Task 1 - Legal Case Retrieval Task
0 2000 4000 6000 8000 10000
Number of Tokens per Document
0.0
0.5
1.0
ProportionofDocuments
Cumulative Distribution of Document size
• Cumulative Distribution of number of
tokens per document
• Up to 12000 tokens in a document
• Median around 2500 tokens
• We can address RQ1 and RQ2
6
9. Task 3 - Statute Law Retrieval
• Search for Civil Code articles relevant to assess the validity of a legal statement
• Based on Japanese Bar Exam
• Query is a statement
• Relevant articles explain the point made in the query
• The legislation is the entire Japanese Civil Code, about 1000 articles in English
• The labeled Dataset contains 650 queries
• We can address RQ2
7
10. Evaluation
• We use Recall and Precision on the ranked list of retrieved documents
8
12. Model, Workflow
• Binary classifier for pairwise relevance trained on labeled dataset
• Derivate it into a ranker
• Predict relevance on unlabeled dataset
• BERT Implementation with google-bert1, pytorch-transformers2,
fast-bert3 and apex4
• LTR Implementation with Tensorflow Ranking5
1
https://github.com/google-research/bert
2
https://github.com/huggingface/pytorch-transformers
3
https://github.com/kaushaltrivedi/fast-bert
4
https://github.com/NVIDIA/apex
5
https://github.com/tensorflow/ranking
9
13. BERT: A Language Model
pursuant
to
article
41.3
,
the
[MASK]
can
defer
(...)
[MASK]
0.12
-0.3
1.45
0.001
(...)
truck
apple
be
defender
beyond
the
dropped
(...)
10
14. Pairwise Relevance Classifier
• Solve the sequence length limitation (512 WordPiece tokens) by summarizing the
English part of documents
• Summarization based on TextRank (Barrios & al, 2016), implemented in gensim6
• Fine-Tuning of a BERT (Devlin & al, 2018) model followed by MLP
• This model is named Fine-Tuned in results
6
https://radimrehurek.com/gensim/
11
16. Ranker with Learning to Rank
• Generate Features from the Fine-Tuned BERT
• Use these features as input to a Learning to Rank model
• These models are named LTR in results
• Training material limited to 285 lists
13
18. In-Domain Pre-Training
• Starting from a pre-trained BERT model
• Running additional iterations of pre-training tasks
• Using in-domain texts
• Canadian Court Decisions for the Task 1
• Japanese Supreme Court rulings in English, for the Task 3
• For pre-training only, 100k iterations, around 24 hours on 1 GPU
• This model is named Pre-Trained in results
15
23. Results
Back to our 2 Research Questions:
• RQ1: Can we deal with long documents ?
• Summarized texts as input to Neural Language Model allowed for retrieval
performance on par or higher than baselines
• RQ2: Can we improve retrieval with limited data ?
• Additional pre-training improves the performance of information retrieval, for small
dataset
• The uniformity of the legal language at hand (Court Decision in English) allows for
quick training
19
24. Critical Review of BERT
• BERT is a Language Model, it learns the language presented at pre-training
• BERT is strong with syntactic and semantic tasks
• ”Open Sesame (...)”, Lin & al., June 2019
• ”What does BERT look at? (...)”, Clark & al., June 2019
• ”BERT rediscovers the Classical NLP Pipeline”, Tenney & al., May 2019
• It is adapted to tasks operating at text level, less adapted to tasks operating at
higher levels of language understanding
• Pre-Training on similar texts as the downstream task is proven to add language
knowledge
20
25. Critical Review of BERT
• Ongoing discussion about Attention and Explanation
• ”Attention is not Explanation”, Jain and Wallace, May 2019
• ”Is Attention Interpretable?”, Serrano and Smith, June 2019
• ”Attention is not not Explanation”, Wiegreffe and Pinter, August 2019
• This is common to all systems based on Transformers: Open-AI GPT, GPT2,
Transformer XL, XLNet, XLM, etc.
• Going through attention weights with bertviz7, it seems the model focuses more
on word similarities than on semantics
7
https://github.com/jessevig/bertviz
21
26. Critical Review of BERT
• The ”Clever Hans”8 effect, mistaking deep knowledge with surface correlations
• ”Probing Neural Network (...)”, Niven and Kao, August 2019
• In the age of AI, ”Correlation is not causation” is ”Good results are not
knowledge acquisition”
• Focus on dataset’s diversity of text for similar usage of knowledge
8
https://thegradient.pub/nlps-clever-hans-moment-has-arrived/
22
27. Take-Home
• Stay aware of dataset’s limitations
• Get to know what the model actually learns vs How it performs
• Learn through unsupervised Pre-Training
• Lots of orthogonal ways forward:
• More data for pre-training
• New pre-training tasks
• New Model architecture
• Different heuristic for summarization
• More annotation on relevance assessment
23
28. Work with us?
We are interested in collaborations with the Industry, in the framework of Research
Projects.
We cover many domains:
• Information Extraction from Contracts
• Summarization of Legal Documents
• Information Retrieval, Search for Regulation
We need access to relevant Data
24