SlideShare ist ein Scribd-Unternehmen logo
1 von 28
Downloaden Sie, um offline zu lesen
Legal Information Retrieval with Generalized Language
Models
Julien ROSSI, Evangelos KANOULAS
September 19, 2019
Big Data Expo
Who we are
Prof. Dr. Evangelos Kanoulas
Professor at Amsterdam Business School, UvA
Professor at Institute of Informatics, UvA
Researcher in Information Retrieval, Conversational Agents
Julien Rossi
Lecturer at Amsterdam Business School, UvA
PhD Candidate, Legal Text Analytics
MBA Big Data, ABS
MSc Computer Sciences
1
Agenda
• Task, Dataset, Evaluation
• Model, Setup, Results
• Lessons Learned
2
Task, Dataset, Evaluation
COLIEE 2019
COLIEE stands for Competition on Legal Information Extraction/Entailment. This
competition started in 2014, in collaboration between University of Alberta (Canada)
and National Institute of Informatics (Japan).
It is a testing ground for applying text analytics to legal documents and tasks.
We have 2 Research Questions:
• RQ1: Can we deal with long documents ?
• RQ2: Can we improve retrieval with limited data ?
3
Task 1 - Legal Case Retrieval Task
• Collection of Canadian Supreme Court Judgments
• Single Topic: Immigration & Citizenship
• Search for Relevant Documents within a collection
• Query is a Legal Case
• Noticed cases are relevant to the query case
• Relevance is binary, and not motivated
4
Task 1 - Legal Case Retrieval Task
• Labeled Dataset contains 285 query cases
• Each query case comes with a collection of 200 candidate cases
• In total 10000 unique documents
• Unlabeled Dataset contains 61 unknown query cases
• All cases involving Immigration and Citizenship
5
Task 1 - Legal Case Retrieval Task
0 2000 4000 6000 8000 10000
Number of Tokens per Document
0.0
0.5
1.0
ProportionofDocuments
Cumulative Distribution of Document size
• Cumulative Distribution of number of
tokens per document
• Up to 12000 tokens in a document
• Median around 2500 tokens
• We can address RQ1 and RQ2
6
Task 3 - Statute Law Retrieval
• Search for Civil Code articles relevant to assess the validity of a legal statement
• Based on Japanese Bar Exam
• Query is a statement
• Relevant articles explain the point made in the query
• The legislation is the entire Japanese Civil Code, about 1000 articles in English
• The labeled Dataset contains 650 queries
• We can address RQ2
7
Evaluation
• We use Recall and Precision on the ranked list of retrieved documents
8
Model, Setup, Results
Model, Workflow
• Binary classifier for pairwise relevance trained on labeled dataset
• Derivate it into a ranker
• Predict relevance on unlabeled dataset
• BERT Implementation with google-bert1, pytorch-transformers2,
fast-bert3 and apex4
• LTR Implementation with Tensorflow Ranking5
1
https://github.com/google-research/bert
2
https://github.com/huggingface/pytorch-transformers
3
https://github.com/kaushaltrivedi/fast-bert
4
https://github.com/NVIDIA/apex
5
https://github.com/tensorflow/ranking
9
BERT: A Language Model
pursuant
to
article
41.3
,
the
[MASK]
can
defer
(...)
[MASK]
0.12
-0.3
1.45
0.001
(...)
truck
apple
be
defender
beyond
the
dropped
(...)
10
Pairwise Relevance Classifier
• Solve the sequence length limitation (512 WordPiece tokens) by summarizing the
English part of documents
• Summarization based on TextRank (Barrios & al, 2016), implemented in gensim6
• Fine-Tuning of a BERT (Devlin & al, 2018) model followed by MLP
• This model is named Fine-Tuned in results
6
https://radimrehurek.com/gensim/
11
Pairwise Relevance Classifier with Summarization
1. 0.94
2. 0.91
3. ...
12
Ranker with Learning to Rank
• Generate Features from the Fine-Tuned BERT
• Use these features as input to a Learning to Rank model
• These models are named LTR in results
• Training material limited to 285 lists
13
Ranker with Learning to Rank
1. 0.21
2. 0.14
3. ...
LTR
14
In-Domain Pre-Training
• Starting from a pre-trained BERT model
• Running additional iterations of pre-training tasks
• Using in-domain texts
• Canadian Court Decisions for the Task 1
• Japanese Supreme Court rulings in English, for the Task 3
• For pre-training only, 100k iterations, around 24 hours on 1 GPU
• This model is named Pre-Trained in results
15
In-Domain Pre-Training
Pre-Training
Fine-Tuning
16
Results - Case Law Retrieval
System R@10 P@10 R@1 P@1 P@R MAP
BM25 Summaries 0.14 0.07 0.02 0.11 0.11 0.14
BM25 0.76 0.36 0.32 0.70 0.68 0.73
PERFECT 0.97 0.50 0.41 1.00 1.00 1.00
Fine-Tuned 0.75 0.34 0.31 0.80 0.63 0.70
LTR 01 0.74 0.34 0.32 0.85 0.63 0.69
LTR 02 0.75 0.35 0.32 0.84 0.65 0.70
Pre-Trained 0.81 0.39 0.34 0.90 0.73 0.79
Table 1: Summary of metrics for all systems
17
Results - Statute Law Retrieval
System R@5 R@10 R@30 MAP P@1
UB3 - COLIEE Winner 0.7978 0.8539 0.9551 0.7988 n/a
Fine-Tuned 0.9010 0.9203 0.9686 0.8246 0.7971
Pre-Trained 0.8913 0.9130 0.9444 0.8321 0.8261
Table 2: Summary of metrics for all systems
18
Lessons Learned
Results
Back to our 2 Research Questions:
• RQ1: Can we deal with long documents ?
• Summarized texts as input to Neural Language Model allowed for retrieval
performance on par or higher than baselines
• RQ2: Can we improve retrieval with limited data ?
• Additional pre-training improves the performance of information retrieval, for small
dataset
• The uniformity of the legal language at hand (Court Decision in English) allows for
quick training
19
Critical Review of BERT
• BERT is a Language Model, it learns the language presented at pre-training
• BERT is strong with syntactic and semantic tasks
• ”Open Sesame (...)”, Lin & al., June 2019
• ”What does BERT look at? (...)”, Clark & al., June 2019
• ”BERT rediscovers the Classical NLP Pipeline”, Tenney & al., May 2019
• It is adapted to tasks operating at text level, less adapted to tasks operating at
higher levels of language understanding
• Pre-Training on similar texts as the downstream task is proven to add language
knowledge
20
Critical Review of BERT
• Ongoing discussion about Attention and Explanation
• ”Attention is not Explanation”, Jain and Wallace, May 2019
• ”Is Attention Interpretable?”, Serrano and Smith, June 2019
• ”Attention is not not Explanation”, Wiegreffe and Pinter, August 2019
• This is common to all systems based on Transformers: Open-AI GPT, GPT2,
Transformer XL, XLNet, XLM, etc.
• Going through attention weights with bertviz7, it seems the model focuses more
on word similarities than on semantics
7
https://github.com/jessevig/bertviz
21
Critical Review of BERT
• The ”Clever Hans”8 effect, mistaking deep knowledge with surface correlations
• ”Probing Neural Network (...)”, Niven and Kao, August 2019
• In the age of AI, ”Correlation is not causation” is ”Good results are not
knowledge acquisition”
• Focus on dataset’s diversity of text for similar usage of knowledge
8
https://thegradient.pub/nlps-clever-hans-moment-has-arrived/
22
Take-Home
• Stay aware of dataset’s limitations
• Get to know what the model actually learns vs How it performs
• Learn through unsupervised Pre-Training
• Lots of orthogonal ways forward:
• More data for pre-training
• New pre-training tasks
• New Model architecture
• Different heuristic for summarization
• More annotation on relevance assessment
23
Work with us?
We are interested in collaborations with the Industry, in the framework of Research
Projects.
We cover many domains:
• Information Extraction from Contracts
• Summarization of Legal Documents
• Information Retrieval, Search for Regulation
We need access to relevant Data
24

Weitere ähnliche Inhalte

Ähnlich wie GENERALIZED LANGUAGE MODELS FOR CASE LAW RETRIEVAL - Big Data Expo 2019

Era ofdataeconomyv4short
Era ofdataeconomyv4shortEra ofdataeconomyv4short
Era ofdataeconomyv4shortJun Miyazaki
 
3 Software Estmation.ppt
3 Software Estmation.ppt3 Software Estmation.ppt
3 Software Estmation.pptSoham De
 
Deductive databases
Deductive databasesDeductive databases
Deductive databasesJohn Popoola
 
Industry@RuleML2015: Automated Decision Support for Financial Regulatory/Pol...
Industry@RuleML2015:  Automated Decision Support for Financial Regulatory/Pol...Industry@RuleML2015:  Automated Decision Support for Financial Regulatory/Pol...
Industry@RuleML2015: Automated Decision Support for Financial Regulatory/Pol...RuleML
 
IRJET- Testing Improvement in Business Intelligence Area
IRJET- Testing Improvement in Business Intelligence AreaIRJET- Testing Improvement in Business Intelligence Area
IRJET- Testing Improvement in Business Intelligence AreaIRJET Journal
 
(Big) Data (Science) Skills
(Big) Data (Science) Skills(Big) Data (Science) Skills
(Big) Data (Science) SkillsOscar Corcho
 
Building Data Scientists
Building Data ScientistsBuilding Data Scientists
Building Data ScientistsMitch Sanders
 
Requirementv4
Requirementv4Requirementv4
Requirementv4stat
 
IRJET- Natural Language Query Processing
IRJET- Natural Language Query ProcessingIRJET- Natural Language Query Processing
IRJET- Natural Language Query ProcessingIRJET Journal
 
C++ plus data structures, 3rd edition (2003)
C++ plus data structures, 3rd edition (2003)C++ plus data structures, 3rd edition (2003)
C++ plus data structures, 3rd edition (2003)SHC
 
Information retrieval concept, practice and challenge
Information retrieval   concept, practice and challengeInformation retrieval   concept, practice and challenge
Information retrieval concept, practice and challengeGan Keng Hoon
 
Big Data: the weakest link
Big Data: the weakest linkBig Data: the weakest link
Big Data: the weakest linkCS, NcState
 
Search term recommendation and non-textual ranking evaluated
 Search term recommendation and non-textual ranking evaluated Search term recommendation and non-textual ranking evaluated
Search term recommendation and non-textual ranking evaluatedGESIS
 
How to conduct systematic literature review
How to conduct systematic literature reviewHow to conduct systematic literature review
How to conduct systematic literature reviewKashif Hussain
 
Determining the Credibility of Science Communication
Determining the Credibility of Science CommunicationDetermining the Credibility of Science Communication
Determining the Credibility of Science CommunicationIsabelle Augenstein
 
NYC Open Data Meetup-- Thoughtworks chief data scientist talk
NYC Open Data Meetup-- Thoughtworks chief data scientist talkNYC Open Data Meetup-- Thoughtworks chief data scientist talk
NYC Open Data Meetup-- Thoughtworks chief data scientist talkVivian S. Zhang
 
Dice.com Bay Area Search - Beyond Learning to Rank Talk
Dice.com Bay Area Search - Beyond Learning to Rank TalkDice.com Bay Area Search - Beyond Learning to Rank Talk
Dice.com Bay Area Search - Beyond Learning to Rank TalkSimon Hughes
 

Ähnlich wie GENERALIZED LANGUAGE MODELS FOR CASE LAW RETRIEVAL - Big Data Expo 2019 (20)

Era ofdataeconomyv4short
Era ofdataeconomyv4shortEra ofdataeconomyv4short
Era ofdataeconomyv4short
 
3 Software Estmation.ppt
3 Software Estmation.ppt3 Software Estmation.ppt
3 Software Estmation.ppt
 
Msr2021 tutorial-di penta
Msr2021 tutorial-di pentaMsr2021 tutorial-di penta
Msr2021 tutorial-di penta
 
Deductive databases
Deductive databasesDeductive databases
Deductive databases
 
Industry@RuleML2015: Automated Decision Support for Financial Regulatory/Pol...
Industry@RuleML2015:  Automated Decision Support for Financial Regulatory/Pol...Industry@RuleML2015:  Automated Decision Support for Financial Regulatory/Pol...
Industry@RuleML2015: Automated Decision Support for Financial Regulatory/Pol...
 
IRJET- Testing Improvement in Business Intelligence Area
IRJET- Testing Improvement in Business Intelligence AreaIRJET- Testing Improvement in Business Intelligence Area
IRJET- Testing Improvement in Business Intelligence Area
 
(Big) Data (Science) Skills
(Big) Data (Science) Skills(Big) Data (Science) Skills
(Big) Data (Science) Skills
 
Building Data Scientists
Building Data ScientistsBuilding Data Scientists
Building Data Scientists
 
Requirementv4
Requirementv4Requirementv4
Requirementv4
 
IRJET- Natural Language Query Processing
IRJET- Natural Language Query ProcessingIRJET- Natural Language Query Processing
IRJET- Natural Language Query Processing
 
C++ plus data structures, 3rd edition (2003)
C++ plus data structures, 3rd edition (2003)C++ plus data structures, 3rd edition (2003)
C++ plus data structures, 3rd edition (2003)
 
Information retrieval concept, practice and challenge
Information retrieval   concept, practice and challengeInformation retrieval   concept, practice and challenge
Information retrieval concept, practice and challenge
 
Big Data: the weakest link
Big Data: the weakest linkBig Data: the weakest link
Big Data: the weakest link
 
Search term recommendation and non-textual ranking evaluated
 Search term recommendation and non-textual ranking evaluated Search term recommendation and non-textual ranking evaluated
Search term recommendation and non-textual ranking evaluated
 
How to conduct systematic literature review
How to conduct systematic literature reviewHow to conduct systematic literature review
How to conduct systematic literature review
 
Determining the Credibility of Science Communication
Determining the Credibility of Science CommunicationDetermining the Credibility of Science Communication
Determining the Credibility of Science Communication
 
NYC Open Data Meetup-- Thoughtworks chief data scientist talk
NYC Open Data Meetup-- Thoughtworks chief data scientist talkNYC Open Data Meetup-- Thoughtworks chief data scientist talk
NYC Open Data Meetup-- Thoughtworks chief data scientist talk
 
Dice.com Bay Area Search - Beyond Learning to Rank Talk
Dice.com Bay Area Search - Beyond Learning to Rank TalkDice.com Bay Area Search - Beyond Learning to Rank Talk
Dice.com Bay Area Search - Beyond Learning to Rank Talk
 
U4 l01 What is big data?
U4 l01 What is big data?U4 l01 What is big data?
U4 l01 What is big data?
 
Integrating Semantic Systems
Integrating Semantic SystemsIntegrating Semantic Systems
Integrating Semantic Systems
 

Mehr von webwinkelvakdag

ISM eCompany: Sander Berlinski
ISM eCompany: Sander BerlinskiISM eCompany: Sander Berlinski
ISM eCompany: Sander Berlinskiwebwinkelvakdag
 
Thuiswinkel.org & Omoda: Alicja Van Ewijk
Thuiswinkel.org & Omoda: Alicja Van EwijkThuiswinkel.org & Omoda: Alicja Van Ewijk
Thuiswinkel.org & Omoda: Alicja Van Ewijkwebwinkelvakdag
 
ANWB: Carolina van den Hoven & Margot van Leeuwen
ANWB: Carolina van den Hoven & Margot van LeeuwenANWB: Carolina van den Hoven & Margot van Leeuwen
ANWB: Carolina van den Hoven & Margot van Leeuwenwebwinkelvakdag
 
HEMA: Ilse Lankhorst, Bas Karsemeijer
HEMA: Ilse Lankhorst, Bas KarsemeijerHEMA: Ilse Lankhorst, Bas Karsemeijer
HEMA: Ilse Lankhorst, Bas Karsemeijerwebwinkelvakdag
 
ISM eCompany: Kees Beckeringh
ISM eCompany: Kees BeckeringhISM eCompany: Kees Beckeringh
ISM eCompany: Kees Beckeringhwebwinkelvakdag
 
Martijn Kozijn: Jessica van Haaster & Martijn Leclaire
Martijn Kozijn: Jessica van Haaster & Martijn LeclaireMartijn Kozijn: Jessica van Haaster & Martijn Leclaire
Martijn Kozijn: Jessica van Haaster & Martijn Leclairewebwinkelvakdag
 
Cemex trescon: Marloe de Ruiter
Cemex trescon: Marloe de RuiterCemex trescon: Marloe de Ruiter
Cemex trescon: Marloe de Ruiterwebwinkelvakdag
 
LINDA.Foundation: Jocelyn Nassenstein-Brouwer
LINDA.Foundation: Jocelyn Nassenstein-BrouwerLINDA.Foundation: Jocelyn Nassenstein-Brouwer
LINDA.Foundation: Jocelyn Nassenstein-Brouwerwebwinkelvakdag
 
Aanhangwagendirect & PI Marketing: Merin Eggink & Mascha Soors
Aanhangwagendirect & PI Marketing: Merin Eggink & Mascha SoorsAanhangwagendirect & PI Marketing: Merin Eggink & Mascha Soors
Aanhangwagendirect & PI Marketing: Merin Eggink & Mascha Soorswebwinkelvakdag
 
ISM eCompany: Ralph van Woensel
ISM eCompany: Ralph van WoenselISM eCompany: Ralph van Woensel
ISM eCompany: Ralph van Woenselwebwinkelvakdag
 
ISM eCompany: Sander Lems
ISM eCompany: Sander LemsISM eCompany: Sander Lems
ISM eCompany: Sander Lemswebwinkelvakdag
 

Mehr von webwinkelvakdag (20)

ISM eCompany: Sander Berlinski
ISM eCompany: Sander BerlinskiISM eCompany: Sander Berlinski
ISM eCompany: Sander Berlinski
 
Social Nomads - Lynn
Social Nomads - LynnSocial Nomads - Lynn
Social Nomads - Lynn
 
Thuiswinkel.org & Omoda: Alicja Van Ewijk
Thuiswinkel.org & Omoda: Alicja Van EwijkThuiswinkel.org & Omoda: Alicja Van Ewijk
Thuiswinkel.org & Omoda: Alicja Van Ewijk
 
Worldpay: Maria Prados
Worldpay: Maria PradosWorldpay: Maria Prados
Worldpay: Maria Prados
 
Van Moof: Simon Vreeman
Van Moof: Simon VreemanVan Moof: Simon Vreeman
Van Moof: Simon Vreeman
 
ANWB: Carolina van den Hoven & Margot van Leeuwen
ANWB: Carolina van den Hoven & Margot van LeeuwenANWB: Carolina van den Hoven & Margot van Leeuwen
ANWB: Carolina van den Hoven & Margot van Leeuwen
 
HEMA: Ilse Lankhorst, Bas Karsemeijer
HEMA: Ilse Lankhorst, Bas KarsemeijerHEMA: Ilse Lankhorst, Bas Karsemeijer
HEMA: Ilse Lankhorst, Bas Karsemeijer
 
ISM eCompany: Kees Beckeringh
ISM eCompany: Kees BeckeringhISM eCompany: Kees Beckeringh
ISM eCompany: Kees Beckeringh
 
ING: Dirk Mulder
ING: Dirk MulderING: Dirk Mulder
ING: Dirk Mulder
 
Martijn Kozijn: Jessica van Haaster & Martijn Leclaire
Martijn Kozijn: Jessica van Haaster & Martijn LeclaireMartijn Kozijn: Jessica van Haaster & Martijn Leclaire
Martijn Kozijn: Jessica van Haaster & Martijn Leclaire
 
ING: Dirk Mulder
ING: Dirk MulderING: Dirk Mulder
ING: Dirk Mulder
 
Cemex trescon: Marloe de Ruiter
Cemex trescon: Marloe de RuiterCemex trescon: Marloe de Ruiter
Cemex trescon: Marloe de Ruiter
 
LINDA.Foundation: Jocelyn Nassenstein-Brouwer
LINDA.Foundation: Jocelyn Nassenstein-BrouwerLINDA.Foundation: Jocelyn Nassenstein-Brouwer
LINDA.Foundation: Jocelyn Nassenstein-Brouwer
 
Maersk: Niek Minderhoud
Maersk: Niek MinderhoudMaersk: Niek Minderhoud
Maersk: Niek Minderhoud
 
Q&A: Brenda Hoekstra
Q&A: Brenda HoekstraQ&A: Brenda Hoekstra
Q&A: Brenda Hoekstra
 
Aanhangwagendirect & PI Marketing: Merin Eggink & Mascha Soors
Aanhangwagendirect & PI Marketing: Merin Eggink & Mascha SoorsAanhangwagendirect & PI Marketing: Merin Eggink & Mascha Soors
Aanhangwagendirect & PI Marketing: Merin Eggink & Mascha Soors
 
ISM eCompany: Ralph van Woensel
ISM eCompany: Ralph van WoenselISM eCompany: Ralph van Woensel
ISM eCompany: Ralph van Woensel
 
Lecot: Raf Maesen
Lecot: Raf MaesenLecot: Raf Maesen
Lecot: Raf Maesen
 
Lobbes: Berry de Snoo
Lobbes: Berry de SnooLobbes: Berry de Snoo
Lobbes: Berry de Snoo
 
ISM eCompany: Sander Lems
ISM eCompany: Sander LemsISM eCompany: Sander Lems
ISM eCompany: Sander Lems
 

Kürzlich hochgeladen

Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...amitlee9823
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...amitlee9823
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...amitlee9823
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsJoseMangaJr1
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...amitlee9823
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...only4webmaster01
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...amitlee9823
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNKTimothy Spann
 

Kürzlich hochgeladen (20)

Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 

GENERALIZED LANGUAGE MODELS FOR CASE LAW RETRIEVAL - Big Data Expo 2019

  • 1. Legal Information Retrieval with Generalized Language Models Julien ROSSI, Evangelos KANOULAS September 19, 2019 Big Data Expo
  • 2. Who we are Prof. Dr. Evangelos Kanoulas Professor at Amsterdam Business School, UvA Professor at Institute of Informatics, UvA Researcher in Information Retrieval, Conversational Agents Julien Rossi Lecturer at Amsterdam Business School, UvA PhD Candidate, Legal Text Analytics MBA Big Data, ABS MSc Computer Sciences 1
  • 3. Agenda • Task, Dataset, Evaluation • Model, Setup, Results • Lessons Learned 2
  • 5. COLIEE 2019 COLIEE stands for Competition on Legal Information Extraction/Entailment. This competition started in 2014, in collaboration between University of Alberta (Canada) and National Institute of Informatics (Japan). It is a testing ground for applying text analytics to legal documents and tasks. We have 2 Research Questions: • RQ1: Can we deal with long documents ? • RQ2: Can we improve retrieval with limited data ? 3
  • 6. Task 1 - Legal Case Retrieval Task • Collection of Canadian Supreme Court Judgments • Single Topic: Immigration & Citizenship • Search for Relevant Documents within a collection • Query is a Legal Case • Noticed cases are relevant to the query case • Relevance is binary, and not motivated 4
  • 7. Task 1 - Legal Case Retrieval Task • Labeled Dataset contains 285 query cases • Each query case comes with a collection of 200 candidate cases • In total 10000 unique documents • Unlabeled Dataset contains 61 unknown query cases • All cases involving Immigration and Citizenship 5
  • 8. Task 1 - Legal Case Retrieval Task 0 2000 4000 6000 8000 10000 Number of Tokens per Document 0.0 0.5 1.0 ProportionofDocuments Cumulative Distribution of Document size • Cumulative Distribution of number of tokens per document • Up to 12000 tokens in a document • Median around 2500 tokens • We can address RQ1 and RQ2 6
  • 9. Task 3 - Statute Law Retrieval • Search for Civil Code articles relevant to assess the validity of a legal statement • Based on Japanese Bar Exam • Query is a statement • Relevant articles explain the point made in the query • The legislation is the entire Japanese Civil Code, about 1000 articles in English • The labeled Dataset contains 650 queries • We can address RQ2 7
  • 10. Evaluation • We use Recall and Precision on the ranked list of retrieved documents 8
  • 12. Model, Workflow • Binary classifier for pairwise relevance trained on labeled dataset • Derivate it into a ranker • Predict relevance on unlabeled dataset • BERT Implementation with google-bert1, pytorch-transformers2, fast-bert3 and apex4 • LTR Implementation with Tensorflow Ranking5 1 https://github.com/google-research/bert 2 https://github.com/huggingface/pytorch-transformers 3 https://github.com/kaushaltrivedi/fast-bert 4 https://github.com/NVIDIA/apex 5 https://github.com/tensorflow/ranking 9
  • 13. BERT: A Language Model pursuant to article 41.3 , the [MASK] can defer (...) [MASK] 0.12 -0.3 1.45 0.001 (...) truck apple be defender beyond the dropped (...) 10
  • 14. Pairwise Relevance Classifier • Solve the sequence length limitation (512 WordPiece tokens) by summarizing the English part of documents • Summarization based on TextRank (Barrios & al, 2016), implemented in gensim6 • Fine-Tuning of a BERT (Devlin & al, 2018) model followed by MLP • This model is named Fine-Tuned in results 6 https://radimrehurek.com/gensim/ 11
  • 15. Pairwise Relevance Classifier with Summarization 1. 0.94 2. 0.91 3. ... 12
  • 16. Ranker with Learning to Rank • Generate Features from the Fine-Tuned BERT • Use these features as input to a Learning to Rank model • These models are named LTR in results • Training material limited to 285 lists 13
  • 17. Ranker with Learning to Rank 1. 0.21 2. 0.14 3. ... LTR 14
  • 18. In-Domain Pre-Training • Starting from a pre-trained BERT model • Running additional iterations of pre-training tasks • Using in-domain texts • Canadian Court Decisions for the Task 1 • Japanese Supreme Court rulings in English, for the Task 3 • For pre-training only, 100k iterations, around 24 hours on 1 GPU • This model is named Pre-Trained in results 15
  • 20. Results - Case Law Retrieval System R@10 P@10 R@1 P@1 P@R MAP BM25 Summaries 0.14 0.07 0.02 0.11 0.11 0.14 BM25 0.76 0.36 0.32 0.70 0.68 0.73 PERFECT 0.97 0.50 0.41 1.00 1.00 1.00 Fine-Tuned 0.75 0.34 0.31 0.80 0.63 0.70 LTR 01 0.74 0.34 0.32 0.85 0.63 0.69 LTR 02 0.75 0.35 0.32 0.84 0.65 0.70 Pre-Trained 0.81 0.39 0.34 0.90 0.73 0.79 Table 1: Summary of metrics for all systems 17
  • 21. Results - Statute Law Retrieval System R@5 R@10 R@30 MAP P@1 UB3 - COLIEE Winner 0.7978 0.8539 0.9551 0.7988 n/a Fine-Tuned 0.9010 0.9203 0.9686 0.8246 0.7971 Pre-Trained 0.8913 0.9130 0.9444 0.8321 0.8261 Table 2: Summary of metrics for all systems 18
  • 23. Results Back to our 2 Research Questions: • RQ1: Can we deal with long documents ? • Summarized texts as input to Neural Language Model allowed for retrieval performance on par or higher than baselines • RQ2: Can we improve retrieval with limited data ? • Additional pre-training improves the performance of information retrieval, for small dataset • The uniformity of the legal language at hand (Court Decision in English) allows for quick training 19
  • 24. Critical Review of BERT • BERT is a Language Model, it learns the language presented at pre-training • BERT is strong with syntactic and semantic tasks • ”Open Sesame (...)”, Lin & al., June 2019 • ”What does BERT look at? (...)”, Clark & al., June 2019 • ”BERT rediscovers the Classical NLP Pipeline”, Tenney & al., May 2019 • It is adapted to tasks operating at text level, less adapted to tasks operating at higher levels of language understanding • Pre-Training on similar texts as the downstream task is proven to add language knowledge 20
  • 25. Critical Review of BERT • Ongoing discussion about Attention and Explanation • ”Attention is not Explanation”, Jain and Wallace, May 2019 • ”Is Attention Interpretable?”, Serrano and Smith, June 2019 • ”Attention is not not Explanation”, Wiegreffe and Pinter, August 2019 • This is common to all systems based on Transformers: Open-AI GPT, GPT2, Transformer XL, XLNet, XLM, etc. • Going through attention weights with bertviz7, it seems the model focuses more on word similarities than on semantics 7 https://github.com/jessevig/bertviz 21
  • 26. Critical Review of BERT • The ”Clever Hans”8 effect, mistaking deep knowledge with surface correlations • ”Probing Neural Network (...)”, Niven and Kao, August 2019 • In the age of AI, ”Correlation is not causation” is ”Good results are not knowledge acquisition” • Focus on dataset’s diversity of text for similar usage of knowledge 8 https://thegradient.pub/nlps-clever-hans-moment-has-arrived/ 22
  • 27. Take-Home • Stay aware of dataset’s limitations • Get to know what the model actually learns vs How it performs • Learn through unsupervised Pre-Training • Lots of orthogonal ways forward: • More data for pre-training • New pre-training tasks • New Model architecture • Different heuristic for summarization • More annotation on relevance assessment 23
  • 28. Work with us? We are interested in collaborations with the Industry, in the framework of Research Projects. We cover many domains: • Information Extraction from Contracts • Summarization of Legal Documents • Information Retrieval, Search for Regulation We need access to relevant Data 24