1. Mining Legal Text
sil.fd SIL.fd .
.
Information Mining and Visualization of a Large
Volume of Legal Texts
.
.. .
.
Flávio Codeço Coelho, Renato Rocha Souza and Pablo de
Camargo Cerdeira
Applied Mathematics School – Getulio Vargas Foundation
August 22, 2011
. . . . . .
4. Mining Legal Text
Introduction
Conquering text
Scraping and indexing the world’s web pages has changed the
world...
Should pagerank be our main measure of information
relevance?
What is possible if we go a little further?
. . . . . .
5. Mining Legal Text
Introduction
It’s documents all the way down...
Luckily, we didn’t have to scan
them...
We have to conquer an
information mountain...
. . . . . .
7. Mining Legal Text
Web-Scraping
Obtaining the Data
No API for access, a little
heuristics was necessary
Scraping took more than 3
months.
1.3 million cases
. . . . . .
8. Mining Legal Text
Web-Scraping
Example: Photos
Navigating with Mechanize1
br = mechanize . Browser ( )
br . open ( ” h t t p : / /www. s t f . j u s . br / p o r t a l / m i n i s t r o / m i n i s t r o . asp ? p e r i o d o=s t
i = 0
l i n k = br . f i n d l i n k ( u r l r e g e x=r ’ v e r M i n i s t r o . asp ’ , nr=i )
while 1:
br . f o l l o w l i n k ( l i n k )
i l = br . f i n d l i n k ( u r l r e g e x=’ imagem . asp ’ )
u r l = ” h t t p : / /www. s t f . j u s . br / p o r t a l ”+ i l . u r l . s t r i p ( ’ . . ’ )
nome = i l . t e x t
download photo ( u r l , nome . decode ( ’ l a t i n 1 ’ ) . s p l i t ( ’ [ ’ ) [ 0 ] )
br . back ( )
try :
l i n k = br . f i n d l i n k ( u r l r e g e x=r ’ v e r M i n i s t r o . asp ’ , nr=i )
i += 1
e x c e p t LinkNotFoundError :
break
1
http://wwwsearch.sourceforge.net/mechanize/
. . . . . .
9. Mining Legal Text
Web-Scraping
HTML Parsing
Parsing scraped HTML
Beautiful Soup2 to the rescue!
Firebug helped analyze page structure.
Parsing was done during the scraping, to clean data for
insertion into MySQL
Some parts of the page were stored in HTML for later parsing
sopa=B e a u t i f u l S o u p ( d [ ’ d e c i s a o ’ ] . s t r i p ( ’ [ ] ’ ) , fromEncoding=’ ISO8859−1 ’ )
r s = sopa . f i n d A l l ( ’ s t r o n g ’ , t e x t=r e . c o m p i l e ( ’ ˆ L e g i s l a ’ ) )
2
http://www.crummy.com/software/BeautifulSoup/
. . . . . .
10. Mining Legal Text
Pattern Matching
Extracting Even more Information
With Data on Local db, we started mining it:
Tried to use the best SQL and Python had to offer
Pattern matching, aggregation, string matching3 , etc...
Read from Db → Process → Write to Db
SQL → Python → SQL
3
difflib . . . . . .
11. Mining Legal Text
Pattern Matching
Regular expressions
Regular Expressions
re module, great, but tricky for
different encodings.
Kodosa : visual debugging
indispensable!
a
http://kodos.sourceforge.net/
r a w s t r = r ”””>∗s ∗ ( [ A−Z] { 2 , 3 } s∗−s ∗ . [ A−Z0 − 9 ] ∗ ) | (CF ) | ( ”CAPUT”) s+”””
c o m p i l e o b j = r e . c o m p i l e ( r a w s t r , r e . LOCALE)
. . . . . .
12. Mining Legal Text
Database Interaction
Structuring the Data
.
Goals .
..
Reflect the original structure of the data
Store additional info coming from raw text
Design data model with future analytical needs in mind
.
.. .
.
. . . . . .
13. Mining Legal Text
Database Interaction
MySQLDb
Databases and Drivers
MySQL (MariaDb4 ) was relational Db of Choice
MySQLDb’s cursor.execute(’ select ∗ from ... ’)
Server side cursors were essential.
MongoDb + PyMongo
4
http://mariadb.org . . . . . .
14. Mining Legal Text
Database Interaction
SQLAlchemy
What about ORMs?
Object-relational mappers are great but...
SqlAlchemy5 used mostly in table creation and data insertion.
For analytical purposes, server-side raw SQL, stored procs and
views can’t be beaten.
We mostly used Elixir to design the tables.
5
http://www.sqlalchemy.org . . . . . .
15. Mining Legal Text
Database Interaction
MongoDb
Escaping from 2D data
Benefits: Tips:
Exploring MongoDba as an
db.cursor( cursorclass =SSDictCursor)
alternative for Analytics
Convert every string to UTF-8
Auto-sharding + Map/reduce!
Pymongo’s transparent
Escape costly Joins in MySQL
conversion of dictionaries to
a
www.mongodb.org BSON
. . . . . .
16. Mining Legal Text
Natural Language Processing
Understanding Text
Biggest challenge is extracting
meaning from decisions
Is a given decision pro- or
against the defendant?
What is the vote count on
non-unanimous decisions?
. . . . . .
17. Mining Legal Text
Natural Language Processing
NLTK
Natural Language Toolkit
Lots of batteries
included
. . . . . .
18. Mining Legal Text
Visualization
Visualizing the Data
You can’t ask questions about what you don’t know...
Data driven research
. . . . . .
19. Mining Legal Text
Visualization
Matplotlib
Standard Charting and Plotting: Matplotlib
Great for plotting summary
statistics
Together with NetworkX can
help visualizing some small
graphs
. . . . . .
20. Mining Legal Text
Visualization
Ubigraph
Large Graph Visualization: Ubigraph
Ubigraph Rocks!a
Navigating Huge graphs gave
powerful insights
Takes advantage of multiple
cores and GPU
a
http://ubietylab.net/ubigraph/
. . . . . .
21. Mining Legal Text
Visualization
Gource
Untangling Temporal patterns:
A bit of Python to create logs compatible with Gource6
This:
Q = dbdec . e x e c u t e ( ”SELECT r e l a t o r , p r o c e s s o , t i p o , p r o c c l a s s e , duracao , U
decs = Q. f e t c h a l l ( )
d u r a t i o n s = [ d [ 4 ] f o r d i n de cs ]
cmap = cm . j e t
norm = n o r m a l i z e ( min ( d u r a t i o n s ) , max( d u r a t i o n s ) ) #n o r m a l i z i n g d u r a t i o n
with open ( ’ d e c i s o e s %s . l o g ’%ano , ’w ’ ) as f :
f o r d i n decs :
c = rgb2hex (cmap( norm ( d [ 4 ] ) ) [ : 3 ] ) . s t r i p ( ’#’ )
path = ”/%s/%s/%s/%s ”%(d [ 5 ] , d [ 2 ] , d [ 3 ] , d [ 1 ] ) #/ S t a t e / t i p o / p r o c
l = ”%s |% s |% s |% s |% s n”%( i n t ( time . mktime ( d [ 6 ] . t i m e t u p l e ( ) ) ) , d [
f . write ( l )
Generates this:
885967200|MIN . SYDNEY SANCHES|A| /MG/ Monocrática /INQUÉRITO/1606809|0000
885967200|MIN . SYDNEY SANCHES|A| /MG/ P r e s i d ê n c i a /INQUÉRITO/1606809|0000
6
http://code.google.com/p/gource/ . . . . . .
22. Mining Legal Text
Visualization
Gource
A snapshot of the Supreme Court activities: 1998
. . . . . .
23. Mining Legal Text
Visualization
Gource
The Dynamics
Video
. . . . . .
24. Mining Legal Text
Visualization
Visual Python
It’s a Jungle Out There. . .
Division of labor in the supreme
court
VPythona is great to quickly
create complex animations.
Here judges are trees, branches
are subjects and leaves are legal
decisions
a
vpython.org
. . . . . .
25. Mining Legal Text
Results
Results
Detailed X-ray of the inner
workings of the Supreme court
92% of the cases are appeals of
a non-constitutional nature
These results led to the proposal
of an amendment to the
constitution!
More questions than answers!
Python for data mining rocks!
. . . . . .
26. Mining Legal Text
Future Directions
To be continued...
Further automate and optimize
More explorations
Scale up the pipeline
Model the life history of a legal process
. . . . . .
27. Mining Legal Text
Future Directions
Acknowledgements
FGV - Direito Rio
FGV - EMAp
Brazilian Supreme Court
Asla Sá (for kindly lending us her server)
. . . . . .