Semantic Web and natural-language-processing techniques meet the Code of Federal Regulations. Presentation from CALICON12 by the Legal Information Institute. Work on definition extraction, linked data publishing, search enhancement, vocabulary discovery.
Joint presentation with Nuria Casellas.
The Semantic Web meets the Code of Federal Regulations
1. THE CFR MEETS THE
SEMANTIC WEB
(with a little unnatural language processing thrown in )
2. BACKGROUND: A TWO-PART HISTORY OF
THE SEMANTIC WEB
• SW is a maze of confusing buzzwords
• Can be thought of in two parts
• Pre-2005 (the “top-down” period)
• Post-2005 (the “bottom-up” period)
3. SW PRE-2005
o A fascination with inferencing & top-down analysis
o Staked out a lot of theoretical territory
o Built basic standards:
• RDF (statement encoding) : saying things about things
• OWL (modeling and inferencing): describing relationships
between things -- that is, creating ontologies
4. SW FROM 2005 TO NOW
o SW now seen as a big heap of statements
o Became more practical
o SKOS ( inexpensive conversion method/standard for metadata)
o Linked Data ( altruistic, like named anchors ca. 1992 )
o Could be seen -- from a library point of view -- as a new set of
techniques for metadata management better suited to the Web
5. THE SEMANTIC WEB AT THE LII
• Tying legal information to the real world, not just itself
• Applications like:
o Improvements to existing finding aids
Table of Popular Names, , Tables I and III
Finer-grained, more expressive PTOA
o Search enhancement via term substitution and expansion
o Publication of “regulated nouns” and definitions as Linked Data
• Research-driven engineering as a practice/culture
6. WHY USE THE SW TOOLSET?
• Sometimes the whole thing looks like an illustration of the Two Fool
Rule
• Why RDF?
o XML is more cumbersome and less expressive
o RDF supports inferencing
o RDF allows processing of partial information
• Why SPARQL?
o um, SPARQL is how you query RDF
7. WHY USE SKOS?
o it's a simple knowledge organization system
o lightweight representation of things we need a lot:
o thesauri
o taxonomies
o classification schemes
o it might be a little too simple
8. SKOS: DRIVING INTO A DITCH
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:skos="http://www.w3.org/2004/02/skos/core#">
<skos:Concept rdf:about="http://www.my.com/#canals">
<skos:definition>A feature type category for places
such as the Erie Canal</skos:definition>
<skos:prefLabel>canals</skos:prefLabel>
<skos:altLabel>canal bends</skos:altLabel>
<skos:altLabel>canalized streams</skos:altLabel>
<skos:altLabel>ditch mouths</skos:altLabel>
<skos:altLabel>ditches</skos:altLabel>
<skos:altLabel>drainage canals</skos:altLabel>
<skos:altLabel>drainage ditches</skos:altLabel>
<skos:broader rdf:resource="http://www.my.com/#hydrographic%20structures"/>
<skos:related rdf:resource="http://www.my.com/#channels"/>
<skos:related rdf:resource="http://www.my.com/#locks"/>
<skos:related rdf:resource="http://www.my.com/#transportation%20features"/>
<skos:related rdf:resource="http://www.my.com/#tunnels"/>
<skos:scopeNote>Manmade waterway used by watercraft
or for drainage, irrigation, mining, or water
power</skos:scopeNote>
</skos:Concept>
</rdf:RDF>
9. DATA REUSE: DRUGBANK
• Acetaminophen vs. Tylenol : CFR regulates by generic name
• DrugBank (http://www4.wiwiss.fu-berlin.de/drugbank/)
o http://www.drugbank.ca/
o Offered as Linked Data by Freie Universität Berlin
• DrugBank associates brand names with their components
• We offer component names as suggested search terms in Title 21
[*]
10. CAN'T EVERYTHING BE DONE WITH
RECYCLED DATA? UM, NO.
• Some datasets suck, or don´t exist yet
• Conversion of existing resources is not painless
o Many vocabularies rely on human interpretation
o Many vocabularies are not rigorous enough for SKOS encoding
(lotta bad SKOS out there)
11. CURATION ISSUES FOR EXISTING DATASETS
o Appropriateness, coverage, provenance
o Same metadata quality issues as usual
o Many systems of subject terms or identifiers not designed for wide
exposure: the "on a horse" problem
o We’re talking about curation of vocabularies and schemas as much as
we are about curation of data.
13. EXTRACTED VOCABULARIES
• The big idea: enhance CFR search via term expansion, suggestion,
etc.
Reuse existing thesauri
Make a CFR-specific vocabulary by discovering how the CFR
talks about itself
Use that knowledge to suggest better search terms
• This is not simple phrase or n-gram matching like Google Suggest.
• Rather, we discover how words within the CFR relate to each other
and we structure them into a hierarchy of terms (SKOS)
14. WHERE DO VOCABULARIES COME FROM?
• Input: text elements in the CFR XML
• Extraction and patterns:
o Anaphora resolution (JavaRAP)
o Natural Language Parser (Stanford Parser)
o Hearst patterns:
o Output: SKOS (Jena)
15. ANAPHORA RESOLUTION
• John spent time in a Turkish prison. He is now the executive
director of CALI.
• Núria stole Sara’s chocolate and stuffed her face with it. (but
whose face was it?)
• When a sponsor conducting a nonclinical laboratory study intended
to be submitted to or reviewed by the Food and Drug Administration
utilizes the services of a consulting laboratory, contractor, or grantee
to perform an analysis or other service, it shall notify the consulting
laboratory, contractor, or grantee that the service is part of a
nonclinical laboratory study that must be conducted in compliance
with the provisions of this part.
17. HEARST PATTERNS
o lexico-syntactic patterns that indicate hypernymic/hyponymic
relations.
o { NP (,)? (such as | like) (NP,)* (or | and) NP
o Example: All vehicles like cars, trucks, and go-karts
o PS:
o hypernym == word for superset containing term
o hyponym == more specific term
19. WHY IS THIS HARD?
• Legal text is structurally complicated
o Parser dies on long sentences, leading to incorrect extractions
• Named entities ("Food, Drug, and Cosmetic Act") confuse the parser
o Should be separately extracted/tagged
o Parser should think of them as a single token, but doesn´t
o May need authority files for entities and acronyms, etc.
• Corpus is huge (CFR == 96.5 million words)
o Strains memory limits and computational resources
20. DEFINITIONS: IMPROVING SEARCH AND
PRESENTATION
• The big idea: find all terms defined by the reg or statute, and do
cool stuff with them, for example
o linking terms in text to their definitions
o pushing definitions to the top of results when the term is
searched for
o altering presentation so that (legally) naive user understands the
importance of definitions for, eg., compliance.
• Of course, that also means figuring out what the scope of definitions
is.... :(
21. WHERE DO THE DEFINITIONS COME
FROM?
• Input: heading elements in the CFR XML with the term "definition".
• Using regular expressions, we extract
o Defined term and definition text
o Location of the definition (section of the CFR)
o Scoping information: "For the purposes of this part"
• Output: SKOS/RDF
o defined term --> SKOS Vocabulary
22. DEFINITIONS: TOOLS
• Python Natural Language Toolkit (NLTK)
• ElementTree, XML parsing library
• Snowball Stemmer Package
• RDFlib, an RDF generation library
23.
24. WHY THIS IS HARD: FINDING
DEFINITIONS
o Text containing definition can make it hard to extract.
o Sponsor means:
o (1) A person who initiates and supports, by provision of
financial or other resources, a nonclinical laboratory study;
o (2) A person who submits a nonclinical study to the Food and
Drug Administration in support of an application for a
research or marketing permit
o Pattern identification/inconsistencies in sections that are not
explicitly meant to be definitions (or, what does “means” mean?)
25. WHY THIS IS HARD: SCOPING DEFINITIONS
o Scoping not stated in text, implicit in structure
o Complex scoping statements:
"The definitions and interpretations contained in section 201 of the act apply to those
terms when used in this part".
"Any term not defined in this part shall have the definition set forth in section 102 of the
Act (21 U.S.C. 802 ), except that certain terms used in part 1316 of this chapter are
defined at the beginning of each subpart of that part".
27. IMPROVEMENTS
o Vocabulary: better extraction and quality
o Definitions: retrieval and completeness
o Obligations: false positives, identification of parts
o Product Codes: semantic matching
28. FUTURE WORK
o RDF-ification, refinement, implementation:
Table III, PTOA, Popular Names
Agency structure
o Data management and quality
o Crowdsourcing
29. RESOURCES: STANDARDS AND PRIMERS
• RDF:
o Primer: http://www.w3.org/TR/rdf-primer/
o Advantages: http://www.w3.org/RDF/advantages.html
• SKOS
o http://www.w3.org/2004/02/skos/
30. MORE RESOURCES
• Linked Open Data:
o General: http://linkeddata.org/
o Tutorial: http://www4.wiwiss.fu-berlin.de/bizer/pub/linkeddatatutorial/
o Government Data: http://logd.tw.rpi.edu/
• W3C Semantic Web resources:
o http://www.w3.org/standards/semanticweb/
31. EVEN MORE RESOURCES: RANTS AND
RAVES
• VoxPop articles on the SW and Law: http://blog.law.cornell.edu/
voxpop/category/semantic-web-and-law/
• Mangy dogs: http://liicr.nl/JPcAb2
• Legal Informatics blog: http://legalinformatics.wordpress.com/
• Books on law and the SW: http://liicr.nl/MGRbkA
32. US
• Núria
o nuria.casellas@liicornell.org
o @ncasellas
o http://nuriacasellas.blogspot.com
• Tom
o tom@liicornell.org
o @trbruce
o http://blog.law.cornell.edu/(tbruce | metasausage)