The Semantic Web meets the Code of Federal Regulations

THE CFR MEETS THE
SEMANTIC WEB
(with a little unnatural language processing thrown in )

BACKGROUND: A TWO-PART HISTORY OF
THE SEMANTIC WEB

• SW is a maze of confusing buzzwords
• Can be thought of in two parts
• Pre-2005 (the “top-down” period)
• Post-2005 (the “bottom-up” period)

SW PRE-2005

o A fascination with inferencing & top-down analysis

o Staked out a lot of theoretical territory

o Built basic standards:

• RDF (statement encoding) : saying things about things

• OWL (modeling and inferencing): describing relationships
between things -- that is, creating ontologies

SW FROM 2005 TO NOW

o SW now seen as a big heap of statements

o Became more practical

o SKOS ( inexpensive conversion method/standard for metadata)

o Linked Data ( altruistic, like named anchors ca. 1992 )

o Could be seen -- from a library point of view -- as a new set of
techniques for metadata management better suited to the Web

THE SEMANTIC WEB AT THE LII
• Tying legal information to the real world, not just itself
• Applications like:
o Improvements to existing ﬁnding aids

 Table of Popular Names, , Tables I and III

 Finer-grained, more expressive PTOA

o Search enhancement via term substitution and expansion

o Publication of “regulated nouns” and deﬁnitions as Linked Data

• Research-driven engineering as a practice/culture

WHY USE THE SW TOOLSET?
• Sometimes the whole thing looks like an illustration of the Two Fool
Rule

• Why RDF?
o XML is more cumbersome and less expressive

o RDF supports inferencing

o RDF allows processing of partial information

• Why SPARQL?
o um, SPARQL is how you query RDF

WHY USE SKOS?

o it's a simple knowledge organization system

o lightweight representation of things we need a lot:

o thesauri

o taxonomies

o classiﬁcation schemes

o it might be a little too simple

SKOS: DRIVING INTO A DITCH

<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:skos="http://www.w3.org/2004/02/skos/core#">

<skos:Concept rdf:about="http://www.my.com/#canals">
<skos:definition>A feature type category for places
such as the Erie Canal</skos:definition>
<skos:prefLabel>canals</skos:prefLabel>
<skos:altLabel>canal bends</skos:altLabel>
<skos:altLabel>canalized streams</skos:altLabel>
<skos:altLabel>ditch mouths</skos:altLabel>
<skos:altLabel>ditches</skos:altLabel>
<skos:altLabel>drainage canals</skos:altLabel>
<skos:altLabel>drainage ditches</skos:altLabel>
<skos:broader rdf:resource="http://www.my.com/#hydrographic%20structures"/>
<skos:related rdf:resource="http://www.my.com/#channels"/>
<skos:related rdf:resource="http://www.my.com/#locks"/>
<skos:related rdf:resource="http://www.my.com/#transportation%20features"/>
<skos:related rdf:resource="http://www.my.com/#tunnels"/>
<skos:scopeNote>Manmade waterway used by watercraft
or for drainage, irrigation, mining, or water
power</skos:scopeNote>
</skos:Concept>

</rdf:RDF>

DATA REUSE: DRUGBANK
• Acetaminophen vs. Tylenol : CFR regulates by generic name
• DrugBank (http://www4.wiwiss.fu-berlin.de/drugbank/)
o http://www.drugbank.ca/

o Offered as Linked Data by Freie Universität Berlin

• DrugBank associates brand names with their components
• We offer component names as suggested search terms in Title 21
[*]

CAN'T EVERYTHING BE DONE WITH
RECYCLED DATA? UM, NO.

• Some datasets suck, or don´t exist yet
• Conversion of existing resources is not painless
o Many vocabularies rely on human interpretation

o Many vocabularies are not rigorous enough for SKOS encoding
(lotta bad SKOS out there)

CURATION ISSUES FOR EXISTING DATASETS

o Appropriateness, coverage, provenance

o Same metadata quality issues as usual

o Many systems of subject terms or identiﬁers not designed for wide
exposure: the "on a horse" problem

o We’re talking about curation of vocabularies and schemas as much as
we are about curation of data.

EXTRACTED VOCABULARIES
• The big idea: enhance CFR search via term expansion, suggestion,
etc.

 Reuse existing thesauri

 Make a CFR-speciﬁc vocabulary by discovering how the CFR
talks about itself

 Use that knowledge to suggest better search terms

• This is not simple phrase or n-gram matching like Google Suggest.
• Rather, we discover how words within the CFR relate to each other
and we structure them into a hierarchy of terms (SKOS)

WHERE DO VOCABULARIES COME FROM?

• Input: text elements in the CFR XML
• Extraction and patterns:
o Anaphora resolution (JavaRAP)

o Natural Language Parser (Stanford Parser)

o Hearst patterns:

o Output: SKOS (Jena)

ANAPHORA RESOLUTION

• John spent time in a Turkish prison. He is now the executive
director of CALI.

• Núria stole Sara’s chocolate and stuffed her face with it. (but
whose face was it?)

• When a sponsor conducting a nonclinical laboratory study intended
to be submitted to or reviewed by the Food and Drug Administration
utilizes the services of a consulting laboratory, contractor, or grantee
to perform an analysis or other service, it shall notify the consulting
laboratory, contractor, or grantee that the service is part of a
nonclinical laboratory study that must be conducted in compliance
with the provisions of this part.

STANFORD PARSER

 Structured grammar trees & typed dependencies

• Noun modiﬁer: nn(product-10, chemical-9)

• “product skos:narrower chemical_product”

• Conjunctions: conj(doctor-7, practitioner-9)

• "doctor skos:related practitioner”

HEARST PATTERNS
o lexico-syntactic patterns that indicate hypernymic/hyponymic
relations.

o { NP (,)? (such as | like) (NP,)* (or | and) NP

o Example: All vehicles like cars, trucks, and go-karts

o PS:

o hypernym == word for superset containing term

o hyponym == more speciﬁc term

principal display panel

parser understands “display”
as a verb. oops.

WHY IS THIS HARD?
• Legal text is structurally complicated
o Parser dies on long sentences, leading to incorrect extractions

• Named entities ("Food, Drug, and Cosmetic Act") confuse the parser
o Should be separately extracted/tagged

o Parser should think of them as a single token, but doesn´t

o May need authority ﬁles for entities and acronyms, etc.

• Corpus is huge (CFR == 96.5 million words)
o Strains memory limits and computational resources

DEFINITIONS: IMPROVING SEARCH AND
PRESENTATION
• The big idea: find all terms defined by the reg or statute, and do
cool stuff with them, for example

o linking terms in text to their definitions

o pushing definitions to the top of results when the term is
searched for

o altering presentation so that (legally) naive user understands the
importance of definitions for, eg., compliance.

• Of course, that also means figuring out what the scope of definitions
is.... :(

WHERE DO THE DEFINITIONS COME
FROM?
• Input: heading elements in the CFR XML with the term "definition".
• Using regular expressions, we extract
o Defined term and definition text

o Location of the definition (section of the CFR)

o Scoping information: "For the purposes of this part"

• Output: SKOS/RDF
o defined term --> SKOS Vocabulary

DEFINITIONS: TOOLS

• Python Natural Language Toolkit (NLTK)

• ElementTree, XML parsing library

• Snowball Stemmer Package

• RDFlib, an RDF generation library

WHY THIS IS HARD: FINDING
DEFINITIONS
o Text containing definition can make it hard to extract.

o Sponsor means:

o (1) A person who initiates and supports, by provision of
financial or other resources, a nonclinical laboratory study;

o (2) A person who submits a nonclinical study to the Food and
Drug Administration in support of an application for a
research or marketing permit

o Pattern identification/inconsistencies in sections that are not
explicitly meant to be definitions (or, what does “means” mean?)

WHY THIS IS HARD: SCOPING DEFINITIONS

o Scoping not stated in text, implicit in structure

o Complex scoping statements:

 "The definitions and interpretations contained in section 201 of the act apply to those
terms when used in this part".

 "Any term not defined in this part shall have the definition set forth in section 102 of the
Act (21 U.S.C. 802 ), except that certain terms used in part 1316 of this chapter are
defined at the beginning of each subpart of that part".

IMPROVEMENTS

o Vocabulary: better extraction and quality

o Deﬁnitions: retrieval and completeness

o Obligations: false positives, identiﬁcation of parts

o Product Codes: semantic matching

FUTURE WORK

o RDF-iﬁcation, reﬁnement, implementation:

 Table III, PTOA, Popular Names

 Agency structure

o Data management and quality

o Crowdsourcing

RESOURCES: STANDARDS AND PRIMERS
• RDF:
o Primer: http://www.w3.org/TR/rdf-primer/

o Advantages: http://www.w3.org/RDF/advantages.html

• SKOS
o http://www.w3.org/2004/02/skos/

MORE RESOURCES

• Linked Open Data:
o General: http://linkeddata.org/

o Tutorial: http://www4.wiwiss.fu-berlin.de/bizer/pub/linkeddatatutorial/

o Government Data: http://logd.tw.rpi.edu/

• W3C Semantic Web resources:
o http://www.w3.org/standards/semanticweb/

EVEN MORE RESOURCES: RANTS AND
RAVES

• VoxPop articles on the SW and Law: http://blog.law.cornell.edu/
voxpop/category/semantic-web-and-law/

• Mangy dogs: http://liicr.nl/JPcAb2
• Legal Informatics blog: http://legalinformatics.wordpress.com/
• Books on law and the SW: http://liicr.nl/MGRbkA

US
• Núria
o nuria.casellas@liicornell.org

o @ncasellas

o http://nuriacasellas.blogspot.com

• Tom
o tom@liicornell.org

o @trbruce

o http://blog.law.cornell.edu/(tbruce | metasausage)

The Semantic Web meets the Code of Federal Regulations

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to The Semantic Web meets the Code of Federal Regulations

Similar to The Semantic Web meets the Code of Federal Regulations (20)

More from tbruce

More from tbruce (11)

Recently uploaded

Recently uploaded (20)

The Semantic Web meets the Code of Federal Regulations

Editor's Notes