SlideShare a Scribd company logo
1 of 32
THE CFR MEETS THE
    SEMANTIC WEB
(with a little unnatural language processing thrown in )
BACKGROUND: A TWO-PART HISTORY OF
        THE SEMANTIC WEB

• SW is a maze of confusing buzzwords
• Can be thought of in two parts
  • Pre-2005 (the “top-down” period)
  • Post-2005 (the “bottom-up” period)
SW PRE-2005


o   A fascination with inferencing & top-down analysis

o   Staked out a lot of theoretical territory

o   Built basic standards:

           • RDF (statement encoding) : saying things about things

           • OWL (modeling and inferencing): describing relationships
             between things -- that is, creating ontologies
SW FROM 2005 TO NOW

o   SW now seen as a big heap of statements

o   Became more practical

    o   SKOS ( inexpensive conversion method/standard for metadata)

    o   Linked Data ( altruistic, like named anchors ca. 1992 )

o   Could be seen -- from a library point of view -- as a new set of
    techniques for metadata management better suited to the Web
THE SEMANTIC WEB AT THE LII
• Tying legal information to the real world, not just itself
• Applications like:
   o   Improvements to existing finding aids

          Table of Popular Names, , Tables I and III

          Finer-grained, more expressive PTOA

   o   Search enhancement via term substitution and expansion

   o   Publication of “regulated nouns” and definitions as Linked Data

• Research-driven engineering as a practice/culture
WHY USE THE SW TOOLSET?
• Sometimes the whole thing looks like an illustration of the Two Fool
  Rule

• Why RDF?
  o   XML is more cumbersome and less expressive

  o   RDF supports inferencing

  o   RDF allows processing of partial information

• Why SPARQL?
  o   um, SPARQL is how you query RDF
WHY USE SKOS?

o   it's a simple knowledge organization system

o   lightweight representation of things we need a lot:

    o   thesauri

    o   taxonomies

    o   classification schemes

o   it might be a little too simple
SKOS: DRIVING INTO A DITCH

<rdf:RDF
  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
  xmlns:skos="http://www.w3.org/2004/02/skos/core#">

  <skos:Concept rdf:about="http://www.my.com/#canals">
    <skos:definition>A feature type category for places
     such as the Erie Canal</skos:definition>
    <skos:prefLabel>canals</skos:prefLabel>
    <skos:altLabel>canal bends</skos:altLabel>
    <skos:altLabel>canalized streams</skos:altLabel>
    <skos:altLabel>ditch mouths</skos:altLabel>
    <skos:altLabel>ditches</skos:altLabel>
    <skos:altLabel>drainage canals</skos:altLabel>
    <skos:altLabel>drainage ditches</skos:altLabel>
    <skos:broader rdf:resource="http://www.my.com/#hydrographic%20structures"/>
    <skos:related rdf:resource="http://www.my.com/#channels"/>
    <skos:related rdf:resource="http://www.my.com/#locks"/>
    <skos:related rdf:resource="http://www.my.com/#transportation%20features"/>
    <skos:related rdf:resource="http://www.my.com/#tunnels"/>
    <skos:scopeNote>Manmade waterway used by watercraft
     or for drainage, irrigation, mining, or water
     power</skos:scopeNote>
  </skos:Concept>

</rdf:RDF>
DATA REUSE: DRUGBANK
• Acetaminophen vs. Tylenol : CFR regulates by generic name
• DrugBank (http://www4.wiwiss.fu-berlin.de/drugbank/)
  o   http://www.drugbank.ca/

  o   Offered as Linked Data by Freie Universität Berlin

• DrugBank associates brand names with their components
• We offer component names as suggested search terms in Title 21
  [*]
CAN'T EVERYTHING BE DONE WITH
         RECYCLED DATA? UM, NO.

• Some datasets suck, or don´t exist yet
• Conversion of existing resources is not painless
  o   Many vocabularies rely on human interpretation

  o   Many vocabularies are not rigorous enough for SKOS encoding
      (lotta bad SKOS out there)
CURATION ISSUES FOR EXISTING DATASETS

o   Appropriateness, coverage, provenance

o   Same metadata quality issues as usual

o   Many systems of subject terms or identifiers not designed for wide
    exposure: the "on a horse" problem

o   We’re talking about curation of vocabularies and schemas as much as
    we are about curation of data.
LII SW FEATURES
EXTRACTED VOCABULARIES
• The big idea: enhance CFR search via term expansion, suggestion,
  etc.

        Reuse existing thesauri

        Make a CFR-specific vocabulary by discovering how the CFR
         talks about itself

        Use that knowledge to suggest better search terms

• This is not simple phrase or n-gram matching like Google Suggest.
• Rather, we discover how words within the CFR relate to each other
  and we structure them into a hierarchy of terms (SKOS)
WHERE DO VOCABULARIES COME FROM?


• Input: text elements in the CFR XML
• Extraction and patterns:
    o   Anaphora resolution (JavaRAP)

    o   Natural Language Parser (Stanford Parser)

    o   Hearst patterns:

o   Output: SKOS (Jena)
ANAPHORA RESOLUTION

• John  spent time in a Turkish prison. He is now the executive
 director of CALI.

• Núria stole Sara’s chocolate and stuffed her face with it. (but
 whose face was it?)

• When    a sponsor conducting a nonclinical laboratory study intended
 to be submitted to or reviewed by the Food and Drug Administration
 utilizes the services of a consulting laboratory, contractor, or grantee
 to perform an analysis or other service, it shall notify the consulting
 laboratory, contractor, or grantee that the service is part of a
 nonclinical laboratory study that must be conducted in compliance
 with the provisions of this part.
STANFORD PARSER


   Structured grammar trees & typed dependencies

• Noun modifier: nn(product-10, chemical-9)

         • “product skos:narrower chemical_product”


• Conjunctions: conj(doctor-7, practitioner-9)

         • "doctor skos:related practitioner”
HEARST PATTERNS
o    lexico-syntactic patterns that indicate hypernymic/hyponymic
    relations.

o   { NP (,)? (such as | like) (NP,)* (or | and) NP

o   Example: All vehicles like cars, trucks, and go-karts

o   PS:

    o     hypernym == word for superset containing term

    o     hyponym == more specific term
principal display panel




parser understands “display”
      as a verb. oops.
WHY IS THIS HARD?
• Legal text is structurally complicated
   o Parser dies on long sentences, leading to incorrect extractions

• Named entities ("Food, Drug, and Cosmetic Act") confuse the parser
   o Should be separately extracted/tagged

   o Parser should think of them as a single token, but doesn´t

   o   May need authority files for entities and acronyms, etc.

• Corpus is huge (CFR == 96.5 million words)
   o   Strains memory limits and computational resources
DEFINITIONS: IMPROVING SEARCH AND
            PRESENTATION
• The big idea: find all terms defined by the reg or statute, and do
  cool stuff with them, for example

  o   linking terms in text to their definitions

  o   pushing definitions to the top of results when the term is
      searched for

  o   altering presentation so that (legally) naive user understands the
      importance of definitions for, eg., compliance.

• Of course, that also means figuring out what the scope of definitions
  is.... :(
WHERE DO THE DEFINITIONS COME
                 FROM?
• Input: heading elements in the CFR XML with the term "definition".
• Using regular expressions, we extract
  o   Defined term and definition text

  o   Location of the definition (section of the CFR)

  o   Scoping information: "For the purposes of this part"

• Output: SKOS/RDF
  o   defined term --> SKOS Vocabulary
DEFINITIONS: TOOLS


• Python Natural Language Toolkit (NLTK)

• ElementTree, XML parsing library

• Snowball Stemmer Package

• RDFlib, an RDF generation library
WHY THIS IS HARD: FINDING
                    DEFINITIONS
o   Text containing definition can make it hard to extract.

    o   Sponsor means:

        o   (1) A person who initiates and supports, by provision of
            financial or other resources, a nonclinical laboratory study;

        o   (2) A person who submits a nonclinical study to the Food and
            Drug Administration in support of an application for a
            research or marketing permit

o   Pattern identification/inconsistencies in sections that are not
    explicitly meant to be definitions (or, what does “means” mean?)
WHY THIS IS HARD: SCOPING DEFINITIONS


o   Scoping not stated in text, implicit in structure

o   Complex scoping statements:

          "The definitions and interpretations contained in section 201 of the act apply to those
           terms when used in this part".

          "Any term not defined in this part shall have the definition set forth in section 102 of the
           Act (21 U.S.C. 802 ), except that certain terms used in part 1316 of this chapter are
           defined at the beginning of each subpart of that part".
SO, WHAT CAN WE DO? [*]
IMPROVEMENTS


o   Vocabulary: better extraction and quality

o   Definitions: retrieval and completeness

o   Obligations: false positives, identification of parts

o   Product Codes: semantic matching
FUTURE WORK


o   RDF-ification, refinement, implementation:

          Table III, PTOA, Popular Names

          Agency structure

o   Data management and quality

o   Crowdsourcing
RESOURCES: STANDARDS AND PRIMERS
• RDF:
  o   Primer: http://www.w3.org/TR/rdf-primer/

  o   Advantages: http://www.w3.org/RDF/advantages.html

• SKOS
  o   http://www.w3.org/2004/02/skos/
MORE RESOURCES

• Linked Open Data:
  o   General: http://linkeddata.org/

  o   Tutorial: http://www4.wiwiss.fu-berlin.de/bizer/pub/linkeddatatutorial/

  o   Government Data: http://logd.tw.rpi.edu/

• W3C Semantic Web resources:
  o   http://www.w3.org/standards/semanticweb/
EVEN MORE RESOURCES: RANTS AND
                 RAVES

• VoxPop articles on the SW and Law: http://blog.law.cornell.edu/
  voxpop/category/semantic-web-and-law/

• Mangy dogs: http://liicr.nl/JPcAb2
• Legal Informatics blog: http://legalinformatics.wordpress.com/
• Books on law and the SW: http://liicr.nl/MGRbkA
US
• Núria
  o   nuria.casellas@liicornell.org

  o   @ncasellas

  o   http://nuriacasellas.blogspot.com

• Tom
  o   tom@liicornell.org

  o   @trbruce

  o   http://blog.law.cornell.edu/(tbruce | metasausage)

More Related Content

What's hot

Understanding RDF: the Resource Description Framework in Context (1999)
Understanding RDF: the Resource Description Framework in Context  (1999)Understanding RDF: the Resource Description Framework in Context  (1999)
Understanding RDF: the Resource Description Framework in Context (1999)Dan Brickley
 
RDA Presentation
RDA PresentationRDA Presentation
RDA Presentationjendibbern
 
RDA, FRBR, and FRAD: Connecting the dots
RDA, FRBR, and FRAD: Connecting the dotsRDA, FRBR, and FRAD: Connecting the dots
RDA, FRBR, and FRAD: Connecting the dotsLouise Spiteri
 
The tools of our trade: AACR2/RDA and MARC
The tools of our trade: AACR2/RDA and MARCThe tools of our trade: AACR2/RDA and MARC
The tools of our trade: AACR2/RDA and MARCAnn Chapman
 
Efficient Query Answering against Dynamic RDF Databases
Efficient Query Answering against Dynamic RDF DatabasesEfficient Query Answering against Dynamic RDF Databases
Efficient Query Answering against Dynamic RDF DatabasesAlexandra Roatiș
 
RDF, SPARQL and Semantic Repositories
RDF, SPARQL and Semantic RepositoriesRDF, SPARQL and Semantic Repositories
RDF, SPARQL and Semantic RepositoriesMarin Dimitrov
 
SAS Global 2021 Introduction to Natural Language Processing
SAS Global 2021 Introduction to Natural Language Processing SAS Global 2021 Introduction to Natural Language Processing
SAS Global 2021 Introduction to Natural Language Processing Colleen Farrelly
 
Resource Description & Access (RDA)
Resource Description & Access (RDA)Resource Description & Access (RDA)
Resource Description & Access (RDA)Buzz Haughton
 
RDA Intro - AACR2 / MARC> RDA / FRBR / Semantic Web
RDA Intro - AACR2 / MARC> RDA / FRBR / Semantic WebRDA Intro - AACR2 / MARC> RDA / FRBR / Semantic Web
RDA Intro - AACR2 / MARC> RDA / FRBR / Semantic Webrobin fay
 
SPARTIQULATION - Verbalizing SPARQL queries
SPARTIQULATION - Verbalizing SPARQL queriesSPARTIQULATION - Verbalizing SPARQL queries
SPARTIQULATION - Verbalizing SPARQL queriesBasil Ell
 
Cataloging basics
Cataloging basicsCataloging basics
Cataloging basicsrobin fay
 
Learning rda in 30 minutes or less
Learning rda in 30 minutes or lessLearning rda in 30 minutes or less
Learning rda in 30 minutes or lessRioghailclan
 
Owl web ontology language
Owl  web ontology languageOwl  web ontology language
Owl web ontology languagehassco2011
 
GDG Meets U event - Big data & Wikidata - no lies codelab
GDG Meets U event - Big data & Wikidata -  no lies codelabGDG Meets U event - Big data & Wikidata -  no lies codelab
GDG Meets U event - Big data & Wikidata - no lies codelabCAMELIA BOBAN
 
Cataloging with RDA - Western New York Library Resources Council
Cataloging with RDA - Western New York Library Resources CouncilCataloging with RDA - Western New York Library Resources Council
Cataloging with RDA - Western New York Library Resources CouncilEmily Nimsakont
 

What's hot (20)

RDF and OWL
RDF and OWLRDF and OWL
RDF and OWL
 
Understanding RDF: the Resource Description Framework in Context (1999)
Understanding RDF: the Resource Description Framework in Context  (1999)Understanding RDF: the Resource Description Framework in Context  (1999)
Understanding RDF: the Resource Description Framework in Context (1999)
 
RDA Presentation
RDA PresentationRDA Presentation
RDA Presentation
 
RDA, FRBR, and FRAD: Connecting the dots
RDA, FRBR, and FRAD: Connecting the dotsRDA, FRBR, and FRAD: Connecting the dots
RDA, FRBR, and FRAD: Connecting the dots
 
The tools of our trade: AACR2/RDA and MARC
The tools of our trade: AACR2/RDA and MARCThe tools of our trade: AACR2/RDA and MARC
The tools of our trade: AACR2/RDA and MARC
 
SWT Lecture Session 2 - RDF
SWT Lecture Session 2 - RDFSWT Lecture Session 2 - RDF
SWT Lecture Session 2 - RDF
 
Ontologies in RDF-S/OWL
Ontologies in RDF-S/OWLOntologies in RDF-S/OWL
Ontologies in RDF-S/OWL
 
Efficient Query Answering against Dynamic RDF Databases
Efficient Query Answering against Dynamic RDF DatabasesEfficient Query Answering against Dynamic RDF Databases
Efficient Query Answering against Dynamic RDF Databases
 
RDF, SPARQL and Semantic Repositories
RDF, SPARQL and Semantic RepositoriesRDF, SPARQL and Semantic Repositories
RDF, SPARQL and Semantic Repositories
 
SAS Global 2021 Introduction to Natural Language Processing
SAS Global 2021 Introduction to Natural Language Processing SAS Global 2021 Introduction to Natural Language Processing
SAS Global 2021 Introduction to Natural Language Processing
 
Resource Description & Access (RDA)
Resource Description & Access (RDA)Resource Description & Access (RDA)
Resource Description & Access (RDA)
 
RDA Intro - AACR2 / MARC> RDA / FRBR / Semantic Web
RDA Intro - AACR2 / MARC> RDA / FRBR / Semantic WebRDA Intro - AACR2 / MARC> RDA / FRBR / Semantic Web
RDA Intro - AACR2 / MARC> RDA / FRBR / Semantic Web
 
SPARTIQULATION - Verbalizing SPARQL queries
SPARTIQULATION - Verbalizing SPARQL queriesSPARTIQULATION - Verbalizing SPARQL queries
SPARTIQULATION - Verbalizing SPARQL queries
 
NCompass Live: Cataloging with RDA
NCompass Live: Cataloging with RDANCompass Live: Cataloging with RDA
NCompass Live: Cataloging with RDA
 
RDA
RDA RDA
RDA
 
Cataloging basics
Cataloging basicsCataloging basics
Cataloging basics
 
Learning rda in 30 minutes or less
Learning rda in 30 minutes or lessLearning rda in 30 minutes or less
Learning rda in 30 minutes or less
 
Owl web ontology language
Owl  web ontology languageOwl  web ontology language
Owl web ontology language
 
GDG Meets U event - Big data & Wikidata - no lies codelab
GDG Meets U event - Big data & Wikidata -  no lies codelabGDG Meets U event - Big data & Wikidata -  no lies codelab
GDG Meets U event - Big data & Wikidata - no lies codelab
 
Cataloging with RDA - Western New York Library Resources Council
Cataloging with RDA - Western New York Library Resources CouncilCataloging with RDA - Western New York Library Resources Council
Cataloging with RDA - Western New York Library Resources Council
 

Similar to The Semantic Web meets the Code of Federal Regulations

CSHALS 2010 W3C Semanic Web Tutorial
CSHALS 2010 W3C Semanic Web TutorialCSHALS 2010 W3C Semanic Web Tutorial
CSHALS 2010 W3C Semanic Web TutorialLeeFeigenbaum
 
Introduction to Application Profiles
Introduction to Application ProfilesIntroduction to Application Profiles
Introduction to Application ProfilesDiane Hillmann
 
Data Designed for Discovery
Data Designed for DiscoveryData Designed for Discovery
Data Designed for DiscoveryOCLC
 
Metadata for digital humanities
Metadata for digital humanities Metadata for digital humanities
Metadata for digital humanities Getaneh Alemu
 
20080917 Rev
20080917 Rev20080917 Rev
20080917 Revcharper
 
Reasoning with Big Knowledge Graphs: Choices, Pitfalls and Proven Recipes
Reasoning with Big Knowledge Graphs: Choices, Pitfalls and Proven RecipesReasoning with Big Knowledge Graphs: Choices, Pitfalls and Proven Recipes
Reasoning with Big Knowledge Graphs: Choices, Pitfalls and Proven RecipesOntotext
 
The Role of Thesauri in Data Modeling
The Role of Thesauri in Data ModelingThe Role of Thesauri in Data Modeling
The Role of Thesauri in Data ModelingDanny Greefhorst
 
An introduction to Metadata Application Profiles
An introduction to Metadata Application ProfilesAn introduction to Metadata Application Profiles
An introduction to Metadata Application Profileskcoylenet
 
20130622 okfn hackathon t2
20130622 okfn hackathon t220130622 okfn hackathon t2
20130622 okfn hackathon t2Seonho Kim
 
Porting Library Vocabularies to the Semantic Web - IFLA 2010
Porting Library Vocabularies to the Semantic Web - IFLA 2010Porting Library Vocabularies to the Semantic Web - IFLA 2010
Porting Library Vocabularies to the Semantic Web - IFLA 2010Bernard Vatant
 
Innovative methods for data integration: Linked Data and NLP
Innovative methods for data integration: Linked Data and NLPInnovative methods for data integration: Linked Data and NLP
Innovative methods for data integration: Linked Data and NLPariadnenetwork
 
SKOS - 2007 Open Forum on Metadata Registries - NYC
SKOS - 2007 Open Forum on Metadata Registries - NYCSKOS - 2007 Open Forum on Metadata Registries - NYC
SKOS - 2007 Open Forum on Metadata Registries - NYCjonphipps
 
SKOS, Past, Present and Future
SKOS, Past, Present and FutureSKOS, Past, Present and Future
SKOS, Past, Present and Futureseanb
 
The Impact of Linked Data in Digital Curation and Application to the Catalogu...
The Impact of Linked Data in Digital Curation and Application to the Catalogu...The Impact of Linked Data in Digital Curation and Application to the Catalogu...
The Impact of Linked Data in Digital Curation and Application to the Catalogu...Hong (Jenny) Jing
 

Similar to The Semantic Web meets the Code of Federal Regulations (20)

CSHALS 2010 W3C Semanic Web Tutorial
CSHALS 2010 W3C Semanic Web TutorialCSHALS 2010 W3C Semanic Web Tutorial
CSHALS 2010 W3C Semanic Web Tutorial
 
Introduction to Application Profiles
Introduction to Application ProfilesIntroduction to Application Profiles
Introduction to Application Profiles
 
Data Designed for Discovery
Data Designed for DiscoveryData Designed for Discovery
Data Designed for Discovery
 
Metadata for digital humanities
Metadata for digital humanities Metadata for digital humanities
Metadata for digital humanities
 
Semantic Web and Linked Open Data
Semantic Web and Linked Open DataSemantic Web and Linked Open Data
Semantic Web and Linked Open Data
 
20080917 Rev
20080917 Rev20080917 Rev
20080917 Rev
 
Reasoning with Big Knowledge Graphs: Choices, Pitfalls and Proven Recipes
Reasoning with Big Knowledge Graphs: Choices, Pitfalls and Proven RecipesReasoning with Big Knowledge Graphs: Choices, Pitfalls and Proven Recipes
Reasoning with Big Knowledge Graphs: Choices, Pitfalls and Proven Recipes
 
The Role of Thesauri in Data Modeling
The Role of Thesauri in Data ModelingThe Role of Thesauri in Data Modeling
The Role of Thesauri in Data Modeling
 
An introduction to Metadata Application Profiles
An introduction to Metadata Application ProfilesAn introduction to Metadata Application Profiles
An introduction to Metadata Application Profiles
 
20130622 okfn hackathon t2
20130622 okfn hackathon t220130622 okfn hackathon t2
20130622 okfn hackathon t2
 
Porting Library Vocabularies to the Semantic Web - IFLA 2010
Porting Library Vocabularies to the Semantic Web - IFLA 2010Porting Library Vocabularies to the Semantic Web - IFLA 2010
Porting Library Vocabularies to the Semantic Web - IFLA 2010
 
Innovative methods for data integration: Linked Data and NLP
Innovative methods for data integration: Linked Data and NLPInnovative methods for data integration: Linked Data and NLP
Innovative methods for data integration: Linked Data and NLP
 
Linked Open Data
Linked Open DataLinked Open Data
Linked Open Data
 
NISO/NFAIS Joint Virtual Conference: Connecting the Library to the Wider Wor...
NISO/NFAIS Joint Virtual Conference:  Connecting the Library to the Wider Wor...NISO/NFAIS Joint Virtual Conference:  Connecting the Library to the Wider Wor...
NISO/NFAIS Joint Virtual Conference: Connecting the Library to the Wider Wor...
 
Introduction to RDF
Introduction to RDFIntroduction to RDF
Introduction to RDF
 
SKOS - 2007 Open Forum on Metadata Registries - NYC
SKOS - 2007 Open Forum on Metadata Registries - NYCSKOS - 2007 Open Forum on Metadata Registries - NYC
SKOS - 2007 Open Forum on Metadata Registries - NYC
 
SKOS, Past, Present and Future
SKOS, Past, Present and FutureSKOS, Past, Present and Future
SKOS, Past, Present and Future
 
Tutorial 1-Ontologies
Tutorial 1-OntologiesTutorial 1-Ontologies
Tutorial 1-Ontologies
 
April 8 NISO Webinar: Experimenting with BIBFRAME: Reports from Early Adopters
April 8 NISO Webinar: Experimenting with BIBFRAME: Reports from Early AdoptersApril 8 NISO Webinar: Experimenting with BIBFRAME: Reports from Early Adopters
April 8 NISO Webinar: Experimenting with BIBFRAME: Reports from Early Adopters
 
The Impact of Linked Data in Digital Curation and Application to the Catalogu...
The Impact of Linked Data in Digital Curation and Application to the Catalogu...The Impact of Linked Data in Digital Curation and Application to the Catalogu...
The Impact of Linked Data in Digital Curation and Application to the Catalogu...
 

More from tbruce

Nabe Communications 2010
Nabe Communications 2010Nabe Communications 2010
Nabe Communications 2010tbruce
 
FDLP CFR Preview
FDLP CFR PreviewFDLP CFR Preview
FDLP CFR Previewtbruce
 
Akoma Ntoso 2
Akoma Ntoso 2Akoma Ntoso 2
Akoma Ntoso 2tbruce
 
Akoma Ntoso 1
Akoma Ntoso 1Akoma Ntoso 1
Akoma Ntoso 1tbruce
 
I Conference 2010 -- Open access to law
I Conference 2010 -- Open access to lawI Conference 2010 -- Open access to law
I Conference 2010 -- Open access to lawtbruce
 
Princeton law.gov meeting
Princeton law.gov meetingPrinceton law.gov meeting
Princeton law.gov meetingtbruce
 
Legal Information and the WebMD effect
Legal Information and the WebMD effectLegal Information and the WebMD effect
Legal Information and the WebMD effecttbruce
 
Open Access to law and the WebMD effect
Open Access to law and the WebMD effectOpen Access to law and the WebMD effect
Open Access to law and the WebMD effecttbruce
 
Philadelphia Assn of Paralegals
Philadelphia Assn of ParalegalsPhiladelphia Assn of Paralegals
Philadelphia Assn of Paralegalstbruce
 
Metadata Quality
Metadata QualityMetadata Quality
Metadata Qualitytbruce
 
Foundlings on the Cathedral Steps
Foundlings on the Cathedral StepsFoundlings on the Cathedral Steps
Foundlings on the Cathedral Stepstbruce
 

More from tbruce (11)

Nabe Communications 2010
Nabe Communications 2010Nabe Communications 2010
Nabe Communications 2010
 
FDLP CFR Preview
FDLP CFR PreviewFDLP CFR Preview
FDLP CFR Preview
 
Akoma Ntoso 2
Akoma Ntoso 2Akoma Ntoso 2
Akoma Ntoso 2
 
Akoma Ntoso 1
Akoma Ntoso 1Akoma Ntoso 1
Akoma Ntoso 1
 
I Conference 2010 -- Open access to law
I Conference 2010 -- Open access to lawI Conference 2010 -- Open access to law
I Conference 2010 -- Open access to law
 
Princeton law.gov meeting
Princeton law.gov meetingPrinceton law.gov meeting
Princeton law.gov meeting
 
Legal Information and the WebMD effect
Legal Information and the WebMD effectLegal Information and the WebMD effect
Legal Information and the WebMD effect
 
Open Access to law and the WebMD effect
Open Access to law and the WebMD effectOpen Access to law and the WebMD effect
Open Access to law and the WebMD effect
 
Philadelphia Assn of Paralegals
Philadelphia Assn of ParalegalsPhiladelphia Assn of Paralegals
Philadelphia Assn of Paralegals
 
Metadata Quality
Metadata QualityMetadata Quality
Metadata Quality
 
Foundlings on the Cathedral Steps
Foundlings on the Cathedral StepsFoundlings on the Cathedral Steps
Foundlings on the Cathedral Steps
 

Recently uploaded

08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 

Recently uploaded (20)

08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 

The Semantic Web meets the Code of Federal Regulations

  • 1. THE CFR MEETS THE SEMANTIC WEB (with a little unnatural language processing thrown in )
  • 2. BACKGROUND: A TWO-PART HISTORY OF THE SEMANTIC WEB • SW is a maze of confusing buzzwords • Can be thought of in two parts • Pre-2005 (the “top-down” period) • Post-2005 (the “bottom-up” period)
  • 3. SW PRE-2005 o A fascination with inferencing & top-down analysis o Staked out a lot of theoretical territory o Built basic standards: • RDF (statement encoding) : saying things about things • OWL (modeling and inferencing): describing relationships between things -- that is, creating ontologies
  • 4. SW FROM 2005 TO NOW o SW now seen as a big heap of statements o Became more practical o SKOS ( inexpensive conversion method/standard for metadata) o Linked Data ( altruistic, like named anchors ca. 1992 ) o Could be seen -- from a library point of view -- as a new set of techniques for metadata management better suited to the Web
  • 5. THE SEMANTIC WEB AT THE LII • Tying legal information to the real world, not just itself • Applications like: o Improvements to existing finding aids  Table of Popular Names, , Tables I and III  Finer-grained, more expressive PTOA o Search enhancement via term substitution and expansion o Publication of “regulated nouns” and definitions as Linked Data • Research-driven engineering as a practice/culture
  • 6. WHY USE THE SW TOOLSET? • Sometimes the whole thing looks like an illustration of the Two Fool Rule • Why RDF? o XML is more cumbersome and less expressive o RDF supports inferencing o RDF allows processing of partial information • Why SPARQL? o um, SPARQL is how you query RDF
  • 7. WHY USE SKOS? o it's a simple knowledge organization system o lightweight representation of things we need a lot: o thesauri o taxonomies o classification schemes o it might be a little too simple
  • 8. SKOS: DRIVING INTO A DITCH <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:skos="http://www.w3.org/2004/02/skos/core#"> <skos:Concept rdf:about="http://www.my.com/#canals"> <skos:definition>A feature type category for places such as the Erie Canal</skos:definition> <skos:prefLabel>canals</skos:prefLabel> <skos:altLabel>canal bends</skos:altLabel> <skos:altLabel>canalized streams</skos:altLabel> <skos:altLabel>ditch mouths</skos:altLabel> <skos:altLabel>ditches</skos:altLabel> <skos:altLabel>drainage canals</skos:altLabel> <skos:altLabel>drainage ditches</skos:altLabel> <skos:broader rdf:resource="http://www.my.com/#hydrographic%20structures"/> <skos:related rdf:resource="http://www.my.com/#channels"/> <skos:related rdf:resource="http://www.my.com/#locks"/> <skos:related rdf:resource="http://www.my.com/#transportation%20features"/> <skos:related rdf:resource="http://www.my.com/#tunnels"/> <skos:scopeNote>Manmade waterway used by watercraft or for drainage, irrigation, mining, or water power</skos:scopeNote> </skos:Concept> </rdf:RDF>
  • 9. DATA REUSE: DRUGBANK • Acetaminophen vs. Tylenol : CFR regulates by generic name • DrugBank (http://www4.wiwiss.fu-berlin.de/drugbank/) o http://www.drugbank.ca/ o Offered as Linked Data by Freie Universität Berlin • DrugBank associates brand names with their components • We offer component names as suggested search terms in Title 21 [*]
  • 10. CAN'T EVERYTHING BE DONE WITH RECYCLED DATA? UM, NO. • Some datasets suck, or don´t exist yet • Conversion of existing resources is not painless o Many vocabularies rely on human interpretation o Many vocabularies are not rigorous enough for SKOS encoding (lotta bad SKOS out there)
  • 11. CURATION ISSUES FOR EXISTING DATASETS o Appropriateness, coverage, provenance o Same metadata quality issues as usual o Many systems of subject terms or identifiers not designed for wide exposure: the "on a horse" problem o We’re talking about curation of vocabularies and schemas as much as we are about curation of data.
  • 13. EXTRACTED VOCABULARIES • The big idea: enhance CFR search via term expansion, suggestion, etc.  Reuse existing thesauri  Make a CFR-specific vocabulary by discovering how the CFR talks about itself  Use that knowledge to suggest better search terms • This is not simple phrase or n-gram matching like Google Suggest. • Rather, we discover how words within the CFR relate to each other and we structure them into a hierarchy of terms (SKOS)
  • 14. WHERE DO VOCABULARIES COME FROM? • Input: text elements in the CFR XML • Extraction and patterns: o Anaphora resolution (JavaRAP) o Natural Language Parser (Stanford Parser) o Hearst patterns: o Output: SKOS (Jena)
  • 15. ANAPHORA RESOLUTION • John spent time in a Turkish prison. He is now the executive director of CALI. • Núria stole Sara’s chocolate and stuffed her face with it. (but whose face was it?) • When a sponsor conducting a nonclinical laboratory study intended to be submitted to or reviewed by the Food and Drug Administration utilizes the services of a consulting laboratory, contractor, or grantee to perform an analysis or other service, it shall notify the consulting laboratory, contractor, or grantee that the service is part of a nonclinical laboratory study that must be conducted in compliance with the provisions of this part.
  • 16. STANFORD PARSER  Structured grammar trees & typed dependencies • Noun modifier: nn(product-10, chemical-9) • “product skos:narrower chemical_product” • Conjunctions: conj(doctor-7, practitioner-9) • "doctor skos:related practitioner”
  • 17. HEARST PATTERNS o lexico-syntactic patterns that indicate hypernymic/hyponymic relations. o { NP (,)? (such as | like) (NP,)* (or | and) NP o Example: All vehicles like cars, trucks, and go-karts o PS: o hypernym == word for superset containing term o hyponym == more specific term
  • 18. principal display panel parser understands “display” as a verb. oops.
  • 19. WHY IS THIS HARD? • Legal text is structurally complicated o Parser dies on long sentences, leading to incorrect extractions • Named entities ("Food, Drug, and Cosmetic Act") confuse the parser o Should be separately extracted/tagged o Parser should think of them as a single token, but doesn´t o May need authority files for entities and acronyms, etc. • Corpus is huge (CFR == 96.5 million words) o Strains memory limits and computational resources
  • 20. DEFINITIONS: IMPROVING SEARCH AND PRESENTATION • The big idea: find all terms defined by the reg or statute, and do cool stuff with them, for example o linking terms in text to their definitions o pushing definitions to the top of results when the term is searched for o altering presentation so that (legally) naive user understands the importance of definitions for, eg., compliance. • Of course, that also means figuring out what the scope of definitions is.... :(
  • 21. WHERE DO THE DEFINITIONS COME FROM? • Input: heading elements in the CFR XML with the term "definition". • Using regular expressions, we extract o Defined term and definition text o Location of the definition (section of the CFR) o Scoping information: "For the purposes of this part" • Output: SKOS/RDF o defined term --> SKOS Vocabulary
  • 22. DEFINITIONS: TOOLS • Python Natural Language Toolkit (NLTK) • ElementTree, XML parsing library • Snowball Stemmer Package • RDFlib, an RDF generation library
  • 23.
  • 24. WHY THIS IS HARD: FINDING DEFINITIONS o Text containing definition can make it hard to extract. o Sponsor means: o (1) A person who initiates and supports, by provision of financial or other resources, a nonclinical laboratory study; o (2) A person who submits a nonclinical study to the Food and Drug Administration in support of an application for a research or marketing permit o Pattern identification/inconsistencies in sections that are not explicitly meant to be definitions (or, what does “means” mean?)
  • 25. WHY THIS IS HARD: SCOPING DEFINITIONS o Scoping not stated in text, implicit in structure o Complex scoping statements:  "The definitions and interpretations contained in section 201 of the act apply to those terms when used in this part".  "Any term not defined in this part shall have the definition set forth in section 102 of the Act (21 U.S.C. 802 ), except that certain terms used in part 1316 of this chapter are defined at the beginning of each subpart of that part".
  • 26. SO, WHAT CAN WE DO? [*]
  • 27. IMPROVEMENTS o Vocabulary: better extraction and quality o Definitions: retrieval and completeness o Obligations: false positives, identification of parts o Product Codes: semantic matching
  • 28. FUTURE WORK o RDF-ification, refinement, implementation:  Table III, PTOA, Popular Names  Agency structure o Data management and quality o Crowdsourcing
  • 29. RESOURCES: STANDARDS AND PRIMERS • RDF: o Primer: http://www.w3.org/TR/rdf-primer/ o Advantages: http://www.w3.org/RDF/advantages.html • SKOS o http://www.w3.org/2004/02/skos/
  • 30. MORE RESOURCES • Linked Open Data: o General: http://linkeddata.org/ o Tutorial: http://www4.wiwiss.fu-berlin.de/bizer/pub/linkeddatatutorial/ o Government Data: http://logd.tw.rpi.edu/ • W3C Semantic Web resources: o http://www.w3.org/standards/semanticweb/
  • 31. EVEN MORE RESOURCES: RANTS AND RAVES • VoxPop articles on the SW and Law: http://blog.law.cornell.edu/ voxpop/category/semantic-web-and-law/ • Mangy dogs: http://liicr.nl/JPcAb2 • Legal Informatics blog: http://legalinformatics.wordpress.com/ • Books on law and the SW: http://liicr.nl/MGRbkA
  • 32. US • Núria o nuria.casellas@liicornell.org o @ncasellas o http://nuriacasellas.blogspot.com • Tom o tom@liicornell.org o @trbruce o http://blog.law.cornell.edu/(tbruce | metasausage)

Editor's Notes

  1. \n
  2. \n
  3. \n
  4. \n
  5. \n
  6. \n
  7. \n
  8. \n
  9. \n
  10. \n
  11. \n
  12. \n
  13. \n
  14. \n
  15. \n
  16. \n
  17. \n
  18. \n
  19. \n
  20. \n
  21. \n
  22. \n
  23. \n
  24. \n
  25. \n
  26. \n
  27. \n
  28. \n
  29. \n
  30. \n
  31. \n
  32. \n