SlideShare ist ein Scribd-Unternehmen logo
1 von 39
Downloaden Sie, um offline zu lesen
Searching The United States Code
        with Solr/Lucene
  Paul Nelson / Ronald Matamoros, Search Technologies
       pnelson@searchtechnologies.com, 5/25/2011
          rmatamoros@searchtechnologies.com
Searching the
           United States Code
§  Who are we:
  •  Paul Nelson, Chief Architect
  •  Ronald Matamoros, Lead Engineer
§  Our Mission: Replace Personal Librarian Search
  •  A 20-Year-Old Search Engine!
§  Key Challenges
  •  How to index this massive, complex, 85-year-old
     document?
  •  How to replicate 20-Year-Old search features?
§  Government Documents are Fun!

                                                       3
Search Technologies
§  The largest independent provider of enterprise
    search expertise and services
§  80 full-time dedicated search engine experts
§  200+ customers
§  Technology Neutral
   •  (yeah, we know
      Sphinx too)
§  Offices All Over
   •  DC, NY, CA, MD,
      OH, UK, CR…


                                                     4
A Quick Civics Lesson…
§  The United States Code
  •  The general & permanent laws of the U.S.
     Government – All in one place
  •  51 titles
     §  Agriculture, Armed Forces, Conservation, The President,
         Food and Drugs, Postal Service, Public Health…
  •  First Version: 1926
§  The Office of the Law Revision Council (OLRC)
  •  20 lawyers who author the U.S. Code
  •  They report to the Speaker of the House of
     Representatives
§  Bonus Question: Which Title is the largest?
                                                                   5
Major Challenges
1.  Document Parsing
  •  A 50 Volume Table Of Contents!


2.  Query Parsing
  •  Custom Features (exact case, exact suffix,
     proximity, query templates, lemmatization, lots
     of fields…)


3.  Searching & Highlighting Fields
  •  Some fields are embedded in the document
  •  These fields must be highlighted in context

                                                       6
screenshot




             7
screenshot




             8
screenshot




             9
10
Part The First:
Document Processing



                      11
Document Processing / Indexing

USC      Parse &      Embed   Construct                Xform &
        Granularize    Refs    XHTML
                                            Store
                                                        Index
                                                                 Solr
Title


                                          Repository




                                                                        12
Field Type 1: Extracted to Index
                                      Page Numbers
<!-- documentid:14_1 usckey:140000000000100000000000000000000 currentthrough:20080108
documentPDFPage:3 -->
<!-- itempath:/140/PART I/CHAPTER 1/Sec. 1 -->
<!-- itemsortkey:140AAAD -->
<!-- expcite:TITLE 14-COAST GUARD!@!PART I-REGULAR COAST GUARD!@!CHAPTER 1-
ESTABLISHMENT AND DUTIES!@!Sec. 1 -->
<!-- field-start:head --><h3 class="section-head">&sect;1. Establishment of Coast Guard</h3>
<!-- field-end:head -->
<!-- field-start:statute -->
                                        Heading
<p class="statutory-body">The Coast Guard as established January 28, 1915, shall be a military …
                                                                         Title
<!-- field-end:statute -->
<!-- field-start:sourcecredit -->
<p class="source-credit">(Aug. 4, 1949, ch. 393, 63 Stat. 496; Pub. L. 94&ndash;546, &sect;1(1),…
<!-- field-end:sourcecredit -->
<!-- field-start:notes -->
<!-- field-start:historicalandrevision-note -->
<h4 class="note-head">Historical and Revision Notes</h4>           Source Credit
<p class="note-body">Based on title 14, U.S.C., 1946 ed., &sect;1 (Jan. 28, 1915, ch. 20, &sect;1…
<!-- field-end:historicalandrevision-note -->
<!-- field-start:amendment-note -->
<h4 class="note-head">Amendments</h4>
<p class="note-body">2002&mdash;Pub. L. 107&ndash;296 substituted &ldquo;Department of …
<!-- field-end:amendment-note -->
<!-- field-start:effectivedate-amendment-note -->
<h4 class="note-head">Effective Date of 2002 Amendment</h4>
<p class="note-body">Amendment by Pub. L. 107&ndash;296 effective on the date of transfer of …       13
Document Processing / Indexing

USC        Parse &            Embed     Construct                Xform &
          Granularize          Refs      XHTML
                                                      Store
                                                                  Index
                                                                           Solr
Title


                                                    Repository

                   Title 14


          ch. 1     ch. 2      ch. 3    …
  pt. A   pt. B     pt. C      …
          sec. 1   sec. 2      sec. 3   …

                                                                                  14
Field Type 2: Embedded Refs
<!-- documentid:14_1 usckey:140000000000100000000000000000000 currentthrough:20080108
documentPDFPage:3 -->
<!-- itempath:/140/PART I/CHAPTER 1/Sec. 1 -->
<!-- itemsortkey:140AAAD -->
<!-- expcite:TITLE 14-COAST GUARD!@!PART I-REGULAR COAST GUARD!@!CHAPTER 1-
ESTABLISHMENT AND DUTIES!@!Sec. 1 -->
<!-- field-start:head --><h3 class="section-head">&sect;1. Establishment of Coast Guard</h3>
<!-- field-end:head -->
<!-- field-start:statute -->
                                                    Statute at Large
<p class="statutory-body">The Coast Guard as established January 28, 1915, shall be a military …
<!-- field-end:statute -->
<!-- field-start:sourcecredit -->
<p class="source-credit">(Aug. 4, 1949, ch. 393, 63 Stat. 496; Pub. L. 94&ndash;546, &sect;1(1),…
<!-- field-end:sourcecredit -->                                                            Public Law
<!-- field-start:notes --> USC Refs
                     Other
<!-- field-start:historicalandrevision-note -->
<h4 class="note-head">Historical and Revision Notes</h4>
<p class="note-body">Based on title 14, U.S.C., 1946 ed., &sect;1 (Jan. 28, 1915, ch. 20, &sect;1…
<!-- field-end:historicalandrevision-note -->
<!-- field-start:amendment-note -->
<h4 class="note-head">Amendments</h4>
<p class="note-body">2002&mdash;Pub. L. 107&ndash;296 substituted &ldquo;Department of …
<!-- field-end:amendment-note -->
                                                                   Public Law
<!-- field-start:effectivedate-amendment-note -->
<h4 class="note-head">Effective Date of 2002 Amendment</h4>           Public Law
<p class="note-body">Amendment by Pub. L. 107&ndash;296 effective on the date of transfer of …          15
Document Processing / Indexing

USC      Parse &      Embed   Construct                Xform &
        Granularize    Refs    XHTML
                                            Store
                                                        Index
                                                                 Solr
Title


                                          Repository




                                                                        16
Document Processing / Indexing

USC      Parse &      Embed       Construct                 Xform &
        Granularize    Refs        XHTML
                                                 Store
                                                             Index
                                                                      Solr
Title


                                               Repository




             §  /US-Code
                  §  /2010
                       §  /title2
                             §  /USC-title2-section1532.htm
                             §  /USC-title2-node3-rule5.htm


                                                                             17
Part The Second:
Token Processing



                   18
Token Processing 1
     xhtml tag tokenizer                             <!-- field-start:amendment-note -->
                                                     <h4 class="note-head">
<!-- field-start:amendment-note -->                  Amendments
<h4 class="note-head">Amendments</h4>
                                                     </h4>
<p class="note-body">2002&mdash;Pub. L. 107&ndash;
296 substituted &ldquo;Department of …               <p class="note-body">
<!-- field-end:amendment-note -->
                                                     2002
                                                     Pub
                                                     L
                                                     107
                                                     296
                                                     Substituted
                                                     Department
                                                     of
                                                     <!-- field-end:amendment-note -->



                                                                                           19
Field Type 3: Marked Within Doc
<!-- documentid:14_1 usckey:140000000000100000000000000000000 currentthrough:20080108
documentPDFPage:3 -->
<!-- itempath:/140/PART I/CHAPTER 1/Sec. 1 -->
<!-- itemsortkey:140AAAD -->
<!-- expcite:TITLE 14-COAST GUARD!@!PART I-REGULAR COAST GUARD!@!CHAPTER 1-
ESTABLISHMENT AND DUTIES!@!Sec. 1 -->
<!-- field-start:head --><h3 class="section-head">&sect;1. Establishment of Coast Guard</h3>
<!-- field-end:head -->
<!-- field-start:statute -->
<p class="statutory-body">The Coast Guard as established January 28, 1915, shall be a military …
<!-- field-end:statute -->
<!-- field-start:sourcecredit -->
<p class="source-credit">(Aug. 4, 1949, ch. 393, 63 Stat. 496; Pub. L. 94&ndash;546, &sect;1(1),…
<!-- field-end:sourcecredit -->
<!-- field-start:notes -->
<!-- field-start:historicalandrevision-note -->
<h4 class="note-head">Historical and Revision Notes</h4>
<p class="note-body">Based on title 14, U.S.C., 1946 ed., &sect;1 (Jan. 28, 1915, ch. 20, &sect;1…
<!-- field-end:historicalandrevision-note -->
<!-- field-start:amendment-note -->
<h4 class="note-head">Amendments</h4>
<p class="note-body">2002&mdash;Pub. L. 107&ndash;296 substituted &ldquo;Department of …
<!-- field-end:amendment-note -->
<!-- field-start:effectivedate-amendment-note -->
<h4 class="note-head">Effective Date of 2002 Amendment</h4>
<p class="note-body">Amendment by Pub. L. 107&ndash;296 effective on the date of transfer of …       20
Token Processing 2
Mark Start and End Tags
<!-- field-start:amendment-note -->   S/amendment
<h4 class="note-head">                <h4 class="note-head">
Amendments                            Amendments
</h4>                                 </h4>
<p class="note-body">                 <p class="note-body">
2002                                  2002
Pub                                   Pub
L                                     L
107                                   107
296                                   296
Substituted                           Substituted
Department                            Department
of                                    of
<!-- field-end:amendment-note -->     E/amendment


                                                               21
Token Processing 3
Remove XHTML Tags
S/amendment                  S/amendment
<h4 class="note-head">
Amendments                   Amendments
</h4>
<p class="note-body">
2002                         2002
Pub                          Pub
L                            L
107                          107
296                          296
Substituted                  Substituted
Department                   Department
of                           of
E/amendment                  E/amendment


                                           22
Token Processing 4
Tag Original Case & Lower Case

S/amendment                S/amendment
Amendments                 O/Amendments    L/amendments
2002                       O/2002          L/2002
Pub                        O/Pub           L/pub
L                          O/L             L/l
107                        O/107           L/107
296                        O/296           L/296
Substituted                O/Substituted   L/substituted
Department                 O/Department    L/department
of                         O/of            L/of
E/amendment                E/amendment




                                                           23
Token Processing 5
 Lemmatize
         Uses dictionary-based lemmatizer based on GCIDE and WordNet

S/amendment                         S/amendment
O/Amendments    L/amendments        O/Amendments    L/amendments    amendment
O/2002          L/2002              O/2002          L/2002          2002
O/Pub           L/pub               O/Pub           L/Pub           pub
O/L             L/l                 O/L             L/l;            l
O/107           L/107               O/107           L/107           107
O/296           L/296               O/296           L/296           296
O/Substituted   L/substituted       O/Substituted   L/Substituted   substitute
O/Department    L/department        O/Department    L/Department    department
O/of            L/of                O/of            L/of            of
E/amendment                         E/amendment




                                                                                 24
Part The Third:
Query Processing



                   25
Query Processing
                           (not all stages shown)

                                                                build
Query             mark       mark                     query
          parse                         lemmatize              lucene   search
String            exact:    phrases                 template
                                                                query


   §  Communicates via generic QNode Class
         •  Simpler to manipulate than Lucene operators
   §  Can produce FAST FQL as well
         •  (cue the derisive catcalls)
   §  But most importantly:
         •  It is a Query Processing Pipeline
            §  Mix and match query processing modules


                                                                           26
Query Processing
                 exact:FOIA top secret amendment:RECORDS

                                                                     build
Query              mark         mark                      query
         parse                             lemmatize                lucene   search
String            original   lowercase                  template
                                                                     query




                              and

                 exact:               phrase           amendment:


                 |FOIA|       |top|        |secret|    |RECORDS|




                                                                                27
Query Processing
                 exact:FOIA top secret amendment:RECORDS

                                                                     build
Query              mark         mark                      query
         parse                             lemmatize                lucene   search
String            original   lowercase                  template
                                                                     query




                              and

                 O/FOIA               phrase           amendment:


                              |top|        |secret|    |RECORDS|




                                                                                28
Query Processing
                 exact:FOIA top secret amendment:RECORDS

                                                                    build
Query              mark         mark                     query
         parse                            lemmatize                lucene   search
String            original   lowercase                 template
                                                                    query




                              and

                 O/FOIA             phrase            amendment:


                             |L/top|     |L/secret|    |records|




                                                                               29
Query Processing
                 exact:FOIA top secret amendment:RECORDS

                                                                    build
Query              mark         mark                     query
         parse                            lemmatize                lucene   search
String            original   lowercase                 template
                                                                    query




                              and

                 O/FOIA             phrase            amendment:


                             |L/top|     |L/secret|    |record|




                                                                               30
Query Processing
                 exact:FOIA top secret amendment:RECORDS

                                                                   build
Query              mark         mark                    query
         parse                            lemmatize               lucene    search
String            original   lowercase                template
                                                                   query




                              and

                 O/FOIA             phrase            between

                                                                 S/amendment
                             |L/top|     |L/secret|
                                                                 |record|

                                                                 E/amendment

                                                                               31
The between() Operator
§  between(start-tag, end-tag, pos-clause, neg-clause)

§  start-tag à Starting tag, e.g. S/amendment
§  end-tag à Ending tag, e.g. E/amendment

§  pos-clause à words which must occur between
    start and end
   •  Note: Requires a nested ScanAnd() operator
§  neg-clause à words which must not occur between
    start and end

                                                      32
Part the Fourth:
Hierarchical Navigation



                          33
screenshot




             34
Hierarchies: Requirements
§  Any number of levels
      §  Title, Sub-Title, Chapter, Sub-Chapter, Part, Sub-Part,
          Section
§  Levels vary across titles
      §  Title 1: 3 levels
      §  Title 26: 8 levels
§  Multiple views:
      §  Children
      §  Ancestors
      §  Ancestor s Siblings
§  Multiple search scopes:
      §  Only children, all descendents, everything

                                                                    35
Hierarchies: Ancestor-Siblings
§  US-Code
  •  Title 1
  •  Title 2
     §  Chapter 1
     §  Chapter 2
         –  Part 1
         –  Part 2
              •  Section 2.1
              •  Section 2.2
         –  Part 3
         –  Part 4
     §  Chapter 3
     §  Chapter 4
  •  Title 3

                                  36
Hierarchies: Fields
§  ancestors
   •  Searching
      §  USC USC-title2 USC-title2-chapter25 USC-title2-chapter25-
          subchapter2
§  encodedAncestors – for display only
   •  Where the node exists within the hierarchy
      §  id;heading;subjectTitle//id;heading;subjectTitle//...
      §  USC-title2-chapter25;Chapter 25;Unfunded Mandates Reform//
          USC-title2-chapter25-subchapter2;Subchapter II;Regulatory
          Accountabilty and Reform
§  parentId – ID of the parent node
      §  USC-title2-chapter25-subchapter2
§  treesort – Hierarchical sort field, e.g. 13/000/0/00882

                                                                       37
Hierarchies: Tree Sort
§  Sorting In Print Order
   •  Front Matter à Titles à Tables à etc.
   •  Everything padded to fixed-length

                    01/011/1/02032

01 = USC Title                            Sequence # in file

                 011 = Title 11   1 = An Appendix




                                                               38
Hierarchies: Sample Searches
§  Assuming Node = USC-title2-chapter25
§  Search Children
   •  parentId:USC-title2-chapter25
§  Search All Descendents
   •  ancestors:USC-title2-chapter25
§  Ancestor Siblings
   •  (parentId:USC OR parentId:USC-title2 OR
      parentId:USC-title2-chapter25)




                                                39
Contact
§  Paul Nelson
   •  pnelson@searchtechnologies.com
§  Ronald Matamoros
   •  rmatamoros@searchtechnologies.com
§  Search Technologies
   •  http://searchtechnologies.com




                                          40

Weitere ähnliche Inhalte

Andere mochten auch

Indexing Text and HTML Files with Solr
Indexing Text and HTML Files with SolrIndexing Text and HTML Files with Solr
Indexing Text and HTML Files with SolrLucidworks (Archived)
 
Solr Cluster installation tool "Anuenue"
Solr Cluster installation tool "Anuenue"Solr Cluster installation tool "Anuenue"
Solr Cluster installation tool "Anuenue"Lucidworks (Archived)
 
Open Source for Enterprise Search: Breaking Down the Barriers to Information
Open Source for Enterprise Search: Breaking Down the Barriers to InformationOpen Source for Enterprise Search: Breaking Down the Barriers to Information
Open Source for Enterprise Search: Breaking Down the Barriers to InformationLucidworks (Archived)
 
HTML5 と次世代のネットワーク プロトコル
HTML5 と次世代のネットワーク プロトコルHTML5 と次世代のネットワーク プロトコル
HTML5 と次世代のネットワーク プロトコル彰 村地
 
Hellosong
HellosongHellosong
Hellosongtanica
 
Mujer, pajaro y estrella
Mujer, pajaro y estrellaMujer, pajaro y estrella
Mujer, pajaro y estrellaguest986e5ae
 
Discover the new techniques about search application
Discover the new techniques about search applicationDiscover the new techniques about search application
Discover the new techniques about search applicationLucidworks (Archived)
 
Using Solr in Online Travel Shopping to Improve User Experience
Using Solr in Online Travel Shopping to Improve User ExperienceUsing Solr in Online Travel Shopping to Improve User Experience
Using Solr in Online Travel Shopping to Improve User ExperienceLucidworks (Archived)
 
Jazeed about Solr - People as A Search Problem
Jazeed about Solr - People as A Search ProblemJazeed about Solr - People as A Search Problem
Jazeed about Solr - People as A Search ProblemLucidworks (Archived)
 
Zombie
ZombieZombie
Zombietanica
 
Civil War
Civil WarCivil War
Civil Wartanica
 
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...Lucidworks (Archived)
 
Tv ролики
Tv роликиTv ролики
Tv роликиtarodnova
 

Andere mochten auch (18)

Indexing Text and HTML Files with Solr
Indexing Text and HTML Files with SolrIndexing Text and HTML Files with Solr
Indexing Text and HTML Files with Solr
 
Solr Cluster installation tool "Anuenue"
Solr Cluster installation tool "Anuenue"Solr Cluster installation tool "Anuenue"
Solr Cluster installation tool "Anuenue"
 
All Data Big and Small
All Data Big and SmallAll Data Big and Small
All Data Big and Small
 
Open Source for Enterprise Search: Breaking Down the Barriers to Information
Open Source for Enterprise Search: Breaking Down the Barriers to InformationOpen Source for Enterprise Search: Breaking Down the Barriers to Information
Open Source for Enterprise Search: Breaking Down the Barriers to Information
 
What’s New in Apache Lucene 3.0
What’s New in Apache Lucene 3.0What’s New in Apache Lucene 3.0
What’s New in Apache Lucene 3.0
 
HTML5 と次世代のネットワーク プロトコル
HTML5 と次世代のネットワーク プロトコルHTML5 と次世代のネットワーク プロトコル
HTML5 と次世代のネットワーク プロトコル
 
Hellosong
HellosongHellosong
Hellosong
 
Mujer, pajaro y estrella
Mujer, pajaro y estrellaMujer, pajaro y estrella
Mujer, pajaro y estrella
 
Discover the new techniques about search application
Discover the new techniques about search applicationDiscover the new techniques about search application
Discover the new techniques about search application
 
Using Solr in Online Travel Shopping to Improve User Experience
Using Solr in Online Travel Shopping to Improve User ExperienceUsing Solr in Online Travel Shopping to Improve User Experience
Using Solr in Online Travel Shopping to Improve User Experience
 
Jazeed about Solr - People as A Search Problem
Jazeed about Solr - People as A Search ProblemJazeed about Solr - People as A Search Problem
Jazeed about Solr - People as A Search Problem
 
What’s New in Apache Lucene 2.9
What’s New in Apache Lucene 2.9What’s New in Apache Lucene 2.9
What’s New in Apache Lucene 2.9
 
Zombie
ZombieZombie
Zombie
 
Civil War
Civil WarCivil War
Civil War
 
Portades
PortadesPortades
Portades
 
Linked In Introduction
Linked In IntroductionLinked In Introduction
Linked In Introduction
 
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
 
Tv ролики
Tv роликиTv ролики
Tv ролики
 

Ähnlich wie Searching The United States Code with Solr/Lucene

Searching The United States Code with Solr/Lucene - By Ronald Matamoros
Searching The United States Code with Solr/Lucene - By Ronald MatamorosSearching The United States Code with Solr/Lucene - By Ronald Matamoros
Searching The United States Code with Solr/Lucene - By Ronald Matamoroslucenerevolution
 
Trouble-shooting Tips for Primo (2013)
Trouble-shooting Tips for Primo (2013)Trouble-shooting Tips for Primo (2013)
Trouble-shooting Tips for Primo (2013)Alison Hitchens
 
HPCC Systems vs Hadoop
HPCC Systems vs HadoopHPCC Systems vs Hadoop
HPCC Systems vs HadoopFujio Turner
 
Paper_Scalable database logging for multicores
Paper_Scalable database logging for multicoresPaper_Scalable database logging for multicores
Paper_Scalable database logging for multicoresHyo jeong Lee
 
W1.1 i os in database
W1.1   i os in databaseW1.1   i os in database
W1.1 i os in databasegafurov_x
 
Apache solr tech doc
Apache solr tech docApache solr tech doc
Apache solr tech docBarot Sagar
 
Ugif 10 2012 beauty ofifmxdiskstructs ugif
Ugif 10 2012 beauty ofifmxdiskstructs ugifUgif 10 2012 beauty ofifmxdiskstructs ugif
Ugif 10 2012 beauty ofifmxdiskstructs ugifUGIF
 
Open Standards for the Semantic Web: XML / RDF(S) / OWL / SOAP
Open Standards for the Semantic Web: XML / RDF(S) / OWL / SOAPOpen Standards for the Semantic Web: XML / RDF(S) / OWL / SOAP
Open Standards for the Semantic Web: XML / RDF(S) / OWL / SOAPPieter De Leenheer
 
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)Kai Chan
 
SPARQL queries on CIDOC-CRM data of BritishMuseum
SPARQL queries on CIDOC-CRM data of BritishMuseumSPARQL queries on CIDOC-CRM data of BritishMuseum
SPARQL queries on CIDOC-CRM data of BritishMuseumThomas Francart
 
Oracle10g New Features I
Oracle10g New Features IOracle10g New Features I
Oracle10g New Features IDenish Patel
 
Managing Your Content with Elasticsearch
Managing Your Content with ElasticsearchManaging Your Content with Elasticsearch
Managing Your Content with ElasticsearchSamantha Quiñones
 

Ähnlich wie Searching The United States Code with Solr/Lucene (13)

Searching The United States Code with Solr/Lucene - By Ronald Matamoros
Searching The United States Code with Solr/Lucene - By Ronald MatamorosSearching The United States Code with Solr/Lucene - By Ronald Matamoros
Searching The United States Code with Solr/Lucene - By Ronald Matamoros
 
Trouble-shooting Tips for Primo (2013)
Trouble-shooting Tips for Primo (2013)Trouble-shooting Tips for Primo (2013)
Trouble-shooting Tips for Primo (2013)
 
HPCC Systems vs Hadoop
HPCC Systems vs HadoopHPCC Systems vs Hadoop
HPCC Systems vs Hadoop
 
Paper_Scalable database logging for multicores
Paper_Scalable database logging for multicoresPaper_Scalable database logging for multicores
Paper_Scalable database logging for multicores
 
W1.1 i os in database
W1.1   i os in databaseW1.1   i os in database
W1.1 i os in database
 
Solr Presentation
Solr PresentationSolr Presentation
Solr Presentation
 
Apache solr tech doc
Apache solr tech docApache solr tech doc
Apache solr tech doc
 
Ugif 10 2012 beauty ofifmxdiskstructs ugif
Ugif 10 2012 beauty ofifmxdiskstructs ugifUgif 10 2012 beauty ofifmxdiskstructs ugif
Ugif 10 2012 beauty ofifmxdiskstructs ugif
 
Open Standards for the Semantic Web: XML / RDF(S) / OWL / SOAP
Open Standards for the Semantic Web: XML / RDF(S) / OWL / SOAPOpen Standards for the Semantic Web: XML / RDF(S) / OWL / SOAP
Open Standards for the Semantic Web: XML / RDF(S) / OWL / SOAP
 
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
 
SPARQL queries on CIDOC-CRM data of BritishMuseum
SPARQL queries on CIDOC-CRM data of BritishMuseumSPARQL queries on CIDOC-CRM data of BritishMuseum
SPARQL queries on CIDOC-CRM data of BritishMuseum
 
Oracle10g New Features I
Oracle10g New Features IOracle10g New Features I
Oracle10g New Features I
 
Managing Your Content with Elasticsearch
Managing Your Content with ElasticsearchManaging Your Content with Elasticsearch
Managing Your Content with Elasticsearch
 

Mehr von Lucidworks (Archived)

Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...Lucidworks (Archived)
 
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
 SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and SolrLucidworks (Archived)
 
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for BusinessSFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for BusinessLucidworks (Archived)
 
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr PerformanceSFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr PerformanceLucidworks (Archived)
 
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search EngineChicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search EngineLucidworks (Archived)
 
Chicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with SearchChicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with SearchLucidworks (Archived)
 
Minneapolis Solr Meetup - May 28, 2014: Target.com Search
Minneapolis Solr Meetup - May 28, 2014: Target.com SearchMinneapolis Solr Meetup - May 28, 2014: Target.com Search
Minneapolis Solr Meetup - May 28, 2014: Target.com SearchLucidworks (Archived)
 
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...Lucidworks (Archived)
 
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...Lucidworks (Archived)
 
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DCBig Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DCLucidworks (Archived)
 
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DCWhat's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DCLucidworks (Archived)
 
Solr At AOL, Presented by Sean Timm at SolrExchage DC
Solr At AOL, Presented by Sean Timm at SolrExchage DCSolr At AOL, Presented by Sean Timm at SolrExchage DC
Solr At AOL, Presented by Sean Timm at SolrExchage DCLucidworks (Archived)
 
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DCIntro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DCLucidworks (Archived)
 
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DCTest Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DCLucidworks (Archived)
 
Building a data driven search application with LucidWorks SiLK
Building a data driven search application with LucidWorks SiLKBuilding a data driven search application with LucidWorks SiLK
Building a data driven search application with LucidWorks SiLKLucidworks (Archived)
 
Introducing LucidWorks App for Splunk Enterprise webinar
Introducing LucidWorks App for Splunk Enterprise webinarIntroducing LucidWorks App for Splunk Enterprise webinar
Introducing LucidWorks App for Splunk Enterprise webinarLucidworks (Archived)
 

Mehr von Lucidworks (Archived) (20)

Integrating Hadoop & Solr
Integrating Hadoop & SolrIntegrating Hadoop & Solr
Integrating Hadoop & Solr
 
The Data-Driven Paradigm
The Data-Driven ParadigmThe Data-Driven Paradigm
The Data-Driven Paradigm
 
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
 
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
 SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
 
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for BusinessSFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
 
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr PerformanceSFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
 
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search EngineChicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
 
Chicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with SearchChicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with Search
 
What's new in solr june 2014
What's new in solr june 2014What's new in solr june 2014
What's new in solr june 2014
 
Minneapolis Solr Meetup - May 28, 2014: Target.com Search
Minneapolis Solr Meetup - May 28, 2014: Target.com SearchMinneapolis Solr Meetup - May 28, 2014: Target.com Search
Minneapolis Solr Meetup - May 28, 2014: Target.com Search
 
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...
 
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
 
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DCBig Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
 
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DCWhat's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
 
Solr At AOL, Presented by Sean Timm at SolrExchage DC
Solr At AOL, Presented by Sean Timm at SolrExchage DCSolr At AOL, Presented by Sean Timm at SolrExchage DC
Solr At AOL, Presented by Sean Timm at SolrExchage DC
 
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DCIntro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
 
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DCTest Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
 
Building a data driven search application with LucidWorks SiLK
Building a data driven search application with LucidWorks SiLKBuilding a data driven search application with LucidWorks SiLK
Building a data driven search application with LucidWorks SiLK
 
Introducing LucidWorks App for Splunk Enterprise webinar
Introducing LucidWorks App for Splunk Enterprise webinarIntroducing LucidWorks App for Splunk Enterprise webinar
Introducing LucidWorks App for Splunk Enterprise webinar
 
Solr4 nosql search_server_2013
Solr4 nosql search_server_2013Solr4 nosql search_server_2013
Solr4 nosql search_server_2013
 

Kürzlich hochgeladen

Intro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptxIntro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptxFIDO Alliance
 
Using IESVE for Room Loads Analysis - UK & Ireland
Using IESVE for Room Loads Analysis - UK & IrelandUsing IESVE for Room Loads Analysis - UK & Ireland
Using IESVE for Room Loads Analysis - UK & IrelandIES VE
 
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...FIDO Alliance
 
Portal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russePortal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russe中 央社
 
Vector Search @ sw2con for slideshare.pptx
Vector Search @ sw2con for slideshare.pptxVector Search @ sw2con for slideshare.pptx
Vector Search @ sw2con for slideshare.pptxjbellis
 
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...FIDO Alliance
 
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdfWhere to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdfFIDO Alliance
 
Design and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data ScienceDesign and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data SciencePaolo Missier
 
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdfMuhammad Subhan
 
Intro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераIntro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераMark Opanasiuk
 
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...ScyllaDB
 
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdfIntroduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdfFIDO Alliance
 
Generative AI Use Cases and Applications.pdf
Generative AI Use Cases and Applications.pdfGenerative AI Use Cases and Applications.pdf
Generative AI Use Cases and Applications.pdfalexjohnson7307
 
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...FIDO Alliance
 
ADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptxADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptxFIDO Alliance
 
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdfHow Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdfFIDO Alliance
 
Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on ThanabotsContinuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on ThanabotsLeah Henrickson
 
Top 10 CodeIgniter Development Companies
Top 10 CodeIgniter Development CompaniesTop 10 CodeIgniter Development Companies
Top 10 CodeIgniter Development CompaniesTopCSSGallery
 
State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!Memoori
 
Event-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream ProcessingEvent-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream ProcessingScyllaDB
 

Kürzlich hochgeladen (20)

Intro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptxIntro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptx
 
Using IESVE for Room Loads Analysis - UK & Ireland
Using IESVE for Room Loads Analysis - UK & IrelandUsing IESVE for Room Loads Analysis - UK & Ireland
Using IESVE for Room Loads Analysis - UK & Ireland
 
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
 
Portal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russePortal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russe
 
Vector Search @ sw2con for slideshare.pptx
Vector Search @ sw2con for slideshare.pptxVector Search @ sw2con for slideshare.pptx
Vector Search @ sw2con for slideshare.pptx
 
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
 
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdfWhere to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
 
Design and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data ScienceDesign and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data Science
 
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
 
Intro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераIntro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджера
 
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
 
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdfIntroduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
 
Generative AI Use Cases and Applications.pdf
Generative AI Use Cases and Applications.pdfGenerative AI Use Cases and Applications.pdf
Generative AI Use Cases and Applications.pdf
 
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
 
ADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptxADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptx
 
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdfHow Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
 
Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on ThanabotsContinuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
 
Top 10 CodeIgniter Development Companies
Top 10 CodeIgniter Development CompaniesTop 10 CodeIgniter Development Companies
Top 10 CodeIgniter Development Companies
 
State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!
 
Event-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream ProcessingEvent-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream Processing
 

Searching The United States Code with Solr/Lucene

  • 1. Searching The United States Code with Solr/Lucene Paul Nelson / Ronald Matamoros, Search Technologies pnelson@searchtechnologies.com, 5/25/2011 rmatamoros@searchtechnologies.com
  • 2. Searching the United States Code §  Who are we: •  Paul Nelson, Chief Architect •  Ronald Matamoros, Lead Engineer §  Our Mission: Replace Personal Librarian Search •  A 20-Year-Old Search Engine! §  Key Challenges •  How to index this massive, complex, 85-year-old document? •  How to replicate 20-Year-Old search features? §  Government Documents are Fun! 3
  • 3. Search Technologies §  The largest independent provider of enterprise search expertise and services §  80 full-time dedicated search engine experts §  200+ customers §  Technology Neutral •  (yeah, we know Sphinx too) §  Offices All Over •  DC, NY, CA, MD, OH, UK, CR… 4
  • 4. A Quick Civics Lesson… §  The United States Code •  The general & permanent laws of the U.S. Government – All in one place •  51 titles §  Agriculture, Armed Forces, Conservation, The President, Food and Drugs, Postal Service, Public Health… •  First Version: 1926 §  The Office of the Law Revision Council (OLRC) •  20 lawyers who author the U.S. Code •  They report to the Speaker of the House of Representatives §  Bonus Question: Which Title is the largest? 5
  • 5. Major Challenges 1.  Document Parsing •  A 50 Volume Table Of Contents! 2.  Query Parsing •  Custom Features (exact case, exact suffix, proximity, query templates, lemmatization, lots of fields…) 3.  Searching & Highlighting Fields •  Some fields are embedded in the document •  These fields must be highlighted in context 6
  • 9. 10
  • 10. Part The First: Document Processing 11
  • 11. Document Processing / Indexing USC Parse & Embed Construct Xform & Granularize Refs XHTML Store Index Solr Title Repository 12
  • 12. Field Type 1: Extracted to Index Page Numbers <!-- documentid:14_1 usckey:140000000000100000000000000000000 currentthrough:20080108 documentPDFPage:3 --> <!-- itempath:/140/PART I/CHAPTER 1/Sec. 1 --> <!-- itemsortkey:140AAAD --> <!-- expcite:TITLE 14-COAST GUARD!@!PART I-REGULAR COAST GUARD!@!CHAPTER 1- ESTABLISHMENT AND DUTIES!@!Sec. 1 --> <!-- field-start:head --><h3 class="section-head">&sect;1. Establishment of Coast Guard</h3> <!-- field-end:head --> <!-- field-start:statute --> Heading <p class="statutory-body">The Coast Guard as established January 28, 1915, shall be a military … Title <!-- field-end:statute --> <!-- field-start:sourcecredit --> <p class="source-credit">(Aug. 4, 1949, ch. 393, 63 Stat. 496; Pub. L. 94&ndash;546, &sect;1(1),… <!-- field-end:sourcecredit --> <!-- field-start:notes --> <!-- field-start:historicalandrevision-note --> <h4 class="note-head">Historical and Revision Notes</h4> Source Credit <p class="note-body">Based on title 14, U.S.C., 1946 ed., &sect;1 (Jan. 28, 1915, ch. 20, &sect;1… <!-- field-end:historicalandrevision-note --> <!-- field-start:amendment-note --> <h4 class="note-head">Amendments</h4> <p class="note-body">2002&mdash;Pub. L. 107&ndash;296 substituted &ldquo;Department of … <!-- field-end:amendment-note --> <!-- field-start:effectivedate-amendment-note --> <h4 class="note-head">Effective Date of 2002 Amendment</h4> <p class="note-body">Amendment by Pub. L. 107&ndash;296 effective on the date of transfer of … 13
  • 13. Document Processing / Indexing USC Parse & Embed Construct Xform & Granularize Refs XHTML Store Index Solr Title Repository Title 14 ch. 1 ch. 2 ch. 3 … pt. A pt. B pt. C … sec. 1 sec. 2 sec. 3 … 14
  • 14. Field Type 2: Embedded Refs <!-- documentid:14_1 usckey:140000000000100000000000000000000 currentthrough:20080108 documentPDFPage:3 --> <!-- itempath:/140/PART I/CHAPTER 1/Sec. 1 --> <!-- itemsortkey:140AAAD --> <!-- expcite:TITLE 14-COAST GUARD!@!PART I-REGULAR COAST GUARD!@!CHAPTER 1- ESTABLISHMENT AND DUTIES!@!Sec. 1 --> <!-- field-start:head --><h3 class="section-head">&sect;1. Establishment of Coast Guard</h3> <!-- field-end:head --> <!-- field-start:statute --> Statute at Large <p class="statutory-body">The Coast Guard as established January 28, 1915, shall be a military … <!-- field-end:statute --> <!-- field-start:sourcecredit --> <p class="source-credit">(Aug. 4, 1949, ch. 393, 63 Stat. 496; Pub. L. 94&ndash;546, &sect;1(1),… <!-- field-end:sourcecredit --> Public Law <!-- field-start:notes --> USC Refs Other <!-- field-start:historicalandrevision-note --> <h4 class="note-head">Historical and Revision Notes</h4> <p class="note-body">Based on title 14, U.S.C., 1946 ed., &sect;1 (Jan. 28, 1915, ch. 20, &sect;1… <!-- field-end:historicalandrevision-note --> <!-- field-start:amendment-note --> <h4 class="note-head">Amendments</h4> <p class="note-body">2002&mdash;Pub. L. 107&ndash;296 substituted &ldquo;Department of … <!-- field-end:amendment-note --> Public Law <!-- field-start:effectivedate-amendment-note --> <h4 class="note-head">Effective Date of 2002 Amendment</h4> Public Law <p class="note-body">Amendment by Pub. L. 107&ndash;296 effective on the date of transfer of … 15
  • 15. Document Processing / Indexing USC Parse & Embed Construct Xform & Granularize Refs XHTML Store Index Solr Title Repository 16
  • 16. Document Processing / Indexing USC Parse & Embed Construct Xform & Granularize Refs XHTML Store Index Solr Title Repository §  /US-Code §  /2010 §  /title2 §  /USC-title2-section1532.htm §  /USC-title2-node3-rule5.htm 17
  • 17. Part The Second: Token Processing 18
  • 18. Token Processing 1 xhtml tag tokenizer <!-- field-start:amendment-note --> <h4 class="note-head"> <!-- field-start:amendment-note --> Amendments <h4 class="note-head">Amendments</h4> </h4> <p class="note-body">2002&mdash;Pub. L. 107&ndash; 296 substituted &ldquo;Department of … <p class="note-body"> <!-- field-end:amendment-note --> 2002 Pub L 107 296 Substituted Department of <!-- field-end:amendment-note --> 19
  • 19. Field Type 3: Marked Within Doc <!-- documentid:14_1 usckey:140000000000100000000000000000000 currentthrough:20080108 documentPDFPage:3 --> <!-- itempath:/140/PART I/CHAPTER 1/Sec. 1 --> <!-- itemsortkey:140AAAD --> <!-- expcite:TITLE 14-COAST GUARD!@!PART I-REGULAR COAST GUARD!@!CHAPTER 1- ESTABLISHMENT AND DUTIES!@!Sec. 1 --> <!-- field-start:head --><h3 class="section-head">&sect;1. Establishment of Coast Guard</h3> <!-- field-end:head --> <!-- field-start:statute --> <p class="statutory-body">The Coast Guard as established January 28, 1915, shall be a military … <!-- field-end:statute --> <!-- field-start:sourcecredit --> <p class="source-credit">(Aug. 4, 1949, ch. 393, 63 Stat. 496; Pub. L. 94&ndash;546, &sect;1(1),… <!-- field-end:sourcecredit --> <!-- field-start:notes --> <!-- field-start:historicalandrevision-note --> <h4 class="note-head">Historical and Revision Notes</h4> <p class="note-body">Based on title 14, U.S.C., 1946 ed., &sect;1 (Jan. 28, 1915, ch. 20, &sect;1… <!-- field-end:historicalandrevision-note --> <!-- field-start:amendment-note --> <h4 class="note-head">Amendments</h4> <p class="note-body">2002&mdash;Pub. L. 107&ndash;296 substituted &ldquo;Department of … <!-- field-end:amendment-note --> <!-- field-start:effectivedate-amendment-note --> <h4 class="note-head">Effective Date of 2002 Amendment</h4> <p class="note-body">Amendment by Pub. L. 107&ndash;296 effective on the date of transfer of … 20
  • 20. Token Processing 2 Mark Start and End Tags <!-- field-start:amendment-note --> S/amendment <h4 class="note-head"> <h4 class="note-head"> Amendments Amendments </h4> </h4> <p class="note-body"> <p class="note-body"> 2002 2002 Pub Pub L L 107 107 296 296 Substituted Substituted Department Department of of <!-- field-end:amendment-note --> E/amendment 21
  • 21. Token Processing 3 Remove XHTML Tags S/amendment S/amendment <h4 class="note-head"> Amendments Amendments </h4> <p class="note-body"> 2002 2002 Pub Pub L L 107 107 296 296 Substituted Substituted Department Department of of E/amendment E/amendment 22
  • 22. Token Processing 4 Tag Original Case & Lower Case S/amendment S/amendment Amendments O/Amendments L/amendments 2002 O/2002 L/2002 Pub O/Pub L/pub L O/L L/l 107 O/107 L/107 296 O/296 L/296 Substituted O/Substituted L/substituted Department O/Department L/department of O/of L/of E/amendment E/amendment 23
  • 23. Token Processing 5 Lemmatize Uses dictionary-based lemmatizer based on GCIDE and WordNet S/amendment S/amendment O/Amendments L/amendments O/Amendments L/amendments amendment O/2002 L/2002 O/2002 L/2002 2002 O/Pub L/pub O/Pub L/Pub pub O/L L/l O/L L/l; l O/107 L/107 O/107 L/107 107 O/296 L/296 O/296 L/296 296 O/Substituted L/substituted O/Substituted L/Substituted substitute O/Department L/department O/Department L/Department department O/of L/of O/of L/of of E/amendment E/amendment 24
  • 24. Part The Third: Query Processing 25
  • 25. Query Processing (not all stages shown) build Query mark mark query parse lemmatize lucene search String exact: phrases template query §  Communicates via generic QNode Class •  Simpler to manipulate than Lucene operators §  Can produce FAST FQL as well •  (cue the derisive catcalls) §  But most importantly: •  It is a Query Processing Pipeline §  Mix and match query processing modules 26
  • 26. Query Processing exact:FOIA top secret amendment:RECORDS build Query mark mark query parse lemmatize lucene search String original lowercase template query and exact: phrase amendment: |FOIA| |top| |secret| |RECORDS| 27
  • 27. Query Processing exact:FOIA top secret amendment:RECORDS build Query mark mark query parse lemmatize lucene search String original lowercase template query and O/FOIA phrase amendment: |top| |secret| |RECORDS| 28
  • 28. Query Processing exact:FOIA top secret amendment:RECORDS build Query mark mark query parse lemmatize lucene search String original lowercase template query and O/FOIA phrase amendment: |L/top| |L/secret| |records| 29
  • 29. Query Processing exact:FOIA top secret amendment:RECORDS build Query mark mark query parse lemmatize lucene search String original lowercase template query and O/FOIA phrase amendment: |L/top| |L/secret| |record| 30
  • 30. Query Processing exact:FOIA top secret amendment:RECORDS build Query mark mark query parse lemmatize lucene search String original lowercase template query and O/FOIA phrase between S/amendment |L/top| |L/secret| |record| E/amendment 31
  • 31. The between() Operator §  between(start-tag, end-tag, pos-clause, neg-clause) §  start-tag à Starting tag, e.g. S/amendment §  end-tag à Ending tag, e.g. E/amendment §  pos-clause à words which must occur between start and end •  Note: Requires a nested ScanAnd() operator §  neg-clause à words which must not occur between start and end 32
  • 34. Hierarchies: Requirements §  Any number of levels §  Title, Sub-Title, Chapter, Sub-Chapter, Part, Sub-Part, Section §  Levels vary across titles §  Title 1: 3 levels §  Title 26: 8 levels §  Multiple views: §  Children §  Ancestors §  Ancestor s Siblings §  Multiple search scopes: §  Only children, all descendents, everything 35
  • 35. Hierarchies: Ancestor-Siblings §  US-Code •  Title 1 •  Title 2 §  Chapter 1 §  Chapter 2 –  Part 1 –  Part 2 •  Section 2.1 •  Section 2.2 –  Part 3 –  Part 4 §  Chapter 3 §  Chapter 4 •  Title 3 36
  • 36. Hierarchies: Fields §  ancestors •  Searching §  USC USC-title2 USC-title2-chapter25 USC-title2-chapter25- subchapter2 §  encodedAncestors – for display only •  Where the node exists within the hierarchy §  id;heading;subjectTitle//id;heading;subjectTitle//... §  USC-title2-chapter25;Chapter 25;Unfunded Mandates Reform// USC-title2-chapter25-subchapter2;Subchapter II;Regulatory Accountabilty and Reform §  parentId – ID of the parent node §  USC-title2-chapter25-subchapter2 §  treesort – Hierarchical sort field, e.g. 13/000/0/00882 37
  • 37. Hierarchies: Tree Sort §  Sorting In Print Order •  Front Matter à Titles à Tables à etc. •  Everything padded to fixed-length 01/011/1/02032 01 = USC Title Sequence # in file 011 = Title 11 1 = An Appendix 38
  • 38. Hierarchies: Sample Searches §  Assuming Node = USC-title2-chapter25 §  Search Children •  parentId:USC-title2-chapter25 §  Search All Descendents •  ancestors:USC-title2-chapter25 §  Ancestor Siblings •  (parentId:USC OR parentId:USC-title2 OR parentId:USC-title2-chapter25) 39
  • 39. Contact §  Paul Nelson •  pnelson@searchtechnologies.com §  Ronald Matamoros •  rmatamoros@searchtechnologies.com §  Search Technologies •  http://searchtechnologies.com 40