SlideShare a Scribd company logo
1 of 46
Download to read offline
Peter	
  Brantley 	
       	
     	
  	
  
Internet	
  Archive 	
     	
     	
  	
  
The	
  Presidio     	
     	
     	
  11.09	
  
Essential	
  premise	
  :	
  

combining	
  web	
  search	
  
with	
  book	
  search	
  is	
  an	
  

engineering	
  challenge	
  
I.	
  	
  Presenting	
  combined	
  search	
  
 For	
  several	
  years,	
  I	
  served	
  the	
  University	
  of	
  
    California	
  as	
  the	
  Director	
  of	
  Technology	
  for	
  
    the	
  California	
  Digital	
  Library.	
  

	
  (the	
  digital	
  library	
  group	
  for	
  the	
  UC	
  system)	
  
We	
  held	
  various	
  conversations	
  over	
  time	
  
with	
  Google	
  engineers	
  in	
  similar	
  spaces	
  ...	
  

grappling	
  with	
  the	
  indexing,	
  search,	
  and	
  
user	
  interface	
  issues	
  with	
  combined	
  but	
  	
  
disparate	
  content	
  pools	
  (books,	
  journals,	
  
web,	
  image,	
  video).	
  	
  	
  

(an	
  important	
  issue	
  for	
  digital	
  libraries)	
  
 In	
  academic	
  info	
  markets,	
  “metasearch”	
  –	
  
    distributed	
  queries	
  with	
  central	
  resolution,	
  
    contested	
  for	
  primacy	
  with	
  search	
  over	
  
    aggregated	
  content.	
  	
  	
  

	
  To	
  an	
  extent,	
  only	
  LANL	
  and	
  commercial	
  
    search	
  pursued	
  aggregation	
  at	
  scale.	
  

	
  Aggregation	
  wins.	
  	
  	
  
 “Google	
  is	
  undertaking	
  the	
  most	
  radical	
  change	
  to	
  its	
  search	
  
    results	
  ever,	
  introducing	
  a	
  "Universal	
  Search"	
  system	
  that	
  will	
  
    blend	
  listings	
  from	
  its	
  news,	
  video,	
  images,	
  local	
  and	
  book	
  
    search	
  engines	
  among	
  those	
  it	
  gathers	
  from	
  crawling	
  web	
  
    pages.”	
  

	
  “With	
  Universal	
  Search,	
  Google	
  will	
  hit	
  a	
  range	
  of	
  its	
  vertical	
  
    search	
  engines,	
  then	
  decide	
  if	
  the	
  relevancy	
  of	
  a	
  result	
  from	
  
    book	
  search	
  is	
  higher	
  than	
  a	
  match	
  from	
  web	
  page	
  search.”	
  

	
  Danny	
  Sullivan,	
  “Google	
  2.0”,	
  May	
  16	
  2007,	
  	
  Search	
  Engine	
  Land	
  
Simple	
  search	
  box	
  ...	
  but	
  

 User	
  search	
  intentionality	
  	
  
for	
  books	
  vs.	
  web	
  can	
  differ	
  

     “mark	
  twain	
  hawai’i”	
  
Google	
  Scholar	
  is	
  vertical	
  search	
  engine.	
  

Explicit	
  opt-­‐in	
  discovery	
  service	
  for	
  STM	
  
journal	
  content,	
  utilized	
  in	
  HE	
  academia.	
  

  	
  Many	
  concerns	
  with	
  combining	
  the	
  Scholar	
  
      product	
  with	
  Big	
  Daddy.	
  	
  User	
  search	
  goals	
  
      differ;	
  content	
  distinct;	
  different	
  indexing.	
  	
  	
  
 From	
  2007	
  –	
  early	
  2009,	
  I	
  was	
  the	
  Director	
  
    of	
  the	
  Digital	
  Library	
  Federation.	
  	
  	
  I	
  made	
  a	
  
    request	
  of	
  Google	
  to	
  update	
  members	
  on	
  
    GBS	
  status	
  at	
  DLF’s	
  Fall	
  Forum,	
  Nov.	
  2008.	
  

	
  They	
  issued	
  an	
  explicit	
  request	
  for	
  HE	
  CS/
    EE	
  attention	
  to	
  the	
  problem	
  of	
  integrating	
  
    book	
  and	
  web	
  search.	
  	
  Paraphrasing:	
  “Not	
  
    a	
  well	
  solved	
  problem”.	
  	
  
Some	
  comparisons	
  
between	
  web	
  pages	
  
    and	
  books.	
  
 web:	
   	
  	
  
	
   	
      	
  short	
  doc	
  (web	
  page)	
  length	
  	
  

	
  books:	
  	
  	
  
	
   	
       	
  long	
  doc	
  (book)	
  length	
  
 web:	
  
	
   	
      	
  high	
  data	
  density	
  (per	
  doc	
  size)	
  	
  

	
  books: 	
  	
  
	
   	
    	
  highly	
  variant	
  data	
  density	
  
	
   	
    	
  (e.g.	
  fiction	
  vs.	
  non-­‐fiction)	
  
 web:     	
  	
  
	
   	
      	
  trillions	
  of	
  unique	
  web	
  pages	
  

	
  books: 	
  	
  
	
   	
    	
  (low)	
  millions	
  of	
  unique	
  books	
  	
  
 web:	
  
	
   	
        	
  many	
  complex	
  media	
  types	
  

	
  books:	
  
	
   	
        	
  text	
  and	
  image	
  media	
  
 web:    	
  	
  
	
   	
     	
  dynamic	
  over	
  time	
  
	
   	
     	
  (avg.	
  TTL	
  of	
  web	
  pages	
  is	
  short)	
  

	
  books: 	
  	
  
	
   	
    	
  static	
  over	
  time	
  
	
   	
    	
  (print	
  books	
  permanently	
  fixed)	
  
 web:	
  
	
   	
        	
  single	
  instances	
  (web	
  pages)	
  

	
  books:	
  
	
   	
        	
  duplicate	
  instances	
  (copies),	
  
	
   	
        	
  similar	
  instances	
  (editions),	
  
	
   	
        	
  in	
  multiple	
  languages	
  
 web:	
  
	
   	
        	
  hyperlinked	
  in/out	
  
	
   	
        	
  (useful	
  in	
  relevance)	
  

	
  books: 	
  	
  
	
   	
    	
  normally	
  quiescent	
  	
  
	
   	
    	
  (sometimes	
  citations)	
  
 web:	
  
	
   	
        	
  designed	
  component	
  structure	
  
	
   	
        	
  {page	
  hierarchy	
  >	
  web	
  site}	
  

	
  books: 	
  	
  
	
   	
    	
  artificial	
  component	
  structure	
  	
  
	
   	
    	
  {page	
  images	
  >	
  book}	
  
Bibliographic	
  data	
  cf.	
  full	
  text	
  (book)	
  data:	
  

         The	
  Melvyl	
  Recommender	
  Project	
  
                   Full	
  Text	
  Extension	
  
                (Supplementary	
  Report)	
  
                California	
  Digital	
  Library	
  
                      October	
  2006	
  

 Funded	
  by	
  the	
  Andrew	
  W.	
  Mellon	
  Foundation	
  
Project	
  Lead	
  
     Peter	
  Brantley,	
  Director	
  of	
  Technology	
  

Implementation	
  Team	
  
       Kirk	
  Hastings,	
  Text	
  Systems	
  Designer	
  
       Martin	
  Haye,	
  Programmer	
  (Contractor)	
  
       Steve	
  Toub,	
  Web	
  Design	
  Manager	
  
       Colleen	
  Whitney,	
  Programmer	
  and	
  Coordinator	
  

Assessment	
  Team	
  
     Jane	
  Lee,	
  Assessment	
  Analyst	
  
     Felicia	
  Poe,	
  Assessment	
  Coordinator	
  
     Lisa	
  Schiff,	
  Digital	
  Ingest	
  Programmer	
  
Often	
  many	
  different	
  editions	
  of	
  popular	
  books.	
  
Can	
  easily	
  artificially	
  boost	
  search	
  (n_copies).	
  

e.g. 	
  “Moby	
  Dick”	
  published	
  100s	
  of	
  times	
  
 	
   	
  (and	
  in	
  many	
  languages)	
  

Depending	
  on	
  publication	
  date:	
  	
  
 	
  either	
  public	
  domain	
  (dep.	
  on	
  country)	
  
 	
  or	
  in-­‐copyright	
  (out-­‐of-­‐print	
  or	
  in-­‐print)	
  
 In	
  CDL	
  tests,	
  for	
  texts	
  vs.	
  bib	
  records:	
  

	
  Search	
  scoring	
  for	
  full	
  text	
  documents	
  
    was	
  typically	
  10	
  -­‐	
  100	
  times	
  larger	
  than	
  
    for	
  metadata-­‐only	
  records.	
  	
  

	
  (Probably	
  approximate	
  magnitude	
  	
  
	
  	
  cf.	
  to	
  representative	
  web	
  pages).	
  
 Easy	
  for	
  a	
  single	
  work	
  to	
  overwhelm	
  web	
  
      pages	
  in	
  relevance	
  for	
  a	
  well-­‐fitting	
  query.	
  	
  	
  

  	
  E.g.	
  “English	
  working	
  class	
  labor	
  industrial”	
  

  The	
  making	
  of	
  the	
  English	
  working	
  class.	
  
  Author:	
  E	
  P	
  Thompson	
  	
  
  Publisher:	
  New	
  York,	
  Pantheon	
  Books	
  	
  
  [1964,	
  ©1963]	
  
Books	
  are	
  long	
  strings	
  of	
  many	
  words,	
  
split	
  into	
  n_sized	
  chunks	
  for	
  parsing.	
  

  	
  Term	
  indexing	
  based	
  on	
  overlapping	
  
      and	
  variant	
  length	
  “word	
  vectors”	
  	
  

  	
     	
  “battle”	
  	
  “of”	
  	
  “britain”	
  	
  
  	
     	
  “battle	
  of”	
  	
  “britain”	
  
  	
     	
  “battle”	
  	
  “of	
  britain” 	
  	
  
  	
     	
  “battle	
  of	
  britain”	
  
{Search	
  Term}	
  and	
  {Document}	
  weights	
  

1.    How	
  often	
  is	
  a	
  search	
  term	
  found	
  within	
  
      a	
  given	
  sized	
  chunk	
  of	
  text?	
  

2.    How	
  many	
  chunks	
  of	
  text	
  is	
  the	
  term	
  
      found	
  within?	
  

3.    How	
  many	
  chunks	
  of	
  text	
  does	
  the	
  
      document	
  contain?	
  
 Which	
  is	
  better?	
  

1.    	
   Adequate	
  matches	
  over	
  many	
  fields,	
  	
  
2.     	
   Better	
  matches	
  in	
  fewer	
  fields.	
  	
  

  	
  Metrics	
  vary	
  between	
  books	
  and	
  web.	
  
  	
  One	
  learns	
  from	
  one’s	
  mistakes.	
  	
  
  	
  More	
  books,	
  more	
  mistakes.	
  	
  
1.    Books	
  are	
  sooo	
  much	
  longer	
  than	
  web	
  pages.	
  
2.    Books	
  produce	
  1000’s	
  more	
  chunks	
  than	
  web.	
  
3.    Term	
  weighting	
  is	
  very	
  complex	
  for	
  long	
  docs.	
  
4.    Indexes	
  must	
  be	
  integrated	
  for	
  web	
  and	
  books.	
  
5.    But	
  source	
  term	
  indexes	
  are	
  biased	
  differently.	
  
II.	
  What	
  you	
  get	
  from	
  books	
  
 The	
  dialectic	
  between	
  books	
  and	
  
      web	
  provides	
  benefits	
  from	
  their	
  
      integration	
  (no	
  matter	
  the	
  pain).	
  

Books	
  enrich	
  general	
  web	
  search,	
  
not	
  just	
  via	
  the	
  data	
  within	
  books,	
  	
  
but	
  also	
  by	
  books-­‐as-­‐data.	
  
All	
  search	
  is	
  made	
  smarter	
  by	
  analysis.	
  

1.    structure	
  
2.    contextualization	
  
3.    relatedness	
  
4.    normalization	
  
5.    association	
  
Because	
  of	
  digitization,	
  
books	
  have	
  complications	
  cf.	
  	
  
web	
  pages;	
  a	
  result	
  of	
  OCR.	
  

1.    Language	
  detection	
  
2.    Determining	
  which	
  words	
  get	
  indexed	
  
      (–	
  stop	
  words	
  like	
  “of”	
  “a”	
  “the”	
  etc.)	
  
3.    OCR	
  mistakes	
  hamper	
  word	
  recognition	
  
Common	
  OCR	
  traps:	
  

    	
   embedded	
  languages	
  
    	
   Latin	
  or	
  archaic	
  spelling	
  	
  
    	
   complex	
  scripts	
  (e.g.	
  captions)	
  
    	
   hyphenated	
  words	
  	
  
    ricain	
           ricanant	
  
    ricaine	
          ricanante	
  
    ricaines	
         ricane	
  
    ricana	
           ricamente	
  
    ricanai	
          ricanement	
  
    ricains	
          ricanements	
  
    rical	
            rican	
  
    rically	
          ricanes	
  
    ricals	
           ricans	
  
More	
  words	
  from	
  more	
  books,	
  	
  
more	
  spelling	
  mistakes.	
  

  	
   	
  This	
  is	
  a	
  good	
  thing!	
  

  	
  Leads	
  to	
  improved	
  spelling	
  correction	
  
  	
  (in	
  multiple	
  languages)	
  and	
  	
  
  	
  more	
  sensitive	
  translation.	
  	
  
 “Our	
  understanding	
  of	
  language	
  is,	
  in	
  large	
  
    part,	
  built	
  inductively	
  from	
  statistical	
  analysis	
  
    of	
  large	
  samples	
  of	
  language	
  as	
  used	
  ‘in	
  the	
  
    wild,’	
  and	
  the	
  larger	
  the	
  sample,	
  the	
  better	
  
    our	
  understanding.”	
  

	
   	
      	
       	
       	
       	
  -­‐	
  Hank	
  Bromley,	
  IA	
  
 “Before	
  the	
  1930’s,	
  and	
  even	
  40’s	
  or	
  50’s	
  in	
  some	
  
    parts,	
  	
  at	
  harvest	
  time,	
  a	
  horse	
  or	
  mule	
  drawn	
  
    wagon	
  would	
  go	
  through	
  the	
  field,	
  straddling	
  two	
  
    rows	
  of	
  corn.	
  	
  Adults	
  working	
  on	
  each	
  side	
  of	
  the	
  
    wagon	
  would	
  pull	
  the	
  corn	
  from	
  the	
  standing	
  corn	
  
    stalks	
  and	
  toss	
  it	
  into	
  the	
  wagon.	
  	
  The	
  unfortunate	
  
    younger	
  ones	
  would	
  have	
  to	
  pull	
  corn	
  from	
  the	
  
    down	
  rows	
  –	
  stoop	
  labor	
  in	
  its	
  worst	
  form.”	
  	
  	
  
	
   	
              	
         	
         	
       	
          	
       	
         	
  -­‐	
  JDB	
  
 Statistical	
  analysis	
  of	
  which	
  terms	
  tend	
  to	
  
    appear	
  in	
  the	
  vicinity	
  of	
  which	
  others),	
  useful	
  
    not	
  only	
  for	
  context-­‐sensitive	
  OCR,	
  but	
  more	
  
    significantly,	
  for	
  building	
  semantic	
  maps	
  and	
  
    other	
  kinds	
  of	
  knowledge	
  representation.	
  	
  

	
  “dead	
  as	
  a	
  door	
  nail”	
  –	
  the	
  term	
  “door	
  nail”	
  
	
  	
  	
  is	
  not	
  commonly	
  found	
  elsewhere.	
  
 Analysis	
  via	
  co-­‐occurrence	
  enables	
  one	
  to	
  
    construct	
  a	
  better	
  general	
  search	
  engine	
  by	
  
    enhancing	
  the	
  ability	
  to	
  distinguish	
  among	
  
    multiple	
  meanings	
  of	
  a	
  given	
  word	
  based	
  
    on	
  the	
  context	
  in	
  which	
  the	
  word	
  occurs.	
  
 LSA	
  is	
  an	
  CS	
  term	
  referring	
  to	
  a	
  technique	
  in	
  
    “natural	
  language	
  processing	
  ...	
  of	
  analyzing	
  
    relationships	
  between	
  a	
  set	
  of	
  documents	
  
    and	
  the	
  terms	
  they	
  contain	
  by	
  producing	
  a	
  
    set	
  of	
  concepts	
  related	
  to	
  the	
  documents	
  
    and	
  terms.”	
  	
  

	
   	
       	
        	
        	
       	
  -­‐	
  Wikipedia.org	
  
 (LSI	
  =	
  LSA	
  in	
  context	
  of	
  info	
  retrieval	
  (IR).)	
  

	
  “Clustering	
  is	
  a	
  way	
  to	
  group	
  documents	
  
    based	
  on	
  their	
  conceptual	
  similarity	
  to	
  each	
  
    other	
  ...	
  .	
  	
  This	
  is	
  very	
  useful	
  when	
  dealing	
  
    with	
  an	
  unknown	
  collection	
  of	
  unstructured	
  
    text.”	
  
 “Because	
  it	
  uses	
  a	
  strictly	
  mathematical	
  
    approach,	
  LSI	
  is	
  inherently	
  independent	
  of	
  
    language.	
  	
  This	
  enables	
  LSI	
  to	
  elicit	
  the	
  
    semantic	
  content	
  of	
  information	
  written	
  in	
  
    any	
  language	
  without	
  requiring	
  the	
  use	
  of	
  
    auxiliary	
  structures,	
  such	
  as	
  dictionaries	
  and	
  
    thesauri.”	
  
 “[Q]ueries	
  can	
  be	
  made	
  in	
  one	
  language,	
  such	
  
    as	
  English,	
  and	
  conceptually	
  similar	
  results	
  
    will	
  be	
  returned	
  even	
  if	
  they	
  are	
  composed	
  of	
  
    an	
  entirely	
  different	
  language	
  or	
  of	
  multiple	
  
    languages.”	
  
 “LSI	
  automatically	
  adapts	
  to	
  new	
  and	
  changing	
  
    terminology,	
  and	
  it	
  has	
  been	
  shown	
  to	
  be	
  very	
  
    tolerant	
  of	
  noise	
  (i.e.,	
  misspelled	
  words,	
  typo-­‐
    graphical	
  errors,	
  unreadable	
  characters,	
  etc.).	
  	
  
	
  	
  “This	
  is	
  especially	
  important	
  for	
  applications	
  
    using	
  text	
  derived	
  from	
  Optical	
  Character	
  
    Recognition	
  (OCR)	
  	
  ...”	
  
	
   	
        	
      	
       	
       	
   	
  -­‐	
  Wikipedia.org	
  
 The	
  More	
  Data,	
  The	
  Better	
  ...	
  	
  

	
  The	
  More	
  Books,	
  The	
  Better	
  Web	
  Search.	
  
Contact	
  information:	
  

peter	
  brantley 	
    	
       	
  internet	
  archive	
  
@naypinya	
  (twitter)	
  	
     	
  peter	
  @	
  archive.org	
  

More Related Content

What's hot

Social Networking: Tools and Technologies for enhancing user interaction
Social Networking: Tools and Technologies for enhancing user interactionSocial Networking: Tools and Technologies for enhancing user interaction
Social Networking: Tools and Technologies for enhancing user interactionADINET Ahmedabad
 
There is such thing as a freebie!
There is such thing as a freebie!There is such thing as a freebie!
There is such thing as a freebie!ddefebbo
 
Enterprise Search Share Point2009 Best Practices Final
Enterprise Search Share Point2009 Best Practices FinalEnterprise Search Share Point2009 Best Practices Final
Enterprise Search Share Point2009 Best Practices FinalMarianne Sweeny
 
The library and the network: scale, engagement, innovation
The library and the network: scale, engagement, innovationThe library and the network: scale, engagement, innovation
The library and the network: scale, engagement, innovationlisld
 
Inverted textindexing
Inverted textindexingInverted textindexing
Inverted textindexingKhwaja Aamer
 
NAG2007
NAG2007NAG2007
NAG2007daveyp
 
Open Library at Make Books Apparent
Open Library at Make Books ApparentOpen Library at Make Books Apparent
Open Library at Make Books ApparentGeorge Oates
 
Josh Moulin: What every prosecutor should know about peer to-peer investigations
Josh Moulin: What every prosecutor should know about peer to-peer investigationsJosh Moulin: What every prosecutor should know about peer to-peer investigations
Josh Moulin: What every prosecutor should know about peer to-peer investigationsJosh Moulin, MSISA,CISSP
 
From Snake People to Solution: A Case Study in Repurposing Open-Source Code
From Snake People to Solution: A Case Study in Repurposing Open-Source CodeFrom Snake People to Solution: A Case Study in Repurposing Open-Source Code
From Snake People to Solution: A Case Study in Repurposing Open-Source CodeNASIG
 
December 2, 2015: NISO/NFAIS Virtual Conference: Semantic Web: What's New and...
December 2, 2015: NISO/NFAIS Virtual Conference: Semantic Web: What's New and...December 2, 2015: NISO/NFAIS Virtual Conference: Semantic Web: What's New and...
December 2, 2015: NISO/NFAIS Virtual Conference: Semantic Web: What's New and...DeVonne Parks, CEM
 
Intertwingularity, Semantic Web and linked Geo data
Intertwingularity, Semantic Web and linked Geo dataIntertwingularity, Semantic Web and linked Geo data
Intertwingularity, Semantic Web and linked Geo dataDan Brickley
 
Vks Presentation, Jankowski,15 Jan2009, Websites & Books, Near Final
Vks Presentation, Jankowski,15 Jan2009, Websites & Books, Near FinalVks Presentation, Jankowski,15 Jan2009, Websites & Books, Near Final
Vks Presentation, Jankowski,15 Jan2009, Websites & Books, Near FinalNick Jankowski
 
LIS 653 fall 2013 final project posters
LIS 653 fall 2013 final project postersLIS 653 fall 2013 final project posters
LIS 653 fall 2013 final project postersPrattSILS
 
A Comparative Overview of Journal Discovery Systems: Library Users Offer Thei...
A Comparative Overview of Journal Discovery Systems: Library Users Offer Thei...A Comparative Overview of Journal Discovery Systems: Library Users Offer Thei...
A Comparative Overview of Journal Discovery Systems: Library Users Offer Thei...Charleston Conference
 
Linked Data Workshop Stanford University
Linked Data Workshop Stanford University Linked Data Workshop Stanford University
Linked Data Workshop Stanford University Talis Consulting
 

What's hot (18)

Social Networking: Tools and Technologies for enhancing user interaction
Social Networking: Tools and Technologies for enhancing user interactionSocial Networking: Tools and Technologies for enhancing user interaction
Social Networking: Tools and Technologies for enhancing user interaction
 
There is such thing as a freebie!
There is such thing as a freebie!There is such thing as a freebie!
There is such thing as a freebie!
 
Digital Public Library of America
Digital Public Library of AmericaDigital Public Library of America
Digital Public Library of America
 
Enterprise Search Share Point2009 Best Practices Final
Enterprise Search Share Point2009 Best Practices FinalEnterprise Search Share Point2009 Best Practices Final
Enterprise Search Share Point2009 Best Practices Final
 
The library and the network: scale, engagement, innovation
The library and the network: scale, engagement, innovationThe library and the network: scale, engagement, innovation
The library and the network: scale, engagement, innovation
 
Inverted textindexing
Inverted textindexingInverted textindexing
Inverted textindexing
 
NAG2007
NAG2007NAG2007
NAG2007
 
Open Library at Make Books Apparent
Open Library at Make Books ApparentOpen Library at Make Books Apparent
Open Library at Make Books Apparent
 
Josh Moulin: What every prosecutor should know about peer to-peer investigations
Josh Moulin: What every prosecutor should know about peer to-peer investigationsJosh Moulin: What every prosecutor should know about peer to-peer investigations
Josh Moulin: What every prosecutor should know about peer to-peer investigations
 
Our World is Socio-technical
Our World is Socio-technicalOur World is Socio-technical
Our World is Socio-technical
 
From Snake People to Solution: A Case Study in Repurposing Open-Source Code
From Snake People to Solution: A Case Study in Repurposing Open-Source CodeFrom Snake People to Solution: A Case Study in Repurposing Open-Source Code
From Snake People to Solution: A Case Study in Repurposing Open-Source Code
 
Xcongressonbookbarcelona
XcongressonbookbarcelonaXcongressonbookbarcelona
Xcongressonbookbarcelona
 
December 2, 2015: NISO/NFAIS Virtual Conference: Semantic Web: What's New and...
December 2, 2015: NISO/NFAIS Virtual Conference: Semantic Web: What's New and...December 2, 2015: NISO/NFAIS Virtual Conference: Semantic Web: What's New and...
December 2, 2015: NISO/NFAIS Virtual Conference: Semantic Web: What's New and...
 
Intertwingularity, Semantic Web and linked Geo data
Intertwingularity, Semantic Web and linked Geo dataIntertwingularity, Semantic Web and linked Geo data
Intertwingularity, Semantic Web and linked Geo data
 
Vks Presentation, Jankowski,15 Jan2009, Websites & Books, Near Final
Vks Presentation, Jankowski,15 Jan2009, Websites & Books, Near FinalVks Presentation, Jankowski,15 Jan2009, Websites & Books, Near Final
Vks Presentation, Jankowski,15 Jan2009, Websites & Books, Near Final
 
LIS 653 fall 2013 final project posters
LIS 653 fall 2013 final project postersLIS 653 fall 2013 final project posters
LIS 653 fall 2013 final project posters
 
A Comparative Overview of Journal Discovery Systems: Library Users Offer Thei...
A Comparative Overview of Journal Discovery Systems: Library Users Offer Thei...A Comparative Overview of Journal Discovery Systems: Library Users Offer Thei...
A Comparative Overview of Journal Discovery Systems: Library Users Offer Thei...
 
Linked Data Workshop Stanford University
Linked Data Workshop Stanford University Linked Data Workshop Stanford University
Linked Data Workshop Stanford University
 

Viewers also liked

GBS Amended Settlement: A status update
GBS Amended Settlement: A status updateGBS Amended Settlement: A status update
GBS Amended Settlement: A status updatePeter Brantley
 
Digital Books and Flying Cars: Disruption in Publishing
Digital Books and Flying Cars: Disruption in PublishingDigital Books and Flying Cars: Disruption in Publishing
Digital Books and Flying Cars: Disruption in PublishingPeter Brantley
 
Making the best of small libraries and small budgets
Making the best of small libraries and small budgetsMaking the best of small libraries and small budgets
Making the best of small libraries and small budgetscaroline
 
HandBook solar engineering
HandBook solar engineeringHandBook solar engineering
HandBook solar engineeringsolarpictures
 
Organizational Fields and the Book Industry
Organizational Fields and the Book IndustryOrganizational Fields and the Book Industry
Organizational Fields and the Book IndustryPeter Brantley
 
HistòRia Duna Pastanaga
HistòRia Duna PastanagaHistòRia Duna Pastanaga
HistòRia Duna Pastanagavirgi
 
What Rupert would tell the DLF
What Rupert would tell the DLFWhat Rupert would tell the DLF
What Rupert would tell the DLFPeter Brantley
 
Digital Books and Flying Cars: The Library edition
Digital Books and Flying Cars: The Library editionDigital Books and Flying Cars: The Library edition
Digital Books and Flying Cars: The Library editionPeter Brantley
 
Railties
RailtiesRailties
RailtiesDefV
 
Projecte dofins
Projecte dofinsProjecte dofins
Projecte dofinsvirgi
 
Samuelson: GBS as Copyright Reform
Samuelson: GBS as Copyright ReformSamuelson: GBS as Copyright Reform
Samuelson: GBS as Copyright ReformPeter Brantley
 
Relatiedag260607
Relatiedag260607Relatiedag260607
Relatiedag260607NSO
 
Abelles Micos I Dofinsdsffff
Abelles Micos I DofinsdsffffAbelles Micos I Dofinsdsffff
Abelles Micos I Dofinsdsffffvirgi
 
Extending the Digital Self
Extending the Digital SelfExtending the Digital Self
Extending the Digital SelfPeter Brantley
 
Literature as a (web) Service
Literature as a (web) ServiceLiterature as a (web) Service
Literature as a (web) ServicePeter Brantley
 
BookServer: A Web of Books
BookServer: A Web of BooksBookServer: A Web of Books
BookServer: A Web of BooksPeter Brantley
 

Viewers also liked (20)

GBS Amended Settlement: A status update
GBS Amended Settlement: A status updateGBS Amended Settlement: A status update
GBS Amended Settlement: A status update
 
Digital Books and Flying Cars: Disruption in Publishing
Digital Books and Flying Cars: Disruption in PublishingDigital Books and Flying Cars: Disruption in Publishing
Digital Books and Flying Cars: Disruption in Publishing
 
Making the best of small libraries and small budgets
Making the best of small libraries and small budgetsMaking the best of small libraries and small budgets
Making the best of small libraries and small budgets
 
HandBook solar engineering
HandBook solar engineeringHandBook solar engineering
HandBook solar engineering
 
Reading the Next Book
Reading the Next BookReading the Next Book
Reading the Next Book
 
Organizational Fields and the Book Industry
Organizational Fields and the Book IndustryOrganizational Fields and the Book Industry
Organizational Fields and the Book Industry
 
HistòRia Duna Pastanaga
HistòRia Duna PastanagaHistòRia Duna Pastanaga
HistòRia Duna Pastanaga
 
What Rupert would tell the DLF
What Rupert would tell the DLFWhat Rupert would tell the DLF
What Rupert would tell the DLF
 
Digital Books and Flying Cars: The Library edition
Digital Books and Flying Cars: The Library editionDigital Books and Flying Cars: The Library edition
Digital Books and Flying Cars: The Library edition
 
Forever Together
Forever TogetherForever Together
Forever Together
 
Railties
RailtiesRailties
Railties
 
Projecte dofins
Projecte dofinsProjecte dofins
Projecte dofins
 
Redefining Libraries
Redefining LibrariesRedefining Libraries
Redefining Libraries
 
Cloud Libraries
Cloud LibrariesCloud Libraries
Cloud Libraries
 
Samuelson: GBS as Copyright Reform
Samuelson: GBS as Copyright ReformSamuelson: GBS as Copyright Reform
Samuelson: GBS as Copyright Reform
 
Relatiedag260607
Relatiedag260607Relatiedag260607
Relatiedag260607
 
Abelles Micos I Dofinsdsffff
Abelles Micos I DofinsdsffffAbelles Micos I Dofinsdsffff
Abelles Micos I Dofinsdsffff
 
Extending the Digital Self
Extending the Digital SelfExtending the Digital Self
Extending the Digital Self
 
Literature as a (web) Service
Literature as a (web) ServiceLiterature as a (web) Service
Literature as a (web) Service
 
BookServer: A Web of Books
BookServer: A Web of BooksBookServer: A Web of Books
BookServer: A Web of Books
 

Similar to Books and Webs: Pulling the Down Rows

The Semantic Web in Digital Libraries: A Literature Review
The Semantic Web in Digital Libraries: A Literature ReviewThe Semantic Web in Digital Libraries: A Literature Review
The Semantic Web in Digital Libraries: A Literature Reviewsstose
 
Tutorial on Semantic Digital Libraries (ESWC'2007)
Tutorial on Semantic Digital Libraries (ESWC'2007)Tutorial on Semantic Digital Libraries (ESWC'2007)
Tutorial on Semantic Digital Libraries (ESWC'2007)Sebastian Ryszard Kruk
 
The Future of Library Cataloguing
The Future of Library CataloguingThe Future of Library Cataloguing
The Future of Library CataloguingKathryne Dunlap
 
Semantic Web Technologies: Changing Bibliographic Descriptions?
Semantic Web Technologies: Changing Bibliographic Descriptions?Semantic Web Technologies: Changing Bibliographic Descriptions?
Semantic Web Technologies: Changing Bibliographic Descriptions?Stuart Weibel
 
Porting Library Vocabularies to the Semantic Web - IFLA 2010
Porting Library Vocabularies to the Semantic Web - IFLA 2010Porting Library Vocabularies to the Semantic Web - IFLA 2010
Porting Library Vocabularies to the Semantic Web - IFLA 2010Bernard Vatant
 
Semantic Libraries: the Container, the Content and the Contenders
Semantic Libraries: the Container, the Content and the ContendersSemantic Libraries: the Container, the Content and the Contenders
Semantic Libraries: the Container, the Content and the ContendersStefan Gradmann
 
DM110 - Week 10 - Semantic Web / Web 3.0
DM110 - Week 10 - Semantic Web / Web 3.0DM110 - Week 10 - Semantic Web / Web 3.0
DM110 - Week 10 - Semantic Web / Web 3.0John Breslin
 
Introduction
IntroductionIntroduction
Introductionsriniefs
 
Aggregation for searching complex information spaces
Aggregation for searching complex information spacesAggregation for searching complex information spaces
Aggregation for searching complex information spacesMounia Lalmas-Roelleke
 
Information Retrieval, Encoding, Indexing, Big Table. Lecture 6 - Indexing
Information Retrieval, Encoding, Indexing, Big Table. Lecture 6  - IndexingInformation Retrieval, Encoding, Indexing, Big Table. Lecture 6  - Indexing
Information Retrieval, Encoding, Indexing, Big Table. Lecture 6 - IndexingSean Golliher
 
Research on collaborative information sharing systems
Research on collaborative information sharing systemsResearch on collaborative information sharing systems
Research on collaborative information sharing systemsDavide Eynard
 

Similar to Books and Webs: Pulling the Down Rows (20)

The Semantic Web in Digital Libraries: A Literature Review
The Semantic Web in Digital Libraries: A Literature ReviewThe Semantic Web in Digital Libraries: A Literature Review
The Semantic Web in Digital Libraries: A Literature Review
 
63demo dfa
63demo dfa63demo dfa
63demo dfa
 
63demo dfa
63demo dfa63demo dfa
63demo dfa
 
63demo dfa
63demo dfa63demo dfa
63demo dfa
 
EDS for JIBS
EDS for JIBSEDS for JIBS
EDS for JIBS
 
Tutorial on Semantic Digital Libraries (ESWC'2007)
Tutorial on Semantic Digital Libraries (ESWC'2007)Tutorial on Semantic Digital Libraries (ESWC'2007)
Tutorial on Semantic Digital Libraries (ESWC'2007)
 
The Future of Library Cataloguing
The Future of Library CataloguingThe Future of Library Cataloguing
The Future of Library Cataloguing
 
Semantic Web Technologies: Changing Bibliographic Descriptions?
Semantic Web Technologies: Changing Bibliographic Descriptions?Semantic Web Technologies: Changing Bibliographic Descriptions?
Semantic Web Technologies: Changing Bibliographic Descriptions?
 
Porting Library Vocabularies to the Semantic Web - IFLA 2010
Porting Library Vocabularies to the Semantic Web - IFLA 2010Porting Library Vocabularies to the Semantic Web - IFLA 2010
Porting Library Vocabularies to the Semantic Web - IFLA 2010
 
Web Search Engine
Web Search EngineWeb Search Engine
Web Search Engine
 
Irish Digital Libraries Summit
Irish Digital Libraries SummitIrish Digital Libraries Summit
Irish Digital Libraries Summit
 
Linked library data
Linked library dataLinked library data
Linked library data
 
Semantic Libraries: the Container, the Content and the Contenders
Semantic Libraries: the Container, the Content and the ContendersSemantic Libraries: the Container, the Content and the Contenders
Semantic Libraries: the Container, the Content and the Contenders
 
DM110 - Week 10 - Semantic Web / Web 3.0
DM110 - Week 10 - Semantic Web / Web 3.0DM110 - Week 10 - Semantic Web / Web 3.0
DM110 - Week 10 - Semantic Web / Web 3.0
 
Introduction
IntroductionIntroduction
Introduction
 
Aggregation for searching complex information spaces
Aggregation for searching complex information spacesAggregation for searching complex information spaces
Aggregation for searching complex information spaces
 
Web Of Books
Web Of BooksWeb Of Books
Web Of Books
 
text
texttext
text
 
Information Retrieval, Encoding, Indexing, Big Table. Lecture 6 - Indexing
Information Retrieval, Encoding, Indexing, Big Table. Lecture 6  - IndexingInformation Retrieval, Encoding, Indexing, Big Table. Lecture 6  - Indexing
Information Retrieval, Encoding, Indexing, Big Table. Lecture 6 - Indexing
 
Research on collaborative information sharing systems
Research on collaborative information sharing systemsResearch on collaborative information sharing systems
Research on collaborative information sharing systems
 

More from Peter Brantley

Publishing and the Future of STM
Publishing and the Future of STMPublishing and the Future of STM
Publishing and the Future of STMPeter Brantley
 
What if the future (of libraries)
What if the future (of libraries)What if the future (of libraries)
What if the future (of libraries)Peter Brantley
 
What ebooks mean for Libraries
What ebooks mean for LibrariesWhat ebooks mean for Libraries
What ebooks mean for LibrariesPeter Brantley
 
OPDS and the Future of Digital Books
OPDS and the Future of Digital BooksOPDS and the Future of Digital Books
OPDS and the Future of Digital BooksPeter Brantley
 
Digital book markets v7-日本語版
Digital book markets v7-日本語版Digital book markets v7-日本語版
Digital book markets v7-日本語版Peter Brantley
 
Digital book markets: Building markets for access
Digital book markets: Building markets for accessDigital book markets: Building markets for access
Digital book markets: Building markets for accessPeter Brantley
 
Finding Vorpal Blades: Questing for Content
Finding Vorpal Blades: Questing for ContentFinding Vorpal Blades: Questing for Content
Finding Vorpal Blades: Questing for ContentPeter Brantley
 
Re Experiencing The Book
Re Experiencing The BookRe Experiencing The Book
Re Experiencing The BookPeter Brantley
 
OBA @ EC Google Book Hearing
OBA @ EC Google Book HearingOBA @ EC Google Book Hearing
OBA @ EC Google Book HearingPeter Brantley
 
Reflections on the Google Book Search Settlement by Pamela Samuelson
Reflections on the Google Book Search Settlement by Pamela SamuelsonReflections on the Google Book Search Settlement by Pamela Samuelson
Reflections on the Google Book Search Settlement by Pamela SamuelsonPeter Brantley
 
NASA' Use of Immersive Environments
NASA' Use of Immersive EnvironmentsNASA' Use of Immersive Environments
NASA' Use of Immersive EnvironmentsPeter Brantley
 

More from Peter Brantley (15)

Publishing and the Future of STM
Publishing and the Future of STMPublishing and the Future of STM
Publishing and the Future of STM
 
Breaking the catalog
Breaking the catalogBreaking the catalog
Breaking the catalog
 
What if the future (of libraries)
What if the future (of libraries)What if the future (of libraries)
What if the future (of libraries)
 
What ebooks mean for Libraries
What ebooks mean for LibrariesWhat ebooks mean for Libraries
What ebooks mean for Libraries
 
Reading on a Holodeck
Reading on a HolodeckReading on a Holodeck
Reading on a Holodeck
 
Save This Book
Save This BookSave This Book
Save This Book
 
OPDS and the Future of Digital Books
OPDS and the Future of Digital BooksOPDS and the Future of Digital Books
OPDS and the Future of Digital Books
 
Digital book markets v7-日本語版
Digital book markets v7-日本語版Digital book markets v7-日本語版
Digital book markets v7-日本語版
 
Digital book markets: Building markets for access
Digital book markets: Building markets for accessDigital book markets: Building markets for access
Digital book markets: Building markets for access
 
Finding Vorpal Blades: Questing for Content
Finding Vorpal Blades: Questing for ContentFinding Vorpal Blades: Questing for Content
Finding Vorpal Blades: Questing for Content
 
Re Experiencing The Book
Re Experiencing The BookRe Experiencing The Book
Re Experiencing The Book
 
OBA @ EC Google Book Hearing
OBA @ EC Google Book HearingOBA @ EC Google Book Hearing
OBA @ EC Google Book Hearing
 
Reflections on the Google Book Search Settlement by Pamela Samuelson
Reflections on the Google Book Search Settlement by Pamela SamuelsonReflections on the Google Book Search Settlement by Pamela Samuelson
Reflections on the Google Book Search Settlement by Pamela Samuelson
 
NASA' Use of Immersive Environments
NASA' Use of Immersive EnvironmentsNASA' Use of Immersive Environments
NASA' Use of Immersive Environments
 
The Inline Interface
The Inline InterfaceThe Inline Interface
The Inline Interface
 

Recently uploaded

Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 

Recently uploaded (20)

Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 

Books and Webs: Pulling the Down Rows

  • 1. Peter  Brantley         Internet  Archive         The  Presidio      11.09  
  • 2. Essential  premise  :   combining  web  search   with  book  search  is  an   engineering  challenge  
  • 3. I.    Presenting  combined  search  
  • 4.  For  several  years,  I  served  the  University  of   California  as  the  Director  of  Technology  for   the  California  Digital  Library.    (the  digital  library  group  for  the  UC  system)  
  • 5. We  held  various  conversations  over  time   with  Google  engineers  in  similar  spaces  ...   grappling  with  the  indexing,  search,  and   user  interface  issues  with  combined  but     disparate  content  pools  (books,  journals,   web,  image,  video).       (an  important  issue  for  digital  libraries)  
  • 6.  In  academic  info  markets,  “metasearch”  –   distributed  queries  with  central  resolution,   contested  for  primacy  with  search  over   aggregated  content.        To  an  extent,  only  LANL  and  commercial   search  pursued  aggregation  at  scale.    Aggregation  wins.      
  • 7.  “Google  is  undertaking  the  most  radical  change  to  its  search   results  ever,  introducing  a  "Universal  Search"  system  that  will   blend  listings  from  its  news,  video,  images,  local  and  book   search  engines  among  those  it  gathers  from  crawling  web   pages.”    “With  Universal  Search,  Google  will  hit  a  range  of  its  vertical   search  engines,  then  decide  if  the  relevancy  of  a  result  from   book  search  is  higher  than  a  match  from  web  page  search.”    Danny  Sullivan,  “Google  2.0”,  May  16  2007,    Search  Engine  Land  
  • 8. Simple  search  box  ...  but   User  search  intentionality     for  books  vs.  web  can  differ   “mark  twain  hawai’i”  
  • 9. Google  Scholar  is  vertical  search  engine.   Explicit  opt-­‐in  discovery  service  for  STM   journal  content,  utilized  in  HE  academia.    Many  concerns  with  combining  the  Scholar   product  with  Big  Daddy.    User  search  goals   differ;  content  distinct;  different  indexing.      
  • 10.  From  2007  –  early  2009,  I  was  the  Director   of  the  Digital  Library  Federation.      I  made  a   request  of  Google  to  update  members  on   GBS  status  at  DLF’s  Fall  Forum,  Nov.  2008.    They  issued  an  explicit  request  for  HE  CS/ EE  attention  to  the  problem  of  integrating   book  and  web  search.    Paraphrasing:  “Not   a  well  solved  problem”.    
  • 11. Some  comparisons   between  web  pages   and  books.  
  • 12.  web:            short  doc  (web  page)  length      books:            long  doc  (book)  length  
  • 13.  web:        high  data  density  (per  doc  size)      books:          highly  variant  data  density        (e.g.  fiction  vs.  non-­‐fiction)  
  • 14.  web:          trillions  of  unique  web  pages    books:          (low)  millions  of  unique  books    
  • 15.  web:        many  complex  media  types    books:        text  and  image  media  
  • 16.  web:          dynamic  over  time        (avg.  TTL  of  web  pages  is  short)    books:          static  over  time        (print  books  permanently  fixed)  
  • 17.  web:        single  instances  (web  pages)    books:        duplicate  instances  (copies),        similar  instances  (editions),        in  multiple  languages  
  • 18.  web:        hyperlinked  in/out        (useful  in  relevance)    books:          normally  quiescent          (sometimes  citations)  
  • 19.  web:        designed  component  structure        {page  hierarchy  >  web  site}    books:          artificial  component  structure          {page  images  >  book}  
  • 20. Bibliographic  data  cf.  full  text  (book)  data:   The  Melvyl  Recommender  Project   Full  Text  Extension   (Supplementary  Report)   California  Digital  Library   October  2006   Funded  by  the  Andrew  W.  Mellon  Foundation  
  • 21. Project  Lead     Peter  Brantley,  Director  of  Technology   Implementation  Team     Kirk  Hastings,  Text  Systems  Designer     Martin  Haye,  Programmer  (Contractor)     Steve  Toub,  Web  Design  Manager     Colleen  Whitney,  Programmer  and  Coordinator   Assessment  Team     Jane  Lee,  Assessment  Analyst     Felicia  Poe,  Assessment  Coordinator     Lisa  Schiff,  Digital  Ingest  Programmer  
  • 22. Often  many  different  editions  of  popular  books.   Can  easily  artificially  boost  search  (n_copies).   e.g.  “Moby  Dick”  published  100s  of  times      (and  in  many  languages)   Depending  on  publication  date:      either  public  domain  (dep.  on  country)    or  in-­‐copyright  (out-­‐of-­‐print  or  in-­‐print)  
  • 23.  In  CDL  tests,  for  texts  vs.  bib  records:    Search  scoring  for  full  text  documents   was  typically  10  -­‐  100  times  larger  than   for  metadata-­‐only  records.      (Probably  approximate  magnitude        cf.  to  representative  web  pages).  
  • 24.  Easy  for  a  single  work  to  overwhelm  web   pages  in  relevance  for  a  well-­‐fitting  query.        E.g.  “English  working  class  labor  industrial”     The  making  of  the  English  working  class.     Author:  E  P  Thompson       Publisher:  New  York,  Pantheon  Books       [1964,  ©1963]  
  • 25. Books  are  long  strings  of  many  words,   split  into  n_sized  chunks  for  parsing.    Term  indexing  based  on  overlapping   and  variant  length  “word  vectors”        “battle”    “of”    “britain”        “battle  of”    “britain”      “battle”    “of  britain”        “battle  of  britain”  
  • 26. {Search  Term}  and  {Document}  weights   1.  How  often  is  a  search  term  found  within   a  given  sized  chunk  of  text?   2.  How  many  chunks  of  text  is  the  term   found  within?   3.  How  many  chunks  of  text  does  the   document  contain?  
  • 27.  Which  is  better?   1.    Adequate  matches  over  many  fields,     2.    Better  matches  in  fewer  fields.      Metrics  vary  between  books  and  web.    One  learns  from  one’s  mistakes.      More  books,  more  mistakes.    
  • 28. 1.  Books  are  sooo  much  longer  than  web  pages.   2.  Books  produce  1000’s  more  chunks  than  web.   3.  Term  weighting  is  very  complex  for  long  docs.   4.  Indexes  must  be  integrated  for  web  and  books.   5.  But  source  term  indexes  are  biased  differently.  
  • 29. II.  What  you  get  from  books  
  • 30.  The  dialectic  between  books  and   web  provides  benefits  from  their   integration  (no  matter  the  pain).   Books  enrich  general  web  search,   not  just  via  the  data  within  books,     but  also  by  books-­‐as-­‐data.  
  • 31. All  search  is  made  smarter  by  analysis.   1.  structure   2.  contextualization   3.  relatedness   4.  normalization   5.  association  
  • 32. Because  of  digitization,   books  have  complications  cf.     web  pages;  a  result  of  OCR.   1.  Language  detection   2.  Determining  which  words  get  indexed   (–  stop  words  like  “of”  “a”  “the”  etc.)   3.  OCR  mistakes  hamper  word  recognition  
  • 33. Common  OCR  traps:       embedded  languages       Latin  or  archaic  spelling         complex  scripts  (e.g.  captions)       hyphenated  words    
  • 34.   ricain     ricanant     ricaine     ricanante     ricaines     ricane     ricana     ricamente     ricanai     ricanement     ricains     ricanements     rical     rican     rically     ricanes     ricals     ricans  
  • 35. More  words  from  more  books,     more  spelling  mistakes.      This  is  a  good  thing!    Leads  to  improved  spelling  correction    (in  multiple  languages)  and      more  sensitive  translation.    
  • 36.  “Our  understanding  of  language  is,  in  large   part,  built  inductively  from  statistical  analysis   of  large  samples  of  language  as  used  ‘in  the   wild,’  and  the  larger  the  sample,  the  better   our  understanding.”              -­‐  Hank  Bromley,  IA  
  • 37.  “Before  the  1930’s,  and  even  40’s  or  50’s  in  some   parts,    at  harvest  time,  a  horse  or  mule  drawn   wagon  would  go  through  the  field,  straddling  two   rows  of  corn.    Adults  working  on  each  side  of  the   wagon  would  pull  the  corn  from  the  standing  corn   stalks  and  toss  it  into  the  wagon.    The  unfortunate   younger  ones  would  have  to  pull  corn  from  the   down  rows  –  stoop  labor  in  its  worst  form.”                        -­‐  JDB  
  • 38.  Statistical  analysis  of  which  terms  tend  to   appear  in  the  vicinity  of  which  others),  useful   not  only  for  context-­‐sensitive  OCR,  but  more   significantly,  for  building  semantic  maps  and   other  kinds  of  knowledge  representation.      “dead  as  a  door  nail”  –  the  term  “door  nail”        is  not  commonly  found  elsewhere.  
  • 39.  Analysis  via  co-­‐occurrence  enables  one  to   construct  a  better  general  search  engine  by   enhancing  the  ability  to  distinguish  among   multiple  meanings  of  a  given  word  based   on  the  context  in  which  the  word  occurs.  
  • 40.  LSA  is  an  CS  term  referring  to  a  technique  in   “natural  language  processing  ...  of  analyzing   relationships  between  a  set  of  documents   and  the  terms  they  contain  by  producing  a   set  of  concepts  related  to  the  documents   and  terms.”                -­‐  Wikipedia.org  
  • 41.  (LSI  =  LSA  in  context  of  info  retrieval  (IR).)    “Clustering  is  a  way  to  group  documents   based  on  their  conceptual  similarity  to  each   other  ...  .    This  is  very  useful  when  dealing   with  an  unknown  collection  of  unstructured   text.”  
  • 42.  “Because  it  uses  a  strictly  mathematical   approach,  LSI  is  inherently  independent  of   language.    This  enables  LSI  to  elicit  the   semantic  content  of  information  written  in   any  language  without  requiring  the  use  of   auxiliary  structures,  such  as  dictionaries  and   thesauri.”  
  • 43.  “[Q]ueries  can  be  made  in  one  language,  such   as  English,  and  conceptually  similar  results   will  be  returned  even  if  they  are  composed  of   an  entirely  different  language  or  of  multiple   languages.”  
  • 44.  “LSI  automatically  adapts  to  new  and  changing   terminology,  and  it  has  been  shown  to  be  very   tolerant  of  noise  (i.e.,  misspelled  words,  typo-­‐ graphical  errors,  unreadable  characters,  etc.).        “This  is  especially  important  for  applications   using  text  derived  from  Optical  Character   Recognition  (OCR)    ...”                -­‐  Wikipedia.org  
  • 45.  The  More  Data,  The  Better  ...      The  More  Books,  The  Better  Web  Search.  
  • 46. Contact  information:   peter  brantley      internet  archive   @naypinya  (twitter)      peter  @  archive.org