Books and Webs: Pulling the Down Rows

Peter
Brantley

Internet
Archive

The
Presidio

11.09

Essential
premise
:

combining
web
search

with
book
search
is
an

engineering
challenge

I.

Presenting
combined
search

For
several
years,
I
served
the
University
of

California
as
the
Director
of
Technology
for

the
California
Digital
Library.

(the
digital
library
group
for
the
UC
system)

We
held
various
conversations
over
time

with
Google
engineers
in
similar
spaces
...

grappling
with
the
indexing,
search,
and

user
interface
issues
with
combined
but

disparate
content
pools
(books,
journals,

web,
image,
video).

(an
important
issue
for
digital
libraries)

In
academic
info
markets,
“metasearch”
–

distributed
queries
with
central
resolution,

contested
for
primacy
with
search
over

aggregated
content.

To
an
extent,
only
LANL
and
commercial

search
pursued
aggregation
at
scale.

Aggregation
wins.

“Google
is
undertaking
the
most
radical
change
to
its
search

results
ever,
introducing
a
"Universal
Search"
system
that
will

blend
listings
from
its
news,
video,
images,
local
and
book

search
engines
among
those
it
gathers
from
crawling
web

pages.”

“With
Universal
Search,
Google
will
hit
a
range
of
its
vertical

search
engines,
then
decide
if
the
relevancy
of
a
result
from

book
search
is
higher
than
a
match
from
web
page
search.”

Danny
Sullivan,
“Google
2.0”,
May
16
2007,

Search
Engine
Land

Simple
search
box
...
but

User
search
intentionality

for
books
vs.
web
can
diﬀer

“mark
twain
hawai’i”

Google
Scholar
is
vertical
search
engine.

Explicit
opt-‐in
discovery
service
for
STM

journal
content,
utilized
in
HE
academia.

Many
concerns
with
combining
the
Scholar

product
with
Big
Daddy.

User
search
goals

diﬀer;
content
distinct;
diﬀerent
indexing.

From
2007
–
early
2009,
I
was
the
Director

of
the
Digital
Library
Federation.

I
made
a

request
of
Google
to
update
members
on

GBS
status
at
DLF’s
Fall
Forum,
Nov.
2008.

They
issued
an
explicit
request
for
HE
CS/
EE
attention
to
the
problem
of
integrating

book
and
web
search.

Paraphrasing:
“Not

a
well
solved
problem”.

Some
comparisons

between
web
pages

and
books.

web:

short
doc
(web
page)
length

books:

long
doc
(book)
length

web:

high
data
density
(per
doc
size)

books:

highly
variant
data
density

(e.g.
ﬁction
vs.
non-‐ﬁction)

web:

trillions
of
unique
web
pages

books:

(low)
millions
of
unique
books

web:

many
complex
media
types

books:

text
and
image
media

web:

dynamic
over
time

(avg.
TTL
of
web
pages
is
short)

books:

static
over
time

(print
books
permanently
ﬁxed)

web:

single
instances
(web
pages)

books:

duplicate
instances
(copies),

similar
instances
(editions),

in
multiple
languages

web:

hyperlinked
in/out

(useful
in
relevance)

books:

normally
quiescent

(sometimes
citations)

web:

designed
component
structure

{page
hierarchy
>
web
site}

books:

artiﬁcial
component
structure

{page
images
>
book}

Bibliographic
data
cf.
full
text
(book)
data:

The
Melvyl
Recommender
Project

Full
Text
Extension

(Supplementary
Report)

California
Digital
Library

October
2006

Funded
by
the
Andrew
W.
Mellon
Foundation

Project
Lead

  Peter
Brantley,
Director
of
Technology

Implementation
Team

  Kirk
Hastings,
Text
Systems
Designer

  Martin
Haye,
Programmer
(Contractor)

  Steve
Toub,
Web
Design
Manager

  Colleen
Whitney,
Programmer
and
Coordinator

Assessment
Team

  Jane
Lee,
Assessment
Analyst

  Felicia
Poe,
Assessment
Coordinator

  Lisa
Schiﬀ,
Digital
Ingest
Programmer

Often
many
diﬀerent
editions
of
popular
books.

Can
easily
artiﬁcially
boost
search
(n_copies).

e.g.
“Moby
Dick”
published
100s
of
times

(and
in
many
languages)

Depending
on
publication
date:

either
public
domain
(dep.
on
country)

or
in-‐copyright
(out-‐of-‐print
or
in-‐print)

In
CDL
tests,
for
texts
vs.
bib
records:

Search
scoring
for
full
text
documents

was
typically
10
-‐
100
times
larger
than

for
metadata-‐only
records.

(Probably
approximate
magnitude

cf.
to
representative
web
pages).

Easy
for
a
single
work
to
overwhelm
web

pages
in
relevance
for
a
well-‐ﬁtting
query.

E.g.
“English
working
class
labor
industrial”

  The
making
of
the
English
working
class.

  Author:
E
P
Thompson

  Publisher:
New
York,
Pantheon
Books

  [1964,
©1963]

Books
are
long
strings
of
many
words,

split
into
n_sized
chunks
for
parsing.

Term
indexing
based
on
overlapping

and
variant
length
“word
vectors”

“battle”

“of”

“britain”

“battle
of”

“britain”

“battle”

“of
britain”

“battle
of
britain”

{Search
Term}
and
{Document}
weights

1.  How
often
is
a
search
term
found
within

a
given
sized
chunk
of
text?

2.  How
many
chunks
of
text
is
the
term

found
within?

3.  How
many
chunks
of
text
does
the

document
contain?

Which
is
better?

1. 
Adequate
matches
over
many
ﬁelds,

2. 
Better
matches
in
fewer
ﬁelds.

Metrics
vary
between
books
and
web.

One
learns
from
one’s
mistakes.

More
books,
more
mistakes.

1.  Books
are
sooo
much
longer
than
web
pages.

2.  Books
produce
1000’s
more
chunks
than
web.

3.  Term
weighting
is
very
complex
for
long
docs.

4.  Indexes
must
be
integrated
for
web
and
books.

5.  But
source
term
indexes
are
biased
diﬀerently.

II.
What
you
get
from
books

The
dialectic
between
books
and

web
provides
beneﬁts
from
their

integration
(no
matter
the
pain).

Books
enrich
general
web
search,

not
just
via
the
data
within
books,

but
also
by
books-‐as-‐data.

All
search
is
made
smarter
by
analysis.

1.  structure

2.  contextualization

3.  relatedness

4.  normalization

5.  association

Because
of
digitization,

books
have
complications
cf.

web
pages;
a
result
of
OCR.

1.  Language
detection

2.  Determining
which
words
get
indexed

(–
stop
words
like
“of”
“a”
“the”
etc.)

3.  OCR
mistakes
hamper
word
recognition

Common
OCR
traps:

 
embedded
languages

 
Latin
or
archaic
spelling

 
complex
scripts
(e.g.
captions)

 
hyphenated
words

  ricain
  ricanant

  ricaine
  ricanante

  ricaines
  ricane

  ricana
  ricamente

  ricanai
  ricanement

  ricains
  ricanements

  rical
  rican

  rically
  ricanes

  ricals
  ricans

More
words
from
more
books,

more
spelling
mistakes.

This
is
a
good
thing!

Leads
to
improved
spelling
correction

(in
multiple
languages)
and

more
sensitive
translation.

“Our
understanding
of
language
is,
in
large

part,
built
inductively
from
statistical
analysis

of
large
samples
of
language
as
used
‘in
the

wild,’
and
the
larger
the
sample,
the
better

our
understanding.”

-‐
Hank
Bromley,
IA

“Before
the
1930’s,
and
even
40’s
or
50’s
in
some

parts,

at
harvest
time,
a
horse
or
mule
drawn

wagon
would
go
through
the
ﬁeld,
straddling
two

rows
of
corn.

Adults
working
on
each
side
of
the

wagon
would
pull
the
corn
from
the
standing
corn

stalks
and
toss
it
into
the
wagon.

The
unfortunate

younger
ones
would
have
to
pull
corn
from
the

down
rows
–
stoop
labor
in
its
worst
form.”

-‐
JDB

Statistical
analysis
of
which
terms
tend
to

appear
in
the
vicinity
of
which
others),
useful

not
only
for
context-‐sensitive
OCR,
but
more

signiﬁcantly,
for
building
semantic
maps
and

other
kinds
of
knowledge
representation.

“dead
as
a
door
nail”
–
the
term
“door
nail”

is
not
commonly
found
elsewhere.

Analysis
via
co-‐occurrence
enables
one
to

construct
a
better
general
search
engine
by

enhancing
the
ability
to
distinguish
among

multiple
meanings
of
a
given
word
based

on
the
context
in
which
the
word
occurs.

LSA
is
an
CS
term
referring
to
a
technique
in

“natural
language
processing
...
of
analyzing

relationships
between
a
set
of
documents

and
the
terms
they
contain
by
producing
a

set
of
concepts
related
to
the
documents

and
terms.”

-‐
Wikipedia.org

(LSI
=
LSA
in
context
of
info
retrieval
(IR).)

“Clustering
is
a
way
to
group
documents

based
on
their
conceptual
similarity
to
each

other
...
.

This
is
very
useful
when
dealing

with
an
unknown
collection
of
unstructured

text.”

“Because
it
uses
a
strictly
mathematical

approach,
LSI
is
inherently
independent
of

language.

This
enables
LSI
to
elicit
the

semantic
content
of
information
written
in

any
language
without
requiring
the
use
of

auxiliary
structures,
such
as
dictionaries
and

thesauri.”

“[Q]ueries
can
be
made
in
one
language,
such

as
English,
and
conceptually
similar
results

will
be
returned
even
if
they
are
composed
of

an
entirely
diﬀerent
language
or
of
multiple

languages.”

“LSI
automatically
adapts
to
new
and
changing

terminology,
and
it
has
been
shown
to
be
very

tolerant
of
noise
(i.e.,
misspelled
words,
typo-‐
graphical
errors,
unreadable
characters,
etc.).

“This
is
especially
important
for
applications

using
text
derived
from
Optical
Character

Recognition
(OCR)

...”

-‐
Wikipedia.org

The
More
Data,
The
Better
...

The
More
Books,
The
Better
Web
Search.

Contact
information:

peter
brantley

internet
archive

@naypinya
(twitter)

peter
@
archive.org

Books and Webs: Pulling the Down Rows

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Viewers also liked

Viewers also liked (20)

Similar to Books and Webs: Pulling the Down Rows

Similar to Books and Webs: Pulling the Down Rows (20)

More from Peter Brantley

More from Peter Brantley (15)

Recently uploaded

Recently uploaded (20)

Books and Webs: Pulling the Down Rows