Elementary explanation of the difficulties of combining indexes for web pages and books, and means by which book index data can optimize general web searches at scale.
4. For
several
years,
I
served
the
University
of
California
as
the
Director
of
Technology
for
the
California
Digital
Library.
(the
digital
library
group
for
the
UC
system)
5. We
held
various
conversations
over
time
with
Google
engineers
in
similar
spaces
...
grappling
with
the
indexing,
search,
and
user
interface
issues
with
combined
but
disparate
content
pools
(books,
journals,
web,
image,
video).
(an
important
issue
for
digital
libraries)
6. In
academic
info
markets,
“metasearch”
–
distributed
queries
with
central
resolution,
contested
for
primacy
with
search
over
aggregated
content.
To
an
extent,
only
LANL
and
commercial
search
pursued
aggregation
at
scale.
Aggregation
wins.
7. “Google
is
undertaking
the
most
radical
change
to
its
search
results
ever,
introducing
a
"Universal
Search"
system
that
will
blend
listings
from
its
news,
video,
images,
local
and
book
search
engines
among
those
it
gathers
from
crawling
web
pages.”
“With
Universal
Search,
Google
will
hit
a
range
of
its
vertical
search
engines,
then
decide
if
the
relevancy
of
a
result
from
book
search
is
higher
than
a
match
from
web
page
search.”
Danny
Sullivan,
“Google
2.0”,
May
16
2007,
Search
Engine
Land
8. Simple
search
box
...
but
User
search
intentionality
for
books
vs.
web
can
differ
“mark
twain
hawai’i”
9. Google
Scholar
is
vertical
search
engine.
Explicit
opt-‐in
discovery
service
for
STM
journal
content,
utilized
in
HE
academia.
Many
concerns
with
combining
the
Scholar
product
with
Big
Daddy.
User
search
goals
differ;
content
distinct;
different
indexing.
10. From
2007
–
early
2009,
I
was
the
Director
of
the
Digital
Library
Federation.
I
made
a
request
of
Google
to
update
members
on
GBS
status
at
DLF’s
Fall
Forum,
Nov.
2008.
They
issued
an
explicit
request
for
HE
CS/
EE
attention
to
the
problem
of
integrating
book
and
web
search.
Paraphrasing:
“Not
a
well
solved
problem”.
19. web:
designed
component
structure
{page
hierarchy
>
web
site}
books:
artificial
component
structure
{page
images
>
book}
20. Bibliographic
data
cf.
full
text
(book)
data:
The
Melvyl
Recommender
Project
Full
Text
Extension
(Supplementary
Report)
California
Digital
Library
October
2006
Funded
by
the
Andrew
W.
Mellon
Foundation
21. Project
Lead
Peter
Brantley,
Director
of
Technology
Implementation
Team
Kirk
Hastings,
Text
Systems
Designer
Martin
Haye,
Programmer
(Contractor)
Steve
Toub,
Web
Design
Manager
Colleen
Whitney,
Programmer
and
Coordinator
Assessment
Team
Jane
Lee,
Assessment
Analyst
Felicia
Poe,
Assessment
Coordinator
Lisa
Schiff,
Digital
Ingest
Programmer
22. Often
many
different
editions
of
popular
books.
Can
easily
artificially
boost
search
(n_copies).
e.g.
“Moby
Dick”
published
100s
of
times
(and
in
many
languages)
Depending
on
publication
date:
either
public
domain
(dep.
on
country)
or
in-‐copyright
(out-‐of-‐print
or
in-‐print)
23. In
CDL
tests,
for
texts
vs.
bib
records:
Search
scoring
for
full
text
documents
was
typically
10
-‐
100
times
larger
than
for
metadata-‐only
records.
(Probably
approximate
magnitude
cf.
to
representative
web
pages).
25. Books
are
long
strings
of
many
words,
split
into
n_sized
chunks
for
parsing.
Term
indexing
based
on
overlapping
and
variant
length
“word
vectors”
“battle”
“of”
“britain”
“battle
of”
“britain”
“battle”
“of
britain”
“battle
of
britain”
26. {Search
Term}
and
{Document}
weights
1. How
often
is
a
search
term
found
within
a
given
sized
chunk
of
text?
2. How
many
chunks
of
text
is
the
term
found
within?
3. How
many
chunks
of
text
does
the
document
contain?
27. Which
is
better?
1.
Adequate
matches
over
many
fields,
2.
Better
matches
in
fewer
fields.
Metrics
vary
between
books
and
web.
One
learns
from
one’s
mistakes.
More
books,
more
mistakes.
28. 1. Books
are
sooo
much
longer
than
web
pages.
2. Books
produce
1000’s
more
chunks
than
web.
3. Term
weighting
is
very
complex
for
long
docs.
4. Indexes
must
be
integrated
for
web
and
books.
5. But
source
term
indexes
are
biased
differently.
30. The
dialectic
between
books
and
web
provides
benefits
from
their
integration
(no
matter
the
pain).
Books
enrich
general
web
search,
not
just
via
the
data
within
books,
but
also
by
books-‐as-‐data.
31. All
search
is
made
smarter
by
analysis.
1. structure
2. contextualization
3. relatedness
4. normalization
5. association
32. Because
of
digitization,
books
have
complications
cf.
web
pages;
a
result
of
OCR.
1. Language
detection
2. Determining
which
words
get
indexed
(–
stop
words
like
“of”
“a”
“the”
etc.)
3. OCR
mistakes
hamper
word
recognition
33. Common
OCR
traps:
embedded
languages
Latin
or
archaic
spelling
complex
scripts
(e.g.
captions)
hyphenated
words
35. More
words
from
more
books,
more
spelling
mistakes.
This
is
a
good
thing!
Leads
to
improved
spelling
correction
(in
multiple
languages)
and
more
sensitive
translation.
36. “Our
understanding
of
language
is,
in
large
part,
built
inductively
from
statistical
analysis
of
large
samples
of
language
as
used
‘in
the
wild,’
and
the
larger
the
sample,
the
better
our
understanding.”
-‐
Hank
Bromley,
IA
37. “Before
the
1930’s,
and
even
40’s
or
50’s
in
some
parts,
at
harvest
time,
a
horse
or
mule
drawn
wagon
would
go
through
the
field,
straddling
two
rows
of
corn.
Adults
working
on
each
side
of
the
wagon
would
pull
the
corn
from
the
standing
corn
stalks
and
toss
it
into
the
wagon.
The
unfortunate
younger
ones
would
have
to
pull
corn
from
the
down
rows
–
stoop
labor
in
its
worst
form.”
-‐
JDB
38. Statistical
analysis
of
which
terms
tend
to
appear
in
the
vicinity
of
which
others),
useful
not
only
for
context-‐sensitive
OCR,
but
more
significantly,
for
building
semantic
maps
and
other
kinds
of
knowledge
representation.
“dead
as
a
door
nail”
–
the
term
“door
nail”
is
not
commonly
found
elsewhere.
39. Analysis
via
co-‐occurrence
enables
one
to
construct
a
better
general
search
engine
by
enhancing
the
ability
to
distinguish
among
multiple
meanings
of
a
given
word
based
on
the
context
in
which
the
word
occurs.
40. LSA
is
an
CS
term
referring
to
a
technique
in
“natural
language
processing
...
of
analyzing
relationships
between
a
set
of
documents
and
the
terms
they
contain
by
producing
a
set
of
concepts
related
to
the
documents
and
terms.”
-‐
Wikipedia.org
41. (LSI
=
LSA
in
context
of
info
retrieval
(IR).)
“Clustering
is
a
way
to
group
documents
based
on
their
conceptual
similarity
to
each
other
...
.
This
is
very
useful
when
dealing
with
an
unknown
collection
of
unstructured
text.”
42. “Because
it
uses
a
strictly
mathematical
approach,
LSI
is
inherently
independent
of
language.
This
enables
LSI
to
elicit
the
semantic
content
of
information
written
in
any
language
without
requiring
the
use
of
auxiliary
structures,
such
as
dictionaries
and
thesauri.”
43. “[Q]ueries
can
be
made
in
one
language,
such
as
English,
and
conceptually
similar
results
will
be
returned
even
if
they
are
composed
of
an
entirely
different
language
or
of
multiple
languages.”
44. “LSI
automatically
adapts
to
new
and
changing
terminology,
and
it
has
been
shown
to
be
very
tolerant
of
noise
(i.e.,
misspelled
words,
typo-‐
graphical
errors,
unreadable
characters,
etc.).
“This
is
especially
important
for
applications
using
text
derived
from
Optical
Character
Recognition
(OCR)
...”
-‐
Wikipedia.org
45. The
More
Data,
The
Better
...
The
More
Books,
The
Better
Web
Search.