Web classification of Digital Libraries using GATE Machine Learning
1. Stephen
J.
Stose
April
18,
2011
1
IST
565:
Final
Project
Web
classification
of
Digital
Libraries
using
GATE
Machine
Learning
Introduction
Text
mining
is
considered
by
some
as
a
form
of
data
mining
that
operates
on
unstructured
and
semi-‐structured
texts.
It
applies
natural
language
processing
models
to
analyze
textual
content
in
order
to
extract
and
generate
actionable
(i.e.,
potentially
useful)
knowledge
from
the
information
inherent
in
words,
sentences,
paragraphs
and
documents
(Witten,
2005).
However,
many
of
the
linguistic
patterns
easy
for
humans
to
comprehend
and
reproduce
end
up
being
astonishingly
complicated
for
machines
to
process.
For
instance,
machines
struggle
interpreting
natural
language
forms
quite
simple
for
most
humans,
such
as
metaphor,
misspellings,
irregular
forms,
slang,
irony,
verbal
tense
and
aspect,
anaphora
and
ellipses,
and
the
context
that
frames
meaning.
On
the
other
hand,
humans
lack
a
computer’s
ability
to
process
large
volumes
of
data
at
high
speeds.
The
key
to
successful
text
mining
is
to
combine
these
assets
into
a
single
technology.
There
are
many
uses
of
this
new
interdisciplinary
effort
at
mining
unstructured
texts
towards
the
discovery
of
new
knowledge.
For
instance,
some
techniques
attempt
to
extract
structure
to
fill
out
templates
(e.g.,
address
forms)
or
extract
key-‐phrases
as
a
form
of
document
metadata.
Others
attempt
to
summarize
the
content
of
a
document,
identify
a
document’s
language,
classify
the
document
into
a
pre-‐established
taxonomy,
or
cluster
it
along
with
similar
documents
based
on
token
or
sentence
similarity
(see
Witten,
2005
for
others).
Other
techniques
include
concept
linkage,
whereby
concepts
across
swathes
of
scientific
research
articles
can
be
linked
to
elucidate
new
hypotheses
that
otherwise
wouldn’t
occur
to
humans,
but
also
topic
tracking
and
question-‐answering
(Fan
et
al.,
2006).
Consider
the
implications
of
being
able
to
automatically
classify
text
documents.
Given
the
massive
size
of
the
World
Wide
Web
and
all
it
contains
(e.g.,
news
feeds,
e-‐mail,
medical
and
corporate
records,
digital
libraries,
journal
and
magazine
articles,
blogs),
imagine
the
practical
consequences
of
training
machines
to
automatically
categorize
this
content.
Indeed,
text
classification
algorithms
have
already
had
moderate
success
in
cataloging
news
articles
(Joachims,
1998)
and
web
pages
(Nigam,
MacCallum,
Thrun
&
Mitchell,
1999).
Indeed,
some
text
mining
systems
have
even
been
incorporated
into
digital
library
systems
(the
Greenstone
DAM),
such
that
users
can
benefit
by
displaying
digital
library
items
automatically
co-‐referenced
by
use
of
semantic
annotations
(Witten,
Don,
Dewsnip
&
Tablan,
2004).
Natural
language
pre-‐processing
for
text
and
document
classification
Text
and
document
classification
make
use
of
natural
language
processing
(NLP)
technology
to
pre-‐process,
encode
and
store
linguistic
features
to
texts
and
documents,
and
then
to
processes
selected
features
using
Machine
Learning
(ML)
algorithms
which
then
are
applied
to
a
new
set
of
texts
and
documents.
The
first
step
in
this
process
usually
involves
tokenization,
a
process
that
involves
removing
punctuation
marks,
tabs,
and
other
non-‐
textual
characters
to
replace
these
with
white
space.
This
produces
a
mere
stream
of
word
tokens
which
forms
the
set
of
data
upon
which
further
processing
occurs.
From
this
stream,
2. Stephen
J.
Stose
April
18,
2011
2
IST
565:
Final
Project
a
filter
usually
is
applied
to
reduce
from
this
set
of
tokens
all
stop-‐words
(e.g.,
prepositions,
articles,
conjunctions
etc.)
that
otherwise
provide
little
if
any
meaning.
In
a
related
vein,
we
see
in
such
instances
that
tokens
are
not
always
the
same
as
words
per
se.
Tokenization
may
insert
white
space
between
two
and
three-‐word
tokens.
“New
York”
should
be
considered
one
token,
not
two
(not
“New”
and
“York”).
Hyphens
and
apostrophes
present
difficult
challenges.
Often
words
like
“don’t”
are
tokenized
into
two
separate
words:
“do”
and
“n’t”,
the
latter
which
is
later
transduced
as
“n’t”
=
“not”.
When
considering
all
the
continually
changing
conventions
used
to
display
words
as
text,
you
will
begin
to
appreciate
the
multitude
of
problems.
Often,
pre-‐processing
can
stop
here,
as
many
text
and
document
classification
methods
rely
on
simple
tokenization,
such
that
each
token
represents
one
term
amongst
a
bag
of
other
words
occurring
within
each
document
and
between
all
documents
in
the
corpus.
One
common
approach
to
determining
word
importance
within
a
bag-‐of-‐words
is
the
term
frequency-‐inverse
document
frequency
approach
(tf-‐idf).
In
this
way,
each
document
represents
a
vector
of
terms,
and
each
term
is
encoded
in
binary
1
(term
occurs)
or
0
(term
does
not
occur)
form,
upon
which
weighting
schemes
apply
more
weight
to
terms
occurring
frequently
within
relevant
documents
but
infrequently
between
all
documents
considered
together.
In
a
corpus
of
documents
about
political
parties,
for
instance,
the
word
“political”
may
occur
a
lot
in
relevant
documents,
but
its
weight
would
be
low
given
that
it
also
occurs
frequently
in
all
the
other
documents
within
the
corpus.
This
renders
the
term
rather
meaningless
when
trying
to
distinguish
relevant
from
non-‐relevant
documents,
as
they
all
are
about
something
political.
If
the
word
“suffrage”
occurs
frequently
in
relevant
documents,
on
the
other
hand,
but
rarely
across
the
corpus,
its
specificity
and
hence
weight
for
determining
document
type
is
considered
much
greater.
This
is
the
reason
tf
is
balanced
with
(i.e.,
multiplied
by)
idf,
a
factor
that
diminishes
the
weight
of
frequent
terms
and
increases
the
weight
of
rare
ones
(for
the
mathematics
of
such
an
approach,
see
Hotho,
Nurnberger
&
Paass,
2005).
In
this
way
a
set
of
documents
can
be
mined
for
keywords.
If
all
of
the
documents
within
our
corpus
are
related
to
political
parties,
the
word
“political”
hardly
qualifies
as
a
keyword.
Words
that
occur
frequently
within
a
subset
of
documents
serve
as
words
that
categorize
content.
As
such,
if
the
word
“suffrage”
occurs
frequently
in
some
documents,
but
not
in
all
of
the
documents,
thus
qualifies
as
a
good
candidate
as
a
keyword
that
classifies
the
relevant
text.
A
good
text-‐mining
program
utilizing
the
tf-‐idf
weighting
scheme
would
be
able
to
extract
this
term
and
present
it
to
a
human
as
a
possible
keyword.
These
weighting
schemes
are
applied
within
vector
space
models
in
order
to
retrieve,
filter
and
index
terms
occurring
in
documents
(Salton,
Wong
&
Yang,
1975).
Such
models
form
the
basis
of
many
search
and
indexing
engines
(e.g.,
Apache
Lucene)
insofar
as
the
HTML
content
of
each
Web
page
is
crawled
and
indexed
to
determine
its
relevance
based
on
words
and
phrases
occurring
within
the
<title>
and
<heading>
elements,
among
other
ways
(see
Chau
&
Chen,
2008).
Still,
a
bag-‐of-‐words
approach
to
text
and
document
mining
can
be
improved
upon
by
incorporating
domain
knowledge
from
experts
into
the
analysis.
For
instance,
experts
can
identify
domain-‐specific
words,
phrases
and/or
rules.
If
a
document
or
Web
page
is
checked
against
a
dictionary
of
these
listed
features,
those
documents
containing
the
features
will
be
deemed
more
relevant
to
the
search.
This
is
what
often
occurs
after
tokenization
in
many
kinds
of
NLP
software
(e.g.,
GATE).
That
is,
tokenized
words
are
mapped
to
an
internal
3. Stephen
J.
Stose
April
18,
2011
3
IST
565:
Final
Project
gazetteer
(an
internal
dictionary),
which
operates
as
a
sort
of
pre-‐classification,
such
that
commonly
occurring
or
well-‐known
entities
are
extracted
and
annotated
as
such.
For
instance,
a
gazetteer
might
by
default
be
outfitted
to
recognize
all
common
first-‐
and
surnames
(Noam
or
Bradeley;
Chomsky
or
Manning)
or
organizations
(UN,
United
Nations,
OPEC,
White
House,
Planned
Parenthood)
or
dates
formats
(02/10/1973
or
February
10,
1973).
Thus,
the
selection
of
these
kinds
of
annotations
constrains
the
set
of
words
chosen
to
represent
space
in
space
vector
models.
Thus,
if
we
want
to
ensure
a
domain-‐specific
vocabulary
is
annotated
as
relevant
to
text
or
document
classification,
we
might
create
a
separate
space
for
those
terms,
and
annotate
each
term
as
belonging
to
a
particular
category.
As
described
later,
we
created
a
gazetteer
of
terms
most
likely
to
occur
on
Web
sites
functioning
as
digital
libraries,
such
that
when
a
random
Web
site
contains
these
terms
it
would
with
a
higher
likelihood
be
classified
as
relevant.
Other
forms
of
linguistic
pre-‐processing
exist
which
may
or
may
not
enhance
document
and
text
classification
algorithms,
depending
on
the
nature
and
specificity
of
the
task.
For
instance,
sentence
splitters
chunk
tokens
into
sentence
spaces
when
phrases
are
an
important
feature
in
classification.
At
times,
tagging
each
term
within
a
document
with
its
part-‐of-‐speech
(POS
Tagging)
is
important.
For
instance,
it
allows
for
the
classification
of
documents
into
language
groups
(e.g.,
Spanish
vs.
English
vs.
German
etc.)
or
sentence
types.
Given
that
language
is
full
of
ambiguity
of
which
we’ll
only
scratch
the
surface
here,
Named-‐Entity
(NE)
transducers
ease
the
confusion
by
contextualizing
certain
tokens.
For
instance,
General
Motors
can
be
recognized
as
a
company,
and
not
as
the
name
of
a
military
officer
(e.g.,
General
Lee).
Or
“May
10”
is
a
date,
“May
Day”
is
a
holiday,
“May
I
leave
the
room”
is
a
request,
and
“Sallie
May
Jones”
is
a
person.
That
is,
the
transducer
disambiguates
homographs
and
homonyms
and
other
such
linguistic
confusions.
Another
common
problem
in
pre-‐processing
is
co-‐reference
matching.
Often,
the
same
entity
is
known
in
different
ways
or
by
different
spellings:
“Center”
is
the
same
as
“centre”;
NATO
is
the
same
entity
as
North
Atlantic
Treaty
Organization;
or
Mr.
Smith
is
the
same
person
as
Joachim
Smith
is
the
same
person
as
“he”
or
“him”
(e.g.,
“Joachim
Smith
went
to
town.
Everyone
greeted
him
as
Mr.
Smith
and
he
didn’t
care
for
that”).
This
is
an
important
element
when
considering
frequency
weights
in
space
vector
models,
as
two
different
tokens
referencing
the
same
entity
should
be
co-‐referenced
as
occurring
with
frequency
=
2,
and
not
frequency
=
1,
one
for
each
way
of
referring
to
the
entity.
Basic
classification
models
Most
classification
models
are
forms
of
supervised
learning
in
that
each
input
value
(e.g.,
a
word
vector)
is
paired
with
an
expected
discrete
output
value
(i.e.,
the
pre-‐defined
category).
As
such,
the
supervised
algorithm
in
training
analyzes
these
pairings
to
produce
an
inferred
classifier
function
and
thereby
in
testing
be
able
to
predict
the
output
value
(i.e.,
the
correct
classification)
for
any
new
valid
input.
One
instance
commonly
used
in
document
classification
is
training
a
classifier
to
automatically
classify
Web
pages
into
a
pre-‐established
taxonomy
of
categories
(e.g.,
sports,
politics,
art,
design,
poetry,
automobiles
etc.).
Accuracy
of
the
training
function
on
correctly
classifying
the
test
set
is
then
computed
as
a
performance
measure,
each
document
falling
within
the
expected
class
by
some
degree.
Herein
we
establish
a
trade-‐off
between
recall
and
precision.
High
4. Stephen
J.
Stose
April
18,
2011
4
IST
565:
Final
Project
precision
implies
a
high
degree
threshold
for
allowing
membership
into
a
class.
In
this
way,
the
algorithm
refuses
to
accept
many
false
positives,
but
in
doing
so
sacrifices
its
ability
to
recall
an
otherwise
larger
set
of
documents,
and
thus
risks
missing
many
relevant
documents
(i.e.,
they
become
false
negatives).
On
the
other
hand,
if
a
threshold
of
high
recall
is
permitted,
we
risk
lowering
our
rate
of
precision
and
thus
allow
many
documents
not
relevant
to
the
category
(i.e.,
false
positives)
into
the
set.
The
F1-‐score
serves
as
a
statistical
measure
of
compromise
(i.e.,
average)
between
recall
and
precision.
For
the
mathematical
details
of
many
of
the
classification
algorithms,
we
defer
to
Hotho,
Nurnberger
and
Paass
(2005),
but
here
outline
the
rudimentary
basics
of
the
four
most
common
algorithms,
NaïveBayes,
k-‐nearest
neighbor,
decision
trees,
and
support
vector
machines
(SVM).
Naïve
Bayes
applies
the
conditional
probability
that
document
d
with
the
vector
of
terms
t1,…,tn
belongs
to
a
certain
class:
P(ti|classj).
Documents
with
a
probability
reaching
a
pre-‐
established
threshold
are
deemed
as
belonging
to
the
category.
Instead
of
building
a
model
of
probability,
k-‐nearest
neighbor
method
of
classification
is
an
instance-‐based
approach
that
operates
on
the
basis
of
the
similarity
of
a
document’s
k
number
of
nearest
neighbors.
By
using
word-‐vectors
stored
as
document
attributes
and
document
labels
as
the
class,
most
computation
occurs
in
testing
whereby
class
labels
are
assigned
based
on
the
k
most
frequent
training
samples
nearest
to
the
document
to
be
classified.
Decision
trees
(e.g.,
C4.5)
operate
by
the
information
gain
established
based
on
a
recursively
built
hierarchy
of
word
selection.
From
labeled
documents,
term
t
is
selected
as
the
best
predictor
of
the
class
according
to
the
amount
of
information
gain.
The
tree
splits
into
subsets,
one
branch
of
documents
containing
the
term
and
the
other
without,
only
to
then
find
the
next
term
to
split
on,
and
is
applied
recursively
until
all
documents
in
a
subset
belong
to
the
same
class.
Support
vector
machines
(SVM)
operate
by
representing
each
document
according
to
a
weighted
vector
td1,…,tdn
based
on
word
frequencies
within
each
document.
SVM
determines
a
maximum
hyper-‐plane
at
point
0
that
separates
positive
(+1)
class
examples
and
negative
class
examples
(-‐1)
in
the
training
set.
Only
a
small
fraction
of
documents
are
support
vectors,
and
any
new
document
is
classified
as
belonging
to
the
class
if
the
vector
is
greater
than
0,
and
not
belonging
to
the
class
if
the
vector
is
less
than
0.
SVMs
can
be
used
with
linear
or
polynomial
kernels
to
transform
space
to
ensure
the
classes
can
be
separated
linearly.
While
the
level
of
performance
of
each
of
these
classifiers
depends
on
the
kind
of
classification
task,
the
SVM
algorithm
most
reliably
outperforms
other
kinds
of
algorithms
on
document
classification
(Sebastiani,
2002),
and
thus
will
utilized
with
priority
in
the
study
that
follows.
Goals
and
objectives
of
current
study
For
our
own
purposes,
we
focus
on
the
domain
of
web
classification
in
order
to
achieve
a
twofold
purpose:
1)
to
learn
and
teach
my
colleagues
about
the
natural
language
processing
5. Stephen
J.
Stose
April
18,
2011
5
IST
565:
Final
Project
suite
known
as
GATE
(General
Architecture
for
Text
Engineering),
especially
with
regards
to
its
Machine
Learning
(ML)
capabilities;
and
2)
to
utilize
the
GATE
architecture
in
order
to
classify
web
documents
into
two
groups:
those
sites
that
function
as
digital
library
sites
(DL)
distinguished
from
all
other
non-‐digital
library
sites
(non-‐DL).
The
purpose
of
such
an
exercise
is
to
identify
from
amongst
the
millions
of
websites
only
those
sites
that
operate
as
digital
library
sites.
Assuming
digital
library
sites
are
identifiable
through
certain
characteristic
earmarks
that
distinguish
them
as
containing
searchable
digital
collections,
the
goal
is
to
develop
a
set
of
annotations
that
by
way
of
an
ML
algorithm
can
be
applied
as
part
of
a
web
crawler
in
order
to
extract
the
URL
of
each
of
the
sites
that
qualify
as
belonging
to
the
DL
group,
while
omitting
those
that
do
not.
While
somewhat
confident
it
is
possible
to
obtain
a
strong
level
of
precision
in
obtaining
many
of
the
relevant
sites,
recalling
sites
that
seem
relevant
(e.g.,
those
merely
about
digital
libraries)
are
of
a
greater
concern;
that
is,
the
false
positives.
The
current
author
is
developing
as
a
prototype
a
website
(www.digitallibrarycentral.com)
that
seeks
to
operate
as
a
digital
library
of
all
digital
library
websites;
a
sort
of
one-‐stop
visual
reference
library
that
points
to
the
collection
of
all
digital
libraries.
Achieving
the
goal
outlined
here
would
serve
to
populate
this
site.
Before
and
whether
such
grand
ideals
can
be
implemented,
however,
the
current
paper
will
outline
some
of
the
first
steps
in
implementing
the
GATE
ML
architecture
towards
this
objective.
In
doing
so,
of
immediate
concern
is
understanding
the
GATE
architecture
and
how
it
functions
in
natural
language
processing
tasks,
in
order
that
we
can
properly
pre-‐
process
and
annotate
our
target
corpora
before
carrying
out
ML
learning
algorithms
on
it.
We
turn
now
to
an
explanation
of
the
GATE
architecture.
The
GATE
architecture
and
text
annotation
GATE
(General
Architecture
for
Text
Engineering)
is
a
set
of
Java
tools
developed
at
the
University
of
Sheffield
for
the
processing
of
natural
language
and
text
engineering
tasks
for
various
languages.
At
its
core
is
an
information
extraction
system
called
ANNIE
(A
Nearly-‐
New
Information
Extraction
System),
a
set
of
functions
that
operates
on
individual
documents
(including
XML,
TXT,
Doc,
PDF,
Database
and
HTML
structures)
and
across
the
corpora
within
which
many
documents
can
belong.
These
functions
comprise
tokenizing,
a
gazetteer,
sentence
splitting,
part-‐of-‐speech
tagging,
named-‐entity
transduction,
and
co-‐
reference
tagging,
among
others.
It
also
boasts
extensive
tools
for
RDF
and
OWL
metadata
annotation
for
creating
ontologies
for
use
within
the
Semantic
Web.
Most
of
these
language
processes
operate
seamlessly
within
GATE
Developer’s
integrated
development
environment
(IDE)
and
graphical
user
interface
(GUI),
the
latter
allowing
users
to
visualize
these
functions
within
a
user-‐friendly
environment.
For
instance,
a
left-‐
sidebar
resource
tree
displays
the
Language
Resources
panel,
where
the
document
and
document
sets
(the
corpus)
reside.
Below
that,
it
also
displays
the
ANNIE
Processing
Resources
(PR),
the
natural
language
processing
functions
mentioned
above
that
form
part
of
an
ordered
pipeline
to
linguistically
pre-‐process
the
documents.
A
right-‐sidebar
illustrates
the
resulting
color-‐coded
annotation
lists
after
pipeline
processing.
Additionally,
a
bottom
table
exposes
the
various
resulting
annotation
attributes,
as
well
as
a
popup
annotation
editor
that
allows
one
to
edit
and
classify
(i.e.,
provide
values
to)
these
6. Stephen
J.
Stose
April
18,
2011
6
IST
565:
Final
Project
annotation
sets
for
training,
prototyping,
and/or
analysis.
Figure
1
below
shows
all
of
these
elements
in
action.
Figure
1.
These
tools
complete
much
of
the
gritty
text-‐engineering
work
of
document
pre-‐processing
in
order
that
useful
research
can
be
quickly
deployed,
but
in
a
way
that
is
visually
explicit
and
apparent
to
those
less
initiated
with
these
common
natural
language
engineering
preprocessing
tasks.
And
in
a
way
that
allows
for
editing
these
functions
as
well
as
the
introduction
of
various
pre-‐processing
plugins
and
other
scripts
developed
for
individual
text-‐mining
applications.
Figure
1
displays
four
open
documents
uploaded
directly
by
entering
the
URL:
Newsweek
and
Reuters
(news
sites),
and
JohnJayPapers
and
DigitalScriptorium
(digital
libraries).
These,
along
with
10
other
news
sites
and
11
other
digital
libraries
all
belong
to
the
corpus
names
“DL_eval_2”
above
(which
will
serve
as
Sample
1
later,
our
first
test
of
DL
discrimination).
This
provides
a
testing
sample
to
ensure
the
pre-‐processing
pipeline
and
Machine
Learning
(ML)
functions
operate
correctly
on
our
soon-‐to-‐be
annotated
documents.
Just
by
uploading
URLs,
GATE
by
default
automatically
annotates
the
HTML
markup,
as
you
can
see
on
within
bottom
right-‐sidebar
where
the
<a>,
<body>
and
<br>
tags
are
located.
7. Stephen
J.
Stose
April
18,
2011
7
IST
565:
Final
Project
After
running
the
PR
pipeline
over
the
“DL_eval_2”
corpus,
the
upper
right-‐sidebar
shows
the
annotations
that
result
from
running
the
tokenizer,
gazetteer,
sentence
splitter,
POS
tagger,
NE
transducer
and
co-‐referencing
orthomatcher.
Organization
is
checked
and
highlighted
in
green,
for
instance,
and
by
clicking
on
“White
House”
(one
instantiation
of
Organization),
we
learn
about
GATE’s
{Type.feature=value}
syntax,
which
in
the
case
of
“White
House”
is
represented
accordingly:
{Organization.orgType=government}.
This
syntax
operates
as
the
core
annotation
engine,
and
allows
for
the
scripting
and
manipulation
of
annotation
strings.
The
ANNIE
PRs
in
this
case
provide
automatic
annotations
that
serve
as
a
rudimentary
start
to
any
text
engineering
projects
upon
which
to
build.
There
are
many
other
plugins
and
PR
functions
we
will
not
discuss
within
this
review.
For
our
own
purposes,
we
want
to
call
attention
to
two
annotation
types
ANNIE
generates:
1)
Token,
and
2)
Lookup.
A
few
examples
of
the
Type.feature
syntax
for
the
token
type
is:
the
kind
of
token
{Token.kind=word};
the
token
character
length
{Token.length=12};
the
token
POS
{Token.category=NN};
the
token
orthography
{Token.orth=lowercase};
or
the
content
of
the
token
string
{Token.string=painter}.
Our
interest
is
in
analyzing
string
content:
determining
whether
a
particular
document
is
an
instance
of
a
digital
library
or
not
will
require
an
ML
analysis
of
the
n-‐gram
string
unigrams
comprising
both
DL
sites
and
nonDL
sites.
We
can
either
use
all
tokens
(after
removing
stop-‐words)
to
analyze
the
tf-‐idf
weighting
of
the
documents
in
question,
or
we
can
constrain
the
kinds
of
tokens
analyzed
within
the
documents
by
making
further
specifications.
The
ANNIE
annotation
schema
provides
many
default
annotations
(e.g.,
Person,
Organization,
Money,
Date,
Job
Title
etc.)
to
constrain
the
kinds
of
words
chosen
for
analysis,
as
can
be
seen
in
Figure
1
in
the
upper
right-‐sidebar.
Additionally,
the
Gazetteer
provides
many
other
kinds
of
dictionary
lookup
entries
(60,000
arranged
in
80
lists)
above
and
beyond
the
ANNIE
default
annotations.
For
instance,
the
list
name
“city”
will
have
as
dictionary
entities
a
list
of
all
worldwide
cities,
such
that
by
mapping
these
onto
the
text,
a
new
annotation
of
the
kind
{Lookup.minorType=city}
is
created,
and
thus
annotates
each
instance
of
a
city
with
this
markup.
The
lookup
uses
a
set-‐
subset
hierarchy
we
will
not
describe,
except
to
say
that
{Lookup.majorType}
is
a
parent
of
{Lookup.minorType}.
Thus,
there
are
different
kinds
of
locations,
for
instance,
city
and
country.
City
and
country
are
thus
minorTypes
(children)
of
the
{Lookup.majorType=location}.
Classification
with
GATE
Machine
Learning
Given
that
the
GATE
Developer
extracts
and
annotates
training
documents,
several
processing
plugins
that
operate
at
the
end
of
a
document
pre-‐processing
pipeline
serve
Machine
Learning
(ML)
functions.
The
Batch
Learning
PR
has
three
functions:
chunk
recognition,
relation
extraction
and
classification.
This
paper
is
interested
in
applying
supervised
ML
processes
to
classify
web
documents
that
qualify
as
instances
of
digital
libraries
(DL)
or
not
(nonDL).
Supervised
ML
requires
two
phases:
learning
and
application.
The
first
phase
requires
building
a
data
model
from
instances
within
a
document
that
has
already
been
correctly
8. Stephen
J.
Stose
April
18,
2011
8
IST
565:
Final
Project
classified.
In
our
case,
it
requires
giving
value
to
certain
sets
of
annotations
that,
as
a
whole,
will
represent
the
document
instance
(i.e.,
the
website)
as
either
a
hit
(DL)
or
a
miss
(non-‐
DL).
The
point
is
to
develop
a
training
set
D
=
(d1,…,dn)
of
correctly
classified
DL
website
documents
(d)
to
build
a
classification
model
able
to
discriminate
any
future
website
d
as
being
either
a
true
DL
or
some
other
website
(not-‐DL).
The
first
task
requires
annotating
each
document,
as
a
whole,
and
in
doing
so
assign
it
to
the
dependent
DL
or
non-‐DL
class.
Up
until
now,
annotations
refer
to
parts
of
a
document
(tokens,
sentences,
dates
etc.).
To
annotate
a
whole
document,
we
begin
by
creating
a
new
{Type.feature=value}
term.
To
do
so,
we
demarcate
the
entire
text
within
each
document
and
create
a
new
annotation
type
called
“Mention,”
a
feature
called
“type”
(not
to
be
confusing)
and
two
distinct
values:
{Mention.type=dl}
and
{Mention.type=nondl}.
The
attributes
used
to
predict
class
membership
are
the
two
annotations
types
we
highlighted
above:
1)
Token
{Token.string},
and
2)
Lookup
{Lookup.majorType}.
To
take
full
advantage
of
the
Gazetteer,
we
added
a
list
entry
named
“dlwords”
(i.e.,
digital
library
words)
with
a
list
of
terms
commonly
found
on
many
digital
library
websites.
This
list
of
word
is
reproduced
below1:
Advanced
Search
Digital
Collection(s)
Manuscript(s)
Archive(s)
Digital
Content
Repository(ies)
Browse
Digital
Library(ies)
Search
Catalog
Digitization
Search
Tip(s)
Collection(s)
Digitisation
Special
Collection(s)
Digital
Image
Collection(s)
Unversity(ies)
Digital
Archive(s)
Keyword(s)
University
Library(ies)
Image(s)
Library(ies)
All
of
our
analyses
will
operate
using
the
bag-‐of-‐words
that
by
GATE
default
applies
tf-‐idf
weighting
schemes
to
a
specified
n-‐gram
(we’ll
be
using
only
1-‐gram
unigrams).
Two
attribute
annotations,
each
representing
a
slightly
different
bag-‐of-‐words,
will
be
used
to
predict
DL
or
nonDL
class
membership:
1. When
the
{Token.string}
attribute
is
chosen
to
predict
{Mention.type}
class
membership,
the
bag-‐of-‐words
includes
all
non-‐stop
word
tokens
within
its
attribute
set.
2. When
the
Gazetteer
is
used
and
“dlwords”
are
included
as
part
of
its
internal
dictionary,
the
attribute
{Lookup.majorType=“dlwords”)
along
with
all
the
other
60,000
entries
will
serve
to
constrain
the
set
of
tokens
predicting
{Mention.type}
class
membership.
The
GATE
Batch
Learning
PR
requires
an
XML
configuration
file
specifying
the
ML
parameters
and
the
attribute-‐class
annotation
sets2.
We
will
only
discuss
a
few
essential
settings
here.
For
starters,
we
set
the
evaluation
method
as
“holdout”
with
a
ratio
of
.66/.33
training
to
test.
The
main
algorithm
we
will
be
using
is
SVM
(in
GATE,
the
SVMLibSvmJava)
with
the
following
parameter
settings
being
varied,
the
ones
reported
below
providing
the
best
results:
1
Note:
the
Gazetteer
does
not
stem
nor
is
it
caps
sensitive,
thus
plural
and
uppercase
variations
of
these
words
were
provided
but
are
not
reproduced
here.
2
For
a
list
of
all
parameter
setting
possibilities,
see
http://gate.ac.uk/sale/tao/splitch17.html#x22-‐43500017.
9. Stephen
J.
Stose
April
18,
2011
9
IST
565:
Final
Project
-‐t:
kernel:
0
(linear
(0)
vs.
polynomial
(1))
-‐c:
cost:
0.7
(higher
values
allow
softer
margins
leading
to
better
generalization)
-‐tau:
uneven
margins:
0.5
(varies
positive
to
negative
instance
ratio)
The
XML
configuration
file
is
reproduced
below,
in
case
others
are
interested
in
getting
started
using
GATE
ML
for
basic
document
classification:
<?xml
version="1.0"?>
<ML-‐CONFIG>
<VERBOSITY
level="1"/>
<SURROUND
value="false"/>
<PARAMETER
name="thresholdProbabilityClassification"
value="0.5"/>
<multiClassification2Binary
method="one-‐vs-‐another"/>
<EVALUATION
method="holdout"
ratio="0.66"/>
<FILTERING
ratio="0.0"
dis="near"/>
<ENGINE
nickname="SVM"
implementationName="SVMLibSvmJava"
options="
-‐c
0.7
-‐t
0
-‐m
100
-‐tau
0.4
"/>
<DATASET>
<INSTANCE-‐TYPE>Mention</INSTANCE-‐TYPE>
<NGRAM>
<NAME>ngram</NAME>
<NUMBER>1</NUMBER>
<CONSNUM>1</CONSNUM>
<CONS-‐1>
<TYPE>Token</TYPE>
<FEATURE>string</FEATURE>
</CONS-‐1>
</NGRAM>
<ATTRIBUTE>
<NAME>Class</NAME>
<SEMTYPE>NOMINAL</SEMTYPE>
<TYPE>Mention</TYPE>
<FEATURE>type</FEATURE>
<POSITION>0</POSITION>
<CLASS/>
</ATTRIBUTE>
</DATASET>
</ML-‐CONFIG>
The
ngram
is
set
to
1
(unigram),
and
the
<CLASS/>
tag
within
the
<Attribute>
tag
indicates
this
attribute
is
the
class
being
predicted,
with
<Type>Mention</Type>
and
<Feature>type</Feature>
as
{Mention.type=dl
or
nondl}.
The
<Type>Lookup</Type>and
<Feature>majorType</Feature>
will
be
changed
to
accommodate
the
other
attribute
value.
Thus,
after
running
ANNIE
PR
pipeline
over
the
“DL_eval_2”
corpus,
the
ML
Batch
Learning
PR
is
placed
alone
in
the
pipeline
to
run
over
the
annotated
set
of
documents
in
Evaluation
mode.
As
mentioned,
the
Batch
Learning
PR
can
also
operate
in
Training-‐Application
mode
on
two
separate
sets
of
corpora:
one
corpus
for
training
and
another
for
application
(i.e.,
testing).
The
initial
results
below
reflect
only
the
results
of
holdout
0.66
evaluation
run
over
one
corpus.
The
current
report
does
not
utilize
the
Training-‐Application
mode.
Sample
corpora
and
results
The
Web
now
boasts
over
8
billion
indexable
pages
(Chau
&
Chen,
2008).
Thus
training
an
ML
algorithm
to
pick
out
the
estimated
few
thousand
digital
libraries
will
not
be
a
simple
matter.
Assuming
there
are
5
thousand
library-‐standard
digital
libraries
(which
may
be
a
high
estimate),
some
of
which
reside
within
umbrella
Digital
Asset
Management
portals,
discriminating
these
will
be
cherry
picking
at
a
ratio
of
an
average
of
5
digital
libraries
per
every
8
million
Web
sites.
Spiders
(or
Web
crawlers)
can
curtail
this
number
greatly
by
10. Stephen
J.
Stose
April
18,
2011
10
IST
565:
Final
Project
crawling
only
a
specified
argument
depth
from
the
starting
URL.
Unfortunately,
librarians
are
not
always
good
at
applying
search
optimized
SEO
standards,
and
many
well-‐known
DLs
are
deeply
embedded
in
arguments
or
strange
ports
(University
of
Wyoming’s
uses
port
8180)
or
within
site
subdomains.
Thus,
curtailing
this
argument
space
too
much
will
result
in
decreased
recall.
Additionally,
there
are
many
non-‐DL
websites
that
use
language
quite
similar
to
DL
websites.
For
instance,
many
websites
operate
as
librarian
blogs
or
digital
library
magazines
that
serve
discussion
spaces
regarding
DLs,
but
are
not
DLs
themselves.
Unfortunately,
these
false
positives
will
prove
daunting
to
exclude.
We
seek
only
DLs
or
DL
portals
that
boast
archival
collections
that
have
been
digitized,
and
as
such
serve
as
electronic
resources
that
are
co-‐referenced,
searchable,
browsable,
and
catalogued
according
to
some
taxonomy
or
ontology.
One
suggested
way
of
narrowing
down
to
only
these
kinds
of
resources
might
be
to
tap
into
the
<meta
content>
in
which
librarians
often
apply
conventions
such
as
Dublin
Core
to
demarcate
these
spaces
as
digital
collection
spaces.
This
is
an
avenue
for
further
research,
and
is
possible
within
GATE
by
utilizing
a
{Meta.content}
attribute.
On
quick
pre-‐testing,
however,
it
provided
no
worthy
results.
In
what
follows
are
two
samples
of
data
we
evaluated
for
the
ML
classification
of
DL
and
non-‐DL
websites.
Sample
1
This
sample
mostly
ensured
that
the
GATE
ML
software
and
configuration
files
were
operating
correctly
given
the
kinds
of
document-‐level
attributions
made.
As
mentioned,
the
first
corpus
we
tested
was
called
“DL_eval_2,”
which
contained
25
websites:
13
DL
sites
(from
Columbia
University
Digital
Collections)
and
12
distinct
news
sites,
listed
below:
Reuters
LA
Times
CNN
Bloomberg
Newsweek
The
Guardian
Chicago
Tribune
BBC
National
Review
CS
Monitor
Boston
Globe
Wall
Street
Journal
Using
both
{Token.string}
and
{Lookup.majorType}
as
attributes,
the
results
of
the
classification
of
{Mention.type}
as
either
DL
or
nonDL
follow.
These
results
correspond
to
the
ML
configuration
file
found
above
and
utilize
the
SVMLibSvmJava
ML
engine
at
.66
holdout
evaluation.
The
training
set
thus
included
16/25
websites
(.66)
and
the
ML
learning
algorithm
was
tested
on
9/25
of
the
remaining
sites
(.33).
{Token.string}
misclassified
only
one
instance:
Bloomberg
News
was
classified
as
falsely
belonging
to
{Mention.type=dl}.
Nothing
in
the
text
of
the
front
page
of
Bloomberg
gave
any
indication
as
to
why
this
was
the
case.
Thus,
precision,
recall
and
the
F1
value
for
the
set
was
0.89.
{Lookup.majorType}
comprises
the
Gazetteer,
but
also
included
the
digital
library
terms
(“dlwords”)
we
added.
Thus,
it
is
a
more
constrained
bag-‐of-‐words,
smaller
than
the
set
of
all
tokens.
Classification
improved
to
100%
using
the
“dlwords”
enhanced
Gazetteer.
Given
the
fact
this
is
such
a
small
sample,
we
cannot
conclude
very
much,
except
to
say
that
there
is
something
about
DL
content
when
compared
to
ordinary
mainstream
news
sites
that
allows
for
their
discrimination.
11. Stephen
J.
Stose
April
18,
2011
11
IST
565:
Final
Project
Sample
2
To
train
the
ML
algorithm
to
make
our
target
discrimination
and
allow
for
any
generalizable
conclusions,
we
increase
the
sample
size.
Sample
2
consists
of
181
non-‐DLs
and
62
DLs.
Each
set
was
chosen
in
the
following
way:
Non-‐DL
set
A
random
website
generator
was
used
to
generate
181
websites
that
were
not
digital
libraries,
were
English
language
only,
and
had
at
least
some
text
(excluding
the
websites
generated
with
only
images
etc.).
• http://www.whatsmyip.org/random_websites/
DL
set
A
set
of
62
university
digital
libraries
were
chosen,
mostly
across
three
main
DL
university
portals:
• Harvard
University
Digital
Collections
o http://digitalcollections.harvard.edu/
• Cornell
University
Libraries
“Windows
on
the
Past”
o http://cdl.library.cornell.edu/
• Columbia
University
Digital
Collections
o http://www.columbia.edu/cu/lweb/digital/collections/index.html
These
websites
are
slightly
more
representative,
but
still
fall
way
short
of
the
kind
of
precision
that
will
be
needed
to
crawl
the
web
as
a
whole.
The
results
bode
well,
nevertheless.
Again,
using
both
{Token.string}
and
{Lookup.majorType}
as
attributes,
the
results
of
the
classification
of
{Mention.type}
as
either
DL
or
nonDL
follow.
The
.66/.33
(total)
holdout
training-‐test
ratios
of
the
data
were:
160/83
(243)
websites:
40/22
(62)
DL
and
120/61
(181)
non-‐DL.
Naïve
Bayes
and
C4.5
algorithms
achieved
100%
misclassification
of
DL
websites
(22/22)
with
both
sets
of
attributes,
achieving
a
total
F1
of
0.73.
It
is
not
clear
why
this
is
the
case.
Given
that
SVM
is
well-‐known
as
the
best
performing
classifier
for
texts
(Sebastiani,
2002),
we
stick
to
it
for
our
purposes.
{Token.string}
performed
slightly
better
than
{Lookup.majorType}.
For
both
cases,
there
were
very
few
misclassifications,
and
most
of
these
were
false
positives.
When
only
the
Gazetter
entry
words,
including
“dlwords,”
were
taken
into
account
({Lookup.majorType}),
3/22
DLs
were
misclassified
as
non-‐DL
(precision=0.95;
recall=0.86;
F1=0.90)
and
1/61
non-‐DLs
were
misclassified
as
DL
(precision=0.95;
recall=0.98;
F1=0.97).
When
all
tokens
were
entered
into
the
bag-‐of-‐words
({Token.string}),
precision
was
perfect
for
DL
classification
and
recall
was
perfect
for
nonDL
classification.
That
is,
3/22
DLs
were
still
misclassified
as
nonDL
(precision=1.0;
recall=0.86;
F1=0.93).
All
61/61
of
nonDLs
were
12. Stephen
J.
Stose
April
18,
2011
12
IST
565:
Final
Project
classified
correctly,
resulting
in
perfect
recall,
but
lacking
precision
insofar
as
64
total
websites
were
classified
as
non-‐DL:
61
of
the
expected,
but
the
3
others
which
should
have
been
classified
as
DL
(precision=0.95;
recall=1.0;
F1=0.98).
Thus,
overall,
using
all
tokens
achieved
slightly
higher
rates
of
precision
and
recall
for
the
discrimination
of
DL
websites
from
all
websites,
based
on
this
small
and
still
very
non-‐
proportional
sample.
Total
F1
values
for
were
0.96
for
{Token.string}
and
0.95
for
{Lookup.majorType}.
The
question
remains
as
to
whether
both
attributes
were
misclassifying
the
exact
same
websites.
It
turns
out
that
2/3
of
the
websites
misclassified
for
both
attributes
were
the
same.
Figure
2
below
illustrates
the
breakdown
of
these
statistics
per
attribute.
{Token.string}
{Lookup.majorType}
Precision=1.0,
Recall=0.86
Precision=0.95,
Recal=0.86
False
Negatives
(misclassified
as
nonDL)
False
Negatives
(misclassified
as
nonDL)
Digital
Scriptorium
Harvard
Business
Education
for
Women
(1937-‐1970)
www.scriptorium.columbia.edu
http://www.library.hbs.edu/hc/daring/intro.html#nav
-‐intro
Holocaust
Rescue
&
Relief
(Andover-‐Harvard
Theological)
Holocaust
Rescue
&
Relief
(Andover-‐Harvard
www.hds.harvard.edu/library/collections/digital/serv Theological)
ice_committee.html
www.hds.harvard.edu/library/collections/digital/servi
ce_committee.html
Joseph
Urban
Stage
Design
Collection
www.columbia.edu/cu/lweb/eresources/archives/rbm Joseph
Urban
Stage
Design
Collection
l/urban/
www.columbia.edu/cu/lweb/eresources/archives/rbm
l/urban/
False
positives
(misclassified
as
DL)
False
positives
(misclassified
as
DL)
none
www.spi-‐poker.sourceforge.net
Figure
2.
Conclusion
In
this
paper
we
discussed
how
GATE
(General
Architecture
for
Text
Engineering)
employs
Machine
Learning
to
classify
documents
from
the
web
into
two
categories:
websites
that
operate
as
digital
library
sites,
and
websites
that
do
not
operate
as
digital
library
sites.
This
exercise
was
completed
firstly
in
order
to
learn
about
GATE,
but
secondarily
to
hopefully
provide
a
solution
to
populating
a
site
the
current
author
is
creating
for
digital
libraries
(www.digitallibrarycentral.com).
No
current
directory
exists
as
a
single-‐stop
go-‐to
resource
for
digital
libraries;
as
is,
digital
libraries
are
difficult
to
find
and
hence
often
un-‐
or
under-‐
utilized
to
the
ordinary
web
user.
By
creating
a
digital
library
of
all
digital
libraries,
we
hope
to
bring
the
ordinary
user
to
the
plethora
of
digitized
resources
available,
and
categorize
these
digital
collections
according
to
a
taxonomy
that
allows
for
the
collation
of
similar
kinds
and
types
of
digital
libraries.
Indeed,
once
these
digital
resources
are
all
13. Stephen
J.
Stose
April
18,
2011
13
IST
565:
Final
Project
collection,
GATE
Machine
Learning
might
provide
a
solution
to
the
automatic
classification
of
these
resources
into
the
supervised
taxonomy.
As
it
is,
we
first
seek
to
locate
these
resources
using
Machine
Learning.
If
the
web
were
made
up
of
three
ordinary
non-‐DL
website
for
every
one
DL
website,
the
current
classifier
we
trained
would
have
a
very
easy
time
locating
all
of
the
DLs
(with
96%
accuracy).
As
it
is,
however,
of
the
8
billion
websites
in
existence
today,
we
reckon
that
only
3-‐6
thousand
of
these
operate
as
digital
libraries
in
some
form
or
another.
Thus,
a
lot
of
work
still
needs
to
be
done
in
order
to
find
the
DL
needle
in
the
haystack
of
all
websites
online
today.
References
Chau,
M.
and
Chen,
H.
(2008).
A
machine
learning
approach
to
web
page
filtering
using
content
and
structure
analysis.
Decision
Support
Systems,
44(2),
482-‐494.
Hotho,
A.,
Nurnberger,
A.
and
Paass,
G.
(2005).
A
brief
survey
of
text
mining.
LDV
Forum
–
GLDV
Journal
for
Computational
Linguistics
and
Language
Technology.
Joachims,
T.
(1998).
Text
Categorization
with
Support
Vector
Machines:
Learning
with
Many
Relevant
Features.
Proceedings
of
the
European
Conference
on
Machine
Learning
(ECML),
Springer.
Nigam,
K.,
McCallum,
A.,
Thrun,
S.,
and
Mitchell,
T.
(2000).
Text
Classification
from
Labeled
and
Unlabeled
Documents
using
EM.
Machine
Learning,
39(2/3),
103-‐134.
Salton,
G.,
Wong,
A.,
and
Yang,
C.
S.
(1975).
A
Vector
Space
Model
for
Automatic
Indexing.
Communications
of
the
ACM.
18(11),
613–620.
Sebastiani,
F.
(2002).
Machine
learning
in
automated
text
categorization.
ACM
Computing
Surveys,
34,
1-‐47.
Witten,
I.
H.
(2005).
“Text
mining.”
in
Practical
handbook
of
internet
computing,
ed.
M.P.
Singh.
Chapman
&
Hall/CRC
Press,
Boca
Raton,
Florida.
Witten,
I.
H.,
Don,
K.
J.,
Dewsnip,
M.
and
Tablan,
V.
(2004).
Text
mining
in
a
digital
library.
Journal
of
Digital
Libraries,
4(1),
56-‐59.