SlideShare ist ein Scribd-Unternehmen logo
1 von 13
Downloaden Sie, um offline zu lesen
Stephen	
  J.	
  Stose	
                                           	
                                                  April	
  18,	
  2011	
     1	
  
IST	
  565:	
  Final	
  Project	
  	
  
	
  
	
  
Web	
  classification	
  of	
  Digital	
  Libraries	
  using	
  GATE	
  Machine	
  Learning	
  	
  
	
  
	
  
Introduction	
  
	
  
Text	
  mining	
  is	
  considered	
  by	
  some	
  as	
  a	
  form	
  of	
  data	
  mining	
  that	
  operates	
  on	
  unstructured	
  
and	
  semi-­‐structured	
  texts.	
  It	
  applies	
  natural	
  language	
  processing	
  models	
  to	
  analyze	
  textual	
  
content	
  in	
  order	
  to	
  extract	
  and	
  generate	
  actionable	
  (i.e.,	
  potentially	
  useful)	
  knowledge	
  from	
  
the	
  information	
  inherent	
  in	
  words,	
  sentences,	
  paragraphs	
  and	
  documents	
  (Witten,	
  2005).	
  
However,	
  many	
  of	
  the	
  linguistic	
  patterns	
  easy	
  for	
  humans	
  to	
  comprehend	
  and	
  reproduce	
  
end	
  up	
  being	
  astonishingly	
  complicated	
  for	
  machines	
  to	
  process.	
  For	
  instance,	
  machines	
  
struggle	
  interpreting	
  natural	
  language	
  forms	
  quite	
  simple	
  for	
  most	
  humans,	
  such	
  as	
  
metaphor,	
  misspellings,	
  irregular	
  forms,	
  slang,	
  irony,	
  verbal	
  tense	
  and	
  aspect,	
  anaphora	
  and	
  
ellipses,	
  and	
  the	
  context	
  that	
  frames	
  meaning.	
  On	
  the	
  other	
  hand,	
  humans	
  lack	
  a	
  computer’s	
  
ability	
  to	
  process	
  large	
  volumes	
  of	
  data	
  at	
  high	
  speeds.	
  The	
  key	
  to	
  successful	
  text	
  mining	
  is	
  
to	
  combine	
  these	
  assets	
  into	
  a	
  single	
  technology.	
  	
  
	
  
There	
  are	
  many	
  uses	
  of	
  this	
  new	
  interdisciplinary	
  effort	
  at	
  mining	
  unstructured	
  texts	
  
towards	
  the	
  discovery	
  of	
  new	
  knowledge.	
  For	
  instance,	
  some	
  techniques	
  attempt	
  to	
  extract	
  
structure	
  to	
  fill	
  out	
  templates	
  (e.g.,	
  address	
  forms)	
  or	
  extract	
  key-­‐phrases	
  as	
  a	
  form	
  of	
  
document	
  metadata.	
  Others	
  attempt	
  to	
  summarize	
  the	
  content	
  of	
  a	
  document,	
  identify	
  a	
  
document’s	
  language,	
  classify	
  the	
  document	
  into	
  a	
  pre-­‐established	
  taxonomy,	
  or	
  cluster	
  it	
  
along	
  with	
  similar	
  documents	
  based	
  on	
  token	
  or	
  sentence	
  similarity	
  (see	
  Witten,	
  2005	
  for	
  
others).	
  Other	
  techniques	
  include	
  concept	
  linkage,	
  whereby	
  concepts	
  across	
  swathes	
  of	
  
scientific	
  research	
  articles	
  can	
  be	
  linked	
  to	
  elucidate	
  new	
  hypotheses	
  that	
  otherwise	
  
wouldn’t	
  occur	
  to	
  humans,	
  but	
  also	
  topic	
  tracking	
  and	
  question-­‐answering	
  (Fan	
  et	
  al.,	
  
2006).	
  	
  
	
  
Consider	
  the	
  implications	
  of	
  being	
  able	
  to	
  automatically	
  classify	
  text	
  documents.	
  Given	
  the	
  
massive	
  size	
  of	
  the	
  World	
  Wide	
  Web	
  and	
  all	
  it	
  contains	
  (e.g.,	
  news	
  feeds,	
  e-­‐mail,	
  medical	
  and	
  
corporate	
  records,	
  digital	
  libraries,	
  journal	
  and	
  magazine	
  articles,	
  blogs),	
  imagine	
  the	
  
practical	
  consequences	
  of	
  training	
  machines	
  to	
  automatically	
  categorize	
  this	
  content.	
  
Indeed,	
  text	
  classification	
  algorithms	
  have	
  already	
  had	
  moderate	
  success	
  in	
  cataloging	
  news	
  
articles	
  (Joachims,	
  1998)	
  and	
  web	
  pages	
  (Nigam,	
  MacCallum,	
  Thrun	
  &	
  Mitchell,	
  1999).	
  	
  
Indeed,	
  some	
  text	
  mining	
  systems	
  have	
  even	
  been	
  incorporated	
  into	
  digital	
  library	
  systems	
  
(the	
  Greenstone	
  DAM),	
  such	
  that	
  users	
  can	
  benefit	
  by	
  displaying	
  digital	
  library	
  items	
  
automatically	
  co-­‐referenced	
  by	
  use	
  of	
  semantic	
  annotations	
  (Witten,	
  Don,	
  Dewsnip	
  &	
  
Tablan,	
  2004).	
  	
  
	
  
	
  
Natural	
  language	
  pre-­‐processing	
  for	
  text	
  and	
  document	
  classification	
  
	
  
Text	
  and	
  document	
  classification	
  make	
  use	
  of	
  natural	
  language	
  processing	
  (NLP)	
  
technology	
  to	
  pre-­‐process,	
  encode	
  and	
  store	
  linguistic	
  features	
  to	
  texts	
  and	
  documents,	
  and	
  
then	
  to	
  processes	
  selected	
  features	
  using	
  Machine	
  Learning	
  (ML)	
  algorithms	
  which	
  then	
  are	
  
applied	
  to	
  a	
  new	
  set	
  of	
  texts	
  and	
  documents.	
  The	
  first	
  step	
  in	
  this	
  process	
  usually	
  involves	
  
tokenization,	
  a	
  process	
  that	
  involves	
  removing	
  punctuation	
  marks,	
  tabs,	
  and	
  other	
  non-­‐
textual	
  characters	
  to	
  replace	
  these	
  with	
  white	
  space.	
  This	
  produces	
  a	
  mere	
  stream	
  of	
  word	
  
tokens	
  which	
  forms	
  the	
  set	
  of	
  data	
  upon	
  which	
  further	
  processing	
  occurs.	
  From	
  this	
  stream,	
  
Stephen	
  J.	
  Stose	
                                             	
                                                   April	
  18,	
  2011	
     2	
  
IST	
  565:	
  Final	
  Project	
  	
  
	
  
a	
  filter	
  usually	
  is	
  applied	
  to	
  reduce	
  from	
  this	
  set	
  of	
  tokens	
  all	
  stop-­‐words	
  (e.g.,	
  prepositions,	
  
articles,	
  conjunctions	
  etc.)	
  that	
  otherwise	
  provide	
  little	
  if	
  any	
  meaning.	
  	
  
	
  
In	
  a	
  related	
  vein,	
  we	
  see	
  in	
  such	
  instances	
  that	
  tokens	
  are	
  not	
  always	
  the	
  same	
  as	
  words	
  per	
  
se.	
  	
  Tokenization	
  may	
  insert	
  white	
  space	
  between	
  two	
  and	
  three-­‐word	
  tokens.	
  “New	
  York”	
  
should	
  be	
  considered	
  one	
  token,	
  not	
  two	
  (not	
  “New”	
  and	
  “York”).	
  Hyphens	
  and	
  apostrophes	
  
present	
  difficult	
  challenges.	
  Often	
  words	
  like	
  “don’t”	
  are	
  tokenized	
  into	
  two	
  separate	
  words:	
  
“do”	
  and	
  “n’t”,	
  the	
  latter	
  which	
  is	
  later	
  transduced	
  as	
  “n’t”	
  =	
  “not”.	
  	
  When	
  considering	
  all	
  the	
  
continually	
  changing	
  conventions	
  used	
  to	
  display	
  words	
  as	
  text,	
  you	
  will	
  begin	
  to	
  appreciate	
  
the	
  multitude	
  of	
  problems.	
  	
  
	
  
Often,	
  pre-­‐processing	
  can	
  stop	
  here,	
  as	
  many	
  text	
  and	
  document	
  classification	
  methods	
  rely	
  
on	
  simple	
  tokenization,	
  such	
  that	
  each	
  token	
  represents	
  one	
  term	
  amongst	
  a	
  bag	
  of	
  other	
  
words	
  occurring	
  within	
  each	
  document	
  and	
  between	
  all	
  documents	
  in	
  the	
  corpus.	
  One	
  
common	
  approach	
  to	
  determining	
  word	
  importance	
  within	
  a	
  bag-­‐of-­‐words	
  is	
  the	
  term	
  
frequency-­‐inverse	
  document	
  frequency	
  approach	
  (tf-­‐idf).	
  	
  In	
  this	
  way,	
  each	
  document	
  
represents	
  a	
  vector	
  of	
  terms,	
  and	
  each	
  term	
  is	
  encoded	
  in	
  binary	
  1	
  (term	
  occurs)	
  or	
  0	
  (term	
  
does	
  not	
  occur)	
  form,	
  upon	
  which	
  weighting	
  schemes	
  apply	
  more	
  weight	
  to	
  terms	
  occurring	
  
frequently	
  within	
  relevant	
  documents	
  but	
  infrequently	
  between	
  all	
  documents	
  considered	
  
together.	
  In	
  a	
  corpus	
  of	
  documents	
  about	
  political	
  parties,	
  for	
  instance,	
  the	
  word	
  “political”	
  
may	
  occur	
  a	
  lot	
  in	
  relevant	
  documents,	
  but	
  its	
  weight	
  would	
  be	
  low	
  given	
  that	
  it	
  also	
  occurs	
  
frequently	
  in	
  all	
  the	
  other	
  documents	
  within	
  the	
  corpus.	
  This	
  renders	
  the	
  term	
  rather	
  
meaningless	
  when	
  trying	
  to	
  distinguish	
  relevant	
  from	
  non-­‐relevant	
  documents,	
  as	
  they	
  all	
  
are	
  about	
  something	
  political.	
  	
  If	
  the	
  word	
  “suffrage”	
  occurs	
  frequently	
  in	
  relevant	
  
documents,	
  on	
  the	
  other	
  hand,	
  but	
  rarely	
  across	
  the	
  corpus,	
  its	
  specificity	
  and	
  hence	
  weight	
  
for	
  determining	
  document	
  type	
  is	
  considered	
  much	
  greater.	
  This	
  is	
  the	
  reason	
  tf	
  is	
  balanced	
  
with	
  (i.e.,	
  multiplied	
  by)	
  idf,	
  a	
  factor	
  that	
  diminishes	
  the	
  weight	
  of	
  frequent	
  terms	
  and	
  
increases	
  the	
  weight	
  of	
  rare	
  ones	
  (for	
  the	
  mathematics	
  of	
  such	
  an	
  approach,	
  see	
  Hotho,	
  
Nurnberger	
  &	
  Paass,	
  2005).	
  
	
  
In	
  this	
  way	
  a	
  set	
  of	
  documents	
  can	
  be	
  mined	
  for	
  keywords.	
  If	
  all	
  of	
  the	
  documents	
  within	
  
our	
  corpus	
  are	
  related	
  to	
  political	
  parties,	
  the	
  word	
  “political”	
  hardly	
  qualifies	
  as	
  a	
  keyword.	
  
Words	
  that	
  occur	
  frequently	
  within	
  a	
  subset	
  of	
  documents	
  serve	
  as	
  words	
  that	
  categorize	
  
content.	
  As	
  such,	
  if	
  the	
  word	
  “suffrage”	
  occurs	
  frequently	
  in	
  some	
  documents,	
  but	
  not	
  in	
  all	
  
of	
  the	
  documents,	
  thus	
  qualifies	
  as	
  a	
  good	
  candidate	
  as	
  a	
  keyword	
  that	
  classifies	
  the	
  
relevant	
  text.	
  A	
  good	
  text-­‐mining	
  program	
  utilizing	
  the	
  tf-­‐idf	
  weighting	
  scheme	
  would	
  be	
  
able	
  to	
  extract	
  this	
  term	
  and	
  present	
  it	
  to	
  a	
  human	
  as	
  a	
  possible	
  keyword.	
  	
  These	
  weighting	
  
schemes	
  are	
  applied	
  within	
  vector	
  space	
  models	
  in	
  order	
  to	
  retrieve,	
  filter	
  and	
  index	
  terms	
  
occurring	
  in	
  documents	
  (Salton,	
  Wong	
  &	
  Yang,	
  1975).	
  Such	
  models	
  form	
  the	
  basis	
  of	
  many	
  
search	
  and	
  indexing	
  engines	
  (e.g.,	
  Apache	
  Lucene)	
  insofar	
  as	
  the	
  HTML	
  content	
  of	
  each	
  Web	
  
page	
  is	
  crawled	
  and	
  indexed	
  to	
  determine	
  its	
  relevance	
  based	
  on	
  words	
  and	
  phrases	
  
occurring	
  within	
  the	
  <title>	
  and	
  <heading>	
  elements,	
  among	
  other	
  ways	
  (see	
  Chau	
  &	
  Chen,	
  
2008).	
  	
  
	
  
Still,	
  a	
  bag-­‐of-­‐words	
  approach	
  to	
  text	
  and	
  document	
  mining	
  can	
  be	
  improved	
  upon	
  by	
  
incorporating	
  domain	
  knowledge	
  from	
  experts	
  into	
  the	
  analysis.	
  For	
  instance,	
  experts	
  can	
  
identify	
  domain-­‐specific	
  words,	
  phrases	
  and/or	
  rules.	
  If	
  a	
  document	
  or	
  Web	
  page	
  is	
  checked	
  
against	
  a	
  dictionary	
  of	
  these	
  listed	
  features,	
  those	
  documents	
  containing	
  the	
  features	
  will	
  be	
  
deemed	
  more	
  relevant	
  to	
  the	
  search.	
  	
  This	
  is	
  what	
  often	
  occurs	
  after	
  tokenization	
  in	
  many	
  
kinds	
  of	
  NLP	
  software	
  (e.g.,	
  GATE).	
  	
  That	
  is,	
  tokenized	
  words	
  are	
  mapped	
  to	
  an	
  internal	
  
Stephen	
  J.	
  Stose	
                                            	
                                                   April	
  18,	
  2011	
     3	
  
IST	
  565:	
  Final	
  Project	
  	
  
	
  
gazetteer	
  (an	
  internal	
  dictionary),	
  which	
  operates	
  as	
  a	
  sort	
  of	
  pre-­‐classification,	
  such	
  that	
  
commonly	
  occurring	
  or	
  well-­‐known	
  entities	
  are	
  extracted	
  and	
  annotated	
  as	
  such.	
  For	
  
instance,	
  a	
  gazetteer	
  might	
  by	
  default	
  be	
  outfitted	
  to	
  recognize	
  all	
  common	
  first-­‐	
  and	
  
surnames	
  (Noam	
  or	
  Bradeley;	
  Chomsky	
  or	
  Manning)	
  or	
  organizations	
  (UN,	
  United	
  Nations,	
  
OPEC,	
  White	
  House,	
  Planned	
  Parenthood)	
  or	
  dates	
  formats	
  (02/10/1973	
  or	
  February	
  10,	
  
1973).	
  	
  
	
  
Thus,	
  the	
  selection	
  of	
  these	
  kinds	
  of	
  annotations	
  constrains	
  the	
  set	
  of	
  words	
  chosen	
  to	
  
represent	
  space	
  in	
  space	
  vector	
  models.	
  	
  Thus,	
  if	
  we	
  want	
  to	
  ensure	
  a	
  domain-­‐specific	
  
vocabulary	
  is	
  annotated	
  as	
  relevant	
  to	
  text	
  or	
  document	
  classification,	
  we	
  might	
  create	
  a	
  
separate	
  space	
  for	
  those	
  terms,	
  and	
  annotate	
  each	
  term	
  as	
  belonging	
  to	
  a	
  particular	
  
category.	
  As	
  described	
  later,	
  we	
  created	
  a	
  gazetteer	
  of	
  terms	
  most	
  likely	
  to	
  occur	
  on	
  Web	
  
sites	
  functioning	
  as	
  digital	
  libraries,	
  such	
  that	
  when	
  a	
  random	
  Web	
  site	
  contains	
  these	
  
terms	
  it	
  would	
  with	
  a	
  higher	
  likelihood	
  be	
  classified	
  as	
  relevant.	
  	
  
	
  
Other	
  forms	
  of	
  linguistic	
  pre-­‐processing	
  exist	
  which	
  may	
  or	
  may	
  not	
  enhance	
  document	
  and	
  
text	
  classification	
  algorithms,	
  depending	
  on	
  the	
  nature	
  and	
  specificity	
  of	
  the	
  task.	
  	
  For	
  
instance,	
  sentence	
  splitters	
  chunk	
  tokens	
  into	
  sentence	
  spaces	
  when	
  phrases	
  are	
  an	
  
important	
  feature	
  in	
  classification.	
  At	
  times,	
  tagging	
  each	
  term	
  within	
  a	
  document	
  with	
  its	
  
part-­‐of-­‐speech	
  (POS	
  Tagging)	
  is	
  important.	
  For	
  instance,	
  it	
  allows	
  for	
  the	
  classification	
  of	
  
documents	
  into	
  language	
  groups	
  (e.g.,	
  Spanish	
  vs.	
  English	
  vs.	
  German	
  etc.)	
  or	
  sentence	
  
types.	
  Given	
  that	
  language	
  is	
  full	
  of	
  ambiguity	
  of	
  which	
  we’ll	
  only	
  scratch	
  the	
  surface	
  here,	
  
Named-­‐Entity	
  	
  (NE)	
  transducers	
  ease	
  the	
  confusion	
  by	
  contextualizing	
  certain	
  tokens.	
  For	
  
instance,	
  General	
  Motors	
  can	
  be	
  recognized	
  as	
  a	
  company,	
  and	
  not	
  as	
  the	
  name	
  of	
  a	
  military	
  
officer	
  (e.g.,	
  General	
  Lee).	
  Or	
  “May	
  10”	
  is	
  a	
  date,	
  “May	
  Day”	
  is	
  a	
  holiday,	
  “May	
  I	
  leave	
  the	
  
room”	
  is	
  a	
  request,	
  and	
  “Sallie	
  May	
  Jones”	
  is	
  a	
  person.	
  That	
  is,	
  the	
  transducer	
  disambiguates	
  
homographs	
  and	
  homonyms	
  and	
  other	
  such	
  linguistic	
  confusions.	
  	
  	
  
	
  
Another	
  common	
  problem	
  in	
  pre-­‐processing	
  is	
  co-­‐reference	
  matching.	
  Often,	
  the	
  same	
  
entity	
  is	
  known	
  in	
  different	
  ways	
  or	
  by	
  different	
  spellings:	
  “Center”	
  is	
  the	
  same	
  as	
  “centre”;	
  
NATO	
  is	
  the	
  same	
  entity	
  as	
  North	
  Atlantic	
  Treaty	
  Organization;	
  or	
  Mr.	
  Smith	
  is	
  the	
  same	
  
person	
  as	
  Joachim	
  Smith	
  is	
  the	
  same	
  person	
  as	
  “he”	
  or	
  “him”	
  (e.g.,	
  “Joachim	
  Smith	
  went	
  to	
  
town.	
  Everyone	
  greeted	
  him	
  as	
  Mr.	
  Smith	
  and	
  he	
  didn’t	
  care	
  for	
  that”).	
  	
  This	
  is	
  an	
  important	
  
element	
  when	
  considering	
  frequency	
  weights	
  in	
  space	
  vector	
  models,	
  as	
  two	
  different	
  
tokens	
  referencing	
  the	
  same	
  entity	
  should	
  be	
  co-­‐referenced	
  as	
  occurring	
  with	
  frequency	
  =	
  
2,	
  and	
  not	
  frequency	
  =	
  1,	
  one	
  for	
  each	
  way	
  of	
  referring	
  to	
  the	
  entity.	
  	
  
	
  
	
  
Basic	
  classification	
  models	
  
	
  
Most	
  classification	
  models	
  are	
  forms	
  of	
  supervised	
  learning	
  in	
  that	
  each	
  input	
  value	
  (e.g.,	
  a	
  
word	
  vector)	
  is	
  paired	
  with	
  an	
  expected	
  discrete	
  output	
  value	
  (i.e.,	
  the	
  pre-­‐defined	
  
category).	
  	
  As	
  such,	
  the	
  supervised	
  algorithm	
  in	
  training	
  analyzes	
  these	
  pairings	
  to	
  produce	
  
an	
  inferred	
  classifier	
  function	
  and	
  thereby	
  in	
  testing	
  be	
  able	
  to	
  predict	
  the	
  output	
  value	
  (i.e.,	
  
the	
  correct	
  classification)	
  for	
  any	
  new	
  valid	
  input.	
  One	
  instance	
  commonly	
  used	
  in	
  
document	
  classification	
  is	
  training	
  a	
  classifier	
  to	
  automatically	
  classify	
  Web	
  pages	
  into	
  a	
  
pre-­‐established	
  taxonomy	
  of	
  categories	
  (e.g.,	
  sports,	
  politics,	
  art,	
  design,	
  poetry,	
  
automobiles	
  etc.).	
  	
  Accuracy	
  of	
  the	
  training	
  function	
  on	
  correctly	
  classifying	
  the	
  test	
  set	
  is	
  
then	
  computed	
  as	
  a	
  performance	
  measure,	
  each	
  document	
  falling	
  within	
  the	
  expected	
  class	
  
by	
  some	
  degree.	
  Herein	
  we	
  establish	
  a	
  trade-­‐off	
  between	
  recall	
  and	
  precision.	
  	
  High	
  
Stephen	
  J.	
  Stose	
                                              	
                                                     April	
  18,	
  2011	
     4	
  
IST	
  565:	
  Final	
  Project	
  	
  
	
  
precision	
  implies	
  a	
  high	
  degree	
  threshold	
  for	
  allowing	
  membership	
  into	
  a	
  class.	
  In	
  this	
  way,	
  
the	
  algorithm	
  refuses	
  to	
  accept	
  many	
  false	
  positives,	
  but	
  in	
  doing	
  so	
  sacrifices	
  its	
  ability	
  to	
  
recall	
  an	
  otherwise	
  larger	
  set	
  of	
  documents,	
  and	
  thus	
  risks	
  missing	
  many	
  relevant	
  
documents	
  (i.e.,	
  they	
  become	
  false	
  negatives).	
  	
  On	
  the	
  other	
  hand,	
  if	
  a	
  threshold	
  of	
  high	
  
recall	
  is	
  permitted,	
  we	
  risk	
  lowering	
  our	
  rate	
  of	
  precision	
  and	
  thus	
  allow	
  many	
  documents	
  
not	
  relevant	
  to	
  the	
  category	
  (i.e.,	
  false	
  positives)	
  into	
  the	
  set.	
  	
  The	
  F1-­‐score	
  serves	
  as	
  a	
  
statistical	
  measure	
  of	
  compromise	
  (i.e.,	
  average)	
  between	
  recall	
  and	
  precision.	
  	
  
	
  
For	
  the	
  mathematical	
  details	
  of	
  many	
  of	
  the	
  classification	
  algorithms,	
  we	
  defer	
  to	
  Hotho,	
  
Nurnberger	
  and	
  Paass	
  (2005),	
  but	
  here	
  outline	
  the	
  rudimentary	
  basics	
  of	
  the	
  four	
  most	
  
common	
  algorithms,	
  NaïveBayes,	
  k-­‐nearest	
  neighbor,	
  decision	
  trees,	
  and	
  support	
  vector	
  
machines	
  (SVM).	
  	
  
	
  
Naïve	
  Bayes	
  applies	
  the	
  conditional	
  probability	
  that	
  document	
  d	
  with	
  the	
  vector	
  of	
  terms	
  
t1,…,tn	
  belongs	
  to	
  a	
  certain	
  class:	
  P(ti|classj).	
  Documents	
  with	
  a	
  probability	
  reaching	
  a	
  pre-­‐
established	
  threshold	
  are	
  deemed	
  as	
  belonging	
  to	
  the	
  category.	
  	
  
	
  
Instead	
  of	
  building	
  a	
  model	
  of	
  probability,	
  k-­‐nearest	
  neighbor	
  method	
  of	
  classification	
  is	
  an	
  
instance-­‐based	
  approach	
  that	
  operates	
  on	
  the	
  basis	
  of	
  the	
  similarity	
  of	
  a	
  document’s	
  k	
  
number	
  of	
  nearest	
  neighbors.	
  By	
  using	
  word-­‐vectors	
  stored	
  as	
  document	
  attributes	
  and	
  
document	
  labels	
  as	
  the	
  class,	
  most	
  computation	
  occurs	
  in	
  testing	
  whereby	
  class	
  labels	
  are	
  
assigned	
  based	
  on	
  the	
  k	
  most	
  frequent	
  training	
  samples	
  nearest	
  to	
  the	
  document	
  to	
  be	
  
classified.	
  	
  
	
  
Decision	
  trees	
  (e.g.,	
  C4.5)	
  operate	
  by	
  the	
  information	
  gain	
  established	
  based	
  on	
  a	
  recursively	
  
built	
  hierarchy	
  of	
  word	
  selection.	
  From	
  labeled	
  documents,	
  term	
  t	
  is	
  selected	
  as	
  the	
  best	
  
predictor	
  of	
  the	
  class	
  according	
  to	
  the	
  amount	
  of	
  information	
  gain.	
  The	
  tree	
  splits	
  into	
  
subsets,	
  one	
  branch	
  of	
  documents	
  containing	
  the	
  term	
  and	
  the	
  other	
  without,	
  only	
  to	
  then	
  
find	
  the	
  next	
  term	
  to	
  split	
  on,	
  and	
  is	
  applied	
  recursively	
  until	
  all	
  documents	
  in	
  a	
  subset	
  
belong	
  to	
  the	
  same	
  class.	
  	
  
	
  
Support	
  vector	
  machines	
  (SVM)	
  operate	
  by	
  representing	
  each	
  document	
  according	
  to	
  a	
  
weighted	
  vector	
  td1,…,tdn	
  based	
  on	
  word	
  frequencies	
  within	
  each	
  document.	
  SVM	
  
determines	
  a	
  maximum	
  hyper-­‐plane	
  at	
  point	
  0	
  that	
  separates	
  positive	
  (+1)	
  class	
  examples	
  
and	
  negative	
  class	
  examples	
  (-­‐1)	
  in	
  the	
  training	
  set.	
  Only	
  a	
  small	
  fraction	
  of	
  documents	
  are	
  
support	
  vectors,	
  and	
  any	
  new	
  document	
  is	
  classified	
  as	
  belonging	
  to	
  the	
  class	
  if	
  the	
  vector	
  is	
  
greater	
  than	
  0,	
  and	
  not	
  belonging	
  to	
  the	
  class	
  if	
  the	
  vector	
  is	
  less	
  than	
  0.	
  SVMs	
  can	
  be	
  used	
  
with	
  linear	
  or	
  polynomial	
  kernels	
  to	
  transform	
  space	
  to	
  ensure	
  the	
  classes	
  can	
  be	
  separated	
  
linearly.	
  	
  	
  
	
  
While	
  the	
  level	
  of	
  performance	
  of	
  each	
  of	
  these	
  classifiers	
  depends	
  on	
  the	
  kind	
  of	
  
classification	
  task,	
  the	
  SVM	
  algorithm	
  most	
  reliably	
  outperforms	
  other	
  kinds	
  of	
  algorithms	
  
on	
  document	
  classification	
  (Sebastiani,	
  2002),	
  and	
  thus	
  will	
  utilized	
  with	
  priority	
  in	
  the	
  
study	
  that	
  follows.	
  
	
  
	
  
Goals	
  and	
  objectives	
  of	
  current	
  study	
  
	
  
For	
  our	
  own	
  purposes,	
  we	
  focus	
  on	
  the	
  domain	
  of	
  web	
  classification	
  in	
  order	
  to	
  achieve	
  a	
  
twofold	
  purpose:	
  1)	
  to	
  learn	
  and	
  teach	
  my	
  colleagues	
  about	
  the	
  natural	
  language	
  processing	
  
Stephen	
  J.	
  Stose	
                                               	
                                                      April	
  18,	
  2011	
     5	
  
IST	
  565:	
  Final	
  Project	
  	
  
	
  
suite	
  known	
  as	
  GATE	
  (General	
  Architecture	
  for	
  Text	
  Engineering),	
  especially	
  with	
  regards	
  
to	
  its	
  Machine	
  Learning	
  (ML)	
  capabilities;	
  and	
  2)	
  to	
  utilize	
  the	
  GATE	
  architecture	
  in	
  order	
  to	
  
classify	
  web	
  documents	
  into	
  two	
  groups:	
  those	
  sites	
  that	
  function	
  as	
  digital	
  library	
  sites	
  
(DL)	
  distinguished	
  from	
  all	
  other	
  non-­‐digital	
  library	
  sites	
  (non-­‐DL).	
  
	
  
The	
  purpose	
  of	
  such	
  an	
  exercise	
  is	
  to	
  identify	
  from	
  amongst	
  the	
  millions	
  of	
  websites	
  only	
  
those	
  sites	
  that	
  operate	
  as	
  digital	
  library	
  sites.	
  Assuming	
  digital	
  library	
  sites	
  are	
  identifiable	
  
through	
  certain	
  characteristic	
  earmarks	
  that	
  distinguish	
  them	
  as	
  containing	
  searchable	
  
digital	
  collections,	
  the	
  goal	
  is	
  to	
  develop	
  a	
  set	
  of	
  annotations	
  that	
  by	
  way	
  of	
  an	
  ML	
  algorithm	
  
can	
  be	
  applied	
  as	
  part	
  of	
  a	
  web	
  crawler	
  in	
  order	
  to	
  extract	
  the	
  URL	
  of	
  each	
  of	
  the	
  sites	
  that	
  
qualify	
  as	
  belonging	
  to	
  the	
  DL	
  group,	
  while	
  omitting	
  those	
  that	
  do	
  not.	
  	
  While	
  somewhat	
  
confident	
  it	
  is	
  possible	
  to	
  obtain	
  a	
  strong	
  level	
  of	
  precision	
  in	
  obtaining	
  many	
  of	
  the	
  relevant	
  
sites,	
  recalling	
  sites	
  that	
  seem	
  relevant	
  (e.g.,	
  those	
  merely	
  about	
  digital	
  libraries)	
  are	
  of	
  a	
  
greater	
  concern;	
  that	
  is,	
  the	
  false	
  positives.	
  	
  The	
  current	
  author	
  is	
  developing	
  as	
  a	
  prototype	
  
a	
  website	
  (www.digitallibrarycentral.com)	
  that	
  seeks	
  to	
  operate	
  as	
  a	
  digital	
  library	
  of	
  all	
  
digital	
  library	
  websites;	
  a	
  sort	
  of	
  one-­‐stop	
  visual	
  reference	
  library	
  that	
  points	
  to	
  the	
  
collection	
  of	
  all	
  digital	
  libraries.	
  Achieving	
  the	
  goal	
  outlined	
  here	
  would	
  serve	
  to	
  populate	
  
this	
  site.	
  
	
  
Before	
  and	
  whether	
  such	
  grand	
  ideals	
  can	
  be	
  implemented,	
  however,	
  the	
  current	
  paper	
  will	
  
outline	
  some	
  of	
  the	
  first	
  steps	
  in	
  implementing	
  the	
  GATE	
  ML	
  architecture	
  towards	
  this	
  
objective.	
  	
  In	
  doing	
  so,	
  of	
  immediate	
  concern	
  is	
  understanding	
  the	
  GATE	
  architecture	
  and	
  
how	
  it	
  functions	
  in	
  natural	
  language	
  processing	
  tasks,	
  in	
  order	
  that	
  we	
  can	
  properly	
  pre-­‐
process	
  and	
  annotate	
  our	
  target	
  corpora	
  before	
  carrying	
  out	
  ML	
  learning	
  algorithms	
  on	
  it.	
  
We	
  turn	
  now	
  to	
  an	
  explanation	
  of	
  the	
  GATE	
  architecture.	
  	
  
	
  	
  	
  
	
  
The	
  GATE	
  architecture	
  and	
  text	
  annotation	
  
	
  
GATE	
  (General	
  Architecture	
  for	
  Text	
  Engineering)	
  is	
  a	
  set	
  of	
  Java	
  tools	
  developed	
  at	
  the	
  
University	
  of	
  Sheffield	
  for	
  the	
  processing	
  of	
  natural	
  language	
  and	
  text	
  engineering	
  tasks	
  for	
  
various	
  languages.	
  At	
  its	
  core	
  is	
  an	
  information	
  extraction	
  system	
  called	
  ANNIE	
  (A	
  Nearly-­‐
New	
  Information	
  Extraction	
  System),	
  a	
  set	
  of	
  functions	
  that	
  operates	
  on	
  individual	
  
documents	
  (including	
  XML,	
  TXT,	
  Doc,	
  PDF,	
  Database	
  and	
  HTML	
  structures)	
  and	
  across	
  the	
  
corpora	
  within	
  which	
  many	
  documents	
  can	
  belong.	
  These	
  functions	
  comprise	
  tokenizing,	
  a	
  
gazetteer,	
  sentence	
  splitting,	
  part-­‐of-­‐speech	
  tagging,	
  named-­‐entity	
  transduction,	
  and	
  co-­‐
reference	
  tagging,	
  among	
  others.	
  	
  It	
  also	
  boasts	
  extensive	
  tools	
  for	
  RDF	
  and	
  OWL	
  metadata	
  
annotation	
  for	
  creating	
  ontologies	
  for	
  use	
  within	
  the	
  Semantic	
  Web.	
  
	
  
Most	
  of	
  these	
  language	
  processes	
  operate	
  seamlessly	
  within	
  GATE	
  Developer’s	
  integrated	
  
development	
  environment	
  (IDE)	
  and	
  graphical	
  user	
  interface	
  (GUI),	
  the	
  latter	
  allowing	
  
users	
  to	
  visualize	
  these	
  functions	
  within	
  a	
  user-­‐friendly	
  environment.	
  For	
  instance,	
  a	
  left-­‐
sidebar	
  resource	
  tree	
  displays	
  the	
  Language	
  Resources	
  panel,	
  where	
  the	
  document	
  and	
  
document	
  sets	
  (the	
  corpus)	
  reside.	
  Below	
  that,	
  it	
  also	
  displays	
  the	
  ANNIE	
  Processing	
  
Resources	
  (PR),	
  the	
  natural	
  language	
  processing	
  functions	
  mentioned	
  above	
  that	
  form	
  part	
  
of	
  an	
  ordered	
  pipeline	
  to	
  linguistically	
  pre-­‐process	
  the	
  documents.	
  	
  A	
  right-­‐sidebar	
  
illustrates	
  the	
  resulting	
  color-­‐coded	
  annotation	
  lists	
  after	
  pipeline	
  processing.	
  Additionally,	
  
a	
  bottom	
  table	
  exposes	
  the	
  various	
  resulting	
  annotation	
  attributes,	
  as	
  well	
  as	
  a	
  popup	
  
annotation	
  editor	
  that	
  allows	
  one	
  to	
  edit	
  and	
  classify	
  (i.e.,	
  provide	
  values	
  to)	
  these	
  
Stephen	
  J.	
  Stose	
                                          	
                                                  April	
  18,	
  2011	
     6	
  
IST	
  565:	
  Final	
  Project	
  	
  
	
  
annotation	
  sets	
  for	
  training,	
  prototyping,	
  and/or	
  analysis.	
  Figure	
  1	
  below	
  shows	
  all	
  of	
  these	
  
elements	
  in	
  action.	
  	
  
	
  




                                                                                                                                	
  
Figure	
  1.	
  
	
  
	
  
These	
  tools	
  complete	
  much	
  of	
  the	
  gritty	
  text-­‐engineering	
  work	
  of	
  document	
  pre-­‐processing	
  
in	
  order	
  that	
  useful	
  research	
  can	
  be	
  quickly	
  deployed,	
  but	
  in	
  a	
  way	
  that	
  is	
  visually	
  explicit	
  
and	
  apparent	
  to	
  those	
  less	
  initiated	
  with	
  these	
  common	
  natural	
  language	
  engineering	
  
preprocessing	
  tasks.	
  And	
  in	
  a	
  way	
  that	
  allows	
  for	
  editing	
  these	
  functions	
  as	
  well	
  as	
  the	
  
introduction	
  of	
  various	
  pre-­‐processing	
  plugins	
  and	
  other	
  scripts	
  developed	
  for	
  individual	
  
text-­‐mining	
  applications.	
  	
  
	
  
Figure	
  1	
  displays	
  four	
  open	
  documents	
  uploaded	
  directly	
  by	
  entering	
  the	
  URL:	
  Newsweek	
  
and	
  Reuters	
  (news	
  sites),	
  and	
  JohnJayPapers	
  and	
  DigitalScriptorium	
  (digital	
  libraries).	
  
These,	
  along	
  with	
  10	
  other	
  news	
  sites	
  and	
  11	
  other	
  digital	
  libraries	
  all	
  belong	
  to	
  the	
  corpus	
  
names	
  “DL_eval_2”	
  above	
  (which	
  will	
  serve	
  as	
  Sample	
  1	
  later,	
  our	
  first	
  test	
  of	
  DL	
  
discrimination).	
  	
  This	
  provides	
  a	
  testing	
  sample	
  to	
  ensure	
  the	
  pre-­‐processing	
  pipeline	
  and	
  
Machine	
  Learning	
  (ML)	
  functions	
  operate	
  correctly	
  on	
  our	
  soon-­‐to-­‐be	
  annotated	
  
documents.	
  	
  
	
  
Just	
  by	
  uploading	
  URLs,	
  GATE	
  by	
  default	
  automatically	
  annotates	
  the	
  HTML	
  markup,	
  as	
  you	
  
can	
  see	
  on	
  within	
  bottom	
  right-­‐sidebar	
  where	
  the	
  <a>,	
  <body>	
  and	
  <br>	
  tags	
  are	
  located.	
  
Stephen	
  J.	
  Stose	
                                           	
                                                  April	
  18,	
  2011	
     7	
  
IST	
  565:	
  Final	
  Project	
  	
  
	
  
After	
  running	
  the	
  PR	
  pipeline	
  over	
  the	
  “DL_eval_2”	
  corpus,	
  the	
  upper	
  right-­‐sidebar	
  shows	
  
the	
  annotations	
  that	
  result	
  from	
  running	
  the	
  tokenizer,	
  gazetteer,	
  sentence	
  splitter,	
  POS	
  
tagger,	
  NE	
  transducer	
  and	
  co-­‐referencing	
  orthomatcher.	
  	
  Organization	
  is	
  checked	
  and	
  
highlighted	
  in	
  green,	
  for	
  instance,	
  and	
  by	
  clicking	
  on	
  “White	
  House”	
  (one	
  instantiation	
  of	
  
Organization),	
  we	
  learn	
  about	
  GATE’s	
  {Type.feature=value}	
  syntax,	
  which	
  in	
  the	
  case	
  of	
  
“White	
  House”	
  is	
  represented	
  accordingly:	
  {Organization.orgType=government}.	
  This	
  
syntax	
  operates	
  as	
  the	
  core	
  annotation	
  engine,	
  and	
  allows	
  for	
  the	
  scripting	
  and	
  
manipulation	
  of	
  annotation	
  strings.	
  	
  	
  
	
  
The	
  ANNIE	
  PRs	
  in	
  this	
  case	
  provide	
  automatic	
  annotations	
  that	
  serve	
  as	
  a	
  rudimentary	
  start	
  
to	
  any	
  text	
  engineering	
  projects	
  upon	
  which	
  to	
  build.	
  There	
  are	
  many	
  other	
  plugins	
  and	
  PR	
  
functions	
  we	
  will	
  not	
  discuss	
  within	
  this	
  review.	
  For	
  our	
  own	
  purposes,	
  we	
  want	
  to	
  call	
  
attention	
  to	
  two	
  annotation	
  types	
  ANNIE	
  generates:	
  1)	
  Token,	
  and	
  2)	
  Lookup.	
  	
  	
  
	
  
A	
  few	
  examples	
  of	
  the	
  Type.feature	
  syntax	
  for	
  the	
  token	
  type	
  is:	
  the	
  kind	
  of	
  token	
  
{Token.kind=word};	
  the	
  token	
  character	
  length	
  {Token.length=12};	
  the	
  token	
  POS	
  
{Token.category=NN};	
  the	
  token	
  orthography	
  {Token.orth=lowercase};	
  or	
  the	
  content	
  of	
  the	
  
token	
  string	
  {Token.string=painter}.	
  	
  	
  
	
  
Our	
  interest	
  is	
  in	
  analyzing	
  string	
  content:	
  determining	
  whether	
  a	
  particular	
  document	
  is	
  an	
  
instance	
  of	
  a	
  digital	
  library	
  or	
  not	
  will	
  require	
  an	
  ML	
  analysis	
  of	
  the	
  n-­‐gram	
  string	
  unigrams	
  
comprising	
  both	
  DL	
  sites	
  and	
  nonDL	
  sites.	
  	
  We	
  can	
  either	
  use	
  all	
  tokens	
  (after	
  removing	
  
stop-­‐words)	
  to	
  analyze	
  the	
  tf-­‐idf	
  weighting	
  of	
  the	
  documents	
  in	
  question,	
  or	
  we	
  can	
  
constrain	
  the	
  kinds	
  of	
  tokens	
  analyzed	
  within	
  the	
  documents	
  by	
  making	
  further	
  
specifications.	
  The	
  ANNIE	
  annotation	
  schema	
  provides	
  many	
  default	
  annotations	
  (e.g.,	
  
Person,	
  Organization,	
  Money,	
  Date,	
  Job	
  Title	
  etc.)	
  to	
  constrain	
  the	
  kinds	
  of	
  words	
  chosen	
  for	
  
analysis,	
  as	
  can	
  be	
  seen	
  in	
  Figure	
  1	
  in	
  the	
  upper	
  right-­‐sidebar.	
  	
  
	
  
Additionally,	
  the	
  Gazetteer	
  provides	
  many	
  other	
  kinds	
  of	
  dictionary	
  lookup	
  entries	
  (60,000	
  
arranged	
  in	
  80	
  lists)	
  above	
  and	
  beyond	
  the	
  ANNIE	
  default	
  annotations.	
  	
  For	
  instance,	
  the	
  list	
  
name	
  “city”	
  will	
  have	
  as	
  dictionary	
  entities	
  a	
  list	
  of	
  all	
  worldwide	
  cities,	
  such	
  that	
  by	
  
mapping	
  these	
  onto	
  the	
  text,	
  a	
  new	
  annotation	
  of	
  the	
  kind	
  {Lookup.minorType=city}	
  is	
  
created,	
  and	
  thus	
  annotates	
  each	
  instance	
  of	
  a	
  city	
  with	
  this	
  markup.	
  The	
  lookup	
  uses	
  a	
  set-­‐	
  
subset	
  hierarchy	
  we	
  will	
  not	
  describe,	
  except	
  to	
  say	
  that	
  {Lookup.majorType}	
  is	
  a	
  parent	
  of	
  
{Lookup.minorType}.	
  	
  Thus,	
  there	
  are	
  different	
  kinds	
  of	
  locations,	
  for	
  instance,	
  city	
  and	
  
country.	
  	
  City	
  and	
  country	
  are	
  thus	
  minorTypes	
  (children)	
  of	
  the	
  
{Lookup.majorType=location}.	
  	
  	
  
	
  
	
  
Classification	
  with	
  GATE	
  Machine	
  Learning	
  
	
  
Given	
  that	
  the	
  GATE	
  Developer	
  extracts	
  and	
  annotates	
  training	
  documents,	
  several	
  
processing	
  plugins	
  that	
  operate	
  at	
  the	
  end	
  of	
  a	
  document	
  pre-­‐processing	
  pipeline	
  serve	
  
Machine	
  Learning	
  (ML)	
  functions.	
  	
  The	
  Batch	
  Learning	
  PR	
  has	
  three	
  functions:	
  chunk	
  
recognition,	
  relation	
  extraction	
  and	
  classification.	
  This	
  paper	
  is	
  interested	
  in	
  applying	
  
supervised	
  ML	
  processes	
  to	
  classify	
  web	
  documents	
  that	
  qualify	
  as	
  instances	
  of	
  digital	
  
libraries	
  (DL)	
  or	
  not	
  (nonDL).	
  	
  	
  
	
  
Supervised	
  ML	
  requires	
  two	
  phases:	
  learning	
  and	
  application.	
  The	
  first	
  phase	
  requires	
  
building	
  a	
  data	
  model	
  from	
  instances	
  within	
  a	
  document	
  that	
  has	
  already	
  been	
  correctly	
  
Stephen	
  J.	
  Stose	
                                                                              	
                                                                             April	
  18,	
  2011	
                 8	
  
IST	
  565:	
  Final	
  Project	
  	
  
	
  
classified.	
  In	
  our	
  case,	
  it	
  requires	
  giving	
  value	
  to	
  certain	
  sets	
  of	
  annotations	
  that,	
  as	
  a	
  whole,	
  
will	
  represent	
  the	
  document	
  instance	
  (i.e.,	
  the	
  website)	
  as	
  either	
  a	
  hit	
  (DL)	
  or	
  a	
  miss	
  (non-­‐
DL).	
  The	
  point	
  is	
  to	
  develop	
  a	
  training	
  set	
  D	
  =	
  (d1,…,dn)	
  of	
  correctly	
  classified	
  DL	
  website	
  
documents	
  (d)	
  to	
  build	
  a	
  classification	
  model	
  able	
  to	
  discriminate	
  any	
  future	
  website	
  d	
  as	
  
being	
  either	
  a	
  true	
  DL	
  or	
  some	
  other	
  website	
  (not-­‐DL).	
  	
  
	
  
The	
  first	
  task	
  requires	
  annotating	
  each	
  document,	
  as	
  a	
  whole,	
  and	
  in	
  doing	
  so	
  assign	
  it	
  to	
  
the	
  dependent	
  DL	
  or	
  non-­‐DL	
  class.	
  Up	
  until	
  now,	
  annotations	
  refer	
  to	
  parts	
  of	
  a	
  document	
  
(tokens,	
  sentences,	
  dates	
  etc.).	
  	
  To	
  annotate	
  a	
  whole	
  document,	
  we	
  begin	
  by	
  creating	
  a	
  new	
  
{Type.feature=value}	
  term.	
  To	
  do	
  so,	
  we	
  demarcate	
  the	
  entire	
  text	
  within	
  each	
  document	
  
and	
  create	
  a	
  new	
  annotation	
  type	
  called	
  “Mention,”	
  a	
  feature	
  called	
  “type”	
  (not	
  to	
  be	
  
confusing)	
  and	
  two	
  distinct	
  values:	
  {Mention.type=dl}	
  and	
  {Mention.type=nondl}.	
  	
  
	
  
The	
  attributes	
  used	
  to	
  predict	
  class	
  membership	
  are	
  the	
  two	
  annotations	
  types	
  we	
  
highlighted	
  above:	
  1)	
  Token	
  {Token.string},	
  and	
  2)	
  Lookup	
  {Lookup.majorType}.	
  To	
  take	
  
full	
  advantage	
  of	
  the	
  Gazetteer,	
  we	
  added	
  a	
  list	
  entry	
  named	
  “dlwords”	
  (i.e.,	
  digital	
  library	
  
words)	
  with	
  a	
  list	
  of	
  terms	
  commonly	
  found	
  on	
  many	
  digital	
  library	
  websites.	
  	
  This	
  list	
  of	
  
word	
  is	
  reproduced	
  below1:	
  
	
  
         Advanced	
  Search	
                                        Digital	
  Collection(s)	
                                   Manuscript(s)	
  
         Archive(s)	
                                                Digital	
  Content	
                                         Repository(ies)	
  
         Browse	
                                                    Digital	
  Library(ies)	
                                    Search	
  
         Catalog	
                                                   Digitization	
                                               Search	
  Tip(s)	
  
         Collection(s)	
                                             Digitisation	
                                               Special	
  Collection(s)	
  
         Digital	
                                                   Image	
  Collection(s)	
                                     Unversity(ies)	
  
         Digital	
  Archive(s)	
                                     Keyword(s)	
                                                 University	
  Library(ies)	
  
         Image(s)	
                                                  Library(ies)	
                                               	
  
	
  
All	
  of	
  our	
  analyses	
  will	
  operate	
  using	
  the	
  bag-­‐of-­‐words	
  that	
  by	
  GATE	
  default	
  applies	
  tf-­‐idf	
  
weighting	
  schemes	
  to	
  a	
  specified	
  n-­‐gram	
  (we’ll	
  be	
  using	
  only	
  1-­‐gram	
  unigrams).	
  	
  Two	
  
attribute	
  annotations,	
  each	
  representing	
  a	
  slightly	
  different	
  bag-­‐of-­‐words,	
  will	
  be	
  used	
  to	
  
predict	
  DL	
  or	
  nonDL	
  class	
  membership:	
  
	
  
                           1. When	
  the	
  {Token.string}	
  attribute	
  is	
  chosen	
  to	
  predict	
  {Mention.type}	
  class	
  
                                                      membership,	
  the	
  bag-­‐of-­‐words	
  includes	
  all	
  non-­‐stop	
  word	
  tokens	
  within	
  its	
  attribute	
  
                                                      set.	
  	
  
                           2. When	
  the	
  Gazetteer	
  is	
  used	
  and	
  “dlwords”	
  are	
  included	
  as	
  part	
  of	
  its	
  internal	
  
                                                      dictionary,	
  the	
  attribute	
  {Lookup.majorType=“dlwords”)	
  along	
  with	
  all	
  the	
  other	
  
                                                      60,000	
  entries	
  will	
  serve	
  to	
  constrain	
  the	
  set	
  of	
  tokens	
  predicting	
  {Mention.type}	
  
                                                      class	
  membership.	
  	
  	
  
	
  
The	
  GATE	
  Batch	
  Learning	
  PR	
  requires	
  an	
  XML	
  configuration	
  file	
  specifying	
  the	
  ML	
  
parameters	
  and	
  the	
  attribute-­‐class	
  annotation	
  sets2.	
  	
  We	
  will	
  only	
  discuss	
  a	
  few	
  essential	
  
settings	
  here.	
  For	
  starters,	
  we	
  set	
  the	
  evaluation	
  method	
  as	
  “holdout”	
  with	
  a	
  ratio	
  of	
  .66/.33	
  
training	
  to	
  test.	
  	
  The	
  main	
  algorithm	
  we	
  will	
  be	
  using	
  is	
  SVM	
  (in	
  GATE,	
  the	
  SVMLibSvmJava)	
  
with	
  the	
  following	
  parameter	
  settings	
  being	
  varied,	
  the	
  ones	
  reported	
  below	
  providing	
  the	
  
best	
  results:	
  
	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  
1	
  Note:	
  the	
  Gazetteer	
  does	
  not	
  stem	
  nor	
  is	
  it	
  caps	
  sensitive,	
  thus	
  plural	
  and	
  uppercase	
  variations	
  of	
  these	
  words	
  were	
  
provided	
  but	
  are	
  not	
  reproduced	
  here.	
  
2	
  For	
  a	
  list	
  of	
  all	
  parameter	
  setting	
  possibilities,	
  see	
  http://gate.ac.uk/sale/tao/splitch17.html#x22-­‐43500017.	
  
Stephen	
  J.	
  Stose	
                                                                    	
                                April	
  18,	
  2011	
     9	
  
IST	
  565:	
  Final	
  Project	
  	
  
	
  
         -­‐t:	
  kernel:	
  0	
  (linear	
  (0)	
  vs.	
  polynomial	
  (1))	
  
         -­‐c:	
  cost:	
  0.7	
  (higher	
  values	
  allow	
  softer	
  margins	
  leading	
  to	
  better	
  generalization)	
  
         -­‐tau:	
  uneven	
  margins:	
  0.5	
  (varies	
  positive	
  to	
  negative	
  instance	
  ratio)	
  
         	
  
The	
  XML	
  configuration	
  file	
  is	
  reproduced	
  below,	
  in	
  case	
  others	
  are	
  interested	
  in	
  getting	
  
started	
  using	
  GATE	
  ML	
  for	
  basic	
  document	
  classification:	
  
         	
  
                  <?xml	
  version="1.0"?>	
  
                  <ML-­‐CONFIG>	
  
                  	
  	
  <VERBOSITY	
  level="1"/>	
  
                  	
  	
  <SURROUND	
  value="false"/>	
  
                  	
  	
  <PARAMETER	
  name="thresholdProbabilityClassification"	
  value="0.5"/>	
  
                  	
  	
  <multiClassification2Binary	
  method="one-­‐vs-­‐another"/>	
  
                  	
  	
  <EVALUATION	
  method="holdout"	
  ratio="0.66"/>	
  
                  	
  	
  <FILTERING	
  ratio="0.0"	
  dis="near"/>	
  
                  	
  	
  <ENGINE	
  nickname="SVM"	
  implementationName="SVMLibSvmJava"	
  	
  
                  	
  	
  	
  	
  	
  options="	
  -­‐c	
  0.7	
  -­‐t	
  0	
  -­‐m	
  100	
  -­‐tau	
  0.4	
  	
  "/>	
  
                  	
  	
  <DATASET>	
  
                  	
  	
  	
  	
  	
  <INSTANCE-­‐TYPE>Mention</INSTANCE-­‐TYPE>	
  	
  	
  	
  	
  	
  
                  	
  	
  	
  	
  	
  <NGRAM>	
  	
  	
  
                  	
  	
  	
  	
  	
  	
  	
  	
  <NAME>ngram</NAME>	
  	
  	
  
                  	
  	
  	
  	
  	
  	
  	
  	
  <NUMBER>1</NUMBER>	
  	
  	
  
                  	
  	
  	
  	
  	
  	
  	
  	
  <CONSNUM>1</CONSNUM>	
  	
  	
  
                  	
  	
  	
  	
  	
  	
  	
  	
  <CONS-­‐1>	
  	
  
                  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  <TYPE>Token</TYPE>	
  	
  
                  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  <FEATURE>string</FEATURE>	
  	
  	
  	
  	
  	
  	
  	
  
                  	
  	
  	
  	
  	
  	
  	
  	
  </CONS-­‐1>	
  	
  	
  	
  	
  
                  	
  	
  	
  	
  	
  </NGRAM>	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  
                  	
  	
  	
  	
  <ATTRIBUTE>	
  
                  	
                                                 	
  	
  <NAME>Class</NAME>	
  
                  	
                                                 	
  	
  <SEMTYPE>NOMINAL</SEMTYPE>	
  
                  	
                                                 	
  	
  <TYPE>Mention</TYPE>	
  
                  	
                                                 	
  	
  <FEATURE>type</FEATURE>	
  
                  	
                                                 	
  	
  <POSITION>0</POSITION>	
  
                  	
                                                 	
  	
  <CLASS/>	
  
                  	
  	
  	
  	
  </ATTRIBUTE>	
  
                  	
  	
  	
  </DATASET>	
  
                  </ML-­‐CONFIG>	
  
	
  
	
  
The	
  ngram	
  is	
  set	
  to	
  1	
  (unigram),	
  and	
  the	
  <CLASS/>	
  tag	
  within	
  the	
  <Attribute>	
  tag	
  indicates	
  
this	
  attribute	
  is	
  the	
  class	
  being	
  predicted,	
  with	
  <Type>Mention</Type>	
  and	
  
<Feature>type</Feature>	
  as	
  {Mention.type=dl	
  or	
  nondl}.	
  The	
  <Type>Lookup</Type>and	
  
<Feature>majorType</Feature>	
  will	
  be	
  changed	
  to	
  accommodate	
  the	
  other	
  attribute	
  value.	
  
	
  
Thus,	
  after	
  running	
  ANNIE	
  PR	
  pipeline	
  over	
  the	
  “DL_eval_2”	
  corpus,	
  the	
  ML	
  Batch	
  Learning	
  
PR	
  is	
  placed	
  alone	
  in	
  the	
  pipeline	
  to	
  run	
  over	
  the	
  annotated	
  set	
  of	
  documents	
  in	
  Evaluation	
  
mode.	
  	
  As	
  mentioned,	
  the	
  Batch	
  Learning	
  PR	
  can	
  also	
  operate	
  in	
  Training-­‐Application	
  mode	
  
on	
  two	
  separate	
  sets	
  of	
  corpora:	
  one	
  corpus	
  for	
  training	
  and	
  another	
  for	
  application	
  (i.e.,	
  
testing).	
  The	
  initial	
  results	
  below	
  reflect	
  only	
  the	
  results	
  of	
  holdout	
  0.66	
  evaluation	
  run	
  over	
  
one	
  corpus.	
  The	
  current	
  report	
  does	
  not	
  utilize	
  the	
  Training-­‐Application	
  mode.	
  	
  
	
  
	
  
Sample	
  corpora	
  and	
  results	
  
	
  
The	
  Web	
  now	
  boasts	
  over	
  8	
  billion	
  indexable	
  pages	
  (Chau	
  &	
  Chen,	
  2008).	
  Thus	
  training	
  an	
  
ML	
  algorithm	
  to	
  pick	
  out	
  the	
  estimated	
  few	
  thousand	
  digital	
  libraries	
  will	
  not	
  be	
  a	
  simple	
  
matter.	
  Assuming	
  there	
  are	
  5	
  thousand	
  library-­‐standard	
  digital	
  libraries	
  (which	
  may	
  be	
  a	
  
high	
  estimate),	
  some	
  of	
  which	
  reside	
  within	
  umbrella	
  Digital	
  Asset	
  Management	
  portals,	
  
discriminating	
  these	
  will	
  be	
  cherry	
  picking	
  at	
  a	
  ratio	
  of	
  an	
  average	
  of	
  5	
  digital	
  libraries	
  per	
  
every	
  8	
  million	
  Web	
  sites.	
  	
  Spiders	
  (or	
  Web	
  crawlers)	
  can	
  curtail	
  this	
  number	
  greatly	
  by	
  
Stephen	
  J.	
  Stose	
                                             	
                                                    April	
  18,	
  2011	
     10	
  
IST	
  565:	
  Final	
  Project	
  	
  
	
  
crawling	
  only	
  a	
  specified	
  argument	
  depth	
  from	
  the	
  starting	
  URL.	
  Unfortunately,	
  librarians	
  
are	
  not	
  always	
  good	
  at	
  applying	
  search	
  optimized	
  SEO	
  standards,	
  and	
  many	
  well-­‐known	
  
DLs	
  are	
  deeply	
  embedded	
  in	
  arguments	
  or	
  strange	
  ports	
  (University	
  of	
  Wyoming’s	
  uses	
  
port	
  8180)	
  or	
  within	
  site	
  subdomains.	
  Thus,	
  curtailing	
  this	
  argument	
  space	
  too	
  much	
  will	
  
result	
  in	
  decreased	
  recall.	
  
	
  
Additionally,	
  there	
  are	
  many	
  non-­‐DL	
  websites	
  that	
  use	
  language	
  quite	
  similar	
  to	
  DL	
  
websites.	
  For	
  instance,	
  many	
  websites	
  operate	
  as	
  librarian	
  blogs	
  or	
  digital	
  library	
  
magazines	
  that	
  serve	
  discussion	
  spaces	
  regarding	
  DLs,	
  but	
  are	
  not	
  DLs	
  themselves.	
  
Unfortunately,	
  these	
  false	
  positives	
  will	
  prove	
  daunting	
  to	
  exclude.	
  	
  We	
  seek	
  only	
  DLs	
  or	
  DL	
  
portals	
  that	
  boast	
  archival	
  collections	
  that	
  have	
  been	
  digitized,	
  and	
  as	
  such	
  serve	
  as	
  
electronic	
  resources	
  that	
  are	
  co-­‐referenced,	
  searchable,	
  browsable,	
  and	
  catalogued	
  
according	
  to	
  some	
  taxonomy	
  or	
  ontology.	
  	
  One	
  suggested	
  way	
  of	
  narrowing	
  down	
  to	
  only	
  
these	
  kinds	
  of	
  resources	
  might	
  be	
  to	
  tap	
  into	
  the	
  <meta	
  content>	
  in	
  which	
  librarians	
  often	
  
apply	
  conventions	
  such	
  as	
  Dublin	
  Core	
  to	
  demarcate	
  these	
  spaces	
  as	
  digital	
  collection	
  
spaces.	
  	
  This	
  is	
  an	
  avenue	
  for	
  further	
  research,	
  and	
  is	
  possible	
  within	
  GATE	
  by	
  utilizing	
  a	
  
{Meta.content}	
  attribute.	
  On	
  quick	
  pre-­‐testing,	
  however,	
  it	
  provided	
  no	
  worthy	
  results.	
  	
  
	
  
In	
  what	
  follows	
  are	
  two	
  samples	
  of	
  data	
  we	
  evaluated	
  for	
  the	
  ML	
  classification	
  of	
  DL	
  and	
  
non-­‐DL	
  websites.	
  	
  
	
  
	
  
Sample	
  1	
  
	
  
This	
  sample	
  mostly	
  ensured	
  that	
  the	
  GATE	
  ML	
  software	
  and	
  configuration	
  files	
  were	
  
operating	
  correctly	
  given	
  the	
  kinds	
  of	
  document-­‐level	
  attributions	
  made.	
  As	
  mentioned,	
  the	
  
first	
  corpus	
  we	
  tested	
  was	
  called	
  “DL_eval_2,”	
  which	
  contained	
  25	
  websites:	
  13	
  DL	
  sites	
  
(from	
  Columbia	
  University	
  Digital	
  Collections)	
  and	
  12	
  distinct	
  news	
  sites,	
  listed	
  below:	
  
	
  
Reuters	
                                 LA	
  Times	
                      CNN	
                                 Bloomberg	
  
Newsweek	
                                The	
  Guardian	
                  Chicago	
  Tribune	
                  BBC	
  
National	
  Review	
                      CS	
  Monitor	
                    Boston	
  Globe	
                     Wall	
  Street	
  Journal	
  
	
  
Using	
  both	
  {Token.string}	
  and	
  {Lookup.majorType}	
  as	
  attributes,	
  the	
  results	
  of	
  the	
  
classification	
  of	
  {Mention.type}	
  as	
  either	
  DL	
  or	
  nonDL	
  follow.	
  These	
  results	
  correspond	
  to	
  
the	
  ML	
  configuration	
  file	
  found	
  above	
  and	
  utilize	
  the	
  SVMLibSvmJava	
  ML	
  engine	
  at	
  .66	
  
holdout	
  evaluation.	
  The	
  training	
  set	
  thus	
  included	
  16/25	
  websites	
  (.66)	
  and	
  the	
  ML	
  
learning	
  algorithm	
  was	
  tested	
  on	
  9/25	
  of	
  the	
  remaining	
  sites	
  (.33).	
  	
  
	
  
{Token.string}	
  misclassified	
  only	
  one	
  instance:	
  Bloomberg	
  News	
  was	
  classified	
  as	
  falsely	
  
belonging	
  to	
  {Mention.type=dl}.	
  Nothing	
  in	
  the	
  text	
  of	
  the	
  front	
  page	
  of	
  Bloomberg	
  gave	
  any	
  
indication	
  as	
  to	
  why	
  this	
  was	
  the	
  case.	
  	
  Thus,	
  precision,	
  recall	
  and	
  the	
  F1	
  value	
  for	
  the	
  set	
  
was	
  0.89.	
  	
  
	
  
{Lookup.majorType}	
  comprises	
  the	
  Gazetteer,	
  but	
  also	
  included	
  the	
  digital	
  library	
  terms	
  
(“dlwords”)	
  we	
  added.	
  Thus,	
  it	
  is	
  a	
  more	
  constrained	
  bag-­‐of-­‐words,	
  smaller	
  than	
  the	
  set	
  of	
  
all	
  tokens.	
  Classification	
  improved	
  to	
  100%	
  using	
  the	
  “dlwords”	
  enhanced	
  Gazetteer.	
  	
  
	
  
Given	
  the	
  fact	
  this	
  is	
  such	
  a	
  small	
  sample,	
  we	
  cannot	
  conclude	
  very	
  much,	
  except	
  to	
  say	
  that	
  
there	
  is	
  something	
  about	
  DL	
  content	
  when	
  compared	
  to	
  ordinary	
  mainstream	
  news	
  sites	
  
that	
  allows	
  for	
  their	
  discrimination.	
  	
  
Stephen	
  J.	
  Stose	
                                           	
                                                  April	
  18,	
  2011	
     11	
  
IST	
  565:	
  Final	
  Project	
  	
  
	
  
	
  
Sample	
  2	
  
	
  
To	
  train	
  the	
  ML	
  algorithm	
  to	
  make	
  our	
  target	
  discrimination	
  and	
  allow	
  for	
  any	
  generalizable	
  
conclusions,	
  we	
  increase	
  the	
  sample	
  size.	
  Sample	
  2	
  consists	
  of	
  181	
  non-­‐DLs	
  and	
  62	
  DLs.	
  	
  
Each	
  set	
  was	
  chosen	
  in	
  the	
  following	
  way:	
  
	
  
Non-­‐DL	
  set	
  
	
  
A	
  random	
  website	
  generator	
  was	
  used	
  to	
  generate	
  181	
  websites	
  that	
  were	
  not	
  digital	
  
libraries,	
  were	
  English	
  language	
  only,	
  and	
  had	
  at	
  least	
  some	
  text	
  (excluding	
  the	
  websites	
  
generated	
  with	
  only	
  images	
  etc.).	
  	
  
	
  
       • http://www.whatsmyip.org/random_websites/	
  	
  
       	
  
	
  
DL	
  set	
  
	
  
A	
  set	
  of	
  62	
  university	
  digital	
  libraries	
  were	
  chosen,	
  mostly	
  across	
  three	
  main	
  DL	
  university	
  
portals:	
  
	
  
       • Harvard	
  University	
  Digital	
  Collections	
  
                    o http://digitalcollections.harvard.edu/	
  	
  
       • Cornell	
  University	
  Libraries	
  “Windows	
  on	
  the	
  Past”	
  
                    o http://cdl.library.cornell.edu/	
  	
  
       • Columbia	
  University	
  Digital	
  Collections	
  
                    o http://www.columbia.edu/cu/lweb/digital/collections/index.html	
  	
  
	
  
	
  
These	
  websites	
  are	
  slightly	
  more	
  representative,	
  but	
  still	
  fall	
  way	
  short	
  of	
  the	
  kind	
  of	
  
precision	
  that	
  will	
  be	
  needed	
  to	
  crawl	
  the	
  web	
  as	
  a	
  whole.	
  The	
  results	
  bode	
  well,	
  
nevertheless.	
  	
  
	
  
Again,	
  using	
  both	
  {Token.string}	
  and	
  {Lookup.majorType}	
  as	
  attributes,	
  the	
  results	
  of	
  the	
  
classification	
  of	
  {Mention.type}	
  as	
  either	
  DL	
  or	
  nonDL	
  follow.	
  The	
  .66/.33	
  (total)	
  holdout	
  
training-­‐test	
  ratios	
  of	
  the	
  data	
  were:	
  160/83	
  (243)	
  websites:	
  40/22	
  (62)	
  DL	
  and	
  120/61	
  
(181)	
  non-­‐DL.	
  	
  Naïve	
  Bayes	
  and	
  C4.5	
  algorithms	
  achieved	
  100%	
  misclassification	
  of	
  DL	
  
websites	
  (22/22)	
  with	
  both	
  sets	
  of	
  attributes,	
  achieving	
  a	
  total	
  F1	
  of	
  0.73.	
  	
  It	
  is	
  not	
  clear	
  
why	
  this	
  is	
  the	
  case.	
  Given	
  that	
  SVM	
  is	
  well-­‐known	
  as	
  the	
  best	
  performing	
  classifier	
  for	
  texts	
  
(Sebastiani,	
  2002),	
  we	
  stick	
  to	
  it	
  for	
  our	
  purposes.	
  	
  	
  
	
  
{Token.string}	
  performed	
  slightly	
  better	
  than	
  {Lookup.majorType}.	
  	
  For	
  both	
  cases,	
  there	
  
were	
  very	
  few	
  misclassifications,	
  and	
  most	
  of	
  these	
  were	
  false	
  positives.	
  When	
  only	
  the	
  
Gazetter	
  entry	
  words,	
  including	
  “dlwords,”	
  were	
  taken	
  into	
  account	
  ({Lookup.majorType}),	
  
3/22	
  DLs	
  were	
  misclassified	
  as	
  non-­‐DL	
  (precision=0.95;	
  recall=0.86;	
  F1=0.90)	
  and	
  1/61	
  
non-­‐DLs	
  were	
  misclassified	
  as	
  DL	
  (precision=0.95;	
  recall=0.98;	
  F1=0.97).	
  	
  	
  
	
  
When	
  all	
  tokens	
  were	
  entered	
  into	
  the	
  bag-­‐of-­‐words	
  ({Token.string}),	
  precision	
  was	
  perfect	
  
for	
  DL	
  classification	
  and	
  recall	
  was	
  perfect	
  for	
  nonDL	
  classification.	
  That	
  is,	
  3/22	
  DLs	
  were	
  
still	
  misclassified	
  as	
  nonDL	
  (precision=1.0;	
  recall=0.86;	
  F1=0.93).	
  All	
  61/61	
  of	
  nonDLs	
  were	
  
Stephen	
  J.	
  Stose	
                                             	
                                                   April	
  18,	
  2011	
     12	
  
IST	
  565:	
  Final	
  Project	
  	
  
	
  
classified	
  correctly,	
  resulting	
  in	
  perfect	
  recall,	
  but	
  lacking	
  precision	
  insofar	
  as	
  64	
  total	
  
websites	
  were	
  classified	
  as	
  non-­‐DL:	
  61	
  of	
  the	
  expected,	
  but	
  the	
  3	
  others	
  which	
  should	
  have	
  
been	
  classified	
  as	
  DL	
  (precision=0.95;	
  recall=1.0;	
  F1=0.98).	
  	
  
	
  
Thus,	
  overall,	
  using	
  all	
  tokens	
  achieved	
  slightly	
  higher	
  rates	
  of	
  precision	
  and	
  recall	
  for	
  the	
  
discrimination	
  of	
  DL	
  websites	
  from	
  all	
  websites,	
  based	
  on	
  this	
  small	
  and	
  still	
  very	
  non-­‐
proportional	
  sample.	
  	
  Total	
  F1	
  values	
  for	
  were	
  0.96	
  for	
  {Token.string}	
  and	
  0.95	
  for	
  
{Lookup.majorType}.	
  The	
  question	
  remains	
  as	
  to	
  whether	
  both	
  attributes	
  were	
  
misclassifying	
  the	
  exact	
  same	
  websites.	
  It	
  turns	
  out	
  that	
  2/3	
  of	
  the	
  websites	
  misclassified	
  
for	
  both	
  attributes	
  were	
  the	
  same.	
  	
  Figure	
  2	
  below	
  illustrates	
  the	
  breakdown	
  of	
  these	
  
statistics	
  per	
  attribute.	
  
	
  
{Token.string}	
                                                           {Lookup.majorType}	
  
Precision=1.0,	
  Recall=0.86	
                                            Precision=0.95,	
  Recal=0.86	
  
                                                                           	
  
	
                                                                         	
  
False	
  Negatives	
  (misclassified	
  as	
  nonDL)	
                     False	
  Negatives	
  (misclassified	
  as	
  nonDL)	
  
	
                                                                         	
  
Digital	
  Scriptorium	
                                                    Harvard	
  Business	
  Education	
  for	
  Women	
  (1937-­‐1970)	
  
www.scriptorium.columbia.edu	
  	
                                          http://www.library.hbs.edu/hc/daring/intro.html#nav
	
                                                                          -­‐intro	
  	
  
Holocaust	
  Rescue	
  &	
  Relief	
  (Andover-­‐Harvard	
                  	
  
Theological)	
                                                              Holocaust	
  Rescue	
  &	
  Relief	
  (Andover-­‐Harvard	
  
www.hds.harvard.edu/library/collections/digital/serv                        Theological)	
  
ice_committee.html	
  	
                                                    www.hds.harvard.edu/library/collections/digital/servi
	
                                                                          ce_committee.html	
  	
  
Joseph	
  Urban	
  Stage	
  Design	
  Collection	
  	
                      	
  
www.columbia.edu/cu/lweb/eresources/archives/rbm                            Joseph	
  Urban	
  Stage	
  Design	
  Collection	
  	
  
l/urban/	
                                                                  www.columbia.edu/cu/lweb/eresources/archives/rbm
	
                                                                          l/urban/	
  
                                                                            	
  
	
                                                                          	
  
False	
  positives	
  (misclassified	
  as	
  DL)	
                         False	
  positives	
  (misclassified	
  as	
  DL)	
  
	
                                                                          	
  
none	
                                                                      www.spi-­‐poker.sourceforge.net	
  	
  
                                                                            	
  
Figure	
  2.	
  
	
  
	
  
Conclusion	
  
	
  
In	
  this	
  paper	
  we	
  discussed	
  how	
  GATE	
  (General	
  Architecture	
  for	
  Text	
  Engineering)	
  employs	
  
Machine	
  Learning	
  to	
  classify	
  documents	
  from	
  the	
  web	
  into	
  two	
  categories:	
  websites	
  that	
  
operate	
  as	
  digital	
  library	
  sites,	
  and	
  websites	
  that	
  do	
  not	
  operate	
  as	
  digital	
  library	
  sites.	
  This	
  
exercise	
  was	
  completed	
  firstly	
  in	
  order	
  to	
  learn	
  about	
  GATE,	
  but	
  secondarily	
  to	
  hopefully	
  
provide	
  a	
  solution	
  to	
  populating	
  a	
  site	
  the	
  current	
  author	
  is	
  creating	
  for	
  digital	
  libraries	
  
(www.digitallibrarycentral.com).	
  No	
  current	
  directory	
  exists	
  as	
  a	
  single-­‐stop	
  go-­‐to	
  resource	
  
for	
  digital	
  libraries;	
  as	
  is,	
  digital	
  libraries	
  are	
  difficult	
  to	
  find	
  and	
  hence	
  often	
  un-­‐	
  or	
  under-­‐
utilized	
  to	
  the	
  ordinary	
  web	
  user.	
  	
  By	
  creating	
  a	
  digital	
  library	
  of	
  all	
  digital	
  libraries,	
  we	
  
hope	
  to	
  bring	
  the	
  ordinary	
  user	
  to	
  the	
  plethora	
  of	
  digitized	
  resources	
  available,	
  and	
  
categorize	
  these	
  digital	
  collections	
  according	
  to	
  a	
  taxonomy	
  that	
  allows	
  for	
  the	
  collation	
  of	
  
similar	
  kinds	
  and	
  types	
  of	
  digital	
  libraries.	
  Indeed,	
  once	
  these	
  digital	
  resources	
  are	
  all	
  
Stephen	
  J.	
  Stose	
                                             	
                                                    April	
  18,	
  2011	
     13	
  
IST	
  565:	
  Final	
  Project	
  	
  
	
  
collection,	
  GATE	
  Machine	
  Learning	
  might	
  provide	
  a	
  solution	
  to	
  the	
  automatic	
  classification	
  
of	
  these	
  resources	
  into	
  the	
  supervised	
  taxonomy.	
  	
  
	
  
As	
  it	
  is,	
  we	
  first	
  seek	
  to	
  locate	
  these	
  resources	
  using	
  Machine	
  Learning.	
  If	
  the	
  web	
  were	
  
made	
  up	
  of	
  three	
  ordinary	
  non-­‐DL	
  website	
  for	
  every	
  one	
  DL	
  website,	
  the	
  current	
  classifier	
  
we	
  trained	
  would	
  have	
  a	
  very	
  easy	
  time	
  locating	
  all	
  of	
  the	
  DLs	
  (with	
  96%	
  accuracy).	
  As	
  it	
  is,	
  
however,	
  of	
  the	
  8	
  billion	
  websites	
  in	
  existence	
  today,	
  we	
  reckon	
  that	
  only	
  3-­‐6	
  thousand	
  of	
  
these	
  operate	
  as	
  digital	
  libraries	
  in	
  some	
  form	
  or	
  another.	
  Thus,	
  a	
  lot	
  of	
  work	
  still	
  needs	
  to	
  
be	
  done	
  in	
  order	
  to	
  find	
  the	
  DL	
  needle	
  in	
  the	
  haystack	
  of	
  all	
  websites	
  online	
  today.	
  	
  	
  	
  
	
  
	
  
	
  
	
  
References	
  
	
  
Chau,	
  M.	
  and	
  Chen,	
  H.	
  (2008).	
  A	
  machine	
  learning	
  approach	
  to	
  web	
  page	
  filtering	
  using	
  
content	
  and	
  structure	
  analysis.	
  Decision	
  Support	
  Systems,	
  44(2),	
  482-­‐494.	
  	
  
	
  
Hotho,	
  A.,	
  Nurnberger,	
  A.	
  and	
  Paass,	
  G.	
  (2005).	
  A	
  brief	
  survey	
  of	
  text	
  mining.	
  LDV	
  Forum	
  –	
  
GLDV	
  Journal	
  for	
  Computational	
  Linguistics	
  and	
  Language	
  Technology.	
  	
  
	
  
Joachims,	
  T.	
  (1998).	
  Text	
  Categorization	
  with	
  Support	
  Vector	
  Machines:	
  Learning	
  with	
  
Many	
  Relevant	
  Features.	
  Proceedings	
  of	
  the	
  European	
  Conference	
  on	
  Machine	
  Learning	
  
(ECML),	
  Springer.	
  	
  
	
  
Nigam,	
  K.,	
  	
  McCallum,	
  A.,	
  Thrun,	
  S.,	
  and	
  Mitchell,	
  T.	
  (2000).	
  Text	
  Classification	
  from	
  Labeled	
  
and	
  Unlabeled	
  Documents	
  using	
  EM.	
  Machine	
  Learning,	
  39(2/3),	
  103-­‐134.	
  	
  
	
  
Salton,	
  G.,	
  Wong,	
  A.,	
  and	
  Yang,	
  C.	
  S.	
  (1975).	
  A	
  Vector	
  Space	
  Model	
  for	
  Automatic	
  Indexing.	
  
Communications	
  of	
  the	
  ACM.	
  18(11),	
  613–620.	
  
	
  
Sebastiani,	
  F.	
  (2002).	
  Machine	
  learning	
  in	
  automated	
  text	
  categorization.	
  ACM	
  Computing	
  
Surveys,	
  34,	
  1-­‐47.	
  	
  
	
  
Witten,	
  I.	
  H.	
  (2005).	
  “Text	
  mining.”	
  in	
  Practical	
  handbook	
  of	
  internet	
  computing,	
  ed.	
  M.P.	
  
Singh.	
  Chapman	
  &	
  Hall/CRC	
  Press,	
  Boca	
  Raton,	
  Florida.	
  
	
  	
  
Witten,	
  I.	
  H.,	
  Don,	
  K.	
  J.,	
  Dewsnip,	
  M.	
  and	
  Tablan,	
  V.	
  (2004).	
  Text	
  mining	
  in	
  a	
  digital	
  library.	
  
Journal	
  of	
  Digital	
  Libraries,	
  4(1),	
  56-­‐59.	
  	
  
	
  
	
  

Weitere ähnliche Inhalte

Was ist angesagt?

Dictionary based concept mining an application for turkish
Dictionary based concept mining  an application for turkishDictionary based concept mining  an application for turkish
Dictionary based concept mining an application for turkishcsandit
 
Lri Owl And Ontologies 04 04
Lri Owl And Ontologies 04 04Lri Owl And Ontologies 04 04
Lri Owl And Ontologies 04 04Rinke Hoekstra
 
Corpora, Blogs and Linguistic Variation (Paderborn)
Corpora, Blogs and Linguistic Variation (Paderborn)Corpora, Blogs and Linguistic Variation (Paderborn)
Corpora, Blogs and Linguistic Variation (Paderborn)Cornelius Puschmann
 
An Investigation of Keywords Extraction from Textual Documents using Word2Ve...
 An Investigation of Keywords Extraction from Textual Documents using Word2Ve... An Investigation of Keywords Extraction from Textual Documents using Word2Ve...
An Investigation of Keywords Extraction from Textual Documents using Word2Ve...IJCSIS Research Publications
 
Lecture 2: Computational Semantics
Lecture 2: Computational SemanticsLecture 2: Computational Semantics
Lecture 2: Computational SemanticsMarina Santini
 
Use of ontologies in natural language processing
Use of ontologies in natural language processingUse of ontologies in natural language processing
Use of ontologies in natural language processingATHMAN HAJ-HAMOU
 
The vector space model
The vector space modelThe vector space model
The vector space modelpkgosh
 
FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCES
FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCESFINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCES
FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCESijnlc
 
Information extraction using discourse
Information extraction using discourseInformation extraction using discourse
Information extraction using discourseijitcs
 
Language Combinatorics: A Sentence Pattern Extraction Architecture Based on C...
Language Combinatorics: A Sentence Pattern Extraction Architecture Based on C...Language Combinatorics: A Sentence Pattern Extraction Architecture Based on C...
Language Combinatorics: A Sentence Pattern Extraction Architecture Based on C...Waqas Tariq
 
Adaptive information extraction
Adaptive information extractionAdaptive information extraction
Adaptive information extractionunyil96
 
Introduction to development of lexical databases
Introduction to development of lexical databasesIntroduction to development of lexical databases
Introduction to development of lexical databasesMuhammad Shoaib Chaudhary
 
A Mathematical Approach to Ontology Authoring and Documentation
A Mathematical Approach to Ontology Authoring and DocumentationA Mathematical Approach to Ontology Authoring and Documentation
A Mathematical Approach to Ontology Authoring and DocumentationChristoph Lange
 
Named entity recognition using web document corpus
Named entity recognition using web document corpusNamed entity recognition using web document corpus
Named entity recognition using web document corpusIJMIT JOURNAL
 

Was ist angesagt? (20)

Dictionary based concept mining an application for turkish
Dictionary based concept mining  an application for turkishDictionary based concept mining  an application for turkish
Dictionary based concept mining an application for turkish
 
Lri Owl And Ontologies 04 04
Lri Owl And Ontologies 04 04Lri Owl And Ontologies 04 04
Lri Owl And Ontologies 04 04
 
The basics of ontologies
The basics of ontologiesThe basics of ontologies
The basics of ontologies
 
Corpora, Blogs and Linguistic Variation (Paderborn)
Corpora, Blogs and Linguistic Variation (Paderborn)Corpora, Blogs and Linguistic Variation (Paderborn)
Corpora, Blogs and Linguistic Variation (Paderborn)
 
An Investigation of Keywords Extraction from Textual Documents using Word2Ve...
 An Investigation of Keywords Extraction from Textual Documents using Word2Ve... An Investigation of Keywords Extraction from Textual Documents using Word2Ve...
An Investigation of Keywords Extraction from Textual Documents using Word2Ve...
 
Lecture 2: Computational Semantics
Lecture 2: Computational SemanticsLecture 2: Computational Semantics
Lecture 2: Computational Semantics
 
Use of ontologies in natural language processing
Use of ontologies in natural language processingUse of ontologies in natural language processing
Use of ontologies in natural language processing
 
E-text in EFL - Four flavours
E-text in EFL - Four flavoursE-text in EFL - Four flavours
E-text in EFL - Four flavours
 
Using ontology for natural language processing
Using ontology for natural language processingUsing ontology for natural language processing
Using ontology for natural language processing
 
The vector space model
The vector space modelThe vector space model
The vector space model
 
FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCES
FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCESFINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCES
FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCES
 
Information Extraction
Information ExtractionInformation Extraction
Information Extraction
 
Information extraction using discourse
Information extraction using discourseInformation extraction using discourse
Information extraction using discourse
 
Language Combinatorics: A Sentence Pattern Extraction Architecture Based on C...
Language Combinatorics: A Sentence Pattern Extraction Architecture Based on C...Language Combinatorics: A Sentence Pattern Extraction Architecture Based on C...
Language Combinatorics: A Sentence Pattern Extraction Architecture Based on C...
 
Adaptive information extraction
Adaptive information extractionAdaptive information extraction
Adaptive information extraction
 
Introduction to development of lexical databases
Introduction to development of lexical databasesIntroduction to development of lexical databases
Introduction to development of lexical databases
 
Treebank annotation
Treebank annotationTreebank annotation
Treebank annotation
 
Ontology learning
Ontology learningOntology learning
Ontology learning
 
A Mathematical Approach to Ontology Authoring and Documentation
A Mathematical Approach to Ontology Authoring and DocumentationA Mathematical Approach to Ontology Authoring and Documentation
A Mathematical Approach to Ontology Authoring and Documentation
 
Named entity recognition using web document corpus
Named entity recognition using web document corpusNamed entity recognition using web document corpus
Named entity recognition using web document corpus
 

Andere mochten auch

Wireless Wall Switch (Single Key)
Wireless Wall Switch (Single Key)Wireless Wall Switch (Single Key)
Wireless Wall Switch (Single Key)Daniel Chen
 
2008 Sae Capacitive Sensing
2008 Sae Capacitive Sensing2008 Sae Capacitive Sensing
2008 Sae Capacitive Sensingzztdn3
 
Capacitive Touch Sensing
Capacitive Touch SensingCapacitive Touch Sensing
Capacitive Touch SensingRey Anderson
 
Ee325 cmos design lab 6 report - loren k schwappach
Ee325 cmos design   lab 6 report - loren k schwappachEe325 cmos design   lab 6 report - loren k schwappach
Ee325 cmos design lab 6 report - loren k schwappachLoren Schwappach
 
A minor project report HOME AUTOMATION USING MOBILE PHONES
A minor project report HOME AUTOMATION  USING  MOBILE PHONESA minor project report HOME AUTOMATION  USING  MOBILE PHONES
A minor project report HOME AUTOMATION USING MOBILE PHONESashokkok
 
Presentation Smart Home With Home Automation
Presentation Smart Home With Home AutomationPresentation Smart Home With Home Automation
Presentation Smart Home With Home AutomationArifur Rahman
 
Touch screen home automation
Touch screen home automationTouch screen home automation
Touch screen home automationvision2d16
 
My Final Year Project - Individual Control Home Automation System
My Final Year Project - Individual Control Home Automation SystemMy Final Year Project - Individual Control Home Automation System
My Final Year Project - Individual Control Home Automation SystemMichael Olafusi
 

Andere mochten auch (10)

Wireless Wall Switch (Single Key)
Wireless Wall Switch (Single Key)Wireless Wall Switch (Single Key)
Wireless Wall Switch (Single Key)
 
AND, NAND, OR, NOR GATES
AND, NAND, OR, NOR GATESAND, NAND, OR, NOR GATES
AND, NAND, OR, NOR GATES
 
2008 Sae Capacitive Sensing
2008 Sae Capacitive Sensing2008 Sae Capacitive Sensing
2008 Sae Capacitive Sensing
 
Capacitive Touch Sensing
Capacitive Touch SensingCapacitive Touch Sensing
Capacitive Touch Sensing
 
Nand gate
Nand gateNand gate
Nand gate
 
Ee325 cmos design lab 6 report - loren k schwappach
Ee325 cmos design   lab 6 report - loren k schwappachEe325 cmos design   lab 6 report - loren k schwappach
Ee325 cmos design lab 6 report - loren k schwappach
 
A minor project report HOME AUTOMATION USING MOBILE PHONES
A minor project report HOME AUTOMATION  USING  MOBILE PHONESA minor project report HOME AUTOMATION  USING  MOBILE PHONES
A minor project report HOME AUTOMATION USING MOBILE PHONES
 
Presentation Smart Home With Home Automation
Presentation Smart Home With Home AutomationPresentation Smart Home With Home Automation
Presentation Smart Home With Home Automation
 
Touch screen home automation
Touch screen home automationTouch screen home automation
Touch screen home automation
 
My Final Year Project - Individual Control Home Automation System
My Final Year Project - Individual Control Home Automation SystemMy Final Year Project - Individual Control Home Automation System
My Final Year Project - Individual Control Home Automation System
 

Ähnlich wie Web classification of Digital Libraries using GATE Machine Learning  

Information retrieval chapter 2-Text Operations.ppt
Information retrieval chapter 2-Text Operations.pptInformation retrieval chapter 2-Text Operations.ppt
Information retrieval chapter 2-Text Operations.pptSamuelKetema1
 
Automatize Document Topic And Subtopic Detection With Support Of A Corpus
Automatize Document Topic And Subtopic Detection With Support Of A CorpusAutomatize Document Topic And Subtopic Detection With Support Of A Corpus
Automatize Document Topic And Subtopic Detection With Support Of A CorpusRichard Hogue
 
Cooperating Techniques for Extracting Conceptual Taxonomies from Text
Cooperating Techniques for Extracting Conceptual Taxonomies from TextCooperating Techniques for Extracting Conceptual Taxonomies from Text
Cooperating Techniques for Extracting Conceptual Taxonomies from TextFulvio Rotella
 
Cooperating Techniques for Extracting Conceptual Taxonomies from Text
Cooperating Techniques for Extracting Conceptual Taxonomies from TextCooperating Techniques for Extracting Conceptual Taxonomies from Text
Cooperating Techniques for Extracting Conceptual Taxonomies from TextUniversity of Bari (Italy)
 
A Comprehensive Analytical Study Of Traditional And Recent Development In Nat...
A Comprehensive Analytical Study Of Traditional And Recent Development In Nat...A Comprehensive Analytical Study Of Traditional And Recent Development In Nat...
A Comprehensive Analytical Study Of Traditional And Recent Development In Nat...Sherri Cost
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingMariana Soffer
 
Document Author Classification using Parsed Language Structure
Document Author Classification using Parsed Language StructureDocument Author Classification using Parsed Language Structure
Document Author Classification using Parsed Language Structurekevig
 
Document Author Classification Using Parsed Language Structure
Document Author Classification Using Parsed Language StructureDocument Author Classification Using Parsed Language Structure
Document Author Classification Using Parsed Language Structurekevig
 
Natural Language Processing_in semantic web.pptx
Natural Language Processing_in semantic web.pptxNatural Language Processing_in semantic web.pptx
Natural Language Processing_in semantic web.pptxAlyaaMachi
 
IRJET - Document Comparison based on TF-IDF Metric
IRJET - Document Comparison based on TF-IDF MetricIRJET - Document Comparison based on TF-IDF Metric
IRJET - Document Comparison based on TF-IDF MetricIRJET Journal
 
Semantic Based Model for Text Document Clustering with Idioms
Semantic Based Model for Text Document Clustering with IdiomsSemantic Based Model for Text Document Clustering with Idioms
Semantic Based Model for Text Document Clustering with IdiomsWaqas Tariq
 
DataFest 2017. Introduction to Natural Language Processing by Rudolf Eremyan
DataFest 2017. Introduction to Natural Language Processing by Rudolf EremyanDataFest 2017. Introduction to Natural Language Processing by Rudolf Eremyan
DataFest 2017. Introduction to Natural Language Processing by Rudolf Eremyanrudolf eremyan
 
A Comparative Analysis Of The Entropy And Transition Point Approach In Repres...
A Comparative Analysis Of The Entropy And Transition Point Approach In Repres...A Comparative Analysis Of The Entropy And Transition Point Approach In Repres...
A Comparative Analysis Of The Entropy And Transition Point Approach In Repres...Kim Daniels
 
601-CriticalEssay-2-Portfolio Edition
601-CriticalEssay-2-Portfolio Edition601-CriticalEssay-2-Portfolio Edition
601-CriticalEssay-2-Portfolio EditionJordan Chapman
 
Text Analytics Overview, 2011
Text Analytics Overview, 2011Text Analytics Overview, 2011
Text Analytics Overview, 2011Seth Grimes
 

Ähnlich wie Web classification of Digital Libraries using GATE Machine Learning   (20)

Information retrieval chapter 2-Text Operations.ppt
Information retrieval chapter 2-Text Operations.pptInformation retrieval chapter 2-Text Operations.ppt
Information retrieval chapter 2-Text Operations.ppt
 
Automatize Document Topic And Subtopic Detection With Support Of A Corpus
Automatize Document Topic And Subtopic Detection With Support Of A CorpusAutomatize Document Topic And Subtopic Detection With Support Of A Corpus
Automatize Document Topic And Subtopic Detection With Support Of A Corpus
 
Cooperating Techniques for Extracting Conceptual Taxonomies from Text
Cooperating Techniques for Extracting Conceptual Taxonomies from TextCooperating Techniques for Extracting Conceptual Taxonomies from Text
Cooperating Techniques for Extracting Conceptual Taxonomies from Text
 
Cooperating Techniques for Extracting Conceptual Taxonomies from Text
Cooperating Techniques for Extracting Conceptual Taxonomies from TextCooperating Techniques for Extracting Conceptual Taxonomies from Text
Cooperating Techniques for Extracting Conceptual Taxonomies from Text
 
REPORT.doc
REPORT.docREPORT.doc
REPORT.doc
 
A Comprehensive Analytical Study Of Traditional And Recent Development In Nat...
A Comprehensive Analytical Study Of Traditional And Recent Development In Nat...A Comprehensive Analytical Study Of Traditional And Recent Development In Nat...
A Comprehensive Analytical Study Of Traditional And Recent Development In Nat...
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Document Author Classification using Parsed Language Structure
Document Author Classification using Parsed Language StructureDocument Author Classification using Parsed Language Structure
Document Author Classification using Parsed Language Structure
 
Document Author Classification Using Parsed Language Structure
Document Author Classification Using Parsed Language StructureDocument Author Classification Using Parsed Language Structure
Document Author Classification Using Parsed Language Structure
 
Natural Language Processing_in semantic web.pptx
Natural Language Processing_in semantic web.pptxNatural Language Processing_in semantic web.pptx
Natural Language Processing_in semantic web.pptx
 
IRJET - Document Comparison based on TF-IDF Metric
IRJET - Document Comparison based on TF-IDF MetricIRJET - Document Comparison based on TF-IDF Metric
IRJET - Document Comparison based on TF-IDF Metric
 
Semantic Based Model for Text Document Clustering with Idioms
Semantic Based Model for Text Document Clustering with IdiomsSemantic Based Model for Text Document Clustering with Idioms
Semantic Based Model for Text Document Clustering with Idioms
 
The impact of standardized terminologies and domain-ontologies in multilingua...
The impact of standardized terminologies and domain-ontologies in multilingua...The impact of standardized terminologies and domain-ontologies in multilingua...
The impact of standardized terminologies and domain-ontologies in multilingua...
 
DataFest 2017. Introduction to Natural Language Processing by Rudolf Eremyan
DataFest 2017. Introduction to Natural Language Processing by Rudolf EremyanDataFest 2017. Introduction to Natural Language Processing by Rudolf Eremyan
DataFest 2017. Introduction to Natural Language Processing by Rudolf Eremyan
 
A Comparative Analysis Of The Entropy And Transition Point Approach In Repres...
A Comparative Analysis Of The Entropy And Transition Point Approach In Repres...A Comparative Analysis Of The Entropy And Transition Point Approach In Repres...
A Comparative Analysis Of The Entropy And Transition Point Approach In Repres...
 
601-CriticalEssay-2-Portfolio Edition
601-CriticalEssay-2-Portfolio Edition601-CriticalEssay-2-Portfolio Edition
601-CriticalEssay-2-Portfolio Edition
 
Document similarity
Document similarityDocument similarity
Document similarity
 
Text Analytics Overview, 2011
Text Analytics Overview, 2011Text Analytics Overview, 2011
Text Analytics Overview, 2011
 
Nlp (1)
Nlp (1)Nlp (1)
Nlp (1)
 
NLP todo
NLP todoNLP todo
NLP todo
 

Mehr von sstose

Planning, Marketing and Assessing a Digital Library:The Open University Digit...
Planning, Marketing and Assessing a Digital Library:The Open University Digit...Planning, Marketing and Assessing a Digital Library:The Open University Digit...
Planning, Marketing and Assessing a Digital Library:The Open University Digit...sstose
 
Government Information
Government InformationGovernment Information
Government Informationsstose
 
Data Breaches
Data BreachesData Breaches
Data Breachessstose
 
Disruptive technologies: Prediction or just recommendations?
Disruptive technologies: Prediction or just recommendations?Disruptive technologies: Prediction or just recommendations?
Disruptive technologies: Prediction or just recommendations?sstose
 
Web classification of Digital Libraries using GATE Machine Learning
Web classification of Digital Libraries using GATE Machine LearningWeb classification of Digital Libraries using GATE Machine Learning
Web classification of Digital Libraries using GATE Machine Learningsstose
 
The Semantic Web in Digital Libraries: A Literature Review
The Semantic Web in Digital Libraries: A Literature ReviewThe Semantic Web in Digital Libraries: A Literature Review
The Semantic Web in Digital Libraries: A Literature Reviewsstose
 
Christine Madsen interview
Christine Madsen interviewChristine Madsen interview
Christine Madsen interviewsstose
 
A comparison of two digital libraries based on pre-established criteria
A comparison of two digital libraries based on pre-established criteriaA comparison of two digital libraries based on pre-established criteria
A comparison of two digital libraries based on pre-established criteriasstose
 
Stose bplstudio
Stose bplstudioStose bplstudio
Stose bplstudiosstose
 

Mehr von sstose (9)

Planning, Marketing and Assessing a Digital Library:The Open University Digit...
Planning, Marketing and Assessing a Digital Library:The Open University Digit...Planning, Marketing and Assessing a Digital Library:The Open University Digit...
Planning, Marketing and Assessing a Digital Library:The Open University Digit...
 
Government Information
Government InformationGovernment Information
Government Information
 
Data Breaches
Data BreachesData Breaches
Data Breaches
 
Disruptive technologies: Prediction or just recommendations?
Disruptive technologies: Prediction or just recommendations?Disruptive technologies: Prediction or just recommendations?
Disruptive technologies: Prediction or just recommendations?
 
Web classification of Digital Libraries using GATE Machine Learning
Web classification of Digital Libraries using GATE Machine LearningWeb classification of Digital Libraries using GATE Machine Learning
Web classification of Digital Libraries using GATE Machine Learning
 
The Semantic Web in Digital Libraries: A Literature Review
The Semantic Web in Digital Libraries: A Literature ReviewThe Semantic Web in Digital Libraries: A Literature Review
The Semantic Web in Digital Libraries: A Literature Review
 
Christine Madsen interview
Christine Madsen interviewChristine Madsen interview
Christine Madsen interview
 
A comparison of two digital libraries based on pre-established criteria
A comparison of two digital libraries based on pre-established criteriaA comparison of two digital libraries based on pre-established criteria
A comparison of two digital libraries based on pre-established criteria
 
Stose bplstudio
Stose bplstudioStose bplstudio
Stose bplstudio
 

Web classification of Digital Libraries using GATE Machine Learning  

  • 1. Stephen  J.  Stose     April  18,  2011   1   IST  565:  Final  Project         Web  classification  of  Digital  Libraries  using  GATE  Machine  Learning         Introduction     Text  mining  is  considered  by  some  as  a  form  of  data  mining  that  operates  on  unstructured   and  semi-­‐structured  texts.  It  applies  natural  language  processing  models  to  analyze  textual   content  in  order  to  extract  and  generate  actionable  (i.e.,  potentially  useful)  knowledge  from   the  information  inherent  in  words,  sentences,  paragraphs  and  documents  (Witten,  2005).   However,  many  of  the  linguistic  patterns  easy  for  humans  to  comprehend  and  reproduce   end  up  being  astonishingly  complicated  for  machines  to  process.  For  instance,  machines   struggle  interpreting  natural  language  forms  quite  simple  for  most  humans,  such  as   metaphor,  misspellings,  irregular  forms,  slang,  irony,  verbal  tense  and  aspect,  anaphora  and   ellipses,  and  the  context  that  frames  meaning.  On  the  other  hand,  humans  lack  a  computer’s   ability  to  process  large  volumes  of  data  at  high  speeds.  The  key  to  successful  text  mining  is   to  combine  these  assets  into  a  single  technology.       There  are  many  uses  of  this  new  interdisciplinary  effort  at  mining  unstructured  texts   towards  the  discovery  of  new  knowledge.  For  instance,  some  techniques  attempt  to  extract   structure  to  fill  out  templates  (e.g.,  address  forms)  or  extract  key-­‐phrases  as  a  form  of   document  metadata.  Others  attempt  to  summarize  the  content  of  a  document,  identify  a   document’s  language,  classify  the  document  into  a  pre-­‐established  taxonomy,  or  cluster  it   along  with  similar  documents  based  on  token  or  sentence  similarity  (see  Witten,  2005  for   others).  Other  techniques  include  concept  linkage,  whereby  concepts  across  swathes  of   scientific  research  articles  can  be  linked  to  elucidate  new  hypotheses  that  otherwise   wouldn’t  occur  to  humans,  but  also  topic  tracking  and  question-­‐answering  (Fan  et  al.,   2006).       Consider  the  implications  of  being  able  to  automatically  classify  text  documents.  Given  the   massive  size  of  the  World  Wide  Web  and  all  it  contains  (e.g.,  news  feeds,  e-­‐mail,  medical  and   corporate  records,  digital  libraries,  journal  and  magazine  articles,  blogs),  imagine  the   practical  consequences  of  training  machines  to  automatically  categorize  this  content.   Indeed,  text  classification  algorithms  have  already  had  moderate  success  in  cataloging  news   articles  (Joachims,  1998)  and  web  pages  (Nigam,  MacCallum,  Thrun  &  Mitchell,  1999).     Indeed,  some  text  mining  systems  have  even  been  incorporated  into  digital  library  systems   (the  Greenstone  DAM),  such  that  users  can  benefit  by  displaying  digital  library  items   automatically  co-­‐referenced  by  use  of  semantic  annotations  (Witten,  Don,  Dewsnip  &   Tablan,  2004).         Natural  language  pre-­‐processing  for  text  and  document  classification     Text  and  document  classification  make  use  of  natural  language  processing  (NLP)   technology  to  pre-­‐process,  encode  and  store  linguistic  features  to  texts  and  documents,  and   then  to  processes  selected  features  using  Machine  Learning  (ML)  algorithms  which  then  are   applied  to  a  new  set  of  texts  and  documents.  The  first  step  in  this  process  usually  involves   tokenization,  a  process  that  involves  removing  punctuation  marks,  tabs,  and  other  non-­‐ textual  characters  to  replace  these  with  white  space.  This  produces  a  mere  stream  of  word   tokens  which  forms  the  set  of  data  upon  which  further  processing  occurs.  From  this  stream,  
  • 2. Stephen  J.  Stose     April  18,  2011   2   IST  565:  Final  Project       a  filter  usually  is  applied  to  reduce  from  this  set  of  tokens  all  stop-­‐words  (e.g.,  prepositions,   articles,  conjunctions  etc.)  that  otherwise  provide  little  if  any  meaning.       In  a  related  vein,  we  see  in  such  instances  that  tokens  are  not  always  the  same  as  words  per   se.    Tokenization  may  insert  white  space  between  two  and  three-­‐word  tokens.  “New  York”   should  be  considered  one  token,  not  two  (not  “New”  and  “York”).  Hyphens  and  apostrophes   present  difficult  challenges.  Often  words  like  “don’t”  are  tokenized  into  two  separate  words:   “do”  and  “n’t”,  the  latter  which  is  later  transduced  as  “n’t”  =  “not”.    When  considering  all  the   continually  changing  conventions  used  to  display  words  as  text,  you  will  begin  to  appreciate   the  multitude  of  problems.       Often,  pre-­‐processing  can  stop  here,  as  many  text  and  document  classification  methods  rely   on  simple  tokenization,  such  that  each  token  represents  one  term  amongst  a  bag  of  other   words  occurring  within  each  document  and  between  all  documents  in  the  corpus.  One   common  approach  to  determining  word  importance  within  a  bag-­‐of-­‐words  is  the  term   frequency-­‐inverse  document  frequency  approach  (tf-­‐idf).    In  this  way,  each  document   represents  a  vector  of  terms,  and  each  term  is  encoded  in  binary  1  (term  occurs)  or  0  (term   does  not  occur)  form,  upon  which  weighting  schemes  apply  more  weight  to  terms  occurring   frequently  within  relevant  documents  but  infrequently  between  all  documents  considered   together.  In  a  corpus  of  documents  about  political  parties,  for  instance,  the  word  “political”   may  occur  a  lot  in  relevant  documents,  but  its  weight  would  be  low  given  that  it  also  occurs   frequently  in  all  the  other  documents  within  the  corpus.  This  renders  the  term  rather   meaningless  when  trying  to  distinguish  relevant  from  non-­‐relevant  documents,  as  they  all   are  about  something  political.    If  the  word  “suffrage”  occurs  frequently  in  relevant   documents,  on  the  other  hand,  but  rarely  across  the  corpus,  its  specificity  and  hence  weight   for  determining  document  type  is  considered  much  greater.  This  is  the  reason  tf  is  balanced   with  (i.e.,  multiplied  by)  idf,  a  factor  that  diminishes  the  weight  of  frequent  terms  and   increases  the  weight  of  rare  ones  (for  the  mathematics  of  such  an  approach,  see  Hotho,   Nurnberger  &  Paass,  2005).     In  this  way  a  set  of  documents  can  be  mined  for  keywords.  If  all  of  the  documents  within   our  corpus  are  related  to  political  parties,  the  word  “political”  hardly  qualifies  as  a  keyword.   Words  that  occur  frequently  within  a  subset  of  documents  serve  as  words  that  categorize   content.  As  such,  if  the  word  “suffrage”  occurs  frequently  in  some  documents,  but  not  in  all   of  the  documents,  thus  qualifies  as  a  good  candidate  as  a  keyword  that  classifies  the   relevant  text.  A  good  text-­‐mining  program  utilizing  the  tf-­‐idf  weighting  scheme  would  be   able  to  extract  this  term  and  present  it  to  a  human  as  a  possible  keyword.    These  weighting   schemes  are  applied  within  vector  space  models  in  order  to  retrieve,  filter  and  index  terms   occurring  in  documents  (Salton,  Wong  &  Yang,  1975).  Such  models  form  the  basis  of  many   search  and  indexing  engines  (e.g.,  Apache  Lucene)  insofar  as  the  HTML  content  of  each  Web   page  is  crawled  and  indexed  to  determine  its  relevance  based  on  words  and  phrases   occurring  within  the  <title>  and  <heading>  elements,  among  other  ways  (see  Chau  &  Chen,   2008).       Still,  a  bag-­‐of-­‐words  approach  to  text  and  document  mining  can  be  improved  upon  by   incorporating  domain  knowledge  from  experts  into  the  analysis.  For  instance,  experts  can   identify  domain-­‐specific  words,  phrases  and/or  rules.  If  a  document  or  Web  page  is  checked   against  a  dictionary  of  these  listed  features,  those  documents  containing  the  features  will  be   deemed  more  relevant  to  the  search.    This  is  what  often  occurs  after  tokenization  in  many   kinds  of  NLP  software  (e.g.,  GATE).    That  is,  tokenized  words  are  mapped  to  an  internal  
  • 3. Stephen  J.  Stose     April  18,  2011   3   IST  565:  Final  Project       gazetteer  (an  internal  dictionary),  which  operates  as  a  sort  of  pre-­‐classification,  such  that   commonly  occurring  or  well-­‐known  entities  are  extracted  and  annotated  as  such.  For   instance,  a  gazetteer  might  by  default  be  outfitted  to  recognize  all  common  first-­‐  and   surnames  (Noam  or  Bradeley;  Chomsky  or  Manning)  or  organizations  (UN,  United  Nations,   OPEC,  White  House,  Planned  Parenthood)  or  dates  formats  (02/10/1973  or  February  10,   1973).       Thus,  the  selection  of  these  kinds  of  annotations  constrains  the  set  of  words  chosen  to   represent  space  in  space  vector  models.    Thus,  if  we  want  to  ensure  a  domain-­‐specific   vocabulary  is  annotated  as  relevant  to  text  or  document  classification,  we  might  create  a   separate  space  for  those  terms,  and  annotate  each  term  as  belonging  to  a  particular   category.  As  described  later,  we  created  a  gazetteer  of  terms  most  likely  to  occur  on  Web   sites  functioning  as  digital  libraries,  such  that  when  a  random  Web  site  contains  these   terms  it  would  with  a  higher  likelihood  be  classified  as  relevant.       Other  forms  of  linguistic  pre-­‐processing  exist  which  may  or  may  not  enhance  document  and   text  classification  algorithms,  depending  on  the  nature  and  specificity  of  the  task.    For   instance,  sentence  splitters  chunk  tokens  into  sentence  spaces  when  phrases  are  an   important  feature  in  classification.  At  times,  tagging  each  term  within  a  document  with  its   part-­‐of-­‐speech  (POS  Tagging)  is  important.  For  instance,  it  allows  for  the  classification  of   documents  into  language  groups  (e.g.,  Spanish  vs.  English  vs.  German  etc.)  or  sentence   types.  Given  that  language  is  full  of  ambiguity  of  which  we’ll  only  scratch  the  surface  here,   Named-­‐Entity    (NE)  transducers  ease  the  confusion  by  contextualizing  certain  tokens.  For   instance,  General  Motors  can  be  recognized  as  a  company,  and  not  as  the  name  of  a  military   officer  (e.g.,  General  Lee).  Or  “May  10”  is  a  date,  “May  Day”  is  a  holiday,  “May  I  leave  the   room”  is  a  request,  and  “Sallie  May  Jones”  is  a  person.  That  is,  the  transducer  disambiguates   homographs  and  homonyms  and  other  such  linguistic  confusions.         Another  common  problem  in  pre-­‐processing  is  co-­‐reference  matching.  Often,  the  same   entity  is  known  in  different  ways  or  by  different  spellings:  “Center”  is  the  same  as  “centre”;   NATO  is  the  same  entity  as  North  Atlantic  Treaty  Organization;  or  Mr.  Smith  is  the  same   person  as  Joachim  Smith  is  the  same  person  as  “he”  or  “him”  (e.g.,  “Joachim  Smith  went  to   town.  Everyone  greeted  him  as  Mr.  Smith  and  he  didn’t  care  for  that”).    This  is  an  important   element  when  considering  frequency  weights  in  space  vector  models,  as  two  different   tokens  referencing  the  same  entity  should  be  co-­‐referenced  as  occurring  with  frequency  =   2,  and  not  frequency  =  1,  one  for  each  way  of  referring  to  the  entity.         Basic  classification  models     Most  classification  models  are  forms  of  supervised  learning  in  that  each  input  value  (e.g.,  a   word  vector)  is  paired  with  an  expected  discrete  output  value  (i.e.,  the  pre-­‐defined   category).    As  such,  the  supervised  algorithm  in  training  analyzes  these  pairings  to  produce   an  inferred  classifier  function  and  thereby  in  testing  be  able  to  predict  the  output  value  (i.e.,   the  correct  classification)  for  any  new  valid  input.  One  instance  commonly  used  in   document  classification  is  training  a  classifier  to  automatically  classify  Web  pages  into  a   pre-­‐established  taxonomy  of  categories  (e.g.,  sports,  politics,  art,  design,  poetry,   automobiles  etc.).    Accuracy  of  the  training  function  on  correctly  classifying  the  test  set  is   then  computed  as  a  performance  measure,  each  document  falling  within  the  expected  class   by  some  degree.  Herein  we  establish  a  trade-­‐off  between  recall  and  precision.    High  
  • 4. Stephen  J.  Stose     April  18,  2011   4   IST  565:  Final  Project       precision  implies  a  high  degree  threshold  for  allowing  membership  into  a  class.  In  this  way,   the  algorithm  refuses  to  accept  many  false  positives,  but  in  doing  so  sacrifices  its  ability  to   recall  an  otherwise  larger  set  of  documents,  and  thus  risks  missing  many  relevant   documents  (i.e.,  they  become  false  negatives).    On  the  other  hand,  if  a  threshold  of  high   recall  is  permitted,  we  risk  lowering  our  rate  of  precision  and  thus  allow  many  documents   not  relevant  to  the  category  (i.e.,  false  positives)  into  the  set.    The  F1-­‐score  serves  as  a   statistical  measure  of  compromise  (i.e.,  average)  between  recall  and  precision.       For  the  mathematical  details  of  many  of  the  classification  algorithms,  we  defer  to  Hotho,   Nurnberger  and  Paass  (2005),  but  here  outline  the  rudimentary  basics  of  the  four  most   common  algorithms,  NaïveBayes,  k-­‐nearest  neighbor,  decision  trees,  and  support  vector   machines  (SVM).       Naïve  Bayes  applies  the  conditional  probability  that  document  d  with  the  vector  of  terms   t1,…,tn  belongs  to  a  certain  class:  P(ti|classj).  Documents  with  a  probability  reaching  a  pre-­‐ established  threshold  are  deemed  as  belonging  to  the  category.       Instead  of  building  a  model  of  probability,  k-­‐nearest  neighbor  method  of  classification  is  an   instance-­‐based  approach  that  operates  on  the  basis  of  the  similarity  of  a  document’s  k   number  of  nearest  neighbors.  By  using  word-­‐vectors  stored  as  document  attributes  and   document  labels  as  the  class,  most  computation  occurs  in  testing  whereby  class  labels  are   assigned  based  on  the  k  most  frequent  training  samples  nearest  to  the  document  to  be   classified.       Decision  trees  (e.g.,  C4.5)  operate  by  the  information  gain  established  based  on  a  recursively   built  hierarchy  of  word  selection.  From  labeled  documents,  term  t  is  selected  as  the  best   predictor  of  the  class  according  to  the  amount  of  information  gain.  The  tree  splits  into   subsets,  one  branch  of  documents  containing  the  term  and  the  other  without,  only  to  then   find  the  next  term  to  split  on,  and  is  applied  recursively  until  all  documents  in  a  subset   belong  to  the  same  class.       Support  vector  machines  (SVM)  operate  by  representing  each  document  according  to  a   weighted  vector  td1,…,tdn  based  on  word  frequencies  within  each  document.  SVM   determines  a  maximum  hyper-­‐plane  at  point  0  that  separates  positive  (+1)  class  examples   and  negative  class  examples  (-­‐1)  in  the  training  set.  Only  a  small  fraction  of  documents  are   support  vectors,  and  any  new  document  is  classified  as  belonging  to  the  class  if  the  vector  is   greater  than  0,  and  not  belonging  to  the  class  if  the  vector  is  less  than  0.  SVMs  can  be  used   with  linear  or  polynomial  kernels  to  transform  space  to  ensure  the  classes  can  be  separated   linearly.         While  the  level  of  performance  of  each  of  these  classifiers  depends  on  the  kind  of   classification  task,  the  SVM  algorithm  most  reliably  outperforms  other  kinds  of  algorithms   on  document  classification  (Sebastiani,  2002),  and  thus  will  utilized  with  priority  in  the   study  that  follows.       Goals  and  objectives  of  current  study     For  our  own  purposes,  we  focus  on  the  domain  of  web  classification  in  order  to  achieve  a   twofold  purpose:  1)  to  learn  and  teach  my  colleagues  about  the  natural  language  processing  
  • 5. Stephen  J.  Stose     April  18,  2011   5   IST  565:  Final  Project       suite  known  as  GATE  (General  Architecture  for  Text  Engineering),  especially  with  regards   to  its  Machine  Learning  (ML)  capabilities;  and  2)  to  utilize  the  GATE  architecture  in  order  to   classify  web  documents  into  two  groups:  those  sites  that  function  as  digital  library  sites   (DL)  distinguished  from  all  other  non-­‐digital  library  sites  (non-­‐DL).     The  purpose  of  such  an  exercise  is  to  identify  from  amongst  the  millions  of  websites  only   those  sites  that  operate  as  digital  library  sites.  Assuming  digital  library  sites  are  identifiable   through  certain  characteristic  earmarks  that  distinguish  them  as  containing  searchable   digital  collections,  the  goal  is  to  develop  a  set  of  annotations  that  by  way  of  an  ML  algorithm   can  be  applied  as  part  of  a  web  crawler  in  order  to  extract  the  URL  of  each  of  the  sites  that   qualify  as  belonging  to  the  DL  group,  while  omitting  those  that  do  not.    While  somewhat   confident  it  is  possible  to  obtain  a  strong  level  of  precision  in  obtaining  many  of  the  relevant   sites,  recalling  sites  that  seem  relevant  (e.g.,  those  merely  about  digital  libraries)  are  of  a   greater  concern;  that  is,  the  false  positives.    The  current  author  is  developing  as  a  prototype   a  website  (www.digitallibrarycentral.com)  that  seeks  to  operate  as  a  digital  library  of  all   digital  library  websites;  a  sort  of  one-­‐stop  visual  reference  library  that  points  to  the   collection  of  all  digital  libraries.  Achieving  the  goal  outlined  here  would  serve  to  populate   this  site.     Before  and  whether  such  grand  ideals  can  be  implemented,  however,  the  current  paper  will   outline  some  of  the  first  steps  in  implementing  the  GATE  ML  architecture  towards  this   objective.    In  doing  so,  of  immediate  concern  is  understanding  the  GATE  architecture  and   how  it  functions  in  natural  language  processing  tasks,  in  order  that  we  can  properly  pre-­‐ process  and  annotate  our  target  corpora  before  carrying  out  ML  learning  algorithms  on  it.   We  turn  now  to  an  explanation  of  the  GATE  architecture.             The  GATE  architecture  and  text  annotation     GATE  (General  Architecture  for  Text  Engineering)  is  a  set  of  Java  tools  developed  at  the   University  of  Sheffield  for  the  processing  of  natural  language  and  text  engineering  tasks  for   various  languages.  At  its  core  is  an  information  extraction  system  called  ANNIE  (A  Nearly-­‐ New  Information  Extraction  System),  a  set  of  functions  that  operates  on  individual   documents  (including  XML,  TXT,  Doc,  PDF,  Database  and  HTML  structures)  and  across  the   corpora  within  which  many  documents  can  belong.  These  functions  comprise  tokenizing,  a   gazetteer,  sentence  splitting,  part-­‐of-­‐speech  tagging,  named-­‐entity  transduction,  and  co-­‐ reference  tagging,  among  others.    It  also  boasts  extensive  tools  for  RDF  and  OWL  metadata   annotation  for  creating  ontologies  for  use  within  the  Semantic  Web.     Most  of  these  language  processes  operate  seamlessly  within  GATE  Developer’s  integrated   development  environment  (IDE)  and  graphical  user  interface  (GUI),  the  latter  allowing   users  to  visualize  these  functions  within  a  user-­‐friendly  environment.  For  instance,  a  left-­‐ sidebar  resource  tree  displays  the  Language  Resources  panel,  where  the  document  and   document  sets  (the  corpus)  reside.  Below  that,  it  also  displays  the  ANNIE  Processing   Resources  (PR),  the  natural  language  processing  functions  mentioned  above  that  form  part   of  an  ordered  pipeline  to  linguistically  pre-­‐process  the  documents.    A  right-­‐sidebar   illustrates  the  resulting  color-­‐coded  annotation  lists  after  pipeline  processing.  Additionally,   a  bottom  table  exposes  the  various  resulting  annotation  attributes,  as  well  as  a  popup   annotation  editor  that  allows  one  to  edit  and  classify  (i.e.,  provide  values  to)  these  
  • 6. Stephen  J.  Stose     April  18,  2011   6   IST  565:  Final  Project       annotation  sets  for  training,  prototyping,  and/or  analysis.  Figure  1  below  shows  all  of  these   elements  in  action.         Figure  1.       These  tools  complete  much  of  the  gritty  text-­‐engineering  work  of  document  pre-­‐processing   in  order  that  useful  research  can  be  quickly  deployed,  but  in  a  way  that  is  visually  explicit   and  apparent  to  those  less  initiated  with  these  common  natural  language  engineering   preprocessing  tasks.  And  in  a  way  that  allows  for  editing  these  functions  as  well  as  the   introduction  of  various  pre-­‐processing  plugins  and  other  scripts  developed  for  individual   text-­‐mining  applications.       Figure  1  displays  four  open  documents  uploaded  directly  by  entering  the  URL:  Newsweek   and  Reuters  (news  sites),  and  JohnJayPapers  and  DigitalScriptorium  (digital  libraries).   These,  along  with  10  other  news  sites  and  11  other  digital  libraries  all  belong  to  the  corpus   names  “DL_eval_2”  above  (which  will  serve  as  Sample  1  later,  our  first  test  of  DL   discrimination).    This  provides  a  testing  sample  to  ensure  the  pre-­‐processing  pipeline  and   Machine  Learning  (ML)  functions  operate  correctly  on  our  soon-­‐to-­‐be  annotated   documents.       Just  by  uploading  URLs,  GATE  by  default  automatically  annotates  the  HTML  markup,  as  you   can  see  on  within  bottom  right-­‐sidebar  where  the  <a>,  <body>  and  <br>  tags  are  located.  
  • 7. Stephen  J.  Stose     April  18,  2011   7   IST  565:  Final  Project       After  running  the  PR  pipeline  over  the  “DL_eval_2”  corpus,  the  upper  right-­‐sidebar  shows   the  annotations  that  result  from  running  the  tokenizer,  gazetteer,  sentence  splitter,  POS   tagger,  NE  transducer  and  co-­‐referencing  orthomatcher.    Organization  is  checked  and   highlighted  in  green,  for  instance,  and  by  clicking  on  “White  House”  (one  instantiation  of   Organization),  we  learn  about  GATE’s  {Type.feature=value}  syntax,  which  in  the  case  of   “White  House”  is  represented  accordingly:  {Organization.orgType=government}.  This   syntax  operates  as  the  core  annotation  engine,  and  allows  for  the  scripting  and   manipulation  of  annotation  strings.         The  ANNIE  PRs  in  this  case  provide  automatic  annotations  that  serve  as  a  rudimentary  start   to  any  text  engineering  projects  upon  which  to  build.  There  are  many  other  plugins  and  PR   functions  we  will  not  discuss  within  this  review.  For  our  own  purposes,  we  want  to  call   attention  to  two  annotation  types  ANNIE  generates:  1)  Token,  and  2)  Lookup.         A  few  examples  of  the  Type.feature  syntax  for  the  token  type  is:  the  kind  of  token   {Token.kind=word};  the  token  character  length  {Token.length=12};  the  token  POS   {Token.category=NN};  the  token  orthography  {Token.orth=lowercase};  or  the  content  of  the   token  string  {Token.string=painter}.         Our  interest  is  in  analyzing  string  content:  determining  whether  a  particular  document  is  an   instance  of  a  digital  library  or  not  will  require  an  ML  analysis  of  the  n-­‐gram  string  unigrams   comprising  both  DL  sites  and  nonDL  sites.    We  can  either  use  all  tokens  (after  removing   stop-­‐words)  to  analyze  the  tf-­‐idf  weighting  of  the  documents  in  question,  or  we  can   constrain  the  kinds  of  tokens  analyzed  within  the  documents  by  making  further   specifications.  The  ANNIE  annotation  schema  provides  many  default  annotations  (e.g.,   Person,  Organization,  Money,  Date,  Job  Title  etc.)  to  constrain  the  kinds  of  words  chosen  for   analysis,  as  can  be  seen  in  Figure  1  in  the  upper  right-­‐sidebar.       Additionally,  the  Gazetteer  provides  many  other  kinds  of  dictionary  lookup  entries  (60,000   arranged  in  80  lists)  above  and  beyond  the  ANNIE  default  annotations.    For  instance,  the  list   name  “city”  will  have  as  dictionary  entities  a  list  of  all  worldwide  cities,  such  that  by   mapping  these  onto  the  text,  a  new  annotation  of  the  kind  {Lookup.minorType=city}  is   created,  and  thus  annotates  each  instance  of  a  city  with  this  markup.  The  lookup  uses  a  set-­‐   subset  hierarchy  we  will  not  describe,  except  to  say  that  {Lookup.majorType}  is  a  parent  of   {Lookup.minorType}.    Thus,  there  are  different  kinds  of  locations,  for  instance,  city  and   country.    City  and  country  are  thus  minorTypes  (children)  of  the   {Lookup.majorType=location}.           Classification  with  GATE  Machine  Learning     Given  that  the  GATE  Developer  extracts  and  annotates  training  documents,  several   processing  plugins  that  operate  at  the  end  of  a  document  pre-­‐processing  pipeline  serve   Machine  Learning  (ML)  functions.    The  Batch  Learning  PR  has  three  functions:  chunk   recognition,  relation  extraction  and  classification.  This  paper  is  interested  in  applying   supervised  ML  processes  to  classify  web  documents  that  qualify  as  instances  of  digital   libraries  (DL)  or  not  (nonDL).         Supervised  ML  requires  two  phases:  learning  and  application.  The  first  phase  requires   building  a  data  model  from  instances  within  a  document  that  has  already  been  correctly  
  • 8. Stephen  J.  Stose     April  18,  2011   8   IST  565:  Final  Project       classified.  In  our  case,  it  requires  giving  value  to  certain  sets  of  annotations  that,  as  a  whole,   will  represent  the  document  instance  (i.e.,  the  website)  as  either  a  hit  (DL)  or  a  miss  (non-­‐ DL).  The  point  is  to  develop  a  training  set  D  =  (d1,…,dn)  of  correctly  classified  DL  website   documents  (d)  to  build  a  classification  model  able  to  discriminate  any  future  website  d  as   being  either  a  true  DL  or  some  other  website  (not-­‐DL).       The  first  task  requires  annotating  each  document,  as  a  whole,  and  in  doing  so  assign  it  to   the  dependent  DL  or  non-­‐DL  class.  Up  until  now,  annotations  refer  to  parts  of  a  document   (tokens,  sentences,  dates  etc.).    To  annotate  a  whole  document,  we  begin  by  creating  a  new   {Type.feature=value}  term.  To  do  so,  we  demarcate  the  entire  text  within  each  document   and  create  a  new  annotation  type  called  “Mention,”  a  feature  called  “type”  (not  to  be   confusing)  and  two  distinct  values:  {Mention.type=dl}  and  {Mention.type=nondl}.       The  attributes  used  to  predict  class  membership  are  the  two  annotations  types  we   highlighted  above:  1)  Token  {Token.string},  and  2)  Lookup  {Lookup.majorType}.  To  take   full  advantage  of  the  Gazetteer,  we  added  a  list  entry  named  “dlwords”  (i.e.,  digital  library   words)  with  a  list  of  terms  commonly  found  on  many  digital  library  websites.    This  list  of   word  is  reproduced  below1:     Advanced  Search   Digital  Collection(s)   Manuscript(s)   Archive(s)   Digital  Content   Repository(ies)   Browse   Digital  Library(ies)   Search   Catalog   Digitization   Search  Tip(s)   Collection(s)   Digitisation   Special  Collection(s)   Digital   Image  Collection(s)   Unversity(ies)   Digital  Archive(s)   Keyword(s)   University  Library(ies)   Image(s)   Library(ies)       All  of  our  analyses  will  operate  using  the  bag-­‐of-­‐words  that  by  GATE  default  applies  tf-­‐idf   weighting  schemes  to  a  specified  n-­‐gram  (we’ll  be  using  only  1-­‐gram  unigrams).    Two   attribute  annotations,  each  representing  a  slightly  different  bag-­‐of-­‐words,  will  be  used  to   predict  DL  or  nonDL  class  membership:     1. When  the  {Token.string}  attribute  is  chosen  to  predict  {Mention.type}  class   membership,  the  bag-­‐of-­‐words  includes  all  non-­‐stop  word  tokens  within  its  attribute   set.     2. When  the  Gazetteer  is  used  and  “dlwords”  are  included  as  part  of  its  internal   dictionary,  the  attribute  {Lookup.majorType=“dlwords”)  along  with  all  the  other   60,000  entries  will  serve  to  constrain  the  set  of  tokens  predicting  {Mention.type}   class  membership.         The  GATE  Batch  Learning  PR  requires  an  XML  configuration  file  specifying  the  ML   parameters  and  the  attribute-­‐class  annotation  sets2.    We  will  only  discuss  a  few  essential   settings  here.  For  starters,  we  set  the  evaluation  method  as  “holdout”  with  a  ratio  of  .66/.33   training  to  test.    The  main  algorithm  we  will  be  using  is  SVM  (in  GATE,  the  SVMLibSvmJava)   with  the  following  parameter  settings  being  varied,  the  ones  reported  below  providing  the   best  results:                                                                                                                     1  Note:  the  Gazetteer  does  not  stem  nor  is  it  caps  sensitive,  thus  plural  and  uppercase  variations  of  these  words  were   provided  but  are  not  reproduced  here.   2  For  a  list  of  all  parameter  setting  possibilities,  see  http://gate.ac.uk/sale/tao/splitch17.html#x22-­‐43500017.  
  • 9. Stephen  J.  Stose     April  18,  2011   9   IST  565:  Final  Project       -­‐t:  kernel:  0  (linear  (0)  vs.  polynomial  (1))   -­‐c:  cost:  0.7  (higher  values  allow  softer  margins  leading  to  better  generalization)   -­‐tau:  uneven  margins:  0.5  (varies  positive  to  negative  instance  ratio)     The  XML  configuration  file  is  reproduced  below,  in  case  others  are  interested  in  getting   started  using  GATE  ML  for  basic  document  classification:     <?xml  version="1.0"?>   <ML-­‐CONFIG>      <VERBOSITY  level="1"/>      <SURROUND  value="false"/>      <PARAMETER  name="thresholdProbabilityClassification"  value="0.5"/>      <multiClassification2Binary  method="one-­‐vs-­‐another"/>      <EVALUATION  method="holdout"  ratio="0.66"/>      <FILTERING  ratio="0.0"  dis="near"/>      <ENGINE  nickname="SVM"  implementationName="SVMLibSvmJava"              options="  -­‐c  0.7  -­‐t  0  -­‐m  100  -­‐tau  0.4    "/>      <DATASET>            <INSTANCE-­‐TYPE>Mention</INSTANCE-­‐TYPE>                      <NGRAM>                      <NAME>ngram</NAME>                      <NUMBER>1</NUMBER>                      <CONSNUM>1</CONSNUM>                      <CONS-­‐1>                            <TYPE>Token</TYPE>                            <FEATURE>string</FEATURE>                                </CONS-­‐1>                    </NGRAM>                                              <ATTRIBUTE>        <NAME>Class</NAME>        <SEMTYPE>NOMINAL</SEMTYPE>        <TYPE>Mention</TYPE>        <FEATURE>type</FEATURE>        <POSITION>0</POSITION>        <CLASS/>          </ATTRIBUTE>        </DATASET>   </ML-­‐CONFIG>       The  ngram  is  set  to  1  (unigram),  and  the  <CLASS/>  tag  within  the  <Attribute>  tag  indicates   this  attribute  is  the  class  being  predicted,  with  <Type>Mention</Type>  and   <Feature>type</Feature>  as  {Mention.type=dl  or  nondl}.  The  <Type>Lookup</Type>and   <Feature>majorType</Feature>  will  be  changed  to  accommodate  the  other  attribute  value.     Thus,  after  running  ANNIE  PR  pipeline  over  the  “DL_eval_2”  corpus,  the  ML  Batch  Learning   PR  is  placed  alone  in  the  pipeline  to  run  over  the  annotated  set  of  documents  in  Evaluation   mode.    As  mentioned,  the  Batch  Learning  PR  can  also  operate  in  Training-­‐Application  mode   on  two  separate  sets  of  corpora:  one  corpus  for  training  and  another  for  application  (i.e.,   testing).  The  initial  results  below  reflect  only  the  results  of  holdout  0.66  evaluation  run  over   one  corpus.  The  current  report  does  not  utilize  the  Training-­‐Application  mode.         Sample  corpora  and  results     The  Web  now  boasts  over  8  billion  indexable  pages  (Chau  &  Chen,  2008).  Thus  training  an   ML  algorithm  to  pick  out  the  estimated  few  thousand  digital  libraries  will  not  be  a  simple   matter.  Assuming  there  are  5  thousand  library-­‐standard  digital  libraries  (which  may  be  a   high  estimate),  some  of  which  reside  within  umbrella  Digital  Asset  Management  portals,   discriminating  these  will  be  cherry  picking  at  a  ratio  of  an  average  of  5  digital  libraries  per   every  8  million  Web  sites.    Spiders  (or  Web  crawlers)  can  curtail  this  number  greatly  by  
  • 10. Stephen  J.  Stose     April  18,  2011   10   IST  565:  Final  Project       crawling  only  a  specified  argument  depth  from  the  starting  URL.  Unfortunately,  librarians   are  not  always  good  at  applying  search  optimized  SEO  standards,  and  many  well-­‐known   DLs  are  deeply  embedded  in  arguments  or  strange  ports  (University  of  Wyoming’s  uses   port  8180)  or  within  site  subdomains.  Thus,  curtailing  this  argument  space  too  much  will   result  in  decreased  recall.     Additionally,  there  are  many  non-­‐DL  websites  that  use  language  quite  similar  to  DL   websites.  For  instance,  many  websites  operate  as  librarian  blogs  or  digital  library   magazines  that  serve  discussion  spaces  regarding  DLs,  but  are  not  DLs  themselves.   Unfortunately,  these  false  positives  will  prove  daunting  to  exclude.    We  seek  only  DLs  or  DL   portals  that  boast  archival  collections  that  have  been  digitized,  and  as  such  serve  as   electronic  resources  that  are  co-­‐referenced,  searchable,  browsable,  and  catalogued   according  to  some  taxonomy  or  ontology.    One  suggested  way  of  narrowing  down  to  only   these  kinds  of  resources  might  be  to  tap  into  the  <meta  content>  in  which  librarians  often   apply  conventions  such  as  Dublin  Core  to  demarcate  these  spaces  as  digital  collection   spaces.    This  is  an  avenue  for  further  research,  and  is  possible  within  GATE  by  utilizing  a   {Meta.content}  attribute.  On  quick  pre-­‐testing,  however,  it  provided  no  worthy  results.       In  what  follows  are  two  samples  of  data  we  evaluated  for  the  ML  classification  of  DL  and   non-­‐DL  websites.         Sample  1     This  sample  mostly  ensured  that  the  GATE  ML  software  and  configuration  files  were   operating  correctly  given  the  kinds  of  document-­‐level  attributions  made.  As  mentioned,  the   first  corpus  we  tested  was  called  “DL_eval_2,”  which  contained  25  websites:  13  DL  sites   (from  Columbia  University  Digital  Collections)  and  12  distinct  news  sites,  listed  below:     Reuters   LA  Times   CNN   Bloomberg   Newsweek   The  Guardian   Chicago  Tribune   BBC   National  Review   CS  Monitor   Boston  Globe   Wall  Street  Journal     Using  both  {Token.string}  and  {Lookup.majorType}  as  attributes,  the  results  of  the   classification  of  {Mention.type}  as  either  DL  or  nonDL  follow.  These  results  correspond  to   the  ML  configuration  file  found  above  and  utilize  the  SVMLibSvmJava  ML  engine  at  .66   holdout  evaluation.  The  training  set  thus  included  16/25  websites  (.66)  and  the  ML   learning  algorithm  was  tested  on  9/25  of  the  remaining  sites  (.33).       {Token.string}  misclassified  only  one  instance:  Bloomberg  News  was  classified  as  falsely   belonging  to  {Mention.type=dl}.  Nothing  in  the  text  of  the  front  page  of  Bloomberg  gave  any   indication  as  to  why  this  was  the  case.    Thus,  precision,  recall  and  the  F1  value  for  the  set   was  0.89.       {Lookup.majorType}  comprises  the  Gazetteer,  but  also  included  the  digital  library  terms   (“dlwords”)  we  added.  Thus,  it  is  a  more  constrained  bag-­‐of-­‐words,  smaller  than  the  set  of   all  tokens.  Classification  improved  to  100%  using  the  “dlwords”  enhanced  Gazetteer.       Given  the  fact  this  is  such  a  small  sample,  we  cannot  conclude  very  much,  except  to  say  that   there  is  something  about  DL  content  when  compared  to  ordinary  mainstream  news  sites   that  allows  for  their  discrimination.    
  • 11. Stephen  J.  Stose     April  18,  2011   11   IST  565:  Final  Project         Sample  2     To  train  the  ML  algorithm  to  make  our  target  discrimination  and  allow  for  any  generalizable   conclusions,  we  increase  the  sample  size.  Sample  2  consists  of  181  non-­‐DLs  and  62  DLs.     Each  set  was  chosen  in  the  following  way:     Non-­‐DL  set     A  random  website  generator  was  used  to  generate  181  websites  that  were  not  digital   libraries,  were  English  language  only,  and  had  at  least  some  text  (excluding  the  websites   generated  with  only  images  etc.).       • http://www.whatsmyip.org/random_websites/         DL  set     A  set  of  62  university  digital  libraries  were  chosen,  mostly  across  three  main  DL  university   portals:     • Harvard  University  Digital  Collections   o http://digitalcollections.harvard.edu/     • Cornell  University  Libraries  “Windows  on  the  Past”   o http://cdl.library.cornell.edu/     • Columbia  University  Digital  Collections   o http://www.columbia.edu/cu/lweb/digital/collections/index.html         These  websites  are  slightly  more  representative,  but  still  fall  way  short  of  the  kind  of   precision  that  will  be  needed  to  crawl  the  web  as  a  whole.  The  results  bode  well,   nevertheless.       Again,  using  both  {Token.string}  and  {Lookup.majorType}  as  attributes,  the  results  of  the   classification  of  {Mention.type}  as  either  DL  or  nonDL  follow.  The  .66/.33  (total)  holdout   training-­‐test  ratios  of  the  data  were:  160/83  (243)  websites:  40/22  (62)  DL  and  120/61   (181)  non-­‐DL.    Naïve  Bayes  and  C4.5  algorithms  achieved  100%  misclassification  of  DL   websites  (22/22)  with  both  sets  of  attributes,  achieving  a  total  F1  of  0.73.    It  is  not  clear   why  this  is  the  case.  Given  that  SVM  is  well-­‐known  as  the  best  performing  classifier  for  texts   (Sebastiani,  2002),  we  stick  to  it  for  our  purposes.         {Token.string}  performed  slightly  better  than  {Lookup.majorType}.    For  both  cases,  there   were  very  few  misclassifications,  and  most  of  these  were  false  positives.  When  only  the   Gazetter  entry  words,  including  “dlwords,”  were  taken  into  account  ({Lookup.majorType}),   3/22  DLs  were  misclassified  as  non-­‐DL  (precision=0.95;  recall=0.86;  F1=0.90)  and  1/61   non-­‐DLs  were  misclassified  as  DL  (precision=0.95;  recall=0.98;  F1=0.97).         When  all  tokens  were  entered  into  the  bag-­‐of-­‐words  ({Token.string}),  precision  was  perfect   for  DL  classification  and  recall  was  perfect  for  nonDL  classification.  That  is,  3/22  DLs  were   still  misclassified  as  nonDL  (precision=1.0;  recall=0.86;  F1=0.93).  All  61/61  of  nonDLs  were  
  • 12. Stephen  J.  Stose     April  18,  2011   12   IST  565:  Final  Project       classified  correctly,  resulting  in  perfect  recall,  but  lacking  precision  insofar  as  64  total   websites  were  classified  as  non-­‐DL:  61  of  the  expected,  but  the  3  others  which  should  have   been  classified  as  DL  (precision=0.95;  recall=1.0;  F1=0.98).       Thus,  overall,  using  all  tokens  achieved  slightly  higher  rates  of  precision  and  recall  for  the   discrimination  of  DL  websites  from  all  websites,  based  on  this  small  and  still  very  non-­‐ proportional  sample.    Total  F1  values  for  were  0.96  for  {Token.string}  and  0.95  for   {Lookup.majorType}.  The  question  remains  as  to  whether  both  attributes  were   misclassifying  the  exact  same  websites.  It  turns  out  that  2/3  of  the  websites  misclassified   for  both  attributes  were  the  same.    Figure  2  below  illustrates  the  breakdown  of  these   statistics  per  attribute.     {Token.string}   {Lookup.majorType}   Precision=1.0,  Recall=0.86   Precision=0.95,  Recal=0.86         False  Negatives  (misclassified  as  nonDL)   False  Negatives  (misclassified  as  nonDL)       Digital  Scriptorium   Harvard  Business  Education  for  Women  (1937-­‐1970)   www.scriptorium.columbia.edu     http://www.library.hbs.edu/hc/daring/intro.html#nav   -­‐intro     Holocaust  Rescue  &  Relief  (Andover-­‐Harvard     Theological)   Holocaust  Rescue  &  Relief  (Andover-­‐Harvard   www.hds.harvard.edu/library/collections/digital/serv Theological)   ice_committee.html     www.hds.harvard.edu/library/collections/digital/servi   ce_committee.html     Joseph  Urban  Stage  Design  Collection       www.columbia.edu/cu/lweb/eresources/archives/rbm Joseph  Urban  Stage  Design  Collection     l/urban/   www.columbia.edu/cu/lweb/eresources/archives/rbm   l/urban/         False  positives  (misclassified  as  DL)   False  positives  (misclassified  as  DL)       none   www.spi-­‐poker.sourceforge.net       Figure  2.       Conclusion     In  this  paper  we  discussed  how  GATE  (General  Architecture  for  Text  Engineering)  employs   Machine  Learning  to  classify  documents  from  the  web  into  two  categories:  websites  that   operate  as  digital  library  sites,  and  websites  that  do  not  operate  as  digital  library  sites.  This   exercise  was  completed  firstly  in  order  to  learn  about  GATE,  but  secondarily  to  hopefully   provide  a  solution  to  populating  a  site  the  current  author  is  creating  for  digital  libraries   (www.digitallibrarycentral.com).  No  current  directory  exists  as  a  single-­‐stop  go-­‐to  resource   for  digital  libraries;  as  is,  digital  libraries  are  difficult  to  find  and  hence  often  un-­‐  or  under-­‐ utilized  to  the  ordinary  web  user.    By  creating  a  digital  library  of  all  digital  libraries,  we   hope  to  bring  the  ordinary  user  to  the  plethora  of  digitized  resources  available,  and   categorize  these  digital  collections  according  to  a  taxonomy  that  allows  for  the  collation  of   similar  kinds  and  types  of  digital  libraries.  Indeed,  once  these  digital  resources  are  all  
  • 13. Stephen  J.  Stose     April  18,  2011   13   IST  565:  Final  Project       collection,  GATE  Machine  Learning  might  provide  a  solution  to  the  automatic  classification   of  these  resources  into  the  supervised  taxonomy.       As  it  is,  we  first  seek  to  locate  these  resources  using  Machine  Learning.  If  the  web  were   made  up  of  three  ordinary  non-­‐DL  website  for  every  one  DL  website,  the  current  classifier   we  trained  would  have  a  very  easy  time  locating  all  of  the  DLs  (with  96%  accuracy).  As  it  is,   however,  of  the  8  billion  websites  in  existence  today,  we  reckon  that  only  3-­‐6  thousand  of   these  operate  as  digital  libraries  in  some  form  or  another.  Thus,  a  lot  of  work  still  needs  to   be  done  in  order  to  find  the  DL  needle  in  the  haystack  of  all  websites  online  today.                 References     Chau,  M.  and  Chen,  H.  (2008).  A  machine  learning  approach  to  web  page  filtering  using   content  and  structure  analysis.  Decision  Support  Systems,  44(2),  482-­‐494.       Hotho,  A.,  Nurnberger,  A.  and  Paass,  G.  (2005).  A  brief  survey  of  text  mining.  LDV  Forum  –   GLDV  Journal  for  Computational  Linguistics  and  Language  Technology.       Joachims,  T.  (1998).  Text  Categorization  with  Support  Vector  Machines:  Learning  with   Many  Relevant  Features.  Proceedings  of  the  European  Conference  on  Machine  Learning   (ECML),  Springer.       Nigam,  K.,    McCallum,  A.,  Thrun,  S.,  and  Mitchell,  T.  (2000).  Text  Classification  from  Labeled   and  Unlabeled  Documents  using  EM.  Machine  Learning,  39(2/3),  103-­‐134.       Salton,  G.,  Wong,  A.,  and  Yang,  C.  S.  (1975).  A  Vector  Space  Model  for  Automatic  Indexing.   Communications  of  the  ACM.  18(11),  613–620.     Sebastiani,  F.  (2002).  Machine  learning  in  automated  text  categorization.  ACM  Computing   Surveys,  34,  1-­‐47.       Witten,  I.  H.  (2005).  “Text  mining.”  in  Practical  handbook  of  internet  computing,  ed.  M.P.   Singh.  Chapman  &  Hall/CRC  Press,  Boca  Raton,  Florida.       Witten,  I.  H.,  Don,  K.  J.,  Dewsnip,  M.  and  Tablan,  V.  (2004).  Text  mining  in  a  digital  library.   Journal  of  Digital  Libraries,  4(1),  56-­‐59.