SlideShare ist ein Scribd-Unternehmen logo
1 von 37
Downloaden Sie, um offline zu lesen
Machine	
  Le
             arning	
  in	
  
    DIADEM	
  
     Reading	
  Co
                        urse	
  Presen
                                          tation	
  
                          	
  
         Andrey	
  Kra
                                vchenko	
  
        20 th	
  of	
  Janu
                               ary,	
  2010	
  
Current	
  area	
  of	
  research	
  
 Real	
  estate	
  page	
  classiDication	
  




                 vs	
  
Current	
  area	
  of	
  research	
  
                                            	
  
Input	
  and	
  output	
  page	
  distinction
Current	
  area	
  of	
  research	
  
   Page	
  element	
  classiDication	
  
The	
  Reading	
  List	
  
               Papers	
  not	
  included	
  in	
  this	
  presentation                                                         	
  
0  “An	
   interactive	
   clustering	
   –	
   based	
   approach	
   to	
   integrating	
   source	
   query	
  
   interfaces	
  on	
  the	
  Deep	
  Web”	
  
    0  This	
  paper	
  is	
  concerned	
  with	
  input	
  forms.	
  
0  “Automatic	
   wrapper	
   induction	
   from	
   hidden-­‐web	
   sources	
   with	
   domain	
  
   knowledge”	
  
    0  Only	
   a	
   part	
   of	
   the	
   paper	
   deals	
   with	
   the	
   output	
   pages.	
   Their	
   methodology	
   for	
  
       processing	
   the	
   output	
   pages	
   is	
   based	
   on	
   gazetteer’s	
   and	
   is	
   thus	
   closer	
   to	
  
       linguistics	
  than	
  ML.	
  
0  “Web	
  scale	
  extraction	
  of	
  structured	
  data”	
  
    0  Deals	
  with	
  the	
  whole	
  Web.	
  
0  “An	
   adaptive	
   information	
   extraction	
   system	
   based	
   on	
   wrapper	
   induction	
  
   with	
  POS	
  tagging”	
  
    0  The	
   labels	
   are	
   of	
   very	
   low	
   granularity	
   (e.g.	
   work_name,	
   work_location)	
   and	
   of	
  
       linguistic	
   nature.	
   The	
   comparison	
   is	
   done	
   against	
   linguistics	
   systems	
   such	
   as	
  
       Rapier	
  (another	
  excluded	
  paper	
  on	
  the	
  reading	
  list),	
  GATE-­‐SVM,	
  etc.	
  Introducing	
  
       POS	
   tagging	
   provides	
   only	
   a	
   5%	
   gain	
   in	
   accuracy	
   and	
   only	
   for	
   some	
   target	
   slots	
  
       for	
  one	
  corpus	
  and	
  no	
  gain	
  for	
  the	
  other	
  two.	
  
The	
  Reading	
  List	
  
                 Papers	
  not	
  included	
  in	
  this	
  presentation	
  
0  “Learning	
   (k,l)-­‐contextual	
   tree	
   languages	
   for	
   information	
   extraction	
   from	
  
   Web	
  pages”	
  
    0  The	
  paper	
  deals	
  with	
  learning	
  an	
  extraction	
  language	
  rather	
  than	
  extraction	
  itself.	
  
0  “Bottom-­‐up	
   relational	
   learning	
   of	
   problem	
   matching	
   rules	
   for	
   Information	
  
   Retrieval”	
  
    0  Deals	
  with	
  textual	
  documents	
  only.	
  
0  “Learning	
  rules	
  to	
  pre-­‐process	
  Web	
  data	
  for	
  automatic	
  integration”	
  
    0  Relies	
   on	
   web	
   data	
   extraction	
   and	
   alignment	
   phases	
   performed	
   by	
   the	
   VIPER	
  
       system	
   that	
   are	
   not	
   described	
   in	
   the	
   paper.	
   I	
   wasn’t	
   able	
   to	
   detect	
   any	
   ML	
   involved	
  
       in	
   the	
   stage	
   of	
   rule	
   learning.	
   No	
   clear	
   description	
   of	
   practical	
   results.	
   Low-­‐level	
  
       granularity	
  of	
  labels.	
  
0  “Learning	
  rules	
  for	
  information	
  extraction”	
  
    0  Is	
  not	
  HTML/DOM	
  speciDic.	
  
The	
  Reading	
  List	
  
                Papers	
  included	
  in	
  this	
  presentation                            	
  
#1	
  “Web-­‐page	
  classiDication:	
  features	
  and	
  algorithms”	
  -­‐	
  2007	
  

#2	
  “Web	
  page	
  element	
  classiDication	
  based	
  on	
  visual	
  features”	
  
#3	
  “Stylistic	
  and	
  lexical	
  co-­‐training	
  for	
  Web-­‐block	
  classiDication”	
  
#4	
  “Can	
  we	
  learn	
  a	
  template-­‐independent	
  	
  wrapper	
  for	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  news	
  article	
  extraction	
  from	
  a	
  single	
  training	
  site?”	
  
#5	
  “EfDicient	
  record-­‐level	
  wrapper	
  induction”	
  
	
  
#6	
  “Towards	
  combining	
  Web	
  classiDication	
  and	
  Web	
  Information	
  	
  	
  	
  	
  	
  
                       	
  	
  	
  Extraction:	
  a	
  case	
  study”	
  	
  
	
  
Paper	
  #	
  1	
  

       Web	
  page	
  classiDication:	
  features	
  and	
  algorithms	
  
                 X.	
  Qi	
  and	
  B.	
  Davison	
  (Lehigh	
  University,	
  2007)	
  

0  The	
  paper	
  distinguishes	
  between	
  four	
  types	
  of	
  classiDication;	
  
0  They	
  also	
  distinguish	
  between	
  subject	
  classiDication,	
  functional	
  
   classiDication,	
  sentiment	
  classiDication,	
  and	
  other	
  types	
  of	
  
   classiDication;	
  
0  The	
  paper	
  distinguishes	
  between	
  on-­‐page	
  features	
  and	
  the	
  
   features	
  of	
  the	
  neighbours;	
  
0  On-­‐page	
  features:	
  
       0  Textual	
  analysis:	
  bag	
  of	
  words	
  vs	
  n-­‐gram;	
  
       0  Visual	
  analysis:	
  the	
  multigraph	
  approach.	
  
	
  
Paper	
  #	
  1	
  	
  
Web	
  page	
  classiDication:	
  features	
  and	
  algorithms	
  
      X.	
  Qi	
  and	
  B.	
  Davison	
  (Lehigh	
  University,	
  2007)	
  
Paper	
  #	
  1	
  

    Web	
  page	
  classiDication:	
  features	
  and	
  algorithms
                                                                  	
  
             X.	
  Qi	
  and	
  B.	
  Davison	
  (Lehigh	
  University,	
  2007) 	
  
0  When	
  using	
  the	
  features	
  of	
  neighbouring	
  pages	
  the	
  authors	
  
   distinct	
  between	
  the	
  weak	
  assumption	
  and	
  the	
  strong	
  assumption;	
  
0  They	
  also	
  distinguish	
  between	
  different	
  types	
  of	
  neighbours:	
  
   parents/children,	
  grandparents/grandchildren	
  and	
  siblings/
   spouses;	
  
0  It	
  appears	
  that	
  siblings	
  are	
  the	
  most	
  important	
  neighbours;	
  
0  There	
  are	
  various	
  features	
  	
  uses	
  for	
  different	
  types	
  of	
  
   neighbouring	
  pages;	
  
0  Algorithm	
  survey:	
  dimension	
  reduction	
  and	
  relational	
  learning	
  
   approaches;	
  
Paper	
  #	
  2	
  

Web	
  page	
  element	
  classiDication	
  based	
  on	
  visual	
  features
                                                                            	
  
            R.	
  Burget	
  and	
  I.	
  Rudolfova	
  (Brno	
  University,	
  2009)                               	
  
0  Problem:	
  ClassiDication	
  of	
  elements	
  from	
  a	
  web	
  page	
  based	
  on	
  
   its	
  visual	
  rendering;	
  
0  Assumptions:	
  A	
  tagged	
  corpus,	
  DOM	
  tree,	
  CSSBox	
  layout;	
  
0  Approach:	
  	
  Page	
  segmentation	
  followed	
  by	
  block	
  classiDication	
  
   performed	
  via	
  Weka’s	
  J48	
  decision	
  tree	
  classiYier;	
  
0  Features:	
  Font	
  features,	
  spatial	
  features,	
  text	
  features,	
  colour	
  
   features;	
  
0  Evaluation:	
  News	
  domain.	
  Average	
  F1	
  measure	
  on	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  
   coarse-­‐grained	
  labels,	
  low	
  F1	
  measure	
  on	
  high-­‐grained	
  labels.	
  
Paper	
  #	
  2	
  

Web	
  page	
  element	
  classiDication	
  based	
  on	
  visual	
  features	
  
         R.	
  Burget	
  and	
  I.	
  Rudolfova	
  (Brno	
  University,	
  2009)	
  

  0  The	
  approach	
  of	
  this	
  papers	
  is	
  split	
  into	
  two	
  phases:	
  
     0  Page	
  segmentation;	
  
     0  Page	
  element	
  classiDication;	
  
  0  Page	
  segmentation	
  is	
  done	
  in	
  four	
  phases:	
  
      0  Page	
  rendering;	
  
      0  Detecting	
  basic	
  visual	
  areas;	
  
      0  Text	
  line	
  detection;	
  
      0  Block	
  detection;	
  
  0  As	
  a	
  result	
  of	
  page	
  segmentation	
  we	
  obtain	
  a	
  tree	
  of	
  areas.	
  
Paper	
  #	
  2	
  

Web	
  page	
  element	
  classiDication	
  based	
  on	
  visual	
  features	
  
       R.	
  Burget	
  and	
  I.	
  Rudolfova	
  (Brno	
  University,	
  2009)	
  

      0  The	
  actual	
  	
  page	
  element	
  classiDication	
  is	
  performed	
  
             for	
  each	
  area	
  via	
  Weka’s	
  J48	
  decision	
  tree	
  classiDier	
  
             based	
  on	
  the	
  following	
  set	
  of	
  features:	
  
              0  Font	
  features	
  {fontsize,	
  weight};	
  
              0  Spatial	
  features	
  {aabove,	
  abelow,	
  aleft,	
  aright};	
  
              0  Text	
  features	
  {tdigits,	
  	
  tlower,	
  	
  tupper,	
  tspaces,	
  tlength};	
  
              0  Colour	
  features	
  {contrast}.	
  
      	
  
Paper	
  #	
  2	
  

Web	
  page	
  element	
  classiDication	
  based	
  on	
  visual	
  features	
  
       R.	
  Burget	
  and	
  I.	
  Rudolfova	
  (Brno	
  University,	
  2009)	
  
                                                                Results	
  	
  
          The	
  set	
  of	
  labels	
          (the	
  testing	
  pages	
  from	
  another	
  
                                                source	
  than	
  the	
  training	
  pages)	
  
Paper	
  #	
  3	
  

Stylistic	
  and	
  Lexical	
  Co-­‐training	
  for	
  Web	
  Block	
  ClassiDication    	
  
                                                                                    	
  
        C.	
  Lee	
  et	
  al	
  (National	
  University	
  of	
  Singapore,	
  2004)

                                           	
   from	
  a	
  web	
  page	
  based	
  on	
  
0  Problem:	
  ClassiDication	
  of	
  elements	
  
   both	
  stylistic	
  and	
  lexical	
  features;	
  
0  Assumptions:	
  A	
  tagged	
  corpus,	
  DOM	
  tree,	
  CSSBox	
  layout;	
  
0  Approach:	
  	
  Web	
  block	
  division	
  followed	
  by	
  co-­‐training	
  with	
  
   Boostexter,	
  an	
  ensemble	
  learning	
  method	
  with	
  a	
  decision	
  stump	
  
   corresponding	
  to	
  a	
  single	
  weak	
  learner;	
  
0  Features:	
  Lexical	
  and	
  stylistic;	
  
0  Evaluation:	
  News	
  domain.	
  Average	
  F1	
  measure	
  on	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  
   coarse-­‐grained	
  labels,	
  low	
  F1	
  measure	
  on	
  high-­‐grained	
  labels.	
  
Paper	
  #	
  3	
  
Stylistic	
  and	
  Lexical	
  Co-­‐training	
  for	
  Web	
  Block	
  ClassiDication	
  
        C.	
  Lee	
  et	
  al	
  (National	
  University	
  of	
  Singapore,	
  2004)	
  
                                                 	
  
0  The	
  authors	
  aim	
  to	
  combine	
  two	
  different	
  classiDiers	
  with	
  
   distinctive	
  set	
  of	
  features	
  (lexical	
  and	
  stylistic);	
  
0  They’ve	
  created	
  a	
  PARser	
  for	
  Content	
  Extraction	
  and	
  Layout	
  
   Structure	
  (PARCELS);	
  
0  Web	
  page	
  division	
  –	
  the	
  authors	
  differentiate	
  between	
  
   structural	
  tags	
  and	
  content	
  tags.	
  
Paper	
  #	
  3	
  

Stylistic	
  and	
  Lexical	
  Co-­‐training	
  for	
  Web	
  Block	
  ClassiDication	
  
        C.	
  Lee	
  et	
  al	
  (National	
  University	
  of	
  Singapore,	
  2004)	
  
                                                 	
  
Paper	
  #	
  3	
  

Stylistic	
  and	
  Lexical	
  Co-­‐training	
  for	
  Web	
  Block	
  ClassiDication	
  
        C.	
  Lee	
  et	
  al	
  (National	
  University	
  of	
  Singapore,	
  2004)	
  
                                                                    	
  
0  The	
  authors	
  distinguish	
  between	
  labels	
  of	
  different	
  	
  levels	
  of	
  
   granularity.	
  They	
  deDine	
  17	
  tags	
  for	
  labelling;	
  
0  Stylistic	
  features:	
  
     0  Linear	
  structure	
  –	
  paragraph	
  (<p>),	
  header	
  (<h1>-­‐<h6>)	
  and	
  rule	
  tags	
  (<hr>);	
  
     0  Table	
  structure	
  –	
  cell	
  Dlow,	
  neighbouring	
  cells’	
  data,	
  	
  the	
  position	
  of	
  table	
  cells;	
  
     0  XHTML/CSS	
  structure	
  –	
  height,	
  width,	
  z-­‐index;	
  
     0  Font	
  features	
  –	
  colour,	
  weight,	
  family,	
  size,	
  hyperlink	
  features;	
  
     0  Images	
  –	
  size,	
  number	
  of	
  images	
  within	
  a	
  block;	
  
0  Lexical	
  features:	
  
    0  Low-­‐level	
  features	
  –	
  count	
  and	
  vocabulary	
  of	
  the	
  words	
  present	
  in	
  the	
  text	
  block;	
  
    0  High-­‐level	
  features	
  –	
  POS-­‐tags,	
  mailto-­‐links,	
  image-­‐links,	
  text-­‐links,	
  total-­‐links;	
  
0  Boostexter	
  is	
  used	
  for	
  co-­‐training.	
  It	
  is	
  an	
  ensemble	
  learning	
  method	
  
   with	
  a	
  decision	
  stump	
  corresponding	
  to	
  a	
  single	
  weak	
  learner.	
  
Paper	
  #	
  3	
  

Stylistic	
  and	
  Lexical	
  Co-­‐training	
  for	
  Web	
  Block	
  ClassiDication	
  
        C.	
  Lee	
  et	
  al	
  (National	
  University	
  of	
  Singapore,	
  2004)	
  
                                           	
  
Paper	
  #	
  3	
  
Paper	
  #	
  4	
  

Can	
  we	
  learn	
  a	
  template	
  independent	
  wrapper	
  for	
  
 news	
  article	
  extraction	
  for	
  a	
  single	
  training	
  site?
                                                                        	
  
    J.	
  Wang	
  et	
  al	
  (2009,	
  Zhejiang	
  University,	
  MS	
  Research)        	
  
0  Problem:	
  ClassiDication	
  of	
  titles	
  and	
  bodies	
  of	
  news	
  taken	
  from	
  
   the	
  webpages	
  belonging	
  to	
  the	
  news	
  domain;	
  
0  Assumptions:	
  A	
  tagged	
  corpus,	
  DOM	
  tree,	
  CSSBox	
  layout;	
  
0  Approach:	
  	
  SVM;	
  decision	
  function	
  gets	
  converted	
  to	
  posterior	
  
   probability;	
  
0  Features:	
  Different	
  sets	
  of	
  features	
  for	
  body	
  and	
  title	
  
   extraction.	
  	
  Features	
  are	
  divided	
  into	
  content	
  and	
  spatial	
  
   features;	
  	
  
0  Evaluation:	
  Overall	
  99%	
  extraction	
  accuracy.	
  
Paper	
  #	
  4	
  

  Can	
  we	
  learn	
  a	
  template	
  independent	
  wrapper	
  for	
  
   news	
  article	
  extraction	
  for	
  a	
  single	
  training	
  site?	
  
      J.	
  Wang	
  et	
  al	
  (2009,	
  Zhejiang	
  University,	
  MS	
  Research)	
  
0  The	
  aim	
  of	
  the	
  paper	
  is	
  to	
  efDiciently	
  extract	
  and	
  then	
  combine	
  
   titles	
  and	
  bodies	
  of	
  news	
  articles;	
  
0  	
  The	
  main	
  problem	
  is	
  in	
  dealing	
  with	
  various	
  noises	
  around	
  the	
  
   titles.	
  
Paper	
  #	
  4	
  

   Can	
  we	
  learn	
  a	
  template	
  independent	
  wrapper	
  for	
  
    news	
  article	
  extraction	
  for	
  a	
  single	
  training	
  site?	
  
     J.	
  Wang	
  et	
  al	
  (2009,	
  Zhejiang	
  University,	
  MS	
  Research)	
  
0  News	
  body	
  extraction:	
  
   0  Content	
  features:	
  FormattingElementsNum	
  and	
  FormattedContentLen;	
  
   0  Spatial	
  features:	
  normalised	
  RectLeft,	
  RectTop,	
  RectWidth	
  and	
  RectHeight;	
  
   0  News	
  body	
  extraction	
  heuristics:	
  TopInScreen(T)	
  and	
  BigEnough(T);	
  
0  News	
  title	
  extraction:	
  
   0  Content	
  features:	
  FontSize,	
  EndWithFullStop,	
  WordNum;	
  
   0  Spatial	
  features:	
  RectLeft,	
  RectTop,	
  RectWidth,	
  RectHeight,	
  Overlap,	
  Distance,	
  Flat;	
  
    0  News	
  title	
  extraction	
  heuristics:	
  WholeInScreen(T),	
  NoAnchorText(T),	
  
       NotCategoryName(T);	
  
0  A	
  SVM	
  approach	
  is	
  chosen	
  for	
  classiDication.	
  The	
  decision	
  
   function	
  gets	
  converted	
  to	
  posterior	
  probability.	
  
Paper	
  #	
  4	
  


Can	
  we	
  learn	
  a	
  template	
  independent	
  wrapper	
  for	
  
 news	
  article	
  extraction	
  for	
  a	
  single	
  training	
  site?	
  
  J.	
  Wang	
  et	
  al	
  (2009,	
  Zhejiang	
  University,	
  MS	
  Research)	
  
    Testing	
  results	
  on	
  the	
  large	
  	
       Extraction	
  results	
  
         scale	
  experiment	
  
Paper	
  #	
  5	
  

            EfDicient	
  record	
  level	
  wrapper	
  induction	
  
         S.	
  Zheng	
  et	
  al	
  (Pennsylvania	
  State	
  Univeristy,	
  2009)      	
  
0  Problem:	
  EfDicient	
  extraction	
  of	
  records	
  from	
  Web	
  pages	
  and	
  
   classiDication	
  of	
  their	
  elements;	
  
0  Assumptions:	
  A	
  tagged	
  corpus,	
  DOM	
  tree;	
  
0  Approach:	
  	
  Alignment	
  of	
  the	
  DOM	
  subtree	
  and	
  the	
  possible	
  
   wrappers;	
  
0  Features:	
  None;	
  
0  Evaluation:	
  Four	
  different	
  domains	
  (online	
  shops,	
  user	
  reviews,	
  
   digital	
  libraries,	
  search	
  results).	
  Seven	
  detail	
  page	
  datasets	
  and	
  
   eleven	
  list	
  page	
  datasets.	
  A	
  99%	
  F1	
  value.	
  
Paper	
  #	
  5	
  

         EfDicient	
  record	
  level	
  wrapper	
  induction	
  
      S.	
  Zheng	
  et	
  al	
  (Pennsylvania	
  State	
  Univeristy,	
  2009)	
  

0  The	
  paper	
  is	
  concerned	
  with	
  extracting	
  records	
  and	
  their	
  
   respective	
  attributes;	
  
0  The	
  key	
  distinction	
  from	
  other	
  approaches	
  is	
  the	
  record-­‐
   level	
  extraction	
  opposed	
  to	
  page-­‐level	
  extraction;	
  
0  The	
  authors	
  propose	
  a	
  novel	
  broom	
  structure	
  for	
  this	
  task;	
  
0  The	
  broom	
  structure	
  has	
  a	
  head	
  and	
  a	
  stick;	
  
0  One	
  of	
  the	
  main	
  issues	
  are	
  crossing	
  records.	
  
Paper	
  #	
  5	
  

  EfDicient	
  record	
  level	
  wrapper	
  induction	
  
                                                                        	
  
S.	
  Zheng	
  et	
  al	
  (Pennsylvania	
  State	
  Univeristy,	
  2009)
Paper	
  #	
  5	
  
          EfDicient	
  record	
  level	
  wrapper	
  induction	
  
        S.	
  Zheng	
  et	
  al	
  (Pennsylvania	
  State	
  Univeristy,	
  2009)        	
  
0  The	
  general	
  architecture	
  of	
  the	
  system	
  involves	
  training	
  and	
  
  testing	
  phases.	
  
Paper	
  #	
  5	
  
           EfDicient	
  record	
  level	
  wrapper	
  induction	
  
        S.	
  Zheng	
  et	
  al	
  (Pennsylvania	
  State	
  Univeristy,	
  2009)       	
  
0  The	
  authors	
  claim	
  to	
  achieve	
  a	
  remarkable	
  extraction	
  accuracy	
  
  and	
  a	
  signiDicant	
  boost	
  in	
  running	
  time	
  performance;	
  
Paper	
  #	
  6	
  

  Towards	
  combining	
  Web	
  classiDication	
  and	
  Web	
  
      Information	
  Extraction:	
  a	
  case	
  study	
  
                                                                   	
  
             P.	
  Luo	
  et	
  al	
  (HP	
  Labs	
  China,	
  2009)

                                                	
   with	
  the	
  extraction	
  of	
  its	
  
0  Problem:	
  Combination	
  of	
  web	
  page	
  classiDication	
  based	
  on	
  
  their	
  relevance	
  to	
  a	
  speciDic	
  domain	
  
   speciDic	
  elements,	
  using	
  both	
  forward	
  and	
  backward	
  
   dependencies;	
  	
  
0  Assumptions:	
  A	
  tagged	
  corpus,	
  DOM	
  tree;	
  
0  Approach:	
  	
  Conditional	
  Random	
  Fields	
  (CRFs);	
  
0  Features:	
  Course	
  terms	
  and	
  heuristics	
  for	
  course	
  homepage	
  
   detection;	
  format,	
  position	
  and	
  content	
  features	
  for	
  course	
  
   title	
  extraction;	
  
0  Evaluation:	
  OfCourse	
  system	
  for	
  online	
  course	
  information	
  
   extraction.	
  90%	
  F1	
  value	
  for	
  course	
  page	
  classiDication,	
  83%	
  
   F1	
  value	
  for	
  course	
  title	
  extraction.	
  
Paper	
  #	
  6	
  

   Towards	
  combining	
  Web	
  classiDication	
  and	
  Web	
  
             Information	
  Extraction:	
  a	
  case	
  study	
  
                   P.	
  Luo	
  et	
  al	
  (HP	
  Labs	
  China,	
  2009)	
  
                                                   	
  
0  The	
  authors	
  propose	
  a	
  method	
  that	
  utilises	
  both	
  forward	
  and	
  
   backward	
  dependencies	
  between	
  Web	
  classiDication	
  and	
  
   information	
  extraction;	
  
0  The	
  authors	
  use	
  a	
  uniDied	
  graphical	
  CRF	
  model	
  for	
  joint	
  and	
  
   simultaneous	
  optimisation	
  of	
  these	
  two	
  steps;	
  
0  This	
  methodology	
  has	
  been	
  used	
  for	
  building	
  the	
  OfCourse	
  
   online	
  search	
  engine	
  ;	
  
0  In	
  their	
  results	
  for	
  OfCourse	
  the	
  authors	
  claim	
  that	
  their	
  model	
  
   signiDicantly	
  outperforms	
  the	
  two	
  baseline	
  methods;	
  
0  Drawbacks:	
  they	
  only	
  deal	
  with	
  DOM	
  leave	
  nodes	
  as	
  
   classiDication	
  variables	
  for	
  the	
  information	
  extraction	
  phase.	
  
Lessons	
  learnt	
  from	
  the	
  Reading	
  Course	
  
#1	
  “Web	
  page	
  classiYication:	
  features	
  and	
  algorithms”	
  by	
  X.	
  Qi	
  and	
  
 B.	
  Davison	
  (2007):	
  the	
  importance	
  of	
  the	
  neighbouring	
  pages’	
  
 features,	
  features	
  of	
  neighbouring	
  pages;	
  
#2	
  “Web	
  page	
  element	
  classiYication	
  based	
  on	
  visual	
  features”	
  by	
  
 R.	
  Burget	
  and	
  I.	
  Rudolfova	
  (2009):	
  a	
  broad	
  set	
  of	
  visual	
  features	
  
 (font	
  features,	
  spatial	
  features,	
  text	
  features	
  and	
  colour	
  
 features);	
  
#3	
  “Stylistic	
  and	
  Lexical	
  Co-­‐training	
  for	
  Web	
  Block	
  ClassiYication”	
  
 by	
  	
  	
  	
  C.	
  Lee	
  et	
  al	
  (2004):	
  	
  A	
  useful	
  web	
  block	
  division	
  algorithm.	
  A	
  
 possibility	
  of	
  co-­‐training	
  on	
  the	
  same	
  corpus	
  using	
  two	
  
 distinctive	
  set	
  of	
  features;	
  
Lessons	
  learnt	
  from	
  the	
  Reading	
  Course	
  
#4	
  “Can	
  we	
  learn	
  a	
  template	
  independent	
  wrapper	
  for	
  news	
  
 article	
  extraction	
  for	
  a	
  single	
  training	
  site”	
  by	
  J.	
  Weng	
  et	
  al	
  
 (2009):	
  a	
  distinctive	
  set	
  of	
  features	
  for	
  news	
  title	
  extraction,	
  a	
  
 lot	
  of	
  which	
  can	
  be	
  used	
  for	
  property	
  title	
  extraction	
  in	
  
 DIADEM;	
  
#5	
  “EfYicient	
  record	
  level	
  wrapper	
  induction	
  “by	
  S.	
  Zheng	
  et	
  al	
  
 (2009):	
  a	
  new	
  record-­‐level	
  approach	
  for	
  extraction.	
  Performs	
  
 much	
  better	
  and	
  faster	
  than	
  the	
  page-­‐level	
  approaches.	
  Can	
  be	
  
 useful	
  for	
  DIADEM	
  extraction	
  in	
  the	
  record-­‐heavy	
  domains;	
  
#6	
  “Towards	
  combining	
  Web	
  classiYication	
  and	
  Web	
  Information	
  
 Extraction:	
  a	
  case	
  study”	
  by	
  P.	
  Luo	
  et	
  al	
  (2009):	
  backward	
  
 dependency	
  between	
  these	
  two	
  tasks	
  can	
  work	
  as	
  well.	
  Thus	
  it	
  
 is	
  worthwhile	
  to	
  experiment	
  with	
  their	
  mutual	
  tie-­‐up.	
  
General	
  lessons	
  learnt
                                                 	
  
0  Most	
  of	
  the	
  papers	
  are	
  recent	
  or	
  very	
  recent	
  (2004-­‐2009);	
  
0  Features	
  play	
  a	
  much	
  more	
  important	
  role	
  than	
  algorithms;	
  
0  Initial	
  page	
  segmentation	
  into	
  blocks	
  can	
  help	
  with	
  subsequent	
  
   determination	
  of	
  relevant	
  DOM-­‐subtrees;	
  
0  All	
  features	
  can	
  be	
  broadly	
  divided	
  into	
  content	
  features	
  and	
  
   visual	
  features;	
  
0  News	
  domain	
  is	
  a	
  very	
  popular	
  one	
  (3	
  out	
  of	
  5	
  reviewed	
  
   systems).	
  No	
  mention	
  of	
  real	
  estate	
  in	
  any	
  of	
  the	
  papers.	
  
Summary	
  of	
  the	
  Reading	
  Course	
  
             and	
  its	
  relevance	
  to	
  DIADEM	
  
0  The	
  six	
  proposed	
  papers	
  are	
  of	
  relevance	
  to	
  all	
  three	
  areas	
  of	
  my	
  
   current	
  research:	
  	
  
    0  Real	
  estate	
  page	
  classiDication;	
  
    0  Output/Input	
  page	
  distinction;	
  
    0  Property	
  page	
  elements’	
  classiDication;	
  
0  The	
  most	
  obvious	
  synergy	
  is	
  with	
  Omer’s	
  NLP	
  work,	
  although	
  
   cross	
  sections	
  with	
  Cheng’s	
  and	
  Xiaonan’s	
  work	
  are	
  also	
  possible;	
  
0  	
  I	
  plan	
  to	
  use	
  a	
  subset	
  of	
  the	
  features	
  presented	
  in	
  these	
  papers	
  in	
  
   the	
  classiDication	
  of	
  the	
  elements	
  of	
  output	
  pages	
  and	
  subsequent	
  
   real	
  estate	
  page	
  classiDication.	
  	
  
Thank	
  you	
  for	
  your	
  attention!	
  

Weitere ähnliche Inhalte

Ähnlich wie Machine Learning Web Page Classification Features

International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
Recovering a Business Object Model from Web Applications
Recovering a Business Object Model from Web ApplicationsRecovering a Business Object Model from Web Applications
Recovering a Business Object Model from Web ApplicationsPorfirio Tramontana
 
Intelligent expert systems for location planning
Intelligent expert systems for location planningIntelligent expert systems for location planning
Intelligent expert systems for location planningNavid Milanizadeh
 
Web Page Classification
Web Page ClassificationWeb Page Classification
Web Page ClassificationPacharaStudio
 
Webpage Classification
Webpage ClassificationWebpage Classification
Webpage ClassificationPacharaStudio
 
Similarity based Dynamic Web Data Extraction and Integration System from Sear...
Similarity based Dynamic Web Data Extraction and Integration System from Sear...Similarity based Dynamic Web Data Extraction and Integration System from Sear...
Similarity based Dynamic Web Data Extraction and Integration System from Sear...IDES Editor
 
web page classification
web page classificationweb page classification
web page classificationNabeelah Ali
 
Role of Ontologies in Semantic Digital Libraries
Role of Ontologies in Semantic Digital LibrariesRole of Ontologies in Semantic Digital Libraries
Role of Ontologies in Semantic Digital LibrariesSebastian Ryszard Kruk
 
Internet 信息检索中的数学
Internet 信息检索中的数学Internet 信息检索中的数学
Internet 信息检索中的数学Xu jiakon
 
Towards Ontology Development Based on Relational Database
Towards Ontology Development Based on Relational DatabaseTowards Ontology Development Based on Relational Database
Towards Ontology Development Based on Relational Databaseijbuiiir1
 
Linked Data and Knowledge Graphs -- Constructing and Understanding Knowledge ...
Linked Data and Knowledge Graphs -- Constructing and Understanding Knowledge ...Linked Data and Knowledge Graphs -- Constructing and Understanding Knowledge ...
Linked Data and Knowledge Graphs -- Constructing and Understanding Knowledge ...Jeff Z. Pan
 
WordLift 2.0 presented on the Semantic Web Meetup in Rome
WordLift 2.0 presented on the Semantic Web Meetup in RomeWordLift 2.0 presented on the Semantic Web Meetup in Rome
WordLift 2.0 presented on the Semantic Web Meetup in RomeAndrea Volpini
 
2015 07-tuto3-mining hin
2015 07-tuto3-mining hin2015 07-tuto3-mining hin
2015 07-tuto3-mining hinjins0618
 
Linked data and semantic wikis
Linked data and semantic wikisLinked data and semantic wikis
Linked data and semantic wikisSören Auer
 

Ähnlich wie Machine Learning Web Page Classification Features (20)

Webpage classification and Features
Webpage classification and FeaturesWebpage classification and Features
Webpage classification and Features
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
Recovering a Business Object Model from Web Applications
Recovering a Business Object Model from Web ApplicationsRecovering a Business Object Model from Web Applications
Recovering a Business Object Model from Web Applications
 
Intelligent expert systems for location planning
Intelligent expert systems for location planningIntelligent expert systems for location planning
Intelligent expert systems for location planning
 
H017554148
H017554148H017554148
H017554148
 
Web Page Classification
Web Page ClassificationWeb Page Classification
Web Page Classification
 
Webpage Classification
Webpage ClassificationWebpage Classification
Webpage Classification
 
Similarity based Dynamic Web Data Extraction and Integration System from Sear...
Similarity based Dynamic Web Data Extraction and Integration System from Sear...Similarity based Dynamic Web Data Extraction and Integration System from Sear...
Similarity based Dynamic Web Data Extraction and Integration System from Sear...
 
Pxc3872601
Pxc3872601Pxc3872601
Pxc3872601
 
Mazhiming
MazhimingMazhiming
Mazhiming
 
Role of Semantic Web in Health Informatics
Role of Semantic Web in Health InformaticsRole of Semantic Web in Health Informatics
Role of Semantic Web in Health Informatics
 
web page classification
web page classificationweb page classification
web page classification
 
Role of Ontologies in Semantic Digital Libraries
Role of Ontologies in Semantic Digital LibrariesRole of Ontologies in Semantic Digital Libraries
Role of Ontologies in Semantic Digital Libraries
 
Internet 信息检索中的数学
Internet 信息检索中的数学Internet 信息检索中的数学
Internet 信息检索中的数学
 
Towards Ontology Development Based on Relational Database
Towards Ontology Development Based on Relational DatabaseTowards Ontology Development Based on Relational Database
Towards Ontology Development Based on Relational Database
 
Linked Data and Knowledge Graphs -- Constructing and Understanding Knowledge ...
Linked Data and Knowledge Graphs -- Constructing and Understanding Knowledge ...Linked Data and Knowledge Graphs -- Constructing and Understanding Knowledge ...
Linked Data and Knowledge Graphs -- Constructing and Understanding Knowledge ...
 
WordLift 2.0 presented on the Semantic Web Meetup in Rome
WordLift 2.0 presented on the Semantic Web Meetup in RomeWordLift 2.0 presented on the Semantic Web Meetup in Rome
WordLift 2.0 presented on the Semantic Web Meetup in Rome
 
The Semantic Data Web, Sören Auer, University of Leipzig
The Semantic Data Web, Sören Auer, University of LeipzigThe Semantic Data Web, Sören Auer, University of Leipzig
The Semantic Data Web, Sören Auer, University of Leipzig
 
2015 07-tuto3-mining hin
2015 07-tuto3-mining hin2015 07-tuto3-mining hin
2015 07-tuto3-mining hin
 
Linked data and semantic wikis
Linked data and semantic wikisLinked data and semantic wikis
Linked data and semantic wikis
 

Machine Learning Web Page Classification Features

  • 1. Machine  Le arning  in   DIADEM   Reading  Co urse  Presen tation     Andrey  Kra vchenko   20 th  of  Janu ary,  2010  
  • 2. Current  area  of  research   Real  estate  page  classiDication   vs  
  • 3. Current  area  of  research     Input  and  output  page  distinction
  • 4. Current  area  of  research   Page  element  classiDication  
  • 5. The  Reading  List   Papers  not  included  in  this  presentation   0  “An   interactive   clustering   –   based   approach   to   integrating   source   query   interfaces  on  the  Deep  Web”   0  This  paper  is  concerned  with  input  forms.   0  “Automatic   wrapper   induction   from   hidden-­‐web   sources   with   domain   knowledge”   0  Only   a   part   of   the   paper   deals   with   the   output   pages.   Their   methodology   for   processing   the   output   pages   is   based   on   gazetteer’s   and   is   thus   closer   to   linguistics  than  ML.   0  “Web  scale  extraction  of  structured  data”   0  Deals  with  the  whole  Web.   0  “An   adaptive   information   extraction   system   based   on   wrapper   induction   with  POS  tagging”   0  The   labels   are   of   very   low   granularity   (e.g.   work_name,   work_location)   and   of   linguistic   nature.   The   comparison   is   done   against   linguistics   systems   such   as   Rapier  (another  excluded  paper  on  the  reading  list),  GATE-­‐SVM,  etc.  Introducing   POS   tagging   provides   only   a   5%   gain   in   accuracy   and   only   for   some   target   slots   for  one  corpus  and  no  gain  for  the  other  two.  
  • 6. The  Reading  List   Papers  not  included  in  this  presentation   0  “Learning   (k,l)-­‐contextual   tree   languages   for   information   extraction   from   Web  pages”   0  The  paper  deals  with  learning  an  extraction  language  rather  than  extraction  itself.   0  “Bottom-­‐up   relational   learning   of   problem   matching   rules   for   Information   Retrieval”   0  Deals  with  textual  documents  only.   0  “Learning  rules  to  pre-­‐process  Web  data  for  automatic  integration”   0  Relies   on   web   data   extraction   and   alignment   phases   performed   by   the   VIPER   system   that   are   not   described   in   the   paper.   I   wasn’t   able   to   detect   any   ML   involved   in   the   stage   of   rule   learning.   No   clear   description   of   practical   results.   Low-­‐level   granularity  of  labels.   0  “Learning  rules  for  information  extraction”   0  Is  not  HTML/DOM  speciDic.  
  • 7. The  Reading  List   Papers  included  in  this  presentation   #1  “Web-­‐page  classiDication:  features  and  algorithms”  -­‐  2007   #2  “Web  page  element  classiDication  based  on  visual  features”   #3  “Stylistic  and  lexical  co-­‐training  for  Web-­‐block  classiDication”   #4  “Can  we  learn  a  template-­‐independent    wrapper  for                    news  article  extraction  from  a  single  training  site?”   #5  “EfDicient  record-­‐level  wrapper  induction”     #6  “Towards  combining  Web  classiDication  and  Web  Information                  Extraction:  a  case  study”      
  • 8. Paper  #  1   Web  page  classiDication:  features  and  algorithms   X.  Qi  and  B.  Davison  (Lehigh  University,  2007)   0  The  paper  distinguishes  between  four  types  of  classiDication;   0  They  also  distinguish  between  subject  classiDication,  functional   classiDication,  sentiment  classiDication,  and  other  types  of   classiDication;   0  The  paper  distinguishes  between  on-­‐page  features  and  the   features  of  the  neighbours;   0  On-­‐page  features:   0  Textual  analysis:  bag  of  words  vs  n-­‐gram;   0  Visual  analysis:  the  multigraph  approach.    
  • 9. Paper  #  1     Web  page  classiDication:  features  and  algorithms   X.  Qi  and  B.  Davison  (Lehigh  University,  2007)  
  • 10. Paper  #  1   Web  page  classiDication:  features  and  algorithms   X.  Qi  and  B.  Davison  (Lehigh  University,  2007)   0  When  using  the  features  of  neighbouring  pages  the  authors   distinct  between  the  weak  assumption  and  the  strong  assumption;   0  They  also  distinguish  between  different  types  of  neighbours:   parents/children,  grandparents/grandchildren  and  siblings/ spouses;   0  It  appears  that  siblings  are  the  most  important  neighbours;   0  There  are  various  features    uses  for  different  types  of   neighbouring  pages;   0  Algorithm  survey:  dimension  reduction  and  relational  learning   approaches;  
  • 11. Paper  #  2   Web  page  element  classiDication  based  on  visual  features   R.  Burget  and  I.  Rudolfova  (Brno  University,  2009)   0  Problem:  ClassiDication  of  elements  from  a  web  page  based  on   its  visual  rendering;   0  Assumptions:  A  tagged  corpus,  DOM  tree,  CSSBox  layout;   0  Approach:    Page  segmentation  followed  by  block  classiDication   performed  via  Weka’s  J48  decision  tree  classiYier;   0  Features:  Font  features,  spatial  features,  text  features,  colour   features;   0  Evaluation:  News  domain.  Average  F1  measure  on                               coarse-­‐grained  labels,  low  F1  measure  on  high-­‐grained  labels.  
  • 12. Paper  #  2   Web  page  element  classiDication  based  on  visual  features   R.  Burget  and  I.  Rudolfova  (Brno  University,  2009)   0  The  approach  of  this  papers  is  split  into  two  phases:   0  Page  segmentation;   0  Page  element  classiDication;   0  Page  segmentation  is  done  in  four  phases:   0  Page  rendering;   0  Detecting  basic  visual  areas;   0  Text  line  detection;   0  Block  detection;   0  As  a  result  of  page  segmentation  we  obtain  a  tree  of  areas.  
  • 13. Paper  #  2   Web  page  element  classiDication  based  on  visual  features   R.  Burget  and  I.  Rudolfova  (Brno  University,  2009)   0  The  actual    page  element  classiDication  is  performed   for  each  area  via  Weka’s  J48  decision  tree  classiDier   based  on  the  following  set  of  features:   0  Font  features  {fontsize,  weight};   0  Spatial  features  {aabove,  abelow,  aleft,  aright};   0  Text  features  {tdigits,    tlower,    tupper,  tspaces,  tlength};   0  Colour  features  {contrast}.    
  • 14. Paper  #  2   Web  page  element  classiDication  based  on  visual  features   R.  Burget  and  I.  Rudolfova  (Brno  University,  2009)   Results     The  set  of  labels   (the  testing  pages  from  another   source  than  the  training  pages)  
  • 15. Paper  #  3   Stylistic  and  Lexical  Co-­‐training  for  Web  Block  ClassiDication     C.  Lee  et  al  (National  University  of  Singapore,  2004)   from  a  web  page  based  on   0  Problem:  ClassiDication  of  elements   both  stylistic  and  lexical  features;   0  Assumptions:  A  tagged  corpus,  DOM  tree,  CSSBox  layout;   0  Approach:    Web  block  division  followed  by  co-­‐training  with   Boostexter,  an  ensemble  learning  method  with  a  decision  stump   corresponding  to  a  single  weak  learner;   0  Features:  Lexical  and  stylistic;   0  Evaluation:  News  domain.  Average  F1  measure  on                               coarse-­‐grained  labels,  low  F1  measure  on  high-­‐grained  labels.  
  • 16. Paper  #  3   Stylistic  and  Lexical  Co-­‐training  for  Web  Block  ClassiDication   C.  Lee  et  al  (National  University  of  Singapore,  2004)     0  The  authors  aim  to  combine  two  different  classiDiers  with   distinctive  set  of  features  (lexical  and  stylistic);   0  They’ve  created  a  PARser  for  Content  Extraction  and  Layout   Structure  (PARCELS);   0  Web  page  division  –  the  authors  differentiate  between   structural  tags  and  content  tags.  
  • 17. Paper  #  3   Stylistic  and  Lexical  Co-­‐training  for  Web  Block  ClassiDication   C.  Lee  et  al  (National  University  of  Singapore,  2004)    
  • 18. Paper  #  3   Stylistic  and  Lexical  Co-­‐training  for  Web  Block  ClassiDication   C.  Lee  et  al  (National  University  of  Singapore,  2004)     0  The  authors  distinguish  between  labels  of  different    levels  of   granularity.  They  deDine  17  tags  for  labelling;   0  Stylistic  features:   0  Linear  structure  –  paragraph  (<p>),  header  (<h1>-­‐<h6>)  and  rule  tags  (<hr>);   0  Table  structure  –  cell  Dlow,  neighbouring  cells’  data,    the  position  of  table  cells;   0  XHTML/CSS  structure  –  height,  width,  z-­‐index;   0  Font  features  –  colour,  weight,  family,  size,  hyperlink  features;   0  Images  –  size,  number  of  images  within  a  block;   0  Lexical  features:   0  Low-­‐level  features  –  count  and  vocabulary  of  the  words  present  in  the  text  block;   0  High-­‐level  features  –  POS-­‐tags,  mailto-­‐links,  image-­‐links,  text-­‐links,  total-­‐links;   0  Boostexter  is  used  for  co-­‐training.  It  is  an  ensemble  learning  method   with  a  decision  stump  corresponding  to  a  single  weak  learner.  
  • 19. Paper  #  3   Stylistic  and  Lexical  Co-­‐training  for  Web  Block  ClassiDication   C.  Lee  et  al  (National  University  of  Singapore,  2004)    
  • 21. Paper  #  4   Can  we  learn  a  template  independent  wrapper  for   news  article  extraction  for  a  single  training  site?   J.  Wang  et  al  (2009,  Zhejiang  University,  MS  Research)   0  Problem:  ClassiDication  of  titles  and  bodies  of  news  taken  from   the  webpages  belonging  to  the  news  domain;   0  Assumptions:  A  tagged  corpus,  DOM  tree,  CSSBox  layout;   0  Approach:    SVM;  decision  function  gets  converted  to  posterior   probability;   0  Features:  Different  sets  of  features  for  body  and  title   extraction.    Features  are  divided  into  content  and  spatial   features;     0  Evaluation:  Overall  99%  extraction  accuracy.  
  • 22. Paper  #  4   Can  we  learn  a  template  independent  wrapper  for   news  article  extraction  for  a  single  training  site?   J.  Wang  et  al  (2009,  Zhejiang  University,  MS  Research)   0  The  aim  of  the  paper  is  to  efDiciently  extract  and  then  combine   titles  and  bodies  of  news  articles;   0   The  main  problem  is  in  dealing  with  various  noises  around  the   titles.  
  • 23. Paper  #  4   Can  we  learn  a  template  independent  wrapper  for   news  article  extraction  for  a  single  training  site?   J.  Wang  et  al  (2009,  Zhejiang  University,  MS  Research)   0  News  body  extraction:   0  Content  features:  FormattingElementsNum  and  FormattedContentLen;   0  Spatial  features:  normalised  RectLeft,  RectTop,  RectWidth  and  RectHeight;   0  News  body  extraction  heuristics:  TopInScreen(T)  and  BigEnough(T);   0  News  title  extraction:   0  Content  features:  FontSize,  EndWithFullStop,  WordNum;   0  Spatial  features:  RectLeft,  RectTop,  RectWidth,  RectHeight,  Overlap,  Distance,  Flat;   0  News  title  extraction  heuristics:  WholeInScreen(T),  NoAnchorText(T),   NotCategoryName(T);   0  A  SVM  approach  is  chosen  for  classiDication.  The  decision   function  gets  converted  to  posterior  probability.  
  • 24. Paper  #  4   Can  we  learn  a  template  independent  wrapper  for   news  article  extraction  for  a  single  training  site?   J.  Wang  et  al  (2009,  Zhejiang  University,  MS  Research)   Testing  results  on  the  large     Extraction  results   scale  experiment  
  • 25. Paper  #  5   EfDicient  record  level  wrapper  induction   S.  Zheng  et  al  (Pennsylvania  State  Univeristy,  2009)   0  Problem:  EfDicient  extraction  of  records  from  Web  pages  and   classiDication  of  their  elements;   0  Assumptions:  A  tagged  corpus,  DOM  tree;   0  Approach:    Alignment  of  the  DOM  subtree  and  the  possible   wrappers;   0  Features:  None;   0  Evaluation:  Four  different  domains  (online  shops,  user  reviews,   digital  libraries,  search  results).  Seven  detail  page  datasets  and   eleven  list  page  datasets.  A  99%  F1  value.  
  • 26. Paper  #  5   EfDicient  record  level  wrapper  induction   S.  Zheng  et  al  (Pennsylvania  State  Univeristy,  2009)   0  The  paper  is  concerned  with  extracting  records  and  their   respective  attributes;   0  The  key  distinction  from  other  approaches  is  the  record-­‐ level  extraction  opposed  to  page-­‐level  extraction;   0  The  authors  propose  a  novel  broom  structure  for  this  task;   0  The  broom  structure  has  a  head  and  a  stick;   0  One  of  the  main  issues  are  crossing  records.  
  • 27. Paper  #  5   EfDicient  record  level  wrapper  induction     S.  Zheng  et  al  (Pennsylvania  State  Univeristy,  2009)
  • 28. Paper  #  5   EfDicient  record  level  wrapper  induction   S.  Zheng  et  al  (Pennsylvania  State  Univeristy,  2009)   0  The  general  architecture  of  the  system  involves  training  and   testing  phases.  
  • 29. Paper  #  5   EfDicient  record  level  wrapper  induction   S.  Zheng  et  al  (Pennsylvania  State  Univeristy,  2009)   0  The  authors  claim  to  achieve  a  remarkable  extraction  accuracy   and  a  signiDicant  boost  in  running  time  performance;  
  • 30. Paper  #  6   Towards  combining  Web  classiDication  and  Web   Information  Extraction:  a  case  study     P.  Luo  et  al  (HP  Labs  China,  2009)   with  the  extraction  of  its   0  Problem:  Combination  of  web  page  classiDication  based  on   their  relevance  to  a  speciDic  domain   speciDic  elements,  using  both  forward  and  backward   dependencies;     0  Assumptions:  A  tagged  corpus,  DOM  tree;   0  Approach:    Conditional  Random  Fields  (CRFs);   0  Features:  Course  terms  and  heuristics  for  course  homepage   detection;  format,  position  and  content  features  for  course   title  extraction;   0  Evaluation:  OfCourse  system  for  online  course  information   extraction.  90%  F1  value  for  course  page  classiDication,  83%   F1  value  for  course  title  extraction.  
  • 31. Paper  #  6   Towards  combining  Web  classiDication  and  Web   Information  Extraction:  a  case  study   P.  Luo  et  al  (HP  Labs  China,  2009)     0  The  authors  propose  a  method  that  utilises  both  forward  and   backward  dependencies  between  Web  classiDication  and   information  extraction;   0  The  authors  use  a  uniDied  graphical  CRF  model  for  joint  and   simultaneous  optimisation  of  these  two  steps;   0  This  methodology  has  been  used  for  building  the  OfCourse   online  search  engine  ;   0  In  their  results  for  OfCourse  the  authors  claim  that  their  model   signiDicantly  outperforms  the  two  baseline  methods;   0  Drawbacks:  they  only  deal  with  DOM  leave  nodes  as   classiDication  variables  for  the  information  extraction  phase.  
  • 32.
  • 33. Lessons  learnt  from  the  Reading  Course   #1  “Web  page  classiYication:  features  and  algorithms”  by  X.  Qi  and   B.  Davison  (2007):  the  importance  of  the  neighbouring  pages’   features,  features  of  neighbouring  pages;   #2  “Web  page  element  classiYication  based  on  visual  features”  by   R.  Burget  and  I.  Rudolfova  (2009):  a  broad  set  of  visual  features   (font  features,  spatial  features,  text  features  and  colour   features);   #3  “Stylistic  and  Lexical  Co-­‐training  for  Web  Block  ClassiYication”   by        C.  Lee  et  al  (2004):    A  useful  web  block  division  algorithm.  A   possibility  of  co-­‐training  on  the  same  corpus  using  two   distinctive  set  of  features;  
  • 34. Lessons  learnt  from  the  Reading  Course   #4  “Can  we  learn  a  template  independent  wrapper  for  news   article  extraction  for  a  single  training  site”  by  J.  Weng  et  al   (2009):  a  distinctive  set  of  features  for  news  title  extraction,  a   lot  of  which  can  be  used  for  property  title  extraction  in   DIADEM;   #5  “EfYicient  record  level  wrapper  induction  “by  S.  Zheng  et  al   (2009):  a  new  record-­‐level  approach  for  extraction.  Performs   much  better  and  faster  than  the  page-­‐level  approaches.  Can  be   useful  for  DIADEM  extraction  in  the  record-­‐heavy  domains;   #6  “Towards  combining  Web  classiYication  and  Web  Information   Extraction:  a  case  study”  by  P.  Luo  et  al  (2009):  backward   dependency  between  these  two  tasks  can  work  as  well.  Thus  it   is  worthwhile  to  experiment  with  their  mutual  tie-­‐up.  
  • 35. General  lessons  learnt   0  Most  of  the  papers  are  recent  or  very  recent  (2004-­‐2009);   0  Features  play  a  much  more  important  role  than  algorithms;   0  Initial  page  segmentation  into  blocks  can  help  with  subsequent   determination  of  relevant  DOM-­‐subtrees;   0  All  features  can  be  broadly  divided  into  content  features  and   visual  features;   0  News  domain  is  a  very  popular  one  (3  out  of  5  reviewed   systems).  No  mention  of  real  estate  in  any  of  the  papers.  
  • 36. Summary  of  the  Reading  Course   and  its  relevance  to  DIADEM   0  The  six  proposed  papers  are  of  relevance  to  all  three  areas  of  my   current  research:     0  Real  estate  page  classiDication;   0  Output/Input  page  distinction;   0  Property  page  elements’  classiDication;   0  The  most  obvious  synergy  is  with  Omer’s  NLP  work,  although   cross  sections  with  Cheng’s  and  Xiaonan’s  work  are  also  possible;   0   I  plan  to  use  a  subset  of  the  features  presented  in  these  papers  in   the  classiDication  of  the  elements  of  output  pages  and  subsequent   real  estate  page  classiDication.    
  • 37. Thank  you  for  your  attention!