Result Page Analysis (Cheng Wang)

²  A list of results decorated with
³  Ø Side bars
³  Ø Branding banners

³  Ø Advertisement

³  Ø Merchant Information

³  Ø Search forms

³  Ø Navigation part

²  Data Area Identification
²  Record Segmentation

²  Data Alignment

²  Visual Information
³  Ø ViDE, VIPER
²  Ontology
³  Ø ODE
²  HTML Page based
³  Ø FiVaTech
²  Regular Expression
³  Ø EXALG, DELA

²  Weifeng Su, Jiying Wang, Frederick H.Lochvsky. 2009.

²  1: Domain ontology construction
³  Ø query interface
³  Ø query result pages

²  2. Data Extraction using the ontology
³  Ø Identify data area
³  Ø Segments record

³  Ø Data Value alignment

²  Multiple Query Result Page
³  Ø PADE

²  1: Match query interface element to data values.
Ø title=“%orientalism%”

²  2. Search for voluntary labels in table headers.

²  3. Search for voluntary labels encoded together with data
values.
³  Ø ISBN No: 0814756654
³  Ø ISBN No: 0789204592

²  4. Data values formats
³  Ø 18/09/2008 : 20080918
³  Ø 03/18/98 : 19980318

²  1. Value level matching
³  Ø Data value similarity
²  2. Label level matching
³  Ø Label co-occurrence
²  3. Label-value matching
³  Ø Check assigned label
³  Ø Assign a suitable label for columns

³  Ø Matching conflict resolution

²  1. Matching is unique ð create attribute
²  2. Matching is 1:1 ð alias

³  Ø Category : Subject
²  3. Matching is 1:n ð n+1 attributes
³  Ø Author: {Last Name, First Name}
²  4. Matching is n:m ð n:1 + 1:m

²  One result page ð One data area
²  Maximum Entropy Model

³  Maximum Correlation Subtree Identification

²  Ø 1 result
²  Ø several results (CABABABAD)

³  Ø find continuous repeated patterns
³  Ø Visual gap

²  Each data value is assigned a label
Ø Maximum Entropy Model
Ø Match with Ontology
²  ØLabel ð Column

²  Wei Liu, Xiaofeng Meng and Weiyi Meng. 2009.

²  ViDRE: Data Record Extractor
²  ViDIE: Data Item Extractor

²  New measure: revision

²  1. Build a Visual Block tree
²  2. Extract data records

³  Ø Noise block filtering
³  Ø Blocks clustering

³  Ø Regroup blocks

²  3. Partition data records into data items and
alignment

²  Mandatory data items
²  Optional data items

²  Static data items

²  Simple one-pass clustering algorithm
³  Ø Take the first block from the list, use it to form a
cluster.
³  Ø For each remaining blocks, compute similarities
to existing clusters.

²  ViDE assumes
³  1. blocks in the same cluster all come from different
data records
³  2. the cluster which has maximum number n of
blocks may contain the mandatory value of data
records.

²  Step 1: Rearranges blocks in each cluster.
²  Step 2: A cluster with n blocks is used as seed.
Initialize n groups, each contains one seed
block.
²  Step 3: For all blocks (in all clusters),
determines which group it belongs.

²  WDBt: total number of web databases processed

²  WDBc: number of web databases whose precision
and recall are both 100%

Root

Data Area (LCA)

Record Separator Record Separator Record

£ £ £ £

²  Real-estate domain
²  60 agents’ websites

³  Ø MRP: 95.0%
³  Ø ERP: 90.0%

Root

Data Area

Record Record Record Record Record Record
1 1 2 2 3 3
Part A Part B Part A Part B Part A Part B

£ £ £

²  DIADEM 0.1 :
³  Ø Construct Real-estate result page ontology
³  Ø Ontological Record Segmentation

°  (More features)
³  Ø Data labeling and data alignment
²  After:
³  Ø Add visual information

Result Page Analysis (Cheng Wang)

Result Page Analysis (Cheng Wang)

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie Result Page Analysis (Cheng Wang)

Ähnlich wie Result Page Analysis (Cheng Wang) (20)

Result Page Analysis (Cheng Wang)