The document discusses techniques for extracting data from web pages. It describes approaches using visual information, ontologies, HTML parsing, and regular expressions. Example systems described include ViDE, ODE, FiVaTech, EXALG and DELA. The document also discusses challenges such as handling multiple query results, matching data to labels, resolving labeling conflicts, and extracting both mandatory and optional data items.
3. ² A list of results decorated with
³ Ø Side bars
³ Ø Branding banners
³ Ø Advertisement
³ Ø Merchant Information
³ Ø Search forms
³ Ø Navigation part
4. ² Data Area Identification
² Record Segmentation
² Data Alignment
5.
6.
7. ² Visual Information
³ Ø ViDE, VIPER
² Ontology
³ Ø ODE
² HTML Page based
³ Ø FiVaTech
² Regular Expression
³ Ø EXALG, DELA
8. ² Weifeng Su, Jiying Wang, Frederick H.Lochvsky. 2009.
² 1: Domain ontology construction
³ Ø query interface
³ Ø query result pages
² 2. Data Extraction using the ontology
³ Ø Identify data area
³ Ø Segments record
³ Ø Data Value alignment
13. ² 1: Match query interface element to data values.
Ø title=“%orientalism%”
² 2. Search for voluntary labels in table headers.
² 3. Search for voluntary labels encoded together with data
values.
³ Ø ISBN No: 0814756654
³ Ø ISBN No: 0789204592
² 4. Data values formats
³ Ø 18/09/2008 : 20080918
³ Ø 03/18/98 : 19980318
14. ² 1. Value level matching
³ Ø Data value similarity
² 2. Label level matching
³ Ø Label co-occurrence
² 3. Label-value matching
³ Ø Check assigned label
³ Ø Assign a suitable label for columns
³ Ø Matching conflict resolution
15.
16.
17. ² 1. Matching is unique ð create attribute
² 2. Matching is 1:1 ð alias
³ Ø Category : Subject
² 3. Matching is 1:n ð n+1 attributes
³ Ø Author: {Last Name, First Name}
² 4. Matching is n:m ð n:1 + 1:m
18.
19. ² One result page ð One data area
² Maximum Entropy Model
³ Maximum Correlation Subtree Identification
20. ² Ø 1 result
² Ø several results (CABABABAD)
³ Ø find continuous repeated patterns
³ Ø Visual gap
21. ² Each data value is assigned a label
Ø Maximum Entropy Model
Ø Match with Ontology
² ØLabel ð Column
22.
23. ² Wei Liu, Xiaofeng Meng and Weiyi Meng. 2009.
² ViDRE: Data Record Extractor
² ViDIE: Data Item Extractor
² New measure: revision
24. ² 1. Build a Visual Block tree
² 2. Extract data records
³ Ø Noise block filtering
³ Ø Blocks clustering
³ Ø Regroup blocks
² 3. Partition data records into data items and
alignment
28. ² Simple one-pass clustering algorithm
³ Ø Take the first block from the list, use it to form a
cluster.
³ Ø For each remaining blocks, compute similarities
to existing clusters.
29. ² ViDE assumes
³ 1. blocks in the same cluster all come from different
data records
³ 2. the cluster which has maximum number n of
blocks may contain the mandatory value of data
records.
30. ² Step 1: Rearranges blocks in each cluster.
² Step 2: A cluster with n blocks is used as seed.
Initialize n groups, each contains one seed
block.
² Step 3: For all blocks (in all clusters),
determines which group it belongs.
31.
32.
33. ² WDBt: total number of web databases processed
² WDBc: number of web databases whose precision
and recall are both 100%
34.
35.
36. Root
Data Area (LCA)
Record Separator Record Separator Record
£ £ £ £