24. Dimension 2: Techniques Tools Scan Pass Extraction Rule Type Features Used Learning Algorithm Tokenization Schemes Minerva Single Regular exp. HTML tags/Literal words None Manually TSIMMIS Single Regular exp. HTML tags/Literal words None Manually WebOQL Single Regular exp. Hypertree None Manually W4F Single Regular exp. DOM tree path addressing None Tag Level XWRAP Single Context-Free DOM tree None Tag Level RAPIER Multiple Logic rules Syntactic/Semantic ILP (bottom-up) Word Level SRV Multiple Logic rules Syntactic/Semantic ILP (top-down) Word Level WHISK Single Regular exp. Syntactic/Semantic Set covering (top-down) Word Level NoDoSE Single Regular exp. HTML tags/Literal words Data Modeling Word Level DEByE Multiple Regular exp. HTML tags/Literal words Data Modeling Word Level WIEN Single Regular exp. HTML tags/Literal words Ad-hoc (bottom-up) Word Level STALKER Multiple Regular exp. HTML tags/Literal words Ad-hoc (bottom-up) Word Level SoftMealy Both Regular exp. HTML tags/Literal words Ad-hoc (bottom-up) Word Level IEPAD Single Regular exp. HTML tags Pattern Mining, String Alignment Multi-Level OLERA Single Regular exp. HTML tags String Alignment Multi-Level DeLa Single Regular exp. HTML tags Pattern Mining Tag Level RoadRunner Single Regular exp. HTML tags String Alignment Tag Level EXALG Single Regular exp. HTML tags/Literal words Equivalent Class and Role Differentiation by DOM tree path Word Level DEPTA Single Tag Tree HTML tags tree Pattern Mining, String comparison, Partial tree alignment Tag Level ViPER Single Tag Tree Visual Features/HTML tags tree Pattern Mining, global string alignment by Divide and Conquer Tag Level MSE Single Tag Tree Visual Features/HTML tags tree Pattern Mining with visual features Tag Level
25. Dimension 3: Automation degree Tools User Expertise Fetch support Output/API Support Applicability Limitation Minerva Programming No XML High Not restricted TSIMMIS Programming No Text High Not restricted WebOQL Programming No Text High Not restricted W4F Programming Yes XML Medium Not restricted XWRAP Programming Yes XML Medium Not restricted RAPIER Labeling No Text Medium Not restricted SRV Labeling No Text Medium Not restricted WHISK Labeling No Text Medium Not restricted NoDoSE Labeling No XML, OEM Medium Not restricted DEByE Labeling Yes XML, SQL DB Medium Not restricted WIEN Labeling No Text Medium Not restricted STALKER Labeling No Text Medium Not restricted SoftMealy Labeling Yes XML, SQL DB Medium Not restricted IEPAD Post labeling Pattern selection No Text Low Multiple-records page OLERA Partial Labeling No XML Low Not restricted DeLa No Interaction Yes Text Low Multiple-records page, More than one page RoadRunner No Interaction Yes XML Low More than one page EXALG No Interaction No Text Low More than one page DEPTA Pattern selection No SQL DB Low Multiple-records pages ViPER No Interaction No SQL DB Low Multiple-records pages MSE No Interaction No -- Low More than one page
30. Page Generation Model A Web page is generated by embedding data values x (taken from a Database) into a predefined template T. All data instances of the database conform to a common schema.
31.
32.
33. Tree Templates T 1 i T 2 is a new tree resulted by appending tree T 2 to the i th node (from the reference point ) on the right most path of tree T 1 .
34.
35.
36.
37.
38. Problem Formulation Definition : Given a set of n DOM trees, DOM i = ( T, x i ) (1 ≤ i ≤ n ), created from some unknown template T and values { x 1 ,. . .,x n }, deduce the template and values, from the set of DOM trees alone. We call this problem a page-level information extraction. If one single page ( n =1) which contains tuple constructors is given as input, the problem is to deduce the template for the schema inside the tuple constructors. We call this problem a record-level information extraction task.
40. FiVaTech System Overview Given some DOM trees (Web pages) as input, we try to merge all DOM trees at the same time into a single tree called a fixed/variant pattern tree . From this pattern tree, we can recognize variant leaf nodes for basic-typed data and mine repetitive nodes for set-typed data.
48. Step 2: Peer Matrix Alignment The peerMatrixAlignment algorithm.
49.
50. Span of a, b, c, d, e are 0, 3, 3, 3, 0 Peer Matrix Alignment (cont.)
51. Peer Matrix Alignment (cont.) The function alignmentResult handles the problem of different functionalities by a clustering algorithm.
52. Peer Matrix Alignment (cont.) The clustering algorithm. The principle here is: " as well as nodes of each row in the matrix M have the same structure, they should also have the same functionality "
53. Step 3: Frequent Pattern Mining A Formal Description of a Repetitive Pattern.
61. Data Schema Detection (cont.) The schema S is the pattern tree after excluding all tag nodes that have no types.
62.
63. Template Identification Templates are identified by segmenting the pre-order traversing of the trees (skipping basic type nodes) at every reference nodes.
67. FiVaTech as a Schema Extractor Experiments The comparison with EXALG schema. Dataset: 9 Web sites on EXALG home page. site N Manual EXALG FiVaTech A m O m {} A e O e {} c Incorr. A e O e {} c Incorr. i n i n Amazon (Cars) 21 13 0 5 15 0 5 11 4 2 8 1 4 8 0 0 Amazon (Pop) 19 5 0 1 5 0 1 5 0 0 5 0 1 5 0 0 MLB 10 7 0 4 7 0 4 7 0 0 6 0 1 6 0 1 RPM 20 6 1 3 6 1 3 6 0 0 5 0 3 5 0 1 UEFA (Teams) 20 9 0 0 9 0 0 9 0 0 9 0 0 9 0 0 UEFA (Play) 20 2 0 1 4 2 1 2 2 0 2 0 0 2 0 0 E-Bay 50 22 3 0 28 2 0 18 10 4 20 5 0 19 1 3 Netflix 50 29 9 6 37 2 1 25 12 4 34 12 7 29 5 0 US Open 32 35 13 10 42 4 10 33 9 2 33 14 11 33 0 2 Total 242 128 26 25 153 11 23 116 37 12 122 32 20 116 6 7 Recall 90.6% 90.6% Precision 75.8% 95.1%
68. FiVaTech as a SRR Extractor Experiments (cont.) To recognize the data sections of a Web site, FiVaTech identifies a set of nodes n SRRs that are the outer most set type nodes, i.e. the path from the node n SRRs to the root of the schema tree has no other nodes of set type. A special case is when the identified node n SRRs in the schema tree has only one child node of another set type, this means data records of this section are presented in more than one column of a Web page, while FiVaTech still catches the data .