7. Training Phase
Clean the HTML
DOM sub-trees
CSS class co-occurrence
CSS Selectors
8. Training Phase
Clean the HTML
DOM sub-trees
CSS class co-occurrence
Value Constraints
CSS Selectors
vcard:email mailto : ALPHA PUNCTUATION ALL_LOWERCASE . ALL_LOWERCASE
vcard:email mailto : ALPHA PUNCTUATION ALL_LOWERCASE . com
vcard:email mailto : ALPHA @ ALPHANUMERIC . ALL_LOWERCASE
vcard:email mailto : ALPHA @ ALPHANUMERIC . com
vcard:email mailto : ALL_UPPERCASE ****@ ALL_LOWERCASE . ALL_LOWERCASE
vcard:bday NUMBER - SMALL_NUMBER - SMALL_NUMBER
vcard:bday MEDIUM_NUMBER - SMALL_NUMBER - SMALL_NUMBER
We could determine patterns for emails for example:
… or even for birthdays
9. Extraction Phase
Clean the HTML
DOM sub-trees
CSS class co-occurrence
Value Constraints
Pattern Detection
CSS Selectors
10. Extraction Phase
Clean the HTML
DOM sub-trees
CSS class co-occurrence
Value Constraints
Pattern Detection
Elements Qualification
CSS Selectors
11. Clean the HTML
DOM sub-trees
CSS class co-occurrence
Value Constraints
Pattern Detection
Elements Qualification
Models Validation
CSS Selectors
Extraction Phase
RDF Model
From μRaptor
RDF Model
Test set
? = 0.94 = 0.7 = 0.8
13. We made the discovery of the new μRaptor species and I am very pleased some researchers helped us understanding its feeding habits
Godzilla is a doll compared to μRaptor! I am currently working on a script for an upcoming movie
As a kid I always wanted to see an actual dinosaur. Today my dream comes true
Damn, he is better than me!