Processing XML
A rewriting system approach
Alberto Simões
alberto.simoes@eu.ipp.pt
Portuguese Perl Workshop – 2010
Alberto Simões Processing XML: a rewriting system approach
Motivation and Goals
XML is usually generated from structured information:
databases, spreadsheets, forms, etc.
but it can be generated from unstructured
(or poorly-structured data):
textual documents, domain specific languages;
Question arises:
How to produce XML documents from textual documents?
write a parser (natural language, domain specific, etc);
produce XML by rewriting the textual document!
Alberto Simões Processing XML: a rewriting system approach
Motivation and Goals
XML is usually generated from structured information:
databases, spreadsheets, forms, etc.
but it can be generated from unstructured
(or poorly-structured data):
textual documents, domain specific languages;
Question arises:
How to produce XML documents from textual documents?
write a parser (natural language, domain specific, etc);
produce XML by rewriting the textual document!
Alberto Simões Processing XML: a rewriting system approach
Motivation and Goals
XML is usually generated from structured information:
databases, spreadsheets, forms, etc.
but it can be generated from unstructured
(or poorly-structured data):
textual documents, domain specific languages;
Question arises:
How to produce XML documents from textual documents?
write a parser (natural language, domain specific, etc);
produce XML by rewriting the textual document!
Alberto Simões Processing XML: a rewriting system approach
Motivation and Goals
XML is usually generated from structured information:
databases, spreadsheets, forms, etc.
but it can be generated from unstructured
(or poorly-structured data):
textual documents, domain specific languages;
Question arises:
How to produce XML documents from textual documents?
write a parser (natural language, domain specific, etc);
produce XML by rewriting the textual document!
Alberto Simões Processing XML: a rewriting system approach
Motivation and Goals
XML is usually generated from structured information:
databases, spreadsheets, forms, etc.
but it can be generated from unstructured
(or poorly-structured data):
textual documents, domain specific languages;
Question arises:
How to produce XML documents from textual documents?
write a parser (natural language, domain specific, etc);
produce XML by rewriting the textual document!
Alberto Simões Processing XML: a rewriting system approach
Hows does textual rewriting works?
write rewriting rules:
rule ∼ pattern × restriction × action
=
pattern a regular (or irregular) expression that should
be textually matched;
restriction conditional code that checks whether the rule
should be applied;
action a piece of code (or simply a string) that
produces text that should replace the
originally matched text;
Alberto Simões Processing XML: a rewriting system approach
Hows does textual rewriting works?
write rewriting rules:
rule ∼ pattern × restriction × action
=
pattern a regular (or irregular) expression that should
be textually matched;
restriction conditional code that checks whether the rule
should be applied;
action a piece of code (or simply a string) that
produces text that should replace the
originally matched text;
Alberto Simões Processing XML: a rewriting system approach
Hows does textual rewriting works?
write rewriting rules:
rule ∼ pattern × restriction × action
=
pattern a regular (or irregular) expression that should
be textually matched;
restriction conditional code that checks whether the rule
should be applied;
action a piece of code (or simply a string) that
produces text that should replace the
originally matched text;
Alberto Simões Processing XML: a rewriting system approach
Hows does textual rewriting works?
write rewriting rules:
rule ∼ pattern × restriction × action
=
pattern a regular (or irregular) expression that should
be textually matched;
restriction conditional code that checks whether the rule
should be applied;
action a piece of code (or simply a string) that
produces text that should replace the
originally matched text;
Alberto Simões Processing XML: a rewriting system approach
Are there text rewriting tools?
For this work we used Text::RewriteRules:
written in Perl:
Perl regular expression engine power;
Reflexive language (code can be generated on the fly);
supports different rewriting approaches:
Fixed-point rewriting approach;
Sliding-cursor rewriting approach;
Lexical analyzer approach;
home-developed;
Alberto Simões Processing XML: a rewriting system approach
Are there text rewriting tools?
For this work we used Text::RewriteRules:
written in Perl:
Perl regular expression engine power;
Reflexive language (code can be generated on the fly);
supports different rewriting approaches:
Fixed-point rewriting approach;
Sliding-cursor rewriting approach;
Lexical analyzer approach;
home-developed;
Alberto Simões Processing XML: a rewriting system approach
Are there text rewriting tools?
For this work we used Text::RewriteRules:
written in Perl:
Perl regular expression engine power;
Reflexive language (code can be generated on the fly);
supports different rewriting approaches:
Fixed-point rewriting approach;
Sliding-cursor rewriting approach;
Lexical analyzer approach;
home-developed;
Alberto Simões Processing XML: a rewriting system approach
Are there text rewriting tools?
For this work we used Text::RewriteRules:
written in Perl:
Perl regular expression engine power;
Reflexive language (code can be generated on the fly);
supports different rewriting approaches:
Fixed-point rewriting approach;
Sliding-cursor rewriting approach;
Lexical analyzer approach;
home-developed;
Alberto Simões Processing XML: a rewriting system approach
Fixed-point rewriting approach
Algorithm
easy to understand;
a sequence of rules that are applied by order;
first rule is applied, and following rules are only applied if
there is no previous rule that can be applied;
it might happen that a rule changes the document in a way
that a previous rule will be applied again;
the process ends when there are no rules that can be
applied (or if a specific rule forces the system to end);
Code example: anonymization of emails
RULES anonymize
w+(.w+)*@w+.w+(.w+)*==>[[hidden email]]
ENDRULES
Alberto Simões Processing XML: a rewriting system approach
Fixed-point rewriting approach
Algorithm
easy to understand;
a sequence of rules that are applied by order;
first rule is applied, and following rules are only applied if
there is no previous rule that can be applied;
it might happen that a rule changes the document in a way
that a previous rule will be applied again;
the process ends when there are no rules that can be
applied (or if a specific rule forces the system to end);
Code example: anonymization of emails
RULES anonymize
w+(.w+)*@w+.w+(.w+)*==>[[hidden email]]
ENDRULES
Alberto Simões Processing XML: a rewriting system approach
Sliding-cursor rewriting approach
Algorithm
the cursor is placed in the beginning of the string;
patterns are matched if they occur right after the cursor;
if a rule is applied, the cursor is placed after that region;
if no rule matches, the cursor moves ahead one character;
process ends when cursor reaches the end of the string;
it will never rewrite text that was already rewritten.
Code example: brute force translation
RULES/m translate
(w+)=e=> $translation{$1} !! exists($translation{$1})
ENDRULES
Example
_ latest train
último _ train
último combóio _
Alberto Simões Processing XML: a rewriting system approach
Sliding-cursor rewriting approach
Algorithm
the cursor is placed in the beginning of the string;
patterns are matched if they occur right after the cursor;
if a rule is applied, the cursor is placed after that region;
if no rule matches, the cursor moves ahead one character;
process ends when cursor reaches the end of the string;
it will never rewrite text that was already rewritten.
Code example: brute force translation
RULES/m translate
(w+)=e=> $translation{$1} !! exists($translation{$1})
ENDRULES
Example
_ latest train
último _ train
último combóio _
Alberto Simões Processing XML: a rewriting system approach
Sliding-cursor rewriting approach
Algorithm
the cursor is placed in the beginning of the string;
patterns are matched if they occur right after the cursor;
if a rule is applied, the cursor is placed after that region;
if no rule matches, the cursor moves ahead one character;
process ends when cursor reaches the end of the string;
it will never rewrite text that was already rewritten.
Code example: brute force translation
RULES/m translate
(w+)=e=> $translation{$1} !! exists($translation{$1})
ENDRULES
Example
_ latest train
último _ train
último combóio _
Alberto Simões Processing XML: a rewriting system approach
Valid Rewriting Rules
Different approaches have different possible rules. . .
but the most relevant rules are:
==> simple pattern substitution: left hand side includes
a Perl regular expression and right hand side
includes the string that will replace the match;
=e=> similar to the previous one, but right hand side
includes Perl code to be evaluated. The result will
be used to replace the match;
=begin=> without a left hand side, the right hand side code is
executed before the rewrite starts;
=end=> without a right hand side, when the left hand side
pattern matches quits the rewrite system;
they can include a restriction block (!!) at the right of the action.
Alberto Simões Processing XML: a rewriting system approach
Valid Rewriting Rules
Different approaches have different possible rules. . .
but the most relevant rules are:
==> simple pattern substitution: left hand side includes
a Perl regular expression and right hand side
includes the string that will replace the match;
=e=> similar to the previous one, but right hand side
includes Perl code to be evaluated. The result will
be used to replace the match;
=begin=> without a left hand side, the right hand side code is
executed before the rewrite starts;
=end=> without a right hand side, when the left hand side
pattern matches quits the rewrite system;
they can include a restriction block (!!) at the right of the action.
Alberto Simões Processing XML: a rewriting system approach
Valid Rewriting Rules
Different approaches have different possible rules. . .
but the most relevant rules are:
==> simple pattern substitution: left hand side includes
a Perl regular expression and right hand side
includes the string that will replace the match;
=e=> similar to the previous one, but right hand side
includes Perl code to be evaluated. The result will
be used to replace the match;
=begin=> without a left hand side, the right hand side code is
executed before the rewrite starts;
=end=> without a right hand side, when the left hand side
pattern matches quits the rewrite system;
they can include a restriction block (!!) at the right of the action.
Alberto Simões Processing XML: a rewriting system approach
Valid Rewriting Rules
Different approaches have different possible rules. . .
but the most relevant rules are:
==> simple pattern substitution: left hand side includes
a Perl regular expression and right hand side
includes the string that will replace the match;
=e=> similar to the previous one, but right hand side
includes Perl code to be evaluated. The result will
be used to replace the match;
=begin=> without a left hand side, the right hand side code is
executed before the rewrite starts;
=end=> without a right hand side, when the left hand side
pattern matches quits the rewrite system;
they can include a restriction block (!!) at the right of the action.
Alberto Simões Processing XML: a rewriting system approach
Valid Rewriting Rules
Different approaches have different possible rules. . .
but the most relevant rules are:
==> simple pattern substitution: left hand side includes
a Perl regular expression and right hand side
includes the string that will replace the match;
=e=> similar to the previous one, but right hand side
includes Perl code to be evaluated. The result will
be used to replace the match;
=begin=> without a left hand side, the right hand side code is
executed before the rewrite starts;
=end=> without a right hand side, when the left hand side
pattern matches quits the rewrite system;
they can include a restriction block (!!) at the right of the action.
Alberto Simões Processing XML: a rewriting system approach
Valid Rewriting Rules
Different approaches have different possible rules. . .
but the most relevant rules are:
==> simple pattern substitution: left hand side includes
a Perl regular expression and right hand side
includes the string that will replace the match;
=e=> similar to the previous one, but right hand side
includes Perl code to be evaluated. The result will
be used to replace the match;
=begin=> without a left hand side, the right hand side code is
executed before the rewrite starts;
=end=> without a right hand side, when the left hand side
pattern matches quits the rewrite system;
they can include a restriction block (!!) at the right of the action.
Alberto Simões Processing XML: a rewriting system approach
Rewriting Text into XML
How to produce XML from weak-structured data?
write a parser;
or rewrite the data step-by-step into XML!
Two case studies:
Rewriting a dictionary in textual format into TEI;
Rewriting a XML DSL authoring tool into XML;
Alberto Simões Processing XML: a rewriting system approach
Rewriting Text into XML
How to produce XML from weak-structured data?
write a parser;
or rewrite the data step-by-step into XML!
Two case studies:
Rewriting a dictionary in textual format into TEI;
Rewriting a XML DSL authoring tool into XML;
Alberto Simões Processing XML: a rewriting system approach
Rewriting Text into TEI
Rewrite this. . . . . . into this!
*Cachimbo*, <entry id="cachimbo">
_m._ <form><orth>Cachimbo</orth></form>
Apparelho de fumador, composto d.. <sense>
Peça de ferro, em que entra o es.. <gramGrp>m.</gramGrp>
Buraco, em que se encaixa a vela.. <def>
* _Bras. de Pernambuco._ Apparelho de fumador, composto d..
Bebida, preparada com aguardente.. Peça de ferro, em que entra o es..
* _Pl. Gír._ Buraco, em que se encaixa a vela..
Pés. </def>
(Do químb. _quixima_) </sense>
<sense ast="1">
<usg type="geo">Bras. de Pernamb..
<def>
Bebida, preparada com aguardente..
</def>
</sense>
<sense ast="1"><gramGrp>Pl.</gra..
<usg type="style">Gír.</usg>
<def>
Pés.
</def>
</sense>
<etym ori="químb">(Do químb. _qu..
Alberto Simões Processing XML: a rewriting system approach
Rewriting Text into TEI
This rewrite was all based on:
a few tables (grammatical and usage strings);
entries genres: Gír Fam Pop Des Fig Vulg Ant Chul Euph
entries domains: Agr Anat Anthrop Apicult Arith Artilh Archit
rewrite the few mark-up into better XML structure;
((* )?_([^_]|_[^_]{1,5}_)+_( *)?)n=e=>$a=$1;end_def.end_sense.start_
rewrite the new XML structure to detect and annotate a
more complex structure;
<gramGrp>([^<]*)s**s*([^<]*)</gramGrp>=e=>$a="$1 $2"; "ast="1"".g
detect and correct wrong XML elements.
</form></sense>==></form>
</form></def>n</sense>==></form>
Alberto Simões Processing XML: a rewriting system approach
Rewriting Text into TEI
This rewrite was all based on:
a few tables (grammatical and usage strings);
entries genres: Gír Fam Pop Des Fig Vulg Ant Chul Euph
entries domains: Agr Anat Anthrop Apicult Arith Artilh Archit
rewrite the few mark-up into better XML structure;
((* )?_([^_]|_[^_]{1,5}_)+_( *)?)n=e=>$a=$1;end_def.end_sense.start_
rewrite the new XML structure to detect and annotate a
more complex structure;
<gramGrp>([^<]*)s**s*([^<]*)</gramGrp>=e=>$a="$1 $2"; "ast="1"".g
detect and correct wrong XML elements.
</form></sense>==></form>
</form></def>n</sense>==></form>
Alberto Simões Processing XML: a rewriting system approach
Rewriting Text into TEI
This rewrite was all based on:
a few tables (grammatical and usage strings);
entries genres: Gír Fam Pop Des Fig Vulg Ant Chul Euph
entries domains: Agr Anat Anthrop Apicult Arith Artilh Archit
rewrite the few mark-up into better XML structure;
((* )?_([^_]|_[^_]{1,5}_)+_( *)?)n=e=>$a=$1;end_def.end_sense.start_
rewrite the new XML structure to detect and annotate a
more complex structure;
<gramGrp>([^<]*)s**s*([^<]*)</gramGrp>=e=>$a="$1 $2"; "ast="1"".g
detect and correct wrong XML elements.
</form></sense>==></form>
</form></def>n</sense>==></form>
Alberto Simões Processing XML: a rewriting system approach
Rewriting Text into TEI
This rewrite was all based on:
a few tables (grammatical and usage strings);
entries genres: Gír Fam Pop Des Fig Vulg Ant Chul Euph
entries domains: Agr Anat Anthrop Apicult Arith Artilh Archit
rewrite the few mark-up into better XML structure;
((* )?_([^_]|_[^_]{1,5}_)+_( *)?)n=e=>$a=$1;end_def.end_sense.start_
rewrite the new XML structure to detect and annotate a
more complex structure;
<gramGrp>([^<]*)s**s*([^<]*)</gramGrp>=e=>$a="$1 $2"; "ast="1"".g
detect and correct wrong XML elements.
</form></sense>==></form>
</form></def>n</sense>==></form>
Alberto Simões Processing XML: a rewriting system approach
Rewriting Text into TEI
Case study conclusions:
flexible tool;
works on big files:
Text file is 13 MB;
Output XML is 30 MB;
Process takes about nine minutes!
we event rewrote XML into XML.
Hey!! XML is text!!
How can we rewrite it!?
Alberto Simões Processing XML: a rewriting system approach
Rewriting Text into TEI
Case study conclusions:
flexible tool;
works on big files:
Text file is 13 MB;
Output XML is 30 MB;
Process takes about nine minutes!
we event rewrote XML into XML.
Hey!! XML is text!!
How can we rewrite it!?
Alberto Simões Processing XML: a rewriting system approach
Rewriting XML
different from the usual DOM or SAX oriented approaches;
looks to XML as text, non structured data;
rewrite can be done:
as any other text write system;
taking advantage of irregular expressions.
Irregular expressions? Are you kidding?
Alberto Simões Processing XML: a rewriting system approach
Rewriting XML
different from the usual DOM or SAX oriented approaches;
looks to XML as text, non structured data;
rewrite can be done:
as any other text write system;
taking advantage of irregular expressions.
Irregular expressions? Are you kidding?
Alberto Simões Processing XML: a rewriting system approach
Rewriting XML
different from the usual DOM or SAX oriented approaches;
looks to XML as text, non structured data;
rewrite can be done:
as any other text write system;
taking advantage of irregular expressions.
Irregular expressions? Are you kidding?
Alberto Simões Processing XML: a rewriting system approach
Rewriting XML
different from the usual DOM or SAX oriented approaches;
looks to XML as text, non structured data;
rewrite can be done:
as any other text write system;
taking advantage of irregular expressions.
Irregular expressions? Are you kidding?
Alberto Simões Processing XML: a rewriting system approach
Not so regular expressions
Perl has a powerful regular expression engine:
regular expressions can define capture zones:
small pieces of the match that can be used later;
regular expressions can define look-ahead or look-behind:
check the context of the matching zone;
since Perl 5.10, regular expressions can be recursive:
regular expression that depends on themself.
my $parens = qr/(((?:[^()]++|(?-1))*+))/;
For XML, we defined two classes:
[[:XML:]] matches any well formed XML fragment;
[[:XML(tag):]] matches a XML fragment with a specific
root element;
Alberto Simões Processing XML: a rewriting system approach
Not so regular expressions
Perl has a powerful regular expression engine:
regular expressions can define capture zones:
small pieces of the match that can be used later;
regular expressions can define look-ahead or look-behind:
check the context of the matching zone;
since Perl 5.10, regular expressions can be recursive:
regular expression that depends on themself.
my $parens = qr/(((?:[^()]++|(?-1))*+))/;
For XML, we defined two classes:
[[:XML:]] matches any well formed XML fragment;
[[:XML(tag):]] matches a XML fragment with a specific
root element;
Alberto Simões Processing XML: a rewriting system approach
Not so regular expressions
Perl has a powerful regular expression engine:
regular expressions can define capture zones:
small pieces of the match that can be used later;
regular expressions can define look-ahead or look-behind:
check the context of the matching zone;
since Perl 5.10, regular expressions can be recursive:
regular expression that depends on themself.
my $parens = qr/(((?:[^()]++|(?-1))*+))/;
For XML, we defined two classes:
[[:XML:]] matches any well formed XML fragment;
[[:XML(tag):]] matches a XML fragment with a specific
root element;
Alberto Simões Processing XML: a rewriting system approach
Not so regular expressions
Perl has a powerful regular expression engine:
regular expressions can define capture zones:
small pieces of the match that can be used later;
regular expressions can define look-ahead or look-behind:
check the context of the matching zone;
since Perl 5.10, regular expressions can be recursive:
regular expression that depends on themself.
my $parens = qr/(((?:[^()]++|(?-1))*+))/;
For XML, we defined two classes:
[[:XML:]] matches any well formed XML fragment;
[[:XML(tag):]] matches a XML fragment with a specific
root element;
Alberto Simões Processing XML: a rewriting system approach
Not so regular expressions
Perl has a powerful regular expression engine:
regular expressions can define capture zones:
small pieces of the match that can be used later;
regular expressions can define look-ahead or look-behind:
check the context of the matching zone;
since Perl 5.10, regular expressions can be recursive:
regular expression that depends on themself.
my $parens = qr/(((?:[^()]++|(?-1))*+))/;
For XML, we defined two classes:
[[:XML:]] matches any well formed XML fragment;
[[:XML(tag):]] matches a XML fragment with a specific
root element;
Alberto Simões Processing XML: a rewriting system approach
Not so regular expressions
Perl has a powerful regular expression engine:
regular expressions can define capture zones:
small pieces of the match that can be used later;
regular expressions can define look-ahead or look-behind:
check the context of the matching zone;
since Perl 5.10, regular expressions can be recursive:
regular expression that depends on themself.
my $parens = qr/(((?:[^()]++|(?-1))*+))/;
For XML, we defined two classes:
[[:XML:]] matches any well formed XML fragment;
[[:XML(tag):]] matches a XML fragment with a specific
root element;
Alberto Simões Processing XML: a rewriting system approach
Rewriting XML
As a simple example, we can remove duplicate translation units
in a translation memory file:
Code example
RULES/m duplicates
([[:XML(tu):]])==>!!duplicate($1)
ENDRULES
sub duplicate {
my $tu = shift;
my $tumd5 = md5(dtstring($tu,
-default => sub{$c}));
return 1 if exists $visited{$tumd5};
$visited{$tumd5}++
return 0;
}
Alberto Simões Processing XML: a rewriting system approach
Conclusions
The rewriting approach is:
flexible;
powerful;
easy to learn;
grows quickly;
big systems can be difficult to maintain;
The Perl regular engine:
makes it easy to match anything;
almost supports full grammars;
makes it possible to define block structures;
So, it can be applied to XML easily!
Alberto Simões Processing XML: a rewriting system approach
Thank you
Thank You!
Alberto Simões
alberto.simoes@eu.ipp.pt
Alberto Simões Processing XML: a rewriting system approach