SlideShare ist ein Scribd-Unternehmen logo
1 von 47
Downloaden Sie, um offline zu lesen
Processing XML
A rewriting system approach

           Alberto Simões
 alberto.simoes@eu.ipp.pt


  Portuguese Perl Workshop – 2010




       Alberto Simões   Processing XML: a rewriting system approach
Motivation and Goals


     XML is usually generated from structured information:
         databases, spreadsheets, forms, etc.

     but it can be generated from unstructured
     (or poorly-structured data):
         textual documents, domain specific languages;

 Question arises:
 How to produce XML documents from textual documents?
     write a parser (natural language, domain specific, etc);
     produce XML by rewriting the textual document!




                      Alberto Simões   Processing XML: a rewriting system approach
Motivation and Goals


     XML is usually generated from structured information:
         databases, spreadsheets, forms, etc.

     but it can be generated from unstructured
     (or poorly-structured data):
         textual documents, domain specific languages;

 Question arises:
 How to produce XML documents from textual documents?
     write a parser (natural language, domain specific, etc);
     produce XML by rewriting the textual document!




                      Alberto Simões   Processing XML: a rewriting system approach
Motivation and Goals


     XML is usually generated from structured information:
         databases, spreadsheets, forms, etc.

     but it can be generated from unstructured
     (or poorly-structured data):
         textual documents, domain specific languages;

 Question arises:
 How to produce XML documents from textual documents?
     write a parser (natural language, domain specific, etc);
     produce XML by rewriting the textual document!




                      Alberto Simões   Processing XML: a rewriting system approach
Motivation and Goals


     XML is usually generated from structured information:
         databases, spreadsheets, forms, etc.

     but it can be generated from unstructured
     (or poorly-structured data):
         textual documents, domain specific languages;

 Question arises:
 How to produce XML documents from textual documents?
     write a parser (natural language, domain specific, etc);
     produce XML by rewriting the textual document!




                      Alberto Simões   Processing XML: a rewriting system approach
Motivation and Goals


     XML is usually generated from structured information:
         databases, spreadsheets, forms, etc.

     but it can be generated from unstructured
     (or poorly-structured data):
         textual documents, domain specific languages;

 Question arises:
 How to produce XML documents from textual documents?
     write a parser (natural language, domain specific, etc);
     produce XML by rewriting the textual document!




                      Alberto Simões   Processing XML: a rewriting system approach
Hows does textual rewriting works?


    write rewriting rules:

                rule ∼ pattern × restriction × action
                     =


         pattern a regular (or irregular) expression that should
                 be textually matched;
      restriction conditional code that checks whether the rule
                  should be applied;
          action a piece of code (or simply a string) that
                 produces text that should replace the
                 originally matched text;



                      Alberto Simões   Processing XML: a rewriting system approach
Hows does textual rewriting works?


    write rewriting rules:

                rule ∼ pattern × restriction × action
                     =


         pattern a regular (or irregular) expression that should
                 be textually matched;
      restriction conditional code that checks whether the rule
                  should be applied;
          action a piece of code (or simply a string) that
                 produces text that should replace the
                 originally matched text;



                      Alberto Simões   Processing XML: a rewriting system approach
Hows does textual rewriting works?


    write rewriting rules:

                rule ∼ pattern × restriction × action
                     =


         pattern a regular (or irregular) expression that should
                 be textually matched;
      restriction conditional code that checks whether the rule
                  should be applied;
          action a piece of code (or simply a string) that
                 produces text that should replace the
                 originally matched text;



                      Alberto Simões   Processing XML: a rewriting system approach
Hows does textual rewriting works?


    write rewriting rules:

                rule ∼ pattern × restriction × action
                     =


         pattern a regular (or irregular) expression that should
                 be textually matched;
      restriction conditional code that checks whether the rule
                  should be applied;
          action a piece of code (or simply a string) that
                 produces text that should replace the
                 originally matched text;



                      Alberto Simões   Processing XML: a rewriting system approach
Are there text rewriting tools?


 For this work we used Text::RewriteRules:

     written in Perl:
          Perl regular expression engine power;
          Reflexive language (code can be generated on the fly);
     supports different rewriting approaches:
          Fixed-point rewriting approach;
          Sliding-cursor rewriting approach;
          Lexical analyzer approach;

     home-developed;




                        Alberto Simões   Processing XML: a rewriting system approach
Are there text rewriting tools?


 For this work we used Text::RewriteRules:

     written in Perl:
          Perl regular expression engine power;
          Reflexive language (code can be generated on the fly);
     supports different rewriting approaches:
          Fixed-point rewriting approach;
          Sliding-cursor rewriting approach;
          Lexical analyzer approach;

     home-developed;




                        Alberto Simões   Processing XML: a rewriting system approach
Are there text rewriting tools?


 For this work we used Text::RewriteRules:

     written in Perl:
          Perl regular expression engine power;
          Reflexive language (code can be generated on the fly);
     supports different rewriting approaches:
          Fixed-point rewriting approach;
          Sliding-cursor rewriting approach;
          Lexical analyzer approach;

     home-developed;




                        Alberto Simões   Processing XML: a rewriting system approach
Are there text rewriting tools?


 For this work we used Text::RewriteRules:

     written in Perl:
          Perl regular expression engine power;
          Reflexive language (code can be generated on the fly);
     supports different rewriting approaches:
          Fixed-point rewriting approach;
          Sliding-cursor rewriting approach;
          Lexical analyzer approach;

     home-developed;




                        Alberto Simões   Processing XML: a rewriting system approach
Fixed-point rewriting approach

 Algorithm
     easy to understand;
     a sequence of rules that are applied by order;
     first rule is applied, and following rules are only applied if
     there is no previous rule that can be applied;
     it might happen that a rule changes the document in a way
     that a previous rule will be applied again;
     the process ends when there are no rules that can be
     applied (or if a specific rule forces the system to end);

 Code example: anonymization of emails
   RULES anonymize
   w+(.w+)*@w+.w+(.w+)*==>[[hidden email]]
   ENDRULES


                       Alberto Simões   Processing XML: a rewriting system approach
Fixed-point rewriting approach

 Algorithm
     easy to understand;
     a sequence of rules that are applied by order;
     first rule is applied, and following rules are only applied if
     there is no previous rule that can be applied;
     it might happen that a rule changes the document in a way
     that a previous rule will be applied again;
     the process ends when there are no rules that can be
     applied (or if a specific rule forces the system to end);

 Code example: anonymization of emails
   RULES anonymize
   w+(.w+)*@w+.w+(.w+)*==>[[hidden email]]
   ENDRULES


                       Alberto Simões   Processing XML: a rewriting system approach
Sliding-cursor rewriting approach
 Algorithm
     the cursor is placed in the beginning of the string;
     patterns are matched if they occur right after the cursor;
     if a rule is applied, the cursor is placed after that region;
     if no rule matches, the cursor moves ahead one character;
     process ends when cursor reaches the end of the string;
     it will never rewrite text that was already rewritten.
 Code example: brute force translation
 RULES/m translate
 (w+)=e=> $translation{$1} !! exists($translation{$1})
 ENDRULES

 Example
   _ latest train
   último _ train
   último combóio _
                       Alberto Simões   Processing XML: a rewriting system approach
Sliding-cursor rewriting approach
 Algorithm
     the cursor is placed in the beginning of the string;
     patterns are matched if they occur right after the cursor;
     if a rule is applied, the cursor is placed after that region;
     if no rule matches, the cursor moves ahead one character;
     process ends when cursor reaches the end of the string;
     it will never rewrite text that was already rewritten.
 Code example: brute force translation
 RULES/m translate
 (w+)=e=> $translation{$1} !! exists($translation{$1})
 ENDRULES

 Example
   _ latest train
   último _ train
   último combóio _
                       Alberto Simões   Processing XML: a rewriting system approach
Sliding-cursor rewriting approach
 Algorithm
     the cursor is placed in the beginning of the string;
     patterns are matched if they occur right after the cursor;
     if a rule is applied, the cursor is placed after that region;
     if no rule matches, the cursor moves ahead one character;
     process ends when cursor reaches the end of the string;
     it will never rewrite text that was already rewritten.
 Code example: brute force translation
 RULES/m translate
 (w+)=e=> $translation{$1} !! exists($translation{$1})
 ENDRULES

 Example
   _ latest train
   último _ train
   último combóio _
                       Alberto Simões   Processing XML: a rewriting system approach
Valid Rewriting Rules

 Different approaches have different possible rules. . .
 but the most relevant rules are:
         ==> simple pattern substitution: left hand side includes
             a Perl regular expression and right hand side
             includes the string that will replace the match;
        =e=> similar to the previous one, but right hand side
             includes Perl code to be evaluated. The result will
             be used to replace the match;
  =begin=> without a left hand side, the right hand side code is
           executed before the rewrite starts;
     =end=> without a right hand side, when the left hand side
            pattern matches quits the rewrite system;
 they can include a restriction block (!!) at the right of the action.


                        Alberto Simões   Processing XML: a rewriting system approach
Valid Rewriting Rules

 Different approaches have different possible rules. . .
 but the most relevant rules are:
         ==> simple pattern substitution: left hand side includes
             a Perl regular expression and right hand side
             includes the string that will replace the match;
        =e=> similar to the previous one, but right hand side
             includes Perl code to be evaluated. The result will
             be used to replace the match;
  =begin=> without a left hand side, the right hand side code is
           executed before the rewrite starts;
     =end=> without a right hand side, when the left hand side
            pattern matches quits the rewrite system;
 they can include a restriction block (!!) at the right of the action.


                        Alberto Simões   Processing XML: a rewriting system approach
Valid Rewriting Rules

 Different approaches have different possible rules. . .
 but the most relevant rules are:
         ==> simple pattern substitution: left hand side includes
             a Perl regular expression and right hand side
             includes the string that will replace the match;
        =e=> similar to the previous one, but right hand side
             includes Perl code to be evaluated. The result will
             be used to replace the match;
  =begin=> without a left hand side, the right hand side code is
           executed before the rewrite starts;
     =end=> without a right hand side, when the left hand side
            pattern matches quits the rewrite system;
 they can include a restriction block (!!) at the right of the action.


                        Alberto Simões   Processing XML: a rewriting system approach
Valid Rewriting Rules

 Different approaches have different possible rules. . .
 but the most relevant rules are:
         ==> simple pattern substitution: left hand side includes
             a Perl regular expression and right hand side
             includes the string that will replace the match;
        =e=> similar to the previous one, but right hand side
             includes Perl code to be evaluated. The result will
             be used to replace the match;
  =begin=> without a left hand side, the right hand side code is
           executed before the rewrite starts;
     =end=> without a right hand side, when the left hand side
            pattern matches quits the rewrite system;
 they can include a restriction block (!!) at the right of the action.


                        Alberto Simões   Processing XML: a rewriting system approach
Valid Rewriting Rules

 Different approaches have different possible rules. . .
 but the most relevant rules are:
         ==> simple pattern substitution: left hand side includes
             a Perl regular expression and right hand side
             includes the string that will replace the match;
        =e=> similar to the previous one, but right hand side
             includes Perl code to be evaluated. The result will
             be used to replace the match;
  =begin=> without a left hand side, the right hand side code is
           executed before the rewrite starts;
     =end=> without a right hand side, when the left hand side
            pattern matches quits the rewrite system;
 they can include a restriction block (!!) at the right of the action.


                        Alberto Simões   Processing XML: a rewriting system approach
Valid Rewriting Rules

 Different approaches have different possible rules. . .
 but the most relevant rules are:
         ==> simple pattern substitution: left hand side includes
             a Perl regular expression and right hand side
             includes the string that will replace the match;
        =e=> similar to the previous one, but right hand side
             includes Perl code to be evaluated. The result will
             be used to replace the match;
  =begin=> without a left hand side, the right hand side code is
           executed before the rewrite starts;
     =end=> without a right hand side, when the left hand side
            pattern matches quits the rewrite system;
 they can include a restriction block (!!) at the right of the action.


                        Alberto Simões   Processing XML: a rewriting system approach
Rewriting Text into XML



 How to produce XML from weak-structured data?
     write a parser;
     or rewrite the data step-by-step into XML!



 Two case studies:
     Rewriting a dictionary in textual format into TEI;
     Rewriting a XML DSL authoring tool into XML;




                       Alberto Simões   Processing XML: a rewriting system approach
Rewriting Text into XML



 How to produce XML from weak-structured data?
     write a parser;
     or rewrite the data step-by-step into XML!



 Two case studies:
     Rewriting a dictionary in textual format into TEI;
     Rewriting a XML DSL authoring tool into XML;




                       Alberto Simões   Processing XML: a rewriting system approach
Rewriting Text into TEI
Rewrite this. . .                       . . . into this!
*Cachimbo*,                             <entry id="cachimbo">
_m._                                    <form><orth>Cachimbo</orth></form>
Apparelho de fumador, composto d..      <sense>
Peça de ferro, em que entra o es..      <gramGrp>m.</gramGrp>
Buraco, em que se encaixa a vela..      <def>
* _Bras. de Pernambuco._                Apparelho de fumador, composto d..
Bebida, preparada com aguardente..      Peça de ferro, em que entra o es..
* _Pl. Gír._                            Buraco, em que se encaixa a vela..
Pés.                                    </def>
(Do químb. _quixima_)                   </sense>
                                        <sense ast="1">
                                        <usg type="geo">Bras. de Pernamb..
                                        <def>
                                        Bebida, preparada com aguardente..
                                        </def>
                                        </sense>
                                        <sense ast="1"><gramGrp>Pl.</gra..
                                        <usg type="style">Gír.</usg>
                                        <def>
                                        Pés.
                                        </def>
                                        </sense>
                                        <etym ori="químb">(Do químb. _qu..
                      Alberto Simões   Processing XML: a rewriting system approach
Rewriting Text into TEI

 This rewrite was all based on:
     a few tables (grammatical and usage strings);
          entries genres: Gír Fam Pop Des Fig Vulg Ant Chul Euph
          entries domains: Agr Anat Anthrop Apicult Arith Artilh Archit

     rewrite the few mark-up into better XML structure;
 ((* )?_([^_]|_[^_]{1,5}_)+_( *)?)n=e=>$a=$1;end_def.end_sense.start_


     rewrite the new XML structure to detect and annotate a
     more complex structure;
 <gramGrp>([^<]*)s**s*([^<]*)</gramGrp>=e=>$a="$1 $2"; "ast="1"".g


     detect and correct wrong XML elements.
 </form></sense>==></form>
 </form></def>n</sense>==></form>


                       Alberto Simões   Processing XML: a rewriting system approach
Rewriting Text into TEI

 This rewrite was all based on:
     a few tables (grammatical and usage strings);
          entries genres: Gír Fam Pop Des Fig Vulg Ant Chul Euph
          entries domains: Agr Anat Anthrop Apicult Arith Artilh Archit

     rewrite the few mark-up into better XML structure;
 ((* )?_([^_]|_[^_]{1,5}_)+_( *)?)n=e=>$a=$1;end_def.end_sense.start_


     rewrite the new XML structure to detect and annotate a
     more complex structure;
 <gramGrp>([^<]*)s**s*([^<]*)</gramGrp>=e=>$a="$1 $2"; "ast="1"".g


     detect and correct wrong XML elements.
 </form></sense>==></form>
 </form></def>n</sense>==></form>


                       Alberto Simões   Processing XML: a rewriting system approach
Rewriting Text into TEI

 This rewrite was all based on:
     a few tables (grammatical and usage strings);
          entries genres: Gír Fam Pop Des Fig Vulg Ant Chul Euph
          entries domains: Agr Anat Anthrop Apicult Arith Artilh Archit

     rewrite the few mark-up into better XML structure;
 ((* )?_([^_]|_[^_]{1,5}_)+_( *)?)n=e=>$a=$1;end_def.end_sense.start_


     rewrite the new XML structure to detect and annotate a
     more complex structure;
 <gramGrp>([^<]*)s**s*([^<]*)</gramGrp>=e=>$a="$1 $2"; "ast="1"".g


     detect and correct wrong XML elements.
 </form></sense>==></form>
 </form></def>n</sense>==></form>


                       Alberto Simões   Processing XML: a rewriting system approach
Rewriting Text into TEI

 This rewrite was all based on:
     a few tables (grammatical and usage strings);
          entries genres: Gír Fam Pop Des Fig Vulg Ant Chul Euph
          entries domains: Agr Anat Anthrop Apicult Arith Artilh Archit

     rewrite the few mark-up into better XML structure;
 ((* )?_([^_]|_[^_]{1,5}_)+_( *)?)n=e=>$a=$1;end_def.end_sense.start_


     rewrite the new XML structure to detect and annotate a
     more complex structure;
 <gramGrp>([^<]*)s**s*([^<]*)</gramGrp>=e=>$a="$1 $2"; "ast="1"".g


     detect and correct wrong XML elements.
 </form></sense>==></form>
 </form></def>n</sense>==></form>


                       Alberto Simões   Processing XML: a rewriting system approach
Rewriting Text into TEI

 Case study conclusions:
     flexible tool;
     works on big files:
         Text file is 13 MB;
         Output XML is 30 MB;
         Process takes about nine minutes!
     we event rewrote XML into XML.



                Hey!! XML is text!!
              How can we rewrite it!?

                      Alberto Simões   Processing XML: a rewriting system approach
Rewriting Text into TEI

 Case study conclusions:
     flexible tool;
     works on big files:
         Text file is 13 MB;
         Output XML is 30 MB;
         Process takes about nine minutes!
     we event rewrote XML into XML.



                Hey!! XML is text!!
              How can we rewrite it!?

                      Alberto Simões   Processing XML: a rewriting system approach
Rewriting XML



    different from the usual DOM or SAX oriented approaches;
    looks to XML as text, non structured data;
    rewrite can be done:
        as any other text write system;
        taking advantage of irregular expressions.



     Irregular expressions? Are you kidding?




                     Alberto Simões   Processing XML: a rewriting system approach
Rewriting XML



    different from the usual DOM or SAX oriented approaches;
    looks to XML as text, non structured data;
    rewrite can be done:
        as any other text write system;
        taking advantage of irregular expressions.



     Irregular expressions? Are you kidding?




                     Alberto Simões   Processing XML: a rewriting system approach
Rewriting XML



    different from the usual DOM or SAX oriented approaches;
    looks to XML as text, non structured data;
    rewrite can be done:
        as any other text write system;
        taking advantage of irregular expressions.



     Irregular expressions? Are you kidding?




                     Alberto Simões   Processing XML: a rewriting system approach
Rewriting XML



    different from the usual DOM or SAX oriented approaches;
    looks to XML as text, non structured data;
    rewrite can be done:
        as any other text write system;
        taking advantage of irregular expressions.



     Irregular expressions? Are you kidding?




                     Alberto Simões   Processing XML: a rewriting system approach
Not so regular expressions

 Perl has a powerful regular expression engine:
     regular expressions can define capture zones:
     small pieces of the match that can be used later;
     regular expressions can define look-ahead or look-behind:
     check the context of the matching zone;
     since Perl 5.10, regular expressions can be recursive:
     regular expression that depends on themself.
     my $parens = qr/(((?:[^()]++|(?-1))*+))/;

 For XML, we defined two classes:
 [[:XML:]] matches any well formed XML fragment;
 [[:XML(tag):]] matches a XML fragment with a specific
          root element;


                         Alberto Simões   Processing XML: a rewriting system approach
Not so regular expressions

 Perl has a powerful regular expression engine:
     regular expressions can define capture zones:
     small pieces of the match that can be used later;
     regular expressions can define look-ahead or look-behind:
     check the context of the matching zone;
     since Perl 5.10, regular expressions can be recursive:
     regular expression that depends on themself.
     my $parens = qr/(((?:[^()]++|(?-1))*+))/;

 For XML, we defined two classes:
 [[:XML:]] matches any well formed XML fragment;
 [[:XML(tag):]] matches a XML fragment with a specific
          root element;


                         Alberto Simões   Processing XML: a rewriting system approach
Not so regular expressions

 Perl has a powerful regular expression engine:
     regular expressions can define capture zones:
     small pieces of the match that can be used later;
     regular expressions can define look-ahead or look-behind:
     check the context of the matching zone;
     since Perl 5.10, regular expressions can be recursive:
     regular expression that depends on themself.
     my $parens = qr/(((?:[^()]++|(?-1))*+))/;

 For XML, we defined two classes:
 [[:XML:]] matches any well formed XML fragment;
 [[:XML(tag):]] matches a XML fragment with a specific
          root element;


                         Alberto Simões   Processing XML: a rewriting system approach
Not so regular expressions

 Perl has a powerful regular expression engine:
     regular expressions can define capture zones:
     small pieces of the match that can be used later;
     regular expressions can define look-ahead or look-behind:
     check the context of the matching zone;
     since Perl 5.10, regular expressions can be recursive:
     regular expression that depends on themself.
     my $parens = qr/(((?:[^()]++|(?-1))*+))/;

 For XML, we defined two classes:
 [[:XML:]] matches any well formed XML fragment;
 [[:XML(tag):]] matches a XML fragment with a specific
          root element;


                         Alberto Simões   Processing XML: a rewriting system approach
Not so regular expressions

 Perl has a powerful regular expression engine:
     regular expressions can define capture zones:
     small pieces of the match that can be used later;
     regular expressions can define look-ahead or look-behind:
     check the context of the matching zone;
     since Perl 5.10, regular expressions can be recursive:
     regular expression that depends on themself.
     my $parens = qr/(((?:[^()]++|(?-1))*+))/;

 For XML, we defined two classes:
 [[:XML:]] matches any well formed XML fragment;
 [[:XML(tag):]] matches a XML fragment with a specific
          root element;


                         Alberto Simões   Processing XML: a rewriting system approach
Not so regular expressions

 Perl has a powerful regular expression engine:
     regular expressions can define capture zones:
     small pieces of the match that can be used later;
     regular expressions can define look-ahead or look-behind:
     check the context of the matching zone;
     since Perl 5.10, regular expressions can be recursive:
     regular expression that depends on themself.
     my $parens = qr/(((?:[^()]++|(?-1))*+))/;

 For XML, we defined two classes:
 [[:XML:]] matches any well formed XML fragment;
 [[:XML(tag):]] matches a XML fragment with a specific
          root element;


                         Alberto Simões   Processing XML: a rewriting system approach
Rewriting XML

 As a simple example, we can remove duplicate translation units
 in a translation memory file:
 Code example
 RULES/m duplicates
 ([[:XML(tu):]])==>!!duplicate($1)
 ENDRULES

 sub duplicate {
   my $tu = shift;
   my $tumd5 = md5(dtstring($tu,
                            -default => sub{$c}));
   return 1 if exists $visited{$tumd5};
   $visited{$tumd5}++
   return 0;
 }


                      Alberto Simões   Processing XML: a rewriting system approach
Conclusions


    The rewriting approach is:
        flexible;
        powerful;
        easy to learn;
        grows quickly;
        big systems can be difficult to maintain;
    The Perl regular engine:
        makes it easy to match anything;
        almost supports full grammars;
        makes it possible to define block structures;

    So, it can be applied to XML easily!




                     Alberto Simões   Processing XML: a rewriting system approach
Thank you




               Thank You!



              Alberto Simões
        alberto.simoes@eu.ipp.pt




              Alberto Simões   Processing XML: a rewriting system approach

Weitere ähnliche Inhalte

Ähnlich wie Processing XML: a rewriting system approach

Regular expressions
Regular expressionsRegular expressions
Regular expressions
Raghu nath
 
match the following attributes to the parts of a compilerstrips ou.pdf
match the following attributes to the parts of a compilerstrips ou.pdfmatch the following attributes to the parts of a compilerstrips ou.pdf
match the following attributes to the parts of a compilerstrips ou.pdf
arpitaeron555
 
Designing A Syntax Based Retrieval System03
Designing A Syntax Based Retrieval System03Designing A Syntax Based Retrieval System03
Designing A Syntax Based Retrieval System03
Avelin Huo
 
chapter7.ppt java programming lecture notes
chapter7.ppt java programming lecture noteschapter7.ppt java programming lecture notes
chapter7.ppt java programming lecture notes
kavitamittal18
 

Ähnlich wie Processing XML: a rewriting system approach (20)

Regular expressions
Regular expressionsRegular expressions
Regular expressions
 
Parsing
ParsingParsing
Parsing
 
xml2tex at TUG 2014
xml2tex at TUG 2014xml2tex at TUG 2014
xml2tex at TUG 2014
 
Plc part 2
Plc  part 2Plc  part 2
Plc part 2
 
Introduction to Boost regex
Introduction to Boost regexIntroduction to Boost regex
Introduction to Boost regex
 
Finaal application on regular expression
Finaal application on regular expressionFinaal application on regular expression
Finaal application on regular expression
 
match the following attributes to the parts of a compilerstrips ou.pdf
match the following attributes to the parts of a compilerstrips ou.pdfmatch the following attributes to the parts of a compilerstrips ou.pdf
match the following attributes to the parts of a compilerstrips ou.pdf
 
Designing A Syntax Based Retrieval System03
Designing A Syntax Based Retrieval System03Designing A Syntax Based Retrieval System03
Designing A Syntax Based Retrieval System03
 
Lexical Analyzers and Parsers
Lexical Analyzers and ParsersLexical Analyzers and Parsers
Lexical Analyzers and Parsers
 
role of lexical anaysis
role of lexical anaysisrole of lexical anaysis
role of lexical anaysis
 
Text based search engine on a fixed corpus and utilizing indexation and ranki...
Text based search engine on a fixed corpus and utilizing indexation and ranki...Text based search engine on a fixed corpus and utilizing indexation and ranki...
Text based search engine on a fixed corpus and utilizing indexation and ranki...
 
Working with XSLT, XPath and ECMA Scripts: Make It Simpler with Novell Identi...
Working with XSLT, XPath and ECMA Scripts: Make It Simpler with Novell Identi...Working with XSLT, XPath and ECMA Scripts: Make It Simpler with Novell Identi...
Working with XSLT, XPath and ECMA Scripts: Make It Simpler with Novell Identi...
 
chapter7.ppt java programming lecture notes
chapter7.ppt java programming lecture noteschapter7.ppt java programming lecture notes
chapter7.ppt java programming lecture notes
 
BayFP: Concurrent and Multicore Haskell
BayFP: Concurrent and Multicore HaskellBayFP: Concurrent and Multicore Haskell
BayFP: Concurrent and Multicore Haskell
 
How does intellisense work?
How does intellisense work?How does intellisense work?
How does intellisense work?
 
What is the deal with Elixir?
What is the deal with Elixir?What is the deal with Elixir?
What is the deal with Elixir?
 
09 string processing_with_regex copy
09 string processing_with_regex copy09 string processing_with_regex copy
09 string processing_with_regex copy
 
1._Introduction_.pptx
1._Introduction_.pptx1._Introduction_.pptx
1._Introduction_.pptx
 
Shell script-sec
Shell script-secShell script-sec
Shell script-sec
 
COMPILER CONSTRUCTION KU 1.pptx
COMPILER CONSTRUCTION KU 1.pptxCOMPILER CONSTRUCTION KU 1.pptx
COMPILER CONSTRUCTION KU 1.pptx
 

Mehr von Alberto Simões

Mehr von Alberto Simões (20)

Source Code Quality
Source Code QualitySource Code Quality
Source Code Quality
 
Language Identification: A neural network approach
Language Identification: A neural network approachLanguage Identification: A neural network approach
Language Identification: A neural network approach
 
Google Maps JS API
Google Maps JS APIGoogle Maps JS API
Google Maps JS API
 
Making the most of a 100-year-old dictionary
Making the most of a 100-year-old dictionaryMaking the most of a 100-year-old dictionary
Making the most of a 100-year-old dictionary
 
Dictionary Alignment by Rewrite-based Entry Translation
Dictionary Alignment by Rewrite-based Entry TranslationDictionary Alignment by Rewrite-based Entry Translation
Dictionary Alignment by Rewrite-based Entry Translation
 
EMLex-A5: Specialized Dictionaries
EMLex-A5: Specialized DictionariesEMLex-A5: Specialized Dictionaries
EMLex-A5: Specialized Dictionaries
 
Modelação de Dados
Modelação de DadosModelação de Dados
Modelação de Dados
 
Aula 04 - Introdução aos Diagramas de Sequência
Aula 04 - Introdução aos Diagramas de SequênciaAula 04 - Introdução aos Diagramas de Sequência
Aula 04 - Introdução aos Diagramas de Sequência
 
Aula 03 - Introdução aos Diagramas de Atividade
Aula 03 - Introdução aos Diagramas de AtividadeAula 03 - Introdução aos Diagramas de Atividade
Aula 03 - Introdução aos Diagramas de Atividade
 
Aula 02 - Engenharia de Requisitos
Aula 02 - Engenharia de RequisitosAula 02 - Engenharia de Requisitos
Aula 02 - Engenharia de Requisitos
 
Aula 01 - Planeamento de Sistemas de Informação
Aula 01 - Planeamento de Sistemas de InformaçãoAula 01 - Planeamento de Sistemas de Informação
Aula 01 - Planeamento de Sistemas de Informação
 
Building C and C++ libraries with Perl
Building C and C++ libraries with PerlBuilding C and C++ libraries with Perl
Building C and C++ libraries with Perl
 
PLN em Perl
PLN em PerlPLN em Perl
PLN em Perl
 
Classification Systems
Classification SystemsClassification Systems
Classification Systems
 
Redes de Pert
Redes de PertRedes de Pert
Redes de Pert
 
Dancing Tutorial
Dancing TutorialDancing Tutorial
Dancing Tutorial
 
Sistemas de Numeração
Sistemas de NumeraçãoSistemas de Numeração
Sistemas de Numeração
 
Álgebra de Boole
Álgebra de BooleÁlgebra de Boole
Álgebra de Boole
 
Arquitecturas de Tradução Automática
Arquitecturas de Tradução AutomáticaArquitecturas de Tradução Automática
Arquitecturas de Tradução Automática
 
Extracção de Recursos para Tradução Automática
Extracção de Recursos para Tradução AutomáticaExtracção de Recursos para Tradução Automática
Extracção de Recursos para Tradução Automática
 

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

Processing XML: a rewriting system approach

  • 1. Processing XML A rewriting system approach Alberto Simões alberto.simoes@eu.ipp.pt Portuguese Perl Workshop – 2010 Alberto Simões Processing XML: a rewriting system approach
  • 2. Motivation and Goals XML is usually generated from structured information: databases, spreadsheets, forms, etc. but it can be generated from unstructured (or poorly-structured data): textual documents, domain specific languages; Question arises: How to produce XML documents from textual documents? write a parser (natural language, domain specific, etc); produce XML by rewriting the textual document! Alberto Simões Processing XML: a rewriting system approach
  • 3. Motivation and Goals XML is usually generated from structured information: databases, spreadsheets, forms, etc. but it can be generated from unstructured (or poorly-structured data): textual documents, domain specific languages; Question arises: How to produce XML documents from textual documents? write a parser (natural language, domain specific, etc); produce XML by rewriting the textual document! Alberto Simões Processing XML: a rewriting system approach
  • 4. Motivation and Goals XML is usually generated from structured information: databases, spreadsheets, forms, etc. but it can be generated from unstructured (or poorly-structured data): textual documents, domain specific languages; Question arises: How to produce XML documents from textual documents? write a parser (natural language, domain specific, etc); produce XML by rewriting the textual document! Alberto Simões Processing XML: a rewriting system approach
  • 5. Motivation and Goals XML is usually generated from structured information: databases, spreadsheets, forms, etc. but it can be generated from unstructured (or poorly-structured data): textual documents, domain specific languages; Question arises: How to produce XML documents from textual documents? write a parser (natural language, domain specific, etc); produce XML by rewriting the textual document! Alberto Simões Processing XML: a rewriting system approach
  • 6. Motivation and Goals XML is usually generated from structured information: databases, spreadsheets, forms, etc. but it can be generated from unstructured (or poorly-structured data): textual documents, domain specific languages; Question arises: How to produce XML documents from textual documents? write a parser (natural language, domain specific, etc); produce XML by rewriting the textual document! Alberto Simões Processing XML: a rewriting system approach
  • 7. Hows does textual rewriting works? write rewriting rules: rule ∼ pattern × restriction × action = pattern a regular (or irregular) expression that should be textually matched; restriction conditional code that checks whether the rule should be applied; action a piece of code (or simply a string) that produces text that should replace the originally matched text; Alberto Simões Processing XML: a rewriting system approach
  • 8. Hows does textual rewriting works? write rewriting rules: rule ∼ pattern × restriction × action = pattern a regular (or irregular) expression that should be textually matched; restriction conditional code that checks whether the rule should be applied; action a piece of code (or simply a string) that produces text that should replace the originally matched text; Alberto Simões Processing XML: a rewriting system approach
  • 9. Hows does textual rewriting works? write rewriting rules: rule ∼ pattern × restriction × action = pattern a regular (or irregular) expression that should be textually matched; restriction conditional code that checks whether the rule should be applied; action a piece of code (or simply a string) that produces text that should replace the originally matched text; Alberto Simões Processing XML: a rewriting system approach
  • 10. Hows does textual rewriting works? write rewriting rules: rule ∼ pattern × restriction × action = pattern a regular (or irregular) expression that should be textually matched; restriction conditional code that checks whether the rule should be applied; action a piece of code (or simply a string) that produces text that should replace the originally matched text; Alberto Simões Processing XML: a rewriting system approach
  • 11. Are there text rewriting tools? For this work we used Text::RewriteRules: written in Perl: Perl regular expression engine power; Reflexive language (code can be generated on the fly); supports different rewriting approaches: Fixed-point rewriting approach; Sliding-cursor rewriting approach; Lexical analyzer approach; home-developed; Alberto Simões Processing XML: a rewriting system approach
  • 12. Are there text rewriting tools? For this work we used Text::RewriteRules: written in Perl: Perl regular expression engine power; Reflexive language (code can be generated on the fly); supports different rewriting approaches: Fixed-point rewriting approach; Sliding-cursor rewriting approach; Lexical analyzer approach; home-developed; Alberto Simões Processing XML: a rewriting system approach
  • 13. Are there text rewriting tools? For this work we used Text::RewriteRules: written in Perl: Perl regular expression engine power; Reflexive language (code can be generated on the fly); supports different rewriting approaches: Fixed-point rewriting approach; Sliding-cursor rewriting approach; Lexical analyzer approach; home-developed; Alberto Simões Processing XML: a rewriting system approach
  • 14. Are there text rewriting tools? For this work we used Text::RewriteRules: written in Perl: Perl regular expression engine power; Reflexive language (code can be generated on the fly); supports different rewriting approaches: Fixed-point rewriting approach; Sliding-cursor rewriting approach; Lexical analyzer approach; home-developed; Alberto Simões Processing XML: a rewriting system approach
  • 15. Fixed-point rewriting approach Algorithm easy to understand; a sequence of rules that are applied by order; first rule is applied, and following rules are only applied if there is no previous rule that can be applied; it might happen that a rule changes the document in a way that a previous rule will be applied again; the process ends when there are no rules that can be applied (or if a specific rule forces the system to end); Code example: anonymization of emails RULES anonymize w+(.w+)*@w+.w+(.w+)*==>[[hidden email]] ENDRULES Alberto Simões Processing XML: a rewriting system approach
  • 16. Fixed-point rewriting approach Algorithm easy to understand; a sequence of rules that are applied by order; first rule is applied, and following rules are only applied if there is no previous rule that can be applied; it might happen that a rule changes the document in a way that a previous rule will be applied again; the process ends when there are no rules that can be applied (or if a specific rule forces the system to end); Code example: anonymization of emails RULES anonymize w+(.w+)*@w+.w+(.w+)*==>[[hidden email]] ENDRULES Alberto Simões Processing XML: a rewriting system approach
  • 17. Sliding-cursor rewriting approach Algorithm the cursor is placed in the beginning of the string; patterns are matched if they occur right after the cursor; if a rule is applied, the cursor is placed after that region; if no rule matches, the cursor moves ahead one character; process ends when cursor reaches the end of the string; it will never rewrite text that was already rewritten. Code example: brute force translation RULES/m translate (w+)=e=> $translation{$1} !! exists($translation{$1}) ENDRULES Example _ latest train último _ train último combóio _ Alberto Simões Processing XML: a rewriting system approach
  • 18. Sliding-cursor rewriting approach Algorithm the cursor is placed in the beginning of the string; patterns are matched if they occur right after the cursor; if a rule is applied, the cursor is placed after that region; if no rule matches, the cursor moves ahead one character; process ends when cursor reaches the end of the string; it will never rewrite text that was already rewritten. Code example: brute force translation RULES/m translate (w+)=e=> $translation{$1} !! exists($translation{$1}) ENDRULES Example _ latest train último _ train último combóio _ Alberto Simões Processing XML: a rewriting system approach
  • 19. Sliding-cursor rewriting approach Algorithm the cursor is placed in the beginning of the string; patterns are matched if they occur right after the cursor; if a rule is applied, the cursor is placed after that region; if no rule matches, the cursor moves ahead one character; process ends when cursor reaches the end of the string; it will never rewrite text that was already rewritten. Code example: brute force translation RULES/m translate (w+)=e=> $translation{$1} !! exists($translation{$1}) ENDRULES Example _ latest train último _ train último combóio _ Alberto Simões Processing XML: a rewriting system approach
  • 20. Valid Rewriting Rules Different approaches have different possible rules. . . but the most relevant rules are: ==> simple pattern substitution: left hand side includes a Perl regular expression and right hand side includes the string that will replace the match; =e=> similar to the previous one, but right hand side includes Perl code to be evaluated. The result will be used to replace the match; =begin=> without a left hand side, the right hand side code is executed before the rewrite starts; =end=> without a right hand side, when the left hand side pattern matches quits the rewrite system; they can include a restriction block (!!) at the right of the action. Alberto Simões Processing XML: a rewriting system approach
  • 21. Valid Rewriting Rules Different approaches have different possible rules. . . but the most relevant rules are: ==> simple pattern substitution: left hand side includes a Perl regular expression and right hand side includes the string that will replace the match; =e=> similar to the previous one, but right hand side includes Perl code to be evaluated. The result will be used to replace the match; =begin=> without a left hand side, the right hand side code is executed before the rewrite starts; =end=> without a right hand side, when the left hand side pattern matches quits the rewrite system; they can include a restriction block (!!) at the right of the action. Alberto Simões Processing XML: a rewriting system approach
  • 22. Valid Rewriting Rules Different approaches have different possible rules. . . but the most relevant rules are: ==> simple pattern substitution: left hand side includes a Perl regular expression and right hand side includes the string that will replace the match; =e=> similar to the previous one, but right hand side includes Perl code to be evaluated. The result will be used to replace the match; =begin=> without a left hand side, the right hand side code is executed before the rewrite starts; =end=> without a right hand side, when the left hand side pattern matches quits the rewrite system; they can include a restriction block (!!) at the right of the action. Alberto Simões Processing XML: a rewriting system approach
  • 23. Valid Rewriting Rules Different approaches have different possible rules. . . but the most relevant rules are: ==> simple pattern substitution: left hand side includes a Perl regular expression and right hand side includes the string that will replace the match; =e=> similar to the previous one, but right hand side includes Perl code to be evaluated. The result will be used to replace the match; =begin=> without a left hand side, the right hand side code is executed before the rewrite starts; =end=> without a right hand side, when the left hand side pattern matches quits the rewrite system; they can include a restriction block (!!) at the right of the action. Alberto Simões Processing XML: a rewriting system approach
  • 24. Valid Rewriting Rules Different approaches have different possible rules. . . but the most relevant rules are: ==> simple pattern substitution: left hand side includes a Perl regular expression and right hand side includes the string that will replace the match; =e=> similar to the previous one, but right hand side includes Perl code to be evaluated. The result will be used to replace the match; =begin=> without a left hand side, the right hand side code is executed before the rewrite starts; =end=> without a right hand side, when the left hand side pattern matches quits the rewrite system; they can include a restriction block (!!) at the right of the action. Alberto Simões Processing XML: a rewriting system approach
  • 25. Valid Rewriting Rules Different approaches have different possible rules. . . but the most relevant rules are: ==> simple pattern substitution: left hand side includes a Perl regular expression and right hand side includes the string that will replace the match; =e=> similar to the previous one, but right hand side includes Perl code to be evaluated. The result will be used to replace the match; =begin=> without a left hand side, the right hand side code is executed before the rewrite starts; =end=> without a right hand side, when the left hand side pattern matches quits the rewrite system; they can include a restriction block (!!) at the right of the action. Alberto Simões Processing XML: a rewriting system approach
  • 26. Rewriting Text into XML How to produce XML from weak-structured data? write a parser; or rewrite the data step-by-step into XML! Two case studies: Rewriting a dictionary in textual format into TEI; Rewriting a XML DSL authoring tool into XML; Alberto Simões Processing XML: a rewriting system approach
  • 27. Rewriting Text into XML How to produce XML from weak-structured data? write a parser; or rewrite the data step-by-step into XML! Two case studies: Rewriting a dictionary in textual format into TEI; Rewriting a XML DSL authoring tool into XML; Alberto Simões Processing XML: a rewriting system approach
  • 28. Rewriting Text into TEI Rewrite this. . . . . . into this! *Cachimbo*, <entry id="cachimbo"> _m._ <form><orth>Cachimbo</orth></form> Apparelho de fumador, composto d.. <sense> Peça de ferro, em que entra o es.. <gramGrp>m.</gramGrp> Buraco, em que se encaixa a vela.. <def> * _Bras. de Pernambuco._ Apparelho de fumador, composto d.. Bebida, preparada com aguardente.. Peça de ferro, em que entra o es.. * _Pl. Gír._ Buraco, em que se encaixa a vela.. Pés. </def> (Do químb. _quixima_) </sense> <sense ast="1"> <usg type="geo">Bras. de Pernamb.. <def> Bebida, preparada com aguardente.. </def> </sense> <sense ast="1"><gramGrp>Pl.</gra.. <usg type="style">Gír.</usg> <def> Pés. </def> </sense> <etym ori="químb">(Do químb. _qu.. Alberto Simões Processing XML: a rewriting system approach
  • 29. Rewriting Text into TEI This rewrite was all based on: a few tables (grammatical and usage strings); entries genres: Gír Fam Pop Des Fig Vulg Ant Chul Euph entries domains: Agr Anat Anthrop Apicult Arith Artilh Archit rewrite the few mark-up into better XML structure; ((* )?_([^_]|_[^_]{1,5}_)+_( *)?)n=e=>$a=$1;end_def.end_sense.start_ rewrite the new XML structure to detect and annotate a more complex structure; <gramGrp>([^<]*)s**s*([^<]*)</gramGrp>=e=>$a="$1 $2"; "ast="1"".g detect and correct wrong XML elements. </form></sense>==></form> </form></def>n</sense>==></form> Alberto Simões Processing XML: a rewriting system approach
  • 30. Rewriting Text into TEI This rewrite was all based on: a few tables (grammatical and usage strings); entries genres: Gír Fam Pop Des Fig Vulg Ant Chul Euph entries domains: Agr Anat Anthrop Apicult Arith Artilh Archit rewrite the few mark-up into better XML structure; ((* )?_([^_]|_[^_]{1,5}_)+_( *)?)n=e=>$a=$1;end_def.end_sense.start_ rewrite the new XML structure to detect and annotate a more complex structure; <gramGrp>([^<]*)s**s*([^<]*)</gramGrp>=e=>$a="$1 $2"; "ast="1"".g detect and correct wrong XML elements. </form></sense>==></form> </form></def>n</sense>==></form> Alberto Simões Processing XML: a rewriting system approach
  • 31. Rewriting Text into TEI This rewrite was all based on: a few tables (grammatical and usage strings); entries genres: Gír Fam Pop Des Fig Vulg Ant Chul Euph entries domains: Agr Anat Anthrop Apicult Arith Artilh Archit rewrite the few mark-up into better XML structure; ((* )?_([^_]|_[^_]{1,5}_)+_( *)?)n=e=>$a=$1;end_def.end_sense.start_ rewrite the new XML structure to detect and annotate a more complex structure; <gramGrp>([^<]*)s**s*([^<]*)</gramGrp>=e=>$a="$1 $2"; "ast="1"".g detect and correct wrong XML elements. </form></sense>==></form> </form></def>n</sense>==></form> Alberto Simões Processing XML: a rewriting system approach
  • 32. Rewriting Text into TEI This rewrite was all based on: a few tables (grammatical and usage strings); entries genres: Gír Fam Pop Des Fig Vulg Ant Chul Euph entries domains: Agr Anat Anthrop Apicult Arith Artilh Archit rewrite the few mark-up into better XML structure; ((* )?_([^_]|_[^_]{1,5}_)+_( *)?)n=e=>$a=$1;end_def.end_sense.start_ rewrite the new XML structure to detect and annotate a more complex structure; <gramGrp>([^<]*)s**s*([^<]*)</gramGrp>=e=>$a="$1 $2"; "ast="1"".g detect and correct wrong XML elements. </form></sense>==></form> </form></def>n</sense>==></form> Alberto Simões Processing XML: a rewriting system approach
  • 33. Rewriting Text into TEI Case study conclusions: flexible tool; works on big files: Text file is 13 MB; Output XML is 30 MB; Process takes about nine minutes! we event rewrote XML into XML. Hey!! XML is text!! How can we rewrite it!? Alberto Simões Processing XML: a rewriting system approach
  • 34. Rewriting Text into TEI Case study conclusions: flexible tool; works on big files: Text file is 13 MB; Output XML is 30 MB; Process takes about nine minutes! we event rewrote XML into XML. Hey!! XML is text!! How can we rewrite it!? Alberto Simões Processing XML: a rewriting system approach
  • 35. Rewriting XML different from the usual DOM or SAX oriented approaches; looks to XML as text, non structured data; rewrite can be done: as any other text write system; taking advantage of irregular expressions. Irregular expressions? Are you kidding? Alberto Simões Processing XML: a rewriting system approach
  • 36. Rewriting XML different from the usual DOM or SAX oriented approaches; looks to XML as text, non structured data; rewrite can be done: as any other text write system; taking advantage of irregular expressions. Irregular expressions? Are you kidding? Alberto Simões Processing XML: a rewriting system approach
  • 37. Rewriting XML different from the usual DOM or SAX oriented approaches; looks to XML as text, non structured data; rewrite can be done: as any other text write system; taking advantage of irregular expressions. Irregular expressions? Are you kidding? Alberto Simões Processing XML: a rewriting system approach
  • 38. Rewriting XML different from the usual DOM or SAX oriented approaches; looks to XML as text, non structured data; rewrite can be done: as any other text write system; taking advantage of irregular expressions. Irregular expressions? Are you kidding? Alberto Simões Processing XML: a rewriting system approach
  • 39. Not so regular expressions Perl has a powerful regular expression engine: regular expressions can define capture zones: small pieces of the match that can be used later; regular expressions can define look-ahead or look-behind: check the context of the matching zone; since Perl 5.10, regular expressions can be recursive: regular expression that depends on themself. my $parens = qr/(((?:[^()]++|(?-1))*+))/; For XML, we defined two classes: [[:XML:]] matches any well formed XML fragment; [[:XML(tag):]] matches a XML fragment with a specific root element; Alberto Simões Processing XML: a rewriting system approach
  • 40. Not so regular expressions Perl has a powerful regular expression engine: regular expressions can define capture zones: small pieces of the match that can be used later; regular expressions can define look-ahead or look-behind: check the context of the matching zone; since Perl 5.10, regular expressions can be recursive: regular expression that depends on themself. my $parens = qr/(((?:[^()]++|(?-1))*+))/; For XML, we defined two classes: [[:XML:]] matches any well formed XML fragment; [[:XML(tag):]] matches a XML fragment with a specific root element; Alberto Simões Processing XML: a rewriting system approach
  • 41. Not so regular expressions Perl has a powerful regular expression engine: regular expressions can define capture zones: small pieces of the match that can be used later; regular expressions can define look-ahead or look-behind: check the context of the matching zone; since Perl 5.10, regular expressions can be recursive: regular expression that depends on themself. my $parens = qr/(((?:[^()]++|(?-1))*+))/; For XML, we defined two classes: [[:XML:]] matches any well formed XML fragment; [[:XML(tag):]] matches a XML fragment with a specific root element; Alberto Simões Processing XML: a rewriting system approach
  • 42. Not so regular expressions Perl has a powerful regular expression engine: regular expressions can define capture zones: small pieces of the match that can be used later; regular expressions can define look-ahead or look-behind: check the context of the matching zone; since Perl 5.10, regular expressions can be recursive: regular expression that depends on themself. my $parens = qr/(((?:[^()]++|(?-1))*+))/; For XML, we defined two classes: [[:XML:]] matches any well formed XML fragment; [[:XML(tag):]] matches a XML fragment with a specific root element; Alberto Simões Processing XML: a rewriting system approach
  • 43. Not so regular expressions Perl has a powerful regular expression engine: regular expressions can define capture zones: small pieces of the match that can be used later; regular expressions can define look-ahead or look-behind: check the context of the matching zone; since Perl 5.10, regular expressions can be recursive: regular expression that depends on themself. my $parens = qr/(((?:[^()]++|(?-1))*+))/; For XML, we defined two classes: [[:XML:]] matches any well formed XML fragment; [[:XML(tag):]] matches a XML fragment with a specific root element; Alberto Simões Processing XML: a rewriting system approach
  • 44. Not so regular expressions Perl has a powerful regular expression engine: regular expressions can define capture zones: small pieces of the match that can be used later; regular expressions can define look-ahead or look-behind: check the context of the matching zone; since Perl 5.10, regular expressions can be recursive: regular expression that depends on themself. my $parens = qr/(((?:[^()]++|(?-1))*+))/; For XML, we defined two classes: [[:XML:]] matches any well formed XML fragment; [[:XML(tag):]] matches a XML fragment with a specific root element; Alberto Simões Processing XML: a rewriting system approach
  • 45. Rewriting XML As a simple example, we can remove duplicate translation units in a translation memory file: Code example RULES/m duplicates ([[:XML(tu):]])==>!!duplicate($1) ENDRULES sub duplicate { my $tu = shift; my $tumd5 = md5(dtstring($tu, -default => sub{$c})); return 1 if exists $visited{$tumd5}; $visited{$tumd5}++ return 0; } Alberto Simões Processing XML: a rewriting system approach
  • 46. Conclusions The rewriting approach is: flexible; powerful; easy to learn; grows quickly; big systems can be difficult to maintain; The Perl regular engine: makes it easy to match anything; almost supports full grammars; makes it possible to define block structures; So, it can be applied to XML easily! Alberto Simões Processing XML: a rewriting system approach
  • 47. Thank you Thank You! Alberto Simões alberto.simoes@eu.ipp.pt Alberto Simões Processing XML: a rewriting system approach