SlideShare ist ein Scribd-Unternehmen logo
1 von 47
Downloaden Sie, um offline zu lesen
Processing XML
A rewriting system approach

           Alberto Simões
 alberto.simoes@eu.ipp.pt


  Portuguese Perl Workshop – 2010




       Alberto Simões   Processing XML: a rewriting system approach
Motivation and Goals


     XML is usually generated from structured information:
         databases, spreadsheets, forms, etc.

     but it can be generated from unstructured
     (or poorly-structured data):
         textual documents, domain specific languages;

 Question arises:
 How to produce XML documents from textual documents?
     write a parser (natural language, domain specific, etc);
     produce XML by rewriting the textual document!




                      Alberto Simões   Processing XML: a rewriting system approach
Motivation and Goals


     XML is usually generated from structured information:
         databases, spreadsheets, forms, etc.

     but it can be generated from unstructured
     (or poorly-structured data):
         textual documents, domain specific languages;

 Question arises:
 How to produce XML documents from textual documents?
     write a parser (natural language, domain specific, etc);
     produce XML by rewriting the textual document!




                      Alberto Simões   Processing XML: a rewriting system approach
Motivation and Goals


     XML is usually generated from structured information:
         databases, spreadsheets, forms, etc.

     but it can be generated from unstructured
     (or poorly-structured data):
         textual documents, domain specific languages;

 Question arises:
 How to produce XML documents from textual documents?
     write a parser (natural language, domain specific, etc);
     produce XML by rewriting the textual document!




                      Alberto Simões   Processing XML: a rewriting system approach
Motivation and Goals


     XML is usually generated from structured information:
         databases, spreadsheets, forms, etc.

     but it can be generated from unstructured
     (or poorly-structured data):
         textual documents, domain specific languages;

 Question arises:
 How to produce XML documents from textual documents?
     write a parser (natural language, domain specific, etc);
     produce XML by rewriting the textual document!




                      Alberto Simões   Processing XML: a rewriting system approach
Motivation and Goals


     XML is usually generated from structured information:
         databases, spreadsheets, forms, etc.

     but it can be generated from unstructured
     (or poorly-structured data):
         textual documents, domain specific languages;

 Question arises:
 How to produce XML documents from textual documents?
     write a parser (natural language, domain specific, etc);
     produce XML by rewriting the textual document!




                      Alberto Simões   Processing XML: a rewriting system approach
Hows does textual rewriting works?


    write rewriting rules:

                rule ∼ pattern × restriction × action
                     =


         pattern a regular (or irregular) expression that should
                 be textually matched;
      restriction conditional code that checks whether the rule
                  should be applied;
          action a piece of code (or simply a string) that
                 produces text that should replace the
                 originally matched text;



                      Alberto Simões   Processing XML: a rewriting system approach
Hows does textual rewriting works?


    write rewriting rules:

                rule ∼ pattern × restriction × action
                     =


         pattern a regular (or irregular) expression that should
                 be textually matched;
      restriction conditional code that checks whether the rule
                  should be applied;
          action a piece of code (or simply a string) that
                 produces text that should replace the
                 originally matched text;



                      Alberto Simões   Processing XML: a rewriting system approach
Hows does textual rewriting works?


    write rewriting rules:

                rule ∼ pattern × restriction × action
                     =


         pattern a regular (or irregular) expression that should
                 be textually matched;
      restriction conditional code that checks whether the rule
                  should be applied;
          action a piece of code (or simply a string) that
                 produces text that should replace the
                 originally matched text;



                      Alberto Simões   Processing XML: a rewriting system approach
Hows does textual rewriting works?


    write rewriting rules:

                rule ∼ pattern × restriction × action
                     =


         pattern a regular (or irregular) expression that should
                 be textually matched;
      restriction conditional code that checks whether the rule
                  should be applied;
          action a piece of code (or simply a string) that
                 produces text that should replace the
                 originally matched text;



                      Alberto Simões   Processing XML: a rewriting system approach
Are there text rewriting tools?


 For this work we used Text::RewriteRules:

     written in Perl:
          Perl regular expression engine power;
          Reflexive language (code can be generated on the fly);
     supports different rewriting approaches:
          Fixed-point rewriting approach;
          Sliding-cursor rewriting approach;
          Lexical analyzer approach;

     home-developed;




                        Alberto Simões   Processing XML: a rewriting system approach
Are there text rewriting tools?


 For this work we used Text::RewriteRules:

     written in Perl:
          Perl regular expression engine power;
          Reflexive language (code can be generated on the fly);
     supports different rewriting approaches:
          Fixed-point rewriting approach;
          Sliding-cursor rewriting approach;
          Lexical analyzer approach;

     home-developed;




                        Alberto Simões   Processing XML: a rewriting system approach
Are there text rewriting tools?


 For this work we used Text::RewriteRules:

     written in Perl:
          Perl regular expression engine power;
          Reflexive language (code can be generated on the fly);
     supports different rewriting approaches:
          Fixed-point rewriting approach;
          Sliding-cursor rewriting approach;
          Lexical analyzer approach;

     home-developed;




                        Alberto Simões   Processing XML: a rewriting system approach
Are there text rewriting tools?


 For this work we used Text::RewriteRules:

     written in Perl:
          Perl regular expression engine power;
          Reflexive language (code can be generated on the fly);
     supports different rewriting approaches:
          Fixed-point rewriting approach;
          Sliding-cursor rewriting approach;
          Lexical analyzer approach;

     home-developed;




                        Alberto Simões   Processing XML: a rewriting system approach
Fixed-point rewriting approach

 Algorithm
     easy to understand;
     a sequence of rules that are applied by order;
     first rule is applied, and following rules are only applied if
     there is no previous rule that can be applied;
     it might happen that a rule changes the document in a way
     that a previous rule will be applied again;
     the process ends when there are no rules that can be
     applied (or if a specific rule forces the system to end);

 Code example: anonymization of emails
   RULES anonymize
   w+(.w+)*@w+.w+(.w+)*==>[[hidden email]]
   ENDRULES


                       Alberto Simões   Processing XML: a rewriting system approach
Fixed-point rewriting approach

 Algorithm
     easy to understand;
     a sequence of rules that are applied by order;
     first rule is applied, and following rules are only applied if
     there is no previous rule that can be applied;
     it might happen that a rule changes the document in a way
     that a previous rule will be applied again;
     the process ends when there are no rules that can be
     applied (or if a specific rule forces the system to end);

 Code example: anonymization of emails
   RULES anonymize
   w+(.w+)*@w+.w+(.w+)*==>[[hidden email]]
   ENDRULES


                       Alberto Simões   Processing XML: a rewriting system approach
Sliding-cursor rewriting approach
 Algorithm
     the cursor is placed in the beginning of the string;
     patterns are matched if they occur right after the cursor;
     if a rule is applied, the cursor is placed after that region;
     if no rule matches, the cursor moves ahead one character;
     process ends when cursor reaches the end of the string;
     it will never rewrite text that was already rewritten.
 Code example: brute force translation
 RULES/m translate
 (w+)=e=> $translation{$1} !! exists($translation{$1})
 ENDRULES

 Example
   _ latest train
   último _ train
   último combóio _
                       Alberto Simões   Processing XML: a rewriting system approach
Sliding-cursor rewriting approach
 Algorithm
     the cursor is placed in the beginning of the string;
     patterns are matched if they occur right after the cursor;
     if a rule is applied, the cursor is placed after that region;
     if no rule matches, the cursor moves ahead one character;
     process ends when cursor reaches the end of the string;
     it will never rewrite text that was already rewritten.
 Code example: brute force translation
 RULES/m translate
 (w+)=e=> $translation{$1} !! exists($translation{$1})
 ENDRULES

 Example
   _ latest train
   último _ train
   último combóio _
                       Alberto Simões   Processing XML: a rewriting system approach
Sliding-cursor rewriting approach
 Algorithm
     the cursor is placed in the beginning of the string;
     patterns are matched if they occur right after the cursor;
     if a rule is applied, the cursor is placed after that region;
     if no rule matches, the cursor moves ahead one character;
     process ends when cursor reaches the end of the string;
     it will never rewrite text that was already rewritten.
 Code example: brute force translation
 RULES/m translate
 (w+)=e=> $translation{$1} !! exists($translation{$1})
 ENDRULES

 Example
   _ latest train
   último _ train
   último combóio _
                       Alberto Simões   Processing XML: a rewriting system approach
Valid Rewriting Rules

 Different approaches have different possible rules. . .
 but the most relevant rules are:
         ==> simple pattern substitution: left hand side includes
             a Perl regular expression and right hand side
             includes the string that will replace the match;
        =e=> similar to the previous one, but right hand side
             includes Perl code to be evaluated. The result will
             be used to replace the match;
  =begin=> without a left hand side, the right hand side code is
           executed before the rewrite starts;
     =end=> without a right hand side, when the left hand side
            pattern matches quits the rewrite system;
 they can include a restriction block (!!) at the right of the action.


                        Alberto Simões   Processing XML: a rewriting system approach
Valid Rewriting Rules

 Different approaches have different possible rules. . .
 but the most relevant rules are:
         ==> simple pattern substitution: left hand side includes
             a Perl regular expression and right hand side
             includes the string that will replace the match;
        =e=> similar to the previous one, but right hand side
             includes Perl code to be evaluated. The result will
             be used to replace the match;
  =begin=> without a left hand side, the right hand side code is
           executed before the rewrite starts;
     =end=> without a right hand side, when the left hand side
            pattern matches quits the rewrite system;
 they can include a restriction block (!!) at the right of the action.


                        Alberto Simões   Processing XML: a rewriting system approach
Valid Rewriting Rules

 Different approaches have different possible rules. . .
 but the most relevant rules are:
         ==> simple pattern substitution: left hand side includes
             a Perl regular expression and right hand side
             includes the string that will replace the match;
        =e=> similar to the previous one, but right hand side
             includes Perl code to be evaluated. The result will
             be used to replace the match;
  =begin=> without a left hand side, the right hand side code is
           executed before the rewrite starts;
     =end=> without a right hand side, when the left hand side
            pattern matches quits the rewrite system;
 they can include a restriction block (!!) at the right of the action.


                        Alberto Simões   Processing XML: a rewriting system approach
Valid Rewriting Rules

 Different approaches have different possible rules. . .
 but the most relevant rules are:
         ==> simple pattern substitution: left hand side includes
             a Perl regular expression and right hand side
             includes the string that will replace the match;
        =e=> similar to the previous one, but right hand side
             includes Perl code to be evaluated. The result will
             be used to replace the match;
  =begin=> without a left hand side, the right hand side code is
           executed before the rewrite starts;
     =end=> without a right hand side, when the left hand side
            pattern matches quits the rewrite system;
 they can include a restriction block (!!) at the right of the action.


                        Alberto Simões   Processing XML: a rewriting system approach
Valid Rewriting Rules

 Different approaches have different possible rules. . .
 but the most relevant rules are:
         ==> simple pattern substitution: left hand side includes
             a Perl regular expression and right hand side
             includes the string that will replace the match;
        =e=> similar to the previous one, but right hand side
             includes Perl code to be evaluated. The result will
             be used to replace the match;
  =begin=> without a left hand side, the right hand side code is
           executed before the rewrite starts;
     =end=> without a right hand side, when the left hand side
            pattern matches quits the rewrite system;
 they can include a restriction block (!!) at the right of the action.


                        Alberto Simões   Processing XML: a rewriting system approach
Valid Rewriting Rules

 Different approaches have different possible rules. . .
 but the most relevant rules are:
         ==> simple pattern substitution: left hand side includes
             a Perl regular expression and right hand side
             includes the string that will replace the match;
        =e=> similar to the previous one, but right hand side
             includes Perl code to be evaluated. The result will
             be used to replace the match;
  =begin=> without a left hand side, the right hand side code is
           executed before the rewrite starts;
     =end=> without a right hand side, when the left hand side
            pattern matches quits the rewrite system;
 they can include a restriction block (!!) at the right of the action.


                        Alberto Simões   Processing XML: a rewriting system approach
Rewriting Text into XML



 How to produce XML from weak-structured data?
     write a parser;
     or rewrite the data step-by-step into XML!



 Two case studies:
     Rewriting a dictionary in textual format into TEI;
     Rewriting a XML DSL authoring tool into XML;




                       Alberto Simões   Processing XML: a rewriting system approach
Rewriting Text into XML



 How to produce XML from weak-structured data?
     write a parser;
     or rewrite the data step-by-step into XML!



 Two case studies:
     Rewriting a dictionary in textual format into TEI;
     Rewriting a XML DSL authoring tool into XML;




                       Alberto Simões   Processing XML: a rewriting system approach
Rewriting Text into TEI
Rewrite this. . .                       . . . into this!
*Cachimbo*,                             <entry id="cachimbo">
_m._                                    <form><orth>Cachimbo</orth></form>
Apparelho de fumador, composto d..      <sense>
Peça de ferro, em que entra o es..      <gramGrp>m.</gramGrp>
Buraco, em que se encaixa a vela..      <def>
* _Bras. de Pernambuco._                Apparelho de fumador, composto d..
Bebida, preparada com aguardente..      Peça de ferro, em que entra o es..
* _Pl. Gír._                            Buraco, em que se encaixa a vela..
Pés.                                    </def>
(Do químb. _quixima_)                   </sense>
                                        <sense ast="1">
                                        <usg type="geo">Bras. de Pernamb..
                                        <def>
                                        Bebida, preparada com aguardente..
                                        </def>
                                        </sense>
                                        <sense ast="1"><gramGrp>Pl.</gra..
                                        <usg type="style">Gír.</usg>
                                        <def>
                                        Pés.
                                        </def>
                                        </sense>
                                        <etym ori="químb">(Do químb. _qu..
                      Alberto Simões   Processing XML: a rewriting system approach
Rewriting Text into TEI

 This rewrite was all based on:
     a few tables (grammatical and usage strings);
          entries genres: Gír Fam Pop Des Fig Vulg Ant Chul Euph
          entries domains: Agr Anat Anthrop Apicult Arith Artilh Archit

     rewrite the few mark-up into better XML structure;
 ((* )?_([^_]|_[^_]{1,5}_)+_( *)?)n=e=>$a=$1;end_def.end_sense.start_


     rewrite the new XML structure to detect and annotate a
     more complex structure;
 <gramGrp>([^<]*)s**s*([^<]*)</gramGrp>=e=>$a="$1 $2"; "ast="1"".g


     detect and correct wrong XML elements.
 </form></sense>==></form>
 </form></def>n</sense>==></form>


                       Alberto Simões   Processing XML: a rewriting system approach
Rewriting Text into TEI

 This rewrite was all based on:
     a few tables (grammatical and usage strings);
          entries genres: Gír Fam Pop Des Fig Vulg Ant Chul Euph
          entries domains: Agr Anat Anthrop Apicult Arith Artilh Archit

     rewrite the few mark-up into better XML structure;
 ((* )?_([^_]|_[^_]{1,5}_)+_( *)?)n=e=>$a=$1;end_def.end_sense.start_


     rewrite the new XML structure to detect and annotate a
     more complex structure;
 <gramGrp>([^<]*)s**s*([^<]*)</gramGrp>=e=>$a="$1 $2"; "ast="1"".g


     detect and correct wrong XML elements.
 </form></sense>==></form>
 </form></def>n</sense>==></form>


                       Alberto Simões   Processing XML: a rewriting system approach
Rewriting Text into TEI

 This rewrite was all based on:
     a few tables (grammatical and usage strings);
          entries genres: Gír Fam Pop Des Fig Vulg Ant Chul Euph
          entries domains: Agr Anat Anthrop Apicult Arith Artilh Archit

     rewrite the few mark-up into better XML structure;
 ((* )?_([^_]|_[^_]{1,5}_)+_( *)?)n=e=>$a=$1;end_def.end_sense.start_


     rewrite the new XML structure to detect and annotate a
     more complex structure;
 <gramGrp>([^<]*)s**s*([^<]*)</gramGrp>=e=>$a="$1 $2"; "ast="1"".g


     detect and correct wrong XML elements.
 </form></sense>==></form>
 </form></def>n</sense>==></form>


                       Alberto Simões   Processing XML: a rewriting system approach
Rewriting Text into TEI

 This rewrite was all based on:
     a few tables (grammatical and usage strings);
          entries genres: Gír Fam Pop Des Fig Vulg Ant Chul Euph
          entries domains: Agr Anat Anthrop Apicult Arith Artilh Archit

     rewrite the few mark-up into better XML structure;
 ((* )?_([^_]|_[^_]{1,5}_)+_( *)?)n=e=>$a=$1;end_def.end_sense.start_


     rewrite the new XML structure to detect and annotate a
     more complex structure;
 <gramGrp>([^<]*)s**s*([^<]*)</gramGrp>=e=>$a="$1 $2"; "ast="1"".g


     detect and correct wrong XML elements.
 </form></sense>==></form>
 </form></def>n</sense>==></form>


                       Alberto Simões   Processing XML: a rewriting system approach
Rewriting Text into TEI

 Case study conclusions:
     flexible tool;
     works on big files:
         Text file is 13 MB;
         Output XML is 30 MB;
         Process takes about nine minutes!
     we event rewrote XML into XML.



                Hey!! XML is text!!
              How can we rewrite it!?

                      Alberto Simões   Processing XML: a rewriting system approach
Rewriting Text into TEI

 Case study conclusions:
     flexible tool;
     works on big files:
         Text file is 13 MB;
         Output XML is 30 MB;
         Process takes about nine minutes!
     we event rewrote XML into XML.



                Hey!! XML is text!!
              How can we rewrite it!?

                      Alberto Simões   Processing XML: a rewriting system approach
Rewriting XML



    different from the usual DOM or SAX oriented approaches;
    looks to XML as text, non structured data;
    rewrite can be done:
        as any other text write system;
        taking advantage of irregular expressions.



     Irregular expressions? Are you kidding?




                     Alberto Simões   Processing XML: a rewriting system approach
Rewriting XML



    different from the usual DOM or SAX oriented approaches;
    looks to XML as text, non structured data;
    rewrite can be done:
        as any other text write system;
        taking advantage of irregular expressions.



     Irregular expressions? Are you kidding?




                     Alberto Simões   Processing XML: a rewriting system approach
Rewriting XML



    different from the usual DOM or SAX oriented approaches;
    looks to XML as text, non structured data;
    rewrite can be done:
        as any other text write system;
        taking advantage of irregular expressions.



     Irregular expressions? Are you kidding?




                     Alberto Simões   Processing XML: a rewriting system approach
Rewriting XML



    different from the usual DOM or SAX oriented approaches;
    looks to XML as text, non structured data;
    rewrite can be done:
        as any other text write system;
        taking advantage of irregular expressions.



     Irregular expressions? Are you kidding?




                     Alberto Simões   Processing XML: a rewriting system approach
Not so regular expressions

 Perl has a powerful regular expression engine:
     regular expressions can define capture zones:
     small pieces of the match that can be used later;
     regular expressions can define look-ahead or look-behind:
     check the context of the matching zone;
     since Perl 5.10, regular expressions can be recursive:
     regular expression that depends on themself.
     my $parens = qr/(((?:[^()]++|(?-1))*+))/;

 For XML, we defined two classes:
 [[:XML:]] matches any well formed XML fragment;
 [[:XML(tag):]] matches a XML fragment with a specific
          root element;


                         Alberto Simões   Processing XML: a rewriting system approach
Not so regular expressions

 Perl has a powerful regular expression engine:
     regular expressions can define capture zones:
     small pieces of the match that can be used later;
     regular expressions can define look-ahead or look-behind:
     check the context of the matching zone;
     since Perl 5.10, regular expressions can be recursive:
     regular expression that depends on themself.
     my $parens = qr/(((?:[^()]++|(?-1))*+))/;

 For XML, we defined two classes:
 [[:XML:]] matches any well formed XML fragment;
 [[:XML(tag):]] matches a XML fragment with a specific
          root element;


                         Alberto Simões   Processing XML: a rewriting system approach
Not so regular expressions

 Perl has a powerful regular expression engine:
     regular expressions can define capture zones:
     small pieces of the match that can be used later;
     regular expressions can define look-ahead or look-behind:
     check the context of the matching zone;
     since Perl 5.10, regular expressions can be recursive:
     regular expression that depends on themself.
     my $parens = qr/(((?:[^()]++|(?-1))*+))/;

 For XML, we defined two classes:
 [[:XML:]] matches any well formed XML fragment;
 [[:XML(tag):]] matches a XML fragment with a specific
          root element;


                         Alberto Simões   Processing XML: a rewriting system approach
Not so regular expressions

 Perl has a powerful regular expression engine:
     regular expressions can define capture zones:
     small pieces of the match that can be used later;
     regular expressions can define look-ahead or look-behind:
     check the context of the matching zone;
     since Perl 5.10, regular expressions can be recursive:
     regular expression that depends on themself.
     my $parens = qr/(((?:[^()]++|(?-1))*+))/;

 For XML, we defined two classes:
 [[:XML:]] matches any well formed XML fragment;
 [[:XML(tag):]] matches a XML fragment with a specific
          root element;


                         Alberto Simões   Processing XML: a rewriting system approach
Not so regular expressions

 Perl has a powerful regular expression engine:
     regular expressions can define capture zones:
     small pieces of the match that can be used later;
     regular expressions can define look-ahead or look-behind:
     check the context of the matching zone;
     since Perl 5.10, regular expressions can be recursive:
     regular expression that depends on themself.
     my $parens = qr/(((?:[^()]++|(?-1))*+))/;

 For XML, we defined two classes:
 [[:XML:]] matches any well formed XML fragment;
 [[:XML(tag):]] matches a XML fragment with a specific
          root element;


                         Alberto Simões   Processing XML: a rewriting system approach
Not so regular expressions

 Perl has a powerful regular expression engine:
     regular expressions can define capture zones:
     small pieces of the match that can be used later;
     regular expressions can define look-ahead or look-behind:
     check the context of the matching zone;
     since Perl 5.10, regular expressions can be recursive:
     regular expression that depends on themself.
     my $parens = qr/(((?:[^()]++|(?-1))*+))/;

 For XML, we defined two classes:
 [[:XML:]] matches any well formed XML fragment;
 [[:XML(tag):]] matches a XML fragment with a specific
          root element;


                         Alberto Simões   Processing XML: a rewriting system approach
Rewriting XML

 As a simple example, we can remove duplicate translation units
 in a translation memory file:
 Code example
 RULES/m duplicates
 ([[:XML(tu):]])==>!!duplicate($1)
 ENDRULES

 sub duplicate {
   my $tu = shift;
   my $tumd5 = md5(dtstring($tu,
                            -default => sub{$c}));
   return 1 if exists $visited{$tumd5};
   $visited{$tumd5}++
   return 0;
 }


                      Alberto Simões   Processing XML: a rewriting system approach
Conclusions


    The rewriting approach is:
        flexible;
        powerful;
        easy to learn;
        grows quickly;
        big systems can be difficult to maintain;
    The Perl regular engine:
        makes it easy to match anything;
        almost supports full grammars;
        makes it possible to define block structures;

    So, it can be applied to XML easily!




                     Alberto Simões   Processing XML: a rewriting system approach
Thank you




               Thank You!



              Alberto Simões
        alberto.simoes@eu.ipp.pt




              Alberto Simões   Processing XML: a rewriting system approach

Weitere ähnliche Inhalte

Ähnlich wie Processing XML: a rewriting system approach

Regular expressions
Regular expressionsRegular expressions
Regular expressionsRaghu nath
 
Introduction to Boost regex
Introduction to Boost regexIntroduction to Boost regex
Introduction to Boost regexYongqiang Li
 
Finaal application on regular expression
Finaal application on regular expressionFinaal application on regular expression
Finaal application on regular expressionGagan019
 
match the following attributes to the parts of a compilerstrips ou.pdf
match the following attributes to the parts of a compilerstrips ou.pdfmatch the following attributes to the parts of a compilerstrips ou.pdf
match the following attributes to the parts of a compilerstrips ou.pdfarpitaeron555
 
Designing A Syntax Based Retrieval System03
Designing A Syntax Based Retrieval System03Designing A Syntax Based Retrieval System03
Designing A Syntax Based Retrieval System03Avelin Huo
 
role of lexical anaysis
role of lexical anaysisrole of lexical anaysis
role of lexical anaysisSudhaa Ravi
 
Working with XSLT, XPath and ECMA Scripts: Make It Simpler with Novell Identi...
Working with XSLT, XPath and ECMA Scripts: Make It Simpler with Novell Identi...Working with XSLT, XPath and ECMA Scripts: Make It Simpler with Novell Identi...
Working with XSLT, XPath and ECMA Scripts: Make It Simpler with Novell Identi...Novell
 
chapter7.ppt java programming lecture notes
chapter7.ppt java programming lecture noteschapter7.ppt java programming lecture notes
chapter7.ppt java programming lecture noteskavitamittal18
 
BayFP: Concurrent and Multicore Haskell
BayFP: Concurrent and Multicore HaskellBayFP: Concurrent and Multicore Haskell
BayFP: Concurrent and Multicore HaskellBryan O'Sullivan
 
How does intellisense work?
How does intellisense work?How does intellisense work?
How does intellisense work?Adam Friedman
 
What is the deal with Elixir?
What is the deal with Elixir?What is the deal with Elixir?
What is the deal with Elixir?George Coffey
 
09 string processing_with_regex copy
09 string processing_with_regex copy09 string processing_with_regex copy
09 string processing_with_regex copyShay Cohen
 
COMPILER CONSTRUCTION KU 1.pptx
COMPILER CONSTRUCTION KU 1.pptxCOMPILER CONSTRUCTION KU 1.pptx
COMPILER CONSTRUCTION KU 1.pptxRossy719186
 
Experiments with Different Models of Statistcial Machine Translation
Experiments with Different Models of Statistcial Machine TranslationExperiments with Different Models of Statistcial Machine Translation
Experiments with Different Models of Statistcial Machine Translationkhyati gupta
 

Ähnlich wie Processing XML: a rewriting system approach (20)

Regular expressions
Regular expressionsRegular expressions
Regular expressions
 
Parsing
ParsingParsing
Parsing
 
xml2tex at TUG 2014
xml2tex at TUG 2014xml2tex at TUG 2014
xml2tex at TUG 2014
 
Plc part 2
Plc  part 2Plc  part 2
Plc part 2
 
Introduction to Boost regex
Introduction to Boost regexIntroduction to Boost regex
Introduction to Boost regex
 
Finaal application on regular expression
Finaal application on regular expressionFinaal application on regular expression
Finaal application on regular expression
 
match the following attributes to the parts of a compilerstrips ou.pdf
match the following attributes to the parts of a compilerstrips ou.pdfmatch the following attributes to the parts of a compilerstrips ou.pdf
match the following attributes to the parts of a compilerstrips ou.pdf
 
Designing A Syntax Based Retrieval System03
Designing A Syntax Based Retrieval System03Designing A Syntax Based Retrieval System03
Designing A Syntax Based Retrieval System03
 
Lexical Analyzers and Parsers
Lexical Analyzers and ParsersLexical Analyzers and Parsers
Lexical Analyzers and Parsers
 
role of lexical anaysis
role of lexical anaysisrole of lexical anaysis
role of lexical anaysis
 
Working with XSLT, XPath and ECMA Scripts: Make It Simpler with Novell Identi...
Working with XSLT, XPath and ECMA Scripts: Make It Simpler with Novell Identi...Working with XSLT, XPath and ECMA Scripts: Make It Simpler with Novell Identi...
Working with XSLT, XPath and ECMA Scripts: Make It Simpler with Novell Identi...
 
chapter7.ppt java programming lecture notes
chapter7.ppt java programming lecture noteschapter7.ppt java programming lecture notes
chapter7.ppt java programming lecture notes
 
BayFP: Concurrent and Multicore Haskell
BayFP: Concurrent and Multicore HaskellBayFP: Concurrent and Multicore Haskell
BayFP: Concurrent and Multicore Haskell
 
How does intellisense work?
How does intellisense work?How does intellisense work?
How does intellisense work?
 
What is the deal with Elixir?
What is the deal with Elixir?What is the deal with Elixir?
What is the deal with Elixir?
 
09 string processing_with_regex copy
09 string processing_with_regex copy09 string processing_with_regex copy
09 string processing_with_regex copy
 
1._Introduction_.pptx
1._Introduction_.pptx1._Introduction_.pptx
1._Introduction_.pptx
 
Shell script-sec
Shell script-secShell script-sec
Shell script-sec
 
COMPILER CONSTRUCTION KU 1.pptx
COMPILER CONSTRUCTION KU 1.pptxCOMPILER CONSTRUCTION KU 1.pptx
COMPILER CONSTRUCTION KU 1.pptx
 
Experiments with Different Models of Statistcial Machine Translation
Experiments with Different Models of Statistcial Machine TranslationExperiments with Different Models of Statistcial Machine Translation
Experiments with Different Models of Statistcial Machine Translation
 

Mehr von Alberto Simões

Language Identification: A neural network approach
Language Identification: A neural network approachLanguage Identification: A neural network approach
Language Identification: A neural network approachAlberto Simões
 
Making the most of a 100-year-old dictionary
Making the most of a 100-year-old dictionaryMaking the most of a 100-year-old dictionary
Making the most of a 100-year-old dictionaryAlberto Simões
 
Dictionary Alignment by Rewrite-based Entry Translation
Dictionary Alignment by Rewrite-based Entry TranslationDictionary Alignment by Rewrite-based Entry Translation
Dictionary Alignment by Rewrite-based Entry TranslationAlberto Simões
 
EMLex-A5: Specialized Dictionaries
EMLex-A5: Specialized DictionariesEMLex-A5: Specialized Dictionaries
EMLex-A5: Specialized DictionariesAlberto Simões
 
Aula 04 - Introdução aos Diagramas de Sequência
Aula 04 - Introdução aos Diagramas de SequênciaAula 04 - Introdução aos Diagramas de Sequência
Aula 04 - Introdução aos Diagramas de SequênciaAlberto Simões
 
Aula 03 - Introdução aos Diagramas de Atividade
Aula 03 - Introdução aos Diagramas de AtividadeAula 03 - Introdução aos Diagramas de Atividade
Aula 03 - Introdução aos Diagramas de AtividadeAlberto Simões
 
Aula 02 - Engenharia de Requisitos
Aula 02 - Engenharia de RequisitosAula 02 - Engenharia de Requisitos
Aula 02 - Engenharia de RequisitosAlberto Simões
 
Aula 01 - Planeamento de Sistemas de Informação
Aula 01 - Planeamento de Sistemas de InformaçãoAula 01 - Planeamento de Sistemas de Informação
Aula 01 - Planeamento de Sistemas de InformaçãoAlberto Simões
 
Building C and C++ libraries with Perl
Building C and C++ libraries with PerlBuilding C and C++ libraries with Perl
Building C and C++ libraries with PerlAlberto Simões
 
Arquitecturas de Tradução Automática
Arquitecturas de Tradução AutomáticaArquitecturas de Tradução Automática
Arquitecturas de Tradução AutomáticaAlberto Simões
 
Extracção de Recursos para Tradução Automática
Extracção de Recursos para Tradução AutomáticaExtracção de Recursos para Tradução Automática
Extracção de Recursos para Tradução AutomáticaAlberto Simões
 

Mehr von Alberto Simões (20)

Source Code Quality
Source Code QualitySource Code Quality
Source Code Quality
 
Language Identification: A neural network approach
Language Identification: A neural network approachLanguage Identification: A neural network approach
Language Identification: A neural network approach
 
Google Maps JS API
Google Maps JS APIGoogle Maps JS API
Google Maps JS API
 
Making the most of a 100-year-old dictionary
Making the most of a 100-year-old dictionaryMaking the most of a 100-year-old dictionary
Making the most of a 100-year-old dictionary
 
Dictionary Alignment by Rewrite-based Entry Translation
Dictionary Alignment by Rewrite-based Entry TranslationDictionary Alignment by Rewrite-based Entry Translation
Dictionary Alignment by Rewrite-based Entry Translation
 
EMLex-A5: Specialized Dictionaries
EMLex-A5: Specialized DictionariesEMLex-A5: Specialized Dictionaries
EMLex-A5: Specialized Dictionaries
 
Modelação de Dados
Modelação de DadosModelação de Dados
Modelação de Dados
 
Aula 04 - Introdução aos Diagramas de Sequência
Aula 04 - Introdução aos Diagramas de SequênciaAula 04 - Introdução aos Diagramas de Sequência
Aula 04 - Introdução aos Diagramas de Sequência
 
Aula 03 - Introdução aos Diagramas de Atividade
Aula 03 - Introdução aos Diagramas de AtividadeAula 03 - Introdução aos Diagramas de Atividade
Aula 03 - Introdução aos Diagramas de Atividade
 
Aula 02 - Engenharia de Requisitos
Aula 02 - Engenharia de RequisitosAula 02 - Engenharia de Requisitos
Aula 02 - Engenharia de Requisitos
 
Aula 01 - Planeamento de Sistemas de Informação
Aula 01 - Planeamento de Sistemas de InformaçãoAula 01 - Planeamento de Sistemas de Informação
Aula 01 - Planeamento de Sistemas de Informação
 
Building C and C++ libraries with Perl
Building C and C++ libraries with PerlBuilding C and C++ libraries with Perl
Building C and C++ libraries with Perl
 
PLN em Perl
PLN em PerlPLN em Perl
PLN em Perl
 
Classification Systems
Classification SystemsClassification Systems
Classification Systems
 
Redes de Pert
Redes de PertRedes de Pert
Redes de Pert
 
Dancing Tutorial
Dancing TutorialDancing Tutorial
Dancing Tutorial
 
Sistemas de Numeração
Sistemas de NumeraçãoSistemas de Numeração
Sistemas de Numeração
 
Álgebra de Boole
Álgebra de BooleÁlgebra de Boole
Álgebra de Boole
 
Arquitecturas de Tradução Automática
Arquitecturas de Tradução AutomáticaArquitecturas de Tradução Automática
Arquitecturas de Tradução Automática
 
Extracção de Recursos para Tradução Automática
Extracção de Recursos para Tradução AutomáticaExtracção de Recursos para Tradução Automática
Extracção de Recursos para Tradução Automática
 

Kürzlich hochgeladen

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 

Kürzlich hochgeladen (20)

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 

Processing XML: a rewriting system approach

  • 1. Processing XML A rewriting system approach Alberto Simões alberto.simoes@eu.ipp.pt Portuguese Perl Workshop – 2010 Alberto Simões Processing XML: a rewriting system approach
  • 2. Motivation and Goals XML is usually generated from structured information: databases, spreadsheets, forms, etc. but it can be generated from unstructured (or poorly-structured data): textual documents, domain specific languages; Question arises: How to produce XML documents from textual documents? write a parser (natural language, domain specific, etc); produce XML by rewriting the textual document! Alberto Simões Processing XML: a rewriting system approach
  • 3. Motivation and Goals XML is usually generated from structured information: databases, spreadsheets, forms, etc. but it can be generated from unstructured (or poorly-structured data): textual documents, domain specific languages; Question arises: How to produce XML documents from textual documents? write a parser (natural language, domain specific, etc); produce XML by rewriting the textual document! Alberto Simões Processing XML: a rewriting system approach
  • 4. Motivation and Goals XML is usually generated from structured information: databases, spreadsheets, forms, etc. but it can be generated from unstructured (or poorly-structured data): textual documents, domain specific languages; Question arises: How to produce XML documents from textual documents? write a parser (natural language, domain specific, etc); produce XML by rewriting the textual document! Alberto Simões Processing XML: a rewriting system approach
  • 5. Motivation and Goals XML is usually generated from structured information: databases, spreadsheets, forms, etc. but it can be generated from unstructured (or poorly-structured data): textual documents, domain specific languages; Question arises: How to produce XML documents from textual documents? write a parser (natural language, domain specific, etc); produce XML by rewriting the textual document! Alberto Simões Processing XML: a rewriting system approach
  • 6. Motivation and Goals XML is usually generated from structured information: databases, spreadsheets, forms, etc. but it can be generated from unstructured (or poorly-structured data): textual documents, domain specific languages; Question arises: How to produce XML documents from textual documents? write a parser (natural language, domain specific, etc); produce XML by rewriting the textual document! Alberto Simões Processing XML: a rewriting system approach
  • 7. Hows does textual rewriting works? write rewriting rules: rule ∼ pattern × restriction × action = pattern a regular (or irregular) expression that should be textually matched; restriction conditional code that checks whether the rule should be applied; action a piece of code (or simply a string) that produces text that should replace the originally matched text; Alberto Simões Processing XML: a rewriting system approach
  • 8. Hows does textual rewriting works? write rewriting rules: rule ∼ pattern × restriction × action = pattern a regular (or irregular) expression that should be textually matched; restriction conditional code that checks whether the rule should be applied; action a piece of code (or simply a string) that produces text that should replace the originally matched text; Alberto Simões Processing XML: a rewriting system approach
  • 9. Hows does textual rewriting works? write rewriting rules: rule ∼ pattern × restriction × action = pattern a regular (or irregular) expression that should be textually matched; restriction conditional code that checks whether the rule should be applied; action a piece of code (or simply a string) that produces text that should replace the originally matched text; Alberto Simões Processing XML: a rewriting system approach
  • 10. Hows does textual rewriting works? write rewriting rules: rule ∼ pattern × restriction × action = pattern a regular (or irregular) expression that should be textually matched; restriction conditional code that checks whether the rule should be applied; action a piece of code (or simply a string) that produces text that should replace the originally matched text; Alberto Simões Processing XML: a rewriting system approach
  • 11. Are there text rewriting tools? For this work we used Text::RewriteRules: written in Perl: Perl regular expression engine power; Reflexive language (code can be generated on the fly); supports different rewriting approaches: Fixed-point rewriting approach; Sliding-cursor rewriting approach; Lexical analyzer approach; home-developed; Alberto Simões Processing XML: a rewriting system approach
  • 12. Are there text rewriting tools? For this work we used Text::RewriteRules: written in Perl: Perl regular expression engine power; Reflexive language (code can be generated on the fly); supports different rewriting approaches: Fixed-point rewriting approach; Sliding-cursor rewriting approach; Lexical analyzer approach; home-developed; Alberto Simões Processing XML: a rewriting system approach
  • 13. Are there text rewriting tools? For this work we used Text::RewriteRules: written in Perl: Perl regular expression engine power; Reflexive language (code can be generated on the fly); supports different rewriting approaches: Fixed-point rewriting approach; Sliding-cursor rewriting approach; Lexical analyzer approach; home-developed; Alberto Simões Processing XML: a rewriting system approach
  • 14. Are there text rewriting tools? For this work we used Text::RewriteRules: written in Perl: Perl regular expression engine power; Reflexive language (code can be generated on the fly); supports different rewriting approaches: Fixed-point rewriting approach; Sliding-cursor rewriting approach; Lexical analyzer approach; home-developed; Alberto Simões Processing XML: a rewriting system approach
  • 15. Fixed-point rewriting approach Algorithm easy to understand; a sequence of rules that are applied by order; first rule is applied, and following rules are only applied if there is no previous rule that can be applied; it might happen that a rule changes the document in a way that a previous rule will be applied again; the process ends when there are no rules that can be applied (or if a specific rule forces the system to end); Code example: anonymization of emails RULES anonymize w+(.w+)*@w+.w+(.w+)*==>[[hidden email]] ENDRULES Alberto Simões Processing XML: a rewriting system approach
  • 16. Fixed-point rewriting approach Algorithm easy to understand; a sequence of rules that are applied by order; first rule is applied, and following rules are only applied if there is no previous rule that can be applied; it might happen that a rule changes the document in a way that a previous rule will be applied again; the process ends when there are no rules that can be applied (or if a specific rule forces the system to end); Code example: anonymization of emails RULES anonymize w+(.w+)*@w+.w+(.w+)*==>[[hidden email]] ENDRULES Alberto Simões Processing XML: a rewriting system approach
  • 17. Sliding-cursor rewriting approach Algorithm the cursor is placed in the beginning of the string; patterns are matched if they occur right after the cursor; if a rule is applied, the cursor is placed after that region; if no rule matches, the cursor moves ahead one character; process ends when cursor reaches the end of the string; it will never rewrite text that was already rewritten. Code example: brute force translation RULES/m translate (w+)=e=> $translation{$1} !! exists($translation{$1}) ENDRULES Example _ latest train último _ train último combóio _ Alberto Simões Processing XML: a rewriting system approach
  • 18. Sliding-cursor rewriting approach Algorithm the cursor is placed in the beginning of the string; patterns are matched if they occur right after the cursor; if a rule is applied, the cursor is placed after that region; if no rule matches, the cursor moves ahead one character; process ends when cursor reaches the end of the string; it will never rewrite text that was already rewritten. Code example: brute force translation RULES/m translate (w+)=e=> $translation{$1} !! exists($translation{$1}) ENDRULES Example _ latest train último _ train último combóio _ Alberto Simões Processing XML: a rewriting system approach
  • 19. Sliding-cursor rewriting approach Algorithm the cursor is placed in the beginning of the string; patterns are matched if they occur right after the cursor; if a rule is applied, the cursor is placed after that region; if no rule matches, the cursor moves ahead one character; process ends when cursor reaches the end of the string; it will never rewrite text that was already rewritten. Code example: brute force translation RULES/m translate (w+)=e=> $translation{$1} !! exists($translation{$1}) ENDRULES Example _ latest train último _ train último combóio _ Alberto Simões Processing XML: a rewriting system approach
  • 20. Valid Rewriting Rules Different approaches have different possible rules. . . but the most relevant rules are: ==> simple pattern substitution: left hand side includes a Perl regular expression and right hand side includes the string that will replace the match; =e=> similar to the previous one, but right hand side includes Perl code to be evaluated. The result will be used to replace the match; =begin=> without a left hand side, the right hand side code is executed before the rewrite starts; =end=> without a right hand side, when the left hand side pattern matches quits the rewrite system; they can include a restriction block (!!) at the right of the action. Alberto Simões Processing XML: a rewriting system approach
  • 21. Valid Rewriting Rules Different approaches have different possible rules. . . but the most relevant rules are: ==> simple pattern substitution: left hand side includes a Perl regular expression and right hand side includes the string that will replace the match; =e=> similar to the previous one, but right hand side includes Perl code to be evaluated. The result will be used to replace the match; =begin=> without a left hand side, the right hand side code is executed before the rewrite starts; =end=> without a right hand side, when the left hand side pattern matches quits the rewrite system; they can include a restriction block (!!) at the right of the action. Alberto Simões Processing XML: a rewriting system approach
  • 22. Valid Rewriting Rules Different approaches have different possible rules. . . but the most relevant rules are: ==> simple pattern substitution: left hand side includes a Perl regular expression and right hand side includes the string that will replace the match; =e=> similar to the previous one, but right hand side includes Perl code to be evaluated. The result will be used to replace the match; =begin=> without a left hand side, the right hand side code is executed before the rewrite starts; =end=> without a right hand side, when the left hand side pattern matches quits the rewrite system; they can include a restriction block (!!) at the right of the action. Alberto Simões Processing XML: a rewriting system approach
  • 23. Valid Rewriting Rules Different approaches have different possible rules. . . but the most relevant rules are: ==> simple pattern substitution: left hand side includes a Perl regular expression and right hand side includes the string that will replace the match; =e=> similar to the previous one, but right hand side includes Perl code to be evaluated. The result will be used to replace the match; =begin=> without a left hand side, the right hand side code is executed before the rewrite starts; =end=> without a right hand side, when the left hand side pattern matches quits the rewrite system; they can include a restriction block (!!) at the right of the action. Alberto Simões Processing XML: a rewriting system approach
  • 24. Valid Rewriting Rules Different approaches have different possible rules. . . but the most relevant rules are: ==> simple pattern substitution: left hand side includes a Perl regular expression and right hand side includes the string that will replace the match; =e=> similar to the previous one, but right hand side includes Perl code to be evaluated. The result will be used to replace the match; =begin=> without a left hand side, the right hand side code is executed before the rewrite starts; =end=> without a right hand side, when the left hand side pattern matches quits the rewrite system; they can include a restriction block (!!) at the right of the action. Alberto Simões Processing XML: a rewriting system approach
  • 25. Valid Rewriting Rules Different approaches have different possible rules. . . but the most relevant rules are: ==> simple pattern substitution: left hand side includes a Perl regular expression and right hand side includes the string that will replace the match; =e=> similar to the previous one, but right hand side includes Perl code to be evaluated. The result will be used to replace the match; =begin=> without a left hand side, the right hand side code is executed before the rewrite starts; =end=> without a right hand side, when the left hand side pattern matches quits the rewrite system; they can include a restriction block (!!) at the right of the action. Alberto Simões Processing XML: a rewriting system approach
  • 26. Rewriting Text into XML How to produce XML from weak-structured data? write a parser; or rewrite the data step-by-step into XML! Two case studies: Rewriting a dictionary in textual format into TEI; Rewriting a XML DSL authoring tool into XML; Alberto Simões Processing XML: a rewriting system approach
  • 27. Rewriting Text into XML How to produce XML from weak-structured data? write a parser; or rewrite the data step-by-step into XML! Two case studies: Rewriting a dictionary in textual format into TEI; Rewriting a XML DSL authoring tool into XML; Alberto Simões Processing XML: a rewriting system approach
  • 28. Rewriting Text into TEI Rewrite this. . . . . . into this! *Cachimbo*, <entry id="cachimbo"> _m._ <form><orth>Cachimbo</orth></form> Apparelho de fumador, composto d.. <sense> Peça de ferro, em que entra o es.. <gramGrp>m.</gramGrp> Buraco, em que se encaixa a vela.. <def> * _Bras. de Pernambuco._ Apparelho de fumador, composto d.. Bebida, preparada com aguardente.. Peça de ferro, em que entra o es.. * _Pl. Gír._ Buraco, em que se encaixa a vela.. Pés. </def> (Do químb. _quixima_) </sense> <sense ast="1"> <usg type="geo">Bras. de Pernamb.. <def> Bebida, preparada com aguardente.. </def> </sense> <sense ast="1"><gramGrp>Pl.</gra.. <usg type="style">Gír.</usg> <def> Pés. </def> </sense> <etym ori="químb">(Do químb. _qu.. Alberto Simões Processing XML: a rewriting system approach
  • 29. Rewriting Text into TEI This rewrite was all based on: a few tables (grammatical and usage strings); entries genres: Gír Fam Pop Des Fig Vulg Ant Chul Euph entries domains: Agr Anat Anthrop Apicult Arith Artilh Archit rewrite the few mark-up into better XML structure; ((* )?_([^_]|_[^_]{1,5}_)+_( *)?)n=e=>$a=$1;end_def.end_sense.start_ rewrite the new XML structure to detect and annotate a more complex structure; <gramGrp>([^<]*)s**s*([^<]*)</gramGrp>=e=>$a="$1 $2"; "ast="1"".g detect and correct wrong XML elements. </form></sense>==></form> </form></def>n</sense>==></form> Alberto Simões Processing XML: a rewriting system approach
  • 30. Rewriting Text into TEI This rewrite was all based on: a few tables (grammatical and usage strings); entries genres: Gír Fam Pop Des Fig Vulg Ant Chul Euph entries domains: Agr Anat Anthrop Apicult Arith Artilh Archit rewrite the few mark-up into better XML structure; ((* )?_([^_]|_[^_]{1,5}_)+_( *)?)n=e=>$a=$1;end_def.end_sense.start_ rewrite the new XML structure to detect and annotate a more complex structure; <gramGrp>([^<]*)s**s*([^<]*)</gramGrp>=e=>$a="$1 $2"; "ast="1"".g detect and correct wrong XML elements. </form></sense>==></form> </form></def>n</sense>==></form> Alberto Simões Processing XML: a rewriting system approach
  • 31. Rewriting Text into TEI This rewrite was all based on: a few tables (grammatical and usage strings); entries genres: Gír Fam Pop Des Fig Vulg Ant Chul Euph entries domains: Agr Anat Anthrop Apicult Arith Artilh Archit rewrite the few mark-up into better XML structure; ((* )?_([^_]|_[^_]{1,5}_)+_( *)?)n=e=>$a=$1;end_def.end_sense.start_ rewrite the new XML structure to detect and annotate a more complex structure; <gramGrp>([^<]*)s**s*([^<]*)</gramGrp>=e=>$a="$1 $2"; "ast="1"".g detect and correct wrong XML elements. </form></sense>==></form> </form></def>n</sense>==></form> Alberto Simões Processing XML: a rewriting system approach
  • 32. Rewriting Text into TEI This rewrite was all based on: a few tables (grammatical and usage strings); entries genres: Gír Fam Pop Des Fig Vulg Ant Chul Euph entries domains: Agr Anat Anthrop Apicult Arith Artilh Archit rewrite the few mark-up into better XML structure; ((* )?_([^_]|_[^_]{1,5}_)+_( *)?)n=e=>$a=$1;end_def.end_sense.start_ rewrite the new XML structure to detect and annotate a more complex structure; <gramGrp>([^<]*)s**s*([^<]*)</gramGrp>=e=>$a="$1 $2"; "ast="1"".g detect and correct wrong XML elements. </form></sense>==></form> </form></def>n</sense>==></form> Alberto Simões Processing XML: a rewriting system approach
  • 33. Rewriting Text into TEI Case study conclusions: flexible tool; works on big files: Text file is 13 MB; Output XML is 30 MB; Process takes about nine minutes! we event rewrote XML into XML. Hey!! XML is text!! How can we rewrite it!? Alberto Simões Processing XML: a rewriting system approach
  • 34. Rewriting Text into TEI Case study conclusions: flexible tool; works on big files: Text file is 13 MB; Output XML is 30 MB; Process takes about nine minutes! we event rewrote XML into XML. Hey!! XML is text!! How can we rewrite it!? Alberto Simões Processing XML: a rewriting system approach
  • 35. Rewriting XML different from the usual DOM or SAX oriented approaches; looks to XML as text, non structured data; rewrite can be done: as any other text write system; taking advantage of irregular expressions. Irregular expressions? Are you kidding? Alberto Simões Processing XML: a rewriting system approach
  • 36. Rewriting XML different from the usual DOM or SAX oriented approaches; looks to XML as text, non structured data; rewrite can be done: as any other text write system; taking advantage of irregular expressions. Irregular expressions? Are you kidding? Alberto Simões Processing XML: a rewriting system approach
  • 37. Rewriting XML different from the usual DOM or SAX oriented approaches; looks to XML as text, non structured data; rewrite can be done: as any other text write system; taking advantage of irregular expressions. Irregular expressions? Are you kidding? Alberto Simões Processing XML: a rewriting system approach
  • 38. Rewriting XML different from the usual DOM or SAX oriented approaches; looks to XML as text, non structured data; rewrite can be done: as any other text write system; taking advantage of irregular expressions. Irregular expressions? Are you kidding? Alberto Simões Processing XML: a rewriting system approach
  • 39. Not so regular expressions Perl has a powerful regular expression engine: regular expressions can define capture zones: small pieces of the match that can be used later; regular expressions can define look-ahead or look-behind: check the context of the matching zone; since Perl 5.10, regular expressions can be recursive: regular expression that depends on themself. my $parens = qr/(((?:[^()]++|(?-1))*+))/; For XML, we defined two classes: [[:XML:]] matches any well formed XML fragment; [[:XML(tag):]] matches a XML fragment with a specific root element; Alberto Simões Processing XML: a rewriting system approach
  • 40. Not so regular expressions Perl has a powerful regular expression engine: regular expressions can define capture zones: small pieces of the match that can be used later; regular expressions can define look-ahead or look-behind: check the context of the matching zone; since Perl 5.10, regular expressions can be recursive: regular expression that depends on themself. my $parens = qr/(((?:[^()]++|(?-1))*+))/; For XML, we defined two classes: [[:XML:]] matches any well formed XML fragment; [[:XML(tag):]] matches a XML fragment with a specific root element; Alberto Simões Processing XML: a rewriting system approach
  • 41. Not so regular expressions Perl has a powerful regular expression engine: regular expressions can define capture zones: small pieces of the match that can be used later; regular expressions can define look-ahead or look-behind: check the context of the matching zone; since Perl 5.10, regular expressions can be recursive: regular expression that depends on themself. my $parens = qr/(((?:[^()]++|(?-1))*+))/; For XML, we defined two classes: [[:XML:]] matches any well formed XML fragment; [[:XML(tag):]] matches a XML fragment with a specific root element; Alberto Simões Processing XML: a rewriting system approach
  • 42. Not so regular expressions Perl has a powerful regular expression engine: regular expressions can define capture zones: small pieces of the match that can be used later; regular expressions can define look-ahead or look-behind: check the context of the matching zone; since Perl 5.10, regular expressions can be recursive: regular expression that depends on themself. my $parens = qr/(((?:[^()]++|(?-1))*+))/; For XML, we defined two classes: [[:XML:]] matches any well formed XML fragment; [[:XML(tag):]] matches a XML fragment with a specific root element; Alberto Simões Processing XML: a rewriting system approach
  • 43. Not so regular expressions Perl has a powerful regular expression engine: regular expressions can define capture zones: small pieces of the match that can be used later; regular expressions can define look-ahead or look-behind: check the context of the matching zone; since Perl 5.10, regular expressions can be recursive: regular expression that depends on themself. my $parens = qr/(((?:[^()]++|(?-1))*+))/; For XML, we defined two classes: [[:XML:]] matches any well formed XML fragment; [[:XML(tag):]] matches a XML fragment with a specific root element; Alberto Simões Processing XML: a rewriting system approach
  • 44. Not so regular expressions Perl has a powerful regular expression engine: regular expressions can define capture zones: small pieces of the match that can be used later; regular expressions can define look-ahead or look-behind: check the context of the matching zone; since Perl 5.10, regular expressions can be recursive: regular expression that depends on themself. my $parens = qr/(((?:[^()]++|(?-1))*+))/; For XML, we defined two classes: [[:XML:]] matches any well formed XML fragment; [[:XML(tag):]] matches a XML fragment with a specific root element; Alberto Simões Processing XML: a rewriting system approach
  • 45. Rewriting XML As a simple example, we can remove duplicate translation units in a translation memory file: Code example RULES/m duplicates ([[:XML(tu):]])==>!!duplicate($1) ENDRULES sub duplicate { my $tu = shift; my $tumd5 = md5(dtstring($tu, -default => sub{$c})); return 1 if exists $visited{$tumd5}; $visited{$tumd5}++ return 0; } Alberto Simões Processing XML: a rewriting system approach
  • 46. Conclusions The rewriting approach is: flexible; powerful; easy to learn; grows quickly; big systems can be difficult to maintain; The Perl regular engine: makes it easy to match anything; almost supports full grammars; makes it possible to define block structures; So, it can be applied to XML easily! Alberto Simões Processing XML: a rewriting system approach
  • 47. Thank you Thank You! Alberto Simões alberto.simoes@eu.ipp.pt Alberto Simões Processing XML: a rewriting system approach