SlideShare ist ein Scribd-Unternehmen logo
1 von 9
Downloaden Sie, um offline zu lesen
Text::Perfide::BookCleaner, a Perl module to clean and
                       normalize plain text books
Text::Perfide::BookCleaner, un m´dulo Perl para limpieza de libros em texto
                               o
                                 plano

                   Andr´ Santos
                        e                                     Jos´ Jo˜o Almeida
                                                                  e   a
             Departamento de Inform´tica
                                   a                       Departamento de Inform´tica
                                                                                 a
               Universidade do Minho                         Universidade do Minho
                   Braga, Portugal                               Braga, Portugal
             pg15973@alunos.uminho.pt                           jj@di.uminho.pt

       Resumen: En este trabajo se presenta Text::Perfide::BookCleaner, un m´dulo Perl
                                                                            o
       para pre-procesamiento de libros en formato de texto plano para posterior alineamiento
       u otros usos. Tareas de limpieza incluyen: eliminaci´n de saltos de p´gina, n´meros de
                                                             o                  a        u
       p´gina, encabezados y pies de p´gina; encontrar t´
         a                              a                ıtulos y fronteras de las secciones; elimi-
       naci´n de las notas de pie de p´gina y normalizaci´n de la notaci´n de p´rrafo y caracteres
           o                          a                  o               o      a
       Unicode. El proceso es guiado con la ayuda de objetos declarativos como ontolog´ y      ıas
       ficheros de configuraci´n. Se realiz´ una evaluaci´n comparativa de las alineaciones con y
                             o            o             o
       sin Text::Perfide::BookCleaner sobre la cual se presentan los resultados y conclusiones.
       Palabras clave: corpora paralelos, alineamiento de documentos, herramientas de PLN

       Abstract: This paper presents Text::Perfide::BookCleaner, an application to pre-
       process plain text books and clean them for any further arbitrary use, e.g., text alignment,
       format conversion or information retrieval. Cleaning tasks include removing page breaks,
       page numbers, headers and footers; finding section titles and boundaries; removing footnotes
       and normalizing paragraph notation and Unicode characters. The process is guided with
       the help of declarative objects such as ontologies and configuration files. A comparative
       evaluation of alignments with and without Text::Perfide::BookCleaner was performed,
       and the results and conclusions are presented.
       Keywords: parallel corpora, document alignment, NLP tools

1     Introduction                                      1.1    Context
In many tasks related to natural language               The work presented in this paper has been
processing, text is the starting point, and the         developed within Project Per-Fide, a project
quality of results that can be obtained de-             which aims to build a large parallel cor-
pends strongly on the text quality itself.              pora (Ara´jo et al., 2010). This process
                                                                   u
   Many kinds of noise may be present in                involves gathering, preparing, aligning and
texts, resulting in a wrong interpretation of:          making available for query thousands of doc-
                                                        uments in several languages.
    • individual characters,                               The documents gathered come from a wide
                                                        range of sources, and are obtained in several
    • words, phrases, sentences,                        different formats and conditions. Some of
                                                        these documents present specific issues which
    • paragraph frontiers,                              prevent them to be successfully aligned.
    • pages, headers, footers and footnotes,            1.2    Motivation
    • titles and section frontiers.                     The alignment of text documents is a pro-
                                                        cess which depends heavily on the format
   The variety of existing texts is huge, and           and conditions of the documents to be
many of the problems are related to specific             aligned (V´ronis, 2000). When processing
                                                                   e
types of text.                                          documents of literary type (i.e. books), there
   In this document, the central concern is             are specific issues which give origin to addi-
the task of cleaning text books, with a special         tional difficulties.
focus on making them ready for alignment.                   The main characteristics of the documents
that may negatively affect book alignment               guide the alignment process by pairing
are:                                                   the sections first (structural alignment).

File format: Documents available in PDF             These problems are often enough to derail
    or Microsoft Word formats (or, gen-          the alignment process, producing unaccept-
    erally, any structured or unstructured       able results, and are found frequently enough
    format other than plain text) need           to justify the development of a tool to pre-
    to be converted to plain text be-            process the books, cleaning and preparing
    fore being processed, using tools such       them for further processing.
    as pdftotext (Noonburg, 2001) or                There are other tools available which
    Apache Tika (Gupta and Ahmed, 2007;          also address the issue of preparing corpora
    Mattmann and Zitting, 2011). This con-       for alignment, such as the one described
    version often leads to loss of informa-      by (Okita, 2009); or text cleaning, such as
    tion (text structure, text formatting, im-   TextSoap, a program for cleaning text in sev-
    ages, etc) and introduction of noise (bad    eral formats (UnmarkedSoftware, 2011).
    conversion of mathematical expressions,
    diacritics, tables and other) (Robinson,     2     Book Cleaner
    2001);                                       In order to solve the problems described in
                                                 the previous section, the first approach was
Text encoding format: There are many
                                                 to implement a Perl script which, with the
   different available text encoding for-
                                                 help of UNIX utilities like grep and some reg-
   mats. For example, Western-European
                                                 ular expressions, attempted to find the un-
   languages can be encoded as ISO-8859-
                                                 desired elements and normalize, replace or
   1 (also known as Latin1), Windows CP
                                                 delete them. As we started to have a bet-
   1252, UTF-8 or others. Discovering the
                                                 ter grasp on the task, it became clear that
   encoding format used in a given docu-
                                                 such a naive approach was not enough.
   ment is frequently not a straightforward
                                                    It was then decided to develop a
   task, and Dealing with a text while as-
                                                 Perl module, Text::Perfide::BookCleaner
   suming the wrong encoding may have a
                                                 (T::P::BC), divided in separate functions,
   negative impact on the results;
                                                 each one dealing with a specific problem.
Structural residues: Some texts, despite
    being in plain text format, still contain    2.1    Design Goals
    structural elements from their previous      T::P::BC was developed with several design
    format (for example, it is common to find     goals in mind. Each component of the module
    page numbers, headers and footers in the     meets the following three requirements:
    middle of the text of documents which
                                                 Optional: depending on the conditions of
    have been converted from PDF to plain
                                                    the books to be cleaned, some steps of
    text);
                                                    this process may not be desired or neces-
Unpaired sections: Sometimes one of the             sary. Thus, each step may be performed
   documents to be aligned contains one             independently from the others, or not be
   or more sections which are not present           performed at all;
   in the other document. It is a quite          Chainable: the functions can be used in se-
   common occurrence with forewords to a            quence, with the output of one function
   given edition, preambles, etc.                   passing as input to the next, with no in-
Sectioning notation: There are endless              termediary steps required;
    ways to represent and numerate the           Reversible: as much as possible, no infor-
    several divisions and subdivisions of           mation is lost (everything removed from
    a document (parts, chapters, sections,          the original document is kept in separate
    etc). Several of these notations are            files). This means that, at any time, it is
    language-dependent (e.g. Part One or            possible to revert the book to a previous
    Premi`re Part). As such, aligning sec-
          e                                         (or even the original) state.
    tion titles is not a trivial task. Identi-
    fying these marks can not only help to          Additionally, after the cleaning process, a
    prevent this, but it can even be used to     diagnostic report is generated, aimed at help-
ing the user to understand the problems de-
     tected in the book, and what was performed
     in order to solve them.
        Below is presented an example of the di-
     agnostic produced after cleaning a version of
     La maison ` vapeur from Jules Verne:
                 a

1    footers = [’( Page _NUM_ ) = 240’];
2    headers=$VAR1 =
3     ["(La maison x{e0} vapeur Jules Verne) = 241"];
4    ctrL=1;
5    pagnum_ctrL=241;
6    sectionsO=2;
7    sectionsN=30;
8    word_per_emptyline=3107.86842105263;
9    emptylines=37;
10   word_per_line=11.5658603466849;
11   lines=10211;                                                  Figure 1: Pipeline of T::P::BC.
12   To_be_indented=1;
13   words=118099;
14   word_per_indent=52.2793271359008;                       2.2.1 Pages
15   lines_w_pont=3263;
                                                             Page breaks, headers and footers are struc-
                                                             tural elements of pages that must be dealt
        In this diagnostic, we can see, among other          with care when automatically processing
     things, that this document had headers com-             texts. In the particular case of bitext align-
     posed by the book title and author name,                ment, the presence of these elements is often
     footers contained the word Page followed by             enough to confuse the aligner and decrease its
     a number (presumably the page number) and               performance. The headers and footers usu-
     that the 241 pages were separated with ˆL .             ally contain information about the book au-
                                                             thor, book name and section title. Despite
        The process is also configurable with the
                                                             the fact that the information provided by this
     help of an ontology, and the use of configura-
                                                             elements might be relevant for some tasks,
     tion files written in domain-specific languages
                                                             they should be identified, marked and possi-
     is also planned. Both are described in sec-
                                                             bly removed before further processing.
     tion 2.3.
                                                                Below is presented an extract of a plain
                                                             text version of La Maison ` Vapeur, from
                                                                                           a
     2.2   Pipeline                                          Jules Verne, containing, from line 3 to 6, a
     We decided to tackle the problem in five dif-            page break (a footer, a page break character
     ferent steps: pages, sections, paragraphs, foot-        and a header, surrounded by blank lines).
     notes, and words and characters. Each of
                                                         1    rencontrer le nabab, et assez audacieux pour
     these stages deals with a specific type of prob-     2    s’emparer de sa personne.
     lems commonly found in plain text books. A          3
     commit option is also available, a final step        4                       Page 3
     which irreversibly removes all the markup           5     ^L La maison ` vapeur
                                                                            a                Jules Verne
                                                         6
     added along the cleaning process. The dif-          7      Le faquir, - ´videmment le seul entre tous
                                                                             e
     ferent steps of this process are implemented        8    que ne surexcitˆt pas l’espoir de gagner la
                                                                             a
     as functions in T::P::BC.
        Figure 1 presents the pipeline of the                   The 0xC character (also known as
     T::P::BC module.                                        Control+L, ^L or f), indicates, in plain text,
        T::P::BC accepts as input documents in               a page break. When present, these provide
     plain text format, with UTF-8 text encoding             a reliable way to delimit pages. Page num-
     (other encoding formats are also supported).            bers, headers and footers should also be found
     The final output is a summary report, and                near these breaks. There are books, however,
     the cleaned version of the original document,           which do not have their pages separated with
     which may then be used in several processes,            this character.
     such as alignment (Varga et al., 2005; Ev-                 Our algorithm starts by looking for page
     ert, 2001), ebook generation (Lee, Gutten-              break characters. If it finds them, it immedi-
     berg, and McCrary, 2002), or others.                    ately replaces them with a custom page break
mark. If not, it tries to find occurrences of             formally represent the hierarchy of divisions
    page numbers. These are typically lines with             of a book (chapters, sections, etc). As such,
    only three or less digits (four digits would oc-         the notation used to write the titles of the
    casionally get year references confused with             several divisions is dictated by the personal
    page numbers), and a preceding and a fol-                choices of whoever transcribed the book to
    lowing blank lines.                                      electronic format, or by their previous format
       After finding the page breaks, the algo-               and the tool used to perform the conversion
    rithm attempts to identify headers and foot-             to plain text. Additionally, the nomenclature
    ers: it considers the lines which are near each          used is often language-dependent (e.g. Part
    page break as candidates to headers or footers           One and Premi`re Part or Chapter II and
                                                                              e
    (depending on whether they are after or be-              Cap´ıtulo 2 ).
    fore the break, respectively). Then the num-                Below is presented the beginning of the
    ber of occurrences of each candidate (normal-            first section of a Portuguese version of Les
    ized so that changes in numbers are ignored)             Miserables, from Vitor Hugo:
    is calculated, and those which occur more           1    PRIMEIRA PARTE
    than a given threshold are considered headers       2

    or footers.                                         3    FANTINE
                                                        4
       At the end, the headers and footers are          5
    moved to a standoff file, and only the page           6     ^L LIVRO PRIMEIRO
    break mark is left in the original text.            7
       A cleaned up version of the previous ex-         8    UM JUSTO
    ample is presented below:                           9
                                                        10   O abade Myriel
                                                        11
1   est vrai qu’il fallait ˆtre assez chanceux pour
                           e
                                                        12   Em 1815, era bispo de Digne, o reverendo Carlos
2   rencontrer le nabab, et assez audacieux pour
                                                        13   Francisco Bemvindo Myriel, o qual contava setenta
3   s’emparer de sa personne. _pb2_
4     Le faquir, - ´videmment le seul entre tous
                   e
5   que ne surexcitˆt pas l’espoir de gagner la
                   a                                            In order to help in the sectioning pro-
6   prime, - filait au milieu des groupes, s’arrˆtant
                                                 e
                                                             cess, an ontology was built (see section 2.3.1),
                                                             which includes several section types and their
    2.2.2 Sections                                           relation, names of common sections and ordi-
    There are several reason that justify the im-            nal and cardinal numbers.
    portance of delimiting sections in a book:                  The algorithm tries to find lines contain-
                                                             ing section names, pages or lines containing
      • automatically generate tables of con-
                                                             only numbers, or lines with Roman numer-
        tents;
                                                             als. Then a custom section mark is added
      • identify sections from one version of a              containing a normalized version of the origi-
        book that are missing on another ver-                nal title (e.g. Roman numerals are converted
        sion (or example, it is common for some              to Arabic numbers).
        editions of a book to have an exclusive                 The processed version of the example
        preamble or afterword which cannot be                shown above follows:
        found in the other editions);
                                                        1    _sec+N:part=1_ PRIMEIRA PARTE
      • matching and comparing the sections of          2

        two books before aligning them. This            3    FANTINE
                                                        4
        allows to assess the books alignability         5    _sec+N:book=1_ LIVRO PRIMEIRO
        (compatibility for alignment), predict          6

        the results of the alignment and even to        7    UM JUSTO
                                                        8
        simply discard the worst sections;              9    O abade Myriel
                                                        10
      • in areas like biomedical text mining, be-       11   Em 1815, era bispo de Digne, o reverendo Carlos
        ing able to detect the different sections of     12   Francisco Bemvindo Myriel, o qual contava setenta
        a document allows to search for specific
        information in specific sections (Agarwal
                                                             2.2.3 Paragraphs
        and Yu, 2009);
                                                             When reading a book, we often assume that
       Being an unstructured format, plain text              a new line, specially if indented, represents
    by itself does not have the necessary means to           a new paragraph, and that sentences within
the same paragraph are separated only by the              2.2.4 Footnotes
punctuation mark and a space.                             Footnotes are formed by a call mark inserted
    However, when dealing with plain text                 in the middle of the text (often appended to
books, several other representations are pos-             the end of a word or phrase), and the footnote
sible: some books break sentences with a                  expansion (i.e. the text of the footnote). The
new line and paragraphs with a blank line                 placement of the footnote expansion shifts
(two consecutive new line characters); others             from book to book, but generally appears ei-
use indentation to represent paragraphs; and              ther at the end of the same page where its call
there are even books where new line charac-               mark can be found, or at a dedicated section
ters are used to wrap text. There are also                at the end of the document.
books where direct speech (from a charac-                     Depending on the notation used to repre-
ter of the story) is represented with a trailing          sent footnote call marks, these can introduce
dash; others use quotation marks to the same              noise in the alignment (for example, if the
end.                                                      aligner confuses them with sentence breaks).
    The detection of paragraph and sentence               At best, the alignment might work well, but
boundaries is of the utmost importance in the             the resulting corpora will be polluted with the
alignment process, and a wrong delimitation               call marks, interfering with its further use.
is often responsible for a bad alignment.                 Footnote expansions are even more likely to
    In order to be able to detect how para-               disturb the process, as it is very unlikely that
graphs and sentences are represented, our al-             the matching footnotes are found at the same
gorithm starts by measuring several parame-               place in both books (the page divisions would
ters, such as the number of words, the num-               have to be exactly the same).
ber of lines, the number of empty lines and                   Below is presented an example of a page
the number of indented lines.                             containing footnote call marks and, at the
    Then several metrics are calculated based             end, footnote expansions, followed by the be-
on these parameters, which are used to under-             ginning of another page:
stand how paragraphs are represented, and
the text is fixed accordingly.                         1   roi Charles V, fils de Jean II, aupr`s de la rue
                                                                                              e
                                                      2   Saint-Antoine, ` la porte des Tournelles[1].
                                                                         a
                                                      3
                                                      4   [1] La Bastille, qui fut prise par le peuple de
 Algorithm 1: Paragraph                               5   Paris, le 14 juillet 1789, puis d´molie. B.
                                                                                           e
Input: txt:text of the book                           6
Output: txt:text with improved paragraph              7    ^L Quel ´tait en chemin l’´tonnement de l’Ing´nu!
                                                                   e                 e                  e
                                                      8   je vous le laisse ` penser. Il crut d’abord que
                                                                            a
elines ← ...calculate empty lines
lines ← ...calculate number of lines
words ← ...calculate number of words                         The removal of footnotes is performed in
plines ← ...number of lines with punctuation
                                                          two steps: first the expansions are removed,
indenti ← ...calculate indentation distrib
                                                          then the call marks. The devised algorithm
plr ← plines              // punctuated lines ratio
       lines                                              searches for lines starting with a plausible
forall the i ∈ dom(indent) do
                                                          pattern (e.g. <<1>>, [2] or ^3), and followed
   if i > 10 ∨ indenti < 12 then                          by blank lines. These are considered footnote
        remove(indenti ) // remove false indent           expansions, are consequently replaced by a
                                                          custom footnote mark and moved to a stand-
wpl ← words
        lines                      // word per line       off file.
          words
wpel ← 1+elines              // word per empty line          Once the footnote expansions have been
        words
wpi ← indent                    // word per indent        removed, the remaining marks (following the
if wpel > 150 then                                        same patterns) are likely to be call marks, and
    if wpi ∈ [10..100] then                               as such, removed and replaced by a custom
        Separate parag. by indentation                    normalized mark1
    if wpl ∈ [10..100] ∧ plr > 0.6 then
        Separate parag. by new lines                         1
                                                              The ideal would be to be able to establish the
                                                          correspondence between the footnote call marks and
else                                                      their respective expansion. However, that feature is
    Separate parag. by empty lines                        not specially relevant to the focus of this work, as
                                                          the effort it would require is not compensated by its
                                                          practical results.
After removing the footnote, the example          by a dash, a newline and more word charac-
    given above would look like the following:           ters). When this pattern is found, the new-
                                                         line is removed and appended to the end of
1   roi Charles V, fils de Jean II, aupr`s de la rue
                                        e                the word instead.
2   Saint-Antoine, ` la porte des Tournelles_fnr29_.
                   a
3   _fne8_
                                                            Dealing with Unicode-characters is per-
4    ^L Quel ´tait en chemin l’´tonnement de l’Ing´nu!
             e                 e                  e      formed with a simple search and replace.
5   je vous le laisse ` penser. Il crut d’abord que
                      a                                  Characters which have a corresponding char-
                                                         acter in ASCII are directly replaced, while
    2.2.5 Words and characters                           stranger characters are replaced with a nor-
                                                         malized custom mark.
    At word and character level, there are sev-
    eral problems that can affect the alignment           2.2.6 Commit
    of two books: Unicode characters, translin-          This last step removes the custom marks in-
    eations and transpaginations:                        troduced by the previous functions and out-
                                                         puts a cleaned document ready to be pro-
    Unicode characters: often, Unicode-only              cessed. The only marks left are section
       versions of previously existing ASCII             marks, which can be helpful to guide the
       characters are used. For example, Uni-            alignment.
       code has several types of dashes (e.g. ‘-’,           There are situations where elements such
       ‘–’ and ‘ ’), where ASCII only has one            as page numbers are important (for example,
       (i.e. ‘-’). While these do not represent          having page numbers in a book is convenient
       the exact same character, they are close          when making citations). However, removing
       enough, and their normalization allows            these marks makes the previous steps irre-
       to produce more similar results.                  versible. As such, this step is optional, and it
    Glyphs: Some documents use glyphs (Uni-              is meant to be used only when a clean docu-
       code characters) to represent combina-            ment is required for further processing.
       tions of some specific characters, like fi
       or ff.                                             2.3   Declarative objects
                                                         The first attempt at writing a program to
    Translineation: translineation   happens
                                                         clean books was a self-contained Perl script –
       when a given word is split across two
                                                         all the patterns and logical steps were hard-
       lines in order to keep the line length
                                                         coded into the program, with only a few
       constant. If the two parts of translin-
                                                         command-line flags allowing to fine tune the
       eated words are not rejoined before
                                                         process. When this solution was discarded
       aligning, the aligner may see them as
                                                         for not being powerful enough to allow more
       two separate words.
                                                         complex processing, several higher level con-
    Transpagination: transpagination happens             figuration objects were created: an ontol-
       when a translineation occurs at the last          ogy to describe section types and names, and
       line of a page, resulting in a word split         domain-specific languages to describe config-
       across pages. However, the concept of             uration files.
       page is removed by the first function,                 These elements open the possibility to dis-
       pages, which removes headers and foot-            cuss the subject with people who have no pro-
       ers and concatenates the pages. This              gramming experience, allowing them to di-
       means that transpagination occurrences            rectly contribute with their knowledge to the
       get reduced to translineations.                   development of the software.
       The algorithm for dealing with translin-          2.3.1 Sections Ontology
    eations and transpaginations is available as         The information about sections relevant to
    an optional feature. This is mainly be-              the sectioning process is being stored in an
    cause different languages have different stan-         ontology file (Uschold and Gruninger, 1996).
    dard ways of dealing with translineations,           Currently it contains several section types
    and many documents use incorrect forms of            and their relation (e.g. a Section is part
    translineation. As such, the user will have the      of a Chapter, an Act contains one or more
    option to turn this feature on or off.                Scenes), names of common sections (e.g. In-
       The algorithm works by looking for words          troduction, Index, Foreword ) and ordinal and
    split across lines (word characters, followed        cardinal numbers.
The recognition of these elements is            the ontology will allow to establish hierar-
    language-dependent. As such, mechanisms to         chies, and, for example, search for Scenes af-
    deal with several different languages had to be     ter finding an Act, or even to classify a text
    implemented. Some of these languages, even         as play in the presence of either.
    use a different alphabet (e.g. Russian), which         Taking advantage of Perl being a dynamic
    lead to interesting challenges.                    programming language (i.e. is able to load
       Several extracts of the sections ontology       new code at runtime), the ontology is being
    are presented below. The first example shows        used to directly create the Perl code contain-
    several translations and ways of writing chap-
    ter, and NT sec indicates section as a nar-        ing the complex data structures which are
    rower term of chapter :                            used in the process of sectioning. The on-
                                                       tology is in the ISO Thesaurus format, and
1   cap                                                the Perl module Biblio::Thesaurus (Sim˜es  o
2   PT cap´tulo, cap, capitulo
           ı                                           and Almeida, 2002) is being used to manipu-
3   FR chapitre, chap
4   EN chapter, chap                                   late the ontology.
5   RU глава
6   NT sec                                             2.3.2   Domain Specific Languages
                                                       The use of domain-specific languages (DSLs)
       In the next example, chapter is indicated       will allow to have configuration files which
    as a broader term of sections:                     control the flow of the process. Despite
1   cena                                               not yet implemented, it is planned to have
2   PT cena                                            T::P::BC reading a file describing external fil-
3   FR sc`ne
         e                                             ters to be used at specific stages of the process
4   EN scene
5   BT act                                             (before or after any step).
                                                           This will allow to have calls to third-party
       BT _alone allows to indicate that this          tools, such as PDF to text converters or aux-
    term must appear alone in a line:                  iliary cleaning scripts specific for a given type
                                                       of files.
1   end
2   PT fim
3   FR fin                                             Both the sections ontology and DSLs help to
4   EN the_end                                         achieve a better organization, allowing to ab-
5   BT _alone
                                                       stract the domain layer from the the code and
                                                       implementation details.
       The number 3 and several variations. BT
    _numeral identifies this term as a number,
    and consequently might be used to numerate         3   Evaluation
    some of the other terms:
                                                       The evaluation of a tool such as T::P::BC
1   3                                                  might be performed by comparing the results
2   PT   terceiro, terceira, trˆs, tres
                               e
3   EN   third, three
                                                       of alignments of texts before and after being
4   FR   troisi`me, troisieme
               e                                       cleaned with T::P::BC.
5   BT   _numeral                                         In order to test T::P::BC, a set of 20
                                                       pairs of books from Shakespeare, both in Por-
       The typification of sections allows to de-       tuguese and English, was selected. From the
    fine different rules and patterns to recognize       40 books, half (the Portuguese ones) con-
    each type. For example, a line beginning           tained both headers and footers.
    with Chapter One may be almost undoubt-               The books were aligned using cwb-align,
    edly considered a section title, even if more      an aligner tool bundled with IMS Open Cor-
    text follows in the same line (which in many       pus Workbench (IMS Corpus Workbench,
    cases is the title of the chapter). On the other   1994-2002).
    hand, a line beginning with a Roman numeral
    may also represent a chapter beginning, but        Pages, headers and footers
    only if nothing follows in the same line (or       From a total of 1093 footers present in the
    else century references, for example, would be     documents, 1077 were found and removed
    frequently mistaken for chapter divisions).        (98.5%), and 1183 headers were detected and
       Despite not yet being fully implemented,        removed in a similar amount of total occur-
    the relations between sections described in        rences.
Translation units                                 demonstrated an increase in the number of
The alignment of texts resulted in the            1:1 translation units.
creation of translation memories, files in             The use of ontologies for describing and
the TMX format (Translation Memory eX-            storing the notation of text units ( chapter,
change (Oscar, )). Another way to evaluate        section, etc,) in a declarative human-readable
T::P::BC is by comparing the number of each       notation was very important for maintenance
type of translation units in each file – either    and for asking collaboration from translator
1:1, 1:0 or 0:1, or 2:2 (the numbers represent    experts.
how many segments were included in each               While processing the books from Shake-
variant of the translation unit). Table 1 sum-    speare, we felt the need to have a theater-
marizes the results obtained in the alignment     specific filter which would normalize, by ex-
of the 20 pairs of Shakespeare books.             ample, the stage directions and the notation
                                                  used in the characters name before each line.
 Alignm. Type     Original   Cleaned      ∆%          This is a practical example of the need to
 0:1 or 1:0           8333       6570    -21.2    allow external filters to be applied at specific
 1:1                 18802      23433   +24.6
 2:1 or 1:2          15673      11365    -27.5
                                                  stages of the cleaning process. This feature
 2:2                  5413       4297    -20.6    will be implemented soon in T::P::BC.
 Total Seg. PT       54864      53204     -3.0        A special effort will be devoted to the
 Total Seg. EN       59744      51515    -13.8    recognition of tables of contents, which is still
                                                  one of the main T::P::BC error causes.
Table 1: Number of translation units ob-              Additionally, it is under consideration the
tained for each type of alignment, with and       possibility of using stand off annotation in
without using T::P::BC2 .                         the intermediary steps. That would mean
                                                  to index the document contents before pro-
    The total number of 1:1 alignments has
                                                  cessing, and have all the information about
increased; in fact, in the original docu-
                                                  the changes to be made to the document in
ments, 39.0% of the translation units were of
                                                  a separate file. This would allow to have
the 1:1 type. On the other hand, the trans-
                                                  the original file untouched during the whole
lation memory created after cleaning the files
                                                  process, and still be able to commit the
contained 51.3% of 1:1 translation units.
                                                  changes and produce a cleaned version.
    The total number of segments decreased in
the cleaned version; some of the books used       Text::Perfide::BookCleaner will be made
a specific theatrical notation which artificially   available at CPAN3 in a very near future.
increased the number of segments. This nota-
tion was normalized during the cleaning pro-      Acknowledgments
cess with theatre_norm, a small Perl script       Andr´ Santos has a scholarship from
                                                       e
created to normalize these undesired specific      Funda¸˜o para a Computa¸˜o Cient´
                                                         ca                ca      ıfica Na-
features, which lead to the smaller amount of     cional and the work reported here has
segments.                                         been partially funded by Funda¸˜o para a
                                                                                ca
    The number of alignments considered bad       Ciˆncia e Tecnologia through project Per-
                                                     e
by the aligner also decreased, from a total of    Fide PTDC/CLE-LLI/108948/2008.
10 “bad alignment” cases in the original doc-
uments to 4 in the cleaned ones.                  References
                                                  Agarwal, S. and H. Yu. 2009. Automatically
                                                    classifying sentences in full-text biomed-
                                                    ical articles into Introduction, Methods,
4   Conclusions and Future Work                     Results and Discussion. Bioinformatics,
Working        with       real      situations,     25(23):3174.
Text::Perfide::BookCleaner              proved
                                                  Ara´jo, S., J.J. Almeida, I. Dias, and
                                                     u
to be a useful tool to reduce text noise.
                                                    A. Sim˜es. 2010. Apresenta¸˜o do pro-
                                                           o                      ca
   The possibility of using a set of configu-        jecto Per-Fide: Paralelizando o Portuguˆs
                                                                                           e
ration options helped to obtain the necessary       com seis outras l´ınguas. Linguam´tica,
                                                                                        a
adaptations to different uses.                       page 71.
   Using T::P::BC, the translation memo-
                                                    3
ries produced in the alignment process have             http://www.cpan.org
Evert, S. 2001. The CQP query language
  tutorial. IMS Stuttgart, 13.
Gupta, R. and S. Ahmed. 2007. Project Pro-
  posal Apache Tika.
IMS    Corpus    Workbench.         1994-
  2002.                http://www.ims.uni-
  stuttgart.de/projekte/CorpusWorkbench/.
Lee, K.H., N. Guttenberg, and V. McCrary.
   2002. Standardization aspects of eBook
   content formats. Computer Standards &
   Interfaces, 24(3):227–239.
Mattmann, Chris A. and Jukka L. Zitting.
  2011. Tika in Action. Manning Publica-
  tions Co., 1st edition.
Noonburg, D. 2001. xpdf: A C++ library
  for accessing PDF.
Okita, T. 2009. Data cleaning for word align-
  ment. In Proceedings of the ACL-IJCNLP
  2009 Student Research Workshop, pages
  72–80. Association for Computational Lin-
  guistics.
Oscar, TMX. Lisa (2000) translation memory
  exchange.
Robinson, N. 2001. A Comparison of
  Utilities for converting from Postscript
  or Portable Document Format to Text.
  Geneva: CERN, 31.
Sim˜es, A. and J.J. Almeida. 2002. Li-
    o
   brary::*: a toolkit for digital libraries.
UnmarkedSoftware. 2011. TextSoap: For
  people working with other people’s text.
Uschold, M. and M. Gruninger. 1996. On-
  tologies: Principles, methods and appli-
  cations. The Knowledge Engineering Re-
  view, 11(02):93–136.
Varga, D., P. Hal´csy, A. Kornai, V. Nagy,
                  a
  L. N´meth, and V. Tr´n. 2005. Paral-
        e                o
  lel corpora for medium density languages.
  Recent Advances in Natural Language Pro-
  cessing IV: Selected Papers from RANLP
  2005.
V´ronis, J. 2000. From the Rosetta stone to
 e
   the information society. pages 1–24.

Weitere ähnliche Inhalte

Was ist angesagt?

Expertise2014 pandoc
Expertise2014 pandocExpertise2014 pandoc
Expertise2014 pandocale93756
 
11 terms in Corpus Linguistics1 (2)
11 terms in Corpus Linguistics1 (2)11 terms in Corpus Linguistics1 (2)
11 terms in Corpus Linguistics1 (2)ThennarasuSakkan
 
Knowledge-poor and Knowledge-rich Approaches for Multilingual Terminology Ext...
Knowledge-poor and Knowledge-rich Approaches for Multilingual Terminology Ext...Knowledge-poor and Knowledge-rich Approaches for Multilingual Terminology Ext...
Knowledge-poor and Knowledge-rich Approaches for Multilingual Terminology Ext...Christophe Tricot
 
Natural language processing using python
Natural language processing using pythonNatural language processing using python
Natural language processing using pythonPrakash Anand
 
Corpus Linguistics :Analytical Tools
Corpus Linguistics :Analytical ToolsCorpus Linguistics :Analytical Tools
Corpus Linguistics :Analytical ToolsJitendra Patil
 
11 terms in corpus linguistics1 (1)
11 terms in corpus linguistics1 (1)11 terms in corpus linguistics1 (1)
11 terms in corpus linguistics1 (1)ThennarasuSakkan
 
MEBI 591C/598 – Data and Text Mining in Biomedical Informatics
MEBI 591C/598 – Data and Text Mining in Biomedical InformaticsMEBI 591C/598 – Data and Text Mining in Biomedical Informatics
MEBI 591C/598 – Data and Text Mining in Biomedical Informaticsbutest
 
Natural language processing (Python)
Natural language processing (Python)Natural language processing (Python)
Natural language processing (Python)Sumit Raj
 
Speech Processing and Audio feedback for Tamil Alphabets using Artificial Neu...
Speech Processing and Audio feedback for Tamil Alphabets using Artificial Neu...Speech Processing and Audio feedback for Tamil Alphabets using Artificial Neu...
Speech Processing and Audio feedback for Tamil Alphabets using Artificial Neu...DR.P.S.JAGADEESH KUMAR
 
Natural Language Processing with Python
Natural Language Processing with PythonNatural Language Processing with Python
Natural Language Processing with PythonBenjamin Bengfort
 
Pal gov.tutorial2.session1.xml basics and namespaces
Pal gov.tutorial2.session1.xml basics and namespacesPal gov.tutorial2.session1.xml basics and namespaces
Pal gov.tutorial2.session1.xml basics and namespacesMustafa Jarrar
 
DataFest 2017. Introduction to Natural Language Processing by Rudolf Eremyan
DataFest 2017. Introduction to Natural Language Processing by Rudolf EremyanDataFest 2017. Introduction to Natural Language Processing by Rudolf Eremyan
DataFest 2017. Introduction to Natural Language Processing by Rudolf Eremyanrudolf eremyan
 
A Combined Approach to Part-of-Speech Tagging Using Features Extraction and H...
A Combined Approach to Part-of-Speech Tagging Using Features Extraction and H...A Combined Approach to Part-of-Speech Tagging Using Features Extraction and H...
A Combined Approach to Part-of-Speech Tagging Using Features Extraction and H...Editor IJARCET
 
Universal dependencies for Finnish
Universal dependencies for FinnishUniversal dependencies for Finnish
Universal dependencies for Finnishhyunyoung Lee
 
A Domain Based Approach to Information Retrieval in Digital Libraries
A Domain Based Approach to Information Retrieval in Digital LibrariesA Domain Based Approach to Information Retrieval in Digital Libraries
A Domain Based Approach to Information Retrieval in Digital LibrariesFulvio Rotella
 

Was ist angesagt? (15)

Expertise2014 pandoc
Expertise2014 pandocExpertise2014 pandoc
Expertise2014 pandoc
 
11 terms in Corpus Linguistics1 (2)
11 terms in Corpus Linguistics1 (2)11 terms in Corpus Linguistics1 (2)
11 terms in Corpus Linguistics1 (2)
 
Knowledge-poor and Knowledge-rich Approaches for Multilingual Terminology Ext...
Knowledge-poor and Knowledge-rich Approaches for Multilingual Terminology Ext...Knowledge-poor and Knowledge-rich Approaches for Multilingual Terminology Ext...
Knowledge-poor and Knowledge-rich Approaches for Multilingual Terminology Ext...
 
Natural language processing using python
Natural language processing using pythonNatural language processing using python
Natural language processing using python
 
Corpus Linguistics :Analytical Tools
Corpus Linguistics :Analytical ToolsCorpus Linguistics :Analytical Tools
Corpus Linguistics :Analytical Tools
 
11 terms in corpus linguistics1 (1)
11 terms in corpus linguistics1 (1)11 terms in corpus linguistics1 (1)
11 terms in corpus linguistics1 (1)
 
MEBI 591C/598 – Data and Text Mining in Biomedical Informatics
MEBI 591C/598 – Data and Text Mining in Biomedical InformaticsMEBI 591C/598 – Data and Text Mining in Biomedical Informatics
MEBI 591C/598 – Data and Text Mining in Biomedical Informatics
 
Natural language processing (Python)
Natural language processing (Python)Natural language processing (Python)
Natural language processing (Python)
 
Speech Processing and Audio feedback for Tamil Alphabets using Artificial Neu...
Speech Processing and Audio feedback for Tamil Alphabets using Artificial Neu...Speech Processing and Audio feedback for Tamil Alphabets using Artificial Neu...
Speech Processing and Audio feedback for Tamil Alphabets using Artificial Neu...
 
Natural Language Processing with Python
Natural Language Processing with PythonNatural Language Processing with Python
Natural Language Processing with Python
 
Pal gov.tutorial2.session1.xml basics and namespaces
Pal gov.tutorial2.session1.xml basics and namespacesPal gov.tutorial2.session1.xml basics and namespaces
Pal gov.tutorial2.session1.xml basics and namespaces
 
DataFest 2017. Introduction to Natural Language Processing by Rudolf Eremyan
DataFest 2017. Introduction to Natural Language Processing by Rudolf EremyanDataFest 2017. Introduction to Natural Language Processing by Rudolf Eremyan
DataFest 2017. Introduction to Natural Language Processing by Rudolf Eremyan
 
A Combined Approach to Part-of-Speech Tagging Using Features Extraction and H...
A Combined Approach to Part-of-Speech Tagging Using Features Extraction and H...A Combined Approach to Part-of-Speech Tagging Using Features Extraction and H...
A Combined Approach to Part-of-Speech Tagging Using Features Extraction and H...
 
Universal dependencies for Finnish
Universal dependencies for FinnishUniversal dependencies for Finnish
Universal dependencies for Finnish
 
A Domain Based Approach to Information Retrieval in Digital Libraries
A Domain Based Approach to Information Retrieval in Digital LibrariesA Domain Based Approach to Information Retrieval in Digital Libraries
A Domain Based Approach to Information Retrieval in Digital Libraries
 

Andere mochten auch

Pps delz@-budapest - i - left bank-the historic part and more
Pps delz@-budapest - i - left bank-the historic part and morePps delz@-budapest - i - left bank-the historic part and more
Pps delz@-budapest - i - left bank-the historic part and morefilipj2000
 
Experiential Education: Learning Through Co-curricular Leadership Experiences...
Experiential Education: Learning Through Co-curricular Leadership Experiences...Experiential Education: Learning Through Co-curricular Leadership Experiences...
Experiential Education: Learning Through Co-curricular Leadership Experiences...TEDx Adventure Catalyst
 
Only in india 1 (c a) (1)
Only in india 1 (c a) (1)Only in india 1 (c a) (1)
Only in india 1 (c a) (1)filipj2000
 
Mobil Cihaz Uygulamalarında Sql Server Ce Kullanımı
Mobil Cihaz Uygulamalarında Sql Server Ce KullanımıMobil Cihaz Uygulamalarında Sql Server Ce Kullanımı
Mobil Cihaz Uygulamalarında Sql Server Ce Kullanımıekinozcicekciler
 
74 terrific ideas for content that you can add to your WordPress website so t...
74 terrific ideas for content that you can add to your WordPress website so t...74 terrific ideas for content that you can add to your WordPress website so t...
74 terrific ideas for content that you can add to your WordPress website so t...Jeremy Dawes
 
Outlook express imap settings
Outlook express imap settingsOutlook express imap settings
Outlook express imap settingsJeremy Dawes
 
Mac mail imap settings
Mac mail imap settingsMac mail imap settings
Mac mail imap settingsJeremy Dawes
 
Lotebox folder
Lotebox folderLotebox folder
Lotebox folderLuiz Gomes
 
Outlook 2010 imap settings
Outlook 2010 imap settingsOutlook 2010 imap settings
Outlook 2010 imap settingsJeremy Dawes
 
Home Staging A Service That Really Works Case Study May 2011
Home Staging A Service That Really Works   Case Study May 2011Home Staging A Service That Really Works   Case Study May 2011
Home Staging A Service That Really Works Case Study May 2011juliestevens
 
Cleaning plain text books with Text::Perfide::BookCleaner
Cleaning plain text books with Text::Perfide::BookCleanerCleaning plain text books with Text::Perfide::BookCleaner
Cleaning plain text books with Text::Perfide::BookCleanerandrefsantos
 
Mind Stream International By Pictures
Mind Stream International By PicturesMind Stream International By Pictures
Mind Stream International By Picturesphilipclaeyssens
 
Insert text from a Word document into WordPress
Insert text from a Word document into WordPressInsert text from a Word document into WordPress
Insert text from a Word document into WordPressJeremy Dawes
 
Student leadership in conduct symposiumonline
Student leadership in conduct   symposiumonlineStudent leadership in conduct   symposiumonline
Student leadership in conduct symposiumonlineTEDx Adventure Catalyst
 

Andere mochten auch (17)

Pps delz@-budapest - i - left bank-the historic part and more
Pps delz@-budapest - i - left bank-the historic part and morePps delz@-budapest - i - left bank-the historic part and more
Pps delz@-budapest - i - left bank-the historic part and more
 
Experiential Education: Learning Through Co-curricular Leadership Experiences...
Experiential Education: Learning Through Co-curricular Leadership Experiences...Experiential Education: Learning Through Co-curricular Leadership Experiences...
Experiential Education: Learning Through Co-curricular Leadership Experiences...
 
Only in india 1 (c a) (1)
Only in india 1 (c a) (1)Only in india 1 (c a) (1)
Only in india 1 (c a) (1)
 
Mobil Cihaz Uygulamalarında Sql Server Ce Kullanımı
Mobil Cihaz Uygulamalarında Sql Server Ce KullanımıMobil Cihaz Uygulamalarında Sql Server Ce Kullanımı
Mobil Cihaz Uygulamalarında Sql Server Ce Kullanımı
 
74 terrific ideas for content that you can add to your WordPress website so t...
74 terrific ideas for content that you can add to your WordPress website so t...74 terrific ideas for content that you can add to your WordPress website so t...
74 terrific ideas for content that you can add to your WordPress website so t...
 
Outlook express imap settings
Outlook express imap settingsOutlook express imap settings
Outlook express imap settings
 
Mac mail imap settings
Mac mail imap settingsMac mail imap settings
Mac mail imap settings
 
Lotebox folder
Lotebox folderLotebox folder
Lotebox folder
 
Outlook 2010 imap settings
Outlook 2010 imap settingsOutlook 2010 imap settings
Outlook 2010 imap settings
 
Home Staging A Service That Really Works Case Study May 2011
Home Staging A Service That Really Works   Case Study May 2011Home Staging A Service That Really Works   Case Study May 2011
Home Staging A Service That Really Works Case Study May 2011
 
Universal Design August Workshop
Universal Design August Workshop Universal Design August Workshop
Universal Design August Workshop
 
Cleaning plain text books with Text::Perfide::BookCleaner
Cleaning plain text books with Text::Perfide::BookCleanerCleaning plain text books with Text::Perfide::BookCleaner
Cleaning plain text books with Text::Perfide::BookCleaner
 
Pp infoo
Pp infooPp infoo
Pp infoo
 
Mind Stream International By Pictures
Mind Stream International By PicturesMind Stream International By Pictures
Mind Stream International By Pictures
 
Insert text from a Word document into WordPress
Insert text from a Word document into WordPressInsert text from a Word document into WordPress
Insert text from a Word document into WordPress
 
The CACUSS Identity Project
The CACUSS Identity ProjectThe CACUSS Identity Project
The CACUSS Identity Project
 
Student leadership in conduct symposiumonline
Student leadership in conduct   symposiumonlineStudent leadership in conduct   symposiumonline
Student leadership in conduct symposiumonline
 

Ähnlich wie Text::Perfide::BookCleaner, a Perl module to clean and normalize plain text books

Intro_script_1_1programmingonfirstchaptwrinc
Intro_script_1_1programmingonfirstchaptwrincIntro_script_1_1programmingonfirstchaptwrinc
Intro_script_1_1programmingonfirstchaptwrincSimonmouawad1
 
SECOND LANGUAGE RESEARCH.pptx
SECOND LANGUAGE RESEARCH.pptxSECOND LANGUAGE RESEARCH.pptx
SECOND LANGUAGE RESEARCH.pptxssuser1ac0fa
 
A Lightweight Approach To Semantic Annotation Of Research Papers
A Lightweight Approach To Semantic Annotation Of Research PapersA Lightweight Approach To Semantic Annotation Of Research Papers
A Lightweight Approach To Semantic Annotation Of Research PapersScott Bou
 
Natural Language Processing_in semantic web.pptx
Natural Language Processing_in semantic web.pptxNatural Language Processing_in semantic web.pptx
Natural Language Processing_in semantic web.pptxAlyaaMachi
 
Poster Aegis intenal poster_accessodf_20111117
Poster Aegis intenal poster_accessodf_20111117Poster Aegis intenal poster_accessodf_20111117
Poster Aegis intenal poster_accessodf_20111117AEGIS-ACCESSIBLE Projects
 
Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Alia Hamwi
 
Web classification of Digital Libraries using GATE Machine Learning  
Web classification of Digital Libraries using GATE Machine Learning  	Web classification of Digital Libraries using GATE Machine Learning  
Web classification of Digital Libraries using GATE Machine Learning   sstose
 
Reengineering PDF-Based Documents Targeting Complex Software Specifications
Reengineering PDF-Based Documents Targeting Complex Software SpecificationsReengineering PDF-Based Documents Targeting Complex Software Specifications
Reengineering PDF-Based Documents Targeting Complex Software SpecificationsMoutasm Tamimi
 
Scientific and technical translation in English - week 3 2019
Scientific and technical translation in English - week 3 2019Scientific and technical translation in English - week 3 2019
Scientific and technical translation in English - week 3 2019Ron Martinez
 
ALGORITHM FOR TEXT TO GRAPH CONVERSION AND SUMMARIZING USING NLP: A NEW APPRO...
ALGORITHM FOR TEXT TO GRAPH CONVERSION AND SUMMARIZING USING NLP: A NEW APPRO...ALGORITHM FOR TEXT TO GRAPH CONVERSION AND SUMMARIZING USING NLP: A NEW APPRO...
ALGORITHM FOR TEXT TO GRAPH CONVERSION AND SUMMARIZING USING NLP: A NEW APPRO...kevig
 
ALGORITHM FOR TEXT TO GRAPH CONVERSION
ALGORITHM FOR TEXT TO GRAPH CONVERSION ALGORITHM FOR TEXT TO GRAPH CONVERSION
ALGORITHM FOR TEXT TO GRAPH CONVERSION ijnlc
 

Ähnlich wie Text::Perfide::BookCleaner, a Perl module to clean and normalize plain text books (20)

REPORT.doc
REPORT.docREPORT.doc
REPORT.doc
 
Intro_script_1_1programmingonfirstchaptwrinc
Intro_script_1_1programmingonfirstchaptwrincIntro_script_1_1programmingonfirstchaptwrinc
Intro_script_1_1programmingonfirstchaptwrinc
 
SECOND LANGUAGE RESEARCH.pptx
SECOND LANGUAGE RESEARCH.pptxSECOND LANGUAGE RESEARCH.pptx
SECOND LANGUAGE RESEARCH.pptx
 
A Lightweight Approach To Semantic Annotation Of Research Papers
A Lightweight Approach To Semantic Annotation Of Research PapersA Lightweight Approach To Semantic Annotation Of Research Papers
A Lightweight Approach To Semantic Annotation Of Research Papers
 
Lfnw2016
Lfnw2016Lfnw2016
Lfnw2016
 
Natural Language Processing_in semantic web.pptx
Natural Language Processing_in semantic web.pptxNatural Language Processing_in semantic web.pptx
Natural Language Processing_in semantic web.pptx
 
Poster Aegis intenal poster_accessodf_20111117
Poster Aegis intenal poster_accessodf_20111117Poster Aegis intenal poster_accessodf_20111117
Poster Aegis intenal poster_accessodf_20111117
 
Tlf2016
Tlf2016Tlf2016
Tlf2016
 
Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)
 
Olf2016
Olf2016Olf2016
Olf2016
 
Wroclaw university library - Grazyna Piotrowicz
Wroclaw university library - Grazyna PiotrowiczWroclaw university library - Grazyna Piotrowicz
Wroclaw university library - Grazyna Piotrowicz
 
Web classification of Digital Libraries using GATE Machine Learning  
Web classification of Digital Libraries using GATE Machine Learning  	Web classification of Digital Libraries using GATE Machine Learning  
Web classification of Digital Libraries using GATE Machine Learning  
 
Reengineering PDF-Based Documents Targeting Complex Software Specifications
Reengineering PDF-Based Documents Targeting Complex Software SpecificationsReengineering PDF-Based Documents Targeting Complex Software Specifications
Reengineering PDF-Based Documents Targeting Complex Software Specifications
 
Scientific and technical translation in English - week 3 2019
Scientific and technical translation in English - week 3 2019Scientific and technical translation in English - week 3 2019
Scientific and technical translation in English - week 3 2019
 
ResearchPaper
ResearchPaperResearchPaper
ResearchPaper
 
ALGORITHM FOR TEXT TO GRAPH CONVERSION AND SUMMARIZING USING NLP: A NEW APPRO...
ALGORITHM FOR TEXT TO GRAPH CONVERSION AND SUMMARIZING USING NLP: A NEW APPRO...ALGORITHM FOR TEXT TO GRAPH CONVERSION AND SUMMARIZING USING NLP: A NEW APPRO...
ALGORITHM FOR TEXT TO GRAPH CONVERSION AND SUMMARIZING USING NLP: A NEW APPRO...
 
ALGORITHM FOR TEXT TO GRAPH CONVERSION
ALGORITHM FOR TEXT TO GRAPH CONVERSION ALGORITHM FOR TEXT TO GRAPH CONVERSION
ALGORITHM FOR TEXT TO GRAPH CONVERSION
 
FIRE2014_IIT-P
FIRE2014_IIT-PFIRE2014_IIT-P
FIRE2014_IIT-P
 
NLP todo
NLP todoNLP todo
NLP todo
 
Scale2016
Scale2016Scale2016
Scale2016
 

Mehr von andrefsantos

Building your own CPAN with Pinto
Building your own CPAN with PintoBuilding your own CPAN with Pinto
Building your own CPAN with Pintoandrefsantos
 
Identifying similar text documents
Identifying similar text documentsIdentifying similar text documents
Identifying similar text documentsandrefsantos
 
Poster - Bigorna, a toolkit for orthography migration challenges
Poster - Bigorna, a toolkit for orthography migration challengesPoster - Bigorna, a toolkit for orthography migration challenges
Poster - Bigorna, a toolkit for orthography migration challengesandrefsantos
 
A survey on parallel corpora alignment
A survey on parallel corpora alignment A survey on parallel corpora alignment
A survey on parallel corpora alignment andrefsantos
 
Detecção e Correcção Parcial de Problemas na Conversão de Formatos
Detecção e Correcção Parcial de Problemas na Conversão de FormatosDetecção e Correcção Parcial de Problemas na Conversão de Formatos
Detecção e Correcção Parcial de Problemas na Conversão de Formatosandrefsantos
 
Bigorna - a toolkit for orthography migration challenges
Bigorna - a toolkit for orthography migration challengesBigorna - a toolkit for orthography migration challenges
Bigorna - a toolkit for orthography migration challengesandrefsantos
 

Mehr von andrefsantos (10)

Elasto Mania
Elasto ManiaElasto Mania
Elasto Mania
 
Building your own CPAN with Pinto
Building your own CPAN with PintoBuilding your own CPAN with Pinto
Building your own CPAN with Pinto
 
Slides
SlidesSlides
Slides
 
Identifying similar text documents
Identifying similar text documentsIdentifying similar text documents
Identifying similar text documents
 
Poster - Bigorna, a toolkit for orthography migration challenges
Poster - Bigorna, a toolkit for orthography migration challengesPoster - Bigorna, a toolkit for orthography migration challenges
Poster - Bigorna, a toolkit for orthography migration challenges
 
A survey on parallel corpora alignment
A survey on parallel corpora alignment A survey on parallel corpora alignment
A survey on parallel corpora alignment
 
Detecção e Correcção Parcial de Problemas na Conversão de Formatos
Detecção e Correcção Parcial de Problemas na Conversão de FormatosDetecção e Correcção Parcial de Problemas na Conversão de Formatos
Detecção e Correcção Parcial de Problemas na Conversão de Formatos
 
Bigorna - a toolkit for orthography migration challenges
Bigorna - a toolkit for orthography migration challengesBigorna - a toolkit for orthography migration challenges
Bigorna - a toolkit for orthography migration challenges
 
Bigorna
BigornaBigorna
Bigorna
 
Mojolicious lite
Mojolicious liteMojolicious lite
Mojolicious lite
 

Kürzlich hochgeladen

A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 

Kürzlich hochgeladen (20)

A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 

Text::Perfide::BookCleaner, a Perl module to clean and normalize plain text books

  • 1. Text::Perfide::BookCleaner, a Perl module to clean and normalize plain text books Text::Perfide::BookCleaner, un m´dulo Perl para limpieza de libros em texto o plano Andr´ Santos e Jos´ Jo˜o Almeida e a Departamento de Inform´tica a Departamento de Inform´tica a Universidade do Minho Universidade do Minho Braga, Portugal Braga, Portugal pg15973@alunos.uminho.pt jj@di.uminho.pt Resumen: En este trabajo se presenta Text::Perfide::BookCleaner, un m´dulo Perl o para pre-procesamiento de libros en formato de texto plano para posterior alineamiento u otros usos. Tareas de limpieza incluyen: eliminaci´n de saltos de p´gina, n´meros de o a u p´gina, encabezados y pies de p´gina; encontrar t´ a a ıtulos y fronteras de las secciones; elimi- naci´n de las notas de pie de p´gina y normalizaci´n de la notaci´n de p´rrafo y caracteres o a o o a Unicode. El proceso es guiado con la ayuda de objetos declarativos como ontolog´ y ıas ficheros de configuraci´n. Se realiz´ una evaluaci´n comparativa de las alineaciones con y o o o sin Text::Perfide::BookCleaner sobre la cual se presentan los resultados y conclusiones. Palabras clave: corpora paralelos, alineamiento de documentos, herramientas de PLN Abstract: This paper presents Text::Perfide::BookCleaner, an application to pre- process plain text books and clean them for any further arbitrary use, e.g., text alignment, format conversion or information retrieval. Cleaning tasks include removing page breaks, page numbers, headers and footers; finding section titles and boundaries; removing footnotes and normalizing paragraph notation and Unicode characters. The process is guided with the help of declarative objects such as ontologies and configuration files. A comparative evaluation of alignments with and without Text::Perfide::BookCleaner was performed, and the results and conclusions are presented. Keywords: parallel corpora, document alignment, NLP tools 1 Introduction 1.1 Context In many tasks related to natural language The work presented in this paper has been processing, text is the starting point, and the developed within Project Per-Fide, a project quality of results that can be obtained de- which aims to build a large parallel cor- pends strongly on the text quality itself. pora (Ara´jo et al., 2010). This process u Many kinds of noise may be present in involves gathering, preparing, aligning and texts, resulting in a wrong interpretation of: making available for query thousands of doc- uments in several languages. • individual characters, The documents gathered come from a wide range of sources, and are obtained in several • words, phrases, sentences, different formats and conditions. Some of these documents present specific issues which • paragraph frontiers, prevent them to be successfully aligned. • pages, headers, footers and footnotes, 1.2 Motivation • titles and section frontiers. The alignment of text documents is a pro- cess which depends heavily on the format The variety of existing texts is huge, and and conditions of the documents to be many of the problems are related to specific aligned (V´ronis, 2000). When processing e types of text. documents of literary type (i.e. books), there In this document, the central concern is are specific issues which give origin to addi- the task of cleaning text books, with a special tional difficulties. focus on making them ready for alignment. The main characteristics of the documents
  • 2. that may negatively affect book alignment guide the alignment process by pairing are: the sections first (structural alignment). File format: Documents available in PDF These problems are often enough to derail or Microsoft Word formats (or, gen- the alignment process, producing unaccept- erally, any structured or unstructured able results, and are found frequently enough format other than plain text) need to justify the development of a tool to pre- to be converted to plain text be- process the books, cleaning and preparing fore being processed, using tools such them for further processing. as pdftotext (Noonburg, 2001) or There are other tools available which Apache Tika (Gupta and Ahmed, 2007; also address the issue of preparing corpora Mattmann and Zitting, 2011). This con- for alignment, such as the one described version often leads to loss of informa- by (Okita, 2009); or text cleaning, such as tion (text structure, text formatting, im- TextSoap, a program for cleaning text in sev- ages, etc) and introduction of noise (bad eral formats (UnmarkedSoftware, 2011). conversion of mathematical expressions, diacritics, tables and other) (Robinson, 2 Book Cleaner 2001); In order to solve the problems described in the previous section, the first approach was Text encoding format: There are many to implement a Perl script which, with the different available text encoding for- help of UNIX utilities like grep and some reg- mats. For example, Western-European ular expressions, attempted to find the un- languages can be encoded as ISO-8859- desired elements and normalize, replace or 1 (also known as Latin1), Windows CP delete them. As we started to have a bet- 1252, UTF-8 or others. Discovering the ter grasp on the task, it became clear that encoding format used in a given docu- such a naive approach was not enough. ment is frequently not a straightforward It was then decided to develop a task, and Dealing with a text while as- Perl module, Text::Perfide::BookCleaner suming the wrong encoding may have a (T::P::BC), divided in separate functions, negative impact on the results; each one dealing with a specific problem. Structural residues: Some texts, despite being in plain text format, still contain 2.1 Design Goals structural elements from their previous T::P::BC was developed with several design format (for example, it is common to find goals in mind. Each component of the module page numbers, headers and footers in the meets the following three requirements: middle of the text of documents which Optional: depending on the conditions of have been converted from PDF to plain the books to be cleaned, some steps of text); this process may not be desired or neces- Unpaired sections: Sometimes one of the sary. Thus, each step may be performed documents to be aligned contains one independently from the others, or not be or more sections which are not present performed at all; in the other document. It is a quite Chainable: the functions can be used in se- common occurrence with forewords to a quence, with the output of one function given edition, preambles, etc. passing as input to the next, with no in- Sectioning notation: There are endless termediary steps required; ways to represent and numerate the Reversible: as much as possible, no infor- several divisions and subdivisions of mation is lost (everything removed from a document (parts, chapters, sections, the original document is kept in separate etc). Several of these notations are files). This means that, at any time, it is language-dependent (e.g. Part One or possible to revert the book to a previous Premi`re Part). As such, aligning sec- e (or even the original) state. tion titles is not a trivial task. Identi- fying these marks can not only help to Additionally, after the cleaning process, a prevent this, but it can even be used to diagnostic report is generated, aimed at help-
  • 3. ing the user to understand the problems de- tected in the book, and what was performed in order to solve them. Below is presented an example of the di- agnostic produced after cleaning a version of La maison ` vapeur from Jules Verne: a 1 footers = [’( Page _NUM_ ) = 240’]; 2 headers=$VAR1 = 3 ["(La maison x{e0} vapeur Jules Verne) = 241"]; 4 ctrL=1; 5 pagnum_ctrL=241; 6 sectionsO=2; 7 sectionsN=30; 8 word_per_emptyline=3107.86842105263; 9 emptylines=37; 10 word_per_line=11.5658603466849; 11 lines=10211; Figure 1: Pipeline of T::P::BC. 12 To_be_indented=1; 13 words=118099; 14 word_per_indent=52.2793271359008; 2.2.1 Pages 15 lines_w_pont=3263; Page breaks, headers and footers are struc- tural elements of pages that must be dealt In this diagnostic, we can see, among other with care when automatically processing things, that this document had headers com- texts. In the particular case of bitext align- posed by the book title and author name, ment, the presence of these elements is often footers contained the word Page followed by enough to confuse the aligner and decrease its a number (presumably the page number) and performance. The headers and footers usu- that the 241 pages were separated with ˆL . ally contain information about the book au- thor, book name and section title. Despite The process is also configurable with the the fact that the information provided by this help of an ontology, and the use of configura- elements might be relevant for some tasks, tion files written in domain-specific languages they should be identified, marked and possi- is also planned. Both are described in sec- bly removed before further processing. tion 2.3. Below is presented an extract of a plain text version of La Maison ` Vapeur, from a 2.2 Pipeline Jules Verne, containing, from line 3 to 6, a We decided to tackle the problem in five dif- page break (a footer, a page break character ferent steps: pages, sections, paragraphs, foot- and a header, surrounded by blank lines). notes, and words and characters. Each of 1 rencontrer le nabab, et assez audacieux pour these stages deals with a specific type of prob- 2 s’emparer de sa personne. lems commonly found in plain text books. A 3 commit option is also available, a final step 4 Page 3 which irreversibly removes all the markup 5 ^L La maison ` vapeur a Jules Verne 6 added along the cleaning process. The dif- 7 Le faquir, - ´videmment le seul entre tous e ferent steps of this process are implemented 8 que ne surexcitˆt pas l’espoir de gagner la a as functions in T::P::BC. Figure 1 presents the pipeline of the The 0xC character (also known as T::P::BC module. Control+L, ^L or f), indicates, in plain text, T::P::BC accepts as input documents in a page break. When present, these provide plain text format, with UTF-8 text encoding a reliable way to delimit pages. Page num- (other encoding formats are also supported). bers, headers and footers should also be found The final output is a summary report, and near these breaks. There are books, however, the cleaned version of the original document, which do not have their pages separated with which may then be used in several processes, this character. such as alignment (Varga et al., 2005; Ev- Our algorithm starts by looking for page ert, 2001), ebook generation (Lee, Gutten- break characters. If it finds them, it immedi- berg, and McCrary, 2002), or others. ately replaces them with a custom page break
  • 4. mark. If not, it tries to find occurrences of formally represent the hierarchy of divisions page numbers. These are typically lines with of a book (chapters, sections, etc). As such, only three or less digits (four digits would oc- the notation used to write the titles of the casionally get year references confused with several divisions is dictated by the personal page numbers), and a preceding and a fol- choices of whoever transcribed the book to lowing blank lines. electronic format, or by their previous format After finding the page breaks, the algo- and the tool used to perform the conversion rithm attempts to identify headers and foot- to plain text. Additionally, the nomenclature ers: it considers the lines which are near each used is often language-dependent (e.g. Part page break as candidates to headers or footers One and Premi`re Part or Chapter II and e (depending on whether they are after or be- Cap´ıtulo 2 ). fore the break, respectively). Then the num- Below is presented the beginning of the ber of occurrences of each candidate (normal- first section of a Portuguese version of Les ized so that changes in numbers are ignored) Miserables, from Vitor Hugo: is calculated, and those which occur more 1 PRIMEIRA PARTE than a given threshold are considered headers 2 or footers. 3 FANTINE 4 At the end, the headers and footers are 5 moved to a standoff file, and only the page 6 ^L LIVRO PRIMEIRO break mark is left in the original text. 7 A cleaned up version of the previous ex- 8 UM JUSTO ample is presented below: 9 10 O abade Myriel 11 1 est vrai qu’il fallait ˆtre assez chanceux pour e 12 Em 1815, era bispo de Digne, o reverendo Carlos 2 rencontrer le nabab, et assez audacieux pour 13 Francisco Bemvindo Myriel, o qual contava setenta 3 s’emparer de sa personne. _pb2_ 4 Le faquir, - ´videmment le seul entre tous e 5 que ne surexcitˆt pas l’espoir de gagner la a In order to help in the sectioning pro- 6 prime, - filait au milieu des groupes, s’arrˆtant e cess, an ontology was built (see section 2.3.1), which includes several section types and their 2.2.2 Sections relation, names of common sections and ordi- There are several reason that justify the im- nal and cardinal numbers. portance of delimiting sections in a book: The algorithm tries to find lines contain- ing section names, pages or lines containing • automatically generate tables of con- only numbers, or lines with Roman numer- tents; als. Then a custom section mark is added • identify sections from one version of a containing a normalized version of the origi- book that are missing on another ver- nal title (e.g. Roman numerals are converted sion (or example, it is common for some to Arabic numbers). editions of a book to have an exclusive The processed version of the example preamble or afterword which cannot be shown above follows: found in the other editions); 1 _sec+N:part=1_ PRIMEIRA PARTE • matching and comparing the sections of 2 two books before aligning them. This 3 FANTINE 4 allows to assess the books alignability 5 _sec+N:book=1_ LIVRO PRIMEIRO (compatibility for alignment), predict 6 the results of the alignment and even to 7 UM JUSTO 8 simply discard the worst sections; 9 O abade Myriel 10 • in areas like biomedical text mining, be- 11 Em 1815, era bispo de Digne, o reverendo Carlos ing able to detect the different sections of 12 Francisco Bemvindo Myriel, o qual contava setenta a document allows to search for specific information in specific sections (Agarwal 2.2.3 Paragraphs and Yu, 2009); When reading a book, we often assume that Being an unstructured format, plain text a new line, specially if indented, represents by itself does not have the necessary means to a new paragraph, and that sentences within
  • 5. the same paragraph are separated only by the 2.2.4 Footnotes punctuation mark and a space. Footnotes are formed by a call mark inserted However, when dealing with plain text in the middle of the text (often appended to books, several other representations are pos- the end of a word or phrase), and the footnote sible: some books break sentences with a expansion (i.e. the text of the footnote). The new line and paragraphs with a blank line placement of the footnote expansion shifts (two consecutive new line characters); others from book to book, but generally appears ei- use indentation to represent paragraphs; and ther at the end of the same page where its call there are even books where new line charac- mark can be found, or at a dedicated section ters are used to wrap text. There are also at the end of the document. books where direct speech (from a charac- Depending on the notation used to repre- ter of the story) is represented with a trailing sent footnote call marks, these can introduce dash; others use quotation marks to the same noise in the alignment (for example, if the end. aligner confuses them with sentence breaks). The detection of paragraph and sentence At best, the alignment might work well, but boundaries is of the utmost importance in the the resulting corpora will be polluted with the alignment process, and a wrong delimitation call marks, interfering with its further use. is often responsible for a bad alignment. Footnote expansions are even more likely to In order to be able to detect how para- disturb the process, as it is very unlikely that graphs and sentences are represented, our al- the matching footnotes are found at the same gorithm starts by measuring several parame- place in both books (the page divisions would ters, such as the number of words, the num- have to be exactly the same). ber of lines, the number of empty lines and Below is presented an example of a page the number of indented lines. containing footnote call marks and, at the Then several metrics are calculated based end, footnote expansions, followed by the be- on these parameters, which are used to under- ginning of another page: stand how paragraphs are represented, and the text is fixed accordingly. 1 roi Charles V, fils de Jean II, aupr`s de la rue e 2 Saint-Antoine, ` la porte des Tournelles[1]. a 3 4 [1] La Bastille, qui fut prise par le peuple de Algorithm 1: Paragraph 5 Paris, le 14 juillet 1789, puis d´molie. B. e Input: txt:text of the book 6 Output: txt:text with improved paragraph 7 ^L Quel ´tait en chemin l’´tonnement de l’Ing´nu! e e e 8 je vous le laisse ` penser. Il crut d’abord que a elines ← ...calculate empty lines lines ← ...calculate number of lines words ← ...calculate number of words The removal of footnotes is performed in plines ← ...number of lines with punctuation two steps: first the expansions are removed, indenti ← ...calculate indentation distrib then the call marks. The devised algorithm plr ← plines // punctuated lines ratio lines searches for lines starting with a plausible forall the i ∈ dom(indent) do pattern (e.g. <<1>>, [2] or ^3), and followed if i > 10 ∨ indenti < 12 then by blank lines. These are considered footnote remove(indenti ) // remove false indent expansions, are consequently replaced by a custom footnote mark and moved to a stand- wpl ← words lines // word per line off file. words wpel ← 1+elines // word per empty line Once the footnote expansions have been words wpi ← indent // word per indent removed, the remaining marks (following the if wpel > 150 then same patterns) are likely to be call marks, and if wpi ∈ [10..100] then as such, removed and replaced by a custom Separate parag. by indentation normalized mark1 if wpl ∈ [10..100] ∧ plr > 0.6 then Separate parag. by new lines 1 The ideal would be to be able to establish the correspondence between the footnote call marks and else their respective expansion. However, that feature is Separate parag. by empty lines not specially relevant to the focus of this work, as the effort it would require is not compensated by its practical results.
  • 6. After removing the footnote, the example by a dash, a newline and more word charac- given above would look like the following: ters). When this pattern is found, the new- line is removed and appended to the end of 1 roi Charles V, fils de Jean II, aupr`s de la rue e the word instead. 2 Saint-Antoine, ` la porte des Tournelles_fnr29_. a 3 _fne8_ Dealing with Unicode-characters is per- 4 ^L Quel ´tait en chemin l’´tonnement de l’Ing´nu! e e e formed with a simple search and replace. 5 je vous le laisse ` penser. Il crut d’abord que a Characters which have a corresponding char- acter in ASCII are directly replaced, while 2.2.5 Words and characters stranger characters are replaced with a nor- malized custom mark. At word and character level, there are sev- eral problems that can affect the alignment 2.2.6 Commit of two books: Unicode characters, translin- This last step removes the custom marks in- eations and transpaginations: troduced by the previous functions and out- puts a cleaned document ready to be pro- Unicode characters: often, Unicode-only cessed. The only marks left are section versions of previously existing ASCII marks, which can be helpful to guide the characters are used. For example, Uni- alignment. code has several types of dashes (e.g. ‘-’, There are situations where elements such ‘–’ and ‘ ’), where ASCII only has one as page numbers are important (for example, (i.e. ‘-’). While these do not represent having page numbers in a book is convenient the exact same character, they are close when making citations). However, removing enough, and their normalization allows these marks makes the previous steps irre- to produce more similar results. versible. As such, this step is optional, and it Glyphs: Some documents use glyphs (Uni- is meant to be used only when a clean docu- code characters) to represent combina- ment is required for further processing. tions of some specific characters, like fi or ff. 2.3 Declarative objects The first attempt at writing a program to Translineation: translineation happens clean books was a self-contained Perl script – when a given word is split across two all the patterns and logical steps were hard- lines in order to keep the line length coded into the program, with only a few constant. If the two parts of translin- command-line flags allowing to fine tune the eated words are not rejoined before process. When this solution was discarded aligning, the aligner may see them as for not being powerful enough to allow more two separate words. complex processing, several higher level con- Transpagination: transpagination happens figuration objects were created: an ontol- when a translineation occurs at the last ogy to describe section types and names, and line of a page, resulting in a word split domain-specific languages to describe config- across pages. However, the concept of uration files. page is removed by the first function, These elements open the possibility to dis- pages, which removes headers and foot- cuss the subject with people who have no pro- ers and concatenates the pages. This gramming experience, allowing them to di- means that transpagination occurrences rectly contribute with their knowledge to the get reduced to translineations. development of the software. The algorithm for dealing with translin- 2.3.1 Sections Ontology eations and transpaginations is available as The information about sections relevant to an optional feature. This is mainly be- the sectioning process is being stored in an cause different languages have different stan- ontology file (Uschold and Gruninger, 1996). dard ways of dealing with translineations, Currently it contains several section types and many documents use incorrect forms of and their relation (e.g. a Section is part translineation. As such, the user will have the of a Chapter, an Act contains one or more option to turn this feature on or off. Scenes), names of common sections (e.g. In- The algorithm works by looking for words troduction, Index, Foreword ) and ordinal and split across lines (word characters, followed cardinal numbers.
  • 7. The recognition of these elements is the ontology will allow to establish hierar- language-dependent. As such, mechanisms to chies, and, for example, search for Scenes af- deal with several different languages had to be ter finding an Act, or even to classify a text implemented. Some of these languages, even as play in the presence of either. use a different alphabet (e.g. Russian), which Taking advantage of Perl being a dynamic lead to interesting challenges. programming language (i.e. is able to load Several extracts of the sections ontology new code at runtime), the ontology is being are presented below. The first example shows used to directly create the Perl code contain- several translations and ways of writing chap- ter, and NT sec indicates section as a nar- ing the complex data structures which are rower term of chapter : used in the process of sectioning. The on- tology is in the ISO Thesaurus format, and 1 cap the Perl module Biblio::Thesaurus (Sim˜es o 2 PT cap´tulo, cap, capitulo ı and Almeida, 2002) is being used to manipu- 3 FR chapitre, chap 4 EN chapter, chap late the ontology. 5 RU глава 6 NT sec 2.3.2 Domain Specific Languages The use of domain-specific languages (DSLs) In the next example, chapter is indicated will allow to have configuration files which as a broader term of sections: control the flow of the process. Despite 1 cena not yet implemented, it is planned to have 2 PT cena T::P::BC reading a file describing external fil- 3 FR sc`ne e ters to be used at specific stages of the process 4 EN scene 5 BT act (before or after any step). This will allow to have calls to third-party BT _alone allows to indicate that this tools, such as PDF to text converters or aux- term must appear alone in a line: iliary cleaning scripts specific for a given type of files. 1 end 2 PT fim 3 FR fin Both the sections ontology and DSLs help to 4 EN the_end achieve a better organization, allowing to ab- 5 BT _alone stract the domain layer from the the code and implementation details. The number 3 and several variations. BT _numeral identifies this term as a number, and consequently might be used to numerate 3 Evaluation some of the other terms: The evaluation of a tool such as T::P::BC 1 3 might be performed by comparing the results 2 PT terceiro, terceira, trˆs, tres e 3 EN third, three of alignments of texts before and after being 4 FR troisi`me, troisieme e cleaned with T::P::BC. 5 BT _numeral In order to test T::P::BC, a set of 20 pairs of books from Shakespeare, both in Por- The typification of sections allows to de- tuguese and English, was selected. From the fine different rules and patterns to recognize 40 books, half (the Portuguese ones) con- each type. For example, a line beginning tained both headers and footers. with Chapter One may be almost undoubt- The books were aligned using cwb-align, edly considered a section title, even if more an aligner tool bundled with IMS Open Cor- text follows in the same line (which in many pus Workbench (IMS Corpus Workbench, cases is the title of the chapter). On the other 1994-2002). hand, a line beginning with a Roman numeral may also represent a chapter beginning, but Pages, headers and footers only if nothing follows in the same line (or From a total of 1093 footers present in the else century references, for example, would be documents, 1077 were found and removed frequently mistaken for chapter divisions). (98.5%), and 1183 headers were detected and Despite not yet being fully implemented, removed in a similar amount of total occur- the relations between sections described in rences.
  • 8. Translation units demonstrated an increase in the number of The alignment of texts resulted in the 1:1 translation units. creation of translation memories, files in The use of ontologies for describing and the TMX format (Translation Memory eX- storing the notation of text units ( chapter, change (Oscar, )). Another way to evaluate section, etc,) in a declarative human-readable T::P::BC is by comparing the number of each notation was very important for maintenance type of translation units in each file – either and for asking collaboration from translator 1:1, 1:0 or 0:1, or 2:2 (the numbers represent experts. how many segments were included in each While processing the books from Shake- variant of the translation unit). Table 1 sum- speare, we felt the need to have a theater- marizes the results obtained in the alignment specific filter which would normalize, by ex- of the 20 pairs of Shakespeare books. ample, the stage directions and the notation used in the characters name before each line. Alignm. Type Original Cleaned ∆% This is a practical example of the need to 0:1 or 1:0 8333 6570 -21.2 allow external filters to be applied at specific 1:1 18802 23433 +24.6 2:1 or 1:2 15673 11365 -27.5 stages of the cleaning process. This feature 2:2 5413 4297 -20.6 will be implemented soon in T::P::BC. Total Seg. PT 54864 53204 -3.0 A special effort will be devoted to the Total Seg. EN 59744 51515 -13.8 recognition of tables of contents, which is still one of the main T::P::BC error causes. Table 1: Number of translation units ob- Additionally, it is under consideration the tained for each type of alignment, with and possibility of using stand off annotation in without using T::P::BC2 . the intermediary steps. That would mean to index the document contents before pro- The total number of 1:1 alignments has cessing, and have all the information about increased; in fact, in the original docu- the changes to be made to the document in ments, 39.0% of the translation units were of a separate file. This would allow to have the 1:1 type. On the other hand, the trans- the original file untouched during the whole lation memory created after cleaning the files process, and still be able to commit the contained 51.3% of 1:1 translation units. changes and produce a cleaned version. The total number of segments decreased in the cleaned version; some of the books used Text::Perfide::BookCleaner will be made a specific theatrical notation which artificially available at CPAN3 in a very near future. increased the number of segments. This nota- tion was normalized during the cleaning pro- Acknowledgments cess with theatre_norm, a small Perl script Andr´ Santos has a scholarship from e created to normalize these undesired specific Funda¸˜o para a Computa¸˜o Cient´ ca ca ıfica Na- features, which lead to the smaller amount of cional and the work reported here has segments. been partially funded by Funda¸˜o para a ca The number of alignments considered bad Ciˆncia e Tecnologia through project Per- e by the aligner also decreased, from a total of Fide PTDC/CLE-LLI/108948/2008. 10 “bad alignment” cases in the original doc- uments to 4 in the cleaned ones. References Agarwal, S. and H. Yu. 2009. Automatically classifying sentences in full-text biomed- ical articles into Introduction, Methods, 4 Conclusions and Future Work Results and Discussion. Bioinformatics, Working with real situations, 25(23):3174. Text::Perfide::BookCleaner proved Ara´jo, S., J.J. Almeida, I. Dias, and u to be a useful tool to reduce text noise. A. Sim˜es. 2010. Apresenta¸˜o do pro- o ca The possibility of using a set of configu- jecto Per-Fide: Paralelizando o Portuguˆs e ration options helped to obtain the necessary com seis outras l´ınguas. Linguam´tica, a adaptations to different uses. page 71. Using T::P::BC, the translation memo- 3 ries produced in the alignment process have http://www.cpan.org
  • 9. Evert, S. 2001. The CQP query language tutorial. IMS Stuttgart, 13. Gupta, R. and S. Ahmed. 2007. Project Pro- posal Apache Tika. IMS Corpus Workbench. 1994- 2002. http://www.ims.uni- stuttgart.de/projekte/CorpusWorkbench/. Lee, K.H., N. Guttenberg, and V. McCrary. 2002. Standardization aspects of eBook content formats. Computer Standards & Interfaces, 24(3):227–239. Mattmann, Chris A. and Jukka L. Zitting. 2011. Tika in Action. Manning Publica- tions Co., 1st edition. Noonburg, D. 2001. xpdf: A C++ library for accessing PDF. Okita, T. 2009. Data cleaning for word align- ment. In Proceedings of the ACL-IJCNLP 2009 Student Research Workshop, pages 72–80. Association for Computational Lin- guistics. Oscar, TMX. Lisa (2000) translation memory exchange. Robinson, N. 2001. A Comparison of Utilities for converting from Postscript or Portable Document Format to Text. Geneva: CERN, 31. Sim˜es, A. and J.J. Almeida. 2002. Li- o brary::*: a toolkit for digital libraries. UnmarkedSoftware. 2011. TextSoap: For people working with other people’s text. Uschold, M. and M. Gruninger. 1996. On- tologies: Principles, methods and appli- cations. The Knowledge Engineering Re- view, 11(02):93–136. Varga, D., P. Hal´csy, A. Kornai, V. Nagy, a L. N´meth, and V. Tr´n. 2005. Paral- e o lel corpora for medium density languages. Recent Advances in Natural Language Pro- cessing IV: Selected Papers from RANLP 2005. V´ronis, J. 2000. From the Rosetta stone to e the information society. pages 1–24.