Gen AI in Business - Global Trends Report 2024.pdf
Text::Perfide::BookCleaner, a Perl module to clean and normalize plain text books
1. Text::Perfide::BookCleaner, a Perl module to clean and
normalize plain text books
Text::Perfide::BookCleaner, un m´dulo Perl para limpieza de libros em texto
o
plano
Andr´ Santos
e Jos´ Jo˜o Almeida
e a
Departamento de Inform´tica
a Departamento de Inform´tica
a
Universidade do Minho Universidade do Minho
Braga, Portugal Braga, Portugal
pg15973@alunos.uminho.pt jj@di.uminho.pt
Resumen: En este trabajo se presenta Text::Perfide::BookCleaner, un m´dulo Perl
o
para pre-procesamiento de libros en formato de texto plano para posterior alineamiento
u otros usos. Tareas de limpieza incluyen: eliminaci´n de saltos de p´gina, n´meros de
o a u
p´gina, encabezados y pies de p´gina; encontrar t´
a a ıtulos y fronteras de las secciones; elimi-
naci´n de las notas de pie de p´gina y normalizaci´n de la notaci´n de p´rrafo y caracteres
o a o o a
Unicode. El proceso es guiado con la ayuda de objetos declarativos como ontolog´ y ıas
ficheros de configuraci´n. Se realiz´ una evaluaci´n comparativa de las alineaciones con y
o o o
sin Text::Perfide::BookCleaner sobre la cual se presentan los resultados y conclusiones.
Palabras clave: corpora paralelos, alineamiento de documentos, herramientas de PLN
Abstract: This paper presents Text::Perfide::BookCleaner, an application to pre-
process plain text books and clean them for any further arbitrary use, e.g., text alignment,
format conversion or information retrieval. Cleaning tasks include removing page breaks,
page numbers, headers and footers; finding section titles and boundaries; removing footnotes
and normalizing paragraph notation and Unicode characters. The process is guided with
the help of declarative objects such as ontologies and configuration files. A comparative
evaluation of alignments with and without Text::Perfide::BookCleaner was performed,
and the results and conclusions are presented.
Keywords: parallel corpora, document alignment, NLP tools
1 Introduction 1.1 Context
In many tasks related to natural language The work presented in this paper has been
processing, text is the starting point, and the developed within Project Per-Fide, a project
quality of results that can be obtained de- which aims to build a large parallel cor-
pends strongly on the text quality itself. pora (Ara´jo et al., 2010). This process
u
Many kinds of noise may be present in involves gathering, preparing, aligning and
texts, resulting in a wrong interpretation of: making available for query thousands of doc-
uments in several languages.
• individual characters, The documents gathered come from a wide
range of sources, and are obtained in several
• words, phrases, sentences, different formats and conditions. Some of
these documents present specific issues which
• paragraph frontiers, prevent them to be successfully aligned.
• pages, headers, footers and footnotes, 1.2 Motivation
• titles and section frontiers. The alignment of text documents is a pro-
cess which depends heavily on the format
The variety of existing texts is huge, and and conditions of the documents to be
many of the problems are related to specific aligned (V´ronis, 2000). When processing
e
types of text. documents of literary type (i.e. books), there
In this document, the central concern is are specific issues which give origin to addi-
the task of cleaning text books, with a special tional difficulties.
focus on making them ready for alignment. The main characteristics of the documents
2. that may negatively affect book alignment guide the alignment process by pairing
are: the sections first (structural alignment).
File format: Documents available in PDF These problems are often enough to derail
or Microsoft Word formats (or, gen- the alignment process, producing unaccept-
erally, any structured or unstructured able results, and are found frequently enough
format other than plain text) need to justify the development of a tool to pre-
to be converted to plain text be- process the books, cleaning and preparing
fore being processed, using tools such them for further processing.
as pdftotext (Noonburg, 2001) or There are other tools available which
Apache Tika (Gupta and Ahmed, 2007; also address the issue of preparing corpora
Mattmann and Zitting, 2011). This con- for alignment, such as the one described
version often leads to loss of informa- by (Okita, 2009); or text cleaning, such as
tion (text structure, text formatting, im- TextSoap, a program for cleaning text in sev-
ages, etc) and introduction of noise (bad eral formats (UnmarkedSoftware, 2011).
conversion of mathematical expressions,
diacritics, tables and other) (Robinson, 2 Book Cleaner
2001); In order to solve the problems described in
the previous section, the first approach was
Text encoding format: There are many
to implement a Perl script which, with the
different available text encoding for-
help of UNIX utilities like grep and some reg-
mats. For example, Western-European
ular expressions, attempted to find the un-
languages can be encoded as ISO-8859-
desired elements and normalize, replace or
1 (also known as Latin1), Windows CP
delete them. As we started to have a bet-
1252, UTF-8 or others. Discovering the
ter grasp on the task, it became clear that
encoding format used in a given docu-
such a naive approach was not enough.
ment is frequently not a straightforward
It was then decided to develop a
task, and Dealing with a text while as-
Perl module, Text::Perfide::BookCleaner
suming the wrong encoding may have a
(T::P::BC), divided in separate functions,
negative impact on the results;
each one dealing with a specific problem.
Structural residues: Some texts, despite
being in plain text format, still contain 2.1 Design Goals
structural elements from their previous T::P::BC was developed with several design
format (for example, it is common to find goals in mind. Each component of the module
page numbers, headers and footers in the meets the following three requirements:
middle of the text of documents which
Optional: depending on the conditions of
have been converted from PDF to plain
the books to be cleaned, some steps of
text);
this process may not be desired or neces-
Unpaired sections: Sometimes one of the sary. Thus, each step may be performed
documents to be aligned contains one independently from the others, or not be
or more sections which are not present performed at all;
in the other document. It is a quite Chainable: the functions can be used in se-
common occurrence with forewords to a quence, with the output of one function
given edition, preambles, etc. passing as input to the next, with no in-
Sectioning notation: There are endless termediary steps required;
ways to represent and numerate the Reversible: as much as possible, no infor-
several divisions and subdivisions of mation is lost (everything removed from
a document (parts, chapters, sections, the original document is kept in separate
etc). Several of these notations are files). This means that, at any time, it is
language-dependent (e.g. Part One or possible to revert the book to a previous
Premi`re Part). As such, aligning sec-
e (or even the original) state.
tion titles is not a trivial task. Identi-
fying these marks can not only help to Additionally, after the cleaning process, a
prevent this, but it can even be used to diagnostic report is generated, aimed at help-
3. ing the user to understand the problems de-
tected in the book, and what was performed
in order to solve them.
Below is presented an example of the di-
agnostic produced after cleaning a version of
La maison ` vapeur from Jules Verne:
a
1 footers = [’( Page _NUM_ ) = 240’];
2 headers=$VAR1 =
3 ["(La maison x{e0} vapeur Jules Verne) = 241"];
4 ctrL=1;
5 pagnum_ctrL=241;
6 sectionsO=2;
7 sectionsN=30;
8 word_per_emptyline=3107.86842105263;
9 emptylines=37;
10 word_per_line=11.5658603466849;
11 lines=10211; Figure 1: Pipeline of T::P::BC.
12 To_be_indented=1;
13 words=118099;
14 word_per_indent=52.2793271359008; 2.2.1 Pages
15 lines_w_pont=3263;
Page breaks, headers and footers are struc-
tural elements of pages that must be dealt
In this diagnostic, we can see, among other with care when automatically processing
things, that this document had headers com- texts. In the particular case of bitext align-
posed by the book title and author name, ment, the presence of these elements is often
footers contained the word Page followed by enough to confuse the aligner and decrease its
a number (presumably the page number) and performance. The headers and footers usu-
that the 241 pages were separated with ˆL . ally contain information about the book au-
thor, book name and section title. Despite
The process is also configurable with the
the fact that the information provided by this
help of an ontology, and the use of configura-
elements might be relevant for some tasks,
tion files written in domain-specific languages
they should be identified, marked and possi-
is also planned. Both are described in sec-
bly removed before further processing.
tion 2.3.
Below is presented an extract of a plain
text version of La Maison ` Vapeur, from
a
2.2 Pipeline Jules Verne, containing, from line 3 to 6, a
We decided to tackle the problem in five dif- page break (a footer, a page break character
ferent steps: pages, sections, paragraphs, foot- and a header, surrounded by blank lines).
notes, and words and characters. Each of
1 rencontrer le nabab, et assez audacieux pour
these stages deals with a specific type of prob- 2 s’emparer de sa personne.
lems commonly found in plain text books. A 3
commit option is also available, a final step 4 Page 3
which irreversibly removes all the markup 5 ^L La maison ` vapeur
a Jules Verne
6
added along the cleaning process. The dif- 7 Le faquir, - ´videmment le seul entre tous
e
ferent steps of this process are implemented 8 que ne surexcitˆt pas l’espoir de gagner la
a
as functions in T::P::BC.
Figure 1 presents the pipeline of the The 0xC character (also known as
T::P::BC module. Control+L, ^L or f), indicates, in plain text,
T::P::BC accepts as input documents in a page break. When present, these provide
plain text format, with UTF-8 text encoding a reliable way to delimit pages. Page num-
(other encoding formats are also supported). bers, headers and footers should also be found
The final output is a summary report, and near these breaks. There are books, however,
the cleaned version of the original document, which do not have their pages separated with
which may then be used in several processes, this character.
such as alignment (Varga et al., 2005; Ev- Our algorithm starts by looking for page
ert, 2001), ebook generation (Lee, Gutten- break characters. If it finds them, it immedi-
berg, and McCrary, 2002), or others. ately replaces them with a custom page break
4. mark. If not, it tries to find occurrences of formally represent the hierarchy of divisions
page numbers. These are typically lines with of a book (chapters, sections, etc). As such,
only three or less digits (four digits would oc- the notation used to write the titles of the
casionally get year references confused with several divisions is dictated by the personal
page numbers), and a preceding and a fol- choices of whoever transcribed the book to
lowing blank lines. electronic format, or by their previous format
After finding the page breaks, the algo- and the tool used to perform the conversion
rithm attempts to identify headers and foot- to plain text. Additionally, the nomenclature
ers: it considers the lines which are near each used is often language-dependent (e.g. Part
page break as candidates to headers or footers One and Premi`re Part or Chapter II and
e
(depending on whether they are after or be- Cap´ıtulo 2 ).
fore the break, respectively). Then the num- Below is presented the beginning of the
ber of occurrences of each candidate (normal- first section of a Portuguese version of Les
ized so that changes in numbers are ignored) Miserables, from Vitor Hugo:
is calculated, and those which occur more 1 PRIMEIRA PARTE
than a given threshold are considered headers 2
or footers. 3 FANTINE
4
At the end, the headers and footers are 5
moved to a standoff file, and only the page 6 ^L LIVRO PRIMEIRO
break mark is left in the original text. 7
A cleaned up version of the previous ex- 8 UM JUSTO
ample is presented below: 9
10 O abade Myriel
11
1 est vrai qu’il fallait ˆtre assez chanceux pour
e
12 Em 1815, era bispo de Digne, o reverendo Carlos
2 rencontrer le nabab, et assez audacieux pour
13 Francisco Bemvindo Myriel, o qual contava setenta
3 s’emparer de sa personne. _pb2_
4 Le faquir, - ´videmment le seul entre tous
e
5 que ne surexcitˆt pas l’espoir de gagner la
a In order to help in the sectioning pro-
6 prime, - filait au milieu des groupes, s’arrˆtant
e
cess, an ontology was built (see section 2.3.1),
which includes several section types and their
2.2.2 Sections relation, names of common sections and ordi-
There are several reason that justify the im- nal and cardinal numbers.
portance of delimiting sections in a book: The algorithm tries to find lines contain-
ing section names, pages or lines containing
• automatically generate tables of con-
only numbers, or lines with Roman numer-
tents;
als. Then a custom section mark is added
• identify sections from one version of a containing a normalized version of the origi-
book that are missing on another ver- nal title (e.g. Roman numerals are converted
sion (or example, it is common for some to Arabic numbers).
editions of a book to have an exclusive The processed version of the example
preamble or afterword which cannot be shown above follows:
found in the other editions);
1 _sec+N:part=1_ PRIMEIRA PARTE
• matching and comparing the sections of 2
two books before aligning them. This 3 FANTINE
4
allows to assess the books alignability 5 _sec+N:book=1_ LIVRO PRIMEIRO
(compatibility for alignment), predict 6
the results of the alignment and even to 7 UM JUSTO
8
simply discard the worst sections; 9 O abade Myriel
10
• in areas like biomedical text mining, be- 11 Em 1815, era bispo de Digne, o reverendo Carlos
ing able to detect the different sections of 12 Francisco Bemvindo Myriel, o qual contava setenta
a document allows to search for specific
information in specific sections (Agarwal
2.2.3 Paragraphs
and Yu, 2009);
When reading a book, we often assume that
Being an unstructured format, plain text a new line, specially if indented, represents
by itself does not have the necessary means to a new paragraph, and that sentences within
5. the same paragraph are separated only by the 2.2.4 Footnotes
punctuation mark and a space. Footnotes are formed by a call mark inserted
However, when dealing with plain text in the middle of the text (often appended to
books, several other representations are pos- the end of a word or phrase), and the footnote
sible: some books break sentences with a expansion (i.e. the text of the footnote). The
new line and paragraphs with a blank line placement of the footnote expansion shifts
(two consecutive new line characters); others from book to book, but generally appears ei-
use indentation to represent paragraphs; and ther at the end of the same page where its call
there are even books where new line charac- mark can be found, or at a dedicated section
ters are used to wrap text. There are also at the end of the document.
books where direct speech (from a charac- Depending on the notation used to repre-
ter of the story) is represented with a trailing sent footnote call marks, these can introduce
dash; others use quotation marks to the same noise in the alignment (for example, if the
end. aligner confuses them with sentence breaks).
The detection of paragraph and sentence At best, the alignment might work well, but
boundaries is of the utmost importance in the the resulting corpora will be polluted with the
alignment process, and a wrong delimitation call marks, interfering with its further use.
is often responsible for a bad alignment. Footnote expansions are even more likely to
In order to be able to detect how para- disturb the process, as it is very unlikely that
graphs and sentences are represented, our al- the matching footnotes are found at the same
gorithm starts by measuring several parame- place in both books (the page divisions would
ters, such as the number of words, the num- have to be exactly the same).
ber of lines, the number of empty lines and Below is presented an example of a page
the number of indented lines. containing footnote call marks and, at the
Then several metrics are calculated based end, footnote expansions, followed by the be-
on these parameters, which are used to under- ginning of another page:
stand how paragraphs are represented, and
the text is fixed accordingly. 1 roi Charles V, fils de Jean II, aupr`s de la rue
e
2 Saint-Antoine, ` la porte des Tournelles[1].
a
3
4 [1] La Bastille, qui fut prise par le peuple de
Algorithm 1: Paragraph 5 Paris, le 14 juillet 1789, puis d´molie. B.
e
Input: txt:text of the book 6
Output: txt:text with improved paragraph 7 ^L Quel ´tait en chemin l’´tonnement de l’Ing´nu!
e e e
8 je vous le laisse ` penser. Il crut d’abord que
a
elines ← ...calculate empty lines
lines ← ...calculate number of lines
words ← ...calculate number of words The removal of footnotes is performed in
plines ← ...number of lines with punctuation
two steps: first the expansions are removed,
indenti ← ...calculate indentation distrib
then the call marks. The devised algorithm
plr ← plines // punctuated lines ratio
lines searches for lines starting with a plausible
forall the i ∈ dom(indent) do
pattern (e.g. <<1>>, [2] or ^3), and followed
if i > 10 ∨ indenti < 12 then by blank lines. These are considered footnote
remove(indenti ) // remove false indent expansions, are consequently replaced by a
custom footnote mark and moved to a stand-
wpl ← words
lines // word per line off file.
words
wpel ← 1+elines // word per empty line Once the footnote expansions have been
words
wpi ← indent // word per indent removed, the remaining marks (following the
if wpel > 150 then same patterns) are likely to be call marks, and
if wpi ∈ [10..100] then as such, removed and replaced by a custom
Separate parag. by indentation normalized mark1
if wpl ∈ [10..100] ∧ plr > 0.6 then
Separate parag. by new lines 1
The ideal would be to be able to establish the
correspondence between the footnote call marks and
else their respective expansion. However, that feature is
Separate parag. by empty lines not specially relevant to the focus of this work, as
the effort it would require is not compensated by its
practical results.
6. After removing the footnote, the example by a dash, a newline and more word charac-
given above would look like the following: ters). When this pattern is found, the new-
line is removed and appended to the end of
1 roi Charles V, fils de Jean II, aupr`s de la rue
e the word instead.
2 Saint-Antoine, ` la porte des Tournelles_fnr29_.
a
3 _fne8_
Dealing with Unicode-characters is per-
4 ^L Quel ´tait en chemin l’´tonnement de l’Ing´nu!
e e e formed with a simple search and replace.
5 je vous le laisse ` penser. Il crut d’abord que
a Characters which have a corresponding char-
acter in ASCII are directly replaced, while
2.2.5 Words and characters stranger characters are replaced with a nor-
malized custom mark.
At word and character level, there are sev-
eral problems that can affect the alignment 2.2.6 Commit
of two books: Unicode characters, translin- This last step removes the custom marks in-
eations and transpaginations: troduced by the previous functions and out-
puts a cleaned document ready to be pro-
Unicode characters: often, Unicode-only cessed. The only marks left are section
versions of previously existing ASCII marks, which can be helpful to guide the
characters are used. For example, Uni- alignment.
code has several types of dashes (e.g. ‘-’, There are situations where elements such
‘–’ and ‘ ’), where ASCII only has one as page numbers are important (for example,
(i.e. ‘-’). While these do not represent having page numbers in a book is convenient
the exact same character, they are close when making citations). However, removing
enough, and their normalization allows these marks makes the previous steps irre-
to produce more similar results. versible. As such, this step is optional, and it
Glyphs: Some documents use glyphs (Uni- is meant to be used only when a clean docu-
code characters) to represent combina- ment is required for further processing.
tions of some specific characters, like fi
or ff. 2.3 Declarative objects
The first attempt at writing a program to
Translineation: translineation happens
clean books was a self-contained Perl script –
when a given word is split across two
all the patterns and logical steps were hard-
lines in order to keep the line length
coded into the program, with only a few
constant. If the two parts of translin-
command-line flags allowing to fine tune the
eated words are not rejoined before
process. When this solution was discarded
aligning, the aligner may see them as
for not being powerful enough to allow more
two separate words.
complex processing, several higher level con-
Transpagination: transpagination happens figuration objects were created: an ontol-
when a translineation occurs at the last ogy to describe section types and names, and
line of a page, resulting in a word split domain-specific languages to describe config-
across pages. However, the concept of uration files.
page is removed by the first function, These elements open the possibility to dis-
pages, which removes headers and foot- cuss the subject with people who have no pro-
ers and concatenates the pages. This gramming experience, allowing them to di-
means that transpagination occurrences rectly contribute with their knowledge to the
get reduced to translineations. development of the software.
The algorithm for dealing with translin- 2.3.1 Sections Ontology
eations and transpaginations is available as The information about sections relevant to
an optional feature. This is mainly be- the sectioning process is being stored in an
cause different languages have different stan- ontology file (Uschold and Gruninger, 1996).
dard ways of dealing with translineations, Currently it contains several section types
and many documents use incorrect forms of and their relation (e.g. a Section is part
translineation. As such, the user will have the of a Chapter, an Act contains one or more
option to turn this feature on or off. Scenes), names of common sections (e.g. In-
The algorithm works by looking for words troduction, Index, Foreword ) and ordinal and
split across lines (word characters, followed cardinal numbers.
7. The recognition of these elements is the ontology will allow to establish hierar-
language-dependent. As such, mechanisms to chies, and, for example, search for Scenes af-
deal with several different languages had to be ter finding an Act, or even to classify a text
implemented. Some of these languages, even as play in the presence of either.
use a different alphabet (e.g. Russian), which Taking advantage of Perl being a dynamic
lead to interesting challenges. programming language (i.e. is able to load
Several extracts of the sections ontology new code at runtime), the ontology is being
are presented below. The first example shows used to directly create the Perl code contain-
several translations and ways of writing chap-
ter, and NT sec indicates section as a nar- ing the complex data structures which are
rower term of chapter : used in the process of sectioning. The on-
tology is in the ISO Thesaurus format, and
1 cap the Perl module Biblio::Thesaurus (Sim˜es o
2 PT cap´tulo, cap, capitulo
ı and Almeida, 2002) is being used to manipu-
3 FR chapitre, chap
4 EN chapter, chap late the ontology.
5 RU глава
6 NT sec 2.3.2 Domain Specific Languages
The use of domain-specific languages (DSLs)
In the next example, chapter is indicated will allow to have configuration files which
as a broader term of sections: control the flow of the process. Despite
1 cena not yet implemented, it is planned to have
2 PT cena T::P::BC reading a file describing external fil-
3 FR sc`ne
e ters to be used at specific stages of the process
4 EN scene
5 BT act (before or after any step).
This will allow to have calls to third-party
BT _alone allows to indicate that this tools, such as PDF to text converters or aux-
term must appear alone in a line: iliary cleaning scripts specific for a given type
of files.
1 end
2 PT fim
3 FR fin Both the sections ontology and DSLs help to
4 EN the_end achieve a better organization, allowing to ab-
5 BT _alone
stract the domain layer from the the code and
implementation details.
The number 3 and several variations. BT
_numeral identifies this term as a number,
and consequently might be used to numerate 3 Evaluation
some of the other terms:
The evaluation of a tool such as T::P::BC
1 3 might be performed by comparing the results
2 PT terceiro, terceira, trˆs, tres
e
3 EN third, three
of alignments of texts before and after being
4 FR troisi`me, troisieme
e cleaned with T::P::BC.
5 BT _numeral In order to test T::P::BC, a set of 20
pairs of books from Shakespeare, both in Por-
The typification of sections allows to de- tuguese and English, was selected. From the
fine different rules and patterns to recognize 40 books, half (the Portuguese ones) con-
each type. For example, a line beginning tained both headers and footers.
with Chapter One may be almost undoubt- The books were aligned using cwb-align,
edly considered a section title, even if more an aligner tool bundled with IMS Open Cor-
text follows in the same line (which in many pus Workbench (IMS Corpus Workbench,
cases is the title of the chapter). On the other 1994-2002).
hand, a line beginning with a Roman numeral
may also represent a chapter beginning, but Pages, headers and footers
only if nothing follows in the same line (or From a total of 1093 footers present in the
else century references, for example, would be documents, 1077 were found and removed
frequently mistaken for chapter divisions). (98.5%), and 1183 headers were detected and
Despite not yet being fully implemented, removed in a similar amount of total occur-
the relations between sections described in rences.
8. Translation units demonstrated an increase in the number of
The alignment of texts resulted in the 1:1 translation units.
creation of translation memories, files in The use of ontologies for describing and
the TMX format (Translation Memory eX- storing the notation of text units ( chapter,
change (Oscar, )). Another way to evaluate section, etc,) in a declarative human-readable
T::P::BC is by comparing the number of each notation was very important for maintenance
type of translation units in each file – either and for asking collaboration from translator
1:1, 1:0 or 0:1, or 2:2 (the numbers represent experts.
how many segments were included in each While processing the books from Shake-
variant of the translation unit). Table 1 sum- speare, we felt the need to have a theater-
marizes the results obtained in the alignment specific filter which would normalize, by ex-
of the 20 pairs of Shakespeare books. ample, the stage directions and the notation
used in the characters name before each line.
Alignm. Type Original Cleaned ∆% This is a practical example of the need to
0:1 or 1:0 8333 6570 -21.2 allow external filters to be applied at specific
1:1 18802 23433 +24.6
2:1 or 1:2 15673 11365 -27.5
stages of the cleaning process. This feature
2:2 5413 4297 -20.6 will be implemented soon in T::P::BC.
Total Seg. PT 54864 53204 -3.0 A special effort will be devoted to the
Total Seg. EN 59744 51515 -13.8 recognition of tables of contents, which is still
one of the main T::P::BC error causes.
Table 1: Number of translation units ob- Additionally, it is under consideration the
tained for each type of alignment, with and possibility of using stand off annotation in
without using T::P::BC2 . the intermediary steps. That would mean
to index the document contents before pro-
The total number of 1:1 alignments has
cessing, and have all the information about
increased; in fact, in the original docu-
the changes to be made to the document in
ments, 39.0% of the translation units were of
a separate file. This would allow to have
the 1:1 type. On the other hand, the trans-
the original file untouched during the whole
lation memory created after cleaning the files
process, and still be able to commit the
contained 51.3% of 1:1 translation units.
changes and produce a cleaned version.
The total number of segments decreased in
the cleaned version; some of the books used Text::Perfide::BookCleaner will be made
a specific theatrical notation which artificially available at CPAN3 in a very near future.
increased the number of segments. This nota-
tion was normalized during the cleaning pro- Acknowledgments
cess with theatre_norm, a small Perl script Andr´ Santos has a scholarship from
e
created to normalize these undesired specific Funda¸˜o para a Computa¸˜o Cient´
ca ca ıfica Na-
features, which lead to the smaller amount of cional and the work reported here has
segments. been partially funded by Funda¸˜o para a
ca
The number of alignments considered bad Ciˆncia e Tecnologia through project Per-
e
by the aligner also decreased, from a total of Fide PTDC/CLE-LLI/108948/2008.
10 “bad alignment” cases in the original doc-
uments to 4 in the cleaned ones. References
Agarwal, S. and H. Yu. 2009. Automatically
classifying sentences in full-text biomed-
ical articles into Introduction, Methods,
4 Conclusions and Future Work Results and Discussion. Bioinformatics,
Working with real situations, 25(23):3174.
Text::Perfide::BookCleaner proved
Ara´jo, S., J.J. Almeida, I. Dias, and
u
to be a useful tool to reduce text noise.
A. Sim˜es. 2010. Apresenta¸˜o do pro-
o ca
The possibility of using a set of configu- jecto Per-Fide: Paralelizando o Portuguˆs
e
ration options helped to obtain the necessary com seis outras l´ınguas. Linguam´tica,
a
adaptations to different uses. page 71.
Using T::P::BC, the translation memo-
3
ries produced in the alignment process have http://www.cpan.org
9. Evert, S. 2001. The CQP query language
tutorial. IMS Stuttgart, 13.
Gupta, R. and S. Ahmed. 2007. Project Pro-
posal Apache Tika.
IMS Corpus Workbench. 1994-
2002. http://www.ims.uni-
stuttgart.de/projekte/CorpusWorkbench/.
Lee, K.H., N. Guttenberg, and V. McCrary.
2002. Standardization aspects of eBook
content formats. Computer Standards &
Interfaces, 24(3):227–239.
Mattmann, Chris A. and Jukka L. Zitting.
2011. Tika in Action. Manning Publica-
tions Co., 1st edition.
Noonburg, D. 2001. xpdf: A C++ library
for accessing PDF.
Okita, T. 2009. Data cleaning for word align-
ment. In Proceedings of the ACL-IJCNLP
2009 Student Research Workshop, pages
72–80. Association for Computational Lin-
guistics.
Oscar, TMX. Lisa (2000) translation memory
exchange.
Robinson, N. 2001. A Comparison of
Utilities for converting from Postscript
or Portable Document Format to Text.
Geneva: CERN, 31.
Sim˜es, A. and J.J. Almeida. 2002. Li-
o
brary::*: a toolkit for digital libraries.
UnmarkedSoftware. 2011. TextSoap: For
people working with other people’s text.
Uschold, M. and M. Gruninger. 1996. On-
tologies: Principles, methods and appli-
cations. The Knowledge Engineering Re-
view, 11(02):93–136.
Varga, D., P. Hal´csy, A. Kornai, V. Nagy,
a
L. N´meth, and V. Tr´n. 2005. Paral-
e o
lel corpora for medium density languages.
Recent Advances in Natural Language Pro-
cessing IV: Selected Papers from RANLP
2005.
V´ronis, J. 2000. From the Rosetta stone to
e
the information society. pages 1–24.