SlideShare ist ein Scribd-Unternehmen logo
1 von 53
Downloaden Sie, um offline zu lesen
A tool for syntax-based
intra-language text alignment
Tariq Yousef, Chiara Palladino
University of Leipzig
Berlin Digital Classicist Seminars,
November 29, 2016
What is text alignment?
● Text alignment is the comparison of two or more parallel texts
● It tries to define correspondences/similarities and divergences/variants
● One of the most important tasks in Natural Language Processing: it can be
performed automatically through algorithmic and dynamic programming
methods
Intra-Language alignment: alignment of
texts in the same language
Cross-language alignment: alignment of
texts in different languages
● Cross-language
alignment is
difficult to perform
automatically
● It still needs
training data from
manual alignment
A Persian poem manually aligned with an English translation, from the project Open
Persian (http://www.dh.uni-leipzig.de/wo/open-philology-project/open-persian/)
...So, there is also manual alignment
● The Perseids Project
and Alpheios Texts
provide tools for
manual alignment of
texts in different
languages
(http://www.perseids.
org/,
http://alpheios.net/)
Homer, Iliad XXI, aligned with the English translation by A.T. Murray
Pairwise alignment: alignment of two texts
We distinguish on the
number of text because
it determines
differences in the use of
the alignment algorithm
Two versions of Emily Dickinson’s Faith is a fine invention, aligned using
the Versioning Machine (http://v-machine.org/samples/faith.html)
Multiple alignment: alignment of multiple
texts (i.e. more than two)
The number of
multiple texts is
virtually unlimited: in
an ideal world, you
can align as many
texts as you want
(but you should be
careful and avoid
“alignment
monsters”)
Six versions of the same poem by Emily Dickinson
Four texts
aligned with
iAligner
Alignment can be
visualized in
different ways
As a table
As a graph
Alignment graph using CollateX
(http://collatex.net/)
AlignmentgraphusingTRAViz
(http://www.traviz.vizcovery.org/)
As matching segments in aligned sentences
Alignment of three
sample texts on
CATView
(http://catview.uzi.uni-h
alle.de/overview.html)
As a dynamic visualization
(http://www.digitalvariants.org/variants/valerio-magrelli)
As overlapping variants
(http://juxtacommons.org/)
As parallel texts with variants
highlighted in the corresponding
sections (http://juxtacommons.org/)
Why do we align
texts?
To highlight
correspondences
in different
versions of a text
(http://v-machine.org/samples/faith.html)
To highlight
divergences
across various
versions of the
same text
(http://juxtacommons.org/)
To establish relations between witnesses of a text and see where they
overlap and diverge
Comparing texts as philological practice
Collatio
- Detection and transcription of
variants in witnesses
- It is made by close reading each
witness and comparing the texts
with each other
- Evaluation of the variants and of
the witnesses bearing them
….and yes, it is usually done manually.
Recensio
- To establish relationships
between witnesses and
which ones bear the “best
text”
- To establish an organic
scheme the transmission of
a text, often represented as
a genealogical tree of
witnesses (stemma)
Example of a stemma. Stemma for De nuptiis Philologiae et Mercurii by Martianus
Capella proposed by Danuta Shanzer (1986, p. 62-81).
Critical editions
- Usually display textual variants
in the form of apparatus
criticus
- The apparatus is a choice in
itself: it does not collect all the
variants found through
collation, but only those that
the editor had judged
significant for the
reconstruction of the text
- The apparatus can be very
complex to understand in large
textual traditions
Sallust’s Catiline in Axel Ahlberg’s 1919 Editio Major.
Critical text
Critical
apparatus
Now we can do some of
these things automatically
iAligner
http://i-alignment.com/
https://github.com/OpenGreekAndLatin/ILA_python
A tool for automatic syntax-based
intra-language alignment
● Automatic: it is performed with algorithmic methods to reduce human
intervention in the mechanical process of comparison.
● Syntax-based: in programming language, defines the order of the
characters and the order of the words in a sentence.
● Intra-language: works with texts in the same language.
● Pairwise or multiple: works with two texts or with an unlimited number of
multiple texts.
Algorithmic methods to produce alignment
The Needleman-Wunsch algorithm
- used in bioinformatics to align protein or nucleotide sequences.
- it uses Dynamic Programming to find the optimal alignment.
- divides a large problem into a series of smaller problems and uses the solutions to
the smaller problems to reconstruct a solution to the larger problem.
- uses a score function and similarity matrix to represent all possible combinations
of tokens and their resulting score.
The Needleman-Wunsch algorithm
- Aligning Bible Text John 1:1
NLT: In the beginning the Word already existed.
KJB: In the beginning was the Word
The used score function ( Matching = 5, Mismatching = -5, In/Del = -2 )
In the beginning the Word already existed .
0 -2→ -4→ -6→ -8→ -10→ -12→ -14→ -16→
In -2↓
the -4↓
beginning -6↓
was -8↓
the -10↓
Word -12↓
, -14↓
The used score function ( Matching = 5, Mismatching = -5, In/Del = -2 )
In the beginning the Word already existed .
0 -2→ -4→ -6→ -8→ -10→ -12→ -14→ -16→
In -2↓ 5 ↘ 3→ 1 → -1 → -3 → -5 → -7 → -9 →
the -4↓ 0↓
beginning -6↓ -2↓
was -8↓ -4↓
the -10↓ -8↓
Word -12↓ -10↓
, -14↓ -12↓
The used score function ( Matching = 5, Mismatching = -5, In/Del = -2 )
In the beginning the Word already existed .
0 -2→ -4→ -6→ -8→ -10→ -12→ -14→ -16→
In -2↓ 5 ↘ 3→ 1 → -1 → -3 → -5 → -7 → -9 →
the -4↓ 0↓ 10 ↘ 8 → 13 → 11 → 9 → 7 → -5 →
beginning -6↓ -2↓ 8 ↓
was -8↓ -4↓ 6 ↓
the -10↓ -8↓ 4 ↓
Word -12↓ -10↓ 2 ↓
, -14↓ -12↓ -0 ↓
The used score function ( Matching = 5, Mismatching = -5, In/Del = -2 )
In the beginning the Word already existed .
0 -2→ -4→ -6→ -8→ -10→ -12→ -14→ -16→
In -2↓ 5 ↘ 3→ 1 → -1 → -3 → -5 → -7 → -9 →
the -4↓ 0↓ 10 ↘ 8 → 13 → 11 → 9 → 7 → -5 →
beginning -6↓ -2↓ 8 ↓ 15 ↘ 13 → 11 → 9 → 7 → -5 →
was -8↓ -4↓ 6 ↓ 8 ↓
the -10↓ -8↓ 4 ↓ 5 ↓
Word -12↓ -10↓ 2 ↓ 0 ↓
, -14↓ -12↓ -0 ↓ -5 ↓
The used score function ( Matching = 5, Mismatching = -5, In/Del = -2 )
In the beginning the Word already existed .
0 -2→ -4→ -6→ -8→ -10→ -12→ -14→ -16→
In -2↓ 5 ↘ 3→ 1 → -1 → -3 → -5 → -7 → -9 →
the -4↓ 0↓ 10 ↘ 8 → 13 → 11 → 9 → 7 → -5 →
beginning -6↓ -2↓ 8 ↓ 15 ↘ 13 → 11 → 9 → 7 → -5 →
was -8↓ -4↓ 6 ↓ 8 ↓ 11↓ 9→ 7→ 5→ 3→
the -10↓ -8↓ 4 ↓ 5 ↓ 13↘ 11→ 9→ 7→ 5→
Word -12↓ -10↓ 2 ↓ 0 ↓ 11↓ 18↘ 16→ 14→ 12→
, -14↓ -12↓ -0 ↓ -5 ↓ 9↓ 16↓ 14→ 12→ 10→
The used score function ( Matching = 5, Mismatching = -5, In/Del = -2 )
In the beginning the Word already existed .
0 -2→ -4→ -6→ -8→ -10→ -12→ -14→ -16→
In -2↓ 5 ↘ 3→ 1 → -1 → -3 → -5 → -7 → -9 →
the -4↓ 0↓ 10 ↘ 8 → 13 → 11 → 9 → 7 → -5 →
beginning -6↓ -2↓ 8 ↓ 15 ↘ 13 → 11 → 9 → 7 → -5 →
was -8↓ -4↓ 6 ↓ 8 ↓ 11↓ 9→ 7→ 5→ 3→
the -10↓ -8↓ 4 ↓ 5 ↓ 13↘ 11→ 9→ 7→ 5→
Word -12↓ -10↓ 2 ↓ 0 ↓ 11↓ 18↘ 16→ 14→ 12→
, -14↓ -12↓ -0 ↓ -5 ↓ 9↓ 16↓ 14→ 12→ 10→
The used score function ( Matching = 5, Mismatching = -5, In/Del = -2 )
The Needleman-Wunsch algorithm
John 1:1
New Living
Translation
In the beginning the Word already existed .
King James
Bible
In the beginning was the Word ,
The modification to the algorithm
The goal is to optimize the algorithm
by reducing the search space
compares a token W at the position i
in S1 with a range of tokens [i-k, i+k]
in S2 with length of 2k+1.
The resulting search space is
reduced from (n * m) to ([2k +1]* m) ,
where k < n/2
The modification to the algorithm
k = 14, n = 157, m = 134
Search space = m*n = 21038
after modification
(2k+1)*m = 3886
Multiple Sequence Alignment ( In progress)
● Progressive alignment
builds up a final MSA by combining pairwise alignments beginning with the
most similar pair and progressing to the most distantly related, it requires two
stages:
- creating the guide tree (clustering)
- adding the sequences sequentially to the
growing MSA according to the guide tree
Multiple Sequence Alignment ( In progress)
● Iterative alignment
The aim is to reduce the problem of a multiple alignment to an iteration of
pairwise alignments.
How to align your texts with iAligner: copy
your text on the editor
The text has to be parsed in sentences first
...Or upload it
Currently supports .txt and .csv files
Refinement criteria
● Ignore non-alphabetical: ignores symbols, such as
punctuation and numbers, anything that is not an alphabetical
character
● Case sensitive: if activated, detects variation across words
according to the case
● Ignore diacritics: ignores any type of diacritical character
(including punctuation marks)
● Levenshtein distance: applies a revised version of the
Levenshtein algorithm and increases the tolerance threshold
on the alignment of similar words.
The Levenshtein distance
The Levenshtein distance between two words is the minimum number of
single-character edits (i.e. insertions, deletions or substitutions) required to
change one word into the other. e.g
lev(Hellanikos, Hellanicus) = 2
Mathematically, the Levenshtein distance between two strings { a,b} (of length
|a| and |b| respectively) is given by leva,b( |a| , |b| )
Modified Levenshtein Distance
Levenshtein distance is not very helpful in our case, because it is binary and
there is no tolerance with errors produced by OCR or Transcription.
the distance between letters is not binary, but it is on scale. The cost of insertion
or deletion depends on:
- Letter position
- Letter type (vowel or consonant)
lev(Hellanikos, Hellanicus) = 0.3
A Greek text with no refinement criteria
The same text with additional refinement criteria applied
Alignment output: a table-graph
iAligner displays all the nuances of variants according to a color-key:
- Completely aligned tokens (deep green)
- Tokens aligned by excluding case sensitivity or punctuation detection (light green)
- Gaps (yellow)
- Divergences (red)
- Tokens aligned by applying Levenshtein distance (blue-green)
What can you do
with iAligner?
Some case studies
Manuscript
alignment
Three manuscripts of Plato’s Crito
aligned (http://i-alignment.com/crito/)
OCR
output
alignment
Alignment of two OCR outputs from the Patrologia Graeca. The third column shows the overlapping
sections and offers the user the choice between two variants where the two texts diverge.
OCR
output
alignment Patrologia Latina: OCR output vs. correct version: www.i-alignment.com/pl/
Alignment of
editions
Three excerpted editions of Aeschylus’ Supplices aligned.
www.i-alignment.com/Aeschylus
Future work
Import and export options
Language dependent options for Latin, Greek, Arabic
Handling crossings and transpositions
Thanks for the attention!

Weitere ähnliche Inhalte

Andere mochten auch

Andere mochten auch (7)

[DCSB] Silvia Polla (Topoi) Between Demography and Consumption: Digital and Q...
[DCSB] Silvia Polla (Topoi) Between Demography and Consumption: Digital and Q...[DCSB] Silvia Polla (Topoi) Between Demography and Consumption: Digital and Q...
[DCSB] Silvia Polla (Topoi) Between Demography and Consumption: Digital and Q...
 
[DCSB] Charlotte Roueché (King's College London), "Digital Classics: Back to ...
[DCSB] Charlotte Roueché (King's College London), "Digital Classics: Back to ...[DCSB] Charlotte Roueché (King's College London), "Digital Classics: Back to ...
[DCSB] Charlotte Roueché (King's College London), "Digital Classics: Back to ...
 
[DCSB] Undine Lieberwirth & Axel Gering (TOPOI) 3D GIS in archaeology – a mic...
[DCSB] Undine Lieberwirth & Axel Gering (TOPOI) 3D GIS in archaeology – a mic...[DCSB] Undine Lieberwirth & Axel Gering (TOPOI) 3D GIS in archaeology – a mic...
[DCSB] Undine Lieberwirth & Axel Gering (TOPOI) 3D GIS in archaeology – a mic...
 
2015 Upload Campaigns Calendar - SlideShare
2015 Upload Campaigns Calendar - SlideShare2015 Upload Campaigns Calendar - SlideShare
2015 Upload Campaigns Calendar - SlideShare
 
What to Upload to SlideShare
What to Upload to SlideShareWhat to Upload to SlideShare
What to Upload to SlideShare
 
How to Make Awesome SlideShares: Tips & Tricks
How to Make Awesome SlideShares: Tips & TricksHow to Make Awesome SlideShares: Tips & Tricks
How to Make Awesome SlideShares: Tips & Tricks
 
Getting Started With SlideShare
Getting Started With SlideShareGetting Started With SlideShare
Getting Started With SlideShare
 

Ähnlich wie [DCSB] Chiara Palladino & Tariq Youssef (Leipzig) iAligner: a tool for syntax-based intra-language text alignment

Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
Abdullah al Mamun
 
Moore_slides.ppt
Moore_slides.pptMoore_slides.ppt
Moore_slides.ppt
butest
 
Align, Disambiguate and Walk : A Unified Approach forMeasuring Semantic Simil...
Align, Disambiguate and Walk  : A Unified Approach forMeasuring Semantic Simil...Align, Disambiguate and Walk  : A Unified Approach forMeasuring Semantic Simil...
Align, Disambiguate and Walk : A Unified Approach forMeasuring Semantic Simil...
Koji Matsuda
 

Ähnlich wie [DCSB] Chiara Palladino & Tariq Youssef (Leipzig) iAligner: a tool for syntax-based intra-language text alignment (20)

Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
 
Moore_slides.ppt
Moore_slides.pptMoore_slides.ppt
Moore_slides.ppt
 
Yelp challenge reviews_sentiment_classification
Yelp challenge reviews_sentiment_classificationYelp challenge reviews_sentiment_classification
Yelp challenge reviews_sentiment_classification
 
Constraint Grammar and Apertium
Constraint Grammar and ApertiumConstraint Grammar and Apertium
Constraint Grammar and Apertium
 
Word embeddings
Word embeddingsWord embeddings
Word embeddings
 
NLP
NLPNLP
NLP
 
NLP
NLPNLP
NLP
 
Pycon ke word vectors
Pycon ke   word vectorsPycon ke   word vectors
Pycon ke word vectors
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Deep network notes.pdf
Deep network notes.pdfDeep network notes.pdf
Deep network notes.pdf
 
Computational model language and grammar bnf
Computational model language and grammar bnfComputational model language and grammar bnf
Computational model language and grammar bnf
 
Align, Disambiguate and Walk : A Unified Approach forMeasuring Semantic Simil...
Align, Disambiguate and Walk  : A Unified Approach forMeasuring Semantic Simil...Align, Disambiguate and Walk  : A Unified Approach forMeasuring Semantic Simil...
Align, Disambiguate and Walk : A Unified Approach forMeasuring Semantic Simil...
 
Simple perl scripts
Simple perl scriptsSimple perl scripts
Simple perl scripts
 
vlang
vlangvlang
vlang
 
MACHINE-DRIVEN TEXT ANALYSIS
MACHINE-DRIVEN TEXT ANALYSISMACHINE-DRIVEN TEXT ANALYSIS
MACHINE-DRIVEN TEXT ANALYSIS
 
Natural Language Processing Course in AI
Natural Language Processing Course in AINatural Language Processing Course in AI
Natural Language Processing Course in AI
 
Intern presentation
Intern presentationIntern presentation
Intern presentation
 
Recipe2Vec: Or how does my robot know what’s tasty
Recipe2Vec: Or how does my robot know what’s tastyRecipe2Vec: Or how does my robot know what’s tasty
Recipe2Vec: Or how does my robot know what’s tasty
 
Function Applicative for Great Good of Palindrome Checker Function - Polyglot...
Function Applicative for Great Good of Palindrome Checker Function - Polyglot...Function Applicative for Great Good of Palindrome Checker Function - Polyglot...
Function Applicative for Great Good of Palindrome Checker Function - Polyglot...
 
https://www.slideshare.net/amaresimachew/hot-topics-132093738
https://www.slideshare.net/amaresimachew/hot-topics-132093738https://www.slideshare.net/amaresimachew/hot-topics-132093738
https://www.slideshare.net/amaresimachew/hot-topics-132093738
 

Mehr von Digital Classicist Seminar Berlin

Kathryn Piquette (U of Cologne), "The Herculaneum Papyri and Greek Magical Te...
Kathryn Piquette (U of Cologne), "The Herculaneum Papyri and Greek Magical Te...Kathryn Piquette (U of Cologne), "The Herculaneum Papyri and Greek Magical Te...
Kathryn Piquette (U of Cologne), "The Herculaneum Papyri and Greek Magical Te...
Digital Classicist Seminar Berlin
 
[DCSB] Gabriel Bodard / Faith Lawrence (KCL), "Standards for Networking Ancie...
[DCSB] Gabriel Bodard / Faith Lawrence (KCL), "Standards for Networking Ancie...[DCSB] Gabriel Bodard / Faith Lawrence (KCL), "Standards for Networking Ancie...
[DCSB] Gabriel Bodard / Faith Lawrence (KCL), "Standards for Networking Ancie...
Digital Classicist Seminar Berlin
 
[DCSB] Tom Brughmans (U of Konstanz), "Roman bazaar or market economy? Explai...
[DCSB] Tom Brughmans (U of Konstanz), "Roman bazaar or market economy? Explai...[DCSB] Tom Brughmans (U of Konstanz), "Roman bazaar or market economy? Explai...
[DCSB] Tom Brughmans (U of Konstanz), "Roman bazaar or market economy? Explai...
Digital Classicist Seminar Berlin
 
[DCSB] Georg Roth (Universität zu Köln) "Die Rückkehr des Leitfundes? Die Ver...
[DCSB] Georg Roth (Universität zu Köln) "Die Rückkehr des Leitfundes? Die Ver...[DCSB] Georg Roth (Universität zu Köln) "Die Rückkehr des Leitfundes? Die Ver...
[DCSB] Georg Roth (Universität zu Köln) "Die Rückkehr des Leitfundes? Die Ver...
Digital Classicist Seminar Berlin
 
[DCSB] Rainer Komp (DAI, Berlin) "Chronological Concepts of the Ancient World...
[DCSB] Rainer Komp (DAI, Berlin) "Chronological Concepts of the Ancient World...[DCSB] Rainer Komp (DAI, Berlin) "Chronological Concepts of the Ancient World...
[DCSB] Rainer Komp (DAI, Berlin) "Chronological Concepts of the Ancient World...
Digital Classicist Seminar Berlin
 
[DCSB] Henry Mendell (California State University, USA) "Visualization of Anc...
[DCSB] Henry Mendell (California State University, USA) "Visualization of Anc...[DCSB] Henry Mendell (California State University, USA) "Visualization of Anc...
[DCSB] Henry Mendell (California State University, USA) "Visualization of Anc...
Digital Classicist Seminar Berlin
 
[DCSB] Amiz Zeldes (HU, Berlin) "Towards Digital Coptic: Searching and Visual...
[DCSB] Amiz Zeldes (HU, Berlin) "Towards Digital Coptic: Searching and Visual...[DCSB] Amiz Zeldes (HU, Berlin) "Towards Digital Coptic: Searching and Visual...
[DCSB] Amiz Zeldes (HU, Berlin) "Towards Digital Coptic: Searching and Visual...
Digital Classicist Seminar Berlin
 
[DCSB] Agnes Thomas, Karen Schwane, Alexander Recht (DAI, Köln) "The Hellespo...
[DCSB] Agnes Thomas, Karen Schwane, Alexander Recht (DAI, Köln) "The Hellespo...[DCSB] Agnes Thomas, Karen Schwane, Alexander Recht (DAI, Köln) "The Hellespo...
[DCSB] Agnes Thomas, Karen Schwane, Alexander Recht (DAI, Köln) "The Hellespo...
Digital Classicist Seminar Berlin
 
[DCSB] Gregory Crane, Stella Dee, Maryam Foradi, Monica Lent, Maria Moritz (U...
[DCSB] Gregory Crane, Stella Dee, Maryam Foradi, Monica Lent, Maria Moritz (U...[DCSB] Gregory Crane, Stella Dee, Maryam Foradi, Monica Lent, Maria Moritz (U...
[DCSB] Gregory Crane, Stella Dee, Maryam Foradi, Monica Lent, Maria Moritz (U...
Digital Classicist Seminar Berlin
 
[DCSB] Torsten Roeder (BBAW), Yury Arzhanov (Ruhr­Universität Bochum) "The Gl...
[DCSB] Torsten Roeder (BBAW), Yury Arzhanov (Ruhr­Universität Bochum) "The Gl...[DCSB] Torsten Roeder (BBAW), Yury Arzhanov (Ruhr­Universität Bochum) "The Gl...
[DCSB] Torsten Roeder (BBAW), Yury Arzhanov (Ruhr­Universität Bochum) "The Gl...
Digital Classicist Seminar Berlin
 

Mehr von Digital Classicist Seminar Berlin (20)

[DCSB] Aline Deicke (Digital Academy Mainz) From E19 to MATCH and MERGE. Mapp...
[DCSB] Aline Deicke (Digital Academy Mainz) From E19 to MATCH and MERGE. Mapp...[DCSB] Aline Deicke (Digital Academy Mainz) From E19 to MATCH and MERGE. Mapp...
[DCSB] Aline Deicke (Digital Academy Mainz) From E19 to MATCH and MERGE. Mapp...
 
[DCSB] Pau de Soto (University of Southampton), “Network Analysis to Understa...
[DCSB] Pau de Soto (University of Southampton), “Network Analysis to Understa...[DCSB] Pau de Soto (University of Southampton), “Network Analysis to Understa...
[DCSB] Pau de Soto (University of Southampton), “Network Analysis to Understa...
 
[DCSB] Torsten Roeder (Julius Maximilian University of Würzburg) and Yury Arz...
[DCSB] Torsten Roeder (Julius Maximilian University of Würzburg) and Yury Arz...[DCSB] Torsten Roeder (Julius Maximilian University of Würzburg) and Yury Arz...
[DCSB] Torsten Roeder (Julius Maximilian University of Würzburg) and Yury Arz...
 
[DCSB] Christian Fron (University of Stuttgart), “Beyond the visual. The acou...
[DCSB] Christian Fron (University of Stuttgart), “Beyond the visual. The acou...[DCSB] Christian Fron (University of Stuttgart), “Beyond the visual. The acou...
[DCSB] Christian Fron (University of Stuttgart), “Beyond the visual. The acou...
 
[DCSB] Silke Vanbeselaere (KU Leuven), “Love Thy (Theban) Neighbours, or how ...
[DCSB] Silke Vanbeselaere (KU Leuven), “Love Thy (Theban) Neighbours, or how ...[DCSB] Silke Vanbeselaere (KU Leuven), “Love Thy (Theban) Neighbours, or how ...
[DCSB] Silke Vanbeselaere (KU Leuven), “Love Thy (Theban) Neighbours, or how ...
 
[DCSB] Jorit Wintjes (University of Würzburg), “Diekplous! – understanding an...
[DCSB] Jorit Wintjes (University of Würzburg), “Diekplous! – understanding an...[DCSB] Jorit Wintjes (University of Würzburg), “Diekplous! – understanding an...
[DCSB] Jorit Wintjes (University of Würzburg), “Diekplous! – understanding an...
 
[DCSB] Gregory Crane (University of Leipzig): "Digital Philology, World Liter...
[DCSB] Gregory Crane (University of Leipzig): "Digital Philology, World Liter...[DCSB] Gregory Crane (University of Leipzig): "Digital Philology, World Liter...
[DCSB] Gregory Crane (University of Leipzig): "Digital Philology, World Liter...
 
[DCSB] Chris Forstall, Lavinia Galli Milić (University of Geneva): "Thematic ...
[DCSB] Chris Forstall, Lavinia Galli Milić (University of Geneva): "Thematic ...[DCSB] Chris Forstall, Lavinia Galli Milić (University of Geneva): "Thematic ...
[DCSB] Chris Forstall, Lavinia Galli Milić (University of Geneva): "Thematic ...
 
Kathryn Piquette (U of Cologne), "The Herculaneum Papyri and Greek Magical Te...
Kathryn Piquette (U of Cologne), "The Herculaneum Papyri and Greek Magical Te...Kathryn Piquette (U of Cologne), "The Herculaneum Papyri and Greek Magical Te...
Kathryn Piquette (U of Cologne), "The Herculaneum Papyri and Greek Magical Te...
 
[DCSB] Gabriel Bodard / Faith Lawrence (KCL), "Standards for Networking Ancie...
[DCSB] Gabriel Bodard / Faith Lawrence (KCL), "Standards for Networking Ancie...[DCSB] Gabriel Bodard / Faith Lawrence (KCL), "Standards for Networking Ancie...
[DCSB] Gabriel Bodard / Faith Lawrence (KCL), "Standards for Networking Ancie...
 
[DCSB] Tom Brughmans (U of Konstanz), "Roman bazaar or market economy? Explai...
[DCSB] Tom Brughmans (U of Konstanz), "Roman bazaar or market economy? Explai...[DCSB] Tom Brughmans (U of Konstanz), "Roman bazaar or market economy? Explai...
[DCSB] Tom Brughmans (U of Konstanz), "Roman bazaar or market economy? Explai...
 
[DCSB] Yannick Anné and Toon Van Hal (U of Leuven), "Creating a Dynamic Gramm...
[DCSB] Yannick Anné and Toon Van Hal (U of Leuven), "Creating a Dynamic Gramm...[DCSB] Yannick Anné and Toon Van Hal (U of Leuven), "Creating a Dynamic Gramm...
[DCSB] Yannick Anné and Toon Van Hal (U of Leuven), "Creating a Dynamic Gramm...
 
[DCSB] Markus Schnöpf (BBAW/IDE), "Reviewing digital editions: The Codex Sina...
[DCSB] Markus Schnöpf (BBAW/IDE), "Reviewing digital editions: The Codex Sina...[DCSB] Markus Schnöpf (BBAW/IDE), "Reviewing digital editions: The Codex Sina...
[DCSB] Markus Schnöpf (BBAW/IDE), "Reviewing digital editions: The Codex Sina...
 
[DCSB] Georg Roth (Universität zu Köln) "Die Rückkehr des Leitfundes? Die Ver...
[DCSB] Georg Roth (Universität zu Köln) "Die Rückkehr des Leitfundes? Die Ver...[DCSB] Georg Roth (Universität zu Köln) "Die Rückkehr des Leitfundes? Die Ver...
[DCSB] Georg Roth (Universität zu Köln) "Die Rückkehr des Leitfundes? Die Ver...
 
[DCSB] Rainer Komp (DAI, Berlin) "Chronological Concepts of the Ancient World...
[DCSB] Rainer Komp (DAI, Berlin) "Chronological Concepts of the Ancient World...[DCSB] Rainer Komp (DAI, Berlin) "Chronological Concepts of the Ancient World...
[DCSB] Rainer Komp (DAI, Berlin) "Chronological Concepts of the Ancient World...
 
[DCSB] Henry Mendell (California State University, USA) "Visualization of Anc...
[DCSB] Henry Mendell (California State University, USA) "Visualization of Anc...[DCSB] Henry Mendell (California State University, USA) "Visualization of Anc...
[DCSB] Henry Mendell (California State University, USA) "Visualization of Anc...
 
[DCSB] Amiz Zeldes (HU, Berlin) "Towards Digital Coptic: Searching and Visual...
[DCSB] Amiz Zeldes (HU, Berlin) "Towards Digital Coptic: Searching and Visual...[DCSB] Amiz Zeldes (HU, Berlin) "Towards Digital Coptic: Searching and Visual...
[DCSB] Amiz Zeldes (HU, Berlin) "Towards Digital Coptic: Searching and Visual...
 
[DCSB] Agnes Thomas, Karen Schwane, Alexander Recht (DAI, Köln) "The Hellespo...
[DCSB] Agnes Thomas, Karen Schwane, Alexander Recht (DAI, Köln) "The Hellespo...[DCSB] Agnes Thomas, Karen Schwane, Alexander Recht (DAI, Köln) "The Hellespo...
[DCSB] Agnes Thomas, Karen Schwane, Alexander Recht (DAI, Köln) "The Hellespo...
 
[DCSB] Gregory Crane, Stella Dee, Maryam Foradi, Monica Lent, Maria Moritz (U...
[DCSB] Gregory Crane, Stella Dee, Maryam Foradi, Monica Lent, Maria Moritz (U...[DCSB] Gregory Crane, Stella Dee, Maryam Foradi, Monica Lent, Maria Moritz (U...
[DCSB] Gregory Crane, Stella Dee, Maryam Foradi, Monica Lent, Maria Moritz (U...
 
[DCSB] Torsten Roeder (BBAW), Yury Arzhanov (Ruhr­Universität Bochum) "The Gl...
[DCSB] Torsten Roeder (BBAW), Yury Arzhanov (Ruhr­Universität Bochum) "The Gl...[DCSB] Torsten Roeder (BBAW), Yury Arzhanov (Ruhr­Universität Bochum) "The Gl...
[DCSB] Torsten Roeder (BBAW), Yury Arzhanov (Ruhr­Universität Bochum) "The Gl...
 

Kürzlich hochgeladen

Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
heathfieldcps1
 

Kürzlich hochgeladen (20)

Interdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxInterdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptx
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and Modifications
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 
Plant propagation: Sexual and Asexual propapagation.pptx
Plant propagation: Sexual and Asexual propapagation.pptxPlant propagation: Sexual and Asexual propapagation.pptx
Plant propagation: Sexual and Asexual propapagation.pptx
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Wellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptxWellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptx
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17
 
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfUnit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - English
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptxOn_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
 

[DCSB] Chiara Palladino & Tariq Youssef (Leipzig) iAligner: a tool for syntax-based intra-language text alignment

  • 1. A tool for syntax-based intra-language text alignment Tariq Yousef, Chiara Palladino University of Leipzig Berlin Digital Classicist Seminars, November 29, 2016
  • 2. What is text alignment? ● Text alignment is the comparison of two or more parallel texts ● It tries to define correspondences/similarities and divergences/variants ● One of the most important tasks in Natural Language Processing: it can be performed automatically through algorithmic and dynamic programming methods
  • 3. Intra-Language alignment: alignment of texts in the same language
  • 4. Cross-language alignment: alignment of texts in different languages ● Cross-language alignment is difficult to perform automatically ● It still needs training data from manual alignment A Persian poem manually aligned with an English translation, from the project Open Persian (http://www.dh.uni-leipzig.de/wo/open-philology-project/open-persian/)
  • 5. ...So, there is also manual alignment ● The Perseids Project and Alpheios Texts provide tools for manual alignment of texts in different languages (http://www.perseids. org/, http://alpheios.net/) Homer, Iliad XXI, aligned with the English translation by A.T. Murray
  • 6. Pairwise alignment: alignment of two texts We distinguish on the number of text because it determines differences in the use of the alignment algorithm Two versions of Emily Dickinson’s Faith is a fine invention, aligned using the Versioning Machine (http://v-machine.org/samples/faith.html)
  • 7. Multiple alignment: alignment of multiple texts (i.e. more than two) The number of multiple texts is virtually unlimited: in an ideal world, you can align as many texts as you want (but you should be careful and avoid “alignment monsters”) Six versions of the same poem by Emily Dickinson
  • 9. Alignment can be visualized in different ways
  • 11. As a graph Alignment graph using CollateX (http://collatex.net/) AlignmentgraphusingTRAViz (http://www.traviz.vizcovery.org/)
  • 12. As matching segments in aligned sentences Alignment of three sample texts on CATView (http://catview.uzi.uni-h alle.de/overview.html)
  • 13. As a dynamic visualization (http://www.digitalvariants.org/variants/valerio-magrelli)
  • 14. As overlapping variants (http://juxtacommons.org/) As parallel texts with variants highlighted in the corresponding sections (http://juxtacommons.org/)
  • 15. Why do we align texts?
  • 16. To highlight correspondences in different versions of a text (http://v-machine.org/samples/faith.html)
  • 17. To highlight divergences across various versions of the same text (http://juxtacommons.org/)
  • 18. To establish relations between witnesses of a text and see where they overlap and diverge
  • 19. Comparing texts as philological practice Collatio - Detection and transcription of variants in witnesses - It is made by close reading each witness and comparing the texts with each other - Evaluation of the variants and of the witnesses bearing them ….and yes, it is usually done manually.
  • 20. Recensio - To establish relationships between witnesses and which ones bear the “best text” - To establish an organic scheme the transmission of a text, often represented as a genealogical tree of witnesses (stemma) Example of a stemma. Stemma for De nuptiis Philologiae et Mercurii by Martianus Capella proposed by Danuta Shanzer (1986, p. 62-81).
  • 21. Critical editions - Usually display textual variants in the form of apparatus criticus - The apparatus is a choice in itself: it does not collect all the variants found through collation, but only those that the editor had judged significant for the reconstruction of the text - The apparatus can be very complex to understand in large textual traditions Sallust’s Catiline in Axel Ahlberg’s 1919 Editio Major. Critical text Critical apparatus
  • 22. Now we can do some of these things automatically
  • 24. A tool for automatic syntax-based intra-language alignment ● Automatic: it is performed with algorithmic methods to reduce human intervention in the mechanical process of comparison. ● Syntax-based: in programming language, defines the order of the characters and the order of the words in a sentence. ● Intra-language: works with texts in the same language. ● Pairwise or multiple: works with two texts or with an unlimited number of multiple texts.
  • 25. Algorithmic methods to produce alignment The Needleman-Wunsch algorithm - used in bioinformatics to align protein or nucleotide sequences. - it uses Dynamic Programming to find the optimal alignment. - divides a large problem into a series of smaller problems and uses the solutions to the smaller problems to reconstruct a solution to the larger problem. - uses a score function and similarity matrix to represent all possible combinations of tokens and their resulting score.
  • 26. The Needleman-Wunsch algorithm - Aligning Bible Text John 1:1 NLT: In the beginning the Word already existed. KJB: In the beginning was the Word The used score function ( Matching = 5, Mismatching = -5, In/Del = -2 )
  • 27. In the beginning the Word already existed . 0 -2→ -4→ -6→ -8→ -10→ -12→ -14→ -16→ In -2↓ the -4↓ beginning -6↓ was -8↓ the -10↓ Word -12↓ , -14↓ The used score function ( Matching = 5, Mismatching = -5, In/Del = -2 )
  • 28. In the beginning the Word already existed . 0 -2→ -4→ -6→ -8→ -10→ -12→ -14→ -16→ In -2↓ 5 ↘ 3→ 1 → -1 → -3 → -5 → -7 → -9 → the -4↓ 0↓ beginning -6↓ -2↓ was -8↓ -4↓ the -10↓ -8↓ Word -12↓ -10↓ , -14↓ -12↓ The used score function ( Matching = 5, Mismatching = -5, In/Del = -2 )
  • 29. In the beginning the Word already existed . 0 -2→ -4→ -6→ -8→ -10→ -12→ -14→ -16→ In -2↓ 5 ↘ 3→ 1 → -1 → -3 → -5 → -7 → -9 → the -4↓ 0↓ 10 ↘ 8 → 13 → 11 → 9 → 7 → -5 → beginning -6↓ -2↓ 8 ↓ was -8↓ -4↓ 6 ↓ the -10↓ -8↓ 4 ↓ Word -12↓ -10↓ 2 ↓ , -14↓ -12↓ -0 ↓ The used score function ( Matching = 5, Mismatching = -5, In/Del = -2 )
  • 30. In the beginning the Word already existed . 0 -2→ -4→ -6→ -8→ -10→ -12→ -14→ -16→ In -2↓ 5 ↘ 3→ 1 → -1 → -3 → -5 → -7 → -9 → the -4↓ 0↓ 10 ↘ 8 → 13 → 11 → 9 → 7 → -5 → beginning -6↓ -2↓ 8 ↓ 15 ↘ 13 → 11 → 9 → 7 → -5 → was -8↓ -4↓ 6 ↓ 8 ↓ the -10↓ -8↓ 4 ↓ 5 ↓ Word -12↓ -10↓ 2 ↓ 0 ↓ , -14↓ -12↓ -0 ↓ -5 ↓ The used score function ( Matching = 5, Mismatching = -5, In/Del = -2 )
  • 31. In the beginning the Word already existed . 0 -2→ -4→ -6→ -8→ -10→ -12→ -14→ -16→ In -2↓ 5 ↘ 3→ 1 → -1 → -3 → -5 → -7 → -9 → the -4↓ 0↓ 10 ↘ 8 → 13 → 11 → 9 → 7 → -5 → beginning -6↓ -2↓ 8 ↓ 15 ↘ 13 → 11 → 9 → 7 → -5 → was -8↓ -4↓ 6 ↓ 8 ↓ 11↓ 9→ 7→ 5→ 3→ the -10↓ -8↓ 4 ↓ 5 ↓ 13↘ 11→ 9→ 7→ 5→ Word -12↓ -10↓ 2 ↓ 0 ↓ 11↓ 18↘ 16→ 14→ 12→ , -14↓ -12↓ -0 ↓ -5 ↓ 9↓ 16↓ 14→ 12→ 10→ The used score function ( Matching = 5, Mismatching = -5, In/Del = -2 )
  • 32. In the beginning the Word already existed . 0 -2→ -4→ -6→ -8→ -10→ -12→ -14→ -16→ In -2↓ 5 ↘ 3→ 1 → -1 → -3 → -5 → -7 → -9 → the -4↓ 0↓ 10 ↘ 8 → 13 → 11 → 9 → 7 → -5 → beginning -6↓ -2↓ 8 ↓ 15 ↘ 13 → 11 → 9 → 7 → -5 → was -8↓ -4↓ 6 ↓ 8 ↓ 11↓ 9→ 7→ 5→ 3→ the -10↓ -8↓ 4 ↓ 5 ↓ 13↘ 11→ 9→ 7→ 5→ Word -12↓ -10↓ 2 ↓ 0 ↓ 11↓ 18↘ 16→ 14→ 12→ , -14↓ -12↓ -0 ↓ -5 ↓ 9↓ 16↓ 14→ 12→ 10→ The used score function ( Matching = 5, Mismatching = -5, In/Del = -2 )
  • 33. The Needleman-Wunsch algorithm John 1:1 New Living Translation In the beginning the Word already existed . King James Bible In the beginning was the Word ,
  • 34. The modification to the algorithm The goal is to optimize the algorithm by reducing the search space compares a token W at the position i in S1 with a range of tokens [i-k, i+k] in S2 with length of 2k+1. The resulting search space is reduced from (n * m) to ([2k +1]* m) , where k < n/2
  • 35. The modification to the algorithm k = 14, n = 157, m = 134 Search space = m*n = 21038 after modification (2k+1)*m = 3886
  • 36. Multiple Sequence Alignment ( In progress) ● Progressive alignment builds up a final MSA by combining pairwise alignments beginning with the most similar pair and progressing to the most distantly related, it requires two stages: - creating the guide tree (clustering) - adding the sequences sequentially to the growing MSA according to the guide tree
  • 37. Multiple Sequence Alignment ( In progress) ● Iterative alignment The aim is to reduce the problem of a multiple alignment to an iteration of pairwise alignments.
  • 38. How to align your texts with iAligner: copy your text on the editor The text has to be parsed in sentences first
  • 39. ...Or upload it Currently supports .txt and .csv files
  • 41. ● Ignore non-alphabetical: ignores symbols, such as punctuation and numbers, anything that is not an alphabetical character ● Case sensitive: if activated, detects variation across words according to the case ● Ignore diacritics: ignores any type of diacritical character (including punctuation marks) ● Levenshtein distance: applies a revised version of the Levenshtein algorithm and increases the tolerance threshold on the alignment of similar words.
  • 42. The Levenshtein distance The Levenshtein distance between two words is the minimum number of single-character edits (i.e. insertions, deletions or substitutions) required to change one word into the other. e.g lev(Hellanikos, Hellanicus) = 2 Mathematically, the Levenshtein distance between two strings { a,b} (of length |a| and |b| respectively) is given by leva,b( |a| , |b| )
  • 43. Modified Levenshtein Distance Levenshtein distance is not very helpful in our case, because it is binary and there is no tolerance with errors produced by OCR or Transcription. the distance between letters is not binary, but it is on scale. The cost of insertion or deletion depends on: - Letter position - Letter type (vowel or consonant) lev(Hellanikos, Hellanicus) = 0.3
  • 44. A Greek text with no refinement criteria The same text with additional refinement criteria applied
  • 45. Alignment output: a table-graph iAligner displays all the nuances of variants according to a color-key: - Completely aligned tokens (deep green) - Tokens aligned by excluding case sensitivity or punctuation detection (light green) - Gaps (yellow) - Divergences (red) - Tokens aligned by applying Levenshtein distance (blue-green)
  • 46. What can you do with iAligner? Some case studies
  • 47. Manuscript alignment Three manuscripts of Plato’s Crito aligned (http://i-alignment.com/crito/)
  • 48. OCR output alignment Alignment of two OCR outputs from the Patrologia Graeca. The third column shows the overlapping sections and offers the user the choice between two variants where the two texts diverge.
  • 49. OCR output alignment Patrologia Latina: OCR output vs. correct version: www.i-alignment.com/pl/
  • 50. Alignment of editions Three excerpted editions of Aeschylus’ Supplices aligned. www.i-alignment.com/Aeschylus
  • 52. Import and export options Language dependent options for Latin, Greek, Arabic Handling crossings and transpositions
  • 53. Thanks for the attention!