Genetic Programming Word Aligners Machine Translation
1. APPLICATION OF GENETIC PROGRAMMING
TOWARDS WORD ALIGNERS
BENJAMIN HEILERS
Department of Electrical Engineering and Computer Science
University of California, Berkeley
December 2004
CS 294-5
keywords: Genetic Programming, Word Aligners, Machine Translation, Machine
Learning, Genetic Algorithms, Natural Language Processing, Artificial Intelligence
2. 1. Introduction
This paper details the (as-of-yet-unfruitful-and-thus-determinedly-ongoing)
research into the application of genetic programming towards optimizing word aligners.
Popular belief holds the use of genetic programming in machine translation to be
infeasible. Regardless, it is the goal of the author, admittedly due to infatuation with
machine learning in general, to convince himself personally that such a wide-spread
sentiment is either well-chosen or wholly amiss.
A word aligner is a program coupling words in sentence pairs, in effect
constructing a bilingual dictionary [1:484, 2]. Genetic Programming, a specific branch of
Genetic Algorithms, is a term used to deal with search across a function space, where the
natural reproductive methods – selection, mutation, crossover – are mimicked in the
hopes that the great successes of evolution on living creatures may be repeated on
programs [3:47-56, 4]. Genetic Algorithms deal more broadly with evolving all types of
functions. Genetic Programming takes a set of programs and filters out those most
resembling word aligners, to subject them to alterations in the hopes of finding yet better
candidates. The process by which programs are selected tests each program on a subset
of the sentence-pair corpus, thus qualifying this approach as supervised learning.
Another concept appearing in this paper and warranting a definition is that of the Abstract
Syntax Tree (AST), a representation for computer code which renders the code in a
particularly useful format for genetically reproductive processes [5:9]. Abstract Syntax
Trees are preferable due to:
• it is many times simpler conceptually to apply crossover and mutation to a tree
representation, than to a program code in string form.
2
3. • a design pattern, the Visitor Pattern, suggests an easily implementable approach
to traversing this representation for program code [9, 10]
Figure 1: An Eclipse Abstract Syntax Tree in Graphical and Textual Forms.
Note in figure 1 that the Eclipse AST, the package used in this research, maintains some
information within nodes (such as the operator in infix expressions), whereas some AST
representations place formulate this information a child node.
2. Literature Review
3
4. There is little literature on the application of genetic algorithms to word aligners.
Instead, we turn to the literature on genetic programming, where suggestions to
counteract various results-limiting phenomena abound.
There are a myriad of decisions to make in implementing a genetic algorithm.
Fortunately, literature provides enough detailed discussion to allow for preparations
against most common problems with genetic programming. Franz Rothlauf is the first to
write a book on the pros and cons of various representations in genetic algorithms [7].
Like many of his colleagues, he highly suggests tree representations for genetic
programming. This representation eases the implementation of mutation and crossover
tremendously, compared with the traditional representation as bit strings, whereby the
chances that a mutated string still resembles working code are less than slim.
4
5. public Alignment alignSentencePair
(SentencePair sentencePair){
MISSING=2092010418 <= -1198423683;
alignment=new Alignment();
I4=addAlignment(alignment,I4,I4,B3);
B4=false;
I2=numEnglishWordsInSentence(sentencePair);
if (I2 < -594586326){
D2=getDouble(L3,I1);
} else {
addInt(L2,I1,I3);
getInt(L5,I2);
addBoolean(L3,I1,B2);
while (I2 < 1564864814){
addDouble(L5,I5,D1);
MISSING=664939021;
alignment=getString(L1,I2);
MISSING=311599999 * 1197784289;
D1=-287916828;
}
}
I2=numFrenchWordsInSentence(sentencePair);
for (I3=0;I3 < I1;I3++){
I4=-1;
D1=0;
for (I5=0;I5 < I2;I5++){
D2=50 / (1 + abs(I3 - I5));
if (D2 >= D1){
D1=D2;
I4=I5;
}
}
addAlignment(alignment,I4,I3,true);
}
return alignment;
}
Figure 2: Example of Bloat. Lines resembling original file are in
bold. Lines colored red are added as result of bloat phenomenon.
The phenomena of bloat is widely mentioned, whereby each successive
generation displays a much larger file size than the previous, yet most of the added code
contribute little to no added functionality. With high rates of mutation, I have seen 450
lines (nine pages) of code introduced to an initially twenty-line file, after less than ten
generations. There are several mechanisms in place to cope with bloat, as discussed later.
Another commonly observed fact to cope with is over-fitting. This occurs when
the genetic process is allowed to run for too long. For example, the corpus used in this
research consists of 447 sentence pairs, pairing English and French sentences. If we are
5
6. to choose the first ten and evolve randomly generated programs to return alignments of
these, then at some point we may theoretically find a reasonable solution which not only
achieves superb results on the ten training sentence pairs, but on the 447 total sentence
pairs as well. However, if we continue to evolve past this point, chances are that our
population will become over fit for these ten sentences. This is similar, for example, to
hoping to find the equation y = x2, but instead achieving y = 1, with training data of only
(-1, 1) and (1, 1).
Figure 3: After fitness is reached, over-fitting to the training data may occur.
6
7. Figure 4: Example of Over-Fitting. The
solid black line is y = x2, the red dotted
line is y = |x|, and the blue dashed line is y
= 1. The training data is { (1, 1), (-1, 1) },
but the desired function is { (x, y) : y =
x2 }
The literature is also helpful in suggesting approximate values for the frequencies
at which to apply mutation and crossover to members of the population, though the
perfect values are apparently learned only by trial and error.
3. General Overview of Algorithm
In general, instead of searching across the solution space, we utilize GP to aid in
search across the function space. As the function space is of immense proportions, we
randomly sample the function space, and then search through not only these functions but
others similar to them. In the graph above, we may have a function y = x + 5. This
would lead us to searching similar functions such as y = x + 6, y = 3x + 5, y = x 2 + 5, etc.
Since it is infeasible to evaluate every possible function with similar form to y = x + 5,
we must again find a method with which to decide which functions to search. This is the
basic concept of genetic programming, where a desired set of (input, output) pairs is
known, but we search for the function (or possibly one of many functions) which causes
this return. Doing this search across program code is many times more complex than
across math equations.
7
8. The flow chart shown here
is exactly the order in which
genetic programming is
implemented in this research. An
initial population is created, by
taking files such as random.java in
the Appendix and sending them
through several generations of high
mutation. Since the current
version of this GP process is still
prone to producing erroneous code,
Figure 5: Flow Chart of GP
many more programs are generated
than asked for. Each is then evaluated according to the fitness function, and those which
have compile and runtime errors, (at this point mostly due to invalid arguments, incorrect
casting, and undeclared variable names – see Results), is filtered out and thrown away.
Thus the GP process begins with only valid programs in its initial population. From here,
the population undergoes a number of iterations wherein each member is evaluated, then
the next generation is selected, then crossover and mutation is allowed to occur. The
rates for these are currently at 80% chance for 3-point crossover to occur (see Section 4),
and 0.05% for mutation, as suggested by most literature.
8
9. 4. Details of Implementation Decisions
Since each design decision is independent of each other, and many need to be presented
simultaneously, I have chosen to format this section by discussing each one on its own, as
orderly as possible.
AST representation: The members of the population could be represented in a
myriad of ways. Many people implementing genetic programming choose to create a an
original representation of their function. This unfortunately, is due to the newness of the
field, and hinders the progress of future work by allowing for the researchers to get
bogged down in minor details which could potentially be settled already. I admire Franz
Rothlauf’s efforts to correct this problem, and agree with him that the best representation
for my purposes is to use an AST. This allows for an easy crossover implementation, and
only necessitates moderate work to implement mutation.
Generational model: The generational model of a population allows for the
lifespan of each population member to be only a single generation, as opposed to the
Steady-State model, which not only selects members for reproduction but also selects
which member they will be replacing as well [8:134]. Both models use a constant-sized
population. I have chosen to go with the generational model here because of simplicity in
design.
Initial Population: The usual approach is to start with a completely random set of
programs. This seems unreasonable. Why create programs which construct strings and
draw websites when we are looking for a program to add two numbers together? I have
taken an approach for which I could not find any literature, to start with a population of
programs similar to that under random.java in the Appendix. In some cases, I have even
9
10. placed some initial code within the for-loop, to make alignments based on superficial
traits. I do not rule out the possibility that this second strategy in effect may steer me into
the wrong direction, by intending to run the program both with random.java and with the
other versions (entitled superficial.java). It is my hope that by providing some base code,
the completely random results will be avoided and thus a better chance of finding an
optimal solution is possible.
Fitness function: The fitness function, the measure by which we decide which
members yield the most desirable results, and thus have the most potential for being
prototypes of our desired word aligner, seems obvious. The goal is to maximize the
precision and recall while minimizing the AER, as defined in [11:1]. Thus the fitness
function calls alignSentencePair of each population member on a small subset of the full
corpus, and returns the weighted sum of these numbers (where w1, w2, w3 are the
weights):
10 * [ w1 * P + w2 * R + w3 * (1 – AER) ]
Countering Over-Fitting: An easy fix to countering over-fitting to the training
data used in the fitness function is to keep the training set dynamic. I have implemented
this by choosing a random set each time. Thus there is no worry of over-fitting to the
specific set of sentence pairs being learned on, since there is no specific set of sentence
pairs.
Fitness-Proportionate Selection: There are two methods of selection in
widespread use: tournament and proportionate fitness selection [3:37]. In tournament
selection several tournaments are held in which the fitness is calculated and the winner of
the tournament is selected for reproduction. As the fitness function here requires no
10
11. small amount of time, for matters of efficiency I chose the less computationally
expensive selection process, fitness-proportionate selection. Here each member of the
population is evaluated once, and then the new generation is randomly selected, with
probability proportional to the fitness of the member [see figure]. The risk here generally
is that with a wide variety of fitness values, those with the lower fitness values will be
excluded from selection, the diversity
of the population will disappear
prematurely, leading to premature
convergence. It is my hope that with a
non-random initial population, the
disparity in the fitness values will not
be as dangerous as if the populations
had been truly initialized randomly.
Figure 6: The Proportionate Fitness Selection Process
is Akin to a Game of Darts.
Elitist Strategy: This is the
decision to leave the best fit member of the population in the next generation,
unmodified, although this does not rule out the possibility of including genetic
reproductions of this member as well.
Copies to file best of each generation: Looking back to the chart in Section 2,
Literature Review, we acknowledge that there exists a reduction in quality when one
allows the GP process to run too many generations. To alleviate this, a copy of the
member with the highest fitness value of each generation is written to file in a separate
folder. The end process, where we test the evolved word aligners, tests the best fit
member of each generation, not solely that of the final generation.
11
12. N-point crossover: The two common methods of crossover are uniform and n-
point. In uniform crossover, a single point in the genome is chosen and two members
have their code swapped at this point. N-point crossover allows for this to happen at
multiple points, and is much more suitable to crossover on trees, where we are not
dealing with the traditional fixed-length representation as in bit strings.
Protected functions and variable types: To alleviate casting problems and protect
against null pointer exceptions, it is easier to Refer to WordAligner class in Appendix,
which is the super class of all other word aligner classes.
Confine alterations: To minimize the number of off-track members of the
population, alterations to the code are kept in the area where they matter the most. The
basic information needed by all word aligners is as is shown in the appendix for
random.java. Every word aligner should align each French word to some word in
English. Thus, only the body of the for loop is altered.
Halting problem: It may occur through mutation or crossover that infinite loops
are created [8:293-294]. To counter this, the fitness function makes use of threads and
halts after a reasonable amount of time has elapsed. This also serves as an additional
measure against excessive bloat.
5. Results
As can be seen by the example code, mutation never worked as desired. Nearly
every mutation results in erroneous code. It seems the large majority are casting issues
(in the figure here, D3 is a double and S1 is a string, while H5 is a HashMap). Those
mutated programs which do compile include useless code, such as incrementing variables
12
13. used nowhere else in
the program.
Without
mutation, the rest of the
genetic programming
process is little more
than searching over the
different orderings of
Figure 2: Some Mutation Errors Still Mishandled.
the statements within
the for loop, which holds no possible word aligners not already visible at a first glance.
Figure 5: Initial Code and Code with "Better" Result.
13
14. Multiple runs each proved futile. As can be seen in the figure above, the best
word aligner returned is only slightly improved in terms of performance. Looking at the
code it appears to be a fluke, due to the off-chance that using the number of English
words in place of the number of French words tends to give better recall results (due to
the smaller length of English sentences in general relative to their French versions, less
proposed alignments allows for less erroneous guesses).
Indeed, generations proved of no use. In effect, the code as is merely stares at
effectively the same program each generation, since crossover alone is not enough to
introduce variety, and the initial population is not truly random (not with the mutation
process in its current status). In the table below are shown the evaluation results of
members of each generation with the highest fitness.
Gen. Precision Recall AER Fitness
0 0.3658 0.2258 0.6864 9.0520
1 0.3658 0.2258 0.6864 9.0520
2 0.3535 0.2909 0.6678 9.7660
3 0.3658 0.2258 0.6864 9.0520
4 0.3658 0.2258 0.6864 9.0520
5 0.3658 0.2258 0.6864 9.0520
6 0.3935 0.1889 0.6966 8.8580
7 0.3658 0.2258 0.6864 9.0520
8 0.3658 0.2258 0.6864 9.0520
9 0.3658 0.2258 0.6864 9.0520
10 0.3658 0.2250 0.6864 9.0520
11 0.3658 0.2250 0.6864 9.0520
12 0.3658 0.2250 0.6864 9.0520
13 0.3658 0.2250 0.6864 9.0520
Figure 9: Table of Fitness Function Results on a GP run
6. Conclusion
14
15. In addition to deciphering the process of mutating code in a meaningful way,
there are a few other tricks which I did not have time to experiment with, but think may
prove useful.
With regards to premature convergence, it would be interesting to add a feature
whereby the mutation rate is raised greatly for a generation to promote an increase in
diversity, triggered by a low standard deviation in the fitnesses.
Since I was not able to get mutation working, I never actually started off with
random.java. I instead started off with the likeness of the file seen in Figure 8. From
what I have seen, it seems like possibly initializing from this file yields a population
lacking in diversity, even with higher mutation rates in the initial generations. It would
be interesting to run a comparison between initializing from this file and from
random.java.
It is acknowledged widely that the process of finding correct values for mutation
and crossover rates, for the number of generations, for the population size, and choosing
a heuristic function are all decisions which are still made by trial and error. Studying the
exact effects of raising and lowering each of these values will consume quite an amount
of time, but is vital before much more work can be done in the area of genetic
programming in general.
15
16. Bibliography
1. Manning and Schütze. Foundations of Statistical Natural Language Processing,
pg. 484
2. Automatic Construction of a Bilingual Lexicon:
wwwhome.cs.utwente.nl/~irgroup/align/
3. Ghanea-Hercock. Applied Evolutionary Algorithms in Java
4. Genetic-Programming.Org: www.genetic-programming.org/
5. Grune, Bal, Jacobs, and Langendoen. Modern Compiler Design, pg. 9, 22, 52-55
6. Langdon and Poli. Foundations of Genetic Programming
7. Rothlauf. Representations for Genetic and Evolutionary Algorithms
8. Banzhaf, Nordin, Keller, and Francone. Genetic Programming: An Introduction
9. Gamma, Helm, Johnson, and Vlissides. Design Patterns
10. Visitor Pattern: http://en.wikipedia.org/wiki/Visitor_pattern
11. Assignment 4: Word Alignment Models:
www.cs.berkeley.edu/~klein/cs294-5/cs294-5%20assignment%204.pdf
Figures
(original unless otherwise noted)
1. Abstract Syntax Tree in graphical and textual forms.
2. Example of Bloat.
3. After fitness is reached, overfitting to the training data may occur. [source:
Schmiedle F, Drechsler N, Grosse D, Drechsler R. “Heuristic learning based on
genetic programming.” Genetic Programming & Evolvable Machines, Vol. 3,
Dec. 2002, pg 376]
4. Example of Over-Fitting
5. Flow Chart of GP Approach. [chart source: Sette S, Boullart L. “Genetic
programming: principles and applications.” Engineering Applications of
Artificial Intelligence, Vol. 14, Dec. 2001, pg 728]
16
17. 6. Proportionate Fitness Selection
7. Some Mutation Errors Still Mishandled
8. Initial Code and Code with "Better" Result.
9. Table of Fitness Function Results
Appended Code
Crossover: Takes two statements with parents of same type (for/for, while/while, etc.)
The eclipse AST toolkit requires that each node belong to a certain tree, and thus simply
switching trees is not possible, instead we clone the subtrees under their new owner with
the static copySubtree(targetAST, sourceNode) method.
private void crossover (int index1, Statement switch1,
int index2, Statement switch2) {
CompilationUnit cu1 = newPop[index1];
CompilationUnit cu2 = newPop[index2];
AST ast1 = cu1.getAST();
AST ast2 = cu2.getAST();
ASTNode p1 = switch1.getParent();
ASTNode p2 = switch2.getParent();
Statement switch1_under_ast2 =
(Statement) ASTNode.copySubtree(ast2,switch1);
Statement switch2_under_ast1 =
(Statement) ASTNode.copySubtree(ast1,switch2);
switch (p1.getNodeType()) {
case ASTNode.BLOCK:
List m1 = ((Block) p1).statements();
List m2 = ((Block) p2).statements();
m1.set(m1.indexOf(switch1), switch2_under_ast1);
m2.set(m2.indexOf(switch2), switch1_under_ast2);
break;
case ASTNode.IF_STATEMENT:
if (switch1.getLocationInParent().getId()
.equals("elseStatement")) {
((IfStatement) p2).setElseStatement(switch1_under_ast2);
((IfStatement) p1).setElseStatement(switch2_under_ast1);
} else {
((IfStatement) p2).setThenStatement(switch1_under_ast2);
((IfStatement) p1).setThenStatement(switch2_under_ast1);
}
break;
case ASTNode.WHILE_STATEMENT:
((WhileStatement) p2).setBody(switch1_under_ast2);
((WhileStatement) p1).setBody(switch2_under_ast1);
break;
17
18. case ASTNode.FOR_STATEMENT:
((ForStatement) p2).setBody(switch1_under_ast2);
((ForStatement) p1).setBody(switch2_under_ast1);
break;
default:
throw new RuntimeException("unhandled crossover for nodeType: "
+ p1.getNodeType());
}
}
Mutation: Uses the Visitor Pattern and extends org.eclipse.jdt.internal.corext
.dom.GenericVisitor with Mutator to implement mutation. Mutator is a file much too
long to display here. The essentials are that it randomly changes register names and
values in the code, as well occasionally inserting newly generated lines of code and
making calls to safely defined methods (that is to say, a divide that checks for division by
zero, etc.).
public void mutate(int index) {
random.nextFloat()*interchangeableTable.size();
Mutator mutator = new Mutator(seed, numRegisters);
CompilationUnit cu = newPop[index];
AST ast = cu.getAST();
cu.accept(mutator);
}
WordAligner parent class: The following is edited due to length; redundant and
obvious methods have been abbreviated. The Statistics object contains data from an
initial pass over the corpus before hand, gathering data such as is used in unsupervised
learning: Pr(f), Pr(e), Pr(f, e).
public class WordAligner {
protected WordAligner (Statistics s) {
statistics = s;
}
public Alignment alignSentencePair(SentencePair s) {
return null;
}
public float prob_f(String f) {
return (float) statistics.prob_f(f);
}
public float prob_e(String e) {
return (float) statistics.prob_e(e);
}
public float prob_e_and_f(SentencePair s, String f, String e) {
18
19. return (float) statistics.prob_f_and_e(s,f,e);
}
public List getFrenchWords (SentencePair s) {
return s.getFrenchWords();
}
public List getEnglishWords (SentencePair s) {
return s.getEnglishWords();
}
public float abs (float i) {
return Math.abs(i);
}
public float numFrenchWordsInSentence (SentencePair s) {
return s.getFrenchWords().size();
}
public float numEnglishWordsInSentence (SentencePair s) {
return s.getEnglishWords().size();
}
public float getSentenceID (SentencePair s) {
return s.getSentenceID();
}
public boolean addAlignment(float englishPosition,
float frenchPosition, boolean sure) {
int e = Math.round(englishPosition);
int f = Math.round(frenchPosition);
alignment.addAlignment(e, f, sure);
return true;
}
/** GET methods **/
public String getString(List L, float i) {
if (L== null || L.size() == 0)
return "";
if (i >= L.size())
i = L.size()-1;
if (i < 0)
i = 0;
return (String) L.get(Math.round(i));
}
Also: getBoolean, getNumber
/** ADD methods **/
public boolean addString (List L, float i, String o) {
if (L == null)
L = new ArrayList();
if (i >= L.size())
i = L.size()-1;
if (i < 0)
i = 0;
L.add(Math.round(i), o);
return true;
}
Also: addBoolean, addNumber
19
20. /** FIELDS **/
public LinkedList L1 = new LinkedList();
public LinkedList L2 = new LinkedList();
public LinkedList L3 = new LinkedList();
public LinkedList L4 = new LinkedList();
public LinkedList L5 = new LinkedList();
public float N1 = 0; public float N2 = 0;
public float N3 = 0; public float N4 = 0;
public float N5 = 0; public float N6 = 0;
public float N7 = 0; public float N8 = 0;
public float N9 = 0; public float N0 = 0;
public boolean B1 = true; public boolean B2 = true;
public boolean B3 = true; public boolean B4 = true;
public boolean B5 = true;
public String S1 = ""; public String S2 = "";
public String S3 = ""; public String S4 = "";
public String S5 = "";
public Alignment alignment = new Alignment();
public static Statistics statistics;
}
An extension of WordAligner class: This is the base class for the random initialization.
Several instances of this class are made, then subjected to many generations at a higher
than normal mutation rate. Mutation occurs within the for-loop.
public class random extends WordAligner {
public Alignment alignSentencePair(SentencePair sentencePair) {
alignment = new Alignment();
N1 = numEnglishWordsInSentence(sentencePair);
N2 = numFrenchWordsInSentence(sentencePair);
for (N3 = 0; N3 < N2; N3++) {
B5 = addAlignment(N4, N3, true);
}
return alignment;
}
public random(Statistics s) {
super(s);
}
}
20