SlideShare a Scribd company logo
1 of 29
Download to read offline
Corpus Linguistics and Lexicography *

                                                    WOLFGANG TEUBERT




Corpus Linguistics—More Than a Slogan?

During the last decade, it has been common practice among the linguistic
community in Europe—both on the continent and on the British Isles—to
use corpus linguistics to verify the results of classical linguistics. In North
America, however, the situation is different. There, the Philadelphia-based
Linguistic Data Consortium, responsible for the dissemination of language
resources, is addressing the commercially oriented market of language engi-
neering rather than academic research, the latter often being more interested
in universal grammar or semantic universals than in the idiosyncrasies of
natural languages. American corpus linguists such as Doug Biber or Nancy
Ide and general linguists who are corpus users by conviction such as Charles
Fillmore are almost better known in Europe than in the United States, which
is even more astonishing when we take into account that the first real corpus
in the modern sense, the Brown Corpus, was compiled in Providence, R.I.,
during the sixties.
     Meanwhile, European corpus linguistics is gradually becoming a sub-
discipline in its own right. Unfortunately, during the last few years, this
lead to a slight bias towards those ‘self-centred’ issues such as the problems
of corpus compilation, encoding, annotation and validation, the procedures
needed for transforming raw corpus data into artificial intelligence applica-
tions and automatic language processing software, not to mention the problem
of standardisation with regard to form and content (cf. the long-term project
EAGLES [Expert Advisory Group on Language Engineering Standards] and

        INTERNATIONAL JOURNAL OF CORPUS LINGUISTICS Vol. 6(special issue), 2001. 125–153
                                                       John Benjamins Publishing Co.
126                          WOLFGANG TEUBERT

the transatlantic TEI [Text Encoding Initiative]). Today, these issues often
tend to prevail over the original gain that the analysis of corpora may con-
tribute to our knowledge of language. But it was exactly this corpus-specific
knowledge that the first generation of European corpus linguists such as Sture
Allen, Vladislav Andrushenko, Stig Johannson, Ferenc Kiefer, Bernard Que-
mada, Helmut Schnelle, or John Sinclair had in mind. In West Germany, the
Institut f¨ r Deutsche Sprache was among the first institutions that considered
          u
the collection of corpus data as one of their permanent tasks; its corpora date
back as early as the late sixties, although at that time most corpus data was
still only used for the verification of research results gained from traditional
methods. But has today’s corpus linguistics really advanced from there?
      The recent textbooks claiming to provide an introduction to corpus lin-
guistics still do not add up to more than a dozen—all of them in English.
Unfortunately, except for the commendable books of Stubbs 1996 and Biber,
Conrad and Reppen 1998, they do deplorably little to establish corpus lin-
guistics as a linguistic discipline in its own right. Instead, they are focussing
on the use of corpora and corpus analysis in traditional linguistics (syn-
tax, lexicology, stylistics, diachrony, variety research) and applied linguistics
(language teaching, translation, language technology). Corpus Linguistics by
Tony McEnery and Andrew Wilson (McEnery and Wilson 1996) may serve
as an example of this kind. Forty pages describe the aspects of encoding;
20 pages deal with quantitative analysis; 25 pages describe the usefulness of
corpus data for computer linguistics with another 30 pages covering the use
of corpora in speech, lexicology, grammar, semantics, pragmatics, discourse
analysis, sociolinguistics, stylistics, language teaching, diachrony, dialectol-
ogy, language variation studies, psycholinguistics, cultural anthropology and
social psychology and the final 20 pages contain a case study on sublan-
guages and closure. McEnery and Wilson’s book reflects the current state of
corpus linguistics. In fact, it more or less corresponds to the topics covered
at the annual meetings held by the venerable IACME, an association dealing
with English language corpora (cf. Renouf 1998). Semantics are mainly left
aside.
      Surprisingly, when judged by their commercial value, it is not the written
language corpora that are most successful, but rather speech corpora that can
claim the highest prices. Speech corpora are special collections of some care-
fully selected text samples (words, phrases, sentences) spoken by numerous
different speakers under various acoustic conditions. They caused the final
CORPUS LINGUISTICS AND LEXICOGRAPHY                          127

breakthrough in automatic speech recognition that computer models based
on cognitive linguistics failed to achieve for many years. The recognition
of speech patterns was only made possible by a combination of categorial
and probabilistic approaches towards a connectionist model trained on large
speech corpora. Thus, speech analysis can thus be seen as an early impetus
for the establishment of corpus linguistics as an independent discipline with
its own theoretical background.
      Lexicography is the second major field where corpus linguistics not
only introduced new methods, but also extended the entire scope of research,
however, without putting too much emphasis on the theoretical aspects of
corpus-based lexicography. Here again, it was John Sinclair who lead the
way as initiator of the first strictly corpus-based dictionary of general lan-
guage (COBUILD 1987). Britain was also the site of the first corpus-based
collocation dictionaries (such as Kjellmer 1994). Bilingual lexicography may
also benefit from a corpus-oriented approach: a fact that is evident when
comparing the traditional Le Robert & Collins English-French Dictionary
edited by B.T.S. Atkins with Valerie Grundy and Marie-H´ l` ne Corr´ ard’s
                                                                ee          e
Oxford-Hachette Dictionary which covers the same language pair. Here, the
use of (monolingual) corpora lead to a remarkably greater number of multi-
word translation units (collocations, set phrases) and to context profiles that
had been written with the target language in mind. W¨ rter und Wortgebrauch
                                                        o
in Ost und West [Words and Word Usage in East and West Germany] (1992)
by Manfred W. Hellmann may serve as the only German example of that era,
using the corpus for lemma selection rather than semantic description. Only
recently, in 1997, did a true corpus-based dictionary appear: Schl¨ sselw¨ rter
                                                                      u       o
der Wendezeit [Keywords during German Unification] by Dieter Herberg,
Doris Steffens and Elke Tellenbach.
      Thus, at least in the field of written language, corpus linguistics is still in
its infancy as a discipline with its own theoretical background—a statement
which holds true not only for Germany but also for most other European
countries. In this orientation phase, where corpus linguistics is still in the
process of defining its position, most publications are in English, the language
that has become interlingua of the modern world. But this does not mean that
corpus linguistics is dominated mainly by English and American scholars:
this can be clearly seen when browsing through any issue of the International
Journal of Corpus Linguistics. Still, German linguistics appears somewhat
underrepresented in this discussion. One exception is Hans J¨ rgen Heringer.
                                                                 u
128                         WOLFGANG TEUBERT

His innovative study on ‘distributive semantics’ shows a growing reception
of the programme for corpus linguistics which is outlined below. In his book
Das h¨ chste der Gef¨ hle [The most sublime of feelings] (Heringer 1999), he
      o              u
describes the validation of semantic cohesion between adjacent words on the
basis of larger corpora. Above all, it is this area between lexis and syntax
where corpus linguistics offers new insights.


Corpus Linguistics—A Programme

Corpus linguistics believes in structuralism as defined by John R. Firth; there-
fore, it insists on the notion that language as a research object can only be
observed in the form of written or spoken texts. Neither language-independent
cognition nor propositional logic can provide information on the nature of
natural languages. For these are, as stated in an apophthegm by Mario Wan-
druszka, characterised by a mixture of analogy and anomaly. The quest for a
universal structure of grammar and lexicon which is typical for the follow-
ers of Chomsky or Lakoff cannot meet the demands of these two aspects.1
Instead, corpus linguistics is closer to the semantic concept inherent in the
continental European structuralism of Ferdinand de Saussure, which regards
the meaning as inseparable from the form, that is, the word, the phrase,
the text. In this theory, the meaning does not exist per se. Corpus linguis-
tics rejects the ubiquitous concept of the meaning being ‘pure information,’
encoded into language by the sender and decoded by the receiver. Corpus
linguistics, instead, holds that content cannot be separated from form, rather
they constitute the two aspects under which texts can be analysed. The word,
the phrase, the text is both form and meaning.
      The above statement clearly outlines the programme of corpus linguis-
tics. It is mainly interested in those phenomena on the fringe between syntax
and lexicon, the two subjects of classical linguistics. It deals with the pat-
terns and structures of semantic cohesion between text elements that are
interpreted as compounds, multi-word units, collocations and set phrases. In
these phenomena, the importance of the context for the meaning becomes
evident.
      Corpus linguistics extends our knowledge of language by combining
three different approaches: the (procedural) identification of language data
by categorial analysis, the correlation of language data by statistical methods
CORPUS LINGUISTICS AND LEXICOGRAPHY                         129

and finally the (intellectual) interpretation of the results. Whilst the first two
steps should be done automatically as much as possible, the last step requires
human intentionality, as any interpretation is an act involving consciousness
and, therefore, not transmutable into an algorithmic procedure. This is the
main difference between corpus linguistics and computational linguistics,
which reduces language to a set of procedures.
      Corpus linguistics assumes that language is a social phenomenon, to be
observed and described above all in accessible empirical data—as it were,
communication acts. Corpora are cross-sections through a universe of dis-
course which incorporates virtually all communication acts of any selected
language community, be it monolingual (e.g., German or English), bilingual
(e.g., South Tyrolean, Welsh) or multilingual (e.g., Western European). How-
ever, the majority of texts that are preserved and made accessible through
corpora in principle only have a limited life-span: most printed texts such as
newspaper texts are out of public reach within a very short time.
      If we consider language as a social phenomenon, we do not know—
and do not want to know—what is going on in the minds of the people,
how the speaker or the hearer is understanding the words, sentences and
texts that they speak or hear. Language as a social phenomenon manifests
itself only in texts that can be observed, recorded, described and analysed.
Most texts happen to be communication acts, that is, interactions between
members of a language community. An ideal universe of discourse would be
the sum of all communication acts ever uttered by members of a language
community. Therefore, it has an inherent diachronic dimension. Of course,
this ideal universe of discourse would be far too large for linguistics to
explore it in its entirety. It would have to be broken down into cross-sections
with regard to the phenomena that we want to describe. There is no such
thing as a ‘one-size-fits-all’-corpus. It is the responsibility of the linguist to
limit the scope of the universe of discourse in such a way that it may be
reduced to a manageable corpus, by means of parameters such as language
(sociolect, terminology, jargon), time, region, situation, external and internal
textual characteristics, to mention just a few.
      When looking towards language as a social phenomenon, we assume that
meaning is expressed in texts. What a text element or text segment means is
the result of negotiation among the members of a language community, and
these negotiations are also part of the discourse. Thus, the language com-
munity sets the conventions on the formal correctness of sentences and on
130                          WOLFGANG TEUBERT

their meaning. Those conventions are both implicit and dynamic; they are not
engraved in stone like commandments. Any communication act may utilise
syntactic structures in a new way, create new collocations, introduce new
words or redefine existing ones. If those modifications are used in a suffi-
cient number of other communication acts or texts, they may well result in the
modification or amendment of an existing convention. One basic difference
between natural and formal languages is the fact that natural language not
only permits but actually integrates metalinguistic statements without explic-
itly marking the metalinguistic level. There is no separation between object
language and metalanguage. Any convention may be discussed, questioned
or even rejected in a text. Above all, discourses deal with meaning, and it
is corpus linguistics that is best suited to deal with this dynamic aspect of
meaning.
     We, as linguists, have no access to the cognitive encoding of the con-
ventions of a language community. We only know what is expressed in texts.
Dictionaries, grammars, and language textbooks are also texts; therefore, they
are part of the universe of discourse. As long as they represent socially ac-
cepted standards, we have to consider their special status. Still, their contents
are neither comprehensive nor always based on factual evidence. Corpus
linguistics, on the other hand, aims to reveal the conventions of a certain
language community on the basis of a relevant corpus. In a corpus, words
are embedded in their context. Corpus linguistics is, therefore, especially
suited to describe the gradual changes in meaning: it is the context which
determines the concrete meaning in most areas of the vocabulary.


Cognitive Linguistics, Logical Semantics and Corpus Linguistics

People normally—if they are not linguists, that is—listen to or read texts
because of their meaning. They are interested in the syntactic features of
phrases, sentences or texts only insofar as is necessary for understanding
them. Meaning is the core feature of natural language, and this is the reason
why semantics is the central linguistic discipline. Still, regardless of the
enormous progress that phonology, syntax and many other disciplines have
made, when it comes to explaining and describing the meaning of phrases,
sentences, and texts, we are far from a consensus.
CORPUS LINGUISTICS AND LEXICOGRAPHY                         131

      As said above, corpus linguistics regards language as a social phenom-
enon. This implies a strict division between meaning and understanding. Is
it really the task of linguistics to investigate how the speaker and the listener
understand the words, sentences or texts that they utter or perceive? Un-
derstanding is a psychological, a mental, or—in modern words—a cognitive
phenomenon. This is why no bond exists between cognitive linguistics and
corpus linguistics. Language as a social phenomenon is laid down in texts
and only there. If we, as corpus linguists, wish to find out how a text is
understood, we have to ask the listeners for paraphrases; these paraphrases,
being texts themselves, again become part of the discourse and can become
the object of linguistic analysis.
      The difference between cognitive linguistics and corpus linguistics lies
in how each deals with the unique property of language to signify. Any text
element is inevitably both form (expression) and meaning. If you delete the
form, the meaning is deleted as well. There is no meaning without form,
without an expression. Text elements and segments are symbols, and being
symbols, linguistic signs, they can be analysed in principle under two aspects:
the form aspect or the meaning aspect. The consequence of this stance is that
the only way to express the meaning of a text element or a text segment
is to interpret it, that is, to paraphrase it. This is the stance of hermeneutic
philosophy, as opposed to analytic philosophy (cf. Keller 1995, J¨ ger [2000]).
                                                                    a
      In cognitive linguistics, which is embedded in analytic philosophy, mean-
ing and understanding is seen as one. Here, text elements and text segments
correspond to conceptual representations on the mental level. Within this
system, however, it is not clear what the term ‘representation’ means. Does
it refer to content linked with a form (what we could call presentations) or
does it refer to pure content disconnected from form (what we could call
ideations)? This ambiguity is of vast consequence (Janik and Toulmin 1973:
133), as presentations themselves are signs, that is, symbols, and thus need
to be understood, that is, interpreted. Cognitive linguistics, however, does not
tell us how this is to happen. Rather, it describes the manipulation of mental
representations as a process (whereas an interpretation is an act, presuppos-
ing intentionality). Processes themselves are meaningless. It is only the act
of interpretation that assigns meaning to them. Both Daniell Dennet and John
Searle point out this aporia of the cognitive approach. In their opinion, the
mental processes would again require a central meaner (Dennet 1998: 287f.)
or homunculus (Searle 1992: 212f.) on a level higher than cognition, that is,
132                          WOLFGANG TEUBERT

for understanding mental representations, and the same would then apply for
that level, too, and so on, ad infinitum.
     On the other hand, if we translate ‘representation’ with ‘ideation,’ we
dismiss the assumption of the symbolic character of language. The meaning of
a word, a sentence or a text would then correspond to something immaterial,
something without form, formulated in a so-called ‘mental language,’ whose
elements would consist of either complex or atomistic concepts, depending
whether one refers to Anna Wierzbicka and the early Jerry Fodor (Wierzbicka
1996, Fodor 1975) or to the later Jerry Fodor (Fodor 1998). On a large scale,
these concepts of cognitive linguistics seem to correspond to words, but the
difference lies in the fact that they are not material symbols which call for
interpretation, but instead they are pure astral ideation, not contaminated by
any form (cf. Teubert 1999).
     In practice, particularly in artificial intelligence and automatic transla-
tion, this cognitive approach has failed. Alan Melby gave a plausible ex-
planation why it was due to fail no matter which formal language had been
defined for encoding the conceptual representations: “The real problem could
be that the language-independent universal sememes we were looking for do
not exist. . . [O]ur approach to word senses was dead wrong.” (Melby 1995:
48.)
     It seems that the idea behind cognitive linguistics is the transduction or
translation of phrases, sentences and texts in natural language, that is, of sym-
bolic units, into an obviously language-independent ‘language of thought’ or
‘mentalese,’ which is non-symbolic and is exclusively defined by syntax.
This transduction or translation is seen as a process and does not involve in-
tentionality. Cognitive linguistics is committed to the computational model of
mind. According to this theory, mental representations are seen as structures
consisting of what is called uninterpreted symbols, while mental processes are
caused by the manipulation of these representations according to rule-based,
that is, exclusively syntactic, algorithms. But does it really make sense to
use the term ‘symbols’ for these mental representation units, just as we call
words ‘linguistic signs’? On a cognitive (or computative) level, those entities
are only symbols inasmuch as a content can become assigned to them from
the outside of the mental (or computational) calculus. This content or mean-
ing, however, does not affect the permissibility of manipulations with regard
to their representation. The content of a text consisting of linguistic signs, on
the other hand, is something inherent to the text itself (and not assigned from
CORPUS LINGUISTICS AND LEXICOGRAPHY                        133

the outside), a feature we can and must investigate if we want to make sense
of a text. As Rudi Keller has pointed out, the symbols of natural language
are suitable for and in need of interpretation (Keller 1995).
      What appeals to many researchers of semantics is the fact that in cogni-
tive semantics the meaning of a text is expressed through a calculation whose
expressions are based exclusively on syntactic rules, or in other words, that
semantics is transformed into syntax. They take it for granted that this is
possible, as they claim that both natural and formal language are working
with symbols. But in natural language, these symbols need to be interpreted
whereas symbols in formal languages work without being assigned a cer-
tain (external) definition. Whether a formal language, a calculus, permits a
certain permutation of symbols or not has nothing to do with the meaning
or the definition of these symbols, it is just a question of syntax. As early
as 1847, George Boole stated: “Those who are acquainted with the present
state of the theory of Symbolic Algebra, are aware that the validity of the
processes of analysis does not depend upon the interpretation of the symbols
which are employed, but solely upon the laws of their combination.” Richard
Montague also believes in the possibility of describing natural language se-
mantics the same way as formal language semantics: “There is in my opinion
no important theoretical difference between natural languages and the arti-
ficial languages of logigicians; indeed, I consider it possible to comprehend
the syntax and semantics of both kinds of languages within a single natural
and mathematically precise theory. On this point I differ from a number of
philosophers, but agree, I believe, with Chomsky and his associates.” (Both
quotes from Devlin, 1997: 73 and 117.)
      From the point of view of corpus linguistics, the meaning of natural
language symbols, of text elements or text segments is negotiated by the
discourse participants and can be found in the paraphrases they offer, and it
is contained in language usage, that is, in context patterns. Natural language
symbols refer not so much to language-external facts, but rather they create
semantic links to other language signs. The meaning of a text segment is the
history of the use of its constituents.
      Linguistic signs always require interpretation. Whoever understands a
text is able to interpret it. This interpretation can be communicated as a text
in itself, a paraphrase of the original text. The act of interpretation requires
intentionality, and therefore, cannot be reduced to a rule-based, algorithmic,
‘mathematically precise’ procedure. If we see language as a social phenom-
134                          WOLFGANG TEUBERT

enon, natural language semantics can leave aside the mental or cognitive
level. Everything that can be said about the meaning of words, phrases or
sentences will be found in the discourse. Anything that cannot be paraphrased
in natural language has nothing to do with meaning. In a nutshell, this is the
core programme that distinguishes corpus linguistics from cognitive linguis-
tics.


Collocation and Meaning

In traditional linguistics, it is rather difficult to pinpoint the difference be-
tween a collocation such as harte Auseinandersetzung (hefty discussion) and
a free combination such as harte Matratze (hard mattress). In corpus lin-
guistics, on the other hand, it is possible to trace this awareness among the
members of a language community of a distinct semantic cohesion between
the lexical elements of a collocation by statistic means, that is, by detect-
ing a significant co-occurrence of these elements within a sufficiently large
corpus. Before it was possible to procedurally and systematically process
large amounts of language data, syntactic rules had been the only way to
describe the complex behaviour of co-occurrence between textual elements
(i.e., words). Such rules describe the relation between different classes of ele-
ments, for instance, between nouns and modifying adjectives. Still, syntactic
descriptions such as ‘Adjective + Noun’ are not specific enough to detect
collocations as distinct types of semantic relationships. Traditional lexicology
fails to come up with a feasible definition for collocations that would allow
their automatic identification in a corpus. To classify certain co-occurring
textual elements as semantic units, that is, as collocations, it is necessary to
recognise these text segments as recurrent phenomena, which is only possible
within a sufficiently large corpus. Therefore, we must complement the intra-
textual perspective with its intertextual counterpart. By applying probabilistic
methods, it is possible to measure recurrence within a virtual universe of dis-
course, or more precisely, within a real corpus. Collocation dictionaries in the
strict sense are always corpus-based. Even so, the speaker’s competence is
still needed to check statistically determined collocation candidates for their
relevant semantic cohesion. The following case study aims to illustrate the
potential of the corpus linguistic approach:
CORPUS LINGUISTICS AND LEXICOGRAPHY                                 135

Case study 1: hart as collocator

The collocation dictionary Kollokationsw¨ rterbuch Adjektive mit ihren Be-
                                            o
gleitsubstantiven (Teubert, Kervio-Berthou and Windisch [in preparation]),
which is currently being compiled at the Institut f¨ r Deutsche Sprache,
                                                        u
is based on the IDS corpora of about 320 million words. The 400 ad-
jectives were mainly selected from basic vocabulary lists. Candidates for
collocations were combinations of adjectives and nouns showing a signif-
icantly higher frequency than the expected frequency based on the occur-
rence of the relevant single words. The occurrences are ranked accord-
ing to significance: their overall frequency, thus, have no principal influ-
ence. The concept for the statistic procedures applied here was designed by
Cyril Belica. It is up to the competent speaker to decide whether a suf-
ficient lexical cohesion can be seen in the collocation candidates detected
by the computer. Manually selected citations are provided in order to facil-
itate this interpretation. If a collocation candidate is translated into a for-
eign language as a whole instead of a word-by-word translation, this can
be seen as evidence of a distinct semantic cohesion; therefore, we have
added the English translation equivalents to our German examples. The ex-
ample below covers rank 1-10 [for an explanation of the abbreviations see
http://www.ids-mannheim.de/kt/cosmas.html]:
    Kern Rank: 1 Frequency: 63 WKB In der Treuhand selbst hat sich ein harter
    Kern aus fr¨ heren SED-Betonk¨ pfen eingegraben. WKB Dennoch enthalte der
                 u                   o
    Bericht einen “harten Kern an Wahrheit.” H68 Die “Kommandoebene,” der
    harte Kern der RAF, umfaßt 25 bis 30 Mitglieder. H87 Der “harte Kern” um-
    faßt 187 Personen. H87 [. . . ] ein sicherer Hinweis, daß sich die Betreffenden
    dem harten Kern der RAF angeschlossen haben. H87 140 eingeschriebene
    Soulm¨ nner kamen regelm¨ ßig, ein harter Kern von 50 Jugendlichen fast
            a                     a
    t¨ glich. (Engl.: diehards/ hard core)
     a
    Arbeit Rank: 2 Frequency: 94 WKD In harter Arbeit haben wir unseren
                      ¨
    Staat aufgebaut. (Uberschrift) WKD Aber wir haben eben in dieser harten Ar-
    beit alle noch ein bißchen zu lernen. H85 [. . . ] Risikobereitschaft und harte
    Arbeit sollen sich in Malaysia wieder lohnen. H86 Mangelnde pers¨ nlicheo
    Ausstrahlung machte er durch harte Arbeit, eiserne Disziplin und Wil-
    lensst¨ rke wett. WKD Ein Sommer h¨ rtester Arbeit steht bevor. H85 Die
          a                                  a
    Technik macht es m¨ glich, den Menschen von harter und uberm¨ ßiger Arbeit
                        o                                       ¨     a
    auch zeitlich zu entlasten. (Engl.: hard work)
    W¨ hrung Rank: 3 Frequency: 40 WKB Die Deutschen w¨ rden nicht nur
      a                                                   u
    durch eine harte W¨ hrung vereint. WKB Harte W¨ hrung soll mangelnden
                      a                           a
136                             WOLFGANG TEUBERT

      Geist wettmachen. WKD Doch wundersam ist die Umwandlung der Ostmark
      in harte W¨ hrung allemal. H87 Dann w¨ re es endg¨ ltig vorbei mit dem Glanz
                 a                            a          u
      der einst h¨ rtesten W¨ hrung der Welt. (Engl.: hard currency)
                 a          a
      Schlag Rank: 4 Frequency: 24 BZK Das war f¨ r ihn ein harter Schlag. MK1
                                                    u
      Ich habe eine junge Mannschaft, die einen harten Schlag verkraften kann, ohne
      zu zerbrechen. MK2 Es war ein harter, gezielter Schlag, der mich prompt von
      den Beinen holte. (Engl.: heavy blow)
      Drogen Rank: 5 Frequency: 20 H88 Außerdem sei ein immer st¨ rker werden-
                                                                       a
      der Trend zu harten Drogen zu beobachten. H87 Kontakt zu harten Drogen
      hatte der Jugendliche bald bekommen [. . . ] (Engl.: hard drugs)
      Kritik Rank: 6 Frequency: 34 H86 Aber sie erfuhren schon damals von vielen
                                                                         ¨
      Seiten harte Kritik. MK2 Harte Kritik am Biedenkopf-Plan. (Uberschrift)
      H88 Zugleich ubte er harte Kritik an der Landesregierung [. . . ] (Engl.: harsh
                    ¨
      criticism)
      Bandagen Rank: 7 Frequency: 12 H86 Beide Seiten schlagen derweil mit
      harten Bandagen zu [. . . ] WKB Der Kampf um Berlin als Hauptstadt wird
                                     ¨
      mit harten Bandagen gef¨ hrt. (Uberschrift) (Engl.: taking one’s gloves off )
                                u
      Kampf Rank: 8 Frequency: 30 MK1 Amerika m¨ sse notfalls auf einen langen
                                                      u
      harten Kampf vorbereitet sein. H86 Die meisten sehen zu, daß sie im harten
      Kampf um die Zehntel f¨ r sich das Beste rausholen. BZK Verkaufsf¨ rderung
                                u                                       o
      gewinnt immer mehr Bedeutung im harten Kampf um die Gunst der Ver-
      braucher. WKD F¨ r sie geht es jetzt nicht einfach um einen harten Kampf
                         u
      um Arbeitspl¨ tze. (Engl.: close fight)
                  a
      D-Mark Rank: 9 Frequency: 22 WKB Dann bek¨ men die DDR-B¨ rger harte
                                                     a             u
      D-Mark in die Hand und w¨ rden dr¨ benbleiben. WKB Nichts hat Vormarsch
                                u        u
      und Endsieg der harten D-Mark aufhalten k¨ nnen. WKD Die harte D-Mark
                                                o
      dient als Schmiedehammer. (Engl.: strong Deutschmark)
      Worte Rank: 10 Frequency: 25 H85 Harte Worte - Berliner Verh¨ ltnisse?
                                                                         a
      H86 Selbst Außenminister Shultz benutzte harte Worte. H85 [. . . der] erste
      Vorsitzende der Gesellschaft, findet nicht minder harte Worte, um den Bruch
      zu begr¨ nden [. . . ] (Engl.: bitter words)
              u



Discourse and Meaning

One of corpus linguistics’ most essential tenets is the assumption that the
meaning of text elements and segments can be found solely in discourse.
This assumption makes sense if we call to mind that in principle, every word
or combination of words was once a neologism. Neologisms are introduced
CORPUS LINGUISTICS AND LEXICOGRAPHY                        137

to the discourse by explicitly assigning certain meanings to new expressions,
that is, by paraphrasing what a new word is supposed to mean. As stated
above, we can determine meaning in two ways: by paraphrase and by usage.
Neologisms, however, still lack the usage. They only become used once other
participants of the discourse start using them either by accepting the proposed
meaning or by negotiating the meaning by offering a new paraphrase. This
also applies to those cases where a new meaning is assigned to an already
existing word.
     It is obvious that we cannot go ‘back to the roots’ for all our established
vocabulary; also, this is not how children learn the meaning of words. But
even so, it is not simply the usage of words that leads to their meaning. In
most cases, an act of explanation, very often by the parents, but sometimes
also through picture-books, sets the starting point for language acquisition.
Obviously, deictic references to reality (or images thereof) are of highest
importance, but they are not understood without narrative explanations of
words that describe what we have to watch out for in reality (or in images of
reality). The meaning of school, for instance, cannot be explained by pictures
of the building, classroom, teachers or pupils. In fact, only very few words
relate to images unambiguously. Picture-book texts play a more important
role with regard to the acquisition of word meanings than dictionaries.
     Since the times of the German lexicographers Adelung and Campe, the
basic principle of German lexicography had been the assumption that the
meaning of words can be found in text samples, a basic principle also for
corpus linguistics. Nevertheless, corpus linguistics differs from traditional
lexicography in various details. Firstly, corpus linguistics does not use cor-
pora merely for examples: it explores them systematically. Secondly, corpus
linguistic does not try to decontextualise the objects it describes. In other
words, it does not abstract the meaning from the context. Thirdly, corpus
linguistics tries to capture different usages in their correlation to different
contexts, unlike traditional lexicography which tries to position word mean-
ings upon a blueprint of a language-independent ontological concept (for
instance, by genus proximum and differentia specifica). Fourthly, corpus lin-
guistics is less interested in the single text element or word than in the
semantic interaction between text elements and context.
     The following case study of Globalisierung [globalisation] aims to
demonstrate that it is indeed the discourse (or in other words: our corpus)
where information about the meaning of words can be found. The reason why
138                              WOLFGANG TEUBERT

we all seem to know the meaning of Globalisierung as it is used currently
is the fact that we all have read those texts that explain Globalisierung. We
cannot depict Globalisierung, any more than we can point at it. In its cur-
rent use, Globalisierung is certainly a neologism. It is characteristic for the
introductory phase of new words that the first citations show a large number
of paraphrases, a fact that demonstrates the role of the discourse participants
in negotiating meaning.


Case study 2: Globalisierung

Globalisierung (Engl.: globalisation) as a non-lexicalised derivation has been,
for a long time, part of our vocabulary. Its semantic vagueness is indicative
of its non-lexicalised status. As nomen actionis or nomen resultativum, it
has long been nothing more than the nominalisation of globalisieren. The
presence of descriptive attributes is significant for its lack of semantic spec-
ification: metalingual indicators (like paraphrases), on the other hand, are
almost totally absent. The following examples were found in the German
daily Tageszeitung:
      Die Vorstellung [. . . ] der Globalisierung der Kleistschen Verz¨ ckung [. . . ]
                                                                      u
      scheint mir denn doch eher m¨ rchenhaft. [14.10.89]
                                     a
                                             ¨
      Aber die Globalisierung von Politik, Okonomie und Technologie dulde keinen
      partikularen Bezugspunkte mehr [. . . ] [05.06.92]
      Mit der Globalisierung der Lebensweise der modernen Zivilisation geht die
      Selbstaufhebung der [. . . ] Ideale und Grund¨ berzeugungen einher. [25.02.95]
                                                   u

As a neologism, Globalisierung manages to almost completely displace the
original, non-lexicalised derivation only as late as in 1996. Suddenly, there is
a distinct rise in frequency: whereas we have only about 160 citations from
1988 to the end of 1995, there are about 320 citations for 1996 alone. Also,
most citations come without descriptive attributes: apparently, it is no longer
necessary to explain what is being globalised. Finally, many citations show
metalingual indicators (below printed in italics) that demonstrate how the
discourse participants take part in assigning a meaning to the word, namely,
the following examples:
      Die “Globalisierung”—ein etwas unscharfer Begriff, mit dem zugleich die
      Ausweitung des Handels, die Liberalisierung der Finanzm¨ rkte, der Sieg der
                                                             a
CORPUS LINGUISTICS AND LEXICOGRAPHY                                   139

    Freiheitsideologie, die unkontrollierte Macht der multinationalen Unternehmen,
    die Internationalisierung des Arbeitsmarktes und die Umstrukturierung der
    Volkswirtschaften gemeint ist—hat die Gewerkschaften weiter geschw¨ cht.  a
    [12.01.96]
    Verbissener Konkurrenzkampf im Inneren und nach außen hin eine maximale
    ¨
    Offnung f¨ r Kapitel, G¨ ter und Dienstleistungen. So lautet eine der m¨ glichen
             u             u                                               o
    Definitionen der Globalisierung. [12.01.96]
    [. . . ] die Globalisierung, das heißt die vollst¨ ndige Liberalisierung aller
                                                     a
    M¨ rkte auf der Welt [. . . ] [10.05.96]
        a
    Lisa Maza [. . . ] sieht die Globalisierung v¨ llig anders: Sie sei eine Fortset-
                                                 o
    zung der Kolonialisierung mit anderen Mitteln—zum Nachteil des S¨ dens, der
                                                                           u
    Armen und der Frauen. [08.06.96]
    Stichwort Globalisierung: In einer globalen Wirtschaft wird es auf Dauer
    kein gesch¨ tztes Umfeld f¨ r die Wirtschaft irgendeines Landes mehr geben.
               u              u
    [27.07.96]
    Globalisierung bedeutet auch die Europ¨ isierung des Globus, Kolonialismus,
                                          a
    okonomischer und okologischer Imperialismus. [04.05.96]
    ¨                 ¨
    Denn in der Tat bedeutet Globalisierung Amerikanisierung, und zwar nicht nur
    der Weltwirtschaft, sondern auch eine normative Amerikanisierung. [11.10.96]
    Das Stichwort “Maastricht” und das Modewort “Globalisierung” sind zu Syn-
    onymen f¨ r sozialen R¨ ckschritt geworden. [18.10.96]
             u            u
    Typischerweise schweigen die Intellektuellen in Deutschland beharrlich zu Eu-
    ropa, Globalisierung und Zukunft der Arbeit [. . . ] [13.12.96]

This is a brief list of comparable English citations taken from the Bank of
English and shortened:
    What does globalisation mean? The term can happily accommodate all manner
    of things: expanding international trade, the growth of multinational business,
    the rise in international joint ventures and increasing interdependence through
    capital flows.
    Globalisation: Low wages in other countries contribute to low wages in the
    United States.
    Words like globalisation and outsourcing are now in common use.
    Watkins sees globalisation as a euphemism for a race to maximise profit by
    lowering workers’ pay and condition.
    As Mr. Keegan says, globalisation means that tax cuts for business are crucial.
    Globalisation represents an attempt to exploit South Korea’s enormous poten-
140                             WOLFGANG TEUBERT

      tial.
      But doesn’t globalisation mean world-wide sameness?
      Globalisation is still more a philosophy than a business reality.
      Globalisation comes in many flavours.

More so than other words, neologisms show that the meaning of words is
to be found in the texts rather than in some discourse-external reality. The
citations—be it in their virtual entirety within the universe of discourse or
be it in some cross-section in a real corpus—are the meaning, and we may
understand this meaning by interpreting the citation.
      The formulation of a dictionary entry for globalisation, however, is the
responsibility of lexicography, not of corpus linguistics, whose main task—
apart from finding the references—would instead be the correlation (by sys-
tematic context analysis) of the various sets of paraphrases and usage patterns
to different parameters such as text type (newspaper), genre (politics/society),
ideological stance and so on. Particularly in the area of ideologically contro-
versial keywords, it seems as if a useful selection of citations can be more
helpful to the user than traditional definitions.


Linguistic Knowledge and Encyclopaedic Knowledge

Corpus linguistics aims to analyse the meaning of words within texts, or
rather, within their individual context. First and foremost, words are text
elements, not lexicon or dictionary entries. Corpus linguistics is interested in
text segments whose elements exhibit an inherent semantic cohesion which
can be made visible through quantitative analyses of discourse or corpus
(Biber, Conrad and Reppen 1998).
     If the research focus is shifted from single words to text segments,
the distinction between linguistic and encyclopaedic knowledge gradually
becomes fuzzy. The word Machtergreifung (seizure of power), outside its
context, may be described as an incident where a certain group, previously
excluded from political influence, seizes the power by its own force and
without democratic legitimation. However, we will interpret text segments
such as braune Machtergreifung or die Machtergreifung im Jahre 1933 as
referring to the ‘seizure of power by the Nazis’ without hesitation. Is this
because these texts refer to a extralingual reality, to a language-independent
CORPUS LINGUISTICS AND LEXICOGRAPHY                                    141

knowledge? Although the majority of linguists would agree with this assump-
tion, there may well be another, simpler, explanation: we have learned from a
large number of citations, whenever braune Machtergreifung or Machtergrei-
fung im Jahre 1933 is mentioned, this refers to the seizure of power by the
Nazis and to nothing else. There is a co-occurrence between both expressions
that may result, for instance, in an anaphoric situation: the expressions are
paraphrases of each other.
     In the tradition of German lexicography, linguistic knowledge is sepa-
rated from encyclopaedic knowledge by the process of decontextualisation, by
the endeavour to describe the meaning of words unadulterated by the context
in which they occur. If we detach all references from their relevant context,
the isolated meaning remains. The different events of Machtergreifung that
are dealt with in texts are viewed as references to a discourse-external reality.
Corpus linguistics, on the other hand, above all is interested in the meaning
of textual segments displaying a distinct semantic cohesion. Machtergreifung
im Jahre 1933 is such a segment, and by projecting it upon our discourse (i.e.,
linguistic) knowledge, we are able to interpret it as ‘Nazi seizure of power’
without problem. If we are no longer limited to single words detached from
their contexts and if we do away with decontextualisation, we can give up
with the distinction between linguistic and encyclopaedic knowledge. For
what we normally call encyclopaedic knowledge is in fact nothing but dis-
course knowledge. Everything we know and are able to know about the Nazi
seizure of power is based on texts. Although some may even have witnessed
one relevant incident or the other, their ability to interpret the whole course
of events as Machtergreifung is also based on texts from other persons. If
we reduce encyclopaedic knowledge to discourse knowledge, the distinction
disappears.
     Let us take a look at the example klassische Rollenverteilung (traditional
role allocation) (Spiegel 13, 1999: 128):
    Ein Zuhause wie ein Bilderbuchideal. Hier [. . . ] ist die klassische Rollen-
    verteilung die Regel: Ein Elternteil k¨ mmert sich um Haushalt und Kinder-
                                          u
    erziehung, der andere verdient das Geld. Auch dieser traditionellen Familien-
    vorstellung entspricht das Leben im Reihenhaus.
    [A home like a picture-book clich´ . Here [. . . ] the traditional role allocation
                                          e
    is still the rule: one parent takes care of the household and of bringing up the
    children, the other parent earns the family income. Also living in a terraced
    house contributes to this traditional image of family.]
142                         WOLFGANG TEUBERT

Within the context of family/home, the meaning of the collocation klassis-
che Rollenverteilung in the above example corresponds exactly to the sen-
tence that may serve as definition: Ein Elternteil. . . [One parent. . . ]. Note
the sublime subversive touch that is present here, characteristic of so many
Spiegel articles: what seems to be a generally acceptable definition, actu-
ally shows an essential deviation from the traditional meaning of klassische
Rollenverteilung—it does not distinguish between male and female.
      The above example aptly illustrates challenges and achievements of cor-
pus linguistics. Firstly, it is not interested in the meaning of isolated words
outside their relevant contexts, but instead in the meaning of semantically
connected text segments, extracted from discourse or, in practice, from the
corpus. In the context of home and family, klassische Rollenverteilung can
be interpreted in different ways with regard to period and genre. If the above
Spiegel-definition becomes the accepted thing, we may apply the term klas-
sische Rollenverteilung even to gay or lesbian partnerships. For corpus lin-
guistics, this implies a dynamic view of meaning. Every new reference may
add to the meaning of a certain text segment; older meanings may fall into
oblivion if they are not sanctioned by new evidence. The above example also
shows that the ways in which meaning can be negotiated within the language
community can be controversial indeed. It is not so long ago when lesbian
partnership and family were two different meanings that could not be imag-
ined, let alone used, synonymously. Corpus linguistics may thus serve as a
useful instrument to detect changes of meaning that are essential to neology.
      Secondly, corpus linguistics is developing devices for the identification
and extraction of potentially metalinguistic elements of citations, that is, of
text elements co-occurring with a paraphrase, thus enabling the automatic
extraction, processing and presentation of semantically relevant material from
corpora. Phrases such as something is the rule; x means y; this is to say;
we understand it as; it can be said etc. point to metalingual content. If
the meaning of a semantically controversial textual segment is negotiated,
we often find indicators such as: some time ago; in fact; strictly speaking;
without doubt; wrongly etc. These indicators can give us important clues.
Above all, it must be realised that just as the meaning of a text segment
is a paraphrase found in earlier citations, peoples’ interpretations are also
paraphrases and therefore part of the discourse. In principle, the meaning
of a text element or a text segment is everything that has been said about
it, in terms of a paraphrase or as a matter of usage; it is the result of the
CORPUS LINGUISTICS AND LEXICOGRAPHY                      143

negotiation of the meaning within the discourse community. Indeed this is
the difference between natural language words and technical terms. Technical
terms are defined by experts, and their meaning is restricted to that definition
(and thus, is discourse-external). For instance, if a tree meets the criteria
for elm-trees listed in the expert’s definition, it is rightly called an elm-
tree no matter what the citations say. Any terminological definition is—at
least in principle—an algorithmic instruction for the usage of the relevant
term. This explains why it is possible to automatically translate technical
texts, provided they are monosemous and only use specialist vocabulary.
Lexicographic definitions, on the other hand, are interpretations of citations,
that is, results of intentional acts. They cannot automatically be processed
from corpus citations, because every citation can be interpreted in various
different ways. Therefore, an automatic translation of general language texts
is not feasible.
     Thirdly, corpus linguistics uses the context to distinguish between us-
ages. For example, the collocation klassische Rollenverteilung is not only
found in the family context but also at work or in society in general. Its
meaning differs according to on the context.
     Fourthly, corpus linguistics is interested in larger units of meaning,
namely, in text segments. The traditional lexicographic practice of decon-
textualisation and isolation of single words impedes us from knowing the
meaning of larger units such as klassische Rollenverteilung. As a rule, the
meaning of text segments such as multi word units, collocations or set phrases
is far more specific than that of single words. The reason why traditional lin-
guistics is focussing on the single word, isolated from its context, can only
be explained by space constraints in the past, as it is impossible to list all
collocations and set phrases even in a dictionary consisting of several vol-
umes. But is klassische Rollenverteilung really a true collocation? Is corpus
linguistics really able to provide a credible validation of semantic cohesion?
Is the co-occurrence klassische Rollenverteilung more than a mere addition
of klassisch and Rollenverteilung? In a sufficiently large corpus, if the fre-
quency of klassische Rollenverteilung differs significantly from the statisti-
cally expected frequency of this combination, this can be seen as one sign
for possible collocation. Another sign would be the occurrence of a special
meaning that can not be derived from the sum of the individual meanings
of the text elements. For instance, if we find six tokens of klassische Rol-
lenverteilung within the corpus although we would only expect three, given
144                          WOLFGANG TEUBERT

the frequency of the constituents, and if they all suggest that one parent is
the wage-earner whereas the other is bringing up the children, then we may
regard this co-occurrence as collocation.
     Finally, corpus linguistic considers meaning as a feature of language, of
text elements, segments, and texts, and not as an external feature existing only
in the human mind or in reality. The meaning of klassische Rollenverteilung
in the context of family is represented in texts, and only there; it is not the
reflection of a non-textual external reality that we could point our fingers at.
There is no meaning outside language, outside the discourse. We know what
globalisation means today, because we have read the texts that explain it, but
we cannot see globalisation.


Multilingual Corpus Linguistics

When translating a text into another language, we paraphrase the source
text. The translation represents the meaning of the original text just like a
paraphrase within the source language. Translation requires understanding
and thus intentionality. Only if we understand a text can we interpret or
even paraphrase it. This implies that different translations will yield different
versions of the same text, which again shows that translation or paraphrasing
cannot be reduced to algorithmic procedures.
      The universe of discourse, containing all texts ever translated along with
their translations, is the empirical base for multilingual corpus linguistics. It
is a virtual universe, and it can be realised by multilingual parallel corpora (or
a collection of bilingual parallel corpora). Parallel corpora consist of source
texts along with their translations into other languages, whereas reciprocal
parallel corpora contain the source texts in two languages along with their
translations into the target languages.
      Just as in monolingual corpus linguistics, meaning is also seen as a
strictly linguistic (or better, textual) term here. Meaning is paraphrase. The
entire meaning of a text segment within a multilingual universe of discourse
is enclosed in the history of all translation equivalents of the segment.
      The translation unit, that is, the text segment completely represented by
the translation equivalent, is the base unit of multilingual corpus semantics.
Translation units, consisting of a single word or of several words, are the
minimal units of translation. If they consist of several words, they are trans-
CORPUS LINGUISTICS AND LEXICOGRAPHY                        145

lated as a whole and not word by word. Therefore, translation equivalents
correspond to the text segments of monolingual corpus linguistics.
     Within the framework of multilingual corpus linguistics, we take that
the meaning of translation units is contained in their translation equivalents
in other languages. This corresponds to the base assumption of corpus lin-
guistics, which does not regard semantic cohesion as something fixed but
as belonging to a large spectrum reaching from inalterable units to text seg-
ments whose elements can be varied, expanded or omitted. Identifying these
translation units (or text segments) again involves interpretation. The transla-
tion shows us whether a given co-occurrence of words is a single translation
equivalent or a combination of them, that is, merely a chain of text elements.
This leads to two consequences. What can be seen as an integral translation
equivalent in one target language may be a simple word-by-word transla-
tion in another. This may even be the case within a single target language,
depending on the stylistic preferences of different translators. In fact, it is
the community of translators (along with the translation critics) who in their
daily practice decide what is the translation equivalent, just as the monolin-
gual language community decides what is a text segment.
     The definition of a translation unit therefore depends both on the target
language and the common practice of translation. A virtual text segment is a
translation unit only in respect to those languages into which it is translated as
a whole. Translation units and their equivalents are not metaphysical entities;
they are the contingent results of translation acts. According to the analysis
of parallel corpora, more than half of the translation units are larger than
the single word—another example of how corpus linguistics may help to
investigate the nature of text segments.
     The meaning of a translation unit is its paraphrase, that is, the translation
equivalent in the target language. For ambiguous translation units, this im-
plies that there are as many meanings to the unit as there are non-synonymous
translation equivalents. If the phenomenon of meaning is thus operationalised,
the meaning of a translation unit depends on the selected target language. A
given translation unit in language A may have two non-synonymous equiv-
alents in language B, but three non-synonymous equivalents in language C.
     Let us look at an example. The English word sorrow (a translation unit
consisting only of a single word) will usually be translated into French by
one of the three equivalents chagrin, peine or tristesse; the first two, chagrin
and peine, are obviously synonymous in a variety of contexts. They both
146                          WOLFGANG TEUBERT

point at a cause for this emotion and, therefore, are sometimes interchange-
able with deuil (‘loss,’ the term for the cause). Tristesse, on the other hand, is
the variety of sorrow which is not caused by a special incident. In German,
there are also three standard equivalents for sorrow, namely, Trauer (caused
by loss), Kummer (caused by an adverse incident, intense and usually lim-
ited in duration) and finally Gram (caused by unhappiness resulting from
an incident, not very intense, more a disposition than a feeling, but often of
long duration). Those three German equivalents are neither synonymous with
nor corresponding to the three French equivalents. By the way, the differ-
ent senses of sorrow usually found in English monolingual dictionaries and
thesauri corresponds to neither the French nor the German distinctions.
     The above example of sorrow shows that the concept of synonymy can-
not be expressed in an algorithm. To call two expressions synonymous re-
quires a prior understanding of their meaning, that is, an act of interpretation.
For instance, if we look at how the Greek verb pros´ uchomai in the first sen-
                                                       e
tence of Plato’s Republic is translated into English, we will find five different
equivalents in eight different translations of this book: to make my prayers,
to say a prayer, to offer up my prayers, to worship, to pay my devoirs and to
pay my devotions. We, as human beings, must decide whether we consider
the Greek verb ambiguous or just fuzzy and whether the relevant equivalents
can be seen as synonyms. This is something computers cannot do. The ex-
ample also shows that the concept of synonymy can only be applied locally,
referring to translation equivalents or text segments within a defined context.
Although we may assume that Plato’s contemporary audience considered the
verb pros´ uchomai as unambiguous within the above context, this is not the
           e
case with native speakers of English, where there is no synonymy between
to make my prayers and to pay my devotions. It can be clearly seen that
meaning has a dynamic quality and also that the act of translation requires
intention and thus cannot be reduced to a mere procedure. We will never find
the correct German equivalent for sorrow or the correct English equivalent
for pros´ uchmai just by defining formal instructions for a machine. Before
         e
we can translate texts and their elements, we must understand them.
CORPUS LINGUISTICS AND LEXICOGRAPHY                        147

Multilingual Corpus Linguistics in Practice

Neither a lexicon derived from a bilingual dictionary nor the supposedly
language-neutral conceptual ontologies applied within Artificial Intelligence
will solve the problem of machine translation of general language texts.
Meanwhile, this fact is acknowledged by the experts. Therefore, they focus
on the machine translation of texts written in a controlled documentation
language, which is a more or less formal language in which all technical terms
are defined unambiguously along with a syntax that rejects all ambiguous
expressions as non-grammatical.
      General language texts written in natural languages cannot be translated
without interpretation. Here, multilingual corpus linguistics steers clear of this
obstacle in an elegant way. Unlike disciplines such as Artificial Intelligence
and Machine Translation, which are based on cognitive linguistics, it does
not try to model and emulate mental processes, but instead tries to support
the translator by processing parallel corpora. They contain the practice of
previous human translation. In these corpora, those translation equivalents
that are proven to be reliable and accepted will outweigh equivalents that have
been dismissed as inadequate in the long run. If, for instance, pros´ uchomai
                                                                       e
is translated as to make my prayers three times out of eight, it may well be
assumed that it is an accepted—albeit not the ideal—equivalent within the
given context.
      Parallel corpora are translation repositories. They link translation units
with their equivalents. As first studies have shown (Steyer and Teubert 1998),
we may assume that 90 percent of all translation units along with their rel-
evant equivalents may be found in a carefully compiled corpus of about 20
million words per language, provided that the text to be translated is suffi-
ciently close to the corpus with regard to text type and genre.
      Multilingual corpus linguistics does not pretend to solve the problem of
machine translation of general language. But it may help the human translator
in finding a suitable equivalent for the unit to be translated more efficiently
than traditional bilingual dictionaries, because it includes the context even in
those cases where the translation equivalent is not a syntagmatically defined
collocation but a certain textual element within a sequence. The goal is to
select from among all given elements the one whose contextual profile is
closest to that of the textual segment to be translated.
148                         WOLFGANG TEUBERT

Case study 3: The translation into German of sorrow and grief

For the two words sorrow and grief, we find three common non-
synonymous German translation equivalents: Trauer, Kummer and Gram.
An analysis of the contexts of all references of these German words
as found in the IDS corpora, based on a method designed by Cyril
Belica (see http://www.ids-mannheim.de/cgi-bin/idsforms/
cosmas-www-client), gives us the context profiles listed below. In our
example, the number of neighbouring words (i.e. span) has been restricted to
5 words on each side. The context profiles given below have been slightly
edited for the sake of clarity.
      Context profile for Trauer: Wut, Angst, Betroffenheit, Schmerz, Tod,
Best¨ rzung, Freude, Hoffnung, Verzweiflung, Scham; tragen, empfinden; tief,
      u
groß-
      Context profile for Kummer: Sorgen, Schmerz, Leid, Seele, Freude,
         ¨
Stress, Arger, Not; bereiten, machen, gewohnt/gew¨ hnt sein; viel, groß-
                                                      o
      Context profile for Gram: Leid, Hass, Bitterkeit, Scham; sterben;
gebeugt, lauter, voll-
      In an English-German parallel corpus we would distinguish between
three translations for sorrow and grief : the first group would contain those
cases where sorrow or grief is translated by Trauer; the second group where
it is translated by Kummer, and finally, the third group where it is translated
by Gram. For each of the above cases, we could compute a context profile
similar to the ones quoted above for the German words from the IDS corpus.
We may assume that the context profile for sorrow and grief, as taken from
the parallel corpus, in the case of the translation equivalent Kummer, will not
differ much from the context profile for Kummer extracted from the German
reference corpus, apart from it being in English instead of German.
      Unfortunately, a sufficiently large enough English-German parallel cor-
pus that would allow the extraction of English context profiles for German
translation equivalents on the basis of recurrence is not yet available. As an
alternative, I have searched the Bank of English for those instances of sor-
row and grief whose contexts are similar to our context profiles for Trauer,
Kummer and Gram. So far these results are not thoroughly convincing: one
reason is the different composition of the IDS corpora compared to the Bank
of English which results in a clear imbalance of the German and English
instances with regard to text type and genre; also, the search criteria for the
CORPUS LINGUISTICS AND LEXICOGRAPHY                        149

English contexts have been too narrow, and last but not least, sorrow and
grief along with their German counterparts Trauer, Kummer, and Gram be-
long to an area of vocabulary which is highly culture-specific and is almost
impossible to reduce to a common denominator.
     Still, the following instances taken from the Bank of English show,
that in practice, the approach for the detection of equivalents outlined above
will function to some extent. The words in square brackets are the German
equivalents of the context words contained within the context profiles.
     (1) Trauer
         So on the night of the crucifixion I place Simon in the home in
         Bethany of Mary called Magdalene and her sister Maria. I en-
         vision a scene in which trauma, grief, anger [Wut], and despair
         [Best¨ rzung] were all present, to say nothing of fear [Angst].
              u
     (2) Kummer
         She enjoys her job though it is full of stress [Stress], sorrow and
         never-ending challenges.
     (3) Gram
         The terrible affliction [Leid] that has fallen so suddenly upon our
         unhapply country fills and monopolises my thoughts. My soul
         is full of grief and bitterness [Bitterkeit] and hate [Hass] and
         vengeance.
     Although matching the context of the element to be translated against
the context profiles of all possible equivalents may suggest a method for the
automatic selection of suitable equivalents, this only works in those cases
where we have clear selection-relevant contextual information at our disposal.
As stated above, this is not always the case, especially if the text element to be
translated is referring to earlier instances within the same text. In these cases,
we may assume that, provided the intratextual continuity is sufficiently high,
the text element (sorrow or grief in our example) can always be translated by
the same equivalent with regard to the target language, be it Trauer, Kummer
or Gram. In most cases, whenever a word with a fuzzy, strongly context-
dependent meaning appears in a text for the first time, the information needed
for the specification of its meaning will be found within the context. Later
instances of the word within the text often tend to omit this information
as redundant. Within a text, we must find one or two references where a
150                         WOLFGANG TEUBERT

suitable translation equivalent is indicated by the context profile and apply
the result to the other instances. This shows that it is imperative to only
include complete texts in the corpus.


Future Prospects

Corpus linguistics sees itself not in opposition to but as a complement of tra-
ditional linguistics. Corpus linguistics helps to make us aware not only of the
interaction between text element and context but also of text segments, that
is, larger, flexible units whose elements are semantically linked in a certain
way: multi-word-units, collocations, set phrases. It explains the repeated co-
occurrence of text elements as a discourse phenomenon that can be explored
by statistical means, and it makes those co-occurrence patterns visible by a
combination of quantitative and categorial devices.
      The investigation of the context enables us to better cope with words
displaying fuzzy meanings, words of the ‘Thespian vocabulary,’ as John Sin-
clair called them (Sinclair 1996), by generating context profiles as presented
above on the basis of sufficiently large corpora. Especially when combin-
ing these context profiles with those citations containing a paraphrase of the
meaning or aspects thereof (cf. our case study of globalisation), this may lead
to descriptions of meaning enabling the user to participate in the discourse.
      Corpus linguistics distinguishes between text segments on the one hand
and text elements embedded in context on the other, depending on how
they can be described. Context profiles are only statistically defined. Within
a context profile, there is no such thing as an obligatory element that is
indispensable within the context of a citation. The lexical constituents of text
segments, however, can be defined either as indispensable or as optional.
But there is still another difference between the text element with its context
profile and the text segment: the latter is defined not only on a lexical but
also on a syntactic level. The collocation Kummer gew¨ hnt ceases to be a
                                                            o
collocation as soon as the verb gew¨ hnt sein is replaced by gew¨ hnen: Er
                                       o                             o
hatte sich an seinen Kummer gew¨ hnt is not a collocation. The same applies
                                    o
for collocations such as geheimer Kummer, Kummer bereiten, Kummer und
Sorgen. If we change the syntagma or even just the word order (for example,
into Sorgen und Kummer), the words lose their collocation character.
CORPUS LINGUISTICS AND LEXICOGRAPHY                         151

     During the last decades, we have witnessed a growing interest in seman-
tic cohesion, in the special semantic relations between words within sentences
and phrases, even in traditional linguistics. Among the relatively new con-
cepts are lexical solidarities, collocations, set phrases, valency, case roles,
semantic frames and scripts. They all try to demonstrate that language is
more than just the assembling of context-free words using semantics-free
rules. The co-occurrence patterns developed by corpus linguistics may help
to clarify heuristically the concept of text segments defined by semantic co-
hesion.
     When it comes to the identification of text segments, multilingual cor-
pus linguistics holds a privileged position. Within monolingual corpora, this
identification is a gruesome task that can only be turned into an automatic
procedure by a painstaking combination of various procedures based on fre-
quencies, lists or rules. The use of parallel corpora makes it easier to identify
text segments (as translation units or equivalents), as they are the true prac-
tical results of interpretation and paraphrase. They show what usually takes
place within the minds of the speakers without leaving their traces in texts.
Parallel corpora, therefore, provide direct access to the translation practice of
human translators. If we assume that we may find the meaning of a textual
element through its paraphrase, which is also a text, then we may describe
parallel corpora as repositories for such paraphrases. Obviously, dictionaries
also attempt to list those paraphrases. However, since their size is limited,
they need to decontextualise and isolate the lexical units, whereas the para-
phrases of translators display the text elements embedded within their con-
texts, along with whole text segments. Parallel corpus evidence helps us to
trace the phenomenon of semantic cohesion.
     Meanwhile, with the availability of large corpora and improved software
for their exploration, corpus linguistics has become part of general lexicog-
raphy. Linguistics is gradually becoming more interested in larger units of
meaning and the use of context for their definition. Also, it is generally
accepted that the next generation of dictionaries, both monolingual and bilin-
gual, needs to be corpus-validated, if not entirely corpus-based. But there is
more to the corpus linguistic approach. By interactive procedures, the am-
bitious user should be able to have direct access to corpus evidence instead
of being confronted with the subjective findings provided by lexicographers.
Such a corpus platform would allow the members of the language community
152                               WOLFGANG TEUBERT

to participate in the social activity of negotiating meanings in a committed
and informed way.


Notes

*     This contribution is a revised version of my article ‘Korpuslinguistik und Lexikographie’
      in Deutsche Sprache 4/99, pp. 292–313, translated into English by Norbert Volz.
1.    The rules that those followers of a universal grammar hope to find in their quest for the
      language organ are not based on deductions of analogy. Whereas rules based on innate-
      ness had been the central factor in Chomskyan language theory until recently (cf. Stephen
      Pinker in The Language Instinct [Pinker 1994]), Pinker now sees language faculty as an
      interaction between ‘distinct mental mechanisms’ which is not yet fully explored, namely,
      the ‘symbolic computation’ [i.e., the algorithmic processing of uninterpreted symbols]
      as opposed to the ‘memory’ [i.e., recollection], the latter being responsible for the as-
      signment of form and meaning of symbols (Pinker 1999). The memory is seen as partly
      associative—an appropriate term for its description could be ‘connectionist network’.
      However, Pinker still sees ‘symbolic computation’ as a strictly rule-based process. We
      may assume that this tentative change in attitude towards language faculty and the extent
      of its genetic embedding might be partly due to Terrence W. Deacon’s convincing ex-
      planation of first language acquisition which does without any language organ (Deacon
      1997).



References

Biber, Douglas; Conrad, Susan; Reppen, Randi. 1998. Corpus Linguistics. Investigating
   Language Structure and Use. Cambridge University Press.
Collins COBUILD. 1987. English Language Dictionary. Editor in Chief: John Sinclair.
Deacon, Terrence W. 1997. The Symbolic Species. New York: Norton.
Dennett, Daniel C. 1998. “Reflections on Language and Mind.” In: Peter Carruthers/
   Jill Boncher (Eds.): Language and Thought. Interdisciplinary Themes. Cambridge:
   Cambridge University Press, 284–294.
Devlin, Keith. 1997. Goodbye, Descartes. New York: Wiley.
Fodor, Jerry A. 1975. The Language of Thought. New York: Crowell.
Fodor, Jerry A. 1998. Concepts. Where Cognitive Science Went Wrong. Oxford: Clarendon
   Press.
Hellmann, Manfred W. 1992. W¨ rter und Wortgebrauch in Ost und West. Vol. 1–3.
                                  o
   T¨ bingen: Narr.
     u
Herberg, Dieter; Steffens, Doris; Tellenbach, Elke. 1997. Schl¨ sselw¨ rter der Wendezeit.
                                                              u      o
   W¨ rter-Buch zum offentlichen Sprachgebrauch 1989/90. Berlin: Walter de Gruyter.
      o               ¨
Heringer, Hans J¨ rgen. 1999. Das h¨ chste der Gef¨ hle. Empirische Studien zur
                   u                     o               u
   distributiven Semantik. T¨ bingen: Stauffenberg Verlag.
                            u
CORPUS LINGUISTICS AND LEXICOGRAPHY                             153

J¨ ger, Ludwig. 2000. “Die Sprachvergessenheit der Medientheorie. Ein Pl¨ doyer f¨ r das
 a                                                                         a      u
    Medium Sprache.” In: Werner Kallmeyer (Ed.): Sprache und neue Medien. Jahrbuch
    1999 des Instituts f¨ r Deutsche Sprache. Berlin/New York: de Gruyter, 9–30.
                         u
Janik, Allen; Toulmin, Stephen. 1973. Wittgenstein’s Vienna. New York: Schuster &
    Schuster.
Keller, Rudi. 1995. Zeichentheorie. T¨ bingen: Francke.
                                       u
Kjellmer, G¨ ran. 1994. A Dictionary of English Collocations. Based on the Brown Corpus.
             o
    Oxford: Clarendon Press.
Lenz, Susanne. 2000. Studienbibliographie Korpuslinguistik. Heidelberg: Groos.
McEnery, Tony; Wilson, Andrew. 1996. Corpus Linguistics. Edinburgh: Edinburgh
    University Press.
Melby, Allen K. 1995. The Possibility of Language. A Discussion of the Nature of
    Language with Implications for Human and Machine Translation. Amsterdam: John
    Benjamins.
The Oxford-Hachette French Dictionary. 1994. French-English/ English-French. Marie-
    H´ l` ne Corr´ ard, Valerie Grundy (Eds.). Oxford: Oxford University Press.
      ee         e
Pinker, Stephen. 1994. The Language Instinct. New York: William Morrow.
Pinker, Stephen. 1999. “Regular habits. How we learn language by mixing memory and
    rules.” In: Times Literary Supplement, October 29, 1999, 11–13.
Renouf, Antoinette (Ed.). 1998. Working with Corpora. Selected Papers from the 18th
    ICAME Conference. Amsterdam: Rodope.
Le Robert & Collins. 1993. Dictionnaire Francais–Anglais/Anglais–Francais. 4th Edition.
                                                 ¸                       ¸
    Editor in Chief: Beryl S. Atkins.
Searle, John R. 1992. The Rediscovery of the Mind. Cambridge, Mass.: The MIT Press.
Sinclair, John M. 1996. “The Empty Lexicon.” In: International Journal of Corpus
    Linguistics I(1): 99–120.
                                                                   ¨
Steyer, Kathrin; Teubert, Wolfgang. 1998. “Deutsch-Franz¨ sische Ubersetzungsplattform.
                                                           o
    Ans¨ tze, Methoden, empirische M¨ glichkeiten.” In: Deutsche Sprache 4(97): 343–359.
        a                              o
Stubbs, Michael. 1996. Text and Corpus Analysis. Oxford: Blackwell.
                                                   ¨
Teubert, Wolfgang. 1999. In: Modelle der Ubersetzung—Grundlagen der Methodik.
    Frankfurt/M.: Lang, 118–135.
Teubert, Wolfgang; Kervio-Berthou, Val´ rie; Windisch, Eric. To be published.
                                               e
    Kollokationsw¨ rterbuch Adjektive und ihre Begleitsubstantive.
                    o
Wierzbicka, Anna. 1996. Semantics. Primes and Universals. Oxford: Oxford University
    Press.

More Related Content

What's hot

Word vs lexeme by james jamie 2014 presentation assigned by asifa memon lect...
Word vs lexeme  by james jamie 2014 presentation assigned by asifa memon lect...Word vs lexeme  by james jamie 2014 presentation assigned by asifa memon lect...
Word vs lexeme by james jamie 2014 presentation assigned by asifa memon lect...James Jamie
 
Stylistics Analysis
Stylistics AnalysisStylistics Analysis
Stylistics AnalysisRona Depalac
 
What is Applied Linguistics?
What is Applied Linguistics?What is Applied Linguistics?
What is Applied Linguistics?Shajaira Lopez
 
Prague school slides
Prague school slidesPrague school slides
Prague school slidesnoreen zafar
 
COMPUTATIONAL LINGUISTICS
COMPUTATIONAL LINGUISTICSCOMPUTATIONAL LINGUISTICS
COMPUTATIONAL LINGUISTICSRahul Motipalle
 
Difference between spoken and written discourse
Difference between spoken and written discourse Difference between spoken and written discourse
Difference between spoken and written discourse Nisar Ahmad
 
Computational linguistics
Computational linguistics Computational linguistics
Computational linguistics kashmasardar
 
Introduction to corpus linguistics 1
Introduction to corpus linguistics 1Introduction to corpus linguistics 1
Introduction to corpus linguistics 1Rafia Sheikh
 
Scopes of linguistic description 1
Scopes of linguistic description 1Scopes of linguistic description 1
Scopes of linguistic description 1Bel Abbes Neddar
 
General linguistics
General linguisticsGeneral linguistics
General linguisticszhian asaad
 
Discourse Analysis and Pragmatics
Discourse Analysis and PragmaticsDiscourse Analysis and Pragmatics
Discourse Analysis and PragmaticsMutiara Ayu
 
British national corpus
British national corpusBritish national corpus
British national corpusLaura P
 
Deep structure and surface structure
Deep structure and surface structureDeep structure and surface structure
Deep structure and surface structureAsif Ali Raza
 
Step by step stylistic analysis
Step by step stylistic analysisStep by step stylistic analysis
Step by step stylistic analysisWaldorf Oberberg
 

What's hot (20)

Word vs lexeme by james jamie 2014 presentation assigned by asifa memon lect...
Word vs lexeme  by james jamie 2014 presentation assigned by asifa memon lect...Word vs lexeme  by james jamie 2014 presentation assigned by asifa memon lect...
Word vs lexeme by james jamie 2014 presentation assigned by asifa memon lect...
 
Stylistics Analysis
Stylistics AnalysisStylistics Analysis
Stylistics Analysis
 
What is Applied Linguistics?
What is Applied Linguistics?What is Applied Linguistics?
What is Applied Linguistics?
 
Prague school slides
Prague school slidesPrague school slides
Prague school slides
 
Error analysis revised
Error analysis revisedError analysis revised
Error analysis revised
 
COMPUTATIONAL LINGUISTICS
COMPUTATIONAL LINGUISTICSCOMPUTATIONAL LINGUISTICS
COMPUTATIONAL LINGUISTICS
 
Difference between spoken and written discourse
Difference between spoken and written discourse Difference between spoken and written discourse
Difference between spoken and written discourse
 
Computational linguistics
Computational linguistics Computational linguistics
Computational linguistics
 
Introduction to corpus linguistics 1
Introduction to corpus linguistics 1Introduction to corpus linguistics 1
Introduction to corpus linguistics 1
 
Scopes of linguistic description 1
Scopes of linguistic description 1Scopes of linguistic description 1
Scopes of linguistic description 1
 
Intro to-stylistics
Intro to-stylisticsIntro to-stylistics
Intro to-stylistics
 
Stylistics
StylisticsStylistics
Stylistics
 
Stylistics
StylisticsStylistics
Stylistics
 
General linguistics
General linguisticsGeneral linguistics
General linguistics
 
Discourse Analysis and Pragmatics
Discourse Analysis and PragmaticsDiscourse Analysis and Pragmatics
Discourse Analysis and Pragmatics
 
British national corpus
British national corpusBritish national corpus
British national corpus
 
Language planning
Language planningLanguage planning
Language planning
 
Stylistics
StylisticsStylistics
Stylistics
 
Deep structure and surface structure
Deep structure and surface structureDeep structure and surface structure
Deep structure and surface structure
 
Step by step stylistic analysis
Step by step stylistic analysisStep by step stylistic analysis
Step by step stylistic analysis
 

Similar to corpus linguistics and lexicography

Computer assisted text and corpus analysis
Computer assisted text and corpus analysisComputer assisted text and corpus analysis
Computer assisted text and corpus analysisRubyaShaheen
 
Corpus Analysis in Corpus linguistics
Corpus Analysis in Corpus linguistics Corpus Analysis in Corpus linguistics
Corpus Analysis in Corpus linguistics Umm-e-Rooman Yaqoob
 
Corpora in cognitive linguistics
Corpora in cognitive linguisticsCorpora in cognitive linguistics
Corpora in cognitive linguistics白兰 钦
 
History of applied linguistic
History of applied linguisticHistory of applied linguistic
History of applied linguisticethan Lim
 
What corpora are available? by David Y. W.D
What corpora are available? by David Y. W.DWhat corpora are available? by David Y. W.D
What corpora are available? by David Y. W.DRajpootBhatti5
 
A brief history of language teaching, the grammar translation method
A brief history of language teaching, the grammar translation methodA brief history of language teaching, the grammar translation method
A brief history of language teaching, the grammar translation methodDerya Baysal
 
Latest Development On Phonetics And Phonology
Latest Development On Phonetics And PhonologyLatest Development On Phonetics And Phonology
Latest Development On Phonetics And PhonologyDr. Cupid Lucid
 
Discourse analysis as a new cross discipline
Discourse analysis as a new cross disciplineDiscourse analysis as a new cross discipline
Discourse analysis as a new cross disciplineAbdullah Saleem
 
Interlinguistics and Esperanto Studies in the new Millennium
Interlinguistics and Esperanto Studies in the new MillenniumInterlinguistics and Esperanto Studies in the new Millennium
Interlinguistics and Esperanto Studies in the new MillenniumFederico Gobbo
 
The History of Language Teaching Methodology
The History of Language Teaching MethodologyThe History of Language Teaching Methodology
The History of Language Teaching MethodologyGeovanny Peña
 
Corpus-Based Studies of Legal Language for Translation Purposes:
Corpus-Based Studies of Legal Language for Translation Purposes:Corpus-Based Studies of Legal Language for Translation Purposes:
Corpus-Based Studies of Legal Language for Translation Purposes:Lucja Biel
 
A history of english language teaching - Section 1 (3,4,5)
A history of english language teaching - Section 1 (3,4,5)A history of english language teaching - Section 1 (3,4,5)
A history of english language teaching - Section 1 (3,4,5)Seray Tanyer
 
Spoken American English Idioms
Spoken American English IdiomsSpoken American English Idioms
Spoken American English IdiomsCompany
 
Applied linguistics: overview
Applied linguistics: overviewApplied linguistics: overview
Applied linguistics: overviewAsma Almashad
 

Similar to corpus linguistics and lexicography (20)

2001052491
20010524912001052491
2001052491
 
Computer assisted text and corpus analysis
Computer assisted text and corpus analysisComputer assisted text and corpus analysis
Computer assisted text and corpus analysis
 
Corpus Analysis in Corpus linguistics
Corpus Analysis in Corpus linguistics Corpus Analysis in Corpus linguistics
Corpus Analysis in Corpus linguistics
 
Corpora in cognitive linguistics
Corpora in cognitive linguisticsCorpora in cognitive linguistics
Corpora in cognitive linguistics
 
History of applied linguistic
History of applied linguisticHistory of applied linguistic
History of applied linguistic
 
What corpora are available? by David Y. W.D
What corpora are available? by David Y. W.DWhat corpora are available? by David Y. W.D
What corpora are available? by David Y. W.D
 
A brief history of language teaching, the grammar translation method
A brief history of language teaching, the grammar translation methodA brief history of language teaching, the grammar translation method
A brief history of language teaching, the grammar translation method
 
Latest Development On Phonetics And Phonology
Latest Development On Phonetics And PhonologyLatest Development On Phonetics And Phonology
Latest Development On Phonetics And Phonology
 
Historical development of grammar
Historical development of grammarHistorical development of grammar
Historical development of grammar
 
Spoken core of british english
Spoken core of british englishSpoken core of british english
Spoken core of british english
 
Discourse analysis as a new cross discipline
Discourse analysis as a new cross disciplineDiscourse analysis as a new cross discipline
Discourse analysis as a new cross discipline
 
Interlinguistics and Esperanto Studies in the new Millennium
Interlinguistics and Esperanto Studies in the new MillenniumInterlinguistics and Esperanto Studies in the new Millennium
Interlinguistics and Esperanto Studies in the new Millennium
 
The History of Language Teaching Methodology
The History of Language Teaching MethodologyThe History of Language Teaching Methodology
The History of Language Teaching Methodology
 
Linguistics
LinguisticsLinguistics
Linguistics
 
Latest Development
Latest DevelopmentLatest Development
Latest Development
 
Corpus-Based Studies of Legal Language for Translation Purposes:
Corpus-Based Studies of Legal Language for Translation Purposes:Corpus-Based Studies of Legal Language for Translation Purposes:
Corpus-Based Studies of Legal Language for Translation Purposes:
 
A history of english language teaching - Section 1 (3,4,5)
A history of english language teaching - Section 1 (3,4,5)A history of english language teaching - Section 1 (3,4,5)
A history of english language teaching - Section 1 (3,4,5)
 
Schools of thought
Schools of thoughtSchools of thought
Schools of thought
 
Spoken American English Idioms
Spoken American English IdiomsSpoken American English Idioms
Spoken American English Idioms
 
Applied linguistics: overview
Applied linguistics: overviewApplied linguistics: overview
Applied linguistics: overview
 

More from ayfa

reflection on third assignment
reflection on third assignmentreflection on third assignment
reflection on third assignmentayfa
 
lexicography
lexicographylexicography
lexicographyayfa
 
contributions of lexicography and corpus linguistics to a theory of language ...
contributions of lexicography and corpus linguistics to a theory of language ...contributions of lexicography and corpus linguistics to a theory of language ...
contributions of lexicography and corpus linguistics to a theory of language ...ayfa
 
literature review
literature reviewliterature review
literature reviewayfa
 
Sinopsis
SinopsisSinopsis
Sinopsisayfa
 
Sinopsis
SinopsisSinopsis
Sinopsisayfa
 
The assignment
The assignmentThe assignment
The assignmentayfa
 
The corpus
The corpusThe corpus
The corpusayfa
 
The assignment
The assignmentThe assignment
The assignmentayfa
 
Lesson plan presentation
Lesson plan presentationLesson plan presentation
Lesson plan presentationayfa
 

More from ayfa (10)

reflection on third assignment
reflection on third assignmentreflection on third assignment
reflection on third assignment
 
lexicography
lexicographylexicography
lexicography
 
contributions of lexicography and corpus linguistics to a theory of language ...
contributions of lexicography and corpus linguistics to a theory of language ...contributions of lexicography and corpus linguistics to a theory of language ...
contributions of lexicography and corpus linguistics to a theory of language ...
 
literature review
literature reviewliterature review
literature review
 
Sinopsis
SinopsisSinopsis
Sinopsis
 
Sinopsis
SinopsisSinopsis
Sinopsis
 
The assignment
The assignmentThe assignment
The assignment
 
The corpus
The corpusThe corpus
The corpus
 
The assignment
The assignmentThe assignment
The assignment
 
Lesson plan presentation
Lesson plan presentationLesson plan presentation
Lesson plan presentation
 

Recently uploaded

The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 

Recently uploaded (20)

The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 

corpus linguistics and lexicography

  • 1. Corpus Linguistics and Lexicography * WOLFGANG TEUBERT Corpus Linguistics—More Than a Slogan? During the last decade, it has been common practice among the linguistic community in Europe—both on the continent and on the British Isles—to use corpus linguistics to verify the results of classical linguistics. In North America, however, the situation is different. There, the Philadelphia-based Linguistic Data Consortium, responsible for the dissemination of language resources, is addressing the commercially oriented market of language engi- neering rather than academic research, the latter often being more interested in universal grammar or semantic universals than in the idiosyncrasies of natural languages. American corpus linguists such as Doug Biber or Nancy Ide and general linguists who are corpus users by conviction such as Charles Fillmore are almost better known in Europe than in the United States, which is even more astonishing when we take into account that the first real corpus in the modern sense, the Brown Corpus, was compiled in Providence, R.I., during the sixties. Meanwhile, European corpus linguistics is gradually becoming a sub- discipline in its own right. Unfortunately, during the last few years, this lead to a slight bias towards those ‘self-centred’ issues such as the problems of corpus compilation, encoding, annotation and validation, the procedures needed for transforming raw corpus data into artificial intelligence applica- tions and automatic language processing software, not to mention the problem of standardisation with regard to form and content (cf. the long-term project EAGLES [Expert Advisory Group on Language Engineering Standards] and INTERNATIONAL JOURNAL OF CORPUS LINGUISTICS Vol. 6(special issue), 2001. 125–153  John Benjamins Publishing Co.
  • 2. 126 WOLFGANG TEUBERT the transatlantic TEI [Text Encoding Initiative]). Today, these issues often tend to prevail over the original gain that the analysis of corpora may con- tribute to our knowledge of language. But it was exactly this corpus-specific knowledge that the first generation of European corpus linguists such as Sture Allen, Vladislav Andrushenko, Stig Johannson, Ferenc Kiefer, Bernard Que- mada, Helmut Schnelle, or John Sinclair had in mind. In West Germany, the Institut f¨ r Deutsche Sprache was among the first institutions that considered u the collection of corpus data as one of their permanent tasks; its corpora date back as early as the late sixties, although at that time most corpus data was still only used for the verification of research results gained from traditional methods. But has today’s corpus linguistics really advanced from there? The recent textbooks claiming to provide an introduction to corpus lin- guistics still do not add up to more than a dozen—all of them in English. Unfortunately, except for the commendable books of Stubbs 1996 and Biber, Conrad and Reppen 1998, they do deplorably little to establish corpus lin- guistics as a linguistic discipline in its own right. Instead, they are focussing on the use of corpora and corpus analysis in traditional linguistics (syn- tax, lexicology, stylistics, diachrony, variety research) and applied linguistics (language teaching, translation, language technology). Corpus Linguistics by Tony McEnery and Andrew Wilson (McEnery and Wilson 1996) may serve as an example of this kind. Forty pages describe the aspects of encoding; 20 pages deal with quantitative analysis; 25 pages describe the usefulness of corpus data for computer linguistics with another 30 pages covering the use of corpora in speech, lexicology, grammar, semantics, pragmatics, discourse analysis, sociolinguistics, stylistics, language teaching, diachrony, dialectol- ogy, language variation studies, psycholinguistics, cultural anthropology and social psychology and the final 20 pages contain a case study on sublan- guages and closure. McEnery and Wilson’s book reflects the current state of corpus linguistics. In fact, it more or less corresponds to the topics covered at the annual meetings held by the venerable IACME, an association dealing with English language corpora (cf. Renouf 1998). Semantics are mainly left aside. Surprisingly, when judged by their commercial value, it is not the written language corpora that are most successful, but rather speech corpora that can claim the highest prices. Speech corpora are special collections of some care- fully selected text samples (words, phrases, sentences) spoken by numerous different speakers under various acoustic conditions. They caused the final
  • 3. CORPUS LINGUISTICS AND LEXICOGRAPHY 127 breakthrough in automatic speech recognition that computer models based on cognitive linguistics failed to achieve for many years. The recognition of speech patterns was only made possible by a combination of categorial and probabilistic approaches towards a connectionist model trained on large speech corpora. Thus, speech analysis can thus be seen as an early impetus for the establishment of corpus linguistics as an independent discipline with its own theoretical background. Lexicography is the second major field where corpus linguistics not only introduced new methods, but also extended the entire scope of research, however, without putting too much emphasis on the theoretical aspects of corpus-based lexicography. Here again, it was John Sinclair who lead the way as initiator of the first strictly corpus-based dictionary of general lan- guage (COBUILD 1987). Britain was also the site of the first corpus-based collocation dictionaries (such as Kjellmer 1994). Bilingual lexicography may also benefit from a corpus-oriented approach: a fact that is evident when comparing the traditional Le Robert & Collins English-French Dictionary edited by B.T.S. Atkins with Valerie Grundy and Marie-H´ l` ne Corr´ ard’s ee e Oxford-Hachette Dictionary which covers the same language pair. Here, the use of (monolingual) corpora lead to a remarkably greater number of multi- word translation units (collocations, set phrases) and to context profiles that had been written with the target language in mind. W¨ rter und Wortgebrauch o in Ost und West [Words and Word Usage in East and West Germany] (1992) by Manfred W. Hellmann may serve as the only German example of that era, using the corpus for lemma selection rather than semantic description. Only recently, in 1997, did a true corpus-based dictionary appear: Schl¨ sselw¨ rter u o der Wendezeit [Keywords during German Unification] by Dieter Herberg, Doris Steffens and Elke Tellenbach. Thus, at least in the field of written language, corpus linguistics is still in its infancy as a discipline with its own theoretical background—a statement which holds true not only for Germany but also for most other European countries. In this orientation phase, where corpus linguistics is still in the process of defining its position, most publications are in English, the language that has become interlingua of the modern world. But this does not mean that corpus linguistics is dominated mainly by English and American scholars: this can be clearly seen when browsing through any issue of the International Journal of Corpus Linguistics. Still, German linguistics appears somewhat underrepresented in this discussion. One exception is Hans J¨ rgen Heringer. u
  • 4. 128 WOLFGANG TEUBERT His innovative study on ‘distributive semantics’ shows a growing reception of the programme for corpus linguistics which is outlined below. In his book Das h¨ chste der Gef¨ hle [The most sublime of feelings] (Heringer 1999), he o u describes the validation of semantic cohesion between adjacent words on the basis of larger corpora. Above all, it is this area between lexis and syntax where corpus linguistics offers new insights. Corpus Linguistics—A Programme Corpus linguistics believes in structuralism as defined by John R. Firth; there- fore, it insists on the notion that language as a research object can only be observed in the form of written or spoken texts. Neither language-independent cognition nor propositional logic can provide information on the nature of natural languages. For these are, as stated in an apophthegm by Mario Wan- druszka, characterised by a mixture of analogy and anomaly. The quest for a universal structure of grammar and lexicon which is typical for the follow- ers of Chomsky or Lakoff cannot meet the demands of these two aspects.1 Instead, corpus linguistics is closer to the semantic concept inherent in the continental European structuralism of Ferdinand de Saussure, which regards the meaning as inseparable from the form, that is, the word, the phrase, the text. In this theory, the meaning does not exist per se. Corpus linguis- tics rejects the ubiquitous concept of the meaning being ‘pure information,’ encoded into language by the sender and decoded by the receiver. Corpus linguistics, instead, holds that content cannot be separated from form, rather they constitute the two aspects under which texts can be analysed. The word, the phrase, the text is both form and meaning. The above statement clearly outlines the programme of corpus linguis- tics. It is mainly interested in those phenomena on the fringe between syntax and lexicon, the two subjects of classical linguistics. It deals with the pat- terns and structures of semantic cohesion between text elements that are interpreted as compounds, multi-word units, collocations and set phrases. In these phenomena, the importance of the context for the meaning becomes evident. Corpus linguistics extends our knowledge of language by combining three different approaches: the (procedural) identification of language data by categorial analysis, the correlation of language data by statistical methods
  • 5. CORPUS LINGUISTICS AND LEXICOGRAPHY 129 and finally the (intellectual) interpretation of the results. Whilst the first two steps should be done automatically as much as possible, the last step requires human intentionality, as any interpretation is an act involving consciousness and, therefore, not transmutable into an algorithmic procedure. This is the main difference between corpus linguistics and computational linguistics, which reduces language to a set of procedures. Corpus linguistics assumes that language is a social phenomenon, to be observed and described above all in accessible empirical data—as it were, communication acts. Corpora are cross-sections through a universe of dis- course which incorporates virtually all communication acts of any selected language community, be it monolingual (e.g., German or English), bilingual (e.g., South Tyrolean, Welsh) or multilingual (e.g., Western European). How- ever, the majority of texts that are preserved and made accessible through corpora in principle only have a limited life-span: most printed texts such as newspaper texts are out of public reach within a very short time. If we consider language as a social phenomenon, we do not know— and do not want to know—what is going on in the minds of the people, how the speaker or the hearer is understanding the words, sentences and texts that they speak or hear. Language as a social phenomenon manifests itself only in texts that can be observed, recorded, described and analysed. Most texts happen to be communication acts, that is, interactions between members of a language community. An ideal universe of discourse would be the sum of all communication acts ever uttered by members of a language community. Therefore, it has an inherent diachronic dimension. Of course, this ideal universe of discourse would be far too large for linguistics to explore it in its entirety. It would have to be broken down into cross-sections with regard to the phenomena that we want to describe. There is no such thing as a ‘one-size-fits-all’-corpus. It is the responsibility of the linguist to limit the scope of the universe of discourse in such a way that it may be reduced to a manageable corpus, by means of parameters such as language (sociolect, terminology, jargon), time, region, situation, external and internal textual characteristics, to mention just a few. When looking towards language as a social phenomenon, we assume that meaning is expressed in texts. What a text element or text segment means is the result of negotiation among the members of a language community, and these negotiations are also part of the discourse. Thus, the language com- munity sets the conventions on the formal correctness of sentences and on
  • 6. 130 WOLFGANG TEUBERT their meaning. Those conventions are both implicit and dynamic; they are not engraved in stone like commandments. Any communication act may utilise syntactic structures in a new way, create new collocations, introduce new words or redefine existing ones. If those modifications are used in a suffi- cient number of other communication acts or texts, they may well result in the modification or amendment of an existing convention. One basic difference between natural and formal languages is the fact that natural language not only permits but actually integrates metalinguistic statements without explic- itly marking the metalinguistic level. There is no separation between object language and metalanguage. Any convention may be discussed, questioned or even rejected in a text. Above all, discourses deal with meaning, and it is corpus linguistics that is best suited to deal with this dynamic aspect of meaning. We, as linguists, have no access to the cognitive encoding of the con- ventions of a language community. We only know what is expressed in texts. Dictionaries, grammars, and language textbooks are also texts; therefore, they are part of the universe of discourse. As long as they represent socially ac- cepted standards, we have to consider their special status. Still, their contents are neither comprehensive nor always based on factual evidence. Corpus linguistics, on the other hand, aims to reveal the conventions of a certain language community on the basis of a relevant corpus. In a corpus, words are embedded in their context. Corpus linguistics is, therefore, especially suited to describe the gradual changes in meaning: it is the context which determines the concrete meaning in most areas of the vocabulary. Cognitive Linguistics, Logical Semantics and Corpus Linguistics People normally—if they are not linguists, that is—listen to or read texts because of their meaning. They are interested in the syntactic features of phrases, sentences or texts only insofar as is necessary for understanding them. Meaning is the core feature of natural language, and this is the reason why semantics is the central linguistic discipline. Still, regardless of the enormous progress that phonology, syntax and many other disciplines have made, when it comes to explaining and describing the meaning of phrases, sentences, and texts, we are far from a consensus.
  • 7. CORPUS LINGUISTICS AND LEXICOGRAPHY 131 As said above, corpus linguistics regards language as a social phenom- enon. This implies a strict division between meaning and understanding. Is it really the task of linguistics to investigate how the speaker and the listener understand the words, sentences or texts that they utter or perceive? Un- derstanding is a psychological, a mental, or—in modern words—a cognitive phenomenon. This is why no bond exists between cognitive linguistics and corpus linguistics. Language as a social phenomenon is laid down in texts and only there. If we, as corpus linguists, wish to find out how a text is understood, we have to ask the listeners for paraphrases; these paraphrases, being texts themselves, again become part of the discourse and can become the object of linguistic analysis. The difference between cognitive linguistics and corpus linguistics lies in how each deals with the unique property of language to signify. Any text element is inevitably both form (expression) and meaning. If you delete the form, the meaning is deleted as well. There is no meaning without form, without an expression. Text elements and segments are symbols, and being symbols, linguistic signs, they can be analysed in principle under two aspects: the form aspect or the meaning aspect. The consequence of this stance is that the only way to express the meaning of a text element or a text segment is to interpret it, that is, to paraphrase it. This is the stance of hermeneutic philosophy, as opposed to analytic philosophy (cf. Keller 1995, J¨ ger [2000]). a In cognitive linguistics, which is embedded in analytic philosophy, mean- ing and understanding is seen as one. Here, text elements and text segments correspond to conceptual representations on the mental level. Within this system, however, it is not clear what the term ‘representation’ means. Does it refer to content linked with a form (what we could call presentations) or does it refer to pure content disconnected from form (what we could call ideations)? This ambiguity is of vast consequence (Janik and Toulmin 1973: 133), as presentations themselves are signs, that is, symbols, and thus need to be understood, that is, interpreted. Cognitive linguistics, however, does not tell us how this is to happen. Rather, it describes the manipulation of mental representations as a process (whereas an interpretation is an act, presuppos- ing intentionality). Processes themselves are meaningless. It is only the act of interpretation that assigns meaning to them. Both Daniell Dennet and John Searle point out this aporia of the cognitive approach. In their opinion, the mental processes would again require a central meaner (Dennet 1998: 287f.) or homunculus (Searle 1992: 212f.) on a level higher than cognition, that is,
  • 8. 132 WOLFGANG TEUBERT for understanding mental representations, and the same would then apply for that level, too, and so on, ad infinitum. On the other hand, if we translate ‘representation’ with ‘ideation,’ we dismiss the assumption of the symbolic character of language. The meaning of a word, a sentence or a text would then correspond to something immaterial, something without form, formulated in a so-called ‘mental language,’ whose elements would consist of either complex or atomistic concepts, depending whether one refers to Anna Wierzbicka and the early Jerry Fodor (Wierzbicka 1996, Fodor 1975) or to the later Jerry Fodor (Fodor 1998). On a large scale, these concepts of cognitive linguistics seem to correspond to words, but the difference lies in the fact that they are not material symbols which call for interpretation, but instead they are pure astral ideation, not contaminated by any form (cf. Teubert 1999). In practice, particularly in artificial intelligence and automatic transla- tion, this cognitive approach has failed. Alan Melby gave a plausible ex- planation why it was due to fail no matter which formal language had been defined for encoding the conceptual representations: “The real problem could be that the language-independent universal sememes we were looking for do not exist. . . [O]ur approach to word senses was dead wrong.” (Melby 1995: 48.) It seems that the idea behind cognitive linguistics is the transduction or translation of phrases, sentences and texts in natural language, that is, of sym- bolic units, into an obviously language-independent ‘language of thought’ or ‘mentalese,’ which is non-symbolic and is exclusively defined by syntax. This transduction or translation is seen as a process and does not involve in- tentionality. Cognitive linguistics is committed to the computational model of mind. According to this theory, mental representations are seen as structures consisting of what is called uninterpreted symbols, while mental processes are caused by the manipulation of these representations according to rule-based, that is, exclusively syntactic, algorithms. But does it really make sense to use the term ‘symbols’ for these mental representation units, just as we call words ‘linguistic signs’? On a cognitive (or computative) level, those entities are only symbols inasmuch as a content can become assigned to them from the outside of the mental (or computational) calculus. This content or mean- ing, however, does not affect the permissibility of manipulations with regard to their representation. The content of a text consisting of linguistic signs, on the other hand, is something inherent to the text itself (and not assigned from
  • 9. CORPUS LINGUISTICS AND LEXICOGRAPHY 133 the outside), a feature we can and must investigate if we want to make sense of a text. As Rudi Keller has pointed out, the symbols of natural language are suitable for and in need of interpretation (Keller 1995). What appeals to many researchers of semantics is the fact that in cogni- tive semantics the meaning of a text is expressed through a calculation whose expressions are based exclusively on syntactic rules, or in other words, that semantics is transformed into syntax. They take it for granted that this is possible, as they claim that both natural and formal language are working with symbols. But in natural language, these symbols need to be interpreted whereas symbols in formal languages work without being assigned a cer- tain (external) definition. Whether a formal language, a calculus, permits a certain permutation of symbols or not has nothing to do with the meaning or the definition of these symbols, it is just a question of syntax. As early as 1847, George Boole stated: “Those who are acquainted with the present state of the theory of Symbolic Algebra, are aware that the validity of the processes of analysis does not depend upon the interpretation of the symbols which are employed, but solely upon the laws of their combination.” Richard Montague also believes in the possibility of describing natural language se- mantics the same way as formal language semantics: “There is in my opinion no important theoretical difference between natural languages and the arti- ficial languages of logigicians; indeed, I consider it possible to comprehend the syntax and semantics of both kinds of languages within a single natural and mathematically precise theory. On this point I differ from a number of philosophers, but agree, I believe, with Chomsky and his associates.” (Both quotes from Devlin, 1997: 73 and 117.) From the point of view of corpus linguistics, the meaning of natural language symbols, of text elements or text segments is negotiated by the discourse participants and can be found in the paraphrases they offer, and it is contained in language usage, that is, in context patterns. Natural language symbols refer not so much to language-external facts, but rather they create semantic links to other language signs. The meaning of a text segment is the history of the use of its constituents. Linguistic signs always require interpretation. Whoever understands a text is able to interpret it. This interpretation can be communicated as a text in itself, a paraphrase of the original text. The act of interpretation requires intentionality, and therefore, cannot be reduced to a rule-based, algorithmic, ‘mathematically precise’ procedure. If we see language as a social phenom-
  • 10. 134 WOLFGANG TEUBERT enon, natural language semantics can leave aside the mental or cognitive level. Everything that can be said about the meaning of words, phrases or sentences will be found in the discourse. Anything that cannot be paraphrased in natural language has nothing to do with meaning. In a nutshell, this is the core programme that distinguishes corpus linguistics from cognitive linguis- tics. Collocation and Meaning In traditional linguistics, it is rather difficult to pinpoint the difference be- tween a collocation such as harte Auseinandersetzung (hefty discussion) and a free combination such as harte Matratze (hard mattress). In corpus lin- guistics, on the other hand, it is possible to trace this awareness among the members of a language community of a distinct semantic cohesion between the lexical elements of a collocation by statistic means, that is, by detect- ing a significant co-occurrence of these elements within a sufficiently large corpus. Before it was possible to procedurally and systematically process large amounts of language data, syntactic rules had been the only way to describe the complex behaviour of co-occurrence between textual elements (i.e., words). Such rules describe the relation between different classes of ele- ments, for instance, between nouns and modifying adjectives. Still, syntactic descriptions such as ‘Adjective + Noun’ are not specific enough to detect collocations as distinct types of semantic relationships. Traditional lexicology fails to come up with a feasible definition for collocations that would allow their automatic identification in a corpus. To classify certain co-occurring textual elements as semantic units, that is, as collocations, it is necessary to recognise these text segments as recurrent phenomena, which is only possible within a sufficiently large corpus. Therefore, we must complement the intra- textual perspective with its intertextual counterpart. By applying probabilistic methods, it is possible to measure recurrence within a virtual universe of dis- course, or more precisely, within a real corpus. Collocation dictionaries in the strict sense are always corpus-based. Even so, the speaker’s competence is still needed to check statistically determined collocation candidates for their relevant semantic cohesion. The following case study aims to illustrate the potential of the corpus linguistic approach:
  • 11. CORPUS LINGUISTICS AND LEXICOGRAPHY 135 Case study 1: hart as collocator The collocation dictionary Kollokationsw¨ rterbuch Adjektive mit ihren Be- o gleitsubstantiven (Teubert, Kervio-Berthou and Windisch [in preparation]), which is currently being compiled at the Institut f¨ r Deutsche Sprache, u is based on the IDS corpora of about 320 million words. The 400 ad- jectives were mainly selected from basic vocabulary lists. Candidates for collocations were combinations of adjectives and nouns showing a signif- icantly higher frequency than the expected frequency based on the occur- rence of the relevant single words. The occurrences are ranked accord- ing to significance: their overall frequency, thus, have no principal influ- ence. The concept for the statistic procedures applied here was designed by Cyril Belica. It is up to the competent speaker to decide whether a suf- ficient lexical cohesion can be seen in the collocation candidates detected by the computer. Manually selected citations are provided in order to facil- itate this interpretation. If a collocation candidate is translated into a for- eign language as a whole instead of a word-by-word translation, this can be seen as evidence of a distinct semantic cohesion; therefore, we have added the English translation equivalents to our German examples. The ex- ample below covers rank 1-10 [for an explanation of the abbreviations see http://www.ids-mannheim.de/kt/cosmas.html]: Kern Rank: 1 Frequency: 63 WKB In der Treuhand selbst hat sich ein harter Kern aus fr¨ heren SED-Betonk¨ pfen eingegraben. WKB Dennoch enthalte der u o Bericht einen “harten Kern an Wahrheit.” H68 Die “Kommandoebene,” der harte Kern der RAF, umfaßt 25 bis 30 Mitglieder. H87 Der “harte Kern” um- faßt 187 Personen. H87 [. . . ] ein sicherer Hinweis, daß sich die Betreffenden dem harten Kern der RAF angeschlossen haben. H87 140 eingeschriebene Soulm¨ nner kamen regelm¨ ßig, ein harter Kern von 50 Jugendlichen fast a a t¨ glich. (Engl.: diehards/ hard core) a Arbeit Rank: 2 Frequency: 94 WKD In harter Arbeit haben wir unseren ¨ Staat aufgebaut. (Uberschrift) WKD Aber wir haben eben in dieser harten Ar- beit alle noch ein bißchen zu lernen. H85 [. . . ] Risikobereitschaft und harte Arbeit sollen sich in Malaysia wieder lohnen. H86 Mangelnde pers¨ nlicheo Ausstrahlung machte er durch harte Arbeit, eiserne Disziplin und Wil- lensst¨ rke wett. WKD Ein Sommer h¨ rtester Arbeit steht bevor. H85 Die a a Technik macht es m¨ glich, den Menschen von harter und uberm¨ ßiger Arbeit o ¨ a auch zeitlich zu entlasten. (Engl.: hard work) W¨ hrung Rank: 3 Frequency: 40 WKB Die Deutschen w¨ rden nicht nur a u durch eine harte W¨ hrung vereint. WKB Harte W¨ hrung soll mangelnden a a
  • 12. 136 WOLFGANG TEUBERT Geist wettmachen. WKD Doch wundersam ist die Umwandlung der Ostmark in harte W¨ hrung allemal. H87 Dann w¨ re es endg¨ ltig vorbei mit dem Glanz a a u der einst h¨ rtesten W¨ hrung der Welt. (Engl.: hard currency) a a Schlag Rank: 4 Frequency: 24 BZK Das war f¨ r ihn ein harter Schlag. MK1 u Ich habe eine junge Mannschaft, die einen harten Schlag verkraften kann, ohne zu zerbrechen. MK2 Es war ein harter, gezielter Schlag, der mich prompt von den Beinen holte. (Engl.: heavy blow) Drogen Rank: 5 Frequency: 20 H88 Außerdem sei ein immer st¨ rker werden- a der Trend zu harten Drogen zu beobachten. H87 Kontakt zu harten Drogen hatte der Jugendliche bald bekommen [. . . ] (Engl.: hard drugs) Kritik Rank: 6 Frequency: 34 H86 Aber sie erfuhren schon damals von vielen ¨ Seiten harte Kritik. MK2 Harte Kritik am Biedenkopf-Plan. (Uberschrift) H88 Zugleich ubte er harte Kritik an der Landesregierung [. . . ] (Engl.: harsh ¨ criticism) Bandagen Rank: 7 Frequency: 12 H86 Beide Seiten schlagen derweil mit harten Bandagen zu [. . . ] WKB Der Kampf um Berlin als Hauptstadt wird ¨ mit harten Bandagen gef¨ hrt. (Uberschrift) (Engl.: taking one’s gloves off ) u Kampf Rank: 8 Frequency: 30 MK1 Amerika m¨ sse notfalls auf einen langen u harten Kampf vorbereitet sein. H86 Die meisten sehen zu, daß sie im harten Kampf um die Zehntel f¨ r sich das Beste rausholen. BZK Verkaufsf¨ rderung u o gewinnt immer mehr Bedeutung im harten Kampf um die Gunst der Ver- braucher. WKD F¨ r sie geht es jetzt nicht einfach um einen harten Kampf u um Arbeitspl¨ tze. (Engl.: close fight) a D-Mark Rank: 9 Frequency: 22 WKB Dann bek¨ men die DDR-B¨ rger harte a u D-Mark in die Hand und w¨ rden dr¨ benbleiben. WKB Nichts hat Vormarsch u u und Endsieg der harten D-Mark aufhalten k¨ nnen. WKD Die harte D-Mark o dient als Schmiedehammer. (Engl.: strong Deutschmark) Worte Rank: 10 Frequency: 25 H85 Harte Worte - Berliner Verh¨ ltnisse? a H86 Selbst Außenminister Shultz benutzte harte Worte. H85 [. . . der] erste Vorsitzende der Gesellschaft, findet nicht minder harte Worte, um den Bruch zu begr¨ nden [. . . ] (Engl.: bitter words) u Discourse and Meaning One of corpus linguistics’ most essential tenets is the assumption that the meaning of text elements and segments can be found solely in discourse. This assumption makes sense if we call to mind that in principle, every word or combination of words was once a neologism. Neologisms are introduced
  • 13. CORPUS LINGUISTICS AND LEXICOGRAPHY 137 to the discourse by explicitly assigning certain meanings to new expressions, that is, by paraphrasing what a new word is supposed to mean. As stated above, we can determine meaning in two ways: by paraphrase and by usage. Neologisms, however, still lack the usage. They only become used once other participants of the discourse start using them either by accepting the proposed meaning or by negotiating the meaning by offering a new paraphrase. This also applies to those cases where a new meaning is assigned to an already existing word. It is obvious that we cannot go ‘back to the roots’ for all our established vocabulary; also, this is not how children learn the meaning of words. But even so, it is not simply the usage of words that leads to their meaning. In most cases, an act of explanation, very often by the parents, but sometimes also through picture-books, sets the starting point for language acquisition. Obviously, deictic references to reality (or images thereof) are of highest importance, but they are not understood without narrative explanations of words that describe what we have to watch out for in reality (or in images of reality). The meaning of school, for instance, cannot be explained by pictures of the building, classroom, teachers or pupils. In fact, only very few words relate to images unambiguously. Picture-book texts play a more important role with regard to the acquisition of word meanings than dictionaries. Since the times of the German lexicographers Adelung and Campe, the basic principle of German lexicography had been the assumption that the meaning of words can be found in text samples, a basic principle also for corpus linguistics. Nevertheless, corpus linguistics differs from traditional lexicography in various details. Firstly, corpus linguistics does not use cor- pora merely for examples: it explores them systematically. Secondly, corpus linguistic does not try to decontextualise the objects it describes. In other words, it does not abstract the meaning from the context. Thirdly, corpus linguistics tries to capture different usages in their correlation to different contexts, unlike traditional lexicography which tries to position word mean- ings upon a blueprint of a language-independent ontological concept (for instance, by genus proximum and differentia specifica). Fourthly, corpus lin- guistics is less interested in the single text element or word than in the semantic interaction between text elements and context. The following case study of Globalisierung [globalisation] aims to demonstrate that it is indeed the discourse (or in other words: our corpus) where information about the meaning of words can be found. The reason why
  • 14. 138 WOLFGANG TEUBERT we all seem to know the meaning of Globalisierung as it is used currently is the fact that we all have read those texts that explain Globalisierung. We cannot depict Globalisierung, any more than we can point at it. In its cur- rent use, Globalisierung is certainly a neologism. It is characteristic for the introductory phase of new words that the first citations show a large number of paraphrases, a fact that demonstrates the role of the discourse participants in negotiating meaning. Case study 2: Globalisierung Globalisierung (Engl.: globalisation) as a non-lexicalised derivation has been, for a long time, part of our vocabulary. Its semantic vagueness is indicative of its non-lexicalised status. As nomen actionis or nomen resultativum, it has long been nothing more than the nominalisation of globalisieren. The presence of descriptive attributes is significant for its lack of semantic spec- ification: metalingual indicators (like paraphrases), on the other hand, are almost totally absent. The following examples were found in the German daily Tageszeitung: Die Vorstellung [. . . ] der Globalisierung der Kleistschen Verz¨ ckung [. . . ] u scheint mir denn doch eher m¨ rchenhaft. [14.10.89] a ¨ Aber die Globalisierung von Politik, Okonomie und Technologie dulde keinen partikularen Bezugspunkte mehr [. . . ] [05.06.92] Mit der Globalisierung der Lebensweise der modernen Zivilisation geht die Selbstaufhebung der [. . . ] Ideale und Grund¨ berzeugungen einher. [25.02.95] u As a neologism, Globalisierung manages to almost completely displace the original, non-lexicalised derivation only as late as in 1996. Suddenly, there is a distinct rise in frequency: whereas we have only about 160 citations from 1988 to the end of 1995, there are about 320 citations for 1996 alone. Also, most citations come without descriptive attributes: apparently, it is no longer necessary to explain what is being globalised. Finally, many citations show metalingual indicators (below printed in italics) that demonstrate how the discourse participants take part in assigning a meaning to the word, namely, the following examples: Die “Globalisierung”—ein etwas unscharfer Begriff, mit dem zugleich die Ausweitung des Handels, die Liberalisierung der Finanzm¨ rkte, der Sieg der a
  • 15. CORPUS LINGUISTICS AND LEXICOGRAPHY 139 Freiheitsideologie, die unkontrollierte Macht der multinationalen Unternehmen, die Internationalisierung des Arbeitsmarktes und die Umstrukturierung der Volkswirtschaften gemeint ist—hat die Gewerkschaften weiter geschw¨ cht. a [12.01.96] Verbissener Konkurrenzkampf im Inneren und nach außen hin eine maximale ¨ Offnung f¨ r Kapitel, G¨ ter und Dienstleistungen. So lautet eine der m¨ glichen u u o Definitionen der Globalisierung. [12.01.96] [. . . ] die Globalisierung, das heißt die vollst¨ ndige Liberalisierung aller a M¨ rkte auf der Welt [. . . ] [10.05.96] a Lisa Maza [. . . ] sieht die Globalisierung v¨ llig anders: Sie sei eine Fortset- o zung der Kolonialisierung mit anderen Mitteln—zum Nachteil des S¨ dens, der u Armen und der Frauen. [08.06.96] Stichwort Globalisierung: In einer globalen Wirtschaft wird es auf Dauer kein gesch¨ tztes Umfeld f¨ r die Wirtschaft irgendeines Landes mehr geben. u u [27.07.96] Globalisierung bedeutet auch die Europ¨ isierung des Globus, Kolonialismus, a okonomischer und okologischer Imperialismus. [04.05.96] ¨ ¨ Denn in der Tat bedeutet Globalisierung Amerikanisierung, und zwar nicht nur der Weltwirtschaft, sondern auch eine normative Amerikanisierung. [11.10.96] Das Stichwort “Maastricht” und das Modewort “Globalisierung” sind zu Syn- onymen f¨ r sozialen R¨ ckschritt geworden. [18.10.96] u u Typischerweise schweigen die Intellektuellen in Deutschland beharrlich zu Eu- ropa, Globalisierung und Zukunft der Arbeit [. . . ] [13.12.96] This is a brief list of comparable English citations taken from the Bank of English and shortened: What does globalisation mean? The term can happily accommodate all manner of things: expanding international trade, the growth of multinational business, the rise in international joint ventures and increasing interdependence through capital flows. Globalisation: Low wages in other countries contribute to low wages in the United States. Words like globalisation and outsourcing are now in common use. Watkins sees globalisation as a euphemism for a race to maximise profit by lowering workers’ pay and condition. As Mr. Keegan says, globalisation means that tax cuts for business are crucial. Globalisation represents an attempt to exploit South Korea’s enormous poten-
  • 16. 140 WOLFGANG TEUBERT tial. But doesn’t globalisation mean world-wide sameness? Globalisation is still more a philosophy than a business reality. Globalisation comes in many flavours. More so than other words, neologisms show that the meaning of words is to be found in the texts rather than in some discourse-external reality. The citations—be it in their virtual entirety within the universe of discourse or be it in some cross-section in a real corpus—are the meaning, and we may understand this meaning by interpreting the citation. The formulation of a dictionary entry for globalisation, however, is the responsibility of lexicography, not of corpus linguistics, whose main task— apart from finding the references—would instead be the correlation (by sys- tematic context analysis) of the various sets of paraphrases and usage patterns to different parameters such as text type (newspaper), genre (politics/society), ideological stance and so on. Particularly in the area of ideologically contro- versial keywords, it seems as if a useful selection of citations can be more helpful to the user than traditional definitions. Linguistic Knowledge and Encyclopaedic Knowledge Corpus linguistics aims to analyse the meaning of words within texts, or rather, within their individual context. First and foremost, words are text elements, not lexicon or dictionary entries. Corpus linguistics is interested in text segments whose elements exhibit an inherent semantic cohesion which can be made visible through quantitative analyses of discourse or corpus (Biber, Conrad and Reppen 1998). If the research focus is shifted from single words to text segments, the distinction between linguistic and encyclopaedic knowledge gradually becomes fuzzy. The word Machtergreifung (seizure of power), outside its context, may be described as an incident where a certain group, previously excluded from political influence, seizes the power by its own force and without democratic legitimation. However, we will interpret text segments such as braune Machtergreifung or die Machtergreifung im Jahre 1933 as referring to the ‘seizure of power by the Nazis’ without hesitation. Is this because these texts refer to a extralingual reality, to a language-independent
  • 17. CORPUS LINGUISTICS AND LEXICOGRAPHY 141 knowledge? Although the majority of linguists would agree with this assump- tion, there may well be another, simpler, explanation: we have learned from a large number of citations, whenever braune Machtergreifung or Machtergrei- fung im Jahre 1933 is mentioned, this refers to the seizure of power by the Nazis and to nothing else. There is a co-occurrence between both expressions that may result, for instance, in an anaphoric situation: the expressions are paraphrases of each other. In the tradition of German lexicography, linguistic knowledge is sepa- rated from encyclopaedic knowledge by the process of decontextualisation, by the endeavour to describe the meaning of words unadulterated by the context in which they occur. If we detach all references from their relevant context, the isolated meaning remains. The different events of Machtergreifung that are dealt with in texts are viewed as references to a discourse-external reality. Corpus linguistics, on the other hand, above all is interested in the meaning of textual segments displaying a distinct semantic cohesion. Machtergreifung im Jahre 1933 is such a segment, and by projecting it upon our discourse (i.e., linguistic) knowledge, we are able to interpret it as ‘Nazi seizure of power’ without problem. If we are no longer limited to single words detached from their contexts and if we do away with decontextualisation, we can give up with the distinction between linguistic and encyclopaedic knowledge. For what we normally call encyclopaedic knowledge is in fact nothing but dis- course knowledge. Everything we know and are able to know about the Nazi seizure of power is based on texts. Although some may even have witnessed one relevant incident or the other, their ability to interpret the whole course of events as Machtergreifung is also based on texts from other persons. If we reduce encyclopaedic knowledge to discourse knowledge, the distinction disappears. Let us take a look at the example klassische Rollenverteilung (traditional role allocation) (Spiegel 13, 1999: 128): Ein Zuhause wie ein Bilderbuchideal. Hier [. . . ] ist die klassische Rollen- verteilung die Regel: Ein Elternteil k¨ mmert sich um Haushalt und Kinder- u erziehung, der andere verdient das Geld. Auch dieser traditionellen Familien- vorstellung entspricht das Leben im Reihenhaus. [A home like a picture-book clich´ . Here [. . . ] the traditional role allocation e is still the rule: one parent takes care of the household and of bringing up the children, the other parent earns the family income. Also living in a terraced house contributes to this traditional image of family.]
  • 18. 142 WOLFGANG TEUBERT Within the context of family/home, the meaning of the collocation klassis- che Rollenverteilung in the above example corresponds exactly to the sen- tence that may serve as definition: Ein Elternteil. . . [One parent. . . ]. Note the sublime subversive touch that is present here, characteristic of so many Spiegel articles: what seems to be a generally acceptable definition, actu- ally shows an essential deviation from the traditional meaning of klassische Rollenverteilung—it does not distinguish between male and female. The above example aptly illustrates challenges and achievements of cor- pus linguistics. Firstly, it is not interested in the meaning of isolated words outside their relevant contexts, but instead in the meaning of semantically connected text segments, extracted from discourse or, in practice, from the corpus. In the context of home and family, klassische Rollenverteilung can be interpreted in different ways with regard to period and genre. If the above Spiegel-definition becomes the accepted thing, we may apply the term klas- sische Rollenverteilung even to gay or lesbian partnerships. For corpus lin- guistics, this implies a dynamic view of meaning. Every new reference may add to the meaning of a certain text segment; older meanings may fall into oblivion if they are not sanctioned by new evidence. The above example also shows that the ways in which meaning can be negotiated within the language community can be controversial indeed. It is not so long ago when lesbian partnership and family were two different meanings that could not be imag- ined, let alone used, synonymously. Corpus linguistics may thus serve as a useful instrument to detect changes of meaning that are essential to neology. Secondly, corpus linguistics is developing devices for the identification and extraction of potentially metalinguistic elements of citations, that is, of text elements co-occurring with a paraphrase, thus enabling the automatic extraction, processing and presentation of semantically relevant material from corpora. Phrases such as something is the rule; x means y; this is to say; we understand it as; it can be said etc. point to metalingual content. If the meaning of a semantically controversial textual segment is negotiated, we often find indicators such as: some time ago; in fact; strictly speaking; without doubt; wrongly etc. These indicators can give us important clues. Above all, it must be realised that just as the meaning of a text segment is a paraphrase found in earlier citations, peoples’ interpretations are also paraphrases and therefore part of the discourse. In principle, the meaning of a text element or a text segment is everything that has been said about it, in terms of a paraphrase or as a matter of usage; it is the result of the
  • 19. CORPUS LINGUISTICS AND LEXICOGRAPHY 143 negotiation of the meaning within the discourse community. Indeed this is the difference between natural language words and technical terms. Technical terms are defined by experts, and their meaning is restricted to that definition (and thus, is discourse-external). For instance, if a tree meets the criteria for elm-trees listed in the expert’s definition, it is rightly called an elm- tree no matter what the citations say. Any terminological definition is—at least in principle—an algorithmic instruction for the usage of the relevant term. This explains why it is possible to automatically translate technical texts, provided they are monosemous and only use specialist vocabulary. Lexicographic definitions, on the other hand, are interpretations of citations, that is, results of intentional acts. They cannot automatically be processed from corpus citations, because every citation can be interpreted in various different ways. Therefore, an automatic translation of general language texts is not feasible. Thirdly, corpus linguistics uses the context to distinguish between us- ages. For example, the collocation klassische Rollenverteilung is not only found in the family context but also at work or in society in general. Its meaning differs according to on the context. Fourthly, corpus linguistics is interested in larger units of meaning, namely, in text segments. The traditional lexicographic practice of decon- textualisation and isolation of single words impedes us from knowing the meaning of larger units such as klassische Rollenverteilung. As a rule, the meaning of text segments such as multi word units, collocations or set phrases is far more specific than that of single words. The reason why traditional lin- guistics is focussing on the single word, isolated from its context, can only be explained by space constraints in the past, as it is impossible to list all collocations and set phrases even in a dictionary consisting of several vol- umes. But is klassische Rollenverteilung really a true collocation? Is corpus linguistics really able to provide a credible validation of semantic cohesion? Is the co-occurrence klassische Rollenverteilung more than a mere addition of klassisch and Rollenverteilung? In a sufficiently large corpus, if the fre- quency of klassische Rollenverteilung differs significantly from the statisti- cally expected frequency of this combination, this can be seen as one sign for possible collocation. Another sign would be the occurrence of a special meaning that can not be derived from the sum of the individual meanings of the text elements. For instance, if we find six tokens of klassische Rol- lenverteilung within the corpus although we would only expect three, given
  • 20. 144 WOLFGANG TEUBERT the frequency of the constituents, and if they all suggest that one parent is the wage-earner whereas the other is bringing up the children, then we may regard this co-occurrence as collocation. Finally, corpus linguistic considers meaning as a feature of language, of text elements, segments, and texts, and not as an external feature existing only in the human mind or in reality. The meaning of klassische Rollenverteilung in the context of family is represented in texts, and only there; it is not the reflection of a non-textual external reality that we could point our fingers at. There is no meaning outside language, outside the discourse. We know what globalisation means today, because we have read the texts that explain it, but we cannot see globalisation. Multilingual Corpus Linguistics When translating a text into another language, we paraphrase the source text. The translation represents the meaning of the original text just like a paraphrase within the source language. Translation requires understanding and thus intentionality. Only if we understand a text can we interpret or even paraphrase it. This implies that different translations will yield different versions of the same text, which again shows that translation or paraphrasing cannot be reduced to algorithmic procedures. The universe of discourse, containing all texts ever translated along with their translations, is the empirical base for multilingual corpus linguistics. It is a virtual universe, and it can be realised by multilingual parallel corpora (or a collection of bilingual parallel corpora). Parallel corpora consist of source texts along with their translations into other languages, whereas reciprocal parallel corpora contain the source texts in two languages along with their translations into the target languages. Just as in monolingual corpus linguistics, meaning is also seen as a strictly linguistic (or better, textual) term here. Meaning is paraphrase. The entire meaning of a text segment within a multilingual universe of discourse is enclosed in the history of all translation equivalents of the segment. The translation unit, that is, the text segment completely represented by the translation equivalent, is the base unit of multilingual corpus semantics. Translation units, consisting of a single word or of several words, are the minimal units of translation. If they consist of several words, they are trans-
  • 21. CORPUS LINGUISTICS AND LEXICOGRAPHY 145 lated as a whole and not word by word. Therefore, translation equivalents correspond to the text segments of monolingual corpus linguistics. Within the framework of multilingual corpus linguistics, we take that the meaning of translation units is contained in their translation equivalents in other languages. This corresponds to the base assumption of corpus lin- guistics, which does not regard semantic cohesion as something fixed but as belonging to a large spectrum reaching from inalterable units to text seg- ments whose elements can be varied, expanded or omitted. Identifying these translation units (or text segments) again involves interpretation. The transla- tion shows us whether a given co-occurrence of words is a single translation equivalent or a combination of them, that is, merely a chain of text elements. This leads to two consequences. What can be seen as an integral translation equivalent in one target language may be a simple word-by-word transla- tion in another. This may even be the case within a single target language, depending on the stylistic preferences of different translators. In fact, it is the community of translators (along with the translation critics) who in their daily practice decide what is the translation equivalent, just as the monolin- gual language community decides what is a text segment. The definition of a translation unit therefore depends both on the target language and the common practice of translation. A virtual text segment is a translation unit only in respect to those languages into which it is translated as a whole. Translation units and their equivalents are not metaphysical entities; they are the contingent results of translation acts. According to the analysis of parallel corpora, more than half of the translation units are larger than the single word—another example of how corpus linguistics may help to investigate the nature of text segments. The meaning of a translation unit is its paraphrase, that is, the translation equivalent in the target language. For ambiguous translation units, this im- plies that there are as many meanings to the unit as there are non-synonymous translation equivalents. If the phenomenon of meaning is thus operationalised, the meaning of a translation unit depends on the selected target language. A given translation unit in language A may have two non-synonymous equiv- alents in language B, but three non-synonymous equivalents in language C. Let us look at an example. The English word sorrow (a translation unit consisting only of a single word) will usually be translated into French by one of the three equivalents chagrin, peine or tristesse; the first two, chagrin and peine, are obviously synonymous in a variety of contexts. They both
  • 22. 146 WOLFGANG TEUBERT point at a cause for this emotion and, therefore, are sometimes interchange- able with deuil (‘loss,’ the term for the cause). Tristesse, on the other hand, is the variety of sorrow which is not caused by a special incident. In German, there are also three standard equivalents for sorrow, namely, Trauer (caused by loss), Kummer (caused by an adverse incident, intense and usually lim- ited in duration) and finally Gram (caused by unhappiness resulting from an incident, not very intense, more a disposition than a feeling, but often of long duration). Those three German equivalents are neither synonymous with nor corresponding to the three French equivalents. By the way, the differ- ent senses of sorrow usually found in English monolingual dictionaries and thesauri corresponds to neither the French nor the German distinctions. The above example of sorrow shows that the concept of synonymy can- not be expressed in an algorithm. To call two expressions synonymous re- quires a prior understanding of their meaning, that is, an act of interpretation. For instance, if we look at how the Greek verb pros´ uchomai in the first sen- e tence of Plato’s Republic is translated into English, we will find five different equivalents in eight different translations of this book: to make my prayers, to say a prayer, to offer up my prayers, to worship, to pay my devoirs and to pay my devotions. We, as human beings, must decide whether we consider the Greek verb ambiguous or just fuzzy and whether the relevant equivalents can be seen as synonyms. This is something computers cannot do. The ex- ample also shows that the concept of synonymy can only be applied locally, referring to translation equivalents or text segments within a defined context. Although we may assume that Plato’s contemporary audience considered the verb pros´ uchomai as unambiguous within the above context, this is not the e case with native speakers of English, where there is no synonymy between to make my prayers and to pay my devotions. It can be clearly seen that meaning has a dynamic quality and also that the act of translation requires intention and thus cannot be reduced to a mere procedure. We will never find the correct German equivalent for sorrow or the correct English equivalent for pros´ uchmai just by defining formal instructions for a machine. Before e we can translate texts and their elements, we must understand them.
  • 23. CORPUS LINGUISTICS AND LEXICOGRAPHY 147 Multilingual Corpus Linguistics in Practice Neither a lexicon derived from a bilingual dictionary nor the supposedly language-neutral conceptual ontologies applied within Artificial Intelligence will solve the problem of machine translation of general language texts. Meanwhile, this fact is acknowledged by the experts. Therefore, they focus on the machine translation of texts written in a controlled documentation language, which is a more or less formal language in which all technical terms are defined unambiguously along with a syntax that rejects all ambiguous expressions as non-grammatical. General language texts written in natural languages cannot be translated without interpretation. Here, multilingual corpus linguistics steers clear of this obstacle in an elegant way. Unlike disciplines such as Artificial Intelligence and Machine Translation, which are based on cognitive linguistics, it does not try to model and emulate mental processes, but instead tries to support the translator by processing parallel corpora. They contain the practice of previous human translation. In these corpora, those translation equivalents that are proven to be reliable and accepted will outweigh equivalents that have been dismissed as inadequate in the long run. If, for instance, pros´ uchomai e is translated as to make my prayers three times out of eight, it may well be assumed that it is an accepted—albeit not the ideal—equivalent within the given context. Parallel corpora are translation repositories. They link translation units with their equivalents. As first studies have shown (Steyer and Teubert 1998), we may assume that 90 percent of all translation units along with their rel- evant equivalents may be found in a carefully compiled corpus of about 20 million words per language, provided that the text to be translated is suffi- ciently close to the corpus with regard to text type and genre. Multilingual corpus linguistics does not pretend to solve the problem of machine translation of general language. But it may help the human translator in finding a suitable equivalent for the unit to be translated more efficiently than traditional bilingual dictionaries, because it includes the context even in those cases where the translation equivalent is not a syntagmatically defined collocation but a certain textual element within a sequence. The goal is to select from among all given elements the one whose contextual profile is closest to that of the textual segment to be translated.
  • 24. 148 WOLFGANG TEUBERT Case study 3: The translation into German of sorrow and grief For the two words sorrow and grief, we find three common non- synonymous German translation equivalents: Trauer, Kummer and Gram. An analysis of the contexts of all references of these German words as found in the IDS corpora, based on a method designed by Cyril Belica (see http://www.ids-mannheim.de/cgi-bin/idsforms/ cosmas-www-client), gives us the context profiles listed below. In our example, the number of neighbouring words (i.e. span) has been restricted to 5 words on each side. The context profiles given below have been slightly edited for the sake of clarity. Context profile for Trauer: Wut, Angst, Betroffenheit, Schmerz, Tod, Best¨ rzung, Freude, Hoffnung, Verzweiflung, Scham; tragen, empfinden; tief, u groß- Context profile for Kummer: Sorgen, Schmerz, Leid, Seele, Freude, ¨ Stress, Arger, Not; bereiten, machen, gewohnt/gew¨ hnt sein; viel, groß- o Context profile for Gram: Leid, Hass, Bitterkeit, Scham; sterben; gebeugt, lauter, voll- In an English-German parallel corpus we would distinguish between three translations for sorrow and grief : the first group would contain those cases where sorrow or grief is translated by Trauer; the second group where it is translated by Kummer, and finally, the third group where it is translated by Gram. For each of the above cases, we could compute a context profile similar to the ones quoted above for the German words from the IDS corpus. We may assume that the context profile for sorrow and grief, as taken from the parallel corpus, in the case of the translation equivalent Kummer, will not differ much from the context profile for Kummer extracted from the German reference corpus, apart from it being in English instead of German. Unfortunately, a sufficiently large enough English-German parallel cor- pus that would allow the extraction of English context profiles for German translation equivalents on the basis of recurrence is not yet available. As an alternative, I have searched the Bank of English for those instances of sor- row and grief whose contexts are similar to our context profiles for Trauer, Kummer and Gram. So far these results are not thoroughly convincing: one reason is the different composition of the IDS corpora compared to the Bank of English which results in a clear imbalance of the German and English instances with regard to text type and genre; also, the search criteria for the
  • 25. CORPUS LINGUISTICS AND LEXICOGRAPHY 149 English contexts have been too narrow, and last but not least, sorrow and grief along with their German counterparts Trauer, Kummer, and Gram be- long to an area of vocabulary which is highly culture-specific and is almost impossible to reduce to a common denominator. Still, the following instances taken from the Bank of English show, that in practice, the approach for the detection of equivalents outlined above will function to some extent. The words in square brackets are the German equivalents of the context words contained within the context profiles. (1) Trauer So on the night of the crucifixion I place Simon in the home in Bethany of Mary called Magdalene and her sister Maria. I en- vision a scene in which trauma, grief, anger [Wut], and despair [Best¨ rzung] were all present, to say nothing of fear [Angst]. u (2) Kummer She enjoys her job though it is full of stress [Stress], sorrow and never-ending challenges. (3) Gram The terrible affliction [Leid] that has fallen so suddenly upon our unhapply country fills and monopolises my thoughts. My soul is full of grief and bitterness [Bitterkeit] and hate [Hass] and vengeance. Although matching the context of the element to be translated against the context profiles of all possible equivalents may suggest a method for the automatic selection of suitable equivalents, this only works in those cases where we have clear selection-relevant contextual information at our disposal. As stated above, this is not always the case, especially if the text element to be translated is referring to earlier instances within the same text. In these cases, we may assume that, provided the intratextual continuity is sufficiently high, the text element (sorrow or grief in our example) can always be translated by the same equivalent with regard to the target language, be it Trauer, Kummer or Gram. In most cases, whenever a word with a fuzzy, strongly context- dependent meaning appears in a text for the first time, the information needed for the specification of its meaning will be found within the context. Later instances of the word within the text often tend to omit this information as redundant. Within a text, we must find one or two references where a
  • 26. 150 WOLFGANG TEUBERT suitable translation equivalent is indicated by the context profile and apply the result to the other instances. This shows that it is imperative to only include complete texts in the corpus. Future Prospects Corpus linguistics sees itself not in opposition to but as a complement of tra- ditional linguistics. Corpus linguistics helps to make us aware not only of the interaction between text element and context but also of text segments, that is, larger, flexible units whose elements are semantically linked in a certain way: multi-word-units, collocations, set phrases. It explains the repeated co- occurrence of text elements as a discourse phenomenon that can be explored by statistical means, and it makes those co-occurrence patterns visible by a combination of quantitative and categorial devices. The investigation of the context enables us to better cope with words displaying fuzzy meanings, words of the ‘Thespian vocabulary,’ as John Sin- clair called them (Sinclair 1996), by generating context profiles as presented above on the basis of sufficiently large corpora. Especially when combin- ing these context profiles with those citations containing a paraphrase of the meaning or aspects thereof (cf. our case study of globalisation), this may lead to descriptions of meaning enabling the user to participate in the discourse. Corpus linguistics distinguishes between text segments on the one hand and text elements embedded in context on the other, depending on how they can be described. Context profiles are only statistically defined. Within a context profile, there is no such thing as an obligatory element that is indispensable within the context of a citation. The lexical constituents of text segments, however, can be defined either as indispensable or as optional. But there is still another difference between the text element with its context profile and the text segment: the latter is defined not only on a lexical but also on a syntactic level. The collocation Kummer gew¨ hnt ceases to be a o collocation as soon as the verb gew¨ hnt sein is replaced by gew¨ hnen: Er o o hatte sich an seinen Kummer gew¨ hnt is not a collocation. The same applies o for collocations such as geheimer Kummer, Kummer bereiten, Kummer und Sorgen. If we change the syntagma or even just the word order (for example, into Sorgen und Kummer), the words lose their collocation character.
  • 27. CORPUS LINGUISTICS AND LEXICOGRAPHY 151 During the last decades, we have witnessed a growing interest in seman- tic cohesion, in the special semantic relations between words within sentences and phrases, even in traditional linguistics. Among the relatively new con- cepts are lexical solidarities, collocations, set phrases, valency, case roles, semantic frames and scripts. They all try to demonstrate that language is more than just the assembling of context-free words using semantics-free rules. The co-occurrence patterns developed by corpus linguistics may help to clarify heuristically the concept of text segments defined by semantic co- hesion. When it comes to the identification of text segments, multilingual cor- pus linguistics holds a privileged position. Within monolingual corpora, this identification is a gruesome task that can only be turned into an automatic procedure by a painstaking combination of various procedures based on fre- quencies, lists or rules. The use of parallel corpora makes it easier to identify text segments (as translation units or equivalents), as they are the true prac- tical results of interpretation and paraphrase. They show what usually takes place within the minds of the speakers without leaving their traces in texts. Parallel corpora, therefore, provide direct access to the translation practice of human translators. If we assume that we may find the meaning of a textual element through its paraphrase, which is also a text, then we may describe parallel corpora as repositories for such paraphrases. Obviously, dictionaries also attempt to list those paraphrases. However, since their size is limited, they need to decontextualise and isolate the lexical units, whereas the para- phrases of translators display the text elements embedded within their con- texts, along with whole text segments. Parallel corpus evidence helps us to trace the phenomenon of semantic cohesion. Meanwhile, with the availability of large corpora and improved software for their exploration, corpus linguistics has become part of general lexicog- raphy. Linguistics is gradually becoming more interested in larger units of meaning and the use of context for their definition. Also, it is generally accepted that the next generation of dictionaries, both monolingual and bilin- gual, needs to be corpus-validated, if not entirely corpus-based. But there is more to the corpus linguistic approach. By interactive procedures, the am- bitious user should be able to have direct access to corpus evidence instead of being confronted with the subjective findings provided by lexicographers. Such a corpus platform would allow the members of the language community
  • 28. 152 WOLFGANG TEUBERT to participate in the social activity of negotiating meanings in a committed and informed way. Notes * This contribution is a revised version of my article ‘Korpuslinguistik und Lexikographie’ in Deutsche Sprache 4/99, pp. 292–313, translated into English by Norbert Volz. 1. The rules that those followers of a universal grammar hope to find in their quest for the language organ are not based on deductions of analogy. Whereas rules based on innate- ness had been the central factor in Chomskyan language theory until recently (cf. Stephen Pinker in The Language Instinct [Pinker 1994]), Pinker now sees language faculty as an interaction between ‘distinct mental mechanisms’ which is not yet fully explored, namely, the ‘symbolic computation’ [i.e., the algorithmic processing of uninterpreted symbols] as opposed to the ‘memory’ [i.e., recollection], the latter being responsible for the as- signment of form and meaning of symbols (Pinker 1999). The memory is seen as partly associative—an appropriate term for its description could be ‘connectionist network’. However, Pinker still sees ‘symbolic computation’ as a strictly rule-based process. We may assume that this tentative change in attitude towards language faculty and the extent of its genetic embedding might be partly due to Terrence W. Deacon’s convincing ex- planation of first language acquisition which does without any language organ (Deacon 1997). References Biber, Douglas; Conrad, Susan; Reppen, Randi. 1998. Corpus Linguistics. Investigating Language Structure and Use. Cambridge University Press. Collins COBUILD. 1987. English Language Dictionary. Editor in Chief: John Sinclair. Deacon, Terrence W. 1997. The Symbolic Species. New York: Norton. Dennett, Daniel C. 1998. “Reflections on Language and Mind.” In: Peter Carruthers/ Jill Boncher (Eds.): Language and Thought. Interdisciplinary Themes. Cambridge: Cambridge University Press, 284–294. Devlin, Keith. 1997. Goodbye, Descartes. New York: Wiley. Fodor, Jerry A. 1975. The Language of Thought. New York: Crowell. Fodor, Jerry A. 1998. Concepts. Where Cognitive Science Went Wrong. Oxford: Clarendon Press. Hellmann, Manfred W. 1992. W¨ rter und Wortgebrauch in Ost und West. Vol. 1–3. o T¨ bingen: Narr. u Herberg, Dieter; Steffens, Doris; Tellenbach, Elke. 1997. Schl¨ sselw¨ rter der Wendezeit. u o W¨ rter-Buch zum offentlichen Sprachgebrauch 1989/90. Berlin: Walter de Gruyter. o ¨ Heringer, Hans J¨ rgen. 1999. Das h¨ chste der Gef¨ hle. Empirische Studien zur u o u distributiven Semantik. T¨ bingen: Stauffenberg Verlag. u
  • 29. CORPUS LINGUISTICS AND LEXICOGRAPHY 153 J¨ ger, Ludwig. 2000. “Die Sprachvergessenheit der Medientheorie. Ein Pl¨ doyer f¨ r das a a u Medium Sprache.” In: Werner Kallmeyer (Ed.): Sprache und neue Medien. Jahrbuch 1999 des Instituts f¨ r Deutsche Sprache. Berlin/New York: de Gruyter, 9–30. u Janik, Allen; Toulmin, Stephen. 1973. Wittgenstein’s Vienna. New York: Schuster & Schuster. Keller, Rudi. 1995. Zeichentheorie. T¨ bingen: Francke. u Kjellmer, G¨ ran. 1994. A Dictionary of English Collocations. Based on the Brown Corpus. o Oxford: Clarendon Press. Lenz, Susanne. 2000. Studienbibliographie Korpuslinguistik. Heidelberg: Groos. McEnery, Tony; Wilson, Andrew. 1996. Corpus Linguistics. Edinburgh: Edinburgh University Press. Melby, Allen K. 1995. The Possibility of Language. A Discussion of the Nature of Language with Implications for Human and Machine Translation. Amsterdam: John Benjamins. The Oxford-Hachette French Dictionary. 1994. French-English/ English-French. Marie- H´ l` ne Corr´ ard, Valerie Grundy (Eds.). Oxford: Oxford University Press. ee e Pinker, Stephen. 1994. The Language Instinct. New York: William Morrow. Pinker, Stephen. 1999. “Regular habits. How we learn language by mixing memory and rules.” In: Times Literary Supplement, October 29, 1999, 11–13. Renouf, Antoinette (Ed.). 1998. Working with Corpora. Selected Papers from the 18th ICAME Conference. Amsterdam: Rodope. Le Robert & Collins. 1993. Dictionnaire Francais–Anglais/Anglais–Francais. 4th Edition. ¸ ¸ Editor in Chief: Beryl S. Atkins. Searle, John R. 1992. The Rediscovery of the Mind. Cambridge, Mass.: The MIT Press. Sinclair, John M. 1996. “The Empty Lexicon.” In: International Journal of Corpus Linguistics I(1): 99–120. ¨ Steyer, Kathrin; Teubert, Wolfgang. 1998. “Deutsch-Franz¨ sische Ubersetzungsplattform. o Ans¨ tze, Methoden, empirische M¨ glichkeiten.” In: Deutsche Sprache 4(97): 343–359. a o Stubbs, Michael. 1996. Text and Corpus Analysis. Oxford: Blackwell. ¨ Teubert, Wolfgang. 1999. In: Modelle der Ubersetzung—Grundlagen der Methodik. Frankfurt/M.: Lang, 118–135. Teubert, Wolfgang; Kervio-Berthou, Val´ rie; Windisch, Eric. To be published. e Kollokationsw¨ rterbuch Adjektive und ihre Begleitsubstantive. o Wierzbicka, Anna. 1996. Semantics. Primes and Universals. Oxford: Oxford University Press.