281x Filetype PDF File size 0.18 MB Source: www.zora.uzh.ch
Zurich Open Repository and
Archive
University of Zurich
University Library
Strickhofstrasse 39
CH-8057 Zurich
www.zora.uzh.ch
Year: 2003
German prepositions and their kin. A survey with respect to the resolution
of PP attachment ambiguities
Volk, Martin
Abstract: This paper surveys German prepositions and their relatives: contracted prepositions, pronomi-
nal adverbs, and reciprocal pronouns. We elaborate on corpus frequencies for these and on their properties
with respect to PP attachment. We show that prepositions and contracted prepositions can be handled
together. They show an overall attachment tendency towards the noun. But pronominal adverbs and
reciprocal pronouns show an overall attachment tendency towards the verb and therefore must be treated
separately.
Posted at the Zurich Open Repository and Archive, University of Zurich
ZORAURL:https://doi.org/10.5167/uzh-20340
Conference or Workshop Item
Originally published at:
Volk, Martin (2003). German prepositions and their kin. A survey with respect to the resolution of PP
attachment ambiguities. In: Workshop on The Linguistic Dimensions of Prepositions and their Use in
Computational Linguistics Formalisms and Applications, Toulouse, 2003.
German prepositions and their kin. A survey with respect to the
resolution of PP attachment ambiguities
Martin Volk
Stockholm University
Department of Linguistics
SE-10691 Stockholm
volk@ling.su.se
Abstract weekly computer science newspaper. In ad-
This paper surveys German prepositions dition to this training corpus, we prepared
and their relatives: contracted prepositions, a 3000 sentence corpus with manually an-
pronominal adverbs, and reciprocal pro- notated syntax trees. From this treebank
nouns. We elaborate on corpus frequencies we extracted over 4000 test cases with am-
for these and on their properties with respect biguously positioned PPs for the evaluation
to PP attachment. We show that prepo- of the disambiguation method. We will call
sitions and contracted prepositions can be these test cases the ‘CZ test set’.
handled together. They show an overall at- As a basis for this study we surveyed Ger-
tachment tendency towards the noun. But man prepositions and their relatives and we
pronominal adverbs and reciprocal pronouns checked for prepositions, contracted prepo-
show an overall attachment tendency to- sitions, pronominal adverbs and reciprocal
wardstheverbandthereforemustbetreated pronouns whether they can mutually benefit
1 from each other with respect to attachment
separately. tendencies.
Keywords: Corpus linguistics, ambigu-
ity resolution, unsupervised learning 2 German prepositions
1 Introduction Prepositions in German are a class of words
Any computer system for natural language relating linguistic elements to each other
processing has to struggle with the problem with respect to a semantic dimension such
of ambiguities. If the system is meant to ex- as local, temporal, causal or modal. They
tract precise information from a text, these do not inflect and cannot function by them-
ambiguities must be resolved. One of the selves as a sentence unit (cf. [Bußmann,
mostfrequent ambiguities arises from the at- 1990]). But, unlike other function words, a
tachment of prepositional phrases (PPs). A German preposition governs the grammati-
PP that follows a noun (in English or Ger- cal case of its argument (genitive, dative or
man) can be attached to the noun or to the accusative). Frequent German prepositions
verb. We did an in-depth study on unsu- are an, fur,Ä in, mit, zwischen.
pervised statistical methods to resolve such Prepositions are considered to be a closed
ambiguities in German sentences based on word class. Nevertheless it is difficult to de-
cooccurrence values derived from a shallow termine the exact number of German prepo-
parsed corpus (see [Volk, 2001] and [Volk, sitions. [SchrÄoder, 1990] speaks of “more
2002]). than 200 prepositions”, but his “Lexikon
Corpus processing consisted of proper deutscher PrÄapositionen” lists only 110 of
name recognition and classification, Part- them. In this dictionary all entries are
of-Speech tagging, lemmatization, phrase marked with their case requirement and
chunking, and clause boundary detection. their semantic features. For instance, ohne
We used a corpus of more than 5 million requires the accusative and is marked with
words from the Computer-Zeitung (CZ), a the semantic functions instrumental, modal,
conditional and part-of.2
1This paper is based on my research at the Uni-
2See also [Klaus, 1999] for a detailed comparison
versity of Zurich in a project supported by the
Swiss National Science Foundation under grant 12- of the range of German prepositions as listed in a
54106.98. number of recent grammar books.
The lexical database CELEX [Baayen et The most frequent homographic func-
al., 1995] contains 108 German prepositions tions are separable verb prefix and conjunc-
with frequency counts derived from corpora tion. Fortunately, these functions are clearly
of the “Institut furÄ deutsche Sprache”. This marked by their position within the clause.
results in the arbitrary inclusion of nÄordlich, A clause conjunction usually occurs at the
nordÄostlich, sudÄ lich while Äostlich and west- beginning of a clause, and a separated verb
lich are missing. prefix mostly occurs at the end of a clause
Searching through 5.5 million tokens of (rechte Satzklammer). A part-of-speech tag-
our tagged computer magazine corpus we ger can therefore disambiguate these cases.5
found around 540,000 preposition tokens Typical (i.e. frequent) prepositions are
3
corresponding to 99 preposition types. monomorphemic words (e.g. an, auf, fur,Ä in,
These counts do not include contracted mit, ubÄ er, von, zwischen). Many of the less
prepositions. A list of the 66 most frequent frequentprepositionsarederivedorcomplex.
German prepositions with frequencies from Theyhaveturnedintoprepositionsovertime
our corpus can be found in appendix A. andstill show traces of their origin. They are
An early frequency count for German by derived from other parts-of-speech such as
[Meier, 1964] lists 18 prepositions among the
100 most frequent word forms. 17 out of ² nouns (e.g. angesichts, zwecks),
these 18 prepositions are also in our top-20 ² adjectives (e.g. fern, unweit),
list. Only gegen is missing which is on rank
23 in our corpus. This means that the usage ² participle forms of verbs (e.g.
of the most frequent prepositions is stable entsprechend, wÄahrend; ungeachtet), or
over corpora and time.
All frequent prepositions in German have ² lexicalized prepositional phrases (e.g.
some homograph serving as anhand, aufgrund, zugunsten).
² separable verb prefix (e.g. ab, auf, mit, German prepositions typically do not al-
zu), low compounding. It is generally not possi-
² clause conjunction (e.g. bis, um)4, ble to form a new preposition by a concate-
nation of prepositions. The two exceptions
² adverb (e.g. auf, fur,Ä ubÄ er) in often id- are gegenubÄ er and mitsamt. Other concate-
iomatic expressions (e.g. auf und davon, nated prepositions have led to adverbs like
ubÄ er und ubÄ er), inzwischen, mitunter, zwischendurch.
² infinitive marker (zu), [Helbig and Buscha, 1998] call the
monomorphemic prepositions primary
² proper name component (von), or prepositions and the derived preposi-
tions secondary prepositions. This
² predicative adjective (e.g. an, auf, aus, distinction is based on the fact that only
in, zu as in Die Maschine ist an/aus. primary prepositions form prepositional
Die TurÄ ist auf/zu.). objects, pronominal adverbs (cf. section 2.2)
3These figures are based on automatically as- and prepositional reciprocal pronouns (cf.
signed part-of-speech tags. If the tagger systemat- section 2.3).
ically mistagged a preposition, the counting proce- In addition, this distinction corresponds
dure does not find it. In the course of the project to different case requirements. The primary
we realized that this happened to the prepositions prepositions govern accusative (durch, fur,Ä
a, via and voller as used in the following example gegen, ohne, um) or dative (aus, bei, mit,
sentences (all examples in this paper are from the nach, von, zu) or both (an, auf, hinter, in,
Computer-Zeitung, Konradin-Verlag, 1993-1997).
(1) Derselbe Service in der Regionalzone (bis neben, ubÄ er, unter, vor, zwischen). Most
zu 50 Kilometern) kostet 23 Pfennig a 60 of the secondary prepositions govern gen-
Sekunden. itive (angesichts, bezuglich,Ä dank). Some
(2) Master und Host kommunizieren via IPX. 5Note the high degree of ambiguity for zu which
(3) Windows steckt voller eigener Fehler. can be a preposition zu ihm, a separated verb prefix
sie sieht ihm zu, the infinitive marker ihn zu sehen, a
4[Jaworska, 1999] (p. 306) argues that “clause- predicative adjective das Fenster ist zu, an adjectival
introducing preposition-like elements are indeed or adverb marker zu gross, zu sehr, or the ordinal
prepositions”. number marker sie kommen zu zweit.
prepositions (most notably wÄahrend) are in the probability estimates in [Ratnaparkhi,
the process of changing from genitive to da- 1998] except that Ratnaparkhi includes a
tive. Some prepositions do not show overt back-off to the uniform distribution for the
case requirements (je, pro, per; cf. [Schaeder, zero denominator case. We added special
1998]) and are used with determiner-less precautions for this case in our disambigua-
noun phrases. tion algorithm. The cooccurrence values are
Some prepositions show other idiosyncra- also very similar to the probability estimates
cies. The preposition bis often takes another in [Hindle and Rooth, 1993].
preposition (in, um, zu as in 4) or combines We started by computing the cooccur-
with the particle hin plus a preposition (as rence values over word forms for nouns,
in 5). The preposition zwischen is special in prepositions, and verbs based on their part-
that it requires a plural argument (as in 6), of-speech tags. In order to compute the pair
often realized as a coordination of NPs (as frequencies freq(N1;P), we search the train-
in 7). ing corpus for all token pairs in which a
noun is immediately followed by a preposi-
(4) Portables mit 486er-Prozessor tion. The treatment of verb + preposition
werden bis zu 20 Prozent billiger. cooccurrences is different from the treatment
(5) ... und berucksichtigtÄ auch Daten of N+P pairs since verb and preposition are
und Datentypen bis hin zu Arrays seldom adjacent to each other in a German
oder den Records im VAX-Fortran. sentence. On the contrary, they can be far
apart from each other, the only restriction
(6) Die Verbindungstopologie zwischen being that they cooccur within the same
den Prozessoren lÄaßt sich als clause. We use the clause boundary infor-
dreidimensionaler Torus darstellen. mation in our training corpus to enforce this
restriction. For computing the cooccurrence
(7) Durch Microsoft Access mussenÄ sich values we accept only verbs and nouns with
die Anwender nicht mehr lÄanger an occurrence frequency of more than 10.
zwischen Bedienerfreundlich- WiththeN+PandV+Pcooccurrenceval-
keit und Leistung entscheiden. ues for word forms we did a first evaluation
over the CZ test set with the following sim-
Results for PP attachment ple disambiguation algorithm.
We explored various possibilities to extract
PPdisambiguation information from the au- if ( cooc(N1,P) && cooc(V,P) ) then
tomatically annotated CZ corpus. We first if ( cooc(N1,P) >= cooc(V,P) ) then
used it to gather frequency data on the cooc- noun attachment
currence of pairs: nouns + prepositions and else
verbs + prepositions. verb attachment
The cooccurrence value is the ra-
tio of the bigram frequency count We found that we can only decide 57%
freq(word;preposition) divided by the of the test cases with an accuracy of 71.4%
unigram frequency freq(word). For our (93.9% correct noun attachments and 55.0%
purposes word can be the verb V or the correct verb attachments). This shows a
reference noun N1. The ratio describes striking imbalance between the noun attach-
the percentage of the cooccurrence of ment accuracy and the verb attachment ac-
word + preposition against all occurrences curacy. This imbalance was countered with
of word. It is thus a straightforward a noun factor which was automatically de-
association measure for a word pair. The rived from the corpus based on the overall
cooccurrence value can be seen as the attachmenttendencyofprepositionstowards
attachment probability of the preposition nouns in comparison to their tendency to-
based on maximum likelihood estimates. wards verbs (cf. [Volk, 2002]). This move
Wewrite: leads to an improvement of the overall at-
tachment accuracy to 81.3%. We then went
cooc(W;P) = freq(W;P)=freq(W) on to lemmatize all word forms which also
with W ∈ {V;N }. The cooccurrence val- included mapping contracted prepositions to
1 their corresponding bare forms.
ues for verb V and noun N1 correspond to
no reviews yet
Please Login to review.