225x Filetype PDF File size 0.66 MB Source: aclanthology.org
Arabic Morphology Generation Using a Concatenative Strategy
Violetta Cavalli-Sforza Abdelhadi Soudi Teruko Mitamura
Carnegie Technology Computer Science Department Language Technologies
Education Ecole Nationale de L'Industrie Institute
4615 Forbes Avenue Minerale Carnegie Mellon University
Pittsburgh, PA, 15213 Rabat, Morocco Pittsburgh, PA 15213
violetta@cs.cmu.edu asoudi@enim.ac.ma teruko @cs.cmu.edu
the 2 tenses (perfect and imperfect), the 2 voices
Abstract (active and passive), and the 5 moods
Arabic inflectional morphology requires (indicative, subjunctive, jussive, imperative and
infixation, prefixation and suffixation, energetic). ~ The stem used in the conjugation of
giving rise to a large space of morphological the verb may differ depending on the person,
variation. In this paper we describe an number, gender, tense, mood, and the presence
approach to reducing the complexity of of certain root consonants. Stem changes
Arabic morphology generation using combine with suffixes in the perfect indicative
discrimination trees and transformational (e.g., katab-naa 'we wrote', kutib-a 'it was
rules. By decoupling the problem of stem written') and the imperative (e.g. uktub-uu
changes from that of prefixes and suffixes, 'write', plural), and with both prefixes and
we gain a significant reduction in the suffixes for the imperfect tense in the indicative,
number of rules required, as much as a subjunctive, and jussive moods (e.g. ya-ktub-na
factor of three for certain verb types. We 'they write, feminine plural') and in the
focus on hollow verbs but discuss the wider energetic mood (e.g. ya-ktub-unna or ya-ktub-un
applicability of the approach. 'he certainly writes'). There are a total of 13
person-number-gender combinations. Distinct
prefixes are used in the active and passive voices
Introduction in the imperfect, although in most cases this
Morphologically, Arabic is a non-concatenative results in a change in the written form only if
language. The basic problem with generating diacritic marks are used. 2
Arabic verbal morphology is the large number of Most previous computational treatments of
variants that must be generated. Verbal stems Arabic morphology are based on linguistic
are based on triliteral or quadriliteral roots (3- or models that describe Arabic in a non-
4-radicals). Stems are formed by a derivational concatenative way and focus primarily on
combination of a root morpheme and a vowel analysis. Beesley (1991) describes a system that
melody; the two are arranged according to analyzes Arabic words based on Koskenniemi's
canonical patterns. Roots are said to
interdigitate with patterns to form stems. For 1 The jussive is used in specific constructions, for
example, the Arabic stem katab (he wrote) is example, negation in the past with the negative
composed of the morpheme ktb (notion of particle tam (e.g., tam aktub 'I didn't write'). The
writing) and the vowel melody morpheme 'a-a'. energetic expresses corroboration of an action taking
The two are coordinated according to the pattern place. The indicative is common to both perfect and
CVCVC (C=consonant, V=vowel). imperfect tenses, but the subjunctive and the jussive
are restricted to the imperfect tense. The imperative
There are 15 triliteral patterns, of which at least has a special form, and the energetic can be derived
9 are in common use, and 4 much rarer from either the imperfect or the imperative.
quadriliteral patterns. All these patterns undergo z Diacritic marks are used in Arabic language
some stem changes with respect to voweling in textbooks and occasionally in regular texts to resolve
ambiguous words (e.g. to mark a passive verb use).
86
(1983) two-level morphology. In Beesley To illustrate our approach, we focus on a
(1996) the system is reworked into a finite-state particular type of verbs, termed hollow verbs,
lexical transducer to perform analysis and and show how we integrate their treatment with
generation. In two-level systems, the lexical that of more regular verbs. We also discuss how
level includes short vowels that are typically not the approach can be extended to other classes of
realized on the the surface level. Kiraz (1994) verbs and other parts of speech.
presents an analysis of Arabic morphology
based on the CV-, moraic-, and affixational 1 Arabic Verbal Morphology
models. He introduces a multi-tape two-level Verb roots in Arabic can be classified as shown
model and a formalism where three tapes are in Figure 1. 3 A primary distinction is made
used for the lexical level (root, pattern, and between weak and strong verbs. Weak verbs
vocalization) and one tape for the surface level. have a weak consonant ('w' or 'y') as one or
In this paper, we propose a computational more of their radicals; strong verbs do not have
approach that applies a concatenative treatment any weak radicals.
to Arabic morphology generation by separating Strong verbs undergo systematic changes in
the issue of infixation from other inflectional stem voweling from the perfect to the imperfect.
variations. We are developing an Arabic The first radical vowel disappears in the
morphological generator using MORPHE imperfect. Verbs whose middle radical vowel in
(Leavitt, 1994), a tool for modeling morphology the perfect is 'a' can change it to 'a' (e.g.,
based on discrimination trees and regular qaTa'a 'he cut' -> yaqTa'u 'he cuts'), 4 'i' (e.g.,
expressions. MORPHE is part of a suite of tools Daraba 'he hit' -> yaDribu 'he hits'), or 'u' (e.g.,
developed at the Language Technologies kataba 'he wrote' -> yaktubu 'he writes') in the
Institute, Carnegie Mellon University, for imperfect. Verbs whose middle radical vowel in
knowledge-based machine translation. Large the perfect is 'i' can only change it to 'a' (e.g.,
systems for MT from English to Spanish, shariba 'he drank' -> yashrabu 'he drinks') or 'i'
French, German, Portuguese and a prototype for (e.g., Hasiba 'he supposed' -> yaHsibu 'he
Italian have already been developed. Within this supposes'). Verbs with middle radical vowel 'u'
framework, we are exploring English to Arabic in the perfect do not change it in the imperfect
translation and Arabic generation for (e.g., Hasuna 'he was beautiful' -> yaHsunu 'he
pedagogical purposes. We generate Arabic is beautiful'). For strong verbs, neither perfect
words including short vowels and diacritic nor imperfect stems change with person, gender,
marks, since they are pedagogically useful and or number.
can always be stripped before display.
Our approach seeks to reduce the number of Hollow verbs are those with a weak middle
rules for generating morphological variants of radical. In both perfect and imperfect tenses, the
Arabic verbs by breaking the problem into two underlying stem is realized by two characteristic
parts. We observe that, with the exception of a allomorphs, one short and one long, whose use
few verb types, there is very little interaction depends on the person, number and gender.
between stem changes and the processes of
prefixation and suffixation. It is therefore 3 Grammars of Arabic are not uniform in their
possible to decouple, in large part, the problem classification of "hamzated" verbs, verbs containing
of stem changes from that of prefixes and the glottal stop as one of the radicals (e.g. [sa?a[] 'to
suffixes. The gain is a significant reduction in ask'). Wright (1968) includes them as weak verbs,
the size number of transformational rules, as but Cowan (1964) doesn't. Hamzated verbs change
much as a factor of three for certain verb classes. the written 'seat' of the hamza from 'alif' to 'waaw'
This improves the space efficiency of the system or 'yaa?', depending on the phonetic context.
and its maintainability by reducing duplication 4 In the Arabic transcription capital letters indicate
of rules, and simplifies the rules by isolating emphatic consonants; 'H' is the voiceless pharyngeal
different types of changes. fricative ; "' the voiced pharyngeal fricative ; '?' is
the glottal stop 'hamza'.
87
triliteral
I
I
strong weak
I
, I I I [ I
regular hamzated doubled weak initial weak middle weak final
radical radical radical radical
(assimilated) (hollow) (defective)
I I I I
I I
tense mood
I I , , I I I I
reterit present participle indicative imperative subjunctive jussive energetic
ffect) (imperfect)
' I I I
active passive
Figure 1: Classification of Arabic Verbal Roots and Mood Tense System
Hollow verbs fall into four classes: Stem allomorphs :
Perfect: -bi'- and -baa'-
. Verbs of the pattern CawaC or CawuC Imperfect: and -bi'- and -bii'-
(e.g. [Tawut] 'to be long'), where the
middle radical is 'w'. Their characteristic . Verbs of the pattern CayiC, where middle
is a long 'uu' between the first and last radical is 'y'. E.g.,
radical in the imperfect. E.g., From the underlying root [hayib]:
From the underlying root [zawar]: haaba 'he feared' and yahaabu 'he fears'
zaara 'he visited' and yazuuru 'he visits' Stem allomorphs :
Stem allomorphs: Perfect: -bib- and-haab-
Perfect: -zur- and -zaar- Imperfect: -hab- and-haab-
Imperfect:-zur- and-zuur-
In the relevant literature (e.g., Beesley, 1998;
. Verbs of the pattern CawiC, where the Kiraz, 1994), verbs belonging to the above
middle radical is 'w'. Their characteristic classes are all assumed to have the pattern
is a long 'aa' between the first and last CVCVC. The pattern does not show the verb
radical in the imperfect. E.g., conjugation class and makes it difficult to
From the underlying root [nawim]: predict the type of stem allomorph to use. To
naama 'he slept and yanaamu 'he sleeps' avoid these problems, we keep information on
Stem aUomorphs : the middle radical and vowel in the base form
Perfect: -nirn- and -naam- of the verb. In generation, classes 2 and 4 of
Imperfect:-ham- and-naam- the verb can be handled as one because they
have the same perfect and imperfect stemsP
. Verbs of the pattern CayaC, where the 5 The only exception is the passive participle. Verbs
middle radical is 'y'. Their characteristic of classes 1 and 2 behave the same (e.g. Class 1:
is a long 'ii' before the first and last radical [zawar]: mazuwr 'visited'; Class 2 [nawil] --)
in the imperfect. E.g., manuwt 'obtained'), as do verbs of classes 3 and 4
From the underlying root [baya" ]: (e.g. Class 3: [baya'] --) mabii" 'sold', Class 4:
baa" a 'he sold' and yabii" u 'he sells' [hayib] --) mahiib 'feared').
88
We describe our approach to modeling strong morphological forms in the language. Each
and hollow verbs below, following a node in the tree below the root is built by
description of the implementation framework. specifying the parent of the node and the
conjunction or disjunction of FVPs that define
2 The MORPHE System the node. Portions of the Arabic MFH are
MORPHE (Leavitt, 1994) is a tool that shown in Figures 2-4.
compiles morphological transformation rules Transformational Rules. A rule attached to
into either a word parsing program or a word each leaf node of the MFH effects the desired
generation program. 6 In this paper we will morphological transformations for that node.
focus on the use of MORPHE in generation. A rule consists of one or more mutually
Input and Output. MORPHE's output is exclusive clauses. The 'if' part of a clause is a
simply a string. Input is a feature structure regular expression pattern, which is matched
(FS) which describes the item that MORPHE against the value of the feature ROOT (a string).
must transform. A FS is implemented as a The 'then' part includes one or more operators,
recursive Lisp list. Each element of the FS is a applied in the given order. Operators include
feature-value pair (FVP), where the value can addition, deletion, and replacement of prefixes,
be atomic or complex. A complex value is infixes, and suffixes. The output of the
itself a FS. For example, the FS for generating transformation is the transformed ROOT string.
the Arabic zurtu 'I visited' would be: An example of a rule attached to a node in the
MFH is given in Section 3.1 below.
((ROOT "zawar") Process Logic. In generation, the MFH acts as
(CAT V) (PAT CVCVC) (VOW HOL) a discrimination network. The specified FS is
(TENSE PERF) (MOOD IND) matched against the features defining each
(VOICE ACT) subtree until a leaf is reached. At that point,
(NI/MBER SG) (PERSON i))
MORPHE first checks in the irregular forms
The choice of feature names and values, other lexicon for an entry indexed by the name of the
than ROOT, which identifies the lexical item to leaf node (i.e., the MF) and the value of the
be transformed, is entirely up to the user. The ROOT feature in the FS. If an irregular form is
FVPs in a FS come from one of two sources. not found, the transformation rule attached to
Static features, such as CAT (part of speech) the leaf node is tried. If no rule is found or
and ROOT, come from the syntactic lexicon, none of the clauses of the applicable rule
which, in addition to the base form of words, match, MORPHE returns the value of ROOT
can contain morphological and syntactic unchanged.
features. Dynamic features, such as TENSE and
NUMBER, are set by MORPHE's caller. 3 Handling Arabic Verbal
The Morphological Form Hierarchy. Morphology in MORPHE
MORPHE is based on the notion of a Figure 2 sketches the basic MFH and the
morphological form hierarchy (MFH) or tree. division of the verb subtree into stem changes
Each internal node of the tree specifies a piece and prefix/suffix additions. 7 The inflected verb
of the FS that is common to that entire is generated in two steps. MORPHE is first
subtree. The root of the tree is a special node called with the feature CHG set to STEM. The
that simply binds all subtrees together. The required stem is returned and temporarily
leaf nodes of the tree correspond to distinct substituted for the value of the ROOT feature.
7 The use of two parts of the same tree for the two
6 MORPHE is written in Common Lisp and the problems is a constraint of MORPHE's
compiled MFH and transformation rules are implementation, which does not permit multiple
themselves a set of Common Lisp functions. trees with separate roots.
89
no reviews yet
Please Login to review.