230x Filetype PDF File size 0.17 MB Source: www.cs.tau.ac.il
325 1
AnArabictoEnglish
ExampleBasedTranslationSystem
K.Bar,Y.Choueka,andN.Dershowitz
additional morphology and partofspeech information. Our
workisstillinprogress.Currently,thesystemfragmentsany
newintroducedinput sentence and translates each fragment
separately. Recombining those translations into a final
coherentformisleftforfuturework.
Our final goal is to develop an automated assistant for
!
ArabictoEnglish machine translation systems that work
within a rulebased or statistical paradigm, so as to better
!
handle complicated cases and especially to improve the
"
I. INTRODUCTION
HE examplebased (or “memorybased”) paradigm has
T
becomeafairlycommontechniquefornaturallanguage
processing (NLP) and especially for machinetranslation
applications,eversinceitwasfirstproposedbyNagaoin[1]. Fig.1.Mainstepsofexamplebasedtranslationsystem.
Thatpaperexpressedthemainideabehindanexamplebased fluencyofthegeneratedtranslations.
machinetranslation(EBMT)paradigm,namelytoemulatethe Thefollowingsectionisageneraldescriptionofoursystem.
wayahumantranslatoroperatesinsomecases.Suchasystem In Section III, we give some experimental results using
exploitsalargebilingualcorpustofindsimilarexamplesfor common automatic metrics. Conclusions are presented in
fragmentsoftheinputsourcelanguage(Arabic,inourcase) SectionIV.
text, and imitate its translations [2]. Searching for similar
fragments is called . Given a group of matched II. SYSTEMDESCRIPTION
fragments,thenextstepistoextractpossibletranslationsfrom
the targetlanguage (English, in our case) version of the
corpus. This is the
step. The last step is Thetranslation examples we need were extracted from a
,whichisthegenerationofacompletetarget collection of parallel unvocalized ArabicEnglish documents
language text, pasting together translated fragments. Fig. 1 takenfromtheUnitedNationsdocumentinventoryavailable
outlinesanexamplebasedsystemforArabictoEnglish.The under the OfficialDocumentSystem (ODS) [4]. We
reader may refer to the comprehensive survey of example automaticallyalignedeachparalleldocumentontheparagraph
basedmachinetranslationsystemsbySomers[3]. level andeachparallelparagraphwastakenasatranslation
Wedescribeanimplementationofthemajorcomponentsof example. These examples were morphologically analyzed
anEBMTsystemthattranslatesshortModernStandardArabic using the wellknown Buckwalter morphological analyzer
(MSA)sentencesintoEnglish.Itisanonstructuralsystem,so (version1.0)[5],andpartofspeechtaggedusingSVMPOS
itstoresthetranslationexamplesastextualstrings,withsome [6],insuchawaythat,foreachword,weconsideredonlythe
relevant Buckwalter analyses with the corresponding SVM
POS'spartofspeechtag.Aspeciallookuptablethatmaps
ManuscriptreceivedJanuary7,2007. Arabic words to their corresponding English words in each
K. Bar, Dept. of ComputerScience,TelAvivUniversity,Ramat Aviv, parallelparagraphwasalsocreated.Actually,foreachArabic
Israel(email:kfirbar@post.tau.ac.il). word in the translation example, we look up its English
Y.Choueka,Dept.ofComputerScience,BarIlanUniversity,RamatGan,
Israel(email:ycsarah@netvision.net.il). equivalentsinthelexiconandexpandthatwithsynonymsfrom
N.Dershowitz,Dept.ofComputerScience,TelAvivUniversity,Ramat WordNet. Then we search the English version of the
Aviv,Israel(email:nachumd@post.tau.ac.il).
325 2
translationexampleforallinstancesonthelemmaleveland
levels,witheachlevelassignedadifferentscore.Text
inserttheminthetable. (exact string) and stem matches credit the words with the
TheArabicversionofthecorpuswasindexedonword,stem maximumpossible;alemmamatchcreditsthemwithlessand
and lemma levels (stem and lemma as defined by the partofspeech credits the fragment matchscore with a
Buckwalteranalyzer),so,foreachgivenword,weareableto minimalamount.TableIsummarizestheseveralmatchlevels
retrievealltranslationexamplesthatcontainthatwordonany weusedinourexperiments.
ofthethreelevels. Textandstemmatchreceivealmostthesamescoresince,
currently, we do not yet handle the translation modification
Givenanewinputsentence,thesystembeginsbysearching needed. When dealing with unvocalized text, there are, of
the corpus for translation examples for which the Arabic course,complicatedsituationswhenbothwordshavethesame
versionmatchesfragmentsoftheinputsentence.Amatched stembutdifferentlemmas,forexample,thewordsHIآ(,
fragmentmustcontainatleasttwoadjacentwordsinthesame “wrote”) and HIآ (, “books”). Such cases are not yet
inputsentence.Thesamefragmentcanbefoundinmorethan handled,sincewehavenotworkedwithacontextsensitive
onetranslationexample.Therefore,aspecial
is Arabiclemmatizerandsocannotderivethecorrectlemmaof
assigned to each fragmenttranslation pair, representing the an Arabic word. Still, the combination of the Buckwalter
quality of the matched fragment in the specific translation morphologicalanalyzerandtheSVMPOStaggerallowsusto
example.Fragmentsarematchedwordbywordsothescore reducethenumberofpossiblelemmasforeveryArabicword
for a fragmentistheaverageoftheindividualwordmatch soastoreducetheamountofambiguity.Actually,bylemma
scores. match,wemeanthatwordsmatchonanyoneoftheirpossible
Words are matched on ,
, , and
lemmas.Thematchscoreinsuchacaseistheratiobetween
thenumberofequallemmasandthetotalnumberoflemma
TABLEI pairs (one per word). Further investigation, as well as
WORDMATCHINGLEVELS developing and working with a context sensitive Arabic
Match Description Match lemmatizer,isneededtobetterhandleallsuchsituations.
Level Score Fragmentswithascorebelowsomepredefinedthresholdare
Text Exactmatchofthewords. 1 discarded,sincepassinglowscorefragmentstothenextstep
dramaticallyincreasestotalrunningtime.Notethatalarger
Stem Wordsmatchintheirstemsbutnotintheir 0.9 corpus, with the concomitant increase in the number of
surfaceform.Forinstance,thewords potentialfragments,wouldrequireraisingthethreshold.
MNرPIQRSا($
% &,“theconstitutionality”)
UINرPIQد($
% &&,“myconstitutional”) Fragments are stored in a structure comprising the
sharethestemيرPIQد($
% &) following:(1)
–fragment’sArabictext,taken
from the input sentence; (2) – fragment’s
Lemma Words share a lemma. For instance, the Dynamic Arabictext,takenfromthematchedtranslationexample;(3)
followingwordsmatchintheirlemmas: score
قرZ[( ',“apostate”) –theEnglishtranslationoftheexamplepattern;(4)
قا\[( (',“apostates”)
–ofthefragmentanditsexampletranslation.
Notethatthestemsofthesewordsarenotthe Forefficiency,fragmentssharingthesameexamplepattern
same. are collected and stored in a higherlevel,
Content This level is planned but not yet 0.8 structure.(Notethatageneralfragmentconsistingofonlyone
implemented.Theideaisthat,forexample, fragmentisalsopossible.)
twolocationnameswouldgetahigherscore
thantwodissimilarpropernouns.
Theinputtothetransferstepconsistsofallthecollected
Partof Words match only in their partofspeech. 0.3
Speech For instance, both are nouns. Actually, we generalfragmentsthatwerefoundinthematchingstep,andits
requirethatbothalsohavethesametagsfor output is the translations of those generalfragments. The
theiraffixes.Forexample,ifawordistagged translation of a generalfragment is taken to be the best
asanounandhasthedefinitearticleprefixلا generated translation among the comprised fragments.
(,“the”),thematchedwordmustagreeon
bothfeatures–itmustbeanounandalso Translating a fragment is done in two main steps: (1)
havethedefinitearticleprefix. extracting the translation of the example pattern from the
English version of the translation example; (2) fixing the
Common Thislevelisrelevantonlyforcommonwords 1
Word and affixes, taken from a predefined list. extracted translation so that it will be the translation of the
Match Thesewords/affixesareorganizedingroups fragment’ssourcepattern.
that representthesamemeaning.Clearly,a ! "
#
word/affix maybeamemberofmorethan
one group.Words/affixesthataremembers Thefirststepistoextractthetranslationofthefragment’s
ofthesamegrouparealsomatchedonthis examplepattern from the English version of the translation
level. For example theprefixب(, “with”, example.Hereweusethepreparedlookuptableforevery
“by”, “in”) is in the same group of the translationexamplewithinourcorpus.ForeveryArabicword
prepositionUa(&,“in”).
inthepattern,welookupitsEnglishequivalentsinthetable
325 3
and mark them in the English version of the translation feasible.
example.Then,weextractthe
Englishsegmentthat +!
$#"
containsthenumberofequivalencewords.Usually Recall that the match of a corpus fragment to the input
a wordinsomeArabicexamplepatternhasseveralEnglish fragmentcanbeinexact:wordsmaybematchedonseveral
equivalents, which makes the translation extraction process levels.Exactlymatchedwordsareassumedtohavethesame
complicatedanderrorprone.Forthisreason,wealsorestrict translation, but stem or lemma matched words may require
theratiobetweenthenumberofArabicwordsintheexample modifications(mostlyinflectionandprepositionsissues)tothe
pattern and the number of English words in the extracted extractedtranslation.Theseissueswereleftforfuturework.
translation,boundthembyafunctionoftheratiobetweenthe Wordsmatchedonthepartofspeechlevelrequirecomplete
totalnumberofwordsintheArabicandEnglishversionsof changeofmeaning.Forexample,taketheinputfragmentrst[
thetranslationexample. u[mا (,
, “the Security Council”), matched to the
Forexample,takethefollowingtranslationexample: fragment u[mا MhSوvc[ (
-%& , “the security
A:نZcdeاقPfgناRh[UaUifISانوZkISاوMNرZlIQmاتZ[RoSا responsibility”)insometranslationexample.Thewordsrst[
E:“Advisoryservicesandtechnicalcooperationinthefield (,
, “council”)andMhSوvc[(
-%&,“responsibility”)are
ofhumanrights.” matchedonthepartofspeechlevel(botharenouns).Assume
TableIIisthecorrespondinglookuptable.Now,supposethe thattheextractedtranslationfromthetranslationexampleis
examplepatternisنZcdeاقPfgناRh[(&$)'%'*
, “thesecurityresponsibility”,whichisactuallyatranslationof
“the field of human rights”), so we want to extract its u[mا MhSوvc[ (
-%& , “the security responsibility”)
translationfromtheEnglishversionofthetranslationexample. andisnotthetranslationoftheinputpatternatall.But,by
Usingtheextractedlookup,wemarktheEnglishequivalences replacing the word “responsibility” from the translation
of the pattern words in the translation example: “Advisory examplewiththetranslationofrst[(,
,“council”)fromthe
services and technical cooperation in the $ of lexicon,wegetthecorrectphrase:“thesecuritycouncil”.The
”,andthenweextracttheshortestEnglishsegmentthat lexiconisimplementedusingtheglossariesextractedfromthe
containsthenumberofequivalentwords,viz.“field Buckwalter morphological analyzer and expanded with
ofhumanrights”. WordNetsynonymsaswasexplainedabove.
TABLEII Sometimes the extracted translation contains some extra
ALIGNMENTLOOKUPTABLE unnecessarywordsinthemiddle.Thosewordsappearmostly
English Arabic because of the different structure of a nounphrase in both
languages.Forexample,considertheexample,u[mاعPxP[
Services تZ[RoSا Uphsymا (%.% '&&),and its translation: “the
Advisory MNرZlIQmا subjectofregionalsecurity”.Byextractingthetranslationof
Cooperation نوZkISاو
Technical UifISا the pattern u[mا عPxP[ (%.% ), we obtain: “the
In Ua subjectofregionalsecurity”(sinceitistheshortestsegment
Field ناRh[ that contains maximumwordalignments).Clearly,theword
Rights قPfg “regional”isunnecessaryinthetranslationbecauseitisthe
Human نZcdeا
translationofthewordUphsymا('&&,“theregional”)that
Thisisofcourseasimpleexample.Morecomplicatedones doesnotappearinthepattern.Sobyremovingthatwordfrom
wouldhavemorethanoneequivalentforeachArabicword. thetranslationweobtainthecorrecttranslationofthepattern.
Sometimes it is hard to find the corresponding English Theword“regional”appearsintheextractedtranslationdueto
equivalentsforaspecificArabicword.Usuallythishappens the fact that Arabic adjectives come after the nouns they
when the Arabic word is part of some phrase, whereas its qualify, which is the opposite of English syntax. Here, the
translationdoesnotfollowwordforword,asin,forexample, nounphrase Uphsymا u[mا ( '&&, “the regional
theArabicexamplepatternUpQر\hq(&
!,meaning“not security”) is translated so that the translation of Uphsymا
formal”. In many cases, we might find “informal” in the ('&&,“theregional”)appearsbeforethetranslationof
English version instead. The problem is that neither the u[mا(AlAmn,security).Currently,identifyingsuchsituations
synonymlistofthewordUpQر(
&,“formal”),northelistof isdonebysearchingforthetranslationoftheword“regional”
theword\hq(& ,“not”),containstheword“informal”.Such inafixednumberofArabicwordsthatcomeimmediatelyafter
a situation is handled by a manually defined rule that is thepatterninthetranslationexample.However,thismethod
triggered whenever the word \hq (& , “not”) appears. The is insufficient for more complex situations and is also very
systemchecksthefollowingword,andinsteadofbuildinga timeconsuming.OurplanistoapplyanArabicchunkerto
synonymlistbuildsanantonymlist,usingWordNet.Inthis extract the boundaries of the nounphrase and in that way
example, the word “informal” appear as an antonym of the delimitingthesearcharea.
word“formal”inWordNet. Removingunnecessarywordsfromtheextractedtranslation
Therearemorecomplicatedstructuresthatarenothandled must preserve the correct English syntax of the remaining
yet,butcapturingandwritingrulesforsuchcasesseemsquite translation,whichinsomecasesseemstobeadifficulttask.
325 4
Forthatpurpose,wehavecompiledseveralrulestodealwith TABLEIII
differentsituations.Theserulesarebasedonthesyntaxofthe EVALUATIONRESULTS
English extracted translation and identify cases that need BLEU NIST METEOR
special care. First, we chunk the translation to discover its (4gram)
basicnounphrases,usingtheBaseNP[7]chunker.Todothat, Besttranslationchosen 0.1849 4.1792 0.4851
we first apply Brill’s partofspeech tagger [8] to the bythesystem
translation.Then,bylookingatthechunkedEnglishtext,we Besttranslationchosen 0.2488 5.1281 0.5363
byahumanreferee
canascertaintheeffectofremovingtheunnecessaryword.In
thepreviousexample,removingtheword“regional”fromthe
text,“thesubjectofregionalsecurity”,maybedonewithout same, but on the best translation from the viewpoint of a
anyfurthermodification,sincebytaggingandchunkingthe humanreferee.Inmostcases,thebesttranslationchosenby
segmentweget the referee had a close (or even the same) finaltranslation
[the/DTsubject/NN]of/IN[regional/JJsecurity/NN] scoreasthesystem’sbesttranslation.
(thephrasesinbracketsarenounphrases)and“regional”is IV. CONCLUSION
simplyanadjectivewithinanounphrase,whichstillhasthe We believe we have demonstrated the potential of the
samehead.Prepositionsandotherfunctionwordsthatrelateto examplebased approach for Arabic, with only minimum
thephrasearestillnecessary,sowekeepthem. investment in Arabic syntactical and linguistic issues. We
As already mentioned, a generalfragment may contain foundthatmatchingfragmentsontheleveloflemmaandstem,
several fragments sharing the same Arabic examplepattern. aswellaspartofspeech,enabledthesystemtobetterexploit
Amongtheextractedtranslationsofthecomprisedfragments, thesmallnumberofexamplesinthecorpusweused.More
which are all translations of the same Arabic pattern, we workisneededtoenlargeandenrichthecorpus,aswellasto
choose the translation that covers the maximum number of formulaterulestodealwithvariousproblematicsituationsthat
Arabic words to represent the generalfragment. The arenotyethandled.Thisallappearsquitefeasible.Finally,we
calculatedforthechosentranslationisthe donotclaimthattheexamplebasedmethodissufficientto
ratio between the number of covered words and the total handle the complete translation process. It seems that, for
numberofwordsintheArabicpattern.The
ofa Arabic,itshouldworktogetherwithsomekindofrulebased
generalfragmentisthemultiplicationofitsmatchscoreand engine,aspartofamultienginesystem,soastobetterhandle
itstranslationscore. morecomplicatedsituations.
. /
Intherecombinationstep,wepastetogethertheextracted REFERENCES
translations to form a complete translation of the input [1] M.Nagao,“AFrameworkofMechanicalTranslationbetweenJapanese
sentence.Thisisgenerallycomposedoftwosubtasks.Thefirst andEnglishbyAnalogyPrinciple”,InA.ElithornandR.Banerji,eds.,
is finding the 0 best recombinations of the extracted $)1.NorthHolland,1984.
translationsthatcovertheentireinputsentence,andthesecond [2] S.Sato,andM.Nagao,“Towardmemorybasedtranslation,”23104
5,vol.3,pp.247252,1990.
issmoothingouttherecombinedtranslationstomakeafully [3] H. L. Somers, “Review article: Examplebased machine translation”,
grammaticalEnglishsentence.Currently,wehandleonlythe
#$,pp.113157,1999.
firstsubtask;thesecondisleftforfuturework.Bymultiplying [4] United Nations Official Document System (ODS), URL
http://www.ods.un.org(viewedon29/11/06).
the totalscores of the comprised generalfragments, we [5] T. Buckwalter, “Buckwalter Arabic Morphological Analyzer Version
calculate a
for each generated 1.0“.LinguisticDataConsortium,Philadelphia,2002.URLhttp://www
recombination. The 0 best (where 0 is configurable) .ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2002L49
(viewedon21/11/2006)
recombinationsarereported. [6] M.Diab,K.HaciogluandD.Jurafsky,“AutomatictaggingofArabic
text: from raw text to base phrase chunks”, The National Science
III. EXPERIMENTALRESULTS Foundation,USA,2004.
[7] L. Ramshaw and M. Marcus, “Text chunking using transformation
Experimentswereconductedonacorpuscontaining13,500 based learning", In
$
5 $ 3 6
7 &
translationexamples.Thefollowingresultsarebasedon400 3
,MIT,1995.
Arabicshortsentences(5.5wordspersentenceonaverage) [8] E.Brill,“Asimplerulebasedpartofspeechtagger”,In
$
./8$0 36
.pp.112116.
thatweretakenfromunseendocumentsoftheUnitedNations MorganKauffman.SanMateo,California,1992.
inventory. The ten best results were evaluated [9] K.Papineni,S.Roukos,T.WardandW.J.Zhu,“Bleu:amethodfor
by some of the common automatic criteria for machine automatic evaluation of machine translation”, 1
$
39:,pp.311318,Philadelphia,PA,July,2002.
translationevaluation(BLEU[9],NIST,andMETEOR[10]), [10] S. Banerjee and A. Lavie, “Meteor: an automatic metric for MT
althoughoursystemisstillunderconstruction.Also,weused evaluation with improved correlation with human judgments”, In
only two different translation references for the evaluation.
$
3956
1
$
TableIII shows somepreliminaryexperimentalresults.The
;
$<
=
,pp.65
72,AnnArbor,MI,June,2005.
firstrowcontainstheresultsofevaluatingthesystem’shighest
rankedtranslationforeachinputsentence.Thesecondisthe
no reviews yet
Please Login to review.