183x Filetype PDF File size 0.09 MB Source: pages.cs.wisc.edu
AnAttemptatMultilingualPOSTaggingforTamil
MadhuRamanathan,VijayChidambaram,AshishPatro
Department of Computer Sciences
University of Wisconsin Madison
Abstract 1 Introduction
Part of Speech (POS) tagging is the process Part of speech (POS) tagging is the process of
of providing every word in a corpus with a labeling a part of speech or other lexical class
syntactic category. In our project we aim marker to each and every word in a sentence. POS
to do supervised and unsupervised methods tagging is an essential part of many applications
of POS tagging using a multilingual parallel
corpora for Tamil, an agglutinative language like speech recognition, natural language parsing,
of ancient Dravidian origin. The multilin- information retrieval and machine translation.
gual parallel corpora consists of three other
languages namely Hindi, Latin, English and Our aim is to perform POS tagging for Tamil
French. We experimented on monolingual, which is a Dravidian Language spoken in the
bilingual and multilingual corpora using var- Southern part of India which has existed for
ious models and techniques such as the HMM
model, SVM model, CRF model and Pro- over two thousand years. Tamil and Sanskrit
jection and Probability Re-estimation tech- are considered the two longest surviving clas-
nique(Yarowsky,2001)anddidadetailedper- sical languages in India, from which the others
formance comparison in an attempt to cap- Dravidian and Indo-Aryan languages have been
ture the properties of the language that aid in derived. Tamil also has a rich set of literary works
increased accuracy for POS tagging. Super- like the Thirukurral which have been manually
vised CRF modeling using a variety of fea- translated into a number languages. Our aim is
tures on a monolingual Tamil corpus revealed
that word specific features such as prefixes to use such parallel corpus and build a method
and suffixes produce an increase of 10% the to improve the accuracy of existing taggers that
highest among all combinations of features. can be used for other applications like automatic
Bilingual and multilingual learning shows that machinetranslation,speechrecognitionandparsing.
the addition of other languages generally pro-
duceadecreaseinaccuracymainlybecauseof Tamil uses a relatively free word order aggluti-
the one to many association among the words
while the other reasons being the drop in ac- native grammar, where suffixes are used to mark
curacy produced at every stage of the vari- noun class, number, and case, verb tense and other
ous pre-processing steps involved in accom- grammatical categories.Tamil words consist of
plishing the word level pairing. The results of a lexical root to which one or more affixes are
our experiments clearly reflect the relatively attached. Most Tamil affixes are suffixes. Tamil
free word order and agglutinative nature of suffixes are of two types : derivational suffixes,
the Tamil language and motivates the need which either change the part of speech of the word
for a morpheme based POS tagger to attain a or its meaning, or inflectional suffixes, which mark
greater accuracy.
categories such as person, number, mood, tense, etc.
There is no absolute limit on the length and extent chose are Hindi, English and French. Tamil follows
of agglutination, which can lead to long words with a SOV word order and we chose Hindi as it a well
a large number of suffixes (Tamil, Wikipedia). studied Indian Language with same word order.
Much of Tamil grammar is extensively described We also choose two other languages that have the
in the oldest known grammar book for Tamil, the SVOwordorder namely English and French to see
Tolkppiyam. how much the word order property influences the
accuracy of the results.
The agglutinative nature of Tamil makes tagging
a complex process. Various methodologies, both The remainder of the paper is organized into 5
statistical and rule based, have been developed and sections. Section 2 deals with the related work,
widelyusedforPOSTaggingindifferentlanguages. section 3 talks about the method, section 4 about
Tamil being a free form language with a large va- the experiments and analysis and section 5 gives the
riety of morphological combinations, inflections concluding remarks.
and exceptions, developing a rule based method for
it would require a lot of effort and also extensive
knowledge about the complex grammatical struc- 2 RelatedWork
tures which makes it almost impractical. Supervised
statistical methods require a large amount of reli- Tamil is one of the classical Indian languages which
able annotated corpus that can be used for training has a very strong linguistic base with well defined
purposes. At the same time a considerable large set of morpho-syntactic rules. However parsing,
amount of sentence aligned parallel data (UDHR development of parsing models, chunking, gen-
corpora, Bible corpora, Thirukural corpora, TV eration of Treebank, POS tagging, morphological
news, newspaper articles,etc) are available in a analysis, and development of semi-automated and
number of languages that we can put to use for automated tools for these processes in Tamil are
this purpose. A large number of those languages at the nascent stage. The existing works on POS
such as the European languages have pre-trained tagging is based on morphological analyzers which
POS taggers that can be used to label the text in was built by Vasu Ranganathan (Renganathan,
those languages. Consider these factors we tried to 2001) and Ganesan and RCILTS-T. Due to the con-
address three main questions: straints, limited coverage of morpho-syntactic and
semantic rules, non-availability of methodologies
towards large scale development of parsing models,
• When trained on a monolingual corpus what non-availability of standards, non applicability of
properties/features of the language contribute statistical methods and resource deficiency, reported
to increasing the POS tagging accuracy? tools cannot be used directly for all types of NLP
• Does the addition of one or more languages applications. These existing tools have been devel-
from a parallel corpus help in increasing the oped using rule based approaches. However, rule
POS tagging accuracy? If the addition of lan- based techniques cannot address all inflectional and
guages does improve the tagging accuracy then derivational word forms and peculiar characteristics
are they any specific properties of the language like relative free word order, syntax with semantics
being paired that lead to an increase in accu- and long distance relationship to a greater extent.
racy? Moderate accuracy can only be achieved in rule
based techniques. This motivates the need for a
Asameanstofindtheanswerstothesequestions statistical approach to POS tagging in Tamil.
we experimented with monolingual, bilingual and
multilingual corpus using various methods such Various methods for bilingual POS tagging
as SVM model, HMM model, CRF model and such as projection and induction have been used
Bilingual projection and probability re-estimation to train highly accurate part-of-speech taggers
method (Yarowsky, 2001). The languages that we (Yarowsky, 2001) for languages such as Viet-
namese (Dieng, 2003). As one of our methods we Tag Description
use Yurowskys robust projection and probability NN Noun
re-estimation technique to learn the POS tags for CNN CompundNoun
Tamil in an semi-supervised manner. There has PRN Pronoun
been some recent work on bilingual (Snyder, 2008) CPRN CompoundPronoun
and multilingual learning (Snyder, 2009) where VRB Verb
the results show that adding languages generally ADJ Adjective
increases the accuracy when unsupervised learning ADV Adverb
is done. There has been one attempt at bilingual CONJ Conjunction
rule based POS tagger for Tamil using projection PP Preposition
and induction techniques that quotes an increase in NUM Number
performance (Selvam, 2009). However, we aim to X Others
do a purely statistical approach to POS which does P Punctuation marks
not require any prior knowledge of the grammar Table 1: Tagset used for Tamil corpus
rules.
3 Methodology studied languages like Hindi, English and French
we used existing pre-trained taggers. For Hindi we
We used the Universal Human Rights Declaration used the tagger developed by the Society for Natu-
corpus (UDHR) which has been translated into over ral Language Technology Research and for English
300 languages for our experimentation (UDHR, and French we used the TreeTagger tool (TreeTag-
UDHRcorpus). The UDHR corpus consists of 75 ger, 1994). For Tamil, as no such pre-trained tagger
lines of short text translated in all the 300 languages wasinausable form we had to hand tag the corpus.
of which we choose the text for our set of languages Table 1 shows the set of 12 tags used for tagging
- Tamil, Hindi, English and French. The following the Tamil corpus. These tags were chosen as they
sections describe in detail about the preprocessing were the frequently occurring tags that also appear
step and the monolingual, bilingual and multilingual in other languages. We tried to perform this tagging
learning approaches that we experimented with. to the best of ability though some errors may have
been performed in this step. These tags were used
3.1 Preprocessing as the gold standard for all our experiments.
Before working on this data, we applied a prepro-
cessing step on the data to make it usable for our 3.2 Monolingual Supervised learning
experiments. We arranged the text by pairing the
Tamil text with the other 3 languages. So, we had In this method we use the monolingual Tamil cor-
a total of 3 pair of languages. Sentence alignment pus alone to perform supervised learning techniques
was done using Microsoft Researchs Bilingual Sen- using various methods to estimate the maximum ac-
tence Aligner tool (Microsoft, 2003). The sentence curacy that can be obtained using a single language
aligned files were given to the GIZA++ word aligner and also to find out which features of the language
and the union method was used to obtain the word aid in increasing the tagging accuracy. For this pur-
alignments (Giza, 1999). The union method was pose we split the dataset into training and test sets.
chosen over the intersection that would give a 1-1 Thetraining set comprised of 80% of the lines while
pairing because Tamil being an agglutinative lan- the testing set comprised of 20% of the lines. Since
guage when paired with other languages which do the corpus was small we used 10 -fold cross vali-
not possess that property would yield very low re- dation to estimate the accuracies. We trained it us-
call whentheintersectionmethodofwordalignment ing three well known models namely the Hidden
was used. The UDHR corpus was a plain text with- Markov Model (HMM), Support vector machines
out any POS tagging done for the words. For well (SVM)andConditionalRandomFields(CRF).
Strategy Description Feature Description
0: one-pass default strategy 1 Actual word
1: two-pass revisiting results and relabeling 2 1Previous Word + Actual word
2: one-pass robust against unknown words 3 2 Previous words + Actual word
4: one-pass very robust against unknown words 4 2 Previous words + Actual word
5: one-pass sentence-level likelihood 5 4 Previous words + Actual word
6: one-pass robust sentence-level likelihood 6 1Nextword+Actualword
Table 2: Strategies used in the SVM Model 7 2Nextwords+Actualword
8 3Nextwords+Actualword
9 1 Previous word + 1 Next word + Actual word
3.2.1 HiddenMarkovModel(HMM) 10 1Prefix+ActualWord
We used a bigram HMM model along with the 11 2Prefixes + Actual word
viterbi algorithm to train the corpus. Maximum 12 Prefixes + 2 Suffixes + Actual word
likelihood estimator was used to determine the 13 Prefixes + 4 Suffixes + Actual word
emission and transition parameters.The transition 14 Prefixes + 5 Suffixes + Actual word
andemissionparameterswerecalculatedasfollows: Table 3: Feature sets used in monolingual learning
′ ′ ′
P(t|t ) = count(t ,t)/count(t ) into a set of binary feature functions associating the
P(w|t) = (count(t,w)+δ) (1) specifiedfeaturewiththeoutputcategory. Usingthis
(count(t) +|V|∗δ) tool we built our training and testing files in the re-
After determining the emission and transition quired formats and modelled and tested on a vari-
probabilities the probability of a given tag sequence ety of combinations of features. The combination of
for a given word sequence was determined using the features are listed in Table 3.
following formula: Fromtheresultsobtained, we try to determine the
P(s,w) = Π (P(t |t ) ∗ P(w|t )) features that give a maximum increase in accuracy
i i i−1 i for POS tagging.
3.2.2 SupportVectorMachines 3.3 Bilingual Learning
WeusedtheSVMtoolwhichisageneralPOStag- 3.3.1 Supervised
ger based on Support Vector Machines to train and
test on our corpus. There were several modes of do- For the supervised method of bilingual learning
ing the tagging in that tool. Each mode brought a we used the same CRF++ tool described above.
little more complexity into the tagging. We used a Tamil was paired with each of the other three lan-
set of six strategies to determine the one that gives guages separately and the tags from the foreign lan-
the maximum accuracy. The six strategies are listed guagewereprojectedontotheTamilwordsusingthe
in the Table 2. word alignments. Then the training and testing files
3.2.3 Conditional Random Fields for the CRF++ tool were prepared and the template
files were created considering the various combina-
For the conditional random fields we used the tions of possible features that could affect the accu-
CRF++ tool which is a simple, customizable, and racy of tagging. The feature sets that we tested on
open source implementation of Conditional Ran- are given in the Table 4.
domFields (CRFs) for segmenting/labeling sequen-
tial data. CRF++ tool allows us to redefine our own 3.3.2 Semi-Supervised
set of features. It requires the training and testing For this we used the projection and aggressive
files to be in a specific format. It also requires us tag probability re-estimation technique (Yarowsky,
to define a template file specifying the unigram and 2001). We used POS tag projection from an input
bigram features. For every unigram and bigram fea- language (e.g. English) to Tamil using the word
ture specified in the feature file the tool converts it alignments computed during the pre-processing
no reviews yet
Please Login to review.