332x Filetype PDF File size 0.29 MB Source: www.cle.org.pk
Towards Sindhi Corpus Construction
Mutee U Rahman
Department of Computer Science, Isra University, Hyderabad Sindh 71000, Pakistan
muteeurahman@gmail.com
Abstract Persio-Arabic script Sindhi corpus being constructed is
in Persio-Arabic script using UTF-16 encoding.
The paper discusses the current state of Sindhi Following sections discuss the existing work in
corpus construction in detail. Sindhi corpus Pakistani language corpora, orthography and script of
development issues including corpus acquisition, Sindhi Language, corpus construction issues, corpus
preprocessing, and tokenization are discussed in acquisition, preprocessing, tokenization and results of
detail. Preliminary results and observations which preliminary statistical analysis. Finally the future work
include letter unigram, bigram and trigram is discussed along-with conclusion.
frequencies; word frequencies and word bigram
frequencies are presented. Current state of Sindhi 2. Previous work
corpus with its limitations and future work is also
discussed. The paper also explores the orthography Apart from fonts, keyboard design [3] and few
and script of Sindhi language with reference to corpus digital dictionaries [4] Sindhi language processing
development. resources are not available publically. Studies or
development projects for resources like linguistic
1. Introduction corpora and comprehensive computational lexicon are
not even initiated. Various research organizations and
Sindhi is one of the major languages of Pakistan individuals are working for the development of
spoken by approximately 30-40 million people [1][2]. linguistic corpora of different Pakistani languages. For
Sindhi is being frequently used on internet. Sindhi Urdu EMILLE [5], Baker Riaz corpus [6], jang
blogs, literary websites, online newspapers and newspaper corpus [7], and parallel English Urdu and
discussion forums are increasing day by day. After Nepali corpus [8] are some key examples. For Pashto
Urdu Sindhi is the second largest written language of the projects include BBN Byblos Pashto OCR System
Pakistan. Despite of its online usage and popularity [9] and Machine readable Pashto text corpus being
only few language processing resources are available developed at University of Peshawar [10]. The first
for NLP researchers which include lexicon, fonts and Punjabi language corpus was developed by Central
simple word processors. The development of Sindhi Institute of Indian Languages (CIIL) India [11]. Hindi
language processing resources like linguistic corpora and Punjabi parallel corpus developed by CDAC
and comprehensive computational lexicon are not even Noida is another useful linguistic corpora available.
initiated. One cannot find such type of linguistic corpora for
Sindhi is being written in Persio-Arabic (يڌنس) , Sindhi, Balouchi, Siraiki and many other Pakistani
Devnagri (िसÛधी) and roman (sindhi) scripts. Persio- languages. In contrast to other Pakistani languages
Arabic script is most common script for Sindhi (Excluding Urdu) Sindhi text in electronic format is
writings in Pakistan and India. Devnagri script is also easily available and is being continuously collected for
being used for Sindhi writing in India. Roman script corpus under discussion.
(though not yet standardized) is also getting popularity.
Very few written documents are available in roman 3. Orthography and script of Sindhi
script but it is being used frequently for language
communications on internet and cell phones and other
smart devices. Due to the fact that most of the online Sindhi is written in Persio-Arabic script based on
and offline written material of Sindhi is available in extended Arabic character set in Naskh style. Sindhi
alphabet is comprised of 52 letters shown in figure 1. 4. Sindhi Corpus Development
The alphabet contains basic letters like ٻ ،ب ،ا and
secondary letters like ،ھج and ھگ which are aspirated After Unicode support and Unicode based Sindhi
versions of ج and گ. keyboard design [13] availability of Unicode based
Sindhi text on Internet is increasing day by day. Key
factor behind the motivation of Sindhi corpus
construction is availability of online text in Sindhi
newspapers, blogs, literary websites and discussion
forums. Despite of the fact that available online
resources do not provide huge amount of text but they
are increasing day by day and corpus is being collected
continuously. Software routines for preprocessing,
normalization, tokenization and frequency calculation
are implemented in C# using Microsoft .net framework
libraries.
4.1. Corpus Acquisition
Figure 1. Sindhi alphabet.
Data is gathered from various domains which
Sindhi words always end in a vowel [12]; this include news, blogs, literature, essays, and letters.
vocalic ending is optionally marked by diacritics in Different subdomains include current affairs, sports,
written text. Diacritics are also used inside words to showbiz, short stories, discussions and opinions.
represent additional vocalic features. Absence of Sources of data collection are shown in Table 2.
diacritics in written text sometimes cause semantic
ُ Table 1. Sources of data collection.
ambiguities. For instance the word ڻٻد (to push) and ڻٻد
َ Source URL(s)
(bog) are semantically ambiguous without diacritics. Daily Kawish http://www.thekawish.com
Diacritics used in Sindhi are shown in Figure 2.
Daily Awami http://www.awamiawaz.com
Awaz
Figure 2. Diacritics used in Sindhi. Daily Ibrat http://dailyibrat.com
Sindhi has its own numerals based on Persio-Arabic Blogs http://shikarpuri.wordpress.com
numerals shown in figure 3. Use of Hindu-Arabic
numerals is also very common in Sindhi writings. Literary Writings http://voiceofsindh.net
Special symbols shown in figure 3 are also used in http://sindhsalamat.com
Sindhi written text.
4.2. Preprocessing and Normalization
Almost all data gathered was already in Unicode
format but nevertheless all the collected text is
converted into standard UTF-16 encoding. Letters
represented by multiple Unicode points and equivalent
representations of composed and decomposed form
[14] are reduced to same underlying form. Letters with
aspirated versions like ھگ which are combinations of
two Unicode characters (for instance گ and ھ in case
Figure 3. Special symbols and Numerals used in of ھگ) are considered single letters while dealing with
Sindhi written text. text processing.
4.3. Tokenization Table 2.Top 20 most frequent letters.
S.No. letter Percent S.No. Letter Percent
For tokenization white spaces, punctuation markers, 1 13.77% ي 11 3.25% ڪ
special symbols (like $, %, # etc.) and digits are used 2 11.42% ا 12 3.23% س
as word boundaries. White space word boundary 3 8.99% ن 13 2.50% د
consideration caused problem of embedded space word 4 7.84% و 14 2.00% ب
breaking (For example the single word تردق بحاص is 5 6.27% ه 15 1.80% پ
divided into two words بحاص and تردق) is tackled out 6 6.15% ر 16 1.18% آ
by using the same technique used for Urdu [15]. 7 3.73% م 17 1.16% ڻ
Another problem in Sindhi word tokenization occurs 8 3.64% ج 18 1.16% ک
when two special words ۾ (in) and ۽ (and) occurred 9 3.30% ل 19 0.99% ع
without space like ڻئلام۾ (me:} mila:i®a) and this was 10 3.26% ت 20 0.94% ٽ
tokenized as a single word. Also in case ملق۽باتڪ
kita:ba ain qalama (book and pen) in which three Table 3. Top 20 bigrams in Sindhi corpus.
words without space are there and were tokenized as S.No. Bigram Percent S.No. Bigram Percent
single word. Same problem was observed with all the
h 1 نا 3.16% 11 نو 1.18%
words with non-connective ending like يپريک k i:ra pi:a
h 2 يج 2.55% 12 اي 1.10%
(drink milk) or starting letters ردناڌنس sind a ander (in 3 ير 1.95% 13 هآ 1.10%
Sindh). Semiautomatic (software based + manual) 4 نھ 1.80% 14 وج 1.07%
approach was used to overcome this problem. 5 وي 1.79% 15 او 1.02%
5. Results and observations 6 را 1.79% 16 لا 1.01%
7 يھ 1.79% 17 يک 0.99%
A total of 4.1 million word corpus analyzed 8 ني 1.69% 18 رو 0.97%
quantitatively. This preliminary analysis includes letter 9 هن 1.28% 19 لا 0.95%
frequency analysis, letter bigram analysis, letter 10 دن 1.27% 20 يت 0.93%
trigram analysis, word frequency analysis, and word
bigram analysis. These quantitative results are Table 4. Top 20 letter trigrams in Sindhi Corpus.
discussed in following sections. S.No. Trigram Percent S.No. Trigram Percent
1 1.40% يھآ 11 0.45% جنھ
5.1. Letter frequencies 2 1.34% نھن 12 0.44% نھآ
3 0.81% يرا 13 0.44% ويڪ
A total of 13,968,112 characters in the corpus were 4 0.74% نوي 14 0.42% هنا
analyzed while calculating letter frequencies. Along- 5 0.71% يرڪ 15 0.41% ودن
with 52 letters of Sindhi alphabet آ was also considered 6 0.61% ناک 16 0.40% يٹا
as a single letter because of its use in Sindhi keyboard 7 0.60% دني 17 0.36% يجن
as a single letter and single Unicode representation. It 8 0.53% يدن 18 0.35% نھڏ
was observed that most frequently occurred letter was 9 0.47% راو 19 0.35% هنپ
vowel ي while least frequently occurred letter was 10 0.46% نام 20 0.35% راد
consonant ڱ. Table 2 shows top 20 most frequently
occurred letters in Sindhi corpus with their percentage.
While analyzing frequencies it was observed that
frequency distribution of individual letters in single
file of 50,000 or more words was identical to the letter
frequency distribution of whole corpus. This similarity
can be seen in graphs of figure 3 and 4.
Letter bigram and trigram frequencies were also
analyzed. It can be seen that almost 50% of top 20
most frequent bigrams are valid two letter words like
نا, يج, نھ and يک. Same is the case with trigrams where
this ratio is more than 60%. Top 20 most frequent
bigram and trigram percentages are shown in Tables 3 Figure 4. Letter frequency distribution in Sindhi
and 4 respectively. corpus.
Table 6. Top 10 most frequent word bigrams.
S.No. Word bigram Percentage
1 هت ويچ 7.52
2 هت يھآ 6.75
3 يج نھ 2.66
4 وٽڀ ريظنيب 1.93
5 يج ڌنس 1.84
6 يج نا 1.72
7 هت نھڏج 1.60
Figure 5. Letter frequency distribution in a single 8 ويچ نھ 1.60
file. 9 ويو ويڪ 1.44
5.2. Word frequencies 10 يھآ ويو 1.21
Total of 4.1 million words were analyzed and absence of standard sentence termination punctuation
70,576 distinct word forms were found. Most marker in Sindhi; full stop comma and other
frequently occurring words include case markers (like punctuation markers are used as sentence terminators
۾, يت and ناک) and auxiliary/incomplete verbs (like يھآ in Sindhi text writings. Sentence segmentation is
and نھآ). Postposition يج has highest frequency of another key area to be worked out. More specific
occurrence as shown in Table 5. Sindhi computational linguistic studies are needed for
further development and maturity of corpus. For
Table 5. Top 20 most frequent words in Sindhi example currently there is no comprehensive POS
corpus. tagging algorithm available for Sindhi. Presently
available POS tagging algorithm for Sindhi [16] need
S.No. word Percent S.No. word Percent to be analyzed and extended further. Sindhi tagset need
1 يج 3.71% 11 يرڪ 0.69% to be designed before POS tagging of the corpus.
2 ۾ 2.44% 12 ناس 0.69% Qualitative, quantitative improvements, proper
3 ۽ 2.17% 13 نا 0.67% annotations and comprehensive statistical analysis are
4 هت 1.78% 14 ناک 0.63% areas to be extensively worked out.
5 يھآ 1.61% 15 يٿ 0.57%
6 يک 1.61% 16 نھآ 0.55% 7. Conclusion
7 وج 1.50% 17 ءلا 0.51%
ِ In absence of language processing resources of
8 يت 1.05% 18 نھ 0.50% Sindhi language Sindhi corpus construction project is a
9 هب 0.82% 19 وھ 0.50% valuable initiative. Regardless of its size and
10 هن 0.71% 20 ويڪ 0.46% preliminary results the corpus in its current state will
provide basis for further natural language processing
Word bigram occurrences are also calculated and studies of Sindhi language. Letter frequencies
are shown in Table 6. The proper name bigram ريظنيب including bigram and trigram frequencies provide basis
وٽڀ is among the top 10 bigrams. This is because of the for intelligent text processing and compact keyboard
current affairs domain contains essays and newspaper design for cell phones and other smart devices. Word
columns about the life of former prime minister level unigram and bigram frequencies provide basis for
Benazir Bhutto. spelling corrections and automatic sentence completion
applications. Further developments in corpus will be
6. Future work useful for advanced language processing tasks like
morphological analysis, syntax analysis, semantic
Corpus is being continuously collected and results are analysis, information retrieval and extraction and
being updated. Currently corpus is simply a UTF-16 machine translation.
encoded text collection. Study are in progress for
proper annotations, POS tagging, corpus based lexicon 8. References
development and n-gram based text categorization.
Sindhi tokenization algorithm need to be worked out [1] Sindhi Language Authority. Official Website.
for the problems discussed in section 4.3. Due to http://www.sindhila.org. (Accessed 2010).
no reviews yet
Please Login to review.