298x Filetype PDF File size 0.15 MB Source: www.apiit.edu.in
A Transliteration Keyboard Configuration with Tamil Unicode
Characters
* #
M.A.C.M. Raafi and H. M. Nasir
*Department of Mathematical Sciences, South Eastern University of Sri Lanka
E-mail: raafim@seu.ac.lk
#
Department of Mathematics, University of Peradeniya, Sri Lanka
E-mail: nasirh@pdn.ac.lk
Abstract Aayitha character. In Tamil word-processors the large
Keyboard configurations for typing are available for many numbers of compound alphabets are obtained by a sequential
languages and for data processing tasks. The common keying of the corresponding consonant and vowel. For
keyboard used today is QWERTY keyboard. The QWERTY example, the keystrokes for consonant k (க்) followed by
keyboard layout is specially designed for typing English vowel I leads to appearance of compound character ki (கி).
alphabets and numerals. Typing for other languages needs
these configurations which remap the QWERTY keys to fit for Keyboard layouts of this kind have been called "phonetic".
other languages. This configuration often faces difficulties due Tamil transliteration is phonetic keyboard system. Thus, the
to large number of character sets in these languages other than Tamil word for father (அப்பா) is written as appA (or appaa),
English. To solve this issue, transliteration keyboard mother (அம்மா) as 'ammA' (or as ammaa) in the
configuration is to be considered. Transliteration is a method
by which one could read a text of a language in the writing transliteration program.
method of another language. In this paper, phonetically we The following advantages are available normally in ourv
discuss about developing a transliteration keyboard
transliteration system.
configuration for Tamil language using Unicode encodings. 1. A user-friendly keystrokes; users easily type in more
familiar way.
2. No need to memorize whole the mapping key strokes
Introduction of the keyboard.
Input devices are used to enter data and commands in the 3. New person entering from some other language can
computer system for data processing work. One of the type easily.
commonly used input devices is the keyboard which consists 4. We don’t need to change the font each time to type
of letters, numerals and other special characters. following characters special character and symbols
There are different types of keyboard system available in such as: / , : < > | ) ( * & ^ % $ # @ ! ~ +
the computing environment. The standard keyboard is known ?...............etc
as the QWERTY keyboard. This keyboard is specially 5. By introducing Unicode
designed to type English Language letters and related a. It can be displayed everywhere
symbols. Use of other languages, such as Asian languages, the b. No matter about the language
QWERTY keyboard is inconvenient.
Entering these Asian language characters using this 6. No matter about the font
QWERTY keyboard is impossible without a proper 7. Wrong word format is being corrected.
convenient configuration mapping for the English keys in the
keyboard. Even with the configuration mapping, typing the Encoding Systems
letter of the language is difficult, because one has to memorize Encoding scheme is a necessary part of the configuration of a
or be familiar with the keyboard mapping in the configuration. keyboard layout for the transliteration program. The encoding
Despite these limitations, transliteration is to be is the system by which the characters in a set are represented
considered for typing texts to the benefit of end users. The in binary form in a file. In computers and in data transmission
transliteration is the process by which one reads and between them, i.e. in digital data processing and transfer, data
pronounces the words and sentences of one language using the is internally presented as octets, as a rule. Octets are often
letters and special symbols of another language. It is helpful called bytes, but in principle, octet is a more definite concept
in situations where one does not know the script of a language than byte. Internally, octets consist of eight bits [6].
but knows how to speak and understand the language [1].
For example, one of the Asian languages, Tamil, can be Tamil Character encodings
introduced to English literate Tamils and non-Tamils with a In Tamil, the forms of some of the letters differ from one to
transliteration scheme. There are 247 characters in Tamil: 12 another for the same vowel sound. This is the reason for the
vowels, 18 consonant, 216 compound alphabets and one inclusion of a high number of letters in the Tamil keyboards
[Page No. 135]
th
5 IEEE International Conference on Advanced Computing & Communication Technologies [ICACCT-2011] ISBN 81-87885-03-3
designed so far. Tamil is a language, where in addition to the Unicode Code Charts
basic vowels (uyir) and consonants (mei), the compounded The code charts that follow present the characters of the
(uyirmei) characters, all have unique glyph forms. Some Unicode Standard. Characters are organized into related
popular Tamil font encoding schemes are TSCII, TAM, TAB, groups called blocks. In the Unicode Standard, character
ISCII and Unicode. blocks generally contain characters from a single script. In
many cases, a script is fully represented in its character block.
TSCII There are, however, important exceptions, most notably in the
The first and most popular one is the Tamil Standard Code for area of punctuation characters.
Information Interchange (TSCII), a glyph-based, 8-bit
bilingual encoding. It uses a unique set of glyphs; the usual
lower ASCII set. Roman letters with standard punctuation Literature Review
marks occupy the first 128 slots and the Tamil glyphs occupy Transliteration of Asian language input is a subject of recent
the upper ASCII segment with slots 128-256. research. During the past several years, different methods have
been introduced to prepare Indian language documents by
TAM and TAB entering the text through specific transliteration schemes. Data
entry through transliteration is quite close to phonetic mapping
TAM is a Monolingual encoding scheme (TAmil
Monolingual) where TAB is a Bilingual encoding scheme of Indian language characters to the letters of the Roman
(TAmil Bilingual). They were proposed by the Tamil Nadu alphabet.
Government. TAM is limited use in an OS environment. The earliest and widely used transliteration scheme is
what is known as Library Of Congress Transliteration
ISCII Scheme. This uses roman alphabets with diacritics (horizontal
Indian Standard Code for Information Interchange, ISCII is a bars or circles added above or below roman alphabets) to
8-bit /single byte umbrella standard, defined in such a way that represent alphabets of Indian languages. Diacritical markers
all Indian languages can be treated using one single character added to a letter or symbol show its pronunciation, accent,
encoding scheme. ISCII is a bilingual character encoding (not etc., typically indicating that a phonetic value is different from
glyphs-based!) scheme. Roman characters and punctuation the unmarked state. The scheme is very general in scope and
marks as defined in the standard lower-ASCII take up the first hence can be used in almost all world languages. Established
half the character set (first 128 slots). Characters for Indic Tamil research centers all around the world are aware of this
languages are allocated to the upper slots (128-255) [5]. scheme and most of them implement this scheme as such
without modifications [5].
Unicode ADAMI was one of the early Tamil word-processors for
Unicode is an international standard for multi-lingual word- MS-DOS PCs produced by Dr. K. Srinivasan of Canada in
processing. It is a two-byte encoding scheme which covers the early eighties released in 1984 to recast such transliterated text
entire world's common writing systems. It represents each into Tamil. The Tamil text is to be typed using a plain ASCII
character as a 2-byte number, from 0 to 65535. Each 2 byte transliteration scheme. Upon compiling and execution of the
number represents a unique character used in at least one of linked macro, this romanized text page is recast on screen in
the world's languages. There is exactly 1 number per equivalent Tamil. One needs to return to the romanized text
character, and exactly 1 character per number. It provisions mode to make the corrections if any. In a more recent version
over 65000 slots to handle nearly all world more than 50 of this software called THIRU, a split screen, where the roman
languages simultaneously. Along with other Asian languages, text being typed in the bottom half of the screen is
for example Tamil has been assigned specific slots from continuously recast in the upper half in Tamil. ADHAWIN is
U+0B80 to U+0BFF (which, in decimal, is from 2944 to another recent implementation of the same software for
3071; 128 locations) in this multi-lingual standard [6]. Windows-based PCs [5].
Unicode encodes only basic vowels and consonant Murasu and Anjal word-processing packages are widely
characters and a set of modifiers to represent situations where used in Malaysian, Singaporean and Tamil Newspapers and
the vowel/consonant pair appear as a combination (uyirmei) in Magazines. These packages belong to the group of "romanized
Tamil language. Unicode file stores textual information solely input and interpreted output" tools. The ‘inaimathi’ and related
at this "character" level. It does not care about the actual form fontfaces used in these packages are of the 8-bit bilingual type.
of the glyphs. Rendering of the glyphs corresponding to stored The first 128 (0-127) slots are filled by roman characters as in
characters is left to softwares. basic ASCII and the Tamil characters occupy the upper ASCII
Once we get beyond the ASCII world, there are many slots (128-255). By invoking the keyboard editor it is possible
different native encodings for different languages and to access either of these two blocks. In the Tamil typing mode,
operating systems. Converting between all of these is easiest the roman keyboard strokes and their relative sequence are
with a central "common point", and that is Unicode. continuously interpreted to present equivalent Tamil
Technically, Unicode is used wherever the characters characters on screen. Thus we can type 'kathai' to get the
used are all drawn from the Unicode set in other words, just equivalent Tamil word ‘கைத’ [8].
about everywhere. Systems that use ASCII are also using
Unicode, since Unicode contains the ASCII set and gives them
the same code points they had in ASCII [6]. Keyboard Configuration Program
There are number of computer programs used to develop
[Page No. 136]
A Transliteration Keyboard Configuration with Tamil Unicode Characters
transliteration keyboard configuration softwares such as Methodology
Keyman, C, C++, Java. In our work we take Keyman as a The keyboard program interprets and translates input from the
keyboard configuration program. Keyman is a keyboard computer keyboard according to a set of rules called a
management utility that makes it practical to input many keyboard. Transliteration of Tamil has to fit the need for
different languages. It is fully supports Unicode and allows us Tamil to be recognized as the only other known language
to creating our own keyboard layouts for use. It interprets and comparable to the English language with a 26-letter keyboard.
translates input from the computer keyboard according to a set It is the plan of our work to develop simple methods to use
of rules called a keyboard. These rules are stored in a Tamil in the computer and introduce Tamil through
keyboard file. It includes features such as an on-screen transliteration.
keyboard, phonetic and visual-order input methods. We have over 230 characters in Tamil language;13
Keyman includes full support for Unicode. It support vowels(uyir), 18 consonants(meis) and compound (uyirmeis)
input and output of any of the thousands of characters defined derived from these. Tamil is one of the Indian languages
in Unicode. There are two applications included in Keyman where many of the compound (uyirmei) alphabets have
Developer: TIKE and KMComp. TIKE, the Tavultesoft complex geometric structure (glyph) of their own. There are
Integrated Keyboard Editor is a complete environment for 12 vowels characters and one aayitham letter in Tamil
designing, developing, testing, and packaging our keyboards language.
for distribution. There are 18 Mei Letters(consonants) and 216 Uyir-Mei.
KMComp, the command-line compiler, is a simple tool The Mei characters are created with sign Anushvara ( ◌ஂ ). The
that lets us compile keyboards, packages, and installers from Uyir-Mei letters are created by the combination of the above
the command-line. This is useful if we want to use batch 12 Uyir letter with the 18 Mei characters(12X18).
builds or Make files. Also there are 13 digits used in Tamil. These character
digits are now not much used by people but these characters
Keyboard File were used in early times. They are as follows:
Keyboard file is the most important component in a keyboard
configuration. It contains the set of rules to represent the
particular keyboard. As we want to create a new keyboard, we
want to create a keyboard file. There are two ways to create a 0 1 2 3 4 5 6 7 8 9 10 100 1000
keyboard file:
The Keyboard Wizard Choosing the mapping for characters
It gives us a simple interface to quickly create a keyboard We define the output characters to be produced by the
using a visual representation of a computer keyboard. We can keyboard. We select the appropriate keystrokes from the
drag and drop characters from a character map, and create QWERTY keyboard to map the output characters. Some
ANSI and Unicode keyboard layouts. We cannot access most keystrokes are used to represent output characters while some
of the programs more powerful features from the Keyboard keys are not. These keystrokes that do not represent any output
Wizard, but it will be useful to get us started on our design. are called dead keys. Dead keys produce null output.
We can convert keyboards created in the Keyboard Wizard to
standard program source files in TIKE. Analyzing the Keystrokes and Assigning Keystroke
We want to analyze how to create all the Tamil Characters
The Keyboard Language using this limited number of codes. Some characters have
It provides the flexibility that is needed to write keyboards direct Unicode numbers so it can be assigned directly while
with complex character management, including constraints, some other characters; they don’t have their own Unicode
dead keys, post-entry parsing, virtual key management numbers. So, we have to assign them for Unicode characters
(accessing any key on the keyboard), and other features. by combining two or more other Unicode characters. It is
A keyboard file is divided into two sections: the header being assigned a key or collection of key strokes to a
and the rules. The header section defines the name of the particular character or combination of characters to represent
keyboard, its bitmap, and other general settings. The rules are Tamil characters. To represent a character one or more key
used to define how the keyboard responds to keystrokes from strokes can be used.
the user, and are divided into groups. The 247 letters in the Tamil alphabet are the product of 31
The keyboard header is the first part of a keyboard; it basic Tamil letters. 18 English letters have similar sound
consists of statements that help Keyman identify the keyboard connection with 18 Tamil letters. It is only the 13 remaining
and set default options for it. Each statement in the header Tamil letters that need a ‘sound connection’ with English. We
must be on a separate line and is usually written with capital can make the ‘sound connection,’ - that is, devise the new
letters. The body of the keyboard is another the most connections- by allocating letters that are in use either in
important part: it determines the behavior of the keyboard. combination or singly as follows;
The body consists of groups, which in turn contain one or
more rules which define the responses of the keyboard to + "a" > U+0B85 அ
certain keystrokes. U+0B85 + "a" > U+0B86 ஆ
[Page No. 137]
th
5 IEEE International Conference on Advanced Computing & Communication Technologies [ICACCT-2011] ISBN 81-87885-03-3
+"A" > U+0B86 ஆ diveintophython.org/toc /index.html.
+ "i" > U+0B87 இ [7] Muguntharaj, Tamil-TSCIIANJAL, 1998.
[8] Muthu Nedumaran, Murasu Anjal, 2000.
U+0B87 + "i" > U+0B88 ஈ [9] Ramalingam Shanmugalingam, jAzhan,Transliteration
+ "I" > U+0B88 ஈ of Tamil to English for the Information Technology,
2002.
[10] Samaranayake, V. K., Nandasara, S. T., Dissanayake,
Conclusion J. B., Weerasinghe, A.R.,Wijayawardhana, H., An
Usage of Tamil language in computers enters a new era with Introduction to UNICODE for Sinhala Characters,
the emerge of the Unicode standard with the support of more University of Colombo School of Computing, 2003.
modern platforms and applications. These days, most of the
Tamil websites support Unicode and typography related
techniques also switching into the new standard.
This paper is useful to people who are interested in
developing their own transliteration softwares to type words
and sentences for their word processing work and to do World
Wide Web applications easily using QWERTY keyboard.
Also this study provides solutions for some existing
problems with Tamil typography. Many non-Unicode Tamil
fonts with stylish glyphs are available at present. Usage of
such fonts in documents can give great appearance. But due to
the unfamiliar keyboard mapping to these fonts, these are not
widely used in typing of Tamil. It is possible to develop these
stylish fonts into familiar keyboard configuration mapping, of
course with the support of keyboard configuration
environment. Then we can use it with our keyboard
configuration.
It is also possible to extend this keyboard configuration to
other platforms like Linux, Mac OS, Solaris, etc. as these are
already supporting Unicode. Only thing to be done is to set up
a keyboard layout in each Operating system’s native format.
Appendix
Some Typing Example.
naan or nAn நான்
avan அவன்
manithan மனிதன்
paadasaalai பாடசாைல
paLkaLaikazakam பல்கைலகழகம்
References
[1] Acharya, Multilingual Computing for Literacy and
Education, SDL, IIT Madrass, India,
http://acharya.iitm.ac.in/acharya.html, 2005.
[2] Addison-Wesley Pub Co, The Unicode Standard 3.0
(www.unicode.org), 1998.
[3] Elengo, Tamil 99 Keyboard Layout,
www.cadgraf.com, 2000.
[4] Ilakkuvanar, S., Tholkappiyam in English.
[5] Kalyanasundaram, K., An Overview Of Different
Tools For Word-ProcessingOf Tamil And A Proposal
Towards Standardisation, Institute of Physical
Chemistry, Swiss Federal Inst. of Technology, 1997.
[6] Mark Pilgrim, “Python and Unicode”, http://
[Page No. 138]
no reviews yet
Please Login to review.