332x Filetype PDF File size 0.31 MB Source: sciresol.s3.us-east-2.amazonaws.com
ISSN (Print) : 0974-6846
Indian Journal of Science and Technology, Vol 10(16), DOI: 10.17485/ijst/2017/v10i16/111895, April 2017 ISSN (Online) : 0974-5645
Approaches for Improving Hindi to English Machine
Translation System
1 2
Rajesh Kumar Chakrawarti and Pratosh Bansal
1
Faculty of Computer Engineering, Institute of Engineering and Technology, Devi Ahilya Vishwavidyalaya,
Indore – 452017, Madhya Pradesh, India; rajesh_kr_chakra@yahoo.com
2
Department of Information Technology, Institute of Engineering and Technology, Devi Ahilya Vishwavidyalaya,
Indore – 452017, Madhya Pradesh, India; pratosh@hotmail.com
Abstract
Objectives: To provide approaches for effective Hindi-to-English Machine Translation (MT) that can be helpful in
inexpensive and ease implementation of and MT systems. Methods/Statistical Analysis: Structure of the Hindi and
English languages have been studied thoroughly. The possible steps towards the Natural languages have also been studied.
The methods, rules, approaches, tools, resources etc. related to MT have been discussed in detail. Findings: MT is an idea
for automatic translation of a language. India is the country with full of diversity in culture and languages. More than 20
regional languages are spoken along with several dialects. Hindi is a widely spoken language in all the states of country.
A lot of literature, poetries and valuable texts are available in Hindi which gives opportunities to retranslate into English.
However, new generation is learning English rapidly and also showing keenness to learn it in simplified lucid manner.
Several efforts have been made in this direction. A large number of approaches and solutions exist for MT still there is a huge
scope. The paper addresses the challenges of MT and solution efforts made in this direction. This motivates researchers to
implement new Hindi-to-English Machine translation systems. Application/Improvements: Efficient, inexpensive and
ease translation for available Hindi literature, poetries and other valuable texts into English. Children can easily learn the
culture through the poetries and literatures hence the Machine Translation of these will bring wonderful impact.
Keywords: English Language, Hindi Language, Machine Translation, Translation-Rules and Translation Approaches
1. Introduction work. Most of the newspapers are also published in vari-
ous regional languages. There are 22 regional languages
India is one of the finest examples for multi-lingual and named “Assamese, Bengali, Bodo, Dogri, Gujarati, Hindi
multi-social country. People from different regions speak (it is official also), Kannada, Kashmiri, Konkani, Maithili,
different languages. After the analysis, it is found that the Malayalam, Manipuri, Marathi, Nepali, Oriya, Punjabi,
spoken languages may change after in every few kilo- Sanskrit, Santali, Sindhi, Tamil, Telugu and Urdu” speak
meters (in digits of 10s). In India, Hindi is the national in various regions. Hence there is dire and great demand
language which is spoken by most of the people. English for better Machine Translation systems to establish a bet-
is internationally accepted language which is used for ter communication and exchange of information with
communication throughout the world. The constitu- 1,2
other countries, states and central governments .
tion of India accepts only these two languages Hindi Machine Translation is the key research area in the
and English as official languages. The official commu- field of Natural Language Processing (NLP). It is a com-
nication between central and state governments is also puterized and automated idea, responsible for translating
done in these two languages. The states government the text/documents from one language (called source
may have their own regional languages to carry out their language) to another language (called target language).
*Author for correspondence
Approaches for Improving Hindi to English Machine Translation System
The work in machine translation area has been going on sents a block diagram for a Hindi-to-English Machine
for several decades but efficient machine translation is a Translation system.
still challenging task. In India, the market is largest for
3
Machine Translation . Figure 1 represents a block dia-
gram for a simple Machine Translation system.
Figure 2. Hindi ð English Machine Translation.
Figure 1. A simple Machine Translation (MT) System. 1.2 English-to-Hindi Translation
English is a major internationally accepted language
Machine Translation produces various challenges for which is spoken and used in all kinds of communications
all levels called “Phonetics and Phonology, Morphology, among almost all countries throughout the world. We can
Syntax, Semantics, Pragmatics and Discourse” of Natural also say that almost English is the only language which is
Language Processing. In which, ambiguity (Semantics) is popular among people from all over the world.
the biggest one. Other than this, the different language The default structure of the English sentence is
might also have language diversity (called translation Subject-Verb-Object (SVO), e.g.
divergence) problem. Machine Translation systems deal “Prithvi wants gold” where S = Prithvi, V = want and
with ambiguity and the linguistic diversity problems O = gold.
4 English is having following main characteristics:
under the umbrella of Natural Language Processing .
In India, we feel that the important and fore- • Highly positional language
most Machine Translations are HindiðEnglish and • Rudimentary (poor) morphology.
HindiðRegional Language.
1.1 Hindi-to-English Translation English-to-Hindi Machine Translation results a verb
movements of large distance. Hindi satisfies the gen-
Hindi is our national language. People speak different der agreement also, which is not possible in English. By
regional language but Hindi is the main official language enriching the source side English resources with linguis-
for standard communication. Other than us, Hindi is 5,6.
tic factors, the morphological issues can be resolved
known in other countries like Pakistan, Bangladesh and Figure 3 shows a block diagram for an English-to-Hindi
Nepal etc. Machine Translation system.
The default structure of Hindi sentence is Subject-
Object-Verb (SOV), e.g.
“पृथ्वी सोना चाहता है |” where S = पृथ्वी, O = सोना
and V = चाहना
Indian languages (primarily Hindi) have the following
characteristics:
Figure 3. English ð Hindi Machine Translation.
• Highly inflectional language,
• Rich morphology, and The HindióEnglish Machine translation can be
• Relatively free word order. improved by incorporating technique called Word Sense
Disambiguation. Word Sense Disambiguation (WSD) is
The Hindi-to-English Machine Translation is more defined as the task of identifying the correct sense of a
complex due to its characteristics. Anything written word depending upon the context. Word sense disambig-
in Hindi may show different senses depending upon uation algorithms can be broadly classified as knowledge/
the context. The spoken sequence of any statement in dictionary-based, supervised, semi-supervised, unsuper-
5,6. Figure 2 repre-
Indian language may differ by people vised approaches. However, there is no boundary in using
2 Vol 10 (16) | April 2017 | www.indjst.org Indian Journal of Science and Technology
Rajesh Kumar Chakrawarti and Pratosh Bansal
19
either single or combinations. Earlier, the combinations Indian languages machine aided translation system . It
7,8.
have also produced good results is using rule-based (pseudo-interlingua based) method.
Since last 03 decades, In India a lot of research The system produces good results. However, sometimes
and research projects are done in the area of Machine produces more than one target sentences for a given
Translation. Although they have produced some good source English sentence. Computer Assisted Translation
Machine Translation systems, they all have their own System Mantra, translates the texts from English to Hindi
advantages, disadvantages and limitations and “It is not in the domain of Personnel Administration, is developed
possible to have fully automatic, qualitative, and general- 20. Research
using rule-based (transfer-based) method
5
purpose Machine Translation ”. Hence, still there is scope through this system produces new areas to contribute
for researchers to do more research in this area. A lot of other facilities. The Anusaaraka system, makes docu-
researches and research projects are also on going to over- ments accessible in one Indian language to another Indian
come these disadvantages and limitations. These scopes language, is developed using direct (word-to-word)
21
are motivating the Teaching of Machine Translation in method . This system also produces good results but
9
Indian perspective to the students and researchers . if it enters into common use, it has major implications.
In the field of Machine Translation, a lot of surveys Universal Networking Language (UNL) {Interlingua}-
are done in the Indian perspective. First, Survey relates based machine Translation system is used translation
to resources, services and tools for Machine Translations for English to Indian languages although is a good sys-
system throughout India. This survey is the rigorous tem but language divergence issues between source and
10. Second, Survey 22. AnglaHindi is
collection for the Indian perspective target to the UNL results implications
includes Word-sense Disambiguationapproach which can a participant project of the Anglabharti translation and
11 23
be used for improving the Machine Translation system . responsible for English to Hindi translation . It is devel-
This contains the type of approach (like knowledge-based, oped using rule and example-based hybrid method.
supervised, minimally-supervised, unsupervised, hybrid MaTra is a fully automatic system for English-Hindi
24
etc.), corpus or WordNet details, features, advantages, Machine Translation (MT) of general-purpose texts . It
disadvantages and limitations of the approach, new tech- is developed using rule-based (transfer-based) method.
niques under these approaches etc. Third, Survey includes Statistical-based Machine Translations by Google,
different types of Machine Translation approaches Microsoft, Worldlingo and IBM are Google Translate,
12-15. Surveys related to Bing Translator, Worldlingo and IBM Server respectively.
used for developing the systems
approaches include the name of approach (like direct, Machine Translation approaches are classified as direct
rule-based, corpus-based, hybrid etc.) for developing the translation, rule-based (transfer and Interlingua-based)
Machine Translation system, features, advantages, disad- translation, corpus-based (statistical and example-based)
vantages and limitations of the approach, new techniques translation and hybrid (combination of one or more)
under these approaches etc. Fourth, Survey includes dif- translations25. These systems and approaches have their
ferent type of Machine Translation systems developed own features, advantages, disadvantages and limitations.
3,14 and
in India. Surveys related to these systems contain name, The Statistical Machine Translation (SMT) Model
year of development, people and/or organization, fund- its types Word, Phrase and Hierarchical Phrase Based
ing agency, place of development, domains/applications Models and others provides the basis to improve the
of the system, approaches/techniques and tools/resources Machine Translation systems. These are helpful in devel-
used, features etc14-17. The all types of surveys also display oping new systems also.
the web-links to use these kinds of Machine Translation A number of online applications are available and
systems. The literature available in this paragraph is based accessible for Hindi-to-English Machine Translation.
on survey papers only but the next paragraph is based on Table 1 gives the detail analysis of providing the effective-
actual research, research projects and resources. ness of those applications. For example, a Hindi language
Machine Translation system faces ambiguity and diver- statement “पृथ्वी सोना चाहता है |” has been converted into
gence issues at all levels of Natural Language Processing4,18. English language by using online applications mentioned
It is observed that the multilingual system is bounded in table. By analyzing the output it can be easily observed
to resource constraint like WordNet which is costly and that most of the applications failed to produce desired
takes more time in processing. Anglabharti is English to output. Only “Google Translate” is producing good result
Vol 10 (16) | April 2017 | www.indjst.org Indian Journal of Science and Technology 3
Approaches for Improving Hindi to English Machine Translation System
“Earth wants to sleep”. However, it cannot identify the A lot of ancient literatures exist in Hindi. They are
Noun “पृथ्वी” that’s why it is producing “Earth” whether written on “Devanagari lipi (script)” which had been
th
it should write “Prithvi”. The remaining applications are developed during 15 Century. Mostly books, novels, vol-
producing improper results. Hence, it can easily analyze umes etc. are in Hindi script. In modern era, there is a
that there is a need of an enhanced and appropriate ver- huge demand for English translation. Since last decades,
35
sion of Hindi-to-English Machine Translator which can the research has been increased .
provide better and appropriate result. One of the hardest kinds of machine translation is
WordNet is an online lexical database designed poetry translation. A lot of poetries are available in Hindi.
for English language includes four main Parts-of- A lot of work has been done in this move. Available sys-
Speech (PoS) (i) Noun, (ii) Verb, (iii) Adjective and (iv) tem requires better mechanism for poetry translation into
26 36
Adverb which are organized into sets of synonyms . English .
HindiWordNet is an online lexical database designed for Many researchers, institutions and research orga-
Hindi language on the basis of English WordNet. Similar nizations have started working on Machine Translation
to English WordNet, It also includes the four main parts- systems for Hindi to English translation, English to Hindi,
of-speech of Hindi (i) Noun, (ii) Verb, (iii) Adjective and Hindi to regional language translation and vice-versa and
(iv) Adverb, which are organized into sets of synonyms. have succeeded in obtaining very satisfactory results. The
IndoWordNet is a linked structure of wordnets of major prominent institutions and research organizations which
27.
Indian languages have worked in area of Machine Translation and still
2,5,17
Word-sense Disambiguation algorithms and appli- working are as follows :
cations are categorized as knowledge/dictionary-based,
supervised, semi-supervised, unsupervised and hybrid • Technology Development for Indian Languages
7 (TDIL) project by Department of Electronics and
approaches . They have their own features, advantages,
disadvantages and limitations. The critical analysis Information Technology (DeitY), Ministry of
provides the knowledge to choose the appropriate Word- Communications and Information Technology,
sense Disambiguation approach for improving the Government of India.
Machine Translation Systems28. Unsupervised Word • Department of Computer Science and
Sense Disambiguation based an experimental study of Engineering, Indian Institute of Technology
Graph Connectivity helps in improving the Machine (IIT), Kanpur, Bombay and Delhi.
29 • Department of Computer and Information
Translation .
Concept map construction might help in improving Sciences, University of Hyderabad (UoH),
the Machine Translation because with the help of this, the Hyderabad.
ideas and knowledge can be combined which are related • Language Technologies Research Center
to each other in some respect. This creates a semantic (LTRC), International Institute of Information
binding between two ideas or knowledge. With concept Technology (IIIT), Hyderabad.
map, we can interlink the concepts which belong to the • Centre for Development of Advanced Techniques
same domain30,31. (CDAC), Pune, Noida and Banglore.
Chinese-Japanese Sign Language Translation pro- • National Center for Software Technology
posed system provides research directions for other kind (NCST) (Now CDAC), Bombay.
of similar translations like HindiðEnglish Sign Language • Department of Computer Science and
32 Engineering, Jadhavpur University, Kolkata.
Translation System . Bi-lingual Hindi-English (Hinglish)
Machine Translation plays important research direction • Machine Learning Lab, CSA, Indian Institute of
for separate the pure component languages from a mixed Science (IISc), Banglore.
33 • AU-KBC Research Centre, Chennai.
set language .
BLEU (Bilingual Evaluation Understudy) is the major • Department of Computer Science and
and some other metrics are helpful in the automatic eval- Application, Utkal University, Utkal.
uation of Machine Translation system. There are different • Advanced Center for Technical Development
techniques under BLEU which play important role in of Punjabi Language, Literature and Culture,
evaluation the Machine Translation system6,34. Punjabi University, Patiyala.
4 Vol 10 (16) | April 2017 | www.indjst.org Indian Journal of Science and Technology
no reviews yet
Please Login to review.