296x Filetype PDF File size 0.85 MB Source: research.ijcaonline.org
International Journal of Computer Applications (0975 – 8887)
Volume 39– No.6, February 2012
Shirorekha Chopping Integrated Tesseract OCR
Engine for Enhanced Hindi Language Recognition
Nitin Mishra C. Patvardhan C. Vasantha Lakshmi Sarika Singh
Dept. of Phy. & Comp. Sc. Dept. of Electrical Engg. Dept. of Phy. & Comp. Sc. Dept. of Phy. & Comp. Sc.
Dayalbagh Edu. Institute Dayalbagh Edu. Institute Dayalbagh Edu. Institute Dayalbagh Edu. Institute
Dayalbagh, Agra, India Dayalbagh, Agra, India Dayalbagh, Agra, India Dayalbagh, Agra, India
ABSTRACT characters. It is highly desirable to choose a Smart Database
Tesseract OCR Engine is one of the most efficient open having all basic characters, half characters, and the minimal
source OCR engines currently available. Recently, Tesseract set of conjunct character combinations that may occur in some
OCR 3.01 is capable of recognizing Hindi language but still it word and left out all unfavorable combinations. The
needs some enhancement to improve the performance. The segmentation issues related to Shirorekha based scripts are
Hindi language recognition accuracy is quite low even for the presented in [7]. Basically the proposed Hindi Language
printed text, as the conjunct character combinations of Hindi Database consists of basic vowels, consonants, extensions,
Language are not easily separable due to partial overlapping. special symbols, punctuation marks, English numerals,
The proposed approach solves this problem, so that Devnagari numerals and minimal set of favorable vowel-
Devanagari conjunct characters can easily be segmented and consonant combinations, bi-consonant combinations and bi-
recognized using Tesseract OCR Engine. This paper presents consonant-vowel combinations. Tesseract based researches
a complete methodology to improve The Hindi Language have shown robust results on Bangla and Kannada languages
Recognition accuracy. This paper also presents comparison [8, 9] but still no efficient recognition results had been shown
with other Devanagari OCR engines available on the basis of for Hindi language. This paper presents an improvement in
recognition accuracy, processing time, font variations and printed Devanagari script recognition using Tesseract OCR
database size. Engine.
General Terms Table 1: General Vowels
Pattern Recognition ऄ अ आ इ ई उ
ा़ ऽा ाा ाि ाी
Keywords a aa/A e/i ee/ii u oo/uu
Tesseract, Hindi, OCR, Shirorekha Chopping, Character ए ऐ ओ औ ऄं ऄः
Segmentation ा ा ा ा ां ाः
1. INTRODUCTION e ai o ou aM aH
Today, Tesseract is considered one of the most accurate open
source OCR engines available. Tesseract OCR Engine was Table 2: Other Vowels
one of the best 3 engines in 1995 UNLV Accuracy Test. ॠ ॡ ॐ
Between 1995 and 2006 however; there was little activity in r^^ l^^ AUM
Tesseract, until it was open sourced by HP and UNLV in
2005. It was again re-released to the open source community Table 3: Consonants
in August of 2006 by Google [1]. Tesseract has ability to train क ख ग घ ङ
for newer language and scripts as well [2]. A complete ka kha ga gha nga
overview of Tesseract OCR engine can be found in [3]. While च छ ज झ ञ
Tesseract was originally developed for English, it has since cha chha ja jha nja
been extended to recognize French, Italian, Catalan, Czech, ट ठ ड ढ ण
Danish, Polish, Bulgarian, Russian, Greek, Korean, Spanish, Ta Tha Da Dha Na
Japanese, Dutch, Chinese, Indonesian, Swedish, German, त थ द ध न
Thai, Arabic, and Hindi etc. Training the Tesseract OCR ta tha da dha na
Engine for Hindi language requires in-depth knowledge of प फ ब भ म
Devnagari script in order to collect the character set [4]. pa Pha/fa ba bha ma
Moreover, Tesseract OCR Engine does not just require य र ल व श
training of the collected dataset but also to tackle the character ya ra la va/wa Sha
segmentation and clubbing issues based on the script specific ष स ह क्ष त्र
shh sa ha ksh tra
features [5] i.e. Shirorekha, maatra etc. Hindi language has ज्ञ
enormous number of character combinations [6]; it is not a jnja
good technique to train all the possible combinations of Hindi
19
International Journal of Computer Applications (0975 – 8887)
Volume 39– No.6, February 2012
Table 4: Dot+Consonants (Extensions) 2.1 Training Data Generation
ऩ ऱ ऴ क़ ख़ ग़ The basic guideline to prepare training data has very clearly
.na .ra .La .ka .kha .ga explained in [10], which is followed to prepare the customized
ज़ ड़ ढ़ फ़ य़ training data. It has following phases described below:
.ja .Da .Dha .fa .ya
2.1.1 Smart Hindi database selection
Table 5: Special Symbols The Training database consists of 15 vowels, 36 consonants,
Anusvara Visarga Chandra Chandra 11 extensions, 13 special symbols, 18 punctuation marks and
ां ाः Bindu ा other symbols, 10 English numerals, 10 Devnagari numerals,
ा a minimal set of 218 vowel-consonant combinations, 276 bi-
Nukta Virama Udatta Anudatta consonant combinations and 179 bi-consonant-vowel
ाऺ ा ा ा combinations, providing a total of 786 character combinations
Deergha Grave of 18 pt. sized mangal font. The coarse classification of Hindi
Purna virama virama Avagraha Accent characters is presented in [11].
। ॥ ऻ ा
Accute Accent 2.1.2 Training image generation
ा It involves the sufficiently spaced out single font specific text
image creation. For each new font Tesseract OCR Engine
suggests preparation of a new image file.
Table 6: Punctuation Marks and Other Symbols
“ ? ; % * / ( ) \ 2.1.3 Box file generation
= { } [ ] , - : ! The information about the Bounding Boxes for all the
characters present in the training image is generated for
specifying Devanagari script components in the box file. The
Table 7: Numerals default generated Bounding boxes can easily be edited using
० १ २ ३ ४ ५ ६ ७ ८ ९ box file editors i.e. cowboxer tool etc.
0 1 2 3 4 5 6 7 8 9 2.1.4 Train file generation
Box file editors also allow editing the corresponding Unicode
2. METHODOLOGY characters against appropriate Bounding boxes.
As Fig 1 shows, the proposed approach can be divided into
two major components described below: 2.1.5 Character set file generation
Character set file is required to specify the information like
Training Data Generation Test Data Processing uppercase, lowercase, digits, punctuation marks etc. about the
Smart Hindi Database Selection Shirorekha Chopping Based Unicode characters. Since Devanagari does not distinguish
Preprocessing upper and lower case characters, only digits and punctuation
Training Image Generation marks have to be specified.
Binarization
2.1.6 Font properties selection
Box file Generation
Noise Elimination Font properties like italic, bold, fixed, serif etc. are required to
Train file generation
be specified before training the data. In this work only normal
Blob Detection fonts have been considered.
Character set file generation
Skew Detection and Correction
2.1.7 Feature extraction
Font properties Selection This phase extracts the features of the shape of characters
Character Segmentation from the Training Data Image.
Feature Extraction
Matching 2.1.8 Clustering
Clustering This phase clusters the character shape features into
P ost Processing
prototypes.
Dictionary Data Preparation
Result Generation 2.1.9 Dictionary data preparation
Post Processing Ambiguity
Removal Tesseract may use up to 5 types of Dictionary files which are
Recognizing the Test Image converted into Directed Acyclic Word Graph (DAWG) files.
Training Data Compaction
2.1.10 Post processing ambiguity removal
Recognizing the Test Image Editing the unicharambigs file allows removing the intrinsic
Fig 1: Block Level Diagram ambiguity between two similar looking characters or their
combinations by using a substitution rule.
20
International Journal of Computer Applications (0975 – 8887)
Volume 39– No.6, February 2012
2.1.11 Training data compaction The dots in Fig 3 represent the chopping points on the
Finally all the generated files are compacted into a single file. Shirorekha for corresponding word in the Test image.
OS used: Ubuntu 10.04
Tesseract OCR version used: 3.01
Training image used: hin.mangal.exp1.tif
Commands used for Training Data Generation:
tesseract hin.mangal.exp1.tif hin.mangal.exp1 batch.nochop
makebox
tesseract hin.mangal.exp1.tif hin.mangal.exp1 nobatch box.train
unicharset_extractor hin.mangal.exp1.box
cp unicharset hin.unicharset Fig 4: Shirorekha Chopping in Test Image
echo mangal 0 0 0 0 0 > font_properties Fig 4 illustrates the Shirorekha Chopping. The small short
mftraining –F font_properties –U hin.unicharset
hin.mangal.exp1.tr lines highlight those valleys, at which distance between the
cntraining hin.mangal.exp1.tr bottom of the valley and the x-axis of corresponding vertical
mv Microfeat hin.Microfeat histogram goes below a threshold, T. Thus Shirorekha is
mv normproto hin.normproto chopped at these valleys. After the preprocessing gets
mv pffmtable hin.pffmtable completed, the Shirorekha Chopped test image as shown in
mv mfunicharset hin.mfunicharset Fig 5 is obtained.
mv inttemp hin.inttemp
wordlist2dawg frequent_words_list hin.freq-dawg hin.unicharset
combine_tessdata hin .
Fig 2: Resources and Commands used
Fig 2 lists all the resources and commands used from the
Fig 5: Shirorekha Chopped Test Image
experimental point of view. The Mangal font was used in
training image. The Shirorekha Chopped test image is now easily segmented
using inbuilt segmentation technique of Tesseract OCR
Engine as shown in Fig 6.
2.2 Test Data Processing
This component can be categorized basically in two sub
components described below:
2.2.1 Shirorekha Chopping Algorithm
In the Preprocessing Phase, the horizontal and vertical
histograms are generated for each line of the text identified in
the test image. The Shirorekha of the Text in the image is
chopped each time the distance between the bottom of the Fig 6: Shirorekha Chopping based Character
valley and the x-axis of corresponding vertical histogram goes Segmentation
below a threshold T, which is dependent on the font size. The 2.2.2 Recognizing the Test Image
motivation behind the Shirorekha Chopping is that by In this Phase, the preprocessed test image is recognized using
applying good segmentation techniques the performance of Training Data.
OCR can be increased [12].
Test image used: test.tif
Font used: Mangal Commands used for Test Data Processing:
Font size: 18
Threshold=18/8=2.25 tesseract test.tif result –l hin
3. EXPERIMENTAL RESULTS
The recognition accuracy, the processing time, and the size of
database with preprocessing and font variations, was tested
against Google’s hin.traineddata [13] and Parichit’s
hin.traineddata [14].
2.25
Fig 3: Shirorekha Chopping based on Font size specific
threshold
Fig 7: Test image
21
International Journal of Computer Applications (0975 – 8887)
Volume 39– No.6, February 2012
Fig 8: Experimental Results Comparison
The test image sample taken is shown in Fig 7. The Test Table 11: Training Data Size Comparison
Results can be compared by Fig 8. After a number of tests, the Training Data Training font
final results were obtained, which are described below: size
Table 8: Font Variation Tolerance Comparison Google’s 13.8 MB -
hin.traineddata
Recognition rate Recognition rate Parichit’s 13.1 MB -
with Mangal as with Krutidev as hin.traineddata
Testing font Testing font Proposed 7.5 MB Mangal
Google’s 45.6 % 44.8 % hin.traineddata
hin.traineddata
Parichit’s 23.4 % 21.2 %
hin.traineddata 4. CONCLUSIONS
Proposed 94.9 % 86.9 % There is a significant improvement in the recognition rate,
hin.traineddata processing time and the size of training database after
integrating Shirorekha Chopping with Tesseract OCR Engine.
Table 9: Average Recognition Rate Comparison Table 8 shows the higher accuracy for testing font being same
Average Preprocessing as that of training font but lower accuracy for testing font
Recognition used on Test being different from the training font, but still the font
Rate Image variation tolerance is quite better than existing ones. Table 9
Google’s 45.2 % No preprocessing shows the average recognition rate is quite enhanced using
hin.traineddata Shirorekha Chopping. The proposed Shirorekha chopping
Parichit’s 22.3 % No preprocessing based preprocessing approach does not just improve the
hin.traineddata recognition rate but also allows training only two or more
Proposed 90.9 % Shirorekha touching conjunct characters along with basic characters and
hin.traineddata Chopping isolated half characters. The single touching conjunct
characters may be left out as these conjunct characters can
Table 10: Processing Time Comparison easily be segmented using Shirorekha Chopping into those
Total Characters basic components that were trained. This leads to the
Processing Time in Test Image generation of comparatively smaller training database (Table
11). The proposed Approach runs faster than that of Google
Google’s 2000 ms 94 and Parichit (Table 10). The extension to multiple fonts is
hin.traineddata being done, from the perspective of Future scope.
Parichit’s 1500 ms 94
hin.traineddata
Proposed 1000 ms 94
hin.traineddata
22
no reviews yet
Please Login to review.