Language Pdf 100358 | Pxc3877076

Partial capture of text on file.
                                                                                         International Journal of Computer Applications (0975 – 8887) 
                                                                                                                       Volume 39– No.6, February 2012 
                           Shirorekha Chopping Integrated Tesseract OCR 
                        Engine for Enhanced Hindi Language Recognition 
                                                                                      
                        Nitin Mishra                     C. Patvardhan                C. Vasantha Lakshmi                      Sarika Singh 
                Dept. of Phy. & Comp. Sc.  Dept. of Electrical Engg.  Dept. of Phy. & Comp. Sc.   Dept. of Phy. & Comp. Sc. 
                  Dayalbagh Edu. Institute          Dayalbagh Edu. Institute           Dayalbagh Edu. Institute          Dayalbagh Edu. Institute 
                   Dayalbagh, Agra, India            Dayalbagh, Agra, India             Dayalbagh, Agra, India            Dayalbagh, Agra, India 
                                                                                                       
                 
                ABSTRACT                                                                  characters. It is highly desirable to choose a Smart Database 
                Tesseract  OCR  Engine  is  one  of  the  most  efficient  open           having all basic characters, half characters, and the minimal 
                source OCR engines currently available. Recently, Tesseract               set of conjunct character combinations that may occur in some 
                OCR 3.01 is capable of recognizing Hindi language but still it            word  and  left  out  all  unfavorable  combinations.  The 
                needs  some  enhancement  to  improve  the  performance.  The             segmentation  issues  related  to  Shirorekha  based  scripts  are 
                Hindi language recognition accuracy is quite low even for the             presented  in  [7].  Basically  the  proposed  Hindi  Language 
                printed text, as the conjunct character combinations of Hindi             Database  consists  of  basic  vowels,  consonants,  extensions, 
                Language are not easily separable due to partial overlapping.             special  symbols,  punctuation  marks,  English  numerals, 
                The  proposed  approach  solves  this  problem,  so  that                 Devnagari  numerals  and  minimal  set  of  favorable  vowel-
                Devanagari conjunct characters can easily be segmented and                consonant combinations, bi-consonant combinations and bi-
                recognized using Tesseract OCR Engine. This paper presents                consonant-vowel  combinations.  Tesseract  based  researches 
                a  complete  methodology  to  improve  The  Hindi  Language               have shown robust results on Bangla and Kannada languages 
                Recognition  accuracy.  This  paper  also  presents  comparison           [8, 9] but still no efficient recognition results had been shown 
                with other Devanagari OCR engines available on the basis of               for Hindi language. This paper presents an improvement in 
                recognition  accuracy,  processing  time,  font  variations  and          printed  Devanagari  script  recognition  using  Tesseract  OCR 
                database size.                                                            Engine. 
                General Terms                                                                               Table 1: General Vowels 
                Pattern Recognition                                                          ऄ          अ          आ          इ          ई          उ 
                                                                                                        ा़         ऽा         ाा         ाि          ाी 
                Keywords                                                                     a        aa/A         e/i       ee/ii        u       oo/uu 
                Tesseract,  Hindi,  OCR,  Shirorekha  Chopping,  Character                   ए          ऐ          ओ          औ          ऄं         ऄः 
                Segmentation                                                                 ा          ा          ा          ा          ां         ाः 
                1.  INTRODUCTION                                                             e          ai         o          ou        aM          aH 
                Today, Tesseract is considered one of the most accurate open               
                source  OCR engines available.  Tesseract  OCR  Engine  was                                  Table 2: Other Vowels  
                one  of  the  best  3  engines  in  1995  UNLV  Accuracy  Test.                   ॠ                      ॡ                     ॐ 
                Between 1995 and 2006 however; there was little activity in                       r^^                   l^^                  AUM 
                Tesseract,  until  it  was  open  sourced  by  HP  and  UNLV  in           
                2005. It was again re-released to the open source community                                   Table 3: Consonants 
                in August of 2006 by Google [1]. Tesseract has ability to train               क            ख             ग            घ            ङ 
                for  newer  language  and  scripts  as  well  [2].  A  complete               ka          kha           ga           gha          nga 
                overview of Tesseract OCR engine can be found in [3]. While                   च            छ             ज            झ            ञ 
                Tesseract was originally developed for English, it has since                 cha          chha           ja          jha           nja 
                been extended to recognize French, Italian, Catalan, Czech,                   ट            ठ             ड            ढ            ण 
                Danish, Polish, Bulgarian, Russian, Greek, Korean, Spanish,                  Ta           Tha           Da           Dha           Na 
                Japanese,  Dutch,  Chinese,  Indonesian,  Swedish,  German,                   त            थ             द            ध            न 
                Thai,  Arabic,  and  Hindi  etc.  Training  the  Tesseract  OCR               ta           tha          da           dha           na 
                Engine  for  Hindi  language  requires  in-depth  knowledge  of               प            फ             ब            भ            म 
                Devnagari  script  in  order  to  collect  the  character  set  [4].          pa         Pha/fa         ba           bha           ma 
                Moreover,  Tesseract  OCR  Engine  does  not  just  require                   य             र            ल            व            श 
                training of the collected dataset but also to tackle the character            ya           ra            la         va/wa         Sha 
                segmentation and clubbing issues based on the script specific                 ष            स             ह            क्ष          त्र 
                                                                                             shh           sa           ha           ksh           tra 
                features  [5] i.e. Shirorekha,  maatra etc. Hindi language has                ज्ञ                                                    
                enormous number of character combinations [6]; it is not a                   jnja 
                good technique to train all the possible combinations of Hindi             
                                                                                                                                                       19 
                                                                                                           International Journal of Computer Applications (0975 – 8887) 
                                                                                                                                               Volume 39– No.6, February 2012 
                                Table 4: Dot+Consonants (Extensions)                                        2.1  Training Data Generation 
                       ऩ            ऱ            ऴ             क़           ख़            ग़                The basic guideline to prepare training data has very clearly 
                      .na          .ra          .La           .ka          .kha          .ga                explained in [10], which is followed to prepare the customized 
                       ज़           ड़            ढ़           फ़           य़                              training data. It has following phases described below: 
                      .ja          .Da          .Dha          .fa          .ya 
                                                                                                            2.1.1  Smart Hindi database selection 
                                         Table 5: Special Symbols                                           The Training database consists of 15 vowels, 36 consonants, 
                      Anusvara             Visarga            Chandra             Chandra                   11 extensions, 13 special symbols, 18 punctuation marks and 
                           ां                  ाः               Bindu                 ा                     other symbols, 10 English numerals, 10 Devnagari numerals, 
                                                                   ा                                        a minimal set of 218 vowel-consonant combinations, 276 bi-
                        Nukta               Virama             Udatta             Anudatta                  consonant       combinations        and     179     bi-consonant-vowel 
                           ाऺ                  ा                   ा                  ा                     combinations, providing a total of 786 character combinations 
                                           Deergha                                  Grave                   of 18 pt. sized mangal font. The coarse classification of Hindi 
                    Purna virama            virama           Avagraha              Accent                   characters is presented in [11]. 
                           ।                   ॥                   ऻ                  ा  
                   Accute Accent                                                                            2.1.2  Training image generation 
                           ा                                                                                It involves the sufficiently spaced out single font specific text 
                                                                                                            image  creation.  For  each  new  font  Tesseract  OCR  Engine 
                                                                                                            suggests preparation of a new image file. 
                          Table 6: Punctuation Marks and Other Symbols 
                     “        ?        ;       %        *         /       (        )        \               2.1.3  Box file generation 
                     =        {        }        [        ]        ,       -        :        !               The  information  about  the  Bounding  Boxes  for  all  the 
                                                                                                            characters  present  in  the  training  image  is  generated  for 
                                                                                                            specifying Devanagari script components in the box file. The 
                                             Table 7: Numerals                                              default generated Bounding boxes can easily be edited using 
                    ०       १       २       ३       ४       ५       ६       ७       ८       ९               box file editors i.e. cowboxer tool etc. 
                    0       1       2       3       4       5       6       7       8       9               2.1.4  Train file generation 
                                                                                                            Box file editors also allow editing the corresponding Unicode 
                   2.  METHODOLOGY                                                                          characters against appropriate Bounding boxes. 
                   As Fig 1 shows, the proposed approach can be divided into 
                   two major components described below:                                                    2.1.5  Character set file generation 
                                                                                                            Character set file is required to specify the information like 
                         Training Data Generation                  Test Data Processing                     uppercase, lowercase, digits, punctuation marks etc. about the 
                        Smart Hindi Database Selection          Shirorekha Chopping Based                   Unicode  characters.  Since  Devanagari  does  not  distinguish 
                                                                               
                                                                       Preprocessing                        upper and lower case characters, only digits and punctuation 
                          Training Image Generation                                                         marks have to be specified. 
                                                                               
                                                                       Binarization 
                                                                               
                                                                                                            2.1.6  Font properties selection 
                             Box file Generation                       
                                                                     Noise Elimination                      Font properties like italic, bold, fixed, serif etc. are required to 
                                                                               
                             Train file generation 
                                                                                                            be specified before training the data. In this work only normal 
                                                                       Blob Detection                       fonts have been considered. 
                                                                               
                         Character set file generation 
                                                                      
                                                              Skew Detection and Correction 
                                                                                                            2.1.7  Feature extraction 
                           Font properties Selection                                                        This  phase  extracts  the  features  of  the  shape  of  characters 
                                                                  Character Segmentation                    from the Training Data Image. 
                             Feature Extraction 
                                                                        Matching                            2.1.8  Clustering 
                                 Clustering                                                                 This  phase  clusters  the  character  shape  features  into 
                                                                      P  ost Processing 
                                                                                                            prototypes. 
                         Dictionary Data Preparation         
                                                                     
                                                                    Result Generation                       2.1.9  Dictionary data preparation 
                          Post Processing Ambiguity                           
                                   Removal                                                                  Tesseract may use up to 5 types of Dictionary files which are 
                                                            Recognizing the Test Image                      converted into Directed Acyclic Word Graph (DAWG) files. 
                          Training Data Compaction           
                                                                                                            2.1.10  Post processing ambiguity removal 
                                                                Recognizing the Test Image                  Editing the unicharambigs file allows removing the intrinsic 
                                       Fig 1: Block Level Diagram                                           ambiguity  between  two  similar  looking  characters  or  their 
                                                                                                            combinations by using a substitution rule. 
                                                                                                                                                                                     20 
                                                                                             International Journal of Computer Applications (0975 – 8887) 
                                                                                                                            Volume 39– No.6, February 2012 
                2.1.11  Training data compaction                                             The  dots  in  Fig  3  represent  the  chopping  points  on  the 
                Finally all the generated files are compacted into a single file.            Shirorekha for corresponding word in the Test image. 
                  OS used: Ubuntu 10.04 
                  Tesseract OCR version used: 3.01 
                  Training image used: hin.mangal.exp1.tif 
                   
                  Commands used for Training Data Generation: 
                   
                  tesseract hin.mangal.exp1.tif hin.mangal.exp1 batch.nochop 
                  makebox                                                                                                                                      
                  tesseract hin.mangal.exp1.tif hin.mangal.exp1 nobatch box.train 
                  unicharset_extractor hin.mangal.exp1.box 
                  cp unicharset hin.unicharset                                                         Fig 4: Shirorekha Chopping in Test Image 
                  echo mangal 0 0 0 0 0 > font_properties                                    Fig  4  illustrates  the  Shirorekha  Chopping.  The  small  short 
                  mftraining –F font_properties –U hin.unicharset 
                  hin.mangal.exp1.tr                                                         lines highlight those valleys, at which distance between the 
                  cntraining hin.mangal.exp1.tr                                              bottom of the valley and the x-axis of corresponding vertical 
                  mv Microfeat hin.Microfeat                                                 histogram  goes  below  a  threshold,  T.  Thus  Shirorekha  is 
                  mv normproto hin.normproto                                                 chopped  at  these  valleys.  After  the  preprocessing  gets 
                  mv pffmtable hin.pffmtable                                                 completed, the Shirorekha Chopped test image as shown in 
                  mv mfunicharset hin.mfunicharset                                           Fig 5 is obtained.   
                  mv inttemp hin.inttemp  
                  wordlist2dawg frequent_words_list hin.freq-dawg hin.unicharset 
                  combine_tessdata hin . 
                   
                            Fig 2: Resources and Commands used 
                   
                                                                                                                                                                 
                Fig  2  lists  all  the  resources  and  commands  used  from  the 
                                                                                                        Fig 5: Shirorekha Chopped Test Image 
                experimental  point  of  view.  The  Mangal  font  was  used  in 
                   
                training image.                                                              The Shirorekha Chopped test image is now easily segmented 
                                                                                             using  inbuilt  segmentation  technique  of  Tesseract  OCR 
                                                                                             Engine as shown in Fig 6. 
                2.2  Test Data Processing 
                   
                This  component  can  be  categorized  basically  in  two  sub 
                   
                components described below: 
                   
                   
                2.2.1  Shirorekha Chopping Algorithm 
                In  the  Preprocessing  Phase,  the  horizontal  and  vertical 
                histograms are generated for each line of the text identified in                                                                                 
                the  test  image.  The  Shirorekha of the Text in the image is 
                chopped each time the distance between the bottom of the                             Fig 6: Shirorekha Chopping based Character 
                valley and the x-axis of corresponding vertical histogram goes                                         Segmentation 
                below a threshold T, which is dependent on the font size. The                2.2.2   Recognizing the Test Image  
                motivation  behind  the  Shirorekha  Chopping  is  that  by                  In this Phase, the preprocessed test image is recognized using 
                applying  good  segmentation  techniques  the  performance  of               Training Data. 
                OCR can be increased [12]. 
                                                                                             Test image used: test.tif 
                                                                                              
                                                         Font used: Mangal                   Commands used for Test Data Processing: 
                                                         Font size: 18                        
                                                         Threshold=18/8=2.25                 tesseract test.tif result –l hin 
                                                                                              
                                                                                             3.  EXPERIMENTAL RESULTS 
                                                                                             The recognition accuracy, the processing time, and the size of 
                                                                                             database with preprocessing and font variations, was tested 
                                                                                             against    Google’s     hin.traineddata   [13]    and    Parichit’s 
                                                                                             hin.traineddata [14].  
                                                                   2.25 
                                                                                    
                  Fig 3: Shirorekha Chopping based on Font size specific                                                                                         
                                            threshold 
                                                                                                                     Fig 7: Test image 
                                                                                                                                                             21 
                                                                                              International Journal of Computer Applications (0975 – 8887) 
                                                                                                                             Volume 39– No.6, February 2012 
                                                                                                                                                      
                                                                 Fig 8: Experimental Results Comparison  
                The  test  image  sample  taken  is  shown  in  Fig  7.  The  Test                     Table 11: Training Data Size Comparison 
                Results can be compared by Fig 8. After a number of tests, the                                          Training Data          Training font  
                final results were obtained, which are described below:                                                       size 
                       Table 8: Font Variation Tolerance Comparison                                 Google’s               13.8 MB                    - 
                                                                                                hin.traineddata 
                                         Recognition rate       Recognition rate                   Parichit’s              13.1 MB                    - 
                                          with Mangal as        with Krutidev as                hin.traineddata 
                                            Testing font           Testing font                    Proposed                 7.5 MB                Mangal 
                      Google’s                 45.6 %                44.8 %                     hin.traineddata 
                  hin.traineddata                                                              
                      Parichit’s               23.4 %                21.2 % 
                  hin.traineddata                                                             4.  CONCLUSIONS 
                      Proposed                 94.9 %                86.9 %                   There  is  a  significant  improvement  in  the  recognition  rate, 
                  hin.traineddata                                                             processing  time  and  the  size  of  training  database  after 
                                                                                              integrating Shirorekha Chopping with Tesseract OCR Engine. 
                       Table 9: Average Recognition Rate Comparison                           Table 8 shows the higher accuracy for testing font being same 
                                              Average            Preprocessing                as  that  of  training  font  but  lower  accuracy  for  testing  font 
                                            Recognition           used on Test                being  different  from  the  training  font,  but  still  the  font 
                                                Rate                 Image                    variation tolerance is quite better than existing ones. Table 9 
                      Google’s                 45.2 %          No preprocessing               shows the average recognition rate is quite enhanced using 
                   hin.traineddata                                                            Shirorekha  Chopping.  The  proposed  Shirorekha  chopping 
                      Parichit’s               22.3 %          No preprocessing               based  preprocessing  approach  does  not  just  improve  the 
                   hin.traineddata                                                            recognition  rate  but  also  allows  training  only  two  or  more 
                      Proposed                 90.9 %              Shirorekha                 touching conjunct characters along with basic characters and 
                   hin.traineddata                                 Chopping                   isolated  half  characters.  The  single  touching  conjunct 
                                                                                              characters  may be left out as these conjunct characters can 
                           Table 10: Processing Time Comparison                               easily  be  segmented  using  Shirorekha  Chopping  into  those 
                                                                Total Characters              basic  components  that  were  trained.  This  leads  to  the 
                                         Processing Time          in Test Image               generation of comparatively smaller training database (Table 
                                                                                              11). The proposed Approach runs faster than that of Google 
                      Google’s                2000 ms                   94                    and Parichit  (Table  10).  The  extension  to  multiple  fonts  is 
                  hin.traineddata                                                             being done, from the perspective of Future scope. 
                      Parichit’s              1500 ms                   94 
                  hin.traineddata                                                              
                      Proposed                1000 ms                   94 
                  hin.traineddata                                                              
                 
                                                                                                                                                              22
The words contained in this file might help you see if this file matches what you are looking for:

...International journal of computer applications volume no february shirorekha chopping integrated tesseract ocr engine for enhanced hindi language recognition nitin mishra c patvardhan vasantha lakshmi sarika singh dept phy comp sc electrical engg dayalbagh edu institute agra india abstract characters it is highly desirable to choose a smart database one the most efficient open having all basic half and minimal source engines currently available recently set conjunct character combinations that may occur in some capable recognizing but still word left out unfavorable needs enhancement improve performance segmentation issues related based scripts are accuracy quite low even presented basically proposed printed text as consists vowels consonants extensions not easily separable due partial overlapping special symbols punctuation marks english numerals approach solves this problem so devnagari favorable vowel devanagari can be segmented consonant bi recognized using paper presents researche...
Related files

Share

Help

Related files

Share

Share to social media

Help

Login Area