Printer Friendly

Offline Tamil handwritten character recognition using statistical features.

INTRODUCTION

Computerizing the languages is popularized around the world now a days to prevent the languages from getting extinct. At the same time computerizing the ancient documents, palm scripts and many handwritten documents of old Pandit are important to educate life giving messages to children as well as the world. Optical Character Recognition (OCR) is a famous field to interpret such documents into machine understandable and editable code. Machine editing and understanding of handwritten characters are not an easy task for researchers because the writing style of every individual differs due to mood and age factors and handwriting style of people will give more variation for same characters also.

Through offline and online phase, the OCR will help to recognize the character. In online phase, the OCR identifies the characters from the pen tip movement, which was giving less complicated for research. On the other hand OCR recognizes the character from the printed and handwritten characters image, which was an open challenge for researchers to contribute lot of research work. In this paper Tamil handwritten characters were chosen for recognition work due to its complex structure and need. The South Indian language Tamil contains 247 characters which include 18 consonants, 12 vowels, 216 combinational characters and one special character. In Tamil characters, most of them have curved shape and many of them have loops. Choosing Tamil for computerizing is a good work in order to glean the good thought and life giving word of ancient people and this could be easily transformed to present generation.

Contributions were came from various researchers in order to achieve good recognition rate in Tamil handwritten OCR system, but still betterment was needed for providing good result. Success of OCR system mainly depends on selecting and extracting the needed features from selected images. This work mainly focused on these two steps for maximizing recognition goal. Tamil OCR system has been classified into five major parts; they were Pre-processing, Segmentation, Feature Selection, Feature Extraction and classification. Forth coming sections were described the above in detail. Section-1 deals with main contribution of others in Tamil OCR system. The main contributions were described in Section-2. Finally in section-3 and section-4 experimental results and conclusion were discussed.

1. Literature Survey:

Much contribution came up to solve the problem in Tamil recognition system. Rajashekararadhya S.V. et al (2008, 2009, 2008, 2008) contributed much using Zoning procedure in many ways. Tamil handwritten numerals were used for recognizing purpose. N. Shanthi and K. Duraiswami applied SVM techniques (2010; 2007) in Tamil recognitions system also used Zoning procedure to get the features. Quadratic classifier was used to get the predicted results finally.

R. Ramanathan et al (2008) used Gabor filter to get the features from character image and SVM technique was used for final result. Akshay Apte and Harshad Gado (2010) acquired structural shapes from character portion and employed Euclidean distance to get the recognized character.

U. Bhattacharya et al (2007) collected Tamil data samples from HP Lab. In their work, number of transitions was used as features for the unsupervised algorithm K-means clustering and chain code histogram features for supervised algorithm multilayer perceptron for predict the final answers.

U. Pal et al (2007) developed system using feature extraction algorithms bounding box and chine code. And quadratic classifier was employed there to achieve the final character. Sukalpa Chanda et al (2008) implemented reservoir based feature extraction algorithm and SVM classifier was employed to get the classified character.

Statistical features projection profile. Word profile and transition were extracted in A.N. Sigappi et al work (2011). Hidden Markov Model was used for getting considerable result. R. Jagadeesh Kannan and R. Prabhakar experimented octal graph based projection profile (Jagadeesh Kannan, R. and R. Prabhakar, 2008) system as feature extraction algorithm and feature matching concepts to achieve the results. C. Sureshkumar and T. Ravichandran (2010) extracted the histogram based structural features and recognized the character using Neural Network classifier.

2. Statistical Features based Handwritten OCR System:

In the stages of recognition system, first stage pre-processing where character image get cleaned and bring them into standard size to minimize the complexity of our recognition work. Other stages were used for finding the exact features of the character and for identifying the exact character. Fig. 1 shows the architecture of handwritten OCR systems.

Pre-processing:

Pre-processing step is important in our work; here we have used the stages Binarization, Noise removal, Skeletanization and Normalization. Otsu's Thresholding (2012, 2014) procedure was experimented in Binarization stage, where the black and white image was obtained. As Otsu's procedure, the character images logically decomposed into two classes where two peak values were calculated using the histogram concepts. These two values were used to separate foreground (black--pixel values 1) and background (white--pixel values 0).

Wrong prediction can be caused by the noise present in character image. Median filter used for removing the unwanted pixel portion from the character image. This process provided clear image for the further recognition process.

To simplify our work single line pixel character image was needed. Skeletonization is a procedure which helps us to get those images. Thinning algorithm were implemented in Skeletonization where without affecting the regular shape of real character unwanted portion was removed from character image. In Skeletonization process the character portion of the image was reduced and only the skeleton was retained. Finally Normalization process was applied on the skeletonized image to convert the random size image into standard image. The image was resized into 90X90 size image.

Feature Selection and Extraction:

Feature Selection:

Zone based method was proposed for selecting entire foreground portion of the character image. And Eight Directional Chain Code algorithm were implemented to extract the proper foreground portion of the image. Before feature selection the character image was normalized into standard size (90X90 pixels) without any changes in the real shape. The image was divided into nine equal parts by zoning concepts as shown in the fig 2.

Further the eight direction chain code (Jun Cao, M. Ahmadi and M. Shridhar, 1995) algorithm was applied on each zone to select the proper features. This process was applied on black pixel of each sub zone image in an anticlockwise direction

Before Applying the Chain Code algorithm, scanned from top of the sub zone image row wise to find the first black pixel with only one neighbor. If it was found, then the chain code travel begins from that black pixel and continued in anti-clock wise direction. The travelling concept was chain code checked eight neighborhood directions for finding the next neighbor black pixel. Once it was found, chain code head was moved to that black pixel. This feature selection process came to an end with another black pixel which contained only one black neighbor or if three neighbors to the current pixel were found (Junction point). Visited foreground (black) pixels placed in another same size image which was selected for further process. Continued the chain code procedure until no further black pixels were available to visit.

If the chain code process was unable to find any black pixel with one neighbor then search on the image for checking any pixels available with two neighbor black pixel. If we found it, it might be circular shapes. All visited black pixels were taken for further process. Fig.3 shows the selected features by the chain code algorithm from the characters [TEXT NOT REPRODUCIBLE IN ASCII.].

Feature Extraction:

Pixel based location:

In each selected feature image, traveled from top to end towards horizontally, vertically and diagonally for finding black pixel count as shown in fig. 4. The procedure of finding the pixel based location detailed in the Algorithm 1.

Algorithm 1

INITIALIZE [R1, C1] [right arrow] SIZE OF THE IMAGE
[C [left arrow] 0, LOC_FEA_1 [left arrow] 0, X1 [left arrow] 0, Y1
[left arrow] 0, r [left arrow] 1, c [left arrow] 1]
FOR EACH (r to R1)
X1 [left arrow] r and Incremented by 2
 [FOR EACH (c to C1)
   Y1 [left arrow] c and Incremented by 1
   FIND [X1, Y1] POSITION equal to 1
      DO (Score[C] += 1)
LOC_FEA_1 G MEAN (Score[C])

INITIALIZE [R2, C2] [left arrow] SIZE OF THE IMAGE
[C [left arrow] 0, LOC_FEA_2 [left arrow] 0, X2 [left arrow] 0, Y2
[left arrow] 0, r [left arrow] 1, c [left arrow] 1]
FOR EACH (c to C2)
X2 [left arrow] c and Incremented by 2
 [FOR EACH (r to R2)
   Y2 [left arrow] r and Incremented by 1
   FIND [X2, Y2] POSITION equal to 1
      DO (Score[C] += 1)
LOC_FEA_2 [left arrow] MEAN (Score[C])
[Continued the procedures until LOC_FEA_3, LOC_FEA_4 were found]
RETURN MAX_OF (LOC_FEA_1, LOC_FEA_2, LOC_FEA_3, LOC_FEA_4)


Finally the mean value of each pixel location (LOC_FEA_1, LOC_FEA_2, LOC_FEA_3, LOC_FEA_4) were calculated and compare with each other's. Take the maximum value as the selected feature.

Vertical and horizontal way of Feature extraction:

The selected image was further logically divided into five equal parts horizontally and vertically (A, B, C, D, E) as shown fig.5. Then the features were extracted using the any one of the following steps

In vertical way of extraction,

Step1: Pixels were found only in A block, taken this as y1

Step2: Pixels were found only in B block, taken this as y2

Step3: Pixels were found only in C block, taken this as y3,

Step4: Pixels were found only in D block, taken this as y4

Step5: Pixels were found only in E block, taken this as y5.

Step6: Pixels were found in A and B blocks, taken this as y6 (AUB [right arrow] y6).

Step7: Pixels were found in B and C blocks, taken this as y7 (BUC [right arrow] y7).

Step8: Pixels were found in C and D blocks, taken this as y8 (CUD [right arrow] y8).

Step9: Pixels were found in D and E blocks, taken this as y9 (DUE [right arrow] y9).

Step10: Pixels were found in A and B and C blocks, taken this as y10 (AUBUC [right arrow] y10).

Step11: Pixels were found in B and C and D blocks, taken this as y11 (BUCUD [right arrow] y11).

Step12: Pixels were found in C and D and E blocks, taken this as y12 (CUDUE [right arrow] y12).

Step13: Pixels were found in A and B and C and D blocks, taken this as y13 (AUBUCUD [right arrow] y13).

Step14: Pixels were found in B and C and D and E blocks, taken this as y14 (AUBUCUD [right arrow] y14).

Step15: Pixels were found in A and B and C and D and E blocks, taken this as y15 (AUBUCUD [right arrow] y15)

In horizontal way of extraction, the features were extracted as followed in vertical way of extraction. The features were any one of z1 to z15.

Axis based pixel location:

In feature extraction, axis based pixel location were calculated as one set of features. Divide the selected image into three axis based sub division as shown fig.6. The division based on the x-axis and y-axis values 0 10, 11-20 and 21-30.

The following procedures were used for picking the features from selected image.

Algorithm 2

INITIALIZE [R4, C4] [left arrow] SIZE OF THE IMAGE
[D1 [left arrow] 0, E [left arrow] 0, r [left arrow] 1, c [left arrow]
1, Score[C]]
[LOC_FEA1 [left arrow] 0, LOC_FEA2 [left arrow] 0, LOC_FEA3
[left arrow] 0]
[X1, Y1, Z1] [left arrow] DIVIDE {R4} BY 3 [ROW WISE]
[X2, Y2, Z2] [left arrow] DIVIDE {C4} BY 3 [COLUMN WISE]
FOR EACH (MAX_OF_Y1+1 to R4)
X1 [left arrow] r and Incremented by 1
   [FOR EACH (r to MAX_OF_X2)
      Y1 [left arrow] c and Incremented by 1
      FIND [X1, Y1] POSITION equal to 1
          DO (Score[C] = 1)
FIND Score[C] equal to 1
LOC_FEA1 [left arrow] 1                   -- 1


[Continued the procedure until all features are found as shown in Fig.6.]

Applied Algorithm 2 on the each pixel in the axis blocks (1-10) or (11-20) or (21-30) or (1-10 and 11-20) or (11-20 and 21-30) or (1-10, 11-20 and 21-30) and found any one of the location feature LOC_FEA1 or LOC_FEA2 or LOC_FEA3 or LOC_FEA4 or LOC_FEA5 or LOC_FEA6.

Points Count:

Another way of feature extraction procedure was calculated based on 'pixel count' values. The following procedure was used for finding pixel count features.

In each sub image, row wise and column wise black pixel counts were calculated (if one pixel found in one row or column, then it was taken as one count. Multiple pixels found in one row or column, considered this also as one count.)

Step1: Both (row and column) count was one. This might be a dot. Considered this as "Q1"

Step2: Both (row and column) count was more than one. Calculated the mean values of both row and column values. Compare both mean values.

* Both Mean values were almost equal then it might be diagonal curve or line. Considered this as "Q2"

* Mean value of row count was higher than the mean value of column count, then it might be horizontal line or curve. Considered this as "Q3".

* Mean value of row count was lower than the mean value of column count, then it might be vertical line or curve. Considered this as "Q4"

The following figure (Fig.7.) show the sample features collected from above said procedures, where the AA denotes the Tamil character '[TEXT NOT REPRODUCIBLE IN ASCII.]', the AAA denotes the Tamil Character '[TEXT NOT REPRODUCIBLE IN ASCII.]' and so on.

Classification:

The statistical classifier Support Vector Machine (SVM) was used for predicting the exact character from given features. SVM algorithm used in various places and achieved tremendous results. The training samples acted as support vectors to identify the characters by the SVM. In SVM the hyper plane kernel function formula is ([W.sup.T]. ([X.sub.i]) + B)= 0, where W is a weight factors which was calculated from the Legrangian procedure using the feature spaces gather from feature samples, Xi were the labeled feature samples and B is bias. The equations for predicting the exact multi classes by the Multiclass SVM was Y ([W.sup.T]. ([X.sub.i]) + B) > a1 < [Y.sub.i] ([W.sup.T] ([X.sub.i]) + B) > a2 < Yi ([W.sup.T]. ([X.sub.i]) + B) < ... where Yi=l or -1 and the al,a2 ... aN are the threshold value calculated from the values Xi and Yi.

3. Experimental Results:

Data collection:

Totally thirty characters were chosen from Tamil character set. They were vowels and consonants. The data samples were collected from HP data set as well as gathered from various people handwritten papers. Totally we gathered 12000 samples where 7200 chosen for training and 4800 chosen for testing samples. Features were extracted from each sample set and stored in Microsoft Excel and fed into Multiclass SVM code written in Matlab. 89% accuracy rate was achieved from overall testing samples. Final experimental results showed in table I. The fig. 8 showed the graph representation of achieved accuracy rate.

4. Conclusion and Future Work:

This paper presents an offline Tamil handwritten recognition system using learning algorithm SVM. Whereas two feature selection algorithms (Zone and chain code) and Statistical based feature extraction algorithms (Pixel based location, Vertical and horizontal way of Feature extraction, Axis based pixel location and points count) were implemented and achieved significant results 89% for testing samples. The algorithms were used in this paper were well suited for cursive character. If writing style will vary in extreme level those algorithm will leaded to negative results.

ARTICLE INFO

Article history:

Received 12 October 2014

Received in revised form 26 December 2014

Accepted 1 January 2015

Available online 25 February 2015

REFERENCES

Akshay Apte and Harshad Gado, 2010. Tamil character recognition using structural features.

Antony Robert Raj, M. and S. Abirami, 2012. A Survey on Tamil Handwritten Character Recognition using OCR techniques. The Second International Conference on Computer Science, Engineering and Applications (CCSEA), 05, pp: 115-127.

Antony Robert Raj, M. and S. Abirami, 2013. Analysis of Statistical Feature Extraction Approaches used in Tamil Handwritten OCR. 12th Tamil Internet Conference- INFITT, pp: 144-150.

Antony Robert Raj, M. and S. Abirami, 2014. Offline Tamil Handwritten Character Recognition using Chain Code and Zone based features. 13th Tamil Internet Conference- INFITT, pp: 28-34.

Bhattacharya, U., S.K. Ghosh and S.K. Parui, 2007. A Two Stage Recognition Scheme for Handwritten Tamil Characters. Ninth International Conference on Document Analysis and Recognition, 1: 511- 515.

Jagadeesh Kannan, R. and R. Prabhakar, 2008. An improved Handwritten Tamil Character Recognition System using Octal Graph. Int. J. of Computer Science, ISSN 1549-3636, 4(7): 509-516.

Jun Cao, M. Ahmadi and M. Shridhar, 1995. Recognition of Handwritten Numerals with Mutable feature and Multistage Classifier. Elsevier, Pattern Recognition, 28(2): 153-160.

Pal, U., T. Wakabayashi and F. Kimura, 2007. Handwritten numeral recognition of six popular scripts. Ninth International conference on Document Analysis and Recognition ICDAR, 2: 749-753.

Rajashekararadhya, S.V. and P. Vanaja Ranjan, 2008. Efficient Zone based Feature Extraction Algorithm for Handwritten Numeral Recognition of Four Popular south Indian Scripts. Int. J. of Theoretical and Applied Information Technology, pp: 1171-1181.

Rajashekararadhya, S.V. and P. Vanaja Ranjan, 2008. Neural Network Based Handwritten Numeral Recognition of Kannada and Telugu Script. IEEE TENCON Conference, pp: 1-5.

Rajashekararadhya, S.V. and P. Vanaja Ranjan, 2009. Zone-Based Hybrid Feature Extraction Algorithm for Handwritten Numeral Recognition of two popular Indian Script. World Congress on Nature & Biologically Inspired Computing, pp: 526-530.

Rajashekararadhya, S.V., P. Vanaja Ranjan, V.N. Manhunath Aradhya, 2008. Isolated Handwritten Kannada and Tamil Numeral Recognition: A Novel Approach. First IEEE International Conference on Emerging Trends in engineering and Technology, pp: 1192-1195.

Ramanathan, R., S. Ponmathavan, L. Thaneshwaran, Arun S. Nair and N. Valliappan, 2009. Tamil font Recognition Using Gabor and Support vector machines. International Conference on Advances in Computing, Control, & Telecommunication Technologies, pp: 613-615.

Shanthi, N. and K. Duraiswami, 2007. Performance Comparison of Different Image size for Recognizing unconstrained Handwritten Tamil character using SVM. Journal of Computer Science, 3(9): 760-764.

Shanthi, N. and K. Duraiswami, 2010. A Novel SVM-based Handwritten Tamil character recognition system. Springer, Pattern Analysis & Application, 13(2): 173-180.

Sigappi, S. Palanivel and V. Ramalingam, 2011. Handwritten Document Retrieval System for Tamil Language. International Journal of Computer Application, ISSN: 0975-8887, 31(4): 42-47.

Sukalpa Chanda, Srikanta Pal and Umapada Pal, 2008. Word-wise Sinhala Tamil and English Script Identification Using Gaussian Kernel SVM. IEEE, 19th international conference on pattern recognition, ICPR, pp: 1-4.

Sureshkumar, C., T. Ravichandran, 2010. Recognition and Conversion of Handwritten Tamil Characters. International Journal on Research and Reviews in Computer Sciences, 1(4): 158-163.

(1) M. Antony Robert Raj and (2) S. Abirami

(1) Department of Information Science and Technology, Anna University, Chennai, India -- 600 025.

(2) Department of Information Science and Technology, Anna University, Chennai, India -- 600 025.

Corresponding Author: M. Antony Robert Raj, Department of Information Science and Technology, Anna University, Chennai, India--600 025.

E-mail: antorobert@gmail.com

Table 1: Accuracy achieved.

S. No         1      2      3      4      5

Vowels &     [??]   [??]   [??]   [??]   [??]
Consonents

Accuracy     90.1   89.3    92     90    89.2
Achieved

S. No         6      7      8      9      10

Vowels &     [??]   [??]   [??]   [??]   [??]
Consonents

Accuracy     91.2   89.5    89    88.5   85.1
Achieved

S. No         11     12     13     14     15

Vowels &     [??]   [??]   [??]   [??]   [??]
Consonents

Accuracy      83    95.3   91.1    92    88.1
Achieved

S. No         16     17     18     19     20

Vowels &     [??]   [??]   [??]   [??]   [??]
Consonents

Accuracy     87.1   95.1    85    87.2   85.2
Achieved

S. No         21     22     23     24     25

Vowels &     [??]   [??]   [??]   [??]   [??]
Consonents

Accuracy      95    91.4    87    89.3   85.2
Achieved

S. No         26     27     28     29     30

Vowels &     [??]   [??]   [??]   [??]   [??]
Consonents

Accuracy      88    86.2   89.5   86.3   89.4
Achieved
COPYRIGHT 2015 American-Eurasian Network for Scientific Information
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 2015 Gale, Cengage Learning. All rights reserved.

Article Details
Printer friendly Cite/link Email Feedback
Author:Raj, M. Antony Robert; Abirami, S.
Publication:Advances in Natural and Applied Sciences
Article Type:Report
Date:Jun 1, 2015
Words:3313
Previous Article:Quantifying concept proximity based on semantic measures.
Next Article:An efficient handover optimization in wireless access networks using PCF based IEEE 802.21 Mill standards.
Topics:

Terms of use | Privacy policy | Copyright © 2019 Farlex, Inc. | Feedback | For webmasters