DSpace Repository

Hybrid approach for spell checking of tamil language

Show simple item record

dc.contributor.author Jananie, S.
dc.contributor.author Sarveswaran, K.
dc.date.accessioned 2021-11-22T06:08:39Z
dc.date.accessioned 2022-06-27T09:57:58Z
dc.date.available 2021-11-22T06:08:39Z
dc.date.available 2022-06-27T09:57:58Z
dc.date.issued 2014
dc.identifier.uri http://repo.lib.jfn.ac.lk/ujrr/handle/123456789/4222
dc.description.abstract The spell checkers are specialised application programs that flags words in a document that may be misspelled. Though there are several spell checkers available for languages like English, no fully functional application is available for the Tamil language. The existing systems either find the misspelled words from an existing list of words stored in those systems or Canti mistakes. Omission of a required letter or inclusion of an inappropriate letter between two adjoined words is called Canti mistake. Further, several issues have been also identified in these systems. A new approach for Tamil spell checker has been proposed in this research by integrating existing approaches and new approaches such as rule-base, crowd sourcing and suggestions generation using character level n-gram. According to the proposed approach, each word is checked whether it exists in the dictionary using a Levenshtein distance finding algorithm. If it does not exist, then the n-gram based technique is used to generate possible suggestions for the given word. And required rules are written to get the appropriate suggestions by considering Canti check as well to identify the appropriate joining letter of two adjoined words. A list of 250,000 unique and error-free words are included in the dictionary. These words have been collected from various sources, including websites. It is very difficult to gather all the words in Tamil language. Therefore, add to dictionary option has been introduced to collect new words from users and add to the existing dictionary after the moderation. To reduce the search space, the dictionary has been divided into different files based on the first letter of the word. Due to the complex nature of Tamil script compared to English, stacks and lists have been used during the processing of words. These rules have been written in such a way that it can be extended further in future. All these processing is being done without Romanising the Tamil text, while in most of the other approaches Tamil language is processed in Romanised form. The proposed system gives better accuracy than the existing systems; 85.77% accuracy was noted when considering the suggestions generation. This result had been calculated by analysing the suggestions generated by the system for the words that are not in the dictionary. Hence the proposed approach, which has dictionary check with Levenshtein algorithm, suggestions generation with n-grams, Canti check with a rule-base and crowd sourcing, is a complete solution for Tamil spell checking. en_US
dc.language.iso en en_US
dc.publisher Proceedings of the Peradeniya Univ. International Research Sessions, Sri Lanka en_US
dc.title Hybrid approach for spell checking of tamil language en_US
dc.type Article en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record