Khmer (Cambodian) Spell Checker
Khmer is the official language of Cambodia. It does not use spaces
to separate individual words making spell checking hard unless word
boundaries are provided. Users can do this by placing invisible spaces
between words. This means that users have to input every segmentation
point manually for existing documents, which is time consuming. There
is another problem. If there is
no alternative to this approach, in the future Khmer will eventually
use spaces between words. This implies that we need to change our
language to meet technology requirements. Were this to happen, Khmer,
which represents not only the language but also the people, culture
and history, would lose one of its unique characteristics. From a
human responsibility perspective, if the
Khmer people fail to offer an alternative solution to the problem, we
should not be proud of being Khmer in the 21st century.
Therefore, the immediate objective of this project is to develop an
effective, portable and free word segmentation algorithm for
everyone. So far, a number of Python prototypes has been released. To
download, go to
this project's download site. The prototype includes a word
segmentation algorithm along with spell checking functionalities and a
Graphical User Interface. This makes it hard for others to this
algorithm with their applications. Moreover, it has not met its
satisfactory performance criteria of 95 percent accuracy yet. So, this
project will work on the following tasks in the immediate future:
These objectives may be different from the initial project description
of khspell, which was intended to integrate khspell with hunspell used
by Open Office. The initial description has to be revised to the
current objectives after being realised that Open Office can
spell check Khmer as long as a customised Khmer dictionary and word
segmentation points are provided. This means that having an effective
Khmer word segmentation is the most important goal for the future of
computerised Khmer language. It is not so important whether this
algorithm will make way to Open Office or not. If this implementation
is useful, effective and better than any other available solutions, it
will eventually be available in Open Office.
- Keep improving the existing prototype.
- Implement the word segmentation algorithm in the existing
prototype in C++ with GCC and put it under a license which is
compatible to both the open source and commercial software
communities. The existing prototype is licensed under Lesser GNU
Public License which protects the prototype from being used in a
commercial application. The prototype will be kept under LGPL because
it may be harder to use the prototype directly.
- If the C++ implementation proves to be useful, it will be
transformed into a library and be ported onto Windows platform.
To promote science, research and technology in solving the
remaining problems in Khmer computational linguistic once and for
all by the Open Source and Free Software community.
Copyright (C) 2006 by Puthick Hok
Last update: 25th July 2006. If you have any comments,
please drop me
an email. I would like to hear from you. My address is puthick
"AT" users.sourceforge.net. This address
will forward your mail to my daily mail box.