Skip to content

Latest commit

 

History

History
41 lines (26 loc) · 3.13 KB

README.md

File metadata and controls

41 lines (26 loc) · 3.13 KB

gogaku

Gogaku is an implementation of Nei Kato's Directional Feature Extraction algorithm for kanji representation (see citations below).

Without providing too many details on the algorithm, each kanji is read in as a 64x64 binary image and converted into an ordered set of 196 positive integers. The distance between two kanji can be reasonably modeled as the Euclidean distance between two of these 196-dimensional vectors, although other distance functions may be used to adjust performance.

This implementation compares an input image to each of the two thousand or so Jouyou kanji and returns the closest match.

requirements

Gogaku requires a reasonably modern version of the Go compiler to build, and training data generation requires Python 2.7 (or similar) and the Python Imaging Library. The automated scripts require a Bourne shell or similar.

setup

After cloning the gogaku repository, building can be performed by executing build.sh. Once the gogaku binaries are built, in particular the trainer binary, the gentrain.sh script may be executed to create the Jouyou training dataset.

Warning: the Jouyou dataset is fairly large. The set is included as a text file, but running gentrain.sh will create around 9MB of PNG images in a directory called img/training. Additionally, these files will be generated with utf-8 filenames, which may not display properly on your system.

execution

The recog binary is used to actually recognize kanji. It takes a 64x64 kanji image and a kanji database file as input. The kanji image should be binary colored with a white background and black strokes. However, anti-aliasing of strokes is not a big deal; any non-white pixel is treated as black. The Jouyou database file is generated by default at txt/db.txt by running gentrain.sh.

miscellanea

I've included the Arial Unicode MS font for rendering of the dataset. I'm not sure if it's legal, but I take pride in how quickly and diligently I respond to cease and desist letters.

known issues

The current default dataset is rendered in Arial Unicode MS, and as a result sometimes does not match well with natural, handwritten characters. Though accuracy is often quite good, I plan to soon write a parser for the ETL9B dataset which consists of actual handwritten kanji. I believe this will boost accuracy by quite a bit.

citations

1) Nei Kato, Masato Abe, and Yoshiaki Nemoto, "A Fine Classification Method of Handwritten Character by Using Automatic Learning Algorithm of Partial Area Matching," The Transactions of IEICE D-II(Japanese Edition), Vol. J78-D-II, No. 3, pp. 492-500, 1995.

2) Nei Kato, Masato Abe, and Yoshiaki Nemoto, "A Handwritten Character Recognition System by Using Improved Directional Element Feature and Subspace Method," The Transactions of IEICE D-II(Japanese Edition), Vol. J78-D-II, No. 6, pp. 922-930, 1995.

3) Nei Kato, Masato Suzuki, Shinichiro Omachi, Hirotomo Aso, and Yoshiaki Nemoto, "A Handwritten character Recognition System Using Directional Element Feature and Asymmetric Mahalanobis Distance," IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 21, No. 3, pp. 258-262, 1999.