Home > Under-the-hood > Data handling in zkanji mini-series, Part II.

Data handling in zkanji mini-series, Part II.

*Imagine there is a disclaimer for non-programmers here.*

This is only the second entry in a planned short series about the data used in zkanji. I realized that the last entry was more detailed than necessary for such a series, so I’ll try to limit myself to the “bare minimum”.  In the previous entry I described the format of the word dictionary as the program sees it before it builds the inner data. I could do the same about the kanji, but there is no magic there. It is not necessary to pre-process the KANJIDIC file to be able to work with it, as it is in a very simple 1 line / 1 kanji format. The data is not organized in any special way, all kanji searches are with a brute-force algorithm. (Looking up something in such a small number of entries is not a challenge for modern computer processors.)

When zkanji is run to generate its kanji data, it is looking for a file called “kanji.txt” in its data folder, which is the KANJIDIC file saved in UNICODE. (When I write UNICODE I always refer to utf-16 which is native in the Windows environment. For example Linux uses utf-32, but wasting 4 bytes for a single character is an overkill in my opinion.) The data is stored in a simple structure in memory and later saved in the same structure, so there is no magic happening here. There are a few other steps, for example information about the kanji radicals is in RADKFILE, which has to be imported separately, but it is not a big deal either.

Once everything is imported, I go over all kanji readings, and throw out those that are not found in any word. This might seem to be an unnecessary step, but I figured that most of the readings that are not used are either very rare or only important for researchers (zkanji is mainly for students of Japanese). When this is done, there is nothing else to do with kanji. The stroke order diagrams and animations are not imported, I made them right in zkanji, and it saves their data file in the final format. I won’t be explaining that in this mini series as it is irrelevant to data handling.

This is all that is also saved in zdict.zkj. When a user creates his or her own dictionary, it is saved in a file with the .zkd file extension very similar to zdict.zkj, but for obvious reasons it is unnecessary to save the kanji information a second time.

In the last entry I mentioned that I might explain how the example sentences file is kept. I’ll try now in a few words without going into details. The examples.zkj was a separate download for a long time, but it is now included in the setup for the program. zkanji can run without it, but it only takes up ~14 megabytes which is really nothing nowadays. Still, the data is saved compressed in the data file with the free zlib library (no relation to zkanji). The data contains around 150,000 sentences, and would take up more than 40Mb if not compressed, so this is a good ratio. Because I didn’t want to load the examples data in memory when the program starts, it was necessary to compress the data in a way that can still be accessed fast and easy when a sentence is needed. The sentences are compressed in packs of 100, and when one has to be displayed, its pack is uncompressed, and all 100 sentences are loaded to memory. 100 sentences take up around 26Kb which is not much. When there are more than 1Mb of sentences in the memory, the ones not used for the longest time are freed. This way only a megabyte of memory is used up for example sentences at a time, and the speed they can be browsed is still lightning fast.

Each sentence in the examples file contains additional data, like which words are in it, so when the mouse is moved over a sentence in the program, you can see words underlined and clicking on them looks them up in the dictionary. Making that work was a bit complicated, but it is working fine, so I don’t need to look into it to refresh my memory and won’t explain in detail.

The next time (which is hopefully soon) I’ll write about user data files, like word and kanji groups and user made dictionaries.

Categories: Under-the-hood Tags:
  1. No comments yet.
  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: