Posts Tagged ‘data structures’

zkanji under-the-hood #1

November 17, 2010 2 comments

DISCLAIMER: this entry is the first in a series I’ll be trying to write about the inner workings of zkanji. It might not be suitable for non-programmers while programmers might find it too trivial. :p

Today I would like to write about how the program stores its dictionary in memory, and how the search works for both the kana and the meaning look-up. Many users noticed that zkanji takes up a fair amount of memory when running. This is mainly because JMDict (the dictionary that I convert with a perl script before every new release) has over 180,000 entries, and it is kept in memory without caching. It’s a huge amount of data, and it has to be stored in a structure which must be fast enough to search and easy to expand.
I have seen other programs using freely available database engines (like SQLite) to solve the problem of database storage and look-up. I haven’t tried any of these myself, but I wanted to write the whole program from scratch anyway, so I had to think up something instead of being a good programmer (=lazy). It wasn’t really difficult to find the perfect solution, a specialized tree data structure to hold indexes to every word, and I could finish the first version of the dictionary in a few days.

Fig. 1: data structure of the word dictionary

Everyone knows about trees so I won’t be going into much detail about them in general. In zkanji the words are held in a big list without any special structuring, only their indexes (the number representing their order in the list) are stored in a tree. The words can be (and are) in random order in the list because there is no need to search in it. There are at least 26 (but usually more) trees representing a letter of the English alphabet. All nodes that make up a tree  have a label, which is a single-byte character buffer of lower case characters (not wide characters nor MBCs). The root node’s label is a single character, while each child node of a parent node must have a label at least one character longer than its parent’s, and the label must start with the parent’s label. For example if the topmost node is labeled “a”, its direct child nodes can be “ab”, “ac” … “az”, but also “aba”, “abc”, “ame”, “amphibious” etc. The direct child nodes for “ame” can be anything that starts with ame, for example “amend”, “ameri” etc. The order of the nodes is important too. They must come in alphabetical order because otherwise looking up the right one would take much longer.
Beside labels, every node has a list of word indexes for the words that start with the node’s label, but not with the child nodes’ labels. If a word starts with a child node’s label, its index is probably in the list of the child node, or in the list of the child node’s child node.

In case the user wanted to look up the word “Japanese”, the tree to look for it would be the one with the root labeled “j”. The first step is to find the direct child node labeled “ja”, “jap”, “japa” etc. Any of these can be right under “j” but only one of them. (If there is “ja”, there won’t be “jap” or “japa” or anything longer starting with “ja”.) The program keeps walking the nodes until the one with the longest label, which is still shorter or same as “Japanese”, is found. For example if it stops at the node labeled “japa”, the resulting node’s word index list will probably contain the index for the word “Japanese” too, if it’s in the dictionary at all. The last step after this is to find “Japanese” among the resulted data, but that’s done with a brute force algorithm, taking the word indexes and finding the word structures in the main list.

Not just the word meanings can be looked up in such trees but Japanese words’ kana forms too (from different trees), though those are first converted to romaji before walking the tree. For end-of-word searches there are another set of trees with labels for kana words written in romaji backwards.

Because most word structures hold several meanings for Japanese words and each meaning is usually made up of more than one word, there had to be a way to find the one word structure by searching for any of those. The solution is easy enough. The index of such words are in several nodes’ list.
For example the word 辞書 (jisho) has two meanings: 1) dictionary, lexicon 2) letter of resignation
When the user searches for “dictionary”, “lexicon”, “letter” or “resignation”, the program walks the tree for the given word and finds the index at the right place. (Storing those words was much more difficult, but I will write about that later.)

Let’s say, that the average length of an English word is 5 characters long, each Japanese word has 2 meanings and each meaning has 3 words in the definition in average (numbers were made up on the spot). That would mean that a word needs at least (5*3+2)*2 =34bytes (5 chars, 3 words, 2 spaces between words, 2 meanings) just for the definition, and we haven’t counted the required memory for storing the addresses of these words, the structures etc. If the kana form takes up around 12 bytes (6 wide characters), the kanji form another 8 (4 wide characters) then that is already 54 bytes for a single word. Adding pointers to the mix this number grows much higher.

This is the exact c++ structure for a single word entry:

// each meaning in a word is stored in such structures
struct TWordData {
 char *meaning; // meaning (4 bytes + character string + 1 byte for the
                // trailing 0)
 int types; // word types, each bit representing a different type (4 bytes)
 int notes; // word notes, same as above (4 bytes)
 int fields; // word fields, same as above (4 bytes)
 word nametag; // nametag, same as above (will be used when name dictionaries
               // are incorporated) (2 bytes)
}; // counted twice if we assume 2 meanings per word (38 bytes total)

// data for the word itself
struct TWord {
 int frequency; // word frequency (4 bytes)

 wchar_t *kanji; // pointer to kanji string (4 bytes + wide character string
                 // + 2 bytes for the trailing 0)
 wchar_t *kana; // pointer to kana string (4 bytes + wide character string
                // + 2 bytes for the trailing 0)
 char *romaji; // pointer to romaji representation of kana (4 bytes + single
               // byte character string + 1 byte for the trailing 0)

 TWordData *data; // array of meanings and types (4 bytes)
 byte meaningcnt; // number of meanings (1 byte)

 TWordStatGlobal *stats; // (4 bytes - used in some tests)
 TWordGroups *group; // linked list of groups this word is in (4 bytes at least)

 TExampleData *exdata; // data for example words linked to this one (4 bytes +
                       // data)
 word examplecnt; number of examples for this word (2 bytes)
}; // +4 byte for every TWord structure in the main list (44 bytes total)
44bytes+38bytes = 82bytes total

The sum is 82 bytes just for the structures. The “average” word is 136 bytes so far. Adding the space taken up by indexes for every kana and meaning in trees (2 indexes for kana (normal and end-of-word) = 2*4 = 8 bytes, 6 indexes for every word in meanings = 6*4 = 24 bytes) and the sum reaches 168 bytes. Thus 168*180,000 is around 30 megabytes just for the word definitions.

A single kanji (you can look up the structures in kanji.h and collection.h in the source) takes up a bit more than 140 bytes as well, but there are only 6353 kanji listed in zkanji so they don’t take up more space than 889,420 bytes.

Adding the 3.3 megabytes of the stroke order data on the disk (probably 4MB with the additional structures when loaded) makes the sum more than 35MBs, but it’s still nowhere near the actual memory usage indicated by windows.

A freshly “installed” (unpacked) zkanji takes up 67 megabytes of memory space on my machine, and half of that is taken up by the dictionary. What takes up the rest? The value I just counted is probably nowhere near the real memory usage, but I still have no idea where the rest of the memory goes to. My guess is that windows generously allocates more space for every program than they actually use. If I rename the stroke order data and run zkanji, the used memory falls to 47 megabytes which either means that the 3.3 megabytes large data file is either “extracted” to 20 megabytes in memory somehow, or that windows simply uses that much unnecessarily. (I will probably look into this later.)