Home > Development, Plans, Under-the-hood > Dictionary file changes

Dictionary file changes

It is not uncommon to change the format of data files, i.e. to add a new kind of data to every group for example, but that doesn’t require much change in the rest of the program. On the other hand it doesn’t often happen that I have to make fundamental changes to how the dictionary or other data files are handled. Unfortunately although such changes require a lot of work and rigorous testing, the user usually doesn’t notice much right away. This won’t be different this time either, so please be patient.

To better understand what changes will be made and why, I need to describe how the program currently works with the dictionary files. There is no disclaimer but the explanation might be difficult for some readers. I’ll write a few words about the benefits this change will bring at the end of the post.

The problem

zdict.zkj – This file contains most data required for zkanji to work. It is mainly for the English dictionary, but some parts are not saved in user dictionaries so this file cannot be replaced with them. When zkanji starts up, it first looks for this file in the data folder, and if not found it can’t start. It’s not enough that this file is in the data folder, it must be the correct version for the program (although it might be able to load old formats, newer formats or too old ones won’t be recognized.)

As zdict.zkj is essential in the form it is distributed, it is currently impossible to add, remove or change the words in it. Many features depend on the file, I can’t even recall most of them, and whenever I make some change I have to make sure everything still works fine.

Still, even this main dictionary file can change between releases. The JMDict project, which provides the Japanese-English dictionary database is still very much alive, new words are constantly added, old ones updated, even common words can change often. While this is a great thing, it creates a huge problem for programs like zkanji, because user data relies on an unchanging dictionary. For example if you add the third meaning of a word to a word group, and it gets merged into the second meaning as they mean almost the same thing (or at least someone thinks so), there is no way for the program to notice this. In such cases there is no other solution but to remove the invalid entry from the group. The problem is even more difficult to notice when two meanings are simply swapped. zkanji will still store the “third” meaning in the group, but the original third meaning was moved to the second place in the new dictionary data.

This simple fact causes headaches for the developer (that is me), and for the users alike. It also makes it impossible to add some features people would need. To change this situation I had to come up with a solution, which would work for most of us.

The solution

There is no painless solution for the problem, but I would like to keep the impact on the users as small as possible. If the main data file cannot be changed, the only straightforward solution I can think of is to create a copy of it. It is already possible to create a so-called “user dictionary”, which holds custom word and kanji information. Word and kanji groups can be created in it like in the main dictionary, and it works the same way generally. (User dictionaries were made possible for people like me whose primary language is not English, but the main English dictionary is still used for many important features.)

If a copy of the main dictionary is created, it can work the same way as any user dictionary. It will be possible to change it in any way the user needs. New words can be added or definitions updated, meanings deleted etc. User dictionaries are saved as separate files, and the English user dictionary won’t be an exception, so this will take up approximately an additional 30 MB of space on your system drive (on the drive the user document folder is found), but this won’t cause trouble for anyone hopefully. But as a side effect this has a bad influence on memory usage too, which is still not as abundant as disk space. (Some people will notice that this statement is not entirely correct, I’ll leave it to you to figure out why.)

zkanji in its current form takes up 60-70 MB of space in memory (it takes up 112 MB for me having my own user dictionary already) and it will take up another 40 MB after an English user dictionary is added, because after the change both will be used. The main data for program features that require the unchanged file, and the user dictionary for displaying everything to the user.

This is still just theory though, it is probably possible to get rid of the main data file, but I’d need to look through the whole program, (and the source files take up a few MB’s, that is a few million characters typed) and figure out what would cause bugs exactly when the main dictionary changes. (For example the example sentences data could be made independent of the English dictionary, as it depends on the original data right now.)

What will this change mean for the user?

If you had no intention to edit the main dictionary anyway and found no need to any additional features, probably nothing. On the other hand, apart from making it possible to edit the English dictionary by the user, (although proposing changes directly to the JMDict project would be beneficial for everyone) user group data won’t change with a new dictionary version anymore. In case the main dictionary entry changes, you will still have the old data, and zkanji can offer you a dialog where you can pick the meaning in the new dictionary to use in the group. The old user dictionary will be discarded and replaced by an updated one after all changes are accepted, making it possible to use the updated dictionary.

Group export and more importantly import can be added. I made a group export some time ago when a user asked for it, although they probably thought that this will also make an import possible. Unfortunately because I can never know whether the group data is still valid among users (for example they have different versions of the main dictionary, so some word is missing or different), it wasn’t possible to create an import without putting a lot of work into making sure invalid data is not created this way. With an editable dictionary I can make sure that the imported data will look exactly the same by creating new entries if needed, although I’ll still have to work a bit on a dialog which confirms user choices when entries are different.

I wanted to make it possible for some time to hold several meanings of a word in a word group as a single item, not only as different ones. If for example a word has 3 meanings and you wanted to include 2 of them as a single item in a test (i.e. you were not interested om learning the third one), you could only curse the programmer (me), because it wasn’t possible to do so. This feature wouldn’t be impossible even now, but would make user data even more vulnerable to dictionary changes in the current setup, so I had to give up on the idea. If this is implemented it will also become less frustrating to collect word groups based on kanji, because there will only be a single item added to word groups for each word, instead of a separate ones for each meaning.

I’m not sure whether the last thing I’m going to mention is my fault, or the data was simply missing from JMDict back in 2006 when I started working on zkanji. Entries in JMDict list the possible kanji forms, the possible kana readings, and the possible English (or other language) definitions for the words. Not all kana readings can be used with all kanji forms, (for example the only difference is katakana / hiragana usage, in which case a single entry holds all forms,) and not all definitions can be used for all forms either. Still, the program shows all definitions for all forms, even though they might be incorrect for some kanji. This is misleading for students, but because many people uses zkanji already, and they rely on data in their groups, I couldn’t easily solve this. The solution is to first copy the dictionary in its old form to an English user dictionary, then tell the user about each missing definition, so he can update the incorrect data. (This will be similar to what the program will do when the dictionary changed between versions, and thus easy to do correctly without extra work.)

This is all I could think of off the top of my head right away, if I missed anything, I will find out when I do the work.

To summarize (what probably only a few people will read)

I need to make changes that are hidden from the user, because otherwise zkanji can’t develop in so many ways.

The pros:

  1. User data will be more safe when the main dictionary is updated.
  2. It will be possible to change word entries in the English dictionary or to add new ones.
  3. Multiple meanings will be possible to add to word groups for existing items, not just as separate ones, and with this
  4. collecting words to word groups based on kanji won’t have to produce separate entries for all meanings of the words.
  5. Word group export/import will work as expected (possibly including word tests).
  6. Mistakes in the dictionary data will be easier to fix without corrupting user data.

The cons:

  1. User data for the English dictionary will take up an extra 30 MB of space.
  2. 2 versions of the English dictionary must be loaded to memory, because both will be used for some features.
  3. Slower starting time and (possibly) slower running of the program, although this might not be noticeable if I’m careful enough.
  4. It will take some time to release a next version, because the changes I’m going to make are not simple.
Advertisements
  1. No comments yet.
  1. April 13, 2012 at 4:39 pm

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: