Creating a usable and safe backup system is my last aim for the next release, before I go over the user reported bugs and complaints. Just like most other things that seem simple at first glance, this is also not as easy as it looks like.
In past releases, zkanji created a copy of successfully loaded files in the user data folder with the TEMP extension, after loading them. (Thinking about it, isn’t the TEMP extension a bit misleading?) The user had a single safe(?) copy of data files that loaded correctly, or at least which didn’t generate an immediate error. Past backups were overwritten. This solution worked fine in the utopian world in my mind, that is, if errors occurred on load (which is not very likely). Unfortunately there has been a case at least once, when a user only noticed a few days late, that something is not right with his/her data. This situation is obviously not solved with our simple backup.
The obvious solution would be to keep a backup of all user data files for the past few days, or even weeks. I started working on this solution, but there have been a few things that bugged me about it all along. In the new data handling system, users will be able to change their main English dictionary, so a safe copy must be made. The dictionary file is nearly 25 megabytes, and even without the few additional kilobytes of user data, making several backups of this size is not an acceptable solution. As I’m working on a dictionary in a different language, the total size for me would be nearly 35 megabytes. In my experience, at least 2 weeks of backup is necessary to be on the safe side, which equals to 350 megabytes normally, and in my case nearly half a gigabyte! We can probably do better than that.
If someone never changes his or her main dictionary, and the only files to save are groups or study data, not saving unchanged files can keep the size of backups to the minimum. This is seemingly a good solution to the problem, unfortunately this brings the complications to a whole new level. How can we know that the main data file has not been changed? We could read it and compare it to the unchanged dictionary data (there is a data file which is not touched, but is required for the update system to work). Comparing files is slow, and nobody would want to wait the additional seconds every time zkanji creates new backups. I also thought of comparing file times, but if a user unintentionally changed the main dictionary, and reverted the changes later, the file times would be different while the data is the same. Not to mention the case when the user data is on some central server and files have to be read and written several times over a network. (I know of at least one such case.)
As I have decided not to do any kind of complicated magic that can be slow as well, a compromise is forming in my head. (This is just the current idea which can be rejected in the next second.) Keeping 2-3 backups of each file doesn’t seem to be that much of a burden. If the files are backed up at some longer intervals, for example every 4-5 days, and are not kept for too long, the user can enjoy a relative safety which is relatively cheap. Data loss happens, but this way only a few days worth of data would be lost. If the user only notices some problem a week later, this is still better than losing everything. (In case you have terabytes of space for backups, you will be able to tweak the interval of days and number of backups in the settings.) Safe copies of data would be created once on startup, or if you are the kind of person who doesn’t power off their computers for months, I’m considering checking the running time of the program as well, and creating a copy when the time comes.
I am back on coding zkanji since yesterday and I’m now much more clear about the difficulties that made me put the project on hold for almost a year. I should read my blog more often, because I completely forgot about my attempt at making the example sentences data independent from the main dictionary. I stopped because the code of the whole program became so complicated that everything depends on everything else and even the smallest change requires if not rewriting but rereading and understanding everything.
For example I got to a point with the example sentences data handling that a new dictionary doesn’t necessarily make old example sentences data unusable. It would work without complaining, but it is still a lie that it does not depend on the original dictionary. When the sentences file is generated, it only allows sentences that have at least one word from the dictionary, in the exact form of the word in the main English dictionary which was loaded in the program at the time. If the example sentences contains something not in the dictionary yet, but a later update adds that word, the sentences panel on the interface won’t reflect that change. This is not a big deal, but any perfectionist would say that it won’t do. I might only be a half perfectionist (is that an oxymoron?) but as it turns out, the JLPT data uses the same structure for linking with words in the dictionary. The new “zdict.zkj” file not only contains data about the kanji, but also a big list of words that have a JLPT level and the level itself. And that list is for the dictionary version at the time of compiling the database. Anything I might add in the future that is in theory independent from the dictionary (that is, more languages could use it) would do the same too, so I have to fix this. Fixing this adds another half megabyte memory usage. I’ll have to get used to this, adding features takes memory. So good bye to my dreams of making a program that won’t grow exponentially. Though in my opinion it is still better to make something that works correctly, than not taking up that extra megabyte and breaking the code altogether.
I want to finish fixing this problem today or tomorrow, and if I succeed, I will have to face the other great obstacle: export/import, because with the new dictionary format it is now broken. (Once that’s out of the way there are a few things I have to tune but nothing serious and you will get a test version.)
(UPDATE: fix done without running it once, testing comes tomorrow!)
(UPDATE2: testing revealed a few bugs that I fixed. There is a small performance problem with loading the example sentences though. I’m not sure whether the new loading time is acceptable on low-end machines.)
Writing the code for export and import is not difficult. I could even say it is pretty easy compared to some other stuff in zkanji. The problem is not the writing, but designing a good file format, and an easy to use interface. Around last April I was getting ready to do it, but after racking my brains for 2 whole days I couldn’t come up with anything useable, and gave up for the time (telling myself I will do it in a week). Why? Because I have no idea what my users need. I would have no use for that feature (except when I have to share the Hungarian-Japanese dictionary I’m working on) so I don’t know what makes sense. I realized fast that exporting and importing everything in every combination is not possible or at least can break the data easily. I imagined a window where you can check what you want to export: groups (kanji and words separately), study group data, long-term study data, words from the dictionary (as the dictionary is editable people would obviously want to share their changes) etc. I also don’t know yet how to handle it when people want to share data from different dictionary versions. Should the exported group data contain the whole word with all meanings and word types or only the indexes? Or even make this selectable creating another heap of problems to solve?
So I had to realize that I need help (and I don’t mean introducing me a good psychologist). What real world application can you think of export/import? I want to design this feature like I usually do with other features, by asking what people would use it for, and not from the programmer point of view, that is, what is the largest set of functionality that is easy to write (and probably hard to use).
With this I close this mini-series, which didn’t have that much to do with data handling, but helped me start working on zkanji again. I’ll have to make up a new post title for future writings…
The first problem I tried to solve was that the main English dictionary was not editable. Of course I could have simply allowed editing the main dictionary like any user dictionary, saving it every few minutes (if the auto save option is on). I didn’t go with this because some features might need the original main dictionary. For example the example sentences data relies on word indexes in the main dictionary, and if those indexes change, the program would crash when looking up the sentences.
This is not the real reason though. The real reason is that I don’t remember what parts would break (if any) if the dictionary changed, so I avoided it. To be even more precise, even the way I solved this problem (for the next release) won’t allow deleting words that were not added by the user. I could have made the program this way from the beginning. My suspicion is though, that there wouldn’t be a problem even if words were deleted from the main dictionary (apart from breaking the example sentences, which will be fixed in the next release or the one after that). I just never had enough patience to check.
So the next release will allow changing the main dictionary as well. It changes the data, but keeps the original words in a separate list, in case they have to be reverted. Reverting, or using the original data is not implemented yet, but I didn’t want to break anything for future releases so I decided to keep the originals anyway. Their list will be saved with the changed English dictionary data.
From the next release, there will be three files for the main dictionary instead of one. I will keep “zdict.zkj”, but from now on, it will only hold data about the kanji, which is shared among all dictionaries. I.e. stroke count does not depend on the target language. Only the kanji meanings depend on the language which is still kept in this file. Of course changed meanings will be saved in user dictionaries like before. I have decided to keep the kanji data separate from the word data, because handling the other files will be simpler this way.
The other two files will be “English.zkj” and “English.zkd”, both holding the word dictionary data and they will be identical at first. The .zkj file will be the dictionary as it was installed, and the .zkd will be a copy of .zkj. You might have noticed, that .zkd is the extension for user dictionaries. This is because “English.zkd” will be handled just like any other user dictionary. Any user changes will be reflected in it, but not in “English.zkj”. Once the program starts, it will check for the user version of the dictionary file, and if found it will load that one, otherwise load the original and create a copy with the .zkd extension. This wastes around 30Mb of disk space.
I could have made the program to either update the original file when the user changes it, or delete it once it created a copy or something similar, so only one version would be present. The original file can’t be kept with its original name though (I’ll explain why in a minute) so renaming it or creating a copy was the only viable option. It is probably not necessary to keep the file with the original name, but makes things a bit easier.
The reason for having an original and a user version of the same data was the simple fact that the setup program (and the zip package) contains the dictionary data under the name “English.zkj”, so updating the program could very easily delete any changes the user have made to his or her own English dictionary. Both the original and the user data will contain a date. When a future release of zkanji runs, it will check that date in the two files, and if they are not identical, it will know that the program was updated, or at least that the original dictionary file was replaced with a different one. If the two dates are identical, it will run as usual, otherwise it will bring up a dialog where the user can check which words differ that were added to a group or study list, so he or she can resolve any issues.
Keeping a separate user English dictionary file will avoid a lot of difficulties the current release has to deal with. For example the file which keeps the data for word groups will no longer have to store both the kanji and kana forms of words, it will be enough to save an index in the user dictionary. It will be also possible to create word groups where each entry in the group can hold more than one meaning for words, as the update won’t break anything, since the user will be notified of changes and will be able to resolve them.
The next one will be probably the last in this mini-series. I will write about what is not ready yet for a new release (apart from that the big changes I just described need a lot of testing, though I’ll probably ask for help with that), and why it is so challenging for me to finish it. I can almost imagine how excited you might be, waiting for the last part to be finally here!
In the past two entries I described what is in the data files included with the program. In this part let me write a bit about data that is generated by the user, which must be saved and restored.
Since Vista, files cannot be created in some folders by running programs, unless they are given administrator privileges or run by an administrator. For example the Program Files folder is one such location. Unless zkanji is “installed” in such a folder, it keeps user data files in the “data” folder which is next to the executable. Otherwise user data files are saved in the user’s document folder. There can be 2 user files for each user dictionary. Though if you are only using the English dictionary, there is a single file only, as there is no dictionary data to be saved.
As I wrote in the previous entry, user made dictionaries are saved in exactly the same format as the main dictionary data file, but the kanji data, which stay the same for all languages (so everything apart from the meaning of a kanji) are not written. There is no JLPT data stored in these files either. The other file saved is the group / study progress file with the .zkd extension. User dictionary is unnecessary for English, but the group file is created even in that case. The group file obviously stores which kanji are moved to kanji groups, and which words are moved to word groups. It must also contain study progress, which is mainly a list of words and their standing in some study group or the long-term study list.
It is less obvious how a word or its identifier is saved and loaded in groups and the study list. One possibility would be to store a unique index number for each word (and probably meaning where it makes sense) that gives the word’s position in the dictionary, but unfortunately this approach wouldn’t work. With each update the English dictionary is also updated. I have no control over how that is done, and unfortunately when the JMDict data changes, it is very common that the index of words change as well. If I just saved the user data with an index number, once the program and its dictionary is updated, most words in word groups and study groups would not be the ones that should be there. The best workaround I could find was saving both the kanji and the kana form of every single word that was added to a group or to a study list. This can increase the size of user files considerably, but what is more important, influences loading times, loading huge user data files much slowly than it would otherwise be necessary. Even if you keep zkanji on an SSD drive, it’s not reading the file to memory which is slow, but looking up every word in the user data in the dictionary, to find their current indexes. This is a pity even more, since updates are rare nowadays (sorry) and indexes only change if the dictionary changes as well. The situation is even worse though, because the meaning of words often change in the dictionary as well. What was the first meaning could be the third in the next release, or some meaning might be split to several meanings. Unfortunately I couldn’t find any solution to this problem (until now).
Once I release the next zkanji, what I wrote in this entry will be out of date. The next version, which is mostly already done for more than 6 months now, does things differently and more reliably, with the price of taking up a lot more disk space. This “lot more” is in the tens of megabytes, an amount which in my opinion is nothing to fret about. In the next part I’ll try to explain the changes. I have already written about what the user will see when the dictionary is updated, but this time I will explain how it concerns data files.
*Imagine there is a disclaimer for non-programmers here.*
This is only the second entry in a planned short series about the data used in zkanji. I realized that the last entry was more detailed than necessary for such a series, so I’ll try to limit myself to the “bare minimum”. In the previous entry I described the format of the word dictionary as the program sees it before it builds the inner data. I could do the same about the kanji, but there is no magic there. It is not necessary to pre-process the KANJIDIC file to be able to work with it, as it is in a very simple 1 line / 1 kanji format. The data is not organized in any special way, all kanji searches are with a brute-force algorithm. (Looking up something in such a small number of entries is not a challenge for modern computer processors.)
When zkanji is run to generate its kanji data, it is looking for a file called “kanji.txt” in its data folder, which is the KANJIDIC file saved in UNICODE. (When I write UNICODE I always refer to utf-16 which is native in the Windows environment. For example Linux uses utf-32, but wasting 4 bytes for a single character is an overkill in my opinion.) The data is stored in a simple structure in memory and later saved in the same structure, so there is no magic happening here. There are a few other steps, for example information about the kanji radicals is in RADKFILE, which has to be imported separately, but it is not a big deal either.
Once everything is imported, I go over all kanji readings, and throw out those that are not found in any word. This might seem to be an unnecessary step, but I figured that most of the readings that are not used are either very rare or only important for researchers (zkanji is mainly for students of Japanese). When this is done, there is nothing else to do with kanji. The stroke order diagrams and animations are not imported, I made them right in zkanji, and it saves their data file in the final format. I won’t be explaining that in this mini series as it is irrelevant to data handling.
This is all that is also saved in zdict.zkj. When a user creates his or her own dictionary, it is saved in a file with the .zkd file extension very similar to zdict.zkj, but for obvious reasons it is unnecessary to save the kanji information a second time.
In the last entry I mentioned that I might explain how the example sentences file is kept. I’ll try now in a few words without going into details. The examples.zkj was a separate download for a long time, but it is now included in the setup for the program. zkanji can run without it, but it only takes up ~14 megabytes which is really nothing nowadays. Still, the data is saved compressed in the data file with the free zlib library (no relation to zkanji). The data contains around 150,000 sentences, and would take up more than 40Mb if not compressed, so this is a good ratio. Because I didn’t want to load the examples data in memory when the program starts, it was necessary to compress the data in a way that can still be accessed fast and easy when a sentence is needed. The sentences are compressed in packs of 100, and when one has to be displayed, its pack is uncompressed, and all 100 sentences are loaded to memory. 100 sentences take up around 26Kb which is not much. When there are more than 1Mb of sentences in the memory, the ones not used for the longest time are freed. This way only a megabyte of memory is used up for example sentences at a time, and the speed they can be browsed is still lightning fast.
Each sentence in the examples file contains additional data, like which words are in it, so when the mouse is moved over a sentence in the program, you can see words underlined and clicking on them looks them up in the dictionary. Making that work was a bit complicated, but it is working fine, so I don’t need to look into it to refresh my memory and won’t explain in detail.
The next time (which is hopefully soon) I’ll write about user data files, like word and kanji groups and user made dictionaries.
I usually write disclaimers at the front of these kind of blog entries so it is clear that what I write about is not for the technically challenged, but this time I will skip that part. Anyone who can read will notice the title and run away screaming anyway. The contents of this mini-series won’t be anything flashy. I will probably not add images, unless they cannot be avoided, so if you are here for entertainment, I have to say sorry.
This first part is an introduction about what makes up the zkanji dictionary, how it started out and what is JMDict anyway. I have a tendency to forget what I deem unimportant so please don’t expect me to get into details about my first mistakes.
If you have used the program a little you know that it has a dictionary (obviously), some data about 6355 kanji, stroke order diagrams (kanji with stroke order and animation) and many many example sentences. Apart from what is built in, it can group kanji and words for the user, and it can even build a new dictionary from scratch if that’s your hobby. Of course this is not everything because there are all kinds of data in the dictionary that cannot be summed up in a few words.
The data in the dictionary comes from a huge XML database (a large file full of text) called JMDict. I don’t know the exact details, but the collection of this data started in ancient times, and saying that making it possible was a huge undertaking is a slight understatement. Go and say your thanks to the people making it possible. (Finding their addresses is your homework.) The data file looks something like this:
<entry> <ent_seq>1183090</ent_seq> <k_ele> <keb>恩</keb> <ke_pri>ichi1</ke_pri> <ke_pri>news1</ke_pri> <ke_pri>nf18</ke_pri> </k_ele> <r_ele> <reb>おん</reb> <re_pri>ichi1</re_pri> <re_pri>news1</re_pri> <re_pri>nf18</re_pri> </r_ele> <sense> <pos>&n;</pos> <gloss>favour</gloss> <gloss>favor</gloss> <gloss>obligation</gloss> <gloss>debt of gratitude</gloss> </sense> </entry>
(A little excerpt from JMDICT)
It might not look like something interesting, but apart from the JLPT data, everything in the dictionary comes from entries like this. You probably didn’t know, but the origin of the data is not this XML file. Even if the original can be found somewhere online, I have no idea where, and because the XML holds everything necessary and for free, it is not important.
Let me tell you a secret. (XML fans will be shocked.) The code that I wrote to process this data cannot speak XML. It simply looks for text like <entry> or <k_ele> etc., and if it recognizes something it tries to read as much as possible. There are states like “reading the kanji” or “the next lines are probably the meanings of the word”, till it sees another line it can recognize and then it goes to the next state. If I used some library to recognize the XML tags and tried to get the data from what it converts the text into, the program would probably run 10 times slower and there wouldn’t be any benefit at all. Instead my script spits out the same data in another format, but what it outputs is already processed and sorted. When the zkanji database is built from this pre-chewed data, there is not much left to do.
Here are a few lines of the output of the script that does the work:
明かり あかり akari 6785 2 light, illumination, glow, gleam p&n; lamp, light p&n; 上がる あがる agaru 6492 23 to rise, to go up, to come up, to ascend, to be raised p&v-i;p&v-u; to enter (esp. from outdoors), to come in, to go in p&v-i;p&v-u; to enter (a school), to advance to the next grade p&v-i;p&v-u; to get out (of water), to come ashore p&v-i;p&v-u; [...] (19 more meanings for 上がる)
The first two lines are the kanji and its reading. The two numbers are the frequency of the word and then the number of meanings. The meanings are each made up of 2 lines, the first is the text of the meaning and those strange codes in the second tell the type of the meanings. For example “p&v-i;” means intransitive verb. (I would probably do it differently if I started today, but this format hasn’t changed since the very beginning.) The words in this output are in alphabetic, or rather “kanabetic” order. This much data wouldn’t be enough to build a great dictionary database, so here is another interesting file:
'n 135942 'na 6048 23369 34670 50400 'na' 91400 17791 76159 116849 12976 55298 19529 101866 87619 130163 130235 72796 135224 64219 13730 21184 74780 11502 130317 'naa 154117 154118 154119 158118 40812
You might remember the first entry I wrote long time ago about how the data is in the memory so fast dictionary look-ups are possible. The data is in a tree structure so looking up words starting with specific vowels is easy. The file part shown above is made up of starting vowels followed by the indexes of all the words under that given vowel in the tree structure. There are 4 such files, one for word meaning look-up, one for kana, another for words written backwards in kana, so looking up words by how they end is possible, and the last is for the kanji meanings to look up in the kanji list. zkanji gets the processed data in this format, and outputs zdict.zkj.
The next time I’ll (probably) write more about what is in zdict.zkj (kanji data, jlpt levels) and if it doesn’t turn out to be such a long entry, I might explain a bit about examples.zkj as well.
Despite all my efforts in creating an example sentences database which is not dependent on the current dictionary, my aim is still unreachable. Since the introduction of the example sentences to the program, a new sentences data file must be generated for each new dictionary. For example the word すっきり had the number 22222 in the dictionary data I generated in January, but it is 22226 in the updated dictionary I created yesterday. When looking up examples for this word, I looked for an index of 22222 in the examples data first, which pointed me to a list of all the examples the word had. You can easily see what would happen if I still used the old examples for the new dictionary where the word was marked #22226.
I used a little trick to make these examples work for user dictionaries, not just for the main English one. Whenever I needed the sentences of a specific word in another dictionary, I first looked up the word’s main index exactly the same way as dictionary searches work. This meant that I still needed the original dictionary file which was compatible with the example sentences data.
To achieve the aim to make any future example sentences data work with any future dictionary, instead of storing a number for words in the example sentences data, I could store the words’ written form and kana reading to identify the words. So for example when you want to see the examples for すっきり, zkanji would look that word up in the data, directly from its written and kana form (which are both すっきり in this case), instead of a number that changes with every new release. When I came up with this idea I thought it will be easy to implement. Finding the sentences for the words this way is easy, but unfortunately building the data is impossible because of the way the Tanaka Corpus is made up.
For each sentence in the corpus, there is a list of words that make up those sentences which I can use to build my database. But as the JMDict project (where the main dictionary comes from) changes, and sometimes even the written and kana forms are changed, the list of words in the Tanaka Corpus cannot be independent from the actual state of the dictionary.
I’ll still create an example sentences database which can be used for any future dictionary data, but as the dictionary changes some words will probably “lose” their examples. And because the corpus changes with time as well, it will still be a good idea to get the latest examples anyway.