Posts Tagged ‘user data’

User data backup problems and solution(?)

May 5, 2013 2 comments

Creating a usable and safe backup system is my last aim for the next release, before I go over the user reported bugs and complaints. Just like most other things that seem simple at first glance, this is also not as easy as it looks like.

In past releases, zkanji created a copy of successfully loaded files in the user data folder with the TEMP extension, after loading them. (Thinking about it, isn’t the TEMP extension a bit misleading?) The user had a single safe(?) copy of data files that loaded correctly, or at least which didn’t generate an immediate error. Past backups were overwritten. This solution worked fine in the utopian world in my mind, that is, if errors occurred on load (which is not very likely). Unfortunately there has been a case at least once, when a user only noticed a few days late, that something is not right with his/her data. This situation is obviously not solved with our simple backup.

The obvious solution would be to keep a backup of all user data files for the past few days, or even weeks. I started working on this solution, but there have been a few things that bugged me about it all along. In the new data handling system, users will be able to change their main English dictionary, so a safe copy must be made. The dictionary file is nearly 25 megabytes, and even without the few additional kilobytes of user data, making several backups of this size is not an acceptable solution. As I’m working on a dictionary in a different language, the total size for me would be nearly 35 megabytes. In my experience, at least 2 weeks of backup is necessary to be on the safe side, which equals to 350 megabytes normally, and in my case nearly half a gigabyte! We can probably do better than that.

If someone never changes his or her main dictionary, and the only files to save are groups or study data, not saving unchanged files can keep the size of backups to the minimum. This is seemingly a good solution to the problem, unfortunately this brings the complications to a whole new level.  How can we know that the main data file has not been changed? We could read it and compare it to the unchanged dictionary data (there is a data file which is not touched, but is required for the update system to work). Comparing files is slow, and nobody would want to wait the additional seconds every time zkanji creates new backups. I also thought of comparing file times, but if a user unintentionally changed the main dictionary, and reverted the changes later, the file times would be different while the data is the same. Not to mention the case when the user data is on some central server and files have to be read and written several times over a network. (I know of at least one such case.)

As I have decided not to do any kind of complicated magic that can be slow as well, a compromise is forming in my head. (This is just the current idea which can be rejected in the next second.) Keeping 2-3 backups of each file doesn’t seem to be that much of a burden. If the files are backed up at some longer intervals, for example every 4-5 days, and are not kept for too long, the user can enjoy a relative safety which is relatively cheap. Data loss happens, but this way only a few days worth of data would be lost. If the user only notices some problem a week later, this is still better than losing everything. (In case you have terabytes of space for backups, you will be able to tweak the interval of days and number of backups in the settings.) Safe copies of data would be created once on startup, or if you are the kind of person who doesn’t power off their computers for months, I’m considering checking the running time of the program as well, and creating a copy when the time comes.


Data import 2.

February 18, 2013 Leave a comment

I have decided to create a separate import for groups and dictionaries, because in some edge cases with a shared import, users would have to deal with 3 separate dialog windows with complicated selections one after the other.

The dictionary import will offer the option to completely replace a dictionary with the imported data (for full dictionary exports or when a free dictionary was converted to the export file format), and another option to expand an existing dictionary with the words in the export file. Making the former case requires no additional work as it is the same as normal dictionary updates. The latter is rather for a community working on a single dictionary, so they can share the few words they changed in the past month or some other short time. It is a more complicated problem but will be very similar to how groups are imported when there are differences in word definitions in the exported file and the current dictionary. Users will be able to select which differing words to import from the export file, adding new entries or changing old ones. This can affect already added words in existing groups. Dictionary expansion can’t deal with the case when words get deleted, so people working on dictionaries will have to do full export/imports from time to time, but it is easier if they don’t have to do it all the time.

Writing the group import is more complicated, because it is usually not the intention of users to change the dictionary, but I still want to allow it. First the same dialog will be shown that is used for expanding the dictionary (only when needed), but in this case I can’t avoid showing a similar dialog again.  Once the dictionary is updated (or kept unchanged – it is important to be able to do this easily as well), a dialog will be shown for words that cannot be imported directly because of some conflict with existing groups or because their definition was not added in the first step. Users will be able to either select a replacement word in problem cases or choose to skip importing that word.

This much complexity can’t be avoided when the data is in such a complex relation. Hopefully in general cases, users won’t see any dialogs. There is no need for them if the dictionaries match and the groups don’t already contain conflicting definitions. Unfortunately, we are always dealing with the same kind of data, so the dialogs must be very similar, even identical at times, which can confuse users. I don’t know how to help with this, but if users complain I will come up with a solution.

Data handling in zkanji mini-series, Part V.

February 6, 2013 2 comments

The first problem I tried to solve was that the main English dictionary was not editable. Of course I could have simply allowed editing the main dictionary like any user dictionary, saving it every few minutes (if the auto save option is on). I didn’t go with this because some features might need the original main dictionary. For example the example sentences data relies on word indexes in the main dictionary, and if those indexes change, the program would crash when looking up the sentences.

This is not the real reason though. The real reason is that I don’t remember what parts would break (if any) if the dictionary changed, so I avoided it. To be even more precise, even the way I solved this problem (for the next release) won’t allow deleting words that were not added by the user. I could have made the program this way from the beginning. My suspicion is though, that there wouldn’t be a problem even if words were deleted from the main dictionary (apart from breaking the example sentences, which will be fixed in the next release or the one after that). I just never had enough patience to check.

So the next release will allow changing the main dictionary as well. It changes the data, but keeps the original words in a separate list, in case they have to be reverted. Reverting, or using the original data is not implemented yet, but I didn’t want to break anything for future releases so I decided to keep the originals anyway. Their list will be saved with the changed English dictionary data.

From the next release, there will be three files for the main dictionary instead of one. I will keep “zdict.zkj”, but from now on, it will only hold data about the kanji, which is shared among all dictionaries. I.e. stroke count does not depend on the target language. Only the kanji meanings depend on the language which is still kept in this file. Of course changed meanings will be saved in user dictionaries like before. I have decided to keep the kanji data separate from the word data, because handling the other files will be simpler this way.

The other two files will be “English.zkj” and “English.zkd”, both holding the word dictionary data and they will be identical at first. The .zkj file will be the dictionary as it was installed, and the .zkd  will be a copy of .zkj. You might have noticed, that .zkd is the extension for user dictionaries. This is because “English.zkd” will be handled just like any other user dictionary. Any user changes will be reflected in it, but not in “English.zkj”. Once the program starts, it will check for the user version of the dictionary file, and if found it will load that one, otherwise load the original and create a copy with the .zkd extension. This wastes around 30Mb of disk space.

I could have made the program to either update the original file when the user changes it, or delete it once it created a copy or something similar, so only one version would be present. The original file can’t be kept with its original name though (I’ll explain why in a minute) so renaming it or creating a copy was the only viable option. It is probably not necessary to keep the file with the original name, but makes things a bit easier.

The reason for having an original and a user version of the same data was the simple fact that the setup program (and the zip package) contains the dictionary data under the name “English.zkj”, so updating the program could very easily delete any changes the user have made to his or her own English dictionary. Both the original and the user data will contain a date. When a future release of zkanji runs, it will check that date in the two files, and if they are not identical, it will know that the program was updated, or at least that the original dictionary file was replaced with a different one. If the two dates are identical, it will run as usual, otherwise it will bring up a dialog where the user can check which words differ that were added to a group or study list, so he or she can resolve any issues.

Keeping a separate user English dictionary file will avoid a lot of difficulties the current release has to deal with. For example the file which keeps the data for word groups will no longer have to store both the kanji and kana forms of words, it will be enough to save an index in the user dictionary. It will be also possible to create word groups where each entry in the group can hold more than one meaning for words, as the update won’t break anything, since the user will be notified of changes and will be able to resolve them.

The next one will be probably the last in this mini-series. I will write about what is not ready yet for a new release (apart from that the big changes I just described need a lot of testing, though I’ll probably ask for help with that), and why it is so challenging for me to finish it. I can almost imagine how excited you might be, waiting for the last part to be finally here!

Data handling in zkanji mini-series, Part IV.

February 1, 2013 Leave a comment

I’ve mentioned most of the following before, so I’ll just summarize what considerations made it necessary to change the file format which will be used in the next release of zkanji (hopefully) soon:

  1. Safer dictionary updates: Because the underlying JMDict dictionary changes with every release, words could sometimes disappear from word and study groups without notice. This only happened if the word’s kanji form or kana writing (pronunciation) in the dictionary changed. As I mentioned before, the only way to identify words when loading the dictionary and groups was by those parts. For example in the current JMDict, the word バケツ (baketsu – bucket) has no kanji, but the dictionary at the time of the last program release contained the 馬穴 ateji (kanji selected by pronunciation only) for this word. If I don’t change how the program handles such cases, users could end up with words disappearing from their groups. Also if a word’s meaning is added to a word group and even the order of meanings change, the word group cannot be automatically fixed to reflect that.
  2. English dictionary user changes: In the currently released program it is not possible to change the English definition of words, nor to add new words or to remove existing ones. It was a request by many users to be able to do that, but without the changes I made to handle the previous point, it would have been very difficult to handle dictionary updates. Fortunately the additional work for allowing dictionary changes was nothing compared to that.
  3. Multiple meanings for single word entries in word groups: In the current release when a word is added to a word group, a meaning has to be selected, and only that meaning will be added to the group. This doesn’t seem to have any direct connection to dictionary updates, but if word meanings (the order or number of the meanings) changed compared to an old dictionary, it would have created an even greater problem that is more difficult to handle.

This is all that I could think of right now, but I sometimes remember other features that I could have implemented long ago if zkanji could handle dictionary updates better.

In the next post I’ll write about what changes had to be made to the file formats. Without knowing that, this and the previous entries might seem a bit mysterious. 🙂

Categories: Under-the-hood Tags: ,

Data handling in zkanji mini-series, Part III.

January 28, 2013 4 comments

In the past two entries I described what is in the data files included with the program. In this part let me write a bit about data that is generated by the user, which must be saved and restored.

Since Vista, files cannot be created in some folders by running programs, unless they are given administrator privileges or run by an administrator. For example the Program Files folder is one such location. Unless zkanji is “installed” in such a folder, it keeps user data files in the “data” folder which is next to the executable. Otherwise user data files are saved in the user’s document folder. There can be 2 user files for each user dictionary. Though if you are only using the English dictionary, there is a single file only, as there is no dictionary data to be saved.

As I wrote in the previous entry, user made dictionaries are saved in exactly the same format as the main dictionary data file, but the kanji data, which stay the same for all languages (so everything apart from the meaning of a kanji) are not written. There is no JLPT data stored in these files either. The other file saved is the group / study progress file with the .zkd extension. User dictionary is unnecessary for English, but the group file is created even in that case. The group file obviously stores which kanji are moved to kanji groups, and which words are moved to word groups. It must also contain study progress, which is mainly a list of words and their standing in some study group or the long-term study list.

It is less obvious how a word or its identifier is saved and loaded in groups and the study list. One possibility would be to store a unique index number for each word (and probably meaning where it makes sense) that gives the word’s position in the dictionary, but unfortunately this approach wouldn’t work. With each update the English dictionary is also updated. I have no control over how that is done, and unfortunately when the JMDict data changes, it is very common that the index of words change as well. If I just saved the user data with an index number, once the program and its dictionary is updated, most words in word groups and study groups would not be the ones that should be there. The best workaround I could find was saving both the kanji and the kana form of every single word that was added to a group or to a study list. This can increase the size of user files considerably, but what is more important, influences loading times, loading huge user data files much slowly than it would otherwise be necessary. Even if you keep zkanji on an SSD drive, it’s not reading the file to memory which is slow, but looking up every word in the user data in the dictionary, to find their current indexes. This is a pity even more, since updates are rare nowadays (sorry) and indexes only change if the dictionary changes as well. The situation is even worse though, because the meaning of words often change in the dictionary as well. What was the first meaning could be the third in the next release, or some meaning might be split to several meanings. Unfortunately I couldn’t find any solution to this problem (until now).

Once I release the next zkanji, what I wrote in this entry will be out of date. The next version, which is mostly already done for more than 6 months now, does things differently and more reliably, with the price of taking up a lot more disk space. This “lot more” is in the tens of megabytes, an amount which in my opinion is nothing to fret about. In the next part I’ll try to explain the changes. I have already written about what the user will see when the dictionary is updated, but this time I will explain how it concerns data files.

Categories: Under-the-hood Tags: ,

When the dictionary is updated…

April 27, 2012 Leave a comment

The words you might have in groups and tests are often changed in the JMDict project, so there should be a way to control the update of the English dictionary. Another reason is that I want to add new features to zkanji that are more sensitive to such changes. I will soon release a beta tester version of the program which starts with a new dialog asking the user to check dictionary changes in the hope, that somebody will look at it and comment. (If nobody does you will get it unchanged, this is a warning :D)

This is the dialog that is shown on startup if the program detects changes that might affect your groups or tests. The items shown in the window are changes that happened in the JMDict project since January. As you can see the word ちゃんと was considerably modified, and if you had it in a group and updated with a previous version of zkanji, you would be in for a surprise, as the “perfectly, property, exactly” definition would have been changed to “diligently, seriously, earnestly, …”, which are not exactly matching meanings.

From the next zkanji you will be able to do the following:

  1. Use copy – This copies the word definitions untouched, overwriting the entry in the updated dictionary, so it will still have the old word definitions.
  2. Remove word data – If you decide that it doesn’t worth the trouble, you can simply throw out anything related to this word from your groups and tests. The new dictionary will keep the updated entry though.
  3. [Meanings that were in groups or tests and need change] and
  4. [Meanings of the same word in the updated dictionary] – You can go through all meanings that need change in 3. and select the corresponding meaning you want in the updated data from 4.
  5. Once you made your choice, click “Next word >>” and your choices are registered.
  6. There is also an “Abort” button (unnumbered on the picture). If you want, you will be able to skip this update and use the old data. But be aware that it will mean that you will keep using the old English dictionary, and this dialog will be shown again when you start the program the next time.

This is fine for words that can be found in the updated dictionary, but in some cases the words are changed in a way that the program cannot find the corresponding entry.

For example the word “bucket” was written as 馬穴 in the original English data. The new dictionary doesn’t have that word with such kanji, only with a written form of バケツ (same as its kana pronunciation). Because zkanji recognizes words by [written form]+[kana pronunciation], it will think that this word is not in the new dictionary, and if this were an older version, it would simply remove all traces of the word from any groups and tests the user added it to. In the next version you will be able to find another word in the dictionary that you think matches closely enough, and then press the “Select” button. Once you do that you will be presented with the previous page of meanings to select their corresponding definitions.

Only those words will be listed here that need user interaction so hopefully there won’t be more than 2-3 words needing update. There are currently 13 in this beta that piled up in 3 months, and I had all N3 marked words in groups, so it is not that much.

I believe that this update is important for future development so much, that once it is released, anyone using zkanji is recommended to download it. Not this one, but the version coming after this won’t run with your old user data! There is a lot of junk code to be thrown out that was in there for compatibility reasons, and I want to get rid of all of them.