Posts Tagged ‘data export’

Data import 2.

February 18, 2013 Leave a comment

I have decided to create a separate import for groups and dictionaries, because in some edge cases with a shared import, users would have to deal with 3 separate dialog windows with complicated selections one after the other.

The dictionary import will offer the option to completely replace a dictionary with the imported data (for full dictionary exports or when a free dictionary was converted to the export file format), and another option to expand an existing dictionary with the words in the export file. Making the former case requires no additional work as it is the same as normal dictionary updates. The latter is rather for a community working on a single dictionary, so they can share the few words they changed in the past month or some other short time. It is a more complicated problem but will be very similar to how groups are imported when there are differences in word definitions in the exported file and the current dictionary. Users will be able to select which differing words to import from the export file, adding new entries or changing old ones. This can affect already added words in existing groups. Dictionary expansion can’t deal with the case when words get deleted, so people working on dictionaries will have to do full export/imports from time to time, but it is easier if they don’t have to do it all the time.

Writing the group import is more complicated, because it is usually not the intention of users to change the dictionary, but I still want to allow it. First the same dialog will be shown that is used for expanding the dictionary (only when needed), but in this case I can’t avoid showing a similar dialog again.  Once the dictionary is updated (or kept unchanged – it is important to be able to do this easily as well), a dialog will be shown for words that cannot be imported directly because of some conflict with existing groups or because their definition was not added in the first step. Users will be able to either select a replacement word in problem cases or choose to skip importing that word.

This much complexity can’t be avoided when the data is in such a complex relation. Hopefully in general cases, users won’t see any dialogs. There is no need for them if the dictionaries match and the groups don’t already contain conflicting definitions. Unfortunately, we are always dealing with the same kind of data, so the dialogs must be very similar, even identical at times, which can confuse users. I don’t know how to help with this, but if users complain I will come up with a solution.


Planned export/import file format

February 14, 2013 Leave a comment

This text will be printed at the top of the zkanji export files. The format might change a bit but the notations will be kept. I’m going to update this entry in case the format changes, but only before the next version is released. After the release I might post another entry if the format changes.

zkanji export file for version 0.73 and later.

The export file must be in UTF-8. Whitespace only refers to the space and the TAB characters in this description. Lines that start with zero or more whitespace followed by ; or # are comments. There is no mid-line comment.

The file consists of sections for different data. Every section starts with a section name in square brackets like this:
[Section Name]
Only whitespace can be on the same line next to the section name.

Unrecognized sections are ignored, so older versions of the program can read newer export formats correctly. The data can contain the same section any number of times, but if data in entries differ, only the first version is used.

Within a single section each line has the same format and refers to a separate entry. The entries can’t span over multiple lines. The lines are made up of one or more tokens which can contain UTF-8 code points between 0x20 and 0xFFFF. Tokens are made up of function markers (zero or more characters marking the function of the token) and a variable part. The separator character between tokens is space, but if a token can contain spaces (depending on its type), its variable part must start and end with the TAB character (marked with \t in the format descriptions). If a token is not recognized, it is ignored during import, but incorrect tokens (i.e. when a token should be a number and it is not) can make the whole or part of the line ignored. If a line or its part is not meaningful without tokens that are missing, the line or that part is ignored. Tokens must be in the same order as in the format description.

Lines can contain repeated patterns on two levels separated by the space character. Top-level patterns are between XX{ … }XX (braces or curly brackets), where XX must match on the ends. Secondary patterns within top-level patterns are between YY( … )YY (parentheses or round brackets).

The only exception is the [About] section. If present, it must be the first one in the file. The only restrictions to its format is that each line must start with a * or – character, and no line can be longer than 1000 characters. Lines starting with * belong to a new line, those which start with – are appended to the previous line during import. On export, the text in this section is the exact copy of the dictionary information of the source dictionary (i.e. license text, authors). It is only imported during full dictionary import.

The following other sections are accepted starting with zkanji v0.73:
[Words], [Kanji]

In the following format for the lines, descriptions inside square brackets [] are placeholders for the variable parts of tokens, the text inside the brackets describes the function of the part they replace.

Line format for the [Words] section:
Each line describes a word entry in the dictionary. The description can be partial (i.e. not all meanings are listed). The line’s structure is:
[word kanji] [word kana] F[frequency number] M{\t[word definition]\t #[meaning number] MT[list of word types separated by comma (see wtypetext in zkformats.cpp)] MN[list of notes (wnotetext)] MF[list of fields (wfieldtext)] NT[list of name tags (ntagtext)] G(\t[group name]\t #[entry index])G}M

The [meaning number] can be missing but if specified, it must be a number between 0 and 99. If the same number is encountered for a second time for the same word, that meaning is skipped. The [entry index] when specified refers to the index of an entry in a group. When several meanings have the same [entry index] in the same group for the same word, they will be imported to the same group entry once this is implemented. Until then, they are all added separately. When the [entry index] is missing, the entries are added to the group in their order in the export file. Repeated group name+entry index pairs
are ignored and not added to the same group again.
If lines for different words have the same [entry index] for a group, they can’t be merged, so they will be added in their order in the export file.

Explanation of the [Kanji] section:
[kanji character] D\t[dictionary kanji meaning]\t U\t[user defined kanji meaning]\t G{\t[group name]\t #[entry index]}G W{[word kanji] [word kana] WU\t[user defined word meaning]\t}W

The same kanji character can be repeated in the section if each line lists different groups, but it is better to have them on a single line. The [dictionary kanji meaning] is only imported for user dictionaries. The [entry index] for groups can be missing, but if present, it is used to order group entries. When it is missing, the entries are added to the group in their order in the export file.
Word kanji and kana mark kanji example words, with an optional user defined meaning. The meaning of the selected example words and the definition of kanji can be edited on the groups panel, but it is usually not specified. It is unsure whether this feature will be supported in future versions of zkanji.

Last update: May 13, 2013

Data export and import in zkanji

February 13, 2013 Leave a comment

The release of the next version of zkanji has been delayed a good 10 months already for the single reason that huge changes took place in its data format, and for that reason the export and import feature became unusable. (Translation: reason of delay = laziness.) You can ask “So what? Why not just release it and add export and import later, like other features?” I could do that of course, but there is a difference. I usually only remove features if I don’t intend to support them in the future, and some users couldn’t use the program without export and import. I think both are reasonable.

Now that this is out of the way, I’ll write a bit about the previous concept image of the data export dialog. You can see that I have thought of four sections. Dictionary, word groups, kanji groups and long-term study list export. After a bit of thought I realized that the dictionary and word group export both has the same kind of data. Not just the words’ kanji form and reading must be exported, but the meanings as well. This is obvious for the dictionary. The word groups contain words, but only specific meanings are added. In the near future I plan to change this a bit to be able to include several or all meanings for each word entry in a word group, but it will still be possible to only add some meanings. The biggest problem here (like with everything) is that the dictionary can change between releases, sometimes splitting meanings or changing them altogether. If the export file didn’t contain the word meanings, it would be impossible to check this at the time of import. The only additional data that must be exported with the dictionary is the word usage types and word frequency so the two are very similar.

It’s a simple case if the dictionary has the same words and definitions when importing (though there still can be conflicting word groups which require user interaction), but when the dictionaries at the time of export and import differ, the program should probably allow users to update the target dictionary not just for dictionary imports, but word group imports as well. When I took a break from developing zkanji 10 months ago I had this same problem, and it is still difficult. Thus I think that I’ll first make the export/import with a bit crude interface, and if users can propose something better I’m going to change it. Because this doesn’t usually happen you will probably have to put up with that interface for life. 🙂

I can’t think of such difficulty with kanji group and long-term study list export/import. The kanji group might have to handle words that are selected as examples for kanji, but the meaning is not important there. The long-term study list holds meanings for the words but those either depend on the current dictionary, or can be independent, in which case there is no need to do anything with the dictionary.


February 12, 2013 Leave a comment

Export dialog concept

This is a concept image for a new dialog window, where the user can select data for export to be imported later or on a different computer. This might still not be final, but I’ll try to finish this and only change it if something comes up. There will be a few pages of this export window, but even from the main panel you can see all the features it will offer. Maybe this dialog is a bit wordy compared to previous ones, but at least nobody can say that I’m not trying to make it user friendly. :p