Archive

Posts Tagged ‘export file’

Export/import file format update

March 6, 2013 Leave a comment

I had to change the export/import file format considerably, because the original format was a bit difficult to handle and wasn’t ready for expansion either. The new format will hopefully stay, unless I discovery something again that makes it unusable.

(I don’t want to bash SourceForge too much because it is free and all, but its quality in some aspects is on the same level as its cost. Just because you don’t see any development progress on the main page of zkanji doesn’t mean there is none.)

Categories: Development, Plans Tags:

Planned export/import file format

February 14, 2013 Leave a comment

This text will be printed at the top of the zkanji export files. The format might change a bit but the notations will be kept. I’m going to update this entry in case the format changes, but only before the next version is released. After the release I might post another entry if the format changes.

zkanji export file for version 0.73 and later.

The export file must be in UTF-8. Whitespace only refers to the space and the TAB characters in this description. Lines that start with zero or more whitespace followed by ; or # are comments. There is no mid-line comment.

The file consists of sections for different data. Every section starts with a section name in square brackets like this:
[Section Name]
Only whitespace can be on the same line next to the section name.

Unrecognized sections are ignored, so older versions of the program can read newer export formats correctly. The data can contain the same section any number of times, but if data in entries differ, only the first version is used.

Within a single section each line has the same format and refers to a separate entry. The entries can’t span over multiple lines. The lines are made up of one or more tokens which can contain UTF-8 code points between 0x20 and 0xFFFF. Tokens are made up of function markers (zero or more characters marking the function of the token) and a variable part. The separator character between tokens is space, but if a token can contain spaces (depending on its type), its variable part must start and end with the TAB character (marked with \t in the format descriptions). If a token is not recognized, it is ignored during import, but incorrect tokens (i.e. when a token should be a number and it is not) can make the whole or part of the line ignored. If a line or its part is not meaningful without tokens that are missing, the line or that part is ignored. Tokens must be in the same order as in the format description.

Lines can contain repeated patterns on two levels separated by the space character. Top-level patterns are between XX{ … }XX (braces or curly brackets), where XX must match on the ends. Secondary patterns within top-level patterns are between YY( … )YY (parentheses or round brackets).

The only exception is the [About] section. If present, it must be the first one in the file. The only restrictions to its format is that each line must start with a * or – character, and no line can be longer than 1000 characters. Lines starting with * belong to a new line, those which start with – are appended to the previous line during import. On export, the text in this section is the exact copy of the dictionary information of the source dictionary (i.e. license text, authors). It is only imported during full dictionary import.

The following other sections are accepted starting with zkanji v0.73:
[Words], [Kanji]

In the following format for the lines, descriptions inside square brackets [] are placeholders for the variable parts of tokens, the text inside the brackets describes the function of the part they replace.

Line format for the [Words] section:
Each line describes a word entry in the dictionary. The description can be partial (i.e. not all meanings are listed). The line’s structure is:
[word kanji] [word kana] F[frequency number] M{\t[word definition]\t #[meaning number] MT[list of word types separated by comma (see wtypetext in zkformats.cpp)] MN[list of notes (wnotetext)] MF[list of fields (wfieldtext)] NT[list of name tags (ntagtext)] G(\t[group name]\t #[entry index])G}M

The [meaning number] can be missing but if specified, it must be a number between 0 and 99. If the same number is encountered for a second time for the same word, that meaning is skipped. The [entry index] when specified refers to the index of an entry in a group. When several meanings have the same [entry index] in the same group for the same word, they will be imported to the same group entry once this is implemented. Until then, they are all added separately. When the [entry index] is missing, the entries are added to the group in their order in the export file. Repeated group name+entry index pairs
are ignored and not added to the same group again.
If lines for different words have the same [entry index] for a group, they can’t be merged, so they will be added in their order in the export file.

Explanation of the [Kanji] section:
[kanji character] D\t[dictionary kanji meaning]\t U\t[user defined kanji meaning]\t G{\t[group name]\t #[entry index]}G W{[word kanji] [word kana] WU\t[user defined word meaning]\t}W

The same kanji character can be repeated in the section if each line lists different groups, but it is better to have them on a single line. The [dictionary kanji meaning] is only imported for user dictionaries. The [entry index] for groups can be missing, but if present, it is used to order group entries. When it is missing, the entries are added to the group in their order in the export file.
Word kanji and kana mark kanji example words, with an optional user defined meaning. The meaning of the selected example words and the definition of kanji can be edited on the groups panel, but it is usually not specified. It is unsure whether this feature will be supported in future versions of zkanji.

Last update: May 13, 2013