Archive

Posts Tagged ‘JMDict’

Data handling in zkanji mini-series, Part I.

December 18, 2012 Leave a comment

I usually write disclaimers at the front of these kind of blog entries so it is clear that what I write about is not for the technically challenged, but this time I will skip that part. Anyone who can read will notice the title and run away screaming anyway. The contents of this mini-series won’t be anything flashy. I will probably not add images, unless they cannot be avoided, so if you are here for entertainment, I have to say sorry.

This first part is an introduction about what makes up the zkanji dictionary, how it started out and what is JMDict anyway. I have a tendency to forget what I deem unimportant so please don’t expect me to get into details about my first mistakes.

If you have used the program a little you know that it has a dictionary (obviously), some data about 6355 kanji, stroke order diagrams (kanji with stroke order and animation) and many many example sentences. Apart from what is built in, it can group kanji and words for the user, and it can even build a new dictionary from scratch if that’s your hobby. Of course this is not everything because there are all kinds of data in the dictionary that cannot be summed up in a few words.

The data in the dictionary comes from a huge XML database (a large file full of text) called JMDict. I don’t know the exact details, but the collection of this data started in ancient times, and saying that making it possible was a huge undertaking is a slight understatement. Go and say your thanks to the people making it possible. (Finding their addresses is your homework.) The data file looks something like this:

<entry>
<ent_seq>1183090</ent_seq>
<k_ele>
<keb>恩</keb>
<ke_pri>ichi1</ke_pri>
<ke_pri>news1</ke_pri>
<ke_pri>nf18</ke_pri>
</k_ele>
<r_ele>
<reb>おん</reb>
<re_pri>ichi1</re_pri>
<re_pri>news1</re_pri>
<re_pri>nf18</re_pri>
</r_ele>
<sense>
<pos>&n;</pos>
<gloss>favour</gloss>
<gloss>favor</gloss>
<gloss>obligation</gloss>
<gloss>debt of gratitude</gloss>
</sense>
</entry>

(A little excerpt from JMDICT)

It might not look like something interesting, but apart from the JLPT data, everything in the dictionary comes from entries like this. You probably didn’t know, but the origin of the data is not this XML file. Even if the original can be found somewhere online, I have no idea where, and because the XML holds everything necessary and for free, it is not important.

Let me tell you a secret. (XML fans will be shocked.) The code that I wrote to process this data cannot speak XML. It simply looks for text like <entry> or <k_ele> etc., and if it recognizes something it tries to read as much as possible. There are states like “reading the kanji” or “the next lines are probably the meanings of the word”, till it sees another line it can recognize and then it goes to the next state. If I used some library to recognize the XML tags and tried to get the data from what it converts the text into, the program would probably run 10 times slower and there wouldn’t be any benefit at all. Instead my script spits out the same data in another format, but what it outputs is already processed and sorted. When the zkanji database is built from this pre-chewed data, there is not much left to do.

Here are a few lines of the output of the script that does the work:

明かり
あかり akari
6785
2
light, illumination, glow, gleam
p&n;
lamp, light
p&n;
上がる
あがる agaru
6492
23
to rise, to go up, to come up, to ascend, to be raised
p&v-i;p&v-u;
to enter (esp. from outdoors), to come in, to go in
p&v-i;p&v-u;
to enter (a school), to advance to the next grade
p&v-i;p&v-u;
to get out (of water), to come ashore
p&v-i;p&v-u;
[...]
(19 more meanings for 上がる)

The first two lines are the kanji and its reading. The two numbers are the frequency of the word and then the number of meanings. The meanings are each made up of 2 lines, the first is the text of the meaning and those strange codes in the second tell the type of the meanings. For example “p&v-i;” means intransitive verb. (I would probably do it differently if I started today, but this format hasn’t changed since the very beginning.) The words in this output are in alphabetic, or rather “kanabetic” order. This much data wouldn’t be enough to build a great dictionary database, so here is another interesting file:

'n
135942
'na
6048
23369
34670
50400
'na'
91400
17791
76159
116849
12976
55298
19529
101866
87619
130163
130235
72796
135224
64219
13730
21184
74780
11502
130317
'naa
154117
154118
154119
158118
40812

You might remember the first entry I wrote long time ago about how the data is in the memory so fast dictionary look-ups are possible. The data is in a tree structure so looking up words starting with specific vowels is easy. The file part shown above is made up of starting vowels followed by the indexes of all the words under that given vowel in the tree structure. There are 4 such files, one for word meaning look-up, one for kana, another for words written backwards in kana, so looking up words by how they end is possible, and the last is for the kanji meanings to look up in the kanji list. zkanji gets the processed data in this format, and outputs zdict.zkj.

The next time I’ll (probably) write more about what is in zdict.zkj (kanji data, jlpt levels) and if it doesn’t turn out to be such a long entry, I might explain a bit about examples.zkj as well.

Advertisements
Categories: Under-the-hood Tags: ,

When the dictionary is updated…

April 27, 2012 Leave a comment

The words you might have in groups and tests are often changed in the JMDict project, so there should be a way to control the update of the English dictionary. Another reason is that I want to add new features to zkanji that are more sensitive to such changes. I will soon release a beta tester version of the program which starts with a new dialog asking the user to check dictionary changes in the hope, that somebody will look at it and comment. (If nobody does you will get it unchanged, this is a warning :D)

This is the dialog that is shown on startup if the program detects changes that might affect your groups or tests. The items shown in the window are changes that happened in the JMDict project since January. As you can see the word ちゃんと was considerably modified, and if you had it in a group and updated with a previous version of zkanji, you would be in for a surprise, as the “perfectly, property, exactly” definition would have been changed to “diligently, seriously, earnestly, …”, which are not exactly matching meanings.

From the next zkanji you will be able to do the following:

  1. Use copy – This copies the word definitions untouched, overwriting the entry in the updated dictionary, so it will still have the old word definitions.
  2. Remove word data – If you decide that it doesn’t worth the trouble, you can simply throw out anything related to this word from your groups and tests. The new dictionary will keep the updated entry though.
  3. [Meanings that were in groups or tests and need change] and
  4. [Meanings of the same word in the updated dictionary] – You can go through all meanings that need change in 3. and select the corresponding meaning you want in the updated data from 4.
  5. Once you made your choice, click “Next word >>” and your choices are registered.
  6. There is also an “Abort” button (unnumbered on the picture). If you want, you will be able to skip this update and use the old data. But be aware that it will mean that you will keep using the old English dictionary, and this dialog will be shown again when you start the program the next time.

This is fine for words that can be found in the updated dictionary, but in some cases the words are changed in a way that the program cannot find the corresponding entry.

For example the word “bucket” was written as 馬穴 in the original English data. The new dictionary doesn’t have that word with such kanji, only with a written form of バケツ (same as its kana pronunciation). Because zkanji recognizes words by [written form]+[kana pronunciation], it will think that this word is not in the new dictionary, and if this were an older version, it would simply remove all traces of the word from any groups and tests the user added it to. In the next version you will be able to find another word in the dictionary that you think matches closely enough, and then press the “Select” button. Once you do that you will be presented with the previous page of meanings to select their corresponding definitions.

Only those words will be listed here that need user interaction so hopefully there won’t be more than 2-3 words needing update. There are currently 13 in this beta that piled up in 3 months, and I had all N3 marked words in groups, so it is not that much.

I believe that this update is important for future development so much, that once it is released, anyone using zkanji is recommended to download it. Not this one, but the version coming after this won’t run with your old user data! There is a lot of junk code to be thrown out that was in there for compatibility reasons, and I want to get rid of all of them.

JMDict going downhill? – UPDATED

September 6, 2011 3 comments

JMDict is a dictionary database. That is, a list of words in English (+some other languages), and their Japanese equivalent. This data is free for non-commercial use, maybe the only such big dictionary project for Japanese which is free.

I thought of writing something satirical about JMDict, the dictionary database behind zkanji, but I probably couldn’t express myself so well as to make it even slightly funny. JMDict is used in 99.999% of freely available online dictionaries and free dictionary programs, and as such, gives a lot to the students of the Japanese language all over the world. The trouble is, it is a bit difficult to criticize it, as many people think of it as The Authoritative Japanese Dictionary Database.

Unfortunately JMDict has many small glitches. Many English word definitions in it are ambiguous and anybody not knowing how to use those words will definitely make mistakes. You can say that this is unavoidable in a dictionary, and I agree. The problem comes up, when the definition could be better with simply a different choice of words. For example the English definition they give could only be used in 10% of all translations, in all other cases there is a better, more general translation which is not specified in the definition. I also started to wonder when I saw words with explanation as part of their definitions, why it is not possible to add explanation to others. Because of this, whenever I come up with something ambiguous, I use a Japanese-Japanese dictionary like goo 辞書 or SpaceALC’s 英辞郎 (which is not exactly a dictionary, but still great and very useful).

This doesn’t mean that JMDict would be full of such problems, and at every corner vicious ambiguousness and incorrect translations await the unsuspecting student. (Maybe it DOES mean that, but only for less common words. There are more than 150,000 unique definitions in the database after all.) Another positive aspect is that anyone can send in corrections to words. I have done it myself several times already.

I’m writing this rant right now, because this time it seems the maintainers made a mistake they don’t want to admit, and it made me slightly dumbfounded. The word 練る is godan (or -u verb). That means when the word is inflected, the る “changes” to another syllable from the r row of the kana table, or in the past tense it is replaced by った. But if we take a look at the verb inflection table at WWWJDIC (the “official” online dictionary that uses JMDict), it says the past tense of 練る is 練た, which is wrong. What is strange is that originally 練る was marked correctly as -u verb, but not so long ago it was changed to ichidan or -ru dropping. I have sent in a correction 3 days ago for this, and the answer? Well, I haven’t specified an e-mail address (and even if I did I doubt they would have sent an answer), the correction was simple refused and deleted from the system without a trace. It might be taken into consideration and we might see some change in the future, but for now it seems they simply removed it.

Though this is just a single case, but makes me think what the future of JMDict will be like, if things go on like this. The creator and original maintaner of JMDict is Jim Breen, though he is now probably not the one doing it. I’m considering sending him an e-mail explaining this situation, but I want to wait a few days to see if they correct this mistake or not.

UPDATE: I checked again today, and the verb information is now corrected. Someone from the maintainers is either reading my blog 🙂 or it simply took this much time till the changes in the database appeared on the dictionary server. In any case, I wish that they only removed the page of a suggested correction after the result was visible.

Categories: Rant Tags: , ,