Home > Under-the-hood > Data handling in zkanji mini-series, Part I.

Data handling in zkanji mini-series, Part I.

I usually write disclaimers at the front of these kind of blog entries so it is clear that what I write about is not for the technically challenged, but this time I will skip that part. Anyone who can read will notice the title and run away screaming anyway. The contents of this mini-series won’t be anything flashy. I will probably not add images, unless they cannot be avoided, so if you are here for entertainment, I have to say sorry.

This first part is an introduction about what makes up the zkanji dictionary, how it started out and what is JMDict anyway. I have a tendency to forget what I deem unimportant so please don’t expect me to get into details about my first mistakes.

If you have used the program a little you know that it has a dictionary (obviously), some data about 6355 kanji, stroke order diagrams (kanji with stroke order and animation) and many many example sentences. Apart from what is built in, it can group kanji and words for the user, and it can even build a new dictionary from scratch if that’s your hobby. Of course this is not everything because there are all kinds of data in the dictionary that cannot be summed up in a few words.

The data in the dictionary comes from a huge XML database (a large file full of text) called JMDict. I don’t know the exact details, but the collection of this data started in ancient times, and saying that making it possible was a huge undertaking is a slight understatement. Go and say your thanks to the people making it possible. (Finding their addresses is your homework.) The data file looks something like this:

<entry>
<ent_seq>1183090</ent_seq>
<k_ele>
<keb>恩</keb>
<ke_pri>ichi1</ke_pri>
<ke_pri>news1</ke_pri>
<ke_pri>nf18</ke_pri>
</k_ele>
<r_ele>
<reb>おん</reb>
<re_pri>ichi1</re_pri>
<re_pri>news1</re_pri>
<re_pri>nf18</re_pri>
</r_ele>
<sense>
<pos>&n;</pos>
<gloss>favour</gloss>
<gloss>favor</gloss>
<gloss>obligation</gloss>
<gloss>debt of gratitude</gloss>
</sense>
</entry>

(A little excerpt from JMDICT)

It might not look like something interesting, but apart from the JLPT data, everything in the dictionary comes from entries like this. You probably didn’t know, but the origin of the data is not this XML file. Even if the original can be found somewhere online, I have no idea where, and because the XML holds everything necessary and for free, it is not important.

Let me tell you a secret. (XML fans will be shocked.) The code that I wrote to process this data cannot speak XML. It simply looks for text like <entry> or <k_ele> etc., and if it recognizes something it tries to read as much as possible. There are states like “reading the kanji” or “the next lines are probably the meanings of the word”, till it sees another line it can recognize and then it goes to the next state. If I used some library to recognize the XML tags and tried to get the data from what it converts the text into, the program would probably run 10 times slower and there wouldn’t be any benefit at all. Instead my script spits out the same data in another format, but what it outputs is already processed and sorted. When the zkanji database is built from this pre-chewed data, there is not much left to do.

Here are a few lines of the output of the script that does the work:

明かり
あかり akari
6785
2
light, illumination, glow, gleam
p&n;
lamp, light
p&n;
上がる
あがる agaru
6492
23
to rise, to go up, to come up, to ascend, to be raised
p&v-i;p&v-u;
to enter (esp. from outdoors), to come in, to go in
p&v-i;p&v-u;
to enter (a school), to advance to the next grade
p&v-i;p&v-u;
to get out (of water), to come ashore
p&v-i;p&v-u;
[...]
(19 more meanings for 上がる)

The first two lines are the kanji and its reading. The two numbers are the frequency of the word and then the number of meanings. The meanings are each made up of 2 lines, the first is the text of the meaning and those strange codes in the second tell the type of the meanings. For example “p&v-i;” means intransitive verb. (I would probably do it differently if I started today, but this format hasn’t changed since the very beginning.) The words in this output are in alphabetic, or rather “kanabetic” order. This much data wouldn’t be enough to build a great dictionary database, so here is another interesting file:

'n
135942
'na
6048
23369
34670
50400
'na'
91400
17791
76159
116849
12976
55298
19529
101866
87619
130163
130235
72796
135224
64219
13730
21184
74780
11502
130317
'naa
154117
154118
154119
158118
40812

You might remember the first entry I wrote long time ago about how the data is in the memory so fast dictionary look-ups are possible. The data is in a tree structure so looking up words starting with specific vowels is easy. The file part shown above is made up of starting vowels followed by the indexes of all the words under that given vowel in the tree structure. There are 4 such files, one for word meaning look-up, one for kana, another for words written backwards in kana, so looking up words by how they end is possible, and the last is for the kanji meanings to look up in the kanji list. zkanji gets the processed data in this format, and outputs zdict.zkj.

The next time I’ll (probably) write more about what is in zdict.zkj (kanji data, jlpt levels) and if it doesn’t turn out to be such a long entry, I might explain a bit about examples.zkj as well.

Advertisements
Categories: Under-the-hood Tags: ,
  1. No comments yet.
  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: