Home > Development, N3 list > Generating the N3 word list

Generating the N3 word list

I’m making progress in creating an N3 list that can point students in the right direction, but be aware that it won’t be a definitive list, just an “opinion”. I have more or less finished writing the algorithm to create an initial list from the more frequent words, and I’m about to explain how it works. I have seen how others did the list generation and as I wrote in a previous post, I wasn’t convinced.

 

For example www.jlptstudy.com included kanji in the N3 list that were from Jouyou grades 1 to 4, but not in N4 or N5 (3 and 4kyuu). This way it got a believable kanji count, but kanji from the JLPT didn’t have a direct connection with kanji from the Jouyou grades in the old system so why would they have now? (Jouyou is the kanji Japanese children learn in schools, and the Jouyou grades correspond to the school years.)

I asked the author of www.tanos.co.uk about his list of N3 words, who told me that it was generated from the old 2kyuu word list, and the decisions were based on the Tanaka Corpus, the example sentences data zkanji uses too. (The Tanaka Corpus is a collection of sentences made with help from many people, still being revised by enthusiasts. It was meant to help students to see how the words in the dictionary are actually used, and not for creating a study plan by it.) I don’t know whether kanji were taken into consideration when picking the word list, but the words on the page contain 1073 kanji (or 1305 if higher levels are included), which is impressive if we consider that it’s near the number required for N2. (N2 words have 1633 kanji, though probably only around 1200 are really required at the JLPT test.)

 

Now I don’t want to say that my method is better or more reliable, but it’s only fair to tell you about how it works so you know what to expect, without getting into technical details nobody really cares for.

First step is to create an order of all kanji. The order is based on many things, kanji frequency from KANJIDAT, number of words the kanji is in, frequency of the words the kanji is in, number of example sentences of those words etc. These are all weighted, for example I don’t consider the example sentences count too important. I change these weights until I get an order I like. BUT, this is not the N3 kanji list as it contains kanji from all levels.

In the second step I create an order of all words, but this time only include those that were in the old 2kyuu list (new N2), because that’s the only official data I have. The order is set on weighted parameters again. These are, the average order of kanji in the previously generated list, the word frequency, average old JLPT level of kanji and finally example sentences count (once again not given too much weight.) This is still NOT the N3 word list.

In the third and final step, the program goes over the generated word list in order, collecting the kanji that were old 2kyuu (or N2, everyone learned these numbers by now) until it reaches a set amount. (365 currently, as with 3 and 4kyuu kanji, the sum is 649.) I consider the collected kanji and words N3, but the algorithm won’t stop there, it keeps collecting words but only with kanji already in my decidedly N3 kanji list. This way I can generate both N3 kanji and N3 words lists that are connected to each other.

 

But this is how far automatic algorithms can go. The fourth and really final step is manually going over all N3 and N2 words and change their levels if I decide that they were put in the wrong place. I mainly base my opinion on intuition, but the sample N3 test concentrates on everyday topics like study and work, so I can pay attention to words that might come up in such topics. After the final manual decision is made, I’ll compute a new N3 kanji list based on those words. Then I can mark words not having such kanji but still in N3 vocab list as “don’t test kanji for this word on this level” and work done.

 

…No, it will only begin, because after that I’ll have to check each and every word and give them a different definition if I don’t like what they already have. (around 7000 words – all the others are either duplicates or same word with “no kanji” / “kanji” versions)

Advertisements
Categories: Development, N3 list Tags: ,
  1. No comments yet.
  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: