After 3 months break I’m working on the JLPT list again. I have to check the definition of the remaining 1600 items, which can be done in a few weeks hopefully. The definitions will mainly come from the dictionary, but I want to shorten the longer ones. Once I’m done, I plan to release the list, though I don’t know what format would be the best.
After some work on N3/N2 word placements I found that it makes no sense to keep the progress on the side bar. I went through a book that was written for N3 and marked most words that I found in it. I don’t agree with some choices but most part it’s OK. I’ll remove the mark from words where I can’t agree and when that’s done, it’ll be a little programming work to set the marked words as N3 (plus some similar words) and the final list will be done. I can hopefully post my N3 kanji list very soon, and then I’ll upload a new zkanji version as well that will not only have N3 kanji but N level for words will be indicated too. (The automatic word selection for the long-term study list will only come after that.)
I haven’t yet decided whether I want to share my version of the JLPT vocabulary list as public domain, but unless someone asks me to use it in their program, it has no importance anyway. (Even less before the list is done.)
UPDATE: The N3 kanji list is now final, but it seems that there are some important words missing from the current JLPT vocabulary list, which are most probably included in the updated JLPT since 2010. These are mainly words that don’t have any kanji, for example インターネット, just to name one that was not part of the old list. The first word missing that I noticed while checking the example test on the official JLPT site was ホテル, but for some reason it was not included in previous vocabulary. Yet it can be in N5.
In case you haven’t noticed, I have added a little progress report on my progress of the JLPT word list on the right side of this blag. It won’t be worth coming back to check the numbers every day though. If you make it a weekly visit, you might see some progress.
I want to finalize a meaning for all words, and decide on most words whether they should be in the N2 or N3 word list. Because I have automatically used the definition of words from the dictionary if their length was less than 45 characters long, there seems to be some great progress already, but the truth is, I can hardly check 100 words daily (no time and no patience). So it will take quite some time still.
I’m making progress in creating an N3 list that can point students in the right direction, but be aware that it won’t be a definitive list, just an “opinion”. I have more or less finished writing the algorithm to create an initial list from the more frequent words, and I’m about to explain how it works. I have seen how others did the list generation and as I wrote in a previous post, I wasn’t convinced.
For example www.jlptstudy.com included kanji in the N3 list that were from Jouyou grades 1 to 4, but not in N4 or N5 (3 and 4kyuu). This way it got a believable kanji count, but kanji from the JLPT didn’t have a direct connection with kanji from the Jouyou grades in the old system so why would they have now? (Jouyou is the kanji Japanese children learn in schools, and the Jouyou grades correspond to the school years.)
I asked the author of www.tanos.co.uk about his list of N3 words, who told me that it was generated from the old 2kyuu word list, and the decisions were based on the Tanaka Corpus, the example sentences data zkanji uses too. (The Tanaka Corpus is a collection of sentences made with help from many people, still being revised by enthusiasts. It was meant to help students to see how the words in the dictionary are actually used, and not for creating a study plan by it.) I don’t know whether kanji were taken into consideration when picking the word list, but the words on the page contain 1073 kanji (or 1305 if higher levels are included), which is impressive if we consider that it’s near the number required for N2. (N2 words have 1633 kanji, though probably only around 1200 are really required at the JLPT test.)
Now I don’t want to say that my method is better or more reliable, but it’s only fair to tell you about how it works so you know what to expect, without getting into technical details nobody really cares for.
First step is to create an order of all kanji. The order is based on many things, kanji frequency from KANJIDAT, number of words the kanji is in, frequency of the words the kanji is in, number of example sentences of those words etc. These are all weighted, for example I don’t consider the example sentences count too important. I change these weights until I get an order I like. BUT, this is not the N3 kanji list as it contains kanji from all levels.
In the second step I create an order of all words, but this time only include those that were in the old 2kyuu list (new N2), because that’s the only official data I have. The order is set on weighted parameters again. These are, the average order of kanji in the previously generated list, the word frequency, average old JLPT level of kanji and finally example sentences count (once again not given too much weight.) This is still NOT the N3 word list.
In the third and final step, the program goes over the generated word list in order, collecting the kanji that were old 2kyuu (or N2, everyone learned these numbers by now) until it reaches a set amount. (365 currently, as with 3 and 4kyuu kanji, the sum is 649.) I consider the collected kanji and words N3, but the algorithm won’t stop there, it keeps collecting words but only with kanji already in my decidedly N3 kanji list. This way I can generate both N3 kanji and N3 words lists that are connected to each other.
But this is how far automatic algorithms can go. The fourth and really final step is manually going over all N3 and N2 words and change their levels if I decide that they were put in the wrong place. I mainly base my opinion on intuition, but the sample N3 test concentrates on everyday topics like study and work, so I can pay attention to words that might come up in such topics. After the final manual decision is made, I’ll compute a new N3 kanji list based on those words. Then I can mark words not having such kanji but still in N3 vocab list as “don’t test kanji for this word on this level” and work done.
…No, it will only begin, because after that I’ll have to check each and every word and give them a different definition if I don’t like what they already have. (around 7000 words – all the others are either duplicates or same word with “no kanji” / “kanji” versions)
I have reached the decision that I’ll either not have an N3 list of words and kanji (there are no reliable sources, those that are free are made up of guesses that are too wild for my taste), or I’ll make my own list based on kanji/word frequency data and my own wild guesses (=experience with the language, though only through the internet, TV and novels).
There is a slight problem with frequency data. It was based on frequency of words in newspapers, ignoring general usage (which is probably way too difficult to measure). Though that might be an advantage regarding the JLPT.
UPDATE: I don’t believe that the results would change drastically, so this poll is closed! If you missed the voting but would still like to tell your thoughts about the question, please write a comment.
Although the program is progressing well, I have run into just another problem with the available data. While trying to create my own N3 list of words, (as I’ve decided not to trust the available naive attempts blindly) I have identified all the kanji that were in the words of specified JLPT levels. The result: total chaos (mainly) in 1kyuu/N1.
http://www.tanos.co.uk/, the site from where I borrowed the list of words, used the lists (with all the mistakes in them) available at http://www.jlptstudy.com. The latter only has word lists till 2kyuu/N2, but those were taken from official JLPT material so they must be relatively good. (I have passed JLPT 2kyuu (now N2) with them)
But how trustworthy is the JLPT 1kyuu/N1 word list? I have never tried to study for N1 and I can only guess. So let’s just look at the facts. (You can skip the following few paragraphs if you are only interested in the final result.)
There are ~3450 words in the N1 list (not including the other levels, together the number would be around 9000).
In these 3450 words, 564 kanji are N1 kanji, 607 N2 kanji, 161 N4 kanji, 93 N5 kanji, and 210 kanji are not in any JLPT level (from old official lists, so the newly introduced N3 is not counted). The sum is 1425 JLPT kanji + 210 non JLPT kanji. That is 1635 kanji used altogether. Officially there were 2230 JLPT kanji from all the levels (the real number was less, but the official JLPT kanji were changed during the years, and this 2230 includes them all.) So there are around 800 kanji missing, not used in words of the N1 list. This is an interesting result, but we might be able to find the missing ones.
There are 480 kanji in words of lower levels, not used in words at N1, which leaves us with 320 kanji missing! We are talking about JLPT kanji, and yet they were not used in any JLPT word?
I have also counted that although only 564 N1 kanji were used in N1 words, there are 199 N1 kanji that were only used in words at lower levels. So 763 N1 kanji are used in all the words of the supposed JLPT words. But there should be 1207 N1 kanji. That makes it 444 missing N1 kanji.
If you compare the numbers, 480 JLPT kanji (from all levels) are not used in any JLPT word, while 444 N1 kanji are not used in any JLPT word. Which means that almost all the missing kanji are from N1, and that’s not a small number! If you also consider that there were 210 non-JLPT kanji in the list of N1 words, that’s enough to make anyone uncertain. I would rather not doubt the validity of the official 1kyuu/N1 kanji list, but there is no assurance about the validity of the unofficial N1 word list.
So once again, I have to find another site with a different N1 word list (or rather more sites) just to make sure. Unfortunately this will slow down my progress quite a bit…