Home > Development, Under-the-hood > Independence of the Example Sentences

Independence of the Example Sentences

Despite all my efforts in creating an example sentences database which is not dependent on the current dictionary, my aim is still unreachable. Since the introduction of the example sentences to the program, a new sentences data file must be generated for each new dictionary. For example the word すっきり had the number 22222 in the dictionary data I generated in January, but it is 22226 in the updated dictionary I created yesterday. When looking up examples for this word, I looked for an index of 22222 in the examples data first, which pointed me to a list of all the examples the word had. You can easily see what would happen if I still used the old examples for the new dictionary where the word was marked #22226.

I used a little trick to make these examples work for user dictionaries, not just for the main English one. Whenever I needed the sentences of a specific word in another dictionary, I first looked up the word’s main index exactly the same way as dictionary searches work. This meant that I still needed the original dictionary file which was compatible with the example sentences data.

To achieve the aim to make any future example sentences data work with any future dictionary, instead of storing a number for words in the example sentences data, I could store the words’ written form and kana reading to identify the words. So for example when you want to see the examples for すっきり, zkanji would look that word up in the data, directly from its written and kana form (which are both すっきり in this case), instead of a number that changes with every new release. When I came up with this idea I thought it will be easy to implement. Finding the sentences for the words this way is easy, but unfortunately building the data is impossible because of the way the Tanaka Corpus is made up.

For each sentence in the corpus, there is a list of words that make up those sentences which I can use to build my database. But as the JMDict project (where the main dictionary comes from) changes, and sometimes even the written and kana forms are changed, the list of words in the Tanaka Corpus cannot be independent from the actual state of the dictionary.

I’ll still create an example sentences database which can be used for any future dictionary data, but as the dictionary changes some words will probably “lose” their examples. And because the corpus changes with time as well, it will still be a good idea to get the latest examples anyway.

Advertisements
  1. jhack89
    May 1, 2012 at 5:40 pm

    Umm, well, luckily as of now I’m not too afraid of keeping downloading a new version of the example sentences databases each new release of the dictionary 🙂

    • May 1, 2012 at 5:55 pm

      There are people with limited internet access though.

      • jhack89
        May 1, 2012 at 6:11 pm

        That is true, indeed. And I was one of them not until too long ago! In Italy government has started to take seriously the Internet-divide problem only recently. Even now so many people doesn’t enjoy high-speed connections, thus creating huge problems of fairness, information availability and analogue-medium dependence. It beats me how many people more has this problem abroad, and unluckily I believe is still too many!

  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: