Posts Tagged ‘example sentences’

Independence of the Example Sentences

May 1, 2012 3 comments

Despite all my efforts in creating an example sentences database which is not dependent on the current dictionary, my aim is still unreachable. Since the introduction of the example sentences to the program, a new sentences data file must be generated for each new dictionary. For example the word すっきり had the number 22222 in the dictionary data I generated in January, but it is 22226 in the updated dictionary I created yesterday. When looking up examples for this word, I looked for an index of 22222 in the examples data first, which pointed me to a list of all the examples the word had. You can easily see what would happen if I still used the old examples for the new dictionary where the word was marked #22226.

I used a little trick to make these examples work for user dictionaries, not just for the main English one. Whenever I needed the sentences of a specific word in another dictionary, I first looked up the word’s main index exactly the same way as dictionary searches work. This meant that I still needed the original dictionary file which was compatible with the example sentences data.

To achieve the aim to make any future example sentences data work with any future dictionary, instead of storing a number for words in the example sentences data, I could store the words’ written form and kana reading to identify the words. So for example when you want to see the examples for すっきり, zkanji would look that word up in the data, directly from its written and kana form (which are both すっきり in this case), instead of a number that changes with every new release. When I came up with this idea I thought it will be easy to implement. Finding the sentences for the words this way is easy, but unfortunately building the data is impossible because of the way the Tanaka Corpus is made up.

For each sentence in the corpus, there is a list of words that make up those sentences which I can use to build my database. But as the JMDict project (where the main dictionary comes from) changes, and sometimes even the written and kana forms are changed, the list of words in the Tanaka Corpus cannot be independent from the actual state of the dictionary.

I’ll still create an example sentences database which can be used for any future dictionary data, but as the dictionary changes some words will probably “lose” their examples. And because the corpus changes with time as well, it will still be a good idea to get the latest examples anyway.