Disclaimer: This is just a short rant to vent my frustration.
I don’t know what is it that they eat at G@@gle, but they are making their search service worse and worse. Or is it just me?
A few days ago I noticed, that they removed the “advanced search” link from the top of the search window. I understand that people run away screaming whenever they see the word “advanced”, but why do they want to prevent some others like me to use those settings? They are not even that advanced… If that name was so scary, rename it to “additional settings” or something. (The link can be found at the bottom of the page now.) The only thing I use those “advanced” settings for is to change the language to Japanese whenever my full kanji search results in Chinese pages only. I know that I could simply tick the checkbox for Japanese in the general search settings, which get saved in a cookie, but even if English is ticked as well, most results come up in Japanese for non Japanese search text too. I love the language, but I still understand English faster, when I need to look up something about programming.
The developers at G@@gle took some pill recently again, when they decided, that the + in front of search keywords is evil. Try searching for a word with a + in front of it, and see what it tells you:
The + operator has been replaced.
To search for an exact word or phrase, use double quotation marks
Thanks. Well, as for me, I used + for telling the search engine to include that word in every result, and don’t give me junk without it, and not “to search for an exact word or phrase”. I’ve used “double quotation marks” for that even till now.
So where can I send G@@gle hate mail?
JMDict is a dictionary database. That is, a list of words in English (+some other languages), and their Japanese equivalent. This data is free for non-commercial use, maybe the only such big dictionary project for Japanese which is free.
I thought of writing something satirical about JMDict, the dictionary database behind zkanji, but I probably couldn’t express myself so well as to make it even slightly funny. JMDict is used in 99.999% of freely available online dictionaries and free dictionary programs, and as such, gives a lot to the students of the Japanese language all over the world. The trouble is, it is a bit difficult to criticize it, as many people think of it as The Authoritative Japanese Dictionary Database.
Unfortunately JMDict has many small glitches. Many English word definitions in it are ambiguous and anybody not knowing how to use those words will definitely make mistakes. You can say that this is unavoidable in a dictionary, and I agree. The problem comes up, when the definition could be better with simply a different choice of words. For example the English definition they give could only be used in 10% of all translations, in all other cases there is a better, more general translation which is not specified in the definition. I also started to wonder when I saw words with explanation as part of their definitions, why it is not possible to add explanation to others. Because of this, whenever I come up with something ambiguous, I use a Japanese-Japanese dictionary like goo 辞書 or SpaceALC’s 英辞郎 (which is not exactly a dictionary, but still great and very useful).
This doesn’t mean that JMDict would be full of such problems, and at every corner vicious ambiguousness and incorrect translations await the unsuspecting student. (Maybe it DOES mean that, but only for less common words. There are more than 150,000 unique definitions in the database after all.) Another positive aspect is that anyone can send in corrections to words. I have done it myself several times already.
I’m writing this rant right now, because this time it seems the maintainers made a mistake they don’t want to admit, and it made me slightly dumbfounded. The word 練る is godan (or -u verb). That means when the word is inflected, the る “changes” to another syllable from the r row of the kana table, or in the past tense it is replaced by った. But if we take a look at the verb inflection table at WWWJDIC (the “official” online dictionary that uses JMDict), it says the past tense of 練る is 練た, which is wrong. What is strange is that originally 練る was marked correctly as -u verb, but not so long ago it was changed to ichidan or -ru dropping. I have sent in a correction 3 days ago for this, and the answer? Well, I haven’t specified an e-mail address (and even if I did I doubt they would have sent an answer), the correction was simple refused and deleted from the system without a trace. It might be taken into consideration and we might see some change in the future, but for now it seems they simply removed it. Though this is just a single case, but makes me think what the future of JMDict will be like, if things go on like this. The creator and original maintaner of JMDict is Jim Breen, though he is now probably not the one doing it. I’m considering sending him an e-mail explaining this situation, but I want to wait a few days to see if they correct this mistake or not.
UPDATE: I checked again today, and the verb information is now corrected. Someone from the maintainers is either reading my blog 🙂 or it simply took this much time till the changes in the database appeared on the dictionary server. In any case, I wish that they only removed the page of a suggested correction after the result was visible.
Disclaimer: This entry is about programming problems when handling Unicode. One part mentions people who call themselves programmers and try to answer questions at Q&A sites, so it might not be suitable for non-programmers, while many programmers might find it offending. (Though I hope they won’t.)
The conversion of zkanji to Unicode is almost completed, but as a consequence a completely new family of problems has arisen. This is my first time trying to make a program that works on many systems with different Language settings, and although zkanji did work till now as well, its users were not capable of sharing their data between each other. At least if they were using different languages. Because of that I didn’t even have to think about what would happen, if someone got the idea to distribute a custom made dictionary in a language, that is not the one supported by every operating system. (Which would be English, but I’m not even sure about that.)
The problem: As I have written in a previous entry, zkanji uses a special dictionary tree to look up words. Each node in the tree has a label corresponding to the words under the node and the branches starting from the node. These nodes must be in alphabetical order of their label to be able to walk the tree and the labels must be in lowercase. When someone searches for a word, that word can be of mixed case, so the first step is to convert that word into lowercase for comparison with node labels. The problem arises when different languages convert a given uppercase letter to a different one in lowercase. The first problem with this is that when the user searches for a word in the English dictionary, the entered text after converting it to lowercase might not match anything in English. (This could happen for the letter
I in Turkish locales, as it apparently will be converted to an
ı character. – this might not be true. I just repeat what I have read on a Q&A site.) The second problem is the ordering of entered words in newly created user dictionaries. The nodes will probably be ordered in a different order under different systems if their languages differ.
The only solution that seems viable at the moment is to use a conversion function that converts a given uppercase character to the same lowercase one on every single system, without ever looking at the system’s own language. This should be possible as there is supposed to be a default conversion table for Unicode characters somewhere hidden in the system. Unfortunately the documentation and even the c++ language itself is in turmoil when it comes to Unicode. There are several functions for Unicode character conversion, but the documentation about them does not always mention whether those functions use the system’s locale or not. Even when it mentions that, there are contradictory remarks about those function, and when looking for help online, it turns out the way those functions behave might differ in several implementations of the same c++ library.
The only thing I can do in such cases is to use an online search engine to look for a solution that works.
Many years ago search engines were not as “smart” as today. They only returned results that contained the exact words one was looking for, and they couldn’t find forum entries at all, only relatively static sites. In recent years the makers of these search engines realized, that people are not interested in sites like those. They don’t want to find anything about what they entered in the search field, rather they need everything else. So search engines were developed further to make them give us sites that had the search terms inflected differently, divided or written as a single word, or even had similar words, but not those entered, even when they were inserted between quotation marks. The other great innovation of search engines is the inclusion of social activity in the search results. This means that it is almost guaranteed, that when one searches for a technical term, the first 1000 results must be forum messages, tweets or personal sites from social sites.
Thanks to these innovations in search technology, it once again became a challenge to find something useful. This is a good thing, because us programmers love challenge, or we wouldn’t be programming in the first place, right?
Q&A sites (question and answer sites, where anyone can ask a question in a given topic and get answers from people all over the world) is among the results, that today’s search engines return trying to pamper us. Of course I have nothing against sites like those. It’s good that so many experts try to be helpful for free. Or at least I thought for first. Unfortunately as it turned out, most of these “experts” don’t know what they are talking about, and don’t want to admit it either. There have been several questions regarding the conversion of Unicode strings to lowercase, all getting the same answers not regarding the needs of the one asking the question.
General Answer #1: converting to lowercase the same way on every system is impossible, because there are languages where the upper/lowercase version of some characters are different than in others.
General Answer #2: why do you even want to do that? We all speak English!
General Answer #3: use the case conversion of [insert any library or function name here]! It’s using the current locale! You don’t want that? Do it anyway!
General Answer #4: use [insert any library]! It does what you need, converts from anything to anything else, with or without using the locale, it’s perfect in every way! Though I have only heard of it. And it uses [some license not compatible with most others]. And you will have to link another 1MB to your exe just because you needed a single function.
Of course this is not the first case when I had to face such helpful answers after a day’s search online, but I had to rant about it. If one is persistent enough, there are really good, helpful answers out there as well, they just have to be found. But it seems that whenever I need an answer for something, it turns out to be one of the rarest problems on earth… Or it’s so simple that everyone knows the solution but me.