Home > Rant > Little Rant on Unicode, Search Engines and Q&A sites

Little Rant on Unicode, Search Engines and Q&A sites

Disclaimer: This entry is about programming problems when handling Unicode. One part mentions people who call themselves programmers and try to answer questions at Q&A sites, so it might not be suitable for non-programmers, while many programmers might find it offending. (Though I hope they won’t.)

The conversion of zkanji to Unicode is almost completed, but as a consequence a completely new family of problems has arisen. This is my first time trying to make a program that works on many systems with different Language settings, and although zkanji did work till now as well, its users were not capable of sharing their data between each other. At least if they were using different languages. Because of that I didn’t even have to think about what would happen, if someone got the idea to distribute a custom made dictionary in a language, that is not the one supported by every operating system. (Which would be English, but I’m not even sure about that.)

The problem: As I have written in a previous entry, zkanji uses a special dictionary tree to look up words. Each node in the tree has a label corresponding to the words under the node and the branches starting from the node. These nodes must be in alphabetical order of their label to be able to walk the tree and the labels must be in lowercase. When someone searches for a word, that word can be of mixed case, so the first step is to convert that word into lowercase for comparison with node labels. The problem arises when different languages convert a given uppercase letter to a different one in lowercase. The first problem with this is that when the user searches for a word in the English dictionary, the entered text after converting it to lowercase might not match anything in English. (This could happen for the letter I in Turkish locales, as it apparently will be converted to an ı character. – this might not be true. I just repeat what I have read on a Q&A site.) The second problem is the ordering of entered words in newly created user dictionaries. The nodes will probably be ordered in a different order under different systems if their languages differ.

The only solution that seems viable at the moment is to use a conversion function that converts a given uppercase character to the same lowercase one on every single system, without ever looking at the system’s own language. This should be possible as there is supposed to be a default conversion table for Unicode characters somewhere hidden in the system. Unfortunately the documentation and even the c++ language itself is in turmoil when it comes to Unicode. There are several functions for Unicode character conversion, but the documentation about them does not always mention whether those functions use the system’s locale or not. Even when it mentions that, there are contradictory remarks about those function, and when looking for help online, it turns out the way those functions behave might differ in several implementations of the same c++ library.

The only thing I can do in such cases is to use an online search engine to look for a solution that works.

Many years ago search engines were not as “smart” as today. They only returned results that contained the exact words one was looking for, and they couldn’t find forum entries at all, only relatively static sites. In recent years the makers of these search engines realized, that people are not interested in sites like those. They don’t want to find anything about what they entered in the search field, rather they need everything else. So search engines were developed further to make them give us sites that had the search terms inflected differently, divided or written as a single word, or even had similar words, but not those entered, even when they were inserted between quotation marks. The other great innovation of search engines is the inclusion of social activity in the search results. This means that it is almost guaranteed, that when one searches for a technical term, the first 1000 results must be forum messages, tweets or personal sites from social sites.

Thanks to these innovations in search technology, it once again became a challenge to find something useful. This is a good thing, because us programmers love challenge, or we wouldn’t be programming in the first place, right?

Q&A sites (question and answer sites, where anyone can ask a question in a given topic and get answers from people all over the world) is among the results, that today’s search engines return trying to pamper us. Of course I have nothing against sites like those. It’s good that so many experts try to be helpful for free. Or at least I thought for first. Unfortunately as it turned out, most of these “experts” don’t know what they are talking about, and don’t want to admit it either. There have been several questions regarding the conversion of Unicode strings to lowercase, all getting the same answers not regarding the needs of the one asking the question.

General Answer #1: converting to lowercase the same way on every system is impossible, because there are languages where the upper/lowercase version of some characters are different than in others.
General Answer #2: why do you even want to do that? We all speak English!
General Answer #3: use the case conversion of [insert any library or function name here]! It’s using the current locale! You don’t want that? Do it anyway!
General Answer #4: use [insert any library]! It does what you need, converts from anything to anything else, with or without using the locale, it’s perfect in every way! Though I have only heard of it. And it uses [some license not compatible with most others]. And you will have to link another 1MB to your exe just because you needed a single function.

Of course this is not the first case when I had to face such helpful answers after a day’s search online, but I had to rant about it. If one is persistent enough, there are really good, helpful answers out there as well, they just have to be found. But it seems that whenever I need an answer for something, it turns out to be one of the rarest problems on earth… Or it’s so simple that everyone knows the solution but me.

  1. No comments yet.
  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: