Better database handling of UTF-8 data / Case-insensitive search

Wed May 13 19:29:21 UTC 2009

Hello,

I was looking into bugs 178278 <https://bugs.kde.org/show_bug.cgi?id=178278>,
119723 <https://bugs.kde.org/show_bug.cgi?id=119723>,
183743<https://bugs.kde.org/show_bug.cgi?id=183743>(and other similar)
as I fixed a similar bug in Amarok 1.4 (
138482 <http://bugs.kde.org/show_bug.cgi?id=138482>). Specifically, I was
wondering why case-insensitive search for non-Latin characters didn't work
again. After spending some time in researching the issue I identified the
problem. But I am not sure of the best way to fix it, so I write this email
:) (I'm sorry it has become a bit long)

First, some background:
When Amarok starts the embedded MySQL server, it does not set the default
character encoding. This leads to the fact that MySQL uses the default
'latin1' encoding for all textual fields. Before putting data into the
tables, Amarok first (correctly) converts it to UTF8. For ASCII chars this
is no problem, as they are equivalent. But if the data contains a 'real'
utf8-encoded string, it gets stored just as a sequence of bytes. When we ask
the database to match other Unicode string with it, it can only do it if
there is an exact match with the stored string. Had it known what the
characters mean, it could perform case-insensitive search, as well as some
other collation fixes (like recognising the similarity of 'A' and 'Ä').

After trying several options, I identified 2 problems to be solved:
1) The user's DB is not yet created
In this case, all we have to do is instruct the server that it would be
dealing with UTF8 data, so it knows how to compare it correctly. This can be
done by passing the parameter "--default-character-set=utf8" to the server,
and when the connection is established, issuing a "SET NAMES 'utf8'" SQL
command. The first parameter specifies that by default, all newly created
text fields should have an UTF8 encoding, while the second specifies that we
want to use UT8 encoding when talking to the server.

2) The user already has UTF8 data in Latin1 text fields
This is the case I was not sure how to handle. There is the possibility to
convert a field to other encoding. In our case, we should first convert the
fields to BINARY encoding, as we'll get broken characters otherwise. This
can be done by the following 2 SQL commands:
"ALTER TABLE artists MODIFY name VARCHAR(255) CHARACTER SET binary;"
"ALTER IGNORE TABLE artists MODIFY name VARCHAR(255) CHARACTER SET utf8 NOT
NULL;".
Please note that we need to do this conversion for each text field, because
doing this to the table or the database only sets the defaults. So, my
uncertainty in this approach is when and how to perform the conversion.
Maybe we should check the current encoding and perform the conversion if
necessary. Or we should use a different db schema version and convert all
old dbs to the new encoding. Another problem here is the possibility that 2
keys that were previously different can now become identical ("Die Arzte"
and "Die Ärzte".) In this case the command fails because of the UNIQUE_KEY
constraint or deletes the second identical key if we use the IGNORE option.
I don't know if this is a problem, or the missing key will be regenerated.

So, thank you for reading through this quite long email. If anyone is
interested in fixing the problem, please give me hints how this could be
done in an acceptable way.

Best wishes,
Stanislav Nikolov
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.kde.org/pipermail/amarok/attachments/20090513/852db0a9/attachment.html>