Better database handling of UTF-8 data / Case-insensitive search

Thu May 14 01:29:43 CEST 2009

Forwarding on to our developer list. Thanks for the info Stanislav!

2009/5/14 Stanislav Nikolov <valsinats at gmail.com>:
> Hello,
>
> I was looking into bugs 178278, 119723,  183743 (and other similar) as I
> fixed a similar bug in Amarok 1.4 (138482). Specifically, I was wondering
> why case-insensitive search for non-Latin characters didn't work again.
> After spending some time in researching the issue I identified the problem.
> But I am not sure of the best way to fix it, so I write this email :) (I'm
> sorry it has become a bit long)
>
> First, some background:
> When Amarok starts the embedded MySQL server, it does not set the default
> character encoding. This leads to the fact that MySQL uses the default
> 'latin1' encoding for all textual fields. Before putting data into the
> tables, Amarok first (correctly) converts it to UTF8. For ASCII chars this
> is no problem, as they are equivalent. But if the data contains a 'real'
> utf8-encoded string, it gets stored just as a sequence of bytes. When we ask
> the database to match other Unicode string with it, it can only do it if
> there is an exact match with the stored string. Had it known what the
> characters mean, it could perform case-insensitive search, as well as some
> other collation fixes (like recognising the similarity of 'A' and 'Ä').
>
> After trying several options, I identified 2 problems to be solved:
> 1) The user's DB is not yet created
> In this case, all we have to do is instruct the server that it would be
> dealing with UTF8 data, so it knows how to compare it correctly. This can be
> done by passing the parameter "--default-character-set=utf8" to the server,
> and when the connection is established, issuing a "SET NAMES 'utf8'" SQL
> command. The first parameter specifies that by default, all newly created
> text fields should have an UTF8 encoding, while the second specifies that we
> want to use UT8 encoding when talking to the server.
>
> 2) The user already has UTF8 data in Latin1 text fields
> This is the case I was not sure how to handle. There is the possibility to
> convert a field to other encoding. In our case, we should first convert the
> fields to BINARY encoding, as we'll get broken characters otherwise. This
> can be done by the following 2 SQL commands:
> "ALTER TABLE artists MODIFY name VARCHAR(255) CHARACTER SET binary;"
> "ALTER IGNORE TABLE artists MODIFY name VARCHAR(255) CHARACTER SET utf8 NOT
> NULL;".
> Please note that we need to do this conversion for each text field, because
> doing this to the table or the database only sets the defaults. So, my
> uncertainty in this approach is when and how to perform the conversion.
> Maybe we should check the current encoding and perform the conversion if
> necessary. Or we should use a different db schema version and convert all
> old dbs to the new encoding. Another problem here is the possibility that 2
> keys that were previously different can now become identical ("Die Arzte"
> and "Die Ärzte".) In this case the command fails because of the UNIQUE_KEY
> constraint or deletes the second identical key if we use the IGNORE option.
> I don't know if this is a problem, or the missing key will be regenerated.
>
> So, thank you for reading through this quite long email. If anyone is
> interested in fixing the problem, please give me hints how this could be
> done in an acceptable way.
>
> Best wishes,
> Stanislav Nikolov
>
> _______________________________________________
> Amarok mailing list
> Amarok at kde.org
> https://mail.kde.org/mailman/listinfo/amarok
>
>

-- 
Seb Ruiz

http://www.sebruiz.net/
http://amarok.kde.org/