Better database handling of UTF-8 data / Case-insensitive search

Thu May 14 11:08:00 UTC 2009

On Wed, May 13, 2009 at 9:29 PM, Stanislav Nikolov <valsinats at gmail.com> wrote:
> Hello,
>
> I was looking into bugs 178278, 119723,  183743 (and other similar) as I
> fixed a similar bug in Amarok 1.4 (138482). Specifically, I was wondering
> why case-insensitive search for non-Latin characters didn't work again.
> After spending some time in researching the issue I identified the problem.
> But I am not sure of the best way to fix it, so I write this email :) (I'm
> sorry it has become a bit long)
>
> First, some background:
> When Amarok starts the embedded MySQL server, it does not set the default
> character encoding. This leads to the fact that MySQL uses the default
> 'latin1' encoding for all textual fields. Before putting data into the
> tables, Amarok first (correctly) converts it to UTF8. For ASCII chars this
> is no problem, as they are equivalent. But if the data contains a 'real'
> utf8-encoded string, it gets stored just as a sequence of bytes. When we ask
> the database to match other Unicode string with it, it can only do it if
> there is an exact match with the stored string. Had it known what the
> characters mean, it could perform case-insensitive search, as well as some
> other collation fixes (like recognising the similarity of 'A' and 'Ä').
>
> After trying several options, I identified 2 problems to be solved:
> 1) The user's DB is not yet created
> In this case, all we have to do is instruct the server that it would be
> dealing with UTF8 data, so it knows how to compare it correctly. This can be
> done by passing the parameter "--default-character-set=utf8" to the server,
> and when the connection is established, issuing a "SET NAMES 'utf8'" SQL
> command. The first parameter specifies that by default, all newly created
> text fields should have an UTF8 encoding, while the second specifies that we
> want to use UT8 encoding when talking to the server.
>
> 2) The user already has UTF8 data in Latin1 text fields
> This is the case I was not sure how to handle. There is the possibility to
> convert a field to other encoding. In our case, we should first convert the
> fields to BINARY encoding, as we'll get broken characters otherwise. This
> can be done by the following 2 SQL commands:
> "ALTER TABLE artists MODIFY name VARCHAR(255) CHARACTER SET binary;"
> "ALTER IGNORE TABLE artists MODIFY name VARCHAR(255) CHARACTER SET utf8 NOT
> NULL;".
> Please note that we need to do this conversion for each text field, because
> doing this to the table or the database only sets the defaults. So, my
> uncertainty in this approach is when and how to perform the conversion.
> Maybe we should check the current encoding and perform the conversion if
> necessary. Or we should use a different db schema version and convert all
> old dbs to the new encoding. Another problem here is the possibility that 2
> keys that were previously different can now become identical ("Die Arzte"
> and "Die Ärzte".) In this case the command fails because of the UNIQUE_KEY
> constraint or deletes the second identical key if we use the IGNORE option.
> I don't know if this is a problem, or the missing key will be regenerated.
>
> So, thank you for reading through this quite long email. If anyone is
> interested in fixing the problem, please give me hints how this could be
> done in an acceptable way.

Hi Stanislav, big thanks for diagnosing this issue.

As for creating an actual fix for the problem, I think we should
currently focus on 1), which seems easier and less risky than 2),
which we could eventually tackle after 2.1 release.

But 1) looks like something that should probably be fixed before
releasing, and it doesn't seem very complicated either. Could you
possibly provide a patch? Otherwise I could try to fix it, but you
seem to be more familiar with the details.

-- 
Mark Kretschmann
Amarok Developer
www.kde.org - amarok.kde.org