Hello,<br><br>I was looking into bugs <a href="https://bugs.kde.org/show_bug.cgi?id=178278" target="_blank">178278</a>, <a href="https://bugs.kde.org/show_bug.cgi?id=119723" target="_blank">119723</a>, <a href="https://bugs.kde.org/show_bug.cgi?id=183743">183743</a> (and other similar) as I fixed a similar bug in Amarok 1.4 (<a href="http://bugs.kde.org/show_bug.cgi?id=138482">138482</a>). Specifically, I was wondering why case-insensitive search for non-Latin characters didn't work again. After spending some time in researching the issue I identified the problem. But I am not sure of the best way to fix it, so I write this email :) (I'm sorry it has become a bit long)<br>
<br>First, some background:<br>When Amarok starts the embedded MySQL server, it does not set the default character encoding. This leads to the fact that MySQL uses the default 'latin1' encoding for all textual fields. Before putting data into the tables, Amarok first (correctly) converts it to UTF8. For ASCII chars this is no problem, as they are equivalent. But if the data contains a 'real' utf8-encoded string, it gets stored just as a sequence of bytes. When we ask the database to match other Unicode string with it, it can only do it if there is an exact match with the stored string. Had it known what the characters mean, it could perform case-insensitive search, as well as some other collation fixes (like recognising the similarity of 'A' and 'Ä').<br>
<br>After trying several options, I identified 2 problems to be solved:<br>1) The user's DB is not yet created<br>In this case, all we have to do is instruct the server that it would be dealing with UTF8 data, so it knows how to compare it correctly. This can be done by passing the parameter "--default-character-set=utf8" to the server, and when the connection is established, issuing a "SET NAMES 'utf8'" SQL command. The first parameter specifies that by default, all newly created text fields should have an UTF8 encoding, while the second specifies that we want to use UT8 encoding when talking to the server.<br>
<br>2) The user already has UTF8 data in Latin1 text fields<br>This is the case I was not sure how to handle. There is the possibility to convert a field to other encoding. In our case, we should first convert the fields to BINARY encoding, as we'll get broken characters otherwise. This can be done by the following 2 SQL commands:<br>
"ALTER TABLE artists MODIFY name VARCHAR(255) CHARACTER SET binary;"<br>"ALTER IGNORE TABLE artists MODIFY name VARCHAR(255) CHARACTER SET utf8 NOT NULL;".<br>Please note that we need to do this conversion for each text field, because doing this to the table or the database only sets the defaults. So, my uncertainty in this approach is when and how to perform the conversion. Maybe we should check the current encoding and perform the conversion if necessary. Or we should use a different db schema version and convert all old dbs to the new encoding. Another problem here is the possibility that 2 keys that were previously different can now become identical ("Die Arzte" and "Die Ärzte".) In this case the command fails because of the UNIQUE_KEY constraint or deletes the second identical key if we use the IGNORE option. I don't know if this is a problem, or the missing key will be regenerated.<br>
<br>So, thank you for reading through this quite long email. If anyone is interested in fixing the problem, please give me hints how this could be done in an acceptable way.<br><br>Best wishes,<br>Stanislav Nikolov<br>