Encoding problems

Tue May 3 16:43:21 CEST 2005

Hello and greetings to all Taglib developers!

I've tried to contact wheeler at kde.org directly, but, alas, I've received
no answer so far.

I'd like to express my gratitude to Taglib authors for creating such a
nice unified interface. However, it still has several inevitable
problems with encodings, especially when it comes to non-iso8859-1
stuff. It's a pity to have such problems in 2005, when almost everything
had gone to Unicode and we've almost solved the majority of i18n/l10n
problems.

I'll try to point to some things that need to be fixed in Taglib and I'd
like to know if anybody's willing to work on it or may be we can discuss
that's the right way to do it and I can do the patches?

So, here's the list of problems:

1. First major problem is with String class. It's more or less nice when
it collaborates with Qt/QString, but some methods really make me cry.
The most evil one is to8Bit() method and operator << that try to convert
string into Latin1 std::string().

First, they fail to do as stated in documentation, really. For example,
output of UTF8 strings containing cyrillic symbols looks just like 8-bit
stripped string, making it totally unreadable. Sure, it's impossible to
represent these symbols in Latin1, but there are "?" signs for
replacement and, well,

Second thing, they shouldn't really restrict everything to Latin1.
Operator << and to8Bit() seem to be used as a helper to output
everything to console - and nowadays console can be in non-Latin1
locales, for example, in case of Russian users, ru_RU.KOI8-R (most
widely used 8-bit encoding) or ru_RU.UTF-8 (variable width one). Qt's
version of to8Bit() method does these conversions properly, according to
locale.

If we'll fix this problem, at least Ogg Vorbis (always proper UTF-8)
comments would be displayed and saved properly with tagreader/tagwriter.

2. Second major problem is ID3v2 tags encoding.

There's a known issue with ID3v1 tags that should contain only Latin1
data by standard, but, in fact, at least concerning Russian-titled MP3s,
99% of ID3v1 tags contain windows-1251 encoded data. This is a standard
de-facto and, alas, there's already lots of hardware players that
support it :(

TagLib introduced TagLib::ID3v1::StringHandler class to solve this issue
and, well, it solves this problem, but only partially.

The problem is that de-facto, 99% of Russian MP3s with ID3v2 tags also
contain windows-1251-encoded data, written as Latin1. Unicode support is
generally very weak within majority of players (including popular
Windows-based players, such as some versions of WinAmp or Windows Media
Player and most hardware players that support localized interface and
character sets), and most taggers use local 8-bit encoding instead of
Unicode.

So, there should be also a recoding handler for ID3v2 tags that will
only touch reading/writing of Latin1-encoded tags, allowing somehow
flexible to decide if we want to write tags in windows-1251 (if user
wants it) or Unicode (by default, I guess?)

It's a hard matter and I don't know which way to choose to implement it.

3. Lots of application use Taglib, but, so far, only amaroK included
selection for string handler encoding. This shouldn't be
application-related issues, it could be some sort of library-wide
setting (for example, a config file in ~/.taglib) that would

If we'll get a perfect tag library, I'll make my best to make most of
free software using a single, nice and clean library instead of that
heaps of garbage we have now...

Thanks for you time,

Waiting for your comments,

-- 
WBR, Mikhail Yakshin AKA GreyCat
ALT Linux [http://www.altlinux.ru] [xmpp:greycat at altlinux.org]