Review Request: Amarok can incorrectly scan non-ascii tags

Sat Mar 17 09:00:10 UTC 2012

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://git.reviewboard.kde.org/r/104310/
-----------------------------------------------------------

Review request for Amarok.

Description
-------

Amarok incorrectly scans files with non-ascii characters in tags. The symptom is that 
some of the files have two "invalid UTF character" symbols instead of a single non-ascii
character (looks like <?><?>, question mark inside a black circle). Most visible effect 
of this issue is that some albums end up in Various Artists because one of the tracks 
had artist name corrupted in this way.  It is not limited to artist name, though - 
there are tracks with corrupted album names or titles.                                                                                                                                                   

The reason for this issue is as follows. When Amarok invokes collection scanner                                                        
process, it receives the results from the amarokcollectionscanner over a pipe. Here is 
a snippet of code from src/core-impl/collections/db/ScanManager.cpp:

void    
ScannerJob::getScannerOutput()
{
     m_incompleteTagBuffer += m_scanner->readAll();                                                                                                      
}

The m_incompleteTagBuffer is declared in src/core-impl/collections/db/ScanManager.h:

     QString m_incompleteTagBuffer

However, m_scanner->readAll() returns QByteArray, not QString. This is okay for ASCII
characters (which are 1 byte in UTF8), but breaks in case of multibyte sequences. If
readAll() method returns a block which terminates in a middle of the multibyte sequence,
conversion to QString in ScannerJob::getScannerOutput replaces the last character with
"invalid UTF character" symbol. When the next block is read, it starts in the middle of
UTF8 multibyte sequence - so it gets replaced with one more "invalid UTF character"
symbol. Thus, a single multibyte UTF8 character is replaced with two "invalid character"
symbols.

The solution implemented by the attached patch is to store incomplete information as
QByteArray and search for partial ("</directory>") or full ("</scanner>") elements in the
byte stream, before conversion to QString. Complete blocks can be safely converted to
QString, as the multibyte characters are inside the XML tags.

Diffs
-----

  src/core-impl/collections/db/ScanManager.h 5f0d153 
  src/core-impl/collections/db/ScanManager.cpp 97d0b1c 

Diff: http://git.reviewboard.kde.org/r/104310/diff/

Testing
-------

Thanks,

Alexey Neyman

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.kde.org/pipermail/amarok-devel/attachments/20120317/de3c945b/attachment-0001.html>