Review Request: Amarok can incorrectly scan non-ascii tags

Sat Mar 17 21:42:20 UTC 2012

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://git.reviewboard.kde.org/r/104310/#review11510
-----------------------------------------------------------

This review has been submitted with commit 711b585e6284e1346c8661d60e6bf04c0223fd8c by Sam Lade to branch master.

- Commit Hook

On March 17, 2012, 9 a.m., Alexey Neyman wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> http://git.reviewboard.kde.org/r/104310/
> -----------------------------------------------------------
> 
> (Updated March 17, 2012, 9 a.m.)
> 
> 
> Review request for Amarok.
> 
> 
> Description
> -------
> 
> Amarok incorrectly scans files with non-ascii characters in tags. The symptom is that 
> some of the files have two "invalid UTF character" symbols instead of a single non-ascii
> character (looks like <?><?>, question mark inside a black circle). Most visible effect 
> of this issue is that some albums end up in Various Artists because one of the tracks 
> had artist name corrupted in this way.  It is not limited to artist name, though - 
> there are tracks with corrupted album names or titles.                                                                                                                                                   
>                                                                                                                                                                                                                
> The reason for this issue is as follows. When Amarok invokes collection scanner                                                        
> process, it receives the results from the amarokcollectionscanner over a pipe. Here is 
> a snippet of code from src/core-impl/collections/db/ScanManager.cpp:
> 
> void    
> ScannerJob::getScannerOutput()
> {
>      m_incompleteTagBuffer += m_scanner->readAll();                                                                                                      
> }
> 
> The m_incompleteTagBuffer is declared in src/core-impl/collections/db/ScanManager.h:
> 
>      QString m_incompleteTagBuffer
> 
> However, m_scanner->readAll() returns QByteArray, not QString. This is okay for ASCII
> characters (which are 1 byte in UTF8), but breaks in case of multibyte sequences. If
> readAll() method returns a block which terminates in a middle of the multibyte sequence,
> conversion to QString in ScannerJob::getScannerOutput replaces the last character with
> "invalid UTF character" symbol. When the next block is read, it starts in the middle of
> UTF8 multibyte sequence - so it gets replaced with one more "invalid UTF character"
> symbol. Thus, a single multibyte UTF8 character is replaced with two "invalid character"
> symbols.
> 
> The solution implemented by the attached patch is to store incomplete information as
> QByteArray and search for partial ("</directory>") or full ("</scanner>") elements in the
> byte stream, before conversion to QString. Complete blocks can be safely converted to
> QString, as the multibyte characters are inside the XML tags.
> 
> 
> Diffs
> -----
> 
>   src/core-impl/collections/db/ScanManager.h 5f0d153 
>   src/core-impl/collections/db/ScanManager.cpp 97d0b1c 
> 
> Diff: http://git.reviewboard.kde.org/r/104310/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Alexey Neyman
> 
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.kde.org/pipermail/amarok-devel/attachments/20120317/82ebc283/attachment.html>