<html>

 <body>

  <div style="font-family: Verdana, Arial, Helvetica, Sans-Serif;">

   <table bgcolor="#f9f3c9" width="100%" cellpadding="8" style="border: 1px #c9c399 solid;">

    <tr>

     <td>

      This is an automatically generated e-mail. To reply, visit:

      <a href="http://git.reviewboard.kde.org/r/104310/">http://git.reviewboard.kde.org/r/104310/</a>

     </td>

    </tr>

   </table>

   <br />

<table bgcolor="#fefadf" width="100%" cellspacing="0" cellpadding="8" style="background-image: url('http://git.reviewboard.kde.org/media/rb/images/review_request_box_top_bg.png'); background-position: left top; background-repeat: repeat-x; border: 1px black solid;">

 <tr>

  <td>

<div>Review request for Amarok.</div>

<div>By Alexey Neyman.</div>

<h1 style="color: #575012; font-size: 10pt; margin-top: 1.5em;">Description </h1>

 <table width="100%" bgcolor="#ffffff" cellspacing="0" cellpadding="10" style="border: 1px solid #b8b5a0">

 <tr>

  <td>

   <pre style="margin: 0; padding: 0; white-space: pre-wrap; white-space: -moz-pre-wrap; white-space: -pre-wrap; white-space: -o-pre-wrap; word-wrap: break-word;">Amarok incorrectly scans files with non-ascii characters in tags. The symptom is that 

some of the files have two "invalid UTF character" symbols instead of a single non-ascii

character (looks like <?><?>, question mark inside a black circle). Most visible effect 

of this issue is that some albums end up in Various Artists because one of the tracks 

had artist name corrupted in this way.  It is not limited to artist name, though - 

there are tracks with corrupted album names or titles.                                                                                                                                                   

The reason for this issue is as follows. When Amarok invokes collection scanner                                                        

process, it receives the results from the amarokcollectionscanner over a pipe. Here is 

a snippet of code from src/core-impl/collections/db/ScanManager.cpp:

void    

ScannerJob::getScannerOutput()

{

     m_incompleteTagBuffer += m_scanner->readAll();                                                                                                      

}

The m_incompleteTagBuffer is declared in src/core-impl/collections/db/ScanManager.h:

     QString m_incompleteTagBuffer

However, m_scanner->readAll() returns QByteArray, not QString. This is okay for ASCII

characters (which are 1 byte in UTF8), but breaks in case of multibyte sequences. If

readAll() method returns a block which terminates in a middle of the multibyte sequence,

conversion to QString in ScannerJob::getScannerOutput replaces the last character with

"invalid UTF character" symbol. When the next block is read, it starts in the middle of

UTF8 multibyte sequence - so it gets replaced with one more "invalid UTF character"

symbol. Thus, a single multibyte UTF8 character is replaced with two "invalid character"

symbols.

The solution implemented by the attached patch is to store incomplete information as

QByteArray and search for partial ("</directory>") or full ("</scanner>") elements in the

byte stream, before conversion to QString. Complete blocks can be safely converted to

QString, as the multibyte characters are inside the XML tags.

</pre>

  </td>

 </tr>

</table>

<h1 style="color: #575012; font-size: 10pt; margin-top: 1.5em;">Diffs</b> </h1>

<ul style="margin-left: 3em; padding-left: 0;">

 <li>src/core-impl/collections/db/ScanManager.h <span style="color: grey">(5f0d153)</span></li>

 <li>src/core-impl/collections/db/ScanManager.cpp <span style="color: grey">(97d0b1c)</span></li>

</ul>

<p><a href="http://git.reviewboard.kde.org/r/104310/diff/" style="margin-left: 3em;">View Diff</a></p>

  </td>

 </tr>

</table>

  </div>

 </body>

</html>