Scanner benchmark

Wed Nov 17 23:38:13 CET 2010

On Tue, Nov 16, 2010 at 6:48 PM, Jeff Mitchell <mitchell at kde.org> wrote:
> On 11/13/2010 03:07 PM, Leo Franchi wrote:
>> Hello,
>>
>> Below are my observations too, just to see if other users' compare.
>>
>> On Sat, Nov 13, 2010 at 4:06 AM, Mikko C. <mikko.cal at gmail.com> wrote:
>>> Hi,
>>> I found some time to run some tests with the new scanner.
>>>
>>> Amarok from git master of today:
>>> Full rescan with the collection already being present on the external
>>> MySQL database.
>>>
>>> - 11:30 mins for the first scanning part (up to 50% in the progress bar)
>>> - 2:50 mins for the last part (remaining 50%)
>>>
>>> Total time: around 14:20 mins.
>>>
>>> tracks found: 21113
>>> albums found: 1703
>>> artists found: 1013
>>
>> Rescan with empty mysql database:
>>
>> 11:00 amarokcollectionscanner run
>> 16:00 scan result processing / committing
>>
>> total of 26:00
>>
>> 47 636 tracks.
>>
>> Old scanner:
>>
>> 11:30 total time for amarokcollectionscanner + committing.
>
> This is almost certainly due to the way that insertions and other DB
> accesses were handled in the old scanning code.
>
> I did a lot of work doing every thing I possibly could to minimize DB
> calls, because they were by far being the slowest part of the scanning,
> other than actual I/O access on the drives. The end result was a lot of
> really nasty data structures to be able to emulate the behavior of
> running various SQL calls. These data structures would store all
> information to be committed, and then this information would be
> committed in one go, using the largest packet size possible. This made
> it quite complex, yes -- but it made it extremely fast. You've probably
> seen them before but see e.g.
> http://jefferai.org/2009/07/db-changes-call-for-benchmarkers/ and
> http://jefferai.org/2009/10/speed-never-gets-old-at-least-in-software/
> and especially
> http://jefferai.org/2009/11/the-collection-scanners-ultimate-speed-bump-and-cases/
>
> I haven't seen any proper query logs for the new scanner because when I
> was last looking at them with Leo there were logic problems in the new
> scanner that were keeping queries screwed up -- hopefully those have
> been fixed. But I'm guessing from what I *did* see that each track uses
> several database accesses -- an INSERT or two into various tables and
> several SELECT or so queries. If so, this is going to be the big
> bottleneck and the big reason for the slowdown.

When I profiled the slowness of the new scan result processor, 95% of
the time was spent in mysql calls. Just wanted to underline Jeff's
point. Thousands of sql queries == bad, and all of Jeff's hard work
making the scanner minimize how many SQL operations it did is not
something to throw away lightly.

I do hope and believe we can get the needed fixes to the current
scanner before getting closer to 2.4 betas. But if we get there and
the scanner is still significantly worse for users with large
collections (of which we have a lot!) we should revert to the old
scanner until the issues are worked out.

leo
-- 
_____________________________________________________________________
leo at kdab.com                                 KDAB (USA), LLC
lfranchi at kde.org                             The KDE Project