Scanner benchmark

Thu Nov 18 02:54:32 CET 2010

On Mi, 2010-11-17 at 17:38 -0500, Leo Franchi wrote:
> On Tue, Nov 16, 2010 at 6:48 PM, Jeff Mitchell <mitchell at kde.org> wrote:
> > On 11/13/2010 03:07 PM, Leo Franchi wrote:
> >> Hello,
> >>
> >> Below are my observations too, just to see if other users' compare.
> >>
> >> On Sat, Nov 13, 2010 at 4:06 AM, Mikko C. <mikko.cal at gmail.com> wrote:
> >>> Hi,
> >>> I found some time to run some tests with the new scanner.
> >>>
> >>> Amarok from git master of today:
> >>> Full rescan with the collection already being present on the external
> >>> MySQL database.
> >>>
> >>> - 11:30 mins for the first scanning part (up to 50% in the progress bar)
> >>> - 2:50 mins for the last part (remaining 50%)
> >>>
> >>> Total time: around 14:20 mins.
> >>>
> >>> tracks found: 21113
> >>> albums found: 1703
> >>> artists found: 1013
> >>
> >> Rescan with empty mysql database:
> >>
> >> 11:00 amarokcollectionscanner run
> >> 16:00 scan result processing / committing
> >>
> >> total of 26:00
> >>
> >> 47 636 tracks.
> >>
> >> Old scanner:
> >>
> >> 11:30 total time for amarokcollectionscanner + committing.
> >
> > This is almost certainly due to the way that insertions and other DB
> > accesses were handled in the old scanning code.
> >
> > I did a lot of work doing every thing I possibly could to minimize DB
> > calls, because they were by far being the slowest part of the scanning,
> > other than actual I/O access on the drives. The end result was a lot of
> > really nasty data structures to be able to emulate the behavior of
> > running various SQL calls. These data structures would store all
> > information to be committed, and then this information would be
> > committed in one go, using the largest packet size possible. This made
> > it quite complex, yes -- but it made it extremely fast. You've probably
> > seen them before but see e.g.
> > http://jefferai.org/2009/07/db-changes-call-for-benchmarkers/ and
> > http://jefferai.org/2009/10/speed-never-gets-old-at-least-in-software/
> > and especially
> > http://jefferai.org/2009/11/the-collection-scanners-ultimate-speed-bump-and-cases/
> >
> > I haven't seen any proper query logs for the new scanner because when I
> > was last looking at them with Leo there were logic problems in the new
> > scanner that were keeping queries screwed up -- hopefully those have
> > been fixed. But I'm guessing from what I *did* see that each track uses
> > several database accesses -- an INSERT or two into various tables and
> > several SELECT or so queries. If so, this is going to be the big
> > bottleneck and the big reason for the slowdown.
> 
> When I profiled the slowness of the new scan result processor, 95% of
> the time was spent in mysql calls. Just wanted to underline Jeff's
> point. Thousands of sql queries == bad, and all of Jeff's hard work
> making the scanner minimize how many SQL operations it did is not
> something to throw away lightly.
> 
> I do hope and believe we can get the needed fixes to the current
> scanner before getting closer to 2.4 betas. But if we get there and
> the scanner is still significantly worse for users with large
> collections (of which we have a lot!) we should revert to the old
> scanner until the issues are worked out.
> 
> leo

Hi all,
I agree that 16 minutes just for committing the data is too much.
My earlier tests with the scan result processor showed a time increase
of 200% to 300%.
Now this sounds bad, but my test also showed that only 5% of the time
was used in the database.

A collection of 13000 file needs 58 seconds for a full scan on an
existing collection and 93 seconds on an empty one.
Calculating for the 47000 files collection you could say that it should
take around five to six minutes which would be consistent with the 200%
time increase.

Now it seem that for large collection there is an additional delay
somewhere. I still assume that a kind of index buffer is getting too big
for memory which would cause additional delays.

Let's just see what else I can do. There are a lot of options open
before the need to copy whole tables around.
One of them e.g. would be for the Registry to realize that it has all
existing tracks already buffered. From that point on it would no longer
need to query for additional tracks.
Another would be to precompile queries or combine them as the old
scanner had.

But first I would like to find out why the access time increases so
much. That might not only concern the scanning but would also slow down
every other operation.

Also why we are at it:
I am still thinking that we might commit the changes as we go along.
The only drawback would be that the abort button near the progress bar
would not abort but instead just stop the scanning.
I think that this might not be a bad thing. The button does not have any
label, just a no-parking symbol.

At least this "committing while we scan" should be done for an empty
collection.
It would decrease the time a new user has to wait until he can start
using Amarok.

Cheers,
Ralf