Scanner benchmark

Thu Nov 18 04:20:43 CET 2010

On 11/17/2010 08:54 PM, Ralf Engels wrote:
> I agree that 16 minutes just for committing the data is too much.
> My earlier tests with the scan result processor showed a time increase
> of 200% to 300%.
> Now this sounds bad, but my test also showed that only 5% of the time
> was used in the database.

This is probably due to the smaller size of the collection.

> A collection of 13000 file needs 58 seconds for a full scan on an
> existing collection and 93 seconds on an empty one.
> Calculating for the 47000 files collection you could say that it should
> take around five to six minutes which would be consistent with the 200%
> time increase.
> 
> Now it seem that for large collection there is an additional delay
> somewhere. I still assume that a kind of index buffer is getting too big
> for memory which would cause additional delays.

<snip>

> But first I would like to find out why the access time increases so
> much. That might not only concern the scanning but would also slow down
> every other operation.

With the older scanner it was this way too -- the time delay wasn't
linear. Certainly part of the problem will be if you are projecting
based on the entire time of the scan -- the mysql stuff is going to
scale at a different rate than the file scanning.

But, I think there's a deeper problem -- the delay doesn't seem to scale
linearly anyways -- it seems like the more calls pile up the longer they
each take. I wouldn't be surprised if mysql queues up the actual data
processing and then returns early from the calls -- which would explain
a more exponential curve, which IME seems to be what happens both with
the old scanner and the new.

> Let's just see what else I can do. There are a lot of options open
> before the need to copy whole tables around.
> One of them e.g. would be for the Registry to realize that it has all
> existing tracks already buffered. From that point on it would no longer
> need to query for additional tracks.

Well that's basically what the accelerations I put into place did
before. They preloaded data structures -- which could be the registry,
if they're not doing tons of queries inside the registry (which was a
problem before) from the database, then operated on that, then committed
it back.

> Another would be to precompile queries or combine them as the old
> scanner had.

I did some work on precompiled queries before -- the C API was super
painful and I couldn't get it to work right without crashiness. But you
might have better luck. Supposedly it should help quite a bit.

> Also why we are at it:
> I am still thinking that we might commit the changes as we go along.
> The only drawback would be that the abort button near the progress bar
> would not abort but instead just stop the scanning.
> I think that this might not be a bad thing. The button does not have any
> label, just a no-parking symbol.

Why is that necessarily better? You can commit it all at the end and
still allow people to actually abort the scan. (If our DB supported
transactions, we could do it that way :-(  ). Really what will help is
drastically reducing the number of commits, which can be done by loading
commits up into one query and using the max packet size to separate them
out.

> At least this "committing while we scan" should be done for an empty
> collection.
> It would decrease the time a new user has to wait until he can start
> using Amarok.

Only if the track he wants to listen to is part of the initial commit.
But ideally the committing period should be short enough that this
shouldn't need to be a concern. If it's not, something is seriously wrong.

--Jeff

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 262 bytes
Desc: OpenPGP digital signature
Url : http://mail.kde.org/pipermail/amarok-devel/attachments/20101117/1bc00f1e/attachment.sig