[kdepim-users] Tbird versus Kmail, performance

Fri Nov 14 14:12:12 GMT 2014

[ Comments below, in-line ]

On 11/14/2014 06:36 AM, René J.V. Bertin wrote:
>
> [ trimmed ]
>
> Which configures there isn't much IO going on during these lengthy
> operations. 

Yes, in my (VM) tests, after ensuring MySQL's logging was disabled,
and increase MySQL's memory to effectively cache the /akonadi/ tables,
I found I was CPU bound with one processor.

My host O/S has eight (logical) CPU's so I gave the VM an additional
four CPU's.  MySQL ate those CPU's too.  Baloo was eating some CPU but
it is nice'd so not that big of a deal.

I dug in to MySQL to see what it was doing.  I didn't do anything very
complicated to measure what SQL was running.  I repeatedly ran:

   > show full processlist;

This shows the database connections and if they're running SQL, the
SQL they're running.

From multiple connections, I'd see the following queries being run:

   SELECT count(DISTINCT PimItemTable.id)
   FROM   PimItemTable
          INNER JOIN PimItemFlagRelation
          ON ( PimItemTable.id = PimItemFlagRelation.PimItem_id )
   WHERE ( PimItemTable.collectionId = ?
   AND   ( PimItemFlagRelation.Flag_id = ?
                      OR
           PimItemFlagRelation.Flag_id = ? ) )

and

   SELECT count(id), sum(size)
   FROM   PimItemTable
   WHERE  collectionId = ?;

I looked at the data distribution for both /collectionId/ and
/Flag_id/ and there weren't many distinct values.

I tried to /cheat/ and create some /covering/ indexes to get slightly
better performance but it was marginal.  This index helps the second
query:

   CREATE INDEX idx01 ON PimItemTable (collectionId, id, size);

I circled back to the queries and realized they're somewhat
unbounded.  I'd have expected the query to use a `date/time' column to
further restrict it.

For example, if we're checking for new data between `time x' and `time
x + 1', we should ask the DBMS to only return the difference.
Otherwise we're forcing the DBMS to do a lot of work, which in this
case, translates itself to high CPU utilization (since all the data is
in cache).

The rule is to limit the amount of data pulled from the DB.  This way
the DB has less `data to traverse' .. which will likely translate to
better scalability.

Carrying the above through the first example, we might do something
like this in the WHERE-clause:

   ::: Time 0 :::

   --
   -- DB Client sets /min_timestamp/ to some low value (e.g.
   -- 1970 00:00:00)
   --
   WHERE ...
   AND   some-timestamp BETWEEN min_timestamp AND curr_timestamp

   ::: Time 1 :::

   --
   -- The DB client knows the last time it checked (curr_timestamp)
   -- and that becomes /last_time_we_checked/ and it re-computes
   -- curr_timestamp.
   --
   WHERE ...
   AND   some-timestamp BETWEEN last_time_we_checked AND curr_timestamp

etc.

Cheers,
--
Pablo Sanchez - Blueoak Database Engineering, Inc
Ph:    819.459.1926         Blog:  http://pablo-blog.blueoakdb.com
iNum:  883.5100.0990.1054

_______________________________________________
KDE PIM users mailing list
Subscription management: https://mail.kde.org/mailman/listinfo/kdepim-users