[Nepomuk] [Soprano-devel] Benchmarking storage backends

Thu Oct 22 14:12:38 CEST 2009

On Thu, 2009-10-22 at 13:22 +0200, Sebastian Trüg wrote: 
> On Thursday 22 October 2009 12:43:55 Ben Martin wrote:
> > Hi,
> >   As I'm tinkering with a new backend design for soprano I'm wondering
> > what folks use to benchmark nepomuk for KDE4 usage?
> > 
> >   Do folks just use the generic RDF benchmarking:
> > http://esw.w3.org/topic/RdfStoreBenchmarking
> > when comparing sesame2 to virtuoso backend for example?
> > 
> 
> folks, in this case me, don't do much benchmarking at all. So far there was no 
> real need for it since there has never been any choice: in the beginning we 
> only had redland. You know that it is slow by using it for a few days. No need 
> for a benchmark. Then we had sesame2 which is deprecated by Virtuoso simply 
> because the latter has so many advantages. Performance is not even in the top 
> 5. ;)

While I have my qualms with redland, a few of which I've made public in
the past, one area where it does perform fairly well is in well
constructed listStatements(). Though if you try to use SPARQL with
redland things get interesting.

A few numbers, using this data set generator:
http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html

$ cat run.sh 
#!/bin/bash
java -cp bin:lib/ssj.jar benchmark.generator.Generator "$@"
$ ./run.sh -fc -pc 1000 -s nt 
$ mv dataset.nt  thousand-prods.nt 

Only comparing redland with my new boostmmap backend as yet.
The SPARQL is based on Query 6 from the above, dropping the second
predicate from the query because my clumsy SPARQL implementation doesn't
handle that much yet :| I also had to hack a few things in the boostmmap
from 0.0.1 for the test. Gah, a better statistics model and non toy
SPARQL evaluator are sorely needed in boostmmap, but at least the
results seem promising so far.

-------

$ time sopranocmd --backend redland \
  --serialization ntriples \  
  import thousand-prods.nt >|out 2>&1
35 minutes
485mb

$ time sopranocmd --backend redland  \
  list "" \
  '<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>' \
'<http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/vocabulary/Product>'
>| /tmp/out.r 2>&1
real 0m0.079s

$ time sopranocmd --backend redland  \
  query 
"
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> 
select ?a ?b ?c 
where { ?a rdfs:label ?c . 
        filter( regex( str( ?c ), 'excites' ))}"

3.8 seconds

-------------

The same stuff using boostmmap:

$ sopranocmd --backend boostmmap \
  --serialization ntriples \   
  import thousand-prods.nt >|out 2>&1

# unfortunately, forgot to time it :(
242M 22.Oct.2009 21:01 triples.mmap

$ time sopranocmd --backend boostmmap \
  list "" \
  '<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>' \
  '<http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/vocabulary/Product>'
>| /tmp/out 2>&1
real 0m0.185s

$ time sopranocmd --backend boostmmap \
  query "
select ?a ?b ?c 
where { ?a http://www.w3.org/2000/01/rdf-schema#label ?c .
        filter( regex( str( ?c ), 'excites' ))}"
0.082s

I'm compiling boostmmap without optimization and with debug symbols in,
so the direct p+o->s lookup should get closer to redland. The SPARQL
difference is quite surprising. 

These files are on a 3 disk RAID-5 on 500gb disks, Q6600 quad core, 8gb
RAM. Same filesystem used for both, all queries executed multiple times
in succession and the hot-cache time reported. All times are the "real"
time from the time command.

I chose hot cache because its easier to measure and when making frequent
use of the store most data will likely be in one RAM cache or another.

I think perhaps to be fair for virtuoso/sesame2 I should use a dedicated
server instead of having one started each time.

> 
> But benchmarking seems like a good idea. And why not start with the standard 
> one. Any chance that could be integrated into the Soprano model unit test?

We could go this way, or keep the benchmark separate. I guess it depends
on how much stuff I can put together.

Short term it might be a good idea to just put these results and
commands on a wiki somewhere.

> 
> Cheers,
> Sebastian

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
Url : http://mail.kde.org/pipermail/nepomuk/attachments/20091022/e54d3b37/attachment.sig