[rkward-devel] a "misfeature"

Thu Apr 5 19:31:55 UTC 2007

On Thursday 05 April 2007 20:41, you wrote:
> No, GO has a few large objects.

Well, yes, it only has 24 top level objects (which are environments), but 
inside those, there are hundreds of thousands of small objects.

> And rk.get.structure is not likely to 
>
> be the main problem. I get (in plain R run from a shell):
> > library(GO)
> > length(ls("package:GO"))
>
> [1] 24
>
> > system.time(sapply(ls("package:GO"), exists))
>
>    user  system elapsed
>  44.311   1.848  48.466

At least on my system, doing this leads to heavy swapping (a memory problem), 
and most of the time comes from there. Of course memory *is* a problem in my 
current approach, but we could in fact find a solution that requires "only" 
loading the data, but not keeping it in memory (I looked some, and this seems 
doable at the C level).

> > system.time(sapply(ls("package:GO"), exists))
>
>    user  system elapsed
>   0.004   0.000   0.003
>
> The second time around time is much faster. I'm pretty sure descending
> into environments inside rk.get.structure has negligible overhead
> compared to the initial load times.

No time to do serious timing right now, but I think it does contribute 
considerably. A simple example:

library(GO)
# let's use just one of the environments in GO, for now:
system.time(sapply(ls(GOTERM), exists))
# [1] 2.408 0.040 2.476 0.000 0.000
system.time(sapply(ls(GOTERM), exists))
# [1] 0.356 0.000 0.355 0.000 0.000

# now, after the data is already loaded:
library (rkward)
system.time (.rk.get.structure (GOTERM, "GOTERM"))
# [1] 23.861  0.136 24.720  0.000  0.000
# this step does not get any better on repetition.

Which ever way to go, I think this is something that really could and should 
be optimized, as well (it also becomes a problem for complex nested lists, 
such as produced by the XML package when parsing large XML files).

> I think the solution is to build up a database beforehand. Objects in
> .GlobalEnv are usually not a problem (they are already loaded) and the
> current approach should be fine. Package namespaces are typically
> sealed, and it should be hard for the user to modify things inside (if
> they do, they deserve whatever they get). So, given a specific version
> of a package, one should only need to compute the relevant information
> once. I think this is a general enough problem that a common solution
> that other front-ends (e.g. rcompgen) can use would be helpful. This
> will need some consensus on what information such a database should
> contain. Clearly, rk.get.structure (which I'm not familiar with) can
> be the basis for a starting point.
>
> Ideally, this database should be computed by R CMD build (like the
> INDEX file) and distributed as part of the package. This is not going
> to happen anytime soon, but one good way to move forward in that
> direction would be to write a separate package (not tied to rkward)
> that would create such a file (I would recommend a plain text format
> that read.table can read, rather than fancy XML-type things) given a
> package. The codetools package may be helpful here (or not, I don't
> really know).
>
> Once this is done, there has to be a decision on when to compute and
> where to store that information. That's a topic for later.

This is an interesting suggestion, indeed, and may well be the way to go. 
Well, I probably won't have the time to read and respond to E-Mail the next 
few days (and in fact, I'll be off in a few minutes), but maybe we can take 
up this discussion again, next week. I'd be glad to have your input on this.

Regards
Thomas
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://mail.kde.org/pipermail/rkward-devel/attachments/20070405/22165d19/attachment.sig>