[rkward-devel] a "misfeature"

Wed Apr 18 19:39:36 UTC 2007

Update on .rk.get.structure():

I have finally finished the .rk.get.structure() rework. To activate the 
new .rk.get.structure(), you will have to do a make install. Overall I'm not 
overly happy with the new solution, read on for details:

1) Correctness:
I hope the new .rk.get.structure() is at least as correct as the previous 
function. Unfortunately, it turned out, that doing this sort of thing is 
quite complex in C, as R objects can really disguise themselves to be of a 
completely different type to a considerable degree. If only the C-side of R 
was object oriented...

2) Speed:
As a consequence, the speedup is not quite as large as I thought it could be. 
However, for typical packages it's between around 2 (if there are lots of 
functions) and 4 (if there are lots of data objects). For the GO package it's 
closer to factor 10, but then this package is not typical at all.
Note, however, that this is after the data is already loaded. The effective 
speedup is more likely to be around 50%, at most (again, in typical cases). 
Add to that all the other noise, and the speedup is not really noticeable 
(esp. during startup).

3) Lazy Loading:
The data is still loaded for this purpose, but now the loaded objects are no 
longer kept permanently. This means the CPU cycles are spent for good, but at 
least the memory is not wasted. This surely is not optimal, but most likely 
better than before.

4) Dealing with insane(*) packages:
Packages like GO can now be blacklisted, so .rk.get.structure() will never run 
on them. This is just a hackish solution, but I don't think trying to find 
a "real" solution is anywhere near worth the effort. In fact, such may not be 
possible at all:
- Caching the structure as suggested by Deepayan (which may be a good idea for 
something to add later) would still mean an insane amount of memory and 
parsing in this particular case.
- Fetching information only when needed may be good enough in many use cases. 
However, this is merely delaying the problem.
- In the case of an object browser, there may not even be a delay at all. Even 
just finding out that one of the objects in package GO is an environment, 
requires loading it (even if you are not interested in all the child objects 
at all). So, if you intend to show any information whatsoever, the objects 
need to be loaded. I think having a decent object browser is a higher 
priority than properly dealing with "insane" packages.

Well, so much for this. I had hoped, I could do better, but for now, I don't 
think any further efforts would be well invested on this issue.

Regards
Thomas

(*) Why do I call package GO "insane"?
<rant>
Well, I can't even load the largest one of the contained environments *in a 
plain session of R* on my 512MB system. It will start swapping, terribly, 
making the entire system unresponsive, and I've never waited for it to 
actually complete. Sure, "loading" the library goes in a snap. Actually using 
it is impossible, here. I guess, if you really want to provide a large 
database, consider actually using a database. There are solutions for that in 
R. Also, consider using a storage format that can be handled efficiently in 
R. I.e. if you're storing large tables, consider using a data.frame, instead 
of thousands of objects with attributes (which are themselves full-fledged R 
objects, again).
</rant>