[rkward-devel] a "misfeature"
Thomas Friedrichsmeier
thomas.friedrichsmeier at ruhr-uni-bochum.de
Wed Apr 18 19:39:36 UTC 2007
Update on .rk.get.structure():
I have finally finished the .rk.get.structure() rework. To activate the
new .rk.get.structure(), you will have to do a make install. Overall I'm not
overly happy with the new solution, read on for details:
1) Correctness:
I hope the new .rk.get.structure() is at least as correct as the previous
function. Unfortunately, it turned out, that doing this sort of thing is
quite complex in C, as R objects can really disguise themselves to be of a
completely different type to a considerable degree. If only the C-side of R
was object oriented...
2) Speed:
As a consequence, the speedup is not quite as large as I thought it could be.
However, for typical packages it's between around 2 (if there are lots of
functions) and 4 (if there are lots of data objects). For the GO package it's
closer to factor 10, but then this package is not typical at all.
Note, however, that this is after the data is already loaded. The effective
speedup is more likely to be around 50%, at most (again, in typical cases).
Add to that all the other noise, and the speedup is not really noticeable
(esp. during startup).
3) Lazy Loading:
The data is still loaded for this purpose, but now the loaded objects are no
longer kept permanently. This means the CPU cycles are spent for good, but at
least the memory is not wasted. This surely is not optimal, but most likely
better than before.
4) Dealing with insane(*) packages:
Packages like GO can now be blacklisted, so .rk.get.structure() will never run
on them. This is just a hackish solution, but I don't think trying to find
a "real" solution is anywhere near worth the effort. In fact, such may not be
possible at all:
- Caching the structure as suggested by Deepayan (which may be a good idea for
something to add later) would still mean an insane amount of memory and
parsing in this particular case.
- Fetching information only when needed may be good enough in many use cases.
However, this is merely delaying the problem.
- In the case of an object browser, there may not even be a delay at all. Even
just finding out that one of the objects in package GO is an environment,
requires loading it (even if you are not interested in all the child objects
at all). So, if you intend to show any information whatsoever, the objects
need to be loaded. I think having a decent object browser is a higher
priority than properly dealing with "insane" packages.
Well, so much for this. I had hoped, I could do better, but for now, I don't
think any further efforts would be well invested on this issue.
Regards
Thomas
(*) Why do I call package GO "insane"?
<rant>
Well, I can't even load the largest one of the contained environments *in a
plain session of R* on my 512MB system. It will start swapping, terribly,
making the entire system unresponsive, and I've never waited for it to
actually complete. Sure, "loading" the library goes in a snap. Actually using
it is impossible, here. I guess, if you really want to provide a large
database, consider actually using a database. There are solutions for that in
R. Also, consider using a storage format that can be handled efficiently in
R. I.e. if you're storing large tables, consider using a data.frame, instead
of thousands of objects with attributes (which are themselves full-fledged R
objects, again).
</rant>
More information about the Rkward-devel
mailing list