[rkward-devel] a "misfeature"

Thu Apr 5 15:17:52 UTC 2007

Hi again,

On Thursday 05 April 2007 00:35, Prasenjit Kapat wrote:
> A friend of mine (Deepayan: Lattice author) comments the following:
>
> [Quote]
> Basically, as far as I can tell, whenever a new package is loaded
> .rk.get.structure is run on all objects in the package (or at least in
> the namespace). This means that all these objects are evaluated,
> including all lazy-loaded symbols, which defeats the whole point of
> lazy loading. This is not much of an issue for small packages, but try
>
> source("http://www.bioconductor.org/biocLite.R")
> biocLite("GO")
> library(GO)
>
> [/Quote]

after a bit more investigation, the matter turns out to be yet more complex:

1) It is possible to determine whether a symbol is really a promise at least 
from C.
1b) Unfortunately, however, for example in the base package, almost 
*everything* is a promise. That is, not just large datasets, but also the 
majority of functions.
1c) I don't think there is any way, currently, to tell apart promises for 
functions and promises for data. Or of course - as would be optimal - to tell 
apart promises for "small" objects from promises for "large" ones. Once we 
try to get *any* information about the object, the promise is evaluated, i.e. 
the object is loaded. So we're back to square one on this front.

2) In the example of the GO package, the problem is multiplied by the fact 
that there are literally hundreds of thousands of (small) objects. As far as 
I can see, loading all the data - while somewhat crazy - is not the main 
slowdown. Lazy loading is pretty fast, and mainly uses memory, not CPU 
cycles. Rather the problem is evaluating .rk.get.structure() on each single 
one of those.
2b) .rk.get.structure() could probably be sped up considerably by implementing 
it in C, instead of R. Likely this could save considerable amounts of 
(temporary) memory as well, but this claim is entirely untested.
2c) Whatever the optimization, as the end result, rkward will build an 
internal representation of the "structure" of each of the objects (i.e. name, 
type of data, child objects, etc.). This results in a small memory overhead 
per object. However, in the case of thousands of small objects, the overhead 
may be noticable.

So what to do? Getting at least basic structure information about all objects 
is needed for the object browser to be useful. Also, we use this info for 
object name completion and function argument hinting (I see that package 
rcompgen provides similar functionality, but looks up potential completions 
dynamically. While in theory such an approach could be used in RKWard as 
well, it would not be easy to fit it into the threaded approach we use (which 
allows to edit a script with object name completion while simultationsly 
other calculations are running)). In the future it might additionally be used 
to aid in syntax highlighting. So I think overall it's not something we can 
just rip out.

Any way to alleviate the problem? First is to implement .rk.get.structure() in 
C. I'll try to see, what I can do for 0.4.8, here. Second might be a 
heuristic to determine when it's best, not to attempt to fetch the structure. 
Unfortunately, I have no good idea on this.In the case of the GO library, 
simply excluding recursion into environments would make most of the problem 
go away, but like this probably does not generalize well. Third might be a 
way to let the user control, whether and for which libraries structure 
information is fetched. But I guess, this would be a power-user option that 
is not easily discoverable.

Well, any further insight is appreciated.

Regards
Thomas
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://mail.kde.org/pipermail/rkward-devel/attachments/20070405/f6dc90cb/attachment.sig>