[rkward-devel] a "misfeature"

Thu Apr 5 18:41:49 UTC 2007

On 4/5/07, Thomas Friedrichsmeier
<thomas.friedrichsmeier at ruhr-uni-bochum.de> wrote:
> Hi again,
>
> On Thursday 05 April 2007 00:35, Prasenjit Kapat wrote:
> > A friend of mine (Deepayan: Lattice author) comments the following:
> >
> > [Quote]
> > Basically, as far as I can tell, whenever a new package is loaded
> > .rk.get.structure is run on all objects in the package (or at least in
> > the namespace). This means that all these objects are evaluated,
> > including all lazy-loaded symbols, which defeats the whole point of
> > lazy loading. This is not much of an issue for small packages, but try
> >
> > source("http://www.bioconductor.org/biocLite.R")
> > biocLite("GO")
> > library(GO)
> >
> > [/Quote]
>
> after a bit more investigation, the matter turns out to be yet more complex:
>
> 1) It is possible to determine whether a symbol is really a promise at least
> from C.
> 1b) Unfortunately, however, for example in the base package, almost
> *everything* is a promise. That is, not just large datasets, but also the
> majority of functions.
> 1c) I don't think there is any way, currently, to tell apart promises for
> functions and promises for data. Or of course - as would be optimal - to tell
> apart promises for "small" objects from promises for "large" ones. Once we
> try to get *any* information about the object, the promise is evaluated, i.e.
> the object is loaded. So we're back to square one on this front.
>
> 2) In the example of the GO package, the problem is multiplied by the fact
> that there are literally hundreds of thousands of (small) objects. As far as
> I can see, loading all the data - while somewhat crazy - is not the main
> slowdown. Lazy loading is pretty fast, and mainly uses memory, not CPU
> cycles. Rather the problem is evaluating .rk.get.structure() on each single
> one of those.

No, GO has a few large objects. And rk.get.structure is not likely to
be the main problem. I get (in plain R run from a shell):

> library(GO)
> length(ls("package:GO"))
[1] 24
> system.time(sapply(ls("package:GO"), exists))
   user  system elapsed
 44.311   1.848  48.466
> system.time(sapply(ls("package:GO"), exists))
   user  system elapsed
  0.004   0.000   0.003

The second time around time is much faster. I'm pretty sure descending
into environments inside rk.get.structure has negligible overhead
compared to the initial load times.

> 2b) .rk.get.structure() could probably be sped up considerably by implementing
> it in C, instead of R. Likely this could save considerable amounts of
> (temporary) memory as well, but this claim is entirely untested.

That's not the place I'd start.

> 2c) Whatever the optimization, as the end result, rkward will build an
> internal representation of the "structure" of each of the objects (i.e. name,
> type of data, child objects, etc.). This results in a small memory overhead
> per object. However, in the case of thousands of small objects, the overhead
> may be noticable.
>
> So what to do? Getting at least basic structure information about all objects
> is needed for the object browser to be useful. Also, we use this info for
> object name completion and function argument hinting (I see that package
> rcompgen provides similar functionality, but looks up potential completions
> dynamically. While in theory such an approach could be used in RKWard as
> well, it would not be easy to fit it into the threaded approach we use (which
> allows to edit a script with object name completion while simultationsly
> other calculations are running)). In the future it might additionally be used
> to aid in syntax highlighting. So I think overall it's not something we can
> just rip out.

I think the solution is to build up a database beforehand. Objects in
.GlobalEnv are usually not a problem (they are already loaded) and the
current approach should be fine. Package namespaces are typically
sealed, and it should be hard for the user to modify things inside (if
they do, they deserve whatever they get). So, given a specific version
of a package, one should only need to compute the relevant information
once. I think this is a general enough problem that a common solution
that other front-ends (e.g. rcompgen) can use would be helpful. This
will need some consensus on what information such a database should
contain. Clearly, rk.get.structure (which I'm not familiar with) can
be the basis for a starting point.

Ideally, this database should be computed by R CMD build (like the
INDEX file) and distributed as part of the package. This is not going
to happen anytime soon, but one good way to move forward in that
direction would be to write a separate package (not tied to rkward)
that would create such a file (I would recommend a plain text format
that read.table can read, rather than fancy XML-type things) given a
package. The codetools package may be helpful here (or not, I don't
really know).

Once this is done, there has to be a decision on when to compute and
where to store that information. That's a topic for later.

> Any way to alleviate the problem? First is to implement .rk.get.structure() in
> C. I'll try to see, what I can do for 0.4.8, here. Second might be a
> heuristic to determine when it's best, not to attempt to fetch the structure.
> Unfortunately, I have no good idea on this.In the case of the GO library,
> simply excluding recursion into environments would make most of the problem
> go away, but like this probably does not generalize well. Third might be a
> way to let the user control, whether and for which libraries structure
> information is fetched. But I guess, this would be a power-user option that
> is not easily discoverable.
>
> Well, any further insight is appreciated.
>
> Regards
> Thomas

-Deepayan