[Kstars-devel] Replacing file-system by database in KStars

Wed Jan 29 04:14:36 UTC 2014

Hi Vijay,

On January 28, 2014 08:22:08 AM Vijay Dhameliya wrote:
> Hi guys,
> 
> Currently when KStars is launched, it reads data corresponding to different
> Skyobject from respective file in loaddata() methods. And I have tracked
> out all the classes where we are loading data by reading file.

Indeed, the code KStars uses to load data from the disk is messy and (IMO) not 
as efficient as it could be.

> I researched bit on the topic and I found that loading data from database
> is always much better option then doing same from file.

The database’s data is stored in a file on disk, so loading data from the 
database is loading from a file. It might be faster, if the use case for 
KStars’ pattern of data-loading is served well by the database we use, and we 
can use the optimized code from the database instead of writing our own.

The problem is that most databases are not actually suited to the kind of data 
we have or our usage patterns. The data we deal with is primarily spatial: we 
have points on the sphere, with extra metadata to tell us about the properties 
of the objects. Currently, KStars has a somewhat complicated system for 
spatially indexing the data with  a heirarchical triangle mesh, and loading 
the data from files as needed.

In order to replace this with an SQL-based system, we’d need to use a database 
that has support for spatial queries. To the best of my knowledge, SQLite does 
not have such support. It would probably be possible to do something with 
PostgreSQL’s PostGIS extension for dealing with geographic data, but KStars 
should not require the user to run and maintain a standalone database server, 
so SQLite is the only SQL option (and we do use it for some data).

> If we replace file system with QSql following are the Pros:
> 
> 1) We will not have to ship so many files with Kstars

File count is less important than file size; if we’re shipping the same data, 
it’s unclear that we would see a big reduction in size. Also, it makes it 
harder to keep track of the data we have.

> 2) Loading from database is quicker than doing same from file

(See discussion above)

> 3) Code for load methods will be reduced in size

Yes, this would be really nice, but I think that there may be other avenues to 
do this.

> Cons:
> 1) I will have to move all data from files into database by temporary
> methods

I’m not quite sure what you mean. We already have to do this for the data we 
have: there’s a collection of (as I recall quite hacky) scripts for the 
purpose of building the catalog files we use now. If we change our data 
representation, we have to change these, too.

There’s also:

2.   We lose spatial indexing, meaning that we may need to load an entire 2GB 
catalog for one small region of the sky.

3.   The only SQL database we can use is SQLite, which is designed to be 
small, not high-performance.

> So I am planning to start coding to replace file system by database on my
> local branch.
> 
> Can you please give your views and suggestion regarding the same ? I am
> sure that It will be very helpful to me. :)

I agree that we should rethink the data-handling in KStars, but I think that 
it would be best to take a few steps back first, to see the bigger picture.

The first task, in my opinion, is to clearly set out *what data we have*. For 
instance, it would be good to have scripts that will completely automatically 
fetch the raw datasets we use, and process them into our catalog format so 
that we have the entire process of creating the files written 
programmatically. Even though we don’t need to regenerate the catalogs very 
often, the benefit of this is that it’s documented in working, runnable, unit-
tested code exactly what we do to the source data. Some datasets (afair) were 
assembled by us or by the Stellarium people, in which case those files should 
be treated as the ‘raw data’.

The question of how we should store our data is something I’ve been giving 
some thought to recently, but as I’ve been busy with school I haven’t had time 
to implement a prototype yet. Since it’s come up, though, I might as well 
share what I was thinking.

It’s possible to run all of our astrocalculations at much higher speed (using, 
e.g., code from my GSOC project), but actually doing this in practice is hard, 
since it requires reworking the data handling of the sky components.

Currently, each component manages its own data handling, indexing, etc., 
usually using the HTM library to compute spatial queries. Different components 
handle things differently -- for instance, the deep star component does lazy-
loading of stars in blocks to avoid having to load huge catalogs all at once.

One nice thing about most of our data is that it generally doesn’t change, so 
our problem should be well-suited to an immutable data structure which gives 
us thread-safety and bug-avoidance for free. In addition, I think we should 
explore using facilities of the operating system to do the work for us. For 
instance, we could try use mmap (in the form of QFile::map() for portability) 
to map the contents of a binary catalog file directly into the virtual address 
space. The OS then loads data in pages as needed (and unloads the pages 
according to, AFAIR, least-recently-used *when needed* [^1]). If we arrange 
the data in our catalog file(s) to have spatial locality (i.e., data near each 
other in the file are nearby points in the sky), then we can have the kernel 
do the work of resource management / loading-unloading for us, greatly 
simplifying our code.

Another issue we have is with proper motion. Technically, most of the points 
on the sphere that we have aren’t points at all, but are actually “dual 
points” that have the data both of a point and the first-order differential 
near the point (i.e., the proper motion), which we have to take into account 
when we do queries in the far future. In effect, we have for each point a 
diffferential equation with initial conditions (the J2000 positions) and the 
equation of motion given by the proper motion, and we want to be able to do 
queries like:

“What are all the points within angle alpha of this direction at time t?”

The HTM library we use is not equipped to answer this question -- it only 
deals with points that don’t move. So what we do now is go through and trash 
our index, reindexing all the points as we do our simulation. Except then 
there’s all kinds of problems with stuff like, how fine should the reindex 
interval be, issues about stars in multiple trixels, .... it’s a real mess.

This got kind of long since it’s sort of a brain dump, but hopefully it will 
stir some discussion.

Cheers,
Henry

P.S. I’m really sorry I haven’t been able to put as much time into KStars as 
I’d like recently.

[^1]: I don’t know about how Windows decides to unload mmap’d files; I assume 
it’s not totally insane, but I guess I don’t really care too much about how it 
performs as long as it runs. The more important portability issue, I think, is 
dealing with endianness issues, but I don’t think that this is a huge problem. 
Worst case, stick a BOM in the beginning of the file and if the endianness is 
wrong, swizzle all the bytes and write the new catalog. Or tell packagers to 
ship compatible files, or something.