[Kstars-devel] Replacing file-system by database in KStars
Henry de Valence
hdevalence at hdevalence.ca
Wed Jan 29 04:14:36 UTC 2014
Hi Vijay,
On January 28, 2014 08:22:08 AM Vijay Dhameliya wrote:
> Hi guys,
>
> Currently when KStars is launched, it reads data corresponding to different
> Skyobject from respective file in loaddata() methods. And I have tracked
> out all the classes where we are loading data by reading file.
Indeed, the code KStars uses to load data from the disk is messy and (IMO) not
as efficient as it could be.
> I researched bit on the topic and I found that loading data from database
> is always much better option then doing same from file.
The database’s data is stored in a file on disk, so loading data from the
database is loading from a file. It might be faster, if the use case for
KStars’ pattern of data-loading is served well by the database we use, and we
can use the optimized code from the database instead of writing our own.
The problem is that most databases are not actually suited to the kind of data
we have or our usage patterns. The data we deal with is primarily spatial: we
have points on the sphere, with extra metadata to tell us about the properties
of the objects. Currently, KStars has a somewhat complicated system for
spatially indexing the data with a heirarchical triangle mesh, and loading
the data from files as needed.
In order to replace this with an SQL-based system, we’d need to use a database
that has support for spatial queries. To the best of my knowledge, SQLite does
not have such support. It would probably be possible to do something with
PostgreSQL’s PostGIS extension for dealing with geographic data, but KStars
should not require the user to run and maintain a standalone database server,
so SQLite is the only SQL option (and we do use it for some data).
> If we replace file system with QSql following are the Pros:
>
> 1) We will not have to ship so many files with Kstars
File count is less important than file size; if we’re shipping the same data,
it’s unclear that we would see a big reduction in size. Also, it makes it
harder to keep track of the data we have.
> 2) Loading from database is quicker than doing same from file
(See discussion above)
> 3) Code for load methods will be reduced in size
Yes, this would be really nice, but I think that there may be other avenues to
do this.
> Cons:
> 1) I will have to move all data from files into database by temporary
> methods
I’m not quite sure what you mean. We already have to do this for the data we
have: there’s a collection of (as I recall quite hacky) scripts for the
purpose of building the catalog files we use now. If we change our data
representation, we have to change these, too.
There’s also:
2. We lose spatial indexing, meaning that we may need to load an entire 2GB
catalog for one small region of the sky.
3. The only SQL database we can use is SQLite, which is designed to be
small, not high-performance.
> So I am planning to start coding to replace file system by database on my
> local branch.
>
> Can you please give your views and suggestion regarding the same ? I am
> sure that It will be very helpful to me. :)
I agree that we should rethink the data-handling in KStars, but I think that
it would be best to take a few steps back first, to see the bigger picture.
The first task, in my opinion, is to clearly set out *what data we have*. For
instance, it would be good to have scripts that will completely automatically
fetch the raw datasets we use, and process them into our catalog format so
that we have the entire process of creating the files written
programmatically. Even though we don’t need to regenerate the catalogs very
often, the benefit of this is that it’s documented in working, runnable, unit-
tested code exactly what we do to the source data. Some datasets (afair) were
assembled by us or by the Stellarium people, in which case those files should
be treated as the ‘raw data’.
The question of how we should store our data is something I’ve been giving
some thought to recently, but as I’ve been busy with school I haven’t had time
to implement a prototype yet. Since it’s come up, though, I might as well
share what I was thinking.
It’s possible to run all of our astrocalculations at much higher speed (using,
e.g., code from my GSOC project), but actually doing this in practice is hard,
since it requires reworking the data handling of the sky components.
Currently, each component manages its own data handling, indexing, etc.,
usually using the HTM library to compute spatial queries. Different components
handle things differently -- for instance, the deep star component does lazy-
loading of stars in blocks to avoid having to load huge catalogs all at once.
One nice thing about most of our data is that it generally doesn’t change, so
our problem should be well-suited to an immutable data structure which gives
us thread-safety and bug-avoidance for free. In addition, I think we should
explore using facilities of the operating system to do the work for us. For
instance, we could try use mmap (in the form of QFile::map() for portability)
to map the contents of a binary catalog file directly into the virtual address
space. The OS then loads data in pages as needed (and unloads the pages
according to, AFAIR, least-recently-used *when needed* [^1]). If we arrange
the data in our catalog file(s) to have spatial locality (i.e., data near each
other in the file are nearby points in the sky), then we can have the kernel
do the work of resource management / loading-unloading for us, greatly
simplifying our code.
Another issue we have is with proper motion. Technically, most of the points
on the sphere that we have aren’t points at all, but are actually “dual
points” that have the data both of a point and the first-order differential
near the point (i.e., the proper motion), which we have to take into account
when we do queries in the far future. In effect, we have for each point a
diffferential equation with initial conditions (the J2000 positions) and the
equation of motion given by the proper motion, and we want to be able to do
queries like:
“What are all the points within angle alpha of this direction at time t?”
The HTM library we use is not equipped to answer this question -- it only
deals with points that don’t move. So what we do now is go through and trash
our index, reindexing all the points as we do our simulation. Except then
there’s all kinds of problems with stuff like, how fine should the reindex
interval be, issues about stars in multiple trixels, .... it’s a real mess.
This got kind of long since it’s sort of a brain dump, but hopefully it will
stir some discussion.
Cheers,
Henry
P.S. I’m really sorry I haven’t been able to put as much time into KStars as
I’d like recently.
[^1]: I don’t know about how Windows decides to unload mmap’d files; I assume
it’s not totally insane, but I guess I don’t really care too much about how it
performs as long as it runs. The more important portability issue, I think, is
dealing with endianness issues, but I don’t think that this is a huge problem.
Worst case, stick a BOM in the beginning of the file and if the endianness is
wrong, swizzle all the bytes and write the new catalog. Or tell packagers to
ship compatible files, or something.
More information about the Kstars-devel
mailing list