[Kst] Wide Line ASCII Performance

Nicolas Brisset nicolas.brisset at free.fr
Sun Jun 17 20:39:46 UTC 2012

Hi Barth,

I've also been looking at that issue with ASCII files with many vectors a bit more. Since loading the file provided by Daniyal already takes 25 minutes, I did not go the valgrind way but rather instrumented the code with some debug prints.
What I found out correlates well with your findings, and your fix from today does make things faster by a factor 10 approximately also for this case :-)

Now some more details:
1) First, a remark: I slightly changed the files by removing empty lines (I did the tests under Linux, maybe it's only a line ending stuff...) and the characters in the first column (A, B, etc) so that I have exactly the same number of columns in the header and in the data, otherwise I think there is an offset.
2) I tried both the free-formatted version (with only the changes above to the original file) and the fixed-width one, generated using the one-liner awk script: "awk '{for (i=1;i<=NF;++i) printf("%-16s", $i); printf("\n")}' original_file.txt > fixed_width.txt"
3) Before the data gets loaded, there is some administrative overhead with the creation of vectors, curves, etc. I especially checked vector creation as I originally thought they were read at creation time (which seems to have been a wrong assumption). This overhead takes less than a minute on my PC (Core i5, 32 bits, 4 GB RAM) and expectedly shows a slight speed decrease with the increase of the number of vectors, but I would say it's nothing dramatic. See first plot in the attached image (nb of vectors created vs time in ms). We start at 5000 vectors/s and still reach about half that at 35000 vectors, which I think is OK. 
4) It turns out that what takes long is the call to self::doUpdates() at the end of the wizard, I guess that where the data gets read into the vectors. I put the results in the second plot (time in ms to read a vector vs column number), where we clearly see that the fixed-width approach always takes the same time regardless of how far the vector is in the line, while the free-formatted one has a linear behavior (and not nlog(n) as you mentioned). As can be seen clearly, overall the fixed-width approach is much faster with ca. 8 minutes loading time vs 25 minutes for the free-formatted one. 
5) One important point is that the fixed-width reading is significantly slower (14 ms per vector) than the other one for the first vectors. I think that's where the valgrind approach is interesting: we probably read the whole line into a buffer before accessing the data, which is not optimal in that case. In the past we held a FILE * pointer and did a fseek(), I would bet that it was faster. Probably something to revisit.
6) I have not redone the plots, but with your latest patch (avoiding closing and reopening the file each time) we are about 10x faster in all cases.

So to sum up: 
- The performance problem with many vectors in an ASCII file is linked with the time needed to read in the data. I suppose, though I haven't checked it, that other formats can be significantly faster. For example I expect that netcdf does not exhibit this behavior, but I haven't taken the time to transform the ascii file into netcdf. I don't know about dirfile, it would be interesting to know how it behaves.
- Your last commit makes things much better for that use case (many vectors)
- We should review the code (maybe for 2.0.7) to check if there are other gains to be had. It is especially not normal that reading fixed-width takes 14 ms when it takes only 8 ms for free-formatted. We can probably gain another factor 2 there.
- One thing I've been thinking about would be to add a "Cache file to RAM" option, where we would keep the vectors in RAM as we parse the file. That would certainly be a significant speedup in such cases and could have interesting uses. We'll have to talk about it in more detail...
- As of tonight, Daniyal's file in fixed-width format loads in less than 2 minutes, 66 seconds being used for the doUpdates() call.
- Interestingly, when resizing the window I went down from 28 s to about 7 s with the patch, I don't know why exactly...
- When there are so many objects loaded, the UI tends to lose responsiveness. Clicking on a menu needs 5 s before the menu appears... Either we do stuff without knowing it and we have to find out what, or there is something to improve, probably using threads (Barth, I know you are waiting for this moment :-))

I'd say this is already pretty good for 2.0.6, we can leave the rest for 2.0.7. Especially as I've found a couple of smaller bugs while doing some tests, and I think most users will have more benefit from those being fixed than from further "wide ASCII file" optimizations in the short term.
I'll send another mail with the bugs.



----- Mail original -----
> De: "Barth Netterfield" <netterfield at astro.utoronto.ca>
> À: "kst" <kst at kde.org>
> Envoyé: Samedi 16 Juin 2012 01:45:09
> Objet: [Kst] Wide Line ASCII Performance
> I created an ASCII file with 6 character wide fixed width columns.
> In all cases, there was 1E6 samples in the file.
> The Channel selector in the data wizard showed n^2 behavior.
> With 10000 columns, it took 30s to select all but one field using
> wildcards, and with 3333 columns it took 3s - ie, n^2 behavior.
> This is because selecting or unselecting a single item in the string
> selection widget goes as the length of the list. However selecting
> or unselecting all of them is very fast. So I did the optimization
> of selecting them all, then unselecting the unwanted ones for the
> case
> where you are selecting most of the fields.
> Reading all of the channels into one plot in white-space separated
> mode
> was unbearably slow with 1000 x 1000. This makes sense: the algorithm
> for reading all channels for white-space separated mode is n^2 in the
> number of channels.
> Fixed width mode was faster, but still n^2. Valgrind claims that
> this is because, even in fixed with mode, the entire row is read for
> each sample. Since, with these really long lines, a single line is
> longer
> than a disk block, reading just what is needed could be faster.
> --
> C. Barth Netterfield
> University of Toronto
> 416-845-0946
> _______________________________________________
> Kst mailing list
> Kst at kde.org
> https://mail.kde.org/mailman/listinfo/kst
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Performance_many_ASCII_columns.png
Type: image/png
Size: 20030 bytes
Desc: not available
URL: <http://mail.kde.org/pipermail/kst/attachments/20120617/084387ed/attachment-0001.png>

More information about the Kst mailing list