[Kst] Wide Line ASCII Performance

Sun Jun 17 20:39:46 UTC 2012

Hi Barth,

I've also been looking at that issue with ASCII files with many vectors a bit more. Since loading the file provided by Daniyal already takes 25 minutes, I did not go the valgrind way but rather instrumented the code with some debug prints.
What I found out correlates well with your findings, and your fix from today does make things faster by a factor 10 approximately also for this case :-)

Now some more details:
1) First, a remark: I slightly changed the files by removing empty lines (I did the tests under Linux, maybe it's only a line ending stuff...) and the characters in the first column (A, B, etc) so that I have exactly the same number of columns in the header and in the data, otherwise I think there is an offset.
2) I tried both the free-formatted version (with only the changes above to the original file) and the fixed-width one, generated using the one-liner awk script: "awk '{for (i=1;i<=NF;++i) printf("%-16s", $i); printf("\n")}' original_file.txt > fixed_width.txt"
3) Before the data gets loaded, there is some administrative overhead with the creation of vectors, curves, etc. I especially checked vector creation as I originally thought they were read at creation time (which seems to have been a wrong assumption). This overhead takes less than a minute on my PC (Core i5, 32 bits, 4 GB RAM) and expectedly shows a slight speed decrease with the increase of the number of vectors, but I would say it's nothing dramatic. See first plot in the attached image (nb of vectors created vs time in ms). We start at 5000 vectors/s and still reach about half that at 35000 vectors, which I think is OK. 
4) It turns out that what takes long is the call to self::doUpdates() at the end of the wizard, I guess that where the data gets read into the vectors. I put the results in the second plot (time in ms to read a vector vs column number), where we clearly see that the fixed-width approach always takes the same time regardless of how far the vector is in the line, while the free-formatted one has a linear behavior (and not nlog(n) as you mentioned). As can be seen clearly, overall the fixed-width approach is much faster with ca. 8 minutes loading time vs 25 minutes for the free-formatted one. 
5) One important point is that the fixed-width reading is significantly slower (14 ms per vector) than the other one for the first vectors. I think that's where the valgrind approach is interesting: we probably read the whole line into a buffer before accessing the data, which is not optimal in that case. In the past we held a FILE * pointer and did a fseek(), I would bet that it was faster. Probably something to revisit.
6) I have not redone the plots, but with your latest patch (avoiding closing and reopening the file each time) we are about 10x faster in all cases.

So to sum up: 
- The performance problem with many vectors in an ASCII file is linked with the time needed to read in the data. I suppose, though I haven't checked it, that other formats can be significantly faster. For example I expect that netcdf does not exhibit this behavior, but I haven't taken the time to transform the ascii file into netcdf. I don't know about dirfile, it would be interesting to know how it behaves.
- Your last commit makes things much better for that use case (many vectors)
- We should review the code (maybe for 2.0.7) to check if there are other gains to be had. It is especially not normal that reading fixed-width takes 14 ms when it takes only 8 ms for free-formatted. We can probably gain another factor 2 there.
- One thing I've been thinking about would be to add a "Cache file to RAM" option, where we would keep the vectors in RAM as we parse the file. That would certainly be a significant speedup in such cases and could have interesting uses. We'll have to talk about it in more detail...
- As of tonight, Daniyal's file in fixed-width format loads in less than 2 minutes, 66 seconds being used for the doUpdates() call.
- Interestingly, when resizing the window I went down from 28 s to about 7 s with the patch, I don't know why exactly...
- When there are so many objects loaded, the UI tends to lose responsiveness. Clicking on a menu needs 5 s before the menu appears... Either we do stuff without knowing it and we have to find out what, or there is something to improve, probably using threads (Barth, I know you are waiting for this moment :-))

I'd say this is already pretty good for 2.0.6, we can leave the rest for 2.0.7. Especially as I've found a couple of smaller bugs while doing some tests, and I think most users will have more benefit from those being fixed than from further "wide ASCII file" optimizations in the short term.
I'll send another mail with the bugs.

Cheers,

Nicolas

----- Mail original -----
> De: "Barth Netterfield" <netterfield at astro.utoronto.ca>
> À: "kst" <kst at kde.org>
> Envoyé: Samedi 16 Juin 2012 01:45:09
> Objet: [Kst] Wide Line ASCII Performance
> 
> 
> 
> I created an ASCII file with 6 character wide fixed width columns.
> In all cases, there was 1E6 samples in the file.
> 
> 
> The Channel selector in the data wizard showed n^2 behavior.
> With 10000 columns, it took 30s to select all but one field using
> wildcards, and with 3333 columns it took 3s - ie, n^2 behavior.
> 
> 
> This is because selecting or unselecting a single item in the string
> selection widget goes as the length of the list. However selecting
> or unselecting all of them is very fast. So I did the optimization
> of selecting them all, then unselecting the unwanted ones for the
> case
> where you are selecting most of the fields.
> 
> 
> Reading all of the channels into one plot in white-space separated
> mode
> was unbearably slow with 1000 x 1000. This makes sense: the algorithm
> for reading all channels for white-space separated mode is n^2 in the
> number of channels.
> 
> 
> Fixed width mode was faster, but still n^2. Valgrind claims that
> this is because, even in fixed with mode, the entire row is read for
> each sample. Since, with these really long lines, a single line is
> longer
> than a disk block, reading just what is needed could be faster.
> 
> 
> 
> --
> C. Barth Netterfield
> University of Toronto
> 416-845-0946
> 
> 
> 
> _______________________________________________
> Kst mailing list
> Kst at kde.org
> https://mail.kde.org/mailman/listinfo/kst
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Performance_many_ASCII_columns.png
Type: image/png
Size: 20030 bytes
Desc: not available
URL: <http://mail.kde.org/pipermail/kst/attachments/20120617/084387ed/attachment-0001.png>