[Kst] Big Ascii file

Sat Dec 7 22:04:08 UTC 2013

Hi Ben,

very interesting analysis!

 >
> I've done some more testing with the latest 64 bit build and the big ascii file. I have some questions and comments for you.
>
> Why does kst consume so much memory when loading this file? The file contains 3 vectors, each vector requires around
> 3.2GB of memory, a total of 9.6GB. But when I load this file with no buffer limit kst uses 22GB of my 24GB available.
> I've tried various buffer limits from 12GB down to 3MB and the best case memory usage is 15GB.
>
> The buffer limit can have a dramatic influence on load times. In general I've found the smaller the buffer the quicker
> it loads. However, too small and kst will crash (loading big ascii file, kst crashes with a buffer limit of 2MB and
> less). The best case load time was 5min 18sec with a buffer limit of 5MB, the worst case was 20min 24sec with a buffer
> limit of 12GB. With the buffer limit disabled the load time was 12min 36sec. Using a small memory buffer is faster (up
> to 3 times) and uses less memory (about 30%). Given that there is a lot to be gained by using a small memory buffer are
> there any disadvantages I should be aware of? With such a dramatic difference it would be great if kst could pick a
> suitable buffer limit automatically.

So lets do a calculation:

400^6 rows:

file, 9.7GB
+ 3 rows, 3 * 400M * sizef(double) = 9.6GB
+ vector of row starts, 400M * 8 = 3.2GB
= 22.5GB

This explains your 22GB when using no buffer limit, because
nothing is freed after reading to speedup the next read (which
seems not to work)

But why you see at least 15GB I can't explain, even when
the vector with the row starts is not freed we are only at 12GB

I also wonder why it loads so slow with no buffer limit, because
when the file is in memory after the first column read the file needs
not to be read again, like with a limited buffer. Maybe you see this
effect only for a large number of columns.

Oh and you have found a crash again ;) I never tried with so small buffers.

And maybe it is even faster when you disable threads for small buffers because
the thread overhead could be relevant when you split the buffer in 8 chunks
and then start a thread for each chunk.

The data should be loaded correctly for any buffersize, only difference is load time,
so you could choose which fits best to your setup.

>
>
>
> I've attached plots of memory usage for each of the buffer sizes I tried. It is interesting to see how different they
> are above 500MB, given that the same file is used in all cases. It can clearly be seen that there is a lot of
> inefficiency with a large buffer.

Yes, this makes me wonder. Maybe it depends on the system you are on, harddisk/memory.

>
> There seems to be a problem with the loading of the last (third) column. It takes much longer to load than the previous
> two columns. For example, with a 3MB buffer size the first two columns load in about 40sec each while the last column
> takes about 200 sec (5 times longer). Do you think there many be a problem with how the last column is processed or is
> this to be expected?

By default it scans from the row start to the the last column, which takes more time the more right the column is,
but 5 times slower? Using a fixed sizes should be faster, could be enabled in the config dialog.

To understand all this better we would need a data file with more columns, not 9GB but big enough to see differences
in the load time. We can log some timing into the debug dialog.

>
> I really like the new status updates during the loading process, however, I have one small suggestion. During the
> loading process, after each column is read in the status bar indicates that each column is being plotted before moving
> on to the next column, however, nothing is rendered to the screen until after the last column is processed. There may be
> a better way to describe the "plotting data..." step because it looks like kst is not doing what it says it's doing.

Yes, indeed this makes only sens when you read on column.

>
> When I plotted the above graph the new progress bar gets stuck at 50%. You can try it yourself with the attached data
> file "mem data.csv".

OK, I have a look at it.

Thanks!
Peter

>
> Regards, Ben
>
>
> On 7/12/2013 8:41 AM, Peter Kümmel wrote:
>> On 06.12.2013 12:39, Ben Lewis wrote:
>>> Hi Peter,
>>>
>>> I can now open the big ASCII file using the settings you recommended. :-)
>>>
>>> Build: x64
>>> Limit buffer size: 500MB
>>> Use threads: Yes
>>> Interpret empty value as: NULL
>>>
>>> I have not tried other settings yet.
>>>
>>> I can load all three columns in just under 10 minutes.
>>
>> When you have enough memory disable the buffer limit, then the file is only read once.
>> With buffer limit enabled, for each column the file is read again, and reading is
>> the bottleneck when you don't have a SSD.
>>
>>>
>>> My only criticism is that the progress bar does not behave as expected when loading multiple columns. When loading all
>>> three columns I observe the following behaviour:
>>
>> Should be fixed now.
>>
>> Cheers,
>> Peter
>>
>>>
>>> Searching for rows: 0-50%
>>> Reading data.../Parsing data.. 50-100% (quick)
>>> Reading column 2: 50%
>>> Reading data.../Parsing data... 50-100% (slow)
>>>
>>> Once loaded, performance is a little slow with the full data set displayed, but after zooming in performance is
>>> excellent with smooth scrolling and zooming.
>>>
>>> Regards, Ben
>>
>>
>> _______________________________________________
>> Kst mailing list
>> Kst at kde.org
>> https://mail.kde.org/mailman/listinfo/kst
>
>
>
> _______________________________________________
> Kst mailing list
> Kst at kde.org
> https://mail.kde.org/mailman/listinfo/kst
>