[Kst] Big Ascii file

Sun Dec 8 10:38:54 UTC 2013

Hi Peter,

I've done some more testing, this time with threads.

If I have "Use Threads" disabled I cannot have a buffer size larger than 2,000MB (also, I cannot 
disable the buffer). Doing so will result in a crash.

If I use a buffer size of 2,000MB or smaller the file will load but the progress bar does not update 
above 50%. Once loaded the progress bar remains displayed at 50%. This is probably related to the 
problem I observed with the very small ascii file I sent you.

With "Use Threads" disabled I can use very small buffer sizes without a crash (I've successfully 
tried down to 1MB). So the crash I reported earlier regarding crashes with buffer sizes of 2MB and 
less only applies when using threads.

Please see additional comments below.

Regards, Ben

On 8/12/2013 9:04 AM, Peter Kümmel wrote:
> Hi Ben,
>
> very interesting analysis!
>
> >
>> I've done some more testing with the latest 64 bit build and the big ascii file. I have some 
>> questions and comments for you.
>>
>> Why does kst consume so much memory when loading this file? The file contains 3 vectors, each 
>> vector requires around
>> 3.2GB of memory, a total of 9.6GB. But when I load this file with no buffer limit kst uses 22GB 
>> of my 24GB available.
>> I've tried various buffer limits from 12GB down to 3MB and the best case memory usage is 15GB.
>>
>> The buffer limit can have a dramatic influence on load times. In general I've found the smaller 
>> the buffer the quicker
>> it loads. However, too small and kst will crash (loading big ascii file, kst crashes with a 
>> buffer limit of 2MB and
>> less). The best case load time was 5min 18sec with a buffer limit of 5MB, the worst case was 
>> 20min 24sec with a buffer
>> limit of 12GB. With the buffer limit disabled the load time was 12min 36sec. Using a small memory 
>> buffer is faster (up
>> to 3 times) and uses less memory (about 30%). Given that there is a lot to be gained by using a 
>> small memory buffer are
>> there any disadvantages I should be aware of? With such a dramatic difference it would be great 
>> if kst could pick a
>> suitable buffer limit automatically.
>
> So lets do a calculation:
>
> 400^6 rows:
>
> file, 9.7GB
> + 3 rows, 3 * 400M * sizef(double) = 9.6GB
> + vector of row starts, 400M * 8 = 3.2GB
> = 22.5GB
>
Aahh, this explains a lot. Thanks.

> This explains your 22GB when using no buffer limit, because
> nothing is freed after reading to speedup the next read (which
> seems not to work)
>
> But why you see at least 15GB I can't explain, even when
> the vector with the row starts is not freed we are only at 12GB
>
> I also wonder why it loads so slow with no buffer limit, because
> when the file is in memory after the first column read the file needs
> not to be read again, like with a limited buffer. Maybe you see this
> effect only for a large number of columns.
>
> Oh and you have found a crash again ;) I never tried with so small buffers.
>
> And maybe it is even faster when you disable threads for small buffers because
> the thread overhead could be relevant when you split the buffer in 8 chunks
> and then start a thread for each chunk.
>
> The data should be loaded correctly for any buffersize, only difference is load time,
> so you could choose which fits best to your setup.
>
>>
>>
>>
>> I've attached plots of memory usage for each of the buffer sizes I tried. It is interesting to 
>> see how different they
>> are above 500MB, given that the same file is used in all cases. It can clearly be seen that there 
>> is a lot of
>> inefficiency with a large buffer.
>
> Yes, this makes me wonder. Maybe it depends on the system you are on, harddisk/memory.
I do not have an SSD, so my hdd is definitely a bottleneck.
>
>>
>> There seems to be a problem with the loading of the last (third) column. It takes much longer to 
>> load than the previous
>> two columns. For example, with a 3MB buffer size the first two columns load in about 40sec each 
>> while the last column
>> takes about 200 sec (5 times longer). Do you think there many be a problem with how the last 
>> column is processed or is
>> this to be expected?
>
> By default it scans from the row start to the the last column, which takes more time the more 
> right the column is,
> but 5 times slower? Using a fixed sizes should be faster, could be enabled in the config dialog.
If the extra time was due to the fact that there is more data to scan with each successive column 
then the load time would be linearly proportional to the column number, this is not the case. The 
first two columns load in a very similar time while the last column loads 5 times slower. There must 
be something else much more significant going on when processing the last column.

I tried enabling the "each column has it's own constant width" option. I did not observe any 
improvement in load time. Funny, because I know you observed dramatic speed improvements with this 
option.
>
>
> To understand all this better we would need a data file with more columns, not 9GB but big enough 
> to see differences
> in the load time. We can log some timing into the debug dialog.
Sounds good, beats doing it with a stopwatch.
>
>>
>> I really like the new status updates during the loading process, however, I have one small 
>> suggestion. During the
>> loading process, after each column is read in the status bar indicates that each column is being 
>> plotted before moving
>> on to the next column, however, nothing is rendered to the screen until after the last column is 
>> processed. There may be
>> a better way to describe the "plotting data..." step because it looks like kst is not doing what 
>> it says it's doing.
>
> Yes, indeed this makes only sens when you read on column.
>
>>
>> When I plotted the above graph the new progress bar gets stuck at 50%. You can try it yourself 
>> with the attached data
>> file "mem data.csv".
>
> OK, I have a look at it.
>
> Thanks!
> Peter
>
>
>>
>> Regards, Ben
>>
>>
>> On 7/12/2013 8:41 AM, Peter Kümmel wrote:
>>> On 06.12.2013 12:39, Ben Lewis wrote:
>>>> Hi Peter,
>>>>
>>>> I can now open the big ASCII file using the settings you recommended. :-)
>>>>
>>>> Build: x64
>>>> Limit buffer size: 500MB
>>>> Use threads: Yes
>>>> Interpret empty value as: NULL
>>>>
>>>> I have not tried other settings yet.
>>>>
>>>> I can load all three columns in just under 10 minutes.
>>>
>>> When you have enough memory disable the buffer limit, then the file is only read once.
>>> With buffer limit enabled, for each column the file is read again, and reading is
>>> the bottleneck when you don't have a SSD.
>>>
>>>>
>>>> My only criticism is that the progress bar does not behave as expected when loading multiple 
>>>> columns. When loading all
>>>> three columns I observe the following behaviour:
>>>
>>> Should be fixed now.
>>>
>>> Cheers,
>>> Peter
>>>
>>>>
>>>> Searching for rows: 0-50%
>>>> Reading data.../Parsing data.. 50-100% (quick)
>>>> Reading column 2: 50%
>>>> Reading data.../Parsing data... 50-100% (slow)
>>>>
>>>> Once loaded, performance is a little slow with the full data set displayed, but after zooming 
>>>> in performance is
>>>> excellent with smooth scrolling and zooming.
>>>>
>>>> Regards, Ben
>>>
>>>
>>> _______________________________________________
>>> Kst mailing list
>>> Kst at kde.org
>>> https://mail.kde.org/mailman/listinfo/kst
>>
>>
>>
>> _______________________________________________
>> Kst mailing list
>> Kst at kde.org
>> https://mail.kde.org/mailman/listinfo/kst
>>
>
> _______________________________________________
> Kst mailing list
> Kst at kde.org
> https://mail.kde.org/mailman/listinfo/kst