[Kst] Big Ascii file
Ben Lewis
egretengineering at gmail.com
Thu Dec 12 14:17:20 UTC 2013
Hi Peter,
There is another issue with the new status bar updates. When plotting live data the progress bar is
kind of distracting because it is constantly moving. Also the text messages (which are great when
loading a data file for the first time) now over writes more useful information like X-Y coordinates.
Maybe the new feedback should only apply when loading the file for the first time (not updating).
Alternatively, the file loading stuff could all go the RHS of the status bar, while the X-Y
coordinates could stay on the LHS.
Something to think about.
Regards, Ben
On 8/12/2013 9:04 AM, Peter Kümmel wrote:
> Hi Ben,
>
> very interesting analysis!
>
> >
>> I've done some more testing with the latest 64 bit build and the big ascii file. I have some
>> questions and comments for you.
>>
>> Why does kst consume so much memory when loading this file? The file contains 3 vectors, each
>> vector requires around
>> 3.2GB of memory, a total of 9.6GB. But when I load this file with no buffer limit kst uses 22GB
>> of my 24GB available.
>> I've tried various buffer limits from 12GB down to 3MB and the best case memory usage is 15GB.
>>
>> The buffer limit can have a dramatic influence on load times. In general I've found the smaller
>> the buffer the quicker
>> it loads. However, too small and kst will crash (loading big ascii file, kst crashes with a
>> buffer limit of 2MB and
>> less). The best case load time was 5min 18sec with a buffer limit of 5MB, the worst case was
>> 20min 24sec with a buffer
>> limit of 12GB. With the buffer limit disabled the load time was 12min 36sec. Using a small memory
>> buffer is faster (up
>> to 3 times) and uses less memory (about 30%). Given that there is a lot to be gained by using a
>> small memory buffer are
>> there any disadvantages I should be aware of? With such a dramatic difference it would be great
>> if kst could pick a
>> suitable buffer limit automatically.
>
> So lets do a calculation:
>
> 400^6 rows:
>
> file, 9.7GB
> + 3 rows, 3 * 400M * sizef(double) = 9.6GB
> + vector of row starts, 400M * 8 = 3.2GB
> = 22.5GB
>
> This explains your 22GB when using no buffer limit, because
> nothing is freed after reading to speedup the next read (which
> seems not to work)
>
> But why you see at least 15GB I can't explain, even when
> the vector with the row starts is not freed we are only at 12GB
>
> I also wonder why it loads so slow with no buffer limit, because
> when the file is in memory after the first column read the file needs
> not to be read again, like with a limited buffer. Maybe you see this
> effect only for a large number of columns.
>
> Oh and you have found a crash again ;) I never tried with so small buffers.
>
> And maybe it is even faster when you disable threads for small buffers because
> the thread overhead could be relevant when you split the buffer in 8 chunks
> and then start a thread for each chunk.
>
> The data should be loaded correctly for any buffersize, only difference is load time,
> so you could choose which fits best to your setup.
>
>>
>>
>>
>> I've attached plots of memory usage for each of the buffer sizes I tried. It is interesting to
>> see how different they
>> are above 500MB, given that the same file is used in all cases. It can clearly be seen that there
>> is a lot of
>> inefficiency with a large buffer.
>
> Yes, this makes me wonder. Maybe it depends on the system you are on, harddisk/memory.
>
>>
>> There seems to be a problem with the loading of the last (third) column. It takes much longer to
>> load than the previous
>> two columns. For example, with a 3MB buffer size the first two columns load in about 40sec each
>> while the last column
>> takes about 200 sec (5 times longer). Do you think there many be a problem with how the last
>> column is processed or is
>> this to be expected?
>
> By default it scans from the row start to the the last column, which takes more time the more
> right the column is,
> but 5 times slower? Using a fixed sizes should be faster, could be enabled in the config dialog.
>
>
> To understand all this better we would need a data file with more columns, not 9GB but big enough
> to see differences
> in the load time. We can log some timing into the debug dialog.
>
>>
>> I really like the new status updates during the loading process, however, I have one small
>> suggestion. During the
>> loading process, after each column is read in the status bar indicates that each column is being
>> plotted before moving
>> on to the next column, however, nothing is rendered to the screen until after the last column is
>> processed. There may be
>> a better way to describe the "plotting data..." step because it looks like kst is not doing what
>> it says it's doing.
>
> Yes, indeed this makes only sens when you read on column.
>
>>
>> When I plotted the above graph the new progress bar gets stuck at 50%. You can try it yourself
>> with the attached data
>> file "mem data.csv".
>
> OK, I have a look at it.
>
> Thanks!
> Peter
>
>
>>
>> Regards, Ben
>>
>>
>> On 7/12/2013 8:41 AM, Peter Kümmel wrote:
>>> On 06.12.2013 12:39, Ben Lewis wrote:
>>>> Hi Peter,
>>>>
>>>> I can now open the big ASCII file using the settings you recommended. :-)
>>>>
>>>> Build: x64
>>>> Limit buffer size: 500MB
>>>> Use threads: Yes
>>>> Interpret empty value as: NULL
>>>>
>>>> I have not tried other settings yet.
>>>>
>>>> I can load all three columns in just under 10 minutes.
>>>
>>> When you have enough memory disable the buffer limit, then the file is only read once.
>>> With buffer limit enabled, for each column the file is read again, and reading is
>>> the bottleneck when you don't have a SSD.
>>>
>>>>
>>>> My only criticism is that the progress bar does not behave as expected when loading multiple
>>>> columns. When loading all
>>>> three columns I observe the following behaviour:
>>>
>>> Should be fixed now.
>>>
>>> Cheers,
>>> Peter
>>>
>>>>
>>>> Searching for rows: 0-50%
>>>> Reading data.../Parsing data.. 50-100% (quick)
>>>> Reading column 2: 50%
>>>> Reading data.../Parsing data... 50-100% (slow)
>>>>
>>>> Once loaded, performance is a little slow with the full data set displayed, but after zooming
>>>> in performance is
>>>> excellent with smooth scrolling and zooming.
>>>>
>>>> Regards, Ben
>>>
>>>
>>> _______________________________________________
>>> Kst mailing list
>>> Kst at kde.org
>>> https://mail.kde.org/mailman/listinfo/kst
>>
>>
>>
>> _______________________________________________
>> Kst mailing list
>> Kst at kde.org
>> https://mail.kde.org/mailman/listinfo/kst
>>
>
> _______________________________________________
> Kst mailing list
> Kst at kde.org
> https://mail.kde.org/mailman/listinfo/kst
More information about the Kst
mailing list