Some mmx/sse2 benchmarking

Tue Dec 19 23:32:54 CET 2006

Perhaps it would be a good idea to build an Altivec and 3DNow! set of 
instructions as well, otherwise I think this looks fairly good.

I'm glad the load instruction worked better, when I looked in to the asm 
output it was clearly shorter in terms of operations, but you can never 
really tell just from that.
Which one did you end up using? loadu_ps or load_ps? You said loadu and 
memcpy were about the same so I'm assuming you used load_ps.

Anyways, keep me updated and I'll help with something if you'd like.

-Tom

Cyrille Berger wrote:
> On Friday 15 December 2006 01:06, Tom Burdick wrote:
>   
>> I would suggest changing the memcpy functions for testSSE to the load
>> functions in xmmintrin, so
>>
>> v1m = _mm_loadu_ps(v1);
>> v2m = _mm_loadu_ps(v2);
>>
>> if you make sure the vectors are 16 byte aligned you can do it an even
>> better way, just use _mm_load_ps instead.
>>
>> Let me know if that imroves the sse timings!
>>     
> yes it's a lot better ! if you have any other tip, I will be glad to take 
> them ! I also tried to profile with sysprof, and odly, memcpy and loadu 
> functions are approximatively of the same speed (way faster than x86 
> instructions).
>
> I didn't progress on the library as fast as I would have wish, but half 
> written version is available here: 
> http://cyrille.diwi.org/tmp/krita/libfastpp.tar.bz2, including an updated 
> version of costs.ods.
>
>