odd speed otpimization [Archive] - RCE Messageboard's Regroupment

View Full Version : odd speed otpimization

Maximus

January 10th, 2011, 09:05

Hi,

I were optimizing for speed a small memset - where i need to optimize the 128-512 case. This is my code:

Code:



_asm {

						pxor	mm0, mm0

						mov edi, TempEndPtr

						mov ecx, len

						xor eax, eax

					_128_loop:

						movntq	[edi], mm0

						movntq	[edi+8], mm0

						movntq	[edi+16], mm0

						movntq	[edi+32], mm0

						movntq	[edi+40], mm0

						movntq	[edi+48], mm0

						movntq	[edi+56], mm0

						//

						movntq	[edi+64], mm0

						movntq	[edi+72], mm0

						movntq	[edi+80], mm0

						movntq	[edi+88], mm0

						movntq	[edi+96], mm0

						movntq	[edi+104], mm0

						movntq	[edi+112], mm0

						movntq	[edi+120], mm0

						sub ecx, 128

						add edi, 128

						cmp ecx, 128

						jg  _128_loop

						je _the_end

					_4_loop:

						mov [edi], eax

						add edi, 4

						sub ecx, 4

						cmp ecx, 0

						jg _4_loop

					_the_end:

with my surprise, it run SLOWER than filling ti manually with a mov [edi],eax/mov [edi+4],eax loop!!
even sadder, it runs SLOWER than:

Code:



mov ecx, len

						mov edi, TempEndPtr

						shr ecx, 2

						xor eax, eax

						inc ecx

						rep stosd

any suggestion/comment?

Darkelf

January 10th, 2011, 10:02

Hi,

I'm pretty sure, you've seen it already but nevertheless:

http://coding.derkeiler.com/Archive/Assembler/comp.lang.asm.x86/2004-06/0004.html

There was a discussion going on, about some similar problem. You might want to have a look.

I have one question about your code. Are the numbers following edi in brackets decimal or hex? If they are hexnumbers you have gaps in between because the space between the memory locations is quite inconsistent then. If they are decimal you have a gap, too. You're jumping from 16 to 32 leaving out 24. But that's probably totally unrelated.

Regards
darkelf

BanMe

January 10th, 2011, 10:53

Avisynth filter sdk the first one in google...

This is a interesting optimization and opcode,but as stated it will slow down the function right after it,I'm guessing a timing routine..

Maximus

January 10th, 2011, 15:06

yep, i missed the +24, indeed

however, results are always the same: moving 8 bytes (with fast GP regs, not MMX/SSE ones) at time is the fastest solution, every other move (4,16,32,64,128) is slower.
The oddity lies in the fact i'd expect write combining to take place - especially using the ntq variant. Also, i'm on an OC i5, so i'd expect SSE to be fast and without old athlon penalties.

mah...

evaluator

January 11th, 2011, 10:50

less then ~200kbytes moving is better with STOSD/MOVSD

ps: ahm, that for 32bit. you are trying on 64bit

ps2: how about using 64byte mem on loop + SFENCE after each.

GamingMasteR

January 12th, 2011, 13:58

Is destination address 16-byte aligned ?

Maximus

January 12th, 2011, 15:03

hi all,

@evaluator: sfence is a serializing instruction - it would slowdown the loop and enforce&wait cached memory writes. In case, one might use it at end of a transfer sequence to ensure that weak memory ordering doesnt cause troubles.

@gamingmaster: no, that's why i'm not using XMM registers, transfer can be any size, and happen at any alignment.

The issue with REP MOSVS is that it's special circuitry 'pop up' after a number of transfer (unless it is changed on more recent processor), and I do not transfer enough data at time to cover the timing costs - so a simple mov wins over it for a low number of transferred bytes.

(by the way, the asm coding the memory transfers boosted the algorithm by 20%, and quickly recoding another routine in asm from C added almost the same ...still, I remember those forums where idiots were saying that C compilers can make equal or even better code than manual one... bah bah!)

What I find odd is, however, the fact that an unrolled MMX loop is slower than a movX2 loop, especially when the number of transferred bytes is between 200 and 300.

(ps: hehe, the IDIOT M$ compiler interpreted BY DEFAULT an INC [stuff] on byte instead of DWORD - damn them! ...and it shows a 'smart warning' saying 'hey, you forgot emms!' puah!)

BanMe

January 14th, 2011, 11:15

There is a nice tool in dev on masm forums called testbed,this with a little feedback to the authors could help them greatly,and might make your life easier atleast for testing comparitivly.

Regards BanMe

reverser

January 14th, 2011, 15:35

Here's what MSVC does:

Code:

          mov     edi, [ebp+arg_0]

          mov     ecx, [ebp+arg_4]

          shr     ecx, 7

          pxor    xmm0, xmm0

          jmp     short $L

          align 10h



$L:

          movdqa  xmmword ptr [edi], xmm0

          movdqa  xmmword ptr [edi+10h], xmm0

          movdqa  xmmword ptr [edi+20h], xmm0

          movdqa  xmmword ptr [edi+30h], xmm0

          movdqa  xmmword ptr [edi+40h], xmm0

          movdqa  xmmword ptr [edi+50h], xmm0

          movdqa  xmmword ptr [edi+60h], xmm0

          movdqa  xmmword ptr [edi+70h], xmm0

          lea     edi, [edi+80h]

          dec     ecx

          jnz     short $L

Maximus

January 23rd, 2011, 18:41

...that code requires that your data is para aligned, whcih is rarely the case for smaller buffer.