These are the new instructions: addps, addss, andnps, andps, cmpeqps, cmpeqss, cmpleps, cmpless, cmpltps, cmpltss, cmpneqps, cmpneqss, cmpnleps, cmpnless, cmpnltps, cmpnltss, cmpordps, cmpordss, cmpunordps, cmpunordss, comiss, cvtpi2ps, cvtps2pi, cvtsi2ss, cvtss2si, cvttps2pi, cvttss2si, divps, divss, fxrstor, fxsave, ldmxcsr, maskmovq, maxps, maxss, minps, minss, movaps, movhps, movlps, movmskps, movntps, movntq, movss, movups, mulps, mulss, orps, pavgb, pavgw, pextrw, pinsrw, pmaxsw, pmaxub, pminsw, pminub, pmovmskb, pmulhuw, prefetchnta, prefetcht0, prefetcht1, prefetcht2, psadbw, pshufw, rcpps, rcpss, rsqrtps, rsqrtss, sfence, shufps, sqrtps, sqrtss, stmxcsr, subps, subss, ucomiss, unpckhps, unpcklps & xorps
Next update...
0F xx |
x0 | x1 | x2 | x3 | x4 | x5 | x6 | x7 |
---|---|---|---|---|---|---|---|---|
0x |
Group#6 | Group#7 | lar | lsl | 04 | loadall 286 |
clts | loadall 386 |
1x |
MOVSS MOVUPS xmm,xmm/mem |
MOVSS MOVUPS mem,xmm |
MOVLPS xmm,mem MOVHLPS xmm,xmm |
MOVLPS mem,xmm |
UNPCKLPS xmm,xmm/mem |
UNPCKHPS xmm,xmm/mem |
MOVHPS xmm,mem MOVLHPS xmm,xmm |
MOVHPS mem,xmm |
2x |
mov | mov | mov | mov | mov | 25 | mov | 26 |
3x |
wrmsr | rdtsc | rdmsr | rdpmc | wdecr | 35 | rdecr | 37 |
4x |
cmovo | cmovno | cmovb | cmovnb | cmovz | cmovnz | cmovbe | cmovnbe |
5x |
MOVMSKPS r32,xmm |
SQRTPS SQRTSS xmm,xmm/mem |
RSQRTPS RSQRTSS xmm,xmm/mem |
RCPPS RCPSS xmm,xmm/mem |
ANDPS xmm,xmm/mem |
ANDNPS xmm,xmm/mem |
ORPS xmm,xmm/mem |
XORPS xmm,xmm/mem |
6x |
punpcklbw | punpcklwd | punpckldq | packsswb | pcmpgtb | pcmpgtw | pcmpgtd | packuswb |
7x |
PSHUF mm,mm/mem,i8 |
Group#A pshimw |
Group#A pshimd |
Group#A pshimq |
pcmpeqb | pcmpeqw | pcmpeqd | emms |
8x |
jo | jno | jb | jnb | jz | jnz | jbe | jnbe |
9x |
seto | setno | setb | setnb | setz | setnz | setbe | setnbe |
Ax |
push fs | pop fs | cpuid | bt | shld | shld | xbts cmpxchg |
xbts cmpxchg |
Bx |
cmpxchg | cmpxchg | lss | btr | lfs | lgs | movzx | movzx |
Cx |
xadd | xadd | CMPxPS CMPxSS EQ LT LE UNORD NEQ NLT NLE ORD xmm,xmm/mem |
C3 | PINSRW mm,r32/mem,i8 |
PEXTRW r32,mm,i8 |
SHUFPS xmm,xmm/mem,i8 |
Group#9 |
Dx |
D0 | psrlw | psrld | psrlq | D4 | pmullw | D6 | PMOVMSKB r32,mm |
Ex |
PAVGB mm,mm/mem |
psraw | psrad | PAVGW mm,mm/mem |
PMULHUW mm,mm/mem |
pmulhw | E6 | MOVNTQ mem,mm |
Fx |
F0 | psllw | pslld | psllq | F4 | pmaddwd | PSADBW mm,mm/mem |
MASKMOVQ mm,mm |
0F xx |
x8 | x9 | xA | xB | xC | xD | xE | xF |
---|---|---|---|---|---|---|---|---|
0x |
invd | wbinvd | cflsh | ud1 | 0C | 0D | 0E | 0F |
1x |
GROUP#C | 19 | 1A | 1B | 1C | 1D | 1E | 1F |
2x |
MOVAPS xmm,xmm/mem |
MOVAPS mem,xmm |
CVTPI2PS xmm,mm/mem CVTSI2SS xmm,r32/mem |
MOVNTPS mem,xmm |
CVTTPS2PI mm,xmm/mem CVTTSS2SI r32,xmm/mem |
CVTPS2PI mm,xmm/mem CVTSS2SI r32,xmm/mem |
UCOMISS xmm,xmm/mem |
COMISS xmm,xmm/mem |
3x |
38 | 39 | 3A | 3B | 3C | 3D | 3E | 3F |
4x |
cmovs | cmovns | cmovp | cmovnp | cmovl | cmovnl | cmovle | cmovnle |
5x |
ADDPS ADDSS xmm,xmm/mem |
MULPS MULSS xmm,xmm/mem |
4A | 4B | SUBPS SUBSS xmm,xmm/mem |
MINPS MINSS xmm,xmm/mem |
DIVPS DIVSS xmm,xmm/mem |
MAXPS MAXSS xmm,xmm/mem |
6x |
punpckhbw | punpckhwd | punpckhdq | packssdw | 6C | 6D | movd | movq |
7x |
78 | 79 | 7A | 7B | 7C | 7D | movd | movq |
8x |
js | jns | jp | jnp | jl | jnl | jle | jnle |
9x |
sets | setns | setp | setnp | setl | setnl | setle | setnle |
Ax |
push gs | pop gs | rsm | bts | shrd | shrd | GROUP#B | imul |
Bx |
B8 | ud2 | Group#8 | btc | bsf | bsr | movsx | movsx |
Cx |
bswap eax | bswap ecx | bswap edx | bswap ebx | bswap esp | bswap ebp | bswap esi | bswap edi |
Dx |
psubusb | psubusw | PMINUB mm,mm/mem |
pand | paddusb | paddusw | PMAXUB mm,mm/mem |
pandn |
Ex |
psubsb | psubsw | PMINSW mm,mm/mem |
por | paddsb | paddsw | PMAXSW mm,mm/mem |
pxor |
Fx |
psubb | psubw | psubd | FB | paddb | paddw | paddd | FF |
modR/M |
xx000xxx | xx001xxx | xx010xxx | xx011xxx | xx100xxx | xx101xxx | xx110xxx | xx111xxx |
---|---|---|---|---|---|---|---|---|
group #B 0F AE |
FXSAVE mem512 |
FXRSTOR mem512 |
LDMXCSR mem |
STMXCSR mem |
SFENCE | |||
group #C 0F 18 |
PREFETCHNTA mem |
PREFETCHT0 mem |
PREFETCHT1 mem |
PREFETCHT2 mem |
KNI not only adds several new SIMD-INT instructions to bolster the original MMX instruction set, it more importantly adds a set of SIMD-FP instructions which operate on a new set of eight 128 bit (16 byte) XMM registers. This wealth of registers offers an extremely large opportunity for parallelism where several INT & FP operations can be executed at once.
3DNow! while extremely powerful, uses only the original eight 64 bit (8 byte) MM registers, which must be shared between INT & FP operations, the present implementation allows two operations to execute at once. Currently 3DNow! has several advantages, a) chips that support it are actually available, b) software is available that use it (DirectX 6), and c) the documentation is not shrouded in unnecessary secrecy.
00401025 0F5806 addps xmm0,[esi] ; Add parallel scalar 00401028 0F58C3 addps xmm0,xmm3 0040102B F30F58CC addss xmm1,xmm4 ; Add singular scalar 0040102F F30F5817 addss xmm2,[edi] 00401033 0F5507 andnps xmm0,[edi] ; And-Not parallel scalar 00401036 0F55C1 andnps xmm0,xmm1 00401039 0F5416 andps xmm2,[esi] ; And parallel scalar 0040103C 0F54D1 andps xmm2,xmm1 0040103F 0FC2C700 cmpeqps xmm0,xmm7 ; Compare Equal parallel scalar 00401043 F30FC2C700 cmpeqss xmm0,xmm7 ; Compare Equal singular scalar 00401048 0FC2D502 cmpleps xmm2,xmm5 ; Compare Less than or Equal 0040104C F30FC2C702 cmpless xmm0,xmm7 00401051 0FC20E01 cmpltps xmm1,[esi] ; Compare Less Than 00401055 0FC2CA01 cmpltps xmm1,xmm2 00401059 0FC2CE01 cmpltps xmm1,xmm6 0040105D 0FC23CB50000000001 cmpltps xmm7,[0+esi*4] 00401066 F30FC2C701 cmpltss xmm0,xmm7 0040106B 0FC2E304 cmpneqps xmm4,xmm3 ; Compare Not Equal 0040106F F30FC2C704 cmpneqss xmm0,xmm7 00401074 0FC2F106 cmpnleps xmm6,xmm1 ; Compare Not Less than or Equal 00401078 F30FC2C706 cmpnless xmm0,xmm7 0040107D 0FC2EA05 cmpnltps xmm5,xmm2 00401081 F30FC2C705 cmpnltss xmm0,xmm7 00401086 0FC21CF007 cmpordps xmm3,[eax+esi*8] ; Compare Ordered 0040108B 0FC2F807 cmpordps xmm7,xmm0 0040108F F30FC2C707 cmpordss xmm0,xmm7 00401094 0FC2DC03 cmpunordps xmm3,xmm4 ; Compare Unordered 00401098 F30FC2C703 cmpunordss xmm0,xmm7 0040109D 0F2FDC comiss xmm3,xmm4 ; Compare (int flags) singular scalar 004010A0 0F2F3C24 comiss xmm7,[esp] 004010A4 0F2AC1 cvtpi2ps xmm0,mm1 ; Convert parallel int to parallel scalar 004010A7 0F2A13 cvtpi2ps xmm2,[ebx] 004010AA 0F2D12 cvtps2pi mm2,[edx] ; Convert parallel scalar to parallel int 004010AD 0F2DD0 cvtps2pi mm2,xmm0 004010B0 F30F2A13 cvtsi2ss xmm2,[ebx] ; Convert singular int to singular scalar 004010B4 F30F2ADE cvtsi2ss xmm3,esi 004010B8 F30F2A3F cvtsi2ss xmm7,[edi] 004010BC F30F2DD8 cvtss2si ebx,xmm0 ; Convert singular scalar to singular int 004010C0 F30F2D0A cvtss2si ecx,[edx] 004010C4 0F2C10 cvttps2pi mm2,[eax] ; Convert ? parallel scalar to parallel int 004010C7 0F2CD0 cvttps2pi mm2,xmm0 004010CA F30F2CC0 cvttss2si eax,xmm0 ; Convert ? parallel scalar to parallel int 004010CE F30F2C10 cvttss2si edx,[eax] 004010D2 0F5E06 divps xmm0,[esi] ; Divide parallel scalar 004010D5 0F5EC3 divps xmm0,xmm3 004010D8 F30F5ECC divss xmm1,xmm4 ; Divide singular scalar 004010DC F30F5E17 divss xmm2,[edi] 004010E0 0FAE0E fxrstor [esi] ; Fast Extended Restore (FP/MMX/KNI context) 004010E3 0FAE06 fxsave [esi] ; Fast Extended Save 004010E6 0FAE16 ldmxcsr [esi] ; LoaD Multimedia eXtended Control/Status Register 004010E9 0FF7CF maskmovq mm1,mm7 ; Masked Move? 004010EC 0FF7DC maskmovq mm3,mm4 004010EF 0F5F1C7E maxps xmm3,[esi+edi*2] ; Maximum parallel scalar 004010F3 0F5FD8 maxps xmm3,xmm0 004010F6 F30F5F1F maxss xmm3,[edi] ; Maximum singular scalar 004010FA F30F5FD8 maxss xmm3,xmm0 004010FE 0F5D06 minps xmm0,[esi] ; Minimum parallel scalar 00401101 0F5DC3 minps xmm0,xmm3 00401104 F30F5D06 minss xmm0,[esi] ; Minimum singular scalar 00401108 F30F5DC3 minss xmm0,xmm3 0040110C 0F2903 movaps [ebx],xmm0 ; Move aligned parallel scalar 0040110F 0F2803 movaps xmm0,[ebx] 00401112 0F28CA movaps xmm1,xmm2 00401115 0F12DD movhlps xmm3,xmm5 00401118 0F1706 movhps [esi],xmm0 ; Move high (qword) parallel scalar 0040111B 0F1603 movhps xmm0,[ebx] 0040111E 0F16FA movlhps xmm7,xmm2 00401121 0F1306 movlps [esi],xmm0 ; Move low (qword) parallel scalar 00401124 0F1203 movlps xmm0,[ebx] 00401127 0F50DB movmskps ebx,xmm3 ; Move mask parallel scalar 0040112A 0F50CF movmskps ecx,xmm7 0040112D 0F2B33 movntps [ebx],xmm6 ; Move non-tagged (uncached) parallel scalar 00401130 0FE710 movntq [eax],mm2 ; Move non-tagged (uncached) parallel int 00401133 F30F1103 movss [ebx],xmm0 ; Move singular scalar 00401137 F30F1003 movss xmm0,[ebx] 0040113B F30F10CA movss xmm1,xmm2 0040113F 0F1103 movups [ebx],xmm0 ; Move unaligned parallel scalar 00401142 0F1003 movups xmm0,[ebx] 00401145 0F10CA movups xmm1,xmm2 00401148 0F5906 mulps xmm0,[esi] ; Multiply parallel scalar 0040114B 0F59C3 mulps xmm0,xmm3 0040114E F30F59CC mulss xmm1,xmm4 ; Multiply singular scalar 00401152 F30F5917 mulss xmm2,[edi] 00401156 0F5616 orps xmm2,[esi] ; Or parallel scalar 00401159 0F56D1 orps xmm2,xmm1 0040115C 0FE0DA pavgb mm3,mm2 ; Average byte parallel int (non sign specific) 0040115F 0FE037 pavgb mm6,[edi] 00401162 0FE3DA pavgw mm3,mm2 ; Average word parallel int 00401165 0FE337 pavgw mm6,[edi] 00401168 0FC5C400 pextrw eax,mm4,0 ; Extract word parallel int 0040116C 0FC5DA03 pextrw ebx,mm2,3 00401170 0FC40E01 pinsrw mm1,[esi],1 ; Insert word parallel int 00401174 0FC4145F00 pinsrw mm2,[edi+ebx*2],0 00401179 0FC4D303 pinsrw mm2,ebx,3 0040117D 0FC42F02 pinsrw mm5,[edi],2 00401181 0FEED2 pmaxsw mm2,mm2 ; Maximum signed-word parallel int 00401184 0FEE29 pmaxsw mm5,[ecx] 00401187 0FDED2 pmaxub mm2,mm2 ; Maximum unsigned-byte parallel int 0040118A 0FDE29 pmaxub mm5,[ecx] 0040118D 0FEACA pminsw mm1,mm2 ; Minimum signed-word parallel int 00401190 0FEA3B pminsw mm7,[ebx] 00401193 0FDACA pminub mm1,mm2 ; Minimum unsigned-byte parallel int 00401196 0FDA3B pminub mm7,[ebx] 00401199 0FD7C0 pmovmskb eax,mm0 ; Move TRUE/FALSE bit mask from bytes in parallel int 0040119C 0FD7F2 pmovmskb esi,mm2 0040119F 0FE4DA pmulhuw mm3,mm2 ; Multiply unsigned word parallel int storing high 16 bits 004011A2 0FE437 pmulhuw mm6,[edi] 004011A5 0F1806 prefetchnta [esi] ; Prefetch non-tagged (uncached) aligned 004011A8 0F180C98 prefetcht0 [eax+ebx*4] ; Prefetch tip 0 (tag line 0?) 004011AC 0F1812 prefetcht1 [edx] ; Prefetch tip 1 004011AF 0F1819 prefetcht2 [ecx] ; Prefetch tip 2 004011B2 0FF6DA psadbw mm3,mm2 ; ? 004011B5 0FF637 psadbw mm6,[edi] 004011B8 0F70D103 pshufw mm2,mm1,3 ; Shuffle word parallel int 004011BC 0F701B02 pshufw mm3,[ebx],2 004011C0 0F703CFD0000000001 pshufw mm7,[0+edi*8],1 004011C9 0F5330 rcpps xmm6,[eax] ; Reciprocal parallel scalar (very coarse) 004011CC 0F53FE rcpps xmm7,xmm6 004011CF F30F53DC rcpss xmm3,xmm4 ; Reciprocal singular scalar (very coarse) 004011D3 F30F5323 rcpss xmm4,[ebx] 004011D7 0F5203 rsqrtps xmm0,[ebx] ; Reciprocal or square root parallel scalar (very coarse) 004011DA 0F52C5 rsqrtps xmm0,xmm5 004011DD F30F5203 rsqrtss xmm0,[ebx] ; Reciprocal or square root singular scalar (very coarse) 004011E1 F30F52C5 rsqrtss xmm0,xmm5 004011E5 0FAEFF sfence ; Serialize write combining/queing buffers 004011E8 0FC604FD0000000001 shufps xmm0,[0+edi*8],1 ; Shuffle parallel scalar 004011F1 0FC61302 shufps xmm2,[ebx],2 004011F5 0FC6F403 shufps xmm6,xmm4,3 004011F9 0F5103 sqrtps xmm0,[ebx] ; Square root parallel scalar 004011FC 0F51C5 sqrtps xmm0,xmm5 004011FF F30F5103 sqrtss xmm0,[ebx] ; Square root singular scalar 00401203 F30F51C5 sqrtss xmm0,xmm5 00401207 0FAE1F stmxcsr [edi] ; STore Multimedia eXtended Control/Status Register 0040120A 0F5C06 subps xmm0,[esi] ; Subtract parallel scalar 0040120D 0F5CC3 subps xmm0,xmm3 00401210 F30F5CCC subss xmm1,xmm4 ; Subtract singular scalar 00401214 F30F5C17 subss xmm2,[edi] 00401218 0F2E4D00 ucomiss xmm1,[ebp] ; Unordered Compare (int flags) singular scalar 0040121C 0F2ECA ucomiss xmm1,xmm2 0040121F 0F151B unpckhps xmm3,[ebx] ; Unpack high (qword) parallel scalar 00401222 0F15EC unpckhps xmm5,xmm4 00401225 0F140B unpcklps xmm1,[ebx] ; Unpack low (qword) parallel scalar 00401228 0F14CA unpcklps xmm1,xmm2 0040122B 0F5707 xorps xmm0,[edi] ; Exclusive or parallel scalar 0040122E 0F57C1 xorps xmm0,xmm1
To determine if your processor supports the Katmai New Instructions you have to use the CPUID instruction with EAX=1 to get the "Feature Flags" returned in EDX. Bit 25 is used to indicate whether the processor supports these new instructions. Also present is a bit within Control Register 4 (CR4.KNI - bit 10) which probably needs to be set for these new instructions to work.
I have also located two additional instructions, MOVLHPS & MOVHLPS (see updated list above).
The following are some examples of Katmai code,
add_array_float(float *p, float *q) { int i; for(i=0; i<64; i++) p[i] += q[i]; } 00401000 fn_00401000: 00401000 53 push ebx 00401001 8BDC mov ebx,esp 00401003 83EC08 sub esp,8 00401006 83E4F0 and esp,0FFFFFFF0h 00401009 83C408 add esp,8 0040100C 83EC38 sub esp,38h 0040100F 8B4B0C mov ecx,[ebx+0Ch] 00401012 8B5308 mov edx,[ebx+8] 00401015 B8C0FFFFFF mov eax,0FFFFFFC0h 0040101A 8DB600000000 lea esi,[esi] 00401020 loc_00401020: 00401020 F30F10848100010000 movss xmm0,[ecx+eax*4+100h] 00401029 F30F58848200010000 addss xmm0,[edx+eax*4+100h] 00401032 F30F11848200010000 movss [edx+eax*4+100h],xmm0 0040103B F30F10848104010000 movss xmm0,[ecx+eax*4+104h] 00401044 F30F58848204010000 addss xmm0,[edx+eax*4+104h] 0040104D F30F11848204010000 movss [edx+eax*4+104h],xmm0 00401056 83C002 add eax,2 00401059 75C5 jnz loc_00401020 0040105B 8BE3 mov esp,ebx 0040105D 5B pop ebx 0040105E C3 ret div_array_float(float *p, float *q) { int i; for(i=0; i<64; i++) p[i] /= q[i]; } 004010C0 fn_004010C0: 004010C0 53 push ebx 004010C1 8BDC mov ebx,esp 004010C3 83EC08 sub esp,8 004010C6 83E4F0 and esp,0FFFFFFF0h 004010C9 83C408 add esp,8 004010CC 83EC38 sub esp,38h 004010CF 8B4B0C mov ecx,[ebx+0Ch] 004010D2 8B5308 mov edx,[ebx+8] 004010D5 B8C0FFFFFF mov eax,0FFFFFFC0h 004010DA 8DB600000000 lea esi,[esi] 004010E0 loc_004010E0: 004010E0 F30F10848200010000 movss xmm0,[edx+eax*4+100h] 004010E9 F30F108C8204010000 movss xmm1,[edx+eax*4+104h] 004010F2 F30F5E848100010000 divss xmm0,[ecx+eax*4+100h] 004010FB F30F11848200010000 movss [edx+eax*4+100h],xmm0 00401104 F30F5E8C8104010000 divss xmm1,[ecx+eax*4+104h] 0040110D F30F118C8204010000 movss [edx+eax*4+104h],xmm1 00401116 83C002 add eax,2 00401119 75C5 jnz loc_004010E0 0040111B 8BE3 mov esp,ebx 0040111D 5B pop ebx 0040111E C3 ret mul_array_float(float *p, float *q) { int i; for(i=0; i<64; i++) p[i] = p[i] * q[i]; } 00401060 fn_00401060: 00401060 53 push ebx 00401061 8BDC mov ebx,esp 00401063 83EC08 sub esp,8 00401066 83E4F0 and esp,0FFFFFFF0h 00401069 83C408 add esp,8 0040106C 83EC38 sub esp,38h 0040106F 8B4B0C mov ecx,[ebx+0Ch] 00401072 8B5308 mov edx,[ebx+8] 00401075 B8C0FFFFFF mov eax,0FFFFFFC0h 0040107A 8DB600000000 lea esi,[esi] 00401080 loc_00401080: 00401080 F30F10848100010000 movss xmm0,[ecx+eax*4+100h] 00401089 F30F59848200010000 mulss xmm0,[edx+eax*4+100h] 00401092 F30F11848200010000 movss [edx+eax*4+100h],xmm0 0040109B F30F10848104010000 movss xmm0,[ecx+eax*4+104h] 004010A4 F30F59848204010000 mulss xmm0,[edx+eax*4+104h] 004010AD F30F11848204010000 movss [edx+eax*4+104h],xmm0 004010B6 83C002 add eax,2 004010B9 75C5 jnz loc_00401080 004010BB 8BE3 mov esp,ebx 004010BD 5B pop ebx 004010BE C3 ret test1_array_float(float *p, float *q) { int i; for(i=0; i<64; i++) p[i] = (p[i] + 2) * (q[i] * 3); } 00401120 fn_00401120: 00401120 53 push ebx 00401121 8BDC mov ebx,esp 00401123 83EC08 sub esp,8 00401126 83E4F0 and esp,0FFFFFFF0h 00401129 83C408 add esp,8 0040112C 83EC38 sub esp,38h 0040112F 8B4B0C mov ecx,[ebx+0Ch] 00401132 8B5308 mov edx,[ebx+8] 00401135 B8C0FFFFFF mov eax,0FFFFFFC0h 0040113A 8DB600000000 lea esi,[esi] 00401140 loc_00401140: 00401140 F30F10848200010000 movss xmm0,[edx+eax*4+100h] 00401149 F30F108C8204010000 movss xmm1,[edx+eax*4+104h] 00401152 F30F580504B04000 addss xmm0,[40B004h] (040000000h) 0040115A F30F59848100010000 mulss xmm0,[ecx+eax*4+100h] 00401163 F30F590508B04000 mulss xmm0,[40B008h] (040400000h) 0040116B F30F11848200010000 movss [edx+eax*4+100h],xmm0 00401174 F30F580D04B04000 addss xmm1,[40B004h] (040000000h) 0040117C F30F598C8104010000 mulss xmm1,[ecx+eax*4+104h] 00401185 F30F590D08B04000 mulss xmm1,[40B008h] (040400000h) 0040118D F30F118C8204010000 movss [edx+eax*4+104h],xmm1 00401196 83C002 add eax,2 00401199 75A5 jnz loc_00401140 0040119B 8BE3 mov esp,ebx 0040119D 5B pop ebx 0040119E C3 ret test2_array_float(float *p, float *q) { int i; for(i=0; i<64; i += 8) p[i] = (p[i] + 4) * (q[i] * 5); } 004011A0 fn_004011A0: 004011A0 53 push ebx 004011A1 8BDC mov ebx,esp 004011A3 83EC08 sub esp,8 004011A6 83E4F0 and esp,0FFFFFFF0h 004011A9 83C408 add esp,8 004011AC 83EC38 sub esp,38h 004011AF 8B4B0C mov ecx,[ebx+0Ch] 004011B2 8B5308 mov edx,[ebx+8] 004011B5 B8C0FFFFFF mov eax,0FFFFFFC0h 004011BA 8DB600000000 lea esi,[esi] 004011C0 loc_004011C0: 004011C0 F30F10848200010000 movss xmm0,[edx+eax*4+100h] 004011C9 F30F108C8220010000 movss xmm1,[edx+eax*4+120h] 004011D2 F30F58050CB04000 addss xmm0,[40B00Ch] (040800000h) 004011DA F30F59848100010000 mulss xmm0,[ecx+eax*4+100h] 004011E3 F30F590510B04000 mulss xmm0,[40B010h] (040A00000h) 004011EB F30F11848200010000 movss [edx+eax*4+100h],xmm0 004011F4 F30F580D0CB04000 addss xmm1,[40B00Ch] (040800000h) 004011FC F30F598C8120010000 mulss xmm1,[ecx+eax*4+120h] 00401205 F30F590D10B04000 mulss xmm1,[40B010h] (040A00000h) 0040120D F30F118C8220010000 movss [edx+eax*4+120h],xmm1 00401216 83C010 add eax,10h 00401219 75A5 jnz loc_004011C0 0040121B 8BE3 mov esp,ebx 0040121D 5B pop ebx 0040121E C3 ret test3_array_float(float *p, float *q, int r) { int i; for(i=0; i<64; i++) p[i] = q[i] * r; } 00401220 fn_00401220: 00401220 53 push ebx 00401221 8BDC mov ebx,esp 00401223 83EC08 sub esp,8 00401226 83E4F0 and esp,0FFFFFFF0h 00401229 83C408 add esp,8 0040122C 83EC38 sub esp,38h 0040122F 0F57C0 xorps xmm0,xmm0 00401232 8B4310 mov eax,[ebx+10h] 00401235 8B530C mov edx,[ebx+0Ch] 00401238 0F28C8 movaps xmm1,xmm0 0040123B F30F2AC8 cvtsi2ss xmm1,eax 0040123F 8B4B08 mov ecx,[ebx+8] 00401242 B8C0FFFFFF mov eax,0FFFFFFC0h 00401247 8BF6 mov esi,esi 00401249 8DBC2700000000 lea edi,[edi] 00401250 loc_00401250: 00401250 F30F10848200010000 movss xmm0,[edx+eax*4+100h] 00401259 F30F59C1 mulss xmm0,xmm1 0040125D F30F11848100010000 movss [ecx+eax*4+100h],xmm0 00401266 F30F10848204010000 movss xmm0,[edx+eax*4+104h] 0040126F F30F59C1 mulss xmm0,xmm1 00401273 F30F11848104010000 movss [ecx+eax*4+104h],xmm0 0040127C 83C002 add eax,2 0040127F 75CF jnz loc_00401250 00401281 8BE3 mov esp,ebx 00401283 5B pop ebx 00401284 C3 ret test4_array_float(float *p, float *q) { int i; for(i=0; i<64; i++) p[i] = q[i] * p[63-i]; for(i=0; i<64; i++) p[i] *= 8; } 00401290 fn_00401290: 00401290 53 push ebx 00401291 8BDC mov ebx,esp 00401293 83EC08 sub esp,8 00401296 83E4F0 and esp,0FFFFFFF0h 00401299 83C408 add esp,8 0040129C 55 push ebp 0040129D 8B6B08 mov ebp,[ebx+8] 004012A0 83EC34 sub esp,34h 004012A3 8B4B0C mov ecx,[ebx+0Ch] 004012A6 8BD5 mov edx,ebp 004012A8 B8C0FFFFFF mov eax,0FFFFFFC0h 004012AD 8D7600 lea esi,[esi] 004012B0 loc_004012B0: 004012B0 F30F1082FC000000 movss xmm0,[edx+0FCh] 004012B8 F30F59848100010000 mulss xmm0,[ecx+eax*4+100h] 004012C1 F30F11848500010000 movss [ebp+eax*4+100h],xmm0 004012CA F30F1082F8000000 movss xmm0,[edx+0F8h] 004012D2 83C2F8 add edx,0FFFFFFF8h 004012D5 F30F59848104010000 mulss xmm0,[ecx+eax*4+104h] 004012DE F30F11848504010000 movss [ebp+eax*4+104h],xmm0 004012E7 83C002 add eax,2 004012EA 75C4 jnz loc_004012B0 004012EC F30F100D14B04000 movss xmm1,[40B014h] (041000000h) 004012F4 8BC5 mov eax,ebp 004012F6 8D9500010000 lea edx,[ebp+100h] 004012FC 0FC6C900 shufps xmm1,xmm1,0 00401300 0F1000 movups xmm0,[eax] 00401303 0F105010 movups xmm2,[eax+10h] 00401307 0F59C1 mulps xmm0,xmm1 0040130A 0F59D1 mulps xmm2,xmm1 0040130D 0F1100 movups [eax],xmm0 00401310 0F115010 movups [eax+10h],xmm2 00401314 0F104020 movups xmm0,[eax+20h] 00401318 0F105030 movups xmm2,[eax+30h] 0040131C 0F59C1 mulps xmm0,xmm1 0040131F 0F59D1 mulps xmm2,xmm1 00401322 0F114020 movups [eax+20h],xmm0 00401326 0F115030 movups [eax+30h],xmm2 0040132A 0F104040 movups xmm0,[eax+40h] 0040132E 0F105050 movups xmm2,[eax+50h] 00401332 0F59C1 mulps xmm0,xmm1 00401335 0F59D1 mulps xmm2,xmm1 00401338 0F114040 movups [eax+40h],xmm0 0040133C 0F115050 movups [eax+50h],xmm2 00401340 0F104060 movups xmm0,[eax+60h] 00401344 0F105070 movups xmm2,[eax+70h] 00401348 0F59C1 mulps xmm0,xmm1 0040134B 0F59D1 mulps xmm2,xmm1 0040134E 0F114060 movups [eax+60h],xmm0 00401352 0F115070 movups [eax+70h],xmm2 00401356 0F108080000000 movups xmm0,[eax+80h] 0040135D 0F109090000000 movups xmm2,[eax+90h] 00401364 0F59C1 mulps xmm0,xmm1 00401367 0F59D1 mulps xmm2,xmm1 0040136A 0F118080000000 movups [eax+80h],xmm0 00401371 0F119090000000 movups [eax+90h],xmm2 00401378 0F1080A0000000 movups xmm0,[eax+0A0h] 0040137F 0F1090B0000000 movups xmm2,[eax+0B0h] 00401386 0F59C1 mulps xmm0,xmm1 00401389 0F59D1 mulps xmm2,xmm1 0040138C 0F1180A0000000 movups [eax+0A0h],xmm0 00401393 0F1190B0000000 movups [eax+0B0h],xmm2 0040139A 0F1080C0000000 movups xmm0,[eax+0C0h] 004013A1 0F1090D0000000 movups xmm2,[eax+0D0h] 004013A8 0F59C1 mulps xmm0,xmm1 004013AB 0F59D1 mulps xmm2,xmm1 004013AE 0F1180C0000000 movups [eax+0C0h],xmm0 004013B5 0F1190D0000000 movups [eax+0D0h],xmm2 004013BC 0F1080E0000000 movups xmm0,[eax+0E0h] 004013C3 0F1090F0000000 movups xmm2,[eax+0F0h] 004013CA 0F59C1 mulps xmm0,xmm1 004013CD 0F59D1 mulps xmm2,xmm1 004013D0 0F1180E0000000 movups [eax+0E0h],xmm0 004013D7 0F1190F0000000 movups [eax+0F0h],xmm2 004013DE 0500010000 add eax,100h 004013E3 83C434 add esp,34h 004013E6 5D pop ebp 004013E7 8BE3 mov esp,ebx 004013E9 5B pop ebx 004013EA C3 ret
BTW SS relates to Singular Scalar (1 x 32-bit FP), and PS relates to Parallel Scalar (4 x 32-bit FP).
CPUID Test eax-> eax ebx ecx edx 0 : 00000003 756E6547 6C65746E 49656E69 GenuineIntel 1 : 00000672 00000000 00000000 0387F9FF Family 6, Model 7, Step 2 2 : 03020101 00000000 00000000 0C040843 01 descriptor pages 01 code TLB, 4K pages, 4 ways, 32 entries 02 code TLB, 4M pages, fully, 2 entries 03 data TLB, 4K pages, 4 ways, 64 entries 43 code and data L2 cache, 512KB, 4 ways, 32 byte lines 08 code L1 cache, 16KB, 2 ways, 32 byte lines 04 data TLB, 4M pages, 4 ways, 8 entries 0C data L1 cache, 16KB, 4 ways, 32 byte lines 3 : 00000000 00000000 2FC60DF3 0002CCE6 Serial# ? CPU Information 511.42 MHz, Pentium III (Katmai) #1 = 0387F9FF FPU Floating point unit on-chip VME Virtual Mode Extensions DE Debugging Extensions PSE Page Size Extension TSC Time Stamp Counter MSR Model Specific Registers PAE Physical Address Extension MCE Machine Check Exception CX8 CMPXCHG8 instruction supported SEP Fast System Call MTRR Memory Type Range Registers PGE Page Global Enable MCA Machine Check Architecture CMOV Conditional Move instructions supported PAT Page Attribute Table PSE-36 Page Size Extension 36-bit 18 reserved MMX Intel Architecture MMX technology supported FXSR Fast floating point save and restore KNI Intel Architecture KNI technology supported Memory Timings L1 519% wrt RAM L2 345% wrt RAM RAM 100%
I haven't got any answers on KNI yet, but the basic P6 core doesn't seem to have changed a whole heck of a lot. For the purpose of comparison I have also knocked down the bus speed to 66 MHz, so that you can evaluate the performance against a PII Deschutes (step 1) 333 MHz. It should be noted that the Deschutes is on a FIC VL601 (Intel LX) motherboard and the Katmai is on the ASUS P2B (Intel BX). Ideally I should have put the Deschutes in the P2B, but I'm not about to waste an evening ripping my computers apart to do some benchmarking. Undoubtedly when Tom and Anand are ungaged they'll have plenty of stats from identical hardware. I have no idea how the NDA's are worded, but I imagine if there's enough publicly disseminated information on the performance of the core (sans KNI/SSE) that they loose a lot of there teeth.
I don't dislike Intel or their products, however I have a very low tolerance for marketing hype and BS, it serve no useful purpose and just confuses consumers. For those who want to boycott Intel because of the serial number in the PIII, I suggest you take a VERY close look at just how many things in your computer already have electronically readable serial numbers, be it modems, CMOS RTC's, laptop batteries, NIC's, etc. Anything that has a Flash ROM or serial EEPROM (a small 8 pin package marked 94C36 or something similar) is in all likelihood going to have a machine readable serial number, along with dates and production information.
Please don't ask me to run any ZD junk or give you frame rates, that's high level stuff. I'm working at the border where hardware meets software and I'm going to stay down there.
MFLOPS was run as a Win32 console application under NT5 Beta 2.
LoopTime (a tool I wrote) was run under plain DOS without HIMEM or EMM386.
Katmai 500 MHz (5x100 MHz) FLOPS C Program (Double Precision), V2.0 18 Dec 1992 Module Error RunTime MFLOPS (usec) 1 -7.6739e-013 0.0720 194.4972 2 -5.7021e-013 0.0520 134.5345 3 -2.4314e-014 0.0799 212.8221 4 6.8501e-014 0.0773 194.1453 5 -1.6320e-014 0.1760 164.7398 6 1.3961e-013 0.1280 226.5003 7 -3.6209e-011 0.1996 60.1257 8 9.0483e-015 0.1474 203.4707 Iterations = 256000000 NullTime (usec) = 0.0000 MFLOPS(1) = 152.9254 MFLOPS(2) = 114.1553 MFLOPS(3) = 165.8686 MFLOPS(4) = 210.3476 Katmai 333 MHz (5x66 MHz) FLOPS C Program (Double Precision), V2.0 18 Dec 1992 Module Error RunTime MFLOPS (usec) 1 -7.6739e-013 0.1080 129.6671 2 -5.7021e-013 0.0781 89.6448 3 -2.4314e-014 0.1199 141.7867 4 6.8501e-014 0.1159 129.4586 5 -1.6320e-014 0.2642 109.7608 6 1.3961e-013 0.1921 150.9557 7 -3.6209e-011 0.2996 40.0516 8 9.0483e-015 0.2213 135.5430 Iterations = 256000000 NullTime (usec) = 0.0000 MFLOPS(1) = 101.8952 MFLOPS(2) = 76.0567 MFLOPS(3) = 110.5223 MFLOPS(4) = 140.1710 Deschutes 333 MHz (5x66 MHz) FLOPS C Program (Double Precision), V2.0 18 Dec 1992 Module Error RunTime MFLOPS (usec) 1 -7.6739e-013 0.1087 128.7819 2 -5.7021e-013 0.0785 89.1188 3 -2.4314e-014 0.1204 141.2345 4 6.8501e-014 0.1159 129.4542 5 -1.6320e-014 0.2649 109.4549 6 1.3961e-013 0.1912 151.6929 7 -3.6209e-011 0.3009 39.8749 8 9.0483e-015 0.2210 135.7586 Iterations = 256000000 NullTime (usec) = 0.0000 MFLOPS(1) = 101.3445 MFLOPS(2) = 75.8165 MFLOPS(3) = 110.3556 MFLOPS(4) = 140.3466 LoopTime (980204) Instruction Cycles Speed CPU NOP 0.4 0.001 us Pentium III (Katmai) CLC 0.9 0.002 us 510.28 MHz XCHG DX,DX 1.9 0.004 us MOV DX,DX 0.7 0.001 us JMP $+2 5.4 0.011 us LOOP $+2 10.1 0.020 us DEC CX & JNZ $+2 3.4 0.007 us IN AL,DX (0x025C) 712.0 1.395 us 716653.03 Bps OUT DX,AL 712.0 1.395 us 716712.62 Bps IN AX,DX 1099.3 2.154 us 928405.03 Bps OUT DX,AX 1099.2 2.154 us 928430.03 Bps IN EAX,DX 1924.0 3.771 us 1060850.68 Bps OUT DX,EAX 1924.0 3.771 us 1060850.68 Bps REP INSB 699.4 1.371 us 729542.48 Bps REP OUTSB 699.4 1.371 us 729573.35 Bps REP INSW 1099.3 2.154 us 928380.03 Bps REP OUTSW 1099.2 2.154 us 928430.03 Bps REP INSD 1863.9 3.653 us 1095049.39 Bps REP OUTSD 1863.9 3.653 us 1095066.78 Bps INC DX 0.7 0.001 us ADD DX,2 0.7 0.001 us ADD DL,DH 0.7 0.001 us ADD DX,DX 0.8 0.002 us SHL DX,1 0.6 0.001 us PUSH AX & POP AX 1.9 0.004 us XCHG DX,[DI] 20.0 0.039 us (Memory 19 ns) MOV DX,[DI] 0.7 0.001 us MOV [DI],DX 1.2 0.002 us Instruction Cycles Speed CPU NOP 0.4 0.001 us Pentium III (Katmai) CLC 0.9 0.003 us 336.32 MHz XCHG DX,DX 1.9 0.006 us MOV DX,DX 0.7 0.002 us JMP $+2 5.4 0.016 us LOOP $+2 10.0 0.030 us DEC CX & JNZ $+2 3.4 0.010 us IN AL,DX (0x025C) 515.9 1.534 us 651958.58 Bps OUT DX,AL 515.8 1.534 us 652032.55 Bps IN AX,DX 765.4 2.276 us 878760.74 Bps OUT DX,AX 765.5 2.276 us 878715.94 Bps IN EAX,DX 1268.0 3.770 us 1060964.93 Bps OUT DX,EAX 1268.0 3.770 us 1060932.29 Bps REP INSB 506.9 1.507 us 663424.62 Bps REP OUTSB 506.9 1.507 us 663424.62 Bps REP INSW 749.1 2.227 us 897892.59 Bps REP OUTSW 750.1 2.230 us 896748.27 Bps REP INSD 1268.0 3.770 us 1060981.26 Bps REP OUTSD 1268.0 3.770 us 1060932.29 Bps INC DX 0.7 0.002 us ADD DX,2 0.7 0.002 us ADD DL,DH 0.7 0.002 us ADD DX,DX 0.7 0.002 us SHL DX,1 0.6 0.002 us PUSH AX & POP AX 1.9 0.006 us XCHG DX,[DI] 19.7 0.059 us (Memory 29 ns) MOV DX,[DI] 0.7 0.002 us MOV [DI],DX 1.2 0.003 us Instruction Cycles Speed CPU NOP 0.4 0.001 us Pentium II (Deschutes) CLC 0.9 0.003 us 334.52 MHz XCHG DX,DX 1.9 0.006 us MOV DX,DX 0.7 0.002 us JMP $+2 5.4 0.016 us LOOP $+2 10.0 0.030 us DEC CX & JNZ $+2 3.4 0.010 us IN AL,DX (0x025C) 458.4 1.370 us 729781.85 Bps OUT DX,AL 472.7 1.413 us 707656.77 Bps IN AX,DX 698.1 2.087 us 958423.94 Bps OUT DX,AX 720.5 2.154 us 928538.35 Bps IN EAX,DX 1192.6 3.565 us 1121992.92 Bps OUT DX,EAX 1212.0 3.623 us 1104009.94 Bps REP INSB 430.8 1.288 us 776503.57 Bps REP OUTSB 458.4 1.370 us 729781.85 Bps REP INSW 679.7 2.032 us 984369.96 Bps REP OUTSW 719.4 2.151 us 929959.35 Bps REP INSD 1177.1 3.519 us 1136738.20 Bps REP OUTSD 1200.6 3.589 us 1114546.22 Bps INC DX 0.7 0.002 us ADD DX,2 0.7 0.002 us ADD DL,DH 0.7 0.002 us ADD DX,DX 0.7 0.002 us SHL DX,1 0.6 0.002 us PUSH AX & POP AX 1.9 0.006 us XCHG DX,[DI] 19.6 0.059 us (Memory 29 ns) MOV DX,[DI] 0.7 0.002 us MOV [DI],DX 1.2 0.004 us
Optimized using instructions for P5 P6 KNI Cycles per addition 3.4688 3.9375 2.3594 Cycles per multiplication 3.9141 3.9063 2.4141 Cycles per division 31.3828 31.4375 9.0469 Cycles per square-root 82.3047 82.8203 31.0703 * * The compiler didn't vectorize this as fully as it could have
Cycles for some instructions (doing 4 FP operations at once) RCPPS 2 (Very Fast, Lower Accuracy (12-bit), using ROM table?) RSQRT 2 ditto DIVPS 24 SQRTPS 56 MULPS 2 ADDPS 2 MINPS 2 ANDPS 2 (128-bit bitwise operation, not any good for FP numbers) CMPEQPS 2 (4 x 32-bit TRUE (0xFFFFFFFF) or FALSE (0x00000000) answers) MOVHLPS 1Other interesting news, last Thursday ASUS posted a new BIOS (1008) for the P2B. It now reports that I have a Pentium III and also offers an option to disable the Serial Number. Bit 18 of the Features Register CPUID#1, indicates the presence of the Serial Number which is returned using CPUID#3.