Clive's Katmai Page

Katmai, Pentium III, MMX2, KNI, XMM or something like that


Last updated 7:49am 17-Feb-99

Update 7:37pm 13-May-98 - KNI/MMX2 instructions. I ran into an application on Intel's public FTP site that contains what I believe to be the KNI/MMX2 instruction set. I haven't got the whole encoding figured out, but the following table shows where they fit into the opcode map. The application also makes reference to registers XMM0 thru 7, which I believe will be a set of FP MMX registers that are separate and distinct from the integer registers MM0 thru 7. Given the size of the context space "conveniently" provided by FXSAVE/FXRSTOR (512 bytes), this would allow each register to be 40 bytes in size, or 4 x 80 bit FP / 8 x 40 bit FP. If you look at the instructions you'll see a lot of SS & PS references, I'm not sure if this means single/double precision or primary/secondary registers! In any case Intel's FPUs typically deal with 80 bit FP values and given 8 registers you could easily represent two 4x4 matrices.

These are the new instructions: addps, addss, andnps, andps, cmpeqps, cmpeqss, cmpleps, cmpless, cmpltps, cmpltss, cmpneqps, cmpneqss, cmpnleps, cmpnless, cmpnltps, cmpnltss, cmpordps, cmpordss, cmpunordps, cmpunordss, comiss, cvtpi2ps, cvtps2pi, cvtsi2ss, cvtss2si, cvttps2pi, cvttss2si, divps, divss, fxrstor, fxsave, ldmxcsr, maskmovq, maxps, maxss, minps, minss, movaps, movhps, movlps, movmskps, movntps, movntq, movss, movups, mulps, mulss, orps, pavgb, pavgw, pextrw, pinsrw, pmaxsw, pmaxub, pminsw, pminub, pmovmskb, pmulhuw, prefetchnta, prefetcht0, prefetcht1, prefetcht2, psadbw, pshufw, rcpps, rcpss, rsqrtps, rsqrtss, sfence, shufps, sqrtps, sqrtss, stmxcsr, subps, subss, ucomiss, unpckhps, unpcklps & xorps

Next update...

Table of 0F xx Opcodes (MMX2 Capitalized)

 
0F xx
 
x0 x1 x2 x3 x4 x5 x6 x7
 
0x
 
Group#6 Group#7 lar lsl 04 loadall
286
clts loadall
386
 
1x
 
MOVSS
MOVUPS
xmm,xmm/mem
MOVSS
MOVUPS
mem,xmm
MOVLPS
xmm,mem
MOVHLPS
xmm,xmm
MOVLPS
mem,xmm
UNPCKLPS
xmm,xmm/mem
UNPCKHPS
xmm,xmm/mem
MOVHPS
xmm,mem
MOVLHPS
xmm,xmm
MOVHPS
mem,xmm
 
2x
 
mov mov mov mov mov 25 mov 26
 
3x
 
wrmsr rdtsc rdmsr rdpmc wdecr 35 rdecr 37
 
4x
 
cmovo cmovno cmovb cmovnb cmovz cmovnz cmovbe cmovnbe
 
5x
 
MOVMSKPS
r32,xmm
SQRTPS
SQRTSS
xmm,xmm/mem
RSQRTPS
RSQRTSS
xmm,xmm/mem
RCPPS
RCPSS
xmm,xmm/mem
ANDPS
xmm,xmm/mem
ANDNPS
xmm,xmm/mem
ORPS
xmm,xmm/mem
XORPS
xmm,xmm/mem
 
6x
 
punpcklbw punpcklwd punpckldq packsswb pcmpgtb pcmpgtw pcmpgtd packuswb
 
7x
 
PSHUF
mm,mm/mem,i8
Group#A
pshimw
Group#A
pshimd
Group#A
pshimq
pcmpeqb pcmpeqw pcmpeqd emms
 
8x
 
jo jno jb jnb jz jnz jbe jnbe
 
9x
 
seto setno setb setnb setz setnz setbe setnbe
 
Ax
 
push fs pop fs cpuid bt shld shld xbts
cmpxchg
xbts
cmpxchg
 
Bx
 
cmpxchg cmpxchg lss btr lfs lgs movzx movzx
 
Cx
 
xadd xadd CMPxPS CMPxSS
EQ LT LE UNORD NEQ NLT NLE ORD
xmm,xmm/mem
C3 PINSRW
mm,r32/mem,i8
PEXTRW
r32,mm,i8
SHUFPS
xmm,xmm/mem,i8
Group#9
 
Dx
 
D0 psrlw psrld psrlq D4 pmullw D6 PMOVMSKB
r32,mm
 
Ex
 
PAVGB
mm,mm/mem
psraw psrad PAVGW
mm,mm/mem
PMULHUW
mm,mm/mem
pmulhw E6 MOVNTQ
mem,mm
 
Fx
 
F0 psllw pslld psllq F4 pmaddwd PSADBW
mm,mm/mem
MASKMOVQ
mm,mm

 
0F xx
 
x8 x9 xA xB xC xD xE xF
 
0x
 
invd wbinvd cflsh ud1 0C 0D 0E 0F
 
1x
 
GROUP#C 19 1A 1B 1C 1D 1E 1F
 
2x
 
MOVAPS
xmm,xmm/mem
MOVAPS
mem,xmm
CVTPI2PS
xmm,mm/mem
CVTSI2SS
xmm,r32/mem
MOVNTPS
mem,xmm
CVTTPS2PI
mm,xmm/mem
CVTTSS2SI
r32,xmm/mem
CVTPS2PI
mm,xmm/mem
CVTSS2SI
r32,xmm/mem
UCOMISS
xmm,xmm/mem
COMISS
xmm,xmm/mem
 
3x
 
38 39 3A 3B 3C 3D 3E 3F
 
4x
 
cmovs cmovns cmovp cmovnp cmovl cmovnl cmovle cmovnle
 
5x
 
ADDPS
ADDSS
xmm,xmm/mem
MULPS
MULSS
xmm,xmm/mem
4A 4B SUBPS
SUBSS
xmm,xmm/mem
MINPS
MINSS
xmm,xmm/mem
DIVPS
DIVSS
xmm,xmm/mem
MAXPS
MAXSS
xmm,xmm/mem
 
6x
 
punpckhbw punpckhwd punpckhdq packssdw 6C 6D movd movq
 
7x
 
78 79 7A 7B 7C 7D movd movq
 
8x
 
js jns jp jnp jl jnl jle jnle
 
9x
 
sets setns setp setnp setl setnl setle setnle
 
Ax
 
push gs pop gs rsm bts shrd shrd GROUP#B imul
 
Bx
 
B8 ud2 Group#8 btc bsf bsr movsx movsx
 
Cx
 
bswap eax bswap ecx bswap edx bswap ebx bswap esp bswap ebp bswap esi bswap edi
 
Dx
 
psubusb psubusw PMINUB
mm,mm/mem
pand paddusb paddusw PMAXUB
mm,mm/mem
pandn
 
Ex
 
psubsb psubsw PMINSW
mm,mm/mem
por paddsb paddsw PMAXSW
mm,mm/mem
pxor
 
Fx
 
psubb psubw psubd FB paddb paddw paddd FF

 
modR/M
 
xx000xxx xx001xxx xx010xxx xx011xxx xx100xxx xx101xxx xx110xxx xx111xxx
 
group #B
0F AE
 
FXSAVE
mem512
FXRSTOR
mem512
LDMXCSR
mem
STMXCSR
mem
SFENCE
 
group #C
0F 18
 
PREFETCHNTA
mem
PREFETCHT0
mem
PREFETCHT1
mem
PREFETCHT2
mem

Update 8:37pm 17-Aug-98 - I have updated the rough disassembler in DUMPLX and DUMPPE to support KNI, so as not to disappoint AMD fans I have also coded support for the 3DNow! instruction set, but don't get too excited, when KNI finally arrives it is going to blow 3DNow! away.

KNI not only adds several new SIMD-INT instructions to bolster the original MMX instruction set, it more importantly adds a set of SIMD-FP instructions which operate on a new set of eight 128 bit (16 byte) XMM registers. This wealth of registers offers an extremely large opportunity for parallelism where several INT & FP operations can be executed at once.

3DNow! while extremely powerful, uses only the original eight 64 bit (8 byte) MM registers, which must be shared between INT & FP operations, the present implementation allows two operations to execute at once. Currently 3DNow! has several advantages, a) chips that support it are actually available, b) software is available that use it (DirectX 6), and c) the documentation is not shrouded in unnecessary secrecy.

Next update...

DumpPE disassembly of my test application showing valid KNI encodings
Please note that this listing is for illustration and does nothing useful

  00401025 0F5806                 addps   xmm0,[esi]        ; Add parallel scalar
  00401028 0F58C3                 addps   xmm0,xmm3
  0040102B F30F58CC               addss   xmm1,xmm4         ; Add singular scalar
  0040102F F30F5817               addss   xmm2,[edi]
  00401033 0F5507                 andnps  xmm0,[edi]        ; And-Not parallel scalar
  00401036 0F55C1                 andnps  xmm0,xmm1
  00401039 0F5416                 andps   xmm2,[esi]        ; And parallel scalar
  0040103C 0F54D1                 andps   xmm2,xmm1
  0040103F 0FC2C700               cmpeqps xmm0,xmm7         ; Compare Equal parallel scalar
  00401043 F30FC2C700             cmpeqss xmm0,xmm7         ; Compare Equal singular scalar
  00401048 0FC2D502               cmpleps xmm2,xmm5         ; Compare Less than or Equal
  0040104C F30FC2C702             cmpless xmm0,xmm7
  00401051 0FC20E01               cmpltps xmm1,[esi]        ; Compare Less Than
  00401055 0FC2CA01               cmpltps xmm1,xmm2
  00401059 0FC2CE01               cmpltps xmm1,xmm6
  0040105D 0FC23CB50000000001     cmpltps xmm7,[0+esi*4]
  00401066 F30FC2C701             cmpltss xmm0,xmm7
  0040106B 0FC2E304               cmpneqps xmm4,xmm3        ; Compare Not Equal
  0040106F F30FC2C704             cmpneqss xmm0,xmm7
  00401074 0FC2F106               cmpnleps xmm6,xmm1        ; Compare Not Less than or Equal
  00401078 F30FC2C706             cmpnless xmm0,xmm7
  0040107D 0FC2EA05               cmpnltps xmm5,xmm2
  00401081 F30FC2C705             cmpnltss xmm0,xmm7
  00401086 0FC21CF007             cmpordps xmm3,[eax+esi*8] ; Compare Ordered
  0040108B 0FC2F807               cmpordps xmm7,xmm0
  0040108F F30FC2C707             cmpordss xmm0,xmm7
  00401094 0FC2DC03               cmpunordps xmm3,xmm4      ; Compare Unordered
  00401098 F30FC2C703             cmpunordss xmm0,xmm7
  0040109D 0F2FDC                 comiss  xmm3,xmm4         ; Compare (int flags) singular scalar
  004010A0 0F2F3C24               comiss  xmm7,[esp]
  004010A4 0F2AC1                 cvtpi2ps xmm0,mm1         ; Convert parallel int to parallel scalar
  004010A7 0F2A13                 cvtpi2ps xmm2,[ebx]
  004010AA 0F2D12                 cvtps2pi mm2,[edx]        ; Convert parallel scalar to parallel int
  004010AD 0F2DD0                 cvtps2pi mm2,xmm0
  004010B0 F30F2A13               cvtsi2ss xmm2,[ebx]       ; Convert singular int to singular scalar
  004010B4 F30F2ADE               cvtsi2ss xmm3,esi
  004010B8 F30F2A3F               cvtsi2ss xmm7,[edi]
  004010BC F30F2DD8               cvtss2si ebx,xmm0         ; Convert singular scalar to singular int
  004010C0 F30F2D0A               cvtss2si ecx,[edx]
  004010C4 0F2C10                 cvttps2pi mm2,[eax]       ; Convert ? parallel scalar to parallel int
  004010C7 0F2CD0                 cvttps2pi mm2,xmm0
  004010CA F30F2CC0               cvttss2si eax,xmm0        ; Convert ? parallel scalar to parallel int
  004010CE F30F2C10               cvttss2si edx,[eax]
  004010D2 0F5E06                 divps   xmm0,[esi]        ; Divide parallel scalar
  004010D5 0F5EC3                 divps   xmm0,xmm3
  004010D8 F30F5ECC               divss   xmm1,xmm4         ; Divide singular scalar
  004010DC F30F5E17               divss   xmm2,[edi]
  004010E0 0FAE0E                 fxrstor [esi]             ; Fast Extended Restore (FP/MMX/KNI context)
  004010E3 0FAE06                 fxsave  [esi]             ; Fast Extended Save
  004010E6 0FAE16                 ldmxcsr [esi]             ; LoaD Multimedia eXtended Control/Status Register
  004010E9 0FF7CF                 maskmovq mm1,mm7          ; Masked Move?
  004010EC 0FF7DC                 maskmovq mm3,mm4
  004010EF 0F5F1C7E               maxps   xmm3,[esi+edi*2]  ; Maximum parallel scalar
  004010F3 0F5FD8                 maxps   xmm3,xmm0
  004010F6 F30F5F1F               maxss   xmm3,[edi]        ; Maximum singular scalar
  004010FA F30F5FD8               maxss   xmm3,xmm0
  004010FE 0F5D06                 minps   xmm0,[esi]        ; Minimum parallel scalar
  00401101 0F5DC3                 minps   xmm0,xmm3
  00401104 F30F5D06               minss   xmm0,[esi]        ; Minimum singular scalar
  00401108 F30F5DC3               minss   xmm0,xmm3
  0040110C 0F2903                 movaps  [ebx],xmm0        ; Move aligned parallel scalar
  0040110F 0F2803                 movaps  xmm0,[ebx]
  00401112 0F28CA                 movaps  xmm1,xmm2
  00401115 0F12DD                 movhlps xmm3,xmm5
  00401118 0F1706                 movhps  [esi],xmm0        ; Move high (qword) parallel scalar
  0040111B 0F1603                 movhps  xmm0,[ebx]
  0040111E 0F16FA                 movlhps xmm7,xmm2
  00401121 0F1306                 movlps  [esi],xmm0        ; Move low (qword) parallel scalar
  00401124 0F1203                 movlps  xmm0,[ebx]
  00401127 0F50DB                 movmskps ebx,xmm3         ; Move mask parallel scalar
  0040112A 0F50CF                 movmskps ecx,xmm7
  0040112D 0F2B33                 movntps [ebx],xmm6        ; Move non-tagged (uncached) parallel scalar
  00401130 0FE710                 movntq  [eax],mm2         ; Move non-tagged (uncached) parallel int
  00401133 F30F1103               movss   [ebx],xmm0        ; Move singular scalar
  00401137 F30F1003               movss   xmm0,[ebx]
  0040113B F30F10CA               movss   xmm1,xmm2
  0040113F 0F1103                 movups  [ebx],xmm0        ; Move unaligned parallel scalar
  00401142 0F1003                 movups  xmm0,[ebx]
  00401145 0F10CA                 movups  xmm1,xmm2
  00401148 0F5906                 mulps   xmm0,[esi]        ; Multiply parallel scalar
  0040114B 0F59C3                 mulps   xmm0,xmm3
  0040114E F30F59CC               mulss   xmm1,xmm4         ; Multiply singular scalar
  00401152 F30F5917               mulss   xmm2,[edi]
  00401156 0F5616                 orps    xmm2,[esi]        ; Or parallel scalar
  00401159 0F56D1                 orps    xmm2,xmm1
  0040115C 0FE0DA                 pavgb   mm3,mm2           ; Average byte parallel int (non sign specific)
  0040115F 0FE037                 pavgb   mm6,[edi]
  00401162 0FE3DA                 pavgw   mm3,mm2           ; Average word parallel int
  00401165 0FE337                 pavgw   mm6,[edi]
  00401168 0FC5C400               pextrw  eax,mm4,0         ; Extract word parallel int
  0040116C 0FC5DA03               pextrw  ebx,mm2,3
  00401170 0FC40E01               pinsrw  mm1,[esi],1       ; Insert word parallel int
  00401174 0FC4145F00             pinsrw  mm2,[edi+ebx*2],0
  00401179 0FC4D303               pinsrw  mm2,ebx,3
  0040117D 0FC42F02               pinsrw  mm5,[edi],2
  00401181 0FEED2                 pmaxsw  mm2,mm2           ; Maximum signed-word parallel int
  00401184 0FEE29                 pmaxsw  mm5,[ecx]
  00401187 0FDED2                 pmaxub  mm2,mm2           ; Maximum unsigned-byte parallel int
  0040118A 0FDE29                 pmaxub  mm5,[ecx]
  0040118D 0FEACA                 pminsw  mm1,mm2           ; Minimum signed-word parallel int
  00401190 0FEA3B                 pminsw  mm7,[ebx]
  00401193 0FDACA                 pminub  mm1,mm2           ; Minimum unsigned-byte parallel int
  00401196 0FDA3B                 pminub  mm7,[ebx]
  00401199 0FD7C0                 pmovmskb eax,mm0          ; Move TRUE/FALSE bit mask from bytes in parallel int
  0040119C 0FD7F2                 pmovmskb esi,mm2
  0040119F 0FE4DA                 pmulhuw mm3,mm2           ; Multiply unsigned word parallel int storing high 16 bits
  004011A2 0FE437                 pmulhuw mm6,[edi]
  004011A5 0F1806                 prefetchnta [esi]         ; Prefetch non-tagged (uncached) aligned
  004011A8 0F180C98               prefetcht0 [eax+ebx*4]    ; Prefetch tip 0 (tag line 0?)
  004011AC 0F1812                 prefetcht1 [edx]          ; Prefetch tip 1
  004011AF 0F1819                 prefetcht2 [ecx]          ; Prefetch tip 2
  004011B2 0FF6DA                 psadbw  mm3,mm2           ; ?
  004011B5 0FF637                 psadbw  mm6,[edi]
  004011B8 0F70D103               pshufw  mm2,mm1,3         ; Shuffle word parallel int
  004011BC 0F701B02               pshufw  mm3,[ebx],2
  004011C0 0F703CFD0000000001     pshufw  mm7,[0+edi*8],1
  004011C9 0F5330                 rcpps   xmm6,[eax]        ; Reciprocal parallel scalar (very coarse)
  004011CC 0F53FE                 rcpps   xmm7,xmm6
  004011CF F30F53DC               rcpss   xmm3,xmm4         ; Reciprocal singular scalar (very coarse)
  004011D3 F30F5323               rcpss   xmm4,[ebx]
  004011D7 0F5203                 rsqrtps xmm0,[ebx]        ; Reciprocal or square root parallel scalar (very coarse)
  004011DA 0F52C5                 rsqrtps xmm0,xmm5
  004011DD F30F5203               rsqrtss xmm0,[ebx]        ; Reciprocal or square root singular scalar (very coarse)
  004011E1 F30F52C5               rsqrtss xmm0,xmm5
  004011E5 0FAEFF                 sfence                    ; Serialize write combining/queing buffers
  004011E8 0FC604FD0000000001     shufps  xmm0,[0+edi*8],1  ; Shuffle parallel scalar
  004011F1 0FC61302               shufps  xmm2,[ebx],2
  004011F5 0FC6F403               shufps  xmm6,xmm4,3
  004011F9 0F5103                 sqrtps  xmm0,[ebx]        ; Square root parallel scalar
  004011FC 0F51C5                 sqrtps  xmm0,xmm5
  004011FF F30F5103               sqrtss  xmm0,[ebx]        ; Square root singular scalar
  00401203 F30F51C5               sqrtss  xmm0,xmm5
  00401207 0FAE1F                 stmxcsr [edi]             ; STore Multimedia eXtended Control/Status Register
  0040120A 0F5C06                 subps   xmm0,[esi]        ; Subtract parallel scalar
  0040120D 0F5CC3                 subps   xmm0,xmm3
  00401210 F30F5CCC               subss   xmm1,xmm4         ; Subtract singular scalar
  00401214 F30F5C17               subss   xmm2,[edi]
  00401218 0F2E4D00               ucomiss xmm1,[ebp]        ; Unordered Compare (int flags) singular scalar
  0040121C 0F2ECA                 ucomiss xmm1,xmm2
  0040121F 0F151B                 unpckhps xmm3,[ebx]       ; Unpack high (qword) parallel scalar
  00401222 0F15EC                 unpckhps xmm5,xmm4
  00401225 0F140B                 unpcklps xmm1,[ebx]       ; Unpack low (qword) parallel scalar
  00401228 0F14CA                 unpcklps xmm1,xmm2
  0040122B 0F5707                 xorps   xmm0,[edi]        ; Exclusive or parallel scalar
  0040122E 0F57C1                 xorps   xmm0,xmm1

Update 9:50pm 11-Jan-99 - The official name for Katmai is Pentium III, what the new opcodes are called is still in the air, is it MMX2, KNI or perhaps XMM?

To determine if your processor supports the Katmai New Instructions you have to use the CPUID instruction with EAX=1 to get the "Feature Flags" returned in EDX. Bit 25 is used to indicate whether the processor supports these new instructions. Also present is a bit within Control Register 4 (CR4.KNI - bit 10) which probably needs to be set for these new instructions to work.

I have also located two additional instructions, MOVLHPS & MOVHLPS (see updated list above).

Next update...

The following are some examples of Katmai code,

  add_array_float(float *p, float *q)
  {
    int i;

    for(i=0; i<64; i++)
      p[i] += q[i];
  }

  00401000                    fn_00401000:
  00401000 53                     push    ebx
  00401001 8BDC                   mov     ebx,esp
  00401003 83EC08                 sub     esp,8
  00401006 83E4F0                 and     esp,0FFFFFFF0h
  00401009 83C408                 add     esp,8
  0040100C 83EC38                 sub     esp,38h
  0040100F 8B4B0C                 mov     ecx,[ebx+0Ch]
  00401012 8B5308                 mov     edx,[ebx+8]
  00401015 B8C0FFFFFF             mov     eax,0FFFFFFC0h
  0040101A 8DB600000000           lea     esi,[esi]
  00401020                    loc_00401020:
  00401020 F30F10848100010000     movss   xmm0,[ecx+eax*4+100h]
  00401029 F30F58848200010000     addss   xmm0,[edx+eax*4+100h]
  00401032 F30F11848200010000     movss   [edx+eax*4+100h],xmm0
  0040103B F30F10848104010000     movss   xmm0,[ecx+eax*4+104h]
  00401044 F30F58848204010000     addss   xmm0,[edx+eax*4+104h]
  0040104D F30F11848204010000     movss   [edx+eax*4+104h],xmm0
  00401056 83C002                 add     eax,2
  00401059 75C5                   jnz     loc_00401020
  0040105B 8BE3                   mov     esp,ebx
  0040105D 5B                     pop     ebx
  0040105E C3                     ret

  div_array_float(float *p, float *q)
  {
    int i;

    for(i=0; i<64; i++)
      p[i] /= q[i];
  }

  004010C0                    fn_004010C0:
  004010C0 53                     push    ebx
  004010C1 8BDC                   mov     ebx,esp
  004010C3 83EC08                 sub     esp,8
  004010C6 83E4F0                 and     esp,0FFFFFFF0h
  004010C9 83C408                 add     esp,8
  004010CC 83EC38                 sub     esp,38h
  004010CF 8B4B0C                 mov     ecx,[ebx+0Ch]
  004010D2 8B5308                 mov     edx,[ebx+8]
  004010D5 B8C0FFFFFF             mov     eax,0FFFFFFC0h
  004010DA 8DB600000000           lea     esi,[esi]
  004010E0                    loc_004010E0:
  004010E0 F30F10848200010000     movss   xmm0,[edx+eax*4+100h]
  004010E9 F30F108C8204010000     movss   xmm1,[edx+eax*4+104h]
  004010F2 F30F5E848100010000     divss   xmm0,[ecx+eax*4+100h]
  004010FB F30F11848200010000     movss   [edx+eax*4+100h],xmm0
  00401104 F30F5E8C8104010000     divss   xmm1,[ecx+eax*4+104h]
  0040110D F30F118C8204010000     movss   [edx+eax*4+104h],xmm1
  00401116 83C002                 add     eax,2
  00401119 75C5                   jnz     loc_004010E0
  0040111B 8BE3                   mov     esp,ebx
  0040111D 5B                     pop     ebx
  0040111E C3                     ret

  mul_array_float(float *p, float *q)
  {
    int i;

    for(i=0; i<64; i++)
      p[i] = p[i] * q[i];
  }

  00401060                    fn_00401060:
  00401060 53                     push    ebx
  00401061 8BDC                   mov     ebx,esp
  00401063 83EC08                 sub     esp,8
  00401066 83E4F0                 and     esp,0FFFFFFF0h
  00401069 83C408                 add     esp,8
  0040106C 83EC38                 sub     esp,38h
  0040106F 8B4B0C                 mov     ecx,[ebx+0Ch]
  00401072 8B5308                 mov     edx,[ebx+8]
  00401075 B8C0FFFFFF             mov     eax,0FFFFFFC0h
  0040107A 8DB600000000           lea     esi,[esi]
  00401080                    loc_00401080:
  00401080 F30F10848100010000     movss   xmm0,[ecx+eax*4+100h]
  00401089 F30F59848200010000     mulss   xmm0,[edx+eax*4+100h]
  00401092 F30F11848200010000     movss   [edx+eax*4+100h],xmm0
  0040109B F30F10848104010000     movss   xmm0,[ecx+eax*4+104h]
  004010A4 F30F59848204010000     mulss   xmm0,[edx+eax*4+104h]
  004010AD F30F11848204010000     movss   [edx+eax*4+104h],xmm0
  004010B6 83C002                 add     eax,2
  004010B9 75C5                   jnz     loc_00401080
  004010BB 8BE3                   mov     esp,ebx
  004010BD 5B                     pop     ebx
  004010BE C3                     ret

  test1_array_float(float *p, float *q)
  {
    int i;

    for(i=0; i<64; i++)
      p[i] = (p[i] + 2) * (q[i] * 3);
  }

  00401120                    fn_00401120:
  00401120 53                     push    ebx
  00401121 8BDC                   mov     ebx,esp
  00401123 83EC08                 sub     esp,8
  00401126 83E4F0                 and     esp,0FFFFFFF0h
  00401129 83C408                 add     esp,8
  0040112C 83EC38                 sub     esp,38h
  0040112F 8B4B0C                 mov     ecx,[ebx+0Ch]
  00401132 8B5308                 mov     edx,[ebx+8]
  00401135 B8C0FFFFFF             mov     eax,0FFFFFFC0h
  0040113A 8DB600000000           lea     esi,[esi]
  00401140                    loc_00401140:
  00401140 F30F10848200010000     movss   xmm0,[edx+eax*4+100h]
  00401149 F30F108C8204010000     movss   xmm1,[edx+eax*4+104h]
  00401152 F30F580504B04000       addss   xmm0,[40B004h]        (040000000h)
  0040115A F30F59848100010000     mulss   xmm0,[ecx+eax*4+100h]
  00401163 F30F590508B04000       mulss   xmm0,[40B008h]        (040400000h)
  0040116B F30F11848200010000     movss   [edx+eax*4+100h],xmm0
  00401174 F30F580D04B04000       addss   xmm1,[40B004h]        (040000000h)
  0040117C F30F598C8104010000     mulss   xmm1,[ecx+eax*4+104h]
  00401185 F30F590D08B04000       mulss   xmm1,[40B008h]        (040400000h)
  0040118D F30F118C8204010000     movss   [edx+eax*4+104h],xmm1
  00401196 83C002                 add     eax,2
  00401199 75A5                   jnz     loc_00401140
  0040119B 8BE3                   mov     esp,ebx
  0040119D 5B                     pop     ebx
  0040119E C3                     ret

  test2_array_float(float *p, float *q)
  {
    int i;

    for(i=0; i<64; i += 8)
      p[i] = (p[i] + 4) * (q[i] * 5);
  }

  004011A0                    fn_004011A0:
  004011A0 53                     push    ebx
  004011A1 8BDC                   mov     ebx,esp
  004011A3 83EC08                 sub     esp,8
  004011A6 83E4F0                 and     esp,0FFFFFFF0h
  004011A9 83C408                 add     esp,8
  004011AC 83EC38                 sub     esp,38h
  004011AF 8B4B0C                 mov     ecx,[ebx+0Ch]
  004011B2 8B5308                 mov     edx,[ebx+8]
  004011B5 B8C0FFFFFF             mov     eax,0FFFFFFC0h
  004011BA 8DB600000000           lea     esi,[esi]
  004011C0                    loc_004011C0:
  004011C0 F30F10848200010000     movss   xmm0,[edx+eax*4+100h]
  004011C9 F30F108C8220010000     movss   xmm1,[edx+eax*4+120h]
  004011D2 F30F58050CB04000       addss   xmm0,[40B00Ch]        (040800000h)
  004011DA F30F59848100010000     mulss   xmm0,[ecx+eax*4+100h]
  004011E3 F30F590510B04000       mulss   xmm0,[40B010h]        (040A00000h)
  004011EB F30F11848200010000     movss   [edx+eax*4+100h],xmm0
  004011F4 F30F580D0CB04000       addss   xmm1,[40B00Ch]        (040800000h)
  004011FC F30F598C8120010000     mulss   xmm1,[ecx+eax*4+120h]
  00401205 F30F590D10B04000       mulss   xmm1,[40B010h]        (040A00000h)
  0040120D F30F118C8220010000     movss   [edx+eax*4+120h],xmm1
  00401216 83C010                 add     eax,10h
  00401219 75A5                   jnz     loc_004011C0
  0040121B 8BE3                   mov     esp,ebx
  0040121D 5B                     pop     ebx
  0040121E C3                     ret

  test3_array_float(float *p, float *q, int r)
  {
    int i;

    for(i=0; i<64; i++)
      p[i] = q[i] * r;
  }

  00401220                    fn_00401220:
  00401220 53                     push    ebx
  00401221 8BDC                   mov     ebx,esp
  00401223 83EC08                 sub     esp,8
  00401226 83E4F0                 and     esp,0FFFFFFF0h
  00401229 83C408                 add     esp,8
  0040122C 83EC38                 sub     esp,38h
  0040122F 0F57C0                 xorps   xmm0,xmm0
  00401232 8B4310                 mov     eax,[ebx+10h]
  00401235 8B530C                 mov     edx,[ebx+0Ch]
  00401238 0F28C8                 movaps  xmm1,xmm0
  0040123B F30F2AC8               cvtsi2ss xmm1,eax
  0040123F 8B4B08                 mov     ecx,[ebx+8]
  00401242 B8C0FFFFFF             mov     eax,0FFFFFFC0h
  00401247 8BF6                   mov     esi,esi
  00401249 8DBC2700000000         lea     edi,[edi]
  00401250                    loc_00401250:
  00401250 F30F10848200010000     movss   xmm0,[edx+eax*4+100h]
  00401259 F30F59C1               mulss   xmm0,xmm1
  0040125D F30F11848100010000     movss   [ecx+eax*4+100h],xmm0
  00401266 F30F10848204010000     movss   xmm0,[edx+eax*4+104h]
  0040126F F30F59C1               mulss   xmm0,xmm1
  00401273 F30F11848104010000     movss   [ecx+eax*4+104h],xmm0
  0040127C 83C002                 add     eax,2
  0040127F 75CF                   jnz     loc_00401250
  00401281 8BE3                   mov     esp,ebx
  00401283 5B                     pop     ebx
  00401284 C3                     ret

  test4_array_float(float *p, float *q)
  {
    int i;

    for(i=0; i<64; i++)
      p[i] = q[i] * p[63-i];

    for(i=0; i<64; i++)
      p[i] *= 8;
  }

  00401290                    fn_00401290:
  00401290 53                     push    ebx
  00401291 8BDC                   mov     ebx,esp
  00401293 83EC08                 sub     esp,8
  00401296 83E4F0                 and     esp,0FFFFFFF0h
  00401299 83C408                 add     esp,8
  0040129C 55                     push    ebp
  0040129D 8B6B08                 mov     ebp,[ebx+8]
  004012A0 83EC34                 sub     esp,34h
  004012A3 8B4B0C                 mov     ecx,[ebx+0Ch]
  004012A6 8BD5                   mov     edx,ebp
  004012A8 B8C0FFFFFF             mov     eax,0FFFFFFC0h
  004012AD 8D7600                 lea     esi,[esi]
  004012B0                    loc_004012B0:
  004012B0 F30F1082FC000000       movss   xmm0,[edx+0FCh]
  004012B8 F30F59848100010000     mulss   xmm0,[ecx+eax*4+100h]
  004012C1 F30F11848500010000     movss   [ebp+eax*4+100h],xmm0
  004012CA F30F1082F8000000       movss   xmm0,[edx+0F8h]
  004012D2 83C2F8                 add     edx,0FFFFFFF8h
  004012D5 F30F59848104010000     mulss   xmm0,[ecx+eax*4+104h]
  004012DE F30F11848504010000     movss   [ebp+eax*4+104h],xmm0
  004012E7 83C002                 add     eax,2
  004012EA 75C4                   jnz     loc_004012B0
  004012EC F30F100D14B04000       movss   xmm1,[40B014h]        (041000000h)
  004012F4 8BC5                   mov     eax,ebp
  004012F6 8D9500010000           lea     edx,[ebp+100h]
  004012FC 0FC6C900               shufps  xmm1,xmm1,0
  00401300 0F1000                 movups  xmm0,[eax]
  00401303 0F105010               movups  xmm2,[eax+10h]
  00401307 0F59C1                 mulps   xmm0,xmm1
  0040130A 0F59D1                 mulps   xmm2,xmm1
  0040130D 0F1100                 movups  [eax],xmm0
  00401310 0F115010               movups  [eax+10h],xmm2
  00401314 0F104020               movups  xmm0,[eax+20h]
  00401318 0F105030               movups  xmm2,[eax+30h]
  0040131C 0F59C1                 mulps   xmm0,xmm1
  0040131F 0F59D1                 mulps   xmm2,xmm1
  00401322 0F114020               movups  [eax+20h],xmm0
  00401326 0F115030               movups  [eax+30h],xmm2
  0040132A 0F104040               movups  xmm0,[eax+40h]
  0040132E 0F105050               movups  xmm2,[eax+50h]
  00401332 0F59C1                 mulps   xmm0,xmm1
  00401335 0F59D1                 mulps   xmm2,xmm1
  00401338 0F114040               movups  [eax+40h],xmm0
  0040133C 0F115050               movups  [eax+50h],xmm2
  00401340 0F104060               movups  xmm0,[eax+60h]
  00401344 0F105070               movups  xmm2,[eax+70h]
  00401348 0F59C1                 mulps   xmm0,xmm1
  0040134B 0F59D1                 mulps   xmm2,xmm1
  0040134E 0F114060               movups  [eax+60h],xmm0
  00401352 0F115070               movups  [eax+70h],xmm2
  00401356 0F108080000000         movups  xmm0,[eax+80h]
  0040135D 0F109090000000         movups  xmm2,[eax+90h]
  00401364 0F59C1                 mulps   xmm0,xmm1
  00401367 0F59D1                 mulps   xmm2,xmm1
  0040136A 0F118080000000         movups  [eax+80h],xmm0
  00401371 0F119090000000         movups  [eax+90h],xmm2
  00401378 0F1080A0000000         movups  xmm0,[eax+0A0h]
  0040137F 0F1090B0000000         movups  xmm2,[eax+0B0h]
  00401386 0F59C1                 mulps   xmm0,xmm1
  00401389 0F59D1                 mulps   xmm2,xmm1
  0040138C 0F1180A0000000         movups  [eax+0A0h],xmm0
  00401393 0F1190B0000000         movups  [eax+0B0h],xmm2
  0040139A 0F1080C0000000         movups  xmm0,[eax+0C0h]
  004013A1 0F1090D0000000         movups  xmm2,[eax+0D0h]
  004013A8 0F59C1                 mulps   xmm0,xmm1
  004013AB 0F59D1                 mulps   xmm2,xmm1
  004013AE 0F1180C0000000         movups  [eax+0C0h],xmm0
  004013B5 0F1190D0000000         movups  [eax+0D0h],xmm2
  004013BC 0F1080E0000000         movups  xmm0,[eax+0E0h]
  004013C3 0F1090F0000000         movups  xmm2,[eax+0F0h]
  004013CA 0F59C1                 mulps   xmm0,xmm1
  004013CD 0F59D1                 mulps   xmm2,xmm1
  004013D0 0F1180E0000000         movups  [eax+0E0h],xmm0
  004013D7 0F1190F0000000         movups  [eax+0F0h],xmm2
  004013DE 0500010000             add     eax,100h
  004013E3 83C434                 add     esp,34h
  004013E6 5D                     pop     ebp
  004013E7 8BE3                   mov     esp,ebx
  004013E9 5B                     pop     ebx
  004013EA C3                     ret

Update 7:30am 21-Jan-99 - Just a brief update, I spent last evening writing a set of MASM macros for KNI. Download KNIMACRO.ZIP

BTW SS relates to Singular Scalar (1 x 32-bit FP), and PS relates to Parallel Scalar (4 x 32-bit FP).


Update 12:00am 6-Feb-99 - Here are a few pictures of a Pentium III.



Here are some more pictures..


Update 10:23pm 8-Feb-99 - It has been suggested that the CPU I have is not a Pentium III, but some fake, remark or pre-production sample. I'd like to report that Intel's production process appears to be working just fine, and with what looks like a production date in the fifth week of 1999 it's hot of the presses. Notwithstanding it contains a Step 3 Katmai die that runs reliably at 500 MHz and more importantly it runs KNI, I bought it to do that and it does. It is marked as requiring 2.0v, the BIOS indicates that it is being supplied with that voltage. It also seems to run fairly cool. The board I'm using is an ASUS P2B with the most current BIOS 12/23/98 (rel 01/08/99). Running DOS, and using a DOS Extender I need to set bit 9 of CR4 to enable KNI & FXSR. Bit 10 of CR4 is present also, but it's exact function is unknown at this time, my best guess is that it relates to a protection fault of some description. The CPU features flag has reserved bit 18 set, this is perhaps to indicate the serial or random number generator is present. Bit 25 is also set and this indicates that KNI are present. NT 5 Beta 2 enables the use of KNI, and uses FXSAVE/FXRSTOR to do context switches. There is an additional CPUID descriptor (eax = 3), which I presume is serialization information. The reciprocal functions provided by KNI [rcp_s (1/x) & rsqrt_s (1/sqrt(x))], appear to have an accuracy of about 12-bits, I have no timing details yet, but suspect that a fast ROM table lookup is probably being used to approximate. I'm attempting to benchmark KNI vis FPU performance, but I haven't found a suitable one yet. With double precision floating point (64-bit), it looks like the Katmai FPU functions the same as the Deschutes FPU.

CPUID Test

eax->  eax      ebx      ecx      edx

 0 : 00000003 756E6547 6C65746E 49656E69 GenuineIntel

 1 : 00000672 00000000 00000000 0387F9FF Family 6, Model 7, Step 2

 2 : 03020101 00000000 00000000 0C040843
      01 descriptor pages
      01 code TLB, 4K pages, 4 ways, 32 entries
      02 code TLB, 4M pages, fully, 2 entries
      03 data TLB, 4K pages, 4 ways, 64 entries
      43 code and data L2 cache, 512KB, 4 ways, 32 byte lines
      08 code L1 cache, 16KB, 2 ways, 32 byte lines
      04 data TLB, 4M pages, 4 ways, 8 entries
      0C data L1 cache, 16KB, 4 ways, 32 byte lines

 3 : 00000000 00000000 2FC60DF3 0002CCE6 Serial# ?

CPU Information

511.42 MHz, Pentium III (Katmai)

#1    = 0387F9FF
FPU     Floating point unit on-chip
VME     Virtual Mode Extensions
DE      Debugging Extensions
PSE     Page Size Extension
TSC     Time Stamp Counter
MSR     Model Specific Registers
PAE     Physical Address Extension
MCE     Machine Check Exception
CX8     CMPXCHG8 instruction supported
SEP     Fast System Call
MTRR    Memory Type Range Registers
PGE     Page Global Enable
MCA     Machine Check Architecture
CMOV    Conditional Move instructions supported
PAT     Page Attribute Table
PSE-36  Page Size Extension 36-bit
18      reserved
MMX     Intel Architecture MMX technology supported
FXSR    Fast floating point save and restore
KNI     Intel Architecture KNI technology supported

Memory Timings

        L1      519% wrt RAM
        L2      345% wrt RAM
        RAM     100%

Update 9:59pm 10-Feb-99 - Let's look at a few benchmarks today.

I haven't got any answers on KNI yet, but the basic P6 core doesn't seem to have changed a whole heck of a lot. For the purpose of comparison I have also knocked down the bus speed to 66 MHz, so that you can evaluate the performance against a PII Deschutes (step 1) 333 MHz. It should be noted that the Deschutes is on a FIC VL601 (Intel LX) motherboard and the Katmai is on the ASUS P2B (Intel BX). Ideally I should have put the Deschutes in the P2B, but I'm not about to waste an evening ripping my computers apart to do some benchmarking. Undoubtedly when Tom and Anand are ungaged they'll have plenty of stats from identical hardware. I have no idea how the NDA's are worded, but I imagine if there's enough publicly disseminated information on the performance of the core (sans KNI/SSE) that they loose a lot of there teeth.

I don't dislike Intel or their products, however I have a very low tolerance for marketing hype and BS, it serve no useful purpose and just confuses consumers. For those who want to boycott Intel because of the serial number in the PIII, I suggest you take a VERY close look at just how many things in your computer already have electronically readable serial numbers, be it modems, CMOS RTC's, laptop batteries, NIC's, etc. Anything that has a Flash ROM or serial EEPROM (a small 8 pin package marked 94C36 or something similar) is in all likelihood going to have a machine readable serial number, along with dates and production information.

Please don't ask me to run any ZD junk or give you frame rates, that's high level stuff. I'm working at the border where hardware meets software and I'm going to stay down there.

MFLOPS was run as a Win32 console application under NT5 Beta 2.
LoopTime (a tool I wrote) was run under plain DOS without HIMEM or EMM386.

Katmai 500 MHz (5x100 MHz)

   FLOPS C Program (Double Precision), V2.0 18 Dec 1992

   Module     Error        RunTime      MFLOPS
                            (usec)
     1    -7.6739e-013      0.0720    194.4972
     2    -5.7021e-013      0.0520    134.5345
     3    -2.4314e-014      0.0799    212.8221
     4     6.8501e-014      0.0773    194.1453
     5    -1.6320e-014      0.1760    164.7398
     6     1.3961e-013      0.1280    226.5003
     7    -3.6209e-011      0.1996     60.1257
     8     9.0483e-015      0.1474    203.4707

   Iterations      =  256000000
   NullTime (usec) =     0.0000
   MFLOPS(1)       =   152.9254
   MFLOPS(2)       =   114.1553
   MFLOPS(3)       =   165.8686
   MFLOPS(4)       =   210.3476

Katmai 333 MHz (5x66 MHz)

   FLOPS C Program (Double Precision), V2.0 18 Dec 1992

   Module     Error        RunTime      MFLOPS
                            (usec)
     1    -7.6739e-013      0.1080    129.6671
     2    -5.7021e-013      0.0781     89.6448
     3    -2.4314e-014      0.1199    141.7867
     4     6.8501e-014      0.1159    129.4586
     5    -1.6320e-014      0.2642    109.7608
     6     1.3961e-013      0.1921    150.9557
     7    -3.6209e-011      0.2996     40.0516
     8     9.0483e-015      0.2213    135.5430

   Iterations      =  256000000
   NullTime (usec) =     0.0000
   MFLOPS(1)       =   101.8952
   MFLOPS(2)       =    76.0567
   MFLOPS(3)       =   110.5223
   MFLOPS(4)       =   140.1710

Deschutes 333 MHz (5x66 MHz)

   FLOPS C Program (Double Precision), V2.0 18 Dec 1992

   Module     Error        RunTime      MFLOPS
                            (usec)
     1    -7.6739e-013      0.1087    128.7819
     2    -5.7021e-013      0.0785     89.1188
     3    -2.4314e-014      0.1204    141.2345
     4     6.8501e-014      0.1159    129.4542
     5    -1.6320e-014      0.2649    109.4549
     6     1.3961e-013      0.1912    151.6929
     7    -3.6209e-011      0.3009     39.8749
     8     9.0483e-015      0.2210    135.7586

   Iterations      =  256000000
   NullTime (usec) =     0.0000
   MFLOPS(1)       =   101.3445
   MFLOPS(2)       =    75.8165
   MFLOPS(3)       =   110.3556
   MFLOPS(4)       =   140.3466

LoopTime (980204)

Instruction             Cycles   Speed          CPU

NOP                        0.4   0.001 us       Pentium III (Katmai)
CLC                        0.9   0.002 us       510.28 MHz
XCHG    DX,DX              1.9   0.004 us
MOV     DX,DX              0.7   0.001 us
JMP     $+2                5.4   0.011 us
LOOP    $+2               10.1   0.020 us
DEC CX & JNZ $+2           3.4   0.007 us
IN      AL,DX (0x025C)   712.0   1.395 us        716653.03 Bps
OUT     DX,AL            712.0   1.395 us        716712.62 Bps
IN      AX,DX           1099.3   2.154 us        928405.03 Bps
OUT     DX,AX           1099.2   2.154 us        928430.03 Bps
IN      EAX,DX          1924.0   3.771 us       1060850.68 Bps
OUT     DX,EAX          1924.0   3.771 us       1060850.68 Bps
REP     INSB             699.4   1.371 us        729542.48 Bps
REP     OUTSB            699.4   1.371 us        729573.35 Bps
REP     INSW            1099.3   2.154 us        928380.03 Bps
REP     OUTSW           1099.2   2.154 us        928430.03 Bps
REP     INSD            1863.9   3.653 us       1095049.39 Bps
REP     OUTSD           1863.9   3.653 us       1095066.78 Bps
INC     DX                 0.7   0.001 us
ADD     DX,2               0.7   0.001 us
ADD     DL,DH              0.7   0.001 us
ADD     DX,DX              0.8   0.002 us
SHL     DX,1               0.6   0.001 us
PUSH AX & POP AX           1.9   0.004 us
XCHG    DX,[DI]           20.0   0.039 us       (Memory 19 ns)
MOV     DX,[DI]            0.7   0.001 us
MOV     [DI],DX            1.2   0.002 us

Instruction             Cycles   Speed          CPU

NOP                        0.4   0.001 us       Pentium III (Katmai)
CLC                        0.9   0.003 us       336.32 MHz
XCHG    DX,DX              1.9   0.006 us
MOV     DX,DX              0.7   0.002 us
JMP     $+2                5.4   0.016 us
LOOP    $+2               10.0   0.030 us
DEC CX & JNZ $+2           3.4   0.010 us
IN      AL,DX (0x025C)   515.9   1.534 us        651958.58 Bps
OUT     DX,AL            515.8   1.534 us        652032.55 Bps
IN      AX,DX            765.4   2.276 us        878760.74 Bps
OUT     DX,AX            765.5   2.276 us        878715.94 Bps
IN      EAX,DX          1268.0   3.770 us       1060964.93 Bps
OUT     DX,EAX          1268.0   3.770 us       1060932.29 Bps
REP     INSB             506.9   1.507 us        663424.62 Bps
REP     OUTSB            506.9   1.507 us        663424.62 Bps
REP     INSW             749.1   2.227 us        897892.59 Bps
REP     OUTSW            750.1   2.230 us        896748.27 Bps
REP     INSD            1268.0   3.770 us       1060981.26 Bps
REP     OUTSD           1268.0   3.770 us       1060932.29 Bps
INC     DX                 0.7   0.002 us
ADD     DX,2               0.7   0.002 us
ADD     DL,DH              0.7   0.002 us
ADD     DX,DX              0.7   0.002 us
SHL     DX,1               0.6   0.002 us
PUSH AX & POP AX           1.9   0.006 us
XCHG    DX,[DI]           19.7   0.059 us       (Memory 29 ns)
MOV     DX,[DI]            0.7   0.002 us
MOV     [DI],DX            1.2   0.003 us

Instruction             Cycles   Speed          CPU

NOP                        0.4   0.001 us       Pentium II (Deschutes)
CLC                        0.9   0.003 us       334.52 MHz
XCHG    DX,DX              1.9   0.006 us
MOV     DX,DX              0.7   0.002 us
JMP     $+2                5.4   0.016 us
LOOP    $+2               10.0   0.030 us
DEC CX & JNZ $+2           3.4   0.010 us
IN      AL,DX (0x025C)   458.4   1.370 us        729781.85 Bps
OUT     DX,AL            472.7   1.413 us        707656.77 Bps
IN      AX,DX            698.1   2.087 us        958423.94 Bps
OUT     DX,AX            720.5   2.154 us        928538.35 Bps
IN      EAX,DX          1192.6   3.565 us       1121992.92 Bps
OUT     DX,EAX          1212.0   3.623 us       1104009.94 Bps
REP     INSB             430.8   1.288 us        776503.57 Bps
REP     OUTSB            458.4   1.370 us        729781.85 Bps
REP     INSW             679.7   2.032 us        984369.96 Bps
REP     OUTSW            719.4   2.151 us        929959.35 Bps
REP     INSD            1177.1   3.519 us       1136738.20 Bps
REP     OUTSD           1200.6   3.589 us       1114546.22 Bps
INC     DX                 0.7   0.002 us
ADD     DX,2               0.7   0.002 us
ADD     DL,DH              0.7   0.002 us
ADD     DX,DX              0.7   0.002 us
SHL     DX,1               0.6   0.002 us
PUSH AX & POP AX           1.9   0.006 us
XCHG    DX,[DI]           19.6   0.059 us       (Memory 29 ns)
MOV     DX,[DI]            0.7   0.002 us
MOV     [DI],DX            1.2   0.004 us

Update 11:15pm 11-Feb-99 - Here are a few interim results, I'm having a little trouble with the compiler. I have been unable to get a KNI version of MFLOPS using single precision to work properly, so I'll take a slightly different tack. I have opted to stick with doing a few primitive operations on simple arrays. Division is one of the more complex floating-point operations, and KNI appears to provide a 3.4 X increase in throughput. It should be noted that 3DNow! doesn't offer a division operation, and instead you do a reciprocal of the divisor and then multiply. The KNI reciprocal functions have only a 12-bit accuracy, so there use in this fashion is probably inadvisable.

Optimized using instructions for   P5        P6        KNI

Cycles per addition              3.4688    3.9375    2.3594
Cycles per multiplication        3.9141    3.9063    2.4141
Cycles per division             31.3828   31.4375    9.0469
Cycles per square-root          82.3047   82.8203   31.0703 *

* The compiler didn't vectorize this as fully as it could have

Update 7:49pm 17-Feb-99 - Well I've done a few more timings of the instructions themselves using the Time Stamp Counter (TSC). These timings are a little rough, because it's hard to account for caching and pipelining issues, but they give an insight in to where the speed of KNI may come from.
Cycles for some instructions (doing 4 FP operations at once)

RCPPS   2 (Very Fast, Lower Accuracy (12-bit), using ROM table?)
RSQRT   2   ditto
DIVPS   24
SQRTPS  56
MULPS   2
ADDPS   2
MINPS   2
ANDPS   2 (128-bit bitwise operation, not any good for FP numbers)
CMPEQPS 2 (4 x 32-bit TRUE (0xFFFFFFFF) or FALSE (0x00000000) answers)
MOVHLPS 1
Other interesting news, last Thursday ASUS posted a new BIOS (1008) for the P2B. It now reports that I have a Pentium III and also offers an option to disable the Serial Number. Bit 18 of the Features Register CPUID#1, indicates the presence of the Serial Number which is returned using CPUID#3.
Here is my KNI rough guide..

Written by Clive Turvey clive@tbcnet.com

Trademarks are the property of their respective owners. No warranty expressed or implied. No deposit. No return. Copyright (C) C Turvey 1998-1999.