Recent Intel® AVX Architectural Changes

Dear Intel® AVX developers,

We recently made some significant changes to the Intel® Advanced Vector Extensions Programmer’s Reference Manual (please download the latest version at /sites/avx/). If you are writing tools or software based on AVX, this may impact you. The big changes are a very different FMA syntax and the removal of two instructions (4-operand permutes).

Since the initial Intel® AVX spec was released in April 2008, I wanted to recap the changes we’ve made in the previous two Programmer’s Reference Manual Updates.

Added: VEX forms of AES instructions
The AES instructions we plan to release on our upcoming 32-nm cores (codename “Westmere”) will be extended by the VEX prefix in the following cores (codename “Sandy Bridge”). The VEX brings a distinct destination register to 4 of the 5 AES instructions (VAESDEC, VAESDECLAST, VASEINC, and VAESINCLAST… AESIMC already had a distinct destination). Only the 128-bit form of the instructions are supported – the instructions are not promoted to 256-bit. As with the AES instruction set, VAES instructions may not be enabled by all products in all geographies. For the VAES instructions we have therefore created a unique way of detecting their presence in hardware, by requiring you to check that two CPUID flags are set (CPUID.AES AND CPUID.AVX).

Added: 256-bit forms of streaming stores
Astute readers noticed that the streaming store instructions MOVNTDQ, MOVNTPS, and MOVNTPD had originally been supported only in 128-bit forms. We now have them in 256-bit forms in our Sandy Bridge cores. It’s not clear that they will be any faster than the 128-bit forms on Sandy Bridge, but we encouraged their use for future performance. Note that Streaming Load (VMOVNTDQA) is still (only) 128-bit – yes that’s intentional.

Removed: VPERMIL2PS and VPERMIL2PD
All PERMIL2 instructions are gone – both the 128-bit and 256-bit flavors. Like the FMA below, they used the VEX.W bit to select which source was from memory – we’re not moving in the direction of using VEX.W for that purpose any more.

Changed: All FMA instructions
We previously defined 4-operand FMA’s with 3 sources and a separate destination. We now have 3 operands – and still 3 sources, so one of them gets destroyed (this makes the FMA instructions unique in AVX). For each of the old forms, we now have 3 new instructions, using 132, 213, and 231 designations. The VEX.W bit no longer selects the source that comes from memory – instead, it selects the floating point type (single or double precision). Finally, the scalar instructions preserve the upper bits of the destination (up to bit 127) instead of zeroing. These instructions are (still) not in Sandy Bridge, we are planning to ship them in a subsequent processor.

Example: we previously had

VFMADDSS xmm1, xmm2, xmm3, m32, which was xmm1 = xmm2*xmm3 + m32

(and the upper bits of the XMM register - from bit 32 to 127 - were zeroed).

NOW we have three forms:

VFMADD132SS xmm1, xmm2, m32, which is xmm1 = xmm1*m32 + xmm2
VFMADD213SS xmm1, xmm2, m32, which is xmm1 = xmm2*xmm1 + m32
VFMADD231SS xmm1, xmm2, m32, which is xmm1 = xmm2*m32 + xmm1

(and note that now the upper bits of the XMM register – from bit 32 to 127 - are unchanged).

The numbers in the instruction mnemonics come from the order of the operands in the expression .... so VFMADD132 is 1*3+2; VFMADD213 is 2*1+3, and VFMADD231 is 2*3+1.

The three forms allow you to avoid having to do extra copies or loads – most of the time. The primary exception is in code where you really did need to re-use all the sources, as in butterflies:

y0 = x0*c0 + x1
y0 = x0*c0 – x1

which generally incurs one copy for every 2 FMA’s. So far, this doesn’t appear to be much of a performance hit.

Added: VEX forms of the PCLMULQDQ instruction
The PCLMULQDQ instruction we plan to release on our upcoming 32-nm cores (codename “Westmere”) will be extended by the VEX prefix in our subsequent cores (codename “Sandy Bridge”). The VEX form brings a distinct destination register. Only the 128-bit form of the instructions are supported – the instructions are not promoted to 256-bit. As with the VAES instructions, VPCLMULQDQ will require the careful programmer to check that two CPUID flags are set (in this case CPUID.PCLMULQDQ AND CPUID.AVX).

A number of miscellanea
We clarified the alignment-check exception (#AC) behavior for MASKMOV instructions (by the way, does anyone actually use #AC?) .
We clarified the exception type for the packed shift instructions (PSLL, PSRL, PSRA).
The Encoding rule table (4-3) is clarified to reflect PEXTRW.

We hope these changes are not too disruptive to you and thank you for your support of our ongoing early disclosure policy. We have updated the Intel® Software Development Emulator with all of these changes. If you have any questions or concerns about the impact of these changes to your application, or would like more detail on any of these changes, I encourage you to start a thread at /en-us/forums/intel-avx-and-cpu-instructions, or contact me directly.

Regards,
Mark Buxton
Software Engineer
Intel Corporation
Mark.J.Buxton@intel.com