|Opcode/Instruction||Op/En||64/32-bit Mode||CPUID Feature Flag||Description|
66 0F 38 2A /r
MOVNTDQA xmm1, m128
|RM||V/V||SSE4_1||Move double quadword from m128 to xmm using non-temporal hint if WC memory type.|
VEX.128.66.0F38.WIG 2A /r
VMOVNTDQA xmm1, m128
|RM||V/V||AVX||Move double quadword from m128 to xmm using non-temporal hint if WC memory type.|
VEX.256.66.0F38.WIG 2A /r
VMOVNTDQA ymm1, m256
|RM||V/V||AVX2||Move 256-bit data from m256 to ymm using non-temporal hint if WC memory type.|
|Op/En||Operand 1||Operand 2||Operand 3||Operand 4|
|RM||ModRM:reg (w)||ModRM:r/m (r)||NA||NA|
(V)MOVNTDQA loads a double quadword from the source operand (second operand) to the destination operand (first operand) using a non-temporal hint. A processor implementation may make use of the non-temporal hint associated with this instruction if the memory source is WC (write combining) memory type. An implementation may also make use of the non-temporal hint associated with this instruction if the memory source is WB (write back) memory type.
A processor’s implementation of the non-temporal hint does not override the effective memory type semantics, but the implementation of the hint is processor dependent. For example, a processor implementation may choose to ignore the hint and process the instruction as a normal MOVDQA for any memory type. Another implementation of the hint for WC memory type may optimize data transfer throughput of WC reads. A third implementation may optimize cache reads generated by (V)MOVNTDQA on WB memory type to reduce cache evictions.
WC Streaming Load Hint
For WC memory type in particular, the processor never appears to read the data into the cache hierarchy. Instead, the non-temporal hint may be implemented by loading a temporary internal buffer with the equivalent of an aligned cache line without filling this data to the cache. Any memory-type aliased lines in the cache will be snooped and flushed. Subsequent MOVNTDQA reads to unread portions of the WC cache line will receive data from the temporary internal buffer if data is available. The temporary internal buffer may be flushed by the processor at any time for any reason, for example:
The memory type of the region being read can override the non-temporal hint, if the memory address specified for the non-temporal read is not a WC memory region. Information on non-temporal reads and writes can be found in Chapter 11, “Memory Cache Control” of Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A.
Because the WC protocol uses a weakly-ordered memory consistency model, an MFENCE or locked instruction should be used in conjunction with MOVNTDQA instructions if multiple processors might reference the same WC memory locations or in order to synchronize reads of a processor with writes by other agents in the system. Because of the speculative nature of fetching due to MOVNTDQA, Streaming loads must not be used to reference memory addresses that are mapped to I/O devices having side effects or when reads to these devices are destruc-
tive. For additional information on MOVNTDQA usages, see Section 12.10.3 in Chapter 12, “Programming with SSE3, SSSE3 and SSE4” of Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1.
The 128-bit (V)MOVNTDQA addresses must be 16-byte aligned or the instruction will cause a #GP.
The 256-bit VMOVNTDQA addresses must be 32-byte aligned or the instruction will cause a #GP.
Note: In VEX-128 encoded versions, VEX.vvvv is reserved and must be 1111b, VEX.L must be 0; otherwise instruc-tions will #UD.
MOVNTDQA (128bit- Legacy SSE form)
DEST ← SRC DEST[VLMAX-1:128] (Unmodified)
VMOVNTDQA (VEX.128 encoded form)
DEST ← SRC DEST[VLMAX-1:128] ← 0
VMOVNTDQA (VEX.256 encoded form)
DEST[255:0] ← SRC[255:0]
__m128i _mm_stream_load_si128 (__m128i *p);
__m256i _mm256_stream_load_si256 (const __m256i *p);
See Exceptions Type 1.SSE4.1; additionally
If VEX.L= 1.
If VEX.vvvv ≠ 1111B.