Partial Register Stall Warning

Partial_Reg  is a warning that is indicated during static analysis. During static analysis, the VTuneTM Performance Analyzer detects partial register stalls within a basic block. It also detects partial register stalls between basic blocks, when it is caused by a write operation to a partial register in one block that is followed by a jump to a block that contains a read operation from its large register.

The instruction for which Partial_Reg  is issued reads from a large register (for example, EAX) after some previous instruction wrote to one of its partial registers (for example, AL, AH, AX). The read stalls until the write retires, even if the instructions are not adjacent.

This applies to all register pairs involving either a larger register with any of its partial registers, or two partial registers in the same set.

Examples of larger registers with one of its partial registers are: AX with EAX, BL with BX, and SI with ESI.

Examples of two partial registers in the same set are: AL with AH, and CL with CH.

The stall does not occur if the write has already retired when the read begins executing. (In static simulation, there is no way to know exactly when the write retires. Therefore, a fixed distance between the write and the read is used for simulation purposes only. The fixed distance used is usually long enough to enable the write to retire.)

A partial register stall also occurs in the following cases because the processor operates on 32 bits internally (even though it seems to be operating on only 16 bits):

Advice

Try to avoid using partial registers. If you must use partial registers, you can still prevent penalties as follows:

Example: Avoiding the Use of Partial Registers

Original

Optimized

mov ah, cl mov al, dl mov mem, ax

mov eax, ecx shl eax, 8 and edx, 0xFF or eax, edx mov mem, ax

Here, the second MOV instruction writes to just the lower portion of the EAX register, AL. The third MOV instruction reads the whole AX register. This causes a partial stall.

 

Here, the full ECX register is copied into EAX. The shl instruction shifts the lower part of ECX (CL) into the upper part of EAX (AH). ANDing EDX with FF places zeros in the upper part of EDX (DH), leaving only DL. When EAX is ORed with EDX, CL is left in the upper part of EAX (AH) and DL is left in the lower part, AL. The semantics of the original code are retained and there is no partial stall.

The code could be further optimized by rescheduling.

 

Example: Using XOR to Prevent Partial Register Stalls

The XOR and SUB instructions can be used to clear the upper bits of a large register before writing to one of its partial registers. When the upper bits of the larger register are cleared in this way, reading it after writing to one of its partial registers does not cause a stall. Other ways of clearing the upper bits of the large register do not prevent a stall.

 

Original

Optimized

mov al, mem8 inc eax

xor eax, eax mov al, mem8 inc eax

The INC instruction uses the entire EAX register. The preceding MOV instruction used just the lower portion of the EAX register, AL. This causes a partial stall.

Using the XOR instruction before reading the partial register clears all bits in EAX to 0, and prevents the stall.

Example: Using SUB to Prevent Partial Register Stalls

The SUB and XOR instructions can be used to clear the upper bits of a large register before writing to one of its partial registers. When the upper bits of the larger register are cleared in this way, reading it after writing to one of its partial registers does not cause a stall. Other ways of clearing the upper bits of the large register do not prevent a stall.

 

Original

Optimized

mov eax, 0 mov al, mem8 inc eax

sub eax, eax mov al, mem8 inc eax

The MOV in the first line clears EAX. The MOV in the second line writes to the partial register AL. The third line increments EAX, causing a stall.

Using the SUB instruction before reading the partial register clears all bits in EAX to 0, and prevents the stall.

See Also Affected Processors:

Intel® Pentium® Pro processor