Write Combining Buffer (WCB) Full Evictions

Thread Specificity: TI

This event counts when all Write Combining Buffers (WCBs) are occupied and an entry must be evicted to handle a new request.

Counting WCB Full Evictions can provide the following tuning insight:

Such evictions are distinguished from evictions due to aliasing conflicts. Subtracting WCB Full Evictions from the WCB All Evictions event provides an indication of the more expensive form of 64k aliasing. 64k aliasing can either occur between loads, leading to conflicts in the 1st-level cache that incur delays to get data from the 2nd-level cache, or they can occur between stores and other memory references. The latter case causes WCB thrashing, and is indicated by the result of subtracting WCB full evictions from all WCB evictions. If the ratio of the resulting count to the number of retired instructions is high, avoiding memory references that are a multiple of 64 KB apart may boost performance.
Note that this count is still only an indication of a possible problem; the hardware does not permit a definitive count..

For a sequence of stores, this event can sometimes be used as an indication of how efficiently the WCBs are being used. WCBs combine data from stores to a set of contiguous addresses (such as those in the same cache line). But if an inner loop attempts to interleave stores from more write streams than the number of WCBs then the WCBs will be thrashed, and WCBs will have to be evicted before they are filled. In writeback memory, this simply leads to slight delays as WCBs are deallocated and allocated again. In write combining memory and non-temporal stores, this leads to partial writes, which make much less efficient use of the bus.
This problem can be detected by comparing the ratio of the number of stores retired to WCB Full Evictions with the number of stores made to each cache line. If this ratio of ratios is greater than one, WCBs may not be being used for streaming as efficiently as possible. To fix this, spread the write streams across several inner loops instead of one. This type of optimization is called loop fission.
Note that since WCBs may be shared among logical processors, the number of WCB Full Evictions may increase when using Hyper-Threading technology.