Glossary

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

A

Active Threads

In a multi-threaded application, a thread is considered active when it is either running on a processor or waiting in the operating system (OS) scheduler run-queue.

A thread is not considered active when it is waiting for an external event (blocking), waiting for another thread, or spin waiting.

Activity

Activities are at the core of the data-collection process. They control collection of data. Within an Activity, you can specify the types of performance data you wish to collect. For each type of performance data, you need to configure the appropriate data collector.

Activity Results

Results of an Activity. Each time an Activity runs, its data collectors are launched and an Activity result is generated for each data collector.

Address Offset Calculation

Instructions with explicit memory operands reference the segment containing the operand and the displacement from the beginning of the segment to the operand.

The offset is calculated as follows:

Base + (Index * Scaling Factor) + Displacement

Where:

Base Register can be any of the eight general registers: EAX, EBC, ECX, EDX, EBP, ESP, ESI, and EDI.
Index Register can be one of these general registers: EAX, EBC, ECX, EDX, EBP, ESI, and EDI.
Scaling Factor can be 1, 2, 4, or 8.
Displacement can be none, an 8-bit number, or a 32-bit number.

Address Generate Interlock (AGI) Conflict

On the Pentium(R) processor and on the Pentium processor with MMX(TM) technology support, an Address Generate Interlock (AGI) conflict occurs when a register that is used as the base or index component of an effective address calculation was the destination register of an instruction executed in the immediately preceding cycle.

On the Intel486(TM) processor, AGI conflicts occur when a register that is used as the base (not the index) component of an effective address calculation was the destination register of an instruction executed in the immediately preceding cycle.

Advice

A possible solution or workaround recommended (usually a suggestion to modify the code) to remove or avoid a performance problem.

Alignment

Alignment of data concerns all kinds of variables:

dynamically allocated
members of a data structure
global or local variables
parameters passed on the stack.

A misaligned data access can incur significant performance penalties. This is particularly true for cache line splits. The size of a cache line is 64 bytes in the Pentium(R) 4 processor, and is 32 bytes in Pentium III and Pentium II processors. On the Pentium 4 processor, an access to data that are unaligned on 64-byte boundary lead to two memory accesses and requires several micro-ops to be executed instead of one. Accesses that span either 16-byte or 64-byte boundaries are likely to incur a large performance penalty, since they are executed near retirement, and can incur stalls that are on the order of the depth of the pipeline.

Assembly/Compiler Coding Rule 15. (H impact, H generality) Align data on natural operand size address boundaries

For best performance, align data as follows:

Align 8-bit data at any address.
Align 16-bit data to be contained within an aligned four byte word.
Align 32-bit data so that its base address is a multiple of four.
Align 64-bit data so that its base address is a multiple of eight.
Align 80-bit data so that its base address is a multiple of sixteen.
Align 128-bit data so that its base address is a multiple of sixteen.

A 64-byte or greater data structure or array should be aligned so that its base address is a multiple of 64. Sorting data in decreasing size order is one heuristic for assisting with natural alignment. As long as 16-byte boundaries (and cache lines) are never crossed, natural alignment is not strictly necessary, though it is an easy way to enforce this.

Application/Module Profile

Containers of configuration information about the execution of the analyzed application and modules to analyze.

Arguments

A parameter value used to invoke the source code advice.

Assist

When the hardware needs the assistance of micro-code to deal with some event, the machine takes an assist. One example of such situation is an underflow condition in the input operands of a floating-point operation. The hardware must internally modify the format of the operands in order to perform the computation. Assists clear the entire machine of micro-ops before they begin to accumulate, and are costly. The assist mechanism on the Intel(R) Pentium(R) 4 processor is similar in principle to that on the Pentium II processors, which also have an assist event.

B

Basic Block

A group of consecutive instructions, which has only one entry point and one exit.

Base Register

A register used to compute an effective memory address. It may be any of the eight general registers: EAX, EBC, ECX, EDX, EBP, ESP, ESI, and EDI. A base represents an indirect offset to an operand. The base register can be used alone (without an index or displacement) for dynamic storage of variables and data structures.

Blocking Time

In the Thread Profiler, time when the thread is not active due to a blocking operation done while the thread was on the critical path. A blocking operation is one that blocks or stalls a thread when the thread invokes that operation, causing the thread to be inactive for the duration of the operation.

Examples of blocking operations are sleeps, waiting on the operating system for I/O operations (disk or file access or user input) to complete. Time may be classified as Blocking time if waiting for another thread or event (waiting for a signal) if the signal was lost or Thread Profiler does not know about the signal. If the User-Synchronization APIs were used, the thread might be actively spinning but would still be considered to be blocked when it is waiting for a signal that it lost or if it cancelled its prepare().

Bogus, Non-bogus, Retire Events

Branch mispredictions incur a large penalty on microprocessors with deep pipelines. In general, the direction of branches can be predicted with a high degree of accuracy by the front end of the Intel(R) Pentium(R) 4 processor, such that most computations can be performed along the predicted path while waiting for the resolution of the branch.

In the event of a misprediction, instructions and micro-ops that were scheduled to execute along the mispredicted path must be cancelled. These instructions and micro-ops are referred to as bogus instructions and bogus micro-ops. A number of Pentium 4 processor performance monitoring events, for example, instructions retired and mops retired, can count instructions or micro-ops that are retired based on the characterization of bogus versus non-bogus.

In event description tables, the term bogus refers to instructions or micro-ops that must be cancelled because they are on a path taken from a mispredicted branch. The terms retired and non-bogus refer to instructions or micro-ops along the path that results in committed architectural state changes as required by the program execution. Thus instructions and micro-ops are either bogus or non-bogus, but not both.

Bottleneck

A bottleneck is an area in your code that is slowing down the execution of your application. Bottlenecks appear as hotspots in the Hotspot view. All bottlenecks are hotspots but all hotspots need not necessarily be bottlenecks. Remove the bottlenecks to optimize your code.

Branch Target Buffer (BTB)

A small cache where the processor stores information about previous branches. The processor uses the information stored in the BTB to predict whether or not a branch that it encounters should be predicted as being taken or not-taken.

Buffer Size

The size of the buffer used to store the samples collected during the sampling process. Buffer size is limited by the amount of real memory available during the monitoring process. Data is lost when the buffer is flushed (since your program is still running).

Bus Ratio

The ratio of the processor clock to the bus clock.

C

Cache

A small, fast memory which holds the active parts of a larger, slower memory. Caches raise system performance by satisfying an internal read request more quickly than a bus cycle to memory. They also reduce the processor's use of the external bus when the same locations are accessed multiple times.

Cache Miss

A cache miss is a request for memory that is not found in the cache. When a cache miss occurs, the processor fetches an entire 32-byte line from a higher level cache (L2) or, if necessary, from main memory. If necessary, it replaces the least recently used (LRU) line.

There are three types of cache miss:

A compulsory miss on the first access to a specific memory address that is not in the cache.
A conflict miss occurs in a set-associative cache when all lines in a given set are full.
A capacity miss occurs when all sets in the cache are full.

Call Site

The location within a caller from where a call to a callee is made.

Callee

A child function that is called by the current function.

Callee Threshold

Maximum of child nodes to display.

Caller

A parent function that calls the current function.

Calls

The number of times this function was called.

Capacity Miss

A capacity miss is a request that misses the cache because the cache was full.

Casting

Data was declared as some type, then you decide you want to access it as more, overwriting default data type, the compiler trusts you, but it might not work.

Clocktick

Also referred to as a clock or a CPU cycle, a clocktick is the smallest unit of time recognized by the processor. The term is also used to indicate the time required by the processor to execute an instruction.

Complex Instruction

An instruction that breaks down into more than four micro-ops, or is longer than seven bytes. Only decoder D0 can decode complex instructions. Complex instructions cannot be decoded in parallel with other instructions.

Examples of instructions that break down into more than four micro-ops are CALL, RET, CMPXCH, LOCK, and CPUID.

Instructions with both an immediate operand and a displacement, as shown below, are also complex because they are longer than seven bytes:

movzx ebx, word ptr [ecx] + 077fb7426

lea ecx, dword ptr [esi*02] + 0

Compulsory Miss

A request that misses because it is the first reference to a given line in memory. A compulsory miss always occurs when it is the first reference to a given line in memory.

Conflict Miss

A request that misses due to lack of space in a certain set. A set is a group of lines in the cache into which a given line in memory can be mapped. The number of lines per set depends on the associativity of a cache. A cache with two lines per set is called a two-way set associative cache.

Connected Components

Sub-graphs where an undirected path exists between all pairs of nodes. If no path exists between two nodes in a graph, then those two nodes must belong to different connected components.

Correctness

A correct multithreaded program is free of race conditions that could lead to non-deterministic results.

Counter

A feature used to count a single event that represents the state of an entity in the system. For example, an application, OS, or hardware device.

Counter Mask

The mask that determines when the event counter increments. This corresponds to bits 24 through 31 of the 32-bit Event Select register that controls the performance monitoring counter. When non-zero, the processor compares this mask to the count of events during a single cycle. If the event count is greater than or equal to this mask, the counter is incremented by one. Otherwise the counter is not incremented.

This mask can be used to count events only if there are multiple occurrences per clock (for example, two or more instructions retired per clock). If the counter-mask field is 0, the counter is incremented each cycle.

CPI

Cycles per instructions retired. This event ratio is a basic measure of performance.

Critical Path

The most time-consuming path or call sequence originating from the root.

Cross-modifying Mode

The act of one processor writing data into the currently executing code segment of a second processor with the intent of having the second processor execute that data as code.

D

Data Collector

The data collectors included in an Activity collect performance data. The different data collectors collect different types of performance data.

Data Race

A situation where the order in which two accesses occur is not guaranteed by the program, causing indeterminate results. A data race is due to two different threads accessing the same memory location at the same time without the proper synchronization needed to guarantee consistent results.

Dataset

A set of data gathered as a result of invoking an instance of one data collector.

DCU

Data Cache Unit. This refers to the level 1 (L1) data cache in the processor.

Deadlock

A thread is deadlocked when it is waiting for a critical section, mutex, or thread handle that can never be acquired.

For example, if two threads each hold a different mutex and need the mutex held by the other thread to proceed, the threads are deadlocked since neither thread can give up the mutex it already holds nor acquire the second mutex.

Decoder

The part of the processor that breaks instructions into sequences of simple micro operations, known as micro-ops. Each micro-op has two logical sources and a logical destination.

Dependency

A relationship between two instructions such that one instruction uses the other's result or defines the other's operands. The compiler cannot reschedule dependent instructions.

For example, a compiler cannot move an instruction to load above an instruction to store a because the two instructions use the same variable a and are dependent on each other.

Denormalized Numbers

Tiny numbers represented by making the integer bit (and perhaps other leading bits) of the significand zero. The use of leading zeros with denormalized numbers allows smaller numbers to be represented. However, this denormalization causes a loss of precision (the number of significant bits in the fraction is reduced by the leading zeros).

Disconnected Nodes

Nodes that have no edges connected to them.

Dynamic Branch Prediction

Every time a branch instruction is reached, the processor stores it history in the 512-entry branch target buffer (BTB). When the branch instruction is fetched again, the processor uses the information stored in the BTB to dynamically predict whether the branch is taken or not taken.

Dynamic branch prediction is used to update the instruction pointer (IP) early (during instruction fetch, instead of during decoding). As a result, fewer instructions enter the pipeline that will need to be flushed.

Upward and Downward Prefetches

Prefetching can be done in an upward or downwards direction, as illustrated.

E

Edge

A relationship between two functions where one (caller) is calling the other (callee).

Edge Time

Time (microseconds) contributed to the total time of a callee function when called from a specific caller function. The sum of the edge time of all the incoming edges to a function equals to this function's total time.

Edge Total Time

The execution time (microseconds) of a particular function when called from a specific parent function.

Edge Wait Time

Time (microseconds) contributed to the total wait time of a callee function when called from a specific caller function. The sum of the edge wait time of all the incoming edges to a function equals to this function's total wait time.

Event-based Sampling

Event-based sampling interrupts the processor after a specified number of events and collects samples containing the instruction addresses of the software modules executing on your system.

Event Skid

Event skid is the recording of an event not exactly on the code line that caused the event. Event skids may even result in a caller function event being recorded in the callee function.

Event skid is caused by a number of factors:

The delay in propagating the event out of the processor's microcode through the interrupt controller (APIC) and back into the processor.
The current instruction retirement cycle must be completed. For example, on the Pentium(R) Pro processor family, up to three instructions can be retired in a single clock.
When the interrupt is received, the processor must serialize its instruction stream which causes a flushing of the execution pipeline.

The Intel(R) Pentium(R) 4 or Itanium(R) processors support accurate event location for some events. These events are called precise events.

Example: Interpreting Jump and Call Instructions

Events that happen in the execution time of the jmp or call instruction, may appear on instruction that is one or two instructions away from original jmp/call in the execution flow. In this example, the mov instruction at the top of the loop is not responsible for the 1.02% of the events because the mov instruction is the target of the branch at the bottom of the loop. The real source of the events is the jmp instruction at the bottom of the loop.

Event %

Instructions

1.02%

top_of_loop: mov ...

... (any number of lines)

end_of_loop: jnz <to someplace>

jmp top_of_loop

Exception

Synchronous events which are responses of the processor to certain conditions detected during the execution of an instruction. It can also be defined as a forced call to a procedure or a task which is generated when the processor fails to interpret an instruction or when an INT n instruction is executed. Causes of exceptions include division by zero, stack overflow, undefined opcodes, and memory protection violations.

Execution Unit

For IA-32 processors, the names of the execution units in the execution core that are utilized to execute the micro-ops for each instruction. This information is provided only for IA-32 instructions that are decoded into no more than 4 micro-ops or instructions that decode into more than 4 micro-ops are supplied by micro-code ROM.

Several execution units may share the same port, such as FP_ADD, FP_MUL, or MMX_SHFT in the FP_EXECUTE cluster.

Explicit AGI Conflict

An Address Generate Interlock (AGI) conflict occurs when a register that is used as the base or index component of an effective address calculation was the destination register of an instruction executed in the immediately preceding cycle. An explicit AGI conflict occurs when the previous instruction explicitly wrote to the destination register. For example, add eax, 3 explicitly writes to the eax register.

F

Focus Function

A function selected in the function summary, or in the graph view.

G

General Registers

Eight general registers that hold operands for logical and arithmetic operations: eax, ebc, ecx, edx, ebp, esp, esi, and edi.

H

Hotspot

A hotspot indicates sections of code within a module that took a long time to execute. This results in a high amount of processor time spent executing that section of the module, thus generating a lot of samples for that module. The Hotspot view can display individual hotspots as they appear in memory (RVA or location), hotspots of active functions, or hotspots of active source files in the module, depending on the option you select.

Some hotspots may turn out to be bottlenecks but some other hotspots, by the very nature of their content inevitably take a long time to execute. These sections of code may be essential to the running of your program. You cannot, therefore, remove such hotspots. Focus on the hotspots that are bottlenecks and can be removed through optimization of your code or system.

Hyper-Threading Technology

Intel(R) processors are superscalar, enabling them to execute more than one instruction per clock cycle. Intel Hyper-Threading technology extends this capability by making a single physical processor appear to be two logical processors that are capable of executing two threads in parallel. From a software perspective, operating systems and programs can schedule processes or threads as if they were running on two physical processors. From a microarchitecture perspective, instructions from two logical processors execute simultaneously on the shared resources of one physical processor, thus increasing overall resource utilization.

Operating environments can take advantage of Hyper-Threading technology in one of two ways.

First, applications can be multithreaded thus allowing the individual application to execute upon parallel threads in the processor at the same time. Operating systems like Microsoft* Windows* Server 2003 and Windows XP* along with high performance applications such as database engines, scientific computation programs and multimedia design software are currently multithreaded and are typically run in dual- or multi-processor environments.

Second, Hyper-Threading can be used in a multitasking environment where the system is running more than one application in parallel. In this instance, each application can be run as separate threads on the same processor increasing utilization of the execution units and overall platform performance.

I

Idle Thread

A thread is idle when it is waiting on an object such that execution of its next operation is delayed.

Instruction Scheduling

The process in which the compiler moves instructions and reorders them in order to allow instructions to execute in parallel and to avoid stalls or wasted cycles between two dependent instructions.

Insight

An observation about the performance of your code. It could indicate a potential performance problem that could be a bottleneck to your applications performance.

Instrumentation

The process of modifying a program so that dynamic information is recorded during program execution. Data collection routines that are invoked at specific points in the execution of the target program, record runtime information. These routines provide information about time spent in each function, and the call sequence that led to a specific function.

L

L1 Cache

First-level on-die cache. The L1 cache is a small but very fast cache that has an instruction cache and a data cache, the first dedicated to caching instructions and the other to caching data.

L1 Instruction and Data Cache Characteristics:

L1 Instruction Cache	Pentium(R) 4 processors: not implemented. P6 family and Pentium processors: 8 or 16 KBytes, 4-way set associative, 32-byte cache line size; 2-way set associative for earlier Pentium processors.
L1 Data Cache	Pentium 4 processors: 8 KBytes, 4-way set associative, 64-byte cache line size. P6 family processors: 16 KBytes, 4-way set associative, 32-byte cache line size; 8 KBytes, 2-way set associative for earlier P6 family processors. Pentium processors: 16 KBytes, 4-way set associative, 32-byte cache line size; 8 KBytes, 2-way set associative for earlier Pentium processors.

Note

L2 cache misses have a much greater impact on performance than L1 cache misses, and L2 cache problems are often easier to solve than L1 cache problems.

L2 Cache

Second-level cache. While the L1 cache is an on-chip cache, the second-level cache (L2 cache) is located between the on-chip cache and the main memory. It is located on a separate die and closely coupled to the CPU die by means of a dedicated full clock-speed bus.

If an access misses the L1 cache, the access passes on to the bigger, but slower L2 cache. The L2 cache is a unified cache of both instructions and data. If the access misses the L2 cache, it goes to the main memory.

L2 Cache Characteristics:

L2 Unified Cache

Pentium(R) 4 processors: 256 KBytes 8-way set associative, sectored, 64-byte cache line size.
P6 family processors: 128 KBytes, 256 KBytes, 512 KBytes, 1 MByte, or
2 MByte, 4-way set associative, 32-byte cache line size.
Pentium processor: System specific, typically 256 or 512 KBytes, 4-way set associative, 32-byte cache line size.

Notes

In the Pentium(R) processors, the L2 cache is external to the processor package and optional; in the Pentium 4 and P6 family processors, the L2 cache is internal to the processor package.
L2 cache misses have a much greater impact on performance than L1 cache misses, and L2 cache problems are often easier to solve than L1 cache problems.

Latency

The number of clock cycles an instruction takes to calculate its result.

Line

The smallest unit of storage that can be allocated in a cache. When the processor fills a cache line, it loads the entire cache line, using multiple read cycles to main memory.

For cache line lengths for different processors, see L1 cache and L2 cache.

Load

Loads on unaligned data reduce performance. Loads on data that crosses cache line boundaries reduce performance even more.

Lock

A synchronization object with ownership. For example a critical section or mutex can be owned by a thread in contrast to a semaphore, which is an un-owned object.

Locking Hierarchy

The order in which a thread acquires multiple locks to be held simultaneously. All threads that acquire the same locks should acquire them in the same order to avoid deadlock.

Loop Blocking

Loop blocking is structuring code so that it performs several operations on blocks of data of a large matrix (while the data is still in the cache), instead of trying to process each operation separately on all the elements, one at a time.

This enables more work to be done on the same data while it is still in the cache.

Loop Invariant

A variable value that does not change within a loop.

LRU

Last recently used.

M

Max Path

The path that takes maximum time.

Max Path to Bottom

The most time-consuming path that leads from the function down to the end of the graph.

Max Path to Function

The most time-consuming path that leads from the root down to the function.

Max Path to Root

The most time-consuming path that leads from the root down to the function.

Memory Order Buffer (MOB)

A memory subsystem that acts as a reservation station and a reorder buffer. It holds suspended loads and stores, re-dispatching them when the blocking condition (dependency or resource) disappears.

Merge

The combination of two or more VTune(TM) Performance Analyzer Activity results into one virtual Activity result.

Micro-ops (Uops)

A decoded series of micro-operations produced by processor decoders from a stream of Intel(R) architecture instructions. Every assembly instruction can break into one or more micro-ops.

MMX(TM) technology

MMX(TM) Technology:

Introduces 64-bit MMX registers.
Introduces support for SIMD operations on packed byte, word, and doubleword integers.

The MMX instructions are useful for multimedia and communications software. The eight MMX registers support execution of single-instruction, multiple-data (SIMD) operations on 64-bit packed byte, word, and doubleword integers.

For more information on the MMX technology, refer to the IA-32 Intel(R) Architecture Software Developer's Manual, Volume 1. See the Intel(R) Processor Information on the Web for more details.

Module Buffer Size

The data collector temporarily stores the module records (executables and DLLs that are loaded) during the sampling process in the module buffer. When the buffer is full, the data collector flushes buffer to the result file. A small Module Buffer Size can cause the data collector to frequently flush the buffer to file, which can negatively impact system performance.

Multi-threaded

A multi-threaded program allows different parts of a program to run concurrently. Operating systems like Microsoft* Windows* Server 2003 and Windows XP* are multi-threaded as are many scientific applications. When creating an application, developers must create code so that the threads can run independently and will not interfere with each other.

Mutex

A synchronization object used to allow multiple threads to serialize their access to shared data. The name derives from the capability it provides; namely, mutual-exclusion. The thread that has locked a mutex becomes its owner and remains the owner until that same thread unlocks the mutex.

N

Non-bogus Events

See bogus, non-bogus, retire events.

Non-intrusive

A process that does not slow down the applications running on a system. For example, sampling is non-intrusive since it takes less than 5% of processor time during an Activity. It also does not modify binary files or executables (no instrumentation of code as in call graph).

O

Object

A specific collection of counters that relate to a specific application, OS subsystem, or hardware device. A software object is a collection of counters that originate from a software component of the system such as an OS subsystem. A hardware object is a collection of counters tracked by the counter registers on a hardware device in the system such as the microprocessor. Each object has a unique set of counters associated with it.

Over-Utilization of Processor

An application has more active threads than the number of available processors.

P

Pipelining

A microarchitecture design technique, also called the superscalar architecture, that divides the execution of an instruction into sequential steps, using different microarchitectural resources at each step. Pipelined machines have multiple Intel(R) architecture instructions executing at the same time, but at different stages in the machine.

The Pentium(R) and Pentium Pro family processors achieve high execution speed through the superscalar architecture. The Pentium processor has two execution units (U and V pipes) and can process two instructions per clock, so it is superscalar level 2. The Pentium Pro processor can dispatch and retire three instructions per clock so it is superscalar level 3.

Precise Events

Precise events are events for which the exact instruction addresses that caused the event are available. Some events for the Pentium(R) 4 processor are precise events.

Prefetching

The Intel NetBurst(R) microarchitecture supports three prefetching mechanisms:

the first is for instructions only
the second is for data only
the third is for code or data

The first mechanism is hardware instruction fetcher that automatically prefetches instructions. The second is a software controlled mechanism that fetches data into the caches using the prefetch instructions. The third is a hardware mechanism that automatically fetches data and instructions into the unified second-level cache. The hardware instruction fetcher reads instructions along the path predicted by the BTB into the instruction streaming buffers. Data is read in 32-byte chunks starting at the target address.

Prefix

Prefixes are one or more bytes that precede an instruction and modify the operation of the instruction.

Private Thread Variable

Variable (or memory addresses) that is designed to be only visible to one thread. For example, a variable declared on a thread's stack is usually visible only to the thread that declared it.

Process

An executing application which has a specific reserved virtual address. It is an intrinsic combination of code, data and several operating system resources. During execution it can contain one or more threads that run parallel.

Project

Contains Activities and Activity results. The project is contained in a .vpj project file.

R

Recursive Functions

A recursive function is a function that directly or indirectly calls itself.

The timing information formulas:

Total [Wait] Time = Self [Wait] Time + SUM (Outgoing Edge [Wait] Time)

Total [Wait] Time = SUM (Incoming Edge [Wait] Time)

do not work for functions found on a recursive path. The reason is the following: the values calculated on edges forming a recursive chain accumulate the circulating recursive calls. To find out whether the function is on a recursive chain, highlight Recursive functions on the filter toolbar.

For recursive functions, the following statements are true:

Total [Wait] Time <= Self [Wait] Time + SUM (Outgoing Edge [Wait] Time)

Total [Wait] Time < = SUM (Incoming Edge [Wait] Time)

Relative Virtual Address

Relative virtual address of any function in your program.

Relevance Score

An heuristic to indicate how relevant a particular insight or advice is to the current context. For instance, an extremely high relevance score for an insight may indicate a high probability of a performance bottleneck.

Replay

In order to maximize performance for the common case, the Intel NetBurst(R) microarchitecture sometimes aggressively schedules micro-ops for execution before all the conditions for correct execution are guaranteed to be satisfied. In the event that all of these conditions are not satisfied, micro-ops must be reissued. This mechanism is called replay.

Some occurrences of replays are caused by cache misses, dependence violations (for example, store forwarding problems), and unforeseen resource constraints. In normal operation, some number of replays are common and unavoidable. An excessive number of replays indicate that there is a performance problem.

Resteer

A resteer is a change in the Instruction Pointer (IP) beyond the normal increment to the next instruction bundle. Normally the IP increments to the next sequential instruction bundle. The IP is used to fetch instructions from the cache and/or memory, and then placed into the instruction pipeline for processing.

When a branch instruction is detected and predicted taken or if an exception or interrupt occurs, the IP is changed to a new value and instruction fetching begins at this new IP value. This change in IP to a different address that is not the next sequential instruction bundle is called a resteer.

Retire Events

See bogus, non-bogus, retire events.

S

Sampling Interval

Frequency with which the VTune(TM) Performance Analyzer interrupts the processor to collect samples.

Segment Register

Registers that hold segment selectors associated with different forms of memory access. There are six registers (CS, DS, SS, ES, FS, and GS) that determine, at any given time, which segments of memory are currently available.

Self Time

Time (microseconds) spent inside a function, including time spent waiting between execution activities. It does not include time spent in calls to other instrumented functions.

Self Wait Time

The number of microseconds spent inside a function while its thread/fiber was non-active (suspended). Self Wait time does not include the non-active time spent in callees of the function.

The call graph calculation of the Self_Wait_Time is based on a heuristic estimation that tracks context switches caused by synchronization events and by other causes.

Semaphore

A signal used to govern access to shared system resources.

Shared Thread Variable (Memory Address)

Variables or memory addresses that are designed to be visible to more than one thread. For example, global variables are usually visible to all threads and each thread may read or write the shared variable (address).

Single-Instruction, Multiple Data (SIMD)

The three SIMD technologies that have been added to the IA-32 architecture in chronological order are as follows:

MMX(TM) technology
Streaming SIMD Extensions (SSE)
Streaming SIMD Extensions2 (SSE2)

Snooping

A mechanism that checks whether the address presented is in the cache of the processor. If the address hits a modified line in the data cache, the modified line is written back onto the bus.

Stalled Thread

Threads that are prevented from running, typically because they are waiting for locks to be released or for other operations to complete.

Start Delay

The number of seconds for which sampling is delayed after you start an Activity.

Streaming SIMD Extensions (SSE)

Streaming SIMD Extensions:

Introduces 128-bit XMM registers.
Introduces 128-bit data type with four packed single-precision floating-point operands.
Introduces data prefetch instructions.
Introduces non-temporal store instructions and other cacheability and memory ordering instructions.
Adds extra 64-bit SIMD integer support.

The SSE instructions are useful for 3D geometry, 3D rendering, speech recognition, and video encoding and decoding.

For more information on the Streaming SIMD Extensions, refer to the IA-32 Intel(R) Architecture Software Developer's Manual, Volume 1.

SSE was followed by Streaming SIMD Extensions2 (SSE2).

Synchronization Delay

Wait time where the threads do not perform any work due to locks being held by other threads. The time spent by the thread waiting in the blocked queue in the OS scheduler waiting for an opportunity to run. This opportunity becomes available when a thread currently holding the lock releases it.

Synchronization Object

An object, such as a mutex, critical section, semaphore, event or handle of a thread used to control or limit access to a section of code. In the Windows* API, synchronization objects except critical sections, are referenced by HANDLE.

T

Tagging

Tagging is a means of marking micro-ops to be counted at retirement. See Appendix A of the IA-32 Intel Architecture Software Developer's Manual, Volume 3: System Programming for the description of the tagging mechanisms. See the Intel(R) Processor Information on the Web for more details.

The same event can happen more than once per micro-op. The tagging mechanisms allow a micro-op to be tagged once during its lifetime. The retired suffix is used for metrics that increment a count once per micro-op, rather than once per event. For example, a micro-op may encounter a cache miss more than once during its life time, but a Misses Retired metric (for example, 1 st -Level Cache Misses Retired) will increment only once for that micro-op.

Target Number of Samples

The target number of samples is calculated by multiplying the samples per second (the inverse of the Sampling Interval) by the Duration.

Thread

A thread is part of a program that can be run independently of other aspects of the program.

Thread Independent (TI)

This event is logical processor independent. The sample count measured is the total number of events that occurred on the physical processor package. The event interrupt for thread independent events is set on logical processor 0. Sample counts for the total system are accurate by physical processor package. However, sample counts at a particular IP address at the process, module, function, or source code may not be accurate. That is because sample counts on logical processor 1 at a particular IP address will be counted on logical processor 0 which is very likely to be at a different IP address. The IP address on logical processor 0 is recorded as the location of the event. If you use TI events, be aware of this when reviewing data below the system level.

Thread Independent due to ESCR Restriction (TI-E)

This event is logical processor specific. However, do to an ESCR restriction that prevents both logical processors from being sampled in the same run, the event has been programmed to be thread independent. The sample count measured is the total number of events that occurred on the physical processor package. The event interrupt for thread independent events is set on logical processor 0. Sample counts for the total system are accurate by physical processor package. However, sample counts at a particular IP address at the process, module, function, or source code many not be accurate. That is because sample counts on logical processor 1 at a particular IP address will be counted on logical processor 0 which is very likely to be at a different IP address. The IP address on logical processor 0 is recorded as the location of the event. If you use TI-E events, be aware of this when reviewing data below the system level.

Throughput

For IA-32 processors, number of clock cycles required to wait before the issue ports are free to accept the same instruction again. For many IA-32 instructions, the throughput of an instruction can be significantly less than its latency.

Time-based Sampling (TBS)

A form of sampling that collects information at a regular time interval and allows you to monitor all active software on your system, including the operating system, device drivers, and application software.

Top of Stack (TOS)

The floating-point register stack contains eight numeric registers, numbered ST(0) through ST(7). ST(0) is also called TOS. Most floating-point instructions require at least one operand in the TOS.

Total Time

Time (microsecond) elapsed between the time a function starts execution until the time it terminates execution. This is the sum of this function's Self Time and all its callees Total Time.

Total Wait Time

This column shows the total of the Self_Wait_Time of the function and the Self_Wait_Time of all its children.

Translation Lookaside Buffer (TLB)

On-chip caches used to store the most recently used page table entries. Separate TLBs exist for data (DTLB) and instruction caches (ITLB). Most paging is done using the contents of the TLB.

U

U pipe

The integer pipe on the Pentium® processor dual pipe lines that can execute any instruction in the Intel® architecture.

Under-Utilization of Processor

When an application does not fully utilize all the available processors in the system; the application runs with fewer threads than the number of processors.

Upward and Downward Prefetches

Prefetching can be done in an upward or downwards direction, as illustrated.

W

Workload

All the software or a subset of the software that was executing when data was collected. This includes the application launched by the Analyzer and any child processes (modules of interest) the application launches.

Write Back

A write request to a line in the cache updates only the data cache memory. Writes to lines in the memory are not immediately forwarded to main memory; instead they are accumulated in the cache. The modified cache line is written to main memory later when a write back occurs. Write-back operations are triggered when cache lines need to be deallocated, such as when new cache lines are being allocated in a cache which is already full. Write-back operations also are triggered by the mechanisms that are used to maintain cache consistency like snooping.

The write-back policy reduces bus traffic by eliminating unnecessary writes to memory.

Last updated: September 24, 2004