Glossary

A  B  C  D  E  F  G  H  I  J  K  L  M  N  O  P  Q  R  S  T  U  V  W  X  Y  Z  

A

Active Threads

In a multi-threaded application, a thread is considered active when it is either running on a processor or waiting in the operating system (OS) scheduler run-queue.

A thread is not considered active when it is waiting for an external event (blocking), waiting for another thread, or spin waiting.

Back to Top


Activity

Activities are at the core of the data-collection process. They control collection of data. Within an Activity, you can specify the types of performance data you wish to collect. For each type of performance data, you need to configure the appropriate data collector.

Back to Top


Activity Results

Results of an Activity. Each time an Activity runs, its data collectors are launched and an Activity result is generated for each data collector.

Back to Top


Address Offset Calculation

Instructions with explicit memory operands reference the segment containing the operand and the displacement from the beginning of the segment to the operand.

The offset is calculated as follows:

Base + (Index * Scaling Factor) + Displacement

Where:

Back to Top


Address Generate Interlock (AGI) Conflict

On the Pentium(R) processor and on the Pentium processor with MMX(TM) technology support, an Address Generate Interlock (AGI) conflict occurs when a register that is used as the base or index component of an effective address calculation was the destination register of an instruction executed in the immediately preceding cycle.

On the Intel486(TM) processor, AGI conflicts occur when a register that is used as the base (not the index) component of an effective address calculation was the destination register of an instruction executed in the immediately preceding cycle.

Back to Top


Advice

A possible solution or workaround recommended (usually a suggestion to modify the code) to remove or avoid a performance problem.

Back to Top


Alignment

Alignment of data concerns all kinds of variables:

A misaligned data access can incur significant performance penalties. This is particularly true for cache line splits. The size of a cache line is 64 bytes in the Pentium(R) 4 processor, and is 32 bytes in Pentium III and Pentium II processors. On the Pentium 4 processor, an access to data that are unaligned on 64-byte boundary lead to two memory accesses and requires several micro-ops to be executed instead of one. Accesses that span either 16-byte or 64-byte boundaries are likely to incur a large performance penalty, since they are executed near retirement, and can incur stalls that are on the order of the depth of the pipeline.

Assembly/Compiler Coding Rule 15. (H impact, H generality) Align data on natural operand size address boundaries

For best performance, align data as follows:

A 64-byte or greater data structure or array should be aligned so that its base address is a multiple of 64. Sorting data in decreasing size order is one heuristic for assisting with natural alignment. As long as 16-byte boundaries (and cache lines) are never crossed, natural alignment is not strictly necessary, though it is an easy way to enforce this.

Back to Top


Application/Module Profile

Containers of configuration information about the execution of the analyzed application and modules to analyze.

Back to Top


Arguments

A parameter value used to invoke the source code advice.

Back to Top


Assist

When the hardware needs the assistance of micro-code to deal with some event, the machine takes an assist. One example of such situation is an underflow condition in the input operands of a floating-point operation. The hardware must internally modify the format of the operands in order to perform the computation. Assists clear the entire machine of micro-ops before they begin to accumulate, and are costly. The assist mechanism on the Intel(R) Pentium(R) 4 processor is similar in principle to that on the Pentium II processors, which also have an assist event. 

Back to Top


B

Basic Block

A group of consecutive instructions, which has only one entry point and one exit.

Back to Top


Base Register

A register used to compute an effective memory address. It may be any of the eight general registers: EAX, EBC, ECX, EDX, EBP, ESP, ESI, and EDI. A base represents an indirect offset to an operand. The base register can be used alone (without an index or displacement) for dynamic storage of variables and data structures.

Back to Top


Blocking Time

In the Thread Profiler, time when the thread is not active due to a blocking operation done while the thread was on the critical path. A blocking operation is one that blocks or stalls a thread when the thread invokes that operation, causing the thread to be inactive for the duration of the operation.   

Examples of blocking operations are sleeps, waiting on the operating system for I/O operations (disk or file access or user input) to complete. Time may be classified as Blocking time if waiting for another thread or event (waiting for a signal) if the signal was lost or Thread Profiler does not know about the signal.  If the User-Synchronization APIs were used, the thread might be actively spinning but would still be considered to be blocked when it is waiting for a signal that it lost or if it cancelled its prepare().

 

Back to Top


Bogus, Non-bogus, Retire Events

Branch mispredictions incur a large penalty on microprocessors with deep pipelines. In general, the direction of branches can be predicted with a high degree of accuracy by the front end of the Intel(R) Pentium(R) 4 processor, such that most computations can be performed along the predicted path while waiting for the resolution of the branch.

In the event of a misprediction, instructions and micro-ops that were scheduled to execute along the mispredicted path must be cancelled. These instructions and micro-ops are referred to as bogus instructions and bogus micro-ops. A number of Pentium 4 processor performance monitoring events, for example, instructions retired and mops retired, can count instructions or micro-ops that are retired based on the characterization of bogus versus non-bogus.

In event description tables, the term bogus refers to instructions or micro-ops that must be cancelled because they are on a path taken from a mispredicted branch. The terms retired and non-bogus refer to instructions or micro-ops along the path that results in committed architectural state changes as required by the program execution. Thus instructions and micro-ops are either bogus or non-bogus, but not both.

Back to Top


Bottleneck

A bottleneck is an area in your code that is slowing down the execution of your application. Bottlenecks appear as hotspots in the Hotspot view. All bottlenecks are hotspots but all hotspots need not necessarily be bottlenecks. Remove the bottlenecks to optimize your code.

Back to Top


Branch Target Buffer (BTB)

A small cache where the processor stores information about previous branches. The processor uses the information stored in the BTB to predict whether or not a branch that it encounters should be predicted as being taken or not-taken.

Back to Top


Buffer Size

The size of the buffer used to store the samples collected during the sampling process. Buffer size is limited by the amount of real memory available during the monitoring process. Data is lost when the buffer is flushed (since your program is still running).

Back to Top


Bus Ratio

The ratio of the processor clock to the bus clock.

Back to Top


C

Cache

A small, fast memory which holds the active parts of a larger, slower memory. Caches raise system performance by satisfying an internal read request more quickly than a bus cycle to memory. They also reduce the processor's use of the external bus when the same locations are accessed multiple times.

Back to Top


Cache Miss

A cache miss is a request for memory that is not found in the cache. When a cache miss occurs, the processor fetches an entire 32-byte line from a higher level cache (L2) or, if necessary, from main memory. If necessary, it replaces the least recently used (LRU) line.

There are three types of cache miss:

Back to Top


Call Site

The location within a caller from where a call to a callee is made.

Back to Top


Callee

A child function that is called by the current function.  

Back to Top


Callee Threshold

Maximum of child nodes to display.

Back to Top


Caller

A parent function that calls the current function.

Back to Top


Calls

The number of times this function was called.

Back to Top


Capacity Miss

A capacity miss is a request that misses the cache because the cache was full.

Back to Top


Casting

Data was declared as some type, then you decide you want to access it as more, overwriting default data type, the compiler trusts you, but it might not work.

Back to Top


Clocktick

Also referred to as a clock or a CPU cycle, a clocktick is the smallest unit of time recognized by the processor. The term is also used to indicate the time required by the processor to execute an instruction.  

Back to Top


Complex Instruction

An instruction that breaks down into more than four micro-ops, or is longer than seven bytes. Only decoder D0 can decode complex instructions. Complex instructions cannot be decoded in parallel with other instructions.

Examples of instructions that break down into more than four micro-ops are CALL, RET, CMPXCH, LOCK, and CPUID.

Instructions with both an immediate operand and a displacement, as shown below, are also complex because they are longer than seven bytes:

movzx ebx, word ptr [ecx] + 077fb7426

lea ecx, dword ptr [esi*02] + 0

Back to Top


Compulsory Miss

A request that misses because it is the first reference to a given line in memory. A compulsory miss always occurs when it is the first reference to a given line in memory.

Back to Top


Conflict Miss

A request that misses due to lack of space in a certain set. A set is a group of lines in the cache into which a given line in memory can be mapped. The number of lines per set depends on the associativity of a cache. A cache with two lines per set is called a two-way set associative cache.

Back to Top


Connected Components

Sub-graphs where an undirected path exists between all pairs of nodes. If no path exists between two nodes in a graph, then those two nodes must belong to different connected components.

Back to Top


Correctness

A correct multithreaded program is free of race conditions that could lead to non-deterministic results.

Back to Top


Counter

A feature used to count a single event that represents the state of an entity in the system. For example, an application, OS, or hardware device.

Back to Top


Counter Mask

The mask that determines when the event counter increments. This corresponds to bits 24 through 31 of the 32-bit Event Select register that controls the performance monitoring counter. When non-zero, the processor compares this mask to the count of events during a single cycle. If the event count is greater than or equal to this mask, the counter is incremented by one. Otherwise the counter is not incremented.

This mask can be used to count events only if there are multiple occurrences per clock (for example, two or more instructions retired per clock). If the counter-mask field is 0, the counter is incremented each cycle.

Back to Top


CPI

Cycles per instructions retired. This event ratio is a basic measure of performance.

Back to Top


Critical Path

The most time-consuming path or call sequence originating from the root.

Back to Top


Cross-modifying Mode

The act of one processor writing data into the currently executing code segment of a second processor with the intent of having the second processor execute that data as code.

Back to Top


D

Data Collector

The data collectors included in an Activity collect performance data. The different data collectors collect different types of performance data.

Back to Top


Data Race

A situation where the order in which two accesses occur is not guaranteed by the program, causing indeterminate results. A data race is due to two different threads accessing the same memory location at the same time without the proper synchronization needed to guarantee consistent results.

Back to Top


Dataset

A set of data gathered as a result of invoking an instance of one data collector.

Back to Top


DCU

Data Cache Unit. This refers to the level 1 (L1) data cache in the processor.

Back to Top


Deadlock

A thread is deadlocked when it is waiting for a critical section, mutex, or thread handle that can never be acquired.

For example, if two threads each hold a different mutex and need the mutex held by the other thread to proceed, the threads are deadlocked since neither thread can give up the mutex it already holds nor acquire the second mutex.

Back to Top


Decoder

The part of the processor that breaks instructions into sequences of simple micro operations, known as micro-ops. Each micro-op has two logical sources and a logical destination.

Back to Top


Dependency

A relationship between two instructions such that one instruction uses the other's result or defines the other's operands. The compiler cannot reschedule dependent instructions.

For example, a compiler cannot move an instruction to load above an instruction to store a because the two instructions use the same variable a and are dependent on each other.

Back to Top


Denormalized Numbers

Tiny numbers represented by making the integer bit (and perhaps other leading bits) of the significand zero. The use of leading zeros with denormalized numbers allows smaller numbers to be represented. However, this denormalization causes a loss of precision (the number of significant bits in the fraction is reduced by the leading zeros). 

Back to Top


Disconnected Nodes

Nodes that have no edges connected to them.

Back to Top


Dynamic Branch Prediction

Every time a branch instruction is reached, the processor stores it history in the 512-entry branch target buffer (BTB). When the branch instruction is fetched again, the processor uses the information stored in the BTB to dynamically predict whether the branch is taken or not taken.

Dynamic branch prediction is used to update the instruction pointer (IP) early (during instruction fetch, instead of during decoding). As a result, fewer instructions enter the pipeline that will need to be flushed.

Back to Top


Upward and Downward Prefetches

Prefetching can be done in an upward or downwards direction, as illustrated.

Back to Top


E

Edge

A relationship between two functions where one (caller) is calling the other (callee).

Back to Top


Edge Time

Time (microseconds) contributed to the total time of a callee function when called from a specific caller function. The sum of the edge time of all the incoming edges to a function equals to this function's total time.

Back to Top


Edge Total Time

The execution time (microseconds) of a particular function when called from a specific parent function.

Back to Top


Edge Wait Time

Time (microseconds) contributed to the total wait time of a callee function when called from a specific caller function. The sum of the edge wait time of all the incoming edges to a function equals to this function's total wait time.

Back to Top


Event-based Sampling

Event-based sampling interrupts the processor after a specified number of events and collects samples containing the instruction addresses of the software modules executing on your system.

Back to Top


Event Skid

Event skid is the recording of an event not exactly on the code line that caused the event. Event skids may even result in a caller function event being recorded in the callee function.

Event skid is caused by a number of factors:

The Intel(R) Pentium(R) 4 or Itanium(R) processors support accurate event location for some events. These events are called precise events.

Example: Interpreting Jump and Call Instructions

Events that happen in the execution time of the jmp or call instruction, may appear on instruction that is one or two instructions away from original jmp/call in the execution flow. In this example, the mov instruction at the top of the loop is not responsible for the 1.02% of the events because the mov instruction is the target of the branch at the bottom of the loop. The real source of the events is the jmp instruction at the bottom of the loop.

Event %

Instructions

1.02%

top_of_loop: mov ...

... (any number of lines)

end_of_loop: jnz <to someplace>

jmp top_of_loop

Back to Top


Exception

Synchronous events which are responses of the processor to certain conditions detected during the execution of an instruction.  It can also be defined as a forced call to a procedure or a task which is generated when the processor fails to interpret an instruction or when an INT n instruction is executed. Causes of exceptions include division by zero, stack overflow, undefined opcodes, and memory protection violations.

Back to Top


Execution Unit

For IA-32 processors, the names of the execution units in the execution core that are utilized to execute the micro-ops for each instruction. This information is provided only for IA-32 instructions that are decoded into no more than 4 micro-ops or instructions that decode into more than 4 micro-ops are supplied by micro-code ROM.

Several execution units may share the same port, such as FP_ADD, FP_MUL, or MMX_SHFT in the FP_EXECUTE cluster.

Back to Top


Explicit AGI Conflict

An Address Generate Interlock (AGI) conflict occurs when a register that is used as the base or index component of an effective address calculation was the destination register of an instruction executed in the immediately preceding cycle. An explicit AGI conflict occurs when the previous instruction explicitly wrote to the destination register. For example, add eax, 3 explicitly writes to the eax register.

Back to Top


F

Focus Function

A function selected in the function summary, or in the graph view.

Back to Top


G

General Registers

Eight general registers that hold operands for logical and arithmetic operations: eax, ebc, ecx, edx, ebp, esp, esi, and edi.

Back to Top


H

Hotspot

A hotspot indicates sections of code within a module that took a long time to execute. This results in a high amount of processor time spent executing that section of the module, thus generating a lot of samples for that module. The Hotspot view can display individual hotspots as they appear in memory (RVA or location), hotspots of active functions, or hotspots of active source files in the module, depending on the option you select.

Some hotspots may turn out to be bottlenecks but some other hotspots, by the very nature of their content inevitably take a long time to execute. These sections of code may be essential to the running of your program. You cannot, therefore, remove such hotspots. Focus on the hotspots that are bottlenecks and can be removed through optimization of your code or system.

Back to Top


Hyper-Threading Technology

Intel(R) processors are superscalar, enabling them to execute more than one instruction per clock cycle. Intel Hyper-Threading technology extends this capability by making a single physical processor appear to be two logical processors that are capable of executing two threads in parallel. From a software perspective, operating systems and programs can schedule processes or threads as if they were running on two physical processors. From a microarchitecture perspective, instructions from two logical processors execute simultaneously on the shared resources of one physical processor, thus increasing overall resource utilization.

Operating environments can take advantage of Hyper-Threading technology in one of two ways.

First, applications can be multithreaded thus allowing the individual application to execute upon parallel threads in the processor at the same time. Operating systems like Microsoft* Windows* Server 2003 and Windows XP* along with high performance applications such as database engines, scientific computation programs and multimedia design software are currently multithreaded and are typically run in dual- or multi-processor environments.

Second, Hyper-Threading can be used in a multitasking environment where the system is running more than one application in parallel. In this instance, each application can be run as separate threads on the same processor increasing utilization of the execution units and overall platform performance.

Back to Top


I

Idle Thread

A thread is idle when it is waiting on an object such that execution of its next operation is delayed.

Back to Top


Instruction Scheduling

The process in which the compiler moves instructions and reorders them in order to allow instructions to execute in parallel and to avoid stalls or wasted cycles between two dependent instructions.

Back to Top


Insight

An observation about the performance of your code. It could indicate a potential performance problem that could be a bottleneck to your applications performance.

 

Back to Top


Instrumentation

The process of modifying a program so that dynamic information is recorded during program execution. Data collection routines that are invoked at specific points in the execution of the target program, record runtime information. These routines provide information about time spent in each function, and the call sequence that led to a specific function.

Back to Top


L

L1 Cache

First-level on-die cache. The L1 cache is a small but very fast cache that has an instruction cache and a data cache, the first dedicated to caching instructions and the other to caching data.

L1 Instruction and Data Cache Characteristics:

L1 Instruction Cache

  • Pentium(R) 4 processors: not implemented.

  • P6 family and Pentium processors: 8 or 16 KBytes, 4-way set associative,

  • 32-byte cache line size; 2-way set associative for earlier Pentium processors.

L1 Data Cache

  • Pentium 4 processors: 8 KBytes, 4-way set associative, 64-byte cache line size.

  • P6 family processors: 16 KBytes, 4-way set associative, 32-byte cache line size;

  • 8 KBytes, 2-way set associative for earlier P6 family processors.

  • Pentium processors: 16 KBytes, 4-way set associative, 32-byte cache line size;

  • 8 KBytes, 2-way set associative for earlier Pentium processors.

Note

L2 cache misses have a much greater impact on performance than L1 cache misses, and L2 cache problems are often easier to solve than L1 cache problems.

Back to Top


L2 Cache

Second-level cache. While the L1 cache is an on-chip cache, the second-level cache (L2 cache) is located between the on-chip cache and the main memory. It is located on a separate die and closely coupled to the CPU die by means of a dedicated full clock-speed bus.

If an access misses the L1 cache, the access passes on to the bigger, but slower L2 cache. The L2 cache is a unified cache of both instructions and data. If the access misses the L2 cache, it goes to the main memory.

L2 Cache Characteristics:

L2 Unified Cache

  • Pentium(R) 4 processors: 256 KBytes 8-way set associative, sectored, 64-byte cache line size.

  • P6 family processors: 128 KBytes, 256 KBytes, 512 KBytes, 1 MByte, or

  • 2 MByte, 4-way set associative, 32-byte cache line size.

  • Pentium processor: System specific, typically 256 or 512 KBytes, 4-way set associative, 32-byte cache line size.

Notes

Back to Top


Latency

The number of clock cycles an instruction takes to calculate its result.

Back to Top


Line

The smallest unit of storage that can be allocated in a cache. When the processor fills a cache line, it loads the entire cache line, using multiple read cycles to main memory.

For cache line lengths for different processors, see L1 cache and L2 cache.

Back to Top


Load

Loads on unaligned data reduce performance. Loads on data that crosses cache line boundaries reduce performance even more.

Back to Top


Lock

A synchronization object with ownership. For example a critical section or mutex can be owned by a thread in contrast to a semaphore, which is an un-owned object.

Back to Top


Locking Hierarchy

The order in which a thread acquires multiple locks to be held simultaneously. All threads that acquire the same locks should acquire them in the same order to avoid deadlock.

Back to Top


 

Loop Blocking

Loop blocking is structuring code so that it performs several operations on blocks of data of a large matrix (while the data is still in the cache), instead of trying to process each operation separately on all the elements, one at a time.

This enables more work to be done on the same data while it is still in the cache.

Back to Top


Loop Invariant

A variable value that does not change within a loop.

Back to Top


LRU

Last recently used.

Back to Top


M

Max Path

The path that takes maximum time.

Back to Top


Max Path to Bottom

The most time-consuming path that leads from the function down to the end of the graph.

Back to Top


Max Path to Function

The most time-consuming path that leads from the root down to the function.

Back to Top


Max Path to Root

The most time-consuming path that leads from the root down to the function.

Back to Top


Memory Order Buffer (MOB)

A memory subsystem that acts as a reservation station and a reorder buffer. It holds suspended loads and stores, re-dispatching them when the blocking condition (dependency or resource) disappears.

Back to Top


Merge

The combination of two or more VTune(TM) Performance Analyzer Activity results into one virtual Activity result.

Back to Top


Micro-ops (Uops)

A decoded series of micro-operations produced by processor decoders from a stream of Intel(R) architecture instructions. Every assembly instruction can break into one or more micro-ops.

Back to Top


MMX(TM) technology

MMX(TM) Technology:

The MMX instructions are useful for multimedia and communications software. The eight MMX registers support execution of single-instruction, multiple-data (SIMD) operations on 64-bit packed byte, word, and doubleword integers.

For more information on the MMX technology, refer to the IA-32 Intel(R) Architecture Software Developer's Manual, Volume 1. See the Intel(R) Processor Information on the Web for more details.

Back to Top


Module Buffer Size

The data collector temporarily stores the module records (executables and DLLs that are loaded) during the sampling process in the module buffer. When the buffer is full, the data collector flushes buffer to the result file.  A small Module Buffer Size can cause the data collector to frequently flush the buffer to file, which can negatively impact system performance.  

Back to Top


Multi-threaded

A multi-threaded program allows different parts of a program to run concurrently. Operating systems like Microsoft* Windows* Server 2003 and Windows XP* are multi-threaded as are many scientific applications. When creating an application, developers must create code so that the threads can run independently and will not interfere with each other.

Back to Top


Mutex

A synchronization object used to allow multiple threads to serialize their access to shared data. The name derives from the capability it provides; namely, mutual-exclusion. The thread that has locked a mutex becomes its owner and remains the owner until that same thread unlocks the mutex.

Back to Top


N

Non-bogus Events

See bogus, non-bogus, retire events.

Back to Top


Non-intrusive

A process that does not slow down the applications running on a system. For example, sampling is non-intrusive since it takes less than 5% of processor time during an Activity. It also does not modify binary files or executables (no instrumentation of code as in call graph).

Back to Top


O

Object

A specific collection of counters that relate to a specific application, OS subsystem, or hardware device. A software object is a collection of counters that originate from a software component of the system such as an OS subsystem. A hardware object is a collection of counters tracked by the counter registers on a hardware device in the system such as the microprocessor. Each object has a unique set of counters associated with it.

Back to Top


Over-Utilization of Processor

An application has more active threads than the number of available processors.

Back to Top


P

Pipelining

A microarchitecture design technique, also called the superscalar architecture, that divides the execution of an instruction into sequential steps, using different microarchitectural resources at each step. Pipelined machines have multiple Intel(R) architecture instructions executing at the same time, but at different stages in the machine.

The Pentium(R) and Pentium Pro family processors achieve high execution speed through the superscalar architecture. The Pentium processor has two execution units (U and V pipes) and can process two instructions per clock, so it is superscalar level 2. The Pentium Pro processor can dispatch and retire three instructions per clock so it is superscalar level 3.

Back to Top


Precise Events

Precise events are events for which the exact instruction addresses that caused the event are available. Some events for the Pentium(R) 4 processor are precise events.

Back to Top


Prefetching

The Intel NetBurst(R) microarchitecture supports three prefetching mechanisms:

The first mechanism is hardware instruction fetcher that automatically prefetches instructions. The second is a software controlled mechanism that fetches data into the caches using the prefetch instructions. The third is a hardware mechanism that automatically fetches data and instructions into the unified second-level cache. The hardware instruction fetcher reads instructions along the path predicted by the BTB into the instruction streaming buffers. Data is read in 32-byte chunks starting at the target address.

Back to Top


Prefix

Prefixes are one or more bytes that precede an instruction and modify the operation of the instruction.

Back to Top


Private Thread Variable

Variable (or memory addresses) that is designed to be only visible to one thread. For example, a variable declared on a thread's stack is usually visible only to the thread that declared it.

Back to Top

 


Process

An executing application which has a specific reserved virtual address. It is an intrinsic combination of code, data and several operating system resources. During execution it can contain one or more threads that run parallel.

Back to Top


Project

Contains Activities and Activity results. The project is contained in a .vpj project file.

Back to Top


R

Recursive Functions

A recursive function is a function that directly or indirectly calls itself.

The timing information formulas:

Total [Wait] Time = Self [Wait] Time + SUM (Outgoing Edge [Wait] Time)

Total [Wait] Time = SUM (Incoming Edge  [Wait] Time)

do not work for functions found on a recursive path. The reason is the following: the values calculated on edges forming a recursive chain accumulate the circulating recursive calls. To find out whether the function is on a recursive chain, highlight Recursive functions on the filter toolbar.

For recursive functions, the following statements are true:

Total [Wait] Time <= Self [Wait] Time + SUM (Outgoing Edge [Wait] Time)

Total [Wait] Time < = SUM (Incoming Edge  [Wait] Time)

Back to Top


Relative Virtual Address

Relative virtual address of any function in your program.

Back to Top


Relevance Score

An heuristic to indicate how relevant a particular insight or advice is to the current context. For instance, an extremely high relevance score for an insight may indicate a high probability of a performance bottleneck.

Back to Top


Replay

In order to maximize performance for the common case, the Intel NetBurst(R) microarchitecture sometimes aggressively schedules micro-ops for execution before all the conditions for correct execution are guaranteed to be satisfied. In the event that all of these conditions are not satisfied, micro-ops must be reissued. This mechanism is called replay.

Some occurrences of replays are caused by cache misses, dependence violations (for example, store forwarding problems), and unforeseen resource constraints. In normal operation, some number of replays are common and unavoidable. An excessive number of replays indicate that there is a performance problem.

Back to Top


Resteer

A resteer is a change in the Instruction Pointer (IP) beyond the normal increment to the next instruction bundle. Normally the IP increments to the next sequential instruction bundle. The IP is used to fetch instructions from the cache and/or memory, and then placed into the instruction pipeline for processing.

When a branch instruction is detected and predicted taken or if an exception or interrupt occurs, the IP is changed to a new value and instruction fetching begins at this new IP value. This change in IP to a different address that is not the next sequential instruction bundle is called a resteer.

Back to Top


Retire Events

See bogus, non-bogus, retire events.

Back to Top


S

Sampling Interval

Frequency with which the VTune(TM) Performance Analyzer interrupts the processor to collect samples.

Back to Top


Segment Register

Registers that hold segment selectors associated with different forms of memory access. There are six registers (CS, DS, SS, ES, FS, and GS) that determine, at any given time, which segments of memory are currently available.

Back to Top


Self Time

Time (microseconds) spent inside a function, including time spent waiting between execution activities. It does not include time spent in calls to other instrumented functions.

Back to Top


Self Wait Time

The number of microseconds spent inside a function while its thread/fiber was non-active (suspended). Self Wait time does not include the non-active time spent in callees of the function.

The call graph calculation of the Self_Wait_Time is based on a heuristic estimation that tracks context switches caused by synchronization events and by other causes.

Back to Top


Semaphore

A signal used to govern access to shared system resources.

Back to Top


Shared Thread Variable (Memory Address)

Variables or memory addresses that are designed to be visible to more than one thread.  For example, global variables are usually visible to all threads and each thread may read or write the shared variable (address).

Back to Top


Single-Instruction, Multiple Data (SIMD)

The three SIMD technologies that have been added to the IA-32 architecture in chronological order are as follows:

Back to Top


Snooping

A mechanism that checks whether the address presented is in the cache of the processor. If the address hits a modified line in the data cache, the modified line is written back onto the bus.

Back to Top


Stalled Thread

Threads that are prevented from running, typically because they are waiting for locks to be released or for other operations to complete.

Back to Top


Start Delay

The number of seconds for which sampling is delayed after you start an Activity.

Back to Top


Streaming SIMD Extensions (SSE)

Streaming SIMD Extensions:

The SSE instructions are useful for 3D geometry, 3D rendering, speech recognition, and video encoding and decoding.

For more information on the Streaming SIMD Extensions, refer to the IA-32 Intel(R) Architecture Software Developer's Manual, Volume 1.

SSE was followed by Streaming SIMD Extensions2 (SSE2).

Back to Top


Synchronization Delay

Wait time where the threads do not perform any work due to locks being held by other threads. The time spent by the thread waiting in the blocked queue in the OS scheduler waiting for an opportunity to run. This opportunity becomes available when a thread currently holding the lock releases it.

Back to Top


Synchronization Object

An object, such as a mutex, critical section, semaphore, event or handle of a thread used to control or limit access to a section of code. In the Windows* API, synchronization objects except critical sections, are referenced by HANDLE.  

Back to Top


 

T

Tagging

Tagging is a means of marking micro-ops to be counted at retirement. See Appendix A of the IA-32 Intel Architecture Software Developer's Manual, Volume 3: System Programming for the description of the tagging mechanisms. See the Intel(R) Processor Information on the Web for more details.

The same event can happen more than once per micro-op. The tagging mechanisms allow a micro-op to be tagged once during its lifetime. The retired suffix is used for metrics that increment a count once per micro-op, rather than once per event. For example, a micro-op may encounter a cache miss more than once during its life time, but a Misses Retired metric (for example, 1 st -Level Cache Misses Retired) will increment only once for that micro-op.

Back to Top


Target Number of Samples

The target number of samples is calculated by multiplying the samples per second (the inverse of the Sampling Interval) by the Duration.

Back to Top


Thread

A thread is part of a program that can be run independently of other aspects of the program.

Back to Top


Thread Independent (TI)

This event is logical processor independent. The sample count measured is the total number of events that occurred on the physical processor package. The event interrupt for thread independent events is set on logical processor 0. Sample counts for the total system are accurate by physical processor package. However, sample counts at a particular IP address at the process, module, function, or source code may not be accurate. That is because sample counts on logical processor 1 at a particular IP address will be counted on logical processor 0 which is very likely to be at a different IP address. The IP address on logical processor 0 is recorded as the location of the event. If you use TI events, be aware of this when reviewing data below the system level.

Back to Top


Thread Independent due to ESCR Restriction (TI-E)

This event is logical processor specific. However, do to an ESCR restriction that prevents both logical processors from being sampled in the same run, the event has been programmed to be thread independent. The sample count measured is the total number of events that occurred on the physical processor package. The event interrupt for thread independent events is set on logical processor 0. Sample counts for the total system are accurate by physical processor package. However, sample counts at a particular IP address at the process, module, function, or source code many not be accurate. That is because sample counts on logical processor 1 at a particular IP address will be counted on logical processor 0 which is very likely to be at a different IP address. The IP address on logical processor 0 is recorded as the location of the event. If you use TI-E events, be aware of this when reviewing data below the system level.

Back to Top


Throughput

For IA-32 processors, number of clock cycles required to wait before the issue ports are free to accept the same instruction again. For many IA-32 instructions, the throughput of an instruction can be significantly less than its latency.

Back to Top


Time-based Sampling (TBS)

A form of sampling that collects information at a regular time interval and allows you to monitor all active software on your system, including the operating system, device drivers, and application software.

Back to Top


Top of Stack (TOS)

The floating-point register stack contains eight numeric registers, numbered ST(0) through ST(7). ST(0) is also called TOS. Most floating-point instructions require at least one operand in the TOS.

Back to Top


Total Time

Time (microsecond) elapsed between the time a function starts execution until the time it terminates execution. This is the sum of this function's Self Time and all its callees Total Time.

Back to Top


Total Wait Time

This column shows the total of the Self_Wait_Time of the function and the Self_Wait_Time of all its children.

Back to Top


Translation Lookaside Buffer (TLB)

On-chip caches used to store the most recently used page table entries. Separate TLBs exist for data (DTLB) and instruction caches (ITLB). Most paging is done using the contents of the TLB.

Back to Top


U

U pipe

The integer pipe on the Pentium® processor dual pipe lines that can execute any instruction in the Intel® architecture.

Back to Top


Under-Utilization of Processor

When an application does not fully utilize all the available processors in the system; the application runs with fewer threads than the number of processors.

Back to Top


Upward and Downward Prefetches

Prefetching can be done in an upward or downwards direction, as illustrated.

Back to Top


W

Workload

All the software or a subset of the software that was executing when data was collected. This includes the application launched by the Analyzer and any child processes (modules of interest) the application launches.

Back to Top


Write Back

A write request to a line in the cache updates only the data cache memory. Writes to lines in the memory are not immediately forwarded to main memory; instead they are accumulated in the cache. The modified cache line is written to main memory later when a write back occurs. Write-back operations are triggered when cache lines need to be deallocated, such as when new cache lines are being allocated in a cache which is already full. Write-back operations also are triggered by the mechanisms that are used to maintain cache consistency like snooping.

The write-back policy reduces bus traffic by eliminating unnecessary writes to memory.

Back to Top

 

 

 

 

 

 

 

Last updated: September 24, 2004