NVIDIA Compute Visual Profiler Version 4.0

Published by
NVIDIA Corporation
2701 San Tomas Expressway
Santa Clara, CA 95050


Notice

BY DOWNLOADING THIS FILE, USER AGREES TO THE FOLLOWING:

ALL NVIDIA SOFTWARE, DESIGN SPECIFICATIONS, REFERENCE BOARDS, FILES, DRAWINGS, DIAGNOSTICS, LISTS, AND OTHER DOCUMENTS (TOGETHER AND SEPARATELY, "MATERIALS") ARE BEING PROVIDED "AS IS". NVIDIA MAKES NO WARRANTIES, EXPRESSED, IMPLIED, STATUTORY, OR OTHERWISE WITH RESPECT TO THE MATERIALS, AND EXPRESSLY DISCLAIMS ALL IMPLIED WARRANTIES OF NONINFRINGEMENT, MERCHANTABILITY, AND FITNESS FOR A PARTICULAR PURPOSE.

Information furnished is believed to be accurate and reliable. However, NVIDIA Corporation assumes no responsibility for the consequences of use of such information or for any infringement of patents or other rights of third parties that may result from its use. No license is granted by implication or otherwise under any patent or patent rights of NVIDIA Corporation. Specifications mentioned in this publication are subject to change without notice. These materials supersedes and replaces all information previously supplied. NVIDIA Corporation products are not authorized for use as critical components in life support devices or systems without express written approval of NVIDIA Corporation.

Trademarks
NVIDIA, CUDA, and the NVIDIA logo are trademarks or registered trademarks of NVIDIA Corporation in the United States and other countries. Other company and product names may be trademarks of the respective companies with which they are associated.

Copyright (C) 2007-2011 by NVIDIA Corporation. All rights reserved.

PLEASE REFER EULA.txt FOR THE LICENSE AGREEMENT FOR USING NVIDIA SOFTWARE.

List of supported features:

Execute a CUDA or OpenCL program (referred to as Compute program in this document) with profiling enabled and view the profiler output as a table. The table has the following columns for each GPU method:

Please refer the "Interpreting Profiler Counters" section below for more information on profiler counters. Note that profiler counters are also referred to as profiler signals.

Display the summary profiler table. It has the following columns for each GPU method:
Display various kinds of plots:
Compare profiler output for multiple program runs of the same program or for different programs.

Each program run is referred to as a session.

Save profiling data for multiple sessions. A group of sessions is referred to as a project.

Import/Export Compute Profiler data in CSV format.

Description of different plots:

GPU time summary plot :
GPU time height plot:
It is a bar diagram in which the height of each bar is proportional to the GPU time for a method and a different bar color is assigned for each method. A legend is displayed which shows the color assignment for different methods. The width of each bar is fixed and the bars are displayed in the order in which the methods are executed. When the "fit in window" option is enabled the display is adjusted so as to fit all the bars in the displayed window width. In this case bars for multiple methods can overlap. The overlapped bars are displayed in decreasing order of height so that all the different bars are visible. When the "Show CPU Time" option is enabled the CPU time is shown as a bar in a different color on top of the GPU time bar. The height of this bar is proportional to the difference of CPU time and GPU time for the method.
GPU time width plot:
It is a bar diagram in which the width of each bar is proportional to the GPU time for a method and a different bar color is assigned for each method. A legend is displayed which shows the color assignment for different methods. The bars are displayed in the order in which the methods are executed. When time stamps are enabled the bars are positioned based on the time stamp. The height of each bar is based on the option chosen:
  1. Fixed height : height is fixed.
  2. Height proportional to instruction issue rate: the instruction issue rate for a method is equal to profiler "instructions" counter value divided by the gpu time for the method. This plot is available only if the "instructions" counter is present in the context.
  3. Height proportional to uncoalesced load + store rate: the uncoalesced load + store rate for a method is equal to the sum of profiler "gld uncoalesced" and "gst uncoalesced" counter values divided by the gpu time for the method. This plot is available only if the counters "gld uncoalesced" and "gst uncoalesced" are present in the context.
  4. Occupancy: Occupancy is proportional to height.
In case of multiple streams or multiple devices the "Split Options" can be used.
  1. No Split : A single horizontal group of bars is displayed. Even in case of multiple streams or multiple devices the data is displayed in a single group.
  2. Split on Device: In case of multiple devices one separate horizontal group of bars is displayed for each device.
  3. Split on Stream: In case of multiple devices one separate horizontal group of bars is displayed for each stream.
Profiler counter bar plot :
It is a bar plot for profiler counter values for a method from the profiler output table or the summary table. . One bar for each profiler counter. Bars sorted in decreasing profiler counter value .Bar length is proportional to profiler counter value.
Profiler output table column bar plot:
It is a bar plot for any column of values from the profiler output table or summary table . One bar for each row in the table. Bars sorted in decreasing column value . Bar length is proportional to column value.
Comparison summary plot:
This plot can be used to compare GPU Time summary data for two sessions. The Base Session is the session with respect to which comparison is done and the other session which is selected for comparison is called Compare Session. GPU Times for matching kernels from the two sessions are shown in a group. For each matched kernel from Compare Session, percentage increment or decrement with respect to Base Session is displayed at the right end of the bar. After showing all the matched pairs, the unmatched kernels GPU Times are shown. At the bottom two bars with total GPU Times for the two sessions are shown.
Device level summary plot :
One bar for each method is there. Bars are sorted in decreasing gpu time. Bar length is proportional to cumulative gputime for a method across all contexts for a device.
Session level summary plot :
One bar for each device is there. Bar length is proportional to Gpu Utilization. Gpu Utilization is the proportion of time when gpu was actually executing some method to total time interval from gpu start to end. The values are presented in percentage.

Steps for sample computeprof usage:


Sample1:



Sample2:

PROFILER DATA ANALYSIS

Analysis feature provides performance analysis of the application at various levels.

Context Level Analysis

Kernel Level Analysis

To view the kernel analysis for any kernel, double click the kernel name in the summary table. A new pop up window analyzes that particular kernel in greater detail as mentioned below:

Session Level Analysis

Device Level Analysis

Integrated CUDA and OpenCL profiler

Compute Visual Profiler can be used for profiling both CUDA and OpenCL applications. The Session settings dialog shows options in the CUDA terminology. Most of the options are common and supported for both CUDA and OpenCL except for the following: The type of a session "CUDA" or "OPENCL" is shown within square brackets after the session name. e.g. Context_0 [CUDA] or Context_1 [OPENCL]. The column names in the profiler table or the summary table for a context are displayed based on the compute language for the context. For a CUDA context CUDA terminology is used and for an OpenCL context OpenCL terminology will be used.
A project can contain sessions containing a mix of CUDA program profiling sessions and OpenCL program profiling sessions. To distinguish such projects from old projects a new project file extension '.cvp' is used. But support for old projects is provided and you can open an old CUDA project (having file extension '.cpj') or an old OpenCL project (having file extension '.oclpj'). But when you save such an old project it will be saved in the new format (with file extension '.cvp').
Following is mapping from C for CUDA terminology to OpenCL terminology

CUDA API Trace

The CUDA API trace is useful to understand the CPU side overhead for CUDA driver API calls and specifically to understand the overhead involved for each kernel launch and memory transfer request.. CUDA Driver API calls capture can be enabled by selecting "API trace" in the "Session settings" dialog. To view CUDA API Trace for a context first select the context in the Sessions tree view and right-click and select the "CUDA API trace" option in the pop-up menu. Or you can use the main menu option "View->CUDA API trace". The API trace view displays two horizontal rows of bars. The top row of bars shows the GPU methods and the bottom row of bars shows the CUDA driver API functions. Each GPU method or API is represented by a bar with the width proportional to the time of execution. The bars are displayed in time order along the horizontal direction based on the start time. A different color is assigned to each GPU method and all APIs are shown in the same color. A legend is displayed which shows the color assignment for different GPU methods and for APIs. The attributes for a GPU method or an API can be viewed by pointing the cursor on the bar. The following attributes are displayed for a CUDA driver API:

Enable or Disable profiling while application is running

For a long running application profiling can be interactively enabled or disabled while the application is running. Profiling can be enabled or disabled before launching the application either using the main menu option, tool bar option or through the checkbox on the Session settings dialog. By default profiling is enabled at application start. After the application is launched and it is running profiling can be enabled or disabled using the main menu option or the tool bar option. When viewing the width plot - idle time gaps are shown on the time line for the periods when profiling is disabled.

Description of computeprof GUI components:

Top line shows the main menu options: File, Profile, Session, Options, Window and Help. See the description below for details on the menu options.

Second line has 4 groups of tool bar icons.

The left vertical window lists all the sessions in the current project as a tree with three levels. Sessions at the top level, devices under a session at the next level and contexts under a device at the lowest level.

Summary session information is displayed when a session is selected in the tree view.

Summary device information is displayed when a device is selected in the tree view.

Right clicking on a session item or a context item in the tree view brings up the context sensitive menus. See the description below for details on the menu options.

Session context menu.

Session->Device->Context context menu.
Right workspace area contains windows which include Tabbed window for each session, each device in a session and for each context for a device.
The different windows for each context are shown as different tabs:
Table Header context menu, for Profiler Output table and Summary table.
Output window - Appears, when asked to display, at the bottom. It displays standard output & standard error for the Compute program which is run. Also some additional status messages are displayed in this window.

Main menu

Tool bars

Dialogs

Session list context menu :

Session->Device context menu :

Profiler table context menu :

Profiler counters

  • Interpreting profiler counters


  • The performance counter values do not correspond to individual thread activity. Instead, these values represent events within a thread warp. For example, a divergent branch within a thread warp will increment the divergent_branch counter by one. So the final counter value stores information for all divergent branches in all warps. In addition, the profiler can only target one of the multiprocessors in the GPU,so the counter values will not correspond to the total number of warps launched for a particular kernel. For this reason, when using the performance counter options in the profiler the user should always launch enough threads blocks to ensure that the target multiprocessor is given a consistent percentage of the total work. In practice for consistent results, it is best to launch at least 2 times as many blocks as there are multiprocessors in the device on which you are profiling. For the reasons listed above, users should not expect the counter values to match the numbers one would get by inspecting kernel code. The values are best used to identify relative performance differences between un-optimized and optimized code. For example, if for the initial version of the program the profiler reports N non-coalesced global loads, it is easy to see if the optimized code produces less than N non-coalesced loads. In most cases, the goal is to make N go to 0, so the counter value is useful for tracking progress toward this goal.

    Note that the counter values for the same application can be different across different runs even on the same setup since it depends on the number of thread blocks which are executed on each multiprocessor. For consistent results it is best to have number of blocks for each kernel launched to be at least equal to or a multiple of the total number of multiprocessors on a compute device. In other words when profiling the grid configuration should be chosen such that all the multiprocessors are uniformly loaded i.e. the number of blocks launched on each multiprocessor is same and also the amount of work of interest per block is the same. This will result in better accuracy of extrapolated counts (such as memory and instruction throughput) and will also provide more consistent results from run to run.

    In every application run only a few counter values can be collected. The number of counters depends on the specific counters selected. Visual Profiler executes the application multiple times to collect all the counter values. Note that in case the number blocks in a kernel is less than or not a multiple of the number of multiprocessors the counters values across multiple runs will not be consistent.

  • Profiler counters for a single multiprocessor (SM)


  • These counter values are a cumulative count for all thread blocks which were run on one multiprocessor. Note that the multiprocessor SIMT (single-instruction multi-thread) unit creates, manages, schedules, and executes threads in groups of 32 threads called warps. These counters are incremented by one per each warp.

  • Profiler counters for all multiprocessors in a Texture Processing Cluster (TPC)


  • These counter values are a cumulative count for all thread blocks which were run on multiprocessors within one Texture Processing Cluster (TPC). Note that there are two multiprocessors per TPC on compute devices with compute capability less than 1.3, there are three multiprocessors per TPC on compute devices with compute capability 1.3 and one multiprocessor per TPC on compute devices with compute capability 2.0 .

    When simultaneous global memory accesses by threads in a half-warp (during the execution of a single read or write instruction) can be combined into a single memory transaction of 32, 64, or 128 bytes it is called a coalesced access. If the global memory access by all threads of a half-warp do not fulfill the coalescing requirements it is called a non-coalesced access and a separate memory transaction is issued for each thread and throughput is significantly reduced. The coalescing requirements on devices with compute capability 1.2 and higher are different from devices with compute capability 1.0 or 1.1. Refer the CUDA Programming Guide for details. The profiler counters related to global memory count the number of global memory accesses or memory transactions and they are not per warp. They provide counts for all global memory requests initiated by warps running on a TPC.

  • Normalized counter values


  • When the "Normalize counters" option is selected all counter values are normalized and per block counts are shown. This option is currently supported only for compute devices with compute capability less than 2.0. In the following cases the counter value is set to zero: If any counter value is set to zero a warning is displayed at the end of the application profiling.

    With "Normalize counters" option enabled more number of application runs are required to collect all counter values compared to when the "Normalized counters" option is disabled.

    Also when "Normalize counters" option is enabled the "cta launched" and "sm cta launched" columns are not shown in the profiler table.

  • Supported profiler counters


  • This table lists all the profiler counters which are supported.
    Counter Description Type 1.0 1.1 1.2 1.3 2.0 2.1
    branch Number of branches taken by threads executing a kernel. This counter will be incremented by one if at least one thread in a warp takes the branch. Note that barrier instructions (__syncThreads()) also get counted as branches. SM Y Y Y Y Y Y
    divergent branch Number of divergent branches within a warp. This counter will be incremented by one if at least one thread in a warp diverges (that is, follows a different execution path) via a data dependent conditional branch. The counter will be incremented by one at each point of divergence in a warp. SM Y Y Y Y Y Y
    instructions Number of instructions executed. SM Y Y Y Y N N
    warp serialize If two addresses of a memory request fall in the same memory bank, there is a bank conflict and the access has to be serialized. This counter gives the number of thread warps that serialize on address conflicts to either shared or constant memory. SM Y Y Y Y N N
    sm cta launched Number of threads blocks launched on a multiprocessor. SM Y Y Y Y Y Y
    gld uncoalesced Number of non-coalesced global memory loads. TPC Y Y N N N N
    gld coalesced Number of coalesced global memory loads. TPC Y Y N N N N
    gld request Number of global memory load requests. On devices with compute capability 1.3 enabling this counter will result in increased counts for the "instructions" and "branch" counter values if they are also enabled in the same application run. TPC N N Y Y Y Y
    gld 32 byte Number of 32 byte global memory load transactions. This increments by 1 for each 32 byte transaction. TPC N N Y Y N N
    gld 64 byte Number of 64 byte global memory load transactions. This increments by 1 for each 64 byte transaction. TPC N N Y Y N N
    gld 128 byte Number of 128 byte global memory load transactions. This increments by 1 for each 128 byte transaction. TPC N N Y Y N N
    gst coalesced Number of coalesced global memory stores. TPC Y Y N N N N
    gst request Number of global memory store requests. On devices with compute capability 1.3 enabling this counter will result in increased counts for the "instructions" and "branch" counter values if they are also enabled in the same application run. TPC N N Y Y Y Y
    gst 32 byte Number of 32 byte global memory store transactions. This increments by 2 for each 32 byte transaction. TPC N N Y Y N N
    gst 64 byte Number of 64 byte global memory store transactions. This increments by 4 for each 64 byte transaction. TPC N N Y Y N N
    gst 128 byte Number of 128 byte global memory store transactions. This increments by 8 for each 128 byte transaction. TPC N N Y Y N N
    local load Number of local memory load transactions. Each local load request will generate one transaction irrespective of the size of the transaction. TPC Y Y Y Y Y Y
    local store Number of local memory store transactions. This increments by 2 for each 32-byte transaction, by 4 for each 64-byte transaction and by 8 for each 128-byte transaction for compute devices having compute capability 1.x. This increments by 1 irrespective of the size of the transaction for compute devices having compute capability 2.0. TPC Y Y Y Y Y Y
    cta launched Number of threads blocks launched on a TPC. TPC Y Y Y Y N N
    texture cache hit Number of texture cache hits. TPC Y Y Y Y N N
    texture cache miss Number of texture cache misses. TPC Y Y Y Y N N
    prof triggers There are 8 such triggers that user can profile. Those are generic and can be inserted in any place of the code to collect the related information. TPC Y Y Y Y Y Y
    shared load Number of executed shared load instructions per warp on a multiprocessor. SM N N N N Y Y
    shared store Number of executed shared store instructions per warp on a multiprocessor. SM N N N N Y Y
    instructions issued Number of instructions issued including replays. SM N N N N Y Y
    instructions executed Number of instructions executed, do not include replays. SM N N N N Y Y
    threads instruction executed Number of instructions executed by all threads, does not include replays. For each instruction it increments by the number of threads in the warp that execute the instruction SM N N N N Y Y
    warps launched Number of warps launched on a multiprocessor. SM N N N N Y Y
    threads launched Number of threads launched on a multiprocessor. SM N N N N Y Y
    active cycles Number of cycles a multiprocessor has at least one active warp. SM N N N N Y Y
    active warps Accumulated number of active warps per cycle. For every cycle it increments by the number of active warps in the cycle which can be in the range 0 to 48. SM N N N N Y Y
    l1 global load hit Number of global load hits in L1 cache. SM N N N N Y Y
    l1 global load miss Number of global load misses in L1 cache. SM N N N N Y Y
    l1 local load hit Number of local load hits in L1 cache. SM N N N N Y Y
    l1 local load miss Number of local load misses in L1 cache SM N N N N Y Y
    l1 local store hit Number of local store hits in L1 cache. SM N N N N Y Y
    l1 local store miss Number of local store misses in L1 cache. SM N N N N Y Y
    l1 shared bank conflicts Number of shared bank conflicts. SM N N N N Y Y
    uncached global load transaction Number of uncached global load transactions. Increments by 1 per transaction. Transaction size can be 32/64/128 bytes. Non-zero values are only seen when L1 cache is disabled during compile time. Please refer to CUDA Programming Guide(Section G.4.2) for disabling L1 cache. SM N N N N Y Y
    global store transaction Number of global store transactions. Increments by 1 per transaction. Transaction size can be 32/64/128 bytes. SM N N N N Y Y
    l2 read requests Number of read requests from L1 to L2 cache. This increments by 1 for each 32-byte access. FB N N N N Y Y
    l2 read texture requests Number of read requests from texture cache to L2 cache. This increments by 1 for each 32-byte access. FB N N N N Y Y
    l2 write requests Number of write requests from L1 to L2 cache. This increments by 1 for each 32-byte access. FB N N N N Y Y
    l2 read misses Number of read misses in L2 cache. This increments by 1 for each 32-byte access. FB N N N N Y Y
    l2 write misses Number of write misses in L2 cache. This increments by 1 for each 32-byte access. FB N N N N Y Y
    dram reads Number of read requests to DRAM. This increments by 1 for each 32-byte access. FB N N N N Y Y
    dram writes Number of write requests to DRAM. This increments by 1 for each 32-byte access. FB N N N N Y Y
    tex cache requests Number of texture cache requests. This increments by 1 for each 32-byte access. SM N N N N Y Y
    tex cache misses Number of texture cache misses. This increments by 1 for each 32-byte access. SM N N N N Y Y
    gld instructions 8bit Total number of 8-bit global load instructions that are executed by all the threads across all thread blocks. SW N N N N Y Y
    gld instructions 16bit Total number of 16-bit global load instructions that are executed by all the threads across all thread blocks. SW N N N N Y Y
    gld instructions 32bit Total number of 32-bit global load instructions that are executed by all the threads across all thread blocks. SW N N N N Y Y
    gld instructions 64bit Total number of 64-bit global load instructions that are executed by all the threads across all thread blocks. SW N N N N Y Y
    gld instructions 128bit Total number of 128-bit global load instructions that are executed by all the threads across all thread blocks. SW N N N N Y Y
    gst instructions 8bit Total number of 8-bit global store instructions that are executed by all the threads across all thread blocks. SW N N N N Y Y
    gst instructions 16bit Total number of 16-bit global store instructions that are executed by all the threads across all thread blocks. SW N N N N Y Y
    gst instructions 32bit Total number of 32-bit global store instructions that are executed by all the threads across all thread blocks. SW N N N N Y Y
    gst instructions 64bit Total number of 64-bit global store instructions that are executed by all the threads across all thread blocks. SW N N N N Y Y
    gst instructions 128bit Total number of 128-bit global store instructions that are executed by all the threads across all thread blocks. SW N N N N Y Y


  • Supported derived statistics


  • This table gives a brief description of all the statistics that are derived from the profiler counter values. These derived statistics appear in the Summary Table.
    Note: The derived statistics dislayed in the Summary Table for a particular kernel are the average values taken over all the invocations of that kernel.
    Derived stats Description 1.0 1.1 1.2 1.3 2.0 2.1
    glob mem read throughput Global memory read throughput in giga-bytes per second.
    For compute capability < 2.0 this is calculated as (((gld_32*32) + (gld_64*64) + (gld_128*128)) * TPC) / (gputime * 1000)
    For compute capability >= 2.0 this is calculated as ((DRAM reads) * 32) / (gputime * 1000)
    * * * * * *
    glob mem write throughput Global memory write throughput in giga-bytes per second.
    For compute capability < 2.0 this is calculated as (((gst_32*32) + (gst_64*64) + (gst_128*128)) * TPC) / (gputime * 1000)
    For compute capability >= 2.0 this is calculated as ((DRAM writes) * 32) / (gputime * 1000)
    * * * * * *
    glob mem overall throughput Global memory overall throughput in giga-bytes per second.
    This is calcualted as Global memory read throughput + Global memory write throughput
    * * * * * *
    gld efficiency Global load efficiency NA NA 0-1 0-1 NA NA
    gst efficiency Global store efficiency NA NA 0-1 0-1 NA NA
    Instruction throughput instruction throughput: Instruction throughput ratio.
    This is the ratio of achieved instruction rate to peak single issue instruction rate.
    The achieved instruction rate is calculated using the "instructions" profiler counter.
    The peak instruction rate is calculated based on the GPU clock speed.
    In the case of instruction dual-issue coming into play, this ratio shoots up to greater than 1.
    This is calculated as gpu_time * clock_frequency / (instructions)
    0-1 0-1 0-1 0-1 NA NA
    retire ipc Retired instructions per cycle
    This is calculated as (instuctions executed) / (active cycles).
    NA NA NA NA 0-2 0-4
    active warps/active cycles The average number of warps that are active on a multiprocessor per cycle.
    This is calculated as (active warps) / (active cycles).
    This is supported only for GPUs with compute capability 2.0.
    NA NA NA NA 0-48 0-48
    l1 gld hit rate This is calculated as 100 * (l1 global load hit count) / ((l1 global load hit count) + (l1 global load miss count))
    This is supported only for GPUs with compute capability 2.0.
    NA NA NA NA 0-100 0-100
    texture hit rate % This is calculated as 100 * (tex_cache_requests - tex_cache_misses) / (tex_cache_requests)
    This is supported only for GPUs with compute capability 2.0.
    NA NA NA NA 0-100 0-100
    Ideal Instruction/Byte ratio This is a ratio of the peak instruction throughput and the peak memory throughput of the CUDA device.
    This is a property of the device and is independent of the kernel.
    NA NA NA NA * *
    instruction/byte This is the ratio of the total number of instructions issued by the kernel and the total number of bytes
    accessed by the kernel from global memory. If this ratio is greater than the Ideal instruction/byte ratio,
    then the kernel is compute bound and if it’s less, then the kernel is memory bound. This is calculated as
    (32 * instructions issued * #SM)/ {32 * (l2 read requests + l2 write requests + l2 read texture requests)}
    NA NA NA NA * *
    Achieved Kernel Occupancy This ratio provides the actual occupancy of the kernel based on the number of warps executing per cycle on the SM.
    This is the ratio of active warps and active cycles divided by the max number of warps that can execute on an SM.
    This is calculated as (active warps/active cycles)/48
    NA NA NA NA 0-1 0-1
    Kernel requested global memory read throughput (GB/s) This is the actual number of bytes requested in terms of loads by the kernel from global memory divided by the
    kernel execution time. These requests are made in terms of global load instructions which can be of varying word sizes of
    8, 16, 32, 64 or 128 bits. This is calculated as (gld instructions 8bit + 2 * gld instructions 16bit + 4 *
    gld instructions 32bit + 8 * gld instructions 64bit + 16 * gld instructions 128bit) / (gpu time * 1000)
    NA NA NA NA * *
    Kernel requested global memory write throughput (GB/s) This is the actual number of bytes requested in terms of stores by the kernel from global memory divided by the kernel
    execution time. These requests are made in terms of global store instructions which can be of varying word sizes of
    8, 16, 32, 64 or 128 bits. This is calculated as (gst instructions 8bit + 2 * gst instructions 16bit + 4 *
    gst instructions 32bit + 8 * gst instructions 64bit + 16 * gst instructions 128bit) / (gpu time * 1000)
    NA NA NA NA * *
    Kernel requested global memory throughput (GB/s) This is the combined kernel requested read and write memory throughput. This is calculated as
    (Kernel requested global memory read throughput + Kernel requested global memory write throughput)
    NA NA NA NA * *
    L1 cache read throughput (GB/s) This gives the throughput achieved while accessing data from L1 cache. This is calculated as
    [(l1 global load hit + l1 local load hit) * 128 * #SM + l2 read requests * 32] / (gpu time * 1000)
    NA NA NA NA * *
    L1 cache global hit ratio (%) Percentage of hits that occur in L1 cache while accessing global memory. This statistic will be zero when L1 cache
    is disabled. This is calculated as (100 * l1 global load hit)/(l1 global load hit + l1 global load miss )
    NA NA NA NA 0-100 0-100
    Texture cache memory throughput (GB/s) This gives the memory throughput achieved while reading data from texture memory. This statistic will be zero
    when texture memory is not used. This is calculated as (#SM * tex cache sector queries * 32) / (gpu time * 1000)
    NA NA NA NA * *
    Texture cache hit rate (%) Percentage of hits that occur in texture cache while accessing data from texture memory. This statistic will be zero
    when texture memory is not used. This is calculated as 100 * (tex cache requests – tex cache misses)/tex cache requests
    NA NA NA NA 0-100 0-100
    L2 cache texture memory read throughput (GB/s) This gives the throughput achieved while reading data from L2 cache when a request for data residing in
    texture memory is made. This is calculated as (l2 read tex requests * 32)/(gpu time *1000)
    NA NA NA NA * *
    L2 cache global memory read throughput (GB/s) This gives the throughput achieved while reading data from L2 cache when a request for data residing in global
    memory is made by L1. This is calculated as (l2 read requests * 32)/(gpu time * 1000)
    NA NA NA NA * *
    L2 cache global memory read throughput (GB/s) This gives the throughput achieved while reading data from L2 cache when a request for data residing in
    global memory is made by L1. This is calculated as (l2 read requests * 32)/(gpu time * 1000)
    NA NA NA NA * *
    L2 cache global memory write throughput (GB/s) This gives the throughput achieved while writing data to L2 cache when a request to store data in
    global memory is made by L1. This is calculated as (l2 write requests * 32)/(gpu time * 1000)
    NA NA NA NA * *
    L2 cache global memory throughput (GB/s) This is the combined L2 cache read and write memory throughput. This is calculated as
    (L2 cache global memory read throughput + L2 cache global memory write throughput)
    NA NA NA NA * *
    L2 cache read hit ratio (%) Percentage of hits that occur in L2 cache while reading from global memory. This is calculated as
    100 * (L2 cache global memory read throughput - glob mem read throughput)/( L2 cache global memory read throughput)
    NA NA NA NA 0-100 0-100
    L2 cache write hit ratio (%) Percentage of hits that occur in L2 cache while writing to global memory. This is calculated as
    100 * (L2 cache global memory write throughput - glob mem write throughput)/( L2 cache global memory write throughput)
    NA NA NA NA 0-100 0-100
    Local memory bus traffic (%) Percentage of bus traffic caused due to accesses to local memory. This is calculated as
    (2 * l1 local load miss * 128 * 100)/((l2 read requests + l2 write requests)* 32 / #SMs)
    NA NA NA NA 0-100 0-100
    Global memory excess load (%) This shows the percentage of excess data that is fetched while making global memory load transactions. Ideally
    0% excess loads will be achieved when kernel requested global memory read throughput is equal to the L2 cache read
    throughput i.e. the number of bytes requested by the kernel in terms of reads are equal to the number of bytes
    actually fetched by the hardware during kernel execution to service the kernel. If this statistic is high, it implies
    that the access pattern for fetch is not coalesced, many extra bytes are getting fetched while serving the threads
    of the kernel. This is calculated as 100 - (100 * kernel requested global memory read throughput / l2 read throughput)
    NA NA NA NA 0-100 0-100
    Global memory excess store (%) This shows the percentage of excess data that is accessed while making global memory store transactions. Ideally 0%
    excess stores will be achieved when kernel requested global memory write throughput is equal to the L2 cache write
    throughput i.e. the number of bytes requested by the kernel in terms of stores are equal to the number of bytes actually
    accessed by the hardware during kernel execution to service the kernel. If this statistic is high, it implies that the
    access pattern for store is not coalesced, many extra bytes are getting accessed while execution of the threads of the
    kernel. This is calculated as 100 - (100 * kernel requested global memory write throughput / l2 write throughput)
    NA NA NA NA 0-100 0-100
    Peak global memory throughput (GB/s) This is the peak memory throughput or bandwidth that can be achieved on the present CUDA device. This is
    a device property and the kernel achieved memory throughput should be as close as possible to this peak.
    * * * * * *
    IPC - Instructions/Cycle This gives the number of instructions issued per cycle. This should be compared to maximum IPC possible
    for the device. The range provided is for single precision floating point instructions.
    This is calculated as (instructions issued/active cycles)
    NA NA NA NA 0-2 0-4
    Divergent branches (%) The percentage of branches that are causing divergence within a warp amongst all the branches present in
    the kernel. Divergence within a warp causes serialization in execution. This is calculated as
    (100*divergent branch)/(divergent branch + branch)
    0-100 0-100 0-100 0-100 0-100 0-100
    Divergent branches (%) The percentage of branches that are causing divergence within a warp amongst all the branches present in
    the kernel. Divergence within a warp causes serialization in execution. This is calculated as
    (100*divergent branch)/(divergent branch + branch)
    0-100 0-100 0-100 0-100 0-100 0-100
    Control flow divergence (%) Control flow divergence gives the percentage of thread instructions that were not executed by all threads
    in the warp, hence causing divergence. This should be as low as possible. This is calculated as
    100 * ((32 * instructions executed) – threads instruction executed)/(32* instructions executed)
    NA NA NA NA 0-100 0-100
    Replayed Instructions (%) This gives the percentage of instructions replayed during kernel execution. Replayed instructions are the
    difference between the numbers of instructions that are actually issued by the hardware to the number of
    instructions that are to be executed by the kernel. Ideally this should be zero. This is calculated as
    100 * (instructions issued - instruction executed) /instruction issued
    NA NA NA NA 0-100 0-100
    Global memory replay (%) Percentage of replayed instructions caused due to global memory accesses. This is calculated as
    100 * (l1 global load miss)/ instructions issued
    NA NA NA NA 0-100 0-100
    Local memory replay (%) Percentage of replayed instructions caused due to local memory accesses. This is calculated as
    100 * (l1 local load miss + l1 local store miss)/ instructions issued
    NA NA NA NA 0-100 0-100
    Shared bank conflict replay (%) Percentage of replayed instructions caused due to shared memory bank conflicts. This is calculated as
    100 * (l1 shared conflict)/ instructions issued
    NA NA NA NA 0-100 0-100
    Shared memory bank conflict per shared memory instruction (%) This gives an indication of the number of bank conflicts caused per shared memory instruction. This may
    exceed 100% if there are n-way bank conflicts or the data accessed is double precision. This is calculated as
    100 * (l1 shared bank conflict)/(shared load + shared store)
    NA NA NA NA 0-100 0-100
    SM activity (%) Percentage of multiprocessor utilization. This is calculated as
    100 * (active cycles)/ elapsed clocks
    NA NA NA NA 0-100 0-100

    computeprof project files saved to disk

    computeprof settings which are saved

    Following is the list of computeprof settings which are saved and remembered across different computeprof sessions. On Windows these settings are saved in the system registry at the location "HKEY_CURRENT_USER\Software\NVIDIA Corporation\computeprof".
    On Linux these settings are saved to the file "$HOME/.config/NVIDIA Corporation/computeprof.conf".

    Compute Visual Profiler Help cache is saved in the folder: There is a separate sub-directory for each version.