Table of Contents
Table of Contents
OProfile is a profiling system for Linux 2.2/2.4/2.6 systems on a number of architectures. It is capable of profiling all parts of a running system, from the kernel (including modules and interrupt handlers) to shared libraries to binaries. It runs transparently in the background collecting information at a low overhead. These features make it ideal for profiling entire systems to determine bottle necks in real-world systems.
You'll need to have a configured kernel source for the current kernel to build the module for 2.4 kernels. It is also recommended that if you have a uniprocessor machine, you enable the local APIC / IO_APIC support for your kernel (this is automatically enabled for SMP kernels). On machines with power management, such as laptops, the power management must be turned off when using OProfile with 2.4 kernels. The power management software in the BIOS cannot handle the non-maskable interrupts (NMIs) used by OProfile for data collection. If you use the NMI watchdog, be aware that the watchdog is disabled when profiling starts, and not re-enabled until the OProfile module is removed (or, in 2.5, when OProfile is not running). If you compile OProfile for a 2.2 kernel you must be root to compile the module. If you are using 2.5 kernels or higher, you do not need kernel source, as long as the OProfile driver is enabled; additionally, you should not need to disable power management.
This section gives a brief description of the available OProfile utilities and their purpose.
This utility lists the available events and short descriptions.
Used for controlling the OProfile data collection, discussed in Chapter 3.
This is the main tool for retrieving useful profile data, described in Chapter 4.
This utility is useful for examining the relative profile values for all images on the system to determine the applications with the largest impact on system performance.
This utility can be used to produce annotated source, assembly or mixed source/assembly. Source level annotation is available only if the application was compiled with debugging symbols. See Section 3.
This utility is useful to merge samples files which belongs to the same application especially when you profile with separating samples for shared libs. See Section 4.
This utility converts sample database files from a foreign binary format (abi) to the native format. This is useful only when moving sample files between hosts, for analysis on platforms other than the one used for collection. The abi format of the file to be imported is described in a text file located in /var/lib/oprofile/abi, if the --enable-abi configure-time option was enabled. Furthermore, the op_import tool is not built unless --enable-abi is given. See --enable-abi.
Table of Contents
Before getting into detail about usage, it's probably a good idea to have a quick stroll through an example session (this example is for Intel P6 generation processors, but the process is similar).
First we need to start the profiler running in the background. We need to specify the correct vmlinux file (to allow kernel profiling), and we need to specify what event to count and the counter value. Here I've started with :
opcontrol --setup --vmlinux=/boot/2.4.0ac12/vmlinux --ctr0-event=CPU_CLK_UNHALTED --ctr0-count=600000
opcontrol --start
Here we've enabled counter 0 to count "CPU_CLK_UNHALTED" (number of cycles CPU is not halted) events with a count value of 600,000. This event is useful as profiles resulting generally correspond to time-spent profiles for functions etc.
A quick ps ax confirms that the daemon (oprofiled) has started. Data is now being collected in the kernel. Now we can do whatever we like ... although in this case I'm profiling the C++ application LyX. Note that unlike gprof, no instrumentation (-pg and -a options to gcc) is necessary. This is major factor in achieving the low overhead of OProfile. Compiling with debug symbols (the -g option) is not necessary to get a basic function-based profile listing, but it must be used in order to retrieve line number information and create annotated source. See the FAQ Section 1 about debug information vs symbol information.
Rather than wait for the buffers to fill up, I now force the profiling data to be processed with :
opcontrol --dump
which will ask the kernel module to dump as much data as it can to the daemon.
Forcing a dump like this can cause the daemon to become very busy, especially the first time it is done. Don't worry, that's not normal behaviour; so if you are profiling over a larger period of time, such spikes won't appear.
I can now ask for a symbol-based summary of the sample profile :
oprofpp --demangle -l ./lyx >oprof.out
This can be quite slow on large binaries, so sit tight. As it's a C++ program, I asked for the symbols to be demangled to a readable form. Examining the output will give the symbols against which the most hits were registered. In this case I got :
... Row::par(void)[0x0813ab54]: 5.4079% (472 samples) LyXText::GetRow(LyXParagraph *, int, int &) const[0x08170a4c]: 5.5683% (486 samples) LyXParagraph::GetFontSettings(BufferParams const &, int) const[0x08145420]: 5.7516% (502 samples) Row::next(void) const[0x0813ac24]: 15.4904% (1352 samples) |
at the top. Note that over a longer run (or with a lower ctr0-count value) the number of samples will be much more statistically reliable. Note that these sample counts do not necessarily reflect the relative amounts of time spent in each function - it depends on the event being counted. In this case we used CPU_CLK_UNHALTED which the command opcontrol --list-events tells us is "clocks processor is not halted", so in fact is likely to represent the relative time spent accurately (in fact, experiments have shown that using this event is far more accurate than the values produced by gprof).
If you're more used to gprof style profile output, you can use oprofpp -g gmon.out and then gprof -p binary to get flat profiles. OProfile does not (cannot) support the call graph feature of gprof.
In this section we describe the configuration and control of the profiling system with opcontrol in more depth.
The opcontrol script provides the following options :
Loads the OProfile module if required and makes the OProfile driver interface available.
Followed by list arguments for profiling set up. List of arguments saved in /home/root/.oprofile/daemonrc.
Start the oprofile daemon without starting actual profiling. The profiling can then be started using --start. This is useful for avoiding measuring the cost of daemon startup, as --start is a simple write to a file in oprofilefs. Not available in 2.2/2.4 kernels.
Start data collection with either arguments provided by --setup of information saved in /home/root/.oprofile/daemonrc. Specifying the addition --verbose makes the daemon generate lots of debug data whilst it is running.
Force a flush of the collected profiling data to the daemon.
Stop data collection (this separate step is not possible with 2.2 or 2.4 kernels).
Stop data collection and remove daemon.
Clears out data from current session, but leaves saved sessions.
Save data from current session to session_name.
Shuts down daemon. Unload the OProfile module and oprofilefs.
List event types and unit masks.
Generate usage messages.
The opcontrol script options used with the --setup option :
Number of samples in kernel buffer.
Set counter N to measure symbolic event name.
Number of events between samples for counter N.
Set unit mask for counter N (e.g. --ctr0-unit-mask=0xf)
Whether to count kernel events for counter N.
Set RTC counter value (see Section 5.2).
Only profile process pid (only available for 2.4 version).
Only profile process tty group pgrp (only avilable for 2.4 version)
Separate samples for each distinct application shared library.
vmlinux kernel image.
kernel range vma address in hexadecimal.
If you are using a 2.5 kernel in timer interrupt mode (i.e. no hardware performance counters are supported for your CPU type), you do not need to specify any counter parameters, because there aren't any (see Section 5.3).
This section describes the OProfile Qt-based interface.
The oprof_start application provides a convenient way to start the profiler. Note that oprof_start is just a wrapper around the opcontrol script, so it does not provide more services than the script itself.
After oprof_start is started you can select the event type for each counter; the sampling rate and other related parameters are explained in Section 2. The "Configuration" section allows you to set general parameters such as the buffer size, kernel filename etc. The counter setup interface should be self-explanatory; Section 5.1 and related links contain information on using unit masks.
A status line shows the current status of the profiler: how long it has been running, and the average number of interrupts received per second and the total, over all processors. Note that quitting oprof_start does not stop the profiler.
Your configuration is saved when you quit the gui in two files in ~/.oprofile directory : oprof_start_config and oprof_start_event. These contain the general configuration, and event/counter setup, respectively.
It can often be useful to split up profiling data into several different time periods. For example, you may want to collect data on an application's startup separately from the normal runtime data. You can use the simple command opcontrol --save to do this. For example :
opcontrol --save=blah |
will create a sub-directory in /var/lib/oprofile/samples containing the samples up to that point (the current session's sample files are moved into this directory). You can then pass this name as, for example, a parameter to op_time to only get data up to the point you named the session.
Your CPU type may not include the requisite support for hardware performance counters, in which case you must use OProfile in RTC mode in 2.4 (see Section 5.2), or timer mode in 2.5 (see Section 5.3).
The hardware performance counters are detailed in the Intel IA-32 Architecture Manual, Volume 3, available from http://developer.intel.com/. The AMD Athlon/Duron implementation is detailed in http://www.amd.com/products/cpg/athlon/techdocs/pdf/22007.pdf. These processors are capable of delivering an interrupt to the local APIC LVTPC vector when a counter overflows. This is the basic mechanism on which OProfile is based. The kernel module installs an interrupt handler for this vector. The delivery mode is set to NMI so that blocking interrupts in the kernel does not prevent profiling. When the interrupt handler is called, the current EIP PC value, process id, and counter are recorded into the profiling structure. This allows the overflow event to be attached to a specific assembly instruction in a binary image. The daemon is necessary to transform these recorded values into a count against a file offset for a given binary image, in order to produce profile data off-line at a later time.
If we use an event such as CPU_CLK_UNHALTED or INST_RETIRED (GLOBAL_POWER_EVENTS or INSTR_RETIRED, respectively, on the Pentium 4), we can use the overflow counts as an estimate of actual time spent in each part of code. Alternatively we can profile interesting data such as the cache behaviour of routines with the other available counters.
However there are several caveats. Firstly there are those issues listed in the Intel manual. There is a delay between the counter overflow and the interrupt delivery that can skew results on a small scale - this means you cannot rely on the profiles at the instruction level, except as a binary was/wasn't executed indicator. If you are using an "event-mode" counter such as the cache counters, a count registered against it doesn't mean that it is responsible for that event. However, it implies that the counter overflowed in the dynamic vicinity of that instruction, to within a few instructions. Further details on this problem can be found in Chapter 5 and also in the Digital paper "ProfileMe: A Hardware Performance Counter". Also note that a very high number of interrupts can have a large performance effect, and even overflow the profiling data structures. This can lead to mapping information getting overwritten, and loss of respect from boxing promoter (don't worry, an obscure reference). The system stability will never be affected, but profiling may not be able to work properly. An error message from the kernel module will appear in your system log files if this situation occurs.
As described in the Intel manual, each counter, as well as being configured to count an event type, has several more configuration parameters. First, there is the unit mask: this simply further specifies what to count. Second, there is the counter value, discussed below. Third, there is a parameter whether to increment counts whilst in kernel or user space. You can configure these separately for each counter.
So you must specify a counter value with the --ctrX-count option, where X is the logical counter number in the range 0 to 3 fro AMD Athlon processors, 0 to 1 for Intel P6 generation procecessors, and 0 to 7 for Intel Pentium 4 processors. Using multiple counters is useful for profiling several aspects of the same running program. After each overflow event, the counter will be re-initialized such that another overflow will occur after this many events have been counted. Picking a good value for this parameter is, unfortunately, somewhat of a grey art (not quite black). It is of course dependent on the event you have chosen. For basic time-based profiling, you will probably use CPU_CLK_UNHALTED (on Intel P6 generation processors). You can estimate how many interrupts this value will generate per second with this event by dividing your CPU clock rate by the chosen value. I have a 600MHz Celeron, so specifying an overflow value of 100,000 will generate around 6000 interrupts per second. Specifying too large a value will mean not enough interrupts are generated to give a realistic profile (though this problem can be ameliorated by profiling for longer). Specifying too small a value can lead to overflow problems discussed previously.
This section applies to 2.4 kernels only.
Some CPU types do not provide the needed hardware support to use the hardware performance counters. This includes some laptops, classic Pentiums, and other CPU types not yet supported by OProfile (such as Cyrix). On these machines, OProfile falls back to using the real-time clock interrupt to collect samples. This interrupt is also used by the rtc module: you cannot have both the OProfile and rtc modules loaded nor the rtc support compiled in the kernel.
RTC mode is less capable than the hardware counters mode; in particular, it is unable to profile sections of the kernel where interrupts are disabled. There is just one available event, "RTC interrupts", and its value corresponds to the number of interrupts generated per second (that is, a higher number means a better profiling resolution, and higher overhead). The current implementation of the real-time clock supports only power-of-two sampling rates from 2 to 4096 per second. Other values within this range are rounded to the nearest power of two.
Setting the value from the GUI should be straightforward. On the command line, you need to specify the --rtc-value option to opcontrol, e.g. :
opcontrol --setup --vmlinux=/boot/2.4.0ac12/vmlinux --rtc-value=256
Note the sysctl tree described in the next section is different when the RTC is being used. In particular, the file /proc/sys/dev/oprofile/rtc_value is used by the tools to set the desired RTC sampling rate, and will reflect the actual sampling rate after profiling has started.
This section applies to 2.5 kernels only.
In 2.5 kernels on CPUs without OProfile support for the hardware performance counters, the driver falls back to using the timer interrupt for profiling. Like the RTC mode in 2.4 kernels, this is not able to profile code that has interrupts disabled. Note that there are no configuration parameters for setting this, unlike RTC and hardware performance counter setup.
The Pentium 4 / Xeon performance counters are organized around 3 types of model specific registers (MSRs): 45 event selection control registers (ESCRs), 18 counter configuration control registers (CCCRs) and 18 counters. ESCRs describe a particular set of events which are to be recorded, and CCCRs bind ESCRs to counters and configure their operation. Unfortunately the relationship between these registers is quite complex; they cannot all be used with one another at any time. There is, however, a subset of 8 counters, 8 ESCRs, and 8 CCCRs which can be used independently of one another, so the Oprofile module only accesses those registers, treating them as a bank of 8 "normal" counters, similar to those in the P6 or Athlon families of CPU.
There is currently no support for Precision Event-Based Sampling (PEBS), nor any advanced uses of the Debug Store (DS). Current support is limited to the conservative extension of Oprofile's existing interrupt-based model described above. Performance monitoring hardware on Pentium 4 / Xeon processors with Hyperthreading enabled (multiple logical processors on a single die) is not supported either.
The Itanium 2 performance monitoring unit (PMU) organizes the counters as four pairs of performance event monitoring registers. Each pair is composed of a Performance Monitoring Configuration (PMC) register and Performance Monitoring Data (PMD) register. The PMC selects the performance event being monitored and the PMD determines the sampling interval. The IA64 Performance Monitoring Unit (PMU) triggers sampling with maskable interrupts. Thus, samples will not occur in sections of the IA64 kernel where interrupts are disabled.
None of the advance features of the Itanium 2 performance monitoring unit such as opcode matching, address range matching, or precise event sampling are supported by this version of OProfile. The Itanium 2 support only maps OProfile's existing interrupt-based model to the PMU hardware.
OProfile is a low-level profiler which allow continuous profiling with a low-overhead cost. If not used carefully, this can affect the stability of the system. If too low a count reset value is set for a counter, the system can become overloaded with counter interrupts, and seem as if the system has frozen.
This can happen as follows: When the profiler count reaches zero an NMI handler is called which stores the sample values in an internal buffer, then resets the counter to its original value. If the count is very low, a pending NMI can be sent before the NMI handler has completed. Due to the priority of the NMI, the local APIC delivers the pending interrupt immediately after completion of the previous interrupt handler, and control never returns to other parts of the system. In this way the system seems to be frozen.
If this happens, it will be impossible to bring the system back to a workable state. There is no way to provide real security against this happening, other than making sure to use a reasonable value for the counter reset. For example, setting CPU_CLK_UNHALTED event type with a ridiculously low reset count (e.g. 500) is likely to freeze the system.
In short : Don't try a foolish sample count value. Unfortunately the definition of a foolish value is really dependent on the event type - if ever in doubt, e-mail
Do I hear you shout "but my event value is low, but not stupid !" ? Yes, this can be the case. In these circumstances, a simple solution is to disable kernel profiling by turning off the kernel option for each enabled counter. As the NMI handler is in-kernel, this avoids the problem.
There are situations where you are only interested in the profiling results of a particular running process, or process tty group. You can set the pid/pgrp values via the --pid-filter and --pgrp-filter options to opcontrol --setup, which will make the daemon ignore samples for processes that don't match the filter.
This section applies to 2.4 kernels only, OProfile in 2.5 can be unloaded safely.
The kernel module can be unloaded, but is designed to take very little memory when profiling is not underway. There is no need to unload the module between profiler runs.
lsmod and similar utilities will still show the module's use count as -1. However, this is not to be relied on - the module will become unloadable some short time after stopping profiling.
Note that by default module unloading is disabled when used on SMP systems. This is because of a small chance of a module unload race crashing the kernel. As the race is very small, it is allowed to re-enable the module unload by specifying the "allow_unload" parameter to the module :
modprobe oprofile allow_unload=1
This option can be DANGEROUS and should only be used on non-production systems.
Table of Contents
OK, so the profiler has been running, but it's not much use unless we can get some data out. Fairly often, OProfile does a little too good a job of keeping overhead low, and no data reaches the profiler. This can happen on lightly-loaded machines. Remember you can force a dump at any time with :
opcontrol --dump
Remember to do this before complaining there is no profiling data ! Now that we've got some data, it has to be processed. That's the job of oprofpp or op_to_source. This works on a sample file in the /var/lib/oprofile/samples/ directory, along with the binary file being profiled, to produce human-readable data. Note that if the binary file changes after the sample file was created, you won't be able to get useful data out. This situation is detected for you. Note that several instances of a binary are merged into one sample file. By default, all samples from a dynamically linked library are merged into one sample file as well.
A different scenario happen when re-starting profiling with different parameters, as the old sample files from previous sessions don't get deleted (allowing you to build profiles over many distinct profiling sessions). If the last session is determined to be out of date due to the use of different profiling parameters, all the samples files are backed up in a sub-directory name session-#nr. If during profiling the daemon detects a change to a binary image and a samples file belonging to this binary exists, the samples file is silently deleted. So if during profiling you change a binary it is your responsibility to save the binary image and the samples files, if you need it.
Note that kernel modules without symbol data (this can happen with some initrd setups) cannot be profiled (modules with symbols show up in /proc/ksyms).
All post profile tools accept the following options
Show the command line options.
Show the version number of OProfile.
Oprofpp can be used in three major modes; list symbol mode, detailed symbol mode, or gprof mode. The first gives sorted histogram output of sample counts against functions as shown in the walkthrough. The second can show individual sample counts against instructions inside a function, useful for detailed profiling, whilst the third mode is handy if you're used to gprof style output. Note that only flat gprof profiles are supported, however.
Some interesting options of the post-processor :
The samples file to use. By default, the current samples file for the given binary is used; this option can be used to examine older sample files.
The binary image (shared library, kernel vmlinux, or program) to produce data for.
Demangle C++ symbol names.
Which counter (0 - N) to extract information for. N is dependent on your cpu type: 1 for P6 generation CPUs, 3 for Athlon based CPUs, 7 for Pentium 4 / Xeon CPUs.
List a histogram of sample counts against symbols. Each line shows the function name, its starting address, the relative percentage of hits across that image, and the absolute number of samples in this function.
Provide a detailed listing for the specified symbol name. This shows, for each sample, the position of the address, and the number of samples.
Dump output to the specified file in gprof format. If you specify gmon.out, you can then call gprof -p <binary>.
Provide a detailed listing for all symbols. Each line shows number of samples at the given address for all counters.
Show the function and line number for all samples. This requires that the image was compiled with debug symbols (-g), and is usable only with --list-all-symbols-details, --list-symbol and --list-symbols.
Comma-separated list of symbols to ignore. This can be useful to ignore the leading contributors to the sample histogram, as the percentage values are re-calculated.
Show the details for each shared lib which belongs to the given application. This option is useful only if you have profiled with the --separate-samples option and you specify on the oprofpp command line either --list-symbols or --list-all-symbols-details.
Specify the output and formatting, as given in this table :
v | VMA (symbol address) |
s | Sample count |
S | Cumulative sample count |
p | Sample percentage relative to image |
P | Cumulative sample percentage relative to image |
q | Sample percentage relative to all images |
Q | Cumulative sample percentage relative to all images |
n | Name of symbol |
l | Source filename and line number |
L | Short source filename and line number |
i | Image name |
I | Short image name |
d | Detailed sample results |
h | Print a columns header |
This option is not available with -dump-gprof-file.
Specify the session name you want to use, session-name can be an absolute path where samples reside or a session name relative to samples files base directory. If you specify a samples filename with an absolute path this option is ignored
op_to_source generates annotated source files or assembly listings optionally mixed with source. If you want to see the source file the profiled application needs to have debug information and the source must be available through this debug information e.g. compile the application with -g for gcc. If the binary don't contain sufficient debug info you can use op_to_source --assembly to get assembly annotation. See the FAQ Section 1 about debug information vs symbol information.
Note that for the reason explained in Section 5.1 the results can show some inaccuracy. The debug info itself can add other problems; for example, the line number for a symbol can be incorrect. Assembly instructions can be re-ordered and moved by the compiler, and this can lead to crediting source lines with samples not really "owned" by this line. Also see Chapter 5.
The options allowed are :
Output assembly code. Currently the assembly code is sorted in increasing order on the vma address. The --sort-by-counter, --with-more-than-samples percent_nr and --until-more-than-samples percent_nr options can also be used with this option to provide filtering capabilities.
This option is used in conjunction with --output-dir. You can use it to specify the base directory of the source which you wish to produce annotated output for. With this option, any source files outside the directory (for example, system header files) are ignored.
Specify that you want to produce an annotated source tree, rather than getting all output to stdout. This creates a hierarchy of annotated source files, and is affected by the --source-dir, --output, and --no-output options.
Specify a set of comma-separated patterns for matching annotated source output filenames. If this option is present, a file is only output if it matches one of the given patterns (which applies to the filename and each components of the containing directory names). For example :
--output '*.c,user.h'
Specify a set of comma-separated patterns for filtering annotated source output filenames. If this option is present, a file is only output if it does not match one of the given patterns (which applies to the filename and each components of the containing directory names). For example :
--no-output 'boring.c,boring*.h'
Output assembly code mixed with the source file, implies --assembly.
Pass the comma separated additional parameters to objdump. Check the objdump man page to see what options objdump accept e.g. -o '--disassembler-options=intel' to get Intel assembly syntax instead of att syntax. This option can be used only with --assembly or --source-with-assembly
Sort by decreasing number of samples on counter_nr. For assembly output this option provides only a filtering and not a sort order.
Output source file which contains at least percent_nr samples. Can not be combined with --until-more-than-samples.
Output source files until the amount of samples in these files reach percent_nr samples. Can not be combined with --with-more-than-samples.
Specify the samples file. At least one of the --samples-file or --image-file must be specified.
Specify the image file.
Comma-separated list of symbols to ignore. This can be useful to ignore the leading contributors to the sample histogram, as the percentage values are re-calculated.
Comma-separated list of symbols to include. This can be useful to only see the leading contributors to the sample histogram, as the percentage values are re-calculated.
Specify the session name you want to use, session-name can be an absolute path where samples reside or a session name relative to samples files base directory. If you specify a samples filename with an absolute path this option is ignored
op_merge is used to merge samples wich belongs to the same binary image. Its main purpose is to merge samples files created by profiling with --separate-samples. So you can create one samples file containing all samples for a shared library : op_merge/usr/lib/ld-2.1.2.so will create a samples file named }usr}lib}ld-2.1.2.so ready to use with oprofpp or other post-profiling tools. Additionally you can merge a subset of samples files inside one sample file by specifying explicitly the sample files to merge. This allows you to use post-profile tools on shared libs for a subset of applications.
The options allowed are :
use the given counter number to select the appropriate samples files
You can get a quick look at an overall summary of relative binary profiles using op_time. This utility displays the relative amount of samples for each application profiled sorted by decreasing order of samples count. So with op_time [--option] [image_name[,image_names]] you can get :
/lib/libc-2.1.2.so 19 32.7586% /usr/X11R6/bin/XF86_SVGA 13 22.4138% ... /usr/bin/grep 1 1.72414% /usr/X11R6/lib/libXt.so.6.0 1 1.72414% |
If you don't specify any image_name on command line op_time report information about all profiled binary image. You can use shell wildcards like : op_time /usr/bin/*
Options allowed are :
use counter nr for sorting samples count
Show the details for each shared lib which belongs to one application. This option is useful only if you have profiled with the --separate-samples option.
Show details for each symbols in each profiled files
demangle GNU C++ symbol names
show the image name when specifying --list-symbols
Sort by decreasing samples count instead of increasing count.
Specify an alternate list of pathname to locate image file. This is usefull if your samples files name does not match the image file name such as module loaded at boot time through a ram disk
Same as mdash; but retrieve recursively the image file name in the path list
Specify the output and formatting, as given in this table :
v | VMA (symbol address) |
s | Sample count |
S | Cumulative sample count |
p | Sample percentage relative to image |
P | Cumulative sample percentage relative to image |
n | Name of symbol |
l | Source filename and line number |
L | Short source filename and line number |
i | Image name |
I | Short image name |
e | Application name |
E | Basename of application |
d | Detailed sample results |
h | Print a columns header |
e and E are useless unless you profile with --separate-samples.
These options can only be used in conjunction with --list-symbols.
Comma-separated list of symbols to ignore. This can be useful to ignore the leading contributors to the sample histogram, as the percentage values are re-calculated.
Specify the session name you want to use, session-name can be an absolute path where samples reside or a session name relative to samples files base directory.
Table of Contents
Another grey art. The standard caveats of profiling come to mind: profile realistic situations, profile different scenarios, profile for as long as a time as possible, avoid system-specific artifacts, don't trust the profile data too much. Also bear in mind the comments on the performance counters above - you can not rely on totally accurate instruction-level profiling. However, for almost all circumstances the data can be useful. Ideally a utility such as Intel's VTUNE would be available to allow careful instruction-level analysis; go hassle Intel for this, not me ;)
This is an example of how the latency of delivery of profiling interrupts can impact the reliability of the profiling data. This is pretty much a worst-case-scenario example: these problems are fairly rare.
double fun(double a, double b, double c) { double result = 0; for (int i = 0 ; i < 10000; ++i) { result += a; result *= b; result /= c; } return result; } |
Here the last instruction of the loop is very costly, and you would expect the result reflecting that - but (cutting the instructions inside the loop):
$ op_to_source -a -w 10 /* 9349 0.3788% */ 8048394: fadd %st(3),%st /* 22858 0.9261% */ 8048396: fmul %st(2),%st /* 687682 27.86% */ 8048398: fdiv %st(1),%st /* 1747822 70.81% */ 804839a: decl %eax /* 17 0.0006887% */ 804839b: jns 8048394 |
The problem comes from the x86 hardware; when the counter overflows the IRQ line is asserted but the hardware have features that can delay the NMI interrupt: x86 hardware is synchronous (e.g. can not interrupt during an instruction but interrupt at the end of instruction), there is also a latency when the IRQ line is asserted the hardware can take some cycles to get account; the multiple execution unit and the out of order model of modern x86 family also causes problems. The following shows the same function at source level
$op_to_source -a -w 10 show double fun(double a, double b, double c) /* fun(double, double, double) 2468162 100% */ /* 165 0.006685% */ { /* 3 0.0001215% */ double result = 0; for (int i = 0 ; i < 10000; ++i) { /* 9349 0.3788% */ result += a; /* 22858 0.9261% */ result *= b; /* 687682 27.86% */ result /= c; /* 1747918 70.82% */ } return result; /* 187 0.007576% */ } |
So the conclusion: don't trust samples coming at the end of a loop, particularly if the last instruction generated by the compiler is costly, this case can occur also for each branch in your program. Always bear in mind that samples can be often delayed by a few cycles from its real position. That's a hardware problem and OProfile can do nothing about it.
OProfile uses non-maskable interrupts (NMI) on the P6 generation, Pentium 4, Athlon and Duron processors. These interrupts can occur even in section of the Linux where interrupts are disabled, allowing collection of samples in virtually all executable code. The RTC, timer interrupt mode, and Itanium 2 collection mechanisms use maskable interrupts. Thus, the RTC and Itanium 2 data collection mechanism have "sample shadows", or blind spots: regions where no samples will be collected. Typically, the samples will be attributed to the code immediately after the interrupts are re-enabled.
The compiler can introduce some pitfalls in the annotated source output. The optimizer can move pieces of code in such manner that two line of codes are interlaced (instruction scheduling). Also debug info generated by the compiler can show strange behavior. This is especially true for complex expressions e.g. inside an if statement:
if (a && .. b && .. c &&) |
here the problem come from the position of line number. The available debug info does not give enough details for the if condition, so all samples are accumulated at the position of the right brace of the expression. Using op_to_source -a can help to show the real samples at an assembly level.
Depending on your compiler you can fall into the following problem:
struct big_object { int a[500]; }; int main() { big_object a, b; for (int i = 0 ; i != 1000 * 1000; ++i) b = a; return 0; } |
Compiled with gcc 3.0.4 the annotated source is clearly inaccurate:
int main() { big_object a, b; /* main 7871 100% */ for (int i = 0 ; i != 1000 * 1000; ++i) b = a; /* 7871 100% */ return 0; } |
The problem here is distinct from the IRQ latency problem; the debug line number information is not precise enough; again, looking at output of op_to_source -as can help.
/* 7871 100% */ int main() { big_object a, b; for (int i = 0 ; i != 1000 * 1000; ++i) 80484c0: push %ebp 80484c1: mov %esp,%ebp 80484c3: sub $0xfac,%esp 80484c9: push %edi 80484ca: push %esi 80484cb: push %ebx b = a; 80484cc: lea 0xfffff060(%ebp),%edx 80484d2: lea 0xfffff830(%ebp),%eax 80484d8: mov $0xf423f,%ebx 80484dd: lea 0x0(%esi),%esi return 0; /* 3 0.03811% */ 80484e0: mov %edx,%edi 80484e2: mov %eax,%esi /* 1 0.0127% */ 80484e4: cld /* 8 0.1016% */ 80484e5: mov $0x1f4,%ecx /* 7850 99.73% */ 80484ea: repz movsl %ds:(%esi),%es:(%edi) /* 9 0.1143% */ 80484ec: dec %ebx 80484ed: jns 80484e0 80484ef: xor %eax,%eax 80484f1: pop %ebx 80484f2: pop %esi 80484f3: pop %edi 80484f4: leave 80484f5: ret |
So here it's clear than copying is correctly credited with of all the samples, but the line number info are misplaced. objdump -dS exposes the same problem. Note that maintaining accurate debug info for compilers when optimizing is difficult so this problem is not suprising. The problem of debug info accuracy is also dependent on the binutils version used; some BFD library versions contain a work-around for known problems of gcc, some others do not. This is unfortunate but we must live with that, since profiling is pointless when you disable optimisation (which would give better debug info entries).
Often the assembler can not generate debug information automatically. Such example of commonly used assembler is gas and nasm. This means than you can not get a source report unless you manually define the neccessary debug information; read your assembler documentation for how you might do that. The only debugging info needed currently by OProfile is the linenr/filename-VMA association. When profiling assembly without debugging info you can always get report for symbol, and optionally for VMA, through oprofpp -l or oprofpp -L but this works only for symbol with the right attributes. For gas you can get this by
.globl foo .type foo,@function |
while for nasm you must use
GLOBAL foo:function ; [1] |
Note than OProfile does not need the global attribute, but only the function attribute. Users of gas and nasm must find the right way to not declare the foo symbol global if necessary.
Another cause of apparent problems is the hidden cost of instructions. A very common example is two memory reads: one from L1 cache and the other from memory. It's clear for all people than the second memory read will show more samples but there are many other causes of hidden cost of instructions. A non-exhaustive list: mis-predicted branch, TLB cache miss, partial register stall, partial register dependencies, memory mismatch stall, re-executed µops. If you want to write programs at the assembly level, be sure to take a look at the Intel and AMD documentation at http://developer.intel.com/ and http://www.amd.com/products/cpg/athlon/techdocs/.
One of the major design criterion for OProfile was low overhead. In many cases profiling is hardly noticeable in terms of overhead (I regularly leave it turned on all the time). It achieves this by judicious use of kernel-side data structures to reduce the analysis overhead to a bare runtime minimum. There are several things that unfortunately complicate the issue, so there are cases where the overhead is noticeable.
The worst-case scenario is where there are many short-lived processes. This can be seen in a kernel compile, for instance. Even in this worst case overhead is low compared to other profilers; only very detailed profiling of these workloads has an overhead of higher than 5%. Actual performance data is presented in the CVS tree. In fact most situations have much fewer numbers of processes, leading to far better performance.
Some graphs of performance characteristics of OProfile are available on the website - see Section 2.
Table of Contents
Some post-profile tools need only symbol information like basic use of oprofpp, whilst others need debugging information, like op_to_source. The notions of debugging information and symbol information are orthogonal. gcc always creates symbol information even without -g. You should be aware that strip by default remove debug and symbol information. You must use strip -d to strip only debug information. Profiling binaries without symbol information leads to a lot of (no symbol) reported by oprofpp.
Note that the post-profile tools do not use dynamic symbols for shared libraries in the absence of symbol information. These symbol are never stripped but using them has a risk of a misleading profile. Dynamic symbols don't provide enough information to create accurate profiling report, because the symbol size information is lost, and we don't know if a sample is from an exported function or from a static function which come immediately after this exported symbol.
In short, if you want to profile already installed binaries on your system, reinstall them with the proper information, at least with symbol information for basic oprofpp and op_time -l or with full debug information to get annotated source. Unfortunately debug information can be very voluminous. The neeeded debug information for annotated source, line numbers, is only generated at -g2 level.
Thanks to (in no particular order) : Arjan van de Ven, Rik van Riel, Juan Quintela, Philippe Elie, Phillipp Rumpf, Tigran Aivazian, Alex Brown, Alisdair Rawsthorne, Bob Montgomery, Ray Bryant, H.J. Lu, Jeff Esper, Will Cohen, Graydon Hoare, Cliff Woolley, Alex Tsariounov, Al Stone, Richard Reich (rreich@rdrtech.com), Dave Jones, Charles Filtness; and finally Pulp, for "Intro".