svcadm(8)을 검색하려면 섹션에서 8 을 선택하고, 맨 페이지 이름에 svcadm을 입력하고 검색을 누른다.
collect(1)
collect(1) User Commands collect(1)
NAME
collect - command used to collect program performance data
SYNOPSIS
collect collect-arguments target target-arguments
collect
collect -V
DESCRIPTION
The collect command runs the target process and records performance
data and global data for the process. Performance data is collected
using profiling or tracing techniques. The data can be examined with
the Performance Analyzer graphical tool (analyzer) or a command-line
program (er_print). The data collection software run by the collect
command is referred to here as the Collector.
The data from a single run of the collect command is called an experi‐
ment. The experiment is represented in the file system as a directory,
with various files inside that directory.
The target is the path name of the executable, Java .jar file, or Java
.class file for which you want to collect performance data. For more
information about Java profiling, see JAVA PROFILING, below.
Executables that are targets for the collect command can be compiled
with any level of optimization, but must use dynamic linking. If a pro‐
gram is statically linked, the collect command prints an error message.
In order to see annotated source using analyzer or er_print, targets
should be compiled with the -g flag, and should not be stripped.
The collect command uses the following strategy to find its target:
o If a file with the specified target name exists, has execute
permission set, and is an ELF executable, the collect com‐
mand verifies that it can run on the current machine and
then runs it. If the file is not an ELF executable, the col‐
lect command assumes it is a script, and runs it.
o If a file with the specified target name exists but does not
have execute permission, collect checks whether the file is
a Java jar file (target name ends in .jar) or class file
(target name ends in .class). If the file is a jar file or
class file, collect inserts the Java virtual machine (JVM)
software as the target, with any necessary flags, and col‐
lects data on that JVM. See JAVA PROFILING below.
o If a file with the specified target name is not found, col‐
lect searches your path to find an executable; if an exe‐
cutable file is found, collect verifies it as described
above.
o If a file of the target name is also not found in your path,
the command looks for a file with that name and the string
.class appended; if a file with the class name is found,
collect inserts the JVM machine with the appropriate flags,
as above.
o If none of these procedures can find the target, the command
fails.
OPTIONS
If invoked with no arguments, collect prints a usage summary, including
the default configuration of the experiment.
Data Specifications
-p option
Collect clock-based profiling data. The allowed values of option
are:
off Turns off clock-based profiling.
lo[w] Turns on clock-based profiling with a per-thread rate
of approximately 10 samples per second.
on Turns on clock-based profiling with a per-thread rate
of approximately 100 samples per second.
hi[gh] Turns on clock-based profiling with a per-thread rate
of approximately 1000 samples per second.
n Turns on clock-based profiling with a profile timer
period of n. The value n can be an integer or a float‐
ing-point number, with a suffix of u for values in
microseconds, or m for values in milliseconds. If no
suffix is used, assume the value to be in millisec‐
onds.
If the value is smaller than the clock profiling mini‐
mum, set it to the minimum; if it is not a multiple of
the clock profiling resolution, round down to the
nearest multiple of the clock resolution. If it
exceeds the clock profiling maximum, report an error.
If it is negative or zero, report an error.
If no explicit -p argument is given, and neither count data, nor
race-detection or deadlock data is specified, turn on clock-based
profiling. If -h high or -h low is specified requesting the default
counter set for that chip at high- or low-frequency, the default
clock-profiling will also be set to high or low; an explicit -p
argument will be respected.
Clock-profiling-based dataspace and memoryspace profiling is no
longer supported; all supported machines have hardware counters for
memory operations.
-h [parameter]
Hardware counter overflow profiling.
-h Shows extended help for collect hardware counter overflow
(HWC) profiling.
If the -h option is specified without a value for parameter,
collect prints hardware counter information. If the processor
supports hardware counter overflow profiling, collect prints
two lists containing information about hardware counters. The
first list contains "aliased" hardware counters; the second
list contains "raw" hardware counters. The output also con‐
tains the specification for the default HWC experiment for
that processor. For more details, see the "Hardware Counter
Overflow Profiling" section below.
If the processor does not support hardware counter overflow
processing the output says so.
The value of parameter can be set to default counters at a specific
rate, a specifc counter, or a set of counters.
-h {auto | lo | on | hi}
Turns on Hardware Counter overflow (HWC) profiling data for a
default set of counters at the specified rate:
auto
Matches the rate used by clock-profiling. If clock-profiling is
disabled, use the per-thread maximum rate of approximately 100
samples per second. auto is the default and preferred setting.
lo|low
Uses per-thread maximum rate of approximately 10 samples per
second.
on
Uses per-thread maximum rate of approximately 100 samples per
second.
hi|high
Uses per-thread maximum rate of approximately 1000 samples per
second.
Alternatively, you can use specific counters:
-h ctr_def[,ctr_def]...
Collects hardware-counter-overflow profiles using one or more spec‐
ified counters. The maximum number of counters supported is proces‐
sor-dependent. You can see the maximum number of hardware counter
definitions for profiling on the current machine, the full list of
available hardware counters, and the default counter set by running
collect -h with no other arguments on the current machine.
Each counter definition takes the following form:
[+|-]ctr[~attr=val]...[~attrN=valN][/reg#],[rate]
The meanings of the counter definition options are as follows:
+|-
Optional parameter that can be applied to precise, memory-
related counters, which are the counters used for memoryspace
and dataspace profiling.
A + is the default and is not needed.
A - collects only normal hardware-counter information and not
the extra information that is used for memoryspace and datas‐
pace profiling.
See the section "MEMORYSPACE AND DATASPACE PROFILING" below.
ctr
Counter name. You can see the list of counter names for your
processor by running the collect -h command without any other
command-line arguments. On most systems, you can specify a
counter using a numeric value in hexadecimal (such as 0x00c3)
or decimal even if a counter is not listed in collect -h out‐
put. The numeric values for counters are specified in the pro‐
cessor manufacturer's manuals. The name of the relevant manual
is shown in the collect -h output. Some counters are only
described in proprietary vendor manuals. On Oracle Solaris,
when a counter is specified numerically it can help to specify
the register number also.
~attr=val
Optional one or more attribute options. On some processors,
attribute options can be associated with a hardware counter. If
the processor supports attribute options, collect -h provides a
list of attribute names to use for attr. The value val can be
in decimal or hexadecimal format. Hexadecimal format numbers
are in C program format where the number is prepended by a zero
and lower-case x (0xhex_number). Multiple attributes are con‐
catenated to the counter name. The ~ tilde character in front
of each attr name is required.
/reg#
On Oracle Solaris, hardware register to use for the counter. If
not specified, collect attempts to place the counter into the
first available register and as a result, might be unable to
place subsequent counters due to register conflicts. If you
specify more than one counter, the counters must use different
registers. You can see a list of allowable register numbers by
running the collect -h command without any other command-line
arguments. The / character is required if the register is spec‐
ified.
rate
The sampling frequency. Valid values are as follows:
auto Matches the rate used by clock profiling. If clock
profiling is disabled, use the per-thread maximum
rate of 100 samples per second. auto is the
default and preferred value.
lo Uses per-thread maximum rate of approximately 10
samples per second.
on Uses per-thread maximum rate of approximately 100
samples per second.
hi Uses per-thread maximum rate of approximately 1000
samples per second.
value Specifies a fixed event interval value to trigger
a sample, rather than a sampling rate. When speci‐
fying value, note that the actual frequency is
dependent on the selected counter and the program
under test.
The event interval can be specified in decimal or
hexadecimal format. Exercise caution in setting a
numerical value, especially as setting the inter‐
val too low can overload your application or even
your entire system. As a rule of thumb, aim for
fewer than 1000 events per second per thread. You
can use the Performance Analyzer Timeline view to
visually estimate the rate of samples.
The rate can be omitted, in which case auto will be used. Even
when the rate is omitted, the comma in front of it is required
(except for the last counter in a -h parameter).
EXAMPLES: Some valid examples of -h usage:
-h auto
-h lo
-h hi
Enable the default counters with default, low, or
high rates, respectively
-h cycles,,insts,,dcm
-h cycles -h insts -h dcm
Both have the same meaning: three counters: cycles, insts
and D-cache misses.
-p lo -h cycles,,insts,,dcm
Select a low rate of profiling for clock and HWC cycles, insts
and D-cache misses. A low rate of profiling can be used to
reduce data collection overhead and experiment size when
dealing with long-running or highly multi-threaded applications.
-h cycles~system=1
Count cycles, explicitly including cycles in system mode.
-h 0xc0/0,10000003
On Nehalem, that is the equivalent to
-h inst_retired.any_p/0,10000003
Some invalid examples of -h usage:
-h cycles -h off
Can't use off with any other -h arguments
-h cycles,insts
Missing comma, and "insts" does not parse as a number for
<interval>
If the -h argument specifies the use of hardware counters but hard‐
ware counters are in use by others at the time the command is
given, the collect command will report an error and no experiment
will be run.
If no -h argument is given, no HW counter profiling data will be
collected. An experiment can specify both hardware counter overflow
profiling and clock-based profiling. Specifying hardware counter
overflow profiling will not disable clock-profiling, even is it is
enabled by default.
For more information on hardware counters, see the "Hardware
Counter Overflow Profiling" section below.
-s option[,scope]
Collect synchronization tracing data.
The minimum delay threshold for tracing events is set using option
and optionally the scope of APIs traced are set by scope.
The allowed values of option are:
on
Turns on synchronization delay tracing and set the threshold
value by calibration at runtime
calibrate
Same as on
off
Turns off synchronization delay tracing
n
Turns on synchronization delay tracing with a threshold value
of n microseconds; if n is zero, trace all events
all
Turns on synchronization delay tracing and trace all synchro‐
nization events
By default, turns off synchronization delay tracing.
For native API tracing on Oracle Solaris, the following func‐
tions are traced: mutex_lock(), rw_rdlock(), rw_wrlock(),
cond_wait(), cond_timedwait(), cond_reltimedwait(), thr_join(),
sema_wait(), pthread_mutex_lock(), pthread_rwlock_rdlock(),
pthread_rwlock_wrlock(), pthread_cond_wait(),
pthread_cond_timedwait(), pthread_cond_reltimedwait_np(),
pthread_join(), and sem_wait().
On Linux, the following functions are traced:
pthread_mutex_lock(), pthread_cond_wait(), pthread_cond_timed‐
wait(), pthread_join(), and sem_wait().
For Java programs, record synchronization events for Java moni‐
tors in user code.
The allowed values of scope are:
n
Traces native APIs.
j
Traces Java APIs
nj
Traces native and Java APIs
By default, trace both native and Java APIs.
-H option
Collects heap trace data. The allowed values of option are:
on Turns on tracing of memory allocation requests
off Turns off tracing of memory allocation requests
By default, turns off heap tracing.
Records heap-tracing events for any native calls.
Treat calls to mmap as memory allocations.
Heap profiling for Java programs traces native alloca‐
tions only, not Java allocations.
Note that heap tracing might produce very large exper‐
iments. Such experiments are very slow to load and
browse.
-i option
Collects I/O trace data. The allowed values of option are:
on Turns on tracing of I/O operations
off Turns off tracing of I/O operations
By default, turns off I/O operations.
Note that I/O tracing might produce very large experiments. Such
experiments are very slow to load and browse.
-M option
Specifies collection of an MPI experiment. (See MPI PROFILING,
below.) The target of collect should be mpirun, and its arguments
should be separated from the user target (that is the programs that
are to be run by mpirun) by an inserted -- argument. The experiment
is named as usual, and is referred to as the "founder experiment";
its directory contains subexperiments for each of the MPI pro‐
cesses, named by rank. It is recommended that the -- argument
always be used with mpirun, so that an experiment can be collected
by prepending collect and its options to the mpirun command line.
The allowed values of option are:
MPI-version
Turns on collection of an MPI experiment, assuming the MPI ver‐
sion named. The recognized versions of MPI are printed when you
type collect with no arguments, or in response to an unrecog‐
nized version specified with -M.
off
Turns off collection of an MPI experiment.
By default, turns off collection of an MPI experiment. When an
MPI experiment is turned on, the default setting for -m (see
below) is changed to on.
-m option
Collect MPI tracing data. (See MPI PROFILING, below.)
The allowed values of option are:
on Turns on MPI tracing information.
off Turns off MPI tracing information.
By default, turn off MPI tracing, except if the -M flag is enabled,
in which case MPI tracing is turned on by default. Normally, MPI
experiments are collected with -M, and no user control of MPI trac‐
ing is needed. If you want to collect an MPI experiment, but not
collect MPI trace data, you can use the explicit flags:
-M MPI-version -m off
-c option
Collects count data. The allowed values of option are:
on Turns on count data.
static Turns on simulated count data, based on the assumption
that every instruction was executed exactly once.
off Turns off count data.
By default, turn off count data. Count data cannot be collected
with any other type of data. For count data or simulated count
data, the executable and any shared-objects that are instrumented
and statically linked are counted; for count data, but not simu‐
lated count data, dynamically loaded shared objects are also
instrumented and counted.
On Oracle Solaris, no special compilation is needed, although the
count option is incompatible with compile flags -p, -pg, -qp, -xpg,
and -xlinkopt. On Linux, the executable must be compiled with the
-xannotate=yes flag in order to collect count data.
-I directory
Specifies a directory for count data instrumentation.
-N libname
Specifies a library to be excluded from instrumentation for count
data, whether the library is linked into the executable, or loaded
with dlopen(3C). Multiple -N options can be specified.
-r option
Collects data for data race detection or deadlock detection for the
Thread Analyzer.
The allowed values of option are:
race
Collects data for detecting data races.
deadlock
Collects data for detecting deadlocks and potential deadlocks.
all
Collects data for detecting data races, deadlocks, and poten‐
tial deadlocks. Can also be specified as race,deadlock.
off
Turns off data collection for data races, deadlocks, and poten‐
tial deadlocks.
on
Collects data for detecting data races (same as race).
terminate
If an unrecoverable error is detected, terminates the target
process.
abort
If an unrecoverable error is detected, terminates the target
process with a core dump.
continue
If an unrecoverable error is detected, enables the process to
continue.
By default, turn off collection of all Thread Analyzer data.
The terminate, abort, and continue options can be added to any
data-collection options, and govern the behavior when an unrecover‐
able error, such as a real (not potential) deadlock. The default
behavior is terminate.
Thread Analyzer data cannot be collected with any tracing data, but
can be collected in conjunction with clock- or hardware counter
profiling data. Thread Analyzer data significantly slows down the
execution of the target, and profiles might not be meaningful as
applied to the user code.
Thread Analyzer experiments can be examined with either analyzer or
with tha. The latter displays a simplified list of default tabs,
but is otherwise identical.
In order to enable data-race detection, executables must be instru‐
mented, either at compile time, or by invoking a post-processor. If
the target is not instrumented, and none of the shared objects on
its library list is instrumented, a warning is displayed, but the
experiment is run. Other Thread Analyzer data do not require
instrumentation.
See the tha(1) man page or the Oracle Developer Studio 12.6: Thread
Analyzer User's
Guide for more detail.
-S interval
Periodically samples process-wide resource utilization at the
interval specified (in seconds). The allowed values of interval
are:
off Turns off periodic sampling.
on Turns on periodic sampling with the default sampling
interval (1 second).
n Turns on periodic sampling with a sampling interval of
n in seconds; n must be positive.
By default, turn on periodic sampling.
Experiment Controls
-L size
Limit the amount of profiling and tracing data recorded to size
megabytes. The limit applies to the sum of all profiling data and
tracing data, but not to process-wide resource-utilization samples.
The limit is only approximate, and can be exceeded. When the limit
is reached, stop profiling and tracing data, but keep the experi‐
ment open and record samples until the target process terminates.
The allowed values of size are:
unlimited or none
Do not impose a size limit on the experiment.
n
Imposes a limit of n megabytes. The value of n must be positive
and greater than zero.
By default, there is no limit on the amount of data recorded.
-F option
Controls whether descendant processes should have their data
recorded. (Data is always collected on the founder process, inde‐
pendent of any -F setting.) The allowed values of option are:
on | all
Records experiments on all descendant processes.
off
Does not record experiments on any descendant processes.
=<regex>
Records experiments on those descendant processes whose exe‐
cutable name (a.out name) matches the regular expression. Only
the basename of the executable is used, not the full path. If
the <regex> that you use contains blanks or characters inter‐
preted by your shell, be sure to enclose the full =<regex>
argument in single quotes.
By default, record experiment on all descendant processes. For more
details, read the sections "FOLLOWING DESCENDANT PROCESSES", and
"PROFILING SCRIPTS" below.
-A option
Controls whether to perform archiving as part of data collection.
Archiving is required to make an experiment self-contained and por‐
table. The allowed values of option are:
on
Copies load objects (the target and any shared objects it uses)
into the experiment. Also copy any ancillary files (.anc) and
object files (.o) which have Stabs or DWARF information not in
the load object.
src
In addition to copying load objects as in -A on, copies into
the experiment all source files and ancillary files (.anc) that
can be found.
usedsrc
Similar to -A src, but only copies source files, ancillary
files (.anc), and load objects that are needed for analytics
and can be found. This option might require additional process‐
ing time, but might result in smaller experiment sizes.
off
Does not copy or archive load objects or source files into the
experiment.
Archiving will not be performed in the following circumstances:
o A profiled process is terminated before it exits nor‐
mally
o -A off is specified
In such cases, you must run er_archive explicitly on the same
machine where the profiling data was recorded.
When many processes are being profiled, enabling archiving as part
of data collection can be very expensive and might change the tim‐
ing of the application run. With many processes, a better strategy
is to collect the data with -A off, and later, when the profiling
is complete,archive the experiment using er_archive -s all. In this
case all binaries and source files will be saved in the experiment.
The minimum archiving required that enables an experiment to be
accessed on another machine is -A on. When using this option, note
that -A on does not copy any sources or object files (.o's); it is
your responsibility to ensure that those files are accessible from
the machine where the experiment is being examined, and that they
are not changed or rebuilt after the experiment was recorded.
The default setting for -A is on.
-j option
Controls Java profiling when the target is a JVM machine. The
allowed values of option are:
on
Records profiling data for the JVM machine, and recognize meth‐
ods compiled by the Java HotSpot virtual machine, and also
record Java call stacks. This is the default.
off
Does not record Java profiling data. Profiling data for native
call stacks is still recorded.
<path>
Records profiling data for the JVM, and use the JVM as
installed in <path>.
See the section "JAVA PROFILING", below.
-J java_arg
Specifies additional arguments to be passed to the JVM used for
profiling. If -J is specified, Java profiling (-j on) will be
enabled. The java_arg must be surrounded by quotes if it contains
more than one argument. It consists of a set of tokens, separated
by either a blank or a tab; each token is passed as a separate
argument to the JVM. Note that most arguments to the JVM must begin
with a "-" character.
-l signal
Samples process-wide resource-utilization whenever the given signal
is delivered to the process.
See the section "DATA COLLECTION AND SIGNALS" below for more infor‐
mation about choosing a signal.
-y signal[,r]
Controls recording of data with signal, referred to as the pause-
resume signal. Whenever the given signal is delivered to the
process, switch between paused (no data is recorded) and resumed
(data is recorded) states. Start in the resumed state if the
optional ,r flag is given, otherwise start in the paused state.
This option does not affect the recording of process-wide resource-
utilization samples.
One use of the pause-resume signal is to start a target without
collecting data, allowing it to reach steady-state, and then
enabling the data.
See the section "DATA COLLECTION AND SIGNALS" below for more infor‐
mation about choosing a signal.
Output Controls
-o experiment_name
Uses experiment_name as the name of the experiment to be recorded.
The experiment_name must end in the string .er; if not, print an
error message and do not run the experiment.
If -o is not specified, give the experiment a name of the form
stem.n.er, where stem is a string, and n is a number. If a group
name has been specified with -g, set stem to the group name without
the .erg suffix. If no group name has been specified, set stem to
the string "test".
If invoked from one of the commands used to run MPI jobs, for exam‐
ple, mpirun, but without -M MPI-versions, and -o is not specified,
take the value of n used in the name from the environment variable
used to define the MPI rank of that process. Otherwise, set n to
one greater than the highest integer currently in use. (See MPI
PROFILING, below.)
If the name is not specified in the form stem.n.er, and the given
name is in use, print an error message and do not run the experi‐
ment. If the name is of the form stem.n.er and the name supplied is
in use, record the experiment under a name corresponding to one
greater than the highest value of n that is currently in use. Print
a warning if the name is changed.
-d directory_name
Places the experiment in directory directory_name. If no directory
is given, place the experiment in the current working directory. If
a group is specified (see -g, below), the group file is also writ‐
ten to the directory named by -d.
For the lightest-weight data collection, it is best to record data
to a local file, with -d used to specify a directory in which to
put the data. However, for MPI experiments on a cluster, the
founder experiment must be available at the same path to all pro‐
cesses to have all data recorded into the founder experiment.
Experiments written to long-latency file systems are especially
problematic, and might progress very slowly.
-g group_name
Adds the experiment to the experiment group group_name. The
group_name string must end in the string .erg; if not, report an
error and do not run the experiment. The first line of a group file
must contain the string
#analyzer experiment group
and each subsequent line is the name of an experiment.
-O file
Appends all output from collect itself to the named file, but do
not redirect the output from the spawned target, nor from dbx (as
invoked with the -P argument), nor from the processes involved in
recording count data (as invoked with the -c argument). If file is
set to /dev/null suppress all output from collect, including any
error messages.
-t duration
Collects data for the specified duration. duration can be a single
number followed by either m to specify minutes, or s to specify
seconds (default), or two such numbers separated by a - sign. If
one number is given, data is collected from the start of the run
until the given time; if two numbers are given, data is collected
from the first time to the second. If the second time is zero, data
is collected until the end of the run. If two non-zero numbers are
given, the first must be less than the second.
Although you specify duration in minutes or seconds, the start and
end of data collection is recognized with greater accuracy. If
clock profiling is enabled, the accuracy is approximately twice the
clock profiling interval. If clock profiling is not enabled, the
accuracy is 200 milliseconds.
Other Arguments
.sp -C commentPuts the comment into the notes file for the experiment.
Up to ten -C arguments can be supplied.
-P <pid>
Write a script for dbx to attach to the process with the given PID,
and collect data from it, and then invoke dbx with that script.
Clock or HW counter profiling data may be specified, but neither
tracing nor count data are supported. See the collector(1) man page
for more information.
When attaching to a process, the directory is created with the
umask of the user running collect -P, but the experiment is written
as the user running the process which is being attached to. If the
user doing the attach is root, and the umask is not zero, the
experiment will fail.
Note -
On Linux, attaching to a multithreaded process, including Java,
will not properly collect data. Data for the thread that was
attached to will be captured, but not data for other threads.
-n
Dry run: do not run the target, but print all the details of the
experiment that would be run. Turn on -v.
-V
Prints the current version. Do not examine further arguments and do
no further processing.
-v
Prints the current version and further detailed information about
the experiment being run.
-x
Leaves the target process stopped on the exit from the exec system
call, in order to allow a debugger to attach to it. The collect
command prints a message with the process PID.
To attach a debugger to the target once it is stopped by collect,
you can follow the procedure below.
o Obtain the PID of the process from the message printed
by the collect -x command
o Start the debugger
o Configure the debugger to ignore SIGPROF and, if you
chose to collect hardware counter data, SIGEMT on
Solaris or SIGIO on Linux
o Attach to the process using dbx's attach command.
o Set the collector parameters for the experiment you wish
to collect
o Issue the collector enable command
o Issue the cont command to allow the target process to
run
As the process runs under the control of the debugger, the Collec‐
tor records an experiment.
Alternatively, you can attach to the process and collect an experi‐
ment using the collect -P PID command.
FOLLOWING DESCENDANT PROCESSES
Data from the initial process spawned by collect, called the founder
process, is always collected. Processes can create descendant processes
by calling system library functions, including the variants of fork,
exec, system, etc. If a -F argument is used, the collector can collect
data for descendant processes, and it opens a new experiment for each
descendant process inside the parent experiment. These new experiments
are named with their lineage as follows:
o An underscore is appended to the creator's experiment name.
o A code letter is added: either "f" for a fork, or "x" for
other descendants, including exec. On Linux, "C" is used for
a descendant generated by clone(2).
o A number is added after the code letter, which is the index
of the descendant.
o The experiment suffix, ".er" is appended to the lineage.
For example, if the experiment name for the initial process is
"test.1.er", the experiment for the descendant process created by its
third fork is "test.1.er/_f3.er". If that descendant process execs a
new image, the corresponding experiment name is "test.1.er/_f3_x1.er".
If the default, -F on, is used, descendant processes initiated by calls
to fork(2), fork1(2), fork(3F), vfork(2), and exec(2) and its variants
are followed. The call to vfork is replaced internally by a call to
fork1. Descendants created by calls to system(3C), system(3F), sh(3F),
popen(3C) , and similar functions, and their associated descendant pro‐
cesses, are also followed. On Linux, descendants created by clone()
without the CLONE_VM flag are followed by default; descendants created
with the CLONE_VM flag are treated as threads, rather than processes,
and are always followed, independent of the -F setting.
If the -F =<regex> argument is used, all descendants whose name matches
the regular expression are followed. When matching names, only the
basename of the executable is used, not the full path, and not any
arguments.
For example, to capture data on the descendant process of the first
exec from the first fork from the first call to system in the founder,
use:
collect -F '=_x1_f1_x1'
To capture data on all the variants of exec, but not fork, use:
collect -F '=.*_x[0-9]/*'
To capture data from a call to system("echo hello") but not sys‐
tem("goodbye"), use:
collect -F '=echo hello'
The Analyzer and er_print automatically read experiments for descendant
processes when the founder experiment is read, and the experiments for
the descendant processes are selected for data display.
To specifically select the data for display from the command line,
specify the path name explicitly to either er_print or Analyzer. The
specified path must include the founder experiment name, and the
descendant experiment's name inside the founder directory.
For example, to see the data for the third fork of the test.1.er exper‐
iment:
er_print test.1.er/_f3.er
analyzer test.1.er/_f3.er
You can prepare an experiment group file with the explicit names of
descendant experiments of interest.
To examine descendant processes in the Analyzer, load the founder
experiment and choose View > Filter data. The Analyzer displays a list
of experiments with only the founder experiment checked. Uncheck the
founder experiment and check the descendant experiment of interest.
PROFILING SCRIPTS
By default, collect no longer requires that its target be an ELF exe‐
cutable. If collect is invoked on a script, data is collected on the
program launched to execute the script, and on all descendant pro‐
cesses. To collect data only on a specific process, use the -F flag to
specify the name of the executable to follow.
For example, to profile the script foo.sh, but collect data primarily
from the executable bar, use the command:
collect -F =bar foo.sh
Data will be collected on the founder process launched to execute the
script, and all bar processes spawned from the script, but not for
other processes.
JAVA PROFILING
Java profiling consists of collecting a performance experiment on the
JVM machine as it runs your .class or .jar files. If possible, call
stacks are collected in both the Java model and in the machine model.
On x86 platforms, if Java applications crash during data collection,
disabling capture of machine model call stacks with the SP_COLLEC‐
TOR_NATIVE_MAX_STACKDEPTH environment variable might help. See "Envi‐
ronment Variables" below.
Data can be shown with view mode set to User, Expert, or Machine. User
mode shows each method by name, with data for interpreted and HotSpot-
compiled methods aggregated together; it also suppresses data for non-
user-Java threads. Expert mode separates HotSpot-compiled methods from
interpreted methods, and does not suppress non-user Java threads.
Machine mode shows data for interpreted Java methods against the JVM
machine as it does the interpreting, while data for methods compiled
with the Java HotSpot virtual machine is reported for named methods.
All threads are shown. In all three modes, data is reported in the
usual way for any non-OpenMP C, C++, or Fortran code called by a Java
target. Such code corresponds to Java native methods. The Analyzer and
the er_print utility can switch between the view mode User, view mode
Expert, and view mode Machine, with User being the default.
Clock-based profiling and hardware counter overflow profiling are sup‐
ported. Synchronization tracing collects data only on the Java monitor
calls, and synchronization calls from native code; it does not collect
data about internal synchronization calls within the JVM.
Heap tracing is not supported for Java, and generates an error if spec‐
ified.
Some Java codes have shared objects contained within a jar file. The
shared objects are extracted to a temporary directory when the applica‐
tion runs, and are deleted when the application terminates. The shared-
object names are recorded in the experiment map file, but the jar file
name is not. To read such experiments, be sure to add an addpath direc‐
tive listing the jar file to your .er.rc file, or add the path from the
Analyzer GUI, or with the addpath command in er_print. If the addpath
directive is in your .er.rc file at the time the experiment is
archived, the shared objects will be archived.
When collect inserts a target name of java into the argument list, it
examines environment variables for a path to the java target, in the
order JDK_HOME, and then JAVA_PATH. For the first of these environment
variables that is set, the resultant target is verified as an ELF exe‐
cutable. If it is not, collect fails with an error indicating which
environment variable was used, and the full path name that was tried.
If neither of those environment variables is set, the collect command
uses the version set by your PATH. If there is no java in your PATH, a
system default of /usr/java/bin/java is tried.
JAVA PROFILING WITH A DLOPEN
Some applications are not pure Java, but are C or C++ applications that
invoke dlopen to load libjvm.so, and then start the JVM by calling into
it. The collector sets an environment variable so that Java profiling
is automatically enabled.
SHARED_OBJECT HANDLING
Normally, the collect command causes data to be collected for all
shared objects in the address space of the target, whether on the ini‐
tial library list, or explicitly dlopen'd. However, there are some cir‐
cumstances under which some shared objects are not profiled.
One such scenario is when the target program is invoked with lazy-load‐
ing. In such cases, the library is not loaded at startup time, and is
not loaded by explicitly calling dlopen, so the shared object name is
not included in the experiment, and all PCs from it are mapped to the
<Unknown> function. The workaround is to set LD_BIND_NOW, to force the
library to be loaded at startup time.
Another such scenario is when the executable is built with the -B
direct linking option. In that case the object is dynamically loaded by
a call specifically to the dynamic linker entry point of dlopen, and
the libcollector interposition is bypassed. The shared object name is
not included in the experiment, and all PCs from it are mapped to the
<Unknown> function. The workaround is to not use -B direct.
DATA COLLECTION AND SIGNALS
Profiling Signals
Signals are used for both clock- and hardware-counter-overflow profil‐
ing. SIGPROF is used in data collection for all experiments. The period
for generating the signal depends on the data being collected. SIGEMT
(Solaris) or SIGIO (Linux) is used for hardware counter overflow pro‐
filing. The overflow interval depends on the user parameter for profil‐
ing. Any user code that uses or manipulates the profiling signals may
potentially interfere with data collection. When the Collector installs
its signal handler for a profile signal, it sets a flag that ensures
that system calls are not interrupted to deliver signals. This setting
could change the behavior of a target program that uses the profiling
signals for other purposes.
When the Collector installs its signal handler for a profile signal, it
remembers whether or not the target had installed its own signal han‐
dler. The Collector also interposes on some signal-handling routines
and does not allow the user to install a signal handler for these sig‐
nals; it saves the user's handler, just as it does when the Collector
replaces a user handler on starting the experiment.
Profiling signals are delivered by from the profiling timer or hard‐
ware-counter-overflow handling code in the kernel, or in response to:
the kill(2), sigsend(2), tkill(2), tgkill(2) or _lwp_kill(2) system
calls; the raise(3C) or sigqueue(3C) library calls; or the kill(1) com‐
mand. A signal code is delivered with the signal so that the Collector
can distinguish the origin. If it is delivered for profiling, it is
processed by the Collector; If it is not delivered for profiling, it is
delivered to the target signal handler.
When the Collector is running under dbx, the profiling signal delivered
occasionally has its signal code corrupted, and a profile signal may be
treated as if it were generated from a system or library call or a com‐
mand. In that case, it will be incorrectly delivered to the user's han‐
dler. If the user handler was set to SIG_DFL, it will cause the process
to fail core dump.
When the Collector is invoked after attaching to a target process, it
will install its signal handler, but it cannot interpose on the signal-
handling routines. If those user code installs a signal handler after
the attach, it will override the Collector's signal handler, and data
will be lost.
Note that any signal, including either of the profiling signals, may
cause premature termination of a system call, and the program must be
prepared to handle that behavior. When libcollector installs the signal
handlers for data collection, it specifies restarting those system
calls that are restartable, but some, like sleep(3C) will return early
without reporting an error.
Process-Wide Sample and Pause-Resume Signals
Signals can be specified by the user as a sample signal (-l) or a
pause-resume signal (-y). SIGUSR1 or SIGUSR2 are recommended for this
use, but any signal that is not used by the target can be used.
The profiling signals can be used if the process does not otherwise use
them, but they should be used only if no other signal is available. The
Collector interposes on some signal-handling routines and does not
allow the user to install a signal handler for these signals; it saves
the user's handler, just as it does when the Collector replaces a user
handler on starting the experiment.
If the Collector is invoked after attaching to a target process, and
the user code installs a signal handler for the sample or pause-resume
signal, those signals will no longer operate as specified.
OPENMP PROFILING
Data collection for OpenMP programs collects data that can be displayed
in any of the three view modes, just as for Java programs. In User
mode, slave threads are shown as if they were really cloned from the
master thread, and have call stacks matching those from the master
thread. Frames in the call stack coming from the OpenMP runtime code
(libmtsk.so) are suppressed. In Expert user mode, the master and slave
threads are shown differently, and the explicit functions generated by
the compiler are visible, and the frames from the OpenMP runtime code
(libmtsk.so) are suppressed. For Machine mode, the actual native stacks
are shown.
In User mode, various artificial functions are introduced as the leaf
function of a call stack whenever the runtime library is in one of sev‐
eral states. These functions are <OMP-overhead>, <OMP-idle>, <OMP-
reduction>, <OMP-implicit_barrier>, <OMP-explicit_barrier>, <OMP-
lock_wait>, <OMP-critical_section_wait>, and <OMP-ordered_sec‐
tion_wait>.
Three additional clock-profiling metrics are added to the data for
clock-profiling experiments:
OpenMP Work (ompwork)
OpenMP Wait (ompwait)
Master Thread Time (masterthread)
OpenMP Work is counted when the OpenMP runtime thinks the code is doing
work. It includes time when the process is consuming User-CPU time, but
it also can include time when the process is consuming System-CPU time,
waiting for page faults, waiting for the CPU, etc. Hence, OpenMP Work
can exceed User-CPU time. OpenMP Wait is accumulated when the OpenMP
runtime thinks the process is waiting. OpenMP Wait can include User-CPU
time for busy-waits (spin-waits), but it also includes Other-Wait time
for sleep-waits.
Master Thread Time is the total time spent in the master thread. It is
only available from Oracle Solaris experiments. It corresponds to wall-
clock time.
The inclusive metrics are visible by default; the exclusive are not.
Together, the sum of those two metrics equals the Total Thread Time
metric. These metrics are added for all clock- and hardware counter
profiling experiments.
Collecting information for every parallel-region entry in the execution
of the program can be very expensive. You can suppress that cost by
setting the environment variable SP_COLLECTOR_NO_OMP. If you set
SP_COLLECTOR_NO_OMP, the program will have substantially less dilation,
but you will not see the data from slave threads propagate up the call‐
er, and eventually to main(), as you would when the variable is not
set.
A collector for OpenMP 3.0 is enabled by default in this release. It
can profile programs that use explicit tasking. Programs built with
earlier compilers can be profiled with the new collector only if a
patched version of libmtsk.so is available. If it is not installed, you
can switch data collection to use the old collector by setting the
environment variable SP_COLLECTOR_OLDOMP.
Note that the OpenMP profiling functionality is only available for
applications compiled with the Oracle Developer Studio compilers, since
it depends on the Oracle Developer Studio compiler runtime. GNU-com‐
piled code will only see machine-level call stacks.
MEMORYSPACE AND DATASPACE PROFILING
A memoryspace profile is a profile in which memory-related events such
as cache misses are reported against the physical structures of the
machine, such as cache-lines, memory-banks, or pages. Memoryspace pro‐
filing is available on Oracle SPARC systems and Intel Oracle Solaris
systems.
A dataspace profile is a profile in which those memory-related events
are reported against the data structures whose references cause the
events rather than just the instructions where the memory-related
events occur. Dataspace profiling is only available on SPARC systems
running Oracle Solaris.
For either memoryspace or dataspace profiling, you must collect hard‐
ware counter profiles on an Oracle Solaris system using precise, mem‐
ory-related counters. Such counters are found in the counter list
obtained by running the collect -h command without any other command-
line arguments; the counters are annotated memoryspace.
Further, in order to support dataspace profiling, executables should be
compiled for a SPARC platform with the -xhwcprof -xdebugformat=dwarf -g
flags.
Memoryspace profiling data can be viewed with er_print commands or Per‐
formance Analyzer views relating to Memory Objects.
Dataspace profiling data can be viewed with the er_print utility com‐
mands data_objects, data_single, and data_layout or with Performance
Analyzer using the data views labeled DataObjects and DataLayout.
MPI PROFILING
The collect command can be used for MPI profiling to manage collection
of the data from the constituent MPI processes, collect MPI trace data,
and organize the data into a single "founder" experiment, with "subex‐
periments" for each MPI process.
The collect command can be used with MPI by simply prefacing the com‐
mand that starts the MPI job and its arguments with the desired collect
command and its arguments (assuming you have inserted the -- argument
to indicate the end of the mpirun arguments). For example, on an SMP
machine,
% mpirun -np 16 -- a.out 3 5
can be replaced by
% collect -M OMPT mpirun -np 16 -- a.out 3 5
This command runs an MPI tracing experiment on each of the 16 MPI pro‐
cesses, collecting them all in an MPI experiment, named by the usual
conventions for naming experiments. It assumes use of the Oracle Mes‐
sage Passing Toolkit (previously known as Sun HPC ClusterTools) version
of MPI.
The initial collect process reformats the mpirun command to specify
running collect with appropriate arguments on each of the individual
MPI processes.
Note that the -- argument immediately before the target name is
required for MPI profiling (although it is optional for mpirun itself),
so that collect can separate the mpirun arguments from the target and
its arguments. If the -- argument is not supplied, collect prints an
error message, and no experiment is run.
Furthermore, a -x PATH argument is added to the mpirun arguments by
collect, so that the remote collect's can find their targets. If any
environment variables in your environment begin with "VT_" or with
"SP_COLLECTOR_", they are passed to the remote collect with -x flags
for each.
MIMD MPI runs are supported, with the similar requirement that there
must be a "--" argument after each ":" (indicating a new target and
local mpirun arguments for it). If the -- argument is not supplied,
collect prints an error message, and no experiment is run.
Some versions of Oracle Message Passing Toolkit, or Sun HPC Cluster‐
Tools have functionality for MPI State profiling. When clock-profiling
data is collected on an MPI experiment run with such a version of MPI,
two additional metrics can be shown:
MPI Work (mpiwork)
MPI Wait (mpiwwait)
MPI Work accumulates when the process is inside the MPI runtime doing
work, such as processing requests or messages; MPI Wait accumulates
when the process is inside the MPI runtime, but waiting for an event,
buffer, or message.
On Oracle Solaris systems, MPI Wait is accumulated whether the MPI
library sleeps or spins when waiting. On Linux systems, MPI Wait is
accumulated when the MPI library spins when waiting; it is not accumu‐
lated if the MPI library sleeps (yields the CPU) when waiting, and will
be undercounted relative to the real wait time.
In the Analyzer, when MPI trace data is collected, two additional tabs
are shown, MPI Timeline and MPI Chart.
The technique of using mpirun to spawn explicit collect commands on the
MPI processes is no longer supported to collect MPI trace data, and
should not be used. It can still be used for all other types of data.
MPI profiling is based on the open source VampirTrace 5.5.3 release. It
recognizes several VampirTrace environment variables, and a new one,
VT_STACKS, which controls whether or not call stacks are recorded in
the data. For further information on the meaning of these variables,
see the VampirTrace 5.5.3 documentation.
The default value of the environment variable VT_BUFFER_SIZE limits the
internal buffer of the MPI API trace collector to 64 MB, and the
default value of VT_MAX_FLUSHES limits the number of times that the
buffer is flushed to 1. Events that are to be recorded after the limits
have been reached are no longer written into the trace file. The envi‐
ronment variables apply to every process of a parallel application,
meaning that applications with n processes will typically create trace
files n times the size of a serial application.
To remove the limit and get a complete trace of an application, set
VT_MAX_FLUSHES to 0. This setting causes the MPI API trace collector to
flush the buffer to disk whenever the buffer is full. To change the
size of the buffer, use the environment variable VT_BUFFER_SIZE. The
optimal value for this variable depends on the application which is to
be traced. Setting a small value will increase the memory available to
the application but will trigger frequent buffer flushes by the MPI API
trace collector. These buffer flushes can significantly change the
behavior of the application. On the other hand, setting a large value,
like 2G, will minimize buffer flushes by the MPI API trace collector,
but decrease the memory available to the application. If not enough
memory is available to hold the buffer and the application data this
might cause parts of the application to be swapped to disk leading also
to a significant change in the behavior of the application.
Another important variable is VT_VERBOSE, which turns on various error
and status messages, and setting it to 2 or higher is recommended if
problems arise.
Normally, MPI trace output data is post-processed when the mpirun tar‐
get exits; a processed data file is written to the experiment, and
information about the post-processing time is written into the experi‐
ment header. MPI post-processing is not done if MPI tracing is explic‐
itly disabled.
In the event of a failure in post-processing, an error is reported, and
no MPI Tabs or MPI tracing metrics will be available.
If the mpirun target does not actually invoke MPI, an experiment will
still be recorded, but no MPI trace data will be produced. The experi‐
ment will report an MPI post-processing error, and no MPI Tabs or MPI
tracing metrics will be available.
If the environment variable VT_UNIFY is set to "0", the post-processing
routines, er_vtunify and er_mpipp will not be run by collect. They will
be run the first time either er_print or analyzer are invoked on the
experiment.
USING COLLECT WITH PPGSZ
The collect command can be used with ppgsz by running the collect com‐
mand on the ppgsz command, and specifying the -F on flag. The founder
experiment is on the ppgsz executable and is uninteresting. If your
path finds the 32-bit version of ppgsz, and the experiment is being run
on a system that supports 64-bit processes, the first thing the collect
command does is execute an exec function on its 64-bit version, creat‐
ing _x1.er. That executable forks, creating _x1_f1.er. The descendant
process attempts to execute an exec function on the named target, in
the first directory on your path, then in the second, and so forth,
until one of the exec functions succeeds. If, for example, the third
attempt succeeds, the first two descendant experiments are named
_x1_f1_x1.er and _x1_f1_x2.er, and both are completely empty. The
experiment on the target is the one from the successful exec, the third
one in the example, and is named _x1_f1_x3.er, stored under the founder
experiment. It can be processed directly by invoking the Analyzer or
the er_print utility on test.1.er/_x1_f1_x3.er.
If the 64-bit ppgsz is the initial process run, or if the 32-bit ppgsz
is invoked on a 32-bit kernel, the fork descendant that executes exec
on the real target has its data in _f1.er, and the real target's exper‐
iment is in _f1_x3.er, assuming the same path properties as in the
example above.
See the section "FOLLOWING DESCENDANT PROCESSES", above. For more
information on hardware counters, see the "Hardware Counter Overflow
Profiling" section below.
USING COLLECT ON SETUID/SETGID TARGETS
The collect command operates by inserting a shared library, libcollec‐
tor.so, into the target's address space (LD_PRELOAD), along with addi‐
tional shared libraries for specific tracing data collection. Those
shared libraries write the files that constitute the experiment.
Several problems might arise if collect is invoked on executables that
call setuid or setgid, or that create descendant processes that call
setuid or setgid. If the user running the experiment is not root, col‐
lection fails because the shared libraries are not installed in a
trusted directory. The workaround is to run the experiments as root, or
use crle(1) to grant permission. Users should, of course, take great
care when circumventing security barriers, and do so at their own risk.
In addition, the umask for the user running the collect command must be
set to allow write permission for that user, and for any users or
groups that are set by the setuid/setgid attributes of a program being
exec'd and for any user or group to which that program sets itself. If
the mask is not set properly, some files might not be written to the
experiment, and processing of the experiment might not be possible. If
the log file can be written, an error is shown when the user attempts
to process the experiment.
Note that when attaching as one user to a process that is owned by
another user, umask must be set to allow writing by the user owning the
process to which you are attaching.
Other problems can arise if the target itself makes any of the system
calls to set UID or GID, or if it changes its umask and then forks or
runs exec on some other process, or crle was used to configure how the
runtime linker searches for shared objects.
If an experiment is started as root on a target that changes its effec‐
tive GID, the er_archive process that is automatically run when the
experiment terminates fails, because it needs a shared library that is
not marked as trusted. In that case, you can run er_archive (or
er_print or Analyzer) explicitly by hand, on the machine on which the
experiment was recorded, immediately following the termination of the
experiment.
DATA COLLECTED
Three types of data are collected: profiling data, tracing data, and
process-wide resource-utilization data. The data packets recorded in
profiling and tracing include the callstack of each LWP, the LWP,
thread, and CPU IDs, and some event-specific data. The data packets
recorded in process-wide resource-utilization samples contain global
data such as execution statistics, but no program-specific or event-
specific data. All data packets include a timestamp.
Each data type describes the metrics derived from that data, both as a
name, and as the string the user would use in a metrics command looking
at an experiment.
Clock-based Profiling
The event-specific data recorded in clock-based profiling is an
array of counts for each accounting microstate. The microstate
array is incremented by the system at a prescribed frequency, and
is recorded by the Collector when a profiling signal is processed.
Clock-based profiling can run at a range of frequencies which must
be multiples of the clock resolution used for the profiling timer.
If you try to do high-resolution profiling on a machine with an
operating system that does not support it, the command prints a
warning message and uses the highest resolution supported. Simi‐
larly, a custom setting that is not a multiple of the resolution
supported by the system is rounded down to the nearest non-zero
multiple of that resolution, and a warning message is printed.
On Oracle Solaris, clock-based profiling data is converted into the
following metrics:
Total Thread Time (total) = sum over all ten microstates
Total CPU Time (totalcpu) = user + system + trap
User CPU Time (user)
System CPU Time (system)
Trap CPU Time (trap)
User Lock Time (lock)
Data Page Fault Time (datapfault)
Text Page Fault Time (textpfault)
Kernel Page Fault Time (kernelpfault)
Stopped Time (stop)
Wait CPU Time (wait)
Sleep Time (sleep)
For experiments on multithreaded applications, all of the times are
summed across all threads in the process. Total Thread Time adds up
to the real elapsed time, multiplied by the average number of
threads in the process.
On Linux, clock-based profiling data produces one metric: Total CPU
Time (totalcpu).
If clock-based profiling is performed on an OpenMP program, three
additional metrics are provided:
OpenMP Work (ompwork)
OpenMP Wait (ompwait)
Master Thread Time (masterthread)
On Oracle Solaris, OpenMP Work accumulates when work is being done
in parallel. OpenMP Wait accumulates when the OpenMP runtime is
waiting for synchronization, and accumulates whether the wait is
using CPU time or sleeping, or when work is being done in parallel,
but the thread is not scheduled on a CPU. Master Thread Time repre‐
sents time in the master thread only.
On Linux, OpenMP Work and OpenMP Wait are accumulated only when the
process is active in either user or system mode. Unless you have
specified that OpenMP should do a busy wait, OpenMP Wait on Linux
will not be useful. Master Thread Time is not provided on Linux.
If clock-based profiling is performed on an MPI program, run under
Oracle Message Passing Toolkit or Sun HPC ClusterTools release 8.1
or later, two additional metrics are provided:
MPI Work (mpiwork)
MPI Wait (mpiwait)
On Oracle Solaris, MPI Work accumulates when the MPI runtime is
active. MPI Wait accumulates when the MPI runtime is waiting for
the send or receive of a message, or when the MPI runtime is
active, but the thread is not running on a CPU.
On Linux, MPI Work and MPI Wait are accumulated only when the
process is active in either user or system mode. Unless you have
specified that MPI should do a busy wait, MPI Wait on Linux will
not be useful.
Hardware Counter Overflow Profiling
Hardware counter overflow profiling records the number of events
counted by the hardware counter at the time the overflow signal was
processed.
The counters available depend on the specific processor chip and
operating system. Running the command collect -h with no other
arguments will describe the processor, and the number of hardware
counters available, along with a list of all counters and a default
hardware-counter set for that processor. The counters that are
aliased to common names are displayed first in the list, followed
by a list of the raw hardware counters. After the list of known
counters is printed, the name of the reference manual for the chip,
and the default counter set defined for that chip is printed.
If neither the performance counter subsystem nor collect know the
names for the counters on a specific chip, the tables are empty.
Even so, the counters can be specified numerically as described
above.
The lines of output are formatted similar to the following:
Aliases for most useful HW counters:
alias raw name type units regs description
cycles Cycles_user CPU-cycles 0123 CPU Cycles
insts Instr_all events 0123 Instructions Executed
c_stalls Commit_0_cyc CPU-cycles 0123 Stall Cycles
loads Instr_ld memoryspace events 0123 Load Instructions
stores Instr_st memoryspace events 0123 Store Instructions
dcm DC_miss_commit memoryspace events 0123 L1 D-cache Misses
...
Raw HW counters:
name type units regs description
Sel_pipe_drain_cyc CPU-cycles 0123
Sel_0_wait_cyc CPU-cycles 0123
Sel_0_ready_cyc CPU-cycles 0123
...
The top section labeled Aliases for most useful HW counters con‐
tains the following columns.
alias Gives a convenient non-processor-specific alias that
can be used in a -h argument.
raw name Lists the real unaliased processor-specific counter
name.
type Lists counter type information, when applicable. Coun‐
ters of type memoryspace can be used for memoryspace
and, where available, dataspace profiling. Rarely, a
not-program-related type appears indicating a counter
that captures events that cannot be attributed
directly to your program. Specifying such a counter
produces a warning and profiling will not record a
call stack; time will be attributed to an artificial
function called collector_not_program_related; and
Thread IDs and LWP IDs will be meaningless.
units Shows either CPU-cycles which can approximately be
converted to time during analysis, or events which are
raw hardware counts.
regs Specifies which registers can be used for the counter.
description Provides a description of the counter
The Raw HW counters section is similar except that no aliases are
listed. Introductory paragraphs describing the counters might be
available for certain processors.
If the two aliases cycles and insts are collected, two additional
metrics are available, CPI (cycles per instruction) and IPC
(instructions per cycle). A high CPI ratio or a low IPC ratio indi‐
cates code that runs inefficiently in the machine. A low CPI ratio
or a high IPC ratio indicates code that runs efficiently in the
pipeline.
EXAMPLES:
Example 1: Using the aliased counter information listed in the
above sample output, the following command:
collect -p hi -h cycles
enables CPU Cycles profiling, with hi chosen to generate a peak
event rate of approximately 1000 events/second/thread. Note that
generating too high an event rate will ultimately distort the per‐
formance you are trying to profile.
Synchronization Delay Tracing
Synchronization delay tracing records all calls to the various
thread synchronization routines where the real-time delay in the
call exceeds a specified threshold. The data packet contains time‐
stamps for entry and exit to the synchronization routines, the
thread ID, and the LWP ID at the time the request is initiated.
Synchronization requests from a thread can be initiated on one LWP,
but complete on another.
Synchronization delay tracing data is converted into the following
metrics:
Synchronization Wait Time (sync)
Synchronization Delay Events (syncn)
Heap Tracing
Heap tracing records all calls to malloc, free, realloc, memalign,
and valloc with the size of the block requested, its address, and
for realloc, the previous address. Calls to calloc are recorded on
Oracle Solaris but not on Linux.
Heap tracing data is converted into the following metrics:
Allocations (heapalloccnt)
Bytes Allocated (heapallocbytes)
Leaks (heapleakcnt)
Bytes Leaked (heapleakbytes)
Leaks are defined as allocations that are not freed. If a zero-
length block is allocated, it counts as an allocation with zero
bytes allocated. If a zero-length block is not freed, it counts as
a leak with zero bytes leaked.
Heap tracing experiments can be very large, and might be slow to
process.
IO Tracing
IO tracing records all calls to the standard IO routines and all IO
system calls.
IO tracing data is converted into the following metrics:
Bytes Read (ioreadbytes)
Read Count (ioreadcnt)
Read Time (ioreadtime)
Bytes Written (iowritebytes)
Write Count (iowritecnt)
Write Time (iowritetime)
Other IO Count (ioothrcnt)
Other IO Time (ioothertime)
IO Error Count (ioerrornt)
IO Error Time (ioerrortime)
MPI Tracing
MPI tracing records calls to the MPI library for functions that can
take a significant amount of time to complete. MPI tracing is
implemented using the Open Source Vampir Trace code.
MPI tracing data is converted into the following metrics:
MPI Time (mpitime)
MPI Sends (mpisendcnt)
MPI Bytes Sent (mpisendbytes)
MPI Receives (mpirecvcnt)
MPI Bytes Received (mpirecvbytes)
Other MPI Events (mpiothercnt)
MPI Time is the total thread time spent in the MPI function. If MPI
state times are also collected, MPI Work Time plus MPI Wait Time
for all MPI functions other than MPI_Init and MPI_Finalize should
approximately equal MPI Work Time. On Linux, MPI Wait and MPI Work
are based on user+system CPU time, while MPI Time is based on real
time, so the numbers will not match.
The MPI Bytes Received metric counts the actual number of bytes
received in all messages. MPI Bytes Sent counts the actual number
of bytes sent in all messages. MPI Sends counts the number of mes‐
sages sent, and MPI Receives counts the number of messages
received. MPI_Sendrecv counts as both a send and a receive. MPI
Other Events counts the events in the trace that are neither sends
nor receives.
Count Data
Count data is recorded by instrumenting the executable, and count‐
ing the number of times each instruction was executed. It also
counts the number of times the first instruction in a function is
executed, and calls that the function execution count. On SPARC
systems only, it also counts the number of times an instruction in
a branch-delay slot is annulled.
Count data is converted into the following metrics:
Bit Func Count (bit_fcount)
Bit Inst Exec (bit_instx)
Bit Inst Annul (bit_annul) -- SPARC only
Data-race Detection Data
Data-race detection data consists of pairs of race-access events
that constitute a race. The events are combined into a race, and
races for which the call stacks for the two access are identical
are merged into a race group.
Data-race detection data is converted into the following metric:
Race Accesses (raccess)
Deadlock Detection Data
Deadlock detection data consists of pairs of threads with conflict‐
ing locks.
Deadlock detection data is converted into the following metric:
Deadlocks (deadlocks)
Process-Wide Resource-Utilization Samples
Process-wide resource utilization can be sampled occasionally. The
data is attributed to the process and does not map to function-
level metrics.
Process-wide resource utilization is always sampled at the start
and termination of the process. By default or if a non-zero -S
argument is specified, samples are taken periodically at the speci‐
fied interval. In addition, samples can be taken by using the lib‐
collector(3) API.
The data recorded at each sample consists of microstate accounting
information from the kernel, along with various other statistics
maintained within the kernel.
ENVIRONMENT VARIABLES
SP_COLLECTOR_JAVA_MAX_STACKDEPTH
Set the maximum number of callstack frames captured, or set to '0'
to prevent capturing Java callstacks. The default behavior is to
capture up to 256 frames.
SP_COLLECTOR_NATIVE_MAX_STACKDEPTH
Set the maximum number of callstack frames captured, or set to '0'
to prevent capturing native callstacks. The default behavior is to
capture up to 256 frames. When profiling Java on x86 systems, set‐
ting SP_COLLECTOR_NATIVE_MAX_STACKDEPTH=0 might reduce the risk of
fatal errors related to native stack unwind. When native callstacks
are disabled, JNI and assembly stacks will not be captured.
SP_COLLECTOR_NO_VALIDATE
Define this variable to disable checking hardware, system, and Java
versions. The default is to do all checks. Setting this variable
will significantly speed up the start-up of the collect command.
SP_COLLECTOR_OUTPUT
Specify filename to redirect the collect output to specified file.
SP_COLLECTOR_SIZE_LIMIT
When using the -c on option, enables you to specify the maximum
size of the experiment in megabytes. For all collect options except
-c on, you can use -L to specify a maximum experiment size.
SP_ER_PRINT_ALLOW_COREDUMP
Define this variable to allow the operating system to generate a
core file if the analyzer back-end (er_print process) encounters a
fatal error. If not defined, the analyzer back-end will not gener‐
ate core files, but will instead create an error report located at
/tmp/analyzer.process-ID/crash.sigsignal.process-ID where process-
ID is the Process ID and signal is the signal number.
SP_COLLECTOR_HWC_DEFAULT
Define this variable to turn on profiling with the default hardware
counters. This is equivalent to using the -h auto option.
SP_COLLECTOR_NO_OMP
Define this variable to suppress tracking of parallel regions. The
program will have substantially less dilation, but the data from
slave threads will not propagate to main().
SP_COLLECTOR_OLDOMP
Define this variable to profile a program built with compilers from
Sun Studio 12.0 or earlier versions.
RESTRICTIONS
Most of the Performance Analyzer binaries depend on finding a shared
library from the installation containing the binaries. Users must not
set LD_LIBRARY_PATH to include any library directories from a different
installation of the tools. The binaries might fail to execute if the
LD_LIBRARY_PATH is set to a different installation.
By default, the Collector collects stacks that are 256 frames deep. To
support deeper stacks, set the environment variable SP_COLLEC‐
TOR_NATIVE_MAX_STACKDEPTH to a larger number. If you are profiling a
Java binary, set the SP_COLLECTOR_JAVA_MAX_STACKDEPTH environment vari‐
able.
The Collector interposes on some signal-handling routines to protect
its use of SIGPROF signals for clock-based profiling and SIGEMT (Oracle
Solaris) or SIGIO (Linux) for hardware counter overflow profiling
against disruption by the target program. See the section "DATA COLLEC‐
TION AND SIGNALS" above.
The Collector interposes on setitimer(2) for clock profiling, periodic
sampling, and hardware counter checking. Any setitimer calls from tar‐
get programs will fail.
On Oracle Solaris, the Collector interposes on functions in the hard‐
ware counter library, libcpc.so, so that an application cannot use
hardware counters while the Collector is collecting performance data.
The interposed functions return a value of -1.
Dataspace profiling is only available on SPARC systems running Oracle
Solaris.
For this release, the data from process-wide resource utilization sam‐
ples might not be reliable on systems running the Linux OS.
Hardware counter overflow profiling cannot be run on an Oracle Solaris
system where cpustat is running, because cpustat takes control of the
counters, and does not let a user process use them.
Java Profiling requires Java 2 SDK (JDK) 7, Update 11 or later JDK 7's.
collect cannot be used on executables compiled with -xprofile=tcov
flag.
Data is not collected on descendant processes that are created to use
the setuid attribute, nor on any descendant processes created with an
exec call for an executable that is not dynamically linked. Further‐
more, subsequent descendant processes might produce corrupted or
unreadable experiments. The workaround is to ensure that all processes
spawned are dynamically-linked and do not have the setuid attribute.
Applications that call vfork(2) have these calls replaced by a call to
fork1(2).
Count data (collect -c) cannot be collected on Oracle Linux 5 systems;
count data cannot be collected for 32-bit binaries on any Linux system
at all.
On Linux systems, data cannot be collected on applications using
clone(2) with the CLONE_VM flag.
SEE ALSO
analyzer(1), collector(1), dbx(1), er_archive(1), er_cp(1),
er_export(1), er_mv(1) , er_print(1), er_rm(1), tha(1), libcollector(3)
Performance Analyzer manual
Studio 12.6 May 2017 collect(1)