This chapter provides information on how to view and analyze experiment results by using the prof(1) report generator. This chapter has the following sections:
Performance data is examined using prof, a text-based report generator that prints to stdout.
Use either of the following syntaxes to generate a report from performance data gathered during experiments recorded by ssrun(1):
prof [options][speedshop_data_file]... |
or
prof [options] executable_name [speedshop_data_file]... |
The arguments for prof when used with data files from ssrun are as follows:
options: zero or more of the options described in Table 7-1.
executable_name: the name of the executable file (including its path) created by the compiler. This argument is needed if prof is unable to locate the executable relative to the location of the data files being analyzed because the data or the executable were moved after the files were created.
speedshop_data_file: one or more names of performance data files generated by ssrun. The file names may differ only in the ID portion of their names. The exp_type portion of the names must be identical.
The following table lists prof options that are current for this release. For more information and for a list of any newly added options since this printing, see the prof(1) man page.
The prof command generates a performance report that is printed to stdout. Warning and fatal errors are printed to stderr.
![]() | Note: Fortran alternate entry point times are attributed to the main function or subroutine, since there is no general way for prof to separate the times for the alternate entries. |
When you call prof with one or more SpeedShop performance data files, it collects the data from all the output files and produces a listing. The prof command is able to detect which experiment was run and generate an appropriate report. The command can identify all of the experiment types used with the ssrun command.
In cases where prof accepts more than one data file as input, it sums up the results. The multiple input data files must be generated from the same executable, using the same experiment type.
The prof command may report times for procedures named with a prefix of *DF*, for example *DF*_hello.init_2. DF stands for Dummy Function and indicates cycles spent in parts of text which are not in any function: init, fini, and MIPS.stubs sections, for example.
The most frequently used reports that prof generates are described in the following sections:
For usertime experiments, prof generates CPU times for individual routines and shows how those times compare with the rest of the program. The column heading are as follows:
The index column provides an index number for reference.
The excl.secs column shows how much time, in seconds, was spent in the function itself (exclusive time). For example, less than one hundredth of a second was spent in __start(), but 0.03 of a second was spent in fread.
The excl.% column shows the percentage of a program's total time that was spent in the function.
The cum.% column shows the percentage of the complete program time that has been spent in the functions that have been listed so far.
The incl.secs column shows how much time, in seconds, was spent in the function and descendents of the function.
The incl.% column shows the cumulative percentage of inclusive time spent in each function and its descendents.
The samples column provides the number of samples of the function and all of its descendants.
The function (dso:file,line) column lists the function name, its DSO name, its file name, and its line number.
The following example is an abbreviated version of the full report. For a complete report, see “Generating a Report” in Chapter 2.
------------------------------------------------------------------------- SpeedShop profile listing generated Mon Feb 2 11:07:15 1998 prof generic.usertime.m10981 generic (n32): Target program usertime: Experiment name ut:cu: Marching orders R4400 / R4000: CPU / FPU 1: Number of CPUs 175: Clock frequency (MHz.) Experiment notes-- From file generic.usertime.m10981: Caliper point 0 at target begin, PID 10981 /usr/demos/SpeedShop/linpack.demos/c/generic Caliper point 1 at exit(0) ------------------------------------------------------------------------- Summary of statistical callstack sampling data (usertime)-- 809: Total Samples 0: Samples with incomplete traceback 24.270: Accumulated Time (secs.) 30.0: Sample interval (msecs.) ------------------------------------------------------------------------- Function list, in descending order by exclusive time ------------------------------------------------------------------------- [index] excl.secs excl.% cum.% incl.secs incl.% samples function (dso: file, line) [4] 22.770 93.8% 93.8% 22.770 93.8% 759 anneal (generic: generic.c, 1573) |
For [f]pcsamp[x] experiments, prof generates a function list annotated with the number of samples taken for the function and the estimated time spent in the function. The column headings are as follows:
The [index] column assigns a reference number to each function.
The secs column shows the amount of CPU time that was spent in the function.
The % column shows the percentage of the total program time that was spent in the function.
The cum.% column shows the percentage of the complete program time that has been spent in the functions that have been listed so far.
The samples column shows how many samples were taken when the process was executing in the function.
The function (dso:file, line) column lists the function, its DSO name, its file name, and its line number.
The following is output from an fpcsamp experiment:
------------------------------------------------------------------------- SpeedShop profile listing generated Mon Feb 2 11:01:36 1998 prof generic.fpcsamp.m11140 generic (n32): Target program fpcsamp: Experiment name pc,2,1000,0:cu: Marching orders R4400 / R4000: CPU / FPU 1: Number of CPUs 175: Clock frequency (MHz.) Experiment notes-- From file generic.fpcsamp.m11140: Caliper point 0 at target begin, PID 11140 /usr/demos/SpeedShop/linpack.demos/c/generic Caliper point 1 at exit(0) ------------------------------------------------------------------------- Summary of statistical PC sampling data (fpcsamp)-- 23828: Total samples 23.828: Accumulated time (secs.) 1.0: Time per sample (msecs.) 2: Sample bin width (bytes) ------------------------------------------------------------------------- Function list, in descending order by time ------------------------------------------------------------------------- [index] secs % cum.% samples function (dso: file, line) [1] 22.279 93.5% 93.5% 22279 anneal (generic: generic.c,1573) |
For the various hwc experiments, prof generates a function list annotated with the number of overflows of hardware counters generated by the function. The column headings are as follows:
The [index] column assigns a reference number to each function.
The counts column shows the extrapolated event count based on the number of samples and the overflow value for the particular counter.
The % column shows the percentage of the program's overflows that occurred in the function.
The cum.% column shows the percentage of the program's overflows that occurred in the functions that have been listed so far.
The samples column shows the number of times the program counter was sampled during execution of the function.
The function (dso:file, line) column lists the name, the DSO, the file name, and line number of the function.
The following is output from a dsc_hwc hardware counter experiment:
------------------------------------------------------------------------- SpeedShop profile listing generated Mon Feb 2 11:11:44 1998 prof generic.dsc_hwc.m294398 generic (n32): Target program dsc_hwc: Experiment name hwc,26,131:cu: Marching orders R10000 / R10010: CPU / FPU 16: Number of CPUs 195: Clock frequency (MHz.) Experiment notes-- From file generic.dsc_hwc.m294398: Caliper point 0 at target begin, PID 294398 /usr/demos/SpeedShop/linpack.demos/c/generic Caliper point 1 at exit(0) ------------------------------------------------------------------------- Summary of R10K perf. counter overflow PC sampling data (dsc_hwc)-- 6: Total samples Sec cache D misses (26): Counter name (number) 131: Counter overflow value 786: Total counts ------------------------------------------------------------------------- Function list, in descending order by counts ------------------------------------------------------------------------- [index] counts % cum.% samples function (dso: file, line) [1] 131 16.7% 16.7% 1 init2da (generic: generic.c, 1430) [2] 131 16.7% 33.3% 1 genLog (generic: generic.c, 1686) [3] 131 16.7% 50.0% 1 _write (libc.so.1: writeSCI.c, 27) 393 50.0% 100.0% 3 **OTHER** (includes excluded DSOs, rld, etc.) 786 100.0% 100.0% 6 TOTAL |
For bbcounts experiments, prof generates a function list annotated with the number of cycles and instructions attributed to the function and the estimated time spent in the function.
The prof command does not take into account interactions between basic blocks. Within a single basic block, prof computes cycles for one execution and multiplies it with the number of times that basic block is executed.
If any of the object files linked into the application have been stripped of line number information (with ld -x, for example), prof warns about the affected procedures. The instruction counts for such procedures are shown as a procedure total, not on a per-basic-block basis. Where a line number would normally appear in a report on a function without line numbers, question marks appear instead. The column headings are as follows:
The [index] column assigns a reference number to each function.
The excl.secs column shows the minimum number of seconds that might be spent in the function under bbcounts conditions.
The excl.% column represents how much of the program's total time was spent in the function.
The cum.% column shows the cumulative percentage of time spent in the functions that have been listed so far.
The cycles column reports the number of machine cycles used by the function.
The instructions column shows the number of instructions executed by a function.
The calls column reports the number of calls to the function.
The function (dso:file, line) column lists the function, its DSO name, its file name, and the line number.
The following is output from a bbcounts experiment.
------------------------------------------------------------------------- SpeedShop profile listing generated Mon Aug 14 13:51:00 2000 prof -butterfly generic.bbcounts.m46372 generic (n32): Target program bbcounts: Experiment name it:cu: Marching orders R12000 / R12010: CPU / FPU 127: Number of CPUs 400: Clock frequency (MHz.) Experiment notes-- From file generic.bbcounts.m46372: Caliper point 0 at target begin, PID 46372 generic Caliper point 1 at exit(0) ------------------------------------------------------------------------- Summary of bbcounts time data (bbcounts)-- 2048835049: Total number of instructions executed 2552056463: Total computed cycles 6.380: Total computed execution time (secs.) 1.246: Average cycles / instruction ------------------------------------------------------------------------- Function list, in descending order by exclusive bbcounts time ------------------------------------------------------------------------- [index] excl.secs excl.% cum.% cycles instructions incl.secs incl.% calls function (dso: file, line) [5] 6.088 95.4% 95.4% 2435240026 1956780024 6.088 95.4% 1 anneal (generic: generic.c, 1559) |
If the -butterfly flag is added to prof, a list of callers and callees of each function is provided:
------------------------------------------------------------------------- Butterfly function list, in descending order by inclusive bbcounts time ------------------------------------------------------------------------- attrib.% attrib.time(#calls) incl.time caller (callsite) [index] [index] incl.% incl.time self% self-time procedure [index] attrib.% attrib.time(#calls) incl.time callee (callsite) [index] ------------------------------------------------------------------------- [1] 99.9% 6.376 0.0% 0.000 __start [1] 99.9% 6.376(0000001) 6.376 main [2] 0.0% 0.000(0000001) 0.000 __readenv_sigfpe [314] 0.0% 0.000(0000001) 0.000 __istart [315] ------------------------------------------------------------------------- 99.9% 6.376(0000001) 6.376 __start [1] [2] 99.9% 6.376 0.0% 0.000 main [2] 99.9% 6.376(0000001) 6.376 Scriptstring [3] ------------------------------------------------------------------------- 99.9% 6.376(0000001) 6.376 main [2] [3] 99.9% 6.376 0.0% 0.000 Scriptstring [3] 95.4% 6.088(0000001) 6.088 usrtime [4] 3.7% 0.238(0000001) 0.238 libdso [6] 0.8% 0.050(0000001) 0.050 cvttrap [9] 0.0% 0.000(0000001) 0.000 iofile [31] 0.0% 0.000(0000002) 0.000 genLog [36] 0.0% 0.000(0000001) 0.000 dirstat [56] 0.0% 0.000(0000001) 0.000 linklist [63] 0.0% 0.000(0000001) 0.000 fpetraps [65] 0.0% 0.000(0000002) 0.000 fprintf [54] 0.0% 0.000(0000002) 0.000 sprintf [49] 0.0% 0.000(0000061) 0.000 strcmp [47] ------------------------------------------------------------------------- 95.4% 6.088(0000001) 6.376 Scriptstring [3] [4] 95.4% 6.088 0.0% 0.000 usrtime [4] 95.4% 6.088(0000001) 6.088 anneal [5] 0.0% 0.000(0000001) 0.000 genLog [36] 0.0% 0.000(0000001) 0.000 fprintf [54] ------------------------------------------------------------------------- |
The fpe trace report shows information for each function. The function name is shown in the right column of the report. The remaining columns are described below.
The [index] column assigns a reference number to each function.
The excl.FPEs column shows how many floating point exceptions were found in the function.
The excl.% column shows the percentage of the total number of floating-point exceptions that were found in the function.
The cum.% column shows the percentage of floating-point exceptions in the program that have been encountered so far in the list.
The incl.FPEs column shows how many floating-point exceptions were attributed to the function and descendents of the function.
The incl.% column shows the cumulative percentage of floating-point exceptions attributed to the function and its descendents.
The function (dso:file, line) column lists the function name, its DSO name, its file name, and its line number.
------------------------------------------------------------------------- SpeedShop profile listing generated Mon Feb 2 13:26:33 1998 prof generic.fpe.m12213 generic (n32): Target program fpe: Experiment name fpe:cu: Marching orders R4400 / R4000: CPU / FPU 1: Number of CPUs 175: Clock frequency (MHz.) Experiment notes-- From file generic.fpe.m12213: Caliper point 0 at target begin, PID 12213 /usr/demos/SpeedShop/linpack.demos/c/generic Caliper point 1 at exit(0) ------------------------------------------------------------------------- Summary of FPE callstack tracing data (fpe)-- 4: Total FPEs 0: Samples with incomplete traceback ------------------------------------------------------------------------- Function list, in descending order by exclusive FPEs ------------------------------------------------------------------------- [index] excl.FPEs excl.% cum.% incl.FPEs incl.% function (dso: file, line) [1] 4 100.0% 100.0% 4 100.0% fpetraps (generic: generic.c, 405) |
This section shows the output from calling prof with some of the options available for prof.
For pcsamp and bbcounts experiments, the -dis option to prof can be used to obtain machine instruction information. prof provides the standard report and then appends the machine instruction information to the end of the report. The following example shows partial output from prof for a pcsamp experiment.
------------------------------------------------------------------------- SpeedShop profile listing generated Tue Feb 3 10:48:59 1998 prof -dis generic.pcsamp.m14493 generic (n32): Target program pcsamp: Experiment name pc,2,10000,0:cu: Marching orders R4400 / R4000: CPU / FPU 1: Number of CPUs 175: Clock frequency (MHz.) Experiment notes-- From file generic.pcsamp.m14493: Caliper point 0 at target begin, PID 14493 /usr/demos/SpeedShop/c/generic Caliper point 1 at exit(0) ------------------------------------------------------------------------- Summary of statistical PC sampling data (pcsamp)-- 2707: Total samples 27.070: Accumulated time (secs.) 10.0: Time per sample (msecs.) 2: Sample bin width (bytes) ------------------------------------------------------------------------- Function list, in descending order by time ------------------------------------------------------------------------- [index] secs % cum.% samples function (dso: file, line) [1] 25.240 93.2% 93.2% 2524 anneal (generic: generic.c, 1573) [2] 1.090 4.0% 97.3% 109 slaveusrtime (dlslave.so: dlslave.c, 22) [3] 0.390 1.4% 98.7% 39 __read (libc.so.1: read.s, 20) [4] 0.230 0.8% 99.6% 23 cvttrap (generic: generic.c, 317) [5] 0.090 0.3% 99.9% 9 _xstat (libc.so.1: xstat.s, 12) [6] 0.010 0.0% 99.9% 1 __write (libc.so.1: write.s, 20) [7] 0.010 0.0% 100.0% 1 _ngetdents (libc.so.1: ngetdents.s, 16) [8] 0.010 0.0% 100.0% 1 _doprnt (libc.so.1: doprnt.c, 285) 27.070 100.0% 100.0% 2707 TOTAL ------------------------------------------------------------------------- Disassembly listing, annotated with PC sampling overflows ------------------------------------------------------------------------- . . . /usr/demos/SpeedShop/linpack.demos/c/generic.c anneal: <0x10006830-0x10006b3c> 2524 total samples(93.24%) [1573] 0x10006830 0x27bdffd0 addiu sp,sp,-48 # 1 [1573] 0x10006834 0xffbc0020 sd gp,32(sp) # 2 [1573] 0x10006838 0xffbf0018 sd ra,24(sp) # 3 [1573] 0x1000683c 0x3c030002 lui v1,0x2 # 4 [1573] 0x10006840 0x246397e8 addiu v1,v1,-26648 # 5 [1573] 0x10006844 0x0323e021 addu gp,t9,v1 # 6 [1575] 0x10006848 0xd7808370 ldc1 $f0,-31888(gp) # 7 <2 cycle stall for following instruction> [1575] 0x1000684c 0xf7a00000 sdc1 $f0,0(sp) # 10 [1577] 0x10006850 0x24010001 li at,1 # 11 [1577] 0x10006854 0x8f82816c lw v0,-32404(gp) # 12 <2 cycle stall for following instruction> [1577] 0x10006858 0xac410000 sw at,0(v0) # 15 [1578] 0x1000685c 0x8f998148 lw t9,-32440(gp) # 16 [1578] 0x10006860 0x0c00171b jal 0x10005c6c # 17 [1578] 0x10006864 0000000000 nop # 18 <2 cycle stall for following instruction> [1586] 0x10006868 0xafa00008 sw zero,8(sp) # 21 [1586] 0x1000686c 0x8fa40008 lw a0,8(sp) # 22 <2 cycle stall for following instruction> [1586] 0x10006870 0x28842710 slti a0,a0,10000 # 25 [1586] 0x10006874 0x108000ac beq a0,zero,0x10006b28 # 26 [1586] 0x10006878 0000000000 nop # 27 <2 cycle stall for following instruction> [1588] 0x1000687c 0x24070001 li a3,1 # 30 [1588] 0x10006880 0xafa7000c sw a3,12(sp) # 31 [1588] 0x10006884 0x8f868164 lw a2,-32412(gp) # 32 <2 cycle stall for following instruction> [1588] 0x10006888 0x8cc60000 lw a2,0(a2) # 35 <2 cycle stall for following instruction> [1588] 0x1000688c 0x24c6ffff addiu a2,a2,-1 # 38 [1588] 0x10006890 0x8fa5000c lw a1,12(sp) # 39 <2 cycle stall for following instruction> [1588] 0x10006894 0x00a6282a slt a1,a1,a2 # 42 [1588] 0x10006898 0x10a0009c beq a1,zero,0x10006b0c # 43 [1588] 0x1000689c 0000000000 nop # 44 <2 cycle stall for following instruction> [1589] 0x100068a0 0x240a0001 li t2,1 # 47 ^------ 1 samples(0.04%)------^ [1589] 0x100068a4 0xafaa0010 sw t2,16(sp) # 48 ^------ 1 samples(0.04%)------^ [1589] 0x100068a8 0x8f898164 lw t1,-32412(gp) # 49 <2 cycle stall for following instruction> [1589] 0x100068ac 0x8d290000 lw t1,0(t1) # 52 <2 cycle stall for following instruction> [1589] 0x100068b0 0x2529ffff addiu t1,t1,-1 # 55 [1589] 0x100068b4 0x8fa80010 lw t0,16(sp) # 56 <2 cycle stall for following instruction> [1589] 0x100068b8 0x0109402a slt t0,t0,t1 # 59 [1589] 0x100068bc 0x11000089 beq t0,zero,0x10006ae4 # 60 [1589] 0x100068c0 0000000000 nop # 61 <2 cycle stall for following instruction> [1590] 0x100068c4 0x8faf000c lw t7,12(sp) # 64 ^------ 27 samples(1.00%)------^ <2 cycle stall for following instruction> [1590] 0x100068c8 0x25ef0001 addiu t7,t7,1 # 67 ^------ 7 samples(0.26%)------^ [1590] 0x100068cc 0x000f7080 sll t6,t7,2 # 68 ^------ 30 samples(1.11%)------^ [1590] 0x100068d0 0x01cf7021 addu t6,t6,t7 # 69 ^------ 8 samples(0.30%)------^ [1590] 0x100068d4 0x000e70c0 sll t6,t6,3 # 70 ^------ 5 samples(0.18%)------^ [1590] 0x100068d8 0x8faf0010 lw t7,16(sp) # 71 ^------ 8 samples(0.30%)------^ <2 cycle stall for following instruction> [1590] 0x100068dc 0x01cf7021 addu t6,t6,t7 # 74 ^------ 9 samples(0.33%)------^ [1590] 0x100068e0 0x000e70c0 sll t6,t6,3 # 75 ^------ 27 samples(1.00%)------^ [1590] 0x100068e4 0x8f8f817c lw t7,-32388(gp) # 76 ^------ 14 samples(0.52%)------^ <2 cycle stall for following instruction> [1590] 0x100068e8 0x01cf7021 addu t6,t6,t7 # 79 ^------ 9 samples(0.33%)------^ [1590] 0x100068ec 0x25ce0008 addiu t6,t6,8 # 80 ^------ 28 samples(1.03%)------^ [1590] 0x100068f0 0xd5c10000 ldc1 $f1,0(t6) # 81 ^------ 7 samples(0.26%)------^ [1590] 0x100068f4 0x8fad000c lw t5,12(sp) # 82 ^------ 10 samples(0.37%)------^ <2 cycle stall for following instruction> [1590] 0x100068f8 0x25ad0001 addiu t5,t5,1 # 85 ^------ 21 samples(0.78%)------^ [1590] 0x100068fc 0x000d6080 sll t4,t5,2 # 86 ^------ 19 samples(0.70%)------^ [1590] 0x10006900 0x018d6021 addu t4,t4,t5 # 87 ^------ 9 samples(0.33%)------^ [1590] 0x10006904 0x000c60c0 sll t4,t4,3 # 88 ^------ 14 samples(0.52%)------^ [1590] 0x10006908 0x8fad0010 lw t5,16(sp) # 89 ^------ 8 samples(0.30%)------^ <2 cycle stall for following instruction> [1590] 0x1000690c 0x018d6021 addu t4,t4,t5 # 92 ^------ 8 samples(0.30%)------^ [1590] 0x10006910 0x000c60c0 sll t4,t4,3 # 93 ^------ 30 samples(1.11%)------^ [1590] 0x10006914 0x8f8d817c lw t5,-32388(gp) # 94 ^------ 10 samples(0.37%)------^ <2 cycle stall for following instruction> [1590] 0x10006918 0x018d6021 addu t4,t4,t5 # 97 ^------ 8 samples(0.30%)------^ [1590] 0x1000691c 0xd5820000 ldc1 $f2,0(t4) # 98 ^------ 28 samples(1.03%)------^ [1590] 0x10006920 0x8fab000c lw t3,12(sp) # 99 ^------ 9 samples(0.33%)------^ <2 cycle stall for following instruction> [1590] 0x10006924 0x256b0001 addiu t3,t3,1 # 102 ^------ 11 samples(0.41%)------^ [1590] 0x10006928 0x000b5080 sll t2,t3,2 # 103 ^------ 25 samples(0.92%)------^ [1590] 0x1000692c 0x014b5021 addu t2,t2,t3 # 104 ^------ 11 samples(0.41%)------^ [1590] 0x10006930 0x000a50c0 sll t2,t2,3 # 105 ^------ 8 samples(0.30%)------^ [1590] 0x10006934 0x8fab0010 lw t3,16(sp) # 106 ^------ 11 samples(0.41%)------^ <2 cycle stall for following instruction> [1590] 0x10006938 0x014b5021 addu t2,t2,t3 # 109 ^------ 7 samples(0.26%)------^ [1590] 0x1000693c 0x000a50c0 sll t2,t2,3 # 110 ^------ 26 samples(0.96%)------^ [1590] 0x10006940 0x8f8b817c lw t3,-32388(gp) # 111 ^------ 13 samples(0.48%)------^ <2 cycle stall for following instruction> [1590] 0x10006944 0x014b5021 addu t2,t2,t3 # 114 ^------ 9 samples(0.33%)------^ [1590] 0x10006948 0x254afff8 addiu t2,t2,-8 # 115 ^------ 26 samples(0.96%)------^ [1590] 0x1000694c 0xd5430000 ldc1 $f3,0(t2) # 116 ^------ 11 samples(0.41%)------^ [1590] 0x10006950 0x8fa9000c lw t1,12(sp) # 117 ^------ 10 samples(0.37%)------^ <2 cycle stall for following instruction> [1590] 0x10006954 0x00094080 sll t0,t1,2 # 120 ^------ 11 samples(0.41%)------^ . . . |
The listing shows statistics about the procedure anneal() in the file generic.c and lists the beginning and ending addresses of anneal(): <0x10006830-0x10006b3c>. The five columns display the following information:
Column | Displays |
1 | Line number of the instruction: [1573]. |
2 | Beginning address of the instruction: 0x10006830. |
3 | Instruction in hexadecimal: 0x27bdffd0. |
4 | Assembler form (mnemonic) of the instruction: addiu sp,sp,-48. |
5 | Cycle in which the instruction executed: # 1. |
Other information includes:
The number of times the immediately preceding branch was executed and taken (bbcounts only).
The total number of cycles in a basic block and the percentage of the total cycles for that basic block; the number of times the branch terminating that basic block was executed; and the number of cycles for one execution of that basic block (bbcounts only).
The total number of samples at an instruction (pcsamp only).
Any cycle stalls, that is, cycles that were wasted.
For bbcounts experiments, the -S option to prof can be used to obtain source line information. prof provides the standard report and then appends the source line information to the end of the report.
This example shows output from calling prof for a bbcounts experiment. Note that some lines are wrapped here to accommodate page width:
------------------------------------------------------------------------- SpeedShop profile listing generated Mon Jul 17 14:45:28 2000 prof -S generic.bbcounts.m190404 generic (n32): Target program bbcounts: Experiment name it:cu: Marching orders R12000 / R12010: CPU / FPU 128: Number of CPUs 400: Clock frequency (MHz.) Experiment notes-- From file generic.bbcounts.m190404: Caliper point 0 at target begin, PID 190404 generic Caliper point 1 bgn_calc Caliper point 2 at exit(0) ------------------------------------------------------------------------- Summary of bbcounts time data (bbcounts)-- 2048886059: Total number of instructions executed 2552098900: Total computed cycles 6.380: Total computed execution time (secs.) 1.246: Average cycles / instruction ------------------------------------------------------------------------- Function list, in descending order by exclusive bbcounts time ------------------------------------------------------------------------- [index] excl.secs excl.% cum.% cycles instructions calls function (dso: file, line) [1] 6.088 95.4% 95.4% 2435240026 1956780024 1 anneal (generic: generic.c, 1560) [2] 0.238 3.7% 99.1% 95000839 75000732 1 slaveusrtime (dlslave.so: dlslave.c, 22) [3] 0.050 0.8% 99.9% 20000056 15000054 1 cvttrap (generic: generic.c, 317) [4] 0.001 0.0% 99.9% 503138 559313 5212 resolve_relocations (rld: rld.c, 2636) [5] 0.001 0.0% 100.0% 274847 282220 1255 general_find_symbol (rld: rld.c, 2038) [6] 0.000 0.0% 100.0% 116756 120371 3 fix_all_defineds (rld: rld.c, 3419) [7] 0.000 0.0% 100.0% 115819 145585 1270 elfhash (rld: obj.c, 1184) [8] 0.000 0.0% 100.0% 102496 146324 6406 obj_dynsym_got (rld: objfcn.c, 46) [9] 0.000 0.0% 100.0% 89123 116619 948 fread (libc.so.1: fread.c, 27) [10] 0.000 0.0% 100.0% 74339 58123 1 init2da (generic: generic.c, 1417) . . . ----------------------- disassembly listing ----------------------- *DF*_generic.MIPS.stubs_1 *DF*_generic.MIPS.stubs_1@0x10001fd4-0x100023d8: <0x10001fd4-0x100023d8> . . . /usr/people/n4733/demos/SpeedShop/generic/generic.c main: <0x10002500-0x10002640> 31 total cycles(0.00%) invoked 1 times, average 31 cycles/invocation File '/usr/people/n4733/demos/SpeedShop/generic/generic.c': Skipping source listing to line 91 92: void sproctestgrandchild(void *); /* sproc grandchild code */ 93: 94: static struct timeval starttime; /* starting time - first timestamp */ 95: static struct timeval ttime; /* last-recorded timestamp */ 96: static struct timeval deltatime; 97: 98: int pagesize; 99: 100: main(unsigned argc, char **argv) 101: { [101] 0x10002500 0x27bdffd0 addiu sp,sp,-48 # 1 [101] 0x10002504 0xffbc0010 sd gp,16(sp) # 2 [101] 0x10002508 0xffbf0008 sd ra,8(sp) # 3 [101] 0x1000250c 0x3c010002 lui at,0x2 # 3 [101] 0x10002510 0x2421eb28 addiu at,at,-5336 # 4 [101] 0x10002514 0x0321e021 addu gp,t9,at # 5 [101] 0x10002518 0xafa40024 sw a0,36(sp) # 5 [101] 0x1000251c 0xafa5002c sw a1,44(sp) # 6 102: int i; 103: 104: /* initialize the timestamp */ 105: (void) gettimeofday(&starttime, NULL); [105] 0x10002520 0x27848198 addiu a0,gp,-32360 # 6 [105] 0x10002524 0x00002825 move a1,zero # 6 [105] 0x10002528 0x8f998064 lw t9,-32668(gp) # 7 <2 cycle stall for following instruction> [105] 0x1000252c 0x0320f809 jalr ra,t9 # 8 [105] 0x10002530 0000000000 nop # 9 ^--- 11 total cycles(0.00%) executed 1 times, average 11 cycles.---^ 106: 107: /* set up to reap any children */ 108: (void) sigset(SIGCHLD, (SIG_PF)reapSig); [108] 0x10002534 0x24040012 li a0,18 # 1 [108] 0x10002538 0x8f858130 lw a1,-32464(gp) # 2 [108] 0x1000253c 0x8f998068 lw t9,-32664(gp) # 3 <2 cycle stall for following instruction> [108] 0x10002540 0x0320f809 jalr ra,t9 # 4 [108] 0x10002544 0000000000 nop # 5 ^--- 6 total cycles(0.00%) executed 1 times, average 6 cycles.---^ 109: 110: if(argc == 1) { [110] 0x10002548 0x8fa20024 lw v0,36(sp) # 1 [110] 0x1000254c 0x24030001 li v1,1 # 2 <2 cycle stall for following instruction> [110] 0x10002550 0x1443000c bne v0,v1,0x10002584 # 3 [110] 0x10002554 0000000000 nop # 4 Preceding branch executed 1 times, taken 0 times. ^--- 5 total cycles(0.00%) executed 1 times, average 5 cycles.---^ 111: Scriptstring(DEFAULT_SCRIPT); [111] 0x10002558 0x8f848034 lw a0,-32716(gp) # 1 <1 cycle stall for following instruction> [111] 0x1000255c 0x24848008 addiu a0,a0,-32760 # 2 [111] 0x10002560 0x8f998138 lw t9,-32456(gp) # 2 [111] 0x10002564 0x0c000a14 jal 0x10002850 # 2 [111] 0x10002568 0000000000 nop # 3 ^--- 4 total cycles(0.00%) executed 1 times, average 4 cycles.---^ 112: exit(0); [112] 0x1000256c 0x00002025 move a0,zero # 1 [112] 0x10002570 0x8f99805c lw t9,-32676(gp) # 2 <2 cycle stall for following instruction> [112] 0x10002574 0x0320f809 jalr ra,t9 # 3 [112] 0x10002578 0000000000 nop # 4 ^--- 5 total cycles(0.00%) executed 1 times, average 5 cycles.---^ [112] 0x1000257c 0x10000020 b 0x10002600 # 1 [112] 0x10002580 0000000000 nop # 2 ^--- 0 total cycles(0.00%) executed 0 times, average 2 cycles.---^ 113: } else { 114: 115: i = argc; [115] 0x10002584 0x8fa60024 lw a2,36(sp) # 1 [115] 0x10002588 0xafa60000 sw a2,0(sp) # 2 . . . |
When you run prof on the output of an experiment in which you have recorded caliper points, you can use the -calipers option to specify the area of the program for which you want to generate a performance report. For example, if you record just one caliper point in the middle of your program, prof can provide a report from the beginning of the program up to the first caliper point using the following command:
% prof -calipers 0,1 executable_name.exp_type.id |
The prof command can also provide a report from the caliper point to the end of the program using the following command:
% prof -calipers 1,2 executable_name.exp_type.id |
If you record two caliper points (0, 1, 2, 3), prof can generate a report from the second to the third caliper point:
% prof -calipers 1,2 executable_name.exp_type.id |
For bbcounts, usertime, and fpe experiments, the -butterfly option to prof can be used to obtain inclusive metric information. prof provides the standard report and then appends the inclusive function counts information to the end of the report. The following example is partial output from prof, showing just the inclusive function counts report.
With inclusive cycle counting, prof prints a list of functions at the end, which are called but not defined. It also includes functions from libss; they are instrumented, but their data is normally excluded.
The prof report does not list the cycles of a procedure in the inclusive listing for the following reasons:
init, fini, and MIPS.stubs sections are not part of any procedure.
Calls to procedures that do not use a “jump and link” are not recognized as procedure calls. (This is not true for bbcounts experiments.)
When global procedures with the same name are executed in different DSOs, only one of them is listed.
These exceptions are listed at the end of the report.
This example shows output from calling prof for a usertime experiment:
------------------------------------------------------------------------- SpeedShop profile listing generated Thu Feb 12 13:52:09 1998 prof -butterfly generic.usertime.m10981 generic (n32): Target program usertime: Experiment name ut:cu: Marching orders R4400 / R4000: CPU / FPU 1: Number of CPUs 175: Clock frequency (MHz.) Experiment notes-- From file generic.usertime.m10981: Caliper point 0 at target begin, PID 10981 /usr/demos/SpeedShop/linpack.demos/c/generic Caliper point 1 at exit(0) ------------------------------------------------------------------------- Summary of statistical callstack sampling data (usertime)-- 809: Total Samples 0: Samples with incomplete traceback 24.270: Accumulated Time (secs.) 30.0: Sample interval (msecs.) ------------------------------------------------------------------------- Function list, in descending order by exclusive time ------------------------------------------------------------------------- [index] excl.secs excl.% cum.% incl.secs incl.% samples function (dso: file, line) [4] 22.770 93.8% 93.8% 22.770 93.8% 759 anneal (generic: generic.c, 1573) [6] 1.020 4.2% 98.0% 1.020 4.2% 34 slaveusrtime (dlslave.so: dlslave.c, 22) [9] 0.210 0.9% 98.9% 0.210 0.9% 7 cvttrap (generic: generic.c, 317) [12] 0.120 0.5% 99.4% 0.120 0.5% 4 _pm_create_special (libc.so.1: pm.c, 191) [14] 0.090 0.4% 99.8% 0.090 0.4% 3 _migr_policy_args_init (libc.so.1: pm.c, 398) [10] 0.030 0.1% 99.9% 0.180 0.7% 6 iofile (generic: generic.c, 464) [11] 0.030 0.1% 100.0% 0.150 0.6% 5 _doscan_f (libc.so.1: inline_doscan.c, 615) [1] 0.000 0.0% 100.0% 24.270 100.0% 809 __start (generic: crt1text.s, 101) [2] 0.000 0.0% 100.0% 24.270 100.0% 809 main (generic: generic.c, 101) [3] 0.000 0.0% 100.0% 24.270 100.0% 809 Scriptstring (generic: generic.c, 184) [5] 0.000 0.0% 100.0% 22.770 93.8% 759 usrtime (generic: generic.c, 1377) [15] 0.000 0.0% 100.0% 0.090 0.4% 3 dirstat (generic: generic.c, 348) [16] 0.000 0.0% 100.0% 0.090 0.4% 3 _pread (libc.so.1: preadSCI.c, 33) [13] 0.000 0.0% 100.0% 0.120 0.5% 4 _fullocale (libc.so.1: _locale.c, 77) [7] 0.000 0.0% 100.0% 1.020 4.2% 34 libdso (generic: generic.c, 619) [8] 0.000 0.0% 100.0% 1.020 4.2% 34 dlslave_routine (dlslave.so: dlslave.c, 7) ------------------------------------------------------------------------- Butterfly function list, in descending order by inclusive time ------------------------------------------------------------------------- attrib.% attrib.time incl.time caller (callsite) [index] [index] incl.% incl.time self% self-time function [index] attrib.% attrib.time incl.time callee (callsite) [index] ------------------------------------------------------------------------- [1] 100.0% 24.270 0.0% 0.000 __start [1] 100.0% 24.270 24.270 main (@0x10001fac; generic: crt1text.s, 177) [2] ------------------------------------------------------------------------- 100.0% 24.270 24.270 __start (@0x10001fac; generic: crt1text.s, 177) [1] [2] 100.0% 24.270 0.0% 0.000 main [2] 100.0% 24.270 24.270 Scriptstring (@0x10002040; generic: generic.c, 111) [3] ------------------------------------------------------------------------- 100.0% 24.270 24.270 main (@0x10002040; generic: generic.c, 111) [2] [3] 100.0% 24.270 0.0% 0.000 Scriptstring [3] 93.8% 22.770 22.770 usrtime (@0x10002460; generic: generic.c, 214) [5] 4.2% 1.020 1.020 libdso (@0x10002460; generic: generic.c, 214) [7] 0.9% 0.210 0.210 cvttrap (@0x10002460; generic: generic.c, 214) [9] 0.7% 0.180 0.180 iofile (@0x10002460; generic: generic.c, 214) [10] 0.4% 0.090 0.090 dirstat (@0x10002460; generic: generic.c, 214) [15] ------------------------------------------------------------------------- 93.8% 22.770 22.770 usrtime (@0x10005c30; generic: generic.c, 1393) [5] [4] 93.8% 22.770 93.8% 22.770 anneal [4] ------------------------------------------------------------------------- 93.8% 22.770 24.270 Scriptstring (@0x10002460; generic: generic.c, 214) [3] [5] 93.8% 22.770 0.0% 0.000 usrtime [5] 93.8% 22.770 22.770 anneal (@0x10005c30; generic: generic.c, 1393) [4] ------------------------------------------------------------------------- 4.2% 1.020 1.020 dlslave_routine (@0x5ffe0690; dlslave.so: dlslave.c, 9) [8] [6] 4.2% 1.020 4.2% 1.020 slaveusrtime [6] ------------------------------------------------------------------------- 4.2% 1.020 24.270 Scriptstring (@0x10002460; generic: generic.c, 214) [3] [7] 4.2% 1.020 0.0% 0.000 libdso [7] 4.2% 1.020 1.020 dlslave_routine (@0x100032a0; generic: generic.c, 650) [8] ------------------------------------------------------------------------- 4.2% 1.020 1.020 libdso (@0x100032a0; generic: generic.c, 650) [7] [8] 4.2% 1.020 0.0% 0.000 dlslave_routine [8] 4.2% 1.020 1.020 slaveusrtime (@0x5ffe0690; dlslave.so: dlslave.c, 9) [6] ------------------------------------------------------------------------- 0.9% 0.210 24.270 Scriptstring (@0x10002460; generic: generic.c, 214) [3] [9] 0.9% 0.210 0.9% 0.210 cvttrap [9] ------------------------------------------------------------------------- 0.7% 0.180 24.270 Scriptstring (@0x10002460; generic: generic.c, 214) [3] [10] 0.7% 0.180 0.1% 0.030 iofile [10] 0.6% 0.150 0.150 _doscan_f (@0x10002d48; generic: generic.c, 483) [11] ------------------------------------------------------------------------- 0.6% 0.150 0.180 iofile (@0x10002d48; generic: generic.c, 483) [10] [11] 0.6% 0.150 0.1% 0.030 _doscan_f [11] 0.5% 0.120 0.120 _fullocale (@0x0fadebe4; libc.so.1: inline_doscan.c, 808) [13] ------------------------------------------------------------------------- 0.5% 0.120 0.120 _fullocale (@0x0fb0b1b8; libc.so.1: _locale.c, 84) [13] [12] 0.5% 0.120 0.5% 0.120 _pm_create_special [12] ------------------------------------------------------------------------- 0.5% 0.120 0.150 _doscan_f (@0x0fadebe4; libc.so.1: inline_doscan.c, 808) [11] [13] 0.5% 0.120 0.0% 0.000 _fullocale [13] 0.5% 0.120 0.120 _pm_create_special (@0x0fb0b1b8; libc.so.1: _locale.c, 84) [12] ------------------------------------------------------------------------- 0.4% 0.090 0.090 _pread (@0x0fb05928; libc.so.1: preadSCI.c, 33) [16] [14] 0.4% 0.090 0.4% 0.090 _migr_policy_args_init [14] ------------------------------------------------------------------------- 0.4% 0.090 24.270 Scriptstring (@0x10002460; generic: generic.c, 214) [3] [15] 0.4% 0.090 0.0% 0.000 dirstat [15] 0.4% 0.090 0.090 _pread (@0x10002a5c; generic: generic.c, 381) [16] ------------------------------------------------------------------------- 0.4% 0.090 0.090 dirstat (@0x10002a5c; generic: generic.c, 381) [15] [16] 0.4% 0.090 0.0% 0.000 _pread [16] 0.4% 0.090 0.090 _migr_policy_args_init (@0x0fb05928; libc.so.1: preadSCI.c, 33) [14] ------------------------------------------------------------------------- |
If you need to generate a report for a machine model that is different from the one on which the experiment was performed, you can use several of the prof options to specify a machine model.
For example, if you record a bbcounts experiment on an R4000 processor with a clock frequency of 100 megahertz, but you want to generate a report for an R10000 processor with a clock frequency of 196 megahertz, the prof command would be the following:
% prof -r10000 -clock 196 generic.bbcounts.m4561 |
You can gather data from executables that use the sproc(2) and sprocsp(2) system calls, such as those executables generated by POWER Fortran and POWER C. Prepare and run the job using the same method as for uniprocessed executables. For multiprocessed executables, each thread of execution writes its own separate data file. View these data files with prof.
The only difference between multiprocessed and regular executables is how the data files are named. The data files are named prog_name.experiment_type.id.
The experiment ID, id, consists of one or two letters (designating the process type) and the process ID number. See Table 1-1 for the letter codes and their meanings. This naming convention avoids the potential conflict of multiple threads attempting to write simultaneously to the same file.
You can determine the overhead of a parallel program by including the -overhead argument on the prof command line. The overhead information will be included at the end of the usual report.
You can get the overhead report for any experiment, but it may be the most useful for the following experiments:
pcsamp
usertime
bbcounts
The hardware counter experiments (*_hwc and *_hwctime)
Use the following steps to generate an overhead report on a system with multiple processors.
Run the bbcounts experiment on the executable:
% ssrun -bbcounts linpackd |
The ssrun command generates the following files, each from a different processor, for an 8-processor program:
linpackd.bbcounts.m422744 linpackd.bbcounts.p421778 linpackd.bbcounts.p422191 linpackd.bbcounts.p422252 linpackd.bbcounts.p422313 linpackd.bbcounts.p422620 linpackd.bbcounts.p422704 linpackd.bbcounts.p422731 |
Combine the experiment files into one experiment file using the ssaggregate(1) command.
% ssaggregate -e linpackd.bbcounts* -o bbcounts.8ps |
Execute the prof command with the -overhead option to create the data file.
% prof -overhead bbcounts.8ps >result |
The result file contains the usual bbcounts experiment output with the overhead information at the end. Note that some lines are wrapped here to accommodate page width:
------------------------------------------------------------------------- SpeedShop profile listing generated Fri Jun 25 09:21:27 1999 prof -overhead bbcounts.8ps linpackd (n32): Target program bbcounts: Experiment name it:cu: Marching orders R10000 / R10010: CPU / FPU 16: Number of CPUs 195: Clock frequency (MHz.) Experiment notes-- From file bbcounts.8ps: Caliper point 0 at target begin, PID 422744 linpackd Caliper point 0 at target begin, PID 422620 linpackd Caliper point 0 at target begin, PID 422731 linpackd Caliper point 0 at target begin, PID 422704 linpackd Caliper point 0 at target begin, PID 422252 linpackd Caliper point 0 at target begin, PID 421778 linpackd Caliper point 0 at target begin, PID 422191 linpackd Caliper point 0 at target begin, PID 422313 linpackd Caliper point 1 at exit(0) ------------------------------------------------------------------------- Summary of bbcounts time data (bbcounts)-- 29877509668: Total number of instructions executed 20592366537: Total computed cycles 105.602: Total computed execution time (secs.) 0.689: Average cycles / instruction ------------------------------------------------------------------------- Function list, in descending order by exclusive bbcounts time ------------------------------------------------------------------------- [index] excl.secs excl.% cum.% cycles instructions calls function (dso: file, line) [1] 72.955 69.1% 69.1% 14226219437 24895879414 140 __mp_slave_wait_for_work (libmp.so: mp_parallel_do.s, 593) [2] 30.344 28.7% 97.8% 5917081268 4669997342 772633 daxpy (linpackd: linpack.f, 495) . . . ------------------------------------------------------------------------- OpenMP Report ------------------------------------------------------------------------- Parallelization Overhead: 00.000% Load Imbalance: 00.076% Insufficient Parallelism: 69.085% Barrier Loss: 00.002% Synchronization Loss: 00.000% Other Model-specific Overhead: 00.000% |
The parallel model used in the program was OpenMP, although other parallel models (such as MPI and pthreads) are supported. The categories for which information is returned vary depending on the model. The OpenMP categories have the following meanings:
Parallelization Overhead: the percentage of the program's time spent doing work necessary to a parallel program, such as distributing loop iterations and data among the processors. The percentage is negligible for this program.
Load Imbalance: the percentage of a program's time lost because work was not spread evenly across the processors. This number would be 0 if each processor had exactly the same amount of work.
Insufficient Parallelism: the percentage of a program's time in which the processors were not working in parallel. The number returned for this program tells us that about two-thirds of the program time was not parallelized.
Barrier Loss: the percentage of the program's time used by the barrier process. This is not the time processors spent waiting at barriers.
Synchronization Loss: the percentage of the program's time used by the other synchronization processes.
Other Model-specific Overhead: the percentage of a program's time spent in other OpenMP (in this case) processes.
The same aggregated experiment file created above can be used by the cvperf(1) command to display overhead information in its own window. For an example, see the ProDev WorkShop: Performance Analyzer User's Guide.
If you run a bbcounts experiment, run prof with the -feedback option to generate a feedback file that can be used to arrange procedures more efficiently on the next compilation. You can rearrange procedures using the -fb option on compiler command lines.
To reorder code regions for the cord(1) command, use the -cordfb or -cordfball option to prof.
For more information, see your compiler man page or the cord(1) man page.
After running experiments, you can compare experiment results by using the sscompare command. This command can be used to analyze performance data generated by ssrun and produce a comparison report. This can be particularly useful when comparing the effects of different optimization techniques, for example, or when comparing different experiments for the same application.
The sscompare command has the following syntax:
sscompare [-by type] [-individual] [-left caliper] [-output format] [-path pathname] [-percentages] [-precision digits] [rough] [-right caliper] [statistics] [-ut_exclusive] [-ut_inclusive] [-width characters] |
The comparison report contains a legend and a table of performance data. Each input file and the type of performance data it contains is listed in the legend with a numeric column key. The table contains multiple columns of data. There is one column for each experiment file (if the individual option is used) or one for each statistic (if the statistics option was used).
The following example demonstrates this command's use. In this example, a SpeedShop PC sampling experiment is run on the OpenMP implementation of the NAS Conjugate Gradient Parallel Benchmark using four threads. The resulting experiment files are then compared side-by-side using sscompare:
% setenv OMP_NUM_THREADS 4 % ssrun -fpcsampx cg.A NAS Parallel Benchmarks 2.3 OpenMP C version - CG Benchmark Size: 14000 Iterations: 15 ... ... ... % sscompare -by function -individual -percentages cg.A.fpcsampx.* 1: Exclusive `PC sampling' time for cg.A.fpcsampx.T0.m2675229 2: Exclusive `PC sampling' time for cg.A.fpcsampx.T1.p2676575 3: Exclusive `PC sampling' time for cg.A.fpcsampx.T2.p2678981 4: Exclusive `PC sampling' time for cg.A.fpcsampx.T3.p2681324 1 2 3 4 63.973% 74.819% 67.227% 63.820% conj_grad (cg.A: cg.c, 356) 0.000% 22.881% 23.754% 24.255% __mp_slave_wait_for_work (libmp.so: mp_parallel_do.s, 593) 10.758% 1.608% 7.913% 10.943% __mp_barrier_nthreads (libmp.so: mp_barrier.c, 271) 16.161% 0.000% 0.000% 0.000% sparse (cg.A: cg.c, 709) ... ... ... ... |
sscompare supports the following SpeedShop experiment types:
usertime
totaltime
pcsamp
bbcounts