Profiling and performance measure with OpenACC and CUDA codeProfiling and performance counters are available with both PrgEnv-cray and PrgEnv-pgi for OpenACC codes. For cuda codes, the Nvidia profiling tools may be used. OpenACC performance measure and profilingPrgEnv-cray Cray perftools can gather profile and counter information when run with OpenACC programs. Compile the executable and instrument it with perftools. This is an example scenario, though other compiler and pat_build flags may also be appropriate: ftn -h func_trace -o mycode.exe mycode.f90 pat_build -w mycode.exe Perftools may capture any of many defined accelerator counters set by PAT_RT_ACCPC. See "man accpc_k20" for the full list of available metrics. Only one metrice from the set may be measured per aprun invocation. The requirements for a batch job are: module load PrgEnv-cray module load craype-accel-nvidia35 module load perftools module unload darshan export PAT_RT_ACCPC=threads_launched export CRAY_ACC_DEBUG=1 # <-- set this to trace all the kernel calls to the device # to stderr, see man intro_openacc for more info. Levels 1,2,3 # will increase the level of detail. Level 3 shows information # about how kernels are launched. # an alternative and simpler perftools approach: # module load perftools-base perftools-lite-gpu # module unload darshan # rebuild code and run the a.out produced , # *.rpt and *.ap2 files will be generated automatically aprun -n N mycode.exe+pat After the batch job has run, a .xf file or directory ending in t (for mpi codes) will be created. Process the .xf file or directory with pat_report and a .ap2 file or directory will be created that you can view with apprentice2 (app2). pat_report mycode.exe+pat+78082-81t.xf ... # text output results from pat_report, can redirect to file with "> file.rpt" Table 2: Time and Bytes Transferred for Accelerator Regions Host | Host | Acc | Acc Copy | Acc Copy | Events |Calltree Time% | Time | Time | In | Out | | PE=HIDE | | | (MBytes) | (MBytes) | | Thread=HIDE 100.0% | 117.505 | 49.067 | 29781 | 0.063 | 9574 |Total |---------------------------------------------------------------------------------- | 100.0% | 117.505 | 49.067 | 29781 | 0.063 | 9574 |cc_triples_restart_ | | | | | | | cc_triples_ ... # launch the X-window apprentice2 GUI: app2 mycode.exe+pat+78082-81t.ap2 PrgEnv-pgi Profile and trace info for OpenACC kernels is available via environment variables with the PGI environment. The batch job will need: module load cudatoolkit module load PrgEnv-pgi module unload darshan export PGI_ACC_TIME=1 # profile , and/or PGI_ACC_NOTIFY=1 or 3 for tracing aprun -n N mycode.exe stdout will contain information about OpenACC regions: main() Jacobi relaxation Calculation: 4096 x 4096 mesh 0, 0.250000 100, 0.002397 200, 0.001204 300, 0.000804 400, 0.000603 500, 0.000483 600, 0.000403 700, 0.000345 800, 0.000302 900, 0.000269 Accelerator Kernel Timing data ./laplace2d.c laplace 74: region entered 1000 times time(us): total=1,742,193 init=192 region=1,742,001 kernels=1,624,325 w/o init: total=1,742,001 max=72,263 min=1,666 avg=1,742 77: kernel launched 1000 times grid: [64x1024] block: [64x4] time(us): total=1,624,325 max=1,683 min=1,620 avg=1,624 ./laplace2d.c laplace 63: region entered 1000 times time(us): total=3,973,944 init=151 region=3,973,793 kernels=3,572,204 w/o init: total=3,973,793 max=66,883 min=3,899 avg=3,973 66: kernel launched 1000 times grid: [64x1024] block: [64x4] time(us): total=3,435,500 max=4,745 min=3,429 avg=3,435 70: kernel launched 1000 times grid: [1] block: [256] time(us): total=136,704 max=1,384 min=134 avg=136 ./laplace2d.c laplace 58: region entered 1 time time(us): total=6,259,767 init=469,009 region=5,790,758 data=71,063 w/o init: total=5,790,758 max=5,790,758 min=5,790,758 avg=5,790,758 total: 6.259794 s Application 140307 exit codes: 19 Application 140307 resources: utime ~4s, stime ~3s CUDA performance measure and profilingOnly 1 of nvprof or command-line profiling below may be used per program invocation. Nvprof The Nvidia profiling and tracing tool nvprof is available and can be used with cuda code. The requirements for using nvprof
module load cudatoolkit
module unload darshan
export LD_LIBRARY_PATH=$CRAY_CUDATOOLKIT_DIR/lib64:$LD_LIBRARY_PATH
export COMPUTE_PROFILE=0 # or unset
# sample MPI wrapper script for profiling MPI applications with nvprof
# ( aprun -n <ranks> wrap.sh )
$ cat -n wrap.sh
This is a sample run. laplace2d-data> export \ LD_LIBRARY_PATH=/opt/nvidia/cudatoolkit/default/lib64:$LD_LIBRARY_PATH laplace2d-data> cd $PBS_O_WORKDIR laplace2D-data> aprun -b -n 1 nvprof laplace2d_accpgi ======== NVPROF is profiling laplace2d_accpgi... ======== Command: laplace2d_accpgi main() Jacobi relaxation Calculation: 4096 x 4096 mesh 0, 0.250000 100, 0.002397 200, 0.001204 300, 0.000804 400, 0.000603 500, 0.000483 600, 0.000403 700, 0.000345 800, 0.000302 900, 0.000269 total: 6.712810 s ======== Warning: Application returned non-zero code 19 ======== Profiling result: Time(%) Time Calls Avg Min Max Name 65.24 3.48s 1000 3.48ms 3.47ms 3.49ms laplace_66_gpu 31.11 1.66s 1000 1.66ms 1.66ms 1.66ms laplace_77_gpu 2.41 128.73ms 1000 128.73us 127.68us 130.33us laplace_70_gpu_red 0.72 38.63ms 1001 38.59us 2.53us 36.03ms [CUDA memcpy DtoH] 0.51 27.25ms 1128 24.16us 3.74us 182.66us [CUDA memcpy HtoD] Application 83077 resources: utime ~5s, stime ~3s command-line profiler via environment variables (MPI or serial profiling, for CUDA versions < 9.x ) In addition to the nvprof profiler, the CUDA environment provides a built-in profiler via the libraries in your code. PGI OpenACC code can also be profiled with this method. MPI codes profiled this way may be analyzed with the NVVP tool by following the steps at http://docs.nvidia.com/cuda/profiler-users-guide/index.html#import-multi-nvprof-session (section 2.2.2.3) . You can employ the built-in profiler by setting COMPUTE_PROFILE to non-zero: -data> module unload darshan -data> export COMPUTE_PROFILE=1 nid00031-[IN_JOB]arnoldg@nid00010:-data> aprun -b -n 1 ./laplace2d_acc main() Jacobi relaxation Calculation: 4096 x 4096 mesh 0, 0.250000 100, 0.002397 200, 0.001204 300, 0.000804 400, 0.000603 500, 0.000483 600, 0.000403 700, 0.000345 800, 0.000302 900, 0.000269 total: 3.945941 s Application 140290 resources: utime ~4s, stime ~1s nid00031-[IN_JOB]arnoldg@nid00010:-data> ls -lt|head -2 total 3876 -rw------- 1 arnoldg bw_staff 236416 Mar 12 13:39 cuda_profile_0.log -data> more cuda_profile_0.log # CUDA_PROFILE_LOG_VERSION 2.0 # CUDA_DEVICE 0 Tesla K20X # CUDA_CONTEXT 1 # TIMESTAMPFACTOR fffff69047ada518 method,gputime,cputime,occupancy method=[ memcpyHtoD ] gputime=[ 53270.656 ] cputime=[ 53558.000 ] method=[ memcpyHtoD ] gputime=[ 1.600 ] cputime=[ 37.000 ] method=[ laplace$ck_L64_3 ] gputime=[1899.712 ] cputime=[ 26.0 ] occupancy=[ 0.75 ] method=[ memcpyDtoH ] gputime=[ 3.104 ] cputime=[ 49.000 ] method=[ laplace$ck_L75_5 ] gputime=[1757.760 ] cputime=[ 10.0 ] occupancy=[ 1.00 ] method=[ laplace$ck_L64_3 ] gputime=[1905.536 ] cputime=[ 8.0 ] occupancy=[ 0.75 ] ... For MPI, a wrapper may be used to assign unique logfiles. Use aprun with the wrapper script: >cat simpleMPI.sh #!/bin/bash -login module load cudatoolkit THIS_NODE=`hostname` export COMPUTE_PROFILE_LOG=$THIS_NODE.log export COMPUTE_PROFILE=1 export COMPUTE_PROFILE_CSV=1 export COMPUTE_PROFILE_CONFIG=mynvvp.cfg ./simpleMPI >cat mynvvp.cfg streamid gpustarttimestamp >grep aprun myjobscript.pbs aprun -b -n 16 ./simpleMPI.sh > nvvp *.log # after job completes See also: Cray man page: "man pat_build" Cray man page: "man accpc_k20" Using OpenACC with PGI compilers (pages 12-13) nvprof documentation at Nvidia command-line profiler overview at Nvidia Using Nvprof with MPI |