|
Perftools with Cray's timeline view: OpenACC
The latest perftools support OpenACC accelerated regions (and they should support CUDA in a similar fashion). A compute_pi c code was marked up with OpenACC pragmas for the main loop and run with perftools to produce the following views with app2 and reveal.
Here are the steps.
(for building):
module load craype-accel-nvidia35
module unload darshan
module load perftools/6.1.3 # or later
cc -h pl=myprogram_library.pl cpi.c
cc -h acc,msgs cpi.c
pat_build -u -gmpi a.out
|
(for the compute node/job):
module unload darshan # conflicts with perftools
module load perftools
export CRAY_CUDA_MPS=0
# crashes perftools, also make sure CRAY_CUDA_PROXY is unset or 0
export PAT_RT_SUMMARY=0
# enables the timeline view for app2
# see Cray Performance Measurement and Analysis Tools section 5.4.11, page 76.
module load perftools
export CRAY_ACC_DEBUG=1 # ok simultaneously with perftools
aprun -n 2 -N 1 ./a.out+pat
|
CrayPat/X: Version 6.1.3 Revision 12145 11/18/13 21:56:10
ACC: Transfer 1 items (to acc 8 bytes, to host 0 bytes) from cpi.c:18
ACC: Transfer 2 items (to acc 4 bytes, to host 0 bytes) from cpi.c:18
ACC: Execute kernel main$ck_L18_2 async(auto) from cpi.c:18
ACC: Transfer 2 items (to acc 0 bytes, to host 0 bytes) from cpi.c:18
ACC: Transfer 1 items (to acc 8 bytes, to host 0 bytes) from cpi.c:18
ACC: Transfer 2 items (to acc 4 bytes, to host 0 bytes) from cpi.c:18
ACC: Execute kernel main$ck_L18_2 async(auto) from cpi.c:18
ACC: Transfer 2 items (to acc 0 bytes, to host 0 bytes) from cpi.c:18
ACC: Wait async(auto) from cpi.c:21
ACC: Transfer 1 items (to acc 0 bytes, to host 8 bytes) from cpi.c:21
pi is approximately 3.1415926535897931, Error is 0.0000000000000000
ACC: Wait async(auto) from cpi.c:21
ACC: Transfer 1 items (to acc 0 bytes, to host 8 bytes) from cpi.c:21
Experiment data file written:
/mnt/abc/u/staff/arnoldg/c/cpi/a.out+pat+208328-81t.xf
Application 208328 resources: utime ~0s, stime ~2s, Rss ~120416, inblocks ~3389,
outblocks ~4231
|
(for analysis after the job has run):
pat_report a.out+pat+*.xf
app2 a.out+pat+*.xf
reveal program_library.pl a.out+pat+*.ap2
|