This guide describes how to use NVIDIA Nsight Systems
on Blue Waters. There are two reasons we created this guide: (1) NVIDIA
Nsight Systems is not installed on Blue Waters (2) its requirements are
not satisfied by the software available on Blue Waters. As a result,
one can not simply upload NVIDIA Nsight Systems to Blue Waters and run
it. This guide shows how one can place NVIDIA Nsight Systems inside of a
Docker container and then use it on Blue Waters with the help of a
HPC-oriented software solution called Shifter.
This part of the guide describes the steps necessary to prepare a
Docker container with NVIDIA Nsight Systems. Because Docker is not
available on Blue Waters, one has to use any system with Docker they
have access to.
NVIDIA Nsight Systems: download NVIDIA Nsight Systems for Linux from NVIDIA Developer website and unpack into a directory called NsightSystems .
Docker: download and install Docker for your operating system.
To build a Docker container with NVIDIA Nsight Systems to be used with Shifter on Blue Waters,
we have to use a base image that satisfies the requirements (and
limitations) of all components: NVIDIA Nsight Systems, Shifter, and Blue
Waters. In particular, this means that the base Docker image has to
have:
glibc 2.15 or newer (for NVIDIA Nsight Systems)
glibc that supports the version of the Linux kernel installed on Blue Waters, which is version 3.0.101 .
The first requirement above means that one can not use a container
image of an operating system that is too old. For example, NVIDIA Nsight
Systems will work on Ubuntu Precise (12.04) because it provides glibc 2.15. However, it will not work on Ubuntu Lucid (10.04) which has glibc 2.11.
The second requirement is trickier to check as one has to make sure that glibc
provided by the operating system in the container supports the version
of the Linux kernel used on Blue Waters. This support is controlled at
the time glibc is compiled, so one has to check this
manually. Continuing with the example of Ubuntu-based Docker images, we
can say that Docker images based on Ubuntu Artful (17.10), Bionic
(18.04), and later will not work on Blue Waters.
Given all of the above requirements, let's choose Docker image based on Ubuntu Xenial (16.04).
The simplest Docker file for NVIDIA Nsight Systems based on Ubuntu Xenial can be written like so:
# Dockerfile.simple
FROM ubuntu:16.04
COPY NsightSystems /opt/NsightSystems
RUN /opt/NsightSystems/Target-x86_64/x86_64/setup_nsys_env
ENV PATH="/opt/NsightSystems/Target-x86_64/x86_64:$PATH"
Here we're copying the folder NsightSystems with NVIDIA Nsight Systems to /opt/NsightSystems
inside of the container. This Docker file produces an image of
approximately 330 MB in size. It is, of course, possible to reduce its
size by removing unnecessary files and directories (via a 2-stage
build), install additional programs and tools, and even combine the two
approaches (use a 2-stage build approach and install additional tools).
However, we will leave these exercises to an interested reader. Instead,
we would recommend installing strace tool that can help you in case of a failure identify files your application is trying to access:
# Dockerfile.strace
FROM ubuntu:16.04
RUN apt-get update && \
apt-get install -y --no-install-recommends strace && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
COPY NsightSystems /opt/NsightSystems
RUN /opt/NsightSystems/Target-x86_64/x86_64/setup_nsys_env
ENV PATH="/opt/NsightSystems/Target-x86_64/x86_64:$PATH"
Let us now consider the application that we want to work with using Nsight Systems.
If the application resides in the same Docker image as Nsight
Systems, all the libraries this application is linked against are in
that image and we don't have to worry about them. However, if this
application uses NVIDIA GPUs, we have to make sure that when executed on
Blue Waters, it has access to the NVIDIA drivers installed on the
computing system (Blue Waters). To do that, we have to bind-mount the
directory with drivers on Blue Waters to a directory in the container.
Because we will be using Shifter, this means that we have to create that
directory in our image first. We can do so by adding, for example, this
line to the Docker file:
RUN mkdir /opt/nvidia-drivers
If the application we wish to interact with was built on Blue Waters,
the libraries it uses are also on Blue Waters. In such a case, we have
to make all directories containing these libraries accessible from
within the Shifter container. And again, to do that we have to
bind-mount them into the container. If your application uses any of the
Blue Waters' modules, we can use the fact that modules reside in the /opt folder. So, we can bind-mount /opt directory on Blue Waters to a directory, say, /opt/bluewaters inside of a container. Again, we have to create a directory in our container for that purpose:
RUN mkdir /opt/bluewaters
If your application uses libraries that are stored outside of the Blue Waters' /opt directory, create the necessary amount of empty directories under /opt/other like so (the example below creates 10 directories):
RUN mkdir -p /opt/other/{1..10}
Two quick notes:
NVIDIA drivers are located in one of the subdirectories of /opt . Therefore, it is sufficient to create /opt/bluewaters directory
regardless of whether the application we wish to interact with resides in the container or on the Blue Waters file system
Although we could bind-mount /opt directory on Blue Waters to /opt
directory in the container, this would ultimately defeat the purpose of
the container by making all libraries installed on Blue Waters
accessible from within the container. Therefore, we choose to bind-mount
to /opt/bluewaters directory in the container.
Finally, the Dockerfile that we will use to build our Docker image looks like this:
# Dockerfile.strace_and_libs
FROM ubuntu:16.04
RUN apt-get update && \
apt-get install -y --no-install-recommends strace && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
RUN mkdir -p /opt/bluewaters /opt/other/{1..10}
COPY NsightSystems /opt/NsightSystems
RUN /opt/NsightSystems/Target-x86_64/x86_64/setup_nsys_env
ENV PATH="/opt/NsightSystems/Target-x86_64/x86_64:$PATH"
To build a Docker image, execute (you may need sudo or root on your linux system) ::
docker build -t dockerhub-username/nsight-and-app:tag -f Dockerfile.strace_and_libs .
and push it to Docker Hub:
docker login
docker push dockerhub-username/nsight-and-app:tag
Docker is not available on Blue Waters. To use Docker images on Blue
Waters we have to use a container solution that (a) works on Blue Waters
and (b) understands Docker images. There are two solutions that satisfy
these criteria: Shifter and Singularity. In this guide we will focus on
using Shifter, a container solution that is designed specifically for
HPC systems.
To start a Shifter job on Blue Waters, all we have to do is add a generic resource request shifter16
when we submit our job. For example, to request an interactive Shifter
job using 1 XK node (each XK node is equipped with 1 NVIDIA GPU) with a
time limit of 1 hour one would execute:
qsub -I -l nodes=1:ppn=16:xk -l walltime=01:00:00 -l gres=shifter16 -N nsight-job -q normal
Once our job goes through the queue and starts, we have to load shifter module and use it to pull our image from Docker Hub:
module load shifter
shifterimg pull dockerhub-username/nsight-and-app:tag
If our application (whether on Blue Waters on in a container) uses NVIDIA GPUs, we have to set CUDA_VISIBLE_DEVICES
variable to a zero-based comma-separated list of GPUs on each node that
we want to be used in the job. Since XK nodes on Blue Waters have one
NVIDIA GPU only, the only acceptable value is 0 :
export CUDA_VISIBLE_DEVICES=0
To run any program on Blue Waters' compute nodes, we have to use the aprun command (which is analogous to mpirun and srun on other HPC systems). When running Shifter jobs on Blue Waters, it is important to use aprun with the -b flag:
Of course, we have to also specify the total and per-node number of MPI ranks we'd like our application to use:
To use the image that we have just pulled from Docker Hub, we have to specify it using the --image flag:
aprun -b -n 1 -N 1 \
shifter --image=dockerhub-username/nsight-and-app:tag ...
Note that -N 1 option above is important as it instructs aprun to execute only 1 instance of shifter command on a compute node. Although, strictly speaking, this option is not necessary if you use a single node (nodes=1 flag in the qsub call) and execute only one MPI task (-n 1 flag in the aprun call) it is still a good idea to specify -N 1
so that if/when you start working with MPI-enabled applications you
don't accidentally start more that one shifter process on a node.
To bind-mount the directory on Blue Water to the directory in our container, we have to use --volume (or -V ) flag, for example, like so:
aprun -b -n 1 -N 1 \
shifter --image=... -V /path/on/bw:/directory/in/container ...
Considering the /opt to /opt/bluewaters mapping mentioned above, that /path/on/bw directory above corresponds to /opt and /directory/in/container to /opt/bluewaters . There are two caveats, however:
- we have to prefix the bind-mounted directories on Blue Waters with
/dsl (so that /opt would become /dsl/opt , for example)
- we can not use symbolic links when bind-mounting the directories.
aprun -b -n 1 -N 1 \
shifter --image=dockerhub-username/nsight-and-app:tag \
-V /dsl/opt:/opt/bluewaters \
...
With the above command, we'll be able to access any directory /opt/dir on the Blue Waters as /opt/bluewaters/dir
in the container. When the number of directories to be bind-mounted is
large, it makes sense to prepare a script that would execute the aprun command:
# call_aprun.sh
aprun -b -n 1 -N 1 \
-- \
shifter --image=dockerhub-username/nsight-and-app:tag \
-V /dsl/opt:/opt/bluewaters \
-V /dsl/another/directory:/opt/other/1 \
...
Our application, however, does not know about this relationship between the directories and will still look for libraries under /opt/dir . To solve this problem, we can use LD_LIBRARY_PATH environment variable.
Besides bind-mounting the directories, we have to make sure that our
application can find the libraries that are stored in them. For that
purpose, we have to add corresponding directories to the LD_LIBRARY_PATH .
This, in turn, means that we have to prepare a script that Shifter will
execute. The simplest example of such a script is as follows:
# execute_nsys.sh
LD_LIBRARY_PATH="/opt/bluewaters/cray/nvidia/390.46-1_1.0502.2481.1.1.gem/lib64:$LD_LIBRARY_PATH" # for NVIDIA drivers
LD_LIBRARY_PATH="/opt/bluewaters/nvidia/cudatoolkit9.1/9.1.85_3.10-1.0502.df1cc54.3.1/lib64:$LD_LIBRARY_PATH" # for CUDA Toolkit 9.1
LD_LIBRARY_PATH="/opt/other/1:$LD_LIBRARY_PATH" # for "some other library"
export LD_LIBRARY_PATH
cd /to/where/necessary
nsys profile ./application --with --some --args
Once we have the above script (execute_nsys.sh ), we can execute it on compute nodes using the call_aprun.sh script:
# call_aprun.sh
aprun -b -n 1 -N 1 -- \
-- \
shifter --image=dockerhub-username/nsight-and-app:tag \
-V /dsl/opt:/opt/bluewaters \
-V /dsl/another/directory:/opt/other/1 \
-- \
bash execute_nsys.sh
Note that while call_aprun.sh script exists purely for our convenience, we must use execute_nsys.sh script in order to set the LD_LIBRARY_PATH .
Of course, we could combine all the scripts into a single job
submission script. However, given the nature of the things NVIDIA Nsight
Systems is used for, one would most likely use it in an interactive
job, and, therefore, we leave the exercise of creating a job batch
script to the interested users ( advice: execute aprun ... bash -i and then setup LD_LIBRARY_PATH as shown in 2.4 ).
|