Call For Collaborating Host Sites
FAQ |
Wednesday, June 28, 2017
GPU Architecture and Concepts
Presenter: John Stone, University of Illinois at Urbana-Champaign
Abstract:
-
Basic GPU hardware intro (3 slides)
-
Heterogenous computing concepts (5 slides)
-
GPU-accelerated application development strategies (10 slides)
-
GPU-accelerated HPC application development/optimization cycle
-
GPUs in the context of distributed memory message passing codes
-
GPU-accelerated libraries, frameworks, domain-specific langs ...
-
Directive-based parallelism, e.g., w/ OpenACC
-
Programmer-provided explicit parallelism, CUDA, OpenCL, etc.
-
Overview of advanced technologies: C++11, NVRTC, parallel STL, …
-
Overview of profiling debugging approaches
-
GPU hardware introduction, trends, futures (12 slides)
-
Throughput-oriented hardware, latency hiding, occupancy, and relation to SIMT concepts
-
GPU memory systems (on-board/on-chip, registers, caches, coalescing)
-
Computational thinking in the context of GPU hardware, e.g., “Scatter” vs. “Gather” algorithms and application to GPUs, use of data privatization schemes (e.g., for histograms)
-
GPU arithmetic hardware capabilities, mixed-precision, special fctns
-
Host-GPU, GPU P2P, and RDMA concepts/issues:
-
General concepts
-
GPU Unified Memory
-
Interactions with host NUMA
-
Zero-copy approaches, pinned memory, GPUDirect RDMA, P2P and NVLink, …
OpenACC
Presenter: Justin Luitjens, NVIDIA
Abstract:
-
Why Use OpenACC?
-
Basic Profiling with PGProf
-
Parallelizing Loops with OpenACC
-
Controlling Data Movement
-
Simple Loop Optimizations
CUDA Programming
Presenter: John Stone, University of Illinois at Urbana-Champaign
Abstract:
-
Part 1
-
Introduction to CUDA programming model, key abstractions and terminology (5 slides)
-
CUDA thread model, differences w/ other programming systems (5 slides)
-
CUDA resource management intro (malloc/free/memcpy etc) (5 slides)
-
Mapping parallelism to grids/blocks/warps/threads, indexing work by thread IDs (5 slides)
-
Anatomy of basic CUDA kernels, comparison with serial code, loop nests, and so on, work through simple examples (10 slides)
-
Part 2
-
Execution of grids/blocks/warps/threads, divergence, etc … (5 slides)
-
Memory-bandwidth-bound kernels vs. arithmetic bound kernels, concepts and strategies (5 slides)
-
Memory systems, performance traits and requirements, optimizations (10 slides)
-
Global memory, coalescing, SOA vs. AOS, broadcasts of reads to multiple threads, use of vector intrinsic types for higher bandwidth
-
Shared memory, bank conflicts, use for AOS to SOA conversion
-
Collective operations and synchronization basics, use of shared memory
-
Other memory systems: constant cache, 1D/2D/3D textures, host-mapped memory over PCIe/NVLink, Peer-to-Peer memory accesses and the like
-
Atomic operations
-
Quick overview of GPU occupancy, register usage, launch configurations, and other kernel tuning concepts (5 slides)
-
Exciting new features in CUDA 9 (5 slides)
GPU Application Optimization and Scaling with Profiling and Debugging
Presenter: Fernanda Foertter, Oak Ridge Leadership Class Facility
Abstract:
This session will cover lessons learned about porting applications to GPU accelerated machines such as Titan and Blue Waters. Best practices include profile-driven development, what to look for in a profile, analysis of data structures and call stacks. It'll also look forward to future multi-gpu architectures such as Summit and how data movement will behave under such systems.
|