Beyond OpenMP & MPI:
GPU parallelization
Dr. Cem Özdo
˘
gan
LOGIK
Exploring the GPU
Architecture
Execution and
Programming Models
12.1
Lecture 12
Beyond OpenMP & M PI: GPU
parallelization
Introduction, Architecture, Programming
IKC-MH.57 Introduction to High Performance and Parallel
Computing at January 05, 2024
Dr. Cem Özdo
˘
gan
Engineering Sciences Department
˙
Izmir Kâtip Çelebi University
Beyond OpenMP & MPI:
GPU parallelization
Dr. Cem Özdo
˘
gan
LOGIK
Exploring the GPU
Architecture
Execution and
Programming Models
12.2
Contents
1 Exploring the GPU Architecture
2 Execution and Programming Models
Beyond OpenMP & MPI:
GPU parallelization
Dr. Cem Özdo
˘
gan
LOGIK
Exploring the GPU
Architecture
Execution and
Programming Models
12.3
Exploring the GPU Architecture I
CPUs are latency
oriented (minimize
execution of serial code).
If the CPU has n cores,
each core processes 1/n
elements.
Launching, scheduling
threads adds overhead.
GPUs are throughput oriented
(maximize number of floating point
operations).
GPUs process one element per
thread.
Scheduled by GPU hardware, not by
OS.
Beyond OpenMP & MPI:
GPU parallelization
Dr. Cem Özdo
˘
gan
LOGIK
Exploring the GPU
Architecture
Execution and
Programming Models
12.4
Exploring the GPU Architecture II
A Graphics Processor Unit (GPU) is mostly known for the
hardware device used when running applications that
weigh heavy on graphics.
Highly parallel, highly multithreaded multiprocessor
optimized for graphic computing and other applications.
1 GPU Programming API: CUDA (Compute Unifi ed Device
Architecture) : parallel GPU programming API created by
NVIDA
NVIDIA GPUs can be programmed by CUDA, extension of
C language
API liba ries with C/C++/Fortran language
CUDA C is compiled with nvcc
Numerica l libraries: cuB LAS, cuFFT, Magma, ...
2 GPU Programming API: OpenGL - an open standard for
GPU programming.
3 GPU Programming API: DirectX - a series of Micr osoft
multimedia programming interfaces.
https://developer.nvidia.com/ Download: CUDA Toolkit,
NVIDIA HPC SDK (Software Development Kit)
Beyond OpenMP & MPI:
GPU parallelization
Dr. Cem Özdo
˘
gan
LOGIK
Exploring the GPU
Architecture
Execution and
Programming Models
12.5
Exploring the GPU Architecture II
SP: Scalar Processor ’CUDA core’.
Executes on e thread.
SM: Streaming Multiprocessor
32xSP (or 16, 48 or more).
Fast local shared memory’ ( shared
between SPs) 16 KiB (or 64 KiB)
For example: NVIDIA Maxwell
GeForce GTX 750 Ti.
32 SP, 20 SM : 640 CUDA Cores
Parallelization: Decomposition to
threads.
Memory: Shared memory, global
memory.
Thread communication:
Synchronization
Beyond OpenMP & MPI:
GPU parallelization
Dr. Cem Özdo
˘
gan
LOGIK
Exploring the GPU
Architecture
Execution and
Programming Models
12.6
Exploring the GPU Architecture III
Threads grouped in thread
blocks: 128, 192 or 256
threads in a block
One thread block exec utes
on one
SM.
All threads sharing the
share d memory’.
Each thread block is
divided in scheduled units
known as a warp.
Blocks form a GRID.
Thread ID: unique within
block.
Block ID: unique within
grid.
Beyond OpenMP & MPI:
GPU parallelization
Dr. Cem Özdo
˘
gan
LOGIK
Exploring the GPU
Architecture
Execution and
Programming Models
12.7
Exploring the GPU Architecture IV
A kernel is executed as a
grid of thread blocks. All
threads share data memory
space.
A thread block is a batch of
threads that can cooperate
with each other by:
Synchronizing their
execution.
Efficiently sharing data
through a low latency
shared memory.
Two threads from two
different blocks cannot
cooperate.
Beyond OpenMP & MPI:
GPU parallelization
Dr. Cem Özdo
˘
gan
LOGIK
Exploring the GPU
Architecture
Execution and
Programming Models
12.8
Execution and Programming Models I
Computation partitioning (where to run)
Declaration s on functions
__host__, __global__, __device__
__global__ v oid cuda_hello ( ) {
}
Mapping of thread programs to device:
compute <<<gs,bs>>>(<args>)
cuda_hello <<<b locks _pe r_g rid , threads_per_block >>> ( ) ;
Data partitioning (where does data reside, who may
access it and how?)
Declaration s on da ta
__shared__, __device__, __constant__, ...
__device__ const char
*
STR =
"HELLO WORLD! " ;
Data management and orchestration
Copying to/fro m host: e.g.,
cudaMemcpy(h_obj,d_obj, cudaMemcpyDevicetoHost)
cudaMemcpy ( d_a , h_a , bytes , cudaMemcpyHostToDevice ) ;
cudaMemcpy ( h_c , d_c , bytes , cudaMemcpyDeviceToHost ) ;
Beyond OpenMP & MPI:
GPU parallelization
Dr. Cem Özdo
˘
gan
LOGIK
Exploring the GPU
Architecture
Execution and
Programming Models
12.9
Execution and Programming Models II
Concurrency management. e.g..
__synchthreads()
cudaDeviceSynchronize ( ) ;
Kernel
a simple C function
executes on G PU in parallel as many times as there are
threads
The keyword
__global__
tells the compiler nvcc to make a function a kernel (and
compile/run it for the GPU, instead of the CPU)
It’s the functions that you may call from the host side using
CUDA kernel call semantics (<<< ... >>>).
Beyond OpenMP & MPI:
GPU parallelization
Dr. Cem Özdo
˘
gan
LOGIK
Exploring the GPU
Architecture
Execution and
Programming Models
12.10
Execution and Programming Models III
Setup and data transfer
cudaMemcpy : Transfer data to and from GPU (global
memory)
cudaMalloc : Allocate memory on GPU (global memory)
1 double
*
h_a ;
/ / Host i npu t v ectors
2 double
*
d_a ;
/ / Device in p ut v ector s
3 h_a = ( double
*
) malloc ( bytes ) ;
/ / A ll o c at e memory f o r each
vecto r on host
4 cudaMalloc (&d_a , bytes ) ; / / A l loc a te memory f o r each ve ctor
on GPU
5 cudaMemcpy ( d_a , h_a , bytes , cudaMemcpyHostToDevice ) ; / / Cop y
data from host ar ray h_a to device arrays d_a
6 add_vectors <<<bl k _i n _gr i d , t hr_per_bl k >>>(d_a , d_b , d_c ) ; / /
Execute the kernel
7 cudaMemcpy ( h_c , d_c , bytes , cudaMemcpyDeviceToHost ) ; / /
Copy data from device a rra y d_c to host array h_c
GPU is the device’, CPU is the ’host’. They do not share
memory!
The H OST launches a ker nel that execute on the DEVICE.
Beyond OpenMP & MPI:
GPU parallelization
Dr. Cem Özdo
˘
gan
LOGIK
Exploring the GPU
Architecture
Execution and
Programming Models
12.11
Execution and Programming Models IV
Threads and blocks have
IDs
So each thread can
decide what data to work
on
Block ID: 1D or 2D
(blockIdx.x, blockIdx.y)
Thread ID: 1D, 2D, or 3D
(threadIdx.x,y,z)
Simplifies memory
addressing when
processing multi-
dimensional data.
Compiler nvcc takes as input a .cu program and produces
C Code for host processor (CPU), compiled by native C
compiler
Code for device processor (GPU), compiled by nvcc
compiler
Beyond OpenMP & MPI:
GPU parallelization
Dr. Cem Özdo
˘
gan
LOGIK
Exploring the GPU
Architecture
Execution and
Programming Models
12.12
Execution and Programming Models V - Hello World I
Cuda Code:
1 # incl ude < s t d i o . h>
2 # include < uni s td . h>
3 __device__ const char
*
STR =
"HELLO WORLD! " ;
4 const i n t STR_LENGTH = 12;
5 __global__ vo id cuda_hello ( ) {
6 / / bl ock Idx . x : Block index w i thi n the g r i d i n xd i r e c t ion
7 / / threadIdx . x : Thread index w i t h i n the block
8 / / blockDim . x : # of threads i n a block
9 p r i n t f ( " He ll o World from GPU! (%d ,%d ) : %c ThreadID %d \ n " ,
block Idx . x , threadIdx . x , STR[ thr ead Idx . x % STR_LENGTH] , (
threadIdx . x + block Idx . x
*
blockDim . x ) ) ;
10 }
11 i n t main ( ) {
12 p r i n t f ( " H ello World from CPU ! \ n " ) ;
13 sleep ( 2) ;
14 i n t threads_per_block =12;
15 i n t bloc ks_per_grid =2;
16 cuda_hello <<<b locks _pe r_g rid , threads_per_block >>> ( ) ;
17 cudaDeviceSynchronize ( ) ; /
*
Halt host thread ex ecuti on on CPU
u n t i l the device has f i n i shed processing a l l pr e vi o usl y
requested tasks
*
/
18 ret u rn 0 ;
19 }
Beyond OpenMP & MPI:
GPU parallelization
Dr. Cem Özdo
˘
gan
LOGIK
Exploring the GPU
Architecture
Execution and
Programming Models
12.13
Execution and Programming Models VI - Hello World I
Beyond OpenMP & MPI:
GPU parallelization
Dr. Cem Özdo
˘
gan
LOGIK
Exploring the GPU
Architecture
Execution and
Programming Models
12.14
Execution and Programming Models VII - Hello World II
Cuda Code:
1 # i nclu de < s t dio . h>
2 # inc l ude <cuda . h>
3 # inc l ude <cuda_runtime . h>
4 # defi n e N 720 / / number of computations
5 # defi n e GRID_D1 20 / / c onst a nts f o r g rid and blo c k siz e s
6 # defi n e GRID_D2 3 / / c o nstan t s f o r gr i d and b loc k size s
7 # defi n e BLOCK_D1 12 / / c onst a nts f o r g r id and blo c k siz e s
8 # defi n e BLOCK_D2 1 / / c onsta n ts f o r grid and bloc k siz e s
9 # defi n e BLOCK_D3 1 / / c onsta n ts f o r grid and bloc k siz e s
10
11 __global__ voi d he l l o ( voi d ) / / t h i s i s the kerne l f unc t i o n ca l l e d f o r each thr e ad
12 {
13 / / CUDA va r i a b l e s { thr e adId x , blo c kIdx , blockDim , gridDim } t o determ ine a unique t hrea d ID
14 i n t myblock = bl o ckIdx . x + b l ockId x . y
*
gridDim . x ; / / i d of the bloc k
15 i n t b loc k s ize = blockDim . x
*
blockDim . y
*
blockDim . z ; / / siz e o f each b lock
16 i n t s ub thread = t h r eadI d x . z
*
( blockDim . x
*
blockDim . y ) + threa dIdx . y
*
blockDim . x +
thre a dIdx . x ; / / i d of t hrea d in a given blo c k
17 i n t idx = myblock
*
blo c ksize + subthread ; / / a ssign o v e r a l l i d / index of the t hrea d
18 i n t nt hr eads = b locks i ze
*
gridDim . x
*
gridDim . y ; / / Tot a l # o f threads
19 i n t chunk =20; / / Vary t h i s v alue t o see t he changes at the o u tput
20 i f ( i dx < chunk | | i d x > nt hreadschunk ) { / / o n ly p r i n t f i r s t and l a s t chunks o f t hreads
21 i f ( i dx < N) {
22 p r i n t f ( " Hel l o wor ld ! My b l ock index i s (%d,%d ) [ Grid dims=(%d,%d ) ] , 3Dthr e ad
index w i t hin bloc k=(%d,%d,%d ) => threa d index=%d \ n " , b l ockI d x . x , b loc k I dx . y , gridDim .
x , gr idDim . y , thr e adIdx . x , t h r eadI d x . y , t h r eadI d x . z , i dx ) ;
23 }
24 el s e
25 {
26 p r i n t f ( " Hel l o wor ld ! My b l ock index i s (%d,%d ) [ Grid dims=(%d,%d ) ] , 3Dthr e ad
index w i t hin bloc k=(%d,%d,%d ) => threa d index=%d [ ### t h i s t hrea d would not be used
f o r N=%d ###] \ n " , blo c k I dx . x , b lockI d x . y , gridDim . x , gridDim . y , threa dIdx . x , thre a dIdx
. y , t hrea d Idx . z , idx , N) ;
27 }
28 }
29 }
Beyond OpenMP & MPI:
GPU parallelization
Dr. Cem Özdo
˘
gan
LOGIK
Exploring the GPU
Architecture
Execution and
Programming Models
12.15
Execution and Programming Models VIII - Hello World II
30 i n t main ( i n t argc , char
**
argv )
31 {
32 / / o bjec t s cont a inin g the b l ock and gr i d i n f o
33 cons t dim3 b loc k S ize (BLOCK_D1, BLOCK_D2, BLOCK_D3) ;
34 cons t dim3 g r idSi z e (GRID_D1 , GRID_D2 , 1) ;
35 i n t nt hr eads = BLOCK_D1
*
BLOCK_D2
*
BLOCK_D3
*
GRID_D1
*
GRID_D2 ; / / T otal # o f t hreads
36 i f ( nt hr eads < N) {
37 p r i n t f ( " \ n============ NOT ENOUGH THREADS TO COVER N=%d ===============\ n \ n " ,N) ;
38 }
39 e lse
40 {
41 p r i n t f ( " Launching %d t hreads (N=%d ) \ n" , nth read s ,N) ;
42 }
43 h ell o <<< grid S ize , bloc kSize > >>() ; / / launch the kerne l on the s p e c if i e d g r id of t hrea d
blo c ks
44 cuda E rror _ t cudaerr = cudaDeviceSy nchronize( ) ; / / Need to f l u sh p r int s , other w ise none
of the p r i n t s from w i t h i n the k erne l w i l l show up as program e x i t does not f lus h the
p r i n t b u f f er
45 i f ( cudaerr ) {
46 p r i n t f ( " ke r nel launch f a i l e d w ith er r o r \"% s \ " . \ n " ,
47 cuda GetE r r orStri ng ( cudaerr ) ) ;
48 }
49 e lse
50 {
51 p r i n t f ( " ke r nel launch success ! \ n " ) ;
52 }
53 p r i n t f ( " That s a l l ! \ n" ) ;
54 r et u r n 0 ;
55 }
Beyond OpenMP & MPI:
GPU parallelization
Dr. Cem Özdo
˘
gan
LOGIK
Exploring the GPU
Architecture
Execution and
Programming Models
12.16
Execution and Programming Models IX - Hello World II