Beyond OpenMP & MPI:

GPU parallelization

Dr. Cem Özdo

gan

LOGIK

Exploring the GPU

Architecture

Execution and

Programming Models

12.1

Lecture 12

Beyond OpenMP & M PI: GPU

parallelization

Introduction, Architecture, Programming

IKC-MH.57 Introduction to High Performance and Parallel

Computing at January 05, 2024

Dr. Cem Özdo

gan

Engineering Sciences Department

Izmir Kâtip Çelebi University

Beyond OpenMP & MPI:

GPU parallelization

Dr. Cem Özdo

gan

LOGIK

Exploring the GPU

Architecture

Execution and

Programming Models

12.2

Contents

1 Exploring the GPU Architecture

2 Execution and Programming Models

Beyond OpenMP & MPI:

GPU parallelization

Dr. Cem Özdo

gan

LOGIK

Exploring the GPU

Architecture

Execution and

Programming Models

12.3

Exploring the GPU Architecture I

•

CPUs are latency

oriented (minimize

execution of serial code).

•

If the CPU has n cores,

each core processes 1/n

elements.

•

Launching, scheduling

threads adds overhead.

•

GPUs are throughput oriented

(maximize number of ﬂoating point

operations).

•

GPUs process one element per

thread.

•

Scheduled by GPU hardware, not by

OS.

Beyond OpenMP & MPI:

GPU parallelization

Dr. Cem Özdo

gan

LOGIK

Exploring the GPU

Architecture

Execution and

Programming Models

12.4

Exploring the GPU Architecture II

•

A Graphics Processor Unit (GPU) is mostly known for the

hardware device used when running applications that

weigh heavy on graphics.

•

Highly parallel, highly multithreaded multiprocessor

optimized for graphic computing and other applications.

1 GPU Programming API: CUDA (Compute Uniﬁ ed Device

Architecture) : parallel GPU programming API created by

NVIDA

• NVIDIA GPUs can be programmed by CUDA, extension of

C language

• API liba ries with C/C++/Fortran language

• CUDA C is compiled with nvcc

• Numerica l libraries: cuB LAS, cuFFT, Magma, ...

2 GPU Programming API: OpenGL - an open standard for

GPU programming.

3 GPU Programming API: DirectX - a series of Micr osoft

multimedia programming interfaces.

•

https://developer.nvidia.com/ Download: CUDA Toolkit,

NVIDIA HPC SDK (Software Development Kit)

Beyond OpenMP & MPI:

GPU parallelization

Dr. Cem Özdo

gan

LOGIK

Exploring the GPU

Architecture

Execution and

Programming Models

12.5

Exploring the GPU Architecture II

•

SP: Scalar Processor ’CUDA core’.

Executes on e thread.

•

SM: Streaming Multiprocessor

32xSP (or 16, 48 or more).

•

Fast local ’shared memory’ ( shared

between SPs) 16 KiB (or 64 KiB)

•

For example: NVIDIA Maxwell

GeForce GTX 750 Ti.

•

32 SP, 20 SM : 640 CUDA Cores

•

Parallelization: Decomposition to

threads.

•

Memory: Shared memory, global

memory.

•

Thread communication:

Synchronization

Beyond OpenMP & MPI:

GPU parallelization

Dr. Cem Özdo

gan

LOGIK

Exploring the GPU

Architecture

Execution and

Programming Models

12.6

Exploring the GPU Architecture III

•

Threads grouped in thread

blocks: 128, 192 or 256

threads in a block

•

One thread block exec utes

on one

SM.

•

All threads sharing the

’share d memory’.

•

Each thread block is

divided in scheduled units

known as a warp.

•

Blocks form a GRID.

•

Thread ID: unique within

block.

•

Block ID: unique within

grid.

Beyond OpenMP & MPI:

GPU parallelization

Dr. Cem Özdo

gan

LOGIK

Exploring the GPU

Architecture

Execution and

Programming Models

12.7

Exploring the GPU Architecture IV

•

A kernel is executed as a

grid of thread blocks. All

threads share data memory

space.

•

A thread block is a batch of

threads that can cooperate

with each other by:

•

Synchronizing their

execution.

•

Efﬁciently sharing data

through a low latency

shared memory.

•

Two threads from two

different blocks cannot

cooperate.

Beyond OpenMP & MPI:

GPU parallelization

Dr. Cem Özdo

gan

LOGIK

Exploring the GPU

Architecture

Execution and

Programming Models

12.8

Execution and Programming Models I

•

Computation partitioning (where to run)

•

Declaration s on functions

__host__, __global__, __device__

__global__ v oid cuda_hello ( ) {

}

•

Mapping of thread programs to device:

compute <<<gs,bs>>>(<args>)

cuda_hello <<<b locks _pe r_g rid , threads_per_block >>> ( ) ;

•

Data partitioning (where does data reside, who may

access it and how?)

•

Declaration s on da ta

__shared__, __device__, __constant__, ...

__device__ const char

STR =

"HELLO WORLD! " ;

•

Data management and orchestration

•

Copying to/fro m host: e.g.,

cudaMemcpy(h_obj,d_obj, cudaMemcpyDevicetoHost)

cudaMemcpy ( d_a , h_a , bytes , cudaMemcpyHostToDevice ) ;

cudaMemcpy ( h_c , d_c , bytes , cudaMemcpyDeviceToHost ) ;

Beyond OpenMP & MPI:

GPU parallelization

Dr. Cem Özdo

gan

LOGIK

Exploring the GPU

Architecture

Execution and

Programming Models

12.9

Execution and Programming Models II

•

Concurrency management. e.g..

__synchthreads()

cudaDeviceSynchronize ( ) ;

Kernel

•

a simple C function

•

executes on G PU in parallel as many times as there are

threads

•

The keyword

__global__

tells the compiler nvcc to make a function a kernel (and

compile/run it for the GPU, instead of the CPU)

•

It’s the functions that you may call from the host side using

CUDA kernel call semantics (<<< ... >>>).

Beyond OpenMP & MPI:

GPU parallelization

Dr. Cem Özdo

gan

LOGIK

Exploring the GPU

Architecture

Execution and

Programming Models

12.10

Execution and Programming Models III

Setup and data transfer

•

cudaMemcpy : Transfer data to and from GPU (global

memory)

•

cudaMalloc : Allocate memory on GPU (global memory)

1 double

h_a ;

/ / Host i npu t v ectors

2 double

d_a ;

/ / Device in p ut v ector s

3 h_a = ( double

) malloc ( bytes ) ;

/ / A ll o c at e memory f o r each

vecto r on host

4 cudaMalloc (&d_a , bytes ) ; / / A l loc a te memory f o r each ve ctor

on GPU

5 cudaMemcpy ( d_a , h_a , bytes , cudaMemcpyHostToDevice ) ; / / Cop y

data from host ar ray h_a to device arrays d_a

6 add_vectors <<<bl k _i n _gr i d , t hr_per_bl k >>>(d_a , d_b , d_c ) ; / /

Execute the kernel

7 cudaMemcpy ( h_c , d_c , bytes , cudaMemcpyDeviceToHost ) ; / /

Copy data from device a rra y d_c to host array h_c

•

GPU is the ’device’, CPU is the ’host’. They do not share

memory!

•

The H OST launches a ker nel that execute on the DEVICE.

Beyond OpenMP & MPI:

GPU parallelization

Dr. Cem Özdo

gan

LOGIK

Exploring the GPU

Architecture

Execution and

Programming Models

12.11

Execution and Programming Models IV

•

Threads and blocks have

IDs

•

So each thread can

decide what data to work

•

Block ID: 1D or 2D

(blockIdx.x, blockIdx.y)

•

Thread ID: 1D, 2D, or 3D

(threadIdx.x,y,z)

•

Simpliﬁes memory

addressing when

processing multi-

dimensional data.

•

Compiler nvcc takes as input a .cu program and produces

•

C Code for host processor (CPU), compiled by native C

compiler

•

Code for device processor (GPU), compiled by nvcc

compiler

Beyond OpenMP & MPI:

GPU parallelization

Dr. Cem Özdo

gan

LOGIK

Exploring the GPU

Architecture

Execution and

Programming Models

12.12

Execution and Programming Models V - Hello World I

Cuda Code:

1 # incl ude < s t d i o . h>

2 # include < uni s td . h>

3 __device__ const char

STR =

"HELLO WORLD! " ;

4 const i n t STR_LENGTH = 12;

5 __global__ vo id cuda_hello ( ) {

6 / / bl ock Idx . x : Block index w i thi n the g r i d i n x−d i r e c t ion

7 / / threadIdx . x : Thread index w i t h i n the block

8 / / blockDim . x : # of threads i n a block

9 p r i n t f ( " He ll o World from GPU! (%d ,%d ) : %c ThreadID %d \ n " ,

block Idx . x , threadIdx . x , STR[ thr ead Idx . x % STR_LENGTH] , (

threadIdx . x + block Idx . x

blockDim . x ) ) ;

10 }

11 i n t main ( ) {

12 p r i n t f ( " H ello World from CPU ! \ n " ) ;

13 sleep ( 2) ;

14 i n t threads_per_block =12;

15 i n t bloc ks_per_grid =2;

16 cuda_hello <<<b locks _pe r_g rid , threads_per_block >>> ( ) ;

17 cudaDeviceSynchronize ( ) ; /

Halt host thread ex ecuti on on CPU

u n t i l the device has f i n i shed processing a l l pr e vi o usl y

requested tasks

18 ret u rn 0 ;

19 }

Beyond OpenMP & MPI:

GPU parallelization

Dr. Cem Özdo

gan

LOGIK

Exploring the GPU

Architecture

Execution and

Programming Models

12.13

Execution and Programming Models VI - Hello World I

Beyond OpenMP & MPI:

GPU parallelization

Dr. Cem Özdo

gan

LOGIK

Exploring the GPU

Architecture

Execution and

Programming Models

12.14

Execution and Programming Models VII - Hello World II

Cuda Code:

1 # i nclu de < s t dio . h>

2 # inc l ude <cuda . h>

3 # inc l ude <cuda_runtime . h>

4 # defi n e N 720 / / number of computations

5 # defi n e GRID_D1 20 / / c onst a nts f o r g rid and blo c k siz e s

6 # defi n e GRID_D2 3 / / c o nstan t s f o r gr i d and b loc k size s

7 # defi n e BLOCK_D1 12 / / c onst a nts f o r g r id and blo c k siz e s

8 # defi n e BLOCK_D2 1 / / c onsta n ts f o r grid and bloc k siz e s

9 # defi n e BLOCK_D3 1 / / c onsta n ts f o r grid and bloc k siz e s

11 __global__ voi d he l l o ( voi d ) / / t h i s i s the kerne l f unc t i o n ca l l e d f o r each thr e ad

12 {

13 / / CUDA va r i a b l e s { thr e adId x , blo c kIdx , blockDim , gridDim } t o determ ine a unique t hrea d ID

14 i n t myblock = bl o ckIdx . x + b l ockId x . y

gridDim . x ; / / i d of the bloc k

15 i n t b loc k s ize = blockDim . x

blockDim . y

blockDim . z ; / / siz e o f each b lock

16 i n t s ub thread = t h r eadI d x . z

( blockDim . x

blockDim . y ) + threa dIdx . y

blockDim . x +

thre a dIdx . x ; / / i d of t hrea d in a given blo c k

17 i n t idx = myblock

blo c ksize + subthread ; / / a ssign o v e r a l l i d / index of the t hrea d

18 i n t nt hr eads = b locks i ze

gridDim . x

gridDim . y ; / / Tot a l # o f threads

19 i n t chunk =20; / / Vary t h i s v alue t o see t he changes at the o u tput

20 i f ( i dx < chunk | | i d x > nt hreads−chunk ) { / / o n ly p r i n t f i r s t and l a s t chunks o f t hreads

21 i f ( i dx < N) {

22 p r i n t f ( " Hel l o wor ld ! My b l ock index i s (%d,%d ) [ Grid dims=(%d,%d ) ] , 3D−thr e ad

index w i t hin bloc k=(%d,%d,%d ) => threa d index=%d \ n " , b l ockI d x . x , b loc k I dx . y , gridDim .

x , gr idDim . y , thr e adIdx . x , t h r eadI d x . y , t h r eadI d x . z , i dx ) ;

23 }

24 el s e

25 {

26 p r i n t f ( " Hel l o wor ld ! My b l ock index i s (%d,%d ) [ Grid dims=(%d,%d ) ] , 3D−thr e ad

index w i t hin bloc k=(%d,%d,%d ) => threa d index=%d [ ### t h i s t hrea d would not be used

f o r N=%d ###] \ n " , blo c k I dx . x , b lockI d x . y , gridDim . x , gridDim . y , threa dIdx . x , thre a dIdx

. y , t hrea d Idx . z , idx , N) ;

27 }

28 }

29 }

Beyond OpenMP & MPI:

GPU parallelization

Dr. Cem Özdo

gan

LOGIK

Exploring the GPU

Architecture

Execution and

Programming Models

12.15

Execution and Programming Models VIII - Hello World II

30 i n t main ( i n t argc , char

argv )

31 {

32 / / o bjec t s cont a inin g the b l ock and gr i d i n f o

33 cons t dim3 b loc k S ize (BLOCK_D1, BLOCK_D2, BLOCK_D3) ;

34 cons t dim3 g r idSi z e (GRID_D1 , GRID_D2 , 1) ;

35 i n t nt hr eads = BLOCK_D1

BLOCK_D2

BLOCK_D3

GRID_D1

GRID_D2 ; / / T otal # o f t hreads

36 i f ( nt hr eads < N) {

37 p r i n t f ( " \ n============ NOT ENOUGH THREADS TO COVER N=%d ===============\ n \ n " ,N) ;

38 }

39 e lse

40 {

41 p r i n t f ( " Launching %d t hreads (N=%d ) \ n" , nth read s ,N) ;

42 }

43 h ell o <<< grid S ize , bloc kSize > >>() ; / / launch the kerne l on the s p e c if i e d g r id of t hrea d

blo c ks

44 cuda E rror _ t cudaerr = cudaDeviceSy nchronize( ) ; / / Need to f l u sh p r int s , other w ise none

of the p r i n t s from w i t h i n the k erne l w i l l show up as program e x i t does not f lus h the

p r i n t b u f f er

45 i f ( cudaerr ) {

46 p r i n t f ( " ke r nel launch f a i l e d w ith er r o r \"% s \ " . \ n " ,

47 cuda GetE r r orStri ng ( cudaerr ) ) ;

48 }

49 e lse

50 {

51 p r i n t f ( " ke r nel launch success ! \ n " ) ;

52 }

53 p r i n t f ( " That ’ s a l l ! \ n" ) ;

54 r et u r n 0 ;

55 }

Beyond OpenMP & MPI:

GPU parallelization

Dr. Cem Özdo

gan

LOGIK

Exploring the GPU

Architecture

Execution and

Programming Models

12.16

Execution and Programming Models IX - Hello World II