Exploring the GPU Architecture

Image cpu

Image gpu

CPUs are latency oriented (minimize execution of serial code).
If the CPU has n cores, each core processes 1/n elements.
Launching, scheduling threads adds overhead.

GPUs are throughput oriented (maximize number of floating point operations).
GPUs process one element per thread.
Scheduled by GPU hardware, not by OS.

A Graphics Processor Unit (GPU) is mostly known for the hardware device used when running applications that weigh heavy on graphics. Highly parallel, highly multithreaded multiprocessor optimized for graphic computing and other applications.

1	GPU Programming API: CUDA (Compute Unified Device Architecture) : parallel GPU programming API created by NVIDA NVIDIA GPUs can be programmed by CUDA, extension of C language API libaries with C/C++/Fortran language CUDA C is compiled with nvcc Numerical libraries: cuBLAS, cuFFT, Magma, ...
2	GPU Programming API: OpenGL - an open standard for GPU programming.
3	GPU Programming API: DirectX - a series of Microsoft multimedia programming interfaces. https://developer.nvidia.com/ Download: CUDA Toolkit, NVIDIA HPC SDK (Software Development Kit)

SP: Scalar Processor 'CUDA core'. Executes one thread.
SM: Streaming Multiprocessor 32xSP (or 16, 48 or more).
Fast local 'shared memory' (shared between SPs) 16 KiB (or 64 KiB)
For example: NVIDIA Maxwell GeForce GTX 750 Ti.
- 32 SP, 20 SM : 640 CUDA Cores
Parallelization: Decomposition to threads.
Memory: Shared memory, global memory.
Thread communication: Synchronization

Image streamingmultiprocessor

Threads grouped in thread blocks: 128, 192 or 256 threads in a block
One thread block executes on one SM.
- All threads sharing the 'shared memory'.
- Each thread block is divided in scheduled units known as a warp.

Image threadblock

Blocks form a GRID.
Thread ID: unique within block.
Block ID: unique within grid.

Image blockgrid

A kernel is executed as a grid of thread blocks. All threads share data memory space.
A thread block is a batch of threads that can cooperate with each other by:
- Synchronizing their execution.
- Efficiently sharing data through a low latency shared memory.
Two threads from two different blocks cannot cooperate.

Image gridsblocks