CUDA 1

Presenter Notes

Resumen

CUDA.
Ejemplos básicos.

Nicolás Wolovick, 20200527.

Presenter Notes

Kudos for CUDA

It's been almost three years since GPU computing broke into the mainstream of HPC with the introduction of NVIDIA’s CUDA API in September 2007. Adoption of the technology since then has proceeded at a surprisingly strong and steady pace. Many organizations that began with small pilot projects a year or two ago have moved on to enterprise deployment, and GPU accelerated machines are now rep- resented on the TOP500 list starting at position two. The relatively rapid adoption of CUDA by a community not known for the rapid adoption of much of anything is a noteworthy signal. Contrary to the accepted wisdom that GPU computing is more difficult, I believe its success thus far signals that it is no more complicated than good CPU programming. Further, it more clearly and succinctly expresses the parallelism of a large class of problems leading to code that is easier to maintain, more scalable and better positioned to map to future many-core architectures. Vincent Natol “Kudos for CUDA”, HPCwire (2010)

Hennessy, Patterson, CAAQA5, L-48.

Presenter Notes

Orígenes de CUDA

Brook for GPUs: Stream Computing on Graphics Hardware Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman, Kayvon Fatahalian, Mike Houston, and Pat Hanrahan Computer Science Department Stanford University
To appear at SIGGRAPH 2004

Figure 1: Programming model for current programmable graphics hardware.

Presenter Notes

Ian Buck

Ian Buck is general manager and vice president of Accelerated Computing at NVIDIA. He is responsible for the company’s worldwide datacenter business, including server GPUs and the enabling NVIDIA computing software for AI and HPC used by millions of developers, researchers and scientists. Buck joined NVIDIA in 2004 after completing his PhD in computer science from Stanford University, where he was development lead for Brook, the forerunner to generalized computing on GPUs. He is also the creator of CUDA, which has become the world’s leading platform for accelerated parallel computing. Buck has testified before the U.S. Congress on artificial intelligence and has advised the White House on the topic. Buck also received a BSE degree in computer science from Princeton University.

Presenter Notes

Modelo SIMT

Single Instruction, Multiple Thread

A medio camino entre MIMD y SIMD

Se programa como SIMD, pero permite divergencia en el flujo de control y de datos.
Se programa como MIMD, pero todos los hilos ejecutan el mismo código.

Abstracción del hardware, es independiente del:

Tamaño del warp.
Cantidad de ALUs por SM.
Ratio fp32 vs. fp64 units.
Cantidad de special units (funciones trascendentes).
Cantidad de paralelismo por SM (warps volando).
Cantidad de SMs.

Permite ejecutar un subconjunto interesante del lenguaje desde una G80 a una GV100.
Puede sufrir problemas de performance.

Presenter Notes

Modelo de paralelismo de CUDA

CUDA Parallelism

CUDA languaje abstrae el hardware:

Thread: procesador escalar virtualizado (PC, registros, pila).
Block: multiprocesador virtualizado (hilos, memoria compartida).

CUDA runtime planifica en el hardware:

Non-preemptive. Los hilos dentro de un bloque se lanzan y ejecutan hasta que se terminan. (No hay critical sections, semaphores, a la manera de SistOp).
Los bloques son independientes.

Presenter Notes

Modelo de Memoria de CUDA

Jerarquía de hilos, bloques, grillas

CUDA hierarchy of threads, blocks, grids

NVIDIA’s Next Generation CUDA Compute Architecture: Fermi™, NVIDIA, 2009.

Presenter Notes

Escalabilidad automática

Esto permite ejecutar en hardware con diferente número de SMs, o bien SM remotos: rCUDA.

GTX 1030 (3 SM), GTX 1050 (5 SM), GTX 1050 Ti (6 SM), GTX 1060 (9/10 SM), GTX 1070 (15 SM), GTX 1070 Ti (19 SM), GTX 1080 (20 SM), GTX 1080 Ti (28 SM), GTX Titan Xp (30 SM).

La performance es lineal a la cantidad de SM, y esto no fue magia

Presenter Notes

Ejemplo: grilla 2D de bloques de hilos 2D

Este esquema facilita el paralelismo de datos para grillas 2D y 3D.
No siempre este esquema se adapta a lo que necesitamos.

Presenter Notes

CUDA

Aumenta "C++" con unas poquitas cosas.
Directo interoperar con "C".

Abstracciones clave

Esquema fork-join como en OpenMP.
- Pero el fork y el join tienen costo 0.
Jerarquía de dos niveles de hilos.
- Bloques de hilos.
- Grillas de bloques.
Sincronización de barrera muy barata dentro de un bloque.
- Por fuera no hay. (CUDA9+ si! Cooperative Groups)
- Usar fork-join para sincronizar.
Memoria compartida dentro de un bloque.

Presenter Notes

Esquema fork-join

Presenter Notes

Jerarquía de hilos

hilo ∈ warp ∈ bloque ∈ grilla

Cantidades típicas

32 hilos en 1 warp
32 warps en 1 bloque
1048576 bloques en una grilla.

CUDA hierarchy of threads, blocks, grids

Presenter Notes

Unidades de Cooperación

Warp

Ejecución interlocked y ...
- Ya no más! desde CUDA9/Volta hay que usar __syncwarp().
Comunicación de variables locales via ballots & warp shuffling).

Bloque

Sincronización de barrera.
Memoria compartida local shared.
Instrucciones atómicas sobre la shared: atomicAdd, CAS.

Grilla

Sincronización fork-join por lanzamiento de kernels. Ahora Volta agrega barreras globales!
Memoria compartida global.
Instrucciones atómicas sobre la global: atomicAdd, CAS.

Presenter Notes

Scoping y lifetime de variables

1 int i;

Esta simple i puede ser:

Local a un hilo (en register file).
Compartida en un warp (vía warp shuffle)
Compartida en un bloque (vía shared memory)
Compartida en la grilla(vía global memory)

¡Compara con OpenMP que solo tenía local y global! Ahora son 4 niveles.

Presenter Notes

Locality = Performance

All performance is from parallelism

Machines are power limited

(efficiency IS performance)

Machines are communication limited

(locality IS performance)

Bill Dally, Efficiency and Programmability: Enablers for ExaScale, SC13.

locality=performance

Presenter Notes

"`C++`" para CUDA

Calificadores de funciones

1 __host__ int add_cpu() { } // CPU function
2 __global__ void add_gpu() { } // callable kernel from host
3 __device__ float add_partial_gpu() { } // GPU function

Calificadores de variables

1 __device__ float a[256]; // device global variable
2 __constant__ float my_constant_array[32]; // constant global array
3 __shared__ float my_shared_array[32]; // device block-shared array
4 __global__ float add_gpu(const float* __restrict__ a); // non-aliased pointer

Tipos de datos

1 typedef struct {float x, float y} float2; // aligned
2 // also float3, float4, double2, uint2, uint3, etc.
3 typedef uint3 dim3; // initial sections, trailing values default to 1.

Configuración de la ejecución

1 dim3 grid_dim(100, 50); // 5000 thread blocks
2 dim3 block_dim(4, 8, 8); // 256 threads per block
3 my_kernel <<< grid_dim, block_dim >>> (...); // Launch kernel

Presenter Notes

"C++" para CUDA (cont'd)

Variables y funciones built-in

1 dim3 gridDim; // Grid dimension
2 dim3 blockDim; // Block dimension
3 uint3 blockIdx; // Block index
4 uint3 threadIdx; // Thread index
5 int warpSize;
6 void __syncwarp(); // Intra-warp sync
7 void __syncthreads(); // Intra-block thread synchronization

Muchas más cosas:

· Memory fence.
· Funciones de texturas (interpolación 1D, 2D y 3D) y superficies.
· Funciones de tiempo (clock).
· Funciones atómicas.
· Funciones de votación de warps.
· Funciones de shuffle de warps.
· Aserciones (assert).
· Salida estándard (printf).
· Pedido de memoria dinámica (malloc).
· Funciones matemáticas.

Presenter Notes

Compute Capability (CC)

1.0 (G80): solo float, sin atomics.
1.1 (G86): atomics.
1.2 (GT216): más atomics, atomics en shared, warp vote.
1.3 (GT200 Tesla): doble precisión.
2.0 (GF100 Fermi): 3D grids, float atomics, predicated synchthreads, surfaces, printf, assert.
2.1 (GF104, aka aborto'e mono): 3 half-warp core groups (48), ILP.
3.0 (GK104 Kepler): 2^32-1 grid.x, shared vs. L1 más configurable, ILP, warp shuffle, 16 kernels concurrentes.
3.5 (GK110 full Kepler): 255 registros por hilo, paralelismo dinámico, ldg, 64-bit atomics, funnel shifts, 32 kernels concurrentes.

Presenter Notes

Compute Capability (CC)

5.0 (GM104, small Maxwell): (software) unified memory, crypto instructions.
5.3 (GM20B, Jetson TX1): half precision en 2:1 respecto a single precision.
6.x (GP10y, Pascal): atomic fp64 global y shared.
7.x (GT100, Volta): tensor cores.
8.x (GA100, Ampere): fp64 tensor core, bf16, tf32 (aka bf19), int4, async copy global->shared sin usar RegFile.

Presenter Notes

Ejemplo CC30 -> CC35

Agregó funnel shift.

Benchmarking de hashcat para WPA2.

GTX 680

GK104, CC 3.0, 8 SM, 29 KHash/s

GTX 780

GK110, CC 3.5, 12 SM, 77 KHash/s

Presenter Notes

Bibliografía

NVIDIA, NVIDIA CUDA C Programming Guide, 2018.
David B. Kirk, Wen-mei W. Hwu, Programming Massively Parallel Processors: A Hands-on Approach, 2nd ed., Morgan Kaufmann, 2012.
Rob Farber, CUDA Application Design and Development, Morgan Kaufmann, 2011.
NVIDIA, Parallel Thread Execution ISA, 2018.
NVIDIA, NVIDIA CUDA Compiler Driver NVCC, 2018.
Hennessy, Patterson, Computer Architecture: A Quantitative Approach, Fifth Edition, Morgan Kaufmann, 2011.

Presenter Notes

La clase que viene

Más ejemplos de paralelismo trivial.

Table of Contents	t
Exposé	ESC
Full screen slides	e
Presenter View	p
Source Files	s
Slide Numbers	n
Toggle screen blanking	b
Show/hide slide context	c
Notes	2
Help	h

A medio camino entre MIMD y SIMD

Abstracción del hardware, es independiente del:

Jerarquía de hilos, bloques, grillas

Abstracciones clave

Cantidades típicas

Warp

Bloque

Grilla

All performance is from parallelism

Machines are power limited

Machines are communication limited

locality=performance

locality=performance

locality=performance

locality=performance

locality=performance

locality=performance

GTX 680

GTX 780

Table of Contents

Help