Nicolás Wolovick, 20180524.
It's been almost three years since GPU computing broke into the mainstream of HPC with the introduction of NVIDIA’s CUDA API in September 2007. Adoption of the technology since then has proceeded at a surprisingly strong and steady pace. Many organizations that began with small pilot projects a year or two ago have moved on to enterprise deployment, and GPU accelerated machines are now rep- resented on the TOP500 list starting at position two. The relatively rapid adoption of CUDA by a community not known for the rapid adoption of much of anything is a noteworthy signal. Contrary to the accepted wisdom that GPU computing is more difficult, I believe its success thus far signals that it is no more complicated than good CPU programming. Further, it more clearly and succinctly expresses the parallelism of a large class of problems leading to code that is easier to maintain, more scalable and better positioned to map to future many-core architectures. Vincent Natol “Kudos for CUDA”, HPCwire (2010)
(Hennessy, Patterson, CAAQA5, L-48)
Brook for GPUs: Stream Computing on Graphics Hardware Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman, Kayvon Fatahalian, Mike Houston, and Pat Hanrahan Computer Science Department Stanford University
To appear at SIGGRAPH 2004
Ian Buck es "VP / General Manager, Tesla Data Center Business - NVIDIA".
Single Instruction, Multiple Thread
fp32
vs. fp64
units.Permite ejecutar un subconjunto interesante del lenguaje desde una G80 a una GV100.
Puede sufrir problemas de performance.
CUDA languaje virtualiza el hardware:
CUDA runtime planifica en el hardware:
Esto permite ejecutar en hardware con diferente número de SMs, o bien SM remotos: rCUDA.
GTX 1030 (3 SM), GTX 1050 (5 SM), GTX 1050 Ti (6 SM), GTX 1060 (9/10 SM), GTX 1070 (15 SM), GTX 1070 Ti (19 SM), GTX 1080 (20 SM), GTX 1080 Ti (28 SM), GTX Titan Xp (30 SM).
La performance es lineal a la cantidad de SM: "no fue magia"
Este esquema facilita el paralelismo de datos para grillas 2D y 3D.
No siempre este esquema se adapta a lo que necesitamos.
Aumenta "C
" con unas poquitas cosas.
hilo ∈ warp ∈ bloque ∈ grilla
32 hilos en 1 warp
32 warps en 1 bloque
1048576 bloques en una grilla.
__syncwarp()
.atomicAdd
, CAS
.atomicAdd
, CAS
.1 int i;
Esta simple i
puede ser:
¡Compara con OpenMP
que solo tenía local y global! Ahora son 4 niveles.
(efficiency IS performance)
(locality IS performance)
-
-
C
" para CUDACalificadores de funciones
1 __host__ int add_cpu() { } // CPU function
2 __global__ void add_gpu() { } // callable kernel from host
3 __device__ float add_partial_gpu() { } // GPU function
Calificadores de variables
1 __device__ float a[256]; // device global variable
2 __constant__ float my_constant_array[32]; // constant global array
3 __shared__ float my_shared_array[32]; // device block-shared array
4 __global__ float add_gpu(const float* __restrict__ a); // non-aliased pointer
Tipos de datos
1 typedef struct {float x, float y} float2; // aligned
2 // also float3, float4, double2, uint2, uint3, etc.
3 typedef uint3 dim3; // initial sections, trailing values default to 1.
Configuración de la ejecución
1 dim3 grid_dim(100, 50); // 5000 thread blocks
2 dim3 block_dim(4, 8, 8); // 256 threads per block
3 my_kernel <<< grid_dim, block_dim >>> (...); // Launch kernel
Variables y funciones built-in
1 dim3 gridDim; // Grid dimension
2 dim3 blockDim; // Block dimension
3 uint3 blockIdx; // Block index
4 uint3 threadIdx; // Thread index
5 int warpSize;
6 void __syncwarp(); // Intra-warp sync
7 void __syncthreads(); // Intra-block thread synchronization
Muchas más cosas:
· Memory fence.
· Funciones de texturas (interpolación 1D, 2D y 3D) y superficies.
· Funciones de tiempo (clock
).
· Funciones atómicas.
· Funciones de votación de warps.
· Funciones de shuffle de warps.
· Aserciones (assert
).
· Salida estándard (printf
).
· Pedido de memoria dinámica (malloc
).
· Funciones matemáticas.
float
, sin atomics.float
atomics, predicated synchthreads
, surfaces, printf
, assert
.grid.x
, shared vs. L1 más configurable, ILP, warp shuffle, 16 kernels concurrentes.ldg
, 64-bit atomics, funnel shifts, 32 kernels concurrentes.fp64
global y shared.Agregó funnel shift.
Benchmarking de hashcat para WPA2.
GK104, CC 3.0, 8 SM, 29 KHash/s
GK110, CC 3.5, 12 SM, 77 KHash/s
map
, reduce
, sgemm
.Table of Contents | t |
---|---|
Exposé | ESC |
Full screen slides | e |
Presenter View | p |
Source Files | s |
Slide Numbers | n |
Toggle screen blanking | b |
Show/hide slide context | c |
Notes | 2 |
Help | h |