Nicolás Wolovick, 20160518.
opcode.type d, a, b, c;
(Hennessy, Patterson, CAAQA5, p.298)
(Hennessy, Patterson, CAAQA5, Fig-4.17)
(Hennessy, Patterson, CAAQA5, Fig-4.17)
(Hennessy, Patterson, CAAQA5, Fig-4.17)
A demostration of the value of the PTX is that the Fermi architecture radically changed the hardware instruction set -- from being memory-oriented like x86 to being register-oriented like MIPS as well as doubling the address size to 64 bits -- without disrupting the NVIDIA software stack.
(Hennessy, Patterson, CAAQA5, p.330)
It's been almost three years since GPU computing broke into the mainstream of HPC with the introduction of NVIDIA’s CUDA API in September 2007. Adoption of the technology since then has proceeded at a surprisingly strong and steady pace. Many organizations that began with small pilot projects a year or two ago have moved on to enterprise deployment, and GPU accelerated machines are now rep- resented on the TOP500 list starting at position two. The relatively rapid adoption of CUDA by a community not known for the rapid adoption of much of anything is a noteworthy signal. Contrary to the accepted wisdom that GPU computing is more difficult, I believe its success thus far signals that it is no more complicated than good CPU programming. Further, it more clearly and succinctly expresses the parallelism of a large class of problems leading to code that is easier to maintain, more scalable and better positioned to map to future many-core architectures. Vincent Natol “Kudos for CUDA”, HPCwire (2010)
(Hennessy, Patterson, CAAQA5, L-48)
Brook for GPUs: Stream Computing on Graphics Hardware
Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman, Kayvon Fatahalian, Mike Houston, and Pat Hanrahan Computer Science Department Stanford University To appear at SIGGRAPH 2004
Ian Buck es "VP Accelerated Computing at NVIDIA".
Single Instruction, Multiple Thread
fp32
vs. fp64
units.Permite ejecutar un subconjunto interesante del lenguaje desde una G80 a una GM200.
Puede sufrir problemas de performance.
CUDA languaje virtualiza el hardware:
CUDA runtime planifica en el hardware:
(NVIDIA’s Next Generation CUDA Compute Architecture: Fermi™, NVIDIA, 2009.)
Esto permite ejecutar en hardware con diferente número de SMs, o bien en algún momento con ¡SM remotos!.
Notar que las placas baratas (GTX 635, 1 SMX) tienen menos SMs. El programa corre igual que en una mediana (GTX 680, 8 SMX), o que una full (GTX 780 Ti, 15 SMX) pero con menor performance.
Este esquema facilita el paralelismo de datos para grillas 2D y 3D.
No siempre este esquema se adapta a lo que necesitamos.
Aumenta "C
" con unas poquitas cosas.
hilo ∈ warp ∈ bloque ∈ grilla
32 hilos en 1 warp
32 warps en 1 bloque
1048576 bloques en una grilla.
(via ballots & warp shuffling)
atomicAdd
, CAS
.atomicAdd
, CAS
.Yo declaro
1 int i;
Esta i
puede ser:
¡Compara con OpenMP
que solo tenía local y global!
(efficiency IS performance)
(locality IS performance)
-
(Bill Dally, Efficiency and Programmability: Enablers for ExaScale, SC13)
-
C
" para CUDACalificadores de funciones
1 __host__ int add_cpu() { } // CPU function
2 __global__ void add_gpu() { } // callable kernel from host
3 __device__ float add_partial_gpu() { } // GPU function
Calificadores de variables
1 __device__ float a[256]; // device global variable
2 __constant__ float my_constant_array[32]; // constant global array
3 __shared__ float my_shared_array[32]; // device block-shared array
4 __global__ float add_gpu(const float* __restrict__ a); // non-aliased pointer
Tipos
1 typedef struct {float x, float y} float2; // aligned
2 // also float3, float4, double2, uint2, uint3, etc.
3 typedef uint3 dim3; // initial sections, trailing values default to 1.
Configuración de la ejecución
1 dim3 grid_dim(100, 50); // 5000 thread blocks
2 dim3 block_dim(4, 8, 8); // 256 threads per block
3 my_kernel <<< grid_dim, block_dim >>> (...); // Launch kernel
Variables y funciones built-in
1 dim3 gridDim; // Grid dimension
2 dim3 blockDim; // Block dimension
3 uint3 blockIdx; // Block index
4 uint3 threadIdx; // Thread index
5 int warpSize;
6 void __syncthreads(); // Intra-block thread synchronization
Muchas más cosas:
· Memory fence.
· Funciones de texturas (interpolación 1D, 2D y 3D) y superficies.
· Funciones de tiempo (clock
).
· Funciones atómicas.
· Funciones de votación de warps.
· Funciones de shuffle de warps.
· Aserciones (assert
).
· Salida estándar (printf
).
· Pedido de memoria dinámica (malloc
).
· Funciones matemáticas.
float
, sin atomics.float
atomics, predicated synchthreads
, surfaces, printf
, assert
.grid.x
, shared vs. L1 más configurable, ILP, warp shuffle, 16 kernels concurrentes.ldg
, 64-bit atomics, funnel shifts, 32 kernels concurrentes.Table of Contents | t |
---|---|
Exposé | ESC |
Full screen slides | e |
Presenter View | p |
Source Files | s |
Slide Numbers | n |
Toggle screen blanking | b |
Show/hide slide context | c |
Notes | 2 |
Help | h |