CUDA 2

Presenter Notes

Nunca lo usé, solo mostré programas.

Presenter Notes

Resumen

Primer programa (paralelismo directo).
Segundo programa (no es tán fácil).

Nicolás Wolovick, 20140522.

Presenter Notes

Paralelismo que da vergüenza

Presenter Notes

Multiply and Add 4 (MA4)

Código CPU

1 #define N (1<<28)
2 float a[N], b[N], c[N], d[N];
3 
4 int main(void) {
5     for(unsigned int i=0; i<N; ++i)
6         d[i] = a[i]*b[i]+c[i];
7 
8     return 0;
9 }

Paralelización

CUDA está pensado para paralelización de datos de grano fino.
Como OpenMP, pero tenemos millones de hilos disponibles.
- No solo los ~23000 cores (virtualización).
- Cuanto más hilos más oculta la latencia, mayor througput (ma non troppo).
Estrategia: un hilo por dato (¡Impensable en CPU!).

Con millones de hilos no tiene sentido paralelización de tareas.

Presenter Notes

MA4 en CUDA

 1 #include <cuda.h>
 2 
 3 #define N (1<<28)
 4 #define BLOCK_SIZE 128
 5 
 6 __device__ float a[N], b[N], c[N], d[N];
 7 
 8 __global__ void ma4(void) {
 9     unsigned int tid = blockIdx.x*blockDim.x + threadIdx.x;
10     d[tid] = a[tid]*b[tid]+c[tid];
11 }
12 
13 int main(void) {
14     ma4<<<N/BLOCK_SIZE, BLOCK_SIZE>>>();
15     cudaDeviceSynchronize();
16     return 0;
17 }

Compilación y ejecución

1 $ nvcc fma.cu -O3 -arch=sm_20 --ptxas-options=-v -o fmacuda
2 ptxas info    : Compiling entry function '_Z3fmav' for 'sm_20'
3 ptxas info    : Function properties for _Z3fmav
4     0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
5 ptxas info    : Used 13 registers, 32 bytes cmem[0], 32 bytes cmem[14]
6 $ ./fmacuda
7 $

Si pongo #define N (1<<23) tengo más de 65536 bloques en la grilla y excedo el límite de GF100. Falla silenciosamente.

Presenter Notes

Paralelismo no trivial

Presenter Notes

Sumar todos los elementos de un arreglo.
- Problemas de concurrencia x:=x+1.
- Atomics en global.
- Atomics en shared.
- Truco: Warp shuffling.
- Apuntar a artículo/slides de Mark Harris y que eso sea un ejercicio para ellos.

Presenter Notes

Bibliografía

NVIDIA, NVIDIA CUDA C Programming Guide Version 6.0, 2014.
Kayvon Fatahalian, From Shader Code to a Teraflop: How GPU Shader Cores Work, 2011.
David B. Kirk, Wen-mei W. Hwu, Programming Massively Parallel Processors: A Hands-on Approach, 2nd ed., Morgan Kaufmann, 2012.
Rob Farber, CUDA Application Design and Development, Morgan Kaufmann, 2011.
NVIDIA, Parallel Thread Execution ISA Version 4.0, 2014.
NVIDIA, NVIDIA CUDA Compiler Driver NVCC Version 6.0, 2014.

Presenter Notes

La clase que viene

Más ejemplos de CUDA.
Debugging.
Herramientas para detectar errores.
Profiling.
Buenas prácticas de programación.

Table of Contents	t
Exposé	ESC
Full screen slides	e
Presenter View	p
Source Files	s
Slide Numbers	n
Toggle screen blanking	b
Show/hide slide context	c
Notes	2
Help	h

Código CPU

Paralelización

Compilación y ejecución

Table of Contents

Help