CUDA 2

Presenter Notes

Resumen

Primer programa (paralelismo directo).
Segundo programa (no es tán fácil).

Nicolás Wolovick, 20180523.

Presenter Notes

Paralelismo que da vergüenza

Presenter Notes

Multiply and Add 4 (MA4)

Código CPU

1 #define N (1<<28)
2 float a[N], b[N], c[N], d[N];
3 
4 int main(void) {
5     for(unsigned int i=0; i<N; ++i)
6         d[i] = a[i]*b[i]+c[i];
7 
8     return 0;
9 }

Paralelización

CUDA está pensado para paralelización de datos de grano fino.
Como OpenMP, pero tenemos millones de hilos disponibles.
- No solo los 24*128 = 3072 cores de una Titan X Maxwell.
- Con el SMT-16 que tiene 49152 threads corriendo.
- Cuanto más hilos más oculta la latencia, mayor througput (ma non troppo).
Estrategia: un hilo por dato ¡Impensable en CPU!

Con millones de hilos no tiene sentido paralelización de tareas.

Presenter Notes

MA4 en CUDA

 1 #include <cuda.h>
 2 
 3 #define N (1<<28)
 4 #define BLOCK_SIZE 128
 5 
 6 __device__ float a[N], b[N], c[N], d[N];
 7 
 8 __global__ void ma4(void) {
 9     unsigned int gtid = blockIdx.x*blockDim.x + threadIdx.x;
10     d[gtid] = a[gtid]*b[gtid]+c[gtid];
11 }
12 
13 int main(void) {
14     ma4<<<N/BLOCK_SIZE, BLOCK_SIZE>>>();
15     cudaDeviceSynchronize();
16     return 0;
17 }

Presenter Notes

Comparación con CPU

 1 #include <stddef.h>
 2 #define N (1<<28)
 3 
 4 float a[N], b[N], c[N], d[N];
 5 
 6 int main(void)
 7 {
 8     #pragma omp parallel for simd
 9     for(size_t i=0; i<N; ++i) {
10         d[i] = a[i]*b[i]+c[i];
11     }
12 
13     return (int)a[N/2]+b[N/3]+c[N/4]+d[N/5];
14 }

2*E52620v3: 43ms
GTX Titan X Maxwell: 15ms

2.86x más rápida la GPU que las dos CPU.

Problema totalmente memory-bound.

Presenter Notes

Paralelismo no trivial

Presenter Notes

Reducción

Paralelizar s = \sum a[i]

1 sum = 0.0f;
2 for(size_t i=0; i<N; ++i)
3     s += a[i];

Suma paralela en O(log(N)).

Presenter Notes

Comunicación entre hilos

Sumar todos los elementos de un arreglo

Problemas de concurrencia x=x+1.
Atomics en global.
Atomics en shared.
Último nivel de localidad: warp shuffling.

Un artículo viejo, pero interesante

Mark Harris, Optimizing Parallel Reduction in CUDA, NVIDIA, 2007.

Presenter Notes

Jerarquía de Paralelismo

CUDA hierarchy of threads, blocks, grids

Presenter Notes

Unidades de Cooperación

Warp

Ejecución interlocked y ...
Comunicación de variables privadas.

(via ballots & warp shuffling)

Bloque

Sincronización de barrera.
Memoria compartida local shared.
Instrucciones atómicas sobre la shared: atomicAdd, CAS.

Grilla

Sincronización fork-join por lanzamiento de kernels.
Memoria compartida global.
Instrucciones atómicas sobre la global: atomicAdd, CAS.

Presenter Notes

Convención

Identificadores de jerarquía de paralelismo

 1 #include "helper_cuda.h"
 2 
 3 uint lane = tid & CUDA_WARP_MASK; // lane dentro del warp
 4 
 5 uint tid = threadIdx.x; // hilo dentro del bloque.
 6 uint warp = tid / CUDA_WARP_SIZE;  // warp dentro del bloque
 7 
 8 uint gtid = threadIdx.x + blockDim.x*blockIdx.x;  // Identificador global de hilo
 9 uint gwarp = gtid / CUDA_WARP_SIZE;  // Identificador global de warp
10 uint bid = blockIdx.x;  // Identificador de bloque

Presenter Notes

Mejoras escandalosas

Comparación de las diferentes versiones en una GTX Titan X Maxwell.

reduce1.cu

256ms, pero incorrecto.
790ms, con atomicAdd, y es correcto.

reduce2.cu

1200ms, usamos reducción por bloque con atomic en la shared y luego acumulamos 1 por bloque en la global.
¡Empeoró!

reduce3.cu

1200ms idem anterior, solo que en vez de acumular en la global, dejamos un valor por bloque en un arreglo y acumula la CPU.

Presenter Notes

Mejoras escandalosas

reduce3_1.cu

12ms, idem anterior, pero hay 32 valores de acumulación sobre la shared para evitar que los 32 lanes usen atomics sobre el mismo.
¡Funcionó!

reduce4.cu

11.6ms, en vez de lo anterior, warp shuffle para acumular por warp en los registros como suma paralela O(log(N)), atomic a la shared por warp y luego atomic a la global por bloque.
¡Funcionó!

Locality=Performance

Presenter Notes

Comparación con CPU

 1 #define N (1<<28)
 2 float a[N];
 3 
 4 int main(void) {
 5     float s = 0.0f;
 6     #pragma omp parallel for simd reduction (+:s)
 7     for(size_t i=0; i<N; ++i)
 8         s += a[i];
 9     return (int)s;
10 }

Hacemos

1 $ gcc -fopenmp -O3 -ffast-math -mcmodel=medium reduce.c && perf stat -r 16 ./a.out

2*E52620v3: 18ms
GTX Titan X Maxwell: 11ms

1.63x más rápida la GPU que las dos CPU.

Hay muchísimo margen para mejorar en la GPU: Optimizing Parallel Reduction in CUDA

Presenter Notes

Bibliografía

Mark Harris Optimizing Parallel Reduction in CUDA, 2007.
Justin Luitjens, Faster Parallel Reductions on Kepler, ∥∀, 2014.
NVIDIA, NVIDIA CUDA C Programming Guide v7.5, 2016.
David B. Kirk, Wen-mei W. Hwu, Programming Massively Parallel Processors: A Hands-on Approach, 2nd ed., Morgan Kaufmann, 2012.
Rob Farber, CUDA Application Design and Development, Morgan Kaufmann, 2011.
NVIDIA, Parallel Thread Execution ISA v4.3, 2016.
NVIDIA, NVIDIA CUDA Compiler Driver NVCC v7.5, 2016.
Hennessy, Patterson, Computer Architecture: A Quantitative Approach, Fifth Edition, Morgan Kaufmann, 2011.

Presenter Notes

La clase que viene

Herramientas.
Buenas prácticas.
Profiling.

Table of Contents	t
Exposé	ESC
Full screen slides	e
Presenter View	p
Source Files	s
Slide Numbers	n
Toggle screen blanking	b
Show/hide slide context	c
Notes	2
Help	h

Código CPU

Paralelización

Sumar todos los elementos de un arreglo

Warp

Bloque

Grilla

Identificadores de jerarquía de paralelismo

Locality=Performance

Table of Contents

Help