sgemm
.Nicolás Wolovick, 20180607
sgemm
N×N hilos, cada uno c[i][j] = a[i][]*b[][j]
, 2N
FLOPS, 2N
Memoria.
NVIDIA, NVIDIA CUDA C Programming Guide.
N×N
definidas en runtime.BX×BY
definidos en runtime.N
múltiplo de BX
o BY
.Kernel
1 // 2D to 1D bijection
2 #define IX(i,j) ((i)*(N)+(j))
3
4 __global__ void sgemm(const uint N, float * __restrict__ a, float * __restrict__ b, float * __restrict__ c) {
5 const uint i = blockIdx.y*blockDim.y + threadIdx.y;
6 const uint j = blockIdx.x*blockDim.x + threadIdx.x;
7 if (i<N && j<N)
8 for (uint k=0; k<N; ++k)
9 c[IX(i,j)] += a[IX(i,k)] * b[IX(k,j)];
10 }
Notar
» Uso de __restrict__
.
» blockDim.{x,y}
para conocer el tamaño del bloque en runtime.
» if
para que los hilos extras lanzados no se ejecuten.
» Tipos definidos en CUDA: uint
.
1 checkCudaErrors(cudaMalloc(&d_a, SIZE * sizeof(float)));
2 checkCudaErrors(cudaMalloc(&d_b, SIZE * sizeof(float)));
3 checkCudaErrors(cudaMalloc(&d_c, SIZE * sizeof(float)));
4 h_a = (float *) calloc(SIZE, sizeof(float));
5 h_b = (float *) calloc(SIZE, sizeof(float));
6 h_c = (float *) calloc(SIZE, sizeof(float));
7 assert(d_a && d_b && d_c && h_a && h_b && h_c);
1 // integer ceiling division
2 #define DIV_CEIL(a,b) (((a)+(b)-1)/(b))
3 dim3 grid_size(DIV_CEIL(N,BX), DIV_CEIL(N,BY));
4 dim3 block_size(BX,BY);
5 setmm<<<grid_size, block_size>>>(N, d_a, d_b, d_c);
6 getLastCudaError("setmm kernel failed");
7 sgemm<<<grid_size, block_size>>>(N, d_a, d_b, d_c);
8 getLastCudaError("sgemm kernel failed");
9 cudaDeviceSynchronize();
Notar:
» Uso de DIV_CEIL
para que haya suficientes bloques.
» Revisión de errores.
» Uso de calloc
para que pida realmente y esté a 0.
1 checkCudaErrors(cudaMemcpy(h_a, d_a, SIZE*sizeof(float), cudaMemcpyDefault));
2 checkCudaErrors(cudaMemcpy(h_b, d_b, SIZE*sizeof(float), cudaMemcpyDefault));
3 checkCudaErrors(cudaMemcpy(h_c, d_c, SIZE*sizeof(float), cudaMemcpyDefault));
4 double max_diff = 0.0;
5 for (unsigned int i=0; i<N; ++i) {
6 for (unsigned int j=0; j<N; ++j) {
7 float cij = 0.0f;
8 for (unsigned int k=0; k<N; ++k)
9 cij += h_a[IX(i,k)] * h_b[IX(k,j)];
10 max_diff = MAX(max_diff, abs(cij-h_c[IX(i,j)]));
11 }
12 }
13 printf("max_diff: %f\n", max_diff);
Notar:
» No tenemos que decir en que dirección copiamos la memoria
cudaMemcpyDefault
usa unified pointers para saber de que lado estamos.
» ¡Pedimos cualquier cosa menos igualdad!
1 $ nvprof ./sgemm 256 16 16
2 ==1336== NVPROF is profiling process 1336, command: ./sgemm 256 16 16
3 max_diff: 0.000023
4 ==1336== Profiling result:
5 Time(%) Time Calls Avg Min Max Name
6 53.79% 111.74us 1 111.74us 111.74us 111.74us sgemm(unsigned int, float*, float*, float*)
7 43.47% 90.305us 3 30.101us 22.112us 34.145us [CUDA memcpy DtoH]
8 2.74% 5.6960us 1 5.6960us 5.6960us 5.6960us setmm(unsigned int, float*, float*, float*)
9 $ cuda-memcheck ./sgemm 256 16 16
10 ========= CUDA-MEMCHECK
11 max_diff: 0.000023
12 ========= ERROR SUMMARY: 0 errors
13 $ cuda-memcheck --tool racecheck ./sgemm 256 16 16
14 ========= CUDA-MEMCHECK
15 max_diff: 0.000023
16 ========= RACECHECK SUMMARY: 0 hazards displayed (0 errors, 0 warnings)
17 $ cuda-memcheck --tool initcheck ./sgemm 256 16 16
18 ========= CUDA-MEMCHECK
19 max_diff: 0.000023
20 ========= ERROR SUMMARY: 0 errors
21 $ cuda-memcheck --tool synccheck ./sgemm 256 16 16
22 ========= CUDA-MEMCHECK
23 max_diff: 0.000023
24 ========= ERROR SUMMARY: 0 errors
» Ejecuta el kernel.
» La diferencia con la versión de especificación es pequeña.
» No hay errores de memoria.
» No hay potenciales condiciones de carrera, ni problemas de inicialización ni de bar.sync
.
» Vamos con pies de plomo.
sgemm N BX BY
1 $ ./sgemm 1024 64 64
2 Cuda error: setmm kernel failed in file 'sgemm.cu' in line 54 : invalid configuration argument.
3 $ ./sgemm 1024 32 32
4 max_diff: 0.000092
5 $ ./sgemm 512 32 32
6 max_diff: 0.000046
7 $ ./sgemm 2048 32 32
8 max_diff: 0.000183
9 $ ./sgemm 256 1 1
10 max_diff: 0.000023
cuda-memcheck
.2048 32 32
la comprobación CPU no es costosa, es costosíiiiisima.1 $ nvcc -arch=sm_52 sgemm.cu -ptx
sgemm
1 BB7_2:
2 mul.wide.u32 %rd9, %r21, 4;
3 add.s64 %rd10, %rd2, %rd9;
4 mul.wide.u32 %rd11, %r22, 4;
5 add.s64 %rd12, %rd1, %rd11;
6 ld.global.nc.f32 %f4, [%rd12];
7 ld.global.nc.f32 %f5, [%rd10];
8 fma.rn.f32 %f6, %f5, %f4, %f6;
9 add.s32 %r22, %r22, %r17;
10 add.s32 %r21, %r21, 1;
11 add.s32 %r23, %r23, 1;
12 setp.lt.u32 %p6, %r23, %r17;
13 @%p6 bra BB7_2;
14
15 st.global.f32 [%rd3], %f6;
También con:
1 $ cuobjdump -ptx sgemm
1 $ cuobjdump -sass sgemm
Vemos el assembler
1 /*0588*/ @!P0 BRA 0x638; /* 0xe24000000a88000f */
2 /*0590*/ SHL R8, R7.reuse, 0x2; /* 0x3848000000270708 */
3 /*0598*/ SHR.U32 R9, R7, 0x1e; /* 0x3828000001e70709 */
4 /* 0x001fd400fe2207f4 */
5 /*05a8*/ SHL R6, R0.reuse, 0x2; /* 0x3848000000270006 */
6 /*05b0*/ IADD R8.CC, R8, c[0x0][0x150]; /* 0x4c10800005470808 */
7 /*05b8*/ SHR.U32 R11, R0, 0x1e; /* 0x3828000001e7000b */
8 /* 0x0001d800fe0007e2 */
9 /*05c8*/ IADD.X R9, R9, c[0x0][0x154]; /* 0x4c10080005570909 */
10 /*05d0*/ { IADD R10.CC, R6, c[0x0][0x148]; /* 0x4c1080000527060a */
11 /*05d8*/ LDG.E.CI R6, [R8]; } /* 0xeed4a00000070806 */
12 /* 0x001fd8002e2007f2 */
13 /*05e8*/ IADD.X R11, R11, c[0x0][0x14c]; /* 0x4c10080005370b0b */
14 /*05f0*/ LDG.E.CI R11, [R10]; /* 0xeed4a00000070a0b */
15 /*05f8*/ IADD32I R4, R4, 0x1; /* 0x1c00000000170404 */
16 /* 0x001fec00fc2007f1 */
17 /*0608*/ ISETP.LT.U32.AND P0, PT, R4, c[0x0][0x140], PT; /* 0x4b62038005070407 */
18 /*0610*/ IADD R7, R7, c[0x0][0x140]; /* 0x4c10000005070707 */
19 /*0618*/ IADD32I R0, R0, 0x1; /* 0x1c00000000170000 */
20 /* 0x0403c403ffa147f0 */
21 /*0628*/ { FFMA R5, R11, R6, R5; /* 0x5980028000670b05 */
22 /*0630*/ @P0 BRA 0x590; } /* 0xe2400ffff580000f */
23 /*0638*/ STG.E [R2], R5; /* 0xeedc200000070205 */
» Distinto al PTX.
» Reorden de instrucciones.
» Uso de ldg
para lectura de memoria read-only a través de la caché de texturas.
-ptxas-options=-v
1 $ nvcc sgemm.cu -arch=sm_52 --ptxas-options=-v -o sgemm.o
2 ptxas info : 0 bytes gmem, 24 bytes cmem[3]
3 ptxas info : Compiling entry function '_Z5sgemmjPfS_S_' for 'sm_52'
4 ptxas info : Function properties for _Z5sgemmjPfS_S_
5 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
6 ptxas info : Used 28 registers, 352 bytes cmem[0]
7 ptxas info : Compiling entry function '_Z5setmmjPfS_S_' for 'sm_52'
8 ptxas info : Function properties for _Z5setmmjPfS_S_
9 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
10 ptxas info : Used 19 registers, 352 bytes cmem[0], 40 bytes cmem[2]
1 $ nvprof --print-gpu-trace ./sgemm 1024 32 32
2 ==6805== NVPROF is profiling process 6805, command: ./sgemm 1024 32 32
3 max_diff: 0.000092
4 ==6805== Profiling application: ./sgemm 1024 32 32
5 ==6805== Profiling result:
6 Start Duration Grid Size Block Size Regs* SSMem* DSMem* Size Throughput Device Context Stream Name
7 365.27ms 100.29us (32 32 1) (32 32 1) 19 0B 0B - - GeForce GTX TIT 1 7 setmm(unsigned int, float*, float*, float*) [185]
8 365.37ms 6.1918ms (32 32 1) (32 32 1) 28 0B 0B - - GeForce GTX TIT 1 7 sgemm(unsigned int, float*, float*, float*) [192]
Ambas informan 19 registros para setmm
y 28 registros para sgemm
.
1 $ nvprof --metrics ipc,flops_sp,gld_throughput,gst_throughput ./sgemm 1024 32 32
2 ==8208== NVPROF is profiling process 8208, command: ./sgemm 1024 32 32
3 max_diff: 0.000092
4 ==8208== Profiling application: ./sgemm 1024 32 32
5 ==8208== Profiling result:
6 ==8208== Metric result:
7 Invocations Metric Name Metric Description Min Max Avg
8 Device "GeForce GTX TITAN X (0)"
9 Kernel: sgemm(unsigned int, float*, float*, float*)
10 1 ipc 4.236410 4.236410 4.236410
11 1 gld_throughput 2e+03GB/s 2e+03GB/s 2e+03GB/s
12 1 gst_throughput 644.16MB/s 644.16MB/s 644.16MB/s
13 1 flop_count_sp 2147483648 2147483648 2147483648
14 Kernel: setmm(unsigned int, float*, float*, float*)
15 1 ipc 3.174295 3.174295 3.174295
16 1 gld_throughput 0.00000B/s 0.00000B/s 0.00000B/s
17 1 gst_throughput 114.57GB/s 114.57GB/s 114.57GB/s
18 1 flop_count_sp 39836070 39836070 39836070
» Pésimo uso del BW de memoria: 0.64 GiB/s sobre un total de ~200 GiB/s.
» Pésimo uso de la potencia de cálculo: 2.14 GFLOPS en sp! (tiene 6 TFLOPS sp).
¿Quiénes somos?: sgeeeeemmm
!!!
¿Como es nuestra intensidad aritmética?: ¡¡¡lineal al tamaño del problema!!!
¿Como estamos funcionando?: ¡pésimooooo!
gputime
vs. BX
,BY
(C2070, Fermi, 2012)gputime
en µs
gputime
vs. BX
,BY
(C2070, zoom)gputime
en µs
BX
=16,24,32.gputime
= 33.264ms para 2×1024^3 FLOP: 60 GFLOPS>>> ((2*1024**3)/0.033) / (1<<30)
60.6060606060606
1 $ grep -v "^ " sgemm-c2070.dat | sort -n -k 3 | head -16
2 32 5 33264.832
3 32 6 33275.199
4 32 24 33732.734
5 ...
6 16 10 35538.559
7 32 3 35789.473
8 32 8 35800.672
a
se lee y la c
se escribe de a warps de 32 hilos accediendo a 128 bytes consecutivos de la memoria.set logscale zcb 2
gputime
en µs
El top-16
1 $ grep -v "^ " sgemm-k40.dat | sort -n -k 3 | head -16
2 32 7 15819.616
3 16 14 15915.968
4 16 16 16027.520
5 32 8 16068.768
6 32 16 16092.128
7 8 32 16112.224
8 16 18 16115.968
9 32 9 16156.960
10 4 32 16188.736
11 9 32 16195.232
12 18 16 16196.064
13 8 28 16201.536
14 7 32 16203.808
15 16 8 16228.736
16 8 16 16229.504
17 17 15 16255.296
gputime
en µs
gputime
en µs
1 $ grep -v "^ " sgemm-gtxtitanx.dat | sort -n -k 3 | head -16
2 32 26 6211.488
3 32 24 6227.584
4 32 22 6239.200
5 32 32 6240.192
6 32 30 6257.536
7 32 28 6259.072
8 32 17 6271.136
9 32 20 6277.696
10 32 21 6285.920
11 32 12 6304.640
12 32 29 6306.304
13 16 30 6311.392
14 32 19 6316.480
15 32 16 6316.640
16 32 18 6317.408
17 32 31 6329.440
gputime
en µs
gputime
en µs
1 $ grep -v "^ " sgemm-gtx1080ti.dat | sort -n -k 3 | head -16
2 32 12 3407.531000
3 32 14 3407.627000
4 32 10 3412.555000
5 32 20 3414.572000
6 32 18 3418.316000
7 32 16 3440.014000
8 32 15 3444.589000
9 32 28 3446.767000
10 32 32 3457.583000
11 32 17 3458.062000
12 32 30 3472.720000
13 32 21 3482.064000
14 32 11 3487.482000
15 32 29 3491.986000
16 32 31 3495.473000
17 32 13 3495.889000
gputime
en µs
Placa Arch µs GFLOPS año Peak %Peak(Eff)
C2070 Fermi 33264 60 2011 1TF 6.0%
K40c Kepler 15819 133 2013 5TF 2.6%
Titan X Maxwell 6211 322 2015 6TF 5.3%
1080 Ti Pascal 3407 588 2017 10TF 5.8%
Es una versión trivial de sgemm
pero permite ver varias cosas.
A partir de CUDA 6.0 y CC 3.5. Para prototipar soluciones.
1 checkCudaErrors(cudaMallocManaged(&a, SIZE * sizeof(float)));
2 checkCudaErrors(cudaMallocManaged(&b, SIZE * sizeof(float)));
3 checkCudaErrors(cudaMallocManaged(&c, SIZE * sizeof(float)));
4 assert(a && b && c);
5
6 dim3 grid_size(DIV_CEIL(N,BX), DIV_CEIL(N,BY));
7 dim3 block_size(BX,BY);
8 setmm<<<grid_size, block_size>>>(N, a, b, c);
9 getLastCudaError("setmm kernel failed");
10 sgemm<<<grid_size, block_size>>>(N, a, b, c);
11 getLastCudaError("sgemm kernel failed");
12 cudaDeviceSynchronize();
13
14 double max_diff = 0.0;
15 for (size_t i=0; i<N; ++i) {
16 for (size_t j=0; j<N; ++j) {
17 float cij = 0.0f;
18 for (size_t k=0; k<N; ++k)
19 cij += a[IX(i,k)] * b[IX(k,j)];
20 max_diff = MAX(max_diff, abs(cij-c[IX(i,j)]));
21 }
22 }
23 printf("max_diff: %f\n", max_diff);
24
25 checkCudaErrors(cudaFree(c));
26 checkCudaErrors(cudaFree(b));
27 checkCudaErrors(cudaFree(a));
1 $ nvprof --print-gpu-summary ./sgemm-unified 1024 16 16
2 ==17649== NVPROF is profiling process 17649, command: ./sgemm-unified 1024 16 16
3 max_diff: 0.000671
4 ==17649== Profiling application: ./sgemm-unified 1024 16 16
5 ==17649== Profiling result:
6 Time(%) Time Calls Avg Min Max Name
7 99.24% 146.81ms 1 146.81ms 146.81ms 146.81ms sgemm(unsigned int, float*, float*, float*)
8 0.76% 1.1184ms 1 1.1184ms 1.1184ms 1.1184ms setmm(unsigned int, float*, float*, float*)
No aparecen los llamados a cudaMemcpy()
o similares.
¡El walltime es similar a la versión con copia a mano!
1 $ time ./sgemm-unified 1024 16 16
2
3 real 0m0.454s
4 user 0m0.136s
5 sys 0m0.300s
6 $ time ./sgemm 1024 16 16
7
8 real 0m0.292s
9 user 0m0.012s
10 sys 0m0.280s
malloc
con mecanismo similar a demand paging usando la MMU de la GPU.Ejecutemos
1 $ nvprof --print-gpu-summary ./sgemm-unified 1024 32 16
2 ==20210== Profiling result:
3 Type Time(%) Time Calls Avg Min Max Name
4 GPU activities: 98.46% 6.1160ms 1 6.1160ms 6.1160ms 6.1160ms sgemm(unsigned int, float*, float*, float*)
5 1.54% 95.777us 1 95.777us 95.777us 95.777us setmm(unsigned int, float*, float*, float*)
6
7 ==20210== Unified Memory profiling result:
8 Device "GeForce GTX TITAN X (0)"
9 Count Avg Size Min Size Max Size Total Size Total Time Name
10 112 109.71KB 4.0000KB 0.9961MB 12.00000MB 2.000928ms Device To Host
11 Total CPU Page faults: 48
A partir de GP100 (Pascal) hay on-demand page migration por hardware.
Nikolay Sakharnykh, Beyond GPU Memory Limits with Unified Memory on Pascal, NVIDIA, 2016.
1 __global__ void sgemm(const uint N, float * __restrict__ a, float * __restrict__ b, float * __restrict__ c) {
2 const uint i = blockIdx.y*blockDim.y + threadIdx.y;
3 const uint j = blockIdx.x*blockDim.x + threadIdx.x;
4 if (i<N && j<N) {
5 for (uint k=0; k<N; k+=4)
6 c[IX(i,j)] += a[IX(i,k+0)] * b[IX(k+0,j)] +
7 a[IX(i,k+1)] * b[IX(k+1,j)] +
8 a[IX(i,k+2)] * b[IX(k+2,j)] +
9 a[IX(i,k+3)] * b[IX(k+3,j)];
10 }
11 }
Pruebas
1 $ nvprof --print-gpu-summary ./sgemm-unroll 1024 32 16
2 ==29268== NVPROF is profiling process 29268, command: ./sgemm-unroll 1024 32 16
3 ==29268== Profiling application: ./sgemm-unroll 1024 32 16
4 ==29268== Profiling result:
5 Type Time(%) Time Calls Avg Min Max Name
6 GPU activities: 64.13% 6.1083ms 1 6.1083ms 6.1083ms 6.1083ms sgemm(unsigned int, float*, float*, float*)
7 34.88% 3.3223ms 3 1.1074ms 978.67us 1.1744ms [CUDA memcpy DtoH]
8 0.99% 94.337us 1 94.337us 94.337us 94.337us setmm(unsigned int, float*, float*, float*)
No mejora nada.
logscale
aprovechar bien la paleta.warpsize/2
, warpsize
).Table of Contents | t |
---|---|
Exposé | ESC |
Full screen slides | e |
Presenter View | p |
Source Files | s |
Slide Numbers | n |
Toggle screen blanking | b |
Show/hide slide context | c |
Notes | 2 |
Help | h |