CUDA via C++: Simple 2D Matrix Multiplication

Autor: rubens
Data: 11/08/2010

This document shows how to create Python C/C++ modules that make use of a CUDA GPU.

Adessowiki Page Layout

Adessowiki pages that define Python compiled modules should have the following layout.

  1. C/C++ Module

    One or more C/C++ code fragments defined inside the directive:

    .. code:: cpp module
    

    These code fragments will be concatenated and saved in one source file.

  2. C++/Python Interface Definition

    One and only one C/C++ code fragment defined with the directive:

    .. code:: cpp header
    

    The functions that should be exported to be used from Python have their prototypes defined within this directive.

    The presence of this directive starts the compile/link process to create the shared library to be imported in Python scripts.

  3. Python Modules

    One or more Python code fragments defined through the directive:

    .. code:: python module
    

    These code fragments will be collected in a file named after the Adessowiki page name and with extension ".py".

  4. Python Scripts

    One or more Python code fragments defined through the directive:

    .. code:: python
    
    or
    
    .. code:: python script
    

    These code fragments will be executed while rendering the web page and the results of this execution will be incorporated to the page.

Module Creation

The code below implements the multiplication of two NxN matrices, using a GPU. This code is based on Figure 4.6 of "Programming Massively Parallel Processors", by Kirk and Hwu. Note that

  • the code specifies a function to implement the Python interface instead of the "main" function used in the CUDA toolkit examples. For a more detailed explanation of the cration of compiled Python modules, see this page.
  • the Adessowiki sandbox that will be used to compile/execute the code is specified by the parameter :xsandbox_key: (see the page source code).

Memory Management Notes

The data that flow through the C++/Python interface are automatically created and destroyed by the Adessowiki code generator. In the example below, this interface is implemented by the Python function:

def matrix_mul(A, B) --> C

    A, B, C: numpy arrays

and its C/C++ equivalent function:

Image32F *matrix_mul(Image32F *A, Image32F *B);

The code generator takes care of the creation of the C++ images of the type Image32S from the Python/numpy arrays passed to the Python function matrix_mul. Then these images are used to call the C/C++ function. The function executes and creates an output image to be returned to the Python layer. From this C++ image, the code generator creates the Python/numpy array to be sent to the Python caller function. The C++ images are then destroyed.

An important point should be stressed here: the images created by the C/C++ function to be returned should be allocated only for this purpose because they will be destroyed next by the code generator.

01. #include "simple_arrays.h"
02. 
03. #include <cuda.h>
04. #include <string.h>
05. #include <math.h>
06. 
07. #define TILE_WIDTH 16
08. 
09. int get_cuda_device_count()
10. {
11.     int n;
12.     cudaGetDeviceCount(&n);
13.     return n;
14. }
15. 
16. __global__ void MatrixMulKernel(float *Md, float *Nd, float *Pd, int Width)
17. {
18.     float Pvalue = 0;
19. 
20.     int bx = blockIdx.x;  int by = blockIdx.y;
21.     int tx = threadIdx.x; int ty = threadIdx.y;
22.     int Row = by * TILE_WIDTH + ty;
23.     int Col = bx * TILE_WIDTH + tx;
24. 
25.     for(int k = 0; k < Width; ++k) {
26.         Pvalue += Md[Row*Width+k] * Nd[k*Width+Col];
27.     }
28.     Pd[Row*Width + Col] = Pvalue;
29. }
30. 
31. int checkError(char *msg)
32. {
33.     cudaError_t status = cudaGetLastError();
34.     if(status != cudaSuccess) {
35.         pyprintf("*** %s [%s] ***\n", msg, cudaGetErrorString(status));
36.         return 1;
37.     }
38.     return 0;
39. }
40. 
41. #define CHECK_ERROR(msg) if(checkError(msg)) return
42. 
43. void MatrixMultiplication(float *a, float *b, float *c, int w)
44. {
45.     int size = w*w*sizeof(float);
46.     float *ad, *bd, *cd;
47. 
48.     // Allocate device memory for a, b and c
49.     cudaMalloc((void **)&ad, size);
50.     CHECK_ERROR("cudaMalloc a");
51.     cudaMalloc((void **)&bd, size);
52.     CHECK_ERROR("cudaMalloc b");
53.     cudaMalloc((void **)&cd, size);
54.     CHECK_ERROR("cudaMalloc c");
55. 
56.     // Copy input data to device
57.     cudaMemcpy(ad, a, size, cudaMemcpyHostToDevice);
58.     CHECK_ERROR("cudaMemcpy a");
59.     cudaMemcpy(bd, b, size, cudaMemcpyHostToDevice);
60.     CHECK_ERROR("cudaMemcpy b");
61. 
62.     // Invoke kernel to perform the multiplication
63.     dim3 gridDim(w/TILE_WIDTH, w/TILE_WIDTH);
64.     dim3 blockDim(TILE_WIDTH, TILE_WIDTH);
65.     MatrixMulKernel<<<gridDim, blockDim>>>(ad, bd, cd, w);
66. 
67.     cudaThreadSynchronize();
68.     CHECK_ERROR("Kernel");
69. 
70.     // Copy data from device
71.     cudaMemcpy(c, cd, size, cudaMemcpyDeviceToHost);
72.     CHECK_ERROR("cudaMemcpy c");
73. 
74.     // Release allocated device memory
75.     cudaFree(ad);
76.     cudaFree(bd);
77.     cudaFree(cd);
78. }
79. 
80. Image32F *matrix_mul(Image32F *A, Image32F *B)
81. {
82.     int width = A->dims[0];
83.     if((A->nd != 2) || (B->nd != 2)) return 0;
84.     if((A->dims[1] != width) || (B->dims[0] != width) || (B->dims[1] != width)) return 0;
85. 
86.     if(get_cuda_device_count() <= 0) {
87.         pyprintf(">>> Cannot find a CUDA device.\n\n");
88.         return 0;
89.     }
90. 
91.     // pyprintf(">>> __CUDA_ARCH__ is %d\n", __CUDA_ARCH__);
92.     // pyprintf(">>> Image width is %dx%d pixels\n\n", width, width);
93. 
94.     Image32F *C = new Image32F(A->nd, A->dims);
95.     MatrixMultiplication((float *)A->raster, (float *)B->raster, (float *)C->raster, width);
96.     return C;
97. 
98. }

The following code fragment triggers the building of the above code. The parameter ":require_cuda: yes" causes the code to be pre-processed by the NVidia compiler. Remember that the Image32F datatype is defined in the file simple_arrays.h.

1. int get_cuda_device_count();
2. Image32F *matrix_mul(Image32F *A, Image32F *B);
ERROR [cmake returns 1]

--------------------------------------------------------------------------------
    Thu Mar  5 15:43:22 2015
--------------------------------------------------------------------------------
-- The C compiler identification is GNU
-- The CXX compiler identification is GNU
-- Check for working C compiler: /usr/bin/gcc
-- Check for working C compiler: /usr/bin/gcc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Found SWIG: /usr/bin/swig2.0 (found version "2.0.4")
CUDA_TOOLKIT_ROOT_DIR not found or specified
-- Could NOT find CUDA (missing:  CUDA_TOOLKIT_ROOT_DIR CUDA_NVCC_EXECUTABLE CUDA_INCLUDE_DIRS CUDA_CUDART_LIBRARY) 
CUDA_TOOLKIT_ROOT_DIR not found or specified
-- Could NOT find CUDA (missing:  CUDA_TOOLKIT_ROOT_DIR CUDA_NVCC_EXECUTABLE CUDA_INCLUDE_DIRS CUDA_CUDART_LIBRARY) 
CMake Error: The following variables are used in this project, but they are set to NOTFOUND.
Please set them or make sure they are set and tested correctly in the CMake files:
CUDA_CUDART_LIBRARY (ADVANCED)
    linked by target "__cudacpp" in directory <pkg_dir>/_build/_cudacpp

-- Configuring incomplete, errors occurred!

Using the Module

In order to use the module in Python, it must be imported: import cudacpp. The module name is the Adessowiki page name. If we are using modules defined in other pages, the import should include the namespace name: import main.cudacpp.

01. import time
02. import numpy as np
03. import cudacpp
04. 
05. if cudacpp.get_cuda_device_count() > 0:
06. 
07.     N = 64*32
08. 
09.     a = 10.0*np.ones((N,N), np.float32)
10.     b =  1.5*np.ones((N,N), np.float32)
11. 
12.     t0 = 1000.0*time.time()
13.     c  = np.dot(a, b)
14.     t1 = 1000.0*time.time()
15.     C  = cudacpp.matrix_mul(a, b)
16.     t2 = 1000.0*time.time()
17. 
18.     print 'N = %d Error = %f' % (N, abs(C - c).max())
19.     print
20.     print '    CPU: %8.3f ms' % (t1-t0,)
21.     print '    GPU: %8.3f ms' % (t2-t1,)
22. 
23. else:
24.     print "Cannot find a CUDA device."
------------------------------------------------------------
*** Exception while evaluating code:
  File "<string>", line 3, in <module>
ImportError: No module named cudacpp

------------------------------------------------------------