CUDA via C++: Simple 2D Matrix Multiplication

Autor: rubens
Data: 11/08/2010

This document shows how to create Python C/C++ modules that make use of a CUDA GPU.

Adessowiki Page Layout

Adessowiki pages that define Python compiled modules should have the following layout.

  1. C/C++ Module

    One or more C/C++ code fragments defined inside the directive:

    .. code:: cpp module
    

    These code fragments will be concatenated and saved in one source file.

  2. C++/Python Interface Definition

    One and only one C/C++ code fragment defined with the directive:

    .. code:: cpp header
    

    The functions that should be exported to be used from Python have their prototypes defined within this directive.

    The presence of this directive starts the compile/link process to create the shared library to be imported in Python scripts.

  3. Python Modules

    One or more Python code fragments defined through the directive:

    .. code:: python module
    

    These code fragments will be collected in a file named after the Adessowiki page name and with extension ".py".

  4. Python Scripts

    One or more Python code fragments defined through the directive:

    .. code:: python
    
    or
    
    .. code:: python script
    

    These code fragments will be executed while rendering the web page and the results of this execution will be incorporated to the page.

Module Creation

The code below implements the multiplication of two NxN matrices, using a GPU. This code is based on Figure 4.6 of "Programming Massively Parallel Processors", by Kirk and Hwu. Note that

  • the code specifies a function to implement the Python interface instead of the "main" function used in the CUDA toolkit examples. For a more detailed explanation of the cration of compiled Python modules, see this page.
  • the Adessowiki sandbox that will be used to compile/execute the code is specified by the parameter :xsandbox_key: (see the page source code).

Memory Management Notes

The data that flow through the C++/Python interface are automatically created and destroyed by the Adessowiki code generator. In the example below, this interface is implemented by the Python function:

def matrix_mul(A, B) --> C

    A, B, C: numpy arrays

and its C/C++ equivalent function:

Image32F *matrix_mul(Image32F *A, Image32F *B);

The code generator takes care of the creation of the C++ images of the type Image32S from the Python/numpy arrays passed to the Python function matrix_mul. Then these images are used to call the C/C++ function. The function executes and creates an output image to be returned to the Python layer. From this C++ image, the code generator creates the Python/numpy array to be sent to the Python caller function. The C++ images are then destroyed.

An important point should be stressed here: the images created by the C/C++ function to be returned should be allocated only for this purpose because they will be destroyed next by the code generator.

 1 #include "simple_arrays.h"
 2 
 3 #include <cuda.h>
 4 #include <string.h>
 5 #include <math.h>
 6 
 7 #define TILE_WIDTH 16
 8 
 9 int get_cuda_device_count()
10 {
11     int n;
12     cudaGetDeviceCount(&n);
13     return n;
14 }
15 
16 __global__ void MatrixMulKernel(float *Md, float *Nd, float *Pd, int Width)
17 {
18     float Pvalue = 0;
19 
20     int bx = blockIdx.x;  int by = blockIdx.y;
21     int tx = threadIdx.x; int ty = threadIdx.y;
22     int Row = by * TILE_WIDTH + ty;
23     int Col = bx * TILE_WIDTH + tx;
24 
25     for(int k = 0; k < Width; ++k) {
26         Pvalue += Md[Row*Width+k] * Nd[k*Width+Col];
27     }
28     Pd[Row*Width + Col] = Pvalue;
29 }
30 
31 int checkError(char *msg)
32 {
33     cudaError_t status = cudaGetLastError();
34     if(status != cudaSuccess) {
35         pyprintf("*** %s [%s] ***\n", msg, cudaGetErrorString(status));
36         return 1;
37     }
38     return 0;
39 }
40 
41 #define CHECK_ERROR(msg) if(checkError(msg)) return
42 
43 void MatrixMultiplication(float *a, float *b, float *c, int w)
44 {
45     int size = w*w*sizeof(float);
46     float *ad, *bd, *cd;
47 
48     // Allocate device memory for a, b and c
49     cudaMalloc((void **)&ad, size);
50     CHECK_ERROR("cudaMalloc a");
51     cudaMalloc((void **)&bd, size);
52     CHECK_ERROR("cudaMalloc b");
53     cudaMalloc((void **)&cd, size);
54     CHECK_ERROR("cudaMalloc c");
55 
56     // Copy input data to device
57     cudaMemcpy(ad, a, size, cudaMemcpyHostToDevice);
58     CHECK_ERROR("cudaMemcpy a");
59     cudaMemcpy(bd, b, size, cudaMemcpyHostToDevice);
60     CHECK_ERROR("cudaMemcpy b");
61 
62     // Invoke kernel to perform the multiplication
63     dim3 gridDim(w/TILE_WIDTH, w/TILE_WIDTH);
64     dim3 blockDim(TILE_WIDTH, TILE_WIDTH);
65     MatrixMulKernel<<<gridDim, blockDim>>>(ad, bd, cd, w);
66 
67     cudaThreadSynchronize();
68     CHECK_ERROR("Kernel");
69 
70     // Copy data from device
71     cudaMemcpy(c, cd, size, cudaMemcpyDeviceToHost);
72     CHECK_ERROR("cudaMemcpy c");
73 
74     // Release allocated device memory
75     cudaFree(ad);
76     cudaFree(bd);
77     cudaFree(cd);
78 }
79 
80 Image32F *matrix_mul(Image32F *A, Image32F *B)
81 {
82     int width = A->dims[0];
83     if((A->nd != 2) || (B->nd != 2)) return 0;
84     if((A->dims[1] != width) || (B->dims[0] != width) || (B->dims[1] != width)) return 0;
85 
86     if(get_cuda_device_count() <= 0) {
87         pyprintf(">>> Cannot find a CUDA device.\n\n");
88         return 0;
89     }
90 
91     // pyprintf(">>> __CUDA_ARCH__ is %d\n", __CUDA_ARCH__);
92     // pyprintf(">>> Image width is %dx%d pixels\n\n", width, width);
93 
94     Image32F *C = new Image32F(A->nd, A->dims);
95     MatrixMultiplication((float *)A->raster, (float *)B->raster, (float *)C->raster, width);
96     return C;
97 
98 }

The following code fragment triggers the building of the above code. The parameter ":require_cuda: yes" causes the code to be pre-processed by the NVidia compiler. Remember that the Image32F datatype is defined in the file simple_arrays.h.

1 int get_cuda_device_count();
2 Image32F *matrix_mul(Image32F *A, Image32F *B);
ERROR ping: sandbox "xsb_pycuda" is busy. Try again.

Using the Module

In order to use the module in Python, it must be imported: import cudacpp. The module name is the Adessowiki page name. If we are using modules defined in other pages, the import should include the namespace name: import main.cudacpp.

 1 import time
 2 import numpy as np
 3 import cudacpp
 4 
 5 if cudacpp.get_cuda_device_count() > 0:
 6 
 7     N = 64*32
 8 
 9     a = 10.0*np.ones((N,N), np.float32)
10     b =  1.5*np.ones((N,N), np.float32)
11 
12     t0 = 1000.0*time.time()
13     c  = np.dot(a, b)
14     t1 = 1000.0*time.time()
15     C  = cudacpp.matrix_mul(a, b)
16     t2 = 1000.0*time.time()
17 
18     print 'N = %d Error = %f' % (N, abs(C - c).max())
19     print
20     print '    CPU: %8.3f ms' % (t1-t0,)
21     print '    GPU: %8.3f ms' % (t2-t1,)
22 
23 else:
24     print "Cannot find a CUDA device."
ERROR ping: sandbox "xsb_pycuda" is busy. Try again.