Solving tridiagonal linear systems in CUDA

Question

I am trying to implement a tridiagonal system solver based on the Cyclic Reduction method on my GTS450.

Cyclic Reduction is illustrated in this paper

Y. Zhang, J. Cohen, J.D. Owens, "Fast Tridiagonal Solvers on GPU"

However, whatever I do, my CUDA code is far slower than the sequential counterpart. My result for a total of 512 x 512 points is 7ms, however on my i7 3.4GHz it is 5ms. The GPU is not accelerating!

Which could be the problem?

#include "cutrid.cuh"
 __global__ void cutrid_RC_1b(double *a,double *b,double *c,double *d,double *x)
{
 int idx_global=blockIdx.x*blockDim.x+threadIdx.x;
 int idx=threadIdx.x;

 __shared__ double asub[512];
 __shared__ double bsub[512];
 __shared__ double csub[512];
 __shared__ double dsub[512];

 double at=0;
 double bt=0;
 double ct=0;
 double dt=0;

 asub[idx]=a[idx_global];
 bsub[idx]=b[idx_global];
 csub[idx]=c[idx_global];
 dsub[idx]=d[idx_global];


 for(int stride=1;stride<N;stride*=2)
  {
    int margin_left,margin_right;
    margin_left=idx-stride;
    margin_right=idx+stride;


    at=(margin_left>=0)?(-csub[idx-stride]*asub[idx]/bsub[idx-stride]):0.f; 

    bt=bsub[idx]+((margin_left>=0)?(-csub[idx-stride]*asub[idx]/bsub[idx-stride]):0.f)
    -((margin_right<512)?asub[idx+stride]*csub[idx]/bsub[idx+stride]:0.f); 

    ct=(margin_right<512)?(-csub[idx+stride]*asub[idx]/bsub[idx+stride]):0.f; 

    dt=dsub[idx]+((margin_left>=0)?(-dsub[idx-stride]*asub[idx]/bsub[idx-stride]):0.f)
    -((margin_right<512)?dsub[idx+stride]*csub[idx]/bsub[idx+stride]:0.f); 

    __syncthreads();
    asub[idx]=at;
    bsub[idx]=bt;
    csub[idx]=ct;
    dsub[idx]=dt;
    __syncthreads();
  }


x[idx_global]=dsub[idx]/bsub[idx];

}/*}}}*/

I launched this kernel by cutrid_RC_1b<<<512,512>>>(d_a,d_b,d_c,d_d,d_x), and reached 100% device occupancy. This result has puzzled me for days.

There is an improved version of my code:

    #include "cutrid.cuh"
    __global__ void cutrid_RC_1b(float *a,float *b,float *c,float *d,float *x)
    {/*{{{*/
     int idx_global=blockIdx.x*blockDim.x+threadIdx.x;
     int idx=threadIdx.x;

     __shared__ float asub[512];
     __shared__ float bsub[512];
     __shared__ float csub[512];
     __shared__ float dsub[512];

    asub[idx]=a[idx_global];
    bsub[idx]=b[idx_global];
    csub[idx]=c[idx_global];
    dsub[idx]=d[idx_global];
 __syncthreads();
   //Reduction  
    for(int stride=1;stride<512;stride*=2)
    {
        int margin_left=(idx-stride);
        int margin_right=(idx+stride);
        if(margin_left<0) margin_left=0;
        if(margin_right>=512) margin_right=511;
        float tmp1 = asub[idx] / bsub[margin_left];
        float tmp2 = csub[idx] / bsub[margin_right];
        float tmp3 = dsub[margin_right];
        float tmp4 = dsub[margin_left];
        __syncthreads();

        dsub[idx] = dsub[idx] - tmp4*tmp1-tmp3*tmp2;
        bsub[idx] = bsub[idx]-csub[margin_left]*tmp1-asub[margin_right]*tmp2;

        tmp3 = -csub[margin_right]; 
        tmp4 = -asub[margin_left];

        __syncthreads();
        asub[idx] = tmp3*tmp1;
        csub[idx] = tmp4*tmp2;
        __syncthreads();
     }

        x[idx_global]=dsub[idx]/bsub[idx];

    }/*}}}*/

The speed is improved to 0.73ms on a Quadro k4000 for 512 x 512 system, however the code in the mentioned paper runs in 0.5ms on a GTX280.

You didn't write it correctly. Can you tell anything about chunks of work, GPU loads, etc.? — duffymo, Oct 23 '13 at 12:26
Your code is full of conditional statements, which perhaps is one of the main limiting factors. — Vitality, Oct 23 '13 at 12:28
@duffymo i launched my kernel by cutrid_RC_1b<<<512,512>>>(d_a,d_b,d_c,d_d,d_x). In this way i reached 100% occupancy on gts480. — pengjun, Oct 23 '13 at 12:31
@JackOLantern the conditional statements are inevitable for this algorithm. there may be ways to improve it, but i haven't figured out. — pengjun, Oct 23 '13 at 12:33
I'm not sure if the code at [google tridiagonal solver](http://tridiagonalsolvers.googlecode.com/svn/trunk/tridiagonalsolvers/) is the same as that in the referenced paper, but you can have a look at their implementation and see if you can grasp some ideas to improve your approach. — Vitality, Oct 23 '13 at 12:45
You saw multiple GPUs engaged on your problem? If yes, I'd say you didn't partition it into chunks properly. Could N GPUs have solved the same problem N times? — duffymo, Oct 23 '13 at 12:46
@JackOLantern thank you for the link, i am reading it. it used a similar algorithm. — pengjun, Oct 23 '13 at 12:54
@duffymo i only use one GPU. I used 512 blocks so that the total computation may equal to that in the paper. — pengjun, Oct 23 '13 at 12:56
You are comparing a low-end GPU and a fairly high-end CPU. You may want to compare the raw double-precision GFLOPS of both devices to get an idea whether acceleration is likely assuming well-optimized code is run on both CPU and GPU. — njuffa, Oct 23 '13 at 19:16
@njuffa thank you for your advise, i used a quadro k4000 device and improve my code as that in [google code] (http://tridiagonalsolvers.googlecode.com/svn/trunk/tridiagonalsolvers/),the speed is improved a lot, but still slower than in the mentioned paper.It comes to me that may be device is a more important point than code. — pengjun, Oct 25 '13 at 05:55

score 6 · Accepted Answer · answered Oct 23 '13 at 21:17

Solving a tridiagonal system of equations is a challenging parallel problem since the classical solution scheme, i.e., Gaussian elimination, is inherently sequential.

Cyclic Reduction consists of two phases:

Forward Reduction. The original system is split in two independent tridiagonal systems for two sets of unknowns, the ones with odd index and the ones with even index. Such systems can be solved independently and this step can be seen as the ﬁrst of a divide et impera scheme. The two smaller systems are split again in the same way in two subsystems and the process is repeated until a system of only 2 equations is reached.
Backward Substitution. The system of 2 equations is solved first. Then, the divide et impera structure is climbed up by solving the sub-systems independently on diﬀerent cores.

I'm not sure (but correct me if I'm wrong) that your code will return consistent results. N does not appear to be defined. Also, you are accessing csub[idx-stride], but I'm not sure what does it mean when idx==0 and stride>1. Furthermore, you are using several conditional statements, essentially for boundary checkings. Finally, your code lacks a proper thread structure capable to deal with the mentioned divide et impera scheme, conceptually pretty much like the one used in the CUDA SDK reduction samples.

As mentioned in one of my comments above, I remembered that at tridiagonalsolvers you can find an implementation of the Cyclic Reduction scheme for solving tridiagonal equation systems. Browsing the related google pages, it seems to me that the code is mantained, among others, by the first Author of the above paper (Yao Zhang). The code is copied and pasted below. Note that the boundary check is done only once (if (iRight >= systemSize) iRight = systemSize - 1;), thus limiting the number of conditional statements involved. Note also the thread structure capable to deal with a divide et impera scheme.

The code by Zhang, Cohen and Owens

__global__ void crKernel(T *d_a, T *d_b, T *d_c, T *d_d, T *d_x)
{
   int thid = threadIdx.x;
   int blid = blockIdx.x;

   int stride = 1;

   int numThreads = blockDim.x;
   const unsigned int systemSize = blockDim.x * 2;

   int iteration = (int)log2(T(systemSize/2));
   #ifdef GPU_PRINTF 
    if (thid == 0 && blid == 0) printf("iteration = %d\n", iteration);
   #endif

   __syncthreads();

   extern __shared__ char shared[];

   T* a = (T*)shared;
   T* b = (T*)&a[systemSize];
   T* c = (T*)&b[systemSize];
   T* d = (T*)&c[systemSize];
   T* x = (T*)&d[systemSize];

   a[thid] = d_a[thid + blid * systemSize];
   a[thid + blockDim.x] = d_a[thid + blockDim.x + blid * systemSize];

   b[thid] = d_b[thid + blid * systemSize];
   b[thid + blockDim.x] = d_b[thid + blockDim.x + blid * systemSize];

   c[thid] = d_c[thid + blid * systemSize];
   c[thid + blockDim.x] = d_c[thid + blockDim.x + blid * systemSize];

   d[thid] = d_d[thid + blid * systemSize];
   d[thid + blockDim.x] = d_d[thid + blockDim.x + blid * systemSize];

   __syncthreads();

   //forward elimination
   for (int j = 0; j <iteration; j++)
   {
       __syncthreads();
       stride *= 2;
       int delta = stride/2;

    if (threadIdx.x < numThreads)
    {
        int i = stride * threadIdx.x + stride - 1;
        int iLeft = i - delta;
        int iRight = i + delta;
        if (iRight >= systemSize) iRight = systemSize - 1;
        T tmp1 = a[i] / b[iLeft];
        T tmp2 = c[i] / b[iRight];
        b[i] = b[i] - c[iLeft] * tmp1 - a[iRight] * tmp2;
        d[i] = d[i] - d[iLeft] * tmp1 - d[iRight] * tmp2;
        a[i] = -a[iLeft] * tmp1;
        c[i] = -c[iRight] * tmp2;
    }
       numThreads /= 2;
   }

   if (thid < 2)
   {
     int addr1 = stride - 1;
     int addr2 = 2 * stride - 1;
     T tmp3 = b[addr2]*b[addr1]-c[addr1]*a[addr2];
     x[addr1] = (b[addr2]*d[addr1]-c[addr1]*d[addr2])/tmp3;
     x[addr2] = (d[addr2]*b[addr1]-d[addr1]*a[addr2])/tmp3;
   }

   // backward substitution
   numThreads = 2;
   for (int j = 0; j <iteration; j++)
   {
       int delta = stride/2;
       __syncthreads();
       if (thid < numThreads)
       {
           int i = stride * thid + stride/2 - 1;
           if(i == delta - 1)
                 x[i] = (d[i] - c[i]*x[i+delta])/b[i];
           else
                 x[i] = (d[i] - a[i]*x[i-delta] - c[i]*x[i+delta])/b[i];
        }
        stride /= 2;
        numThreads *= 2;
     }

   __syncthreads();

   d_x[thid + blid * systemSize] = x[thid];
   d_x[thid + blockDim.x + blid * systemSize] = x[thid + blockDim.x];

}

Thank you for your answer, i improved my code and reached a higher speed, but still a bit slower than the author claimed: 0.72 ms v.s. 0.5ms. It seems that device is a limit reason for this case. — pengjun, Oct 25 '13 at 05:57
I download the tridiagonalsolvers from googlecode, how can I compile in linux? — xhg, Jun 06 '14 at 07:55

Vitality · Answer 2 · 2018-12-16T07:49:03.953

I want to add a further answer to mention that tridiagonal systems can be easily solved in the framework of the cuSPARSE library by aid of the function

cusparse<t>gtsv()

cuSPARSE also provides

cusparse<t>gtsv_nopivot()

which, at variance with the first mentioned routine, does not perform pivoting. Both the above functions solve the same linear system with multiple right hand sides. A batched routine

cusparse<t>gtsvStridedBatch()

also exists which solves multiple linear systems.

For all the above routines, the system matrix is fixed by simply specifying the lower diagonal, the main diagonal and the upper diagonal.

Below, I'm reporting a fully worked out example using cusparse<t>gtsv() to solve a tridiagonal linear system.

#include <stdio.h>
#include <stdlib.h>
#include <iostream>
#include <assert.h>

#include <cuda_runtime.h>
#include <cusparse_v2.h>

/********************/
/* CUDA ERROR CHECK */
/********************/
// --- Credit to http://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using-the-cuda-runtime-api
void gpuAssert(cudaError_t code, char *file, int line, bool abort=true)
{
   if (code != cudaSuccess)
   {
      fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
      if (abort) { exit(code); }
   }
}

extern "C" void gpuErrchk(cudaError_t ans) { gpuAssert((ans), __FILE__, __LINE__); }

/***************************/
/* CUSPARSE ERROR CHECKING */
/***************************/
static const char *_cusparseGetErrorEnum(cusparseStatus_t error)
{
    switch (error)
    {

        case CUSPARSE_STATUS_SUCCESS:
            return "CUSPARSE_STATUS_SUCCESS";

        case CUSPARSE_STATUS_NOT_INITIALIZED:
            return "CUSPARSE_STATUS_NOT_INITIALIZED";

        case CUSPARSE_STATUS_ALLOC_FAILED:
            return "CUSPARSE_STATUS_ALLOC_FAILED";

        case CUSPARSE_STATUS_INVALID_VALUE:
            return "CUSPARSE_STATUS_INVALID_VALUE";

        case CUSPARSE_STATUS_ARCH_MISMATCH:
            return "CUSPARSE_STATUS_ARCH_MISMATCH";

        case CUSPARSE_STATUS_MAPPING_ERROR:
            return "CUSPARSE_STATUS_MAPPING_ERROR";

        case CUSPARSE_STATUS_EXECUTION_FAILED:
            return "CUSPARSE_STATUS_EXECUTION_FAILED";

        case CUSPARSE_STATUS_INTERNAL_ERROR:
            return "CUSPARSE_STATUS_INTERNAL_ERROR";

        case CUSPARSE_STATUS_MATRIX_TYPE_NOT_SUPPORTED:
            return "CUSPARSE_STATUS_MATRIX_TYPE_NOT_SUPPORTED";

        case CUSPARSE_STATUS_ZERO_PIVOT:
            return "CUSPARSE_STATUS_ZERO_PIVOT";
    }

    return "<unknown>";
}

inline void __cusparseSafeCall(cusparseStatus_t err, const char *file, const int line)
{
    if(CUSPARSE_STATUS_SUCCESS != err) {
        fprintf(stderr, "CUSPARSE error in file '%s', line %Ndims\Nobjs %s\nerror %Ndims: %s\nterminating!\Nobjs",__FILE__, __LINE__,err, \
                                _cusparseGetErrorEnum(err)); \
        cudaDeviceReset(); assert(0); \
    }
}

extern "C" void cusparseSafeCall(cusparseStatus_t err) { __cusparseSafeCall(err, __FILE__, __LINE__); }

/********/
/* MAIN */
/********/
int main()
{
    // --- Initialize cuSPARSE
    cusparseHandle_t handle;    cusparseSafeCall(cusparseCreate(&handle));

    const int N     = 5;        // --- Size of the linear system

    // --- Lower diagonal, diagonal and upper diagonal of the system matrix
    double *h_ld = (double*)malloc(N * sizeof(double));
    double *h_d  = (double*)malloc(N * sizeof(double));
    double *h_ud = (double*)malloc(N * sizeof(double));

    h_ld[0]     = 0.;
    h_ud[N-1]   = 0.;
    for (int k = 0; k < N - 1; k++) {
        h_ld[k + 1] = -1.;
        h_ud[k]     = -1.;
    }

    for (int k = 0; k < N; k++) h_d[k] = 2.;

    double *d_ld;   gpuErrchk(cudaMalloc(&d_ld, N * sizeof(double)));
    double *d_d;    gpuErrchk(cudaMalloc(&d_d,  N * sizeof(double)));
    double *d_ud;   gpuErrchk(cudaMalloc(&d_ud, N * sizeof(double)));

    gpuErrchk(cudaMemcpy(d_ld, h_ld, N * sizeof(double), cudaMemcpyHostToDevice));
    gpuErrchk(cudaMemcpy(d_d,  h_d,  N * sizeof(double), cudaMemcpyHostToDevice));
    gpuErrchk(cudaMemcpy(d_ud, h_ud, N * sizeof(double), cudaMemcpyHostToDevice));

    // --- Allocating and defining dense host and device data vectors
    double *h_x = (double *)malloc(N * sizeof(double)); 
    h_x[0] = 100.0;  h_x[1] = 200.0; h_x[2] = 400.0; h_x[3] = 500.0; h_x[4] = 300.0;

    double *d_x;        gpuErrchk(cudaMalloc(&d_x, N * sizeof(double)));   
    gpuErrchk(cudaMemcpy(d_x, h_x, N * sizeof(double), cudaMemcpyHostToDevice));

    // --- Allocating the host and device side result vector
    double *h_y = (double *)malloc(N * sizeof(double)); 
    double *d_y;        gpuErrchk(cudaMalloc(&d_y, N * sizeof(double))); 

    cusparseSafeCall(cusparseDgtsv(handle, N, 1, d_ld, d_d, d_ud, d_x, N));

    cudaMemcpy(h_x, d_x, N * sizeof(double), cudaMemcpyDeviceToHost);
    for (int k=0; k<N; k++) printf("%f\n", h_x[k]);
}

At this gitHub repository, a comparison of different CUDA routines available in the cuSOLVER library for the solution of tridiagonal linear systems is reported.

The problems is it only solves multiple right hand sides with the same diagonal matrix, and it is not callable from device, — JimBamFeng, Feb 08 '19 at 16:17

Farzad · Answer 3 · 2013-10-23T18:04:00.400

0

Things I see:

1st __syncthreads() seems redundant.
There are repetitive sets of operations such as (-csub[idx-stride]*asub[idx]/bsub[idx-stride]) in your code. Use intermediate variables to hold the result and reuse them instead of making GPU calculate those sets each time.
Use NVIDIA profiler to see where issues are.

edited Oct 23 '13 at 18:04

answered Oct 23 '13 at 17:15

Farzad

3,288
2
29
53

Thank your for your advice, i am still improving my code to achieve an ideal speed! – pengjun Oct 25 '13 at 05:58

Solving tridiagonal linear systems in CUDA

3 Answers3