How to normalize matrix columns in CUDA with max performance?

Question

How to effectively normalize matrix columns in CUDA?

My matrix is stored in column-major, and the typical size is 2000x200.

The operation can be represented in the following matlab code.

A = rand(2000,200);

A = exp(A);
A = A./repmat(sum(A,1), [size(A,1) 1]);

Can this be done effectively by Thrust, cuBLAS and/or cuNPP?

A rapid implementation including 4 kernels is shown as follows.

Wondering if these can be done in 1 or 2 kernels to improve the performance, especially for the column summation step implemented by cublasDgemv().

#include <cuda.h>
#include <curand.h>
#include <cublas_v2.h>
#include <thrust/device_vector.h>
#include <thrust/device_ptr.h>
#include <thrust/transform.h>
#include <thrust/iterator/constant_iterator.h>
#include <math.h>

struct Exp
{
    __host__ __device__ void operator()(double& x)
    {
        x = exp(x);
    }
};

struct Inv
{
    __host__ __device__ void operator()(double& x)
    {
        x = (double) 1.0 / x;
    }
};

int main()
{
    cudaDeviceSetCacheConfig(cudaFuncCachePreferShared);
    cublasHandle_t hd;
    curandGenerator_t rng;
    cublasCreate(&hd);
    curandCreateGenerator(&rng, CURAND_RNG_PSEUDO_DEFAULT);

    const size_t m = 2000, n = 200;
    const double c1 = 1.0;
    const double c0 = 0.0;

    thrust::device_vector<double> A(m * n);
    thrust::device_vector<double> sum(1 * n);
    thrust::device_vector<double> one(m * n, 1.0);

    double* pA = thrust::raw_pointer_cast(&A[0]);
    double* pSum = thrust::raw_pointer_cast(&sum[0]);
    double* pOne = thrust::raw_pointer_cast(&one[0]);

    for (int i = 0; i < 100; i++)
    {
        curandGenerateUniformDouble(rng, pA, A.size());


        thrust::for_each(A.begin(), A.end(), Exp());

        cublasDgemv(hd, CUBLAS_OP_T, m, n,
                &c1, pA, m, pOne, 1, &c0, pSum, 1);

        thrust::for_each(sum.begin(), sum.end(), Inv());

        cublasDdgmm(hd, CUBLAS_SIDE_RIGHT, m, n, pA, m, pSum, 1, pA, m);
    }

    curandDestroyGenerator(rng);
    cublasDestroy(hd);

    return 0;
}

Yes it can be done effectively with CUDA. Show some CUDA code you have written to achieve what you want. — sgarizvi, Jan 08 '13 at 10:33

kangshiyin · Accepted Answer · 2016-06-13T18:28:50.943

I compared the performance of 3 approaches on M2090 with CUDA 5.0.

[173.179 us] cublas implementation as shown in the question
[733.734 us] pure Thrust implementation with thrust::reduce_by_key from @talonmies
[1.508 ms] pure Thrust implementation with thrust::inclusive_scan_by_key

$Performance on A_{2,000 x 200}$

It can be seen that,

cublas has highest performance in this case;
both thrust::reduce_by_key & thrust::inclusive_scan_by_key launch multiple kernels, which lead to extra overhead;
thrust::inclusive_scan_by_key writes much more data to DRAM compared to thrust::reduce_by_key, which can be one of the reasons for longer kernel time;
the main performance difference between cublas and thrust approach is the matrix column summation. thrust is slower possibly because thrust::reduce_by_key is designed to do reduction on segments with variant length, but cublas_gemv() can only apply to fixed length segments (row/col).

When the matrix A is large enough to ignore the kernel launching overhead, the cublas appoach still performs best. The profiling result on A_{20,000 x 2,000} is shown as follows.

$Performance on A_{20,000 x 2,000}$

Fusing the first for_each operation with the cublasSgemv call as indicated by @talonmies may further improve the performance, but I think kernel written by hand should be used instead of thrust::reduce_by_key.

The code for the 3 approaches is shown as follows.

#include <cuda.h>
#include <curand.h>
#include <cublas_v2.h>
#include <thrust/device_vector.h>
#include <thrust/device_ptr.h>
#include <thrust/transform.h>
#include <thrust/reduce.h>
#include <thrust/scan.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/iterator/transform_iterator.h>
#include <thrust/iterator/discard_iterator.h>
#include <thrust/iterator/permutation_iterator.h>
#include <math.h>

struct Exp: public thrust::unary_function<double, double>
{
    __host__ __device__ double operator()(double x)
    {
        return exp(x);
    }
};

struct Inv: public thrust::unary_function<double, double>
{
    __host__ __device__ double operator()(double x)
    {
        return (double) 1.0 / x;
    }
};

template<typename T>
struct MulC: public thrust::unary_function<T, T>
{
    T C;
    __host__ __device__ MulC(T c) :
        C(c)
    {
    }
    __host__ __device__ T operator()(T x)
    {
        return x * C;
    }
};

template<typename T>
struct line2col: public thrust::unary_function<T, T>
{
    T C;
    __host__ __device__ line2col(T C) :
            C(C)
    {
    }

    __host__ __device__ T operator()(T i)
    {
        return i / C;
    }
};

int main()
{
    cudaDeviceSetCacheConfig(cudaFuncCachePreferShared);
    cublasHandle_t hd;
    curandGenerator_t rng;
    cublasCreate(&hd);
    curandCreateGenerator(&rng, CURAND_RNG_PSEUDO_DEFAULT);

    const size_t m = 2000, n = 200;
    const double c1 = 1.0;
    const double c0 = 0.0;

    thrust::device_vector<double> A(m * n);
    thrust::device_vector<double> B(m * n);
    thrust::device_vector<double> C(m * n);
    thrust::device_vector<double> sum1(1 * n);
    thrust::device_vector<double> sum2(1 * n);
    thrust::device_vector<double> one(m * n, 1);

    double* pA = thrust::raw_pointer_cast(&A[0]);
    double* pB = thrust::raw_pointer_cast(&B[0]);
    double* pSum1 = thrust::raw_pointer_cast(&sum1[0]);
    double* pSum2 = thrust::raw_pointer_cast(&sum2[0]);
    double* pOne = thrust::raw_pointer_cast(&one[0]);

    curandGenerateUniformDouble(rng, pA, A.size());

    const int count = 2;

    for (int i = 0; i < count; i++)
    {
        thrust::transform(A.begin(), A.end(), B.begin(), Exp());
        cublasDgemv(hd, CUBLAS_OP_T, m, n, &c1, pB, m, pOne, 1, &c0, pSum1, 1);
        thrust::transform(sum1.begin(), sum1.end(), sum1.begin(), Inv());
        cublasDdgmm(hd, CUBLAS_SIDE_RIGHT, m, n, pB, m, pSum2, 1, pB, m);
    }

    for (int i = 0; i < count; i++)
    {
        thrust::reduce_by_key(
                thrust::make_transform_iterator(thrust::make_counting_iterator(0), line2col<int>(m)),
                thrust::make_transform_iterator(thrust::make_counting_iterator(0), line2col<int>(m)) + A.size(),
                thrust::make_transform_iterator(A.begin(), Exp()),
                thrust::make_discard_iterator(),
                sum2.begin());
        thrust::transform(
                A.begin(), A.end(),
                thrust::make_permutation_iterator(
                        sum2.begin(),
                        thrust::make_transform_iterator(thrust::make_counting_iterator(0), line2col<int>(m))),
                C.begin(),
                thrust::divides<double>());
    }

    for (int i = 0; i < count; i++)
    {
        thrust::inclusive_scan_by_key(
                thrust::make_transform_iterator(thrust::make_counting_iterator(0), line2col<int>(m)),
                thrust::make_transform_iterator(thrust::make_counting_iterator(0), line2col<int>(m)) + A.size(),
                thrust::make_transform_iterator(A.begin(), Exp()),
                C.begin());
        thrust::copy(
                thrust::make_permutation_iterator(
                        C.begin() + m - 1,
                        thrust::make_transform_iterator(thrust::make_counting_iterator(0), MulC<int>(m))),
                thrust::make_permutation_iterator(
                        C.begin() + m - 1,
                        thrust::make_transform_iterator(thrust::make_counting_iterator(0), MulC<int>(m))) + n,
                sum2.begin());
        thrust::transform(
                A.begin(), A.end(),
                thrust::make_permutation_iterator(
                        sum2.begin(),
                        thrust::make_transform_iterator(thrust::make_counting_iterator(0), line2col<int>(m))),
                C.begin(),
                thrust::divides<double>());
    }

    curandDestroyGenerator(rng);
    cublasDestroy(hd);

    return 0;
}

score 3 · Answer 2 · answered Jan 09 '13 at 03:43

You should be able to fuse the first for_each operation with the cublasSgemv call into a single reduce_by_key call. If you define/redefine functors as:

struct Accessor : public thrust::unary_function<int,int>
{
    int lda;
    __host__ __device__ Accessor(int _lda) : lda(_lda) {};
    __host__ __device__ int operator()(const int& idx)
    {
        return idx/lda;
    }
};

struct Exp : public thrust::unary_function<double,double>
{
    __host__ __device__ double operator()(const double& x)
    {
        return exp(x);
    }
};

struct Inv : public thrust::unary_function<double,double>
{
    __host__ __device__ double operator()(const double& x)
    {
        return double(1.0) / x;
    }
};

You can then calculate the normalised output as

Accessor columns(m);
thrust::reduce_by_key(
        thrust::make_transform_iterator(thrust::make_counting_iterator(int(0)), columns),
        thrust::make_transform_iterator(thrust::make_counting_iterator(int(m*n)), columns),
        thrust::make_transform_iterator(A.begin(), Exp()),
        thrust::make_discard_iterator(),
        sum.begin());

thrust::for_each(sum.begin(), sum.end(), Inv());

cublasDdgmm(hd, CUBLAS_SIDE_RIGHT, m, n, pA, m, pSum, 1, pA, m);

[disclaimer: all code written in browser and is untested, use at own risk]

Apart from reducing the number of kernel calls, using fancy iterators eliminates the need for the large unit matrix which should reduce memory footprint and total number of memory transactions to do the summation and exponentiation operations.

The iterators are really _fancy_. I compared the the cublas and thrust approaches. Although `thrust::reduce_by_key` may require lower memory bandwidth, it's still slower compared to the `cublasDgemv`. Any ideas? — kangshiyin, Jan 09 '13 at 08:50
I suspect that the relative performance will depend quite a lot on what GPU and type you use. On a different GPU using 32 bit types, you might find a reduction approach is closer in performance that the pure CUBLAS implementation. The thrust developers have acknowledged that the state of the art reduction has moved on a bit since they did the current implementation in thrust, but in general the tree like reduction pattern will always be less efficient that something optimal expressed as a stream of FMADs, as in this case. — talonmies, Jan 09 '13 at 09:13
I would also suggest looking at trying `thrust::transform` instead of `thrust_for_each`. In some cases (admittedly some time ago), I found it a bit faster than `for_each`. But it probably won't change the performance much. — talonmies, Jan 09 '13 at 09:15

Pavan Yalamanchili · Answer 3 · 2013-01-11T02:35:29.553

2

You could use ArrayFire in the following manner

array A = randu(2000, 2000);
A = exp(A);
A /= tile(sum(A, 0), A.dims(0), 1);

You could do this in thrust as well. But if you are going to work with matrices (as opposed to plain vectors), you'd have to do it in a for loop which would not be as efficient.

DISCLAIMER I am a developer at Accelereyes, working on arrayfire.

EDIT I am working on generating new benchmarks as requested.

EDIT We found performance bugs for exp in our code because of this benchmark. We are reviewing and fixing it.

edited Jan 11 '13 at 02:35

answered Jan 08 '13 at 23:24

Pavan Yalamanchili

12,021
2
35
55

Thanks! It's impressive that the code can be as simple as Matlab. Could you also compare the performance of your code with mine? As I don't have the ArrayFire lib on the hand. – kangshiyin Jan 09 '13 at 07:03
@EricShiyinKang Updated with results. – Pavan Yalamanchili Jan 09 '13 at 18:17
I think there's an issue in your benchmark code, which lead to pool timing result for cublas/thrust approach. Here is the modified [bench.cu](https://gist.github.com/786a7ea0597be5ab10d3) – kangshiyin Jan 09 '13 at 21:55
1

@EricShiyinKang Any reason you are generating random numbers outside as well as inside the loop ? Also I realized I was not using device synchronize before the timer::stop, causing it to skew results for both thrust and arrayfire. I am working on revising the code again. – Pavan Yalamanchili Jan 09 '13 at 23:05
First call to curandGenerateUniformDouble() after curandCreateGenerator() requires extra time as mentioned in [Performance Notes](http://docs.nvidia.com/cuda/curand/index.html#topic_1_2_6) of CURAND ref manual. – kangshiyin Jan 09 '13 at 23:39
m~[500,5000]. n~[100,1000] – kangshiyin Jan 10 '13 at 05:39
@EricShiyinKang it will take a couple of days for us to fix our bug. If you are interested contact me through the email address in my profile. I updated the answer removing the reference to the benchmarks. I will put new benchmarks here once the fix is publicly available later this week. – Pavan Yalamanchili Jan 11 '13 at 02:36
Thanks for your time. Work comes first. – kangshiyin Jan 11 '13 at 03:35

How to normalize matrix columns in CUDA with max performance?

3 Answers3

Linked