How to do segmented reduction sum of segments of equal size?

Question

I have a series of M single-channel images, each of size NxN, stored continuously in a device memory array. (N is not a power of two.) So, the array is of length MxNxN. I need to find the sum of all pixels for each of these images. So, the output is M values, one for each image.

I am generating an additional array that holds the image index of every pixel and using this index to reduce_by_key for each image (segment). This reduce_by_key seems to be pretty slow, taking more time than everything else I'm doing on these pixels.

Is there a faster way to do this segmented reduction sum, where the segments are all the same size?

score 1 · Accepted Answer · edited May 23 '17 at 10:28

OpenCV provides a matrix reduction API implemented with CUDA. You can find it here.

http://docs.opencv.org/modules/gpu/doc/matrix_reductions.html#gpu-reduce

If you don't want to include extra 3rd party libraries, you could use cublas. In this case, your task can be represented by matlab code as follows.

result(1:M) = sum(images(1:N*N, 1:M), 1);

which is equivalent to

result(1:M) = ones(1, N*N) * images(1:N*N, 1:M);

It's a matrix-vector multiply operation and can be efficiently done by BLAS 2 function cublas<t>gemv() provided by CUBLAS.

http://docs.nvidia.com/cuda/cublas/index.html#cublas-lt-t-gt-gemv

On the other hand, using reduce_by_key() for your task does not need to generate an additional array of image indices. Fancy iterators in Thrust are designed for this situation to reduce the global mem bandwidth requirement.

Please refer to this answer for more details.

Reduce matrix rows with CUDA

The fancy iterator approach Eric describes is demonstrated [here](https://github.com/thrust/thrust/blob/master/examples/sum_rows.cu). — Jared Hoberock, Sep 30 '13 at 20:15

How to do segmented reduction sum of segments of equal size?

1 Answers1