Find max of matrix in CUDA

Question

I just started in CUDA. Now I have a question. I have N*N matrix, and a window scale is 8x8. I want subdivided this matrix into multiple sub-matrix and find max value of this. For example if I have 64*64 matrix so I will have 8 small matrix with 8*8 scale and find out 8 max values. Finally I save all max values into new array, but its order always change. I want find solution to keep them in right order

__global__ void calculate_emax_kernel(float emap[],float emax[], int img_height, int img_width,int windows_size)
{
    int x_index = blockIdx.x*blockDim.x+threadIdx.x;
    int y_index = blockIdx.y*blockDim.y+threadIdx.y;

    int num_row_block = img_height/windows_size;
    int num_col_block = img_width/windows_size;
    __shared__ float window_elements[256];
    __shared__ int counter;
    __shared__ int emax_count;

    if (threadIdx.x == 0) emax_count = 0;
    __syncthreads();
    int index;
    int emax_idx = 0;


    if(y_index >= img_height|| x_index >= img_width) return;
    for(int i = 0; i < num_row_block; i++)
    {
        for(int j = 0; j < num_col_block; j++)
        {
            counter = 0;
            if(y_index >= i*windows_size && y_index < (i+1)*windows_size
                    && x_index >= j*windows_size && x_index < (j+1)*windows_size)
            {
                int idx = y_index*img_height + x_index;
                index = atomicAdd(&counter, 1);

                window_elements[index] = emap[idx];
                __syncthreads();


                // reduction
                unsigned int k = (windows_size*windows_size)/2;
                while(k != 0)
                {
                    if(index < k)
                    {
                        window_elements[index] = fmaxf(window_elements[index], window_elements[index+k]);

                    }
                    k /= 2;
                }
                if(index == 0)
                {
                    emax[i*num_row_block+j] = window_elements[index];
                }
            }
            __syncthreads();
        }
        __syncthreads();
    }
    __syncthreads();
}

This is my configuration

void construct_emax(float *input,float *output, int img_height, int img_width)
{
    int windows_size = 4;
    float * d_input, * d_output;
    cudaMalloc(&d_input, img_width*img_height*sizeof(float));
    cudaMalloc(&d_output, img_width*img_height*sizeof(float));

    cudaMemcpy(d_input, input, img_width*img_height*sizeof(float), cudaMemcpyHostToDevice);
    dim3 blocksize(16,16);
    dim3 gridsize;

    gridsize.x=(img_width+blocksize.x-1)/blocksize.x;
    gridsize.y=(img_height+blocksize.y-1)/blocksize.y;

    calculate_emax_kernel<<<gridsize,blocksize>>>(d_input,d_output,img_height,img_width,windows_size);

}

Do you mean "I will have 8x8 small matrix with 8*8 scale and find out 8x8 max values" ? — kangshiyin, Jul 20 '16 at 06:45
@kangshiyin sorry, It difficult to explain, it mean I will split input matrix into some small matrix, it depend on size of window. For example If I have 16*16 matrix and 8*8 window size so I will have 4 small matrix. And find out max value of each small matrix. — Nguyễn Cường, Jul 20 '16 at 07:34
what is your possible window size and range? 1,2,3,4,5... or only 2,4,8,16, ...? — kangshiyin, Jul 20 '16 at 12:04

score 3 · Accepted Answer · edited May 23 '17 at 12:32

With CUDA, parallel reduction is tricky; segmented parallel reduction is trickier. Now you are doing it in 2-D, and your segment/window is smaller than the thread block.

For large window size, I don't think it is a problem. You could use one thread block to reduce one window. For example if you have a 16x16 window, you could simply use 16x16 thread block. If you have even larger window size, for example 64x64, you could still use 16x16 thread block. First reduce the 64x64 window to 16x16 elements during data loading, then reduce to 1 scalar within the thread block.

For window size smaller than the block size, you will have to reduce multiple windows per thread block for higher performance. You could use your current block/grid configuration, where each 256-thread block (16x16) is responsible for 16 4x4 windows. But this will not be optimal because each 32-thread wrap is organized in two parts (2x16). This is not good for coalesced global memory access, and it is hard to map a 2x16 warp to one or more 4x4 windows for efficient parallel reduction.

Alternatively I would suggest you use 1-D thread block with 256 threads. Every m threads reduce one mxm window. Then you could use 2-D grid to cover the whole image.

const int m = window_size;
dim3 blocksize(256);
dim3 gridsize((img_width+255)/256, (img_height+m-1)/m);

In the kernel function, you could

reduce each mxm window to a 1xm vector during global data loading;
use tree reduction method to reduce the 1xm vector to a scalar.

This following code is a conceptual demo which works when m is a power of 2 and m <= 32. You could further modify it for arbitrary m and better boundary checking.

#include <assert.h>
#include <cuda.h>
#include <thrust/device_vector.h>

__global__ void calculate_emax_kernel(const float* input, float* output,
                                      int height, int width, int win_size,
                                      int out_width) {
  const int tid = threadIdx.x;
  const int i = blockIdx.y * win_size;
  const int j = blockIdx.x * 256 + tid;
  const int win_id = j % win_size;

  __shared__ float smax[256];

  float tmax = -1e20;
  if (j < width) {
    for (int tile = 0; tile < win_size; tile++) {
      if (i + tile < height) {
        tmax = max(tmax, input[(i + tile) * width + j]);
      }
    }
  }
  smax[tid] = tmax;
  for (int shift = win_size / 2; shift > 0; shift /= 2) {
    if (win_id < shift) {
      smax[tid] = max(smax[tid], smax[tid + shift]);
    }
  }
  if (win_id == 0 && j < width) {
    output[blockIdx.y * out_width + (j / win_size)] = smax[tid];
  }
}

int main() {
  const int height = 1024;
  const int width = 1024;
  const int m = 4;
  thrust::device_vector<float> in(height * width);
  thrust::device_vector<float> out(
      ((height + m - 1) / m) * ((width + m - 1) / m));

  dim3 blocksize(256);
  dim3 gridsize((width + 255) / 256, (height + m - 1) / m);

  assert(m == 2 || m == 4 || m == 8 || m == 16 || m == 32);
  calculate_emax_kernel<<<gridsize, blocksize>>>(
      thrust::raw_pointer_cast(in.data()),
      thrust::raw_pointer_cast(out.data()),
      height, width, m, (width + m - 1) / m);

  return 0;
}

score 0 · Answer 2 · answered May 12 '19 at 10:19

In case you're willing to use a library, few pointers:

use NPP, set of primitives (from nvidia) https://docs.nvidia.com/cuda/npp/group__image__filter__max.html
a lower level library, for other reduce operations and more granularity in the way you use the hardware (from nvidia / nvlabs) http://nvlabs.github.io/cub/

Find max of matrix in CUDA

2 Answers2

Linked

Related