I have a series of M single-channel images, each of size NxN, stored continuously in a device memory array. (N is not a power of two.) So, the array is of length MxNxN. I need to find the sum of all pixels for each of these images. So, the output is M values, one for each image.
I am generating an additional array that holds the image index of every pixel and using this index to reduce_by_key for each image (segment). This reduce_by_key seems to be pretty slow, taking more time than everything else I'm doing on these pixels.
Is there a faster way to do this segmented reduction sum, where the segments are all the same size?