As far as I'm aware cv::cuda::PtrStep is used to passing GpuMat data directly to the custom kernel. I found examples of one channel access here however my case is 2 channel mat (CV_32FC2). In this case I'm trying to achieve complex absolute squared value where complex values are encoded like: real part is 1st plane, imaginary part is 2nd plane of given Mat. 
I tried:
__global__ void testKernel(const cv::cuda::PtrStepSz<cv::Vec2f> input, cv::cuda::PtrStepf output)
{
    int x = blockIdx.x * blockDim.x + threadIdx.x;
    int y = blockIdx.y * blockDim.y + threadIdx.y;
    if (x <= input.cols - 1 && y <= input.rows - 1 && y >= 0 && x >= 0)
    {
        float val_re = input(x, y)[0];
        float val_im = input(x, y) [1];
        output(x, y) = val_re * val_re + val_im * val_im;
    }
}
but this results in the following error:
calling a __host__ function("cv::Vec<float, (int)2> ::operator []") from a __global__ function("gpuholo::testKernel") is not allowed
I get it. [] is __host__ restricted function since its cv::Vec2f not cv::cuda::Vec2f (which apparently does not exist). But still I would really like to access the data. 
Is there other mechanism to access 2-channel data on device side similar to Vec2f?
I thought of workaround in form of splitting input into two CV_32FC1 Mats so the kernel would look like:
__global__ void testKernel(const cv::cuda::PtrStepSzf re, const cv::cuda::PtrStepSzf im, cv::cuda::PtrStepf output)
but I'm wondering whether there's a "cleaner" solution, Vec2f-like one.
 
     
    