CUDA out of resources when trying to launch through MATLAB

Question

After fixing the code I posted here (adding *sizeof(float) to shared memory allocation - but It doesn't matter since here I allocate shared memory through MATLAB), I ran the code, which successfully returned results of size up to sizeof(float)*18*18*5000*100 bytes.

I took the PTX, and used it to run the code though MATLAB (It found the right entry point - the function I wanted to run)

    kernel=parallel.gpu.CUDAKernel('Tst.ptx','float *,const float *,int');
    mask=gpuArray.randn([7,7,1],'single');
    toConv=gpuArray.randn([12,12,5],'single'); %%generate random data for testing
    setConstantMemory(kernel,'masks',mask);  %%transfer data to constant memory.
    kernel.ThreadBlockSize=[(12+2*7)-2 (12+2*7)-2 1];
    kernel.GridSize=[1 5 1]; %%first element is how many convolution masks
    %%second one is how many matrices we want to convolve
    kernel.SharedMemorySize=(24*24*4);
    foo=gpuArray.zeros([18 18 5 1],'single'); %%result size
    foo=reshape(foo,[numel(foo) 1]);
    toConv=reshape(toConv,[numel(toConv) 1]);
    foo=feval(kernel,foo,toConv,12);

I get:

Error using parallel.gpu.CUDAKernel/feval An unexpected error occurred trying to launch a kernel. The CUDA error was: CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES

Error in tst (line 12) foo=feval(kernel,foo,toConv,12);

out of resources for such a small example? It worked for a problem a hundred thousand times larger in Visual Studio...

I have GTX 480 (compute 2.0, about 1.5 GB memory, 1024 max threads per block, 48K shared memory)

1>  ptxas : info : 0 bytes gmem, 25088 bytes cmem[2]
1>  ptxas : info : Compiling entry function '_Z6myConvPfPKfi' for 'sm_21'
1>  ptxas : info : Function properties for _Z6myConvPfPKfi
1>      0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
1>  ptxas : info : Used 10 registers, 44 bytes cmem[0]

EDIT: problem resolved by compiling with Configuration Active(Release) and Platform Active(x64)

`CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES` means you are asking for too many per thread or per block resources (so registers, local memory or shared memory). Can you edit your question to include the output of compiling the kernel with `-Xptxas="-v"` as an option to nvcc and tell us what GPU you have? Note that Matlab is compiling your kernel for you from PTX, it is likely that there is something different between the final code emitted by the two different compilation trajectories. — talonmies, Aug 07 '13 at 08:18
Also note that the guide you link to shows how to interrogate the Matlab kernel structure to see the kernel properties (I would be paying careful attention to the value of MaxThreadsPerBlock, for example). — talonmies, Aug 07 '13 at 08:45
Edited as you requested. And I know MATLAB shows me information - and I kept all the information in mind. note that when I'm trying to run the code through MATLAB, i'm taking up less space in constant memory than I used through Visual studio. shared memory usage per thread remains the same, and still well below maximum. The error appears after allocation of all variables, and with the small sizes I use, there is no way it's out of global memory. — user1999728, Aug 07 '13 at 08:52
This has nothing to do with memory. It most likely threads per block, and probably because of a difference between the PTX you are feeding to Matlab and code you compiled to binary inside VS. The default architecture for the CUDA toolchain only supports 512 threads per block. If you have compiled your kernel to PTX 1.x, the resulting code Matlab will try and run might be limited to 512 threads. You are trying to run 576. The error you are reporting is consistent with that. — talonmies, Aug 07 '13 at 09:09
I compiled my kernel with compute_20, sm_20 (though I'm not entirely sure what sm is). And I tried to run the PTX again with 22*22 instead of 24*24. no luck... =/ — user1999728, Aug 07 '13 at 09:28
Problem fixed (I edited the question, solution at the end) - I compiled it for debug instead of release (I'm learning more and more how little I know about coding - I didn't even notice these options). Should I delete the question or something? — user1999728, Aug 07 '13 at 10:08
Don't edit the solution into your question. Add it as an answer (this is perfectly OK here). Later you will be able to accept you own answer, which shows that the question is answered and gets it off the unanswered questions list. — talonmies, Aug 07 '13 at 10:10
I had to wait 8 hours from posting my question to answer it, and I can vote it as an answer for another 2 days (reputation thing I guess) — user1999728, Aug 07 '13 at 16:59

score 1 · Accepted Answer · answered Aug 07 '13 at 16:59

1

problem resolved by compiling with Configuration Active(Release) and Platform Active(x64) instead of default (Due to backwards compatibility, I'm guessing it's not about the x64 as much as about compiling for release and not for debug)

answered Aug 07 '13 at 16:59

user1999728

913
1
8
25

1

Thanks for doing this - it makes life easier for those who "curate" the CUDA tag. The debug build is probably the key - register and local memory consumption change a lot when you build for debugging. – talonmies Aug 07 '13 at 17:37

CUDA out of resources when trying to launch through MATLAB

1 Answers1