After fixing the code I posted here (adding *sizeof(float) to shared memory allocation - but It doesn't matter since here I allocate shared memory through MATLAB), I ran the code, which successfully returned results of size up to sizeof(float)*18*18*5000*100 bytes.
I took the PTX, and used it to run the code though MATLAB (It found the right entry point - the function I wanted to run)
    kernel=parallel.gpu.CUDAKernel('Tst.ptx','float *,const float *,int');
    mask=gpuArray.randn([7,7,1],'single');
    toConv=gpuArray.randn([12,12,5],'single'); %%generate random data for testing
    setConstantMemory(kernel,'masks',mask);  %%transfer data to constant memory.
    kernel.ThreadBlockSize=[(12+2*7)-2 (12+2*7)-2 1];
    kernel.GridSize=[1 5 1]; %%first element is how many convolution masks
    %%second one is how many matrices we want to convolve
    kernel.SharedMemorySize=(24*24*4);
    foo=gpuArray.zeros([18 18 5 1],'single'); %%result size
    foo=reshape(foo,[numel(foo) 1]);
    toConv=reshape(toConv,[numel(toConv) 1]);
    foo=feval(kernel,foo,toConv,12);
I get:
Error using parallel.gpu.CUDAKernel/feval An unexpected error occurred trying to launch a kernel. The CUDA error was: CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES
Error in tst (line 12) foo=feval(kernel,foo,toConv,12);
out of resources for such a small example? It worked for a problem a hundred thousand times larger in Visual Studio...
I have GTX 480 (compute 2.0, about 1.5 GB memory, 1024 max threads per block, 48K shared memory)
1>  ptxas : info : 0 bytes gmem, 25088 bytes cmem[2]
1>  ptxas : info : Compiling entry function '_Z6myConvPfPKfi' for 'sm_21'
1>  ptxas : info : Function properties for _Z6myConvPfPKfi
1>      0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
1>  ptxas : info : Used 10 registers, 44 bytes cmem[0]
EDIT: problem resolved by compiling with Configuration Active(Release) and Platform Active(x64)
 
    