CUDA/C++ noob here.
The error I receive on attempting to debug my CUDA project is:
First-chance exception at 0x000000013F889467 in simple6.exe: 0xC00000FD: Stack overflow (parameters: 0x0000000000000001, 0x0000000000223000).
The program '[2668] simple6.exe' has exited with code 0 (0x0).
From research on the web, it seems that I have some large variables that are too large for the "stack" and need to be moved to the "heap". 
Can someone please provide me the appropriate code modifications?
My code is below. The point of this kernel is to use h_S and h_TM to create a bunch of values and write these values into h_F at the very end. This is why h_F is never copied into the GPU.
int main()
{
int blockSize= 1024; 
int gridSize = 1; 
const int reps = 1024; 
const int iterations = 18000; 
int h_F [reps * iterations] = {0};
int h_S [reps] = {0}; // not actually zeros in my code this just simplifies things
int h_TM [2592] = {0} // not actually zeros in my code this just simplifies things
// Device input vectors
float *d_F;
double *d_S;
float *d_TM;
//Select GPU
cudaSetDevice(0);
// Allocate memory for each vector on GPU
cudaMalloc((void**)&d_F, iterations * reps * sizeof(float));
cudaMalloc((void**)&d_S, reps * sizeof(double));
cudaMalloc((void**)&d_TM, 2592 * sizeof(float));
// Copy host vectors to device
cudaMemcpy( d_S, h_S, reps * sizeof(double), cudaMemcpyHostToDevice);
cudaMemcpy( d_TM, h_TM, 2592 * sizeof(float), cudaMemcpyHostToDevice);
// Execute the kernel
myKern<<<gridSize, blockSize>>>(d_TM, d_F, d_S, reps);
cudaDeviceSynchronize(); 
// Copy array back to host
cudaMemcpy( h_F, d_F, iterations * reps * sizeof(float), cudaMemcpyDeviceToHost );
// Release device memory
cudaFree(d_F);
cudaFree(d_TM);
cudaFree(d_S);
cudaDeviceReset();
return 0;
Also, related, but would making these huge input arrays "shared" variables solve my problem?
Many thanks.
 
    