After reading about Cooperative Groups in CUDA 9, I've been trying synchronize at a grid level.
I'm using Visual Studio 2017, a GTX 1060 and CUDA 9.1.
I altered my code as follows:
__global__ void ExplicitKernel_American(/* ... */) {
    int i = threadIdx.x + blockDim.x * blockIdx.x;
    auto grid = cooperative_groups::this_grid();
    if (i < sizeS) {
        //...
        for (int j = 1; j < sizeT; ++j) {
            // ...
            grid.sync(); // __syncthreads();
        }
    }
}
And, as stated in the documentation, I call my kernel this way :
void* Explicit_Args[] = { &PDE_Grid, /* ... */, &sizeS, &sizeT };
cudaLaunchCooperativeKernel(
    (void*)ExplicitKernel_American, 
    dim3((sizeS + TPB - 1) / TPB), 
    dim3(TPB),  
    Explicit_Args
); // TPB being 256...
Unfortunately, I get linking errors as soon as I add the "grid" part in the kernel.
error LNK2001: unresolved external symbol __fatbinwrap_38_cuda_device_runtime_compute_70_cpp1_ii_8b1a5d37
fatal error LNK1120: 1 unresolved externals
I've set -rdc=true and sm_61 but cannot find why it is not working... Any ideas ?
Many thanks !