I recently found out that the bottleneck of my code is the following block. N is of order 10,000, and L (10,000)^2. RQ_func is just a function that takes indices (tuples) and returns float V and dictionary sp_dist of {index : probability} format.
Is there a way I can parallelize this code? I have access to cluster computing from which I can use up to 20 cores at a time and would like to use the option.
    R = np.empty((L,))
    Q = scipy.sparse.lil_matrix((L, N))
    traverser = 0        # Populate R and Q by traversing the array
    for s_index in state_indices:
        for a_index in action_indices:
            V, sp_dist = RQ_func(s_index, a_index)
            R[traverser] = V
            for sp_index, prob in sp_dist.items():
                Q[traverser, sp_index] = prob
            traverser += 1
