If your operation is taking a long time (say maybe 30 seconds or longer), then you could perhaps benefit from dividing results into as many pieces as you want to run python processes, and using python's multiprocessing module. If the operation isn't taking that long, the overhead of starting these new processes will outweigh the benefit of using them.
Since the operation being carried out does not depend on the values stored in v, each process can write to an independent vector and you can aggregate the results at the end. Pass each process a vector v_prime of 0's of the same length as v. Perform the above operation, each process handling a portion of the output_diffs in results and incrementing the corresponding values in v_prime instead of v. Then at the end, each process returns its vector v_prime. Sum all of the returned v_primes and the original v (this is where having the items expressed as numpy arrays is helpful, as it is easy to add numpy vectors of the same length) to get the correct result.