a way to improve my use of joblib ?
joblib can help and will help, but only a code, which can benefit from distributed execution, split across some pool of resources if costs of doing so are less than an effective-speedup.
Tweaking the joblib's pre-load and batch size parameters only starts to make any sense only after the to-be-distributed code gets performance polished.
Some efforts on this, as demonstrated below, have shown the core-speedup of ~ 8x on achieving still pure-[SERIAL] run-times of ~ 217,000 [us] per one Random Walk ( instead of ~ 1,640,000 [us] per item, as was reported above ).
Only after this there may come some harder work on Cluster-resources related optimisation ( performance devastation avoidance efforts ) for the ab-definitio postulated intention to organise a distributed work-flow of the above defined 400 repetitions thereof.
That makes sense if and only if:
- if possible to avoid CPU-starvation, and
 
- if not having to pay any excessive add-on costs of distributed job-scheduling.
 
Perhaps a long, but important story about where the performance gets saved or lost:
Executive summary
There could be hardly a better reward to Dr. Gene AMDAHL's argument.
The inner structure of the above defined task is heavily [SERIAL]:
- random number genesis is principally a pure-
[SERIAL] process, due to PRNG-design 
1E6 iterator atop a 1D-vector of pre-calc'd drunken-sailor-steps is pure-[SERIAL] 
Yes, the "outer"-scope-of-work ( the 400 repetitions of the same process ) can be easily converted to a "just"-[CONCURRENT] ( not a true-[PARALLEL], even if professors and wannabe gurus try to tell you ) process scheduling, but the add-on costs of doing so get worse than linearly added to the run-time, and given the [SERIAL] part was not performance re-engineered, the net-effect of such efforts could easily devastate the original good intentions ( Q.E.D. above, as the posted run-times grew up, from 10:52, for K == 1, towards ~ 13 minutes for even small number of K-s ).
A brief testing has proved the whole task could be, after using standard python tools, run in a pure-[SERIAL] fashion under < 1.45 [s] ( instead of the reported ~ 12 - 13 minutes ) even on a rather stone-age archaic desktop device ( some in-cache computing effect were possible, yet more as an accidental side-effect, than an intentional HPC-motivated code-refactoring for an HPC-cluster specific performance ):
u@amd64FX:~$ lstopo --of ascii
+-----------------------------------------------------------------+
| Machine (7969MB)                                                |
|                                                                 |
| +------------------------------------------------------------+  |
| | Package P#0                                                |  |
| |                                                            |  |
| | +--------------------------------------------------------+ |  |
| | | L3 (8192KB)                                            | |  |
| | +--------------------------------------------------------+ |  |
| |                                                            |  |
| | +--------------------------+  +--------------------------+ |  |
| | | L2 (2048KB)              |  | L2 (2048KB)              | |  |
| | +--------------------------+  +--------------------------+ |  |
| |                                                            |  |
| | +--------------------------+  +--------------------------+ |  |
| | | L1i (64KB)               |  | L1i (64KB)               | |  |
| | +--------------------------+  +--------------------------+ |  |
| |                                                            |  |
| | +------------++------------+  +------------++------------+ |  |
| | | L1d (16KB) || L1d (16KB) |  | L1d (16KB) || L1d (16KB) | |  |
| | +------------++------------+  +------------++------------+ |  |
| |                                                            |  |
| | +------------++------------+  +------------++------------+ |  |
| | | Core P#0   || Core P#1   |  | Core P#2   || Core P#3   | |  |
| | |            ||            |  |            ||            | |  |
| | | +--------+ || +--------+ |  | +--------+ || +--------+ | |  |
| | | | PU P#0 | || | PU P#1 | |  | | PU P#2 | || | PU P#3 | | |  |
| | | +--------+ || +--------+ |  | +--------+ || +--------+ | |  |
| | +------------++------------+  +------------++------------+ |  |
| +------------------------------------------------------------+  |
|                                                                 |
+-----------------------------------------------------------------+
+-----------------------------------------------------------------+
| Host: amd64FX                                                   |
| Date: Fri 15 Jun 2018 07:08:44 AM                               |
+-----------------------------------------------------------------+
< 1.45 [s]?
 Why ? How ? That is the whole story about ...( Due HPC efforts could make it further well under 1 [s] )
Dr. Gene AMDAHL's argument, even in his original, add-on overheads agnostic form, in his well cited report, was showing, that any composition of [SERIAL] and [PARALLEL] blocks of work will have a principally limited benefit from increasing amount of processing units harnessed for the [PARALLEL]-part ( a.k.a. A Law of Diminishing Returns, towards an asymptotically limited speedup for even an infinite number of processors ), whereas any improvement introduced for the [SERIAL]-part will continue to additively increase the speedup ( in a pure linear fashion ). Let me skip here the adverse effects ( also affecting the speedup, some in a similarly pure linear fashion, but in an adverse sense - the add-on overheads - as these will get discussed below ).
Step No. 1:
Repair the code, so as to make something useful.
Given the code as-is above, there is no random-walk at all.
Why ?
>>> [ op( np.random.binomial( 1, 1 /3,  1E9 ) ) for op in ( sum, min, max, len ) ]
[0, 0, 0, 1000000000]
So,
the code as-is produces a rather expensive list of a-priori known constants.
No randomness at all. Damned python rounding of integer division. :o)
>>> [ op( np.random.binomial( 1, 1./3., 1E9 ) ) for op in ( sum, min, max, len ) ]
[333338430, 0, 1, 1000000000]
So this is fixed.
Step No. 2:
Understand the overheads ( and best also any hidden performance-blockers )
Any attempt to instantiate a distributed process ( for each one of the instructed K-amount joblib-spawned processes, calling a multiprocessing with the subprocess, not the threading, backend ) makes you pay a cost. Always...
Given that your code execution will get additional [SERIAL]-add-on code, that has to run, before any ... still just a theoretical ... ( 1 / n_jobs ) split-effect may start to take place.
A closer look onto the "useful" work:
def f( j ):                                         # T0
    #pass;   np.random.seed( None )                 #  +      ~ 250 [us]
    prnGEN = np.random.RandomState()                #  +      ~ 230 [us]
    # = 2 * np.random.binomial( 1, 1./3., 1E6 ) - 1 #  +  ~ 465,000 [us]
    X =        prnGEN.binomial( 1, 1./3., 1E6 )     #  +  ~ 393,000
    X*= 2                                           #  +    ~ 2.940
    X-= 1                                           #  +    ~ 2.940                                  
    s = 0; m = 0                                    #  +        ~ 3 [us]
    for t in range( 0, int( 1E6 ) ):                #     ( py3+ does not allocate range() but works as an xrange()-generator
        s = max( 0, s + X[t] )                      #  +       ~ 15 [us] cache-line friendly consecutive { hit | miss }-rulez here, heavily ...
        m = max( m, s )                             #  +       ~  5 [us]
    return  m                                       # = ~ 2,150,000 [us] @  i5/2.67 GHz
#                                                   # = ~ 1,002,250 [us] @ amd/3.6  GHz
For this sort of workpackage, the best demonstration-purpose speedups will be demonstrable from a non-interpreted, GIL-free, threading-backend, multiprocessing.Pool-spawned Cython-ised code-packages, cdef-ed with a nogil directive. May expect such code-execution under about = ~ 217,000 [us] per one pure-[SERIAL] Random Walk, with 1E6 steps, when it starts to make sense to harness the pool of code-execution  nodes with some pre-load tweaking, so as not to let them starve. Yet, all premature-optimisation warnings are due and valid in this simplified context and proper engineering practices are to be used for achieving a professional-grade result.
Some tools may help you see, dow to the assembly level, how many instructions were actually added, by any respective high-level language syntax-constructor element ( or concurrency / parallelisation #pragma masquerades ) to "smell" these add-on processing-costs one will pay during the finalised code-execution:

Given these add-on processing costs, a "small"-(thin)-amount of work "inside" the "just"-concurrently executed ( Beware, not automatically a true [PARALLEL]-scheduling possible ), these add-on costs may make you pay a way more than you would receive by the split.
The blockers:
Any add-on communication / synchronisation may further devastate the theoretical Speedup flow-of-code-execution. Locks, GIL being avoided if not using the threading-backend, semaphores, socket-comms, sharing et al are the common blockers.
For a carefully designed source-of-randomness, any draw from such a "device" must also get centrally re-synchronised, so as to keep the quality of such randomness. This may cause additional trouble behind the curtain ( a common issue on systems with some authority certified sources of randomness ).
The next step?
Read more details on Amdahl's Law, best the contemporary re-formulated version, where both setup + termination overheads are added in the Overhead-strict mode and also an atomicity-of-processing was taken into account for practical evaluation of realistic speedup limitations
Next: measure your net-duration-costs of your code and you get indirectly the add-on costs of setup+termination overheads on your in-vivo system execution.
def f( j ):
    ts = time.time()
    #------------------------------------------------------<clock>-ed SECTION
    N = 10**6
    p = 1./3.
    np.random.seed( None )                    # RandomState coordination ...
    X = 2 * np.random.binomial( 1, p, N ) - 1 # X = 1 with probability p
    s = 0                                     # X =-1 with probability 1-p
    m = 0 
    for t in range( 0, N ):
        s = max( 0, s + X[t] )
        m = max( m, s )
    #------------------------------------------------------<clock>-ed SECTION
    return ( m, time.time() - ts )            # tuple
For classroom tutorials, I've successfully parallelized my random walk code using special modules within R, Matlab, Julia & Stata. (By "successfully", I mean it is abundantly clear that 20 workers do at least 10 times as much work as 1 worker in the same time interval.) Is such internal parallelization not feasible in Python ?
Well, the last comment seems to display some sort of inconvenience to people, who tried to help and who have brought in reasons, why the posted code, as-is, works as was observed. It was not our choice to define the processing strategy this exact way, was it?
So, again. Given the original decision was to use python-3.6 + joblib.Parallel() + joblib.delayed() instrumentation, simply Alea Iacta Est ...
What might have worked ( as cited ) for { R | MATLAB | Julia | Stata } simply does not mean that it will work the same way in GIL-stepped, the less joblib.Parallel()-spawned eco-systems.
The first cost one will ALWAYS pay for a joblib.Parallel()-spawned job is a cost of re-construction of a whole, 1:1 copy of the current python-interpreter state. Given that the current state contains a way more object-instances, than a stripped-down MCVE-code ( as was demonstrated for a python2.x MCVE-script as @rth demonstrated ), the whole, many-times-replicated memory-image first has to get copied + transported + re-constructed onto all distributed-processing nodes, accross the SLURM-managed cluster-footprint, which all costs add-on ( non-productive ) overhead time. If in doubts, add a few GB-sized numpy-arrays into the state of the python interpreter and put the measured timestamps for a respective duration calculations into the first and last array cells and finally return ( m, aFatArray ). The overall execution times will jump higher, as both the initial 1:1-copy and the returning path will have to move much larger amount of data there and back ( again, for details on instantiation-related add-on costs, there has been posted in many places here, including templates for systematic benchmarking of the respective add-on costs ).
Exactly this was the reason to advice the kind O/P to indeed do measure the effective amount of computing time ( the "benefit" part of the elementary cost/benefit argument ), which is both cheap to get in the trivialised experiment, which will show the scale,  the sum, and the actual proportion of the useful-work inside the "remote"-executed efficient-computing-payload(s) ( ref. the code modification proposed above, that returns the values so that W[:][1] will tell the actual "net" computing costs of the "useful-work" spent during the efficient-workpackage-computing time, once these have got finally arrived and activated into the respective "remote" code-execution eco-system(s) ( here, in the form of the joblib.Parallel()-spawned binary full-scale replicas of the original python interpreter ) whereas a flow of time between a start and end of the main-code execution shows the actual expenses - here the lumpsum spent amount of time, i.e. including all the "remote"-process-instantiation(s) + all the respective workpackage(s)-distribution(s) + all the "remote"-process-termination.
A final remark for those who have not spend a time on reading about the source-of-Randomness-related problems :
Any good-practice shall rather avoid a blocking logic hidden behind the "shared-randomness". Better use individually configured PRNG-sources. If interested or in a need for a certifiable PRNG-robustness, feel free to read more on this in a discussion here