I'm trying to batch with the csv rows from dask:
Can this task be done with dask?
batch_size = 1000 # 1000rows
batch = []
count = 0
df = dd.read_csv (path, header = 0)
df_dask ['output'] = df.apply (lambda x: batch_row_csv (
         x), axis = 1, meta = object) .compute ()
def batch_row_csv (row):
       global batch 
       global count
       batch.append(row)
       if len (batch) < batch_size:
             return
       json.dump (batch) // save batch
       count = count +1
       batch = []
       return
Is there a problem with global variables and multiprocessing? In the good practices of Dask, they advise not to use global variables ... What would be the alternative?
Can this task be done with dask?
