Dask keeps failing with killed worker exception while running TPOT

Question

Im running tpot with dask running on kubernetes cluster on gcp, the cluster is 24 cores 120 gb memory with 4 nodes of kubernetes, my kubernetes yaml is

apiVersion: v1
kind: Service
metadata:
name: daskd-scheduler
labels:
app: daskd
role: scheduler
spec:
ports:
- port: 8786
  targetPort: 8786
  name: scheduler
- port: 8787
  targetPort: 8787
  name: bokeh
- port: 9786
  targetPort: 9786
  name: http
- port: 8888
  targetPort: 8888
  name: jupyter
selector:
  app: daskd
  role: scheduler

 type: LoadBalancer
 --- 
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
 name: daskd-scheduler
spec:
 replicas: 1
 template:
  metadata:
  labels:
    app: daskd
    role: scheduler
spec:
  containers:
  - name: scheduler
    image: uyogesh/daskml-tpot-gcpfs  # CHANGE THIS TO BE YOUR DOCKER HUB IMAGE
    imagePullPolicy: Always
    command: ["/opt/conda/bin/dask-scheduler"]
    resources:
      requests:
        cpu: 1
        memory: 20000Mi # set aside some extra resources for the scheduler
    ports:
     - containerPort: 8786
     ---
     apiVersion: extensions/v1beta1
     kind: Deployment
     metadata:
       name: daskd-worker
     spec:
       replicas: 3
       template:
      metadata:
        labels:
        app: daskd
        role: worker
    spec:
  containers:
  - name: worker
    image: uyogesh/daskml-tpot-gcpfs  # CHANGE THIS TO BE YOUR DOCKER HUB IMAGE
    imagePullPolicy: Always
    command: [
      "/bin/bash",
      "-cx",
      "env && /opt/conda/bin/dask-worker $DASKD_SCHEDULER_SERVICE_HOST:$DASKD_SCHEDULER_SERVICE_PORT_SCHEDULER --nthreads 8 --nprocs 1 --memory-limit 5e9",
    ]
    resources:
      requests:
        cpu: 2
        memory: 20000Mi

My data is 4 million rows and 77 columns, whenever i run fit on the tpot classifier, it runs on the dask cluster for a while then it crashes, the output log looks like

KilledWorker:
("('gradientboostingclassifier-fit-1c9d29ce92072868462946c12335e5dd',
0, 4)", 'tcp://10.8.1.14:35499')

I tried increasing threads per worker as suggested by the dask distributed docs, yet the problem persists. Some observations i have made are:

It'll take longer time to crash if n_jobs is less (for n_jobs=4, it ran for 20 mins before crashing) where as crashes instantly for n_jobs=-1.
It'll actually start working and get optimized model for fewer data, with 10000 data it works fine.

So my question is, what changes and modifications do i need to make this work, I guess its doable as ive heard dask is capable of handling even bigger data than mine.

I often find that a `KilledWorker` error is the result of running our of RAM on the workers. If you are able I would watch the memory usage of your workers as you run the calculation. — Jacob Tomlinson, Jan 29 '19 at 09:28

score 3 · Answer 1 · answered Jan 29 '19 at 12:36

Best practices described on Dask`s official documentation page say:

Kubernetes resource limits and requests should match the --memory-limit and --nthreads parameters given to the dask-worker command. Otherwise your workers may get killed by Kubernetes as they pack into the same node and overwhelm that nodes’ available memory, leading to KilledWorker errors.

In your case these configuration parameters` values don`t match from what I can see:

Kubernetes` Container limit 20 GB vs. dask-worker command limit 5 GB

Dask keeps failing with killed worker exception while running TPOT

1 Answers1

Linked