How to cancel a job that is on completing (CG) state?

Question

I normally submitted some jobs using sbatch and canceled some of them after using scancel. However, they are in state CG and I cannot remove the jobs from my list.

There is any way to get ride off those CG jobs? Sadly, I'm not the administrator of the cluster neither do I have the root password.

score 2 · Answer 1 · edited Feb 13 '24 at 04:19

I have seen the same issue and shared how to resolve it.

requeue and then release, scancel

[test@test02-scheduler ~]$ scontrol release 9
Job has already finished for job 9
slurm_suspend error: Job has already finished
[test@test02-scheduler ~]$ scontrol requeue 9
[test@test02-scheduler ~]$ scontrol release 9
[test@test02-scheduler ~]$
[test@test02-scheduler ~]$ squeue --long
Sun Feb 06 00:17:57 2022
         JOBID PARTITION     NAME     USER    STATE       TIME TIME_LIMI  NODES NODELIST(REASON)
             9       hpc sleep.sh      test COMPLETI       0:00      5:00      1 test02-hpc-pg0-[1-3,5,9]
[test@test02-scheduler ~]$ squeue -s
     STEPID     NAME PARTITION     USER      TIME NODELIST
    9.batch    batch       hpc      test   1:22:24 test02-hpc-pg0-1
[test@test02-scheduler ~]$ scancel 9
[test@test02-scheduler ~]$ squeue -s
     STEPID     NAME PARTITION     USER      TIME NODELIST
    9.batch    batch       hpc      test   1:22:30 test02-hpc-pg0-1
[test@test02-scheduler ~]$ squeue -s
     STEPID     NAME PARTITION     USER      TIME NODELIST
    9.batch    batch       hpc      test   1:22:32 test02-hpc-pg0-1
 [test@test02-scheduler ~]$ squeue --long
 Sun Feb 06 00:18:12 2022
         JOBID PARTITION     NAME     USER    STATE       TIME TIME_LIMI  NODES NODELIST(REASON)
             9       hpc sleep.sh      test COMPLETI       0:21      5:00      1 test02-hpc-pg0-[1-3,5,9]
 [test@test02-scheduler ~]$
 [test@test02-scheduler ~]$ squeue --long
 Sun Feb 06 00:21:04 2022
         JOBID PARTITION     NAME     USER    STATE       TIME TIME_LIMI  NODES NODELIST(REASON)
[test@test02-scheduler ~]$
Sun Feb  6 00:22:32 UTC 2022

score 0 · Answer 2 · answered Aug 22 '19 at 06:37

Killing the slurmstepd process on the 1st node that your job occupy, should work. This process should be under your user, so in principle killing it shouldn't require special privileges.

Be aware not to kill slurmtespd of another yours job that may be running on same node. You probably may tell them apart according to their start time.

score 0 · Answer 3 · answered Mar 22 '25 at 04:23

0

For me it worked by doing:

$ scontrol requeue <job_id>
$ scontrol release <job_id>
$ scancel <job_id>

Check it with:

$ squeue --me

answered Mar 22 '25 at 04:23

Janikas

101

How to cancel a job that is on completing (CG) state?

3 Answers3