4

I normally submitted some jobs using sbatch and canceled some of them after using scancel. However, they are in state CG and I cannot remove the jobs from my list.

There is any way to get ride off those CG jobs? Sadly, I'm not the administrator of the cluster neither do I have the root password.

3 Answers3

2

I have seen the same issue and shared how to resolve it.

  • requeue and then release, scancel
[test@test02-scheduler ~]$ scontrol release 9
Job has already finished for job 9
slurm_suspend error: Job has already finished
[test@test02-scheduler ~]$ scontrol requeue 9
[test@test02-scheduler ~]$ scontrol release 9
[test@test02-scheduler ~]$
[test@test02-scheduler ~]$ squeue --long
Sun Feb 06 00:17:57 2022
         JOBID PARTITION     NAME     USER    STATE       TIME TIME_LIMI  NODES NODELIST(REASON)
             9       hpc sleep.sh      test COMPLETI       0:00      5:00      1 test02-hpc-pg0-[1-3,5,9]
[test@test02-scheduler ~]$ squeue -s
     STEPID     NAME PARTITION     USER      TIME NODELIST
    9.batch    batch       hpc      test   1:22:24 test02-hpc-pg0-1
[test@test02-scheduler ~]$ scancel 9
[test@test02-scheduler ~]$ squeue -s
     STEPID     NAME PARTITION     USER      TIME NODELIST
    9.batch    batch       hpc      test   1:22:30 test02-hpc-pg0-1
[test@test02-scheduler ~]$ squeue -s
     STEPID     NAME PARTITION     USER      TIME NODELIST
    9.batch    batch       hpc      test   1:22:32 test02-hpc-pg0-1
 [test@test02-scheduler ~]$ squeue --long
 Sun Feb 06 00:18:12 2022
         JOBID PARTITION     NAME     USER    STATE       TIME TIME_LIMI  NODES NODELIST(REASON)
             9       hpc sleep.sh      test COMPLETI       0:21      5:00      1 test02-hpc-pg0-[1-3,5,9]
 [test@test02-scheduler ~]$
 [test@test02-scheduler ~]$ squeue --long
 Sun Feb 06 00:21:04 2022
         JOBID PARTITION     NAME     USER    STATE       TIME TIME_LIMI  NODES NODELIST(REASON)
[test@test02-scheduler ~]$
Sun Feb  6 00:22:32 UTC 2022
Giacomo1968
  • 58,727
0

Killing the slurmstepd process on the 1st node that your job occupy, should work. This process should be under your user, so in principle killing it shouldn't require special privileges.

Be aware not to kill slurmtespd of another yours job that may be running on same node. You probably may tell them apart according to their start time.

0

For me it worked by doing:

$ scontrol requeue <job_id>
$ scontrol release <job_id>
$ scancel <job_id>

Check it with:

$ squeue --me
Janikas
  • 101