A possible approach (free of busy waiting and ugliness of that kind) is to track the number of jobs on the client side, cap their total count at 500 and, each time any of them finishes, immediately start a new one to replace it. (This is, however, based on the assumption that the client script outlives the jobs.) Concrete steps:
Make the qsub tool block and (passively) wait for the completion of its remote job. Depending on the particular qsub implementation, it may have a -sync flag or something more complex may be needed.
Keep exactly 500, no more and, if possible, no fewer waiting instances of qsub. This can be automated by using this answer or this answer and setting MAX_PARALLELISM to 500 there. qsub itself would be started from the do_something_and_maybe_fail() function.
Here’s a copy&paste of the Bash outline from the answers linked above, just to make this answer more self-contained. Starting with a trivial and runnable harness / dummy example (with a sleep instead of a qsub -sync):
#!/bin/bash
set -euo pipefail
declare -ir MAX_PARALLELISM=500 # pick a limit
declare -i pid
declare -a pids=()
do_something_and_maybe_fail() {
### qsub -sync "$@" ... ### # add the real thing here
sleep $((RANDOM % 10)) # remove this :-)
return $((RANDOM % 2 * 5)) # remove this :-)
}
for pname in some_program_{a..j}{00..60}; do # 600 items
if ((${#pids[@]} >= MAX_PARALLELISM)); then
wait -p pid -n \
&& echo "${pids[pid]} succeeded" 1>&2 \
|| echo "${pids[pid]} failed with ${?}" 1>&2
unset 'pids[pid]'
fi
do_something_and_maybe_fail & # forking here
pids[$!]="${pname}"
echo "${#pids[@]} running" 1>&2
done
for pid in "${!pids[@]}"; do
wait -n "$((pid))" \
&& echo "${pids[pid]} succeeded" 1>&2 \
|| echo "${pids[pid]} failed with ${?}" 1>&2
done
The first loop needs to be adjusted for the specific use case. An example follows, assuming that the right do_something_and_maybe_fail() implementation is in place and that one_command_per_line.txt is a list of arguments for qsub, one invocation per line, with an arbitrary number of lines. (The script could accept a file name as an argument or just read the commands from standard input, whatever works best.) The rest of the script would look exactly like the boilerplate above, keeping the number of parallel qsubs at MAX_PARALLELISM at most.
while read -ra job_args; do
if ((${#pids[@]} >= MAX_PARALLELISM)); then
wait -p pid -n \
&& echo "${pids[pid]} succeeded" 1>&2 \
|| echo "${pids[pid]} failed with ${?}" 1>&2
unset 'pids[pid]'
fi
do_something_and_maybe_fail "${job_args[@]}" & # forking here
pids[$!]="${job_args[*]}"
echo "${#pids[@]} running" 1>&2
done < /path/to/one_command_per_line.txt