stopping ssh process that is sent to background without corrupting stdout

Question

My script aims to extract a text log file using tail -f and a wireshark trace using tshark. But I don't know if these are the best options for my goal.

My script has to ssh into a machine (which I call server) and then from that machine it ssh into another (called blade), so I created these 2 functions to streamline sending commands:

processIDs=()
sends command $2 to server $1, piping output to file $3 on local machine
server_cmd() {
    ssh -i /home/$USER/.ssh/id_rsa root@$1 $2 1>>$3 2>>$errorOutput &
    processIDs+=($!)
}
sends command $3 to blade $2 of server $1, piping output to file $4 on local machine
blade_cmd() {
    server_cmd $1 "ssh root@$2 &quot;$3&quot;" $4
}

The process IDs get stored into an array every time I send an ssh call into the background.

On my script I make a variable number of calls (depending on user choices) to the blade_cmd function:

blade_cmd $server_ip $server_blade_ip "tail -f \\\$(ls -1tr ${path}_Debug_* | tail -1)" debug.log
blade_cmd $server_ip $server_blade_ip "tail -f \\\$(ls -1tr ${path}_Report_* | tail -1)" report.log
blade_cmd $server_ip $server_blade_ip "tshark -i eth7 -w -" tshark.pcap

Then perform the actions that generate the logs/traces, and then kill the processes like so:

# kill all generated processes on the array
for i in ${!processIDs[@]}; do
    kill ${processIDs[i]}
    wait ${processIDs[i]} 2>>$errorOutput
done

But with this setup the processes on the remote machines don't get killed and are left hanging.

The solution that I found to killing the processes is to call ssh with the -tt flag to force the tty which does fix the problem of not propagating the kill that comes from the local machine but then the logs/traces I receive get corrupted by the login banner and the various newlines, which render the logs and especially the tshark traces useless.

I require some guidance on how to go forward with this.

score 0 · Accepted Answer · answered Mar 02 '23 at 00:17

In my tests a remote tshark -w - … exits automatically when the local ssh exits, even if I chain sshs like you do; at least when tshark tries to keep writing. I think it's the mechanism described here: Why isn't tail -f … | grep -q … quitting when it finds a match?

Note the first write after the local ssh exits will probably only cause the intermediate ssh (on server) to exit, the next write will cause tshark to exit.

It's similar with tail -f on blade, except this tail may (or may not) be "smart" and detect the broken pipe (see the linked answer). Still the intermediate ssh is not that smart.

So yes, in some circumstances "the processes on the remote machines […] are left hanging".

There is a trick to make remote tshark or tail -f exit just after the local ssh exits. You cannot use this trick with remote commands that should read from stdin (i.e. ultimately from stdin of the local ssh), but since tshark and tail -f don't use their stdin, in this case the trick will be useful. This is the trick:

On blade, instead of tshark … run:

# shell code for blade
tshark … & cat >/dev/null; kill "$!"

Note locally you need to adjust the quoting so $! is expanded on blade, not earlier. For brevity I decided to post the command in a form a shell on blade should get.

tshark … runs now in the background on blade, this shouldn't prevent it from working though. cat ultimately reads from stdin of the local ssh and it will sit there until the local ssh exits or its stdin breaks or is depleted. If the local ssh exits or its stdin breaks or is depleted, the cat will get an EOF condition and it will exit. Then kill will kill tshark on blade.

When tshark, cat and kill are no more, the shell on blade exits and the relevant instance of sshd on blade exits, so ssh on server also exits. It's clean.

The same trick can be used to make tail -f … exit.

Now we need to take care of the local side.

The remote cat will read what the local ssh reads. Without redirection the local ssh will consume the stdin of the local script (stdin may be a terminal, a regular file or whatever). In general you may want not to let the local ssh consume the stdin of the script. Also if EOF condition happens prematurely then cat will exit prematurely; we want it to exit only after the local ssh exits, not when local stdin is depleted. For these reasons it's good to connect stdin of the local ssh to something else than the stdin of the script. It cannot be /dev/null because EOF would happen immediately. It shouldn't be /dev/zero because it would transfer zero bytes in vain. It could be tail -f /dev/null:

# locally inside your function
tail -f /dev/null | ssh … &

but then $! would tell you the PID of ssh and after killing ssh this local tail would remain, unless it's "smart" (again, see the linked answer). Only if your local tail is "smart" then this will be a clean way that doesn't leave hanging processes.

Alternatively you can use a dummy fifo. You should open it for reading and writing beforehand, so nothing stalls. Still, there is no need to pass anything through the fifo. Example:

# locally
mkfifo dummy
exec 3<>dummy
rm dummy
then inside your function
<&3 ssh … &

You can use the same fifo for many instances of ssh. In the example I unlinked the fifo from the directory just after opening the file descriptor, so there is no need for maintenance later. The kernel will truly get rid of the fifo when it's no longer in use.

One way or another (i.e. with "smart" tail or a fifo) it's now enough to kill ssh locally and the remote cat on blade will see EOF and exit, and things on blade and server will happen as described above.

stopping ssh process that is sent to background without corrupting stdout

sends command $2 to server $1, piping output to file $3 on local machine

sends command $3 to blade $2 of server $1, piping output to file $4 on local machine

1 Answers1

then inside your function