Do shell pipes block upstream source processes if the buffer is overflowing?

Question

I do various sysadmin tasks to clean up my disks, such as (but not limited to):

find /media/me/disk_with_huge_inode_count -type d -empty | xargs rmdir -p

and the rmdir part is really slow, while find produces a huge amount of output by comparison.

What would be the behaviour of the `find` under such a scenario?

I'm not looking for specific advice for this operation because I have this concern with other similar jobs. What I want to understand is how the Linux kernel (or shell?) handles pipeline overflows when the producer and consumer have a load mismatch.

Kamil Maciorowski · Accepted Answer · 2022-09-03T02:00:29.920

Specific case

Yes, find will block for as long as it's necessary. A "simple" test is like:

find / -print -exec sh -c 'printf "%s\\n" "$1" >/dev/tty' find-sh {} \; \
| while </dev/tty read user_input; do read from_find; done

Here find -prints pathnames to a pipe (find in your code does the same, it uses implicit -print). Each pathname is additonally printed to /dev/tty, so you see it after -print succeeds. At some moment the output you see will block, this is when the pipe buffer is (almost) full. Press Enter to trigger read from_find that reads from the pipe and makes room in the buffer.

Most likely you will need several presses of Enter (in practice it's good to hold it and be patient) until find prints another bunch of pathnames to /dev/tty. Nevertheless, by not pressing Enter you can make find block for arbitrarily long.

General case

You wrote:

I want to understand is how the Linux kernel (or shell?) handles pipeline overflows […]

The shell is responsible for setting things up: creating processes with descriptors (including standard descriptors: stdin, stdout, stderr) connected to respective files (unnamed fifos, i.e. pipes; or files of other types). Data that flows through a pipe between processes (like your find and xargs) does not flow through the shell. The shell does not act as a relay.

To understand how pipeline overflows are handled, in general you can get some insight from the following fragments of the POSIX specification of write():

DESCRIPTION

[…]

Write requests to a pipe or FIFO shall be handled in the same way as a regular file with the following exceptions:

[…]

If the O_NONBLOCK flag is clear, a write request may cause the thread to block, but on normal completion it shall return nbyte.

If the O_NONBLOCK flag is set, write() requests shall be handled differently, in the following ways:

The write() function shall not block the thread.

[…]

[…]

ERRORS

[…]

[EAGAIN]
The file is a pipe or FIFO, the O_NONBLOCK flag is set for the file descriptor, and the thread would be delayed in the write operation.

[…]

RATIONALE

[…]

An attempt to write to a pipe or FIFO has several major characteristics:

[…]

Blocking/immediate: Blocking is only possible with O_NONBLOCK clear. If there is enough space for all the data requested to be written immediately, the implementation should do so. Otherwise, the calling thread may block; that is, pause until enough space is available for writing. […]

[…]

This means a writing thread can:

set O_NONBLOCK and in a case of not enough room in the buffer it will get [EAGAIN], be able to continue (with other tasks) and eventually try to write again; or
clear O_NONBLOCK and in a case of not enough room in the buffer it will block.

If a thread of a program uses write() with O_NONBLOCK set, the program shall track what has been written successfully and what requires another try. With O_NONBLOCK clear, a pipe buffer being full is not a concern: the thread just uses write() and it gets blocked for as long as necessary; this blocking happens at write() and requires no additional effort (code) like polling, a trap or anything.

read() can block like write(), the situation is quite similar and I won't elaborate separately.

This design allows programs to use pipes reliably and easily. The whole idea of the pipe is writers wait for room in the respective buffer, readers wait for something in the respective buffer; thus data will eventually flow through even if there's a bottleneck.

It's possible to write an impatient program that will exit if it's unable to write (almost) immediately (or read, in case of a reader). Programs designed to work with pipes shall be infinitely patient. Standard *nix tools (including find) by design are infinitely patient¹. It takes additional effort to build an impatient program or to wrap a standard patient tool in something that implements a timeout.

Deadlock can happen if "plumbing" is circular (example) or if it branches and converges later (like here). It's a separate issue that has little (if anything) to do with how fast programs process data. It does not occur in a linear arrangement of pipes.

We considered the POSIX specifications of write() and read(). Linux is classified as "mostly POSIX-compliant". It's not "fully compliant", but I don't expect it to significantly deviate from POSIX in the area in question. Reliably working pipes are just too important.

_{¹ I don't expect implementations to be able to block to the end of the Universe, or past the year 2038 or 2147485547. By "infinitely patient program" I mean a program that is not deliberately impatient by itself.}

Conclusion

Your command is fine as a pipeline. (It is flawed for another reason, see below.)

Side note

Your find … | xargs rmdir -p will misbehave for pahtnames containing blanks (like spaces), newlines, single- or double-quotes, backslashes. This is because xargs without specific options interprets these.

A reliable way is by find … -print0 | xargs -0 … or by find … -exec …. The latter is portable.

find /media/me/disk_with_huge_inode_count -type d -empty -exec rmdir -p {} +

(AFAIK find is also infinitely patient when waiting for -exec to finish.)