What's the bottleneck in transfer of a large number of small files?

Question

While taking backup of a collection of folder containing source code (where I realized later that I could exclude certain folders containing library files like node_modules), I noticed that the file transfer speed slows to a crawl (few kb/s against the usual 60 mb/s that the backup drive allows).

I'd like to understand where the bottleneck is. Is it some computation which must be performed which interleaves with the pure I/O and thus slows the whole thing down, or is it that the filesystem index on the target drive has some central lock which must be acquired and released between files?

I'm using NTFS on the target backup drive, and it's a HDD.

score 1 · Accepted Answer · answered Jun 29 '20 at 19:55

The problem is that the file-system catalog, which says where the files are situated on the hard disk, needs to be accessed multiple times.

For each file the copy needs to do:

Open the source file from the source catalog
Create a target file in the target catalog
Copy the file
Close the source file and mark its catalog entry as read
Close the target file and mark its catalog entry as created.

This causes the heads of both source and target disks to switch from file metadata in the catalog to the file itself several times during each file copy.

On an SSD this wouldn't matter much, but on a HDD this can slow down to a crawl the copy of a large number of small files. Basically the HDD would be mostly moving the head(s), which is a much slower operation than reads or writes.

Windows wouldn't also be able to effectively use the RAM as cache, since closing a file causes its flush to the disk.

What's the bottleneck in transfer of a large number of small files?

1 Answers1