16

In our office we use a RAID5 of SSDs as a network share on a linux server. This share is accessed as a network drive from Windows-PCs and Mac. Sometimes this network share becomes awfully slow in terms of access times and transfer speed.

I am not the admin and therefore do not have full insight into the system.

One of the admins now proposed that this may have to do with the number of files that are stored on the network share. Some folders contain millions of few kB files.

Does the access speed depend on the number of files on a network share?

bobbolous
  • 395

4 Answers4

33

It's not the sheer number of files on the drive, so much as how many are in any given folder.

Every time someone accesses a folder, the contents must be read so the file list can be presented. This is also independent of the file sizes; only the titles, created/modified dates & other outwardly-visible info needs to be fetched.
Icon caches could also suffer heavy impact, if thumbnails are used.

Splitting these gigantic folders into sub-sets might be just what the structure needs.

Tetsujin
  • 50,917
9
  1. The speed of listing files obviously does depend on the number of files to be listed.

  2. The speed of opening a specific file (i.e. starting the retrieval) can depend on the number of files.

    Depending on which filesystem is being used on the server (e.g. NTFS, XFS, ext4, ZFS), it will use different data structures to store the list of files in each directory – some of which are noticeably better at handling massive lists than others (e.g. B-trees vs hashtables vs linear lists).

    Every time a new file is opened (or otherwise touched), the server needs to find it within that directory, and this may take some time. (Especially if the directory listing isn't cached in memory and needs to be read from an HDD.)

    With millions of files, you should definitely consider sharding them into subdirectories, e.g. based on the first few letters of the filename (similar to what you might see in .git/objects/ of a Git repository).

  3. The speed of transferring a file's contents (not including the time needed to open it) doesn't depend on the number of files in that directory at all.

    It does depend on how much the disks need to seek (if they're mechanical), which is especially bad for many tiny files.

If you're transferring thousands of tiny files, I guess most of the time will be spent in and – if the server is using HDDs – physically seeking the HDD heads back and forth from one tiny file to another, and from one metadata entry to another.

grawity
  • 501,077
6

You didn't say whether the server was Windows or Linux, but at least in Linux based file systems, large directories are certainly slow. If you create millions of files in one directory, the directory index grows. You can actually see that if you do ls -lhd <dir>. And directories only grow; they don't get smaller.

I manage a system that deals with many queue files, and to avoid slowdowns because of that, there are two things I do:

  • Split up the millions of file over various sub directories. This is a very common practice. If you look at Postfix SMTP server for instance, you'll see the queue dir is subdivided into sub directories, based on the first letter (this can be done with hashing or any algorithm you want).
  • Occasionally recreate all the sub directories. There are events that cause even those sub dirs to grow, and once a directory is dozens or a hundreds of megabytes in size (not the contents, just the dir index), it slows down all access on it.

So, avoid millions of files in one dir and put them in subdirectories.

When you're talking about millions of files spread out over many sub directories, that shouldn't be a factor.

Halfgaar
  • 326
1

A likely bottleneck is the network interface.

The answer to the question as asked is "it depends". It depends on the OS, Filesystem, file sharing protocol, RAM, SSD Interface, if Encryption at rest is used and how, the RAID controller among other things.

It is possible that the number of files on the drive are impacting performance - bit this is likely only an issue if files are only read occassionally and/or the server is very memory constrained - the file system pointers are typically kept in memory and as the disk is SSD, "seek times" are a non issue.

Its also possible one or more SSDs is nearing its end of lufe, or that its not handling TRIM correctly, in which case it could be greatly slowing reads and particulsrly writes, possibly disproportionately affecting access to other disks as data us striped across all the disks.

davidgo
  • 73,366