Speed up reading a large number of files (random read)

Question

I'm trying to run a bash script on all the xml files inside a folder. After some effort I found the likely bottleneck is reading the files, given the filenames. My script is likely running fast enough for CPU to not be the bottleneck.

Here's the command I would like to run

find /home/ec2-user/books -type f -regex '.*\.\(html\|htm\|xml\|xhtml\|xhtm\)$' -print0 | while IFS= read -r -d '' file; do myscript.sh "$file"; done

Here's stats about the dataset:

$ du -sh ~/books
16G     /home/ec2-user/books
$ find /home/ec2-user/books -type f | wc -l
find: ‘/home/ec2-user/books/dir_113fa74f0fcfabeeeee0abc0ab4f35c0/OEBPS’: Permission denied
696755
$ find /home/ec2-user/books -type f -regex '.*\.\(html\|htm\|xml\|xhtml\|xhtm\)$' | wc -l
544952
$ mkdir ~/justxml; find /home/ec2-user/books/ -type f -regex '..(html|htm|xml|xhtml|xhtm)$' -exec cp {} justxml ;
$ mkdir ~/justxml; find /home/ec2-user/books/ -type f -regex '..(html|htm|xml|xhtml|xhtm)$' -exec cp {} justxml ;
$ du -sh ~/justxml
981M    justxml
$ ls ~/justxml | wc -l
48243

Here's time taken to find and access files

$ date; find /home/ec2-user/books -type f -regex '.*\.\(html\|htm\|xml\|xhtml\|xhtm\)$' -print0 | while IFS= read -r -d '' file; do touch "$file"; done; date
Wed Oct  9 08:10:58 UTC 2024
Wed Oct  9 08:32:19 UTC 2024
$ date; find /home/ec2-user/books -type f -regex '.*\.\(html\|htm\|xml\|xhtml\|xhtm\)$' -print0 >~/temp.txt; date
Wed Oct  9 08:34:14 UTC 2024
Wed Oct  9 08:34:16 UTC 2024

My actual command takes over 2 hours (estimated), hence I've not run it to completion. I'm not okay just waiting 2 hours because I need to repeat this task many times with other data similar to this one.

find itself is not taking time, but accessing the files is taking time. 21 minutes to access 545k files is ~432 files per second.

I'm using AWS EBS volume (SSD) attached to AWS EC2 t2.medium instance. If I understand fio output correctly, here's the disk performance.

~4 MB/s random read 4K blocksize single-threaded
~50 MB/s sequential read 16K blocksize single-thread
~100 MB/s sequential read 1M blocksize single-threaded

Is there anything I can do the speed up this task?

I am not an expert in how hard disks work or how to optimise file reading, hence this question.

My guess is that my script is slow because it does random read, but sequential would be much faster (At 100 MB/s I could read the entire 16 GB in under 3 minutes in theory).

Is there a way I can take advantage of sequential read speeds here?
Would reading zip files make any difference?

The books folder is actually distributed as a tarball of a large number of files, I have first untarred the tarball, then unzipped each zip file to obtain the books folder. If there is a way to directly read the original zip files that would be faster, I'm open to doing that.

Would copying files to /tmp or /dev/shm make a difference?

I tried copying the entire directory to /tmp (on a machine with sufficient RAM). This increased the read speed but the initial copying took time. Is there a way to split the books folder into chunks and process each chunk in RAM, such that the overall operation takes less time? My guess is this should not be any faster, but I'm not sure.

Would parallelising / async code make a difference?

If I understand, the disk is idle while myscript.sh runs on a file, so if I start reading the next file while myscript.sh runs on the previous one, that could make a difference. In practice though I tried using GNU parallel instead of a while loop, it didn't make a difference.

I'm using only a 4 GB machine though. I could rent a machine with more threads, but I'm assuming paying double to double the threads would mean only doubling of performance, so it doesn't help me. (The actual metric I care about is probably something like dollars spent / GB output, and ensure programmer time spent < 1 month)

Is there a faster way to do this in C rather than bash?

I'm assuming yes. Are there any resources on how to write C-optimised code to access files inside of a folder? I understand basics like fseek and fscanf but not much about optimisation.

Appendix

Disk benchmarks of AWS EBS 50 GB SSD attached to AWS EC2 t2.medium instance

$ fio --name TEST --eta-newline=5s --filename=fio-tempfile.dat --rw=randread --size=500m --io_size=10g --blocksize=4k --ioengine=libaio 
--fsync=1 --iodepth=1 --direct=1 --numjobs=1 --runtime=60 --group_reporting                                                                                          
TEST: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1                                                           
fio-3.32                                                                                                                                                             
Starting 1 process                                                                                                                                                   
Jobs: 1 (f=1): [r(1)][11.7%][r=4408KiB/s][r=1102 IOPS][eta 00m:53s]                                                                                                  
Jobs: 1 (f=1): [r(1)][21.7%][r=4100KiB/s][r=1025 IOPS][eta 00m:47s] 
Jobs: 1 (f=1): [r(1)][31.7%][r=3984KiB/s][r=996 IOPS][eta 00m:41s]  
Jobs: 1 (f=1): [r(1)][41.7%][r=4104KiB/s][r=1026 IOPS][eta 00m:35s]
Jobs: 1 (f=1): [r(1)][51.7%][r=3819KiB/s][r=954 IOPS][eta 00m:29s] 
Jobs: 1 (f=1): [r(1)][61.7%][r=2666KiB/s][r=666 IOPS][eta 00m:23s] 
Jobs: 1 (f=1): [r(1)][71.7%][r=3923KiB/s][r=980 IOPS][eta 00m:17s]  
Jobs: 1 (f=1): [r(1)][81.7%][r=3864KiB/s][r=966 IOPS][eta 00m:11s] 
Jobs: 1 (f=1): [r(1)][91.7%][r=3988KiB/s][r=997 IOPS][eta 00m:05s] 
Jobs: 1 (f=1): [r(1)][100.0%][r=3819KiB/s][r=954 IOPS][eta 00m:00s]
TEST: (groupid=0, jobs=1): err= 0: pid=3336320: Wed Oct  9 09:18:22 2024
  read: IOPS=1006, BW=4025KiB/s (4121kB/s)(236MiB/60001msec)
    slat (usec): min=10, max=1954, avg=26.74, stdev=28.11
    clat (nsec): min=1973, max=74708k, avg=962785.46, stdev=670441.37
     lat (usec): min=266, max=74736, avg=989.52, stdev=671.14
    clat percentiles (usec):
     |  1.00th=[  351],  5.00th=[  490], 10.00th=[  537], 20.00th=[  668],
     | 30.00th=[  766], 40.00th=[  840], 50.00th=[  922], 60.00th=[ 1004],
     | 70.00th=[ 1090], 80.00th=[ 1205], 90.00th=[ 1369], 95.00th=[ 1532],
     | 99.00th=[ 1909], 99.50th=[ 2089], 99.90th=[ 2900], 99.95th=[ 7635],
     | 99.99th=[27395]
   bw (  KiB/s): min= 1424, max= 5536, per=100.00%, avg=4030.04, stdev=438.60, samples=119
   iops        : min=  356, max= 1384, avg=1007.50, stdev=109.65, samples=119
  lat (usec)   : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%
  lat (usec)   : 100=0.01%, 250=0.01%, 500=6.19%, 750=22.12%, 1000=31.05%
  lat (msec)   : 2=39.89%, 4=0.64%, 10=0.03%, 20=0.03%, 50=0.01%
  lat (msec)   : 100=0.01%
  cpu          : usr=0.65%, sys=4.06%, ctx=60390, majf=0, minf=11
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=60374,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
   READ: bw=4025KiB/s (4121kB/s), 4025KiB/s-4025KiB/s (4121kB/s-4121kB/s), io=236MiB (247MB), run=60001-60001msec
Disk stats (read/write):
  xvda: ios=60263/99, merge=0/4, ticks=50400/91, in_queue=50490, util=85.39%

$ sudo fio --directory=/ --name fio_test_file --direct=1 --rw=randread --bs=16k --size=1G --numjobs=16 --time_based --runtime=180 --grou
p_reporting --norandommap                                                                                                                                            
fio_test_file: (g=0): rw=randread, bs=(R) 16.0KiB-16.0KiB, (W) 16.0KiB-16.0KiB, (T) 16.0KiB-16.0KiB, ioengine=psync, iodepth=1                                       
...                                                                                                                                                                  
fio-3.32                                                                                                                                                             
Starting 16 processes                                                                                                                                                
fio_test_file: Laying out IO file (1 file / 1024MiB)                                                                                                                 
fio_test_file: Laying out IO file (1 file / 1024MiB)                                                                                                                 
fio_test_file: Laying out IO file (1 file / 1024MiB)                                                                                                                 
fio_test_file: Laying out IO file (1 file / 1024MiB)                                                                                                                 
fio_test_file: Laying out IO file (1 file / 1024MiB)                                                                                                                 
fio_test_file: Laying out IO file (1 file / 1024MiB)                                                                                                                 
fio_test_file: Laying out IO file (1 file / 1024MiB)
fio_test_file: Laying out IO file (1 file / 1024MiB)
fio_test_file: Laying out IO file (1 file / 1024MiB)
fio_test_file: Laying out IO file (1 file / 1024MiB)
fio_test_file: Laying out IO file (1 file / 1024MiB)
fio_test_file: Laying out IO file (1 file / 1024MiB)
fio_test_file: Laying out IO file (1 file / 1024MiB)
fio_test_file: Laying out IO file (1 file / 1024MiB)
fio_test_file: Laying out IO file (1 file / 1024MiB)
fio_test_file: Laying out IO file (1 file / 1024MiB)
Jobs: 16 (f=16): [r(16)][100.0%][r=46.9MiB/s][r=3001 IOPS][eta 00m:00s]
fio_test_file: (groupid=0, jobs=16): err= 0: pid=3336702: Wed Oct  9 09:31:27 2024
  read: IOPS=3015, BW=47.1MiB/s (49.4MB/s)(8482MiB/180005msec)
    clat (usec): min=223, max=120278, avg=5301.13, stdev=803.46
     lat (usec): min=223, max=120278, avg=5301.85, stdev=803.46
    clat percentiles (usec):
     |  1.00th=[ 3425],  5.00th=[ 4555], 10.00th=[ 4752], 20.00th=[ 4948],
     | 30.00th=[ 5080], 40.00th=[ 5211], 50.00th=[ 5276], 60.00th=[ 5407],
     | 70.00th=[ 5538], 80.00th=[ 5669], 90.00th=[ 5932], 95.00th=[ 6128],
     | 99.00th=[ 6652], 99.50th=[ 6849], 99.90th=[ 8029], 99.95th=[12387],
     | 99.99th=[27132]
   bw (  KiB/s): min=43192, max=144017, per=100.00%, avg=48297.40, stdev=319.26, samples=5744
   iops        : min= 2698, max= 9000, avg=3017.45, stdev=19.95, samples=5744
  lat (usec)   : 250=0.01%, 500=0.04%, 750=0.20%, 1000=0.13%
  lat (msec)   : 2=0.32%, 4=0.69%, 10=98.55%, 20=0.04%, 50=0.02%
  lat (msec)   : 100=0.01%, 250=0.01%
  cpu          : usr=0.14%, sys=0.64%, ctx=543370, majf=0, minf=222
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=542833,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=47.1MiB/s (49.4MB/s), 47.1MiB/s-47.1MiB/s (49.4MB/s-49.4MB/s), io=8482MiB (8894MB), run=180005-180005msec

Disk stats (read/write):
  xvda: ios=542583/252, merge=0/7, ticks=2828968/1381, in_queue=2830349, util=85.25%

RedGrittyBrick · Accepted Answer · 2024-10-09T10:45:30.963

The books folder is actually distributed as a tarball of a large number of files, I have first untarred the tarball, then unzipped each zip file to obtain the books folder. If there is a way to directly read the original zip files that would be faster, I'm open to doing that.

I suspect your problem is mostly disk IO.

I'd aim to do as much as possible in a pipeline, in memory or in a RAM-based filesystem.

e.g.

tar -x -O -f booksfolder.tar | {some magic involving unzip -p ?} | {process data}

That avoids writing and reading lots of intermediate files from hard disks.

Although caching ought to work in your favour, I suppose you might instead reduce disk-head seek-times by using a second disk for extracting the tar files and a third disk for unzipping the extracted zip files. I generally avoid trying to second guess OS programmers though.

Speed up reading a large number of files (random read)

1 Answers1