I'm trying to run a bash script on all the xml files inside a folder. After some effort I found the likely bottleneck is reading the files, given the filenames. My script is likely running fast enough for CPU to not be the bottleneck.
Here's the command I would like to run
find /home/ec2-user/books -type f -regex '.*\.\(html\|htm\|xml\|xhtml\|xhtm\)$' -print0 | while IFS= read -r -d '' file; do myscript.sh "$file"; done
Here's stats about the dataset:
$ du -sh ~/books
16G /home/ec2-user/books
$ find /home/ec2-user/books -type f | wc -l
find: ‘/home/ec2-user/books/dir_113fa74f0fcfabeeeee0abc0ab4f35c0/OEBPS’: Permission denied
696755
$ find /home/ec2-user/books -type f -regex '.*\.\(html\|htm\|xml\|xhtml\|xhtm\)$' | wc -l
544952
$ mkdir ~/justxml; find /home/ec2-user/books/ -type f -regex '..(html|htm|xml|xhtml|xhtm)$' -exec cp {} justxml ;
$ mkdir ~/justxml; find /home/ec2-user/books/ -type f -regex '..(html|htm|xml|xhtml|xhtm)$' -exec cp {} justxml ;
$ du -sh ~/justxml
981M justxml
$ ls ~/justxml | wc -l
48243
Here's time taken to find and access files
$ date; find /home/ec2-user/books -type f -regex '.*\.\(html\|htm\|xml\|xhtml\|xhtm\)$' -print0 | while IFS= read -r -d '' file; do touch "$file"; done; date
Wed Oct 9 08:10:58 UTC 2024
Wed Oct 9 08:32:19 UTC 2024
$ date; find /home/ec2-user/books -type f -regex '.*\.\(html\|htm\|xml\|xhtml\|xhtm\)$' -print0 >~/temp.txt; date
Wed Oct 9 08:34:14 UTC 2024
Wed Oct 9 08:34:16 UTC 2024
My actual command takes over 2 hours (estimated), hence I've not run it to completion. I'm not okay just waiting 2 hours because I need to repeat this task many times with other data similar to this one.
find itself is not taking time, but accessing the files is taking time. 21 minutes to access 545k files is ~432 files per second.
I'm using AWS EBS volume (SSD) attached to AWS EC2 t2.medium instance. If I understand fio output correctly, here's the disk performance.
- ~4 MB/s random read 4K blocksize single-threaded
- ~50 MB/s sequential read 16K blocksize single-thread
- ~100 MB/s sequential read 1M blocksize single-threaded
Is there anything I can do the speed up this task?
I am not an expert in how hard disks work or how to optimise file reading, hence this question.
My guess is that my script is slow because it does random read, but sequential would be much faster (At 100 MB/s I could read the entire 16 GB in under 3 minutes in theory).
Is there a way I can take advantage of sequential read speeds here?
Would reading zip files make any difference?
The books folder is actually distributed as a tarball of a large number of files, I have first untarred the tarball, then unzipped each zip file to obtain the books folder. If there is a way to directly read the original zip files that would be faster, I'm open to doing that.
- Would copying files to /tmp or /dev/shm make a difference?
I tried copying the entire directory to /tmp (on a machine with sufficient RAM). This increased the read speed but the initial copying took time. Is there a way to split the books folder into chunks and process each chunk in RAM, such that the overall operation takes less time? My guess is this should not be any faster, but I'm not sure.
- Would parallelising / async code make a difference?
If I understand, the disk is idle while myscript.sh runs on a file, so if I start reading the next file while myscript.sh runs on the previous one, that could make a difference. In practice though I tried using GNU parallel instead of a while loop, it didn't make a difference.
I'm using only a 4 GB machine though. I could rent a machine with more threads, but I'm assuming paying double to double the threads would mean only doubling of performance, so it doesn't help me. (The actual metric I care about is probably something like dollars spent / GB output, and ensure programmer time spent < 1 month)
- Is there a faster way to do this in C rather than bash?
I'm assuming yes. Are there any resources on how to write C-optimised code to access files inside of a folder? I understand basics like fseek and fscanf but not much about optimisation.
Appendix
Disk benchmarks of AWS EBS 50 GB SSD attached to AWS EC2 t2.medium instance
$ fio --name TEST --eta-newline=5s --filename=fio-tempfile.dat --rw=randread --size=500m --io_size=10g --blocksize=4k --ioengine=libaio
--fsync=1 --iodepth=1 --direct=1 --numjobs=1 --runtime=60 --group_reporting
TEST: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.32
Starting 1 process
Jobs: 1 (f=1): [r(1)][11.7%][r=4408KiB/s][r=1102 IOPS][eta 00m:53s]
Jobs: 1 (f=1): [r(1)][21.7%][r=4100KiB/s][r=1025 IOPS][eta 00m:47s]
Jobs: 1 (f=1): [r(1)][31.7%][r=3984KiB/s][r=996 IOPS][eta 00m:41s]
Jobs: 1 (f=1): [r(1)][41.7%][r=4104KiB/s][r=1026 IOPS][eta 00m:35s]
Jobs: 1 (f=1): [r(1)][51.7%][r=3819KiB/s][r=954 IOPS][eta 00m:29s]
Jobs: 1 (f=1): [r(1)][61.7%][r=2666KiB/s][r=666 IOPS][eta 00m:23s]
Jobs: 1 (f=1): [r(1)][71.7%][r=3923KiB/s][r=980 IOPS][eta 00m:17s]
Jobs: 1 (f=1): [r(1)][81.7%][r=3864KiB/s][r=966 IOPS][eta 00m:11s]
Jobs: 1 (f=1): [r(1)][91.7%][r=3988KiB/s][r=997 IOPS][eta 00m:05s]
Jobs: 1 (f=1): [r(1)][100.0%][r=3819KiB/s][r=954 IOPS][eta 00m:00s]
TEST: (groupid=0, jobs=1): err= 0: pid=3336320: Wed Oct 9 09:18:22 2024
read: IOPS=1006, BW=4025KiB/s (4121kB/s)(236MiB/60001msec)
slat (usec): min=10, max=1954, avg=26.74, stdev=28.11
clat (nsec): min=1973, max=74708k, avg=962785.46, stdev=670441.37
lat (usec): min=266, max=74736, avg=989.52, stdev=671.14
clat percentiles (usec):
| 1.00th=[ 351], 5.00th=[ 490], 10.00th=[ 537], 20.00th=[ 668],
| 30.00th=[ 766], 40.00th=[ 840], 50.00th=[ 922], 60.00th=[ 1004],
| 70.00th=[ 1090], 80.00th=[ 1205], 90.00th=[ 1369], 95.00th=[ 1532],
| 99.00th=[ 1909], 99.50th=[ 2089], 99.90th=[ 2900], 99.95th=[ 7635],
| 99.99th=[27395]
bw ( KiB/s): min= 1424, max= 5536, per=100.00%, avg=4030.04, stdev=438.60, samples=119
iops : min= 356, max= 1384, avg=1007.50, stdev=109.65, samples=119
lat (usec) : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%
lat (usec) : 100=0.01%, 250=0.01%, 500=6.19%, 750=22.12%, 1000=31.05%
lat (msec) : 2=39.89%, 4=0.64%, 10=0.03%, 20=0.03%, 50=0.01%
lat (msec) : 100=0.01%
cpu : usr=0.65%, sys=4.06%, ctx=60390, majf=0, minf=11
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=60374,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
READ: bw=4025KiB/s (4121kB/s), 4025KiB/s-4025KiB/s (4121kB/s-4121kB/s), io=236MiB (247MB), run=60001-60001msec
Disk stats (read/write):
xvda: ios=60263/99, merge=0/4, ticks=50400/91, in_queue=50490, util=85.39%
$ sudo fio --directory=/ --name fio_test_file --direct=1 --rw=randread --bs=16k --size=1G --numjobs=16 --time_based --runtime=180 --grou
p_reporting --norandommap
fio_test_file: (g=0): rw=randread, bs=(R) 16.0KiB-16.0KiB, (W) 16.0KiB-16.0KiB, (T) 16.0KiB-16.0KiB, ioengine=psync, iodepth=1
...
fio-3.32
Starting 16 processes
fio_test_file: Laying out IO file (1 file / 1024MiB)
fio_test_file: Laying out IO file (1 file / 1024MiB)
fio_test_file: Laying out IO file (1 file / 1024MiB)
fio_test_file: Laying out IO file (1 file / 1024MiB)
fio_test_file: Laying out IO file (1 file / 1024MiB)
fio_test_file: Laying out IO file (1 file / 1024MiB)
fio_test_file: Laying out IO file (1 file / 1024MiB)
fio_test_file: Laying out IO file (1 file / 1024MiB)
fio_test_file: Laying out IO file (1 file / 1024MiB)
fio_test_file: Laying out IO file (1 file / 1024MiB)
fio_test_file: Laying out IO file (1 file / 1024MiB)
fio_test_file: Laying out IO file (1 file / 1024MiB)
fio_test_file: Laying out IO file (1 file / 1024MiB)
fio_test_file: Laying out IO file (1 file / 1024MiB)
fio_test_file: Laying out IO file (1 file / 1024MiB)
fio_test_file: Laying out IO file (1 file / 1024MiB)
Jobs: 16 (f=16): [r(16)][100.0%][r=46.9MiB/s][r=3001 IOPS][eta 00m:00s]
fio_test_file: (groupid=0, jobs=16): err= 0: pid=3336702: Wed Oct 9 09:31:27 2024
read: IOPS=3015, BW=47.1MiB/s (49.4MB/s)(8482MiB/180005msec)
clat (usec): min=223, max=120278, avg=5301.13, stdev=803.46
lat (usec): min=223, max=120278, avg=5301.85, stdev=803.46
clat percentiles (usec):
| 1.00th=[ 3425], 5.00th=[ 4555], 10.00th=[ 4752], 20.00th=[ 4948],
| 30.00th=[ 5080], 40.00th=[ 5211], 50.00th=[ 5276], 60.00th=[ 5407],
| 70.00th=[ 5538], 80.00th=[ 5669], 90.00th=[ 5932], 95.00th=[ 6128],
| 99.00th=[ 6652], 99.50th=[ 6849], 99.90th=[ 8029], 99.95th=[12387],
| 99.99th=[27132]
bw ( KiB/s): min=43192, max=144017, per=100.00%, avg=48297.40, stdev=319.26, samples=5744
iops : min= 2698, max= 9000, avg=3017.45, stdev=19.95, samples=5744
lat (usec) : 250=0.01%, 500=0.04%, 750=0.20%, 1000=0.13%
lat (msec) : 2=0.32%, 4=0.69%, 10=98.55%, 20=0.04%, 50=0.02%
lat (msec) : 100=0.01%, 250=0.01%
cpu : usr=0.14%, sys=0.64%, ctx=543370, majf=0, minf=222
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=542833,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
READ: bw=47.1MiB/s (49.4MB/s), 47.1MiB/s-47.1MiB/s (49.4MB/s-49.4MB/s), io=8482MiB (8894MB), run=180005-180005msec
Disk stats (read/write):
xvda: ios=542583/252, merge=0/7, ticks=2828968/1381, in_queue=2830349, util=85.25%