How do I eliminate file write latency spikes on mdadm raid0?

Question

High Level Problem Summary

We are working on an application that requires high throughput to RAID0 for extended periods of time. There are up to 8 independent 5 GB/s data streams being written to dedicated RAIDs (1 RAID per data stream). This works just fine most of the time, however there are apparently unpredictable file write latency spikes that result in the stream buffers overflowing and therefore, data loss.

Has anyone seen similar issues? If so, what changes might I make in my software to prevent it from happening?

Offending Code

The following is the code that runs on our file IO threads. Note that due to restrictions of the platform the rest of our app is built on, we are limited to passing a single argument to this function, which is why we must unpack it at the top of the function. Also note that we are most concerned with the pwritev() line.

void IoThreadFunction(IoThreadArgument *argument)
{
    /////////////// Unpack the argument ///////////////
    rte_ring* io_job_buffer = argument->io_job_buffer;
    rte_ring* job_pool = argument->job_pool;
    bool* stop_signal = argument->stop_signal;
    int fd = argument->fd;
    std::shared_ptr<bool> initialized = argument->initialized;
    ///////////////////////////////////////////////////
int status;
Job* job;

spdlog::trace(&quot;io thread servicing fd {0} started on lcore {1}.&quot;, fd, rte_lcore_id());

*initialized = true;

while(!*stop_signal || rte_ring_count(io_job_buffer))
{
    /////   This part of the code receives data from
    /////   other parts of the app. And creates an IOVEC
    /////   array that will be used for vectorized
    /////   file IO. It is not suspected to be the
    /////   root of the problem.
    /////   START SECTION

    // Poll io job queue
    if(rte_ring_mc_dequeue(io_job_buffer, (void**)&amp;job) == -ENOENT) continue;

    // Populate iovecs
    job-&gt;populate_iovecs();

    ///// END SECTION

    // Write the data to file
    pwritev2(fd, job-&gt;iovecs, job-&gt;num_packets, job-&gt;file_offset, RWF_HIPRI);

    // Free dpdk packets
    rte_pktmbuf_free_bulk(job-&gt;packets, job-&gt;num_packets);

    // Restore job to pool
    rte_ring_mp_enqueue(job_pool, job);
}

}

System Hardware

The computer that our app is running on is a server with 2 AMD EPYC 7643 48-core processors. Hyperthreading is intentionally disabled.
Each RAID0 is built using two NVMes, each capable of 3.5 GB/s sustained write speeds, so we should be able to get up to 7 GB/s write speeds, in theory.
All of our hardware has the latest firmware installed.

System Software

We are using Ubuntu 22.04 LTS running on the 5.15.0.86-generic Linux kernel.
The following boot arguments are used to optimize the software environment for our application: isolcpus=0-39,48-87 rcu_nocbs=0-39,48-87 processor.max_cstate=0 nohz=off rcu_nocb_poll audit=0 nosoftlockup amd_iommu=on iommu=pt mce=ignore_c. Note that some of these boot arguments are required for other parts of the application that are not related to writing data to disk. The isoclcpus argument is set up to isolate the cores that our application uses to stream the data to disk, minimizing system interrupts that may induce more latency.

Other Relevant Details

Our application is NUMA-aware, so data that originates on a given NUMA node will always end up on a RAID belonging to the same NUMA node.
The app uses up to 4 dedicated threads per data stream for file IO. We have tried using as few as 2 threads per stream, but 4 are required for io throughput reliability.
Each stream is written to a single file that may grow as large as 5 TB.
We're using synchronous pwritev() calls to write the data to the RAIDs. Note that we have experimented thoroughly with other approaches, such as iouring, but due to the nature of the data stream, synchronous pwritev() calls yield the highest and most reliable throughput for us.
The data arrives in packets that are 8192 bytes wide, and we are writing 1024 packets at a time with vectorized writes to take full advantage of pwritev().
All data is page-aligned, and all file writes bypass the Linux page cache using the O_DIRECT flag.
The space for each file is pre-allocated using fallocate()
The RAIDs are configured with mdadm with the following options: mdadm --create /dev/md0 --chunk=256 --level=0 --raid-devices=2 /dev/nvme[n]n1 /dev/nvme[n+1]n1
The RAIDs all have XFS file systems that are created with the following options: mkfs.xfs -b size=4096 -d sunit=512,swidth=1024 -f /dev/md[n]. The filesystem is configured to best accommodate the RAID geometry.