Analysing performance of transpose function

Question

I've written a naive and an "optimized" transpose functions for order-3 tensors containing double-precision complex numbers and I would like to analyze their performance.

Approximate code for naive transpose function:

    #pragma omp for schedule(static)
    for (auto i2 = std::size_t(0); i2 < n2; ++i2)
    {
        for (auto i1 = std::size_t{}; i1 < n1; ++i1)
        {
            for (auto i3 = std::size_t{}; i3 < n3; ++i3)
            {
                tens_tr(i3, i2, i1) = tens(i1, i2, i3);
            }
        }
    }

Approximate code for optimized transpose function (remainder loop not shown, assume divisibility):

    #pragma omp for schedule(static)
    for (auto i2 = std::size_t(0); i2 < n2; ++i2)
    {        
        // blocked loop
        for (auto bi1 = std::size_t{}; bi1 < n1; bi1 += block_size)
        {
            for (auto bi3 = std::size_t{}; bi3 < n3; bi3 += block_size)
            {
                for (auto i1 = std::size_t{}; i1 < block_size; ++i1)
                {
                    for (auto i3 = std::size_t{}; i3 < block_size; ++i3)
                    {
                        cache_buffer[i3 * block_size + i1] = tens(bi1 + i1, i2, bi3 + i3);
                    }
                }
                
                for (auto i1 = std::size_t{}; i1 < block_size; ++i1)
                {
                    for (auto i3 = std::size_t{}; i3 < block_size; ++i3)
                    {
                        tens_tr(bi3 + i1, i2, bi1 + i3) = cache_buffer[i1 * block_size + i3];
                    }
                }
            }
        }
    }

Assumption: I decided to use a streaming function as reference because I reasoned that the transpose function, in its perfect implementation, would closely resemble any bandwidth-saturating streaming function.

For this purpose, I chose the DAXPY loop as reference.

    #pragma omp parallel for schedule(static)
    for (auto i1 = std::size_t{}; i1 < tens_a_->get_n1(); ++i1)
    {
        auto* slice_a = reinterpret_cast<double*>(tens_a_->get_slice_data(i1));
        auto* slice_b = reinterpret_cast<double*>(tens_b_->get_slice_data(i1));
        const auto slice_size = 2 * tens_a_->get_slice_size(); // 2 doubles for a complex
        
        #pragma omp simd safelen(8)
        for (auto index = std::size_t{}; index < slice_size; ++index)
        {
            slice_b[index] += lambda_ * slice_a[index]; // fp_count: 2, traffic: 2+1
        }
    }

Also, I used a simple copy kernel as a second reference.

    #pragma omp parallel for schedule(static)
    for (auto i1 = std::size_t{}; i1 < tens_a_->get_n1(); ++i1)
    {
        const auto* op1_begin = reinterpret_cast<double*>(tens_a_->get_slice_data(index));
        const auto* op1_end = op1_begin + 2 * tens_a_->get_slice_size(); // 2 doubles in a complex
        auto* op2_iter = reinterpret_cast<double*>(tens_b_->get_slice_data(index));
        
        #pragma omp simd safelen(8)
        for (auto* iter = op1_begin; iter != op1_end; ++iter, ++op2_iter)
        {
            *op2_iter = *iter;
        }
    }

Hardware:

Intel(R) Xeon(X) Platinum 8168 (Skylake) with 24 cores @ 2.70 GHz and L1, L2 and L3 caches sized 32 kB, 1 MB and 33 MB respectively.
Memory of 48 GiB @ 2666 MHz. Intel Advisor's roof-line view says memory BW is 115 GB/s.

Benchmarking: 20 warm-up runs, 100 timed experiments, each with newly allocated data "touched" such that page-faults will not be measured.

Compiler and flags: Intel compiler from OneAPI 2022.1.0, optimization flags -O3;-ffast-math;-march=native;-qopt-zmm-usage=high.

Results (sizes assumed to be adequately large):

Using 24 threads pinned on 24 cores (total size of both tensors ~10 GiB):
DAXPY 102 GB/s
Copy 101 GB/s
naive transpose 91 GB/s
optimized transpose 93 GB/s
Using 1 thread pinned on a single core (total size of both tensors ~10 GiB):
DAXPY 20 GB/s
Copy 20 GB/s
naive transpose 9.3 GB/s
optimized transpose 9.3 GB/s

Questions:

Why is my naive transpose function performing so well?
Why is the difference in performance between reference and transpose functions so high when using only 1 thread?

I'm glad to receive any kind of input for any of the above questions. Also, I will gladly provide additional information when required. Unfortunately, I cannot provide a minimum reproducer because of the size and complexity of each benchmark program. Thank you very much for you time and help in advance!

Updates:

Could it be that the Intel compiler performed loop-blocking for the naive transpose function as optimization?

What is the size of your tensor ? – PierU Nov 11 '22 at 09:38 — PierU, Nov 11 '22 at 09:38

Jérôme Richard · Accepted Answer · 2022-11-11T14:35:57.113

Is the above-mentioned assumption valid? [asked before the edit]

Not really.

Transpositions of large arrays tends not to saturate the bandwidth of the RAM on some platforms. This can be due to cache effects like cache trashing. For more information about this, you can read this post for example. In your specific case, things works quite well though (see below).

On NUMA platforms, the data page distribution on NUMA nodes has can have a strong impact on the performance. This can be due to the (temporary) unbalanced page distribution, a non-uniform latency, a non-uniform throughput or even the (temporary) saturation of the RAM of some NUMA node. NUMA can be seen on recent AMD processors but also on some Intel ones (eg. since Skylake, see this post) regarding the system configuration.

Even assuming the above points do not apply in your case, considering the perfect case while the naive code may not behave like a perfect transposition can result in wrong interpretations. If this assumption is broken, results could overestimate the performance of the naive implementation for example.

Why is my naive transpose function performing so well?

A good throughput does not means the computation is fast. The computation can be slower with a higher throughput if more data needs to be transferred from the RAM. This is possible due to cache misses. More specifically, with a naive access pattern, cache lines can be replaced more frequently with a lower reuse (due to cache trashing) and thus the wall clock time should be higher. You need to measure the wall clock time. Metrics are good to understand what is going on but not to measure the performance of a kernel.

In this specific case, the chosen size (ie. 1050) should not cause too many conflict misses because it is not divisible by a large power of two. In the naive version, the tens_tr writes will fill many cache lines partially (1050) before they can be reused when i1 is increased (up to 8 subsequent increases are needed so to fill the cache lines). This means, 1050 * 64 ~= 66 KiB of cache is needed for the i1-i3-based transposition of one given i2 to complete. The cache lines cannot be reused with other i2 values so the cache do not need to be so huge for the transposition to be relatively efficient. That being said, one should also consider the tens reads (though it can be quite quickly evicted from the cache). In the end, the 16-way associative L2 cache of 1 MiB should be enough for that. Note that the naive implementation should perform poorly with significantly bigger arrays since the L2 cache should not be large enough so for cache lines to be fully reused (causing data to be reloaded many times from the memory hierarchy, typically from the L3 in sequential and the RAM in parallel). Also note that the naive transposition can also perform very poorly on processor with smaller caches (eg. x86-64 desktop processors except recent ones that often have bigger caches) or if you plan to change the size of the input array to something divisible by a large power of two.

While blocking enable a better use of the L1 cache, it is not so important in your specific case. Indeed, the naive computation does not benefit from the L1 cache but the effect is small since the transposition should be bounded by the L3 cache and the RAM anyway. That being said, a better L1 cache usage could help to reduce a bit the latency regarding the target processor architecture. You should see the effect mainly on significantly smaller arrays.

In parallel, the L3 cache is large enough so that the 24 cores can run in parallel without too many conflict misses. Even if the L3 performed poorly, the kernel would be mainly memory bound so the impact of the cache misses would be not much visible.

Why is the difference in performance between reference and transpose functions so high when using only 1 thread?

This is likely due to the latency of memory operations. Transpositions perform memory read/writes with huge strides and the hardware prefetchers may not be able to fully mitigate the huge latency of the L3 cache or the one of the main RAM. Indeed, the number of pending cache-line request per core is limited (to a dozen of them on Skylake), so the kernel is bound by the latency of the requests since there is not enough concurrency to fully overlap their latency.

For the DAXPY/copy, hardware prefetchers can better reduce the latency but the amount of concurrency is still too small compared to the latency on Xeon processor so to fully saturate the RAM with 1 thread. This is a quite reasonable architectural limitation since such processors are designed to execute applications scaling well on many cores.

With many threads, the per-core limitation vanishes and it is replaced by a stronger one: the practical RAM bandwidth.

Could it be that the Intel compiler performed loop-blocking for the naive transpose function as optimization?

This is theoretically possible since the Intel compiler (ICC) has such optimizer, but it is very unlikely for ICC to do that on a 3D transposition code (since it is a pretty complex relatively specific use-case). The best is to analyse the assembly code so to be sure.

Note on the efficiency of the optimized transposition

Due to the cache-line write allocation on x86-64 processors (like your Xeon processor), I expect the transposition to have a lower throughput assuming it do not take into account such effect. Indeed, the processor needs to read tens_tr cache lines so to fill them since it do not know if they will be completely filled ahead of time (it would be crazy for the naive transposition) and they may be evicted before (eg. during a context switch, by another running program).

There is several possible reasons to explain that:

The assumption is wrong and it means 1/3 of the bandwidth is wasted by reading cache lines meant to be actually written;
the DAXPY code also have the same issue and the reported maximum bandwidth is not really correct either (unlikely);
ICC succeed to rewrite the transposition so to use efficiently the caches and also generate non-temporal store instructions so to avoid this effect (unlikely).

Based on the possible reasons, I think the measured throughput already take into account write allocation and that the transposition implementation can be optimized further. Indeed, the optimized version doing the copy can use non-temporal store so to write the array back in memory without reading it. This is not possible with the naive implementation. With such optimization, the throughput may be the same, but the execution time can be about 33% lower (due to a better use of the memory bandwidth). This is a good example showing that the initial assumption is just wrong.

NUMA is not a concern here. Also, the bandwidth measurements I have made are based on wall clock time i.e. _effective_ bandwidths. Thanks for your thoughtful and well-written answer! — Nitin Malapally, Nov 14 '22 at 09:23
So based on your observations, a developer should only think of optimizing the transpose function if `N * 64 > size of L3 cache` ? — Nitin Malapally, Nov 14 '22 at 09:36

Analysing performance of transpose function

1 Answers1

Note on the efficiency of the optimized transposition