I've written a naive and an "optimized" transpose functions for order-3 tensors containing double-precision complex numbers and I would like to analyze their performance.
Approximate code for naive transpose function:
    #pragma omp for schedule(static)
    for (auto i2 = std::size_t(0); i2 < n2; ++i2)
    {
        for (auto i1 = std::size_t{}; i1 < n1; ++i1)
        {
            for (auto i3 = std::size_t{}; i3 < n3; ++i3)
            {
                tens_tr(i3, i2, i1) = tens(i1, i2, i3);
            }
        }
    }
Approximate code for optimized transpose function (remainder loop not shown, assume divisibility):
    #pragma omp for schedule(static)
    for (auto i2 = std::size_t(0); i2 < n2; ++i2)
    {        
        // blocked loop
        for (auto bi1 = std::size_t{}; bi1 < n1; bi1 += block_size)
        {
            for (auto bi3 = std::size_t{}; bi3 < n3; bi3 += block_size)
            {
                for (auto i1 = std::size_t{}; i1 < block_size; ++i1)
                {
                    for (auto i3 = std::size_t{}; i3 < block_size; ++i3)
                    {
                        cache_buffer[i3 * block_size + i1] = tens(bi1 + i1, i2, bi3 + i3);
                    }
                }
                
                for (auto i1 = std::size_t{}; i1 < block_size; ++i1)
                {
                    for (auto i3 = std::size_t{}; i3 < block_size; ++i3)
                    {
                        tens_tr(bi3 + i1, i2, bi1 + i3) = cache_buffer[i1 * block_size + i3];
                    }
                }
            }
        }
    }
Assumption: I decided to use a streaming function as reference because I reasoned that the transpose function, in its perfect implementation, would closely resemble any bandwidth-saturating streaming function.
For this purpose, I chose the DAXPY loop as reference.
    #pragma omp parallel for schedule(static)
    for (auto i1 = std::size_t{}; i1 < tens_a_->get_n1(); ++i1)
    {
        auto* slice_a = reinterpret_cast<double*>(tens_a_->get_slice_data(i1));
        auto* slice_b = reinterpret_cast<double*>(tens_b_->get_slice_data(i1));
        const auto slice_size = 2 * tens_a_->get_slice_size(); // 2 doubles for a complex
        
        #pragma omp simd safelen(8)
        for (auto index = std::size_t{}; index < slice_size; ++index)
        {
            slice_b[index] += lambda_ * slice_a[index]; // fp_count: 2, traffic: 2+1
        }
    }
Also, I used a simple copy kernel as a second reference.
    #pragma omp parallel for schedule(static)
    for (auto i1 = std::size_t{}; i1 < tens_a_->get_n1(); ++i1)
    {
        const auto* op1_begin = reinterpret_cast<double*>(tens_a_->get_slice_data(index));
        const auto* op1_end = op1_begin + 2 * tens_a_->get_slice_size(); // 2 doubles in a complex
        auto* op2_iter = reinterpret_cast<double*>(tens_b_->get_slice_data(index));
        
        #pragma omp simd safelen(8)
        for (auto* iter = op1_begin; iter != op1_end; ++iter, ++op2_iter)
        {
            *op2_iter = *iter;
        }
    }
Hardware:
- Intel(R) Xeon(X) Platinum 8168 (Skylake) with 24 cores @ 2.70 GHz and L1, L2 and L3 caches sized 32 kB, 1 MB and 33 MB respectively.
- Memory of 48 GiB @ 2666 MHz. Intel Advisor's roof-line view says memory BW is 115 GB/s.
Benchmarking: 20 warm-up runs, 100 timed experiments, each with newly allocated data "touched" such that page-faults will not be measured.
Compiler and flags:
Intel compiler from OneAPI 2022.1.0, optimization flags -O3;-ffast-math;-march=native;-qopt-zmm-usage=high.
Results (sizes assumed to be adequately large):
- Using 24 threads pinned on 24 cores (total size of both tensors ~10 GiB): 
 DAXPY 102 GB/s
 Copy 101 GB/s
 naive transpose 91 GB/s
 optimized transpose 93 GB/s
- Using 1 thread pinned on a single core (total size of both tensors ~10 GiB): 
 DAXPY 20 GB/s
 Copy 20 GB/s
 naive transpose 9.3 GB/s
 optimized transpose 9.3 GB/s
Questions:
- Why is my naive transpose function performing so well?
- Why is the difference in performance between reference and transpose functions so high when using only 1 thread?
I'm glad to receive any kind of input for any of the above questions. Also, I will gladly provide additional information when required. Unfortunately, I cannot provide a minimum reproducer because of the size and complexity of each benchmark program. Thank you very much for you time and help in advance!
Updates:
- Could it be that the Intel compiler performed loop-blocking for the naive transpose function as optimization?
 
    