This is an extended quesion of How can I resolve data dependency in pointer arrays? .
I'll refer the question description first:
If we have an array of integer pointers which all pointing to the same int, and loop over it doing ++ operation, it'll be 100% slower than those pointers pointing to two different ints.
Here is a new version of example code:
// Make sure it takes at least two cachelines
struct Cacheline {
    int c[128]{};
};
int main() {
    Cacheline d[4];
    vector<int*> f;
    f.resize(100000000);
    // case 1 : counting over the same location
    {
        for (auto i = 0ul; i < f.size(); ++i) {
            f[i] = d[i % 1].c;
        }
        /// this takes 200ms
        for (auto i = 0ul; i < f.size(); ++i) {
            ++*f[i];
        }
    }
    {
        // case 2 : two locations interleaved
        for (auto i = 0ul; i < f.size(); ++i) {
            f[i] = d[i % 2].c;
        }
        /// this takes 100ms
        for (auto i = 0ul; i < f.size(); ++i) {
            ++*f[i];
        }
    }
    ....
    // three locations takes 90ms and four locations takes 85ms
}
I understand that the performance gain of case 2 is because the out-of-order execution mechanism kicks in and hides the latency of data dependency. I'm trying to find a way of optimizing this in general by utilizing OoO execution. The expected method should have negligible pre-processing cost as my use case is against dynamic workloads.
 
    