I'm an R developer who uses C for algorithmic purposes and have a question about why a C loop that seems like it would be slow is actually faster than the alternative approach.
In R, our Boolean type can actually have three values, true, false, and na, and we represent this using an int at the C level.
I'm looking into a vectorized && operation (yes, we have this in R already, but bear with me) that also handles the na case. The scalar results would look like this:
F && F == F
F && T == F
F && N == F
T && F == F
T && T == T
T && N == N
N && F == F
N && T == N
N && N == N
Note that it works like && in C, except that na values propagate when combined with anything except false, in which case we "know" that && can never be true, so we return false.
Now to the implementation. Assume we have two vectors, v_out and v_x, and we'd like to perform the vectorized && on them. We are allowed to overwrite v_out with the result. One option is:
// Option 1
for (int i = 0; i < size; ++i) {
int elt_out = v_out[i];
int elt_x = v_x[i];
if (elt_out == 0) {
// Done
} else if (elt_x == 0) {
v_out[i] = 0;
} else if (elt_out == na) {
// Done
} else if (elt_x == na) {
v_out[i] = na;
}
}
And another option is:
// Option 2
for (int i = 0; i < size; ++i) {
int elt_out = v_out[i];
if (elt_out == 0) {
continue;
}
int elt_x = v_x[i];
if (elt_x == 0) {
v_out[i] = 0;
} else if (elt_out == na) {
// Done
} else if (elt_x == na) {
v_out[i] = na;
}
}
I sort of expected the second option to be faster, because it avoids accessing v_x[i] when it isn't required. But in fact it was twice as slow when compiled with -O2!
In the following script, I get the following timing results. Note that I am on a Mac and compile with Clang.
It seems reasonable with O0. They are about the same.
2x faster with O2 with Option 1!
Option 1, `clang -O0`
0.110560
Option 2, `clang -O0`
0.107710
Option 1, `clang -O2`
0.032223
Option 2, `clang -O2`
0.070557
What is going on here? My best guess is that it has something to do with the fact that in Option 1 v_x[i] is always being accessed linearly, which is extremely fast. But in Option 2, v_x[i] is essentially being accessed randomly (sort of), because it might access v_x[10], but then not need another element from v_x until v_x[120], and because that access isn't linear, it is probably much slower.
Reproducible script:
#include <stdlib.h>
#include <stdio.h>
#include <limits.h>
#include <time.h>
int main() {
srand(123);
int size = 1e7;
int na = INT_MIN;
int* v_out = (int*) malloc(size * sizeof(int));
int* v_x = (int*) malloc(size * sizeof(int));
// Generate random numbers between 1-3
// 1 -> false
// 2 -> true
// 3 -> na
for (int i = 0; i < size; ++i) {
int elt_out = rand() % 3 + 1;
if (elt_out == 1) {
v_out[i] = 0;
} else if (elt_out == 2) {
v_out[i] = 1;
} else {
v_out[i] = na;
}
int elt_x = rand() % 3 + 1;
if (elt_x == 1) {
v_x[i] = 0;
} else if (elt_x == 2) {
v_x[i] = 1;
} else {
v_x[i] = na;
}
}
clock_t start = clock();
// Option 1
for (int i = 0; i < size; ++i) {
int elt_out = v_out[i];
int elt_x = v_x[i];
if (elt_out == 0) {
// Done
} else if (elt_x == 0) {
v_out[i] = 0;
} else if (elt_out == na) {
// Done
} else if (elt_x == na) {
v_out[i] = na;
}
}
// // Option 2
// for (int i = 0; i < size; ++i) {
// int elt_out = v_out[i];
//
// if (elt_out == 0) {
// continue;
// }
//
// int elt_x = v_x[i];
//
// if (elt_x == 0) {
// v_out[i] = 0;
// } else if (elt_out == na) {
// // Done
// } else if (elt_x == na) {
// v_out[i] = na;
// }
// }
clock_t end = clock();
double time = (double) (end - start) / CLOCKS_PER_SEC;
free(v_out);
free(v_x);
printf("%f\n", time);
return 0;
}
Based on a few questions in the comments, here are a few points of clarifications for future readers:
I am on a 2018 15-inch MacBook Pro with a 2.9 GHz 6-core Intel i9-8950HK (6 core Coffee Lake.)
My particular Clang version that I tested with is
Apple clang version 13.1.6 (clang-1316.0.21.2.5)withTarget: x86_64-apple-darwin21.6.0I am restricted by R to use
intas the data type (even though there are more efficient options) and the following coding:false = 0,true = 1,na = INT_MIN. The reproducible example that I provided respects this.The original question was not actually a request to make the code run faster. I just wanted to get an idea of what the difference was between my two if/else approaches. That said, some answers have shown that branchless approaches can be much faster, and I really appreciate the explanations that those users have provided! That has greatly influenced the final version of the implementation I am working on.