Efficiency of Bitwise XOR in c++ in comparison to more readable methods

Question

I've recently been writing some code for a research project that I'm working on, where efficiency is very important. I've been considering scraping some of the regular methods I do things in and using bitwise XORs instead. What I'm wondering is if this will make if a difference (if I'm performing this operation say several million times) or if it's the same after I use 03 in g++.

The two examples that come to mind:

I had an instance where (I'm working with purely positive ints) I needed to change n to n-1 if n was odd or n to (n+1) if n was even. I figured I had a few options:

if(n%2) // or (n%2==0) and flip the order
    n=n-1
else
    n=n+1

or

n=n+2*n%2-1; //This of course was silly, but was the only non-bitwise 1 line I could come up with

Finally:

n=n^1;

All of the methods clearly do the same thing, but my feeling was that the third one would be the most efficient.

The next example is on a more general note. Say I'm comparing two positive integers, will one of these perform better than the others. Or will the difference really not be noticeable, even if I perform this operation several million times:

if(n_1==n_2)
if(! (n_1 ^ n_2) )
if( n_1 ^ n_2) else \do work here

Will the compiler just do the same operation in all of these instances? I'm just curious if there is an instance when I should use bitwise operations and not trust the compiler to do the work for me.

Fixed:In correct statement of problem.

If for any reason you think you're writing unreadable code, comment it! Optimizations like that can be explained well in a couple of lines. — NG., Feb 22 '10 at 01:35
Of course, if I write anything like this I comment it. I'm just curious if there's a point to it. — JSchlather, Feb 22 '10 at 01:43
Have you already run your code through a profiler? There are probably some higher-order optimizations you can do that will give you a better bang for your buck. — thebretness, Feb 22 '10 at 01:44
Ah - I think it really depends on the target platform. For instance, some embedded systems really suck at % (an FPGA I once wrote for did). So in some instances you'll see a big win with it, but I imagine you won't see a huge difference on higher end systems. — NG., Feb 22 '10 at 02:28
Your feelings as to what will run faster are extremely unreliable, particularly if you think `if (!(n_1 ^ n_2) )` and `if (n_1 ^ n_1) else` are likely to be different. Play around with things as you like, but test the performance. Also, only do micro-optimizations like this when you've profiled and found that the code in question does execute often enough to bother with. — David Thornley, Feb 22 '10 at 16:24

score 9 · Accepted Answer · answered Feb 22 '10 at 01:45

9

It's easy enough to check, just fire up your disassembler. Take a look:

f.c:

unsigned int f1(unsigned int n)
{
  n ^= 1;  
  return n;
}

unsigned int f2(unsigned int n)
{
  if (n % 2)
    n=n-1;
  else
    n=n+1;

  return n;
}

Build and disassemble:

$ cc -O3 -c f.c 
$ otool -tV f.o 
f.o:
(__TEXT,__text) section
_f1:
00  pushq   %rbp
01  movq    %rsp,%rbp
04  xorl    $0x01,%edi
07  movl    %edi,%eax
09  leave
0a  ret
0b  nopl    _f1(%rax,%rax)
_f2:
10  pushq   %rbp
11  movq    %rsp,%rbp
14  leal    0xff(%rdi),%eax
17  leal    0x01(%rdi),%edx
1a  andl    $0x01,%edi
1d  cmovel  %edx,%eax
20  leave
21  ret

It looks like f1() is a bit shorter, whether or not that matters in reality is up to some benchmarking.

answered Feb 22 '10 at 01:45

Carl Norum

219,201
40
422
469

`f2` appears to be "subtract 1, conditionally add 2". – Anon. Feb 22 '10 at 02:01
It not just that the instruction path for XOR is shorter. On most hardware bitwise instructions are faster than integer arithmetic, also, many compilers will add code to deal with integer overflows etc. when adding or subtracting. – James Anderson Feb 22 '10 at 02:17
What hardware has faster bitwise instructions than integer arithmetic? On ARM, at least, they're all the same. – Carl Norum Feb 22 '10 at 03:41
6

The big win of shorter code is typically that it will fit into cache better. – David Thornley Feb 22 '10 at 16:20

score 5 · Answer 2 · answered Dec 16 '13 at 14:19

I kind-of disagree with most of the answers here, which is why I still see myself replying to a question of 2010 :-)

XOR is practically speaking the fastest operation a CPU can possibly do, and the good part is that all CPU's support it. The reason for this is quite easy: a XOR gate can be created with only 4 NAND gates or 5 NOR gates -- which means it's easy to create using the fabric of your silicon. Unsurprising, all CPU's that I know of can execute your XOR operation in 1 clock tick (or even less).

If you need to do a XOR on multiple items in an array, modern x64 CPU's also support XOR's on multiple items at once like f.ex. the SIMD instructions on Intel.

The alternative solution you opt uses the if-then-else. True, most compilers are able to figure this easy thing out... but why take any chances and what's the consequence?

The consequence of your compiler not figuring it out are branch prediction errors. A single branch prediction failure will easily take 17 clock ticks. If you take one look at the execution speeds of processor instructions, you'll find that branches are quite bad for your performance, especially when dealing with random data.

Note that this also means that if you construct your test incorrectly, the data will mess up your performance measurements.

So to conclude: first think, then program, then profile - not the other way around. And use XOR.

score 4 · Answer 3 · answered Feb 22 '10 at 01:34

4

I needed to change n to n-1 if n was even or n to (n+1) if n was odd.

In that case, regardless of efficiency, n = n ^ 1 is wrong.

For your second case, == will be just as efficient (if not more so) than any of the others.

In general, when it comes to optimization, you should benchmark it yourself. If a potential optimization is not worth benchmarking, it's not really worth making.

answered Feb 22 '10 at 01:34

Anon.

58,739
8
81
86

I wrote that up wrong, my fault (fixed it). Also why will ==, be just as much if not more so efficient? Will it simply do the same operations that I'm writing and save overhead? – JSchlather Feb 22 '10 at 01:38
3

`cmp` is a *single assembly instruction*. If you think anything involving lots of bitwise manipulation would be faster than that single instruction, you really shouldn't be trying to do low-level optimizations. – Anon. Feb 22 '10 at 01:42

Jerry Coffin · Answer 4 · 2010-02-22T17:52:50.913

2

About the only way to know for sure is to test. I'd have to agree that it would take a fairly clever compiler to produce as efficient of output for:

if(n%2) // or (n%2==0) and flip the order
    n=n-1
else
    n=n+1

as it could for n ^= 1;, but I haven't checked anything similar recently enough to say with any certainty.

As for your second question, I doubt it makes any difference -- an equality comparison is going to end up fast for any of these methods. If you want speed, the main thing to do is avoid having a branch involved at all -- e.g. something like:

if (a == b)
    c += d;

can be written as: c += d * (a==b);. Looking at the assembly language, the second will often look a bit messy (with ugly cruft to get the result of the comparison from the flags into a normal register) but still often perform better by avoiding any branches.

Edit: At least the compilers I have handy (gcc & MSVC), do not generate a cmov for the if, but they do generate a sete for the * (a==b). I expanded the code to something testable.

Edit2: Since Potatoswatter brought up another possibility using bit-wise and instead of multiplication, I decided to test that along with the others. Here's the code with that added:

#include <time.h>
#include <iostream>
#include <stdlib.h>

int addif1(int a, int b, int c, int d) { 
    if (a==b) 
        c+=d;
    return c;
}

int addif2(int a, int b, int c, int d) {
    return c += d * (a == b);
}

int addif3(int a, int b, int c, int d) {
    return c += d & -(a == b);
}

int main() {
    const int iterations = 50000;
    int x = rand();
    unsigned tot1 = 0;
    unsigned tot2 = 0;
    unsigned tot3 = 0;

    clock_t start1 = clock();
    for (int i=0; i<iterations; i++) {
        for (int j=0; j<iterations; j++) 
            tot1 +=addif1(i, j, i, x);
    }
    clock_t stop1 = clock();

    clock_t start2 = clock();
    for (int i=0; i<iterations; i++) {
        for (int j=0; j<iterations; j++) 
            tot2 +=addif2(i, j, i, x);
    }
    clock_t stop2 = clock();

    clock_t start3 = clock();
    for (int i=0; i<iterations; i++) {
        for (int j=0; j<iterations; j++) 
            tot3 +=addif3(i, j, i, x);
    }
    clock_t stop3 = clock();

    std::cout << "Ignore: " << tot1 << "\n";
    std::cout << "Ignore: " << tot2 << "\n";
    std::cout << "Ignore: " << tot3 << "\n";

    std::cout << "addif1: " << stop1-start1 << "\n";
    std::cout << "addif2: " << stop2-start2 << "\n";
    std::cout << "addif3: " << stop3-start3 << "\n";    
    return 0;
}

Now the really interesting part: the results for the third version are quite interesting. For MS VC++, we get roughly what most of us would probably expect:

Ignore: 2682925904
Ignore: 2682925904
Ignore: 2682925904
addif1: 4814
addif2: 3504
addif3: 3021

Using the & instead of the *, gives a definite improvement -- almost as much of an improvement as * gives over if. With gcc the result is quite a bit different though:

Ignore: 2680875904
Ignore: 2680875904
Ignore: 2680875904
addif1: 2901
addif2: 2886
addif3: 7675

In this case, the code using if is much closer to the speed of the code using *, but the code using & is slower than either one -- a lot slower! In case anybody cares, I found this surprising enough that I re-compiled a couple of times with different flags, re-ran a few times with each, and so on and the result was entirely consistent -- the code using & was consistently considerably slower.

The poor result with the third version of the code compiled with gcc gets us back to what I said to start with [and ends this edit]:

As I said to start with, "the only way to know for sure is to test" -- but at least in this limited testing, the multiplication consistently beats the if. There may be some combination of compiler, compiler flags, CPU, data pattern, iteration count, etc., that favors the if over the multiplication -- there's no question that the difference is small enough that a test going the other direction is entirely believable. Nonetheless, I believe that it's a technique worth knowing; for mainstream compilers and CPUs, it seems reasonably effective (though it's certainly more helpful with MSVC than with gcc).

[resumption of edit2:] the result with gcc using & demonstrates the degree to which 1) micro-optimizations can be/are compiler specific, and 2) how much different real-life results can be from expectations.

edited Feb 22 '10 at 17:52

answered Feb 22 '10 at 01:46

Jerry Coffin

476,176
80
629
1,111

`cmov` is great for the compiler to optimize that sort of stuff. The generated assembly would likely be adding c and d, followed by a conditional move into c if the condition is true. Possibly even faster on modern processors (if they can parellize the add and compare) than the multiply-by-the-flag based version (which is strictly serial and uses `mult`). – Anon. Feb 22 '10 at 01:54
Besides `cmov`, predictable branches are faster than multiplication. Multiplying by a Boolean is the kind of thing to test on real-world data after everything else is finished and you have absolutely nothing better to do. – Potatoswatter Feb 22 '10 at 02:19
Oh, also, depending on latency of mult, `c += d & -(a==b);` could be faster. – Potatoswatter Feb 22 '10 at 04:27
@Potatoswatter: That's probably the fastest, as long as you can live with slightly non-portable code (mostly theoretical -- it just won't work on 1s complement or sign/magnitude hardware, both of which are now pretty rare, at least for integers). – Jerry Coffin Feb 22 '10 at 04:44
`c += d & -unsigned(a==b);` but I'm sticking with "the branch is probably predictable and therefore free." – Potatoswatter Feb 22 '10 at 04:52
@Potatoswatter:You can stick to that if you want, but if you look at the test code above, you'll see proof that it's wrong. With the cast to unsigned, the code should work even on 1s complement or sign/magnitude hardware, but may not give a real advantage (but as I said before, such hardware is rare enough that few people have any reason to care). – Jerry Coffin Feb 22 '10 at 05:05
Your code proves nothing about OP's code. As I said, "predictable branches are faster than multiplication" and "test on real-world data" because that is the only way to know whether branches are predicted well or not. When I'm presented with a problematic branch, I try to improve predictability rather than find an algebraic substitution. Correctly predicted branches are cheap as free, as in zero cycles on most machines. – Potatoswatter Feb 22 '10 at 05:30
@Potatoswatter:to make a long story short, you're just plain wrong. See (for example) http://www.intel.com/Assets/PDF/manual/248966.pdf, §3.4.1.1 and http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/25112.PDF, §6.3. Oh, but wait: those are only written by Intel and AMD. Obviously **you** know more about how processors work than they ever could! – Jerry Coffin Feb 22 '10 at 05:52
Hmm, "3.4.1.1 Eliminating Branches Eliminating branches improves performance because: • It reduces the possibility of mispredictions. • It reduces the number of required branch target buffer (BTB) entries. Conditional branches, which are never taken, do not consume BTB resources." "6.3 Branches That Depend on Random Data Optimization Avoid conditional branches that depend on random data, as these branches are difficult to predict." What did I say about prediction? You are on a roll, dude. Refuted by the first sentence of BOTH references you gave. – Potatoswatter Feb 22 '10 at 06:13
@PotatoSwatter: you claimed: "...correctly predicted branches are cheap as free..." Intel says: "...every branch counts. Even correctly predicted branches have a negative effect..." You argue with me regularly, and you lose every time. Let your intelligence win over your pig-headedness, and learn from this. – Jerry Coffin Feb 22 '10 at 06:28
Cheap as free does not mean absolutely free, it's an expression. You clipped the sentence: "Even correctly predicted branches have a negative effect on the amount of useful code delivered to the processor." As long as branch instructions aren't numerous enough to starve the processor of other instructions, they are quite cheap and unlike eg a multiplication do not introduce a dependency. This isn't obscure knowledge. – Potatoswatter Feb 22 '10 at 07:23
First argument: you said that numeric casts produce undefined behavior and are unsafe while I said that behavior is defined not by C++ but by the hardware spec. You could not provide an example supporting your claim. I would say you were wrong on that one. Second argument: you made an empty claim that "my code was wrong" and your very specialized method of comparing FP numbers is "right", but again didn't provide an example or even a substantive claim. Again I think you looked foolish. Now, you claim that a chain of three integer operations incl a mult is cheaper than a predicted branch + add. – Potatoswatter Feb 22 '10 at 07:28
@Potatoswatter: So the bottom line is that you won't admit you're wrong even when the standard says so, you won't admit you're wrong even when testing proves it, and basically, as far as you care, the simple fact that you hold an opinion makes it "right" no matter what facts prove otherwise. You also forgot to mention your completely idiotic claim that a binary search is really a binary tree. – Jerry Coffin Feb 22 '10 at 15:08
Well, I suppose someone reading our arguments could come to one of two conclusions. I think you have curious ways of interpreting things and particular habits of "copy-paste errors" and declaring yourself the victor rather than present an argument. Also, I just ran your above test on my machine (Core2, GCC 4.2 -O3, OS X) and `addif1` came out ahead 549107 to 685495, or 20% faster. (I *did* add a call to `srand`.) The real lesson here is to avoid premature optimization and to try to be scientific. – Potatoswatter Feb 22 '10 at 17:50

Potatoswatter · Answer 5 · 2010-02-22T02:17:53.480

1

Is n^=1 faster than if ( n%2 ) --n; else ++n;? Yes. I wouldn't expect a compiler to optimize that. Since the bitwise operation is so much more succinct, it might be worthwhile to familiarize yourself with XOR and maybe add a comment on that line of code.

If it's really critical to the functionality of your program, it could also be considered a portability issue: if you test on your compiler and it's fast, you would likely be in for a surprise when trying on another compiler. Usually this isn't an issue for algebraic optimizations.

Is x^y faster than x==y? No. Doing things in roundabout ways is generally not good.

edited Feb 22 '10 at 02:17

answered Feb 22 '10 at 01:52

Potatoswatter

134,909
25
265
421

Why would checking for equality be faster than x^y? In order to check for equality, you have to do an X0R. There's absolutely no way to get around it. In machine code wouldn't the way to do it, be to do an X0R and then check if the zero flag has been thrown? As opposed to just doing the X0R and placing the result in a register, instead of tossing it. – JSchlather Feb 22 '10 at 03:45
@liberalkid: In practice, everything simpler than multiplication is effectively the same speed because time is measured in cycles. The processor spends very little energy on performing simple operations compared to bookkeeping for parallel instruction execution. I never said either would be faster. Not all ISAs have an XOR operation that sets flags. On a register-starved architecture like x86, keeping an unneeded result could cause something else to get pushed onto the stack—very bad. Also XOR (short for exclusive or) is spelled with an oh, not a zero. – Potatoswatter Feb 22 '10 at 04:19

score 0 · Answer 6 · answered Feb 22 '10 at 01:37

0

A good compiler will optimize n%2 but you can always check the assembly produced to see. If you see divides, start optimizing it yourself because divide is about as slow as it gets.

answered Feb 22 '10 at 01:37

David Kanarek

12,611
5
45
62

score 0 · Answer 7 · answered Feb 22 '10 at 01:45

You should trust your compiler. gcc/++ is the product of years of development and it's capable of doing any optimizations you're probably thinking of doing. And, it's likely that if you start playing around, you'll tamper with it's efforts to optimize your code.

score 0 · Answer 8 · edited May 23 '17 at 12:06

0

n ^= 1 and n1==n2 are probably the best you can do, but really, if you are after maximum efficiency, don't eyeball the code looking for little things like that.

Here's an example of how to really tune for performance.

Don't expect low level optimizations to help much until sampling has proven they are where you should focus.

edited May 23 '17 at 12:06

Community

1
1

answered Feb 22 '10 at 16:07

Mike Dunlavey

40,059
14
91
135

Efficiency of Bitwise XOR in c++ in comparison to more readable methods

8 Answers8

Linked