Exor EX, why EX?
Written by me, proof-read by LLM.
Details at the end.
In one of my talks on assembly, I show a list of the 20 most executed instructions on the average x86 Linux desktop. All the usual criminals are there, mov, add, lea, sub, jmp, call and so on, but the surprising thing is that the interventionist xor – “sole”. In my 6502 hacking days, the presence of a special OR was a sure indicator you’d find either the encryption part of the code, or some kind of sprite routine. It’s surprising that a Linux machine would be performing so many tasks just minding its own business.
That is, until you remember that compilers love to emit xor When setting a register to zero:
We know that exclusive-or-ing anything with itself produces a void, but why does Why does the compiler emit this sequence? Is it just a show?
In the above example, I have compiled this -O2 And enabled Compiler Explorer’s “Compile to binary object” so you can see the machine code that the CPU specifically sees:
31 c0 xor eax, eax
c3 ret
If you change GCC’s optimization level down -O1 You will see:
b8 00 00 00 00 mov eax, 0x0
c3 ret
The more specific, the greater the intention-disclosure mov eax, 0 It takes five bytes to set the EAX register to zero, compared to two special ORs. Using a slightly more obscure instruction, we save three bytes every time we set a register to zero, which is a very common operation. Saving bytes makes the program smaller, and makes more efficient use of the instruction cache.
However it gets better! Since it is a Very In normal operation, x86 CPUs quickly recognize this “zeroing idiom” in the pipeline and can specifically optimize around it: the out-of-order tracking system knows that the value of “eax” (or whatever register is being zeroed) does not depend on the previous value of eax, so it can allocate a fresh, dependency-free zeroed register renamer slot. And, by doing this This removes the operation from the execution queue. – That’s that xor Takes zero execution cycles! It is essentially optimized by the CPU!
You may wonder why you see xor eax, eax but never xor rax, rax (64-bit version), even when returning long,
In this case, even if rax Required to hold full 64-bit long result, by writing eaxWe get a nice effect: unlike other partial register writes, when writing to a e register like eaxThe architecture zeroes the top 32 bits free. so xor eax, eax Sets all 64 bits to zero.
Interestingly, when zeroing “extended” numbered registers (e.g. r8), GCC still uses d (Double width, i.e. 32-bit) version:
Notice how it is xor r8d, r8d (32-bit version) even if with REX prefix (here). 45) it will be the same number of bytes xor r8, r8 full width. Probably makes something easier in compilers, as Clang does this too.
xor eax, eax saves your code space And Execution time! Thanks compiler!
Watch the video accompanying this post.
This post marks the first day of Compiler Optimization 2025, a 25-day series exploring how compilers change our code.
This post was written by a human (Matt Godbolt) and reviewed and proof-read by LLMs and humans.
Support Compiler Explorer on Patreon or GitHub, or by purchasing a CE product in the Compiler Explorer Shop,
<a href