A few days ago I was thinking about what you can do when the branch predictor is effectively working against you, and thus gloomy your program instead Adaptation it.
Let’s work with something relatively simple and concrete: Let’s say we want to write some kind of financial system (perhaps a trading system) and all our transaction requests reach a certain function before either (A) happens. sent out on any official server, or (b) abandonedLet us also assume that:
- vast majority Most of our transaction requests are left at the last stage.
- We care a lot about the speed of the ‘send’ path and we want to make it as fast as possible.
- We are not at all concerned about the speed of the path of ‘renouncement’.
The code will look approximately like this:
|
|
The implication of Assumption #1 is that the branch predictor will be over-prepared to predict
should_send
Since we only care about speed send() path, I was wondering if there is a way to tell the CPU that we don't want to rely on the branch predictor at all when actually executing send()
Or abandon(): We always want to believe this send() will be performed.
A low level solution?
I asked Claude if there was a way to natively hard-code branch prediction rules into machine code, and the answer was that there is no way to do this on x86, but there is a way on ARM: BEQP (Prophecy branch taken) and BEQNP (Prediction branch not taken) Instructions.
Those ARM instructions are just hallucinations, and the reality is actually the other way around: ARM
Not there. There is a way to hard-code 'predictions', but x86 doesMore precisely: some Old x86 processors do. On the Pentium 4 series, those rules/signals are encoded as instruction prefixes:
0x2E(branch not taken);0x3E(branch taken).
So if a jump instruction came with the prefix 0x3ESo the processor will assume that the branch has been taken.
But Modern x86 processors, those instruction prefixes are simply ignored, so the compiler obviously won't bother to generate them when you're targeting such a CPU.
Another 'low-level' approach that doesn't work here is [[likely]] And [[unlikely]]
Features were introduced in C++20. Those features will usually rearrange some of the labels/paths to ensure that the path we mark [[likely]] Fewer jumps will be required. whether we mark send() branch as possibility or we leave the code as it is, clang and gcc generate the same assembly,
|
|
It really won't matter that send() The branch does not need to be jumped to line 7, as the branch predictor would be prepared to assume. abandon() Still branch. If you are writing code for modern x86 processors, you may want to consider [[likely]] And [[unlikely]] Are completely unrelated to the CPU branch predictor, as the microarchitecture would have no way to override its predictions. So those features are just a rearranging arrangement. However, as stated in the proposal document, the compiler can Use probable/unlikely signals to override the branch predictor:
Some potential code generation improvements from using branch probability hints include:
- […]
- Some microarchitectures, such as Power, have branch hints that can override dynamic branch prediction.
A high level solution?
So if we can't bypass the branch predictor with some granular and guaranteed mechanism on modern x86 processors, and we don't want to buy a bunch of Pentium 4 CPUs and run our code on them, we need to think about a higher-level solution that works. well enoughIt doesn't need to be guaranteed to work 100% of the time.
One such solution that I know of is the one that Carl Cook talked about during his CppCon 17 talk: We can fill our systems with fake transaction data should_send
send() path will be executed (assumption may lead to only partial execution, unless
should_send
To get rid of those fake transactions after They are 'sent' from our program, Carl Cook mentions that network cards are able to recognize and discard such messages without adding significant overhead. Real Transaction, so we can leave it there.
As explained in their talk, this whole 'fake/dummy transaction' system gives them 5 microseconds of speed, which, as the title of the talk suggests, matters Very For his use case.
Thanks for reading!
you can check it r/cpp Discussion here regarding this blog post.
Reference: