Shrinking While Linking - Tweag

If you are concerned about your shape binaryThere is lots of useful advice on the internet to help you reduce this. However, in my experience, people are hesitant to discuss their static libraries. If they are mentioned, you will be told not to worry about their size: dead code will be optimized out when linking the final binary, and the final binary size is all that matters.

But that advice didn’t help me, because I wanted to distribute a static library and I was having trouble due to the size. Specifically, I had a Rust library^{Which I wanted to make available to Go developers. Both Rust and Go can interoperate with C, so I compiled the Rust code into a C-compatible library and created a small Go wrapper package for it. Like most pre-compiled C libraries, I can distribute it as a static or dynamic library. Now Go developers have become accustomed to static linking, which produces self-contained binaries that are refreshingly easy to deploy. Bundling pre-compiled static libraries with our Go packages makes it easier for Go developers

go get https://github.com/nickel-lang/go-nickel And get to work. Dynamic libraries, on the other hand, require runtime dependencies, linker paths, and installation instructions.}

So I really wanted to go the static route, even if it meant a minor size penalty. How big a fine are we talking about anyway?

❯ ls -sh target/release/
132M libnickel_lang.a
15M  libnickel_lang.so

😳 Okay, that’s too much. Even though I was morally satisfied with the 132MB library, it far exceeds GitHub’s 50MB file size limit.^{(To be honest, even the 15M shared library seems big to me; we haven’t put much effort into optimizing the code size yet.)}

Compilation Process in Brief

Previously, your compiler or assembler would turn each source file into an “object” file containing compiled code. To allow source files to call functions defined in other source files, each object file can declare a list of functions.^{It defines, and lists of tasks which it is very hopeful that someone else will define. then you will drive linkerA program that takes all those object files and mashes them together into a binary, matching expected functions with actual function definitions or yelling “undefined symbol” if it can’t. Modern compiled languages change this pipeline slightly: Rust generates one object file per crate.^{Instead of one per source file. But there has not been much change in the fundamentals.}}

A static library is nothing but a bundle of object files, wrapped in an ancient and never standardized archive format. There is no linker involved in building the static library: this will ultimately be used to link the static library into the binary. The unfortunate result is that a static library contains a lot of information we don’t want. For starters, it includes all the code in all our dependencies, even if most of that code is unused. If you compiled your code with support for link-time optimization (LTO), it includes
one more A copy of all our code and the code of all our dependencies (as LLVM bitcode – more on that later). And then because it has a lot of unnecessary code in it, it includes a bunch of metadata (section headers) that makes it easier for the linker to remove that unnecessary code later. The underlying reason for all this is that extra fluff in object files is generally not considered a problem: it is removed when linking the final binary (or shared library), and that’s all most people care about.

reconnect with `ld`

I wrote above that a linker takes a bunch of object files and merges them together into a single binary. Like everything in the previous section, this was an oversimplification: If you pass --relocatable Flag your linker, it will mash your object files together but write the result as an object file instead of binary. If you have also passed --gc-sections flag, while doing this it will remove unused code.

This gives us the first strategy for minifying a static collection:

Unpack the archive, recovering all object files
Link them all together into one large object, removing unused code. In this step we need to tell the linker which code usedAnd then it will remove anything that cannot be reached by the code used.
pack that single object back into a static library


ar x libnickel_lang.a



ld --relocatable --gc-sections -o merged.o *.o -u nickel_context_alloc -u nickel_context_free ...


ar rcs libsmaller_nickel_lang.a merged.o

This helped a bit: the archive size went from 132 MB to 107 MB. But there is clearly still room for improvement.

Checking our merged object file with size command, the largest volume so far – weighing in at 84MB – is .llvmbcRemember I wrote that we would return to LLVM bitcode? Well, when you compile something with LLVM (and the Rust compiler uses LLVM), it converts the original source code into an intermediate representation, then it converts the intermediate representation into machine code, and then^{it writes Both intermediate representation and machine code In an object file. It keeps the intermediate representation around if it contains useful information for further optimization during linking time. Even though that information may be useful, that 84MB is not useful.^{It goes out:}}

objcopy --remove-section .llvmbc merged.o without_llvmbc.o

The next largest sections contain debug information. They It is possible Be useful, but we’ll remove them for now, to see how much smaller we can make.

strip --strip-unneeded without_llvmbc.o -o stripped.o

There are no huge sections left at this point. But there are more than 48,000 smaller sections. It turns out that the Rust compiler puts each function in its own little section within the object file. It does this to help the linker remove unused code: remember
--gc-sections argument to ldThis removes unused StreamAnd hence if the sections are small then unused code can be removed accurately. But we’ve already removed unused code, and each of those 48,000 section headers is taking up space.

To do this, we write a linker script that tells ld Merging sections together. The meaning of the different sections is not important here: the point is that we are merging sections with names like .text._ZN11nickel_lang4Expr7to_json17h
And .text._ZN11nickel_lang4Expr7to_yaml17h in a big .text Section.


SECTIONS
{
  .text :
  {
    *(.text .text.*)
  }

  .rodata :
  {
    *(.rodata .rodata.*)
  }

  
}

And we use it like this:

ld --relocatable --script merge.ld stripped.o -o without_tiny_sections.o

Let’s take a look at what we did in our collection and how much it helped:

	size
Original	132mb
is connected to `--gc-sections`	107mb
taken out `.llvmbc`	33mb
snatch	25 MB
merged section	19mb

It’s probably possible to keep it going, but it’s already a big improvement. We got rid of over 85% of our original size!

However, in the last two phases we have lost something. Separating out debug information can make backtraces less useful, and merging sections removes the ability for future linking steps to remove unused code from the final binaries. In our case, our library has a relatively small and coarse API; I checked that as soon as you use a non-trivial function, less than 150KB of dead code is left. But you’ll have to decide for yourself whether these costs are worth cutting down the size.

More portability with LLVM bitcode

I was quite pleased with the result of the previous section, until I tried to port it to MacOS, as it turned out that the MacOS linker does not support
--gc-sections (This is a -dead_strip option, but it is inconsistent --relocatable
Because apparently nobody cares about code size unless they’re building a binary). After drafting this post but before publishing it, I found this good post on shrinking MacOS static libraries using the toolchain from XCode. I’m no MacOS expert so maybe I’m using it wrong, but using those tools I’ve only been able to get down to about 25MB (after stripping). (If you know how to do better, let me know!)

But there is another way too! Remember that we had two copies of all our code: the LLVM intermediate representation and the machine code.^{Last time, we removed the intermediate representation and used machine code. But since I don’t know how to massage machine code on MacOS, we can make do with intermediate representations instead.}

The first step is to extract the LLVM bitcode and throw out the rest. (The name of the section on macOS is __LLVM,__bitcode instead of .llvmbc Like it was on Linux.)

for obj_file in ./*.o; do
  llvm-objcopy --dump-section=__LLVM,__bitcode="$obj_file.bc" "$obj_file"
done

Then we combine all the small bitcode files into one huge file:

llvm-link -o merged.bc ./*.bc

And we remove unused code by telling LLVM which functions make up the public API. We tell it to “internalize” every function that is not in the list, and to remove code that is not accessible from a public function (the “dce” in “GlobalDes” stands for “dead-code elimination”).

opt \
  --internalize-public-api-list=nickel_context_alloc,... \
  --passes='internalize,globaldce' \
  -o small.bc \
  merged.bc

Finally, we recompile the result into an object file and pop it into a static library. llc LLVM transforms the bitcode back into machine code, so that the resulting object file can be consumed by non-LLVM toolchains.

llc --filetype=obj --relocation-model=pic small.bc -o small.o
ar rcs libsmaller_nickel_lang.a small.o

The result is a 19MB static library, which is the same as the other workflows. Note that we don’t need the section-merge step here, because we didn’t ask. llc To generate one section per function.

Dragon Fire

Shortly after drafting this post, I learned about Dragonfire, a recently released and wonderfully named tool for shrink-wrapping. Collection of static libs by dragging and duplicating object files. I don’t think the techniques in this post can be combined with them for additional savings, since you can’t both deduplicate and merge object files (I suppose in theory you could deduplicate some and merge others, if you have very specific needs.) But it’s a great read, and I was gratified to know that someone else shared my huge-Rust-static-library concerns.

conclusion

We looked at two ways to significantly reduce the size of a static library, using classic tools like ld And objcopy And the other is using LLVM-specific tools. They both produce similar sized outputs, but as with everything in life, there are some tradeoffs. The “classic” bintools approach works with both GNU bintools and LLVM bintools, and it is significantly faster than LLVM tools – a few seconds, compared to a minute or even more, which requires recompiling everything from the intermediate representation to the machine code. Furthermore, the bintools approach should work with any static library, not just libraries compiled with LLVM-based toolchains.

On the other hand, the LLVM approach works on macOS (and Linux, Windows, and probably others). For this reason alone, this is how we will build our stable library for Nickel.

<a href

Shrinking while linking – Tweag

Compilation Process in Brief

reconnect with `ld`

More portability with LLVM bitcode

Dragon Fire

conclusion

Like this:

Related

Leave a Comment Cancel reply

Compilation Process in Brief

reconnect with ld

More portability with LLVM bitcode

Dragon Fire

conclusion

Share this:

Like this:

Related

Leave a Comment Cancel reply

reconnect with `ld`