Reproducible C++ builds by logging git hashes

Sometimes I’m in a tough situation where I’ve written a program that writes some kind of output to disk, and I want to remember which version of my program produced this output. This is actually normal for me at the moment due to my research, which always involves a lot of trial and error algorithm design. I think similar problems exist in all sorts of other areas, but especially during rapid development, because once the software has been appropriately deployed and versioned it is quite trivial to simply put the version number in the log.

For a slightly more, but not too much, concrete example: I’m working on an algorithm implementation right now. I won’t say too much about the details right now, but it requires several configuration options. inOf course, I can write to the log file very easily. There are also a lot of implementation details in the program that can be changed, actually dozens of things I could change, and I keep coming up with new ideas I want to try. This means that I have a folder full of output generated by code that probably doesn’t even exist anymore, and which I won’t be able to fully reproduce by running the current version of the program with the configuration options specified in the log file.

This is also not the first time that I have faced a similar problem. I think (hope) it’s not just me, so I thought I’d write up the solution I came up with.

git commit hash

As you probably know, you can identify Git commits by their hashes, which are long strings of hexadecimal digits, such as b5a994c260105b7cc979aead986532b51c37df75Specifically, they are 40 characters long, and are the result of hashing the repository with SHA-1,

My idea is very simple: Tells the program to write the hash of the current commit to a log file. Then, looking at any log file, I can see the commit used to generate it, and go back into the Git history to see what my code was doing at that point.

basic implementation

How to integrate it into logs? A super easy but wrong approach would be to invoke git directly from my program, retrieve the hash of the current commit, and write it to a log file. However this doesn’t work, because that would give us the git commit status at runtimeWhereas we want to know which commit was used when Compilation code.

We really need to integrate the commit hash into the build system. Since I’m writing my code in C++, the natural way to implement compile-time information like this is #define This, so let’s start by writing a script that creates a C++ header file to do this:

#!/usr/bin/bash
commit_hash=$(git rev-parse HEAD)
echo "#pragma once"
echo "#define GIT_COMMIT_HASH \"${commit_hash}\""

Quite simple: we are just defining a macro GIT_COMMIT_HASH Whatever, with a string literal git rev-parse HEAD Says it will have a hash of whatever the current checked out commit is. I would say, there is probably a “proper” C++ way of defining a compile-time literal like this, with “proper” type checking, or something, #define Quite good.

The final step is to actually, somehow, run this script on every compile. For reference (because I always forget), CMAKE_BINARY_DIR is the place where you run cmake from, who is for me /buildAnd CMAKE_SOURCE_DIR cmake is the root of the project, i.e. where CMakeLists.txt Is. I added the following CMakeLists.txt,

add_custom_target(git_info ALL
  COMMAND scripts/gen_git_info.sh > ${CMAKE_BINARY_DIR}/git_info.h
  WORKING_DIRECTORY ${CMAKE_SOURCE_DIR}
  COMMENT "Generating git info header"
)

add_dependencies(my_program git_info)
include_directories(${CMAKE_BINARY_DIR})

This target simply tells cmake that I want to run the specified command, which runs our script and writes the output to a new header file in the build directory. Since I’m writing the header file to the build directory, I need to add it as an include directory to my program. Of course, if you’re using a plain Makefile, you’ll need another method. It’s probably even simpler, maybe make a forged Target to run and build script git_info.hAnd make it a dependency.

From C++ it’s very simple:

#include "git_info.h"
std::string git_info = GIT_COMMIT_HASH;
std::cout << "git_commit_hash: " << git_info << "\n";

Good! In my particular program, I redirect stdout to my log file, so this is enough for me…

…About.

Uncommitted code?

Most of you will have noticed by now that there is a problem here. It is assumed, or hoped, that I have always been compiling commit codeThis is certainly not always the case during rapid development, but I can certainly force myself to only run “proper” experiments using code that I have actually committed, Still, I don’t want to mislead myself into thinking that some code from a “not fair” experiment is compiled directly from a certain commit,

The fix I chose is the simplest possible option: if the compiled code has not been committed I would add “-dirty” to the commit hash:

#!/usr/bin/bash
commit_hash=$(git rev-parse HEAD)
dirty=$(git diff --quiet || echo "-dirty")
echo "#pragma once"
echo "#define GIT_COMMIT_HASH \"${commit_hash}${dirty}\""

And to make everything extra clear (since I don’t want to forget to code if I want to run a “proper” experiment), I can add the following:

if (git_info.ends_with("dirty")) {
  std::cout << "note: you're running a build with non-committed changes, "
               "which may limit reproducability\n";
}

The way my code works, it’s cout it goes First stdout starts to be redirected to a log file, so I can see this warning on the command-line, allowing me to immediately stop and recompile if I want. Or I can run it anyway, if I don’t care.

Improvement

This system works great for me. It doesn’t have to be that professional, as it is just a research project. I doubt anyone else will look at the log files, let alone the code. However, it can definitely be improved.

Mainly, it would be nice to actually record Who In case of a dirty build the files are dirty. This will again define a new macro as a list of those files. It can also be defined as a C++ vector type for easy printing!

Similarly, we only care about commits that are modified source codeThere are some other files in my repository, like some Python scripts to plot the output, and also some configuration files for other things, If these are changed, I don’t want to claim that this is a dirty build, It would be a bit more work to work around this, but if I have a list of dirty files, I can simply check if any of them are in the file, src/ Or include/ For example, directories.

Ultimately, this may even lead to savings Rich Information, like differences to dirty files (so I can reproduce dirty builds), and even library version numbers.

But for now, it’s good enough.



Leave a Comment