AMD GPU Debugger · Abdelhadi


index
I’ve always wondered why we don’t have a GPU debugger similar to the one used for CPUs. A tool that allows pausing execution and examining the current state. This capability feels essential, especially since the GPU’s concurrent execution model is much harder to reason about. After searching for solutions, I came across rocgdb, a debugger for AMD’s ROCm environment. Unfortunately, its scope is limited to that environment. Still, this shows it’s technically possible. I then found a helpful series of blog posts by Marcell Kiss, detailing how he achieved this, which inspired me to try to recreate the process myself.

The best place to start learning about this is RADV. By tracing what it does, we can find how to do it. Our goal here is to run the most basic shader nop 0 without using Vulkan, aka RADV in our case.

First of all, we need to open the DRM file to establish a connection with the KMD, using a simple open(“/dev/dri/cardX”), then we find that it’s calling amdgpu_device_initialize, which is a function defined in libdrm, which is a library that acts as middleware between user mode drivers(UMD) like RADV and and kernel mode drivers(KMD) like amdgpu driver, and then when we try to do some actual work we have to create a context which can be achieved by calling amdgpu_cs_ctx_create from libdrm again, next up we need to allocate 2 buffers one of them for our code and the other for writing our commands into, we do this by calling a couple of functions, here’s how I do it:

Now we have the context and 2 buffers. Next, fill those buffers and send our commands to the KMD, which will then forward them to the Command Processor (CP) in the GPU for processing.

Let’s compile our code. We can use clang assembler for that, like this:

The bash script compiles the code, and then we’re only interested in the actual machine code, so we use objdump to figure out the offset and the size of the section and copy it to a new file called asmc.bin, then we can just load the file and write its bytes to the CPU-mapped address of the code buffer.

Next up, filling in the commands. This was extremely confusing for me because it’s not well documented.
It was mostly learning how RADV does things and trying to do similar things. Also, shout-out to the folks on the Graphics Programming Discord server for helping me, especially Picoduck. The commands are encoded in a special format called PM4 Packets, which has multiple types. We only care about Type 3: each packet has an opcode and the number of bytes it contains.

The first thing we need to do is program the GPU registers, then dispatch the shader. Some of those registers are rsrc[1-3]; those registers are responsible for a number of configurations, pgm_[lo/hi], which hold the pointer to the code buffer and num_thread_[x/y/z]; those are responsible for the number of threads inside a work group. All of those are set using the set shader register packets, and here is how to encode them:

It’s worth mentioning that we can set multiple registers in 1 packet if they’re consecutive.

Then we append the dispatch command:

Now we want to write those commands into our buffer and send them to the KMD:

Here is a good point to make a more complex shader that outputs something. For example, writing 1 to a buffer.

No GPU hangs ?! nothing happened ?! cool, cool, now we have a shader that runs on the GPU, what’s next? Let’s try to hang the GPU by pausing the execution, aka make the GPU trap.

The RDNA3’s ISA manual does mention 2 registers, TBA, TMA; here’s how they describe them respectively:

Holds the pointer to the current trap handler program address. Per-VMID register. Bit [63] indicates if the trap
handler is present (1) or not (0) and is not considered part of the address
(bit[62] is replicated into address bit[63]). Accessed via S_SENDMSG_RTN.

Temporary register for shader operations. For example, it can hold a pointer to memory used by the trap handler.

You can configure the GPU to enter the trap handler when encountering certain exceptions listed in the RDNA3 ISA manual.

We know from Marcell Kiss’s blog posts that we need to compile a trap handler, which is a normal shader the GPU switches to when encountering a s_trap. The TBA register has a special bit that indicates whether the trap handler is enabled.

Since these are privileged registers, we cannot write to them from user space. To bridge this gap for debugging, we can utilize the debugfs interface. Luckily, we have UMR, which uses that debugfs interface, and it’s open source; we copy AMD’s homework here which is great.

The amdgpu KMD has a couple of files in debugfs under /sys/kernel/debug/dri/{PCI address}; one of them is regs2, which is an interface to a amdgpu_debugfs_regs2_write in the kernel that writes to the registers. It works by simply opening the file, seeking the register’s offset, and then writing; it also performs some synchronisation and writes the value correctly. We need to provide more parameters about the register before writing to the file, tho and do that by using an ioctl call. Here are the ioctl arguments:

The 2 structs are because there are 2 types of registers, GRBM and SRBM, each of which is banked by different constructs; you can learn more about some of them here in the Linux kernel documentation.

Turns out our registers here are SBRM registers and banked by VMIDs, meaning each VMID has its own TBA and TMA registers. Cool, now we need to figure out the VMID of our process. As far as I understand, VMIDs are a way for the GPU to identify a specific process context, including the page table base address, so the address translation unit can translate a virtual memory address. The context is created when we open the DRM file. They get assigned dynamically at dispatch time, which is a problem for us; we want to write to those registers before dispatch.

We can obtain the VMID of the dispatched process by querying the HW_ID2 register with s_getreg_b32. I do a hack here, by enabling the trap handler in every VMID, and there are 16 of them, the first being special, and used by the KMD and the last 8 allocated to the amdkfd driver. We loop over the remaining VMIDs and write to those registers. This can cause issues to other processes using other VMIDs by enabling trap handlers in them and writing the virtual address of our trap handler, which is only valid within our virtual memory address space. It’s relatively safe tho since most other processes won’t cause a trap1.

Now we can write to TMA and TBA, here’s the code:

And here’s how we write to TMA and TBA:
If you noticed, I’m using bitfields. I use them because working with them is much easier than macros, and while the byte order is not guaranteed by the C spec, it’s guaranteed by System V ABI, which Linux adheres to.

Anyway, now that we can write to those registers, if we enable the trap handler correctly, the GPU should hang when we launch our shader if we added s_trap instruction to it, or we enabled the TRAP_ON_START bit in rsrc32 register.

Now, let’s try to write a trap handler.

If you wrote a different shader that outputs to a buffer, u can try writing to that shader from the trap handler, which is nice to make sure it’s actually being run.

We need 2 things: our trap handler and some scratch memory to use when needed, which we will store the address of in the TMA register.

The trap handler is just a normal program running in privileged state, meaning we have access to special registers like TTMP[0-15]. When we enter a trap handler, we need to first ensure that the state of the GPU registers is saved, just as the kernel does for CPU processes when context-switching, by saving a copy of the stable registers and the program counter, etc. The problem, tho, is that we don’t have a stable ABI for GPUs, or at least not one I’m aware of, and compilers use all the registers they can, so we need to save everything.

AMD GPUs’ Command Processors (CPs) have context-switching functionality, and the amdkfd driver does implement some context-switching shaders. The problem is they’re not documented, and we have to figure them out from the amdkfd driver source and from other parts of the driver stack that interact with it, which is a pain in the ass. I kinda did a workaround here since I didn’t find luck understanding how it works, and some other reasons I’ll discuss later in the post.

The workaround here is to use only TTMP registers and a combination of specific instructions to copy the values of some registers, allowing us to use more instructions to copy the remaining registers. The main idea is to make use of the global_store_addtid_b32 instruction, which adds the index of the current thread within the wave to the writing address, aka
IDthread4+addressID_{thread} * 4 + address

The problem is that if we have more than 1 wave, they will overlap, and we will have a race condition.

Here is the code:

Now that we have those values in memory, we need to tell the CPU: Hey, we got the data, and pause the GPU’s execution until the CPU issues a command. Also, notice we can just modify those from the CPU.

Before we tell the CPU, we need to write some values that might help the CPU. Here are they:

Now the GPU should just wait for the CPU, and here’s the spin code it’s implemented as described by Marcell Kiss here:

The main loop in the CPU is like enable trap handler, then dispatch shader, then wait for the GPU to write some specific value in a specific address to signal all data is there, then examine and display, and tell the GPU all clear, go ahead.

Now that our uncached buffers are in play, we just keep looping and checking whether the GPU has written the register values. When it does, the first thing we do is halt the wave by writing into the SQ_CMD register to allow us to do whatever with the wave without causing any issues, tho if we halt for too long, the GPU CP will reset the command queue and kill the process, but we can change that behaviour by adjusting lockup_timeout parameter of the amdgpu kernel module:

From here on, we can do whatever with the data we have. All the data we need to build a proper debugger. We will come back to what to do with the data in a bit; let’s assume we did what was needed for now.

Now that we’re done with the CPU, we need to write to the first byte in our TMA buffer, since the trap handler checks for that, then resume the wave, and the trap handler should pick it up. We can resume by writing to the SQ_CMD register again:

Then the GPU should continue. We need to restore everything and return the program counter to the original address. Based on whether it’s a hardware trap or not, the program counter may point to the instruction before or the instruction itself. The ISA manual and Marcell Kiss’s posts explain that well, so refer to them.

Now we can run compiled code directly, but we don’t want people to compile their code manually, then extract the text section, and give it to us. The plan is to take SPIR-V code, compile it correctly, then run it, or, even better, integrate with RADV and let RADV give us more information to work with.

My main plan was making like fork RADV and then add then make report for us the vulkan calls and then we can have a better view on the GPU work know the buffers/textures it’s using etc, This seems like a lot more work tho so I’ll keep it in mind but not doing that for now unless someone is willing to pay me for that ;).

For now, let’s just use RADV’s compiler ACO. Luckily, RADV has a null_winsys mode, aka it will not do actual work or open DRM files, just a fake Vulkan device, which is perfect for our case here, since we care about nothing other than just compiling code. We can enable it by setting the env var RADV_FORCE_FAMILY, then we just call what we need like this:

Now that we have a well-structured loop and communication between the GPU and the CPU, we can run SPIR-V binaries to some extent. Let’s see how we can make it an actual debugger.

We talked earlier about CPs natively supporting context-switching, this appears to be compute spcific feature,
which prevents from implementing it for other types of shaders, tho, it appears that mesh shaders and raytracing
shaders are just compute shaders under the hood, which will allow us to use that functionality. For now debugging
one wave feels enough, also we can moify the wave parameters to debug some specific indices.

Here’s some of the features

Breakpoints and Stepping

For stepping, we can use 2 bits: one in RSRC1 and the other in RSRC3. They’re DEBUG_MODE and TRAP_ON_START, respectively. The former enters the trap handler after each instruction, and the latter enters before the first instruction. This means we can automatically enable instruction-level stepping.

Regarding breakpoints, I haven’t implemented them, but they’re rather simple to implement here by us having the base address of the code buffer and knowing the size of each instruction; we can calculate the program counter location ahead and have a list of them available to the GPU, and we can do a binary search on the trap handler.

Source Code Line Mapping

The ACO shader compiler does generate instruction-level source code mapping, which is good enough for our purposes here. By taking the offset4 of the current program counter and indexing into the code buffer, we can retrieve the current instruction and disassemble it, as well as find the source code mapping from the debug info.

Address Watching aka Watchpoints

We can implement this by marking the GPU page as protected. On a GPU fault, we enter the trap handler, check whether it’s within the range of our buffers and textures, and then act accordingly. Also, looking at the registers, we can find these:

which suggests that the hardware already supports this natively, so we don’t even need to do that dance. It needs more investigation on my part, tho, since I didn’t implement this.

Variables Types and Names

This needs some serious plumbing, since we need to make NIR(Mesa’s intermediate representation) optimisation passes propagate debug info correctly. I already started on this here. Then we need to make ACO track variables and store the information.

Vulkan Integration

This requires ditching our simple UMD we made earlier and using RADV, which is what should happen eventually, then we have our custom driver maybe pause on before a specific frame, or get triggered by a key, and then ask before each dispatch if to attach to it or not, or something similar, since we have a full proper Vulkan implementation we already have all the information we would need like buffers, textures, push constants, types, variable names, .. etc, that would be a much better and more pleasant debugger to use.


Finally, here’s some live footage:


Here is an incomplete user-mode page walking code for gfx11, aka rx7900xtx



<a href

Leave a Comment