How A 40-Line Fix Eliminated A 400x Performance Gap

I have a habit of skimming the OpenJDK commit log every few weeks. It’s too complex for me to understand in the limited time I have reserved for it… special interest. But sometimes something catches my eye.

Last week, this commit stopped me mid-scroll:

858d2e434dd 8372584: [Linux]: Replace reading proc to get thread CPU time with clock_gettime

diffstat was interesting: +96 insertions, -54 deletions. A 55-line JMH benchmark has been added to the changeset, meaning the production code has actually been reduced.

what has been removed from here os_linux.cpp: :

static jlong user_thread_cpu_time(Thread *thread)  CPUCLOCK_VIRT;

The implementation behind this was ThreadMXBean.getCurrentThreadUserTime(). To get the user CPU time of the current thread, the old code was:

formatting path to /proc/self/task//stat
opening that file
reading into stack buffer
Parsing through a hostile format where the command name may contain parentheses (so strrchr for the last ))
run sscanf To remove fields 13 and 14
Converting clock ticks to nanoseconds

For comparison, here’s what getCurrentThreadCpuTime() Does and has always done:

jlong os::current_thread_cpu_time() 

    // Flip to CPUCLOCK_VIRT for user-time-only

    *clockid = (*clockid & ~CLOCK_TYPE_MASK) 

jlong os::Linux::thread_cpu_time(clockid_t clockid) 

    // Flip to CPUCLOCK_VIRT for user-time-only

    *clockid = (*clockid & ~CLOCK_TYPE_MASK)

just one clock_gettime() call out. There’s no file I/O, no complex parsing and no buffers to manage.

The original bug report filed in 2018 quantified the difference:

“getCurrentThreadUserTime is 30x-400x slower than getCurrentThreadCpuTime”

The difference increases under concurrency. Why `clock_gettime()` so fast? Both approaches require kernel entry, but the difference is what happens next.

`/proc` Path:

`open()` syscall

VFS Dispatch + Daintree Lookup

Synthesizes file contents at procfs read time

Kernel formats string into buffer

`read()` syscall, copy to userspace

userspace `sscanf()` dissection

`close()` syscall

`clock_gettime(CLOCK_THREAD_CPUTIME_ID)` Path:

Single syscall → `posix_cpu_clock_get()` → `cpu_clock_sample()` → `task_sched_runtime()` → reads directly from `sched_entity`

`/proc` The path involves multiple syscalls, VFS machinery, string formatting kernel-side, and parsing userspace-side. `clock_gettime()` Path is a syscall with a direct function call chain.

Under concurrent load, `/proc` The approach also suffers from kernel lock contention. The bug report notes:

“Reading the proc is slow (so this process is placed under the Slow_thread_cpu_time(…) method) and can cause noticeable spikes in case of contention for kernel resources.”

so why not do it `getCurrentThreadUserTime()` just use `clock_gettime()` From the beginning?

The answer is (probably) POSIX. The standard mandates that `CLOCK_THREAD_CPUTIME_ID` Returns the total CPU time (user + system). There is no portable way to simply request a user’s time. so `/proc`based implementation.

The Linux port of OpenJDK is not limited to the POSIX definition, it can use Linux-specific features. Let’s see how.

The Linux kernel since 2.6.12 (released in 2005) encodes clock type information directly. clockid_t price. when you call pthread_getcpuclockid()You get back a clockid with a specific bit pattern:

Bit 2:    Thread vs process clock

Bits 1-0: Clock type

  00 = PROF

  01 = VIRT  (user time only)

  10 = SCHED (user + system, POSIX-compliant)

  11 = FD

The remaining bits encode the target PID/TID. We’ll come back to that in the bonus section.

POSIX-compliant pthread_getcpuclockid() Returns a clockid with bits 10 (Scheduled). But if you flip those low bits 01 (VIRT), clock_gettime() Will only return the user’s time.

New Implementation:

static bool get_thread_clockid(Thread* thread, clockid_t* clockid, bool total) {

  constexpr clockid_t CLOCK_TYPE_MASK = 3;

  constexpr clockid_t CPUCLOCK_VIRT = 1;

  int rc = pthread_getcpuclockid(thread->osthread()->pthread_id(), clockid);

  if (rc != 0) {

    // Thread may have terminated

    assert_status(rc == ESRCH, rc, "pthread_getcpuclockid failed");

    return false;

  }

  if (!total) {

    // Flip to CPUCLOCK_VIRT for user-time-only

    *clockid = (*clockid & ~CLOCK_TYPE_MASK) | CPUCLOCK_VIRT;

  }

  return true;

}

static jlong user_thread_cpu_time(Thread *thread) {

  clockid_t clockid;

  bool success = get_thread_clockid(thread, &clockid, false);

  return success ? os::Linux::thread_cpu_time(clockid) : -1;

}

And that’s all. The new version has no file I/O, no buffers and certainly no sscanf() With thirteen format specifiers.

Let’s see how it performs in practice. For this exercise, I am taking the JMH test included in the fix, the only change is that I have increased the number of threads from 1 to 16 and added a main() Simple execution method from IDE:

@State(Scope.Benchmark)

@Warmup(iterations = 2, time = 5)

@Measurement(iterations = 5, time = 5)

@BenchmarkMode(Mode.SampleTime)

@OutputTimeUnit(TimeUnit.MICROSECONDS)

@Threads(16)

@Fork(value = 1)

public class ThreadMXBeanBench {

    static final ThreadMXBean mxThreadBean = ManagementFactory.getThreadMXBean();

    static long user; // To avoid dead-code elimination

    @Benchmark

    public void getCurrentThreadUserTime() throws Throwable {

        user = mxThreadBean.getCurrentThreadUserTime();

    }

    public static void main(String[] args) throws RunnerException {

        Options opt = new OptionsBuilder()

                .include(ThreadMXBeanBench.class.getSimpleName())

                .build();

        new Runner(opt).run();

    }

}

Aside: this is an unscientific benchmark, there are other processes running on my desktop etc. Anyway, here’s the setup: Ryzen 9950X, on JDK main branch commit 8ab7d3b89f656e5c. For the “before” case, I reverted the fix instead of checking out the old revision.

Here is the result:

Benchmark Mode Cnt Score Error Units ThreadMXBeanBench.getCurrentThreadUserTime sample 8912714 11.186 ± 0.006 us/op ThreadMXBeanBench.getCurrentThreadUserTime:p0.00 sample 2.000 us/op ThreadMXBeanBench.getCurrentThreadUserTime:p0.50 sample 10.272 us/op ThreadMXBeanBench.getCurrentThreadUserTime:p0.90 sample 17.984 us/op ThreadMXBeanBench.getCurrentThreadUserTime:p0.95 sample 20.832 us/op ThreadMXBeanBench.getCurrentThreadUserTime:p0.99 sample 27.552 us/op ThreadMXBeanBench.getCurrentThreadUserTime:p0.999 sample 56.768 us/op ThreadMXBeanBench.getCurrentThreadUserTime:p0.9999 sample 79.709 us/op ThreadMXBeanBench.getCurrentThreadUserTime:p1.00 sample 1179.648 us/op

We can see that a single invocation took on average 11 microseconds and the average per invocation was about 10 microseconds.

The CPU profile looks like this:

Click to zoom, open in a new tab for interactivity

CPU profile confirms that each invocation getCurrentThreadUserTime() Makes multiple syscalls. In fact, most of the CPU time is spent in syscalls. We can see files opening and closing. Shutting down alone generates multiple syscalls, including futex locks.

Let’s see the benchmark results with the improvements applied:

Benchmark Mode Cnt Score Error Units ThreadMXBeanBench.getCurrentThreadUserTime sample 11037102 0.279 ± 0.001 us/op ThreadMXBeanBench.getCurrentThreadUserTime:p0.00 sample 0.070 us/op ThreadMXBeanBench.getCurrentThreadUserTime:p0.50 sample 0.310 us/op ThreadMXBeanBench.getCurrentThreadUserTime:p0.90 sample 0.440 us/op ThreadMXBeanBench.getCurrentThreadUserTime:p0.95 sample 0.530 us/op ThreadMXBeanBench.getCurrentThreadUserTime:p0.99 sample 0.610 us/op ThreadMXBeanBench.getCurrentThreadUserTime:p0.999 sample 1.030 us/op ThreadMXBeanBench.getCurrentThreadUserTime:p0.9999 sample 3.088 us/op ThreadMXBeanBench.getCurrentThreadUserTime:p1.00 sample 1230.848 us/op

The average dropped from 11 microseconds to 279 nanoseconds. This means that the latency of the fixed version is 40 times lower than that of the older version. Although this is not a 400x improvement, it is within the 30x – 400x range from the original report. Chances are the delta will be higher with a different setup. Let’s take a look at the new profile:

Click to zoom, open in a new tab for interactivity

The profile is very clean. There is only one syscall. If the profile is to be trusted most of the time is spent outside the kernel, in the JVM.

barely. Bit encoding is stable. It hasn’t changed in 20 years, but you won’t find it in `clock_gettime(2)` Man page. The closest thing to official documentation is the kernel source itself. `kernel/time/posix-cpu-timers.c` and this `CPUCLOCK_*` Macros.

The kernel’s policy is clear: do not break user space.

Linus’s position on ABI stability is… clear.

My opinion: If glibc depends on it, it’s not going away.

When looking at the profiler data from the ‘after’ run, I noticed another optimization opportunity: a good portion of the remaining syscall is spent inside the radix tree lookup. have a look:

click to zoom

When JVM calls pthread_getcpuclockid()it receives a clockid Which encodes the ID of the thread. when it clockid is passed to clock_gettime()The kernel extracts the thread ID and performs a radix tree lookup to find pid The structure associated with that ID.

However, the Linux kernel has a fast-path. If encoded PID clockid is 0, the kernel interprets this as “the current thread” and skips the radix tree lookup altogether, going straight to the structure of the current task.

The OpenJDK fix currently gets the specific TID, flips the bits, and passes it. clock_gettime(). This forces the kernel to take a “normalized path” (radix tree lookup).

The source code looks like this:

/*

 * Functions for validating access to tasks.

 */

static struct pid *pid_for_clock(const clockid_t clock, bool gettime)

{

[...]

  /*

  * If the encoded PID is 0, then the timer is targeted at current

  * or the process to which current belongs.

  */

  if (upid == 0)

      // the fast path: current task lookup, cheap

      return thread ? task_pid(current) : task_tgid(current);

  // the generalized path: radix tree lookup, more expensive

  pid = find_vpid(upid);

  [...]

If JVM has done the entire build clockid With PID=0 encoded manually (instead of getting it). clockid through pthread_getcpuclockid()), the kernel can take the fast-path and avoid radix tree lookups altogether. JVM already pokes bits clockidSo building it completely from scratch wouldn’t be a huge leap in terms of compatibility.

Let’s try it!

First, a refresher on clockid Encoding. clockid It is constructed as follows:

clockid for TID=42, user-time-only: 1111_1111_1111_1111_1111_1110_1010_1101 └───────────────~42────────────────┘│└┘ │ └─ 01 = VIRT (user time only) └─── 1 = per-thread

For the current thread, we want PID=0 encoded, which gives ~0 In upper bits:

1111_1111_1111_1111_1111_1111_1111_1101 └─────────────── ~0 ───────────────┘│└┘ │ └─ 01 = VIRT (user time only) └─── 1 = per-thread

We can translate this into C++ as:

// Linux Kernel internal bit encoding for dynamic CPU clocks:

// [31:3] : Bitwise NOT of the PID or TID (~0 for current thread)

// [2]    : 1 = Per-thread clock, 0 = Per-process clock

// [1:0]  : Clock type (0 = PROF, 1 = VIRT/User-only, 2 = SCHED)

static_assert(sizeof(clockid_t) == 4, "Linux clockid_t must be 32-bit");

constexpr clockid_t CLOCK_CURRENT_THREAD_USERTIME = static_cast(~0u << 3 | 4 | 1);

and then make a little teenage change user_thread_cpu_time(): :

jlong os::current_thread_cpu_time(bool user_sys_cpu_time) {

  if (user_sys_cpu_time) {

    return os::Linux::thread_cpu_time(CLOCK_THREAD_CPUTIME_ID);

  } else {

   - return user_thread_cpu_time(Thread::current());

   + return os::Linux::thread_cpu_time(CLOCK_CURRENT_THREAD_USERTIME);

  }

It is enough to make the above changes getCurrentThreadUserTime() Use fast-paths in the kernel.

Given that we are already in nanosecond territory, we change the test slightly:

Increase iteration and fork count
Use only one thread to reduce noise
switch to nano

The benchmark changes are to eliminate noise from the rest of my system and get a more accurate measurement of the small delta we expect:

@State(Scope.Benchmark)

@Warmup(iterations = 4, time = 5)

@Measurement(iterations = 10, time = 5)

@BenchmarkMode(Mode.SampleTime)

@OutputTimeUnit(TimeUnit.NANOSECONDS)

@Threads(1)

@Fork(value = 3)

public class ThreadMXBeanBench {

    static final ThreadMXBean mxThreadBean = ManagementFactory.getThreadMXBean();

    static long user; // To avoid dead-code elimination

    @Benchmark

    public void getCurrentThreadUserTime() throws Throwable {

        user = mxThreadBean.getCurrentThreadUserTime();

    }

    public static void main(String[] args) throws RunnerException {

        Options opt = new OptionsBuilder()

                .include(ThreadMXBeanBench.class.getSimpleName())

                .build();

        new Runner(opt).run();

    }

}

Gives the version currently in the JDK main branch:

Benchmark Mode Cnt Score Error Units ThreadMXBeanBench.getCurrentThreadUserTime sample 4347067 81.746 ± 0.510 ns/op ThreadMXBeanBench.getCurrentThreadUserTime:p0.00 sample 69.000 ns/op ThreadMXBeanBench.getCurrentThreadUserTime:p0.50 sample 80.000 ns/op ThreadMXBeanBench.getCurrentThreadUserTime:p0.90 sample 90.000 ns/op ThreadMXBeanBench.getCurrentThreadUserTime:p0.95 sample 90.000 ns/op ThreadMXBeanBench.getCurrentThreadUserTime:p0.99 sample 90.000 ns/op ThreadMXBeanBench.getCurrentThreadUserTime:p0.999 sample 230.000 ns/op ThreadMXBeanBench.getCurrentThreadUserTime:p0.9999 sample 1980.000 ns/op ThreadMXBeanBench.getCurrentThreadUserTime:p1.00 sample 653312.000 ns/op

with manual clockid construction, which uses the kernel fast-path, we get:

Benchmark Mode Cnt Score Error Units ThreadMXBeanBench.getCurrentThreadUserTime sample 5081223 70.813 ± 0.325 ns/op ThreadMXBeanBench.getCurrentThreadUserTime:p0.00 sample 59.000 ns/op ThreadMXBeanBench.getCurrentThreadUserTime:p0.50 sample 70.000 ns/op ThreadMXBeanBench.getCurrentThreadUserTime:p0.90 sample 70.000 ns/op ThreadMXBeanBench.getCurrentThreadUserTime:p0.95 sample 70.000 ns/op ThreadMXBeanBench.getCurrentThreadUserTime:p0.99 sample 80.000 ns/op ThreadMXBeanBench.getCurrentThreadUserTime:p0.999 sample 170.000 ns/op ThreadMXBeanBench.getCurrentThreadUserTime:p0.9999 sample 1830.000 ns/op ThreadMXBeanBench.getCurrentThreadUserTime:p1.00 sample 425472.000 ns/op

The average dropped from 81.7 ns to 70.8 ns, approximately a 13% improvement. Improvements are also visible in all percentiles. Is its creation worth the loss of clarity? clockid instead of using manually pthread_getcpuclockid()? I’m not quite sure. The absolute gain is small and makes additional assumptions about the kernel internals, including the size clockid_t. On the other hand, in practice this is still a benefit without any downside. (famous last Words…)

This is why I love browsing the commits of large open source projects. The 40-line deletion eliminated the 400x performance difference. The solution doesn’t require any new kernel features, just knowledge of a stable-but-obscure Linux ABI description.

Lesson:

Read the kernel source. POSIX tells you what is portable. The kernel source code tells you what is possible. Sometimes there is a 400x difference between the two. Whether it is worth exploiting or not is a different question.

Examine old assumptions. `/proc` The parsing approach made sense when it was written, before anyone realized it could be exploited in this way. Assumptions turn into code. Sometimes it is beneficial to meet them again.

The change occurred on December 3, 2025. Just a day before the JDK 26 feature freeze. If you are using `ThreadMXBean.getCurrentThreadUserTime()`JDK 26 (released March 2026) brings you a 30-400x speedup for free!

<a href

How a 40-Line Fix Eliminated a 400x Performance Gap

Like this:

Related

Leave a Comment Cancel reply

Share this:

Like this:

Related

Leave a Comment Cancel reply