How a 40-Line Fix Eliminated a 400x Performance Gap

I have a habit of skimming the OpenJDK commit log every few weeks. It’s too complex for me to understand in the limited time I have reserved for it… special interest. But sometimes something catches my eye.

Last week, this commit stopped me mid-scroll:

858d2e434dd 8372584: [Linux]: Replace reading proc to get thread CPU

time with clock_gettime

diffstat was interesting: +96 insertions, -54 deletions. A 55-line JMH benchmark has been added to the changeset, meaning the production code has actually been reduced.

what has been removed from here os_linux.cpp: :

static jlong user_thread_cpu_time(Thread *thread)  CPUCLOCK_VIRT;

  

The implementation behind this was ThreadMXBean.getCurrentThreadUserTime(). To get the user CPU time of the current thread, the old code was:

  1. formatting path to /proc/self/task//stat
  2. opening that file
  3. reading into stack buffer
  4. Parsing through a hostile format where the command name may contain parentheses (so strrchr for the last ))
  5. run sscanf To remove fields 13 and 14
  6. Converting clock ticks to nanoseconds

For comparison, here’s what getCurrentThreadCpuTime() Does and has always done:

jlong os::current_thread_cpu_time() 

    // Flip to CPUCLOCK_VIRT for user-time-only

    *clockid = (*clockid & ~CLOCK_TYPE_MASK) 

jlong os::Linux::thread_cpu_time(clockid_t clockid) 

    // Flip to CPUCLOCK_VIRT for user-time-only

    *clockid = (*clockid & ~CLOCK_TYPE_MASK) 

just one clock_gettime() call out. There’s no file I/O, no complex parsing and no buffers to manage.

The original bug report filed in 2018 quantified the difference:

“getCurrentThreadUserTime is 30x-400x slower than getCurrentThreadCpuTime”

The difference increases under concurrency. Why clock_gettime() so fast? Both approaches require kernel entry, but the difference is what happens next.

/proc Path:

  • open() syscall
  • VFS Dispatch + Daintree Lookup
  • Synthesizes file contents at procfs read time
  • Kernel formats string into buffer
  • read() syscall, copy to userspace
  • userspace sscanf() dissection
  • close() syscall

clock_gettime(CLOCK_THREAD_CPUTIME_ID) Path:

  • Single syscall → posix_cpu_clock_get()cpu_clock_sample()task_sched_runtime() → reads directly from sched_entity

/proc The path involves multiple syscalls, VFS machinery, string formatting kernel-side, and parsing userspace-side. clock_gettime() Path is a syscall with a direct function call chain.

Under concurrent load, /proc The approach also suffers from kernel lock contention. The bug report notes:

“Reading the proc is slow (so this process is placed under the Slow_thread_cpu_time(…) method) and can cause noticeable spikes in case of contention for kernel resources.”

so why not do it getCurrentThreadUserTime() just use clock_gettime() From the beginning?

The answer is (probably) POSIX. The standard mandates that CLOCK_THREAD_CPUTIME_ID Returns the total CPU time (user + system). There is no portable way to simply request a user’s time. so /procbased implementation.

The Linux port of OpenJDK is not limited to the POSIX definition, it can use Linux-specific features. Let’s see how.

The Linux kernel since 2.6.12 (released in 2005) encodes clock type information directly. clockid_t price. when you call pthread_getcpuclockid()You get back a clockid with a specific bit pattern:

Bit 2:    Thread vs process clock

Bits 1-0: Clock type

  00 = PROF

  01 = VIRT  (user time only)

  10 = SCHED (user + system, POSIX-compliant)

  11 = FD

The remaining bits encode the target PID/TID. We’ll come back to that in the bonus section.

POSIX-compliant pthread_getcpuclockid() Returns a clockid with bits 10 (Scheduled). But if you flip those low bits 01 (VIRT), clock_gettime() Will only return the user’s time.

New Implementation:

static bool get_thread_clockid(Thread* thread, clockid_t* clockid, bool total) {

  constexpr clockid_t CLOCK_TYPE_MASK = 3;

  constexpr clockid_t CPUCLOCK_VIRT = 1;

  int rc = pthread_getcpuclockid(thread->osthread()->pthread_id(), clockid);

  if (rc != 0) {

    // Thread may have terminated

    assert_status(rc == ESRCH, rc, "pthread_getcpuclockid failed");

    return false;

  }

  if (!total) {

    // Flip to CPUCLOCK_VIRT for user-time-only

    *clockid = (*clockid & ~CLOCK_TYPE_MASK) | CPUCLOCK_VIRT;

  }

  return true;

}

static jlong user_thread_cpu_time(Thread *thread) {

  clockid_t clockid;

  bool success = get_thread_clockid(thread, &clockid, false);

  return success ? os::Linux::thread_cpu_time(clockid) : -1;

}

And that’s all. The new version has no file I/O, no buffers and certainly no sscanf() With thirteen format specifiers.

Let’s see how it performs in practice. For this exercise, I am taking the JMH test included in the fix, the only change is that I have increased the number of threads from 1 to 16 and added a main() Simple execution method from IDE:

@State(Scope.Benchmark)

@Warmup(iterations = 2, time = 5)

@Measurement(iterations = 5, time = 5)

@BenchmarkMode(Mode.SampleTime)

@OutputTimeUnit(TimeUnit.MICROSECONDS)

@Threads(16)

@Fork(value = 1)

public class ThreadMXBeanBench {

    static final ThreadMXBean mxThreadBean = ManagementFactory.getThreadMXBean();

    static long user; // To avoid dead-code elimination

    @Benchmark

    public void getCurrentThreadUserTime() throws Throwable {

        user = mxThreadBean.getCurrentThreadUserTime();

    }

    public static void main(String[] args) throws RunnerException {

        Options opt = new OptionsBuilder()

                .include(ThreadMXBeanBench.class.getSimpleName())

                .build();

        new Runner(opt).run();

    }

}

Aside: this is an unscientific benchmark, there are other processes running on my desktop etc. Anyway, here’s the setup: Ryzen 9950X, on JDK main branch commit 8ab7d3b89f656e5c. For the “before” case, I reverted the fix instead of checking out the old revision.

Here is the result:

Benchmark                                             Mode      Cnt     Score   Error  Units

ThreadMXBeanBench.getCurrentThreadUserTime          sample  8912714    11.186 ± 0.006  us/op

ThreadMXBeanBench.getCurrentThreadUserTime:p0.00    sample              2.000          us/op

ThreadMXBeanBench.getCurrentThreadUserTime:p0.50    sample             10.272          us/op

ThreadMXBeanBench.getCurrentThreadUserTime:p0.90    sample             17.984          us/op

ThreadMXBeanBench.getCurrentThreadUserTime:p0.95    sample             20.832          us/op

ThreadMXBeanBench.getCurrentThreadUserTime:p0.99    sample             27.552          us/op

ThreadMXBeanBench.getCurrentThreadUserTime:p0.999   sample             56.768          us/op

ThreadMXBeanBench.getCurrentThreadUserTime:p0.9999  sample             79.709          us/op

ThreadMXBeanBench.getCurrentThreadUserTime:p1.00    sample           1179.648          us/op

We can see that a single invocation took on average 11 microseconds and the average per invocation was about 10 microseconds.

The CPU profile looks like this:

CPU profile before repair
Click to zoom, open in a new tab for interactivity

CPU profile confirms that each invocation getCurrentThreadUserTime() Makes multiple syscalls. In fact, most of the CPU time is spent in syscalls. We can see files opening and closing. Shutting down alone generates multiple syscalls, including futex locks.

Let’s see the benchmark results with the improvements applied:

Benchmark                                             Mode       Cnt     Score   Error  Units

ThreadMXBeanBench.getCurrentThreadUserTime          sample  11037102     0.279 ± 0.001  us/op

ThreadMXBeanBench.getCurrentThreadUserTime:p0.00    sample               0.070          us/op

ThreadMXBeanBench.getCurrentThreadUserTime:p0.50    sample               0.310          us/op

ThreadMXBeanBench.getCurrentThreadUserTime:p0.90    sample               0.440          us/op

ThreadMXBeanBench.getCurrentThreadUserTime:p0.95    sample               0.530          us/op

ThreadMXBeanBench.getCurrentThreadUserTime:p0.99    sample               0.610          us/op

ThreadMXBeanBench.getCurrentThreadUserTime:p0.999   sample               1.030          us/op

ThreadMXBeanBench.getCurrentThreadUserTime:p0.9999  sample               3.088          us/op

ThreadMXBeanBench.getCurrentThreadUserTime:p1.00    sample            1230.848          us/op

The average dropped from 11 microseconds to 279 nanoseconds. This means that the latency of the fixed version is 40 times lower than that of the older version. Although this is not a 400x improvement, it is within the 30x – 400x range from the original report. Chances are the delta will be higher with a different setup. Let’s take a look at the new profile:

CPU profile after repair
Click to zoom, open in a new tab for interactivity

The profile is very clean. There is only one syscall. If the profile is to be trusted most of the time is spent outside the kernel, in the JVM.

barely. Bit encoding is stable. It hasn’t changed in 20 years, but you won’t find it in clock_gettime(2) Man page. The closest thing to official documentation is the kernel source itself. kernel/time/posix-cpu-timers.c and this CPUCLOCK_* Macros.

The kernel’s policy is clear: do not break user space.

Linus on kernel stability: don't break userspace
Linus’s position on ABI stability is… clear.

My opinion: If glibc depends on it, it’s not going away.

When looking at the profiler data from the ‘after’ run, I noticed another optimization opportunity: a good portion of the remaining syscall is spent inside the radix tree lookup. have a look:

Zoomed-in CPU profile showing radix tree lookup
click to zoom

When JVM calls pthread_getcpuclockid()it receives a clockid Which encodes the ID of the thread. when it clockid is passed to clock_gettime()The kernel extracts the thread ID and performs a radix tree lookup to find pid The structure associated with that ID.

However, the Linux kernel has a fast-path. If encoded PID clockid is 0, the kernel interprets this as “the current thread” and skips the radix tree lookup altogether, going straight to the structure of the current task.

The OpenJDK fix currently gets the specific TID, flips the bits, and passes it. clock_gettime(). This forces the kernel to take a “normalized path” (radix tree lookup).

The source code looks like this:

/*

 * Functions for validating access to tasks.

 */

static struct pid *pid_for_clock(const clockid_t clock, bool gettime)

{

[...]

  /*

  * If the encoded PID is 0, then the timer is targeted at current

  * or the process to which current belongs.

  */

  if (upid == 0)

      // the fast path: current task lookup, cheap

      return thread ? task_pid(current) : task_tgid(current);

  // the generalized path: radix tree lookup, more expensive

  pid = find_vpid(upid);

  [...]

If JVM has done the entire build clockid With PID=0 encoded manually (instead of getting it). clockid through pthread_getcpuclockid()), the kernel can take the fast-path and avoid radix tree lookups altogether. JVM already pokes bits clockidSo building it completely from scratch wouldn’t be a huge leap in terms of compatibility.

Let’s try it!

First, a refresher on clockid Encoding. clockid It is constructed as follows:

clockid for TID=42, user-time-only:

  1111_1111_1111_1111_1111_1110_1010_1101

  └───────────────~42────────────────┘│└┘

                                      │ └─ 01 = VIRT (user time only)

                                      └─── 1 = per-thread

For the current thread, we want PID=0 encoded, which gives ~0 In upper bits:

  1111_1111_1111_1111_1111_1111_1111_1101

  └─────────────── ~0 ───────────────┘│└┘

                                      │ └─ 01 = VIRT (user time only)

                                      └─── 1 = per-thread

We can translate this into C++ as:

// Linux Kernel internal bit encoding for dynamic CPU clocks:

// [31:3] : Bitwise NOT of the PID or TID (~0 for current thread)

// [2]    : 1 = Per-thread clock, 0 = Per-process clock

// [1:0]  : Clock type (0 = PROF, 1 = VIRT/User-only, 2 = SCHED)

static_assert(sizeof(clockid_t) == 4, "Linux clockid_t must be 32-bit");

constexpr clockid_t CLOCK_CURRENT_THREAD_USERTIME = static_cast(~0u << 3 | 4 | 1);

and then make a little teenage change user_thread_cpu_time(): :

jlong os::current_thread_cpu_time(bool user_sys_cpu_time) {

  if (user_sys_cpu_time) {

    return os::Linux::thread_cpu_time(CLOCK_THREAD_CPUTIME_ID);

  } else {

   - return user_thread_cpu_time(Thread::current());

   + return os::Linux::thread_cpu_time(CLOCK_CURRENT_THREAD_USERTIME);

  }

It is enough to make the above changes getCurrentThreadUserTime() Use fast-paths in the kernel.

Given that we are already in nanosecond territory, we change the test slightly:

  • Increase iteration and fork count
  • Use only one thread to reduce noise
  • switch to nano

The benchmark changes are to eliminate noise from the rest of my system and get a more accurate measurement of the small delta we expect:

@State(Scope.Benchmark)

@Warmup(iterations = 4, time = 5)

@Measurement(iterations = 10, time = 5)

@BenchmarkMode(Mode.SampleTime)

@OutputTimeUnit(TimeUnit.NANOSECONDS)

@Threads(1)

@Fork(value = 3)

public class ThreadMXBeanBench {

    static final ThreadMXBean mxThreadBean = ManagementFactory.getThreadMXBean();

    static long user; // To avoid dead-code elimination

    @Benchmark

    public void getCurrentThreadUserTime() throws Throwable {

        user = mxThreadBean.getCurrentThreadUserTime();

    }

    public static void main(String[] args) throws RunnerException {

        Options opt = new OptionsBuilder()

                .include(ThreadMXBeanBench.class.getSimpleName())

                .build();

        new Runner(opt).run();

    }

}

Gives the version currently in the JDK main branch:

Benchmark                                             Mode      Cnt       Score   Error  Units

ThreadMXBeanBench.getCurrentThreadUserTime          sample  4347067      81.746 ± 0.510  ns/op

ThreadMXBeanBench.getCurrentThreadUserTime:p0.00    sample               69.000          ns/op

ThreadMXBeanBench.getCurrentThreadUserTime:p0.50    sample               80.000          ns/op

ThreadMXBeanBench.getCurrentThreadUserTime:p0.90    sample               90.000          ns/op

ThreadMXBeanBench.getCurrentThreadUserTime:p0.95    sample               90.000          ns/op

ThreadMXBeanBench.getCurrentThreadUserTime:p0.99    sample               90.000          ns/op

ThreadMXBeanBench.getCurrentThreadUserTime:p0.999   sample              230.000          ns/op

ThreadMXBeanBench.getCurrentThreadUserTime:p0.9999  sample             1980.000          ns/op

ThreadMXBeanBench.getCurrentThreadUserTime:p1.00    sample           653312.000          ns/op

with manual clockid construction, which uses the kernel fast-path, we get:

Benchmark                                             Mode      Cnt       Score   Error  Units

ThreadMXBeanBench.getCurrentThreadUserTime          sample  5081223      70.813 ± 0.325  ns/op

ThreadMXBeanBench.getCurrentThreadUserTime:p0.00    sample               59.000          ns/op

ThreadMXBeanBench.getCurrentThreadUserTime:p0.50    sample               70.000          ns/op

ThreadMXBeanBench.getCurrentThreadUserTime:p0.90    sample               70.000          ns/op

ThreadMXBeanBench.getCurrentThreadUserTime:p0.95    sample               70.000          ns/op

ThreadMXBeanBench.getCurrentThreadUserTime:p0.99    sample               80.000          ns/op

ThreadMXBeanBench.getCurrentThreadUserTime:p0.999   sample              170.000          ns/op

ThreadMXBeanBench.getCurrentThreadUserTime:p0.9999  sample             1830.000          ns/op

ThreadMXBeanBench.getCurrentThreadUserTime:p1.00    sample           425472.000          ns/op

The average dropped from 81.7 ns to 70.8 ns, approximately a 13% improvement. Improvements are also visible in all percentiles. Is its creation worth the loss of clarity? clockid instead of using manually pthread_getcpuclockid()? I’m not quite sure. The absolute gain is small and makes additional assumptions about the kernel internals, including the size clockid_t. On the other hand, in practice this is still a benefit without any downside. (famous last Words…)

This is why I love browsing the commits of large open source projects. The 40-line deletion eliminated the 400x performance difference. The solution doesn’t require any new kernel features, just knowledge of a stable-but-obscure Linux ABI description.

Lesson:

Read the kernel source. POSIX tells you what is portable. The kernel source code tells you what is possible. Sometimes there is a 400x difference between the two. Whether it is worth exploiting or not is a different question.

Examine old assumptions. /proc The parsing approach made sense when it was written, before anyone realized it could be exploited in this way. Assumptions turn into code. Sometimes it is beneficial to meet them again.

The change occurred on December 3, 2025. Just a day before the JDK 26 feature freeze. If you are using ThreadMXBean.getCurrentThreadUserTime()JDK 26 (released March 2026) brings you a 30-400x speedup for free!



<a href

Leave a Comment