
NPS0 is a special NUMA mode that goes in the other direction. Instead of subdividing the system, NPS0 exposes a dual socket system as a single monolithic unit. It distributes memory access evenly across all memory controller channels, providing equal memory access like a desktop system. NPS0 and similar modes exist because optimization for NUMA can be complex and time-consuming. The programmer has to specify a NUMA node for each memory allocation, and minimize cross-node memory accesses. Each NUMA node only represents a fraction of the system resources, so code pinned to a NUMA node will be constrained by that node’s CPU core count, memory bandwidth, and memory capacity. The effort spent in scaling an application across NUMA nodes may be effort not spent on other goals of a software project.

Thank you very much Verda (formerly DataCrunch) An example with 2 AMD EPYC 9575Fs and 8 Nvidia B200 GPUs to prove. varda Gave us about 3 weeks to do it how we wanted. While this article takes a look at the AMD EPYC 9575Fs, there will be upcoming coverage of the B200s found in VMs.
The system appears to be running in NPS0 mode, providing an opportunity to see how a modern server functions with 24 memory controllers providing the same amount of memory access.

A simple latency test immediately shows the cost of providing the same memory access. DRAM latency exceeds 220 ns, causing a penalty of about 90 ns on the EPYC 9355P running in NPS1 mode. This is a higher penalty than using the equivalent of NPS0 on older systems. For example, when each socket is treated as a NUMA node, a dual-socket Broadwell system has 75.8 ns DRAM latency, and 104.6 ns with identical memory accesses.[1],

Bringing twice as many memory controllers into NPS0 mode has a bandwidth benefit. But the additional bandwidth does not translate into latency benefits until bandwidth demand reaches approximately 400 GB/s. The EPYC 9355P starts to suffer when the latency test thread is mixed with a bandwidth heavy thread. A bandwidth test thread with only linear read patterns Can achieve 479 GB/s in NPS1 modeHowever, my bandwidth test produces lower values on the EPYC 9575F because not all test threads terminate at the same time, I avoid this problem in loaded memory latency testing, because I have the bandwidth loaded threads check a flag, This lets me stop all threads at almost the same time,

The per-CCD bandwidth is barely affected by different NPS modes. Both the EPYC 9355P and 9575F use “GMI-wide” links for their core complex dies, or CCDs. Provides 64B/cycle read and write bandwidth on the GMI-wide Infinity Fabric clock. On both chips, each CCD enjoys more bandwidth in the system than the standard “GMI-narrow” configuration. For reference, a typical desktop GMI-Narrow setup running 2 GHz FCLK will be limited to 64 GB/s read and 32 GB/s write bandwidth.
High memory latency may cause performance degradation, especially in single threaded workloads. But the EPYC 9575F performs surprisingly well in SPEC CPU2017. The EPYC 9575F runs at a higher 5 GHz clock speed, and DRAM latency is just one of many factors affecting CPU performance.

Individual workloads show a more complex picture. The EPYC 9575F works best when the workload does not cause cache misses. Then, its higher 5 GHz clock speed can shine. 548.exchange2 is an example. On the other hand, workloads that impact DRAM suffer much higher losses in NPS0 mode. The EPYC 9575F’s high clock speed counts in 502.gcc, 505.mcf, and 520.omnetpp are seen for no reason, and the higher clocked chip performs worse than a 4.4 GHz setup with lower DRAM latency.

SPEC CPU2017’s floating point suite also shows diverse behavior. The 549.Photonic3D and 554.ROM suffer in NPS0 mode as the EPYC 9575F struggles to keep itself fed. 538.Imagic matches well with the advantages of EPYC 9575F. In that test, the higher cache hitrate allowed the 9575F’s higher core throughput to shine.

NPS0 mode performs surprisingly well in the single threaded SPEC CPU2017 run. Some sub-tests suffer from higher memory latency, but other tests benefit from the higher 5 GHz clock speed to make up the difference. This is a lesson in the importance of clock speed and good caching in a modern server CPU. Those two factors go together, because faster cores only provide performance benefits if the memory subsystem can feed them. The EPYC 9575F’s good overall performance despite over 220 ns memory latency shows how good its caching setup is.
As far as running in NPS0 mode, I don’t think that’s appropriate in a modern system. The latency penalty is very high, and the bandwidth gain for NUMA-unaware code is modest. I expect the latency penalty to get worse as the number of server cores and memory controllers continues to increase. For workloads that need to scale across socket limitations, optimizing for NUMA seems to be an unfortunate necessity.
again, thank you very much Verda (formerly DataCrunch) Without which this article, and upcoming B200 articles, would not be possible!
<a href