Evaluating Uniform Memory Access Mode On AMD's Turin Ft. Verda (formerly DataCrunch.io)

NUMA, or Non-Uniform Memory Access, lets hardware expose the relationship between cores and memory controllers to software. NUMA nodes traditionally align along socket boundaries, but modern server chips can subdivide a socket into multiple NUMA nodes. This is a reflection of how non-uniform interconnects get as the number of cores and memory controllers increases. AMD specifies its NUMA modes with the NPS (nodes per socket) prefix.

https%3A%2F%2Fsubstack post media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc7d1b01 cd33 4573 a062

NPS0 is a special NUMA mode that goes in the other direction. Instead of subdividing the system, NPS0 exposes a dual socket system as a single monolithic unit. It distributes memory access evenly across all memory controller channels, providing equal memory access like a desktop system. NPS0 and similar modes exist because optimization for NUMA can be complex and time-consuming. The programmer has to specify a NUMA node for each memory allocation, and minimize cross-node memory accesses. Each NUMA node only represents a fraction of the system resources, so code pinned to a NUMA node will be constrained by that node’s CPU core count, memory bandwidth, and memory capacity. The effort spent in scaling an application across NUMA nodes may be effort not spent on other goals of a software project.

https%3A%2F%2Fsubstack post media.s3.amazonaws.com%2Fpublic%2Fimages%2F0b11778f 12bc 412e accb — A dual socket Zen 5 (Turin) setup shown in NPS1 mode, from AMD’s EPYC 9005 series architecture overview

Thank you very much Verda (formerly DataCrunch) An example with 2 AMD EPYC 9575Fs and 8 Nvidia B200 GPUs to prove. varda Gave us about 3 weeks to do it how we wanted. While this article takes a look at the AMD EPYC 9575Fs, there will be upcoming coverage of the B200s found in VMs.

The system appears to be running in NPS0 mode, providing an opportunity to see how a modern server functions with 24 memory controllers providing the same amount of memory access.

https%3A%2F%2Fsubstack post media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c23e668 1aef 49f7 9281

A simple latency test immediately shows the cost of providing the same memory access. DRAM latency exceeds 220 ns, causing a penalty of about 90 ns on the EPYC 9355P running in NPS1 mode. This is a higher penalty than using the equivalent of NPS0 on older systems. For example, when each socket is treated as a NUMA node, a dual-socket Broadwell system has 75.8 ns DRAM latency, and 104.6 ns with identical memory accesses.[1],

https%3A%2F%2Fsubstack post media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac990f26 af46 4543 bb95

Bringing twice as many memory controllers into NPS0 mode has a bandwidth benefit. But the additional bandwidth does not translate into latency benefits until bandwidth demand reaches approximately 400 GB/s. The EPYC 9355P starts to suffer when the latency test thread is mixed with a bandwidth heavy thread. A bandwidth test thread with only linear read patterns Can achieve 479 GB/s in NPS1 modeHowever, my bandwidth test produces lower values on the EPYC 9575F because not all test threads terminate at the same time, I avoid this problem in loaded memory latency testing, because I have the bandwidth loaded threads check a flag, This lets me stop all threads at almost the same time,

https%3A%2F%2Fsubstack post media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bd90d10 57d4 474d b0dc

The per-CCD bandwidth is barely affected by different NPS modes. Both the EPYC 9355P and 9575F use “GMI-wide” links for their core complex dies, or CCDs. Provides 64B/cycle read and write bandwidth on the GMI-wide Infinity Fabric clock. On both chips, each CCD enjoys more bandwidth in the system than the standard “GMI-narrow” configuration. For reference, a typical desktop GMI-Narrow setup running 2 GHz FCLK will be limited to 64 GB/s read and 32 GB/s write bandwidth.

High memory latency may cause performance degradation, especially in single threaded workloads. But the EPYC 9575F performs surprisingly well in SPEC CPU2017. The EPYC 9575F runs at a higher 5 GHz clock speed, and DRAM latency is just one of many factors affecting CPU performance.

https%3A%2F%2Fsubstack post media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff55646c2 c10e 48b8 938c

Individual workloads show a more complex picture. The EPYC 9575F works best when the workload does not cause cache misses. Then, its higher 5 GHz clock speed can shine. 548.exchange2 is an example. On the other hand, workloads that impact DRAM suffer much higher losses in NPS0 mode. The EPYC 9575F’s high clock speed counts in 502.gcc, 505.mcf, and 520.omnetpp are seen for no reason, and the higher clocked chip performs worse than a 4.4 GHz setup with lower DRAM latency.

https%3A%2F%2Fsubstack post media.s3.amazonaws.com%2Fpublic%2Fimages%2F6dc2b2f7 5687 4e3b 95bc

SPEC CPU2017’s floating point suite also shows diverse behavior. The 549.Photonic3D and 554.ROM suffer in NPS0 mode as the EPYC 9575F struggles to keep itself fed. 538.Imagic matches well with the advantages of EPYC 9575F. In that test, the higher cache hitrate allowed the 9575F’s higher core throughput to shine.

https%3A%2F%2Fsubstack post media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc130cb58 e88b 4fe3 bb45

NPS0 mode performs surprisingly well in the single threaded SPEC CPU2017 run. Some sub-tests suffer from higher memory latency, but other tests benefit from the higher 5 GHz clock speed to make up the difference. This is a lesson in the importance of clock speed and good caching in a modern server CPU. Those two factors go together, because faster cores only provide performance benefits if the memory subsystem can feed them. The EPYC 9575F’s good overall performance despite over 220 ns memory latency shows how good its caching setup is.

As far as running in NPS0 mode, I don’t think that’s appropriate in a modern system. The latency penalty is very high, and the bandwidth gain for NUMA-unaware code is modest. I expect the latency penalty to get worse as the number of server cores and memory controllers continues to increase. For workloads that need to scale across socket limitations, optimizing for NUMA seems to be an unfortunate necessity.

again, thank you very much Verda (formerly DataCrunch) Without which this article, and upcoming B200 articles, would not be possible!

<a href

Evaluating Uniform Memory Access Mode on AMD’s Turin ft. Verda (formerly DataCrunch.io)

Like this:

Related

Leave a Comment Cancel reply

Share this:

Like this:

Related

Leave a Comment Cancel reply