Non-Uniform Memory Architecture (NUMA) as a special type of memory organization in multi-processor AMD Opteron platforms has been existing for a long time already — one can say since the announcement of AMD Opteron 200 and 800 that support multi-processor configurations. But we still haven’t carried out a low level analysis of its advantages and disadvantages. That’ll be the subject of this article. Fortunately, our lab has a dual-processor system based on AMD Opteron CPUs. But at first let’s revise the key features of this architecture.
The majority of typical multi-processor systems are based on the symmetric multiprocessing architecture (SMP) that offers all processors a common FSB (and consequently a memory bus).
Simplified flowchart of an SMP system
On the one hand, this architecture provides almost identical memory access latencies for any processor. But on the other hand, a common system bus is a potential bottleneck of the entire memory system in terms of no less important (and even much more important) bandwidth. Indeed, if a multi-threaded application is critical to memory bandwidth, its performance will be limited by this memory organization.
What does AMD offer in its Non-Uniform Memory Architecture (its full name is Cache-Coherent Non-Uniform Memory Architecture, ccNUMA)? Everything is simple — as AMD64 processors have a built-in memory controller, each processor in a multi-processor system has its own local memory. Processors are connected between each other with HyperTransport that bears no direct relation to the memory system (which cannot be said about the traditional FSB).
Simplified flowchart of a NUMA system
In case of a NUMA system, processors experience low latencies for accessing local memory (especially compared to an SMP system). At the same time, remote memory (belonging to the other processor) is accessed at higher latencies. That’s where the non-uniform memory organization notion originates from. But it’s not hard to guess that if memory access is organized correctly (each processor operates with data solely in its local memory), such an architecture will have an advantage over a classic SMP solution due to no bandwidth limits of the common system bus. The total peak memory bandwidth in this case will equal the double bandwidth of the memory modules used.
But the correct memory access organization is the key notion here. NUMA platforms must be supported both by OS (at least the operating system and applications should be able to “see” memory of all processors as a whole memory block) and by applications. The latest versions of Windows XP (SP2) and Windows Server 2003 fully support NUMA systems (Physical Address Extension must be enabled in 32bit versions (/PAE in boot.ini), which is fortunately enabled by default in AMD64 platforms, as it’s required by Data Execution Prevention). What concerns applications, it first of all means that a program shouldn’t deploy its data in the memory of one processor and then access it from the other processor. The effect of sticking to this recommendation or failing to comply with it will be reviewed now.