Gem #98: High Performance Multi-core Programming - Part 2
by Pat Rogers —AdaCore
Let’s get started…
Chameneos-Redux is part of the Computer Language Shootout, a suite of benchmarks that compares the implementations of various programming languages across different kinds of applications and platforms. The program is required to perform a specified number of rendezvous between mythical "chameneos" creatures, where each creature is represented by a distinct thread. Each rendezvous is symmetric, in that the participating creatures can be either the caller or the called member in any given encounter. The multi-core benchmark results for all implementations of Chameneos-Redux are available here: Chameneos-Redux.
In the previous Gem in this series, we made the point that design trumps tuning: when it comes to performance, no amount of tweaking can compensate for an inherently slow design. In this Gem we explore another important aspect of the design: minimizing resource conflicts via processor assignments for threads. This issue arises because the Chameneos-Redux benchmark requires two distinct "games," in which a number of threads perform the required number of rendezvous. (One game has a different number of threads, but both must execute the same number of rendezvous.) When multiple processors are available the program runs the two games concurrently, so it would be possible for the two sets of threads to run on the same cores.
To prevent the threads of one game from interfering with the threads of another game running concurrently, we permanently assign threads to processors rather than let the operating system assign them dynamically. Specifically, we allow a thread to execute on any of the cores within a single assigned processor. All of the leading Chameneos-Redux implementations use the same approach to prevent this sharing.
Assigning threads to processors is specified in terms of "slots" instead of directly in terms of processors. A slot is an integer value corresponding to a processor number, but since there are likely more threads than processors, when the slot number exceeds the number of processors present in the machine, the slot number "wraps around" back to the beginning processor. As a result, there is effectively no limit to the number of slots available. This does not prevent threads from sharing (cores on) processors, it simply makes assignment convenient. If there are too many threads, sharing will be unavoidable.
Slot assignment is achieved using pragma Task_Info within the task type declaration representing the chameneos creatures' threads. The argument to the pragma is an access value designating a value of type Thread_Attributes, which on the benchmark operating system is a record type containing a single affinity bit-mask component. The affinity mask indicates the cores on which tasks may execute. We define a function Affinity that returns a value of that type. The following code fragment illustrates use of the pragma and the function:
task type Thread (This : access Creature; Slot : Natural) is pragma Task_Info (new Thread_Attributes'(CPU_Affinity => Affinity (Slot))); end Thread;
The argument to the Affinity function is a task type discriminant indicating the slot to use. A slot logically contains an affinity mask that indicates the cores in the corresponding processor, and it is this mask that is returned by the function.
Slot zero is used to hold a bit-mask indicating all the cores known to the system. We import function Sched_Getaffinity to get this value. For example, imagine the function returns a mask with the first eight bits enabled, indicating that a total of eight cores are available. Slot zero will then hold a bit mask with eight bits set. Assuming two cores per processor, and assuming that every two contiguous bits represent cores on the same processor, the following illustrates the resulting masks per slot:
Bit # Slot # 0123456789... 0 1111111100 1 1100000000 2 0011000000 3 0000110000 4 0000001100 5 1100000000 6 0011000000
Observe that at slot 5 the two cores on the first processor are again involved. You can see the details of the implementation by examining the Chameneos.Processors package.
Getting back to the game, the main program determines whether the target is a multi-core machine and assigns slots for the two required games, and thus the threads in the games, accordingly:
if Processor_Count < 4 then -- run the games sequentially Game1.Start (Game1_Creature_Colors, N, Slot => 0); Game1.Await_Completion; Game2.Start (Game2_Creature_Colors, N, Slot => 0); Game2.Await_Completion; else -- run the games in parallel Game1.Start (Game1_Creature_Colors, N, Slot => 1); Game2.Start (Game2_Creature_Colors, N, Slot => 2); Game1.Await_Completion; Game2.Await_Completion; end if;
In this manner the two games do not share processors, so they do not conflict with each other.
In the next Gem in this series we will explore another aspect of the implementation relative to the cache, specifically a user-defined storage allocator that allocates dynamic memory on cache-aligned boundaries.