Total Hits: 774937
|Past 7 days: 767 hits|
With dual core processors widely available at your average electronics shop, and quad core models not so far on the horizon, schedualed for release in 2007 by both Intel and AMD, its only a matter of time before we see even higher levels of integration, with eight or more cores integrated into one physical processor package.
However, while the move to dual core was more of a manufacturing probelm, having mostly to do with getting enough yields to justify production, the move beyond 4 cores will be a design challange in finding a way to let all those cores communicate effectively with each other, and with the reminder of the system without requiring an overwhelming amount of resources.
So, how could a processor designer manage to integrate say 16 cores into one package without an overwhelming amount of circuitry to glue those cores together?
One way would be to use a layered approach. To get a hold of this approach, first lets assume that quad-core processors would expand on the same methology used to connect dual-core processors, aka, using a crossbar to let both cores talk to each other, and to the reminder of the system. Taking the crossbar beyond four cores would be a very big design problem, as it would need to provide tremendous amounts of bandwidth.
Here is where the layered approach kicks in by borrowing an interconnect architecture that would devide those 16 cores into groups, with each group having a super fast interconnect, like the crossbar, and then glueing those groups together by a relatively slower interconnect, something like a Hyper Transport interconnect on steroids.
For example, in a 16 core model, we would have four groups of four cores each. Each quad-core group would connect as usual through the crossbar, and then the crossbars of the four quad groups would talk to ech other and the outside world through a simpler, relatively slower interconnect.
For such an approach to succeed there have to be some considerations at both the hardware and software levels. At the hardware level, each core would preferably have its own caches, or if the design permits, each group of cores shares one large cache (similar to Intel’s approach with the Pentium-M). This per-core, or per-group cache, would greatly minimize the amount of traffic between the groups, and the reminder of the system. In a multi socket setup, a shared bus would never work, as there would be a tremendous amount of load on the memory controller, plus all the traffic that would need to go between the cores. A setup similar to what AMD did with their Opteron line would be highly successful, having each physical processor have its own memory controller, and having a point-to-point topology connecting the physical processors around the system.
One addition at the hardware level that would greatly help at the software level in interprocess communication would be the addition of a number of small buffers, say four buffers of 4-8KB each per core that act as a very high speed shared memory that is addressable, and accessible to all the cores around the system. Of course, the OS would be responsible of assigning who could access those buffers. Such high speed buffers would eliminate the need of using a shared memory area in RAM for interprocess communication.
At the software level, which would perhaps have a greater impact on how such an arrangement performs, there will have to be some changes in the way the operating system deals with each core in an SMP system. First, the OS will need be aware about this layered approach and distribute the processes/threads at hand accordingly. At the most basic level, which is not something new, the OS will try its best to keep each process/thread on the same processor. In a system that has a large number of cores, this would substantially increase cache hits, as there would be far less context switches on each core. Then, if dealing with a multi-threaded application, the OS should try to keep the threads within the same group of cores, as those cores have the highest amount of bandwidth available interconnecting them. If there are more threads than the cores in each group, and all of those threads are fully loading the group of cores, then the OS would allocate more cores for this application within the same package. Only when the given application has such a high number of processes/threads that are more than all the cores on the package/socket can handle, that the OS would consider running those extra threads/processes accross multiple physical processors. In short, at the software level, its all about affinity.
Of course, if we are talking about 16 full fledged cores per processor package, the OS would have a lot more flexibility in shcedualing and assigning processes/threads to a certain core. The less processes/threads are assigned to each core, the less context switches that would have to be carried, also minimizing the need for large caches.
In the end, while such a layered solution for integrating a large number of cores into a single processor package would require a considerable amount of hardawre redesign, and software changes at the OS level, there is very little change at the application level, and doesnt break backward compatibility with older applications, while enabling substantial performance improvements for applications that are coded with the system design and features in mind.