Multicore Processor
Multicore Processor
Multi-core Microprocessors
V Rajaraman
The speed mismatch ways. The first improvement was increase in the chunks of data
between the processor that could be processed in each clock cycle. It was increased from
and the main memory in
8 bits to 64 bits by doubling the data path width every few years
a microcomputer was
reduced by increasing (See Table 1). Increasing the data path width also allowed a mi-
the number of registers croprocessor to directly address a larger main memory. Through-
in the processor and by out the history of computers, the processor was much faster than
introducing on-chip the main memory. Thus, reducing the speed mismatch between
cache memory.
the processor and the main memory in a microcomputer was the
next problem to be addressed. This was done by increasing the
number of registers in the processor and by introducing on-chip
cache memory using the larger number of transistors that could be
packed in a chip. Availability of a large number of transistors in a
chip also enabled architects to increase the number of arithmetic
units in a processor. Multiple arithmetic units allowed a proces-
sor to execute several instructions in one clock cycle. This is
called instruction level parallel processing. Another architectural
method of increasing the speed of processing besides increasing
the clock speed was by pipelining. In pipelining, the instruction
cycle of a processor is broken up into p steps, each taking ap-
proximately equal time to execute. The p steps of a set of se-
quential instructions are overlapped when a set of instructions are
executed sequentially (as in an assembly line) thereby increas-
ing the speed of execution of a large sequence of independent
instructions p fold. All these techniques, namely, increasing the
clock frequency, increasing the data path width, executing several
instructions in one clock cycle, increasing on-chip memory, and
pipelining that were used to increase the speed of a single proces-
sor in a chip could not be sustained as will be explained in what
follows.
Why Multi-core?
The number of We saw in the last section that the number of transistors packed
transistors packed in a
single silicon chip has in a single silicon chip has been doubling every two years with
been doubling every two the result that around 2006 designers were able to pack about 240
years. million transistors in a chip.
In 1965 Gordon Moore* who was working at Fairchild Semiconductors predicted, based on empirical data
available since integrated circuits were invented in 1958, that the number of transistors in an integrated
circuit chip would double every two years. This became a self-fulfilling prophecy in the sense that the
semiconductor industry took it as a challenge and as a benchmark to be achieved. A semiconductor
manufacturers’ association was formed comprising chip fabricators and all the suppliers who supplied
materials and components to the industry to cooperate and provide better components, purer materials,
and better fabrication techniques that enabled doubling the number of transistors in chips every two years.
Table 1 indicates how the number of transistors has increased in microprocessor chips made by Intel –
a major manufacturer of integrated circuits. In 1971, there were 2300 transistors in the microprocessors
made by Intel. By 2016, it had reached 7,200,000,000 – an increase by a factor about 3 × 106 in 23 years,
very close to Moore’s law’s prediction. The width of the gate in a transistor in the present technology
(called 14 nm technology) is about 50 times the diameter of a silicon atom. Any further increase in the
number of transistors fabricated in a planar chip will lead to the transistor gate size approaching 4 to 5
atoms. At this size quantum effects will be evident and a transistor may not work reliably. Thus we may
soon reach the end of Moore’s law as we know it now [1].
*Gordon E Moore, Cramming More Components, Electronics, Vol.38, No.8, pp.114–117, April 1965.
3. During the small but finite time when a transistor gate toggles,
a direct path exists between the voltage source and the ground.
This causes a short circuit current I st and leads to a power dis-
sipation proportional to Vd I st . This also increases as a transistor
becomes small.
Figure 1. Structure of a
multicore microprocessor.
Mode of Cooperation
When each processing
core is assigned separate
programs, all the cores
1. Each processing core may be assigned separate programs. All
executing the programs
the cores execute the programs assigned to them independently assigned to them
and concurrently. This mode of cooperation is called Request- independently and
Level Parallel Processing. concurrently, it is called
Request-Level Parallel
2. All processing cores execute the same program concurrently but Processing.
on different data sets. This is called Single Program Multiple
Data mode of parallel processing (called SPMD).
3. Spreading the heat dissipation evenly without creating hot spots The cost of designing
in chips. multi-core processors is
reduced if a single
4. On-chip interconnection network should normally be planar as module, namely, the
the interconnection wires are placed in a layer on top of the pro- core, cache, and bus
cessor layer. In multi-core microprocessors, a bus is used to in- interconnection is
replicated as the
terconnect the processors.
maximum design effort
5. The cost of designing multi-core processors is reduced if a sin- is required in designing a
core and its cache and in
gle module, namely, the core, cache, and bus interconnection is
verifying the design.
replicated as the maximum design effort is required in designing
a core and its cache and in verifying the design.
Temperature sensors are own cache called an L1 cache. The L1 cache is divided into
used to switch off a core 2 parts – a data cache to store data that will be immediately re-
when it gets heated and quired by the core and an instruction cache where a segment of
distribute its load to
other cores. a program needed immediately by the processing core is stored.
The data and instruction cache sizes vary between 16 and 64 KB.
Another interesting part of a core is the temperature sensor. Tem-
perature sensors are not normally used in single core systems.
In multi-core systems, they are used particularly if the cores are
complex processors such as Intel’s i7 (a multi-threaded super-
scalar processor). Such processors tend to heat up at high fre-
quencies (3 GHz). If a core heats up its neighbouring cores may
also get affected and the chip may fail. Temperature sensors are
used to switch off a core when it gets heated and distribute its
load to other cores. In Figure 2 we have shown an L2 cache as
Microprocessors use an a shared on-chip memory. This is feasible as the huge number
on-chip cache to place in of transistors available on a chip now may be used to fabricate
it data and instructions L2 cache. Memory does not get heated unlike processors as they
immediately needed for
computation. switch only when data is written or read from them. The memory
outside the chip is divided into a static RAM used as an L3 cache
statistical calculations on large populations, protein folding, and Two statements – fork
image processing. In such problems, multi-core computers are and join – are added to a
programming language
very efficient as cores process tasks independently and rarely com-
to enable the creation of
municate. Such problems are often called embarrassingly paral- processes and for
lel. waiting for them to
2. Multiple Instruction Multiple Data Programming
complete and re-join the
main program.
In a shared memory multi-core processor in which a program
stored in a shared memory is to be executed by the cores co-
operatively, the program is viewed as a collection of processes.
Each core is assigned a different process to be executed with the
required data that is stored in the shared memory. The cores ex-
ecute processes assigned to them independently. When all the
cores complete the task assigned to them, they re-join to com-
plete execution of the program. Two statements are added to a
programming language to enable the creation of processes and
for waiting for them to complete and re-join the main program.
These statements are fork to create a process and join when the
invoking process needs the results from the invoked process to
continue processing. For example, consider the following state-
ments of a parallel program in which P1 and P2 are two processes
to be executed in parallel [3]:
Core x Core y
begin P1 begin P1
.. ..
. .
fork P2;
.. ..
. .
join P2;
.. ..
. .
end P2 end P2
When multiple processes minated. If yes P1 takes the result from Core y and continues
work concurrently in processing. Core y is free to be assigned another process. If no
different cores and
then Core x waits until Core y completes P2 to get the results in
update data stored in the
shared memory, it is order to continue processing.
necessary to ensure that
When multiple processes work concurrently in different cores and
a shared variable value is
not initialised or updated update data stored in the shared memory, it is necessary to ensure
independently and that a shared variable value is not initialised or updated indepen-
simultaneously by these dently and simultaneously by these processes. We illustrate this
processes. by a parallel program to compute sum ← sum + f (A) + f (B).
Suppose we write the following parallel program:
Core x Core y
begin P1 begin P2
.. ..
. .
fork P2; sum ← sum + f (B);
.. ..
. .
sum ← sum + f (A); end P2
..
.
join P2;
..
.
end P1
to compute sum + f (A) + f (B) is written below using the lock and
unlock statements.
Core x Core y
begin P1 begin P2
.. ..
. .
fork P2; lock sum;
..
. sum ← sum + f (B);
lock sum; unlock sum;
..
sum ← sum + f (A); .
unlock sum; end P2
..
.
join P2;
..
.
end P1
In the above program, the process that reaches lock sum first will
update it. Until it is unlocked sum cannot be accessed by any
other process using proper locking mechanism. After it is un-
locked the other process can update it. This serialises the opera-
tion on a shared variable. As was pointed out earlier in this ar-
ticle the machine instructions of cores to be used in a multi-core
microprocessor normally incorporate fork, join, lock and unlock
instructions.
An important requirement to correctly execute multiple processes
concurrently is known as sequential consistency [3]. It is defined
as:
“A multiprocessor is sequentially consistent if the result of any The machine instructions
execution is the same as if the operation of all the processors were of cores to be used in a
executed in some sequential order and the operations of each in- multi-core
microprocessor normally
dividual processor occur in this sequence in the order specified by incorporate in addition
its program”. to fork, hand, join, lock
and unlock instructions.
In order to ensure this in hardware, each processor must appear to
issue and complete memory operations one at a time in program
Acknowledgment