Quantifying The Cost of Context Switch
Quantifying The Cost of Context Switch
Quantifying The Cost of Context Switch
1
250
150
100
50
0
1 2 4 8 16 32 64 128 256 512 1024 2048
Array size (KB)
Figure 1: The effect of data size on the cost of the context switch
writes a message to the other process and then blocks on L2 cache and the cache line size is 128B. The operating
the next read operation. We still have a single process system is Linux 2.6.17 kernel with Redhat 9. The com-
simulating the two processes’ behavior except for context piler is gcc 3.2.2. We do not use any optimization op-
switches. The simulation process will do the same amount tion for compilation. Our source code can be found at
of array accesses as each of the two communicating pro- http://www.cs.rochester.edu/u/cli/research/switch.htm.
cesses. Assuming the execution time of 10,000 round-trip The average direct context switch cost (c1 ) in our system
communications between the two test processes is (s1 ) and is 3.8 microsecond. The results shown below are about the
the execution time of 10,000 simulated round-trip commu- total cost per context switch (c2 ). In general, c2 ranges from
nications (s2 ), we get the total time cost per context switch several microseconds to more than one thousand microsec-
as c2 = s1 /20000 − s2 /10000. onds. The indirect context switch cost can be estimated as
We change the following two parameters during different c2 − c1 .
runs of our test. In the following subsections, we first discuss effects of data
size and access stride on the cost of context switch. Then
• Array size: the total data accessed by each process.
we discuss the effect of experimental environment on mea-
• Access stride: the size of the strided access. surement accuracy.
2
Context switch time (micro sec)
1500 stride= 8B Cache size: 512KB
stride= 16B
stride= 32B
stride= 64B
1000
stride=128B
500
0
32 128 256 384 512 2048
Array size (KB)
Figure 2: The effect of the access stride on the cost of context switch
hence the cases of RMW and write showing twice the cost If s is 1, the access pattern is actually sequential. Each
as the case of read at the next three array sizes. process does a read-modify-write operation on each element
The third region starts from the array size 512KB. Here of the array. We show results on arrays of size between
the dataset for both the communicating processes and the 32KB and 2MB, which include arrays from all the three
simulation process is larger than the size of L2 cache. The regions described in section 3.1.
cost of context switch is still high compared to the first re- For array size of 32KB and 128 KB, since datasets can
gion, showing the presence of cache interference. But the fit into cache, there is not much difference for the cost of
curves do not increase monotonously with the array size. context switch when we change the access stride. However,
This is because context switch is not the only reason for when the datasets do not fit into cache, the cost of context
cache misses any more. Since the dataset of each process is switch increases significantly with the increment of the stride
too large to fit in cache, cache misses will happen even when size. When the access stride is 8B, the cost ranges between
there is no context switch. 44.1µs and 183.8µs with the mean 116.5µs. When the access
These results show the nonlinear effect of cache sharing. stride is 128B, the cost ranges between 133.8µs and 1496.1µs
One may question whether it is proper to count this as part with the mean 825.3µs. This substantial difference in the
of the indirect cost of a context switch, because contention cost of context switch is caused only by the increase in the
to the cache resource also happens when multiple processors access stride. In other words, the data access pattern can
share the same cache, regardless whether they incur context affect the cost of context switch significantly. The reason
switches or not. However, we note that the overhead for a is that the stride affects the cost of cache warm-ups in a
time-shared cache is not the same as that for a concurrently similar way it affects program running time. For contiguous
shared cache. Take the simple example of two concurrent memory access, the hardware prefetching works well, and
processes writing to the same data block. The cost of their the cost is relatively lower than in the case of strided access.
cache interference at each context switch is the re-loading of
the cache block, which is very different from the cost of par- 3.3 Effect of experimental environment on mea-
allel access. In general, the interference manifests as cache surement accuracy
warm-ups in the case of context switch. A number of past All the above results are measured on a dual-processor
studies have examined this cost in detail including the rela- machine. According to our design, the two communicating
tion with the cache size and other parameters, the workload, processes are bound to the same processor and they have
and the length of CPU quanta [5, 2, 7, 1]. Further work to the maximum real-time priority. This design aims to avoid
compare time-shared and concurrently shared caches would the interference from background interrupt handling or from
be interesting. other processes in the system. We call it the augmented
design for interfered measurement environment.
3.2 Effect of access stride We evaluate the effectiveness of this design by compar-
We show the effect of access stride on the cost of context ing the execution time of the two-process communication
switch in figure 2. In this experiment, each process accesses program in the following experimental settings.
an array of floating-point number numbers in a strided pat- • dual-processor: The program with our augmented de-
tern. Suppose the access stride size is s. Starting from the sign runs in the dual-processor environment as de-
first element, it accesses every s-th element. Then starting scribed in our experiment.
from the second element, it accesses every next s-th element.
• single-processor: The program without our augmented
The process repeats striding until every element of the ar-
design runs in the single-processor environment we cre-
ray is accessed. We show the array access behavior in the
ate on the same machine by disabling multiple-processor
following code.
support in the Linux kernel.
for (i=0; i<s; i++) We run the programs with 3 different array size in both
for (j=i; j<array_size; j=j+s) settings for 6 times. Assuming the measured results fol-
array[j]++; low a normal distribution, we report the 90% confidence
3
Array size dual-processor single-processor is significant. The larger the stride is, the larger the
256KB (242.07, 246.46) (221.00, 226.35) cost of context switch is.
384KB (462.78, 474.43) (461.80, 474.40)
512KB (614.68, 629.87) (614.69, 634.78) • Experimental environment may affect the measurement
accuracy. Our suggested augmented design can help to
Table 1: Confidence intervals of the execution time avoid the interference from background interrupt han-
in a quiet environment dling or from other processes in the system.
5. ACKNOWLEDGMENTS
Array size dual-processor single-processor We wish to thank Linlin Chen for demonstrating how to
256KB (237.45, 245.54) (220.72, 263.79) to use a statistical analysis package and Trishul Chilimbi
384KB (459.38, 470.03) (510.91, 555.67) for pointing out a related paper in ACM TOCS. We thank
512KB (623.28, 630.77) (635.75, 683.84) the anonymous reviewers of the ExpCS workshop and our
colleagues at Rochester in particular Michael Huang and
Table 2: Confidence intervals of the execution time Michael Scott for their comments on the work and the pre-
in the presence of external (network) interference sentation.
6. REFERENCES
[1] Anant Agarwal, John L. Hennessy, and Mark Horowitz.
intervals [3] for the mean of each test. Generally, when the
Cache performance of operating system and
confidence interval is wide, the results are unstable.
multiprogramming workloads. ACM Trans. Comput.
When the machine has nothing else to run and there is
Syst., 6(4):393–431, 1988.
no outside interference, we say it is in a quiet environment.
Table 1 reports confidence intervals of six tests in such a [2] R. Fromm and N. Treuhaft. Revisiting the cache
environment. We can see the width of each confidence in- interference costs of context switching.
terval in the dual-processor setting is similar to the width of http://citeseer.ist.psu.edu/252861.html.
the corresponding confidence interval in the single-processor [3] R. Jain. The Art of Computer Systems Performance
setting. This means both settings generate relatively stable Analysis: Techniques for Experimental Design,
results. The confidence interval boundary values for the Measurement, Simulation and Modeling. John Wiley &
two settings are a little different. This is because the Linux Sons, 2001.
kernel scheduler code for the two settings is not the same. [4] L. McVoy and C. Staelin. lmbench: Portable Tools for
Remember that single-processor has the multiple-processor Performance Analysis. In In Proc. of the USENIX
support disabled in the Linux kernel. Annual Technical Conference, pages 279–294, San
However, in reality, we can not always guarantee our ex- Diego, CA, January 1996.
perimental environment is not interfered. Thus, we simulate [5] J. C. Mogul and A. Borg. The Effect of Context
outside interference by sending “ping” packets to the test Switches on Cache Performance. In In Proc. of the
machine with a variable waiting intervals (between 0 and Fourth International Conference on Architectural
200 milliseconds). We show the confidence intervals in the Support for Programming Languages and Operating
interfered environment in table 2. The confidence intervals Systems, pages 75–84, Santa Clara, CA, April 1991.
obtained from single-processor is much larger than the in- [6] J. K. Ousterhout. Why Aren’t Operating Systems
tervals from dual-processor. This shows the instability of Getting Faster As Fast As Hardware ? In In Proc. of
the results of single-processor in the interfered environment. the USENIX Summer Conference, pages 247–256,
Comparing the results of single-processor in table 2 to the Anaheim, CA, June 1990.
corresponding results in table 1, we can see the measurement [7] G. Edward Suh, Srinivas Devadas, and Larry Rudolph.
inaccuracy of the single-processor setting in the interfered Analytical cache models with applications to cache
environment. partitioning. In Proceedings of the International
Conference on Supercomputing, pages 1–12, 2001.
4. SUMMARY
We summarize our observations from the experiment as
follows.