526
zyxwvutsrqponmlkji
IEEE TRANSACTIONS ON EDUCATION, VOL. 39, NO. 4, NOVEMBER 1996
oratory Exercises for Practical
zyxw
zyxwvutsrqpo
S. Mansoor Sarwar, Member, IEEE, Edwin E. Parks, and Syed Aqeel Sarwar
Abstruct- This paper addresses the question of theoretical
versus practical performance of algorithms and data structures.
It describes an experiment the first author has been using in
his data structures course to achieve the primary objective of
comparing theoretical behavior of algorithms with their actual
performance. It also presents enhancements in performance evaluation experiments and a general methodology that can be used
to develop experiments in senior or first-year graduate courses
in data structures, algorithm analysis, software engineering, and
operating systems.
I. INTRODUCTION
HE STUDY of data structures, algorithms, and their
performance has been an essential component of the
computer science and engineering curriculum for a long time
[1] and [2]. Analysis of algorithms and abstract data types
(ADT’s), however, is typically limited to their theoretical
behavior. Students are introduced to the
Of upper
(the Big-oh notation)3 lower bound (the Big-omega
notation), and light bound (the Big-Theta notation), and the
rules used to estimate running times of language constructs
L3i, L4i. The students are then asked to estimate worst-case,
best-case, and average-case running times of some well-known
algorithms and small pieces of codes written in a Pascal- or
C-like language.
The algorithmic analysis stops here. The effects of using
different data structures, languages, compilers, memory hierarchies, cache sizes, page sizes, and CPU’s on the performance
of an implementation are completely ignored. From an implementor’s point of view, all these things play key roles in
determining the performance of a piece of software. Another
point that is not considered while estimating the performance
of an algorithm or ADT is its interaction with the memory
management subsystem. Performance of an algorithm that
frequently interacts with the memory management subsystem
cannot be estimated correctly. This is so because the performance of the memory management subsystem is dependent
on memory loading of the computer executing the piece of
software at hand. Therefore, the best way to measure the
performance of an algorithm is to implement it, run it on
different platforms under varying memory loadings by using
Manuscript received November 23, 1994; revised July 29, 1996.
S. M. Sarwar is with the Department of Electrical Engineering, Multnomah
School of Engineering, University of Portland, Portland, OR 97203-5798 USA
(e-mail: sarwar@up.edu).
E. E. Parks is with Ames Research Laboratories, Albany, OR 97321 USA.
S. A. Sarwar is with Academic Computing Laboratories, New York Institute
of Technology, Old Westbury, NY 11568 USA.
Publisher Item Identifier S 0018-9359(96)0893 1-5.
different compilers, collect actual running times, and find
polynomials that best fit the collected times under different
conditions. Practical aspects and experimentation, however,
are completely overlooked in a traditional course on data
structures and algorithm analysis.
In the remainder of the paper, we describe a laboratory
experiment that is being used by the first author to introduce
practical aspects of performance in his data structures course,
including a typical submittal expected for the experiment.
It also presents a general methodology that can be used to
develop performance evaluation experiments in data structures,
algorithm analysis, operating systems, and parallel computing
courses.
11. SAMPLELABORATORY
EXPERIMENT
The experiment described below is a representative of the
many that the first author has been successfully using over the
past several years in his data structures, operating systems, and
network programming courses, The experiment is to
run-time performance of the simple matrix multiply algorithm
on different platforms and compare it with its theoretical
behavior.
The students are asked to do the following.
1) Implement the simple matrix multiply algorithm in the
C language on the following platforms available in the
Engineering Computer Laboratory:
a) an iX0486DX66-based PC;
b) a Sun SPARC workstation;
c) a 12-processor supermini Sequent Symmetry.
2) Execute the implementation for two N x N matrices
of integers, for N = 0, 25, 50, . . 400, and collect
the execution times for all runs for nonoptimized and
optimized codes generated by the compilers at hand.
Repeat each run at least three times, every time with
zyx
zyx
a ,
a new matrix, and compute the average of the three. Do
not charge the time for data generation to the algorithm.
(The matrix and step sizes can be varied according to
the computing facilities available in the department.)
3) Document running times in the form of tables as well
as plots, showing performance of the algorithm on different platforms for optimized as well as nonoptimized
versions.
4) The running time, T ( N ) ,for the algorithm is O ( N 3 ) ,
i.e., T ( N ) 5 K N 3 for two constants K and No, when
N 2 No. Compute the value of K for every platform
for both optimized and nonoptimized codes.
zyxwv
zyxwvut
0018-9359/96$05.00 0 1996 IEEE
zyxwvutsrqponmlk
zyxwvutsrqponmlkjihg
zyx
zyxwvutsrqpo
zyxwvut
SARWAR et al.: LABORATORY EXERCISES FOR PRACTICAL PERFORMANCE OF ALGORITHMS AND DATA
n
e
e ll d
de
e rr
File
Source file
for the
Driver for
compiling source,
c x c c u l i o g it. and
c o I I c c I I n gs r u n n l n gn
lim e a
521
0 UtPUl
file
Fig. 1. The process of collecting running times.
Analyze the results and give relative performance of the
algorithm on all platforms. Make comments such as,
“the implementation is T times faster on machine X
as compared to its execution on machine Y” and “the
optimized code on machine X is T times faster than its
nonoptimized counterpart.”
Make any other comments on the experiment.
111. REQUIREMENTS ANALYSIS
A careful analysis of the above requirements will identify
a number of software resources that are needed to effectively
perform exercises specified in the experiment.
The implementation will use three two-dimensional arrays, two for storing input matrices and one to store the
final result. The C code would look like:
for (i = 0; i<MAX; i++)
for (j=O; j<MAX; j++)
for (k=O; k< MAX; k++)
Result[i][j] Matrix1 [z] [k]
* Matrix2[k][ j ] .
program run is repeated several times, each time with
newly generated data, and average of all times is used
compute the constant K .
4) Since the exercise explicitly says that the data generation
time should not be charged to the execution time of the
algorithm, a mechanism has to be devised to separate
the two times. This can be accomplished in one of the
two ways:
a) Write a driver that uses a random number generator
to generate two matrices and make them the initial
values of two 2-D arrays, to be used as input, in
a header (.h) file. Include the header file in the
C source file before its compilation. By doing so,
we eliminate the time needed for constructing input
matrices, making time measurements more reliable.
b) Make data generation a part of the implementation
for the matrix multiply algorithm. Measure execution time for a given matrix; call this time Ttotal.
Delete the multiplication part of the code in the
source file and measure the time for data generation
Time for matrix
only; call this time Tgeneration.
multiply is (Ttotal - Tgeneration).
zyxwvutsrq
zyxwvutsrqponmlkjihgfed
zyxwvutsrqp
+=
The experiment requires the ability to randomly generate
integer matrices. This can be done by using a random
number generator available on the host operating system.
In a UNIX environment, one may use the library routines
rand( ) or random( ). Studies have shown that random
produces better random numbers than rand( ) [5]-[7].
The experiment requires determining the average execution times of the algorithm for various matrix sizes on
different platforms. This can be accomplished by either
one of the following methods:
a) by using a start-stop clock that can be started,
stopped, and displayed, and is typically available as
a library call;
b) by using a shell command that displays the time
taken by the execution of a program run.
Regardless of the method used, one must make sure
that the time displayed is for execution of the given
piece of code only. In the UNIX operating system, for
example, one can use the shell command time or the
library routine clock( ). Although, the time command
gives us less precision than the SunOS library routine
clock, we prefer it because it enables us to collect
the true running time of a program (the user time) as
opposed to the total CPU time during a program run,
which includes the time for system activities as well.
Same objective can be achieved by using system call
getrusage, but time is easier to use with a driver. Every
Although both can be automated with equal ease,
Method 1) is preferred because it is more accurate and
cleaner.
The process described in Steps 1)-4) can be automated as
shown in Fig. 1. We have been using this process for a number
of experiments on measuring practical behavior of algorithms
and data structures.
Iv.
PROCEDURES, RESULTS,AND DISCUSSION
In this section, we describe a typical submittal expected for
the exercise outlined in the previous section.
The program for matrix multiplication was executed three
times on each platform. The algorithm was also run three
times with optimized code. Every run used newly generated
input matrices of random integers. Table I shows a sample
of running times for different platforms for optimized and
nonoptimized codes. Figs. 2 and 3 show relative performance
of the algorithm on different platforms. As can be seen, the
performance curves for Sun SPARC and PC for nonoptimized
codes are smooth, but the one for Sequent Symmetry is not.
Of particular interest is the turbulence in the performance
curve for Sequent for values of N around 300 in Fig. 2 and
abnormal behavior for values of N between 225-275 and
between 325-350 in Fig. 3. This behavior cannot be explained
and is currently under study.
528
zyxwvutsrqponmlkjih
IEEE TRANSACTIONS ON EDUCATION, VOL. 39, NO. 4, NOVEMBER 1996
TABLE I
SAMPLE EXECUTION
TIMESON DIFFERENT
PLATFORMS
TABLE I1
I
I
Platform
i80486DX66-Based PC
Sun SPARC Workstation
12-~rocessorSequent Symmetry
zyxw
zyxwvut
zyxwvu
zyxw
FOR DIFFERENT
PLATFORMS
VALUES OF
1
~~
Value of K
Non-Optimized
1.52 x l o 6
I
4.28 x
I
1
24.30 x
I
Optimized
0.59 x loM6
1.90 x
3.05 x
zyxw
zyxwv
zyxwvutsrq
225
-3$
-p8
F
200
175
125
100
22 75
250
100
25
0
zyxwvutsr
0
0
50
1 0 0 1 5 0 2 0 0 m m 3 5 Q 4 0 0
Matrixsize (N)
0
50
1 0 0 1 5 0 2 0 0 2 5 0 m 3 5 0 4 0 0
Matrix Size (N)
Fig 2 Running times of the nonoptimized matrix multiply code for
i80486DX66-based PC, Sun SPARC workstation, and 12-processor Sequent
Symmetry 8 1/S super minicomputer
Fig 3 Running times of the optimized matrix multiply code for
i80486DX66-based PC, Sun SPARC workstation, and 12-processor Sequent
Symmetry 8 1/S super minicomputer
Table I1 shows values of the execution time constant, K , for
optimized and nonoptimized codes for the three platforms. The
values of K were calculated by using the equation T ( N ) =
K N 3 for all data points for a given platform and taking the
TABLE I11
RATIOOF RUNNING
TIMESFOR DIFFERENTPLATFORMS
average of these values. One could obviously take the maxi-
mum of all the values to get K . Accuracy of the values of K is
less for smaller values of T ( N ) ,however, as compared to their
accuracy for larger values of T (N ) . We therefore computed the
average of the more accurate values of K . It is clear from this
table that optimized code for the algorithm runs 2..58,2.25, and
7.97 times faster than the nonoptimized code on i80486DX66,
Sun SPARC, and Sequent Symmetry, respectively. This means
that if executable code is optimized, tremendous amount of the
system hardware/software resources, including processor time,
memory, and operating system data structures, can be saved.
Table I11 summarizes relative performance of the algorithm
on different platforms. Figs. 2 and 3 and Table I11 clearly
show that 486-based PC outperforms both the Sun SPARC
workstation and the 12-processor Sequent Symmetry and that
Sequent has the worst performance. Although performance
of Sun SPARC is surprising, that of Sequent should not
be because it uses 20-MHz 80386 processor boards. So,
performance of Sequent should be close to that of a 20MHz 80 386-based system. The multiprocessor configuration
of Sequent is of no help in this situation because the code
is inherently sequential and cannot utilize more than one
processor at a time. The performance of Sun SPARC is worse
zyxwvutsrqponmlkj
SARWAR et al.: LABORATORY EXERCISES FOR PRACTICAL PERFORMANCE OF ALGORITHMS AND DATA
529
TABLE IV
CONFIGURATIONS
OF THE PLATFORMS
Resource Type
i486DX66-Based PC
i80486DX66
internal math
coprocessor
Sun SPARC
Workstation
SPARC iPX 4/50
internal math
coprocessor
40Mhzclock
Sequent Symmetry
S81BO
i386DX20-base six
dual-processor
boards
i80386DX math
zyxwvutsrq
Main Memory
Cache Memory
System Bus Type
Operating System
6 Mhvtes
8 Kbytes internal
ISA 16-bit
8 Mhz clock
DOS6.2
than the 486-based PC because SPARC hardware is optimized
for floating-point arithmetic and performs integer operations
poorly.
The compilers used for the experiment are Borland C++
version 3.1 for the PC and GNU C compiler gcc version
2.4 for the two UNIX platforms. The optimized codes were
generated with appropriate optimization options (flags) for the
two compilers. Optimization details can be found in the manual
pages for the compilers. Table IV gives configurations of the
three platforms.
V. OUTCOMEASSESSMENT
In order to assess the impact of these exercises, the first
author interviewed graduating and graduated students. The impact of the exercises has been, as perceived by these students,
very educating for them. In this section, we summarize student
evaluations of the experiments and narrate some anecdotal
accounts.
All of the students who have performed these experiments
over the years have been very positive about their experiences.
A common feature of student evaluations of the experiments
has been that students enjoy the experiments. Students find
practical performance evaluation of the algorithms, data structures, and operating system primitives they learn throughout
the course of their education interesting. They feel that doing
this gives them insight into the practical aspects of computing
by using real machines. “This was a very fun exercise” and “I
very much enjoyed doing this lab” are the kind of comments
the lab exercises have received. In 1994, one student wrote
about the matrix multiply laboratory:
It is interesting to see different execution times for the
different processor platforms. Some of the results were
surprising. I expected Sun workstations to perform the
best, but they ended up in the middle. From the results
of the experiment, I can only imagine what the future
of computing will be.
Another wrote, “This experiment shows the importance of
optimized code. It shows how the true computing power of a
processor is not realized until running optimized code.”
Another observation made by students is that performance
of a parallel or distributed version of an algorithm does
not improve linearly with increase in number of processors.
They, for example, find out that process management, message
64 Mbytes
256 Kbytes external
S Bus
SunOS v4.1.3b
Dynidptx v2.0.4
I
passing overheads, and fundamental changes in the sequential
version of an algorithm contribute significantly toward the
practical performance of its parallel (or distributed) version,
thereby making the parallel version useful for very large size
inputs only. These conclusions are drawn by students after
observing the run-time performance of parallel and distributed
versions of matrix multiply and Quicksort in the Senior
Laboratory course.
Students also observe that although some of the algorithms
found in the literature may be mathematically elegant and
have better theoretical performance than some well-known
algorithms, they are not very practical because they may have
limited application and coding them may not be a trivial task
either. Difficult implementation of Strassen’ s Algorithm [81
and its poor performance against the simple matrix multiply
algorithm is one example that students experience. One student
summarized his experience of this experiment.
When compared with the simple matrix multiply algorithm, Strassen’s Algorithm is much more difficult to
code, needs much more memory space to execute, and
generally exhibits poor performance for small to medium
size matrices.
zyx
Another wrote:
Large matrix sizes favor Strassen’s Algorithm, but there
are a lot of restrictions. Memory requirement is a big
problem. The matrix size is also restricted to 2*. This
algorithm can be used to solve some specific problems,
but cannot be used in general-purpose mathematical
applications.
Students also tend to have more appreciation of the computing environment they work in after finishing the experiments.
They, for example, understand that in a networked environment a machine running the file server software is inconsistent
and slower on the average because of its file-serving duties
and, therefore, is not a good choice for normal day-to-day use.
Probably the most important achievement of the experiments is that students start practicing experimental algorithmics [SI, [9] in their professional careers. One student noted,
“had I not performed these experiments, I would have always
chosen Quicksort, with median of three for finding the pivot
element, for sorting internal arrays. Not anymore. I know now
that theoretical bounds are not always good enough metrics
for choosing an algorithm in practice; experimentation is a
zyxw
zyxwvut
530
zyxwvutsrqponmlkj
IEEE TRANSACTIONS ON EDUCATION, VOL. 39, NO. 4, NOVEMBER 1996
must if a piece of code will be used hundreds or thousands
of times.” Another student, who did a summer internship at
a local software house, said that he convinced his boss that
although an algorithm that the student had proposed for string
matching had worse upper bound than some known algorithms,
it would perform better for inputs that their software product
would deal with.
AND GENERAL
METHODOLOGY
VI. ENHANCEMENTS
The process outlined for the experiment discussed in the
paper can be used in any laboratory exercise involving performance analysis of algorithms, abstract data types, and
interprocess communication primitives. The first author has
successfully used a number of such exercises in his data structures, operating systems, and network programming courses.
The exercise described in this paper can be extended to
include performance of the algorithm on different platforms
for floating-point numbers. By doing this exercise, students
will get to use a floating-point number generator, or combine
a random integer with a random real number (typically real
number generators generate numbers between zero and one) to
create a floating point number > 1. The relative performance
of platforms for matrices of floating-point numbers may be
different from their performance for integer matrices, and it
would be interesting to analyze the system components and
parameters (CPU speed, cache size, page size, main memory
size, availability of the math coprocessor, etc.) that dictate
performance in both cases.
Care must be taken while measuring the performance of
internal algorithms, that is, algorithms that do not involve any
I/O; for example, sorting algorithms like Quicksort and Shellsort. In order to*collect running times for these algorithms, they
should be coded such that the code does not include any kind
of I/O. This is to be done to mimic the internal nature of these
algorithms. The inclusion of any kind of I/O would corrupt
run-time data nondeterministically if the implementation is
executed on a machine that runs under a multiprocess timesharing operating system, like UNIX. This is due to waiting
times of processes in different operating system queues, as
these times depend on the system load in terms of the number
of processes running in the system. The paging time can be
eliminated by minimizing memory loading. If the executable
code, including input data, is smaller than the main storage
available for executing user programs and the machine is not
running any other user process, there is no reason for the
paging traffic to arise due to an algorithm itself.
In order to have students appreciate physical machine ef
fects, including effects of memory loading, (cache miss, page
fault, and disk I/O), they can be asked to change the nesting
of the iteration variables in the matrix multiply algorithm.
This would show them the advarse effects of manipulating
matrices in column-major order in languages that store them
in row-major order (e.g., C, C++, Pascal), or vice versa
(e.g., FORTRAN). Depending on the page size used by a
machine, matrix size, and the number of frames allocated to
the executing program, it may cause one page fault for every
matrix element accessed. For such an experiment, students
can either do the running-time comparison as outlined in this
paper or enhance their study by monitoring cache misses and
number of page faults as a function of the matrix size. The
latter can be accomplished if the system under study has a tool
to monitor cache misses and page faults. Contemporary CPU’s,
like Intel’s Pentium, have counters that enable measurements
like instruction counts and on-chip cache miss counts. These
enhancements to the experiment are only recommended if
students have ample background in computer architecture and
operating systems. Senior or first-year graduate students will
benefit most from these experiments.
If exercises are to be given in senior or first-year graduate
courses, students should be asked to find polynomials that
best fit the run-time data. By doing so, students can confirm
that practical running times of algorithms are limited by
their theoretical upper bounds. In reality, they will find tight
bounds for the algorithms they study. The data can be fitted
by using any one of a number of math tools available,
like Mathematica. Preferably, students should be asked to
use nonlinear regression for fitting the data; although linear
regression can be used as well. Students should also be asked
to show the goodness of their fits by computing percentage
errors, and plotting actual data and corresponding polynomials
side by side.
zyxwvutsrq
VII. SUMMARY AND FINALCOMMENTS
The paper outlined a methodology that can be used to
develop performance evaluation experiments in data structure,
algorithm analysis, operating systems, software engineering,
and other such courses. The paper also described one such
experiment that is being used by the first author in his data
structures course. The experiment outlined is for measuring
practical performance of the simple matrix multiply algorithm
on various platforms for optimized and nonoptimized codes.
Also discussed in the paper is a typical submittal expected
of a student for the experiment. The paper ends with an
outcome assessment of the experiments that have been used
by one of the authors for several years and suggestions for
enhancing these experiments for use in senior or first-year
graduate courses.
ACKNOWLEDGMENT
The authors sincerely thank the anonymous reviewers for
their comments and suggestions that greatly improved the
quality of the paper.
zyxwvu
zyxw
zyxwvuts
zyx
REFERENCES
[ 11 ACM Curriculum Committee on Computer Science, “Curriculum ’78-
[2]
[3]
[4]
[5]
recommendations for the undergraduate program in computer science,”
Comm. ACM, vol. 22, no. 3, pp. 147-166, 1979.
A. B. Tucker et al., Computing Curricula 1991: Report of the
ACM/IEEE-CS Joint Curriculum Task Force. Los Alamitos, CA: IEEE
Computer Soc. Press, 1991.
A. Aho, J. Hopcroft, and J. D. Ullman, Data Structures and Algorithms.
Reading, MA: Addison-Wesley, 1983.
M. A. Weiss, Data Structures and Algorithm Analysis in C. Reading,
MA: Addison-Wesley, 1993.
S. M. Sarwar, M. H. Jaragh, and M. Wind, “An empirical study of the
run-time behavior of quicksort, shellsort, and mergesort for medium to
large size data,” Comp. Languages, vol. 20, no. 2, pp. 127-134, 1994.
zyxwvutsrqponm
zyxwvutsrqponmlkjihgfedc
zyx
zyxwvutsrqpon
zyxwvutsrqponmlk
zyxw
zyxwvutsrq
zyxwvutsrqponml
SARWAR et al.: LABORATORY EXERCISES FOR PRACTICAL PERFORMANCE OF ALGORITHMS AND DATA
[6] S. M, Sawar, M, H. A. Jaragh, S. A, Sarwar, and
Brandeburg,
“Engineering quicksort,” Comp. Languages, to be published.
,71 M, A,
ccEmpiricalstudy of the expected mnning time of shellsort,” The Comp. J., vol. 34, no. 1, pp. 88-91, 1991.
[8] B, M, E, M~~~~and H, D. shapiro,
empirical analysis of algorithms
for constructing a minimum cost manning tree.” Lecture Notes in Comp.
Sci., no. 519, p. 411, 1990.
[9] ~,
“How to find a minimum cost spanning tree in practice,” Lecture
Notes in Comp. Sci., no. 555, pp. 192-203, 1991.
J,
53 1
Edwin E. Parks graduated magna cum laude in electrical engineering from
the University of Portland, OR, in 1994.
He is a Lead Engineer at the Ames Research Laboratories in Albany, OR,
where he has worked on various GPS navigation hardware and software
projects. His current professional interests are in GPS navigation, image
processing3 and computer-based system design.
v
S. Mansoor Sarwar (S’82-M’91) received the undergraduate degree in
electrical engineering from UET, Lahore, Pakistan, in 1981, and the M.S. and
Ph.D. degrees in computer engineering from Iowa State University, Ames, in
1985 and 1988, respectively.
He has taught at UET and Kuwait University and is currently an Associate
Professor of Electrical Engineering at the University of Portland, OR. His
current teaching and research interests are in operating systems, parallel and
distributed computing, software engineering, experimental algorithmics, and
engineering education.
Syed Aqeel Sarwar received the undergraduate degree in computer science
from Iowa State University, Ames, in 1988, and the M.S. degree in computer
science from the New York Institute of Technology (NYIT) in 1992.
He is currently with the Academic Computing Laboratories at the Old
Westbury campus of NYIT. His professional interests are in databases, 4gls,
and computer networks.