Laboratory exercises for practical performance of algorithms and data structures

Sarwar, S.M.; Parks, E.E.; Sarwar, S.A.

Laboratory exercises for practical performance of algorithms and data structures

IEEE Transactions on Education, 1996

526 zyxwvutsrqponmlkji IEEE TRANSACTIONS ON EDUCATION, VOL. 39, NO. 4, NOVEMBER 1996 oratory Exercises for Practical S. Mansoor Sarwar, Member, IEEE, Edwin E. Parks, and Syed Aqeel Sarwar zyxw Abstruct- zyxwvutsrqpo This paper addresses the question of theoretical versus practical performance of algorithms and data structures. It describes an experiment the first author has been using in his data structures course to achieve the primary objective of comparing theoretical behavior of algorithms with their actual performance. It also presents enhancementsin performance eval- uation experiments and a general methodology that can be used to develop experiments in senior or first-year graduate courses in data structures, algorithm analysis, software engineering, and operating systems. I. INTRODUCTION HE STUDY of data structures, algorithms, and their performance has been an essential component of the computer science and engineering curriculum for a long time [1] and [2]. Analysis of algorithms and abstract data types different compilers, collect actual running times, and find polynomials that best fit the collected times under different conditions. Practical aspects and experimentation, however, are completely overlooked in a traditional course on data structures and algorithm analysis. In the remainder of the paper, we describe a laboratory experiment that is being used by the first author to introduce practical aspects of performance in his data structures course, including a typical submittal expected for the experiment. It also presents a general methodology that can be used to develop performance evaluation experiments in data structures, algorithm analysis, operating systems, and parallel computing courses. 11. SAMPLE LABORATORY EXPERIMENT (ADT’s), however, is typically limited to their theoretical The experiment described below is a representative of the behavior. Students are introduced to the Of upper (the Big-oh notation)3 lower bound (the Big-omega rules used to estimate running times of language constructs best-case, and average-case running times of some well-known algorithms and small pieces of codes written in a Pascal- or C-like language. The algorithmic analysis stops here. The effects of using different data structures, languages, compilers, memory hierar- chies, cache sizes, page sizes, and CPU’s on the performance of an implementation are completely ignored. From an im- plementor’s point of view, all these things play key roles in determining the performance of a piece of software. Another point that is not considered while estimating the performance of an algorithm or ADT is its interaction with the memory management subsystem. Performance of an algorithm that frequently interacts with the memory management subsystem cannot be estimated correctly. This is so because the perfor- mance of the memory management subsystem is dependent on memory loading of the computer executing the piece of software at hand. Therefore, the best way to measure the performance of an algorithm is to implement it, run it on different platforms under varying memory loadings by using many that the first author has been successfully using over the past several years in his data structures, operating systems, and run-time performance of the simple matrix multiply algorithm behavior. notation), and light bound (the Big-Theta notation), and the network programming courses, The experiment is to L3i, L4i. The students are then asked to estimate worst-case, on different platforms and compare it with its theoretical The students are asked to do the following. 1) Implement the simple matrix multiply algorithm in the C language on the following platforms available in the Engineering Computer Laboratory: a) an iX0486DX66-based PC; b) a Sun SPARC workstation; c) a 12-processor supermini Sequent Symmetry. 2) Execute the implementation for two N zyx x N matrices of integers, for N = 0, 25, 50, . . zyx a , 400, and collect the execution times for all runs for nonoptimized and optimized codes generated by the compilers at hand. Repeat each run at least three times, every time with a new matrix, and compute the average of the three. Do not charge the time for data generation to the algorithm. (The matrix and step sizes can be varied according to the computing facilities available in the department.) 3) Document running times in the form of tables as well as plots, showing performance of the algorithm on dif- ferent platforms for optimized as well as nonoptimized versions. 4) The running time, T(N), for the algorithm is O(N3), i.e., T(N) 5 KN3 for two constants K and No, when N zyxwv 2 No. Compute the value of K for every platform for both optimized and nonoptimized codes. Manuscript received November 23, 1994; revised July 29, 1996. S. M. Sarwar is with the Department of Electrical Engineering, Multnomah School of Engineering, University of Portland, Portland, OR 97203-5798 USA (e-mail: sarwar@ up.edu). E. E. Parks is with Ames Research Laboratories, Albany, OR 97321 USA. S. A. Sarwar is with Academic Computing Laboratories, New York Institute of Technology, Old Westbury, NY 11568 USA. Publisher Item Identifier S 0018-9359(96)0893 1-5. 0018-9359/96$05.00 zyxwvut 0 1996 IEEE

SARWAR zyxwvutsrqponmlk et zyxwvutsrqponmlkjihg al.: LABORATORY EXERCISES FOR PRACTICAL PERFORMANCE OF ALGORITHMS AND DATA 521 zyx compiling source, coIIccIIns runnlnn zyxwvut n zyxwvutsrqpo elder for the cxcculiog it. and Driver for compiling source, cxcculiog it. and File coIIccIIng runnlng lim ea 0 UtPUl Source file file n elder Fig. 1. The process of collecting running times. Analyze the results and give relative performance of the algorithm on all platforms. Make comments such as, “the implementation is T times faster on machine X as compared to its execution on machine Y” and “the optimized code on machine X is T times faster than its nonoptimized counterpart.” Make any other comments on the experiment. 111. REQUIREMENTS ANALYSIS A careful analysis of the above requirements will identify a number of software resources that are needed to effectively perform exercises specified in the experiment. The implementation will use three two-dimensional ar- rays, two for storing input matrices and one to store the final result. The C code would look like: for zyxwvutsrq (i zyxwvutsrqponmlkjihgfed = 0; i<MAX; i++) for zyxwvutsrqp (j=O; j<MAX; j++) for (k=O; k< MAX; k++) Result[i] [j] += Matrix1 [z] [k] * Matrix2[k] [j]. The experiment requires the ability to randomly generate integer matrices. This can be done by using a random number generator available on the host operating system. In a UNIX environment, one may use the library routines rand( ) or random( ). Studies have shown that random produces better random numbers than rand( ) [5]-[7]. The experiment requires determining the average execu- tion times of the algorithm for various matrix sizes on different platforms. This can be accomplished by either one of the following methods: a) by using a start-stop clock that can be started, stopped, and displayed, and is typically available as a library call; b) by using a shell command that displays the time taken by the execution of a program run. Regardless of the method used, one must make sure that the time displayed is for execution of the given piece of code only. In the UNIX operating system, for example, one can use the shell command time or the library routine clock( ). Although, the time command gives us less precision than the SunOS library routine clock, we prefer it because it enables us to collect the true running time of a program (the user time) as opposed to the total CPU time during a program run, which includes the time for system activities as well. Same objective can be achieved by using system call getrusage, but time is easier to use with a driver. Every file 4) program run is repeated several times, each time with newly generated data, and average of all times is used compute the constant K. Since the exercise explicitly says that the data generation time should not be charged to the execution time of the algorithm, a mechanism has to be devised to separate the two times. This can be accomplished in one of the two ways: a) Write a driver that uses a random number generator to generate two matrices and make them the initial values of two 2-D arrays, to be used as input, in a header (.h) file. Include the header file in the C source file before its compilation. By doing so, we eliminate the time needed for constructing input matrices, making time measurements more reliable. b) Make data generation a part of the implementation for the matrix multiply algorithm. Measure execu- tion time for a given matrix; call this time Ttotal. Delete the multiplication part of the code in the source file and measure the time for data generation only; call this time Tgeneration. Time for matrix multiply is (Ttotal - Tgeneration). Although both can be automated with equal ease, Method 1) is preferred because it is more accurate and cleaner. The process described in Steps 1)-4) can be automated as shown in Fig. 1. We have been using this process for a number of experiments on measuring practical behavior of algorithms and data structures. Iv. PROCEDURES, RESULTS, AND DISCUSSION In this section, we describe a typical submittal expected for the exercise outlined in the previous section. The program for matrix multiplication was executed three times on each platform. The algorithm was also run three times with optimized code. Every run used newly generated input matrices of random integers. Table I shows a sample of running times for different platforms for optimized and nonoptimized codes. Figs. 2 and 3 show relative performance of the algorithm on different platforms. As can be seen, the performance curves for Sun SPARC and PC for nonoptimized codes are smooth, but the one for Sequent Symmetry is not. Of particular interest is the turbulence in the performance curve for Sequent for values of N around 300 in Fig. 2 and abnormal behavior for values of N between 225-275 and between 325-350 in Fig. 3. This behavior cannot be explained and is currently under study.

526 zyxwvutsrqponmlkji IEEE TRANSACTIONS ON EDUCATION, VOL. 39, NO. 4, NOVEMBER 1996 oratory Exercises for Practical zyxw zyxwvutsrqpo S. Mansoor Sarwar, Member, IEEE, Edwin E. Parks, and Syed Aqeel Sarwar Abstruct- This paper addresses the question of theoretical versus practical performance of algorithms and data structures. It describes an experiment the first author has been using in his data structures course to achieve the primary objective of comparing theoretical behavior of algorithms with their actual performance. It also presents enhancements in performance evaluation experiments and a general methodology that can be used to develop experiments in senior or first-year graduate courses in data structures, algorithm analysis, software engineering, and operating systems. I. INTRODUCTION HE STUDY of data structures, algorithms, and their performance has been an essential component of the computer science and engineering curriculum for a long time [1] and [2]. Analysis of algorithms and abstract data types (ADT’s), however, is typically limited to their theoretical behavior. Students are introduced to the Of upper (the Big-oh notation)3 lower bound (the Big-omega notation), and light bound (the Big-Theta notation), and the rules used to estimate running times of language constructs L3i, L4i. The students are then asked to estimate worst-case, best-case, and average-case running times of some well-known algorithms and small pieces of codes written in a Pascal- or C-like language. The algorithmic analysis stops here. The effects of using different data structures, languages, compilers, memory hierarchies, cache sizes, page sizes, and CPU’s on the performance of an implementation are completely ignored. From an implementor’s point of view, all these things play key roles in determining the performance of a piece of software. Another point that is not considered while estimating the performance of an algorithm or ADT is its interaction with the memory management subsystem. Performance of an algorithm that frequently interacts with the memory management subsystem cannot be estimated correctly. This is so because the performance of the memory management subsystem is dependent on memory loading of the computer executing the piece of software at hand. Therefore, the best way to measure the performance of an algorithm is to implement it, run it on different platforms under varying memory loadings by using Manuscript received November 23, 1994; revised July 29, 1996. S. M. Sarwar is with the Department of Electrical Engineering, Multnomah School of Engineering, University of Portland, Portland, OR 97203-5798 USA (e-mail: sarwar@up.edu). E. E. Parks is with Ames Research Laboratories, Albany, OR 97321 USA. S. A. Sarwar is with Academic Computing Laboratories, New York Institute of Technology, Old Westbury, NY 11568 USA. Publisher Item Identifier S 0018-9359(96)0893 1-5. different compilers, collect actual running times, and find polynomials that best fit the collected times under different conditions. Practical aspects and experimentation, however, are completely overlooked in a traditional course on data structures and algorithm analysis. In the remainder of the paper, we describe a laboratory experiment that is being used by the first author to introduce practical aspects of performance in his data structures course, including a typical submittal expected for the experiment. It also presents a general methodology that can be used to develop performance evaluation experiments in data structures, algorithm analysis, operating systems, and parallel computing courses. 11. SAMPLELABORATORY EXPERIMENT The experiment described below is a representative of the many that the first author has been successfully using over the past several years in his data structures, operating systems, and network programming courses, The experiment is to run-time performance of the simple matrix multiply algorithm on different platforms and compare it with its theoretical behavior. The students are asked to do the following. 1) Implement the simple matrix multiply algorithm in the C language on the following platforms available in the Engineering Computer Laboratory: a) an iX0486DX66-based PC; b) a Sun SPARC workstation; c) a 12-processor supermini Sequent Symmetry. 2) Execute the implementation for two N x N matrices of integers, for N = 0, 25, 50, . . 400, and collect the execution times for all runs for nonoptimized and optimized codes generated by the compilers at hand. Repeat each run at least three times, every time with zyx zyx a , a new matrix, and compute the average of the three. Do not charge the time for data generation to the algorithm. (The matrix and step sizes can be varied according to the computing facilities available in the department.) 3) Document running times in the form of tables as well as plots, showing performance of the algorithm on different platforms for optimized as well as nonoptimized versions. 4) The running time, T ( N ) ,for the algorithm is O ( N 3 ) , i.e., T ( N ) 5 K N 3 for two constants K and No, when N 2 No. Compute the value of K for every platform for both optimized and nonoptimized codes. zyxwv zyxwvut 0018-9359/96$05.00 0 1996 IEEE zyxwvutsrqponmlk zyxwvutsrqponmlkjihg zyx zyxwvutsrqpo zyxwvut SARWAR et al.: LABORATORY EXERCISES FOR PRACTICAL PERFORMANCE OF ALGORITHMS AND DATA n e e ll d de e rr File Source file for the Driver for compiling source, c x c c u l i o g it. and c o I I c c I I n gs r u n n l n gn lim e a 521 0 UtPUl file Fig. 1. The process of collecting running times. Analyze the results and give relative performance of the algorithm on all platforms. Make comments such as, “the implementation is T times faster on machine X as compared to its execution on machine Y” and “the optimized code on machine X is T times faster than its nonoptimized counterpart.” Make any other comments on the experiment. 111. REQUIREMENTS ANALYSIS A careful analysis of the above requirements will identify a number of software resources that are needed to effectively perform exercises specified in the experiment. The implementation will use three two-dimensional arrays, two for storing input matrices and one to store the final result. The C code would look like: for (i = 0; i<MAX; i++) for (j=O; j<MAX; j++) for (k=O; k< MAX; k++) Result[i][j] Matrix1 [z] [k] * Matrix2[k][ j ] . program run is repeated several times, each time with newly generated data, and average of all times is used compute the constant K . 4) Since the exercise explicitly says that the data generation time should not be charged to the execution time of the algorithm, a mechanism has to be devised to separate the two times. This can be accomplished in one of the two ways: a) Write a driver that uses a random number generator to generate two matrices and make them the initial values of two 2-D arrays, to be used as input, in a header (.h) file. Include the header file in the C source file before its compilation. By doing so, we eliminate the time needed for constructing input matrices, making time measurements more reliable. b) Make data generation a part of the implementation for the matrix multiply algorithm. Measure execution time for a given matrix; call this time Ttotal. Delete the multiplication part of the code in the source file and measure the time for data generation Time for matrix only; call this time Tgeneration. multiply is (Ttotal - Tgeneration). zyxwvutsrq zyxwvutsrqponmlkjihgfed zyxwvutsrqp += The experiment requires the ability to randomly generate integer matrices. This can be done by using a random number generator available on the host operating system. In a UNIX environment, one may use the library routines rand( ) or random( ). Studies have shown that random produces better random numbers than rand( ) [5]-[7]. The experiment requires determining the average execution times of the algorithm for various matrix sizes on different platforms. This can be accomplished by either one of the following methods: a) by using a start-stop clock that can be started, stopped, and displayed, and is typically available as a library call; b) by using a shell command that displays the time taken by the execution of a program run. Regardless of the method used, one must make sure that the time displayed is for execution of the given piece of code only. In the UNIX operating system, for example, one can use the shell command time or the library routine clock( ). Although, the time command gives us less precision than the SunOS library routine clock, we prefer it because it enables us to collect the true running time of a program (the user time) as opposed to the total CPU time during a program run, which includes the time for system activities as well. Same objective can be achieved by using system call getrusage, but time is easier to use with a driver. Every Although both can be automated with equal ease, Method 1) is preferred because it is more accurate and cleaner. The process described in Steps 1)-4) can be automated as shown in Fig. 1. We have been using this process for a number of experiments on measuring practical behavior of algorithms and data structures. Iv. PROCEDURES, RESULTS,AND DISCUSSION In this section, we describe a typical submittal expected for the exercise outlined in the previous section. The program for matrix multiplication was executed three times on each platform. The algorithm was also run three times with optimized code. Every run used newly generated input matrices of random integers. Table I shows a sample of running times for different platforms for optimized and nonoptimized codes. Figs. 2 and 3 show relative performance of the algorithm on different platforms. As can be seen, the performance curves for Sun SPARC and PC for nonoptimized codes are smooth, but the one for Sequent Symmetry is not. Of particular interest is the turbulence in the performance curve for Sequent for values of N around 300 in Fig. 2 and abnormal behavior for values of N between 225-275 and between 325-350 in Fig. 3. This behavior cannot be explained and is currently under study. 528 zyxwvutsrqponmlkjih IEEE TRANSACTIONS ON EDUCATION, VOL. 39, NO. 4, NOVEMBER 1996 TABLE I SAMPLE EXECUTION TIMESON DIFFERENT PLATFORMS TABLE I1 I I Platform i80486DX66-Based PC Sun SPARC Workstation 12-~rocessorSequent Symmetry zyxw zyxwvut zyxwvu zyxw FOR DIFFERENT PLATFORMS VALUES OF 1 ~~ Value of K Non-Optimized 1.52 x l o 6 I 4.28 x I 1 24.30 x I Optimized 0.59 x loM6 1.90 x 3.05 x zyxw zyxwv zyxwvutsrq 225 -3$ -p8 F 200 175 125 100 22 75 250 100 25 0 zyxwvutsr 0 0 50 1 0 0 1 5 0 2 0 0 m m 3 5 Q 4 0 0 Matrixsize (N) 0 50 1 0 0 1 5 0 2 0 0 2 5 0 m 3 5 0 4 0 0 Matrix Size (N) Fig 2 Running times of the nonoptimized matrix multiply code for i80486DX66-based PC, Sun SPARC workstation, and 12-processor Sequent Symmetry 8 1/S super minicomputer Fig 3 Running times of the optimized matrix multiply code for i80486DX66-based PC, Sun SPARC workstation, and 12-processor Sequent Symmetry 8 1/S super minicomputer Table I1 shows values of the execution time constant, K , for optimized and nonoptimized codes for the three platforms. The values of K were calculated by using the equation T ( N ) = K N 3 for all data points for a given platform and taking the TABLE I11 RATIOOF RUNNING TIMESFOR DIFFERENTPLATFORMS average of these values. One could obviously take the maxi- mum of all the values to get K . Accuracy of the values of K is less for smaller values of T ( N ) ,however, as compared to their accuracy for larger values of T (N ) . We therefore computed the average of the more accurate values of K . It is clear from this table that optimized code for the algorithm runs 2..58,2.25, and 7.97 times faster than the nonoptimized code on i80486DX66, Sun SPARC, and Sequent Symmetry, respectively. This means that if executable code is optimized, tremendous amount of the system hardware/software resources, including processor time, memory, and operating system data structures, can be saved. Table I11 summarizes relative performance of the algorithm on different platforms. Figs. 2 and 3 and Table I11 clearly show that 486-based PC outperforms both the Sun SPARC workstation and the 12-processor Sequent Symmetry and that Sequent has the worst performance. Although performance of Sun SPARC is surprising, that of Sequent should not be because it uses 20-MHz 80386 processor boards. So, performance of Sequent should be close to that of a 20MHz 80 386-based system. The multiprocessor configuration of Sequent is of no help in this situation because the code is inherently sequential and cannot utilize more than one processor at a time. The performance of Sun SPARC is worse zyxwvutsrqponmlkj SARWAR et al.: LABORATORY EXERCISES FOR PRACTICAL PERFORMANCE OF ALGORITHMS AND DATA 529 TABLE IV CONFIGURATIONS OF THE PLATFORMS Resource Type i486DX66-Based PC i80486DX66 internal math coprocessor Sun SPARC Workstation SPARC iPX 4/50 internal math coprocessor 40Mhzclock Sequent Symmetry S81BO i386DX20-base six dual-processor boards i80386DX math zyxwvutsrq Main Memory Cache Memory System Bus Type Operating System 6 Mhvtes 8 Kbytes internal ISA 16-bit 8 Mhz clock DOS6.2 than the 486-based PC because SPARC hardware is optimized for floating-point arithmetic and performs integer operations poorly. The compilers used for the experiment are Borland C++ version 3.1 for the PC and GNU C compiler gcc version 2.4 for the two UNIX platforms. The optimized codes were generated with appropriate optimization options (flags) for the two compilers. Optimization details can be found in the manual pages for the compilers. Table IV gives configurations of the three platforms. V. OUTCOMEASSESSMENT In order to assess the impact of these exercises, the first author interviewed graduating and graduated students. The impact of the exercises has been, as perceived by these students, very educating for them. In this section, we summarize student evaluations of the experiments and narrate some anecdotal accounts. All of the students who have performed these experiments over the years have been very positive about their experiences. A common feature of student evaluations of the experiments has been that students enjoy the experiments. Students find practical performance evaluation of the algorithms, data structures, and operating system primitives they learn throughout the course of their education interesting. They feel that doing this gives them insight into the practical aspects of computing by using real machines. “This was a very fun exercise” and “I very much enjoyed doing this lab” are the kind of comments the lab exercises have received. In 1994, one student wrote about the matrix multiply laboratory: It is interesting to see different execution times for the different processor platforms. Some of the results were surprising. I expected Sun workstations to perform the best, but they ended up in the middle. From the results of the experiment, I can only imagine what the future of computing will be. Another wrote, “This experiment shows the importance of optimized code. It shows how the true computing power of a processor is not realized until running optimized code.” Another observation made by students is that performance of a parallel or distributed version of an algorithm does not improve linearly with increase in number of processors. They, for example, find out that process management, message 64 Mbytes 256 Kbytes external S Bus SunOS v4.1.3b Dynidptx v2.0.4 I passing overheads, and fundamental changes in the sequential version of an algorithm contribute significantly toward the practical performance of its parallel (or distributed) version, thereby making the parallel version useful for very large size inputs only. These conclusions are drawn by students after observing the run-time performance of parallel and distributed versions of matrix multiply and Quicksort in the Senior Laboratory course. Students also observe that although some of the algorithms found in the literature may be mathematically elegant and have better theoretical performance than some well-known algorithms, they are not very practical because they may have limited application and coding them may not be a trivial task either. Difficult implementation of Strassen’ s Algorithm [81 and its poor performance against the simple matrix multiply algorithm is one example that students experience. One student summarized his experience of this experiment. When compared with the simple matrix multiply algorithm, Strassen’s Algorithm is much more difficult to code, needs much more memory space to execute, and generally exhibits poor performance for small to medium size matrices. zyx Another wrote: Large matrix sizes favor Strassen’s Algorithm, but there are a lot of restrictions. Memory requirement is a big problem. The matrix size is also restricted to 2*. This algorithm can be used to solve some specific problems, but cannot be used in general-purpose mathematical applications. Students also tend to have more appreciation of the computing environment they work in after finishing the experiments. They, for example, understand that in a networked environment a machine running the file server software is inconsistent and slower on the average because of its file-serving duties and, therefore, is not a good choice for normal day-to-day use. Probably the most important achievement of the experiments is that students start practicing experimental algorithmics [SI, [9] in their professional careers. One student noted, “had I not performed these experiments, I would have always chosen Quicksort, with median of three for finding the pivot element, for sorting internal arrays. Not anymore. I know now that theoretical bounds are not always good enough metrics for choosing an algorithm in practice; experimentation is a zyxw zyxwvut 530 zyxwvutsrqponmlkj IEEE TRANSACTIONS ON EDUCATION, VOL. 39, NO. 4, NOVEMBER 1996 must if a piece of code will be used hundreds or thousands of times.” Another student, who did a summer internship at a local software house, said that he convinced his boss that although an algorithm that the student had proposed for string matching had worse upper bound than some known algorithms, it would perform better for inputs that their software product would deal with. AND GENERAL METHODOLOGY VI. ENHANCEMENTS The process outlined for the experiment discussed in the paper can be used in any laboratory exercise involving performance analysis of algorithms, abstract data types, and interprocess communication primitives. The first author has successfully used a number of such exercises in his data structures, operating systems, and network programming courses. The exercise described in this paper can be extended to include performance of the algorithm on different platforms for floating-point numbers. By doing this exercise, students will get to use a floating-point number generator, or combine a random integer with a random real number (typically real number generators generate numbers between zero and one) to create a floating point number > 1. The relative performance of platforms for matrices of floating-point numbers may be different from their performance for integer matrices, and it would be interesting to analyze the system components and parameters (CPU speed, cache size, page size, main memory size, availability of the math coprocessor, etc.) that dictate performance in both cases. Care must be taken while measuring the performance of internal algorithms, that is, algorithms that do not involve any I/O; for example, sorting algorithms like Quicksort and Shellsort. In order to*collect running times for these algorithms, they should be coded such that the code does not include any kind of I/O. This is to be done to mimic the internal nature of these algorithms. The inclusion of any kind of I/O would corrupt run-time data nondeterministically if the implementation is executed on a machine that runs under a multiprocess timesharing operating system, like UNIX. This is due to waiting times of processes in different operating system queues, as these times depend on the system load in terms of the number of processes running in the system. The paging time can be eliminated by minimizing memory loading. If the executable code, including input data, is smaller than the main storage available for executing user programs and the machine is not running any other user process, there is no reason for the paging traffic to arise due to an algorithm itself. In order to have students appreciate physical machine ef fects, including effects of memory loading, (cache miss, page fault, and disk I/O), they can be asked to change the nesting of the iteration variables in the matrix multiply algorithm. This would show them the advarse effects of manipulating matrices in column-major order in languages that store them in row-major order (e.g., C, C++, Pascal), or vice versa (e.g., FORTRAN). Depending on the page size used by a machine, matrix size, and the number of frames allocated to the executing program, it may cause one page fault for every matrix element accessed. For such an experiment, students can either do the running-time comparison as outlined in this paper or enhance their study by monitoring cache misses and number of page faults as a function of the matrix size. The latter can be accomplished if the system under study has a tool to monitor cache misses and page faults. Contemporary CPU’s, like Intel’s Pentium, have counters that enable measurements like instruction counts and on-chip cache miss counts. These enhancements to the experiment are only recommended if students have ample background in computer architecture and operating systems. Senior or first-year graduate students will benefit most from these experiments. If exercises are to be given in senior or first-year graduate courses, students should be asked to find polynomials that best fit the run-time data. By doing so, students can confirm that practical running times of algorithms are limited by their theoretical upper bounds. In reality, they will find tight bounds for the algorithms they study. The data can be fitted by using any one of a number of math tools available, like Mathematica. Preferably, students should be asked to use nonlinear regression for fitting the data; although linear regression can be used as well. Students should also be asked to show the goodness of their fits by computing percentage errors, and plotting actual data and corresponding polynomials side by side. zyxwvutsrq VII. SUMMARY AND FINALCOMMENTS The paper outlined a methodology that can be used to develop performance evaluation experiments in data structure, algorithm analysis, operating systems, software engineering, and other such courses. The paper also described one such experiment that is being used by the first author in his data structures course. The experiment outlined is for measuring practical performance of the simple matrix multiply algorithm on various platforms for optimized and nonoptimized codes. Also discussed in the paper is a typical submittal expected of a student for the experiment. The paper ends with an outcome assessment of the experiments that have been used by one of the authors for several years and suggestions for enhancing these experiments for use in senior or first-year graduate courses. ACKNOWLEDGMENT The authors sincerely thank the anonymous reviewers for their comments and suggestions that greatly improved the quality of the paper. zyxwvu zyxw zyxwvuts zyx REFERENCES [ 11 ACM Curriculum Committee on Computer Science, “Curriculum ’78- [2] [3] [4] [5] recommendations for the undergraduate program in computer science,” Comm. ACM, vol. 22, no. 3, pp. 147-166, 1979. A. B. Tucker et al., Computing Curricula 1991: Report of the ACM/IEEE-CS Joint Curriculum Task Force. Los Alamitos, CA: IEEE Computer Soc. Press, 1991. A. Aho, J. Hopcroft, and J. D. Ullman, Data Structures and Algorithms. Reading, MA: Addison-Wesley, 1983. M. A. Weiss, Data Structures and Algorithm Analysis in C. Reading, MA: Addison-Wesley, 1993. S. M. Sarwar, M. H. Jaragh, and M. Wind, “An empirical study of the run-time behavior of quicksort, shellsort, and mergesort for medium to large size data,” Comp. Languages, vol. 20, no. 2, pp. 127-134, 1994. zyxwvutsrqponm zyxwvutsrqponmlkjihgfedc zyx zyxwvutsrqpon zyxwvutsrqponmlk zyxw zyxwvutsrq zyxwvutsrqponml SARWAR et al.: LABORATORY EXERCISES FOR PRACTICAL PERFORMANCE OF ALGORITHMS AND DATA [6] S. M, Sawar, M, H. A. Jaragh, S. A, Sarwar, and Brandeburg, “Engineering quicksort,” Comp. Languages, to be published. ,71 M, A, ccEmpiricalstudy of the expected mnning time of shellsort,” The Comp. J., vol. 34, no. 1, pp. 88-91, 1991. [8] B, M, E, M~~~~and H, D. shapiro, empirical analysis of algorithms for constructing a minimum cost manning tree.” Lecture Notes in Comp. Sci., no. 519, p. 411, 1990. [9] ~, “How to find a minimum cost spanning tree in practice,” Lecture Notes in Comp. Sci., no. 555, pp. 192-203, 1991. J, 53 1 Edwin E. Parks graduated magna cum laude in electrical engineering from the University of Portland, OR, in 1994. He is a Lead Engineer at the Ames Research Laboratories in Albany, OR, where he has worked on various GPS navigation hardware and software projects. His current professional interests are in GPS navigation, image processing3 and computer-based system design. v S. Mansoor Sarwar (S’82-M’91) received the undergraduate degree in electrical engineering from UET, Lahore, Pakistan, in 1981, and the M.S. and Ph.D. degrees in computer engineering from Iowa State University, Ames, in 1985 and 1988, respectively. He has taught at UET and Kuwait University and is currently an Associate Professor of Electrical Engineering at the University of Portland, OR. His current teaching and research interests are in operating systems, parallel and distributed computing, software engineering, experimental algorithmics, and engineering education. Syed Aqeel Sarwar received the undergraduate degree in computer science from Iowa State University, Ames, in 1988, and the M.S. degree in computer science from the New York Institute of Technology (NYIT) in 1992. He is currently with the Academic Computing Laboratories at the Old Westbury campus of NYIT. His professional interests are in databases, 4gls, and computer networks.

Log In

Laboratory exercises for practical performance of algorithms and data structures