Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Genetics-Based Learning of New Heuristics: Rational Scheduling of Experiments and Generalization Benjamin W. Wah, Arthur Ieumwananonthachai, Lon-Chan Chu, and Akiko N. Aizawa Abstract — In this paper we present new methods for the automated learning of heuristics in knowledge-lean applications and for finding heuristics that can be generalized to unlearned domains. These applications lack domain knowledge for credit assignment; hence, operators for composing new heuristics are generally modelfree, domain independent, and syntactic in nature. The operators we have used are genetics-based; examples of which include mutation and cross-over. Learning is based on a generate-and-test paradigm that maintains a pool of competing heuristics, tests them to a limited extent, creates new ones from those that perform well in the past, and prunes poor ones from the pool. We have studied three important issues in learning better heuristics: (a) anomalies in performance evaluation, (b) rational scheduling of limited computational resources in testing candidate heuristics in single-objective as well as multiobjective learning, and (c) finding heuristics that can be generalized to unlearned domains. We show experimental results in learning better heuristics for (a) process placement for distributed-memory multicomputers, (b) node decomposition in a branch-and-bound search, (c) generation of test patterns in VLSI circuit testing, and (d) VLSI cell placement and routing. Index Terms — Branch-and-bound search, generalization, genetics-based learning, heuristics, knowledge-lean applications, performance evaluation, process mapping, resource scheduling, VLSI circuit testing, VLSI placement and routing. Benjamin W. Wah (b-wah@uiuc.edu) and Arthur Ieumwananonthachai are with the Center for Reliable and High-Performance Computing, Coordinated Science Laboratory, University of Illinois at Urbana-Champaign, 1308 West Main Street, Urbana, IL 61801. Lon-Chan Chu is with Microsoft, Inc., One Microsoft Way, Redmond, WA 98052. A. N. Aizawa is with the National Center for Science Information System, 3-29-1 Otsuka, Bunkyoku, Tokyo 112, Japan. This research was supported in part by National Science Foundation Grants MIP 92-10584 and MIP 88-10584 and National Aeronautics and Space Administration Grants NCC 2-481, NAG 1-613 and NGT 50743 (NASA Graduate Fellowship Program). Manuscript submitted: August, 1991; Revised: November 1994; Accepted: March 1995. 1. INTRODUCTION The design of problem solving algorithms for many applications generally relies on the expertise of designers and the amount of domain knowledge available. This design is difficult when there is little domain knowledge or when the environment under consideration is different from which the algorithm is applied. In this paper we study two important problems in designing efficient algorithms: (a) automated design of problem solving heuristics in knowledge-lean application environments, and (b) systematic search of heuristics that can be generalized to unlearned domains. A problem solver (PS) can be optimal or heuristic. An optimal problem solver is a realization of an optimal algorithm that solves the problem optimally with respect to certain objectives. In contrast, a heuristic problem solver has components that were designed in an ad hoc fashion, leading to possibly suboptimal solutions when applied. When there is no optimal algorithm, the design of effective heuristics is crucial. Without ambiguity, we simply use ‘‘problem solvers’’ in this paper to refer to ‘‘heuristic problem solvers.’’ Heuristics, in general terms, are ‘‘rules of thumb’’ or ‘‘common-sense knowledge’’ used in attempting the solution of a problem [22]. Newell, Shaw and Simon defined heuristics as ‘‘A process that may solve a given problem, but offers no guarantees of doing so’’ [20]. Pearl defined heuristics as ‘‘Strategies using readily accessible though loosely applicable information to control problem-solving processes in human being and machines’’ [22]. In this paper, we define a heuristic method (HM) to mean a problem solving procedure in a problem solver. Without loss of generality, a HM can be considered as a collection of interrelated heuristic decision elements (HDE) or heuristics decision rules. As illustrated in Fig. 1, a problem solver takes a problem instance (or test case) and generates a solution. Note that the solution is generally suboptimal since heuristics are used. Heuristics are usually designed by experts with strong expertise in the target application domain, or by automated learning systems using machine learning techniques. Both methods focus on explaining the relation between heuristics and their performance, and on generating ‘‘good’’ heuristics based on observed information or explained relations. There are three major issues in designing good heuristics. (1) Generation of heuristics. The way that heuristics are generated depends on domain knowledge available in the application environment. An application environment can be knowledge-rich or knowledge-lean with respect to the heuris- 2 IIEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 7, NO 5, OCTOBER 1995 Problem Subdomain Problem solving procedure in PS = Heuristic Method (HM) Suboptimal solution test-case Heuristic Decision Element (HDE) missing feedback model in credit assignment Fig. 1. A heuristic method applied to a problem instance in a knowledge-lean application domain. tics to be designed. In knowledge-rich domains, a world model helps explain the relationship among decision, states, actions, and performance feedback generated by the learning system or measured in the environment. This model is important in identifying good heuristics that otherwise may be difficult to find. In contrast, such models do not exist in knowledge-lean domains. In this case, the heuristics generator cannot rely on performance feedback (or credit assignment as shown in Fig. 1) to decide how new heuristics should be generated or how existing heuristics should be modified. Note that we can use credit-assignment algorithms that do not rely on the world model; however, their effect on performance improvement is weak. In this paper, we study the learning of heuristics in knowledge-lean application domains. Since credit assignment is difficult in these domains, operators for composing new HMs are usually model-free, domain-independent, and syntactic in nature. Heuristics are generally represented in a form that can be modified syntactically; examples include bit strings and collections of symbols and numbers. The operators we use are based on those in genetics-based learning (such as mutation, cross-over and random generation) that perturb existing parameters and rules in order to obtain new ones [6]. Genetics-based learning [6], also called population-based learning [1, 31], of HMs is based on a generate-and-test paradigm that maintains a pool of competing HMs, tests them to a limited extent, creates new ones from those that perform well in the past, and prunes poor ones from the pool. It involves the application of genetic algorithms to machine learning problems. Examples of such learning include genetic programming [16] and classifier-system learning [6]. Genetics-based learning is suitable for learning performance-related HMs but not for learning correctness-related ones. A HM is said to be performance related if the constraints of the target problem are trivially satisfied, and the goal of learning is to improve the performance of the resulting solution, where performance is characterized by the quality of the resulting solution and the cost of getting it. In contrast, a HM is correctness related if the constraints of the problem are hard to satisfy, and the goal of learning is to find HMs that lead to efficient as well as feasible solutions. In knowledge-lean applications, HMs that are performance related are easier to learn than those that are correctness related. For this reason, we only study performance-related HMs in this paper. (2) Testing of heuristics and evaluating their performance. HMs generated are tested on a set of problem instances (or test cases) in a problem domain. In this paper, we are interested in two types of application domains: (a) those with a large number of test cases and possibly an infinite number of deterministic HMs for solving them, and (b) those with a small number of test cases but the HMs concerned have a nondeterministic component, such as a random initialization point, that allows different results to be generated for each test case. In both types, the performance of a HM is nondeterministic, requiring multiple evaluations of the HM on different test cases (type i) or multiple evaluations of the HM on the same test case (type ii). Consequently, we need to define valid statistical metrics for comparing two HMs without exhaustively testing all test cases using these HMs. This requires identifying subsets of test cases whose collective behavior on a HM can be evaluated statistically. We present in Section 2 issues on selecting appropriate aggregate metrics in performance evaluation of heuristics. An important issue in implementing a learning system is the scheduling of finite computational resources for testing a possibly infinite set of test cases and infinitely many variations of HMs. This entails apportioning computational resources to tests so that the best HM is found when resources are expended. The problem is especially difficult when tests are expensive and noisy. (The latter means that multiple tests are necessary in order to determine the performance of a HM.) We study in Section 4 the scheduling of computational time for learning. (3) Generalization of heuristics learned to unlearned domains. Since the problem space is very large and learning can only cover a small subset, it is necessary to generalize HMs learned to test cases not studied in learning. Generalization is difficult when HMs do not perform consistently or have different performance distributions across different test cases. Section 5 examines issues in generalization. In short, we study in this paper the automated learning and generalization of performance-related heuristics by genetics-based learning methods for knowledge-lean applications. We assume that the performance of a HM is represented by one or more statistical metrics and is based on evaluating multiple test cases (noisy evaluations). The major issues that we study are methods to cope with inconsistencies in performance evaluation of heuristics (Section 2), resource scheduling of tests of heuristics (Section 4), and generalization of learned heuristics to unlearned domains (Section 5). Section 3 presents the overall learning framework, and experimental results are shown in Section 6. 2. HEURISTICS In applying a HM, a problem solver applies a sequence of decisions defined in the HDEs of the HM, one after another, until an input test case is solved. These decisions, initiated by the problem solver at decision points, change the state of the application environment evaluated by a number of userdefined performance measurables. The problem solver then uses the performance measured to make further decisions. 3 IIEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 7, NO 5, OCTOBER 1995 Table 1. Examples of knowledge-lean applications and their learnable domain-dependent heuristics. Application Objective(s) Domain-Dependent Parametric Heuristics Process mapping for placing a set of processes on a multicomputer [14] Load balancing in distributed systems [19] Simulated annealing: TimberWolf [26] Depth perception in stereo vision [25] Minimize overall completion time, and minimize time to find such mappings Minimize completion time of an incoming job Minimize area of layout with fixed maximum number of layers Minimize error in range estimation If (processor utilization / average utilization of all processors) > (threshold), then evict one process If (average WL(•) > (threshold)), then migrate this process If ((acceptance ratio) > (threshold)), then reduce temperature to next lower value Marr and Poggio’s iterative algorithm Heuristic Element(s) Numeric threshold value Workload function WL, numeric threshold value Numerical threshold value, cost function, temperature function (low edge-detection threshold, channel width, high threshold) Genetic search of Maximize fault cover- Controls used in the genetic Numeric values, fitthe best VLSI test age algorithm: iteration, rejection ness function sequence [24] ratio, sequence depth, control factor, frequency of usage Branch-and-bound Minimize cost of tour If a node has the smallest Symbolic formula search for finding a while satisfying con- decomposition-function value minimum-cost tour straint on visiting each among all active nodes, then in a graph node exactly once expand this node Designing a blind Minimize convergence Objective (error) function for Symbolic formula of equalizer time, accumulated gradient descent the error function errors and cost; maximize S/N ratio A solution in this context is defined as a sequence of decisions made by the HM on an input test case in order to reach the final state. The performance of a HM on a test case depends on the quality of the solution found by the HM for this test case as well as the cost (e.g., computation time) in finding the solution. Here, we define quality (resp., cost) of a solution with respect to an input test case to be one or more measures of how good the final state is (resp., how expensive it is to reach the final state) when the test case is solved, and be independent of the intermediate states reached. Note that cost and quality are in turn defined as functions of measurables in the application environment. We call quality and cost performance measures of an application. In this section, we discuss issues related to the performance evaluation of HMs. We show that a HM can be found to be better or worse than another HM depending on the evaluation criterion. Such inconsistencies are called anomalies in this paper and are attributed to the different methods of evaluating performance and the different behavior of HMs under different conditions. We propose methods to cope with these anomalies. When such anomalies cannot be avoided, alternative HMs should be learned and generalized so that users can pick the best HM(s) to apply. Example(s) of Element 1.10 WL(•), 2.0 0.9, C(•), T(•) (0.6, 2.0, 5.0) (2, 3, 4, 3.2, 100), H(•) Lower bound + upper bound of node E(•) 2.1. Example Applications A problem solver in general consists of a domainindependent part and a domain-dependent part. The domainindependent part is a general solution method that is applicable across different applications. For example, a divide-andconquer method is domain-independent because it can be applied to many different applications. In contrast, the domain-dependent part is specific for a particular application. For example, the mechanism of partitioning a problem in a divide-and-conquer method is domain-dependent. The domain-independent and domain-dependent parts interact with each other to make decisions during the solution process. The domain-dependent part provides information on the current state to the domain-independent part, which returns a decision according to the information provided. The domain-dependent part then applies the decision to change the state of the application environment. Heuristics can be used in the domain-dependent part to improve the solution cost or the solution quality (or both). In Table 1, we present examples of practical applications and identify their domain-dependent heuristics. 4 IIEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 7, NO 5, OCTOBER 1995 2.2. Problem Subspace and Subdomain In an application domain, different regions of its problem space may have different characteristics, each of which can best be solved by a unique HM [23]. Since learning is difficult when test cases are of different behavior and it is necessary to compare HMs quantitatively, we need to decompose the problem space into smaller partitions before learning begins. In the following, we define the concepts of problem subspace and problem subdomain. A problem subspace is a user-defined partition of a problem space so that HMs for one subspace are learned independently of HMs in other subspaces. Such partitioning is generally guided by common-sense knowledge or by user experience in solving similar problems. To identify a problem subspace, we need to know one or more attributes to classify test cases and a set of decision rules to identify the subspace to which a test case belongs. For instance, consider the development of HMs for CRIS [24], a genetic-algorithm package for generating test patterns in order to test sequential VLSI circuits. (More details are provided in Section 6.3.) CRIS aims to find a test sequence in order to discover as many faults as possible in a circuit. (Fault coverage measures the percentage of faults that can be detected by a test pattern generated by CRIS.) Previous experience shows that sequential circuits require tests that are different from those of combinatorial circuits. As a result, we can partition the problem space into two subspaces, one for combinatorial circuits and the other for sequential circuits. As another example, consider solving a vertex-cover problem whose goal is to find a minimal set of vertices in a graph such that all edges have at least one of their vertices covered by this set. In designing a decomposition HM in a branch-and-bound (B&B) search for deciding which vertex to be included in the covered set, previous experience on other optimization problems indicates that HMs for densely connected graphs are generally different from HMs for sparsely connected ones. Consequently, the problem space of all graphs may be partitioned into two subspaces, one for tightly connected graphs and one for loosely connected ones. Given a subspace of test cases, we next define a subdomain. A problem subdomain is a partitioning of a problem subspace into smaller partitions so that one or more HMs can be designed for each partition. The reason for this partitioning is to allow quantitative comparison of performance of HMs in a subdomain, which may be difficult across subdomains. In comparing the performance of HMs, it is necessary to aggregate their performance values into a small number of performance metrics (such as average or maximum). Computing these aggregate metrics is not meaningful when performance values are of different ranges and distributions. Hence, we define a subdomain as a maximal partitioning of test cases in a subspace so that different HMs in a subdomain can be compared quantitatively based on their aggregate metrics. It is important to point out that performance values may need to be normalized with respect to those of the baseline HM before aggregated. In the same way that test cases are partitioned into subspaces, we need to know the attributes to classify test cases and a set of decision rules to identify the subdomain to which a test case belongs. For example, in solving a vertex-cover problem, we can treat graph connectivity as an attribute to classify graphs into subdomains. In some applications, it may be difficult to determine the subdomain to which a test case belongs. This is true because the available attributes may not be well defined or may be too large to be useful. For instance, in test-pattern generation for sequential circuits, there are many attributes that can be used to characterize these circuits (such as length of the longest path and maximum number of fan-in’s and fan-out’s). However, none of these attributes is a clear winner. When we do not know the attributes to classify test cases into subdomains, we can treat each test case as a subdomain by itself. This works well when the HM to be learned has a random component: by using different random seeds in the HM, we can obtain statistically valid performance values of the HM on a test case. We have used this approach in the two circuitrelated applications discussed in Section 6 and have chosen each circuit as an independent subdomain for learning. Another possibility is to learn one HM for each subdomain, but apply multiple HMs when a new circuit is encountered. After learning good HMs for each subdomain, we need to compare the performance of HMs across subdomains. This comparison may be difficult because test cases in different subdomains of a subspace may have different performance distributions, even though they can be evaluated by a common HM. As a result, it may be difficult to compare the performance of test cases statistically. It should now be clear that there can be many subdomains in an application, and learning can only be performed on a small number of them. Consequently, it is important to generalize the HMs learned to unlearned subdomains. Informally, generalization entails finding a good HM from the set of learned HMs so that this HM has a high probability of performing better than other competing HMs for solving a randomly chosen test case in the subspace. In some situations, multiple HMs may have to be identified and applied together at a higher cost to find a solution of higher quality. In Section 5, we propose a method for generalizing learned HMs. To illustrate the importance and difficulty in generalization, consider a HM developed for the previously described CRIS [24]. A HM in this application is a vector of seven parameters and a random seed used in the genetic algorithm in CRIS. As mentioned earlier, we treat each circuit as a separate subdomain in learning because we do not know the attributes to group circuits in subdomains. Note that different fault coverages can be obtained for a circuit by varying the random seed used in a HM. Table 2 shows the maximum and average fault coverages (over ten random seeds) of a HM we have learned and generalized for CRIS across six circuits. It shows that (a) the HM behaves differently across different circuits — not only is the range of fault coverages different, but it may perform better than CRIS [24] and HITEC (a program that uses a deterministic search to find good patterns) [21] for 5 IIEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 7, NO 5, OCTOBER 1995 Table 2. Fault coverages of a learned and generalized HM used in CRIS on six circuits as compared to fault coverages of the original CRIS [24] and HITEC [21]. Total Circuit ID Faults s344 s382 s641 s832 s1238 s5378 342 399 467 870 1355 4603 Fault Coverage HITEC CRIS 95.9 90.9 86.5 93.9 94.6 70.3 Generalized HM for CRIS Avg. Max. 93.7 68.6 85.2 42.5 90.7 65.8 96.1 72.4 85.0 44.1 88.2 65.3 96.2 87.0 86.1 45.6 89.2 69.9 Table 3. Average speedups of learned HMs and generalized HMs used in a B&B search for solving the vertex-cover problem. (All speedups are normalized with respect to those of the baseline HM. Subdomains are classified by degree of connectivity — DC.) Subdomain DC 0.1 0.2 0.3 0.4 0.5 0.6 Speedups of HM Learned DC = 0.1 DC = 0.5 1.035 0.950 1.012 1.043 0.993 0.997 0.993 1.001 0.988 0.986 1.013 1.012 Generalized HM 1.260 1.086 1.074 1.106 1.009 1.042 one circuit, but worse for another; (b) multiple applications of the same HM using different random seeds can improve the coverage; and (c) there are limitations in CRIS that may render it difficult to improve over HITEC. As another example, we show in Table 3 the results of learning and generalization of decomposition HMs used in a B&B search for solving vertex-cover problems. Here, we treat all test cases to belong to one subspace, and graphs with the same average degree of connectivity are grouped into a subdomain. We applied genetics-based learning (to be discussed in Section 3) to find two HMs, one for each of two subdomains with connectivities 0.1 and 0.5. We then applied our generalization procedure (to be discussed in Section 5) to find one HM that can generalize across the two subdomains. Finally, we verified the speedups of the generalized HM on six subdomains. The results in Table 3 show that the generalized HM is not the top HM learned in each subdomain, indicating that the best HM learned in each subdomain may be too specialized to the subdomain. Further, we have found one HM that performs better than the baseline HM across the six subdomains. 2.3. Anomalies in Performance Evaluation To design a good and general HM for an application, we must compare HMs in terms of their performance. There are two steps in accomplishing this task. First, we must compare HMs in the same subdomain [31]. Second, we must compare the performance of HMs across multiple subdomains. Accomplishing the first step is necessary before we can deal with the second step. In this section, we present issues involved in these two steps. 2.3.1. Anomalies within a Subdomain Recall that HMs studied in this research have nondeterministic performance, implying the need to evaluate each HM multiple times in a subdomain. Further, performance may be made up of multiple inter-related measures (for instance, higher quality may require higher cost). To compare the performance of different HMs, it is necessary to aggregate performance values before comparing them. This is, however, difficult, as the objectives of a HM as well as their trade-offs may be unknown with respect to its performance measures. A possible solution is to derive a single objective function of the performance measures with tunable parameters, and to find a combination of values of these parameters that lead to the HM with the best trade-off. Using this approach, we have observed the following difficulties before [31]. • It is difficult to find a good combination of parameter values in the objective function so that HMs with the best quality-cost trade-offs can be found. We have seen similar difficulties in the goal attainment method [12]. • It is difficult to compare the performance of two HMs when they are evaluated on test cases of different sizes or behavior. • Inconsistent conclusions (anomalies) about the performance of two HMs may be reached when they are compared using either different user-defined objective functions, or the same objective function with different parameters. In fact, it is possible to show that one HM is better than another by finding a new parametric objective function of the performance measures. We have proposed before [31] three solutions to cope with these difficulties. • Identify a reference or baseline HM upon which all other HMs are compared. A good choice for an application is the best existing HM for this application. • Normalize each raw performance measure of a new HM with respect to the same measure of the reference HM (evaluated on either the same set of test cases or test cases with the same distribution of performance) so that it is meaningful to compare two HMs based on their normalized measures. • Compare two HMs based on individual normalized performance measures, not on a single parametric function of the measures. We have proposed before a multi-dimensional graphical representation of performance values, representing each performance measure in a separate axis [15]. Two HMs are, therefore, compared based on their relative positions in this multi-dimensional plot. (This method is discussed later in Section 6.1.) In this section, we extends the anomalies found earlier [31] and classify all the anomalies into three classes. Note that anomalies happen because there are more than one dimensions of performance variations. 6 IIEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 7, NO 5, OCTOBER 1995 Table 4. Inconsistent performance of HMs across test cases. Let t i, j be the completion time of test case j using HM i . HM 2 has better average rank and better average completion time than HM 1 after two tests, but worse average completion time and better average rank after three tests. Quality Measure 1 Test Case j 2 3 t 1, j t 2, j 1474.89 1269.25 1665.38 1513.14 1381.34 1988.42 Average t 1 = 1507.20 t 2 = 1590.27 a) Inconsistent performance across different test cases. When a HM is evaluated on a set of test cases, we must determine (i) the number of tests to be made and (ii) the evaluation method (or metric) for aggregating performance values (such as mean, maximum, average rank). Inconsistent conclusions may be reached when one HM is better than another on one set of test cases, but worse on a different set of test cases. For example, assume that performance is evaluated by either the average metric or the average-rank metric. Each of these metrics may improve or degrade when more tests are done, changing the ordering of HMs. Table 4 illustrates this phenomenon using HMs developed for Post-Game Analysis (PGA) (to be discussed in Section 6.1). These HMs were used to map a collection of communicating processes on a network of computers. This example shows that different conclusions can be drawn depending on the performance metrics used. In general, we must first select a method for aggregating performance values of each measure. This can usually be determined from the application requirements. In this paper, we use the average metric as the primary method for comparing HMs, assuming that performance values of tests in a subdomain are i.i.d. In addition, we must examine the actual distribution obtained in the experiments, since the average metric alone is not enough to show the spread of performance values. HMs that have good average behavior but have large spread in performance are not desirable. On the other hand, when the metric is unknown, alternative HMs should be found for different metrics, and users can select the appropriate HM(s) to use. b) Multiple objectives with unknown trade-offs. The performance of a HM may be evaluated by multiple objectives (such as quality and cost). Of course, we would like to find HMs with improved quality and reduced cost. However, this may not always be possible. The problem, generally known as the multi-objective optimization problem [12], involves trade-offs among the objectives. In evaluating HMs with multiple objectives, we must evaluate them based on individual performance measures and not combine multiple measures into a single parametric function [31]. During learning, the learning system should constrains all but one measures and optimize the single unconstrained measure. A HM is pruned from further testing when one of its performance constraints is violated. The goal is to find a HM that satisfies all the constraints and has the best performance in the unconstrained measure. If a good HM satisfying the constraints can be found, then the constraints are further refined, and learning is repeated (see Section 5.2). A similar approach has been used in MOGA, a multiple objective genetic algorithm [11]. The difficulty with this approach is on setting constraints. We study in this paper the case when there are two performance measures. The general case when there are more than two performance measures is still open at this time. c) Inconsistencies in normalization. Normalization involves choosing a baseline HM and computing relative performance values of a new HM on a set of test cases in a subdomain by the corresponding performance values of the baseline HM. This is necessary when performance is assessed by evaluating multiple test cases (type i as discussed in Section 1), and is not needed when nondeterminism in performance is due to randomness in the problem solver (type ii as discussed in Section 1). In the former case, performance from different tests may be of different ranges and distributions, and normalization establishes a reference point in performance comparison. In the latter, raw performance values within a subdomain are from one test case and presumably have the same distribution. Normalization may lead to inconsistent conclusions about the performance of HMs when multiple normalization methods are combined. This anomaly is illustrated as follows. Example A. Referring to Table 4, if we use HM 1 as the baseline for normalization, we can compute the average normalized speedup of HM 2 by one of the following methods: q 2n = 3 t 1, j Σ j=1 t 2, j n = 0. 986; Q2 = 3 t 1, j − 1200 = 1. 900 . Σ j=1 t 2, j − 1200 (1) Since the average normalized speedup of HM 1 is one, HM 2 is found to be worse using the first methods and better using the second. Inconsistencies may also occur when normalization overemphasizes or deemphasizes performance changes. For instance, the speedup measure is biased against slowdown (as slowdowns are in the range between 0 and 1, whereas speedups are in the range between 1 and infinity). Consider the following example. Example B. Suppose the speedups of a HM on two test cases are 10 and 0.1. Then the average speedup is 5.05, and the average slowdown is also 5.05, where the average slowdown is defined as the average of the reciprocals of speedups. Hence, the average speedup and average slowdown are both greater than one. In general, when normalizing performance values, it is important to note that the ordering of HMs may change when using a different normalization method, and that the spread of performance values may vary across subdomains in an application. Here, we propose three methods to cope with anomalies in normalization. First, we should use only one normalization method consistently throughout learning and evaluation, thereby preserving the ordering of HMs throughout the process. Second, we need to evaluate the spread of normalized performance values to detect bias. This can be done by detecting outliers and by examining higher-order moments of 7 IIEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 7, NO 5, OCTOBER 1995 Avg. Speedup 1.1 the performance values. Third, to avoid placing unequal emphasis on normalized values, we need a normalization method that gives equal emphasis to improvement as well as to degradation. To simplify understanding, we describe this symmetric normalization method using the speedup measure. We define symmetric speedup as Speedup symmetric   Speedup − 1 = 1  1 − Speedup  1 0.9 0.8 0.7 if Speedup ≥ 1 (2) . if 1 > Speedup ≥ 0 0.6 0.5 1 2 3 4 where Speedup is the ratio of the time of the original HM with respect to the time of the new HM. Note that slowdown is the reciprocal of speedup, and that symmetric speedup is computed for each pair of performance values. Eq. (2) dictates that speedups and slowdowns carry the same weight: speedups are in the range from zero to infinity, and slowdowns are in the range from zero to negative infinity. In a similar way, we can define symmetric slowdown as Slowdown symmetric  if Slowdown ≥ 1  Slowdown − 1 (3) = 1  1 − Slowdown if 1 > Slowdown ≥ 0  It is easy to prove that Speedup symmetric = − Slowdown symmetric , thereby eliminating the anomalous condition in which average speedup and average slowdown are both greater than one or both less than one. In Example (A) discussed earlier, the average symmetric speedup is −0. 059 which shows that HM 2 is worse than HM 1 . In Example (B), both the average symmetric speedup and average symmetric slowdown are zero, hence avoiding the anomaly where the average speedup and average slowdown are both greater than one. To further illustrate the difference between speedups and symmetric speedups, we show in Fig. 2 the distributions of speedups as well as symmetric speedups of a HM to solve the vertex-cover problem. 2.3.2. Anomalies across Subdomains We now discuss the difficulty in comparing performance of HMs across multiple subdomains. This comparison is difficult when there is a wide discrepancy in performance across subdomains. To illustrate this point, consider two HMs learned for CRIS (Table 5). These HMs behave differently in different subdomains: not only can the range and distribution of performance values be different, but a good HM in one subdomain may not perform well in another. With respect to circuit s444, HM 101 has worse fault coverages and a wider distribution of coverage values than HM 535 , but performs better than HM 535 for circuit s1196. The major difficulty in handling multiple subdomains is that performance values from different subdomains cannot be aggregated statistically. For instance, it is not meaningful to find the average fault coverage of HM 101 in Table 5. Scaling 5 6 7 8 Subdomain 9 10 11 12 (a) Speedup measure Avg. Sym-SU 0.1 0 -0.1 -0.2 -0.3 -0.4 -0.5 1 2 3 4 5 6 7 8 Subdomain 9 10 11 12 (b) Symmetric speedup measure Fig. 2. Contour plots showing the distribution of performance values of one HM on 15 test cases for solving the vertexcover problem. Table 5. Inconsistent HM behavior in various subdomains. Fault Coverages (%) Random Seeds used in HM Circuit HM Max. Avg. ID 61801 98052 15213 101 535 101 s1196 535 s444 60.3 81.9 93.2 93.2 13.9 86.3 94.4 92.5 11.2 86.3 94.9 93.6 60.3 86.3 94.9 93.6 28.5 84.8 94.2 93.1 and normalization of performance values are possible ways to match the difference in distributions, but will lead to new inconsistencies for reasons discussed in (c) in the last subsection. Another way is to rank HMs by their performance values across different subdomains, and use the average ranks of HMs for comparing HMs. This does not work well because it does not account for actual differences in performance values, and two HMs with very close or very different performance may differ only by one in their ranks. Further, the maximum rank of HMs depends on the number of HMs evaluated, thereby biasing the average ranks of individual HMs. To address this problem, we propose in Section 5 a new metric called probability of win. Informally, the probability of win is a range-independent metric that evaluates the probability that the true mean performance of a HM in one subdo- 8 IIEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 7, NO 5, OCTOBER 1995 main is better than the true mean performance of another randomly selected HM in the same subdomain. The advantage of using this measure is that it is between zero and one, independent of the number of HMs evaluated and the range and distribution of performance values. 3. TEACHER: A SYSTEM FOR LEARNING NEW HEURISTICS In this section, we discuss TEACHER, an acronym for TEchniques for the Automated Creation of HEuRistics. TEACHER is a genetics-based learning system we have developed in the last six years [31]. Preliminary designs of TEACHER have been studied with respect to learning process-placement strategies for a network of workstations [19], learning process-placement strategies on distributed-memory multicomputers [15], tuning parameters in a stereo-vision algorithm [25], finding smaller feed-forward neural networks, and [29] learning heuristics for B&B search [18, 33]. We have also studied resource scheduling strategies in geneticsbased learning algorithms. Our present learning system is aimed towards methods for coping with anomalies in performance evaluation, general resource scheduling strategies in multi-objective learning, and finding HMs that can be generalized. By combining the following three features, our system is unique as compared to other genetics-based learning studies. • Our environment is noisy so that the performance of a HM cannot be evaluated using a single test. • We consider applications in which HMs behave differently under different situations (subdomains). Existing methods generally ignore this problem and focus on only one set of statistically related test cases. • We assume that the cost of evaluating a HM on a test case is expensive. This forbids performing extensive tests on each HM. In the applications presented in Section 6, one to two thousand tests are what can be performed in a few days on a fast workstation. This is in contrast to many other studies that assume that tests are inexpensive and that many tests can be performed in the time allowed [16]. For simplicity, we consider logical time in this paper in which one unit of time is needed for performing one test of a HM. The goal of resource scheduling is to learn, under limited computational resource, good HMs for solving application problems and to generalize the HMs learned to unlearned subdomains. We use the average metric for comparing HMs, and examine the spread of performance values when HMs have similar average performance. When there are multiple objectives in comparing HMs, we constrain all but one objectives during learning, and optimize the unconstrained objective. In this case, our learning system proposes more than one HMs, showing trade-offs among these objectives. 3.1. Three Phases of Learning in TEACHER There are three phases of learning in TEACHER: classification, learning, and generalization. The first phase partitions the test cases in an application into distinct subsets. There are two steps in this phase. a) Subspace classification. As is discussed in Section 2.2, the problem space is first partitioned into a small number of distinct subspaces so that new HMs are learned/designed for each. Such partitioning is guided by common-sense knowledge expressed in the form of decision rules. By applying these rules, we can determine for a new test case the subspace it belongs to. b) Subdomain classification. For a problem subspace, we need to partition it into subdomains so that the performance of HMs in each subdomain can be represented collectively by some meaningful statistical metrics. As we have seen in Sections 2.2 and 2.3, the performance of HMs may not be comparable across subdomains in a learning experiment. In the learning phase, the goal is to find effective HMs for each of a limited set of subdomains. The tasks in this phase and the generalization phase are shown in Fig. 3. To perform learning, the system first selects a subdomain, generates good HMs (or uses existing HMs) for this subdomain, and schedules tests of the HMs based on the available computational resources. When learning is completed, the resulting HMs need to be fully verified, as HMs obtained during learning may not be tested adequately. Note that learning is performed on one subdomain at a time. There are three key issues in this phase. a) Heuristics generation. This entails the generation of good HMs given the performance of ‘‘empirically good’’ HMs. As is discussed in Section 1, we use weak generation operators here [6, 16]. b) Performance of HMs in a subdomain. This problem is related to the performance evaluation of HMs during learning, given that there may be multiple performance measures, and that there is no defined relationship among them (Sec. 2.3.1). c) Resource scheduling. The issues here are on the selection of HMs for further testing, the termination of the current generation, and the initiation of the next generation, given performance information of HMs under consideration. These problems are important when limited computational resources are available and tests of HMs are expensive and noisy. We schedule computational resources rationally by choosing (i) number of tests on each HM, (ii) number of competing HMs to be maintained at any time, and (iii) number of problem subdomains to be used for learning and for generalization. We study in Section 4 two related problems in resource scheduling: sample allocation and duration scheduling. The last phase is the generalization phase whose goal is to find a HM from the set of learned HMs and see if it has the same high level of performance improvement on unlearned subdomains. There are two key issues here. a) Performance of HMs across subdomains. As is discussed in Section 2.3, HMs may have different distributions of performance values in different subdomains; hence, these values cannot be compared directly. We present in Section 5 a method to evaluate the performance of HMs for a group of subdomains. 9 IIEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 7, NO 5, OCTOBER 1995 Subdomain Learning Level (Performance) SELECT SUBDOMAIN WITHIN SUBSPACE Generate or select test cases to form a subdomain HM Generate & Test Level (Subdomain, Duration) Find HMs, Verify performance HMs, Performance Cluster subdomains, Cost-quality trade-offs (Performance) GENERATE/SELECT HMs FOR SUBDOMAIN (HM set, Duration) Select subset of existing HMs and/or generate new HMs HM Test Level GENERALIZE HMs TO ALL SUBDOMAINS LEARN HMs FOR SUBDOMAIN VERIFY HMs FULLY TEST HMs Evaluate HMs on current subdomain (Good HMs, Partial performance) Fully test HMs selected (Performance) SELECT HM & GENERATE/ SELECT TEST-CASE Select HM and generate/test test-case for current HM (HM, Test-case) TEST HM ON TEST CASE Apply problem solver on test case Fig. 3. The learning and generalization phases in TEACHER. b) Cost-quality trade-offs. This involves determining efficient HMs that perform well in the application. Should there be multiple HMs to be applied (at a higher total cost and better quality), or should there be one HM that is costly to run but generates high-quality results? These issues are studied in Section 6 when we present experimental results on learning new HMs for four applications. 3.2. Architecture of Learning System for One Subdomain Fig. 4 shows the architecture of our resource-constrained learning system for one subdomain [31]. There are five main components in the system: a) Resource Scheduler, which decides the best way to use the available resources, b) Internal Critic, which provides feedback based on the performance measured to indicate how well a particular HM has performed, c) Population-Based Learning Element, which generates new HMs and maintains a pool of existing ones and their past performance, d) Test-Case Manager which generates and maintains a database of test cases used in HM evaluation, and e) Problem Solver, which evaluates a HM using a test case. In this paper, we assume that the application-specific Problem Solver and Test-Case Manager are user-supplied. In our current implementation, the Test-Case Manager selects from a user-supplied pool of test cases. The Internal Critic normalizes the performance value of each test case tested by a candidate HM against the performance value of the same test case evaluated by the baseline HM. It then updates the performance metrics of the candidate HM. Note that this is similar to updating the fitness values of HMs in classifier-system learning. In general, the Internal Critic performs credit assignment [28] that apportions credit/blame on HDEs using results obtained in testing (see Fig. 1). Credit assignments can be classified into temporal credit assignment (TCA) and structural credit assignment (SCA). TCA is the first stage in the assimilation of feedback and precedes SCA during learning. It divides up feedback between current and past decisions. Methods for TCA depends on whether the state space is Markovian: Non-Markovian representations often require more complex TCA procedures. On the other hand, SCA translates the (temporally local but structurally global) feedback associated with a decision point into modifications associated with various parameters of the decision process. In knowledge-lean applications we consider in this paper, we are missing a world model that relates states, decisions, and feedback signals generated by the learning system or measured in the environment. As a result, credit assignment has much weaker influence on performance improvement. An example of such a TCA algorithm is the bucket-brigade algorithm in classifier-system learning [6]. Note that the lack of a world model for credit assignment is the main reason for maintaining competing HMs in our learning system. The Resource Scheduler schedules tests of HMs based on the available resources. Note that scheduling is critical when tests are computationally expensive. Two related problems, sample allocation and duration scheduling, as well as the scheduling of tests under multiple performance objectives, are studied in the next section. 4. SCHEDULING TESTS IN GENETICS-BASED LEARNING Resource scheduling of tests in learning is crucial when tests are expensive. To illustrate the importance of scheduling, consider the testing of HMs in the vertex-cover problem discussed in Section 2.2. Suppose we have identified two subdomains: D A (with graph connectivity of 0.1) and D B (with graph connectivity of 0.6). To illustrate the effect of scheduling, we generated randomly 100 decomposition HMs and evaluated each on D A and D B . Table 6 shows the average symmetric speedups of HMs selected under three resource schedules with respect to those of the conventional HM. The results show that (a) there are trade-offs between the number of HMs tested and the performance of the best 10 IIEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 7, NO 5, OCTOBER 1995 HM Pool & Performance Database Population-Based Learning Element HM Generator HM Pool Initial HM Pool scheduling decisions new aggregate performance HM Internal Critic Resource Scheduler Resource Constraints HM Problem Solver test cases to use Test-case Manager test cases measured performance decision Application Environment Fig. 4. Architecture of learning system for one subdomain. Table 6. Symmetric speedups of the best HMs based on three different schedules and 150 tests. HMs in each run are selected randomly from a pool of 100 HMs. The average speedup is evaluated using 15 randomly generated test cases. Schedule Sub- Sym-Speedup of the Best HM domain Run 1 Run 2 Run 3 10 HMs of 15 tests each DA DB −0.56 −0.22 −2.08 −0.16 0.01 −0.01 75 HMs of 2 times each DA DB 0.01 −0.03 0.01 −0.03 0.01 −0.03 20 HMs of 5 tests each; the 5 best HMs 10 times each DA 0 −0.56 0.01 DB 0 −0.16 −0.01 HM found, and (b) more detailed evaluation of several top HMs at the end of learning is beneficial. In this section, we discuss our model and assumptions on the sample-allocation and duration-scheduling problems, issues on designing resource scheduling strategies, and our proposed scheduling algorithms. 4.1. Model and Assumptions We describe in this section a statistical model for scheduling tests in our learning system. A general comprehensive model is too complex to be analyzed since many parameters are unknown a priori. Here, we find good scheduling strategies based on a simplified model and apply these strategies as heuristics in practice. We assume that the performance values of HM i over a problem subdomain constitute a population with a distribution f i (x). Each evaluation of HM i is equivalent to drawing a performance value from the distribution. We make the fol- lowing assumptions in our study. • In multi-objective applications, we assume that there are k + 1 performance metrics {C 1 , . . . , C k , Q}, where {C i , i = 1, . . . , k} are constrained metrics, and Q is an unconstrained metric (see Sec. 2.3). We define c i , ĉ i , and θ i to be, respectively, the expected value, the approximated mean value, and the maximum acceptable expected value for constrained metric C i . Here, f (x) is the distribution of performance metric Q. We assume that Q for test cases in the subdomain is to be maximized, while the other performance metrics must satisfy the constraints. • f i (x) is generally unknown and non-identical for different i’s, and tests drawn from f i (x) may be dependent. In our simplified analysis, we assume that samples drawn from f i (x) are i.i.d. • The means of populations belong to a distribution that is hard to estimate. Further, cross-overs and mutations applied at the end of a generation may change this distribution in an unknown fashion. For simplicity, we ignore this effect in our simplified model. • We assume that the time to evaluate one test case using one HM is one unit. That is, we consider only logical time in our scheduling study. 4.2. Previous Work In this section, we discuss two problems in resource scheduling and their solutions in our population-based learning system: sample-allocation and duration-scheduling. (A) Sample-allocation entails the scheduling of tests of HMs in a generation, given fixed numbers of tests in the generation and HMs to be tested. This problem is known in statistics as the (sequential) allocation problem [4, 30], and the scheduler, the local scheduler. The original problem suggested by Bechhofer in 1954 [4] is to decide the optimal allocation of picks, given a fixed total number of picks, assuming that the population mean and variance are known. The objective of these strategies is to maximize P(CS), the probability of correctly selecting the population with the highest population mean when time is expended. Optimal solutions to problems in this class are unknown, and many extensions have been proposed to accommodate various trade-offs and relaxed assumptions. Existing strategies can be classified into static and dynamic. Static sample-allocation strategies have a selection sequence fixed ahead of time, independent of the values of the picks obtained during selection. They are easier to analyze due to their simplicity. The most commonly used static strategy is the round-robin strategy, which takes samples from each population in turn. It allocates T / n tests to each population, given T tests and n population, while maximizing the worst-case P(CS) when all populations have the same variance. Its drawback is that it tests the worst population to the same extent as the best, an obviously inefficient way of using resources. This is also the most commonly used strategy in genetics-based learning systems [5, 6, 13]. Dynamic (or adaptive) sample-allocation strategies select the population for testing based on previous sample values 11 IIEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 7, NO 5, OCTOBER 1995 and other run-time information. Although more flexible, they are more complicated. One such strategy was developed by Tong and Wetzell to optimize P(CS) when the selection process ends. It focuses on populations with high sample means, but also tests others with smaller means if they were not tested enough [30]. Sample-allocation strategies developed in statistics are not directly applicable in our learning system because they were developed with different objectives. In statistical sample allocation, the objective is to maximize P(CS), given a finite number of populations. In contrast, our objective is to maximize the expected population-mean value of the population selected, given infinitely many populations initially. Since the maximum number of tests in learning is limited, we are interested in how close the actual performance of the selected HM is to the maximum performance within a pool of HMs. We have developed before a minimum-risk scheduling strategy [15], which is a dynamic sample-allocation strategy with the above objective in mind. The goal of the strategy is to minimize the risk of the best population: minimize risk = minimize E [ ( µ max − µ̂ max )2 ] . (4) In our derivation, we assume that the distribution of each population is normal with a common variance, an obviously restricted assumption for many applications. (B) Duration scheduling entails deciding when to terminate an existing generation and to start a new one. A common strategy is to allocate a fixed duration to each generation, although better decisions can be made if past information is used. Duration-scheduling strategies can be classified as static and dynamic. A static (or fixed) duration-scheduling strategy simply sets the duration of each generation to a predetermined value. Previous work [10] has shown that the most appropriate duration is dependent on the total time allocated to learning and the target application. To find a proper duration size for a given time limit, experiments with different durations must be run. The overhead for this is deemed too high to be useful. A dynamic (or adaptive) strategy, on the other hand, uses run-time information to determine when each generation should end. A new set of HMs should be generated when the expected improvement from the new HMs are larger than the expected improvement from further testing the current set of HMs. There is very little research on this problem in statistics. One strategy we have studied extends our minimum-risk sample-allocation strategy [15] by estimating the distribution of new populations to be generated in the next generation using statistics collected in previous generations [2, 3]. (In the first generation, samples have to be drawn to estimate the initial distribution.) This strategy is restricted because it assumes that all populations have normal distributions with the same variance. Another dynamic strategy we have studied is based on Bayesian analysis [2, 3], which results in a strategy that increases the duration size as the variance ratio (ratio of sample variance to variance of the µ i ’s) decreases. When the variance of the µ i ’s is large, it is easy to identify good popu- lations; hence, the duration should be small. In general, the variance ratio is large when learning begins, and decreases as learning proceeds. Consequently, the duration size should be small initially and increases gradually. The difficulty with this strategy is that it is hard to find the correct duration without making simplifying assumptions on the distributions. Instead of varying the duration size, a dual strategy is to fix the duration of a generation but varies the number of populations in it [15]. This is less flexible because it is difficult to adjust the population size dynamically. The main shortcoming of existing work is that it assumes that HMs generated always have acceptable performance, even though most HMs may be pruned after a few tests. This is especially true in multi-objective applications in which we set constraints on performance metrics, and there may not be any acceptable HMs at the end of a generation. We address this problem in Section 4.4. 4.3. Non-parametric Minimum-Risk Sample-Allocation Strategy A general sample-allocation strategy should not require information on the distributions of performance measures, as they change dynamically and are difficult to estimate. In this section, we propose a non-parametric sample-allocation strategy for determining HMs to be evaluated based on run-time performance information of populations. Our nonparametric minimum-risk strategy is extended from the parametric minimum-risk method we have developed earlier [15]. The objective of resource scheduling is to find the best HMs when all resources are exhausted. In general, this objective cannot be achieved since we cannot model changes in distributions between generations. To cope with this problem, we restrict our objective on sample allocation within a generation to the following objective: ‘‘minimize the risk that the populations selected for generating new HMs when the generation ends are wrong.’’ Note that this objective is for scheduling within a generation, but not across generations. Consider a generation of K populations. Population i is characterized by information such as ni (number of tests performed), µ i (unknown population mean) σ i2 (unknown population variance), µ̂ i (sample mean), S i2 (sample variance), F i (true fitness value), and f i (sample fitness value), where ∆ Fi ∆ = µ i − c, f i = µ̂ i − c, and c is a constant that is usually set to the minimum fitness value of all populations. Note that f i is an unbiased estimator of F i , since µ̂ i and S i2 are unbiased estimators of µ i and σ i2 , respectively. In this paper, we use the average metric as the fitness function. We define the loss due to believing f i as L i ∆ E[( f − F )2 ]. Given µ̂ and σ (or S ), we can calculate the i i i i i = value of L i , noting that E[( f i − F i )2 ] = Var[ f i − F i ]. Li = σ i2 ni . (5) The probability that population i will be selected for genK erating new ones is defined as P i ∆ = f i / Σ j=1 f j . The scheduling problem can be formulated (heuristically) as follows. 12 IIEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 7, NO 5, OCTOBER 1995 minimize φ ∆= K subject to Σ K Σ Pi Li i=1 (6) ni = N . i=1 where N is the number of tests performed in the current generation. By using a Lagrange multiplier, we have K   minimize φˆ ∆ = φ + λ  N − Σ ni  . i=1 (7) By equating ∂φˆ / ∂ni to zero, we have the optimality criteria as follows −λ = P j σ 2j P i σ i2 = , n2i n2j for i ≠ j . (8) At any time t, the strategy is to minimize Eq. (6) for time (t + 1); i.e., only one of the ni ’s can be increased by 1. ONE-STAGE POLICY : nj ← nj + 1 where max i P j σ 2j P i σ i2 = n2i n2j generation at that point is not helpful because random generation is the weakest of all generation methods, and it is unlikely that newly generated HMs will satisfy the constraints. To avoid this undesirable scenario, we must relax our original goal and find HMs as close as possible to the desired level of constraints, given the available resources. To this end, we must first start with loose constraints and gradually tightens them as learning proceeds. (9) Eq. (9) says that the population to be tested is one that has large fitness value (i.e., large probability of being chosen for reproduction in the next generation) and that has large variance (i.e., large uncertainty in its mean). Note that Eq. (9) only tries to find the next population to be tested. In this case, P i generally changes slowly (P i ′ ≈ P i ), and P i can be approximated using information in the current generation. In our experiments, we use S i as an approximation to σ i . To estimate S i , at least two tests must be performed in each population (preferably four tests or more). Note that the Lagrange-multiplier procedure is valid only when performance values are continuous. As an example, consider population i with four samples: 0.971, 1.006, 0.988, and 1.055. In this case, µ̂ i , S i , and ni are 1.005, 0.036, and 4, respectively. Assuming that there are a total 30 populations and that the minimum µ̂ is 0.910, then the fitness value of population i is µ̂ i − 0. 910 = 0. 095. Further, assume the current total fitness of the remaining 29 populations to be 0.781. Hence, P i is 0.095/(0.781 + 0.095) = 0.108, and the risk value of this population is P i S i2 / n2i , (= 8. 78 × 10−6 ). Assuming that population i has the largest risk value, and a new sample with value 0.920 is drawn from it. With this new sample, µ̂ i , S i , ni , f i , and P i become 0.988, 0.049, 5, 0.078, and 0.091, respectively, and the new risk value is reduced to 7. 59 × 10−6 . This example shows that populations with high mean (hence, high P i ) and high S i2 / ni are more likely to have high risk values and be tested. Generally, risk values are reduced as more samples are drawn. 4.4. Duration Scheduling for Multi-Objective Applications This subsection presents duration scheduling methods for multi-objective applications. As discussed in Section 2.3, we must constrain all but one objectives and optimize the unconstrained objective. However, all HMs may be pruned during learning when constraints are too tight. Applying random 4.4.1. Constraint Handling We outline in the following a method for determining the likelihood that a HM satisfies the given constraints using the notations defined in Section 4.1. It is not possible to prune every HM violating one or more constraints (c i > θ i ) on one or more test cases because (a) c i is unknown and the estimated ĉ i is used instead, (b) there is uncertainty in determining ĉ i , and (c) it is not possible to set worst-case performance bounds of a HM on a test case because by the nature of heuristics, their worst-case behavior may not be bounded. We want to penalize HMs based on P ok , the probability of satisfying the given set of constraints. Since the problems we study have high evaluation cost, we need to prune HMs that are unlikely to satisfy the constraints (P ok << 0.5). Further, we like to give higher chance to HMs with P ok close to 1 for further reproduction and testing. Given the performance values of a HM over n test cases with sample mean ĉ i and sample variance S 2 (c i ) for each constrained metric C i , random variable ( ĉ i − c i ) √  n / S(c i ) has Student’s t-distribution with n − 1 degrees of freedom [9]. Accordingly, we can compute the probability that this HM satisfies threshold value θ i on C i . θ − ĉ i   P ci ≤ θ i  = F t n − 1 , i ,    S(c i ) / √ n  (10) where F t (ν , x) is the c.d.f. of Student’s t distribution with ν degrees of freedom. It is important to point out that Eq. (10) is only valid when c i is the average metric. When there are multiple constrained metrics, the probability that all constraints (θ i for i = 1, . . . , k) are satisfied is equal to P[c 1 ≤ θ 1 ∩ . . . ∩ c k ≤ θ k ]. Based on probability theory, we know that Πi P ci ≤ θ i  ≤ P ok = P  c 1 ≤ θ 1 ∩ . . . ∩ c k ≤ θ k    (11) ≤ min P  c i ≤ θ i  . i   Hence, we use min (P[c i ≤ θ i ]) as an approximation to P ok . i 4.4.2. Dynamic Multi-Objective Duration-Scheduling Strategy (DMDS) Using the relaxed goal, the learning system iteratively finds HMs using increasingly harder constraints, instead of trying to find HMs that satisfy the final target constraints immediately. The initial set of constraints are selected in such a way that almost all randomly generated HMs will be accepted. This will ensure that some HMs are available for generating new HMs in the next generation. To set con- 13 IIEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 7, NO 5, OCTOBER 1995 straints in successive iterations, we apply an iterative refinement method we have developed in a real-time search algorithm for solving time-constrained combinatorial optimization problems [7]. We set new thresholds so that the times used in learning with successive thresholds grow in a geometric fashion. In this way, a small portion of the total time is used in intermediate iterations, and most of the effort is spent in the last iteration. Using this iterative method, we need to set intermediate thresholds on constrained variables {C i } that are easier to achieve than the final target thresholds. We define the j’th set of intermediate thresholds on k performance metrics as {θˆ1, j , θˆ2, j , . . . , θˆk, j }. Since we want increasingly tougher constraints, we have the following property. ∞ = θˆi,0 > θˆi,1 > . . . > θˆi,∞ = θ i , i = 1, . . . , k (12) where θ i is the final target threshold for C i , and θˆi,0 is the initial threshold at the start of learning. To control both the duration of each generation and the intermediate thresholds, we have developed the DMDS duration scheduling strategy. This strategy has two stages. Stage 1: ∃ C i with θˆi, j > θ i (not all target thresholds are satisfied). In this stage, DMDS decides the time for updating constraints and the values of the thresholds. It starts a new generation and a new set of thresholds when HMs satisfying the current constraints have been found, and most HMs violating the current constraints have been eliminated. DMDS determines new thresholds based on profile data collected on thresholds during learning and the amount of time spent in finding acceptable HMs using these thresholds. The thresholds are set so that the time spent in each iteration to find feasible HMs satisfying the constraints grows geometrically. Stage 2: θˆi, j = θ i for all i (all target thresholds have been reached). When all the performance constraints are satisfied in a generation, the learning system finds the best HM that satisfies all the constraints. To do so, good HMs found in this generation are tested more thoroughly to ascertain that they satisfy the constraints to within a high degree of certainty before the next generation begins. We defer until Section 6 to show the effects of the various scheduling algorithms discussed in this section. 5. FINDING GENERAL HEURISTIC METHODS FOR ALL SUBDOMAINS One of the key reasons in learning is to find a good HM that can generalize to test cases in new problem subdomains. Generalization is important because we perform learning on a very small number of subdomains, and there may be infinitely many subdomains in an application. Further, it is desirable to have one or very few HMs to be used in an application rather than a new HM for each problem instance. The goal of generalization is somewhat vague: we like to find one or more HMs that perform well most of the time across multiple subdomains as compared to a baseline HM (if one exists). There are four issues to achieve this goal. • How to compare the performance of HMs within a subdomain in a range-independent and distribution-independent fashion? • How to define the notion that one HM performs well across multiple subdomains? • How to find the condition(s) under which a specific HM should be applied? • What are the trade-offs between cost and quality in generalization? 5.1. Probability of Win within a Subdomain There are many ways to address the first issue raised in this section, and the solutions to the remaining problems depend on this solution. As is discussed at the end of Section 2, scaling, normalization and ranking do not work well. In this section, we propose a metric called probability of win to select good HMs. P win , the probability-of-win of HM i within a subdomain, is defined as the probability that the true mean of HM i (with respect to one performance measure) is better than the true mean of HM j randomly selected from the pool. When HM i is applied to test cases in subdomain d m , we have P win ( HM i , d m ) = 1 |s|−1 (13) Σ P µ im > µ mj  µ̂ im , σ̂ im , nim , µ̂ mj , σ̂ mj , n mj  , j≠i where | s | is the number of HMs under consideration, d m is a subdomain, and nim , σ̂ im , µ̂ im , and µ im are, respectively, the number of tests, sample standard deviation, sample mean, and true mean of HM i in d m . Under the assumptions that (i) performance values of each HM are normally distributed, (ii) true variance σ i2 of HM i is known, and (iii) HMs in a subdomain are independent, µ̂ i , the sample mean for HM i , can be assumed to have N( µ i , σ i / √ ni ) distribution. Consequently, µ̂ i − µ̂ j has   N µ i − µ j , σ i2 / ni + σ 2j / n j distribution, and   Z = (( µ̂ i − µ̂ j ) − ( µ i − µ j )) / σ i2 / ni + σ 2j / n j has N(0, 1) distribution. From the above assumptions, the probability that HM i is better than HM j in d m can now be computed approximately as follows.  √  √    m P µ im > µ mj  µ̂ im , σ im , nim , µ̂ mj , σ m , n = Φ  j j         2 m σ im 2 /nim + σ m j /n j  µ̂ im − µ̂ mj  √ (14) where Φ(x) is the cumulative distribution function for the N(0, 1) distribution. When ni > 30 and n j > 30, assumptions (i) and (ii) above can be relaxed. Using the Law of Large Numbers [9], the performance of each HM does not have to be normally distributed for the condition stated in Eq. (14) to hold. In addition, σ̂ i , the sample standard deviation, can be used in place of σ i , the actual standard deviation, without signifinant loss in accuracy. In this case, the following equation can be used instead:     m P µ im > µ mj  µ̂ im , σ̂ im , nim , µ̂ mj , σ̂ m = Φ j , nj        . (15) m2 m m2 m  σ̂ i /ni + σ̂ j /n j  µ̂ im − µ̂ mj  √ 14 IIEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 7, NO 5, OCTOBER 1995 Table 7. Probabilities of win of four HMs. µ̂ i σ̂ i ni P win (hi ) 1 2 3 4 43.2 46.2 44.9 33.6 13.5 6.4 2.5 25.9 10 12 10 8 0.4787 0.7976 0.6006 0.1231 1 Probability of Win hi Note that using Eq. (15) when ni and n j are less than or equal to 30 will result in less accurate prediction. To illustrate the concept, we show in Table 7 probabilities of win of four HMs tested to various degrees. Note that probability of win is not directly related to sample mean but rather depends on sample mean, sample variance and number of tests performed. Further, the probability that hi is better than h j and the probability that h j is better than hi are both counted in the evaluation. Hence, the sum of P win over all HMs will be half of the number of HMs evaluated. P win defined in Eq. (13) is range-independent and distribution-independent because all performance values are transformed into probabilities between 0 and 1. It assumes that all HMs are i.i.d. and takes into account uncertainty in their sample averages (by using the variance values); hence, it is better than simple scaling which only compresses all performance averages into a range between 0 and 1. 5.2. Probability of Win across Subdomains The use of probability of win leads to two ways to solve the second issue posted earlier in this section, namely, how to define the notion that one HM performs better than another HM across multiple subdomains. First, we assume that when HM h is applied over multiple subdomains in partition Π p of subdomains, all subdomain are equally likely. Therefore, we compute the probability of win of h over subdomains in Π p as the average probability of win of h over all subdomain in Π p .   P win h, Π p =   Σ d ∈Π p P win ( h, d ) | Πp | , (16) where Π p is the p’th partition of subdomains in the subspace. The HM picked is the one that maximizes Eq. (16). HMs picked this way usually wins with a high probability across most of the subdomains in Π p but occasionally may not perform well in a few subdomains. Second, we consider the problem of finding a good HM across multiple subdomains in Π p as a multi-objective optimization problem itself. As is indicated in Section 2.3, evaluating HMs based on a combined objective function (such as the average probability of win in Eq. (16)) may lead to inconsistent conclusions. To alleviate such inconsistencies, we should treat each subdomain independently and find a common HM across all subdomains in Π p satisfying some common constraints. For example, let δ be the allowable deviam tion of the probability of win of any chosen HM from q win , the maximum P win in subdomain m. Generalization, therefore, amounts to finding h that satisfies the following con- 0.8 0.6 0.4 0.2 0 AAAA AAAA AAAA AAAA AAAA AAAAAAAA AAAA AAAA AAAA AAAAAAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAAAAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAAAAAA AAAA AAAA AAAA AAAA AAAAAAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAAAAAA AAAA AAAA AAAAAAAA AAAA AAAA AAAAAAAA AAAA AAAA AAAAAAAA AAAA AAAA AAAAAAAA AAAA AAAA AAAAAAAA AAAA AAAA 0.1 0.15 0.3 0.4 0.5 0.6 AAAAAAAA AAAAAAAA AAAAAAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAAAAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAAAAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA BASE AAAA AAAA188 AAAA AAAA130 AAAA AAAA107 129 AVG Connectivity Fig. 5. Histogram showing probabilities of win of four HMs generalized across six subdomains and those of the baseline HM. (HM 129 will be picked if Eq. (16) is used as the selection criterion; HM 107 will be selected if Eq. (17) is used as the criterion.) straints for every subdomain m ∈Π p . P win ( h, m ) ≥  m  q − δ for every m ∈Π p  win  (17) In this formulation, δ may need to refined if there are too many or too few HMs satisfying the constraints. To illustrate the generalization procedure, consider the vertex-cover problem discussed in Section 2.2. Assume that learning had been performed on six subdomains (with graph connectivities 0.1, 0.15, 0.3, 0.4, 0.5, and 0.6, respectively), and that the five best decomposition HMs were generated from each. After full evaluation of the 30 HMs across the six subdomains, we computed the probability of win for each HM in each subdomain. Fig. 5 shows the probabilities of win of several of these HMs. If we generalize HMs based on Eq. (16), then HM 129 will be picked since it has the highest average P win . In contrast, if we generalize using Eq. (17), then HM 107 will be picked because it has the smallest deviation from the maximum P win in each subdomain. Note that both HMs are reasonable choices as a generalized HM that can be applied across all subdomains. To narrow down to one single generalized HM, further evaluations on the spread of performance values would be necessary (see Section 6.2). Using probabilities of win to assess a HM across subdomains, we can now address the last two issues raised earlier in this section, which deal with the selection of multiple HMs. There are two ways that multiple HMs can be used: (a) each HM is used in a non-overlapping subset of subdomains in the subspace (third issue), and (b) multiple HMs are always applied in solving a test case in the subspace (fourth issue). The issue on finding condition(s) under which a specific HM should be applied is a difficult open problem. Solving this problem amounts to designing decision rules to partition the subspace of test cases into a finite number of partitions, each of which can be solved by one HM. This is possible in some applications that can be characterized by a small number of well-defined attributes. For instance, in the vertexcover problem discussed in Sections 2.2, graph connectivity 15 IIEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 7, NO 5, OCTOBER 1995 is a unique attribute that allows us to decompose the space of all random graphs into partitions. For other applications, this may not be easy. For instance, in Post-Game Analysis discussed in Sections 2.3.1, there are a few attributes that can be used to decompose the subspace (e.g., number of processes in an application, number of processors in a multi-computer system). However, none of them is a good choice. In the CRIS test-pattern generation system [24] discussed in Sections 2.2, there are too many attributes that can be used to classify circuits (e.g., number of gates, length of the longest path). In this case, it is not clear which attributes should be used. The last issue raised earlier in this section is on the tradeoffs between cost and quality in generalization. Since it may be difficult to identify a unique HM for each test case in the subspace, we can pick multiple HMs, each of which works well for some subdomains in the subspace, and apply all of them when a new test case is encountered. This is feasible only when the added cost of applying multiple HMs is compensated by the improved quality of the solutions. In this approach, the cost reflects the total computational cost of applying all the chosen HMs to solve a given test case. The problem of selecting a set of HMs for a subspace amounts to picking multiple HMs and assigning each to a subdomain in the subspace. Assuming that | H | such HMs are to be found, we need to decompose the subspace into | H | partitions of subdomains, and assign one HM to all subdomains in each partition. The overall probability of win over the subspace is computed in a similar way as in Eq. (16). In mathematical form, let Ω be the set of all HMs tested in the subspace and Π be the set of all subdomains in this subspace, we are interested to find H ⊂ Ω such that |H| is constant and the average P win is maximized. That is, max P max win ( Ω, Π ) = Σ max P win ( h, d ) H ⊂Ω d ∈Π h ∈H | H | = constant , (18) |Π| where |Π| is the number of subdomains in subspace Π. One way to find H in Eq. (18) is to enumerate over all possible ways of decomposing Π and assign the best HM to each partition. The problem is equivalent to the minimumcover problem: given a set Π of subdomains and a set Ω of HMs (each of which covers one or more subdomains), find the minimum subset H of Ω so that each element of Π can be covered by one HM in H. The problem is NP hard and is solvable in polynomial time only when |H| is two. Fortunately, by applying Eq. (17), we can make the number of HMs arbitrarily small by choosing a proper δ . In this case, finding a fixed set of HMs that can best cover all subdomains in the subspace can be obtained by enumeration. Experimental results on such cost-quality trade-offs are presented in Section 6.3. 5.3. Generalization Procedure The procedure to generalize HMs learned for subdomains in a problem subspace is summarized as follows. (a) Using the collective set of HMs obtained in the subdomains learned, find the probability of win (using Eq. (13)) of each HM in each subdomain learned or to be generalized. (b) Apply Eq. (18) to select the necessary number of HMs for evaluating test cases in the subspace. Eq. (17) can be used to restrict the set of HMs considered in the selection process. 6. EXPERIMENTAL RESULTS To illustrate the learning methods developed in this paper, we present in this section results on learning and generalization for the four applications discussed in Section 2. 6.1. Process Mapping on Distributed-Memory Multicomputers Process mapping involves placing a set of communicating processes on a multicomputer system so that the completion time of the processes is minimized. The problem is characterized by non-deterministic (data-dependent) execution times between inter-process communications and amount of data communicated between processes. It can be solved as a deterministic optimization problem using average execution times and data volumes; however, the solution is inaccurate when execution and communication activities change with time. Further, a deterministic model does not account for delays incurred due to blocked messages. Such unpredictable delays can only be found when the processes are actually executed or simulated. Yan and Lundstrom proposed Post-Game Analysis (PGA), a simulation-based method for finding good mappings. Their system collects an execution trace consisting of actual execution times in between communications and amounts of data sent between processes, and uses them in a simulation system to find the actual completion time of a specific mapping. It then applies heuristics to propose a new mapping, evaluating the effectiveness of the new mapping through the simulation system. This iterative refinement is repeated until no further improvement is possible. PGA can be applied in practice by repeatedly collecting a trace for a short period of time, optimizing the mapping by PGA on a different computer while the original application program is run, and proposing a new mapping for the application program to use. There are four components of the heuristics used in PGA: (a) proposal-generation heuristics for generating proposals to perturb a mapping based on independent optimization subgoals, (b) priority-assessment heuristics for prioritizing each site and process, (c) transformation-generation heuristics for generating possible transformations from the ordered list of sites and processes, and (d) feasibility heuristics for checking the feasibility of a move. These heuristics are represented as expressions (or heuristic decision elements — HDEs) that combine values collected during program execution and are applied to make decisions. These four heuristics components interact extensively in proposing alternative mappings. Consequently, we cannot isolate each set of heuristics and learned them independently. Instead, we consider the four components to make up a PGA IIEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 7, NO 5, OCTOBER 1995 HM, and learning aims to find the best collection of HDEs and the proper value for each threshold. PGA HMs used in learning are generated randomly, as well as by cross-overs and mutations. PGA HMs are evaluated by two performance measures: quality of a mapping (process completion time of the mapping), and cost of finding the mapping (PGA execution time). Learning aims to find PGA HMs with the minimum completion time and cost within a user-specified limit. This is necessary since PGA has to be run concurrently with the application program, and a new mapping should be proposed within a time constraint. In learning PGA HMs, we chose an application based on a divide-and-conquer algorithm: each process computes for a random amount of time, sends a message to each of its child processes to start their computation, and waits for the results from its descendents before reporting to its parent. We used three machine architectures (3-by-3, 4-by-4, and 5-by-5 meshes) and two process sizes (200 and 300 processes), resulting in six subdomains. All subdomains were assumed to belong to one problem subspace. Each PGA test case specifies the number of processes in the divide-and-conquer tree, the execution time of each node of the tree, one of the machine architectures, and an initial mapping of the processes on the architecture. All experiments were performed on a Sun SparcStation 10/30. For the applications we have used, testing a PGA HM on one test case is time consuming: a small learning experiment with 6,400 evaluations (6 subdomains, each with 800 tests for learning and 400 tests for final verification) took between 7 to 11 days of CPU time. We first evaluated three resource scheduling strategies: DMDS duration scheduling with minimum-risk sample allocation, fixed duration with minimum-risk sample allocation (MR), and fixed duration with round-robin sample allocation (RR). For each strategy, we performed five learning experiments of 800 tests over each of the six subdomains, using a cost constraint of 30% of the cost of the original PGA HM by Yan and Lundstrom [32]. We studied six cases of retaining 1, 3, 5, 10, 15, and 20 HMs at the end of learning for full verification, and compared the best and average qualities of the HMs achieved over 5 runs of each scheduling strategy. Fig. 6 shows the average speedups (quality) of HMs achieved by the three scheduling strategies over the six subdomains. Since the cost constraint is tight and all performance values represent slowdowns, we do not use symmetric speedups here. All these HMs have significantly lower costs and slightly worse qualities than those of the original PGA HM. Further, DMDS performs the best in four out of six subdomains and the second best in two other subdomains. Other results are similar, with smaller differences in quality as the number of HMs retained for verification increases. DMDS also consistently finds better HMs more often than the two fixed-duration scheduling strategies when the cost constraint is tight. Of the 30 experiments performed under a 30% cost constraint, DMDS failed once when the five best HMs were retained for verification (RR failed five times and MR, four times). When the cost constraint is loose, DMDS 0.99 E[Normalized Performance] 16 0.97 0.95 0.93 DMDS AAAA AAAA AAAA RRAAAA AAAA AAAA AAAAAAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAAAAAA AAAA MR AAAAAAAA AAAA AAAAAAAA AAAAAAAAAAAA AAAAAAAAAAAA AAAA AAAAAAAA AAAAAAAA AAAAAAAAAAAA AAAA AAAA AAAA AAAAAAAA AAAAAAAAAAAA AAAAAAAAAAAA AAAA AAAA AAAAAAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAAAAAA AAAAAAAAAAAA AAAAAAAAAAAA AAAA AAAAAAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAAAAAA AAAAAAAAAAAA AAAAAAAAAAAA AAAA AAAAAAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAAAAAA AAAAAAAAAAAA AAAAAAAAAAAA AAAA AAAAAAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAAAAAA AAAAAAAAAAAA AAAAAAAAAAAA AAAA AAAAAAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAAAAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAAAAAA AAAAAAAAAAAA AAAAAAAAAAAA AAAA AAAAAAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAAAAAA AAAAAAAAAAAA AAAAAAAAAAAA AAAA AAAAAAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAAAAAA AAAAAAAAAAAA AAAAAAAAAAAA AAAA AAAAAAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAAAAAA AAAAAAAAAAAA AAAAAAAAAAAA AAAA AAAAAAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAAAAAA AAAAAAAAAAAA AAAAAAAAAAAA AAAA AAAAAAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAAAAAA AAAAAAAAAAAA AAAAAAAAAAAA AAAA AAAAAAAA AAAA AAAA AAAA AAAA AAAA DMDS AAAA AAAA AAAAAAAA AAAAAAAAAAAA AAAAAAAAAAAA AAAA AAAAAAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAAAAAA AAAAAAAAAAAA AAAAAAAAAAAA AAAA AAAAAAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAAAAAA AAAAAAAAAAAA AAAAAAAAAAAA AAAA AAAAAAAA AAAA AAAA AAAA AAAA AAAA MR AAAA AAAA AAAAAAAA AAAA AAAAAAAA AAAAAAAAAAAA AAAA AAAAAAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAAAAAA AAAAAAAAAAAA AAAAAAAAAAAA AAAA AAAAAAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAAAAAA AAAAAAAAAAAA AAAAAAAAAAAA AAAA AAAAAAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAAAAAA AAAAAAAAAAAA AAAAAAAAAAAA AAAA AAAAAAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAAAAAA AAAAAAAAAAAA AAAA AAAAAAAA AAAA RR AAAAAAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAAAAAA AAAA AAAA AAAA AAAA AAAA AAAAAAAA AAAAAAAAAAAA AAAAAAAAAAAA AAAA MR AAAA AAAA AAAA AAAAAAAA AAAA AAAAAAAA AAAA RR AAAAAAAA AAAAAAAAAAAA AAAA DMDS AAAA AAAAAAAA AAAAAAAAAAAA AAAA AAAA AAAA AAAA AAAA AAAAAAAA AAAAAAAAAAAA AAAA AAAA AAAA AAAAAAAA AAAA AAAA AAAAAAAAAAAA AAAA AAAA AAAAAAAA AAAA AAAAAAAA AAAAAAAAAAAA AAAA AAAAAAAA AAAA AAAA AAAAAAAAAAAA AAAA AAAA AAAA AAAAAAAA AAAA AAAAAAAA AAAAAAAAAAAA AAAA AAAAAAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAAAAAA AAAA AAAAAAAA AAAAAAAAAAAA AAAA AAAAAAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAAAAAA AAAA AAAAAAAA AAAAAAAAAAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA DMDS AAAA AAAA AAAAAAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAAAAAA AAAA AAAAAAAA AAAAAAAAAAAA AAAA AAAAAAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAAAAAA AAAA AAAAAAAA AAAAAAAAAAAA AAAA AAAAAAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAAAAAA AAAA AAAAAAAA AAAAAAAAAAAA AAAA MR AAAAAAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAAAAAA AAAA AAAAAAAA AAAAAAAAAAAA AAAA AAAAAAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAAAAAA AAAA AAAAAAAA AAAA AAAAAAAA AAAA RR AAAAAAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAAAAAA AAAA AAAAAAAA AAAAAAAAAAAA AAAA AAAAAAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAAAAAA AAAA AAAAAAAA AAAAAAAAAAAA AAAA AAAAAAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAAAAAA AAAA AAAAAAAA AAAAAAAAAAAA AAAA AAAAAAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAAAAAA AAAA AAAAAAAA AAAAAAAAAAAA AAAA AAAAAAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAAAAAA AAAA AAAAAAAA AAAAAAAAAAAA AAAA AAAAAAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAAAAAA AAAA AAAAAAAA AAAAAAAAAAAA AAAA AAAAAAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAAAAAA AAAA AAAAAAAA AAAAAAAAAAAA AAAA AAAAAAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAAAAAA AAAA AAAAAAAA AAAAAAAAAAAA AAAA AAAAAAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAAAAAA AAAA AAAAAAAA AAAAAAAAAAAA AAAA AAAAAAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAAAAAA AAAA AAAA AAAA AAAA AAAA AAAAAAAA AAAA AAAAAAAA AAAAAAAAAAAA AAAA DMDS AAAA AAAA RR AAAA AAAA AAAA AAAA AAAA RRAAAA AAAA AAAAAAAA AAAA AAAA AAAA AAAAAAAA AAAA MR AAAA DMDS AAAAAAAA AAAAAAAAAAAA AAAAAAAA AAAA AAAAAAAA AAAAAAAA AAAAAAAAAAAA AAAAAAAAAAAA AAAA AAAAAAAA AAAAAAAA AAAAAAAAAAAA AAAA AAAA AAAAAAAA AAAAAAAAAAAA AAAAAAAAAAAA AAAA AAAAAAAA AAAA AAAAAAAA AAAAAAAAAAAA AAAA MR AAAAAAAA AAAA AAAAAAAA AAAAAAAAAAAA AAAA AAAAAAAA AAAAAAAAAAAA AAAAAAAAAAAA AAAA AAAA AAAA AAAAAAAA AAAAAAAAAAAA AAAAAAAAAAAA AAAA AAAAAAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAAAAAA AAAAAAAAAAAA AAAAAAAAAAAA AAAA AAAAAAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAAAAAA AAAAAAAAAAAA AAAAAAAAAAAA AAAA AAAAAAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAAAAAA AAAAAAAAAAAA AAAAAAAAAAAA AAAA AAAAAAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAAAAAA AAAAAAAAAAAA AAAAAAAAAAAA AAAA AAAAAAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAAAAAA AAAAAAAAAAAA AAAAAAAAAAAA AAAA AAAAAAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAAAAAA AAAAAAAAAAAA AAAAAAAAAAAA AAAA AAAAAAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAAAAAA AAAAAAAAAAAA AAAAAAAAAAAA AAAA AAAAAAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAAAAAA AAAAAAAAAAAA AAAAAAAAAAAA AAAA AAAAAAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAAAAAA AAAAAAAAAAAA AAAAAAAAAAAA AAAA AAAAAAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAAAAAA AAAAAAAAAAAA AAAAAAAAAAAA AAAA AAAAAAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAAAAAA AAAA AAAA AAAA AAAA AAAA AAAAAAAA AAAAAAAAAAAA AAAAAAAAAAAA AAAA 0.91 3x3 3x3 3x3 4x4 4x4 4x4 5x5 5x5 5x5 Number of Sites 100 Processes AAAA AAAA200 AAAA Processes Fig. 6. Average quality of HMs selected under a cost constraint of 30% of the original PGA HM for three resource scheduling strategies: DMDS, MR and RR. (The number of HMs retained for full verification is 5.) does not perform better than the other scheduling strategies. Our next experiments address the generalization of the HMs learned. We used three subdomains (3-by-3 mesh with 100 and 200 processes and 5-by-5 mesh with 200 processes) for learning, and generalized the HMs learned to the remaining three subdomains. In learning, 800 tests were performed for each subdomain, and the best five HMs that satisfied the cost constraint were selected for full verification. We considered two cost constraints: 30% and 100% of the cost of the original PGA HM by Yan and Lundstrom [32]. By applying Eq. (18), we have found that under the 1.0 cost constraint, all subdomains should belong to one partition and can be evaluated by one HM, and that under the 0.3 cost constraint, there are two partitions (partition 1 contains three subdomains with 100 processes, and partition 2, the remaining three subdomains with 200 processes). In this case, the PGA HMs learned do not generalize well and are biased towards the number of processes in the application program. Table 8 shows the costs and qualities of the generalized HMs as compared to those of the learned HMs. We see that both have similar costs and qualities. In contrast, the cost of learning is around 500 times higher than that of the generalized HMs. Next, we show the performance of the learned PGA HMs when generalized across the three multi-computer architectures under various combinations of cost constraint and number of processes in the application program. As discussed in Section 2.3, performance values related to each objective in a multi-objective application need to be considered independently in order to avoid inconsistencies in evaluation. To this end, we plot in a two-dimensional graph the distribution of the normalized quality of a HM on a test case and the corresponding normalized cost of the same HM and test case. Using a method we have developed earlier to show cost-quality trade-offs [15], we identified a 90% constant probability contour for each HM after removing outliers, checking for bivariate normality, and finding the 90% constant probability-density contour of the bivariate distribution. 17 IIEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 7, NO 5, OCTOBER 1995 Table 8. Quality-cost comparison of learned and generalized HMs using a cost constraint of 30% of the original PGA cost. Subdomains with ‘‘*’’ are learned; subdomains with ‘‘+’’ are generalized from subdomains with the same number of processes. (Subdomains 1-3 belong to one partition, and subdomains 4-6 belong to another.) The cost of learning is the sum of normalized execution times with respect to the baseline HM. Subdomain Generalized HM(s) Learned HM(s) ID Architecture Processes Quality Cost Quality Cost Normalized Learning Cost 1* 2+ 3+ 4* 5+ 6+ 3-by-3 mesh 4-by-4 mesh 5-by-5 mesh 3-by-3 mesh 4-by-4 mesh 5-by-5 mesh 100 100 100 200 200 200 0.934 0.933 0.954 0.993 0.986 0.964 0.251 0.231 0.235 0.283 0.244 0.274 0.934 0.948 0.951 0.993 0.993 0.961 0.251 0.277 0.230 0.283 0.274 0.269 583.3 349.9 428.8 505.8 348.0 416.9 1.15 1.2 3x3 mesh 100 processes 4x4 mesh 100 processes 1.15 1.1 3x3 mesh 200 processes Normalized Quality of Mapping Found Normalized Quality of Mapping Found 5x5 mesh 100 processes 1.1 4x4 mesh 200 processes 5x5 mesh 200 processes 1.05 1 0.95 0.9 1.05 1 3x3 mesh 100 processes 4x4 mesh 100 processes 5x5 mesh 100 processes 0.95 3x3 mesh 200 processes 4x4 mesh 200 processes 0.9 5x5 mesh 200 processes 0.85 0.8 -0.5 0 0.5 1 Normalized Cost 1.5 0.85 -0.2 2 0 0.2 0.4 0.6 Normalized Cost 0.8 1 1.2 1.15 1.05 1.1 Normalized Quality of Mapping Found Normalized Quality of Mapping Found 1 0.95 3x3 mesh 100 processes 4x4 mesh 100 processes 0.9 5x5 mesh 100 processes 3x3 mesh 200 processes 0.85 4x4 mesh 200 processes 5x5 mesh 200 processes 0.8 0.75 -0.2 1.05 3x3 mesh 100 processes 1 4x4 mesh 100 processes 5x5 mesh 100 processes 0.95 3x3 mesh 200 processes 4x4 mesh 200 processes 0.9 5x5 mesh 200 processes 0.85 0 0.2 0.4 Normalized Cost 0.6 0.8 0.8 -0.2 0 0.2 0.4 0.6 Normalized Cost 0.8 1 1.2 Fig. 7. Performance of PGA HMs learned for a 3-by-3 mesh architecture and (a) Top-left—100-process subdomain and 1.0 cost constraint, (b) Top-right—200-process subdomain and 1.0 cost constraint, (c) Bottom-left—100-process subdomain and 0.3 cost constraint, and (d) Bottom-right—200-process subdomain and 0.3 cost constraint. Fig. 7 shows the cost-quality of generalized PGA HMs on various architectural configurations and numbers of nodes in the divide-and-conquer tree. Each of these four graphs represents the performance of one HM obtained by learning in one subdomain and generalizing to two other subdomains. The HM used in Fig. 7a (resp., 7b) was obtained by generalizing HMs learned in a 100-process (resp., 200-process) subdomain under a 1.0 cost constraint and a 3-by-3 mesh architecture to the two remaining architectural configurations. Likewise, the HM used in Fig. 7c (resp., 7d) are HM 1 (resp., HM 4 ) in Table 8. In learning the HMs, the fixed-duration minimum-risk strategy (resp., DMDS minimum-risk strategy) IIEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 7, NO 5, OCTOBER 1995 were used in Fig. 7a and 7b (resp., Fig. 7c and 7d). We see in Fig. 7a and 7b that all the contour plots are clustered together, implying that the PGA HMs selected under the 1.0 cost constraint generalize well to other subdomains. Further, these HMs have lower costs than the original HM (normalized to 1) while having qualities close to or better than the original HM. In Fig. 7c and 7d, we find both generalized HMs to have similar quality levels, but the HM generalized from the 100-process subdomains have higher costs than those from the 200-process subdomains. In this case, HMs that perform well for the 200-process subdomains would violate the cost constraint for the 100-process subdomains. This happens because the achievable cost for each subdomain in the 0.3 cost-constraint case is lower; for larger test cases, there is more room for improvement in terms of cost, and a lower relative cost can be achieved. Similar conclusions can be drawn by computing P win . For instance, the average P win for a single HM generalized across all six subdomains is 0.78 for the 1.0 cost-constraint case, and 0.61 for the 0.3 cost-constraint case. In the latter case, the PGA HMs learned tend to specialize to the number of processes in the application program. For instance, the average P win for two partitions in the problem subspace, one for the 100-process subdomains and the other for the 200-process subdomains, are 0.79 and 0.85, respectively. We have also observed similar results that generalization is easier for different divide-and-conquer trees or different applications problems when the cost constraint is loose [15], but more difficult when the cost constraint is tight. 6.2. Branch-and-Bound Search A branch-and-bound (B&B) search algorithm is a systematic method for traversing a search tree or search graph in order to find a solution that optimizes a given objective while satisfying given constraints. It decomposes a problem into smaller subproblems and repeatedly decomposes them until a solution is found or infeasibility is proved. Each subproblem is represented by a node in the search tree/graph. The algorithm has four sets of HMs: (a) selection HM for selecting a search node for expansion based on a sequence of selection keys for ordering search nodes; (b) decomposition HM (or branching mechanism) for expanding a search node into descendents using operators to expand (or transform) a search node into child nodes; (c) pruning HM for pruning inferior nodes in order to trim potentially poor subtrees; and (d) termination HM for determining when to stop. In this paper, we apply learning to find only new decomposition HMs; preliminary results on learning of selection and pruning HMs can be found in the reference [8]. We consider optimization search, which involves finding the optimal solution and proving its optimality. We illustrate our method on three applications: traveling salesman problems on incompletely connected graphs mapped on a 2-D plane (TSP), vertex-cover problems (VC), and knapsack problems (KS). Table 9 shows the parameters used to generate a test case in each application. All subdomains are assumed to belong to one problem subspace. Table 9. Generation of 12 subdomains of test cases for testing decomposition HMs in a B&B search. Appl. Subdomain Attributes VC • Connectivity of vertices is (0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5, 0.55 or 0.6) • Number of vertices is between 16 and 45 TSP • Distributions of 8-18 cities (uniformly distributed between 0-100 on both the X and Y axes, uniformly distributed on one axis and normally distributed on another, or normally distributed on both axes) • Graph connectivity of cities is (0.1, 0.2, 0.3, or 1.0) KS • Range of both profits and weights is {(100-1000), (100-200), (100-105)} • Variance of profit/weight ratio is (1.05, 1.5, 10, 100) • 13-60 objects in the knapsack 0.05 Min-Risk Round-Robin Top Min-Risk Top Round-Robin 0.04 Average Symmetric-Speedup 18 0.03 Average Performance over 5 Experiments 160 Tests per Generation 0.02 0.01 0 1 2 3 4 5 6 Generation 7 8 9 10 Fig. 8. Average performance over 5 runs of the HMs selected for the VC problem with edge connectivity of 0.15 using the MR and RR scheduling strategies. (Top-RR and Top-MR mean that only one HM was retained for verification; RR and MR mean that five HMs were retained.) The problem solver here is a B&B algorithm, and a test case is considered solved when its optimal solution is found. Note that the decomposition HM is a component of the B&B algorithm. We use well-known decomposition HMs developed for these applications as our baseline HMs (see Table 11). The normalized cost of a candidate decomposition HM is defined in terms of its average symmetric speedup (see Eq. (2)), which is related to the number of nodes expanded by a B&B search using the baseline HM and that using the new HM. Note that we do not need to measure quality as both the new and existing HMs when applied in a B&B search look for the optimal solution. Our first experiments study the effects of fixed-duration strategies (RR and MR) on learning. DMDS was not used because there is only one objective measure. Fig. 8 shows the quality of HMs found as a function of learning time for the two scheduling strategies. Each point in the graph was obtained by averaging the symmetric speedups of HMs selected if learning had been stopped at that point. 19 IIEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 7, NO 5, OCTOBER 1995 Table 10. Results of learning and generalization for VC, TSP, and KS. (In the results on generalization, numbers with ‘‘*’’ are the ones learned; only one common HM is generalized to all 12 subdomains.) Type Learning PerformAppli- ance cation Measure Average 1 2 3 4 5 6 7 8 9 10 11 12 VC Sym-SU 0.000 0.011 0.041 0.000 0.044 0.022 0.008 0.013 0.000 0.000 0.000 0.000 0.012 Cost 26343.5 23570.9 21934.1 12951.6 11034.3 12414.4 5871.0 8093.3 6878.0 5051.2 3826.2 3107.3 11756.3 TSP Sym-SU 0.194 Cost 2846.6 KS Sym-SU 0.000 0.000 0.000 0.000 0.000 0.000 0.893 0.000 0.263 0.107 Cost 25707.7 32587.9 9671.6 26408.1 24903.6 22309.1 3648.1 7943.1 8114.7 6476.2 Genera- VC lization TSP KS Validation Subdomain VC TSP KS 0.073 1543.9 0.288 2077.7 0.378 2207.7 0.106 2314.9 0.068 0.267 0.382 0.048 0.165 0.208 0.083 1865.6 1889.9 1847.5 2509.7 1947.0 1445.4 1958.8 Sym-SU 0.218 0.283* 0.031 Sym-SU 0.072* 0.004 0.082* Sym-SU 0.000* 0.000* 0.000 0.068* 0.054 0.060* 0.017 0.049* 0.016 0.225 0.005* 0.061* 0.139 0.155 −0.010 0.000 0.000 0.000* 0.000* 0.000 0.000* Sym-SU Sym-SU Sym-SU 0.078 0.155 0.000 0.070 0.417 0.000 0.638 0.036 0.000 0.241 0.144 0.000 0.073 0.131 0.000 Table 11. Original and generalized decomposition HMs used in a B&B search. (The new HM learned for VC can be interpreted as follows. l is the primary key, and n − ∆l is the secondary key.) Application and Variables Used Orig. Gen. in Constructing HMs HM HM Vertex-Cover Problem l 1000 l l = live degree of vertex (uncovered edges) +n d = dead degree of vertex (covered edges) − ∆l n = average live degree of all neighbors ∆l = difference between l from parent node to current node Traveling-Salesman Problem c m*c c = length of current partial tour m = min length to complete current tour a = avg length to complete current tour l = number of neighboring cities not yet visited d = number of neighbor already visited Knapsack Problem p/w p/w p, w = profit/weight of object s = weight slack = weight limit − current weight pmax , pmin = max./min. profit of unselected object w max , w min = max./min. weight of unselected object Fig. 8 shows that MR outperforms RR, that MR is more likely to identify the top HM, and that MR requires less HMs to be retained for full verification at the end of learning. For this reason, we use the fixed-duration MR scheduling strategy in the remaining results in this subsection. Fig. 8 also shows that verifying more HMs at the end leads to better HMs (albeit higher verification cost). Next, we generalize the HMs learned to new subdomains. For each application, we selected six subdomains and performed learning on each using 1,600 tests. We then selected 0.020 0.364 0.000 2.840 0.089 0.349 772.9 10684.4 14935.6 −0.000* −0.011 0.054 0.000 0.028* 0.090* 0.083* 0.000 0.000* −0.013 −0.004 −0.018 −0.000 −0.019 −0.010 1.161 0.000 0.101 0.000 0.108 0.000 0.008 0.000 0.188 2037.9 0.022 0.000 0.131 0.000 0.068 0.080 0.000 0.088 0.231 0.000 the top five HMs from each learned subdomain, fully verified them on all the learned subdomains, and selected one final HM to be used across all subdomains. Table 10 summarizes the learning, generalization, and validation results. For learning, we show the average symmetric speedup of the top HM learned in each subdomain and the normalized cost of learning, where the latter was computed as the ratio of the total CPU time for learning and the harmonic mean of the CPU times required by the baseline HM on test cases used in learning. The results show that a new HM learned for a subdomain has around 1-35% improvement in its average symmetric speedups and 3,000-16,000 times in learning costs. Table 10 also shows the average symmetric speedups of the generalized HMs. We picked six subdomains randomly for learning. After learning and full verification of the top five HMs in each subdomain, we applied Eq. (18) to identify the generalized HM across all twelve subdomains. Our results show that the generalized HMs have between 0-8% improvement in average symmetric speedups. Note that these results are worse than those obtained by learning, and that the baseline HM is the best HM in solving the knapsack problem. The bottom part of Table 10 shows the average symmetric speedups when we validate the generalized HMs on larger test cases. These test cases generally require 10-50 times more nodes expanded than those used earlier. Surprisingly, our results show better improvement (9-23%). Further, six of the twelve subdomains with high degree of connectivity in the vertex-cover problem have slowdowns. This is a clear indication that these subdomains should be grouped in a different subspace and learned separately. Finally, Table 11 shows the new decomposition HMs learned for the three applications. We list the variables that were supplied to the learning system. In addition to these variables, we have also included constants that can be used by the heuristics generator. An example of such a constant is shown in the HM learned for the vertex-cover problem. This formula can be interpreted as using l as the primary key for deciding which node to be included in the covered set. If the 20 IIEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 7, NO 5, OCTOBER 1995 Table 12. HM parameters in CRIS. (The type, range, and step size were recommended to us by the designer of CRIS.) Learned Value int. 1-10 1 related to the num1 ber of stages in a flip flop int. 1-40 1 sensitivity of 12 changes of state of a flip flop int. 1-40 1 survival rate of a 38 test sequence in the next generation float 0.1-10.0 0.1 no. of test vectors 7.06 concatenated to form a new vector int. 50-800 10 number of useless 623 trials before quitting int. 1-20 1 number of genera1 tions float 0.1-1.0 0.1 how genes are 0.1 spliced in the genetic algorithm int. any 1 seed for the random number generator Param. Type Range Step P1 P2 P3 P4 P5 P6 P7 P8 Definition l’s of two alternatives are different, then the remaining terms in the formula (n − ∆l) are insignificant. In contrast, when the l’s are the same, then we use n − ∆l as a tie breaker. In short, our results show that reasonable improvements can be obtained by learning and by generalization. We anticipate further improvements by (a) learning and generalizing new pruning HMs in a depth-first search, (b) partitioning the problem space into a number of subspaces and learning a new HM for each, and (c) identifying attributes that help explain why one HM performs well in some subdomains. 6.3. Heuristics for Sequential Circuit Testing CRIS is a genetic-search package developed by experts for generating test patterns to test VLSI circuits [24]. It is based on continuous mutations of a given input test sequence and on analyzing the mutated vectors for selecting a test set. The package has been applied successfully to generate test patterns with high fault coverages for large combinatorial and sequential circuits. In our application of TEACHER to improve CRIS, we treat CRIS as a problem solver in a black box. The inputs to CRIS are a set of eight parameters that we treat as our HM (see Table 12). We were also given a suite of 20 sequential benchmark circuits [17]. Since these circuits are from different applications, we cannot characterize them by some common features. Consequently, we treat each circuit as an individual subdomain. We further assume that all the circuits are from one subspace, and wish to learn one common HM for all the circuits. Note that parameter P 8 is a random seed, imply- Table 13. Performance of HMs found for CRIS (learned subdomains are marked by ‘‘*’’ and generalized subdomains by ‘‘+’’). The costs and fault coverages of HITEC are from the literature. Costs of our experiments are running times in seconds on a Sun SparcStation 10/512, whereas costs of HITEC are running times in seconds on a Sun SparcStation SLC (around 4-6 times slower than a Sun SparcStation 10/512). Ckt. ID *s298 s344 +s382 s386 *s400 s444 *s526 s641 +s713 s820 *s832 s1196 *s1238 s1488 +s1494 s1423 +s5378 am2910 +div16 tc100 Fault Coverage HITEC Learned HMs HIAvg. Max. CRIS Cost Cost TEC FC FC 82.1 93.7 68.6 76.0 84.7 83.7 77.1 85.2 81.7 53.1 42.5 95.0 90.7 91.2 90.1 77.0 65.8 83.0 75.0 70.8 Generalized HM Avg. Max. Avg. FC FC Cost 86.0 15984.0 84.9 86.4 15255.1 84.7 86.4 10.9 95.9 4788.0 96.1 96.2 22305.8 96.1 96.2 21.8 90.9 43200.0 86.9 88.0 12612.3 72.4 87.0 7.2 81.7 61.8 78.9 80.5 6750.6 77.5 78.9 3.5 89.9 43560.0 85.5 88.8 13402.8 71.2 85.7 8.4 87.3 57960.0 85.8 87.1 13863.6 79.8 85.4 9.3 65.7 168480.0 76.8 77.3 17278.2 70.0 77.1 10.0 86.5 1080.0 85.7 86.5 18914.3 85.0 86.1 19.5 81.9 91.2 81.6 81.9 24714.5 81.3 81.9 23.0 95.6 5796.0 47.8 55.2 55948.0 44.7 46.7 51.3 93.9 6336.0 45.6 59.4 56526.2 44.1 45.6 44.6 99.7 91.8 93.9 95.2 22444.4 92.0 94.1 20.0 94.6 132.0 90.6 91.7 26961.4 88.2 89.2 23.0 97.0 12960.0 94.4 96.3 87314.4 94.1 95.2 85.6 96.4 6876.0 93.7 95.4 88649.9 93.2 94.1 85.5 40.0 84.6 89.3 241983.0 82.0 88.3 210.4 70.3 71.5 73.7 560312.0 65.3 69.9 501.8 85.0 85.4 86.5 345253.0 83.7 85.2 307.6 72.0 80.6 82.6 167876.0 79.1 81.0 149.9 80.6 75.8 76.8 244061.0 72.6 75.9 163.8 Table 14. Summary of comparison of our generalized HMs with respect to HITEC and CRIS. (The first number in each entry shows the number of wins out of 20 circuits, and the second, the number of ties.) Our HM wins/ties Generalized HM Learned HM with respect to the following Max. FC Avg. FC Max. FC Avg. FC HITEC CRIS Both HITEC and CRIS 6, 1 16, 1 4, 0 11, 0 7, 2 20, 0 6, 0 15, 0 5, 2 3, 0 7, 2 5, 0 ing that CRIS can be run multiple times using different random seeds in order to obtain better fault coverages. (In our experiments, we used a fixed sequence of ten random seeds.) A major problem in using the original CRIS is that it is hard to find the proper values of the seven parameters (excluding the random seed) for a particular circuit. The designer of CRIS manually tuned these parameters for each circuit, resulting in HMs that are hard to generalize. This was done because the designer wanted to obtain the highest possible fault coverage for each circuit, and computation cost was only a secondary consideration. Note that the times for manually learning these HMs were exceedingly high and were not reported in the reference [24]. 21 IIEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 7, NO 5, OCTOBER 1995 Our goal in learning and generalization is to develop one common HM that can be applied across all the benchmark circuits and that has similar or better fault coverages as compared to those of the original CRIS. The advantage of having one HM is that it can be applied to new circuits without further manual tuning. In our learning experiments, we chose five circuits (subdomains) for learning. Each learning experiment had ten fixed-duration generations with a total of 1,000 applications of CRIS. The learning system started with a pool of 30 active HMs in each generation and retained the top ten HMs at the end of the generation. The minimum-risk sample-allocation strategy was used in learning. The twenty best HMs at the end of learning were selected for final verification. Using the five best HMs from each of the five learning experiments, we evaluated the 25 HMs fully using ten random seeds on the five subdomains used in learning and five new subdomains. We then selected one generalized HM to be used across the ten circuits. The generalized HM we have found is shown in Table 12. Table 13 shows the results after learning and generalization. For each circuit, we present the average and maximum (over 10 random seeds) fault coverages and the costs of the learned and generalized HMs. These fault coverages are compared against the published fault coverages of CRIS [24] as well as those of the deterministic HITEC system [21]. Note that the cost of generating the maximum fault coverage is 10 times the average cost. Table 14 summarizes the improvements of our learned and generalized HMs as compared to the published results of CRIS and HITEC. Each entry of the table shows the number of times our HM wins (and ties) as compared to the method(s) in the first column. Our results show that our generalized HM wins 8 out of 20 circuits as compared to the best fault coverages of HITEC and CRIS. Learning can improve this further to result in 10 wins (and 2 ties) out of 20 circuits (albeit much higher computational costs as compared to those of HITEC and CRIS). Our results are significant in the following aspects: (a) new faults detected by our generalized HMs were not discovered by previous methods; (b) only one HM (rather than many circuit-dependent HMs) was found for all the circuits. 6.4. Heuristics for VLSI Placement and Routing In our last application, we use TimberWolf [27] as our problem solver. This is a software package based on simulated annealing to place and route various components (transistors, resistors, capacitors, wires, etc.) on a piece of silicon. Its goal is to minimize the chip area needed while satisfying constraints such as the number of layers of poly-silicon for routing and the maximum signal delay through any path. Its operations can be divided into three steps: placement, global routing, and detailed routing. The placement and routing problem is NP-hard; hence, heuristics are generally used. Simulated annealing (SA) used in TimberWolf is an efficient method to randomly search the Table 15. Parameters in TimberWolf (version 6) used in our HM for learning and for generalization. Param. Range p1 0.1 - 2.5 p2 p3 p4 p5 p6 p7 p8 p9 p10 p11 Step Meaning Default Learned 0.1 vertical path weight 1.0 0.9584 for estimating the cost function 0.1 - 2.5 0.1 vertical wire weight 1.0 0.2315 for estimating the cost function 3 - 10 1 orientation ratio 6 10 0.33 - 2.0 0.1 range limiter win1.0 1.2987 dow change ratio 10.0 - 35.0 1.0 high temperature 23.0 10.0416 finishing point 50.0 - 99.0 1.0 intermediate tem81.0 63.6962 perature finishing point 100.0 - 150.0 1.0 low temperature 125.0 125.5509 finishing point 130.0 - 180.0 1.0 final iteration tem- 155.0 147.9912 perature 0.29 - 0.59 0.01 critical ratio that 0.44 0.3325 determines acceptance probability 0.01 - 0.12 0.01 temperature for 0.06 0.1124 controller turn off integer 1 seed for the random number generator space of possible placements. Although in theory SA converges asymptotically to the global optimum with probability one, the results generated in finite time are usually suboptimal. As a result, there is a trade-off between quality of a result and cost (or computational time) of obtaining it. In TimberWolf version 6.0, the version we have studied, there are two parameters to control the running time (which indirectly control the quality of the result): fast-n and slow-n. The larger the fast-n is, the shorter time SA will run. In contrast, the larger the slow-n is, the longer time SA will run. Of course, only one of these parameters can be used at any time. TimberWolf has six major components: cost function, generate function, initial temperature, temperature decrement, equilibrium condition, and stopping criterion. Many parameters in these components have been tuned manually. However, their settings are generally heuristic because we lack domain knowledge to set them optimally. In Table 15, we list the parameters we have studied. Our goal is to illustrate the power of our learning and generalization procedures and to show improved quality and reduced cost for the placement and routing of large circuits, despite the fact that only small circuits were used in learning. In our experiments, we used seven benchmark circuits [17] (s298, s420, fract, primary1, struct, primary2, industrial1). Here, we have only studied the application of TimberWolf to standard-cell placement, although other kinds of placement (gate-array placement and macro/custom-cell placement) can be studied in a similar fashion. In our experiments, we used fast-n values of 1, 5, and 10, respectively. We first applied TEACHER to learn good HMs for circuits s298 22 IIEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 7, NO 5, OCTOBER 1995 generalized Normalized Symmetric Quality 1 0.8 default 0.6 0.4 0.2 0 -0.5 0 0.5 1 1.5 2 2.5 Normalized Symmetric Cost 3 3.5 Fig. 9. Comparison of normalized average performance between the default and the generalized HMs. The plots are normalized with respect to the performance of the baseline HM on each circuit using fast-n = 10. (See Eq. (2)). with fast-n of 1, s420 with fast-n of 5, and primary1 with fast-n of 10, each of which was taken as a learning subdomain. We used a fixed sequence of ten random seeds (P 8 in Table 15) in each subdomain to find the statistical performance of a HM. Each learning experiment involved 1000 applications of TimberWolf divided into ten generations. Based on the best 30 HMs (10 from each subdomain), we applied our generalization procedure to obtain one generalized HM. This generalized HM as well as the default HM are shown in Table 15. Fig. 9. plots the quality (higher quality in the y-axis means reduced chip area averaged over 10 runs using the defined random seeds) and cost (average execution time of TimberWolf) between the generalized HM and the default HM on all seven circuits with fast-n of 1, 5, and 10, respectively. Note that all performance values in Fig. 9 are normalized with respect to those of fast -n of 10, and that the positive (resp., negative) portion of the x-axes shows the fractional improvement (resp., degradation) in computational cost with respect to the baseline HM using fast -n of 10 for the same circuit. Each arrow in this figure points from the average performance of the default HM to the average performance of the generalized HM. Among the 21 test cases, the generalized HM has worse quality than that of the default in only two instances (both for circuit fract), and has worse cost in 4 out of 21 cases. We see in Fig. 9 that most of the arrows point in a left-upward direction, implying improved quality and reduced cost. Note that these experiments are meant to illustrate the power of our generalization procedure. We expect to see more improvement as we learn other functions and parameters in TimberWolf. 7. CONCLUSIONS In this paper, we have studied automated learning of performance-related heuristics for knowledge-lean applications using genetics-based learning methods. To summarize, we have derived the following results. (a) We have found inconsistencies in performance evaluation of heuristics due to multiple tests, multiple learning objectives, normalization, and changing behavior of heuristics across problem subdomains. We have proposed methods to cope with some of these anomalies. (b) We have studied strategies to schedule resources for tests in learning. An improvement over previous strategies is that our strategy is non-parametric and does not rely on the underlying performance distribution of heuristics. We have also proposed a scheduling strategy to cope with one or more learning objectives. Our results show that scheduling is important when tests are expensive and test results are noisy. (c) We have studied methods to find good HMs that can generalize to unlearned domains. Using a rangeindependent measure called probability of win, we can compare and rank heuristics across problem subdomains in a uniform manner. In case that there are trade-offs between cost and quality, our learning system will propose alternative heuristics showing such trade-offs. (d) We have found better heuristics for process mapping, branch-and-bound search, test-pattern generation in circuit testing, and VLSI cell placement and routing. There are several areas that we plan to study in the future. (a) One of these areas is the identification of problem subdomains for learning and subspaces for generalization. Since such demarcation is generally vague and imprecise, we plan to apply fuzzy sets to help define subdomains and subspaces. Fuzzy logic can also help identify heuristics that can be generalized, especially when there are multiple objectives in the application. (b) We plan to study metrics for performance evaluation besides the average metric studied in this paper. One of such metrics is the maximum metric that is useful when a heuristic method can be applied multiple times in order to generate better results at higher costs. This is also related to better generalization procedures that trades between improved quality and higher cost. (c) Finally, we plan to carry out learning on more applications. The merits of our system, of course, lie in finding better heuristics for real-world problems, which may involve many contradicting objectives. Our experience in this paper is on an application with two objectives. To cope with applications with many objectives, we need to extend our scheduling and generalization strategies. ACKNOWLEDGMENTS The authors are indebted to Prof. C. V. Ramamoorthy who processed the original submission in 1991 and the revised submission in 1994. The authors would like to thank Dr. Jerry Yan for letting us use his PostGame Analysis package described in Sec. 6.1. The authors like to acknowledge Mr. Yong-Cheng Li for interfacing TEACHER to TimberWolf and for collecting some preliminary results in Sec. 6.4. The authors are grateful to Mr. Steve Schwartz who co-developed the first version of TEACHER in 1991. 23 IIEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 7, NO 5, OCTOBER 1995 REFERENCES [1] D. H. Ackley, A Connectionist Machine for Genetic Hillclimbing, Kluwer Academic Pub., Boston, MA, 1987. [2] A. K. Aizawa and B. W. Wah, ‘‘Scheduling of Genetic Algorithms in a Noisy Environment,’’ Evolutionary Computation, vol. 2, no. 2, pp. 97-122, MIT Press, 1994. [3] A. N. Aizawa and B. W. Wah, ‘‘A Sequential Sampling Procedure for Genetic Algorithms,’’ Computers and Mathematics with Applications, vol. 27, no. 9/10, pp. 77-82, Pergamon Press, Ltd., Tarrytown, NY, May 1994. [4] R. E. Bechhofer, ‘‘A Single-Sample Multiple Decision Procedure for Ranking Means of Normal Populations with Known Variances,’’ Ann. Math. Statist., vol. 25, no. 1, pp. 16-39, Institute of Mathematical Statistics, Ann Arbor, MI, March 1954. [5] R. E. Bechhofer, A. J. Hayter, and A. C. Tamhane, ‘‘Designing Experiments for Selecting the Largest Normal Mean when the Variances are Known and Unequal: Optimal Sample Size Allocation,’’ J. of Statistical Planning and Inference, vol. 28, pp. 271-289, Elsevier Science Pub., 1991. [6] L. B. Booker, D. E. Goldberg, and J. H. Holland, ‘‘Classifier Systems and Genetic Algorithms,’’ in Machine Learning: Paradigm and Methods, ed. J. Carbonell, MIT press, 1990. [7] L.-C. Chu and B. W. Wah, ‘‘Optimization in Real Time,’’ Proc. Real Time Systems Symp., pp. 150-159, IEEE, Nov. 1991. [8] L.-C. Chu, Algorithms for Combinatorial Optimization in Real Time and their Automated Refinement by Genetic Programming, Ph.D. Thesis, Dept. of Electrical and Computer Engineering, Univ. of Illinois, Urbana, IL, May 1994. [9] J. L. Devore, Probability and Statistics for Engineering and the Sciences, Brooks/Cole Pub. Co., Monterey, CA, 1982. [10] J. M. Fitzpatrick and J. J. Grefenstette, ‘‘Genetic Algorithms in Noisy Environments,’’ Machine Learning, vol. 3, no. 2/3, pp. 101-120, Kluwer Academic Pub., Boston, MA, Oct. 1988. [11] C. M. Fonseca and P. J. Fleming, ‘‘Genetic Algorithms for Multiobjective Optimization: Formulation, Discussion, and Generalization,’’ Proc. of the Fifth Int’l Conf. on Genetic Algorithms, pp. 416-423, Int’l Soc. for Genetic Algorithms, Morgan Kaufman, June 1993. [12] F. W. Gembicki, Vector Optimization for Control with Performance and Parameter Sensitivity Indices, Ph.D. Thesis, Case Western Reserve University, Cleveland, OH, 1974. [13] J. J. Grefenstette, C. L. Ramsey, and A. C. Schultz, ‘‘Learning Sequential Decision Rules using Simulation Models and Competition,’’ Machine Learning, vol. 5, pp. 355-381, Kluwer Academic Pub., Boston, MA, 1990. [14] A. Ieumwananonthachai, A. N. Aizawa, S. R. Schwartz, B. W. Wah, and J. C. Yan, ‘‘Intelligent Mapping of Communicating Processes in Distributed Computing Systems,’’ Proc. Supercomputing 91, pp. 512-521, ACM/IEEE, Albuquerque, NM, Nov. 1991. [15] A. Ieumwananonthachai, A. Aizawa, S. R. Schwartz, B. W. Wah, and J. C. Yan, ‘‘Intelligent Process Mapping Through Systematic Improvement of Heuristics,’’ J. of Parallel and Distributed Computing, vol. 15, pp. 118-142, Academic Press, June 1992. [16] J. R. Koza, Genetic Programming, The MIT Press, Cambridge, MA, 1992. [17] LayoutSynth92, International Workshop on Layout Synthesis, ftp site: mcnc.mcnc.org in directory /pub/benchmark, 1992. [18] M. B. Lowrie and B. W. Wah, ‘‘Learning Heuristic Functions for Numeric Optimization Problems,’’ Proc. Computer Software and Applications Conf., pp. 443-450, IEEE, Chicago, IL, Oct. 1988. [19] P. Mehra and B. W. Wah, Load Balancing: An Automated Learning Approach, World Scientific Publishing Co. Pte. Ltd., 1995. [20] A. Newell, J. C. Shaw, and H. A. Simon, ‘‘Programming the Logic Theory Machine,’’ Prof. 1957 Western Joint Computer Conf., pp. 230-240, IRE, 1957. [21] T. M. Niermann and J. H. Patel, ‘‘HITEC: A Test Generation Package for Sequential Circuits,’’ European Design Automation Conference, pp. 214-218, 1991. [22] J. Pearl, Heuristics--Intelligent Search Strategies for Computer Problem Solving, Addison-Wesley, Reading, MA, 1984. [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] C. L. Ramsey and J. J. Grefenstette, ‘‘Case-Based Initialization of Genetic Algorithms,’’ Proc. of the Fifth Int’l Conf. on Genetic Algorithms, pp. 84-91, Int’l Soc. for Genetic Algorithms, Morgan Kaufman, June 1993. D. G. Saab, Y. G. Saab, and J. A. Abraham, ‘‘CRIS: A Test Cultivation Program for Sequential VLSI Circuits,’’ Proc. of Int’l Conf. on Computer Aided Design, pp. 216-219, IEEE, Santa Clara, CA, Nov. 8-12, 1992. S. R. Schwartz and B. W. Wah, ‘‘Automated Parameter Tuning in Stereo Vision Under Time Constraints,’’ Proc. Int’l Conf. on Tools for Artificial Intelligence, pp. 162-169, IEEE, Nov. 1992. C. Sechen and A. Sangiovanni-Vincentelli, ‘‘The TimberWolf Placement and Routing Package,’’ J of Solid-State Circuits, vol. 20, no. 2, pp. 510-522, IEEE, 1985. C. Sechen, VLSI Placement and Global Routing Using Simulated Annealing, Kluwer Academic Publishers, Boston, MA, 1988. R. S. Sutton, Temporal Credit Assignment in Reinforcement Learning, Ph.D. Thesis, Univ. of Massachusetts, Amherst, MA, Feb. 1984. C.-C. Teng and B. W. Wah, ‘‘An Automated Design System for finding the Minimal Configuration of a Feed-Forward Neural Network,’’ Int’l Conf. on Neural Networks, vol. 3, pp. 1295-1300, IEEE, June 1994. Y. L. Tong and D. E. Wetzell, ‘‘Allocation of Observations for Selecting the Best Normal Population,’’ in Design of Experiments: Ranking and Selection, ed. T. J. Santner and A. C. Tamhane, pp. 213-224, Marcel Dekker, New York, NY, 1984. B. W. Wah, ‘‘Population-Based Learning: A New Method for Learning from Examples under Resource Constraints,’’ Trans. on Knowledge and Data Engineering, vol. 4, no. 5, pp. 454-474, IEEE, Oct. 1992. J. C. Yan and S. F. Lundstrom, ‘‘The Post-Game Analysis Framework--Developing Resource Management Strategies for Concurrent Systems,’’ Trans. on Knowledge and Data Engineering, vol. 1, no. 3, pp. 293-309, IEEE, Sept. 1989. C. F. Yu and B. W. Wah, ‘‘Learning Dominance Relations in Combinatorial Search Problems,’’ Trans. on Software Engineering, vol. SE-14, no. 8, pp. 1155-1175, IEEE, Aug. 1988. Benjamin W. Wah received his Ph.D. degree in computer science from the University of California, Berkeley, CA, in 1979. He is currently a Professor in the Department of Electrical and Computer Engineering and the Coordinated Science Laboratory of the University of Illinois at Urbana-Champaign, Urbana, IL. Previously, he had served on the faculty of Purdue University (1979-85), as a Program Director at the National Science Foundation (1988-89), as Fujitsu Visiting Chair Professor of Intelligence Engineering, University of Tokyo (1992), and McKay Visiting Professor of Electrical Engineering and Computer Science, University of California, Berkeley (1994) In 1989, he was awarded a University Scholar of the University of Illinois. His current research interests are in the areas of parallel and distributed processing, data and knowledge base management, artificial intelligence, and nonlinear optimization. Dr. Wah is the Editor-in-Chief of IEEE Transactions on Knowledge and Data Engineering, and serves on the editorial boards of Journal of Parallel and Distributed Computing, Information Sciences, International Journal on Artificial Intelligence Tools, and Journal of VLSI Signal Processing. He has chaired a number of conferences and will chair the 1996 International Conference on Neural Networks. He has served in the IEEE Computer Society as a member of its Governing Board, and is currently serving in the IEEE-CS Publications Board, Press Activities Board, and Fellows Committee, and as a society representative to the IEEE Neural Network Council. Arthur Ieumwananonthachai received his B.S. degree in Electrical Engineering and Computer Science from the University of Washingtion, Seattle, WA, in 1986; and his M.S. degree in Computer Science from University of California, Los Angeles, in 1988. Since then, he has been working toward his Ph.D. degree in Electrical and Computer Engineering at the University of Illinois, Urbana-Champaign, under the supervision of Prof. Benjamin Wah. His research interests include computer networks, distributed systems, and machine learning. 24 IIEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 7, NO 5, OCTOBER 1995 Lon-Chan Chu received his B.S. degree in Electrical Engineering from the National Taiwan University in 1985, and his M.S. and Ph.D. degrees in Electrical and Computer Engineering from the University of Illinois at UrbanaChampaign in 1991 and 1994, respectively. He joined Microsoft Corporation in 1993. His research interests include real-time intelligent systems and realtime embedded systems. Akiko N. Aizawa received her B.S. degree in 1985, M.S. degree in 1987, and Ph.D. degree in 1990, all in Electrical Engineering from the University of Tokyo. Since 1990, she has been with NACSIS (National Center for Science Information System), Japan. During 1990-92, she was a visiting researcher at the University of Illinois at Urbana-Champaign. Her research interests include stochastic optimization methods, knowledge engineering, networked information, and communication protocols. View publication stats