Program autotuning has been shown to achieve better or more portable performance in a number of d... more Program autotuning has been shown to achieve better or more portable performance in a number of domains. How-ever, autotuners themselves are rarely portable between projects, for a number of reasons: using a domain-informed search space representation is critical to achieving good re-sults; search spaces can be intractably large and require ad-vanced machine learning techniques; and the landscape of search spaces can vary greatly between different problems, sometimes requiring domain specific search techniques to explore efficiently. This paper introduces OpenTuner, a new open source framework for building domain-specific multi-objective pro-gram autotuners. OpenTuner supports fully-customizable configuration representations, an extensible technique rep-resentation to allow for domain-specific techniques, and an easy to use interface for communicating with the program to be autotuned. A key capability inside OpenTuner is the use of ensembles of disparate search techniques simultaneo...
Abstract—Approximating ideal program outputs is a common technique for solving computationally di... more Abstract—Approximating ideal program outputs is a common technique for solving computationally difficult problems, for adhering to processing or timing constraints, and for performance optimization in situations where perfect precision is not necessary. To this end, programmers often use approximation algorithms, iterative methods, data resampling, and other heuristics. However, programming such variable accuracy algorithms presents difficult challenges since the optimal algorithms and parameters may change with different accuracy requirements and usage environments. This problem is further compounded when multiple variable accuracy algorithms are nested together due to the complex way that accuracy requirements can propagate across algorithms and because of the size of the set of allowable compositions. As a result, programmers often deal with this issue in an ad-hoc manner that can sometimes violate sound programming practices such as maintaining library abstractions. In this pape...
Approximating ideal program outputs is a common technique for solving computationally difficult p... more Approximating ideal program outputs is a common technique for solving computationally difficult problems, for adhering to processing or timing constraints, and for performance optimization in situations where perfect precision is not necessary. To this end, programmers often use approximation algorithms, iterative methods, data resampling, and other heuristics. However, programming such variable accuracy algorithms presents difficult challenges since the optimal algorithms and parameters may change with different accuracy requirements and usage environments. This problem is further compounded when multiple variable accuracy algorithms are nested together due to the complex way that accuracy requirements can propagate across algorithms and because of the resulting size of the set of allowable compositions.
DMTCP (Distributed MultiThreaded CheckPointing) is a transparent user-level checkpointing package... more DMTCP (Distributed MultiThreaded CheckPointing) is a transparent user-level checkpointing package for distributed applications. Checkpointing and restart is demonstrated for a wide range of over 20 well known applications, including MATLAB, Python, TightVNC, MPICH2, OpenMPI, and runCMS. RunCMS runs as a 680 MB image in memory that includes 540 dynamic libraries, and is used for the CMS experiment of the Large Hadron Collider at CERN. DMTCP transparently checkpoints general cluster computations consisting of many nodes, processes, and threads; as well as typical desktop applications. On 128 distributed cores (32 nodes), checkpoint and restart times are typically 2 seconds, with negligible run-time overhead. Typical checkpoint times are reduced to 0.2 seconds when using forked checkpointing. Experimental results show that checkpoint time remains nearly constant as the number of nodes increases on a medium-size cluster. DMTCP automatically accounts for fork, exec, ssh, mutexes/semaphor...
Algorithmic choice is essential in any problem domain to realizing optimal computational performa... more Algorithmic choice is essential in any problem domain to realizing optimal computational performance. Multigrid is a prime example: not only is it possible to make choices at the highest grid resolution, but a program can switch techniques as the problem is recursively attacked on coarser grid levels to take advantage of algorithms with different scaling behaviors. Additionally, users with different convergence criteria must experiment with parameters to yield a tuned algorithm that meets their accuracy requirements. Even after a tuned algorithm has been found, users often have to start all over when migrating from one machine to another. We present an algorithm and autotuning methodology that address these issues in a near-optimal and efficient manner. The freedom of independently tuning both the algorithm and the number of iterations at each recursion level results in an exponential search space of tuned algorithms that have different accuracies and performances. To search this sp...
With the dawn of the multi-core era, programmers are being challenged to write code that performs... more With the dawn of the multi-core era, programmers are being challenged to write code that performs well on an increasingly diverse array of architectures. A single program or library may be used on systems ranging in power from large servers with dozens or hundreds of cores to small single-core netbooks or mobile phones. A program may need to run efficiently both on architectures with many simple cores and on those with fewer monolithic cores. Some of the systems a program encounters might have GPU coprocessors, while others might not. Looking forward, processor designs such as asymmetric multi-core [3], with different types of cores on a single chip, will present an even greater challenge for programmers to utilize effectively. Programmers often find they must make algorithmic changes to their program in order to get performance when moving between these different types
Prediction intervals are a valuable way of quantifying uncertainty in regression problems. Good p... more Prediction intervals are a valuable way of quantifying uncertainty in regression problems. Good prediction intervals should be both correct, containing the actual value between the lower and upper bound at least a target percentage of the time; and tight, having a small mean width of the bounds. Many prior techniques for generating prediction intervals make assumptions on the distribution of error, which causes them to work poorly for problems with asymmetric distributions. This paper presents Expanded Interval Minimization (EIM), a novel loss function for generating prediction intervals using neural networks. This loss function uses minibatch statistics to estimate the coverage and optimize the width of the prediction intervals. It does not make the same assumptions on the distributions of data and error as prior work. We compare to three published techniques and show EIM produces on average 1.37x tighter prediction intervals and in the worst case 1.06x tighter intervals across two...
We present a preliminary description of a user-level checkpointing package, DMTCP, for Linux. The... more We present a preliminary description of a user-level checkpointing package, DMTCP, for Linux. The socket-based approach presents a novel method for checkpointing distributed processes. This includes checkpointing of any dynamically created POSIX threads and forked child processes. It also includes checkpointing of remotel y spawned processes via ssh and other mechanisms. As with all user-level checkpointing, no modification of the kernel is needed, and the application code is not modified. The package also checkpoints signal handlers, ordinary file descriptors, socket descriptors, and c ertain other types of file descriptors. Each checkpointed process has an associated checkpoint file . Hence, process migration, and even migration of an entire computation to a new cluster, are achieved through the simple expedient of copying checkpoint files to a new host. However, process migration adds the additional restriction that the source and destination hos t must be homogeneous.
The process of optimizing programs and libraries, both for performance and quality of service, ca... more The process of optimizing programs and libraries, both for performance and quality of service, can be viewed as a search problem over the space of implementation choices. This search is traditionally manually conducted by the programmer and often must be repeated when systems, tools, or requirements change. The overriding goal of this work is to automate this search so that programs can change themselves and adapt to achieve performance portability across different environments and requirements. To achieve this, first, this work presents the PetaBricks programming language which focuses on ways for expressing program implementation search spaces at the language level. Second, this work presents OpenTuner which provides sophisticated techniques for searching these search spaces in a way that can easily be adopted by other projects. PetaBricks is a implicitly parallel language and compiler where having multiple implementations of multiple algorithms to solve a problem is the natural w...
Interest in applying Artificial Intelligence (AI) techniques to compiler optimizations is increas... more Interest in applying Artificial Intelligence (AI) techniques to compiler optimizations is increasing rapidly, but compiler research has a high entry barrier. Unlike in other domains, compiler and AI researchers do not have access to the datasets and frameworks that enable fast iteration and development of ideas, and getting started requires a significant engineering investment. What is needed is an easy, reusable experimental infrastructure for real world compiler optimization tasks that can serve as a common benchmark for comparing techniques, and as a platform to accelerate progress in the field. We introduce CompilerGym, a set of environments for real world compiler optimization tasks, and a toolkit for exposing new optimization tasks to compiler researchers. CompilerGym enables anyone to experiment on production compiler optimization problems through an easy-to-use package, regardless of their experience with compilers. We build upon the popular OpenAI Gym interface enabling res...
A daunting challenge faced by program performance autotuning is input sensitivity, where the best... more A daunting challenge faced by program performance autotuning is input sensitivity, where the best autotuned configuration may vary with different input sets. This paper presents a novel two-level input learning algorithm to tackle the challenge for an important class of autotuning problems, algorithmic autotuning. The new approach uses a two-level input clustering method to automatically refine input grouping, feature selection, and classifier construction. Its design solves a series of open issues that are particularly essential to algorithmic autotuning, including the enormous optimization space, complex influence by deep input features, high cost in feature extraction, and variable accuracy of algorithmic choices. Experimental results show that the new solution yields up to a 3x speedup over using a single configuration for all inputs, and a 34x speedup over a traditional one-level method for addressing input sensitivity in program optimizations.
Program autotuning has been shown to achieve better or more portable performance in a number of d... more Program autotuning has been shown to achieve better or more portable performance in a number of domains. How-ever, autotuners themselves are rarely portable between projects, for a number of reasons: using a domain-informed search space representation is critical to achieving good re-sults; search spaces can be intractably large and require ad-vanced machine learning techniques; and the landscape of search spaces can vary greatly between different problems, sometimes requiring domain specific search techniques to explore efficiently. This paper introduces OpenTuner, a new open source framework for building domain-specific multi-objective pro-gram autotuners. OpenTuner supports fully-customizable configuration representations, an extensible technique rep-resentation to allow for domain-specific techniques, and an easy to use interface for communicating with the program to be autotuned. A key capability inside OpenTuner is the use of ensembles of disparate search techniques simultaneo...
Abstract—Approximating ideal program outputs is a common technique for solving computationally di... more Abstract—Approximating ideal program outputs is a common technique for solving computationally difficult problems, for adhering to processing or timing constraints, and for performance optimization in situations where perfect precision is not necessary. To this end, programmers often use approximation algorithms, iterative methods, data resampling, and other heuristics. However, programming such variable accuracy algorithms presents difficult challenges since the optimal algorithms and parameters may change with different accuracy requirements and usage environments. This problem is further compounded when multiple variable accuracy algorithms are nested together due to the complex way that accuracy requirements can propagate across algorithms and because of the size of the set of allowable compositions. As a result, programmers often deal with this issue in an ad-hoc manner that can sometimes violate sound programming practices such as maintaining library abstractions. In this pape...
Approximating ideal program outputs is a common technique for solving computationally difficult p... more Approximating ideal program outputs is a common technique for solving computationally difficult problems, for adhering to processing or timing constraints, and for performance optimization in situations where perfect precision is not necessary. To this end, programmers often use approximation algorithms, iterative methods, data resampling, and other heuristics. However, programming such variable accuracy algorithms presents difficult challenges since the optimal algorithms and parameters may change with different accuracy requirements and usage environments. This problem is further compounded when multiple variable accuracy algorithms are nested together due to the complex way that accuracy requirements can propagate across algorithms and because of the resulting size of the set of allowable compositions.
DMTCP (Distributed MultiThreaded CheckPointing) is a transparent user-level checkpointing package... more DMTCP (Distributed MultiThreaded CheckPointing) is a transparent user-level checkpointing package for distributed applications. Checkpointing and restart is demonstrated for a wide range of over 20 well known applications, including MATLAB, Python, TightVNC, MPICH2, OpenMPI, and runCMS. RunCMS runs as a 680 MB image in memory that includes 540 dynamic libraries, and is used for the CMS experiment of the Large Hadron Collider at CERN. DMTCP transparently checkpoints general cluster computations consisting of many nodes, processes, and threads; as well as typical desktop applications. On 128 distributed cores (32 nodes), checkpoint and restart times are typically 2 seconds, with negligible run-time overhead. Typical checkpoint times are reduced to 0.2 seconds when using forked checkpointing. Experimental results show that checkpoint time remains nearly constant as the number of nodes increases on a medium-size cluster. DMTCP automatically accounts for fork, exec, ssh, mutexes/semaphor...
Algorithmic choice is essential in any problem domain to realizing optimal computational performa... more Algorithmic choice is essential in any problem domain to realizing optimal computational performance. Multigrid is a prime example: not only is it possible to make choices at the highest grid resolution, but a program can switch techniques as the problem is recursively attacked on coarser grid levels to take advantage of algorithms with different scaling behaviors. Additionally, users with different convergence criteria must experiment with parameters to yield a tuned algorithm that meets their accuracy requirements. Even after a tuned algorithm has been found, users often have to start all over when migrating from one machine to another. We present an algorithm and autotuning methodology that address these issues in a near-optimal and efficient manner. The freedom of independently tuning both the algorithm and the number of iterations at each recursion level results in an exponential search space of tuned algorithms that have different accuracies and performances. To search this sp...
With the dawn of the multi-core era, programmers are being challenged to write code that performs... more With the dawn of the multi-core era, programmers are being challenged to write code that performs well on an increasingly diverse array of architectures. A single program or library may be used on systems ranging in power from large servers with dozens or hundreds of cores to small single-core netbooks or mobile phones. A program may need to run efficiently both on architectures with many simple cores and on those with fewer monolithic cores. Some of the systems a program encounters might have GPU coprocessors, while others might not. Looking forward, processor designs such as asymmetric multi-core [3], with different types of cores on a single chip, will present an even greater challenge for programmers to utilize effectively. Programmers often find they must make algorithmic changes to their program in order to get performance when moving between these different types
Prediction intervals are a valuable way of quantifying uncertainty in regression problems. Good p... more Prediction intervals are a valuable way of quantifying uncertainty in regression problems. Good prediction intervals should be both correct, containing the actual value between the lower and upper bound at least a target percentage of the time; and tight, having a small mean width of the bounds. Many prior techniques for generating prediction intervals make assumptions on the distribution of error, which causes them to work poorly for problems with asymmetric distributions. This paper presents Expanded Interval Minimization (EIM), a novel loss function for generating prediction intervals using neural networks. This loss function uses minibatch statistics to estimate the coverage and optimize the width of the prediction intervals. It does not make the same assumptions on the distributions of data and error as prior work. We compare to three published techniques and show EIM produces on average 1.37x tighter prediction intervals and in the worst case 1.06x tighter intervals across two...
We present a preliminary description of a user-level checkpointing package, DMTCP, for Linux. The... more We present a preliminary description of a user-level checkpointing package, DMTCP, for Linux. The socket-based approach presents a novel method for checkpointing distributed processes. This includes checkpointing of any dynamically created POSIX threads and forked child processes. It also includes checkpointing of remotel y spawned processes via ssh and other mechanisms. As with all user-level checkpointing, no modification of the kernel is needed, and the application code is not modified. The package also checkpoints signal handlers, ordinary file descriptors, socket descriptors, and c ertain other types of file descriptors. Each checkpointed process has an associated checkpoint file . Hence, process migration, and even migration of an entire computation to a new cluster, are achieved through the simple expedient of copying checkpoint files to a new host. However, process migration adds the additional restriction that the source and destination hos t must be homogeneous.
The process of optimizing programs and libraries, both for performance and quality of service, ca... more The process of optimizing programs and libraries, both for performance and quality of service, can be viewed as a search problem over the space of implementation choices. This search is traditionally manually conducted by the programmer and often must be repeated when systems, tools, or requirements change. The overriding goal of this work is to automate this search so that programs can change themselves and adapt to achieve performance portability across different environments and requirements. To achieve this, first, this work presents the PetaBricks programming language which focuses on ways for expressing program implementation search spaces at the language level. Second, this work presents OpenTuner which provides sophisticated techniques for searching these search spaces in a way that can easily be adopted by other projects. PetaBricks is a implicitly parallel language and compiler where having multiple implementations of multiple algorithms to solve a problem is the natural w...
Interest in applying Artificial Intelligence (AI) techniques to compiler optimizations is increas... more Interest in applying Artificial Intelligence (AI) techniques to compiler optimizations is increasing rapidly, but compiler research has a high entry barrier. Unlike in other domains, compiler and AI researchers do not have access to the datasets and frameworks that enable fast iteration and development of ideas, and getting started requires a significant engineering investment. What is needed is an easy, reusable experimental infrastructure for real world compiler optimization tasks that can serve as a common benchmark for comparing techniques, and as a platform to accelerate progress in the field. We introduce CompilerGym, a set of environments for real world compiler optimization tasks, and a toolkit for exposing new optimization tasks to compiler researchers. CompilerGym enables anyone to experiment on production compiler optimization problems through an easy-to-use package, regardless of their experience with compilers. We build upon the popular OpenAI Gym interface enabling res...
A daunting challenge faced by program performance autotuning is input sensitivity, where the best... more A daunting challenge faced by program performance autotuning is input sensitivity, where the best autotuned configuration may vary with different input sets. This paper presents a novel two-level input learning algorithm to tackle the challenge for an important class of autotuning problems, algorithmic autotuning. The new approach uses a two-level input clustering method to automatically refine input grouping, feature selection, and classifier construction. Its design solves a series of open issues that are particularly essential to algorithmic autotuning, including the enormous optimization space, complex influence by deep input features, high cost in feature extraction, and variable accuracy of algorithmic choices. Experimental results show that the new solution yields up to a 3x speedup over using a single configuration for all inputs, and a 34x speedup over a traditional one-level method for addressing input sensitivity in program optimizations.
Uploads
Papers by Jason Ansel