Energy efficiency has become increasingly important in high performance computing (HPC), as power... more Energy efficiency has become increasingly important in high performance computing (HPC), as power constraints and costs escalate. Workload and system characteristics form a complex optimization search space in which optimal settings for energy efficiency and performance often diverge. Thus, we must identify trade-off options for performance and energy efficiency to find the desired balance between them. We present an innovative statistical model that accurately predicts the Pareto optimal performance and energy efficiency trade-off options using only user-controllable parameters. Our approach can also tolerate both measurement and model errors. We study model training and validation using several HPC kernels, then explore the feasibility of applying the model to more complex workloads, including AMG and LAMMPS. We can calibrate an accurate model from as few as 12 runs, with prediction error of less than 10%. Our results identify trade-off options allowing up to 40% improvement in energy efficiency at the cost of under 20% performance loss. For AMG, we reduce the required sample measurement time from 13 hours to 74 minutes (about 90%).
The increase in compute power and complexity of supercomputing systems requires the decrease in t... more The increase in compute power and complexity of supercomputing systems requires the decrease in the feature size and the supply voltage of internal components. Such development makes unintended errors such as soft errors, potentially caused by random bit flips, inevitable because of the huge size of the resources (such as CPU cores and memory). In this paper, we discuss a non-parametric statistical modelling technique to implement a soft error detector. By exploring temporal autocorrelation within key variables of a running scientific simulation, we introduce an automatic anomaly detection technique in which runtime data from a time-step based simulation can be converted into a time series, and a time series modelling technique can be used to identify soft errors at runtime. Experiments with LAMMPS, a high-performance molecular dynamics simulator, and with PLUTO, an open-source astrophysical code, reveal that the time-series based detector is subjected to less than 3% of both false-positive rate and false-negative rate while incurring only 6% performance overheads.
Relative debugging helps trace software errors by comparing two concurrent executions of a progra... more Relative debugging helps trace software errors by comparing two concurrent executions of a program - one code being a reference version and the other faulty. By locating data divergence between the runs, relative debugging is effective at finding coding errors when a program is scaled up to solve larger problem sizes or migrated from one platform to another. In this work, we envision potential changes to our current relative debugging scheme in order to address exascale factors such as the increase of faults and the nondeterministic outputs. First, we propose a statistical-based comparison scheme to support verifying results that are stochastic. Second, we leverage a scalable data reduction network to adapt to the complex network hierarchy of an exascale system, and extend our debugger to support the statistical-based comparison in an environment subject to failures.
2016 5th Workshop on Extreme-Scale Programming Tools (ESPT), 2016
Relative debugging helps trace software errors by comparing two concurrent executions of a progra... more Relative debugging helps trace software errors by comparing two concurrent executions of a program - one code being a reference version and the other faulty. By locating data divergence between the runs, relative debugging is effective at finding coding errors when a program is scaled up to solve larger problem sizes or migrated from one platform to another. In this work, we envision potential changes to our current relative debugging scheme in order to address exascale factors such as the increase of faults and the nondeterministic outputs. First, we propose a statistical-based comparison scheme to support verifying results that are stochastic. Second, we leverage a scalable data reduction network to adapt to the complex network hierarchy of an exascale system, and extend our debugger to support the statistical-based comparison in an environment subject to failures.
2017 IEEE 13th International Conference on e-Science (e-Science), 2017
Coral reefs are of global economic and biological significance but are subject to increasing thre... more Coral reefs are of global economic and biological significance but are subject to increasing threats. As a result, it is essential to understand the risk of coral reef ecosystem collapse and to develop assessment process for those ecosystems. The International Union for Conservation of Nature (IUCN) Red List of Ecosystem (RLE) is a framework to assess the vulnerability of an ecosystem. Importantly, the assessment processes need to be repeatable as new monitoring data arises. The repeatability will also enhance transparency. In this paper, we discuss the evolution of a computational pipeline for risk assessment of the Meso-American reef ecosystem, a diverse reef ecosystem located in the Caribbean, with the focus on improving the execution time starting from sequential and parallel implementation and finally using Apache Spark. The final form of the pipeline is a scientific workflow to improve its repeatability and reproducibility.
Proceedings of the Royal Society B: Biological Sciences, 2017
Effective ecosystem risk assessment relies on a conceptual understanding of ecosystem dynamics an... more Effective ecosystem risk assessment relies on a conceptual understanding of ecosystem dynamics and the synthesis of multiple lines of evidence. Risk assessment protocols and ecosystem models integrate limited observational data with threat scenarios, making them valuable tools for monitoring ecosystem status and diagnosing key mechanisms of decline to be addressed by management. We applied the IUCN Red List of Ecosystems criteria to quantify the risk of collapse of the Meso-American Reef, a unique ecosystem containing the second longest barrier reef in the world. We collated a wide array of empirical data (field and remotely sensed), and used a stochastic ecosystem model to backcast past ecosystem dynamics, as well as forecast future ecosystem dynamics under 11 scenarios of threat. The ecosystem is at high risk from mass bleaching in the coming decades, with compounding effects of ocean acidification, hurricanes, pollution and fishing. The overall status of the ecosystem is Critical...
2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012), 2012
Parallel debugging faces challenges in both scalability and efficiency. A number of advanced meth... more Parallel debugging faces challenges in both scalability and efficiency. A number of advanced methods have been invented to improve the efficiency of parallel debugging. As the scale of system increases, these methods highly rely on a scalable communication protocol in order to be utilized in large-scale distributed environments. This paper describes a debugging middleware that provides fundamental debugging functions supporting
2009 Fifth IEEE International Conference on e-Science, 2009
Page 1. Virtual Microscopy and Analysis using Scientific Workflows § David Abramson, § Blair Beth... more Page 1. Virtual Microscopy and Analysis using Scientific Workflows § David Abramson, § Blair Bethwaite, § Minh Ngoc Dinh, § Colin Enticott, Stephen Firth, § Slavisa Garic, Ian Harper, Martin Lackmann, § Hoang Nguyen ...
Traditional debuggers are of limited value for modern scientific codes that manipulate large comp... more Traditional debuggers are of limited value for modern scientific codes that manipulate large complex data structures. This paper discusses a novel debug-time assertion, called a "Statistical Assertion", that allows a user to reason about large data structures, and the primitives are parallelised to provide an efficient solution. We present the design and implementation of statistical assertions, and illustrate the debugging technique with a molecular dynamics simulation. We evaluate the performance of the tool on a 12,000 cores Cray XE6.
ABSTRACT Traditional debuggers are of limited value for modern scientific codes that manipulate l... more ABSTRACT Traditional debuggers are of limited value for modern scientific codes that manipulate large complex data structures. Current parallel machines make this even more complicated, because the data structure may be distributed across processors, making it difficult to view/interpret and validate its contents. Therefore, many applications’ developers resort to placing validation code directly in the source program. This paper discusses a novel debug-time assertion, called a “Statistical Assertion”, that allows using extracted statistics instead of raw data to reason about large data structures, therefore help locating coding defects. In this paper, we present the design and implementation of an ‘extendable’ statistical-framework which executes the assertion in parallel by exploiting the underlying parallel system. We illustrate the debugging technique with a molecular dynamics simulation. The performance is evaluated on a 20,000 processor Cray XE6 to show that it is useful for real-time debugging.
IEEE Transactions on Parallel and Distributed Systems, 2014
ABSTRACT Detecting and isolating bugs that arise only at high processor counts is a challenging t... more ABSTRACT Detecting and isolating bugs that arise only at high processor counts is a challenging task. Over a number of years, we have implemented a special debugging method, called "relative debugging", that supports debugging applications as they evolve or are ported to larger machines. It allows a user to compare the state of a suspect program against another reference version even as the number of processors is increased. The innovative idea is the comparison of runtime data in order to reason about the state of the suspect program. Whilst powerful, a naïve implementation of the comparison phase does not scale to large problems running on large machines. In this paper, we propose two different solutions including a hash-based scheme and a direct point-to-point scheme. We demonstrate the implementation, a case study, as well as the performance, of our techniques on 20K cores of a Cray XE6 system.
Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, 2010
Page 1. Data Centric Highly Parallel Debugging § David Abramson, Minh Ngoc Dinh, Donny Kurniawan ... more Page 1. Data Centric Highly Parallel Debugging § David Abramson, Minh Ngoc Dinh, Donny Kurniawan Bob Moench, Luiz DeRose § Faculty of Information Technology, Monash University, Clayton, 3800, Victoria, Australia Cray Inc, Cray Plaza, 380 Jackson St, Suite 210 St. ...
Energy efficiency has become increasingly important in high performance computing (HPC), as power... more Energy efficiency has become increasingly important in high performance computing (HPC), as power constraints and costs escalate. Workload and system characteristics form a complex optimization search space in which optimal settings for energy efficiency and performance often diverge. Thus, we must identify trade-off options for performance and energy efficiency to find the desired balance between them. We present an innovative statistical model that accurately predicts the Pareto optimal performance and energy efficiency trade-off options using only user-controllable parameters. Our approach can also tolerate both measurement and model errors. We study model training and validation using several HPC kernels, then explore the feasibility of applying the model to more complex workloads, including AMG and LAMMPS. We can calibrate an accurate model from as few as 12 runs, with prediction error of less than 10%. Our results identify trade-off options allowing up to 40% improvement in energy efficiency at the cost of under 20% performance loss. For AMG, we reduce the required sample measurement time from 13 hours to 74 minutes (about 90%).
The increase in compute power and complexity of supercomputing systems requires the decrease in t... more The increase in compute power and complexity of supercomputing systems requires the decrease in the feature size and the supply voltage of internal components. Such development makes unintended errors such as soft errors, potentially caused by random bit flips, inevitable because of the huge size of the resources (such as CPU cores and memory). In this paper, we discuss a non-parametric statistical modelling technique to implement a soft error detector. By exploring temporal autocorrelation within key variables of a running scientific simulation, we introduce an automatic anomaly detection technique in which runtime data from a time-step based simulation can be converted into a time series, and a time series modelling technique can be used to identify soft errors at runtime. Experiments with LAMMPS, a high-performance molecular dynamics simulator, and with PLUTO, an open-source astrophysical code, reveal that the time-series based detector is subjected to less than 3% of both false-positive rate and false-negative rate while incurring only 6% performance overheads.
Relative debugging helps trace software errors by comparing two concurrent executions of a progra... more Relative debugging helps trace software errors by comparing two concurrent executions of a program - one code being a reference version and the other faulty. By locating data divergence between the runs, relative debugging is effective at finding coding errors when a program is scaled up to solve larger problem sizes or migrated from one platform to another. In this work, we envision potential changes to our current relative debugging scheme in order to address exascale factors such as the increase of faults and the nondeterministic outputs. First, we propose a statistical-based comparison scheme to support verifying results that are stochastic. Second, we leverage a scalable data reduction network to adapt to the complex network hierarchy of an exascale system, and extend our debugger to support the statistical-based comparison in an environment subject to failures.
2016 5th Workshop on Extreme-Scale Programming Tools (ESPT), 2016
Relative debugging helps trace software errors by comparing two concurrent executions of a progra... more Relative debugging helps trace software errors by comparing two concurrent executions of a program - one code being a reference version and the other faulty. By locating data divergence between the runs, relative debugging is effective at finding coding errors when a program is scaled up to solve larger problem sizes or migrated from one platform to another. In this work, we envision potential changes to our current relative debugging scheme in order to address exascale factors such as the increase of faults and the nondeterministic outputs. First, we propose a statistical-based comparison scheme to support verifying results that are stochastic. Second, we leverage a scalable data reduction network to adapt to the complex network hierarchy of an exascale system, and extend our debugger to support the statistical-based comparison in an environment subject to failures.
2017 IEEE 13th International Conference on e-Science (e-Science), 2017
Coral reefs are of global economic and biological significance but are subject to increasing thre... more Coral reefs are of global economic and biological significance but are subject to increasing threats. As a result, it is essential to understand the risk of coral reef ecosystem collapse and to develop assessment process for those ecosystems. The International Union for Conservation of Nature (IUCN) Red List of Ecosystem (RLE) is a framework to assess the vulnerability of an ecosystem. Importantly, the assessment processes need to be repeatable as new monitoring data arises. The repeatability will also enhance transparency. In this paper, we discuss the evolution of a computational pipeline for risk assessment of the Meso-American reef ecosystem, a diverse reef ecosystem located in the Caribbean, with the focus on improving the execution time starting from sequential and parallel implementation and finally using Apache Spark. The final form of the pipeline is a scientific workflow to improve its repeatability and reproducibility.
Proceedings of the Royal Society B: Biological Sciences, 2017
Effective ecosystem risk assessment relies on a conceptual understanding of ecosystem dynamics an... more Effective ecosystem risk assessment relies on a conceptual understanding of ecosystem dynamics and the synthesis of multiple lines of evidence. Risk assessment protocols and ecosystem models integrate limited observational data with threat scenarios, making them valuable tools for monitoring ecosystem status and diagnosing key mechanisms of decline to be addressed by management. We applied the IUCN Red List of Ecosystems criteria to quantify the risk of collapse of the Meso-American Reef, a unique ecosystem containing the second longest barrier reef in the world. We collated a wide array of empirical data (field and remotely sensed), and used a stochastic ecosystem model to backcast past ecosystem dynamics, as well as forecast future ecosystem dynamics under 11 scenarios of threat. The ecosystem is at high risk from mass bleaching in the coming decades, with compounding effects of ocean acidification, hurricanes, pollution and fishing. The overall status of the ecosystem is Critical...
2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012), 2012
Parallel debugging faces challenges in both scalability and efficiency. A number of advanced meth... more Parallel debugging faces challenges in both scalability and efficiency. A number of advanced methods have been invented to improve the efficiency of parallel debugging. As the scale of system increases, these methods highly rely on a scalable communication protocol in order to be utilized in large-scale distributed environments. This paper describes a debugging middleware that provides fundamental debugging functions supporting
2009 Fifth IEEE International Conference on e-Science, 2009
Page 1. Virtual Microscopy and Analysis using Scientific Workflows § David Abramson, § Blair Beth... more Page 1. Virtual Microscopy and Analysis using Scientific Workflows § David Abramson, § Blair Bethwaite, § Minh Ngoc Dinh, § Colin Enticott, Stephen Firth, § Slavisa Garic, Ian Harper, Martin Lackmann, § Hoang Nguyen ...
Traditional debuggers are of limited value for modern scientific codes that manipulate large comp... more Traditional debuggers are of limited value for modern scientific codes that manipulate large complex data structures. This paper discusses a novel debug-time assertion, called a "Statistical Assertion", that allows a user to reason about large data structures, and the primitives are parallelised to provide an efficient solution. We present the design and implementation of statistical assertions, and illustrate the debugging technique with a molecular dynamics simulation. We evaluate the performance of the tool on a 12,000 cores Cray XE6.
ABSTRACT Traditional debuggers are of limited value for modern scientific codes that manipulate l... more ABSTRACT Traditional debuggers are of limited value for modern scientific codes that manipulate large complex data structures. Current parallel machines make this even more complicated, because the data structure may be distributed across processors, making it difficult to view/interpret and validate its contents. Therefore, many applications’ developers resort to placing validation code directly in the source program. This paper discusses a novel debug-time assertion, called a “Statistical Assertion”, that allows using extracted statistics instead of raw data to reason about large data structures, therefore help locating coding defects. In this paper, we present the design and implementation of an ‘extendable’ statistical-framework which executes the assertion in parallel by exploiting the underlying parallel system. We illustrate the debugging technique with a molecular dynamics simulation. The performance is evaluated on a 20,000 processor Cray XE6 to show that it is useful for real-time debugging.
IEEE Transactions on Parallel and Distributed Systems, 2014
ABSTRACT Detecting and isolating bugs that arise only at high processor counts is a challenging t... more ABSTRACT Detecting and isolating bugs that arise only at high processor counts is a challenging task. Over a number of years, we have implemented a special debugging method, called "relative debugging", that supports debugging applications as they evolve or are ported to larger machines. It allows a user to compare the state of a suspect program against another reference version even as the number of processors is increased. The innovative idea is the comparison of runtime data in order to reason about the state of the suspect program. Whilst powerful, a naïve implementation of the comparison phase does not scale to large problems running on large machines. In this paper, we propose two different solutions including a hash-based scheme and a direct point-to-point scheme. We demonstrate the implementation, a case study, as well as the performance, of our techniques on 20K cores of a Cray XE6 system.
Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, 2010
Page 1. Data Centric Highly Parallel Debugging § David Abramson, Minh Ngoc Dinh, Donny Kurniawan ... more Page 1. Data Centric Highly Parallel Debugging § David Abramson, Minh Ngoc Dinh, Donny Kurniawan Bob Moench, Luiz DeRose § Faculty of Information Technology, Monash University, Clayton, 3800, Victoria, Australia Cray Inc, Cray Plaza, 380 Jackson St, Suite 210 St. ...
Uploads
Papers by Minh Dinh