Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004., 2004
Performing inlining of routines across file boundaries is known to yield significant run-time per... more Performing inlining of routines across file boundaries is known to yield significant run-time performance improvements. In this paper, we present a scalable cross-module inlining framework that reduces the compiler's memory footprint, file thrashing, and overall compile-time. Instead of using ...
2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA), 2013
Abstract Due to the complexity and the massive scale of modern warehouse scale computers (WSCs), ... more Abstract Due to the complexity and the massive scale of modern warehouse scale computers (WSCs), it is challenging to quantify the performance impact of individual microarchitectural properties and the potential optimization benefits in the production environment. As a result of these challenges, there is currently a lack of understanding of the microarchitecture-workload interaction, leaving potentially significant performance on the table. This paper argues for a two-phase performance analysis methodology for ...
Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), 2013
ABSTRACT JavaScript is the dominant language for implementing dynamic web pages in browsers. Even... more ABSTRACT JavaScript is the dominant language for implementing dynamic web pages in browsers. Even though it is standardized, many browsers implement language and browser bindings in different and incompatible ways. As a result, a plethora of web development frameworks were developed to hide cross-browser issues and to ease development of large web applications. An unwelcome side-effect of these frameworks is that they can introduce memory leaks, despite the fact that JavaScript is garbage collected. Memory bloat is a major issue for web applications, as it affects user perceived latency and may even prevent large web applications from running on devices with limited resources. In this paper we present JSWhiz, an extension to the open-source Closure JavaScript compiler. Based on experiences analyzing memory leaks in Gmail, JSWhiz detects five identified common problem patterns. JSWhiz found a total of 89 memory leaks across Google's Gmail, Docs, Spread-sheets, Books, and Closure itself. It contributed significantly in a recent effort to reduce Gmail memory footprint, which resulted in bloat reduction of 75% at the 99th percentile, and by roughly 50% at the median.
2009 International Symposium on Code Generation and Optimization, 2009
Abstract—Online optimization allows the continuous re-structuring and adaptation of an executing ... more Abstract—Online optimization allows the continuous re-structuring and adaptation of an executing application using live information about its execution environment. The further advancement of performance monitoring hardware presents new opportunities for online optimization ...
International Symposium on Code Generation and Optimization (CGO'07), 2007
Structure layout optimizations seek to improve runtime performance by improving data locality and... more Structure layout optimizations seek to improve runtime performance by improving data locality and reuse. The structure layout heuristics for single-threaded benchmarks differ from those for multi-threaded applications running on multiprocessor machines, where the effects of false sharing need to be taken into account. In this paper we propose a technique for structure layout transformations for multithreaded applications that optimizes both
Symposium on Code Generation and Optimization, 2006
With the delta between processor clock frequency and memory latency ever increasing and with the ... more With the delta between processor clock frequency and memory latency ever increasing and with the standard locality improving transformations maturing, compilers increasingly seek to modify an application's data layout to improve spatial and temporal locality and to reduce cache miss and page fault penalties. In this paper, we describe a practical implementation of the data layout optimizations structure splitting, structure
Abstract Feedback-directed optimization (FDO) is effective in improving application performance, ... more Abstract Feedback-directed optimization (FDO) is effective in improving application performance, but has not been widely adopted due to the tedious dual-compilation model, the difficulties in generating representative training data sets, and the high runtime overhead of profile collection. The use of hardware-event sampling to generate estimated execution profiles overcomes these drawbacks. Yet, hardware event samples are typically not precise at the instruction or basic-block granularity. These inaccuracies lead to missed ...
Precisely predicting performance degradation due to colocating multiple executing applications on... more Precisely predicting performance degradation due to colocating multiple executing applications on a single machine is critical for improving utilization in modern warehouse-scale computers (WSCs). Bubble-Up is the first mechanism for such precise prediction. As opposed to over-provisioning machines, Bubble-Up enables the safe colocation of multiple workloads on a single machine for Web service applications that have quality of service constraints,
Google-Wide Profiling (GWP), a continuous profiling infrastructure for data centers, provides per... more Google-Wide Profiling (GWP), a continuous profiling infrastructure for data centers, provides performance insights for cloud applications. With negligible overhead, GWP provides stable, accurate profiles and a datacenter-scale tool for traditional performance analyses. Furthermore, GWP introduces novel applications of its profiles, such as application-platform affinity measurements and identification of platform-specific, microarchitectural peculiarities.
Concurrency bugs, particularly data races, are notoriously difficult to debug and are a significa... more Concurrency bugs, particularly data races, are notoriously difficult to debug and are a significant source of unreliability in multithreaded applications. Many tools to catch data races rely on program instrumentation to obtain memory instruction traces. Unfortunately, this instrumentation introduces significant runtime overhead, is extremely invasive, or has a limited domain of applicability making these tools unsuitable for many production systems. Consequently, these tools are typically used during application testing where many data ...
In this paper we study the impact of sharing memory resources on five Google datacenter applicati... more In this paper we study the impact of sharing memory resources on five Google datacenter applications: a web search engine, bigtable, content analyzer, image stitching, and protocol buffer. While prior work has found neither positive nor negative effects from cache sharing across the PARSEC benchmark suite, we find that across these datacenter applications, there is both a sizable benefit and
Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004., 2004
Performing inlining of routines across file boundaries is known to yield significant run-time per... more Performing inlining of routines across file boundaries is known to yield significant run-time performance improvements. In this paper, we present a scalable cross-module inlining framework that reduces the compiler's memory footprint, file thrashing, and overall compile-time. Instead of using ...
2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA), 2013
Abstract Due to the complexity and the massive scale of modern warehouse scale computers (WSCs), ... more Abstract Due to the complexity and the massive scale of modern warehouse scale computers (WSCs), it is challenging to quantify the performance impact of individual microarchitectural properties and the potential optimization benefits in the production environment. As a result of these challenges, there is currently a lack of understanding of the microarchitecture-workload interaction, leaving potentially significant performance on the table. This paper argues for a two-phase performance analysis methodology for ...
Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), 2013
ABSTRACT JavaScript is the dominant language for implementing dynamic web pages in browsers. Even... more ABSTRACT JavaScript is the dominant language for implementing dynamic web pages in browsers. Even though it is standardized, many browsers implement language and browser bindings in different and incompatible ways. As a result, a plethora of web development frameworks were developed to hide cross-browser issues and to ease development of large web applications. An unwelcome side-effect of these frameworks is that they can introduce memory leaks, despite the fact that JavaScript is garbage collected. Memory bloat is a major issue for web applications, as it affects user perceived latency and may even prevent large web applications from running on devices with limited resources. In this paper we present JSWhiz, an extension to the open-source Closure JavaScript compiler. Based on experiences analyzing memory leaks in Gmail, JSWhiz detects five identified common problem patterns. JSWhiz found a total of 89 memory leaks across Google's Gmail, Docs, Spread-sheets, Books, and Closure itself. It contributed significantly in a recent effort to reduce Gmail memory footprint, which resulted in bloat reduction of 75% at the 99th percentile, and by roughly 50% at the median.
2009 International Symposium on Code Generation and Optimization, 2009
Abstract—Online optimization allows the continuous re-structuring and adaptation of an executing ... more Abstract—Online optimization allows the continuous re-structuring and adaptation of an executing application using live information about its execution environment. The further advancement of performance monitoring hardware presents new opportunities for online optimization ...
International Symposium on Code Generation and Optimization (CGO'07), 2007
Structure layout optimizations seek to improve runtime performance by improving data locality and... more Structure layout optimizations seek to improve runtime performance by improving data locality and reuse. The structure layout heuristics for single-threaded benchmarks differ from those for multi-threaded applications running on multiprocessor machines, where the effects of false sharing need to be taken into account. In this paper we propose a technique for structure layout transformations for multithreaded applications that optimizes both
Symposium on Code Generation and Optimization, 2006
With the delta between processor clock frequency and memory latency ever increasing and with the ... more With the delta between processor clock frequency and memory latency ever increasing and with the standard locality improving transformations maturing, compilers increasingly seek to modify an application's data layout to improve spatial and temporal locality and to reduce cache miss and page fault penalties. In this paper, we describe a practical implementation of the data layout optimizations structure splitting, structure
Abstract Feedback-directed optimization (FDO) is effective in improving application performance, ... more Abstract Feedback-directed optimization (FDO) is effective in improving application performance, but has not been widely adopted due to the tedious dual-compilation model, the difficulties in generating representative training data sets, and the high runtime overhead of profile collection. The use of hardware-event sampling to generate estimated execution profiles overcomes these drawbacks. Yet, hardware event samples are typically not precise at the instruction or basic-block granularity. These inaccuracies lead to missed ...
Precisely predicting performance degradation due to colocating multiple executing applications on... more Precisely predicting performance degradation due to colocating multiple executing applications on a single machine is critical for improving utilization in modern warehouse-scale computers (WSCs). Bubble-Up is the first mechanism for such precise prediction. As opposed to over-provisioning machines, Bubble-Up enables the safe colocation of multiple workloads on a single machine for Web service applications that have quality of service constraints,
Google-Wide Profiling (GWP), a continuous profiling infrastructure for data centers, provides per... more Google-Wide Profiling (GWP), a continuous profiling infrastructure for data centers, provides performance insights for cloud applications. With negligible overhead, GWP provides stable, accurate profiles and a datacenter-scale tool for traditional performance analyses. Furthermore, GWP introduces novel applications of its profiles, such as application-platform affinity measurements and identification of platform-specific, microarchitectural peculiarities.
Concurrency bugs, particularly data races, are notoriously difficult to debug and are a significa... more Concurrency bugs, particularly data races, are notoriously difficult to debug and are a significant source of unreliability in multithreaded applications. Many tools to catch data races rely on program instrumentation to obtain memory instruction traces. Unfortunately, this instrumentation introduces significant runtime overhead, is extremely invasive, or has a limited domain of applicability making these tools unsuitable for many production systems. Consequently, these tools are typically used during application testing where many data ...
In this paper we study the impact of sharing memory resources on five Google datacenter applicati... more In this paper we study the impact of sharing memory resources on five Google datacenter applications: a web search engine, bigtable, content analyzer, image stitching, and protocol buffer. While prior work has found neither positive nor negative effects from cache sharing across the PARSEC benchmark suite, we find that across these datacenter applications, there is both a sizable benefit and
Uploads
Papers by Robert Hundt