Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
  • Beck T, Baroni A, Bennink R, Buchs G, Pérez E, Eisenbach M, da Silva R, Meena M, Gottiparthi K, Groszkowski P, Humble T, Landfield R, Maheshwari K, Oral S, Sandoval M, Shehata A, Suh I and Zimmer C. (2024). Integrating quantum computing resources into scientific HPC ecosystems. Future Generation Computer Systems. 10.1016/j.future.2024.06.058. 161. (11-25). Online publication date: 1-Dec-2024.

    https://linkinghub.elsevier.com/retrieve/pii/S0167739X24003583

  • Lungu N, Al Rababah A, Dash B, Syed A, Barik L, Rout S, Tembo S, Lubobya C and Patra S. (2024). NIST CSF-2.0 Compliant GPU Shader Execution. Engineering, Technology & Applied Science Research. 10.48084/etasr.7351. 14:4. (15187-15193).

    https://etasr.com/index.php/ETASR/article/view/7351

  • Gouk D, Kang S, Bae H, Ryu E, Lee S, Kim D, Jang J and Jung M. Breaking Barriers: Expanding GPU Memory with Sub-Two Digit Nanosecond Latency CXL Controller. Proceedings of the 16th ACM Workshop on Hot Topics in Storage and File Systems. (108-115).

    https://doi.org/10.1145/3655038.3665953

  • Trochatos T, Etim A and Szefer J. (2024). Covert-channels in FPGA-enabled SmartSSDs. ACM Transactions on Reconfigurable Technology and Systems. 17:2. (1-23). Online publication date: 30-Jun-2024.

    https://doi.org/10.1145/3635312

  • Feng Y, Na S, Kim H and Jeon H. (2024). Barre Chord: Efficient Virtual Memory Translation for Multi-Chip-Module GPUs 2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA). 10.1109/ISCA59077.2024.00065. 979-8-3503-2658-1. (834-847).

    https://ieeexplore.ieee.org/document/10609639/

  • Priya A, Choudhury R, Patni S, Sharma H, Mohanty M, Narayanam K, Devi U, Moogi P, Patil P and Parag P. Energy-minimizing workload splitting and frequency selection for guaranteed performance over heterogeneous cores. Proceedings of the 15th ACM International Conference on Future and Sustainable Energy Systems. (308-322).

    https://doi.org/10.1145/3632775.3661968

  • Oh C, Yi S, Seok J, Jung H, Yoon I and Yi Y. (2023). Hybridhadoop: CPU-GPU hybrid scheduling in hadoop. Cluster Computing. 10.1007/s10586-023-04178-5. 27:3. (3875-3892). Online publication date: 1-Jun-2024.

    https://link.springer.com/10.1007/s10586-023-04178-5

  • Tyagi A, Mishra A, Vedavathi N, Kakulapati V and Sajidha S. (2024). Futuristic Technologies for Smart Manufacturing. Automated Secure Computing for Next‐Generation Systems. 10.1002/9781394213948.ch21. (415-441). Online publication date: 3-May-2024.

    https://onlinelibrary.wiley.com/doi/10.1002/9781394213948.ch21

  • Cheng J, Coward S, Chelini L, Barbalho R and Drane T. SEER: Super-Optimization Explorer for High-Level Synthesis using E-graph Rewriting. Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2. (1029-1044).

    https://doi.org/10.1145/3620665.3640392

  • Crisci L, Carpentieri L, Thoman P, Alpay A, Heuveline V and Cosenza B. SYCL-Bench 2020: Benchmarking SYCL 2020 on AMD, Intel, and NVIDIA GPUs. Proceedings of the 12th International Workshop on OpenCL and SYCL. (1-12).

    https://doi.org/10.1145/3648115.3648120

  • Frachtenberg E, Mittal V, Bruel P, Faloutsos M, Milojicic D and Milojicic D. (2024). The Distribution Is the Performance. Computer. 57:4. (143-149). Online publication date: 1-Apr-2024.

    https://doi.org/10.1109/MC.2024.3362448

  • Hasler J and Hao C. (2023). Programmable Analog System Benchmarks Leading to Efficient Analog Computation Synthesis. ACM Transactions on Reconfigurable Technology and Systems. 17:1. (1-25). Online publication date: 31-Mar-2024.

    https://doi.org/10.1145/3625298

  • Wang Y, Li B, Jaleel A, Yang J and Tang X. (2024). GRIT: Enhancing Multi-GPU Performance with Fine-Grained Dynamic Page Placement 2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA). 10.1109/HPCA57654.2024.00085. 979-8-3503-9313-2. (1080-1094).

    https://ieeexplore.ieee.org/document/10476474/

  • Na S, Kim J, Lee S and Huh J. (2024). Supporting Secure Multi-GPU Computing with Dynamic and Batched Metadata Management 2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA). 10.1109/HPCA57654.2024.00025. 979-8-3503-9313-2. (204-217).

    https://ieeexplore.ieee.org/document/10476487/

  • Jeong E, Park E, Koo G, Oh Y and Yoon M. (2024). Conflict-aware compiler for hierarchical register file on GPUs. Journal of Systems Architecture. 10.1016/j.sysarc.2024.103099. (103099). Online publication date: 1-Feb-2024.

    https://linkinghub.elsevier.com/retrieve/pii/S1383762124000365

  • Kumar V, Ranjbar B and Kumar A. Utilizing Machine Learning Techniques for Worst-Case Execution Time Estimation on GPU Architectures. IEEE Access. 10.1109/ACCESS.2024.3379018. 12. (41464-41478).

    https://ieeexplore.ieee.org/document/10474357/

  • Mustafa D, Alkhasawneh R, Obeidat F and Shatnawi A. MIMD Programs Execution Support on SIMD Machines: A Holistic Survey. IEEE Access. 10.1109/ACCESS.2024.3372990. 12. (34354-34377).

    https://ieeexplore.ieee.org/document/10458910/

  • Mohamed K. (2024). An Introduction to Heterogeneous SoC Design and Verification “A Conceptual-Level”. Heterogeneous SoC Design and Verification. 10.1007/978-3-031-56152-8_1. (1-26).

    https://link.springer.com/10.1007/978-3-031-56152-8_1

  • Tian S, Giechaskiel I, Xiong W and Szefer J. (2024). Fingerprinting and Mapping Cloud FPGA Infrastructures. Security of FPGA-Accelerated Cloud Computing Environments. 10.1007/978-3-031-45395-3_9. (239-272).

    https://link.springer.com/10.1007/978-3-031-45395-3_9

  • Giechaskiel I, Tian S and Szefer J. (2024). Contention-Based Threats Between Single-Tenant Cloud FPGA Instances. Security of FPGA-Accelerated Cloud Computing Environments. 10.1007/978-3-031-45395-3_6. (137-172).

    https://link.springer.com/10.1007/978-3-031-45395-3_6

  • Zoni D, Galimberti A and Fornaciari W. (2023). A Survey on Run-time Power Monitors at the Edge. ACM Computing Surveys. 55:14s. (1-33). Online publication date: 31-Dec-2024.

    https://doi.org/10.1145/3593044

  • Saini A, Shende O, Pandit M, Sen R and Ananthanarayanan G. Bang for the Buck: Evaluating the cost-effectiveness of Heterogeneous Edge Platforms for Neural Network Workloads. Proceedings of the Eighth ACM/IEEE Symposium on Edge Computing. (94-107).

    https://doi.org/10.1145/3583740.3628437

  • Tan J, Chen K, Wang W, Yan K and Wei X. MCM-GPU Voltage Noise Characterization and Architecture-Level Mitigation. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems. 10.1109/TCAD.2023.3279304. 42:12. (5084-5097).

    https://ieeexplore.ieee.org/document/10131993/

  • Lin F, Liu Y, Wang X and Gai X. (2023). Leveraging simulation of high performance computing systems with node simulation using architecture simulator. CCF Transactions on High Performance Computing. 10.1007/s42514-023-00173-9. 5:4. (442-464). Online publication date: 1-Dec-2023.

    https://link.springer.com/10.1007/s42514-023-00173-9

  • Weckert C, Solis-Vasquez L, Oppermann J, Koch A and Sinnen O. Altis-SYCL: Migrating Altis Benchmarking Suite from CUDA to SYCL for GPUs and FPGAs. Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis. (547-555).

    https://doi.org/10.1145/3624062.3624542

  • Afzal A, Hager G and Wellein G. SPEChpc 2021 Benchmarks on Ice Lake and Sapphire Rapids Infiniband Clusters: A Performance and Energy Case Study. Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis. (1245-1254).

    https://doi.org/10.1145/3624062.3624197

  • Rodríguez-Borbón J, Wang X, Diéguez A, Ibrahim K and Wong B. (2023). TRAVOLTA: GPU Acceleration and Algorithmic Improvements for Constructing Quantum Optimal Control Fields in Photo-Excited Systems. Computer Physics Communications. 10.1016/j.cpc.2023.109017. (109017). Online publication date: 1-Nov-2023.

    https://linkinghub.elsevier.com/retrieve/pii/S0010465523003624

  • Liu C, Sun Y and Carlson T. Photon: A Fine-grained Sampled Simulation Methodology for GPU Workloads. Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture. (1227-1241).

    https://doi.org/10.1145/3613424.3623773

  • Li B, Guo Y, Wang Y, Jaleel A, Yang J and Tang X. IDYLL: Enhancing Page Translation in Multi-GPUs via Light Weight PTE Invalidations. Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture. (1163-1177).

    https://doi.org/10.1145/3613424.3614269

  • Sung S, Hur S, Kim S, Ha D, Oh Y and Ro W. MAD MAcce: Supporting Multiply-Add Operations for Democratizing Matrix-Multiplication Accelerators. Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture. (367-379).

    https://doi.org/10.1145/3613424.3614247

  • Dutta A, Alcaraz J, TehraniJamsaz A, Cesar E, Sikora A and Jannesari A. Performance Optimization using Multimodal Modeling and Heterogeneous GNN. Proceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing. (45-57).

    https://doi.org/10.1145/3588195.3592984

  • Meyer M, Kenter T and Plessl C. (2023). Multi-FPGA Designs and Scaling of HPC Challenge Benchmarks via MPI and Circuit-switched Inter-FPGA Networks. ACM Transactions on Reconfigurable Technology and Systems. 10.1145/3576200. 16:2. (1-27). Online publication date: 30-Jun-2023.

    https://dl.acm.org/doi/10.1145/3576200

  • Barbierato E, Manini D and Gribaudo M. (2023). A Multiformalism-Based Model for Performance Evaluation of Green Data Centres. Electronics. 10.3390/electronics12102169. 12:10. (2169).

    https://www.mdpi.com/2079-9292/12/10/2169

  • Tørring J, van Werkhoven B, Petrovč F, Willemsen F, Filipovič J and Elster A. (2023). Towards a Benchmarking Suite for Kernel Tuners 2023 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). 10.1109/IPDPSW59300.2023.00124. 979-8-3503-1199-0. (724-733).

    https://ieeexplore.ieee.org/document/10196663/

  • Kamatar A, Friese R and Gioiosa R. (2023). A Task Based Approach for Co-Scheduling Ensemble Workloads on Heterogeneous Nodes 2023 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). 10.1109/IPDPSW59300.2023.00015. 979-8-3503-1199-0. (6-15).

    https://ieeexplore.ieee.org/document/10196582/

  • Emonds Y, Braun L and Fröning H. (2023). CUDAsap: Statically-Determined Execution Statistics as Alternative to Execution-Based Profiling 2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid). 10.1109/CCGrid57682.2023.00021. 979-8-3503-0119-9. (119-130).

    https://ieeexplore.ieee.org/document/10171571/

  • Meyer J, Alpay A, Hack S, Fröning H and Heuveline V. Implementation Techniques for SPMD Kernels on CPUs. Proceedings of the 2023 International Workshop on OpenCL. (1-12).

    https://doi.org/10.1145/3585341.3585342

  • Sawalha L and Deljevic G. (2023). Workload Characterization Using Hierarchical PCA 2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 10.1109/ISPASS57527.2023.00043. 979-8-3503-9739-0. (331-333).

    https://ieeexplore.ieee.org/document/10158189/

  • Jin Z and Vetter J. (2023). A Benchmark Suite for Improving Performance Portability of the SYCL Programming Model 2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 10.1109/ISPASS57527.2023.00041. 979-8-3503-9739-0. (325-327).

    https://ieeexplore.ieee.org/document/10158214/

  • Giechaskiel I, Tian S and Szefer J. (2022). Cross-VM Covert- and Side-Channel Attacks in Cloud FPGAs. ACM Transactions on Reconfigurable Technology and Systems. 16:1. (1-29). Online publication date: 31-Mar-2023.

    https://doi.org/10.1145/3534972

  • Lee J, Lee J, Oh Y, Song W and Ro W. (2023). SnakeByte: A TLB Design with Adaptive and Recursive Page Merging in GPUs 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA). 10.1109/HPCA56546.2023.10071063. 978-1-6654-7652-2. (1195-1207).

    https://ieeexplore.ieee.org/document/10071063/

  • Li B, Yin J, Holey A, Zhang Y, Yang J and Tang X. (2023). Trans-FW: Short Circuiting Page Table Walk in Multi-GPU Systems via Remote Forwarding 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA). 10.1109/HPCA56546.2023.10071054. 978-1-6654-7652-2. (456-470).

    https://ieeexplore.ieee.org/document/10071054/

  • Wang X, Li Y, Guo F, Xu Y and Lui J. Dynamic GPU Scheduling with Multi-resource Awareness and Live Migration Support. IEEE Transactions on Cloud Computing. 10.1109/TCC.2023.3264242. (1-16).

    https://ieeexplore.ieee.org/document/10091187/

  • Paul B, Choudhury N, Saikia E and Trivedi G. (2023). Digital Boolean Logic Equivalent Reversible Quantum Gates Design. Third Congress on Intelligent Systems. 10.1007/978-981-19-9379-4_20. (253-271).

    https://link.springer.com/10.1007/978-981-19-9379-4_20

  • Defour D. (2022). Using scheduling entropy amplification in CUDA/OpenMP code to exhibit non-reproducibility issues 2022 IEEE 15th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC). 10.1109/MCSoC57363.2022.00040. 978-1-6654-6499-4. (200-207).

    https://ieeexplore.ieee.org/document/10008469/

  • Oh Y, Jeong I, Ro W and Yoon M. CASH-RF: A Compiler-Assisted Hierarchical Register File in GPUs. IEEE Embedded Systems Letters. 10.1109/LES.2022.3163749. 14:4. (187-190).

    https://ieeexplore.ieee.org/document/9745582/

  • Hammond J, Deakin T, Cownie J and McIntosh-Smith S. (2022). Benchmarking Fortran DO CONCURRENT on CPUs and GPUs Using BabelStream 2022 IEEE/ACM International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS). 10.1109/PMBS56514.2022.00013. 978-1-6654-5185-7. (82-99).

    https://ieeexplore.ieee.org/document/10024026/

  • Gomez-Hernandez E, Cebrian J, Kaxiras S and Ros A. (2022). Splash-4: A Modern Benchmark Suite with Lock-Free Constructs 2022 IEEE International Symposium on Workload Characterization (IISWC). 10.1109/IISWC55918.2022.00015. 978-1-6654-8798-6. (51-64).

    https://ieeexplore.ieee.org/document/9975421/

  • Peng W and Belikov E. (2022). CAMP: a Synthetic Micro-Benchmark for Assessing Deep Memory Hierarchies 2022 IEEE/ACM International Workshop on Hierarchical Parallelism for Exascale Computing (HiPar). 10.1109/HiPar56574.2022.00009. 978-1-6654-6345-4. (28-36).

    https://ieeexplore.ieee.org/document/10024617/

  • Bao Y, Sun Y, Feric Z, Shen M, Weston M, Abellán J, Baruah T, Kim J, Joshi A and Kaeli D. NaviSim. Proceedings of the International Conference on Parallel Architectures and Compilation Techniques. (333-345).

    https://doi.org/10.1145/3559009.3569666

  • Belayneh L, Ye H, Chen K, Blaauw D, Mudge T, Dreslinski R and Talati N. Locality-Aware Optimizations for Improving Remote Memory Latency in Multi-GPU Systems. Proceedings of the International Conference on Parallel Architectures and Compilation Techniques. (304-316).

    https://doi.org/10.1145/3559009.3569649

  • B P, Jawalkar N and Basu A. Designing Virtual Memory System of MCM GPUs. Proceedings of the 55th Annual IEEE/ACM International Symposium on Microarchitecture. (404-422).

    https://doi.org/10.1109/MICRO56248.2022.00036

  • Zhang Y and Jung C. (2022). Featherweight Soft Error Resilience for GPUs 2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO). 10.1109/MICRO56248.2022.00030. 978-1-6654-6272-3. (245-262).

    https://ieeexplore.ieee.org/document/9923801/

  • Tan J, Chen K and Yan K. (2022). MG-Voltage: Characterizing and Mitigating Voltage Noise in MCM-GPU Architectures 2022 IEEE 40th International Conference on Computer Design (ICCD). 10.1109/ICCD56317.2022.00109. 978-1-6654-6186-3. (714-721).

    https://ieeexplore.ieee.org/document/9978493/

  • Jagasivamani M, Fong C, Goodnow K and Voigt R. (2022). Model And Evaluation Of A Superconducting-Logic Based Hybrid CPU-Accelerator System 2022 Annual Modeling and Simulation Conference (ANNSIM). 10.23919/ANNSIM55834.2022.9859454. 978-1-71-385288-9. (140-151).

    https://ieeexplore.ieee.org/document/9859454/

  • Jin H, Jeong D, Park T, Ko J and Kim J. Multi-Prediction Compression: An Efficient and Scalable Memory Compression Framework for GP-GPU. IEEE Computer Architecture Letters. 10.1109/LCA.2022.3177419. 21:2. (37-40).

    https://ieeexplore.ieee.org/document/9780608/

  • Zhao C, Gao W, Nie F and Zhou H. A Survey of GPU Multitasking Methods Supported by Hardware Architecture. IEEE Transactions on Parallel and Distributed Systems. 10.1109/TPDS.2021.3115630. 33:6. (1451-1463).

    https://ieeexplore.ieee.org/document/9548839/

  • Liu Y, Azami N, Walters C and Burtscher M. (2022). The Indigo Program-Verification Microbenchmark Suite of Irregular Parallel Code Patterns 2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 10.1109/ISPASS55109.2022.00003. 978-1-6654-5954-9. (24-34).

    https://ieeexplore.ieee.org/document/9804647/

  • Jin Z and Vetter J. (2022). Evaluating Unified Memory Performance in HIP 2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). 10.1109/IPDPSW55747.2022.00096. 978-1-6654-9747-3. (562-568).

    https://ieeexplore.ieee.org/document/9835548/

  • Heldens S, Hijma P, Van Werkhoven B, Maassen J and van Nieuwpoort R. (2022). Lightning: Scaling the GPU Programming Model Beyond a Single GPU 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 10.1109/IPDPS53621.2022.00054. 978-1-6654-8106-9. (492-503).

    https://ieeexplore.ieee.org/document/9820612/

  • Saiz A, Prieto P, Abad P, Gregorio J and Puente V. (2022). Top-Down Performance Profiling on NVIDIA's GPUs 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 10.1109/IPDPS53621.2022.00026. 978-1-6654-8106-9. (179-189).

    https://ieeexplore.ieee.org/document/9820717/

  • Brunst H, Chandrasekaran S, Ciorba F, Hagerty N, Henschel R, Juckeland G, Li J, Vergara V, Wienke S and Zavala M. (2022). First Experiences in Performance Benchmarking with the New SPEChpc 2021 Suites 2022 22nd International Symposium on Cluster, Cloud and Internet Computing (CCGrid). 10.1109/CCGrid54584.2022.00077. 978-1-6654-9956-9. (675-684).

    https://ieeexplore.ieee.org/document/9826013/

  • Chen G, Zhang J, Zhu Z, Wang H, Jiang H and Pang C. (2020). CRAC: An automatic assistant compiler of checkpoint/restart for OpenCL program. Concurrency and Computation: Practice and Experience. 10.1002/cpe.6048. 34:8. Online publication date: 10-Apr-2022.

    https://onlinelibrary.wiley.com/doi/10.1002/cpe.6048

  • van Stigt R, Swatman S and Varbanescu A. Isolating GPU Architectural Features Using Parallelism-Aware Microbenchmarks. Proceedings of the 2022 ACM/SPEC on International Conference on Performance Engineering. (77-88).

    https://doi.org/10.1145/3489525.3511673

  • Olabi M, Luna J, Mutlu O, Hwu W and El Hajj I. A compiler framework for optimizing dynamic parallelism on GPUs. Proceedings of the 20th IEEE/ACM International Symposium on Code Generation and Optimization. (1-13).

    https://doi.org/10.1109/CGO53902.2022.9741284

  • Dalmia P, Mahapatra R and Sinclair M. (2022). Only Buffer When You Need To: Reducing On-chip GPU Traffic with Reconfigurable Local Atomic Buffers 2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA). 10.1109/HPCA53966.2022.00056. 978-1-6654-2027-3. (676-691).

    https://ieeexplore.ieee.org/document/9773230/

  • Li W, Chen Z, He X, Duan G, Sun J and Chen H. (2022). CVFuzz. Future Generation Computer Systems. 127:C. (384-395). Online publication date: 1-Feb-2022.

    https://doi.org/10.1016/j.future.2021.09.006

  • Kim S and Kim Y. (2021). K-Scheduler: dynamic intra-SM multitasking management with execution profiles on GPUs. Cluster Computing. 10.1007/s10586-021-03429-7. 25:1. (597-617). Online publication date: 1-Feb-2022.

    https://link.springer.com/10.1007/s10586-021-03429-7

  • Jeong I, Oh Y, Ro W and Yoon M. TEA-RC: Thread Context-Aware Register Cache for GPUs. IEEE Access. 10.1109/ACCESS.2022.3196149. 10. (82049-82062).

    https://ieeexplore.ieee.org/document/9848819/

  • Nermend M, Singh S and Singh U. (2022). An evaluation of decision on paradigm shift in higher education by digital transformation. Procedia Computer Science. 207:C. (1959-1969). Online publication date: 1-Jan-2022.

    https://doi.org/10.1016/j.procs.2022.09.255

  • Giechaskiel I, Tian S and Szefer J. (2021). Cross-VM Information Leaks in FPGA-Accelerated Cloud Environments 2021 IEEE International Symposium on Hardware Oriented Security and Trust (HOST). 10.1109/HOST49136.2021.9702277. 978-1-6654-1357-2. (91-101).

    https://ieeexplore.ieee.org/document/9702277/

  • Naderan-Tahan M and Eeckhout L. (2021). Cactus: Top-Down GPU-Compute Benchmarking using Real-Life Applications 2021 IEEE International Symposium on Workload Characterization (IISWC). 10.1109/IISWC53511.2021.00026. 978-1-6654-4173-5. (176-188).

    https://ieeexplore.ieee.org/document/9668300/

  • Meyer M, Kenter T and Plessl C. (2021). In-depth FPGA Accelerator Performance Evaluation with Single Node Benchmarks from the HPC Challenge Benchmark Suite for Intel and Xilinx FPGAs using OpenCL. Journal of Parallel and Distributed Computing. 10.1016/j.jpdc.2021.10.007. Online publication date: 1-Nov-2021.

    https://linkinghub.elsevier.com/retrieve/pii/S0743731521002057

  • Kamath A and Basu A. iGUARD. Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles. (49-65).

    https://doi.org/10.1145/3477132.3483545

  • Li B, Yin J, Zhang Y and Tang X. Improving Address Translation in Multi-GPUs via Sharing and Spilling aware TLB Design. MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture. (1154-1168).

    https://doi.org/10.1145/3466752.3480083

  • Cabrera A, Hitefield S, Kim J, Lee S, Miniskar N and Vetter J. (2021). Toward Performance Portable Programming for Heterogeneous Systems on a Chip: A Case Study with Qualcomm Snapdragon SoC 2021 IEEE High Performance Extreme Computing Conference (HPEC). 10.1109/HPEC49654.2021.9622794. 978-1-6654-2369-4. (1-7).

    https://ieeexplore.ieee.org/document/9622794/

  • Xiao C, Ran W, Lin F and Zhang L. (2021). Dynamic Fine-Grained Workload Partitioning for Irregular Applications on Discrete CPU-GPU Systems 2021 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom). 10.1109/ISPA-BDCloud-SocialCom-SustainCom52081.2021.00148. 978-1-6654-3574-1. (1067-1074).

    https://ieeexplore.ieee.org/document/9644895/

  • Biookaghazadeh S, Ren F and Zhao M. (2021). Characterizing Loop Acceleration in Heterogeneous Computing 2021 IEEE 14th International Conference on Cloud Computing (CLOUD). 10.1109/CLOUD53861.2021.00059. 978-1-6654-0060-2. (445-455).

    https://ieeexplore.ieee.org/document/9582262/

  • Geng T, Amaris M, Zuckerman S, Goldman A, Gao G and Gaudiot J. (2021). A Profile-Based AI-Assisted Dynamic Scheduling Approach for Heterogeneous Architectures. International Journal of Parallel Programming. 10.1007/s10766-021-00721-2.

    https://link.springer.com/10.1007/s10766-021-00721-2

  • Zhang C, Zhang F, Guo X, He B, Zhang X and Du X. iMLBench: A Machine Learning Benchmark Suite for CPU-GPU Integrated Architectures. IEEE Transactions on Parallel and Distributed Systems. 10.1109/TPDS.2020.3046870. 32:7. (1740-1752).

    https://ieeexplore.ieee.org/document/9305972/

  • Tsuji M, Kramer W, Weill J, Nominé J and Sato M. (2021). A new sustained system performance metric for scientific performance evaluation. The Journal of Supercomputing. 10.1007/s11227-020-03545-y. 77:7. (6476-6504). Online publication date: 1-Jul-2021.

    https://link.springer.com/10.1007/s11227-020-03545-y

  • Fotouhi P, Fariborz M, Proietti R, Lowe-Power J, Akella V and Yoo S. HTA: A Scalable High-Throughput Accelerator for Irregular HPC Workloads. High Performance Computing. (176-194).

    https://doi.org/10.1007/978-3-030-78713-4_10

  • Meyer M. Towards Performance Characterization of FPGAs in Context of HPC using OpenCL Benchmarks. Proceedings of the 11th International Symposium on Highly Efficient Accelerators and Reconfigurable Technologies. (1-2).

    https://doi.org/10.1145/3468044.3468058

  • Abdolrashidi A, Esfeden H, Jahanshahi A, Singh K, Abu-Ghazaleh N and Wong D. BlockMaestro. Proceedings of the 48th Annual International Symposium on Computer Architecture. (333-346).

    https://doi.org/10.1109/ISCA52012.2021.00034

  • Jin Z and Vetter J. (2021). Evaluating CUDA Portability with HIPCL and DPCT 2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). 10.1109/IPDPSW52791.2021.00065. 978-1-6654-3577-2. (371-376).

    https://ieeexplore.ieee.org/document/9460636/

  • Di B, Sun J, Chen H and Li D. Efficient Buffer Overflow Detection on GPU. IEEE Transactions on Parallel and Distributed Systems. 10.1109/TPDS.2020.3042965. 32:5. (1161-1177).

    https://ieeexplore.ieee.org/document/9286775/

  • Tian S, Giechaskiel I, Xiong W and Szefer J. (2021). Cloud FPGA Cartography using PCIe Contention 2021 IEEE 29th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). 10.1109/FCCM51124.2021.00035. 978-1-6654-3555-0. (224-232).

    https://ieeexplore.ieee.org/document/9444054/

  • Schmitt N, Lange K, Sharma S, Rawtani N, Ponder C and Kounev S. The SPECpowerNext Benchmark Suite, its Implementation and New Workloads from a Developer's Perspective. Proceedings of the ACM/SPEC International Conference on Performance Engineering. (225-232).

    https://doi.org/10.1145/3427921.3450239

  • Baruah T, Shivdikar K, Dong S, Sun Y, Mojumder S, Jung K, Abellan J, Ukidave Y, Joshi A, Kim J and Kaeli D. (2021). GNNMark: A Benchmark Suite to Characterize Graph Neural Network Training on GPUs 2021 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 10.1109/ISPASS51385.2021.00013. 978-1-7281-8643-6. (13-23).

    https://ieeexplore.ieee.org/document/9408205/

  • Pratheek B, Jawalkar N and Basu A. (2021). Improving GPU Multi-tenancy with Page Walk Stealing 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA). 10.1109/HPCA51647.2021.00059. 978-1-6654-2235-2. (626-639).

    https://ieeexplore.ieee.org/document/9407125/

  • Ibrahim M, Kayiran O, Eckert Y, Loh G and Jog A. (2021). Analyzing and Leveraging Decoupled L1 Caches in GPUs 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA). 10.1109/HPCA51647.2021.00047. 978-1-6654-2235-2. (467-478).

    https://ieeexplore.ieee.org/document/9407080/

  • Lin F, Liu Y, Guo Y and Qian D. (2020). ELS: Emulation system for debugging and tuning large-scale parallel programs on small clusters. The Journal of Supercomputing. 10.1007/s11227-020-03319-6. 77:2. (1635-1666). Online publication date: 1-Feb-2021.

    https://link.springer.com/10.1007/s11227-020-03319-6

  • Feng Y, Han X, Xu N, Gong J, Le L, Xing C, Yang K, Wang Y, Chen X and An W. (2021). Development of Heterogeneous Computing and Virtualization in Spaceborne IMA During 2010–2020. Signal and Information Processing, Networking and Computers. 10.1007/978-981-33-4102-9_46. (374-383).

    http://link.springer.com/10.1007/978-981-33-4102-9_46

  • Han W, Mawhirter D, Wu B, Ma L and Tian C. (2021). FLARE: Flexibly Sharing Commodity GPUs to Enforce QoS and Improve Utilization. Languages and Compilers for Parallel Computing. 10.1007/978-3-030-72789-5_3. (32-48).

    http://link.springer.com/10.1007/978-3-030-72789-5_3

  • Tsai Y, Cojean T, Ribizel T and Anzt H. (2021). Preparing Ginkgo for AMD GPUs – A Testimonial on Porting CUDA Code to HIP. Euro-Par 2020: Parallel Processing Workshops. 10.1007/978-3-030-71593-9_9. (109-121).

    http://link.springer.com/10.1007/978-3-030-71593-9_9

  • Wang Q and Chu X. GPGPU Performance Estimation With Core and Memory Frequency Scaling. IEEE Transactions on Parallel and Distributed Systems. 10.1109/TPDS.2020.3004623. 31:12. (2865-2881).

    https://ieeexplore.ieee.org/document/9124659/

  • Eyraud-Dubois L and Bentes C. (2020). Algorithms for Preemptive Co-scheduling of Kernels on GPUs 2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC). 10.1109/HiPC50609.2020.00033. 978-1-6654-2292-5. (192-201).

    https://ieeexplore.ieee.org/document/9406773/

  • Carvalho P, Clua E, Paes A, Bentes C, Lopes B and Drummond L. (2020). Using machine learning techniques to analyze the performance of concurrent kernel execution on GPUs. Future Generation Computer Systems. 10.1016/j.future.2020.07.038. 113. (528-540). Online publication date: 1-Dec-2020.

    https://linkinghub.elsevier.com/retrieve/pii/S0167739X19312658

  • Chen G, Zhang J, Zhu Z, Jiang Q, Jiang H and Pang C. (2020). CRState: checkpoint/restart of OpenCL program for in-kernel applications. The Journal of Supercomputing. 10.1007/s11227-020-03460-2.

    http://link.springer.com/10.1007/s11227-020-03460-2

  • Kamatar A, Friese R and Gioiosa R. (2020). Locality-Aware Scheduling for Scalable Heterogeneous Environments 2020 IEEE/ACM International Workshop on Runtime and Operating Systems for Supercomputers (ROSS). 10.1109/ROSS51935.2020.00011. 978-1-6654-2268-0. (50-58).

    https://ieeexplore.ieee.org/document/9307939/

  • Chen Y, Long X, He J, Chen Y, Tan H, Zhang Z, Winslett M and Chen D. (2020). HaoCL: Harnessing Large-scale Heterogeneous Processors Made Easy 2020 IEEE 40th International Conference on Distributed Computing Systems (ICDCS). 10.1109/ICDCS47774.2020.00120. 978-1-7281-7002-2. (1231-1234).

    https://ieeexplore.ieee.org/document/9355742/

  • Meyer M, Kenter T and Plessl C. (2020). Evaluating FPGA Accelerator Performance with a Parameterized OpenCL Adaptation of Selected Benchmarks of the HPCChallenge Benchmark Suite 2020 IEEE/ACM International Workshop on Heterogeneous High-performance Reconfigurable Computing (H2RC). 10.1109/H2RC51942.2020.00007. 978-1-6654-1592-7. (10-18).

    https://ieeexplore.ieee.org/document/9306963/

  • Sultana T, Allen B and Qasem A. Intelligent Data Placement on Discrete GPU Nodes with Unified Memory. Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques. (139-151).

    https://doi.org/10.1145/3410463.3414651

  • Baruah T, Sun Y, Mojumder S, Abellán J, Ukidave Y, Joshi A, Rubin N, Kim J and Kaeli D. Valkyrie. Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques. (455-466).

    https://doi.org/10.1145/3410463.3414639

  • Lavin P, Young J, Vuduc R, Riedy J, Vose A and Ernst D. Evaluating Gather and Scatter Performance on CPUs and GPUs. Proceedings of the International Symposium on Memory Systems. (209-222).

    https://doi.org/10.1145/3422575.3422794

  • Rho S, Park G, Choi J and Park C. (2020). Development of benchmark automation suite and evaluation of various high-performance computing systems. Cluster Computing. 10.1007/s10586-020-03167-2.

    http://link.springer.com/10.1007/s10586-020-03167-2

  • Zheng R, Liu Y and Jin H. (2020). Optimizing non-coalesced memory access for irregular applications with GPU computing. Frontiers of Information Technology & Electronic Engineering. 10.1631/FITEE.1900262. 21:9. (1285-1301). Online publication date: 1-Sep-2020.

    http://link.springer.com/10.1631/FITEE.1900262

  • Hu B and Rossbach C. (2020). Altis: Modernizing GPGPU Benchmarks 2020 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 10.1109/ISPASS48437.2020.00011. 978-1-7281-4798-7. (1-11).

    https://ieeexplore.ieee.org/document/9238617/

  • Azimi R, Jing C and Reda S. (2020). PowerCoord: Power Capping Coordination for Multi-CPU/GPU Servers using Reinforcement Learning. Sustainable Computing: Informatics and Systems. 10.1016/j.suscom.2020.100412. (100412). Online publication date: 1-Jul-2020.

    https://linkinghub.elsevier.com/retrieve/pii/S2210537920301396

  • Wu Y, Shen M, Chen Y and Zhou Y. Tuning applications for efficient GPU offloading to in-memory processing. Proceedings of the 34th ACM International Conference on Supercomputing. (1-12).

    https://doi.org/10.1145/3392717.3392760

  • Mendonça G, Liao C and Pereira F. AutoParBench. Proceedings of the 34th ACM International Conference on Supercomputing. (1-10).

    https://doi.org/10.1145/3392717.3392744

  • Stevens J and Klöckner A. (2020). A mechanism for balancing accuracy and scope in cross-machine black-box GPU performance modeling. The International Journal of High Performance Computing Applications. 10.1177/1094342020921340. (109434202092134).

    http://journals.sagepub.com/doi/10.1177/1094342020921340

  • Feinberg B, Heyman B, Mikhailenko D, Wong R, Ho A and Ipek E. (2020). Commutative Data Reordering: A New Technique to Reduce Data Movement Energy on Sparse Inference Workloads 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). 10.1109/ISCA45697.2020.00091. 978-1-7281-4661-4. (1076-1088).

    https://ieeexplore.ieee.org/document/9138978/

  • Nie B, Jog A and Smirni E. (2020). Characterizing Accuracy-Aware Resilience of GPGPU Applications 2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID). 10.1109/CCGrid49817.2020.00-82. 978-1-7281-6095-5. (111-120).

    https://ieeexplore.ieee.org/document/9139732/

  • Rodrı́guez-Borbón J, Kalantar A, Yamijala S, Oviedo M, Najjar W and Wong B. (2020). Field Programmable Gate Arrays for Enhancing the Speed and Energy Efficiency of Quantum Dynamics Simulations. Journal of Chemical Theory and Computation. 10.1021/acs.jctc.9b01284. 16:4. (2085-2098). Online publication date: 14-Apr-2020.

    https://pubs.acs.org/doi/10.1021/acs.jctc.9b01284

  • Yeh T, Green R and Rogers T. Dimensionality-Aware Redundant SIMT Instruction Elimination. Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems. (1327-1340).

    https://doi.org/10.1145/3373376.3378520

  • Jadidi A, Kandemir M and Das C. (2020). Selective Caching: Avoiding Performance Valleys in Massively Parallel Architectures 2020 28th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP). 10.1109/PDP50117.2020.00051. 978-1-7281-6582-0. (290-298).

    https://ieeexplore.ieee.org/document/9092211/

  • Chang C, Carpenter I and Jones W. The ESIF-HPC-2 benchmark suite. Proceedings of the Workshop on Benchmarking in the Datacenter. (1-8).

    https://doi.org/10.1145/3380868.3398200

  • Baruah T, Sun Y, Dincer A, Mojumder S, Abellan J, Ukidave Y, Joshi A, Rubin N, Kim J and Kaeli D. (2020). Griffin: Hardware-Software Support for Efficient Page Migration in Multi-GPU Systems 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA). 10.1109/HPCA47549.2020.00055. 978-1-7281-6149-5. (596-609).

    https://ieeexplore.ieee.org/document/9065453/

  • Kadam G, Zhang D and Jog A. (2020). BCoal: Bucketing-Based Memory Coalescing for Efficient and Secure GPUs 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA). 10.1109/HPCA47549.2020.00053. 978-1-7281-6149-5. (570-581).

    https://ieeexplore.ieee.org/document/9065581/

  • Reyes Fernandez de Bulnes D, Maldonado Y, Trujillo L and Acacio Sanchez M. (2020). Development of Multiobjective High-Level Synthesis for FPGAs. Scientific Programming. 2020. Online publication date: 1-Jan-2020.

    https://doi.org/10.1155/2020/7095048

  • Eassa F, Alghamdi A, Haridi S, Khemakhem M, Al-Ghamdi A and Alsolami E. ACC_TEST: Hybrid Testing Approach for OpenACC-Based Programs. IEEE Access. 10.1109/ACCESS.2020.2991009. 8. (80358-80368).

    https://ieeexplore.ieee.org/document/9079851/

  • Chen G, Zhang J, Zhu Z, Zhu C, Jiang H and Pang C. (2020). CRAC: An Automatic Assistant Compiler of Checkpoint/Restart for OpenCL Program. Data Science. 10.1007/978-981-15-2810-1_54. (574-586).

    http://link.springer.com/10.1007/978-981-15-2810-1_54

  • Geng T, Amaris M, Zuckerman S, Goldman A, Gao G and Gaudiot J. (2020). PDAWL: Profile-Based Iterative Dynamic Adaptive WorkLoad Balance on Heterogeneous Architectures. Job Scheduling Strategies for Parallel Processing. 10.1007/978-3-030-63171-0_8. (145-162).

    http://link.springer.com/10.1007/978-3-030-63171-0_8

  • Lal S, Alpay A, Salzmann P, Cosenza B, Hirsch A, Stawinoga N, Thoman P, Fahringer T and Heuveline V. (2020). SYCL-Bench: A Versatile Cross-Platform Benchmark Suite for Heterogeneous Computing. Euro-Par 2020: Parallel Processing. 10.1007/978-3-030-57675-2_39. (629-644).

    http://link.springer.com/10.1007/978-3-030-57675-2_39

  • Gerzhoy D, Sun X, Zuzak M and Yeung D. (2019). Nested MIMD-SIMD Parallelization for Heterogeneous Microprocessors. ACM Transactions on Architecture and Code Optimization. 16:4. (1-27). Online publication date: 31-Dec-2020.

    https://doi.org/10.1145/3368304

  • Chen G, Zhang J, Lin Q, Jiang H and Pang C. (2019). CRState: In-Kernel Checkpoint/Restart of OpenCL Program Execution on GPU 2019 IEEE 25th International Conference on Parallel and Distributed Systems (ICPADS). 10.1109/ICPADS47876.2019.00054. 978-1-7281-2583-1. (335-342).

    https://ieeexplore.ieee.org/document/8975814/

  • Garg A, Kulkarni P, Kurkure U, Sivaraman H and Vu L. (2019). Empirical Analysis of Hardware-Assisted GPU Virtualization 2019 IEEE 26th International Conference on High Performance Computing, Data, and Analytics (HiPC). 10.1109/HiPC.2019.00054. 978-1-7281-4535-8. (395-405).

    https://ieeexplore.ieee.org/document/8990619/

  • Guo F, Li Y, Lui J and Xu Y. DCUDA. Proceedings of the ACM Symposium on Cloud Computing. (114-125).

    https://doi.org/10.1145/3357223.3362714

  • Sun H, Gorlatch S and Zhao R. (2019). Vectorizing programs with IF-statements for processors with SIMD extensions. The Journal of Supercomputing. 10.1007/s11227-019-03057-4.

    http://link.springer.com/10.1007/s11227-019-03057-4

  • Zhang H and Hollingsworth J. (2019). Understanding the Performance of GPGPU Applications from a Data-Centric View 2019 IEEE/ACM International Workshop on Programming and Performance Visualization Tools (ProTools). 10.1109/ProTools49597.2019.00006. 978-1-7281-6026-9. (1-8).

    https://ieeexplore.ieee.org/document/8955684/

  • Do Y, Kim H, Oh P, Park D and Lee J. (2019). SNU-NPB 2019: Parallelizing and Optimizing NPB in OpenCL and CUDA for Modern GPUs 2019 IEEE International Symposium on Workload Characterization (IISWC). 10.1109/IISWC47752.2019.9041954. 978-1-7281-4045-2. (93-105).

    https://ieeexplore.ieee.org/document/9041954/

  • Goyat S, Kant S and Dhariwal N. (2019). Dynamic Heterogeneous scheduling of GPU-CPU in Distributed Environment 2019 International Conference on Smart Systems and Inventive Technology (ICSSIT). 10.1109/ICSSIT46314.2019.8987886. 978-1-7281-2119-2. (329-336).

    https://ieeexplore.ieee.org/document/8987886/

  • Green O, Fox J, Young J, Shirako J and Bader D. (2019). Performance Impact of Memory Channels on Sparse and Irregular Algorithms 2019 IEEE/ACM 9th Workshop on Irregular Applications: Architectures and Algorithms (IA3). 10.1109/IA349570.2019.00016. 978-1-7281-5987-4. (67-70).

    https://ieeexplore.ieee.org/document/8945089/

  • Blott M, Halder L, Leeser M and Doyle L. (2019). QuTiBench. ACM Journal on Emerging Technologies in Computing Systems. 15:4. (1-38). Online publication date: 31-Oct-2019.

    https://doi.org/10.1145/3358700

  • Cruz R, Bentes C, Breder B, Vasconcellos E, Clua E, de Carvalho P and Drummond L. (2018). Maximizing the GPU resource usage by reordering concurrent kernels submission. Concurrency and Computation: Practice and Experience. 10.1002/cpe.4409. 31:18. Online publication date: 25-Sep-2019.

    https://onlinelibrary.wiley.com/doi/10.1002/cpe.4409

  • Ibrahim M, Liu H, Kayiran O and Jog A. (2019). Analyzing and Leveraging Remote-Core Bandwidth for Enhanced Performance in GPUs 2019 28th International Conference on Parallel Architectures and Compilation Techniques (PACT). 10.1109/PACT.2019.00028. 978-1-7281-3613-4. (258-271).

    https://ieeexplore.ieee.org/document/8891655/

  • Akshintala A, Yu H, Peters A and Rossbach C. (2019). Trillium: The code is the IR 2019 International Conference on High Performance Computing & Simulation (HPCS). 10.1109/HPCS48598.2019.9188169. 978-1-7281-4484-9. (880-889).

    https://ieeexplore.ieee.org/document/9188169/

  • Jin Z and Finkel H. (2019). Base64 Encoding on Heterogeneous Computing Platforms 2019 IEEE 30th International Conference on Application-specific Systems, Architectures and Processors (ASAP). 10.1109/ASAP.2019.00014. 978-1-7281-1601-3. (247-254).

    https://ieeexplore.ieee.org/document/8825134/

  • Lee S, Gounley J, Randles A and Vetter J. (2019). Performance portability study for massively parallel computational fluid dynamics application on scalable heterogeneous architectures. Journal of Parallel and Distributed Computing. 129:C. (1-13). Online publication date: 1-Jul-2019.

    https://doi.org/10.1016/j.jpdc.2019.02.005

  • Pattnaik A, Tang X, Kayiran O, Jog A, Mishra A, Kandemir M, Sivasubramaniam A and Das C. Opportunistic computing in GPU architectures. Proceedings of the 46th International Symposium on Computer Architecture. (210-223).

    https://doi.org/10.1145/3307650.3322212

  • Uhrie R, Bliss D, Chakrabarti C, Ogras U, Brunhaver J and Suresh R. (2019). Machine understanding of domain computation for Domain-Specific System-on-Chips (DSSoC) Open Architecture/Open Business Model Net-Centric Systems and Defense Transformation 2019. 10.1117/12.2519264. 9781510626959. (21).

    https://www.spiedigitallibrary.org/conference-proceedings-of-spie/11015/2519264/Machine-understanding-of-domain-computation-for-Domain-Specific-System-on/10.1117/12.2519264.full

  • Matz A and Fröning H. Quantifying the NUMA Behavior of Partitioned GPGPU Applications. Proceedings of the 12th Workshop on General Purpose Processing Using GPUs. (53-62).

    https://doi.org/10.1145/3300053.3319420

  • Pellauer M, Shao Y, Clemons J, Crago N, Hegde K, Venkatesan R, Keckler S, Fletcher C and Emer J. Buffets. Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems. (137-151).

    https://doi.org/10.1145/3297858.3304025

  • Pearson C, Dakkak A, Hashash S, Li C, Chung I, Xiong J and Hwu W. Evaluating Characteristics of CUDA Communication Primitives on High-Bandwidth Interconnects. Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering. (209-218).

    https://doi.org/10.1145/3297663.3310299

  • von Kistowski J, Pais J, Wahl T, Lange K, Block H, Beckett J and Kounev S. Measuring the Energy Efficiency of Transactional Loads on GPGPU. Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering. (219-230).

    https://doi.org/10.1145/3297663.3309667

  • Navarro A, Corbera F, Rodriguez A, Vilches A and Asenjo R. (2019). Heterogeneous parallel_for Template for CPU---GPU Chips. International Journal of Parallel Programming. 47:2. (213-233). Online publication date: 1-Apr-2019.

    https://doi.org/10.1007/s10766-018-0555-0

  • Davila G, Oliveira D, Navaux P and Rech P. (2019). Identifying the Most Reliable Collaborative Workload Distribution in Heterogeneous Devices 2019 Design, Automation & Test in Europe Conference & Exhibition (DATE). 10.23919/DATE.2019.8715107. 978-3-9819263-2-3. (1325-1330).

    https://ieeexplore.ieee.org/document/8715107/

  • Kim K, Park J and Baek W. Improving the Performance and Energy Efficiency of GPGPU Computing through Integrated Adaptive Cache Management. IEEE Transactions on Parallel and Distributed Systems. 10.1109/TPDS.2018.2868658. 30:3. (630-645).

    https://ieeexplore.ieee.org/document/8454288/

  • Liu Y, Huang L, Wu M, Cui H, Lv F, Feng X and Xue J. PPOpenCL: a performance-portable OpenCL compiler with host and kernel thread code fusion. Proceedings of the 28th International Conference on Compiler Construction. (2-16).

    https://doi.org/10.1145/3302516.3307350

  • Sakdhnagool P, Sabne A and Eigenmann R. Optimizing GPU programs by register demotion. Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming. (405-406).

    https://doi.org/10.1145/3293883.3297859

  • Fuchs A and Wentzlaff D. (2019). The Accelerator Wall: Limits of Chip Specialization 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA). 10.1109/HPCA.2019.00023. 978-1-7281-1444-6. (1-14).

    https://ieeexplore.ieee.org/document/8675237/

  • Carvalho P, Cruz R, Drummond L, Bentes C, Clua E, Cataldo E and Marzulo L. (2019). Kernel concurrency opportunities based on GPU benchmarks characterization. Cluster Computing. 10.1007/s10586-018-02901-1.

    http://link.springer.com/10.1007/s10586-018-02901-1

  • Tripathy S, Sahoo D and Satpathy M. (2019). Multidimensional Grid Aware Address Prediction for GPGPU 2019 32nd International Conference on VLSI Design and 2019 18th International Conference on Embedded Systems (VLSID). 10.1109/VLSID.2019.00064. 978-1-7281-0409-6. (263-268).

    https://ieeexplore.ieee.org/document/8711244/

  • Zhang F, Zhai J, Wu B, He B, Chen W and Du X. Automatic Irregularity-Aware Fine-Grained Workload Partitioning on Integrated Architectures. IEEE Transactions on Knowledge and Data Engineering. 10.1109/TKDE.2019.2940184. (1-1).

    https://ieeexplore.ieee.org/document/8827952/

  • Tan T, Nurvitadhi E and Chiou D. Dark Wires and the Opportunities for Reconfigurable Logic. IEEE Computer Architecture Letters. 10.1109/LCA.2019.2909867. 18:1. (67-70).

    https://ieeexplore.ieee.org/document/8684249/

  • Khaleghzadeh H, Manumachu R and Lastovetsky A. A Hierarchical Data-partitioning Algorithm for Performance Optimization of Data-Parallel Applications on Heterogeneous Multi-accelerator NUMA Nodes. IEEE Access. 10.1109/ACCESS.2019.2959905. (1-1).

    https://ieeexplore.ieee.org/document/8933138/

  • Guerreiro J, Ilic A, Roma N and Tomas P. GPU Static Modeling Using PTX and Deep Structured Learning. IEEE Access. 10.1109/ACCESS.2019.2951218. 7. (159150-159161).

    https://ieeexplore.ieee.org/document/8890640/

  • Zhao D and Chen Q. Current Prediction Model of GPU Oriented to General Purpose Computing. IEEE Access. 10.1109/ACCESS.2019.2939256. 7. (127920-127931).

    https://ieeexplore.ieee.org/document/8822998/

  • Alghamdi A and Eassa F. OpenACC Errors Classification and Static Detection Techniques. IEEE Access. 10.1109/ACCESS.2019.2935498. 7. (113235-113253).

    https://ieeexplore.ieee.org/document/8801837/

  • Kanekawa N, Miyoshi T, Fujita M, Matsumoto T, Yoshida H, Jo S, Kajihara S, Ohtake S, Imai M, Yoneda T, Takizawa H, Gao Y, Sato M, Egawa R and Kobayashi H. (2019). Unknown Threats and Provisions. VLSI Design and Test for Systems Dependability. 10.1007/978-4-431-56594-9_12. (475-509).

    http://link.springer.com/10.1007/978-4-431-56594-9_12

  • Lim R, Norris B and Malony A. (2019). A Similarity Measure for GPU Kernel Subgraph Matching. Languages and Compilers for Parallel Computing. 10.1007/978-3-030-34627-0_3. (37-53).

    http://link.springer.com/10.1007/978-3-030-34627-0_3

  • Schrödter T, Pallasch D, Wienke S, Schmitt R and Müller M. (2019). Modeling and Optimizing Data Transfer in GPU-Accelerated Optical Coherence Tomography. Euro-Par 2018: Parallel Processing Workshops. 10.1007/978-3-030-10549-5_33. (421-433).

    https://link.springer.com/10.1007/978-3-030-10549-5_33

  • Ben-Nun T, Jakobovits A and Hoefler T. Neural code comprehension. Proceedings of the 32nd International Conference on Neural Information Processing Systems. (3589-3601).

    /doi/10.5555/3327144.3327276

  • Yu C, Bai Y, Yang H, Cheng K, Gu Y, Luan Z and Qian D. SMGuard: A Flexible and Fine-Grained Resource Management Framework for GPUs. IEEE Transactions on Parallel and Distributed Systems. 10.1109/TPDS.2018.2848621. 29:12. (2849-2862).

    https://ieeexplore.ieee.org/document/8388218/

  • Sathre P, Helal A and Feng W. (2018). A Composable Workflow for Productive Heterogeneous Computing on FPGAs via Whole-Program Analysis and Transformation 2018 International Conference on ReConFigurable Computing and FPGAs (ReConFig). 10.1109/RECONFIG.2018.8641694. 978-1-7281-1968-7. (1-8).

    https://ieeexplore.ieee.org/document/8641694/

  • Bannwart Perina A and Bonato V. (2018). Mapping Estimator for OpenCL Heterogeneous Accelerators 2018 International Conference on Field-Programmable Technology (FPT). 10.1109/FPT.2018.00057. 978-1-7281-0214-6. (294-297).

    https://ieeexplore.ieee.org/document/8742290/

  • Ausavarungnirun R, Miller V, Landgraf J, Ghose S, Gandhi J, Jog A, Rossbach C and Mutlu O. (2018). MASK. ACM SIGPLAN Notices. 53:2. (503-518). Online publication date: 30-Nov-2018.

    https://doi.org/10.1145/3296957.3173169

  • Di B, Sun J, Li D, Chen H and Quan Z. GMOD. Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques. (1-13).

    https://doi.org/10.1145/3243176.3243194

  • Luo H, Chen G, Liu F, Li P, Ding C and Shen X. Footprint modeling of cache associativity and granularity. Proceedings of the International Symposium on Memory Systems. (232-242).

    https://doi.org/10.1145/3240302.3240419

  • Azimi R, Jing C and Reda S. (2018). PowerCoord: A Coordinated Power Capping Controller for Multi-CPU/GPU Servers 2018 Ninth International Green and Sustainable Computing Conference (IGSC). 10.1109/IGCC.2018.8752132. 978-1-5386-7466-6. (1-9).

    https://ieeexplore.ieee.org/document/8752132/

  • Umar M, Moore S, Meredith J, Vetter J and Cameron K. (2018). Aspen-based performance and energy modeling frameworks. Journal of Parallel and Distributed Computing. 10.1016/j.jpdc.2017.11.005. 120. (222-236). Online publication date: 1-Oct-2018.

    https://linkinghub.elsevier.com/retrieve/pii/S0743731517303039

  • Basu A, Greathouse J, Venkataramani G and Vesely J. (2018). Interference from GPU System Service Requests 2018 IEEE International Symposium on Workload Characterization (IISWC). 10.1109/IISWC.2018.8573485. 978-1-5386-6780-4. (179-190).

    https://ieeexplore.ieee.org/document/8573485/

  • Li A, Song S, Chen J, Liu X, Tallent N and Barker K. (2018). Tartan: Evaluating Modern GPU Interconnect via a Multi-GPU Benchmark Suite 2018 IEEE International Symposium on Workload Characterization (IISWC). 10.1109/IISWC.2018.8573483. 978-1-5386-6780-4. (191-202).

    https://ieeexplore.ieee.org/document/8573483/

  • Mammeri N and Juurlink B. (2018). VComputeBench: A Vulkan Benchmark Suite for GPGPU on Mobile and Embedded GPUs 2018 IEEE International Symposium on Workload Characterization (IISWC). 10.1109/IISWC.2018.8573477. 978-1-5386-6780-4. (25-35).

    https://ieeexplore.ieee.org/document/8573477/

  • Jamieson P, Sanaullah A and Herbordt M. (2018). Benchmarking Heterogeneous HPC Systems Including Reconfigurable Fabrics: Community Aspirations for Ideal Comparisons 2018 IEEE High Performance Extreme Computing Conference (HPEC). 10.1109/HPEC.2018.8547635. 978-1-5386-5989-2. (1-6).

    https://ieeexplore.ieee.org/document/8547635/

  • Chen M, Chung I, Abali B and Crumley P. (2018). Towards a Single-Host Many-GPU System 2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD). 10.1109/CAHPC.2018.8645874. 978-1-5386-7769-8. (140-147).

    https://ieeexplore.ieee.org/document/8645874/

  • Ausavarungnirun R, Landgraf J, Miller V, Ghose S, Gandhi J, Rossbach C and Mutlu O. (2018). Mosaic. ACM SIGOPS Operating Systems Review. 52:1. (27-44). Online publication date: 28-Aug-2018.

    https://doi.org/10.1145/3273982.3273986

  • Sawin J, Myre J and Wilken H. (2018). Economic Considerations for Integrating Massively Parallel Heterogeneous Devices into the Cloud 2018 IEEE 6th International Conference on Future Internet of Things and Cloud (FiCloud). 10.1109/FiCloud.2018.00011. 978-1-5386-7503-8. (17-24).

    https://ieeexplore.ieee.org/document/8457988/

  • Shen D, Liu X and Lin F. (2016). Characterizing emerging heterogeneous memory. ACM SIGPLAN Notices. 51:11. (13-23). Online publication date: 19-Jul-2018.

    https://doi.org/10.1145/3241624.2926702

  • Sinha H, Raj G, Kumar P and Choudhury T. (2018). Effective E-Healthcare System. International Journal of Big Data and Analytics in Healthcare. 3:2. (10-27). Online publication date: 1-Jul-2018.

    https://doi.org/10.4018/IJBDAH.2018070102

  • Betts A, Chong N, Deligiannis P, Donaldson A and Ketema J. Implementing and Evaluating Candidate-Based Invariant Generation. IEEE Transactions on Software Engineering. 10.1109/TSE.2017.2718516. 44:7. (631-650).

    https://ieeexplore.ieee.org/document/7955079/

  • Heo J, Jo G, Han H and Yang H. (2018). Accelerated Code Generator for Processing Ocean Color Remote Sensing Data on Gpu IGARSS 2018 - 2018 IEEE International Geoscience and Remote Sensing Symposium. 10.1109/IGARSS.2018.8519420. 978-1-5386-7150-4. (9218-9221).

    https://ieeexplore.ieee.org/document/8519420/

  • Zacharopoulos G, Barbon A, Ansaloni G and Pozzi L. (2018). Machine Learning Approach for Loop Unrolling Factor Prediction in High Level Synthesis 2018 International Conference on High Performance Computing & Simulation (HPCS). 10.1109/HPCS.2018.00030. 978-1-5386-7878-7. (91-97).

    https://ieeexplore.ieee.org/document/8514335/

  • Losch A and Platzner M. (2018). A Highly Accurate Energy Model for Task Execution on Heterogeneous Compute Nodes 2018 IEEE 29th International Conference on Application-specific Systems, Architectures and Processors (ASAP). 10.1109/ASAP.2018.8445098. 978-1-5386-7479-6. (1-8).

    https://ieeexplore.ieee.org/document/8445098/

  • Trompouki M and Kosmidis L. Brook auto. Proceedings of the 55th Annual Design Automation Conference. (1-6).

    https://doi.org/10.1145/3195970.3196002

  • Jain A, Khairy M and Rogers T. (2018). A Quantitative Evaluation of Contemporary GPU Simulation Methodology. Proceedings of the ACM on Measurement and Analysis of Computing Systems. 2:2. (1-28). Online publication date: 13-Jun-2018.

    https://doi.org/10.1145/3224430

  • Li A, Liu W, Wang L, Barker K and Song S. Warp-Consolidation. Proceedings of the 2018 International Conference on Supercomputing. (53-64).

    https://doi.org/10.1145/3205289.3205294

  • Sinha H, ang D and Raj G. (2018). Elastic Search in Cache Based Service Management for Healthcare Automation 2018 12th International Conference on Communications (COMM). 10.1109/ICComm.2018.8430162. 978-1-5386-2350-3. (01-06).

    https://ieeexplore.ieee.org/document/8430162/

  • Sinha H, Dewang and Raj G. (2018). Elastic Search in Cache Based Service Management For Healthcare Automation 2018 International Conference on Advances in Computing and Communication Engineering (ICACCE). 10.1109/ICACCE.2018.8441722. 978-1-5386-4485-0. (445-450).

    https://ieeexplore.ieee.org/document/8441722/

  • Trompouki M and Kosmidis L. (2018). Brook Auto: High-Level Certification-Friendly Programming for GPU-powered Automotive Systems 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC). 10.1109/DAC.2018.8465869. 978-1-5386-4114-9. (1-6).

    https://ieeexplore.ieee.org/document/8465869/

  • Hong C, Spence I and Nikolopoulos D. (2017). GPU Virtualization and Scheduling Methods. ACM Computing Surveys. 50:3. (1-37). Online publication date: 31-May-2018.

    https://doi.org/10.1145/3068281

  • Zhang P, Fang J, Yang C, Tang T, Huang C and Wang Z. MOCL. Proceedings of the 15th ACM International Conference on Computing Frontiers. (26-35).

    https://doi.org/10.1145/3203217.3203244

  • Jacobs J. (2018). Finding the edge: Art and automation. XRDS: Crossroads, The ACM Magazine for Students. 24:3. (5-6). Online publication date: 3-Apr-2018.

    https://doi.org/10.1145/3186703

  • Ausavarungnirun R, Miller V, Landgraf J, Ghose S, Gandhi J, Jog A, Rossbach C and Mutlu O. MASK. Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems. (503-518).

    https://doi.org/10.1145/3173162.3173169

  • Lin J. (2018). Python Non-Uniform Fast Fourier Transform (PyNUFFT): An Accelerated Non-Cartesian MRI Package on a Heterogeneous Platform (CPU/GPU). Journal of Imaging. 10.3390/jimaging4030051. 4:3. (51).

    https://www.mdpi.com/2313-433X/4/3/51

  • Saussard R, Bouzid B, Vasiliu M and Reynaud R. (2018). A novel global methodology to analyze the embeddability of real-time image processing algorithms. Journal of Real-Time Image Processing. 14:3. (565-583). Online publication date: 1-Mar-2018.

    https://doi.org/10.1007/s11554-017-0686-3

  • Hagedorn B, Stoltzfus L, Steuwer M, Gorlatch S and Dubach C. High performance stencil code generation with Lift. Proceedings of the 2018 International Symposium on Code Generation and Optimization. (100-112).

    https://doi.org/10.1145/3168824

  • Dao T and Lee J. An Auto-Tuner for OpenCL Work-Group Size on GPUs. IEEE Transactions on Parallel and Distributed Systems. 10.1109/TPDS.2017.2755657. 29:2. (283-296).

    http://ieeexplore.ieee.org/document/8048544/

  • Wang H, Luo F, Ibrahim M, Kayiran O and Jog A. (2018). Efficient and Fair Multi-programming in GPUs via Effective Bandwidth Management 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA). 10.1109/HPCA.2018.00030. 978-1-5386-3659-6. (247-258).

    http://ieeexplore.ieee.org/document/8327013/

  • Sahoo D, Sha S, Satpathy M, Mutyam M and Bhuyan L. CAMO. Proceedings of the 23rd Asia and South Pacific Design Automation Conference. (215-220).

    /doi/10.5555/3201607.3201652

  • (2018). Evaluating attainable memory bandwidth of parallel programming models via BabelStream. International Journal of Computational Science and Engineering. 17:3. (247-262). Online publication date: 1-Jan-2018.

    /doi/10.5555/3292750.3292751

  • Hagedorn B, Stoltzfus L, Steuwer M, Gorlatch S and Dubach C. (2018). High performance stencil code generation with Lift the 2018 International Symposium. 10.1145/3179541.3168824. 9781450356176. (100-112).

    http://dl.acm.org/citation.cfm?doid=3179541.3168824

  • Sahoo D, Sha S, Satpathy M, Mutyam M and Bhuyan L. (2018). CAMO: A novel cache management organization for GPGPUs 2018 23rd Asia and South Pacific Design Automation Conference (ASP-DAC). 10.1109/ASPDAC.2018.8297308. 978-1-5090-0602-1. (215-220).

    http://ieeexplore.ieee.org/document/8297308/

  • Carvalho P, Drummond L, Bentes C, Clua E, Cataldo E and Marzulo L. (2018). Analysis and Characterization of GPU Benchmarks for Kernel Concurrency Efficiency. High Performance Computing. 10.1007/978-3-319-73353-1_5. (71-86).

    http://link.springer.com/10.1007/978-3-319-73353-1_5

  • Matsumura K, Sato M, Boku T, Podobas A and Matsuoka S. (2018). MACC: An OpenACC Transpiler for Automatic Multi-GPU Use. Supercomputing Frontiers. 10.1007/978-3-319-69953-0_7. (109-127).

    http://link.springer.com/10.1007/978-3-319-69953-0_7

  • Fang J, Zhang P, Tang T, Huang C and Yang C. (2017). Implementing and Evaluating OpenCL on an ARMv8 Multi-Core CPU 2017 IEEE International Symposium on Parallel and Distributed Processing with Applications and 2017 IEEE International Conference on Ubiquitous Computing and Communications (ISPA/IUCC). 10.1109/ISPA/IUCC.2017.00131. 978-1-5386-3790-6. (860-867).

    https://ieeexplore.ieee.org/document/8367361/

  • Xiao Y, Xue Y, Nazarian S and Bogdan P. A load balancing inspired optimization framework for exascale multicore systems. Proceedings of the 36th International Conference on Computer-Aided Design. (217-224).

    /doi/10.5555/3199700.3199729

  • Haidl M, Moll S, Klein L, Sun H, Hack S and Gorlatch S. PACXXv2 + RV. Proceedings of the Fourth Workshop on the LLVM Compiler Infrastructure in HPC. (1-12).

    https://doi.org/10.1145/3148173.3148185

  • Mishra A, Li L, Kong M, Finkel H and Chapman B. Benchmarking and Evaluating Unified Memory for OpenMP GPU Offloading. Proceedings of the Fourth Workshop on the LLVM Compiler Infrastructure in HPC. (1-10).

    https://doi.org/10.1145/3148173.3148184

  • Yoon M, Oh Y, Kim S, Lee S, Kim D and Ro W. Dynamic Resizing on Active Warps Scheduler to Hide Operation Stalls on GPUs. IEEE Transactions on Parallel and Distributed Systems. 10.1109/TPDS.2017.2704080. 28:11. (3142-3156).

    http://ieeexplore.ieee.org/document/7927466/

  • Xiao Y, Xue Y, Nazarian S and Bogdan P. (2017). A load balancing inspired optimization framework for exascale multicore systems: A complex networks approach 2017 IEEE/ACM International Conference on Computer-Aided Design (ICCAD). 10.1109/ICCAD.2017.8203781. 978-1-5386-3093-8. (217-224).

    http://ieeexplore.ieee.org/document/8203781/

  • Chen G, Zhao Y, Shen X and Zhou H. (2017). EffiSha. ACM SIGPLAN Notices. 52:8. (3-16). Online publication date: 26-Oct-2017.

    https://doi.org/10.1145/3155284.3018748

  • Ausavarungnirun R, Landgraf J, Miller V, Ghose S, Gandhi J, Rossbach C and Mutlu O. Mosaic. Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture. (136-150).

    https://doi.org/10.1145/3123939.3123975

  • Li A, Zhao W and Song S. BVF. Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture. (532-545).

    https://doi.org/10.1145/3123939.3123944

  • Lee S and Wu C. (2017). Performance characterization, prediction, and optimization for heterogeneous systems with multi-level memory interference 2017 IEEE International Symposium on Workload Characterization (IISWC). 10.1109/IISWC.2017.8167755. 978-1-5386-1233-0. (43-53).

    http://ieeexplore.ieee.org/document/8167755/

  • Koo G, Oh Y, Ro W and Annavaram M. (2017). Access Pattern-Aware Cache Management for Improving Data Utilization in GPU. ACM SIGARCH Computer Architecture News. 45:2. (307-319). Online publication date: 14-Sep-2017.

    https://doi.org/10.1145/3140659.3080239

  • Maurer L, Downen P, Ariola Z and Peyton Jones S. (2017). Compiling without continuations. ACM SIGPLAN Notices. 52:6. (482-494). Online publication date: 14-Sep-2017.

    https://doi.org/10.1145/3140587.3062380

  • Mamouras K, Raghothaman M, Alur R, Ives Z and Khanna S. (2017). StreamQRE: modular specification and efficient evaluation of quantitative queries over streaming data. ACM SIGPLAN Notices. 52:6. (693-708). Online publication date: 14-Sep-2017.

    https://doi.org/10.1145/3140587.3062369

  • Feng Y, Martins R, Van Geffen J, Dillig I and Chaudhuri S. (2017). Component-based synthesis of table consolidation and transformation tasks from examples. ACM SIGPLAN Notices. 52:6. (422-436). Online publication date: 14-Sep-2017.

    https://doi.org/10.1145/3140587.3062351

  • Chu S, Weitz K, Cheung A and Suciu D. (2017). HoTTSQL: proving query rewrites with univalent SQL semantics. ACM SIGPLAN Notices. 52:6. (510-524). Online publication date: 14-Sep-2017.

    https://doi.org/10.1145/3140587.3062348

  • Eizenberg A, Peng Y, Pigli T, Mansky W and Devietti J. (2017). BARRACUDA: binary-level analysis of runtime RAces in CUDA programs. ACM SIGPLAN Notices. 52:6. (126-140). Online publication date: 14-Sep-2017.

    https://doi.org/10.1145/3140587.3062342

  • Cummins C, Petoumenos P, Wang Z and Leather H. (2017). End-to-End Deep Learning of Optimization Heuristics 2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT). 10.1109/PACT.2017.24. 978-1-5090-6764-0. (219-232).

    http://ieeexplore.ieee.org/document/8091247/

  • Huang Y and Li D. (2017). Performance Modeling for Optimal Data Placement on GPU with Heterogeneous Memory Systems 2017 IEEE International Conference on Cluster Computing (CLUSTER). 10.1109/CLUSTER.2017.42. 978-1-5386-2326-8. (166-177).

    http://ieeexplore.ieee.org/document/8048928/

  • Fang Y, Chen Q, Xiong N, Zhao D and Wang J. (2017). RGCA: A Reliable GPU Cluster Architecture for Large-Scale Internet of Things Computing Based on Effective Performance-Energy Optimization. Sensors. 10.3390/s17081799. 17:8. (1799).

    https://www.mdpi.com/1424-8220/17/8/1799

  • Amrizal M and Takizawa H. (2017). Optimizing Energy Consumption on HPC Systems with a Multi-Level Checkpointing Mechanism 2017 International Conference on Networking, Architecture, and Storage (NAS). 10.1109/NAS.2017.8026868. 978-1-5386-3486-8. (1-9).

    http://ieeexplore.ieee.org/document/8026868/

  • Ham T, Aragón J and Martonosi M. (2017). Decoupling Data Supply from Computation for Latency-Tolerant Communication in Heterogeneous Architectures. ACM Transactions on Architecture and Code Optimization. 14:2. (1-27). Online publication date: 30-Jun-2017.

    https://doi.org/10.1145/3075620

  • Koo G, Oh Y, Ro W and Annavaram M. Access Pattern-Aware Cache Management for Improving Data Utilization in GPU. Proceedings of the 44th Annual International Symposium on Computer Architecture. (307-319).

    https://doi.org/10.1145/3079856.3080239

  • Eizenberg A, Peng Y, Pigli T, Mansky W and Devietti J. BARRACUDA: binary-level analysis of runtime RAces in CUDA programs. Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation. (126-140).

    https://doi.org/10.1145/3062341.3062342

  • Khairy M, Zahran M and Wassal A. (2017). SACAT. IEEE Transactions on Parallel and Distributed Systems. 28:6. (1740-1753). Online publication date: 1-Jun-2017.

    https://doi.org/10.1109/TPDS.2016.2627560

  • Loghin D, Ramapantulu L and Teo Y. (2017). On Understanding Time, Energy and Cost Performance of Wimpy Heterogeneous Systems for Edge Computing 2017 IEEE International Conference on Edge Computing (EDGE). 10.1109/IEEE.EDGE.2017.10. 978-1-5386-2017-5. (1-8).

    http://ieeexplore.ieee.org/document/8029250/

  • Losada N, Fraguela B, Gonzlez P and Martn M. (2017). A portable and adaptable fault tolerance solution for heterogeneous applications. Journal of Parallel and Distributed Computing. 104:C. (146-158). Online publication date: 1-Jun-2017.

    https://doi.org/10.1016/j.jpdc.2017.01.020

  • Che S, Beckmann B and Reinhardt S. (2017). Programming GPGPU Graph Applications with Linear Algebra Building Blocks. International Journal of Parallel Programming. 45:3. (657-679). Online publication date: 1-Jun-2017.

    https://doi.org/10.1007/s10766-016-0448-z

  • Tang L, Barrett R, Cook J and Hu X. (2017). PeaPaw. ACM Transactions on Design Automation of Electronic Systems. 22:3. (1-26). Online publication date: 31-May-2017.

    https://doi.org/10.1145/2999540

  • Gleeson J, Kats D, Mei C and de Lara E. Crane. Proceedings of the 10th ACM International Systems and Storage Conference. (1-13).

    https://doi.org/10.1145/3078468.3078478

  • Wang Q, Xu P, Zhang Y and Chu X. EPPMiner. Proceedings of the Eighth International Conference on Future Energy Systems. (23-33).

    https://doi.org/10.1145/3077839.3077858

  • Hou K, Wang H and Feng W. GPU-UniCache. Proceedings of the Computing Frontiers Conference. (107-116).

    https://doi.org/10.1145/3075564.3075583

  • Wu B, Liu X, Zhou X and Jiang C. (2017). FLEP. ACM SIGPLAN Notices. 52:4. (483-496). Online publication date: 12-May-2017.

    https://doi.org/10.1145/3093336.3037742

  • Wu B, Liu X, Zhou X and Jiang C. (2017). FLEP. ACM SIGARCH Computer Architecture News. 45:1. (483-496). Online publication date: 11-May-2017.

    https://doi.org/10.1145/3093337.3037742

  • Pino S, Pollock L and Chandrasekaran S. (2017). Exploring translation of OpenMP to OpenACC 2.5: lessons learned 2017 IEEE International Parallel and Distributed Processing Symposium: Workshops (IPDPSW). 10.1109/IPDPSW.2017.84. 978-1-5386-3408-0. (673-682).

    http://ieeexplore.ieee.org/document/7965109/

  • Lal S, Lucas J and Juurlink B. (2017). E^2MC: Entropy Encoding Based Memory Compression for GPUs 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 10.1109/IPDPS.2017.101. 978-1-5386-3914-6. (1119-1128).

    http://ieeexplore.ieee.org/document/7967202/

  • Jadidi A, Arjomand M, Kandemir M and Das C. Optimizing energy consumption in GPUS through feedback-driven CTA scheduling. Proceedings of the 25th High Performance Computing Symposium. (1-12).

    /doi/10.5555/3108096.3108108

  • Jun T, Yoo M, Kim D, Cho K, Lee S and Yeun K. HPC Supported Mission-Critical Cloud Architecture. Proceedings of the 8th ACM/SPEC on International Conference on Performance Engineering. (223-232).

    https://doi.org/10.1145/3030207.3044531

  • Wu B, Liu X, Zhou X and Jiang C. (2017). FLEP. ACM SIGOPS Operating Systems Review. 10.1145/3093315.3037742. 51:2. (483-496). Online publication date: 4-Apr-2017.

    http://dl.acm.org/citation.cfm?doid=3093315.3037742

  • Wu B, Liu X, Zhou X and Jiang C. FLEP. Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems. (483-496).

    https://doi.org/10.1145/3037697.3037742

  • Lopes A, Pratas F, Sousa L and Ilic A. (2017). Exploring GPU performance, power and energy-efficiency bounds with Cache-aware Roofline Modeling 2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 10.1109/ISPASS.2017.7975297. 978-1-5386-3890-3. (259-268).

    http://ieeexplore.ieee.org/document/7975297/

  • Chen H, Wang M, Hu Y, Song M and Li T. (2017). GaaS workload characterization under NUMA architecture for virtualized GPU 2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 10.1109/ISPASS.2017.7975271. 978-1-5386-3890-3. (65-76).

    http://ieeexplore.ieee.org/document/7975271/

  • Gomez-Luna J, Hajj I, Chang L, Garcia-Flores V, de Gonzalo S, Jablin T, Pena A and Hwu W. (2017). Chai: Collaborative heterogeneous applications for integrated-architectures 2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 10.1109/ISPASS.2017.7975269. 978-1-5386-3890-3. (43-54).

    https://ieeexplore.ieee.org/document/7975269/

  • Menon V and Raju K. (2017). Performance analysis of ray tracing based rendering using OpenCL 2017 Innovations in Power and Advanced Computing Technologies (i-PACT). 10.1109/IPACT.2017.8244923. 978-1-5090-5682-8. (1-5).

    http://ieeexplore.ieee.org/document/8244923/

  • Chen G, Shen X, Wu B and Li D. (2017). Optimizing Data Placement on GPU Memory. IEEE Transactions on Computers. 66:3. (473-487). Online publication date: 1-Mar-2017.

    https://doi.org/10.1109/TC.2016.2604372

  • Cummins C, Petoumenos P, Wang Z and Leather H. Synthesizing benchmarks for predictive modeling. Proceedings of the 2017 International Symposium on Code Generation and Optimization. (86-99).

    /doi/10.5555/3049832.3049843

  • Erb C, Collins M and Greathouse J. Dynamic buffer overflow detection for GPGPUs. Proceedings of the 2017 International Symposium on Code Generation and Optimization. (61-73).

    /doi/10.5555/3049832.3049840

  • Zhang F, Wu B, Zhai J, He B and Chen W. FinePar: irregularity-aware fine-grained workload partitioning on integrated architectures. Proceedings of the 2017 International Symposium on Code Generation and Optimization. (27-38).

    /doi/10.5555/3049832.3049836

  • Majumdar A, Piga L, Paul I, Greathouse J, Huang W and Albonesi D. (2017). Dynamic GPGPU Power Management Using Adaptive Model Predictive Control 2017 IEEE International Symposium on High-Performance Computer Architecture (HPCA). 10.1109/HPCA.2017.34. 978-1-5090-4985-1. (613-624).

    http://ieeexplore.ieee.org/document/7920860/

  • Cummins C, Petoumenos P, Wang Z and Leather H. (2017). Synthesizing benchmarks for predictive modeling 2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). 10.1109/CGO.2017.7863731. 978-1-5090-4931-8. (86-99).

    http://ieeexplore.ieee.org/document/7863731/

  • Erb C, Collins M and Greathouse J. (2017). Dynamic buffer overflow detection for GPGPUs 2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). 10.1109/CGO.2017.7863729. 978-1-5090-4931-8. (61-73).

    http://ieeexplore.ieee.org/document/7863729/

  • Zhang F, Wu B, Zhai J, He B and Chen W. (2017). FinePar: Irregularity-aware fine-grained workload partitioning on integrated architectures 2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). 10.1109/CGO.2017.7863726. 978-1-5090-4931-8. (27-38).

    http://ieeexplore.ieee.org/document/7863726/

  • Chen G, Zhao Y, Shen X and Zhou H. EffiSha. Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. (3-16).

    https://doi.org/10.1145/3018743.3018748

  • Benkner S, Pllana S, Träff J, Tsigas P, Richards A, Russell G, Thibault S, Augonnet C, Namyst R, Cornelius H, Keler C, Moloney D and Sanders P. (2017). Peppher: Performance Portability and Programmability for Heterogeneous Many‐Core Architectures. Programming multi‐core and many‐core computing systems. 10.1002/9781119332015.ch12. (241-260). Online publication date: 24-Jan-2017.

    https://onlinelibrary.wiley.com/doi/10.1002/9781119332015.ch12

  • Tamarit S, Mariño J, Vigueras G and Carro M. (2017). Towards a Semantics-Aware Code Transformation Toolchain for Heterogeneous Systems. Electronic Proceedings in Theoretical Computer Science. 10.4204/EPTCS.237.3. 237. (34-51).

    http://arxiv.org/abs/1701.03319

  • Küsters A, Wienke S and Arnold L. (2017). Performance Portability Analysis for Real-Time Simulations of Smoke Propagation Using OpenACC. High Performance Computing. 10.1007/978-3-319-67630-2_35. (477-495).

    http://link.springer.com/10.1007/978-3-319-67630-2_35

  • Steinbach P and Werner M. (2017). gearshifft – The FFT Benchmark Suite for Heterogeneous Platforms. High Performance Computing. 10.1007/978-3-319-58667-0_11. (199-216).

    http://link.springer.com/10.1007/978-3-319-58667-0_11

  • Tamarit S, Vigueras G, Carro M and Mariño J. (2017). Machine Learning-Driven Automatic Program Transformation to Increase Performance in Heterogeneous Architectures. Tools for High Performance Computing 2016. 10.1007/978-3-319-56702-0_7. (115-140).

    http://link.springer.com/10.1007/978-3-319-56702-0_7

  • Bridges R, Imam N and Mintz T. (2016). Understanding GPU Power. ACM Computing Surveys. 49:3. (1-27). Online publication date: 13-Dec-2016.

    https://doi.org/10.1145/2962131

  • Tupinamba A and Sztajnberg A. (2016). Transparent and Optimized Distributed Processing on GPUs. IEEE Transactions on Parallel and Distributed Systems. 27:12. (3673-3686). Online publication date: 1-Dec-2016.

    https://doi.org/10.1109/TPDS.2016.2550445

  • Xie B, Liu X, McKee S, Zhan J, Jia Z, Wang L and Zhang L. (2016). Understanding Data Analytics Workloads on Intel(R) Xeon Phi(R) 2016 IEEE 18th International Conference on High Performance Computing and Communications; IEEE 14th International Conference on Smart City; IEEE 2nd International Conference on Data Science and Systems (HPCC/SmartCity/DSS). 10.1109/HPCC-SmartCity-DSS.2016.0039. 978-1-5090-4297-5. (206-215).

    http://ieeexplore.ieee.org/document/7828380/

  • Allen T and Ge R. Characterizing power and performance of GPU memory access. Proceedings of the 4th International Workshop on Energy Efficient Supercomputing. (46-53).

    /doi/10.5555/3018076.3018083

  • Allen T and Ge R. (2016). Characterizing Power and Performance of GPU Memory Access 2016 4th International Workshop on Energy Efficient Supercomputing (E2SC). 10.1109/E2SC.2016.012. 978-1-5090-3856-5. (46-53).

    http://ieeexplore.ieee.org/document/7830508/

  • Hajj I, Gómez-Luna J, Li C, Chang L, Milojicic D and Hwu W. KLAP. The 49th Annual IEEE/ACM International Symposium on Microarchitecture. (1-12).

    /doi/10.5555/3195638.3195654

  • Chang L, Hajj I, Rodrigues C, Gómez-Luna J and Hwu W. Efficient kernel synthesis for performance portable programming. The 49th Annual IEEE/ACM International Symposium on Microarchitecture. (1-13).

    /doi/10.5555/3195638.3195653

  • Yoon M, Kim K, Lee S, Ro W and Annavaram M. (2016). Virtual thread. ACM SIGARCH Computer Architecture News. 44:3. (609-621). Online publication date: 12-Oct-2016.

    https://doi.org/10.1145/3007787.3001201

  • Umar M, Meredith J, Vetter J and Cameron K. (2016). A Study of Power-Performance Modeling Using a Domain-Specific Language 2016 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD). 10.1109/SBAC-PAD.2016.19. 978-1-5090-6108-2. (84-92).

    http://ieeexplore.ieee.org/document/7789327/

  • Hajj I, Gomez-Luna J, Li C, Chang L, Milojicic D and Hwu W. (2016). KLAP: Kernel launch aggregation and promotion for optimizing dynamic parallelism 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 10.1109/MICRO.2016.7783716. 978-1-5090-3508-3. (1-12).

    http://ieeexplore.ieee.org/document/7783716/

  • Chang L, Hajj I, Rodrigues C, Gomez-Luna J and Hwu W. (2016). Efficient kernel synthesis for performance portable programming 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 10.1109/MICRO.2016.7783715. 978-1-5090-3508-3. (1-13).

    http://ieeexplore.ieee.org/document/7783715/

  • Kim K, Park J and Baek W. (2016). IACM: Integrated adaptive cache management for high-performance and energy-efficient GPGPU computing 2016 IEEE 34th International Conference on Computer Design (ICCD). 10.1109/ICCD.2016.7753308. 978-1-5090-5142-7. (380-383).

    http://ieeexplore.ieee.org/document/7753308/

  • Wang B, Zhu Y and Yu W. OAWS. Proceedings of the 2016 International Conference on Parallel Architectures and Compilation. (45-55).

    https://doi.org/10.1145/2967938.2967947

  • Kayiran O, Jog A, Pattnaik A, Ausavarungnirun R, Tang X, Kandemir M, Loh G, Mutlu O and Das C. μC-States. Proceedings of the 2016 International Conference on Parallel Architectures and Compilation. (17-30).

    https://doi.org/10.1145/2967938.2967941

  • Pattnaik A, Tang X, Jog A, Kayiran O, Mishra A, Kandemir M, Mutlu O and Das C. Scheduling Techniques for GPU Architectures with Processing-In-Memory Capabilities. Proceedings of the 2016 International Conference on Parallel Architectures and Compilation. (31-44).

    https://doi.org/10.1145/2967938.2967940

  • Saussard R, Bouzid B, Vasiliu M and Reynaud R. (2016). A Robust Methodology for Performance Analysis on Hybrid Embedded Multicore Architectures 2016 IEEE 10th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC). 10.1109/MCSoC.2016.35. 978-1-5090-3531-1. (77-84).

    http://ieeexplore.ieee.org/document/7774423/

  • Adhinarayanan V, Paul I, Greathouse J, Huang W, Pattnaik A and Feng W. (2016). Measuring and modeling on-chip interconnect power on real hardware 2016 IEEE International Symposium on Workload Characterization (IISWC). 10.1109/IISWC.2016.7581263. 978-1-5090-3896-1. (1-11).

    http://ieeexplore.ieee.org/document/7581263/

  • Sun Y, Gong X, Ziabari A, Yu L, Li X, Mukherjee S, Mccardwell C, Villegas A and Kaeli D. (2016). Hetero-mark, a benchmark suite for CPU-GPU collaborative computing 2016 IEEE International Symposium on Workload Characterization (IISWC). 10.1109/IISWC.2016.7581262. 978-1-5090-3896-1. (1-10).

    http://ieeexplore.ieee.org/document/7581262/

  • Chang L, Kim H and Hwu W. (2016). DySel. ACM SIGARCH Computer Architecture News. 44:2. (667-680). Online publication date: 29-Jul-2016.

    https://doi.org/10.1145/2980024.2872373

  • Gallardo E, Teller P, Argueta A and Jaloma J. Cross-Accelerator Performance Profiling. Proceedings of the XSEDE16 Conference on Diversity, Big Data, and Science at Scale. (1-8).

    https://doi.org/10.1145/2949550.2949567

  • Sen R and Wood D. (2016). GPGPU Footprint Models to Estimate per-Core Power. IEEE Computer Architecture Letters. 15:2. (97-100). Online publication date: 1-Jul-2016.

    https://doi.org/10.1109/LCA.2015.2456909

  • Delporte B, Rigamonti R and Dassatti A. (2016). HPA: An opportunistic approach to embedded energy efficiency 2016 International Conference on High Performance Computing & Simulation (HPCS). 10.1109/HPCSim.2016.7568415. 978-1-5090-2088-1. (792-799).

    http://ieeexplore.ieee.org/document/7568415/

  • Obrecht C, Asinari P, Kuznik F and Roux J. (2016). Thermal link-wise artificial compressibility method. Computers & Mathematics with Applications. 72:2. (375-385). Online publication date: 1-Jul-2016.

    https://doi.org/10.1016/j.camwa.2015.05.022

  • Jog A, Kayiran O, Pattnaik A, Kandemir M, Mutlu O, Iyer R and Das C. (2016). Exploiting Core Criticality for Enhanced GPU Performance. ACM SIGMETRICS Performance Evaluation Review. 44:1. (351-363). Online publication date: 30-Jun-2016.

    https://doi.org/10.1145/2964791.2901468

  • Yoon M, Kim K, Lee S, Ro W and Annavaram M. Virtual thread. Proceedings of the 43rd International Symposium on Computer Architecture. (609-621).

    https://doi.org/10.1109/ISCA.2016.59

  • Shen D, Liu X and Lin F. Characterizing emerging heterogeneous memory. Proceedings of the 2016 ACM SIGPLAN International Symposium on Memory Management. (13-23).

    https://doi.org/10.1145/2926697.2926702

  • Jog A, Kayiran O, Pattnaik A, Kandemir M, Mutlu O, Iyer R and Das C. Exploiting Core Criticality for Enhanced GPU Performance. Proceedings of the 2016 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Science. (351-363).

    https://doi.org/10.1145/2896377.2901468

  • Hasabnis N and Sekar R. (2016). Lifting Assembly to Intermediate Representation. ACM SIGPLAN Notices. 51:4. (311-324). Online publication date: 9-Jun-2016.

    https://doi.org/10.1145/2954679.2872380

  • Chang L, Kim H and Hwu W. (2016). DySel. ACM SIGPLAN Notices. 51:4. (667-680). Online publication date: 9-Jun-2016.

    https://doi.org/10.1145/2954679.2872373

  • Panda R, Eckert Y, Jayasena N, Kayiran O, Boyer M and John L. Prefetching Techniques for Near-memory Throughput Processors. Proceedings of the 2016 International Conference on Supercomputing. (1-14).

    https://doi.org/10.1145/2925426.2926282

  • Chen G and Shen X. Coherence-Free Multiview. Proceedings of the 2016 International Conference on Supercomputing. (1-13).

    https://doi.org/10.1145/2925426.2926277

  • Kumar S, Srinivasan V, Sharifian A, Sumner N and Shriraman A. Peruse and Profit. Proceedings of the 2016 International Conference on Supercomputing. (1-13).

    https://doi.org/10.1145/2925426.2926269

  • Li A, Song S, Wijtvliet M, Kumar A and Corporaal H. SFU-Driven Transparent Approximation Acceleration on GPUs. Proceedings of the 2016 International Conference on Supercomputing. (1-14).

    https://doi.org/10.1145/2925426.2926255

  • Wu W, Bosilca G, vandeVaart R, Jeaugey S and Dongarra J. GPU-Aware Non-contiguous Data Movement In Open MPI. Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing. (231-242).

    https://doi.org/10.1145/2907294.2907317

  • Adhinarayanan V, Subramaniam B and Feng W. Online power estimation of graphics processing units. Proceedings of the 16th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing. (245-254).

    https://doi.org/10.1109/CCGrid.2016.93

  • Ukidave Y, Li X and Kaeli D. (2016). Mystic: Predictive Scheduling for GPU Based Cloud Servers Using Machine Learning 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 10.1109/IPDPS.2016.73. 978-1-5090-2140-6. (353-362).

    http://ieeexplore.ieee.org/document/7516031/

  • Tallent N, Manzano J, Gawande N, Kang S, Kerbyson D, Hoisie A and Cross J. (2016). Algorithm and Architecture Independent Benchmarking with SEAK 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 10.1109/IPDPS.2016.25. 978-1-5090-2140-6. (63-72).

    http://ieeexplore.ieee.org/document/7516002/

  • Heinecke A, Karlstetter R, Pflüger D and Bungartz H. (2016). Data mining on vast data sets as a cluster system benchmark. Concurrency and Computation: Practice & Experience. 28:7. (2145-2165). Online publication date: 1-May-2016.

    https://doi.org/10.1002/cpe.3514

  • Aviv R and Wang G. OpenCL-Based Mobile GPGPU Benchmarking. Proceedings of the 4th International Workshop on OpenCL. (1-4).

    https://doi.org/10.1145/2909437.2909441

  • Dev K, Paul I and Huang W. A framework for evaluating promising power efficiency techniques in future GPUs for HPC. Proceedings of the 24th High Performance Computing Symposium. (1-8).

    https://doi.org/10.22360/SpringSim.2016.HPC.003

  • Adhinarayanan V and Feng W. (2016). An automated framework for characterizing and subsetting GPGPU workloads 2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 10.1109/ISPASS.2016.7482105. 978-1-5090-1953-3. (307-317).

    http://ieeexplore.ieee.org/document/7482105/

  • Giefers H, Staar P, Bekas C and Hagleitner C. (2016). Analyzing the energy-efficiency of sparse matrix multiplication on heterogeneous systems: A comparative study of GPU, Xeon Phi and FPGA 2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 10.1109/ISPASS.2016.7482073. 978-1-5090-1953-3. (46-56).

    http://ieeexplore.ieee.org/document/7482073/

  • Chang L, Kim H and Hwu W. (2016). DySel. ACM SIGOPS Operating Systems Review. 10.1145/2954680.2872373. 50:2. (667-680). Online publication date: 25-Mar-2016.

    http://dl.acm.org/citation.cfm?doid=2954680.2872373

  • Chang L, Kim H and Hwu W. DySel. Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems. (667-680).

    https://doi.org/10.1145/2872362.2872373

  • Soldado F, Alexandre F and Paulino H. (2016). Execution of compound multi-kernel OpenCL computations in multi-CPU/multi-GPU environments. Concurrency and Computation: Practice & Experience. 28:3. (768-787). Online publication date: 10-Mar-2016.

    https://doi.org/10.1002/cpe.3612

  • de Oliveira D, Pilla L, Santini T and Rech P. (2016). Evaluation and Mitigation of Radiation-Induced Soft Errors in Graphics Processing Units. IEEE Transactions on Computers. 65:3. (791-804). Online publication date: 1-Mar-2016.

    https://doi.org/10.1109/TC.2015.2444855

  • Wong D, Kim N and Annavaram M. (2016). Approximating warps with intra-warp operand value similarity 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA). 10.1109/HPCA.2016.7446063. 978-1-4673-9211-2. (176-187).

    http://ieeexplore.ieee.org/document/7446063/

  • Wu J, Belevich A, Bendersky E, Heffernan M, Leary C, Pienaar J, Roune B, Springer R, Weng X and Hundt R. gpucc: an open-source GPGPU compiler. Proceedings of the 2016 International Symposium on Code Generation and Optimization. (105-116).

    https://doi.org/10.1145/2854038.2854041

  • Langenkämper D, Jakobi T, Feld D, Jelonek L, Goesmann A and Nattkemper T. (2016). Comparison of Acceleration Techniques for Selected Low-Level Bioinformatics Operations. Frontiers in Genetics. 10.3389/fgene.2016.00005. 7.

    http://journal.frontiersin.org/Article/10.3389/fgene.2016.00005/abstract

  • Lopez-Novoa U, Mendiburu A and Miguel-Alonso J. (2016). Kernel density estimation in accelerators. The Journal of Supercomputing. 72:2. (545-566). Online publication date: 1-Feb-2016.

    https://doi.org/10.1007/s11227-015-1577-7

  • Cherubin S, Scandale M and Agosta G. Stack size estimation on machine-independent intermediate code for OpenCL kernels. Proceedings of the 7th Workshop on Parallel Programming and Run-Time Management Techniques for Many-core Architectures and the 5th Workshop on Design Tools and Architectures For Multicore Embedded Computing Platforms. (1-6).

    https://doi.org/10.1145/2872421.2872425

  • Paul I, Huang W, Arora M and Yalamanchili S. (2015). Harmonia. ACM SIGARCH Computer Architecture News. 43:3S. (54-65). Online publication date: 4-Jan-2016.

    https://doi.org/10.1145/2872887.2750404

  • Bhura M, Deshpande P and Chandrasekaran K. (2016). CUDA or OpenCL. Research Advances in the Integration of Big Data and Smart Computing. 10.4018/978-1-4666-8737-0.ch015. (267-279).

    http://services.igi-global.com/resolvedoi/resolve.aspx?doi=10.4018/978-1-4666-8737-0.ch015

  • Welch A and Venkata M. (2016). On Synchronisation and Memory Reuse in OpenSHMEM. OpenSHMEM and Related Technologies. Enhancing OpenSHMEM for Hybrid Environments. 10.1007/978-3-319-50995-2_6. (82-94).

    http://link.springer.com/10.1007/978-3-319-50995-2_6

  • Grodowitz M, D’Azevedo E, Powers S and Imam N. (2016). Using Hybrid Model OpenSHMEM + CUDA to Implement the SHOC Benchmark Suite. OpenSHMEM and Related Technologies. Enhancing OpenSHMEM for Hybrid Environments. 10.1007/978-3-319-50995-2_14. (204-216).

    http://link.springer.com/10.1007/978-3-319-50995-2_14

  • Deakin T, Price J, Martineau M and McIntosh-Smith S. (2016). GPU-STREAM v2.0: Benchmarking the Achievable Memory Bandwidth of Many-Core Processors Across Diverse Parallel Programming Models. High Performance Computing. 10.1007/978-3-319-46079-6_34. (489-507).

    http://link.springer.com/10.1007/978-3-319-46079-6_34

  • Manochio R, Buzatto D, de Ávila P and Pantoni R. (2016). Algorithms Performance Evaluation in Hybrid Systems. Information Technolog: New Generations. 10.1007/978-3-319-32467-8_101. (1169-1181).

    http://link.springer.com/10.1007/978-3-319-32467-8_101

  • Steuwer M, Fensch C, Lindley S and Dubach C. (2015). Generating performance portable code using rewrite rules: from high-level functional expressions to high-performance OpenCL code. ACM SIGPLAN Notices. 50:9. (205-217). Online publication date: 18-Dec-2015.

    https://doi.org/10.1145/2858949.2784754

  • Daga M and Greathouse J. Structural Agnostic SpMV. Proceedings of the 2015 IEEE 22nd International Conference on High Performance Computing (HiPC). (64-74).

    https://doi.org/10.1109/HiPC.2015.55

  • Teng Li , Narayana V and El-Ghazawi T. A Power-Aware Symbiotic Scheduling Algorithm for Concurrent GPU Kernels. Proceedings of the 2015 IEEE 21st International Conference on Parallel and Distributed Systems (ICPADS). (562-569).

    https://doi.org/10.1109/ICPADS.2015.76

  • Chen G and Shen X. Free launch. Proceedings of the 48th International Symposium on Microarchitecture. (407-419).

    https://doi.org/10.1145/2830772.2830818

  • Ham T, Aragón J and Martonosi M. DeSC. Proceedings of the 48th International Symposium on Microarchitecture. (191-203).

    https://doi.org/10.1145/2830772.2830800

  • Shao Y and Brooks D. (2015). Research Infrastructures for Hardware Accelerators. Synthesis Lectures on Computer Architecture. 10.2200/S00677ED1V01Y201511CAC034. 10:4. (1-99). Online publication date: 18-Nov-2015.

    http://www.morganclaypool.com/doi/10.2200/S00677ED1V01Y201511CAC034

  • Lopez M, Young J, Meredith J, Roth P, Horton M and Vetter J. Examining recent many-core architectures and programming models using SHOC. Proceedings of the 6th International Workshop on Performance Modeling, Benchmarking, and Simulation of High Performance Computing Systems. (1-12).

    https://doi.org/10.1145/2832087.2832090

  • Wen M, Huang D, Xun C and Chen D. (2015). Improving performance portability for GPU-specific OpenCL kernels on multi-core/many-core CPUs by analysis-based transformations. Frontiers of Information Technology & Electronic Engineering. 10.1631/FITEE.1500032. 16:11. (899-916). Online publication date: 1-Nov-2015.

    http://link.springer.com/10.1631/FITEE.1500032

  • Baghdadi R, Beaugnon U, Cohen A, Grosser T, Kruse M, Reddy C, Verdoolaege S, Betts A, Donaldson A, Ketema J, Absar J, Haastregt S, Kravets A, Lokhmotov A, David R and Hajiyev E. PENCIL. Proceedings of the 2015 International Conference on Parallel Architecture and Compilation (PACT). (138-149).

    https://doi.org/10.1109/PACT.2015.17

  • Pisal T, Walunj S, Shrimali A, Gautam O and Patil L. Acceleration of CUDA programs for non-GPU users using cloud. Proceedings of the 2015 International Conference on Green Computing and Internet of Things (ICGCIoT). (365-370).

    https://doi.org/10.1109/ICGCIoT.2015.7380490

  • Jog A, Kayiran O, Kesten T, Pattnaik A, Bolotin E, Chatterjee N, Keckler S, Kandemir M and Das C. Anatomy of GPU Memory System for Multi-Application Execution. Proceedings of the 2015 International Symposium on Memory Systems. (223-234).

    https://doi.org/10.1145/2818950.2818979

  • Majumdar A, Wu G, Dev K, Greathouse J, Paul I, Huang W, Venugopal A, Piga L, Freitag C and Puthoor S. A Taxonomy of GPGPU Performance Scaling. Proceedings of the 2015 IEEE International Symposium on Workload Characterization. (118-119).

    https://doi.org/10.1109/IISWC.2015.22

  • Awan A, Hamidouche K, Venkatesh A, Perkins J, Subramoni H and Panda D. GPU-Aware Design, Implementation, and Evaluation of Non-blocking Collective Benchmarks. Proceedings of the 22nd European MPI Users' Group Meeting. (1-10).

    https://doi.org/10.1145/2802658.2802672

  • Aji A, Peña A, Balaji P and Feng W. Automatic Command Queue Scheduling for Task-Parallel Workloads in OpenCL. Proceedings of the 2015 IEEE International Conference on Cluster Computing. (42-51).

    https://doi.org/10.1109/CLUSTER.2015.15

  • Ryoo J, Quirem S, Lebeane M, Panda R, Song S and John L. GPGPU Benchmark Suites. Proceedings of the 2015 44th International Conference on Parallel Processing (ICPP). (320-329).

    https://doi.org/10.1109/ICPP.2015.41

  • Vilches A, Asenjo R, Navarro A, Corbera F, Gran R and Garzarn M. (2015). Adaptive Partitioning for Irregular Applications on Heterogeneous CPU-GPU Chips. Procedia Computer Science. 51:C. (140-149). Online publication date: 1-Sep-2015.

    https://doi.org/10.1016/j.procs.2015.05.213

  • Loghin D, Ramapantulu L, Barbu O and Teo Y. (2015). A timeenergy performance analysis of MapReduce on heterogeneous systems with GPUs. Performance Evaluation. 91:C. (255-269). Online publication date: 1-Sep-2015.

    https://doi.org/10.1016/j.peva.2015.06.015

  • Walsh J and Dukes J. Application Support for Virtual GPGPUs in Grid Infrastructures. Proceedings of the 2015 IEEE 11th International Conference on e-Science. (67-77).

    https://doi.org/10.1109/eScience.2015.45

  • Steuwer M, Fensch C, Lindley S and Dubach C. Generating performance portable code using rewrite rules: from high-level functional expressions to high-performance OpenCL code. Proceedings of the 20th ACM SIGPLAN International Conference on Functional Programming. (205-217).

    https://doi.org/10.1145/2784731.2784754

  • Mittal S and Vetter J. (2015). A Survey of CPU-GPU Heterogeneous Computing Techniques. ACM Computing Surveys. 47:4. (1-35). Online publication date: 21-Jul-2015.

    https://doi.org/10.1145/2788396

  • Dao T, Kim J, Seo S, Egger B and Lee J. A Performance Model for GPUs with Caches. IEEE Transactions on Parallel and Distributed Systems. 10.1109/TPDS.2014.2333526. 26:7. (1800-1813).

    http://ieeexplore.ieee.org/document/6844867/

  • Guoyang Chen , Bo Wu , Dong Li and Xipeng Shen . (2015). Enabling Portable Optimizations of Data Placement on GPU. IEEE Micro. 35:4. (16-24). Online publication date: 1-Jul-2015.

    https://doi.org/10.1109/MM.2015.53

  • Zheng Z, Wang Z and Lipasti M. (2015). Adaptive Cache and Concurrency Allocation on GPGPUs. IEEE Computer Architecture Letters. 14:2. (90-93). Online publication date: 1-Jul-2015.

    https://doi.org/10.1109/LCA.2014.2359882

  • Tamarit S, Vigueras G, Carro M and Mariño J. A Haskell Implementation of a Rule-Based Program Transformation for C Programs. Proceedings of the 17th International Symposium on Practical Aspects of Declarative Languages - Volume 9131. (105-114).

    https://doi.org/10.1007/978-3-319-19686-2_8

  • Paul I, Huang W, Arora M and Yalamanchili S. Harmonia. Proceedings of the 42nd Annual International Symposium on Computer Architecture. (54-65).

    https://doi.org/10.1145/2749469.2750404

  • Wang B, Yu W, Sun X and Wang X. DaCache. Proceedings of the 29th ACM on International Conference on Supercomputing. (89-98).

    https://doi.org/10.1145/2751205.2751239

  • Wu B, Chen G, Li D, Shen X and Vetter J. Enabling and Exploiting Flexible Task Assignment on GPU through SM-Centric Program Transformations. Proceedings of the 29th ACM on International Conference on Supercomputing. (119-130).

    https://doi.org/10.1145/2751205.2751213

  • Ndu G, Navaridas J and Luján M. CHO. Proceedings of the 3rd International Workshop on OpenCL. (1-10).

    https://doi.org/10.1145/2791321.2791331

  • Shao Y, Reagen B, Gu-Yeon Wei and Brooks D. (2015). The Aladdin Approach to Accelerator Design and Modeling. IEEE Micro. 35:3. (58-70). Online publication date: 1-May-2015.

    https://doi.org/10.1109/MM.2015.50

  • Tang L, Hu X and Barrett R. PerDome. Proceedings of the Symposium on High Performance Computing. (225-232).

    /doi/10.5555/2872599.2872627

  • Wang B, Liu Z, Wang X and Yu W. Eliminating intra-warp conflict misses in GPU. Proceedings of the 2015 Design, Automation & Test in Europe Conference & Exhibition. (689-694).

    /doi/10.5555/2755753.2755911

  • Guttman D and Kandemir M. (2015). Performance and energy evaluation of data prefetching on intel Xeon Phi 2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 10.1109/ISPASS.2015.7095814. 978-1-4799-1957-4. (288-297).

    http://ieeexplore.ieee.org/document/7095814/

  • Oka K, Jia W, Martonosi M and Inoue K. (2015). Characterization and cross-platform analysis of high-throughput accelerators 2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 10.1109/ISPASS.2015.7095797. 978-1-4799-1957-4. (161-162).

    http://ieeexplore.ieee.org/document/7095797/

  • Guttman D, Kandemir M, Arunachalam M and Khanna R. (2015). Machine learning techniques for improved data prefetching 2015 International Conference on Energy Aware Computing (ICEAC). 10.1109/ICEAC.2015.7352208. 978-1-4799-1771-6. (1-4).

    http://ieeexplore.ieee.org/document/7352208/

  • Saeed I, Young J and Yalamanchili S. A portable benchmark suite for highly parallel data intensive query processing. Proceedings of the 2nd Workshop on Parallel Programming for Analytics Applications. (31-38).

    https://doi.org/10.1145/2726935.2726943

  • Fauzia N, Pouchet L and Sadayappan P. Characterizing and enhancing global memory data coalescing on GPUs. Proceedings of the 13th Annual IEEE/ACM International Symposium on Code Generation and Optimization. (12-22).

    /doi/10.5555/2738600.2738603

  • Tarakji A, Börger L and Leupers R. A comparative investigation of device-specific mechanisms for exploiting HPC accelerators. Proceedings of the 8th Workshop on General Purpose Processing using GPUs. (1-12).

    https://doi.org/10.1145/2716282.2716293

  • Khairy M, Zahran M and Wassal A. Efficient utilization of GPGPU cache hierarchy. Proceedings of the 8th Workshop on General Purpose Processing using GPUs. (36-47).

    https://doi.org/10.1145/2716282.2716291

  • Naik V and Kusur C. (2015). Analysis of performance enhancement on graphic processor based heterogeneous architecture: A CUDA and MATLAB experiment 2015 National Conference on Parallel Computing Technologies (PARCOMPTECH). 10.1109/PARCOMPTECH.2015.7084519. 978-1-4799-6916-6. (1-5).

    http://ieeexplore.ieee.org/document/7084519/

  • Wu G, Greathouse J, Lyashevsky A, Jayasena N and Chiou D. (2015). GPGPU performance and power estimation using machine learning 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA). 10.1109/HPCA.2015.7056063. 978-1-4799-8930-0. (564-576).

    http://ieeexplore.ieee.org/document/7056063/

  • Tiwari D, Gupta S, Rogers J, Maxwell D, Rech P, Vazhkudai S, Oliveira D, Londo D, DeBardeleben N, Navaux P, Carro L and Bland A. (2015). Understanding GPU errors on large-scale HPC systems and the implications for system design and operation 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA). 10.1109/HPCA.2015.7056044. 978-1-4799-8930-0. (331-342).

    http://ieeexplore.ieee.org/document/7056044/

  • Fauzia N, Pouchet L and Sadayappan P. (2015). Characterizing and enhancing global memory data coalescing on GPUs 2015 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). 10.1109/CGO.2015.7054183. 978-1-4799-8161-8. (12-22).

    http://ieeexplore.ieee.org/document/7054183/

  • Ukidave Y, Paravecino F, Yu L, Kalra C, Momeni A, Chen Z, Materise N, Daley B, Mistry P and Kaeli D. NUPAR. Proceedings of the 6th ACM/SPEC International Conference on Performance Engineering. (253-264).

    https://doi.org/10.1145/2668930.2688046

  • Elangovan V, Badia R and Ayguadé E. Auto-Tuning OmpSs-OpenCL Kernels Across GPU Machines. Proceedings of the 6th Workshop on Parallel Programming and Run-Time Management Techniques for Many-core Architectures. (31-36).

    https://doi.org/10.1145/2701310.2701316

  • Schaub T, Moll S, Karrenberg R and Hack S. (2015). The Impact of the SIMD Width on Control-Flow and Memory Divergence. ACM Transactions on Architecture and Code Optimization. 11:4. (1-25). Online publication date: 9-Jan-2015.

    https://doi.org/10.1145/2687355

  • Wang Z, Grewe D and O’boyle M. (2014). Automatic and Portable Mapping of Data Parallel Programs to OpenCL for GPU-Based Heterogeneous Systems. ACM Transactions on Architecture and Code Optimization. 11:4. (1-26). Online publication date: 9-Jan-2015.

    https://doi.org/10.1145/2677036

  • Mittal S and Vetter J. (2014). A Survey of Methods for Analyzing and Improving GPU Energy Efficiency. ACM Computing Surveys. 47:2. (1-23). Online publication date: 8-Jan-2015.

    https://doi.org/10.1145/2636342

  • Suwancharoen C and Marurngsith W. (2015). Compiler Support for Accelerating C++11 Range-Based Loops on Heterogeneous Systems. International Journal of Computer and Electrical Engineering. 10.17706/IJCEE.2015.V7.877. 7:2. (109-117).

    http://www.ijcee.org/index.php?m=content&c=index&a=show&catid=73&id=990

  • HUANG D, XUN C, WU N, WEN M, ZHANG C, CAI X and YANG Q. (2015). Enabling a Uniform OpenCL Device View for Heterogeneous Platforms. IEICE Transactions on Information and Systems. 10.1587/transinf.2014EDP7244. E98.D:4. (812-823).

    https://www.jstage.jst.go.jp/article/transinf/E98.D/4/E98.D_2014EDP7244/_article

  • Pallipuram V, Smith M, Sarma N, Anand R, Weill E and Sapra K. (2015). Subjective versus objective. The Journal of Supercomputing. 71:1. (162-201). Online publication date: 1-Jan-2015.

    https://doi.org/10.1007/s11227-014-1292-9

  • Juckeland G, Brantley W, Chandrasekaran S, Chapman B, Che S, Colgrove M, Feng H, Grund A, Henschel R, Hwu W, Li H, Müller M, Nagel W, Perminov M, Shelepugin P, Skadron K, Stratton J, Titov A, Wang K, van Waveren M, Whitney B, Wienke S, Xu R and Kumaran K. (2015). SPEC ACCEL: A Standard Application Suite for Measuring Hardware Accelerator Performance. High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation. 10.1007/978-3-319-17248-4_3. (46-67).

    https://link.springer.com/10.1007/978-3-319-17248-4_3

  • Chen G, Wu B, Li D and Shen X. PORPLE. Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture. (88-100).

    https://doi.org/10.1109/MICRO.2014.20

  • Agosta G, Barenghi A, Pelosi G and Scandale M. Towards Transparently Tackling Functionality and Performance Issues across Different OpenCL Platforms. Proceedings of the 2014 Second International Symposium on Computing and Networking. (130-136).

    https://doi.org/10.1109/CANDAR.2014.53

  • Gao S and Chritz J. (2014). Characterization of OpenCL on a scalable FPGA architecture 2014 International Conference on ReConFigurable Computing and FPGAs (ReConFig). 10.1109/ReConFig.2014.7032505. 978-1-4799-5944-0. (1-6).

    http://ieeexplore.ieee.org/document/7032505/

  • Sajjapongse K, Agarwal T and Becchi M. (2014). A flexible scheduling framework for heterogeneous CPU-GPU clusters 2014 21st International Conference on High Performance Computing (HiPC). 10.1109/HiPC.2014.7116892. 978-1-4799-5976-1. (1-11).

    http://ieeexplore.ieee.org/document/7116892/

  • Greathouse J and Daga M. Efficient sparse matrix-vector multiplication on GPUs using the CSR storage format. Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. (769-780).

    https://doi.org/10.1109/SC.2014.68

  • Shi R, Lu X, Potluri S, Hamidouche K, Zhang J and Panda D. HAND. Proceedings of the 2014 Brazilian Conference on Intelligent Systems. (221-230).

    https://doi.org/10.1109/ICPP.2014.31

  • Shao Y, Reagen B, Wei G and Brooks D. (2014). Aladdin. ACM SIGARCH Computer Architecture News. 42:3. (97-108). Online publication date: 16-Oct-2014.

    https://doi.org/10.1145/2678373.2665689

  • Jenkins J, Dinan J, Balaji P, Peterka T, Samatova N and Thakur R. Processing MPI Derived Datatypes on Noncontiguous GPU-Resident Data. IEEE Transactions on Parallel and Distributed Systems. 10.1109/TPDS.2013.234. 25:10. (2627-2637).

    http://ieeexplore.ieee.org/document/6600679/

  • Reagen B, Adolf R, Shao Y, Wei G and Brooks D. (2014). MachSuite: Benchmarks for accelerator design and customized architectures 2014 IEEE International Symposium on Workload Characterization (IISWC). 10.1109/IISWC.2014.6983050. 978-1-4799-6454-3. (110-119).

    http://ieeexplore.ieee.org/document/6983050/

  • Wang J and Yalamanchili S. (2014). Characterization and analysis of dynamic parallelism in unstructured GPU applications 2014 IEEE International Symposium on Workload Characterization (IISWC). 10.1109/IISWC.2014.6983039. 978-1-4799-6454-3. (51-60).

    http://ieeexplore.ieee.org/document/6983039/

  • Che S. (2014). GasCL: A vertex-centric graph model for GPUs 2014 IEEE High Performance Extreme Computing Conference (HPEC). 10.1109/HPEC.2014.7040962. 978-1-4799-6233-4. (1-6).

    http://ieeexplore.ieee.org/document/7040962/

  • Che S, Beckmann B and Reinhardt S. (2014). BelRed: Constructing GPGPU graph applications with software building blocks 2014 IEEE High Performance Extreme Computing Conference (HPEC). 10.1109/HPEC.2014.7040961. 978-1-4799-6233-4. (1-6).

    http://ieeexplore.ieee.org/document/7040961/

  • Romero P and Idler C. (2014). Methodologies and application of machine learning algorithms to classify the performance of high performance cluster components 2014 IEEE International Conference On Cluster Computing (CLUSTER). 10.1109/CLUSTER.2014.6968669. 978-1-4799-5548-0. (400-407).

    http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=6968669

  • Mateo Lázaro J, Sánchez Navarro J, García Gil A and Edo Romero V. (2014). 3D-geological structures with digital elevation models using GPU programming. Computers & Geosciences. 10.1016/j.cageo.2014.05.014. 70. (138-146). Online publication date: 1-Sep-2014.

    https://linkinghub.elsevier.com/retrieve/pii/S0098300414001411

  • Griessl R, Peykanu M, Hagemeyer J, Porrmann M, Krupop S, Berge M, Kiesel T and Christmann W. A Scalable Server Architecture for Next-Generation Heterogeneous Compute Clusters. Proceedings of the 2014 12th IEEE International Conference on Embedded and Ubiquitous Computing. (146-153).

    https://doi.org/10.1109/EUC.2014.29

  • Shen J, Varbanescu A and Sips H. Look before You Leap. Proceedings of the 2014 IEEE Intl Conf on High Performance Computing and Communications, 2014 IEEE 6th Intl Symp on Cyberspace Safety and Security, 2014 IEEE 11th Intl Conf on Embedded Software and Syst (HPCC,CSS,ICESS). (383-391).

    https://doi.org/10.1109/HPCC.2014.65

  • Ukidave Y, Ziabari A, Mistry P, Schirner G and Kaeli D. (2014). Analyzing power efficiency of optimization techniques and algorithm design methods for applications on heterogeneous platforms. International Journal of High Performance Computing Applications. 28:3. (319-334). Online publication date: 1-Aug-2014.

    https://doi.org/10.1177/1094342014526907

  • Breslauer D and Galil Z. (2014). Real-Time Streaming String-Matching. ACM Transactions on Algorithms. 10:4. (1-12). Online publication date: 1-Aug-2014.

    https://doi.org/10.1145/2635814

  • Tamizharasan P, Yadav P, Ramasubramanian N and Geetha K. (2014). Performance enhancing factors for manycore architectures: State-of-the-art 2014 International Conference on Networks & Soft Computing (ICNSC). 10.1109/CNSC.2014.6906686. 978-1-4799-3486-7. (278-283).

    http://ieeexplore.ieee.org/document/6906686/

  • Yan X, Shi X, Wang L and Yang H. (2014). An OpenCL micro-benchmark suite for GPUs and CPUs. The Journal of Supercomputing. 69:2. (693-713). Online publication date: 1-Aug-2014.

    https://doi.org/10.1007/s11227-014-1112-2

  • Bardsley E, Betts A, Chong N, Collingbourne P, Deligiannis P, Donaldson A, Ketema J, Liew D and Qadeer S. Engineering a Static Verification Tool for GPU Kernels. Proceedings of the 16th International Conference on Computer Aided Verification - Volume 8559. (226-242).

    https://doi.org/10.1007/978-3-319-08867-9_15

  • Merritt A, Farooqui N, Slawinska M, Gavrilovska A, Schwan K and Gupta V. Slices. Proceedings of the 2014 Annual Conference on Extreme Science and Engineering Discovery Environment. (1-8).

    https://doi.org/10.1145/2616498.2616531

  • Walters J, Younge A, Kang D, Yao K, Kang M, Crago S and Fox G. GPU Passthrough Performance. Proceedings of the 2014 IEEE International Conference on Cloud Computing. (636-643).

    https://doi.org/10.1109/CLOUD.2014.90

  • Elangovan V, Badia R and Ayguadé E. Scalability and Parallel Execution of OmpSs-OpenCL Tasks on Heterogeneous CPU-GPU Environment. Proceedings of the 29th International Conference on Supercomputing - Volume 8488. (141-155).

    https://doi.org/10.1007/978-3-319-07518-1_9

  • Shao Y, Reagen B, Wei G and Brooks D. Aladdin. Proceeding of the 41st annual international symposium on Computer architecuture. (97-108).

    /doi/10.5555/2665671.2665689

  • Shao Y, Reagen B, Wei G and Brooks D. (2014). Aladdin: A pre-RTL, power-performance accelerator simulator enabling large design space exploration of customized architectures 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA). 10.1109/ISCA.2014.6853196. 978-1-4799-4394-4. (97-108).

    http://ieeexplore.ieee.org/document/6853196/

  • Krommydas K, Feng W, Owaida M, Antonopoulos C and Bellas N. (2014). On the characterization of OpenCL dwarfs on fixed and reconfigurable platforms 2014 IEEE 25th International Conference on Application-specific Systems, Architectures and Processors (ASAP). 10.1109/ASAP.2014.6868650. 978-1-4799-3609-0. (153-160).

    http://ieeexplore.ieee.org/document/6868650/

  • Iparraguirre J, Balmaceda L and Mariani C. (2014). Speeded-up robust features (SURF) as a benchmark for heterogeneous computers 2014 IEEE Biennial Congress of Argentina (ARGENCON). 10.1109/ARGENCON.2014.6868545. 978-1-4799-4269-5. (519-524).

    http://ieeexplore.ieee.org/document/6868545/

  • Younge A and Fox G. Advanced virtualization techniques for high performance cloud cyberinfrastructure. Proceedings of the 14th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing. (583-586).

    https://doi.org/10.1109/CCGrid.2014.93

  • Younge A, Walters J, Crago S and Fox G. Evaluating GPU Passthrough in Xen for High Performance Cloud Computing. Proceedings of the 2014 IEEE International Parallel & Distributed Processing Symposium Workshops. (852-859).

    https://doi.org/10.1109/IPDPSW.2014.97

  • Che S and Skadron K. (2014). BenchFriend. International Journal of High Performance Computing Applications. 28:2. (238-250). Online publication date: 1-May-2014.

    https://doi.org/10.1177/1094342013507960

  • Zhang D, Xu L and Howes L. Efficient parallel image clustering and search on a heterogeneous platform. Proceedings of the High Performance Computing Symposium. (1-8).

    /doi/10.5555/2663510.2663527

  • Paul I, Ravi V, Manne S, Arora M and Yalamanchili S. (2014). Coordinated energy management in heterogeneous processors. Scientific Programming. 22:2. (93-108). Online publication date: 1-Apr-2014.

    https://doi.org/10.1155/2014/210762

  • Alexandre F, Marques R and Paulino H. On the support of task-parallel algorithmic skeletons for multi-GPU computing. Proceedings of the 29th Annual ACM Symposium on Applied Computing. (880-885).

    https://doi.org/10.1145/2554850.2555018

  • Boulos V, Huet S, Fristot V, Salvo L and Houzet D. (2014). Efficient implementation of data flow graphs on multi-gpu clusters. Journal of Real-Time Image Processing. 9:1. (217-232). Online publication date: 1-Mar-2014.

    https://doi.org/10.1007/s11554-012-0279-0

  • Chong N, Donaldson A and Ketema J. (2014). A sound and complete abstraction for reasoning about parallel prefix sums. ACM SIGPLAN Notices. 49:1. (397-409). Online publication date: 13-Jan-2014.

    https://doi.org/10.1145/2578855.2535882

  • Chong N, Donaldson A and Ketema J. A sound and complete abstraction for reasoning about parallel prefix sums. Proceedings of the 41st ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages. (397-409).

    https://doi.org/10.1145/2535838.2535882

  • Zhao Q, Yang H, Wei G, Luan Z and Qian D. (2014). Energy Efficiency Evaluation of Workload Execution on Intel Xeon Phi Coprocessor. Trustworthy Computing and Services. 10.1007/978-3-662-43908-1_34. (268-275).

    https://link.springer.com/10.1007/978-3-662-43908-1_34

  • DeBardeleben N, Blanchard S, Monroe L, Romero P, Grunau D, Idler C and Wright C. (2014). GPU Behavior on a Large HPC Cluster. Euro-Par 2013: Parallel Processing Workshops. 10.1007/978-3-642-54420-0_66. (680-689).

    http://link.springer.com/10.1007/978-3-642-54420-0_66

  • Rogers T, O'Connor M and Aamodt T. Divergence-aware warp scheduling. Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture. (99-110).

    https://doi.org/10.1145/2540708.2540718

  • Panwar L, Aji A, Meng J, Balaji P and Feng W. (2013). Online Performance Projection for Clusters with Heterogeneous GPUs 2013 International Conference on Parallel and Distributed Systems (ICPADS). 10.1109/ICPADS.2013.48. 978-1-4799-2081-5. (283-290).

    http://ieeexplore.ieee.org/document/6808185/

  • Shen J, Fang J, Sips H and Varbanescu A. (2013). An application-centric evaluation of OpenCL on multi-core CPUs. Parallel Computing. 39:12. (834-850). Online publication date: 1-Dec-2013.

    https://doi.org/10.1016/j.parco.2013.08.009

  • Viñas M, Bozkus Z and Fraguela B. (2013). Exploiting heterogeneous parallelism with the Heterogeneous Programming Library. Journal of Parallel and Distributed Computing. 73:12. (1627-1638). Online publication date: 1-Dec-2013.

    https://doi.org/10.1016/j.jpdc.2013.07.013

  • Paul I, Ravi V, Manne S, Arora M and Yalamanchili S. Coordinated energy management in heterogeneous processors. Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. (1-12).

    https://doi.org/10.1145/2503210.2503227

  • Kim S, Roy I and Talwar V. Evaluating integrated graphics processors for data center workloads. Proceedings of the Workshop on Power-Aware Computing and Systems. (1-5).

    https://doi.org/10.1145/2525526.2525847

  • Xun C, Chen D, Lan Q and Zhang C. (2013). Efficient fine-grained shared buffer management for multiple OpenCL devices. Journal of Zhejiang University SCIENCE C. 10.1631/jzus.C1300078. 14:11. (859-872). Online publication date: 1-Nov-2013.

    http://link.springer.com/10.1631/jzus.C1300078

  • Ji F, Lin H and Ma X. RSVM. Proceedings of the 22nd international conference on Parallel architectures and compilation techniques. (269-278).

    /doi/10.5555/2523721.2523758

  • Feng Ji , Heshan Lin and Xiaosong Ma . (2013). Breaking SIMD shackles with an exposed flexible microarchitecture and the access execute PDG 22nd International Conference on Parallel Architectures and Compilation Techniques (PACT). 10.1109/PACT.2013.6618823. 978-1-4799-1018-2. (341-352).

    http://ieeexplore.ieee.org/document/6618823/

  • Komoda T, Miwa S, Nakamura H and Maruyama N. Integrating Multi-GPU Execution in an OpenACC Compiler. Proceedings of the 2013 42nd International Conference on Parallel Processing. (260-269).

    https://doi.org/10.1109/ICPP.2013.35

  • Reagen B, Shao Y, Wei G and Brooks D. Quantifying acceleration. Proceedings of the 2013 International Symposium on Low Power Electronics and Design. (395-400).

    /doi/10.5555/2648668.2648759

  • Shao Y and Brooks D. Energy characterization and instruction-level energy model of Intel's Xeon Phi processor. Proceedings of the 2013 International Symposium on Low Power Electronics and Design. (389-394).

    /doi/10.5555/2648668.2648758

  • Reagen B, Shao Y, Wei G and Brooks D. (2013). Quantifying acceleration: Power/performance trade-offs of application kernels in hardware 2013 IEEE International Symposium on Low Power Electronics and Design (ISLPED). 10.1109/ISLPED.2013.6629329. 978-1-4799-1235-3. (395-400).

    http://ieeexplore.ieee.org/document/6629329/

  • Shao Y and Brooks D. (2013). Energy characterization and instruction-level energy model of Intel's Xeon Phi processor 2013 IEEE International Symposium on Low Power Electronics and Design (ISLPED). 10.1109/ISLPED.2013.6629328. 978-1-4799-1235-3. (389-394).

    http://ieeexplore.ieee.org/document/6629328/

  • Che S, Beckmann B, Reinhardt S and Skadron K. (2013). Pannotia: Understanding irregular GPGPU graph applications 2013 IEEE International Symposium on Workload Characterization (IISWC). 10.1109/IISWC.2013.6704684. 978-1-4799-0553-9. (185-195).

    https://ieeexplore.ieee.org/document/6704684/

  • Young J, Shon S, Yalamanchili S, Merritt A, Schwan K and Froning H. (2013). Oncilla: A GAS runtime for efficient resource allocation and data movement in accelerated clusters 2013 IEEE International Conference on Cluster Computing (CLUSTER). 10.1109/CLUSTER.2013.6702679. 978-1-4799-0898-1. (1-8).

    http://ieeexplore.ieee.org/document/6702679/

  • Expósito R, Taboada G, Ramos S, Touriño J and Doallo R. (2012). General‐purpose computation on GPUs for high performance cloud computing. Concurrency and Computation: Practice and Experience. 10.1002/cpe.2845. 25:12. (1628-1642). Online publication date: 25-Aug-2013.

    https://onlinelibrary.wiley.com/doi/10.1002/cpe.2845

  • Grasso I, Kofler K, Cosenza B and Fahringer T. (2013). Automatic problem size sensitive task partitioning on heterogeneous parallel systems. ACM SIGPLAN Notices. 48:8. (281-282). Online publication date: 23-Aug-2013.

    https://doi.org/10.1145/2517327.2442545

  • Wu B, Zhao Z, Zhang E, Jiang Y and Shen X. (2013). Complexity analysis and algorithm design for reorganizing data to minimize non-coalesced memory accesses on GPU. ACM SIGPLAN Notices. 48:8. (57-68). Online publication date: 23-Aug-2013.

    https://doi.org/10.1145/2517327.2442523

  • Defour D and Petit E. (2013). GPUburn: A system to test and mitigate GPU hardware failures 2013 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS XIII). 10.1109/SAMOS.2013.6621133. 978-1-4799-0103-6. (263-270).

    http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=6621133

  • Kofler K, Grasso I, Cosenza B and Fahringer T. An automatic input-sensitive approach for heterogeneous task partitioning. Proceedings of the 27th international ACM conference on International conference on supercomputing. (149-160).

    https://doi.org/10.1145/2464996.2465007

  • Ukidave Y and Kaeli D. Analyzing Optimization Techniques for Power Efficiency on Heterogeneous Platforms. Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing Workshops and PhD Forum. (1040-1049).

    https://doi.org/10.1109/IPDPSW.2013.220

  • Song S, Su C, Rountree B and Cameron K. A Simplified and Accurate Model of Power-Performance Efficiency on Emergent GPU Architectures. Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing. (673-686).

    https://doi.org/10.1109/IPDPS.2013.73

  • Wu J and Hong B. Collocating CPU-only jobs with GPU-assisted jobs on GPU-assisted HPC. Proceedings of the 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing. (418-425).

    https://doi.org/10.1109/CCGrid.2013.19

  • Ukidave Y, Ziabari A, Mistry P, Schirner G and Kaeli D. (2013). Quantifying the energy efficiency of FFT on heterogeneous platforms 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 10.1109/ISPASS.2013.6557174. 978-1-4673-5779-1. (235-244).

    http://ieeexplore.ieee.org/document/6557174/

  • Docampo J, Ramos S, Taboada G, Exposito R, Tourino J and Doallo R. Evaluation of Java for General Purpose GPU Computing. Proceedings of the 2013 27th International Conference on Advanced Information Networking and Applications Workshops. (1398-1404).

    https://doi.org/10.1109/WAINA.2013.234

  • Shih C, Chen Y, Chen J and Chang N. Virtual Cloud Core. Proceedings of the 2013 IEEE Seventh International Symposium on Service-Oriented System Engineering. (486-493).

    https://doi.org/10.1109/SOSE.2013.70

  • Mistry P, Ukidave Y, Schaa D and Kaeli D. Valar. Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units. (54-65).

    https://doi.org/10.1145/2458523.2458529

  • Shen J, Fang J, Sips H and Varbanescu A. Performance Traps in OpenCL for CPUs. Proceedings of the 2013 21st Euromicro International Conference on Parallel, Distributed, and Network-Based Processing. (38-45).

    https://doi.org/10.1109/PDP.2013.16

  • Grasso I, Kofler K, Cosenza B and Fahringer T. Automatic problem size sensitive task partitioning on heterogeneous parallel systems. Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming. (281-282).

    https://doi.org/10.1145/2442516.2442545

  • O'Boyle M, Wang Z and Grewe D. Portable mapping of data parallel programs to OpenCL for heterogeneous systems. Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). (1-10).

    https://doi.org/10.1109/CGO.2013.6494993

  • Ardila Y, Kawai N, Nakamura T and Tamura Y. (2013). Support tools for porting legacy applications to multicore 2013 18th Asia and South Pacific Design Automation Conference (ASP-DAC 2013). 10.1109/ASPDAC.2013.6509658. 978-1-4673-3030-5. (568-573).

    http://ieeexplore.ieee.org/document/6509658/

  • Zhang Y, Sinclair M and Chien A. (2013). Improving Performance Portability in OpenCL Programs. Supercomputing. 10.1007/978-3-642-38750-0_11. (136-150).

    https://link.springer.com/10.1007/978-3-642-38750-0_11

  • Wu J, Shi W and Hong B. (2013). Dynamic Kernel/Device Mapping Strategies for GPU-Assisted HPC Systems. Job Scheduling Strategies for Parallel Processing. 10.1007/978-3-642-35867-8_6. (96-113).

    http://link.springer.com/10.1007/978-3-642-35867-8_6

  • Yan X, Shi X and Sun Q. An OpenCL Micro-Benchmark Suite for GPUs and CPUs. Proceedings of the 2012 13th International Conference on Parallel and Distributed Computing, Applications and Technologies. (53-58).

    https://doi.org/10.1109/PDCAT.2012.52

  • Williams S, Kalamkar D, Singh A, Deshpande A, Van Straalen B, Smelyanskiy M, Almgren A, Dubey P, Shalf J and Oliker L. Optimization of geometric multigrid for emerging multi- and manycore processors. Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. (1-11).

    /doi/10.5555/2388996.2389126

  • Williams S, Kalamkar D, Singh A, Deshpande A, Van Straalen B, Smelyanskiy M, Almgren A, Dubey P, Shalf J and Oliker L. Optimization of geometric multigrid for emerging multi- and manycore processors. Proceedings of the 2012 International Conference for High Performance Computing, Networking, Storage and Analysis. (1-11).

    https://doi.org/10.1109/SC.2012.85

  • Zhou L, Clifford Chao K and Chang J. (2012). Fast polyenergetic forward projection for image formation using OpenCL on a heterogeneous parallel computing platform. Medical Physics. 10.1118/1.4758062. 39:11. (6745-6756). Online publication date: 1-Nov-2012.

    https://aapm.onlinelibrary.wiley.com/doi/10.1118/1.4758062

  • Amrizal A, Hirasawa S, Komatsu K, Takizawa H and Kobayashi H. (2012). Improving the scalability of transparent checkpointing for GPU computing systems TENCON 2012 - 2012 IEEE Region 10 Conference. 10.1109/TENCON.2012.6412343. 978-1-4673-4824-9. (1-6).

    http://ieeexplore.ieee.org/document/6412343/

  • Tupinamba A and Sztajnberg A. DistributedCL. Proceedings of the 2012 13th Symposium on Computing Systems. (187-193).

    https://doi.org/10.1109/WSCAD-SSC.2012.36

  • Bureddy D, Wang H, Venkatesh A, Potluri S and Panda D. OMB-GPU. Proceedings of the 19th European conference on Recent Advances in the Message Passing Interface. (110-120).

    https://doi.org/10.1007/978-3-642-33518-1_16

  • Prabhakar R, Govindarajan R and Thazhuthaveetil M. CUDA-for-clusters. Proceedings of the 18th international conference on Parallel Processing. (415-426).

    https://doi.org/10.1007/978-3-642-32820-6_42

  • Pratas F, Trancoso P, Sousa L, Stamatakis A, Shi G and Kindratenko V. (2012). Fine-grain parallelism using multi-core, Cell/BE, and GPU Systems. Parallel Computing. 38:8. (365-390). Online publication date: 1-Aug-2012.

    https://doi.org/10.1016/j.parco.2011.08.002

  • Barrio P, Carreras C, Sierra R, Kenter T and Plessl C. (2012). Turning control flow graphs into function calls: Code generation for heterogeneous architectures 2012 International Conference on High Performance Computing & Simulation (HPCS). 10.1109/HPCSim.2012.6266973. 978-1-4673-2362-8. (559-565).

    http://ieeexplore.ieee.org/document/6266973/

  • Ji F, Aji A, Dinan J, Buntinas D, Balaji P, Thakur R, Feng W and Ma X. DMA-Assisted, Intranode Communication in GPU Accelerated Systems. Proceedings of the 2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems. (461-468).

    https://doi.org/10.1109/HPCC.2012.69

  • Calandrini G, Gardel A, Revenga P and Lázaro J. GPU Acceleration on Embedded Devices. A Power Consumption Approach. Proceedings of the 2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems. (1806-1812).

    https://doi.org/10.1109/HPCC.2012.272

  • Wang C, Chandrasekaran S and Chapman B. An OpenMP 3.1 validation testsuite. Proceedings of the 8th international conference on OpenMP in a Heterogeneous World. (237-249).

    https://doi.org/10.1007/978-3-642-30961-8_18

  • Jaros J. (2012). Multi-GPU island-based genetic algorithm for solving the knapsack problem 2012 IEEE Congress on Evolutionary Computation (CEC). 10.1109/CEC.2012.6256131. 978-1-4673-1509-8. (1-8).

    http://ieeexplore.ieee.org/document/6256131/

  • Hartley T, Saule E and Çatalyürek í. (2012). Improving performance of adaptive component-based dataflow middleware. Parallel Computing. 38:6-7. (289-309). Online publication date: 1-Jun-2012.

    https://doi.org/10.1016/j.parco.2012.03.005

  • Qin C and Zhan L. (2012). Parallelizing flow-accumulation calculations on graphics processing units-From iterative DEM preprocessing algorithm to recursive multiple-flow-direction algorithm. Computers & Geosciences. 43. (7-16). Online publication date: 1-Jun-2012.

    https://doi.org/10.1016/j.cageo.2012.02.022

  • Nowrouzezahrai D, Simari P and Fiume E. (2012). Sparse zonal harmonic factorization for efficient SH rotation. ACM Transactions on Graphics. 31:3. (1-9). Online publication date: 31-May-2012.

    https://doi.org/10.1145/2167076.2167081

  • Ji F, Aji A, Dinan J, Buntinas D, Balaji P, Feng W and Ma X. Efficient Intranode Communication in GPU-Accelerated Systems. Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum. (1838-1847).

    https://doi.org/10.1109/IPDPSW.2012.227

  • Bozkus Z and Fraguela B. A Portable High-Productivity Approach to Program Heterogeneous Systems. Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum. (163-173).

    https://doi.org/10.1109/IPDPSW.2012.15

  • Spafford K, Meredith J, Lee S, Li D, Roth P and Vetter J. The tradeoffs of fused memory hierarchies in heterogeneous computing architectures. Proceedings of the 9th conference on Computing Frontiers. (103-112).

    https://doi.org/10.1145/2212908.2212924

  • Unat D, Zhou J, Cui Y, Baden S and Cai X. (2012). Accelerating a 3D Finite-Difference Earthquake Simulation with a C-to-CUDA Translator. Computing in Science and Engineering. 14:3. (48-59). Online publication date: 1-May-2012.

    https://doi.org/10.1109/MCSE.2012.44

  • Xiao S, Balaji P, Zhu Q, Thakur R, Coghlan S, Lin H, Wen G, Hong J and Feng W. (2012). VOCL: An optimized environment for transparent virtualization of graphics processing units 2012 Innovative Parallel Computing (InPar). 10.1109/InPar.2012.6339609. 978-1-4673-2633-9. (1-12).

    http://ieeexplore.ieee.org/document/6339609/

  • Stratton J, Anssari N, Rodrigues C, Sung I, Obeid N, Chang L, Liu G and Hwu W. (2012). Optimization and architecture effects on GPU computing workload performance 2012 Innovative Parallel Computing (InPar). 10.1109/InPar.2012.6339605. 978-1-4673-2633-9. (1-10).

    http://ieeexplore.ieee.org/document/6339605/

  • Gupta K, Stuart J and Owens J. (2012). A study of Persistent Threads style GPU programming for GPGPU workloads 2012 Innovative Parallel Computing (InPar). 10.1109/InPar.2012.6339596. 978-1-4673-2633-9. (1-14).

    http://ieeexplore.ieee.org/document/6339596/

  • Braithwaite R, Feng W and McCormick P. Automatic NUMA characterization using Cbench. Proceedings of the 3rd ACM/SPEC International Conference on Performance Engineering. (295-298).

    https://doi.org/10.1145/2188286.2188342

  • Jaros J and Pospichal P. A fair comparison of modern CPUs and GPUs running the genetic algorithm under the knapsack benchmark. Proceedings of the 2012t European conference on Applications of Evolutionary Computation. (426-435).

    https://doi.org/10.1007/978-3-642-29178-4_43

  • Miyoshi T, Irie H, Shima K, Honda H, Kondo M and Yoshinaga T. FLAT. Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units. (20-29).

    https://doi.org/10.1145/2159430.2159433

  • Jaros J, Treeby B and Rendell A. Use of multiple GPUs on shared memory multiprocessors for ultrasound propagation simulations. Proceedings of the Tenth Australasian Symposium on Parallel and Distributed Computing - Volume 127. (43-52).

    /doi/10.5555/2523685.2523691

  • Pereira K, Athanas P, Lin H and Feng W. Spectral Method Characterization on FPGA and GPU Accelerators. Proceedings of the 2011 International Conference on Reconfigurable Computing and FPGAs. (487-492).

    https://doi.org/10.1109/ReConFig.2011.83

  • Madduri K, Ibrahim K, Williams S, Im E, Ethier S, Shalf J and Oliker L. Gyrokinetic toroidal simulations on leading multi- and manycore HPC systems. Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis. (1-12).

    https://doi.org/10.1145/2063384.2063415

  • Zhang Y, Peng L, Li B, Peir J and Chen J. Architecture comparisons between Nvidia and ATI GPUs. Proceedings of the 2011 IEEE International Symposium on Workload Characterization. (205-215).

    https://doi.org/10.1109/IISWC.2011.6114180

  • Seo S, Jo G and Lee J. Performance characterization of the NAS Parallel Benchmarks in OpenCL. Proceedings of the 2011 IEEE International Symposium on Workload Characterization. (137-148).

    https://doi.org/10.1109/IISWC.2011.6114174

  • Wang H, Potluri S, Luo M, Singh A, Ouyang X, Sur S and Panda D. Optimized Non-contiguous MPI Datatype Communication for GPU Clusters. Proceedings of the 2011 IEEE International Conference on Cluster Computing. (308-316).

    https://doi.org/10.1109/CLUSTER.2011.42

  • Malony A, Biersdorff S, Shende S, Jagode H, Tomov S, Juckeland G, Dietrich R, Poole D and Lamb C. Parallel Performance Measurement of Heterogeneous Parallel Systems with GPUs. Proceedings of the 2011 International Conference on Parallel Processing. (176-185).

    https://doi.org/10.1109/ICPP.2011.71

  • Fang J, Varbanescu A and Sips H. A Comprehensive Performance Comparison of CUDA and OpenCL. Proceedings of the 2011 International Conference on Parallel Processing. (216-225).

    https://doi.org/10.1109/ICPP.2011.45

  • Meredith J, Roth P, Spafford K and Vetter J. (2011). Performance Implications of Nonuniform Device Topologies in Scalable Heterogeneous Architectures. IEEE Micro. 31:5. (66-75). Online publication date: 1-Sep-2011.

    https://doi.org/10.1109/MM.2011.79

  • Vetter J, Glassbrook R, Dongarra J, Schwan K, Loftis B, McNally S, Meredith J, Rogers J, Roth P, Spafford K and Yalamanchili S. (2011). Keeneland. Computing in Science and Engineering. 13:5. (90-95). Online publication date: 1-Sep-2011.

    https://doi.org/10.1109/MCSE.2011.83

  • Thoman P, Kofler K, Studt H, Thomson J and Fahringer T. Automatic OpenCL device characterization. Proceedings of the 17th international conference on Parallel processing - Volume Part II. (438-452).

    /doi/10.5555/2033408.2033459

  • Daga M, Aji A and Feng W. On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing. Proceedings of the 2011 Symposium on Application Accelerators in High-Performance Computing. (141-149).

    https://doi.org/10.1109/SAAHPC.2011.29

  • Takizawa H, Koyama K, Sato K, Komatsu K and Kobayashi H. CheCL. Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium. (864-876).

    https://doi.org/10.1109/IPDPS.2011.85

  • Grewe D and O'Boyle M. A static task partitioning approach for heterogeneous systems using OpenCL. Proceedings of the 20th international conference on Compiler construction: part of the joint European conferences on theory and practice of software. (286-305).

    /doi/10.5555/1987237.1987259

  • Spafford K, Meredith J and Vetter J. Quantifying NUMA and contention effects in multi-GPU systems. Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units. (1-7).

    https://doi.org/10.1145/1964179.1964194

  • Karantasis K and Polychronopoulos E. Programming GPU Clusters with Shared Memory Abstraction in Software. Proceedings of the 2011 19th International Euromicro Conference on Parallel, Distributed and Network-Based Processing. (223-230).

    https://doi.org/10.1109/PDP.2011.91

  • Grewe D and O’Boyle M. (2011). A Static Task Partitioning Approach for Heterogeneous Systems Using OpenCL. Compiler Construction. 10.1007/978-3-642-19861-8_16. (286-305).

    http://link.springer.com/10.1007/978-3-642-19861-8_16

  • Che S, Sheaffer J, Boyer M, Szafaryn L, Liang Wang and Skadron K. A characterization of the Rodinia benchmark suite with comparison to contemporary CMP workloads. Proceedings of the IEEE International Symposium on Workload Characterization (IISWC'10). (1-11).

    https://doi.org/10.1109/IISWC.2010.5650274

  • Hartley T, Saule E and Catalyurek U. (2010). Automatic dataflow application tuning for heterogeneous systems 2010 International Conference on High Performance Computing (HiPC). 10.1109/HIPC.2010.5713173. 978-1-4244-8518-5. (1-10).

    http://ieeexplore.ieee.org/document/5713173/

  • Jurecko M, Kocisova J, Jr. J, Kasanicky T, Domiter M and Zvada M. Evaluation Framework for GPU Performance Based on OpenCL Standard. Proceedings of the 2010 First International Conference on Networking and Computing. (256-261).

    https://doi.org/10.1109/IC-NC.2010.32

  • Barak A, Ben-Nun T, Levy E and Shiloh A. (2010). A package for OpenCL based heterogeneous computing on clusters with many GPU devices 2010 IEEE International Conference On Cluster Computing Workshops and Posters (CLUSTER WORKSHOPS). 10.1109/CLUSTERWKSP.2010.5613086. 978-1-4244-8395-2. (1-7).

    http://ieeexplore.ieee.org/document/5613086/

  • Spafford K, Meredith J and Vetter J. Maestro. Proceedings of the 16th international Euro-Par conference on Parallel processing: Part II. (275-286).

    /doi/10.5555/1885276.1885305

  • Malony A, Biersdorff S, Spear W and Mayanglambam S. An experimental approach to performance measurement of heterogeneous parallel applications using CUDA. Proceedings of the 24th ACM International Conference on Supercomputing. (127-136).

    https://doi.org/10.1145/1810085.1810105

  • Li X, Li Z, David F, Zhou P, Zhou Y, Adve S and Kumar S. (2004). Performance directed energy management for main memory and disks. ACM SIGARCH Computer Architecture News. 32:5. (271-283). Online publication date: 1-Dec-2004.

    https://doi.org/10.1145/1037947.1024425

  • Gomaa M, Powell M and Vijaykumar T. (2004). Heat-and-run. ACM SIGARCH Computer Architecture News. 32:5. (260-270). Online publication date: 1-Dec-2004.

    https://doi.org/10.1145/1037947.1024424

  • Wu Q, Juang P, Martonosi M and Clark D. (2004). Formal online methods for voltage/frequency control in multiple clock domain microprocessors. ACM SIGARCH Computer Architecture News. 32:5. (248-259). Online publication date: 1-Dec-2004.

    https://doi.org/10.1145/1037947.1024423

  • Bronevetsky G, Marques D, Pingali K, Szwed P and Schulz M. (2004). Application-level checkpointing for shared memory programs. ACM SIGARCH Computer Architecture News. 32:5. (235-247). Online publication date: 1-Dec-2004.

    https://doi.org/10.1145/1037947.1024421

  • Smolens J, Gold B, Kim J, Falsafi B, Hoe J and Nowatzyk A. (2004). Fingerprinting. ACM SIGARCH Computer Architecture News. 32:5. (224-234). Online publication date: 1-Dec-2004.

    https://doi.org/10.1145/1037947.1024420

  • Lowell D, Saito Y and Samberg E. (2004). Devirtualizable virtual machines enabling general, single-node, online maintenance. ACM SIGARCH Computer Architecture News. 32:5. (211-223). Online publication date: 1-Dec-2004.

    https://doi.org/10.1145/1037947.1024419

  • Cher C, Hosking A and Vijaykumar T. (2004). Software prefetching for mark-sweep garbage collection. ACM SIGARCH Computer Architecture News. 32:5. (199-210). Online publication date: 1-Dec-2004.

    https://doi.org/10.1145/1037947.1024417

  • Singh Umrao L and Pandey J. Performance Analysis and Optimization of Graphics Processing Unit. SSRN Electronic Journal. 10.2139/ssrn.3350249.

    https://www.ssrn.com/abstract=3350249