Beck T, Baroni A, Bennink R, Buchs G, Pérez E, Eisenbach M, da Silva R, Meena M, Gottiparthi K, Groszkowski P, Humble T, Landfield R, Maheshwari K, Oral S, Sandoval M, Shehata A, Suh I and Zimmer C. (2024). Integrating quantum computing resources into scientific HPC ecosystems. Future Generation Computer Systems. 10.1016/j.future.2024.06.058. 161. (11-25). Online publication date: 1-Dec-2024.

https://linkinghub.elsevier.com/retrieve/pii/S0167739X24003583

Lungu N, Al Rababah A, Dash B, Syed A, Barik L, Rout S, Tembo S, Lubobya C and Patra S. (2024). NIST CSF-2.0 Compliant GPU Shader Execution. Engineering, Technology & Applied Science Research. 10.48084/etasr.7351. 14:4. (15187-15193).

https://etasr.com/index.php/ETASR/article/view/7351

Gouk D, Kang S, Bae H, Ryu E, Lee S, Kim D, Jang J and Jung M. Breaking Barriers: Expanding GPU Memory with Sub-Two Digit Nanosecond Latency CXL Controller. Proceedings of the 16th ACM Workshop on Hot Topics in Storage and File Systems. (108-115).

https://doi.org/10.1145/3655038.3665953

Trochatos T, Etim A and Szefer J. (2024). Covert-channels in FPGA-enabled SmartSSDs. ACM Transactions on Reconfigurable Technology and Systems. 17:2. (1-23). Online publication date: 30-Jun-2024.

https://doi.org/10.1145/3635312

Feng Y, Na S, Kim H and Jeon H. (2024). Barre Chord: Efficient Virtual Memory Translation for Multi-Chip-Module GPUs 2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA). 10.1109/ISCA59077.2024.00065. 979-8-3503-2658-1. (834-847).

https://ieeexplore.ieee.org/document/10609639/

Priya A, Choudhury R, Patni S, Sharma H, Mohanty M, Narayanam K, Devi U, Moogi P, Patil P and Parag P. Energy-minimizing workload splitting and frequency selection for guaranteed performance over heterogeneous cores. Proceedings of the 15th ACM International Conference on Future and Sustainable Energy Systems. (308-322).

https://doi.org/10.1145/3632775.3661968

Oh C, Yi S, Seok J, Jung H, Yoon I and Yi Y. (2023). Hybridhadoop: CPU-GPU hybrid scheduling in hadoop. Cluster Computing. 10.1007/s10586-023-04178-5. 27:3. (3875-3892). Online publication date: 1-Jun-2024.

https://link.springer.com/10.1007/s10586-023-04178-5

Tyagi A, Mishra A, Vedavathi N, Kakulapati V and Sajidha S. (2024). Futuristic Technologies for Smart Manufacturing. Automated Secure Computing for Next‐Generation Systems. 10.1002/9781394213948.ch21. (415-441). Online publication date: 3-May-2024.

https://onlinelibrary.wiley.com/doi/10.1002/9781394213948.ch21

Cheng J, Coward S, Chelini L, Barbalho R and Drane T. SEER: Super-Optimization Explorer for High-Level Synthesis using E-graph Rewriting. Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2. (1029-1044).

https://doi.org/10.1145/3620665.3640392

Crisci L, Carpentieri L, Thoman P, Alpay A, Heuveline V and Cosenza B. SYCL-Bench 2020: Benchmarking SYCL 2020 on AMD, Intel, and NVIDIA GPUs. Proceedings of the 12th International Workshop on OpenCL and SYCL. (1-12).

https://doi.org/10.1145/3648115.3648120

Frachtenberg E, Mittal V, Bruel P, Faloutsos M, Milojicic D and Milojicic D. (2024). The Distribution Is the Performance. Computer. 57:4. (143-149). Online publication date: 1-Apr-2024.

https://doi.org/10.1109/MC.2024.3362448

Hasler J and Hao C. (2023). Programmable Analog System Benchmarks Leading to Efficient Analog Computation Synthesis. ACM Transactions on Reconfigurable Technology and Systems. 17:1. (1-25). Online publication date: 31-Mar-2024.

https://doi.org/10.1145/3625298

Wang Y, Li B, Jaleel A, Yang J and Tang X. (2024). GRIT: Enhancing Multi-GPU Performance with Fine-Grained Dynamic Page Placement 2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA). 10.1109/HPCA57654.2024.00085. 979-8-3503-9313-2. (1080-1094).

https://ieeexplore.ieee.org/document/10476474/

Na S, Kim J, Lee S and Huh J. (2024). Supporting Secure Multi-GPU Computing with Dynamic and Batched Metadata Management 2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA). 10.1109/HPCA57654.2024.00025. 979-8-3503-9313-2. (204-217).

https://ieeexplore.ieee.org/document/10476487/

Jeong E, Park E, Koo G, Oh Y and Yoon M. (2024). Conflict-aware compiler for hierarchical register file on GPUs. Journal of Systems Architecture. 10.1016/j.sysarc.2024.103099. (103099). Online publication date: 1-Feb-2024.

https://linkinghub.elsevier.com/retrieve/pii/S1383762124000365

Kumar V, Ranjbar B and Kumar A. Utilizing Machine Learning Techniques for Worst-Case Execution Time Estimation on GPU Architectures. IEEE Access. 10.1109/ACCESS.2024.3379018. 12. (41464-41478).

https://ieeexplore.ieee.org/document/10474357/

Mustafa D, Alkhasawneh R, Obeidat F and Shatnawi A. MIMD Programs Execution Support on SIMD Machines: A Holistic Survey. IEEE Access. 10.1109/ACCESS.2024.3372990. 12. (34354-34377).

https://ieeexplore.ieee.org/document/10458910/

Mohamed K. (2024). An Introduction to Heterogeneous SoC Design and Verification “A Conceptual-Level”. Heterogeneous SoC Design and Verification. 10.1007/978-3-031-56152-8_1. (1-26).

https://link.springer.com/10.1007/978-3-031-56152-8_1

Tian S, Giechaskiel I, Xiong W and Szefer J. (2024). Fingerprinting and Mapping Cloud FPGA Infrastructures. Security of FPGA-Accelerated Cloud Computing Environments. 10.1007/978-3-031-45395-3_9. (239-272).

https://link.springer.com/10.1007/978-3-031-45395-3_9

Giechaskiel I, Tian S and Szefer J. (2024). Contention-Based Threats Between Single-Tenant Cloud FPGA Instances. Security of FPGA-Accelerated Cloud Computing Environments. 10.1007/978-3-031-45395-3_6. (137-172).

https://link.springer.com/10.1007/978-3-031-45395-3_6

Zoni D, Galimberti A and Fornaciari W. (2023). A Survey on Run-time Power Monitors at the Edge. ACM Computing Surveys. 55:14s. (1-33). Online publication date: 31-Dec-2024.

https://doi.org/10.1145/3593044

Saini A, Shende O, Pandit M, Sen R and Ananthanarayanan G. Bang for the Buck: Evaluating the cost-effectiveness of Heterogeneous Edge Platforms for Neural Network Workloads. Proceedings of the Eighth ACM/IEEE Symposium on Edge Computing. (94-107).

https://doi.org/10.1145/3583740.3628437

Tan J, Chen K, Wang W, Yan K and Wei X. MCM-GPU Voltage Noise Characterization and Architecture-Level Mitigation. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems. 10.1109/TCAD.2023.3279304. 42:12. (5084-5097).

https://ieeexplore.ieee.org/document/10131993/

Lin F, Liu Y, Wang X and Gai X. (2023). Leveraging simulation of high performance computing systems with node simulation using architecture simulator. CCF Transactions on High Performance Computing. 10.1007/s42514-023-00173-9. 5:4. (442-464). Online publication date: 1-Dec-2023.

https://link.springer.com/10.1007/s42514-023-00173-9

Weckert C, Solis-Vasquez L, Oppermann J, Koch A and Sinnen O. Altis-SYCL: Migrating Altis Benchmarking Suite from CUDA to SYCL for GPUs and FPGAs. Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis. (547-555).

https://doi.org/10.1145/3624062.3624542

Afzal A, Hager G and Wellein G. SPEChpc 2021 Benchmarks on Ice Lake and Sapphire Rapids Infiniband Clusters: A Performance and Energy Case Study. Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis. (1245-1254).

https://doi.org/10.1145/3624062.3624197

Rodríguez-Borbón J, Wang X, Diéguez A, Ibrahim K and Wong B. (2023). TRAVOLTA: GPU Acceleration and Algorithmic Improvements for Constructing Quantum Optimal Control Fields in Photo-Excited Systems. Computer Physics Communications. 10.1016/j.cpc.2023.109017. (109017). Online publication date: 1-Nov-2023.

https://linkinghub.elsevier.com/retrieve/pii/S0010465523003624

Liu C, Sun Y and Carlson T. Photon: A Fine-grained Sampled Simulation Methodology for GPU Workloads. Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture. (1227-1241).

https://doi.org/10.1145/3613424.3623773

Li B, Guo Y, Wang Y, Jaleel A, Yang J and Tang X. IDYLL: Enhancing Page Translation in Multi-GPUs via Light Weight PTE Invalidations. Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture. (1163-1177).

https://doi.org/10.1145/3613424.3614269

Sung S, Hur S, Kim S, Ha D, Oh Y and Ro W. MAD MAcce: Supporting Multiply-Add Operations for Democratizing Matrix-Multiplication Accelerators. Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture. (367-379).

https://doi.org/10.1145/3613424.3614247

Dutta A, Alcaraz J, TehraniJamsaz A, Cesar E, Sikora A and Jannesari A. Performance Optimization using Multimodal Modeling and Heterogeneous GNN. Proceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing. (45-57).

https://doi.org/10.1145/3588195.3592984

Meyer M, Kenter T and Plessl C. (2023). Multi-FPGA Designs and Scaling of HPC Challenge Benchmarks via MPI and Circuit-switched Inter-FPGA Networks. ACM Transactions on Reconfigurable Technology and Systems. 10.1145/3576200. 16:2. (1-27). Online publication date: 30-Jun-2023.

https://dl.acm.org/doi/10.1145/3576200

Barbierato E, Manini D and Gribaudo M. (2023). A Multiformalism-Based Model for Performance Evaluation of Green Data Centres. Electronics. 10.3390/electronics12102169. 12:10. (2169).

https://www.mdpi.com/2079-9292/12/10/2169

Tørring J, van Werkhoven B, Petrovč F, Willemsen F, Filipovič J and Elster A. (2023). Towards a Benchmarking Suite for Kernel Tuners 2023 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). 10.1109/IPDPSW59300.2023.00124. 979-8-3503-1199-0. (724-733).

https://ieeexplore.ieee.org/document/10196663/

Kamatar A, Friese R and Gioiosa R. (2023). A Task Based Approach for Co-Scheduling Ensemble Workloads on Heterogeneous Nodes 2023 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). 10.1109/IPDPSW59300.2023.00015. 979-8-3503-1199-0. (6-15).

https://ieeexplore.ieee.org/document/10196582/

Emonds Y, Braun L and Fröning H. (2023). CUDAsap: Statically-Determined Execution Statistics as Alternative to Execution-Based Profiling 2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid). 10.1109/CCGrid57682.2023.00021. 979-8-3503-0119-9. (119-130).

https://ieeexplore.ieee.org/document/10171571/

Meyer J, Alpay A, Hack S, Fröning H and Heuveline V. Implementation Techniques for SPMD Kernels on CPUs. Proceedings of the 2023 International Workshop on OpenCL. (1-12).

https://doi.org/10.1145/3585341.3585342

Sawalha L and Deljevic G. (2023). Workload Characterization Using Hierarchical PCA 2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 10.1109/ISPASS57527.2023.00043. 979-8-3503-9739-0. (331-333).

https://ieeexplore.ieee.org/document/10158189/

Jin Z and Vetter J. (2023). A Benchmark Suite for Improving Performance Portability of the SYCL Programming Model 2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 10.1109/ISPASS57527.2023.00041. 979-8-3503-9739-0. (325-327).

https://ieeexplore.ieee.org/document/10158214/

Giechaskiel I, Tian S and Szefer J. (2022). Cross-VM Covert- and Side-Channel Attacks in Cloud FPGAs. ACM Transactions on Reconfigurable Technology and Systems. 16:1. (1-29). Online publication date: 31-Mar-2023.

https://doi.org/10.1145/3534972

Lee J, Lee J, Oh Y, Song W and Ro W. (2023). SnakeByte: A TLB Design with Adaptive and Recursive Page Merging in GPUs 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA). 10.1109/HPCA56546.2023.10071063. 978-1-6654-7652-2. (1195-1207).

https://ieeexplore.ieee.org/document/10071063/

Li B, Yin J, Holey A, Zhang Y, Yang J and Tang X. (2023). Trans-FW: Short Circuiting Page Table Walk in Multi-GPU Systems via Remote Forwarding 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA). 10.1109/HPCA56546.2023.10071054. 978-1-6654-7652-2. (456-470).

https://ieeexplore.ieee.org/document/10071054/

Wang X, Li Y, Guo F, Xu Y and Lui J. Dynamic GPU Scheduling with Multi-resource Awareness and Live Migration Support. IEEE Transactions on Cloud Computing. 10.1109/TCC.2023.3264242. (1-16).

https://ieeexplore.ieee.org/document/10091187/

Paul B, Choudhury N, Saikia E and Trivedi G. (2023). Digital Boolean Logic Equivalent Reversible Quantum Gates Design. Third Congress on Intelligent Systems. 10.1007/978-981-19-9379-4_20. (253-271).

https://link.springer.com/10.1007/978-981-19-9379-4_20

Defour D. (2022). Using scheduling entropy amplification in CUDA/OpenMP code to exhibit non-reproducibility issues 2022 IEEE 15th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC). 10.1109/MCSoC57363.2022.00040. 978-1-6654-6499-4. (200-207).

https://ieeexplore.ieee.org/document/10008469/

Oh Y, Jeong I, Ro W and Yoon M. CASH-RF: A Compiler-Assisted Hierarchical Register File in GPUs. IEEE Embedded Systems Letters. 10.1109/LES.2022.3163749. 14:4. (187-190).

https://ieeexplore.ieee.org/document/9745582/

Hammond J, Deakin T, Cownie J and McIntosh-Smith S. (2022). Benchmarking Fortran DO CONCURRENT on CPUs and GPUs Using BabelStream 2022 IEEE/ACM International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS). 10.1109/PMBS56514.2022.00013. 978-1-6654-5185-7. (82-99).

https://ieeexplore.ieee.org/document/10024026/

Gomez-Hernandez E, Cebrian J, Kaxiras S and Ros A. (2022). Splash-4: A Modern Benchmark Suite with Lock-Free Constructs 2022 IEEE International Symposium on Workload Characterization (IISWC). 10.1109/IISWC55918.2022.00015. 978-1-6654-8798-6. (51-64).

https://ieeexplore.ieee.org/document/9975421/

Peng W and Belikov E. (2022). CAMP: a Synthetic Micro-Benchmark for Assessing Deep Memory Hierarchies 2022 IEEE/ACM International Workshop on Hierarchical Parallelism for Exascale Computing (HiPar). 10.1109/HiPar56574.2022.00009. 978-1-6654-6345-4. (28-36).

https://ieeexplore.ieee.org/document/10024617/

Bao Y, Sun Y, Feric Z, Shen M, Weston M, Abellán J, Baruah T, Kim J, Joshi A and Kaeli D. NaviSim. Proceedings of the International Conference on Parallel Architectures and Compilation Techniques. (333-345).

https://doi.org/10.1145/3559009.3569666

Belayneh L, Ye H, Chen K, Blaauw D, Mudge T, Dreslinski R and Talati N. Locality-Aware Optimizations for Improving Remote Memory Latency in Multi-GPU Systems. Proceedings of the International Conference on Parallel Architectures and Compilation Techniques. (304-316).

https://doi.org/10.1145/3559009.3569649

B P, Jawalkar N and Basu A. Designing Virtual Memory System of MCM GPUs. Proceedings of the 55th Annual IEEE/ACM International Symposium on Microarchitecture. (404-422).

https://doi.org/10.1109/MICRO56248.2022.00036

Zhang Y and Jung C. (2022). Featherweight Soft Error Resilience for GPUs 2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO). 10.1109/MICRO56248.2022.00030. 978-1-6654-6272-3. (245-262).

https://ieeexplore.ieee.org/document/9923801/

Tan J, Chen K and Yan K. (2022). MG-Voltage: Characterizing and Mitigating Voltage Noise in MCM-GPU Architectures 2022 IEEE 40th International Conference on Computer Design (ICCD). 10.1109/ICCD56317.2022.00109. 978-1-6654-6186-3. (714-721).

https://ieeexplore.ieee.org/document/9978493/

Jagasivamani M, Fong C, Goodnow K and Voigt R. (2022). Model And Evaluation Of A Superconducting-Logic Based Hybrid CPU-Accelerator System 2022 Annual Modeling and Simulation Conference (ANNSIM). 10.23919/ANNSIM55834.2022.9859454. 978-1-71-385288-9. (140-151).

https://ieeexplore.ieee.org/document/9859454/

Jin H, Jeong D, Park T, Ko J and Kim J. Multi-Prediction Compression: An Efficient and Scalable Memory Compression Framework for GP-GPU. IEEE Computer Architecture Letters. 10.1109/LCA.2022.3177419. 21:2. (37-40).

https://ieeexplore.ieee.org/document/9780608/

Zhao C, Gao W, Nie F and Zhou H. A Survey of GPU Multitasking Methods Supported by Hardware Architecture. IEEE Transactions on Parallel and Distributed Systems. 10.1109/TPDS.2021.3115630. 33:6. (1451-1463).

https://ieeexplore.ieee.org/document/9548839/

Liu Y, Azami N, Walters C and Burtscher M. (2022). The Indigo Program-Verification Microbenchmark Suite of Irregular Parallel Code Patterns 2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 10.1109/ISPASS55109.2022.00003. 978-1-6654-5954-9. (24-34).

https://ieeexplore.ieee.org/document/9804647/

Jin Z and Vetter J. (2022). Evaluating Unified Memory Performance in HIP 2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). 10.1109/IPDPSW55747.2022.00096. 978-1-6654-9747-3. (562-568).

https://ieeexplore.ieee.org/document/9835548/

Heldens S, Hijma P, Van Werkhoven B, Maassen J and van Nieuwpoort R. (2022). Lightning: Scaling the GPU Programming Model Beyond a Single GPU 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 10.1109/IPDPS53621.2022.00054. 978-1-6654-8106-9. (492-503).

https://ieeexplore.ieee.org/document/9820612/

Saiz A, Prieto P, Abad P, Gregorio J and Puente V. (2022). Top-Down Performance Profiling on NVIDIA's GPUs 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 10.1109/IPDPS53621.2022.00026. 978-1-6654-8106-9. (179-189).

https://ieeexplore.ieee.org/document/9820717/

Brunst H, Chandrasekaran S, Ciorba F, Hagerty N, Henschel R, Juckeland G, Li J, Vergara V, Wienke S and Zavala M. (2022). First Experiences in Performance Benchmarking with the New SPEChpc 2021 Suites 2022 22nd International Symposium on Cluster, Cloud and Internet Computing (CCGrid). 10.1109/CCGrid54584.2022.00077. 978-1-6654-9956-9. (675-684).

https://ieeexplore.ieee.org/document/9826013/

Chen G, Zhang J, Zhu Z, Wang H, Jiang H and Pang C. (2020). CRAC: An automatic assistant compiler of checkpoint/restart for OpenCL program. Concurrency and Computation: Practice and Experience. 10.1002/cpe.6048. 34:8. Online publication date: 10-Apr-2022.

https://onlinelibrary.wiley.com/doi/10.1002/cpe.6048

van Stigt R, Swatman S and Varbanescu A. Isolating GPU Architectural Features Using Parallelism-Aware Microbenchmarks. Proceedings of the 2022 ACM/SPEC on International Conference on Performance Engineering. (77-88).

https://doi.org/10.1145/3489525.3511673

Olabi M, Luna J, Mutlu O, Hwu W and El Hajj I. A compiler framework for optimizing dynamic parallelism on GPUs. Proceedings of the 20th IEEE/ACM International Symposium on Code Generation and Optimization. (1-13).

https://doi.org/10.1109/CGO53902.2022.9741284

Dalmia P, Mahapatra R and Sinclair M. (2022). Only Buffer When You Need To: Reducing On-chip GPU Traffic with Reconfigurable Local Atomic Buffers 2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA). 10.1109/HPCA53966.2022.00056. 978-1-6654-2027-3. (676-691).

https://ieeexplore.ieee.org/document/9773230/

Li W, Chen Z, He X, Duan G, Sun J and Chen H. (2022). CVFuzz. Future Generation Computer Systems. 127:C. (384-395). Online publication date: 1-Feb-2022.

https://doi.org/10.1016/j.future.2021.09.006

Kim S and Kim Y. (2021). K-Scheduler: dynamic intra-SM multitasking management with execution profiles on GPUs. Cluster Computing. 10.1007/s10586-021-03429-7. 25:1. (597-617). Online publication date: 1-Feb-2022.

https://link.springer.com/10.1007/s10586-021-03429-7

Jeong I, Oh Y, Ro W and Yoon M. TEA-RC: Thread Context-Aware Register Cache for GPUs. IEEE Access. 10.1109/ACCESS.2022.3196149. 10. (82049-82062).

https://ieeexplore.ieee.org/document/9848819/

Nermend M, Singh S and Singh U. (2022). An evaluation of decision on paradigm shift in higher education by digital transformation. Procedia Computer Science. 207:C. (1959-1969). Online publication date: 1-Jan-2022.

https://doi.org/10.1016/j.procs.2022.09.255

Giechaskiel I, Tian S and Szefer J. (2021). Cross-VM Information Leaks in FPGA-Accelerated Cloud Environments 2021 IEEE International Symposium on Hardware Oriented Security and Trust (HOST). 10.1109/HOST49136.2021.9702277. 978-1-6654-1357-2. (91-101).

https://ieeexplore.ieee.org/document/9702277/

Naderan-Tahan M and Eeckhout L. (2021). Cactus: Top-Down GPU-Compute Benchmarking using Real-Life Applications 2021 IEEE International Symposium on Workload Characterization (IISWC). 10.1109/IISWC53511.2021.00026. 978-1-6654-4173-5. (176-188).

https://ieeexplore.ieee.org/document/9668300/

Meyer M, Kenter T and Plessl C. (2021). In-depth FPGA Accelerator Performance Evaluation with Single Node Benchmarks from the HPC Challenge Benchmark Suite for Intel and Xilinx FPGAs using OpenCL. Journal of Parallel and Distributed Computing. 10.1016/j.jpdc.2021.10.007. Online publication date: 1-Nov-2021.

https://linkinghub.elsevier.com/retrieve/pii/S0743731521002057

Kamath A and Basu A. iGUARD. Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles. (49-65).

https://doi.org/10.1145/3477132.3483545

Li B, Yin J, Zhang Y and Tang X. Improving Address Translation in Multi-GPUs via Sharing and Spilling aware TLB Design. MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture. (1154-1168).

https://doi.org/10.1145/3466752.3480083

Cabrera A, Hitefield S, Kim J, Lee S, Miniskar N and Vetter J. (2021). Toward Performance Portable Programming for Heterogeneous Systems on a Chip: A Case Study with Qualcomm Snapdragon SoC 2021 IEEE High Performance Extreme Computing Conference (HPEC). 10.1109/HPEC49654.2021.9622794. 978-1-6654-2369-4. (1-7).

https://ieeexplore.ieee.org/document/9622794/

Xiao C, Ran W, Lin F and Zhang L. (2021). Dynamic Fine-Grained Workload Partitioning for Irregular Applications on Discrete CPU-GPU Systems 2021 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom). 10.1109/ISPA-BDCloud-SocialCom-SustainCom52081.2021.00148. 978-1-6654-3574-1. (1067-1074).

https://ieeexplore.ieee.org/document/9644895/

Biookaghazadeh S, Ren F and Zhao M. (2021). Characterizing Loop Acceleration in Heterogeneous Computing 2021 IEEE 14th International Conference on Cloud Computing (CLOUD). 10.1109/CLOUD53861.2021.00059. 978-1-6654-0060-2. (445-455).

https://ieeexplore.ieee.org/document/9582262/

Geng T, Amaris M, Zuckerman S, Goldman A, Gao G and Gaudiot J. (2021). A Profile-Based AI-Assisted Dynamic Scheduling Approach for Heterogeneous Architectures. International Journal of Parallel Programming. 10.1007/s10766-021-00721-2.

https://link.springer.com/10.1007/s10766-021-00721-2

Zhang C, Zhang F, Guo X, He B, Zhang X and Du X. iMLBench: A Machine Learning Benchmark Suite for CPU-GPU Integrated Architectures. IEEE Transactions on Parallel and Distributed Systems. 10.1109/TPDS.2020.3046870. 32:7. (1740-1752).

https://ieeexplore.ieee.org/document/9305972/

Tsuji M, Kramer W, Weill J, Nominé J and Sato M. (2021). A new sustained system performance metric for scientific performance evaluation. The Journal of Supercomputing. 10.1007/s11227-020-03545-y. 77:7. (6476-6504). Online publication date: 1-Jul-2021.

https://link.springer.com/10.1007/s11227-020-03545-y

Fotouhi P, Fariborz M, Proietti R, Lowe-Power J, Akella V and Yoo S. HTA: A Scalable High-Throughput Accelerator for Irregular HPC Workloads. High Performance Computing. (176-194).

https://doi.org/10.1007/978-3-030-78713-4_10

Meyer M. Towards Performance Characterization of FPGAs in Context of HPC using OpenCL Benchmarks. Proceedings of the 11th International Symposium on Highly Efficient Accelerators and Reconfigurable Technologies. (1-2).

https://doi.org/10.1145/3468044.3468058

Abdolrashidi A, Esfeden H, Jahanshahi A, Singh K, Abu-Ghazaleh N and Wong D. BlockMaestro. Proceedings of the 48th Annual International Symposium on Computer Architecture. (333-346).

https://doi.org/10.1109/ISCA52012.2021.00034

Jin Z and Vetter J. (2021). Evaluating CUDA Portability with HIPCL and DPCT 2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). 10.1109/IPDPSW52791.2021.00065. 978-1-6654-3577-2. (371-376).

https://ieeexplore.ieee.org/document/9460636/

Di B, Sun J, Chen H and Li D. Efficient Buffer Overflow Detection on GPU. IEEE Transactions on Parallel and Distributed Systems. 10.1109/TPDS.2020.3042965. 32:5. (1161-1177).

https://ieeexplore.ieee.org/document/9286775/

Tian S, Giechaskiel I, Xiong W and Szefer J. (2021). Cloud FPGA Cartography using PCIe Contention 2021 IEEE 29th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). 10.1109/FCCM51124.2021.00035. 978-1-6654-3555-0. (224-232).

https://ieeexplore.ieee.org/document/9444054/

Schmitt N, Lange K, Sharma S, Rawtani N, Ponder C and Kounev S. The SPECpowerNext Benchmark Suite, its Implementation and New Workloads from a Developer's Perspective. Proceedings of the ACM/SPEC International Conference on Performance Engineering. (225-232).

https://doi.org/10.1145/3427921.3450239

Baruah T, Shivdikar K, Dong S, Sun Y, Mojumder S, Jung K, Abellan J, Ukidave Y, Joshi A, Kim J and Kaeli D. (2021). GNNMark: A Benchmark Suite to Characterize Graph Neural Network Training on GPUs 2021 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 10.1109/ISPASS51385.2021.00013. 978-1-7281-8643-6. (13-23).

https://ieeexplore.ieee.org/document/9408205/

Pratheek B, Jawalkar N and Basu A. (2021). Improving GPU Multi-tenancy with Page Walk Stealing 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA). 10.1109/HPCA51647.2021.00059. 978-1-6654-2235-2. (626-639).

https://ieeexplore.ieee.org/document/9407125/

Ibrahim M, Kayiran O, Eckert Y, Loh G and Jog A. (2021). Analyzing and Leveraging Decoupled L1 Caches in GPUs 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA). 10.1109/HPCA51647.2021.00047. 978-1-6654-2235-2. (467-478).

https://ieeexplore.ieee.org/document/9407080/

Lin F, Liu Y, Guo Y and Qian D. (2020). ELS: Emulation system for debugging and tuning large-scale parallel programs on small clusters. The Journal of Supercomputing. 10.1007/s11227-020-03319-6. 77:2. (1635-1666). Online publication date: 1-Feb-2021.

https://link.springer.com/10.1007/s11227-020-03319-6

Feng Y, Han X, Xu N, Gong J, Le L, Xing C, Yang K, Wang Y, Chen X and An W. (2021). Development of Heterogeneous Computing and Virtualization in Spaceborne IMA During 2010–2020. Signal and Information Processing, Networking and Computers. 10.1007/978-981-33-4102-9_46. (374-383).

http://link.springer.com/10.1007/978-981-33-4102-9_46

Han W, Mawhirter D, Wu B, Ma L and Tian C. (2021). FLARE: Flexibly Sharing Commodity GPUs to Enforce QoS and Improve Utilization. Languages and Compilers for Parallel Computing. 10.1007/978-3-030-72789-5_3. (32-48).

http://link.springer.com/10.1007/978-3-030-72789-5_3

Tsai Y, Cojean T, Ribizel T and Anzt H. (2021). Preparing Ginkgo for AMD GPUs – A Testimonial on Porting CUDA Code to HIP. Euro-Par 2020: Parallel Processing Workshops. 10.1007/978-3-030-71593-9_9. (109-121).

http://link.springer.com/10.1007/978-3-030-71593-9_9

Wang Q and Chu X. GPGPU Performance Estimation With Core and Memory Frequency Scaling. IEEE Transactions on Parallel and Distributed Systems. 10.1109/TPDS.2020.3004623. 31:12. (2865-2881).

https://ieeexplore.ieee.org/document/9124659/

Eyraud-Dubois L and Bentes C. (2020). Algorithms for Preemptive Co-scheduling of Kernels on GPUs 2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC). 10.1109/HiPC50609.2020.00033. 978-1-6654-2292-5. (192-201).

https://ieeexplore.ieee.org/document/9406773/

Carvalho P, Clua E, Paes A, Bentes C, Lopes B and Drummond L. (2020). Using machine learning techniques to analyze the performance of concurrent kernel execution on GPUs. Future Generation Computer Systems. 10.1016/j.future.2020.07.038. 113. (528-540). Online publication date: 1-Dec-2020.

https://linkinghub.elsevier.com/retrieve/pii/S0167739X19312658

Chen G, Zhang J, Zhu Z, Jiang Q, Jiang H and Pang C. (2020). CRState: checkpoint/restart of OpenCL program for in-kernel applications. The Journal of Supercomputing. 10.1007/s11227-020-03460-2.

http://link.springer.com/10.1007/s11227-020-03460-2

Kamatar A, Friese R and Gioiosa R. (2020). Locality-Aware Scheduling for Scalable Heterogeneous Environments 2020 IEEE/ACM International Workshop on Runtime and Operating Systems for Supercomputers (ROSS). 10.1109/ROSS51935.2020.00011. 978-1-6654-2268-0. (50-58).

https://ieeexplore.ieee.org/document/9307939/

Chen Y, Long X, He J, Chen Y, Tan H, Zhang Z, Winslett M and Chen D. (2020). HaoCL: Harnessing Large-scale Heterogeneous Processors Made Easy 2020 IEEE 40th International Conference on Distributed Computing Systems (ICDCS). 10.1109/ICDCS47774.2020.00120. 978-1-7281-7002-2. (1231-1234).

https://ieeexplore.ieee.org/document/9355742/

Meyer M, Kenter T and Plessl C. (2020). Evaluating FPGA Accelerator Performance with a Parameterized OpenCL Adaptation of Selected Benchmarks of the HPCChallenge Benchmark Suite 2020 IEEE/ACM International Workshop on Heterogeneous High-performance Reconfigurable Computing (H2RC). 10.1109/H2RC51942.2020.00007. 978-1-6654-1592-7. (10-18).

https://ieeexplore.ieee.org/document/9306963/

Sultana T, Allen B and Qasem A. Intelligent Data Placement on Discrete GPU Nodes with Unified Memory. Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques. (139-151).

https://doi.org/10.1145/3410463.3414651

Baruah T, Sun Y, Mojumder S, Abellán J, Ukidave Y, Joshi A, Rubin N, Kim J and Kaeli D. Valkyrie. Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques. (455-466).

https://doi.org/10.1145/3410463.3414639

Lavin P, Young J, Vuduc R, Riedy J, Vose A and Ernst D. Evaluating Gather and Scatter Performance on CPUs and GPUs. Proceedings of the International Symposium on Memory Systems. (209-222).

https://doi.org/10.1145/3422575.3422794

Rho S, Park G, Choi J and Park C. (2020). Development of benchmark automation suite and evaluation of various high-performance computing systems. Cluster Computing. 10.1007/s10586-020-03167-2.

http://link.springer.com/10.1007/s10586-020-03167-2

Zheng R, Liu Y and Jin H. (2020). Optimizing non-coalesced memory access for irregular applications with GPU computing. Frontiers of Information Technology & Electronic Engineering. 10.1631/FITEE.1900262. 21:9. (1285-1301). Online publication date: 1-Sep-2020.

http://link.springer.com/10.1631/FITEE.1900262

Hu B and Rossbach C. (2020). Altis: Modernizing GPGPU Benchmarks 2020 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 10.1109/ISPASS48437.2020.00011. 978-1-7281-4798-7. (1-11).

https://ieeexplore.ieee.org/document/9238617/

Azimi R, Jing C and Reda S. (2020). PowerCoord: Power Capping Coordination for Multi-CPU/GPU Servers using Reinforcement Learning. Sustainable Computing: Informatics and Systems. 10.1016/j.suscom.2020.100412. (100412). Online publication date: 1-Jul-2020.

https://linkinghub.elsevier.com/retrieve/pii/S2210537920301396

Wu Y, Shen M, Chen Y and Zhou Y. Tuning applications for efficient GPU offloading to in-memory processing. Proceedings of the 34th ACM International Conference on Supercomputing. (1-12).

https://doi.org/10.1145/3392717.3392760

Mendonça G, Liao C and Pereira F. AutoParBench. Proceedings of the 34th ACM International Conference on Supercomputing. (1-10).

https://doi.org/10.1145/3392717.3392744

Stevens J and Klöckner A. (2020). A mechanism for balancing accuracy and scope in cross-machine black-box GPU performance modeling. The International Journal of High Performance Computing Applications. 10.1177/1094342020921340. (109434202092134).

http://journals.sagepub.com/doi/10.1177/1094342020921340

Feinberg B, Heyman B, Mikhailenko D, Wong R, Ho A and Ipek E. (2020). Commutative Data Reordering: A New Technique to Reduce Data Movement Energy on Sparse Inference Workloads 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). 10.1109/ISCA45697.2020.00091. 978-1-7281-4661-4. (1076-1088).

https://ieeexplore.ieee.org/document/9138978/

Nie B, Jog A and Smirni E. (2020). Characterizing Accuracy-Aware Resilience of GPGPU Applications 2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID). 10.1109/CCGrid49817.2020.00-82. 978-1-7281-6095-5. (111-120).

https://ieeexplore.ieee.org/document/9139732/

Rodrı́guez-Borbón J, Kalantar A, Yamijala S, Oviedo M, Najjar W and Wong B. (2020). Field Programmable Gate Arrays for Enhancing the Speed and Energy Efficiency of Quantum Dynamics Simulations. Journal of Chemical Theory and Computation. 10.1021/acs.jctc.9b01284. 16:4. (2085-2098). Online publication date: 14-Apr-2020.

https://pubs.acs.org/doi/10.1021/acs.jctc.9b01284

Yeh T, Green R and Rogers T. Dimensionality-Aware Redundant SIMT Instruction Elimination. Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems. (1327-1340).

https://doi.org/10.1145/3373376.3378520

Jadidi A, Kandemir M and Das C. (2020). Selective Caching: Avoiding Performance Valleys in Massively Parallel Architectures 2020 28th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP). 10.1109/PDP50117.2020.00051. 978-1-7281-6582-0. (290-298).

https://ieeexplore.ieee.org/document/9092211/

Chang C, Carpenter I and Jones W. The ESIF-HPC-2 benchmark suite. Proceedings of the Workshop on Benchmarking in the Datacenter. (1-8).

https://doi.org/10.1145/3380868.3398200

Baruah T, Sun Y, Dincer A, Mojumder S, Abellan J, Ukidave Y, Joshi A, Rubin N, Kim J and Kaeli D. (2020). Griffin: Hardware-Software Support for Efficient Page Migration in Multi-GPU Systems 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA). 10.1109/HPCA47549.2020.00055. 978-1-7281-6149-5. (596-609).

https://ieeexplore.ieee.org/document/9065453/

Kadam G, Zhang D and Jog A. (2020). BCoal: Bucketing-Based Memory Coalescing for Efficient and Secure GPUs 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA). 10.1109/HPCA47549.2020.00053. 978-1-7281-6149-5. (570-581).

https://ieeexplore.ieee.org/document/9065581/

Reyes Fernandez de Bulnes D, Maldonado Y, Trujillo L and Acacio Sanchez M. (2020). Development of Multiobjective High-Level Synthesis for FPGAs. Scientific Programming. 2020. Online publication date: 1-Jan-2020.

https://doi.org/10.1155/2020/7095048

Eassa F, Alghamdi A, Haridi S, Khemakhem M, Al-Ghamdi A and Alsolami E. ACC_TEST: Hybrid Testing Approach for OpenACC-Based Programs. IEEE Access. 10.1109/ACCESS.2020.2991009. 8. (80358-80368).

https://ieeexplore.ieee.org/document/9079851/

Chen G, Zhang J, Zhu Z, Zhu C, Jiang H and Pang C. (2020). CRAC: An Automatic Assistant Compiler of Checkpoint/Restart for OpenCL Program. Data Science. 10.1007/978-981-15-2810-1_54. (574-586).

http://link.springer.com/10.1007/978-981-15-2810-1_54

Geng T, Amaris M, Zuckerman S, Goldman A, Gao G and Gaudiot J. (2020). PDAWL: Profile-Based Iterative Dynamic Adaptive WorkLoad Balance on Heterogeneous Architectures. Job Scheduling Strategies for Parallel Processing. 10.1007/978-3-030-63171-0_8. (145-162).

http://link.springer.com/10.1007/978-3-030-63171-0_8

Lal S, Alpay A, Salzmann P, Cosenza B, Hirsch A, Stawinoga N, Thoman P, Fahringer T and Heuveline V. (2020). SYCL-Bench: A Versatile Cross-Platform Benchmark Suite for Heterogeneous Computing. Euro-Par 2020: Parallel Processing. 10.1007/978-3-030-57675-2_39. (629-644).

http://link.springer.com/10.1007/978-3-030-57675-2_39

Gerzhoy D, Sun X, Zuzak M and Yeung D. (2019). Nested MIMD-SIMD Parallelization for Heterogeneous Microprocessors. ACM Transactions on Architecture and Code Optimization. 16:4. (1-27). Online publication date: 31-Dec-2020.

https://doi.org/10.1145/3368304

Chen G, Zhang J, Lin Q, Jiang H and Pang C. (2019). CRState: In-Kernel Checkpoint/Restart of OpenCL Program Execution on GPU 2019 IEEE 25th International Conference on Parallel and Distributed Systems (ICPADS). 10.1109/ICPADS47876.2019.00054. 978-1-7281-2583-1. (335-342).

https://ieeexplore.ieee.org/document/8975814/

Garg A, Kulkarni P, Kurkure U, Sivaraman H and Vu L. (2019). Empirical Analysis of Hardware-Assisted GPU Virtualization 2019 IEEE 26th International Conference on High Performance Computing, Data, and Analytics (HiPC). 10.1109/HiPC.2019.00054. 978-1-7281-4535-8. (395-405).

https://ieeexplore.ieee.org/document/8990619/

Guo F, Li Y, Lui J and Xu Y. DCUDA. Proceedings of the ACM Symposium on Cloud Computing. (114-125).

https://doi.org/10.1145/3357223.3362714

Sun H, Gorlatch S and Zhao R. (2019). Vectorizing programs with IF-statements for processors with SIMD extensions. The Journal of Supercomputing. 10.1007/s11227-019-03057-4.

http://link.springer.com/10.1007/s11227-019-03057-4

Zhang H and Hollingsworth J. (2019). Understanding the Performance of GPGPU Applications from a Data-Centric View 2019 IEEE/ACM International Workshop on Programming and Performance Visualization Tools (ProTools). 10.1109/ProTools49597.2019.00006. 978-1-7281-6026-9. (1-8).

https://ieeexplore.ieee.org/document/8955684/

Do Y, Kim H, Oh P, Park D and Lee J. (2019). SNU-NPB 2019: Parallelizing and Optimizing NPB in OpenCL and CUDA for Modern GPUs 2019 IEEE International Symposium on Workload Characterization (IISWC). 10.1109/IISWC47752.2019.9041954. 978-1-7281-4045-2. (93-105).

https://ieeexplore.ieee.org/document/9041954/

Goyat S, Kant S and Dhariwal N. (2019). Dynamic Heterogeneous scheduling of GPU-CPU in Distributed Environment 2019 International Conference on Smart Systems and Inventive Technology (ICSSIT). 10.1109/ICSSIT46314.2019.8987886. 978-1-7281-2119-2. (329-336).

https://ieeexplore.ieee.org/document/8987886/

Green O, Fox J, Young J, Shirako J and Bader D. (2019). Performance Impact of Memory Channels on Sparse and Irregular Algorithms 2019 IEEE/ACM 9th Workshop on Irregular Applications: Architectures and Algorithms (IA3). 10.1109/IA349570.2019.00016. 978-1-7281-5987-4. (67-70).

https://ieeexplore.ieee.org/document/8945089/

Blott M, Halder L, Leeser M and Doyle L. (2019). QuTiBench. ACM Journal on Emerging Technologies in Computing Systems. 15:4. (1-38). Online publication date: 31-Oct-2019.

https://doi.org/10.1145/3358700

Cruz R, Bentes C, Breder B, Vasconcellos E, Clua E, de Carvalho P and Drummond L. (2018). Maximizing the GPU resource usage by reordering concurrent kernels submission. Concurrency and Computation: Practice and Experience. 10.1002/cpe.4409. 31:18. Online publication date: 25-Sep-2019.

https://onlinelibrary.wiley.com/doi/10.1002/cpe.4409

Ibrahim M, Liu H, Kayiran O and Jog A. (2019). Analyzing and Leveraging Remote-Core Bandwidth for Enhanced Performance in GPUs 2019 28th International Conference on Parallel Architectures and Compilation Techniques (PACT). 10.1109/PACT.2019.00028. 978-1-7281-3613-4. (258-271).

https://ieeexplore.ieee.org/document/8891655/

Akshintala A, Yu H, Peters A and Rossbach C. (2019). Trillium: The code is the IR 2019 International Conference on High Performance Computing & Simulation (HPCS). 10.1109/HPCS48598.2019.9188169. 978-1-7281-4484-9. (880-889).

https://ieeexplore.ieee.org/document/9188169/

Jin Z and Finkel H. (2019). Base64 Encoding on Heterogeneous Computing Platforms 2019 IEEE 30th International Conference on Application-specific Systems, Architectures and Processors (ASAP). 10.1109/ASAP.2019.00014. 978-1-7281-1601-3. (247-254).

https://ieeexplore.ieee.org/document/8825134/

Lee S, Gounley J, Randles A and Vetter J. (2019). Performance portability study for massively parallel computational fluid dynamics application on scalable heterogeneous architectures. Journal of Parallel and Distributed Computing. 129:C. (1-13). Online publication date: 1-Jul-2019.

https://doi.org/10.1016/j.jpdc.2019.02.005

Pattnaik A, Tang X, Kayiran O, Jog A, Mishra A, Kandemir M, Sivasubramaniam A and Das C. Opportunistic computing in GPU architectures. Proceedings of the 46th International Symposium on Computer Architecture. (210-223).

https://doi.org/10.1145/3307650.3322212

Uhrie R, Bliss D, Chakrabarti C, Ogras U, Brunhaver J and Suresh R. (2019). Machine understanding of domain computation for Domain-Specific System-on-Chips (DSSoC) Open Architecture/Open Business Model Net-Centric Systems and Defense Transformation 2019. 10.1117/12.2519264. 9781510626959. (21).

https://www.spiedigitallibrary.org/conference-proceedings-of-spie/11015/2519264/Machine-understanding-of-domain-computation-for-Domain-Specific-System-on/10.1117/12.2519264.full

Matz A and Fröning H. Quantifying the NUMA Behavior of Partitioned GPGPU Applications. Proceedings of the 12th Workshop on General Purpose Processing Using GPUs. (53-62).

https://doi.org/10.1145/3300053.3319420

Pellauer M, Shao Y, Clemons J, Crago N, Hegde K, Venkatesan R, Keckler S, Fletcher C and Emer J. Buffets. Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems. (137-151).

https://doi.org/10.1145/3297858.3304025

Pearson C, Dakkak A, Hashash S, Li C, Chung I, Xiong J and Hwu W. Evaluating Characteristics of CUDA Communication Primitives on High-Bandwidth Interconnects. Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering. (209-218).

https://doi.org/10.1145/3297663.3310299

von Kistowski J, Pais J, Wahl T, Lange K, Block H, Beckett J and Kounev S. Measuring the Energy Efficiency of Transactional Loads on GPGPU. Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering. (219-230).

https://doi.org/10.1145/3297663.3309667

Navarro A, Corbera F, Rodriguez A, Vilches A and Asenjo R. (2019). Heterogeneous parallel_for Template for CPU---GPU Chips. International Journal of Parallel Programming. 47:2. (213-233). Online publication date: 1-Apr-2019.

https://doi.org/10.1007/s10766-018-0555-0

Davila G, Oliveira D, Navaux P and Rech P. (2019). Identifying the Most Reliable Collaborative Workload Distribution in Heterogeneous Devices 2019 Design, Automation & Test in Europe Conference & Exhibition (DATE). 10.23919/DATE.2019.8715107. 978-3-9819263-2-3. (1325-1330).

https://ieeexplore.ieee.org/document/8715107/

Kim K, Park J and Baek W. Improving the Performance and Energy Efficiency of GPGPU Computing through Integrated Adaptive Cache Management. IEEE Transactions on Parallel and Distributed Systems. 10.1109/TPDS.2018.2868658. 30:3. (630-645).

https://ieeexplore.ieee.org/document/8454288/

Liu Y, Huang L, Wu M, Cui H, Lv F, Feng X and Xue J. PPOpenCL: a performance-portable OpenCL compiler with host and kernel thread code fusion. Proceedings of the 28th International Conference on Compiler Construction. (2-16).

https://doi.org/10.1145/3302516.3307350

Sakdhnagool P, Sabne A and Eigenmann R. Optimizing GPU programs by register demotion. Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming. (405-406).

https://doi.org/10.1145/3293883.3297859

Fuchs A and Wentzlaff D. (2019). The Accelerator Wall: Limits of Chip Specialization 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA). 10.1109/HPCA.2019.00023. 978-1-7281-1444-6. (1-14).

https://ieeexplore.ieee.org/document/8675237/

Carvalho P, Cruz R, Drummond L, Bentes C, Clua E, Cataldo E and Marzulo L. (2019). Kernel concurrency opportunities based on GPU benchmarks characterization. Cluster Computing. 10.1007/s10586-018-02901-1.

http://link.springer.com/10.1007/s10586-018-02901-1

Tripathy S, Sahoo D and Satpathy M. (2019). Multidimensional Grid Aware Address Prediction for GPGPU 2019 32nd International Conference on VLSI Design and 2019 18th International Conference on Embedded Systems (VLSID). 10.1109/VLSID.2019.00064. 978-1-7281-0409-6. (263-268).

https://ieeexplore.ieee.org/document/8711244/

Zhang F, Zhai J, Wu B, He B, Chen W and Du X. Automatic Irregularity-Aware Fine-Grained Workload Partitioning on Integrated Architectures. IEEE Transactions on Knowledge and Data Engineering. 10.1109/TKDE.2019.2940184. (1-1).

https://ieeexplore.ieee.org/document/8827952/

Tan T, Nurvitadhi E and Chiou D. Dark Wires and the Opportunities for Reconfigurable Logic. IEEE Computer Architecture Letters. 10.1109/LCA.2019.2909867. 18:1. (67-70).

https://ieeexplore.ieee.org/document/8684249/

Khaleghzadeh H, Manumachu R and Lastovetsky A. A Hierarchical Data-partitioning Algorithm for Performance Optimization of Data-Parallel Applications on Heterogeneous Multi-accelerator NUMA Nodes. IEEE Access. 10.1109/ACCESS.2019.2959905. (1-1).

https://ieeexplore.ieee.org/document/8933138/

Guerreiro J, Ilic A, Roma N and Tomas P. GPU Static Modeling Using PTX and Deep Structured Learning. IEEE Access. 10.1109/ACCESS.2019.2951218. 7. (159150-159161).

https://ieeexplore.ieee.org/document/8890640/

Zhao D and Chen Q. Current Prediction Model of GPU Oriented to General Purpose Computing. IEEE Access. 10.1109/ACCESS.2019.2939256. 7. (127920-127931).

https://ieeexplore.ieee.org/document/8822998/

Alghamdi A and Eassa F. OpenACC Errors Classification and Static Detection Techniques. IEEE Access. 10.1109/ACCESS.2019.2935498. 7. (113235-113253).

https://ieeexplore.ieee.org/document/8801837/

Kanekawa N, Miyoshi T, Fujita M, Matsumoto T, Yoshida H, Jo S, Kajihara S, Ohtake S, Imai M, Yoneda T, Takizawa H, Gao Y, Sato M, Egawa R and Kobayashi H. (2019). Unknown Threats and Provisions. VLSI Design and Test for Systems Dependability. 10.1007/978-4-431-56594-9_12. (475-509).

http://link.springer.com/10.1007/978-4-431-56594-9_12

Lim R, Norris B and Malony A. (2019). A Similarity Measure for GPU Kernel Subgraph Matching. Languages and Compilers for Parallel Computing. 10.1007/978-3-030-34627-0_3. (37-53).

http://link.springer.com/10.1007/978-3-030-34627-0_3

Schrödter T, Pallasch D, Wienke S, Schmitt R and Müller M. (2019). Modeling and Optimizing Data Transfer in GPU-Accelerated Optical Coherence Tomography. Euro-Par 2018: Parallel Processing Workshops. 10.1007/978-3-030-10549-5_33. (421-433).

https://link.springer.com/10.1007/978-3-030-10549-5_33

Ben-Nun T, Jakobovits A and Hoefler T. Neural code comprehension. Proceedings of the 32nd International Conference on Neural Information Processing Systems. (3589-3601).

/doi/10.5555/3327144.3327276

Yu C, Bai Y, Yang H, Cheng K, Gu Y, Luan Z and Qian D. SMGuard: A Flexible and Fine-Grained Resource Management Framework for GPUs. IEEE Transactions on Parallel and Distributed Systems. 10.1109/TPDS.2018.2848621. 29:12. (2849-2862).

https://ieeexplore.ieee.org/document/8388218/

Sathre P, Helal A and Feng W. (2018). A Composable Workflow for Productive Heterogeneous Computing on FPGAs via Whole-Program Analysis and Transformation 2018 International Conference on ReConFigurable Computing and FPGAs (ReConFig). 10.1109/RECONFIG.2018.8641694. 978-1-7281-1968-7. (1-8).

https://ieeexplore.ieee.org/document/8641694/

Bannwart Perina A and Bonato V. (2018). Mapping Estimator for OpenCL Heterogeneous Accelerators 2018 International Conference on Field-Programmable Technology (FPT). 10.1109/FPT.2018.00057. 978-1-7281-0214-6. (294-297).

https://ieeexplore.ieee.org/document/8742290/

Ausavarungnirun R, Miller V, Landgraf J, Ghose S, Gandhi J, Jog A, Rossbach C and Mutlu O. (2018). MASK. ACM SIGPLAN Notices. 53:2. (503-518). Online publication date: 30-Nov-2018.

https://doi.org/10.1145/3296957.3173169

Di B, Sun J, Li D, Chen H and Quan Z. GMOD. Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques. (1-13).

https://doi.org/10.1145/3243176.3243194

Luo H, Chen G, Liu F, Li P, Ding C and Shen X. Footprint modeling of cache associativity and granularity. Proceedings of the International Symposium on Memory Systems. (232-242).

https://doi.org/10.1145/3240302.3240419

Azimi R, Jing C and Reda S. (2018). PowerCoord: A Coordinated Power Capping Controller for Multi-CPU/GPU Servers 2018 Ninth International Green and Sustainable Computing Conference (IGSC). 10.1109/IGCC.2018.8752132. 978-1-5386-7466-6. (1-9).

https://ieeexplore.ieee.org/document/8752132/

Umar M, Moore S, Meredith J, Vetter J and Cameron K. (2018). Aspen-based performance and energy modeling frameworks. Journal of Parallel and Distributed Computing. 10.1016/j.jpdc.2017.11.005. 120. (222-236). Online publication date: 1-Oct-2018.

https://linkinghub.elsevier.com/retrieve/pii/S0743731517303039

Basu A, Greathouse J, Venkataramani G and Vesely J. (2018). Interference from GPU System Service Requests 2018 IEEE International Symposium on Workload Characterization (IISWC). 10.1109/IISWC.2018.8573485. 978-1-5386-6780-4. (179-190).

https://ieeexplore.ieee.org/document/8573485/

Li A, Song S, Chen J, Liu X, Tallent N and Barker K. (2018). Tartan: Evaluating Modern GPU Interconnect via a Multi-GPU Benchmark Suite 2018 IEEE International Symposium on Workload Characterization (IISWC). 10.1109/IISWC.2018.8573483. 978-1-5386-6780-4. (191-202).

https://ieeexplore.ieee.org/document/8573483/

Mammeri N and Juurlink B. (2018). VComputeBench: A Vulkan Benchmark Suite for GPGPU on Mobile and Embedded GPUs 2018 IEEE International Symposium on Workload Characterization (IISWC). 10.1109/IISWC.2018.8573477. 978-1-5386-6780-4. (25-35).

https://ieeexplore.ieee.org/document/8573477/

Jamieson P, Sanaullah A and Herbordt M. (2018). Benchmarking Heterogeneous HPC Systems Including Reconfigurable Fabrics: Community Aspirations for Ideal Comparisons 2018 IEEE High Performance Extreme Computing Conference (HPEC). 10.1109/HPEC.2018.8547635. 978-1-5386-5989-2. (1-6).

https://ieeexplore.ieee.org/document/8547635/

Chen M, Chung I, Abali B and Crumley P. (2018). Towards a Single-Host Many-GPU System 2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD). 10.1109/CAHPC.2018.8645874. 978-1-5386-7769-8. (140-147).

https://ieeexplore.ieee.org/document/8645874/

Ausavarungnirun R, Landgraf J, Miller V, Ghose S, Gandhi J, Rossbach C and Mutlu O. (2018). Mosaic. ACM SIGOPS Operating Systems Review. 52:1. (27-44). Online publication date: 28-Aug-2018.

https://doi.org/10.1145/3273982.3273986

Sawin J, Myre J and Wilken H. (2018). Economic Considerations for Integrating Massively Parallel Heterogeneous Devices into the Cloud 2018 IEEE 6th International Conference on Future Internet of Things and Cloud (FiCloud). 10.1109/FiCloud.2018.00011. 978-1-5386-7503-8. (17-24).

https://ieeexplore.ieee.org/document/8457988/

Shen D, Liu X and Lin F. (2016). Characterizing emerging heterogeneous memory. ACM SIGPLAN Notices. 51:11. (13-23). Online publication date: 19-Jul-2018.

https://doi.org/10.1145/3241624.2926702

Sinha H, Raj G, Kumar P and Choudhury T. (2018). Effective E-Healthcare System. International Journal of Big Data and Analytics in Healthcare. 3:2. (10-27). Online publication date: 1-Jul-2018.

https://doi.org/10.4018/IJBDAH.2018070102

Betts A, Chong N, Deligiannis P, Donaldson A and Ketema J. Implementing and Evaluating Candidate-Based Invariant Generation. IEEE Transactions on Software Engineering. 10.1109/TSE.2017.2718516. 44:7. (631-650).

https://ieeexplore.ieee.org/document/7955079/

Heo J, Jo G, Han H and Yang H. (2018). Accelerated Code Generator for Processing Ocean Color Remote Sensing Data on Gpu IGARSS 2018 - 2018 IEEE International Geoscience and Remote Sensing Symposium. 10.1109/IGARSS.2018.8519420. 978-1-5386-7150-4. (9218-9221).

https://ieeexplore.ieee.org/document/8519420/

Zacharopoulos G, Barbon A, Ansaloni G and Pozzi L. (2018). Machine Learning Approach for Loop Unrolling Factor Prediction in High Level Synthesis 2018 International Conference on High Performance Computing & Simulation (HPCS). 10.1109/HPCS.2018.00030. 978-1-5386-7878-7. (91-97).

https://ieeexplore.ieee.org/document/8514335/

Losch A and Platzner M. (2018). A Highly Accurate Energy Model for Task Execution on Heterogeneous Compute Nodes 2018 IEEE 29th International Conference on Application-specific Systems, Architectures and Processors (ASAP). 10.1109/ASAP.2018.8445098. 978-1-5386-7479-6. (1-8).

https://ieeexplore.ieee.org/document/8445098/

Trompouki M and Kosmidis L. Brook auto. Proceedings of the 55th Annual Design Automation Conference. (1-6).

https://doi.org/10.1145/3195970.3196002

Jain A, Khairy M and Rogers T. (2018). A Quantitative Evaluation of Contemporary GPU Simulation Methodology. Proceedings of the ACM on Measurement and Analysis of Computing Systems. 2:2. (1-28). Online publication date: 13-Jun-2018.

https://doi.org/10.1145/3224430

Li A, Liu W, Wang L, Barker K and Song S. Warp-Consolidation. Proceedings of the 2018 International Conference on Supercomputing. (53-64).

https://doi.org/10.1145/3205289.3205294

Sinha H, ang D and Raj G. (2018). Elastic Search in Cache Based Service Management for Healthcare Automation 2018 12th International Conference on Communications (COMM). 10.1109/ICComm.2018.8430162. 978-1-5386-2350-3. (01-06).

https://ieeexplore.ieee.org/document/8430162/

Sinha H, Dewang and Raj G. (2018). Elastic Search in Cache Based Service Management For Healthcare Automation 2018 International Conference on Advances in Computing and Communication Engineering (ICACCE). 10.1109/ICACCE.2018.8441722. 978-1-5386-4485-0. (445-450).

https://ieeexplore.ieee.org/document/8441722/

Trompouki M and Kosmidis L. (2018). Brook Auto: High-Level Certification-Friendly Programming for GPU-powered Automotive Systems 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC). 10.1109/DAC.2018.8465869. 978-1-5386-4114-9. (1-6).

https://ieeexplore.ieee.org/document/8465869/

Hong C, Spence I and Nikolopoulos D. (2017). GPU Virtualization and Scheduling Methods. ACM Computing Surveys. 50:3. (1-37). Online publication date: 31-May-2018.

https://doi.org/10.1145/3068281

Zhang P, Fang J, Yang C, Tang T, Huang C and Wang Z. MOCL. Proceedings of the 15th ACM International Conference on Computing Frontiers. (26-35).

https://doi.org/10.1145/3203217.3203244

Jacobs J. (2018). Finding the edge: Art and automation. XRDS: Crossroads, The ACM Magazine for Students. 24:3. (5-6). Online publication date: 3-Apr-2018.

https://doi.org/10.1145/3186703

Ausavarungnirun R, Miller V, Landgraf J, Ghose S, Gandhi J, Jog A, Rossbach C and Mutlu O. MASK. Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems. (503-518).

https://doi.org/10.1145/3173162.3173169

Lin J. (2018). Python Non-Uniform Fast Fourier Transform (PyNUFFT): An Accelerated Non-Cartesian MRI Package on a Heterogeneous Platform (CPU/GPU). Journal of Imaging. 10.3390/jimaging4030051. 4:3. (51).

https://www.mdpi.com/2313-433X/4/3/51

Saussard R, Bouzid B, Vasiliu M and Reynaud R. (2018). A novel global methodology to analyze the embeddability of real-time image processing algorithms. Journal of Real-Time Image Processing. 14:3. (565-583). Online publication date: 1-Mar-2018.

https://doi.org/10.1007/s11554-017-0686-3

Hagedorn B, Stoltzfus L, Steuwer M, Gorlatch S and Dubach C. High performance stencil code generation with Lift. Proceedings of the 2018 International Symposium on Code Generation and Optimization. (100-112).

https://doi.org/10.1145/3168824

Dao T and Lee J. An Auto-Tuner for OpenCL Work-Group Size on GPUs. IEEE Transactions on Parallel and Distributed Systems. 10.1109/TPDS.2017.2755657. 29:2. (283-296).

http://ieeexplore.ieee.org/document/8048544/

Wang H, Luo F, Ibrahim M, Kayiran O and Jog A. (2018). Efficient and Fair Multi-programming in GPUs via Effective Bandwidth Management 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA). 10.1109/HPCA.2018.00030. 978-1-5386-3659-6. (247-258).

http://ieeexplore.ieee.org/document/8327013/

Sahoo D, Sha S, Satpathy M, Mutyam M and Bhuyan L. CAMO. Proceedings of the 23rd Asia and South Pacific Design Automation Conference. (215-220).

/doi/10.5555/3201607.3201652

(2018). Evaluating attainable memory bandwidth of parallel programming models via BabelStream. International Journal of Computational Science and Engineering. 17:3. (247-262). Online publication date: 1-Jan-2018.

/doi/10.5555/3292750.3292751

Hagedorn B, Stoltzfus L, Steuwer M, Gorlatch S and Dubach C. (2018). High performance stencil code generation with Lift the 2018 International Symposium. 10.1145/3179541.3168824. 9781450356176. (100-112).

http://dl.acm.org/citation.cfm?doid=3179541.3168824

Sahoo D, Sha S, Satpathy M, Mutyam M and Bhuyan L. (2018). CAMO: A novel cache management organization for GPGPUs 2018 23rd Asia and South Pacific Design Automation Conference (ASP-DAC). 10.1109/ASPDAC.2018.8297308. 978-1-5090-0602-1. (215-220).

http://ieeexplore.ieee.org/document/8297308/

Carvalho P, Drummond L, Bentes C, Clua E, Cataldo E and Marzulo L. (2018). Analysis and Characterization of GPU Benchmarks for Kernel Concurrency Efficiency. High Performance Computing. 10.1007/978-3-319-73353-1_5. (71-86).

http://link.springer.com/10.1007/978-3-319-73353-1_5

Matsumura K, Sato M, Boku T, Podobas A and Matsuoka S. (2018). MACC: An OpenACC Transpiler for Automatic Multi-GPU Use. Supercomputing Frontiers. 10.1007/978-3-319-69953-0_7. (109-127).

http://link.springer.com/10.1007/978-3-319-69953-0_7

Fang J, Zhang P, Tang T, Huang C and Yang C. (2017). Implementing and Evaluating OpenCL on an ARMv8 Multi-Core CPU 2017 IEEE International Symposium on Parallel and Distributed Processing with Applications and 2017 IEEE International Conference on Ubiquitous Computing and Communications (ISPA/IUCC). 10.1109/ISPA/IUCC.2017.00131. 978-1-5386-3790-6. (860-867).

https://ieeexplore.ieee.org/document/8367361/

Xiao Y, Xue Y, Nazarian S and Bogdan P. A load balancing inspired optimization framework for exascale multicore systems. Proceedings of the 36th International Conference on Computer-Aided Design. (217-224).

/doi/10.5555/3199700.3199729

Haidl M, Moll S, Klein L, Sun H, Hack S and Gorlatch S. PACXXv2 + RV. Proceedings of the Fourth Workshop on the LLVM Compiler Infrastructure in HPC. (1-12).

https://doi.org/10.1145/3148173.3148185

Mishra A, Li L, Kong M, Finkel H and Chapman B. Benchmarking and Evaluating Unified Memory for OpenMP GPU Offloading. Proceedings of the Fourth Workshop on the LLVM Compiler Infrastructure in HPC. (1-10).

https://doi.org/10.1145/3148173.3148184

Yoon M, Oh Y, Kim S, Lee S, Kim D and Ro W. Dynamic Resizing on Active Warps Scheduler to Hide Operation Stalls on GPUs. IEEE Transactions on Parallel and Distributed Systems. 10.1109/TPDS.2017.2704080. 28:11. (3142-3156).

http://ieeexplore.ieee.org/document/7927466/

Xiao Y, Xue Y, Nazarian S and Bogdan P. (2017). A load balancing inspired optimization framework for exascale multicore systems: A complex networks approach 2017 IEEE/ACM International Conference on Computer-Aided Design (ICCAD). 10.1109/ICCAD.2017.8203781. 978-1-5386-3093-8. (217-224).

http://ieeexplore.ieee.org/document/8203781/

Chen G, Zhao Y, Shen X and Zhou H. (2017). EffiSha. ACM SIGPLAN Notices. 52:8. (3-16). Online publication date: 26-Oct-2017.

https://doi.org/10.1145/3155284.3018748

Ausavarungnirun R, Landgraf J, Miller V, Ghose S, Gandhi J, Rossbach C and Mutlu O. Mosaic. Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture. (136-150).

https://doi.org/10.1145/3123939.3123975

Li A, Zhao W and Song S. BVF. Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture. (532-545).

https://doi.org/10.1145/3123939.3123944

Lee S and Wu C. (2017). Performance characterization, prediction, and optimization for heterogeneous systems with multi-level memory interference 2017 IEEE International Symposium on Workload Characterization (IISWC). 10.1109/IISWC.2017.8167755. 978-1-5386-1233-0. (43-53).

http://ieeexplore.ieee.org/document/8167755/

Koo G, Oh Y, Ro W and Annavaram M. (2017). Access Pattern-Aware Cache Management for Improving Data Utilization in GPU. ACM SIGARCH Computer Architecture News. 45:2. (307-319). Online publication date: 14-Sep-2017.

https://doi.org/10.1145/3140659.3080239

Maurer L, Downen P, Ariola Z and Peyton Jones S. (2017). Compiling without continuations. ACM SIGPLAN Notices. 52:6. (482-494). Online publication date: 14-Sep-2017.

https://doi.org/10.1145/3140587.3062380

Mamouras K, Raghothaman M, Alur R, Ives Z and Khanna S. (2017). StreamQRE: modular specification and efficient evaluation of quantitative queries over streaming data. ACM SIGPLAN Notices. 52:6. (693-708). Online publication date: 14-Sep-2017.

https://doi.org/10.1145/3140587.3062369

Feng Y, Martins R, Van Geffen J, Dillig I and Chaudhuri S. (2017). Component-based synthesis of table consolidation and transformation tasks from examples. ACM SIGPLAN Notices. 52:6. (422-436). Online publication date: 14-Sep-2017.

https://doi.org/10.1145/3140587.3062351

Chu S, Weitz K, Cheung A and Suciu D. (2017). HoTTSQL: proving query rewrites with univalent SQL semantics. ACM SIGPLAN Notices. 52:6. (510-524). Online publication date: 14-Sep-2017.

https://doi.org/10.1145/3140587.3062348

Eizenberg A, Peng Y, Pigli T, Mansky W and Devietti J. (2017). BARRACUDA: binary-level analysis of runtime RAces in CUDA programs. ACM SIGPLAN Notices. 52:6. (126-140). Online publication date: 14-Sep-2017.

https://doi.org/10.1145/3140587.3062342

Cummins C, Petoumenos P, Wang Z and Leather H. (2017). End-to-End Deep Learning of Optimization Heuristics 2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT). 10.1109/PACT.2017.24. 978-1-5090-6764-0. (219-232).

http://ieeexplore.ieee.org/document/8091247/

Huang Y and Li D. (2017). Performance Modeling for Optimal Data Placement on GPU with Heterogeneous Memory Systems 2017 IEEE International Conference on Cluster Computing (CLUSTER). 10.1109/CLUSTER.2017.42. 978-1-5386-2326-8. (166-177).

http://ieeexplore.ieee.org/document/8048928/

Fang Y, Chen Q, Xiong N, Zhao D and Wang J. (2017). RGCA: A Reliable GPU Cluster Architecture for Large-Scale Internet of Things Computing Based on Effective Performance-Energy Optimization. Sensors. 10.3390/s17081799. 17:8. (1799).

https://www.mdpi.com/1424-8220/17/8/1799

Amrizal M and Takizawa H. (2017). Optimizing Energy Consumption on HPC Systems with a Multi-Level Checkpointing Mechanism 2017 International Conference on Networking, Architecture, and Storage (NAS). 10.1109/NAS.2017.8026868. 978-1-5386-3486-8. (1-9).

http://ieeexplore.ieee.org/document/8026868/

Ham T, Aragón J and Martonosi M. (2017). Decoupling Data Supply from Computation for Latency-Tolerant Communication in Heterogeneous Architectures. ACM Transactions on Architecture and Code Optimization. 14:2. (1-27). Online publication date: 30-Jun-2017.

https://doi.org/10.1145/3075620

Koo G, Oh Y, Ro W and Annavaram M. Access Pattern-Aware Cache Management for Improving Data Utilization in GPU. Proceedings of the 44th Annual International Symposium on Computer Architecture. (307-319).

https://doi.org/10.1145/3079856.3080239

Eizenberg A, Peng Y, Pigli T, Mansky W and Devietti J. BARRACUDA: binary-level analysis of runtime RAces in CUDA programs. Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation. (126-140).

https://doi.org/10.1145/3062341.3062342

Khairy M, Zahran M and Wassal A. (2017). SACAT. IEEE Transactions on Parallel and Distributed Systems. 28:6. (1740-1753). Online publication date: 1-Jun-2017.

https://doi.org/10.1109/TPDS.2016.2627560

Loghin D, Ramapantulu L and Teo Y. (2017). On Understanding Time, Energy and Cost Performance of Wimpy Heterogeneous Systems for Edge Computing 2017 IEEE International Conference on Edge Computing (EDGE). 10.1109/IEEE.EDGE.2017.10. 978-1-5386-2017-5. (1-8).

http://ieeexplore.ieee.org/document/8029250/

Losada N, Fraguela B, Gonzlez P and Martn M. (2017). A portable and adaptable fault tolerance solution for heterogeneous applications. Journal of Parallel and Distributed Computing. 104:C. (146-158). Online publication date: 1-Jun-2017.

https://doi.org/10.1016/j.jpdc.2017.01.020

Che S, Beckmann B and Reinhardt S. (2017). Programming GPGPU Graph Applications with Linear Algebra Building Blocks. International Journal of Parallel Programming. 45:3. (657-679). Online publication date: 1-Jun-2017.

https://doi.org/10.1007/s10766-016-0448-z

Tang L, Barrett R, Cook J and Hu X. (2017). PeaPaw. ACM Transactions on Design Automation of Electronic Systems. 22:3. (1-26). Online publication date: 31-May-2017.

https://doi.org/10.1145/2999540

Gleeson J, Kats D, Mei C and de Lara E. Crane. Proceedings of the 10th ACM International Systems and Storage Conference. (1-13).

https://doi.org/10.1145/3078468.3078478

Wang Q, Xu P, Zhang Y and Chu X. EPPMiner. Proceedings of the Eighth International Conference on Future Energy Systems. (23-33).

https://doi.org/10.1145/3077839.3077858

Hou K, Wang H and Feng W. GPU-UniCache. Proceedings of the Computing Frontiers Conference. (107-116).

https://doi.org/10.1145/3075564.3075583

Wu B, Liu X, Zhou X and Jiang C. (2017). FLEP. ACM SIGPLAN Notices. 52:4. (483-496). Online publication date: 12-May-2017.

https://doi.org/10.1145/3093336.3037742

Wu B, Liu X, Zhou X and Jiang C. (2017). FLEP. ACM SIGARCH Computer Architecture News. 45:1. (483-496). Online publication date: 11-May-2017.

https://doi.org/10.1145/3093337.3037742

Pino S, Pollock L and Chandrasekaran S. (2017). Exploring translation of OpenMP to OpenACC 2.5: lessons learned 2017 IEEE International Parallel and Distributed Processing Symposium: Workshops (IPDPSW). 10.1109/IPDPSW.2017.84. 978-1-5386-3408-0. (673-682).

http://ieeexplore.ieee.org/document/7965109/

Lal S, Lucas J and Juurlink B. (2017). E^2MC: Entropy Encoding Based Memory Compression for GPUs 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 10.1109/IPDPS.2017.101. 978-1-5386-3914-6. (1119-1128).

http://ieeexplore.ieee.org/document/7967202/

Jadidi A, Arjomand M, Kandemir M and Das C. Optimizing energy consumption in GPUS through feedback-driven CTA scheduling. Proceedings of the 25th High Performance Computing Symposium. (1-12).

/doi/10.5555/3108096.3108108

Jun T, Yoo M, Kim D, Cho K, Lee S and Yeun K. HPC Supported Mission-Critical Cloud Architecture. Proceedings of the 8th ACM/SPEC on International Conference on Performance Engineering. (223-232).

https://doi.org/10.1145/3030207.3044531

Wu B, Liu X, Zhou X and Jiang C. (2017). FLEP. ACM SIGOPS Operating Systems Review. 10.1145/3093315.3037742. 51:2. (483-496). Online publication date: 4-Apr-2017.

http://dl.acm.org/citation.cfm?doid=3093315.3037742

Wu B, Liu X, Zhou X and Jiang C. FLEP. Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems. (483-496).

https://doi.org/10.1145/3037697.3037742

Lopes A, Pratas F, Sousa L and Ilic A. (2017). Exploring GPU performance, power and energy-efficiency bounds with Cache-aware Roofline Modeling 2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 10.1109/ISPASS.2017.7975297. 978-1-5386-3890-3. (259-268).

http://ieeexplore.ieee.org/document/7975297/

Chen H, Wang M, Hu Y, Song M and Li T. (2017). GaaS workload characterization under NUMA architecture for virtualized GPU 2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 10.1109/ISPASS.2017.7975271. 978-1-5386-3890-3. (65-76).

http://ieeexplore.ieee.org/document/7975271/

Gomez-Luna J, Hajj I, Chang L, Garcia-Flores V, de Gonzalo S, Jablin T, Pena A and Hwu W. (2017). Chai: Collaborative heterogeneous applications for integrated-architectures 2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 10.1109/ISPASS.2017.7975269. 978-1-5386-3890-3. (43-54).

https://ieeexplore.ieee.org/document/7975269/

Menon V and Raju K. (2017). Performance analysis of ray tracing based rendering using OpenCL 2017 Innovations in Power and Advanced Computing Technologies (i-PACT). 10.1109/IPACT.2017.8244923. 978-1-5090-5682-8. (1-5).

http://ieeexplore.ieee.org/document/8244923/

Chen G, Shen X, Wu B and Li D. (2017). Optimizing Data Placement on GPU Memory. IEEE Transactions on Computers. 66:3. (473-487). Online publication date: 1-Mar-2017.

https://doi.org/10.1109/TC.2016.2604372

Cummins C, Petoumenos P, Wang Z and Leather H. Synthesizing benchmarks for predictive modeling. Proceedings of the 2017 International Symposium on Code Generation and Optimization. (86-99).

/doi/10.5555/3049832.3049843

Erb C, Collins M and Greathouse J. Dynamic buffer overflow detection for GPGPUs. Proceedings of the 2017 International Symposium on Code Generation and Optimization. (61-73).

/doi/10.5555/3049832.3049840

Zhang F, Wu B, Zhai J, He B and Chen W. FinePar: irregularity-aware fine-grained workload partitioning on integrated architectures. Proceedings of the 2017 International Symposium on Code Generation and Optimization. (27-38).

/doi/10.5555/3049832.3049836

Majumdar A, Piga L, Paul I, Greathouse J, Huang W and Albonesi D. (2017). Dynamic GPGPU Power Management Using Adaptive Model Predictive Control 2017 IEEE International Symposium on High-Performance Computer Architecture (HPCA). 10.1109/HPCA.2017.34. 978-1-5090-4985-1. (613-624).

http://ieeexplore.ieee.org/document/7920860/

Cummins C, Petoumenos P, Wang Z and Leather H. (2017). Synthesizing benchmarks for predictive modeling 2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). 10.1109/CGO.2017.7863731. 978-1-5090-4931-8. (86-99).

http://ieeexplore.ieee.org/document/7863731/

Erb C, Collins M and Greathouse J. (2017). Dynamic buffer overflow detection for GPGPUs 2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). 10.1109/CGO.2017.7863729. 978-1-5090-4931-8. (61-73).

http://ieeexplore.ieee.org/document/7863729/

Zhang F, Wu B, Zhai J, He B and Chen W. (2017). FinePar: Irregularity-aware fine-grained workload partitioning on integrated architectures 2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). 10.1109/CGO.2017.7863726. 978-1-5090-4931-8. (27-38).

http://ieeexplore.ieee.org/document/7863726/

Chen G, Zhao Y, Shen X and Zhou H. EffiSha. Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. (3-16).

https://doi.org/10.1145/3018743.3018748

Benkner S, Pllana S, Träff J, Tsigas P, Richards A, Russell G, Thibault S, Augonnet C, Namyst R, Cornelius H, Keler C, Moloney D and Sanders P. (2017). Peppher: Performance Portability and Programmability for Heterogeneous Many‐Core Architectures. Programming multi‐core and many‐core computing systems. 10.1002/9781119332015.ch12. (241-260). Online publication date: 24-Jan-2017.

https://onlinelibrary.wiley.com/doi/10.1002/9781119332015.ch12

Tamarit S, Mariño J, Vigueras G and Carro M. (2017). Towards a Semantics-Aware Code Transformation Toolchain for Heterogeneous Systems. Electronic Proceedings in Theoretical Computer Science. 10.4204/EPTCS.237.3. 237. (34-51).

http://arxiv.org/abs/1701.03319

Küsters A, Wienke S and Arnold L. (2017). Performance Portability Analysis for Real-Time Simulations of Smoke Propagation Using OpenACC. High Performance Computing. 10.1007/978-3-319-67630-2_35. (477-495).

http://link.springer.com/10.1007/978-3-319-67630-2_35

Steinbach P and Werner M. (2017). gearshifft – The FFT Benchmark Suite for Heterogeneous Platforms. High Performance Computing. 10.1007/978-3-319-58667-0_11. (199-216).

http://link.springer.com/10.1007/978-3-319-58667-0_11

Tamarit S, Vigueras G, Carro M and Mariño J. (2017). Machine Learning-Driven Automatic Program Transformation to Increase Performance in Heterogeneous Architectures. Tools for High Performance Computing 2016. 10.1007/978-3-319-56702-0_7. (115-140).

http://link.springer.com/10.1007/978-3-319-56702-0_7

Bridges R, Imam N and Mintz T. (2016). Understanding GPU Power. ACM Computing Surveys. 49:3. (1-27). Online publication date: 13-Dec-2016.

https://doi.org/10.1145/2962131

Tupinamba A and Sztajnberg A. (2016). Transparent and Optimized Distributed Processing on GPUs. IEEE Transactions on Parallel and Distributed Systems. 27:12. (3673-3686). Online publication date: 1-Dec-2016.

https://doi.org/10.1109/TPDS.2016.2550445

Xie B, Liu X, McKee S, Zhan J, Jia Z, Wang L and Zhang L. (2016). Understanding Data Analytics Workloads on Intel(R) Xeon Phi(R) 2016 IEEE 18th International Conference on High Performance Computing and Communications; IEEE 14th International Conference on Smart City; IEEE 2nd International Conference on Data Science and Systems (HPCC/SmartCity/DSS). 10.1109/HPCC-SmartCity-DSS.2016.0039. 978-1-5090-4297-5. (206-215).

http://ieeexplore.ieee.org/document/7828380/

Allen T and Ge R. Characterizing power and performance of GPU memory access. Proceedings of the 4th International Workshop on Energy Efficient Supercomputing. (46-53).

/doi/10.5555/3018076.3018083

Allen T and Ge R. (2016). Characterizing Power and Performance of GPU Memory Access 2016 4th International Workshop on Energy Efficient Supercomputing (E2SC). 10.1109/E2SC.2016.012. 978-1-5090-3856-5. (46-53).

http://ieeexplore.ieee.org/document/7830508/

Hajj I, Gómez-Luna J, Li C, Chang L, Milojicic D and Hwu W. KLAP. The 49th Annual IEEE/ACM International Symposium on Microarchitecture. (1-12).

/doi/10.5555/3195638.3195654

Chang L, Hajj I, Rodrigues C, Gómez-Luna J and Hwu W. Efficient kernel synthesis for performance portable programming. The 49th Annual IEEE/ACM International Symposium on Microarchitecture. (1-13).

/doi/10.5555/3195638.3195653

Yoon M, Kim K, Lee S, Ro W and Annavaram M. (2016). Virtual thread. ACM SIGARCH Computer Architecture News. 44:3. (609-621). Online publication date: 12-Oct-2016.

https://doi.org/10.1145/3007787.3001201

Umar M, Meredith J, Vetter J and Cameron K. (2016). A Study of Power-Performance Modeling Using a Domain-Specific Language 2016 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD). 10.1109/SBAC-PAD.2016.19. 978-1-5090-6108-2. (84-92).

http://ieeexplore.ieee.org/document/7789327/

Hajj I, Gomez-Luna J, Li C, Chang L, Milojicic D and Hwu W. (2016). KLAP: Kernel launch aggregation and promotion for optimizing dynamic parallelism 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 10.1109/MICRO.2016.7783716. 978-1-5090-3508-3. (1-12).

http://ieeexplore.ieee.org/document/7783716/

Chang L, Hajj I, Rodrigues C, Gomez-Luna J and Hwu W. (2016). Efficient kernel synthesis for performance portable programming 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 10.1109/MICRO.2016.7783715. 978-1-5090-3508-3. (1-13).

http://ieeexplore.ieee.org/document/7783715/

Kim K, Park J and Baek W. (2016). IACM: Integrated adaptive cache management for high-performance and energy-efficient GPGPU computing 2016 IEEE 34th International Conference on Computer Design (ICCD). 10.1109/ICCD.2016.7753308. 978-1-5090-5142-7. (380-383).

http://ieeexplore.ieee.org/document/7753308/

Wang B, Zhu Y and Yu W. OAWS. Proceedings of the 2016 International Conference on Parallel Architectures and Compilation. (45-55).

https://doi.org/10.1145/2967938.2967947

Kayiran O, Jog A, Pattnaik A, Ausavarungnirun R, Tang X, Kandemir M, Loh G, Mutlu O and Das C. μC-States. Proceedings of the 2016 International Conference on Parallel Architectures and Compilation. (17-30).

https://doi.org/10.1145/2967938.2967941

Pattnaik A, Tang X, Jog A, Kayiran O, Mishra A, Kandemir M, Mutlu O and Das C. Scheduling Techniques for GPU Architectures with Processing-In-Memory Capabilities. Proceedings of the 2016 International Conference on Parallel Architectures and Compilation. (31-44).

https://doi.org/10.1145/2967938.2967940

Saussard R, Bouzid B, Vasiliu M and Reynaud R. (2016). A Robust Methodology for Performance Analysis on Hybrid Embedded Multicore Architectures 2016 IEEE 10th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC). 10.1109/MCSoC.2016.35. 978-1-5090-3531-1. (77-84).

http://ieeexplore.ieee.org/document/7774423/

Adhinarayanan V, Paul I, Greathouse J, Huang W, Pattnaik A and Feng W. (2016). Measuring and modeling on-chip interconnect power on real hardware 2016 IEEE International Symposium on Workload Characterization (IISWC). 10.1109/IISWC.2016.7581263. 978-1-5090-3896-1. (1-11).

http://ieeexplore.ieee.org/document/7581263/

Sun Y, Gong X, Ziabari A, Yu L, Li X, Mukherjee S, Mccardwell C, Villegas A and Kaeli D. (2016). Hetero-mark, a benchmark suite for CPU-GPU collaborative computing 2016 IEEE International Symposium on Workload Characterization (IISWC). 10.1109/IISWC.2016.7581262. 978-1-5090-3896-1. (1-10).

http://ieeexplore.ieee.org/document/7581262/

Chang L, Kim H and Hwu W. (2016). DySel. ACM SIGARCH Computer Architecture News. 44:2. (667-680). Online publication date: 29-Jul-2016.

https://doi.org/10.1145/2980024.2872373

Gallardo E, Teller P, Argueta A and Jaloma J. Cross-Accelerator Performance Profiling. Proceedings of the XSEDE16 Conference on Diversity, Big Data, and Science at Scale. (1-8).

https://doi.org/10.1145/2949550.2949567

Sen R and Wood D. (2016). GPGPU Footprint Models to Estimate per-Core Power. IEEE Computer Architecture Letters. 15:2. (97-100). Online publication date: 1-Jul-2016.

https://doi.org/10.1109/LCA.2015.2456909

Delporte B, Rigamonti R and Dassatti A. (2016). HPA: An opportunistic approach to embedded energy efficiency 2016 International Conference on High Performance Computing & Simulation (HPCS). 10.1109/HPCSim.2016.7568415. 978-1-5090-2088-1. (792-799).

http://ieeexplore.ieee.org/document/7568415/

Obrecht C, Asinari P, Kuznik F and Roux J. (2016). Thermal link-wise artificial compressibility method. Computers & Mathematics with Applications. 72:2. (375-385). Online publication date: 1-Jul-2016.

https://doi.org/10.1016/j.camwa.2015.05.022

Jog A, Kayiran O, Pattnaik A, Kandemir M, Mutlu O, Iyer R and Das C. (2016). Exploiting Core Criticality for Enhanced GPU Performance. ACM SIGMETRICS Performance Evaluation Review. 44:1. (351-363). Online publication date: 30-Jun-2016.

https://doi.org/10.1145/2964791.2901468

Yoon M, Kim K, Lee S, Ro W and Annavaram M. Virtual thread. Proceedings of the 43rd International Symposium on Computer Architecture. (609-621).

https://doi.org/10.1109/ISCA.2016.59

Shen D, Liu X and Lin F. Characterizing emerging heterogeneous memory. Proceedings of the 2016 ACM SIGPLAN International Symposium on Memory Management. (13-23).

https://doi.org/10.1145/2926697.2926702

Jog A, Kayiran O, Pattnaik A, Kandemir M, Mutlu O, Iyer R and Das C. Exploiting Core Criticality for Enhanced GPU Performance. Proceedings of the 2016 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Science. (351-363).

https://doi.org/10.1145/2896377.2901468

Hasabnis N and Sekar R. (2016). Lifting Assembly to Intermediate Representation. ACM SIGPLAN Notices. 51:4. (311-324). Online publication date: 9-Jun-2016.

https://doi.org/10.1145/2954679.2872380

Chang L, Kim H and Hwu W. (2016). DySel. ACM SIGPLAN Notices. 51:4. (667-680). Online publication date: 9-Jun-2016.

https://doi.org/10.1145/2954679.2872373

Panda R, Eckert Y, Jayasena N, Kayiran O, Boyer M and John L. Prefetching Techniques for Near-memory Throughput Processors. Proceedings of the 2016 International Conference on Supercomputing. (1-14).

https://doi.org/10.1145/2925426.2926282

Chen G and Shen X. Coherence-Free Multiview. Proceedings of the 2016 International Conference on Supercomputing. (1-13).

https://doi.org/10.1145/2925426.2926277

Kumar S, Srinivasan V, Sharifian A, Sumner N and Shriraman A. Peruse and Profit. Proceedings of the 2016 International Conference on Supercomputing. (1-13).

https://doi.org/10.1145/2925426.2926269

Li A, Song S, Wijtvliet M, Kumar A and Corporaal H. SFU-Driven Transparent Approximation Acceleration on GPUs. Proceedings of the 2016 International Conference on Supercomputing. (1-14).

https://doi.org/10.1145/2925426.2926255

Wu W, Bosilca G, vandeVaart R, Jeaugey S and Dongarra J. GPU-Aware Non-contiguous Data Movement In Open MPI. Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing. (231-242).

https://doi.org/10.1145/2907294.2907317

Adhinarayanan V, Subramaniam B and Feng W. Online power estimation of graphics processing units. Proceedings of the 16th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing. (245-254).

https://doi.org/10.1109/CCGrid.2016.93

Ukidave Y, Li X and Kaeli D. (2016). Mystic: Predictive Scheduling for GPU Based Cloud Servers Using Machine Learning 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 10.1109/IPDPS.2016.73. 978-1-5090-2140-6. (353-362).

http://ieeexplore.ieee.org/document/7516031/

Tallent N, Manzano J, Gawande N, Kang S, Kerbyson D, Hoisie A and Cross J. (2016). Algorithm and Architecture Independent Benchmarking with SEAK 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 10.1109/IPDPS.2016.25. 978-1-5090-2140-6. (63-72).

http://ieeexplore.ieee.org/document/7516002/

Heinecke A, Karlstetter R, Pflüger D and Bungartz H. (2016). Data mining on vast data sets as a cluster system benchmark. Concurrency and Computation: Practice & Experience. 28:7. (2145-2165). Online publication date: 1-May-2016.

https://doi.org/10.1002/cpe.3514

Aviv R and Wang G. OpenCL-Based Mobile GPGPU Benchmarking. Proceedings of the 4th International Workshop on OpenCL. (1-4).

https://doi.org/10.1145/2909437.2909441

Dev K, Paul I and Huang W. A framework for evaluating promising power efficiency techniques in future GPUs for HPC. Proceedings of the 24th High Performance Computing Symposium. (1-8).

https://doi.org/10.22360/SpringSim.2016.HPC.003

Adhinarayanan V and Feng W. (2016). An automated framework for characterizing and subsetting GPGPU workloads 2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 10.1109/ISPASS.2016.7482105. 978-1-5090-1953-3. (307-317).

http://ieeexplore.ieee.org/document/7482105/

Giefers H, Staar P, Bekas C and Hagleitner C. (2016). Analyzing the energy-efficiency of sparse matrix multiplication on heterogeneous systems: A comparative study of GPU, Xeon Phi and FPGA 2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 10.1109/ISPASS.2016.7482073. 978-1-5090-1953-3. (46-56).

http://ieeexplore.ieee.org/document/7482073/

Chang L, Kim H and Hwu W. (2016). DySel. ACM SIGOPS Operating Systems Review. 10.1145/2954680.2872373. 50:2. (667-680). Online publication date: 25-Mar-2016.

http://dl.acm.org/citation.cfm?doid=2954680.2872373

Chang L, Kim H and Hwu W. DySel. Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems. (667-680).

https://doi.org/10.1145/2872362.2872373

Soldado F, Alexandre F and Paulino H. (2016). Execution of compound multi-kernel OpenCL computations in multi-CPU/multi-GPU environments. Concurrency and Computation: Practice & Experience. 28:3. (768-787). Online publication date: 10-Mar-2016.

https://doi.org/10.1002/cpe.3612

de Oliveira D, Pilla L, Santini T and Rech P. (2016). Evaluation and Mitigation of Radiation-Induced Soft Errors in Graphics Processing Units. IEEE Transactions on Computers. 65:3. (791-804). Online publication date: 1-Mar-2016.

https://doi.org/10.1109/TC.2015.2444855

Wong D, Kim N and Annavaram M. (2016). Approximating warps with intra-warp operand value similarity 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA). 10.1109/HPCA.2016.7446063. 978-1-4673-9211-2. (176-187).

http://ieeexplore.ieee.org/document/7446063/

Wu J, Belevich A, Bendersky E, Heffernan M, Leary C, Pienaar J, Roune B, Springer R, Weng X and Hundt R. gpucc: an open-source GPGPU compiler. Proceedings of the 2016 International Symposium on Code Generation and Optimization. (105-116).

https://doi.org/10.1145/2854038.2854041

Langenkämper D, Jakobi T, Feld D, Jelonek L, Goesmann A and Nattkemper T. (2016). Comparison of Acceleration Techniques for Selected Low-Level Bioinformatics Operations. Frontiers in Genetics. 10.3389/fgene.2016.00005. 7.

http://journal.frontiersin.org/Article/10.3389/fgene.2016.00005/abstract

Lopez-Novoa U, Mendiburu A and Miguel-Alonso J. (2016). Kernel density estimation in accelerators. The Journal of Supercomputing. 72:2. (545-566). Online publication date: 1-Feb-2016.

https://doi.org/10.1007/s11227-015-1577-7

Cherubin S, Scandale M and Agosta G. Stack size estimation on machine-independent intermediate code for OpenCL kernels. Proceedings of the 7th Workshop on Parallel Programming and Run-Time Management Techniques for Many-core Architectures and the 5th Workshop on Design Tools and Architectures For Multicore Embedded Computing Platforms. (1-6).

https://doi.org/10.1145/2872421.2872425

Paul I, Huang W, Arora M and Yalamanchili S. (2015). Harmonia. ACM SIGARCH Computer Architecture News. 43:3S. (54-65). Online publication date: 4-Jan-2016.

https://doi.org/10.1145/2872887.2750404

Bhura M, Deshpande P and Chandrasekaran K. (2016). CUDA or OpenCL. Research Advances in the Integration of Big Data and Smart Computing. 10.4018/978-1-4666-8737-0.ch015. (267-279).

http://services.igi-global.com/resolvedoi/resolve.aspx?doi=10.4018/978-1-4666-8737-0.ch015

Welch A and Venkata M. (2016). On Synchronisation and Memory Reuse in OpenSHMEM. OpenSHMEM and Related Technologies. Enhancing OpenSHMEM for Hybrid Environments. 10.1007/978-3-319-50995-2_6. (82-94).

http://link.springer.com/10.1007/978-3-319-50995-2_6

Grodowitz M, D’Azevedo E, Powers S and Imam N. (2016). Using Hybrid Model OpenSHMEM + CUDA to Implement the SHOC Benchmark Suite. OpenSHMEM and Related Technologies. Enhancing OpenSHMEM for Hybrid Environments. 10.1007/978-3-319-50995-2_14. (204-216).

http://link.springer.com/10.1007/978-3-319-50995-2_14

Deakin T, Price J, Martineau M and McIntosh-Smith S. (2016). GPU-STREAM v2.0: Benchmarking the Achievable Memory Bandwidth of Many-Core Processors Across Diverse Parallel Programming Models. High Performance Computing. 10.1007/978-3-319-46079-6_34. (489-507).

http://link.springer.com/10.1007/978-3-319-46079-6_34

Manochio R, Buzatto D, de Ávila P and Pantoni R. (2016). Algorithms Performance Evaluation in Hybrid Systems. Information Technolog: New Generations. 10.1007/978-3-319-32467-8_101. (1169-1181).

http://link.springer.com/10.1007/978-3-319-32467-8_101

Steuwer M, Fensch C, Lindley S and Dubach C. (2015). Generating performance portable code using rewrite rules: from high-level functional expressions to high-performance OpenCL code. ACM SIGPLAN Notices. 50:9. (205-217). Online publication date: 18-Dec-2015.

https://doi.org/10.1145/2858949.2784754

Daga M and Greathouse J. Structural Agnostic SpMV. Proceedings of the 2015 IEEE 22nd International Conference on High Performance Computing (HiPC). (64-74).

https://doi.org/10.1109/HiPC.2015.55

Teng Li , Narayana V and El-Ghazawi T. A Power-Aware Symbiotic Scheduling Algorithm for Concurrent GPU Kernels. Proceedings of the 2015 IEEE 21st International Conference on Parallel and Distributed Systems (ICPADS). (562-569).

https://doi.org/10.1109/ICPADS.2015.76

Chen G and Shen X. Free launch. Proceedings of the 48th International Symposium on Microarchitecture. (407-419).

https://doi.org/10.1145/2830772.2830818

Ham T, Aragón J and Martonosi M. DeSC. Proceedings of the 48th International Symposium on Microarchitecture. (191-203).

https://doi.org/10.1145/2830772.2830800

Shao Y and Brooks D. (2015). Research Infrastructures for Hardware Accelerators. Synthesis Lectures on Computer Architecture. 10.2200/S00677ED1V01Y201511CAC034. 10:4. (1-99). Online publication date: 18-Nov-2015.

http://www.morganclaypool.com/doi/10.2200/S00677ED1V01Y201511CAC034

Lopez M, Young J, Meredith J, Roth P, Horton M and Vetter J. Examining recent many-core architectures and programming models using SHOC. Proceedings of the 6th International Workshop on Performance Modeling, Benchmarking, and Simulation of High Performance Computing Systems. (1-12).

https://doi.org/10.1145/2832087.2832090

Wen M, Huang D, Xun C and Chen D. (2015). Improving performance portability for GPU-specific OpenCL kernels on multi-core/many-core CPUs by analysis-based transformations. Frontiers of Information Technology & Electronic Engineering. 10.1631/FITEE.1500032. 16:11. (899-916). Online publication date: 1-Nov-2015.

http://link.springer.com/10.1631/FITEE.1500032

Baghdadi R, Beaugnon U, Cohen A, Grosser T, Kruse M, Reddy C, Verdoolaege S, Betts A, Donaldson A, Ketema J, Absar J, Haastregt S, Kravets A, Lokhmotov A, David R and Hajiyev E. PENCIL. Proceedings of the 2015 International Conference on Parallel Architecture and Compilation (PACT). (138-149).

https://doi.org/10.1109/PACT.2015.17

Pisal T, Walunj S, Shrimali A, Gautam O and Patil L. Acceleration of CUDA programs for non-GPU users using cloud. Proceedings of the 2015 International Conference on Green Computing and Internet of Things (ICGCIoT). (365-370).

https://doi.org/10.1109/ICGCIoT.2015.7380490

Jog A, Kayiran O, Kesten T, Pattnaik A, Bolotin E, Chatterjee N, Keckler S, Kandemir M and Das C. Anatomy of GPU Memory System for Multi-Application Execution. Proceedings of the 2015 International Symposium on Memory Systems. (223-234).

https://doi.org/10.1145/2818950.2818979

Majumdar A, Wu G, Dev K, Greathouse J, Paul I, Huang W, Venugopal A, Piga L, Freitag C and Puthoor S. A Taxonomy of GPGPU Performance Scaling. Proceedings of the 2015 IEEE International Symposium on Workload Characterization. (118-119).

https://doi.org/10.1109/IISWC.2015.22

Awan A, Hamidouche K, Venkatesh A, Perkins J, Subramoni H and Panda D. GPU-Aware Design, Implementation, and Evaluation of Non-blocking Collective Benchmarks. Proceedings of the 22nd European MPI Users' Group Meeting. (1-10).

https://doi.org/10.1145/2802658.2802672

Aji A, Peña A, Balaji P and Feng W. Automatic Command Queue Scheduling for Task-Parallel Workloads in OpenCL. Proceedings of the 2015 IEEE International Conference on Cluster Computing. (42-51).

https://doi.org/10.1109/CLUSTER.2015.15

Ryoo J, Quirem S, Lebeane M, Panda R, Song S and John L. GPGPU Benchmark Suites. Proceedings of the 2015 44th International Conference on Parallel Processing (ICPP). (320-329).

https://doi.org/10.1109/ICPP.2015.41

Vilches A, Asenjo R, Navarro A, Corbera F, Gran R and Garzarn M. (2015). Adaptive Partitioning for Irregular Applications on Heterogeneous CPU-GPU Chips. Procedia Computer Science. 51:C. (140-149). Online publication date: 1-Sep-2015.

https://doi.org/10.1016/j.procs.2015.05.213

Loghin D, Ramapantulu L, Barbu O and Teo Y. (2015). A timeenergy performance analysis of MapReduce on heterogeneous systems with GPUs. Performance Evaluation. 91:C. (255-269). Online publication date: 1-Sep-2015.

https://doi.org/10.1016/j.peva.2015.06.015

Walsh J and Dukes J. Application Support for Virtual GPGPUs in Grid Infrastructures. Proceedings of the 2015 IEEE 11th International Conference on e-Science. (67-77).

https://doi.org/10.1109/eScience.2015.45

Steuwer M, Fensch C, Lindley S and Dubach C. Generating performance portable code using rewrite rules: from high-level functional expressions to high-performance OpenCL code. Proceedings of the 20th ACM SIGPLAN International Conference on Functional Programming. (205-217).

https://doi.org/10.1145/2784731.2784754

Mittal S and Vetter J. (2015). A Survey of CPU-GPU Heterogeneous Computing Techniques. ACM Computing Surveys. 47:4. (1-35). Online publication date: 21-Jul-2015.

https://doi.org/10.1145/2788396

Dao T, Kim J, Seo S, Egger B and Lee J. A Performance Model for GPUs with Caches. IEEE Transactions on Parallel and Distributed Systems. 10.1109/TPDS.2014.2333526. 26:7. (1800-1813).

http://ieeexplore.ieee.org/document/6844867/

Guoyang Chen , Bo Wu , Dong Li and Xipeng Shen . (2015). Enabling Portable Optimizations of Data Placement on GPU. IEEE Micro. 35:4. (16-24). Online publication date: 1-Jul-2015.

https://doi.org/10.1109/MM.2015.53

Zheng Z, Wang Z and Lipasti M. (2015). Adaptive Cache and Concurrency Allocation on GPGPUs. IEEE Computer Architecture Letters. 14:2. (90-93). Online publication date: 1-Jul-2015.

https://doi.org/10.1109/LCA.2014.2359882

Tamarit S, Vigueras G, Carro M and Mariño J. A Haskell Implementation of a Rule-Based Program Transformation for C Programs. Proceedings of the 17th International Symposium on Practical Aspects of Declarative Languages - Volume 9131. (105-114).

https://doi.org/10.1007/978-3-319-19686-2_8

Paul I, Huang W, Arora M and Yalamanchili S. Harmonia. Proceedings of the 42nd Annual International Symposium on Computer Architecture. (54-65).

https://doi.org/10.1145/2749469.2750404

Wang B, Yu W, Sun X and Wang X. DaCache. Proceedings of the 29th ACM on International Conference on Supercomputing. (89-98).

https://doi.org/10.1145/2751205.2751239

Wu B, Chen G, Li D, Shen X and Vetter J. Enabling and Exploiting Flexible Task Assignment on GPU through SM-Centric Program Transformations. Proceedings of the 29th ACM on International Conference on Supercomputing. (119-130).

https://doi.org/10.1145/2751205.2751213

Ndu G, Navaridas J and Luján M. CHO. Proceedings of the 3rd International Workshop on OpenCL. (1-10).

https://doi.org/10.1145/2791321.2791331

Shao Y, Reagen B, Gu-Yeon Wei and Brooks D. (2015). The Aladdin Approach to Accelerator Design and Modeling. IEEE Micro. 35:3. (58-70). Online publication date: 1-May-2015.

https://doi.org/10.1109/MM.2015.50

Tang L, Hu X and Barrett R. PerDome. Proceedings of the Symposium on High Performance Computing. (225-232).

/doi/10.5555/2872599.2872627

Wang B, Liu Z, Wang X and Yu W. Eliminating intra-warp conflict misses in GPU. Proceedings of the 2015 Design, Automation & Test in Europe Conference & Exhibition. (689-694).

/doi/10.5555/2755753.2755911

Guttman D and Kandemir M. (2015). Performance and energy evaluation of data prefetching on intel Xeon Phi 2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 10.1109/ISPASS.2015.7095814. 978-1-4799-1957-4. (288-297).

http://ieeexplore.ieee.org/document/7095814/

Oka K, Jia W, Martonosi M and Inoue K. (2015). Characterization and cross-platform analysis of high-throughput accelerators 2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 10.1109/ISPASS.2015.7095797. 978-1-4799-1957-4. (161-162).

http://ieeexplore.ieee.org/document/7095797/

Guttman D, Kandemir M, Arunachalam M and Khanna R. (2015). Machine learning techniques for improved data prefetching 2015 International Conference on Energy Aware Computing (ICEAC). 10.1109/ICEAC.2015.7352208. 978-1-4799-1771-6. (1-4).

http://ieeexplore.ieee.org/document/7352208/

Saeed I, Young J and Yalamanchili S. A portable benchmark suite for highly parallel data intensive query processing. Proceedings of the 2nd Workshop on Parallel Programming for Analytics Applications. (31-38).

https://doi.org/10.1145/2726935.2726943

Fauzia N, Pouchet L and Sadayappan P. Characterizing and enhancing global memory data coalescing on GPUs. Proceedings of the 13th Annual IEEE/ACM International Symposium on Code Generation and Optimization. (12-22).

/doi/10.5555/2738600.2738603

Tarakji A, Börger L and Leupers R. A comparative investigation of device-specific mechanisms for exploiting HPC accelerators. Proceedings of the 8th Workshop on General Purpose Processing using GPUs. (1-12).

https://doi.org/10.1145/2716282.2716293

Khairy M, Zahran M and Wassal A. Efficient utilization of GPGPU cache hierarchy. Proceedings of the 8th Workshop on General Purpose Processing using GPUs. (36-47).

https://doi.org/10.1145/2716282.2716291

Naik V and Kusur C. (2015). Analysis of performance enhancement on graphic processor based heterogeneous architecture: A CUDA and MATLAB experiment 2015 National Conference on Parallel Computing Technologies (PARCOMPTECH). 10.1109/PARCOMPTECH.2015.7084519. 978-1-4799-6916-6. (1-5).

http://ieeexplore.ieee.org/document/7084519/

Wu G, Greathouse J, Lyashevsky A, Jayasena N and Chiou D. (2015). GPGPU performance and power estimation using machine learning 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA). 10.1109/HPCA.2015.7056063. 978-1-4799-8930-0. (564-576).

http://ieeexplore.ieee.org/document/7056063/

Tiwari D, Gupta S, Rogers J, Maxwell D, Rech P, Vazhkudai S, Oliveira D, Londo D, DeBardeleben N, Navaux P, Carro L and Bland A. (2015). Understanding GPU errors on large-scale HPC systems and the implications for system design and operation 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA). 10.1109/HPCA.2015.7056044. 978-1-4799-8930-0. (331-342).

http://ieeexplore.ieee.org/document/7056044/

Fauzia N, Pouchet L and Sadayappan P. (2015). Characterizing and enhancing global memory data coalescing on GPUs 2015 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). 10.1109/CGO.2015.7054183. 978-1-4799-8161-8. (12-22).

http://ieeexplore.ieee.org/document/7054183/

Ukidave Y, Paravecino F, Yu L, Kalra C, Momeni A, Chen Z, Materise N, Daley B, Mistry P and Kaeli D. NUPAR. Proceedings of the 6th ACM/SPEC International Conference on Performance Engineering. (253-264).

https://doi.org/10.1145/2668930.2688046

Elangovan V, Badia R and Ayguadé E. Auto-Tuning OmpSs-OpenCL Kernels Across GPU Machines. Proceedings of the 6th Workshop on Parallel Programming and Run-Time Management Techniques for Many-core Architectures. (31-36).

https://doi.org/10.1145/2701310.2701316

Schaub T, Moll S, Karrenberg R and Hack S. (2015). The Impact of the SIMD Width on Control-Flow and Memory Divergence. ACM Transactions on Architecture and Code Optimization. 11:4. (1-25). Online publication date: 9-Jan-2015.

https://doi.org/10.1145/2687355

Wang Z, Grewe D and O’boyle M. (2014). Automatic and Portable Mapping of Data Parallel Programs to OpenCL for GPU-Based Heterogeneous Systems. ACM Transactions on Architecture and Code Optimization. 11:4. (1-26). Online publication date: 9-Jan-2015.

https://doi.org/10.1145/2677036

Mittal S and Vetter J. (2014). A Survey of Methods for Analyzing and Improving GPU Energy Efficiency. ACM Computing Surveys. 47:2. (1-23). Online publication date: 8-Jan-2015.

https://doi.org/10.1145/2636342

Suwancharoen C and Marurngsith W. (2015). Compiler Support for Accelerating C++11 Range-Based Loops on Heterogeneous Systems. International Journal of Computer and Electrical Engineering. 10.17706/IJCEE.2015.V7.877. 7:2. (109-117).

http://www.ijcee.org/index.php?m=content&c=index&a=show&catid=73&id=990

HUANG D, XUN C, WU N, WEN M, ZHANG C, CAI X and YANG Q. (2015). Enabling a Uniform OpenCL Device View for Heterogeneous Platforms. IEICE Transactions on Information and Systems. 10.1587/transinf.2014EDP7244. E98.D:4. (812-823).

https://www.jstage.jst.go.jp/article/transinf/E98.D/4/E98.D_2014EDP7244/_article

Pallipuram V, Smith M, Sarma N, Anand R, Weill E and Sapra K. (2015). Subjective versus objective. The Journal of Supercomputing. 71:1. (162-201). Online publication date: 1-Jan-2015.

https://doi.org/10.1007/s11227-014-1292-9

Juckeland G, Brantley W, Chandrasekaran S, Chapman B, Che S, Colgrove M, Feng H, Grund A, Henschel R, Hwu W, Li H, Müller M, Nagel W, Perminov M, Shelepugin P, Skadron K, Stratton J, Titov A, Wang K, van Waveren M, Whitney B, Wienke S, Xu R and Kumaran K. (2015). SPEC ACCEL: A Standard Application Suite for Measuring Hardware Accelerator Performance. High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation. 10.1007/978-3-319-17248-4_3. (46-67).

https://link.springer.com/10.1007/978-3-319-17248-4_3

Chen G, Wu B, Li D and Shen X. PORPLE. Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture. (88-100).

https://doi.org/10.1109/MICRO.2014.20

Agosta G, Barenghi A, Pelosi G and Scandale M. Towards Transparently Tackling Functionality and Performance Issues across Different OpenCL Platforms. Proceedings of the 2014 Second International Symposium on Computing and Networking. (130-136).

https://doi.org/10.1109/CANDAR.2014.53

Gao S and Chritz J. (2014). Characterization of OpenCL on a scalable FPGA architecture 2014 International Conference on ReConFigurable Computing and FPGAs (ReConFig). 10.1109/ReConFig.2014.7032505. 978-1-4799-5944-0. (1-6).

http://ieeexplore.ieee.org/document/7032505/

Sajjapongse K, Agarwal T and Becchi M. (2014). A flexible scheduling framework for heterogeneous CPU-GPU clusters 2014 21st International Conference on High Performance Computing (HiPC). 10.1109/HiPC.2014.7116892. 978-1-4799-5976-1. (1-11).

http://ieeexplore.ieee.org/document/7116892/

Greathouse J and Daga M. Efficient sparse matrix-vector multiplication on GPUs using the CSR storage format. Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. (769-780).

https://doi.org/10.1109/SC.2014.68

Shi R, Lu X, Potluri S, Hamidouche K, Zhang J and Panda D. HAND. Proceedings of the 2014 Brazilian Conference on Intelligent Systems. (221-230).

https://doi.org/10.1109/ICPP.2014.31

Shao Y, Reagen B, Wei G and Brooks D. (2014). Aladdin. ACM SIGARCH Computer Architecture News. 42:3. (97-108). Online publication date: 16-Oct-2014.

https://doi.org/10.1145/2678373.2665689

Jenkins J, Dinan J, Balaji P, Peterka T, Samatova N and Thakur R. Processing MPI Derived Datatypes on Noncontiguous GPU-Resident Data. IEEE Transactions on Parallel and Distributed Systems. 10.1109/TPDS.2013.234. 25:10. (2627-2637).

http://ieeexplore.ieee.org/document/6600679/

Reagen B, Adolf R, Shao Y, Wei G and Brooks D. (2014). MachSuite: Benchmarks for accelerator design and customized architectures 2014 IEEE International Symposium on Workload Characterization (IISWC). 10.1109/IISWC.2014.6983050. 978-1-4799-6454-3. (110-119).

http://ieeexplore.ieee.org/document/6983050/

Wang J and Yalamanchili S. (2014). Characterization and analysis of dynamic parallelism in unstructured GPU applications 2014 IEEE International Symposium on Workload Characterization (IISWC). 10.1109/IISWC.2014.6983039. 978-1-4799-6454-3. (51-60).

http://ieeexplore.ieee.org/document/6983039/

Che S. (2014). GasCL: A vertex-centric graph model for GPUs 2014 IEEE High Performance Extreme Computing Conference (HPEC). 10.1109/HPEC.2014.7040962. 978-1-4799-6233-4. (1-6).

http://ieeexplore.ieee.org/document/7040962/

Che S, Beckmann B and Reinhardt S. (2014). BelRed: Constructing GPGPU graph applications with software building blocks 2014 IEEE High Performance Extreme Computing Conference (HPEC). 10.1109/HPEC.2014.7040961. 978-1-4799-6233-4. (1-6).

http://ieeexplore.ieee.org/document/7040961/

Romero P and Idler C. (2014). Methodologies and application of machine learning algorithms to classify the performance of high performance cluster components 2014 IEEE International Conference On Cluster Computing (CLUSTER). 10.1109/CLUSTER.2014.6968669. 978-1-4799-5548-0. (400-407).

http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=6968669

Mateo Lázaro J, Sánchez Navarro J, García Gil A and Edo Romero V. (2014). 3D-geological structures with digital elevation models using GPU programming. Computers & Geosciences. 10.1016/j.cageo.2014.05.014. 70. (138-146). Online publication date: 1-Sep-2014.

https://linkinghub.elsevier.com/retrieve/pii/S0098300414001411

Griessl R, Peykanu M, Hagemeyer J, Porrmann M, Krupop S, Berge M, Kiesel T and Christmann W. A Scalable Server Architecture for Next-Generation Heterogeneous Compute Clusters. Proceedings of the 2014 12th IEEE International Conference on Embedded and Ubiquitous Computing. (146-153).

https://doi.org/10.1109/EUC.2014.29

Shen J, Varbanescu A and Sips H. Look before You Leap. Proceedings of the 2014 IEEE Intl Conf on High Performance Computing and Communications, 2014 IEEE 6th Intl Symp on Cyberspace Safety and Security, 2014 IEEE 11th Intl Conf on Embedded Software and Syst (HPCC,CSS,ICESS). (383-391).

https://doi.org/10.1109/HPCC.2014.65

Ukidave Y, Ziabari A, Mistry P, Schirner G and Kaeli D. (2014). Analyzing power efficiency of optimization techniques and algorithm design methods for applications on heterogeneous platforms. International Journal of High Performance Computing Applications. 28:3. (319-334). Online publication date: 1-Aug-2014.

https://doi.org/10.1177/1094342014526907

Breslauer D and Galil Z. (2014). Real-Time Streaming String-Matching. ACM Transactions on Algorithms. 10:4. (1-12). Online publication date: 1-Aug-2014.

https://doi.org/10.1145/2635814

Tamizharasan P, Yadav P, Ramasubramanian N and Geetha K. (2014). Performance enhancing factors for manycore architectures: State-of-the-art 2014 International Conference on Networks & Soft Computing (ICNSC). 10.1109/CNSC.2014.6906686. 978-1-4799-3486-7. (278-283).

http://ieeexplore.ieee.org/document/6906686/

Yan X, Shi X, Wang L and Yang H. (2014). An OpenCL micro-benchmark suite for GPUs and CPUs. The Journal of Supercomputing. 69:2. (693-713). Online publication date: 1-Aug-2014.

https://doi.org/10.1007/s11227-014-1112-2

Bardsley E, Betts A, Chong N, Collingbourne P, Deligiannis P, Donaldson A, Ketema J, Liew D and Qadeer S. Engineering a Static Verification Tool for GPU Kernels. Proceedings of the 16th International Conference on Computer Aided Verification - Volume 8559. (226-242).

https://doi.org/10.1007/978-3-319-08867-9_15

Merritt A, Farooqui N, Slawinska M, Gavrilovska A, Schwan K and Gupta V. Slices. Proceedings of the 2014 Annual Conference on Extreme Science and Engineering Discovery Environment. (1-8).

https://doi.org/10.1145/2616498.2616531

Walters J, Younge A, Kang D, Yao K, Kang M, Crago S and Fox G. GPU Passthrough Performance. Proceedings of the 2014 IEEE International Conference on Cloud Computing. (636-643).

https://doi.org/10.1109/CLOUD.2014.90

Elangovan V, Badia R and Ayguadé E. Scalability and Parallel Execution of OmpSs-OpenCL Tasks on Heterogeneous CPU-GPU Environment. Proceedings of the 29th International Conference on Supercomputing - Volume 8488. (141-155).

https://doi.org/10.1007/978-3-319-07518-1_9

Shao Y, Reagen B, Wei G and Brooks D. Aladdin. Proceeding of the 41st annual international symposium on Computer architecuture. (97-108).

/doi/10.5555/2665671.2665689

Shao Y, Reagen B, Wei G and Brooks D. (2014). Aladdin: A pre-RTL, power-performance accelerator simulator enabling large design space exploration of customized architectures 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA). 10.1109/ISCA.2014.6853196. 978-1-4799-4394-4. (97-108).

http://ieeexplore.ieee.org/document/6853196/

Krommydas K, Feng W, Owaida M, Antonopoulos C and Bellas N. (2014). On the characterization of OpenCL dwarfs on fixed and reconfigurable platforms 2014 IEEE 25th International Conference on Application-specific Systems, Architectures and Processors (ASAP). 10.1109/ASAP.2014.6868650. 978-1-4799-3609-0. (153-160).

http://ieeexplore.ieee.org/document/6868650/

Iparraguirre J, Balmaceda L and Mariani C. (2014). Speeded-up robust features (SURF) as a benchmark for heterogeneous computers 2014 IEEE Biennial Congress of Argentina (ARGENCON). 10.1109/ARGENCON.2014.6868545. 978-1-4799-4269-5. (519-524).

http://ieeexplore.ieee.org/document/6868545/

Younge A and Fox G. Advanced virtualization techniques for high performance cloud cyberinfrastructure. Proceedings of the 14th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing. (583-586).

https://doi.org/10.1109/CCGrid.2014.93

Younge A, Walters J, Crago S and Fox G. Evaluating GPU Passthrough in Xen for High Performance Cloud Computing. Proceedings of the 2014 IEEE International Parallel & Distributed Processing Symposium Workshops. (852-859).

https://doi.org/10.1109/IPDPSW.2014.97

Che S and Skadron K. (2014). BenchFriend. International Journal of High Performance Computing Applications. 28:2. (238-250). Online publication date: 1-May-2014.

https://doi.org/10.1177/1094342013507960

Zhang D, Xu L and Howes L. Efficient parallel image clustering and search on a heterogeneous platform. Proceedings of the High Performance Computing Symposium. (1-8).

/doi/10.5555/2663510.2663527

Paul I, Ravi V, Manne S, Arora M and Yalamanchili S. (2014). Coordinated energy management in heterogeneous processors. Scientific Programming. 22:2. (93-108). Online publication date: 1-Apr-2014.

https://doi.org/10.1155/2014/210762

Alexandre F, Marques R and Paulino H. On the support of task-parallel algorithmic skeletons for multi-GPU computing. Proceedings of the 29th Annual ACM Symposium on Applied Computing. (880-885).

https://doi.org/10.1145/2554850.2555018

Boulos V, Huet S, Fristot V, Salvo L and Houzet D. (2014). Efficient implementation of data flow graphs on multi-gpu clusters. Journal of Real-Time Image Processing. 9:1. (217-232). Online publication date: 1-Mar-2014.

https://doi.org/10.1007/s11554-012-0279-0

Chong N, Donaldson A and Ketema J. (2014). A sound and complete abstraction for reasoning about parallel prefix sums. ACM SIGPLAN Notices. 49:1. (397-409). Online publication date: 13-Jan-2014.

https://doi.org/10.1145/2578855.2535882

Chong N, Donaldson A and Ketema J. A sound and complete abstraction for reasoning about parallel prefix sums. Proceedings of the 41st ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages. (397-409).

https://doi.org/10.1145/2535838.2535882

Zhao Q, Yang H, Wei G, Luan Z and Qian D. (2014). Energy Efficiency Evaluation of Workload Execution on Intel Xeon Phi Coprocessor. Trustworthy Computing and Services. 10.1007/978-3-662-43908-1_34. (268-275).

https://link.springer.com/10.1007/978-3-662-43908-1_34

DeBardeleben N, Blanchard S, Monroe L, Romero P, Grunau D, Idler C and Wright C. (2014). GPU Behavior on a Large HPC Cluster. Euro-Par 2013: Parallel Processing Workshops. 10.1007/978-3-642-54420-0_66. (680-689).

http://link.springer.com/10.1007/978-3-642-54420-0_66

Rogers T, O'Connor M and Aamodt T. Divergence-aware warp scheduling. Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture. (99-110).

https://doi.org/10.1145/2540708.2540718

Panwar L, Aji A, Meng J, Balaji P and Feng W. (2013). Online Performance Projection for Clusters with Heterogeneous GPUs 2013 International Conference on Parallel and Distributed Systems (ICPADS). 10.1109/ICPADS.2013.48. 978-1-4799-2081-5. (283-290).

http://ieeexplore.ieee.org/document/6808185/

Shen J, Fang J, Sips H and Varbanescu A. (2013). An application-centric evaluation of OpenCL on multi-core CPUs. Parallel Computing. 39:12. (834-850). Online publication date: 1-Dec-2013.

https://doi.org/10.1016/j.parco.2013.08.009

Viñas M, Bozkus Z and Fraguela B. (2013). Exploiting heterogeneous parallelism with the Heterogeneous Programming Library. Journal of Parallel and Distributed Computing. 73:12. (1627-1638). Online publication date: 1-Dec-2013.

https://doi.org/10.1016/j.jpdc.2013.07.013

Paul I, Ravi V, Manne S, Arora M and Yalamanchili S. Coordinated energy management in heterogeneous processors. Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. (1-12).

https://doi.org/10.1145/2503210.2503227

Kim S, Roy I and Talwar V. Evaluating integrated graphics processors for data center workloads. Proceedings of the Workshop on Power-Aware Computing and Systems. (1-5).

https://doi.org/10.1145/2525526.2525847

Xun C, Chen D, Lan Q and Zhang C. (2013). Efficient fine-grained shared buffer management for multiple OpenCL devices. Journal of Zhejiang University SCIENCE C. 10.1631/jzus.C1300078. 14:11. (859-872). Online publication date: 1-Nov-2013.

http://link.springer.com/10.1631/jzus.C1300078

Ji F, Lin H and Ma X. RSVM. Proceedings of the 22nd international conference on Parallel architectures and compilation techniques. (269-278).

/doi/10.5555/2523721.2523758

Feng Ji , Heshan Lin and Xiaosong Ma . (2013). Breaking SIMD shackles with an exposed flexible microarchitecture and the access execute PDG 22nd International Conference on Parallel Architectures and Compilation Techniques (PACT). 10.1109/PACT.2013.6618823. 978-1-4799-1018-2. (341-352).

http://ieeexplore.ieee.org/document/6618823/

Komoda T, Miwa S, Nakamura H and Maruyama N. Integrating Multi-GPU Execution in an OpenACC Compiler. Proceedings of the 2013 42nd International Conference on Parallel Processing. (260-269).

https://doi.org/10.1109/ICPP.2013.35

Reagen B, Shao Y, Wei G and Brooks D. Quantifying acceleration. Proceedings of the 2013 International Symposium on Low Power Electronics and Design. (395-400).

/doi/10.5555/2648668.2648759

Shao Y and Brooks D. Energy characterization and instruction-level energy model of Intel's Xeon Phi processor. Proceedings of the 2013 International Symposium on Low Power Electronics and Design. (389-394).

/doi/10.5555/2648668.2648758

Reagen B, Shao Y, Wei G and Brooks D. (2013). Quantifying acceleration: Power/performance trade-offs of application kernels in hardware 2013 IEEE International Symposium on Low Power Electronics and Design (ISLPED). 10.1109/ISLPED.2013.6629329. 978-1-4799-1235-3. (395-400).

http://ieeexplore.ieee.org/document/6629329/

Shao Y and Brooks D. (2013). Energy characterization and instruction-level energy model of Intel's Xeon Phi processor 2013 IEEE International Symposium on Low Power Electronics and Design (ISLPED). 10.1109/ISLPED.2013.6629328. 978-1-4799-1235-3. (389-394).

http://ieeexplore.ieee.org/document/6629328/

Che S, Beckmann B, Reinhardt S and Skadron K. (2013). Pannotia: Understanding irregular GPGPU graph applications 2013 IEEE International Symposium on Workload Characterization (IISWC). 10.1109/IISWC.2013.6704684. 978-1-4799-0553-9. (185-195).

https://ieeexplore.ieee.org/document/6704684/

Young J, Shon S, Yalamanchili S, Merritt A, Schwan K and Froning H. (2013). Oncilla: A GAS runtime for efficient resource allocation and data movement in accelerated clusters 2013 IEEE International Conference on Cluster Computing (CLUSTER). 10.1109/CLUSTER.2013.6702679. 978-1-4799-0898-1. (1-8).

http://ieeexplore.ieee.org/document/6702679/

Expósito R, Taboada G, Ramos S, Touriño J and Doallo R. (2012). General‐purpose computation on GPUs for high performance cloud computing. Concurrency and Computation: Practice and Experience. 10.1002/cpe.2845. 25:12. (1628-1642). Online publication date: 25-Aug-2013.

https://onlinelibrary.wiley.com/doi/10.1002/cpe.2845

Grasso I, Kofler K, Cosenza B and Fahringer T. (2013). Automatic problem size sensitive task partitioning on heterogeneous parallel systems. ACM SIGPLAN Notices. 48:8. (281-282). Online publication date: 23-Aug-2013.

https://doi.org/10.1145/2517327.2442545

Wu B, Zhao Z, Zhang E, Jiang Y and Shen X. (2013). Complexity analysis and algorithm design for reorganizing data to minimize non-coalesced memory accesses on GPU. ACM SIGPLAN Notices. 48:8. (57-68). Online publication date: 23-Aug-2013.

https://doi.org/10.1145/2517327.2442523

Defour D and Petit E. (2013). GPUburn: A system to test and mitigate GPU hardware failures 2013 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS XIII). 10.1109/SAMOS.2013.6621133. 978-1-4799-0103-6. (263-270).

http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=6621133

Kofler K, Grasso I, Cosenza B and Fahringer T. An automatic input-sensitive approach for heterogeneous task partitioning. Proceedings of the 27th international ACM conference on International conference on supercomputing. (149-160).

https://doi.org/10.1145/2464996.2465007

Ukidave Y and Kaeli D. Analyzing Optimization Techniques for Power Efficiency on Heterogeneous Platforms. Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing Workshops and PhD Forum. (1040-1049).

https://doi.org/10.1109/IPDPSW.2013.220

Song S, Su C, Rountree B and Cameron K. A Simplified and Accurate Model of Power-Performance Efficiency on Emergent GPU Architectures. Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing. (673-686).

https://doi.org/10.1109/IPDPS.2013.73

Wu J and Hong B. Collocating CPU-only jobs with GPU-assisted jobs on GPU-assisted HPC. Proceedings of the 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing. (418-425).

https://doi.org/10.1109/CCGrid.2013.19

Ukidave Y, Ziabari A, Mistry P, Schirner G and Kaeli D. (2013). Quantifying the energy efficiency of FFT on heterogeneous platforms 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 10.1109/ISPASS.2013.6557174. 978-1-4673-5779-1. (235-244).

http://ieeexplore.ieee.org/document/6557174/

Docampo J, Ramos S, Taboada G, Exposito R, Tourino J and Doallo R. Evaluation of Java for General Purpose GPU Computing. Proceedings of the 2013 27th International Conference on Advanced Information Networking and Applications Workshops. (1398-1404).

https://doi.org/10.1109/WAINA.2013.234

Shih C, Chen Y, Chen J and Chang N. Virtual Cloud Core. Proceedings of the 2013 IEEE Seventh International Symposium on Service-Oriented System Engineering. (486-493).

https://doi.org/10.1109/SOSE.2013.70

Mistry P, Ukidave Y, Schaa D and Kaeli D. Valar. Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units. (54-65).

https://doi.org/10.1145/2458523.2458529

Shen J, Fang J, Sips H and Varbanescu A. Performance Traps in OpenCL for CPUs. Proceedings of the 2013 21st Euromicro International Conference on Parallel, Distributed, and Network-Based Processing. (38-45).

https://doi.org/10.1109/PDP.2013.16

Grasso I, Kofler K, Cosenza B and Fahringer T. Automatic problem size sensitive task partitioning on heterogeneous parallel systems. Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming. (281-282).

https://doi.org/10.1145/2442516.2442545

O'Boyle M, Wang Z and Grewe D. Portable mapping of data parallel programs to OpenCL for heterogeneous systems. Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). (1-10).

https://doi.org/10.1109/CGO.2013.6494993

Ardila Y, Kawai N, Nakamura T and Tamura Y. (2013). Support tools for porting legacy applications to multicore 2013 18th Asia and South Pacific Design Automation Conference (ASP-DAC 2013). 10.1109/ASPDAC.2013.6509658. 978-1-4673-3030-5. (568-573).

http://ieeexplore.ieee.org/document/6509658/

Zhang Y, Sinclair M and Chien A. (2013). Improving Performance Portability in OpenCL Programs. Supercomputing. 10.1007/978-3-642-38750-0_11. (136-150).

https://link.springer.com/10.1007/978-3-642-38750-0_11

Wu J, Shi W and Hong B. (2013). Dynamic Kernel/Device Mapping Strategies for GPU-Assisted HPC Systems. Job Scheduling Strategies for Parallel Processing. 10.1007/978-3-642-35867-8_6. (96-113).

http://link.springer.com/10.1007/978-3-642-35867-8_6

Yan X, Shi X and Sun Q. An OpenCL Micro-Benchmark Suite for GPUs and CPUs. Proceedings of the 2012 13th International Conference on Parallel and Distributed Computing, Applications and Technologies. (53-58).

https://doi.org/10.1109/PDCAT.2012.52

Williams S, Kalamkar D, Singh A, Deshpande A, Van Straalen B, Smelyanskiy M, Almgren A, Dubey P, Shalf J and Oliker L. Optimization of geometric multigrid for emerging multi- and manycore processors. Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. (1-11).

/doi/10.5555/2388996.2389126

Williams S, Kalamkar D, Singh A, Deshpande A, Van Straalen B, Smelyanskiy M, Almgren A, Dubey P, Shalf J and Oliker L. Optimization of geometric multigrid for emerging multi- and manycore processors. Proceedings of the 2012 International Conference for High Performance Computing, Networking, Storage and Analysis. (1-11).

https://doi.org/10.1109/SC.2012.85

Zhou L, Clifford Chao K and Chang J. (2012). Fast polyenergetic forward projection for image formation using OpenCL on a heterogeneous parallel computing platform. Medical Physics. 10.1118/1.4758062. 39:11. (6745-6756). Online publication date: 1-Nov-2012.

https://aapm.onlinelibrary.wiley.com/doi/10.1118/1.4758062

Amrizal A, Hirasawa S, Komatsu K, Takizawa H and Kobayashi H. (2012). Improving the scalability of transparent checkpointing for GPU computing systems TENCON 2012 - 2012 IEEE Region 10 Conference. 10.1109/TENCON.2012.6412343. 978-1-4673-4824-9. (1-6).

http://ieeexplore.ieee.org/document/6412343/

Tupinamba A and Sztajnberg A. DistributedCL. Proceedings of the 2012 13th Symposium on Computing Systems. (187-193).

https://doi.org/10.1109/WSCAD-SSC.2012.36

Bureddy D, Wang H, Venkatesh A, Potluri S and Panda D. OMB-GPU. Proceedings of the 19th European conference on Recent Advances in the Message Passing Interface. (110-120).

https://doi.org/10.1007/978-3-642-33518-1_16

Prabhakar R, Govindarajan R and Thazhuthaveetil M. CUDA-for-clusters. Proceedings of the 18th international conference on Parallel Processing. (415-426).

https://doi.org/10.1007/978-3-642-32820-6_42

Pratas F, Trancoso P, Sousa L, Stamatakis A, Shi G and Kindratenko V. (2012). Fine-grain parallelism using multi-core, Cell/BE, and GPU Systems. Parallel Computing. 38:8. (365-390). Online publication date: 1-Aug-2012.

https://doi.org/10.1016/j.parco.2011.08.002

Barrio P, Carreras C, Sierra R, Kenter T and Plessl C. (2012). Turning control flow graphs into function calls: Code generation for heterogeneous architectures 2012 International Conference on High Performance Computing & Simulation (HPCS). 10.1109/HPCSim.2012.6266973. 978-1-4673-2362-8. (559-565).

http://ieeexplore.ieee.org/document/6266973/

Ji F, Aji A, Dinan J, Buntinas D, Balaji P, Thakur R, Feng W and Ma X. DMA-Assisted, Intranode Communication in GPU Accelerated Systems. Proceedings of the 2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems. (461-468).

https://doi.org/10.1109/HPCC.2012.69

Calandrini G, Gardel A, Revenga P and Lázaro J. GPU Acceleration on Embedded Devices. A Power Consumption Approach. Proceedings of the 2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems. (1806-1812).

https://doi.org/10.1109/HPCC.2012.272

Wang C, Chandrasekaran S and Chapman B. An OpenMP 3.1 validation testsuite. Proceedings of the 8th international conference on OpenMP in a Heterogeneous World. (237-249).

https://doi.org/10.1007/978-3-642-30961-8_18

Jaros J. (2012). Multi-GPU island-based genetic algorithm for solving the knapsack problem 2012 IEEE Congress on Evolutionary Computation (CEC). 10.1109/CEC.2012.6256131. 978-1-4673-1509-8. (1-8).

http://ieeexplore.ieee.org/document/6256131/

Hartley T, Saule E and Çatalyürek í. (2012). Improving performance of adaptive component-based dataflow middleware. Parallel Computing. 38:6-7. (289-309). Online publication date: 1-Jun-2012.

https://doi.org/10.1016/j.parco.2012.03.005

Qin C and Zhan L. (2012). Parallelizing flow-accumulation calculations on graphics processing units-From iterative DEM preprocessing algorithm to recursive multiple-flow-direction algorithm. Computers & Geosciences. 43. (7-16). Online publication date: 1-Jun-2012.

https://doi.org/10.1016/j.cageo.2012.02.022

Nowrouzezahrai D, Simari P and Fiume E. (2012). Sparse zonal harmonic factorization for efficient SH rotation. ACM Transactions on Graphics. 31:3. (1-9). Online publication date: 31-May-2012.

https://doi.org/10.1145/2167076.2167081

Ji F, Aji A, Dinan J, Buntinas D, Balaji P, Feng W and Ma X. Efficient Intranode Communication in GPU-Accelerated Systems. Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum. (1838-1847).

https://doi.org/10.1109/IPDPSW.2012.227

Bozkus Z and Fraguela B. A Portable High-Productivity Approach to Program Heterogeneous Systems. Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum. (163-173).

https://doi.org/10.1109/IPDPSW.2012.15

Spafford K, Meredith J, Lee S, Li D, Roth P and Vetter J. The tradeoffs of fused memory hierarchies in heterogeneous computing architectures. Proceedings of the 9th conference on Computing Frontiers. (103-112).

https://doi.org/10.1145/2212908.2212924

Unat D, Zhou J, Cui Y, Baden S and Cai X. (2012). Accelerating a 3D Finite-Difference Earthquake Simulation with a C-to-CUDA Translator. Computing in Science and Engineering. 14:3. (48-59). Online publication date: 1-May-2012.

https://doi.org/10.1109/MCSE.2012.44

Xiao S, Balaji P, Zhu Q, Thakur R, Coghlan S, Lin H, Wen G, Hong J and Feng W. (2012). VOCL: An optimized environment for transparent virtualization of graphics processing units 2012 Innovative Parallel Computing (InPar). 10.1109/InPar.2012.6339609. 978-1-4673-2633-9. (1-12).

http://ieeexplore.ieee.org/document/6339609/

Stratton J, Anssari N, Rodrigues C, Sung I, Obeid N, Chang L, Liu G and Hwu W. (2012). Optimization and architecture effects on GPU computing workload performance 2012 Innovative Parallel Computing (InPar). 10.1109/InPar.2012.6339605. 978-1-4673-2633-9. (1-10).

http://ieeexplore.ieee.org/document/6339605/

Gupta K, Stuart J and Owens J. (2012). A study of Persistent Threads style GPU programming for GPGPU workloads 2012 Innovative Parallel Computing (InPar). 10.1109/InPar.2012.6339596. 978-1-4673-2633-9. (1-14).

http://ieeexplore.ieee.org/document/6339596/

Braithwaite R, Feng W and McCormick P. Automatic NUMA characterization using Cbench. Proceedings of the 3rd ACM/SPEC International Conference on Performance Engineering. (295-298).

https://doi.org/10.1145/2188286.2188342

Jaros J and Pospichal P. A fair comparison of modern CPUs and GPUs running the genetic algorithm under the knapsack benchmark. Proceedings of the 2012t European conference on Applications of Evolutionary Computation. (426-435).

https://doi.org/10.1007/978-3-642-29178-4_43

Miyoshi T, Irie H, Shima K, Honda H, Kondo M and Yoshinaga T. FLAT. Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units. (20-29).

https://doi.org/10.1145/2159430.2159433

Jaros J, Treeby B and Rendell A. Use of multiple GPUs on shared memory multiprocessors for ultrasound propagation simulations. Proceedings of the Tenth Australasian Symposium on Parallel and Distributed Computing - Volume 127. (43-52).

/doi/10.5555/2523685.2523691

Pereira K, Athanas P, Lin H and Feng W. Spectral Method Characterization on FPGA and GPU Accelerators. Proceedings of the 2011 International Conference on Reconfigurable Computing and FPGAs. (487-492).

https://doi.org/10.1109/ReConFig.2011.83

Madduri K, Ibrahim K, Williams S, Im E, Ethier S, Shalf J and Oliker L. Gyrokinetic toroidal simulations on leading multi- and manycore HPC systems. Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis. (1-12).

https://doi.org/10.1145/2063384.2063415

Zhang Y, Peng L, Li B, Peir J and Chen J. Architecture comparisons between Nvidia and ATI GPUs. Proceedings of the 2011 IEEE International Symposium on Workload Characterization. (205-215).

https://doi.org/10.1109/IISWC.2011.6114180

Seo S, Jo G and Lee J. Performance characterization of the NAS Parallel Benchmarks in OpenCL. Proceedings of the 2011 IEEE International Symposium on Workload Characterization. (137-148).

https://doi.org/10.1109/IISWC.2011.6114174

Wang H, Potluri S, Luo M, Singh A, Ouyang X, Sur S and Panda D. Optimized Non-contiguous MPI Datatype Communication for GPU Clusters. Proceedings of the 2011 IEEE International Conference on Cluster Computing. (308-316).

https://doi.org/10.1109/CLUSTER.2011.42

Malony A, Biersdorff S, Shende S, Jagode H, Tomov S, Juckeland G, Dietrich R, Poole D and Lamb C. Parallel Performance Measurement of Heterogeneous Parallel Systems with GPUs. Proceedings of the 2011 International Conference on Parallel Processing. (176-185).

https://doi.org/10.1109/ICPP.2011.71

Fang J, Varbanescu A and Sips H. A Comprehensive Performance Comparison of CUDA and OpenCL. Proceedings of the 2011 International Conference on Parallel Processing. (216-225).

https://doi.org/10.1109/ICPP.2011.45

Meredith J, Roth P, Spafford K and Vetter J. (2011). Performance Implications of Nonuniform Device Topologies in Scalable Heterogeneous Architectures. IEEE Micro. 31:5. (66-75). Online publication date: 1-Sep-2011.

https://doi.org/10.1109/MM.2011.79

Vetter J, Glassbrook R, Dongarra J, Schwan K, Loftis B, McNally S, Meredith J, Rogers J, Roth P, Spafford K and Yalamanchili S. (2011). Keeneland. Computing in Science and Engineering. 13:5. (90-95). Online publication date: 1-Sep-2011.

https://doi.org/10.1109/MCSE.2011.83

Thoman P, Kofler K, Studt H, Thomson J and Fahringer T. Automatic OpenCL device characterization. Proceedings of the 17th international conference on Parallel processing - Volume Part II. (438-452).

/doi/10.5555/2033408.2033459

Daga M, Aji A and Feng W. On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing. Proceedings of the 2011 Symposium on Application Accelerators in High-Performance Computing. (141-149).

https://doi.org/10.1109/SAAHPC.2011.29

Takizawa H, Koyama K, Sato K, Komatsu K and Kobayashi H. CheCL. Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium. (864-876).

https://doi.org/10.1109/IPDPS.2011.85

Grewe D and O'Boyle M. A static task partitioning approach for heterogeneous systems using OpenCL. Proceedings of the 20th international conference on Compiler construction: part of the joint European conferences on theory and practice of software. (286-305).

/doi/10.5555/1987237.1987259

Spafford K, Meredith J and Vetter J. Quantifying NUMA and contention effects in multi-GPU systems. Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units. (1-7).

https://doi.org/10.1145/1964179.1964194

Karantasis K and Polychronopoulos E. Programming GPU Clusters with Shared Memory Abstraction in Software. Proceedings of the 2011 19th International Euromicro Conference on Parallel, Distributed and Network-Based Processing. (223-230).

https://doi.org/10.1109/PDP.2011.91

Grewe D and O’Boyle M. (2011). A Static Task Partitioning Approach for Heterogeneous Systems Using OpenCL. Compiler Construction. 10.1007/978-3-642-19861-8_16. (286-305).

http://link.springer.com/10.1007/978-3-642-19861-8_16

Che S, Sheaffer J, Boyer M, Szafaryn L, Liang Wang and Skadron K. A characterization of the Rodinia benchmark suite with comparison to contemporary CMP workloads. Proceedings of the IEEE International Symposium on Workload Characterization (IISWC'10). (1-11).

https://doi.org/10.1109/IISWC.2010.5650274

Hartley T, Saule E and Catalyurek U. (2010). Automatic dataflow application tuning for heterogeneous systems 2010 International Conference on High Performance Computing (HiPC). 10.1109/HIPC.2010.5713173. 978-1-4244-8518-5. (1-10).

http://ieeexplore.ieee.org/document/5713173/

Jurecko M, Kocisova J, Jr. J, Kasanicky T, Domiter M and Zvada M. Evaluation Framework for GPU Performance Based on OpenCL Standard. Proceedings of the 2010 First International Conference on Networking and Computing. (256-261).

https://doi.org/10.1109/IC-NC.2010.32

Barak A, Ben-Nun T, Levy E and Shiloh A. (2010). A package for OpenCL based heterogeneous computing on clusters with many GPU devices 2010 IEEE International Conference On Cluster Computing Workshops and Posters (CLUSTER WORKSHOPS). 10.1109/CLUSTERWKSP.2010.5613086. 978-1-4244-8395-2. (1-7).

http://ieeexplore.ieee.org/document/5613086/

Spafford K, Meredith J and Vetter J. Maestro. Proceedings of the 16th international Euro-Par conference on Parallel processing: Part II. (275-286).

/doi/10.5555/1885276.1885305

Malony A, Biersdorff S, Spear W and Mayanglambam S. An experimental approach to performance measurement of heterogeneous parallel applications using CUDA. Proceedings of the 24th ACM International Conference on Supercomputing. (127-136).

https://doi.org/10.1145/1810085.1810105

Li X, Li Z, David F, Zhou P, Zhou Y, Adve S and Kumar S. (2004). Performance directed energy management for main memory and disks. ACM SIGARCH Computer Architecture News. 32:5. (271-283). Online publication date: 1-Dec-2004.

https://doi.org/10.1145/1037947.1024425

Gomaa M, Powell M and Vijaykumar T. (2004). Heat-and-run. ACM SIGARCH Computer Architecture News. 32:5. (260-270). Online publication date: 1-Dec-2004.

https://doi.org/10.1145/1037947.1024424

Wu Q, Juang P, Martonosi M and Clark D. (2004). Formal online methods for voltage/frequency control in multiple clock domain microprocessors. ACM SIGARCH Computer Architecture News. 32:5. (248-259). Online publication date: 1-Dec-2004.

https://doi.org/10.1145/1037947.1024423

Bronevetsky G, Marques D, Pingali K, Szwed P and Schulz M. (2004). Application-level checkpointing for shared memory programs. ACM SIGARCH Computer Architecture News. 32:5. (235-247). Online publication date: 1-Dec-2004.

https://doi.org/10.1145/1037947.1024421

Smolens J, Gold B, Kim J, Falsafi B, Hoe J and Nowatzyk A. (2004). Fingerprinting. ACM SIGARCH Computer Architecture News. 32:5. (224-234). Online publication date: 1-Dec-2004.

https://doi.org/10.1145/1037947.1024420

Lowell D, Saito Y and Samberg E. (2004). Devirtualizable virtual machines enabling general, single-node, online maintenance. ACM SIGARCH Computer Architecture News. 32:5. (211-223). Online publication date: 1-Dec-2004.

https://doi.org/10.1145/1037947.1024419

Cher C, Hosking A and Vijaykumar T. (2004). Software prefetching for mark-sweep garbage collection. ACM SIGARCH Computer Architecture News. 32:5. (199-210). Online publication date: 1-Dec-2004.

https://doi.org/10.1145/1037947.1024417

Singh Umrao L and Pandey J. Performance Analysis and Optimization of Graphics Processing Unit. SSRN Electronic Journal. 10.2139/ssrn.3350249.

https://www.ssrn.com/abstract=3350249