Fang J, Huang C, Tang T and Wang Z. (2020). Parallel programming models for heterogeneous many-cores: a comprehensive survey. CCF Transactions on High Performance Computing. 10.1007/s42514-020-00039-4. 2:4. (382-400). Online publication date: 1-Dec-2020.

https://link.springer.com/10.1007/s42514-020-00039-4

Gangwon Jo , Jeongho Nah , Jun Lee , Jungwon Kim and Jaejin Lee . (2015). Accelerating LINPACK with MPI-OpenCL on Clusters of Multi-GPU Nodes. IEEE Transactions on Parallel and Distributed Systems. 26:7. (1814-1825). Online publication date: 1-Jul-2015.

https://doi.org/10.1109/TPDS.2014.2321742

Rohr D and Lindenstruth V. A Flexible and Portable Large-Scale DGEMM Library for Linpack on Next-Generation Multi-GPU Systems. Proceedings of the 2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing. (664-668).

https://doi.org/10.1109/PDP.2015.89

Pinto V and Maillard N. (2012). Work Stealing on Hybrid Architectures 2012 13th Symposium on Computer Systems - XIII Simp sio de Sistemas Computacionais (WSCAD-SSC). 10.1109/WSCAD-SSC.2012.28. 978-1-4673-4468-5. (17-24).

http://ieeexplore.ieee.org/document/6391759/

Wang F, Yang C, Du Y, Chen J, Yi H and Xu W. (2011). Optimizing linpack benchmark on GPU-accelerated petascale supercomputer. Journal of Computer Science and Technology. 26:5. (854-865). Online publication date: 1-Sep-2011.

https://doi.org/10.1007/s11390-011-0184-1

Yang C, Wang F, Du Y, Chen J, Liu J, Yi H and Lu K. Adaptive Optimization for Petascale Heterogeneous CPU/GPU Computing. Proceedings of the 2010 IEEE International Conference on Cluster Computing. (19-28).

https://doi.org/10.1109/CLUSTER.2010.12

Endo T, Matsuoka S, Nukada A and Maruyama N. (2010). Linpack evaluation on a supercomputer with heterogeneous accelerators 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS). 10.1109/IPDPS.2010.5470353. 978-1-4244-6442-5. (1-8).

http://ieeexplore.ieee.org/document/5470353/

Brodtkorb A, Dyken C, Hagen T, Hjelmervik J and Storaasli O. (2010). State-of-the-art in heterogeneous computing. Scientific Programming. 18:1. (1-33). Online publication date: 1-Jan-2010.

https://doi.org/10.1155/2010/540159

Jaejin Lee , Jun Lee , Sangmin Seo , Jungwon Kim , Seungkyun Kim and Sura Z. (2010). COMIC++: A software SVM system for heterogeneous multicore accelerator clusters 2010 IEEE 16th International Symposium on High Performance Computer Architecture (HPCA). 10.1109/HPCA.2010.5416633. 978-1-4244-5658-1. (1-12).

http://ieeexplore.ieee.org/document/5416633/

Sarkar V, Harrod W and Snavely A. (2009). Software challenges in extreme scale systems. Journal of Physics: Conference Series. 10.1088/1742-6596/180/1/012045. 180. (012045). Online publication date: 1-Jul-2009.

https://iopscience.iop.org/article/10.1088/1742-6596/180/1/012045

Kistler M, Gunnels J, Brokenshire D and Benton B. (2009). Programming the Linpack benchmark for the IBM PowerXCell 8i processor. Scientific Programming. 17:1-2. (43-57). Online publication date: 1-Jan-2009.

https://doi.org/10.1155/2009/401691

Chalmers N, Kurzak J, Mcdougall D and Bauman P. Optimizing High-Performance Linpack for Exascale Accelerated Architectures. Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. (1-12).

https://doi.org/10.1145/3581784.3607066

Kim J, Kwon H, Kang J, Park J, Lee S and Lee J. SnuHPL. Proceedings of the 36th ACM International Conference on Supercomputing. (1-12).

https://doi.org/10.1145/3524059.3532370

Ham J, Peter Cho Y, Kim J, Lyuh C, Kim J, Han J and Kwon Y. HPC LINPACK Parameter Optimization on Homo-/Heterogeneous System of ARM Neoverse N1SDP. The International Conference on High Performance Computing in Asia-Pacific Region. (139-143).

https://doi.org/10.1145/3432261.3439864

Chen C, Yang W, Wang F, Zhao D, Liu Y, Deng L and Yang C. Reverse Offload Programming on Heterogeneous Systems. IEEE Access. 10.1109/ACCESS.2019.2891740. 7. (10787-10797).

https://ieeexplore.ieee.org/document/8606083/

Yang C, Chen C, Tang T, Chen X, Fang J and Xue J. (2016). An Energy-Efficient Implementation of LU Factorization on Heterogeneous Systems 2016 IEEE 22nd International Conference on Parallel and Distributed Systems (ICPADS). 10.1109/ICPADS.2016.0130. 978-1-5090-4457-3. (971-979).

http://ieeexplore.ieee.org/document/7823845/

Rohr D, Cuveland J and Lindenstruth V. (2016). A Model for Weak Scaling to Many GPUs at the Basis of the Linpack Benchmark 2016 IEEE International Conference on Cluster Computing (CLUSTER). 10.1109/CLUSTER.2016.15. 978-1-5090-3653-0. (192-202).

http://ieeexplore.ieee.org/document/7776509/

Kroshko A and Spiteri R. (2013). Efficient SIMD solution of multiple systems of stiff IVPs. Journal of Computational Science. 10.1016/j.jocs.2012.08.017. 4:5. (377-385). Online publication date: 1-Sep-2013.

http://linkinghub.elsevier.com/retrieve/pii/S1877750312001068

https://doi.org/10.1007/s11390-011-0184-1