SPHINX: Search Space-Pruning Heterogeneous Task Scheduling for Deep Neural Networks
Pages 524 - 533
Abstract
Given the tendency of increasingly heterogeneous AI systems and the large workload scale of deep neural networks (DNNs), there is an urgent demand for model scheduling to improve execution performance in heterogeneous computational systems. However, this is very challenging because the task scheduling under the high-dimensional search space is an NP-hard problem. Existing works either schedule under naive search spaces without simplifications or oversimplifies the optimisation, which is hard to strike a balance between efficiency and optimality.
To address the aforementioned challenges, we propose SPHINX, an efficient and novel search space-pruning heterogeneous task scheduling engine to improve the DNN execution performance. The search space can be reduced by taking the prior knowledge of the model structures into account, and a cost model has been developed to guide the search process. In combination with our novel search method, named Critical Path Genetic Algorithm (CPGA), the search efficiency and stability can be significantly improved resulting in higher performance of DNN execution. The SPHINX engine is implemented based on the Multi-Level Intermediate Representation (MLIR) dialect, making it scalable and reusable within existing AI compilers. We examine the SPHINX engine with six popular deep neural networks across different heterogeneous scenarios, and the results prove the SPHINX engine can outperform existing state-of-the-art works with up to 1.44 × speedup. Compared to other search-based methods, our method converges faster and can boost the search speed by 23.6-35.5 ×.
References
[1]
Vahid Arabnejad, Kris Bubendorfer, Bryan Ng, and Kyle Chard. 2015. A deadline constrained critical path heuristic for cost-effectively scheduling workflows. In 2015 IEEE/ACM 8th International Conference on Utility and Cloud Computing (UCC). IEEE, 242–250.
[2]
Rabeh Ayari, Imane Hafnaoui, Giovanni Beltrame, and Gabriela Nicolescu. 2018. ImGA: an improved genetic algorithm for partitioned scheduling on heterogeneous multi-core systems. Design Automation for Embedded Systems 22, 1 (2018), 183–197.
[3]
James E Baker 1987. Reducing bias and inefficiency in the selection algorithm. In Proceedings of the second international conference on genetic algorithms, Vol. 206. 14–21.
[4]
Luiz F Bittencourt, Rizos Sakellariou, and Edmundo RM Madeira. 2010. Dag scheduling using a lookahead variant of the heterogeneous earliest finish time algorithm. In 2010 18th Euromicro conference on parallel, distributed and network-based processing. IEEE, 27–34.
[5]
Shaileshh Bojja Venkatakrishnan, Shreyan Gupta, Hongzi Mao, Mohammad Alizadeh, 2019. Learning Generalizable Device Placement Algorithms for Distributed Machine Learning. Advances in Neural Information Processing Systems 32 (2019).
[6]
Nitish Chopra and Sarbjeet Singh. 2013. HEFT based workflow scheduling algorithm for cost optimization within deadline in hybrid clouds. In 2013 Fourth International Conference on Computing, Communications and Networking Technologies (ICCCNT). IEEE, 1–6.
[7]
Kaiwen Duan, Song Bai, Lingxi Xie, Honggang Qi, Qingming Huang, and Qi Tian. 2019. Centernet: Keypoint triplets for object detection. In Proceedings of the IEEE/CVF international conference on computer vision. 6569–6578.
[8]
Lionel Eyraud-Dubois, Loris Marchal, Oliver Sinnen, and Frédéric Vivien. 2015. Parallel scheduling of task trees with limited memory. ACM Transactions on Parallel Computing (TOPC) 2, 2 (2015), 1–37.
[9]
Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF international conference on computer vision. 6202–6211.
[10]
José Fernando Gonçalves and Mauricio GC Resende. 2011. Biased random-key genetic algorithms for combinatorial optimization. Journal of Heuristics 17, 5 (2011), 487–525.
[11]
Mourad Hakem and Franck Butelle. 2005. Dynamic critical path scheduling parallel programs onto multiprocessors. In 19th IEEE International Parallel and Distributed Processing Symposium. IEEE, 7–pp.
[12]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.
[13]
John H Holland. 1992. Genetic algorithms. Scientific american 267, 1 (1992), 66–73.
[14]
Zhiyao Hu, Dongsheng Li, Yiming Zhang, Deke Guo, and Ziyang Li. 2019. Branch scheduling: DAG-aware scheduling for speeding up data-parallel jobs. In Proceedings of the International Symposium on Quality of Service. 1–10.
[15]
Zhihao Jia, Matei Zaharia, and Alex Aiken. 2019. Beyond Data and Model Parallelism for Deep Neural Networks.Proceedings of Machine Learning and Systems 1 (2019), 1–13.
[16]
Glenn Jocher, Ayush Chaurasia, Alex Stoken, Jirka Borovec, NanoCode012, Yonghye Kwon, Kalen Michael, TaoXie, Jiacong Fang, imyhxy, Lorna, Zeng Yifu, Colin Wong, Abhiram V, Diego Montes, Zhiqiang Wang, Cristi Fati, Jebastin Nadar, Laughing, UnglvKitDe, Victor Sonck, tkianai, yxNONG, Piotr Skalski, Adam Hogan, Dhruv Nair, Max Strobel, and Mrinal Jain. 2022. ultralytics/yolov5: v7.0 - YOLOv5 SOTA Realtime Instance Segmentation. https://doi.org/10.5281/zenodo.7347926
[17]
Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, 2017. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th annual international symposium on computer architecture. 1–12.
[18]
ML Kornelsen, SH Mozafari, JJ Clark, BH Meyer, and WJ Gross. 2022. Fast Heterogeneous Task Mapping for Reducing Edge DNN Latency. In 2022 IEEE 33rd International Conference on Application-specific Systems, Architectures and Processors (ASAP). IEEE, 64–71.
[19]
Neetesh Kumar and Deo Prakash Vidyarthi. 2016. A novel hybrid PSO–GA meta-heuristic for scheduling of DAG with communication on multiprocessor systems. Engineering with Computers 32, 1 (2016), 35–47.
[20]
Nicholas D Lane, Sourav Bhattacharya, Petko Georgiev, Claudio Forlivesi, Lei Jiao, Lorena Qendro, and Fahim Kawsar. 2016. Deepx: A software accelerator for low-power deep learning inference on mobile devices. In 2016 15th ACM/IEEE International Conference on Information Processing in Sensor Networks (IPSN). IEEE, 1–12.
[21]
Chris Lattner, Mehdi Amini, Uday Bondhugula, Albert Cohen, Andy Davis, Jacques Pienaar, River Riddle, Tatiana Shpeisman, Nicolas Vasilache, and Oleksandr Zinenko. 2021. Mlir: Scaling compiler infrastructure for domain specific computation. In 2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). IEEE, 2–14.
[22]
Yee Leung, Yong Gao, and Zong-Ben Xu. 1997. Degree of population diversity - a perspective on premature convergence in genetic algorithms and its Markov chain analysis. IEEE Transactions on Neural Networks 8, 5 (1997), 1165–1176. https://doi.org/10.1109/72.623217
[23]
Yang Li, Zhenhua Han, Quanlu Zhang, Zhenhua Li, and Haisheng Tan. 2020. Automating cloud deployment for deep learning inference of real-time online services. In IEEE INFOCOM 2020-IEEE Conference on Computer Communications. IEEE, 1668–1677.
[24]
Peiying Lin, Zhichen Shi, Zheng Xiao, Cen Chen, and Kenli Li. 2021. Latency-Driven Model Placement for Efficient Edge Intelligence Service. IEEE Transactions on Services Computing 15, 2 (2021), 591–601.
[25]
Azalia Mirhoseini, Anna Goldie, Hieu Pham, Benoit Steiner, Quoc V Le, and Jeff Dean. 2018. A hierarchical model for device placement. In International Conference on Learning Representations.
[26]
Azalia Mirhoseini, Hieu Pham, Quoc V Le, Benoit Steiner, Rasmus Larsen, Yuefeng Zhou, Naveen Kumar, Mohammad Norouzi, Samy Bengio, and Jeff Dean. 2017. Device placement optimization with reinforcement learning. In International Conference on Machine Learning. PMLR, 2430–2439.
[27]
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, 2022. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems 35 (2022), 27730–27744.
[28]
Aditya Paliwal, Felix Gimeno, Vinod Nair, Yujia Li, Miles Lubin, Pushmeet Kohli, and Oriol Vinyals. 2019. Reinforced genetic algorithm learning for optimizing computation graphs. arXiv preprint arXiv:1905.02494 (2019).
[29]
Jihong Park, Sumudu Samarakoon, Hamid Shiri, Mohamed K Abdel-Aziz, Takayuki Nishio, Anis Elgabli, and Mehdi Bennis. 2022. Extreme ultra-reliable and low-latency communication. Nature Electronics 5, 3 (2022), 133–141.
[30]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, 2019. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 (2019).
[31]
François Pellegrini and Jean Roman. 1996. Scotch: A software package for static mapping by dual recursive bipartitioning of process and architecture graphs. In International Conference on High-Performance Computing and Networking. Springer, 493–498.
[32]
Michael A Province, WD Shannon, and DC Rao. 2001. 19 classification methods for confronting heterogeneity. (2001).
[33]
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2818–2826.
[34]
Jakub M Tarnawski, Amar Phanishayee, Nikhil Devanur, Divya Mahajan, and Fanny Nina Paravecino. 2020. Efficient algorithms for device placement of dnn graph operators. Advances in Neural Information Processing Systems 33 (2020), 15451–15463.
[35]
Haluk Topcuoglu, Salim Hariri, and Min-You Wu. 2002. Performance-effective and low-complexity task scheduling for heterogeneous computing. IEEE transactions on parallel and distributed systems 13, 3 (2002), 260–274.
[36]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).
[37]
Jianyu Wei, Ting Cao, Shijie Cao, Shiqi Jiang, Shaowei Fu, Mao Yang, Yanyong Zhang, and Yunxin Liu. 2023. NN-Stretch: Automatic Neural Network Branching for Parallel Inference on Heterogeneous Multi-Processors. In Proceedings of the 21st Annual International Conference on Mobile Systems, Applications and Services. 70–83.
[38]
Minjia Zhang, Zehua Hu, and Mingqin Li. 2021. DUET: A Compiler-Runtime Subgraph Scheduling Approach for Tensor Programs on a Coupled CPU-GPU Architecture. In 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 151–161.
[39]
Yanqi Zhou, Sudip Roy, Amirali Abdolrashidi, Daniel Wong, Peter Ma, Qiumin Xu, Hanxiao Liu, Phitchaya Phothilimtha, Shen Wang, Anna Goldie, 2020. Transferable graph optimizers for ml compilers. Advances in Neural Information Processing Systems 33 (2020), 13844–13855.
[40]
Shiliang Zhu, Min Miao, Zhuanzhuan Zhang, and Xiaolong Duan. 2022. Research on A Chiplet-based DSA (Domain-Specific Architectures) Scalable Convolutional Acceleration Architecture. In 2022 23rd International Conference on Electronic Packaging Technology (ICEPT). IEEE, 1–6.
Index Terms
- SPHINX: Search Space-Pruning Heterogeneous Task Scheduling for Deep Neural Networks
Recommendations
A Novel Heterogeneous Scheduling Algorithm with Improved Task Priority
HPCC-CSS-ICESS '15: Proceedings of the 2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, and 2015 IEEE 12th International Conf on Embedded Software and SystemsEfficient application scheduling algorithms are important to obtain high performance in heterogeneous computing systems. However, most of current algorithms are of low efficiency in scheduling. Aiming at this problem, we propose a heterogeneous ...
Comments
Information & Contributors
Information
Published In
August 2024
1279 pages
ISBN:9798400717932
DOI:10.1145/3673038
Copyright © 2024 Owner/Author.
This work is licensed under a Creative Commons Attribution International 4.0 License.
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
Published: 12 August 2024
Check for updates
Author Tags
Qualifiers
- Research-article
- Research
- Refereed limited
Conference
ICPP '24
ICPP '24: the 53rd International Conference on Parallel Processing
August 12 - 15, 2024
Gotland, Sweden
Acceptance Rates
Overall Acceptance Rate 91 of 313 submissions, 29%
Contributors
Other Metrics
Bibliometrics & Citations
Bibliometrics
Article Metrics
- 0Total Citations
- 272Total Downloads
- Downloads (Last 12 months)272
- Downloads (Last 6 weeks)30
Reflects downloads up to 31 Jan 2025
Other Metrics
Citations
View Options
View options
View or Download as a PDF file.
PDFeReader
View online with eReader.
eReaderHTML Format
View this article in HTML Format.
HTML FormatLogin options
Check if you have access through your login credentials or your institution to get full access on this article.
Sign in