research-article

Open access

MalleTrain: Deep Neural Networks Training on Unfillable Supercomputer Nodes

Authors:

Michael E. Papka,

Zhengchun Liu, and

Rajkumar KettimuthuAuthors Info & Claims

ICPE '24: Proceedings of the 15th ACM/SPEC International Conference on Performance Engineering

May 2024

Pages 190 - 200

https://doi.org/10.1145/3629526.3645035

Published: 07 May 2024 Publication History

Abstract

First-come first-serve scheduling can result in substantial (up to 10%) of transiently idle nodes on supercomputers. Recognizing that such unfilled nodes are well-suited for deep neural network (DNN) training, due to the flexible nature of DNN training tasks, Liu et al. proposed that the re-scaling DNN training tasks to fit gaps in schedules be formulated as a mixed-integer linear programming (MILP) problem, and demonstrated via simulation the potential benefits of the approach. Here, we introduce MalleTrain, a system that provides the first practical implementation of this approach and that furthermore generalizes it by allowing it to be used even for DNN training applications for which model information is unknown before runtime. Key to this latter innovation is the use of a lightweight online job profiling advisor (JPA) to collect critical scalability information for DNN jobs---information that it then employs to optimize resource allocations dynamically, in real time. We describe the MalleTrain architecture and present the results of a detailed experimental evaluation on a supercomputer GPU cluster and several representative DNN training workloads, including neural architecture search and hyperparameter optimization. Our results not only confirm the practical feasibility of leveraging idle supercomputer nodes for DNN training but improve significantly on prior results, improving training throughput by up to 22.3% without requiring users to provide job scalability information.

References

[1]

2023. https://aws.amazon.com/ec2/spot/. Accessed: 2023--10--23.

[2]

2023. https://cloud.google.com/spot-vms. Accessed: 2023--10--23.

[3]

2023. https://azure.microsoft.com/en-us/products/virtual-machines/spot/. Accessed: 2023--10--23.

[4]

2023. https://www.alcf.anl.gov/polaris. Accessed: 2023--10--23.

[5]

2023. https://www.top500.org/lists/top500/2023/11/. Accessed: 2023--11--15.

[6]

Bilge Acun, Abhishek Gupta, Nikhil Jain, Akhil Langer, Harshitha Menon, Eric Mikida, Xiang Ni, Michael Robson, Yanhua Sun, Ehsan Totoni, et al. 2014. Parallel programming with migratable objects: Charm in practice. In SC'14: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 647--658.

Digital Library

[7]

Mohammad Al-Fares, Alexander Loukissas, and Amin Vahdat. 2008. A scalable, commodity data center network architecture. ACM SIGCOMM computer communication review 38, 4 (2008), 63--74.

[8]

Ahsan Ali, Hemant Sharma, Rajkumar Kettimuthu, Peter Kenesei, Dennis Trujillo, Antonino Miceli, Ian Foster, Ryan Coffee, Jana Thayer, and Zhengchun Liu. 2022. fairDMS: Rapid model training by data and model reuse. In 2022 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, 394--405.

[9]

William Allcock, Paul Rich, Yuping Fan, and Zhiling Lan. 2017. Experience and practice of batch scheduling on Leadership Supercomputers at Argonne. In Workshop on Job Scheduling Strategies for Parallel Processing. Springer, 1--24.

[10]

Selin Aslan, Zhengchun Liu, Viktor Nikitin, Tekin Bicer, Sven Leyffer, and Doga Gursoy. 2020. Distributed optimization with tunable learned priors for robust ptycho-tomography. arXiv preprint arXiv:2009.09498 (2020).

[11]

Sebastian Buchwald, Manuel Mohr, and Andreas Zwinkau. 2015. Malleable Invasive Applications. In Software Engineering (Workshops). 123--126.

[12]

Ewa Deelman, Karan Vahi, Mats Rynge, Rajiv Mayani, Rafael Ferreira da Silva, George Papadimitriou, and Miron Livny. 2019. The evolution of the Pegasus workflow management software. Computing in Science & Engineering 21, 4 (2019), 22--36.

[13]

Travis Desell, Kaoutar El Maghraoui, and Carlos A Varela. 2007. Malleable applications for scalable high performance computing. Cluster Computing 10, 3 (2007), 323--337.

Digital Library

[14]

Stefan Falkner, Aaron Klein, and Frank Hutter. 2018. BOHB: Robust and efficient hyperparameter optimization at scale. In International Conference on Machine Learning. PMLR, 1437--1446.

[15]

Dror G Feitelson and Larry Rudolph. 1996. Toward convergence in job schedulers for parallel supercomputers. In Workshop on Job Scheduling Strategies for Parallel Processing. Springer, 1--26.

Digital Library

[16]

Hanhua Feng, Vishal Misra, and Dan Rubenstein. 2007. PBS: a unified priority based scheduler. In Proceedings of the 2007 ACM SIGMETRICS international conference on Measurement and Modeling of Computer Systems. 203--214.

Digital Library

[17]

Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. 2017. Accurate, large minibatch SGD: Training ImageNet in 1 hour. arXiv preprint arXiv:1706.02677 (2017).

[18]

Matthew D Jones, Joseph P White, Martins Innus, Robert L DeLeon, Nikolay Simakov, Jeffrey T Palmer, Steven M Gallo, Thomas R Furlani, Michael Showerman, Robert Brunner, et al. 2017. Workload analysis of Blue Waters. arXiv preprint arXiv:1703.00924 (2017).

[19]

Julian Kates-Harbeck, Alexey Svyatkovskiy, and William Tang. 2019. Predicting disruptive instabilities in controlled fusion plasmas through deep learning. Nature 568, 7753 (2019), 526--531.

[20]

John Kim, Wiliam J Dally, Steve Scott, and Dennis Abts. 2008. Technology-driven, highly-scalable dragonfly topology. ACM SIGARCH Computer Architecture News 36, 3 (2008), 77--88.

Digital Library

[21]

Christopher Kleman, Shoaib Anwar, Zhengchun Liu, Jiaqi Gong, Xishi Zhu, Austin Yunker, Rajkumar Kettimuthu, and Jiaze He. 2023. Full Waveform Inversion-Based Ultrasound Computed Tomography Acceleration Using Two- Dimensional Convolutional Neural Networks. Journal of Nondestructive Evaluation, Diagnostics and Prognostics of Engineering Systems 6, 4 (2023), 041004.

[22]

Liam Li, Kevin Jamieson, Afshin Rostamizadeh, Ekaterina Gonina, Jonathan Ben- Tzur, Moritz Hardt, Benjamin Recht, and Ameet Talwalkar. 2020. A system for massively parallel hyperparameter tuning. Proceedings of Machine Learning and Systems 2 (2020), 230--246.

[23]

Hanxiao Liu, Karen Simonyan, and Yiming Yang. 2018. DARTS: Differentiable architecture search. arXiv preprint arXiv:1806.09055 (2018).

[24]

Zhengchun Liu, Tekin Bicer, Rajkumar Kettimuthu, and Ian Foster. 2019. Deep learning accelerated light source experiments. In IEEE/ACM Third Workshop on Deep Learning on Supercomputers. IEEE, 20--28.

[25]

Zhengchun Liu, Rajkumar Kettimuthu, Michael E Papka, and Ian Foster. 2023. FreeTrain: A Framework to Utilize Unused Supercomputer Nodes for Training Neural Networks. In IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid). IEEE, 299--310.

[26]

Zhengchun Liu, Hemant Sharma, Jun-Sang Park, Peter Kenesei, Jonathan Almer, Rajkumar Kettimuthu, and Ian Foster. 2020. BraggNN: Fast X-ray Bragg Peak Analysis Using Deep Learning. arXiv preprint arXiv:2008.08198 (2020).

[27]

Ahuva W. Mu'alem and Dror G. Feitelson. 2001. Utilization, predictability, workloads, and user runtime estimates in scheduling the IBM SP2 with backfilling. IEEE Transactions on Parallel and Distributed Systems 12, 6 (2001), 529--543.

Digital Library

[28]

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. PyTorch: An imperative style, high-performance deep learning library. arXiv preprint arXiv:1912.01703 (2019).

[29]

Tirthak Patel, Zhengchun Liu, Rajkumar Kettimuthu, Paul Rich, William Allcock, and Devesh Tiwari. 2020. Job Characteristics on Large-Scale Systems: Long-Term Analysis, Quantification and Implications. In 2020 SC20: International Conference for High Performance Computing, Networking, Storage and Analysis (SC). IEEE Computer Society, 1186--1202.

[30]

Aurick Qiao, Sang Keun Choe, Suhas Jayaram Subramanya, Willie Neiswanger, Qirong Ho, Hao Zhang, Gregory R Ganger, and Eric P Xing. 2021. Pollux: Coadaptive Cluster Scheduling for Goodput-Optimized Deep Learning. In OSDI, Vol. 21. 1--18.

[31]

Aurick Qiao, Willie Neiswanger, Qirong Ho, Hao Zhang, Gregory R Ganger, and Eric P Xing. 2020. Pollux: Co-adaptive cluster scheduling for goodput-optimized deep learning. arXiv preprint arXiv:2008.12260 (2020).

[32]

Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V Le. 2019. Regularized evolution for image classifier architecture search. In Proceedings of the aaai conference on artificial intelligence, Vol. 33. 4780--4789.

Digital Library

[33]

Alexander Sergeev and Mike Del Balso. 2018. Horovod: Fast and easy distributed deep learning in TensorFlow. arXiv preprint arXiv:1802.05799 (2018).

[34]

Sathish S Vadhiyar and Jack J Dongarra. 2003. SRS: A framework for developing malleable and migratable parallel applications for distributed systems. Parallel Processing Letters 13, 02 (2003), 291--312.

[35]

Chris Ying, Aaron Klein, Eric Christiansen, Esteban Real, Kevin Murphy, and Frank Hutter. 2019. NAS-Bench-101: Towards reproducible neural architecture search. In International Conference on Machine Learning. PMLR, 7105--7114.

[36]

Andy B Yoo, Morris A Jette, and Mark Grondona. 2003. Slurm: Simple Linux utility for resource management. In Workshop on job scheduling strategies for parallel processing. Springer, 44--60.

[37]

Haihang You and Hao Zhang. 2012. Comprehensive workload analysis and modeling of a petascale supercomputer. In Workshop on Job Scheduling Strategies for Parallel Processing. Springer, 253--271.

[38]

Yang You, Zhao Zhang, Cho-Jui Hsieh, James Demmel, and Kurt Keutzer. 2018. ImageNet training in minutes. In 47th International Conference on Parallel Processing. 1--10.

Digital Library

[39]

Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. 2018. Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 8697--8710.

Index Terms

MalleTrain: Deep Neural Networks Training on Unfillable Supercomputer Nodes
1. Computing methodologies

Recommendations

Towards dropout training for convolutional neural networks

Recently, dropout has seen increasing use in deep learning. For deep convolutional neural networks, dropout is known to work well in fully-connected layers. However, its effect in convolutional and pooling layers is still not clear. This paper ...
Read More
Toward robustness against label noise in training deep discriminative neural networks
NIPS'17: Proceedings of the 31st International Conference on Neural Information Processing Systems

Collecting large training datasets, annotated with high-quality labels, is costly and time-consuming. This paper proposes a novel framework for training deep convolutional neural networks from noisy labeled datasets that can be obtained cheaply. The ...
Read More
Topological measurement of deep neural networks using persistent homology
Abstract
The inner representation of deep neural networks (DNNs) is indecipherable, which makes it difficult to tune DNN models, control their training process, and interpret their outputs. In this paper, we propose a novel approach to investigate the ...
Read More

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICPE '24: Proceedings of the 15th ACM/SPEC International Conference on Performance Engineering

May 2024

310 pages

ISBN:9798400704444

DOI:10.1145/3629526

General Chairs:
Simonetta Balsamo
Ca'Foscari University of Venice, Italy
,
William Knottenbelt
Imperial College London, UK
,
Program Chairs:
Cristina L. Abad
Escuela Superior Politecnica del Litoral, Ecuador
,
Weiyi Shang
University of Waterloo, Canada

Copyright © 2024 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 May 2024

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Department of Energy, Office of Science
NSF (National Science Foundation)

Conference

ICPE '24

Sponsor:

ICPE '24: 15th ACM/SPEC International Conference on Performance Engineering

May 7 - 11, 2024

London, United Kingdom

Acceptance Rates

Overall Acceptance Rate 252 of 851 submissions, 30%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
61
Total Downloads

Downloads (Last 12 months)61
Downloads (Last 6 weeks)27

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents