skip to main content
research-article
Open access

MalleTrain: Deep Neural Networks Training on Unfillable Supercomputer Nodes

Published: 07 May 2024 Publication History
  • Get Citation Alerts
  • Abstract

    First-come first-serve scheduling can result in substantial (up to 10%) of transiently idle nodes on supercomputers. Recognizing that such unfilled nodes are well-suited for deep neural network (DNN) training, due to the flexible nature of DNN training tasks, Liu et al. proposed that the re-scaling DNN training tasks to fit gaps in schedules be formulated as a mixed-integer linear programming (MILP) problem, and demonstrated via simulation the potential benefits of the approach. Here, we introduce MalleTrain, a system that provides the first practical implementation of this approach and that furthermore generalizes it by allowing it to be used even for DNN training applications for which model information is unknown before runtime. Key to this latter innovation is the use of a lightweight online job profiling advisor (JPA) to collect critical scalability information for DNN jobs---information that it then employs to optimize resource allocations dynamically, in real time. We describe the MalleTrain architecture and present the results of a detailed experimental evaluation on a supercomputer GPU cluster and several representative DNN training workloads, including neural architecture search and hyperparameter optimization. Our results not only confirm the practical feasibility of leveraging idle supercomputer nodes for DNN training but improve significantly on prior results, improving training throughput by up to 22.3% without requiring users to provide job scalability information.

    References

    [1]
    2023. https://aws.amazon.com/ec2/spot/. Accessed: 2023--10--23.
    [2]
    2023. https://cloud.google.com/spot-vms. Accessed: 2023--10--23.
    [3]
    2023. https://azure.microsoft.com/en-us/products/virtual-machines/spot/. Accessed: 2023--10--23.
    [4]
    2023. https://www.alcf.anl.gov/polaris. Accessed: 2023--10--23.
    [5]
    2023. https://www.top500.org/lists/top500/2023/11/. Accessed: 2023--11--15.
    [6]
    Bilge Acun, Abhishek Gupta, Nikhil Jain, Akhil Langer, Harshitha Menon, Eric Mikida, Xiang Ni, Michael Robson, Yanhua Sun, Ehsan Totoni, et al. 2014. Parallel programming with migratable objects: Charm in practice. In SC'14: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 647--658.
    [7]
    Mohammad Al-Fares, Alexander Loukissas, and Amin Vahdat. 2008. A scalable, commodity data center network architecture. ACM SIGCOMM computer communication review 38, 4 (2008), 63--74.
    [8]
    Ahsan Ali, Hemant Sharma, Rajkumar Kettimuthu, Peter Kenesei, Dennis Trujillo, Antonino Miceli, Ian Foster, Ryan Coffee, Jana Thayer, and Zhengchun Liu. 2022. fairDMS: Rapid model training by data and model reuse. In 2022 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, 394--405.
    [9]
    William Allcock, Paul Rich, Yuping Fan, and Zhiling Lan. 2017. Experience and practice of batch scheduling on Leadership Supercomputers at Argonne. In Workshop on Job Scheduling Strategies for Parallel Processing. Springer, 1--24.
    [10]
    Selin Aslan, Zhengchun Liu, Viktor Nikitin, Tekin Bicer, Sven Leyffer, and Doga Gursoy. 2020. Distributed optimization with tunable learned priors for robust ptycho-tomography. arXiv preprint arXiv:2009.09498 (2020).
    [11]
    Sebastian Buchwald, Manuel Mohr, and Andreas Zwinkau. 2015. Malleable Invasive Applications. In Software Engineering (Workshops). 123--126.
    [12]
    Ewa Deelman, Karan Vahi, Mats Rynge, Rajiv Mayani, Rafael Ferreira da Silva, George Papadimitriou, and Miron Livny. 2019. The evolution of the Pegasus workflow management software. Computing in Science & Engineering 21, 4 (2019), 22--36.
    [13]
    Travis Desell, Kaoutar El Maghraoui, and Carlos A Varela. 2007. Malleable applications for scalable high performance computing. Cluster Computing 10, 3 (2007), 323--337.
    [14]
    Stefan Falkner, Aaron Klein, and Frank Hutter. 2018. BOHB: Robust and efficient hyperparameter optimization at scale. In International Conference on Machine Learning. PMLR, 1437--1446.
    [15]
    Dror G Feitelson and Larry Rudolph. 1996. Toward convergence in job schedulers for parallel supercomputers. In Workshop on Job Scheduling Strategies for Parallel Processing. Springer, 1--26.
    [16]
    Hanhua Feng, Vishal Misra, and Dan Rubenstein. 2007. PBS: a unified priority based scheduler. In Proceedings of the 2007 ACM SIGMETRICS international conference on Measurement and Modeling of Computer Systems. 203--214.
    [17]
    Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. 2017. Accurate, large minibatch SGD: Training ImageNet in 1 hour. arXiv preprint arXiv:1706.02677 (2017).
    [18]
    Matthew D Jones, Joseph P White, Martins Innus, Robert L DeLeon, Nikolay Simakov, Jeffrey T Palmer, Steven M Gallo, Thomas R Furlani, Michael Showerman, Robert Brunner, et al. 2017. Workload analysis of Blue Waters. arXiv preprint arXiv:1703.00924 (2017).
    [19]
    Julian Kates-Harbeck, Alexey Svyatkovskiy, and William Tang. 2019. Predicting disruptive instabilities in controlled fusion plasmas through deep learning. Nature 568, 7753 (2019), 526--531.
    [20]
    John Kim, Wiliam J Dally, Steve Scott, and Dennis Abts. 2008. Technology-driven, highly-scalable dragonfly topology. ACM SIGARCH Computer Architecture News 36, 3 (2008), 77--88.
    [21]
    Christopher Kleman, Shoaib Anwar, Zhengchun Liu, Jiaqi Gong, Xishi Zhu, Austin Yunker, Rajkumar Kettimuthu, and Jiaze He. 2023. Full Waveform Inversion-Based Ultrasound Computed Tomography Acceleration Using Two- Dimensional Convolutional Neural Networks. Journal of Nondestructive Evaluation, Diagnostics and Prognostics of Engineering Systems 6, 4 (2023), 041004.
    [22]
    Liam Li, Kevin Jamieson, Afshin Rostamizadeh, Ekaterina Gonina, Jonathan Ben- Tzur, Moritz Hardt, Benjamin Recht, and Ameet Talwalkar. 2020. A system for massively parallel hyperparameter tuning. Proceedings of Machine Learning and Systems 2 (2020), 230--246.
    [23]
    Hanxiao Liu, Karen Simonyan, and Yiming Yang. 2018. DARTS: Differentiable architecture search. arXiv preprint arXiv:1806.09055 (2018).
    [24]
    Zhengchun Liu, Tekin Bicer, Rajkumar Kettimuthu, and Ian Foster. 2019. Deep learning accelerated light source experiments. In IEEE/ACM Third Workshop on Deep Learning on Supercomputers. IEEE, 20--28.
    [25]
    Zhengchun Liu, Rajkumar Kettimuthu, Michael E Papka, and Ian Foster. 2023. FreeTrain: A Framework to Utilize Unused Supercomputer Nodes for Training Neural Networks. In IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid). IEEE, 299--310.
    [26]
    Zhengchun Liu, Hemant Sharma, Jun-Sang Park, Peter Kenesei, Jonathan Almer, Rajkumar Kettimuthu, and Ian Foster. 2020. BraggNN: Fast X-ray Bragg Peak Analysis Using Deep Learning. arXiv preprint arXiv:2008.08198 (2020).
    [27]
    Ahuva W. Mu'alem and Dror G. Feitelson. 2001. Utilization, predictability, workloads, and user runtime estimates in scheduling the IBM SP2 with backfilling. IEEE Transactions on Parallel and Distributed Systems 12, 6 (2001), 529--543.
    [28]
    Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. PyTorch: An imperative style, high-performance deep learning library. arXiv preprint arXiv:1912.01703 (2019).
    [29]
    Tirthak Patel, Zhengchun Liu, Rajkumar Kettimuthu, Paul Rich, William Allcock, and Devesh Tiwari. 2020. Job Characteristics on Large-Scale Systems: Long-Term Analysis, Quantification and Implications. In 2020 SC20: International Conference for High Performance Computing, Networking, Storage and Analysis (SC). IEEE Computer Society, 1186--1202.
    [30]
    Aurick Qiao, Sang Keun Choe, Suhas Jayaram Subramanya, Willie Neiswanger, Qirong Ho, Hao Zhang, Gregory R Ganger, and Eric P Xing. 2021. Pollux: Coadaptive Cluster Scheduling for Goodput-Optimized Deep Learning. In OSDI, Vol. 21. 1--18.
    [31]
    Aurick Qiao, Willie Neiswanger, Qirong Ho, Hao Zhang, Gregory R Ganger, and Eric P Xing. 2020. Pollux: Co-adaptive cluster scheduling for goodput-optimized deep learning. arXiv preprint arXiv:2008.12260 (2020).
    [32]
    Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V Le. 2019. Regularized evolution for image classifier architecture search. In Proceedings of the aaai conference on artificial intelligence, Vol. 33. 4780--4789.
    [33]
    Alexander Sergeev and Mike Del Balso. 2018. Horovod: Fast and easy distributed deep learning in TensorFlow. arXiv preprint arXiv:1802.05799 (2018).
    [34]
    Sathish S Vadhiyar and Jack J Dongarra. 2003. SRS: A framework for developing malleable and migratable parallel applications for distributed systems. Parallel Processing Letters 13, 02 (2003), 291--312.
    [35]
    Chris Ying, Aaron Klein, Eric Christiansen, Esteban Real, Kevin Murphy, and Frank Hutter. 2019. NAS-Bench-101: Towards reproducible neural architecture search. In International Conference on Machine Learning. PMLR, 7105--7114.
    [36]
    Andy B Yoo, Morris A Jette, and Mark Grondona. 2003. Slurm: Simple Linux utility for resource management. In Workshop on job scheduling strategies for parallel processing. Springer, 44--60.
    [37]
    Haihang You and Hao Zhang. 2012. Comprehensive workload analysis and modeling of a petascale supercomputer. In Workshop on Job Scheduling Strategies for Parallel Processing. Springer, 253--271.
    [38]
    Yang You, Zhao Zhang, Cho-Jui Hsieh, James Demmel, and Kurt Keutzer. 2018. ImageNet training in minutes. In 47th International Conference on Parallel Processing. 1--10.
    [39]
    Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. 2018. Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 8697--8710.

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ICPE '24: Proceedings of the 15th ACM/SPEC International Conference on Performance Engineering
    May 2024
    310 pages
    ISBN:9798400704444
    DOI:10.1145/3629526
    This work is licensed under a Creative Commons Attribution International 4.0 License.

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 07 May 2024

    Check for updates

    Author Tags

    1. deep neural network
    2. distributed deep learning training
    3. resource management
    4. scheduling
    5. supercomputer

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    ICPE '24

    Acceptance Rates

    Overall Acceptance Rate 252 of 851 submissions, 30%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 61
      Total Downloads
    • Downloads (Last 12 months)61
    • Downloads (Last 6 weeks)27

    Other Metrics

    Citations

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media