All researchs

All researchs

All researchs

All researchs

Mad-max beyond single-node: Enabling large machine learning model acceleration on distributed systems

Mad-max beyond single-node: Enabling large machine learning model acceleration on distributed systems

Published:29 June, 2024

Published:29 June, 2024

Training and deploying large-scale machine learning models is time-consuming, requires significant distributed computing infrastructures, and incurs high operational costs. Our analysis, grounded in real-world large model training on datacenter-scale infrastructures, reveals that 14~32% of all GPU hours are spent on communication with no overlapping computation. To minimize this outstanding communication latency and other inherent at-scale inefficiencies, we introduce an agile performance modeling framework, MAD-Max. This framework is designed to optimize parallelization strategies and facilitate hardware-software co-design opportunities. Through the application of MAD-Max to a suite of real-world large-scale ML models on state-of-the-art GPU clusters, we showcase potential throughput enhancements of up to 2.24 × for pretraining and up to 5.27 × for inference scenarios, respectively.


url: “https://scholar.google.com/citations?view_op=view_citation&hl=en&user=IR0yJB8AAAAJ&sortby=pubdate&citation_for_view=IR0yJB8AAAAJ:J3KpcKIlIpsC,

Training and deploying large-scale machine learning models is time-consuming, requires significant distributed computing infrastructures, and incurs high operational costs. Our analysis, grounded in real-world large model training on datacenter-scale infrastructures, reveals that 14~32% of all GPU hours are spent on communication with no overlapping computation. To minimize this outstanding communication latency and other inherent at-scale inefficiencies, we introduce an agile performance modeling framework, MAD-Max. This framework is designed to optimize parallelization strategies and facilitate hardware-software co-design opportunities. Through the application of MAD-Max to a suite of real-world large-scale ML models on state-of-the-art GPU clusters, we showcase potential throughput enhancements of up to 2.24 × for pretraining and up to 5.27 × for inference scenarios, respectively.


url: “https://scholar.google.com/citations?view_op=view_citation&hl=en&user=IR0yJB8AAAAJ&sortby=pubdate&citation_for_view=IR0yJB8AAAAJ:J3KpcKIlIpsC,

Harvard Innovation Labs


125 Western Ave


Boston, MA 02163

© Copyright 2025 Stochastic.  All rights reserved.

Harvard Innovation Labs


125 Western Ave


Boston, MA 02163

© Copyright 2025 Stochastic.  All rights reserved.

Harvard Innovation Labs


125 Western Ave


Boston, MA 02163

© Copyright 2024 Stochastic.  

All rights reserved.

Harvard Innovation Labs


125 Western Ave


Boston, MA 02163

© Copyright 2025 Stochastic.  All rights reserved.