All researchs

All researchs

All researchs

All researchs

Generative ai beyond llms: System implications of multi-modal generation

Generative ai beyond llms: System implications of multi-modal generation

Published:5 May, 2024

Published:5 May, 2024

As the development of large-scale Generative AI models evolve beyond text (1D) generation to include image (2D) and video (3D) generation, processing spatial and temporal information presents unique challenges to quality, performance, and efficiency. We present the first work towards understanding this new system design space for multi-modal text-to-image (TTI) and text-to-video (TTV) generation models. Current model architecture designs are bifurcated into 2 categories: Diffusion-and Transformer-based models. Our systematic performance characterization on a suite of eight representative TTI/TTV models shows that after state-of-the-art optimization techniques such as Flash Attention are applied, Convolution accounts for up to 44% of execution time for Diffusion-based TTI models, while Linear layers consume up to 49 % of execution time for Transformer-based models. We additionally observe that …


url: “https://scholar.google.com/citations?view_op=view_citation&hl=en&user=IR0yJB8AAAAJ&sortby=pubdate&citation_for_view=IR0yJB8AAAAJ:GWiaReNCd0YC,

As the development of large-scale Generative AI models evolve beyond text (1D) generation to include image (2D) and video (3D) generation, processing spatial and temporal information presents unique challenges to quality, performance, and efficiency. We present the first work towards understanding this new system design space for multi-modal text-to-image (TTI) and text-to-video (TTV) generation models. Current model architecture designs are bifurcated into 2 categories: Diffusion-and Transformer-based models. Our systematic performance characterization on a suite of eight representative TTI/TTV models shows that after state-of-the-art optimization techniques such as Flash Attention are applied, Convolution accounts for up to 44% of execution time for Diffusion-based TTI models, while Linear layers consume up to 49 % of execution time for Transformer-based models. We additionally observe that …


url: “https://scholar.google.com/citations?view_op=view_citation&hl=en&user=IR0yJB8AAAAJ&sortby=pubdate&citation_for_view=IR0yJB8AAAAJ:GWiaReNCd0YC,

Harvard Innovation Labs


125 Western Ave


Boston, MA 02163

© Copyright 2025 Stochastic.  All rights reserved.

Harvard Innovation Labs


125 Western Ave


Boston, MA 02163

© Copyright 2025 Stochastic.  All rights reserved.

Harvard Innovation Labs


125 Western Ave


Boston, MA 02163

© Copyright 2025 Stochastic.  All rights reserved.

Harvard Innovation Labs


125 Western Ave


Boston, MA 02163

© Copyright 2024 Stochastic.  

All rights reserved.