Stochastic

Stochastic Research

Nanoscaling Floating-Point (NxFP): NanoMantissa, Adaptive Microexponents, and Code Recycling for Direct-Cast Compression of Large Language Models

As cutting-edge large language models (LLMs) continue to transform various industries, their fast-growing model size and sequence length have led to memory traffic and capacity challenges…

Dec 15th, 2024

Nanoscaling Floating-Point (NxFP): NanoMantissa, Adaptive Microexponents, and Code Recycling for Direct-Cast Compression of Large Language Models

As cutting-edge large language models (LLMs) continue to transform various industries, their fast-growing model size and sequence length have led to memory traffic and capacity challenges…

Dec 15th, 2024

Nanoscaling Floating-Point (NxFP): NanoMantissa, Adaptive Microexponents, and Code Recycling for Direct-Cast Compression of Large Language Models

As cutting-edge large language models (LLMs) continue to transform various industries, their fast-growing model size and sequence length have led to memory traffic and capacity challenges…

Dec 15th, 2024

JointNF: Enhancing DNN Performance through Adaptive N:M Pruning Across both Weight and Activation

Balancing accuracy and hardware efficiency remains a challenge with traditional pruning methods. N:M sparsity is a recent approach offering a compromise, allowing up to N non-zero weights…

Aug 5th, 2024

JointNF: Enhancing DNN Performance through Adaptive N:M Pruning Across both Weight and Activation

Balancing accuracy and hardware efficiency remains a challenge with traditional pruning methods. N:M sparsity is a recent approach offering a compromise, allowing up to N non-zero weights…

Aug 5th, 2024

JointNF: Enhancing DNN Performance through Adaptive N:M Pruning Across both Weight and Activation

Balancing accuracy and hardware efficiency remains a challenge with traditional pruning methods. N:M sparsity is a recent approach offering a compromise, allowing up to N non-zero weights…

Aug 5th, 2024

MAD-Max Beyond Single-Node: Enabling Large Machine Learning Model Acceleration on Distributed Systems

Training and deploying large-scale machine learning models is time-consuming, requires significant distributed computing infrastructures, and incurs high operational costs…

June 29th, 2024

MAD-Max Beyond Single-Node: Enabling Large Machine Learning Model Acceleration on Distributed Systems

Training and deploying large-scale machine learning models is time-consuming, requires significant distributed computing infrastructures, and incurs high operational costs…

June 29th, 2024

MAD-Max Beyond Single-Node: Enabling Large Machine Learning Model Acceleration on Distributed Systems

Training and deploying large-scale machine learning models is time-consuming, requires significant distributed computing infrastructures, and incurs high operational costs…

June 29th, 2024

Is Flash Attention Stable?

Training large-scale machine learning models poses distinct system challenges, given both the size and complexity of today's workloads. Recently, many organizations…

May 5th, 2024

Is Flash Attention Stable?

Training large-scale machine learning models poses distinct system challenges, given both the size and complexity of today's workloads. Recently, many organizations…

May 5th, 2024

Is Flash Attention Stable?

Training large-scale machine learning models poses distinct system challenges, given both the size and complexity of today's workloads. Recently, many organizations…

May 5th, 2024

Generative AI Beyond LLMS: System Implications of Multi-Modal Generation

As the development of large-scale Generative AI models evolve beyond text (1D) generation to include image (2D) and video (3D) generation, processing spatial and temporal…

May 5th, 2024

Generative AI Beyond LLMS: System Implications of Multi-Modal Generation

As the development of large-scale Generative AI models evolve beyond text (1D) generation to include image (2D) and video (3D) generation, processing spatial and temporal…

May 5th, 2024

Generative AI Beyond LLMS: System Implications of Multi-Modal Generation

As the development of large-scale Generative AI models evolve beyond text (1D) generation to include image (2D) and video (3D) generation, processing spatial and temporal…

May 5th, 2024

Towards Fine-Tunable Quantized Large Language Models with Error Correction through Low-Rank Adaptation

We introduce a method that dramatically reduces fine-tuning VRAM requirements and rectifies quantization errors in quantized Large Language Models…

June 13th, 2023

Towards Fine-Tunable Quantized Large Language Models with Error Correction through Low-Rank Adaptation

We introduce a method that dramatically reduces fine-tuning VRAM requirements and rectifies quantization errors in quantized Large Language Models…

June 13th, 2023

Towards Fine-Tunable Quantized Large Language Models with Error Correction through Low-Rank Adaptation

We introduce a method that dramatically reduces fine-tuning VRAM requirements and rectifies quantization errors in quantized Large Language Models…

June 13th, 2023

Increasing GPU Utilization During Generative Inference for Higher Throughput

Generating texts with a large language model (LLM) consumes massive amounts of memory. Apart from the already-large model parameters…

June 09th, 2023

Increasing GPU Utilization During Generative Inference for Higher Throughput

Generating texts with a large language model (LLM) consumes massive amounts of memory. Apart from the already-large model parameters…

June 09th, 2023

Increasing GPU Utilization During Generative Inference for Higher Throughput

Generating texts with a large language model (LLM) consumes massive amounts of memory. Apart from the already-large model parameters…

June 09th, 2023

Neural Architecture Search for Quantized Transformer Models

While research in the field of transformer models has primarily focused on enhancing performance metrics such as accuracy and perplexity, practical applications in industry…

Sep 25th, 2022

Neural Architecture Search for Quantized Transformer Models

While research in the field of transformer models has primarily focused on enhancing performance metrics such as accuracy and perplexity, practical applications in industry…

Sep 25th, 2022

Neural Architecture Search for Quantized Transformer Models

While research in the field of transformer models has primarily focused on enhancing performance metrics such as accuracy and perplexity, practical applications in industry…

Sep 25th, 2022

A 16-nm SoC for Noise-Robust Speech and NLP Edge AI Inference With Bayesian Sound Source Separation and Attention-Based DNNs

The proliferation of personal artificial intelligence (AI) -assistant technologies with speech-based conversational AI interfaces is driving the exponential growth in the consumer …

June 06th, 2022

A 16-nm SoC for Noise-Robust Speech and NLP Edge AI Inference With Bayesian Sound Source Separation and Attention-Based DNNs

The proliferation of personal artificial intelligence (AI) -assistant technologies with speech-based conversational AI interfaces is driving the exponential growth in the consumer …

June 06th, 2022

A 16-nm SoC for Noise-Robust Speech and NLP Edge AI Inference With Bayesian Sound Source Separation and Attention-Based DNNs

The proliferation of personal artificial intelligence (AI) -assistant technologies with speech-based conversational AI interfaces is driving the exponential growth in the consumer …

June 06th, 2022

AI agents that think like humans, only faster

Request a Demo

AI agents that think like humans, only faster

Request a Demo

AI agents that think like humans, only faster

Request a Demo

Stochastic Research

Stochastic Research

Nanoscaling Floating-Point (NxFP): NanoMantissa, Adaptive Microexponents, and Code Recycling for Direct-Cast Compression of Large Language Models

Nanoscaling Floating-Point (NxFP): NanoMantissa, Adaptive Microexponents, and Code Recycling for Direct-Cast Compression of Large Language Models

Nanoscaling Floating-Point (NxFP): NanoMantissa, Adaptive Microexponents, and Code Recycling for Direct-Cast Compression of Large Language Models

JointNF: Enhancing DNN Performance through Adaptive N:M Pruning Across both Weight and Activation

JointNF: Enhancing DNN Performance through Adaptive N:M Pruning Across both Weight and Activation

JointNF: Enhancing DNN Performance through Adaptive N:M Pruning Across both Weight and Activation

MAD-Max Beyond Single-Node: Enabling Large Machine Learning Model Acceleration on Distributed Systems

MAD-Max Beyond Single-Node: Enabling Large Machine Learning Model Acceleration on Distributed Systems

MAD-Max Beyond Single-Node: Enabling Large Machine Learning Model Acceleration on Distributed Systems

Is Flash Attention Stable?

Is Flash Attention Stable?

Is Flash Attention Stable?

Generative AI Beyond LLMS: System Implications of Multi-Modal Generation

Generative AI Beyond LLMS: System Implications of Multi-Modal Generation

Generative AI Beyond LLMS: System Implications of Multi-Modal Generation

Towards Fine-Tunable Quantized Large Language Models with Error Correction through Low-Rank Adaptation

Towards Fine-Tunable Quantized Large Language Models with Error Correction through Low-Rank Adaptation

Towards Fine-Tunable Quantized Large Language Models with Error Correction through Low-Rank Adaptation

Increasing GPU Utilization During Generative Inference for Higher Throughput

Increasing GPU Utilization During Generative Inference for Higher Throughput

Increasing GPU Utilization During Generative Inference for Higher Throughput

Neural Architecture Search for Quantized Transformer Models

Neural Architecture Search for Quantized Transformer Models

Neural Architecture Search for Quantized Transformer Models

A 16-nm SoC for Noise-Robust Speech and NLP Edge AI Inference With Bayesian Sound Source Separation and Attention-Based DNNs

A 16-nm SoC for Noise-Robust Speech and NLP Edge AI Inference With Bayesian Sound Source Separation and Attention-Based DNNs

A 16-nm SoC for Noise-Robust Speech and NLP Edge AI Inference With Bayesian Sound Source Separation and Attention-Based DNNs

CoopMC: Algorithm-Architecture Co-Optimization for Markov Chain Monte Carlo Accelerators

CoopMC: Algorithm-Architecture Co-Optimization for Markov Chain Monte Carlo Accelerators

CoopMC: Algorithm-Architecture Co-Optimization for Markov Chain Monte Carlo Accelerators

A 25mm2 SoC for IoT Devices with 18ms Noise-Robust Speech-to-Text Latency via Bayesian Speech Denoising and…

A 25mm2 SoC for IoT Devices with 18ms Noise-Robust Speech-to-Text Latency via Bayesian Speech Denoising and…

A 25mm2 SoC for IoT Devices with 18ms Noise-Robust Speech-to-Text Latency via Bayesian Speech Denoising and…