map torchao quantized checkpoints to vLLM's MoE kernels #28421

vkuzo · 2025-11-10T21:22:37Z

Summary:

Purpose

This is a POC to map torchao's quantized checkpoints for Mixture-of-Experts modules to the compressed-tensors MoE path for w8a8 rowwise quant scheme. We do this by:

subclassing CompressedTensorsW8A8Fp8MoEMethod, and overriding its create_weights and process_weights_after_loading methods to map from torchao quantized tensor to compressed-tensors plain tensor format. This is the part where the "torchao -> compressed_tensor" conversion happens. Note that for float8 w8a8 rowwise, this is a metadata-only change.
adding logic to TorchAOConfig to set the quant method to TorchAOWrappingCompressedTensorsW8A8Fp8MoEMethod when appropriate.

Note that there are no changes to the compressed-tensors path due to the existing vLLM APIs already being expressive enough to do this mapping.

For now, I only implemented one w8a8 scheme to demonstrate a proof of concept. In the future, the following could be done:
a. map to more schemes for w8a8
b. add mappings for w4a8, etc

Test Plan

integration test for correctness on 1xH100 with Qwen/Qwen1.5-MoE-A2.7B with experts quantized to float8 with rowwise scaling with torchao
tested locally on 4xH100 with meta-llama/Llama-4-Scout-17B-16E-Instruct with experts quantized to float8 with rowwise scaling with torchao
very simple performance testing, the values are essentially the same from torchao checkpoint and the corresponding compressed-tensors checkpoint. Note that the model loading time is higher for torchao (since it is not using safetensors), but this is not captured in the serving benchmark.

> CUDA_VISIBLE_DEVICES=4,5,6,7 vllm bench throughput --model ../pytorch_scripts/hf_torchao_vllm/data/torchao/fp8-experts-only-mnk-testing-Llama-4-Scout-17B-16E-Instruct --dataset-name sonnet --dataset-path benchmarks/sonnet.txt --num-prompts 10 --tensor-parallel-size 4 --max-model-len 2048 --gpu-memory-utilization 0.8
...
Throughput: 4.07 requests/s, 2623.30 total tokens/s, 609.79 output tokens/s
Total num prompt tokens:  4953
Total num output tokens:  1500

Test Result

see above, the new funtionality did not exist before this PR

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

vllm/model_executor/layers/quantization/torchao.py

gemini-code-assist

Code Review

This pull request is a work-in-progress to add support for mapping torchao quantized MoE checkpoints to vLLM's optimized kernels. The changes primarily involve modifying torchao.py to handle FusedMoE layers and qwen2_moe.py to load Float8Tensor weights. The approach seems reasonable for the stated goal. My review includes a few suggestions to improve code robustness and remove an internal link before this can be landed.

vllm/model_executor/layers/quantization/torchao.py

vllm/model_executor/models/qwen2_moe.py

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

vllm/model_executor/models/qwen2_moe.py

vllm/model_executor/layers/quantization/torchao.py

vkuzo · 2025-11-11T16:37:51Z

vllm/model_executor/layers/quantization/torchao.py

+        # TODO(before land): test other formats, and make it explicit when
+        # something is not supported with a nice error message.
+
+        layer.w13_weight_scale = torch.nn.Parameter(layer.w13_weight.scale)


before land: need to polish this to set requires_grad and other attrs properly

vkuzo · 2025-11-13T14:19:49Z

vllm/model_executor/layers/quantization/torchao_utils.py

+    is_float8_rowwise = isinstance(
+        torchao_config, Float8DynamicActivationFloat8WeightConfig
+    ) and torchao_config.granularity == [PerRow(), PerRow()]
+    # Special case of float8 rowwise where the HuggingFace weight is stored


note: it's faster to preprocess the checkpoint to convert expert weights to MNK, but good to have this path for supporting naive HF format

kylesayrs

I'll call out that compressed tensors has some nice matching utilities, especially ones that might be good for matching these sets of fused weights.

There are plans to use these utils in compressed_tensors.py as well, not sure if they might be helpful here.

vkuzo · 2025-11-13T18:13:18Z

I'll call out that compressed tensors has some nice matching utilities, especially ones that might be good for matching these sets of fused weights.

thank you! Will take a look, I'd like to reuse instead of reimplementing where makes sense :)

kylesayrs

Looks good to me! I think doing this conversion on a per-scheme basis is probably the correct approach.

You could potentially generalize this to some sort of adapter mixin for Linear and Moe layers (that way you can share conversion logic for modules which have the same parameters), but that can be thought through more later.

vkuzo · 2025-11-14T13:19:39Z

Looks good to me! I think doing this conversion on a per-scheme basis is probably the correct approach.

You could potentially generalize this to some sort of adapter mixin for Linear and Moe layers (that way you can share conversion logic for modules which have the same parameters), but that can be thought through more later.

thank you, good to hear this approach looks reasonable! Going to polish this PR a bit.

…rnels Summary: This is not ready for review yet, for now just hacking to understand quantization and MoEs in vLLM. High level, I want to map torchao generated MoE checkpoints to vLLM optimized fused kernels. Note that I **do not** want torchao to take over the kernels here, instead torchao just provides the checkpoint and there is glue code to let vLLM select the kernels. TODO iterate some more and write down the design in more detail Test Plan: Tested locally with Qwen1.5-MoE-A2.7B with experts quantized to float8 with rowwise scaling. Reviewers: Subscribers: Tasks: Tags: Signed-off-by: vasiliy <[email protected]> Signed-off-by: <[email protected]>

Summary: Adds an e2e example of how to use torchao to quantize LLaMa 4 Scout. Note that this needs: * a recent `transformers` version (higher than 4.57, not officially released yet so user needs to build from source) * a recent `fbgemm_gpu` version nightly from `2025.11.22` or after * to run this in vLLM, vllm-project/vllm#28421 is needed (not yet landed). Test Plan: ```bash with-proxy time python examples/quantize_llama_4.py ~/local/tmp/20251201_test/ ``` Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 3c47130 ghstack-comment-id: 3599037297 Pull-Request: #3408

Summary: Adds an e2e example of how to use torchao to quantize LLaMa 4 Scout. Note that this needs: * a recent `transformers` version (higher than 4.57, not officially released yet so user needs to build from source) * a recent `fbgemm_gpu` version nightly from `2025.11.22` or after * to run this in vLLM, vllm-project/vllm#28421 is needed (not yet landed). Test Plan: ```bash with-proxy time python examples/quantize_llama_4.py ~/local/tmp/20251201_test/ ``` Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 76125e9 ghstack-comment-id: 3599037297 Pull-Request: #3408

vkuzo requested review from mgoin, pavanimajety, robertgshaw2-redhat, sighingnow, tlrmchlsmth and yewentao256 as code owners November 10, 2025 21:22

mergify bot added the qwen Related to Qwen models label Nov 10, 2025

vkuzo commented Nov 10, 2025

View reviewed changes

vllm/model_executor/layers/quantization/torchao.py Outdated Show resolved Hide resolved

gemini-code-assist bot reviewed Nov 10, 2025

View reviewed changes

vllm/model_executor/layers/quantization/torchao.py Outdated Show resolved Hide resolved

vllm/model_executor/layers/quantization/torchao.py Outdated Show resolved Hide resolved

vllm/model_executor/models/qwen2_moe.py Outdated Show resolved Hide resolved

chatgpt-codex-connector bot reviewed Nov 10, 2025

View reviewed changes

vllm/model_executor/models/qwen2_moe.py Outdated Show resolved Hide resolved

vkuzo force-pushed the pr_26095 branch 3 times, most recently from 41f071f to 9d047b3 Compare November 11, 2025 16:05

vkuzo changed the title ~~[wip|not for land] map torchao quantized checkpoints to vLLM's MoE kernels~~ [wip] map torchao quantized checkpoints to vLLM's MoE kernels Nov 11, 2025

vkuzo force-pushed the pr_26095 branch from 9d047b3 to 3fd3079 Compare November 11, 2025 16:18

vkuzo commented Nov 11, 2025

View reviewed changes

vllm/model_executor/layers/quantization/torchao.py Show resolved Hide resolved

vkuzo commented Nov 11, 2025

View reviewed changes

vkuzo force-pushed the pr_26095 branch 3 times, most recently from 5f32c46 to 047c44b Compare November 12, 2025 20:25

vkuzo changed the title ~~[wip] map torchao quantized checkpoints to vLLM's MoE kernels~~ map torchao quantized checkpoints to vLLM's MoE kernels Nov 12, 2025

vkuzo force-pushed the pr_26095 branch from 047c44b to 5d6c1cf Compare November 12, 2025 20:35

vkuzo changed the title ~~map torchao quantized checkpoints to vLLM's MoE kernels~~ POC: map torchao quantized checkpoints to vLLM's MoE kernels Nov 13, 2025

vkuzo commented Nov 13, 2025

View reviewed changes

kylesayrs reviewed Nov 13, 2025

View reviewed changes

kylesayrs approved these changes Nov 14, 2025

View reviewed changes

vkuzo force-pushed the pr_26095 branch 2 times, most recently from d58f788 to 96c59da Compare November 20, 2025 18:48

vkuzo force-pushed the pr_26095 branch from 96c59da to e1c01c4 Compare November 20, 2025 18:49

vkuzo changed the title ~~POC: map torchao quantized checkpoints to vLLM's MoE kernels~~ map torchao quantized checkpoints to vLLM's MoE kernels Nov 20, 2025

vkuzo mentioned this pull request Dec 1, 2025

add an example for quantizing LLaMa 4 Scout pytorch/ao#3408

Merged

vkuzo closed this Jan 13, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

map torchao quantized checkpoints to vLLM's MoE kernels #28421

map torchao quantized checkpoints to vLLM's MoE kernels #28421

vkuzo commented Nov 10, 2025 •

edited by github-actions bot

Loading

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

Uh oh!

vkuzo Nov 11, 2025

Uh oh!

vkuzo Nov 13, 2025

Uh oh!

kylesayrs left a comment

Uh oh!

vkuzo commented Nov 13, 2025

Uh oh!

kylesayrs left a comment

Uh oh!

vkuzo commented Nov 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

map torchao quantized checkpoints to vLLM's MoE kernels #28421

map torchao quantized checkpoints to vLLM's MoE kernels #28421

Conversation

vkuzo commented Nov 10, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

vkuzo Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

vkuzo Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

kylesayrs left a comment

Choose a reason for hiding this comment

Uh oh!

vkuzo commented Nov 13, 2025

Uh oh!

kylesayrs left a comment

Choose a reason for hiding this comment

Uh oh!

vkuzo commented Nov 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vkuzo commented Nov 10, 2025 •

edited by github-actions bot

Loading