Skip to content

Conversation

@Xia-Weiwen
Copy link
Collaborator

@Xia-Weiwen Xia-Weiwen commented Sep 26, 2025

Summary
We split the original big PR #2505 into the following smaller ones:

Test plan

pytest -sv test/quantization/quantize_/workflows/float8/test_float8_opaque_tensor.py

@pytorch-bot
Copy link

pytorch-bot bot commented Sep 26, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3075

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit c26d34b with merge base 1a9b6f4 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 26, 2025
@Xia-Weiwen Xia-Weiwen added the topic: new feature Use this tag if this PR adds a new feature label Sep 26, 2025
@Xia-Weiwen
Copy link
Collaborator Author

CC @mingfeima for review. Thanks.

@Xia-Weiwen
Copy link
Collaborator Author

Hi @mingfeima @jerryzh168 @andrewor14 Could you please review this PR? Thanks.

@Xia-Weiwen Xia-Weiwen marked this pull request as draft September 30, 2025 01:28
@Xia-Weiwen Xia-Weiwen marked this pull request as ready for review September 30, 2025 01:35
@Xia-Weiwen
Copy link
Collaborator Author

Hi @mingfeima @jerryzh168 @andrewor14 Though this PR depends on #3100, could you please review this PR? Thanks.

@Xia-Weiwen Xia-Weiwen requested a review from jerryzh168 October 14, 2025 01:59
@Xia-Weiwen
Copy link
Collaborator Author

@jerryzh168 Could you please review this PR again? Thanks.

f"Shapes of input and weight do not match, input:{input_tensor.shape}, weight: {weight_tensor.shape}"
)

act_mat = input_tensor.contiguous()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this going to be slow?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On CPU, we require input tensors to be contiguous. In fact, we almost always get contiguous inputs. So, the reordering won't actually happen. Here it just ensures the assumption.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it's always contiguous I feel it might be better to do an assert here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean in most cases it's contiguous so we don't need to worry about performance. But we cannot guarantee that. It is acceptable for us if the input tensor is not contiguous and it's slow to make it contiguous.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK sounds good

Comment on lines 231 to 246
granularity = weight_tensor.act_quant_kwargs.granularity
if isinstance(granularity, PerGroup):
group_size = granularity.group_size
if weight_tensor.block_size[1] < weight_tensor.size(-1):
# weight_tensor is also per group quantized
assert weight_tensor.block_size[1] == group_size, (
"input and weight should have the same group size but got"
f" {weight_tensor.block_size[1]} and {group_size}"
)
act_block_size = get_block_size(act_mat.shape, granularity)
act_scale = _choose_scale_float8(
act_mat,
float8_dtype=torch.float8_e4m3fn,
block_size=act_block_size,
)
act_mat = _quantize_affine_float8(act_mat, act_scale, torch.float8_e4m3fn)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this not using

input_tensor = _choose_quant_func_and_quantize_tensor(

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the pointer. However, _choose_quant_func_and_quantize_tensor does the following:

    if isinstance(quant_kwargs, QuantizeTensorToFloat8Kwargs):
        return Float8Tensor.from_hp(...)

Unfortunately, Float8OpaqueTensor also uses QuantizeTensorToFloat8Kwargs so it cannot distinguish them.
Besides, in the implementation of Float8Tensor, activation is quantized by Float8Tensor.from_hp to a Float8Tensor and then unwrapped to get the quantized tensor data for computation. And this part of logic is not exposed to users. So, I feel that it's unnecessary to use Float8OpaqueTensor.from_hp to quantize the activation then unwrap it. It looks good to quantize it with _quantize_affine_float8.
What do you think? If you want Float8OpaqueTensor to be aligned with Float8Tensor, we may need to define a counterpart of QuantizeTensorToFloat8Kwargs for Float8OpaqueTensor so that we can distinguish them. Thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should add one of QuantizeTensorToFloat8Kwargs for each tensor I think, so should create QuantizeTensorToOpaqueFloat8Kwargs

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah it's optional, if you feel it's better to not use it, it's OK as well.

main thing is using this will reduce duplication and make it easier to adapt in the future

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. Since input tensors should be plain, we can reuse QuantizeTensorToFloat8Kwargs and _choose_quant_func_and_quantize_tensor here. I have updated this part.

@Xia-Weiwen
Copy link
Collaborator Author

Hi @jerryzh168 I have updated this PR per your comments. Could you please review again? Thanks.

return activation


def _input_activation_quant_cpu_fp8(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this function used?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Removed.

return weight

elif not _fp8_mm_compat(weight):
elif float8_packing_format == Float8PackingFormat.PLAIN and not _fp8_mm_compat(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to make these less complicated, can you lift the float8_packing_format == Float8PackingFormat.PLAIN before line 1851

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. Thanks.


# Note: Tiny-GEMM kernel only uses BF16 inputs
def example_inputs(self, batch_size=1):
def example_inputs(self, batch_size=1, dtype=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when do you use the dtype that's not the same as the original weight dtype?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed this. Thanks.

return
device = "cpu"
m = ToyTwoLinearModel(256, 256, 256, dtype, device, bias).eval()
example_inputs = m.example_inputs(batch_size=bs, dtype=dtype)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dtype here seems to be the same as the one taken by model? so you don't need to specify it right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. Thanks.

@Xia-Weiwen
Copy link
Collaborator Author

Hi @jerryzh168 Please review again. Thanks.

@jerryzh168 jerryzh168 merged commit ba6f428 into pytorch:main Nov 14, 2025
18 checks passed
namgyu-youn pushed a commit to namgyu-youn/ao that referenced this pull request Nov 21, 2025
…torch#3075)

* [CPU] add Float8OpaqueTensor for dynamic float8 act float8 weight

* Update _normalize_granularity

* Update torchao/quantization/quant_api.py

* Fix CI

* remove unnecessary changes

* Refine code

* Refine code

* Refine code
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. topic: new feature Use this tag if this PR adds a new feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants