-
Notifications
You must be signed in to change notification settings - Fork 399
[CPU] add Float8OpaqueTensor for dynamic float8 act float8 weight #3075
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CPU] add Float8OpaqueTensor for dynamic float8 act float8 weight #3075
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3075
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit c26d34b with merge base 1a9b6f4 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
CC @mingfeima for review. Thanks. |
|
Hi @mingfeima @jerryzh168 @andrewor14 Could you please review this PR? Thanks. |
test/quantization/quantize_/workflows/float8/test_float8_opaque_tensor.py
Show resolved
Hide resolved
|
Hi @mingfeima @jerryzh168 @andrewor14 Though this PR depends on #3100, could you please review this PR? Thanks. |
|
@jerryzh168 Could you please review this PR again? Thanks. |
test/quantization/quantize_/workflows/float8/test_float8_opaque_tensor.py
Outdated
Show resolved
Hide resolved
test/quantization/quantize_/workflows/float8/test_float8_opaque_tensor.py
Outdated
Show resolved
Hide resolved
test/quantization/quantize_/workflows/float8/test_float8_opaque_tensor.py
Outdated
Show resolved
Hide resolved
test/quantization/quantize_/workflows/float8/test_float8_opaque_tensor.py
Outdated
Show resolved
Hide resolved
test/quantization/quantize_/workflows/float8/test_float8_opaque_tensor.py
Outdated
Show resolved
Hide resolved
torchao/quantization/quantize_/workflows/float8/float8_opaque_tensor.py
Outdated
Show resolved
Hide resolved
| f"Shapes of input and weight do not match, input:{input_tensor.shape}, weight: {weight_tensor.shape}" | ||
| ) | ||
|
|
||
| act_mat = input_tensor.contiguous() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't this going to be slow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On CPU, we require input tensors to be contiguous. In fact, we almost always get contiguous inputs. So, the reordering won't actually happen. Here it just ensures the assumption.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if it's always contiguous I feel it might be better to do an assert here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean in most cases it's contiguous so we don't need to worry about performance. But we cannot guarantee that. It is acceptable for us if the input tensor is not contiguous and it's slow to make it contiguous.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK sounds good
| granularity = weight_tensor.act_quant_kwargs.granularity | ||
| if isinstance(granularity, PerGroup): | ||
| group_size = granularity.group_size | ||
| if weight_tensor.block_size[1] < weight_tensor.size(-1): | ||
| # weight_tensor is also per group quantized | ||
| assert weight_tensor.block_size[1] == group_size, ( | ||
| "input and weight should have the same group size but got" | ||
| f" {weight_tensor.block_size[1]} and {group_size}" | ||
| ) | ||
| act_block_size = get_block_size(act_mat.shape, granularity) | ||
| act_scale = _choose_scale_float8( | ||
| act_mat, | ||
| float8_dtype=torch.float8_e4m3fn, | ||
| block_size=act_block_size, | ||
| ) | ||
| act_mat = _quantize_affine_float8(act_mat, act_scale, torch.float8_e4m3fn) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this not using
| input_tensor = _choose_quant_func_and_quantize_tensor( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the pointer. However, _choose_quant_func_and_quantize_tensor does the following:
if isinstance(quant_kwargs, QuantizeTensorToFloat8Kwargs):
return Float8Tensor.from_hp(...)Unfortunately, Float8OpaqueTensor also uses QuantizeTensorToFloat8Kwargs so it cannot distinguish them.
Besides, in the implementation of Float8Tensor, activation is quantized by Float8Tensor.from_hp to a Float8Tensor and then unwrapped to get the quantized tensor data for computation. And this part of logic is not exposed to users. So, I feel that it's unnecessary to use Float8OpaqueTensor.from_hp to quantize the activation then unwrap it. It looks good to quantize it with _quantize_affine_float8.
What do you think? If you want Float8OpaqueTensor to be aligned with Float8Tensor, we may need to define a counterpart of QuantizeTensorToFloat8Kwargs for Float8OpaqueTensor so that we can distinguish them. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should add one of QuantizeTensorToFloat8Kwargs for each tensor I think, so should create QuantizeTensorToOpaqueFloat8Kwargs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah it's optional, if you feel it's better to not use it, it's OK as well.
main thing is using this will reduce duplication and make it easier to adapt in the future
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion. Since input tensors should be plain, we can reuse QuantizeTensorToFloat8Kwargs and _choose_quant_func_and_quantize_tensor here. I have updated this part.
torchao/quantization/quantize_/workflows/float8/float8_opaque_tensor.py
Outdated
Show resolved
Hide resolved
|
Hi @jerryzh168 I have updated this PR per your comments. Could you please review again? Thanks. |
torchao/quantization/quant_api.py
Outdated
| return activation | ||
|
|
||
|
|
||
| def _input_activation_quant_cpu_fp8( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this function used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Removed.
torchao/quantization/quant_api.py
Outdated
| return weight | ||
|
|
||
| elif not _fp8_mm_compat(weight): | ||
| elif float8_packing_format == Float8PackingFormat.PLAIN and not _fp8_mm_compat( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to make these less complicated, can you lift the float8_packing_format == Float8PackingFormat.PLAIN before line 1851
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated. Thanks.
|
|
||
| # Note: Tiny-GEMM kernel only uses BF16 inputs | ||
| def example_inputs(self, batch_size=1): | ||
| def example_inputs(self, batch_size=1, dtype=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when do you use the dtype that's not the same as the original weight dtype?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have removed this. Thanks.
| return | ||
| device = "cpu" | ||
| m = ToyTwoLinearModel(256, 256, 256, dtype, device, bias).eval() | ||
| example_inputs = m.example_inputs(batch_size=bs, dtype=dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dtype here seems to be the same as the one taken by model? so you don't need to specify it right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated. Thanks.
|
Hi @jerryzh168 Please review again. Thanks. |
…torch#3075) * [CPU] add Float8OpaqueTensor for dynamic float8 act float8 weight * Update _normalize_granularity * Update torchao/quantization/quant_api.py * Fix CI * remove unnecessary changes * Refine code * Refine code * Refine code
Summary
We split the original big PR #2505 into the following smaller ones:
Test plan