Skip to content

Conversation

@syed-ahmed
Copy link
Contributor

I'm seeing the following error on B200 and GB200.

        y_ref.backward(g)
        y.backward(g)
        w_g_ref = m_mx[0].weight.grad
        w_g = getattr(m_mx_c, "0").weight.grad
        # TODO(future): investigate why we can't match with rtol=0 atol=0
        # after moving to torchao repo. Technically compile does not give
        # bit exactness guarantees, but there also might be a bug lurking
        # around.
>       torch.testing.assert_close(w_g_ref, w_g, atol=0.02, rtol=0.02)
E       AssertionError: Tensor-likes are not close!
E       
E       Mismatched elements: 493 / 131072 (0.4%)
E       Greatest absolute difference: 10.5625 at index (23, 0) (up to 0.02 allowed)
E       Greatest relative difference: 14272.0 at index (92, 0) (up to 0.02 allowed)

test/prototype/mx_formats/test_mx_linear.py:353: AssertionError

for the following unit test:

pytest test/prototype/mx_formats/test_mx_linear.py -k test_linear_compile[ScaleCalculationMode.FLOOR-MXFP8Dim1CastKernelChoice.TRITON-False-mxfp8_cublas-hp_dtype1] -vvv

Excluding 512 from COL_TILE_SIZE suppresses this issue. I figured it out by selectively including the values for ROW_TILE_SIZE, COL_TILE_SIZE, num_warps, and num_stages, until I found the config that results in the error. I'm not sure if COL_TILE_SIZE of 512 is not supported in B200/GB200 or if it's a bug in Triton.

@pytorch-bot
Copy link

pytorch-bot bot commented Nov 20, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3357

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 3c7a8e8 with merge base 01374eb (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 20, 2025
@syed-ahmed
Copy link
Contributor Author

@pytorchbot label "topic: not user facing"

@pytorch-bot pytorch-bot bot added the topic: not user facing Use this tag if you don't want this PR to show up in release notes label Nov 20, 2025
@syed-ahmed
Copy link
Contributor Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 mandatory check(s) are pending/not yet run. The first few are:

  • Facebook CLA Check

Dig deeper by viewing the pending checks on hud

Details for Dev Infra team Raised by workflow job

Failing merge rule: superuser

@syed-ahmed syed-ahmed moved this to In Progress in PyTorch + CUDA Nov 20, 2025
@vkuzo vkuzo merged commit 5f33595 into pytorch:main Nov 24, 2025
20 checks passed
@github-project-automation github-project-automation bot moved this from In Progress to Done in PyTorch + CUDA Nov 24, 2025
namgyu-youn pushed a commit to namgyu-youn/ao that referenced this pull request Dec 19, 2025
* Fixes accuracy error for mxfp8 linear

* Adds todo
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. topic: not user facing Use this tag if you don't want this PR to show up in release notes

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants