Skip to content

Conversation

@jerryzh168
Copy link
Contributor

@jerryzh168 jerryzh168 commented Nov 11, 2025

Summary:
We added this to float8_inference_roofline to reuse code but we haven't enabled the roofline feature. For now we just need the e2e speedup time for single conv2d/conv3d against bf16 to understand the speedup expectation

some if/else branches right now but hopefully we should be able to remove some of these when more roofline support is added for conv2d/conv3d

Also added B200 hardware spec.

Test Plan:

python float8_inference_roofline.py ~/local/tmp/test.csv \
    --recipe_name tensorwise \
    --shape_gen_name custom \
    --M 1 --K 160 --N 160 \
    --D 3 --H 50 --W 50 \
    --kernel_size 3 \
    --op_name conv3d

requires fbgemm_gpu_genai nightly and torch nightly:

pip install --pre torch fbgemm_gpu_genai --index-url https://download.pytorch.org/whl/nightly/cu129

sample results:

,fwd_M,fwd_K,fwd_N,D,H,W,kernel_size,r_bf16_gemm_s,r_fp8_gemm_s,r_fp8_ovhd_s,r_fp8_gemm_and_ovhd_s,r_fp8_gemm_and_ovhd_spdp,b_bf16_gemm_s,b_fp8_gemm_s,b_bf16_e2e_s,b_fp8_e2e_s,b_fp8_e2e_spdp,rb_bf16_gemm_ratio,rb\
_fp8_gemm_ratio
0,1,160,320,3,194,130,3,,,,,,0,0,8.248959999999995e-05,0.00011273280000000014,0.7317267024326535,-1,-1

Reviewers:

Subscribers:

Tasks:

Tags:

@pytorch-bot
Copy link

pytorch-bot bot commented Nov 11, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3329

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (4 Unrelated Failures)

As of commit e2d0395 with merge base e8c4d09 (image):

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 11, 2025
@jerryzh168 jerryzh168 requested a review from vkuzo November 11, 2025 21:36
@jerryzh168 jerryzh168 force-pushed the add-fp8-conv-e2e-bench branch from 8f92276 to 1d92a60 Compare November 11, 2025 21:37
@jerryzh168 jerryzh168 added the topic: improvement Use this tag if this PR is an improvement (doesn't fit into any of the other categories) label Nov 11, 2025
Summary:
att, we added this to float8_inference_roofline to reuse code
but we haven't enabled the roofline feature. For now we just
need the e2e speedup time for single conv2d/conv3d against bf16
to understand the speedup expecatation

Also added B200 hardware spec.

Test Plan:
python $SCRIPT_PATH $OUTPUT_FILE \
    --recipe_name $RECIPE_NAME \
    --shape_gen_name $SHAPE_GEN_NAME \
    --M $M --K $K --N $N  \
    --D $D --H $H --W $W  \
    --kernel_size $kernel_size \
    --op_name conv3d

This doesn't run yet because OSS fbgemm can't be installed in the B200 machine

Reviewers:

Subscribers:

Tasks:

Tags:
@jerryzh168 jerryzh168 force-pushed the add-fp8-conv-e2e-bench branch from 1d92a60 to e2d0395 Compare November 11, 2025 23:12
@jerryzh168 jerryzh168 merged commit 0ed9cce into pytorch:main Nov 12, 2025
14 of 18 checks passed
jainapurva pushed a commit that referenced this pull request Nov 13, 2025
Summary:
att, we added this to float8_inference_roofline to reuse code
but we haven't enabled the roofline feature. For now we just
need the e2e speedup time for single conv2d/conv3d against bf16
to understand the speedup expecatation

Also added B200 hardware spec.

Test Plan:
python $SCRIPT_PATH $OUTPUT_FILE \
    --recipe_name $RECIPE_NAME \
    --shape_gen_name $SHAPE_GEN_NAME \
    --M $M --K $K --N $N  \
    --D $D --H $H --W $W  \
    --kernel_size $kernel_size \
    --op_name conv3d

This doesn't run yet because OSS fbgemm can't be installed in the B200 machine

Reviewers:

Subscribers:

Tasks:

Tags:

Co-authored-by: jerryzh <[email protected]>
namgyu-youn pushed a commit to namgyu-youn/ao that referenced this pull request Nov 21, 2025
Summary:
att, we added this to float8_inference_roofline to reuse code
but we haven't enabled the roofline feature. For now we just
need the e2e speedup time for single conv2d/conv3d against bf16
to understand the speedup expecatation

Also added B200 hardware spec.

Test Plan:
python $SCRIPT_PATH $OUTPUT_FILE \
    --recipe_name $RECIPE_NAME \
    --shape_gen_name $SHAPE_GEN_NAME \
    --M $M --K $K --N $N  \
    --D $D --H $H --W $W  \
    --kernel_size $kernel_size \
    --op_name conv3d

This doesn't run yet because OSS fbgemm can't be installed in the B200 machine

Reviewers:

Subscribers:

Tasks:

Tags:

Co-authored-by: jerryzh <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. topic: improvement Use this tag if this PR is an improvement (doesn't fit into any of the other categories)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants