Add support for e2e benchmark for conv2d/conv3d #3329

jerryzh168 · 2025-11-11T21:32:38Z

Summary:
We added this to float8_inference_roofline to reuse code but we haven't enabled the roofline feature. For now we just need the e2e speedup time for single conv2d/conv3d against bf16 to understand the speedup expectation

some if/else branches right now but hopefully we should be able to remove some of these when more roofline support is added for conv2d/conv3d

Also added B200 hardware spec.

Test Plan:

python float8_inference_roofline.py ~/local/tmp/test.csv \
    --recipe_name tensorwise \
    --shape_gen_name custom \
    --M 1 --K 160 --N 160 \
    --D 3 --H 50 --W 50 \
    --kernel_size 3 \
    --op_name conv3d

requires fbgemm_gpu_genai nightly and torch nightly:

pip install --pre torch fbgemm_gpu_genai --index-url https://download.pytorch.org/whl/nightly/cu129

sample results:

,fwd_M,fwd_K,fwd_N,D,H,W,kernel_size,r_bf16_gemm_s,r_fp8_gemm_s,r_fp8_ovhd_s,r_fp8_gemm_and_ovhd_s,r_fp8_gemm_and_ovhd_spdp,b_bf16_gemm_s,b_fp8_gemm_s,b_bf16_e2e_s,b_fp8_e2e_s,b_fp8_e2e_spdp,rb_bf16_gemm_ratio,rb\
_fp8_gemm_ratio
0,1,160,320,3,194,130,3,,,,,,0,0,8.248959999999995e-05,0.00011273280000000014,0.7317267024326535,-1,-1

Reviewers:

Subscribers:

Tasks:

Tags:

pytorch-bot · 2025-11-11T21:32:42Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3329

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (4 Unrelated Failures)

As of commit e2d0395 with merge base e8c4d09 ():

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

Run Regression Tests / test (CPU 2.6, linux.4xlarge, torch==2.6.0 --index-url https://download.pytorch.org/whl/cpu, cpu) / linux-job (gh) (trunk failure)
test/test_low_bit_optim.py::TestOptim::test_param_groups_optim_name_AdamFp8_device_cpu
Run Regression Tests / test (CPU 2.7, linux.4xlarge, torch==2.7.0 --index-url https://download.pytorch.org/whl/cpu, cpu) / linux-job (gh) (trunk failure)
test/test_low_bit_optim.py::TestOptim::test_param_groups_optim_name_AdamFp8_device_cpu
Run Regression Tests / test (CPU 2.8, linux.4xlarge, torch==2.8.0 --index-url https://download.pytorch.org/whl/cpu, cpu) / linux-job (gh) (trunk failure)
test/test_low_bit_optim.py::TestOptim::test_param_groups_optim_name_AdamFp8_device_cpu
Run Regression Tests / test-nightly (CPU Nightly, linux.4xlarge, --pre torch --index-url https://download.pytorch.org/wh... / linux-job (gh) (trunk failure)
test/test_low_bit_optim.py::TestOptim::test_param_groups_optim_name_AdamFp8_device_cpu

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Summary: att, we added this to float8_inference_roofline to reuse code but we haven't enabled the roofline feature. For now we just need the e2e speedup time for single conv2d/conv3d against bf16 to understand the speedup expecatation Also added B200 hardware spec. Test Plan: python $SCRIPT_PATH $OUTPUT_FILE \ --recipe_name $RECIPE_NAME \ --shape_gen_name $SHAPE_GEN_NAME \ --M $M --K $K --N $N \ --D $D --H $H --W $W \ --kernel_size $kernel_size \ --op_name conv3d This doesn't run yet because OSS fbgemm can't be installed in the B200 machine Reviewers: Subscribers: Tasks: Tags:

Summary: att, we added this to float8_inference_roofline to reuse code but we haven't enabled the roofline feature. For now we just need the e2e speedup time for single conv2d/conv3d against bf16 to understand the speedup expecatation Also added B200 hardware spec. Test Plan: python $SCRIPT_PATH $OUTPUT_FILE \ --recipe_name $RECIPE_NAME \ --shape_gen_name $SHAPE_GEN_NAME \ --M $M --K $K --N $N \ --D $D --H $H --W $W \ --kernel_size $kernel_size \ --op_name conv3d This doesn't run yet because OSS fbgemm can't be installed in the B200 machine Reviewers: Subscribers: Tasks: Tags: Co-authored-by: jerryzh <[email protected]>

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 11, 2025

jerryzh168 requested a review from vkuzo November 11, 2025 21:36

jerryzh168 force-pushed the add-fp8-conv-e2e-bench branch from 8f92276 to 1d92a60 Compare November 11, 2025 21:37

jerryzh168 added the topic: improvement Use this tag if this PR is an improvement (doesn't fit into any of the other categories) label Nov 11, 2025

jerryzh168 force-pushed the add-fp8-conv-e2e-bench branch from 1d92a60 to e2d0395 Compare November 11, 2025 23:12

vkuzo approved these changes Nov 12, 2025

View reviewed changes

jerryzh168 merged commit 0ed9cce into pytorch:main Nov 12, 2025
14 of 18 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add support for e2e benchmark for conv2d/conv3d #3329

Add support for e2e benchmark for conv2d/conv3d #3329

jerryzh168 commented Nov 11, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Nov 11, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add support for e2e benchmark for conv2d/conv3d #3329

Add support for e2e benchmark for conv2d/conv3d #3329

Conversation

jerryzh168 commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3329

✅ You can merge normally! (4 Unrelated Failures)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jerryzh168 commented Nov 11, 2025 •

edited

Loading

pytorch-bot bot commented Nov 11, 2025 •

edited

Loading