WMMA grouped conv fwd large tensor extra flavors #3582

wj-laskowski · 2026-01-15T16:24:09Z

Added additional flavors for WMMA conv fwd large tensor.

Following operations are added for FP16/BF16 data type and NHWGCxGKYXC layout.

grouped_conv2d_fwd_bias_clamp
grouped_conv2d_fwd_clamp
grouped_conv3d_fwd_bias_clamp
grouped_conv3d_fwd_clamp

Plus small modification to the device code to accommodate extra tensors.

Proposed changes

Please describe the motivation behind the pull request, whether it enables a new feature or fixes a bug. If there are associated pull requests or issues, please link them to the pull request.

Checklist

Please put an x into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask.

I have added tests relevant to the introduced functionality, and the unit tests are passing locally
I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run.
I have added inline documentation which enables the maintainers with understanding the motivation
I have removed the stale documentation which is no longer relevant after this pull request
(If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request
I have run clang-format on all changed files
Any dependent changes have been merged

Discussion

If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered

...eration/gpu/device/impl/device_grouped_conv_fwd_multiple_d_wmma_cshuffle_v3_large_tensor.hpp

krithalith · 2026-01-20T10:17:02Z

Looks good! I have a couple comments:

They did want full instance lists for these flavors, so that's good, but it seems like your Large Tensor instance lists do not contain generic instances (instances with scalarPerVector values of 1). You can find examples of these in the XDL instance lists. I think we should add one generic instance per instance list.
There is a bias clamp "large cases" test, which presumably leads to splitting in the Large Tensor implementation. Do you think it's necessary to also have "large cases" for clamp?
Did you check that with only the Large Tensor instances available, we still have support for all the relevant tests?

ErwinTerpstra · 2026-01-20T12:58:03Z

Looks good from my side. No comments.

wj-laskowski · 2026-01-21T09:53:34Z

Hi @krithalith

I cherry-picked a generic instance from WMMA grouped conv fwd large tensor bias bnorm clamp #3595 and integrated into clamp and bias_clamp operations now.
I don't think it's necessary to have a large test for clamp considering we have one for bias_clamp.
With generic instance, we have a full coverage now.

...tance/gpu/grouped_conv_fwd/device_grouped_conv_fwd_wmma_cshufflev3_large_tensor_instance.hpp

krithalith · 2026-01-21T10:25:48Z

Hi @krithalith

1. I cherry-picked a generic instance from [WMMA grouped conv fwd large tensor bias bnorm clamp #3595](https://github.com/ROCm/composable_kernel/pull/3595) and integrated into clamp and bias_clamp operations now.

2. I don't think it's necessary to have a large test for clamp considering we have one for bias_clamp.

3. With generic instance, we have a full coverage now.

Ok great, the new generic instances look good and thanks for checking the coverage. Yes I agree that in this case another Large test is not necessary. All good on my end, I had one micro-nit about the instance list headers but approved.

bartekxk

lgtm please rebase

Copilot

Pull request overview

This PR adds WMMA large tensor support for grouped convolution forward operations with clamp and bias+clamp element-wise operations for FP16 and BF16 data types. The changes enable additional operation flavors that were previously commented out.

Changes:

Added device instance files for grouped_conv2d_fwd_clamp and grouped_conv3d_fwd_clamp with large tensor support (FP16/BF16, both regular and generic variants)
Added device instance files for grouped_conv2d_fwd_bias_clamp and grouped_conv3d_fwd_bias_clamp with large tensor support (FP16/BF16, both regular and generic variants)
Modified device implementation to properly handle operations with and without D tensors (bias), using if constexpr guards and replacing manual struct initialization with Emplace method
Added Emplace utility method to Array class for in-place construction with designated initializers
Uncommented and enabled previously disabled function declarations and calls for large tensor instances

Reviewed changes

Copilot reviewed 27 out of 27 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
library/src/tensor_operation_instance/gpu/grouped_conv3d_fwd_clamp/wmma/large_tensor/*.cpp	4 new instance files for 3D convolution forward with clamp (F16/BF16, regular/generic)
library/src/tensor_operation_instance/gpu/grouped_conv3d_fwd_bias_clamp/wmma/large_tensor/*.cpp	4 new instance files for 3D convolution forward with bias+clamp (F16/BF16, regular/generic)
library/src/tensor_operation_instance/gpu/grouped_conv2d_fwd_clamp/wmma/large_tensor/*.cpp	4 new instance files for 2D convolution forward with clamp (F16/BF16, regular/generic)
library/src/tensor_operation_instance/gpu/grouped_conv2d_fwd_bias_clamp/wmma/large_tensor/*.cpp	4 new instance files for 2D convolution forward with bias+clamp (F16/BF16, regular/generic)
library/src/tensor_operation_instance/gpu/grouped_conv3d_fwd_clamp/CMakeLists.txt	Registers new 3D clamp instance files in build system
library/src/tensor_operation_instance/gpu/grouped_conv3d_fwd_bias_clamp/CMakeLists.txt	Registers new 3D bias+clamp instance files in build system
library/src/tensor_operation_instance/gpu/grouped_conv2d_fwd_clamp/CMakeLists.txt	Registers new 2D clamp instance files in build system
library/src/tensor_operation_instance/gpu/grouped_conv2d_fwd_bias_clamp/CMakeLists.txt	Registers new 2D bias+clamp instance files in build system
library/include/ck/library/tensor_operation_instance/gpu/grouped_convolution_forward_clamp_wmma_cshufflev3.inc	Uncomments function declarations for large tensor clamp instances
library/include/ck/library/tensor_operation_instance/gpu/grouped_convolution_forward_clamp.hpp	Enables calls to large tensor clamp instance functions
library/include/ck/library/tensor_operation_instance/gpu/grouped_convolution_forward_bias_clamp_wmma_cshufflev3.inc	Uncomments function declarations for large tensor bias+clamp instances
library/include/ck/library/tensor_operation_instance/gpu/grouped_convolution_forward_bias_clamp.hpp	Enables calls to large tensor bias+clamp instance functions
library/include/ck/library/tensor_operation_instance/gpu/grouped_conv_fwd/device_grouped_conv_fwd_wmma_cshufflev3_large_tensor_instance.hpp	Adds generic instance templates and element-wise operation type aliases
include/ck/utility/array.hpp	Adds Emplace method for in-place construction with proper object lifetime management
include/ck/tensor_operation/gpu/device/impl/device_grouped_conv_fwd_multiple_d_wmma_cshuffle_v3_large_tensor.hpp	Refactors initialization to use Emplace with designated initializers and adds conditional compilation guards for D tensor operations

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

library/src/tensor_operation_instance/gpu/grouped_conv3d_fwd_clamp/CMakeLists.txt

- added F16/BF16 clamp operation - added F16/BF16 bias_clamp operation - small modification to the device code to accomodate extra tensors

wj-laskowski force-pushed the streamhpc/grouped-conv-fwd-wmma-large-tensor-extra-flavors branch from f5f3e01 to 3af62e5 Compare January 16, 2026 09:20

wj-laskowski added the organization: streamhpc label Jan 16, 2026

wj-laskowski force-pushed the streamhpc/grouped-conv-fwd-wmma-large-tensor-extra-flavors branch from 3af62e5 to 34333d6 Compare January 16, 2026 10:10

krithalith requested review from ErwinTerpstra and krithalith January 16, 2026 12:43

wj-laskowski marked this pull request as ready for review January 19, 2026 12:58

wj-laskowski requested review from a team, Snektron, ThomasNing, afagaj, andriy-ca, aosewski, asleepzzz, bartekxk, carlushuang, cgmillette, coderfeli, geyyer, illsilin, poyenc, qianfengz, shumway, tenpercent, vidyasagar-amd and vpietila-amd as code owners January 19, 2026 12:59

krithalith reviewed Jan 20, 2026

View reviewed changes

...eration/gpu/device/impl/device_grouped_conv_fwd_multiple_d_wmma_cshuffle_v3_large_tensor.hpp Outdated Show resolved Hide resolved

illsilin assigned bartekxk Jan 20, 2026

wj-laskowski mentioned this pull request Jan 21, 2026

WMMA grouped conv fwd large tensor bias bnorm clamp #3595

Merged

7 tasks

krithalith reviewed Jan 21, 2026

View reviewed changes

...tance/gpu/grouped_conv_fwd/device_grouped_conv_fwd_wmma_cshufflev3_large_tensor_instance.hpp Outdated Show resolved Hide resolved

krithalith previously approved these changes Jan 21, 2026

View reviewed changes

bartekxk requested a review from Copilot January 21, 2026 11:21

Copilot started reviewing on behalf of bartekxk January 21, 2026 11:22 View session

bartekxk previously approved these changes Jan 21, 2026

View reviewed changes

Copilot AI reviewed Jan 21, 2026

View reviewed changes

library/src/tensor_operation_instance/gpu/grouped_conv3d_fwd_clamp/CMakeLists.txt Outdated Show resolved Hide resolved

wj-laskowski dismissed stale reviews from bartekxk and krithalith via 8651b4c January 21, 2026 11:53

wj-laskowski force-pushed the streamhpc/grouped-conv-fwd-wmma-large-tensor-extra-flavors branch 2 times, most recently from 7475853 to 72a000f Compare January 21, 2026 11:57

wj-laskowski added 4 commits January 22, 2026 09:29

Additional flavors for WMMA conv fwd large tensor

250da2f

- added F16/BF16 clamp operation - added F16/BF16 bias_clamp operation - small modification to the device code to accomodate extra tensors

changed strategy to handle GemmArgs array

1d99899

Adding generic instance

4b5b15f

Added generic instance to clamp and bias_clamp ops

3c64b49

wj-laskowski force-pushed the streamhpc/grouped-conv-fwd-wmma-large-tensor-extra-flavors branch from 72a000f to 3c64b49 Compare January 22, 2026 10:55

wj-laskowski requested a review from bartekxk January 23, 2026 07:26

bartekxk approved these changes Jan 23, 2026

View reviewed changes

wj-laskowski merged commit 81ee19b into develop Jan 23, 2026
22 checks passed

wj-laskowski deleted the streamhpc/grouped-conv-fwd-wmma-large-tensor-extra-flavors branch January 23, 2026 11:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WMMA grouped conv fwd large tensor extra flavors #3582

WMMA grouped conv fwd large tensor extra flavors #3582

Uh oh!

wj-laskowski commented Jan 15, 2026 •

edited

Loading

Uh oh!

Uh oh!

krithalith commented Jan 20, 2026

Uh oh!

ErwinTerpstra commented Jan 20, 2026

Uh oh!

wj-laskowski commented Jan 21, 2026

Uh oh!

Uh oh!

krithalith commented Jan 21, 2026

Uh oh!

bartekxk left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

WMMA grouped conv fwd large tensor extra flavors #3582

WMMA grouped conv fwd large tensor extra flavors #3582

Uh oh!

Conversation

wj-laskowski commented Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proposed changes

Checklist

Discussion

Uh oh!

Uh oh!

krithalith commented Jan 20, 2026

Uh oh!

ErwinTerpstra commented Jan 20, 2026

Uh oh!

wj-laskowski commented Jan 21, 2026

Uh oh!

Uh oh!

krithalith commented Jan 21, 2026

Uh oh!

bartekxk left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

wj-laskowski commented Jan 15, 2026 •

edited

Loading