Skip to content

Conversation

@wj-laskowski
Copy link
Contributor

@wj-laskowski wj-laskowski commented Jan 15, 2026

Added additional flavors for WMMA conv fwd large tensor.

Following operations are added for FP16/BF16 data type and NHWGCxGKYXC layout.

  • grouped_conv2d_fwd_bias_clamp
  • grouped_conv2d_fwd_clamp
  • grouped_conv3d_fwd_bias_clamp
  • grouped_conv3d_fwd_clamp

Plus small modification to the device code to accommodate extra tensors.

Proposed changes

Please describe the motivation behind the pull request, whether it enables a new feature or fixes a bug. If there are associated pull requests or issues, please link them to the pull request.

Checklist

Please put an x into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask.

  • I have added tests relevant to the introduced functionality, and the unit tests are passing locally
  • I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run.
  • I have added inline documentation which enables the maintainers with understanding the motivation
  • I have removed the stale documentation which is no longer relevant after this pull request
  • (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request
  • I have run clang-format on all changed files
  • Any dependent changes have been merged

Discussion

If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered

@wj-laskowski wj-laskowski force-pushed the streamhpc/grouped-conv-fwd-wmma-large-tensor-extra-flavors branch from f5f3e01 to 3af62e5 Compare January 16, 2026 09:20
@wj-laskowski wj-laskowski force-pushed the streamhpc/grouped-conv-fwd-wmma-large-tensor-extra-flavors branch from 3af62e5 to 34333d6 Compare January 16, 2026 10:10
@wj-laskowski wj-laskowski marked this pull request as ready for review January 19, 2026 12:58
@krithalith
Copy link
Contributor

Looks good! I have a couple comments:

  1. They did want full instance lists for these flavors, so that's good, but it seems like your Large Tensor instance lists do not contain generic instances (instances with scalarPerVector values of 1). You can find examples of these in the XDL instance lists. I think we should add one generic instance per instance list.
  2. There is a bias clamp "large cases" test, which presumably leads to splitting in the Large Tensor implementation. Do you think it's necessary to also have "large cases" for clamp?
  3. Did you check that with only the Large Tensor instances available, we still have support for all the relevant tests?

@ErwinTerpstra
Copy link
Contributor

Looks good from my side. No comments.

@wj-laskowski
Copy link
Contributor Author

Hi @krithalith

  1. I cherry-picked a generic instance from WMMA grouped conv fwd large tensor bias bnorm clamp #3595 and integrated into clamp and bias_clamp operations now.
  2. I don't think it's necessary to have a large test for clamp considering we have one for bias_clamp.
  3. With generic instance, we have a full coverage now.

@krithalith
Copy link
Contributor

Hi @krithalith

1. I cherry-picked a generic instance from [WMMA grouped conv fwd large tensor bias bnorm clamp #3595](https://github.com/ROCm/composable_kernel/pull/3595) and integrated into clamp and bias_clamp operations now.

2. I don't think it's necessary to have a large test for clamp considering we have one for bias_clamp.

3. With generic instance, we have a full coverage now.

Ok great, the new generic instances look good and thanks for checking the coverage. Yes I agree that in this case another Large test is not necessary. All good on my end, I had one micro-nit about the instance list headers but approved.

krithalith
krithalith previously approved these changes Jan 21, 2026
bartekxk
bartekxk previously approved these changes Jan 21, 2026
Copy link
Contributor

@bartekxk bartekxk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm please rebase

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds WMMA large tensor support for grouped convolution forward operations with clamp and bias+clamp element-wise operations for FP16 and BF16 data types. The changes enable additional operation flavors that were previously commented out.

Changes:

  • Added device instance files for grouped_conv2d_fwd_clamp and grouped_conv3d_fwd_clamp with large tensor support (FP16/BF16, both regular and generic variants)
  • Added device instance files for grouped_conv2d_fwd_bias_clamp and grouped_conv3d_fwd_bias_clamp with large tensor support (FP16/BF16, both regular and generic variants)
  • Modified device implementation to properly handle operations with and without D tensors (bias), using if constexpr guards and replacing manual struct initialization with Emplace method
  • Added Emplace utility method to Array class for in-place construction with designated initializers
  • Uncommented and enabled previously disabled function declarations and calls for large tensor instances

Reviewed changes

Copilot reviewed 27 out of 27 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
library/src/tensor_operation_instance/gpu/grouped_conv3d_fwd_clamp/wmma/large_tensor/*.cpp 4 new instance files for 3D convolution forward with clamp (F16/BF16, regular/generic)
library/src/tensor_operation_instance/gpu/grouped_conv3d_fwd_bias_clamp/wmma/large_tensor/*.cpp 4 new instance files for 3D convolution forward with bias+clamp (F16/BF16, regular/generic)
library/src/tensor_operation_instance/gpu/grouped_conv2d_fwd_clamp/wmma/large_tensor/*.cpp 4 new instance files for 2D convolution forward with clamp (F16/BF16, regular/generic)
library/src/tensor_operation_instance/gpu/grouped_conv2d_fwd_bias_clamp/wmma/large_tensor/*.cpp 4 new instance files for 2D convolution forward with bias+clamp (F16/BF16, regular/generic)
library/src/tensor_operation_instance/gpu/grouped_conv3d_fwd_clamp/CMakeLists.txt Registers new 3D clamp instance files in build system
library/src/tensor_operation_instance/gpu/grouped_conv3d_fwd_bias_clamp/CMakeLists.txt Registers new 3D bias+clamp instance files in build system
library/src/tensor_operation_instance/gpu/grouped_conv2d_fwd_clamp/CMakeLists.txt Registers new 2D clamp instance files in build system
library/src/tensor_operation_instance/gpu/grouped_conv2d_fwd_bias_clamp/CMakeLists.txt Registers new 2D bias+clamp instance files in build system
library/include/ck/library/tensor_operation_instance/gpu/grouped_convolution_forward_clamp_wmma_cshufflev3.inc Uncomments function declarations for large tensor clamp instances
library/include/ck/library/tensor_operation_instance/gpu/grouped_convolution_forward_clamp.hpp Enables calls to large tensor clamp instance functions
library/include/ck/library/tensor_operation_instance/gpu/grouped_convolution_forward_bias_clamp_wmma_cshufflev3.inc Uncomments function declarations for large tensor bias+clamp instances
library/include/ck/library/tensor_operation_instance/gpu/grouped_convolution_forward_bias_clamp.hpp Enables calls to large tensor bias+clamp instance functions
library/include/ck/library/tensor_operation_instance/gpu/grouped_conv_fwd/device_grouped_conv_fwd_wmma_cshufflev3_large_tensor_instance.hpp Adds generic instance templates and element-wise operation type aliases
include/ck/utility/array.hpp Adds Emplace method for in-place construction with proper object lifetime management
include/ck/tensor_operation/gpu/device/impl/device_grouped_conv_fwd_multiple_d_wmma_cshuffle_v3_large_tensor.hpp Refactors initialization to use Emplace with designated initializers and adds conditional compilation guards for D tensor operations

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@wj-laskowski wj-laskowski dismissed stale reviews from bartekxk and krithalith via 8651b4c January 21, 2026 11:53
@wj-laskowski wj-laskowski force-pushed the streamhpc/grouped-conv-fwd-wmma-large-tensor-extra-flavors branch 2 times, most recently from 7475853 to 72a000f Compare January 21, 2026 11:57
- added F16/BF16 clamp operation
- added F16/BF16 bias_clamp operation
- small modification to the device code to accomodate extra tensors
@wj-laskowski wj-laskowski force-pushed the streamhpc/grouped-conv-fwd-wmma-large-tensor-extra-flavors branch from 72a000f to 3c64b49 Compare January 22, 2026 10:55
@wj-laskowski wj-laskowski requested a review from bartekxk January 23, 2026 07:26
@wj-laskowski wj-laskowski merged commit 81ee19b into develop Jan 23, 2026
22 checks passed
@wj-laskowski wj-laskowski deleted the streamhpc/grouped-conv-fwd-wmma-large-tensor-extra-flavors branch January 23, 2026 11:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants