Enable group mode (varlen) kernel generation for PyTorch integration #3553

chinmaydk99 · 2026-01-12T17:28:44Z

Proposed changes

This PR enables group mode (variable-length attention) kernel generation for PyTorch's CK SDPA backend.

Checklist

Please put an x into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask.

I have added tests relevant to the introduced functionality, and the unit tests are passing locally
I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run.
I have added inline documentation which enables the maintainers with understanding the motivation
I have removed the stale documentation which is no longer relevant after this pull request
(If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request
I have run clang-format on all changed files
Any dependent changes have been merged

Discussion

The change is minimal (single line deletion) but enables a significant feature: variable-length attention support for ROCm users via PyTorch's torch.nn.attention.varlen API.

poyenc · 2026-01-13T07:59:44Z

May I know why the group-mode kernels were not included previously, and why they are required now?

chinmaydk99 · 2026-01-20T16:10:06Z

May I know why the group-mode kernels were not included previously, and why they are required now?

This is part of the ongoing parity effort to bridge feature gaps between CK and AOTriton backends in PyTorch. Varlen attention is one of the features being enabled

pruthvistony · 2026-02-03T20:21:04Z

@illsilin @poyenc ,
Please help on review on this PR.

ammallya · 2026-02-03T22:02:04Z

Imported to ROCm/rocm-libraries

@chinmaydk99

…4292) ## Proposed changes This PR enables group mode (variable-length attention) kernel generation for PyTorch's CK SDPA backend. ## Checklist Please put an `x` into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask. - [X] I have added tests relevant to the introduced functionality, and the unit tests are passing locally - [ ] I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more than 30 seconds to run. - [ ] I have added inline documentation which enables the maintainers with understanding the motivation - [ ] I have removed the stale documentation which is no longer relevant after this pull request - [ ] (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request - [X] I have run `clang-format` on all changed files - [ ] Any dependent changes have been merged ## Discussion The change is minimal (single line deletion) but enables a significant feature: variable-length attention support for ROCm users via PyTorch's torch.nn.attention.varlen API. --- 🔁 Imported from [ROCm/composable_kernel#3553](ROCm/composable_kernel#3553) 🧑‍💻 Originally authored by @chinmaydk99 Co-authored-by: Chinmay_Kuchinad <[email protected]>

commit 8c40fb6cac48969d6237cccdcbbbad56b44ff0a3 Author: Kerry Wang <[email protected]> Date: Mon Feb 9 20:27:33 2026 -0500 more consistent skip text commit 557e2764b3a001884a004f0a183a307c4fbc2bd2 Merge: 6bd6e49791 219f365e7b Author: Kerry Wang <[email protected]> Date: Mon Feb 9 18:38:48 2026 -0600 Merge remote-tracking branch 'origin/develop' into users/kerrwang/lds-queue commit 6bd6e497910e3ba681b22a47630bc5f0dedb16b8 Author: Kerry Wang <[email protected]> Date: Mon Feb 9 18:38:31 2026 -0600 fix format commit 219f365e7bc40c9ce3f5c382228a7b2e14b90520 Author: James Sandham <[email protected]> Date: Mon Feb 9 19:08:04 2026 -0500 [hipsparse] Match behaviour of csr2csr_compress from rocsparse (#4420) ## Motivation In the hipSPARSE test code host solution, we were incorrectly checking if a value satisfied: `testing_abs(csr_val_A[j]) > testing_real(tol) && testing_abs(csr_val_A[j]) > std::numeric_limits<float>::min()` instead of the correct criteria: `testing_abs(csr_val_A[j]) > testing_real(tol)` commit 698d5d09184a24fde32ab7309fcd88410fc7ff8e Author: amd-hsong <[email protected]> Date: Mon Feb 9 16:40:07 2026 -0700 [rocprim] Fix a call to intrinsics in test_device_reduce_by_key (#4391) ## Motivation  Fix a call to __clzll in test_device_reduce_by_key ## Technical Details  There are a couple of issues in the call to __clzll: - the argument is cast to `long long`: it should be cast to `unsigned long long` instead - in rocprim there exists a wrapper for clz, so for better portability rocprim::clz should be used instead. ## Test Plan  Run test_device_reduce_by_key to verify the test runs correctly. ## Test Result  The test passes. ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. commit 557f5baa6d68bb5a8126d9730a8d48983778aac3 Author: Kerry Wang <[email protected]> Date: Mon Feb 9 16:53:41 2026 -0600 skip on non-gfx950 commit 8b72bc8759d9c11dfcbf410182fa332152b97e69 Author: bnemanich <[email protected]> Date: Mon Feb 9 16:26:51 2026 -0500 [hipBLASLt] Enable custom MXFP4 kernels (#4384) ## Motivation Allow hipBLASLt to call custom MX FP4 kernels for higher performance. ## Technical Details A single kernel was added in this PR. The kernel was originally from: https://github.com/ROCm/aiter/tree/main/hsa/gfx950/f4gemm. This kernel used a slightly different shuffled scaling layout than rocRoller. hipBLASLt will only support this new shuffled layout, plus the original non-shuffled layout. All rocRoller kernels will be disabled when using shuffled scales for now. Once rocRoller supports the new layout, they will be added back in. This PR also adds some new MX datatype generation patterns that were useful during debugging. New custom kernels can be added to the custom_kernels directory. They will also need to be added in the customer_kernels.cpp file that was added in this PR. ## Test Plan Check that performance improved when using MXFP4 GEMMs with shuffled scales. ## Test Result Performance improved by about 17%. ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Andrew Whittle <[email protected]> Co-authored-by: Bryant Nelson <[email protected]> commit 61f9f906dcc0a9d4f6c327fea713aebc6d4b0a1d Author: Bartłomiej Kocot <[email protected]> Date: Mon Feb 9 22:08:57 2026 +0100 [CK] CK Tile grouped convolution direct load (#4406) ## Motivation CK Tile grouped convolution forward direct load support. ## Technical Details Basic pipeline for direct load and new instances for forward for v1 and v4 pipelines. ## Test Plan test_grouped_convnd_fwd_tile ## Test Result CI pending ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. AICK-130 commit b7f136734ad26314386ca2b4f5a99467804f1bb7 Author: assistant-librarian[bot] <210906412+assistant-librarian[bot]@users.noreply.github.com> Date: Mon Feb 9 20:58:57 2026 +0000 Enable group mode (varlen) kernel generation for PyTorch integration (#4292) ## Proposed changes This PR enables group mode (variable-length attention) kernel generation for PyTorch's CK SDPA backend. ## Checklist Please put an `x` into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask. - [X] I have added tests relevant to the introduced functionality, and the unit tests are passing locally - [ ] I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more than 30 seconds to run. - [ ] I have added inline documentation which enables the maintainers with understanding the motivation - [ ] I have removed the stale documentation which is no longer relevant after this pull request - [ ] (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request - [X] I have run `clang-format` on all changed files - [ ] Any dependent changes have been merged ## Discussion The change is minimal (single line deletion) but enables a significant feature: variable-length attention support for ROCm users via PyTorch's torch.nn.attention.varlen API. --- 🔁 Imported from [ROCm/composable_kernel#3553](https://github.com/ROCm/composable_kernel/pull/3553) 🧑‍💻 Originally authored by @chinmaydk99 Co-authored-by: Chinmay_Kuchinad <[email protected]> commit f48a5e63edb7102996b0b769e76114c0bbfd35cf Author: Mihnea Chirila <[email protected]> Date: Mon Feb 9 14:55:12 2026 -0600 [Tensilelite] Added MIArchVgpr support for Complex Datatypes. (#4332) ## Motivation Added MIArchVgpr support for Complex Datatypes. ## Technical Details Fixed AlphaTmpVgpr initialization, and rocisa register offset bug: - Updated condition to initialize AlphaTmpVgpr if MIArchVgpr parameter is enabled. Required to generate `MulMIOutAlphaToArch' code (https://github.com/ROCm/rocm-libraries/blob/c20a85b6c458ef44c1f0e30c35b286a0395fb8fa/projects/hipblaslt/tensilelite/Tensile/KernelWriterModules.py#L251) regardless of postGSU Accumulation scheme. - Fixed underlying `Holder` struct bug: correctly passes string passed offsets to `RegisterContainer`. Required to update imaginary register for C/ZGEMM. (https://github.com/ROCm/rocm-libraries/blob/c20a85b6c458ef44c1f0e30c35b286a0395fb8fa/projects/hipblaslt/tensilelite/Tensile/KernelWriterModules.py#L288) ## Test Plan Tested for C & Z with MIArchVgpr: [0, 1] on gfx942 and gfx950 ## Test Result Success ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. commit 3de83b9b4035006b5ddd825df9404edc43ca9b39 Author: CMiservaAMD <[email protected]> Date: Mon Feb 9 13:42:54 2026 -0700 [hipDNN] Add integration tests for frontend configuration knobs APIs. (#4307) Add integration tests to verify correct operation of new hipDNN frontend API functions for managing engine config settings. commit 2752a8a5105e11929b876ce0e343bcc73a9cf308 Author: DarylHawkinsAMD <[email protected]> Date: Mon Feb 9 13:08:30 2026 -0700 [MIOpen] First set of kernels using CK Builder end to end (#4123) commit e55f37bad667987f74989bc95e08f86603438963 Author: Mitchell Ousdahl <[email protected]> Date: Mon Feb 9 10:05:26 2026 -0800 Modified test plugin rpaths (#4350) ## Motivation In order to successfully get hipDNN added to the python ROCm wheels, the RPATHs on Linux for the test plugins need to be updated to make them portable. We will leverage TheRock's existing RPATH update mechanism to do this. ## Technical Details - Update all test plugin RPATHs ## Test Plan - Build ROCm - Build the wheels - Use the "Test ROCm Wheel" workflow, which verifies that the test plugins can load and find their dependencies. ## Test Result - [ ] "Test ROCm Wheel" workflow succeeds ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. commit a7485411874b0650b31068364f6b1155d9890212 Author: Muhammad Osama <[email protected]> Date: Mon Feb 9 09:30:52 2026 -0800 [Origami] Skip test-selector if torch not found. (#4359) ## Motivation Makes `torch` completely optional by skipping dependent tests if it is not found. ## Technical Details ``` # Skip entire module if torch is not available (selector requires torch) torch = pytest.importorskip("torch", reason="torch is required for OrigamiMatmulSelector tests.") ``` ## Test Plan Run tests using CI + TheRock build. ## Test Result  ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. commit 0c5cd629a94a454a350eb651b5921baeb1c82546 Author: Swati Rawat <[email protected]> Date: Mon Feb 9 22:51:13 2026 +0530 Update Tensile CHANGELOG.md (#4164) ## Motivation  ## Technical Details  ## Test Plan  ## Test Result  ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. commit 331512e9e13e197d8d7fdf7b72f5b60eb63d7d1e Author: Bartłomiej Kocot <[email protected]> Date: Mon Feb 9 16:36:52 2026 +0100 [CK] Fix grouped conv fwd transform for merged groups (#4399) ## Motivation [CK] Fix grouped conv fwd transform for merged groups for 1d and 3d. ## Technical Details After optimizations for 2d there is a lack of implementation for 1d and 3d ## Test Plan test_grouped_convnd_fwd ## Test Result pending CI ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. commit 1c2927530e176c63cf814b44eb8147e89d2bcaf7 Author: Eiden Yoshida <[email protected]> Date: Mon Feb 9 10:23:47 2026 -0500 [CK] MICI: Disable failure pattern checking (#4373) ## Motivation - ck mici jobs hanging at end, possibly at failure pattern checking ## Technical Details - Disable failure pattern checking to see if hanging goes away ## Test Plan - Observe behavior after merge commit a3058d1dc0b3f176f56fbecd040c2fc48c7258ad Author: COrruDXC <[email protected]> Date: Mon Feb 9 14:02:05 2026 +0100 Reduce boost usage by replacing time calls (#3875) ## Motivation Reduce boost usage by replacing time calls. ## Technical Details Replace boost::posix_time related data types with the corresponsing std::chrono data types. ## Submission Checklist - [X] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. commit 959bd9393ad9a578711334c40948ac1321e41c1f Author: Yi-Yao (Alex), Wang <[email protected]> Date: Mon Feb 9 17:15:47 2026 +0800 Update gfx942/gfx950 BBS/HHS/I8I8S SPB/SPA logic yaml (#4365) ## Motivation - Update BBS/HHS/I8I8S SPB/SPA logic yaml for gfx942/gfx950 ## Technical Details  ## Test Plan - Run local hipsparselt-test - Run local tests for all matrix sizes using hipsparselt-bench ## Test Result  ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: chiahlin <[email protected]> commit ad03e58dabbf2bbc348c031a06ec73011d85d2c3 Author: Chuck Wu <[email protected]> Date: Mon Feb 9 13:04:38 2026 +0800 [hipblaslt] Fix memory leaks & uninitialized value use (#4338) ## Motivation  According to the [ROCM-1835](https://amd-hub.atlassian.net/browse/ROCM-1835?focusedCommentId=109304&sourceType=mention), there are some memory leaks and instances of uninitialized value being used during the gtest. <img width="450" height="367" alt="image" src="https://github.com/user-attachments/assets/2345e1f8-6062-4a5a-b294-97042709b18e" /> ## Technical Details  1. Add the code to call the corresponding destroy functions for the data that has not been released yet. 2. Add the default value to compute_input_typeA/B Flow (before this commit) -> a. hipblasLtMatmulDescCreate: - compute_input_typeA = ??? - compute_input_typeB = ??? b. hipblasLtMatmulDescSetAttribute(COMPUTE_INPUT_TYPE_A, HIP_R_16F, ...) - compute_input_typeA = HIP_R_16_F - compute_input_typeB = ??? - call _matmul_desc_determine_compute_type() - Read compute_input_typeA & compute_input_typeB c. hipblasLtMatmulDescSetAttribute(COMPUTE_INPUT_TYPE_B, HIP_R_16F, ...) - compute_input_typeA = HIP_R_16_F - compute_input_typeB = HIP_R_16_F - call _matmul_desc_determine_compute_type() - Read compute_input_typeA & compute_input_typeB ## Test Plan  1. Test command: `valgrind --leak-check=full ./hipblaslt-test --gtest_filter=_/aux_test.*` 2. Before this commit: Uninitialized value being used <img width="691" height="81" alt="image" src="https://github.com/user-attachments/assets/22a897f1-c25e-4608-850e-0c6bcb5ad0a3" /> Memory leaks <img width="708" height="78" alt="image" src="https://github.com/user-attachments/assets/02268893-a29a-4db4-95aa-c93385371d5a" /> ## Test Result  1. The Valgrind output above isn’t showing. 2. gtest all pass on Navi3. <img width="952" height="92" alt="image" src="https://github.com/user-attachments/assets/6e1b9b01-afc4-4a51-8a9f-e0196fc8495a" /> ## Submission Checklist - [X] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. [ROCM-1835]: https://amd-hub.atlassian.net/browse/ROCM-1835?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ commit b7de1e14cea70681a23cd1a136df42910c776e4a Author: assistant-librarian[bot] <210906412+assistant-librarian[bot]@users.noreply.github.com> Date: Mon Feb 9 11:54:54 2026 +0800 [CK_TILE] Add blockscale GEMM support for EightWarps on gfx950 (#4280) ## Proposed changes gemm blockscale eightwarps support ## Checklist Please put an `x` into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask. - [ ] I have added tests relevant to the introduced functionality, and the unit tests are passing locally - [ ] I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more than 30 seconds to run. - [ ] I have added inline documentation which enables the maintainers with understanding the motivation - [ ] I have removed the stale documentation which is no longer relevant after this pull request - [ ] (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request - [x] I have run `clang-format` on all changed files - [x] Any dependent changes have been merged ## Discussion If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered --- 🔁 Imported from [ROCm/composable_kernel#3650](https://github.com/ROCm/composable_kernel/pull/3650) 🧑‍💻 Originally authored by @kensclin --------- Co-authored-by: KenSCLin <[email protected]> Co-authored-by: Ding, Yi <[email protected]> Co-authored-by: systems-assistant[bot] <systems-assistant[bot]@users.noreply.github.com> Co-authored-by: Thomas Ning <[email protected]> commit 774cfc6410ed55802691fef19a34449182878be5 Author: Ethan <[email protected]> Date: Mon Feb 9 11:39:20 2026 +0800 [hipblaslt] do some debug operations only in debug setting ## Motivation Single solution selection time has increased slightly.  ## Technical Details Lots of "assign matchingTag" can be avoid if not in debug (printProperty), but I still keep the Equal assign there since it has been there before #2757  ## Test Plan  ## Test Result  ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. commit ff3e9821bbe2d14b9874e61ebb518bbbce621ac3 Author: jakpiase <[email protected]> Date: Sun Feb 8 20:57:14 2026 +0100 [CK_TILE] Add support and tests for V6 pipeline in conv fwd (#4357) Added support for conv v6 pipeline in ck tile's convolution forward kernel. CK Tile v6 pipeline is the equivalent to old ck's V5 pipeline and should be faster than other pipelines for some cases. This PR also adds tests inside profiler that's currently inside experimental directory, so now we should be able to detect regressions easier. --------- Co-authored-by: Illia Silin <[email protected]> Co-authored-by: subhajitdchow <[email protected]> commit 591f50450241d6b1965f9f6ee3fe2526ef71ab8d Author: assistant-librarian[bot] <210906412+assistant-librarian[bot]@users.noreply.github.com> Date: Sun Feb 8 12:34:59 2026 +0100 [CK] Add fwd conv group merging to v3 conv instances (#4273) ## Proposed changes Added conv group merging to the (universal) V3 fwd conv pipeline. The new instance improves fwd conv performance when the number of input/output channel per group is low. On MI300 (`gfx942`) we get | CK prof command | Baseline (TFLOPS) | V3 group merging (TFLOPS) | |:-----|:------:|------:| | grouped_conv_fwd 1 1 1 0 1 0 1 2 32 32 4 4 3 3 200 200 1 1 1 1 1 1 1 1 | 3.86035 | 8.36796 | | grouped_conv_fwd 1 1 1 0 1 0 1 2 32 32 8 8 3 3 200 200 2 2 1 1 1 1 1 1 | 10.1867 | 13.4677 | | grouped_conv_fwd 1 1 1 0 1 0 1 2 32 32 8 8 3 3 100 100 1 2 1 1 1 1 1 1 | 11.7875 | 16.3657 | --- 🔁 Imported from [ROCm/composable_kernel#3675](https://github.com/ROCm/composable_kernel/pull/3675) 🧑‍💻 Originally authored by @vpietila-amd --------- Co-authored-by: Ville Pietilä <> Co-authored-by: Ville Pietilä <[email protected]> Co-authored-by: systems-assistant[bot] <systems-assistant[bot]@users.noreply.github.com> Co-authored-by: Illia Silin <[email protected]> Co-authored-by: Bartlomiej Kocot <[email protected]> commit cad7fa2c1849b0863ed52ef6cd47198e421d5b6e Author: BrianHarrisonAMD <[email protected]> Date: Fri Feb 6 23:48:54 2026 -0700 [hipDNN] Fix codecov target breaks (#4374) ## Motivation code_cov stage for hipDNN was breaking due to environment differences. Newer clang tooling flags false positives on added files. Environments that are missing spdlog, but have fmt present were causing issues due to mixed include expectations. ## Technical Details - Ignore false positives for lint - Only add fmt if spdlog was built with external FMT ## Test Plan - Code cov target builds succeessfully ## Test Result Waiting on CI commit 91627789d86acc7dff4bf5eaafe3b774a7037f76 Author: Koji Nakajima <[email protected]> Date: Fri Feb 6 23:44:05 2026 -0700 [hipblaslt] Fix memory access error with DtlPlusLdsBuf (#4303) ## Motivation Fix a memory access error with DtlPlusLdsBuf ## Technical Details - generate all GlobalRead Inc code before local read addr swap ## Test Plan Added a test case in dtl.yaml ## Test Result Confirmed new test failed with before change and no error with after change ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. commit 8b5a98b48c007663765865d2e14247ef1f056b01 Author: Aaron St George <[email protected]> Date: Fri Feb 6 23:18:13 2026 -0600 [hipDNN] Add `FUSILLI_PLUGIN` to `EngineNames.hpp` (#4362) ## Motivation  Following the approach outlined in [hipdnn/docs/rfcs/0003_EngineIdDesign.md](https://github.com/ROCm/rocm-libraries/blob/develop/projects/hipdnn/docs/rfcs/0003_EngineIdDesign.md) this PR defines an engine ID for fusilli in `hipdnn/data_sdk/include/hipdnn_data_sdk/utilities/EngineNames.hpp`. ## Technical Details  ID + tests defined. ## Test Plan  Test defined in the PR. ## Test Result  Tests pass. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. commit e3a9b3f95d29ce626efc3d2134e4e924b1c591a9 Author: James Newling <[email protected]> Date: Fri Feb 6 21:12:55 2026 -0800 [hipblaslt] Raise exception instead of segfaulting (#3995) ## Motivation Faster problem diagnostic when failure. ## Technical Details Throw exception if library is nullptr. ## Test Before: ``` TensileLibrary.yaml:181:31: error: invalid boolean customMainLoopScheduling: 0 ^ [Lots of logging] Segmentation fault + ERR2=139 + ERR=0 ``` After: ``` TensileLibrary.yaml:181:31: error: invalid boolean customMainLoopScheduling: 0 terminate called after throwing an instance of 'std::runtime_error' what(): Failed to load solution library + ERR2=134 + ERR=0 ``` commit f48eaa54f7395aa8ce4980dcc6725fe38784f7e6 Author: CMiservaAMD <[email protected]> Date: Fri Feb 6 21:48:52 2026 -0700 [hipDNN] Fix a couple log messages in test plugins broken by recent merge. (#4380) Include correct function name in test plugin log output. commit 5df3343ecfae6b39201995d8178fe39e061e0c40 Author: Emily Martins <[email protected]> Date: Fri Feb 6 17:26:57 2026 -0700 [CK_TILE] Fix MMA concepts compiler error (#4381) ## Motivation CK Tile is required to support certain older OSs; on these OSs, cpp 20 is not fully supported. For ROCm 7.2, compiler errors occur on one of these older OSs. An example of this error is as follows: ```bash /composable_kernel/include/ck_tile/core/arch/mma/amdgcn_mma.hpp:34:28: error: expected concept name with optional arguments 34 | { MmaOp::kAMBlock } -> std::convertible_to<unsigned int>; | ``` The goal of this PR is to resolve these compiler errors. ## Technical Details The existing guards around the mma concepts only check if the concepts language feature is supported, as follows: ```cpp #if defined(__cpp_concepts) && __cpp_concepts >= 201907L // ... template <typename CtrlFlags> concept CtrlFlagsGfx9I = requires(CtrlFlags ctrlFlags) { // Flag members for Gfx9 MFMA instructions { CtrlFlags::Cbsz } -> std::convertible_to<int>; { CtrlFlags::Abid } -> std::convertible_to<int>; { CtrlFlags::Blgp } -> std::convertible_to<int>; }; #endif // defined(__cpp_concepts) && __cpp_concepts >= 201907L ``` That said, in cases where functionality from the `<concepts>` header is used (e.g., `std::convertible_to`), this guard fails to check whether the `<concepts>` header is available. This change adds an additional check to the concepts that make use of functionality from the `<concepts>` header to ensure the header is available. ## Test Plan I tested the changes on the relevant docker for gfx90a, gfx950, and gfx942 and the compiler issue is not present. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. commit 06976b37a2f0353b80c71fb3d56bee92bb6b9bab Author: Aviral Goel <[email protected]> Date: Sat Feb 7 04:14:28 2026 +0400 Increase tolerance for FP16 GEMM tests to handle non-deterministic ro… (#4335) …unding Three tests were failing intermittently with small errors (0.01-1.5%) due to non-deterministic FP16 accumulation order from GPU thread scheduling: - test_ck_tile_batched_gemm - test_ck_tile_grouped_gemm_preshuffle - test_ck_tile_grouped_gemm_multi_d These tests use kbatch=1 (no split-K), so errors are from order-dependent rounding, not atomics. Increased tolerances from 1e-3 to 2e-3 (0.2%) to account for FP16 precision limits while still catching real bugs. - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. Co-authored-by: Illia Silin <[email protected]> commit 07e9d561402c717946a1c08cfdce2681d5733335 Author: assistant-librarian[bot] <210906412+assistant-librarian[bot]@users.noreply.github.com> Date: Fri Feb 6 16:10:23 2026 -0800 [CK] add inter/intrawave scheduling concept doc (#4300) ## Proposed changes Adding information about inter/intrawave scheduling --- 🔁 Imported from [ROCm/composable_kernel#3660](https://github.com/ROCm/composable_kernel/pull/3660) 🧑‍💻 Originally authored by @spolifroni-amd --------- Co-authored-by: spolifroni-amd <[email protected]> Co-authored-by: assistant-librarian[bot] <assistant-librarian[bot]@users.noreply.github.com> Co-authored-by: Illia Silin <[email protected]> Co-authored-by: illsilin_amdeng <[email protected]> commit 4d773b636ca00996e971d55bcd0530f641837b42 Author: JonathanLichtnerAMD <[email protected]> Date: Fri Feb 6 17:09:55 2026 -0700 Add .cline_storage to .gitignore (#4390) ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. commit 738ffd7689ba0759f00c0e9430889b2ed995fa94 Author: Enrico Degregori <[email protected]> Date: Sat Feb 7 01:09:08 2026 +0100 [CK] Workaround blockscale wp test failure (#4372) ## Motivation Workaround to fix blockscale wp test failure for pipeline v3 ## Technical Details  ## Test Plan  ## Test Result  ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. commit 612bf0b710b399276916c222d8d4c5f9c34f9f62 Author: James Sandham <[email protected]> Date: Fri Feb 6 18:44:55 2026 -0500 [rocsparse] Add bfloat16 and complex-types tests for code coverage (#4204) ## Motivation Add bfloat16 and complex-types tests for code coverage. Also renames the atomic_add tests to belong to pre_checkin so that they will be run as part of code coverage pre_checkin tests. commit 287fbc900071d5f9f8df7efdf1cfd25d9c8ea338 Author: Kerry Wang <[email protected]> Date: Fri Feb 6 15:55:22 2026 -0600 don't include waitcnts; renames commit f52966a377bfd26725f35f103fbc7975cd9b4ec9 Author: Yiqian Liu <[email protected]> Date: Fri Feb 6 15:50:20 2026 -0600 [rocRoller] Explicitly convert when typeAcc differs with typeD (#3977) ## Motivation This PR explicitly converts data type when Accumulator type is different with matrix D. The purpose of this change is to make rocRoller client adds the same operation as hipblaslt ## Technical Details Added a convert operation when Accumulator is not the same type as matrix D. ## Test Plan All the existing tests should pass. This change should not affect the performance. ## Test Result Existing tests passed. --------- Co-authored-by: yiqialiu <[email protected]> commit 323a8d256e2409de54fa9dfa9523f4c50295c305 Author: Kerry Wang <[email protected]> Date: Fri Feb 6 15:28:17 2026 -0600 minor quality improvements commit 6c4a1fb6d0f2eff761cd95e690f3ef0090266367 Author: Ali Yazdani <[email protected]> Date: Fri Feb 6 14:24:53 2026 -0700 [Origami] AutoWgm for NonTemporal Kernels. (#4218) AIGESOLSEL-71 ## Motivation This PR enhances the Origami workgroup mapping (WGM) selection logic to support nontemporal kernels and improve automatic WGM value selection. Previously, nontemporal cases (NTA/NTB > 3) were excluded from automatic workgroup mapping optimizations, limiting potential performance. ## Technical Details 1. Enabling NonTemporal support in AutoWGM with an enhanced logic 2. Improved WGM Candidate Generation ## Test Plan CI, and locally ran performance tests. ## Test Result Performance benchmarks show uplifts coming from changes. ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. commit 50e39459b541f978edd9acf645792cf496c16bea Author: Kerry Wang <[email protected]> Date: Fri Feb 6 15:21:16 2026 -0600 add string-based observer test commit 16b3b1840d61871c116c9ae80cf59324233377d5 Author: Torre Zuk <[email protected]> Date: Fri Feb 6 14:19:00 2026 -0700 [rocBLAS] Users/torrezuk/rocm 1157 amd smi rocblas (#4353) ## Motivation Deprecated dependency rocm-smi replaced by amd-smi ## Test Plan Test frequency reporting using rocblas-bench with environment variable set commit 45b616b1e6df1b1d3816a2f03a05a0f6ab754652 Author: Illia Silin <[email protected]> Date: Fri Feb 6 10:17:02 2026 -0800 [CK] fix path for build filter (#4375) ## Motivation Fix the filter that determines whether CI builds are necessary. ## Technical Details A script checks the files list returned by git diff and checks whether any code source was modified. If not, if only documentation was changed, it will allow skipping the builds. We make sure we only look at the changes in projects/composablekernel/ folder. commit d8e2826bedff1183eaedeb6d6f5b2eeaa65dab7b Author: Geo Min <[email protected]> Date: Fri Feb 6 09:59:29 2026 -0800 [ci] Adding mi350 required group ID (#4378) After updating mi325 group-id, we are noticing errors for mi350. Tested here for mi350: https://github.com/ROCm/TheRock/actions/runs/21733399385/job/62692971370 Tested here for mi325: https://github.com/ROCm/TheRock/actions/runs/21759203211/job/62778060417 Adding both work properly commit 78497b37bb4e5853b5da3feb96381c6b643556f7 Author: bibek <[email protected]> Date: Fri Feb 6 10:46:48 2026 -0600 Fix intermittent kernel compilation failures in BnFwdTrainingSpatial (#4202) ## Motivation Fix kernel compilation failures in BnFwdTrainingSpatial caused by two related issues: 1. Uninitialized local size variables producing random garbage values 2. Missing compile-time guard for warp-reduction LDS arrays when workgroup < 64 threads ## To Reproduce ```bash rm -rf ~/.config/miopen/*.udb.txt ~/.cache/miopen/ rm -rf /tmp/.config/miopen/ /tmp/.cache/miopen/ MIOPEN_FIND_ENFORCE=SEARCH MIOPEN_LOG_LEVEL=5 ./bin/MIOpenDriver bnorm -n 1024 -c 64 -H 13 -W 13 -m 1 --forw 1 -s 1 -V 1 ``` ## Technical Details ### Bug 1: Uninitialized Variables (Host) Variables declared without initialization: ```cpp size_t xlocalsize, xgridsize; // uninitialized ``` For Variants 0/1/3, early returns skip initialization, leaving garbage values that propagate to kernel template parameters. Depending on stack memory state, errors include: - `error: array is too large (18446744073709545792 elements)` - `error: variable length array declaration cannot have 'static' storage duration` - `error: zero-length arrays are not permitted in HIP device code` ### Bug 2: Missing Compile-Time Guard (Kernel) The warp-reduction path divides LDS size by 64: ```cpp __shared__ FpAccumCType lcl_data_x[MIO_BN_GRP0_FINAL * MIO_BN_GRP1_FINAL * MIO_BN_GRP2_FINAL / 64]; ``` When Variants 0/1/3 set workgroup size to 1×1×1, this computes `1/64 = 0`, producing illegal zero-length arrays. Note that `if constexpr` only suppresses template instantiation, not parsing of ill-formed code like zero-length arrays. ## Fix ### Host side (`common_spatial.hpp`, `forward_spatial.cpp`) Initialize variables to safe defaults: ```cpp size_t xlocalsize = 1, xgridsize = 1; size_t ylocalsize = 1, ygridsize = 1; size_t zlocalsize = 1, zgridsize = 1; size_t nelements = 1; unsigned int ldsgcn = 0, ldsnogcn = 0; ``` ### Kernel side (`MIOpenBatchNormFwdTrainSpatial.cpp`) Use C++17 constexpr ternary to ensure array size is always ≥ 1: ```cpp else { // C++17 idiomatic: ensure array size is never zero using constexpr ternary constexpr auto grp_final_total = MIO_BN_GRP0_FINAL * MIO_BN_GRP1_FINAL * MIO_BN_GRP2_FINAL; constexpr auto lds_gcn_array_size = grp_final_total >= 64 ? grp_final_total / 64 : 1; commitID = 64; __shared__ FpAccumCType lcl_data_x[lds_gcn_array_size]; __shared__ FpAccumCType lcl_data_y[lds_gcn_array_size]; miopen::reduction::gcn_reduce2(...); } ``` __Why this works:__ - `constexpr` ensures compile-time evaluation (zero runtime overhead) - When workgroup ≥ 64: array size = `grp_final_total / 64` (correct, same as before) - When workgroup < 64: array size = 1 (valid), but this `else` branch is never taken due to `if constexpr` guard - Dead code elimination removes the unused size-1 arrays from the binary ## Test Plan - Existing batchnorm training tests pass - Verified no zero-length array errors with deterministic workgroup sizes - Confirmed warp-reduction path only executes when workgroup ≥ 64 threads commit 8f8b97a40d36cb4095e929b0ef1b71ffea7ba170 Author: SreecharanGundaboluAMD <[email protected]> Date: Fri Feb 6 08:18:12 2026 -0800 [miopen] upgrade clang-format (#4194) This PR updates the project's code formatting tooling to use `clang-format-18` instead of `clang-format-12` throughout the codebase as a transition as we move towards TheRock for our CI. **Tooling and Configuration Updates:** * Updated all references to `clang-format-12` to `clang-format-18` in the pre-commit hook (`.githooks/pre-commit`), CMake configuration (`ClangCheck.cmake`), and Dockerfile (`Dockerfile`). The Dockerfile now also adds the appropriate LLVM 18 repository and keyring for installation. [[1]](diffhunk://#diff-1436c8126d575a7576d98d0bc8a8c6d27e8eb4e2d7241d61fe64c286c0d7365cL7-R7) [[2]](diffhunk://#diff-fc024f0d7573d33039081dab6b12f76f0f34c8e07e014552daa1bed9a276a548L9-R9) [[3]](diffhunk://#diff-32304f8a254e46fb8ff524cf4c488eb6013ab54a89ca62709cfb20ccf58976f9R54-R61) ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. commit b34fa62134425a611b18c05aae687f1bc09c7d87 Author: BrianHarrisonAMD <[email protected]> Date: Fri Feb 6 08:57:01 2026 -0700 [hipDNN] Remove spdlog dependency for consumers of hipdnn (#4312) ## Motivation Draft of changes to remove spdlog and fmt dependencies from hipDNN frontend & consumer facing SDKs. Consumers of hipDNN can have conflicts as a result of these dependencies. Removing these extra dependencies, and relying on the C API logging methods from hipDNN backend will reduce friction, and make it easier for adoption of hipDNN. ## Technical Details - Remove spdlog and fmt from frontend + consumer SDKs. - Add new string stream style formatter that will forward to backend logging API callback - Note: since we are set at C++17 standards, we cannot use std::format. - Add Spdlog and fmt optional dependencies as opt in for plugin_sdk - This enables existing plugins to maintain logging style they have in place, and allows plugin authors to decide what style of logging to use. - Unify consumption of spdlog & fmt dependencies in CMake's using unified method's ## Test Plan - Ensure build and tests are working for all components - Ensure tests with logs enabled are working properly with expected format - Ensure build of samples is working, and logging format is correct ## Test Result Build, tests, and testing with logging enabled is working locally for hipDNN, providers, and samples commit 0c37fdc37ba545b2ced5211b9f59c2381fc93753 Author: amd-chunxlin <[email protected]> Date: Fri Feb 6 09:44:39 2026 -0600 [rocRoller] Address long StreamK test runtimes (#4095) ## Motivation Some streamK tests take long time to finish, and this PR addresses the performance issue. <details> <summary>Comparison of test runtimes</summary> | Test name | Develop branch | This branch | | --- | --- | --- | | GPU_BasicGEMMStreamKWorkgroupMapping/0 | 14.412s | 4.059s| | GPU_BasicGEMMStreamKWorkgroupMapping/1 | 14.447s | 4.018s| | GPU_BasicGEMMStreamKWorkgroupMapping/2 | 14.452s | 4.026s| | GPU_BasicGEMMStreamKWorkgroupMapping/3 | 14.444s | 4.042s| | GPU_BasicGEMMStreamKWorkgroupMapping/4 | 38.872s | 9.446s| | GPU_BasicGEMMStreamKWorkgroupMapping/5 | 38.828s | 9.445s| | GPU_BasicGEMMStreamKWorkgroupMapping/6 | 38.913s | 9.446s| | GPU_BasicGEMMStreamKWorkgroupMapping/7 | 38.812s | 9.435s| | GPU_BasicGEMMStreamKWorkgroupMapping/8 | 38.878s | 9.456s| | GPU_BasicGEMMStreamKWorkgroupMapping/9 | 38.889s | 9.463s| | GPU_BasicGEMMStreamKWorkgroupMapping/10 | 38.884s | 9.463s| | GPU_BasicGEMMStreamKWorkgroupMapping/11 | 38.859s | 9.463s| | GPU_BasicGEMMStreamKWorkgroupMapping/12 | 14.414s | 4.035s| | GPU_BasicGEMMStreamKWorkgroupMapping/13 | 14.429s | 4.024s| | GPU_BasicGEMMStreamKWorkgroupMapping/14 | 14.47s | 4.045s| | GPU_BasicGEMMStreamKWorkgroupMapping/15 | 14.428s | 4.044s| | GPU_BasicGEMMStreamKWorkgroupMapping/16 | 38.854s | 9.454s| | GPU_BasicGEMMStreamKWorkgroupMapping/17 | 38.861s | 9.448s| | GPU_BasicGEMMStreamKWorkgroupMapping/18 | 38.95s | 9.461s| | GPU_BasicGEMMStreamKWorkgroupMapping/19 | 38.826s | 9.458s| | GPU_BasicGEMMStreamKWorkgroupMapping/20 | 38.833s | 9.48s| | GPU_BasicGEMMStreamKWorkgroupMapping/21 | 38.888s | 9.472s| | GPU_BasicGEMMStreamKWorkgroupMapping/22 | 38.868s | 9.477s| | GPU_BasicGEMMStreamKWorkgroupMapping/23 | 38.907s | 9.485s| | GPU_BasicGEMMStreamKWorkgroupMapping/24 | 14.426s | 4.025s| | GPU_BasicGEMMStreamKWorkgroupMapping/25 | 14.435s | 4.051s| | GPU_BasicGEMMStreamKWorkgroupMapping/26 | 14.472s | 4.053s| | GPU_BasicGEMMStreamKWorkgroupMapping/27 | 14.471s | 4.058s| | GPU_BasicGEMMStreamKWorkgroupMapping/28 | 38.879s | 9.457s| | GPU_BasicGEMMStreamKWorkgroupMapping/29 | 38.814s | 9.445s| | GPU_BasicGEMMStreamKWorkgroupMapping/30 | 38.853s | 9.45s| | GPU_BasicGEMMStreamKWorkgroupMapping/31 | 38.963s | 9.458s| | GPU_BasicGEMMStreamKWorkgroupMapping/32 | 38.924s | 9.466s| | GPU_BasicGEMMStreamKWorkgroupMapping/33 | 38.898s | 9.482s| | GPU_BasicGEMMStreamKWorkgroupMapping/34 | 38.951s | 9.455s| | GPU_BasicGEMMStreamKWorkgroupMapping/35 | 38.924s | 9.459s| | GPU_BasicGEMMStreamKWorkgroupMapping/36 | 14.461s | 4.037s| | GPU_BasicGEMMStreamKWorkgroupMapping/37 | 14.452s | 4.032s| | GPU_BasicGEMMStreamKWorkgroupMapping/38 | 14.43s | 4.053s| | GPU_BasicGEMMStreamKWorkgroupMapping/39 | 14.43s | 4.042s| | GPU_BasicGEMMStreamKWorkgroupMapping/40 | 38.868s | 9.473s| | GPU_BasicGEMMStreamKWorkgroupMapping/41 | 38.925s | 9.461s| | GPU_BasicGEMMStreamKWorkgroupMapping/42 | 38.884s | 9.452s| | GPU_BasicGEMMStreamKWorkgroupMapping/43 | 38.925s | 9.455s| | GPU_BasicGEMMStreamKWorkgroupMapping/44 | 39.012s | 9.476s| | GPU_BasicGEMMStreamKWorkgroupMapping/45 | 38.915s | 9.479s| | GPU_BasicGEMMStreamKWorkgroupMapping/46 | 38.933s | 9.457s| | GPU_BasicGEMMStreamKWorkgroupMapping/47 | 38.936s | 9.469s| | GPU_BasicGEMMStreamKWorkgroupMapping/48 | 14.461s | 4.041s| | GPU_BasicGEMMStreamKWorkgroupMapping/49 | 14.468s | 4.049s| | GPU_BasicGEMMStreamKWorkgroupMapping/50 | 14.466s | 4.046s| | GPU_BasicGEMMStreamKWorkgroupMapping/51 | 14.479s | 4.038s| | GPU_BasicGEMMStreamKWorkgroupMapping/52 | 38.907s | 9.473s| | GPU_BasicGEMMStreamKWorkgroupMapping/53 | 38.914s | 9.471s| | GPU_BasicGEMMStreamKWorkgroupMapping/54 | 38.885s | 9.463s| | GPU_BasicGEMMStreamKWorkgroupMapping/55 | 38.891s | 9.463s| | GPU_BasicGEMMStreamKWorkgroupMapping/56 | 38.859s | 9.472s| | GPU_BasicGEMMStreamKWorkgroupMapping/57 | 38.899s | 9.475s| | GPU_BasicGEMMStreamKWorkgroupMapping/58 | 38.936s | 9.47s| | GPU_BasicGEMMStreamKWorkgroupMapping/59 | 38.952s | 9.472s| | GPU_BasicGEMMStreamKWorkgroupMapping/60 | 14.485s | 4.036s| | GPU_BasicGEMMStreamKWorkgroupMapping/61 | 14.419s | 4.031s| | GPU_BasicGEMMStreamKWorkgroupMapping/62 | 14.455s | 4.035s| | GPU_BasicGEMMStreamKWorkgroupMapping/63 | 14.488s | 4.045s| | GPU_BasicGEMMStreamKWorkgroupMapping/64 | 38.846s | 9.462s| | GPU_BasicGEMMStreamKWorkgroupMapping/65 | 38.923s | 9.475s| | GPU_BasicGEMMStreamKWorkgroupMapping/66 | 38.884s | 9.463s| | GPU_BasicGEMMStreamKWorkgroupMapping/67 | 38.93s | 9.441s| | GPU_BasicGEMMStreamKWorkgroupMapping/68 | 38.9s | 9.473s| | GPU_BasicGEMMStreamKWorkgroupMapping/69 | 38.914s | 9.477s| | GPU_BasicGEMMStreamKWorkgroupMapping/70 | 38.961s | 9.487s| | GPU_BasicGEMMStreamKWorkgroupMapping/71 | 38.93s | 9.489s| | GPU_BasicGEMMStreamKWorkgroupMapping/72 | 14.432s | 4.051s| | GPU_BasicGEMMStreamKWorkgroupMapping/73 | 14.442s | 4.042s| | GPU_BasicGEMMStreamKWorkgroupMapping/74 | 14.44s | 4.036s| | GPU_BasicGEMMStreamKWorkgroupMapping/75 | 14.456s | 4.042s| | GPU_BasicGEMMStreamKWorkgroupMapping/76 | 38.891s | 9.486s| | GPU_BasicGEMMStreamKWorkgroupMapping/77 | 38.938s | 9.444s| | GPU_BasicGEMMStreamKWorkgroupMapping/78 | 38.919s | 9.448s| | GPU_BasicGEMMStreamKWorkgroupMapping/79 | 38.858s | 9.455s| | GPU_BasicGEMMStreamKWorkgroupMapping/80 | 38.878s | 9.491s| | GPU_BasicGEMMStreamKWorkgroupMapping/81 | 38.975s | 9.478s| | GPU_BasicGEMMStreamKWorkgroupMapping/82 | 38.945s | 9.496s| | GPU_BasicGEMMStreamKWorkgroupMapping/83 | 38.905s | 9.477s| | GPU_BasicGEMMStreamKWorkgroupMapping/84 | 14.433s | 4.033s| | GPU_BasicGEMMStreamKWorkgroupMapping/85 | 14.41s | 4.038s| | GPU_BasicGEMMStreamKWorkgroupMapping/86 | 14.478s | 4.044s| | GPU_BasicGEMMStreamKWorkgroupMapping/87 | 14.473s | 4.035s| | GPU_BasicGEMMStreamKWorkgroupMapping/88 | 38.927s | 9.49s| | GPU_BasicGEMMStreamKWorkgroupMapping/89 | 38.938s | 9.472s| | GPU_BasicGEMMStreamKWorkgroupMapping/90 | 38.913s | 9.454s| | GPU_BasicGEMMStreamKWorkgroupMapping/91 | 38.835s | 9.463s| | GPU_BasicGEMMStreamKWorkgroupMapping/92 | 38.901s | 9.494s| | GPU_BasicGEMMStreamKWorkgroupMapping/93 | 38.864s | 9.488s| | GPU_BasicGEMMStreamKWorkgroupMapping/94 | 38.839s | 9.508s| | GPU_BasicGEMMStreamKWorkgroupMapping/95 | 38.921s | 9.487s| | GPU_BasicGEMMStreamKWorkgroupMapping/96 | 14.445s | 4.05s| | GPU_BasicGEMMStreamKWorkgroupMapping/97 | 14.432s | 4.041s| | GPU_BasicGEMMStreamKWorkgroupMapping/98 | 14.472s | 4.055s| | GPU_BasicGEMMStreamKWorkgroupMapping/99 | 14.449s | 4.039s| | GPU_BasicGEMMStreamKWorkgroupMapping/100 | 38.87s | 9.485s| | GPU_BasicGEMMStreamKWorkgroupMapping/101 | 38.866s | 9.474s| | GPU_BasicGEMMStreamKWorkgroupMapping/102 | 38.836s | 9.471s| | GPU_BasicGEMMStreamKWorkgroupMapping/103 | 38.847s | 9.464s| | GPU_BasicGEMMStreamKWorkgroupMapping/104 | 38.985s | 9.468s| | GPU_BasicGEMMStreamKWorkgroupMapping/105 | 38.928s | 9.469s| | GPU_BasicGEMMStreamKWorkgroupMapping/106 | 38.866s | 9.496s| | GPU_BasicGEMMStreamKWorkgroupMapping/107 | 38.92s | 9.48s| | GPU_BasicGEMMStreamKWorkgroupMapping/108 | 14.452s | 4.046s| | GPU_BasicGEMMStreamKWorkgroupMapping/109 | 14.448s | 4.044s| | GPU_BasicGEMMStreamKWorkgroupMapping/110 | 14.469s | 4.044s| | GPU_BasicGEMMStreamKWorkgroupMapping/111 | 14.457s | 4.052s| | GPU_BasicGEMMStreamKWorkgroupMapping/112 | 38.895s | 9.474s| | GPU_BasicGEMMStreamKWorkgroupMapping/113 | 38.901s | 9.453s| | GPU_BasicGEMMStreamKWorkgroupMapping/114 | 38.882s | 9.467s| | GPU_BasicGEMMStreamKWorkgroupMapping/115 | 38.861s | 9.446s| | GPU_BasicGEMMStreamKWorkgroupMapping/116 | 38.945s | 9.498s| | GPU_BasicGEMMStreamKWorkgroupMapping/117 | 38.94s | 9.488s| | GPU_BasicGEMMStreamKWorkgroupMapping/118 | 38.881s | 9.488s| | GPU_BasicGEMMStreamKWorkgroupMapping/119 | 38.911s | 9.486s| | GPU_BasicGEMMStreamKWorkgroupMapping/120 | 14.434s | 4.044s| | GPU_BasicGEMMStreamKWorkgroupMapping/121 | 14.455s | 4.061s| | GPU_BasicGEMMStreamKWorkgroupMapping/122 | 14.464s | 4.067s| | GPU_BasicGEMMStreamKWorkgroupMapping/123 | 14.488s | 4.058s| | GPU_BasicGEMMStreamKWorkgroupMapping/124 | 38.932s | 9.499s| | GPU_BasicGEMMStreamKWorkgroupMapping/125 | 38.868s | 9.488s| | GPU_BasicGEMMStreamKWorkgroupMapping/126 | 38.863s | 9.488s| | GPU_BasicGEMMStreamKWorkgroupMapping/127 | 38.899s | 9.467s| | GPU_BasicGEMMStreamKWorkgroupMapping/128 | 38.948s | 9.511s| | GPU_BasicGEMMStreamKWorkgroupMapping/129 | 38.915s | 9.502s| | GPU_BasicGEMMStreamKWorkgroupMapping/130 | 38.932s | 9.508s| | GPU_BasicGEMMStreamKWorkgroupMapping/131 | 38.914s | 9.491s| | GPU_BasicGEMMStreamKWorkgroupMapping/132 | 14.42s | 4.031s| | GPU_BasicGEMMStreamKWorkgroupMapping/133 | 14.464s | 4.054s| | GPU_BasicGEMMStreamKWorkgroupMapping/134 | 14.494s | 4.048s| | GPU_BasicGEMMStreamKWorkgroupMapping/135 | 14.48s | 4.046s| | GPU_BasicGEMMStreamKWorkgroupMapping/136 | 38.911s | 9.49s| | GPU_BasicGEMMStreamKWorkgroupMapping/137 | 38.91s | 9.478s| | GPU_BasicGEMMStreamKWorkgroupMapping/138 | 38.914s | 9.478s| | GPU_BasicGEMMStreamKWorkgroupMapping/139 | 38.92s | 9.475s| | GPU_BasicGEMMStreamKWorkgroupMapping/140 | 38.883s | 9.508s| | GPU_BasicGEMMStreamKWorkgroupMapping/141 | 38.935s | 9.5s| | GPU_BasicGEMMStreamKWorkgroupMapping/142 | 38.848s | 9.494s| | GPU_BasicGEMMStreamKWorkgroupMapping/143 | 38.932s | 9.502s| </details> ## Technical Details - Short-circuit expression comparison - Remove unused code - Caching expressions of kernel arguments to eliminate redundant regeneration. - Change `AssemblyKernelArgument` to a `class` ## Test Plan No functional changes; covered by existing tests. ## Test Result See CI report commit 66e22ac6c6c0c286325cabf2b1faa269ea640446 Author: hcman2 <[email protected]> Date: Fri Feb 6 10:55:08 2026 +0800 [formocast] [tensilelite] enable tuning with formocast (#4043) ## Motivation Integration plan of formocast and origami : Steps 1.Push Formocast code to the origami subfolder. (we are here now) 2.Submit tuning code calling the API of origami. 3.Push and reserve origami prediction mode. Add APIs to pass sizemapping data via config_t. Use an environment variable to switch modes without affecting other code. 4.Enable predictionThreshold with tox tests. 5.Push Origami code with Formocast backend. This step will enable the bench with different modes. 6.Refine Formocast and Origami to verify API usage and identify functions to move. This PR is to include step2 and step4. ## Technical Details  ## Test Plan  ## Test Result  ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Henry Ho <[email protected]> Co-authored-by: Peter Cheng <[email protected]> commit 37a74ef54eaa1bb1df603db6ec8aff22f342bc71 Author: Illia Silin <[email protected]> Date: Thu Feb 5 17:06:57 2026 -0800 [CK] a bunch of CI fixes. (#4361) ## Motivation Fixing some of the CK CI issues ## Technical Details fixing paths to dockerfiles and scripts; moving codegen tests to separate stage (collides with main build since you must call cmake from same folder but different options); fixing a couple of clang compilation issues with staging compiler; commit 808e9496d17be0826164dbb34457f54e157dd2bd Author: Kerry Wang <[email protected]> Date: Thu Feb 5 17:00:01 2026 -0600 refractor observers to have new runtime with context concept commit 3c9beb38b8dba1301a961cc5dc3f44ca9d4185e3 Author: Eiden Yoshida <[email protected]> Date: Thu Feb 5 17:56:12 2026 -0500 [CK] MICI: Fix git diff in selective_test_filter.py (#4352) ## Motivation - git diff needs access to reference repo ## Technical Details - mount reference repo path into docker for selective_test_filter.py to access ## Test Plan - tested in MICI ## Test Result - launch_tests.sh ran successfully commit 1663ac026d46c3dd02edb73d2bfa7310c54695d5 Author: Torre Zuk <[email protected]> Date: Thu Feb 5 14:20:45 2026 -0700 [rocBLAS] trsm doc & test; trsv change noted in log (#4198) ## Motivation Tests trsm use of new trsv kernel for big batches Adds chagelog note on trsv which used to call hipGetDevice and potentially hipSetDevice commit d8bb9d2b9fe278d07fe63b395d87268c94e53fd8 Author: Jeffrey Novotny <[email protected]> Date: Thu Feb 5 14:54:48 2026 -0500 [rocsolver] Doxygen API cleanup part 2 (#4330) ## Motivation  Continuing with rocSOLVER API/Doxygen copy edits and polishing ## Technical Details  Edit Doxygen comments in header file. ## Test Plan  Build locally. ## Test Result  NA ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. commit 5aa1f1d4c189f779ea699be250fb1b284f3d6ac2 Author: Geo Min <[email protected]> Date: Thu Feb 5 11:01:53 2026 -0800 [ci] Updating variable group-id for OSSCI (#4360) OSSCI migrated mi325s, so need a new groupID Sanity works here: https://github.com/ROCm/TheRock/actions/runs/21723540679/job/62659665907 normal run works here: https://github.com/ROCm/TheRock/actions/runs/21723540679/job/62659791422 I've dabbled with organization variables, however, this does not work for forks so for now, we will do the manual update commit 6273d3b30b32d6e0856394d37a421153dceb33c1 Author: Dmitrii Polomin <[email protected]> Date: Thu Feb 5 19:53:45 2026 +0100 [MIOpen] Ported solver test to gtest (#3713) ## Motivation Porting tests from CTest to GTest, in this case, `solver.cpp` ## Technical Details Pretty straightforward port, although I had to get creative in order to conform to `INSTANTIATE_TEST_SUITE_P` pattern and naming conventions ## Test Plan Running locally, using the CI launched by this PR ## Test Result See CI actions launched by this PR ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. commit ece0c48dd152856c3b443c12c58a89bc7d7c34e5 Author: Nathan Henderson <[email protected]> Date: Thu Feb 5 09:03:42 2026 -0700 [rocroller] Use StreamKMode in hipBLASLt integration and client (#4028) ## Motivation RocRoller has a `StreamKMode` enum that is used internally and in the GEMM tests. However, the client and hipBLASLt integration still used boolean values to represent the StreamK state. This PR replaces the individual boolean flags (`--streamK`, `--streamKTwoTile`, `--streamKTwoTileDPFirst`) with a single `--streamK` string option that accepts one of the `StreamKMode` values (`None`, `Standard`, `TwoTile`, or `TwoTileDPFirst`). ## Technical Details - Update the rocRoller GEMM client to use `StreamKMode` enum instead of three separate booleans - Update the hipBLASLt rocRoller integration to match - Add `enumStrings<T>()` utility function in `Utils_impl.hpp` for CLI validation of enum values ## Test Plan Update `test_gemm_client.py` YAML fixtures to use the new `streamK: None` format. ## Test Result Validated by the StreamK rrperf tests commit 3b98c98a23e76075a6a1e4e580482a627e39d59b Author: Jobbins <[email protected]> Date: Thu Feb 5 08:56:42 2026 -0700 [composablekernel] fix failure status (#4351) ## Motivation Pipelines were failing on Math CI status check. ## Technical Details For the success case, we just changed the config in Jenkins to use a proper app token and no code changes were required. However, the failure case would not have worked as coded, so we needed to move that outside of the `rocmnode()` block. ## Test Plan I removed all of the CI in one of the commits to quickly test, and then added it back. Got a successful "success" message and "failure" message produced commit 9bb7f5c31253643cd72363314c3d3ee02f723406 Author: Eiden Yoshida <[email protected]> Date: Thu Feb 5 10:55:44 2026 -0500 [CK] MICI: Correct path for build trace script (#4349) ## Motivation - Corrects path to script due to superrepo migration - Forces all tests to run by default ## Technical Details - now in /projects/composablekernel --------- Co-authored-by: illsilin_amdeng <[email protected]> commit 120f91dd211117e308b3713593ac7f061cc02c08 Author: bibek <[email protected]> Date: Thu Feb 5 09:47:16 2026 -0600 [HIPDNN][DOC] Add TYPED_TEST guidance for multi-datatype tests (#4000) ## Motivation Doc update : add `TYPED_TEST` guidance for multi-datatype tests Update `.clinerules`, `.cursor/rules/testing.mdc`, and `docs/Testing.md` to recommend `TYPED_TEST` for tests covering `float`/`half`/`bfloat16`. ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Samuel Reeder <[email protected]> commit d26a7820b58f789d19efd6064d4c2c4f4fc72a95 Author: Eiden Yoshida <[email protected]> Date: Wed Feb 4 21:43:22 2026 -0500 [CK] MICI: Use reference repo for checkout operations (#4336) ## Motivation - Maintain a reference repo on slave nodes that speeds up any clone/checkout operations ## Technical Details - clone a ref repo if it does not exist - update ref repo if it does exist - checkout after ref repo is updated - eliminates double clone ## Test Result - Initial checkouts succeeded commit f2f187ab40738272232f571f58112697da405b1a Author: Geo Min <[email protected]> Date: Wed Feb 4 15:43:38 2026 -0800 [ci] Fixing rocm-libs race condition (#4192) Currently, there is a race condition that overwrites BLAS libraries during MIOpen/hipdnn builds. (error: https://github.com/ROCm/rocm-libraries/actions/runs/21228188053/job/61080555083) Tested locally: ``` # With all three geom@geom:~/Code/rocm-libraries/.github/scripts$ PLATFORM="linux" python3 therock_matrix.py [{'cmake_options': '-DTHEROCK_USE_EXTERNAL_COMPOSABLE_KERNEL=ON -DTHEROCK_ENABLE_COMPOSABLE_KERNEL=ON -DTHEROCK_ENABLE_MIOPEN=ON -DTHEROCK_ENABLE_MIOPEN_PLUGIN=ON -DTHEROCK_ENABLE_ALL=OFF -DTHEROCK_COMPOSABLE_KERNEL_SOURCE_DIR=../composable_kernel', 'project_to_test': 'miopen_plugin,miopen,hipdnn'}] # Only hipdnn geom@geom:~/Code/rocm-libraries/.github/scripts$ PLATFORM="linux" python3 therock_matrix.py [{'cmake_options': '-DTHEROCK_USE_EXTERNAL_COMPOSABLE_KERNEL=ON -DTHEROCK_ENABLE_ALL=OFF -DTHEROCK_COMPOSABLE_KERNEL_SOURCE_DIR=../composable_kernel -DTHEROCK_ENABLE_MIOPEN_PLUGIN=ON -DTHEROCK_ENABLE_COMPOSABLE_KERNEL=ON', 'project_to_test': 'hipdnn,miopen_plugin'}] ``` This fixes this error, as if all libraries are ran, they will combine (saves resources + no overwriting) commit f34aec25c434b3044b75481d70693af3bf0ade1e Author: assistant-librarian[bot] <210906412+assistant-librarian[bot]@users.noreply.github.com> Date: Wed Feb 4 18:25:31 2026 -0500 [CK] Add FP8 KV_BLOCKSCALE support for batch prefill (#4263) Implement per-page K/V quantization for paged attention: - Add KV_BLOCKSCALE enum to BlockAttentionQuantScaleEnum - Use exp2 shift trick to eliminate explicit P scaling overhead - Prefetch physical pages offset for KV cache, overlaps with computations ## Proposed changes Please describe the motivation behind the pull request, whether it enables a new feature or fixes a bug. If there are associated pull requests or issues, please link them to the pull request. ## Checklist Please put an `x` into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask. - [ ] I have added tests relevant to the introduced functionality, and the unit tests are passing locally - [ ] I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more than 30 seconds to run. - [ ] I have added inline documentation which enables the maintainers with understanding the motivation - [ ] I have removed the stale documentation which is no longer relevant after this pull request - [ ] (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request - [ ] I have run `clang-format` on all changed files - [ ] Any dependent changes have been merged ## Discussion If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered --- 🔁 Imported from [ROCm/composable_kernel#3696](https://github.com/ROCm/composable_kernel/pull/3696) 🧑‍💻 Originally authored by @Jeff-Huang --------- Co-authored-by: Jeff Huang <[email protected]> Co-authored-by: Illia Silin <[email protected]> commit df32df51ea6a9ebfba9a459c77ed82c4877df22b Author: Yiqian Liu <[email protected]> Date: Wed Feb 4 17:12:11 2026 -0600 [rocRoller] [hipblaslt] Enable more workgroup tile sizes for pre-swizzled scale data (#4175) ## Motivation Pre-swizzle is an optimization that pre-swizzle the scale data to match the layout that kernel expects. The purpose of this PR is to add more possible workgroup tile sizes that supports pre-swizzled scale data. ## Technical Details 1. Configure the workgroup tile size to 256 at K dimension when the input data format is pre-swizzled. 2. Filter out the invalid workgroup tile size (i.e., MN dimension is not multiple of 32, or MN dimension is 96). 3. Configure the solution parameters for pre-swizzled input. ## Test Plan 1. Added a rocRoller client test that uses 32x32x256 workgroup tile. ## Test Result 1. This PR should not change any kernel that the data is not pre-swizzled. 2. All the tests should pass and no performance changes. ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Yiqian Liu <[email protected]> Co-authored-by: yiqialiu <[email protected]> commit 87d1a8fa005ef2f75e48e5c9c4e70f8235236b03 Author: Samuel Reeder <[email protected]> Date: Wed Feb 4 16:10:01 2026 -0700 Use `--latest-release` flag for installing rocm in clang-tidy (#4120) ## Motivation `--latest-release` was added in TheRock [2997](https://github.com/ROCm/TheRock/pull/2997) to grab latest nightly for the specified target. We can use this to keep the clang-tidy workflow somewhat up-to-date. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. commit c5…

chinmaydk99 mentioned this pull request Jan 12, 2026

[ROCm][CK] Enable variable-length attention support for CK SDPA backend pytorch/pytorch#172246

Draft

chinmaydk99 force-pushed the ck-enable-varlen-group-mode branch 2 times, most recently from 691b71a to 4670a96 Compare January 13, 2026 01:21

chinmaydk99 force-pushed the ck-enable-varlen-group-mode branch from 4670a96 to 6258488 Compare January 13, 2026 22:27

chinmaydk99 force-pushed the ck-enable-varlen-group-mode branch 3 times, most recently from 070806a to 49d0721 Compare January 26, 2026 03:34

chinmaydk99 marked this pull request as ready for review January 26, 2026 15:50

chinmaydk99 requested review from Snektron, ThomasNing, afagaj, andriy-ca, aosewski, asleepzzz, bartekxk, carlushuang, cgmillette, coderfeli, geyyer, illsilin, poyenc, qianfengz, shumway, tenpercent, vidyasagar-amd and vpietila-amd as code owners January 26, 2026 15:50

Enable group mode varlen kernel generation for PyTorch

f4d8329

chinmaydk99 force-pushed the ck-enable-varlen-group-mode branch from 49d0721 to f4d8329 Compare February 3, 2026 14:57

illsilin assigned poyenc Feb 3, 2026

assistant-librarian bot mentioned this pull request Feb 3, 2026

Enable group mode (varlen) kernel generation for PyTorch integration ROCm/rocm-libraries#4292

Merged

7 tasks

ammallya closed this Feb 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable group mode (varlen) kernel generation for PyTorch integration #3553

Enable group mode (varlen) kernel generation for PyTorch integration #3553

Uh oh!

chinmaydk99 commented Jan 12, 2026

Uh oh!

poyenc commented Jan 13, 2026

Uh oh!

chinmaydk99 commented Jan 20, 2026 •

edited

Loading

Uh oh!

pruthvistony commented Feb 3, 2026

Uh oh!

ammallya commented Feb 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Enable group mode (varlen) kernel generation for PyTorch integration #3553

Enable group mode (varlen) kernel generation for PyTorch integration #3553

Uh oh!

Conversation

chinmaydk99 commented Jan 12, 2026

Proposed changes

Checklist

Discussion

Uh oh!

poyenc commented Jan 13, 2026

Uh oh!

chinmaydk99 commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pruthvistony commented Feb 3, 2026

Uh oh!

ammallya commented Feb 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

chinmaydk99 commented Jan 20, 2026 •

edited

Loading