Skip to content

Conversation

@chinmaydk99
Copy link

Proposed changes

This PR enables group mode (variable-length attention) kernel generation for PyTorch's CK SDPA backend.

Checklist

Please put an x into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask.

  • I have added tests relevant to the introduced functionality, and the unit tests are passing locally
  • I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run.
  • I have added inline documentation which enables the maintainers with understanding the motivation
  • I have removed the stale documentation which is no longer relevant after this pull request
  • (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request
  • I have run clang-format on all changed files
  • Any dependent changes have been merged

Discussion

The change is minimal (single line deletion) but enables a significant feature: variable-length attention support for ROCm users via PyTorch's torch.nn.attention.varlen API.

@poyenc
Copy link
Contributor

poyenc commented Jan 13, 2026

May I know why the group-mode kernels were not included previously, and why they are required now?

@chinmaydk99 chinmaydk99 force-pushed the ck-enable-varlen-group-mode branch from 4670a96 to 6258488 Compare January 13, 2026 22:27
@chinmaydk99
Copy link
Author

chinmaydk99 commented Jan 20, 2026

May I know why the group-mode kernels were not included previously, and why they are required now?

This is part of the ongoing parity effort to bridge feature gaps between CK and AOTriton backends in PyTorch. Varlen attention is one of the features being enabled

@chinmaydk99 chinmaydk99 force-pushed the ck-enable-varlen-group-mode branch from 49d0721 to f4d8329 Compare February 3, 2026 14:57
@pruthvistony
Copy link

@illsilin @poyenc ,
Please help on review on this PR.

@ammallya
Copy link
Contributor

ammallya commented Feb 3, 2026

Imported to ROCm/rocm-libraries

@ammallya ammallya closed this Feb 3, 2026
illsilin pushed a commit to ROCm/rocm-libraries that referenced this pull request Feb 9, 2026
…4292)

## Proposed changes

This PR enables group mode (variable-length attention) kernel generation
for PyTorch's CK SDPA backend.

## Checklist

Please put an `x` into the boxes that apply. You can also fill these out
after creating the PR. If you're not sure, please don't hesitate to ask.

- [X] I have added tests relevant to the introduced functionality, and
the unit tests are passing locally
- [ ] I have added the test to REGRESSION_TESTS list defined at the top
of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more
than 30 seconds to run.
- [ ] I have added inline documentation which enables the maintainers
with understanding the motivation
- [ ] I have removed the stale documentation which is no longer relevant
after this pull request
- [ ] (If this change is user-facing) I have added release notes which
provide the end users with a brief summary of the improvement from this
pull request
- [X] I have run `clang-format` on all changed files
- [ ] Any dependent changes have been merged

## Discussion

The change is minimal (single line deletion) but enables a significant
feature: variable-length attention support for ROCm users via PyTorch's
torch.nn.attention.varlen API.



---
🔁 Imported from
[ROCm/composable_kernel#3553](ROCm/composable_kernel#3553)
🧑‍💻 Originally authored by @chinmaydk99

Co-authored-by: Chinmay_Kuchinad <[email protected]>
WorldofKerry added a commit to ROCm/rocm-libraries that referenced this pull request Feb 10, 2026
commit 8c40fb6cac48969d6237cccdcbbbad56b44ff0a3
Author: Kerry Wang <[email protected]>
Date:   Mon Feb 9 20:27:33 2026 -0500

    more consistent skip text

commit 557e2764b3a001884a004f0a183a307c4fbc2bd2
Merge: 6bd6e49791 219f365e7b
Author: Kerry Wang <[email protected]>
Date:   Mon Feb 9 18:38:48 2026 -0600

    Merge remote-tracking branch 'origin/develop' into users/kerrwang/lds-queue

commit 6bd6e497910e3ba681b22a47630bc5f0dedb16b8
Author: Kerry Wang <[email protected]>
Date:   Mon Feb 9 18:38:31 2026 -0600

    fix format

commit 219f365e7bc40c9ce3f5c382228a7b2e14b90520
Author: James Sandham <[email protected]>
Date:   Mon Feb 9 19:08:04 2026 -0500

    [hipsparse] Match behaviour of csr2csr_compress from rocsparse (#4420)

    ## Motivation

    In the hipSPARSE test code host solution, we were incorrectly checking
    if a value satisfied:

    `testing_abs(csr_val_A[j]) > testing_real(tol) &&
    testing_abs(csr_val_A[j]) > std::numeric_limits<float>::min()`

    instead of the correct criteria:

    `testing_abs(csr_val_A[j]) > testing_real(tol)`

commit 698d5d09184a24fde32ab7309fcd88410fc7ff8e
Author: amd-hsong <[email protected]>
Date:   Mon Feb 9 16:40:07 2026 -0700

    [rocprim] Fix a call to intrinsics in test_device_reduce_by_key (#4391)

    ## Motivation

    <!-- Explain the purpose of this PR and the goals it aims to achieve.
    -->
    Fix a call to __clzll in test_device_reduce_by_key

    ## Technical Details

    <!-- Explain the changes along with any relevant GitHub links. -->
    There are a couple of issues in the call to __clzll:
    - the argument is cast to `long long`: it should be cast to `unsigned
    long long` instead
    - in rocprim there exists a wrapper for clz, so for better portability
    rocprim::clz should be used instead.

    ## Test Plan

    <!-- Explain any relevant testing done to verify this PR. -->
    Run test_device_reduce_by_key to verify the test runs correctly.

    ## Test Result

    <!-- Briefly summarize test outcomes. -->
    The test passes.

    ## Submission Checklist

    - [ ] Look over the contributing guidelines at
    https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

commit 557f5baa6d68bb5a8126d9730a8d48983778aac3
Author: Kerry Wang <[email protected]>
Date:   Mon Feb 9 16:53:41 2026 -0600

    skip on non-gfx950

commit 8b72bc8759d9c11dfcbf410182fa332152b97e69
Author: bnemanich <[email protected]>
Date:   Mon Feb 9 16:26:51 2026 -0500

    [hipBLASLt] Enable custom MXFP4 kernels (#4384)

    ## Motivation

    Allow hipBLASLt to call custom MX FP4 kernels for higher performance.

    ## Technical Details

    A single kernel was added in this PR. The kernel was originally from:
    https://github.com/ROCm/aiter/tree/main/hsa/gfx950/f4gemm.
    This kernel used a slightly different shuffled scaling layout than
    rocRoller. hipBLASLt will only support this new shuffled layout, plus
    the original non-shuffled layout. All rocRoller kernels will be disabled
    when using shuffled scales for now. Once rocRoller supports the new
    layout, they will be added back in.

    This PR also adds some new MX datatype generation patterns that were
    useful during debugging.

    New custom kernels can be added to the custom_kernels directory. They
    will also need to be added in the customer_kernels.cpp file that was
    added in this PR.

    ## Test Plan

    Check that performance improved when using MXFP4 GEMMs with shuffled
    scales.

    ## Test Result

    Performance improved by about 17%.

    ## Submission Checklist

    - [ ] Look over the contributing guidelines at
    https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

    ---------

    Co-authored-by: Andrew Whittle <[email protected]>
    Co-authored-by: Bryant Nelson <[email protected]>

commit 61f9f906dcc0a9d4f6c327fea713aebc6d4b0a1d
Author: Bartłomiej Kocot <[email protected]>
Date:   Mon Feb 9 22:08:57 2026 +0100

    [CK] CK Tile grouped convolution direct load (#4406)

    ## Motivation

    CK Tile grouped convolution forward direct load support.

    ## Technical Details

    Basic pipeline for direct load and new instances for forward for v1 and
    v4 pipelines.

    ## Test Plan

    test_grouped_convnd_fwd_tile

    ## Test Result

    CI pending

    ## Submission Checklist

    - [x] Look over the contributing guidelines at
    https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
    AICK-130

commit b7f136734ad26314386ca2b4f5a99467804f1bb7
Author: assistant-librarian[bot] <210906412+assistant-librarian[bot]@users.noreply.github.com>
Date:   Mon Feb 9 20:58:57 2026 +0000

    Enable group mode (varlen) kernel generation for PyTorch integration (#4292)

    ## Proposed changes

    This PR enables group mode (variable-length attention) kernel generation
    for PyTorch's CK SDPA backend.

    ## Checklist

    Please put an `x` into the boxes that apply. You can also fill these out
    after creating the PR. If you're not sure, please don't hesitate to ask.

    - [X] I have added tests relevant to the introduced functionality, and
    the unit tests are passing locally
    - [ ] I have added the test to REGRESSION_TESTS list defined at the top
    of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more
    than 30 seconds to run.
    - [ ] I have added inline documentation which enables the maintainers
    with understanding the motivation
    - [ ] I have removed the stale documentation which is no longer relevant
    after this pull request
    - [ ] (If this change is user-facing) I have added release notes which
    provide the end users with a brief summary of the improvement from this
    pull request
    - [X] I have run `clang-format` on all changed files
    - [ ] Any dependent changes have been merged

    ## Discussion

    The change is minimal (single line deletion) but enables a significant
    feature: variable-length attention support for ROCm users via PyTorch's
    torch.nn.attention.varlen API.

    ---
    🔁 Imported from
    [ROCm/composable_kernel#3553](https://github.com/ROCm/composable_kernel/pull/3553)
    🧑‍💻 Originally authored by @chinmaydk99

    Co-authored-by: Chinmay_Kuchinad <[email protected]>

commit f48a5e63edb7102996b0b769e76114c0bbfd35cf
Author: Mihnea Chirila <[email protected]>
Date:   Mon Feb 9 14:55:12 2026 -0600

    [Tensilelite] Added MIArchVgpr support for Complex Datatypes. (#4332)

    ## Motivation

    Added MIArchVgpr support for Complex Datatypes.

    ## Technical Details

    Fixed AlphaTmpVgpr initialization, and rocisa register offset bug:

    - Updated condition to initialize AlphaTmpVgpr if MIArchVgpr parameter
    is enabled. Required to generate `MulMIOutAlphaToArch' code
    (https://github.com/ROCm/rocm-libraries/blob/c20a85b6c458ef44c1f0e30c35b286a0395fb8fa/projects/hipblaslt/tensilelite/Tensile/KernelWriterModules.py#L251)
    regardless of postGSU Accumulation scheme.
    - Fixed underlying `Holder` struct bug: correctly passes string passed
    offsets to `RegisterContainer`. Required to update imaginary register
    for C/ZGEMM.
    (https://github.com/ROCm/rocm-libraries/blob/c20a85b6c458ef44c1f0e30c35b286a0395fb8fa/projects/hipblaslt/tensilelite/Tensile/KernelWriterModules.py#L288)

    ## Test Plan

    Tested for C & Z with MIArchVgpr: [0, 1] on gfx942 and gfx950

    ## Test Result

    Success

    ## Submission Checklist

    - [x] Look over the contributing guidelines at
    https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

commit 3de83b9b4035006b5ddd825df9404edc43ca9b39
Author: CMiservaAMD <[email protected]>
Date:   Mon Feb 9 13:42:54 2026 -0700

    [hipDNN] Add integration tests for frontend configuration knobs APIs.  (#4307)

    Add integration tests to verify correct operation of new hipDNN frontend API
    functions for managing engine config settings.

commit 2752a8a5105e11929b876ce0e343bcc73a9cf308
Author: DarylHawkinsAMD <[email protected]>
Date:   Mon Feb 9 13:08:30 2026 -0700

    [MIOpen] First set of kernels using CK Builder end to end (#4123)

commit e55f37bad667987f74989bc95e08f86603438963
Author: Mitchell Ousdahl <[email protected]>
Date:   Mon Feb 9 10:05:26 2026 -0800

    Modified test plugin rpaths (#4350)

    ## Motivation
    In order to successfully get hipDNN added to the python ROCm wheels, the
    RPATHs on Linux for the test plugins need to be updated to make them
    portable. We will leverage TheRock's existing RPATH update mechanism to
    do this.

    ## Technical Details
    - Update all test plugin RPATHs

    ## Test Plan
    - Build ROCm
    - Build the wheels
    - Use the "Test ROCm Wheel" workflow, which verifies that the test
    plugins can load and find their dependencies.

    ## Test Result
    - [ ] "Test ROCm Wheel" workflow succeeds

    ## Submission Checklist
    - [x] Look over the contributing guidelines at
    https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

commit a7485411874b0650b31068364f6b1155d9890212
Author: Muhammad Osama <[email protected]>
Date:   Mon Feb 9 09:30:52 2026 -0800

    [Origami] Skip test-selector if torch not found. (#4359)

    ## Motivation

    Makes `torch` completely optional by skipping dependent tests if it is
    not found.

    ## Technical Details

    ```
    # Skip entire module if torch is not available (selector requires torch)
    torch = pytest.importorskip("torch", reason="torch is required for OrigamiMatmulSelector tests.")
    ```

    ## Test Plan

    Run tests using CI + TheRock build.

    ## Test Result

    <!-- Briefly summarize test outcomes. -->

    ## Submission Checklist

    - [x] Look over the contributing guidelines at
    https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

commit 0c5cd629a94a454a350eb651b5921baeb1c82546
Author: Swati Rawat <[email protected]>
Date:   Mon Feb 9 22:51:13 2026 +0530

    Update Tensile CHANGELOG.md (#4164)

    ## Motivation

    <!-- Explain the purpose of this PR and the goals it aims to achieve.
    -->

    ## Technical Details

    <!-- Explain the changes along with any relevant GitHub links. -->

    ## Test Plan

    <!-- Explain any relevant testing done to verify this PR. -->

    ## Test Result

    <!-- Briefly summarize test outcomes. -->

    ## Submission Checklist

    - [ ] Look over the contributing guidelines at
    https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

commit 331512e9e13e197d8d7fdf7b72f5b60eb63d7d1e
Author: Bartłomiej Kocot <[email protected]>
Date:   Mon Feb 9 16:36:52 2026 +0100

    [CK] Fix grouped conv fwd transform for merged groups (#4399)

    ## Motivation

    [CK] Fix grouped conv fwd transform for merged groups for 1d and 3d.

    ## Technical Details

    After optimizations for 2d there is a lack of implementation for 1d and
    3d

    ## Test Plan

    test_grouped_convnd_fwd

    ## Test Result

    pending CI

    ## Submission Checklist

    - [x] Look over the contributing guidelines at
    https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

commit 1c2927530e176c63cf814b44eb8147e89d2bcaf7
Author: Eiden Yoshida <[email protected]>
Date:   Mon Feb 9 10:23:47 2026 -0500

    [CK] MICI: Disable failure pattern checking (#4373)

    ## Motivation

    - ck mici jobs hanging at end, possibly at failure pattern checking

    ## Technical Details

    - Disable failure pattern checking to see if hanging goes away

    ## Test Plan

    - Observe behavior after merge

commit a3058d1dc0b3f176f56fbecd040c2fc48c7258ad
Author: COrruDXC <[email protected]>
Date:   Mon Feb 9 14:02:05 2026 +0100

    Reduce boost usage by replacing time calls (#3875)

    ## Motivation

    Reduce boost usage by replacing time calls.

    ## Technical Details

    Replace boost::posix_time related data types with the corresponsing
    std::chrono data types.

    ## Submission Checklist

    - [X] Look over the contributing guidelines at
    https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

commit 959bd9393ad9a578711334c40948ac1321e41c1f
Author: Yi-Yao (Alex), Wang <[email protected]>
Date:   Mon Feb 9 17:15:47 2026 +0800

    Update gfx942/gfx950 BBS/HHS/I8I8S SPB/SPA logic yaml (#4365)

    ## Motivation

    - Update BBS/HHS/I8I8S SPB/SPA logic yaml for gfx942/gfx950

    ## Technical Details

    <!-- Explain the changes along with any relevant GitHub links. -->

    ## Test Plan

    - Run local hipsparselt-test
    - Run local tests for all matrix sizes using hipsparselt-bench

    ## Test Result

    <!-- Briefly summarize test outcomes. -->

    ## Submission Checklist

    - [ ] Look over the contributing guidelines at
    https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

    ---------

    Co-authored-by: chiahlin <[email protected]>

commit ad03e58dabbf2bbc348c031a06ec73011d85d2c3
Author: Chuck Wu <[email protected]>
Date:   Mon Feb 9 13:04:38 2026 +0800

    [hipblaslt] Fix memory leaks & uninitialized value use (#4338)

    ## Motivation

    <!-- Explain the purpose of this PR and the goals it aims to achieve.
    -->
    According to the
    [ROCM-1835](https://amd-hub.atlassian.net/browse/ROCM-1835?focusedCommentId=109304&sourceType=mention),
    there are some memory leaks and instances of uninitialized value being
    used during the gtest.
    <img width="450" height="367" alt="image"
    src="https://github.com/user-attachments/assets/2345e1f8-6062-4a5a-b294-97042709b18e"
    />

    ## Technical Details

    <!-- Explain the changes along with any relevant GitHub links. -->
    1. Add the code to call the corresponding destroy functions for the data
    that has not been released yet.
    2. Add the default value to compute_input_typeA/B

    Flow (before this commit) ->
    a. hipblasLtMatmulDescCreate:
    - compute_input_typeA = ???
    - compute_input_typeB = ???

    b. hipblasLtMatmulDescSetAttribute(COMPUTE_INPUT_TYPE_A, HIP_R_16F, ...)
    - compute_input_typeA = HIP_R_16_F
    - compute_input_typeB = ???
    - call _matmul_desc_determine_compute_type()
      - Read  compute_input_typeA & compute_input_typeB

    c. hipblasLtMatmulDescSetAttribute(COMPUTE_INPUT_TYPE_B, HIP_R_16F, ...)
    - compute_input_typeA = HIP_R_16_F
    - compute_input_typeB = HIP_R_16_F
    - call _matmul_desc_determine_compute_type()
      - Read  compute_input_typeA & compute_input_typeB

    ## Test Plan

    <!-- Explain any relevant testing done to verify this PR. -->
    1. Test command: `valgrind --leak-check=full ./hipblaslt-test
    --gtest_filter=_/aux_test.*`
    2. Before this commit:

    Uninitialized value being used
    <img width="691" height="81" alt="image"
    src="https://github.com/user-attachments/assets/22a897f1-c25e-4608-850e-0c6bcb5ad0a3"
    />
    Memory leaks
    <img width="708" height="78" alt="image"
    src="https://github.com/user-attachments/assets/02268893-a29a-4db4-95aa-c93385371d5a"
    />

    ## Test Result

    <!-- Briefly summarize test outcomes. -->
    1. The Valgrind output above isn’t showing.
    2. gtest all pass on Navi3.
    <img width="952" height="92" alt="image"
    src="https://github.com/user-attachments/assets/6e1b9b01-afc4-4a51-8a9f-e0196fc8495a"
    />

    ## Submission Checklist

    - [X] Look over the contributing guidelines at
    https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

    [ROCM-1835]:
    https://amd-hub.atlassian.net/browse/ROCM-1835?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

commit b7de1e14cea70681a23cd1a136df42910c776e4a
Author: assistant-librarian[bot] <210906412+assistant-librarian[bot]@users.noreply.github.com>
Date:   Mon Feb 9 11:54:54 2026 +0800

    [CK_TILE] Add blockscale GEMM support for EightWarps on gfx950 (#4280)

    ## Proposed changes

    gemm blockscale eightwarps support

    ## Checklist

    Please put an `x` into the boxes that apply. You can also fill these out
    after creating the PR. If you're not sure, please don't hesitate to ask.

    - [ ] I have added tests relevant to the introduced functionality, and
    the unit tests are passing locally
    - [ ] I have added the test to REGRESSION_TESTS list defined at the top
    of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more
    than 30 seconds to run.
    - [ ] I have added inline documentation which enables the maintainers
    with understanding the motivation
    - [ ] I have removed the stale documentation which is no longer relevant
    after this pull request
    - [ ] (If this change is user-facing) I have added release notes which
    provide the end users with a brief summary of the improvement from this
    pull request
    - [x] I have run `clang-format` on all changed files
    - [x] Any dependent changes have been merged

    ## Discussion

    If this is a relatively large or complex change, feel free to start a
    discussion by explaining why you chose the solution you did and what
    alternatives you considered

    ---
    🔁 Imported from
    [ROCm/composable_kernel#3650](https://github.com/ROCm/composable_kernel/pull/3650)
    🧑‍💻 Originally authored by @kensclin

    ---------

    Co-authored-by: KenSCLin <[email protected]>
    Co-authored-by: Ding, Yi <[email protected]>
    Co-authored-by: systems-assistant[bot] <systems-assistant[bot]@users.noreply.github.com>
    Co-authored-by: Thomas Ning <[email protected]>

commit 774cfc6410ed55802691fef19a34449182878be5
Author: Ethan <[email protected]>
Date:   Mon Feb 9 11:39:20 2026 +0800

    [hipblaslt] do some debug operations only in debug setting

    ## Motivation

    Single solution selection time has increased slightly.
    <!-- Explain the purpose of this PR and the goals it aims to achieve.
    -->

    ## Technical Details

    Lots of "assign matchingTag" can be avoid if not in debug
    (printProperty), but I still keep the Equal assign there since it has
    been there before #2757
    <!-- Explain the changes along with any relevant GitHub links. -->

    ## Test Plan

    <!-- Explain any relevant testing done to verify this PR. -->

    ## Test Result

    <!-- Briefly summarize test outcomes. -->

    ## Submission Checklist

    - [x] Look over the contributing guidelines at
    https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

commit ff3e9821bbe2d14b9874e61ebb518bbbce621ac3
Author: jakpiase <[email protected]>
Date:   Sun Feb 8 20:57:14 2026 +0100

    [CK_TILE] Add support and tests for V6 pipeline in conv fwd (#4357)

    Added support for conv v6 pipeline in ck tile's convolution forward
    kernel. CK Tile v6 pipeline is the equivalent to old ck's V5 pipeline
    and should be faster than other pipelines for some cases. This PR also
    adds tests inside profiler that's currently inside experimental
    directory, so now we should be able to detect regressions easier.

    ---------

    Co-authored-by: Illia Silin <[email protected]>
    Co-authored-by: subhajitdchow <[email protected]>

commit 591f50450241d6b1965f9f6ee3fe2526ef71ab8d
Author: assistant-librarian[bot] <210906412+assistant-librarian[bot]@users.noreply.github.com>
Date:   Sun Feb 8 12:34:59 2026 +0100

    [CK] Add fwd conv group merging to v3 conv instances (#4273)

    ## Proposed changes

    Added conv group merging to the (universal) V3 fwd conv pipeline. The
    new instance improves fwd conv performance when the number of
    input/output channel per group is low.

    On MI300 (`gfx942`) we get

    | CK prof command | Baseline (TFLOPS) | V3 group merging (TFLOPS) |
    |:-----|:------:|------:|
    | grouped_conv_fwd 1 1 1 0 1 0 1 2 32 32 4 4 3 3 200 200 1 1 1 1 1 1 1 1
    | 3.86035 | 8.36796 |
    | grouped_conv_fwd 1 1 1 0 1 0 1 2 32 32 8 8 3 3 200 200 2 2 1 1 1 1 1 1
    | 10.1867 | 13.4677 |
    | grouped_conv_fwd 1 1 1 0 1 0 1 2 32 32 8 8 3 3 100 100 1 2 1 1 1 1 1 1
    | 11.7875 | 16.3657 |

    ---
    🔁 Imported from
    [ROCm/composable_kernel#3675](https://github.com/ROCm/composable_kernel/pull/3675)
    🧑‍💻 Originally authored by @vpietila-amd

    ---------

    Co-authored-by: Ville Pietilä <>
    Co-authored-by: Ville Pietilä <[email protected]>
    Co-authored-by: systems-assistant[bot] <systems-assistant[bot]@users.noreply.github.com>
    Co-authored-by: Illia Silin <[email protected]>
    Co-authored-by: Bartlomiej Kocot <[email protected]>

commit cad7fa2c1849b0863ed52ef6cd47198e421d5b6e
Author: BrianHarrisonAMD <[email protected]>
Date:   Fri Feb 6 23:48:54 2026 -0700

    [hipDNN] Fix codecov target breaks (#4374)

    ## Motivation

    code_cov stage for hipDNN was breaking due to environment differences.
    Newer clang tooling flags false positives on added files.
    Environments that are missing spdlog, but have fmt present were causing
    issues due to mixed include expectations.

    ## Technical Details

    - Ignore false positives for lint
    - Only add fmt if spdlog was built with external FMT

    ## Test Plan

    - Code cov target builds succeessfully

    ## Test Result

    Waiting on CI

commit 91627789d86acc7dff4bf5eaafe3b774a7037f76
Author: Koji Nakajima <[email protected]>
Date:   Fri Feb 6 23:44:05 2026 -0700

    [hipblaslt] Fix memory access error with DtlPlusLdsBuf (#4303)

    ## Motivation

    Fix a memory access error with DtlPlusLdsBuf

    ## Technical Details

    - generate all GlobalRead Inc code before local read addr swap

    ## Test Plan

    Added a test case in dtl.yaml

    ## Test Result

    Confirmed new test failed with before change and no error with after
    change

    ## Submission Checklist

    - [ ] Look over the contributing guidelines at
    https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

commit 8b5a98b48c007663765865d2e14247ef1f056b01
Author: Aaron St George <[email protected]>
Date:   Fri Feb 6 23:18:13 2026 -0600

    [hipDNN] Add `FUSILLI_PLUGIN` to `EngineNames.hpp` (#4362)

    ## Motivation

    <!-- Explain the purpose of this PR and the goals it aims to achieve.
    -->

    Following the approach outlined in
    [hipdnn/docs/rfcs/0003_EngineIdDesign.md](https://github.com/ROCm/rocm-libraries/blob/develop/projects/hipdnn/docs/rfcs/0003_EngineIdDesign.md)
    this PR defines an engine ID for fusilli in
    `hipdnn/data_sdk/include/hipdnn_data_sdk/utilities/EngineNames.hpp`.

    ## Technical Details

    <!-- Explain the changes along with any relevant GitHub links. -->

    ID + tests defined.

    ## Test Plan

    <!-- Explain any relevant testing done to verify this PR. -->

    Test defined in the PR.

    ## Test Result

    <!-- Briefly summarize test outcomes. -->

    Tests pass.

    ## Submission Checklist

    - [x] Look over the contributing guidelines at
    https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

commit e3a9b3f95d29ce626efc3d2134e4e924b1c591a9
Author: James Newling <[email protected]>
Date:   Fri Feb 6 21:12:55 2026 -0800

    [hipblaslt] Raise exception instead of segfaulting (#3995)

    ## Motivation

    Faster problem diagnostic when failure.

    ## Technical Details

    Throw exception if library is nullptr.

    ## Test

    Before:

    ```
    TensileLibrary.yaml:181:31: error: invalid boolean
        customMainLoopScheduling: 0
                                  ^
    [Lots of logging]

    Segmentation fault
    + ERR2=139
    + ERR=0
    ```

    After:

    ```
    TensileLibrary.yaml:181:31: error: invalid boolean
        customMainLoopScheduling: 0

    terminate called after throwing an instance of 'std::runtime_error'
      what():  Failed to load solution library

    + ERR2=134
    + ERR=0
    ```

commit f48eaa54f7395aa8ce4980dcc6725fe38784f7e6
Author: CMiservaAMD <[email protected]>
Date:   Fri Feb 6 21:48:52 2026 -0700

    [hipDNN] Fix a couple log messages in test plugins broken by recent merge. (#4380)

    Include correct function name in test plugin log output.

commit 5df3343ecfae6b39201995d8178fe39e061e0c40
Author: Emily Martins <[email protected]>
Date:   Fri Feb 6 17:26:57 2026 -0700

    [CK_TILE] Fix MMA concepts compiler error (#4381)

    ## Motivation

    CK Tile is required to support certain older OSs; on these OSs, cpp 20
    is not fully supported. For ROCm 7.2, compiler errors occur on one of
    these older OSs. An example of this error is as follows:

    ```bash
    /composable_kernel/include/ck_tile/core/arch/mma/amdgcn_mma.hpp:34:28: error: expected concept name with optional arguments
       34 |     { MmaOp::kAMBlock } -> std::convertible_to<unsigned int>;
          |
    ```

    The goal of this PR is to resolve these compiler errors.

    ## Technical Details

    The existing guards around the mma concepts only check if the concepts
    language feature is supported, as follows:

    ```cpp
    #if defined(__cpp_concepts) && __cpp_concepts >= 201907L
    // ...
    template <typename CtrlFlags>
    concept CtrlFlagsGfx9I = requires(CtrlFlags ctrlFlags) {
        // Flag members for Gfx9 MFMA instructions
        { CtrlFlags::Cbsz } -> std::convertible_to<int>;
        { CtrlFlags::Abid } -> std::convertible_to<int>;
        { CtrlFlags::Blgp } -> std::convertible_to<int>;
    };

    #endif // defined(__cpp_concepts) && __cpp_concepts >= 201907L
    ```
    That said, in cases where functionality from the `<concepts>` header is
    used (e.g., `std::convertible_to`), this guard fails to check whether
    the `<concepts>` header is available.

    This change adds an additional check to the concepts that make use of
    functionality from the `<concepts>` header to ensure the header is
    available.

    ## Test Plan

    I tested the changes on the relevant docker for gfx90a, gfx950, and
    gfx942 and the compiler issue is not present.

    ## Submission Checklist

    - [x] Look over the contributing guidelines at
    https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

commit 06976b37a2f0353b80c71fb3d56bee92bb6b9bab
Author: Aviral Goel <[email protected]>
Date:   Sat Feb 7 04:14:28 2026 +0400

    Increase tolerance for FP16 GEMM tests to handle non-deterministic ro… (#4335)

    …unding

    Three tests were failing intermittently with small errors (0.01-1.5%)
    due to non-deterministic FP16 accumulation order from GPU thread
    scheduling:
    - test_ck_tile_batched_gemm
    - test_ck_tile_grouped_gemm_preshuffle
    - test_ck_tile_grouped_gemm_multi_d

    These tests use kbatch=1 (no split-K), so errors are from
    order-dependent rounding, not atomics. Increased tolerances from 1e-3 to
    2e-3 (0.2%) to account for FP16 precision limits while still catching
    real bugs.

    - [x] Look over the contributing guidelines at
    https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

    Co-authored-by: Illia Silin <[email protected]>

commit 07e9d561402c717946a1c08cfdce2681d5733335
Author: assistant-librarian[bot] <210906412+assistant-librarian[bot]@users.noreply.github.com>
Date:   Fri Feb 6 16:10:23 2026 -0800

    [CK] add inter/intrawave scheduling concept doc (#4300)

    ## Proposed changes

    Adding information about inter/intrawave scheduling

    ---
    🔁 Imported from
    [ROCm/composable_kernel#3660](https://github.com/ROCm/composable_kernel/pull/3660)
    🧑‍💻 Originally authored by @spolifroni-amd

    ---------

    Co-authored-by: spolifroni-amd <[email protected]>
    Co-authored-by: assistant-librarian[bot] <assistant-librarian[bot]@users.noreply.github.com>
    Co-authored-by: Illia Silin <[email protected]>
    Co-authored-by: illsilin_amdeng <[email protected]>

commit 4d773b636ca00996e971d55bcd0530f641837b42
Author: JonathanLichtnerAMD <[email protected]>
Date:   Fri Feb 6 17:09:55 2026 -0700

    Add .cline_storage to .gitignore (#4390)

    ## Submission Checklist

    - [x] Look over the contributing guidelines at
    https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

commit 738ffd7689ba0759f00c0e9430889b2ed995fa94
Author: Enrico Degregori <[email protected]>
Date:   Sat Feb 7 01:09:08 2026 +0100

    [CK] Workaround blockscale wp test failure (#4372)

    ## Motivation

    Workaround to fix blockscale wp test failure for pipeline v3

    ## Technical Details

    <!-- Explain the changes along with any relevant GitHub links. -->

    ## Test Plan

    <!-- Explain any relevant testing done to verify this PR. -->

    ## Test Result

    <!-- Briefly summarize test outcomes. -->

    ## Submission Checklist

    - [ ] Look over the contributing guidelines at
    https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

commit 612bf0b710b399276916c222d8d4c5f9c34f9f62
Author: James Sandham <[email protected]>
Date:   Fri Feb 6 18:44:55 2026 -0500

    [rocsparse] Add bfloat16 and complex-types tests for code coverage (#4204)

    ## Motivation

    Add bfloat16 and complex-types tests for code coverage. Also renames the
    atomic_add tests to belong to pre_checkin so that they will be run as
    part of code coverage pre_checkin tests.

commit 287fbc900071d5f9f8df7efdf1cfd25d9c8ea338
Author: Kerry Wang <[email protected]>
Date:   Fri Feb 6 15:55:22 2026 -0600

    don't include waitcnts; renames

commit f52966a377bfd26725f35f103fbc7975cd9b4ec9
Author: Yiqian Liu <[email protected]>
Date:   Fri Feb 6 15:50:20 2026 -0600

    [rocRoller] Explicitly convert when typeAcc differs with typeD (#3977)

    ## Motivation

    This PR explicitly converts data type when Accumulator type is different
    with matrix D. The purpose of this change is to make rocRoller client
    adds the same operation as hipblaslt

    ## Technical Details

    Added a convert operation when Accumulator is not the same type as
    matrix D.

    ## Test Plan

    All the existing tests should pass.
    This change should not affect the performance.

    ## Test Result

    Existing tests passed.

    ---------

    Co-authored-by: yiqialiu <[email protected]>

commit 323a8d256e2409de54fa9dfa9523f4c50295c305
Author: Kerry Wang <[email protected]>
Date:   Fri Feb 6 15:28:17 2026 -0600

    minor quality improvements

commit 6c4a1fb6d0f2eff761cd95e690f3ef0090266367
Author: Ali Yazdani <[email protected]>
Date:   Fri Feb 6 14:24:53 2026 -0700

    [Origami] AutoWgm for NonTemporal Kernels. (#4218)

    AIGESOLSEL-71

    ## Motivation

    This PR enhances the Origami workgroup mapping (WGM) selection logic to
    support nontemporal kernels and improve automatic WGM value selection.
    Previously, nontemporal cases (NTA/NTB > 3) were excluded from automatic
    workgroup mapping optimizations, limiting potential performance.

    ## Technical Details

    1. Enabling NonTemporal support in AutoWGM with an enhanced logic
    2. Improved WGM Candidate Generation

    ## Test Plan

    CI, and locally ran performance tests.

    ## Test Result

    Performance benchmarks show uplifts coming from changes.

    ## Submission Checklist

    - [ ] Look over the contributing guidelines at
    https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

commit 50e39459b541f978edd9acf645792cf496c16bea
Author: Kerry Wang <[email protected]>
Date:   Fri Feb 6 15:21:16 2026 -0600

    add string-based observer test

commit 16b3b1840d61871c116c9ae80cf59324233377d5
Author: Torre Zuk <[email protected]>
Date:   Fri Feb 6 14:19:00 2026 -0700

    [rocBLAS] Users/torrezuk/rocm 1157 amd smi rocblas (#4353)

    ## Motivation

    Deprecated dependency rocm-smi replaced by amd-smi

    ## Test Plan

    Test frequency reporting using rocblas-bench with environment variable set

commit 45b616b1e6df1b1d3816a2f03a05a0f6ab754652
Author: Illia Silin <[email protected]>
Date:   Fri Feb 6 10:17:02 2026 -0800

    [CK] fix path for build filter (#4375)

    ## Motivation

    Fix the filter that determines whether CI builds are necessary.

    ## Technical Details

    A script checks the files list returned by git diff and checks whether
    any code source was modified. If not, if only documentation was changed,
    it will allow skipping the builds. We make sure we only look at the
    changes in projects/composablekernel/ folder.

commit d8e2826bedff1183eaedeb6d6f5b2eeaa65dab7b
Author: Geo Min <[email protected]>
Date:   Fri Feb 6 09:59:29 2026 -0800

    [ci] Adding mi350 required group ID (#4378)

    After updating mi325 group-id, we are noticing errors for mi350.

    Tested here for mi350:
    https://github.com/ROCm/TheRock/actions/runs/21733399385/job/62692971370
    Tested here for mi325:
    https://github.com/ROCm/TheRock/actions/runs/21759203211/job/62778060417

    Adding both work properly

commit 78497b37bb4e5853b5da3feb96381c6b643556f7
Author: bibek <[email protected]>
Date:   Fri Feb 6 10:46:48 2026 -0600

    Fix intermittent kernel compilation failures in BnFwdTrainingSpatial (#4202)

    ## Motivation
    Fix kernel compilation failures in BnFwdTrainingSpatial caused by two
    related issues:

    1. Uninitialized local size variables producing random garbage values
    2. Missing compile-time guard for warp-reduction LDS arrays when
    workgroup < 64 threads

    ## To Reproduce

    ```bash
    rm -rf ~/.config/miopen/*.udb.txt ~/.cache/miopen/
    rm -rf /tmp/.config/miopen/ /tmp/.cache/miopen/
    MIOPEN_FIND_ENFORCE=SEARCH MIOPEN_LOG_LEVEL=5 ./bin/MIOpenDriver bnorm -n 1024 -c 64 -H 13 -W 13 -m 1 --forw 1 -s 1 -V 1
    ```

    ## Technical Details

    ### Bug 1: Uninitialized Variables (Host)

    Variables declared without initialization:

    ```cpp
    size_t xlocalsize, xgridsize;  // uninitialized
    ```

    For Variants 0/1/3, early returns skip initialization, leaving garbage
    values that propagate to kernel template parameters. Depending on stack
    memory state, errors include:

    - `error: array is too large (18446744073709545792 elements)`
    - `error: variable length array declaration cannot have 'static' storage
    duration`
    - `error: zero-length arrays are not permitted in HIP device code`

    ### Bug 2: Missing Compile-Time Guard (Kernel)

    The warp-reduction path divides LDS size by 64:

    ```cpp
    __shared__ FpAccumCType lcl_data_x[MIO_BN_GRP0_FINAL * MIO_BN_GRP1_FINAL * MIO_BN_GRP2_FINAL / 64];
    ```

    When Variants 0/1/3 set workgroup size to 1×1×1, this computes `1/64 =
    0`, producing illegal zero-length arrays. Note that `if constexpr` only
    suppresses template instantiation, not parsing of ill-formed code like
    zero-length arrays.

    ## Fix

    ### Host side (`common_spatial.hpp`, `forward_spatial.cpp`)

    Initialize variables to safe defaults:

    ```cpp
    size_t xlocalsize = 1, xgridsize = 1;
    size_t ylocalsize = 1, ygridsize = 1;
    size_t zlocalsize = 1, zgridsize = 1;
    size_t nelements = 1;
    unsigned int ldsgcn = 0, ldsnogcn = 0;
    ```

    ### Kernel side (`MIOpenBatchNormFwdTrainSpatial.cpp`)

    Use C++17 constexpr ternary to ensure array size is always ≥ 1:

    ```cpp
    else
    {
        // C++17 idiomatic: ensure array size is never zero using constexpr ternary
        constexpr auto grp_final_total =
            MIO_BN_GRP0_FINAL * MIO_BN_GRP1_FINAL * MIO_BN_GRP2_FINAL;
        constexpr auto lds_gcn_array_size = grp_final_total >= 64 ? grp_final_total / 64 : 1;

        commitID = 64;
        __shared__ FpAccumCType lcl_data_x[lds_gcn_array_size];
        __shared__ FpAccumCType lcl_data_y[lds_gcn_array_size];
        miopen::reduction::gcn_reduce2(...);
    }
    ```

    __Why this works:__

    - `constexpr` ensures compile-time evaluation (zero runtime overhead)
    - When workgroup ≥ 64: array size = `grp_final_total / 64` (correct,
    same as before)
    - When workgroup < 64: array size = 1 (valid), but this `else` branch is
    never taken due to `if constexpr` guard
    - Dead code elimination removes the unused size-1 arrays from the binary

    ## Test Plan

    - Existing batchnorm training tests pass
    - Verified no zero-length array errors with deterministic workgroup
    sizes
    - Confirmed warp-reduction path only executes when workgroup ≥ 64
    threads

commit 8f8b97a40d36cb4095e929b0ef1b71ffea7ba170
Author: SreecharanGundaboluAMD <[email protected]>
Date:   Fri Feb 6 08:18:12 2026 -0800

    [miopen] upgrade clang-format (#4194)

    This PR updates the project's code formatting tooling to use
    `clang-format-18` instead of `clang-format-12` throughout the codebase
    as a transition as we move towards TheRock for our CI.

    **Tooling and Configuration Updates:**

    * Updated all references to `clang-format-12` to `clang-format-18` in
    the pre-commit hook (`.githooks/pre-commit`), CMake configuration
    (`ClangCheck.cmake`), and Dockerfile (`Dockerfile`). The Dockerfile now
    also adds the appropriate LLVM 18 repository and keyring for
    installation.
    [[1]](diffhunk://#diff-1436c8126d575a7576d98d0bc8a8c6d27e8eb4e2d7241d61fe64c286c0d7365cL7-R7)
    [[2]](diffhunk://#diff-fc024f0d7573d33039081dab6b12f76f0f34c8e07e014552daa1bed9a276a548L9-R9)
    [[3]](diffhunk://#diff-32304f8a254e46fb8ff524cf4c488eb6013ab54a89ca62709cfb20ccf58976f9R54-R61)

    ## Submission Checklist

    - [ ] Look over the contributing guidelines at
    https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

commit b34fa62134425a611b18c05aae687f1bc09c7d87
Author: BrianHarrisonAMD <[email protected]>
Date:   Fri Feb 6 08:57:01 2026 -0700

    [hipDNN] Remove spdlog dependency for consumers of hipdnn (#4312)

    ## Motivation

    Draft of changes to remove spdlog and fmt dependencies from hipDNN
    frontend & consumer facing SDKs.

    Consumers of hipDNN can have conflicts as a result of these
    dependencies.

    Removing these extra dependencies, and relying on the C API logging
    methods from hipDNN backend will reduce friction, and make it easier for
    adoption of hipDNN.

    ## Technical Details

    - Remove spdlog and fmt from frontend + consumer SDKs.
    - Add new string stream style formatter that will forward to backend
    logging API callback
    - Note: since we are set at C++17 standards, we cannot use std::format.
    - Add Spdlog and fmt optional dependencies as opt in for plugin_sdk
    - This enables existing plugins to maintain logging style they have in
    place, and allows plugin authors to decide what style of logging to use.
    - Unify consumption of spdlog & fmt dependencies in CMake's using
    unified method's

    ## Test Plan

    - Ensure build and tests are working for all components
    - Ensure tests with logs enabled are working properly with expected
    format
    - Ensure build of samples is working, and logging format is correct

    ## Test Result

    Build, tests, and testing with logging enabled is working locally for
    hipDNN, providers, and samples

commit 0c37fdc37ba545b2ced5211b9f59c2381fc93753
Author: amd-chunxlin <[email protected]>
Date:   Fri Feb 6 09:44:39 2026 -0600

    [rocRoller] Address long StreamK test runtimes (#4095)

    ## Motivation

    Some streamK tests take long time to finish, and this PR addresses the
    performance issue.

    <details>
    <summary>Comparison of test runtimes</summary>

    | Test name | Develop branch | This branch |
    | --- | --- | --- |
    | GPU_BasicGEMMStreamKWorkgroupMapping/0    |  14.412s   |    4.059s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/1    |  14.447s   |    4.018s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/2    |  14.452s   |    4.026s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/3    |  14.444s   |    4.042s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/4    |  38.872s   |    9.446s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/5    |  38.828s   |    9.445s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/6    |  38.913s   |    9.446s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/7    |  38.812s   |    9.435s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/8    |  38.878s   |    9.456s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/9    |  38.889s   |    9.463s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/10   |  38.884s  |     9.463s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/11   |  38.859s  |     9.463s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/12   |  14.414s  |     4.035s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/13   |  14.429s  |     4.024s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/14   |  14.47s   |     4.045s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/15   |  14.428s  |     4.044s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/16   |  38.854s  |     9.454s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/17   |  38.861s  |     9.448s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/18   |  38.95s   |     9.461s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/19   |  38.826s  |     9.458s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/20   |  38.833s  |     9.48s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/21   |  38.888s  |     9.472s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/22   |  38.868s  |     9.477s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/23   |  38.907s  |     9.485s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/24   |  14.426s  |     4.025s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/25   |  14.435s  |     4.051s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/26   |  14.472s  |     4.053s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/27   |  14.471s  |     4.058s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/28   |  38.879s  |     9.457s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/29   |  38.814s  |     9.445s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/30   |  38.853s  |     9.45s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/31   |  38.963s  |     9.458s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/32   |  38.924s  |     9.466s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/33   |  38.898s  |     9.482s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/34   |  38.951s  |     9.455s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/35   |  38.924s  |     9.459s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/36   |  14.461s  |     4.037s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/37   |  14.452s  |     4.032s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/38   |  14.43s   |     4.053s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/39   |  14.43s   |     4.042s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/40   |  38.868s  |     9.473s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/41   |  38.925s  |     9.461s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/42   |  38.884s  |     9.452s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/43   |  38.925s  |     9.455s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/44   |  39.012s  |     9.476s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/45   |  38.915s  |     9.479s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/46   |  38.933s  |     9.457s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/47   |  38.936s  |     9.469s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/48   |  14.461s  |     4.041s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/49   |  14.468s  |     4.049s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/50   |  14.466s  |     4.046s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/51   |  14.479s  |     4.038s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/52   |  38.907s  |     9.473s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/53   |  38.914s  |     9.471s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/54   |  38.885s  |     9.463s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/55   |  38.891s  |     9.463s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/56   |  38.859s  |     9.472s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/57   |  38.899s  |     9.475s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/58   |  38.936s  |     9.47s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/59   |  38.952s  |     9.472s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/60   |  14.485s  |     4.036s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/61   |  14.419s  |     4.031s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/62   |  14.455s  |     4.035s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/63   |  14.488s  |     4.045s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/64   |  38.846s  |     9.462s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/65   |  38.923s  |     9.475s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/66   |  38.884s  |     9.463s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/67   |  38.93s   |     9.441s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/68   |  38.9s    |     9.473s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/69   |  38.914s  |     9.477s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/70   |  38.961s  |     9.487s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/71   |  38.93s   |     9.489s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/72   |  14.432s  |     4.051s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/73   |  14.442s  |     4.042s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/74   |  14.44s   |     4.036s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/75   |  14.456s  |     4.042s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/76   |  38.891s  |     9.486s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/77   |  38.938s  |     9.444s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/78   |  38.919s  |     9.448s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/79   |  38.858s  |     9.455s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/80   |  38.878s  |     9.491s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/81   |  38.975s  |     9.478s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/82   |  38.945s  |     9.496s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/83   |  38.905s  |     9.477s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/84   |  14.433s  |     4.033s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/85   |  14.41s   |     4.038s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/86   |  14.478s  |     4.044s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/87   |  14.473s  |     4.035s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/88   |  38.927s  |     9.49s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/89   |  38.938s  |     9.472s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/90   |  38.913s  |     9.454s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/91   |  38.835s  |     9.463s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/92   |  38.901s  |     9.494s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/93   |  38.864s  |     9.488s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/94   |  38.839s  |     9.508s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/95   |  38.921s  |     9.487s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/96   |  14.445s  |     4.05s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/97   |  14.432s  |     4.041s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/98   |  14.472s  |     4.055s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/99   |  14.449s  |     4.039s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/100  |  38.87s  |      9.485s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/101  |  38.866s |      9.474s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/102  |  38.836s |      9.471s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/103  |  38.847s |      9.464s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/104  |  38.985s |      9.468s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/105  |  38.928s |      9.469s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/106  |  38.866s |      9.496s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/107  |  38.92s  |      9.48s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/108  |  14.452s |      4.046s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/109  |  14.448s |      4.044s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/110  |  14.469s |      4.044s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/111  |  14.457s |      4.052s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/112  |  38.895s |      9.474s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/113  |  38.901s |      9.453s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/114  |  38.882s |      9.467s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/115  |  38.861s |      9.446s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/116  |  38.945s |      9.498s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/117  |  38.94s  |      9.488s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/118  |  38.881s |      9.488s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/119  |  38.911s |      9.486s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/120  |  14.434s |      4.044s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/121  |  14.455s |      4.061s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/122  |  14.464s |      4.067s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/123  |  14.488s |      4.058s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/124  |  38.932s |      9.499s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/125  |  38.868s |      9.488s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/126  |  38.863s |      9.488s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/127  |  38.899s |      9.467s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/128  |  38.948s |      9.511s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/129  |  38.915s |      9.502s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/130  |  38.932s |      9.508s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/131  |  38.914s |      9.491s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/132  |  14.42s  |      4.031s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/133  |  14.464s |      4.054s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/134  |  14.494s |      4.048s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/135  |  14.48s  |      4.046s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/136  |  38.911s |      9.49s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/137  |  38.91s  |      9.478s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/138  |  38.914s |      9.478s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/139  |  38.92s  |      9.475s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/140  |  38.883s |      9.508s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/141  |  38.935s |      9.5s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/142  |  38.848s |      9.494s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/143  |  38.932s |      9.502s|

    </details>

    ## Technical Details

    - Short-circuit expression comparison
    - Remove unused code
    - Caching expressions of kernel arguments to eliminate redundant
    regeneration.
    - Change `AssemblyKernelArgument` to a `class`

    ## Test Plan

    No functional changes; covered by existing tests.

    ## Test Result

    See CI report

commit 66e22ac6c6c0c286325cabf2b1faa269ea640446
Author: hcman2 <[email protected]>
Date:   Fri Feb 6 10:55:08 2026 +0800

    [formocast] [tensilelite] enable tuning with formocast (#4043)

    ## Motivation
    Integration plan of formocast and origami :
    Steps
    1.Push Formocast code to the origami subfolder. (we are here now)
    2.Submit tuning code calling the API of origami.
    3.Push and reserve origami prediction mode. Add APIs to pass sizemapping
    data via config_t. Use an environment variable to switch modes without
    affecting other code.
    4.Enable predictionThreshold with tox tests.
    5.Push Origami code with Formocast backend. This step will enable the
    bench with different modes.
    6.Refine Formocast and Origami to verify API usage and identify
    functions to move.

    This PR is to include step2 and step4.

    ## Technical Details

    <!-- Explain the changes along with any relevant GitHub links. -->

    ## Test Plan

    <!-- Explain any relevant testing done to verify this PR. -->

    ## Test Result

    <!-- Briefly summarize test outcomes. -->

    ## Submission Checklist

    - [ ] Look over the contributing guidelines at
    https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

    ---------

    Co-authored-by: Henry Ho <[email protected]>
    Co-authored-by: Peter Cheng <[email protected]>

commit 37a74ef54eaa1bb1df603db6ec8aff22f342bc71
Author: Illia Silin <[email protected]>
Date:   Thu Feb 5 17:06:57 2026 -0800

    [CK]  a bunch of CI fixes. (#4361)

    ## Motivation

    Fixing some of the CK CI issues

    ## Technical Details

    fixing paths to dockerfiles and scripts;
    moving codegen tests to separate stage (collides with main build since
    you must call cmake from same folder but different options);
    fixing a couple of clang compilation issues with staging compiler;

commit 808e9496d17be0826164dbb34457f54e157dd2bd
Author: Kerry Wang <[email protected]>
Date:   Thu Feb 5 17:00:01 2026 -0600

    refractor observers to have new runtime with context concept

commit 3c9beb38b8dba1301a961cc5dc3f44ca9d4185e3
Author: Eiden Yoshida <[email protected]>
Date:   Thu Feb 5 17:56:12 2026 -0500

    [CK] MICI: Fix git diff in selective_test_filter.py (#4352)

    ## Motivation

    - git diff needs access to reference repo

    ## Technical Details

    - mount reference repo path into docker for selective_test_filter.py to
    access

    ## Test Plan

    - tested in MICI

    ## Test Result

    - launch_tests.sh ran successfully

commit 1663ac026d46c3dd02edb73d2bfa7310c54695d5
Author: Torre Zuk <[email protected]>
Date:   Thu Feb 5 14:20:45 2026 -0700

    [rocBLAS] trsm doc & test;  trsv change noted in log (#4198)

    ## Motivation

    Tests trsm use of new trsv kernel for big batches

    Adds chagelog note on trsv which used to call hipGetDevice and
    potentially hipSetDevice

commit d8bb9d2b9fe278d07fe63b395d87268c94e53fd8
Author: Jeffrey Novotny <[email protected]>
Date:   Thu Feb 5 14:54:48 2026 -0500

    [rocsolver] Doxygen API cleanup part 2 (#4330)

    ## Motivation

    <!-- Explain the purpose of this PR and the goals it aims to achieve.
    -->
    Continuing with rocSOLVER API/Doxygen copy edits and polishing

    ## Technical Details

    <!-- Explain the changes along with any relevant GitHub links. -->
    Edit Doxygen comments in header file.

    ## Test Plan

    <!-- Explain any relevant testing done to verify this PR. -->
    Build locally.

    ## Test Result

    <!-- Briefly summarize test outcomes. -->
    NA

    ## Submission Checklist

    - [x] Look over the contributing guidelines at
    https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

commit 5aa1f1d4c189f779ea699be250fb1b284f3d6ac2
Author: Geo Min <[email protected]>
Date:   Thu Feb 5 11:01:53 2026 -0800

    [ci] Updating variable group-id for OSSCI (#4360)

    OSSCI migrated mi325s, so need a new groupID

    Sanity works here:
    https://github.com/ROCm/TheRock/actions/runs/21723540679/job/62659665907
    normal run works here:
    https://github.com/ROCm/TheRock/actions/runs/21723540679/job/62659791422

    I've dabbled with organization variables, however, this does not work
    for forks so for now, we will do the manual update

commit 6273d3b30b32d6e0856394d37a421153dceb33c1
Author: Dmitrii Polomin <[email protected]>
Date:   Thu Feb 5 19:53:45 2026 +0100

    [MIOpen] Ported solver test to gtest (#3713)

    ## Motivation

    Porting tests from CTest to GTest, in this case, `solver.cpp`

    ## Technical Details

    Pretty straightforward port, although I had to get creative in order to
    conform to `INSTANTIATE_TEST_SUITE_P` pattern and naming conventions

    ## Test Plan

    Running locally, using the CI launched by this PR

    ## Test Result

    See CI actions launched by this PR

    ## Submission Checklist

    - [ ] Look over the contributing guidelines at
    https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

commit ece0c48dd152856c3b443c12c58a89bc7d7c34e5
Author: Nathan Henderson <[email protected]>
Date:   Thu Feb 5 09:03:42 2026 -0700

    [rocroller] Use StreamKMode in hipBLASLt integration and client (#4028)

    ## Motivation

    RocRoller has a `StreamKMode` enum that is used internally and in the
    GEMM tests. However, the client and hipBLASLt integration still used
    boolean values to represent the StreamK state. This PR replaces the
    individual boolean flags (`--streamK`, `--streamKTwoTile`,
    `--streamKTwoTileDPFirst`) with a single `--streamK` string option that
    accepts one of the `StreamKMode` values (`None`, `Standard`, `TwoTile`,
    or `TwoTileDPFirst`).

    ## Technical Details

    - Update the rocRoller GEMM client to use `StreamKMode` enum instead of
    three separate booleans
    - Update the hipBLASLt rocRoller integration to match
    - Add `enumStrings<T>()` utility function in `Utils_impl.hpp` for CLI
    validation of enum values

    ## Test Plan

    Update `test_gemm_client.py` YAML fixtures to use the new `streamK:
    None` format.

    ## Test Result

    Validated by the StreamK rrperf tests

commit 3b98c98a23e76075a6a1e4e580482a627e39d59b
Author: Jobbins <[email protected]>
Date:   Thu Feb 5 08:56:42 2026 -0700

    [composablekernel] fix failure status (#4351)

    ## Motivation

    Pipelines were failing on Math CI status check.

    ## Technical Details

    For the success case, we just changed the config in Jenkins to use a
    proper app token and no code changes were required. However, the failure
    case would not have worked as coded, so we needed to move that outside
    of the `rocmnode()` block.

    ## Test Plan

    I removed all of the CI in one of the commits to quickly test, and then
    added it back.  Got a successful "success" message and "failure" message
    produced

commit 9bb7f5c31253643cd72363314c3d3ee02f723406
Author: Eiden Yoshida <[email protected]>
Date:   Thu Feb 5 10:55:44 2026 -0500

    [CK] MICI: Correct path for build trace script (#4349)

    ## Motivation

    - Corrects path to script due to superrepo migration
    - Forces all tests to run by default

    ## Technical Details

    - now in /projects/composablekernel

    ---------

    Co-authored-by: illsilin_amdeng <[email protected]>

commit 120f91dd211117e308b3713593ac7f061cc02c08
Author: bibek <[email protected]>
Date:   Thu Feb 5 09:47:16 2026 -0600

    [HIPDNN][DOC] Add TYPED_TEST guidance for multi-datatype tests (#4000)

    ## Motivation

    Doc update : add `TYPED_TEST` guidance for multi-datatype tests

    Update `.clinerules`, `.cursor/rules/testing.mdc`, and `docs/Testing.md`
    to recommend `TYPED_TEST` for tests covering `float`/`half`/`bfloat16`.

    ## Submission Checklist

    - [ ] Look over the contributing guidelines at
    https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

    ---------

    Co-authored-by: Samuel Reeder <[email protected]>

commit d26a7820b58f789d19efd6064d4c2c4f4fc72a95
Author: Eiden Yoshida <[email protected]>
Date:   Wed Feb 4 21:43:22 2026 -0500

    [CK] MICI: Use reference repo for checkout operations (#4336)

    ## Motivation

    - Maintain a reference repo on slave nodes that speeds up any
    clone/checkout operations

    ## Technical Details

    - clone a ref repo if it does not exist
    - update ref repo if it does exist
    - checkout after ref repo is updated
    - eliminates double clone

    ## Test Result

    - Initial checkouts succeeded

commit f2f187ab40738272232f571f58112697da405b1a
Author: Geo Min <[email protected]>
Date:   Wed Feb 4 15:43:38 2026 -0800

    [ci] Fixing rocm-libs race condition (#4192)

    Currently, there is a race condition that overwrites BLAS libraries
    during MIOpen/hipdnn builds. (error:
    https://github.com/ROCm/rocm-libraries/actions/runs/21228188053/job/61080555083)

    Tested locally:
    ```
    # With all three
    geom@geom:~/Code/rocm-libraries/.github/scripts$ PLATFORM="linux" python3 therock_matrix.py
    [{'cmake_options': '-DTHEROCK_USE_EXTERNAL_COMPOSABLE_KERNEL=ON -DTHEROCK_ENABLE_COMPOSABLE_KERNEL=ON -DTHEROCK_ENABLE_MIOPEN=ON -DTHEROCK_ENABLE_MIOPEN_PLUGIN=ON -DTHEROCK_ENABLE_ALL=OFF -DTHEROCK_COMPOSABLE_KERNEL_SOURCE_DIR=../composable_kernel', 'project_to_test': 'miopen_plugin,miopen,hipdnn'}]

    # Only hipdnn
    geom@geom:~/Code/rocm-libraries/.github/scripts$ PLATFORM="linux" python3 therock_matrix.py
    [{'cmake_options': '-DTHEROCK_USE_EXTERNAL_COMPOSABLE_KERNEL=ON -DTHEROCK_ENABLE_ALL=OFF -DTHEROCK_COMPOSABLE_KERNEL_SOURCE_DIR=../composable_kernel -DTHEROCK_ENABLE_MIOPEN_PLUGIN=ON -DTHEROCK_ENABLE_COMPOSABLE_KERNEL=ON', 'project_to_test': 'hipdnn,miopen_plugin'}]
    ```

    This fixes this error, as if all libraries are ran, they will combine
    (saves resources + no overwriting)

commit f34aec25c434b3044b75481d70693af3bf0ade1e
Author: assistant-librarian[bot] <210906412+assistant-librarian[bot]@users.noreply.github.com>
Date:   Wed Feb 4 18:25:31 2026 -0500

    [CK] Add FP8 KV_BLOCKSCALE support for batch prefill (#4263)

    Implement per-page K/V quantization for paged attention:
      - Add KV_BLOCKSCALE enum to BlockAttentionQuantScaleEnum
      - Use exp2 shift trick to eliminate explicit P scaling overhead
    - Prefetch physical pages offset for KV cache, overlaps with
    computations

    ## Proposed changes

    Please describe the motivation behind the pull request, whether it
    enables a new feature or fixes a bug. If there are associated pull
    requests or issues, please link them to the pull request.

    ## Checklist

    Please put an `x` into the boxes that apply. You can also fill these out
    after creating the PR. If you're not sure, please don't hesitate to ask.

    - [ ] I have added tests relevant to the introduced functionality, and
    the unit tests are passing locally
    - [ ] I have added the test to REGRESSION_TESTS list defined at the top
    of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more
    than 30 seconds to run.
    - [ ] I have added inline documentation which enables the maintainers
    with understanding the motivation
    - [ ] I have removed the stale documentation which is no longer relevant
    after this pull request
    - [ ] (If this change is user-facing) I have added release notes which
    provide the end users with a brief summary of the improvement from this
    pull request
    - [ ] I have run `clang-format` on all changed files
    - [ ] Any dependent changes have been merged

    ## Discussion

    If this is a relatively large or complex change, feel free to start a
    discussion by explaining why you chose the solution you did and what
    alternatives you considered

    ---
    🔁 Imported from
    [ROCm/composable_kernel#3696](https://github.com/ROCm/composable_kernel/pull/3696)
    🧑‍💻 Originally authored by @Jeff-Huang

    ---------

    Co-authored-by: Jeff Huang <[email protected]>
    Co-authored-by: Illia Silin <[email protected]>

commit df32df51ea6a9ebfba9a459c77ed82c4877df22b
Author: Yiqian Liu <[email protected]>
Date:   Wed Feb 4 17:12:11 2026 -0600

    [rocRoller] [hipblaslt] Enable more workgroup tile sizes for pre-swizzled scale data (#4175)

    ## Motivation

    Pre-swizzle is an optimization that pre-swizzle the scale data to match
    the layout that kernel expects. The purpose of this PR is to add more
    possible workgroup tile sizes that supports pre-swizzled scale data.

    ## Technical Details

    1. Configure the workgroup tile size to 256 at K dimension when the
    input data format is pre-swizzled.
    2. Filter out the invalid workgroup tile size (i.e., MN dimension is not
    multiple of 32, or MN dimension is 96).
    3. Configure the solution parameters for pre-swizzled input.

    ## Test Plan

    1. Added a rocRoller client test that uses 32x32x256 workgroup tile.

    ## Test Result

    1. This PR should not change any kernel that the data is not
    pre-swizzled.
    2. All the tests should pass and no performance changes.

    ## Submission Checklist

    - [ ] Look over the contributing guidelines at
    https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

    ---------

    Co-authored-by: Yiqian Liu <[email protected]>
    Co-authored-by: yiqialiu <[email protected]>

commit 87d1a8fa005ef2f75e48e5c9c4e70f8235236b03
Author: Samuel Reeder <[email protected]>
Date:   Wed Feb 4 16:10:01 2026 -0700

    Use `--latest-release` flag for installing rocm in clang-tidy (#4120)

    ## Motivation

    `--latest-release` was added in TheRock
    [2997](https://github.com/ROCm/TheRock/pull/2997) to grab latest nightly
    for the specified target. We can use this to keep the clang-tidy
    workflow somewhat up-to-date.

    ## Submission Checklist

    - [x] Look over the contributing guidelines at
    https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

commit c5…
WorldofKerry added a commit to ROCm/rocm-libraries that referenced this pull request Feb 10, 2026
commit 8c40fb6cac48969d6237cccdcbbbad56b44ff0a3
Author: Kerry Wang <[email protected]>
Date:   Mon Feb 9 20:27:33 2026 -0500

    more consistent skip text

commit 557e2764b3a001884a004f0a183a307c4fbc2bd2
Merge: 6bd6e49791 219f365e7b
Author: Kerry Wang <[email protected]>
Date:   Mon Feb 9 18:38:48 2026 -0600

    Merge remote-tracking branch 'origin/develop' into users/kerrwang/lds-queue

commit 6bd6e497910e3ba681b22a47630bc5f0dedb16b8
Author: Kerry Wang <[email protected]>
Date:   Mon Feb 9 18:38:31 2026 -0600

    fix format

commit 219f365e7bc40c9ce3f5c382228a7b2e14b90520
Author: James Sandham <[email protected]>
Date:   Mon Feb 9 19:08:04 2026 -0500

    [hipsparse] Match behaviour of csr2csr_compress from rocsparse (#4420)

    ## Motivation

    In the hipSPARSE test code host solution, we were incorrectly checking
    if a value satisfied:

    `testing_abs(csr_val_A[j]) > testing_real(tol) &&
    testing_abs(csr_val_A[j]) > std::numeric_limits<float>::min()`

    instead of the correct criteria:

    `testing_abs(csr_val_A[j]) > testing_real(tol)`

commit 698d5d09184a24fde32ab7309fcd88410fc7ff8e
Author: amd-hsong <[email protected]>
Date:   Mon Feb 9 16:40:07 2026 -0700

    [rocprim] Fix a call to intrinsics in test_device_reduce_by_key (#4391)

    ## Motivation

    <!-- Explain the purpose of this PR and the goals it aims to achieve.
    -->
    Fix a call to __clzll in test_device_reduce_by_key

    ## Technical Details

    <!-- Explain the changes along with any relevant GitHub links. -->
    There are a couple of issues in the call to __clzll:
    - the argument is cast to `long long`: it should be cast to `unsigned
    long long` instead
    - in rocprim there exists a wrapper for clz, so for better portability
    rocprim::clz should be used instead.

    ## Test Plan

    <!-- Explain any relevant testing done to verify this PR. -->
    Run test_device_reduce_by_key to verify the test runs correctly.

    ## Test Result

    <!-- Briefly summarize test outcomes. -->
    The test passes.

    ## Submission Checklist

    - [ ] Look over the contributing guidelines at
    https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

commit 557f5baa6d68bb5a8126d9730a8d48983778aac3
Author: Kerry Wang <[email protected]>
Date:   Mon Feb 9 16:53:41 2026 -0600

    skip on non-gfx950

commit 8b72bc8759d9c11dfcbf410182fa332152b97e69
Author: bnemanich <[email protected]>
Date:   Mon Feb 9 16:26:51 2026 -0500

    [hipBLASLt] Enable custom MXFP4 kernels (#4384)

    ## Motivation

    Allow hipBLASLt to call custom MX FP4 kernels for higher performance.

    ## Technical Details

    A single kernel was added in this PR. The kernel was originally from:
    https://github.com/ROCm/aiter/tree/main/hsa/gfx950/f4gemm.
    This kernel used a slightly different shuffled scaling layout than
    rocRoller. hipBLASLt will only support this new shuffled layout, plus
    the original non-shuffled layout. All rocRoller kernels will be disabled
    when using shuffled scales for now. Once rocRoller supports the new
    layout, they will be added back in.

    This PR also adds some new MX datatype generation patterns that were
    useful during debugging.

    New custom kernels can be added to the custom_kernels directory. They
    will also need to be added in the customer_kernels.cpp file that was
    added in this PR.

    ## Test Plan

    Check that performance improved when using MXFP4 GEMMs with shuffled
    scales.

    ## Test Result

    Performance improved by about 17%.

    ## Submission Checklist

    - [ ] Look over the contributing guidelines at
    https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

    ---------

    Co-authored-by: Andrew Whittle <[email protected]>
    Co-authored-by: Bryant Nelson <[email protected]>

commit 61f9f906dcc0a9d4f6c327fea713aebc6d4b0a1d
Author: Bartłomiej Kocot <[email protected]>
Date:   Mon Feb 9 22:08:57 2026 +0100

    [CK] CK Tile grouped convolution direct load (#4406)

    ## Motivation

    CK Tile grouped convolution forward direct load support.

    ## Technical Details

    Basic pipeline for direct load and new instances for forward for v1 and
    v4 pipelines.

    ## Test Plan

    test_grouped_convnd_fwd_tile

    ## Test Result

    CI pending

    ## Submission Checklist

    - [x] Look over the contributing guidelines at
    https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
    AICK-130

commit b7f136734ad26314386ca2b4f5a99467804f1bb7
Author: assistant-librarian[bot] <210906412+assistant-librarian[bot]@users.noreply.github.com>
Date:   Mon Feb 9 20:58:57 2026 +0000

    Enable group mode (varlen) kernel generation for PyTorch integration (#4292)

    ## Proposed changes

    This PR enables group mode (variable-length attention) kernel generation
    for PyTorch's CK SDPA backend.

    ## Checklist

    Please put an `x` into the boxes that apply. You can also fill these out
    after creating the PR. If you're not sure, please don't hesitate to ask.

    - [X] I have added tests relevant to the introduced functionality, and
    the unit tests are passing locally
    - [ ] I have added the test to REGRESSION_TESTS list defined at the top
    of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more
    than 30 seconds to run.
    - [ ] I have added inline documentation which enables the maintainers
    with understanding the motivation
    - [ ] I have removed the stale documentation which is no longer relevant
    after this pull request
    - [ ] (If this change is user-facing) I have added release notes which
    provide the end users with a brief summary of the improvement from this
    pull request
    - [X] I have run `clang-format` on all changed files
    - [ ] Any dependent changes have been merged

    ## Discussion

    The change is minimal (single line deletion) but enables a significant
    feature: variable-length attention support for ROCm users via PyTorch's
    torch.nn.attention.varlen API.

    ---
    🔁 Imported from
    [ROCm/composable_kernel#3553](https://github.com/ROCm/composable_kernel/pull/3553)
    🧑‍💻 Originally authored by @chinmaydk99

    Co-authored-by: Chinmay_Kuchinad <[email protected]>

commit f48a5e63edb7102996b0b769e76114c0bbfd35cf
Author: Mihnea Chirila <[email protected]>
Date:   Mon Feb 9 14:55:12 2026 -0600

    [Tensilelite] Added MIArchVgpr support for Complex Datatypes. (#4332)

    ## Motivation

    Added MIArchVgpr support for Complex Datatypes.

    ## Technical Details

    Fixed AlphaTmpVgpr initialization, and rocisa register offset bug:

    - Updated condition to initialize AlphaTmpVgpr if MIArchVgpr parameter
    is enabled. Required to generate `MulMIOutAlphaToArch' code
    (https://github.com/ROCm/rocm-libraries/blob/c20a85b6c458ef44c1f0e30c35b286a0395fb8fa/projects/hipblaslt/tensilelite/Tensile/KernelWriterModules.py#L251)
    regardless of postGSU Accumulation scheme.
    - Fixed underlying `Holder` struct bug: correctly passes string passed
    offsets to `RegisterContainer`. Required to update imaginary register
    for C/ZGEMM.
    (https://github.com/ROCm/rocm-libraries/blob/c20a85b6c458ef44c1f0e30c35b286a0395fb8fa/projects/hipblaslt/tensilelite/Tensile/KernelWriterModules.py#L288)

    ## Test Plan

    Tested for C & Z with MIArchVgpr: [0, 1] on gfx942 and gfx950

    ## Test Result

    Success

    ## Submission Checklist

    - [x] Look over the contributing guidelines at
    https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

commit 3de83b9b4035006b5ddd825df9404edc43ca9b39
Author: CMiservaAMD <[email protected]>
Date:   Mon Feb 9 13:42:54 2026 -0700

    [hipDNN] Add integration tests for frontend configuration knobs APIs.  (#4307)

    Add integration tests to verify correct operation of new hipDNN frontend API
    functions for managing engine config settings.

commit 2752a8a5105e11929b876ce0e343bcc73a9cf308
Author: DarylHawkinsAMD <[email protected]>
Date:   Mon Feb 9 13:08:30 2026 -0700

    [MIOpen] First set of kernels using CK Builder end to end (#4123)

commit e55f37bad667987f74989bc95e08f86603438963
Author: Mitchell Ousdahl <[email protected]>
Date:   Mon Feb 9 10:05:26 2026 -0800

    Modified test plugin rpaths (#4350)

    ## Motivation
    In order to successfully get hipDNN added to the python ROCm wheels, the
    RPATHs on Linux for the test plugins need to be updated to make them
    portable. We will leverage TheRock's existing RPATH update mechanism to
    do this.

    ## Technical Details
    - Update all test plugin RPATHs

    ## Test Plan
    - Build ROCm
    - Build the wheels
    - Use the "Test ROCm Wheel" workflow, which verifies that the test
    plugins can load and find their dependencies.

    ## Test Result
    - [ ] "Test ROCm Wheel" workflow succeeds

    ## Submission Checklist
    - [x] Look over the contributing guidelines at
    https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

commit a7485411874b0650b31068364f6b1155d9890212
Author: Muhammad Osama <[email protected]>
Date:   Mon Feb 9 09:30:52 2026 -0800

    [Origami] Skip test-selector if torch not found. (#4359)

    ## Motivation

    Makes `torch` completely optional by skipping dependent tests if it is
    not found.

    ## Technical Details

    ```
    # Skip entire module if torch is not available (selector requires torch)
    torch = pytest.importorskip("torch", reason="torch is required for OrigamiMatmulSelector tests.")
    ```

    ## Test Plan

    Run tests using CI + TheRock build.

    ## Test Result

    <!-- Briefly summarize test outcomes. -->

    ## Submission Checklist

    - [x] Look over the contributing guidelines at
    https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

commit 0c5cd629a94a454a350eb651b5921baeb1c82546
Author: Swati Rawat <[email protected]>
Date:   Mon Feb 9 22:51:13 2026 +0530

    Update Tensile CHANGELOG.md (#4164)

    ## Motivation

    <!-- Explain the purpose of this PR and the goals it aims to achieve.
    -->

    ## Technical Details

    <!-- Explain the changes along with any relevant GitHub links. -->

    ## Test Plan

    <!-- Explain any relevant testing done to verify this PR. -->

    ## Test Result

    <!-- Briefly summarize test outcomes. -->

    ## Submission Checklist

    - [ ] Look over the contributing guidelines at
    https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

commit 331512e9e13e197d8d7fdf7b72f5b60eb63d7d1e
Author: Bartłomiej Kocot <[email protected]>
Date:   Mon Feb 9 16:36:52 2026 +0100

    [CK] Fix grouped conv fwd transform for merged groups (#4399)

    ## Motivation

    [CK] Fix grouped conv fwd transform for merged groups for 1d and 3d.

    ## Technical Details

    After optimizations for 2d there is a lack of implementation for 1d and
    3d

    ## Test Plan

    test_grouped_convnd_fwd

    ## Test Result

    pending CI

    ## Submission Checklist

    - [x] Look over the contributing guidelines at
    https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

commit 1c2927530e176c63cf814b44eb8147e89d2bcaf7
Author: Eiden Yoshida <[email protected]>
Date:   Mon Feb 9 10:23:47 2026 -0500

    [CK] MICI: Disable failure pattern checking (#4373)

    ## Motivation

    - ck mici jobs hanging at end, possibly at failure pattern checking

    ## Technical Details

    - Disable failure pattern checking to see if hanging goes away

    ## Test Plan

    - Observe behavior after merge

commit a3058d1dc0b3f176f56fbecd040c2fc48c7258ad
Author: COrruDXC <[email protected]>
Date:   Mon Feb 9 14:02:05 2026 +0100

    Reduce boost usage by replacing time calls (#3875)

    ## Motivation

    Reduce boost usage by replacing time calls.

    ## Technical Details

    Replace boost::posix_time related data types with the corresponsing
    std::chrono data types.

    ## Submission Checklist

    - [X] Look over the contributing guidelines at
    https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

commit 959bd9393ad9a578711334c40948ac1321e41c1f
Author: Yi-Yao (Alex), Wang <[email protected]>
Date:   Mon Feb 9 17:15:47 2026 +0800

    Update gfx942/gfx950 BBS/HHS/I8I8S SPB/SPA logic yaml (#4365)

    ## Motivation

    - Update BBS/HHS/I8I8S SPB/SPA logic yaml for gfx942/gfx950

    ## Technical Details

    <!-- Explain the changes along with any relevant GitHub links. -->

    ## Test Plan

    - Run local hipsparselt-test
    - Run local tests for all matrix sizes using hipsparselt-bench

    ## Test Result

    <!-- Briefly summarize test outcomes. -->

    ## Submission Checklist

    - [ ] Look over the contributing guidelines at
    https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

    ---------

    Co-authored-by: chiahlin <[email protected]>

commit ad03e58dabbf2bbc348c031a06ec73011d85d2c3
Author: Chuck Wu <[email protected]>
Date:   Mon Feb 9 13:04:38 2026 +0800

    [hipblaslt] Fix memory leaks & uninitialized value use (#4338)

    ## Motivation

    <!-- Explain the purpose of this PR and the goals it aims to achieve.
    -->
    According to the
    [ROCM-1835](https://amd-hub.atlassian.net/browse/ROCM-1835?focusedCommentId=109304&sourceType=mention),
    there are some memory leaks and instances of uninitialized value being
    used during the gtest.
    <img width="450" height="367" alt="image"
    src="https://github.com/user-attachments/assets/2345e1f8-6062-4a5a-b294-97042709b18e"
    />

    ## Technical Details

    <!-- Explain the changes along with any relevant GitHub links. -->
    1. Add the code to call the corresponding destroy functions for the data
    that has not been released yet.
    2. Add the default value to compute_input_typeA/B

    Flow (before this commit) ->
    a. hipblasLtMatmulDescCreate:
    - compute_input_typeA = ???
    - compute_input_typeB = ???

    b. hipblasLtMatmulDescSetAttribute(COMPUTE_INPUT_TYPE_A, HIP_R_16F, ...)
    - compute_input_typeA = HIP_R_16_F
    - compute_input_typeB = ???
    - call _matmul_desc_determine_compute_type()
      - Read  compute_input_typeA & compute_input_typeB

    c. hipblasLtMatmulDescSetAttribute(COMPUTE_INPUT_TYPE_B, HIP_R_16F, ...)
    - compute_input_typeA = HIP_R_16_F
    - compute_input_typeB = HIP_R_16_F
    - call _matmul_desc_determine_compute_type()
      - Read  compute_input_typeA & compute_input_typeB

    ## Test Plan

    <!-- Explain any relevant testing done to verify this PR. -->
    1. Test command: `valgrind --leak-check=full ./hipblaslt-test
    --gtest_filter=_/aux_test.*`
    2. Before this commit:

    Uninitialized value being used
    <img width="691" height="81" alt="image"
    src="https://github.com/user-attachments/assets/22a897f1-c25e-4608-850e-0c6bcb5ad0a3"
    />
    Memory leaks
    <img width="708" height="78" alt="image"
    src="https://github.com/user-attachments/assets/02268893-a29a-4db4-95aa-c93385371d5a"
    />

    ## Test Result

    <!-- Briefly summarize test outcomes. -->
    1. The Valgrind output above isn’t showing.
    2. gtest all pass on Navi3.
    <img width="952" height="92" alt="image"
    src="https://github.com/user-attachments/assets/6e1b9b01-afc4-4a51-8a9f-e0196fc8495a"
    />

    ## Submission Checklist

    - [X] Look over the contributing guidelines at
    https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

    [ROCM-1835]:
    https://amd-hub.atlassian.net/browse/ROCM-1835?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

commit b7de1e14cea70681a23cd1a136df42910c776e4a
Author: assistant-librarian[bot] <210906412+assistant-librarian[bot]@users.noreply.github.com>
Date:   Mon Feb 9 11:54:54 2026 +0800

    [CK_TILE] Add blockscale GEMM support for EightWarps on gfx950 (#4280)

    ## Proposed changes

    gemm blockscale eightwarps support

    ## Checklist

    Please put an `x` into the boxes that apply. You can also fill these out
    after creating the PR. If you're not sure, please don't hesitate to ask.

    - [ ] I have added tests relevant to the introduced functionality, and
    the unit tests are passing locally
    - [ ] I have added the test to REGRESSION_TESTS list defined at the top
    of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more
    than 30 seconds to run.
    - [ ] I have added inline documentation which enables the maintainers
    with understanding the motivation
    - [ ] I have removed the stale documentation which is no longer relevant
    after this pull request
    - [ ] (If this change is user-facing) I have added release notes which
    provide the end users with a brief summary of the improvement from this
    pull request
    - [x] I have run `clang-format` on all changed files
    - [x] Any dependent changes have been merged

    ## Discussion

    If this is a relatively large or complex change, feel free to start a
    discussion by explaining why you chose the solution you did and what
    alternatives you considered

    ---
    🔁 Imported from
    [ROCm/composable_kernel#3650](https://github.com/ROCm/composable_kernel/pull/3650)
    🧑‍💻 Originally authored by @kensclin

    ---------

    Co-authored-by: KenSCLin <[email protected]>
    Co-authored-by: Ding, Yi <[email protected]>
    Co-authored-by: systems-assistant[bot] <systems-assistant[bot]@users.noreply.github.com>
    Co-authored-by: Thomas Ning <[email protected]>

commit 774cfc6410ed55802691fef19a34449182878be5
Author: Ethan <[email protected]>
Date:   Mon Feb 9 11:39:20 2026 +0800

    [hipblaslt] do some debug operations only in debug setting

    ## Motivation

    Single solution selection time has increased slightly.
    <!-- Explain the purpose of this PR and the goals it aims to achieve.
    -->

    ## Technical Details

    Lots of "assign matchingTag" can be avoid if not in debug
    (printProperty), but I still keep the Equal assign there since it has
    been there before #2757
    <!-- Explain the changes along with any relevant GitHub links. -->

    ## Test Plan

    <!-- Explain any relevant testing done to verify this PR. -->

    ## Test Result

    <!-- Briefly summarize test outcomes. -->

    ## Submission Checklist

    - [x] Look over the contributing guidelines at
    https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

commit ff3e9821bbe2d14b9874e61ebb518bbbce621ac3
Author: jakpiase <[email protected]>
Date:   Sun Feb 8 20:57:14 2026 +0100

    [CK_TILE] Add support and tests for V6 pipeline in conv fwd (#4357)

    Added support for conv v6 pipeline in ck tile's convolution forward
    kernel. CK Tile v6 pipeline is the equivalent to old ck's V5 pipeline
    and should be faster than other pipelines for some cases. This PR also
    adds tests inside profiler that's currently inside experimental
    directory, so now we should be able to detect regressions easier.

    ---------

    Co-authored-by: Illia Silin <[email protected]>
    Co-authored-by: subhajitdchow <[email protected]>

commit 591f50450241d6b1965f9f6ee3fe2526ef71ab8d
Author: assistant-librarian[bot] <210906412+assistant-librarian[bot]@users.noreply.github.com>
Date:   Sun Feb 8 12:34:59 2026 +0100

    [CK] Add fwd conv group merging to v3 conv instances (#4273)

    ## Proposed changes

    Added conv group merging to the (universal) V3 fwd conv pipeline. The
    new instance improves fwd conv performance when the number of
    input/output channel per group is low.

    On MI300 (`gfx942`) we get

    | CK prof command | Baseline (TFLOPS) | V3 group merging (TFLOPS) |
    |:-----|:------:|------:|
    | grouped_conv_fwd 1 1 1 0 1 0 1 2 32 32 4 4 3 3 200 200 1 1 1 1 1 1 1 1
    | 3.86035 | 8.36796 |
    | grouped_conv_fwd 1 1 1 0 1 0 1 2 32 32 8 8 3 3 200 200 2 2 1 1 1 1 1 1
    | 10.1867 | 13.4677 |
    | grouped_conv_fwd 1 1 1 0 1 0 1 2 32 32 8 8 3 3 100 100 1 2 1 1 1 1 1 1
    | 11.7875 | 16.3657 |

    ---
    🔁 Imported from
    [ROCm/composable_kernel#3675](https://github.com/ROCm/composable_kernel/pull/3675)
    🧑‍💻 Originally authored by @vpietila-amd

    ---------

    Co-authored-by: Ville Pietilä <>
    Co-authored-by: Ville Pietilä <[email protected]>
    Co-authored-by: systems-assistant[bot] <systems-assistant[bot]@users.noreply.github.com>
    Co-authored-by: Illia Silin <[email protected]>
    Co-authored-by: Bartlomiej Kocot <[email protected]>

commit cad7fa2c1849b0863ed52ef6cd47198e421d5b6e
Author: BrianHarrisonAMD <[email protected]>
Date:   Fri Feb 6 23:48:54 2026 -0700

    [hipDNN] Fix codecov target breaks (#4374)

    ## Motivation

    code_cov stage for hipDNN was breaking due to environment differences.
    Newer clang tooling flags false positives on added files.
    Environments that are missing spdlog, but have fmt present were causing
    issues due to mixed include expectations.

    ## Technical Details

    - Ignore false positives for lint
    - Only add fmt if spdlog was built with external FMT

    ## Test Plan

    - Code cov target builds succeessfully

    ## Test Result

    Waiting on CI

commit 91627789d86acc7dff4bf5eaafe3b774a7037f76
Author: Koji Nakajima <[email protected]>
Date:   Fri Feb 6 23:44:05 2026 -0700

    [hipblaslt] Fix memory access error with DtlPlusLdsBuf (#4303)

    ## Motivation

    Fix a memory access error with DtlPlusLdsBuf

    ## Technical Details

    - generate all GlobalRead Inc code before local read addr swap

    ## Test Plan

    Added a test case in dtl.yaml

    ## Test Result

    Confirmed new test failed with before change and no error with after
    change

    ## Submission Checklist

    - [ ] Look over the contributing guidelines at
    https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

commit 8b5a98b48c007663765865d2e14247ef1f056b01
Author: Aaron St George <[email protected]>
Date:   Fri Feb 6 23:18:13 2026 -0600

    [hipDNN] Add `FUSILLI_PLUGIN` to `EngineNames.hpp` (#4362)

    ## Motivation

    <!-- Explain the purpose of this PR and the goals it aims to achieve.
    -->

    Following the approach outlined in
    [hipdnn/docs/rfcs/0003_EngineIdDesign.md](https://github.com/ROCm/rocm-libraries/blob/develop/projects/hipdnn/docs/rfcs/0003_EngineIdDesign.md)
    this PR defines an engine ID for fusilli in
    `hipdnn/data_sdk/include/hipdnn_data_sdk/utilities/EngineNames.hpp`.

    ## Technical Details

    <!-- Explain the changes along with any relevant GitHub links. -->

    ID + tests defined.

    ## Test Plan

    <!-- Explain any relevant testing done to verify this PR. -->

    Test defined in the PR.

    ## Test Result

    <!-- Briefly summarize test outcomes. -->

    Tests pass.

    ## Submission Checklist

    - [x] Look over the contributing guidelines at
    https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

commit e3a9b3f95d29ce626efc3d2134e4e924b1c591a9
Author: James Newling <[email protected]>
Date:   Fri Feb 6 21:12:55 2026 -0800

    [hipblaslt] Raise exception instead of segfaulting (#3995)

    ## Motivation

    Faster problem diagnostic when failure.

    ## Technical Details

    Throw exception if library is nullptr.

    ## Test

    Before:

    ```
    TensileLibrary.yaml:181:31: error: invalid boolean
        customMainLoopScheduling: 0
                                  ^
    [Lots of logging]

    Segmentation fault
    + ERR2=139
    + ERR=0
    ```

    After:

    ```
    TensileLibrary.yaml:181:31: error: invalid boolean
        customMainLoopScheduling: 0

    terminate called after throwing an instance of 'std::runtime_error'
      what():  Failed to load solution library

    + ERR2=134
    + ERR=0
    ```

commit f48eaa54f7395aa8ce4980dcc6725fe38784f7e6
Author: CMiservaAMD <[email protected]>
Date:   Fri Feb 6 21:48:52 2026 -0700

    [hipDNN] Fix a couple log messages in test plugins broken by recent merge. (#4380)

    Include correct function name in test plugin log output.

commit 5df3343ecfae6b39201995d8178fe39e061e0c40
Author: Emily Martins <[email protected]>
Date:   Fri Feb 6 17:26:57 2026 -0700

    [CK_TILE] Fix MMA concepts compiler error (#4381)

    ## Motivation

    CK Tile is required to support certain older OSs; on these OSs, cpp 20
    is not fully supported. For ROCm 7.2, compiler errors occur on one of
    these older OSs. An example of this error is as follows:

    ```bash
    /composable_kernel/include/ck_tile/core/arch/mma/amdgcn_mma.hpp:34:28: error: expected concept name with optional arguments
       34 |     { MmaOp::kAMBlock } -> std::convertible_to<unsigned int>;
          |
    ```

    The goal of this PR is to resolve these compiler errors.

    ## Technical Details

    The existing guards around the mma concepts only check if the concepts
    language feature is supported, as follows:

    ```cpp
    #if defined(__cpp_concepts) && __cpp_concepts >= 201907L
    // ...
    template <typename CtrlFlags>
    concept CtrlFlagsGfx9I = requires(CtrlFlags ctrlFlags) {
        // Flag members for Gfx9 MFMA instructions
        { CtrlFlags::Cbsz } -> std::convertible_to<int>;
        { CtrlFlags::Abid } -> std::convertible_to<int>;
        { CtrlFlags::Blgp } -> std::convertible_to<int>;
    };

    #endif // defined(__cpp_concepts) && __cpp_concepts >= 201907L
    ```
    That said, in cases where functionality from the `<concepts>` header is
    used (e.g., `std::convertible_to`), this guard fails to check whether
    the `<concepts>` header is available.

    This change adds an additional check to the concepts that make use of
    functionality from the `<concepts>` header to ensure the header is
    available.

    ## Test Plan

    I tested the changes on the relevant docker for gfx90a, gfx950, and
    gfx942 and the compiler issue is not present.

    ## Submission Checklist

    - [x] Look over the contributing guidelines at
    https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

commit 06976b37a2f0353b80c71fb3d56bee92bb6b9bab
Author: Aviral Goel <[email protected]>
Date:   Sat Feb 7 04:14:28 2026 +0400

    Increase tolerance for FP16 GEMM tests to handle non-deterministic ro… (#4335)

    …unding

    Three tests were failing intermittently with small errors (0.01-1.5%)
    due to non-deterministic FP16 accumulation order from GPU thread
    scheduling:
    - test_ck_tile_batched_gemm
    - test_ck_tile_grouped_gemm_preshuffle
    - test_ck_tile_grouped_gemm_multi_d

    These tests use kbatch=1 (no split-K), so errors are from
    order-dependent rounding, not atomics. Increased tolerances from 1e-3 to
    2e-3 (0.2%) to account for FP16 precision limits while still catching
    real bugs.

    - [x] Look over the contributing guidelines at
    https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

    Co-authored-by: Illia Silin <[email protected]>

commit 07e9d561402c717946a1c08cfdce2681d5733335
Author: assistant-librarian[bot] <210906412+assistant-librarian[bot]@users.noreply.github.com>
Date:   Fri Feb 6 16:10:23 2026 -0800

    [CK] add inter/intrawave scheduling concept doc (#4300)

    ## Proposed changes

    Adding information about inter/intrawave scheduling

    ---
    🔁 Imported from
    [ROCm/composable_kernel#3660](https://github.com/ROCm/composable_kernel/pull/3660)
    🧑‍💻 Originally authored by @spolifroni-amd

    ---------

    Co-authored-by: spolifroni-amd <[email protected]>
    Co-authored-by: assistant-librarian[bot] <assistant-librarian[bot]@users.noreply.github.com>
    Co-authored-by: Illia Silin <[email protected]>
    Co-authored-by: illsilin_amdeng <[email protected]>

commit 4d773b636ca00996e971d55bcd0530f641837b42
Author: JonathanLichtnerAMD <[email protected]>
Date:   Fri Feb 6 17:09:55 2026 -0700

    Add .cline_storage to .gitignore (#4390)

    ## Submission Checklist

    - [x] Look over the contributing guidelines at
    https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

commit 738ffd7689ba0759f00c0e9430889b2ed995fa94
Author: Enrico Degregori <[email protected]>
Date:   Sat Feb 7 01:09:08 2026 +0100

    [CK] Workaround blockscale wp test failure (#4372)

    ## Motivation

    Workaround to fix blockscale wp test failure for pipeline v3

    ## Technical Details

    <!-- Explain the changes along with any relevant GitHub links. -->

    ## Test Plan

    <!-- Explain any relevant testing done to verify this PR. -->

    ## Test Result

    <!-- Briefly summarize test outcomes. -->

    ## Submission Checklist

    - [ ] Look over the contributing guidelines at
    https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

commit 612bf0b710b399276916c222d8d4c5f9c34f9f62
Author: James Sandham <[email protected]>
Date:   Fri Feb 6 18:44:55 2026 -0500

    [rocsparse] Add bfloat16 and complex-types tests for code coverage (#4204)

    ## Motivation

    Add bfloat16 and complex-types tests for code coverage. Also renames the
    atomic_add tests to belong to pre_checkin so that they will be run as
    part of code coverage pre_checkin tests.

commit 287fbc900071d5f9f8df7efdf1cfd25d9c8ea338
Author: Kerry Wang <[email protected]>
Date:   Fri Feb 6 15:55:22 2026 -0600

    don't include waitcnts; renames

commit f52966a377bfd26725f35f103fbc7975cd9b4ec9
Author: Yiqian Liu <[email protected]>
Date:   Fri Feb 6 15:50:20 2026 -0600

    [rocRoller] Explicitly convert when typeAcc differs with typeD (#3977)

    ## Motivation

    This PR explicitly converts data type when Accumulator type is different
    with matrix D. The purpose of this change is to make rocRoller client
    adds the same operation as hipblaslt

    ## Technical Details

    Added a convert operation when Accumulator is not the same type as
    matrix D.

    ## Test Plan

    All the existing tests should pass.
    This change should not affect the performance.

    ## Test Result

    Existing tests passed.

    ---------

    Co-authored-by: yiqialiu <[email protected]>

commit 323a8d256e2409de54fa9dfa9523f4c50295c305
Author: Kerry Wang <[email protected]>
Date:   Fri Feb 6 15:28:17 2026 -0600

    minor quality improvements

commit 6c4a1fb6d0f2eff761cd95e690f3ef0090266367
Author: Ali Yazdani <[email protected]>
Date:   Fri Feb 6 14:24:53 2026 -0700

    [Origami] AutoWgm for NonTemporal Kernels. (#4218)

    AIGESOLSEL-71

    ## Motivation

    This PR enhances the Origami workgroup mapping (WGM) selection logic to
    support nontemporal kernels and improve automatic WGM value selection.
    Previously, nontemporal cases (NTA/NTB > 3) were excluded from automatic
    workgroup mapping optimizations, limiting potential performance.

    ## Technical Details

    1. Enabling NonTemporal support in AutoWGM with an enhanced logic
    2. Improved WGM Candidate Generation

    ## Test Plan

    CI, and locally ran performance tests.

    ## Test Result

    Performance benchmarks show uplifts coming from changes.

    ## Submission Checklist

    - [ ] Look over the contributing guidelines at
    https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

commit 50e39459b541f978edd9acf645792cf496c16bea
Author: Kerry Wang <[email protected]>
Date:   Fri Feb 6 15:21:16 2026 -0600

    add string-based observer test

commit 16b3b1840d61871c116c9ae80cf59324233377d5
Author: Torre Zuk <[email protected]>
Date:   Fri Feb 6 14:19:00 2026 -0700

    [rocBLAS] Users/torrezuk/rocm 1157 amd smi rocblas (#4353)

    ## Motivation

    Deprecated dependency rocm-smi replaced by amd-smi

    ## Test Plan

    Test frequency reporting using rocblas-bench with environment variable set

commit 45b616b1e6df1b1d3816a2f03a05a0f6ab754652
Author: Illia Silin <[email protected]>
Date:   Fri Feb 6 10:17:02 2026 -0800

    [CK] fix path for build filter (#4375)

    ## Motivation

    Fix the filter that determines whether CI builds are necessary.

    ## Technical Details

    A script checks the files list returned by git diff and checks whether
    any code source was modified. If not, if only documentation was changed,
    it will allow skipping the builds. We make sure we only look at the
    changes in projects/composablekernel/ folder.

commit d8e2826bedff1183eaedeb6d6f5b2eeaa65dab7b
Author: Geo Min <[email protected]>
Date:   Fri Feb 6 09:59:29 2026 -0800

    [ci] Adding mi350 required group ID (#4378)

    After updating mi325 group-id, we are noticing errors for mi350.

    Tested here for mi350:
    https://github.com/ROCm/TheRock/actions/runs/21733399385/job/62692971370
    Tested here for mi325:
    https://github.com/ROCm/TheRock/actions/runs/21759203211/job/62778060417

    Adding both work properly

commit 78497b37bb4e5853b5da3feb96381c6b643556f7
Author: bibek <[email protected]>
Date:   Fri Feb 6 10:46:48 2026 -0600

    Fix intermittent kernel compilation failures in BnFwdTrainingSpatial (#4202)

    ## Motivation
    Fix kernel compilation failures in BnFwdTrainingSpatial caused by two
    related issues:

    1. Uninitialized local size variables producing random garbage values
    2. Missing compile-time guard for warp-reduction LDS arrays when
    workgroup < 64 threads

    ## To Reproduce

    ```bash
    rm -rf ~/.config/miopen/*.udb.txt ~/.cache/miopen/
    rm -rf /tmp/.config/miopen/ /tmp/.cache/miopen/
    MIOPEN_FIND_ENFORCE=SEARCH MIOPEN_LOG_LEVEL=5 ./bin/MIOpenDriver bnorm -n 1024 -c 64 -H 13 -W 13 -m 1 --forw 1 -s 1 -V 1
    ```

    ## Technical Details

    ### Bug 1: Uninitialized Variables (Host)

    Variables declared without initialization:

    ```cpp
    size_t xlocalsize, xgridsize;  // uninitialized
    ```

    For Variants 0/1/3, early returns skip initialization, leaving garbage
    values that propagate to kernel template parameters. Depending on stack
    memory state, errors include:

    - `error: array is too large (18446744073709545792 elements)`
    - `error: variable length array declaration cannot have 'static' storage
    duration`
    - `error: zero-length arrays are not permitted in HIP device code`

    ### Bug 2: Missing Compile-Time Guard (Kernel)

    The warp-reduction path divides LDS size by 64:

    ```cpp
    __shared__ FpAccumCType lcl_data_x[MIO_BN_GRP0_FINAL * MIO_BN_GRP1_FINAL * MIO_BN_GRP2_FINAL / 64];
    ```

    When Variants 0/1/3 set workgroup size to 1×1×1, this computes `1/64 =
    0`, producing illegal zero-length arrays. Note that `if constexpr` only
    suppresses template instantiation, not parsing of ill-formed code like
    zero-length arrays.

    ## Fix

    ### Host side (`common_spatial.hpp`, `forward_spatial.cpp`)

    Initialize variables to safe defaults:

    ```cpp
    size_t xlocalsize = 1, xgridsize = 1;
    size_t ylocalsize = 1, ygridsize = 1;
    size_t zlocalsize = 1, zgridsize = 1;
    size_t nelements = 1;
    unsigned int ldsgcn = 0, ldsnogcn = 0;
    ```

    ### Kernel side (`MIOpenBatchNormFwdTrainSpatial.cpp`)

    Use C++17 constexpr ternary to ensure array size is always ≥ 1:

    ```cpp
    else
    {
        // C++17 idiomatic: ensure array size is never zero using constexpr ternary
        constexpr auto grp_final_total =
            MIO_BN_GRP0_FINAL * MIO_BN_GRP1_FINAL * MIO_BN_GRP2_FINAL;
        constexpr auto lds_gcn_array_size = grp_final_total >= 64 ? grp_final_total / 64 : 1;

        commitID = 64;
        __shared__ FpAccumCType lcl_data_x[lds_gcn_array_size];
        __shared__ FpAccumCType lcl_data_y[lds_gcn_array_size];
        miopen::reduction::gcn_reduce2(...);
    }
    ```

    __Why this works:__

    - `constexpr` ensures compile-time evaluation (zero runtime overhead)
    - When workgroup ≥ 64: array size = `grp_final_total / 64` (correct,
    same as before)
    - When workgroup < 64: array size = 1 (valid), but this `else` branch is
    never taken due to `if constexpr` guard
    - Dead code elimination removes the unused size-1 arrays from the binary

    ## Test Plan

    - Existing batchnorm training tests pass
    - Verified no zero-length array errors with deterministic workgroup
    sizes
    - Confirmed warp-reduction path only executes when workgroup ≥ 64
    threads

commit 8f8b97a40d36cb4095e929b0ef1b71ffea7ba170
Author: SreecharanGundaboluAMD <[email protected]>
Date:   Fri Feb 6 08:18:12 2026 -0800

    [miopen] upgrade clang-format (#4194)

    This PR updates the project's code formatting tooling to use
    `clang-format-18` instead of `clang-format-12` throughout the codebase
    as a transition as we move towards TheRock for our CI.

    **Tooling and Configuration Updates:**

    * Updated all references to `clang-format-12` to `clang-format-18` in
    the pre-commit hook (`.githooks/pre-commit`), CMake configuration
    (`ClangCheck.cmake`), and Dockerfile (`Dockerfile`). The Dockerfile now
    also adds the appropriate LLVM 18 repository and keyring for
    installation.
    [[1]](diffhunk://#diff-1436c8126d575a7576d98d0bc8a8c6d27e8eb4e2d7241d61fe64c286c0d7365cL7-R7)
    [[2]](diffhunk://#diff-fc024f0d7573d33039081dab6b12f76f0f34c8e07e014552daa1bed9a276a548L9-R9)
    [[3]](diffhunk://#diff-32304f8a254e46fb8ff524cf4c488eb6013ab54a89ca62709cfb20ccf58976f9R54-R61)

    ## Submission Checklist

    - [ ] Look over the contributing guidelines at
    https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

commit b34fa62134425a611b18c05aae687f1bc09c7d87
Author: BrianHarrisonAMD <[email protected]>
Date:   Fri Feb 6 08:57:01 2026 -0700

    [hipDNN] Remove spdlog dependency for consumers of hipdnn (#4312)

    ## Motivation

    Draft of changes to remove spdlog and fmt dependencies from hipDNN
    frontend & consumer facing SDKs.

    Consumers of hipDNN can have conflicts as a result of these
    dependencies.

    Removing these extra dependencies, and relying on the C API logging
    methods from hipDNN backend will reduce friction, and make it easier for
    adoption of hipDNN.

    ## Technical Details

    - Remove spdlog and fmt from frontend + consumer SDKs.
    - Add new string stream style formatter that will forward to backend
    logging API callback
    - Note: since we are set at C++17 standards, we cannot use std::format.
    - Add Spdlog and fmt optional dependencies as opt in for plugin_sdk
    - This enables existing plugins to maintain logging style they have in
    place, and allows plugin authors to decide what style of logging to use.
    - Unify consumption of spdlog & fmt dependencies in CMake's using
    unified method's

    ## Test Plan

    - Ensure build and tests are working for all components
    - Ensure tests with logs enabled are working properly with expected
    format
    - Ensure build of samples is working, and logging format is correct

    ## Test Result

    Build, tests, and testing with logging enabled is working locally for
    hipDNN, providers, and samples

commit 0c37fdc37ba545b2ced5211b9f59c2381fc93753
Author: amd-chunxlin <[email protected]>
Date:   Fri Feb 6 09:44:39 2026 -0600

    [rocRoller] Address long StreamK test runtimes (#4095)

    ## Motivation

    Some streamK tests take long time to finish, and this PR addresses the
    performance issue.

    <details>
    <summary>Comparison of test runtimes</summary>

    | Test name | Develop branch | This branch |
    | --- | --- | --- |
    | GPU_BasicGEMMStreamKWorkgroupMapping/0    |  14.412s   |    4.059s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/1    |  14.447s   |    4.018s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/2    |  14.452s   |    4.026s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/3    |  14.444s   |    4.042s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/4    |  38.872s   |    9.446s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/5    |  38.828s   |    9.445s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/6    |  38.913s   |    9.446s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/7    |  38.812s   |    9.435s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/8    |  38.878s   |    9.456s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/9    |  38.889s   |    9.463s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/10   |  38.884s  |     9.463s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/11   |  38.859s  |     9.463s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/12   |  14.414s  |     4.035s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/13   |  14.429s  |     4.024s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/14   |  14.47s   |     4.045s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/15   |  14.428s  |     4.044s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/16   |  38.854s  |     9.454s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/17   |  38.861s  |     9.448s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/18   |  38.95s   |     9.461s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/19   |  38.826s  |     9.458s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/20   |  38.833s  |     9.48s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/21   |  38.888s  |     9.472s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/22   |  38.868s  |     9.477s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/23   |  38.907s  |     9.485s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/24   |  14.426s  |     4.025s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/25   |  14.435s  |     4.051s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/26   |  14.472s  |     4.053s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/27   |  14.471s  |     4.058s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/28   |  38.879s  |     9.457s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/29   |  38.814s  |     9.445s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/30   |  38.853s  |     9.45s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/31   |  38.963s  |     9.458s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/32   |  38.924s  |     9.466s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/33   |  38.898s  |     9.482s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/34   |  38.951s  |     9.455s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/35   |  38.924s  |     9.459s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/36   |  14.461s  |     4.037s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/37   |  14.452s  |     4.032s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/38   |  14.43s   |     4.053s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/39   |  14.43s   |     4.042s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/40   |  38.868s  |     9.473s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/41   |  38.925s  |     9.461s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/42   |  38.884s  |     9.452s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/43   |  38.925s  |     9.455s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/44   |  39.012s  |     9.476s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/45   |  38.915s  |     9.479s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/46   |  38.933s  |     9.457s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/47   |  38.936s  |     9.469s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/48   |  14.461s  |     4.041s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/49   |  14.468s  |     4.049s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/50   |  14.466s  |     4.046s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/51   |  14.479s  |     4.038s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/52   |  38.907s  |     9.473s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/53   |  38.914s  |     9.471s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/54   |  38.885s  |     9.463s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/55   |  38.891s  |     9.463s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/56   |  38.859s  |     9.472s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/57   |  38.899s  |     9.475s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/58   |  38.936s  |     9.47s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/59   |  38.952s  |     9.472s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/60   |  14.485s  |     4.036s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/61   |  14.419s  |     4.031s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/62   |  14.455s  |     4.035s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/63   |  14.488s  |     4.045s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/64   |  38.846s  |     9.462s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/65   |  38.923s  |     9.475s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/66   |  38.884s  |     9.463s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/67   |  38.93s   |     9.441s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/68   |  38.9s    |     9.473s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/69   |  38.914s  |     9.477s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/70   |  38.961s  |     9.487s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/71   |  38.93s   |     9.489s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/72   |  14.432s  |     4.051s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/73   |  14.442s  |     4.042s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/74   |  14.44s   |     4.036s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/75   |  14.456s  |     4.042s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/76   |  38.891s  |     9.486s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/77   |  38.938s  |     9.444s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/78   |  38.919s  |     9.448s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/79   |  38.858s  |     9.455s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/80   |  38.878s  |     9.491s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/81   |  38.975s  |     9.478s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/82   |  38.945s  |     9.496s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/83   |  38.905s  |     9.477s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/84   |  14.433s  |     4.033s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/85   |  14.41s   |     4.038s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/86   |  14.478s  |     4.044s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/87   |  14.473s  |     4.035s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/88   |  38.927s  |     9.49s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/89   |  38.938s  |     9.472s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/90   |  38.913s  |     9.454s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/91   |  38.835s  |     9.463s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/92   |  38.901s  |     9.494s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/93   |  38.864s  |     9.488s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/94   |  38.839s  |     9.508s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/95   |  38.921s  |     9.487s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/96   |  14.445s  |     4.05s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/97   |  14.432s  |     4.041s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/98   |  14.472s  |     4.055s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/99   |  14.449s  |     4.039s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/100  |  38.87s  |      9.485s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/101  |  38.866s |      9.474s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/102  |  38.836s |      9.471s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/103  |  38.847s |      9.464s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/104  |  38.985s |      9.468s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/105  |  38.928s |      9.469s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/106  |  38.866s |      9.496s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/107  |  38.92s  |      9.48s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/108  |  14.452s |      4.046s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/109  |  14.448s |      4.044s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/110  |  14.469s |      4.044s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/111  |  14.457s |      4.052s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/112  |  38.895s |      9.474s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/113  |  38.901s |      9.453s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/114  |  38.882s |      9.467s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/115  |  38.861s |      9.446s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/116  |  38.945s |      9.498s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/117  |  38.94s  |      9.488s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/118  |  38.881s |      9.488s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/119  |  38.911s |      9.486s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/120  |  14.434s |      4.044s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/121  |  14.455s |      4.061s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/122  |  14.464s |      4.067s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/123  |  14.488s |      4.058s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/124  |  38.932s |      9.499s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/125  |  38.868s |      9.488s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/126  |  38.863s |      9.488s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/127  |  38.899s |      9.467s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/128  |  38.948s |      9.511s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/129  |  38.915s |      9.502s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/130  |  38.932s |      9.508s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/131  |  38.914s |      9.491s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/132  |  14.42s  |      4.031s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/133  |  14.464s |      4.054s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/134  |  14.494s |      4.048s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/135  |  14.48s  |      4.046s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/136  |  38.911s |      9.49s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/137  |  38.91s  |      9.478s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/138  |  38.914s |      9.478s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/139  |  38.92s  |      9.475s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/140  |  38.883s |      9.508s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/141  |  38.935s |      9.5s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/142  |  38.848s |      9.494s|
    | GPU_BasicGEMMStreamKWorkgroupMapping/143  |  38.932s |      9.502s|

    </details>

    ## Technical Details

    - Short-circuit expression comparison
    - Remove unused code
    - Caching expressions of kernel arguments to eliminate redundant
    regeneration.
    - Change `AssemblyKernelArgument` to a `class`

    ## Test Plan

    No functional changes; covered by existing tests.

    ## Test Result

    See CI report

commit 66e22ac6c6c0c286325cabf2b1faa269ea640446
Author: hcman2 <[email protected]>
Date:   Fri Feb 6 10:55:08 2026 +0800

    [formocast] [tensilelite] enable tuning with formocast (#4043)

    ## Motivation
    Integration plan of formocast and origami :
    Steps
    1.Push Formocast code to the origami subfolder. (we are here now)
    2.Submit tuning code calling the API of origami.
    3.Push and reserve origami prediction mode. Add APIs to pass sizemapping
    data via config_t. Use an environment variable to switch modes without
    affecting other code.
    4.Enable predictionThreshold with tox tests.
    5.Push Origami code with Formocast backend. This step will enable the
    bench with different modes.
    6.Refine Formocast and Origami to verify API usage and identify
    functions to move.

    This PR is to include step2 and step4.

    ## Technical Details

    <!-- Explain the changes along with any relevant GitHub links. -->

    ## Test Plan

    <!-- Explain any relevant testing done to verify this PR. -->

    ## Test Result

    <!-- Briefly summarize test outcomes. -->

    ## Submission Checklist

    - [ ] Look over the contributing guidelines at
    https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

    ---------

    Co-authored-by: Henry Ho <[email protected]>
    Co-authored-by: Peter Cheng <[email protected]>

commit 37a74ef54eaa1bb1df603db6ec8aff22f342bc71
Author: Illia Silin <[email protected]>
Date:   Thu Feb 5 17:06:57 2026 -0800

    [CK]  a bunch of CI fixes. (#4361)

    ## Motivation

    Fixing some of the CK CI issues

    ## Technical Details

    fixing paths to dockerfiles and scripts;
    moving codegen tests to separate stage (collides with main build since
    you must call cmake from same folder but different options);
    fixing a couple of clang compilation issues with staging compiler;

commit 808e9496d17be0826164dbb34457f54e157dd2bd
Author: Kerry Wang <[email protected]>
Date:   Thu Feb 5 17:00:01 2026 -0600

    refractor observers to have new runtime with context concept

commit 3c9beb38b8dba1301a961cc5dc3f44ca9d4185e3
Author: Eiden Yoshida <[email protected]>
Date:   Thu Feb 5 17:56:12 2026 -0500

    [CK] MICI: Fix git diff in selective_test_filter.py (#4352)

    ## Motivation

    - git diff needs access to reference repo

    ## Technical Details

    - mount reference repo path into docker for selective_test_filter.py to
    access

    ## Test Plan

    - tested in MICI

    ## Test Result

    - launch_tests.sh ran successfully

commit 1663ac026d46c3dd02edb73d2bfa7310c54695d5
Author: Torre Zuk <[email protected]>
Date:   Thu Feb 5 14:20:45 2026 -0700

    [rocBLAS] trsm doc & test;  trsv change noted in log (#4198)

    ## Motivation

    Tests trsm use of new trsv kernel for big batches

    Adds chagelog note on trsv which used to call hipGetDevice and
    potentially hipSetDevice

commit d8bb9d2b9fe278d07fe63b395d87268c94e53fd8
Author: Jeffrey Novotny <[email protected]>
Date:   Thu Feb 5 14:54:48 2026 -0500

    [rocsolver] Doxygen API cleanup part 2 (#4330)

    ## Motivation

    <!-- Explain the purpose of this PR and the goals it aims to achieve.
    -->
    Continuing with rocSOLVER API/Doxygen copy edits and polishing

    ## Technical Details

    <!-- Explain the changes along with any relevant GitHub links. -->
    Edit Doxygen comments in header file.

    ## Test Plan

    <!-- Explain any relevant testing done to verify this PR. -->
    Build locally.

    ## Test Result

    <!-- Briefly summarize test outcomes. -->
    NA

    ## Submission Checklist

    - [x] Look over the contributing guidelines at
    https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

commit 5aa1f1d4c189f779ea699be250fb1b284f3d6ac2
Author: Geo Min <[email protected]>
Date:   Thu Feb 5 11:01:53 2026 -0800

    [ci] Updating variable group-id for OSSCI (#4360)

    OSSCI migrated mi325s, so need a new groupID

    Sanity works here:
    https://github.com/ROCm/TheRock/actions/runs/21723540679/job/62659665907
    normal run works here:
    https://github.com/ROCm/TheRock/actions/runs/21723540679/job/62659791422

    I've dabbled with organization variables, however, this does not work
    for forks so for now, we will do the manual update

commit 6273d3b30b32d6e0856394d37a421153dceb33c1
Author: Dmitrii Polomin <[email protected]>
Date:   Thu Feb 5 19:53:45 2026 +0100

    [MIOpen] Ported solver test to gtest (#3713)

    ## Motivation

    Porting tests from CTest to GTest, in this case, `solver.cpp`

    ## Technical Details

    Pretty straightforward port, although I had to get creative in order to
    conform to `INSTANTIATE_TEST_SUITE_P` pattern and naming conventions

    ## Test Plan

    Running locally, using the CI launched by this PR

    ## Test Result

    See CI actions launched by this PR

    ## Submission Checklist

    - [ ] Look over the contributing guidelines at
    https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

commit ece0c48dd152856c3b443c12c58a89bc7d7c34e5
Author: Nathan Henderson <[email protected]>
Date:   Thu Feb 5 09:03:42 2026 -0700

    [rocroller] Use StreamKMode in hipBLASLt integration and client (#4028)

    ## Motivation

    RocRoller has a `StreamKMode` enum that is used internally and in the
    GEMM tests. However, the client and hipBLASLt integration still used
    boolean values to represent the StreamK state. This PR replaces the
    individual boolean flags (`--streamK`, `--streamKTwoTile`,
    `--streamKTwoTileDPFirst`) with a single `--streamK` string option that
    accepts one of the `StreamKMode` values (`None`, `Standard`, `TwoTile`,
    or `TwoTileDPFirst`).

    ## Technical Details

    - Update the rocRoller GEMM client to use `StreamKMode` enum instead of
    three separate booleans
    - Update the hipBLASLt rocRoller integration to match
    - Add `enumStrings<T>()` utility function in `Utils_impl.hpp` for CLI
    validation of enum values

    ## Test Plan

    Update `test_gemm_client.py` YAML fixtures to use the new `streamK:
    None` format.

    ## Test Result

    Validated by the StreamK rrperf tests

commit 3b98c98a23e76075a6a1e4e580482a627e39d59b
Author: Jobbins <[email protected]>
Date:   Thu Feb 5 08:56:42 2026 -0700

    [composablekernel] fix failure status (#4351)

    ## Motivation

    Pipelines were failing on Math CI status check.

    ## Technical Details

    For the success case, we just changed the config in Jenkins to use a
    proper app token and no code changes were required. However, the failure
    case would not have worked as coded, so we needed to move that outside
    of the `rocmnode()` block.

    ## Test Plan

    I removed all of the CI in one of the commits to quickly test, and then
    added it back.  Got a successful "success" message and "failure" message
    produced

commit 9bb7f5c31253643cd72363314c3d3ee02f723406
Author: Eiden Yoshida <[email protected]>
Date:   Thu Feb 5 10:55:44 2026 -0500

    [CK] MICI: Correct path for build trace script (#4349)

    ## Motivation

    - Corrects path to script due to superrepo migration
    - Forces all tests to run by default

    ## Technical Details

    - now in /projects/composablekernel

    ---------

    Co-authored-by: illsilin_amdeng <[email protected]>

commit 120f91dd211117e308b3713593ac7f061cc02c08
Author: bibek <[email protected]>
Date:   Thu Feb 5 09:47:16 2026 -0600

    [HIPDNN][DOC] Add TYPED_TEST guidance for multi-datatype tests (#4000)

    ## Motivation

    Doc update : add `TYPED_TEST` guidance for multi-datatype tests

    Update `.clinerules`, `.cursor/rules/testing.mdc`, and `docs/Testing.md`
    to recommend `TYPED_TEST` for tests covering `float`/`half`/`bfloat16`.

    ## Submission Checklist

    - [ ] Look over the contributing guidelines at
    https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

    ---------

    Co-authored-by: Samuel Reeder <[email protected]>

commit d26a7820b58f789d19efd6064d4c2c4f4fc72a95
Author: Eiden Yoshida <[email protected]>
Date:   Wed Feb 4 21:43:22 2026 -0500

    [CK] MICI: Use reference repo for checkout operations (#4336)

    ## Motivation

    - Maintain a reference repo on slave nodes that speeds up any
    clone/checkout operations

    ## Technical Details

    - clone a ref repo if it does not exist
    - update ref repo if it does exist
    - checkout after ref repo is updated
    - eliminates double clone

    ## Test Result

    - Initial checkouts succeeded

commit f2f187ab40738272232f571f58112697da405b1a
Author: Geo Min <[email protected]>
Date:   Wed Feb 4 15:43:38 2026 -0800

    [ci] Fixing rocm-libs race condition (#4192)

    Currently, there is a race condition that overwrites BLAS libraries
    during MIOpen/hipdnn builds. (error:
    https://github.com/ROCm/rocm-libraries/actions/runs/21228188053/job/61080555083)

    Tested locally:
    ```
    # With all three
    geom@geom:~/Code/rocm-libraries/.github/scripts$ PLATFORM="linux" python3 therock_matrix.py
    [{'cmake_options': '-DTHEROCK_USE_EXTERNAL_COMPOSABLE_KERNEL=ON -DTHEROCK_ENABLE_COMPOSABLE_KERNEL=ON -DTHEROCK_ENABLE_MIOPEN=ON -DTHEROCK_ENABLE_MIOPEN_PLUGIN=ON -DTHEROCK_ENABLE_ALL=OFF -DTHEROCK_COMPOSABLE_KERNEL_SOURCE_DIR=../composable_kernel', 'project_to_test': 'miopen_plugin,miopen,hipdnn'}]

    # Only hipdnn
    geom@geom:~/Code/rocm-libraries/.github/scripts$ PLATFORM="linux" python3 therock_matrix.py
    [{'cmake_options': '-DTHEROCK_USE_EXTERNAL_COMPOSABLE_KERNEL=ON -DTHEROCK_ENABLE_ALL=OFF -DTHEROCK_COMPOSABLE_KERNEL_SOURCE_DIR=../composable_kernel -DTHEROCK_ENABLE_MIOPEN_PLUGIN=ON -DTHEROCK_ENABLE_COMPOSABLE_KERNEL=ON', 'project_to_test': 'hipdnn,miopen_plugin'}]
    ```

    This fixes this error, as if all libraries are ran, they will combine
    (saves resources + no overwriting)

commit f34aec25c434b3044b75481d70693af3bf0ade1e
Author: assistant-librarian[bot] <210906412+assistant-librarian[bot]@users.noreply.github.com>
Date:   Wed Feb 4 18:25:31 2026 -0500

    [CK] Add FP8 KV_BLOCKSCALE support for batch prefill (#4263)

    Implement per-page K/V quantization for paged attention:
      - Add KV_BLOCKSCALE enum to BlockAttentionQuantScaleEnum
      - Use exp2 shift trick to eliminate explicit P scaling overhead
    - Prefetch physical pages offset for KV cache, overlaps with
    computations

    ## Proposed changes

    Please describe the motivation behind the pull request, whether it
    enables a new feature or fixes a bug. If there are associated pull
    requests or issues, please link them to the pull request.

    ## Checklist

    Please put an `x` into the boxes that apply. You can also fill these out
    after creating the PR. If you're not sure, please don't hesitate to ask.

    - [ ] I have added tests relevant to the introduced functionality, and
    the unit tests are passing locally
    - [ ] I have added the test to REGRESSION_TESTS list defined at the top
    of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more
    than 30 seconds to run.
    - [ ] I have added inline documentation which enables the maintainers
    with understanding the motivation
    - [ ] I have removed the stale documentation which is no longer relevant
    after this pull request
    - [ ] (If this change is user-facing) I have added release notes which
    provide the end users with a brief summary of the improvement from this
    pull request
    - [ ] I have run `clang-format` on all changed files
    - [ ] Any dependent changes have been merged

    ## Discussion

    If this is a relatively large or complex change, feel free to start a
    discussion by explaining why you chose the solution you did and what
    alternatives you considered

    ---
    🔁 Imported from
    [ROCm/composable_kernel#3696](https://github.com/ROCm/composable_kernel/pull/3696)
    🧑‍💻 Originally authored by @Jeff-Huang

    ---------

    Co-authored-by: Jeff Huang <[email protected]>
    Co-authored-by: Illia Silin <[email protected]>

commit df32df51ea6a9ebfba9a459c77ed82c4877df22b
Author: Yiqian Liu <[email protected]>
Date:   Wed Feb 4 17:12:11 2026 -0600

    [rocRoller] [hipblaslt] Enable more workgroup tile sizes for pre-swizzled scale data (#4175)

    ## Motivation

    Pre-swizzle is an optimization that pre-swizzle the scale data to match
    the layout that kernel expects. The purpose of this PR is to add more
    possible workgroup tile sizes that supports pre-swizzled scale data.

    ## Technical Details

    1. Configure the workgroup tile size to 256 at K dimension when the
    input data format is pre-swizzled.
    2. Filter out the invalid workgroup tile size (i.e., MN dimension is not
    multiple of 32, or MN dimension is 96).
    3. Configure the solution parameters for pre-swizzled input.

    ## Test Plan

    1. Added a rocRoller client test that uses 32x32x256 workgroup tile.

    ## Test Result

    1. This PR should not change any kernel that the data is not
    pre-swizzled.
    2. All the tests should pass and no performance changes.

    ## Submission Checklist

    - [ ] Look over the contributing guidelines at
    https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

    ---------

    Co-authored-by: Yiqian Liu <[email protected]>
    Co-authored-by: yiqialiu <[email protected]>

commit 87d1a8fa005ef2f75e48e5c9c4e70f8235236b03
Author: Samuel Reeder <[email protected]>
Date:   Wed Feb 4 16:10:01 2026 -0700

    Use `--latest-release` flag for installing rocm in clang-tidy (#4120)

    ## Motivation

    `--latest-release` was added in TheRock
    [2997](https://github.com/ROCm/TheRock/pull/2997) to grab latest nightly
    for the specified target. We can use this to keep the clang-tidy
    workflow somewhat up-to-date.

    ## Submission Checklist

    - [x] Look over the contributing guidelines at
    https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

commit c5…
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants