Add hipdnn convolution support by zjgarvey · Pull Request #3049 · ROCm/pytorch

zjgarvey · 2026-03-06T21:54:31Z

Implements forward and backward convolution (2D/3D) through the hipDNN frontend graph API, providing an alternative to the MIOpen backend on ROCm.

Shared utilities: Extract createTensorAttributes() into ATen/hipdnn/Utils.h for reuse across hipDNN ops (BatchNorm, Conv)
Graph-cached convolution: Forward (fprop), backward-data (dgrad), and backward-weight (wgrad) via hipDNN frontend graphs with a thread-local LRU cache (ParamsLRUCache<K,V>) to amortize graph->build() cost
Dispatch integration: New ConvBackend::Hipdnn and ConvBackend::HipdnnTranspose variants wired through backend selection, memory format selection, forward/backward switches, and Python enum exposure. hipDNN takes priority over MIOpen when torch.backends.hipdnn.enabled is True
Bias fusion: Forward conv fuses bias via a pointwise ADD node in the graph, avoiding a separate output.add_() call
Transposed convolution: Implemented as dgrad (forward) / fprop (backward-input) / wgrad (backward-weight), with bias applied separately (for now) since dgrad + pointwise graphs aren't supported by any hipDNN backends currently.
Grouped/depthwise support: hipDNN infers group count from tensor dimensions; explicit output dims are set on dgrad/wgrad graphs so grouping is resolved correctly

rocm-repo-management-api · 2026-03-06T22:04:12Z

Jenkins build for 2cea5f50e97a741b501c0fce1667b713966152db commit finished as FAILURE
Links: Pipeline Overview / Build artifacts / Test Results

rocm-repo-management-api · 2026-03-09T19:15:28Z

Jenkins build for 168eeb1b729889b93ad504773a9b8c3c5be09592 commit finished as FAILURE
Links: Pipeline Overview / Build artifacts / Test Results

zjgarvey

Self-review

aten/src/ATen/native/hipdnn/Conv_hipdnn.cpp

aten/src/ATen/native/Convolution.cpp

test/test_hipdnn_conv.py

rocm-repo-management-api · 2026-03-11T14:37:11Z

Jenkins build for b0ca170b13923c1d321e2aed5f3b2daef7d5dd09 commit finished as NOT_BUILT
Links: Pipeline Overview / Build artifacts / Test Results

aten/src/ATen/native/hipdnn/Conv_hipdnn.cpp

rocm-repo-management-api · 2026-03-11T15:23:56Z

Jenkins build for b0ca170b13923c1d321e2aed5f3b2daef7d5dd09 commit finished as FAILURE
Links: Pipeline Overview / Build artifacts / Test Results

Detected error during Pytorch building:

[5315/8175] Building CXX object third_party/ideep/mkl-dnn/src/cpu/x64/CMakeFiles/dnnl_cpu_x64.dir/jit_avx512_core_f16_dw_conv_kernel.cpp.o
[5316/8175] Building CXX object third_party/ideep/mkl-dnn/src/cpu/x64/CMakeFiles/dnnl_cpu_x64.dir/gemm/s8x8s32/jit_avx512_core_u8_copy_sum_an_kern_autogen.cpp.o
[5317/8175] Building CXX object third_party/ideep/mkl-dnn/src/cpu/x64/CMakeFiles/dnnl_cpu_x64.dir/gemm/s8x8s32/jit_avx_u8_copy_bn_kern_autogen.cpp.o
[5318/8175] Building CXX object third_party/ideep/mkl-dnn/src/cpu/x64/CMakeFiles/dnnl_cpu_x64.dir/jit_generator.cpp.o
[5319/8175] Building CXX object third_party/ideep/mkl-dnn/src/cpu/x64/CMakeFiles/dnnl_cpu_x64.dir/jit_brgemm_transpose_utils.cpp.o
FAILED: third_party/ideep/mkl-dnn/src/cpu/x64/CMakeFiles/dnnl_cpu_x64.dir/jit_brgemm_transpose_utils.cpp.o 
/opt/cache/bin/sccache /opt/cache/bin/c++ -DDNNL_ENABLE_CPU_ISA_HINTS -DDNNL_ENABLE_ITT_TASKS -DDNNL_ENABLE_MAX_CPU_ISA -DDNNL_X64=1 -DIDEEP_USE_MKL -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DROCM_VERSION=70200 -DTORCH_HIP_VERSION=702 -DUSE_LAYERNORM_FAST_RECIPROCAL -D__STDC_CONSTANT_MACROS -D__STDC_LIMIT_MACROS -I/var/lib/jenkins/pytorch/build/third_party/ideep/mkl-dnn/include -I/var/lib/jenkins/pytorch/third_party/ideep/mkl-dnn/include -I/var/lib/jenkins/pytorch/third_party/ideep/mkl-dnn/third_party -I/var/lib/jenkins/pytorch/third_party/ideep/mkl-dnn/src -isystem /opt/rocm-7.2.0/include -isystem /var/lib/jenkins/pytorch/build/third_party/gloo -isystem /var/lib/jenkins/pytorch/cmake/../third_party/gloo -isystem /var/lib/jenkins/pytorch/cmake/../third_party/tensorpipe/third_party/libuv/include -isystem /var/lib/jenkins/pytorch/cmake/../third_party/googletest/googlemock/include -isystem /var/lib/jenkins/pytorch/cmake/../third_party/googletest/googletest/include -isystem /var/lib/jenkins/pytorch/third_party/protobuf/src -isystem /opt/conda/envs/py_3.12/include -isystem /var/lib/jenkins/pytorch/third_party/XNNPACK/include -isystem /var/lib/jenkins/pytorch/third_party/ittapi/include -isystem /var/lib/jenkins/pytorch/cmake/../third_party/eigen -isystem /opt/rocm/include -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -fopenmp -fvisibility-inlines-hidden  -fno-omit-frame-pointer -mno-omit-leaf-frame-pointer -Wall -Wno-unknown-pragmas -Wundef -fvisibility=internal   -fPIC -Wformat -Wformat-security -D_FORTIFY_SOURCE=2 -fstack-protector-strong -fcf-protection=full  -Wmissing-field-initializers  -Wno-strict-overflow -Wno-maybe-uninitialized -Wno-stringop-overflow -Wno-array-bounds  -O3 -DNDEBUG -DNDEBUG -std=c++20 -fPIC -DMKL_HAS_SBGEMM -DMKL_HAS_SHGEMM -DTORCH_USE_LIBUV -DCAFFE2_USE_GLOO -MD -MT third_party/ideep/mkl-dnn/src/cpu/x64/CMakeFiles/dnnl_cpu_x64.dir/jit_brgemm_transpose_utils.cpp.o -MF third_party/ideep/mkl-dnn/src/cpu/x64/CMakeFiles/dnnl_cpu_x64.dir/jit_brgemm_transpose_utils.cpp.o.d -o third_party/ideep/mkl-dnn/src/cpu/x64/CMakeFiles/dnnl_cpu_x64.dir/jit_brgemm_transpose_utils.cpp.o -c /var/lib/jenkins/pytorch/third_party/ideep/mkl-dnn/src/cpu/x64/jit_brgemm_transpose_utils.cpp
during RTL pass: cse_local
/var/lib/jenkins/pytorch/third_party/ideep/mkl-dnn/src/cpu/x64/jit_brgemm_transpose_utils.cpp: In member function ‘void dnnl::impl::cpu::x64::jit_brgemm_copy_to_coarse_t::copy_row_blks(int)’:
/var/lib/jenkins/pytorch/third_party/ideep/mkl-dnn/src/cpu/x64/jit_brgemm_transpose_utils.cpp:1146:1: internal compiler error: Segmentation fault
 1146 | }

dnikolaev-amd · 2026-03-11T15:27:19Z

aten/src/ATen/native/hipdnn/Conv_hipdnn.cpp

+// ---------------------------------------------------------------------------
+// Cache key: captures everything that determines graph topology
+// ---------------------------------------------------------------------------
+constexpr int hipdnn_max_dim = 3;


MIOPEN_DIM_MAX = 5 already defined and used

pytorch/aten/src/ATen/native/Normalization.cpp

Line 67 in 2befc15

static constexpr int MIOPEN_DIM_MAX = 5;

pytorch/aten/src/ATen/native/Convolution.cpp

Line 90 in 2befc15

constexpr int MIOPEN_DIM_MAX = 5;

pytorch/aten/src/ATen/miopen/Descriptors.cpp

Line 28 in f6a72d1

constexpr size_t MIOPEN_DIM_MAX = 5;

I'd like to leave this separate, since I don't want to use any of these named constants from other source files.

E.g., hipdnn might have other backends which support more dim combinations than miopen.

zjgarvey · 2026-03-11T21:22:54Z

I think I need a rebase, one sec.

rocm-repo-management-api · 2026-03-11T21:40:33Z

Jenkins build for 184c4c3223e16aa7318a8d4f9d82bcad635636c0 commit finished as SUCCESS
Links: Pipeline Overview / Build artifacts

rocm-repo-management-api · 2026-03-12T18:34:58Z

Jenkins build for a2cb42c0f7117d294a9cbd6ffad6ad27341fc523 commit finished as NOT_BUILT
Links: Pipeline Overview / Build artifacts / Test Results

rocm-repo-management-api · 2026-03-12T20:27:24Z

Jenkins build for a2cb42c0f7117d294a9cbd6ffad6ad27341fc523 commit finished as FAILURE
Links: Pipeline Overview / Build artifacts / Test Results

mousdahl-amd

Nothing major jumped out at me.

aten/src/ATen/hipdnn/Handle.cpp

mousdahl-amd · 2026-03-16T17:07:47Z

aten/src/ATen/native/Convolution.cpp

+    if (!at::globalContext().userEnabledHipdnn()) return false;
+    if (!detail::getCUDAHooks().compiledWithHipDNN()) return false;
+    if (!input.is_cuda()) return false;
+    auto dtype = input.scalar_type();


These datatype / dimension checks are interesting to me. I understand wanting a fast path to skip if hipDNN doesn't support certain datatypes / tensor dimensions, but this could very easily change going forward. I wonder if there's a way we can check in with hipDNN to see if it's got applicable engines instead. I just want to avoid a maintenance headache if possible.

The problem with that is it may be weightier than you want to do.

Implement hipDNN-based forward and backward convolution (2D and 3D) with a thread-local LRU graph cache to amortize the expensive graph->build() cost. Supports contiguous and channels-last memory formats, grouped/depthwise configurations, and transposed convolution. The graph cache follows the cuDNN v8 pattern from Conv_v8.cpp with configurable size via TORCH_HIPDNN_CONV_LRU_CACHE_LIMIT env var. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add Hipdnn and HipdnnTranspose to ConvBackend enum and wire them through the full dispatch path: backend selection (use_hipdnn check inserted before use_miopen), memory format selection, forward switch, backward switch, and Python enum exposure. hipDNN takes priority over MIOpen when torch.backends.hipdnn.enabled is True. The hipdnn_conv_suggest_memory_format() supports NHWC/NDHWC unconditionally since hipDNN handles strides natively. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The functions used SymIntArrayRef/SymInt parameter types but the dispatch registration in native_functions.yaml maps to the non-symint overload, which expects IntArrayRef/int64_t. This caused undefined reference linker errors in libtorch_hip.so. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Fix cache limit env var: replace check_env() (returns bool) with get_env() + stoi() so custom limits are actually parsed - Remove cudnn_enabled gate from use_hipdnn(); userEnabledHipdnn() already handles the enable/disable check independently - Fuse bias into forward conv graph via pointwise(ADD), eliminating the separate output.add_(reshape_bias(...)) call - Remove unused #include <mutex> (cache is thread-local) - Unify namespace style to `namespace at::native` - Add clarifying comment on implicit group count via tensor shapes Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add test_hipdnn_conv.py with forward+backward tests for hipDNN conv. Working: fp32 basic conv, dilation. Xfail: bias (plugin needs conv+bias +activ 3-node graph), bf16, grouped/strided backward, transposed conv. Skip: depthwise (GPU fault). Fix buildConvFpropGraph to set intermediate_data_type on graph and compute_data_type on pointwise attributes, matching the hipDNN sample pattern for fused conv+bias. Bias fusion still blocked by MIOpen legacy plugin only supporting 3-node (conv+bias+activ) graphs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

hipDNN infers group count from tensor dimensions. For wgrad, both graph inputs (dy and x) have full channel counts, so without explicitly setting the output weight shape, hipDNN cannot determine the group count and defaults to groups=1. This causes out-of-bounds GPU memory access for grouped/depthwise convolutions. Set output dims on both dgrad and wgrad graphs using the known input_size and weight_size respectively. Authored with Claude. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…loat16 numerics. Signed-off-by: zjgarvey <zjgarvey@gmail.com>

Replace the custom HipdnnGraphCache with a shared ParamsLRUCache<K,V> template, remove stored UIDs from the cached graph struct in favor of enum constants with semantic aliases, and move contiguous() calls into the hipDNN entry points so the dispatch site doesn't need them. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Include the graph output dimensions in HipdnnConvParams so that dgrad graphs built for different output_padding values (e.g. transposed conv with output_padding=1 vs 0) are cached separately. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Wire bias through buildConvDgradGraph and runHipdnnConvDgrad so dgrad+pointwise(ADD) fusion is ready when hipDNN backend plugins support it. For now the transposed conv entry point still applies bias separately since no plugin handles this pattern yet. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Fix cache_limit == 0 to mean "unlimited" (no eviction, no LRU tracking) instead of the previous behavior where entries were still added to cache_order unnecessarily. Add TORCH_INTERNAL_ASSERT on eviction erase to catch cache corruption early. Plumb benchmark and deterministic flags through to the cache key so that different flag combinations produce separate cache entries, preparing for when HipDNN supports algorithm/engine selection based on these flags. Authored with Claude. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

deterministic=true is a correctness contract that hipDNN cannot honor yet (no engine-level determinism filtering), so raise an error rather than silently producing non-deterministic results. benchmark=true is a performance hint (algorithm search), safe to ignore but users should know it has no effect — emit TORCH_WARN_ONCE. Also removes both flags from the cache key since they do not affect graph construction. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Instead of applying bias separately after the dgrad call, pass it through to runHipdnnConvDgrad so it can be fused when the backend supports it. Signed-off-by: Zach Garvey <zachary.garvey@amd.com> Assisted-By: Claude Opus 4.6 <noreply@anthropic.com>

rocm-repo-management-api · 2026-03-17T17:47:39Z

Jenkins build for cd7415905e38d51aec57a82df23904df59fd1bb7 commit finished as FAILURE
Links: Pipeline Overview / Build artifacts / Test Results

Per-backend forward ops (hipdnn_convolution, hipdnn_convolution_transpose) are invisible to the compiler -- AOT autograd captures aten.convolution as an opaque node, never per-backend ops. Removing them from native_functions.yaml and using dispatch stubs (the same mechanism backward has used since Joel Schlosser's 2021 refactor) eliminates derivatives.yaml entries, FC allowlist entries, HasDecomp entries, and trace_rules entries. hipDNN is the first convolution backend to use dispatch stubs for forward, proving the pattern works for a potential broader cleanup of cuDNN/MIOpen. Assisted-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: zjgarvey <zjgarvey@gmail.com>

rocm-repo-management-api · 2026-03-19T18:55:24Z

Jenkins build for 5e5cd8a61c1d5ee5381f87ee5a44f69e9e4a6620 commit finished as FAILURE
Links: Pipeline Overview / Build artifacts / Test Results

zjgarvey changed the title ~~Hipdnn convolution 2~~ [wip] add hipdnn convolution support Mar 6, 2026

zjgarvey commented Mar 9, 2026

View reviewed changes

zjgarvey changed the title ~~[wip] add hipdnn convolution support~~ Add hipdnn convolution support Mar 11, 2026

zjgarvey marked this pull request as ready for review March 11, 2026 14:55

zjgarvey requested review from BrianHarrisonAMD, dnikolaev-amd and jeffdaily March 11, 2026 14:55

BrianHarrisonAMD reviewed Mar 11, 2026

View reviewed changes

aten/src/ATen/native/hipdnn/Conv_hipdnn.cpp Show resolved Hide resolved

zjgarvey force-pushed the hipdnn_develop branch from 2befc15 to f6a72d1 Compare March 11, 2026 15:26

dnikolaev-amd reviewed Mar 11, 2026

View reviewed changes

zjgarvey force-pushed the hipdnn_convolution_2 branch from b0ca170 to 184c4c3 Compare March 11, 2026 21:23

zjgarvey mentioned this pull request Mar 16, 2026

[ROCm] Add hipDNN backend support for batch norm pytorch/pytorch#177534

Open

mousdahl-amd reviewed Mar 16, 2026

View reviewed changes

zjgarvey and others added 9 commits March 17, 2026 10:34

Tighten reduction sizes to fall within tolerable range for bfloat16/f…

8098f26

…loat16 numerics. Signed-off-by: zjgarvey <zjgarvey@gmail.com>

zjgarvey and others added 4 commits March 17, 2026 10:34

zjgarvey force-pushed the hipdnn_convolution_2 branch from a2cb42c to cd74159 Compare March 17, 2026 17:40

zjgarvey requested a review from jithunnair-amd as a code owner March 17, 2026 17:40

zjgarvey force-pushed the hipdnn_develop branch from 5e55e83 to a5d098f Compare March 17, 2026 21:38

Conversation

zjgarvey commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rocm-repo-management-api bot commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rocm-repo-management-api bot commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zjgarvey left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rocm-repo-management-api bot commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

rocm-repo-management-api bot commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dnikolaev-amd Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

zjgarvey Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

zjgarvey commented Mar 11, 2026

Uh oh!

rocm-repo-management-api bot commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rocm-repo-management-api bot commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rocm-repo-management-api bot commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mousdahl-amd left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mousdahl-amd Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

rocm-repo-management-api bot commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rocm-repo-management-api bot commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

zjgarvey commented Mar 6, 2026 •

edited

Loading

rocm-repo-management-api bot commented Mar 6, 2026 •

edited

Loading

rocm-repo-management-api bot commented Mar 9, 2026 •

edited

Loading

rocm-repo-management-api bot commented Mar 11, 2026 •

edited

Loading

rocm-repo-management-api bot commented Mar 11, 2026 •

edited

Loading

rocm-repo-management-api bot commented Mar 11, 2026 •

edited

Loading

rocm-repo-management-api bot commented Mar 12, 2026 •

edited

Loading

rocm-repo-management-api bot commented Mar 12, 2026 •

edited

Loading

rocm-repo-management-api bot commented Mar 17, 2026 •

edited

Loading

rocm-repo-management-api bot commented Mar 19, 2026 •

edited

Loading