Skip to content

UCP/CORE: Select one TL resource for memtype EP when multiple are available#11229

Open
yafshar wants to merge 2 commits intoopenucx:masterfrom
intel-staging:fix/memtype-ep-multi-tl-selection
Open

UCP/CORE: Select one TL resource for memtype EP when multiple are available#11229
yafshar wants to merge 2 commits intoopenucx:masterfrom
intel-staging:fix/memtype-ep-multi-tl-selection

Conversation

@yafshar
Copy link
Contributor

@yafshar yafshar commented Mar 2, 2026

What

Fix an assertion failure in ucp_worker_mem_type_eps_create() when multiple TL resources support the same memory type.

Why

On platforms where one memory type is exposed by multiple transport resources (for example, Level Zero sub-devices on multi-tile GPUs), memtype EP creation can see multiple candidate lanes and fail with:

Assertion num_lanes == 1 failed

The memtype EP flow currently requires a single lane for staging operations.

How

In ucp_worker_mem_type_eps_create():

  • Detect when mem_access_tls contains more than one resource.
  • Select one deterministically using first-set-bit (lowest rsc_index via UCS_STATIC_BITMAP_FFS).
  • Reduce mem_access_tls to only the selected resource.
  • Continue with the existing single-lane memtype EP creation path.

This is a minimal core-side fix that preserves existing behavior and invariants.

Impact

  • Preserves the single-lane memtype EP requirement.
  • Prevents assertion failures when multiple TL resources are present for one memory type.
  • Applies generically to memtype EP selection (for example ZE/CUDA/ROCm cases), with ZE multi-tile as the motivating trigger.

When multiple transport devices support the same memory type (for
example, ZE sub-devices (tiles) on multi-tile GPUs),
ucp_worker_mem_type_eps_create() asserted on num_lanes == 1 and aborted.

Instead of failing, select one mem-access TL resource deterministically
by choosing the lowest rsc_index when multiple candidates are present.
This preserves the single-lane requirement for memtype endpoints while
allowing transports such as ZE to enumerate all devices.

The change is generic to memtype EP selection and applies when any
memory type (for example ZE, CUDA, or ROCm) exposes multiple TL
resources. ZE multi-tile configurations were the immediate trigger.

Fixes assertion failures on Intel Data Center GPU Max and similar
multi-tile platforms.
@yafshar yafshar marked this pull request as ready for review March 2, 2026 22:25
@yosefe
Copy link
Contributor

yosefe commented Mar 3, 2026

Seems this PR always limits memtype endpoints to 1 transport, which can prevent from both cuda_copy and gdrcopy being used for memory type copy

@yafshar
Copy link
Contributor Author

yafshar commented Mar 3, 2026

Seems this PR always limits memtype endpoints to 1 transport, which can prevent from both cuda_copy and gdrcopy being used for memory type copy

This is a regression fix: the assertion num_lanes == 1 was recently added #10933 to enforce the existing architectural constraint that memtype EPs support only a single lane.

at src/ucp/core/ucp_ep.c

        /* Mem type EP cannot have more than one lane */
        num_lanes = ucp_ep_num_lanes(worker->mem_type_ep[mem_type]);
        ucs_assertv_always(num_lanes == 1, "num_lanes=%u", num_lanes);

PR caused having multiple transports (e.g., multiple ZE tiles, or both cuda_copy and gdrcopy) would trigger this assertion and abort; this change allows UCX to continue by selecting the first available resource deterministically. Without this fix, multi-tile GPUs (Intel Max) or systems with redundant copy transports simply crash during worker creation. Supporting multiple simultaneous transports for memtype staging (multi-lane memtype EPs) would require a broader architectural refactor beyond the scope of this crash fix.

@yafshar
Copy link
Contributor Author

yafshar commented Mar 3, 2026

Seems this PR always limits memtype endpoints to 1 transport, which can prevent from both cuda_copy and gdrcopy being used for memory type copy

  • What happens without this PR:
    On multi-tile GPUs (8 ZE tiles) or systems with multiple copy transports (cuda_copy + gdrcopy), UCX crashes with the assertion failure during worker creation.

  • What happens with this PR:
    UCX selects the first available transport resource deterministically and continues successfully.

  • Regarding cuda_copy + gdrcopy:
    You're right that this picks only one (whichever appears first in the bitmap). However, even before the assertion was added, having multiple lanes would likely malfunction since the memtype EP infrastructure isn't designed to stripe staging operations across multiple transports. Selecting one deterministically at least provides functional behavior instead of a crash.
    I think the proper fix (supporting multiple transports or intelligent selection like preferring gdrcopy over cuda_copy) requires broader changes to allow num_lanes > 1 for memtype EPs or adding transport scoring/selection logic.

Please let me know if this is not correct or there is a better way.

@rakhmets
Copy link
Contributor

rakhmets commented Mar 4, 2026

@shasson5 it seems the assert added in this PR #10933 is not quite right

@yosefe
Copy link
Contributor

yosefe commented Mar 9, 2026

Seems this PR always limits memtype endpoints to 1 transport, which can prevent from both cuda_copy and gdrcopy being used for memory type copy

  • What happens without this PR:
    On multi-tile GPUs (8 ZE tiles) or systems with multiple copy transports (cuda_copy + gdrcopy), UCX crashes with the assertion failure during worker creation.
  • What happens with this PR:
    UCX selects the first available transport resource deterministically and continues successfully.
  • Regarding cuda_copy + gdrcopy:
    You're right that this picks only one (whichever appears first in the bitmap). However, even before the assertion was added, having multiple lanes would likely malfunction since the memtype EP infrastructure isn't designed to stripe staging operations across multiple transports. Selecting one deterministically at least provides functional behavior instead of a crash.
    I think the proper fix (supporting multiple transports or intelligent selection like preferring gdrcopy over cuda_copy) requires broader changes to allow num_lanes > 1 for memtype EPs or adding transport scoring/selection logic.

Please let me know if this is not correct or there is a better way.

@yafshar it's possible both cuda_copy and gdr_copy be used in the same connection, for example gdrcopy for small buffers copy and cuda_copy for large buffer copy

@yafshar
Copy link
Contributor Author

yafshar commented Mar 11, 2026

it's possible both cuda_copy and gdr_copy be used in the same connection, for example gdrcopy for small buffers copy and cuda_copy for large buffer copy

You're correct that ideally UCX could leverage both transports. It might be using gdrcopy for small buffers (lower latency) and cuda_copy for large buffers (better bandwidth). However, the current memtype EP architecture fundamentally assumes a single transport lane for staging operations, as evidenced by the num_lanes == 1 assertion.
This is what I understand fro the current code.

Without this PR, having both cuda_copy and gdrcopy present causes an immediate assertion failure and process abort during worker creation. So currently, it's impossible to use both simultaneously anyway. This PR fixes the crash by deterministically selecting one transport, restoring basic functionality.

To properly support your suggested optimization (using gdrcopy for small transfers and cuda_copy for large ones), we would need to:

  • Refactor memtype EPs to support multiple lanes (num_lanes > 1)
  • Add lane selection logic based on transfer size/heuristics
  • Ensure proper resource management across multiple staging transports

Do you want me to make draft on this? or you guys already have a fix?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants