UCP/CORE: Select one TL resource for memtype EP when multiple are available by yafshar · Pull Request #11229 · openucx/ucx

yafshar · 2026-03-02T22:12:29Z

What

Fix an assertion failure in ucp_worker_mem_type_eps_create() when multiple TL resources support the same memory type.

Why

On platforms where one memory type is exposed by multiple transport resources (for example, Level Zero sub-devices on multi-tile GPUs), memtype EP creation can see multiple candidate lanes and fail with:

Assertion num_lanes == 1 failed

The memtype EP flow currently requires a single lane for staging operations.

How

In ucp_worker_mem_type_eps_create():

Detect when mem_access_tls contains more than one resource.
Select one deterministically using first-set-bit (lowest rsc_index via UCS_STATIC_BITMAP_FFS).
Reduce mem_access_tls to only the selected resource.
Continue with the existing single-lane memtype EP creation path.

This is a minimal core-side fix that preserves existing behavior and invariants.

Impact

Preserves the single-lane memtype EP requirement.
Prevents assertion failures when multiple TL resources are present for one memory type.
Applies generically to memtype EP selection (for example ZE/CUDA/ROCm cases), with ZE multi-tile as the motivating trigger.

When multiple transport devices support the same memory type (for example, ZE sub-devices (tiles) on multi-tile GPUs), ucp_worker_mem_type_eps_create() asserted on num_lanes == 1 and aborted. Instead of failing, select one mem-access TL resource deterministically by choosing the lowest rsc_index when multiple candidates are present. This preserves the single-lane requirement for memtype endpoints while allowing transports such as ZE to enumerate all devices. The change is generic to memtype EP selection and applies when any memory type (for example ZE, CUDA, or ROCm) exposes multiple TL resources. ZE multi-tile configurations were the immediate trigger. Fixes assertion failures on Intel Data Center GPU Max and similar multi-tile platforms.

yosefe · 2026-03-03T17:20:06Z

Seems this PR always limits memtype endpoints to 1 transport, which can prevent from both cuda_copy and gdrcopy being used for memory type copy

yafshar · 2026-03-03T17:59:27Z

Seems this PR always limits memtype endpoints to 1 transport, which can prevent from both cuda_copy and gdrcopy being used for memory type copy

This is a regression fix: the assertion num_lanes == 1 was recently added #10933 to enforce the existing architectural constraint that memtype EPs support only a single lane.

at src/ucp/core/ucp_ep.c

        /* Mem type EP cannot have more than one lane */
        num_lanes = ucp_ep_num_lanes(worker->mem_type_ep[mem_type]);
        ucs_assertv_always(num_lanes == 1, "num_lanes=%u", num_lanes);

PR caused having multiple transports (e.g., multiple ZE tiles, or both cuda_copy and gdrcopy) would trigger this assertion and abort; this change allows UCX to continue by selecting the first available resource deterministically. Without this fix, multi-tile GPUs (Intel Max) or systems with redundant copy transports simply crash during worker creation. Supporting multiple simultaneous transports for memtype staging (multi-lane memtype EPs) would require a broader architectural refactor beyond the scope of this crash fix.

yafshar · 2026-03-03T18:17:17Z

Seems this PR always limits memtype endpoints to 1 transport, which can prevent from both cuda_copy and gdrcopy being used for memory type copy

What happens without this PR:
On multi-tile GPUs (8 ZE tiles) or systems with multiple copy transports (cuda_copy + gdrcopy), UCX crashes with the assertion failure during worker creation.
What happens with this PR:
UCX selects the first available transport resource deterministically and continues successfully.
Regarding cuda_copy + gdrcopy:
You're right that this picks only one (whichever appears first in the bitmap). However, even before the assertion was added, having multiple lanes would likely malfunction since the memtype EP infrastructure isn't designed to stripe staging operations across multiple transports. Selecting one deterministically at least provides functional behavior instead of a crash.
I think the proper fix (supporting multiple transports or intelligent selection like preferring gdrcopy over cuda_copy) requires broader changes to allow num_lanes > 1 for memtype EPs or adding transport scoring/selection logic.

Please let me know if this is not correct or there is a better way.

rakhmets · 2026-03-04T13:26:08Z

@shasson5 it seems the assert added in this PR #10933 is not quite right

yosefe · 2026-03-09T07:36:00Z

Seems this PR always limits memtype endpoints to 1 transport, which can prevent from both cuda_copy and gdrcopy being used for memory type copy

What happens without this PR:
On multi-tile GPUs (8 ZE tiles) or systems with multiple copy transports (cuda_copy + gdrcopy), UCX crashes with the assertion failure during worker creation.

What happens with this PR:
UCX selects the first available transport resource deterministically and continues successfully.

Regarding cuda_copy + gdrcopy:
You're right that this picks only one (whichever appears first in the bitmap). However, even before the assertion was added, having multiple lanes would likely malfunction since the memtype EP infrastructure isn't designed to stripe staging operations across multiple transports. Selecting one deterministically at least provides functional behavior instead of a crash.
I think the proper fix (supporting multiple transports or intelligent selection like preferring gdrcopy over cuda_copy) requires broader changes to allow num_lanes > 1 for memtype EPs or adding transport scoring/selection logic.

Please let me know if this is not correct or there is a better way.

@yafshar it's possible both cuda_copy and gdr_copy be used in the same connection, for example gdrcopy for small buffers copy and cuda_copy for large buffer copy

yafshar · 2026-03-11T10:03:55Z

it's possible both cuda_copy and gdr_copy be used in the same connection, for example gdrcopy for small buffers copy and cuda_copy for large buffer copy

You're correct that ideally UCX could leverage both transports. It might be using gdrcopy for small buffers (lower latency) and cuda_copy for large buffers (better bandwidth). However, the current memtype EP architecture fundamentally assumes a single transport lane for staging operations, as evidenced by the num_lanes == 1 assertion.
This is what I understand fro the current code.

Without this PR, having both cuda_copy and gdrcopy present causes an immediate assertion failure and process abort during worker creation. So currently, it's impossible to use both simultaneously anyway. This PR fixes the crash by deterministically selecting one transport, restoring basic functionality.

To properly support your suggested optimization (using gdrcopy for small transfers and cuda_copy for large ones), we would need to:

Refactor memtype EPs to support multiple lanes (num_lanes > 1)
Add lane selection logic based on transfer size/heuristics
Ensure proper resource management across multiple staging transports

Do you want me to make draft on this? or you guys already have a fix?

yafshar mentioned this pull request Mar 2, 2026

UCT/ZE: Add device topology registration #11180

Merged

yafshar marked this pull request as ready for review March 2, 2026 22:25

Merge branch 'openucx:master' into fix/memtype-ep-multi-tl-selection

a59ae34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UCP/CORE: Select one TL resource for memtype EP when multiple are available#11229

UCP/CORE: Select one TL resource for memtype EP when multiple are available#11229
yafshar wants to merge 2 commits intoopenucx:masterfrom
intel-staging:fix/memtype-ep-multi-tl-selection

yafshar commented Mar 2, 2026

Uh oh!

yosefe commented Mar 3, 2026

Uh oh!

yafshar commented Mar 3, 2026 •

edited

Loading

Uh oh!

yafshar commented Mar 3, 2026 •

edited

Loading

Uh oh!

rakhmets commented Mar 4, 2026

Uh oh!

yosefe commented Mar 9, 2026

Uh oh!

yafshar commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

yafshar commented Mar 2, 2026

What

Why

How

Impact

Uh oh!

yosefe commented Mar 3, 2026

Uh oh!

yafshar commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yafshar commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rakhmets commented Mar 4, 2026

Uh oh!

yosefe commented Mar 9, 2026

Uh oh!

yafshar commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yafshar commented Mar 3, 2026 •

edited

Loading

yafshar commented Mar 3, 2026 •

edited

Loading