UCP/CORE: Select one TL resource for memtype EP when multiple are available#11229
UCP/CORE: Select one TL resource for memtype EP when multiple are available#11229yafshar wants to merge 2 commits intoopenucx:masterfrom
Conversation
When multiple transport devices support the same memory type (for example, ZE sub-devices (tiles) on multi-tile GPUs), ucp_worker_mem_type_eps_create() asserted on num_lanes == 1 and aborted. Instead of failing, select one mem-access TL resource deterministically by choosing the lowest rsc_index when multiple candidates are present. This preserves the single-lane requirement for memtype endpoints while allowing transports such as ZE to enumerate all devices. The change is generic to memtype EP selection and applies when any memory type (for example ZE, CUDA, or ROCm) exposes multiple TL resources. ZE multi-tile configurations were the immediate trigger. Fixes assertion failures on Intel Data Center GPU Max and similar multi-tile platforms.
|
Seems this PR always limits memtype endpoints to 1 transport, which can prevent from both cuda_copy and gdrcopy being used for memory type copy |
This is a regression fix: the assertion at src/ucp/core/ucp_ep.c /* Mem type EP cannot have more than one lane */
num_lanes = ucp_ep_num_lanes(worker->mem_type_ep[mem_type]);
ucs_assertv_always(num_lanes == 1, "num_lanes=%u", num_lanes);PR caused having multiple transports (e.g., multiple ZE tiles, or both cuda_copy and gdrcopy) would trigger this assertion and abort; this change allows UCX to continue by selecting the first available resource deterministically. Without this fix, multi-tile GPUs (Intel Max) or systems with redundant copy transports simply crash during worker creation. Supporting multiple simultaneous transports for memtype staging (multi-lane memtype EPs) would require a broader architectural refactor beyond the scope of this crash fix. |
Please let me know if this is not correct or there is a better way. |
@yafshar it's possible both cuda_copy and gdr_copy be used in the same connection, for example gdrcopy for small buffers copy and cuda_copy for large buffer copy |
You're correct that ideally UCX could leverage both transports. It might be using gdrcopy for small buffers (lower latency) and cuda_copy for large buffers (better bandwidth). However, the current memtype EP architecture fundamentally assumes a single transport lane for staging operations, as evidenced by the num_lanes == 1 assertion. Without this PR, having both cuda_copy and gdrcopy present causes an immediate assertion failure and process abort during worker creation. So currently, it's impossible to use both simultaneously anyway. This PR fixes the crash by deterministically selecting one transport, restoring basic functionality. To properly support your suggested optimization (using gdrcopy for small transfers and cuda_copy for large ones), we would need to:
Do you want me to make draft on this? or you guys already have a fix? |
What
Fix an assertion failure in
ucp_worker_mem_type_eps_create()when multiple TL resources support the same memory type.Why
On platforms where one memory type is exposed by multiple transport resources (for example, Level Zero sub-devices on multi-tile GPUs), memtype EP creation can see multiple candidate lanes and fail with:
The memtype EP flow currently requires a single lane for staging operations.
How
In
ucp_worker_mem_type_eps_create():mem_access_tlscontains more than one resource.rsc_indexviaUCS_STATIC_BITMAP_FFS).mem_access_tlsto only the selected resource.This is a minimal core-side fix that preserves existing behavior and invariants.
Impact