UCT/IB/UD: Fix skb use-after-free on simultaneous CREQ by sunnytiwari-e4 · Pull Request #11246 · openucx/ucx

sunnytiwari-e4 · 2026-03-10T09:54:52Z

Fix premature skb release and deadlock on simultaneous connection

What?

Removed the unsafe implicit ACK logic on simultaneous CREQs. The endpoints now safely rely on subsequent CREP packets to acknowledge initial connection requests.
Added explicit ACK_REQ scheduling when discarding a duplicate CREQ.
Restricted this explicit ACK_REQ to incomplete connections (!uct_ud_ep_is_connected(ep)). This handles the dropped CREP deadlock without risking ACK spam from stale endpoints on established connections.

Why?

When two UD endpoints perform a simultaneous connection establishment, both transmit a CREQ packet. Historically, the UD transport handled this collision by treating the incoming CREQ as an implicit ACK for the outgoing CREQ, immediately releasing the outgoing skb back to the memory pool.
Issue 1: Memory corruption
When the CREQ payload exceeds max_inline, the implicit ACK logic is unsafe. The skb is freed immediately while the NIC is still performing the async DMA read from that memory, causing silent data corruption.
Issue 2: Deadlock and OOM
Removing the implicit ACK exposed a deadlock if one side's CREP reply is dropped by the network. The peer will indefinitely resend its CREQ. The receiver drops these resent CREQs as duplicates but never sends an ACK. The sender stays stuck waiting for an ACK, stalling its transmit window, and piling up Active Messages until it exhausts the memory pool.

How?

tvegas1

any possibility to send ctl/creq as signaled and release its skb when receiving tx cqe (dealing only with releasing memory), and keep processing it as ack? apparently there are other possibly other cases involving an ack like:
#10423 (comment)

tvegas1 · 2026-03-11T12:19:44Z

src/uct/ib/ud/base/ud_ep.c

+         * We restrict this to incomplete connections to prevent side-effects 
+         * on established endpoints.
+         */
+        if (!uct_ud_ep_is_connected(ep)) {


seems we still mark ep as connected when receiving simultaneous creq at line 856 or after having received crep, so I wonder if this can be executed at all, when restricted to connected ep?

Thanks @tvegas1 for reviewing it!

any possibility to send creq as signaled and release its skb when receiving tx cqe, and keep processing it as ack?

I am thinking about the potential tx cqe polling overhead if we make every creq signaled, also we might quickly exhaust SQ/CQ slots during massive connection bursts. Second is UD needs the skb in the tx window for timeout/resends. If we free the skb right after the local tx cqe and the packet gets dropped on the wire, we lose the buffer before getting a real peer ack. We'd have to dynamically rebuild the creq payload from scratch on every timeout in that case.

seems we still mark ep as connected when receiving simultaneous creq at line 856 or after having received crep, so I wonder if this can be executed at all, when restricted to connected ep?

you are right, very first creq marks the ep as connected so it'll never get executed. This check was an "optimization" I added in the latest revision, but as you noted (and as the failing CI confirms), it's incorrect.
Does it make sense to unconditionally schedule ack here when dropping duplicate creq or crep as any duplicate creq or crep arriving at this point has a matching conn_sn which definitively means the peer is stuck waiting because our previously sent crep or ack was lost on the wire.

i think the case mentioned in the comment where send1_a, retransmit send1_b, and ack_a from send1_a releases send1_b while it is still being sent, could be also valid. maybe we would only free skb on tx cqe poll, not adding new signaled cqe but using tx cqe poll from existing signaled and covering previous ranges? tx window would be cleaned independently, but without actually releasing the skbs?

…nection When two endpoints execute a simultaneous connection establishment, both endpoints transmit a CREQ packet. Previously, the UD transport handled this collision by incorrectly treating the incoming CREQ as an implicit ACK for the outgoing CREQ, immediately releasing the outgoing skb back to the memory pool. This triggers a use-after-free bug when the CREQ payload exceeds `max_inline`, as the skb is freed while the NIC is still performing async DMA read from that memory, causing silent data corruption. Fix this by removing the implicit ACK logic on simultaneous CREQs. The endpoints now correctly rely on the subsequent CREP packets to acknowledge their initial connection requests. Additionally, handle the resulting deadlock when a CREP is dropped. If an endpoint receives a duplicate CREQ while still establishing a connection, it now schedules an explicit ACK_REQ before discarding it. This allows the stuck peer to free its unacknowledged CREQ.

sunnytiwari-e4 force-pushed the master branch 5 times, most recently from 91bf609 to 69e0bf6 Compare March 11, 2026 09:16

tvegas1 reviewed Mar 11, 2026

View reviewed changes

sunnytiwari-e4 force-pushed the master branch from 69e0bf6 to 8edd134 Compare March 11, 2026 15:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UCT/IB/UD: Fix skb use-after-free on simultaneous CREQ#11246

UCT/IB/UD: Fix skb use-after-free on simultaneous CREQ#11246
sunnytiwari-e4 wants to merge 1 commit intoopenucx:masterfrom
sunnytiwari-e4:master

sunnytiwari-e4 commented Mar 10, 2026 •

edited

Loading

Uh oh!

tvegas1 left a comment •

edited

Loading

Uh oh!

tvegas1 Mar 11, 2026

Uh oh!

sunnytiwari-e4 Mar 11, 2026

Uh oh!

tvegas1 Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sunnytiwari-e4 commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What?

Why?

How?

Uh oh!

tvegas1 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tvegas1 Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

sunnytiwari-e4 Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

tvegas1 Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sunnytiwari-e4 commented Mar 10, 2026 •

edited

Loading

tvegas1 left a comment •

edited

Loading