[SP] add SP deny list instead of allow by kashif · Pull Request #7887 · deepspeedai/DeepSpeed

kashif · 2026-03-05T13:32:31Z

this way one can register kernels based flash-attn as well with SP

Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>

tohtana

Thank you for opening this PR! I think supporting HF hub kernels is is a significant update.

Regarding the approach, we check if core_attn_implementation is in ALL_ATTENTION_FUNCTIONS but HF hub kernels like kernels-community/flash-attn2 is not in the list. So HF hub kernels won’t still be available with this fix.

We probably need to do the proper registration steps:

Reject known-bad impls explicitly: eager, flex_attention, and probably paged|eager.
If core_attn_implementation is an HF hub kernel string, call the HF registration path first. (Using lazy_import_flash_attention(…))
Then read core_attn_function = ALL_ATTENTION_FUNCTIONS[core_attn_implementation].
Build uattn from that original function.
Replace that key with uattn_wrapper.

Does it make sense to you?

Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>

kashif · 2026-03-08T09:43:37Z

thanks @tohtana I have tried to fix all the issues raised, if you can kindly check again?

stas00 · 2026-03-09T20:09:10Z

Reject known-bad impls explicitly: eager, flex_attention, and probably paged|eager.

We actually don't know if flex_attention is bad, we just haven't tried it out. Do you have resources to try it out, Kashif? Same for the others on the list.

That's why we started with approve list, rather than deny.

The only reason eager is denied is that it requires 4D attention_mask which is a bad idea for long sequence.

BTW, SDPA is silently broken with packed samples - when there is no attn mask, it ignores pos ids and attends to the whole sequence instead. Expect bad results. Not sure how to flag that to users - probably need to inspect pos ids and see if they reset at least once and disallow sdpa then.

tohtana · 2026-03-10T02:19:09Z

Hi @kashif,
Thank you for addressing my comments! It looks good to me.

I also think Stas's comment makes sense. Can you try implementing such a validation?
You can refer to transformers' find_packed_sequence_indices.

kashif · 2026-03-10T12:26:58Z

sure @tohtana i can check

stas00 · 2026-03-12T00:29:16Z

to make things more exact - it's packed samples + pos ids + 4D attention_mask=None where sdpa silently does the wrong thing. I haven't validated but it most likely will do the right thing with 4D attention mask being not None- but it can't be used with SP because it becomes too large too quickly.

stas00 · 2026-03-12T00:47:15Z

oh, Kashif, I'm being told eager has the exact same problem as sdpa - so both need to be fixed on the transformers side. Thank you very much!

kashif · 2026-03-14T11:33:24Z

I ran some experiments comparing flash_attention_2, sdpa, and flex_attention with SP=4 on Qwen3-4B (GQA: 32 Q
heads, 8 KV heads), 8K seq length, 10 steps.

Without SP (1 GPU baseline): flash_attention_2 and sdpa produce identical losses — confirming the backends are
equivalent in the standard path.

  ┌──────┬───────┬───────┐
  │ Step │  fa2  │ sdpa  │
  ├──────┼───────┼───────┤
  │ 1    │ 0.736 │ 0.737 │
  ├──────┼───────┼───────┤
  │ 2    │ 1.841 │ 1.843 │
  ├──────┼───────┼───────┤
  │ 3    │ 0.806 │ 0.807 │
  └──────┴───────┴───────┘

With SP=4 (4 GPUs): sdpa and flex_attention match each other, but both diverge significantly from
flash_attention_2:

  ┌──────┬──────┬──────┬──────┐
  │ Step │ fa2  │ sdpa │ flex │
  ├──────┼──────┼──────┼──────┤
  │ 1    │ 2.37 │ 4.55 │ 4.55 │
  ├──────┼──────┼──────┼──────┤
  │ 5    │ 2.28 │ 3.52 │ 3.52 │
  ├──────┼──────┼──────┼──────┤
  │ 10   │ 2.29 │ 3.02 │ 3.02 │
  └──────┴──────┴──────┴──────┘

@stas00 any ideas on what flash_attention_2 might be doing differently after the all-to-all that
sdpa/flex_attention aren't? The Q/K/V shapes and attention_mask=None + is_causal=True path should be equivalent,
but something in the SP gather/scatter is exposing a difference.

Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>

kashif · 2026-03-14T12:48:34Z

ok @stas00 I now enerate position_ids if missing from batch, build causal BlockMask for flex_attention and do a one-time packed sample validation for packed samples + sdpa/eager

Now the outputs are matching:

  ┌──────┬───────────────────┬───────┬────────────────┐
  │ Step │ flash_attention_2 │ sdpa  │ flex_attention │
  ├──────┼───────────────────┼───────┼────────────────┤
  │ 1    │ 2.152             │ 2.152 │ 2.150          │
  ├──────┼───────────────────┼───────┼────────────────┤
  │ 2    │ 2.469             │ 2.468 │ 2.468          │
  ├──────┼───────────────────┼───────┼────────────────┤
  │ 3    │ 2.045             │ 2.044 │ 2.045          │
  ├──────┼───────────────────┼───────┼────────────────┤
  │ 4    │ 2.197             │ 2.197 │ 2.197          │
  ├──────┼───────────────────┼───────┼────────────────┤
  │ 5    │ 2.113             │ 2.112 │ 2.112          │
  ├──────┼───────────────────┼───────┼────────────────┤
  │ 6    │ 2.173             │ 2.173 │ 2.172          │
  ├──────┼───────────────────┼───────┼────────────────┤
  │ 7    │ 2.351             │ 2.350 │ 2.351          │
  ├──────┼───────────────────┼───────┼────────────────┤
  │ 8    │ 2.380             │ 2.380 │ 2.379          │
  ├──────┼───────────────────┼───────┼────────────────┤
  │ 9    │ 1.847             │ 1.847 │ 1.847          │
  ├──────┼───────────────────┼───────┼────────────────┤
  │ 10   │ 2.151             │ 2.151 │ 2.151          │
  └──────┴───────────────────┴───────┴────────────────┘

Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>

stas00 · 2026-03-15T15:43:28Z

Thank you for running those quality comparison experiments, Kashif

I'm a bit unclear about your last "success" comment - what was missing to make FA2 match? are you saying the mismatch was from missing position_ids? but we said that already that SDPA (and now most likely FlexAttenion) have a trouble with no-attn-mask / yes-pos-id and will ignore packed samples. SDPA on the other hand does the right thing here.

And it's great to hear Flex Attention works as well with Ulysses, so we could add it to the allow list.

stas00 · 2026-03-15T15:46:37Z

deepspeed/runtime/sequence_parallel/ulysses_sp.py

+                if has_packed_samples and self.core_attn_implementation in ("sdpa", "eager"):
+                    raise ValueError(


heh, I thought we were discussing that it's HF Transformers that has to do that, not Ulysses SP. It affects all users regardless of whether they use Ulysses or not. Unless HF Transformers disallows not providing attn-mask with sdpa/eager, which I don't think is the case.

agree, removed from DeepSpeed side

deepspeed/runtime/sequence_parallel/ulysses_sp.py

stas00 · 2026-03-15T15:54:11Z

deepspeed/runtime/sequence_parallel/ulysses_sp.py

+        # looks like packed sequences [0,...,N, 0,...,N, ...]. flash_attention_2 handles
+        # this via flash_varlen_fn, but sdpa/flex_attention apply full causal masking
+        # across the resets, producing incorrect attention.
+        if "position_ids" not in batch:


I'm not sure about this. This might lead to a user getting the wrong behavior if they packed samples but forgot to supply pos ids. Should we simply assert if pos ids aren't there and not potentially create invalid pos ids?

I agree there needs to be a check and it's not there.

yes, It would need to be in the TRL trainer, for the collator to always provide position_ids when SP is enabled, so the adapter never needs to generate them. I Can try to fix it there.

Thank you, Kashif.

And probably then add an assert on SP side if pos id isn't there?

kashif · 2026-03-16T07:59:06Z

Thank you for running those quality comparison experiments, Kashif

I'm a bit unclear about your last "success" comment - what was missing to make FA2 match? are you saying the mismatch was from missing position_ids? but we said that already that SDPA (and now most likely FlexAttenion) have a trouble with no-attn-mask / yes-pos-id and will ignore packed samples. SDPA on the other hand does the right thing here.

And it's great to hear Flex Attention works as well with Ulysses, so we could add it to the allow list.

So, FA2 was the one producing correct results, while SDPA/flex were wrong. Here's what was happening:

When position_ids are not in the dataloader batch (common with SFTTrainer + packing=False), UlyssesSPDataLoaderAdapter doesn't generate them. The Trainer then generates position_ids = [0,...,chunk_len-1] on each rank AFTER the adapter has already sharded the sequence. After all_gather in UlyssesSPAttentionHF.forward(), the concatenated position_ids become:

 [0,...,2047, 0,...,2047, 0,...,2047, 0,...,2047]  # looks like 4 packed documents!

FA2 "accidentally" handles this correctly — _is_packed_sequence() detects the resets and switches to flash_varlen_fn, treating each shard as a separate document. This gives correct attention within each shard.

SDPA with is_causal=True applies a simple lower-triangular causal mask over the entire gathered sequence, allowing tokens to attend across the position_id resets. This produced loss=4.55 vs FA2's correct loss=2.37.

The fix: generate position_ids in UlyssesSPDataLoaderAdapter.refill() BEFORE all_gather and sharding, so each rank gets correct global positions (rank 1 gets [2048,...,4095], not [0,...,2047]). After gather they reconstruct to monotonic [0,...,8191] — no resets, all backends produce identical results.

With this fix, all three backends match within numerical precision:

  ┌──────┬───────┬───────┬───────┐
  │ Step │  FA2  │ SDPA  │ Flex  │
  ├──────┼───────┼───────┼───────┤
  │ 1    │ 2.152 │ 2.152 │ 2.150 │
  ├──────┼───────┼───────┼───────┤
  │ 5    │ 2.113 │ 2.113 │ 2.112 │
  ├──────┼───────┼───────┼───────┤
  │ 10   │ 2.151 │ 2.152 │ 2.151 │
  └──────┴───────┴───────┴───────┘

For flex_attention, we also needed to rebuild the BlockMask for the full gathered sequence length after the all-to-all (the wrapper discards the original one since it was built for the local shard).

Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>

stas00 · 2026-03-16T18:19:24Z

great explanations, Kashif - thank you!

let's assert if pos ids isn't there, trusting that the user will set it up correctly. Generating a warning doesn't guarantee the user will see. But an assert and telling them to do it correctly is probably the safest/resilient way forward.
and as you said a special treatment needs to be added for BlockMask for flex attn - I'm not familiar with this one, so will see your implementation when you get a chance to add it.

Thank you, Kashif

Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>

kashif · 2026-03-17T12:32:19Z

@stas00, regarding point 2, we added BlockMask handling for flex_attention in these places:

uattn_wrapper: keeps the BlockMask instead of discarding it (other mask types are set to None)
UlyssesSPAttentionHF.forward(): rebuilds the BlockMask for the full gathered sequence length after the all-to-all (the original was built for the local shard)
register_with_transformers(): imports BlockMask and create_block_mask once and stores them on the instance (only when core_attn_implementation == "flex_attention")

Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>

stas00 · 2026-03-17T16:44:18Z

Thank you very much, Kashif.

Do you think all this amazing tooling you added should live here and not in HF Transformers?

kashif · 2026-03-17T16:55:42Z

checking

kashif · 2026-03-19T16:40:49Z

So some SP-specific things tied to the all-to-all make sense to be here...

position_ids assert in the adapter: the adapter owns the sharding contract and needs position_ids before it shards
BlockMask rebuild in forward(): after the all-to-all, the gathered sequence is longer than the original mask was built for. This is inherently SP logic I believe
Smart mask handling in uattn_wrapper: distinguishing BlockMask (keep) from 4D tensor masks (discard) is specific to how SP nulls out masks
flex_attention allow/deny list: SP-specific compatibility

Agree that in Transformers:

The packed samples + sdpa/eager silent failure: any user with packed sequences + attention_mask=None gets wrong results, regardless of SP. sdpa/eager should respect position_ids for document boundaries like FA2 does (or at minimum warn)

On the TRL side:

Ensuring position_ids are always in the batch when SP is enabled: currently requires padding_free=True. The collator could detect SP and always emit position_ids

stas00 · 2026-03-19T20:28:20Z

Thank you for the detailed summary, Kashif.

I agree with everything, except:

sdpa/eager should respect position_ids for document boundaries like FA2 does (or at minimum warn)

I think it should assert. Warnings don't work and allowing invalid training can be so so costly to the user who missed the warning in the sea of warnings. I wonder how many people will discover their model has been mistrained and they had no clue that was the case, other than getting bad outcomes.

stas00 · 2026-03-19T20:29:28Z

Please let us know when things are ready for the final review, Kashif.

kashif requested review from tjruwase and tohtana as code owners March 5, 2026 13:32

kashif mentioned this pull request Mar 5, 2026

[DeepSpeed] allow kernels flash-attn in SP huggingface/accelerate#3959

Merged

5 tasks

add SP deny list instead of allow

aec2c90

Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>

kashif force-pushed the sp_attn_deny branch from 6a76b53 to aec2c90 Compare March 5, 2026 13:36

tohtana reviewed Mar 6, 2026

View reviewed changes

kashif requested a review from loadams as a code owner March 8, 2026 09:41

hub-kernel lazy registration before validation and tests

49e0310

Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>

kashif force-pushed the sp_attn_deny branch from 5044178 to 49e0310 Compare March 8, 2026 09:42

Merge branch 'master' into sp_attn_deny

2f8e77c

Merge branch 'master' into sp_attn_deny

ce69dc0

kashif added 2 commits March 13, 2026 17:56

Merge branch 'master' into sp_attn_deny

952c3ae

Merge branch 'master' into sp_attn_deny

6e3f2cb

position_ids generation and flex_attention BlockMask

874ec62

Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>

kashif added 2 commits March 14, 2026 17:02

refactor

7d0a136

Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>

update comments

5868135

Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>

kashif requested a review from tohtana March 14, 2026 17:22

stas00 reviewed Mar 15, 2026

View reviewed changes

do not check for has_packed_samples

b0e05f0

Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>

raise error instead of warning

89058fc

Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>

cache BlockMask

463cb30

Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>

		if has_packed_samples and self.core_attn_implementation in ("sdpa", "eager"):
		raise ValueError(

Conversation

kashif commented Mar 5, 2026

Uh oh!

tohtana left a comment

Choose a reason for hiding this comment

Uh oh!

kashif commented Mar 8, 2026

Uh oh!

stas00 commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tohtana commented Mar 10, 2026

Uh oh!

kashif commented Mar 10, 2026

Uh oh!

stas00 commented Mar 12, 2026

Uh oh!

stas00 commented Mar 12, 2026

Uh oh!

kashif commented Mar 14, 2026

Uh oh!

kashif commented Mar 14, 2026

Uh oh!

stas00 commented Mar 15, 2026

Uh oh!

stas00 Mar 15, 2026

Choose a reason for hiding this comment

Uh oh!

kashif Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

stas00 Mar 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kashif Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

stas00 Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

kashif commented Mar 16, 2026

Uh oh!

stas00 commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kashif commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stas00 commented Mar 17, 2026

Uh oh!

kashif commented Mar 17, 2026

Uh oh!

kashif commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stas00 commented Mar 19, 2026

Uh oh!

stas00 commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

stas00 commented Mar 9, 2026 •

edited

Loading

stas00 Mar 15, 2026 •

edited

Loading

stas00 commented Mar 16, 2026 •

edited

Loading

kashif commented Mar 17, 2026 •

edited

Loading

kashif commented Mar 19, 2026 •

edited

Loading