-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Description
🐛 Bug in CTC forced alignment: softmax used instead of log_softmax
There's a bug in the CTC forced alignment implementation in SenseVoice model where softmax is incorrectly used instead of log_softmax, causing incorrect alignment results.
To Reproduce
Just read the code, and one could see it.
Code sample
branch: main @252eef8b8b29b603d10bc640bc4f0c3fe12c3604
Location: funasr/models/sense_voice/model.py, line 933
Current (incorrect) code:
logits_speech = self.ctc.softmax(encoder_out)[i, 4 : encoder_out_lens[i].item(), :]
pred = logits_speech.argmax(-1).cpu()
logits_speech[pred == self.blank_id, self.blank_id] = 0
align = ctc_forced_align(
logits_speech.unsqueeze(0).float(),
torch.Tensor(token_ids).unsqueeze(0).long().to(logits_speech.device),
(encoder_out_lens[i] - 4).long(),
torch.tensor(len(token_ids)).unsqueeze(0).long().to(logits_speech.device),
ignore_id=self.ignore_id,
)
The issue: The ctc_forced_align function expects log probabilities, not regular probabilities.
Evidence from funasr/models/sense_voice/utils/ctc_alignment.py:
- Line 3: Parameter is named log_probs
- Line 12: Docstring states: "log_probs (Tensor): log probability of CTC emission output."
- Line 53: Uses log-space arithmetic:
best_score[:, padding_num:] = log_probs[:, t].gather(-1, _t_a_r_g_e_t_s_) + prev_max_value
Expected behavior
The code should use log_softmax instead of softmax:
logits_speech = self.ctc.log_softmax(encoder_out)[i, 4 : encoder_out_lens[i].item(), :]
This will provide log probabilities (range: -∞ to 0) as expected by the ctc_forced_align function, instead of regular probabilities (range: 0 to 1).
Environment
Not necessary for this bug.