Improves attention bias numerical stability

algo-home · algo-home · commit 6cf8a1f42988 · 2025-10-16T12:48:29.000+08:00
Replaces $\\exp(A\\cdot\\mathrm{softplus}(\\Delta V))$ with $A\\cdot\\mathrm{softplus}(\\Delta V)$ to prevent overflow/NaNs in attention bias and stabilize training/inference.

Preserves tensor shape/dtype and adds a clarifying comment on the rationale.
diff --git a/examples/modeling/modeling_doge.py b/examples/modeling/modeling_doge.py
@@ -217,7 +217,8 @@ def forward(
         dt_states = self.dt_proj(
             value_states.transpose(1, 2).reshape(value_states.shape[0], value_states.shape[-2], -1)
         )
-        attn_bias = torch.exp(self.A * F.softplus(dt_states)).transpose(-1, -2).unsqueeze(-2).to(hidden_states.dtype)
+        # original formula is exp(A * softplus(delta V)), but for numerical stability, it is changed to A * softplus(delta V)
+        attn_bias = self.A * F.softplus(dt_states).transpose(-1, -2).unsqueeze(-2).to(hidden_states.dtype)
 
         attention_interface: Callable = flash_dynamic_mask_attention_forward
 

Original file line number	Diff line number	Diff line change
`@@ -217,7 +217,8 @@ def forward(`
`217`	`217`	`dt_states = self.dt_proj(`
`218`	`218`	`value_states.transpose(1, 2).reshape(value_states.shape[0], value_states.shape[-2], -1)`
`219`	`219`	`)`
`220`		`- attn_bias = torch.exp(self.A * F.softplus(dt_states)).transpose(-1, -2).unsqueeze(-2).to(hidden_states.dtype)`
	`220`	`+ # original formula is exp(A * softplus(delta V)), but for numerical stability, it is changed to A * softplus(delta V)`
	`221`	`+ attn_bias = self.A * F.softplus(dt_states).transpose(-1, -2).unsqueeze(-2).to(hidden_states.dtype)`
`221`	`222`
`222`	`223`	`attention_interface: Callable = flash_dynamic_mask_attention_forward`
`223`	`224`