Skip to content

Conversation

@CZH-THU
Copy link

@CZH-THU CZH-THU commented Dec 22, 2025

If mask is action_mask[:,start:] = attention_mask[:, 1:][:,start:] and ends=start + action_mask[:, start:].sum(1)+1 then rewards[:,ends[j]-1] represent the last non-padding token's reward by predicting padding token which action_mask is 0,The reward_score should give the penultimate non-padding token to reward it's action —— predicting the last non-padding token.

Signed-off-by Zehao Chen <chenzh22@tsinghua.org.cn>
@CZH-THU CZH-THU requested a review from tjruwase as a code owner December 22, 2025 13:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant