Skip to content

Conversation

@griwes
Copy link
Contributor

@griwes griwes commented Dec 18, 2025

Description

This PR adds {lower,upper}_bound device algorithms to both CUB and c.parallel.

In the case of CUB, the implementation is very straightforward and directly follows current implementation in Thrust (which I have cleaned up as a drive by change).

In the case of c.parallel, because of how CUB's for_each passes in kernel arguments, repeating the slight madness of the current for operator construction for the for_each algorithm itself felt beyond annoying, but I needed a kernel pointer; so instead of reusing the kernels available in CUB, I adapted the static-block-size for_each kernel to accept all the necessary arguments as separate kernel arguments, then construct the for operator expected by CUB inside the kernel, and finally invoke the CUB for_each agent with that operator. So, it's a manually constructed kernel, but it reuses both the agent code and binary search helper types from CUB.

Resolves #6695

Checklist

  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@griwes griwes requested review from a team as code owners December 18, 2025 06:13
@github-project-automation github-project-automation bot moved this to Todo in CCCL Dec 18, 2025
@griwes griwes requested a review from elstehle December 18, 2025 06:13
@cccl-authenticator-app cccl-authenticator-app bot moved this from Todo to In Review in CCCL Dec 18, 2025
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

Comment on lines +59 to +60
const unsigned int thread_count = 256;
const size_t items_per_block = 512;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these essentially hardcoded somewhere in the CUB as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. There's no tunings for for, and this just follows that.

I'm planning to start working on a warp level binary search algorithm in the new year, and then build a device wide one on top of that, as a replacement of the current approach - we'll do actual tunings then.

Copy link
Contributor

@shwina shwina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The C side looks good to me. Thanks!

@github-actions

This comment has been minimized.

@griwes griwes enabled auto-merge (squash) December 19, 2025 22:51
@github-actions

This comment has been minimized.

@github-actions
Copy link
Contributor

😬 CI Workflow Results

🟥 Finished in 9h 20m: Pass: 98%/143 | Total: 7d 15h | Max: 5h 51m | Hits: 71%/186811

See results here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Review

Development

Successfully merging this pull request may close these issues.

[FEA]: Add vectorized cub::DeviceFind::UpperBound and cub::DeviceFind::LowerBound algorithms

4 participants