[FEATURE REQUEST] `axpby` CuTe kernel implementation

### Problem statement

The BLAS level-1 `axpby` kernel does not yet have a CuTe backend implementation in this project. In the README BLAS table, the CuTe column for `axpby` is empty, preventing complete cross-backend coverage for this fundamental fused vector update.

Without a CuTe `axpby` kernel:
- users cannot see how $y = \alpha x + \beta y$ maps onto CuTe primitives and memory abstractions,
- there is no CuTe baseline for performance comparison against PyTorch and Triton `axpby`,
- higher-level CuTe-based examples lack a standard `axpby` building block.

### Proposed solution

Implement an `axpby` kernel using CuTe (CUTLASS/CuTe-style) constructs, matching the mathematical semantics of the Python reference and fitting the project’s backend conventions.

Concretely:
- Add a CuTe-based `axpby` kernel in the appropriate CuTe backend directory (once established in the codebase).
- Implement the operation $y = \alpha x + \beta y$ for 1D vectors with attention to memory access patterns and performance.
- Mirror the public API structure of other backends so callers can dispatch to CuTe `axpby` in a uniform way.

### Alternatives considered

Alternatives such as omitting CuTe `axpby` or relying on other backends for this operation would:
- reduce the educational value of comparing CuTe with PyTorch and Triton on a fused vector update,
- leave the CuTe column incomplete for `axpby` in the README,
- limit CuTe’s role as a first-class backend in `kernel-course`.

### Implementation details

- Decide the precise file layout and build integration for CuTe kernels in the repository.
- Implement the `axpby` kernel using CuTe primitives that are idiomatic for 1D vector operations.
- Ensure numerical equivalence with the Python reference and consistency with other backend semantics.
- Integrate with future tests and benchmarks for `axpby`.

### Use case

The CuTe `axpby` kernel will:
- illustrate how a fused BLAS-1 operation is realized in CuTe,
- enable detailed performance comparisons and tuning across backends,
- serve as a foundational building block for more complex CuTe-based kernels.

### Related work

- CuTe/CUTLASS examples of vector and fused operations.
- BLAS `axpby` in other GPU math libraries.

### Additional context

This issue complements the `axpby` Python, PyTorch, and Triton feature requests and helps complete the CuTe column in the README BLAS table.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FEATURE REQUEST] `axpby` CuTe kernel implementation #13

Problem statement

Proposed solution

Alternatives considered

Implementation details

Use case

Related work

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[FEATURE REQUEST] axpby CuTe kernel implementation #13

Description

Problem statement

Proposed solution

Alternatives considered

Implementation details

Use case

Related work

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[FEATURE REQUEST] `axpby` CuTe kernel implementation #13