Skip to content

Conversation

@yoyolicoris
Copy link
Member

@yoyolicoris yoyolicoris commented May 4, 2025

import torch
import torchlpc
from numba import cuda

from timeit import timeit


def numba_scan(impulse, decay, initial_state):
    out = torch.empty_like(impulse)
    n_dims, n_steps = decay.shape
    torchlpc.recurrence.compute_linear_recurrence(
        cuda.as_cuda_array(decay.detach()),
        cuda.as_cuda_array(impulse.detach()),
        cuda.as_cuda_array(initial_state.detach()),
        cuda.as_cuda_array(out),
        n_dims,
        n_steps,
    )
    return out


batch_size = 64
samples = 2**19

x = torch.randn(batch_size, samples).cuda()
A = torch.rand(batch_size, samples).cuda() * 2 - 1
zi = torch.randn(batch_size).cuda()

t_torch = timeit(
    "torch.ops.torchlpc.scan(x, A, zi)",
    globals=globals(),
    number=100,
)
print(f"Torch time: {t_torch:.4f} seconds")

numba_scan(x, A, zi)

t_numba = timeit(
    "numba_scan(x, A, zi)",
    globals=globals(),
    number=100,
)

print(f"Numba time: {t_numba:.4f} seconds")

print(f"Torch is {t_numba / t_torch:.2f}x faster than Numba")

The results on a 5060 ti GPU with linux machine:

Torch time: 0.4350 seconds
Numba time: 0.4457 seconds
Torch is 1.02x faster than Numba

It's roughly the same as before.

@yoyolicoris yoyolicoris requested a review from Copilot May 4, 2025 09:17
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR reintroduces the original CUDA scan implementation for linear RNN recurrence computation while also cleaning up and reorganizing the recurrence functions and extension API. Key changes include:

  • Splitting recurrence functionality into separate _cuda_recurrence and _cpu_recurrence functions.
  • Renaming and re-registering C++ extension functions from “scan_cpu”/“lpc_cpu” to “scan”/“lpc”.
  • Updating tests and setup configuration to support the new extension API and parameterizations.

Reviewed Changes

Copilot reviewed 8 out of 9 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
torchlpc/recurrence.py Refactored recurrence functions with separate CPU/CUDA paths.
torchlpc/csrc/scan_cpu.cpp Updated API registration and const/mutable data pointer usage.
torchlpc/csrc/cuda/LICENSE.txt Added license text for CUDA code.
torchlpc/core.py Updated operator calls to reflect new API names.
torchlpc/init.py Changed extension loading logic using try-except for _C import.
tests/test_grad.py Added parameterization for complex data tests.
tests/test_extension.py Updated tests to use new operator names and added device parameter.
setup.py Revised extension build configuration and source file discovery.
Files not reviewed (1)
  • torchlpc/csrc/cuda/linear_recurrence.cu: Language not supported

@yoyolicoris yoyolicoris merged commit c662e34 into main May 4, 2025
6 of 8 checks passed
@yoyolicoris yoyolicoris deleted the feat/cuda-scan branch May 4, 2025 10:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants