CUDA and Gaussian GPU support #77

riteshbhirud · 2025-09-06T07:40:56Z

This PR covers two things:

1. Fixes a bug in CPU based non gaussian states I noticed:

- Incorrect normalization for squeezed cat states

Issue: The normalization calculation for squeezed cat states was using the coherent state overlap formula exp(-2|α|²) regardless of whether squeezing was applied. This is mathematically incorrect for squeezed states, where the overlap ⟨α,r,θ|-α,r,θ⟩ depends on both the displacement and squeeze parameters.
Fix: Added proper branch logic to compute normalization differently for coherent vs squeezed cases:

Coherent states: Use analytical formula exp(-2|α|²)
Squeezed states: Use numerical overlap calculation via _overlap() function

This ensures squeezed cat states have correct quantum mechanical normalization.

- Wrong function call in multi-mode odd cat states

Issue: The multi-mode catstate_odd() function was incorrectly calling catstate_even() in its loop, causing all multi-mode "odd" cat states to actually be even cat states.
Fix: Changed the function call to catstate_odd() to match the intended behavior.

Please correct me if I am making any wrong understanding here.

2. Introduces CUDA extension and Gaussian Functionality (Non Gaussian will be done in a separate PR):

Simple GPU Workflow

basis = QuadPairBasis(1)
vac_gpu = vacuumstate(basis) |> gpu
coh_gpu = coherentstate(basis, 1.0 + 0.5im) |> gpu

disp_gpu = displace(basis, 0.5) |> gpu
result_gpu = disp_gpu * vac_gpu

x_points = CuArray(randn(Float32, 2, 1000))
w_values = wigner(coh_gpu, x_points)

Device Management

state_gpu = coherentstate(basis, 1.0) |> gpu
op_gpu = displace(basis, 0.5) |> gpu
lc_gpu = linear_combination |> gpu

state_cpu = state_gpu |> cpu
op_cpu = op_gpu |> cpu

@assert device(state_gpu) == :gpu
@assert device(state_cpu) == :cpu

Precision Control

state_f32 = gpu(state, precision=Float32)
state_f64 = gpu(state, precision=Float64)

@assert eltype(state_f32.mean) == Float32
@assert eltype(state_f64.mean) == Float64

Automatic GPU Dispatch

When you use GPU arrays as inputs, operations automatically happen on GPU:

α_gpu = CuArray([1.0f0 + 0.5f0im])
auto_gpu_state = coherentstate(basis, α_gpu)
@assert device(auto_gpu_state) == :gpu

GPU State Creation

Basic Gaussian States

basis = QuadPairBasis(1)

vac_gpu = vacuumstate(basis) |> gpu
coh_gpu = coherentstate(basis, 1.0f0 + 0.5f0im) |> gpu  
sq_gpu = squeezedstate(basis, 0.3f0, Float32(π/4)) |> gpu
thermal_gpu = thermalstate(basis, 2.0f0) |> gpu

basis_multi = QuadPairBasis(3)
α_vec = [1.0f0, 0.5f0im, -0.3f0 + 0.2f0im]
coh_multi_gpu = coherentstate(basis_multi, α_vec) |> gpu

basis_2mode = QuadPairBasis(2)
epr_gpu = eprstate(basis_2mode, 0.3f0, Float32(π/4)) |> gpu

Typed GPU Constructors

For direct GPU creation with specific precision:

vac_gpu = vacuumstate(CuVector{Float32}, CuMatrix{Float32}, basis)
coh_gpu = coherentstate(CuVector{Float32}, CuMatrix{Float32}, basis, 1.0f0)
sq_gpu = squeezedstate(CuVector{Float64}, CuMatrix{Float64}, basis, 0.3, π/4)

GPU Operations

Gaussian Unitaries

basis = QuadPairBasis(1)

disp_gpu = displace(basis, 0.5f0 + 0.3f0im) |> gpu
squeeze_gpu = squeeze(basis, 0.3f0, Float32(π/4)) |> gpu
phase_gpu = phaseshift(basis, Float32(π/3)) |> gpu

basis_2mode = QuadPairBasis(2)
bs_gpu = beamsplitter(basis_2mode, 0.7f0) |> gpu
twosq_gpu = twosqueeze(basis_2mode, 0.2f0, Float32(π/6)) |> gpu

state_gpu = coherentstate(basis, 1.0f0) |> gpu
transformed_gpu = squeeze_gpu * state_gpu

Gaussian Channels

att_gpu = attenuator(basis, π/6, 2.0f0) |> gpu
amp_gpu = amplifier(basis, 0.2f0, 1.5f0) |> gpu

noisy_state = att_gpu * state_gpu
amplified_state = amp_gpu * state_gpu

Mixed Device Operations

Operations work seamlessly across CPU/GPU:

cpu_state = coherentstate(basis, 1.0)
gpu_op = displace(basis, 0.5) |> gpu
result = gpu_op * cpu_state

@assert device(result) == :gpu

Tensor Products and Partial Traces

state1_gpu = coherentstate(basis, 1.0f0) |> gpu
state2_gpu = squeezedstate(basis, 0.3f0, 0.0f0) |> gpu
tensor_gpu = tensor(state1_gpu, state2_gpu)

basis_3mode = QuadPairBasis(3)
big_state_gpu = coherentstate(basis_3mode, [1.0f0, 0.5f0, -0.3f0]) |> gpu
traced_gpu = ptrace(big_state_gpu, [1, 3])

GPU Linear Combinations

Creating Superposition States

basis = QuadPairBasis(1)

state1 = coherentstate(basis, 1.0f0) |> gpu
state2 = coherentstate(basis, -1.0f0) |> gpu
state3 = squeezedstate(basis, 0.2f0, 0.0f0) |> gpu

coeffs = [0.6f0, 0.3f0, 0.1f0]
superposition = GaussianLinearCombination(basis, coeffs, [state1, state2, state3])

@assert device(superposition) == :gpu
@assert superposition.coeffs isa Vector{Float32}
@assert device(superposition.states[1]) == :gpu

Linear Combination Arithmetic

lc1 = GaussianLinearCombination(basis, [0.7f0, 0.3f0], [state1, state2])
lc2 = GaussianLinearCombination(basis, [0.5f0], [state3])

sum_lc = lc1 + lc2
diff_lc = lc1 - lc2

scaled_lc = 2.0f0 * lc1
normalized_lc = lc1 / norm(lc1.coeffs)

neg_lc = -lc1

Normalization and Simplification

unnormalized_lc = GaussianLinearCombination(basis, [3.0f0, 4.0f0], [state1, state2])
normalize!(unnormalized_lc)
@assert norm(unnormalized_lc.coeffs) ≈ 1.0

Gabs.simplify!(lc)

Operations on Linear Combinations

squeeze_op = squeeze(basis, 0.2f0, 0.0f0) |> gpu
squeezed_superposition = squeeze_op * superposition

att_channel = attenuator(basis, π/4, 1.0f0) |> gpu
noisy_superposition = att_channel * superposition

@assert squeezed_superposition.coeffs ≈ superposition.coeffs

State Metrics

@assert purity(superposition) == 1.0
@assert entropy_vn(superposition) == 0.0

Wigner Functions on GPU

Single-Point Evaluation

state_gpu = coherentstate(basis, 1.0f0) |> gpu

x_point = [0.5f0, -0.3f0]
w_value = wigner(state_gpu, x_point)

xi_point = [0.2f0, 0.4f0]
chi_value = wignerchar(state_gpu, xi_point)

Batch Evaluation (High Performance)

n_points = 10000
x_grid = CuArray(randn(Float32, 2, n_points))
xi_grid = CuArray(randn(Float32, 2, n_points))

w_batch = wigner(state_gpu, x_grid)
chi_batch = wignerchar(state_gpu, xi_grid)

@assert length(w_batch) == n_points
@assert length(chi_batch) == n_points

Measured speedups:

1,000 points: 40x speedup
10,000 points: 300x speedup
25,000 points: 1,120x speedup

Multi-Mode Wigner Functions

basis_2mode = QuadPairBasis(2)
state_2mode_gpu = coherentstate(basis_2mode, [1.0f0, 0.5f0]) |> gpu

x_2mode = CuArray(randn(Float32, 4, 1000))
w_2mode = wigner(state_2mode_gpu, x_2mode)

Multi-mode systems show excellent GPU performance:

2-mode, 5,000 points: 162x speedup

Cross-Wigner Functions

state1_gpu = coherentstate(basis, 1.0f0) |> gpu
state2_gpu = squeezedstate(basis, 0.3f0, Float32(π/4)) |> gpu

x_point = [0.0f0, 0.0f0]
cross_w = cross_wigner(state1_gpu, state2_gpu, x_point)
cross_chi = cross_wignerchar(state1_gpu, state2_gpu, [0.1f0, 0.2f0])

@assert cross_wigner(state1_gpu, state2_gpu, x_point) ≈ 
        conj(cross_wigner(state2_gpu, state1_gpu, x_point))

Interference Wigner Functions

superposition = GaussianLinearCombination(basis, [0.6f0, 0.8f0], [state1_gpu, state2_gpu])

x_points = CuArray(randn(Float32, 2, 5000))
w_interference = wigner(superposition, x_points)
chi_interference = wignerchar(superposition, x_points)

Interference calculations show excellent GPU performance:

1,000 points: 31x speedup
10,000 points: 333x speedup
50,000 points: 1,705x speedup

Batch Operations for Maximum GPU Advantage

# avoid this :
for i in 1:1000
    w = wigner(state_gpu, x_points[:, i])
end

# instead do: 
w_all = wigner(state_gpu, x_points_gpu)

Measured Performance Scaling

Based on benchmark results:

Wigner function performance scaling:

100 points: 4x speedup
500 points: 27x speedup
1,000 points: 40x speedup
5,000 points: 241x speedup
10,000 points: 305x speedup
25,000 points: 1,407x speedup

Interference Wigner functions:

1,000 points (2-state): 31x speedup
10,000 points (2-state): 333x speedup
50,000 points (2-state): 1,705x speedup

Multi-mode systems:

2-mode, 1K points: 39x speedup
3-mode, 500 points: 21x speedup

Measured Performance Gains:

Batch Wigner functions: 40x - 1,400x speedup (scales with problem size)
Interference calculations: 30x - 1,700x speedup for superposition states
Cross-Wigner functions: Up to 950x speedup for large batches
Multi-mode systems: 160x+ speedup for 2+ mode calculations

riteshbhirud · 2025-09-06T07:53:28Z

For now I placed all the gpu tests and benchmarks (commented out for now) in test_CUDAExt file as I was not sure about where to place them as its an extension.

codecov · 2025-09-10T20:16:59Z

Welcome to Codecov 🎉

Once you merge this PR into your default branch, you're all set! Codecov will compare coverage reports and display results in all future pull requests.

Thanks for integrating Codecov - We've got you covered ☂️

riteshbhirud added 4 commits September 6, 2025 03:14

GPU support for Gaussian and CPU non_gaussian bug fix

4cc7b3a

.

38f018e

benchmark code

9cb7fec

weakdep

04a7fb7

riteshbhirud changed the title ~~Gaussian GPU support and CPU non_gaussian bug fix~~ Gaussian GPU support Sep 6, 2025

riteshbhirud changed the title ~~Gaussian GPU support~~ CUDA and Gaussian GPU support Sep 6, 2025

riteshbhirud and others added 7 commits September 6, 2025 17:59

cleanup utils.jl and add cuda tags

e6bc6e9

Update basis_operations.jl

4da8a69

Update runtests.jl

78d34c4

CI update fo cuda tests

b87dec1

corrected imports order

b6f688f

fixes

c22a5c1

add pkg to .toml

2c564c4

riteshbhirud and others added 2 commits September 10, 2025 16:17

add Stats

fb129e5

Merge branch 'QuantumSavory:main' into main

8374e37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CUDA and Gaussian GPU support #77

CUDA and Gaussian GPU support #77

riteshbhirud commented Sep 6, 2025 •

edited

Loading

Uh oh!

riteshbhirud commented Sep 6, 2025

Uh oh!

codecov bot commented Sep 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

CUDA and Gaussian GPU support #77

Are you sure you want to change the base?

CUDA and Gaussian GPU support #77

Conversation

riteshbhirud commented Sep 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1. Fixes a bug in CPU based non gaussian states I noticed:

2. Introduces CUDA extension and Gaussian Functionality (Non Gaussian will be done in a separate PR):

Simple GPU Workflow

Device Management

Precision Control

Automatic GPU Dispatch

GPU State Creation

Basic Gaussian States

Typed GPU Constructors

GPU Operations

Gaussian Unitaries

Gaussian Channels

Mixed Device Operations

Tensor Products and Partial Traces

GPU Linear Combinations

Creating Superposition States

Linear Combination Arithmetic

Normalization and Simplification

Operations on Linear Combinations

State Metrics

Wigner Functions on GPU

Single-Point Evaluation

Batch Evaluation (High Performance)

Multi-Mode Wigner Functions

Cross-Wigner Functions

Interference Wigner Functions

Batch Operations for Maximum GPU Advantage

Measured Performance Scaling

Measured Performance Gains:

Uh oh!

riteshbhirud commented Sep 6, 2025

Uh oh!

codecov bot commented Sep 10, 2025

Welcome to Codecov 🎉

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

riteshbhirud commented Sep 6, 2025 •

edited

Loading