Skip to content

Conversation

@riteshbhirud
Copy link
Contributor

@riteshbhirud riteshbhirud commented Sep 6, 2025

This PR covers two things:

1. Fixes a bug in CPU based non gaussian states I noticed:

- Incorrect normalization for squeezed cat states

Issue: The normalization calculation for squeezed cat states was using the coherent state overlap formula exp(-2|α|²) regardless of whether squeezing was applied. This is mathematically incorrect for squeezed states, where the overlap ⟨α,r,θ|-α,r,θ⟩ depends on both the displacement and squeeze parameters.
Fix: Added proper branch logic to compute normalization differently for coherent vs squeezed cases:

Coherent states: Use analytical formula exp(-2|α|²)
Squeezed states: Use numerical overlap calculation via _overlap() function

This ensures squeezed cat states have correct quantum mechanical normalization.

- Wrong function call in multi-mode odd cat states

Issue: The multi-mode catstate_odd() function was incorrectly calling catstate_even() in its loop, causing all multi-mode "odd" cat states to actually be even cat states.
Fix: Changed the function call to catstate_odd() to match the intended behavior.

Please correct me if I am making any wrong understanding here.

2. Introduces CUDA extension and Gaussian Functionality (Non Gaussian will be done in a separate PR):

Simple GPU Workflow

basis = QuadPairBasis(1)
vac_gpu = vacuumstate(basis) |> gpu
coh_gpu = coherentstate(basis, 1.0 + 0.5im) |> gpu

disp_gpu = displace(basis, 0.5) |> gpu
result_gpu = disp_gpu * vac_gpu

x_points = CuArray(randn(Float32, 2, 1000))
w_values = wigner(coh_gpu, x_points)

Device Management

state_gpu = coherentstate(basis, 1.0) |> gpu
op_gpu = displace(basis, 0.5) |> gpu
lc_gpu = linear_combination |> gpu

state_cpu = state_gpu |> cpu
op_cpu = op_gpu |> cpu

@assert device(state_gpu) == :gpu
@assert device(state_cpu) == :cpu

Precision Control

state_f32 = gpu(state, precision=Float32)
state_f64 = gpu(state, precision=Float64)

@assert eltype(state_f32.mean) == Float32
@assert eltype(state_f64.mean) == Float64

Automatic GPU Dispatch

When you use GPU arrays as inputs, operations automatically happen on GPU:

α_gpu = CuArray([1.0f0 + 0.5f0im])
auto_gpu_state = coherentstate(basis, α_gpu)
@assert device(auto_gpu_state) == :gpu

GPU State Creation

Basic Gaussian States

basis = QuadPairBasis(1)

vac_gpu = vacuumstate(basis) |> gpu
coh_gpu = coherentstate(basis, 1.0f0 + 0.5f0im) |> gpu  
sq_gpu = squeezedstate(basis, 0.3f0, Float32/4)) |> gpu
thermal_gpu = thermalstate(basis, 2.0f0) |> gpu

basis_multi = QuadPairBasis(3)
α_vec = [1.0f0, 0.5f0im, -0.3f0 + 0.2f0im]
coh_multi_gpu = coherentstate(basis_multi, α_vec) |> gpu

basis_2mode = QuadPairBasis(2)
epr_gpu = eprstate(basis_2mode, 0.3f0, Float32/4)) |> gpu

Typed GPU Constructors

For direct GPU creation with specific precision:

vac_gpu = vacuumstate(CuVector{Float32}, CuMatrix{Float32}, basis)
coh_gpu = coherentstate(CuVector{Float32}, CuMatrix{Float32}, basis, 1.0f0)
sq_gpu = squeezedstate(CuVector{Float64}, CuMatrix{Float64}, basis, 0.3, π/4)

GPU Operations

Gaussian Unitaries

basis = QuadPairBasis(1)

disp_gpu = displace(basis, 0.5f0 + 0.3f0im) |> gpu
squeeze_gpu = squeeze(basis, 0.3f0, Float32/4)) |> gpu
phase_gpu = phaseshift(basis, Float32/3)) |> gpu

basis_2mode = QuadPairBasis(2)
bs_gpu = beamsplitter(basis_2mode, 0.7f0) |> gpu
twosq_gpu = twosqueeze(basis_2mode, 0.2f0, Float32/6)) |> gpu

state_gpu = coherentstate(basis, 1.0f0) |> gpu
transformed_gpu = squeeze_gpu * state_gpu

Gaussian Channels

att_gpu = attenuator(basis, π/6, 2.0f0) |> gpu
amp_gpu = amplifier(basis, 0.2f0, 1.5f0) |> gpu

noisy_state = att_gpu * state_gpu
amplified_state = amp_gpu * state_gpu

Mixed Device Operations

Operations work seamlessly across CPU/GPU:

cpu_state = coherentstate(basis, 1.0)
gpu_op = displace(basis, 0.5) |> gpu
result = gpu_op * cpu_state

@assert device(result) == :gpu

Tensor Products and Partial Traces

state1_gpu = coherentstate(basis, 1.0f0) |> gpu
state2_gpu = squeezedstate(basis, 0.3f0, 0.0f0) |> gpu
tensor_gpu = tensor(state1_gpu, state2_gpu)

basis_3mode = QuadPairBasis(3)
big_state_gpu = coherentstate(basis_3mode, [1.0f0, 0.5f0, -0.3f0]) |> gpu
traced_gpu = ptrace(big_state_gpu, [1, 3])

GPU Linear Combinations

Creating Superposition States

basis = QuadPairBasis(1)

state1 = coherentstate(basis, 1.0f0) |> gpu
state2 = coherentstate(basis, -1.0f0) |> gpu
state3 = squeezedstate(basis, 0.2f0, 0.0f0) |> gpu

coeffs = [0.6f0, 0.3f0, 0.1f0]
superposition = GaussianLinearCombination(basis, coeffs, [state1, state2, state3])

@assert device(superposition) == :gpu
@assert superposition.coeffs isa Vector{Float32}
@assert device(superposition.states[1]) == :gpu

Linear Combination Arithmetic

lc1 = GaussianLinearCombination(basis, [0.7f0, 0.3f0], [state1, state2])
lc2 = GaussianLinearCombination(basis, [0.5f0], [state3])

sum_lc = lc1 + lc2
diff_lc = lc1 - lc2

scaled_lc = 2.0f0 * lc1
normalized_lc = lc1 / norm(lc1.coeffs)

neg_lc = -lc1

Normalization and Simplification

unnormalized_lc = GaussianLinearCombination(basis, [3.0f0, 4.0f0], [state1, state2])
normalize!(unnormalized_lc)
@assert norm(unnormalized_lc.coeffs)  1.0

Gabs.simplify!(lc)

Operations on Linear Combinations

squeeze_op = squeeze(basis, 0.2f0, 0.0f0) |> gpu
squeezed_superposition = squeeze_op * superposition

att_channel = attenuator(basis, π/4, 1.0f0) |> gpu
noisy_superposition = att_channel * superposition

@assert squeezed_superposition.coeffs  superposition.coeffs

State Metrics

@assert purity(superposition) == 1.0
@assert entropy_vn(superposition) == 0.0

Wigner Functions on GPU

Single-Point Evaluation

state_gpu = coherentstate(basis, 1.0f0) |> gpu

x_point = [0.5f0, -0.3f0]
w_value = wigner(state_gpu, x_point)

xi_point = [0.2f0, 0.4f0]
chi_value = wignerchar(state_gpu, xi_point)

Batch Evaluation (High Performance)

n_points = 10000
x_grid = CuArray(randn(Float32, 2, n_points))
xi_grid = CuArray(randn(Float32, 2, n_points))

w_batch = wigner(state_gpu, x_grid)
chi_batch = wignerchar(state_gpu, xi_grid)

@assert length(w_batch) == n_points
@assert length(chi_batch) == n_points

Measured speedups:

  • 1,000 points: 40x speedup
  • 10,000 points: 300x speedup
  • 25,000 points: 1,120x speedup

Multi-Mode Wigner Functions

basis_2mode = QuadPairBasis(2)
state_2mode_gpu = coherentstate(basis_2mode, [1.0f0, 0.5f0]) |> gpu

x_2mode = CuArray(randn(Float32, 4, 1000))
w_2mode = wigner(state_2mode_gpu, x_2mode)

Multi-mode systems show excellent GPU performance:

  • 2-mode, 5,000 points: 162x speedup

Cross-Wigner Functions

state1_gpu = coherentstate(basis, 1.0f0) |> gpu
state2_gpu = squeezedstate(basis, 0.3f0, Float32/4)) |> gpu

x_point = [0.0f0, 0.0f0]
cross_w = cross_wigner(state1_gpu, state2_gpu, x_point)
cross_chi = cross_wignerchar(state1_gpu, state2_gpu, [0.1f0, 0.2f0])

@assert cross_wigner(state1_gpu, state2_gpu, x_point)  
        conj(cross_wigner(state2_gpu, state1_gpu, x_point))

Interference Wigner Functions

superposition = GaussianLinearCombination(basis, [0.6f0, 0.8f0], [state1_gpu, state2_gpu])

x_points = CuArray(randn(Float32, 2, 5000))
w_interference = wigner(superposition, x_points)
chi_interference = wignerchar(superposition, x_points)

Interference calculations show excellent GPU performance:

  • 1,000 points: 31x speedup
  • 10,000 points: 333x speedup
  • 50,000 points: 1,705x speedup

Batch Operations for Maximum GPU Advantage

# avoid this :
for i in 1:1000
    w = wigner(state_gpu, x_points[:, i])
end

# instead do: 
w_all = wigner(state_gpu, x_points_gpu)

Measured Performance Scaling

Based on benchmark results:

Wigner function performance scaling:

  • 100 points: 4x speedup
  • 500 points: 27x speedup
  • 1,000 points: 40x speedup
  • 5,000 points: 241x speedup
  • 10,000 points: 305x speedup
  • 25,000 points: 1,407x speedup

Interference Wigner functions:

  • 1,000 points (2-state): 31x speedup
  • 10,000 points (2-state): 333x speedup
  • 50,000 points (2-state): 1,705x speedup

Multi-mode systems:

  • 2-mode, 1K points: 39x speedup
  • 3-mode, 500 points: 21x speedup

Measured Performance Gains:

  • Batch Wigner functions: 40x - 1,400x speedup (scales with problem size)
  • Interference calculations: 30x - 1,700x speedup for superposition states
  • Cross-Wigner functions: Up to 950x speedup for large batches
  • Multi-mode systems: 160x+ speedup for 2+ mode calculations

@riteshbhirud
Copy link
Contributor Author

For now I placed all the gpu tests and benchmarks (commented out for now) in test_CUDAExt file as I was not sure about where to place them as its an extension.

@riteshbhirud riteshbhirud changed the title Gaussian GPU support and CPU non_gaussian bug fix Gaussian GPU support Sep 6, 2025
@riteshbhirud riteshbhirud changed the title Gaussian GPU support CUDA and Gaussian GPU support Sep 6, 2025
@codecov
Copy link

codecov bot commented Sep 10, 2025

Welcome to Codecov 🎉

Once you merge this PR into your default branch, you're all set! Codecov will compare coverage reports and display results in all future pull requests.

Thanks for integrating Codecov - We've got you covered ☂️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant