bitonic sort sample #209

CrabExtra · 2025-08-11T18:31:33Z

No description provided.

Signed-off-by: CrabeExtra <abbasgaroosi7@gmail.com>

devshgraphicsprogramming · 2025-08-11T19:26:30Z

@Fletterio wanna try your hand at a review?

CMakeLists.txt

devshgraphicsprogramming · 2025-09-01T20:57:31Z

13_BitonicSort/app_resources/bitonic_sort_shader.comp.hlsl

+struct BitonicPushData
+{
+    uint64_t inputKeyAddress;
+    uint64_t inputValueAddress;
+    uint64_t outputKeyAddress;
+    uint64_t outputValueAddress;
+    uint32_t dataElementCount;
+};


why are you not including common.hlsl and having duplicate code?

devshgraphicsprogramming · 2025-09-01T21:09:52Z

13_BitonicSort/app_resources/bitonic_sort_shader.comp.hlsl

+using DataPtr = bda::__ptr<uint32_t>;
+using DataAccessor = BdaAccessor<uint32_t>;
+
+groupshared uint32_t sharedKeys[ElementCount];


Good SoA instead of AoS, no bank conflicts

You shared memory size is fixed, which means you shouldn't have a push constant dataElementSize

Unless it's purpose is to allow you to sort less elements than ElementCount but then you'd need to still initialize all of the sharedKeys with the highest possible value so you don't end up with garbage getting into the sort

devshgraphicsprogramming · 2025-09-01T21:11:13Z

13_BitonicSort/app_resources/bitonic_sort_shader.comp.hlsl

+
+[numthreads(WorkgroupSize, 1, 1)]
+[shader("compute")]
+void main(uint32_t3 dispatchId : SV_DispatchThreadID, uint32_t3 localId : SV_GroupThreadID)


Use nbl::hlsl::glsl:: functions instead of weird : hlsl semantics, see other examples with gl_LocalInvocationIndex

devshgraphicsprogramming · 2025-09-01T21:13:57Z

13_BitonicSort/app_resources/bitonic_sort_shader.comp.hlsl

+    for (uint32_t i = threadId; i < dataSize; i += WorkgroupSize)
+    {
+        inputKeys.get(i, sharedKeys[i]);
+        inputValues.get(i, sharedValues[i]);


Very nice but why isn't this wrapped up into a statically polymorphic struct like FFT and workgroup prefix sum? So we can actually reuse the code ? You know with shared memory accessors etc.

devshgraphicsprogramming · 2025-09-01T21:14:25Z

13_BitonicSort/app_resources/bitonic_sort_shader.comp.hlsl

+    }
+
+    // Synchronize all threads after loading
+    GroupMemoryBarrierWithGroupSync();


Use glsl or Spir-V barrier, we dislike hlsl intrinsics

devshgraphicsprogramming · 2025-09-01T21:16:28Z

13_BitonicSort/app_resources/bitonic_sort_shader.comp.hlsl

+    GroupMemoryBarrierWithGroupSync();
+
+
+    for (uint32_t stage = 0; stage < Log2ElementCount; stage++)


Where does Log2ElementCount come from? I have a feeling that you shouldn't have a data size push constant then.

devshgraphicsprogramming · 2025-09-01T21:20:28Z

13_BitonicSort/app_resources/bitonic_sort_shader.comp.hlsl

+groupshared uint32_t sharedKeys[ElementCount];
+groupshared uint32_t sharedValues[ElementCount];
+
+[numthreads(WorkgroupSize, 1, 1)]


Some static asserts about workgroup size dividing the element mCount evenly and the element count being power of two would be nice

devshgraphicsprogramming · 2025-09-03T08:12:35Z

13_BitonicSort/main.cpp

+
+		auto limits = m_physicalDevice->getLimits();
+		const uint32_t max_shared_memory_size = limits.maxComputeSharedMemorySize;
+		const uint32_t max_workgroup_size = limits.maxComputeWorkGroupInvocations; // Get actual GPU limit


there's a different limit maxOptimallyResidentWorkgroupInvocations that tells you the optimal max size.

Reason, most Nvidia GPUs in the wild have space for 1536 invocations, but report 1024 as max workgroup size, you get 33% under utilization. If you use 512 sized workgroups, then 3 fit on the SM.

devshgraphicsprogramming · 2025-09-03T08:20:03Z

13_BitonicSort/main.cpp

+		auto limits = m_physicalDevice->getLimits();
+		const uint32_t max_shared_memory_size = limits.maxComputeSharedMemorySize;
+		const uint32_t max_workgroup_size = limits.maxComputeWorkGroupInvocations; // Get actual GPU limit
+		const uint32_t bytes_per_elements = sizeof(uint32_t) * 2; // 2 uint32_t per element (key and value)


you should get this with sizeof() of the structs in app_resources/common.hlsl

devshgraphicsprogramming · 2025-09-03T08:22:12Z

13_BitonicSort/main.cpp

+		const uint32_t max_shared_memory_size = limits.maxComputeSharedMemorySize;
+		const uint32_t max_workgroup_size = limits.maxComputeWorkGroupInvocations; // Get actual GPU limit
+		const uint32_t bytes_per_elements = sizeof(uint32_t) * 2; // 2 uint32_t per element (key and value)
+		const uint32_t max_element_in_shared_memory = max_shared_memory_size / bytes_per_elements;


if you use all the shared memory, you won't have room for as many workgroups as possible.

Btw a good undocumented rule for occupancy is to make each invocation use no more than 32 bytes of shared memory.

Also all these calculations can be done with templates in HLSL/C++ (see the workgroup FFT and Prefix sum code)

you can take maxOptimallyResidentWorkgroupInvocations * 32 / (sizeof(Key)+sizeof(uint16_t)) (don't use 32 as a literal name it something) and round it down to PoT

devshgraphicsprogramming · 2025-09-03T08:24:10Z

13_BitonicSort/main.cpp

+		const uint32_t log2_element_count = static_cast<uint32_t>(log2(element_count));
+
+		m_logger->log("GPU Limits:", ILogger::ELL_INFO);
+		m_logger->log("  Max Workgroup Size: " + std::to_string(max_workgroup_size), ILogger::ELL_INFO);
+		m_logger->log("  Max Shared Memory: " + std::to_string(max_shared_memory_size) + " bytes", ILogger::ELL_INFO);
+		m_logger->log("  Max elements in shared memory: " + std::to_string(max_element_in_shared_memory), ILogger::ELL_INFO);
+		m_logger->log("  Using element count (power of 2): " + std::to_string(element_count), ILogger::ELL_INFO);


nice logging

devshgraphicsprogramming · 2025-09-03T08:25:45Z

13_BitonicSort/main.cpp

+				auto overrideSource = CHLSLCompiler::createOverridenCopy(
+					source.get(), "#define ElementCount %d\n#define Log2ElementCount %d\n#define WorkgroupSize %d\n",
+					element_count, log2_element_count, max_workgroup_size
+				);


the device limits are available in device_capability_traits.hlsl in the jit namespace, see the prefix sum example how we use them.

Idea is that your HLSL code would have a templated version (which computes all the constants via use of templates) and a NBL_CONSTEXPR struct version you use on the C++ side so that you can keep everything in sync and no need to make overriden shaders by slapping #define into them

devshgraphicsprogramming · 2025-09-03T08:27:03Z

13_BitonicSort/main.cpp

+		if (!bitonicSortShader)
+			return logFail("Failed to compile bitonic sort shader!");
+
+
+		const nbl::asset::SPushConstantRange pcRange = { .stageFlags = IShader::E_SHADER_STAGE::ESS_COMPUTE,.offset = 0,.size = sizeof(BitonicPushData) };
+
+		smart_refctd_ptr<IGPUPipelineLayout> layout;
+		smart_refctd_ptr<IGPUComputePipeline> bitonicSortPipeline;
+		{
+			layout = m_device->createPipelineLayout({ &pcRange,1 });
+			IGPUComputePipeline::SCreationParams params = {};
+			params.layout = layout.get();
+			params.shader.shader = bitonicSortShader.get();
+			params.shader.entryPoint = "main";
+			params.shader.entries = nullptr;
+			if (!m_device->createComputePipelines(nullptr, { &params,1 }, &bitonicSortPipeline))
+				return logFail("Failed to create compute pipeline!\n");
+		}
+
+		nbl::video::IDeviceMemoryAllocator::SAllocation allocation[4] = {};
+		smart_refctd_ptr<IGPUBuffer> buffers[4];
+
+		auto build_buffer = [this](
+			smart_refctd_ptr<ILogicalDevice> m_device,
+			nbl::video::IDeviceMemoryAllocator::SAllocation* allocation,
+			smart_refctd_ptr<IGPUBuffer>& buffer,
+			size_t buffer_size,
+			const char* label
+			) -> void {
+				IGPUBuffer::SCreationParams params;
+				params.size = buffer_size;
+				params.usage = IGPUBuffer::EUF_STORAGE_BUFFER_BIT | IGPUBuffer::EUF_SHADER_DEVICE_ADDRESS_BIT;
+				buffer = m_device->createBuffer(std::move(params));
+				if (!buffer)
+					logFail("Failed to create GPU buffer of size %d!\n", buffer_size);
+
+				buffer->setObjectDebugName(label);
+
+				auto reqs = buffer->getMemoryReqs();
+				reqs.memoryTypeBits &= m_physicalDevice->getHostVisibleMemoryTypeBits();
+
+				*allocation = m_device->allocate(reqs, buffer.get(), IDeviceMemoryAllocation::EMAF_DEVICE_ADDRESS_BIT);


nice and clean

devshgraphicsprogramming · 2025-09-03T08:28:44Z

13_BitonicSort/main.cpp

+			allocation[0].memory->map({0ull,allocation[0].memory->getAllocationSize()}, IDeviceMemoryAllocation::EMCAF_READ),
+			allocation[1].memory->map({0ull,allocation[1].memory->getAllocationSize()}, IDeviceMemoryAllocation::EMCAF_READ),
+			allocation[2].memory->map({0ull,allocation[2].memory->getAllocationSize()}, IDeviceMemoryAllocation::EMCAF_READ),
+			allocation[3].memory->map({0ull,allocation[3].memory->getAllocationSize()}, IDeviceMemoryAllocation::EMCAF_READ)


Vulkan doesn't care about the EMCAF flags, but they're kinda wrong here, I'm pretty sure you WRITE into them with the host, while others you READ at the end

devshgraphicsprogramming · 2025-09-03T08:28:59Z

13_BitonicSort/main.cpp

+				if (!allocation->isValid())
+					logFail("Failed to allocate Device Memory compatible with our GPU Buffer!\n");
+
+				assert(allocation->memory.get() == buffer->getBoundMemory().memory);
+			};
+
+		build_buffer(m_device, allocation, buffers[0], sizeof(uint32_t) * element_count, "Input Key Buffer");
+		build_buffer(m_device, allocation + 1, buffers[1], sizeof(uint32_t) * element_count, "Input Value Buffer");
+		build_buffer(m_device, allocation + 2, buffers[2], sizeof(uint32_t) * element_count, "Output Key Buffer");
+		build_buffer(m_device, allocation + 3, buffers[3], sizeof(uint32_t) * element_count, "Output Value Buffer");


one warning about calling logFail in a lambda, you won't actually quit the app cleanly, you'll continue going through the app

devshgraphicsprogramming · 2025-09-03T08:30:55Z

13_BitonicSort/main.cpp

+		auto bufferData = new uint32_t * [2];
+		for (int i = 0; i < 2; ++i) {
+			bufferData[i] = new uint32_t[element_count];


use a std::array<std::unique_ptr<uint32_t>,2>

devshgraphicsprogramming · 2025-09-03T08:33:18Z

13_BitonicSort/app_resources/bitonic_sort_shader.comp.hlsl

+using DataAccessor = BdaAccessor<uint32_t>;
+
+groupshared uint32_t sharedKeys[ElementCount];
+groupshared uint32_t sharedValues[ElementCount];


its cheaper to move Key & Index to Value instead of the Value instead, because Value might be quite fat (also for a workgroup scan you need only uint16_t)

also because the value never has to get accessed until the end.

So you can initialize sharedValueIndex with IOTA sequence (0,1,2,...) and not perform any value reads until the end where you do

value_t tmp; inputValues.get(sharedValueIndex[i],tmp); inputValues.set(i,tmp);

hmm actually you wouldn't initialize sharedValueIndex with IOTA but the result of the subgroup::BitonicSort

devshgraphicsprogramming · 2025-09-03T08:55:32Z

13_BitonicSort/app_resources/bitonic_sort_shader.comp.hlsl

+    DataAccessor inputKeys = DataAccessor::create(DataPtr::create(pushData.inputKeyAddress));
+    DataAccessor inputValues = DataAccessor::create(DataPtr::create(pushData.inputValueAddress));
+
+    for (uint32_t i = threadId; i < dataSize; i += WorkgroupSize)
+    {
+        inputKeys.get(i, sharedKeys[i]);
+        inputValues.get(i, sharedValues[i]);
+    }
+
+    // Synchronize all threads after loading
+    GroupMemoryBarrierWithGroupSync();
+
+
+    for (uint32_t stage = 0; stage < Log2ElementCount; stage++)
+    {
+        for (uint32_t pass = 0; pass <= stage; pass++)
+        {
+            const uint32_t compareDistance = 1 << (stage - pass);
+
+            for (uint32_t i = threadId; i < dataSize; i += WorkgroupSize)
+            {
+                const uint32_t partnerId = i ^ compareDistance;
+
+                if (partnerId >= dataSize)
+                    continue;
+
+                const uint32_t waveSize = WaveGetLaneCount();
+                const uint32_t myWaveId = i / waveSize;
+                const uint32_t partnerWaveId = partnerId / waveSize;
+                const bool sameWave = (myWaveId == partnerWaveId);
+
+                uint32_t myKey, myValue, partnerKey, partnerValue;
+                [branch]
+                if (sameWave && compareDistance < waveSize)
+                {
+                    // WAVE INTRINSIC
+                    myKey = sharedKeys[i];
+                    myValue = sharedValues[i];
+
+                    const uint32_t partnerLane = partnerId % waveSize;
+                    partnerKey = WaveReadLaneAt(myKey, partnerLane);
+                    partnerValue = WaveReadLaneAt(myValue, partnerLane);
+                }
+                else
+                {
+                    // SHARED MEM
+                    myKey = sharedKeys[i];
+                    myValue = sharedValues[i];
+                    partnerKey = sharedKeys[partnerId];
+                    partnerValue = sharedValues[partnerId];
+                }
+
+                const uint32_t sequenceSize = 1 << (stage + 1);
+                const uint32_t sequenceIndex = i / sequenceSize;
+                const bool sequenceAscending = (sequenceIndex % 2) == 0;
+                const bool ascending = true;
+                const bool finalDirection = sequenceAscending == ascending;
+
+                const bool swap = (myKey > partnerKey) == finalDirection;
+
+                // WORKGROUP COORDINATION: Only lower-indexed element writes both
+                if (i < partnerId && swap)
+                {
+                    sharedKeys[i] = partnerKey;
+                    sharedKeys[partnerId] = myKey;
+                    sharedValues[i] = partnerValue;
+                    sharedValues[partnerId] = myValue;
+                }
+            }
+
+            GroupMemoryBarrierWithGroupSync();
+        }
+    }
+
+
+    DataAccessor outputKeys = DataAccessor::create(DataPtr::create(pushData.outputKeyAddress));
+    DataAccessor outputValues = DataAccessor::create(DataPtr::create(pushData.outputValueAddress));


Glad it works, but now the real work begins.

This needs to be split into subgroup::BitonicSort<Config> and workgroup::BitonicSort<Config with the latter using the former.
See the FFT and PrefixSum/Scan (reduce, inclusive_scan) code for inspiration of how to lay this out. I think they don't use our weird C++20/HLSL concepts but you should.

In the Config we need a SubgroupSize, KeyAccessor, ValueAccessor, Key Comparator (for a default see our hlsl::less binop), and obviously typedefs for value_t and key_t

Only the workgroup config needs a ScratchKeyAccessor (what will mediate your smem accesses), ScratchValueAccessor.

The reason is that this lets us choose to sort values from/to BDA, SSBO, Smem, Images, and use various memory as scratch (different offsets of shared, reuse shared mem allocations).

And most importantly, this lets us choose whether to load the Keys and move them around in scratch or move their indices and load them from main memory every time (useful if the Key is super fat) without touching the Bitonic sort code, the user would just provide an appropriate scratch-key accessor and comparator which both translate the argument to/from index.

Items Per Invocation should also come from Config.

Btw with a uint16_t value index and 1 invocation always processing a pair of items, this means a baseline shared memory consumption of 8 bytes per invocation if the key is also either 2 bytes or an index of a key.

Realistically max virtual thread repetition is 4, in your case with an actual 32bit key getting used directly, its 3 but with optimal workgroup size of 512 on NV to keep the multiple PoT it becomes 2.

As you can see the point at which one should consider indexing keys instead of using them out right is 14 byte keys.

btw when doing subgroup bitonics its also possible to have multiple items per invocation (more than 2), so its useful to bitonic sort in registers first (see subgroup prefix sum and FFT doing stuff locally within an invocation first).

Btw barriers are expensive, so make sure that the repeated items sit at the most minor level of indexing.
So that the Subgroup has to access

Key keys[2<<ElementsPerInvocationLog2];

this way the last ElementsPerInvocationLog2 passes in a stage execute without any barriers (subgroup or workgroup).

devshgraphicsprogramming · 2025-09-03T08:56:25Z

13_BitonicSort/app_resources/bitonic_sort_shader.comp.hlsl

+            for (uint32_t i = threadId; i < dataSize; i += WorkgroupSize)
+            {


cool use of virtual invocations

one slight nitpick, this won't [unroll] were the dataSize a compile time constant.

to get a loop that unrolls you need to do

for (uint32_t baseIndex=0; baseIndex<dataSize; baseIndex+=WorkgroupSize) { const uint32_t i = threadId+baseIndex;

compilers are dumb

devshgraphicsprogramming · 2025-09-03T08:59:42Z

13_BitonicSort/app_resources/bitonic_sort_shader.comp.hlsl

+void main(uint32_t3 dispatchId : SV_DispatchThreadID, uint32_t3 localId : SV_GroupThreadID)
+{
+    const uint32_t threadId = localId.x;
+    const uint32_t dataSize = pushData.dataElementCount;


I'd make that constant and part of the Config, because the workgroup size needs to be hardcoded and as I've mentioned optimal amount of shared memory per invocation to use is <=8 dwords (32 bytes)

devshgraphicsprogramming · 2025-09-03T09:05:27Z

13_BitonicSort/app_resources/bitonic_sort_shader.comp.hlsl

+                if (partnerId >= dataSize)
+                    continue;
+
+                const uint32_t waveSize = WaveGetLaneCount();


thats a constant which should get hoisted out, btw if you make it a template uint supplied from a Config, the compiler will unroll certain loops (instead of counting on the SPIR-V optimizer in the driver to do it) thats why we do it in FFT code

devshgraphicsprogramming · 2025-09-03T09:19:22Z

13_BitonicSort/app_resources/bitonic_sort_shader.comp.hlsl

+                [branch]
+                if (sameWave && compareDistance < waveSize)
+                {
+                    // WAVE INTRINSIC
+                    myKey = sharedKeys[i];
+                    myValue = sharedValues[i];
+
+                    const uint32_t partnerLane = partnerId % waveSize;
+                    partnerKey = WaveReadLaneAt(myKey, partnerLane);
+                    partnerValue = WaveReadLaneAt(myValue, partnerLane);
+                }
+                else
+                {
+                    // SHARED MEM
+                    myKey = sharedKeys[i];
+                    myValue = sharedValues[i];
+                    partnerKey = sharedKeys[partnerId];
+                    partnerValue = sharedValues[partnerId];
+                }


the wholly subgroup stages (stage<=subgroupSizeLog2) need separation from the stages where the compareDistance can be big. Thats what I want wrapped up in subgroup::BitonicSort

btw the workgroup stages would still use subgroup to finish off each stage, because compareDistance decreases eventually to subgroup sized

devshgraphicsprogramming · 2025-09-03T09:21:48Z

13_BitonicSort/app_resources/bitonic_sort_shader.comp.hlsl

+                if (partnerId >= dataSize)
+                    continue;


this causes a 50% efficiency drop when compareDistance<SubgroupSize because only half the wave is active, do your for loop from 0 to (dataSize>>1) (half as many)

and fit up your index calculations by inserting a 0 within the i bitpattern for the lowerIndex and 1 for the higher, see FFT code for that getting done

btw you might have some issues with making this work for a single subgroup, @Fletterio may have some ideas how to help you there (see his FFT blogpost, about trading halves of FFT with subgroup shuffles while only keeping one invocation active).

The main problem is that with a 32-sized subgroup you need to treat 64 items.

Ideally you'd want one invocation to get 2 keys and value indices, but keep them rearranged in such a way that always 32 invocations are active.

Meaning that with step sizes of 1,2,4,8,... subgroupSize/2 you need to cleverly "reuse" the "partner ID" invocation to do the work of the "would be" second subgroup.

devshgraphicsprogramming · 2025-09-03T09:57:58Z

13_BitonicSort/app_resources/bitonic_sort_shader.comp.hlsl

+                const uint32_t myWaveId = i / waveSize;
+                const uint32_t partnerWaveId = partnerId / waveSize;
+                const bool sameWave = (myWaveId == partnerWaveId);


integer division is expensive, it would have been better to do just bool((i^partner)&uint16_t(-SubgroupSize)) or just compareDistance>SubgroupSize

devshgraphicsprogramming · 2025-09-03T10:21:55Z

13_BitonicSort/app_resources/bitonic_sort_shader.comp.hlsl

+                    const uint32_t partnerLane = partnerId % waveSize;
+                    partnerKey = WaveReadLaneAt(myKey, partnerLane);
+                    partnerValue = WaveReadLaneAt(myValue, partnerLane);


use our SPIR-V intrinsics from spirv or glsl namespace.

You may spot that there's a ShuffleXor which lets you simply read the lane at gl_SubgroupInvocationIndex^mask

devshgraphicsprogramming · 2025-09-03T10:22:55Z

13_BitonicSort/app_resources/bitonic_sort_shader.comp.hlsl

+                    // WAVE INTRINSIC
+                    myKey = sharedKeys[i];
+                    myValue = sharedValues[i];


there's no need to read-write to shared during subgroup stages, you could keep the value in the register always

devshgraphicsprogramming · 2025-09-03T10:39:15Z

13_BitonicSort/app_resources/bitonic_sort_shader.comp.hlsl

+                const uint32_t sequenceSize = 1 << (stage + 1);
+                const uint32_t sequenceIndex = i / sequenceSize;
+                const bool sequenceAscending = (sequenceIndex % 2) == 0;
+                const bool ascending = true;
+                const bool finalDirection = sequenceAscending == ascending;
+
+                const bool swap = (myKey > partnerKey) == finalDirection;


this can be simply expressed as bool(i & (0x1u<< (stage+2))), all that code is just checking for one bit being set in i

devshgraphicsprogramming · 2025-09-03T10:40:12Z

13_BitonicSort/app_resources/bitonic_sort_shader.comp.hlsl

+                // WORKGROUP COORDINATION: Only lower-indexed element writes both
+                if (i < partnerId && swap)
+                {
+                    sharedKeys[i] = partnerKey;
+                    sharedKeys[partnerId] = myKey;
+                    sharedValues[i] = partnerValue;
+                    sharedValues[partnerId] = myValue;
+                }


the key is not using 2 invocations for 2 inputs, but 1 for 2, throughout the function

devshgraphicsprogramming · 2025-09-03T10:51:11Z

13_BitonicSort/main.cpp

+				buffer->setObjectDebugName(label);
+
+				auto reqs = buffer->getMemoryReqs();
+				reqs.memoryTypeBits &= m_physicalDevice->getHostVisibleMemoryTypeBits();


you wanna bitwise-and UpStreaming memory types as well, so you're running from VRAM and not RAM

devshgraphicsprogramming · 2025-09-03T10:52:17Z

13_BitonicSort/main.cpp

+		cmdBuf->begin(IGPUCommandBuffer::USAGE::ONE_TIME_SUBMIT_BIT);
+		cmdBuf->beginDebugMarker("Bitonic Sort Single Dispatch", core::vectorSIMDf(0, 1, 0, 1));
+		cmdBuf->bindComputePipeline(bitonicSortPipeline.get());
+		cmdBuf->pushConstants(layout.get(), IShader::E_SHADER_STAGE::ESS_COMPUTE, 0u, sizeof(pc), &pc);
+		cmdBuf->dispatch(1, 1, 1);
+		cmdBuf->endDebugMarker();
+		cmdBuf->end();
+
+		{
+			auto queue = getComputeQueue();
+
+			IQueue::SSubmitInfo submit_infos[1];
+			IQueue::SSubmitInfo::SCommandBufferInfo cmdBufs[] = {
+				{
+					.cmdbuf = cmdBuf.get()
+				}
+			};
+			submit_infos[0].commandBuffers = cmdBufs;
+			IQueue::SSubmitInfo::SSemaphoreInfo signals[] = {
+				{
+					.semaphore = progress.get(),
+					.value = ++timeline,
+					.stageMask = asset::PIPELINE_STAGE_FLAGS::COMPUTE_SHADER_BIT
+				}
+			};
+			submit_infos[0].signalSemaphores = signals;
+
+			m_api->startCapture();
+			queue->submit(submit_infos);
+			m_api->endCapture();
+		}
+
+		const ISemaphore::SWaitInfo wait_infos[] = { {
+				.semaphore = progress.get(),
+				.value = timeline
+			} };
+		m_device->blockForSemaphores(wait_infos);


add a separate benchmark mode that does more workgroup dispatches (but allocates the buffer on DeviceLocal without redability) and on a loop with a time elapsed query

If you're curious you can see if you're faster than https://github.com/AndrewBoessen/Bitonic-Merge-Sort

devshgraphicsprogramming · 2025-09-03T10:53:00Z

13_BitonicSort/main.cpp

+		if (!allocation[0].memory->getMemoryPropertyFlags().hasFlags(IDeviceMemoryAllocation::EMPF_HOST_COHERENT_BIT))
+			m_device->invalidateMappedMemoryRanges(1, &memory_range[0]);
+		if (!allocation[1].memory->getMemoryPropertyFlags().hasFlags(IDeviceMemoryAllocation::EMPF_HOST_COHERENT_BIT))
+			m_device->invalidateMappedMemoryRanges(1, &memory_range[1]);


wrong, the bit you write must be flushed before you submit the other buffers (output) are the ones you invalidate

devshgraphicsprogramming · 2025-09-03T10:53:32Z

13_BitonicSort/main.cpp

+		allocation[2].memory->unmap();
+		allocation[3].memory->unmap();
+
+		m_device->waitIdle();


you don't need to waitIdle because you've blocked on the submit semaphore

Fletterio

Just realized these were all pending, most are outdated see if any still hold up

Fletterio · 2025-08-11T19:34:13Z

CMakeLists.txt

-		get_target_property(TYPE ${T} TYPE)
-		if(NOT ${TYPE} MATCHES INTERFACE)
-			target_link_libraries(${T} PUBLIC ${NBL_EXAMPLES_API_TARGET})
-			target_include_directories(${T} PUBLIC $<TARGET_PROPERTY:${NBL_EXAMPLES_API_TARGET},INCLUDE_DIRECTORIES>)
-			set_target_properties(${T} PROPERTIES DISABLE_PRECOMPILE_HEADERS OFF)
-			target_precompile_headers(${T} REUSE_FROM "${NBL_EXAMPLES_API_TARGET}")
-
-			if(NBL_EMBED_BUILTIN_RESOURCES)
-				LINK_BUILTIN_RESOURCES_TO_TARGET(${T} NblExtExamplesAPIBuiltinsSource)
-				LINK_BUILTIN_RESOURCES_TO_TARGET(${T} NblExtExamplesAPIBuiltinsInclude)
-				LINK_BUILTIN_RESOURCES_TO_TARGET(${T} NblExtExamplesAPIBuiltinsBuild)
-			endif()
-		endif()
+        target_link_libraries(${T} PUBLIC ${NBL_EXAMPLES_API_TARGET})
+		target_include_directories(${T} PUBLIC $<TARGET_PROPERTY:${NBL_EXAMPLES_API_TARGET},INCLUDE_DIRECTORIES>)
+		target_precompile_headers(${T} REUSE_FROM "${NBL_EXAMPLES_API_TARGET}")


No clue about this, can you tell me if this is ok @AnastaZIuk ?

No its a weird reversal, probably a bad merge, @CrabExtra merge current master to this branch and make sure diff of this file only shows add_subdirectory(13_BitonicSort)

Fletterio · 2025-08-11T19:44:57Z

13_BitonicSort/app_resources/bitonic_sort_shader.comp.hlsl

+struct BitonicPushData
+{
+    uint64_t inputKeyAddress;
+    uint64_t inputValueAddress;
+    uint64_t outputKeyAddress;
+    uint64_t outputValueAddress;
+    uint32_t dataElementCount;
+};


Replace this with an #include "common.hlsl", otherwise you have the same struct two times

Fletterio · 2025-08-11T20:07:30Z

13_BitonicSort/main.cpp

+				buffer->setObjectDebugName(label);
+
+				auto reqs = buffer->getMemoryReqs();
+				reqs.memoryTypeBits &= m_physicalDevice->getHostVisibleMemoryTypeBits();


This is fine, but this memory will be slow for device-side access. If you want to benchmark your algorithm this will cause your times to be bad. Prefer to use upstreaming buffer(s) for uploads, device local(s) to do your work on, then downstreaming for CPU readback. These are very neatly handled by the IUtilities class, see examples 5 and 11 to see how to use these staging buffers

Fletterio · 2025-08-11T20:11:37Z

13_BitonicSort/main.cpp

+		m_device->waitIdle();
+
+		for (int i = 0; i < 2; ++i) {
+			delete[] bufferData[i];


Use std::vectors, it's cleaner. You can call vector::reserve(element_count) on them to allocate memory only once

Fletterio · 2025-08-11T20:12:12Z

13_BitonicSort/main.cpp

+		for (int i = 0; i < 2; ++i) {
+			delete[] bufferData[i];
+		}
+		delete[] bufferData;


this one can instead be std::array<std::vector<uint32_t>, 2>

Fletterio · 2025-08-11T20:45:27Z

13_BitonicSort/app_resources/bitonic_sort_shader.comp.hlsl

+                    myValue = sharedValues[i];
+
+                    const uint32_t partnerLane = partnerId % waveSize;
+                    partnerKey = WaveReadLaneAt(myKey, partnerLane);


Or, since we know partnerId = i ^ compareDistance, you can use a subgroupShuffleXor (it's in glsl_compat/subgroup_shuffle.hlsl)

Fletterio · 2025-08-11T20:53:36Z

13_BitonicSort/app_resources/bitonic_sort_shader.comp.hlsl

+            {
+                const uint32_t partnerId = i ^ compareDistance;
+
+                if (partnerId >= dataSize)


You can also try to prove that this never happens for power of two sized arrays

Fletterio · 2025-08-11T21:02:55Z

13_BitonicSort/app_resources/bitonic_sort_shader.comp.hlsl

+
+                const bool swap = (myKey > partnerKey) == finalDirection;
+
+                // WORKGROUP COORDINATION: Only lower-indexed element writes both


This yields 4 write operations to shared memory. If you tweak the swap logic slightly (condition the key comparison based on whether i < partnerId), you can make each thread write (if a swap is necessary) to its own positions into the shared memory, reducing this to two write operations to shared memory (also, given how your threads access shared memory, there are no bank conflicts, so you're guaranteed this will result in only two write operations per thread)

Fletterio · 2025-08-11T21:11:36Z

13_BitonicSort/app_resources/bitonic_sort_shader.comp.hlsl

+                }
+            }
+
+            GroupMemoryBarrierWithGroupSync();


If compareDistance < waveSize, these barriers serve no purpose, you are overbarriering. In fact writing to shared memory at the end of every such iteration is also pointless.

The proper way to avoid this overbarriering is to branch behaviour based on whether compareDistance < waveSize or not. All steps with compareDistance < waveSize can be done in one go. Threads shuffle their elements around using subgroup intrinsics (shuffleXor, namely), once per every compareDistance value less than the starting one, and then write back to shared memory only once. This is what we do with the FFT, although I don't expect you to infer that from the code since it can be a bit obscure. @ me on discord if you want to figure out the way we handle this with the FFT, I can explain better there since I need to draw diagrams and write a bunch more

Fletterio · 2025-08-11T21:16:50Z

13_BitonicSort/app_resources/bitonic_sort_shader.comp.hlsl

+        outputKeys.set(i, sharedKeys[i]);
+        outputValues.set(i, sharedValues[i]);
+    }
+}


As a closing remark, it should be possible to run this algo on arrays bigger than the maximum shared memory allowed per workgroup, using what we call virtual threads

devshgraphicsprogramming · 2025-11-08T22:26:38Z

13_BitonicSort/app_resources/bitonic_sort_shader.comp.hlsl

+struct Accessor
+{
+	static Accessor create(const uint64_t address)
+	{
+		Accessor accessor;
+		accessor.address = address;
+		return accessor;
+	}
+
+	template <typename AccessType, typename IndexType>
+	void get(const IndexType index, NBL_REF_ARG(AccessType) value)
+	{
+		value = vk::RawBufferLoad<AccessType>(address + index * sizeof(AccessType));
+	}
+
+	template <typename AccessType, typename IndexType>
+	void set(const IndexType index, const AccessType value)
+	{
+		vk::RawBufferStore<AccessType>(address + index * sizeof(AccessType), value);
+	}
+
+	uint64_t address;
+};


there's readymade BDA accessors you can use AFAIK

devshgraphicsprogramming · 2025-11-08T22:27:55Z

CMakeLists.txt

-	add_subdirectory(12_MeshLoaders)
-	#
-	#add_subdirectory(13_MaterialCompiler EXCLUDE_FROM_ALL)
+	add_subdirectory(12_MeshLoaders EXCLUDE_FROM_ALL)


you erroneously added EXCLUDE_FROM_ALL to example 12 and now its omitted from CI

devshgraphicsprogramming · 2025-11-08T22:32:20Z

13_BitonicSort/app_resources/common.hlsl

+	uint64_t deviceBufferAddress;
+};
+
+NBL_CONSTEXPR uint32_t WorkgroupSizeLog2 = 10;  // 1024 threads (2^10)


512 is optimal residency on all GPUs

devshgraphicsprogramming · 2025-11-08T22:33:05Z

13_BitonicSort/main.cpp

+					std::cout << "(" << key << "," << value << "), ";
+					if ((i + 1) % 20 == 0) {
+						std::cout << "\n";
+					}
+				}
+				std::cout << "\nElement count: " << elementCount << "\n";
+
+				bool is_sorted = true;
+				int32_t error_index = -1;
+				for (uint32_t i = 1; i < elementCount; i++) {
+					uint32_t prevKey = data[(i - 1) * 2];
+					uint32_t currKey = data[i * 2];
+					if (currKey < prevKey) {
+						is_sorted = false;
+						error_index = i;
+						break;
+					}
+				}
+
+				if (is_sorted) {
+					std::cout << "Array is correctly sorted!\n";
+				}
+				else {
+					std::cout << "Array is NOT sorted correctly!\n";
+					std::cout << "Error at index " << error_index << ":\n";
+					std::cout << "  Previous key [" << (error_index - 1) << "] = " << data[(error_index - 1) * 2] << "\n";
+					std::cout << "  Current key  [" << error_index << "] = " << data[error_index * 2] << "\n";
+					std::cout << "  (" << data[error_index * 2] << " < " << data[(error_index - 1) * 2] << " is WRONG!)\n";


please use m_logger instead of std::cout

bitonic sort sample added

fd346a0

Signed-off-by: CrabeExtra <abbasgaroosi7@gmail.com>

devshgraphicsprogramming reviewed Sep 1, 2025

View reviewed changes

CMakeLists.txt Show resolved Hide resolved

devshgraphicsprogramming reviewed Sep 1, 2025

View reviewed changes

devshgraphicsprogramming reviewed Sep 3, 2025

View reviewed changes

CrabExtra added 3 commits October 22, 2025 17:25

Update bitonic_sort_shader.comp.hlsl

547e518

Update common.hlsl

6544e04

Update main.cpp

8de0c0f

Fletterio reviewed Oct 22, 2025

View reviewed changes

CrabExtra added 4 commits November 2, 2025 18:59

Update main.cpp

9923294

Update common.hlsl

800802b

Update bitonic_sort_shader.comp.hlsl

446d487

Refactor shared memory access in bitonic sort shader

2f3126f

devshgraphicsprogramming reviewed Nov 8, 2025

View reviewed changes

		GroupMemoryBarrierWithGroupSync();


		for (uint32_t stage = 0; stage < Log2ElementCount; stage++)

		for (uint32_t i = threadId; i < dataSize; i += WorkgroupSize)
		{


		const bool swap = (myKey > partnerKey) == finalDirection;

		// WORKGROUP COORDINATION: Only lower-indexed element writes both

bitonic sort sample #209

Are you sure you want to change the base?

bitonic sort sample #209

Uh oh!

Conversation

CrabExtra commented Aug 11, 2025

Uh oh!

devshgraphicsprogramming commented Aug 11, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

devshgraphicsprogramming Sep 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

devshgraphicsprogramming Sep 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

devshgraphicsprogramming Sep 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

devshgraphicsprogramming Sep 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

devshgraphicsprogramming Sep 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

devshgraphicsprogramming Sep 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

devshgraphicsprogramming Sep 1, 2025 •

edited

Loading

devshgraphicsprogramming Sep 1, 2025 •

edited

Loading

devshgraphicsprogramming Sep 3, 2025 •

edited

Loading

devshgraphicsprogramming Sep 3, 2025 •

edited

Loading

devshgraphicsprogramming Sep 3, 2025 •

edited

Loading

devshgraphicsprogramming Sep 3, 2025 •

edited

Loading