2026-03-12
Moving the Renderer off the Main Thread
How we decoupled rendering from game logic in Raven using a dedicated render thread and buffered command queues.
Raven's renderer used to run entirely on the main thread. Logic updated, draw calls recorded, GPU submitted, present, repeat. It was the obvious first implementation and it worked fine early on, but as we are looking to grow the renderer, we noticed that it could quickly become a problem. The main thread was constantly sitting idle waiting on the GPU, and there was no clean boundary between game logic and rendering code. Everything was tangled together.
We knew we wanted the two to run in parallel. While the GPU is chewing through frame N, the main thread should already be building frame N+1. Getting there meant designing a threading model we could actually reason about and enforce.
The Design
The split is straightforward in principle. The main thread handles game logic, scene updates, and recording a list of render commands. It never touch es Vulkan directly. The render thread, owned entirely by IllumineRenderer, drains that command list, records Vulkan command buffers, and submits to the GPU.
The hard part is the handoff between them.
The Command Queue
The core primitive is RenderCommandQueue, a flat linear allocator over a 10 MiB buffer. Commands are stored as (function pointer, size, payload) tuples packed back-to-back with 8-byte alignment on the payload:
void* RenderCommandQueue::Allocate(RenderCommandFn func, u32 size)
{
*(RenderCommandFn*)m_Ptr = func;
m_Ptr += sizeof(RenderCommandFn);
*(u32*)m_Ptr = size;
m_Ptr += sizeof(u32);
uptr aligned = (reinterpret_cast<uptr>(m_Ptr) + 7) & ~uptr(7);
m_Ptr = reinterpret_cast<cw_byte*>(aligned);
void* payload = m_Ptr;
m_Ptr += size;
return payload;
}
Execute() walks the buffer in order, calls each function with its payload, then resets the write pointer. No heap allocation per command, no virtual dispatch, cache-friendly iteration.
Commands get enqueued from the main thread via SubmitCmd, which placement-news the lambda directly into the queue buffer:
template<typename Fn>
static void SubmitCmd(Fn&& fn)
{
using Cmd = std::decay_t<Fn>;
auto execute = [](void* data) {
Cmd* cmd = static_cast<Cmd*>(data);
(*cmd)();
cmd->~Cmd();
};
void* storage = s_CommandQueues[s_WriteIndex]->Allocate(execute, sizeof(Cmd));
new (storage) Cmd(std::forward<Fn>(fn));
}
The lambda destructor runs immediately after execution on the render thread, so captured Ref<> handles are released on that side.
Frame Pacing
We maintain one RenderCommandQueue per frame in flight, configurable and clamped to the range [2, 4]. The main thread always writes to s_WriteIndex. Calling SwapQueues() hands the current write queue to the render thread and advances the index:
void IllumineRenderer::SwapQueues()
{
s_QueueSemaphore->acquire(); // block if render thread is MAX_FRAMES_IN_FLIGHT behind
const int prev = s_WriteIndex.load();
s_WriteIndex = (prev + 1) % s_Config->FramesInFlight;
{
std::lock_guard lock(s_RenderMutex);
s_RenderReadIndex = prev;
s_HasWork = true;
}
s_RenderCV.notify_one();
}
Backpressure comes from a std::counting_semaphore initialized to FramesInFlight. The render thread releases a slot after each Execute(), and the main thread acquires one in SwapQueues(). If the render thread falls a full FramesInFlight frames behind, the main thread blocks. Right now the main thread finishes quickly since most other engine systems aren't built out yet, so we spend a fair amount of time waiting on that semaphore. That will change as more work moves onto it.
The render thread loop is about as simple as it gets:
while (s_RenderThreadRunning)
{
{
std::unique_lock lock(s_RenderMutex);
s_RenderCV.wait(lock, []() {
return s_HasWork.load() || !s_RenderThreadRunning.load();
});
readIndex = s_RenderReadIndex.load();
s_HasWork = false;
}
s_CommandQueues[readIndex]->Execute();
s_QueueSemaphore->release();
{
std::lock_guard lock(s_MainMutex);
s_RenderComplete = true;
}
s_MainCV.notify_one();
}
The Frame Loop
From the framework side, each frame looks like this:
void Framework::BeginRender()
{
IllumineRenderer::SwapQueues(); // hand off previous frame, advance write index
if (IllumineRenderer::ShouldRecreateSwapchain())
{
IllumineRenderer::FlushAndWait(); // fully drain the queue before touching swapchain
swapchain.Recreate(...);
IllumineRenderer::RecreateFramebuffers();
IllumineRenderer::Recreated();
}
IllumineRenderer::GetResourceRegistry().ProcessReloadQueue();
IllumineRenderer::BeginFrame(); // enqueues swapchain acquire + cmd begin
}
void Framework::EndRender()
{
IllumineRenderer::EndFrame(); // enqueues cmd end + submit + present
swapchain.Present();
}
SwapQueues() is a fire-and-forget kick. The main thread moves on immediately. FlushAndWait() is the expensive one, doing a full CPU queue drain plus vkDeviceWaitIdle. We only use it at explicit sync points like swapchain recreation or shader hot-reload, where we need everything idle before touching GPU resources.
Rules for SubmitCmd Lambdas
The thread boundary creates rules that are easy to violate by accident.
Do:
- Capture
Ref<>handles by value. Refcounting is thread-safe. - Resolve
GetCommandBuffer()inside the lambda body, never before it. - Snapshot main-thread state before submitting:
Ref<MyType> instance = this;
IllumineRenderer::SubmitCmd([instance]() mutable { /* use instance here */ });
Don't:
- Capture raw Vulkan handles resolved on the main thread. They may be stale by the time the lambda executes.
- Call
SwapQueues()orFlushAndWait()from inside a lambda. - Call
GetCommandBuffer()from the main thread.
Violating any of these tends to produce one-frame desync that looks like flickering or garbage geometry. It's annoying to track down.
What's Still Broken
Any code that calls into RendererAPI directly instead of going through SubmitCmd will race with the render thread. Right now this is enforced only by convention and the header documentation. A wrapper that makes direct API access from the main thread a compile error would eliminate this class of bug, but we haven't built that yet.
There is another known trigger for bypassing the Queue: starting the engine with the window minimized or unfocused. The main loop guards the entire render block behind IsMinimized(), so BeginRender() and SwapQueues() are never called for those frames. When the window regains focus the write and read indices are out of step and the render thread picks up a queue that was never properly set up. We haven't tracked down the exact fix yet, but the guard is the obvious place to look.
Where It Lands
The main thread should no longer block waiting on the GPU under normal load, and logic and rendering overlap by one frame. The command queue gives us a hard boundary that makes it obvious what is and isn't a render operation. Adding a new draw call type is just writing a new SubmitCmd wrapper. The linear allocator means zero per-frame heap allocation on the hot path.