This doc builds directly on the atomics and memory-ordering material in multithreading.md §6. If those concepts aren't comfortable, start there — none of what follows will make sense without them.
Realistic warning. Production lock-free queues take experts months to get right and years to fully verify. The code in this doc is correct for the cases shown but you should not deploy hand-rolled lock-free containers without benchmarking, stress-testing, and reviewing memory-order proofs. Real codebases use libraries — see §10.
- 1. Lock-Free vs Wait-Free vs Blocking
- 1b. Why Not Just
std::queue+ a Mutex? - 2. Producer/Consumer Taxonomy
- 3. Cache Lines and False Sharing
- 4. Single-Producer, Single-Consumer (SPSC) Ring Buffer
- 5. Single-Producer, Single-Consumer (SPSC) Unbounded Queue
- 6. Multiple-Producer, Single-Consumer (MPSC) Queue — Vyukov's Intrusive
- 7. Single-Producer, Multiple-Consumer (SPMC) — Chase–Lev Work-Stealing Deque
- 8. Multiple-Producer, Multiple-Consumer (MPMC) — Vyukov's Bounded Queue
- 8b. Note on Unbounded Multiple-Producer, Multiple-Consumer (MPMC) Queues
- 9. The Reclamation Problem
- 10. Use a Library
- 11. When Not to Use Lock-Free
- 12. References
These terms describe progress guarantees, not "never uses a lock":
| Class | Guarantee | Example |
|---|---|---|
| Blocking | A stalled thread can prevent all others from making progress. | Mutex-based queue — if the holder is descheduled, no one else can push or pop. |
| Lock-free | At least one thread is always making progress, system-wide. Individual threads may starve. | CAS-loop counter, Treiber stack, Michael–Scott queue. |
| Wait-free | Every thread completes in a bounded number of steps regardless of other threads. | Single-producer ring buffer (per-side, no contention). |
Lock-free is the practical sweet spot. Wait-free is desirable but rare and usually slower in the common case.
What lock-free buys you:
- Resilience to descheduling — preempting one thread doesn't freeze the system. Critical for real-time and low-latency code.
- Forward progress under contention — no priority inversion, no deadlock.
- Often better tail latency than mutex-based code, because no thread pays a sleep/wake cost.
What it costs you:
- Much harder to reason about and test. Wrong code may "work" for years before the right interleaving surfaces.
- More expensive in the uncontended case than a mutex (which is essentially free when uncontended).
- Memory reclamation is hard — you can't just
deletea node another thread might still be reading. See §9.
Reasonable question. For most code the answer is "you should — that's exactly the right tool." This whole document only matters for the cases where blocking is unacceptable.
std::queue<T> is a container adaptor — by default a thin wrapper over std::deque<T>. It is not thread-safe. Calling push and pop from two threads concurrently is undefined behavior — corrupted memory, lost elements, or a crash.
To use it across threads you have to add the synchronization yourself:
#include <queue>
#include <mutex>
#include <condition_variable>
class MutexQueue {
std::queue<int> q_;
std::mutex m_;
std::condition_variable cv_;
public:
void push(int value) {
{
std::lock_guard lock(m_);
q_.push(value);
}
cv_.notify_one();
}
int pop() { // blocks until something is available
std::unique_lock lock(m_);
cv_.wait(lock, [&] { return !q_.empty(); });
int value = q_.front();
q_.pop();
return value;
}
};This is correct, simple, and the right answer for most code. An uncontended Linux mutex is around 20 ns; the condition_variable lets the consumer block instead of spinning. For a thread-pool task queue, a GUI event dispatcher, a CRUD server's request queue — use this and stop reading.
The "blocking" part is what disqualifies std::queue + mutex for the cases this doc cares about:
- Realtime / audio / motor control — taking a kernel-backed mutex is forbidden on these threads. An audio callback that calls
lock()and gets descheduled produces an audible glitch. A motor-control loop that misses its deadline can damage hardware. - Priority inversion — a low-priority producer holding the mutex stalls a high-priority consumer waiting on it. Priority-inheritance mutexes help but aren't free, and aren't always available on every platform.
- Tail latency under contention — under heavy load the mutex serializes everyone; threads pile up and worst-case latency explodes. Lock-free's "at least one thread always makes progress" guarantee bounds this.
- Descheduling — preempt the lock holder and every other thread that wants the queue is stuck until the OS reschedules it. Lock-free continues regardless.
Lock-free queues trade implementation difficulty (and a small constant-time penalty per op for the atomics) for never blocking another thread.
| You need... | Use |
|---|---|
| Single-threaded | std::queue directly |
| Multi-threaded, normal latency budget | std::queue + std::mutex (+ condition_variable) |
| Hard latency, one-producer/one-consumer | Lock-free SPSC (§4 / §5) |
| Hard latency, many producers and consumers | Vyukov MPMC (§8) or a library (§10) |
The rest of this document is for the bottom two rows. If your problem is in the top two, you already have your answer.
Most lock-free queues are specialized for a particular access pattern. Picking the right one matters — a more general queue (e.g. MPMC) is always slower than a specialized one (e.g. SPSC) on the same workload.
| Acronym | Producers | Consumers | Difficulty | Typical use |
|---|---|---|---|---|
| SPSC | 1 | 1 | Easy | Audio thread → render thread; logger; feed one worker. |
| SPMC | 1 | many | Medium | Broadcast a stream to N workers; work-stealing read side. |
| MPSC | many | 1 | Medium | Many threads log to one writer; event aggregation. |
| MPMC | many | many | Hard | General work queue; thread pool. |
Rule of thumb: identify the most-restricted model that fits your workload. SPSC is dramatically simpler and faster than MPMC.
A modern CPU moves memory in 64-byte chunks called cache lines. When two threads write to different variables that happen to share a line, the hardware coherence protocol shuffles that line back and forth between cores — even though the variables are logically independent. This is false sharing, and it can slow lock-free code down by an order of magnitude or worse.
The classic case in a queue:
struct BadQueue {
std::atomic<size_t> head; // producer writes
std::atomic<size_t> tail; // consumer writes — same cache line as head!
// ...
};Producer and consumer are bouncing the same cache line between their cores on every push/pop.
The fix is to put each "hot" atomic on its own cache line:
struct GoodQueue {
alignas(std::hardware_destructive_interference_size) std::atomic<size_t> head;
alignas(std::hardware_destructive_interference_size) std::atomic<size_t> tail;
// ...
};std::hardware_destructive_interference_size (C++17) is the implementation's recommended alignment to avoid false sharing — typically 64 or 128. If your toolchain doesn't define it, hard-code 64 (with a comment).
Pair with std::hardware_constructive_interference_size for the opposite problem: things you want on the same cache line because they're always accessed together.
Validation.
perf c2c(Linux) and Intel VTune both surface false-sharing hotspots. Don't trust intuition here.Note for the rest of this doc. The queue implementations in §§4–8 omit the
alignas(...)decorations on hot atomics to keep the algorithms readable. For production code, pad each producer-side and consumer-side counter onto its own cache line as shown above; the rule is the same regardless of the queue.
The simplest correct lock-free queue. One thread produces integers, another thread consumes them. They share a fixed-size circular buffer of int and two atomic indices. Wait-free on each side.
#include <atomic>
class BoundedSpscQueue {
static constexpr std::size_t Capacity = 8;
int buffer_[Capacity]{};
std::atomic<std::size_t> head_{0}; // producer writes here
std::atomic<std::size_t> tail_{0}; // consumer reads from here
public:
// Producer side
bool push(int value) {
std::size_t h = head_.load(std::memory_order_relaxed);
std::size_t next = (h + 1) % Capacity;
if (next == tail_.load(std::memory_order_acquire))
return false; // full
buffer_[h] = value;
head_.store(next, std::memory_order_release); // publishes buffer_[h]
return true;
}
// Consumer side
bool pop(int& out) {
std::size_t t = tail_.load(std::memory_order_relaxed);
if (t == head_.load(std::memory_order_acquire))
return false; // empty
out = buffer_[t];
tail_.store((t + 1) % Capacity, std::memory_order_release);
return true;
}
};Usage:
BoundedSpscQueue q;
// producer thread
for (int i = 0; i < 1000; ++i) {
while (!q.push(i)) { /* spin until there's room */ }
}
// consumer thread
int value;
while (true) {
if (q.pop(value)) std::cout << value << '\n';
}- Each index has exactly one writer.
head_is only stored to by the producer,tail_only by the consumer. There's never a write/write race on either index, so no CAS is needed — plainstoreis enough. - One slot is left empty so empty/full are distinguishable. When
head_ == tail_the queue is empty; when(head_ + 1) % Capacity == tail_it's full. Without that wasted slot, both states would look identical. - Acquire/release pairing publishes the data. The producer writes
buffer_[h]first, thenhead_.store(release). The consumer'shead_.load(acquire)synchronizes with that release: if the consumer sees the newhead_, it's guaranteed to also see the write tobuffer_[h]. - Relaxed self-loads are safe. A thread reading its own counter never sees a stale value of its own writes — program order takes care of it. Only the other side's counter needs acquire.
- Wasting one slot. A
Capacityof 8 holds at most 7 items. %is fine here, but production code uses bitmask. If you makeCapacitya power of two and replace% Capacitywith& (Capacity - 1), you skip the integer-division stall. Same algorithm, same correctness.- Single producer, single consumer means exactly one of each. Calling
push()from two threads concurrently is undefined behavior — the algorithm relies on each side having one writer. - Production note:
head_andtail_should each sit on their own cache line viaalignas(64)to avoid the false sharing described in §3 — omitted here for clarity.
For a battle-tested version of this pattern with the bitmask trick, an "index caching" optimization, and proper alignment, see Folly's ProducerConsumerQueue.
The ring buffer's only weakness is that Capacity is fixed. You might be tempted to fix that by growing the array when it fills up — but you can't, lock-free:
- The producer would have to allocate a new larger array, copy items over, and swap a pointer.
- The consumer might be reading from the old array at that exact moment.
- There's no safe way to free the old array without coordinating with the consumer — which means a lock.
So an unbounded lock-free queue can't be a single growable array. Instead, the producer allocates one tiny piece of storage per push and links it to the previous one. That's all a linked list is here — the smallest unit you can grow by. Each "node" stores one int and a pointer to the next node:
#include <atomic>
class UnboundedSpscQueue {
struct Node {
int value{};
std::atomic<Node*> next{nullptr};
};
std::atomic<Node*> head_; // producer side: points at the newest node
Node* tail_; // consumer side: points at the next node to read
public:
UnboundedSpscQueue() {
Node* dummy = new Node{}; // sentinel — start with one empty node so the
head_.store(dummy); // queue is never literally empty (simplifies pop)
tail_ = dummy;
}
// Producer side
void push(int value) {
Node* n = new Node{};
n->value = value;
Node* prev = head_.exchange(n, std::memory_order_acq_rel);
prev->next.store(n, std::memory_order_release); // link it on
}
// Consumer side
bool pop(int& out) {
Node* next = tail_->next.load(std::memory_order_acquire);
if (!next) return false; // empty
out = next->value;
Node* old = tail_;
tail_ = next;
delete old; // safe: only the consumer ever touches old nodes
return true;
}
};Usage is identical to the bounded version — q.push(i) and q.pop(value), except push never fails because the queue grows on demand.
- The producer never blocks. It just allocates a new node and atomically swings
head_to point at it. - The consumer walks the chain.
tail_->nextis the next int to read; if it'snull, nothing has been pushed since the lastpop. - Single consumer = trivial deletion. Only one thread ever touches "old" nodes, so we can
deleteimmediately. Multi-consumer queues lose this property — see §9 for what they need instead. - Sentinel/dummy node sidesteps the awkward "queue is exactly empty" edge case:
tail_always points to a real node, and the question is just whethertail_->nextis set yet.
| Aspect | Bounded ring buffer (§4) | Unbounded linked list (§5) |
|---|---|---|
| Storage | one fixed array | one node per item |
| Allocation per push | none | new Node |
| Push can fail | yes (full) | never |
| Memory growth | capped | unbounded — runaway producers OOM you |
| Cache friendliness | excellent | poor (nodes scattered on the heap) |
Pick bounded unless you genuinely can't predict an upper bound. Even then, real implementations usually batch nodes into chunks (a node-pool, or a linked list of arrays) to claw back cache friendliness.
Multiple producers, single consumer. The canonical algorithm is Dmitry Vyukov's intrusive MPSC queue — used in actor systems and event loops where many threads send messages to one processor.
struct Node {
std::atomic<Node*> next{nullptr};
// ... payload ...
};
class MpscQueue {
std::atomic<Node*> head_; // producer side
Node* tail_; // consumer side
Node stub_; // sentinel, never deleted
public:
MpscQueue() : head_{&stub_}, tail_{&stub_} {}
void push(Node* n) {
n->next.store(nullptr, std::memory_order_relaxed);
Node* prev = head_.exchange(n, std::memory_order_acq_rel);
prev->next.store(n, std::memory_order_release);
}
Node* pop() { // single-consumer only
Node* tail = tail_;
Node* next = tail->next.load(std::memory_order_acquire);
if (tail == &stub_) {
if (!next) return nullptr; // empty
tail_ = next;
tail = next;
next = next->next.load(std::memory_order_acquire);
}
if (next) { tail_ = next; return tail; }
Node* head = head_.load(std::memory_order_acquire);
if (tail != head) return nullptr; // inconsistent — producer mid-push
push(&stub_);
next = tail->next.load(std::memory_order_acquire);
if (next) { tail_ = next; return tail; }
return nullptr;
}
};Key properties:
- Producer side is wait-free — a single
exchange+ a singlestore, no loops. - Consumer side is lock-free, not wait-free — under heavy contention, the consumer can observe a transient inconsistent state (a producer has done the
exchangebut not thenext.storeyet) and must retry. - Intrusive — the queue doesn't allocate; the caller embeds
Nodein their payload. - Stub node trick keeps the algorithm's edge cases simple. The stub is never freed.
This is genuinely tricky. Read Vyukov's original page before deploying anything based on it.
One producer, many consumers. The textbook algorithm is the Chase–Lev work-stealing deque — used in every modern work-stealing scheduler (Cilk, TBB, Go, Rust's Rayon, Java's ForkJoinPool).
The structure is asymmetric:
- The owner (producer) pushes and pops at the bottom. No CAS needed on its fast path.
- Thieves (consumers) steal at the top, contending with each other and occasionally the owner via CAS.
#include <atomic>
#include <array>
#include <cstdint>
template <typename T, std::size_t Size>
class ChaseLevDeque {
static_assert((Size & (Size - 1)) == 0, "Size must be a power of two");
std::atomic<int64_t> top_{0}; // thieves
std::atomic<int64_t> bottom_{0}; // owner
std::array<std::atomic<T>, Size> buf_;
static constexpr int64_t mask_ = Size - 1;
public:
// Owner only. Wait-free.
bool push(T v) {
const int64_t b = bottom_.load(std::memory_order_relaxed);
const int64_t t = top_.load(std::memory_order_acquire);
if (b - t >= static_cast<int64_t>(Size)) return false; // full
buf_[b & mask_].store(std::move(v), std::memory_order_relaxed);
std::atomic_thread_fence(std::memory_order_release); // publish slot before bottom
bottom_.store(b + 1, std::memory_order_relaxed);
return true;
}
// Owner only. Lock-free (one CAS in the contended last-element case).
bool pop(T& out) {
int64_t b = bottom_.load(std::memory_order_relaxed) - 1;
bottom_.store(b, std::memory_order_relaxed);
std::atomic_thread_fence(std::memory_order_seq_cst); // barrier vs steal
int64_t t = top_.load(std::memory_order_relaxed);
if (t > b) { // empty
bottom_.store(b + 1, std::memory_order_relaxed);
return false;
}
out = buf_[b & mask_].load(std::memory_order_relaxed);
if (t != b) return true; // not last — uncontested
// Last element: race the thieves for it.
bool won = top_.compare_exchange_strong(
t, t + 1, std::memory_order_seq_cst, std::memory_order_relaxed);
bottom_.store(b + 1, std::memory_order_relaxed);
return won;
}
// Any thief. Lock-free (one CAS).
bool steal(T& out) {
int64_t t = top_.load(std::memory_order_acquire);
std::atomic_thread_fence(std::memory_order_seq_cst);
int64_t b = bottom_.load(std::memory_order_acquire);
if (t >= b) return false; // empty
out = buf_[t & mask_].load(std::memory_order_relaxed);
return top_.compare_exchange_strong(
t, t + 1, std::memory_order_seq_cst, std::memory_order_relaxed);
}
};Why this works:
- Owner's
pushis wait-free. It's the only writer ofbottom_, and the order "store slot → fence → store bottom" guarantees that any thief seeing the newbottom_also sees the slot data. - Owner's
popdecrementsbottom_first, then fences, then readstop_. The fence prevents the read oftop_from being reordered before the write tobottom_— that's what makes the race against thieves correct. - Thieves only update
top_. Multiple thieves race via CAS; at most one wins. - When the deque has exactly one element, owner-pop and thief-steal both want it. The owner's CAS on
top_resolves the race: either it wins (and the thieves see an empty deque) or it loses (and the value goes to whichever thief CASed first).
Limitations of the version above:
- Bounded. Real Chase–Lev grows the underlying array dynamically; that adds a "current array" atomic pointer and a retire-old-array policy. The bounded variant is enough for most thread-pool workloads.
Tmust fit in astd::atomicslot. For larger types, storeT*and manage lifetime separately (or use the unbounded version with hazard pointers on the array).
Practical advice: if you don't need stealing semantics, a fan-out of N independent SPSC queues (one per consumer, dispatcher round-robins) is usually simpler and faster than a single SPMC queue.
Many producers, many consumers. The well-known algorithm is Vyukov's bounded MPMC queue — the basis of moodycamel's ConcurrentQueue, rigtorp's MPMCQueue, and most other production implementations.
The trick: each cell has its own sequence number that tracks "what's the next operation expected on this cell?" Producers and consumers compete only via two CAS-able position counters; per-cell synchronization is done via the sequence numbers, which sidesteps ABA entirely.
#include <atomic>
#include <array>
#include <cstddef>
#include <cstdint>
template <typename T, std::size_t Size>
class MpmcBoundedQueue {
static_assert((Size & (Size - 1)) == 0, "Size must be a power of two");
static_assert(Size >= 2, "Size must be at least 2");
struct Cell {
std::atomic<std::size_t> seq;
T data;
};
std::array<Cell, Size> buf_;
std::atomic<std::size_t> enq_pos_{0};
std::atomic<std::size_t> deq_pos_{0};
static constexpr std::size_t mask_ = Size - 1;
public:
MpmcBoundedQueue() {
for (std::size_t i = 0; i < Size; ++i)
buf_[i].seq.store(i, std::memory_order_relaxed);
}
bool enqueue(T value) {
Cell* cell;
std::size_t pos = enq_pos_.load(std::memory_order_relaxed);
for (;;) {
cell = &buf_[pos & mask_];
std::size_t seq = cell->seq.load(std::memory_order_acquire);
std::intptr_t diff = (std::intptr_t)seq - (std::intptr_t)pos;
if (diff == 0) {
// Cell is ready for an enqueue at position `pos`. Claim it.
if (enq_pos_.compare_exchange_weak(
pos, pos + 1, std::memory_order_relaxed))
break;
} else if (diff < 0) {
return false; // full
} else {
// Another producer already claimed `pos`. Reload and retry.
pos = enq_pos_.load(std::memory_order_relaxed);
}
}
cell->data = std::move(value);
cell->seq.store(pos + 1, std::memory_order_release); // hand off to consumer
return true;
}
bool dequeue(T& out) {
Cell* cell;
std::size_t pos = deq_pos_.load(std::memory_order_relaxed);
for (;;) {
cell = &buf_[pos & mask_];
std::size_t seq = cell->seq.load(std::memory_order_acquire);
std::intptr_t diff = (std::intptr_t)seq - (std::intptr_t)(pos + 1);
if (diff == 0) {
// Cell holds an item produced for position `pos`. Claim it.
if (deq_pos_.compare_exchange_weak(
pos, pos + 1, std::memory_order_relaxed))
break;
} else if (diff < 0) {
return false; // empty
} else {
pos = deq_pos_.load(std::memory_order_relaxed);
}
}
out = std::move(cell->data);
cell->seq.store(pos + Size, std::memory_order_release); // ready for next lap
return true;
}
};How the sequence numbers work:
- Cell
istarts atseq == i. That means "ready to be enqueued at positioni." - A producer claims
poswhencell->seq == pos. After writing the data, it storesseq = pos + 1— meaning "ready to be dequeued at positionpos." - A consumer claims
poswhencell->seq == pos + 1. After reading, it storesseq = pos + Size— meaning "ready to be enqueued one full lap later."
The key insight: the sequence number is what advances, not the data pointer. There's no node to recycle, so no ABA window. Producers and consumers can intermix freely — no producer needs to know which consumer "owns" the cell next, only that the seq number says so.
Properties:
- Both sides are lock-free, not wait-free — the CAS on
enq_pos_/deq_pos_can be lost to other producers/consumers. - Bounded.
Sizemust be a power of two. Choose it large enough that producers don't see "full" too often. - Per-cell false sharing matters. For best performance, pad each
Cellto a cache line:
struct alignas(64) Cell { ... };Otherwise neighboring cells thrash the cache when multiple producers/consumers operate at adjacent positions.
This algorithm is fast and correct, but the proofs are dense. Don't modify it without reading Vyukov's original write-up and the formal correctness arguments. For unbounded MPMC (Michael–Scott + reclamation), see §9.
The classic unbounded MPMC algorithm is Michael–Scott, a linked-list queue with separate head and tail atomics and a "dummy node" trick. It's about 50 lines, but the safe-reclamation glue (you can't delete a node while another thread might still be traversing it) makes the complete implementation 200+ lines once you bolt on hazard pointers (§9.2) or epoch-based reclamation.
Unless you genuinely need unbounded growth, prefer Vyukov's bounded queue — it's faster, simpler, and avoids the reclamation problem entirely. If you do need unbounded, use a library (§10) rather than rolling your own.
The deepest difficulty in non-trivial lock-free data structures: when can you safely free a node?
Mutex-based code doesn't have this problem — only one thread is in the critical section, so when it removes a node it can free it immediately. Lock-free code has N threads simultaneously traversing the structure; some of them may still hold pointers to a node you want to free.
Already covered in multithreading.md §6.5 — CAS only checks the value, not whether the value has been recycled in between. Reclamation strategies prevent that recycling from being observable.
Each thread publishes "I am currently reading node X" by writing X to its own hazard pointer slot (a known atomic location). A thread that wants to free X first scans every other thread's hazard pointer slots; if any of them point at X, the free is deferred to a retire list and the scan retries later.
Thread A: reads node X, sets hazard[A] = X
Thread B: wants to delete X
scans all hazards
sees hazard[A] == X → puts X on retire list, doesn't delete
Thread A: finishes, sets hazard[A] = nullptr
Thread B: next scan, no hazard claims X → deletes X
Below is a minimal working implementation — small enough to read in one sitting, complete enough to use as the basis of a Treiber stack. Production hazard-pointer libraries are more involved (per-thread retire lists, batch reclamation, dynamic slot allocation) but the core idea is here.
#include <atomic>
#include <array>
#include <thread>
#include <vector>
constexpr std::size_t MaxThreads = 64;
constexpr std::size_t MaxRetiredSize = 2 * MaxThreads; // scan threshold
class HazardManager {
struct Slot {
std::atomic<std::thread::id> owner{};
std::atomic<void*> ptr{nullptr};
};
std::array<Slot, MaxThreads> slots_;
Slot& my_slot() {
thread_local Slot* cached = nullptr;
if (cached) return *cached;
std::thread::id me = std::this_thread::get_id();
std::thread::id none{};
for (auto& s : slots_) {
std::thread::id expected = none;
if (s.owner.compare_exchange_strong(expected, me)) {
cached = &s;
return s;
}
}
std::abort(); // exceeded MaxThreads — bump the constant
}
public:
// Mark `p` as in-use by this thread. Caller must re-validate `p` afterwards.
void protect(void* p) {
my_slot().ptr.store(p, std::memory_order_release);
}
void clear() {
my_slot().ptr.store(nullptr, std::memory_order_release);
}
bool is_hazardous(void* p) const {
for (auto& s : slots_)
if (s.ptr.load(std::memory_order_acquire) == p)
return true;
return false;
}
// Per-thread retire list. When it gets large, scan and free what's safe.
template <typename T>
void retire(T* p) {
thread_local std::vector<void*> retired;
retired.push_back(p);
if (retired.size() < MaxRetiredSize) return;
std::vector<void*> still_hazardous;
for (void* q : retired) {
if (is_hazardous(q)) still_hazardous.push_back(q);
else delete static_cast<T*>(q);
}
retired = std::move(still_hazardous);
}
};The classical use case: a Treiber stack — the simplest lock-free DS that needs reclamation.
template <typename T>
class TreiberStack {
struct Node {
T value;
Node* next;
};
std::atomic<Node*> head_{nullptr};
HazardManager hp_;
public:
void push(T value) {
Node* n = new Node{std::move(value), nullptr};
Node* h = head_.load(std::memory_order_relaxed);
do {
n->next = h;
} while (!head_.compare_exchange_weak(
h, n, std::memory_order_release, std::memory_order_relaxed));
}
bool pop(T& out) {
Node* h;
for (;;) {
h = head_.load(std::memory_order_acquire);
if (!h) return false;
hp_.protect(h); // publish hazard
if (head_.load(std::memory_order_acquire) != h)
continue; // h was popped between load & protect
// Now h is safely protected: any reclaimer will see our hazard.
if (head_.compare_exchange_weak(
h, h->next, std::memory_order_acquire, std::memory_order_relaxed))
break;
}
out = std::move(h->value);
hp_.clear();
hp_.retire(h); // freed eventually, when no hazards remain
return true;
}
};The dance to remember:
- Load
head_intoh. - Publish
has a hazard (release store). - Re-validate: load
head_again. If it changed, our hazard could be on a node already retired — restart. - Now CAS to swing
head_toh->next. If it succeeds, we ownh. - Clear our hazard, then retire
h. The retire deferred-free will skip any node still listed as someone's hazard.
Step 3 — the re-validation after publishing the hazard — is what makes the protocol correct. Without it, a reclaimer that scanned hazards before you published yours would happily delete the node you're about to use.
Strengths: bounded memory overhead per thread; works for any lock-free DS where you can identify "the pointer being followed."
Weaknesses: every read pays a hazard-pointer publish + re-validation; tricky to get the memory ordering right; you must enumerate every pointer that might be followed (multi-hop traversals need multiple hazard slots per thread).
C++26 is on track to ship std::hazard_pointer and friends. Until then: Folly's folly/synchronization/Hazptr.h is the production reference.
Read-Copy-Update: readers traverse the structure with no per-read overhead; writers don't free nodes immediately, instead waiting for a "grace period" — a point at which all readers that might have been reading the old node have finished.
A simplified user-space variant — epoch-based reclamation (EBR) — captures the same idea without the kernel infrastructure. Each thread enters a "read-side critical section" by recording the current global epoch; reclaimers retire nodes tagged with the epoch they were unlinked in; a node is safe to free once the global epoch has advanced past every active reader's recorded epoch.
#include <atomic>
#include <array>
#include <vector>
#include <thread>
constexpr std::size_t MaxReaders = 64;
class EpochManager {
std::atomic<uint64_t> global_epoch_{1};
std::array<std::atomic<uint64_t>, MaxReaders> active_; // 0 = not in CS
std::size_t my_index() {
thread_local std::size_t idx = [] {
static std::atomic<std::size_t> next{0};
return next.fetch_add(1);
}();
return idx;
}
public:
EpochManager() { for (auto& a : active_) a.store(0, std::memory_order_relaxed); }
// Mark this thread as "reading at the current epoch."
uint64_t enter() {
uint64_t e = global_epoch_.load(std::memory_order_acquire);
active_[my_index()].store(e, std::memory_order_release);
std::atomic_thread_fence(std::memory_order_seq_cst);
return e;
}
void exit() {
active_[my_index()].store(0, std::memory_order_release);
}
// Retire a node. Free it when no reader is in an epoch ≤ retirement epoch.
template <typename T>
void retire(T* p) {
struct Retired { void* ptr; uint64_t epoch; void (*deleter)(void*); };
thread_local std::vector<Retired> bin;
uint64_t e = global_epoch_.fetch_add(1, std::memory_order_acq_rel);
bin.push_back({p, e, [](void* x) { delete static_cast<T*>(x); }});
// Scan: free everything whose retirement epoch is older than every active reader.
uint64_t safe = e;
for (auto& a : active_) {
uint64_t r = a.load(std::memory_order_acquire);
if (r != 0 && r < safe) safe = r;
}
std::vector<Retired> still_unsafe;
for (auto& r : bin) {
if (r.epoch < safe) r.deleter(r.ptr);
else still_unsafe.push_back(r);
}
bin = std::move(still_unsafe);
}
};
// Reader-side RAII guard:
class EpochGuard {
EpochManager& m_;
public:
explicit EpochGuard(EpochManager& m) : m_(m) { m_.enter(); }
~EpochGuard() { m_.exit(); }
EpochGuard(const EpochGuard&) = delete;
EpochGuard& operator=(const EpochGuard&) = delete;
};Use it like this:
EpochManager ebr;
// Reader
{
EpochGuard guard(ebr);
Node* n = head.load(); // safe to follow `n` for the duration of `guard`
use(n);
} // exit() — n may now be reclaimed at some point
// Writer
Node* old = head.exchange(new_node);
ebr.retire(old); // freed when all readers have left their epochStrengths: reads pay one atomic store + one load — cheaper than hazard pointers' per-pointer publish + re-validation. Scales beautifully when reads dominate.
Weaknesses: a single stuck reader pins the global epoch, deferring all reclamation until it exits. Memory pressure can grow unboundedly if your read sections are long.
The full Linux RCU kernel implementation handles this with quiescent-state tracking, multiple grace-period algorithms, and tiered reclamation. User-space RCU (liburcu) ports much of it. For C++ projects, crossbeam (Rust) and cds::urcu (C++) are good references.
std::atomic<std::shared_ptr<T>> (C++20) sidesteps reclamation entirely: the shared_ptr's reference count holds nodes alive while readers hold them.
std::atomic<std::shared_ptr<Node>> head;
// reader
auto node = head.load(); // bumps refcount; node stays alive while we use it
// writer
auto fresh = std::make_shared<Node>(...);
head.store(fresh); // old head released; refcount drops when last reader exitsStrengths: simple; uses the type system to make reclamation automatic.
Weaknesses: every load is an atomic refcount bump (typically two atomic ops on x86, more on weak-memory machines); roughly an order of magnitude slower than hazard pointers in read-heavy workloads.
Worth it for low-frequency configuration-style data; not for hot paths.
For anything beyond an SPSC ring buffer, reach for a battle-tested implementation:
| Library | License | What it gives you |
|---|---|---|
| moodycamel::ConcurrentQueue | Simplified BSD | Header-only MPMC queue, very fast, well-tested. |
| moodycamel::ReaderWriterQueue | Simplified BSD | Header-only SPSC queue. |
Folly (folly::MPMCQueue, folly::ProducerConsumerQueue) |
Apache 2.0 | Industrial-grade, but a heavy dependency. |
| Boost.Lockfree | Boost | boost::lockfree::queue, spsc_queue, stack. |
| rigtorp::SPSCQueue / MPMCQueue | MIT | Small header-only implementations of well-known algorithms. |
TBB concurrent_queue |
Apache 2.0 | Part of Intel TBB / oneTBB. |
A 30-line hand-rolled queue almost always loses to these in throughput, latency, and (especially) correctness.
Lock-free is not always faster than a well-designed mutex queue. Consider mutex-based first if:
- Contention is rare. Uncontended mutexes are extremely cheap on Linux (futex). Lock-free pays atomic-op overhead on every operation.
- The critical section is large. Lock-free wants tiny operations; large ones force long retry loops.
- You want bounded latency above all. A mutex with priority inheritance can give you that more directly than CAS-loop fairness arguments.
- You don't have time to verify the code. "Pretty sure it works" is not enough for lock-free; bugs surface only under specific timing.
Profile first. If a mutex isn't the bottleneck, a lock-free queue won't make your system faster — it'll just make it harder to debug.
- Books
- C++ Concurrency in Action (Anthony Williams) — chapter 7 covers lock-free queues with full proofs.
- The Art of Multiprocessor Programming (Herlihy & Shavit) — the academic foundation.
- Papers
- Michael & Scott, Simple, Fast, and Practical Non-Blocking and Blocking Concurrent Queue Algorithms (1996).
- Maged Michael, Hazard Pointers: Safe Memory Reclamation for Lock-Free Objects (2004).
- Sites
- 1024cores (Dmitry Vyukov) — definitive practitioner reference for queue algorithms.
- Preshing on Programming — clear posts on memory ordering and lock-free.
- Code
- Related docs
- multithreading.md — atomics, memory ordering, ABA, threads.
- smart_pointers.md §7 — atomic shared_ptr.