Mitigating ABA for Compare-and-Swap: a Survey
Compare-and-Swap (CAS) is a central part of many implementations of lock-free algorithms synchronizing values in shared data structures, such as dynamically resizable arrays. CAS is also used when executing load-exclusive/store-exclusive instructions on architectures, which only offer CAS, such as x86, in Dynamic Binary Translation (DBT).
A typical pattern for using CAS in lock-free synchronization algorithms is to read the shared data-structure, modify locally, and finally swap the updated data in, after ensuring with CAS that the shared data-struture has not been modified in the meanwhile. Although CAS is an atomic operation, the remaining parts of the overall algorithm are not atomic. Without further scaffolding, the algorithm may be prone to the ABA problem.
In this post, I describe the ABA problem, and survey solutions to mitigate the ABA problem in the context of CAS. I aim to provide a high-level classification of mitigating solutions. As lock-free synchronization is an important part of many performance critical applications, such as virtual machines and operating systems, there is a rich body of literature on specific implementations with specific performance trade-offs.
Compare-and-Swap
In essence, Compare-and-Swap (CAS) is an atomic operation which takes three
arguments: a reference or memory location l
, an expected value e
, and a new
value n
. First, CAS compares the value at l
with e
. If the values match,
the value at l
is updated to n
and the operation succeeds. Otherwise, the
operation fails.
CAS is implemented in hardware on the x86 architecture as CMPXCHG. Programming languages provide implementations of CAS as compiler primitives, such as atomic_compare_exchange on C++ or compare_exchange on many data structurues in the atomic module in Rust.
The ABA Problem
The ABA problem is inherent to CAS when used with multiple threads writing concurrently. In the following we look at an example in C++, even though the ABA problem is not specific to C++, programming languages, or even software.
Example Consider the case of two threads, T1 and T2, incrementing a
shared counter via a pointer counter
.
|
|
Each thread executes the following code
|
|
which
- first creates a local copy
o
ofcounter
, - increments the value at
counter
and stores the incremented value at the addressn
- swaps the new address
n
forcounter
ifcounter
still points to the old addresso
- and retries otherwise.
- Finally the previous address
o
of the counter is freed.
The ABA problem occurs when a second thread T2 replaces counter
from a
address A
with a new address B
and back to A
, while T1 moves from Load
to Compare-and-swap. Such a situation is witnessed by the following
interleaving. T1 executes inc()
once, and T2 twice.
Figure
T1 | T2 | counter | *counter |
---|---|---|---|
Load // s = A | Load // s = A | A | 0 |
Increment // n = B, *n = 1 | A | 0 | |
CAS | B | 1 | |
Free A | B | 1 | |
Increment // n = C, *n = 1 | B | 1 | |
Load // s = B | B | 1 | |
Increment // n = A, *n = 2 | B | 1 | |
CAS | A | 2 | |
CAS | C | 🗲1🗲 |
The value of the counter should be 3, as it has been increased once by T1, and twice by T2. However, due to the ABA problem, T1 misses the increments by T2, and the value of the counter is 1.
Mitigation by Deferred Memory Reclamation
In the above example, the ABA problem occurs, because T2 is allowed to free A
while still in use by T1. Once freed, A
becomes available as an allocatable
address for T2 to use in the second call to inc()
. This observation gives
rise to an obvious class of solutions to mitigate ABA: deferred memory
reclamation.
Prominent examples are
- Garbage collection (GC)
- Hazard pointers (HP)
- Read-copy-update (RCU)
Garbage collection (GC) is a class of algorithms in itself, which includes for instance reference conuting. GC algorithms have in common some overhead for deciding which addresses to reclaim. In the following we look at unmanaged approaches to deferred memory reclamation.
Hazard Pointers
Hazard pointers Michael04HazPtr are a general-purpose approach to deferred memory reclamation, which work with several lock-free synchronization algorithms. Here we look at hazard pointers specifically in the context of CAS.
A hazard pointer is a marker that a specific thread is about to start an operation that requires the data-structure to be available at the address the hazard pointer points to. The marker does not mean that the address has the latest version of the shared data-structure, it must just not be reclaimed while the hazard pointer points to the address.
For each shared data-structure, there is an array of hazard pointers, one per each thread potentially accessing the shared data-structure. The array of hazard pointers is itself shared, however each thread will only write to its own hazard pointer in the array.
The whole array may be read by any thread modifying the shared data-structure,
so that writes to the hazard pointer must be guarded by hardware fences to
ensure that each thread sees the latest version of the array of hazard
pointers. Because hp[tid]
needs to be set to a specific value, updating
hp[tid]
is not atomic. In order to ensure that all threads see the latest
version, implementations need to use a hardware fence to synchronize threads,
such as the following code from the Folly library.
The memory of o
can be reclaimed once hp[tid]
is no longer pointing to a
valid instance of the shared data structure e.g. by poisoning or setting to
null
. The full memory fence in the above snippet ensures that all threads
have a current view on the entry in the hazard pointer array.
The Folly library implements hazard pointers for C++, the Haphazard crate for Rust.
Read-Copy Update
Contrary to hazard pointers, read-copy-update (RCU) McKenneyEtAl01RCU, uses flags to indicate that a thread is about to modify the shared data-structure.
The flags are arranged in a shared array, with one flag per thread. Each thread will write to its own flag in the array only. However, any thread may read the whole array. The release-acquire memory order of the fetch-and-add, the release fence before CAS joint with acquire memory order in case of CAS failure, and release-acquire memory order in case of CAS success, ensure that all threads have a current view on the array of flags. See the C++ standard for more details on the semantics of memory order for the atomic compare-exchange operation in C++.
The memory of o
can be reclaimed once the RCU flag for each thread has been
cleared. The release memory order of the Fetch-and-Add ensures that the
writing thread has set its RCU flag. The release memory order of the CAS in
the reclaiming thread ensure that the writing thread has performed the
Fetch-and-Add on the same memory address for C, read: generation of the shared
data structure, before reclaiming the memory. The latter also enables a
single-pass check of the RCU-flag in the reclaiming thread. The release fence
is needed to avoid a fringe read-after-free bug when compiler or hardware can
reorder the value increment over the CAS within the same thread.
Largely thanks to Paul McKenney’s work, RCU is now a prevalent synchronization algorithm in the Linux kernel. Performance is critical for uses of RCU in the Linux kernel. Deciding memory reclamation causes some overhead, which can be avoided by using preemptive periods, available in the privilege levels of OS kernels.
Optimizing use of preemptive periods, the Linux kernel implements RCU in three
flavours, see Section 3.4 of
McKenney19Ease. RCU-Sched
relies on disabling read-side preemption, see
McKenneyEtAl01RCU for details. RCU-bh
solves a lock-up of RCU-Sched
under
load through an additional quiescent period on exit. Both RCU-Sched
and
RCU-bh
disable read-side preemption, which is problem for real-time
workloads. The latter is supported by the third flavour, RCU-preempt
,
see GuniguntalaEtAll08RCURT.
The Folly library implements RCU for C++. What is RCU? has a discussion of the use of RCU in the Linux kernel. Paul McKenney’s blog has a discussion of the implementation of RCU in Rust, and more details on the distinction of user-space vs kernel-space RCU. The rcu-clean crate implements a flavour of RCU in Rust.
Mitigation in Systems Architectures
Alternative to using memory allocation as a token for synchronization, system architectures can implement wide pointers which include more information about the pointer to the shared data structure. The additional information can e.g. include version information, such as in Tagged-State-References or ABA’, a “dirty” bit in the wide pointer, such as implemented in CHERI, or a modification exclusion bit as implemented in load-linked/store-conditional (LL/SC).
Tagged State References
Tagged State References mitigate ABA by wrapping the shared pointer by a tag, or version. Every time a thread modifies the pointer, it modifies the tag or increments the version. This way, even though, the pointer may point to the same memory location, the version number will different. Tagged state references are nick-named ABA’ for that reason.
The main difficulty with Tagged State References is to modify the tag atomically with the pointer, which requires support from the system architecture.
The blog on IT Hare discusses the trade-off between bit-length of the version tag and the risk of version over-runs.
Alternative Instructions: Load-linked/Store-conditional
Load-linked/Store-conditional (LL/SC) is an algorithm alternative to CAS. On hardware, threads are represented as monitors (of memory locations or regions). LL/SC is implemented using a pair of exclusive load and store instructions. On exclusive load, an exclusion bit is set for the requesting monitor. The subsequent exclusive store will only succeed if the exclusion bit is still set. Any write to the same address, or respectively memory region, will clear the modification exclusion bit. In case the exclusive store fails, the instruction chain will be retried from the exclusive load.
Several architectures implement LL/SC:
- Alpha:
ldl_l/stl_c
andldq_l/stq_c
- PowerPC/Power ISA:
lwarx/stwcx
andldarx/stdcx
- MIPS:
ll/sc
andlld/scd
- ARM:
ldrex/strex
(ARMv6, v7 and v8-M), andldxr/stxr
(ARMv8-A) - RISC-V:
lr/sc
- ARC:
LLOCK/SCOND
Dynamic Binary Translation (DBT) systems such as used in virtual machine implementations translate CAS to LL/SC instructions when executing e.g. x86 code on ARMv8 systems, where CAS is not available as an instruction. See KristienEtAl20DBT for a detailed discussion.
See also this Wikipedia article for additional references on LL/SC.
Summary
As of the writing of this post, I am aware of two main solutions to mitigate ABA for CAS: using memory allocation as a token to synchronise data structures and deferring memory reclamation, either with Hazard Pointers or RCU, or some form of wide pointers or descriptor object: version tags, dirty bits, or exclusion bits, as used in Tagged State Reference, CHERI, or implementation of CAS with load-exclusive/store-exclusive instructions, respectively.
Constraints from applications will dictate the trade-off between the solutions. Wide pointers and descriptor objects may require hardware support. In particular approaches as taken in CHERI may enable safe and transparently performant solutions to lock-free synchronization. RCU still comes with some overhead, but preemption may limit the overhead to a constant order of complexity. Hazard pointers appear to be the most flexible solution for user-space applications, but in particular the full memory fence used when acquiring the hazard pointer comes with a high performance penalty. Note that non-Linux systems, Folly already implements asymmetric fences, which may alleviate the impact on performance.
References
Michael: Hazard Pointers: Safe Memory Reclamation for Lock-Free Objects, 2004
McKenney et al: Read-Copy Update, 2001
McKenney: A Critical RCU Safety Property is… Ease of Use!, 2019
Guniguntala et al: The read-copy-update mechanism for supporting real-time applications on shared-memory multiprocessor systems with Linux, 2008
Dechev et al: Lock-Free Dynamically Resizable Arrays 2006
Dechev et al: Understanding and Effectively Preventing the ABA Problem in Descriptor-based Lock-free Designs, 2010
Kristien et al: Fast and Correct Load-Link/Store-Conditional Instruction Handling in DBT Systems 2020