Mitigating ABA for Compare-and-Swap: a Survey

September 6, 2024

Compare-and-Swap (CAS) is a central part of many implementations of lock-free algorithms synchronizing values in shared data structures, such as dynamically resizable arrays. CAS is also used when executing load-exclusive/store-exclusive instructions on architectures, which only offer CAS, such as x86, in Dynamic Binary Translation (DBT).

A typical pattern for using CAS in lock-free synchronization algorithms is to read the shared data-structure, modify locally, and finally swap the updated data in, after ensuring with CAS that the shared data-struture has not been modified in the meanwhile. Although CAS is an atomic operation, the remaining parts of the overall algorithm are not atomic. Without further scaffolding, the algorithm may be prone to the ABA problem.

In this post, I describe the ABA problem, and survey solutions to mitigate the ABA problem in the context of CAS. I aim to provide a high-level classification of mitigating solutions. As lock-free synchronization is an important part of many performance critical applications, such as virtual machines and operating systems, there is a rich body of literature on specific implementations with specific performance trade-offs.

Compare-and-Swap

In essence, Compare-and-Swap (CAS) is an atomic operation which takes three arguments: a reference or memory location l, an expected value e, and a new value n. First, CAS compares the value at l with e. If the values match, the value at l is updated to n and the operation succeeds. Otherwise, the operation fails.

CAS is implemented in hardware on the x86 architecture as CMPXCHG. Programming languages provide implementations of CAS as compiler primitives, such as atomic_compare_exchange on C++ or compare_exchange on many data structurues in the atomic module in Rust.

The ABA Problem

The ABA problem is inherent to CAS when used with multiple threads writing concurrently. In the following we look at an example in C++, even though the ABA problem is not specific to C++, programming languages, or even software.

Example Consider the case of two threads, T1 and T2, incrementing a shared counter via a pointer counter.

1
std::atomic<int*> counter = {0};

Each thread executes the following code

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
void inc() {
  int *n = new int(0);
  int *o;
  bool s = false;
  while(!s) {
    o = counter.load(std::memory_order_acquire); // Load
    *n = *o + 1; // Increment
    s = counter.compare_exchange_weak(o, n, std::memory_order_acq_rel, std::memory_order_relaxed);
    // Compare-and-swap
  }
  delete o; // Free old address
}

which

  • first creates a local copy o of counter,
  • increments the value at counter and stores the incremented value at the address n
  • swaps the new address n for counter if counter still points to the old address o
  • and retries otherwise.
  • Finally the previous address o of the counter is freed.

The ABA problem occurs when a second thread T2 replaces counter from a address A with a new address B and back to A, while T1 moves from Load to Compare-and-swap. Such a situation is witnessed by the following interleaving. T1 executes inc() once, and T2 twice.

Figure

T1 T2 counter *counter
Load // s = A Load // s = A A 0
Increment // n = B, *n = 1 A 0
CAS B 1
Free A B 1
Increment // n = C, *n = 1 B 1
Load // s = B B 1
Increment // n = A, *n = 2 B 1
CAS A 2
CAS C 🗲1🗲

The value of the counter should be 3, as it has been increased once by T1, and twice by T2. However, due to the ABA problem, T1 misses the increments by T2, and the value of the counter is 1.

Mitigation by Deferred Memory Reclamation

In the above example, the ABA problem occurs, because T2 is allowed to free A while still in use by T1. Once freed, A becomes available as an allocatable address for T2 to use in the second call to inc(). This observation gives rise to an obvious class of solutions to mitigate ABA: deferred memory reclamation.

Prominent examples are

  • Garbage collection (GC)
  • Hazard pointers (HP)
  • Read-copy-update (RCU)

Garbage collection (GC) is a class of algorithms in itself, which includes for instance reference conuting. GC algorithms have in common some overhead for deciding which addresses to reclaim. In the following we look at unmanaged approaches to deferred memory reclamation.

Hazard Pointers

Hazard pointers Michael04HazPtr are a general-purpose approach to deferred memory reclamation, which work with several lock-free synchronization algorithms. Here we look at hazard pointers specifically in the context of CAS.

A hazard pointer is a marker that a specific thread is about to start an operation that requires the data-structure to be available at the address the hazard pointer points to. The marker does not mean that the address has the latest version of the shared data-structure, it must just not be reclaimed while the hazard pointer points to the address.

For each shared data-structure, there is an array of hazard pointers, one per each thread potentially accessing the shared data-structure. The array of hazard pointers is itself shared, however each thread will only write to its own hazard pointer in the array.

1
2
3
4
5
6
7
8
hp[tid] = C;        // Set hazard pointer
...                 // Synchronize hp
o = hp[tid] ;
v = *o ;
*n = v + 1 ;        // Increment
... // release fence
C.compare_exchange_weak(&o, n, memory_order_acq_rel, memory_order_acquire);
// Compare-and-swap

The whole array may be read by any thread modifying the shared data-structure, so that writes to the hazard pointer must be guarded by hardware fences to ensure that each thread sees the latest version of the array of hazard pointers. Because hp[tid] needs to be set to a specific value, updating hp[tid] is not atomic. In order to ensure that all threads see the latest version, implementations need to use a hardware fence to synchronize threads, such as the following code from the Folly library.

1
2
3
4
5
6
// Synchronize hp
do {
  p = hp[tid];
  asm volatile("" ::: "memory");
  hp[tid] = A C;
} while(p != hp[tid]);

The memory of o can be reclaimed once hp[tid] is no longer pointing to a valid instance of the shared data structure e.g. by poisoning or setting to null. The full memory fence in the above snippet ensures that all threads have a current view on the entry in the hazard pointer array.

The Folly library implements hazard pointers for C++, the Haphazard crate for Rust.

Read-Copy Update

Contrary to hazard pointers, read-copy-update (RCU) McKenneyEtAl01RCU, uses flags to indicate that a thread is about to modify the shared data-structure.

1
2
3
4
5
6
int *o = C.fetch_add(0, memory_order_acq_rel);  // Read-copy
int v = *o;
*n = v + 1;                                     // Modify
... // release fence
C.compare_exchange_weak(&o, n, memory_order_acq_rel, memory_order_acquire);
// Update

The flags are arranged in a shared array, with one flag per thread. Each thread will write to its own flag in the array only. However, any thread may read the whole array. The release-acquire memory order of the fetch-and-add, the release fence before CAS joint with acquire memory order in case of CAS failure, and release-acquire memory order in case of CAS success, ensure that all threads have a current view on the array of flags. See the C++ standard for more details on the semantics of memory order for the atomic compare-exchange operation in C++.

The memory of o can be reclaimed once the RCU flag for each thread has been cleared. The release memory order of the Fetch-and-Add ensures that the writing thread has set its RCU flag. The release memory order of the CAS in the reclaiming thread ensure that the writing thread has performed the Fetch-and-Add on the same memory address for C, read: generation of the shared data structure, before reclaiming the memory. The latter also enables a single-pass check of the RCU-flag in the reclaiming thread. The release fence is needed to avoid a fringe read-after-free bug when compiler or hardware can reorder the value increment over the CAS within the same thread.

Largely thanks to Paul McKenney’s work, RCU is now a prevalent synchronization algorithm in the Linux kernel. Performance is critical for uses of RCU in the Linux kernel. Deciding memory reclamation causes some overhead, which can be avoided by using preemptive periods, available in the privilege levels of OS kernels.

Optimizing use of preemptive periods, the Linux kernel implements RCU in three flavours, see Section 3.4 of McKenney19Ease. RCU-Sched relies on disabling read-side preemption, see McKenneyEtAl01RCU for details. RCU-bh solves a lock-up of RCU-Sched under load through an additional quiescent period on exit. Both RCU-Sched and RCU-bh disable read-side preemption, which is problem for real-time workloads. The latter is supported by the third flavour, RCU-preempt, see GuniguntalaEtAll08RCURT.

The Folly library implements RCU for C++. What is RCU? has a discussion of the use of RCU in the Linux kernel. Paul McKenney’s blog has a discussion of the implementation of RCU in Rust, and more details on the distinction of user-space vs kernel-space RCU. The rcu-clean crate implements a flavour of RCU in Rust.

Mitigation in Systems Architectures

Alternative to using memory allocation as a token for synchronization, system architectures can implement wide pointers which include more information about the pointer to the shared data structure. The additional information can e.g. include version information, such as in Tagged-State-References or ABA’, a “dirty” bit in the wide pointer, such as implemented in CHERI, or a modification exclusion bit as implemented in load-linked/store-conditional (LL/SC).

Tagged State References

Tagged State References mitigate ABA by wrapping the shared pointer by a tag, or version. Every time a thread modifies the pointer, it modifies the tag or increments the version. This way, even though, the pointer may point to the same memory location, the version number will different. Tagged state references are nick-named ABA’ for that reason.

The main difficulty with Tagged State References is to modify the tag atomically with the pointer, which requires support from the system architecture.

The blog on IT Hare discusses the trade-off between bit-length of the version tag and the risk of version over-runs.

Alternative Instructions: Load-linked/Store-conditional

Load-linked/Store-conditional (LL/SC) is an algorithm alternative to CAS. On hardware, threads are represented as monitors (of memory locations or regions). LL/SC is implemented using a pair of exclusive load and store instructions. On exclusive load, an exclusion bit is set for the requesting monitor. The subsequent exclusive store will only succeed if the exclusion bit is still set. Any write to the same address, or respectively memory region, will clear the modification exclusion bit. In case the exclusive store fails, the instruction chain will be retried from the exclusive load.

Several architectures implement LL/SC:

  • Alpha: ldl_l/stl_c and ldq_l/stq_c
  • PowerPC/Power ISA: lwarx/stwcx and ldarx/stdcx
  • MIPS: ll/sc and lld/scd
  • ARM: ldrex/strex (ARMv6, v7 and v8-M), and ldxr/stxr (ARMv8-A)
  • RISC-V: lr/sc
  • ARC: LLOCK/SCOND

Dynamic Binary Translation (DBT) systems such as used in virtual machine implementations translate CAS to LL/SC instructions when executing e.g. x86 code on ARMv8 systems, where CAS is not available as an instruction. See KristienEtAl20DBT for a detailed discussion.

See also this Wikipedia article for additional references on LL/SC.

Summary

As of the writing of this post, I am aware of two main solutions to mitigate ABA for CAS: using memory allocation as a token to synchronise data structures and deferring memory reclamation, either with Hazard Pointers or RCU, or some form of wide pointers or descriptor object: version tags, dirty bits, or exclusion bits, as used in Tagged State Reference, CHERI, or implementation of CAS with load-exclusive/store-exclusive instructions, respectively.

Constraints from applications will dictate the trade-off between the solutions. Wide pointers and descriptor objects may require hardware support. In particular approaches as taken in CHERI may enable safe and transparently performant solutions to lock-free synchronization. RCU still comes with some overhead, but preemption may limit the overhead to a constant order of complexity. Hazard pointers appear to be the most flexible solution for user-space applications, but in particular the full memory fence used when acquiring the hazard pointer comes with a high performance penalty. Note that non-Linux systems, Folly already implements asymmetric fences, which may alleviate the impact on performance.

References

Michael: Hazard Pointers: Safe Memory Reclamation for Lock-Free Objects, 2004

McKenney et al: Read-Copy Update, 2001

McKenney: A Critical RCU Safety Property is… Ease of Use!, 2019

Guniguntala et al: The read-copy-update mechanism for supporting real-time applications on shared-memory multiprocessor systems with Linux, 2008

Dechev et al: Lock-Free Dynamically Resizable Arrays 2006

Dechev et al: Understanding and Effectively Preventing the ABA Problem in Descriptor-based Lock-free Designs, 2010

Kristien et al: Fast and Correct Load-Link/Store-Conditional Instruction Handling in DBT Systems 2020