7 Atomic Instructions
This chapter is currently being restructured. Its contents are normative, but the presentation might appear disjoint.
RISC-V provides several extensions that atomically read-modify-write memory to support synchronization between multiple RISC-V harts running in the same memory space. The two forms of atomic instruction provided are load-reserved/store-conditional instructions and atomic fetch-and-op memory instructions. Both types of atomic instruction support various memory consistency orderings including unordered, acquire, release, and sequentially consistent semantics. These instructions allow RISC-V to support the RCsc memory consistency model. [18]
After much debate, the language community and architecture community appear to have finally settled on release consistency as the standard memory consistency model and so the RISC-V atomic support is built around this model.
Specifying Ordering of Atomic Instructions
The base RISC-V ISA has a relaxed memory model, with the FENCE instruction used to impose additional ordering constraints. The address space is divided by the execution environment into memory and I/O domains, and the FENCE instruction provides options to order accesses to one or both of these two address domains.
To provide more efficient support for release consistency [18], each atomic instruction has two bits, aq and rl, used to specify additional memory ordering constraints as viewed by other RISC-V harts. The bits order accesses to one of the two address domains, memory or I/O, depending on which address domain the atomic instruction is accessing. No ordering constraint is implied to accesses to the other domain, and a FENCE instruction should be used to order across both domains.
If both bits are clear, no additional ordering constraints are imposed on the atomic memory operation. If only the aq bit is set, the atomic memory operation is treated as an acquire access, i.e., no following memory operations on this RISC-V hart can be observed to take place before the acquire memory operation. If only the rl bit is set, the atomic memory operation is treated as a release access, i.e., the release memory operation cannot be observed to take place before any earlier memory operations on this RISC-V hart. If both the aq and rl bits are set, the atomic memory operation is sequentially consistent and cannot be observed to happen before any earlier memory operations or after any later memory operations in the same RISC-V hart and to the same address domain.
7.1 A Extension for Atomic Instructions
The A extension comprises the Zalrsc and Zaamo extensions,
which are defined in the following sections.
7.2 "Zalrsc" Extension for Load-Reserved/Store-Conditional Instructions
Complex atomic memory operations on a single memory word or doubleword are performed with the load-reserved (LR) and store-conditional (SC) instructions. LR.W loads a word from the address in rs1, places the sign-extended value in rd, and registers a reservation set—a set of bytes that subsumes the bytes in the addressed word. SC.W conditionally writes a word in rs2 to the address in rs1: the SC.W succeeds only if the reservation is still valid and the reservation set contains the bytes being written. If the SC.W succeeds, the instruction writes the word in rs2 to memory, and it writes zero to rd. If the SC.W fails, the instruction does not write to memory, and it writes a nonzero value to rd. No SC.W instruction shall retire unless it passes memory permission checks, but it is UNSPECIFIED whether any side effects of implicit address translation and protection memory accesses (such as setting a page-table entry D bit) occur on a failed SC.W. For the purposes of memory protection, a failed SC.W may be treated like a store. Regardless of success or failure, executing an SC.W instruction invalidates any reservation held by this hart. LR.D and SC.D act analogously on doublewords and are only available on RV64. For RV64, LR.W and SC.W sign-extend the value placed in rd.
Both compare-and-swap (CAS) and LR/SC can be used to build lock-free data structures. After extensive discussion, we opted for LR/SC for several reasons: 1) CAS suffers from the ABA problem, which LR/SC avoids because it monitors all writes to the address rather than only checking for changes in the data value; 2) CAS would also require a new integer instruction format to support three source operands (address, compare value, swap value) as well as a different memory system message format, which would complicate microarchitectures; 3) Furthermore, to avoid the ABA problem, other systems provide a double-wide CAS (DW-CAS) to allow a counter to be tested and incremented along with a data word. This requires reading five registers and writing two in one instruction, and also a new larger memory system message type, further complicating implementations; 4) LR/SC provides a more efficient implementation of many primitives as it only requires one load as opposed to two with CAS (one load before the CAS instruction to obtain a value for speculative computation, then a second load as part of the CAS instruction to check if value is unchanged before updating).
The main disadvantage of LR/SC over CAS is livelock, which we avoid, under certain circumstances, with an architected guarantee of eventual forward progress as described below. Another concern is whether the influence of the current x86 architecture, with its DW-CAS, will complicate porting of synchronization libraries and other software that assumes DW-CAS is the basic machine primitive. A possible mitigating factor is the recent addition of transactional memory instructions to x86, which might cause a move away from DW-CAS.
More generally, a multi-word atomic primitive is desirable, but there is still considerable debate about what form this should take, and guaranteeing forward progress adds complexity to a system.
The failure code with value 1 encodes an unspecified failure. Other failure codes are reserved at this time. Portable software should only assume the failure code will be non-zero.
We reserve a failure code of 1 to mean ''unspecified'' so that simple implementations may return this value using the existing multiplexer required for the SLT/SLTU instructions. More specific failure codes might be defined in future versions or extensions to the ISA.
For LR and SC, the Zalrsc extension requires that the address held in rs1 be naturally aligned to the size of the operand (i.e., eight-byte aligned for doublewords and four-byte aligned for words). If the address is not naturally aligned, an address-misaligned exception or an access-fault exception will be generated. The access-fault exception can be generated for a memory access that would otherwise be able to complete except for the misalignment, if the misaligned access should not be emulated.
Emulating misaligned LR/SC sequences is impractical in most systems.
Misaligned LR/SC sequences also raise the possibility of accessing multiple reservation sets at once, which present definitions do not provide for.
An implementation can register an arbitrarily large reservation set on each LR, provided the reservation set includes all bytes of the addressed data word or doubleword. An SC can only pair with the most recent LR in program order. An SC may succeed only if no store from another hart to the reservation set can be observed to have occurred between the LR and the SC, and if there is no other SC between the LR and itself in program order. An SC may succeed only if no write from a device other than a hart to the bytes accessed by the LR instruction can be observed to have occurred between the LR and SC. Note this LR might have had a different effective address and data size, but reserved the SC’s address as part of the reservation set.
Following this model, in systems with memory translation, an SC is allowed to succeed if the earlier LR reserved the same location using an alias with a different virtual address, but is also allowed to fail if the virtual address is different.
To accommodate legacy devices and buses, writes from devices other than RISC-V harts are only required to invalidate reservations when they overlap the bytes accessed by the LR. These writes are not required to invalidate the reservation when they access other bytes in the reservation set.
The SC must fail if the address is not within the reservation set of the most recent LR in program order. The SC must fail if a store to the reservation set from another hart can be observed to occur between the LR and SC. The SC must fail if a write from some other device to the bytes accessed by the LR can be observed to occur between the LR and SC. (If such a device writes the reservation set but does not write the bytes accessed by the LR, the SC may or may not fail.) An SC must fail if there is another SC (to any address) between the LR and the SC in program order. The precise statement of the atomicity requirements for successful LR/SC sequences is defined by the Atomicity Axiom in rvwmo.
The platform should provide a means to determine the size and shape of the reservation set.
A platform specification may constrain the size and shape of the reservation set.
A store-conditional instruction to a scratch word of memory should be used to forcibly invalidate any existing load reservation:
- during a preemptive context switch, and
- if necessary when changing virtual to physical address mappings, such as when migrating pages that might contain an active reservation.
The invalidation of a hart’s reservation when it executes an LR or SC imply that a hart can only hold one reservation at a time, and that an SC can only pair with the most recent LR, and LR with the next following SC, in program order. This is a restriction to the Atomicity Axiom in rvwmo that ensures software runs correctly on expected common implementations that operate in this manner.
An SC instruction can never be observed by another RISC-V hart before the LR instruction that established the reservation.
The LR/SC sequence
can be given acquire semantics by setting the aq bit on the LR
instruction. The LR/SC sequence can be given release semantics by
by setting the rl bit on the SC instruction. Assuming
suitable mappings for other atomic operations, setting the
aq bit on the LR instruction, and setting the
rl bit on the SC instruction makes the LR/SC
sequence sequentially consistent in the C++ memory_order_seq_cst
sense. Such a sequence does not act as a fence for ordering ordinary
load and store instructions before and after the sequence. Specific
instruction mappings for other C++ atomic operations,
or stronger notions of "sequential consistency", may require both
bits to be set on either or both of the LR or SC instruction.
If neither bit is set on either LR or SC, the LR/SC sequence can be observed to occur before or after surrounding memory operations from the same RISC-V hart. This can be appropriate when the LR/SC sequence is used to implement a parallel reduction operation.
Software should not set the rl bit on an LR instruction unless the aq bit is also set, nor should software set the aq bit on an SC instruction unless the rl bit is also set. LR.rl and SC.aq instructions are not guaranteed to provide any stronger ordering than those with both bits clear, but may result in lower performance.
# a0 holds address of memory location
# a1 holds expected value
# a2 holds desired value
# a0 holds return value, 0 if successful, !0 otherwise
cas:
lr.w t0, (a0) # Load original value.
bne t0, a1, fail # Doesn't match, so fail.
sc.w t0, a2, (a0) # Try to update.
bnez t0, cas # Retry if store-conditional failed.
li a0, 0 # Set return to success.
jr ra # Return.
fail:
li a0, 1 # Set return to failure.
jr ra # Return.
LR/SC can be used to construct lock-free data structures. An example using LR/SC to implement a compare-and-swap function is shown in cas. If inlined, compare-and-swap functionality need only take four instructions.
7.2.1 Eventual Success of Store-Conditional Instructions
The Zalrsc extension defines constrained LR/SC loops, which have the following properties:
- The loop comprises only an LR/SC sequence and code to retry the sequence in the case of failure, and must comprise at most 16 instructions placed sequentially in memory.
- An LR/SC sequence begins with an LR instruction and ends with an SC instruction. The dynamic code executed between the LR and SC instructions can only contain instructions from the base ''I'' instruction set, excluding loads, stores, backward jumps, taken backward branches, JALR, FENCE, and SYSTEM instructions. Compressed forms of the aforementioned ''I'' instructions in the C (hence Zca) and Zcb extensions are also permitted.
- The code to retry a failing LR/SC sequence can contain backwards jumps and/or branches to repeat the LR/SC sequence, but otherwise has the same constraint as the code between the LR and SC.
- The LR and SC addresses must lie within a memory region with the LR/SC eventuality property. The execution environment is responsible for communicating which regions have this property.
- The SC must be to the same effective address and of the same data size as the latest LR executed by the same hart.
LR/SC sequences that do not lie within constrained LR/SC loops are unconstrained. Unconstrained LR/SC sequences might succeed on some attempts on some implementations, but might never succeed on other implementations.
We restricted the length of LR/SC loops to fit within 64 contiguous instruction bytes in the base ISA to avoid undue restrictions on instruction cache and TLB size and associativity. Similarly, we disallowed other loads and stores within the loops to avoid restrictions on data-cache associativity in simple implementations that track the reservation within a private cache. The restrictions on branches and jumps limit the time that can be spent in the sequence. Floating-point operations and integer multiply/divide were disallowed to simplify the operating system’s emulation of these instructions on implementations lacking appropriate hardware support.
Software is not forbidden from using unconstrained LR/SC sequences, but portable software must detect the case that the sequence repeatedly fails, then fall back to an alternate code sequence that does not rely on an unconstrained LR/SC sequence. Implementations are permitted to unconditionally fail any unconstrained LR/SC sequence.
If a hart H enters a constrained LR/SC loop, the execution environment must guarantee that one of the following events eventually occurs:
- H or some other hart executes a successful SC to the reservation set of the LR instruction in H's constrained LR/SC loops.
- Some other hart executes an unconditional store or AMO instruction to the reservation set of the LR instruction in H's constrained LR/SC loop, or some other device in the system writes to that reservation set.
- H executes a branch or jump that exits the constrained LR/SC loop.
- H traps.
Note that these definitions permit an implementation to fail an SC instruction occasionally for any reason, provided the aforementioned guarantee is not violated.
As a consequence of the eventuality guarantee, if some harts in an execution environment are executing constrained LR/SC loops, and no other harts or devices in the execution environment execute an unconditional store or AMO to that reservation set, then at least one hart will eventually exit its constrained LR/SC loop. By contrast, if other harts or devices continue to write to that reservation set, it is not guaranteed that any hart will exit its LR/SC loop.
Loads and load-reserved instructions do not by themselves impede the progress of other harts' LR/SC sequences. We note this constraint implies, among other things, that loads and load-reserved instructions executed by other harts (possibly within the same core) cannot impede LR/SC progress indefinitely. For example, cache evictions caused by another hart sharing the cache cannot impede LR/SC progress indefinitely. Typically, this implies reservations are tracked independently of evictions from any shared cache. Similarly, cache misses caused by speculative execution within a hart cannot impede LR/SC progress indefinitely.
These definitions admit the possibility that SC instructions may spuriously fail for implementation reasons, provided progress is eventually made.
One advantage of CAS is that it guarantees that some hart eventually makes progress, whereas an LR/SC atomic sequence could livelock indefinitely on some systems. To avoid this concern, we added an architectural guarantee of livelock freedom for certain LR/SC sequences.
Earlier versions of this specification imposed a stronger starvation-freedom guarantee. However, the weaker livelock-freedom guarantee is sufficient to implement the C11 and C++11 languages, and is substantially easier to provide in some microarchitectural styles.
7.3 Za128rs Extension for Reservation-Set Size, Version 1.0
The Za128rs extension requires that the reservation sets used by the instructions in the Zalrsc extension be contiguous, naturally aligned, and at most 128 bytes in size.
7.4 Za64rs Extension for Reservation-Set Size, Version 1.0
The Za64rs extension requires that the reservation sets used by the instructions in the Zalrsc extension be contiguous, naturally aligned, and at most 64 bytes in size.
The Za64rs extension implies the Za128rs extension.
7.5 "Zawrs" Extension for Wait-on-Reservation-Set instructions, Version 1.01
The Zawrs extension defines a pair of instructions to be used in polling loops that allows a core to enter a low-power state and wait on a store to a memory location. Waiting for a memory location to be updated is a common pattern in many use cases such as:
- Contenders for a lock waiting for the lock variable to be updated.
- Consumers waiting on the tail of an empty queue for the producer to queue work/data. The producer may be code executing on a RISC-V hart, an accelerator device, an external I/O agent.
- Code waiting on a flag to be set in memory indicative of an event occurring. For example, software on a RISC-V hart may wait on a "done" flag to be set in memory by an accelerator device indicating completion of a job previously submitted to the device.
Such use cases involve polling on memory locations, and such busy loops can be a
wasteful expenditure of energy. To mitigate the wasteful looping in such usages,
a WRS.NTO (WRS-with-no-timeout) instruction is provided. Instead of polling
for a store to a specific memory location, software registers a reservation set
that includes all the bytes of the memory location using the LR instruction.
Then a subsequent WRS.NTO instruction would cause the hart to temporarily
stall execution in a low-power state until a store occurs to the reservation set
or an interrupt is observed.
Sometimes the program waiting on a memory update may also need to carry out a
task at a future time or otherwise place an upper bound on the wait. To support
such use cases a second instruction WRS.STO (WRS-with-short-timeout) is
provided that works like WRS.NTO but bounds the stall duration to an
implementation-define short timeout such that the stall is terminated on the
timeout if no other conditions have occurred to terminate the stall. The
program using this instruction may then determine if its deadline has been
reached.
The instructions in the Zawrs extension are only useful in conjunction with the LR instruction, which is provided by the Zalrsc component of the A extension.
7.5.1 Wait-on-Reservation-Set Instructions
The WRS.NTO and WRS.STO instructions cause the hart to temporarily stall
execution in a low-power state as long as the reservation set is valid and no
pending interrupts, even if disabled, are observed. For WRS.STO the stall
duration is bounded by an implementation defined short timeout. These
instructions are available in all privilege modes.
Hart execution may be stalled while the following conditions are all satisfied:
- The reservation set is valid
- If
WRS.STO, a "short" duration since start of stall has not elapsed - No pending interrupt is observed (see the rules below)
While stalled, an implementation is permitted to occasionally terminate the stall and complete execution for any reason.
WRS.NTO and WRS.STO instructions follow the rules of the WFI instruction
for resuming execution on a pending interrupt.
When the TW (Timeout Wait) bit in mstatus is set and WRS.NTO is executed
in any privilege mode other than M mode, and it does not complete within an
implementation-specific bounded time limit, the WRS.NTO instruction will cause
an illegal-instruction exception.
When executing in VS or VU mode, if the VTW bit is set in hstatus, the
TW bit in mstatus is clear, and the WRS.NTO does not complete within an
implementation-specific bounded time limit, the WRS.NTO instruction will cause
a virtual-instruction exception.
Since the WRS.STO and WRS.NTO instructions can complete execution for
reasons other than stores to the reservation set, software will likely need
a means of looping until the required stores have occurred.
The duration of a WRS.STO instruction’s timeout may vary significantly within
and among implementations. In typical implementations this duration should be
roughly in the range of 10 to 100 times an on-chip cache miss latency or a
cacheless access to main memory.
WRS.NTO, unlike WFI, is not specified to cause an illegal-instruction
exception if executed in U-mode when the governing TW bit is 0. WFI is
typically not expected to be used in U-mode and on many systems may promptly
cause an illegal-instruction exception if used at U-mode. Unlike WFI,
WRS.NTO is expected to be used by software in U-mode when waiting on
memory but without a deadline for that wait.
7.6 "Zaamo" Extension for Atomic Memory Operations
The atomic memory operation (AMO) instructions perform read-modify-write operations for multiprocessor synchronization and are encoded with an R-type instruction format. These AMO instructions atomically load a data value from the address in rs1, place the value into register rd, apply a binary operator to the loaded value and the original value in rs2, then store the result back to the original address in rs1. AMOs can either operate on doublewords (RV64 only) or words in memory. For RV64, 32-bit AMOs always sign-extend the value placed in rd, and ignore the upper 32 bits of the original value of rs2.
For AMOs, the Zaamo extension requires that the address held in rs1 be naturally aligned to the size of the operand (i.e., eight-byte aligned for doublewords and four-byte aligned for words). If the address is not naturally aligned, an address-misaligned exception or an access-fault exception will be generated. The access-fault exception can be generated for a memory access that would otherwise be able to complete except for the misalignment, if the misaligned access should not be emulated.
The misaligned atomicity granule PMA, defined in sec:misaligned-atomicity-granule, optionally relaxes this alignment requirement. If present, the misaligned atomicity granule PMA specifies the size of a misaligned atomicity granule, a power-of-two number of bytes. The misaligned atomicity granule PMA applies only to AMOs, loads and stores defined in the base ISAs, and loads and stores of no more than XLEN bits defined in the F, D, and Q extensions, and compressed encodings thereof. For an instruction in that set, if all accessed bytes lie within the same misaligned atomicity granule, the instruction will not raise an exception for reasons of address alignment, and the instruction will give rise to only one memory operation for the purposes of RVWMO—i.e., it will execute atomically.
The operations supported are swap, integer add, bitwise AND, bitwise OR,
bitwise XOR, and signed and unsigned integer maximum and minimum.
Without ordering constraints, these AMOs can be used to implement
parallel reduction operations, where typically the return value would be
discarded by writing to x0.
We provided fetch-and-op style atomic primitives as they scale to highly
parallel systems better than LR/SC or CAS. A simple microarchitecture
can implement AMOs using the LR/SC primitives, provided the
implementation can guarantee the AMO eventually completes. More complex
implementations might also implement AMOs at memory controllers, and can
optimize away fetching the original value when the destination is x0.
The set of AMOs was chosen to support the C11/C++11 atomic memory operations efficiently, and also to support parallel reductions in memory. Another use of AMOs is to provide atomic updates to memory-mapped device registers (e.g., setting, clearing, or toggling bits) in the I/O space.
The Zaamo extension enables microcontroller class implementations to utilize atomic primitives from the AMO subset of the A extension. Typically such implementations do not have caches and thus may not be able to naturally support the LR/SC instructions provided by the Zalrsc extension.
To help implement multiprocessor synchronization, the AMOs optionally provide release consistency semantics. If the aq bit is set, then no later memory operations in this RISC-V hart can be observed to take place before the AMO. Conversely, if the rl bit is set, then other RISC-V harts will not observe the AMO before memory accesses preceding the AMO in this RISC-V hart. Setting both the aq and the rl bit on an AMO makes the sequence sequentially consistent, meaning that it cannot be reordered with earlier or later memory operations from the same hart.
The AMOs were designed to implement the C11 and C++11 memory models efficiently. Although the FENCE R, RW instruction suffices to implement the acquire operation and FENCE RW, W suffices to implement release, both imply additional unnecessary ordering as compared to AMOs with the corresponding aq or rl bit set.
An example code sequence for a critical section guarded by a test-and-test-and-set spinlock is shown in Example critical. Note the first AMO is marked aq to order the lock acquisition before the critical section, and the second AMO is marked rl to order the critical section before the lock relinquishment.
li t0, 1 # Initialize swap value.
again:
lw t1, (a0) # Check if lock is held.
bnez t1, again # Retry if held.
amoswap.w.aq t1, t0, (a0) # Attempt to acquire lock.
bnez t1, again # Retry if held.
# ...
# Critical section.
# ...
amoswap.w.rl x0, x0, (a0) # Release lock by storing 0.
We recommend the use of the AMO Swap idiom shown in critical for both lock acquire and release to simplify the implementation of speculative lock elision. [19]
The instructions in the "A" extension can be used to provide sequentially
consistent loads and stores, but this constrains hardware
reordering of memory accesses more than necessary.
A C++ sequentially consistent load can be implemented as
an LR with aq set. However, the LR/SC eventual
success guarantee may slow down concurrent loads from the same effective
address. A sequentially consistent store can be implemented as an AMOSWAP
that writes the old value to x0 and has rl set. However the superfluous
load may impose ordering constraints that are unnecessary for this use case.
Specific compilation conventions may require both the aq and rl
bits to be set in either or both the LR and AMOSWAP instructions.
7.7 "Zalasr" Atomic Load-Acquire and Store-Release Instructions, Version 1.0
The Zalasr (Load-Acquire and Store-Release) extension provides load-acquire and store-release instructions in RISC-V.
These can be important for high performance designs by enabling finer-grained synchronisation than is possible with fences alone, by providing a unidirectional fence.
Load-acquire and store-release are widely used in language-level memory models:
both the Java and C++ memory models make use of acquire-release semantics, and C++'s atomic provides primitives that are meant to map directly to load-acquire and store-release instructions.
The Zalasr extension builds on the atomic support provided by the Zaamo (Atomic Memory Operations), Zalrsc (Load-Reserved and Store-Conditional), and Zabha (Byte and Halfword Atomic Memory Operations) extensions by providing additional atomic operations (although it can be implemented independently of them). All of the AMO operations in Zaamo (and Zabha) are read-modify-write operations that both load and store. The Zalrsc extension provides operations that are only loads or stores. However, since it is designed to perform an atomic operation on a single memory word or doubleword, the loads and stores are designed to be paired. The load-reserved implies that a future store-conditional will follow while store-conditional requires that there was a previous load-reserved without other intervening loads or stores. Therefore, the Zalrsc extension does not provide a general atomic and ordered load or store.
Zalasr fills this gap by offering truly standalone atomic and ordered loads and stores. The Zalasr instructions are atomic loads and stores that support ordering annotations. With the combination of Zaamo, Zabha, and Zalasr all C++ atomic operations can be supported with single instructions.
7.7.1 Load-Acquire and Store-Release Instructions
The Zalasr instructions always sign-extend the value placed in rd and ignore the upper bits of the value of rs2. The instructions in the Zalasr extension require that the address held in rs1 be naturally aligned to the size in bytes (2width) of the operand. If the address is not naturally aligned, an address-misaligned exception or an access-fault exception will be generated. The access-fault exception can be generated for a memory access that would otherwise be able to complete except for the misalignment, if the misaligned access should not be emulated.
The misaligned atomicity granule PMA, defined in sec:misaligned-atomicity-granule, optionally relaxes this alignment requirement. If all accessed bytes lie within the same misaligned atomicity granule, the instruction will not raise an exception for reasons of address alignment, and the instruction will give rise to only one memory operation for the purposes of RVWMO—i.e., it will execute atomically.
7.7.2 Load Acquire
Synopsis The load-acquire instruction atomically loads a 2width-byte value from the address in rs1 and places the sign-extended value into the register rd, subject to the ordering annotations specified in the instruction.
Mnemonic
lb.{aq,aqrl} rd, (rs1)
lh.{aq,aqrl} rd, (rs1)
lw.{aq,aqrl} rd, (rs1)
ld.{aq,aqrl} rd, (rs1)
Encoding
Description This instruction loads 2width bytes of memory from rs1 atomically and writes the result into rd. If the size (2width+3) is less than XLEN, it is sign-extended to fill the destination register. This load must have the ordering annotation aq and may have ordering annotation rl encoded in the instruction. The instruction always has an "acquire-RCsc" annotation, and if the bit rl is set the instruction has a "release-RCsc" annotation.
The aq bit is mandatory because the two encodings that would be produced are not seen as useful at this time. The version with neither the aq nor the rl bit set would correspond to a load with no ordering annotations that was guaranteed to be performed atomically. This can be achieved with ordinary load instructions by suitably aligning pointers. The version with only the rl bit would correspond to load-release. Load-release has theoretical applications in seqlocks, but is not supported in language-level memory models and so is not included.
7.7.3 Store Release
Synopsis The store-release instruction atomically stores the 2width-byte value from the low bits of register rs2 to the address in rs1, subject to the ordering annotations specified in the instruction.
Mnemonic
sb.{rl,aqrl} rs2, (rs1)
sh.{rl,aqrl} rs2, (rs1)
sw.{rl,aqrl} rs2, (rs1)
sd.{rl,aqrl} rs2, (rs1)
Encoding
Description This instruction stores 2width bytes of memory from rs1 atomically. This store must have ordering annotation rl and may have ordering annotation aq encoded in the instruction. The instruction always has an "release-RCsc" annotation, and if the bit aq is set the instruction has a "acquire-RCsc" annotation.
The rl bit is mandatory because the two encodings that would be produced are not seen as useful at this time. The version with neither the aq nor the rl bit set would correspond to a store with no ordering annotations that was guaranteed to be performed atomically. This can be achieved with ordinary store instructions by suitably aligned pointers. The version with only the aq bit would correspond to store-acquire. Store-acquire has theoretical applications in seqlocks, but is not supported in language-level memory models and so is not included.
7.8 "Zabha" Extension for Byte and Halfword Atomic Memory Operations, Version 1.0
The A-extension offers atomic memory operation (AMO) instructions for words,
doublewords, and quadwords (only for AMOCAS). The absence of atomic
operations for subword data types necessitates emulation strategies. For bitwise
operations, this emulation can be performed via word-sized bitwise AMO*
instructions. For non-bitwise operations, emulation is achievable using
word-sized LR/SC instructions.
Several limitations arise from this emulation approach:
- In systems with large-scale or Non-Uniform Memory Access (NUMA)
configurations, emulation based on
LR/SCintroduces issues related to scalability and fairness, particularly under conditions of high contention. - Emulation of narrower AMOs through wider AMO* instructions on non-idempotent IO memory regions may result in unintended side effects.
- Utilizing wider AMO* instructions for emulating narrower AMOs risks activating extraneous breakpoints or watchpoints.
- In the absence of native support for subword atomics, compilers often resort to inlining code sequences to provide the required emulation. This practice contributes to an increase in code size, with consequent impacts on system performance and memory utilization.
The Zabha extension addresses these limitations by adding support for byte and halfword atomic memory operations to the RISC-V Unprivileged ISA. The Zabha extension depends upon the Zaamo standard extension.
7.8.1 Byte and Halfword Atomic Memory Operation Instructions
Zabha extension provides the AMO[ADD|AND|OR|XOR|SWAP|MIN[U]|MAX[U]].[B|H]
instructions. If Zacas extension is also implemented, Zabha further provides the
AMOCAS.[B|H] instructions.
Byte and halfword AMOs always sign-extend the value placed in rd, and ignore
the bits of the original value in rs2. The
AMOCAS.[B|H] instructions similarly ignore the
bits of the original value in rd.
Similar to the AMOs specified in the A extension, the Zabha extension mandates
that the address contained in the rs1 register must be naturally aligned to
the size of the operand. The same exception options as specified in the A
extension are applicable in cases where the address is not naturally aligned.
Similar to the AMOs specified in the A and Zacas extensions, the AMOs in the
Zabha extension optionally provide release consistency semantics, using the aq
and rl bits, to help implement multiprocessor synchronization.
Zabha omits byte and halfword support for LR and SC due to low utility.
7.9 "Zacas" Extension for Atomic Compare-and-Swap (CAS) Instructions, Version 1.0.0
Compare-and-Swap (CAS) provides an easy and typically faster way to perform thread synchronization operations when supported as a hardware instruction. CAS is typically used by lock-free and wait-free algorithms. This extension defines CAS instructions to operate on 32-bit, 64-bit, and 128-bit (RV64 only) data values. The Zacas extension depends upon the Zaamo extension.
7.9.1 Word/Doubleword/Quadword CAS (AMOCAS.W/D/Q) Instructions
For RV32, AMOCAS.W atomically loads a 32-bit data value from address in rs1,
compares the loaded value to the 32-bit value held in rd, and if the comparison
is bitwise equal, then stores the 32-bit value held in rs2 to the original
address in rs1. The value loaded from memory is placed into register rd. The
operation performed by AMOCAS.W for RV32 is as follows:
temp = mem[X(rs1)]
if ( temp == X(rd) )
mem[X(rs1)] = X(rs2)
X(rd) = temp
AMOCAS.D is similar to AMOCAS.W but operates on 64-bit data values.
For RV32, AMOCAS.D atomically loads 64-bits of a data value from address in
rs1, compares the loaded value to a 64-bit value held in a register pair
consisting of rd and rd+1, and if the comparison is bitwise equal, then
stores the 64-bit value held in the register pair rs2 and rs2+1 to the
original address in rs1. The value loaded from memory is placed into the
register pair rd and rd+1. The instruction requires the first register in
the pair to be even numbered; encodings with odd numbered registers specified
in rs2 and rd are reserved. When the first register of a source register
pair is x0, then both halves of the pair read as zero. When the first
register of a destination register pair is x0, then the entire register
result is discarded and neither destination register is written.
The operation performed by AMOCAS.D for RV32 is as follows:
temp0 = mem[X(rs1)+0]
temp1 = mem[X(rs1)+4]
comp0 = (rd == x0) ? 0 : X(rd)
comp1 = (rd == x0) ? 0 : X(rd+1)
swap0 = (rs2 == x0) ? 0 : X(rs2)
swap1 = (rs2 == x0) ? 0 : X(rs2+1)
if ( temp0 == comp0 ) && ( temp1 == comp1 )
mem[X(rs1)+0] = swap0
mem[X(rs1)+4] = swap1
endif
if ( rd != x0 )
X(rd) = temp0
X(rd+1) = temp1
endif
For RV64, AMOCAS.W atomically loads a 32-bit data value from address in
rs1, compares the loaded value to the lower 32 bits of the value held in rd,
and if the comparison is bitwise equal, then stores the lower 32 bits of the
value held in rs2 to the original address in rs1. The 32-bit value loaded
from memory is sign-extended and is placed into register rd. The operation
performed by AMOCAS.W for RV64 is as follows:
temp[31:0] = mem[X(rs1)]
if ( temp[31:0] == X(rd)[31:0] )
mem[X(rs1)] = X(rs2)[31:0]
X(rd) = SignExtend(temp[31:0])
For RV64, AMOCAS.D atomically loads 64-bits of a data value from address in
rs1, compares the loaded value to a 64-bit value held in rd, and if the
comparison is bitwise equal, then stores the 64-bit value held in rs2 to the
original address in rs1. The value loaded from memory is placed into register
rd. The operation performed by AMOCAS.D for RV64 is as follows:
temp = mem[X(rs1)]
if ( temp == X(rd) )
mem[X(rs1)] = X(rs2)
X(rd) = temp
AMOCAS.Q (RV64 only) atomically loads 128-bits of a data value from address in
rs1, compares the loaded value to a 128-bit value held in a register pair
consisting of rd and rd+1, and if the comparison is bitwise equal, then
stores the 128-bit value held in the register pair rs2 and rs2+1 to the
original address in rs1. The value loaded from memory is placed into the
register pair rd and rd+1. The instruction requires the first register in
the pair to be even numbered; encodings with odd numbered registers specified in
rs2 and rd are reserved. When the first register of a source register pair
is x0, then both halves of the pair read as zero. When the first register of a
destination register pair is x0, then the entire register result is discarded
and neither destination register is written. The operation performed by
AMOCAS.Q is as follows:
temp0 = mem[X(rs1)+0]
temp1 = mem[X(rs1)+8]
comp0 = (rd == x0) ? 0 : X(rd)
comp1 = (rd == x0) ? 0 : X(rd+1)
swap0 = (rs2 == x0) ? 0 : X(rs2)
swap1 = (rs2 == x0) ? 0 : X(rs2+1)
if ( temp0 == comp0 ) && ( temp1 == comp1 )
mem[X(rs1)+0] = swap0
mem[X(rs1)+8] = swap1
endif
if ( rd != x0 )
X(rd) = temp0
X(rd+1) = temp1
endif
Some algorithms may load the previous data value of a memory location into the
register used as the compare data value source by a Zacas instruction. When
using a Zacas instruction that uses a register pair to source the compare value,
the two registers may be loaded using two individual loads. The two individual
loads may read an inconsistent pair of values but that is not an issue since the
AMOCAS operation itself uses an atomic load-pair from memory to obtain the
data value for its comparison.
The following example code sequence illustrates the use of AMOCAS.D in a RV32
implementation to atomically increment a 64-bit counter.
# a0 - address of the counter.
increment:
lw a2, (a0) # Load current counter value using
lw a3, 4(a0) # two individual loads.
retry:
mv a6, a2 # Save the low 32 bits of the current value.
mv a7, a3 # Save the high 32 bits of the current value.
addi a4, a2, 1 # Increment the low 32 bits.
sltu a1, a4, a2 # Determine if there is a carry out.
add a5, a3, a1 # Add the carry if any to high 32 bits.
amocas.d.aqrl a2, a4, (a0)
bne a2, a6, retry # If amocas.d failed then retry
bne a3, a7, retry # using current values loaded by amocas.d.
ret
Just as for AMOs in the A extension, AMOCAS.W/D/Q requires that the address
held in rs1 be naturally aligned to the size of the operand (i.e., 16-byte
aligned for quadwords, eight-byte aligned for doublewords, and four-byte
aligned for words). And the same exception options apply if the address
is not naturally aligned.
Just as for AMOs in the A extension, the AMOCAS.W/D/Q optionally provide
release consistency semantics, using the aq and rl bits, to help implement
multiprocessor synchronization. The memory operation performed by an
AMOCAS.W/D/Q, when successful, has acquire semantics if aq bit is 1 and has
release semantics if rl bit is 1. The memory operation performed by an
AMOCAS.W/D/Q, when not successful, has acquire semantics if aq bit is 1 but
does not have release semantics, regardless of rl.
A FENCE instruction may be used to order the memory read access and, if
produced, the memory write access by an AMOCAS.W/D/Q instruction.
An unsuccessful AMOCAS.W/D/Q may either not perform a memory write or may
write back the old value loaded from memory. The memory write, if produced, does
not have release semantics, regardless of rl.
Irrespective of whether a write is actually performed, the instruction is
treated as an AMO for the purposes of the RVWMO PPO rules.
An AMOCAS.W/D/Q instruction always requires write permissions.
The following example code sequence illustrates the use of AMOCAS.Q to
implement the enqueue operation for a non-blocking concurrent queue using the
algorithm outlined in [20]. The algorithm atomically operates on a
pointer and its associated modification counter using the AMOCAS.Q instruction
to avoid the ABA problem.
# Enqueue operation of a non-blocking concurrent queue.
# Data structures used by the queue:
# structure pointer_t {ptr: node_t *, count: uint64_t}
# structure node_t {next: pointer_t, value: data type}
# structure queue_t {Head: pointer_t, Tail: pointer_t}
# Inputs to the procedure:
# a0 - address of Tail variable
# a4 - address of a new node to insert at tail
enqueue:
ld a6, (a0) # a6 = Tail.ptr
ld a7, 8(a0) # a7 = Tail.count
ld a2, (a6) # a2 = Tail.ptr->next.ptr
ld a3, 8(a6) # a3 = Tail.ptr->next.count
ld t1, (a0)
ld t2, 8(a0)
bne a6, t1, enqueue # Retry if Tail & next are not consistent
bne a7, t2, enqueue # Retry if Tail & next are not consistent
bne a2, x0, move_tail # Was tail pointing to the last node?
mv t1, a2 # Save Tail.ptr->next.ptr
mv t2, a3 # Save Tail.ptr->next.count
addi a5, a3, 1 # Link the node at the end of the list
amocas.q.aqrl a2, a4, (a6)
bne a2, t1, enqueue # Retry if CAS failed
bne a3, t2, enqueue # Retry if CAS failed
addi a5, a7, 1 # Update Tail to the inserted node
amocas.q.aqrl a6, a4, (a0)
ret # Enqueue done
move_tail: # Tail was not pointing to the last node
addi a3, a7, 1 # Try to swing Tail to the next node
amocas.q.aqrl a6, a2, (a0)
j enqueue # Retry
7.10 Zama16b Extension for 16-byte Misaligned Atomicity, Version 1.0
If the Zama16b extension is implemented, then the misaligned atomicity granule in main memory regions with both the cacheability and coherence PMAs is 16 bytes. Misaligned loads, stores, and AMOs to main memory regions that do not cross a naturally aligned 16-byte boundary are atomic.