6 Scalar Integer Extensions
This chapter is currently being restructured. Its contents are normative, but the presentation might appear disjoint.
This chapter describes the scalar integer extensions. Most of these extensions are accordingly named with the prefix "Zi", with the exception of the integer multiplication and division extensions, which are named "M" or prefixed with "Zm".
6.1 "Zifencei" Extension for Instruction-Fetch Fence, Version 2.0
This chapter defines the "Zifencei" extension, which includes the FENCE.I instruction that provides explicit synchronization between writes to instruction memory and instruction fetches on the same hart. Currently, this instruction is the only standard mechanism to ensure that stores visible to a hart will also be visible to its instruction fetches.
We considered but did not include a "store instruction word" instruction as in [17]. JIT compilers may generate a large trace of instructions before a single FENCE.I, and amortize any instruction cache snooping/invalidation overhead by writing translated instructions to memory regions that are known not to reside in the I-cache.
The FENCE.I instruction was designed to support a wide variety of implementations. A simple implementation can flush the local instruction cache and the instruction pipeline when the FENCE.I is executed. A more complex implementation might snoop the instruction (data) cache on every data (instruction) cache miss, or use an inclusive unified private L2 cache to invalidate lines from the primary instruction cache when they are being written by a local store instruction. If instruction and data caches are kept coherent in this way, or if the memory system consists of only uncached RAMs, then just the fetch pipeline needs to be flushed at a FENCE.I.
The FENCE.I instruction was previously part of the base I instruction set. Two main issues are driving moving this out of the mandatory base, although at time of writing it is still the only standard method for maintaining instruction-fetch coherence.
First, it has been recognized that on some systems, FENCE.I will be expensive to implement and alternate mechanisms are being discussed in the memory model task group. In particular, for designs that have an incoherent instruction cache and an incoherent data cache, or where the instruction cache refill does not snoop a coherent data cache, both caches must be completely flushed when a FENCE.I instruction is encountered. This problem is exacerbated when there are multiple levels of I and D cache in front of a unified cache or outer memory system.
Second, the instruction is not powerful enough to make available at user level in a Unix-like operating system environment. The FENCE.I only synchronizes the local hart, and the OS can reschedule the user hart to a different physical hart after the FENCE.I. This would require the OS to execute an additional FENCE.I as part of every context migration. For this reason, the standard Linux ABI has removed FENCE.I from user-level and now requires a system call to maintain instruction-fetch coherence, which allows the OS to minimize the number of FENCE.I executions required on current systems and provides forward-compatibility with future improved instruction-fetch coherence mechanisms.
Future approaches to instruction-fetch coherence under discussion include providing more restricted versions of FENCE.I that only target a given address specified in rs1, and/or allowing software to use an ABI that relies on machine-mode cache-maintenance operations.
The FENCE.I instruction is used to synchronize the instruction and data streams. RISC-V does not guarantee that stores to instruction memory will be made visible to instruction fetches on a RISC-V hart until that hart executes a FENCE.I instruction. A FENCE.I instruction ensures that a subsequent instruction fetch on a RISC-V hart will see any previous data stores already visible to the same RISC-V hart. FENCE.I does not ensure that other RISC-V harts' instruction fetches will observe the local hart’s stores in a multiprocessor system. To make a store to instruction memory visible to all RISC-V harts, the writing hart also has to execute a data FENCE before requesting that all remote RISC-V harts execute a FENCE.I.
A FENCE.I instruction orders all explicit memory accesses that precede the FENCE.I in program order before all instruction fetches that follow the FENCE.I in program order.
In the following litmus test, for example, the outcome a0=1, a1=0 on
the consumer hart is forbidden, assuming little-endian RV32IC harts:
Initially, flag = 0.
Producer hart: Consumer hart:
la t0, patch_me la t2, flag
li t1, 0x4585 lw a0, (t2)
sh t1, (t0) # patch_me := c.li a1, 1 fence.i
fence w, w # order flag write patch_me:
la t0, flag c.li a1, 0
li t1, 1
sw t1, (t0) # flag := 1
Note that this example is only meant to illustrate the aforementioned ordering
property.
In a realistic producer-consumer code-generation scheme, the consumer would loop
until flag becomes 1 before executing the FENCE.I instruction.
An instruction fetch is always ordered before any explicit memory accesses that instruction gives rise to.
The unused fields in the FENCE.I instruction, funct12, rs1, and rd, are reserved for finer-grain fences in future extensions. For forward compatibility, base implementations shall ignore these fields, and standard software shall zero these fields.
Because FENCE.I only orders stores with a hart’s own instruction fetches, application code should only rely upon FENCE.I if the application thread will not be migrated to a different hart. The EEI can provide mechanisms for efficient multiprocessor instruction-stream synchronization.
6.2 "Zicsr" Extension for Control and Status Register (CSR) Instructions, Version 2.0
RISC-V defines a separate address space of 4096 Control and Status registers associated with each hart. This chapter defines the full set of CSR instructions that operate on these CSRs.
While CSRs are primarily used by the privileged architecture, there are several uses in unprivileged code including for counters and timers, and for floating-point status.
The counters and timers are no longer considered mandatory parts of the standard base ISAs, and so the CSR instructions required to access them have been moved out of rv32 into this separate chapter.
6.2.1 CSR Instructions
All CSR instructions atomically read-modify-write a single CSR, whose CSR specifier is encoded in the 12-bit csr field of the instruction held in bits 31-20. The immediate forms use a 5-bit zero-extended immediate encoded in the rs1 field.
The CSRRW (Atomic Read/Write CSR) instruction atomically swaps values in
the CSRs and integer registers. CSRRW reads the old value of the CSR,
zero-extends the value to XLEN bits, then writes it to integer register
rd. The initial value in rs1 is written to the CSR. If rd=x0,
then the instruction shall not read the CSR and shall not cause any of
the side effects that might occur on a CSR read.
The CSRRS (Atomic Read and Set Bits in CSR) instruction reads the value of the CSR, zero-extends the value to XLEN bits, and writes it to integer register rd. The initial value in integer register rs1 is treated as a bit mask that specifies bit positions to be set in the CSR. Any bit that is high in rs1 will cause the corresponding bit to be set in the CSR, if that CSR bit is writable.
The CSRRC (Atomic Read and Clear Bits in CSR) instruction reads the value of the CSR, zero-extends the value to XLEN bits, and writes it to integer register rd. The initial value in integer register rs1 is treated as a bit mask that specifies bit positions to be cleared in the CSR. Any bit that is high in rs1 will cause the corresponding bit to be cleared in the CSR, if that CSR bit is writable.
Since CSRRS and CSRRC perform a read-modify-write operation, any bits that read as a different value to their underlying value may be modified by these instructions even if the corresponding bit is not set in rs1. For example, pmpaddr_n_[G-1] may have an underlying value of 1 but read as 0. Executing CSRRC or CSRRS to modify a different bit will cause 0 to be read from pmpaddr_n_[G-1] and then written back, updating the underlying value to 0.
For both CSRRS and CSRRC, if rs1=x0, then the instruction will not
write to the CSR at all, and so shall not cause any of the side effects
that might otherwise occur on a CSR write, nor raise illegal-instruction
exceptions on accesses to read-only CSRs. Both CSRRS and CSRRC always
read the addressed CSR and cause any read side effects regardless of
rs1 and rd fields.
Note that if rs1 specifies a register other than x0, and that register
holds a zero value, the instruction will not action any attendant per-field
side effects, but will action any side effects caused by writing to the entire
CSR.
A CSRRW with rs1=x0 will attempt to write zero to the destination CSR.
The CSRRWI, CSRRSI, and CSRRCI variants are similar to CSRRW, CSRRS, and
CSRRC respectively, except they update the CSR using an XLEN-bit value
obtained by zero-extending a 5-bit unsigned immediate (uimm[4:0]) field
encoded in the rs1 field instead of a value from an integer register.
For CSRRSI and CSRRCI, if the uimm[4:0] field is zero, then these
instructions will not write to the CSR, and shall not cause any of the
side effects that might otherwise occur on a CSR write, nor raise
illegal-instruction exceptions on accesses to read-only CSRs. For
CSRRWI, if rd=x0, then the instruction shall not read the CSR and
shall not cause any of the side effects that might occur on a CSR read.
Both CSRRSI and CSRRCI will always read the CSR and cause any read side
effects regardless of rd and rs1 fields.
| Register operand | ||||
|---|---|---|---|---|
| Instruction | rd is x0 | rs1 is x0 | Reads CSR | Writes CSR |
| CSRRW | Yes | No | Yes | |
| CSRRW | No | Yes | Yes | |
| CSRRS/CSRRC | Yes | Yes | No | |
| CSRRS/CSRRC | No | Yes | Yes | |
| Immediate operand | ||||
| Instruction | rd is x0 | uimm=0 | Reads CSR | Writes CSR |
| CSRRWI | Yes | No | Yes | |
| CSRRWI | No | Yes | Yes | |
| CSRRSI/CSRRCI | Yes | Yes | No | |
| CSRRSI/CSRRCI | No | Yes | Yes | |
csrsideeffects summarizes the behavior of the CSR instructions with respect to whether they read and/or write the CSR.
In addition to side effects that occur as a consequence of reading or writing a CSR, individual fields within a CSR might have side effects when written. The CSRRW[I] instructions action side effects for all such fields within the written CSR. The CSRRS[I] and CSRRC[I] instructions only action side effects for fields for which the rs1 or uimm argument has at least one bit set corresponding to that field.
As of this writing, no standard CSRs have side effects on field writes. Hence, whether a standard CSR access has any side effects can be determined solely from the opcode.
Defining CSRs with side effects on field writes is not recommended.
For any event or consequence that occurs due to a CSR having a particular value, if a write to the CSR gives it that value, the resulting event or consequence is said to be an indirect effect of the write. Indirect effects of a CSR write are not considered by the RISC-V ISA to be side effects of that write.
An example of side effects for CSR accesses would be if reading from a specific CSR causes a light bulb to turn on, while writing an odd value to the same CSR causes the light to turn off. Assume writing an even value has no effect. In this case, both the read and write have side effects controlling whether the bulb is lit, as this condition is not determined solely from the CSR value. (Note that after writing an odd value to the CSR to turn off the light, then reading to turn the light on, writing again the same odd value causes the light to turn off again. Hence, on the last write, it is not a change in the CSR value that turns off the light.)
On the other hand, if a bulb is rigged to light whenever the value of a particular CSR is odd, then turning the light on and off is not considered a side effect of writing to the CSR but merely an indirect effect of such writes.
More concretely, the RISC-V privileged architecture defined in Volume II specifies that certain combinations of CSR values cause a trap to occur. When an explicit write to a CSR creates the conditions that trigger the trap, the trap is not considered a side effect of the write but merely an indirect effect.
Standard CSRs do not have any side effects on reads. Standard CSRs may have side effects on writes. Custom extensions might add CSRs for which accesses have side effects on either reads or writes.
Some CSRs, such as the instructions-retired counter, instret, may be
modified as side effects of instruction execution. In these cases, if a
CSR access instruction reads a CSR, it reads the value prior to the
execution of the instruction. If a CSR access instruction writes such a
CSR, the explicit write is done instead of the update from the side effect.
In particular, a value
written to instret by one instruction will be the value read by the
following instruction.
The assembler pseudoinstruction to read a CSR, CSRR rd, csr, is encoded as CSRRS rd, csr, x0. The assembler pseudoinstruction to write a CSR, CSRW csr, rs1, is encoded as CSRRW x0, csr, rs1, while CSRWI csr, uimm, is encoded as CSRRWI x0, csr, uimm.
Further assembler pseudoinstructions are defined to set and clear bits in the CSR when the old value is not required: CSRS/CSRC csr, rs1; CSRSI/CSRCI csr, uimm.
6.2.1.1 CSR Access Ordering
Each RISC-V hart normally observes its own CSR accesses, including its implicit CSR accesses, as performed in program order. In particular, unless specified otherwise, a CSR access is performed after the execution of any prior instructions in program order whose behavior modifies or is modified by the CSR state and before the execution of any subsequent instructions in program order whose behavior modifies or is modified by the CSR state. Furthermore, an explicit CSR read returns the CSR state before the execution of the instruction, while an explicit CSR write suppresses and overrides any implicit writes or modifications to the same CSR by the same instruction.
Likewise, any side effects from an explicit CSR access are normally observed to occur synchronously in program order. Unless specified otherwise, the full consequences of any such side effects are observable by the very next instruction, and no consequences may be observed out-of-order by preceding instructions. (Note the distinction made earlier between side effects and indirect effects of CSR writes.)
For the RVWMO memory consistency model (memorymodel), CSR accesses are weakly ordered by default, so other harts or devices may observe CSR accesses in an order different from program order. In addition, CSR accesses are not ordered with respect to explicit memory accesses, unless a CSR access modifies the execution behavior of the instruction that performs the explicit memory access or unless a CSR access and an explicit memory access are ordered by either the syntactic dependencies defined by the memory model or the ordering requirements defined in sec:memory-ordering-pmas. To enforce ordering in all other cases, software should execute a FENCE instruction between the relevant accesses. For the purposes of the FENCE instruction, CSR read accesses are classified as device input (I), and CSR write accesses are classified as device output (O).
Informally, the CSR space acts as a weakly ordered memory-mapped I/O region, as defined in sec:memory-ordering-pmas. As a result, the order of CSR accesses with respect to all other accesses is constrained by the same mechanisms that constrain the order of memory-mapped I/O accesses to such a region.
These CSR-ordering constraints are imposed to support ordering main
memory and memory-mapped I/O accesses with respect to CSR accesses that
are visible to, or affected by, devices or other harts. Examples include
the time, cycle, and mcycle CSRs, in addition to CSRs that reflect
pending interrupts, like mip and sip. Note that implicit reads of
such CSRs (e.g., taking an interrupt because of a change in mip) are
also ordered as device input.
Most CSRs (including, e.g., the fcsr) are not visible to other harts;
their accesses can be freely reordered in the global memory order with
respect to FENCE instructions without violating this specification.
The hardware platform may define that accesses to certain CSRs are strongly ordered, as defined in sec:memory-ordering-pmas. Accesses to strongly ordered CSRs have stronger ordering constraints with respect to accesses to both weakly ordered CSRs and accesses to memory-mapped I/O regions.
The rules for the reordering of CSR accesses in the global memory order should probably be moved to memorymodel concerning the RVWMO memory consistency model.
6.3 "Zicntr" Extension for Base Counters and Timers
RISC-V ISAs provide a set of up to thirty-two 64-bit performance
counters and timers that are accessible via unprivileged XLEN-bit
read-only CSR registers 0xC00–0xC1F (when XLEN=32, the upper 32 bits
are accessed via CSR registers 0xC80–0xC9F). These counters are
divided between the Zicntr and Zihpm extensions.
The Zicntr standard extension comprises the first three of these counters (CYCLE, TIME, and INSTRET), which have dedicated functions (cycle count, real-time clock, and instructions retired, respectively). The Zicntr extension depends on the Zicsr extension.
We recommend provision of these basic counters in implementations as they are essential for basic performance analysis, adaptive and dynamic optimization, and to allow an application to work with real-time streams. Additional counters in the separate Zihpm extension can help diagnose performance problems and these should be made accessible from user-level application code with low overhead.
Some execution environments might prohibit access to counters, for example, to impede timing side-channel attacks.
For base ISAs with XLEN≥64, CSR instructions can access
the full 64-bit CSRs directly. In particular, the RDCYCLE, RDTIME, and
RDINSTRET pseudoinstructions read the full 64 bits of the cycle,
time, and instret counters.
The counter pseudoinstructions are mapped to the read-only
csrrs rd, counter, x0 canonical form, but the other read-only CSR
instruction forms (based on CSRRC/CSRRSI/CSRRCI) are also legal ways to
read these CSRs.
For base ISAs with XLEN=32, the Zicntr extension enables the three 64-bit read-only counters to be accessed in 32-bit pieces. The RDCYCLE, RDTIME, and RDINSTRET pseudoinstructions provide the lower 32 bits, and the RDCYCLEH, RDTIMEH, and RDINSTRETH pseudoinstructions provide the upper 32 bits of the respective counters.
We required the counters be 64 bits wide, even when XLEN=32, as otherwise it is very difficult for software to determine if values have overflowed. The sample code given below shows how the full 64-bit width value can be safely read using the individual 32-bit width pseudoinstructions.
The RDCYCLE pseudoinstruction reads the low XLEN bits of the cycle
CSR which holds a count of the number of clock cycles executed by the
processor core on which the hart is running from an arbitrary start time
in the past. RDCYCLEH is only present when XLEN=32 and reads bits 63-32
of the same cycle counter. The underlying 64-bit counter should never
overflow in practice. The rate at which the cycle counter advances will
depend on the implementation and operating environment. The execution
environment should provide a means to determine the current rate
(cycles/second) at which the cycle counter is incrementing.
RDCYCLE is intended to return the number of cycles executed by the processor core, not the hart. Precisely defining what is a "core" is difficult given some implementation choices (e.g., AMD Bulldozer). Precisely defining what is a "clock cycle" is also difficult given the range of implementations (including software emulations), but the intent is that RDCYCLE is used for performance monitoring along with the other performance counters. In particular, where there is one hart/core, one would expect cycle-count/instructions-retired to measure CPI for a hart.
Cores don’t have to be exposed to software at all, and an implementer might choose to pretend multiple harts on one physical core are running on separate cores with one hart/core, and provide separate cycle counters for each hart. This might make sense in a simple barrel processor (e.g., CDC 6600 peripheral processors) where inter-hart timing interactions are non-existent or minimal.
Where there is more than one hart/core and dynamic multithreading, it is not generally possible to separate out cycles per hart (especially with SMT). It might be possible to define a separate performance counter that tried to capture the number of cycles a particular hart was running, but this definition would have to be very fuzzy to cover all the possible threading implementations. For example, should we only count cycles for which any instruction was issued to execution for this hart, and/or cycles any instruction retired, or include cycles this hart was occupying machine resources but couldn’t execute due to stalls while other harts went into execution? Likely, "all of the above" would be needed to have understandable performance stats. This complexity of defining a per-hart cycle count, and also the need in any case for a total per-core cycle count when tuning multithreaded code led to just standardizing the per-core cycle counter, which also happens to work well for the common single hart/core case.
Standardizing what happens during "sleep" is not practical given that what "sleep" means is not standardized across execution environments, but if the entire core is paused (entirely clock-gated or powered-down in deep sleep), then it is not executing clock cycles, and the cycle count shouldn’t be increasing per the spec. There are many details, e.g., whether clock cycles required to reset a processor after waking up from a power-down event should be counted, and these are considered execution-environment-specific details.
Even though there is no precise definition that works for all platforms, this is still a useful facility for most platforms, and an imprecise, common, "usually correct" standard here is better than no standard. The intent of RDCYCLE was primarily performance monitoring/tuning, and the specification was written with that goal in mind.
The RDTIME pseudoinstruction reads the low XLEN bits of the "time" CSR, which counts wall-clock real time that has passed from an arbitrary start time in the past. RDTIMEH is only present when XLEN=32 and reads bits 63-32 of the same real-time counter. The underlying 64-bit counter increments by one with each tick of the real-time clock, and, for realistic real-time clock frequencies, should never overflow in practice. The execution environment should provide a means of determining the period of a counter tick (seconds/tick). The period should be constant within a small error bound. The environment should provide a means to determine the accuracy of the clock (i.e., the maximum relative error between the nominal and actual real-time clock periods).
On some simple platforms, cycle count might represent a valid implementation of RDTIME, in which case RDTIME and RDCYCLE may return the same result.
It is difficult to provide a strict mandate on clock period given the wide variety of possible implementation platforms. The maximum error bound should be set based on the requirements of the platform.
The real-time clocks of all harts must be synchronized to within one tick of the real-time clock.
As with other architectural mandates, it suffices to appear "as if" harts are synchronized to within one tick of the real-time clock, i.e., software is unable to observe that there is a greater delta between the real-time clock values observed on two harts.
If, for example, the real-time clock increments at a frequency of 1 GHz, then all harts must appear to be synchronized to within 1 nsec. But it is also acceptable for this example implementation to only update the real-time clock at, say, a frequency of 100 MHz with increments of 10 ticks. As long as software cannot observe this seeming violation of the above synchronization requirement, and software always observes time across harts to be monotonically nondecreasing, then this implementation is compliant.
A platform spec may then, for example, specify an apparent real-time clock tick frequency (e.g. 1 GHz) and also a minimum update frequency (e.g. 100 MHz) at which updated time values are guaranteed to be observable by software. Software may read time more frequently, but it should only observe monotonically nondecreasing values and it should observe a new value at least once every 10 ns (corresponding to the 100 MHz update frequency in this example).
The RDINSTRET pseudoinstruction reads the low XLEN bits of the
instret CSR, which counts the number of instructions retired by this
hart from some arbitrary start point in the past. RDINSTRETH is only
present when XLEN=32 and reads bits 63-32 of the same instruction
counter. The underlying 64-bit counter should never overflow in
practice.
Instructions that cause synchronous exceptions, including ECALL and
EBREAK, are not considered to retire and hence do not increment the
instret CSR.
The following code sequence will read a valid 64-bit cycle counter value
into x3:x2, even if the counter overflows its lower half between
reading its upper and lower halves.
again:
rdcycleh x3
rdcycle x2
rdcycleh x4
bne x3, x4, again
6.4 "Zihpm" Extension for Hardware Performance Counters
The Zihpm extension comprises up to 29 additional unprivileged 64-bit
hardware performance counters, hpmcounter3-hpmcounter31. When
XLEN=32, the upper 32 bits of these performance counters are accessible
via additional CSRs hpmcounter3h- hpmcounter31h. The Zihpm extension
depends on the Zicsr extension.
In some applications, it is important to be able to read multiple counters at the same instant in time. When run under a multitasking environment, a user thread can suffer a context switch while attempting to read the counters. One solution is for the user thread to read the real-time counter before and after reading the other counters to determine if a context switch occurred in the middle of the sequence, in which case the reads can be retried. We considered adding output latches to allow a user thread to snapshot the counter values atomically, but this would increase the size of the user context, especially for implementations with a richer set of counters.
The implemented number and width of these additional counters, and the set of events they count, are platform-specific. Accessing an unimplemented counter may cause an illegal-instruction exception or may return a constant value. If the configuration used to select the events counted by a counter is misconfigured, the counter may return a constant value.
The execution environment should provide a means to determine the number and width of the implemented counters, and an interface to configure the events to be counted by each counter.
For execution environments implemented on RISC-V privileged platforms, the privileged architecture manual describes privileged CSRs controlling access by lower privileged modes to these counters, and to set the events to be counted.
Alternative execution environments (e.g., user-level-only software performance models) may provide alternative mechanisms to configure the events counted by the performance counters.
It would be useful to eventually standardize event settings to count ISA-level metrics, such as the number of floating-point instructions executed for example, and possibly a few common microarchitectural metrics, such as "L1 instruction cache misses".
6.5 "M" Extension for Integer Multiplication and Division, Version 2.0
This chapter describes the standard integer multiplication and division
instruction extension, which is named M and contains instructions
that multiply or divide values held in two integer registers.
We separate integer multiply and divide out from the base to simplify low-end implementations, or for applications where integer multiply and divide operations are either infrequent or better handled in attached accelerators.
6.5.1 Multiplication Operations
MUL performs an XLEN-bit×XLEN-bit multiplication of
rs1 by rs2 and places the lower XLEN bits in the destination
register. MULH, MULHU, and MULHSU perform the same multiplication but
return the upper XLEN bits of the full 2×XLEN-bit
product, for signed×signed,
unsigned×unsigned, and rs1×unsigned rs2 multiplication.
If both the high and low bits of the same product are required, then the recommended code sequence is: MULH[[S]U] rdh, rs1, rs2; MUL rdl, rs1, rs2 (source register specifiers must be in same order and rdh cannot be the same as rs1 or rs2). Microarchitectures can then fuse these into a single multiply operation instead of performing two separate multiplies.
MULHSU is used in multi-word signed multiplication to multiply the most-significant word of the multiplicand (which contains the sign bit) with the less-significant words of the multiplier (which are unsigned).
MULW is an RV64 instruction that multiplies the lower 32 bits of the source registers, placing the sign extension of the lower 32 bits of the result into the destination register.
In RV64, MUL can be used to obtain the upper 32 bits of the 64-bit product, but signed arguments must be proper 32-bit signed values, whereas unsigned arguments must have their upper 32 bits clear. If the arguments are not known to be sign- or zero-extended, an alternative is to shift both arguments left by 32 bits, then use MULH[[S]U].
6.5.2 Division Operations
DIV and DIVU perform an XLEN bits by XLEN bits signed and unsigned
integer division of rs1 by rs2, rounding towards zero. REM and REMU
provide the remainder of the corresponding division operation. For REM,
the sign of a nonzero result equals the sign of the dividend.
For both signed and unsigned division, except in the case of overflow, it holds that dividend = divisor × quotient + remainder.
If both the quotient and remainder are required from the same division,
the recommended code sequence is: DIV[U] rdq, rs1, rs2; REM[U] rdr,
rs1, rs2 (rdq cannot be the same as rs1 or rs2).
Microarchitectures can then fuse these into a single divide operation
instead of performing two separate divides.
DIVW and DIVUW are RV64 instructions that divide the lower 32 bits of
rs1 by the lower 32 bits of rs2, treating them as signed and
unsigned integers, placing the 32-bit quotient in rd,
sign-extended to 64 bits. REMW and REMUW are RV64 instructions that
provide the corresponding signed and unsigned remainder
operations. Both REMW and REMUW always sign-extend the 32-bit result
to 64 bits, including on a divide by zero.
The semantics for division by zero and division overflow are summarized in divby0. The quotient of division by zero has all bits set, and the remainder of division by zero equals the dividend. Signed division overflow occurs only when the most-negative integer is divided by −1. The quotient of a signed division with overflow is equal to the dividend, and the remainder is zero. Unsigned division overflow cannot occur.
| Condition | Dividend | Divisor | DIVU[W] | REMU[W] | DIV[W] | REM[W] |
|---|---|---|---|---|---|---|
| Overflow (signed only) | -2L-1 | −1 | - | - | 0 |
We considered raising exceptions on integer divide by zero, with these exceptions causing a trap in most execution environments. However, this would be the only arithmetic trap in the standard ISA (floating-point exceptions set flags and write default values, but do not cause traps) and would require language implementers to interact with the execution environment’s trap handlers for this case. Further, where language standards mandate that a divide-by-zero exception must cause an immediate control flow change, only a single branch instruction needs to be added to each divide operation, and this branch instruction can be inserted after the divide and should normally be very predictably not taken, adding little runtime overhead.
The value of all bits set is returned for both unsigned and signed divide by zero to simplify the divider circuitry. The value of all 1s is both the natural value to return for unsigned divide, representing the largest unsigned number, and also the natural result for simple unsigned divider implementations. Signed division is often implemented using an unsigned division circuit and specifying the same overflow result simplifies the hardware.
6.6 Zmmul Extension, Version 1.0
The Zmmul extension implements the multiplication subset of the M
extension. It adds all of the instructions defined in
m-mul, namely: MUL, MULH, MULHU,
MULHSU, and (for RV64 only) MULW. The encodings are identical to those
of the corresponding M-extension instructions. M implies Zmmul.
The Zmmul extension enables low-cost implementations that require
multiplication operations but not division. For many microcontroller
applications, division operations are too infrequent to justify the cost
of divider hardware. By contrast, multiplication operations are more
frequent, making the cost of multiplier hardware more justifiable.
Simple FPGA soft cores particularly benefit from eliminating division
but retaining multiplication, since many FPGAs provide hardwired
multipliers but require dividers be implemented in soft logic.
6.7 "Zicond" Extension for Integer Conditional Operations, Version 1.0.0
The Zicond extension defines two R-type instructions that support branchless conditional operations.
| RV32 | RV64 | Mnemonic | Instruction |
|---|---|---|---|
| ✓ | ✓ | czero.eqz rd, rs1, rs2 | insns-czero-eqz |
| ✓ | ✓ | czero.nez rd, rs1, rs2 | insns-czero-nez |
6.7.1 Instructions (in alphabetical order)
6.7.1.1 czero.eqz
Synopsis Moves zero to a register rd, if the condition rs2 is equal to zero, otherwise moves rs1 to rd.
Mnemonic czero.eqz rd, rs1, rs2
Encoding
Description
If rs2 contains the value zero, this instruction writes the value zero to rd. Otherwise, this instruction copies the contents of rs1 to rd.
This instruction carries a syntactic dependency from both rs1 and rs2 to rd.
Furthermore, if the Zkt extension is implemented, this instruction’s timing is independent of the data values in rs1 and rs2.
SAIL code
let condition = X(rs2);
result : xlenbits = if (condition == zeros()) then zeros()
else X(rs1);
X(rd) = result;
6.7.1.2 czero.nez
Synopsis Moves zero to a register rd, if the condition rs2 is nonzero, otherwise moves rs1 to rd.
Mnemonic czero.nez rd, rs1, rs2
Encoding
Description
If rs2 contains a nonzero value, this instruction writes the value zero to rd. Otherwise, this instruction copies the contents of rs1 to rd.
This instruction carries a syntactic dependency from both rs1 and rs2 to rd.
Furthermore, if the Zkt extension is implemented, this instruction’s timing is independent of the data values in rs1 and rs2.
SAIL code
let condition = X(rs2);
result : xlenbits = if (condition != zeros()) then zeros()
else X(rs1);
X(rd) = result;
6.7.2 Usage examples
The instructions from this extension can be used to construct sequences that perform conditional-arithmetic, conditional-bitwise-logical, and conditional-select operations.
6.7.2.1 Instruction sequences
| Operation | Instruction sequence | Length |
|---|---|---|
rd = (rc == 0) ? (rs1 + rs2) : rs1 | czero.nez rd, rs2, rc add rd, rs1, rd | 2 insns |
rd = (rc != 0) ? (rs1 + rs2) : rs1 | czero.eqz rd, rs2, rc add rd, rs1, rd | |
rd = (rc == 0) ? (rs1 - rs2) : rs1 | czero.nez rd, rs2, rc sub rd, rs1, rd | |
rd = (rc != 0) ? (rs1 - rs2) : rs1 | czero.eqz rd, rs2, rc sub rd, rs1, rd | |
rd = (rc == 0) ? (rs1 | rs2) : rs1 | czero.nez rd, rs2, rc or rd, rs1, rd | |
rd = (rc != 0) ? (rs1 | rs2) : rs1 | czero.eqz rd, rs2, rc or rd, rs1, rd | |
rd = (rc == 0) ? (rs1 ^ rs2) : rs1 | czero.nez rd, rs2, rc xor rd, rs1, rd | |
rd = (rc != 0) ? (rs1 ^ rs2) : rs1 | czero.eqz rd, rs2, rc xor rd, rs1, rd | |
rd = (rc == 0) ? (rs1 & rs2) : rs1 | and rd, rs1, rs2 czero.eqz rtmp, rs1, rc or rd, rd, rtmp | (requires 1 temporary) |
rd = (rc != 0) ? (rs1 & rs2) : rs1 | and rd, rs1, rs2 czero.nez rtmp, rs1, rc or rd, rd, rtmp | |
rd = (rc == 0) ? rs1 : rs2 | czero.nez rd, rs1, rc czero.eqz rtmp, rs2, rc add rd, rd, rtmp | |
rd = (rc != 0) ? rs1 : rs2 | czero.eqz rd, rs1, rc czero.nez rtmp, rs2, rc add rd, rd, rtmp |
6.8 "Zilsd", "Zclsd" Extensions for Load/Store pair for RV32, Version 1.0
The Zilsd & Zclsd extensions provide load/store pair instructions for RV32, reusing the existing RV64 doubleword load/store instruction encodings.
Operands containing src for store instructions and dest for load instructions are held in aligned x-register pairs, i.e., register numbers must be even. Use of misaligned (odd-numbered) registers for these operands is reserved.
Regardless of endianness, the lower-numbered register holds the
low-order bits, and the higher-numbered register holds the high-order bits:
e.g., bits 31:0 of an operand in Zilsd might be held in register x14, with bits 63:32 of that operand held in x15.
6.8.1 Load/Store pair instructions (Zilsd)
The Zilsd extension adds the following RV32-only instructions:
| RV32 | RV64 | Mnemonic | Instruction |
|---|---|---|---|
| yes | no | ld rd, offset(rs1) | insns-ld |
| yes | no | sd rs2, offset(rs1) | insns-sd |
As the access size is 64-bit, accesses are only considered naturally aligned for effective addresses that are a multiple of 8. In this case, these instructions are guaranteed to not raise an address-misaligned exception. Even if naturally aligned, the memory access might not be performed atomically.
If the effective address is a multiple of 4, then each word access is required to be performed atomically.
The following table summarizes the required behavior:
| Alignment | Word accesses guaranteed atomic? | Can cause misaligned trap? |
|---|---|---|
| 8 B | yes | no |
| 4 B not 8 B | yes | yes |
| else | no | yes |
To ensure resumable trap handling is possible for the load instructions, the base register must have its original value if a trap is taken. The other register in the pair can have been updated. This affects x2 for the stack pointer relative instruction and rs1 otherwise.
If an implementation performs a doubleword load access atomically and the register file implements write-back for even/odd register pairs, the mentioned atomicity requirements are inherently fulfilled. Otherwise, an implementation either needs to delay the write-back until the write can be performed atomically, or order sequential writes to the registers to ensure the requirement above is satisfied.
6.8.2 Compressed Load/Store pair instructions (Zclsd)
Zclsd depends on Zilsd and Zca. It has overlapping encodings with Zcf and is thus incompatible with Zcf.
Zclsd adds the following RV32-only instructions:
| RV32 | RV64 | Mnemonic | Instruction |
|---|---|---|---|
| yes | no | c.ldsp rd, offset(sp) | insns-cldsp |
| yes | no | c.sdsp rs2, offset(sp) | insns-csdsp |
| yes | no | c.ld rd', offset(rs1') | insns-cld |
| yes | no | c.sd rs2', offset(rs1') | insns-csd |
6.8.3 Use of x0 as operand
LD instructions with destination x0 are processed as any other load,
but the result is discarded entirely and x1 is not written.
For C.LDSP, usage of x0 as the destination is reserved.
If using x0 as src of SD or C.SDSP, the entire 64-bit operand is zero — i.e., register x1 is not accessed.
C.LD and C.SD instructions can only use x8-15.
6.8.4 Exception Handling
For the purposes of RVWMO and exception handling, LD and SD instructions are considered to be misaligned loads and stores, with one additional constraint: an LD or SD instruction whose effective address is a multiple of 4 gives rise to two 4-byte memory operations.
This definition permits LD and SD instructions giving rise to exactly one memory access, regardless of alignment. If instructions with 4-byte-aligned effective address are decomposed into two 32b operations, there is no constraint on the order in which the operations are performed and each operation is guaranteed to be atomic. These decomposed sequences are interruptible. Exceptions might occur on subsequent operations, making the effects of previous operations within the same instruction visible.
Software should make no assumptions about the number or order of accesses these instructions might give rise to, beyond the 4-byte constraint mentioned above. For example, an interrupted store might overwrite the same bytes upon return from the interrupt handler.
6.8.5 Instructions
6.8.5.1 ld
Synopsis Load doubleword to even/odd register pair, 32-bit encoding
Mnemonic ld rd, offset(rs1)
Encoding (RV32)
Description
Loads a 64-bit value into registers rd and rd+1.
The effective address is obtained by adding register rs1 to the
sign-extended 12-bit offset.
Included in: zilsd
6.8.5.2 sd
Synopsis Store doubleword from even/odd register pair, 32-bit encoding
Mnemonic sd rs2, offset(rs1)
Encoding (RV32)
Description
Stores a 64-bit value from registers rs2 and rs2+1.
The effective address is obtained by adding register rs1 to the
sign-extended 12-bit offset.
Included in: zilsd
6.8.5.3 c.ldsp
Synopsis Stack-pointer based load doubleword to even/odd register pair, 16-bit encoding
Mnemonic c.ldsp rd, offset(sp)
Encoding (RV32)
Description
Loads stack-pointer relative 64-bit value into registers rd' and rd'+1. It computes its effective address by adding the zero-extended offset, scaled by 8, to the stack pointer, x2. It expands to ld rd, offset(x2). C.LDSP is only valid when rd≠x0; the code points with rd=x0 are reserved.
Included in: zclsd
6.8.5.4 c.sdsp
Synopsis Stack-pointer based store doubleword from even/odd register pair, 16-bit encoding
Mnemonic c.sdsp rs2, offset(sp)
Encoding (RV32)
Description
Stores a stack-pointer relative 64-bit value from registers rs2' and rs2'+1. It computes an effective address by adding the zero-extended offset, scaled by 8, to the stack pointer, x2. It expands to sd rs2, offset(x2).
Included in: zclsd
6.8.5.5 c.ld
Synopsis Load doubleword to even/odd register pair, 16-bit encoding
Mnemonic c.ld rd', offset(rs1')
Encoding (RV32)
Description
Loads a 64-bit value into registers rd' and rd'+1.
It computes an effective address by adding the zero-extended offset, scaled by 8, to the base address in register rs1'.
Included in: zclsd
6.8.5.6 c.sd
Synopsis Store doubleword from even/odd register pair, 16-bit encoding
Mnemonic c.sd rs2', offset(rs1')
Encoding (RV32)
Description
Stores a 64-bit value from registers rs2' and rs2'+1.
It computes an effective address by adding the zero-extended offset, scaled by 8, to the base address in register rs1'.
It expands to sd rs2', offset(rs1').
Included in: zclsd
6.9 Ziccif Extension for Instruction-Fetch Atomicity, Version 1.0
This extension was ratified alongside the RVA20U64 profile. This chapter supplies an operational definition for the extension and adds expository material.
If the Ziccif extension is implemented, main memory regions with both the
cacheability and coherence PMAs must support instruction fetch, and any
instruction fetches of naturally aligned power-of-2 sizes of at most
min(ILEN,XLEN) bits are atomic.
An implementation with the Ziccif extension fetches instructions in a manner equivalent to the following state machine.
- Let
Mbe the smallest power of 2 such thatM≥min(ILEN,XLEN)/8. LetNbe thepcmoduloM. Atomically fetchM-Nbytes from memory at addresspc. LetTbe the running total of bytes fetched, initiallyM-N. - If the
Tbytes fetched begin with a complete instruction of lengthL≤T, then execute that instruction, discard the remainingT-Lbytes fetched, and go back to step 1, using the updatedpc. Otherwise, atomically fetchMbytes from memory at addresspc+T, incrementTbyM, and repeat step 2.
The instruction-fetch atomicity rule supports concurrent code modification. When a hart modifies instruction memory and either it or another hart executes the modified instructions without first having executing a FENCE.I, the modifying hart should adhere to the following rules to ensure predictable behavior:
- Modification stores must be single-copy atomic, hence must be naturally aligned.
- The modified instruction must not span an aligned
M-byte boundary, unless it is replaced with a shorter unconditional control transfer (e.g.,c.ebreakorc.j) that does not itself span anM-byte boundary. - Modification stores must alter a complete instruction or complete
instructions that do not collectively span an
M-byte boundary, modulo the exception above that the first part of an instruction may be replaced with an unconditional control transfer instruction. - Modifications must not combine smaller instructions into a larger instruction but may convert a larger instruction to some number of smaller instructions.
- Modified instruction memory must have the coherence PMA.
Other well-defined code-modification strategies exist, but these rules provide a safe harbor.
Note that the software modifying the code need not know the value of M.
Because ILEN must be at least the width of the instruction being modified,
a lower bound on M can be inferred from the instruction’s width and XLEN.
Memory protection and executability PMAs are applied only to bytes that are not discarded by this algorithm.
For example, if M=8, N=0, and the PMP granularity is 4 bytes, then
it is valid to fetch a 4-byte instruction at pc, even if fetching from
pc + 4 would have been disallowed by PMP.
For simplicity, implementations are likely to choose a PMP granularity no
smaller than M.
6.10 Ziccrse Extension for Main Memory Reservability, Version 1.0
If the Ziccrse extension is implemented, then main memory regions with both the cacheability and coherence PMAs must support the RsrvEventual PMA.
6.11 Ziccamoa Extension for Main Memory Atomics, Version 1.0
If the Ziccamoa extension is implemented, then main memory regions with both the cacheability and coherence PMAs must support all atomics in the Zaamo extension.
6.12 Ziccamoc Extension for Main Memory Compare-and-Swap, Version 1.0
If the Ziccamoc extension is implemented, then main memory regions with both
the cacheability and coherence PMAs must provide AMOCASQ-level PMA
support.
6.13 Zicclsm Extension for Main Memory Misaligned Accesses, Version 1.0
If the Zicclsm extension is implemented, then misaligned loads and stores to main memory regions with both the cacheability and coherence PMAs must be supported.
This definition includes vector memory accesses.
It does not include any instructions in the various Za* extensions.
Even though mandated, misaligned loads and stores might execute extremely slowly. Standard software distributions should assume their existence only for correctness, not for performance.
6.14 Zic64b Extension for 64-byte Cache Blocks, Version 1.0
If the Zic64b extension is implemented, then cache blocks must be 64 bytes in size, naturally aligned in the address space.
6.15 "Zimop" Extension for May-Be-Operations, Version 1.0
This chapter defines the "Zimop" extension, which introduces the concept of
instructions that may be operations (MOPs). MOPs are initially defined to
simply write zero to x[rd], but are designed to be redefined by later
extensions to perform some other action.
The Zimop extension defines an encoding space for 40 MOPs.
It is sometimes desirable to define instruction-set extensions whose
instructions, rather than raising illegal-instruction exceptions when the extension is
not implemented, take no useful action (beyond writing x[rd]).
For example, programs with control-flow integrity checks can
execute correctly on implementations without the corresponding extension,
provided the checks are simply ignored. Implementing these checks as MOPs
allows the same programs to run on implementations with or without the
corresponding extension.
Although similar in some respects to HINTs, MOPs cannot be encoded as HINTs, because unlike HINTs, MOPs are allowed to alter architectural state.
Because MOPs may be redefined by later extensions, standard software should not execute a MOP unless it is deliberately targeting an extension that has redefined that MOP.
The Zimop extension defines 32 MOP instructions named MOP.R.n, where
n is an integer between 0 and 31, inclusive.
Unless redefined by another extension, these instructions simply write 0 to
x[rd]. Their encoding allows future extensions to define them to read x[rs1],
as well as write x[rd].
The Zimop extension additionally defines 8 MOP instructions named
MOP.RR.n, where n is an integer between 0 and 7, inclusive.
Unless redefined by another extension, these instructions simply
write 0 to x[rd]. Their encoding allows future extensions to define them to
read x[rs1] and x[rs2], as well as write x[rd].
The recommended assembly syntax for MOP.R.n is MOP.R.n rd, rs1,
with any x-register specifier being valid for either argument. Similarly for
MOP.RR.n, the recommended syntax is MOP.RR.n rd, rs1, rs2.
The extension that redefines a MOP may define an alternate assembly mnemonic.
These MOPs are encoded in the SYSTEM major opcode in part because it is expected their behavior will be modulated by privileged CSR state.
These MOPs are defined to write zero to x[rd], rather than performing
no operation, to simplify instruction decoding and to allow testing the
presence of features by branching on the zeroness of the result.
The MOPs defined in the Zimop extension do not carry a syntactic dependency
from x[rs1] or x[rs2] to x[rd], though an extension that redefines the
MOP may impose such a requirement.
Not carrying a syntactic dependency relieves straightforward
implementations of reading x[rs1] and x[rs2].
6.15.1 "Zcmop" Compressed May-Be-Operations Extension, Version 1.0
This section defines the "Zcmop" extension, which defines eight 16-bit MOP
instructions named C.MOP.n, where n is an odd integer between 1 and
15, inclusive. C.MOP.n is encoded in the reserved encoding space
corresponding to C.LUI x_n_, 0, as shown in norm:c-mop_enc.
Unlike the MOPs defined in the Zimop extension, the C.MOP.n instructions
are defined to not write any register.
Their encoding allows future extensions to define them to read register
x[_n_].
The Zcmop extension depends upon the Zca extension.
Very few suitable 16-bit encoding spaces exist. This space was chosen
because it already has unusual behavior with respect to the rd/rs1
field—it encodes c.addi16sp when the field contains x2--and is
therefore of lower value for most purposes.
| Mnemonic | Encoding | Redefinable to read register |
|---|---|---|
| C.MOP.1 | 0110000010000001 | x1 |
| C.MOP.3 | 0110000110000001 | x3 |
| C.MOP.5 | 0110001010000001 | x5 |
| C.MOP.7 | 0110001110000001 | x7 |
| C.MOP.9 | 0110010010000001 | x9 |
| C.MOP.11 | 0110010110000001 | x11 |
| C.MOP.13 | 0110011010000001 | x13 |
| C.MOP.15 | 0110011110000001 | x15 |
The recommended assembly syntax for C.MOP.n is simply the nullary
C.MOP.n. The possibly accessed register is implicitly x_n_.
The expectation is that each Zcmop instruction is equivalent to some
Zimop instruction, but the choice of expansion (if any) is left to the
extension that redefines the MOP.
Note, a Zcmop instruction that does not write a value can expand into a write
to x0.
6.16 Control-flow Integrity (CFI)
Control-flow Integrity (CFI) capabilities help defend against Return-Oriented
Programming (ROP) and Call/Jump-Oriented Programming (COP/JOP) style
control-flow subversion attacks. These attack methodologies use code sequences
in authorized modules, with at least one instruction in the sequence being a
control transfer instruction that depends on attacker-controlled data either in
the return stack or in memory used to obtain the target address for a call or
jump. Attackers stitch these sequences together by diverting the control flow
instructions (e.g., JALR, C.JR, C.JALR), from their original target
address to a new target via modification in the return stack or in the memory
used to obtain the jump/call target address.
RV32/RV64 provides two types of control transfer instructions - unconditional
jumps and conditional branches. Conditional branches encode an offset in the
immediate field of the instruction and are thus direct branches that are not
susceptible to control-flow subversion. Unconditional direct jumps using JAL
transfer control to a target that is in a +/- 1 MiB range from the current pc.
Unconditional indirect jumps using the JALR obtain their branch target by
adding the sign extended 12-bit immediate encoded in the instruction to the
rs1 register.
The RV32I/RV64I does not have a dedicated instruction for calling a procedure or
returning from a procedure. A JAL or JALR may be used to perform a procedure
call and JALR to return from a procedure. The RISC-V ABI however defines the
convention that a JAL/JALR where rd (i.e. the link register) is x1 or
x5 is a procedure call, and a JALR where rs1 is the conventional
link register (i.e. x1 or x5) is a return from procedure. The architecture
allows for using these hints and conventions to support return address
prediction (See rashints).
The RVC standard extension for compressed instructions provides unconditional
jump and conditional branch instructions. The C.J and C.JAL instructions
encode an offset in the immediate field of the instruction and thus are not
susceptible to control-flow subversion. The C.JR and C.JALR RVC instructions
perform an unconditional control transfer to the address in register rs1. The
C.JALR additionally writes the address of the instruction following the jump
(pc+2) to the link register x1 and is a procedure call. The C.JR is a
return from procedure if rs1 is a conventional link register (i.e. x1 or
x5); else it is an indirect jump.
The term call is used to refer to a JAL or JALR instruction with a link
register as destination, i.e., rd≠x0. Conventionally, the link register is
x1 or x5. A call using JAL or C.JAL is termed a direct call. A
C.JALR expands to JALR x1, 0(rs1) and is a call. A call using JALR or
C.JALR is termed an indirect-call.
The term return is used to refer to a JALR instruction with rd=x0 and
with rs1=x1 or rs1=x5. A C.JR instruction expands to
JALR x0, 0(rs1) and is a return if rs1=x1 or rs1=x5.
The term indirect-jump is used to refer to a JALR instruction with rd=x0
and where the rs1 is not x1 or x5 (i.e., not a return). A C.JR
instruction where rs1 is not x1 or x5 (i.e., not a return) is an
indirect-jump.
The Zicfiss and Zicfilp extensions build on these conventions and hints and provide backward-edge and forward-edge control flow integrity respectively.
The Unprivileged ISA for Zicfilp extension is specified in unpriv-forward and for the Unprivileged ISA for Zicfiss extension is specified in unpriv-backward. The Privileged ISA for these extensions is specified in the Privileged ISA specification.
6.16.1 Landing Pad (Zicfilp)
To enforce forward-edge control-flow integrity, the Zicfilp extension introduces
a landing pad (LPAD) instruction. The LPAD instruction must be placed at the
program locations that are valid targets of indirect jumps or calls. The LPAD
instruction (See LP_INST) is encoded using the AUIPC major opcode with
rd=x0.
Compilers emit a landing pad instruction as the first instruction of an address-taken function, as well as at any indirect jump targets. A landing pad instruction is not required in functions that are only reached using a direct call or direct jump.
The landing pad is designed to provide integrity to control transfers performed
using indirect calls and jumps, and this is referred to as forward-edge
protection. When the Zicfilp is active, the hart tracks an expected landing pad
(ELP) state that is updated by an indirect_call or indirect_jump to
require a landing pad instruction at the target of the branch. If the
instruction at the target is not a landing pad, then a software-check exception
is raised.
A landing pad may be optionally associated with a 20-bit label. With labeling enabled, the number of landing pads that can be reached from an indirect call or jump sites can be defined using programming language-based policies. Labeling of the landing pads enables software to achieve greater precision in pairing up indirect call/jump sites with valid targets. When labeling of landing pads is used, indirect call or indirect jump site can specify the expected label of the landing pad and thereby constrain the set of landing pads that may be reached from each indirect call or indirect jump site in the program.
In the simplest form, a program can be built with a single label value to implement a coarse-grained version of forward-edge control-flow integrity. By constraining gadgets to be preceded by a landing pad instruction that marks the start of indirect callable functions, the program can significantly reduce the available gadget space. A second form of label generation may generate a signature, such as a MAC, using the prototype of the function. Programs that use this approach would further constrain the gadgets accessible from a call site to only indirectly callable functions that match the prototype of the called functions. Another approach to label generation involves analyzing the control-flow-graph (CFG) of the program, which can lead to even more stringent constraints on the set of reachable gadgets. Such programs may further use multiple labels per function, which means that if a function is called from two or more call sites, the functions can be labeled as being reachable from each of the call sites. For instance, consider two call sites A and B, where A calls the functions X and Y, and B calls the functions Y and Z. In a single label scheme, functions X, Y, and Z would need to be assigned the same label so that both call sites A and B can invoke the common function Y. This scheme would allow call site A to also call function Z and call site B to also call function X. However, if function Y was assigned two labels - one corresponding to call site A and the other to call site B, then Y can be invoked by both call sites, but X can only be invoked by call site A and Z can only be invoked by call site B. To support multiple labels, the compiler could generate a call-site-specific entry point for shared functions, with each entry point having its own landing pad instruction followed by a direct branch to the start of the function. This would allow the function to be labeled with multiple labels, each corresponding to a specific call site. A portion of the label space may be dedicated to labeled landing pads that are only valid targets of an indirect jump (and not an indirect call).
The LPAD instruction uses the code points defined as HINTs for the AUIPC
opcode. When Zicfilp is not active at a privilege level or when the extension
is not implemented, the landing pad instruction executes as a no-op. A program
that is built with LPAD instructions can thus continue to operate correctly,
but without forward-edge control-flow integrity, on processors that do not
support the Zicfilp extension or if the Zicfilp extension is not active.
Compilers and linkers should provide an attribute flag to indicate if the program has been compiled with the Zicfilp extension and use that to determine if the Zicfilp extension should be activated. The dynamic loader should activate the use of Zicfilp extension for an application only if all executables (the application and the dependent dynamically linked libraries) used by that application use the Zicfilp extension.
When Zicfilp extension is not active or not implemented, the hart does not require landing pad instructions at the targets of indirect calls/jumps, and the landing instructions revert to being no-ops. This allows a program compiled with landing pad instructions to operate correctly but without forward-edge control-flow integrity.
The Zicfilp extensions may be activated for use individually and independently for each privilege mode.
The Zicfilp extension depends on the Zicsr extension.
6.16.1.1 Landing Pad Enforcement
To enforce that the target of an indirect call or indirect jump must be a valid
landing pad instruction, the hart maintains an expected landing pad (ELP) state
to determine if a landing pad instruction is required at the target of an
indirect call or an indirect jump. The ELP state can be one of:
- 0 -
NO_LP_EXPECTED - 1 -
LP_EXPECTED
The ELP state is initialized to NO_LP_EXPECTED by the hart upon reset.
The Zicfilp extension, when enabled, determines if an indirect call or an
indirect jump must land on a landing pad, as specified in IND_CALL_JMP. If
is_lp_expected is 1, then the hart updates the ELP to LP_EXPECTED.
is_lp_expected = ( (JALR || C.JR || C.JALR) &&
(rs1 != x1) && (rs1 != x5) && (rs1 != x7) ) ? 1 : 0;
An indirect branch using JALR, C.JALR, or C.JR with rs1 as x7 is
termed a software guarded branch. Such branches do not need to land on a
LPAD instruction and thus do not set ELP to LP_EXPECTED.
When the register source is a link register and the register destination is
x0, then it’s a return from a procedure and does not require a landing pad at
the target.
When the register source and register destination are both link registers, then
it is a semantically-direct-call. For example, the call offset
pseudoinstruction may expand to a two instruction sequence composed of a
lui ra, imm20 or a auipc ra, imm20 instruction followed by a
jalr ra, imm12(ra) instruction where ra is the link register (either x1 or
x5). Since the address of the procedure was not explicitly taken and the
computed address is not obtained from mutable memory, such semantically-direct
calls do not require a landing pad to be placed at the target. Compilers and
JITers must use the semantically-direct calls only if the rs1 was computed as
a PC-relative or an absolute offset to the symbol.
The tail offset pseudoinstruction used to tail call a far-away procedure may
also be expanded to a two instruction sequence composed of a lui x7, imm20 or
auipc x7, imm20 followed by a jalr x0, x7. Since the address of the
procedure was not explicitly taken and the computed address is not obtained from
mutable memory, such semantically-direct tail-calls do not require a landing pad
to be placed at the target.
Software guarded branches may also be used by compilers to generate code for constructs like switch-cases. When using the software guarded branches, the compiler is required to ensure it has full control on the possible jump targets (e.g., by obtaining the targets from a read-only table in memory and performing bounds checking on the index into the table, etc.).
The landing pad may be labeled. Zicfilp extension designates the register x7
for use as the landing pad label register. To support labeled landing pads, the
indirect call/jump sites establish an expected landing pad label (e.g., using
the LUI instruction) in the bits 31:12 of the x7 register. The LPAD
instruction is encoded with a 20-bit immediate value called the landing-pad-label
(LPL) that is matched to the expected landing pad label. When LPL is encoded
as zero, the LPAD instruction does not perform the label check and in programs
built with this single label mode of operation the indirect call/jump sites do
not need to establish an expected landing pad label value in x7.
When ELP is set to LP_EXPECTED, if the next instruction in the instruction
stream is not 4-byte aligned, or is not LPAD, or if the landing pad label
encoded in LPAD is not zero and does not match the expected landing pad label
in bits 31:12 of the x7 register, then a software-check exception (cause=18)
with _x_tval set to "landing pad fault (code=2)" is raised else the ELP is
updated to NO_LP_EXPECTED.
The tracking of ELP and the requirement for a landing pad instruction
at the target of indirect call and jump enables a processor implementation to
significantly reduce or to prevent speculation to non-landing-pad instructions.
Constraining speculation using this technique, greatly reduces the gadget space
and increases the difficulty of using techniques such as branch-target-injection,
also known as Spectre variant 2, which use speculative execution to leak data
through side channels.
The LPAD requires a 4-byte alignment to address the concatenation of two
instructions A and B accidentally forming an unintended landing pad in the
program. For example, consider a 32-bit instruction where the bytes 3 and 2 have
a pattern of ?017h (for example, the immediate fields of a LUI, AUIPC, or
a JAL instruction), followed by a 16-bit or a 32-bit instruction. When
patterns that can accidentally form a valid landing pad are detected, the
assembler or linker can force instruction A to be aligned to a 4-byte
boundary to force the unintended LPAD pattern to become misaligned, and thus
not a valid landing pad, or may use an alternate register allocation to prevent
the accidental landing pad.
6.16.1.2 Landing Pad Instruction
When Zicfilp is enabled, LPAD is the only instruction allowed to execute when
the ELP state is LP_EXPECTED. If Zicfilp is not enabled then the instruction
is a no-op. If Zicfilp is enabled, the LPAD instruction causes a
software-check exception with _x_tval set to "landing pad fault (code=2)" if
any of the following conditions are true:
- The
pcis not 4-byte aligned andELPisLP_EXPECTED. - The
ELPisLP_EXPECTEDand theLPLis not zero and theLPLdoes not match the expected landing pad label in bits 31:12 of thex7register.
If a software-check exception is not caused then the ELP is updated to
NO_LP_EXPECTED.
The operation of the LPAD instruction is as follows:
if (xLPE == 1 && ELP == LP_EXPECTED)
// If PC not 4-byte aligned then software-check exception
if pc[1:0] != 0
raise software-check exception
// If landing pad label not matched -> software-check exception
else if (inst.LPL != x7[31:12] && inst.LPL != 0)
raise software-check exception
else
ELP = NO_LP_EXPECTED
else
no-op
endif
<<<
6.16.2 Shadow Stack (Zicfiss)
The Zicfiss extension introduces a shadow stack to enforce backward-edge control-flow integrity. A shadow stack is a second stack used to store a shadow copy of the return address in the link register if it needs to be spilled.
The shadow stack is designed to provide integrity to control transfers performed using a return, where the return may be from a procedure invoked using an indirect call or a direct call, and this is referred to as backward-edge protection.
A program using backward-edge control-flow integrity has two stacks: a regular
stack and a shadow stack. The shadow stack is used to spill the link register,
if required, by non-leaf functions. An additional register, shadow-stack-pointer
(ssp), is introduced in the architecture to hold the address of the top of the
active shadow stack.
The shadow stack, similar to the regular stack, grows downwards, from
higher addresses to lower addresses. Each entry on the shadow stack is XLEN
wide and holds the link register value. The ssp points to the top of the
shadow stack, which is the address of the last element stored on the shadow
stack.
The shadow stack is architecturally protected from inadvertent corruptions and modifications, as detailed in the Privileged specification.
The Zicfiss extension provides instructions to store and load the link register to/from the shadow stack and to check the integrity of the return address. The extension provides instructions to support common stack maintenance operations such as stack unwinding and stack switching.
When Zicfiss is enabled, each function that needs to spill the link register, typically non-leaf functions, store the link register value to the regular stack and a shadow copy of the link register value to the shadow stack when the function is entered (the prologue). When such a function returns (the epilogue), the function loads the link register from the regular stack and the shadow copy of the link register from the shadow stack. Then, the link register value from the regular stack and the shadow link register value from the shadow stack are compared. A mismatch of the two values is indicative of a subversion of the return address control variable and causes a software-check exception.
The Zicfiss instructions, except SSAMOSWAP.W/D, are encoded using a subset of
May-Be-Operation instructions defined by the Zimop and Zcmop extensions.
This subset of instructions revert to their Zimop/Zcmop defined behavior when
the Zicfiss extension is not implemented or if the extension has not been
activated. A program that is built with Zicfiss instructions can thus continue
to operate correctly, but without backward-edge control-flow integrity, on
processors that do not support the Zicfiss extension or if the Zicfiss extension
is not active. The Zicfiss extension may be activated for use individually and
independently for each privilege mode.
Compilers should flag each object file (for example, using flags in the ELF attributes) to indicate if the object file has been compiled with the Zicfiss instructions. The linker should flag (for example, using flags in the ELF attributes) the binary/executable generated by linking objects as being compiled with the Zicfiss instructions only if all the object files that are linked have the same Zicfiss attributes.
The dynamic loader should activate the use of Zicfiss extension for an application only if all executables (the application and the dependent dynamically-linked libraries) used by that application use the Zicfiss extension.
An application that has the Zicfiss extension active may request the dynamic loader at runtime to load a new dynamic shared object (using dlopen() for example). If the requested object does not have the Zicfiss attribute then the dynamic loader, based on its policy (e.g., established by the operating system or the administrator) configuration, could either deny the request or deactivate the Zicfiss extension for the application. It is strongly recommended that the policy enforces a strict security posture and denies the request.
The Zicfiss extension depends on the Zicsr, Zimop and Zaamo extensions. Furthermore,
if the Zcmop extension is implemented, the Zicfiss extension also provides the
C.SSPUSH and C.SSPOPCHK instructions. Moreover, use of Zicfiss in U-mode
requires S-mode to be implemented. Use of Zicfiss in M-mode is not supported.
6.16.2.1 Zicfiss Instructions Summary
The Zicfiss extension introduces the following instructions:
-
Push to the shadow stack (See SS_PUSH)
SSPUSH x1andSSPUSH x5- encoded usingMOP.RR.7C.SSPUSH x1- encoded usingC.MOP.1
-
Pop from the shadow stack (See SS_POP)
SSPOPCHK x1andSSPOPCHK x5- encoded usingMOP.R.28C.SSPOPCHK x5- encoded usingC.MOP.5
-
Read the value of
sspinto a register (See SSP_READ)SSRDP- encoded usingMOP.R.28
-
Perform an atomic swap from a shadow stack location (See SSAMOSWAP)
SSAMOSWAP.WandSSAMOSWAP.D
Zicfiss does not use all encodings of MOP.RR.7 or MOP.R.28. When a
MOP.RR.7 or MOP.R.28 encoding is not used by the Zicfiss extension, the
corresponding instruction adheres to its Zimop-defined behavior, unless
redefined by another extension.
6.16.2.2 Shadow Stack Pointer (ssp)
The ssp CSR is an unprivileged read-write (URW) CSR that reads and writes
XLEN low order bits of the shadow stack pointer (ssp). The CSR address is
0x011. There is no high CSR defined as the ssp is always as wide as the XLEN
of the current privilege mode. The bits 1:0 of ssp are read-only zero. If the
UXLEN or SXLEN may never be 32, then the bit 2 is also read-only zero.
6.16.2.3 Zicfiss Instructions
6.16.2.4 Push to the Shadow Stack
A shadow stack push operation is defined as decrement of the ssp by XLEN/8
followed by a store of the value in the link register to memory at the new top
of the shadow stack.
Only x1 and x5 registers are supported as rs2 for SSPUSH. Zicfiss
provides a 16-bit version of the SSPUSH x1 instruction using the Zcmop
defined C.MOP.1 encoding. The C.SSPUSH x1 expands to SSPUSH x1.
The SSPUSH instruction and its compressed form C.SSPUSH can be used to push
a link register on the shadow stack. The SSPUSH and C.SSPUSH instructions
perform a store identically to the existing store instructions, with the
difference that the base is implicitly ssp and the width is implicitly XLEN.
The operation of the SSPUSH and C.SSPUSH instructions is as follows:
if (xSSE == 1)
mem[ssp - (XLEN/8)] = X(src) # Store src value to ssp - XLEN/8
ssp = ssp - (XLEN/8) # decrement ssp by XLEN/8
endif
The ssp is decremented by SSPUSH and C.SSPUSH only if the store to the
shadow stack completes successfully.
6.16.2.5 Pop from the Shadow Stack
A shadow stack pop operation is defined as an XLEN wide read from the
current top of the shadow stack followed by an increment of the ssp by
XLEN/8.
Only x1 and x5 registers are supported as rs1 for SSPOPCHK. Zicfiss
provides a 16-bit version of the SSPOPCHK x5 using the Zcmop defined C.MOP.5
encoding. The C.SSPOPCHK x5 expands to SSPOPCHK x5.
Programs with a shadow stack push the return address onto the regular stack as well as the shadow stack in the prologue of non-leaf functions. When returning from these non-leaf functions, such programs pop the link register from the regular stack and pop a shadow copy of the link register from the shadow stack. The two values are then compared. If the values do not match, it is indicative of a corruption of the return address variable on the regular stack.
The SSPOPCHK instruction, and its compressed form C.SSPOPCHK, can be used to
pop the shadow return address value from the shadow stack and check that the
value matches the contents of the link register, and if not cause a
software-check exception with _x_tval set to "shadow stack fault (code=3)".
While any register may be used as link register, conventionally the x1 or x5
registers are used. The shadow stack instructions are designed to be most
efficient when the x1 and x5 registers are used as the link register.
Return-address prediction stacks are a common feature of high-performance instruction-fetch units, but they require accurate detection of instructions used for procedure calls and returns to be effective. For RISC-V, hints as to the instructions' usage are encoded implicitly via the register numbers used. The return-address stack (RAS) actions to pop and/or push onto the RAS are specified in rashints.
Using x1 or x5 as the link register allows a program to benefit from the
return-address prediction stacks. Additionally, since the shadow stack
instructions are designed around the use of x1 or x5 as the link register,
using any other register as a link register would incur the cost of additional
register movements.
Compilers, when generating code with backward-edge CFI, must protect the link
register, e.g., x1 and/or x5, from arbitrary modification by not emitting
unsafe code sequences.
Storing the return address on both stacks preserves the call stack layout and the ABI, while also allowing for the detection of corruption of the return address on the regular stack. The prologue and epilogue of a non-leaf function that uses shadow stacks is as follows:
function_entry:
addi sp,sp,-8 # push link register x1
sd x1,(sp) # on regular stack
sspush x1 # push link register x1 on shadow stack
:
ld x1,(sp) # pop link register x1 from regular stack
addi sp,sp,8
sspopchk x1 # fault if x1 not equal to shadow
# return address
ret
This example illustrates the use of x1 register as the link register.
Alternatively, the x5 register may also be used as the link register.
A leaf function, a function that does not itself make function calls, does not need to spill the link register. Consequently, the return value may be held in the link register itself for the duration of the leaf function’s execution.
The C.SSPOPCHK, and SSPOPCHK instructions perform a load identically to the
existing load instructions, with the difference that the base is implicitly
ssp and the width is implicitly XLEN.
The operation of the SSPOPCHK and C.SSPOPCHK instructions is as follows:
if (xSSE == 1)
temp = mem[ssp] # Load temp from address in ssp and
if temp != X(src) # Compare temp to value in src and
# cause a software-check exception
# if they are not bitwise equal.
# Only x1 and x5 may be used as src
raise software-check exception
else
ssp = ssp + (XLEN/8) # increment ssp by XLEN/8.
endif
endif
If the value loaded from the address in ssp does not match the value in rs1,
a software-check exception (cause=18) is raised with _x_tval set to "shadow
stack fault (code=3)". The software-check exception caused by SSPOPCHK/
C.SSPOPCHK is lower in priority than a load/store/AMO access-fault exception.
The ssp is incremented by SSPOPCHK and C.SSPOPCHK only if the load from
the shadow stack completes successfully and no software-check exception is
raised.
The use of the compressed instruction C.SSPUSH x1 to push on the shadow stack
is most efficient when the ABI uses x1 as the link register, as the link
register may then be pushed without needing a register-to-register move in the
function prologue. To use the compressed instruction C.SSPOPCHK x5, the
function should pop the return address from regular stack into the alternate
link register x5 and use the C.SSPOPCHK x5 to compare the return address to
the shadow copy stored on the shadow stack. The function then uses C.JR x5 to
jump to the return address.
function_entry:
c.addi sp,sp,-8 # push link register x1
c.sd x1,(sp) # on regular stack
c.sspush x1 # push link register x1 on shadow stack
:
c.ld x5,(sp) # pop link register x5 from regular stack
c.addi sp,sp,8
c.sspopchk x5 # fault if x5 not equal to shadow return address
c.jr x5
Store-to-load forwarding is a common technique employed by high-performance
processor implementations. Zicfiss implementations may prevent forwarding from
a non-shadow-stack store to the SSPOPCHK or the C.SSPOPCHK instructions. A
non-shadow-stack store causes a fault if done to a page mapped as a shadow
stack. However, such determination may be delayed till the PTE has been examined
and thus may be used to transiently forward the data from such stores to
SSPOPCHK or to C.SSPOPCHK.
6.16.2.6 Read ssp into a Register
The SSRDP instruction is provided to move the contents of ssp to a destination
register.
Encoding rd as x0 is not supported for SSRDP.
The operation of the SSRDP instructions is as follows:
if (xSSE == 1)
X(dst) = ssp
else
X(dst) = 0
endif
The property of Zimop writing 0 to the rd when the extension using Zimop is
not implemented or not active may be used by to determine if Zicfiss extension
is active. For example, functions that unwind shadow stacks may skip over the
unwind actions by dynamically detecting if the Zicfiss extension is active.
An example sequence such as the following may be used:
ssrdp t0 # mv ssp to t0
beqz t0, zicfiss_not_active # zero is not a valid shadow stack
# pointer by convention
# Zicfiss is active
:
:
zicfiss_not_active:
To assist with the use of such code sequences, operating systems and runtimes must not locate shadow stacks at address 0.
A common operation performed on stacks is to unwind them to support constructs
like setjmp/longjmp, C++ exception handling, etc. A program that uses shadow
stacks must unwind the shadow stack in addition to the stack used to store data.
The unwind function must verify that it does not accidentally unwind past the
bounds of the shadow stack. Shadow stacks are expected to be bounded on each end
using guard pages. A guard page for a stack is a page that is not accessible by
the process that owns the stack. To detect if the unwind occurs past the bounds
of the shadow stack, the unwind may be done in maximal increments of 4 KiB,
testing whether the ssp is still pointing to a shadow stack page or has
unwound into the guard page. The following examples illustrate the use of shadow
stack instructions to unwind a shadow stack. This example assumes that the
setjmp function itself does not push on to the shadow stack (being a leaf
function, it is not required to).
setjmp() {
:
:
// read and save the shadow stack pointer to jmp_buf
asm("ssrdp %0" : "=r"(cur_ssp):);
jmp_buf->saved_ssp = cur_ssp;
:
:
}
longjmp() {
:
// Read current shadow stack pointer and
// compute number of call frames to unwind
asm("ssrdp %0" : "=r"(cur_ssp):);
// Skip the unwind if backward-edge CFI not active
asm("beqz %0, back_cfi_not_active" : "=r"(cur_ssp):);
// Unwind the frames in a loop
while ( jmp_buf->saved_ssp > cur_ssp ) {
// advance by a maximum of 4K at a time to avoid
// unwinding past bounds of the shadow stack
cur_ssp = ( (jmp_buf->saved_ssp - cur_ssp) >= 4096 ) ?
(cur_ssp + 4096) : jmp_buf->saved_ssp;
asm("csrw ssp, %0" : : "r" (cur_ssp));
// Test if unwound past the shadow stack bounds
asm("sspush x5");
asm("sspopchk x5");
}
back_cfi_not_active:
:
}
6.16.2.7 Atomic Swap from a Shadow Stack Location
For RV32, SSAMOSWAP.W atomically loads a 32-bit data value from address of a
shadow stack location in rs1, puts the loaded value into register rd, and
stores the 32-bit value held in rs2 to the original address in rs1.
SSAMOSWAP.D (RV64 only) is similar to SSAMOSWAP.W but operates on 64-bit
data values.
if privilege_mode != M && menvcfg.SSE == 0
raise illegal-instruction exception
else if S-mode not implemented
raise illegal-instruction exception
else if privilege_mode == U && senvcfg.SSE == 0
raise illegal-instruction exception
else if privilege_mode == VS && henvcfg.SSE == 0
raise virtual-instruction exception
else if privilege_mode == VU && senvcfg.SSE == 0
raise virtual-instruction exception
else
X(rd) = mem[X(rs1)]
mem[X(rs1)] = X(rs2)
endif
For RV64, SSAMOSWAP.W atomically loads a 32-bit data value from address of a
shadow stack location in rs1, sign-extends the loaded value and puts it in
rd, and stores the lower 32 bits of the value held in rs2 to the original
address in rs1.
if privilege_mode != M && menvcfg.SSE == 0
raise illegal-instruction exception
else if S-mode not implemented
raise illegal-instruction exception
else if privilege_mode == U && senvcfg.SSE == 0
raise illegal-instruction exception
else if privilege_mode == VS && henvcfg.SSE == 0
raise virtual-instruction exception
else if privilege_mode == VU && senvcfg.SSE == 0
raise virtual-instruction exception
else
temp[31:0] = mem[X(rs1)]
X(rd) = SignExtend(temp[31:0])
mem[X(rs1)] = X(rs2)[31:0]
endif
Just as for AMOs in the A extension, SSAMOSWAP.W/D requires that the address
held in rs1 be naturally aligned to the size of the operand (i.e., eight-byte
aligned for doublewords, and four-byte aligned for words). The same
exception options apply if the address is not naturally aligned.
Just as for AMOs in the A extension, SSAMOSWAP.W/D optionally provides release
consistency semantics, using the aq and rl bits, to help implement
multiprocessor synchronization. An SSAMOSWAP.W/D operation has acquire
semantics if aq=1 and release semantics if rl=1.
Stack switching is a common operation in user programs as well as supervisor programs. When a stack switch is performed the stack pointer of the currently active stack is saved into a context data structure and the new stack is made active by loading a new stack pointer from a context data structure.
When shadow stacks are active for a program, the program needs to additionally
switch the shadow stack pointer. If the pointer to the top of the deactivated
shadow stack is held in a context data structure, then it may be susceptible to
memory corruption vulnerabilities. To protect the pointer value, the program may
store it at the top of the deactivated shadow stack itself and thereby create a
checkpoint. A legal checkpoint is defined as one that holds a value of X,
where X is the address at which the checkpoint is positioned on the shadow
stack.
An example sequence to restore the shadow stack pointer from the new shadow stack and save the old shadow stack pointer on the old shadow stack is as follows:
# a0 hold pointer to top of new shadow stack to switch to
stack_switch:
ssrdp ra
beqz ra, 2f # skip if Zicfiss not active
ssamoswap.d ra, x0, (a0) # ra=*[a0] and *[a0]=0
beq ra, a0, 1f # [a0] must be == [ra]
unimp # else crash
1: addi ra, ra, XLEN/8 # pop the checkpoint
csrrw ra, ssp, ra # swap ssp: ra=ssp, ssp=ra
addi ra, ra, -(XLEN/8) # checkpoint = "old ssp - XLEN/8"
ssamoswap.d x0, ra, (ra) # Save checkpoint at "old ssp - XLEN/8"
2:
This sequence uses the ra register. If the privilege mode at which this
sequence is executed can be interrupted, then the trap handler should save the
ra on the shadow stack itself. There it is guarded against tampering and
can be restored prior to returning from the trap.
When a new shadow stack is created by the supervisor, it needs to store a
checkpoint at the highest address on that stack. This enables the shadow stack
pointer to be switched using the process outlined in this note. The
SSAMOSWAP.W/D instruction can be used to store this checkpoint. When the old
value at the memory location operated on by SSAMOSWAP.W/D is not required,
rd can be set to x0.
6.17 "Zihintntl" Extension for Non-Temporal Locality Hints, Version 1.0
The NTL instructions are HINTs that indicate that the explicit memory accesses of the immediately subsequent instruction (henceforth "target instruction") exhibit poor temporal locality of reference. The NTL instructions do not change architectural state, nor do they alter the architecturally visible effects of the target instruction. Four variants are provided:
The NTL.P1 instruction indicates that the target instruction does not exhibit temporal locality within the capacity of the innermost level of private cache in the memory hierarchy. NTL.P1 is encoded as ADD x0, x0, x2.
The NTL.PALL instruction indicates that the target instruction does not exhibit temporal locality within the capacity of any level of private cache in the memory hierarchy. NTL.PALL is encoded as ADD x0, x0, x3.
The NTL.S1 instruction indicates that the target instruction does not exhibit temporal locality within the capacity of the innermost level of shared cache in the memory hierarchy. NTL.S1 is encoded as ADD x0, x0, x4.
The NTL.ALL instruction indicates that the target instruction does not exhibit temporal locality within the capacity of any level of cache in the memory hierarchy. NTL.ALL is encoded as ADD x0, x0, x5.
The NTL instructions can be used to avoid cache pollution when streaming data or traversing large data structures, or to reduce latency in producer-consumer interactions.
A microarchitecture might use the NTL instructions to inform the cache replacement policy, or to decide which cache to allocate into, or to avoid cache allocation altogether. For example, NTL.P1 might indicate that an implementation should not allocate a line in a private L1 cache, but should allocate in L2 (whether private or shared). In another implementation, NTL.P1 might allocate the line in L1, but in the least-recently used state.
NTL.ALL will typically inform implementations not to allocate anywhere in the cache hierarchy. Programmers should use NTL.ALL for accesses that have no exploitable temporal locality.
Like any HINTs, these instructions may be freely ignored. Hence, although they are described in terms of cache-based memory hierarchies, they do not mandate the provision of caches.
Some implementations might respect these HINTs for some memory accesses but not others: e.g., implementations that implement LR/SC by acquiring a cache line in the exclusive state in L1 might ignore NTL instructions on LR and SC, but might respect NTL instructions for AMOs and regular loads and stores.
ntl-portable lists several software use cases and the recommended NTL variant that portable software—i.e., software not tuned for any specific implementation’s memory hierarchy—should use in each case.
| Scenario | Recommended NTL variant |
|---|---|
| Access to a working set between 64 KiB and 256 KiB in size | NTL.P1 |
| Access to a working set between 256 KiB and 1 MiB in size | NTL.PALL |
| Access to a working set greater than 1 MiB in size | NTL.S1 |
| Access with no exploitable temporal locality (e.g., streaming) | NTL.ALL |
| Access to a contended synchronization variable | NTL.PALL |
The working-set sizes listed in ntl-portable are not meant to constrain implementers' cache-sizing decisions. Cache sizes will obviously vary between implementations, and so software writers should only take these working-set sizes as rough guidelines.
ntl lists several sample memory hierarchies and recommends how each NTL variant maps onto each cache level. The table also recommends which NTL variant that implementation-tuned software should use to avoid allocating in a particular cache level. For example, for a system with a private L1 and a shared L2, it is recommended that NTL.P1 and NTL.PALL indicate that temporal locality cannot be exploited by the L1, and that NTL.S1 and NTL.ALL indicate that temporal locality cannot be exploited by the L2. Furthermore, software tuned for such a system should use NTL.P1 to indicate a lack of temporal locality exploitable by the L1, or should use NTL.ALL indicate a lack of temporal locality exploitable by the L2.
If the C or Zca extension is provided, compressed variants of these HINTs are also provided: C.NTL.P1 is encoded as C.ADD x0, x2; C.NTL.PALL is encoded as C.ADD x0, x3; C.NTL.S1 is encoded as C.ADD x0, x4; and C.NTL.ALL is encoded as C.ADD x0, x5.
The NTL instructions affect all memory-access instructions except the cache-management instructions in the Zicbom extension.
As of this writing, there are no other exceptions to this rule, and so the NTL instructions affect all memory-access instructions defined in the base ISAs and the A, F, D, Q, C, and V standard extensions, as well as those defined within the hypervisor extension in hypervisor.
The NTL instructions can affect cache-management operations other than those in the Zicbom extension. For example, NTL.PALL followed by CBO.ZERO might indicate that the line should be allocated in L3 and zeroed, but not allocated in L1 or L2.
| Memory hierarchy | variant to actual cache level | explicit cache management | ||||||
|---|---|---|---|---|---|---|---|---|
| P1 | PALL | S1 | ALL | L1 | L2 | L3 | L4/L5 | |
| Common Scenarios | ||||||||
| No caches | none | |||||||
| Private L1 only | L1 | L1 | L1 | L1 | ALL | |||
| Private L1; shared L2 | L1 | L1 | L2 | L2 | P1 | ALL | ||
| Private L1; shared L2/L3 | L1 | L1 | L2 | L3 | P1 | S1 | ALL | |
| Private L1/L2 | L1 | L2 | L2 | L2 | P1 | ALL | ||
| Private L1/L2; shared L3 | L1 | L2 | L3 | L3 | P1 | PALL | ALL | |
| Private L1/L2; shared L3/L4 | L1 | L2 | L3 | L4 | P1 | PALL | S1 | ALL |
| Uncommon Scenarios | ||||||||
| Private L1/L2/L3; shared L4 | L1 | L3 | L4 | L4 | P1 | P1 | PALL | ALL |
| Private L1; shared L2/L3/L4 | L1 | L1 | L2 | L4 | P1 | S1 | ALL | ALL |
| Private L1/L2; shared L3/L4/L5 | L1 | L2 | L3 | L5 | P1 | PALL | S1 | ALL |
| Private L1/L2/L3; shared L4/L5 | L1 | L3 | L4 | L5 | P1 | P1 | PALL | ALL |
When an NTL instruction is applied to a prefetch hint in the Zicbop extension, it indicates that a cache line should be prefetched into a cache that is outer from the level specified by the NTL.
For example, in a system with a private L1 and shared L2, NTL.P1 followed by PREFETCH.R might prefetch into L2 with read intent.
To prefetch into the innermost level of cache, do not prefix the prefetch instruction with an NTL instruction.
In some systems, NTL.ALL followed by a prefetch instruction might prefetch into a cache or prefetch buffer internal to a memory controller.
Software is discouraged from following an NTL instruction with an instruction that does not explicitly access memory. Nonadherence to this recommendation might reduce performance but otherwise has no architecturally visible effect.
In the event that a trap is taken on the target instruction, implementations are discouraged from applying the NTL to the first instruction in the trap handler. Instead, implementations are recommended to ignore the HINT in this case.
If an interrupt occurs between the execution of an NTL instruction and its target instruction, execution will normally resume at the target instruction. That the NTL instruction is not re-executed does not change the semantics of the program.
Some implementations might prefer not to process the NTL instruction until the target instruction is seen (e.g., so that the NTL can be fused with the memory access it modifies). Such implementations might preferentially take the interrupt before the NTL, rather than between the NTL and the memory access.
Since the NTL instructions are encoded as ADDs, they can be used within LR/SC loops without voiding the forward-progress guarantee. But, since using other loads and stores within an LR/SC loop does void the forward-progress guarantee, the only reason to use an NTL within such a loop is to modify the LR or the SC.
6.18 "Zihintpause" Extension for Pause Hint, Version 2.0
The PAUSE instruction is a HINT that indicates the current hart’s rate of instruction retirement should be temporarily reduced or paused. The duration of its effect must be bounded and may be zero.
Software can use the PAUSE instruction to reduce energy consumption while executing spin-wait code sequences. Multithreaded cores might temporarily relinquish execution resources to other harts when PAUSE is executed. It is recommended that a PAUSE instruction generally be included in the code sequence for a spin-wait loop.
The duration of a PAUSE instruction’s effect may vary significantly within and among implementations. In typical implementations this duration should be much less than the time to perform a context switch, probably more on the rough order of an on-chip cache miss latency or a cacheless access to main memory.
A series of PAUSE instructions can be used to create a cumulative delay loosely proportional to the number of PAUSE instructions. In spin-wait loops in portable code, however, only one PAUSE instruction should be used before re-evaluating loop conditions, else the hart might stall longer than optimal on some implementations, degrading system performance.
PAUSE is encoded as a FENCE instruction with pred=W, succ=0, fm=0,
rd=x0, and rs1=x0.
PAUSE is encoded as a hint within the FENCE opcode because some implementations are expected to deliberately stall the PAUSE instruction until outstanding memory transactions have completed. Because the successor set is null, however, PAUSE does not mandate any particular memory ordering—hence, it truly is a HINT.
Like other FENCE instructions, PAUSE cannot be used within LR/SC sequences without voiding the forward-progress guarantee.
The choice of a predecessor set of W is arbitrary, since the successor set is null. Other HINTs similar to PAUSE might be encoded with other predecessor sets.
6.19 Cache Management Operations (CMOs)
6.19.1 Pseudocode for instruction semantics
The semantics of each instruction in the insns chapter is expressed in a SAIL-like syntax.
6.19.2 Introduction
Cache-management operation (or CMO) instructions perform operations on copies of data in the memory hierarchy. In general, CMO instructions operate on cached copies of data, but in some cases, a CMO instruction may operate on memory locations directly. Furthermore, CMO instructions are grouped by operation into the following classes:
- A management instruction manipulates cached copies of data with respect to a set of agents that can access the data
- A zero instruction zeros out a range of memory locations, potentially allocating cached copies of data in one or more caches
- A prefetch instruction indicates to hardware that data at a given memory location may be accessed in the near future, potentially allocating cached copies of data in one or more caches
This chapter introduces a base set of CMO ISA extensions that operate specifically on cache blocks or the memory locations corresponding to a cache block; these are known as cache-block operation (or CBO) instructions. Each of the above classes of instructions represents an extension in this specification:
- The Zicbom extension defines a set of cache-block management instructions:
CBO.INVAL,CBO.CLEAN, andCBO.FLUSH - The Zicboz extension defines a cache-block zero instruction:
CBO.ZERO - The Zicbop extension defines a set of cache-block prefetch instructions:
PREFETCH.R,PREFETCH.W, andPREFETCH.I
The execution behavior of the above instructions is also modified by CSR state added by this specification.
The remainder of this chapter provides general background information on CMO instructions and describes each of the above ISA extensions.
The term CMO encompasses all operations on caches or resources related to caches. The term CBO represents a subset of CMOs that operate only on cache blocks. The first CMO extensions only define CBOs.
6.19.3 Background
This chapter provides information common to all CMO extensions.
6.19.3.1 Memory and Caches
A memory location is a physical resource in a system uniquely identified by a physical address. An agent is a logic block, such as a RISC-V hart, accelerator, I/O device, etc., that can access a given memory location.
A given agent may not be able to access all memory locations in a system, and two different agents may or may not be able to access the same set of memory locations.
A load operation (or store operation) is performed by an agent to consume (or modify) the data at a given memory location. Load and store operations are performed as a result of explicit memory accesses to that memory location. Additionally, a read transfer from memory fetches the data at the memory location, while a write transfer to memory updates the data at the memory location.
A cache is a structure that buffers copies of data to reduce average memory latency. Any number of caches may be interspersed between an agent and a memory location, and load and store operations from an agent may be satisfied by a cache instead of the memory location.
Load and store operations are decoupled from read and write transfers by caches. For example, a load operation may be satisfied by a cache without performing a read transfer from memory, or a store operation may be satisfied by a cache that first performs a read transfer from memory.
Caches organize copies of data into cache blocks, each of which represents a contiguous, naturally aligned power-of-two (or NAPOT) range of memory locations. A cache block is identified by any of the physical addresses corresponding to the underlying memory locations. The capacity and organization of a cache and the size of a cache block are both implementation-specific, and the execution environment provides software a means to discover information about the caches and cache blocks in a system. In the initial set of CMO extensions, the size of a cache block shall be uniform throughout the system.
In future CMO extensions, the requirement for a uniform cache block size may be relaxed.
Implementation techniques such as speculative execution or hardware prefetching may cause a given cache to allocate or deallocate a copy of a cache block at any time, provided the corresponding physical addresses are accessible according to the supported access type PMA and are cacheable according to the cacheability PMA. Allocating a copy of a cache block results in a read transfer from another cache or from memory, while deallocating a copy of a cache block may result in a write transfer to another cache or to memory depending on whether the data in the copy were modified by a store operation. Additional details are discussed in coherent-agents-caches.
6.19.3.2 Cache-Block Operations
A CBO instruction causes one or more operations to be performed on the cache blocks identified by the instruction. In general, a CBO instruction may identify one or more cache blocks; however, in the initial set of CMO extensions, CBO instructions identify a single cache block only.
A cache-block management instruction performs one of the following operations, relative to the copy of a given cache block allocated in a given cache:
- An invalidate operation deallocates the copy of the cache block
- A clean operation performs a write transfer to another cache or to memory if the data in the copy of the cache block have been modified by a store operation
- A flush operation atomically performs a clean operation followed by an invalidate operation
Additional details, including the actual operation performed by a given cache-block management instruction, are described in Zicbom.
A cache-block zero instruction performs a set of store operations that write zeros to the set of bytes corresponding to a cache block. Unless specified otherwise, the store operations generated by a cache-block zero instruction have the same general properties and behaviors that other store instructions in the architecture have. An implementation may or may not update the entire set of bytes atomically with a single store operation. Additional details are described in Zicboz.
A cache-block prefetch instruction is a HINT to the hardware that software expects to perform a particular type of memory access in the near future. Additional details are described in Zicbop.
6.19.4 Coherent Agents and Caches
For a given memory location, a set of coherent agents consists of the agents for which all of the following hold:
- Store operations from all agents in the set appear to be serialized with respect to each other
- Store operations from all agents in the set eventually appear to all other agents in the set
- A load operation from an agent in the set returns data from a store operation from an agent in the set (or from the initial data in memory)
The coherent agents within such a set shall access a given memory location with the same physical address and the same physical memory attributes; however, if the coherence PMA for a given agent indicates a given memory location is not coherent, that agent shall not be a member of a set of coherent agents with any other agent for that memory location and shall be the sole member of a set of coherent agents consisting of itself.
An agent who is a member of a set of coherent agents is said to be coherent with respect to the other agents in the set. On the other hand, an agent who is not a member is said to be non-coherent with respect to the agents in the set.
Caches introduce the possibility that multiple copies of a given cache block may be present in a system at the same time. An implementation-specific mechanism keeps these copies coherent with respect to the load and store operations from the agents in the set of coherent agents. Additionally, if a coherent agent in the set executes a CBO instruction that specifies the cache block, the resulting operation shall apply to any and all of the copies in the caches that can be accessed by the load and store operations from the coherent agents.
An operation from a CBO instruction is defined to operate only on the copies of a cache block that are cached in the caches accessible by the explicit memory accesses performed by the set of coherent agents. This includes copies of a cache block in caches that are accessed only indirectly by load and store operations, e.g. coherent instruction caches.
The set of caches subject to the above mechanism form a set of coherent caches, and each coherent cache has the following behaviors, assuming all operations are performed by the agents in a set of coherent agents:
- A coherent cache is permitted to allocate and deallocate copies of a cache block and perform read and write transfers as described in memory-caches
- A coherent cache is permitted to perform a write transfer to memory provided that a store operation has modified the data in the cache block since the most recent invalidate, clean, or flush operation on the cache block
- At least one coherent cache is responsible for performing a write transfer to memory once a store operation has modified the data in the cache block until the next invalidate, clean, or flush operation on the cache block, after which no coherent cache is responsible (or permitted) to perform a write transfer to memory until the next store operation has modified the data in the cache block
- A coherent cache is required to perform a write transfer to memory if a store operation has modified the data in the cache block since the most recent invalidate, clean, or flush operation on the cache block and if the next clean or flush operation requires a write transfer to memory
The above restrictions ensure that a "clean" copy of a cache block, fetched by a read transfer from memory and unmodified by a store operation, cannot later overwrite the copy of the cache block in memory updated by a write transfer to memory from a non-coherent agent.
A non-coherent agent may initiate a cache-block operation that operates on the set of coherent caches accessed by a set of coherent agents. The mechanism to perform such an operation is implementation-specific.
6.19.4.1 Memory Ordering
6.19.4.1.1 Preserved Program Order
The preserved program order (abbreviated PPO) rules are defined by the RVWMO memory ordering model. How the operations resulting from CMO instructions fit into these rules is described below.
For cache-block management instructions, the resulting invalidate, clean, and flush operations behave as stores in the PPO rules subject to one additional overlapping address rule. Specifically, if a precedes b in program order, then a will precede b in the global memory order if:
- a is an invalidate, clean, or flush, b is a load, and a and b access overlapping memory addresses
The above rule ensures that a subsequent load in program order never appears in the global memory order before a preceding invalidate, clean, or flush operation to an overlapping address.
Additionally, invalidate, clean, and flush operations are classified as W or O
(depending on the physical memory attributes for the corresponding physical
addresses) for the purposes of predecessor and successor sets in FENCE
instructions. These operations are not ordered by other instructions that
order stores, e.g. FENCE.I and SFENCE.VMA.
For cache-block zero instructions, the resulting store operations behave as stores in the PPO rules and are ordered by other instructions that order stores.
Finally, for cache-block prefetch instructions, the resulting operations are not ordered by the PPO rules nor are they ordered by any other ordering instructions.
6.19.4.1.2 Load Values
An invalidate operation may change the set of values that can be returned by a load. In particular, an additional condition is added to the Load Value Axiom:
- If an invalidate operation i precedes a load r and operates on a byte x returned by r, and no store to x appears between i and r in program order or in the global memory order, then r returns any of the following values for x:
- If no clean or flush operations on x precede i in the global memory order, either the initial value of x or the value of any store to x that precedes i
- If no store to x precedes a clean or flush operation on x in the global memory order and if the clean or flush operation on x precedes i in the global memory order, either the initial value of x or the value of any store to x that precedes i
- If a store to x precedes a clean or flush operation on x in the global memory order and if the clean or flush operation on x precedes i in the global memory order, either the value of the latest store to x that precedes the latest clean or flush operation on x or the value of any store to x that both precedes i and succeeds the latest clean or flush operation on x that precedes i
- The value of any store to x by a non-coherent agent regardless of the above conditions
The first three bullets describe the possible load values at different points in the global memory order relative to clean or flush operations. The final bullet implies that the load value may be produced by a non-coherent agent at any time.
6.19.4.2 Traps
Execution of certain CMO instructions may result in traps due to CSR state, described in the csr_state section, or due to the address translation and protection mechanisms. The trapping behavior of CMO instructions is described in the following sections.
6.19.4.2.1 Illegal-Instruction and Virtual-Instruction Exceptions
Cache-block management instructions and cache-block zero instructions may raise illegal-instruction exceptions or virtual-instruction exceptions depending on the current privilege mode and the state of the CMO control registers described in the csr_state section.
Cache-block prefetch instructions raise neither illegal-instruction exceptions nor virtual-instruction exceptions.
6.19.4.2.2 Page-Fault, Guest-Page-Fault, and Access-Fault Exceptions
Similar to load and store instructions, CMO instructions are explicit memory access instructions that compute an effective address. The effective address is ultimately translated into a physical address based on the privilege mode and the enabled translation mechanisms, and the CMO extensions impose the following constraints on the physical addresses in a given cache block:
- The PMP access control bits shall be the same for all physical addresses in the cache block, and if write permission is granted by the PMP access control bits, read permission shall also be granted
- The PMAs shall be the same for all physical addresses in the cache block, and if write permission is granted by the supported access type PMAs, read permission shall also be granted
If the above constraints are not met, the behavior of a CBO instruction is UNSPECIFIED.
This specification assumes that the above constraints will typically be met for main memory regions and may be met for certain I/O regions.
The access size for CMO instructions is equal to the size of the cache block, however in some cases that access can be decomposed into multiple memory operations. PMP checks are applied to each memory operation independently. For example a 64-byte cbo.zero that spans two 32-byte PMP regions would succeed if it was decomposed into two 32-byte memory operations (and the PMP access control bits are the same in both regions), but if performed as a single 64-byte memory operation it would cause an access fault.
The Zicboz extension introduces an additional supported access type PMA for cache-block zero instructions. Main memory regions are required to support accesses by cache-block zero instructions; however, I/O regions may specify whether accesses by cache-block zero instructions are supported.
A cache-block management instruction is permitted to access the specified cache block whenever a load instruction or store instruction is permitted to access the corresponding physical addresses. If neither a load instruction nor store instruction is permitted to access the physical addresses, but an instruction fetch is permitted to access the physical addresses, whether a cache-block management instruction is permitted to access the cache block is UNSPECIFIED. If access to the cache block is not permitted, a cache-block management instruction raises a store page-fault or store guest-page-fault exception if address translation does not permit any access or raises a store access-fault exception otherwise. During address translation, the instruction also checks the accessed bit and may either raise an exception or set the bit as required.
The interaction between cache-block management instructions and instruction fetches will be specified in a future extension.
As implied by omission, a cache-block management instruction does not check the dirty bit and neither raises an exception nor sets the bit.
A cache-block zero instruction is permitted to access the specified cache block whenever a store instruction is permitted to access the corresponding physical addresses and when the PMAs indicate that cache-block zero instructions are a supported access type. If access to the cache block is not permitted, a cache-block zero instruction raises a store page-fault or store guest-page-fault exception if address translation does not permit write access or raises a store access-fault exception otherwise. During address translation, the instruction also checks the accessed and dirty bits and may either raise an exception or set the bits as required.
A cache-block prefetch instruction is permitted to access the specified cache block whenever a load instruction, store instruction, or instruction fetch is permitted to access the corresponding physical addresses. If access to the cache block is not permitted, a cache-block prefetch instruction does not raise any exceptions and shall not access any caches or memory. During address translation, the instruction does not check the accessed and dirty bits and neither raises an exception nor sets the bits.
When a page-fault, guest-page-fault, or access-fault exception is taken, the relevant *tval CSR is written with the faulting effective address (i.e. the value of rs1).
Like a load or store instruction, a CMO instruction may or may not be permitted
to access a cache block based on the states of the MPRV, MPV, and MPP bits
in mstatus and the SUM and MXR bits in mstatus, sstatus, and
vsstatus.
This specification expects that implementations will process cache-block management instructions like store/AMO instructions, so store/AMO exceptions are appropriate for these instructions, regardless of the permissions required.
6.19.4.2.3 Address-Misaligned Exceptions
CMO instructions do not generate address-misaligned exceptions.
6.19.4.2.4 Breakpoint Exceptions and Debug Mode Entry
Unless otherwise defined by the debug architecture specification, the behavior of trigger modules with respect to CMO instructions is UNSPECIFIED.
For the Zicbom, Zicboz, and Zicbop extensions, this specification recommends the following common trigger module behaviors:
- Type 6 address match triggers, i.e.
tdata1.type=6andmcontrol6.select=0, should be supported - Type 2 address/data match triggers, i.e.
tdata1.type=2, should be unsupported - The size of a memory access equals the size of the cache block accessed, and the compare values follow from the addresses of the NAPOT memory region corresponding to the cache block containing the effective address
- Unless an encoding for a cache block is added to the
mcontrol6.sizefield, an address trigger should only match a memory access from a CBO instruction ifmcontrol6.size=0
If the Zicbom extension is implemented, this specification recommends the following additional trigger module behaviors:
- Implementing address match triggers should be optional
- Type 6 data match triggers, i.e.
tdata1.type=6andmcontrol6.select=1, should be unsupported - Memory accesses are considered to be stores, i.e. an address trigger matches
only if
mcontrol6.store=1
If the Zicboz extension is implemented, this specification recommends the following additional trigger module behaviors:
- Implementing address match triggers should be mandatory
- Type 6 data match triggers, i.e.
tdata1.type=6andmcontrol6.select=1, should be supported, and implementing these triggers should be optional - Memory accesses are considered to be stores, i.e. an address trigger matches
only if
mcontrol6.store=1
If the Zicbop extension is implemented, this specification recommends the following additional trigger module behaviors:
- Implementing address match triggers should be optional
- Type 6 data match triggers, i.e.
tdata1.type=6andmcontrol6.select=1, should be unsupported - Memory accesses may be considered to be loads or stores depending on the
implementation, i.e. whether an address trigger matches on these instructions
when
mcontrol6.load=1ormcontrol6.store=1is implementation-specific
This specification also recommends that the behavior of trigger modules with respect to the Zicboz extension should be defined in version 1.0 of the debug architecture specification. The behavior of trigger modules with respect to the Zicbom and Zicbop extensions is expected to be defined in future extensions.
6.19.4.2.5 Hypervisor Extension
For the purposes of writing the mtinst or htinst register on a trap, the
following standard transformation is defined for cache-block management
instructions and cache-block zero instructions:
The operation field corresponds to the 12 most significant bits of the
trapping instruction.
As described in the hypervisor extension, a zero may be written into mtinst
or htinst instead of the standard transformation defined above.
6.19.4.3 Effects on Constrained LR/SC Loops
The following event is added to the list of events that satisfy the eventuality guarantee provided by constrained LR/SC loops, as defined in the A extension:
- Some other hart executes a cache-block management instruction or a cache-block zero instruction to the reservation set of the LR instruction in H's constrained LR/SC loop.
The above event has been added to accommodate cache coherence protocols that cannot distinguish between invalidations for stores and invalidations for cache-block management operations.
Aside from the above event, CMO instructions neither change the properties of constrained LR/SC loops nor modify the eventuality guarantee provided by them. For example, executing a CMO instruction may cause a constrained LR/SC loop on any hart to fail periodically or may cause a unconstrained LR/SC sequence on the same hart to fail always. Additionally, executing a cache-block prefetch instruction does not impact the eventuality guarantee provided by constrained LR/SC loops executed on any hart.
6.19.4.4 Software Discovery
The initial set of CMO extensions requires the following information to be discovered by software:
- The size of the cache block for management and prefetch instructions
- The size of the cache block for zero instructions
- CBIE support at each privilege level
Other general cache characteristics may also be specified in the discovery mechanism.
6.19.5 CSR controls for CMO instructions
The x{csrname} registers control CBO instruction execution based on the current privilege mode and the state of the appropriate CSRs, as detailed below.
A CBO.INVAL instruction executes or raises either an illegal-instruction
exception or a virtual-instruction exception based on the state of the
x\{csrname\}.CBIE fields:
// illegal-instruction exceptions
if (((priv_mode != M) && (m{csrname}.CBIE == 00)) ||
((priv_mode == U) && (s{csrname}.CBIE == 00)))
{
\<raise illegal-instruction exception>
}
// virtual-instruction exceptions
else if (((priv_mode == VS) && (h{csrname}.CBIE == 00)) ||
((priv_mode == VU) && ((h{csrname}.CBIE == 00) || (s{csrname}.CBIE == 00))))
{
\<raise virtual-instruction exception>
}
// execute instruction
else
{
if (((priv_mode != M) && (m{csrname}.CBIE == 01)) ||
((priv_mode == U) && (s{csrname}.CBIE == 01)) ||
((priv_mode == VS) && (h{csrname}.CBIE == 01)) ||
((priv_mode == VU) && ((h{csrname}.CBIE == 01) || (s{csrname}.CBIE == 01))))
{
\<execute CBO.INVAL and perform flush operation>
}
else
{
\<execute CBO.INVAL and perform invalidate operation>
}
}
Until a modified cache block has updated memory, a CBO.INVAL instruction may
expose stale data values in memory if the CSRs are programmed to perform an
invalidate operation. This behavior may result in a security hole if lower
privileged level software performs an invalidate operation and accesses
sensitive information in memory.
To avoid such holes, higher privileged level software must perform either a
clean or flush operation on the cache block before permitting lower privileged
level software to perform an invalidate operation on the block. Alternatively,
higher privileged level software may program the CSRs so that CBO.INVAL
either traps or performs a flush operation in a lower privileged level.
A CBO.CLEAN or CBO.FLUSH instruction executes or raises an illegal-instruction
or virtual-instruction exception based on the state of the
x\{csrname\}.CBCFE bits:
// illegal-instruction exceptions
if (((priv_mode != M) && !m{csrname}.CBCFE) ||
((priv_mode == U) && !s{csrname}.CBCFE))
{
\<raise illegal-instruction exception>
}
// virtual-instruction exceptions
else if (((priv_mode == VS) && !h{csrname}.CBCFE) ||
((priv_mode == VU) && !(h{csrname}.CBCFE && s{csrname}.CBCFE)))
{
\<raise virtual-instruction exception>
}
// execute instruction
else
{
\<execute CBO.CLEAN or CBO.FLUSH>
}
Finally, a CBO.ZERO instruction executes or raises an illegal-instruction or
virtual-instruction exception based on the state of the x\{csrname\}.CBZE bits:
// illegal-instruction exceptions
if (((priv_mode != M) && !m{csrname}.CBZE) ||
((priv_mode == U) && !s{csrname}.CBZE))
{
\<raise illegal-instruction exception>
}
// virtual-instruction exceptions
else if (((priv_mode == VS) && !h{csrname}.CBZE) ||
((priv_mode == VU) && !(h{csrname}.CBZE && s{csrname}.CBZE)))
{
\<raise virtual-instruction exception>
}
// execute instruction
else
{
\<execute CBO.ZERO>
}
The CBIE/CBCFE/CBZE fields in each x\{csrname\} register do not affect the
read and write behavior of the same fields in the other x\{csrname\} registers.
Each x\{csrname\} register is WARL; however, software should determine the legal
values from the execution environment discovery mechanism.
6.19.6 Extensions
CMO instructions are defined in the following extensions:
6.19.6.1 Cache-Block Management Instructions
Cache-block management instructions enable software running on a set of coherent agents to communicate with a set of non-coherent agents by performing one of the following operations:
- An invalidate operation makes data from store operations performed by a set of non-coherent agents visible to the set of coherent agents at a point common to both sets by deallocating all copies of a cache block from the set of coherent caches up to that point
- A clean operation makes data from store operations performed by the set of coherent agents visible to a set of non-coherent agents at a point common to both sets by performing a write transfer of a copy of a cache block to that point provided a coherent agent performed a store operation that modified the data in the cache block since the previous invalidate, clean, or flush operation on the cache block
- A flush operation atomically performs a clean operation followed by an invalidate operation
In the Zicbom extension, the instructions operate to a point common to all agents in the system. In other words, an invalidate operation ensures that store operations from all non-coherent agents visible to agents in the set of coherent agents, and a clean operation ensures that store operations from coherent agents visible to all non-coherent agents.
The Zicbom extension does not prohibit agents that fall outside of the above architectural definition; however, software cannot rely on the defined cache operations to have the desired effects with respect to those agents.
Future extensions may define different sets of agents for the purposes of performance optimization.
These instructions operate on the cache block whose effective address is specified in rs1. The effective address is translated into a corresponding physical address by the appropriate translation mechanisms.
The following instructions comprise the Zicbom extension:
| RV32 | RV64 | Mnemonic | Instruction |
|---|---|---|---|
| ✓ | ✓ | cbo.clean base | insns-cbo_clean |
| ✓ | ✓ | cbo.flush base | insns-cbo_flush |
| ✓ | ✓ | cbo.inval base | insns-cbo_inval |
Cache-block management instructions ignore cacheability attributes and operate on the cache block irrespective of the PMA cacheable attribute and any Page-Based Memory Type (PBMT) downgrade from cacheable to non-cacheable.
6.19.6.2 Cache-Block Zero Instructions
Cache-block zero instructions store zeros to the set of bytes corresponding to a cache block. An implementation may update the bytes in any order and with any granularity and atomicity, including individual bytes.
Cache-block zero instructions store zeros independently of whether data from the underlying memory locations are cacheable. In addition, this specification does not constrain how the bytes are written.
These instructions operate on the cache block, or the memory locations corresponding to the cache block, whose effective address is specified in rs1. The effective address is translated into a corresponding physical address by the appropriate translation mechanisms.
The following instructions comprise the Zicboz extension:
| RV32 | RV64 | Mnemonic | Instruction |
|---|---|---|---|
| ✓ | ✓ | cbo.zero base | insns-cbo_zero |
6.19.6.3 Cache-Block Prefetch Instructions
Cache-block prefetch instructions are HINTs to the hardware to indicate that software intends to perform a particular type of memory access in the near future. The types of memory accesses are instruction fetch, data read (i.e. load), and data write (i.e. store).
These instructions operate on the cache block whose effective address is the sum
of the base address specified in rs1 and the sign-extended offset encoded in
imm[11:0], where imm[4:0] shall equal 0b00000. The effective address is
translated into a corresponding physical address by the appropriate translation
mechanisms.
Cache-block prefetch instructions are encoded as ORI instructions with rd equal
to 0b00000; however, for the purposes of effective address calculation, this
field is also interpreted as imm[4:0] like a store instruction.
The following instructions comprise the Zicbop extension:
| RV32 | RV64 | Mnemonic | Instruction |
|---|---|---|---|
| ✓ | ✓ | prefetch.i offset(base) | insns-prefetch_i |
| ✓ | ✓ | prefetch.r offset(base) | insns-prefetch_r |
| ✓ | ✓ | prefetch.w offset(base) | insns-prefetch_w |
6.19.7 Instructions
6.19.7.1 cbo.clean
Synopsis Perform a clean operation on a cache block
Mnemonic cbo.clean offset(base)
Encoding
Description A cbo.clean instruction performs a clean operation on the cache block whose effective address is the base address specified in rs1. The offset operand may be omitted; otherwise, any expression that computes the offset shall evaluate to zero. The instruction operates on the set of coherent caches accessed by the agent executing the instruction.
When executing a cbo.clean instruction, an implementation may instead perform a flush operation, since the result of that operation is indistinguishable from the sequence of performing a clean operation just before deallocating all cached copies in the set of coherent caches.
6.19.7.2 cbo.flush
Synopsis Perform a flush operation on a cache block
Mnemonic cbo.flush offset(base)
Encoding
Description A cbo.flush instruction performs a flush operation on the cache block whose that contains the address specified in rs1. It is not required that rs1 is aligned to the size of a cache block. On faults, the faulting virtual address is considered to be the value in rs1, rather than the base address of the cache block. The instruction operates on the set of coherent caches accessed by the agent executing the instruction.
The assembly offset operand may be omitted. If it isn’t then any expression that computes the offset shall evaluate to zero.
6.19.7.3 cbo.inval
Synopsis Perform an invalidate operation on a cache block
Mnemonic cbo.inval offset(base)
Encoding
Description A cbo.inval instruction performs an invalidate operation on the cache block that contains the address specified in rs1. It is not required that rs1 is aligned to the size of a cache block. On faults, the faulting virtual address is considered to be the value in rs1, rather than the base address of the cache block. The instruction operates on the set of coherent caches accessed by the agent executing the instruction.
Depending on CSR programming, the instruction may perform a flush operation instead of an invalidate operation.
The assembly offset operand may be omitted. If it isn’t then any expression that computes the offset shall evaluate to zero.
When executing a cbo.inval instruction, an implementation may instead perform a flush operation, since the result of that operation is indistinguishable from the sequence of performing a write transfer to memory just before performing an invalidate operation.
6.19.7.4 cbo.zero
Synopsis Store zeros to the full set of bytes corresponding to a cache block
Mnemonic cbo.zero offset(base)
Encoding
Description A cbo.zero instruction performs stores of zeros to the full set of bytes corresponding to the cache block that contains the address specified in rs1. It is not required that rs1 is aligned to the size of a cache block. On faults, the faulting virtual address is considered to be the value in rs1, rather than the base address of the cache block. An implementation may or may not update the entire set of bytes atomically.
The assembly offset operand may be omitted. If it isn’t then any expression that computes the offset shall evaluate to zero.
6.19.7.5 prefetch.i
Synopsis Provide a HINT to hardware that a cache block is likely to be accessed by an instruction fetch in the near future
Mnemonic prefetch.i offset(base)
Encoding
Description
A prefetch.i instruction indicates to hardware that the cache block whose
effective address is the sum of the base address specified in rs1 and the
sign-extended offset encoded in imm[11:0], where imm[4:0] equals 0b00000,
is likely to be accessed by an instruction fetch in the near future.
An implementation may opt to cache a copy of the cache block in a cache accessed by an instruction fetch in order to improve memory access latency, but this behavior is not required.
6.19.7.6 prefetch.r
Synopsis Provide a HINT to hardware that a cache block is likely to be accessed by a data read in the near future
Mnemonic prefetch.r offset(base)
Encoding
Description
A prefetch.r instruction indicates to hardware that the cache block whose
effective address is the sum of the base address specified in rs1 and the
sign-extended offset encoded in imm[11:0], where imm[4:0] equals 0b00000,
is likely to be accessed by a data read (i.e. load) in the near future.
An implementation may opt to cache a copy of the cache block in a cache accessed by a data read in order to improve memory access latency, but this behavior is not required.
6.19.7.7 prefetch.w
Synopsis Provide a HINT to hardware that a cache block is likely to be accessed by a data write in the near future
Mnemonic prefetch.w offset(base)
Encoding
Description
A prefetch.w instruction indicates to hardware that the cache block whose
effective address is the sum of the base address specified in rs1 and the
sign-extended offset encoded in imm[11:0], where imm[4:0] equals 0b00000,
is likely to be accessed by a data write (i.e. store) in the near future.
An implementation may opt to cache a copy of the cache block in a cache accessed by a data write in order to improve memory access latency, but this behavior is not required.