Skip to main content

6 Scalar Integer Extensions

note

This chapter is currently being restructured. Its contents are normative, but the presentation might appear disjoint.

This chapter describes the scalar integer extensions. Most of these extensions are accordingly named with the prefix "Zi", with the exception of the integer multiplication and division extensions, which are named "M" or prefixed with "Zm".

6.1 "Zifencei" Extension for Instruction-Fetch Fence, Version 2.0

This chapter defines the "Zifencei" extension, which includes the FENCE.I instruction that provides explicit synchronization between writes to instruction memory and instruction fetches on the same hart. Currently, this instruction is the only standard mechanism to ensure that stores visible to a hart will also be visible to its instruction fetches.

note

We considered but did not include a "store instruction word" instruction as in [17]. JIT compilers may generate a large trace of instructions before a single FENCE.I, and amortize any instruction cache snooping/invalidation overhead by writing translated instructions to memory regions that are known not to reside in the I-cache.


note

The FENCE.I instruction was designed to support a wide variety of implementations. A simple implementation can flush the local instruction cache and the instruction pipeline when the FENCE.I is executed. A more complex implementation might snoop the instruction (data) cache on every data (instruction) cache miss, or use an inclusive unified private L2 cache to invalidate lines from the primary instruction cache when they are being written by a local store instruction. If instruction and data caches are kept coherent in this way, or if the memory system consists of only uncached RAMs, then just the fetch pipeline needs to be flushed at a FENCE.I.

The FENCE.I instruction was previously part of the base I instruction set. Two main issues are driving moving this out of the mandatory base, although at time of writing it is still the only standard method for maintaining instruction-fetch coherence.

First, it has been recognized that on some systems, FENCE.I will be expensive to implement and alternate mechanisms are being discussed in the memory model task group. In particular, for designs that have an incoherent instruction cache and an incoherent data cache, or where the instruction cache refill does not snoop a coherent data cache, both caches must be completely flushed when a FENCE.I instruction is encountered. This problem is exacerbated when there are multiple levels of I and D cache in front of a unified cache or outer memory system.

Second, the instruction is not powerful enough to make available at user level in a Unix-like operating system environment. The FENCE.I only synchronizes the local hart, and the OS can reschedule the user hart to a different physical hart after the FENCE.I. This would require the OS to execute an additional FENCE.I as part of every context migration. For this reason, the standard Linux ABI has removed FENCE.I from user-level and now requires a system call to maintain instruction-fetch coherence, which allows the OS to minimize the number of FENCE.I executions required on current systems and provides forward-compatibility with future improved instruction-fetch coherence mechanisms.

Future approaches to instruction-fetch coherence under discussion include providing more restricted versions of FENCE.I that only target a given address specified in rs1, and/or allowing software to use an ABI that relies on machine-mode cache-maintenance operations.

c400b9b4bc8a8bf07f503d090fe5ecb2

The FENCE.I instruction is used to synchronize the instruction and data streams. RISC-V does not guarantee that stores to instruction memory will be made visible to instruction fetches on a RISC-V hart until that hart executes a FENCE.I instruction. A FENCE.I instruction ensures that a subsequent instruction fetch on a RISC-V hart will see any previous data stores already visible to the same RISC-V hart. FENCE.I does not ensure that other RISC-V harts' instruction fetches will observe the local hart’s stores in a multiprocessor system. To make a store to instruction memory visible to all RISC-V harts, the writing hart also has to execute a data FENCE before requesting that all remote RISC-V harts execute a FENCE.I.

A FENCE.I instruction orders all explicit memory accesses that precede the FENCE.I in program order before all instruction fetches that follow the FENCE.I in program order.

note

In the following litmus test, for example, the outcome a0=1, a1=0 on the consumer hart is forbidden, assuming little-endian RV32IC harts:

Initially, flag = 0.

Producer hart: Consumer hart:

la t0, patch_me la t2, flag
li t1, 0x4585 lw a0, (t2)
sh t1, (t0) # patch_me := c.li a1, 1 fence.i
fence w, w # order flag write patch_me:
la t0, flag c.li a1, 0
li t1, 1
sw t1, (t0) # flag := 1

Note that this example is only meant to illustrate the aforementioned ordering property. In a realistic producer-consumer code-generation scheme, the consumer would loop until flag becomes 1 before executing the FENCE.I instruction.

An instruction fetch is always ordered before any explicit memory accesses that instruction gives rise to.

The unused fields in the FENCE.I instruction, funct12, rs1, and rd, are reserved for finer-grain fences in future extensions. For forward compatibility, base implementations shall ignore these fields, and standard software shall zero these fields.

note

Because FENCE.I only orders stores with a hart’s own instruction fetches, application code should only rely upon FENCE.I if the application thread will not be migrated to a different hart. The EEI can provide mechanisms for efficient multiprocessor instruction-stream synchronization.

6.2 "Zicsr" Extension for Control and Status Register (CSR) Instructions, Version 2.0

RISC-V defines a separate address space of 4096 Control and Status registers associated with each hart. This chapter defines the full set of CSR instructions that operate on these CSRs.

note

While CSRs are primarily used by the privileged architecture, there are several uses in unprivileged code including for counters and timers, and for floating-point status.

The counters and timers are no longer considered mandatory parts of the standard base ISAs, and so the CSR instructions required to access them have been moved out of rv32 into this separate chapter.

6.2.1 CSR Instructions

All CSR instructions atomically read-modify-write a single CSR, whose CSR specifier is encoded in the 12-bit csr field of the instruction held in bits 31-20. The immediate forms use a 5-bit zero-extended immediate encoded in the rs1 field.

b66a4f5a478fdbc2d739a5f6fba6cbbc

The CSRRW (Atomic Read/Write CSR) instruction atomically swaps values in the CSRs and integer registers. CSRRW reads the old value of the CSR, zero-extends the value to XLEN bits, then writes it to integer register rd. The initial value in rs1 is written to the CSR. If rd=x0, then the instruction shall not read the CSR and shall not cause any of the side effects that might occur on a CSR read.

The CSRRS (Atomic Read and Set Bits in CSR) instruction reads the value of the CSR, zero-extends the value to XLEN bits, and writes it to integer register rd. The initial value in integer register rs1 is treated as a bit mask that specifies bit positions to be set in the CSR. Any bit that is high in rs1 will cause the corresponding bit to be set in the CSR, if that CSR bit is writable.

The CSRRC (Atomic Read and Clear Bits in CSR) instruction reads the value of the CSR, zero-extends the value to XLEN bits, and writes it to integer register rd. The initial value in integer register rs1 is treated as a bit mask that specifies bit positions to be cleared in the CSR. Any bit that is high in rs1 will cause the corresponding bit to be cleared in the CSR, if that CSR bit is writable.

note

Since CSRRS and CSRRC perform a read-modify-write operation, any bits that read as a different value to their underlying value may be modified by these instructions even if the corresponding bit is not set in rs1. For example, pmpaddr_n_[G-1] may have an underlying value of 1 but read as 0. Executing CSRRC or CSRRS to modify a different bit will cause 0 to be read from pmpaddr_n_[G-1] and then written back, updating the underlying value to 0.

For both CSRRS and CSRRC, if rs1=x0, then the instruction will not write to the CSR at all, and so shall not cause any of the side effects that might otherwise occur on a CSR write, nor raise illegal-instruction exceptions on accesses to read-only CSRs. Both CSRRS and CSRRC always read the addressed CSR and cause any read side effects regardless of rs1 and rd fields. Note that if rs1 specifies a register other than x0, and that register holds a zero value, the instruction will not action any attendant per-field side effects, but will action any side effects caused by writing to the entire CSR.

A CSRRW with rs1=x0 will attempt to write zero to the destination CSR.

The CSRRWI, CSRRSI, and CSRRCI variants are similar to CSRRW, CSRRS, and CSRRC respectively, except they update the CSR using an XLEN-bit value obtained by zero-extending a 5-bit unsigned immediate (uimm[4:0]) field encoded in the rs1 field instead of a value from an integer register. For CSRRSI and CSRRCI, if the uimm[4:0] field is zero, then these instructions will not write to the CSR, and shall not cause any of the side effects that might otherwise occur on a CSR write, nor raise illegal-instruction exceptions on accesses to read-only CSRs. For CSRRWI, if rd=x0, then the instruction shall not read the CSR and shall not cause any of the side effects that might occur on a CSR read. Both CSRRSI and CSRRCI will always read the CSR and cause any read side effects regardless of rd and rs1 fields.

Register operand
Instructionrd is x0rs1 is x0Reads CSRWrites CSR
CSRRWYes
NoYes
CSRRWNo
YesYes
CSRRS/CSRRC
YesYesNo
CSRRS/CSRRC
NoYesYes
Immediate operand
Instructionrd is x0uimm=0Reads CSRWrites CSR
CSRRWIYes
NoYes
CSRRWINo
YesYes
CSRRSI/CSRRCI
YesYesNo
CSRRSI/CSRRCI
NoYesYes

csrsideeffects summarizes the behavior of the CSR instructions with respect to whether they read and/or write the CSR.

In addition to side effects that occur as a consequence of reading or writing a CSR, individual fields within a CSR might have side effects when written. The CSRRW[I] instructions action side effects for all such fields within the written CSR. The CSRRS[I] and CSRRC[I] instructions only action side effects for fields for which the rs1 or uimm argument has at least one bit set corresponding to that field.

note

As of this writing, no standard CSRs have side effects on field writes. Hence, whether a standard CSR access has any side effects can be determined solely from the opcode.

Defining CSRs with side effects on field writes is not recommended.

For any event or consequence that occurs due to a CSR having a particular value, if a write to the CSR gives it that value, the resulting event or consequence is said to be an indirect effect of the write. Indirect effects of a CSR write are not considered by the RISC-V ISA to be side effects of that write.

note

An example of side effects for CSR accesses would be if reading from a specific CSR causes a light bulb to turn on, while writing an odd value to the same CSR causes the light to turn off. Assume writing an even value has no effect. In this case, both the read and write have side effects controlling whether the bulb is lit, as this condition is not determined solely from the CSR value. (Note that after writing an odd value to the CSR to turn off the light, then reading to turn the light on, writing again the same odd value causes the light to turn off again. Hence, on the last write, it is not a change in the CSR value that turns off the light.)

On the other hand, if a bulb is rigged to light whenever the value of a particular CSR is odd, then turning the light on and off is not considered a side effect of writing to the CSR but merely an indirect effect of such writes.

More concretely, the RISC-V privileged architecture defined in Volume II specifies that certain combinations of CSR values cause a trap to occur. When an explicit write to a CSR creates the conditions that trigger the trap, the trap is not considered a side effect of the write but merely an indirect effect.

Standard CSRs do not have any side effects on reads. Standard CSRs may have side effects on writes. Custom extensions might add CSRs for which accesses have side effects on either reads or writes.

Some CSRs, such as the instructions-retired counter, instret, may be modified as side effects of instruction execution. In these cases, if a CSR access instruction reads a CSR, it reads the value prior to the execution of the instruction. If a CSR access instruction writes such a CSR, the explicit write is done instead of the update from the side effect. In particular, a value written to instret by one instruction will be the value read by the following instruction.

The assembler pseudoinstruction to read a CSR, CSRR rd, csr, is encoded as CSRRS rd, csr, x0. The assembler pseudoinstruction to write a CSR, CSRW csr, rs1, is encoded as CSRRW x0, csr, rs1, while CSRWI csr, uimm, is encoded as CSRRWI x0, csr, uimm.

Further assembler pseudoinstructions are defined to set and clear bits in the CSR when the old value is not required: CSRS/CSRC csr, rs1; CSRSI/CSRCI csr, uimm.

6.2.1.1 CSR Access Ordering

Each RISC-V hart normally observes its own CSR accesses, including its implicit CSR accesses, as performed in program order. In particular, unless specified otherwise, a CSR access is performed after the execution of any prior instructions in program order whose behavior modifies or is modified by the CSR state and before the execution of any subsequent instructions in program order whose behavior modifies or is modified by the CSR state. Furthermore, an explicit CSR read returns the CSR state before the execution of the instruction, while an explicit CSR write suppresses and overrides any implicit writes or modifications to the same CSR by the same instruction.

Likewise, any side effects from an explicit CSR access are normally observed to occur synchronously in program order. Unless specified otherwise, the full consequences of any such side effects are observable by the very next instruction, and no consequences may be observed out-of-order by preceding instructions. (Note the distinction made earlier between side effects and indirect effects of CSR writes.)

For the RVWMO memory consistency model (memorymodel), CSR accesses are weakly ordered by default, so other harts or devices may observe CSR accesses in an order different from program order. In addition, CSR accesses are not ordered with respect to explicit memory accesses, unless a CSR access modifies the execution behavior of the instruction that performs the explicit memory access or unless a CSR access and an explicit memory access are ordered by either the syntactic dependencies defined by the memory model or the ordering requirements defined in sec:memory-ordering-pmas. To enforce ordering in all other cases, software should execute a FENCE instruction between the relevant accesses. For the purposes of the FENCE instruction, CSR read accesses are classified as device input (I), and CSR write accesses are classified as device output (O).

note

Informally, the CSR space acts as a weakly ordered memory-mapped I/O region, as defined in sec:memory-ordering-pmas. As a result, the order of CSR accesses with respect to all other accesses is constrained by the same mechanisms that constrain the order of memory-mapped I/O accesses to such a region.

These CSR-ordering constraints are imposed to support ordering main memory and memory-mapped I/O accesses with respect to CSR accesses that are visible to, or affected by, devices or other harts. Examples include the time, cycle, and mcycle CSRs, in addition to CSRs that reflect pending interrupts, like mip and sip. Note that implicit reads of such CSRs (e.g., taking an interrupt because of a change in mip) are also ordered as device input.

Most CSRs (including, e.g., the fcsr) are not visible to other harts; their accesses can be freely reordered in the global memory order with respect to FENCE instructions without violating this specification.

The hardware platform may define that accesses to certain CSRs are strongly ordered, as defined in sec:memory-ordering-pmas. Accesses to strongly ordered CSRs have stronger ordering constraints with respect to accesses to both weakly ordered CSRs and accesses to memory-mapped I/O regions.

note

The rules for the reordering of CSR accesses in the global memory order should probably be moved to memorymodel concerning the RVWMO memory consistency model.

6.3 "Zicntr" Extension for Base Counters and Timers

RISC-V ISAs provide a set of up to thirty-two 64-bit performance counters and timers that are accessible via unprivileged XLEN-bit read-only CSR registers 0xC000xC1F (when XLEN=32, the upper 32 bits are accessed via CSR registers 0xC800xC9F). These counters are divided between the Zicntr and Zihpm extensions.

The Zicntr standard extension comprises the first three of these counters (CYCLE, TIME, and INSTRET), which have dedicated functions (cycle count, real-time clock, and instructions retired, respectively). The Zicntr extension depends on the Zicsr extension.

note

We recommend provision of these basic counters in implementations as they are essential for basic performance analysis, adaptive and dynamic optimization, and to allow an application to work with real-time streams. Additional counters in the separate Zihpm extension can help diagnose performance problems and these should be made accessible from user-level application code with low overhead.

Some execution environments might prohibit access to counters, for example, to impede timing side-channel attacks.

cb31a3a22dd48b21cb01352143fed0f9

For base ISAs with XLEN≥64, CSR instructions can access the full 64-bit CSRs directly. In particular, the RDCYCLE, RDTIME, and RDINSTRET pseudoinstructions read the full 64 bits of the cycle, time, and instret counters.

note

The counter pseudoinstructions are mapped to the read-only csrrs rd, counter, x0 canonical form, but the other read-only CSR instruction forms (based on CSRRC/CSRRSI/CSRRCI) are also legal ways to read these CSRs.

For base ISAs with XLEN=32, the Zicntr extension enables the three 64-bit read-only counters to be accessed in 32-bit pieces. The RDCYCLE, RDTIME, and RDINSTRET pseudoinstructions provide the lower 32 bits, and the RDCYCLEH, RDTIMEH, and RDINSTRETH pseudoinstructions provide the upper 32 bits of the respective counters.

note

We required the counters be 64 bits wide, even when XLEN=32, as otherwise it is very difficult for software to determine if values have overflowed. The sample code given below shows how the full 64-bit width value can be safely read using the individual 32-bit width pseudoinstructions.

The RDCYCLE pseudoinstruction reads the low XLEN bits of the cycle CSR which holds a count of the number of clock cycles executed by the processor core on which the hart is running from an arbitrary start time in the past. RDCYCLEH is only present when XLEN=32 and reads bits 63-32 of the same cycle counter. The underlying 64-bit counter should never overflow in practice. The rate at which the cycle counter advances will depend on the implementation and operating environment. The execution environment should provide a means to determine the current rate (cycles/second) at which the cycle counter is incrementing.

note

RDCYCLE is intended to return the number of cycles executed by the processor core, not the hart. Precisely defining what is a "core" is difficult given some implementation choices (e.g., AMD Bulldozer). Precisely defining what is a "clock cycle" is also difficult given the range of implementations (including software emulations), but the intent is that RDCYCLE is used for performance monitoring along with the other performance counters. In particular, where there is one hart/core, one would expect cycle-count/instructions-retired to measure CPI for a hart.

Cores don’t have to be exposed to software at all, and an implementer might choose to pretend multiple harts on one physical core are running on separate cores with one hart/core, and provide separate cycle counters for each hart. This might make sense in a simple barrel processor (e.g., CDC 6600 peripheral processors) where inter-hart timing interactions are non-existent or minimal.

Where there is more than one hart/core and dynamic multithreading, it is not generally possible to separate out cycles per hart (especially with SMT). It might be possible to define a separate performance counter that tried to capture the number of cycles a particular hart was running, but this definition would have to be very fuzzy to cover all the possible threading implementations. For example, should we only count cycles for which any instruction was issued to execution for this hart, and/or cycles any instruction retired, or include cycles this hart was occupying machine resources but couldn’t execute due to stalls while other harts went into execution? Likely, "all of the above" would be needed to have understandable performance stats. This complexity of defining a per-hart cycle count, and also the need in any case for a total per-core cycle count when tuning multithreaded code led to just standardizing the per-core cycle counter, which also happens to work well for the common single hart/core case.

Standardizing what happens during "sleep" is not practical given that what "sleep" means is not standardized across execution environments, but if the entire core is paused (entirely clock-gated or powered-down in deep sleep), then it is not executing clock cycles, and the cycle count shouldn’t be increasing per the spec. There are many details, e.g., whether clock cycles required to reset a processor after waking up from a power-down event should be counted, and these are considered execution-environment-specific details.

Even though there is no precise definition that works for all platforms, this is still a useful facility for most platforms, and an imprecise, common, "usually correct" standard here is better than no standard. The intent of RDCYCLE was primarily performance monitoring/tuning, and the specification was written with that goal in mind.

The RDTIME pseudoinstruction reads the low XLEN bits of the "time" CSR, which counts wall-clock real time that has passed from an arbitrary start time in the past. RDTIMEH is only present when XLEN=32 and reads bits 63-32 of the same real-time counter. The underlying 64-bit counter increments by one with each tick of the real-time clock, and, for realistic real-time clock frequencies, should never overflow in practice. The execution environment should provide a means of determining the period of a counter tick (seconds/tick). The period should be constant within a small error bound. The environment should provide a means to determine the accuracy of the clock (i.e., the maximum relative error between the nominal and actual real-time clock periods).

note

On some simple platforms, cycle count might represent a valid implementation of RDTIME, in which case RDTIME and RDCYCLE may return the same result.

It is difficult to provide a strict mandate on clock period given the wide variety of possible implementation platforms. The maximum error bound should be set based on the requirements of the platform.

The real-time clocks of all harts must be synchronized to within one tick of the real-time clock.

note

As with other architectural mandates, it suffices to appear "as if" harts are synchronized to within one tick of the real-time clock, i.e., software is unable to observe that there is a greater delta between the real-time clock values observed on two harts.

If, for example, the real-time clock increments at a frequency of 1 GHz, then all harts must appear to be synchronized to within 1 nsec. But it is also acceptable for this example implementation to only update the real-time clock at, say, a frequency of 100 MHz with increments of 10 ticks. As long as software cannot observe this seeming violation of the above synchronization requirement, and software always observes time across harts to be monotonically nondecreasing, then this implementation is compliant.

A platform spec may then, for example, specify an apparent real-time clock tick frequency (e.g. 1 GHz) and also a minimum update frequency (e.g. 100 MHz) at which updated time values are guaranteed to be observable by software. Software may read time more frequently, but it should only observe monotonically nondecreasing values and it should observe a new value at least once every 10 ns (corresponding to the 100 MHz update frequency in this example).

The RDINSTRET pseudoinstruction reads the low XLEN bits of the instret CSR, which counts the number of instructions retired by this hart from some arbitrary start point in the past. RDINSTRETH is only present when XLEN=32 and reads bits 63-32 of the same instruction counter. The underlying 64-bit counter should never overflow in practice.

note

Instructions that cause synchronous exceptions, including ECALL and EBREAK, are not considered to retire and hence do not increment the instret CSR.

The following code sequence will read a valid 64-bit cycle counter value into x3:x2, even if the counter overflows its lower half between reading its upper and lower halves.

again:
rdcycleh x3
rdcycle x2
rdcycleh x4
bne x3, x4, again

6.4 "Zihpm" Extension for Hardware Performance Counters

The Zihpm extension comprises up to 29 additional unprivileged 64-bit hardware performance counters, hpmcounter3-hpmcounter31. When XLEN=32, the upper 32 bits of these performance counters are accessible via additional CSRs hpmcounter3h- hpmcounter31h. The Zihpm extension depends on the Zicsr extension.

note

In some applications, it is important to be able to read multiple counters at the same instant in time. When run under a multitasking environment, a user thread can suffer a context switch while attempting to read the counters. One solution is for the user thread to read the real-time counter before and after reading the other counters to determine if a context switch occurred in the middle of the sequence, in which case the reads can be retried. We considered adding output latches to allow a user thread to snapshot the counter values atomically, but this would increase the size of the user context, especially for implementations with a richer set of counters.

The implemented number and width of these additional counters, and the set of events they count, are platform-specific. Accessing an unimplemented counter may cause an illegal-instruction exception or may return a constant value. If the configuration used to select the events counted by a counter is misconfigured, the counter may return a constant value.

The execution environment should provide a means to determine the number and width of the implemented counters, and an interface to configure the events to be counted by each counter.

note

For execution environments implemented on RISC-V privileged platforms, the privileged architecture manual describes privileged CSRs controlling access by lower privileged modes to these counters, and to set the events to be counted.

Alternative execution environments (e.g., user-level-only software performance models) may provide alternative mechanisms to configure the events counted by the performance counters.

It would be useful to eventually standardize event settings to count ISA-level metrics, such as the number of floating-point instructions executed for example, and possibly a few common microarchitectural metrics, such as "L1 instruction cache misses".

6.5 "M" Extension for Integer Multiplication and Division, Version 2.0

This chapter describes the standard integer multiplication and division instruction extension, which is named M and contains instructions that multiply or divide values held in two integer registers.

note

We separate integer multiply and divide out from the base to simplify low-end implementations, or for applications where integer multiply and divide operations are either infrequent or better handled in attached accelerators.

6.5.1 Multiplication Operations

ae423ad3a52fb94a69936b5c315ebb05

MUL performs an XLEN-bit×XLEN-bit multiplication of rs1 by rs2 and places the lower XLEN bits in the destination register. MULH, MULHU, and MULHSU perform the same multiplication but return the upper XLEN bits of the full 2×XLEN-bit product, for signed×signed, unsigned×unsigned, and rs1×unsigned rs2 multiplication. If both the high and low bits of the same product are required, then the recommended code sequence is: MULH[[S]U] rdh, rs1, rs2; MUL rdl, rs1, rs2 (source register specifiers must be in same order and rdh cannot be the same as rs1 or rs2). Microarchitectures can then fuse these into a single multiply operation instead of performing two separate multiplies.

note

MULHSU is used in multi-word signed multiplication to multiply the most-significant word of the multiplicand (which contains the sign bit) with the less-significant words of the multiplier (which are unsigned).

MULW is an RV64 instruction that multiplies the lower 32 bits of the source registers, placing the sign extension of the lower 32 bits of the result into the destination register.

note

In RV64, MUL can be used to obtain the upper 32 bits of the 64-bit product, but signed arguments must be proper 32-bit signed values, whereas unsigned arguments must have their upper 32 bits clear. If the arguments are not known to be sign- or zero-extended, an alternative is to shift both arguments left by 32 bits, then use MULH[[S]U].

6.5.2 Division Operations

6a7c1f1f9749b23faafce6cf793bdfce

DIV and DIVU perform an XLEN bits by XLEN bits signed and unsigned integer division of rs1 by rs2, rounding towards zero. REM and REMU provide the remainder of the corresponding division operation. For REM, the sign of a nonzero result equals the sign of the dividend.

note

For both signed and unsigned division, except in the case of overflow, it holds that dividend = divisor × quotient + remainder.

If both the quotient and remainder are required from the same division, the recommended code sequence is: DIV[U] rdq, rs1, rs2; REM[U] rdr, rs1, rs2 (rdq cannot be the same as rs1 or rs2). Microarchitectures can then fuse these into a single divide operation instead of performing two separate divides.

DIVW and DIVUW are RV64 instructions that divide the lower 32 bits of rs1 by the lower 32 bits of rs2, treating them as signed and unsigned integers, placing the 32-bit quotient in rd, sign-extended to 64 bits. REMW and REMUW are RV64 instructions that provide the corresponding signed and unsigned remainder operations. Both REMW and REMUW always sign-extend the 32-bit result to 64 bits, including on a divide by zero.

The semantics for division by zero and division overflow are summarized in divby0. The quotient of division by zero has all bits set, and the remainder of division by zero equals the dividend. Signed division overflow occurs only when the most-negative integer is divided by −1. The quotient of a signed division with overflow is equal to the dividend, and the remainder is zero. Unsigned division overflow cannot occur.

ConditionDividendDivisorDIVU[W]REMU[W]DIV[W]REM[W]
Overflow (signed only)-2L-1−1--0
note

We considered raising exceptions on integer divide by zero, with these exceptions causing a trap in most execution environments. However, this would be the only arithmetic trap in the standard ISA (floating-point exceptions set flags and write default values, but do not cause traps) and would require language implementers to interact with the execution environment’s trap handlers for this case. Further, where language standards mandate that a divide-by-zero exception must cause an immediate control flow change, only a single branch instruction needs to be added to each divide operation, and this branch instruction can be inserted after the divide and should normally be very predictably not taken, adding little runtime overhead.

The value of all bits set is returned for both unsigned and signed divide by zero to simplify the divider circuitry. The value of all 1s is both the natural value to return for unsigned divide, representing the largest unsigned number, and also the natural result for simple unsigned divider implementations. Signed division is often implemented using an unsigned division circuit and specifying the same overflow result simplifies the hardware.

6.6 Zmmul Extension, Version 1.0

The Zmmul extension implements the multiplication subset of the M extension. It adds all of the instructions defined in m-mul, namely: MUL, MULH, MULHU, MULHSU, and (for RV64 only) MULW. The encodings are identical to those of the corresponding M-extension instructions. M implies Zmmul.

note

The Zmmul extension enables low-cost implementations that require multiplication operations but not division. For many microcontroller applications, division operations are too infrequent to justify the cost of divider hardware. By contrast, multiplication operations are more frequent, making the cost of multiplier hardware more justifiable. Simple FPGA soft cores particularly benefit from eliminating division but retaining multiplication, since many FPGAs provide hardwired multipliers but require dividers be implemented in soft logic.

6.7 "Zicond" Extension for Integer Conditional Operations, Version 1.0.0

The Zicond extension defines two R-type instructions that support branchless conditional operations.

RV32RV64MnemonicInstruction
czero.eqz rd, rs1, rs2insns-czero-eqz
czero.nez rd, rs1, rs2insns-czero-nez

6.7.1 Instructions (in alphabetical order)

6.7.1.1 czero.eqz

Synopsis Moves zero to a register rd, if the condition rs2 is equal to zero, otherwise moves rs1 to rd.

Mnemonic czero.eqz rd, rs1, rs2

Encoding

ec53dc3a4ec9e5961956ed8b26979bac

Description

If rs2 contains the value zero, this instruction writes the value zero to rd. Otherwise, this instruction copies the contents of rs1 to rd.

This instruction carries a syntactic dependency from both rs1 and rs2 to rd.

Furthermore, if the Zkt extension is implemented, this instruction’s timing is independent of the data values in rs1 and rs2.

SAIL code

let condition = X(rs2);
result : xlenbits = if (condition == zeros()) then zeros()
else X(rs1);
X(rd) = result;

6.7.1.2 czero.nez

Synopsis Moves zero to a register rd, if the condition rs2 is nonzero, otherwise moves rs1 to rd.

Mnemonic czero.nez rd, rs1, rs2

Encoding

da0d9ba87c16c6770b4eb42689b8c145

Description

If rs2 contains a nonzero value, this instruction writes the value zero to rd. Otherwise, this instruction copies the contents of rs1 to rd.

This instruction carries a syntactic dependency from both rs1 and rs2 to rd.

Furthermore, if the Zkt extension is implemented, this instruction’s timing is independent of the data values in rs1 and rs2.

SAIL code

let condition = X(rs2);
result : xlenbits = if (condition != zeros()) then zeros()
else X(rs1);
X(rd) = result;

6.7.2 Usage examples

The instructions from this extension can be used to construct sequences that perform conditional-arithmetic, conditional-bitwise-logical, and conditional-select operations.

6.7.2.1 Instruction sequences

OperationInstruction sequenceLength
rd = (rc == 0) ? (rs1 + rs2) : rs1czero.nez rd, rs2, rc add rd, rs1, rd2 insns
rd = (rc != 0) ? (rs1 + rs2) : rs1czero.eqz rd, rs2, rc add rd, rs1, rd
rd = (rc == 0) ? (rs1 - rs2) : rs1czero.nez rd, rs2, rc sub rd, rs1, rd
rd = (rc != 0) ? (rs1 - rs2) : rs1czero.eqz rd, rs2, rc sub rd, rs1, rd
rd = (rc == 0) ? (rs1 | rs2) : rs1czero.nez rd, rs2, rc or rd, rs1, rd
rd = (rc != 0) ? (rs1 | rs2) : rs1czero.eqz rd, rs2, rc or rd, rs1, rd
rd = (rc == 0) ? (rs1 ^ rs2) : rs1czero.nez rd, rs2, rc xor rd, rs1, rd
rd = (rc != 0) ? (rs1 ^ rs2) : rs1czero.eqz rd, rs2, rc xor rd, rs1, rd
rd = (rc == 0) ? (rs1 & rs2) : rs1and rd, rs1, rs2 czero.eqz rtmp, rs1, rc or rd, rd, rtmp(requires 1 temporary)
rd = (rc != 0) ? (rs1 & rs2) : rs1and rd, rs1, rs2 czero.nez rtmp, rs1, rc or rd, rd, rtmp
rd = (rc == 0) ? rs1 : rs2czero.nez rd, rs1, rc czero.eqz rtmp, rs2, rc add rd, rd, rtmp
rd = (rc != 0) ? rs1 : rs2czero.eqz rd, rs1, rc czero.nez rtmp, rs2, rc add rd, rd, rtmp

6.8 "Zilsd", "Zclsd" Extensions for Load/Store pair for RV32, Version 1.0

The Zilsd & Zclsd extensions provide load/store pair instructions for RV32, reusing the existing RV64 doubleword load/store instruction encodings.

Operands containing src for store instructions and dest for load instructions are held in aligned x-register pairs, i.e., register numbers must be even. Use of misaligned (odd-numbered) registers for these operands is reserved.

Regardless of endianness, the lower-numbered register holds the low-order bits, and the higher-numbered register holds the high-order bits: e.g., bits 31:0 of an operand in Zilsd might be held in register x14, with bits 63:32 of that operand held in x15.

6.8.1 Load/Store pair instructions (Zilsd)

The Zilsd extension adds the following RV32-only instructions:

RV32RV64MnemonicInstruction
yesnold rd, offset(rs1)insns-ld
yesnosd rs2, offset(rs1)insns-sd

As the access size is 64-bit, accesses are only considered naturally aligned for effective addresses that are a multiple of 8. In this case, these instructions are guaranteed to not raise an address-misaligned exception. Even if naturally aligned, the memory access might not be performed atomically.

If the effective address is a multiple of 4, then each word access is required to be performed atomically.

The following table summarizes the required behavior:

AlignmentWord accesses guaranteed atomic?Can cause misaligned trap?
8 Byesno
4 B not 8 Byesyes
elsenoyes

To ensure resumable trap handling is possible for the load instructions, the base register must have its original value if a trap is taken. The other register in the pair can have been updated. This affects x2 for the stack pointer relative instruction and rs1 otherwise.

note

If an implementation performs a doubleword load access atomically and the register file implements write-back for even/odd register pairs, the mentioned atomicity requirements are inherently fulfilled. Otherwise, an implementation either needs to delay the write-back until the write can be performed atomically, or order sequential writes to the registers to ensure the requirement above is satisfied.

6.8.2 Compressed Load/Store pair instructions (Zclsd)

Zclsd depends on Zilsd and Zca. It has overlapping encodings with Zcf and is thus incompatible with Zcf.

Zclsd adds the following RV32-only instructions:

RV32RV64MnemonicInstruction
yesnoc.ldsp rd, offset(sp)insns-cldsp
yesnoc.sdsp rs2, offset(sp)insns-csdsp
yesnoc.ld rd', offset(rs1')insns-cld
yesnoc.sd rs2', offset(rs1')insns-csd

6.8.3 Use of x0 as operand

LD instructions with destination x0 are processed as any other load, but the result is discarded entirely and x1 is not written. For C.LDSP, usage of x0 as the destination is reserved.

If using x0 as src of SD or C.SDSP, the entire 64-bit operand is zero — i.e., register x1 is not accessed.

C.LD and C.SD instructions can only use x8-15.

6.8.4 Exception Handling

For the purposes of RVWMO and exception handling, LD and SD instructions are considered to be misaligned loads and stores, with one additional constraint: an LD or SD instruction whose effective address is a multiple of 4 gives rise to two 4-byte memory operations.

note

This definition permits LD and SD instructions giving rise to exactly one memory access, regardless of alignment. If instructions with 4-byte-aligned effective address are decomposed into two 32b operations, there is no constraint on the order in which the operations are performed and each operation is guaranteed to be atomic. These decomposed sequences are interruptible. Exceptions might occur on subsequent operations, making the effects of previous operations within the same instruction visible.

note

Software should make no assumptions about the number or order of accesses these instructions might give rise to, beyond the 4-byte constraint mentioned above. For example, an interrupted store might overwrite the same bytes upon return from the interrupt handler.

6.8.5 Instructions

6.8.5.1 ld

Synopsis Load doubleword to even/odd register pair, 32-bit encoding

Mnemonic ld rd, offset(rs1)

Encoding (RV32)

283c960f3355a77608652538099eb60c

Description

Loads a 64-bit value into registers rd and rd+1. The effective address is obtained by adding register rs1 to the sign-extended 12-bit offset.

Included in: zilsd

6.8.5.2 sd

Synopsis Store doubleword from even/odd register pair, 32-bit encoding

Mnemonic sd rs2, offset(rs1)

Encoding (RV32)

20bcdc19f20f23a915e31226c51b6baa

Description

Stores a 64-bit value from registers rs2 and rs2+1. The effective address is obtained by adding register rs1 to the sign-extended 12-bit offset.

Included in: zilsd

6.8.5.3 c.ldsp

Synopsis Stack-pointer based load doubleword to even/odd register pair, 16-bit encoding

Mnemonic c.ldsp rd, offset(sp)

Encoding (RV32)

8260c9fa9488fd3bbdee50f141ecc9c7

Description

Loads stack-pointer relative 64-bit value into registers rd' and rd'+1. It computes its effective address by adding the zero-extended offset, scaled by 8, to the stack pointer, x2. It expands to ld rd, offset(x2). C.LDSP is only valid when rd≠x0; the code points with rd=x0 are reserved.

Included in: zclsd

6.8.5.4 c.sdsp

Synopsis Stack-pointer based store doubleword from even/odd register pair, 16-bit encoding

Mnemonic c.sdsp rs2, offset(sp)

Encoding (RV32)

d70888dfa15183fca0a34bc5827ac29d

Description

Stores a stack-pointer relative 64-bit value from registers rs2' and rs2'+1. It computes an effective address by adding the zero-extended offset, scaled by 8, to the stack pointer, x2. It expands to sd rs2, offset(x2).

Included in: zclsd

6.8.5.5 c.ld

Synopsis Load doubleword to even/odd register pair, 16-bit encoding

Mnemonic c.ld rd', offset(rs1')

Encoding (RV32)

12b990e1dc6e1cfc070ff4c5d9615bbb

Description

Loads a 64-bit value into registers rd' and rd'+1. It computes an effective address by adding the zero-extended offset, scaled by 8, to the base address in register rs1'.

Included in: zclsd

6.8.5.6 c.sd

Synopsis Store doubleword from even/odd register pair, 16-bit encoding

Mnemonic c.sd rs2', offset(rs1')

Encoding (RV32)

f3f4b09e3eaca7e189c6133a645dae52

Description

Stores a 64-bit value from registers rs2' and rs2'+1. It computes an effective address by adding the zero-extended offset, scaled by 8, to the base address in register rs1'. It expands to sd rs2', offset(rs1').

Included in: zclsd

6.9 Ziccif Extension for Instruction-Fetch Atomicity, Version 1.0

note

This extension was ratified alongside the RVA20U64 profile. This chapter supplies an operational definition for the extension and adds expository material.

If the Ziccif extension is implemented, main memory regions with both the cacheability and coherence PMAs must support instruction fetch, and any instruction fetches of naturally aligned power-of-2 sizes of at most min(ILEN,XLEN) bits are atomic.

An implementation with the Ziccif extension fetches instructions in a manner equivalent to the following state machine.

  1. Let M be the smallest power of 2 such that Mmin(ILEN,XLEN)/8. Let N be the pc modulo M. Atomically fetch M - N bytes from memory at address pc. Let T be the running total of bytes fetched, initially M - N.
  2. If the T bytes fetched begin with a complete instruction of length LT, then execute that instruction, discard the remaining T - L bytes fetched, and go back to step 1, using the updated pc. Otherwise, atomically fetch M bytes from memory at address pc + T, increment T by M, and repeat step 2.
note

The instruction-fetch atomicity rule supports concurrent code modification. When a hart modifies instruction memory and either it or another hart executes the modified instructions without first having executing a FENCE.I, the modifying hart should adhere to the following rules to ensure predictable behavior:

  • Modification stores must be single-copy atomic, hence must be naturally aligned.
  • The modified instruction must not span an aligned M-byte boundary, unless it is replaced with a shorter unconditional control transfer (e.g., c.ebreak or c.j) that does not itself span an M-byte boundary.
  • Modification stores must alter a complete instruction or complete instructions that do not collectively span an M-byte boundary, modulo the exception above that the first part of an instruction may be replaced with an unconditional control transfer instruction.
  • Modifications must not combine smaller instructions into a larger instruction but may convert a larger instruction to some number of smaller instructions.
  • Modified instruction memory must have the coherence PMA.

Other well-defined code-modification strategies exist, but these rules provide a safe harbor.

Note that the software modifying the code need not know the value of M. Because ILEN must be at least the width of the instruction being modified, a lower bound on M can be inferred from the instruction’s width and XLEN.

Memory protection and executability PMAs are applied only to bytes that are not discarded by this algorithm.

note

For example, if M=8, N=0, and the PMP granularity is 4 bytes, then it is valid to fetch a 4-byte instruction at pc, even if fetching from pc + 4 would have been disallowed by PMP.

For simplicity, implementations are likely to choose a PMP granularity no smaller than M.

6.10 Ziccrse Extension for Main Memory Reservability, Version 1.0

If the Ziccrse extension is implemented, then main memory regions with both the cacheability and coherence PMAs must support the RsrvEventual PMA.

6.11 Ziccamoa Extension for Main Memory Atomics, Version 1.0

If the Ziccamoa extension is implemented, then main memory regions with both the cacheability and coherence PMAs must support all atomics in the Zaamo extension.

6.12 Ziccamoc Extension for Main Memory Compare-and-Swap, Version 1.0

If the Ziccamoc extension is implemented, then main memory regions with both the cacheability and coherence PMAs must provide AMOCASQ-level PMA support.

6.13 Zicclsm Extension for Main Memory Misaligned Accesses, Version 1.0

If the Zicclsm extension is implemented, then misaligned loads and stores to main memory regions with both the cacheability and coherence PMAs must be supported.

note

This definition includes vector memory accesses. It does not include any instructions in the various Za* extensions.

note

Even though mandated, misaligned loads and stores might execute extremely slowly. Standard software distributions should assume their existence only for correctness, not for performance.

6.14 Zic64b Extension for 64-byte Cache Blocks, Version 1.0

If the Zic64b extension is implemented, then cache blocks must be 64 bytes in size, naturally aligned in the address space.

6.15 "Zimop" Extension for May-Be-Operations, Version 1.0

This chapter defines the "Zimop" extension, which introduces the concept of instructions that may be operations (MOPs). MOPs are initially defined to simply write zero to x[rd], but are designed to be redefined by later extensions to perform some other action. The Zimop extension defines an encoding space for 40 MOPs.

note

It is sometimes desirable to define instruction-set extensions whose instructions, rather than raising illegal-instruction exceptions when the extension is not implemented, take no useful action (beyond writing x[rd]). For example, programs with control-flow integrity checks can execute correctly on implementations without the corresponding extension, provided the checks are simply ignored. Implementing these checks as MOPs allows the same programs to run on implementations with or without the corresponding extension.

Although similar in some respects to HINTs, MOPs cannot be encoded as HINTs, because unlike HINTs, MOPs are allowed to alter architectural state.

Because MOPs may be redefined by later extensions, standard software should not execute a MOP unless it is deliberately targeting an extension that has redefined that MOP.

The Zimop extension defines 32 MOP instructions named MOP.R.n, where n is an integer between 0 and 31, inclusive. Unless redefined by another extension, these instructions simply write 0 to x[rd]. Their encoding allows future extensions to define them to read x[rs1], as well as write x[rd].

8cecd538a1bd76c31bd31526ec6ad0ca

The Zimop extension additionally defines 8 MOP instructions named MOP.RR.n, where n is an integer between 0 and 7, inclusive. Unless redefined by another extension, these instructions simply write 0 to x[rd]. Their encoding allows future extensions to define them to read x[rs1] and x[rs2], as well as write x[rd].

fd040e2633af3e9d9484f6527f27890c

note

The recommended assembly syntax for MOP.R.n is MOP.R.n rd, rs1, with any x-register specifier being valid for either argument. Similarly for MOP.RR.n, the recommended syntax is MOP.RR.n rd, rs1, rs2. The extension that redefines a MOP may define an alternate assembly mnemonic.

note

These MOPs are encoded in the SYSTEM major opcode in part because it is expected their behavior will be modulated by privileged CSR state.

note

These MOPs are defined to write zero to x[rd], rather than performing no operation, to simplify instruction decoding and to allow testing the presence of features by branching on the zeroness of the result.

The MOPs defined in the Zimop extension do not carry a syntactic dependency from x[rs1] or x[rs2] to x[rd], though an extension that redefines the MOP may impose such a requirement.

note

Not carrying a syntactic dependency relieves straightforward implementations of reading x[rs1] and x[rs2].

6.15.1 "Zcmop" Compressed May-Be-Operations Extension, Version 1.0

This section defines the "Zcmop" extension, which defines eight 16-bit MOP instructions named C.MOP.n, where n is an odd integer between 1 and 15, inclusive. C.MOP.n is encoded in the reserved encoding space corresponding to C.LUI x_n_, 0, as shown in norm:c-mop_enc. Unlike the MOPs defined in the Zimop extension, the C.MOP.n instructions are defined to not write any register. Their encoding allows future extensions to define them to read register x[_n_].

The Zcmop extension depends upon the Zca extension.

8eca1ee4962e3624480b1f308bff1cb2

note

Very few suitable 16-bit encoding spaces exist. This space was chosen because it already has unusual behavior with respect to the rd/rs1 field—​it encodes c.addi16sp when the field contains x2--and is therefore of lower value for most purposes.

MnemonicEncodingRedefinable to read register
C.MOP.10110000010000001x1
C.MOP.30110000110000001x3
C.MOP.50110001010000001x5
C.MOP.70110001110000001x7
C.MOP.90110010010000001x9
C.MOP.110110010110000001x11
C.MOP.130110011010000001x13
C.MOP.150110011110000001x15
note

The recommended assembly syntax for C.MOP.n is simply the nullary C.MOP.n. The possibly accessed register is implicitly x_n_.

note

The expectation is that each Zcmop instruction is equivalent to some Zimop instruction, but the choice of expansion (if any) is left to the extension that redefines the MOP. Note, a Zcmop instruction that does not write a value can expand into a write to x0.

6.16 Control-flow Integrity (CFI)

Control-flow Integrity (CFI) capabilities help defend against Return-Oriented Programming (ROP) and Call/Jump-Oriented Programming (COP/JOP) style control-flow subversion attacks. These attack methodologies use code sequences in authorized modules, with at least one instruction in the sequence being a control transfer instruction that depends on attacker-controlled data either in the return stack or in memory used to obtain the target address for a call or jump. Attackers stitch these sequences together by diverting the control flow instructions (e.g., JALR, C.JR, C.JALR), from their original target address to a new target via modification in the return stack or in the memory used to obtain the jump/call target address.

RV32/RV64 provides two types of control transfer instructions - unconditional jumps and conditional branches. Conditional branches encode an offset in the immediate field of the instruction and are thus direct branches that are not susceptible to control-flow subversion. Unconditional direct jumps using JAL transfer control to a target that is in a +/- 1 MiB range from the current pc. Unconditional indirect jumps using the JALR obtain their branch target by adding the sign extended 12-bit immediate encoded in the instruction to the rs1 register.

The RV32I/RV64I does not have a dedicated instruction for calling a procedure or returning from a procedure. A JAL or JALR may be used to perform a procedure call and JALR to return from a procedure. The RISC-V ABI however defines the convention that a JAL/JALR where rd (i.e. the link register) is x1 or x5 is a procedure call, and a JALR where rs1 is the conventional link register (i.e. x1 or x5) is a return from procedure. The architecture allows for using these hints and conventions to support return address prediction (See rashints).

The RVC standard extension for compressed instructions provides unconditional jump and conditional branch instructions. The C.J and C.JAL instructions encode an offset in the immediate field of the instruction and thus are not susceptible to control-flow subversion. The C.JR and C.JALR RVC instructions perform an unconditional control transfer to the address in register rs1. The C.JALR additionally writes the address of the instruction following the jump (pc+2) to the link register x1 and is a procedure call. The C.JR is a return from procedure if rs1 is a conventional link register (i.e. x1 or x5); else it is an indirect jump.

The term call is used to refer to a JAL or JALR instruction with a link register as destination, i.e., rdx0. Conventionally, the link register is x1 or x5. A call using JAL or C.JAL is termed a direct call. A C.JALR expands to JALR x1, 0(rs1) and is a call. A call using JALR or C.JALR is termed an indirect-call.

The term return is used to refer to a JALR instruction with rd=x0 and with rs1=x1 or rs1=x5. A C.JR instruction expands to JALR x0, 0(rs1) and is a return if rs1=x1 or rs1=x5.

The term indirect-jump is used to refer to a JALR instruction with rd=x0 and where the rs1 is not x1 or x5 (i.e., not a return). A C.JR instruction where rs1 is not x1 or x5 (i.e., not a return) is an indirect-jump.

The Zicfiss and Zicfilp extensions build on these conventions and hints and provide backward-edge and forward-edge control flow integrity respectively.

The Unprivileged ISA for Zicfilp extension is specified in unpriv-forward and for the Unprivileged ISA for Zicfiss extension is specified in unpriv-backward. The Privileged ISA for these extensions is specified in the Privileged ISA specification.

6.16.1 Landing Pad (Zicfilp)

To enforce forward-edge control-flow integrity, the Zicfilp extension introduces a landing pad (LPAD) instruction. The LPAD instruction must be placed at the program locations that are valid targets of indirect jumps or calls. The LPAD instruction (See LP_INST) is encoded using the AUIPC major opcode with rd=x0.

Compilers emit a landing pad instruction as the first instruction of an address-taken function, as well as at any indirect jump targets. A landing pad instruction is not required in functions that are only reached using a direct call or direct jump.

The landing pad is designed to provide integrity to control transfers performed using indirect calls and jumps, and this is referred to as forward-edge protection. When the Zicfilp is active, the hart tracks an expected landing pad (ELP) state that is updated by an indirect_call or indirect_jump to require a landing pad instruction at the target of the branch. If the instruction at the target is not a landing pad, then a software-check exception is raised.

A landing pad may be optionally associated with a 20-bit label. With labeling enabled, the number of landing pads that can be reached from an indirect call or jump sites can be defined using programming language-based policies. Labeling of the landing pads enables software to achieve greater precision in pairing up indirect call/jump sites with valid targets. When labeling of landing pads is used, indirect call or indirect jump site can specify the expected label of the landing pad and thereby constrain the set of landing pads that may be reached from each indirect call or indirect jump site in the program.

In the simplest form, a program can be built with a single label value to implement a coarse-grained version of forward-edge control-flow integrity. By constraining gadgets to be preceded by a landing pad instruction that marks the start of indirect callable functions, the program can significantly reduce the available gadget space. A second form of label generation may generate a signature, such as a MAC, using the prototype of the function. Programs that use this approach would further constrain the gadgets accessible from a call site to only indirectly callable functions that match the prototype of the called functions. Another approach to label generation involves analyzing the control-flow-graph (CFG) of the program, which can lead to even more stringent constraints on the set of reachable gadgets. Such programs may further use multiple labels per function, which means that if a function is called from two or more call sites, the functions can be labeled as being reachable from each of the call sites. For instance, consider two call sites A and B, where A calls the functions X and Y, and B calls the functions Y and Z. In a single label scheme, functions X, Y, and Z would need to be assigned the same label so that both call sites A and B can invoke the common function Y. This scheme would allow call site A to also call function Z and call site B to also call function X. However, if function Y was assigned two labels - one corresponding to call site A and the other to call site B, then Y can be invoked by both call sites, but X can only be invoked by call site A and Z can only be invoked by call site B. To support multiple labels, the compiler could generate a call-site-specific entry point for shared functions, with each entry point having its own landing pad instruction followed by a direct branch to the start of the function. This would allow the function to be labeled with multiple labels, each corresponding to a specific call site. A portion of the label space may be dedicated to labeled landing pads that are only valid targets of an indirect jump (and not an indirect call).

The LPAD instruction uses the code points defined as HINTs for the AUIPC opcode. When Zicfilp is not active at a privilege level or when the extension is not implemented, the landing pad instruction executes as a no-op. A program that is built with LPAD instructions can thus continue to operate correctly, but without forward-edge control-flow integrity, on processors that do not support the Zicfilp extension or if the Zicfilp extension is not active.

Compilers and linkers should provide an attribute flag to indicate if the program has been compiled with the Zicfilp extension and use that to determine if the Zicfilp extension should be activated. The dynamic loader should activate the use of Zicfilp extension for an application only if all executables (the application and the dependent dynamically linked libraries) used by that application use the Zicfilp extension.

When Zicfilp extension is not active or not implemented, the hart does not require landing pad instructions at the targets of indirect calls/jumps, and the landing instructions revert to being no-ops. This allows a program compiled with landing pad instructions to operate correctly but without forward-edge control-flow integrity.

The Zicfilp extensions may be activated for use individually and independently for each privilege mode.

The Zicfilp extension depends on the Zicsr extension.

6.16.1.1 Landing Pad Enforcement

To enforce that the target of an indirect call or indirect jump must be a valid landing pad instruction, the hart maintains an expected landing pad (ELP) state to determine if a landing pad instruction is required at the target of an indirect call or an indirect jump. The ELP state can be one of:

  • 0 - NO_LP_EXPECTED
  • 1 - LP_EXPECTED

The ELP state is initialized to NO_LP_EXPECTED by the hart upon reset.

The Zicfilp extension, when enabled, determines if an indirect call or an indirect jump must land on a landing pad, as specified in IND_CALL_JMP. If is_lp_expected is 1, then the hart updates the ELP to LP_EXPECTED.


is_lp_expected = ( (JALR || C.JR || C.JALR) &&
(rs1 != x1) && (rs1 != x5) && (rs1 != x7) ) ? 1 : 0;

An indirect branch using JALR, C.JALR, or C.JR with rs1 as x7 is termed a software guarded branch. Such branches do not need to land on a LPAD instruction and thus do not set ELP to LP_EXPECTED.

note

When the register source is a link register and the register destination is x0, then it’s a return from a procedure and does not require a landing pad at the target.

When the register source and register destination are both link registers, then it is a semantically-direct-call. For example, the call offset pseudoinstruction may expand to a two instruction sequence composed of a lui ra, imm20 or a auipc ra, imm20 instruction followed by a jalr ra, imm12(ra) instruction where ra is the link register (either x1 or x5). Since the address of the procedure was not explicitly taken and the computed address is not obtained from mutable memory, such semantically-direct calls do not require a landing pad to be placed at the target. Compilers and JITers must use the semantically-direct calls only if the rs1 was computed as a PC-relative or an absolute offset to the symbol.

The tail offset pseudoinstruction used to tail call a far-away procedure may also be expanded to a two instruction sequence composed of a lui x7, imm20 or auipc x7, imm20 followed by a jalr x0, x7. Since the address of the procedure was not explicitly taken and the computed address is not obtained from mutable memory, such semantically-direct tail-calls do not require a landing pad to be placed at the target.

Software guarded branches may also be used by compilers to generate code for constructs like switch-cases. When using the software guarded branches, the compiler is required to ensure it has full control on the possible jump targets (e.g., by obtaining the targets from a read-only table in memory and performing bounds checking on the index into the table, etc.).

The landing pad may be labeled. Zicfilp extension designates the register x7 for use as the landing pad label register. To support labeled landing pads, the indirect call/jump sites establish an expected landing pad label (e.g., using the LUI instruction) in the bits 31:12 of the x7 register. The LPAD instruction is encoded with a 20-bit immediate value called the landing-pad-label (LPL) that is matched to the expected landing pad label. When LPL is encoded as zero, the LPAD instruction does not perform the label check and in programs built with this single label mode of operation the indirect call/jump sites do not need to establish an expected landing pad label value in x7.

When ELP is set to LP_EXPECTED, if the next instruction in the instruction stream is not 4-byte aligned, or is not LPAD, or if the landing pad label encoded in LPAD is not zero and does not match the expected landing pad label in bits 31:12 of the x7 register, then a software-check exception (cause=18) with _x_tval set to "landing pad fault (code=2)" is raised else the ELP is updated to NO_LP_EXPECTED.

note

The tracking of ELP and the requirement for a landing pad instruction at the target of indirect call and jump enables a processor implementation to significantly reduce or to prevent speculation to non-landing-pad instructions. Constraining speculation using this technique, greatly reduces the gadget space and increases the difficulty of using techniques such as branch-target-injection, also known as Spectre variant 2, which use speculative execution to leak data through side channels.

The LPAD requires a 4-byte alignment to address the concatenation of two instructions A and B accidentally forming an unintended landing pad in the program. For example, consider a 32-bit instruction where the bytes 3 and 2 have a pattern of ?017h (for example, the immediate fields of a LUI, AUIPC, or a JAL instruction), followed by a 16-bit or a 32-bit instruction. When patterns that can accidentally form a valid landing pad are detected, the assembler or linker can force instruction A to be aligned to a 4-byte boundary to force the unintended LPAD pattern to become misaligned, and thus not a valid landing pad, or may use an alternate register allocation to prevent the accidental landing pad.

6.16.1.2 Landing Pad Instruction

When Zicfilp is enabled, LPAD is the only instruction allowed to execute when the ELP state is LP_EXPECTED. If Zicfilp is not enabled then the instruction is a no-op. If Zicfilp is enabled, the LPAD instruction causes a software-check exception with _x_tval set to "landing pad fault (code=2)" if any of the following conditions are true:

  • The pc is not 4-byte aligned and ELP is LP_EXPECTED.
  • The ELP is LP_EXPECTED and the LPL is not zero and the LPL does not match the expected landing pad label in bits 31:12 of the x7 register.

If a software-check exception is not caused then the ELP is updated to NO_LP_EXPECTED.

9cdcd755a3a98992699cbcca46235acc

The operation of the LPAD instruction is as follows:

if (xLPE == 1 && ELP == LP_EXPECTED)
// If PC not 4-byte aligned then software-check exception
if pc[1:0] != 0
raise software-check exception
// If landing pad label not matched -> software-check exception
else if (inst.LPL != x7[31:12] && inst.LPL != 0)
raise software-check exception
else
ELP = NO_LP_EXPECTED
else
no-op
endif

<<<

6.16.2 Shadow Stack (Zicfiss)

The Zicfiss extension introduces a shadow stack to enforce backward-edge control-flow integrity. A shadow stack is a second stack used to store a shadow copy of the return address in the link register if it needs to be spilled.

The shadow stack is designed to provide integrity to control transfers performed using a return, where the return may be from a procedure invoked using an indirect call or a direct call, and this is referred to as backward-edge protection.

A program using backward-edge control-flow integrity has two stacks: a regular stack and a shadow stack. The shadow stack is used to spill the link register, if required, by non-leaf functions. An additional register, shadow-stack-pointer (ssp), is introduced in the architecture to hold the address of the top of the active shadow stack.

The shadow stack, similar to the regular stack, grows downwards, from higher addresses to lower addresses. Each entry on the shadow stack is XLEN wide and holds the link register value. The ssp points to the top of the shadow stack, which is the address of the last element stored on the shadow stack.

The shadow stack is architecturally protected from inadvertent corruptions and modifications, as detailed in the Privileged specification.

The Zicfiss extension provides instructions to store and load the link register to/from the shadow stack and to check the integrity of the return address. The extension provides instructions to support common stack maintenance operations such as stack unwinding and stack switching.

When Zicfiss is enabled, each function that needs to spill the link register, typically non-leaf functions, store the link register value to the regular stack and a shadow copy of the link register value to the shadow stack when the function is entered (the prologue). When such a function returns (the epilogue), the function loads the link register from the regular stack and the shadow copy of the link register from the shadow stack. Then, the link register value from the regular stack and the shadow link register value from the shadow stack are compared. A mismatch of the two values is indicative of a subversion of the return address control variable and causes a software-check exception.

The Zicfiss instructions, except SSAMOSWAP.W/D, are encoded using a subset of May-Be-Operation instructions defined by the Zimop and Zcmop extensions. This subset of instructions revert to their Zimop/Zcmop defined behavior when the Zicfiss extension is not implemented or if the extension has not been activated. A program that is built with Zicfiss instructions can thus continue to operate correctly, but without backward-edge control-flow integrity, on processors that do not support the Zicfiss extension or if the Zicfiss extension is not active. The Zicfiss extension may be activated for use individually and independently for each privilege mode.

Compilers should flag each object file (for example, using flags in the ELF attributes) to indicate if the object file has been compiled with the Zicfiss instructions. The linker should flag (for example, using flags in the ELF attributes) the binary/executable generated by linking objects as being compiled with the Zicfiss instructions only if all the object files that are linked have the same Zicfiss attributes.

The dynamic loader should activate the use of Zicfiss extension for an application only if all executables (the application and the dependent dynamically-linked libraries) used by that application use the Zicfiss extension.

An application that has the Zicfiss extension active may request the dynamic loader at runtime to load a new dynamic shared object (using dlopen() for example). If the requested object does not have the Zicfiss attribute then the dynamic loader, based on its policy (e.g., established by the operating system or the administrator) configuration, could either deny the request or deactivate the Zicfiss extension for the application. It is strongly recommended that the policy enforces a strict security posture and denies the request.

The Zicfiss extension depends on the Zicsr, Zimop and Zaamo extensions. Furthermore, if the Zcmop extension is implemented, the Zicfiss extension also provides the C.SSPUSH and C.SSPOPCHK instructions. Moreover, use of Zicfiss in U-mode requires S-mode to be implemented. Use of Zicfiss in M-mode is not supported.

6.16.2.1 Zicfiss Instructions Summary

The Zicfiss extension introduces the following instructions:

  • Push to the shadow stack (See SS_PUSH)

    • SSPUSH x1 and SSPUSH x5 - encoded using MOP.RR.7
    • C.SSPUSH x1 - encoded using C.MOP.1
  • Pop from the shadow stack (See SS_POP)

    • SSPOPCHK x1 and SSPOPCHK x5 - encoded using MOP.R.28
    • C.SSPOPCHK x5 - encoded using C.MOP.5
  • Read the value of ssp into a register (See SSP_READ)

    • SSRDP - encoded using MOP.R.28
  • Perform an atomic swap from a shadow stack location (See SSAMOSWAP)

    • SSAMOSWAP.W and SSAMOSWAP.D

Zicfiss does not use all encodings of MOP.RR.7 or MOP.R.28. When a MOP.RR.7 or MOP.R.28 encoding is not used by the Zicfiss extension, the corresponding instruction adheres to its Zimop-defined behavior, unless redefined by another extension.

6.16.2.2 Shadow Stack Pointer (ssp)

The ssp CSR is an unprivileged read-write (URW) CSR that reads and writes XLEN low order bits of the shadow stack pointer (ssp). The CSR address is 0x011. There is no high CSR defined as the ssp is always as wide as the XLEN of the current privilege mode. The bits 1:0 of ssp are read-only zero. If the UXLEN or SXLEN may never be 32, then the bit 2 is also read-only zero.

6.16.2.3 Zicfiss Instructions

6.16.2.4 Push to the Shadow Stack

A shadow stack push operation is defined as decrement of the ssp by XLEN/8 followed by a store of the value in the link register to memory at the new top of the shadow stack.

545385adb579a49e47fb8a5bd8bf8b58

cd0912f1b9b0d8563153e242113bbc4c

Only x1 and x5 registers are supported as rs2 for SSPUSH. Zicfiss provides a 16-bit version of the SSPUSH x1 instruction using the Zcmop defined C.MOP.1 encoding. The C.SSPUSH x1 expands to SSPUSH x1.

The SSPUSH instruction and its compressed form C.SSPUSH can be used to push a link register on the shadow stack. The SSPUSH and C.SSPUSH instructions perform a store identically to the existing store instructions, with the difference that the base is implicitly ssp and the width is implicitly XLEN.

The operation of the SSPUSH and C.SSPUSH instructions is as follows:

if (xSSE == 1)
mem[ssp - (XLEN/8)] = X(src) # Store src value to ssp - XLEN/8
ssp = ssp - (XLEN/8) # decrement ssp by XLEN/8
endif

The ssp is decremented by SSPUSH and C.SSPUSH only if the store to the shadow stack completes successfully.

6.16.2.5 Pop from the Shadow Stack

A shadow stack pop operation is defined as an XLEN wide read from the current top of the shadow stack followed by an increment of the ssp by XLEN/8.

d734969416c4eefb7623eaa2baf66772

5bb605d702690ea464f1f4682509950e

Only x1 and x5 registers are supported as rs1 for SSPOPCHK. Zicfiss provides a 16-bit version of the SSPOPCHK x5 using the Zcmop defined C.MOP.5 encoding. The C.SSPOPCHK x5 expands to SSPOPCHK x5.

Programs with a shadow stack push the return address onto the regular stack as well as the shadow stack in the prologue of non-leaf functions. When returning from these non-leaf functions, such programs pop the link register from the regular stack and pop a shadow copy of the link register from the shadow stack. The two values are then compared. If the values do not match, it is indicative of a corruption of the return address variable on the regular stack.

The SSPOPCHK instruction, and its compressed form C.SSPOPCHK, can be used to pop the shadow return address value from the shadow stack and check that the value matches the contents of the link register, and if not cause a software-check exception with _x_tval set to "shadow stack fault (code=3)".

While any register may be used as link register, conventionally the x1 or x5 registers are used. The shadow stack instructions are designed to be most efficient when the x1 and x5 registers are used as the link register.

note

Return-address prediction stacks are a common feature of high-performance instruction-fetch units, but they require accurate detection of instructions used for procedure calls and returns to be effective. For RISC-V, hints as to the instructions' usage are encoded implicitly via the register numbers used. The return-address stack (RAS) actions to pop and/or push onto the RAS are specified in rashints.

Using x1 or x5 as the link register allows a program to benefit from the return-address prediction stacks. Additionally, since the shadow stack instructions are designed around the use of x1 or x5 as the link register, using any other register as a link register would incur the cost of additional register movements.

Compilers, when generating code with backward-edge CFI, must protect the link register, e.g., x1 and/or x5, from arbitrary modification by not emitting unsafe code sequences.

note

Storing the return address on both stacks preserves the call stack layout and the ABI, while also allowing for the detection of corruption of the return address on the regular stack. The prologue and epilogue of a non-leaf function that uses shadow stacks is as follows:

function_entry:
addi sp,sp,-8 # push link register x1
sd x1,(sp) # on regular stack
sspush x1 # push link register x1 on shadow stack
:
ld x1,(sp) # pop link register x1 from regular stack
addi sp,sp,8
sspopchk x1 # fault if x1 not equal to shadow
# return address
ret

This example illustrates the use of x1 register as the link register. Alternatively, the x5 register may also be used as the link register.

A leaf function, a function that does not itself make function calls, does not need to spill the link register. Consequently, the return value may be held in the link register itself for the duration of the leaf function’s execution.

The C.SSPOPCHK, and SSPOPCHK instructions perform a load identically to the existing load instructions, with the difference that the base is implicitly ssp and the width is implicitly XLEN.

The operation of the SSPOPCHK and C.SSPOPCHK instructions is as follows:

if (xSSE == 1)
temp = mem[ssp] # Load temp from address in ssp and
if temp != X(src) # Compare temp to value in src and
# cause a software-check exception
# if they are not bitwise equal.
# Only x1 and x5 may be used as src
raise software-check exception
else
ssp = ssp + (XLEN/8) # increment ssp by XLEN/8.
endif
endif

If the value loaded from the address in ssp does not match the value in rs1, a software-check exception (cause=18) is raised with _x_tval set to "shadow stack fault (code=3)". The software-check exception caused by SSPOPCHK/ C.SSPOPCHK is lower in priority than a load/store/AMO access-fault exception.

The ssp is incremented by SSPOPCHK and C.SSPOPCHK only if the load from the shadow stack completes successfully and no software-check exception is raised.

note

The use of the compressed instruction C.SSPUSH x1 to push on the shadow stack is most efficient when the ABI uses x1 as the link register, as the link register may then be pushed without needing a register-to-register move in the function prologue. To use the compressed instruction C.SSPOPCHK x5, the function should pop the return address from regular stack into the alternate link register x5 and use the C.SSPOPCHK x5 to compare the return address to the shadow copy stored on the shadow stack. The function then uses C.JR x5 to jump to the return address.

function_entry:
c.addi sp,sp,-8 # push link register x1
c.sd x1,(sp) # on regular stack
c.sspush x1 # push link register x1 on shadow stack
:
c.ld x5,(sp) # pop link register x5 from regular stack
c.addi sp,sp,8
c.sspopchk x5 # fault if x5 not equal to shadow return address
c.jr x5
note

Store-to-load forwarding is a common technique employed by high-performance processor implementations. Zicfiss implementations may prevent forwarding from a non-shadow-stack store to the SSPOPCHK or the C.SSPOPCHK instructions. A non-shadow-stack store causes a fault if done to a page mapped as a shadow stack. However, such determination may be delayed till the PTE has been examined and thus may be used to transiently forward the data from such stores to SSPOPCHK or to C.SSPOPCHK.

6.16.2.6 Read ssp into a Register

The SSRDP instruction is provided to move the contents of ssp to a destination register.

8135ab940ec1989d83bebc9425cac39d

Encoding rd as x0 is not supported for SSRDP.

The operation of the SSRDP instructions is as follows:

if (xSSE == 1)
X(dst) = ssp
else
X(dst) = 0
endif
note

The property of Zimop writing 0 to the rd when the extension using Zimop is not implemented or not active may be used by to determine if Zicfiss extension is active. For example, functions that unwind shadow stacks may skip over the unwind actions by dynamically detecting if the Zicfiss extension is active.

An example sequence such as the following may be used:

ssrdp t0 # mv ssp to t0
beqz t0, zicfiss_not_active # zero is not a valid shadow stack
# pointer by convention
# Zicfiss is active
:
:
zicfiss_not_active:

To assist with the use of such code sequences, operating systems and runtimes must not locate shadow stacks at address 0.

note

A common operation performed on stacks is to unwind them to support constructs like setjmp/longjmp, C++ exception handling, etc. A program that uses shadow stacks must unwind the shadow stack in addition to the stack used to store data. The unwind function must verify that it does not accidentally unwind past the bounds of the shadow stack. Shadow stacks are expected to be bounded on each end using guard pages. A guard page for a stack is a page that is not accessible by the process that owns the stack. To detect if the unwind occurs past the bounds of the shadow stack, the unwind may be done in maximal increments of 4 KiB, testing whether the ssp is still pointing to a shadow stack page or has unwound into the guard page. The following examples illustrate the use of shadow stack instructions to unwind a shadow stack. This example assumes that the setjmp function itself does not push on to the shadow stack (being a leaf function, it is not required to).

setjmp() {
:
:
// read and save the shadow stack pointer to jmp_buf
asm("ssrdp %0" : "=r"(cur_ssp):);
jmp_buf->saved_ssp = cur_ssp;
:
:
}
longjmp() {
:
// Read current shadow stack pointer and
// compute number of call frames to unwind
asm("ssrdp %0" : "=r"(cur_ssp):);
// Skip the unwind if backward-edge CFI not active
asm("beqz %0, back_cfi_not_active" : "=r"(cur_ssp):);
// Unwind the frames in a loop
while ( jmp_buf->saved_ssp > cur_ssp ) {
// advance by a maximum of 4K at a time to avoid
// unwinding past bounds of the shadow stack
cur_ssp = ( (jmp_buf->saved_ssp - cur_ssp) >= 4096 ) ?
(cur_ssp + 4096) : jmp_buf->saved_ssp;
asm("csrw ssp, %0" : : "r" (cur_ssp));
// Test if unwound past the shadow stack bounds
asm("sspush x5");
asm("sspopchk x5");
}
back_cfi_not_active:
:
}

6.16.2.7 Atomic Swap from a Shadow Stack Location

63fc28ddbffe1b4d4c1a405f168ba621

For RV32, SSAMOSWAP.W atomically loads a 32-bit data value from address of a shadow stack location in rs1, puts the loaded value into register rd, and stores the 32-bit value held in rs2 to the original address in rs1. SSAMOSWAP.D (RV64 only) is similar to SSAMOSWAP.W but operates on 64-bit data values.

if privilege_mode != M && menvcfg.SSE == 0
raise illegal-instruction exception
else if S-mode not implemented
raise illegal-instruction exception
else if privilege_mode == U && senvcfg.SSE == 0
raise illegal-instruction exception
else if privilege_mode == VS && henvcfg.SSE == 0
raise virtual-instruction exception
else if privilege_mode == VU && senvcfg.SSE == 0
raise virtual-instruction exception
else
X(rd) = mem[X(rs1)]
mem[X(rs1)] = X(rs2)
endif

For RV64, SSAMOSWAP.W atomically loads a 32-bit data value from address of a shadow stack location in rs1, sign-extends the loaded value and puts it in rd, and stores the lower 32 bits of the value held in rs2 to the original address in rs1.

if privilege_mode != M && menvcfg.SSE == 0
raise illegal-instruction exception
else if S-mode not implemented
raise illegal-instruction exception
else if privilege_mode == U && senvcfg.SSE == 0
raise illegal-instruction exception
else if privilege_mode == VS && henvcfg.SSE == 0
raise virtual-instruction exception
else if privilege_mode == VU && senvcfg.SSE == 0
raise virtual-instruction exception
else
temp[31:0] = mem[X(rs1)]
X(rd) = SignExtend(temp[31:0])
mem[X(rs1)] = X(rs2)[31:0]
endif

Just as for AMOs in the A extension, SSAMOSWAP.W/D requires that the address held in rs1 be naturally aligned to the size of the operand (i.e., eight-byte aligned for doublewords, and four-byte aligned for words). The same exception options apply if the address is not naturally aligned.

Just as for AMOs in the A extension, SSAMOSWAP.W/D optionally provides release consistency semantics, using the aq and rl bits, to help implement multiprocessor synchronization. An SSAMOSWAP.W/D operation has acquire semantics if aq=1 and release semantics if rl=1.

note

Stack switching is a common operation in user programs as well as supervisor programs. When a stack switch is performed the stack pointer of the currently active stack is saved into a context data structure and the new stack is made active by loading a new stack pointer from a context data structure.

When shadow stacks are active for a program, the program needs to additionally switch the shadow stack pointer. If the pointer to the top of the deactivated shadow stack is held in a context data structure, then it may be susceptible to memory corruption vulnerabilities. To protect the pointer value, the program may store it at the top of the deactivated shadow stack itself and thereby create a checkpoint. A legal checkpoint is defined as one that holds a value of X, where X is the address at which the checkpoint is positioned on the shadow stack.

note

An example sequence to restore the shadow stack pointer from the new shadow stack and save the old shadow stack pointer on the old shadow stack is as follows:

# a0 hold pointer to top of new shadow stack to switch to
stack_switch:
ssrdp ra
beqz ra, 2f # skip if Zicfiss not active
ssamoswap.d ra, x0, (a0) # ra=*[a0] and *[a0]=0
beq ra, a0, 1f # [a0] must be == [ra]
unimp # else crash
1: addi ra, ra, XLEN/8 # pop the checkpoint
csrrw ra, ssp, ra # swap ssp: ra=ssp, ssp=ra
addi ra, ra, -(XLEN/8) # checkpoint = "old ssp - XLEN/8"
ssamoswap.d x0, ra, (ra) # Save checkpoint at "old ssp - XLEN/8"
2:

This sequence uses the ra register. If the privilege mode at which this sequence is executed can be interrupted, then the trap handler should save the ra on the shadow stack itself. There it is guarded against tampering and can be restored prior to returning from the trap.

When a new shadow stack is created by the supervisor, it needs to store a checkpoint at the highest address on that stack. This enables the shadow stack pointer to be switched using the process outlined in this note. The SSAMOSWAP.W/D instruction can be used to store this checkpoint. When the old value at the memory location operated on by SSAMOSWAP.W/D is not required, rd can be set to x0.

6.17 "Zihintntl" Extension for Non-Temporal Locality Hints, Version 1.0

The NTL instructions are HINTs that indicate that the explicit memory accesses of the immediately subsequent instruction (henceforth "target instruction") exhibit poor temporal locality of reference. The NTL instructions do not change architectural state, nor do they alter the architecturally visible effects of the target instruction. Four variants are provided:

The NTL.P1 instruction indicates that the target instruction does not exhibit temporal locality within the capacity of the innermost level of private cache in the memory hierarchy. NTL.P1 is encoded as ADD x0, x0, x2.

The NTL.PALL instruction indicates that the target instruction does not exhibit temporal locality within the capacity of any level of private cache in the memory hierarchy. NTL.PALL is encoded as ADD x0, x0, x3.

The NTL.S1 instruction indicates that the target instruction does not exhibit temporal locality within the capacity of the innermost level of shared cache in the memory hierarchy. NTL.S1 is encoded as ADD x0, x0, x4.

The NTL.ALL instruction indicates that the target instruction does not exhibit temporal locality within the capacity of any level of cache in the memory hierarchy. NTL.ALL is encoded as ADD x0, x0, x5.

note

The NTL instructions can be used to avoid cache pollution when streaming data or traversing large data structures, or to reduce latency in producer-consumer interactions.

A microarchitecture might use the NTL instructions to inform the cache replacement policy, or to decide which cache to allocate into, or to avoid cache allocation altogether. For example, NTL.P1 might indicate that an implementation should not allocate a line in a private L1 cache, but should allocate in L2 (whether private or shared). In another implementation, NTL.P1 might allocate the line in L1, but in the least-recently used state.

NTL.ALL will typically inform implementations not to allocate anywhere in the cache hierarchy. Programmers should use NTL.ALL for accesses that have no exploitable temporal locality.

Like any HINTs, these instructions may be freely ignored. Hence, although they are described in terms of cache-based memory hierarchies, they do not mandate the provision of caches.

Some implementations might respect these HINTs for some memory accesses but not others: e.g., implementations that implement LR/SC by acquiring a cache line in the exclusive state in L1 might ignore NTL instructions on LR and SC, but might respect NTL instructions for AMOs and regular loads and stores.

ntl-portable lists several software use cases and the recommended NTL variant that portable software—i.e., software not tuned for any specific implementation’s memory hierarchy—should use in each case.

ScenarioRecommended NTL variant
Access to a working set between 64 KiB and 256 KiB in sizeNTL.P1
Access to a working set between 256 KiB and 1 MiB in sizeNTL.PALL
Access to a working set greater than 1 MiB in sizeNTL.S1
Access with no exploitable temporal locality (e.g., streaming)NTL.ALL
Access to a contended synchronization variableNTL.PALL
note

The working-set sizes listed in ntl-portable are not meant to constrain implementers' cache-sizing decisions. Cache sizes will obviously vary between implementations, and so software writers should only take these working-set sizes as rough guidelines.

ntl lists several sample memory hierarchies and recommends how each NTL variant maps onto each cache level. The table also recommends which NTL variant that implementation-tuned software should use to avoid allocating in a particular cache level. For example, for a system with a private L1 and a shared L2, it is recommended that NTL.P1 and NTL.PALL indicate that temporal locality cannot be exploited by the L1, and that NTL.S1 and NTL.ALL indicate that temporal locality cannot be exploited by the L2. Furthermore, software tuned for such a system should use NTL.P1 to indicate a lack of temporal locality exploitable by the L1, or should use NTL.ALL indicate a lack of temporal locality exploitable by the L2.

If the C or Zca extension is provided, compressed variants of these HINTs are also provided: C.NTL.P1 is encoded as C.ADD x0, x2; C.NTL.PALL is encoded as C.ADD x0, x3; C.NTL.S1 is encoded as C.ADD x0, x4; and C.NTL.ALL is encoded as C.ADD x0, x5.

The NTL instructions affect all memory-access instructions except the cache-management instructions in the Zicbom extension.

note

As of this writing, there are no other exceptions to this rule, and so the NTL instructions affect all memory-access instructions defined in the base ISAs and the A, F, D, Q, C, and V standard extensions, as well as those defined within the hypervisor extension in hypervisor.

The NTL instructions can affect cache-management operations other than those in the Zicbom extension. For example, NTL.PALL followed by CBO.ZERO might indicate that the line should be allocated in L3 and zeroed, but not allocated in L1 or L2.

Memory hierarchyvariant to actual cache levelexplicit cache management
P1PALLS1ALLL1L2L3L4/L5
Common Scenarios
No caches
none
Private L1 onlyL1L1L1L1ALL


Private L1; shared L2L1L1L2L2P1ALL

Private L1; shared L2/L3L1L1L2L3P1S1ALL
Private L1/L2L1L2L2L2P1ALL

Private L1/L2; shared L3L1L2L3L3P1PALLALL
Private L1/L2; shared L3/L4L1L2L3L4P1PALLS1ALL
Uncommon Scenarios
Private L1/L2/L3; shared L4L1L3L4L4P1P1PALLALL
Private L1; shared L2/L3/L4L1L1L2L4P1S1ALLALL
Private L1/L2; shared L3/L4/L5L1L2L3L5P1PALLS1ALL
Private L1/L2/L3; shared L4/L5L1L3L4L5P1P1PALLALL

When an NTL instruction is applied to a prefetch hint in the Zicbop extension, it indicates that a cache line should be prefetched into a cache that is outer from the level specified by the NTL.

note

For example, in a system with a private L1 and shared L2, NTL.P1 followed by PREFETCH.R might prefetch into L2 with read intent.

To prefetch into the innermost level of cache, do not prefix the prefetch instruction with an NTL instruction.

In some systems, NTL.ALL followed by a prefetch instruction might prefetch into a cache or prefetch buffer internal to a memory controller.

Software is discouraged from following an NTL instruction with an instruction that does not explicitly access memory. Nonadherence to this recommendation might reduce performance but otherwise has no architecturally visible effect.

In the event that a trap is taken on the target instruction, implementations are discouraged from applying the NTL to the first instruction in the trap handler. Instead, implementations are recommended to ignore the HINT in this case.

note

If an interrupt occurs between the execution of an NTL instruction and its target instruction, execution will normally resume at the target instruction. That the NTL instruction is not re-executed does not change the semantics of the program.

Some implementations might prefer not to process the NTL instruction until the target instruction is seen (e.g., so that the NTL can be fused with the memory access it modifies). Such implementations might preferentially take the interrupt before the NTL, rather than between the NTL and the memory access.

note

Since the NTL instructions are encoded as ADDs, they can be used within LR/SC loops without voiding the forward-progress guarantee. But, since using other loads and stores within an LR/SC loop does void the forward-progress guarantee, the only reason to use an NTL within such a loop is to modify the LR or the SC.

6.18 "Zihintpause" Extension for Pause Hint, Version 2.0

The PAUSE instruction is a HINT that indicates the current hart’s rate of instruction retirement should be temporarily reduced or paused. The duration of its effect must be bounded and may be zero.

note

Software can use the PAUSE instruction to reduce energy consumption while executing spin-wait code sequences. Multithreaded cores might temporarily relinquish execution resources to other harts when PAUSE is executed. It is recommended that a PAUSE instruction generally be included in the code sequence for a spin-wait loop.

The duration of a PAUSE instruction’s effect may vary significantly within and among implementations. In typical implementations this duration should be much less than the time to perform a context switch, probably more on the rough order of an on-chip cache miss latency or a cacheless access to main memory.

A series of PAUSE instructions can be used to create a cumulative delay loosely proportional to the number of PAUSE instructions. In spin-wait loops in portable code, however, only one PAUSE instruction should be used before re-evaluating loop conditions, else the hart might stall longer than optimal on some implementations, degrading system performance.

PAUSE is encoded as a FENCE instruction with pred=W, succ=0, fm=0, rd=x0, and rs1=x0.

note

PAUSE is encoded as a hint within the FENCE opcode because some implementations are expected to deliberately stall the PAUSE instruction until outstanding memory transactions have completed. Because the successor set is null, however, PAUSE does not mandate any particular memory ordering—hence, it truly is a HINT.

Like other FENCE instructions, PAUSE cannot be used within LR/SC sequences without voiding the forward-progress guarantee.

The choice of a predecessor set of W is arbitrary, since the successor set is null. Other HINTs similar to PAUSE might be encoded with other predecessor sets.

6.19 Cache Management Operations (CMOs)

6.19.1 Pseudocode for instruction semantics

The semantics of each instruction in the insns chapter is expressed in a SAIL-like syntax.

6.19.2 Introduction

Cache-management operation (or CMO) instructions perform operations on copies of data in the memory hierarchy. In general, CMO instructions operate on cached copies of data, but in some cases, a CMO instruction may operate on memory locations directly. Furthermore, CMO instructions are grouped by operation into the following classes:

  • A management instruction manipulates cached copies of data with respect to a set of agents that can access the data
  • A zero instruction zeros out a range of memory locations, potentially allocating cached copies of data in one or more caches
  • A prefetch instruction indicates to hardware that data at a given memory location may be accessed in the near future, potentially allocating cached copies of data in one or more caches

This chapter introduces a base set of CMO ISA extensions that operate specifically on cache blocks or the memory locations corresponding to a cache block; these are known as cache-block operation (or CBO) instructions. Each of the above classes of instructions represents an extension in this specification:

  • The Zicbom extension defines a set of cache-block management instructions: CBO.INVAL, CBO.CLEAN, and CBO.FLUSH
  • The Zicboz extension defines a cache-block zero instruction: CBO.ZERO
  • The Zicbop extension defines a set of cache-block prefetch instructions: PREFETCH.R, PREFETCH.W, and PREFETCH.I

The execution behavior of the above instructions is also modified by CSR state added by this specification.

The remainder of this chapter provides general background information on CMO instructions and describes each of the above ISA extensions.

note

The term CMO encompasses all operations on caches or resources related to caches. The term CBO represents a subset of CMOs that operate only on cache blocks. The first CMO extensions only define CBOs.

6.19.3 Background

This chapter provides information common to all CMO extensions.

6.19.3.1 Memory and Caches

A memory location is a physical resource in a system uniquely identified by a physical address. An agent is a logic block, such as a RISC-V hart, accelerator, I/O device, etc., that can access a given memory location.

note

A given agent may not be able to access all memory locations in a system, and two different agents may or may not be able to access the same set of memory locations.

A load operation (or store operation) is performed by an agent to consume (or modify) the data at a given memory location. Load and store operations are performed as a result of explicit memory accesses to that memory location. Additionally, a read transfer from memory fetches the data at the memory location, while a write transfer to memory updates the data at the memory location.

A cache is a structure that buffers copies of data to reduce average memory latency. Any number of caches may be interspersed between an agent and a memory location, and load and store operations from an agent may be satisfied by a cache instead of the memory location.

note

Load and store operations are decoupled from read and write transfers by caches. For example, a load operation may be satisfied by a cache without performing a read transfer from memory, or a store operation may be satisfied by a cache that first performs a read transfer from memory.

Caches organize copies of data into cache blocks, each of which represents a contiguous, naturally aligned power-of-two (or NAPOT) range of memory locations. A cache block is identified by any of the physical addresses corresponding to the underlying memory locations. The capacity and organization of a cache and the size of a cache block are both implementation-specific, and the execution environment provides software a means to discover information about the caches and cache blocks in a system. In the initial set of CMO extensions, the size of a cache block shall be uniform throughout the system.

note

In future CMO extensions, the requirement for a uniform cache block size may be relaxed.

Implementation techniques such as speculative execution or hardware prefetching may cause a given cache to allocate or deallocate a copy of a cache block at any time, provided the corresponding physical addresses are accessible according to the supported access type PMA and are cacheable according to the cacheability PMA. Allocating a copy of a cache block results in a read transfer from another cache or from memory, while deallocating a copy of a cache block may result in a write transfer to another cache or to memory depending on whether the data in the copy were modified by a store operation. Additional details are discussed in coherent-agents-caches.

6.19.3.2 Cache-Block Operations

A CBO instruction causes one or more operations to be performed on the cache blocks identified by the instruction. In general, a CBO instruction may identify one or more cache blocks; however, in the initial set of CMO extensions, CBO instructions identify a single cache block only.

A cache-block management instruction performs one of the following operations, relative to the copy of a given cache block allocated in a given cache:

  • An invalidate operation deallocates the copy of the cache block
  • A clean operation performs a write transfer to another cache or to memory if the data in the copy of the cache block have been modified by a store operation
  • A flush operation atomically performs a clean operation followed by an invalidate operation

Additional details, including the actual operation performed by a given cache-block management instruction, are described in Zicbom.

A cache-block zero instruction performs a set of store operations that write zeros to the set of bytes corresponding to a cache block. Unless specified otherwise, the store operations generated by a cache-block zero instruction have the same general properties and behaviors that other store instructions in the architecture have. An implementation may or may not update the entire set of bytes atomically with a single store operation. Additional details are described in Zicboz.

A cache-block prefetch instruction is a HINT to the hardware that software expects to perform a particular type of memory access in the near future. Additional details are described in Zicbop.

6.19.4 Coherent Agents and Caches

For a given memory location, a set of coherent agents consists of the agents for which all of the following hold:

  • Store operations from all agents in the set appear to be serialized with respect to each other
  • Store operations from all agents in the set eventually appear to all other agents in the set
  • A load operation from an agent in the set returns data from a store operation from an agent in the set (or from the initial data in memory)

The coherent agents within such a set shall access a given memory location with the same physical address and the same physical memory attributes; however, if the coherence PMA for a given agent indicates a given memory location is not coherent, that agent shall not be a member of a set of coherent agents with any other agent for that memory location and shall be the sole member of a set of coherent agents consisting of itself.

An agent who is a member of a set of coherent agents is said to be coherent with respect to the other agents in the set. On the other hand, an agent who is not a member is said to be non-coherent with respect to the agents in the set.

Caches introduce the possibility that multiple copies of a given cache block may be present in a system at the same time. An implementation-specific mechanism keeps these copies coherent with respect to the load and store operations from the agents in the set of coherent agents. Additionally, if a coherent agent in the set executes a CBO instruction that specifies the cache block, the resulting operation shall apply to any and all of the copies in the caches that can be accessed by the load and store operations from the coherent agents.

note

An operation from a CBO instruction is defined to operate only on the copies of a cache block that are cached in the caches accessible by the explicit memory accesses performed by the set of coherent agents. This includes copies of a cache block in caches that are accessed only indirectly by load and store operations, e.g. coherent instruction caches.

The set of caches subject to the above mechanism form a set of coherent caches, and each coherent cache has the following behaviors, assuming all operations are performed by the agents in a set of coherent agents:

  • A coherent cache is permitted to allocate and deallocate copies of a cache block and perform read and write transfers as described in memory-caches
  • A coherent cache is permitted to perform a write transfer to memory provided that a store operation has modified the data in the cache block since the most recent invalidate, clean, or flush operation on the cache block
  • At least one coherent cache is responsible for performing a write transfer to memory once a store operation has modified the data in the cache block until the next invalidate, clean, or flush operation on the cache block, after which no coherent cache is responsible (or permitted) to perform a write transfer to memory until the next store operation has modified the data in the cache block
  • A coherent cache is required to perform a write transfer to memory if a store operation has modified the data in the cache block since the most recent invalidate, clean, or flush operation on the cache block and if the next clean or flush operation requires a write transfer to memory
note

The above restrictions ensure that a "clean" copy of a cache block, fetched by a read transfer from memory and unmodified by a store operation, cannot later overwrite the copy of the cache block in memory updated by a write transfer to memory from a non-coherent agent.

A non-coherent agent may initiate a cache-block operation that operates on the set of coherent caches accessed by a set of coherent agents. The mechanism to perform such an operation is implementation-specific.

6.19.4.1 Memory Ordering

6.19.4.1.1 Preserved Program Order

The preserved program order (abbreviated PPO) rules are defined by the RVWMO memory ordering model. How the operations resulting from CMO instructions fit into these rules is described below.

For cache-block management instructions, the resulting invalidate, clean, and flush operations behave as stores in the PPO rules subject to one additional overlapping address rule. Specifically, if a precedes b in program order, then a will precede b in the global memory order if:

  • a is an invalidate, clean, or flush, b is a load, and a and b access overlapping memory addresses
note

The above rule ensures that a subsequent load in program order never appears in the global memory order before a preceding invalidate, clean, or flush operation to an overlapping address.

Additionally, invalidate, clean, and flush operations are classified as W or O (depending on the physical memory attributes for the corresponding physical addresses) for the purposes of predecessor and successor sets in FENCE instructions. These operations are not ordered by other instructions that order stores, e.g. FENCE.I and SFENCE.VMA.

For cache-block zero instructions, the resulting store operations behave as stores in the PPO rules and are ordered by other instructions that order stores.

Finally, for cache-block prefetch instructions, the resulting operations are not ordered by the PPO rules nor are they ordered by any other ordering instructions.

6.19.4.1.2 Load Values

An invalidate operation may change the set of values that can be returned by a load. In particular, an additional condition is added to the Load Value Axiom:

  • If an invalidate operation i precedes a load r and operates on a byte x returned by r, and no store to x appears between i and r in program order or in the global memory order, then r returns any of the following values for x:
  1. If no clean or flush operations on x precede i in the global memory order, either the initial value of x or the value of any store to x that precedes i
  2. If no store to x precedes a clean or flush operation on x in the global memory order and if the clean or flush operation on x precedes i in the global memory order, either the initial value of x or the value of any store to x that precedes i
  3. If a store to x precedes a clean or flush operation on x in the global memory order and if the clean or flush operation on x precedes i in the global memory order, either the value of the latest store to x that precedes the latest clean or flush operation on x or the value of any store to x that both precedes i and succeeds the latest clean or flush operation on x that precedes i
  4. The value of any store to x by a non-coherent agent regardless of the above conditions
note

The first three bullets describe the possible load values at different points in the global memory order relative to clean or flush operations. The final bullet implies that the load value may be produced by a non-coherent agent at any time.

6.19.4.2 Traps

Execution of certain CMO instructions may result in traps due to CSR state, described in the csr_state section, or due to the address translation and protection mechanisms. The trapping behavior of CMO instructions is described in the following sections.

6.19.4.2.1 Illegal-Instruction and Virtual-Instruction Exceptions

Cache-block management instructions and cache-block zero instructions may raise illegal-instruction exceptions or virtual-instruction exceptions depending on the current privilege mode and the state of the CMO control registers described in the csr_state section.

Cache-block prefetch instructions raise neither illegal-instruction exceptions nor virtual-instruction exceptions.

6.19.4.2.2 Page-Fault, Guest-Page-Fault, and Access-Fault Exceptions

Similar to load and store instructions, CMO instructions are explicit memory access instructions that compute an effective address. The effective address is ultimately translated into a physical address based on the privilege mode and the enabled translation mechanisms, and the CMO extensions impose the following constraints on the physical addresses in a given cache block:

  • The PMP access control bits shall be the same for all physical addresses in the cache block, and if write permission is granted by the PMP access control bits, read permission shall also be granted
  • The PMAs shall be the same for all physical addresses in the cache block, and if write permission is granted by the supported access type PMAs, read permission shall also be granted

If the above constraints are not met, the behavior of a CBO instruction is UNSPECIFIED.

note

This specification assumes that the above constraints will typically be met for main memory regions and may be met for certain I/O regions.

note

The access size for CMO instructions is equal to the size of the cache block, however in some cases that access can be decomposed into multiple memory operations. PMP checks are applied to each memory operation independently. For example a 64-byte cbo.zero that spans two 32-byte PMP regions would succeed if it was decomposed into two 32-byte memory operations (and the PMP access control bits are the same in both regions), but if performed as a single 64-byte memory operation it would cause an access fault.

The Zicboz extension introduces an additional supported access type PMA for cache-block zero instructions. Main memory regions are required to support accesses by cache-block zero instructions; however, I/O regions may specify whether accesses by cache-block zero instructions are supported.

A cache-block management instruction is permitted to access the specified cache block whenever a load instruction or store instruction is permitted to access the corresponding physical addresses. If neither a load instruction nor store instruction is permitted to access the physical addresses, but an instruction fetch is permitted to access the physical addresses, whether a cache-block management instruction is permitted to access the cache block is UNSPECIFIED. If access to the cache block is not permitted, a cache-block management instruction raises a store page-fault or store guest-page-fault exception if address translation does not permit any access or raises a store access-fault exception otherwise. During address translation, the instruction also checks the accessed bit and may either raise an exception or set the bit as required.

note

The interaction between cache-block management instructions and instruction fetches will be specified in a future extension.

As implied by omission, a cache-block management instruction does not check the dirty bit and neither raises an exception nor sets the bit.

A cache-block zero instruction is permitted to access the specified cache block whenever a store instruction is permitted to access the corresponding physical addresses and when the PMAs indicate that cache-block zero instructions are a supported access type. If access to the cache block is not permitted, a cache-block zero instruction raises a store page-fault or store guest-page-fault exception if address translation does not permit write access or raises a store access-fault exception otherwise. During address translation, the instruction also checks the accessed and dirty bits and may either raise an exception or set the bits as required.

A cache-block prefetch instruction is permitted to access the specified cache block whenever a load instruction, store instruction, or instruction fetch is permitted to access the corresponding physical addresses. If access to the cache block is not permitted, a cache-block prefetch instruction does not raise any exceptions and shall not access any caches or memory. During address translation, the instruction does not check the accessed and dirty bits and neither raises an exception nor sets the bits.

When a page-fault, guest-page-fault, or access-fault exception is taken, the relevant *tval CSR is written with the faulting effective address (i.e. the value of rs1).

note

Like a load or store instruction, a CMO instruction may or may not be permitted to access a cache block based on the states of the MPRV, MPV, and MPP bits in mstatus and the SUM and MXR bits in mstatus, sstatus, and vsstatus.

This specification expects that implementations will process cache-block management instructions like store/AMO instructions, so store/AMO exceptions are appropriate for these instructions, regardless of the permissions required.

6.19.4.2.3 Address-Misaligned Exceptions

CMO instructions do not generate address-misaligned exceptions.

6.19.4.2.4 Breakpoint Exceptions and Debug Mode Entry

Unless otherwise defined by the debug architecture specification, the behavior of trigger modules with respect to CMO instructions is UNSPECIFIED.

note

For the Zicbom, Zicboz, and Zicbop extensions, this specification recommends the following common trigger module behaviors:

  • Type 6 address match triggers, i.e. tdata1.type=6 and mcontrol6.select=0, should be supported
  • Type 2 address/data match triggers, i.e. tdata1.type=2, should be unsupported
  • The size of a memory access equals the size of the cache block accessed, and the compare values follow from the addresses of the NAPOT memory region corresponding to the cache block containing the effective address
  • Unless an encoding for a cache block is added to the mcontrol6.size field, an address trigger should only match a memory access from a CBO instruction if mcontrol6.size=0

If the Zicbom extension is implemented, this specification recommends the following additional trigger module behaviors:

  • Implementing address match triggers should be optional
  • Type 6 data match triggers, i.e. tdata1.type=6 and mcontrol6.select=1, should be unsupported
  • Memory accesses are considered to be stores, i.e. an address trigger matches only if mcontrol6.store=1

If the Zicboz extension is implemented, this specification recommends the following additional trigger module behaviors:

  • Implementing address match triggers should be mandatory
  • Type 6 data match triggers, i.e. tdata1.type=6 and mcontrol6.select=1, should be supported, and implementing these triggers should be optional
  • Memory accesses are considered to be stores, i.e. an address trigger matches only if mcontrol6.store=1

If the Zicbop extension is implemented, this specification recommends the following additional trigger module behaviors:

  • Implementing address match triggers should be optional
  • Type 6 data match triggers, i.e. tdata1.type=6 and mcontrol6.select=1, should be unsupported
  • Memory accesses may be considered to be loads or stores depending on the implementation, i.e. whether an address trigger matches on these instructions when mcontrol6.load=1 or mcontrol6.store=1 is implementation-specific

This specification also recommends that the behavior of trigger modules with respect to the Zicboz extension should be defined in version 1.0 of the debug architecture specification. The behavior of trigger modules with respect to the Zicbom and Zicbop extensions is expected to be defined in future extensions.

6.19.4.2.5 Hypervisor Extension

For the purposes of writing the mtinst or htinst register on a trap, the following standard transformation is defined for cache-block management instructions and cache-block zero instructions:

87a002ad1404d9f9ae0e603a59df0898

The operation field corresponds to the 12 most significant bits of the trapping instruction.

note

As described in the hypervisor extension, a zero may be written into mtinst or htinst instead of the standard transformation defined above.

6.19.4.3 Effects on Constrained LR/SC Loops

The following event is added to the list of events that satisfy the eventuality guarantee provided by constrained LR/SC loops, as defined in the A extension:

  • Some other hart executes a cache-block management instruction or a cache-block zero instruction to the reservation set of the LR instruction in H's constrained LR/SC loop.
note

The above event has been added to accommodate cache coherence protocols that cannot distinguish between invalidations for stores and invalidations for cache-block management operations.

Aside from the above event, CMO instructions neither change the properties of constrained LR/SC loops nor modify the eventuality guarantee provided by them. For example, executing a CMO instruction may cause a constrained LR/SC loop on any hart to fail periodically or may cause a unconstrained LR/SC sequence on the same hart to fail always. Additionally, executing a cache-block prefetch instruction does not impact the eventuality guarantee provided by constrained LR/SC loops executed on any hart.

6.19.4.4 Software Discovery

The initial set of CMO extensions requires the following information to be discovered by software:

  • The size of the cache block for management and prefetch instructions
  • The size of the cache block for zero instructions
  • CBIE support at each privilege level

Other general cache characteristics may also be specified in the discovery mechanism.

6.19.5 CSR controls for CMO instructions

The x{csrname} registers control CBO instruction execution based on the current privilege mode and the state of the appropriate CSRs, as detailed below.

A CBO.INVAL instruction executes or raises either an illegal-instruction exception or a virtual-instruction exception based on the state of the x\{csrname\}.CBIE fields:


// illegal-instruction exceptions
if (((priv_mode != M) && (m{csrname}.CBIE == 00)) ||
((priv_mode == U) && (s{csrname}.CBIE == 00)))
{
\<raise illegal-instruction exception>
}
// virtual-instruction exceptions
else if (((priv_mode == VS) && (h{csrname}.CBIE == 00)) ||
((priv_mode == VU) && ((h{csrname}.CBIE == 00) || (s{csrname}.CBIE == 00))))
{
\<raise virtual-instruction exception>
}
// execute instruction
else
{
if (((priv_mode != M) && (m{csrname}.CBIE == 01)) ||
((priv_mode == U) && (s{csrname}.CBIE == 01)) ||
((priv_mode == VS) && (h{csrname}.CBIE == 01)) ||
((priv_mode == VU) && ((h{csrname}.CBIE == 01) || (s{csrname}.CBIE == 01))))
{
\<execute CBO.INVAL and perform flush operation>
}
else
{
\<execute CBO.INVAL and perform invalidate operation>
}
}


note

Until a modified cache block has updated memory, a CBO.INVAL instruction may expose stale data values in memory if the CSRs are programmed to perform an invalidate operation. This behavior may result in a security hole if lower privileged level software performs an invalidate operation and accesses sensitive information in memory.

To avoid such holes, higher privileged level software must perform either a clean or flush operation on the cache block before permitting lower privileged level software to perform an invalidate operation on the block. Alternatively, higher privileged level software may program the CSRs so that CBO.INVAL either traps or performs a flush operation in a lower privileged level.

A CBO.CLEAN or CBO.FLUSH instruction executes or raises an illegal-instruction or virtual-instruction exception based on the state of the x\{csrname\}.CBCFE bits:


// illegal-instruction exceptions
if (((priv_mode != M) && !m{csrname}.CBCFE) ||
((priv_mode == U) && !s{csrname}.CBCFE))
{
\<raise illegal-instruction exception>
}
// virtual-instruction exceptions
else if (((priv_mode == VS) && !h{csrname}.CBCFE) ||
((priv_mode == VU) && !(h{csrname}.CBCFE && s{csrname}.CBCFE)))
{
\<raise virtual-instruction exception>
}
// execute instruction
else
{
\<execute CBO.CLEAN or CBO.FLUSH>
}

Finally, a CBO.ZERO instruction executes or raises an illegal-instruction or virtual-instruction exception based on the state of the x\{csrname\}.CBZE bits:


// illegal-instruction exceptions
if (((priv_mode != M) && !m{csrname}.CBZE) ||
((priv_mode == U) && !s{csrname}.CBZE))
{
\<raise illegal-instruction exception>
}
// virtual-instruction exceptions
else if (((priv_mode == VS) && !h{csrname}.CBZE) ||
((priv_mode == VU) && !(h{csrname}.CBZE && s{csrname}.CBZE)))
{
\<raise virtual-instruction exception>
}
// execute instruction
else
{
\<execute CBO.ZERO>
}

The CBIE/CBCFE/CBZE fields in each x\{csrname\} register do not affect the read and write behavior of the same fields in the other x\{csrname\} registers.

Each x\{csrname\} register is WARL; however, software should determine the legal values from the execution environment discovery mechanism.

6.19.6 Extensions

CMO instructions are defined in the following extensions:

6.19.6.1 Cache-Block Management Instructions

Cache-block management instructions enable software running on a set of coherent agents to communicate with a set of non-coherent agents by performing one of the following operations:

  • An invalidate operation makes data from store operations performed by a set of non-coherent agents visible to the set of coherent agents at a point common to both sets by deallocating all copies of a cache block from the set of coherent caches up to that point
  • A clean operation makes data from store operations performed by the set of coherent agents visible to a set of non-coherent agents at a point common to both sets by performing a write transfer of a copy of a cache block to that point provided a coherent agent performed a store operation that modified the data in the cache block since the previous invalidate, clean, or flush operation on the cache block
  • A flush operation atomically performs a clean operation followed by an invalidate operation

In the Zicbom extension, the instructions operate to a point common to all agents in the system. In other words, an invalidate operation ensures that store operations from all non-coherent agents visible to agents in the set of coherent agents, and a clean operation ensures that store operations from coherent agents visible to all non-coherent agents.

note

The Zicbom extension does not prohibit agents that fall outside of the above architectural definition; however, software cannot rely on the defined cache operations to have the desired effects with respect to those agents.

Future extensions may define different sets of agents for the purposes of performance optimization.

These instructions operate on the cache block whose effective address is specified in rs1. The effective address is translated into a corresponding physical address by the appropriate translation mechanisms.

The following instructions comprise the Zicbom extension:

RV32RV64MnemonicInstruction
cbo.clean baseinsns-cbo_clean
cbo.flush baseinsns-cbo_flush
cbo.inval baseinsns-cbo_inval
note

Cache-block management instructions ignore cacheability attributes and operate on the cache block irrespective of the PMA cacheable attribute and any Page-Based Memory Type (PBMT) downgrade from cacheable to non-cacheable.

6.19.6.2 Cache-Block Zero Instructions

Cache-block zero instructions store zeros to the set of bytes corresponding to a cache block. An implementation may update the bytes in any order and with any granularity and atomicity, including individual bytes.

note

Cache-block zero instructions store zeros independently of whether data from the underlying memory locations are cacheable. In addition, this specification does not constrain how the bytes are written.

These instructions operate on the cache block, or the memory locations corresponding to the cache block, whose effective address is specified in rs1. The effective address is translated into a corresponding physical address by the appropriate translation mechanisms.

The following instructions comprise the Zicboz extension:

RV32RV64MnemonicInstruction
cbo.zero baseinsns-cbo_zero

6.19.6.3 Cache-Block Prefetch Instructions

Cache-block prefetch instructions are HINTs to the hardware to indicate that software intends to perform a particular type of memory access in the near future. The types of memory accesses are instruction fetch, data read (i.e. load), and data write (i.e. store).

These instructions operate on the cache block whose effective address is the sum of the base address specified in rs1 and the sign-extended offset encoded in imm[11:0], where imm[4:0] shall equal 0b00000. The effective address is translated into a corresponding physical address by the appropriate translation mechanisms.

note

Cache-block prefetch instructions are encoded as ORI instructions with rd equal to 0b00000; however, for the purposes of effective address calculation, this field is also interpreted as imm[4:0] like a store instruction.

The following instructions comprise the Zicbop extension:

RV32RV64MnemonicInstruction
prefetch.i offset(base)insns-prefetch_i
prefetch.r offset(base)insns-prefetch_r
prefetch.w offset(base)insns-prefetch_w

6.19.7 Instructions

6.19.7.1 cbo.clean

Synopsis Perform a clean operation on a cache block

Mnemonic cbo.clean offset(base)

Encoding

96abe91e8a7ec6ee9f12bb49549276ef

Description A cbo.clean instruction performs a clean operation on the cache block whose effective address is the base address specified in rs1. The offset operand may be omitted; otherwise, any expression that computes the offset shall evaluate to zero. The instruction operates on the set of coherent caches accessed by the agent executing the instruction.

note

When executing a cbo.clean instruction, an implementation may instead perform a flush operation, since the result of that operation is indistinguishable from the sequence of performing a clean operation just before deallocating all cached copies in the set of coherent caches.

6.19.7.2 cbo.flush

Synopsis Perform a flush operation on a cache block

Mnemonic cbo.flush offset(base)

Encoding

0866e6f8e9d07c4eebce44ac47ca7943

Description A cbo.flush instruction performs a flush operation on the cache block whose that contains the address specified in rs1. It is not required that rs1 is aligned to the size of a cache block. On faults, the faulting virtual address is considered to be the value in rs1, rather than the base address of the cache block. The instruction operates on the set of coherent caches accessed by the agent executing the instruction.

The assembly offset operand may be omitted. If it isn’t then any expression that computes the offset shall evaluate to zero.

6.19.7.3 cbo.inval

Synopsis Perform an invalidate operation on a cache block

Mnemonic cbo.inval offset(base)

Encoding

67f21b5cf4d20fd369fb9c7f4601e320

Description A cbo.inval instruction performs an invalidate operation on the cache block that contains the address specified in rs1. It is not required that rs1 is aligned to the size of a cache block. On faults, the faulting virtual address is considered to be the value in rs1, rather than the base address of the cache block. The instruction operates on the set of coherent caches accessed by the agent executing the instruction.

Depending on CSR programming, the instruction may perform a flush operation instead of an invalidate operation.

The assembly offset operand may be omitted. If it isn’t then any expression that computes the offset shall evaluate to zero.

note

When executing a cbo.inval instruction, an implementation may instead perform a flush operation, since the result of that operation is indistinguishable from the sequence of performing a write transfer to memory just before performing an invalidate operation.

6.19.7.4 cbo.zero

Synopsis Store zeros to the full set of bytes corresponding to a cache block

Mnemonic cbo.zero offset(base)

Encoding

4af55206cb547f7350581c478c13f452

Description A cbo.zero instruction performs stores of zeros to the full set of bytes corresponding to the cache block that contains the address specified in rs1. It is not required that rs1 is aligned to the size of a cache block. On faults, the faulting virtual address is considered to be the value in rs1, rather than the base address of the cache block. An implementation may or may not update the entire set of bytes atomically.

The assembly offset operand may be omitted. If it isn’t then any expression that computes the offset shall evaluate to zero.

6.19.7.5 prefetch.i

Synopsis Provide a HINT to hardware that a cache block is likely to be accessed by an instruction fetch in the near future

Mnemonic prefetch.i offset(base)

Encoding

e4216adf59714d34be1cb9b77f56c878

Description

A prefetch.i instruction indicates to hardware that the cache block whose effective address is the sum of the base address specified in rs1 and the sign-extended offset encoded in imm[11:0], where imm[4:0] equals 0b00000, is likely to be accessed by an instruction fetch in the near future.

note

An implementation may opt to cache a copy of the cache block in a cache accessed by an instruction fetch in order to improve memory access latency, but this behavior is not required.

6.19.7.6 prefetch.r

Synopsis Provide a HINT to hardware that a cache block is likely to be accessed by a data read in the near future

Mnemonic prefetch.r offset(base)

Encoding

d4f35b00caffcee7da49eeb233346b40

Description

A prefetch.r instruction indicates to hardware that the cache block whose effective address is the sum of the base address specified in rs1 and the sign-extended offset encoded in imm[11:0], where imm[4:0] equals 0b00000, is likely to be accessed by a data read (i.e. load) in the near future.

note

An implementation may opt to cache a copy of the cache block in a cache accessed by a data read in order to improve memory access latency, but this behavior is not required.

6.19.7.7 prefetch.w

Synopsis Provide a HINT to hardware that a cache block is likely to be accessed by a data write in the near future

Mnemonic prefetch.w offset(base)

Encoding

051cb6bbb40309a1d1afdecb6135b827

Description

A prefetch.w instruction indicates to hardware that the cache block whose effective address is the sum of the base address specified in rs1 and the sign-extended offset encoded in imm[11:0], where imm[4:0] equals 0b00000, is likely to be accessed by a data write (i.e. store) in the near future.

note

An implementation may opt to cache a copy of the cache block in a cache accessed by a data write in order to improve memory access latency, but this behavior is not required.