7 Compressed Instructions

This chapter describes the RISC-V compressed instruction-set extensions, which reduce static and dynamic code size by adding short 16-bit instruction encodings for common operations. Typically, 50%-60% of the RISC-V instructions in a program can be replaced with compressed instructions, resulting in a 25%-30% code-size reduction.

The Zca extension forms the core of the compressed extensions; it provides compressed forms of integer loads, stores, branches, and computational instructions. The Zcf extension is an XLEN=32-only extension that adds single-precision floating-point loads and stores. The Zcd extension adds double-precision floating-point loads and stores. The C extension combines Zca, Zcf if XLEN=32 and the F extension is present, and Zcd if the D extension is present. Various additional Zc* extensions are also defined.

7.1 `Zca` Extension for Integer Compressed Instructions

The compressed extensions use a simple compression scheme that offers shorter 16-bit versions of common 32-bit RISC-V instructions when:

the immediate or address offset is small, or
one of the registers is the zero register (x0), the ABI link register (x1), or the ABI stack pointer (x2), or
the destination register and the first source register are identical, or
the registers used are the 8 most popular ones.

The Zca extension is compatible with all other standard instruction extensions. The Zca extension allows 16-bit instructions to be freely intermixed with 32-bit instructions, with the latter now able to start on any 16-bit boundary, i.e., IALIGN=16. With the addition of the Zca extension, no instructions can raise instruction-address-misaligned exceptions.

note

Removing the 32-bit alignment constraint on the original 32-bit instructions allows significantly greater code density.

The compressed instruction encodings are mostly common across XLEN=32 and XLEN=64, but as shown in Zca Instruction listing, Quadrant 0, a few opcodes are used for different purposes depending on base ISA. For example, the XLEN=64 variant requires additional opcodes to compress loads and stores of 64-bit integer values, whereas the XLEN=32-only Zcf extension uses the same opcodes to compress loads and stores of single-precision floating-point values. If the C extension is implemented, the appropriate compressed floating-point load and store instructions must be provided whenever the relevant standard floating-point extension (F and/or D) is also implemented. In addition, the XLEN=32 variant includes a compressed jump and link instruction to compress short-range subroutine calls, where the same opcode is used to compress ADDIW for XLEN=64.

note

Double-precision loads and stores are a significant fraction of static and dynamic instructions, hence the motivation to provide the Zcd extension.

Single-precision loads and stores are not a significant source of static or dynamic compression for programs compiled for the LP64D and ILP32D calling conventions. However, for microcontrollers that only provide hardware single-precision floating-point units and whose programs use the ILP32F calling convention, the single-precision loads and stores are used at least as frequently as double-precision loads and stores are used in LP64D and ILP32D—hence the motivation to provide compressed support for these in the XLEN=32-only Zcf extension.

Short-range subroutine calls are more likely in small binaries for microcontrollers, hence the motivation to include C.JAL only in the XLEN=32 variant of Zca.

Although reusing opcodes for different purposes for different base ISAs adds some complexity to documentation, the impact on implementation complexity is small even for designs that support multiple base ISAs. The compressed floating-point load and store variants use the same instruction format with the same register specifiers as the wider integer loads and stores.

Zca was designed under the constraint that each Zca instruction expands into a single 32-bit instruction in the base ISA. Adopting this constraint has two main benefits:

Hardware designs can simply expand Zca instructions during decode, simplifying verification and minimizing modifications to existing microarchitectures.
Compilers can be unaware of the Zca extension and leave code compression to the assembler and linker, although a compression-aware compiler will generally be able to produce better results.

note

At the time we designed the Zca extension, we felt the multiple complexity reductions of a simple one-one mapping between Zca and base instructions far outweighed the potential gains of a slightly denser encoding that added additional instructions only supported in the Zca extension, or that allowed encoding of multiple base instructions in one Zca instruction.

Since then, additional extensions Zcmp and Zcmt have been defined that further reduce code size at the expense of these complexities.

It is important to note that the Zca extension is not designed to be a stand-alone ISA, and is meant to be used alongside a base ISA.

note

Variable-length instruction sets have long been used to improve code density. For example, the IBM Stretch [24], developed in the late 1950s, had an ISA with 32-bit and 64-bit instructions, where some of the 32-bit instructions were compressed versions of the full 64-bit instructions. Stretch also employed the concept of limiting the set of registers that were addressable in some of the shorter instruction formats, with short branch instructions that could only refer to one of the index registers. The later IBM 360 architecture [25] supported a simple variable-length instruction encoding with 16-bit, 32-bit, or 48-bit instruction formats.

In 1963, CDC introduced the Cray-designed CDC 6600 [26], a precursor to RISC architectures, that introduced a register-rich load-store architecture with instructions of two lengths, 15-bits and 30-bits. The later Cray-1 design used a very similar instruction format, with 16-bit and 32-bit instruction lengths.

The initial RISC ISAs from the 1980s all picked performance over code size, which was reasonable for a workstation environment, but not for embedded systems. Hence, both ARM and MIPS subsequently made versions of the ISAs that offered smaller code size by offering an alternative 16-bit wide instruction set instead of the standard 32-bit wide instructions. The compressed RISC ISAs reduced code size relative to their starting points by about 25-30%, yielding code that was significantly smaller than 80x86. This result surprised some, as their intuition was that the variable-length CISC ISA should be smaller than RISC ISAs that offered only 16-bit and 32-bit formats.

Since the original RISC ISAs did not leave sufficient opcode space free to include these unplanned compressed instructions, they were instead developed as complete new ISAs. This meant compilers needed different code generators for the separate compressed ISAs. The first compressed RISC ISA extensions (e.g., ARM Thumb and MIPS16) used only a fixed 16-bit instruction size, which gave good reductions in static code size but caused an increase in dynamic instruction count, which led to lower performance compared to the original fixed-width 32-bit instruction size. This led to the development of a second generation of compressed RISC ISA designs with mixed 16-bit and 32-bit instruction lengths (e.g., ARM Thumb2, microMIPS, PowerPC VLE), so that performance was similar to pure 32-bit instructions but with significant code size savings. Unfortunately, these different generations of compressed ISAs are incompatible with each other and with the original uncompressed ISA, leading to significant complexity in documentation, implementations, and software tools support.

Of the commonly used 64-bit ISAs, only PowerPC and microMIPS currently supports a compressed instruction format. It is surprising that the most popular 64-bit ISA for mobile platforms (ARM v8) does not include a compressed instruction format given that static code size and dynamic instruction fetch bandwidth are important metrics. Although static code size is not a major concern in larger systems, instruction fetch bandwidth can be a major bottleneck in servers running commercial workloads, which often have a large instruction working set.

Benefiting from 25 years of hindsight, RISC-V was designed to support compressed instructions from the outset, leaving enough opcode space for the compressed extensions to be added on top of the base ISA (along with many other extensions). The philosophy of Zca is to reduce code size for embedded applications and to improve performance and energy-efficiency for all applications due to fewer misses in the instruction cache. Waterman shows a 25%-30% reduction in static instruction bits, which reduces instruction cache misses by 20%-25%, or roughly the same performance impact as doubling the instruction cache size. [27]

7.1.1 Compressed Instruction Formats

Table 31 shows the nine compressed instruction formats. CR, CI, and CSS can use any of the 32 RVI registers, but CIW, CL, CS, CA, and CB are limited to just 8 of them. Table 32 lists these popular registers, which correspond to registers x8 to x15. Note that there is a separate version of load and store instructions that use the stack pointer as the base address register, since saving to and restoring from the stack are so prevalent, and that they use the CI and CSS formats to allow access to all 32 data registers. CIW supplies an 8-bit immediate for the ADDI4SPN instruction.

note

The RISC-V ABI was changed to make the frequently used registers map to registers x8-x15. This simplifies the decompression decoder by having a contiguous naturally aligned set of register numbers, and is also compatible with the RV32E and RV64E base ISAs, which only have 16 integer registers.

The formats were designed to keep bits for the two register source specifiers in the same place in all instructions, while the destination register field can move. When the full 5-bit destination register specifier is present, it is in the same place as in the 32-bit RISC-V encoding. Where immediates are sign-extended, the sign extension is always from bit 12. Immediate fields have been scrambled, as in the base specification, to reduce the number of immediate multiplexers required.

note

The immediate fields are scrambled in the instruction formats instead of in sequential order so that as many bits as possible are in the same position in every instruction, thereby simplifying implementations.

For many Zca instructions, zero-valued immediates are disallowed and x0 is not a valid 5-bit register specifier. These restrictions free up encoding space for other instructions requiring fewer operand bits.

Table 31. Compressed 16-bit Zca instruction formats

Format	Meaning	15 14 13	12	11 10	9 8 7	6 5	4 3 2	1 0
CR	Register	funct4		rd/rs1		rs2		op
CI	Immediate	funct3	imm	rd/rs1		imm		op
CSS	Stack-relative Store	funct3	imm			rs2		op
CIW	Wide Immediate	funct3	imm				rd′	op
CL	Load	funct3	imm		rs1′	imm	rd′	op
CS	Store	funct3	imm		rs1′	imm	rs2′	op
CA	Arithmetic	funct6			rd′/rs1′	funct2	rs2′	op
CB	Branch/Arithmetic	funct3	offset		rd′/rs1′	offset		op
CJ	Jump	funct3	jump target					op

Table 32. Registers specified by the three-bit rs1′, rs2′, and rd′ fields of the CIW, CL, CS, CA, and CB formats.

`Zca` Register Number	`000`	`001`	`010`	`011`	`100`	`101`	`110`	`111`
Integer Register Number	`x8`	`x9`	`x10`	`x11`	`x12`	`x13`	`x14`	`x15`
Integer Register ABI Name	`s0`	`s1`	`a0`	`a1`	`a2`	`a3`	`a4`	`a5`

7.1.2 Load and Store Instructions

To increase the reach of 16-bit instructions, data-transfer instructions use zero-extended immediates that are scaled by the size of the data in bytes: ×4 for words, ×8 for doublewords, and ×16 for quadwords.

Zca provides two variants of loads and stores. One uses the ABI stack pointer, x2, as the base address and can target any data register. The other can reference one of 8 base address registers and one of 8 data registers.

7.1.2.1 Stack-Pointer-Based Loads and Stores

0975bc864f1c496f6f8915f9db163f24

These instructions use the CI format.

C.LWSP loads a 32-bit value from memory into register rd. It computes an effective address by adding the zero-extended offset, scaled by 4, to the stack pointer, x2. It expands to LW rd, offset(x2). C.LWSP is valid only when rd≠x0; the code points with rd=x0 are reserved.

C.LDSP is an XLEN=64-only instruction that loads a 64-bit value from memory into register rd. It computes its effective address by adding the zero-extended offset, scaled by 8, to the stack pointer, x2. It expands to LD rd, offset(x2). C.LDSP is valid only when rd≠x0; the code points with rd=x0 are reserved.

26c4c61e801554647709857b9c405462

These instructions use the CSS format.

C.SWSP stores a 32-bit value in register rs2 to memory. It computes an effective address by adding the zero-extended offset, scaled by 4, to the stack pointer, x2. It expands to SW rs2, offset(x2).

C.SDSP is an XLEN=64-only instruction that stores a 64-bit value in register rs2 to memory. It computes an effective address by adding the zero-extended offset, scaled by 8, to the stack pointer, x2. It expands to SD rs2, offset(x2).

note

Register save/restore code at function entry/exit represents a significant portion of static code size. The stack-pointer-based compressed loads and stores in Zca are effective at reducing the save/restore static code size by a factor of 2 while improving performance by reducing dynamic instruction bandwidth.

A common mechanism used in other ISAs to further reduce save/restore code size is load-multiple and store-multiple instructions. We considered adopting these for RISC-V but noted the following drawbacks to these instructions:

These instructions complicate processor implementations.
For virtual memory systems, some data accesses could be resident in physical memory and some could not, which requires a new restart mechanism for partially executed instructions.
Unlike the rest of the Zca instructions, there is no base ISA equivalent to Load Multiple and Store Multiple.
Unlike the rest of the Zca instructions, the compiler would have to be aware of these load-multiple and store-multiple instructions to both allocate registers in the expected order and also to schedule the loads and stores contiguously and in the proper order, to maximize the chances of them being detected and replaced by an assembler or linker with the equivalent load-multiple or store-multiple compressed instruction.
Simple microarchitectural implementations will constrain how other instructions can be scheduled around the load and store multiple instructions, leading to a potential performance loss.
The desire for sequential register allocation might conflict with the featured registers selected for the CIW, CL, CS, CA, and CB formats.

Furthermore, much of the gains can be realized in software by replacing prologue and epilogue code with subroutine calls to common prologue and epilogue code, a technique described in Section 5.6 of [28].

While our rationale for omitting load-multiple and store-multiple remains valid, the pressure to reduce code size is so great, even at the expense of performance and microarchitectural complexity, that the Zcmp extension has since been defined to compress most prologues and epilogues into two bytes apiece.

7.1.2.2 Register-Based Loads and Stores

1d597247c95c6e3ded4fbe9dd90ffbc3

These instructions use the CL format.

C.LW loads a 32-bit value from memory into register _rd′_. It computes an effective address by adding the zero-extended offset, scaled by 4, to the base address in register _rs1′_. It expands to LW rd′, offset(rs1′).

C.LD is an XLEN=64-only instruction that loads a 64-bit value from memory into register _rd′_. It computes an effective address by adding the zero-extended offset, scaled by 8, to the base address in register _rs1′_. It expands to LD rd′, offset(rs1′).

455f1876bf5d93f4573cd5cd0d49b103

These instructions use the CS format.

C.SW stores a 32-bit value in register _rs2′_ to memory. It computes an effective address by adding the zero-extended offset, scaled by 4, to the base address in register _rs1′_. It expands to SW rs2′, offset(rs1′).

C.SD is an XLEN=64-only instruction that stores a 64-bit value in register _rs2′_ to memory. It computes an effective address by adding the zero-extended offset, scaled by 8, to the base address in register _rs1′_. It expands to SD rs2′, offset(rs1′).

7.1.3 Control Transfer Instructions

Zca provides unconditional jump instructions and conditional branch instructions. As with base RVI instructions, the offsets of all Zca control transfer instructions are in multiples of 2 bytes.

75fbeea32ff4e41282ba89d0e35a1df5

These instructions use the CJ format.

C.J performs an unconditional control transfer. The offset is sign-extended and added to the pc to form the jump target address. C.J can therefore target a ±2 KiB range. It expands to JAL x0, offset.

C.JAL is an XLEN=32-only instruction that performs the same operation as C.J, but additionally writes the address of the instruction following the jump (pc+2) to the link register, x1. It expands to JAL x1, offset. (See XLEN=32-only rationale.)

54138855fdf6db04df2cab76da69bbfa

These instructions use the CR format.

C.JR (jump register) performs an unconditional control transfer to the address in register rs1. It expands to JALR x0, 0(rs1). C.JR is valid only when rs1≠x0; the code point with rs1=x0 is reserved.

C.JALR (jump and link register) performs the same operation as C.JR, but additionally writes the address of the instruction following the jump (pc+2) to the link register, x1. It expands to JALR x1, 0(rs1). C.JALR is valid only when rs1≠x0; the code point with rs1=x0 corresponds to the C.EBREAK instruction.

note

Strictly speaking, C.JALR does not expand exactly to a base RVI instruction as the value added to the PC to form the link address is 2 rather than 4 as in the base ISA, but supporting both offsets of 2 and 4 bytes is only a very minor change to the base microarchitecture.

365edc581ff8f38db99f35028ef0c38d

These instructions use the CB format.

C.BEQZ performs conditional control transfers. The offset is sign-extended and added to the pc to form the branch target address. It can therefore target a ±256 B range. C.BEQZ takes the branch if the value in register rs1′ is zero. It expands to BEQ rs1′, x0, offset`.

C.BNEZ is defined analogously, but it takes the branch if rs1′ contains a nonzero value. It expands to BNE rs1′, x0, offset`.

7.1.4 Integer Computational Instructions

Zca provides several instructions for integer arithmetic and constant generation.

7.1.4.1 Integer Constant-Generation Instructions

The two constant-generation instructions both use the CI instruction format and can target any integer register.

e808dce215515293ae0bf2498f714835

C.LI loads the sign-extended 6-bit immediate, imm, into register rd. It expands into ADDI rd, x0, imm. The C.LI code points with rd=x0 are HINTs.

C.LUI loads the non-zero 6-bit immediate field into bits 17–12 of the destination register, clears the bottom 12 bits, and sign-extends bit 17 into all higher bits of the destination. It expands into LUI rd, imm. C.LUI is valid only when rd≠x2, and when the immediate is not equal to zero. The code points with imm=0 are reserved. The code points with rd=x2 and imm≠0 correspond to the C.ADDI16SP instruction. The code points with rd=x0 and imm≠0 are HINTs.

7.1.4.2 Integer Register-Immediate Operations

These integer register-immediate operations are encoded in the CI format and perform operations on an integer register and a 6-bit immediate.

af47ed93911dccd4e00a97e61cf4ee13

C.ADDI adds the non-zero sign-extended 6-bit immediate to the value in register rd then writes the result to rd. It expands into ADDI rd, rd, imm. The code points with rd≠0 and imm=0 are HINTs. The code points with rd=x0 encode the C.NOP instruction, of which the code points with imm≠0 are HINTs.

C.ADDIW is an XLEN=64-only instruction that performs the same computation but produces a 32-bit result, then sign-extends result to 64 bits. It expands into ADDIW rd, rd, imm. The immediate can be zero for C.ADDIW, where this corresponds to SEXT.W rd. C.ADDIW is valid only when rd≠x0; the code points with rd=x0 are reserved.

C.ADDI16SP (add immediate to stack pointer) shares the opcode with C.LUI, but has a destination field of x2. C.ADDI16SP adds the non-zero sign-extended 6-bit immediate to the value in the stack pointer (sp=x2), where the immediate is scaled to represent multiples of 16 in the range [-512, 496]. C.ADDI16SP is used to adjust the stack pointer in procedure prologues and epilogues. It expands into ADDI x2, x2, nzimm[9:4]. C.ADDI16SP is valid only when nzimm≠0; the code point with nzimm=0 is reserved.

note

In the standard RISC-V calling convention, the stack pointer sp is always 16-byte aligned.

0e68e273e0b6ea4e7c551ddc1cbdbb54

C.ADDI4SPN (add immediate to stack pointer, non-destructive) is a CIW-format instruction that adds a zero-extended non-zero immediate, scaled by 4, to the stack pointer, x2, and writes the result to rd′. This instruction is used to generate pointers to stack-allocated variables, and expands to ADDI rd′, x2, nzuimm[9:2]. C.ADDI4SPN is valid only when nzuimm≠0; the code points with nzuimm=0 are reserved.

35d629300f6fa48ac201dc2bec40fff3

C.SLLI is a CI-format instruction that performs a logical left shift of the value in register rd then writes the result to rd. The shift amount is encoded in the shamt field. It expands into SLLI rd, rd, shamt[5:0].

The C.SLLI code points with shamt=0 or with rd=x0 are HINTs.

For XLEN=32, shamt[5] must be zero; the code points with shamt[5]=1 are designated for custom extensions.

66eb86960e3eaf8ba10bfa80ffabece6

C.SRLI is a CB-format instruction that performs a logical right shift of the value in register rd′ then writes the result to rd′. The shift amount is encoded in the shamt field. It expands into SRLI rd′, rd′, shamt.

The C.SRLI code points with shamt=0 are HINTs.

For XLEN=32, shamt[5] must be zero; the code points with shamt[5]=1 are designated for custom extensions.

C.SRAI is defined analogously to C.SRLI, but instead performs an arithmetic right shift. It expands to SRAI rd′, rd′, shamt.

note

Left shifts are usually more frequent than right shifts, as left shifts are frequently used to scale address values. Right shifts have therefore been granted less encoding space and are placed in an encoding quadrant where all other immediates are sign-extended.

67317e8a259c86f00ca10f952785603c

C.ANDI is a CB-format instruction that computes the bitwise AND of the value in register rd′ and the sign-extended 6-bit immediate, then writes the result to rd′. It expands to ANDI rd′, rd′, imm.

7.1.4.3 Integer Register-Register Operations

72a6eb106ab280e55bf22ef38f25c20f

These instructions use the CR format.

C.MV copies the value in register rs2 into register rd. It expands into ADD rd, x0, rs2. C.MV is valid only when rs2≠x0; the code points with rs2=x0 correspond to the C.JR instruction. The code points with rs2≠x0 and rd=x0 are HINTs.

note

C.MV expands to a different instruction than the canonical MV pseudoinstruction, which instead uses ADDI. Implementations that handle MV specially, e.g. using register-renaming hardware, may find it more convenient to expand C.MV to MV instead of ADD, at slight additional hardware cost.

C.ADD adds the values in registers rd and rs2 and writes the result to register rd. It expands into ADD rd, rd, rs2. C.ADD is only valid when rs2≠x0; the code points with rs2=x0 correspond to the C.JALR and C.EBREAK instructions. The code points with rs2≠x0 and rd=x0 are HINTs.

7ce7cc051ab80872fdb532ee79c02608

These instructions use the CA format.

C.AND computes the bitwise AND of the values in registers rd′ and rs2′, then writes the result to register rd′. It expands into AND rd′, rd′, rs2′.

C.OR computes the bitwise OR of the values in registers rd′ and rs2′, then writes the result to register rd′. It expands into OR rd′, rd′, rs2′.

C.XOR computes the bitwise XOR of the values in registers rd′ and rs2′, then writes the result to register rd′. It expands into XOR rd′, rd′, rs2′.

C.SUB subtracts the value in register rs2′ from the value in register rd′, then writes the result to register rd′. It expands into SUB rd′, rd′, rs2′.

C.ADDW is an XLEN=64-only instruction that adds the values in registers rd′ and rs2′, then sign-extends the lower 32 bits of the sum before writing the result to register rd′. It expands into ADDW rd′, rd′, rs2′.

C.SUBW is an XLEN=64-only instruction that subtracts the value in register rs2′ from the value in register rd′, then sign-extends the lower 32 bits of the difference before writing the result to register rd′. It expands into SUBW rd′, rd′, rs2′.

note

This group of six instructions do not provide large savings individually, but do not occupy much encoding space and are straightforward to implement, and as a group provide a worthwhile improvement in static and dynamic compression.

7.1.4.4 Defined Illegal Instruction

ab771e968d8f01861b0170af143c55fa

A 16-bit instruction with all bits zero is permanently reserved as an illegal instruction.

note

We reserve all-zero instructions to be illegal instructions to help trap attempts to execute zero-ed or non-existent portions of the memory space. The all-zero value should not be redefined in any non-standard extension. Similarly, we reserve instructions with all bits set to 1 (corresponding to very long instructions in the RISC-V variable-length encoding scheme) as illegal to capture another common value seen in non-existent memory regions.

7.1.4.5 NOP Instruction

3c84f92945e40696227b8640abf09b27

C.NOP is a CI-format instruction that does not change any user-visible state, except for advancing the pc and incrementing any applicable performance counters. C.NOP expands to NOP. The C.NOP code points with imm≠0 encode HINTs.

7.1.4.6 Breakpoint Instruction

7e84f9b25817378326db23e5d06f772b

Debuggers can use the C.EBREAK instruction, which expands to EBREAK, to cause control to be transferred back to the debugging environment. C.EBREAK shares the opcode with the C.ADD instruction, but with rd and rs2 both zero, thus can also use the CR format.

7.1.5 Usage of Compressed Instructions in LR/SC Sequences

On implementations that support the Zca extension, compressed forms of instructions permitted inside constrained LR/SC sequences, as described in Section 5.2.1, are also permitted inside constrained LR/SC sequences.

note

The implication is that any implementation that claims to support both the A and Zca extensions must ensure that LR/SC sequences containing valid Zca instructions will eventually complete.

7.1.6 HINT Instructions

A portion of the Zca encoding space is reserved for microarchitectural HINTs. Like the HINTs in the RV32I base ISA (see HINT Instructions), these instructions do not modify any architectural state, except for advancing the pc and any applicable performance counters. HINTs are executed as no-ops on implementations that ignore them.

Zca HINTs are encoded as computational instructions that do not modify the architectural state, either because rd=x0 (e.g. C.ADD x0, t0), or because rd is overwritten with a copy of itself (e.g. C.ADDI t0, 0).

note

This HINT encoding has been chosen so that simple implementations can ignore HINTs altogether, and instead execute a HINT as a regular computational instruction that happens not to mutate the architectural state.

Zca HINTs do not necessarily expand to their RVI HINT counterparts. For example, C.ADD x0, a0 might not encode the same HINT as ADD x0, x0, a0.

note

The primary reason to not require an Zca HINT to expand to an RVI HINT is that HINTs are unlikely to be compressible in the same manner as the underlying computational instruction. Also, decoupling the Zca and RVI HINT mappings allows the scarce Zca HINT space to be allocated to the most popular HINTs, and in particular, to HINTs that are amenable to macro-op fusion.

Table 33 lists all Zca HINT code points. For XLEN=32, 78% of the HINT space is reserved for standard HINTs. The remainder of the HINT space is designated for custom HINTs; no standard HINTs will ever be defined in this subspace.

Table 33. Zca HINT instructions.

Instruction	Constraints	Code Points	Purpose
C.NOP	imm≠0	63	Designated for future standard use
C.ADDI	rd≠`x0`, imm=0	31
C.LI	rd=`x0`	64
C.LUI	rd=`x0`, imm≠0	63
C.MV	rd=`x0`, rs2≠`x0`	31
C.ADD	rd=`x0`, rs2≠`x0`, rs2≠`x2-x5`	27
C.ADD	rd=`x0`, rs2=`x2-x5`	4	(rs2=x2) C.NTL.P1 (rs2=x3) C.NTL.PALL (rs2=x4) C.NTL.S1 (rs2=x5) C.NTL.ALL
C.SLLI	rd=`x0` or imm=0	63 (RV32), 95 (RV64)	Designated for custom use
C.SRLI	imm=0	8
C.SRAI	imm=0	8

7.1.7 `Zca` Instruction Set Listings

Table 34 shows a map of the major opcodes for the compressed extensions. Each row of the table corresponds to one quadrant of the encoding space. The last quadrant, which has the two least-significant bits set, corresponds to instructions wider than 16 bits, including those in the base ISAs. Several instructions are only valid for certain operands; when invalid, they are marked either RES to indicate that the opcode is reserved for future standard extensions; Custom to indicate that the opcode is designated for custom extensions; or HINT to indicate that the opcode is reserved for microarchitectural hints (see Section 7.1.6).

Table 34. Zca opcode map.

inst[1:0]	000	001	010	011	100	101	110	111
00	ADDI4SPN	FLD	LW	LD	Reserved	FSD	SW	SD	RV64
01	ADDI	ADDIW	LI	LUI/ADDI16SP	MISC-ALU	J	BEQZ	BNEZ	RV64
10	SLLI	FLDSP	LWSP	LDSP	J[AL]R/MV/ADD	FSDSP	SWSP	SDSP	RV64
11	>16b

Zca Instruction listing, Quadrant 0, Zca Instruction listing, Quadrant 1, and Zca Instruction listing, Quadrant 2 list the Zca instructions.

081b12fbc74b2f00b3381fd7c311df71

c29d71c0b2ef94100d820c11f0a9975c

2b37d42b8a01bcbebc2586c33ab1a93b

7.2 `Zcf` Extension for Single-Precision Floating-Point Compressed Instructions

The Zcf extension adds compressed single-precision floating-point load and store instructions. It is an XLEN=32-only extension.

note

7.2.1 Stack-Pointer-Based Loads and Stores

C.FLWSP is an RV32FC-only instruction that loads a single-precision floating-point value from memory into floating-point register rd. It computes its effective address by adding the zero-extended offset, scaled by 4, to the stack pointer, x2. It expands to FLW rd, offset(x2). C.FLWSP uses the CI format.

C.FSWSP is an RV32FC-only instruction that stores a single-precision floating-point value in floating-point register rs2 to memory. It computes an effective address by adding the zero-extended offset, scaled by 4, to the stack pointer, x2. It expands to FSW rs2, offset(x2).

7.2.2 Register-Based Loads and Stores

Compressed register-based floating-point loads and stores use the CL and CS formats, respectively, with the eight registers mapping to f8 to f15.

note

The standard RISC-V calling convention maps the most frequently used floating-point registers to registers f8 to f15, which allows the same register decompression decoding as for integer register numbers.

These instructions encode their data source or destination as described in the following table.

Table 35. Registers specified by the three-bit rs1′, rs2′, and rd′ fields of the CIW, CL, CS, CA, and CB formats.

`Zcf` Register Number	`000`	`001`	`010`	`011`	`100`	`101`	`110`	`111`
Floating-Point Register Number	`f8`	`f9`	`f10`	`f11`	`f12`	`f13`	`f14`	`f15`
Floating-Point Register ABI Name	`fs0`	`fs1`	`fa0`	`fa1`	`fa2`	`fa3`	`fa4`	`fa5`

C.FLW is an RV32FC-only instruction that loads a single-precision floating-point value from memory into floating-point register _rd′_. It computes an effective address by adding the zero-extended offset, scaled by 4, to the base address in register _rs1′_. It expands to FLW rd′, offset(rs1′).

C.FSW is an RV32FC-only instruction that stores a single-precision floating-point value in floating-point register _rs2′_ to memory. It computes an effective address by adding the zero-extended offset, scaled by 4, to the base address in register _rs1′_. It expands to FSW rs2′, offset(rs1′).

7.3 `Zcd` Extension for Double-Precision Floating-Point Compressed Instructions

The Zcd extension adds compressed double-precision floating-point load and store instructions.

note

Double-precision loads and stores represent a significant fraction of static instructions in programs compiled for the ILP32D and LP64D calling conventions, due in large part to callee-saved register spills and fills.

7.3.1 Stack-Pointer-Based Loads and Stores

C.FLDSP is an RV32DC/RV64DC-only instruction that loads a double-precision floating-point value from memory into floating-point register rd. It computes its effective address by adding the zero-extended offset, scaled by 8, to the stack pointer, x2. It expands to FLD rd, offset(x2).

C.FSDSP is an RV32DC/RV64DC-only instruction that stores a double-precision floating-point value in floating-point register rs2 to memory. It computes an effective address by adding the zero-extended offset, scaled by 8, to the stack pointer, x2. It expands to FSD rs2, offset(x2).

7.3.2 Register-Based Loads and Stores

These instructions encode their data source or destination as described in Table 35.

C.FLD is an RV32DC/RV64DC-only instruction that loads a double-precision floating-point value from memory into floating-point register _rd′_. It computes an effective address by adding the zero-extended offset, scaled by 8, to the base address in register _rs1′_. It expands to FLD rd′, offset(rs1′).

C.FSD is an RV32DC/RV64DC-only instruction that stores a double-precision floating-point value in floating-point register _rs2′_ to memory. It computes an effective address by adding the zero-extended offset, scaled by 8, to the base address in register _rs1′_. It expands to FSD rs2′, offset(rs1′).

7.4 `C` Extension for Compressed Instructions

This section describes the C extension, which incorporates the compressed instruction-set extensions designed for application- and server-class processors. The C extension substantially improves code density across a wide range of applications, thereby improving performance, area-efficiency, and energy-efficiency of processors with instruction caches. It excludes features that improve code density at the cost of performance.

The C extension depends upon the Zca extension.

If XLEN=32 and the F extension is present, the C extension additionally depends upon the Zcf extension.

If the D extension is present, the C extension additionally depends upon the Zcd extension.

7.5 `Zcb` Extension for Additional Compressed Instructions

The Zcb extension adds several compressed instructions which, like those in the Zca extension, expand into a single 32-bit instruction. The Zcb extension depends on the Zca extension.

As shown on the individual instruction pages, many of the instructions in Zcb depend upon another extension being implemented. For example, C.MUL is only implemented if M or Zmmul is implemented, and C.SEXT.B is only implemented if Zbb is implemented.

RV32	RV64	Mnemonic	Instruction
yes	yes	C.LBU rd', uimm(rs1')	Load unsigned byte, 16-bit encoding
yes	yes	C.LHU rd', uimm(rs1')	Load unsigned halfword, 16-bit encoding
yes	yes	C.LH rd', uimm(rs1')	Load signed halfword, 16-bit encoding
yes	yes	C.SB rs2', uimm(rs1')	Store byte, 16-bit encoding
yes	yes	C.SH rs2', uimm(rs1')	Store halfword, 16-bit encoding
yes	yes	C.ZEXT.B rsd'	Zero extend byte, 16-bit encoding
yes	yes	C.SEXT.B rsd'	Sign extend byte, 16-bit encoding
yes	yes	C.ZEXT.H rsd'	Zero extend halfword, 16-bit encoding
yes	yes	C.SEXT.H rsd'	Sign extend halfword, 16-bit encoding
	yes	C.ZEXT.W rsd'	Zero extend word, 16-bit encoding
yes	yes	C.NOT rsd'	Bitwise not, 16-bit encoding
yes	yes	C.MUL rsd', rs2'	Multiply, 16-bit encoding

7.5.1 C.LBU

Synopsis Load unsigned byte, 16-bit encoding

Mnemonic C.LBU rd', uimm(rs1')

Encoding (RV32, RV64):

6a88a9faa28e44830b36666d97c7b602

The immediate offset is formed as follows:

  uimm[31:2] = 0;
  uimm[1]    = encoding[5];
  uimm[0]    = encoding[6];

Description This instruction loads a byte from the memory address formed by adding rs1' to the zero extended immediate uimm. The resulting byte is zero extended to XLEN bits and is written to rd'.

note

rd' and rs1' are from the standard 8-register set x8-x15.

Prerequisites None

Operation

//This is not SAIL, it's pseudocode. The SAIL hasn't been written yet.

X(rdc) = EXTZ(mem[X(rs1c)+EXTZ(uimm)][7..0]);

7.5.2 C.LHU

Synopsis Load unsigned halfword, 16-bit encoding

Mnemonic C.LHU rd', uimm(rs1')

Encoding (RV32, RV64):

43c8288d7cc42c9adafbe09cf7b4e1d6

The immediate offset is formed as follows:

  uimm[31:2] = 0;
  uimm[1]    = encoding[5];
  uimm[0]    = 0;

Description This instruction loads a halfword from the memory address formed by adding rs1' to the zero extended immediate uimm. The resulting halfword is zero extended to XLEN bits and is written to rd'.

note

rd' and rs1' are from the standard 8-register set x8-x15.

Prerequisites None

Operation

//This is not SAIL, it's pseudocode. The SAIL hasn't been written yet.

X(rdc) = EXTZ(load_mem[X(rs1c)+EXTZ(uimm)][15..0]);

7.5.3 C.LH

Synopsis Load signed halfword, 16-bit encoding

Mnemonic C.LH rd', uimm(rs1')

Encoding (RV32, RV64):

fa16976c8b37f108e22ec37ca739580e

The immediate offset is formed as follows:

  uimm[31:2] = 0;
  uimm[1]    = encoding[5];
  uimm[0]    = 0;

Description This instruction loads a halfword from the memory address formed by adding rs1' to the zero extended immediate uimm. The resulting halfword is sign extended to XLEN bits and is written to rd'.

note

rd' and rs1' are from the standard 8-register set x8-x15.

Prerequisites None

Operation

//This is not SAIL, it's pseudocode. The SAIL hasn't been written yet.

X(rdc) = EXTS(load_mem[X(rs1c)+EXTZ(uimm)][15..0]);

7.5.4 C.SB

Synopsis Store byte, 16-bit encoding

Mnemonic C.SB rs2', uimm(rs1')

Encoding (RV32, RV64):

0d7d669fd86ed1da98eef367972b79b4

The immediate offset is formed as follows:

  uimm[31:2] = 0;
  uimm[1]    = encoding[5];
  uimm[0]    = encoding[6];

Description This instruction stores the least significant byte of rs2' to the memory address formed by adding rs1' to the zero extended immediate uimm.

note

rs1' and rs2' are from the standard 8-register set x8-x15.

Prerequisites None

Operation

//This is not SAIL, it's pseudocode. The SAIL hasn't been written yet.

mem[X(rs1c)+EXTZ(uimm)][7..0] = X(rs2c)

7.5.5 C.SH

Synopsis Store halfword, 16-bit encoding

Mnemonic c.sh rs2', uimm(rs1')

Encoding (RV32, RV64):

c253afc565edb07abf84ed195af4d946

The immediate offset is formed as follows:

  uimm[31:2] = 0;
  uimm[1]    = encoding[5];
  uimm[0]    = 0;

Description This instruction stores the least significant halfword of rs2' to the memory address formed by adding rs1' to the zero extended immediate uimm.

note

rs1' and rs2' are from the standard 8-register set x8-x15.

Prerequisites None

Operation

//This is not SAIL, it's pseudocode. The SAIL hasn't been written yet.

mem[X(rs1c)+EXTZ(uimm)][15..0] = X(rs2c)

7.5.6 C.ZEXT.B

Synopsis Zero extend byte, 16-bit encoding

Mnemonic C.ZEXT.B rd'/rs1'

Encoding (RV32, RV64):

7df015bff884769ed7d5325ed2a8a581

Description This instruction takes a single source/destination operand. It zero-extends the least-significant byte of the operand to XLEN bits by inserting zeros into all of the bits more significant than 7.

note

rd'/rs1' is from the standard 8-register set x8-x15.

Prerequisites None

32-bit equivalent:

andi rd'/rs1', rd'/rs1', 0xff

note

The SAIL module variable for rd'/rs1' is called rsdc.

Operation

X(rsdc) = EXTZ(X(rsdc)[7..0]);

7.5.7 C.SEXT.B

Synopsis Sign extend byte, 16-bit encoding

Mnemonic C.SEXT.B rd'/rs1'

Encoding (RV32, RV64):

8c314a3d0a5a77b235dee59f55e7da80

Description This instruction takes a single source/destination operand. It sign-extends the least-significant byte in the operand to XLEN bits by copying the most-significant bit in the byte (i.e., bit 7) to all of the more-significant bits.

note

rd'/rs1' is from the standard 8-register set x8-x15.

Prerequisites Zbb is also required.

note

The SAIL module variable for rd'/rs1' is called rsdc.

Operation

X(rsdc) = EXTS(X(rsdc)[7..0]);

7.5.8 C.ZEXT.H

Synopsis Zero extend halfword, 16-bit encoding

Mnemonic C.ZEXT.H rd'/rs1'

Encoding (RV32, RV64):

42cba33c0b6101a10144a99222417304

Description This instruction takes a single source/destination operand. It zero-extends the least-significant halfword of the operand to XLEN bits by inserting zeros into all of the bits more significant than 15.

note

rd'/rs1' is from the standard 8-register set x8-x15.

Prerequisites Zbb is also required.

note

The SAIL module variable for rd'/rs1' is called rsdc.

Operation

X(rsdc) = EXTZ(X(rsdc)[15..0]);

7.5.9 C.SEXT.H

Synopsis Sign extend halfword, 16-bit encoding

Mnemonic C.SEXT.H rd'/rs1'

Encoding (RV32, RV64):

53ac98030f7c423c10bd310b094394dc

Description This instruction takes a single source/destination operand. It sign-extends the least-significant halfword in the operand to XLEN bits by copying the most-significant bit in the halfword (i.e., bit 15) to all of the more-significant bits.

note

rd'/rs1' is from the standard 8-register set x8-x15.

Prerequisites Zbb is also required.

note

The SAIL module variable for rd'/rs1' is called rsdc.

Operation

X(rsdc) = EXTS(X(rsdc)[15..0]);

7.5.10 C.ZEXT.W

Synopsis Zero extend word, 16-bit encoding

Mnemonic C.ZEXT.W rd'/rs1'

Encoding (RV64):

7d02f43b647e4db9e7437ca5a5ad11be

Description This instruction takes a single source/destination operand. It zero-extends the least-significant word of the operand to XLEN bits by inserting zeros into all of the bits more significant than 31.

note

rd'/rs1' is from the standard 8-register set x8-x15.

Prerequisites Zba is also required.

32-bit equivalent:

add.uw rd'/rs1', rd'/rs1', zero

note

The SAIL module variable for rd'/rs1' is called rsdc.

Operation

X(rsdc) = EXTZ(X(rsdc)[31..0]);

7.5.11 C.NOT

Synopsis Bitwise not, 16-bit encoding

Mnemonic C.NOT rd'/rs1'

Encoding (RV32, RV64):

31860dec1b3e4f4839b87447a241bc61

Description This instruction takes the one’s complement of rd'/rs1' and writes the result to the same register.

note

rd'/rs1' is from the standard 8-register set x8-x15.

Prerequisites None

32-bit equivalent:

xori rd'/rs1', rd'/rs1', -1

note

The SAIL module variable for rd'/rs1' is called rsdc.

Operation

X(rsdc) = X(rsdc) XOR -1;

7.5.12 C.MUL

Synopsis Multiply, 16-bit encoding

Mnemonic C.MUL rsd', rs2'

Encoding (RV32, RV64):

66afe3d9afe581d33d3c4e4fbf7599a0

Description This instruction multiplies XLEN bits of the source operands from rsd' and rs2' and writes the lowest XLEN bits of the result to rsd'.

note

rd'/rs1' and rs2' are from the standard 8-register set x8-x15.

Prerequisites M or Zmmul must be configured.

note

The SAIL module variable for rd'/rs1' is called rsdc, and for rs2' is called rs2c.

Operation

let result_wide = to_bits(2 * sizeof(xlen), signed(X(rsdc)) * signed(X(rs2c)));
X(rsdc) = result_wide[(sizeof(xlen) - 1) .. 0];

7.6 `Zcmt` Extension for Compressed Table Jumps

The Zcmt extension adds table-jump instructions, which improve code density when procedures have many call sites. It also adds the jvt CSR. The jvt CSR requires a state enable if Smstateen is implemented. See jvt CSR, table jump base vector and control register for details.

The Zcmt extension conflicts with the Zcd extension.

note

Zcmt is primarily targeted at embedded class CPUs due to implementation complexity. Additionally, it is not compatible with RVA profiles.

The Zcmt extension depends on the Zca and Zicsr extensions.

RV32	RV64	Mnemonic	Instruction
yes	yes	CM.JT index	Jump via table
yes	yes	CM.JALT index	Jump and link via table

7.6.1 Table Jump Overview

CM.JT (Jump via table) and CM.JALT (Jump and link via table) are referred to as table jump.

Table jump uses a 256-entry XLEN wide table in instruction memory to contain function addresses. The table must be a minimum of 64-byte aligned.

Table entries follow the current data endianness. This is different from normal instruction fetch which is always little-endian.

CM.JT and CM.JALT encodings index the table, giving access to functions within the full XLEN wide address space.

This is used as a form of dictionary compression to reduce the code size of JAL / AUIPC+JALR / JR / AUIPC+JR instructions.

Table jump allows the linker to replace the following instruction sequences with a CM.JT or CM.JALT encoding, and an entry in the table:

32-bit J calls
32-bit JAL ra calls
64-bit AUIPC+JR calls to fixed locations
64-bit AUIPC+JALR ra calls to fixed locations
- The AUIPC+JR/JALR sequence is used because the offset from the PC is out of the ±1 MB range.

If a return address stack is implemented, then as CM.JALT is equivalent to JAL ra, it pushes to the stack.

7.6.2 `jvt`

The base of the table is in the jvt CSR (see jvt CSR, table jump base vector and control register), each table entry is XLEN bits.

If the same function is called with and without linking then it must have two entries in the table. This is typically caused by the same function being called with and without tail calling.

7.6.3 Table Jump Fault handling

For a table jump instruction, the table entry that the instruction selects is considered an extension of the instruction itself. Hence, the execution of a table jump instruction involves two instruction fetches, the first to read the instruction (CM.JT/CM.JALT) and the second to read from the jump vector table (JVT). Both instruction fetches are implicit reads, and both require execute permission; read permission is irrelevant. It is recommended that the second fetch be ignored for hardware triggers and breakpoints.

Memory writes to the jump vector table require an instruction barrier (FENCE.I) to guarantee that they are visible to the instruction fetch.

Multiple contexts may have different jump vector tables. JVT may be switched between them without an instruction barrier if the tables have not been updated in memory since the last FENCE.I.

If an exception occurs on either instruction fetch, xEPC is set to the PC of the table jump instruction, xCAUSE is set as expected for the type of fault and xTVAL (if not set to zero) contains the fetch address which caused the fault.

7.6.4 `jvt` CSR

Synopsis Table jump base vector and control register

Address:

0x017

Permissions:

URW

Format (RV32):

14379a2b0b98910c44ed385cc655e77b

Format (RV64):

9dfbf3a3821f55f3642000917e174f28

Description The jvt register is an XLEN-bit WARL read/write register that holds the jump table configuration, consisting of the jump table base address (BASE) and the jump table mode (MODE).

If Zcmt is implemented then jvt must also be implemented, but can contain a read-only value. If jvt is writable, the set of values the register may hold can vary by implementation. The value in the BASE field must always be aligned on a 64-byte boundary. Note that the CSR contains only bits XLEN-1 through 6 of the address base. When computing jump-table accesses, the lower six bits of base are filled with zeroes to obtain an XLEN-bit jump-table base address jvt.⁠BASE that is always aligned on a 64-byte boundary.

jvt.⁠BASE is a virtual address, whenever virtual memory is enabled.

The memory pointed to by jvt.⁠BASE is treated as instruction memory for the purpose of executing table jump instructions, implying execute access permission.

Table 36. jvt.⁠MODE definition.

`jvt.⁠MODE`	Comment
000000	Jump table mode
others	reserved for future standard use

jvt.⁠MODE is a WARL field, so can only be programmed to modes which are implemented. Therefore the discovery mechanism is to attempt to program different modes and read back the values to see which are available. Jump table mode must be implemented.

note

in future the RISC-V Unified Discovery method will report the available modes.

Architectural State:

jvt CSR adds architectural state to the system software context (such as an OS process), therefore must be saved/restored on context switches.

7.6.5 CM.JT

Synopsis jump via table

Mnemonic CM.JT index

Encoding (RV32, RV64):

929a5b8779e10fd578c2057efef381d9

note

For this encoding to decode as CM.JT, index<32, otherwise it decodes as CM.JALT, see Jump and link via table.

note

If jvt.⁠MODE = 0 (Jump Table Mode) then CM.JT behaves as specified here. If jvt.⁠MODE is a reserved value, then CM.JT is also reserved. In the future other defined values of jvt.⁠MODE may change the behaviour of CM.JT.

Assembly Syntax:

cm.jt index

Description CM.JT reads an entry from the jump vector table in memory and jumps to the address that was read.

For further information see Table Jump Overview.

Prerequisites None

32-bit equivalent:

No direct equivalent encoding exists.

Operation

//This is not SAIL, it's pseudocode. The SAIL hasn't been written yet.

# table_address is temporary internal state, it doesn't represent a real register
# InstMemory is byte indexed

switch(XLEN) {
  32:  table_address[XLEN-1:0] = jvt.base + (index\<\<2);
  64:  table_address[XLEN-1:0] = jvt.base + (index\<\<3);
}

//fetch from the jump table
pc = InstMemory[table_address][XLEN-1:0]&~0x1;  // Clear bit 0.

7.6.6 CM.JALT

Synopsis jump via table with optional link

Mnemonic CM.JALT index

Encoding (RV32, RV64):

929a5b8779e10fd578c2057efef381d9

note

For this encoding to decode as CM.JALT, index>=32, otherwise it decodes as CM.JT, see Jump via table.

note

If jvt.⁠MODE = 0 (Jump Table Mode) then CM.JALT behaves as specified here. If jvt.⁠MODE is a reserved value, then CM.JALT is also reserved. In the future other defined values of jvt.⁠MODE may change the behaviour of CM.JALT.

Assembly Syntax:

cm.jalt index

Description CM.JALT reads an entry from the jump vector table in memory and jumps to the address that was read, linking to ra.

For further information see Table Jump Overview.

Prerequisites None

32-bit equivalent:

No direct equivalent encoding exists.

Operation

//This is not SAIL, it's pseudocode. The SAIL hasn't been written yet.

# table_address is temporary internal state, it doesn't represent a real register
# InstMemory is byte indexed

switch(XLEN) {
  32:  table_address[XLEN-1:0] = jvt.base + (index\<\<2);
  64:  table_address[XLEN-1:0] = jvt.base + (index\<\<3);
}

//fetch from the jump table

ra = pc+2;
pc = InstMemory[table_address][XLEN-1:0]&~0x1;  // Clear bit 0.

7.7 `Zcmp` Extension for Compressed Prologues and Epilogues

The Zcmp extension adds instructions that substantially reduce the static code size of procedure prologues and epilogues. The instructions it adds are collectively referred to as PUSH/POP:

cm.push
cm.pop
cm.popret
cm.popretz

The term PUSH refers to CM.PUSH.

The term POP refers to CM.POP.

The term POPRET refers to CM.POPRET and CM.POPRETZ.

Common details for these instructions are in this section.

7.7.1 PUSH/POP functional overview

PUSH, POP, POPRET are used to reduce the size of function prologues and epilogues.

The PUSH instruction

adjusts the stack pointer to create the stack frame
pushes (stores) the registers specified in the register list to the stack frame

The POP instruction

pops (loads) the registers in the register list from the stack frame
adjusts the stack pointer to destroy the stack frame

The POPRET instructions

pop (load) the registers in the register list from the stack frame
CM.POPRETZ also moves zero into a0 as the return value
adjust the stack pointer to destroy the stack frame
execute a ret instruction to return from the function

7.7.2 Example usage

This example gives an illustration of the use of PUSH and POPRET.

The function processMarkers in the EMBench benchmark picojpeg in the following file on github: libpicojpeg.c

The prologue and epilogue compile with GCC10 to:

   0001098a \<processMarkers>:
   1098a:       711d                    addi    sp,sp,-96 ;#cm.push(1)
   1098c:       c8ca                    sw      s2,80(sp) ;#cm.push(2)
   1098e:       c6ce                    sw      s3,76(sp) ;#cm.push(3)
   10990:       c4d2                    sw      s4,72(sp) ;#cm.push(4)
   10992:       ce86                    sw      ra,92(sp) ;#cm.push(5)
   10994:       cca2                    sw      s0,88(sp) ;#cm.push(6)
   10996:       caa6                    sw      s1,84(sp) ;#cm.push(7)
   10998:       c2d6                    sw      s5,68(sp) ;#cm.push(8)
   1099a:       c0da                    sw      s6,64(sp) ;#cm.push(9)
   1099c:       de5e                    sw      s7,60(sp) ;#cm.push(10)
   1099e:       dc62                    sw      s8,56(sp) ;#cm.push(11)
   109a0:       da66                    sw      s9,52(sp) ;#cm.push(12)
   109a2:       d86a                    sw      s10,48(sp);#cm.push(13)
   109a4:       d66e                    sw      s11,44(sp);#cm.push(14)
...
   109f4:       4501                    li      a0,0      ;#cm.popretz(1)
   109f6:       40f6                    lw      ra,92(sp) ;#cm.popretz(2)
   109f8:       4466                    lw      s0,88(sp) ;#cm.popretz(3)
   109fa:       44d6                    lw      s1,84(sp) ;#cm.popretz(4)
   109fc:       4946                    lw      s2,80(sp) ;#cm.popretz(5)
   109fe:       49b6                    lw      s3,76(sp) ;#cm.popretz(6)
   10a00:       4a26                    lw      s4,72(sp) ;#cm.popretz(7)
   10a02:       4a96                    lw      s5,68(sp) ;#cm.popretz(8)
   10a04:       4b06                    lw      s6,64(sp) ;#cm.popretz(9)
   10a06:       5bf2                    lw      s7,60(sp) ;#cm.popretz(10)
   10a08:       5c62                    lw      s8,56(sp) ;#cm.popretz(11)
   10a0a:       5cd2                    lw      s9,52(sp) ;#cm.popretz(12)
   10a0c:       5d42                    lw      s10,48(sp);#cm.popretz(13)
   10a0e:       5db2                    lw      s11,44(sp);#cm.popretz(14)
   10a10:       6125                    addi    sp,sp,96  ;#cm.popretz(15)
   10a12:       8082                    ret               ;#cm.popretz(16)

with the GCC option -msave-restore the output is the following:

0001080e \<processMarkers>:
   1080e:       73a012ef                jal     t0,11f48 \<__riscv_save_12>
   10812:       1101                    addi    sp,sp,-32
...
   10862:       4501                    li      a0,0
   10864:       6105                    addi    sp,sp,32
   10866:       71e0106f                j       11f84 \<__riscv_restore_12>

with PUSH/POPRET this reduces to

0001080e \<processMarkers>:
   1080e:       b8fa                    cm.push    \{ra,s0-s11},-96
...
   10866:       bcfa                    cm.popretz \{ra,s0-s11}, 96

The prologue / epilogue reduce from 60-bytes in the original code, to 14-bytes with -msave-restore, and to 4-bytes with PUSH and POPRET. As well as reducing the code-size PUSH and POPRET eliminate the branches from calling the millicode save/restore routines and so may also perform better.

note

The calls to <riscv_save_0>/<riscv_restore_0> become 64-bit when the target functions are out of the ±1 MB range, increasing the prologue/epilogue size to 22-bytes.

note

POP is typically used in tail-calling sequences where ret is not used to return to ra after destroying the stack frame.

7.7.2.1 Stack pointer adjustment handling

The instructions all automatically adjust the stack pointer by enough to cover the memory required for the registers being saved or restored. Additionally the spimm field in the encoding allows the stack pointer to be adjusted in additional increments of 16-bytes. There is only a small restricted range available in the encoding; if the range is insufficient then a separate C.ADDI16SP can be used to increase the range.

7.7.2.2 Register list handling

There is no support for the {ra, s0-s10} register list without also adding s11. Therefore the {ra, s0-s11} register list must be used in this case.

7.7.3 PUSH/POP Fault handling

Correct execution requires that sp refers to idempotent memory (also see Non-idempotent memory handling), because the core must be able to handle traps detected during the sequence. The entire PUSH/POP sequence is re-executed after returning from the trap handler, and multiple traps are possible during the sequence.

If a trap occurs during the sequence then xEPC is updated with the PC of the instruction, xTVAL (if not read-only-zero) updated with the bad address if it was an access fault and xCAUSE updated with the type of trap.

note

It is implementation defined whether interrupts can also be taken during the sequence execution.

7.7.4 Software view of execution

7.7.4.1 Software view of the PUSH sequence

From a software perspective the PUSH sequence appears as:

A sequence of stores writing the bytes required by the pseudocode
- The bytes may be written in any order.
- The bytes may be grouped into larger accesses.
- Any of the bytes may be written multiple times.
A stack pointer adjustment

note

If an implementation allows interrupts during the sequence, and the interrupt handler uses sp to allocate stack memory, then any stores which were executed before the interrupt may be overwritten by the handler. This is safe because the memory is idempotent and the stores will be re-executed when execution resumes.

The stack pointer adjustment must only be committed only when it is certain that the entire PUSH instruction will commit.

Stores may also return imprecise faults from the bus. It is platform defined whether the core implementation waits for the bus responses before continuing to the final stage of the sequence, or handles errors responses after completing the PUSH instruction.

For example:

cm.push  \{ra, s0-s5}, -64

Appears to software as:

# any bytes from sp-1 to sp-28 may be written multiple times before
# the instruction completes therefore these updates may be visible in
# the interrupt/exception handler below the stack pointer
sw  s5, -4(sp)
sw  s4, -8(sp)
sw  s3,-12(sp)
sw  s2,-16(sp)
sw  s1,-20(sp)
sw  s0,-24(sp)
sw  ra,-28(sp)

# this must only execute once, and will only execute after all stores
# completed without any precise faults, therefore this update is only
# visible in the interrupt/exception handler if cm.push has completed
addi sp, sp, -64

7.7.4.2 Software view of the POP/POPRET sequence

From a software perspective the POP/POPRET sequence appears as:

A sequence of loads reading the bytes required by the pseudocode.
- The bytes may be loaded in any order.
- The bytes may be grouped into larger accesses.
- Any of the bytes may be loaded multiple times.
A stack pointer adjustment
An optional LI a0, 0
An optional RET

If a trap occurs during the sequence, then any loads which were executed before the trap may update architectural state. The loads will be re-executed once the trap handler completes, so the values will be overwritten. Therefore it is permitted for an implementation to update some of the destination registers before taking a fault.

The optional LI a0, 0, stack pointer adjustment and optional RET must only be committed only when it is certain that the entire POP/POPRET instruction will commit.

For POPRET once the stack pointer adjustment has been committed the RET must execute.

For example:

cm.popretz \{ra, s0-s3}, 32;

Appears to software as:

# any or all of these load instructions may execute multiple times
# therefore these updates may be visible in the interrupt/exception handler
lw   s3, 28(sp)
lw   s2, 24(sp)
lw   s1, 20(sp)
lw   s0, 16(sp)
lw   ra, 12(sp)

# these must only execute once, will only execute after all loads
# complete successfully all instructions must execute atomically
# therefore these updates are not visible in the interrupt/exception handler
li a0, 0
addi sp, sp, 32
ret

7.7.5 Non-idempotent memory handling

An implementation may have a requirement to issue a PUSH/POP instruction to non-idempotent memory.

If the core implementation does not support PUSH/POP to non-idempotent memories, the core may use an idempotency PMA to detect it and take a load (POP/POPRET) or store (PUSH) access-fault exception in order to avoid unpredictable results.

Software should only use these instructions on non-idempotent memory regions when software can tolerate the required memory accesses being issued repeatedly in the case that they cause exceptions.

7.7.6 Example RV32I PUSH/POP sequences

The examples are included show the load/store series expansion and the stack adjustment. Examples of CM.POPRET and CM.POPRETZ are not included, as the difference in the expanded sequence from CM.POP is trivial in all cases.

7.7.6.1 CM.PUSH {ra, s0-s2}, -64

Encoding: rlist=7, spimm=3

expands to:

sw  s2,  -4(sp);
sw  s1,  -8(sp);
sw  s0, -12(sp);
sw  ra, -16(sp);
addi sp, sp, -64;

7.7.6.2 CM.PUSH {ra, s0-s11}, -112

Encoding: rlist=15, spimm=3

expands to:

sw  s11,  -4(sp);
sw  s10,  -8(sp);
sw  s9,  -12(sp);
sw  s8,  -16(sp);
sw  s7,  -20(sp);
sw  s6,  -24(sp);
sw  s5,  -28(sp);
sw  s4,  -32(sp);
sw  s3,  -36(sp);
sw  s2,  -40(sp);
sw  s1,  -44(sp);
sw  s0,  -48(sp);
sw  ra,  -52(sp);
addi sp, sp, -112;

7.7.6.3 CM.POP {ra}, 16

Encoding: rlist=4, spimm=0

expands to:

lw   ra, 12(sp);
addi sp, sp, 16;

7.7.6.4 CM.POP {ra, s0-s3}, 48

Encoding: rlist=8, spimm=1

expands to:

lw   s3, 44(sp);
lw   s2, 40(sp);
lw   s1, 36(sp);
lw   s0, 32(sp);
lw   ra, 28(sp);
addi sp, sp, 48;

7.7.6.5 CM.POP {ra, s0-s4}, 64

Encoding: rlist=9, spimm=2

expands to:

lw   s4, 60(sp);
lw   s3, 56(sp);
lw   s2, 52(sp);
lw   s1, 48(sp);
lw   s0, 44(sp);
lw   ra, 40(sp);
addi sp, sp, 64;

7.7.7 CM.PUSH

Synopsis Create stack frame: store ra and 0 to 12 saved registers to the stack frame, optionally allocate additional stack space.

Mnemonic CM.PUSH {reg_list}, -stack_adj

Encoding (RV32, RV64):

bff62081946f8ed931e341eac1d6a1c0

note

rlist values 0 to 3 are reserved for a future EABI variant called CM.PUSH.E

Assembly Syntax:

cm.push \{reg_list},  -stack_adj
cm.push {xreg_list}, -stack_adj

The variables used in the assembly syntax are defined below.

RV32E:

switch (rlist){
  case  4: \{reg_list="ra";         xreg_list="x1";}
  case  5: \{reg_list="ra, s0";     xreg_list="x1, x8";}
  case  6: \{reg_list="ra, s0-s1";  xreg_list="x1, x8-x9";}
  default: reserved();
}
stack_adj      = stack_adj_base + spimm * 16;

RV32I, RV64:
switch (rlist){
  case  4: \{reg_list="ra";         xreg_list="x1";}
  case  5: \{reg_list="ra, s0";     xreg_list="x1, x8";}
  case  6: \{reg_list="ra, s0-s1";  xreg_list="x1, x8-x9";}
  case  7: \{reg_list="ra, s0-s2";  xreg_list="x1, x8-x9, x18";}
  case  8: \{reg_list="ra, s0-s3";  xreg_list="x1, x8-x9, x18-x19";}
  case  9: \{reg_list="ra, s0-s4";  xreg_list="x1, x8-x9, x18-x20";}
  case 10: \{reg_list="ra, s0-s5";  xreg_list="x1, x8-x9, x18-x21";}
  case 11: \{reg_list="ra, s0-s6";  xreg_list="x1, x8-x9, x18-x22";}
  case 12: \{reg_list="ra, s0-s7";  xreg_list="x1, x8-x9, x18-x23";}
  case 13: \{reg_list="ra, s0-s8";  xreg_list="x1, x8-x9, x18-x24";}
  case 14: \{reg_list="ra, s0-s9";  xreg_list="x1, x8-x9, x18-x25";}
  //note - to include s10, s11 must also be included
  case 15: \{reg_list="ra, s0-s11"; xreg_list="x1, x8-x9, x18-x27";}
  default: reserved();
}
stack_adj      = stack_adj_base + spimm * 16;

RV32E:

stack_adj_base = 16;
Valid values:
stack_adj      = [16|32|48|64];

RV32I:

switch (rlist) {
  case  4.. 7: stack_adj_base = 16;
  case  8..11: stack_adj_base = 32;
  case 12..14: stack_adj_base = 48;
  case     15: stack_adj_base = 64;
}

Valid values:
switch (rlist) {
  case  4.. 7: stack_adj = [16|32|48| 64];
  case  8..11: stack_adj = [32|48|64| 80];
  case 12..14: stack_adj = [48|64|80| 96];
  case     15: stack_adj = [64|80|96|112];
}

RV64:

switch (rlist) {
  case  4.. 5: stack_adj_base =  16;
  case  6.. 7: stack_adj_base =  32;
  case  8.. 9: stack_adj_base =  48;
  case 10..11: stack_adj_base =  64;
  case 12..13: stack_adj_base =  80;
  case     14: stack_adj_base =  96;
  case     15: stack_adj_base = 112;
}

Valid values:
switch (rlist) {
  case  4.. 5: stack_adj = [ 16| 32| 48| 64];
  case  6.. 7: stack_adj = [ 32| 48| 64| 80];
  case  8.. 9: stack_adj = [ 48| 64| 80| 96];
  case 10..11: stack_adj = [ 64| 80| 96|112];
  case 12..13: stack_adj = [ 80| 96|112|128];
  case     14: stack_adj = [ 96|112|128|144];
  case     15: stack_adj = [112|128|144|160];
}

Description This instruction pushes (stores) the registers in reg_list to the memory below the stack pointer, and then creates the stack frame by decrementing the stack pointer by stack_adj, including any additional stack space requested by the value of spimm.

note

All ABI register mappings are for the UABI. An EABI version is planned once the EABI is frozen.

For further information see Zcmp.

Stack Adjustment Calculation:

stack_adj_base is the minimum number of bytes, in multiples of 16-byte address increments, required to cover the registers in the list.

spimm is the number of additional 16-byte address increments allocated for the stack frame.

The total stack adjustment represents the total size of the stack frame, which is stack_adj_base added to spimm scaled by 16, as defined above.

Prerequisites None

32-bit equivalent:

No direct equivalent encoding exists

Operation The first section of pseudocode may be executed multiple times before the instruction successfully completes.

//This is not SAIL, it's pseudocode. The SAIL hasn't been written yet.

if (XLEN==32) bytes=4; else bytes=8;

addr=sp-bytes;
for(i in 27,26,25,24,23,22,21,20,19,18,9,8,1)  {
  //if register i is in xreg_list
  if (xreg_list[i]) {
    switch(bytes) {
      4:  asm("sw x[i], 0(addr)");
      8:  asm("sd x[i], 0(addr)");
    }
    addr-=bytes;
  }
}

The final section of pseudocode executes atomically, and only executes if the section above completes without any exceptions or interrupts.

//This is not SAIL, it's pseudocode. The SAIL hasn't been written yet.

sp-=stack_adj;

7.7.8 CM.POP

Synopsis Destroy stack frame: load ra and 0 to 12 saved registers from the stack frame, deallocate the stack frame.

Mnemonic CM.POP {reg_list}, stack_adj

Encoding (RV32, RV64):

42ed43bfb10034981f12b1fb5fa598c8

note

rlist values 0 to 3 are reserved for a future EABI variant called CM.POP.E

Assembly Syntax:

cm.pop \{reg_list},  stack_adj
cm.pop {xreg_list}, stack_adj

The variables used in the assembly syntax are defined below.

RV32E:
switch (rlist){
  case  4: \{reg_list="ra";         xreg_list="x1";}
  case  5: \{reg_list="ra, s0";     xreg_list="x1, x8";}
  case  6: \{reg_list="ra, s0-s1";  xreg_list="x1, x8-x9";}
  default: reserved();
}
stack_adj      = stack_adj_base + spimm * 16;

RV32I, RV64:
switch (rlist){
  case  4: \{reg_list="ra";         xreg_list="x1";}
  case  5: \{reg_list="ra, s0";     xreg_list="x1, x8";}
  case  6: \{reg_list="ra, s0-s1";  xreg_list="x1, x8-x9";}
  case  7: \{reg_list="ra, s0-s2";  xreg_list="x1, x8-x9, x18";}
  case  8: \{reg_list="ra, s0-s3";  xreg_list="x1, x8-x9, x18-x19";}
  case  9: \{reg_list="ra, s0-s4";  xreg_list="x1, x8-x9, x18-x20";}
  case 10: \{reg_list="ra, s0-s5";  xreg_list="x1, x8-x9, x18-x21";}
  case 11: \{reg_list="ra, s0-s6";  xreg_list="x1, x8-x9, x18-x22";}
  case 12: \{reg_list="ra, s0-s7";  xreg_list="x1, x8-x9, x18-x23";}
  case 13: \{reg_list="ra, s0-s8";  xreg_list="x1, x8-x9, x18-x24";}
  case 14: \{reg_list="ra, s0-s9";  xreg_list="x1, x8-x9, x18-x25";}
  //note - to include s10, s11 must also be included
  case 15: \{reg_list="ra, s0-s11"; xreg_list="x1, x8-x9, x18-x27";}
  default: reserved();
}
stack_adj      = stack_adj_base + spimm * 16;

RV32E:

stack_adj_base = 16;
Valid values:
stack_adj      = [16|32|48|64];

RV32I:

switch (rlist) {
  case  4.. 7: stack_adj_base = 16;
  case  8..11: stack_adj_base = 32;
  case 12..14: stack_adj_base = 48;
  case     15: stack_adj_base = 64;
}

Valid values:
switch (rlist) {
  case  4.. 7: stack_adj = [16|32|48| 64];
  case  8..11: stack_adj = [32|48|64| 80];
  case 12..14: stack_adj = [48|64|80| 96];
  case     15: stack_adj = [64|80|96|112];
}

RV64:

switch (rlist) {
  case  4.. 5: stack_adj_base =  16;
  case  6.. 7: stack_adj_base =  32;
  case  8.. 9: stack_adj_base =  48;
  case 10..11: stack_adj_base =  64;
  case 12..13: stack_adj_base =  80;
  case     14: stack_adj_base =  96;
  case     15: stack_adj_base = 112;
}

Valid values:
switch (rlist) {
  case  4.. 5: stack_adj = [ 16| 32| 48| 64];
  case  6.. 7: stack_adj = [ 32| 48| 64| 80];
  case  8.. 9: stack_adj = [ 48| 64| 80| 96];
  case 10..11: stack_adj = [ 64| 80| 96|112];
  case 12..13: stack_adj = [ 80| 96|112|128];
  case     14: stack_adj = [ 96|112|128|144];
  case     15: stack_adj = [112|128|144|160];
}

Description This instruction pops (loads) the registers in reg_list from stack memory, and then adjusts the stack pointer by stack_adj.

note

All ABI register mappings are for the UABI. An EABI version is planned once the EABI is frozen.

For further information see Zcmp.

Stack Adjustment Calculation:

stack_adj_base is the minimum number of bytes, in multiples of 16-byte address increments, required to cover the registers in the list.

spimm is the number of additional 16-byte address increments allocated for the stack frame.

The total stack adjustment represents the total size of the stack frame, which is stack_adj_base added to spimm scaled by 16, as defined above.

Prerequisites None

32-bit equivalent:

No direct equivalent encoding exists

Operation The first section of pseudocode may be executed multiple times before the instruction successfully completes.

//This is not SAIL, it's pseudocode. The SAIL hasn't been written yet.

if (XLEN==32) bytes=4; else bytes=8;

addr=sp+stack_adj-bytes;
for(i in 27,26,25,24,23,22,21,20,19,18,9,8,1)  {
  //if register i is in xreg_list
  if (xreg_list[i]) {
    switch(bytes) {
      4:  asm("lw x[i], 0(addr)");
      8:  asm("ld x[i], 0(addr)");
    }
    addr-=bytes;
  }
}

The final section of pseudocode executes atomically, and only executes if the section above completes without any exceptions or interrupts.

//This is not SAIL, it's pseudocode. The SAIL hasn't been written yet.

sp+=stack_adj;

7.7.9 CM.POPRETZ

Synopsis Destroy stack frame: load ra and 0 to 12 saved registers from the stack frame, deallocate the stack frame, move zero into a0, return to ra.

Mnemonic CM.POPRETZ {reg_list}, stack_adj

Encoding (RV32, RV64):

94c42c96c1fc3202b78c4b69798eb003

note

rlist values 0 to 3 are reserved for a future EABI variant called CM.POPRETZ.E

Assembly Syntax:

cm.popretz \{reg_list},  stack_adj
cm.popretz {xreg_list}, stack_adj

RV32E:
switch (rlist){
  case  4: \{reg_list="ra";         xreg_list="x1";}
  case  5: \{reg_list="ra, s0";     xreg_list="x1, x8";}
  case  6: \{reg_list="ra, s0-s1";  xreg_list="x1, x8-x9";}
  default: reserved();
}
stack_adj      = stack_adj_base + spimm * 16;

RV32I, RV64:

switch (rlist){
  case  4: \{reg_list="ra";         xreg_list="x1";}
  case  5: \{reg_list="ra, s0";     xreg_list="x1, x8";}
  case  6: \{reg_list="ra, s0-s1";  xreg_list="x1, x8-x9";}
  case  7: \{reg_list="ra, s0-s2";  xreg_list="x1, x8-x9, x18";}
  case  8: \{reg_list="ra, s0-s3";  xreg_list="x1, x8-x9, x18-x19";}
  case  9: \{reg_list="ra, s0-s4";  xreg_list="x1, x8-x9, x18-x20";}
  case 10: \{reg_list="ra, s0-s5";  xreg_list="x1, x8-x9, x18-x21";}
  case 11: \{reg_list="ra, s0-s6";  xreg_list="x1, x8-x9, x18-x22";}
  case 12: \{reg_list="ra, s0-s7";  xreg_list="x1, x8-x9, x18-x23";}
  case 13: \{reg_list="ra, s0-s8";  xreg_list="x1, x8-x9, x18-x24";}
  case 14: \{reg_list="ra, s0-s9";  xreg_list="x1, x8-x9, x18-x25";}
  //note - to include s10, s11 must also be included
  case 15: \{reg_list="ra, s0-s11"; xreg_list="x1, x8-x9, x18-x27";}
  default: reserved();
}
stack_adj      = stack_adj_base + spimm * 16;

RV32E:

stack_adj_base = 16;
Valid values:
stack_adj      = [16|32|48|64];

RV32I:

switch (rlist) {
  case  4.. 7: stack_adj_base = 16;
  case  8..11: stack_adj_base = 32;
  case 12..14: stack_adj_base = 48;
  case     15: stack_adj_base = 64;
}

Valid values:
switch (rlist) {
  case  4.. 7: stack_adj = [16|32|48| 64];
  case  8..11: stack_adj = [32|48|64| 80];
  case 12..14: stack_adj = [48|64|80| 96];
  case     15: stack_adj = [64|80|96|112];
}

RV64:

switch (rlist) {
  case  4.. 5: stack_adj_base =  16;
  case  6.. 7: stack_adj_base =  32;
  case  8.. 9: stack_adj_base =  48;
  case 10..11: stack_adj_base =  64;
  case 12..13: stack_adj_base =  80;
  case     14: stack_adj_base =  96;
  case     15: stack_adj_base = 112;
}

Valid values:
switch (rlist) {
  case  4.. 5: stack_adj = [ 16| 32| 48| 64];
  case  6.. 7: stack_adj = [ 32| 48| 64| 80];
  case  8.. 9: stack_adj = [ 48| 64| 80| 96];
  case 10..11: stack_adj = [ 64| 80| 96|112];
  case 12..13: stack_adj = [ 80| 96|112|128];
  case     14: stack_adj = [ 96|112|128|144];
  case     15: stack_adj = [112|128|144|160];
}

Description This instruction pops (loads) the registers in reg_list from stack memory, adjusts the stack pointer by stack_adj, moves zero into a0 and then returns to ra.

note

All ABI register mappings are for the UABI. An EABI version is planned once the EABI is frozen.

For further information see Zcmp.

Stack Adjustment Calculation:

stack_adj_base is the minimum number of bytes, in multiples of 16-byte address increments, required to cover the registers in the list.

spimm is the number of additional 16-byte address increments allocated for the stack frame.

The total stack adjustment represents the total size of the stack frame, which is stack_adj_base added to spimm scaled by 16, as defined above.

Prerequisites None

32-bit equivalent:

No direct equivalent encoding exists

Operation The first section of pseudocode may be executed multiple times before the instruction successfully completes.

//This is not SAIL, it's pseudocode. The SAIL hasn't been written yet.

if (XLEN==32) bytes=4; else bytes=8;

addr=sp+stack_adj-bytes;
for(i in 27,26,25,24,23,22,21,20,19,18,9,8,1)  {
  //if register i is in xreg_list
  if (xreg_list[i]) {
    switch(bytes) {
      4:  asm("lw x[i], 0(addr)");
      8:  asm("ld x[i], 0(addr)");
    }
    addr-=bytes;
  }
}

The final section of pseudocode executes atomically, and only executes if the section above completes without any exceptions or interrupts.

note

The LI a0, 0 could be executed more than once, but is included in the atomic section for convenience.

//This is not SAIL, it's pseudocode. The SAIL hasn't been written yet.

asm("li a0, 0");
sp+=stack_adj;
asm("ret");

7.7.10 CM.POPRET

Synopsis Destroy stack frame: load ra and 0 to 12 saved registers from the stack frame, deallocate the stack frame, return to ra.

Mnemonic CM.POPRET {reg_list}, stack_adj

Encoding (RV32, RV64):

91c001642033ba01ed68118ea07b705a

note

rlist values 0 to 3 are reserved for a future EABI variant called cm.popret.e

Assembly Syntax:

cm.popret \{reg_list},  stack_adj
cm.popret {xreg_list}, stack_adj

The variables used in the assembly syntax are defined below.

RV32E:

switch (rlist){
  case  4: \{reg_list="ra";         xreg_list="x1";}
  case  5: \{reg_list="ra, s0";     xreg_list="x1, x8";}
  case  6: \{reg_list="ra, s0-s1";  xreg_list="x1, x8-x9";}
  default: reserved();
}
stack_adj      = stack_adj_base + spimm * 16;

RV32I, RV64:

switch (rlist){
  case  4: \{reg_list="ra";         xreg_list="x1";}
  case  5: \{reg_list="ra, s0";     xreg_list="x1, x8";}
  case  6: \{reg_list="ra, s0-s1";  xreg_list="x1, x8-x9";}
  case  7: \{reg_list="ra, s0-s2";  xreg_list="x1, x8-x9, x18";}
  case  8: \{reg_list="ra, s0-s3";  xreg_list="x1, x8-x9, x18-x19";}
  case  9: \{reg_list="ra, s0-s4";  xreg_list="x1, x8-x9, x18-x20";}
  case 10: \{reg_list="ra, s0-s5";  xreg_list="x1, x8-x9, x18-x21";}
  case 11: \{reg_list="ra, s0-s6";  xreg_list="x1, x8-x9, x18-x22";}
  case 12: \{reg_list="ra, s0-s7";  xreg_list="x1, x8-x9, x18-x23";}
  case 13: \{reg_list="ra, s0-s8";  xreg_list="x1, x8-x9, x18-x24";}
  case 14: \{reg_list="ra, s0-s9";  xreg_list="x1, x8-x9, x18-x25";}
  //note - to include s10, s11 must also be included
  case 15: \{reg_list="ra, s0-s11"; xreg_list="x1, x8-x9, x18-x27";}
  default: reserved();
}
stack_adj      = stack_adj_base + spimm * 16;

RV32E:

stack_adj_base = 16;
Valid values:
stack_adj      = [16|32|48|64];

RV32I:

switch (rlist) {
  case  4.. 7: stack_adj_base = 16;
  case  8..11: stack_adj_base = 32;
  case 12..14: stack_adj_base = 48;
  case     15: stack_adj_base = 64;
}

Valid values:
switch (rlist) {
  case  4.. 7: stack_adj = [16|32|48| 64];
  case  8..11: stack_adj = [32|48|64| 80];
  case 12..14: stack_adj = [48|64|80| 96];
  case     15: stack_adj = [64|80|96|112];
}

RV64:

switch (rlist) {
  case  4.. 5: stack_adj_base =  16;
  case  6.. 7: stack_adj_base =  32;
  case  8.. 9: stack_adj_base =  48;
  case 10..11: stack_adj_base =  64;
  case 12..13: stack_adj_base =  80;
  case     14: stack_adj_base =  96;
  case     15: stack_adj_base = 112;
}

Valid values:
switch (rlist) {
  case  4.. 5: stack_adj = [ 16| 32| 48| 64];
  case  6.. 7: stack_adj = [ 32| 48| 64| 80];
  case  8.. 9: stack_adj = [ 48| 64| 80| 96];
  case 10..11: stack_adj = [ 64| 80| 96|112];
  case 12..13: stack_adj = [ 80| 96|112|128];
  case     14: stack_adj = [ 96|112|128|144];
  case     15: stack_adj = [112|128|144|160];
}

Description This instruction pops (loads) the registers in reg_list from stack memory, adjusts the stack pointer by stack_adj and then returns to ra.

note

All ABI register mappings are for the UABI. An EABI version is planned once the EABI is frozen.

For further information see Zcmp.

Stack Adjustment Calculation:

stack_adj_base is the minimum number of bytes, in multiples of 16-byte address increments, required to cover the registers in the list.

spimm is the number of additional 16-byte address increments allocated for the stack frame.

The total stack adjustment represents the total size of the stack frame, which is stack_adj_base added to spimm scaled by 16, as defined above.

Prerequisites None

32-bit equivalent:

No direct equivalent encoding exists

Operation The first section of pseudocode may be executed multiple times before the instruction successfully completes.

//This is not SAIL, it's pseudocode. The SAIL hasn't been written yet.

if (XLEN==32) bytes=4; else bytes=8;

addr=sp+stack_adj-bytes;
for(i in 27,26,25,24,23,22,21,20,19,18,9,8,1)  {
  //if register i is in xreg_list
  if (xreg_list[i]) {
    switch(bytes) {
      4:  asm("lw x[i], 0(addr)");
      8:  asm("ld x[i], 0(addr)");
    }
    addr-=bytes;
  }
}

The final section of pseudocode executes atomically, and only executes if the section above completes without any exceptions or interrupts.

//This is not SAIL, it's pseudocode. The SAIL hasn't been written yet.

sp+=stack_adj;
asm("ret");

7.7.11 CM.MVSA01

Synopsis Move a0-a1 into two registers of s0-s7

Mnemonic CM.MVSA01 r1s', r2s'

Encoding (RV32, RV64):

d35b88851bd6ff600ef14e9e839959b2

note

For the encoding to be legal r1s' != r2s'.

Assembly Syntax:

cm.mvsa01 r1s', r2s'

Description This instruction moves a0 into r1s' and a1 into r2s'. r1s' and r2s' must be different. The execution is atomic, so it is not possible to observe state where only one of r1s' or r2s' has been updated.

The encoding uses sreg number specifiers instead of xreg number specifiers to save encoding space. The mapping between them is specified in the pseudocode below.

note

The s register mapping is taken from the UABI, and may not match the currently unratified EABI. CM.MVSA01.E may be included in the future.

Prerequisites None

32-bit equivalent:

No direct equivalent encoding exists.

Operation

//This is not SAIL, it's pseudocode. The SAIL hasn't been written yet.
if (RV32E && (r1sc>1 || r2sc>1)) {
  reserved();
}
xreg1 = {r1sc[2:1]>0,r1sc[2:1]==0,r1sc[2:0]};
xreg2 = {r2sc[2:1]>0,r2sc[2:1]==0,r2sc[2:0]};
X[xreg1] = X[10];
X[xreg2] = X[11];

7.7.12 CM.MVA01S

Synopsis Move two s0-s7 registers into a0-a1

Mnemonic CM.MVA01S r1s', r2s'

Encoding (RV32, RV64):

9ccbc1e89748c8471b7957ccbeaac82c

Assembly Syntax:

cm.mva01s r1s', r2s'

Description This instruction moves r1s' into a0 and r2s' into a1. The execution is atomic, so it is not possible to observe state where only one of a0 or a1 have been updated.

The encoding uses sreg number specifiers instead of xreg number specifiers to save encoding space. The mapping between them is specified in the pseudocode below.

note

The s register mapping is taken from the UABI, and may not match the currently unratified EABI. CM.MVA01S.E may be included in the future.

Prerequisites None

32-bit equivalent:

No direct equivalent encoding exists.

Operation

//This is not SAIL, it's pseudocode. The SAIL hasn't been written yet.
if (RV32E && (r1sc>1 || r2sc>1)) {
  reserved();
}
xreg1 = {r1sc[2:1]>0,r1sc[2:1]==0,r1sc[2:0]};
xreg2 = {r2sc[2:1]>0,r2sc[2:1]==0,r2sc[2:0]};
X[10] = X[xreg1];
X[11] = X[xreg2];

7.8 `Zce` Extension for Enhanced Instruction Compression

This section describes the Zce extension, which incorporates the compressed instruction-set extensions designed for microcontrollers. Unlike the C extension, the Zce extension includes extensions that trade performance for code density.

The Zce extension depends upon the Zca, Zcb, Zcmp, and Zcmt extensions.

If XLEN=32 and the F extension is present, the Zce extension additionally depends upon the Zcf extension.

7.9 `Zclsd` Extension for Compressed Load/Store Pair Instructions

The Zclsd extension provides compressed load/store pair instructions for RV32, reusing the existing RV64 doubleword load/store instruction encodings.

Zclsd depends on Zilsd and Zca. It has overlapping encodings with Zcf and is thus incompatible with Zcf.

7.9.1 Use of `x0` as operand

For C.LDSP, usage of x0 as the destination is reserved.

If using x0 as src of C.SDSP, the entire 64-bit operand is zero, i.e., register x1 is not accessed.

C.LD and C.SD instructions can only use x8-x15.

7.9.2 Exception Handling

For the purposes of RVWMO and exception handling, C.LD, C.LDSP, C.SD, and C.SDSP instructions are considered to be misaligned loads and stores, with one additional constraint: a C.LD, C.LDSP, C.SD, or C.SDSP instruction whose effective address is a multiple of 4 gives rise to two 4-byte memory operations.

Zclsd adds the following RV32-only instructions:

RV32	RV64	Mnemonic	Instruction
yes	no	C.LDSP rd, offset(sp)	Stack-pointer based load doubleword to register pair, 16-bit encoding
yes	no	C.SDSP rs2, offset(sp)	Stack-pointer based store doubleword from register pair, 16-bit encoding
yes	no	C.LD rd', offset(rs1')	Load doubleword to register pair, 16-bit encoding
yes	no	C.SD rs2', offset(rs1')	Store doubleword from register pair, 16-bit encoding

7.9.2.1 C.LDSP

Synopsis Stack-pointer based load doubleword to even/odd register pair, 16-bit encoding

Mnemonic C.LDSP rd, offset(sp)

Encoding (RV32)

8260c9fa9488fd3bbdee50f141ecc9c7

Description

Loads stack-pointer relative 64-bit value into registers rd' and rd'+1. It computes its effective address by adding the zero-extended offset, scaled by 8, to the stack pointer, x2. It expands to LD rd, offset(x2). C.LDSP is only valid when rd≠x0; the code points with rd=x0 are reserved.

Included in: Zclsd

7.9.2.2 C.SDSP

Synopsis Stack-pointer based store doubleword from even/odd register pair, 16-bit encoding

Mnemonic C.SDSP rs2, offset(sp)

Encoding (RV32)

d70888dfa15183fca0a34bc5827ac29d

Description

Stores a stack-pointer relative 64-bit value from registers rs2' and rs2'+1. It computes an effective address by adding the zero-extended offset, scaled by 8, to the stack pointer, x2. It expands to SD rs2, offset(x2).

Included in: Zclsd

7.9.2.3 c.ld

Synopsis Load doubleword to even/odd register pair, 16-bit encoding

Mnemonic C.LD rd', offset(rs1')

Encoding (RV32)

12b990e1dc6e1cfc070ff4c5d9615bbb

Description

Loads a 64-bit value into registers rd' and rd'+1. It computes an effective address by adding the zero-extended offset, scaled by 8, to the base address in register rs1'.

Included in: Zclsd

7.9.2.4 C.SD

Synopsis Store doubleword from even/odd register pair, 16-bit encoding

Mnemonic C.SD rs2', offset(rs1')

Encoding (RV32)

f3f4b09e3eaca7e189c6133a645dae52

Description

Stores a 64-bit value from registers rs2' and rs2'+1. It computes an effective address by adding the zero-extended offset, scaled by 8, to the base address in register rs1'. It expands to SD rs2', offset(rs1').

Included in: Zclsd

7.10 `Zcmop` Extension for Compressed May-Be-Operations

This section defines the Zcmop extension, which defines eight 16-bit MOP instructions named C.MOP.N, where N is an odd integer between 1 and 15, inclusive. C.MOP.N is encoded in the reserved encoding space corresponding to C.LUI xN, 0, as shown in Table 37. Unlike the MOPs defined in the Zimop extension, the C.MOP.N instructions are defined to not write any register. Their encoding allows future extensions to define them to read register xN.

The Zcmop extension depends upon the Zca extension.

8eca1ee4962e3624480b1f308bff1cb2

note

Very few suitable 16-bit encoding spaces exist. This space was chosen because it already has unusual behavior with respect to the rd/rs1 field—it encodes C.ADDI16SP when the field contains x2—and is therefore of lower value for most purposes.

Table 37. C.MOP.N instruction encoding.

Mnemonic	Encoding	Redefinable to read register
C.MOP.1	`0110000010000001`	`x1`
C.MOP.3	`0110000110000001`	`x3`
C.MOP.5	`0110001010000001`	`x5`
C.MOP.7	`0110001110000001`	`x7`
C.MOP.9	`0110010010000001`	`x9`
C.MOP.11	`0110010110000001`	`x11`
C.MOP.13	`0110011010000001`	`x13`
C.MOP.15	`0110011110000001`	`x15`

note

The recommended assembly syntax for C.MOP.N is simply the nullary C.MOP.N. The possibly accessed register is implicitly xN.

note

The expectation is that each Zcmop instruction is equivalent to some Zimop instruction, but the choice of expansion (if any) is left to the extension that redefines the MOP. Note, a Zcmop instruction that does not write a value can expand into a write to x0.

7.1 Zca Extension for Integer Compressed Instructions​

7.1.1 Compressed Instruction Formats​

7.1.2 Load and Store Instructions​

7.1.2.1 Stack-Pointer-Based Loads and Stores​

7.1.2.2 Register-Based Loads and Stores​

7.1.3 Control Transfer Instructions​

7.1.4 Integer Computational Instructions​

7.1.4.1 Integer Constant-Generation Instructions​

7.1.4.2 Integer Register-Immediate Operations​

7.1.4.3 Integer Register-Register Operations​

7.1.4.4 Defined Illegal Instruction​

7.1.4.5 NOP Instruction​

7.1.4.6 Breakpoint Instruction​

7.1.5 Usage of Compressed Instructions in LR/SC Sequences​

7.1.6 HINT Instructions​

7.1.7 Zca Instruction Set Listings​

7.2 Zcf Extension for Single-Precision Floating-Point Compressed Instructions​

7.2.1 Stack-Pointer-Based Loads and Stores​

7.2.2 Register-Based Loads and Stores​

7.3 Zcd Extension for Double-Precision Floating-Point Compressed Instructions​

7.3.1 Stack-Pointer-Based Loads and Stores​

7.3.2 Register-Based Loads and Stores​

7.4 C Extension for Compressed Instructions​

7.5 Zcb Extension for Additional Compressed Instructions​

7.5.1 C.LBU​

7.5.2 C.LHU​

7.5.3 C.LH​

7.5.4 C.SB​

7.5.5 C.SH​

7.5.6 C.ZEXT.B​

7.5.7 C.SEXT.B​

7.5.8 C.ZEXT.H​

7.5.9 C.SEXT.H​

7.5.10 C.ZEXT.W​

7.5.11 C.NOT​

7.5.12 C.MUL​

7.6 Zcmt Extension for Compressed Table Jumps​

7.6.1 Table Jump Overview​

7.6.2 jvt​

7.6.3 Table Jump Fault handling​

7.6.4 jvt CSR​

7.6.5 CM.JT​

7.6.6 CM.JALT​

7.7 Zcmp Extension for Compressed Prologues and Epilogues​

7.7.1 PUSH/POP functional overview​

7.7.2 Example usage​

7.7.2.1 Stack pointer adjustment handling​

7.7.2.2 Register list handling​

7.7.3 PUSH/POP Fault handling​

7.7.4 Software view of execution​

7.7.4.1 Software view of the PUSH sequence​

7.7.4.2 Software view of the POP/POPRET sequence​

7.7.5 Non-idempotent memory handling​

7.7.6 Example RV32I PUSH/POP sequences​

7.7.6.1 CM.PUSH {ra, s0-s2}, -64​

7.7.6.2 CM.PUSH {ra, s0-s11}, -112​

7.7.6.3 CM.POP {ra}, 16​

7.7.6.4 CM.POP {ra, s0-s3}, 48​

7.7.6.5 CM.POP {ra, s0-s4}, 64​

7.7.7 CM.PUSH​

7.7.8 CM.POP​

7.7.9 CM.POPRETZ​

7.7.10 CM.POPRET​

7.7.11 CM.MVSA01​

7.7.12 CM.MVA01S​

7.8 Zce Extension for Enhanced Instruction Compression​

7.9 Zclsd Extension for Compressed Load/Store Pair Instructions​

7.9.1 Use of x0 as operand​

7.9.2 Exception Handling​

7.9.2.1 C.LDSP​

7.9.2.2 C.SDSP​

7.9.2.3 c.ld​

7.9.2.4 C.SD​

7.10 Zcmop Extension for Compressed May-Be-Operations​

7.1 `Zca` Extension for Integer Compressed Instructions

7.1.1 Compressed Instruction Formats

7.1.2 Load and Store Instructions

7.1.2.1 Stack-Pointer-Based Loads and Stores

7.1.2.2 Register-Based Loads and Stores

7.1.3 Control Transfer Instructions

7.1.4 Integer Computational Instructions

7.1.4.1 Integer Constant-Generation Instructions

7.1.4.2 Integer Register-Immediate Operations

7.1.4.3 Integer Register-Register Operations

7.1.4.4 Defined Illegal Instruction

7.1.4.5 NOP Instruction

7.1.4.6 Breakpoint Instruction

7.1.5 Usage of Compressed Instructions in LR/SC Sequences

7.1.6 HINT Instructions

7.1.7 `Zca` Instruction Set Listings

7.2 `Zcf` Extension for Single-Precision Floating-Point Compressed Instructions

7.2.1 Stack-Pointer-Based Loads and Stores

7.2.2 Register-Based Loads and Stores

7.3 `Zcd` Extension for Double-Precision Floating-Point Compressed Instructions

7.3.1 Stack-Pointer-Based Loads and Stores

7.3.2 Register-Based Loads and Stores

7.4 `C` Extension for Compressed Instructions

7.5 `Zcb` Extension for Additional Compressed Instructions

7.5.1 C.LBU

7.5.2 C.LHU

7.5.3 C.LH

7.5.4 C.SB

7.5.5 C.SH

7.5.6 C.ZEXT.B

7.5.7 C.SEXT.B

7.5.8 C.ZEXT.H

7.5.9 C.SEXT.H

7.5.10 C.ZEXT.W

7.5.11 C.NOT

7.5.12 C.MUL

7.6 `Zcmt` Extension for Compressed Table Jumps

7.6.1 Table Jump Overview

7.6.2 `jvt`

7.6.3 Table Jump Fault handling

7.6.4 `jvt` CSR

7.6.5 CM.JT

7.6.6 CM.JALT

7.7 `Zcmp` Extension for Compressed Prologues and Epilogues

7.7.1 PUSH/POP functional overview

7.7.2 Example usage

7.7.2.1 Stack pointer adjustment handling

7.7.2.2 Register list handling

7.7.3 PUSH/POP Fault handling

7.7.4 Software view of execution

7.7.4.1 Software view of the PUSH sequence

7.7.4.2 Software view of the POP/POPRET sequence

7.7.5 Non-idempotent memory handling

7.7.6 Example RV32I PUSH/POP sequences

7.7.6.1 CM.PUSH {ra, s0-s2}, -64

7.7.6.2 CM.PUSH {ra, s0-s11}, -112

7.7.6.3 CM.POP {ra}, 16

7.7.6.4 CM.POP {ra, s0-s3}, 48

7.7.6.5 CM.POP {ra, s0-s4}, 64

7.7.7 CM.PUSH

7.7.8 CM.POP

7.7.9 CM.POPRETZ

7.7.10 CM.POPRET

7.7.11 CM.MVSA01

7.7.12 CM.MVA01S

7.8 `Zce` Extension for Enhanced Instruction Compression

7.9 `Zclsd` Extension for Compressed Load/Store Pair Instructions

7.9.1 Use of `x0` as operand

7.9.2 Exception Handling

7.9.2.1 C.LDSP

7.9.2.2 C.SDSP

7.9.2.3 c.ld

7.9.2.4 C.SD

7.10 `Zcmop` Extension for Compressed May-Be-Operations