7 Compressed Instructions
This chapter describes the RISC-V compressed instruction-set extensions, which reduce static and dynamic code size by adding short 16-bit instruction encodings for common operations. Typically, 50%-60% of the RISC-V instructions in a program can be replaced with compressed instructions, resulting in a 25%-30% code-size reduction.
The Zca extension forms the core of the compressed extensions;
it provides compressed forms of integer loads, stores, branches, and
computational instructions.
The Zcf extension is an XLEN=32-only extension that adds
single-precision floating-point loads and stores.
The Zcd extension adds double-precision floating-point loads and
stores.
The C extension combines Zca, Zcf if XLEN=32 and the F extension
is present, and Zcd if the D extension is present.
Various additional Zc* extensions are also defined.
7.1 Zca Extension for Integer Compressed Instructions
The compressed extensions use a simple compression scheme that offers shorter 16-bit versions of common 32-bit RISC-V instructions when:
- the immediate or address offset is small, or
- one of the registers is the zero register (
x0), the ABI link register (x1), or the ABI stack pointer (x2), or - the destination register and the first source register are identical, or
- the registers used are the 8 most popular ones.
The Zca extension is compatible with all other standard instruction
extensions. The Zca extension allows 16-bit instructions to be freely
intermixed with 32-bit instructions, with the latter now able to start
on any 16-bit boundary, i.e., IALIGN=16. With the addition of the Zca
extension, no instructions can raise instruction-address-misaligned
exceptions.
Removing the 32-bit alignment constraint on the original 32-bit instructions allows significantly greater code density.
The compressed instruction encodings are mostly common across XLEN=32 and
XLEN=64, but as shown in Zca Instruction listing, Quadrant 0, a few opcodes are used for
different purposes depending on base ISA.
For example, the XLEN=64 variant requires additional opcodes to compress loads
and stores of 64-bit integer values, whereas the XLEN=32-only Zcf extension
uses the same opcodes to compress loads and stores of single-precision
floating-point values.
If the C extension is implemented, the
appropriate compressed floating-point load and store instructions must
be provided whenever the relevant standard floating-point extension (F
and/or D) is also implemented.
In addition, the XLEN=32 variant includes a compressed jump and link
instruction to compress short-range subroutine calls, where the same opcode is
used to compress ADDIW for XLEN=64.
Double-precision loads and stores are a significant fraction of static and
dynamic instructions, hence the motivation to provide the Zcd extension.
Single-precision loads and stores are not a significant source of static or
dynamic compression for programs compiled for the LP64D and ILP32D calling
conventions.
However, for microcontrollers that only provide hardware single-precision
floating-point units and whose programs use the ILP32F calling convention, the
single-precision loads and stores are used at least as frequently as
double-precision loads and stores are used in LP64D and ILP32D—hence the
motivation to provide compressed support for these in the XLEN=32-only Zcf
extension.
Short-range subroutine calls are more likely in small binaries for
microcontrollers, hence the motivation to include these in Zca.
Although reusing opcodes for different purposes for different base ISAs adds some complexity to documentation, the impact on implementation complexity is small even for designs that support multiple base ISAs. The compressed floating-point load and store variants use the same instruction format with the same register specifiers as the wider integer loads and stores.
Zca was designed under the constraint that each Zca instruction expands
into a single 32-bit instruction in the base ISA.
Adopting this constraint has two main benefits:
- Hardware designs can simply expand
Zcainstructions during decode, simplifying verification and minimizing modifications to existing microarchitectures. - Compilers can be unaware of the
Zcaextension and leave code compression to the assembler and linker, although a compression-aware compiler will generally be able to produce better results.
At the time we designed the Zca extension, we felt the multiple complexity
reductions of a simple one-one mapping between Zca and base instructions far
outweighed the potential gains of a slightly denser encoding that added
additional instructions only supported in the Zca extension, or that allowed
encoding of multiple base instructions in one Zca instruction.
Since then, additional extensions Zcmp and Zcmt have been defined
that further reduce code size at the expense of these complexities.
It is important to note that the Zca extension is not designed to be a
stand-alone ISA, and is meant to be used alongside a base ISA.
Variable-length instruction sets have long been used to improve code density. For example, the IBM Stretch [24], developed in the late 1950s, had an ISA with 32-bit and 64-bit instructions, where some of the 32-bit instructions were compressed versions of the full 64-bit instructions. Stretch also employed the concept of limiting the set of registers that were addressable in some of the shorter instruction formats, with short branch instructions that could only refer to one of the index registers. The later IBM 360 architecture [25] supported a simple variable-length instruction encoding with 16-bit, 32-bit, or 48-bit instruction formats.
In 1963, CDC introduced the Cray-designed CDC 6600 [26], a precursor to RISC architectures, that introduced a register-rich load-store architecture with instructions of two lengths, 15-bits and 30-bits. The later Cray-1 design used a very similar instruction format, with 16-bit and 32-bit instruction lengths.
The initial RISC ISAs from the 1980s all picked performance over code size, which was reasonable for a workstation environment, but not for embedded systems. Hence, both ARM and MIPS subsequently made versions of the ISAs that offered smaller code size by offering an alternative 16-bit wide instruction set instead of the standard 32-bit wide instructions. The compressed RISC ISAs reduced code size relative to their starting points by about 25-30%, yielding code that was significantly smaller than 80x86. This result surprised some, as their intuition was that the variable-length CISC ISA should be smaller than RISC ISAs that offered only 16-bit and 32-bit formats.
Since the original RISC ISAs did not leave sufficient opcode space free to include these unplanned compressed instructions, they were instead developed as complete new ISAs. This meant compilers needed different code generators for the separate compressed ISAs. The first compressed RISC ISA extensions (e.g., ARM Thumb and MIPS16) used only a fixed 16-bit instruction size, which gave good reductions in static code size but caused an increase in dynamic instruction count, which led to lower performance compared to the original fixed-width 32-bit instruction size. This led to the development of a second generation of compressed RISC ISA designs with mixed 16-bit and 32-bit instruction lengths (e.g., ARM Thumb2, microMIPS, PowerPC VLE), so that performance was similar to pure 32-bit instructions but with significant code size savings. Unfortunately, these different generations of compressed ISAs are incompatible with each other and with the original uncompressed ISA, leading to significant complexity in documentation, implementations, and software tools support.
Of the commonly used 64-bit ISAs, only PowerPC and microMIPS currently supports a compressed instruction format. It is surprising that the most popular 64-bit ISA for mobile platforms (ARM v8) does not include a compressed instruction format given that static code size and dynamic instruction fetch bandwidth are important metrics. Although static code size is not a major concern in larger systems, instruction fetch bandwidth can be a major bottleneck in servers running commercial workloads, which often have a large instruction working set.
Benefiting from 25 years of hindsight, RISC-V was designed to support
compressed instructions from the outset, leaving enough opcode space for
the compressed extensions to be added on top of the base ISA (along with
many other extensions). The philosophy of Zca is to reduce code size for
embedded applications and to improve performance and energy-efficiency
for all applications due to fewer misses in the instruction cache.
Waterman shows a 25%-30% reduction in static instruction bits, which
reduces instruction cache misses by 20%-25%, or roughly the same
performance impact as doubling the instruction cache size. [27]
7.1.1 Compressed Instruction Formats
Table 28 shows the nine compressed instruction
formats. CR, CI, and CSS can use any of the 32 RVI registers, but CIW,
CL, CS, CA, and CB are limited to just 8 of them.
Table 29 lists these popular registers, which
correspond to registers x8 to x15. Note that there is a separate
version of load and store instructions that use the stack pointer as the
base address register, since saving to and restoring from the stack are
so prevalent, and that they use the CI and CSS formats to allow access
to all 32 data registers. CIW supplies an 8-bit immediate for the
ADDI4SPN instruction.
The RISC-V ABI was changed to make the frequently used registers map to
registers x8-x15. This simplifies the decompression decoder by
having a contiguous naturally aligned set of register numbers, and is
also compatible with the RV32E and RV64E base ISAs, which only have 16 integer
registers.
The formats were designed to keep bits for the two register source specifiers in the same place in all instructions, while the destination register field can move. When the full 5-bit destination register specifier is present, it is in the same place as in the 32-bit RISC-V encoding. Where immediates are sign-extended, the sign extension is always from bit 12. Immediate fields have been scrambled, as in the base specification, to reduce the number of immediate multiplexers required.
The immediate fields are scrambled in the instruction formats instead of in sequential order so that as many bits as possible are in the same position in every instruction, thereby simplifying implementations.
For many Zca instructions, zero-valued immediates are disallowed and
x0 is not a valid 5-bit register specifier. These restrictions free up
encoding space for other instructions requiring fewer operand bits.
Table 28. Compressed 16-bit Zca instruction formats
| Format | Meaning | 15 14 13 | 12 | 11 10 | 9 8 7 | 6 5 | 4 3 2 | 1 0 |
| CR | Register | funct4 | rd/rs1 | rs2 | op | |||
| CI | Immediate | funct3 | imm | rd/rs1 | imm | op | ||
| CSS | Stack-relative Store | funct3 | imm | rs2 | op | |||
| CIW | Wide Immediate | funct3 | imm | rd′ | op | |||
| CL | Load | funct3 | imm | rs1′ | imm | rd′ | op | |
| CS | Store | funct3 | imm | rs1′ | imm | rs2′ | op | |
| CA | Arithmetic | funct6 | rd′/rs1′ | funct2 | rs2′ | op | ||
| CB | Branch/Arithmetic | funct3 | offset | rd′/rs1′ | offset | op | ||
| CJ | Jump | funct3 | jump target | op | ||||
Table 29. Registers specified by the three-bit rs1′, rs2′, and rd′ fields of the CIW, CL, CS, CA, and CB formats.
Zca Register Number | 000 | 001 | 010 | 011 | 100 | 101 | 110 | 111 |
| Integer Register Number | x8 | x9 | x10 | x11 | x12 | x13 | x14 | x15 |
| Integer Register ABI Name | s0 | s1 | a0 | a1 | a2 | a3 | a4 | a5 |
7.1.2 Load and Store Instructions
To increase the reach of 16-bit instructions, data-transfer instructions use zero-extended immediates that are scaled by the size of the data in bytes: ×4 for words, ×8 for doublewords, and ×16 for quadwords.
Zca provides two variants of loads and stores. One uses the ABI stack
pointer, x2, as the base address and can target any data register. The
other can reference one of 8 base address registers and one of 8 data
registers.
7.1.2.1 Stack-Pointer-Based Loads and Stores
These instructions use the CI format.
C.LWSP loads a 32-bit value from memory into register rd. It computes
an effective address by adding the zero-extended offset, scaled by 4,
to the stack pointer, x2. It expands to LW rd, offset(x2).
C.LWSP is
valid only when rd≠x0; the code points with rd=x0 are reserved.
C.LDSP is an XLEN=64-only instruction that loads a 64-bit value
from memory into register rd. It computes its effective address by
adding the zero-extended offset, scaled by 8, to the stack pointer,
x2. It expands to LD rd, offset(x2).
C.LDSP is valid only when
rd≠x0; the code points with
rd=x0 are reserved.
These instructions use the CSS format.
C.SWSP stores a 32-bit value in register rs2 to memory. It computes an
effective address by adding the zero-extended offset, scaled by 4, to
the stack pointer, x2. It expands to SW rs2, offset(x2).
C.SDSP is an XLEN=64-only instruction that stores a 64-bit value in
register rs2 to memory. It computes an effective address by adding the
zero-extended offset, scaled by 8, to the stack pointer, x2. It
expands to SD rs2, offset(x2).
Register save/restore code at function entry/exit represents a
significant portion of static code size. The stack-pointer-based
compressed loads and stores in Zca are effective at reducing the
save/restore static code size by a factor of 2 while improving
performance by reducing dynamic instruction bandwidth.
A common mechanism used in other ISAs to further reduce save/restore code size is load-multiple and store-multiple instructions. We considered adopting these for RISC-V but noted the following drawbacks to these instructions:
- These instructions complicate processor implementations.
- For virtual memory systems, some data accesses could be resident in physical memory and some could not, which requires a new restart mechanism for partially executed instructions.
- Unlike the rest of the
Zcainstructions, there is no base ISA equivalent to Load Multiple and Store Multiple. - Unlike the rest of the
Zcainstructions, the compiler would have to be aware of these load-multiple and store-multiple instructions to both allocate registers in the expected order and also to schedule the loads and stores contiguously and in the proper order, to maximize the chances of them being detected and replaced by an assembler or linker with the equivalent load-multiple or store-multiple compressed instruction. - Simple microarchitectural implementations will constrain how other instructions can be scheduled around the load and store multiple instructions, leading to a potential performance loss.
- The desire for sequential register allocation might conflict with the featured registers selected for the CIW, CL, CS, CA, and CB formats.
Furthermore, much of the gains can be realized in software by replacing prologue and epilogue code with subroutine calls to common prologue and epilogue code, a technique described in Section 5.6 of [28].
While our rationale for omitting load-multiple and store-multiple remains
valid, the pressure to reduce code size is so great, even at the expense of
performance and microarchitectural complexity, that the Zcmp extension
has since been defined to compress most prologues and epilogues into two bytes
apiece.
7.1.2.2 Register-Based Loads and Stores
These instructions use the CL format.
C.LW loads a 32-bit value from memory into register
_rd′_. It computes an effective address by adding the
zero-extended offset, scaled by 4, to the base address in register
_rs1′_. It expands to LW rd′, offset(rs1′).
C.LD is an XLEN=64-only instruction that loads a 64-bit value from
memory into register _rd′_. It computes an effective
address by adding the zero-extended offset, scaled by 8, to the base
address in register _rs1′_. It expands to
LD rd′, offset(rs1′).
These instructions use the CS format.
C.SW stores a 32-bit value in register _rs2′_ to memory.
It computes an effective address by adding the zero-extended offset,
scaled by 4, to the base address in register _rs1′_. It
expands to SW rs2′, offset(rs1′).
C.SD is an XLEN=64-only instruction that stores a 64-bit value in
register _rs2′_ to memory. It computes an effective
address by adding the zero-extended offset, scaled by 8, to the base
address in register _rs1′_. It expands to
SD rs2′, offset(rs1′).
7.1.3 Control Transfer Instructions
Zca provides unconditional jump instructions and conditional branch
instructions. As with base RVI instructions, the offsets of all
Zca control transfer instructions are in multiples of 2 bytes.
These instructions use the CJ format.
C.J performs an unconditional control transfer. The offset is
sign-extended and added to the pc to form the jump target address. C.J can
therefore target a ±2 KiB range. It expands to
JAL x0, offset.
C.JAL is an XLEN=32-only instruction that performs the same operation as
C.J, but additionally writes the address of the instruction following
the jump (pc+2) to the link register, x1. It expands to
JAL x1, offset.
These instructions use the CR format.
C.JR (jump register) performs an unconditional control transfer to the
address in register rs1. It expands to JALR x0, 0(rs1).
C.JR is
valid only when rs1≠x0; the code
point with rs1=x0 is reserved.
C.JALR (jump and link register) performs the same operation as C.JR, but
additionally writes the address of the instruction following the jump
(pc+2) to the link register, x1. It expands to
JALR x1, 0(rs1).
C.JALR is valid only when
rs1≠x0; the code point with
rs1=x0 corresponds to the C.EBREAK instruction.
Strictly speaking, C.JALR does not expand exactly to a base RVI instruction as the value added to the PC to form the link address is 2 rather than 4 as in the base ISA, but supporting both offsets of 2 and 4 bytes is only a very minor change to the base microarchitecture.
These instructions use the CB format.
C.BEQZ performs conditional control transfers. The offset is
sign-extended and added to the pc to form the branch target address.
It can therefore target a ±256 B range. C.BEQZ takes the
branch if the value in register rs1′ is zero. It
expands to BEQ rs1′, x0, offset`.
C.BNEZ is defined analogously, but it takes the branch if rs1′ contains a nonzero value. It expands to BNE rs1′, x0, offset`.
7.1.4 Integer Computational Instructions
Zca provides several instructions for integer arithmetic and constant
generation.
7.1.4.1 Integer Constant-Generation Instructions
The two constant-generation instructions both use the CI instruction format and can target any integer register.
C.LI loads the sign-extended 6-bit immediate, imm, into register rd.
It expands into ADDI rd, x0, imm.
The C.LI code points with rd=x0 are HINTs.
C.LUI loads the non-zero 6-bit immediate field into bits 17–12 of the
destination register, clears the bottom 12 bits, and sign-extends bit 17
into all higher bits of the destination. It expands into
LUI rd, imm.
C.LUI is valid only when
rd≠x2,
and when the immediate is not equal to zero. The code points with
imm=0 are reserved.
The code points with rd=x2 and imm≠0 correspond to the
C.ADDI16SP instruction.
The code points with rd=x0 and imm≠0 are HINTs.
7.1.4.2 Integer Register-Immediate Operations
These integer register-immediate operations are encoded in the CI format and perform operations on an integer register and a 6-bit immediate.
C.ADDI adds the non-zero sign-extended 6-bit immediate to the value in
register rd then writes the result to rd. It expands into
ADDI rd, rd, imm.
The code points with rd≠0 and imm=0 are HINTs.
The code points with rd=x0 encode the C.NOP instruction, of
which the code points with imm≠0 are HINTs.
C.ADDIW is an XLEN=64-only instruction that performs the same
computation but produces a 32-bit result, then sign-extends result to 64
bits. It expands into ADDIW rd, rd, imm. The immediate can be
zero for C.ADDIW, where this corresponds to SEXT.W rd.
C.ADDIW is
valid only when rd≠x0; the code points with
rd=x0 are reserved.
C.ADDI16SP (add immediate to stack pointer)
shares the opcode with C.LUI, but has a destination field of
x2. C.ADDI16SP adds the non-zero sign-extended 6-bit immediate to the
value in the stack pointer (sp=x2), where the immediate is scaled to
represent multiples of 16 in the range [-512, 496]. C.ADDI16SP is used to
adjust the stack pointer in procedure prologues and epilogues. It
expands into ADDI x2, x2, nzimm[9:4].
C.ADDI16SP is valid only when
nzimm≠0; the code point with nzimm=0 is reserved.
In the standard RISC-V calling convention, the stack pointer sp is
always 16-byte aligned.
C.ADDI4SPN (add immediate to stack pointer, non-destructive)
is a CIW-format instruction that adds a zero-extended
non-zero immediate, scaled by 4, to the stack pointer, x2, and writes
the result to rd′. This instruction is used to generate
pointers to stack-allocated variables, and expands to
ADDI rd′, x2, nzuimm[9:2].
C.ADDI4SPN is valid only when
nzuimm≠0; the code points with nzuimm=0 are
reserved.
C.SLLI is a CI-format instruction that performs a logical left shift of the value in register rd then writes the result to rd. The shift amount is encoded in the shamt field. It expands into SLLI rd, rd, shamt[5:0].
The C.SLLI code points with shamt=0 or with rd=x0 are HINTs.
For XLEN=32, shamt[5] must be zero; the code points with shamt[5]=1 are designated for custom extensions.
C.SRLI is a CB-format instruction that performs a logical right shift of the value in register rd′ then writes the result to rd′. The shift amount is encoded in the shamt field. It expands into SRLI rd′, rd′, shamt.
The C.SRLI code points with shamt=0 are HINTs.
For XLEN=32, shamt[5] must be zero; the code points with shamt[5]=1 are designated for custom extensions.
C.SRAI is defined analogously to C.SRLI, but instead performs an arithmetic right shift. It expands to SRAI rd′, rd′, shamt.
Left shifts are usually more frequent than right shifts, as left shifts are frequently used to scale address values. Right shifts have therefore been granted less encoding space and are placed in an encoding quadrant where all other immediates are sign-extended.
C.ANDI is a CB-format instruction that computes the bitwise AND of the value in register rd′ and the sign-extended 6-bit immediate, then writes the result to rd′. It expands to ANDI rd′, rd′, imm.
7.1.4.3 Integer Register-Register Operations
These instructions use the CR format.
C.MV copies the value in register rs2 into register rd. It expands
into ADD rd, x0, rs2.
C.MV is valid only when
rs2≠x0; the code points with rs2=x0 correspond to the C.JR instruction.
The code points with rs2≠x0 and rd=x0 are HINTs.
C.MV expands to a different instruction than the canonical MV pseudoinstruction, which instead uses ADDI. Implementations that handle MV specially, e.g. using register-renaming hardware, may find it more convenient to expand C.MV to MV instead of ADD, at slight additional hardware cost.
C.ADD adds the values in registers rd and rs2 and writes the result
to register rd. It expands into ADD rd, rd, rs2.
C.ADD is only
valid when rs2≠x0; the code points with rs2=x0 correspond to the
C.JALR and C.EBREAK instructions.
The code points with rs2≠x0 and rd=x0 are HINTs.
These instructions use the CA format.
C.AND computes the bitwise AND of the values in registers
rd′ and rs2′, then writes the result
to register rd′. It expands into
AND rd′, rd′, rs2′.
C.OR computes the bitwise OR of the values in registers
rd′ and rs2′, then writes the result
to register rd′. It expands into
OR rd′, rd′, rs2′.
C.XOR computes the bitwise XOR of the values in registers
rd′ and rs2′, then writes the result
to register rd′. It expands into
XOR rd′, rd′, rs2′.
C.SUB subtracts the value in register rs2′ from the value in register rd′, then writes the result to register rd′. It expands into SUB rd′, rd′, rs2′.
C.ADDW is an XLEN=64-only instruction that adds the values in registers rd′ and rs2′, then sign-extends the lower 32 bits of the sum before writing the result to register rd′. It expands into ADDW rd′, rd′, rs2′.
C.SUBW is an XLEN=64-only instruction that subtracts the value in register rs2′ from the value in register rd′, then sign-extends the lower 32 bits of the difference before writing the result to register rd′. It expands into SUBW rd′, rd′, rs2′.
This group of six instructions do not provide large savings individually, but do not occupy much encoding space and are straightforward to implement, and as a group provide a worthwhile improvement in static and dynamic compression.
7.1.4.4 Defined Illegal Instruction
A 16-bit instruction with all bits zero is permanently reserved as an illegal instruction.
We reserve all-zero instructions to be illegal instructions to help trap attempts to execute zero-ed or non-existent portions of the memory space. The all-zero value should not be redefined in any non-standard extension. Similarly, we reserve instructions with all bits set to 1 (corresponding to very long instructions in the RISC-V variable-length encoding scheme) as illegal to capture another common value seen in non-existent memory regions.
7.1.4.5 NOP Instruction
C.NOP is a CI-format instruction that does not change any user-visible
state, except for advancing the pc and incrementing any applicable
performance counters. C.NOP expands to NOP.
The C.NOP code points
with imm≠0 encode HINTs.
7.1.4.6 Breakpoint Instruction
Debuggers can use the C.EBREAK instruction, which expands to EBREAK,
to cause control to be transferred back to the debugging environment.
C.EBREAK shares the opcode with the C.ADD instruction, but with rd and
rs2 both zero, thus can also use the CR format.
7.1.5 Usage of Compressed Instructions in LR/SC Sequences
On implementations that support the Zca extension, compressed forms of
instructions permitted inside constrained LR/SC sequences, as
described in Section 5.2.1, are also permitted
inside constrained LR/SC sequences.
The implication is that any implementation that claims to support both
the A and Zca extensions must ensure that LR/SC sequences containing valid
Zca instructions will eventually complete.
7.1.6 HINT Instructions
A portion of the Zca encoding space is reserved for microarchitectural
HINTs. Like the HINTs in the RV32I base ISA (see
HINT Instructions), these instructions do not
modify any architectural state, except for advancing the pc and any
applicable performance counters. HINTs are executed as no-ops on
implementations that ignore them.
Zca HINTs are encoded as computational instructions that do not modify
the architectural state, either because rd=x0 (e.g.
C.ADD x0, t0), or because rd is overwritten with a copy of itself
(e.g. C.ADDI t0, 0).
This HINT encoding has been chosen so that simple implementations can ignore HINTs altogether, and instead execute a HINT as a regular computational instruction that happens not to mutate the architectural state.
Zca HINTs do not necessarily expand to their RVI HINT counterparts. For
example, C.ADD x0, a0 might not encode the same HINT as
ADD x0, x0, a0.
The primary reason to not require an Zca HINT to expand to an RVI HINT
is that HINTs are unlikely to be compressible in the same manner as the
underlying computational instruction. Also, decoupling the Zca and RVI
HINT mappings allows the scarce Zca HINT space to be allocated to the
most popular HINTs, and in particular, to HINTs that are amenable to
macro-op fusion.
Table 30 lists all Zca HINT code points. For XLEN=32, 78%
of the HINT space is reserved for standard HINTs. The remainder of the HINT space is designated for custom HINTs;
no standard HINTs will ever be defined in this subspace.
Table 30. Zca HINT instructions.
| Instruction | Constraints | Code Points | Purpose |
|---|---|---|---|
| C.NOP | imm≠0 | 63 | Designated for future standard use |
| C.ADDI | rd≠x0, imm=0 | 31 | |
| C.LI | rd=x0 | 64 | |
| C.LUI | rd=x0, imm≠0 | 63 | |
| C.MV | rd=x0, rs2≠x0 | 31 | |
| C.ADD | rd=x0, rs2≠x0, rs2≠x2-x5 | 27 | |
| C.ADD | rd=x0, rs2=x2-x5 | 4 | (rs2=x2) C.NTL.P1 (rs2=x3) C.NTL.PALL (rs2=x4) C.NTL.S1 (rs2=x5) C.NTL.ALL |
| C.SLLI | rd=x0 or imm=0 | 63 (RV32), 95 (RV64) | Designated for custom use |
| C.SRLI | imm=0 | 8 | |
| C.SRAI | imm=0 | 8 |
7.1.7 Zca Instruction Set Listings
Table 31 shows a map of the major opcodes for the compressed extensions. Each row of the table corresponds to one quadrant of the encoding space. The last quadrant, which has the two least-significant bits set, corresponds to instructions wider than 16 bits, including those in the base ISAs. Several instructions are only valid for certain operands; when invalid, they are marked either RES to indicate that the opcode is reserved for future standard extensions; Custom to indicate that the opcode is designated for custom extensions; or HINT to indicate that the opcode is reserved for microarchitectural hints (see Section 7.1.6).
Table 31. Zca opcode map.
| inst[1:0] | 000 | 001 | 010 | 011 | 100 | 101 | 110 | 111 | ||
| 00 | ADDI4SPN | FLD | LW | LD | Reserved | FSD | SW | SD | RV64 | |
| 01 | ADDI | ADDIW | LI | LUI/ADDI16SP | MISC-ALU | J | BEQZ | BNEZ | RV64 | |
| 10 | SLLI | FLDSP | LWSP | LDSP | J[AL]R/MV/ADD | FSDSP | SWSP | SDSP | RV64 | |
| 11 | >16b | |||||||||
Zca Instruction listing, Quadrant 0, Zca Instruction listing, Quadrant 1, and Zca Instruction listing, Quadrant 2 list the Zca instructions.
7.2 Zcf Extension for Single-Precision Floating-Point Compressed Instructions
The Zcf extension adds compressed single-precision floating-point
load and store instructions.
It is an XLEN=32-only extension.
Single-precision loads and stores are not a significant source of static or dynamic compression for programs compiled for the LP64D and ILP32D calling conventions. However, for microcontrollers that only provide hardware single-precision floating-point units and whose programs use the ILP32F calling convention, the single-precision loads and stores are used at least as frequently as double-precision loads and stores are used in LP64D and ILP32D.
7.2.1 Stack-Pointer-Based Loads and Stores
C.FLWSP is an RV32FC-only instruction that loads a single-precision
floating-point value from memory into floating-point register rd. It
computes its effective address by adding the zero-extended offset,
scaled by 4, to the stack pointer, x2. It expands to
FLW rd, offset(x2).
C.FLWSP uses the CI format.
C.FSWSP is an RV32FC-only instruction that stores a single-precision
floating-point value in floating-point register rs2 to memory. It
computes an effective address by adding the zero-extended offset,
scaled by 4, to the stack pointer, x2. It expands to
FSW rs2, offset(x2).
7.2.2 Register-Based Loads and Stores
Compressed register-based floating-point loads and stores use the
CL and CS formats, respectively, with the eight registers mapping to f8 to f15.
The standard RISC-V calling convention maps the most frequently used
floating-point registers to registers f8 to f15, which allows the
same register decompression decoding as for integer register numbers.
These instructions encode their data source or destination as described in the following table.
Table 32. Registers specified by the three-bit rs1′, rs2′, and rd′ fields of the CIW, CL, CS, CA, and CB formats.
Zcf Register Number | 000 | 001 | 010 | 011 | 100 | 101 | 110 | 111 |
| Floating-Point Register Number | f8 | f9 | f10 | f11 | f12 | f13 | f14 | f15 |
| Floating-Point Register ABI Name | fs0 | fs1 | fa0 | fa1 | fa2 | fa3 | fa4 | fa5 |
C.FLW is an RV32FC-only instruction that loads a single-precision
floating-point value from memory into floating-point register
_rd′_. It computes an effective address by adding the
zero-extended offset, scaled by 4, to the base address in register
_rs1′_. It expands to
FLW rd′, offset(rs1′).
C.FSW is an RV32FC-only instruction that stores a single-precision
floating-point value in floating-point register _rs2′_ to
memory. It computes an effective address by adding the zero-extended
offset, scaled by 4, to the base address in register
_rs1′_. It expands to
FSW rs2′, offset(rs1′).
7.3 Zcd Extension for Double-Precision Floating-Point Compressed Instructions
The Zcd extension adds compressed double-precision floating-point
load and store instructions.
Double-precision loads and stores represent a significant fraction of static instructions in programs compiled for the ILP32D and LP64D calling conventions, due in large part to callee-saved register spills and fills.
7.3.1 Stack-Pointer-Based Loads and Stores
C.FLDSP is an RV32DC/RV64DC-only instruction that loads a
double-precision floating-point value from memory into floating-point
register rd. It computes its effective address by adding the
zero-extended offset, scaled by 8, to the stack pointer, x2. It
expands to FLD rd, offset(x2).
C.FSDSP is an RV32DC/RV64DC-only instruction that stores a
double-precision floating-point value in floating-point register rs2
to memory. It computes an effective address by adding the
zero-extended offset, scaled by 8, to the stack pointer, x2. It
expands to FSD rs2, offset(x2).
7.3.2 Register-Based Loads and Stores
These instructions encode their data source or destination as described in Table 32.
C.FLD is an RV32DC/RV64DC-only instruction that loads a double-precision
floating-point value from memory into floating-point register
_rd′_. It computes an effective address by adding the
zero-extended offset, scaled by 8, to the base address in register
_rs1′_. It expands to
FLD rd′, offset(rs1′).
C.FSD is an RV32DC/RV64DC-only instruction that stores a
double-precision floating-point value in floating-point register
_rs2′_ to memory. It computes an effective address by
adding the zero-extended offset, scaled by 8, to the base address in
register _rs1′_. It expands to
FSD rs2′, offset(rs1′).
7.4 C Extension for Compressed Instructions
This section describes the C extension, which incorporates the
compressed instruction-set extensions designed for application-
and server-class processors.
The C extension substantially improves code density across a wide
range of applications, thereby improving performance, area-efficiency,
and energy-efficiency of processors with instruction caches.
It excludes features that improve code density at the cost of performance.
The C extension depends upon the Zca extension.
If XLEN=32 and the F extension is present, the C extension additionally
depends upon the Zcf extension.
If the D extension is present, the C extension additionally depends
upon the Zcd extension.
7.5 Zcb Extension for Additional Compressed Instructions
The Zcb extension adds several compressed instructions which, like those
in the Zca extension, expand into a single 32-bit instruction.
The Zcb extension depends on the Zca extension.
As shown on the individual instruction pages, many of the instructions in
Zcb depend upon another extension being implemented.
For example, C.MUL is only implemented if M or Zmmul is
implemented, and C.SEXT.B is only implemented if Zbb is
implemented.
| RV32 | RV64 | Mnemonic | Instruction |
|---|---|---|---|
| yes | yes | C.LBU rd', uimm(rs1') | Load unsigned byte, 16-bit encoding |
| yes | yes | C.LHU rd', uimm(rs1') | Load unsigned halfword, 16-bit encoding |
| yes | yes | C.LH rd', uimm(rs1') | Load signed halfword, 16-bit encoding |
| yes | yes | C.SB rs2', uimm(rs1') | Store byte, 16-bit encoding |
| yes | yes | C.SH rs2', uimm(rs1') | Store halfword, 16-bit encoding |
| yes | yes | C.ZEXT.B rsd' | Zero extend byte, 16-bit encoding |
| yes | yes | C.SEXT.B rsd' | Sign extend byte, 16-bit encoding |
| yes | yes | C.ZEXT.H rsd' | Zero extend halfword, 16-bit encoding |
| yes | yes | C.SEXT.H rsd' | Sign extend halfword, 16-bit encoding |
| yes | C.ZEXT.W rsd' | Zero extend word, 16-bit encoding | |
| yes | yes | C.NOT rsd' | Bitwise not, 16-bit encoding |
| yes | yes | C.MUL rsd', rs2' | Multiply, 16-bit encoding |
7.5.1 C.LBU
Synopsis Load unsigned byte, 16-bit encoding
Mnemonic C.LBU rd', uimm(rs1')
Encoding (RV32, RV64):
The immediate offset is formed as follows:
uimm[31:2] = 0;
uimm[1] = encoding[5];
uimm[0] = encoding[6];
Description This instruction loads a byte from the memory address formed by adding rs1' to the zero extended immediate uimm. The resulting byte is zero extended to XLEN bits and is written to rd'.
rd' and rs1' are from the standard 8-register set x8-x15.
Prerequisites None
Operation
//This is not SAIL, it's pseudocode. The SAIL hasn't been written yet.
X(rdc) = EXTZ(mem[X(rs1c)+EXTZ(uimm)][7..0]);
7.5.2 C.LHU
Synopsis Load unsigned halfword, 16-bit encoding
Mnemonic C.LHU rd', uimm(rs1')
Encoding (RV32, RV64):
The immediate offset is formed as follows:
uimm[31:2] = 0;
uimm[1] = encoding[5];
uimm[0] = 0;
Description This instruction loads a halfword from the memory address formed by adding rs1' to the zero extended immediate uimm. The resulting halfword is zero extended to XLEN bits and is written to rd'.
rd' and rs1' are from the standard 8-register set x8-x15.
Prerequisites None
Operation
//This is not SAIL, it's pseudocode. The SAIL hasn't been written yet.
X(rdc) = EXTZ(load_mem[X(rs1c)+EXTZ(uimm)][15..0]);
7.5.3 C.LH
Synopsis Load signed halfword, 16-bit encoding
Mnemonic C.LH rd', uimm(rs1')
Encoding (RV32, RV64):
The immediate offset is formed as follows:
uimm[31:2] = 0;
uimm[1] = encoding[5];
uimm[0] = 0;
Description This instruction loads a halfword from the memory address formed by adding rs1' to the zero extended immediate uimm. The resulting halfword is sign extended to XLEN bits and is written to rd'.
rd' and rs1' are from the standard 8-register set x8-x15.
Prerequisites None
Operation
//This is not SAIL, it's pseudocode. The SAIL hasn't been written yet.
X(rdc) = EXTS(load_mem[X(rs1c)+EXTZ(uimm)][15..0]);
7.5.4 C.SB
Synopsis Store byte, 16-bit encoding
Mnemonic C.SB rs2', uimm(rs1')
Encoding (RV32, RV64):
The immediate offset is formed as follows:
uimm[31:2] = 0;
uimm[1] = encoding[5];
uimm[0] = encoding[6];
Description This instruction stores the least significant byte of rs2' to the memory address formed by adding rs1' to the zero extended immediate uimm.
rs1' and rs2' are from the standard 8-register set x8-x15.
Prerequisites None
Operation
//This is not SAIL, it's pseudocode. The SAIL hasn't been written yet.
mem[X(rs1c)+EXTZ(uimm)][7..0] = X(rs2c)
7.5.5 C.SH
Synopsis Store halfword, 16-bit encoding
Mnemonic c.sh rs2', uimm(rs1')
Encoding (RV32, RV64):
The immediate offset is formed as follows:
uimm[31:2] = 0;
uimm[1] = encoding[5];
uimm[0] = 0;
Description This instruction stores the least significant halfword of rs2' to the memory address formed by adding rs1' to the zero extended immediate uimm.
rs1' and rs2' are from the standard 8-register set x8-x15.
Prerequisites None
Operation
//This is not SAIL, it's pseudocode. The SAIL hasn't been written yet.
mem[X(rs1c)+EXTZ(uimm)][15..0] = X(rs2c)
7.5.6 C.ZEXT.B
Synopsis Zero extend byte, 16-bit encoding
Mnemonic C.ZEXT.B rd'/rs1'
Encoding (RV32, RV64):
Description This instruction takes a single source/destination operand. It zero-extends the least-significant byte of the operand to XLEN bits by inserting zeros into all of the bits more significant than 7.
rd'/rs1' is from the standard 8-register set x8-x15.
Prerequisites None
32-bit equivalent:
andi rd'/rs1', rd'/rs1', 0xff
The SAIL module variable for rd'/rs1' is called rsdc.
Operation
X(rsdc) = EXTZ(X(rsdc)[7..0]);
7.5.7 C.SEXT.B
Synopsis Sign extend byte, 16-bit encoding
Mnemonic C.SEXT.B rd'/rs1'
Encoding (RV32, RV64):
Description This instruction takes a single source/destination operand. It sign-extends the least-significant byte in the operand to XLEN bits by copying the most-significant bit in the byte (i.e., bit 7) to all of the more-significant bits.
rd'/rs1' is from the standard 8-register set x8-x15.
Prerequisites
Zbb is also required.
The SAIL module variable for rd'/rs1' is called rsdc.
Operation
X(rsdc) = EXTS(X(rsdc)[7..0]);
7.5.8 C.ZEXT.H
Synopsis Zero extend halfword, 16-bit encoding
Mnemonic C.ZEXT.H rd'/rs1'
Encoding (RV32, RV64):
Description This instruction takes a single source/destination operand. It zero-extends the least-significant halfword of the operand to XLEN bits by inserting zeros into all of the bits more significant than 15.
rd'/rs1' is from the standard 8-register set x8-x15.
Prerequisites
Zbb is also required.
The SAIL module variable for rd'/rs1' is called rsdc.
Operation
X(rsdc) = EXTZ(X(rsdc)[15..0]);
7.5.9 C.SEXT.H
Synopsis Sign extend halfword, 16-bit encoding
Mnemonic C.SEXT.H rd'/rs1'
Encoding (RV32, RV64):
Description This instruction takes a single source/destination operand. It sign-extends the least-significant halfword in the operand to XLEN bits by copying the most-significant bit in the halfword (i.e., bit 15) to all of the more-significant bits.
rd'/rs1' is from the standard 8-register set x8-x15.
Prerequisites
Zbb is also required.
The SAIL module variable for rd'/rs1' is called rsdc.
Operation
X(rsdc) = EXTS(X(rsdc)[15..0]);
7.5.10 C.ZEXT.W
Synopsis Zero extend word, 16-bit encoding
Mnemonic C.ZEXT.W rd'/rs1'
Encoding (RV64):
Description This instruction takes a single source/destination operand. It zero-extends the least-significant word of the operand to XLEN bits by inserting zeros into all of the bits more significant than 31.
rd'/rs1' is from the standard 8-register set x8-x15.
Prerequisites
Zba is also required.
32-bit equivalent:
add.uw rd'/rs1', rd'/rs1', zero
The SAIL module variable for rd'/rs1' is called rsdc.
Operation
X(rsdc) = EXTZ(X(rsdc)[31..0]);
7.5.11 C.NOT
Synopsis Bitwise not, 16-bit encoding
Mnemonic C.NOT rd'/rs1'
Encoding (RV32, RV64):
Description This instruction takes the one’s complement of rd'/rs1' and writes the result to the same register.
rd'/rs1' is from the standard 8-register set x8-x15.
Prerequisites None
32-bit equivalent:
xori rd'/rs1', rd'/rs1', -1
The SAIL module variable for rd'/rs1' is called rsdc.
Operation
X(rsdc) = X(rsdc) XOR -1;
7.5.12 C.MUL
Synopsis Multiply, 16-bit encoding
Mnemonic C.MUL rsd', rs2'
Encoding (RV32, RV64):
Description This instruction multiplies XLEN bits of the source operands from rsd' and rs2' and writes the lowest XLEN bits of the result to rsd'.
rd'/rs1' and rs2' are from the standard 8-register set x8-x15.
Prerequisites
M or Zmmul must be configured.
The SAIL module variable for rd'/rs1' is called rsdc, and for rs2' is called rs2c.
Operation
let result_wide = to_bits(2 * sizeof(xlen), signed(X(rsdc)) * signed(X(rs2c)));
X(rsdc) = result_wide[(sizeof(xlen) - 1) .. 0];
7.6 Zcmt Extension for Compressed Table Jumps
The Zcmt extension adds table-jump instructions, which improve code density
when procedures have many call sites.
It also adds the jvt CSR.
The jvt CSR requires a state enable if Smstateen is implemented. See
jvt CSR, table jump base vector and control register for details.
The Zcmt extension conflicts with the Zcd extension.
Zcmt is primarily targeted at embedded class CPUs due to implementation
complexity. Additionally, it is not compatible with RVA profiles.
The Zcmt extension depends on the Zca and Zicsr extensions.
| RV32 | RV64 | Mnemonic | Instruction |
|---|---|---|---|
| yes | yes | CM.JT index | Jump via table |
| yes | yes | CM.JALT index | Jump and link via table |
7.6.1 Table Jump Overview
CM.JT (Jump via table) and CM.JALT (Jump and link via table) are referred to as table jump.
Table jump uses a 256-entry XLEN wide table in instruction memory to contain function addresses. The table must be a minimum of 64-byte aligned.
Table entries follow the current data endianness. This is different from normal instruction fetch which is always little-endian.
CM.JT and CM.JALT encodings index the table, giving access to functions within the full XLEN wide address space.
This is used as a form of dictionary compression to reduce the code size of JAL / AUIPC+JALR / JR / AUIPC+JR instructions.
Table jump allows the linker to replace the following instruction sequences with a CM.JT or CM.JALT encoding, and an entry in the table:
- 32-bit J calls
- 32-bit JAL ra calls
- 64-bit AUIPC+JR calls to fixed locations
- 64-bit AUIPC+JALR ra calls to fixed locations
- The AUIPC+JR/JALR sequence is used because the offset from the PC is out of the ±1 MB range.
If a return address stack is implemented, then as CM.JALT is equivalent to JAL ra, it pushes to the stack.
7.6.2 jvt
The base of the table is in the jvt CSR (see jvt CSR, table jump base vector and control register), each table entry is XLEN bits.
If the same function is called with and without linking then it must have two entries in the table. This is typically caused by the same function being called with and without tail calling.
7.6.3 Table Jump Fault handling
For a table jump instruction, the table entry that the instruction selects is considered an extension of the instruction itself. Hence, the execution of a table jump instruction involves two instruction fetches, the first to read the instruction (CM.JT/CM.JALT) and the second to read from the jump vector table (JVT). Both instruction fetches are implicit reads, and both require execute permission; read permission is irrelevant. It is recommended that the second fetch be ignored for hardware triggers and breakpoints.
Memory writes to the jump vector table require an instruction barrier (FENCE.I) to guarantee that they are visible to the instruction fetch.
Multiple contexts may have different jump vector tables. JVT may be switched between them without an instruction barrier if the tables have not been updated in memory since the last FENCE.I.
If an exception occurs on either instruction fetch, xEPC is set to the PC of the table jump instruction, xCAUSE is set as expected for the type of fault and xTVAL (if not set to zero) contains the fetch address which caused the fault.
7.6.4 jvt CSR
Synopsis Table jump base vector and control register
Address:
0x017
Permissions:
URW
Format (RV32):
Format (RV64):
Description
The jvt register is an XLEN-bit WARL read/write register that holds the jump table configuration, consisting of the jump table base address (BASE) and the jump table mode (MODE).
If Zcmt is implemented then jvt must also be implemented, but can contain a read-only value. If jvt is writable, the set of values the register may hold can vary by implementation. The value in the BASE field must always be aligned on a 64-byte boundary.
Note that the CSR contains only bits XLEN-1 through 6 of the address base. When computing jump-table accesses, the lower six bits of base are filled with zeroes to obtain an XLEN-bit jump-table base address jvt.BASE that is always aligned on a 64-byte boundary.
jvt.BASE is a virtual address, whenever virtual memory is enabled.
The memory pointed to by jvt.BASE is treated as instruction memory for the purpose of executing table jump instructions, implying execute access permission.
Table 33. jvt.MODE definition.
jvt.MODE | Comment |
|---|---|
| 000000 | Jump table mode |
| others | reserved for future standard use |
jvt.MODE is a WARL field, so can only be programmed to modes which are implemented. Therefore the discovery mechanism is to
attempt to program different modes and read back the values to see which are available. Jump table mode must be implemented.
in future the RISC-V Unified Discovery method will report the available modes.
Architectural State:
jvt CSR adds architectural state to the system software context (such as an OS process), therefore must be saved/restored on context switches.
<<<
7.6.5 CM.JT
Synopsis jump via table
Mnemonic CM.JT index
Encoding (RV32, RV64):
For this encoding to decode as CM.JT, index<32, otherwise it decodes as CM.JALT, see Jump and link via table.
If jvt.MODE = 0 (Jump Table Mode) then CM.JT behaves as specified here. If jvt.MODE is a reserved value, then CM.JT is also reserved. In the future other defined values of jvt.MODE may change the behaviour of CM.JT.
Assembly Syntax:
cm.jt index
Description CM.JT reads an entry from the jump vector table in memory and jumps to the address that was read.
For further information see Table Jump Overview.
Prerequisites None
32-bit equivalent:
No direct equivalent encoding exists.
Operation
//This is not SAIL, it's pseudocode. The SAIL hasn't been written yet.
# table_address is temporary internal state, it doesn't represent a real register
# InstMemory is byte indexed
switch(XLEN) {
32: table_address[XLEN-1:0] = jvt.base + (index\<\<2);
64: table_address[XLEN-1:0] = jvt.base + (index\<\<3);
}
//fetch from the jump table
pc = InstMemory[table_address][XLEN-1:0]&~0x1; // Clear bit 0.
7.6.6 CM.JALT
Synopsis jump via table with optional link
Mnemonic CM.JALT index
Encoding (RV32, RV64):
For this encoding to decode as CM.JALT, index>=32, otherwise it decodes as CM.JT, see Jump via table.
If jvt.MODE = 0 (Jump Table Mode) then CM.JALT behaves as specified here. If jvt.MODE is a reserved value, then CM.JALT is also reserved. In the future other defined values of jvt.MODE may change the behaviour of CM.JALT.
Assembly Syntax:
cm.jalt index
Description CM.JALT reads an entry from the jump vector table in memory and jumps to the address that was read, linking to ra.
For further information see Table Jump Overview.
Prerequisites None
32-bit equivalent:
No direct equivalent encoding exists.
Operation
//This is not SAIL, it's pseudocode. The SAIL hasn't been written yet.
# table_address is temporary internal state, it doesn't represent a real register
# InstMemory is byte indexed
switch(XLEN) {
32: table_address[XLEN-1:0] = jvt.base + (index\<\<2);
64: table_address[XLEN-1:0] = jvt.base + (index\<\<3);
}
//fetch from the jump table
ra = pc+2;
pc = InstMemory[table_address][XLEN-1:0]&~0x1; // Clear bit 0.
7.7 Zcmp Extension for Compressed Prologues and Epilogues
The Zcmp extension adds instructions that substantially reduce the
static code size of procedure prologues and epilogues.
The instructions it adds are collectively referred to as PUSH/POP:
The term PUSH refers to CM.PUSH.
The term POP refers to CM.POP.
The term POPRET refers to CM.POPRET and CM.POPRETZ.
Common details for these instructions are in this section.
7.7.1 PUSH/POP functional overview
PUSH, POP, POPRET are used to reduce the size of function prologues and epilogues.
- The PUSH instruction
- adjusts the stack pointer to create the stack frame
- pushes (stores) the registers specified in the register list to the stack frame
- The POP instruction
- pops (loads) the registers in the register list from the stack frame
- adjusts the stack pointer to destroy the stack frame
- The POPRET instructions
- pop (load) the registers in the register list from the stack frame
- CM.POPRETZ also moves zero into a0 as the return value
- adjust the stack pointer to destroy the stack frame
- execute a ret instruction to return from the function
7.7.2 Example usage
This example gives an illustration of the use of PUSH and POPRET.
The function processMarkers in the EMBench benchmark picojpeg in the following file on github: libpicojpeg.c
The prologue and epilogue compile with GCC10 to:
0001098a \<processMarkers>:
1098a: 711d addi sp,sp,-96 ;#cm.push(1)
1098c: c8ca sw s2,80(sp) ;#cm.push(2)
1098e: c6ce sw s3,76(sp) ;#cm.push(3)
10990: c4d2 sw s4,72(sp) ;#cm.push(4)
10992: ce86 sw ra,92(sp) ;#cm.push(5)
10994: cca2 sw s0,88(sp) ;#cm.push(6)
10996: caa6 sw s1,84(sp) ;#cm.push(7)
10998: c2d6 sw s5,68(sp) ;#cm.push(8)
1099a: c0da sw s6,64(sp) ;#cm.push(9)
1099c: de5e sw s7,60(sp) ;#cm.push(10)
1099e: dc62 sw s8,56(sp) ;#cm.push(11)
109a0: da66 sw s9,52(sp) ;#cm.push(12)
109a2: d86a sw s10,48(sp);#cm.push(13)
109a4: d66e sw s11,44(sp);#cm.push(14)
...
109f4: 4501 li a0,0 ;#cm.popretz(1)
109f6: 40f6 lw ra,92(sp) ;#cm.popretz(2)
109f8: 4466 lw s0,88(sp) ;#cm.popretz(3)
109fa: 44d6 lw s1,84(sp) ;#cm.popretz(4)
109fc: 4946 lw s2,80(sp) ;#cm.popretz(5)
109fe: 49b6 lw s3,76(sp) ;#cm.popretz(6)
10a00: 4a26 lw s4,72(sp) ;#cm.popretz(7)
10a02: 4a96 lw s5,68(sp) ;#cm.popretz(8)
10a04: 4b06 lw s6,64(sp) ;#cm.popretz(9)
10a06: 5bf2 lw s7,60(sp) ;#cm.popretz(10)
10a08: 5c62 lw s8,56(sp) ;#cm.popretz(11)
10a0a: 5cd2 lw s9,52(sp) ;#cm.popretz(12)
10a0c: 5d42 lw s10,48(sp);#cm.popretz(13)
10a0e: 5db2 lw s11,44(sp);#cm.popretz(14)
10a10: 6125 addi sp,sp,96 ;#cm.popretz(15)
10a12: 8082 ret ;#cm.popretz(16)
with the GCC option -msave-restore the output is the following:
0001080e \<processMarkers>:
1080e: 73a012ef jal t0,11f48 \<__riscv_save_12>
10812: 1101 addi sp,sp,-32
...
10862: 4501 li a0,0
10864: 6105 addi sp,sp,32
10866: 71e0106f j 11f84 \<__riscv_restore_12>
with PUSH/POPRET this reduces to
0001080e \<processMarkers>:
1080e: b8fa cm.push \{ra,s0-s11},-96
...
10866: bcfa cm.popretz \{ra,s0-s11}, 96
The prologue / epilogue reduce from 60-bytes in the original code, to 14-bytes with -msave-restore, and to 4-bytes with PUSH and POPRET. As well as reducing the code-size PUSH and POPRET eliminate the branches from calling the millicode save/restore routines and so may also perform better.
The calls to <riscv_save_0>/<riscv_restore_0> become 64-bit when the target functions are out of the ±1 MB range, increasing the prologue/epilogue size to 22-bytes.
POP is typically used in tail-calling sequences where ret is not used to return to ra after destroying the stack frame.
7.7.2.1 Stack pointer adjustment handling
The instructions all automatically adjust the stack pointer by enough to cover the memory required for the registers being saved or restored. Additionally the spimm field in the encoding allows the stack pointer to be adjusted in additional increments of 16-bytes. There is only a small restricted range available in the encoding; if the range is insufficient then a separate C.ADDI16SP can be used to increase the range.
7.7.2.2 Register list handling
There is no support for the {ra, s0-s10} register list without also adding s11. Therefore the {ra, s0-s11} register list must be used in this case.
7.7.3 PUSH/POP Fault handling
Correct execution requires that sp refers to idempotent memory (also see Non-idempotent memory handling), because the core must be able to handle traps detected during the sequence. The entire PUSH/POP sequence is re-executed after returning from the trap handler, and multiple traps are possible during the sequence.
If a trap occurs during the sequence then xEPC is updated with the PC of the instruction, xTVAL (if not read-only-zero) updated with the bad address if it was an access fault and xCAUSE updated with the type of trap.
It is implementation defined whether interrupts can also be taken during the sequence execution.
7.7.4 Software view of execution
7.7.4.1 Software view of the PUSH sequence
From a software perspective the PUSH sequence appears as:
-
A sequence of stores writing the bytes required by the pseudocode
- The bytes may be written in any order.
- The bytes may be grouped into larger accesses.
- Any of the bytes may be written multiple times.
-
A stack pointer adjustment
If an implementation allows interrupts during the sequence, and the interrupt handler uses sp to allocate stack memory, then any stores which were executed before the interrupt may be overwritten by the handler. This is safe because the memory is idempotent and the stores will be re-executed when execution resumes.
The stack pointer adjustment must only be committed only when it is certain that the entire PUSH instruction will commit.
Stores may also return imprecise faults from the bus. It is platform defined whether the core implementation waits for the bus responses before continuing to the final stage of the sequence, or handles errors responses after completing the PUSH instruction.
For example:
cm.push \{ra, s0-s5}, -64
Appears to software as:
# any bytes from sp-1 to sp-28 may be written multiple times before
# the instruction completes therefore these updates may be visible in
# the interrupt/exception handler below the stack pointer
sw s5, -4(sp)
sw s4, -8(sp)
sw s3,-12(sp)
sw s2,-16(sp)
sw s1,-20(sp)
sw s0,-24(sp)
sw ra,-28(sp)
# this must only execute once, and will only execute after all stores
# completed without any precise faults, therefore this update is only
# visible in the interrupt/exception handler if cm.push has completed
addi sp, sp, -64
7.7.4.2 Software view of the POP/POPRET sequence
From a software perspective the POP/POPRET sequence appears as:
-
A sequence of loads reading the bytes required by the pseudocode.
- The bytes may be loaded in any order.
- The bytes may be grouped into larger accesses.
- Any of the bytes may be loaded multiple times.
-
A stack pointer adjustment
-
An optional LI a0, 0
-
An optional RET
If a trap occurs during the sequence, then any loads which were executed before the trap may update architectural state. The loads will be re-executed once the trap handler completes, so the values will be overwritten. Therefore it is permitted for an implementation to update some of the destination registers before taking a fault.
The optional LI a0, 0, stack pointer adjustment and optional RET must only be committed only when it is certain that the entire POP/POPRET instruction will commit.
For POPRET once the stack pointer adjustment has been committed the RET must execute.
For example:
cm.popretz \{ra, s0-s3}, 32;
Appears to software as:
# any or all of these load instructions may execute multiple times
# therefore these updates may be visible in the interrupt/exception handler
lw s3, 28(sp)
lw s2, 24(sp)
lw s1, 20(sp)
lw s0, 16(sp)
lw ra, 12(sp)
# these must only execute once, will only execute after all loads
# complete successfully all instructions must execute atomically
# therefore these updates are not visible in the interrupt/exception handler
li a0, 0
addi sp, sp, 32
ret
7.7.5 Non-idempotent memory handling
An implementation may have a requirement to issue a PUSH/POP instruction to non-idempotent memory.
If the core implementation does not support PUSH/POP to non-idempotent memories, the core may use an idempotency PMA to detect it and take a load (POP/POPRET) or store (PUSH) access-fault exception in order to avoid unpredictable results.
Software should only use these instructions on non-idempotent memory regions when software can tolerate the required memory accesses being issued repeatedly in the case that they cause exceptions.
7.7.6 Example RV32I PUSH/POP sequences
The examples are included show the load/store series expansion and the stack adjustment. Examples of CM.POPRET and CM.POPRETZ are not included, as the difference in the expanded sequence from CM.POP is trivial in all cases.
7.7.6.1 CM.PUSH {ra, s0-s2}, -64
Encoding: rlist=7, spimm=3
expands to:
sw s2, -4(sp);
sw s1, -8(sp);
sw s0, -12(sp);
sw ra, -16(sp);
addi sp, sp, -64;
7.7.6.2 CM.PUSH {ra, s0-s11}, -112
Encoding: rlist=15, spimm=3
expands to:
sw s11, -4(sp);
sw s10, -8(sp);
sw s9, -12(sp);
sw s8, -16(sp);
sw s7, -20(sp);
sw s6, -24(sp);
sw s5, -28(sp);
sw s4, -32(sp);
sw s3, -36(sp);
sw s2, -40(sp);
sw s1, -44(sp);
sw s0, -48(sp);
sw ra, -52(sp);
addi sp, sp, -112;
7.7.6.3 CM.POP {ra}, 16
Encoding: rlist=4, spimm=0
expands to:
lw ra, 12(sp);
addi sp, sp, 16;
7.7.6.4 CM.POP {ra, s0-s3}, 48
Encoding: rlist=8, spimm=1
expands to:
lw s3, 44(sp);
lw s2, 40(sp);
lw s1, 36(sp);
lw s0, 32(sp);
lw ra, 28(sp);
addi sp, sp, 48;
7.7.6.5 CM.POP {ra, s0-s4}, 64
Encoding: rlist=9, spimm=2
expands to:
lw s4, 60(sp);
lw s3, 56(sp);
lw s2, 52(sp);
lw s1, 48(sp);
lw s0, 44(sp);
lw ra, 40(sp);
addi sp, sp, 64;
7.7.7 CM.PUSH
Synopsis Create stack frame: store ra and 0 to 12 saved registers to the stack frame, optionally allocate additional stack space.
Mnemonic CM.PUSH {reg_list}, -stack_adj
Encoding (RV32, RV64):
rlist values 0 to 3 are reserved for a future EABI variant called CM.PUSH.E
Assembly Syntax:
cm.push \{reg_list}, -stack_adj
cm.push {xreg_list}, -stack_adj
The variables used in the assembly syntax are defined below.
RV32E:
switch (rlist){
case 4: \{reg_list="ra"; xreg_list="x1";}
case 5: \{reg_list="ra, s0"; xreg_list="x1, x8";}
case 6: \{reg_list="ra, s0-s1"; xreg_list="x1, x8-x9";}
default: reserved();
}
stack_adj = stack_adj_base + spimm * 16;
RV32I, RV64:
switch (rlist){
case 4: \{reg_list="ra"; xreg_list="x1";}
case 5: \{reg_list="ra, s0"; xreg_list="x1, x8";}
case 6: \{reg_list="ra, s0-s1"; xreg_list="x1, x8-x9";}
case 7: \{reg_list="ra, s0-s2"; xreg_list="x1, x8-x9, x18";}
case 8: \{reg_list="ra, s0-s3"; xreg_list="x1, x8-x9, x18-x19";}
case 9: \{reg_list="ra, s0-s4"; xreg_list="x1, x8-x9, x18-x20";}
case 10: \{reg_list="ra, s0-s5"; xreg_list="x1, x8-x9, x18-x21";}
case 11: \{reg_list="ra, s0-s6"; xreg_list="x1, x8-x9, x18-x22";}
case 12: \{reg_list="ra, s0-s7"; xreg_list="x1, x8-x9, x18-x23";}
case 13: \{reg_list="ra, s0-s8"; xreg_list="x1, x8-x9, x18-x24";}
case 14: \{reg_list="ra, s0-s9"; xreg_list="x1, x8-x9, x18-x25";}
//note - to include s10, s11 must also be included
case 15: \{reg_list="ra, s0-s11"; xreg_list="x1, x8-x9, x18-x27";}
default: reserved();
}
stack_adj = stack_adj_base + spimm * 16;
RV32E:
stack_adj_base = 16;
Valid values:
stack_adj = [16|32|48|64];
RV32I:
switch (rlist) {
case 4.. 7: stack_adj_base = 16;
case 8..11: stack_adj_base = 32;
case 12..14: stack_adj_base = 48;
case 15: stack_adj_base = 64;
}
Valid values:
switch (rlist) {
case 4.. 7: stack_adj = [16|32|48| 64];
case 8..11: stack_adj = [32|48|64| 80];
case 12..14: stack_adj = [48|64|80| 96];
case 15: stack_adj = [64|80|96|112];
}
RV64:
switch (rlist) {
case 4.. 5: stack_adj_base = 16;
case 6.. 7: stack_adj_base = 32;
case 8.. 9: stack_adj_base = 48;
case 10..11: stack_adj_base = 64;
case 12..13: stack_adj_base = 80;
case 14: stack_adj_base = 96;
case 15: stack_adj_base = 112;
}
Valid values:
switch (rlist) {
case 4.. 5: stack_adj = [ 16| 32| 48| 64];
case 6.. 7: stack_adj = [ 32| 48| 64| 80];
case 8.. 9: stack_adj = [ 48| 64| 80| 96];
case 10..11: stack_adj = [ 64| 80| 96|112];
case 12..13: stack_adj = [ 80| 96|112|128];
case 14: stack_adj = [ 96|112|128|144];
case 15: stack_adj = [112|128|144|160];
}
Description This instruction pushes (stores) the registers in reg_list to the memory below the stack pointer, and then creates the stack frame by decrementing the stack pointer by stack_adj, including any additional stack space requested by the value of spimm.
All ABI register mappings are for the UABI. An EABI version is planned once the EABI is frozen.
For further information see Zcmp.
Stack Adjustment Calculation:
stack_adj_base is the minimum number of bytes, in multiples of 16-byte address increments, required to cover the registers in the list.
spimm is the number of additional 16-byte address increments allocated for the stack frame.
The total stack adjustment represents the total size of the stack frame, which is stack_adj_base added to spimm scaled by 16, as defined above.
Prerequisites None
32-bit equivalent:
No direct equivalent encoding exists
Operation The first section of pseudocode may be executed multiple times before the instruction successfully completes.
//This is not SAIL, it's pseudocode. The SAIL hasn't been written yet.
if (XLEN==32) bytes=4; else bytes=8;
addr=sp-bytes;
for(i in 27,26,25,24,23,22,21,20,19,18,9,8,1) {
//if register i is in xreg_list
if (xreg_list[i]) {
switch(bytes) {
4: asm("sw x[i], 0(addr)");
8: asm("sd x[i], 0(addr)");
}
addr-=bytes;
}
}
The final section of pseudocode executes atomically, and only executes if the section above completes without any exceptions or interrupts.
//This is not SAIL, it's pseudocode. The SAIL hasn't been written yet.
sp-=stack_adj;
7.7.8 CM.POP
Synopsis Destroy stack frame: load ra and 0 to 12 saved registers from the stack frame, deallocate the stack frame.
Mnemonic CM.POP {reg_list}, stack_adj
Encoding (RV32, RV64):
rlist values 0 to 3 are reserved for a future EABI variant called CM.POP.E
Assembly Syntax:
cm.pop \{reg_list}, stack_adj
cm.pop {xreg_list}, stack_adj
The variables used in the assembly syntax are defined below.
RV32E:
switch (rlist){
case 4: \{reg_list="ra"; xreg_list="x1";}
case 5: \{reg_list="ra, s0"; xreg_list="x1, x8";}
case 6: \{reg_list="ra, s0-s1"; xreg_list="x1, x8-x9";}
default: reserved();
}
stack_adj = stack_adj_base + spimm * 16;
RV32I, RV64:
switch (rlist){
case 4: \{reg_list="ra"; xreg_list="x1";}
case 5: \{reg_list="ra, s0"; xreg_list="x1, x8";}
case 6: \{reg_list="ra, s0-s1"; xreg_list="x1, x8-x9";}
case 7: \{reg_list="ra, s0-s2"; xreg_list="x1, x8-x9, x18";}
case 8: \{reg_list="ra, s0-s3"; xreg_list="x1, x8-x9, x18-x19";}
case 9: \{reg_list="ra, s0-s4"; xreg_list="x1, x8-x9, x18-x20";}
case 10: \{reg_list="ra, s0-s5"; xreg_list="x1, x8-x9, x18-x21";}
case 11: \{reg_list="ra, s0-s6"; xreg_list="x1, x8-x9, x18-x22";}
case 12: \{reg_list="ra, s0-s7"; xreg_list="x1, x8-x9, x18-x23";}
case 13: \{reg_list="ra, s0-s8"; xreg_list="x1, x8-x9, x18-x24";}
case 14: \{reg_list="ra, s0-s9"; xreg_list="x1, x8-x9, x18-x25";}
//note - to include s10, s11 must also be included
case 15: \{reg_list="ra, s0-s11"; xreg_list="x1, x8-x9, x18-x27";}
default: reserved();
}
stack_adj = stack_adj_base + spimm * 16;
RV32E:
stack_adj_base = 16;
Valid values:
stack_adj = [16|32|48|64];
RV32I:
switch (rlist) {
case 4.. 7: stack_adj_base = 16;
case 8..11: stack_adj_base = 32;
case 12..14: stack_adj_base = 48;
case 15: stack_adj_base = 64;
}
Valid values:
switch (rlist) {
case 4.. 7: stack_adj = [16|32|48| 64];
case 8..11: stack_adj = [32|48|64| 80];
case 12..14: stack_adj = [48|64|80| 96];
case 15: stack_adj = [64|80|96|112];
}
RV64:
switch (rlist) {
case 4.. 5: stack_adj_base = 16;
case 6.. 7: stack_adj_base = 32;
case 8.. 9: stack_adj_base = 48;
case 10..11: stack_adj_base = 64;
case 12..13: stack_adj_base = 80;
case 14: stack_adj_base = 96;
case 15: stack_adj_base = 112;
}
Valid values:
switch (rlist) {
case 4.. 5: stack_adj = [ 16| 32| 48| 64];
case 6.. 7: stack_adj = [ 32| 48| 64| 80];
case 8.. 9: stack_adj = [ 48| 64| 80| 96];
case 10..11: stack_adj = [ 64| 80| 96|112];
case 12..13: stack_adj = [ 80| 96|112|128];
case 14: stack_adj = [ 96|112|128|144];
case 15: stack_adj = [112|128|144|160];
}
Description This instruction pops (loads) the registers in reg_list from stack memory, and then adjusts the stack pointer by stack_adj.
All ABI register mappings are for the UABI. An EABI version is planned once the EABI is frozen.
For further information see Zcmp.
Stack Adjustment Calculation:
stack_adj_base is the minimum number of bytes, in multiples of 16-byte address increments, required to cover the registers in the list.
spimm is the number of additional 16-byte address increments allocated for the stack frame.
The total stack adjustment represents the total size of the stack frame, which is stack_adj_base added to spimm scaled by 16, as defined above.
Prerequisites None
32-bit equivalent:
No direct equivalent encoding exists
Operation The first section of pseudocode may be executed multiple times before the instruction successfully completes.
//This is not SAIL, it's pseudocode. The SAIL hasn't been written yet.
if (XLEN==32) bytes=4; else bytes=8;
addr=sp+stack_adj-bytes;
for(i in 27,26,25,24,23,22,21,20,19,18,9,8,1) {
//if register i is in xreg_list
if (xreg_list[i]) {
switch(bytes) {
4: asm("lw x[i], 0(addr)");
8: asm("ld x[i], 0(addr)");
}
addr-=bytes;
}
}
The final section of pseudocode executes atomically, and only executes if the section above completes without any exceptions or interrupts.
//This is not SAIL, it's pseudocode. The SAIL hasn't been written yet.
sp+=stack_adj;
7.7.9 CM.POPRETZ
Synopsis Destroy stack frame: load ra and 0 to 12 saved registers from the stack frame, deallocate the stack frame, move zero into a0, return to ra.
Mnemonic CM.POPRETZ {reg_list}, stack_adj
Encoding (RV32, RV64):
rlist values 0 to 3 are reserved for a future EABI variant called CM.POPRETZ.E
Assembly Syntax:
cm.popretz \{reg_list}, stack_adj
cm.popretz {xreg_list}, stack_adj
RV32E:
switch (rlist){
case 4: \{reg_list="ra"; xreg_list="x1";}
case 5: \{reg_list="ra, s0"; xreg_list="x1, x8";}
case 6: \{reg_list="ra, s0-s1"; xreg_list="x1, x8-x9";}
default: reserved();
}
stack_adj = stack_adj_base + spimm * 16;
RV32I, RV64:
switch (rlist){
case 4: \{reg_list="ra"; xreg_list="x1";}
case 5: \{reg_list="ra, s0"; xreg_list="x1, x8";}
case 6: \{reg_list="ra, s0-s1"; xreg_list="x1, x8-x9";}
case 7: \{reg_list="ra, s0-s2"; xreg_list="x1, x8-x9, x18";}
case 8: \{reg_list="ra, s0-s3"; xreg_list="x1, x8-x9, x18-x19";}
case 9: \{reg_list="ra, s0-s4"; xreg_list="x1, x8-x9, x18-x20";}
case 10: \{reg_list="ra, s0-s5"; xreg_list="x1, x8-x9, x18-x21";}
case 11: \{reg_list="ra, s0-s6"; xreg_list="x1, x8-x9, x18-x22";}
case 12: \{reg_list="ra, s0-s7"; xreg_list="x1, x8-x9, x18-x23";}
case 13: \{reg_list="ra, s0-s8"; xreg_list="x1, x8-x9, x18-x24";}
case 14: \{reg_list="ra, s0-s9"; xreg_list="x1, x8-x9, x18-x25";}
//note - to include s10, s11 must also be included
case 15: \{reg_list="ra, s0-s11"; xreg_list="x1, x8-x9, x18-x27";}
default: reserved();
}
stack_adj = stack_adj_base + spimm * 16;
RV32E:
stack_adj_base = 16;
Valid values:
stack_adj = [16|32|48|64];
RV32I:
switch (rlist) {
case 4.. 7: stack_adj_base = 16;
case 8..11: stack_adj_base = 32;
case 12..14: stack_adj_base = 48;
case 15: stack_adj_base = 64;
}
Valid values:
switch (rlist) {
case 4.. 7: stack_adj = [16|32|48| 64];
case 8..11: stack_adj = [32|48|64| 80];
case 12..14: stack_adj = [48|64|80| 96];
case 15: stack_adj = [64|80|96|112];
}
RV64:
switch (rlist) {
case 4.. 5: stack_adj_base = 16;
case 6.. 7: stack_adj_base = 32;
case 8.. 9: stack_adj_base = 48;
case 10..11: stack_adj_base = 64;
case 12..13: stack_adj_base = 80;
case 14: stack_adj_base = 96;
case 15: stack_adj_base = 112;
}
Valid values:
switch (rlist) {
case 4.. 5: stack_adj = [ 16| 32| 48| 64];
case 6.. 7: stack_adj = [ 32| 48| 64| 80];
case 8.. 9: stack_adj = [ 48| 64| 80| 96];
case 10..11: stack_adj = [ 64| 80| 96|112];
case 12..13: stack_adj = [ 80| 96|112|128];
case 14: stack_adj = [ 96|112|128|144];
case 15: stack_adj = [112|128|144|160];
}
Description This instruction pops (loads) the registers in reg_list from stack memory, adjusts the stack pointer by stack_adj, moves zero into a0 and then returns to ra.
All ABI register mappings are for the UABI. An EABI version is planned once the EABI is frozen.
For further information see Zcmp.
Stack Adjustment Calculation:
stack_adj_base is the minimum number of bytes, in multiples of 16-byte address increments, required to cover the registers in the list.
spimm is the number of additional 16-byte address increments allocated for the stack frame.
The total stack adjustment represents the total size of the stack frame, which is stack_adj_base added to spimm scaled by 16, as defined above.
Prerequisites None
32-bit equivalent:
No direct equivalent encoding exists
Operation The first section of pseudocode may be executed multiple times before the instruction successfully completes.
//This is not SAIL, it's pseudocode. The SAIL hasn't been written yet.
if (XLEN==32) bytes=4; else bytes=8;
addr=sp+stack_adj-bytes;
for(i in 27,26,25,24,23,22,21,20,19,18,9,8,1) {
//if register i is in xreg_list
if (xreg_list[i]) {
switch(bytes) {
4: asm("lw x[i], 0(addr)");
8: asm("ld x[i], 0(addr)");
}
addr-=bytes;
}
}
The final section of pseudocode executes atomically, and only executes if the section above completes without any exceptions or interrupts.
The LI a0, 0 could be executed more than once, but is included in the atomic section for convenience.
//This is not SAIL, it's pseudocode. The SAIL hasn't been written yet.
asm("li a0, 0");
sp+=stack_adj;
asm("ret");
7.7.10 CM.POPRET
Synopsis Destroy stack frame: load ra and 0 to 12 saved registers from the stack frame, deallocate the stack frame, return to ra.
Mnemonic CM.POPRET {reg_list}, stack_adj
Encoding (RV32, RV64):
rlist values 0 to 3 are reserved for a future EABI variant called cm.popret.e
Assembly Syntax:
cm.popret \{reg_list}, stack_adj
cm.popret {xreg_list}, stack_adj
The variables used in the assembly syntax are defined below.
RV32E:
switch (rlist){
case 4: \{reg_list="ra"; xreg_list="x1";}
case 5: \{reg_list="ra, s0"; xreg_list="x1, x8";}
case 6: \{reg_list="ra, s0-s1"; xreg_list="x1, x8-x9";}
default: reserved();
}
stack_adj = stack_adj_base + spimm * 16;
RV32I, RV64:
switch (rlist){
case 4: \{reg_list="ra"; xreg_list="x1";}
case 5: \{reg_list="ra, s0"; xreg_list="x1, x8";}
case 6: \{reg_list="ra, s0-s1"; xreg_list="x1, x8-x9";}
case 7: \{reg_list="ra, s0-s2"; xreg_list="x1, x8-x9, x18";}
case 8: \{reg_list="ra, s0-s3"; xreg_list="x1, x8-x9, x18-x19";}
case 9: \{reg_list="ra, s0-s4"; xreg_list="x1, x8-x9, x18-x20";}
case 10: \{reg_list="ra, s0-s5"; xreg_list="x1, x8-x9, x18-x21";}
case 11: \{reg_list="ra, s0-s6"; xreg_list="x1, x8-x9, x18-x22";}
case 12: \{reg_list="ra, s0-s7"; xreg_list="x1, x8-x9, x18-x23";}
case 13: \{reg_list="ra, s0-s8"; xreg_list="x1, x8-x9, x18-x24";}
case 14: \{reg_list="ra, s0-s9"; xreg_list="x1, x8-x9, x18-x25";}
//note - to include s10, s11 must also be included
case 15: \{reg_list="ra, s0-s11"; xreg_list="x1, x8-x9, x18-x27";}
default: reserved();
}
stack_adj = stack_adj_base + spimm * 16;
RV32E:
stack_adj_base = 16;
Valid values:
stack_adj = [16|32|48|64];
RV32I:
switch (rlist) {
case 4.. 7: stack_adj_base = 16;
case 8..11: stack_adj_base = 32;
case 12..14: stack_adj_base = 48;
case 15: stack_adj_base = 64;
}
Valid values:
switch (rlist) {
case 4.. 7: stack_adj = [16|32|48| 64];
case 8..11: stack_adj = [32|48|64| 80];
case 12..14: stack_adj = [48|64|80| 96];
case 15: stack_adj = [64|80|96|112];
}
RV64:
switch (rlist) {
case 4.. 5: stack_adj_base = 16;
case 6.. 7: stack_adj_base = 32;
case 8.. 9: stack_adj_base = 48;
case 10..11: stack_adj_base = 64;
case 12..13: stack_adj_base = 80;
case 14: stack_adj_base = 96;
case 15: stack_adj_base = 112;
}
Valid values:
switch (rlist) {
case 4.. 5: stack_adj = [ 16| 32| 48| 64];
case 6.. 7: stack_adj = [ 32| 48| 64| 80];
case 8.. 9: stack_adj = [ 48| 64| 80| 96];
case 10..11: stack_adj = [ 64| 80| 96|112];
case 12..13: stack_adj = [ 80| 96|112|128];
case 14: stack_adj = [ 96|112|128|144];
case 15: stack_adj = [112|128|144|160];
}
Description This instruction pops (loads) the registers in reg_list from stack memory, adjusts the stack pointer by stack_adj and then returns to ra.
All ABI register mappings are for the UABI. An EABI version is planned once the EABI is frozen.
For further information see Zcmp.
Stack Adjustment Calculation:
stack_adj_base is the minimum number of bytes, in multiples of 16-byte address increments, required to cover the registers in the list.
spimm is the number of additional 16-byte address increments allocated for the stack frame.
The total stack adjustment represents the total size of the stack frame, which is stack_adj_base added to spimm scaled by 16, as defined above.
Prerequisites None
32-bit equivalent:
No direct equivalent encoding exists
Operation The first section of pseudocode may be executed multiple times before the instruction successfully completes.
//This is not SAIL, it's pseudocode. The SAIL hasn't been written yet.
if (XLEN==32) bytes=4; else bytes=8;
addr=sp+stack_adj-bytes;
for(i in 27,26,25,24,23,22,21,20,19,18,9,8,1) {
//if register i is in xreg_list
if (xreg_list[i]) {
switch(bytes) {
4: asm("lw x[i], 0(addr)");
8: asm("ld x[i], 0(addr)");
}
addr-=bytes;
}
}
The final section of pseudocode executes atomically, and only executes if the section above completes without any exceptions or interrupts.
//This is not SAIL, it's pseudocode. The SAIL hasn't been written yet.
sp+=stack_adj;
asm("ret");
7.7.11 CM.MVSA01
Synopsis Move a0-a1 into two registers of s0-s7
Mnemonic CM.MVSA01 r1s', r2s'
Encoding (RV32, RV64):
For the encoding to be legal r1s' != r2s'.
Assembly Syntax:
cm.mvsa01 r1s', r2s'
Description This instruction moves a0 into r1s' and a1 into r2s'. r1s' and r2s' must be different. The execution is atomic, so it is not possible to observe state where only one of r1s' or r2s' has been updated.
The encoding uses sreg number specifiers instead of xreg number specifiers to save encoding space. The mapping between them is specified in the pseudocode below.
The s register mapping is taken from the UABI, and may not match the currently unratified EABI. CM.MVSA01.E may be included in the future.
Prerequisites None
32-bit equivalent:
No direct equivalent encoding exists.
Operation
//This is not SAIL, it's pseudocode. The SAIL hasn't been written yet.
if (RV32E && (r1sc>1 || r2sc>1)) {
reserved();
}
xreg1 = {r1sc[2:1]>0,r1sc[2:1]==0,r1sc[2:0]};
xreg2 = {r2sc[2:1]>0,r2sc[2:1]==0,r2sc[2:0]};
X[xreg1] = X[10];
X[xreg2] = X[11];
7.7.12 CM.MVA01S
Synopsis Move two s0-s7 registers into a0-a1
Mnemonic CM.MVA01S r1s', r2s'
Encoding (RV32, RV64):
Assembly Syntax:
cm.mva01s r1s', r2s'
Description This instruction moves r1s' into a0 and r2s' into a1. The execution is atomic, so it is not possible to observe state where only one of a0 or a1 have been updated.
The encoding uses sreg number specifiers instead of xreg number specifiers to save encoding space. The mapping between them is specified in the pseudocode below.
The s register mapping is taken from the UABI, and may not match the currently unratified EABI. CM.MVA01S.E may be included in the future.
Prerequisites None
32-bit equivalent:
No direct equivalent encoding exists.
Operation
//This is not SAIL, it's pseudocode. The SAIL hasn't been written yet.
if (RV32E && (r1sc>1 || r2sc>1)) {
reserved();
}
xreg1 = {r1sc[2:1]>0,r1sc[2:1]==0,r1sc[2:0]};
xreg2 = {r2sc[2:1]>0,r2sc[2:1]==0,r2sc[2:0]};
X[10] = X[xreg1];
X[11] = X[xreg2];
7.8 Zce Extension for Enhanced Instruction Compression
This section describes the Zce extension, which incorporates the
compressed instruction-set extensions designed for microcontrollers.
Unlike the C extension, the Zce extension includes extensions
that trade performance for code density.
The Zce extension depends upon the Zca, Zcb,
Zcmp, and Zcmt extensions.
If XLEN=32 and the F extension is present, the Zce extension
additionally depends upon the Zcf extension.
7.9 Zclsd Extension for Compressed Load/Store Pair Instructions
The Zclsd extension provides compressed load/store pair instructions for
RV32, reusing the existing RV64 doubleword load/store instruction encodings.
Zclsd depends on Zilsd and Zca. It has overlapping encodings with Zcf and is thus incompatible with Zcf.
7.9.1 Use of x0 as operand
For C.LDSP, usage of x0 as the destination is reserved.
If using x0 as src of C.SDSP, the entire 64-bit operand is zero, i.e., register x1 is not accessed.
C.LD and C.SD instructions can only use x8-x15.
7.9.2 Exception Handling
For the purposes of RVWMO and exception handling, C.LD, C.LDSP, C.SD, and C.SDSP instructions are considered to be misaligned loads and stores, with one additional constraint: a C.LD, C.LDSP, C.SD, or C.SDSP instruction whose effective address is a multiple of 4 gives rise to two 4-byte memory operations.
Zclsd adds the following RV32-only instructions:
| RV32 | RV64 | Mnemonic | Instruction |
|---|---|---|---|
| yes | no | C.LDSP rd, offset(sp) | Stack-pointer based load doubleword to register pair, 16-bit encoding |
| yes | no | C.SDSP rs2, offset(sp) | Stack-pointer based store doubleword from register pair, 16-bit encoding |
| yes | no | C.LD rd', offset(rs1') | Load doubleword to register pair, 16-bit encoding |
| yes | no | C.SD rs2', offset(rs1') | Store doubleword from register pair, 16-bit encoding |
7.9.2.1 C.LDSP
Synopsis Stack-pointer based load doubleword to even/odd register pair, 16-bit encoding
Mnemonic C.LDSP rd, offset(sp)
Encoding (RV32)
Description
Loads stack-pointer relative 64-bit value into registers rd' and rd'+1. It computes its effective address by adding the zero-extended offset, scaled by 8, to the stack pointer, x2. It expands to LD rd, offset(x2). C.LDSP is only valid when rd≠x0; the code points with rd=x0 are reserved.
Included in: Zclsd
7.9.2.2 C.SDSP
Synopsis Stack-pointer based store doubleword from even/odd register pair, 16-bit encoding
Mnemonic C.SDSP rs2, offset(sp)
Encoding (RV32)
Description
Stores a stack-pointer relative 64-bit value from registers rs2' and rs2'+1. It computes an effective address by adding the zero-extended offset, scaled by 8, to the stack pointer, x2. It expands to SD rs2, offset(x2).
Included in: Zclsd
7.9.2.3 c.ld
Synopsis Load doubleword to even/odd register pair, 16-bit encoding
Mnemonic C.LD rd', offset(rs1')
Encoding (RV32)
Description
Loads a 64-bit value into registers rd' and rd'+1.
It computes an effective address by adding the zero-extended offset, scaled by 8, to the base address in register rs1'.
Included in: Zclsd
7.9.2.4 C.SD
Synopsis Store doubleword from even/odd register pair, 16-bit encoding
Mnemonic C.SD rs2', offset(rs1')
Encoding (RV32)
Description
Stores a 64-bit value from registers rs2' and rs2'+1.
It computes an effective address by adding the zero-extended offset, scaled by 8, to the base address in register rs1'.
It expands to SD rs2', offset(rs1').
Included in: Zclsd
7.10 Zcmop Extension for Compressed May-Be-Operations
This section defines the Zcmop extension, which defines eight 16-bit MOP
instructions named C.MOP.N, where N is an odd integer between 1 and
15, inclusive. C.MOP.N is encoded in the reserved encoding space
corresponding to C.LUI xN, 0, as shown in Table 34.
Unlike the MOPs defined in the Zimop extension, the C.MOP.N instructions
are defined to not write any register.
Their encoding allows future extensions to define them to read register
xN.
The Zcmop extension depends upon the Zca extension.
Very few suitable 16-bit encoding spaces exist. This space was chosen
because it already has unusual behavior with respect to the rd/rs1
field—it encodes C.ADDI16SP when the field contains x2—and is
therefore of lower value for most purposes.
Table 34. C.MOP.N instruction encoding.
| Mnemonic | Encoding | Redefinable to read register |
|---|---|---|
| C.MOP.1 | 0110000010000001 | x1 |
| C.MOP.3 | 0110000110000001 | x3 |
| C.MOP.5 | 0110001010000001 | x5 |
| C.MOP.7 | 0110001110000001 | x7 |
| C.MOP.9 | 0110010010000001 | x9 |
| C.MOP.11 | 0110010110000001 | x11 |
| C.MOP.13 | 0110011010000001 | x13 |
| C.MOP.15 | 0110011110000001 | x15 |
The recommended assembly syntax for C.MOP.N is simply the nullary
C.MOP.N. The possibly accessed register is implicitly xN.
The expectation is that each Zcmop instruction is equivalent to some
Zimop instruction, but the choice of expansion (if any) is left to the
extension that redefines the MOP.
Note, a Zcmop instruction that does not write a value can expand into a write
to x0.