8 Scalar Floating-Point Extensions

note

This chapter is currently being restructured. Its contents are normative, but the presentation might appear disjoint.

This chapter describes the scalar floating-point extensions. The F extension adds floating-point registers and instructions for computation on single-precision floating-point values. The D and Q extensions widen those registers to hold double- and quad-precision floating-point values, respectively, and add instructions for computation on those formats. Several additional extensions with the Zf and Zd prefixes provide additional computational instructions.

The Zfinx and Zdinx extensions add computational instructions analogous to those in the F and D extensions, but they instead operate on floating-point numbers in the x registers. These extensions, intended for lower-cost systems, are incompatible with the F and D extensions.

8.1 "F" Extension for Single-Precision Floating-Point, Version 2.2

This chapter describes the F standard extension for single-precision floating-point, which adds computational instructions compliant with the IEEE 754-2008 arithmetic standard's binary32 format and operations. The F extension depends on the "Zicsr" extension for control and status register access.

8.1.1 F Register State

The F extension adds 32 floating-point registers, f0-f31, each 32 bits wide, and a floating-point control and status register fcsr, which contains the operating mode and exception status of the floating-point unit. This additional state is shown in fprs. We use the term FLEN to describe the width of the floating-point registers in the RISC-V ISA, and FLEN=32 for the F single-precision floating-point extension. Most floating-point instructions operate on values in the floating-point register file. Floating-point load and store instructions transfer floating-point values between registers and memory. Instructions to transfer values to and from the integer register file are also provided.

note

We considered a unified register file for both integer and floating-point values as this simplifies software register allocation and calling conventions, and reduces total user state. However, a split organization increases the total number of registers accessible with a given instruction width, simplifies provision of enough register file ports for wide superscalar issue, supports decoupled floating-point-unit architectures, and simplifies use of internal floating-point encoding techniques. Compiler support and calling conventions for split register file architectures are well understood, and using dirty bits on floating-point register file state can reduce context-switch overhead.

FLEN-1		0
f0
f1
f2
f3
f4
f5
f6
f7
f8
f9
f10
f11
f12
f13
f14
f15
f16
f17
f18
f19
f20
f21
f22
f23
f24
f25
f26
f27
f28
f29
f30
f31
FLEN
31		0
fcsr
32

8.1.2 Floating-Point Control and Status Register

The floating-point control and status register, fcsr, is a RISC-V control and status register (CSR). It is a 32-bit read/write register that selects the dynamic rounding mode for floating-point arithmetic operations and holds the accrued exception flags, as shown in fcsr.

0e3ae3f2e775127c3d8949256becd44a

The fcsr register can be read and written with the FRCSR and FSCSR instructions, which are assembler pseudoinstructions built on the underlying CSR access instructions. FRCSR reads fcsr by copying it into integer register rd. FSCSR swaps the value in fcsr by copying the original value into integer register rd, and then writing a new value obtained from integer register rs1 into fcsr.

The fields within the fcsr can also be accessed individually through different CSR addresses, and separate assembler pseudoinstructions are defined for these accesses. The FRRM instruction reads the Rounding Mode field frm (fcsr bits 7—5) and copies it into the least-significant three bits of integer register rd, with zero in all other bits. FSRM swaps the value in frm by copying the original value into integer register rd, and then writing a new value obtained from the three least-significant bits of integer register rs1 into frm. FRFLAGS and FSFLAGS are defined analogously for the Accrued Exception Flags field fflags (fcsr bits 4—0).

Bits 31—8 of the fcsr are reserved for other standard extensions. If these extensions are not present, implementations shall ignore writes to these bits and supply a zero value when read. Standard software should preserve the contents of these bits.

Floating-point operations use either a static rounding mode encoded in the instruction, or a dynamic rounding mode held in frm. Rounding modes are encoded as shown in norm:dyn_round_enc. A value of 111 in the instruction’s rm field selects the dynamic rounding mode held in frm. The behavior of floating-point instructions that depend on rounding mode when executed with a reserved rounding mode is reserved, including both static reserved rounding modes (101-110) and dynamic reserved rounding modes (101-111). Some instructions, including widening conversions, have the rm field but are nevertheless mathematically unaffected by the rounding mode; software should set their rm field to RNE (000) but implementations must treat the rm field as usual (in particular, with regard to decoding legal vs. reserved encodings).

Rounding Mode	Mnemonic	Meaning
000	RNE	Round to Nearest, ties to Even
001	RTZ	Round towards Zero
010	RDN	Round Down (towards −∞)
011	RUP	Round Up (towards +∞)
100	RMM	Round to Nearest, ties to Max Magnitude
101		Reserved for future use.
110		Reserved for future use.
111	DYN	In instruction’s rm field, selects dynamic rounding mode; In Rounding Mode register, reserved.

note

The C99 language standard effectively mandates the provision of a dynamic rounding mode register. In typical implementations, writes to the dynamic rounding mode CSR state will serialize the pipeline. Static rounding modes are used to implement specialized arithmetic operations that often have to switch frequently between different rounding modes.

The ratified version of the F spec mandated that an illegal-instruction exception was raised when an instruction was executed with a reserved dynamic rounding mode. This has been weakened to reserved, which matches the behavior of static rounding-mode instructions. Raising an illegal-instruction exception is still valid behavior when encountering a reserved encoding, so implementations compatible with the ratified spec are compatible with the weakened spec.

The accrued exception flags indicate the exception conditions that have arisen on any floating-point arithmetic instruction since the field was last reset by software, as shown in bitdef. The base RISC-V ISA does not support generating a trap on the setting of a floating-point exception flag.

Flag Mnemonic	Flag Meaning
NV	Invalid Operation
DZ	Divide by Zero
OF	Overflow
UF	Underflow
NX	Inexact

note

As allowed by IEEE 754-2008, we do not support traps on floating-point exceptions in the F extension, but instead require explicit checks of the flags in software. We considered adding branches controlled directly by the contents of the floating-point accrued exception flags, but ultimately chose to omit these instructions to keep the ISA simple.

8.1.3 NaN Generation and Propagation

Except when otherwise stated, if the result of a floating-point operation is NaN, it is the canonical NaN. The canonical NaN has a positive sign and all significand bits clear except the MSB, a.k.a. the quiet bit. For single-precision floating-point, this corresponds to the pattern 0x7fc00000.

note

We considered propagating NaN payloads, as is recommended by IEEE 754-2008, but this decision would have increased hardware cost. Moreover, since this feature is optional in IEEE 754-2008, it cannot be used in portable code.

Implementers are free to provide a NaN payload propagation scheme as a nonstandard extension enabled by a nonstandard operating mode. However, the canonical NaN scheme described above must always be supported and should be the default mode.

note

We require implementations to return the standard-mandated default values in the case of exceptional conditions, without any further intervention on the part of user-level software (unlike the Alpha ISA floating-point trap barriers). We believe full hardware handling of exceptional cases will become more common, and so wish to avoid complicating the user-level ISA to optimize other approaches. Implementations can always trap to machine-mode software handlers to provide exceptional default values.

8.1.4 Subnormal Arithmetic

Operations on subnormal numbers are handled in accordance with IEEE 754-2008.

In the parlance of IEEE 754-2008, tininess is detected after rounding.

note

Detecting tininess after rounding results in fewer spurious underflow signals.

8.1.5 Single-Precision Load and Store Instructions

Floating-point loads and stores use the same base+offset addressing mode as the integer base ISAs, with a base address in register rs1 and a 12-bit signed byte offset. The FLW instruction loads a single-precision floating-point value from memory into floating-point register rd. FSW stores a single-precision value from floating-point register rs2 to memory.

4ea02f78a5dd4d389bad21ded30455ef

06b72a92b6663a55b0929ef99ac0e612

FLW and FSW are only guaranteed to execute atomically if the effective address is naturally aligned.

FLW and FSW do not modify the bits being transferred; in particular, the payloads of non-canonical NaNs are preserved.

As described in ldst, the execution environment defines whether misaligned floating-point loads and stores are handled invisibly or raise a contained or fatal trap.

8.1.6 Single-Precision Floating-Point Computational Instructions

Floating-point arithmetic instructions with one or two source operands use the R-type format with the OP-FP major opcode. FADD.S and FMUL.S perform single-precision floating-point addition and multiplication respectively, between rs1 and rs2. FSUB.S performs the single-precision floating-point subtraction of rs2 from rs1. FDIV.S performs the single-precision floating-point division of rs1 by rs2. FSQRT.S computes the square root of rs1. In each case, the result is written to rd.

The 2-bit floating-point format field fmt is encoded as shown in fmt. It is set to S (00) for all instructions in the F extension.

fmt field	Mnemonic	Meaning
00	S	32-bit single-precision
01	D	64-bit double-precision
10	H	16-bit half-precision
11	Q	128-bit quad-precision

All floating-point operations that perform rounding can select the rounding mode using the rm field with the encoding shown in norm:dyn_round_enc.

Floating-point minimum-number and maximum-number instructions FMIN.S and FMAX.S write, respectively, the smaller or larger of rs1 and rs2 to rd. For the purposes of these instructions only, the value −0.0 is considered to be less than the value +0.0. If both inputs are NaNs, the result is the canonical NaN. If only one operand is a NaN, the result is the non-NaN operand. Signaling NaN inputs set the invalid operation exception flag, even when the result is not NaN.

note

Note that in version 2.2 of the F extension, the FMIN.S and FMAX.S instructions were amended to implement the IEEE 754-2019 [21] minimumNumber and maximumNumber operations, rather than the IEEE 754-2008 [4] minNum and maxNum operations. These operations differ in their handling of signaling NaNs.

495cd3aea0c01ae3b314bd17535e4653

Floating-point fused multiply-add instructions require a new standard instruction format. R4-type instructions specify three source registers (rs1, rs2, and rs3) and a destination register (rd). This format is only used by the floating-point fused multiply-add instructions.

FMADD.S multiplies the values in rs1 and rs2, adds the value in rs3, and writes the final result to rd. FMADD.S computes (rs1×rs2)+rs3.

FMSUB.S multiplies the values in rs1 and rs2, subtracts the value in rs3, and writes the final result to rd. FMSUB.S computes (rs1×rs2)−rs3.

FNMSUB.S multiplies the values in rs1 and rs2, negates the product, adds the value in rs3, and writes the final result to rd. FNMSUB.S computes −(rs1×rs2)+rs3.

FNMADD.S multiplies the values in rs1 and rs2, negates the product, subtracts the value in rs3, and writes the final result to rd. FNMADD.S computes −(rs1×rs2)−rs3.

note

The FNMSUB and FNMADD instructions are counterintuitively named, owing to the naming of the corresponding instructions in MIPS-IV. The MIPS instructions were defined to negate the sum, rather than negating the product as the RISC-V instructions do, so the naming scheme was more rational at the time. The two definitions differ with respect to signed-zero results. The RISC-V definition matches the behavior of the x86 and ARM fused multiply-add instructions, but unfortunately the RISC-V FNMSUB and FNMADD instruction names are swapped as compared to x86, whereas the RISC-V FMSUB and FNMSUB instruction names are swapped as compared to ARM.

4317d06f9c2716ced103e10e0e694d34

note

The fused multiply-add (FMA) instructions consume a large part of the 32-bit instruction encoding space. Some alternatives considered were to restrict FMA to only use dynamic rounding modes, but static rounding modes are useful in code that exploits the lack of product rounding. Another alternative would have been to use rd to provide rs3, but this would require additional move instructions in some common sequences. The current design still leaves a large portion of the 32-bit encoding space open while avoiding having FMA be non-orthogonal.

The fused multiply-add instructions must set the invalid operation exception flag when the multiplicands are ∞ and zero, even when the addend is a quiet NaN.

note

IEEE 754-2008 permits, but does not require, raising the invalid exception for the operation ∞×0 + qNaN.

8.1.7 Single-Precision Floating-Point Conversion and Move Instructions

Floating-point-to-integer and integer-to-floating-point conversion instructions are encoded in the OP-FP major opcode space. FCVT.W.S or FCVT.L.S converts a floating-point number in floating-point register rs1 to a signed 32-bit or 64-bit integer, respectively, in integer register rd. FCVT.S.W or FCVT.S.L converts a 32-bit or 64-bit signed integer, respectively, in integer register rs1 into a floating-point number in floating-point register rd. FCVT.WU.S, FCVT.LU.S, FCVT.S.WU, and FCVT.S.LU variants convert to or from unsigned integer values. For XLEN>32, FCVT.W[U].S sign-extends the 32-bit result to the destination register width. FCVT.L[U].S and FCVT.S.L[U] are RV64-only instructions. If the rounded result is not representable in the destination format, it is clipped to the nearest value and the invalid flag is set. int_conv gives the range of valid inputs for FCVT.int.S and the behavior for invalid inputs.

All floating-point to integer and integer to floating-point conversion instructions round according to the rm field. A floating-point register can be initialized to floating-point positive zero using FCVT.S.W rd, x0, which will never set any exception flags.

	FCVT.W.S	FCVT.WU.S	FCVT.L.S	FCVT.LU.S
Minimum valid input (after rounding)	−2³¹	0	−2⁶³	0
Maximum valid input (after rounding)	2³¹−1	2³²−1	2⁶³−1	2⁶⁴−1
Output for out-of-range negative input	−2³¹	0	−2⁶³	0
Output for -∞	−2³¹	0	−2⁶³	0
Output for out-of-range positive input	2³¹−1	2³²−1	2⁶³−1	2⁶⁴−1
Output for +∞ or NaN	2³¹−1	2³²−1	2⁶³−1	2⁶⁴−1

All floating-point conversion instructions set the Inexact exception flag if the rounded result differs from the operand value and the Invalid exception flag is not set.

374c6bdaa68ca2e9ecb167a1e1c15335

Floating-point to floating-point sign-injection instructions, FSGNJ.S, FSGNJN.S, and FSGNJX.S, produce a result that takes all bits except the sign bit from rs1. For FSGNJ, the result’s sign bit is rs2's sign bit; for FSGNJN, the result’s sign bit is the opposite of rs2's sign bit; and for FSGNJX, the sign bit is the XOR of the sign bits of rs1 and rs2. Sign-injection instructions do not set floating-point exception flags, nor do they canonicalize NaNs. Note, FSGNJ.S rx, ry, ry moves ry to rx (assembler pseudoinstruction FMV.S rx, ry); FSGNJN.S rx, ry, ry moves the negation of ry to rx (assembler pseudoinstruction FNEG.S rx, ry); and FSGNJX.S rx, ry, ry moves the absolute value of ry to rx (assembler pseudoinstruction FABS.S rx, ry).

cdd324f73dbc2fad22f6c9076d8c3fa1

note

The sign-injection instructions provide floating-point MV, ABS, and NEG, as well as supporting a few other operations, including the IEEE 754-2008 copySign operation and sign manipulation in transcendental math function libraries. Although MV, ABS, and NEG only need a single register operand, whereas FSGNJ instructions need two, it is unlikely most microarchitectures would add optimizations to benefit from the reduced number of register reads for these relatively infrequent instructions. Even in this case, a microarchitecture can simply detect when both source registers are the same for FSGNJ instructions and only read a single copy.

Instructions are provided to move bit patterns between the floating-point and integer registers. FMV.X.W moves the single-precision value in floating-point register rs1 represented in the IEEE 754-2008 encoding to the lower 32 bits of integer register rd. The bits are not modified in the transfer, and in particular, the payloads of non-canonical NaNs are preserved. For RV64, the higher 32 bits of the destination register are filled with copies of the floating-point number’s sign bit.

FMV.W.X moves the single-precision value encoded in the IEEE 754-2008 encoding from the lower 32 bits of integer register rs1 to the floating-point register rd. The bits are not modified in the transfer, and in particular, the payloads of non-canonical NaNs are preserved.

note

The FMV.W.X and FMV.X.W instructions were previously called FMV.S.X and FMV.X.S. The use of W is more consistent with their semantics as an instruction that moves 32 bits without interpreting them. This became clearer after defining NaN-boxing. To avoid disturbing existing code, both the W and S versions will be supported by tools.

3187f42b06bfed50811f0220720278d8

note

The base floating-point ISA was defined so as to allow implementations to employ an internal recoding of the floating-point format in registers to simplify handling of subnormal values and possibly to reduce functional unit latency. To this end, the F extension avoids representing integer values in the floating-point registers by defining conversion and comparison operations that read and write the integer register file directly. This also removes many of the common cases where explicit moves between integer and floating-point registers are required, reducing instruction count and critical paths for common mixed-format code sequences.

8.1.8 Single-Precision Floating-Point Compare Instructions

Floating-point compare instructions (FEQ.S, FLT.S, FLE.S) perform the specified comparison between floating-point registers (rs1 = rs2, rs1 < rs2, rs1 ≤ rs2) writing 1 to the integer register rd if the condition holds, and 0 otherwise.

FLT.S and FLE.S perform what IEEE 754-2008 refers to as signaling comparisons: that is, they set the invalid operation exception flag if either input is NaN. FEQ.S performs a quiet comparison: it only sets the invalid operation exception flag if either input is a signaling NaN. For all three instructions, the result is 0 if either operand is NaN.

ad74a55766588031071812a9b975c2d9

note

The F extension provides a ≤ comparison, whereas the base ISAs provide a ≥ branch comparison. Because ≤ can be synthesized from ≥ and vice-versa, there is no performance implication to this inconsistency, but it is nevertheless an unfortunate incongruity in the ISA.

8.1.9 Single-Precision Floating-Point Classify Instruction

The FCLASS.S instruction examines the value in floating-point register rs1 and writes to integer register rd a 10-bit mask that indicates the class of the floating-point number. The format of the mask is described in fclass. The corresponding bit in rd will be set if the property is true and clear otherwise. All other bits in rd are cleared. Note that exactly one bit in rd will be set. FCLASS.S does not set the floating-point exception flags.

243616d8e9ac0b353a5c4a5ed7149725

rd bit	Meaning
0	rs1 is −∞.
1	rs1 is a negative normal number.
2	rs1 is a negative subnormal number.
3	rs1 is −0.
4	rs1 is +0.
5	rs1 is a positive subnormal number.
6	rs1 is a positive normal number.
7	rs1 is +∞.
8	rs1 is a signaling NaN.
9	rs1 is a quiet NaN.

8.2 "D" Extension for Double-Precision Floating-Point, Version 2.2

This chapter describes the D standard extension for double-precision floating-point, which adds computational instructions compliant with the IEEE 754-2008 arithmetic standard’s binary64 format and operations. The D extension depends on the F extension.

8.2.1 D Register State

The D extension widens the 32 floating-point registers, f0-f31, to 64 bits (FLEN=64 in fprs). The f registers can now hold either 32-bit or 64-bit floating-point values as described below in nanboxing.

note

FLEN can be 32, 64, or 128 depending on which of the F, D, and Q extensions are supported. There can be up to four different floating-point precisions supported, including H, F, D, and Q.

8.2.2 NaN Boxing of Narrower Values

When multiple floating-point precisions are supported, then valid values of narrower n-bit types, n<FLEN, are represented in the lower n bits of an FLEN-bit NaN value, in a process termed NaN-boxing. The upper bits of a valid NaN-boxed value must be all 1s. Valid NaN-boxed n-bit values therefore appear as negative quiet NaNs (qNaNs) when viewed as any wider m-bit value, n < m ≤ FLEN. Any operation that writes a narrower result to an 'f' register must write all 1s to the uppermost FLEN-n bits to yield a legal NaN-boxedvalue.

note

Software might not know the current type of data stored in a floating-point register but has to be able to save and restore the register values, hence the result of using wider operations to transfer narrower values has to be defined. A common case is for callee-saved registers, but a standard convention is also desirable for features including variadic functions, user-level threading libraries, virtual machine migration, and debugging.

Floating-point n-bit transfer operations move external values held in the IEEE 754-2008 formats into and out of the f registers, and comprise floating-point loads and stores (FL_n_/FS_n_) and floating-point move instructions (FMV.n.X/FMV.X.n). A narrower n-bit transfer, n<FLEN, into the f registers will create a valid NaN-boxed value. A narrower n-bit transfer out of the floating-point registers will transfer the lower n bits of the register ignoring the upper FLEN-n bits.

Apart from transfer operations described in the previous paragraph, all other floating-point operations on narrower n-bit operations, n<FLEN, check if the input operands are correctly NaN-boxed, i.e., all upper FLEN-n bits are 1. If so, the n least-significant bits of the input are used as the input value, otherwise the input value is treated as an n-bit canonical NaN.

note

Earlier versions of this extension did not define the behavior of feeding the results of narrower or wider operands into an operation, except to require that wider saves and restores would preserve the value of a narrower operand. The new definition removes this implementation-specific behavior, while still accommodating both non-recoded and recoded implementations of the floating-point unit. The new definition also helps catch software errors by propagating NaNs if values are used incorrectly.

Non-recoded implementations unpack and pack the operands to the IEEE 754-2008 format on the input and output of every floating-point operation. The NaN-boxing cost to a non-recoded implementation is primarily in checking if the upper bits of a narrower operation represent a legal NaN-boxed value, and in writing all 1s to the upper bits of a result.

Recoded implementations use a more convenient internal format to represent floating-point values, with an added exponent bit to allow all values to be held normalized. The cost to the recoded implementation is primarily the extra tagging needed to track the internal types and sign bits, but this can be done without adding new state bits by recoding NaNs internally in the exponent field. Small modifications are needed to the pipelines used to transfer values in and out of the recoded format, but the datapath and latency costs are minimal. The recoding process has to handle shifting of input subnormal values for wide operands in any case, and extracting the NaN-boxed value is a similar process to normalization except for skipping over leading-1 bits instead of skipping over leading-0 bits, allowing the datapath multiplexing to be shared.

8.2.3 Double-Precision Load and Store Instructions

The FLD instruction loads a double-precision floating-point value from memory into floating-point register rd. FSD stores a double-precision value from the floating-point registers to memory.

note

The double-precision value may be a NaN-boxed single-precision value.

fd94db7420655c8a6301de1219c4999d

efb9bcf07472034faaa97cb97d21eb4a

FLD and FSD are only guaranteed to execute atomically if the effective address is naturally aligned and XLEN≥64.

FLD and FSD do not modify the bits being transferred; in particular, the payloads of non-canonical NaNs are preserved.

8.2.4 Double-Precision Floating-Point Computational Instructions

The double-precision floating-point computational instructions are defined analogously to their single-precision counterparts, but operate on double-precision operands and produce double-precision results.

46250fe4ee75a86c6c62713554b09d84

a024d71dea475a8e0b02afbe0a59e058

8.2.5 Double-Precision Floating-Point Conversion and Move Instructions

Floating-point-to-integer and integer-to-floating-point conversion instructions are encoded in the OP-FP major opcode space. FCVT.W.D or FCVT.L.D converts a double-precision floating-point number in floating-point register rs1 to a signed 32-bit or 64-bit integer, respectively, in integer register rd. FCVT.D.W or FCVT.D.L converts a 32-bit or 64-bit signed integer, respectively, in integer register rs1 into a double-precision floating-point number in floating-point register rd. FCVT.WU.D, FCVT.LU.D, FCVT.D.WU, and FCVT.D.LU variants convert to or from unsigned integer values. For RV64, FCVT.W[U].D sign-extends the 32-bit result. FCVT.L[U].D and FCVT.D.L[U] are RV64-only instructions. The range of valid inputs for FCVT.int.D and the behavior for invalid inputs are the same as for FCVT.int.S.

All floating-point to integer and integer to floating-point conversion instructions round according to the rm field. Note FCVT.D.W[U] always produces an exact result and is unaffected by rounding mode.

508f74a3f75a499b7948bdf70f6a5ccd

The double-precision to single-precision and single-precision to double-precision conversion instructions, FCVT.S.D and FCVT.D.S, are encoded in the OP-FP major opcode space and both the source and destination are floating-point registers. The rs2 field encodes the datatype of the source, and the fmt field encodes the datatype of the destination. FCVT.S.D rounds according to the RM field; FCVT.D.S will never round.

2a25a38658def3d3d9ccd88af772be80

Floating-point to floating-point sign-injection instructions, FSGNJ.D, FSGNJN.D, and FSGNJX.D are defined analogously to the single-precision sign-injection instruction.

ede7088f46264741de8155abe97d9af7

For XLEN≥64 only, instructions are provided to move bit patterns between the floating-point and integer registers. FMV.X.D moves the double-precision value in floating-point register rs1 to a representation in the IEEE 754-2008 encoding in integer register rd. FMV.D.X moves the double-precision value encoded in the IEEE 754-2008 encoding from the integer register rs1 to the floating-point register rd.

FMV.X.D and FMV.D.X do not modify the bits being transferred; in particular, the payloads of non-canonical NaNs are preserved.

84923443407628dfc9a3db8f7d90070a

note

Early versions of the RISC-V ISA had additional instructions to allow RV32 systems to transfer between the upper and lower portions of a 64-bit floating-point register and an integer register. However, these would be the only instructions with partial register writes and would add complexity in implementations with recoded floating-point or register renaming, requiring a pipeline read-modify-write sequence. Scaling up to handling quad-precision for RV32 and RV64 would also require additional instructions if they were to follow this pattern. The ISA was defined to reduce the number of explicit int-float register moves, by having conversions and comparisons write results to the appropriate register file, so we expect the benefit of these instructions to be lower than for other ISAs.

We note that for systems that implement a 64-bit floating-point unit including fused multiply-add support and 64-bit floating-point loads and stores, the marginal hardware cost of moving from a 32-bit to a 64-bit integer datapath is low, and a software ABI supporting 32-bit wide address-space and pointers can be used to avoid growth of static data and dynamic memory traffic.

8.2.6 Double-Precision Floating-Point Compare Instructions

The double-precision floating-point compare instructions are defined analogously to their single-precision counterparts, but operate on double-precision operands.

9e23414972d5343372a0d3d76be263be

8.2.7 Double-Precision Floating-Point Classify Instruction

The double-precision floating-point classify instruction, FCLASS.D, is defined analogously to its single-precision counterpart, but operates on double-precision operands.

44e89660e5f69944aa8d3027f0ef5345

8.3 "Q" Extension for Quad-Precision Floating-Point, Version 2.2

This chapter describes the Q standard extension for quad-precision floating-point, which adds computational instructions compliant with the IEEE 754-2008 arithmetic standard’s binary128 format and operations. The Q extension depends on the D extension.

The floating-point registers are now extended to hold either a single, double, or quad-precision floating-point value (FLEN=128). The NaN-boxing scheme described in nanboxing is now extended recursively to allow a single-precision value to be NaN-boxed inside a double-precision value which is itself NaN-boxed inside a quad-precision value.

8.3.1 Quad-Precision Load and Store Instructions

New 128-bit variants of LOAD-FP and STORE-FP instructions are added, encoded with a new value for the funct3 width field.

630d03f6f5e2f9c9e190ef42c092c678

0415e5ba9a33880e15e6df4e5c583201

FLQ and FSQ are only guaranteed to execute atomically if the effective address is naturally aligned and XLEN=128.

FLQ and FSQ do not modify the bits being transferred; in particular, the payloads of non-canonical NaNs are preserved.

8.3.2 Quad-Precision Computational Instructions

A new supported format is added to the format field of most instructions, as shown in fpextfmt

fmt field	Mnemonic	Meaning
00	S	32-bit single-precision
01	D	64-bit double-precision
10	H	16-bit half-precision
11	Q	128-bit quad-precision

The quad-precision floating-point computational instructions are defined analogously to their double-precision counterparts, but operate on quad-precision operands and produce quad-precision results.

1bc1f7996b091abb476dc73abbd483e8

cacc5c5a2498b1675ab8fdd36f1b9b14

8.3.3 Quad-Precision Convert and Move Instructions

New floating-point-to-integer and integer-to-floating-point conversion instructions are added. These instructions are defined analogously to the double-precision-to-integer and integer-to-double-precision conversion instructions. FCVT.W.Q or FCVT.L.Q converts a quad-precision floating-point number to a signed 32-bit or 64-bit integer, respectively. FCVT.Q.W or FCVT.Q.L converts a 32-bit or 64-bit signed integer, respectively, into a quad-precision floating-point number. FCVT.WU.Q, FCVT.LU.Q, FCVT.Q.WU, and FCVT.Q.LU variants convert to or from unsigned integer values. FCVT.L[U].Q and FCVT.Q.L[U] are RV64-only instructions. Note FCVT.Q.L[U] always produces an exact result and is unaffected by rounding mode.

6f4e4e9d8b21bda03b65cd10bd0bf13a

New floating-point-to-floating-point conversion instructions are added. These instructions are defined analogously to the double-precision floating-point-to-floating-point conversion instructions. FCVT.S.Q or FCVT.Q.S converts a quad-precision floating-point number to a single-precision floating-point number, or vice-versa, respectively. FCVT.D.Q or FCVT.Q.D converts a quad-precision floating-point number to a double-precision floating-point number, or vice-versa, respectively.

7b20c9090517c15bf1e33a1ae7ad4c8e

Floating-point to floating-point sign-injection instructions, FSGNJ.Q, FSGNJN.Q, and FSGNJX.Q are defined analogously to the double-precision sign-injection instruction.

ac1c5babbc597773c5df7c5d0a697f3a

FMV.X.Q and FMV.Q.X instructions are not provided in RV32 or RV64, so quad-precision bit patterns must be moved to the integer registers via memory.

8.3.4 Quad-Precision Floating-Point Compare Instructions

The quad-precision floating-point compare instructions are defined analogously to their double-precision counterparts, but operate on quad-precision operands.

43fbe95c32f56118b42d182db7c03566

8.3.5 Quad-Precision Floating-Point Classify Instruction

The quad-precision floating-point classify instruction, FCLASS.Q, is defined analogously to its double-precision counterpart, but operates on quad-precision operands.

f88015fe5d1e9f716b78198eb2c5fd85

8.4 `Zfh` Extension for Half-Precision Floating-Point

This chapter describes the Zfh standard extension for half-precision floating-point, which adds computational instructions compliant with the IEEE 754-2008 arithmetic standard’s binary16 format and operations. The Zfh extension depends on the F extension. The NaN-boxing scheme described in nanboxing is extended to allow a half-precision value to be NaN-boxed inside a single-precision value (which may be recursively NaN-boxed inside a double- or quad-precision value when the D or Q extension is present).

note

This extension primarily provides instructions that consume half-precision operands and produce half-precision results. However, it is also common to compute on half-precision data using higher intermediate precision. Although this extension provides explicit conversion instructions that suffice to implement that pattern, future extensions might further accelerate such computation with additional instructions that implicitly widen their operands—e.g., half×half+single→single—or implicitly narrow their results—e.g., half+single→half.

8.4.1 Half-Precision Load and Store Instructions

New 16-bit variants of LOAD-FP and STORE-FP instructions are added, encoded with a new value for the funct3 width field.

8ee0e49cd2bb25af669c30d18a57673a

a02923b95069e20651ef9f4a81c8f672

FLH and FSH are only guaranteed to execute atomically if the effective address is naturally aligned.

FLH and FSH do not modify the bits being transferred; in particular, the payloads of non-canonical NaNs are preserved. FLH NaN-boxes the result written to rd, whereas FSH ignores all but the lower 16 bits in rs2.

8.4.2 Half-Precision Computational Instructions

A new supported format is added to the format field of most instructions, as shown in tab:fpextfmth.

fmt field	Mnemonic	Meaning
00	S	32-bit single-precision
01	D	64-bit double-precision
10	H	16-bit half-precision
11	Q	128-bit quad-precision

The half-precision floating-point computational instructions are defined analogously to their single-precision counterparts, but operate on half-precision operands and produce half-precision results.

a462b7ca19154455380e80e03dc42611

856c99a1aa6141218a6aa3677a73bf51

8.4.3 Half-Precision Conversion and Move Instructions

New floating-point-to-integer and integer-to-floating-point conversion instructions are added. These instructions are defined analogously to the single-precision-to-integer and integer-to-single-precision conversion instructions. FCVT.W.H or FCVT.L.H converts a half-precision floating-point number to a signed 32-bit or 64-bit integer, respectively. FCVT.H.W or FCVT.H.L converts a 32-bit or 64-bit signed integer, respectively, into a half-precision floating-point number. FCVT.WU.H, FCVT.LU.H, FCVT.H.WU, and FCVT.H.LU variants convert to or from unsigned integer values. FCVT.L[U].H and FCVT.H.L[U] are RV64-only instructions.

235e2a4a791a07a2b9cc200dc9810559

New floating-point-to-floating-point conversion instructions are added. These instructions are defined analogously to the double-precision floating-point-to-floating-point conversion instructions. FCVT.S.H or FCVT.H.S converts a half-precision floating-point number to a single-precision floating-point number, or vice-versa, respectively. If the D extension is present, FCVT.D.H or FCVT.H.D converts a half-precision floating-point number to a double-precision floating-point number, or vice-versa, respectively. If the Q extension is present, FCVT.Q.H or FCVT.H.Q converts a half-precision floating-point number to a quad-precision floating-point number, or vice-versa, respectively.

68f4d280a7d7bcf62f078855a5601f9e

Floating-point to floating-point sign-injection instructions, FSGNJ.H, FSGNJN.H, and FSGNJX.H are defined analogously to the single-precision sign-injection instruction.

293503a067764b561124ff5464a58acf

Instructions are provided to move bit patterns between the floating-point and integer registers. FMV.X.H moves the half-precision value in floating-point register rs1 to a representation in the IEEE 754-2008 encoding in integer register rd, filling the upper XLEN-16 bits with copies of the floating-point number’s sign bit.

FMV.H.X moves the half-precision value encoded in the IEEE 754-2008 encoding from the lower 16 bits of integer register rs1 to the floating-point register rd, NaN-boxing the result.

FMV.X.H and FMV.H.X do not modify the bits being transferred; in particular, the payloads of non-canonical NaNs are preserved.

155b6d8ec0a7c90e62f6f33cd8399ff3

8.4.4 Half-Precision Floating-Point Compare Instructions

The half-precision floating-point compare instructions are defined analogously to their single-precision counterparts, but operate on half-precision operands.

4faf8630ac54be0b7886f92729731e8a

8.4.5 Half-Precision Floating-Point Classify Instruction

The half-precision floating-point classify instruction, FCLASS.H, is defined analogously to its single-precision counterpart, but operates on half-precision operands.

461eb2e1c79c81629b42ed0b95616427

8.5 `Zfhmin` Standard Extension for Minimal Half-Precision Floating-Point

This section describes the Zfhmin standard extension, which provides minimal support for 16-bit half-precision binary floating-point instructions. The Zfhmin extension is a subset of the Zfh extension, consisting only of data transfer and conversion instructions. Like Zfh, the Zfhmin extension depends on the single-precision floating-point extension, F. The expectation is that Zfhmin software primarily uses the half-precision format for storage, performing most computation in higher precision.

The Zfhmin extension includes the following instructions from the Zfh extension: FLH, FSH, FMV.X.H, FMV.H.X, FCVT.S.H, and FCVT.H.S. If the D extension is present, the FCVT.D.H and FCVT.H.D instructions are also included. If the Q extension is present, the FCVT.Q.H and FCVT.H.Q instructions are additionally included.

note

Zfhmin does not include the FSGNJ.H instruction, because it suffices to instead use the FSGNJ.S instruction to move half-precision values between floating-point registers.

Half-precision addition, subtraction, multiplication, division, and square-root operations can be faithfully emulated by converting the half-precision operands to single-precision, performing the operation using single-precision arithmetic, then converting back to half-precision. [22] Performing half-precision fused multiply-addition using this method incurs a 1-ulp error on some inputs for the RNE and RMM rounding modes.

Conversion from 8- or 16-bit integers to half-precision can be emulated by first converting to single-precision, then converting to half-precision. Conversion from 32-bit integer can be emulated by first converting to double-precision. If the D extension is not present and a 1-ulp error under RNE or RMM is tolerable, 32-bit integers can be first converted to single-precision instead. The same remark applies to conversions from 64-bit integers without the Q extension.

8.6 "Zfa" Extension for Additional Floating-Point Instructions, Version 1.0

This chapter describes the Zfa standard extension, which adds instructions for immediate loads, IEEE 754-2019 minimum and maximum operations, round-to-integer operations, and quiet floating-point comparisons. For RV32D, the Zfa extension also adds instructions to transfer double-precision floating-point values to and from integer registers, and for RV64Q, it adds analogous instructions for quad-precision floating-point values. The Zfa extension depends on the F extension.

8.6.1 Load-Immediate Instructions

The FLI.S instruction loads one of 32 single-precision floating-point constants, encoded in the rs1 field, into floating-point register rd. The correspondence of rs1 field values and single-precision floating-point values is shown in tab:flis. FLI.S is encoded like FMV.W.X, but with rs2=1.

rs1	Value	Sign	Exponent	Significand
0	−1.0	`1`	`01111111`	`000…000`
1	Minimum positive normal	`0`	`00000001`	`000…000`
2	1.0 × 2⁻¹⁶	`0`	`01101111`	`000…000`
3	1.0 × 2⁻¹⁵	`0`	`01110000`	`000…000`
4	1.0 × 2⁻⁸	`0`	`01110111`	`000…000`
5	1.0 × 2⁻⁷	`0`	`01111000`	`000…000`
6	0.0625 (2⁻⁴)	`0`	`01111011`	`000…000`
7	0.125 (2⁻³)	`0`	`01111100`	`000…000`
8	0.25	`0`	`01111101`	`000…000`
9	0.3125	`0`	`01111101`	`010…000`
10	0.375	`0`	`01111101`	`100…000`
11	0.4375	`0`	`01111101`	`110…000`
12	0.5	`0`	`01111110`	`000…000`
13	0.625	`0`	`01111110`	`010…000`
14	0.75	`0`	`01111110`	`100…000`
15	0.875	`0`	`01111110`	`110…000`
16	1.0	`0`	`01111111`	`000…000`
17	1.25	`0`	`01111111`	`010…000`
18	1.5	`0`	`01111111`	`100…000`
19	1.75	`0`	`01111111`	`110…000`
20	2.0	`0`	`10000000`	`000…000`
21	2.5	`0`	`10000000`	`010…000`
22	3	`0`	`10000000`	`100…000`
23	4	`0`	`10000001`	`000…000`
24	8	`0`	`10000010`	`000…000`
25	16	`0`	`10000011`	`000…000`
26	128 (2⁷)	`0`	`10000110`	`000…000`
27	256 (2⁸)	`0`	`10000111`	`000…000`
28	2¹⁵	`0`	`10001110`	`000…000`
29	2¹⁶	`0`	`10001111`	`000…000`
30	+∞	`0`	`11111111`	`000…000`
31	Canonical NaN	`0`	`11111111`	`100…000`

note

The preferred assembly syntax for entries 1, 30, and 31 is min, inf, and nan, respectively. For entries 0 through 29 (including entry 1), the assembler will accept decimal constants in C-like syntax.

note

The set of 32 constants was chosen by examining floating-point libraries, including the C standard math library, and to optimize fixed-point to floating-point conversion.

Entries 8-22 follow a regular encoding pattern. No entry sets mantissa bits other than the two most significant ones.

If the D extension is implemented, FLI.D performs the analogous operation, but loads a double-precision value into floating-point register rd. Note that entry 1 (corresponding to the minimum positive normal value) has a numerically different value for double-precision than for single-precision. FLI.D is encoded like FLI.S, but with fmt=D.

If the Q extension is implemented, FLI.Q performs the analogous operation, but loads a quad-precision value into floating-point register rd. Note that entry 1 (corresponding to the minimum positive normal value) has a numerically different value for quad-precision. FLI.Q is encoded like FLI.S, but with fmt=Q.

If the Zfh or Zvfh extension is implemented, FLI.H performs the analogous operation, but loads a half-precision floating-point value into register rd. Note that entry 1 (corresponding to the minimum positive normal value) has a numerically different value for half-precision. Furthermore, since 2¹⁶ is not representable in half-precision floating-point, entry 29 in the table instead loads positive infinity—i.e., it is redundant with entry 30. FLI.H is encoded like FLI.S, but with fmt=H.

note

Additionally, since 2⁻¹⁶ and 2⁻¹⁵ are subnormal in half-precision, entry 1 is numerically greater than entries 2 and 3 for FLI.H.

The FLI.fmt instructions never set any floating-point exception flags.

8.6.2 Minimum and Maximum Instructions

The FMINM.S and FMAXM.S instructions are defined like the FMIN.S and FMAX.S instructions, except that if either input is NaN, the result is the canonical NaN.

If the D extension is implemented, FMINM.D and FMAXM.D instructions are analogously defined to operate on double-precision numbers.

If the Zfh extension is implemented, FMINM.H and FMAXM.H instructions are analogously defined to operate on half-precision numbers.

If the Q extension is implemented, FMINM.Q and FMAXM.Q instructions are analogously defined to operate on quad-precision numbers.

These instructions are encoded like their FMIN and FMAX counterparts, but with instruction bit 13 set to 1.

note

These instructions implement the IEEE 754-2019 minimum and maximum operations.

8.6.3 Round-to-Integer Instructions

The FROUND.S instruction rounds the single-precision floating-point number in floating-point register rs1 to an integer, according to the rounding mode specified in the instruction’s rm field. It then writes that integer, represented as a single-precision floating-point number, to floating-point register rd. Zero and infinite inputs are copied to rd unmodified. Signaling NaN inputs cause the invalid operation exception flag to be set; no other exception flags are set. FROUND.S is encoded like FCVT.S.D, but with rs2=4.

The FROUNDNX.S instruction is defined similarly, but it also sets the inexact exception flag if the input differs from the rounded result and is not NaN. FROUNDNX.S is encoded like FCVT.S.D, but with rs2=5.

If the D extension is implemented, FROUND.D and FROUNDNX.D instructions are analogously defined to operate on double-precision numbers. They are encoded like FCVT.D.S, but with rs2=4 and 5, respectively,

If the Zfh extension is implemented, FROUND.H and FROUNDNX.H instructions are analogously defined to operate on half-precision numbers. They are encoded like FCVT.H.S, but with rs2=4 and 5, respectively,

If the Q extension is implemented, FROUND.Q and FROUNDNX.Q instructions are analogously defined to operate on quad-precision numbers. They are encoded like FCVT.Q.S, but with rs2=4 and 5, respectively,

note

The FROUNDNX.fmt instructions implement the IEEE 754-2019 roundToIntegralExact operation, and the FROUND.fmt instructions implement the other operations in the roundToIntegral family.

8.6.4 Modular Convert-to-Integer Instruction

The FCVTMOD.W.D instruction is defined similarly to the FCVT.W.D instruction, with the following differences. FCVTMOD.W.D always rounds towards zero. Bits 31:0 are taken from the rounded, unbounded two’s complement result, then sign-extended to XLEN bits and written to integer register rd. ±∞ and NaN are converted to zero.

Floating-point exception flags are raised the same as they would be for FCVT.W.D with the same input operand.

This instruction is only provided if the D extension is implemented. It is encoded like FCVT.W.D, but with the rs2 field set to 8 and the rm field set to 1 (RTZ). Other rm values are reserved.

note

The assembly syntax requires the RTZ rounding mode to be explicitly specified, i.e., fcvtmod.w.d rd, rs1, rtz.

The FCVTMOD.W.D instruction was added principally to accelerate the processing of JavaScript Numbers. Numbers are double-precision values, but some operators implicitly truncate them to signed integers mod 2³².

8.6.5 Move Instructions

For RV32 only, if the D extension is implemented, the FMVH.X.D instruction moves bits 63:32 of floating-point register rs1 into integer register rd. It is encoded in the OP-FP major opcode with funct3=0, rs2=1, and funct7=1110001.

note

FMVH.X.D is used in conjunction with the existing FMV.X.W instruction to move a double-precision floating-point number to a pair of x-registers.

For RV32 only, if the D extension is implemented, the FMVP.D.X instruction moves a double-precision number from a pair of integer registers into a floating-point register. Integer registers rs1 and rs2 supply bits 31:0 and 63:32, respectively; the result is written to floating-point register rd. FMVP.D.X is encoded in the OP-FP major opcode with funct3=0 and funct7=1011001.

For RV64 only, if the Q extension is implemented, the FMVH.X.Q instruction moves bits 127:64 of floating-point register rs1 into integer register rd. It is encoded in the OP-FP major opcode with funct3=0, rs2=1, and funct7=1110011.

note

FMVH.X.Q is used in conjunction with the existing FMV.X.D instruction to move a quad-precision floating-point number to a pair of x-registers.

For RV64 only, if the Q extension is implemented, the FMVP.Q.X instruction moves a double-precision number from a pair of integer registers into a floating-point register. Integer registers rs1 and rs2 supply bits 63:0 and 127:64, respectively; the result is written to floating-point register rd. FMVP.Q.X is encoded in the OP-FP major opcode with funct3=0 and funct7=1011011.

8.6.6 Comparison Instructions

The FLEQ.S and FLTQ.S instructions are defined like the FLE.S and FLT.S instructions, except that quiet NaN inputs do not cause the invalid operation exception flag to be set.

If the D extension is implemented, FLEQ.D and FLTQ.D instructions are analogously defined to operate on double-precision numbers.

If the Zfh extension is implemented, FLEQ.H and FLTQ.H instructions are analogously defined to operate on half-precision numbers.

If the Q extension is implemented, FLEQ.Q and FLTQ.Q instructions are analogously defined to operate on quad-precision numbers.

These instructions are encoded like their FLE and FLT counterparts, but with instruction bit 14 set to 1.

note

We do not expect analogous comparison instructions will be added to the vector ISA, since they can be reasonably efficiently emulated using masking.

8.7 `Zfbfmin` Extension for Scalar BFloat16 Conversions

This extension provides the minimal set of instructions needed to enable scalar support of the BF16 format. It enables BF16 as an interchange format as it provides conversion between BF16 values and FP32 values.

This extension depends upon the single-precision floating-point extension F.

This extension includes six instructions: the FCVT.BF16.S and FCVT.S.BF16 instructions, defined below, and the FLH, FSH, FMV.X.H, and FMV.H.X instructions, defined in chap:zfh.

note

While conversion instructions tend to include all supported formats, in these extensions we only support conversion between BF16 and FP32 as we are targeting a special use case. These extensions are intended to support the case where BF16 values are used as reduced precision versions of FP32 values, where use of BF16 provides a two-fold advantage for storage, bandwidth, and computation. In this use case, the BF16 values are typically multiplied by each other and accumulated into FP32 sums. These sums are typically converted to BF16 and then used as subsequent inputs. The operations on the BF16 values can be performed on the CPU or a loosely coupled coprocessor.

Subsequent extensions might provide support for native BF16 arithmetic. Such extensions could add additional conversion instructions to allow all supported formats to be converted to and from BF16.

note

BF16 addition, subtraction, multiplication, division, and square-root operations can be faithfully emulated by converting the BF16 operands to single-precision, performing the operation using single-precision arithmetic, and then converting back to BF16. Performing BF16 fused multiply-addition using this method can produce results that differ by 1-ulp on some inputs for the RNE and RMM rounding modes.

Conversions between BF16 and formats larger than FP32 can be emulated. Exact widening conversions from BF16 can be synthesized by first converting to FP32 and then converting from FP32 to the target precision. Conversions narrowing to BF16 can be synthesized by first converting to FP32 through a series of halving steps and then converting from FP32 to BF16. As with the fused multiply-addition instruction described above, this method of converting values to BF16 can be off by 1-ulp on some inputs for the RNE and RMM rounding modes.

8.7.1 BF16 Number Format

BF16 bits

92c4bcec676bc7355771f336c9744a72

While BF16 (also known as BFloat16) is not an IEEE 754 standard format, it is a valid floating-point format as defined by IEEE 754-2008, with radix 2, number of significand digits 8, and maximum exponent 127.

BF16 computational instructions defined in this chapter support all IEEE 754-2008 features, including all rounding modes, subnormal inputs and outputs, overflow and underflow, and default exception handling. Tininess is detected after rounding.

The BF16 canonical NaN is 0x7fc0.

BF16 values are NaN-boxed when held in f registers, as described in nanboxing.

8.7.2 fcvt.bf16.s

Synopsis Convert FP32 value to a BF16 value

Mnemonic fcvt.bf16.s rd, rs1

Encoding

4867c24168f24befa5c11e84a6675e7f

note

While the mnemonic of this instruction is consistent with that of the other RISC-V floating-point convert instructions, a new encoding is used in bits 24:20.

BF16.S and H are used to signify that the source is FP32 and the destination is BF16.

Description

Narrowing convert FP32 value to a BF16 value. Round according to the RM field.

This instruction is similar to other narrowing floating-point-to-floating-point conversion instructions.

Exceptions: Overflow, Underflow, Inexact, Invalid

8.7.3 fcvt.s.bf16

Synopsis Convert BF16 value to an FP32 value

Mnemonic fcvt.s.bf16 rd, rs1

Encoding

802c234625f4357a6521ad4f4d9a0a1a

note

While the mnemonic of this instruction is consistent with that of the other RISC-V floating-point convert instructions, a new encoding is used in bits 24:20 to indicate that the source is BF16.

Description

Converts a BF16 value to an FP32 value. The conversion is exact.

This instruction is similar to other widening floating-point-to-floating-point conversion instructions.

note

If the input is normal or infinity, the BF16 encoded value is shifted to the left by 16 places and the least significant 16 bits are written with 0s.

The result is NaN-boxed by writing the most significant FLEN-32 bits with 1s.

Exceptions: Invalid

8.8 "Zfinx", "Zdinx", "Zhinx", "Zhinxmin" Extensions for Floating-Point in Integer Registers, Version 1.0

This chapter defines the "Zfinx" extension (pronounced "z-f-in-x") that provides instructions similar to those in the standard floating-point F extension for single-precision floating-point instructions but which operate on the x registers instead of the f registers. This chapter also defines the "Zdinx", "Zhinx", and "Zhinxmin" extensions that provide similar instructions for other floating-point precisions.

note

The F extension uses separate f registers for floating-point computation, to reduce register pressure and simplify the provision of register-file ports for wide superscalars. However, the additional 128 B of architectural state increases the minimal implementation cost. By eliminating the f registers, the Zfinx extension substantially reduces the cost of simple RISC-V implementations with floating-point instruction-set support. Zfinx also reduces context-switch cost.

In general, software that assumes the presence of the F extension is incompatible with software that assumes the presence of the Zfinx extension, and vice versa.

The Zfinx extension adds all of the instructions that the F extension adds, except for the transfer instructions FLW, FSW, FMV.W.X, FMV.X.W, C.FLW[SP], and C.FSW[SP].

note

Zfinx software uses integer loads and stores to transfer floating-point values from and to memory. Transfers between registers use either integer arithmetic or floating-point sign-injection instructions.

The Zfinx variants of these F-extension instructions have the same semantics, except that whenever such an instruction would have accessed an f register, it instead accesses the x register with the same number.

The Zfinx extension depends on the "Zicsr" extension for control and status register access.

8.8.1 Processing of Narrower Values

Floating-point operands of width w < XLEN bits occupy bits w-1:0 of an x register. Floating-point operations on w-bit operands ignore operand bits XLEN-1: w.

Floating-point operations that produce w < XLEN-bit results fill bits XLEN-1: w with copies of bit w-1 (the sign bit).

note

The NaN-boxing scheme employed in the f registers was designed to efficiently support recoded floating-point formats. Recoding is less practical for Zfinx, though, since the same registers hold both floating-point and integer operands. Hence, the need for NaN boxing is diminished.

Sign-extending 32-bit floating-point numbers when held in RV64 x registers is compatible with the existing RV64 calling conventions, which leave bits 63-32 undefined when passing a 32-bit floating point value in x registers. To keep the architecture more regular, we extend this pattern to 16-bit floating-point numbers in both RV32 and RV64.

8.8.2 Zdinx

The Zdinx extension provides analogous double-precision floating-point instructions. The Zdinx extension depends upon the Zfinx extension.

The Zdinx extension adds all of the instructions that the D extension adds, except for the transfer instructions FLD, FSD, FMV.D.X, FMV.X.D, C.FLD[SP], and C.FSD[SP].

The Zdinx variants of these D-extension instructions have the same semantics, except that whenever such an instruction would have accessed an f register, it instead accesses the x register with the same number.

8.8.3 Processing of Wider Values

Double-precision operands in RV32Zdinx are held in aligned x-register pairs, i.e., register numbers must be even. Use of misaligned (odd-numbered) registers for double-width floating-point operands is reserved.

Regardless of endianness, the lower-numbered register holds the low-order bits, and the higher-numbered register holds the high-order bits: e.g., bits 31:0 of a double-precision operand in RV32Zdinx might be held in register x14, with bits 63:32 of that operand held in x15.

When a double-width floating-point result is written to x0, the entire write takes no effect: e.g., for RV32Zdinx, writing a double-precision result to x0 does not cause x1 to be written.

When x0 is used as a double-width floating-point operand, the entire operand is zero—i.e., x1 is not accessed.

note

Load-pair and store-pair instructions are contained in a separate extension (see Section Extensions for Load/Store pair for RV32). In case this is not available, transferring double-precision operands in RV32Zdinx from or to memory requires two loads or stores. Register moves need only a single FSGNJ.D instruction, however.

8.8.4 Zhinx

The Zhinx extension provides analogous half-precision floating-point instructions. The Zhinx extension depends upon the Zfinx extension.

The Zhinx extension adds all of the instructions that the Zfh extension adds, except for the transfer instructions FLH, FSH, FMV.H.X, and FMV.X.H.

The Zhinx variants of these Zfh-extension instructions have the same semantics, except that whenever such an instruction would have accessed an f register, it instead accesses the x register with the same number.

8.8.5 Zhinxmin

The Zhinxmin extension provides minimal support for 16-bit half-precision floating-point instructions that operate on the x registers. The Zhinxmin extension depends upon the Zfinx extension.

The Zhinxmin extension includes the following instructions from the Zhinx extension: FCVT.S.H and FCVT.H.S. If the Zdinx extension is present, the FCVT.D.H and FCVT.H.D instructions are also included.

note

In the future, an RV64Zqinx quad-precision extension could be defined analogously to RV32Zdinx. An RV32Zqinx extension could also be defined but would require quad-register groups.

8.8.6 Privileged Architecture Implications

As described in sec:mstatus, the mstatus field FS is hardwired to 0 if the Zfinx extension is implemented, and FS no longer affects the trapping behavior of floating-point instructions or fcsr accesses.

The misa bits F, D, and Q are hardwired to 0 when the Zfinx extension is implemented.

note

A future discoverability mechanism might be used to probe the existence of the Zfinx, Zhinx, and Zdinx extensions.

8.1 "F" Extension for Single-Precision Floating-Point, Version 2.2​

8.1.1 F Register State​

8.1.2 Floating-Point Control and Status Register​

8.1.3 NaN Generation and Propagation​

8.1.4 Subnormal Arithmetic​

8.1.5 Single-Precision Load and Store Instructions​

8.1.6 Single-Precision Floating-Point Computational Instructions​

8.1.7 Single-Precision Floating-Point Conversion and Move Instructions​

8.1.8 Single-Precision Floating-Point Compare Instructions​

8.1.9 Single-Precision Floating-Point Classify Instruction​

8.2 "D" Extension for Double-Precision Floating-Point, Version 2.2​

8.2.1 D Register State​

8.2.2 NaN Boxing of Narrower Values​

8.2.3 Double-Precision Load and Store Instructions​

8.2.4 Double-Precision Floating-Point Computational Instructions​

8.2.5 Double-Precision Floating-Point Conversion and Move Instructions​

8.2.6 Double-Precision Floating-Point Compare Instructions​

8.2.7 Double-Precision Floating-Point Classify Instruction​

8.3 "Q" Extension for Quad-Precision Floating-Point, Version 2.2​

8.3.1 Quad-Precision Load and Store Instructions​

8.3.2 Quad-Precision Computational Instructions​

8.3.3 Quad-Precision Convert and Move Instructions​

8.3.4 Quad-Precision Floating-Point Compare Instructions​

8.3.5 Quad-Precision Floating-Point Classify Instruction​

8.4 Zfh Extension for Half-Precision Floating-Point​

8.4.1 Half-Precision Load and Store Instructions​

8.4.2 Half-Precision Computational Instructions​

8.4.3 Half-Precision Conversion and Move Instructions​

8.4.4 Half-Precision Floating-Point Compare Instructions​

8.4.5 Half-Precision Floating-Point Classify Instruction​

8.5 Zfhmin Standard Extension for Minimal Half-Precision Floating-Point​

8.6 "Zfa" Extension for Additional Floating-Point Instructions, Version 1.0​

8.6.1 Load-Immediate Instructions​

8.6.2 Minimum and Maximum Instructions​

8.6.3 Round-to-Integer Instructions​

8.6.4 Modular Convert-to-Integer Instruction​

8.6.5 Move Instructions​

8.6.6 Comparison Instructions​

8.7 Zfbfmin Extension for Scalar BFloat16 Conversions​

8.7.1 BF16 Number Format​

8.7.2 fcvt.bf16.s​

8.7.3 fcvt.s.bf16​

8.8 "Zfinx", "Zdinx", "Zhinx", "Zhinxmin" Extensions for Floating-Point in Integer Registers, Version 1.0​

8.8.1 Processing of Narrower Values​

8.8.2 Zdinx​

8.8.3 Processing of Wider Values​

8.8.4 Zhinx​

8.8.5 Zhinxmin​

8.8.6 Privileged Architecture Implications​

8.1 "F" Extension for Single-Precision Floating-Point, Version 2.2

8.1.1 F Register State

8.1.2 Floating-Point Control and Status Register

8.1.3 NaN Generation and Propagation

8.1.4 Subnormal Arithmetic

8.1.5 Single-Precision Load and Store Instructions

8.1.6 Single-Precision Floating-Point Computational Instructions

8.1.7 Single-Precision Floating-Point Conversion and Move Instructions

8.1.8 Single-Precision Floating-Point Compare Instructions

8.1.9 Single-Precision Floating-Point Classify Instruction

8.2 "D" Extension for Double-Precision Floating-Point, Version 2.2

8.2.1 D Register State

8.2.2 NaN Boxing of Narrower Values

8.2.3 Double-Precision Load and Store Instructions

8.2.4 Double-Precision Floating-Point Computational Instructions

8.2.5 Double-Precision Floating-Point Conversion and Move Instructions

8.2.6 Double-Precision Floating-Point Compare Instructions

8.2.7 Double-Precision Floating-Point Classify Instruction

8.3 "Q" Extension for Quad-Precision Floating-Point, Version 2.2

8.3.1 Quad-Precision Load and Store Instructions

8.3.2 Quad-Precision Computational Instructions

8.3.3 Quad-Precision Convert and Move Instructions

8.3.4 Quad-Precision Floating-Point Compare Instructions

8.3.5 Quad-Precision Floating-Point Classify Instruction

8.4 `Zfh` Extension for Half-Precision Floating-Point

8.4.1 Half-Precision Load and Store Instructions

8.4.2 Half-Precision Computational Instructions

8.4.3 Half-Precision Conversion and Move Instructions

8.4.4 Half-Precision Floating-Point Compare Instructions

8.4.5 Half-Precision Floating-Point Classify Instruction

8.5 `Zfhmin` Standard Extension for Minimal Half-Precision Floating-Point

8.6 "Zfa" Extension for Additional Floating-Point Instructions, Version 1.0

8.6.1 Load-Immediate Instructions

8.6.2 Minimum and Maximum Instructions

8.6.3 Round-to-Integer Instructions

8.6.4 Modular Convert-to-Integer Instruction

8.6.5 Move Instructions

8.6.6 Comparison Instructions

8.7 `Zfbfmin` Extension for Scalar BFloat16 Conversions

8.7.1 BF16 Number Format

8.7.2 fcvt.bf16.s

8.7.3 fcvt.s.bf16

8.8 "Zfinx", "Zdinx", "Zhinx", "Zhinxmin" Extensions for Floating-Point in Integer Registers, Version 1.0

8.8.1 Processing of Narrower Values

8.8.2 Zdinx

8.8.3 Processing of Wider Values

8.8.4 Zhinx

8.8.5 Zhinxmin

8.8.6 Privileged Architecture Implications