Fusion
TODO: 分类&命名
- 间接寻址模式
- riscv: 仅有[reg]
- arm: pre-index, post-index
- 长立即数加载
- riscv: 20bit + 12bit
- 操作数+左右移
- riscv
- 乘法、除法的高低部分
- riscv
X86
https://stackoverflow.com/questions/56413517/what-is-instruction-fusion-in-contemporary-x86-processors
- Macro-fusion decodes cmp/jcc or test/jcc into a single compare-and-branch uop.
- Micro-fusion stores 2 uops from the same instruction together so they only take up 1 "slot" in the fused-domain parts of the pipeline.
ARM
2019.cortex_a77.pdf
除了加密指令外,均为条件跳转
aarch64
Head | Tail |
---|---|
CMP/CMN (immediate) | B.cond |
CMP/CMN (register) | B.cond |
TST (immediate) | B.cond |
TST (register) | B.cond |
BICS (register) | B.cond |
NOP | Any instruction |
aarch32 & aarch64
Head | Tail |
---|---|
AESE | AESMC |
AESD | AESIMC |
2020.neoverse_n2.pdf
aarch64
Head | Tail |
---|---|
CMP/CMN (immediate) | B.cond |
CMP/CMN (register) | B.cond |
CMP (immediate) | CSEL |
CMP (register) | CSEL |
CMP (immediate) | CSET |
CMP (register) | CSET |
TST (immediate) | B.cond |
TST (register) | B.cond |
BICS (register) | B.cond |
NOP | Any instruction |
aarch32 & aarch64
Head | Tail |
---|---|
AESE | AESMC |
AESD | AESIMC |
CMP/CMN (immediate) | B.cond |
CMP/CMN (register) | B.cond |
TST (immediate) | B.cond |
TST (register) | B.cond |
BICS (register) | B.cond |
RISC-V
2022.fuse_mem.singh.micro.0.md
这篇文章做的融合,需要使用双GPR写口 包含最后两行,也包含不连续的内存访问,不连续的指令。 双写口的通用性个人感觉不强。
为了正确性,需要新增不少逻辑,见Figure 7。
Head | Tail |
---|---|
add rd, rs1, rs2 | ld rd, 0(rd) |
lui rd, imm[31:12] | addi rd, rd, imm[11:0] |
ld rd, imm(rs1) | add rs1, rs1, 8 |
auipc t, imm20 | jalr ra, imm12(t) |
slli rd, rs1, {1,2,3} | add rd, rd, rs2 |
mulh[[S]U] rdh, rs1, rs2 | mul rdl, rs1, rs2 |
slli rd, rs1, 32 | srli rd, rd, 29/30/31/32 |
div[U] rdq, rs1, rs2 | rem[U] rdr, rs1, rs |
lui rd, imm[31:12] | ld rd, imm11:0 |
auipc rd, symbol[31:12] | ld rd, symbol11:0 |
ld rd1, imm(rs1) | ld rd2, imm+8(rs1) |
st rs2, imm(rs1) | st rs3, imm+8(rs1) |
2017.bt_fuse_riscv_x86.clark.carrv.0.pdf
二进制翻译risc-v => x86-64,静态寄存器映射,和qemu比,不太行的样子。
做了N到1的融合, 宏指令融合,感觉没有微码融合带来的性能提升更高?
Head | Middle | Tail |
---|---|---|
AUIPC r1, imm20 | ADDI r1, r1, imm12 | |
AUIPC r1, imm20 | JALR ra, imm12(r1) | |
AUIPC ra, imm20 | JALR ra, imm12(ra) | |
AUIPC r1, imm20 | LW r1, imm12(r1) | |
AUIPC r1, imm20 | LD r1, imm12(r1) | |
SLLI r1, r1, 32 | SRLI r1, r1, 32 | |
ADDIW r1, r1, imm12 | SLLI r1, r1, 32 | SRLI r1, r1, 32 |
SRLI r2, r1, imm12 | SLLI r3, r1, (64-imm12) | OR r2, r2, r3 |
SRLIW r2, r1, imm12 | SLLIW r3, r1, (32-imm12) | OR r2, r2, r3 |