Fusion

TODO: 分类&命名

  • 间接寻址模式
    • riscv: 仅有[reg]
    • arm: pre-index, post-index
  • 长立即数加载
    • riscv: 20bit + 12bit
  • 操作数+左右移
    • riscv
  • 乘法、除法的高低部分
    • riscv

X86

https://stackoverflow.com/questions/56413517/what-is-instruction-fusion-in-contemporary-x86-processors

  • Macro-fusion decodes cmp/jcc or test/jcc into a single compare-and-branch uop.
  • Micro-fusion stores 2 uops from the same instruction together so they only take up 1 "slot" in the fused-domain parts of the pipeline.

ARM

2019.cortex_a77.pdf

除了加密指令外,均为条件跳转

aarch64

HeadTail
CMP/CMN (immediate)B.cond
CMP/CMN (register)B.cond
TST (immediate)B.cond
TST (register)B.cond
BICS (register)B.cond
NOPAny instruction

aarch32 & aarch64

HeadTail
AESEAESMC
AESDAESIMC

2020.neoverse_n2.pdf

aarch64

HeadTail
CMP/CMN (immediate)B.cond
CMP/CMN (register)B.cond
CMP (immediate)CSEL
CMP (register)CSEL
CMP (immediate)CSET
CMP (register)CSET
TST (immediate)B.cond
TST (register)B.cond
BICS (register)B.cond
NOPAny instruction

aarch32 & aarch64

HeadTail
AESEAESMC
AESDAESIMC
CMP/CMN (immediate)B.cond
CMP/CMN (register)B.cond
TST (immediate)B.cond
TST (register)B.cond
BICS (register)B.cond

RISC-V

2022.fuse_mem.singh.micro.0.md

这篇文章做的融合,需要使用双GPR写口 包含最后两行,也包含不连续的内存访问,不连续的指令。 双写口的通用性个人感觉不强。

为了正确性,需要新增不少逻辑,见Figure 7。

HeadTail
add rd, rs1, rs2ld rd, 0(rd)
lui rd, imm[31:12]addi rd, rd, imm[11:0]
ld rd, imm(rs1)add rs1, rs1, 8
auipc t, imm20jalr ra, imm12(t)
slli rd, rs1, {1,2,3}add rd, rd, rs2
mulh[[S]U] rdh, rs1, rs2mul rdl, rs1, rs2
slli rd, rs1, 32srli rd, rd, 29/30/31/32
div[U] rdq, rs1, rs2rem[U] rdr, rs1, rs
lui rd, imm[31:12]ld rd, imm11:0
auipc rd, symbol[31:12]ld rd, symbol11:0
ld rd1, imm(rs1)ld rd2, imm+8(rs1)
st rs2, imm(rs1)st rs3, imm+8(rs1)

2017.bt_fuse_riscv_x86.clark.carrv.0.pdf

二进制翻译risc-v => x86-64,静态寄存器映射,和qemu比,不太行的样子。

做了N到1的融合, 宏指令融合,感觉没有微码融合带来的性能提升更高?

HeadMiddleTail
AUIPC r1, imm20ADDI r1, r1, imm12
AUIPC r1, imm20JALR ra, imm12(r1)
AUIPC ra, imm20JALR ra, imm12(ra)
AUIPC r1, imm20LW r1, imm12(r1)
AUIPC r1, imm20LD r1, imm12(r1)
SLLI r1, r1, 32SRLI r1, r1, 32
ADDIW r1, r1, imm12SLLI r1, r1, 32SRLI r1, r1, 32
SRLI r2, r1, imm12SLLI r3, r1, (64-imm12)OR r2, r2, r3
SRLIW r2, r1, imm12SLLIW r3, r1, (32-imm12)OR r2, r2, r3