mt-tcg.md - xieby1's notes

2020.5.9

QEMU的wiki页面：tcg-multithread

`lv.c`

qemu-system代码框架

为了厘清MTTCG的启动过程，梳理了一边qemu-system启动过程的代码框架图，

上图中的qemu_thread_create启用一个线程执行qemu_tcg_cpu_thread_fn标志这一个MTTCG CPU开始运作。

我猜测rr是round robin的缩写。

`qemu_tcg_cpu_thread_fn`

副标题：MTTCG CPU线程启动的代码框架

2020.5.20：加载bios的过程参考博客BIOS execution in QEMU: where it all starts – martin.uy，加载bios的代码框架如下，

bios

2020.5.20：qemu-system框架图的cpu_reset将选择BSP CPU（负责执行bios的cpu——BootStrap Processor）的cs:eip设置为0xf000: 0xfff0，即bios的入口地址。但是为什么要执行两次cpu_rest我还不太清楚。CPU启动过程参考《Inter 64 and IA-32 Architecture Software Developer's Manual》的Volume 3 Chapter 8.4。

由qemu-system框架图可知，若要启动MTTCG则让qemu_thread_create调用qemu_tcg_cpu_thread_fn即可，

2020.5.11

下面将多线程TCG和单线程TCG的运行框架图以对比的形式呈现在下面，

mttcg-cpu-formatted

当进入tcg_cpu_exec函数开始执行时，会调用cpu_exec函数，这样就基本和用户态指令的路线相同了；目前看到的不同是中断/异常处理时和用户态略有区别，把cpu_exec的代码框架图放置在下面，

linux-user-cpu-exec

当运行时遇到中断/异常通过siglongjmp跳出后，会在cpu_handle_exception和cpu_handle_interrupt里完成异常/中断处理。（用户态qemu会在cpu_loop里完成处理。）

2020.5.12

参考docs/devel/multiple-iothreads.txt，可知

IOThreads是为了减轻qemu在多核多线程处理器里运行时main loop的IO瓶颈；
qemu的全局锁为了让vCPU和main loop线性地（非并行地）执行一些qemu的代码，因为历史原因诸多qemu代码没有考虑多线程执行的安全性；

参考docs/devel/multi-thread-tcg.txt，

MTTCG设计宗旨：

In the general case of running translated code there should be no inter-vCPU dependencies and all vCPUs should be able to run at full speed. Synchronisation will only be required while accessing internal shared data structures or when the emulated architecture requires a coherent representation of the emulated machine state.

即除了必要的同步外，vCPU线程要能够没有任何牵连地全速运行。

需要考虑多线程访问的数据结构

CPUState.tb_jmp_cache虽然是每个cpu自己拥有，但是当需要无效无效一个tb时do_tb_phys_invalidate需要在每个cpu上查找，或当执行tb_flush会在一个cpu上冲掉所有cpu上的cache；
tb_ctx.htable所有cpu共有，在tb_jmp_cache查不到就回来这里查；

2020.5.22

并发TLB读写

采用async_run_on_cpu保障@cputlb.c

这个函数的调用backtrace见check.md，

tcg_out_qemu_ld/st @ tcg-target.inc.c // 给ld/st贴上标签，翻译完成一个tb后再来添加tlb未命中的代码，再去这些标签处改跳转地址。

贴标签的位置在tcg_gen_code @tcg.c里调用

tcg_out_ldst_finalize
- tcg_out_qemu_ld/st_slow_path @tcg_ldst.inc.c
  - 生产本地码在运行时调用qemu_ld/st_helpers @tcg-target.inc.c，会根据操作数类型，选择一个helper，比如
    - helper_ret_ldub_mmu @cputlb.c
      - full_ldub_mmu
        
        load/store_helper

当遇到TLB miss时就会调用tlb_fill

tlb_fill
- x86_cpu_tlb_fill @excp_helper.c
  - handle_mmu_fault
    - tlb_set_page_with_attrs @cputlb.c // Called from TCG-generated code, which is under an RCU read-side critical section. /// walk页表，用GVA找到了GPA
      - tlb_set_page_with_attrs ///建立GPA到HVA的映射，GPA + CPUTLBEntry.addend = HVA
        
        address_space_translate_for_iotlb ///在memory region找HVA，具体的没看

2020.5.28

指令Prefix: LOCK

引用qemu linux user笔记里翻译一系列指令到TCG的框架

translate_insn-formatted

在switch(指令操作码（opcode）)之前现有一个switch处理了prefix，这是之前没有关注过的内容。prefixes变量是disas_insn的局部变量。当需要翻译的指令是bts时就会执行如下的case，注：以word大小，little edian为例即w_le后缀，注2：i386bts指令实际调用l_le，

translate_lock_bts-formatted

这里helper函数的调用采用的是和处理系统调用的helper函数一致的方法。helper_tomic_fetch_orw_le函数会在运行本地码的过程中被调用到，来帮助完成一个原子操作，函数名用宏拼接的粗略过程和代码框架如下，

helper_atomic_fetch_or-formatted

gcc内置原子函数的编程文档见gcc官网的文档内置原子操作这一章。

2020.5.28

接下来就是看直接使用helper_atomic_fetch_or函数的可行性了。这里虽然用到了TCG相关的类型TCGMemOpIdx，不过这个就是两个包装过的立即数TCGMemOp和Idx。TCGMemOp自己写个简单的解码即可，Idx我看到牛根写的helper_softmmu_load/store函数里已经用到了。不过问题就是牛根为啥要重写一个helper_softmmu_load/store而不是直接复用qemu的呢，莫非有坑？大概是mmu的操作都是后缀_mmu的函数，而这样的函数太多了，用一个函数集中起来更方便？