LLBT调研

LLBT: An LLVM-based Static Binary Translator@2012

ABSTRACT

Keywords: Static binary translation, compiler, retargeting, intermediate representation
INTRODUCTION

为什么存在code discovery和code location问题？为什么DBT很容易解决这俩问题？

原文做了回答：
- Code discovery
  
  ELF文件在数据和指令混在一起。变长指令集若把数据看做指令，则后续指令大概率全错。定长且对齐的指令集虽然在静态翻译时会把数据翻译成代码，不过执行时不会有问题。
- Code location
  
  间接跳转需要正确地把地址映射到转后后的代码地址上去。
LLBT解决code location的特色在于不需要解释器或者仿真器来处理间接跳转。原文如下，

LLBT effectively solves the code location problem. It does not need an interpreter or emulator for handling indirect branches.
RELATED WORK
1. Static binary translators
2. Dynamic binary translators
3. LLVM
LLBT OVERVIEW
IMPLEMENTATION
1. Code Discovery
  
  ARM指令集和Thumb指令集，能确定是哪套指令集最好，不能确定则既按照ARM翻译也按照Thumb翻译。
2. Register Mapping
3. Instruction Translation
4. Handling Indirect Branches
  
  建立原指令地址到目标指令地址的映射表。每当需要间接跳转时，就用原指令地址从这个表中查找目标指令的地址。因此一个简单的映射表会占很大的空间。
  
  论文中还是用的表，把可能的转移目标地址制成映射表。有没有办法不用映射表？而是计算可能的间接跳转地址（通过程序行为的分析），可不可以证明或者证伪？比如把跳跃到当前时间的内存地址去？或者跳转到用户输入的值？可以证伪吗？
5. Jump Table Recovery
6. PC-relative Data Inlining
7. Helper Function Replacement
  
  原指令集需要调用外部函数库（这里叫做helper function），比如没有浮点处理器，但是中间表示可以简单地表示出这些外部函数的功能。
EXPERIMENTAL RESULTS

Benchmark: EEMBC and CoreMark
1. SBT vs. DBT
2. Startup Time
3. Address Mapping Table
CONCLUSIONS
ACKNOWLEDGEMENTS
REFERENCES

A Retargetable Static Binary Translator for the ARM Architecture@2014

INTRODUCTION
LLBT(AN LLVM-BASED STATIC BINARY TRANSLATOR)
1. Overview
2. Register Mapping
3. Instruction Translation
4. Handling Indirect Branches
5. Helper Function Replacement
6. NEW ! Debugging Support and Verification
  
  什么是LLVM的metadata？
CODE DISCOVERY FOR ARM/THUMB INTERWORKING BINARIES
1. ARM/Thumb Region Identification
2. Discrimination between Data and Instructions
  1. PC-relative Data
  2. Jump Tables
DETAILED ! EXPERIMENTAL RESULTS
1. Binary Translation vs. Native Compilation
2. SBT vs. DBT
3. LLVM Optimization Analysis
  
  **感觉Instrunction Translation中提到的3点注意点和虚拟机的翻译并没什么太大区别。在这个地方可以一定程度的印证我的这个想法。**没有经过LLVM优化的代码的效率略低于QEMU的效率。所以我猜测自动验证工具是在未优化的IR上做的！
4. Start-up Time
5. Space Overhead from Address Mapping Tables
6. Code Size Measurement and Memory Overhead
7. Translation Time
RELATED WORK
1. Static Binary Translators
2. Dynamic Binary Translators
3. Static Binary Translators for Embedded Systems
4. Dynamic Binary Translators for Embedded Systems
CONCLUSIONS

Automatic Validation for Static Binary Translation@2013

Introduction
Background
1. Overview of QEMU
2. Overview of LLBT
Challenges of Automatic Validation
- 堆和栈的存放地方不一样
- 需要检查寄存器，若寄存器存的是地址，则值可能不一样则需要验证地址处的内容，当然也可能是地址。
- 我的想法是做一个LLVM IR的执行器，并与原程序进行对比验证，堆栈的重新分布交给LLVM opt做。
Design and Implementation
1. Allocating identical virtual memory
2. Performance of validation
3. Coarse instuction
4. Quick validation
Experimental evaluation
1. Bugs in LLBT discovered by the validator
2. The number of times instrumentation code is executed
3. Execution time
Conclusion

LLVM简介

A Quick Introduction to Classical Compiler Design
Existing Language Implementations
LLVM's Code Representation: LLVM IR
1. Writing an LLVM IR Optimization
LLVM's Implementation of Three-Phase Design
1. LLVM IR is a Complete Code Representation
2. LLVM is a Collection of Libraries
Design of the Retargetable LLVM Code Generator
1. LLVM Target Description Files
Interesting Capabilities Provided by a Modular Design
Retrospective and Future Directions

LLVM IR简介

源头依据LLVM Language Reference Manual，在这份manual的介绍中便提到下面三者等价，

In-memory compiler IR
on-disk bitcode representation
human readable assembly language

且这篇manual介绍第三类表示和记法。

小插曲：LLBT结构图中的IR是指LLBT自己的IR，这一点在2012年和2014年的文章中都有印，“internal IR”已加粗，

An ARM input binary is disassembled to an assembly file, and then an IR converter translates these ARM assembly instructions into LLBT’s internal IR. Some analysis and optimization passes, such as identifying PC-relative data and recovering jump tables, will be performed by LLBT on its internal IR before generating the corresponding LLVM instructions.

LLBT结构图中的LLVM assembly才是LLVM IR这一点在上一节的LLVM简介中得到印证，

Beyond being implemented as a language, LLVM IR is actually defined in three isomorphic forms: the textual format above, an in-memory data structure inspected and modified by optimizations themselves, and an efficient and dense on-disk binary "bitcode" format. The LLVM Project also provides tools to convert the on-disk format from text to binary: llvm-as assembles the textual .ll file into a .bc file containing the bitcode goop and llvm-dis turns a .bc file into a .ll file.

clang -emit-llvm -S生成LLVM IR，clang -emit-llvm生成LLVM bitcode。

xieby1's notes