简介

这篇文章提出了一个二进制分析框架,并实现了许多现有的分析技术。通过将这些技术系统化地实现,可以让其他研究人员直接利用并开发新的技术。此外,在统一框架中实现这些技术可以更直接地进行比较,并确定各自的优缺点。

自动化二进制分析

为了保持程序分析的可行性,往往需要在可重现性和语义理解两个方面需要进行权衡:

为了得到可重现的输入或者语义理解的能力,就需要对分析技术进行权衡。例如,高可重现性往往和低覆盖率相关,因为要想生成可重现的输入必须知道如何到达任何想要分析的代码,那么它将不能分析尽可能多的代码。另一方面,如果不能通过重现来验证漏洞,那么会产生高误报率(即并不存在漏洞)。在缺乏可重现性的情况下,这些误报必须通过启发式的方法进行过滤,反过来又会引入高漏报率。同样的,为了实现语义理解,必须存储和处理大量的数据。例如,具有语义理解能力的动态分析必须保存下程序分支的条件,而具有语义理解能力的静态分析需要适时地调整数据域。但由于系统资源有限,在分析中必须做出取舍。

下面是一个例子,可以对不同分析技术的能力有个简单的认识:

对于静态分析,它可能会将全部 3 个 memcpy 调用都标记为漏洞(即使 16 行的调用其实是安全的),因为静态分析没有足够的信息来确定漏洞是否真的会发生。另外,静态分析可以得到漏洞的地点,但不能得到触发漏洞的输入。对于动态分析(例如 fuzzing),它通过制造输入来触发漏洞,所以通常有很大可能会漏掉需要精确输入的漏洞,例如第 10 行的漏洞。动态符号执行能够检测出第 10 行的错误并通过约束求解得到输入,也能判断出第 16 行没有漏洞,但是它很可能会漏掉第 30 行,因为有多个潜在的路径不会触发该漏洞。另外,在符号执行进行到循环时,可能存在路径爆炸的问题。

静态漏洞挖掘

Static analyses can be split into two paradigms: those that model program properties as graphs and those that model the data itself.

控制流图恢复

CFG recovery is implemented as a recursive algorithm that disassembles and analyzes a basic block, identifies its possible exits and adds them to the CFG, and then repeats the analysis recursively until no new exits are identified.

CFG recovery has one fundamental challenge: indirect jumps. Specifically, indirect jumps fall into several categories:

The goal of CFG recovery is to resolve the targets of as many of these indirect jumps as possible, in order to create a CFG. Depending on how well jump targets are resolved, the CFG recovery analysis has two properties:

值集分析

At a high level, VSA attempts to identify a tight over-approximation of the program state at any given point in the program. This can be used to understand the possible targets of indirect jumps or the possible targets of memory write operations.

动态漏洞挖掘

Dynamic techniques here are split into two main categories: concrete and symbolic execution.

动态具体执行

The most relevant application of dynamic concrete execution to vulnerability discovery is fuzzing.

动态符号执行

Dynamic symbolic execution executes a program in an emulated environment with an abstract domain of symbolic variables. They track the state of registers and memory throughout program execution and the constraints on those variables. Whenever a conditional branch is reached, execution forks and follows both paths, saving the branch condition as a constraint on the path in which the branch was taken and the inverse of the branch condition as a constraint on the path in which the branch was not taken.

angr 分析引擎

设计目标

子模块:Intermediate Representation

We leveraged libVEX, the IR lifter of the Valgrind project. libVEX produces an IR, called VEX, that is specifically designed for program analysis. We used PyVEX to expose the VEX IR to Python.

子模块:Binary Loading

The task of loading an application binary into the analysis system is handled by a module called CLE. CLE abstracts over different binary formats to handle loading a given binary and any libraries that it depends on, resolving dynamic symbols, performing relocations, and properly initializing the program state.

子模块:Program State Representation/Modification

The SimuVEX module is responsible for representing the program state. The state, named SimState in SimuVEX terms, is implemented as a collection of state plugins, which are controlled by state options specified by the user or analysis when the state is created.

子模块:Data Model

Claripy abstracts all values to an internal representation of an expression that tracks all operations in which it is used. These expressions are represented as “expression trees” with values being the leaf nodes and operations being non-leaf nodes.

At any point, an expression can be translated into data domains provided by Claripy’s backends. User-facing operations, such as interpreting the constructs provided by the backends into Python primitives are provided by frontends. A frontend augments a backend with additional functionality of varying complexity.

子模块:Full-Program Analysis

Project is the analyst-facing part of angr, which provides complete analyses, such as dynamic symbolic execution and controlflow graph recovery.

实现:数据流图恢复

假设

angr’s CFGAccurate makes several assumptions about binaries to optimize the run time of the algorithm.

迭代生成 CFG

Throughout CFG recovery, CFGAccurate maintains a list of indirect jumps, Lj, whose jump targets have not been resolved. When the analysis identifies such a jump, it is added to Lj. After each iterative technique terminates, CFGAccurate triggers the next one in the list. This next technique may resolve jumps in Lj, may add new unresolved jumps to Lj, and may add basic blocks and edges to the CFG C. CFGAccurate terminates when a run of all techniques results in no change to Lj or C, as that means that no further indirect jumps can be resolved with any available analysis.

The goal of the fast CFG generation algorithm is to generate a graph, with high code coverage, that identifies at least the location and content of functions in the binary.

实现:值集分析

Value-Set Analysis (VSA) is a static analysis technique that combines numeric analysis and pointer analysis for binary programs. It uses an abstract domain, called the Value-Set Abstract domain, for approximating possible values that registers or abstract locations may hold at each program point.

The main interface that angr provides into a full-program VSA analysis is the Value Flow Graph. The VFG is an enhanced CFG that includes the program state representing the VSA fix-point at each program location.

实现:动态符号执行

The dynamic symbolic execution module of our analysis platform is mainly based on the techniques described in Mayhem. Our implementation follows the same memory model and path prioritization techniques.

We use Claripy’s interface into Z3 to populate the symbolic memory model (specifically, SimSymbolicMemory) provided by SimuVEX. Individual execution paths through a program are managed by Path objects, provided by angr, which track the actions taken by paths, the path predicates, and various other path-specific information. Groups of these paths are managed by angr’s PathGroup functionality, which provides an interface for managing the splitting, merging, and filtering of paths during dynamic symbolic execution.

angr has built-in support for Veritesting, implementing it as a Veritesting analysis and exposing transparent support for it with an option passed to PathGroup objects.

实现:非约束的符号执行

We implemented under-constrained symbolic execution (UCSE), as proposed in UC-KLEE, and dubbed it UC-angr. UCSE is a dynamic symbolic execution technique where execution is performed on each function separately.

We made two changes to the technique described in UCSE:

实现:符号辅助的 fuzzing

Our implementation of symbolic-assisted fuzzing, called Driller, uses the AFL fuzzer as its foundation and angr as its symbolic tracer.

实现:崩溃重现

We implemented the approach proposed by Replayer to recover missing relationships between input values and output values.

We can define the problem of replaying a crashing input as the search for an input specification is to bring a program from an initial state s to a crash state q. Our implementation symbolically executes the path from sa to qa, using the input ia. It records all constraints that are generated while executing P. Given the constraints, the execution path, the program P, and the new initial state sb, we can symbolically execute P with an unconstrained symbolic input, following the previously recorded execution path until the new crash state qb is reached. At this point, the input constraints on the input and output can be analyzed, and relationships between them can be recovered. This relationship data is used to generate the input specification is, allowing the crashing input to be replayed.

实现:利用生成

we generate exploits by performing concolic execution on crashing program inputs using angr. We drive concolic execution forward, forcing it to follow the same path as a dynamic trace gathered by concretely executing the crashing input applied to the program. Concolic execution is stopped at the point where the program crashed, and we inspect the symbolic state to determine the cause of the crash and measure exploitability. By counting the number of symbolic bits in certain registers, we can triage a crash into a number of categories such as frame pointer overwrite, instruction pointer overwrite, or arbitrary write, among others.

实现:利用强化

To harden exploits against modern mitigation techniques, we implemented a ROP chain compiler based on the ideas in Q.

比较评估