An Intro to Compilers

How to Speak to Computers, Pre-Siri

没有siri的那些年,我们如何实现人机对话?

A compiler is just a program that translates other programs. Traditional compilers translate source code into executable machine code that your computer understands. (Some compilers translate source code into another programming language. These compilers are called source-to-source translators or transpilers.) LLVM is a widely used compiler project, consisting of many modular compiler tools.
编译器不过就是一个翻译其它程序的程序。传统的编译器将源代码转换成计算机可理解的可执行的机器代码。(一些编译器将源代码转换为另一种编程语言,这些编译器被称为源到源转换器或转译器)。 LLVM是一个广泛使用的编译器项目,包括多个模块化的编译器工具。

Traditional compiler design comprises three parts:
传统的编译器设计包括三个部分:

  • The Frontend translates source code into an intermediate representation (IR)*. clang is LLVM’s frontend for the C family of languages.
  • 前端将源代码转换成一种中间表示(IR)。 clang 是LLVM项目中C类语言的前端工具。
  • The Optimizer analyzes the IR and translates it into a more efficient form. opt is the LLVM optimizer tool.
  • 优化器解析IR并将其转换成一种更高效的形式。 opt是LLVM项目的优化器工具。
  • The Backend generates machine code by mapping the IR to the target hardware instruction set. llc is the LLVM backend tool.
  • 后端通过将IR映射到目标硬件指令集上来生成机器代码。 llc是LLVM项目的后端工具。

* LLVM IR is a low-level language that is similar to assembly. However, it abstracts away hardware-specific information.

LLVM IR 是一种类似汇编的低级语言。但是,它不针对特定的硬件信息编程。

Hello, Compiler

你好,编译器

Below is a simple C program that prints “Hello, Compiler!” to stdout. The C syntax is human-readable, but my computer wouldn’t know what to do with it. I’m going to walk through the three compilation phases to make this program machine-executable.

下面是一个简单的打印“Hello,Compiler”字符串的C语言程序。虽然程序员可以读懂C语言语法,但是计算机却看的一脸懵逼。接下来我要过一遍编译的三个阶段,以便将以下程序转换成机器可执行的程序。

// compile_me.c
// Wave to the compiler. The world can wait.

#include <stdio.h>

int main() {
  printf("Hello, Compiler!\n");
  return 0;
}

The Frontend

前端

As I mentioned above, clang is LLVM’s frontend for the C family of languages. Clang consists of a C preprocessor, lexer, parser, semantic analyzer, and IR generator.
前文讲到,clang 是 LLVM C 类语言的前端工具。Clang 由一个 C 预处理器、词法分析器(lexer)、解析器、语义分析器和中间表示生成器组成。

  • The C Preprocessor modifies the source code before beginning the translation to IR. The preprocessor handles including external files, like #include <stdio.h> above. It will replace that line with the entire contents of the stdio.h C standard library file, which will include the declaration of the printf function.
    See the output of the preprocessor step by running:
  • C 预处理器在源代码转换成IR 之前对其进行修改。预处理器会将外部文件包含进来,比如上面的#include <stdio.h>。它会用 C 标准库文件 stdio.h 的所有代码替换 #include <stdio.h> 这一行,stdio.h 头文件包含了 printf 函数的声明。通过执行以下命令观察预处理器的输出:
clang -E compile_me.c -o preprocessed.i
  • The Lexer (or scanner or tokenizer) converts a string of characters to a string of words. Each word, or token, is assigned to one of five syntactic categories: punctuation, keyword, identifier, literal, or comment.
    Tokenization of compile_me.c
  • 词法分析器(Lexer,也叫 scanner 或 tokenizer)将一串字符转换成一串词。每个词或符号,按其属性被分配到对应的句法类别:标点符号、关键词、标识符、常量或注释。

compile_me.c的词法分析:

  • The Parser determines whether or not the stream of words consists of valid sentences in the source language. After analyzing the grammar of the token stream, it outputs an abstract syntax tree (AST). Nodes in a Clang AST represent declarations, statements, and types.
    The AST of compile_me.c
  • 解析器判定由词法分析器生成的一串词是否包含源语言中的有效语句。在分析完词的语法以后,解析器输出了一个抽象语法树(AST)。Clang AST 中的节点分别表示声明与类型(declarations, statement均为声明变量或函数用,细节不作展开)

compile_me.c 的 AST:

  • The Semantic Analyzer traverses the AST, determining if code sentences have valid meaning. This phase checks for type errors. If the main function in compile_me.c returned "zero" instead of 0, the semantic analyzer would throw an error because "zero" is not of type int.
  • 语义分析器遍历 AST,判定语句的涵义是否有效。这个阶段会检查类型错误。如果 compile_me.c 中的 main 函数返回了 "zero" 而不是 0, 语义分析器就会抛出一个错误,因为 "zero" 不是 int 类型。
  • The IR Generator translates the AST to IR.
    Run the clang frontend on compile_me.c to generate LLVM IR:
  • IR生成器将 AST 转换为 IR。

在compile_me.c上运行clang前端,生成LLVM IR:

clang -S -emit-llvm -o llvm_ir.ll compile_me.c

The main function in llvm_ir.ll
llvm_ir.ll中的main函数:

; llvm_ir.ll

@.str = private unnamed_addr constant [18 x i8] c"Hello, Compiler!\0A\00", align 1

define i32 @main() {
  %1 = alloca i32, align 4 ; <- memory allocated on the stack
  store i32 0, i32* %1, align 4
  %2 = call i32 (i8*, ...) @printf(i8* getelementptr inbounds ([18 x i8], [18 x i8]* @.str, i32 0, i32 0))
  ret i32 0
}

declare i32 @printf(i8*, ...)

 

The Optimizer

优化器

The job of the optimizer is to improve code efficiency based on its understanding of the program’s runtime behavior. The optimizer takes IR as input and produces improved IR as output. LLVM’s optimizer tool, opt, will optimize for processor speed with the flag -O2 (capital o, two) and for size with the flag -Os (capital o, s).

优化器的任务是基于对程序运行时行为的理解,提升代码的效率。优化器的输入为 IR,输出为优化后的 IR。LLVM 的优化器工具opt将使用 -O2 (大写字母 o,数字2)标记优化处理器速度,使用-Os (大写字母 o,s)标记优化生成目标的大小。

Take a look at the difference between the LLVM IR code our frontend generated above and the result of running:
看一下优化器优化之前的 LLVM IR 代码和优化后的代码:

opt -O2 -S llvm_ir.ll -o optimized.ll

The main function in optimized.ll
optimized.ll的main函数:

; optimized.ll

@str = private unnamed_addr constant [17 x i8] c"Hello, Compiler!\00"

define i32 @main() {
  %puts = tail call i32 @puts(i8* getelementptr inbounds ([17 x i8], [17 x i8]* @str, i64 0, i64 0))
  ret i32 0
}

declare i32 @puts(i8* nocapture readonly)

In the optimized version, main doesn’t allocate memory on the stack, since it doesn’t use any memory. The optimized code also calls puts instead of printf because none of printf’s formatting functionality was used.
Of course, the optimizer does more than just know when to use puts in lieu of printf. The optimizer also unrolls loops and inlines the results of simple calculations. Consider the program below, which adds two integers and prints the result.

优化后,main 函数没有在栈上分配内存,因为它没有使用任何内存。优化后的代码调用了 puts函数而不是 printf函数,因为它没有使用 printf 函数的任何格式化功能。当然了,优化器不仅仅知道什么时候该用 puts 代替 printf。 优化器也会展开循环,内联简单计算的结果。思考以下代码,它将两个数加起来并打印结果:

// add.c
#include <stdio.h>

int main() {
  int a = 5, b = 10, c = a + b;
  printf("%i + %i = %i\n", a, b, c);
}

Here is the unoptimized LLVM IR:
未优化的LLVM IR:

@.str = private unnamed_addr constant [14 x i8] c"%i + %i = %i\0A\00", align 1

define i32 @main() {
  %1 = alloca i32, align 4 ; <- allocate stack space for var a
  %2 = alloca i32, align 4 ; <- allocate stack space for var b
  %3 = alloca i32, align 4 ; <- allocate stack space for var c
  store i32 5, i32* %1, align 4  ; <- store 5 at memory location %1
  store i32 10, i32* %2, align 4 ; <- store 10 at memory location %2
  %4 = load i32, i32* %1, align 4 ; <- load the value at memory address %1 into register %4
  %5 = load i32, i32* %2, align 4 ; <- load the value at memory address %2 into register %5
  %6 = add nsw i32 %4, %5 ; <- add the values in registers %4 and %5. put the result in register %6
  store i32 %6, i32* %3, align 4 ; <- put the value of register %6 into memory address %3
  %7 = load i32, i32* %1, align 4 ; <- load the value at memory address %1 into register %7
  %8 = load i32, i32* %2, align 4 ; <- load the value at memory address %2 into register %8
  %9 = load i32, i32* %3, align 4 ; <- load the value at memory address %3 into register %9
  %10 = call i32 (i8*, ...) @printf(i8* getelementptr inbounds ([14 x i8], [14 x i8]* @.str, i32 0, i32 0), i32 %7, i32 %8, i32 %9)
  ret i32 0
}

declare i32 @printf(i8*, ...)

Here is the optimized LLVM IR:
优化后的LLVM IR:

@.str = private unnamed_addr constant [14 x i8] c"%i + %i = %i\0A\00", align 1

define i32 @main() {
  %1 = tail call i32 (i8*, ...) @printf(i8* getelementptr inbounds ([14 x i8], [14 x i8]* @.str, i64 0, i64 0), i32 5, i32 10, i32 15)
  ret i32 0
}

declare i32 @printf(i8* nocapture readonly, ...)

Our optimized main function is essentially lines 17 and 18 of the unoptimized version, with the variable values inlined. opt calculated the addition because all of the variables were constant. Pretty cool, huh?

优化后的 main 函数实际上就是在未优化版本的 17 和 18 行将变量进行内联。opt 对加法进行运算,因为所有的变量都是常量。很酷吧?

The Backend

后端

LLVM’s backend tool is llc. It generates machine code from LLVM IR input in three phases:
LLVM 的后端工具是 llc。它经历了三个阶段,最终把 LLVM IR 输入转化生成机器代码:

  • Instruction selection is the mapping of IR instructions to the instruction-set of the target machine. This step uses an infinite namespace of virtual registers.
  • 指令选取(instruction selection)是从 IR 指令到目标机器指令集的映射。这一步使用了虚拟寄存器一个无限的命名空间。
  • Register allocation is the mapping of virtual registers to actual registers on your target architecture. My CPU has an x86 architecture, which is limited to 16 registers. However, the compiler will use as few registers as possible.
  • 寄存器分配(register allocation)是从虚拟寄存器到目标架构真实寄存器的映射。我的 CPU 是 x86 架构的,也就是说只能使用 16 个寄存器。但是,编译器会尽可能少地使用寄存器。
  • Instruction scheduling is the reordering of operations to reflect the target machine’s performance constraints.
  • 指令调度(instruction scheduling)是对操作的重新安排,它反映了目标机器上的性能限制。

Running this command will produce some machine code!
执行以下命令将生成部分机器代码!

llc -o compiled-assembly.s optimized.ll

 

_main:
    pushq    %rbp
    movq    %rsp, %rbp
    leaq    L_str(%rip), %rdi
    callq    _puts
    xorl    %eax, %eax
    popq    %rbp
    retq
L_str:
    .asciz    "Hello, Compiler!"

This program is x86 assembly language, which is the human readable syntax for the language my computer speaks. Someone finally understands me
这是一个 x86 汇编语言程序,是计算机和程序员共通的语言。看似晦涩,但肯定有人懂我。

Resources
相关资源

  1. Engineering a compiler
  2. Getting Started with LLVM Core Libraries
Share this to:

发表评论

电子邮件地址不会被公开。 必填项已用*标注