pybind11解放Python GIL - barriery's World

由于GIL(Global Interpreter Lock)，在Python中使用多线程容易遇到 “一核有难多核围观” 的情况，在计算密集场景下Python的多线程基本处于不可用的状态。本文将给出一种解放GIL的方法。

Python的GIL

由于历史原因（Python发布于1989年，当时的程序都是运行在单核计算机上），Python解释器自带全局锁（Global Interpreter Lock，GIL），当使用多线程时，不同线程共用这个锁，使得在同一时刻仅有一个线程在执行指令。

GIL使得Python的多线程处于鸡肋状态。

规避GIL的几种方法

使用多进程替代多线程
异步编程（针对IO密集场景）
将关键组件用其他语言编写拓展（如使用pybind11用C++拓展）

下面介绍最后一种方法。

pybind11

Pybind11提供了一种将C++（C++11以上）代码供Python调用的简单方法，其在深度学习领域（TensorFlow，paddlepaddle）广泛应用。

参考官方文档进行安装。

编译链接库时可能会遇到的问题

ld: symbol(s) not found for architecture x86_64
clang: error: linker command failed with exit code 1 (use -v to see invocation)

参考该issue，Python lib不在搜索路径中，编译命令上加上-undefined,dynamic_lookup参数即可。

使用pybind11编译一个module

下面用C++编写了一个简单的module，用pybind11编译后将得到一个链接库文件（记得关掉O3优化选项）。

#include <pybind11/pybind11.h>
namespace py = pybind11;

void loop(int num) {
  for (int i = 0; i < num; ++i) {
    for (int j = 0; j < num; ++j) {
      for (int k = 0; k < num; ++k) {
        double x = 1.0 * i * j * k;
      }
    }
  }
}

PYBIND11_MODULE(test_pybind, m) {
  m.def("loop", &loop);
}

Python端可以直接import该module，具体调用如下：

import test_pybind

test_pybind.loop(1000)

pybind11解放GIL

接下来将会用下面的Python代码来测试多线程性能：

import threading
import sys
import test_pybind

if len(sys.argv) != 2:
    print('usage: python multi-thread.py thread_num')
    exit(-1)
thread_num = int(sys.argv[1])
threads = []
for i in range(thread_num):
    th = threading.Thread(target=test_pybind.loop, args=(1000, ))
    th.start()
    threads.append(th)

for i in range(thread_num):
    threads[i].join()

Code with GIL

简单测试下我们刚刚定义的loop函数在单线程和四线程下的表现：

» time python multi-thread.py 1
python multi-thread.py 1  3.16s user 0.05s system 99% cpu 3.214 total
» time python multi-thread.py 4
python multi-thread.py 4  12.57s user 0.09s system 99% cpu 12.699 total

四线程下的程序耗时差不多是单线程的四倍，同时程序cpu利用率一直保持在100%左右，这就是GIL在起作用辣。

程序进入C++函数始终会保持持有GIL锁，让其他线程处于wait状态。

Release GIL

pybind11中提供了py::gil_scoped_release和py::gil_scoped_acquire类用于获取和释放C++函数调用主体中的GIL锁，这样就可以使用多个Python线程并行运行C++代码。

一般情况，也可以使用简化的call_guard策略py::call_guard<py::gil_scoped_release>()。

于是我们可以更改module中loop相关的代码：

#include <pybind11/pybind11.h>
namespace py = pybind11;

void loop(int num) {
  for (int i = 0; i < num; ++i) {
    for (int j = 0; j < num; ++j) {
      for (int k = 0; k < num; ++k) {
        double x = 1.0 * i * j * k;
      }
    }
  }
}

PYBIND11_MODULE(test_pybind, m) {
  m.def("loop", &loop)
   .def("loop_without_gil", &loop,
        py::call_guard<py::gil_scoped_release>());
}

Python端改调用loop_without_gil函数。测试下loop_without_gil函数在单线程和四线程下的表现：

» time python multi-thread.py 1
python multi-thread.py 1  3.19s user 0.04s system 99% cpu 3.245 total
» time python multi-thread.py 4
python multi-thread.py 4  12.53s user 0.07s system 390% cpu 3.225 total

单线程和四线程的耗时差不多，同时四线程的cpu使用率保持在400%左右。