性能提示

本指南简要介绍了 Numba 中有助于从代码中获得最佳性能的功能。使用了两个示例，两者都完全是虚构的，纯粹出于教学目的以引发讨论。第一个是计算三角恒等式 cos(x)^2 + sin(x)^2，第二个是向量的简单逐元素平方根并进行求和归约。所有性能数据仅供参考，除非另有说明，均取自 Intel i7-4790 CPU（4 个硬件线程）上运行，输入为 np.arange(1.e7) 的结果。

注意

实现高性能代码的合理有效方法是使用真实数据分析代码运行情况，并以此指导性能调优。此处提供的信息旨在演示功能，而非作为规范性指导！

NoPython 模式

Numba 的 @jit 装饰器默认运行的模式是 nopython 模式。此模式对可编译内容限制最多，但会生成更快的可执行代码。

注意

历史上（0.59.0 版本之前），默认的编译模式是一种回退模式，编译器会尝试以 nopython 模式编译，如果失败则回退到 object 模式。您很可能会在代码/文档中看到 @jit(nopython=True) 或其别名 @njit 的使用，因为这曾是强制使用 nopython 模式的推荐最佳实践方法。自 Numba 0.59.0 版本以来，这不再是必需的，因为 nopython 模式已成为 @jit 的默认模式。

循环

虽然 NumPy 围绕向量操作的使用发展出了一套强大的范式，但 Numba 对循环也同样适用。对于熟悉 C 或 Fortran 的用户来说，以这种风格编写 Python 代码在 Numba 中也能很好地工作（毕竟，LLVM 在编译 C 系语言方面得到了广泛应用）。例如

@njit
def ident_np(x):
    return np.cos(x) ** 2 + np.sin(x) ** 2

@njit
def ident_loops(x):
    r = np.empty_like(x)
    n = len(x)
    for i in range(n):
        r[i] = np.cos(x[i]) ** 2 + np.sin(x[i]) ** 2
    return r

以上代码在使用 @njit 装饰时以几乎相同的速度运行，而如果没有装饰器，向量化函数会快上几个数量级。

函数名称	@njit	执行时间
`ident_np`	否	0.581秒
`ident_np`	是	0.659秒
`ident_loops`	否	25.2秒
`ident_loops`	是	0.670秒

Object 模式的一个案例：循环提升（LoopLifting）

某些函数可能与限制性的 nopython 模式不兼容，但其中包含兼容的循环。您可以通过设置 @jit(forceobj=True) 来使这些函数尝试对其循环应用 nopython 模式。不兼容的代码段将以 object 模式运行。

尽管在 object 模式中使用循环提升（looplifting）可以提供一定的性能提升，但在 nopython 模式下完全编译函数是实现最佳性能的关键。

Fastmath

在某些类型的应用中，严格遵循 IEEE 754 标准的重要性较低。因此，可以通过放宽一些数值严谨性来获得额外的性能。在 Numba 中实现这种行为的方式是使用 fastmath 关键字参数

@njit(fastmath=False)
def do_sum(A):
    acc = 0.
    # without fastmath, this loop must accumulate in strict order
    for x in A:
        acc += np.sqrt(x)
    return acc

@njit(fastmath=True)
def do_sum_fast(A):
    acc = 0.
    # with fastmath, the reduction can be vectorized as floating point
    # reassociation is permitted.
    for x in A:
        acc += np.sqrt(x)
    return acc

函数名称	执行时间
`do_sum`	35.2 毫秒
`do_sum_fast`	17.8 毫秒

在某些情况下，您可能希望只选择部分可能的 fast-math 优化。这可以通过向 fastmath 提供一组 LLVM fast-math 标志来实现。

def add_assoc(x, y):
    return (x - y) + y

print(njit(fastmath=False)(add_assoc)(0, np.inf)) # nan
print(njit(fastmath=True) (add_assoc)(0, np.inf)) # 0.0
print(njit(fastmath={'reassoc', 'nsz'})(add_assoc)(0, np.inf)) # 0.0
print(njit(fastmath={'reassoc'})       (add_assoc)(0, np.inf)) # nan
print(njit(fastmath={'nsz'})           (add_assoc)(0, np.inf)) # nan

Parallel=True

如果代码包含可并行化的操作（且受支持），Numba 可以编译一个版本，该版本将在多个原生线程上并行运行（无 GIL！）。这种并行化是自动执行的，只需添加 parallel 关键字参数即可启用

@njit(parallel=True)
def ident_parallel(x):
    return np.cos(x) ** 2 + np.sin(x) ** 2

执行时间如下

函数名称	执行时间
`ident_parallel`	112 毫秒

该函数在存在 parallel=True 参数时的执行速度约为 NumPy 等效代码的 5 倍，标准 @njit 代码的 6 倍。

Numba 并行执行还支持类似于 OpenMP 中的显式并行循环声明。为了指示循环应并行执行，应使用 numba.prange 函数，此函数行为类似于 Python 的 range，如果未设置 parallel=True，则它仅作为 range 的别名。使用 prange 引导的循环可用于完全并行的计算和归约。

回顾求和归约示例，假设可以安全地乱序累加求和结果，那么 n 中的循环可以通过使用 prange 进行并行化。此外，在这种情况下可以放心地添加 fastmath=True 关键字参数，因为通过使用 parallel=True（因为每个线程计算一个部分和）已经假定乱序执行是有效的。

@njit(parallel=True)
def do_sum_parallel(A):
    # each thread can accumulate its own partial sum, and then a cross
    # thread reduction is performed to obtain the result to return
    n = len(A)
    acc = 0.
    for i in prange(n):
        acc += np.sqrt(A[i])
    return acc

@njit(parallel=True, fastmath=True)
def do_sum_parallel_fast(A):
    n = len(A)
    acc = 0.
    for i in prange(n):
        acc += np.sqrt(A[i])
    return acc

执行时间如下，fastmath 再次提高了性能。

函数名称	执行时间
`do_sum_parallel`	9.81 毫秒
`do_sum_parallel_fast`	5.37 毫秒

Intel SVML

Intel 提供了一个短向量数学库（SVML），其中包含大量可作为编译器内部函数使用的优化超越函数。如果环境中存在 intel-cmplr-lib-rt 包（或者 SVML 库可以简单地找到！），那么 Numba 会自动配置 LLVM 后端，以便尽可能使用 SVML 内部函数。SVML 提供每种内部函数的高精度和低精度版本，使用的版本由 fastmath 关键字决定。默认使用高精度版本，其精度在 1 ULP 以内，但是如果将 fastmath 设置为 True，则会使用较低精度的内部函数版本（结果在 4 ULP 以内）。

首先获取 SVML，例如使用 conda

conda install intel-cmplr-lib-rt

注意

SVML 库以前通过 icc_rt conda 包提供。icc_rt 包此后已成为一个元包，从 2021.1.1 版本开始，它将 intel-cmplr-lib-rt 及其他包作为依赖项。直接安装推荐的 intel-cmplr-lib-rt 包可以减少安装的包数量。

使用 @njit 的各种选项组合，并有/无 SVML，重新运行上面提到的恒等函数示例 ident_np，得出以下性能结果（输入大小为 np.arange(1.e8)）。作为参考，仅使用 NumPy 时，该函数执行时间为 5.84秒

`@njit` 关键字参数	SVML	执行时间
`None`	否	5.95秒
`None`	是	2.26秒
`fastmath=True`	否	5.97秒
`fastmath=True`	是	1.8秒
`parallel=True`	否	1.36秒
`parallel=True`	是	0.624秒
`parallel=True, fastmath=True`	否	1.32秒
`parallel=True, fastmath=True`	是	0.576秒

显然，SVML 显著提升了此函数的性能。在没有 SVML 的情况下，fastmath 的影响为零，这是预料之中的，因为原始函数中没有任何部分可以从放宽数值严格性中受益。

线性代数

Numba 在 no Python 模式下支持大部分 numpy.linalg。其内部实现依赖于 LAPACK 和 BLAS 库进行数值计算，并从 SciPy 中获取所需函数的绑定。因此，要在 Numba 中实现 numpy.linalg 函数的良好性能，必须使用针对优化良好的 LAPACK/BLAS 库构建的 SciPy。在 Anaconda 发行版中，SciPy 是针对 Intel 的 MKL 构建的，MKL 经过高度优化，因此 Numba 利用了这一性能优势。