Numba 运行时笔记

Numba 运行时 (NRT) 为 nopython 模式 Python 子集提供了语言运行时。NRT 是一个独立的 C 库，带有一个 Python 绑定。这使得 NPM 运行时功能可以在不使用 GIL 的情况下使用。目前，NRT 中实现的唯一语言功能是内存管理。

内存管理

NRT 为 NPM 代码实现内存管理。它使用 原子引用计数 进行线程安全、确定性的内存管理。NRT 维护一个单独的 MemInfo 结构，用于存储每个分配的信息。

与 CPython 协作

为了使 NRT 与 CPython 协作，NRT Python 绑定提供了适配器，用于转换导出内存区域的 Python 对象。当此类对象用作 NPM 函数的参数时，会创建一个新的 MemInfo 并获取对该 Python 对象的引用。当一个 NPM 值返回给 Python 解释器时，会检查关联的 MemInfo（如果有）。如果 MemInfo 引用了一个 Python 对象，则会释放并返回底层的 Python 对象。否则，MemInfo 会被包装成一个 Python 对象并返回。根据类型，可能需要额外的处理。

当前实现支持 Numpy 数组和任何缓冲区导出类型。

编译器端协作

NRT 引用计数要求编译器根据使用情况发出 incref/decref 操作。当引用计数降至零时，编译器必须调用 NRT 中的析构例程。

优化

编译器允许朴素地发出 incref/decref 操作。它依赖于一个优化过程来移除冗余的引用计数操作。

版本 0.52.0 中实现了一个新的优化过程，用于移除属于以下四类控制流结构（每基本块、菱形、扇出、扇出+抛出）的引用计数操作。请参阅 NUMBA_LLVM_REFPRUNE_FLAGS 的文档以获取它们的描述。

旧的优化过程在块级别运行，以避免控制流分析。它依赖于 LLVM 函数优化过程来简化控制流、执行栈到寄存器转换和简化指令。它的工作原理是匹配并移除每个块内的 incref 和 decref 对。可以通过将 NUMBA_LLVM_REFPRUNE_PASS 设置为 0 来启用旧过程。

重要假设

旧（0.52.0 之前）和新（0.52.0 之后）的优化过程都假设唯一可以消耗引用的函数是 NRT_decref。重要的是，没有其他函数会消耗引用。由于这些过程在 LLVM IR 上操作，这里的“函数”指的是 LLVM 调用指令中的任何被调用者。

总而言之，所有暴露给引用计数优化过程的函数 绝不能 消耗计数引用，除非通过 NRT_decref 完成。

旧优化过程的怪癖

由于 0.52.0 之前的引用计数优化过程需要 LLVM 函数优化过程，因此该过程以文本形式操作 LLVM IR。优化后的 IR 随后再次具化为一个新的 LLVM 内存位码对象。

调试内存泄漏

要调试 NRT MemInfo 中的引用泄漏，每个 MemInfo python 对象都有一个 .refcount 属性供检查。要从 NRT 分配的 ndarray 中获取 MemInfo，请使用 .base 属性。

要调试 NRT 中的内存泄漏，numba.core.runtime.rtsys 定义了 .get_allocation_stats()。它返回一个命名元组，其中包含自程序启动以来的分配和释放次数。检查分配和释放计数器是否匹配是判断 NRT 是否泄漏的最简单方法。

调试 C 代码中的内存泄漏

numba/core/runtime/nrt.h 的开头有这些行

/* Debugging facilities - enabled at compile-time */
/* #undef NDEBUG */
#if 0
#   define NRT_Debug(X) X
#else
#   define NRT_Debug(X) if (0) { X; }
#endif

取消定义 NDEBUG（取消注释 #undef NDEBUG 行）可以启用 NRT 中的断言检查。

启用 NRT_Debug（将 #if 0 替换为 #if 1）可以打开 NRT 内部的调试打印。

递归支持

在编译一对相互递归的函数时，其中一个函数将包含未解析的符号引用，因为编译器一次处理一个函数。在 LLVM 生成机器码之前，为未解析符号分配内存并将其初始化为 未解析符号中止 函数 (nrt_unresolved_abort) 的地址。这些符号会随着新函数的编译而被跟踪和解析。如果存在阻止这些符号解析的错误，将调用中止函数，引发 RuntimeError 异常。

未解析符号中止 函数在 NRT 中定义，具有零参数签名。调用者可以安全地使用任意数量的参数调用它。因此，它可以安全地替代预期的被调用者。

在 C 代码中使用 NRT

外部编译的 C 代码应使用 NRT_api_functions 结构体作为函数表来访问 NRT API。该结构体定义在 numba/core/runtime/nrt_external.h 中。用户可以使用实用函数 numba.extending.include_path() 来确定 Numba 提供的 C 头文件的包含目录。

numba/core/runtime/nrt_external.h

#ifndef NUMBA_NRT_EXTERNAL_H_
#define NUMBA_NRT_EXTERNAL_H_

#include <stdlib.h>

typedef struct MemInfo NRT_MemInfo;

typedef void NRT_managed_dtor(void *data);

typedef void *(*NRT_external_malloc_func)(size_t size, void *opaque_data);
typedef void *(*NRT_external_realloc_func)(void *ptr, size_t new_size, void *opaque_data);
typedef void (*NRT_external_free_func)(void *ptr, void *opaque_data);

struct ExternalMemAllocator {
    NRT_external_malloc_func malloc;
    NRT_external_realloc_func realloc;
    NRT_external_free_func free;
    void *opaque_data;
};

typedef struct ExternalMemAllocator NRT_ExternalAllocator;

typedef struct {
    /* Methods to create MemInfos.

    MemInfos are like smart pointers for objects that are managed by the Numba.
    */

    /* Allocate memory

    *nbytes* is the number of bytes to be allocated

    Returning a new reference.
    */
    NRT_MemInfo* (*allocate)(size_t nbytes);
    /* Allocates memory using an external allocator but still using Numba's MemInfo.
     *
     * NOTE: An externally provided allocator must behave the same way as C99
     *       stdlib.h's "malloc" function with respect to return value
     *       (including the behaviour that occurs when requesting an allocation
     *        of size 0 bytes).
     */
    NRT_MemInfo* (*allocate_external)(size_t nbytes, NRT_ExternalAllocator *allocator);

    /* Convert externally allocated memory into a MemInfo.

    *data* is the memory pointer
    *dtor* is the deallocator of the memory
    */
    NRT_MemInfo* (*manage_memory)(void *data, NRT_managed_dtor dtor);

    /* Acquire a reference */
    void (*acquire)(NRT_MemInfo* mi);

    /* Release a reference */
    void (*release)(NRT_MemInfo* mi);

    /* Get MemInfo data pointer */
    void* (*get_data)(NRT_MemInfo* mi);

} NRT_api_functions;



#endif /* NUMBA_NRT_EXTERNAL_H_ */

在 Numba 编译的代码内部，可以使用 numba.core.unsafe.nrt.NRT_get_api() intrinsic 来获取 NRT_api_functions 的指针。

这是一个使用 nrt_external.h 的示例

#include <stdio.h>
#include "numba/core/runtime/nrt_external.h"

void my_dtor(void *ptr) {
    free(ptr);
}

NRT_MemInfo* my_allocate(NRT_api_functions *nrt) {
    /* heap allocate some memory */
    void * data = malloc(10);
    /* wrap the allocated memory; yield a new reference */
    NRT_MemInfo *mi = nrt->manage_memory(data, my_dtor);
    /* acquire reference */
    nrt->acquire(mi);
    /* release reference */
    nrt->release(mi);
    return mi;
}

在使用 NRT 之前务必确保其已初始化，从 Python 调用 numba.core.runtime.nrt.rtsys.initialize(context) 将会达到预期效果。同样，代码片段

from numba.core.registry import cpu_target # Get the CPU target singleton
cpu_target.target_context # Access the target_context property to initialize

将专门为 Numba 的 CPU 目标（默认）实现同样的效果。未能初始化 NRT 将导致访问冲突，因为 NRT_MemSys 结构中将缺少各种内部原子操作的函数指针。

未来计划

NRT 的计划是创建一个独立的共享库，可以链接到 Numba 编译的代码，包括在 Python 解释器内部和外部使用。为了实现这一点，我们将进行一些重构

numba NPM 代码引用了“helperlib.c”中的静态编译代码。这些函数应该移动到 NRT。