从 Python 内核调用外部函数

CUDA 内置目标弃用通知

Numba 内置的 CUDA 目标已弃用，后续开发已移至 NVIDIA numba-cuda 包。请参阅内置 CUDA 目标弃用和维护状态。

Python 内核可以调用用其他语言编写的设备函数。CUDA C/C++、PTX 和二进制对象（cubin、fat binary 等）直接受支持；其他语言的源代码必须首先编译为 PTX。Python 内核调用外部设备函数的组成部分包括：

外部语言中的设备函数实现（例如 CUDA C）。
Python 中设备函数的声明。
链接并调用外部函数的内核。

设备函数 ABI

Numba 用于调用设备函数的 ABI 在 C/C++ 中定义了以下原型：

extern "C"
__device__ int
function(
  T* return_value,
  ...
);

原型的组成部分如下：

extern "C" 用于防止名称改编 (name-mangling)，以便在 Python 中轻松声明函数。它可以被移除，但届时在 Python 中声明函数时必须使用改编后的名称。
__device__ 用于将函数定义为设备函数。
返回值始终为 int 类型，用于指示是否发生了 Python 异常。由于外部函数中不会发生 Python 异常，因此被调用者应始终将其设置为 0。
第一个参数是指向 T 类型返回值的指针，该值在本地地址空间 1 中分配并由调用者传入。如果函数返回值，则被指向的值应由被调用者设置以存储返回值。
后续参数应与从 Python 内核传递给函数的参数的类型和顺序匹配。

用其他语言编写的函数必须编译成符合此原型规范的 PTX。

一个接受两个浮点数并返回一个浮点数的函数将具有以下原型：

extern "C"
__device__ int
mul_f32_f32(
  float* return_value,
  float x,
  float y
);

备注

1: 必须注意确保对返回值的任何操作都适用于本地地址空间中的数据。某些操作，例如原子操作，不能在本地地址空间中的数据上执行。

Python 中的声明

要在 Python 中声明外部设备函数，请使用 declare_device()

numba.cuda.declare_device(name, sig)

声明外部函数的签名。返回一个描述符，可用于从 Python 内核调用该函数。

参数

name (str) – 外部函数的名称。
sig – 函数的 Numba 签名。

返回的描述符名称无需与外部函数的名称匹配。例如，当

mul = cuda.declare_device('mul_f32_f32', 'float32(float32, float32)')

被声明时，在内核内调用 mul(a, b) 将在编译后的代码中转换为对 mul_f32_f32(a, b) 的调用。

传递指针

Numba 的调用约定要求为数组参数传递多个值。这些值包括数据指针以及形状、步幅和其他信息。这与大多数 C/C++ 函数的预期不兼容，后者通常只期望一个指向数据的指针。为了使 C 设备代码和 Python 内核之间的调用约定保持一致，需要使用 C 指针类型声明数组参数。

例如，一个具有以下原型的函数：

numba/cuda/tests/doc_examples/ffi/functions.cu

extern "C"
__device__ int
sum_reduce(
  float* return_value,
  float* array,
  int n
);

将被声明为如下所示：

来自 numba/cuda/tests/doc_examples/test_ffi.py 中的 test_ex_from_buffer

signature = 'float32(CPointer(float32), int32)'
sum_reduce = cuda.declare_device('sum_reduce', signature)

要获取指向数组数据的指针以传递给外部函数，请使用 cffi.FFI 实例的 from_buffer() 方法。例如，使用 sum_reduce 函数的内核可以定义为：

来自 numba/cuda/tests/doc_examples/test_ffi.py 中的 test_ex_from_buffer

import cffi
ffi = cffi.FFI()

@cuda.jit(link=[functions_cu])
def reduction_caller(result, array):
    array_ptr = ffi.from_buffer(array)
    result[()] = sum_reduce(array_ptr, len(array))

其中 result 和 array 都是 float32 数据的数组。

链接和调用函数

@cuda.jit 装饰器的 link 关键字参数接受一个文件名称列表，可以通过绝对路径或相对于当前工作目录的路径指定。名称以 .cu 结尾的文件将使用 NVIDIA Runtime Compiler (NVRTC) 编译并作为 PTX 链接到内核中；其他文件将直接传递给 CUDA Linker。

例如，以下内核调用了上面声明的 mul() 函数，其实现 mul_f32_f32() 位于名为 functions.cu 的文件中：

@cuda.jit(link=['functions.cu'])
def multiply_vectors(r, x, y):
    i = cuda.grid(1)

    if i < len(r):
        r[i] = mul(x[i], y[i])

C/C++ 支持

通过使用 NVRTC 提供对 CUDA C/C++ 代码编译和链接的支持，但需考虑以下事项：

仅在使用 NVIDIA Bindings 时可用。请参阅 NUMBA_CUDA_USE_NVIDIA_BINDING。
已安装的 NVIDIA CUDA Bindings 版本必须有相应的 NVRTC 库可用。
CUDA include 路径在 Linux 上默认假定为 /usr/local/cuda/include，在 Windows 上为 $env:CUDA_PATH\include。可以使用环境变量 NUMBA_CUDA_INCLUDE_PATH 进行修改。
CUDA include 目录将在 include 路径上提供给 NVRTC；不支持额外的 include。

完整示例

此示例演示了调用用 CUDA C 编写的外部函数来乘以两个数组中的数字对。

外部函数编写如下：

numba/cuda/tests/doc_examples/ffi/functions.cu

// Foreign function example: multiplication of a pair of floats

extern "C" __device__ int
mul_f32_f32(
  float* return_value,
  float x,
  float y)
{
  // Compute result and store in caller-provided slot
  *return_value = x * y;

  // Signal that no Python exception occurred
  return 0;
}

Python 代码和内核如下：

来自 numba/cuda/tests/doc_examples/test_ffi.py 中的 test_ex_linking_cu

from numba import cuda
import numpy as np
import os

# Declaration of the foreign function
mul = cuda.declare_device('mul_f32_f32', 'float32(float32, float32)')

# Path to the source containing the foreign function
# (here assumed to be in a subdirectory called "ffi")
basedir = os.path.dirname(os.path.abspath(__file__))
functions_cu = os.path.join(basedir, 'ffi', 'functions.cu')

# Kernel that links in functions.cu and calls mul
@cuda.jit(link=[functions_cu])
def multiply_vectors(r, x, y):
    i = cuda.grid(1)

    if i < len(r):
        r[i] = mul(x[i], y[i])

# Generate random data
N = 32
np.random.seed(1)
x = np.random.rand(N).astype(np.float32)
y = np.random.rand(N).astype(np.float32)
r = np.zeros_like(x)

# Run the kernel
multiply_vectors[1, 32](r, x, y)

# Sanity check - ensure the results match those expected
np.testing.assert_array_equal(r, x * y)

注意

上述示例是最小的，旨在说明外部函数调用——由于网格小且外部函数的工作负载轻，预计其性能不会特别高。