示例

CUDA 内置目标弃用通知

Numba 中内置的 CUDA 目标已被弃用，未来的开发已移至 NVIDIA numba-cuda 包。请参阅内置 CUDA 目标弃用和维护状态。

向量加法

此示例使用 Numba 创建设备上数组和向量加法内核；这是学习如何使用 Numba 编写 GPU 内核的热身。我们将从一些必要的导入开始

来自 test_ex_vecadd 在 numba/cuda/tests/doc_examples/test_vecadd.py

import numpy as np
from numba import cuda

以下函数是内核。请注意，它是由 Python 变量定义的，类型未指定。当内核启动时，Numba 将检查在运行时传递的参数类型，并为它们生成一个专门的 CUDA 内核。

请注意，Numba 内核不返回值，必须将任何输出写入作为参数传入的数组中（这类似于 CUDA C/C++ 内核需要具有 void 返回类型的要求）。在这里，我们传入 c 以便将结果写入其中。

来自 test_ex_vecadd 在 numba/cuda/tests/doc_examples/test_vecadd.py

@cuda.jit
def f(a, b, c):
    # like threadIdx.x + (blockIdx.x * blockDim.x)
    tid = cuda.grid(1)
    size = len(c)

    if tid < size:
        c[tid] = a[tid] + b[tid]

cuda.to_device() 可用于创建数组的设备端副本。cuda.device_array_like() 创建一个与现有数组具有相同形状和类型的未初始化数组。在这里，我们传输两个向量并创建一个空向量来保存我们的结果

来自 test_ex_vecadd 在 numba/cuda/tests/doc_examples/test_vecadd.py

N = 100000
a = cuda.to_device(np.random.random(N))
b = cuda.to_device(np.random.random(N))
c = cuda.device_array_like(a)

调用 forall() 会生成一个带有 1D 网格的适当启动配置（请参阅内核调用），适用于给定数据大小，并且通常是启动内核最简单的方式

来自 test_ex_vecadd 在 numba/cuda/tests/doc_examples/test_vecadd.py

f.forall(len(a))(a, b, c)
print(c.copy_to_host())

这将打印

[0.73548323 1.32061059 0.12582968 ... 1.25925809 1.49335059 1.59315414]

也可以使用下标语法手动配置网格。以下示例启动了一个具有足够线程的网格，以操作每个向量元素

来自 test_ex_vecadd 在 numba/cuda/tests/doc_examples/test_vecadd.py

# Enough threads per block for several warps per block
nthreads = 256
# Enough blocks to cover the entire vector depending on its length
nblocks = (len(a) // nthreads) + 1
f[nblocks, nthreads](a, b, c)
print(c.copy_to_host())

这也将打印

[0.73548323 1.32061059 0.12582968 ... 1.25925809 1.49335059 1.59315414]

一维热方程

此示例解决了一维拉普拉斯方程，用于一组特定的初始条件和边界条件。关于拉普拉斯方程的全面讨论超出了本文档的范围，但可以说它描述了热量如何随时间通过物体传播。它通过两种方式离散化问题

域被划分为一个点网格，每个点都具有单独的温度。
时间被划分为离散的时间间隔，并按顺序向前推进。

然后，应用以下假设：一个点经过一段时间后的温度是与其直接相邻点的温度的加权平均值。直观地讲，如果域中的所有点都非常热，而中间的单个点非常冷，随着时间的推移，热点会导致冷点升温，而冷点会导致周围的热点略微冷却。简而言之，热量会在物体中扩散。

我们可以使用 Numba 内核实现此模拟。让我们从简单的开始，假设我们有一个一维对象，我们将其表示为一个值数组。数组中元素的位置是对象内点的位置，元素的值表示温度。

来自 test_ex_laplace 在 numba/cuda/tests/doc_examples/test_laplace.py

import numpy as np
from numba import cuda

这里有一些初始设置。让我们将物体中心的一个点变得非常热。

来自 test_ex_laplace 在 numba/cuda/tests/doc_examples/test_laplace.py

# Use an odd problem size.
# This is so there can be an element truly in the "middle" for symmetry.
size = 1001
data = np.zeros(size)

# Middle element is made very hot
data[500] = 10000
buf_0 = cuda.to_device(data)

# This extra array is used for synchronization purposes
buf_1 = cuda.device_array_like(buf_0)

niter = 10000

问题的初始状态可以可视化为

在我们的内核中，每个线程将负责在所需时间步数的循环中管理单个元素的温度更新。内核如下。请注意，此处使用了协作组同步，并且在每次迭代中交换了两个缓冲区以避免竞态条件。有关详细信息，请参阅 numba.cuda.cg.this_grid()。

来自 test_ex_laplace 在 numba/cuda/tests/doc_examples/test_laplace.py

@cuda.jit
def solve_heat_equation(buf_0, buf_1, timesteps, k):
    i = cuda.grid(1)

    # Don't continue if our index is outside the domain
    if i >= len(buf_0):
        return

    # Prepare to do a grid-wide synchronization later
    grid = cuda.cg.this_grid()

    for step in range(timesteps):
        # Select the buffer from the previous timestep
        if (step % 2) == 0:
            data = buf_0
            next_data = buf_1
        else:
            data = buf_1
            next_data = buf_0

        # Get the current temperature associated with this point
        curr_temp = data[i]

        # Apply formula from finite difference equation
        if i == 0:
            # Left wall is held at T = 0
            next_temp = curr_temp + k * (data[i + 1] - (2 * curr_temp))
        elif i == len(data) - 1:
            # Right wall is held at T = 0
            next_temp = curr_temp + k * (data[i - 1] - (2 * curr_temp))
        else:
            # Interior points are a weighted average of their neighbors
            next_temp = curr_temp + k * (
                data[i - 1] - (2 * curr_temp) + data[i + 1]
            )

        # Write new value to the next buffer
        next_data[i] = next_temp

        # Wait for every thread to write before moving on
        grid.sync()

调用内核

来自 test_ex_laplace 在 numba/cuda/tests/doc_examples/test_laplace.py

solve_heat_equation.forall(len(data))(
    buf_0, buf_1, niter, 0.25
)

绘制最终数据显示一个弧线，该弧线在物体最初炽热的地方最高，并逐渐向温度固定为零的边缘倾斜至零。在无限时间的极限下，该弧线将完全变平。

共享内存规约

Numba 暴露了许多 CUDA 特性，包括共享内存。为了演示共享内存，让我们重新实现一个著名的 CUDA 向量求和解决方案，它通过使用逐渐减少的线程数量来“折叠”数据。

请注意，这是一个相当朴素的实现，使用 Numba 实现规约有更高效的方法——请参阅蒙特卡洛积分获取示例。

来自 test_ex_reduction 在 numba/cuda/tests/doc_examples/test_reduction.py

import numpy as np
from numba import cuda
from numba.types import int32

让我们创建一些一维数据，我们将用它来演示内核本身

来自 test_ex_reduction 在 numba/cuda/tests/doc_examples/test_reduction.py

# generate data
a = cuda.to_device(np.arange(1024))
nelem = len(a)

这是一个使用 Numba 实现的内核版本

来自 test_ex_reduction 在 numba/cuda/tests/doc_examples/test_reduction.py

@cuda.jit
def array_sum(data):
    tid = cuda.threadIdx.x
    size = len(data)
    if tid < size:
        i = cuda.grid(1)

        # Declare an array in shared memory
        shr = cuda.shared.array(nelem, int32)
        shr[tid] = data[i]

        # Ensure writes to shared memory are visible
        # to all threads before reducing
        cuda.syncthreads()

        s = 1
        while s < cuda.blockDim.x:
            if tid % (2 * s) == 0:
                # Stride by `s` and add
                shr[tid] += shr[tid + s]
            s *= 2
            cuda.syncthreads()

        # After the loop, the zeroth  element contains the sum
        if tid == 0:
            data[tid] = shr[tid]

我们可以运行内核并验证通过在主机上求和数据是否获得了相同的结果，如下所示

来自 test_ex_reduction 在 numba/cuda/tests/doc_examples/test_reduction.py

array_sum[1, nelem](a)
print(a[0])                  # 523776
print(sum(np.arange(1024)))  # 523776

通过重新设计内部循环以使用顺序内存访问，该算法可以大大改进，甚至可以通过使用保持更多线程活跃和工作的策略来进一步改进，因为在此示例中，大多数线程会迅速变为空闲。

将点击数据划分为会话

商业分析中一个常见的问题是将在线平台用户的活动分组为会话，这被称为“会话化”（sessionization）。其理念是，用户通常会浏览网站并执行各种操作（点击某物、填写表格等），这些操作以离散组的形式进行。也许客户上午花了一些时间购物，然后晚上又购物——通常，企业有兴趣将这些时段视为与其服务的独立交互，这便产生了以某种约定方式通过编程将活动拆分的问题。

在这里，我们将演示如何编写一个 Numba 内核来解决这个问题。我们将从包含两个字段的数据开始：让 user_id 表示对应单个客户的唯一 ID，并让 action_time 表示在服务上执行某个未知操作的时间。目前，我们假设只有一种操作类型，所以只需要知道它发生的时间。

我们的目标是创建一个名为 session_id 的新列，其中包含对应于唯一会话的标签。我们将把会话之间的边界定义为两次点击之间至少间隔一小时。

来自 test_ex_sessionize 在 numba/cuda/tests/doc_examples/test_sessionize.py

import numpy as np
from numba import cuda

# Set the timeout to one hour
session_timeout = np.int64(np.timedelta64("3600", "s"))

这是一个使用 Numba 的解决方案

来自 test_ex_sessionize 在 numba/cuda/tests/doc_examples/test_sessionize.py

@cuda.jit
def sessionize(user_id, timestamp, results):
    gid = cuda.grid(1)
    size = len(user_id)

    if gid >= size:
        return

    # Determine session boundaries
    is_first_datapoint = gid == 0
    if not is_first_datapoint:
        new_user = user_id[gid] != user_id[gid - 1]
        timed_out = (
            timestamp[gid] - timestamp[gid - 1] > session_timeout
        )
        is_sess_boundary = new_user or timed_out
    else:
        is_sess_boundary = True

    # Determine session labels
    if is_sess_boundary:
        # This thread marks the start of a session
        results[gid] = gid

        # Make sure all session boundaries are written
        # before populating the session id
        grid = cuda.cg.this_grid()
        grid.sync()

        look_ahead = 1
        # Check elements 'forward' of this one
        # until a new session boundary is found
        while results[gid + look_ahead] == 0:
            results[gid + look_ahead] = gid
            look_ahead += 1
            # Avoid out-of-bounds accesses by the last thread
            if gid + look_ahead == size - 1:
                results[gid + look_ahead] = gid
                break

让我们生成一些数据并尝试运行内核

来自 test_ex_sessionize 在 numba/cuda/tests/doc_examples/test_sessionize.py

# Generate data
ids = cuda.to_device(
    np.array(
        [
            1, 1, 1, 1, 1, 1,
            2, 2, 2,
            3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
            4, 4, 4, 4, 4, 4, 4, 4, 4,
        ]
    )
)
sec = cuda.to_device(
    np.array(
        [
            1, 2, 3, 5000, 5001, 5002, 1,
            2, 3, 1, 2, 5000, 5001, 10000,
            10001, 10002, 10003, 15000, 150001,
            1, 5000, 50001, 15000, 20000,
            25000, 25001, 25002, 25003,
        ],
        dtype="datetime64[ns]",
    ).astype(
        "int64"
    )  # Cast to int64 for compatibility
)
# Create a vector to hold the results
results = cuda.to_device(np.zeros(len(ids)))

如上所示，内核成功地将第一个用户 ID 的前三个数据点与后三个数据点分开，并且整个过程中都看到了类似的模式。

JIT 函数 CPU-GPU 兼容性

此示例演示了如何使用 numba.jit 对函数进行 JIT 编译以在 CPU 上运行，同时使其可在 CUDA 内核内部使用。这对于将工作流从 CPU 迁移到 GPU 的用户非常有用，因为他们可以直接重用潜在的业务逻辑，而无需进行较少的代码更改。

以下是一个示例函数

来自 test_ex_cpu_gpu_compat 在 numba/cuda/tests/doc_examples/test_cpu_gpu_compat.py

@numba.jit
def business_logic(x, y, z):
    return 4 * z * (2 * x - (4 * y) / 2 * pi)

函数 business_logic 可以在 CPU 上以编译形式独立运行

来自 test_ex_cpu_gpu_compat 在 numba/cuda/tests/doc_examples/test_cpu_gpu_compat.py

print(business_logic(1, 2, 3))  # -126.79644737231007

它也可以在 GPU 内核内部直接按线程重用。例如，可以生成一些向量来表示 x、y 和 z

来自 test_ex_cpu_gpu_compat 在 numba/cuda/tests/doc_examples/test_cpu_gpu_compat.py

X = cuda.to_device([1, 10, 234])
Y = cuda.to_device([2, 2, 4014])
Z = cuda.to_device([3, 14, 2211])
results = cuda.to_device([0.0, 0.0, 0.0])

以及一个引用了该修饰函数的 Numba 内核

来自 test_ex_cpu_gpu_compat 在 numba/cuda/tests/doc_examples/test_cpu_gpu_compat.py

@cuda.jit
def f(res, xarr, yarr, zarr):
    tid = cuda.grid(1)
    if tid < len(xarr):
        # The function decorated with numba.jit may be directly reused
        res[tid] = business_logic(xarr[tid], yarr[tid], zarr[tid])

这个内核可以以正常方式调用

来自 test_ex_cpu_gpu_compat 在 numba/cuda/tests/doc_examples/test_cpu_gpu_compat.py

f.forall(len(X))(results, X, Y, Z)
print(results)
# [-126.79644737231007, 416.28324559588634, -218912930.2987788]

蒙特卡洛积分

此示例展示了如何使用 Numba 通过在 GPU 上快速生成随机数来近似定积分的值。蒙特卡洛积分数学机制的详细描述超出了本示例的范围，但可以简要描述为一种平均过程，其中曲线下的面积是通过对由其函数值形成的许多矩形取平均值来近似的。

此外，此示例展示了如何使用 cuda.reduce() API 在 Numba 中执行规约。

来自 test_ex_montecarlo 在 numba/cuda/tests/doc_examples/test_montecarlo.py

import numba
import numpy as np
from numba import cuda
from numba.cuda.random import (
    create_xoroshiro128p_states,
    xoroshiro128p_uniform_float32,
)

让我们创建一个变量来控制抽取的样本数量

来自 test_ex_montecarlo 在 numba/cuda/tests/doc_examples/test_montecarlo.py

# number of samples, higher will lead to a more accurate answer
nsamps = 1000000

以下内核实现了主要的积分例程

来自 test_ex_montecarlo 在 numba/cuda/tests/doc_examples/test_montecarlo.py

@cuda.jit
def mc_integrator_kernel(out, rng_states, lower_lim, upper_lim):
    """
    kernel to draw random samples and evaluate the function to
    be integrated at those sample values
    """
    size = len(out)

    gid = cuda.grid(1)
    if gid < size:
        # draw a sample between 0 and 1 on this thread
        samp = xoroshiro128p_uniform_float32(rng_states, gid)

        # normalize this sample to the limit range
        samp = samp * (upper_lim - lower_lim) + lower_lim

        # evaluate the function to be
        # integrated at the normalized
        # value of the sample
        y = func(samp)
        out[gid] = y

这个便利函数调用内核，执行一些预处理和后处理步骤。请注意，它使用了 Numba 的规约 API 来对数组求和并计算最终结果

来自 test_ex_montecarlo 在 numba/cuda/tests/doc_examples/test_montecarlo.py

@cuda.reduce
def sum_reduce(a, b):
    return a + b

def mc_integrate(lower_lim, upper_lim, nsamps):
    """
    approximate the definite integral of `func` from
    `lower_lim` to `upper_lim`
    """
    out = cuda.to_device(np.zeros(nsamps, dtype="float32"))
    rng_states = create_xoroshiro128p_states(nsamps, seed=42)

    # jit the function for use in CUDA kernels

    mc_integrator_kernel.forall(nsamps)(
        out, rng_states, lower_lim, upper_lim
    )
    # normalization factor to convert
    # to the average: (b - a)/(N - 1)
    factor = (upper_lim - lower_lim) / (nsamps - 1)

    return sum_reduce(out) * factor

我们现在可以使用 mc_integrate 来计算此函数在两个限制之间的定积分

来自 test_ex_montecarlo 在 numba/cuda/tests/doc_examples/test_montecarlo.py

# define a function to integrate
@numba.jit
def func(x):
    return 1.0 / x

mc_integrate(1, 2, nsamps)  # array(0.6929643, dtype=float32)
mc_integrate(2, 3, nsamps)  # array(0.4054021, dtype=float32)

矩阵乘法

首先，导入此示例所需的模块

来自 test_ex_matmul 在 numba/cuda/tests/doc_examples/test_matmul.py

from numba import cuda, float32
import numpy as np
import math

这是一个使用 CUDA 内核实现的朴素矩阵乘法

来自 test_ex_matmul 在 numba/cuda/tests/doc_examples/test_matmul.py

@cuda.jit
def matmul(A, B, C):
    """Perform square matrix multiplication of C = A * B."""
    i, j = cuda.grid(2)
    if i < C.shape[0] and j < C.shape[1]:
        tmp = 0.
        for k in range(A.shape[1]):
            tmp += A[i, k] * B[k, j]
        C[i, j] = tmp

此函数的一个用法示例如下

来自 test_ex_matmul 在 numba/cuda/tests/doc_examples/test_matmul.py

x_h = np.arange(16).reshape([4, 4])
y_h = np.ones([4, 4])
z_h = np.zeros([4, 4])

x_d = cuda.to_device(x_h)
y_d = cuda.to_device(y_h)
z_d = cuda.to_device(z_h)

threadsperblock = (16, 16)
blockspergrid_x = math.ceil(z_h.shape[0] / threadsperblock[0])
blockspergrid_y = math.ceil(z_h.shape[1] / threadsperblock[1])
blockspergrid = (blockspergrid_x, blockspergrid_y)

matmul[blockspergrid, threadsperblock](x_d, y_d, z_d)
z_h = z_d.copy_to_host()
print(z_h)
print(x_h @ y_h)

此实现直观易懂，但性能不佳，因为相同的矩阵元素会多次从设备内存中加载，这很慢（有些设备可能有透明数据缓存，但它们可能不足以一次性容纳所有输入）。

如果我们使用分块算法来减少对设备内存的访问，它会更快。CUDA 提供快速的共享内存，供块中的线程协作计算任务。以下实现了使用共享内存的平方矩阵乘法的更快版本

来自 test_ex_matmul 在 numba/cuda/tests/doc_examples/test_matmul.py

# Controls threads per block and shared memory usage.
# The computation will be done on blocks of TPBxTPB elements.
# TPB should not be larger than 32 in this example
TPB = 16

@cuda.jit
def fast_matmul(A, B, C):
    """
    Perform matrix multiplication of C = A * B using CUDA shared memory.

    Reference: https://stackoverflow.com/a/64198479/13697228 by @RobertCrovella
    """
    # Define an array in the shared memory
    # The size and type of the arrays must be known at compile time
    sA = cuda.shared.array(shape=(TPB, TPB), dtype=float32)
    sB = cuda.shared.array(shape=(TPB, TPB), dtype=float32)

    x, y = cuda.grid(2)

    tx = cuda.threadIdx.x
    ty = cuda.threadIdx.y
    bpg = cuda.gridDim.x    # blocks per grid

    # Each thread computes one element in the result matrix.
    # The dot product is chunked into dot products of TPB-long vectors.
    tmp = float32(0.)
    for i in range(bpg):
        # Preload data into shared memory
        sA[ty, tx] = 0
        sB[ty, tx] = 0
        if y < A.shape[0] and (tx + i * TPB) < A.shape[1]:
            sA[ty, tx] = A[y, tx + i * TPB]
        if x < B.shape[1] and (ty + i * TPB) < B.shape[0]:
            sB[ty, tx] = B[ty + i * TPB, x]

        # Wait until all threads finish preloading
        cuda.syncthreads()

        # Computes partial product on the shared memory
        for j in range(TPB):
            tmp += sA[ty, j] * sB[j, tx]

        # Wait until all threads finish computing
        cuda.syncthreads()
    if y < C.shape[0] and x < C.shape[1]:
        C[y, x] = tmp

由于共享内存是有限的资源，代码会从输入数组中一次预加载一小块。然后，它调用 syncthreads()，等待所有线程完成预加载，然后才在共享内存上进行计算。计算完成后，它再次同步，以确保所有线程都已处理完共享内存中的数据，然后才在下一次循环迭代中覆盖它。

函数 fast_matmul 的一个用法示例如下

来自 test_ex_matmul 在 numba/cuda/tests/doc_examples/test_matmul.py

x_h = np.arange(16).reshape([4, 4])
y_h = np.ones([4, 4])
z_h = np.zeros([4, 4])

x_d = cuda.to_device(x_h)
y_d = cuda.to_device(y_h)
z_d = cuda.to_device(z_h)

threadsperblock = (TPB, TPB)
blockspergrid_x = math.ceil(z_h.shape[0] / threadsperblock[0])
blockspergrid_y = math.ceil(z_h.shape[1] / threadsperblock[1])
blockspergrid = (blockspergrid_x, blockspergrid_y)

fast_matmul[blockspergrid, threadsperblock](x_d, y_d, z_d)
z_h = z_d.copy_to_host()
print(z_h)
print(x_h @ y_h)

这通过了 CUDA 内存检查测试，这有助于调试。运行上述代码会产生以下输出

$ python fast_matmul.py
[[ 6.  6.  6.  6.]
[22. 22. 22. 22.]
[38. 38. 38. 38.]
[54. 54. 54. 54.]]
[[ 6.  6.  6.  6.]
[22. 22. 22. 22.]
[38. 38. 38. 38.]
[54. 54. 54. 54.]]

注意

对于 CUDA 中的高性能矩阵乘法，另请参阅 CuPy 实现。

此处概述的方法通过调整 blockspergrid 变量可以推广到非方阵乘法，如下所示

再次，以下是一个用法示例

来自 test_ex_matmul 在 numba/cuda/tests/doc_examples/test_matmul.py

x_h = np.arange(115).reshape([5, 23])
y_h = np.ones([23, 7])
z_h = np.zeros([5, 7])

x_d = cuda.to_device(x_h)
y_d = cuda.to_device(y_h)
z_d = cuda.to_device(z_h)

threadsperblock = (TPB, TPB)
grid_y_max = max(x_h.shape[0], y_h.shape[0])
grid_x_max = max(x_h.shape[1], y_h.shape[1])
blockspergrid_x = math.ceil(grid_x_max / threadsperblock[0])
blockspergrid_y = math.ceil(grid_y_max / threadsperblock[1])
blockspergrid = (blockspergrid_x, blockspergrid_y)

fast_matmul[blockspergrid, threadsperblock](x_d, y_d, z_d)
z_h = z_d.copy_to_host()
print(z_h)
print(x_h @ y_h)

以及相应的输出

$ python nonsquare_matmul.py
[[ 253.  253.  253.  253.  253.  253.  253.]
[ 782.  782.  782.  782.  782.  782.  782.]
[1311. 1311. 1311. 1311. 1311. 1311. 1311.]
[1840. 1840. 1840. 1840. 1840. 1840. 1840.]
[2369. 2369. 2369. 2369. 2369. 2369. 2369.]]
[[ 253.  253.  253.  253.  253.  253.  253.]
[ 782.  782.  782.  782.  782.  782.  782.]
[1311. 1311. 1311. 1311. 1311. 1311. 1311.]
[1840. 1840. 1840. 1840. 1840. 1840. 1840.]
[2369. 2369. 2369. 2369. 2369. 2369. 2369.]]

调用 NumPy UFunc

CUDA 目标中支持的 UFuncs（请参阅 NumPy 支持）可以在内核内部调用，但输出数组必须作为位置参数传入。以下示例演示了遵循此模式在内核内部调用 np.sin() 的方法

来自 test_ex_cuda_ufunc_call 在 numba/cuda/tests/doc_examples/test_ufunc.py

import numpy as np
from numba import cuda

# A kernel calling a ufunc (sin, in this case)
@cuda.jit
def f(r, x):
    # Compute sin(x) with result written to r
    np.sin(x, r)

# Declare input and output arrays
x = np.arange(10, dtype=np.float32) - 5
r = np.zeros_like(x)

# Launch kernel that calls the ufunc
f[1, 1](r, x)

# A quick sanity check demonstrating equality of the sine computed by
# the sin ufunc inside the kernel, and NumPy's sin ufunc
np.testing.assert_allclose(r, np.sin(x))