渲染管线之旅|13 计算着色器

欢迎回到"渲染管线之旅"系列，本篇是“渲染管线之旅”的最后部分。这个系列已经够长了，后面可能写更多与GPU相关的文章。之前我们一直遨游在图形渲染管线的所有常规部分，以及不同层级的具体细节。这篇我们来看DX11中引入的一项重要新功能：计算着色器（Computer Shader, CS）。

1.执行环境

在本系列中，重点一直放在体系结构级别的整体数据流上，而不是着色器执行上（在其他地方对这部分有很好的解释）。到目前为止，着重于每个阶段的管道输入和输出；内部工作的方式通常由数据的形状决定。计算着色器是和他们是不同的——计算着色器是自己运行，而不是作为图形渲染管线的一部分运行的，因此它们用于显示的数据接口要小得多。

实际上，在输入方面，CS根本没有用于输入数据的缓冲区。除了诸如绑定的常量缓冲区和资源之类的API状态外，Compute Shaders唯一获得的输入就是它们的线程索引。这里潜在的存在巨大的混乱，因此请牢记最重要的一点：“线程”是CS环境中分派的原子单位，它与操作系统所提供的线程有很大的不同，你可能与该术语相关联。 CS线程具有自己的标识和寄存器，但没有自己的程序计数器（指令指针）或堆栈，也没有单独的调度。

实际上，CS中的“线程”取代了“顶点着色”期间单个顶点或“像素着色”期间单个像素的位置。而且它们的处理方式也一样：将一堆（通常在16到64之间）组装成一个“ Warp”或“ Wavefront”，并让它们以锁步的方式运行相同的代码。 CS线程没有计划-Warps和Wavefronts可以安排（本文其余部分将坚持使用“ Warp”；在精神上替代AMD的“ Wavefront”）。为了隐藏延迟，我们不会切换到其他“线程”（以CS的说法），而是切换到了另一个Warp，即不同的线程束。经纱内的单线程不能单独分支。如果此类捆绑软件中至少有一个线程想要执行某段代码，则捆绑软件中的所有线程都会对其进行处理-即使大多数线程最终都将结果扔掉了。简而言之，CS“线程”更像是SIMD通道，而不是你在编程中其他地方看到的线程。记住这一点。

这就解释了“线程”和“扭曲”级别。在此之上的是“线程组”级别，涉及的对象是谁？ –线程组。在着色器编译期间指定线程组的大小。在DX11中，线程组可以包含1到1024个线程之间的任意位置，并且线程组的大小不是指定为单个数字，而是指定为给出线程x，y和z坐标的三元组。该编号方案主要是为了方便寻址2D或3D资源的着色器代码，尽管它也允许遍历优化。在宏级别，CS执行是在多个线程组中分派的。D3D11中的线程组ID再次使用3D组ID，与线程ID相同，并且出于几乎相同的原因。

2. Thread Groups

上面的描述听起来像线程组在此层次结构中是一个相当随意的中间层。但是，有一个重要的缺失使线程组确实非常特殊：线程组共享内存（TGSM）。在DX11级别的硬件上，计算着色器可以访问32k的TGSM，这基本上是同一组中线程之间通信的暂存器。这是不同CS线程进行通信的主要方式（也是最快方式）。

那么如何在硬件中实现呢？非常简单：线程组中的所有线程（实际上是Warps）都由同一着色器单元执行。然后，着色器单元仅具有至少32k（通常多一点）的本地内存。而且由于所有分组的线程共享相同的着色器单元（因此也具有相同的ALU等），因此无需包括用于共享内存访问的复杂仲裁或同步机制：在任何给定周期中只有一个Warp可以访问内存，因为一个经线可以在任何周期内发出指令！现在，当然，该过程通常将通过流水线进行，但这并不会改变基本不变性：每个着色器单元，我们只有一个TGSM；访问TGSM可能需要多个流水线阶段，但是对TGSM的实际读取（或写入）仅会发生在一个流水线阶段内，并且在该周期内的内存访问全部来自同一Warp。

但是，这还不足以进行实际的共享内存通信。问题很简单：上述不变性保证即使我们不添加任何联锁来防止并发访问，每个周期也只有一组访问TGSM的权限。这很好，因为它使硬件更简单，更快捷。从着色器程序的角度来看，它不能保证以任何特定顺序进行内存访问，因为Warp可以或多或少地随机安排；这完全取决于谁在特定时间点可运行（不等待内存访问/纹理读取完成）。更微妙的是，正是因为整个过程都是流水线，对TGSM的写入可能变得需要一些周期才能被读取“可见”。当对TGSM的实际读写操作发生在不同的流水线阶段（或同一阶段的不同阶段）时，就会发生这种情况。因此，我们仍然需要某种同步机制。输入障碍。障碍的类型不同，但它们仅由三个基本组成：

组同步 组同步屏障会强制当前组内的所有线程到达屏障，然后再消耗它们。一旦Warp到达此障碍，它将被标记为不可运行，就像它在等待内存或纹理访问完成一样。一旦最后一个经线到达屏障，其余的经线将重新激活。这一切都发生在Warp调度级别上。它增加了额外的调度约束，可能会导致停顿，但是不需要原子内存事务或类似的事情；除了在微观上损失利用率之外，这是一个相当便宜的操作。
组内存屏障 由于组中的所有线程都在同一着色器单元上运行，因此，这基本上相当于管道刷新，以确保完成所有待处理的共享内存操作。无需与当前着色器单元外部的资源进行同步，这意味着它再次相当便宜。
设备内存屏障 这将阻塞组中的所有线程，直到所有内存访问都完成（直接或间接访问）（例如，通过纹理样本）。如本系列前面所述，GPU上的内存访问和纹理样本具有较长的等待时间-认为超过600个周期，通常超过1000个周期-因此这种障碍确实会受到伤害。

DX11提供了不同类型的势垒，这些势垒将上述几种成分组合为一个原子单元。语义应该很明显。无序访问视图我们现在处理了CS输入，并了解了一些有关CS执行的知识。但是我们将输出数据放在哪里？答案的名字很笨拙，即“无序访问视图”，简称UAV。 UAV似乎有点类似于Pixel Shaders中的渲染目标（事实上，除了Pixel Pixel渲染中的渲染目标之外，还可以使用UAV），但是在语义上有一些非常重要的区别：

同样重要的是，最重要的是，从某种意义上说，API无法保证以任何特定的顺序显示访问内容，对UAV的访问是“无序的”。在渲染图元时，必须确保对四边形进行API顺序的Z测试，混合和写回（如本系列第9部分中详细讨论的那样），或者至少产生与它们相同的结果-这需要大量的精力。无人机无需付出任何努力，因为在着色器中遇到无人机时，无人机会立即进行访问，这可能与API顺序大不相同。不过，它们并不是完全无序的；尽管在API调用中不能保证操作的顺序，但API和驱动程序仍将协作以确保在API调用之间保留感知的顺序。因此，如果你有复杂的Compute Shader（或Pixel Shader）写入UAV，然后立即从第二个（更简单的）CS读取同一基础资源，则第二个CS将看到完成的结果，而不会看到部分写入的输出。
UAV支持随机访问。 Pixel Shader只能向每个渲染目标写入一个位置——相应的像素。同一Pixel Shader可以将其绑定到的任何UAV中的任意位置写入。
无人机支持原子操作。在经典的Pixel Pipeline中，没有必要；我们保证绝对不会发生任何碰撞。但是，借助无人机提供的自由格式执行，不同的线程可能试图同时访问一块内存，因此我们需要同步机制来处理此问题。

因此，从“ CPU程序员”的角度来看，UAV对应于共享内存多处理系统中的常规RAM。它们是进入内存的窗口。更有趣的是原子操作的问题。这是当前GPU与CPU设计差异很大的领域。

3. 原子操作

在当前的CPU中，共享内存处理的大部分魔术是由内存层次结构（即缓存）处理的。要写入一块内存，活动内核必须首先声明对相应缓存行的独占所有权。这是通过所谓的“缓存一致性协议”（通常是MESI和后代）来完成的。详细信息与本文有关。重要的是，因为写入内存需要获得专有权，所以永远不会有两个内核同时尝试写入某个位置的风险。在这种模型中，原子操作可以通过在操作期间保持专有所有权来实现；如果我们一直都拥有专有所有权，那么在执行原子操作时就不会有其他人试图写入同一位置。同样，它的实际详细信息很快变得很繁琐（尤其是涉及分页，中断和异常之类的东西时），但是对于本文而言，30000英尺的视图就足够了。

在这种类型的模型中，原子操作使用常规的核心ALU和加载/存储单元执行，并且大多数“有趣”的工作都在高速缓存中进行。优点是，原子操作（或多或少）是常规的内存访问，尽管有一些额外的要求。但是，存在两个问题：最重要的是，缓存一致性的标准实现“监听”要求协议中的所有代理都相互通信，这存在严重的可伸缩性问题。有很多方法可以解决此限制（主要使用所谓的基于目录的一致性协议），但是它们会增加内存访问的复杂性和延迟。另一个问题是所有锁和内存事务实际上都发生在高速缓存行级别。如果两个不相关但经常更新的变量共享同一条缓存行，则它可能最终在多个内核之间“乒乓”，从而导致大量的一致性事务（以及相关的速度降低）。此问题称为“虚假共享”。软件可以通过确保无关字段不属于同一缓存行来避免这种情况；但是在GPU上，执行过程中的缓存行大小和内存布局都不是由应用程序知道或控制的，因此此问题将更加严重。

当前的GPU通过不同地构造其内存层次结构来避免此问题。专用的原子单元不直接在着色器单元内部处理原子操作（这再次引发“谁拥有哪个内存”问题），而是直接与共享的最低级缓存层次结构进行通信。这样的缓存只有一个，因此不会出现一致性问题。高速缓存行是否存在于高速缓存中（这意味着它是当前的）或不存在（这意味着内存中的副本是当前的）。原子操作包括首先将相应的存储位置放入缓存（如果尚不存在），然后使用原子单元上的专用整数ALU直接对缓存内容执行所需的读取-修改-写入操作。当原子单元在某个存储位置上忙时，对该位置的所有其他访问将停止。由于存在多个原子单位，因此必须确保它们永远不会尝试同时访问同一内存位置；一种简单的实现方法是使每个原子单元“拥有”一组特定的地址（静态地-不像高速缓存行所有权那样动态地）。这是通过将负责的原子单元的索引计算为要访问的内存地址的某些哈希函数来完成的。（请注意，我无法确定这是当前GPU的工作方式；我在官方文档中未找到有关原子单位如何工作的详细信息）。

如果着色器单元想要对给定的内存地址执行原子操作，则首先需要确定哪个原子单元负责，等待直到它准备好接受新命令，然后提交操作（并可能等待完成）（如果需要原子操作的结果）。原子单元可能一次只处理一个命令，或者可能有一个小的FIFO，用于处理未完成的请求。当然，还有各种各样的分配和排队细节需要正确处理，以便原子操作处理相当合理，以便着色器单元将始终取得进步。再说一遍，我在这里不再赘述。

最后一点是，当然，出色的原子操作算作“设备内存”访问，与内存/纹理读取和UAV写入相同。着色器单元需要跟踪其出色的原子操作，并确保它们在遇到设备内存访问障碍时完成操作。

4. Structured buffers 和 append/consume buffers

除非我没有错过任何内容，否则这两种缓冲区类型是我尚未谈论的与CS相关的最后功能。而且，从硬件的角度来看，确实没有太多要谈论的话题。结构化缓冲区对驱动程序内部着色器编译器的提示比其他任何东西都多。它们为驱动程序提供了一些有关如何使用它们的提示——即它们由步幅固定的元素组成，这些元素很可能将被一起访问，但最终它们仍会编译为常规的内存访问方式。结构化的缓冲区部分可能会使驾驶员对他们在内存中的位置和布局的决定产生偏差，但不会向模型添加任何根本上的新功能。

append/consume buffers相似；它们可以使用现有的原子指令来实现。实际上，它们是这样的，除了追加/消耗指针不在资源中的显式位置之外，它们是资源外部的边带数据，可以使用特殊的原子指令进行访问。（与结构化缓冲区类似，将其用法声明为添加/使用缓冲区这一事实使驱动程序可以适当地选择它们在内存中的位置）。

5. 总结

而且…就是这样。下一部分不再需要预览，该系列已完成:)，尽管这并不意味着我已经完成了。我需要进行一些重组和部分重写–这些博客文章是原始且未经证实的，我打算将它们遍历并将其转变为一个文档。同时，我将在这里撰写其他内容。我将尝试吸收到目前为止所获得的反馈-如果还有其他问题，更正或意见，现在是时候告诉我！我不想详细说明本系列的最终清理版本，但我会尽力在今年年底之前将其降低。

Welcome back to what’s going to be the last “official” part of this series – I’ll do more GPU-related posts in the future, but this series is long enough already. We’ve been touring all the regular parts of the graphics pipeline, down to different levels of detail. Which leaves one major new feature introduced in DX11 out: Compute Shaders. So that’s gonna be my topic this time around. Execution environmentFor this series, the emphasis has been on overall dataflow at the architectural level, not shader execution (which is explained well elsewhere). For the stages so far, that meant focusing on the input piped into and output produced by each stage; the way the internals work was usually dictated by the shape of the data. Compute shaders are different – they’re running by themselves, not as part of the graphics pipeline, so the surface area of their interface is much smaller. In fact, on the input side, there’s not really any buffers for input data at all. The only input Compute Shaders get, aside from API state such as the bound Constant Buffers and resources, is their thread index. There’s a tremendous potential for confusion here, so here’s the most important thing to keep in mind: a “thread” is the atomic unit of dispatch in the CS environment, and it’s a substantially different beast from the threads provided by the OS that you probably associate with the term. CS threads have their own identity and registers, but they don’t have their own Program Counter (Instruction Pointer) or stack, nor are they scheduled individually. In fact, “threads” in CS take the place that individual vertices had during Vertex Shading, or individual pixels during Pixel Shading. And they get treated the same way: assemble a bunch of them (usually, somewhere between 16 and 64) into a “Warp” or “Wavefront” and let them run the same code in lockstep. CS threads don’t get scheduled – Warps and Wavefronts do (I’ll stick with “Warp” for the rest of this article; mentally substitute “Wavefront” for AMD). To hide latency, we don’t switch to a different “thread” (in CS parlance), but to a different Warp, i.e. a different bundle of threads. Single threads inside a Warp can’t take branches individually; if at least one thread in such a bundle wants to execute a certain piece of code, it gets processed by all the threads in the bundle – even if most threads then end up throwing the results away. In short, CS “threads” are more like SIMD lanes than like the threads you see elsewhere in programming; keep that in mind. That explains the “thread” and “warp” levels. Above that is the “thread group” level, which deals with – who would’ve thought? – groups of threads. The size of a thread group is specified during shader compilation. In DX11, a thread group can contain anywhere between 1 and 1024 threads, and the thread group size is specified not as a single number but as a 3-tuple giving thread x, y, and z coordinates. This numbering scheme is mostly for the convenience of shader code that addresses 2D or 3D resources, though it also allows for traversal optimizations. At the macro level, CS execution is dispatched in multiples of thread groups; thread group IDs in D3D11 again use 3D group IDs, same as thread IDs, and for pretty much the same reasons. Thread IDs – which can be passed in in various forms, depending on what the shader prefers – are the only input to Compute Shaders that’s not the same for all threads; quite different from the other shader types we’ve seen before. This is just the tip of the iceberg, though. Thread GroupsThe above description makes it sound like thread groups are a fairly arbitrary middle level in this hierarchy. However, there’s one important bit missing that makes thread groups very special indeed: Thread Group Shared Memory (TGSM). On DX11 level hardware, compute shaders have access to 32k of TGSM, which is basically a scratchpad for communication between threads in the same group. This is the primary (and fastest) way by which different CS threads can communicate. So how is this implemented in hardware? It’s quite simple: all threads (well, Warps really) within a thread group get executed by the same shader unit. The shader unit then simply has at least 32k (usually a bit more) of local memory. And because all grouped threads share the same shader unit (and hence the same set of ALUs etc.), there’s no need to include complicated arbitration or synchronization mechanisms for shared memory access: only one Warp can access memory in any given cycle, because only one Warp gets to issue instructions in any cycle! Now, of course this process will usually be pipelined, but that doesn’t change the basic invariant: per shader unit, we have exactly one piece of TGSM; accessing TGSM might require multiple pipeline stages, but actual reads from (or writes to) TGSM will only happen inside one pipeline stage, and the memory accesses during that cycle all come from within the same Warp. However, this is not yet enough for actual shared-memory communication. The problem is simple: The above invariant guarantees that there’s only one set of accesses to TGSM per cycle even when we don’t add any interlocks to prevent concurrent access. This is nice since it makes the hardware simpler and faster. It does not guarantee that memory accesses happen in any particular order from the perspective of the shader program, however, since Warps can be scheduled more or less randomly; it all depends on who is runnable (not waiting for memory access / texture read completion) at certain points in time. Somewhat more subtle, precisely because the whole process is pipelined, it might take some cycles for writes to TGSM to become “visible” to reads; this happens when the actual read and write operations to TGSM occur in different pipeline stages (or different phases of the same stage). So we still need some kind of synchronization mechanism. Enter barriers. There’s different types of barriers, but they’re composed of just three fundamental components:

Group Synchronization. A Group Synchronization Barrier forces all threads inside the current group to reach the barrier before any of them may consume past it. Once a Warp reaches such a barrier, it will be flagged as non-runnable, same as if it was waiting for a memory or texture access to complete. Once the last Warp reaches the barrier, the remaining Warps will be reactivated. This all happens at the Warp scheduling level; it adds additional scheduling constraints, which may cause stalls, but there’s no need for atomic memory transactions or anything like that; other than lost utilization at the micro level, this is a reasonably cheap operation.
Group Memory Barriers. Since all threads within a group run on the same shader unit, this basically amounts to a pipeline flush, to ensure that all pending shared memory operations are completed. There’s no need to synchronize with resources external to the current shader unit, which means it’s again reasonably cheap.
Device Memory Barriers. This blocks all threads within a group until all memory accesses have completed – either direct or indirect (e.g. via texture samples). As explained earlier in this series, memory accesses and texture samples on GPUs have long latencies – think more than 600, and often above 1000 cycles – so this kind of barrier will really hurt. DX11 offers different types of barriers that combine several of the above components into one atomic unit; the semantics should be obvious. Unordered Access ViewsWe’ve now dealt with CS input and learned a bit about CS execution. But where do we put our output data? The answer has the unwieldy name “unordered access views”, or UAVs for short. An UAV seems somewhat similar to render targets in Pixel Shaders (and UAVs can in fact be used in addition to render targets in Pixel Shaders), but there’s some very important semantic differences:

• Most importantly, as the same suggests, access to UAVs is “unordered”, in the sense that the API does not guarantee accesses to become visible in any particular order. When rendering primitives, quads are guaranteed to be Z-tested, blended and written back in API order (as discussed in detail in part 9 of this series), or at least produce the same results as if they were – which takes substantial effort. UAVs make no such effort – UAV accesses happen immediately as they’re encountered in the shader, which may be very different from API order. They’re notcompletely unordered, though; while there’s no guaranteed order of operations within an API call, the API and driver will still collaborate to make sure that perceived sequential ordering is preserved across API calls. Thus, if you have a complex Compute Shader (or Pixel Shader) writing to an UAV immediately followed by a second (simpler) CS that reads from the same underlying resource, the second CS will see the finished results, never some partially-written output. • UAVs support random access. A Pixel Shader can only write to one location per render target – its corresponding pixel. The same Pixel Shader can write to arbitrary locations in whatever UAVs it has bound. • UAVs support atomic operations. In the classic Pixel Pipeline, there’s no need; we guarantee there’s never any collisions anyway. But with the free-form execution provided by UAVs, different threads might be trying to access a piece of memory at the same time, and we need synchronization mechanisms to deal with this. So from a “CPU programmer”‘s point of view, UAVs correspond to regular RAM in a shared-memory multiprocessing system; they’re windows into memory. More interesting is the issue of atomic operations; this is one area where current GPUs diverge considerably from CPU designs. AtomicsIn current CPUs, most of the magic for shared memory processing is handled by the memory hierarchy (i.e. caches). To write to a piece of memory, the active core must first assert exclusive ownership of the corresponding cache line. This is accomplished using what’s called a “cache coherency protocol”, usually MESI and descendants. The details are tangential to this article; what matters is that because writing to memory entails acquiring exclusive ownership, there’s never a risk of two cores simultaneously trying to write to the some location. In such a model, atomic operations can be implemented by holding exclusive ownership for the duration of the operation; if we had exclusive ownership for the whole time, there’s no chance that someone else was trying to write to the same location while we were performing the atomic operation. Again, the actual details of this get hairy pretty fast (especially as soon as things like paging, interrupts and exceptions get involved), but the 30000-feet-view will suffice for the purposes of this article. In this type of model, atomic operations are performed using the regular Core ALUs and load/store units, and most of the “interesting” work happens in the caches. The advantage is that atomic operations are (more or less) regular memory accesses, albeit with some extra requirements. There’s a couple of problems, though: most importantly, the standard implementation of cache coherency, “snooping”, requires that all agents in the protocol talk to each other, which has serious scalability issues. There are ways around this restriction (mainly using so-called Directory-based Coherency protocols), but they add additional complexity and latency to memory accesses. Another issue is that all locks and memory transactions really happen at the cache line level; if two unrelated but frequently-updated variables share the same cache line, it can end up “ping-ponging” between multiple cores, causing tons of coherency transactions (and associated slowdown). This problem is called “false sharing”. Software can avoid it by making sure unrelated fields don’t fall into the same cache line; but on GPUs, neither the cache line size nor the memory layout during execution is known or controlled by the application, so this problem would be more serious. Current GPUs avoid this problem by structuring their memory hierarchy differently. Instead of handling atomic operations inside the shader units (which again raises the “who owns which memory” issue), there’s dedicated atomic units that directly talk to a shared lowest-level cache hierarchy. There’s only one such cache, so the issue of coherency doesn’t come up; either the cache line is present in the cache (which means it’s current) or it isn’t (which means the copy in memory is current). Atomic operations consist of first bringing the respective memory location into the cache (if it isn’t there already), then performing the required read-modify-write operation directly on the cache contents using a dedicated integer ALU on the atomic units. While an atomic unit is busy on a memory location, all other accesses to that location will stall. Since there’s multiple atomic units, it’s necessary to make sure they never try to access the same memory location at the same time; one easy way to accomplish this is to make each atomic unit “own” a certain set of addresses (statically – not dynamically as with cache line ownership). This is done by computing the index of the responsible atomic unit as some hash function of the memory address to be accessed. (Note that I can’t confirm this is how current GPUs do; I’ve found little detail on how the atomic units work in official docs). If a shader unit wants to perform an atomic operation to a given memory address, it first needs to determine which atomic unit is responsible, wait until it is ready to accept new commands, and then submit the operation (and potentially wait until it is finished if the result of the atomic operation is required). The atomic unit might only be processing one command at a time, or it might have a small FIFO of outstanding requests; and of course there’s all kinds of allocation and queuing details to get right so that atomic operation processing is reasonably fair so that shader units will always make progress. Again, I won’t go into further detail here. One final remark is that, of course, outstanding atomic operations count as “device memory” accesses, same as memory/texture reads and UAV writes; shader units need to keep track of their outstanding atomic operations and make sure they’re finished when they hit device memory access barriers. Structured buffers and append/consume buffersUnless I missed something, these two buffer types are the last CS-related features I haven’t talked about yet. And, well, from a hardware perspective, there’s not that much to talk about, really. Structured buffers are more of a hint to the driver-internal shader compiler than anything else; they give the driver some hint as to how they’re going to be used – namely, they consist of elements with a fixed stride that are likely going to be accessed together – but they still compile down to regular memory accesses in the end. The structured buffer part may bias the driver’s decision of their position and layout in memory, but it does not add any fundamentally new functionality to the model. Append/consume buffers are similar; they could be implemented using the existing atomic instructions. In fact, they kind of are, except the append/consume pointers aren’t at an explicit location in the resource, they’re side-band data outside the resource that are accessed using special atomic instructions. (And similarly to structured buffers, the fact that their usage is declared as append/consume buffer allows the driver to pick their location in memory appropriately). Wrap-up.And… that’s it. No more previews for the next part, this series is done :), though that doesn’t mean I’m done with it. I have some restructuring and partial rewriting to do – these blog posts are raw and unproofed, and I intend to go over them and turn it into a single document. In the meantime, I’ll be writing about other stuff here. I’ll try to incorporate the feedback I got so far – if there’s any other questions, corrections or comments, now’s the time to tell me! I don’t want to nail down the ETA for the final cleaned-up version of this series, but I’ll try to get it down well before the end of the year. We’ll see. Until then, thanks for reading!

文章目录

渲染管线之旅|13 计算着色器

1.执行环境

2. Thread Groups

3. 原子操作

4. Structured buffers 和 append/consume buffers

5. 总结

See Also

Binean

欢迎关注微信公众号

分类

最近文章

友情链接

其它

标签