C++与CUDA备忘录

发布时间 2023-07-10 09:32:40作者: 阿坦

CUDA中grid、block、thread、warp与SM、SP的关系

转载:CUDA中grid、block、thread、warp与SM、SP的关系_block sm_AplusX的博客-CSDN博客

首先概括一下这几个概念。其中SM(Streaming Multiprocessor)和SP(streaming Processor)是硬件层次的,其中一个SM可以包含多个SP。thread是一个线程,多个thread组成一个线程块block,多个block又组成一个线程网格grid。

现在就说一下一个kenerl函数是怎么执行的。一个kernel程式会有一个grid,grid底下又有数个block,每个block是一个thread群组。在同一个block中thread可以通过共享内存(shared memory)来通信,同步。而不同block之间的thread是无法通信的。

CUDA的设备在实际执行过程中,会以block为单位。把一个个block分配给SM进行运算;而block中的thread又会以warp(线程束)为单位,对thread进行分组计算。目前CUDA的warp大小都是32,也就是说32个thread会被组成一个warp来一起执行。同一个warp中的thread执行的指令是相同的,只是处理的数据不同。

基本上warp 分组的动作是由SM 自动进行的,会以连续的方式来做分组。比如说如果有一个block 里有128 个thread 的话,就会被分成四组warp,第0-31 个thread 会是warp 132-63 是warp 264-95是warp 396-127 是warp 4。而如果block 里面的thread 数量不是32 的倍数,那他会把剩下的thread独立成一个warp;比如说thread 数目是66 的话,就会有三个warp:0-3132-6364-65 。由于最后一个warp 里只剩下两个thread,所以其实在计算时,就相当于浪费了30 个thread 的计算能力;这点是在设定block 中thread 数量一定要注意的事!

一个SM 会根据其内部SP数目分配warp,但是SM 不见得会一次就把这个warp 的所有指令都执行完;当遇到正在执行的warp 需要等待的时候(例如存取global memory 就会要等好一段时间),就切换到别的warp来继续做运算,借此避免为了等待而浪费时间。所以理论上效率最好的状况,就是在SM 中有够多的warp 可以切换,让在执行的时候,不会有「所有warp 都要等待」的情形发生;因为当所有的warp 都要等待时,就会变成SM 无事可做的状况了。

实际上,warp 也是CUDA 中,每一个SM 执行的最小单位;如果GPU 有16 组SM 的话,也就代表他真正在执行的thread 数目会是32*16 个。不过由于CUDA 是要透过warp 的切换来隐藏thread 的延迟、等待,来达到大量平行化的目的,所以会用所谓的active thread 这个名词来代表一个SM 里同时可以处理的thread 数目。而在block 的方面,一个SM 可以处理多个线程块block,当其中有block 的所有thread 都处理完后,他就会再去找其他还没处理的block 来处理。假设有16 个SM、64 个block、每个SM 可以同时处理三个block 的话,那一开始执行时,device 就会同时处理48 个block;而剩下的16 个block 则会等SM 有处理完block 后,再进到SM 中处理,直到所有block 都处理结束 

在CUDA 架构下,GPU芯片执行时的最小单位是thread。
若干个thread可以组成一个线程块(block)。一个block中的thread能存取同一块共享内存,可以快速进行同步和通信操作。
每一个block 所能包含的thread 数目是有限的。执行相同程序的block,可以组成grid。不同block 中的thread 无法存取同一共享内存,因此无法直接通信或进行同步。
不同的grid可以执行不同的程序(kernel)。

举个栗子:

1:一个SM有8个SP,SM执行一个Warp时有32个线程,这32各线程在8个SP上执行4次,实际上是8个8个轮替,严格意义上来讲不是同时执行,只是隐藏延迟,因为软件层我们是将其抽象出来,因此可以说是同时执行。
2:当一个SM中有更多的SP时,例如GP100这种,一个SM上有64个SP,线程也不一定是平摊的,看具体架构的官方文档。一般情况下还是8个sp执行4次,也就是说当你数据跑32个线程的时候,在有64个SP的SM里实际还是8个SP在跑,和一个SM里面只有8个SP的情况是一致的。所以一个SM有64个SP的时候,意味着最多同时可以并行8个warp,8×32线程即256和线程。此时通常情况来说已经满线程了,当架构能进一步降低延迟时,通过抽象可以跑1024个线程
View Code

 

CUDA线程ID的计算

 转载: CUDA学习-计算实际线程ID_cuda中线程id_UUUUFUUUU的博客-CSDN博客

#include "cuda_runtime.h"
#include "device_launch_parameters.h"
 
#include <stdio.h>
#include <stdlib.h>
#include <iostream>
 
using namespace std;
 
//thread 1D
__global__ void testThread1(int *c, const int *a, const int *b)
{
    int i = threadIdx.x;
    c[i] = b[i] - a[i];
}
//thread 2D
__global__ void testThread2(int *c, const int *a, const int *b)
{
    int i = threadIdx.x + threadIdx.y*blockDim.x;
    c[i] = b[i] - a[i];
}
 
//thread 3D
__global__ void testThread3(int *c, const int *a, const int *b)
{
    int i = threadIdx.x + threadIdx.y*blockDim.x + threadIdx.z*blockDim.x*blockDim.y;
    c[i] = b[i] - a[i];
}
 
//block 1D
__global__ void testBlock1(int *c, const int *a, const int *b)
{
    int i = blockIdx.x;
    c[i] = b[i] - a[i];
}
 
//block 2D
__global__ void testBlock2(int *c, const int *a, const int *b)
{
    int i = blockIdx.x + blockIdx.y*gridDim.x;
    c[i] = b[i] - a[i];
}
 
//block 3D
__global__ void testBlock3(int *c, const int *a, const int *b)
{
    int i = blockIdx.x + blockIdx.y*gridDim.x + blockIdx.z*gridDim.x*gridDim.y;
    c[i] = b[i] - a[i];
}
 
//block-thread 1D-1D
__global__ void testBlockThread1(int *c, const int *a, const int *b)
{
    int i = threadIdx.x + blockDim.x*blockIdx.x;
    c[i] = b[i] - a[i];
}
 
//block-thread 1D-2D
__global__ void testBlockThread2(int *c, const int *a, const int *b)
{
    int threadId_2D = threadIdx.x + threadIdx.y*blockDim.x;
    int i = threadId_2D+ (blockDim.x*blockDim.y)*blockIdx.x;
    c[i] = b[i] - a[i];
}
 
//block-thread 1D-3D
__global__ void testBlockThread3(int *c, const int *a, const int *b)
{
    int threadId_3D = threadIdx.x + threadIdx.y*blockDim.x + threadIdx.z*blockDim.x*blockDim.y;
    int i = threadId_3D + (blockDim.x*blockDim.y*blockDim.z)*blockIdx.x;
    c[i] = b[i] - a[i];
}
 
//block-thread 2D-1D
__global__ void testBlockThread4(int *c, const int *a, const int *b)
{
    int blockId_2D = blockIdx.x + blockIdx.y*gridDim.x;
    int i = threadIdx.x + blockDim.x*blockId_2D;
    c[i] = b[i] - a[i];
}
 
//block-thread 3D-1D
__global__ void testBlockThread5(int *c, const int *a, const int *b)
{
    int blockId_3D = blockIdx.x + blockIdx.y*gridDim.x + blockIdx.z*gridDim.x*gridDim.y;
    int i = threadIdx.x + blockDim.x*blockId_3D;
    c[i] = b[i] - a[i];
}
 
//block-thread 2D-2D
__global__ void testBlockThread6(int *c, const int *a, const int *b)
{
    int threadId_2D = threadIdx.x + threadIdx.y*blockDim.x;
    int blockId_2D = blockIdx.x + blockIdx.y*gridDim.x;
    int i = threadId_2D + (blockDim.x*blockDim.y)*blockId_2D;
    c[i] = b[i] - a[i];
}
 
//block-thread 2D-3D
__global__ void testBlockThread7(int *c, const int *a, const int *b)
{
    int threadId_3D = threadIdx.x + threadIdx.y*blockDim.x + threadIdx.z*blockDim.x*blockDim.y;
    int blockId_2D = blockIdx.x + blockIdx.y*gridDim.x;
    int i = threadId_3D + (blockDim.x*blockDim.y*blockDim.z)*blockId_2D;
    c[i] = b[i] - a[i];
}
 
//block-thread 3D-2D
__global__ void testBlockThread8(int *c, const int *a, const int *b)
{
    int threadId_2D = threadIdx.x + threadIdx.y*blockDim.x;
    int blockId_3D = blockIdx.x + blockIdx.y*gridDim.x + blockIdx.z*gridDim.x*gridDim.y;
    int i = threadId_2D + (blockDim.x*blockDim.y)*blockId_3D;
    c[i] = b[i] - a[i];
}
 
//block-thread 3D-3D
__global__ void testBlockThread9(int *c, const int *a, const int *b)
{
    int threadId_3D = threadIdx.x + threadIdx.y*blockDim.x + threadIdx.z*blockDim.x*blockDim.y;
    int blockId_3D = blockIdx.x + blockIdx.y*gridDim.x + blockIdx.z*gridDim.x*gridDim.y;
    int i = threadId_3D + (blockDim.x*blockDim.y*blockDim.z)*blockId_3D;
    c[i] = b[i] - a[i];
}
 
void addWithCuda(int *c, const int *a, const int *b, unsigned int size)
{
    int *dev_a = 0;
    int *dev_b = 0;
    int *dev_c = 0;
 
    cudaSetDevice(0);
 
    cudaMalloc((void**)&dev_c, size * sizeof(int));
    cudaMalloc((void**)&dev_a, size * sizeof(int));
    cudaMalloc((void**)&dev_b, size * sizeof(int));
 
    cudaMemcpy(dev_a, a, size * sizeof(int), cudaMemcpyHostToDevice);
    cudaMemcpy(dev_b, b, size * sizeof(int), cudaMemcpyHostToDevice);
 
 
    testThread1<<<1, size>>>(dev_c, dev_a, dev_b);
   //testThread1<<<1, size>>>(dev_c, dev_a, dev_b);
 
    //uint3 s;s.x = size/5;s.y = 5;s.z = 1;
    //testThread2 <<<1,s>>>(dev_c, dev_a, dev_b);
 
    //uint3 s; s.x = size / 10; s.y = 5; s.z = 2;
    //testThread3<<<1, s >>>(dev_c, dev_a, dev_b);
 
    //testBlock1<<<size,1 >>>(dev_c, dev_a, dev_b);
 
    //uint3 s; s.x = size / 5; s.y = 5; s.z = 1;
    //testBlock2<<<s, 1 >>>(dev_c, dev_a, dev_b);
 
    //uint3 s; s.x = size / 10; s.y = 5; s.z = 2;
    //testBlock3<<<s, 1 >>>(dev_c, dev_a, dev_b);
 
    //testBlockThread1<<<size/10, 10>>>(dev_c, dev_a, dev_b);
 
    //uint3 s1; s1.x = size / 100; s1.y = 1; s1.z = 1;
    //uint3 s2; s2.x = 10; s2.y = 10; s2.z = 1;
    //testBlockThread2 << <s1, s2 >> >(dev_c, dev_a, dev_b);
 
    //uint3 s1; s1.x = size / 100; s1.y = 1; s1.z = 1;
    //uint3 s2; s2.x = 10; s2.y = 5; s2.z = 2;
    //testBlockThread3 << <s1, s2 >> >(dev_c, dev_a, dev_b);
 
    //uint3 s1; s1.x = 10; s1.y = 10; s1.z = 1;
    //uint3 s2; s2.x = size / 100; s2.y = 1; s2.z = 1;
    //testBlockThread4 << <s1, s2 >> >(dev_c, dev_a, dev_b);
 
    //uint3 s1; s1.x = 10; s1.y = 5; s1.z = 2;
    //uint3 s2; s2.x = size / 100; s2.y = 1; s2.z = 1;
    //testBlockThread5 << <s1, s2 >> >(dev_c, dev_a, dev_b);
 
    //uint3 s1; s1.x = size / 100; s1.y = 10; s1.z = 1;
    //uint3 s2; s2.x = 5; s2.y = 2; s2.z = 1;
    //testBlockThread6 << <s1, s2 >> >(dev_c, dev_a, dev_b);
 
    //uint3 s1; s1.x = size / 100; s1.y = 5; s1.z = 1;
    //uint3 s2; s2.x = 5; s2.y = 2; s2.z = 2;
    //testBlockThread7 << <s1, s2 >> >(dev_c, dev_a, dev_b);
 
    //uint3 s1; s1.x = 5; s1.y = 2; s1.z = 2;
    //uint3 s2; s2.x = size / 100; s2.y = 5; s2.z = 1;
    //testBlockThread8 <<<s1, s2 >>>(dev_c, dev_a, dev_b);
 
    //uint3 s1; s1.x = 5; s1.y = 2; s1.z = 2;
    //uint3 s2; s2.x = size / 200; s2.y = 5; s2.z = 2;
    //testBlockThread9<<<s1, s2 >>>(dev_c, dev_a, dev_b);
 
    cudaMemcpy(c, dev_c, size*sizeof(int), cudaMemcpyDeviceToHost);
 
    cudaFree(dev_a);
    cudaFree(dev_b);
    cudaFree(dev_c);
 
    cudaGetLastError();
 
}
int main()
{
    const int n = 1000;
 
    int *a = new int[n];
    int *b = new int[n];
    int *c = new int[n];
    int *cc = new int[n];
 
    for (int i = 0; i < n; i++)
    {
        a[i] = rand() % 100;
        b[i] = rand() % 100;
        c[i] = b[i] - a[i];
    }
 
    addWithCuda(cc, a, b, n);
    for (int i = 0; i < n; i++)
        printf("%d %d\n", c[i], cc[i]);
 
    delete[] a;
    delete[] b;
    delete[] c;
    delete[] cc; 
 
    return 0;
 
View Code

 

CUDA入门示例

转载:CUDA入门示例1:两个整型数组相加 - 知乎 (zhihu.com)

#include "cuda_runtime.h"
#include "device_launch_parameters.h"
 
#include <stdio.h>

 /***************************************
  * 声明addWithCuda函数,这里是调用GPU运算的入口函数,函数的返回类型是cudaError_t。
  * cudaError_t是一个枚举类型,通常是和CUDA相关函数的返回类型,用于检测函数执行期间的不同类型错误,
  * 具体的定义在driver_types.h头文件中。如果执行成功,则返回0
***************************************/
cudaError_t addWithCuda(int *c, const int *a, const int *b, unsigned int size);

/***************************************
 * 声明addKernel核函数,函数的修饰符是__global__,提示编译器这个函数是在GPU上运行,所以会由NVCC来编译
   *
    __global__表示函数在device上运行,且须在host上调用,返回类型必须是void
    __device__表示函数在device上运行,只能在device上调用
    __host__表示函数在host上运行,只能在host上调用,可以省略不写,此函数在host和device上都编译
   *
 * threadIdx.x 表示获取执行当前计算任务的线程的ID号,并会在调用该函数时使用运行时配置符号“<<<>>>”确定线程数量。
   * 
     GPU会根据这个索引号确定计算的线程资源。
     线程(thread)是GPU中的基本计算单位,各个线程相互独立计算,可以是一维,二维,三维
     线程块(block)是由一组线程组成的线程块,包含多个独立线程,可以是一维,二维,三维
     网格块(grid)是由一组多个网格块组成的网格块,包含多个独立线程块,可以是一维,二维
     通过指定grind-->block-->thread,可以确定具体线程资源
   *
 * 该核函数将会在GPU上被多个线程同时执行,线程间没有通信,相互独立。具体由哪些线程执行下会通过运行配置符“<<<>>>”确定
 * 具体为:<<<num,size>>> num表示分配了num个线程块(block),每个线程块分配了size个线程。num和size是1维,2维或是3维
 * “<<<>>>”中的参数并不是传递给device代码的参数,而是定义host代码运行时如何启动host代码
 * 这里的i可以理解维核函数会同时在不同的线程上运行,i表示线程号。
***************************************/
__global__ void addKernel(int *c, const int *a, const int *b)
{
    int i = threadIdx.x;
    c[i] = a[i] + b[i];
}

 /***************************************
 *  主函数
 ***************************************/
int main()
{
    // 初始化
        const int arraySize = 5;
    const int a[arraySize] = { 1, 2, 3, 4, 5 };
    const int b[arraySize] = { 10, 20, 30, 40, 50 };
    int c[arraySize] = { 0 };
 
    // 开始并行计算
    cudaError_t cudaStatus = addWithCuda(c, a, b, arraySize);
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "addWithCuda failed!");
        return 1;
    }
 
    printf("{1,2,3,4,5} + {10,20,30,40,50} = {%d,%d,%d,%d,%d}\n",
        c[0], c[1], c[2], c[3], c[4]);
 
    // cudaDeviceReset函数必须在结束程序前被调用,以便于分析完整的计算轨迹.
    cudaStatus = cudaDeviceReset();
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaDeviceReset failed!");
        return 1;
    }
 
    return 0;
}

/***************************************
 * 辅助函数,用于核函数执行的相关操作
***************************************/
cudaError_t addWithCuda(int *c, const int *a, const int *b, unsigned int size)
{
    int *dev_a = 0;
    int *dev_b = 0;
    int *dev_c = 0;
    cudaError_t cudaStatus;
 
    // 选择GPU号,在多GPU时可以选择特定的GPU号
    cudaStatus = cudaSetDevice(0);
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaSetDevice failed!  Do you have a CUDA-capable GPU installed?");
        goto Error;
    }

        // 分配GPU缓存,(两个输入,一个输出)
    cudaStatus = cudaMalloc((void**)&dev_c, size * sizeof(int));
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaMalloc failed!");
        goto Error;
    }
    cudaStatus = cudaMalloc((void**)&dev_a, size * sizeof(int));
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaMalloc failed!");
        goto Error;
    }
    cudaStatus = cudaMalloc((void**)&dev_b, size * sizeof(int));
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaMalloc failed!");
        goto Error;
    }
 
    // 将输入向量从host内存中复制到GUP缓存中
    cudaStatus = cudaMemcpy(dev_a, a, size * sizeof(int), cudaMemcpyHostToDevice);
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaMemcpy failed!");
        goto Error;
    }
    cudaStatus = cudaMemcpy(dev_b, b, size * sizeof(int), cudaMemcpyHostToDevice);
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaMemcpy failed!");
        goto Error;
    }
 
    // 在GPU上启动一个内核,并分为每个元素分配线程
    addKernel << <1, size >> > (dev_c, dev_a, dev_b);
 
    // 在启动内核时,进行异常检测,返回最新的一个运行时调用错误
    cudaStatus = cudaGetLastError();
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "addKernel launch failed: %s\n", cudaGetErrorString(cudaStatus));
        goto Error;
    }
 
    // cudaDeviceSynchronize 函数等待内核计算完成,并返回运行状态 
    // any errors encountered during the launch.
        // 提供了一个阻塞,用于等待所有的线程都执行完各自的计算任务,然后继续往下执行
    cudaStatus = cudaDeviceSynchronize();
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaDeviceSynchronize returned error code %d after launching addKernel!\n", cudaStatus);
        goto Error;
    }
 
    // 从GPU缓存中将结果复制到host内存中
    cudaStatus = cudaMemcpy(c, dev_c, size * sizeof(int), cudaMemcpyDeviceToHost);
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaMemcpy failed!");
        goto Error;
    }
 
        Error:
            // 释放缓存
       cudaFree(dev_c);
       cudaFree(dev_a);
       cudaFree(dev_b);
 
    return cudaStatus;
}
View Code

 

C++处理多线程

#include <iostream>
#include <thread>
#include <vector>

using namespace std;

// 定义并行函数,将vector中的元素乘以2
void parallel_function(int start, int end, vector<int>& data) {
    for (int i = start; i < end; i++) {
        data[i] = data[i] * 2;
    }
}

int main() {
    vector<int> data = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10};
    int num_threads = 4;
    int chunk_size = data.size() / num_threads;

    vector<thread> threads;
    // 创建线程,将vector分成num_threads个chunk,每个线程处理一个chunk
    for (int i = 0; i < num_threads; i++) {
        int start = i * chunk_size;
        int end = (i == num_threads - 1) ? data.size() : (i + 1) * chunk_size;
        threads.push_back(thread(parallel_function, start, end, ref(data)));
    }

    // 等待所有线程执行完毕
    for (auto& t : threads) {
        t.join();
    }

    // 输出处理后的vector
    for (int i = 0; i < data.size(); i++) {
        cout << data[i] << " ";
    }
    cout << endl;

    return 0;
}
View Code

 

C++统计代码运行时间

using namespace chrono;

int main(int argc, char* argv[]) {

    auto start = high_resolution_clock::now();

    // 运行代码
    
    auto duration = duration_cast<microseconds>(high_resolution_clock::now() - start);
    long long time = duration.count();//统计运行时间(us)
    return 0;

}
View Code