CUDA C编程权威指南：1.3-CUDA基础知识点梳理-526互联

主要整理了N多年前（2013年）学习CUDA的时候开始总结的知识点，好长时间不写CUDA代码了，现在LLM推理需要重新学习CUDA编程，看来出来混迟早要还的。

1.CUDA数组
解析：CUDA数组是使用cudaMallocArray()、cudaMalloc3DArray()分配的，使用cudaFreeArray()释放。

2.OpenGL/DirectX Interoperability
解析：OpenGL的帧缓冲与DirectX的顶点缓冲可以被映射到CUDA可操作的地址空间中，让CUDA读写帧缓冲里面的数据。OpenGL与CUDA互操作，主要是缓冲对象的注册与取消注册、映射与取消映射。如下所示：
（1）cudaGLRegisterBufferObject()：缓冲对象注册。（2）cudaGLUnregisterBufferObject()：取消缓冲对象注册。
（3）cudaGLMapBufferObject()：映射缓冲对象。
（4）cudaGLUnmapBufferObject()：取消映射。
（5）cudaGLMapBufferObject()：映射缓冲对象后，CUDA可以使用其返回的设备存储器地址读取和写入缓冲对象。Direct 3D与CUDA互操作，主要是Direct 3D设备的设置、资源的注册、资源映射、映射后信息获取、取消映射、取消注册。以Direct3D9为例，如下所示：
（1）cudaD3D9SetDirect3DDevice()：Direct3D设备的设置。
（2）cudaD3D9RegisterResource ()：注册资源。
（3）cudaD3D9MapResources()：资源映射。
（4）cudaD3D9ResourceGetMappedPointer()：获取资源映射后的CUDA设备存储器地址。
（5）cudaD3D9ResourceGetMappedSize()：获取大小。
（6）cudaD3D9ResourceGetMappedPitch()：获取间隔。
（7）cudaD3D9UnmapResources()：取消映射。
（8）cudaD3D9UnregisterResource()：取消注册。

3.CUDA软件环境
解析：
（1）NVIDIA Jetson TK1：NVIDIA提供的基于GPU的嵌入式开发板。
（2）NVRTC（NVIDIA Runtime Compilation）：基于CUDA C++的运行时编译库。
（3）cuSolver：基于cuBLAS和cuSPARSE库的高级包。
（4）ptxas：PTX汇编工具。
（5）cuobjdump：CUDA目标文件转储工具。
（6）nvidia-smi：英伟达系统管理接口。
（7）CUDA Binary Utilities：cuobjdump；nvdisasm；nvprune。
（8）CUDA-MEMCHECK：CUDA工具套件中提供的独立的内存检查实用程序。

4.cudaGetLastError和cudaGetErrorString
解析：
（1）cudaError_t cudaGetLastError( void )：返回同一主线程中运行时调用所返回的最新错误，并将其重置为cudaSuccess。
（2）cudaError_t cudaGetLastError( void )：返回同一主线程中运行时调用所返回的最新错误，并将其重置为cudaSuccess。

5.零拷贝内存
解析：可以在CUDA核函数中直接访问主机内存，不需要复制到GPU。如下所示：
（1）开辟Host内存空间：cudaHostAlloc((void**)&host_data_to_device, size_in_bytes, cudaHostAllocMapped);
（2）获取Device端指针：cudaHostGetDevicePointer(&dev_host_data_to_device, host_data_to_device, 0);
说明：零拷贝内存技术适用于计算密集型、读取写入次数少的程序中。

6.static函数
解析：静态函数（内部函数）只在声明的文件中可见，不能被其它文件调用。

7.cudaError_t和checkCudaErrors
解析：
（1）#include <helper_cuda.h>
（2）#include <helper_functions.h>

8.texture<type, dimension, readtype> texreference;[4]
解析：
（1）type：int，uchar，float等。
（2）dimension：1，2，3。
（3）readtype：cudaReadModeNormalizedFloat（归一化），cudaReadModeElementType（默认）。

9.纹理存储器绑定[5]
解析：纹理存储器绑定有两种，一种是绑定到cudaMalloc()，cudaMemcpy()开辟的一维数组，另一种是绑定到cudaMallocArray，cudaMemcpyToArray开辟的二维数组或者三维数组。

10.cudaCreateChannelDesc
解析：__host__ cudaChannelFormatDesc cudaCreateChannelDesc ( int x, int y, int z, int w, cudaChannelFormatKind f )：Returns a channel descriptor using the specified format。
说明：where cudaChannelFormatKind is one of cudaChannelFormatKindSigned, cudaChannelFormatKindUnsigned, or cudaChannelFormatKindFloat.

11.cudaBindTextureToArray
解析：__host__ cudaError_t cudaBindTextureToArray ( const textureReference* texref, cudaArray_const_t array, const cudaChannelFormatDesc* desc )：Binds an array to a texture。
说明：texref：Texture to bind；array：Memory array on device；desc：Channel format.

12.CUDA同步函数
解析：
（1）cudaDeviceSynchronize()：停止CPU端线程执行，直到GPU端完成CUDA任务，包括kernel、数据拷贝等。
（2）cudaThreadSynchronize()：和cudaDeviceSynchronize()基本相同，过时版本。
（3）cudaStreamSynchronize()：该方法接受一个Stream ID，它将阻止CPU执行直到GPU端完成相应Stream ID的CUDA任务，但并不关心其它Stream ID中的CUDA任务是否完成。

13.cudaGetLastError
解析：__host__ __device__ cudaError_t cudaGetLastError ( void )：返回运行时调用的最后错误。

14.PGM图像格式
解析：PGM是Portable Gray Map的缩写，它是灰度图像格式中一种最简单的格式标准。

15.CUDA架构
解析：Tesla架构，Femi架构，Kepler架构，Maxwell架构，Pascal架构，Volta架构。

16.tex2D
解析：在核函数中访问纹理存储器的操作称为纹理拾取。通过tex2D()来读取纹理内存中的数据。

17.simpleTexture.cu代码剖析
解析：

#include <iostream>
#include <cuda_runtime.h>
#include <helper_image.h>
using namespace std;
 
const char *imagePath = "./data/lena_bw.pgm";
const char *outputFilename = "./data/lena_bw_out.pgm";
const float angle = 0.5f;
// Texture reference for 2D float texture
texture<float, 2, cudaReadModeElementType> tex;
 
// @param outputData  output data in global memory
__global__ void transformKernel(float *outputData, int width, int height, float theta)
{   
 // calculate normalized texture coordinates
 unsigned int x = blockIdx.x*blockDim.x + threadIdx.x;
 unsigned int y = blockIdx.y*blockDim.y + threadIdx.y;
    
 float u = (float)x - (float)width / 2;
 float v = (float)y - (float)height / 2;
 
 u /= (float)width;
 v /= (float)height;
 
 float tu = u*cosf(theta) - v*sinf(theta);
 float tv = v*cosf(theta) + u*sinf(theta);
    
 // read from texture and write to global memory
 outputData[y * width + x] = tex2D(tex, tu + 0.5f, tv + 0.5f);
}

int main(int argc, char **argv)
{   
 unsigned int width = 512;
 unsigned int height = 512;
 unsigned int size = width * height * sizeof(float);
        
 // Allocate device memory for result
 float *dData = NULL;
 cudaMalloc((void **)&dData, size);
 
 float *hData = NULL;
 sdkLoadPGM(imagePath, &hData, &width, &height);
 
 // Allocate array and copy image data
 cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc(32, 0, 0, 0, cudaChannelFormatKindFloat);
 cudaArray *cuArray;
 cudaMallocArray(&cuArray, &channelDesc, width, height);
 cudaMemcpyToArray(cuArray, 0, 0, hData, size, cudaMemcpyHostToDevice);
 
 // Set texture parameters
 tex.addressMode[0] = cudaAddressModeWrap;
 tex.addressMode[1] = cudaAddressModeWrap;
 tex.filterMode = cudaFilterModeLinear;
 tex.normalized = true;
 
 // Bind the array to the texture
 cudaBindTextureToArray(tex, cuArray, channelDesc);
 
 dim3 dimBlock(8, 8, 1);
 dim3 dimGrid(width / dimBlock.x, height / dimBlock.y, 1);
 
 // Execute the kernel
 transformKernel << <dimGrid, dimBlock, 0 >> >(dData, width, height, angle);
    
 // Allocate mem for the result on host side
 float *hOutputData = (float *)malloc(size);
 // copy result from device to host
 cudaMemcpy(hOutputData, dData, size, cudaMemcpyDeviceToHost);
 sdkSavePGM(outputFilename, hOutputData, width, height);
 
 cudaFree(dData);
    cudaFreeArray(cuArray);
}

18.Texrure<Type, Dim, ReadMode> texRef;
解析：
（1）Type：数据类型。
（2）Dim：纹理引用维数。
（3）ReadMode：cudaReadModeNormalizedFloat或cudaReadModeElementType。
说明：纹理引用只能声明为全局静态变量，不能作为函数参数传递。

19.struct cudaChannelFormatDesc { int x, y, z, w; enum cudaChannelFormatKind f;};
解析：
（1）结构体中的成员x，y，z以及w指定了纹理元素中每个成员的比特数。比如，仅包含一个浮点元素的纹理对应的x为32，其它成员的值为0。
（2）结构体cudaChannelFormatKind指明了该数据的类型，是带符号的整数（cudaChannelFormatKindSigned），还是无符号整数（cudaChannelFormatKindUnsigned），或者浮点数（cudaChannelFormatKindFloat）。

20.struct cudaChannelFormatDesc cudaCreateChannelDesc(int x,int y,int z,int w,enum cudaChannelFormatKind f);
解析：通过函数cudaCreateChannelDesc创建cudaChannelFormatDesc结构体。

21.cudaError_t cudaMallocArray(struct cudaArray** array, const struct cudaChannelFormatDesc* desc, size_t width, size_t height);
解析：根据cudaChannelFormatDesc结构desc分配一个CUDA 数组，并返回一个在*array的新CUDA数组的句柄。

22.cudaError_t cudaMemcpyToArray(struct cudaArray* dstArray, size_t dstX, size_t dstY, const void* src, size_t count, enum cudaMemcpyKind kind);
解析：拷贝count字节，从src指向的内存区域到dstArray指向的CUDA数组，从数组的左上角（dstX, dstY）开始，kind可以是cudaMemcpyHostToHost，cudaMemcpyHostToDevice，cudaMemcpyDeviceToHost，或cudaMemcpyDeviceToDevice的拷贝方向。

23.struct textureReference {int normalized; enum cudaTextureFilterMode filterMode; enum cudaTextureAddressMode addressMode[2]; struct cudaChannelFormatDesc channelDesc;}[1]
解析：
（1）normalized：归一化模式。
（2）filterMode：cudaFilterModePoint或cudaFilterModeLinear。（滤波模式）
（3）addressMode：cudaAddressModeClamp或cudaAddressModeWrap。（寻址模式）

24.cudaError_t cudaBindTextureToArray(const struct textureReference* texRef, const struct cudaArray* array, const struct cudaChannelFormatDesc* desc);
解析：绑定CUDA数组array到纹理引用texRef。desc描述了在纹理拾取时内存如何被解释。任何之前被绑定到texRef的内存将被解除绑定。

25.CUDA数组纹理操作
解析：
（1）template<class Type, enum cudaTextureReadMode readMode> Type tex1D(texture<Type, 1, readMode> texRef, float x);
（2）template<class Type, enum cudaTextureReadMode readMode> Type tex2D(texture<Type, 2, readMode> texRef, float x, float y);
说明：tex1D和tex2D函数通过纹理坐标x和y拾取CUDA数组中绑定到纹理引用texRef的区域。

26.CPU/GPU并发性
解析：CPU/GPU并发性指CPU在已经发送一些请求给GPU后能够继续处理的能力。其实，CPU/GPU并发性最重要的用处就是隐藏来自GPU请求任务的开销。

27.cudaError_t cudaThreadSynchronize(void);
解析：阻止直到设备上所有请求任务执行完毕。cudaThreadSynchronize()返回一个错误，如果其中一个任务失败。

28.cudaError_t cudaThreadExit(void);
解析：清除主机调用的线程中所有runtime相关的资源。任何后来的API将重新初始化runtime。

29.cudaError_t cudaStreamCreate(cudaStream_t* stream);
解析：创建一个流。

30.cudaError_t cudaStreamQuery(cudaStream_t stream);
解析：返回cudasuccess，如果所有流中的操作完成。返回cudaErrorNotReady，如果不是。

31.cudaError_t cudaStreamSyncronize(cudaStream_t stream)
解析：阻止直到设备上完成流中的所有操。

32.cudaError_t cudaStreamDestroy(cudaStream_t stream);
解析：销毁一个流。

33.cudaError_t cudaEventCreate(cudaEvent_t* event);
解析：创建一个事件。

34.cudaError_t cudaEventRecord(cudaEvent_t event, CUstream stream);
解析：记录一个事件。如果stream是非零的，当流中所有的操作完毕，事件被记录；否则，当CUDA context中所有的操作完毕，事件被记录。由于这个操作是异步的，必须使用cudaEventQuery和/或cudaEventSyncronize 来决定何时事件被真的记录了。

35.cudaError_t cudaEventQuery(cudaEvent_t event);
解析：返回cudaSuccess，如果事件被真的记录了。返回cudaErrorNotReady，如果不是。

36.cudaError_t cudaEventSyncronize(cudaEvent_t event);
解析：阻止直到事件被真的记录了。如果cudaEventRecord()在这个事件中没有被调用，函数返回cudaErrorInvalidValue。

37.cudaError_t cudaEventDestroy(cudaEvent_t event);
解析：销毁一个事件。

38.cudaError_t cudaEventElapsedTime(float* time, cudaEvent_t start, cudaEvent_t end);
解析：计算两个事件之间花费的时间（millisecond）。如果事件未被记录，函数返回cudaErrorInvalidValue。

39.cudaError_t cudaGetDeviceCount(int* count);
解析：返回计算兼容性大于等于1.0的设备数量到指针*count。

40.cudaError_t cudaSetDevice(int dev);
解析：记录dev作为设备在哪个活动的主机线程中执行设备代码。

41.cudaError_t cudaGetDevice(int* dev);
解析：返回设备在哪个活动的主机线程中执行设备代码到指针*dev。

42.cudaError_t cudaGetDeviceProperties(struct cudaDeviceProp* prop, int dev);
解析：返回设备dev的属性到指针*prop。

43.cudaError_t cudaChooseDevice(int* dev, const struct cudaDeviceProp* prop);
解析：返回设备的哪些属性最匹配*prop到指针*dev。

44.CUDA Runtime API
解析：
（1）低级API（cuda_runtime_api.h）是C接口类型的，不需要nvcc编译。
（2）高级API（cuda_runtime.h ）是C++接口类型的，基于低级API之上的，可直接使用C++代码，并被任何的C++编译器编译。高级API还有一些CUDA特定的包，它们需要nvcc编译。
说明：CUDA Runtime API和CUDA Driver API提供了设备管理，线程管理，流管理，事件管理，内存管理，纹理引用管理，执行控制，OpenGL互操作性，Direct3D互操作性，错误处理等函数。

45.__noinline__
解析：默认下，__device__函数总是inline的。__noinline__函数可以作为一个非inline函数的提示。

46.#pragma unroll
解析：编译器默认情况下将循环展开小的次数，#pragma unroll能够指定循环以多少次展开。

47.CUDA内置矢量类型
解析：char1，uchar1，char2，uchar2，char3，uchar3，char4，uchar4，short1，ushort1，short2，ushort2，short3，ushort3，short4，ushort4，int1，uint1，int2，uint2，int3，uint3，int4，uint4，long1，ulong1，long2，ulong2，long3，ulong3，long4，ulong4，float1，float2，float3，float4。

48.CUDA类型转换函数
解析：
（1）int __float2int_[rn,rz,ru,rd](float);：用指定的舍入模式转换浮点参数到整型。
（2）unsigned int __float2unit_[rn,rz,ru,zd](float);:用指定的舍入模式转换浮点参数到无符号整型。
（3）float __int2float_[rn,rz,ru,rd](int);：用指定的舍入模式转换整型参数到浮点数。
（4）float __int2float_[rn,rz,ru,rd](unsigned int);：用指定的舍入模式转换无符号整型参数到浮点数。
说明：rn是求最近的偶数，rz是逼近零，ru是向上舍入[到正无穷]，rd是向下舍入[到负无穷]。

49.asyncAPI.cu代码剖析
解析：

#include <stdio.h>
#include <cuda_runtime.h>
#include <helper_cuda.h>
#include <helper_functions.h> 
 
__global__ void increment_kernel(int *g_data, int inc_value)
{
 int idx = blockIdx.x * blockDim.x + threadIdx.x;
 g_data[idx] = g_data[idx] + inc_value;
}

int main(int argc, char *argv[])
{   
 int n = 16 * 1024 * 1024;
 int nbytes = n * sizeof(int);
 int value = 26;
 
 // allocate host memory
 int *a = 0;
 cudaMallocHost((void **)&a, nbytes);
 memset(a, 0, nbytes);
 
 // allocate device memory
 int *d_a = 0;
 cudaMalloc((void **)&d_a, nbytes);
 cudaMemset(d_a, 255, nbytes);
 
 // set kernel launch configuration
 dim3 threads = dim3(512, 1);
 dim3 blocks = dim3(n / threads.x, 1);
 
 // create cuda event handles
 cudaEvent_t start, stop;
 cudaEventCreate(&start);
 cudaEventCreate(&stop);
 
 // define time
 StopWatchInterface *timer = NULL;
 sdkCreateTimer(&timer);
 sdkResetTimer(&timer);
 
 cudaDeviceSynchronize();
 float gpu_time = 0.0f;
 
 // asynchronously issue work to the GPU (all to stream 0)
 sdkStartTimer(&timer);
 cudaEventRecord(start, 0);
 cudaMemcpyAsync(d_a, a, nbytes, cudaMemcpyHostToDevice, 0);
 increment_kernel << <blocks, threads, 0, 0 >> >(d_a, value);
 cudaMemcpyAsync(a, d_a, nbytes, cudaMemcpyDeviceToHost, 0);
 cudaEventRecord(stop, 0);
 sdkStopTimer(&timer);
 
 // have CPU do some work while waiting for stage 1 to finish
 unsigned long int counter = 0;
 while (cudaEventQuery(stop) == cudaErrorNotReady)
 {
  counter++;
 }
 
 cudaEventElapsedTime(&gpu_time, start, stop);
 
 // print the cpu and gpu times
 printf("time spent executing by the GPU: %.2f\n", gpu_time);
 printf("time spent by CPU in CUDA calls: %.2f\n", sdkGetTimerValue(&timer));
 printf("CPU executed %lu iterations while waiting for GPU to finish\n", counter);
 
 // release resources
 cudaEventDestroy(start);
 cudaEventDestroy(stop);
 cudaFreeHost(a);
 cudaFree(d_a);
 
 cudaDeviceReset();
}

解析：
（1）使用事件管理API主要作用是用于记录GPU状态，使CPU可以通过查询CUDA事件来确定GPU是否执行结束。
（2）常见的异步执行（主机端和设备端）函数包括Kernel启动；以Async为后缀的内存拷贝函数；device到device内存拷贝函数；存储器初始化函数，比如cudaMemset()，cudaMemset2D()，cudaMemset3D()。

50.流的创建与初始化
解析：

cudaStream_t *streams = (cudaStream_t *)malloc(nstreams * sizeof(cudaStream_t));
for (int i = 0; i < nstreams; i++)
{
 checkCudaErrors(cudaStreamCreate(&(streams[i])));
}

51.simpleStreams.cu代码剖析
解析：

#include <stdio.h>
#include <cuda_runtime.h>

__global__ void init_array(int *g_data, int *factor, int num_iterations)
{
 int idx = blockIdx.x * blockDim.x + threadIdx.x;
 for (int i = 0; i<num_iterations; i++) { g_data[idx] += *factor;}
}
 
int main(int argc, char **argv)
{
 int nstreams = 4;               
 int nreps = 10;                 
 int n = 16 * 1024 * 1024;       
 int nbytes = n * sizeof(int);   
 dim3 threads, blocks;           
 float elapsed_time, time_memcpy, time_kernel;   
 int niterations = 5;    
 
 // allocate host memory
 int c = 5;                      
 int *h_a = 0;                  
 cudaMallocHost((void**)&h_a, nbytes);
 
 // allocate device memory
 int *d_a = 0, *d_c = 0;            
 cudaMalloc((void **)&d_a, nbytes);
 cudaMalloc((void **)&d_c, sizeof(int));
 cudaMemcpy(d_c, &c, sizeof(int), cudaMemcpyHostToDevice);
 
 // allocate and initialize an array of stream handles
 cudaStream_t *streams = (cudaStream_t *)malloc(nstreams * sizeof(cudaStream_t));
 for (int i = 0; i < nstreams; i++)
 { cudaStreamCreate(&(streams[i]));}
 
 // create CUDA event handles, use blocking sync
 cudaEvent_t start_event, stop_event;
 cudaEventCreate(&start_event);
 cudaEventCreate(&stop_event);
 
 // time memcopy from device
 cudaEventRecord(start_event, 0);
 cudaMemcpyAsync(h_a, d_a, nbytes, cudaMemcpyDeviceToHost, streams[0]);
 cudaEventRecord(stop_event, 0);
 cudaEventSynchronize(stop_event);   
 cudaEventElapsedTime(&time_memcpy, start_event, stop_event);
 printf("memcopy:\t%.2f\n", time_memcpy);
 
 // time kernel
 threads = dim3(512, 1);
 blocks = dim3(n / threads.x, 1);
 cudaEventRecord(start_event, 0);
 init_array << <blocks, threads, 0, streams[0] >> >(d_a, d_c, niterations);
 cudaEventRecord(stop_event, 0);
 cudaEventSynchronize(stop_event);
 cudaEventElapsedTime(&time_kernel, start_event, stop_event);
 printf("kernel:\t\t%.2f\n", time_kernel);
    
 // time non-streamed execution for reference
 threads = dim3(512, 1);
 blocks = dim3(n / threads.x, 1);
 cudaEventRecord(start_event, 0);
 for (int k = 0; k < nreps; k++)
 {
  init_array << <blocks, threads >> >(d_a, d_c, niterations);
  cudaMemcpy(h_a, d_a, nbytes, cudaMemcpyDeviceToHost);
 }
 cudaEventRecord(stop_event, 0);
 cudaEventSynchronize(stop_event);
 cudaEventElapsedTime(&elapsed_time, start_event, stop_event);
 printf("non-streamed:\t%.2f\n", elapsed_time / nreps);
 
 // time execution with nstreams streams
 threads = dim3(512, 1);
 blocks = dim3(n / (nstreams*threads.x), 1);
 memset(h_a, 255, nbytes);     
 cudaMemset(d_a, 0, nbytes); 
 cudaEventRecord(start_event, 0);
 for (int k = 0; k < nreps; k++)
 {   // 异步加载nstreams个kernel
  for (int i = 0; i < nstreams; i++)
  {
   init_array << <blocks, threads, 0, streams[i] >> >(d_a + i *n / nstreams, d_c, niterations);
  }
     // 异步加载nstreams个memcopy
  for (int i = 0; i < nstreams; i++)
  {
   cudaMemcpyAsync(h_a + i * n / nstreams, d_a + i * n / nstreams, nbytes / nstreams, cudaMemcpyDeviceToHost, streams[i]);
  }
 }
 cudaEventRecord(stop_event, 0);
 cudaEventSynchronize(stop_event);
 cudaEventElapsedTime(&elapsed_time, start_event, stop_event);
 printf("%d streams:\t%.2f\n", nstreams, elapsed_time / nreps);
    
 // release resources
 for (int i = 0; i < nstreams; i++) { cudaStreamDestroy(streams[i]); }
 cudaEventDestroy(start_event);
 cudaEventDestroy(stop_event);
 cudaFree(h_a);
 cudaFree(d_a);
 cudaFree(d_c);
 
 cudaDeviceReset();
}

解析：simpleStreams.cu进行了流与事件的创建，并分别进行了内存拷贝计时，使用流的kernel执行计时，不使用流的kernel执行计时，以及使用nstreams个流的整体计时。为了使计时更加准确，采用了执行nreps次求平均值的方法。

52.CUDA中的流
解析：在一个给定的流中，操作顺序进行，但在不同流上的操作是乱序执行的，也可能是并行执行的。流的定义方法是创建一个cudaStream_t对象，并在启动内核和进行内存复制时将该对象作为参数传入，参数相同的属于同一个流，参数不同的属于不同的流。

53.Tegra
解析：Tegra是于推出的基于ARM构架通用处理器品牌（即CPU，NVIDIA称为"Computer on a chip"片上计算机），能够为便携设备提供高性能、低功耗体验。

54.Ubuntu 16.04安装CUDA 10.1
解析：
（1）禁用nouveau驱动

lsmod | grep nouveau
sudo vim /etc/modprobe.d/blacklist.conf
blacklist nouveau
options nouveau modeset=0
sudo update-initramfs –u
sudo reboot
lsmod | grep nouveau

（2）文本命令行模式运行runfile文件安装CUDA

sudo service lightdm stop
sudo sh cuda_10.1.168_418.67_linux.run --no-opengl-libs
sudo /usr/local/cuda-10.1/bin/cuda-uninstaller
sudo /usr/bin/nvidia-uninstall
ls /dev/nvidia*

（3）设置环境变量/etc/profile

sudo vim /etc/profile
export PATH=/usr/local/cuda-10.1/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda-10.1/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
cat /proc/driver/nvidia/version
nvcc -v

（4）编译cuda提供的samples

cd /home/xxx/NVIDIA_CUDA-10.1_Samples
make
cd /home/lxxx/NVIDIA_CUDA-10.1_Samples/bin/x86_64/linux/release
./deviceQuery
./bandwidthTest

（5）安装cudnn

sudo cp cuda/include/cudnn.h /usr/local/cuda/include/
sudo cp cuda/lib64/libcudnn* /usr/local/cuda/lib64/

（6）查看cuDNN是否安装成功

~/Downloads$ cat /usr/local/cuda/include/cudnn.h | grep CUDNN_MAJOR -A 2
#define CUDNN_MAJOR 7
#define CUDNN_MINOR 6
#define CUDNN_PATCHLEVEL 4
--
#define CUDNN_VERSION (CUDNN_MAJOR * 1000 + CUDNN_MINOR * 100 + CUDNN_PATCHLEVEL)
 
#include "driver_types.h"

说明：至此就把CUDA和cuDNN安装好了。

55.CMakeLists.txt
解析：

[1]MESSAGE(STATUS "Project: SERVER")：打印相关消息
[2]SET(CMAKE_BUILE_TYPE DEBUG)：指定编译类型，debug或者为release
[3]SET(CMAKE_C_FLAGS_DEBUG "-g -Wall")：指定编译器
[4]CMAKE_C_FLAGS_DEBUG：C编译器
[5]CMAKE_CXX_FLAGS_DEBUG：C++编译器
[6]-g：只是编译器，在编译的时候，产生调试信息
[7]-Wall：生成所有警告信息
[8]ADD_SUBDIRECTORY()：添加子目录
[9]SET(*.cpp)：设置变量，表示所有的源文件
[10]INCLUDE_DIRECTORIES：相关头文件的目录
[11]LINK_DIRECTORIES：相关库文件的目录
[12]ADD_LIBRARY：生成静态链接库
[13]TARGET_LINK_LIBRARY：依赖的库文件
[14]SET_TARGET_PROPERTIES：表示生成的执行文件所在路径
[15]add_executable：指定生成目标
[16]find_package()：用来查找依赖包的
[17]FILE(GLOB EXTENDED doc/**)：加载doc文件夹下的所有文件

56.CUDA API
解析：Runtime API和Driver API，两种API各有其适用的范围。

57.PTX代码
解析：并行线程执行（Parallel Thread eXecution，PTX）代码是编译后的GPU代码的一种中间形式，它可以再次编译为原生的GPU微码。

58.Clang
解析：Clang是一个C语言、C++、Objective-C语言的轻量级编译器。源代码发布于BSD协议下。Clang将支持其普通lambda表达式、返回类型的简化处理以及更好的处理constexpr关键字。

59.clang-tidy
解析：clang-tidy是一个基于clang的静态代码分析框架，支持C++/C/Objective-C。

参考文献：
[1] CUDA编程：http://www.cnblogs.com/stewart/archive/2013/01/05/2846860.html
[2] DirectX学习经典参考书籍：http://blog.csdn.net/kuangfengwu/article/details/7344009
[3] 数字图像处理高级应用：基于MATLAB与CUDA的实现
[4] CUDA里面的Texture Memory：http://blog.csdn.net/qq_25716575/article/details/52444686
[5] CUDA纹理的使用：http://preston2006.blog.sohu.com/253531751.html
[6] CUDA纹理存储器的特性及其使用：http://blog.csdn.net/darkstorm2111203/articl
[7]CUDA计算能力查询表：https://blog.csdn.net/allyli0022/article/details/54628987e/details/4294012