ATS-GPU-BASE Programmer's guide =============================== .. highlight:: c License Agreement ----------------- .. include:: ../../LICENSE Introduction ------------ The ATS-GPU SDK provides a framework to allow real-time processing of data from AlazarTech PCIe digitizers on a CUDA-enabled GPU. This programmer's guide covers the use of ATS-GPU-BASE. ATS-GPU-BASE internally calls ATS-CUDA, which is a wrapper library for simple CUDA calls. ATS-CUDA is described in more detail later in this guide in the section `ATS-CUDA`_. This document assumes that the reader is familiar with ATS-SDK, the standard interface for programming AlazarTech digitizers. Having a copy of the ATS-SDK manual available can be helpful, since many references to ATSApi functions are done here. The latest version of the ATS-SDK manual can be downloaded free of charge from `AlazarTech's website `_. In addition, expertise in CUDA programming is assumed. This is particularly important for users wishing to use ATS-GPU-BASE, because this task involves CUDA programming. It is also essential for programmers to have in-depth knowledge of GPU architecture and parallel programming. .. image:: res/ATS-GPU_3-5_BlockDiagram_v3.png :align: center Prerequisites ------------- System requirements ~~~~~~~~~~~~~~~~~~~ This software requires a PC with a CUDA-enabled GPU, and sufficient CPU resources to supply data to the GPU at the desired data acquisition rate. It was tested with GeForce GTX Titan X (Maxwell), GeForce GTX980 and Quadro P5000. DDR4 memory and a modern chipset (X99, X299) will greatly improve transfer speed and overall performance. Supported operating systems 64-bit Windows and 64-bit Linux operating systems are supported. Please verify that your Linux distribution is `supported by NVIDIA `_ , which supplies the CUDA toolkit required to use ATS-GPU. Compiler support CMake is required to build C/C++ code. CMake files are provided. On Linux, a C++11 compiler is required to build the library. On older Red Hat distributions, a devtoolset can be obtained to use a more recent version of gcc that supports C++11. NVCC is required to compile the example code, this compiler is included with CUDA toolkit. CUDA driver requirements In order to use ATS-GPU, you must install the appropriate driver for your CUDA-enabled GPU. Drivers can be downloaded at https://www.nvidia.com/Download/index.aspx. .. note:: Under Windows operating systems, dynamic link libraries releated to ATS-GPU-BASE are installed by default in %WINDIR%\System32. For applications to link approripately to them, %WINDIR%\System32 must be added to the Windows PATH Environment Variable. Programming experience ~~~~~~~~~~~~~~~~~~~~~~ Users who wish to use ATS-GPU-BASE to create high-performance custom kernels must have expertise in CUDA programming. It is also essential for programmers to have in-depth knowledge of GPU architecture and parallel programming. ATS-GPU-BASE ------------ ATS-GPU-BASE is designed to provide highly efficient code to transfer data from an ATS PCIe digitizer to a CUDA-enabled GPU for processing. This transfer is done using multiple DMA transactions. The user application, which includes custom CUDA kernels, can then access data buffers on the GPU. The user is then responsible to perform data processing and copy data back to the CPU if required. A code example is provided as an example of a user application that performs very simple signal processing (data inversion). Usage ~~~~~ .. image:: res/ATS-GPU_DataFlow_Block-Diagram20170516_v2.png :align: center ATS-GPU-BASE offers several functions that behave similarly to ATSApi functions. Please refer to the ATS-SDK guide for more details about these APIs. Obtaining a board handle and configuring the board (sampling rate, trigger, input channels, etc.) is performed directly using functions from the ATS-SDK. By convention, the code samples define a ``ConfigureBoard()`` function that handles all these tasks. .. code-block:: c++ if (!ConfigureBoard(boardHandle)) { // Error handling } During the lifetime of an application, multiple acquisitions can take place. If the board configuration parameters do not change, it is not necessary to call ``ConfigureBoard()`` again. The next step is to select the CUDA-enabled GPU to use for the data transfer. This call is optional. If you only have one CUDA capable GPU on your computer, you can skip it. .. code-block:: c++ rc = ATS_GPU_SetCUDAComputeDevice(boardHandle, deviceIndex); // Error handling We must then setup parameters of the acquisition to GPU. This function replaces the call to :cpp:func:`AlazarBeforeAsyncRead()` in normal programs. Parameters were kept as close as possible to those of :cpp:func:`AlazarBeforeAsyncRead()` to ease transition between standard acquisitions and ATS-GPU acquisitions. To maximize performance, sample interleave should be enabled with ``ADMA_INTERLEAVE_SAMPLES``. .. code-block:: c++ rc = ATS_GPU_Setup(boardHandle, channelSelect, transferOffset, transferLength, recordsPerBuffer, recordsPerAcquisition, autoDMAFlags, ATSGPUFlags); // Error handling We then allocate memory on the GPU for data to be transferred to, and we post those buffers to the board. For this purpose, we use :cpp:func:`ATS_GPU_AllocBuffer`. This function allocates a buffer on the GPU and sets up all the intermediary state necessary for ATS-GPU to successfully transfer data. Please note that if you would like to send data back from the GPU to your computer's RAM after having processed it, you will need to allocate memory independently of the AlazarTech APIs. .. code-block:: c++ for (size_t i = 0; i < buffers_to_allocate; i++) { buffers[i] = ATS_GPU_AllocBuffer(boardHandle, bytesPerBuffer); rc = ATS_GPU_PostBuffer(boardHandle, buffers[i], bytesPerBuffer); // Error handling } We can then start the acquisition. The board will directly start acquiring data, assuming it receives triggers, and data transfer to posted GPU buffers will also start immediately. .. code-block:: c++ rc = ATS_GPU_StartCapture(HANDLE boardHandle); // Error handling Once acquisition is started, :cpp:func:`ATS_GPU_GetBuffer` must be called as often as possible to retrieve a buffer containing data already copied on the GPU. This buffer can then be processed by your custom kernel on the GPU. When a buffer is done being used (either data has been copied to a different buffer or processing is complete), the buffer needs to be posted back to the board. .. code-block:: c++ for (size_t i; i < buffers_per_acquisition; i++) { rc = ATS_GPU_GetBuffer(boardHandle, buffers[i], timeout_ms, nullptr); // TODO: Error handling // TODO: Process buffer. This is where you can call your own processing // function that launches the GPU kernels, such as ProcessBuffer() // in the code samples. ProcessBuffer(buffers[i], bytesPerBuffer); rc = ATS_GPU_PostBuffer(boardHandle, buffer, bytesPerBuffer); } When acquisition is complete, :cpp:func:`ATS_GPU_AbortCapture` must be called. Buffers allocated with :cpp:func:`ATS_GPU_AllocBuffer` should then be freed with :cpp:func:`ATS_GPU_FreeBuffer`. .. code-block:: c++ RETURN_CODE ATS_GPU_AbortCapture(HANDLE boardHandle); for (size_t i = 0; i < number_of_buffers; i++) { rc = ATS_GPU_FreeBuffer(boardHandle, buffers[i]); // Error handling } Here is an example of what the function to process data on the GPU can look like. Since this contains code that is executed on the GPU, it needs to be located in a file with a ``.cu`` extension: .. code-block:: none extern ā€œCā€__global__ void ProcessBuffer(void* buffer, bytesPerBuffer) { int idx = blockDim.x * blockIdx.x + threadIdx.x; // TODO: Do processing here } Bool ProcessBuffer(void* buffer, U32 bytesPerBuffer) { // Launch ProcessBuffer CUDA kernel ProcessBuffer<<>>(buffer, bytesPerBuffer); // Copy result from GPU memory to CPU memory cudaMemcpy(resultBuffer,buffer,bytesPerBuffer); } Performance guidelines ~~~~~~~~~~~~~~~~~~~~~~ While GPU solutions are highly customizable and can reach very high processing speeds, care must be taken to preserve performance. The provided libraries use streams to maximise concurrency and hide latency associated with data transfers. The processing functions are optimized to provide the best performance and modifying them can result in a loss of performance. Refer to the `CUDA best practices guide `_ for more information on how to improve performance. .. warning:: When developing CUDA code, it is very important to check memory accesses with a dedicated tool, as bad memory accesses will not necessarily trigger an error but will lead to bad behavior and can cause a crash. The CUDA toolkit provides the necessary memory checking utilities. Because data is DMA'd from ATS board to host memory then to GPU memory, speed of host computer memory will influence performance. DDR4 memory and a modern chipset (X99, X299, etc.) will greatly improve transfer speed and overall performance. Benchmarks ~~~~~~~~~~ Performance benchmarks using the optional OCT signal processing library and NVIDIA GeForce GTX Titan X (Maxwell) GPU on an ASUS X99 Deluxe motherboard with an Intel i9-7900X 3.3 GHz CPU, and 2133 MHz DDR4 memory (32 GB RAM): +------------------------------+----------------+ | PCIe Link Speed | Transfer Rate | +==============================+================+ | Gen 3x8: ATS9373, ATS9371 | Up to 6.9 GB/s | +------------------------------+----------------+ | Gen 2x8: ATS9360, ATS9416 | Up to 3.5 GB/s | +------------------------------+----------------+ | Gen 2x4: ATS9352 | | | Gen 1x8: ATS9870, ATS9350, | Up to 1.6 GB/s | | ATS9351, ATS9625, | | | ATS9626, ATS9440 | | +------------------------------+----------------+ | Gen 1x4: ATS9462 | Up to 720 MB/s | +------------------------------+----------------+ | Gen 1x1: ATS9146, ATS9130, | Up to 200 MB/s | | ATS9120 | | +------------------------------+----------------+ API Reference ~~~~~~~~~~~~~ .. note:: Errors from ATS-GPU-BASE will be logged in ATS_GPU.log. Relevant information about the error will be logged here and can be useful for debugging. For Windows users log file is located in %TEMP%. For Linux users log file is located in /tmp/. .. doxygenfunction:: ATS_GPU_Setup .. raw:: latex \clearpage .. .. doxygenfunction:: ATS_GPU_SetProcessingPipeline See `ATS-GPU-HPC`_ for more information on types of kernels that can be launched on the GPU. .. raw:: latex \clearpage .. doxygenfunction:: ATS_GPU_AllocBuffer .. raw:: latex \clearpage .. doxygenfunction:: ATS_GPU_PostBuffer .. raw:: latex \clearpage .. doxygenfunction:: ATS_GPU_GetBuffer .. raw:: latex \clearpage .. doxygenfunction:: ATS_GPU_AbortCapture .. raw:: latex \clearpage .. doxygenfunction:: ATS_GPU_FreeBuffer .. raw:: latex \clearpage .. doxygenfunction:: ATS_GPU_GenerateCPUBoxcarFunction .. raw:: latex \clearpage .. doxygenfunction:: ATS_GPU_GetVersion .. raw:: latex \clearpage .. doxygenfunction:: ATS_GPU_QueryCUDADeviceCount .. raw:: latex \clearpage .. doxygenfunction:: ATS_GPU_QueryCUDADeviceName .. raw:: latex \clearpage .. doxygenfunction:: ATS_GPU_SetCUDAComputeDevice .. raw:: latex \clearpage .. doxygenenum:: ATS_GPU_SETUP_FLAG .. raw:: latex \clearpage .. doxygenstruct:: InputRange :members: .. raw:: latex \clearpage .. ATS-GPU-HPC ~~~~~~~~~~~ ATS-GPU-HPC is designed to allow users to create their custom processing pipeline with simple structure declarations. These structures contain all the information necessary for the library to perform the desired kernel. ATS-GPU-HPC is designed to be used by passing a list of KERNEL structures to ATS_GPU_SetProcessingPipeline from the ATS-GPU-BASE library. Some processing kernels require the installation of additional libraries in order to run. Kernel description ^^^^^^^^^^^^^^^^^^ .. doxygenenum:: _KERNEL_TYPE .. doxygenstruct:: _KERNEL :members: .. doxygenstruct:: _KERNEL_DATA :members: Processing kernels ^^^^^^^^^^^^^^^^^^ .. doxygenstruct:: _PRE_FFT_KERNEL :members: .. doxygenstruct:: _FFT_KERNEL :members: .. doxygenstruct:: _POST_FFT_KERNEL :members: .. doxygenstruct:: _OCT_KERNEL :members: .. doxygenstruct:: _OUTPUT_KERNEL :members: ATS-CUDA -------- The ATS-CUDA SDK provides a framework to allow users to perform simple manipulations on CUDA-enabled GPUs. ATS-CUDA is designed to be used with ATS-GPU-BASE, but can also be used independently. This section of the programmer's guide covers the use of ATS-CUDA. As with ATS-GPU-BASE, using ATS-CUDA requires expertise in CUDA programming because this involves writing custom CUDA kernels. It is also essential for programmers to have in-depth knowledge of GPU architecture and parallel programming. API Reference ~~~~~~~~~~~~~ .. note:: Errors from ATS-CUDA-BASE will be logged in ATS_GPU.log. Relevant information about the error will be logged here and can be useful for debugging. For Windows users log file is located in %TEMP%. For Linux users log file is located in /tmp/. .. doxygenenum:: ATS_CUDA_Input_DataType .. raw:: latex \clearpage .. doxygenenum:: ALAZAR_PACKING .. raw:: latex \clearpage .. doxygenstruct:: UNPACK_DEINTERLEAVE_OPTIONS :members: .. raw:: latex \clearpage .. doxygenfunction:: ATS_CUDA_AllocCPUBuffer .. raw:: latex \clearpage .. doxygenfunction:: ATS_CUDA_AllocGPUBuffer .. raw:: latex \clearpage .. doxygenfunction:: ATS_CUDA_AverageRecords .. raw:: latex \clearpage .. doxygenfunction:: ATS_CUDA_BaseProcessBuffer .. raw:: latex \clearpage .. doxygenfunction:: ATS_CUDA_ConvertToVolts .. raw:: latex \clearpage .. doxygenfunction:: ATS_CUDA_CopyDeviceToHost .. raw:: latex \clearpage .. doxygenfunction:: ATS_CUDA_CopyHostToDevice .. raw:: latex \clearpage .. doxygenfunction:: ATS_CUDA_CreateStream .. raw:: latex \clearpage .. doxygenfunction:: ATS_CUDA_DestroyStream .. raw:: latex \clearpage .. doxygenfunction:: ATS_CUDA_FreeCPUBuffer .. raw:: latex \clearpage .. doxygenfunction:: ATS_CUDA_FreeGPUBuffer .. raw:: latex \clearpage .. doxygenfunction:: ATS_CUDA_GetVersion .. raw:: latex \clearpage .. doxygenfunction:: ATS_CUDA_GetComputeCapability .. raw:: latex \clearpage .. doxygenfunction:: ATS_CUDA_MultiplyRecords .. raw:: latex \clearpage .. doxygenfunction:: ATS_CUDA_QueryDeviceCount .. raw:: latex \clearpage .. doxygenfunction:: ATS_CUDA_QueryDeviceName .. raw:: latex \clearpage .. doxygenfunction:: ATS_CUDA_SetComputeDevice .. raw:: latex \clearpage .. doxygenfunction:: ATS_CUDA_StreamSynchronize .. raw:: latex \clearpage .. doxygenfunction:: ATS_CUDA_StreamQuery .. raw:: latex \clearpage .. doxygenfunction:: ATS_CUDA_SeparateDataFromNPTFooters .. raw:: latex \clearpage .. doxygenfunction:: ATS_CUDA_ExtractNPTFooters