ATS-GPU-BASE Programmer's guide
===============================

.. highlight:: c


License Agreement
-----------------

.. include:: ../../LICENSE

Introduction
------------

The ATS-GPU SDK provides a framework to allow real-time processing of data from
AlazarTech PCIe digitizers on a CUDA-enabled GPU. This programmer's guide
covers the use of ATS-GPU-BASE.

ATS-GPU-BASE internally calls ATS-CUDA, which is a wrapper library for simple CUDA
calls. ATS-CUDA is described in more detail later in this guide in the section
`ATS-CUDA`_.

This document assumes that the reader is familiar with ATS-SDK, the standard
interface for programming AlazarTech digitizers. Having a copy of the ATS-SDK
manual available can be helpful, since many references to ATSApi functions are
done here. The latest version of the ATS-SDK manual can be downloaded free of
charge from `AlazarTech's website <https://www.alazartech.com>`_.

In addition, expertise in CUDA programming is assumed. This is particularly
important for users wishing to use ATS-GPU-BASE, because this task involves CUDA
programming.

It is also essential for programmers to have in-depth knowledge of GPU
architecture and parallel programming.

.. image:: res/ATS-GPU_3-5_BlockDiagram_v3.png
   :align: center

Prerequisites
-------------

System requirements
~~~~~~~~~~~~~~~~~~~

This software requires a PC with a CUDA-enabled GPU, and sufficient CPU resources 
to supply data to the GPU at the desired data acquisition rate. It was tested 
with GeForce GTX Titan X (Maxwell), GeForce GTX980 and Quadro P5000. DDR4 memory 
and a modern chipset (X99, X299) will greatly improve transfer speed and overall 
performance.

Supported operating systems
  64-bit Windows and 64-bit Linux operating systems are supported. Please verify 
  that your Linux distribution is `supported by NVIDIA
  <http://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#system-requirements>`_
  , which supplies the CUDA toolkit required to use ATS-GPU.

Compiler support
  CMake is required to build C/C++ code. CMake files are provided.
  On Linux, a C++11 compiler is required to build the library. On older Red Hat
  distributions, a devtoolset can be obtained to use a more recent version of
  gcc that supports C++11. NVCC is required to compile the example code, this
  compiler is included with CUDA toolkit.

CUDA driver requirements
  In order to use ATS-GPU, you must install the appropriate driver for your CUDA-enabled  
  GPU. Drivers can be downloaded at https://www.nvidia.com/Download/index.aspx.

.. note::
   
   Under Windows operating systems, dynamic link libraries releated to
   ATS-GPU-BASE are installed by default in %WINDIR%\System32. For applications
   to link approripately to them, %WINDIR%\System32 must be added to the Windows
   PATH Environment Variable.

Programming experience
~~~~~~~~~~~~~~~~~~~~~~

Users who wish to use ATS-GPU-BASE to create high-performance custom kernels 
must have expertise in CUDA programming.

It is also essential for programmers to have in-depth knowledge of GPU
architecture and parallel programming.

ATS-GPU-BASE
------------

ATS-GPU-BASE is designed to provide highly efficient code to transfer data from
an ATS PCIe digitizer to a CUDA-enabled GPU for processing. This transfer is 
done using multiple DMA transactions. The user application, which includes 
custom CUDA kernels, can then access data buffers on the GPU. The user is then 
responsible to perform data processing and copy data back to the CPU if required. 
A code example is provided as an example of a user application that performs 
very simple signal processing (data inversion).

Usage
~~~~~

.. image:: res/ATS-GPU_DataFlow_Block-Diagram20170516_v2.png
   :align: center

ATS-GPU-BASE offers several functions that behave similarly to ATSApi functions.
Please refer to the ATS-SDK guide for more details about these APIs. Obtaining a
board handle and configuring the board (sampling rate, trigger, input channels,
etc.) is performed directly using functions from the ATS-SDK. By convention, the
code samples define a ``ConfigureBoard()`` function that handles all these
tasks.

.. code-block:: c++

    if (!ConfigureBoard(boardHandle)) {
        // Error handling
    }

During the lifetime of an application, multiple acquisitions can take place. If
the board configuration parameters do not change, it is not necessary to call
``ConfigureBoard()`` again.

The next step is to select the CUDA-enabled GPU to use for the data transfer.
This call is optional. If you only have one CUDA capable GPU on your computer,
you can skip it.

.. code-block:: c++

    rc = ATS_GPU_SetCUDAComputeDevice(boardHandle, deviceIndex);
    // Error handling

We must then setup parameters of the acquisition to GPU. This function replaces
the call to :cpp:func:`AlazarBeforeAsyncRead()` in normal programs. Parameters
were kept as close as possible to those of :cpp:func:`AlazarBeforeAsyncRead()`
to ease transition between standard acquisitions and ATS-GPU acquisitions. To
maximize performance, sample interleave should be enabled with
``ADMA_INTERLEAVE_SAMPLES``.

.. code-block:: c++

    rc = ATS_GPU_Setup(boardHandle, channelSelect, transferOffset,
                       transferLength, recordsPerBuffer, recordsPerAcquisition,
                       autoDMAFlags, ATSGPUFlags);
    // Error handling

We then allocate memory on the GPU for data to be transferred to, and we post
those buffers to the board. For this purpose, we use
:cpp:func:`ATS_GPU_AllocBuffer`. This function allocates a buffer on the GPU
and sets up all the intermediary state necessary for ATS-GPU to successfully
transfer data. Please note that if you would like to send data back from the GPU to
your computer's RAM after having processed it, you will need to allocate memory
independently of the AlazarTech APIs.

.. code-block:: c++

    for (size_t i = 0; i < buffers_to_allocate; i++)
    {
        buffers[i] = ATS_GPU_AllocBuffer(boardHandle, bytesPerBuffer);

        rc =  ATS_GPU_PostBuffer(boardHandle,
                                 buffers[i],
                                 bytesPerBuffer);
        // Error handling
    }

We can then start the acquisition. The board will directly start acquiring data,
assuming it receives triggers, and data transfer to posted GPU buffers will also
start immediately.

.. code-block:: c++

    rc =  ATS_GPU_StartCapture(HANDLE boardHandle);
    // Error handling


Once acquisition is started, :cpp:func:`ATS_GPU_GetBuffer` must be called as
often as possible to retrieve a buffer containing data already copied on the
GPU. This buffer can then be processed by your custom kernel on the GPU. When a
buffer is done being used (either data has been copied to a different buffer or
processing is complete), the buffer needs to be posted back to the board.

.. code-block:: c++

    for (size_t i; i < buffers_per_acquisition; i++)
    {
        rc =  ATS_GPU_GetBuffer(boardHandle,
                                buffers[i],
                                timeout_ms,
                                nullptr);

        // TODO: Error handling

        // TODO: Process buffer. This is where you can call your own processing
        //       function that launches the GPU kernels, such as ProcessBuffer()
        //       in the code samples.
        ProcessBuffer(buffers[i], bytesPerBuffer);

        rc = ATS_GPU_PostBuffer(boardHandle, buffer, bytesPerBuffer);
    }

When acquisition is complete, :cpp:func:`ATS_GPU_AbortCapture` must be called.
Buffers allocated with :cpp:func:`ATS_GPU_AllocBuffer` should then be freed with
:cpp:func:`ATS_GPU_FreeBuffer`.


.. code-block:: c++

    RETURN_CODE ATS_GPU_AbortCapture(HANDLE boardHandle);

    for (size_t i = 0; i < number_of_buffers; i++)
    {
       rc = ATS_GPU_FreeBuffer(boardHandle, buffers[i]);
       // Error handling
    }

Here is an example of what the function to process data on the GPU can look
like. Since this contains code that is executed on the GPU, it needs to be
located in a file with a ``.cu`` extension:


.. code-block:: none

    extern “C”__global__ void ProcessBuffer(void* buffer, bytesPerBuffer)
    {
        int idx = blockDim.x * blockIdx.x + threadIdx.x;

        // TODO: Do processing here
    }

    Bool ProcessBuffer(void* buffer, U32 bytesPerBuffer)
    {
        // Launch ProcessBuffer CUDA kernel
        ProcessBuffer<<<threadsPerBlock, BlocksPerGrid>>>(buffer, bytesPerBuffer);

        // Copy result from GPU memory to CPU memory
        cudaMemcpy(resultBuffer,buffer,bytesPerBuffer);
    }

Performance guidelines
~~~~~~~~~~~~~~~~~~~~~~

While GPU solutions are highly customizable and can reach very high processing
speeds, care must be taken to preserve performance. The provided libraries use
streams to maximise concurrency and hide latency associated with data transfers.
The processing functions are optimized to provide the best performance and
modifying them can result in a loss of performance. Refer to the `CUDA best
practices guide
<http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/#axzz4cGxP6eNR>`_ for
more information on how to improve performance.

.. warning::

   When developing CUDA code, it is very important to check memory accesses with
   a dedicated tool, as bad memory accesses will not necessarily trigger an
   error but will lead to bad behavior and can cause a crash. The CUDA toolkit
   provides the necessary memory checking utilities.

Because data is DMA'd from ATS board to host memory then to GPU memory, speed of
host computer memory will influence performance. DDR4 memory and a modern
chipset (X99, X299, etc.) will greatly improve transfer speed and overall
performance.

Benchmarks
~~~~~~~~~~

Performance benchmarks using the optional OCT signal processing library and
NVIDIA GeForce GTX Titan X (Maxwell) GPU on an ASUS X99 Deluxe motherboard with
an Intel i9-7900X 3.3 GHz CPU, and 2133 MHz DDR4 memory (32 GB RAM):

+------------------------------+----------------+
|       PCIe Link Speed        | Transfer Rate  |
+==============================+================+
| Gen 3x8: ATS9373, ATS9371    | Up to 6.9 GB/s |
+------------------------------+----------------+
| Gen 2x8: ATS9360, ATS9416    | Up to 3.5 GB/s |
+------------------------------+----------------+
| Gen 2x4: ATS9352             |                |
| Gen 1x8: ATS9870, ATS9350,   | Up to 1.6 GB/s |
| ATS9351, ATS9625,            |                |
| ATS9626, ATS9440             |                |
+------------------------------+----------------+
| Gen 1x4: ATS9462             | Up to 720 MB/s |
+------------------------------+----------------+
| Gen 1x1: ATS9146, ATS9130,   | Up to 200 MB/s |
| ATS9120                      |                |
+------------------------------+----------------+

API Reference
~~~~~~~~~~~~~

.. note::
   
   Errors from ATS-GPU-BASE will be logged in ATS_GPU.log. Relevant
   information about the error will be logged here and can be useful for
   debugging. For Windows users log file is located in %TEMP%. For
   Linux users log file is located in /tmp/.

.. doxygenfunction:: ATS_GPU_Setup

.. raw:: latex

    \clearpage
.. 
    .. doxygenfunction:: ATS_GPU_SetProcessingPipeline

    See `ATS-GPU-HPC`_ for more information on types of kernels that can be launched 
    on the GPU.

    .. raw:: latex

    \clearpage

.. doxygenfunction:: ATS_GPU_AllocBuffer

.. raw:: latex

    \clearpage

.. doxygenfunction:: ATS_GPU_PostBuffer

.. raw:: latex

    \clearpage

.. doxygenfunction:: ATS_GPU_GetBuffer

.. raw:: latex

    \clearpage

.. doxygenfunction:: ATS_GPU_AbortCapture

.. raw:: latex

    \clearpage

.. doxygenfunction:: ATS_GPU_FreeBuffer

.. raw:: latex

    \clearpage

.. doxygenfunction:: ATS_GPU_GenerateCPUBoxcarFunction

.. raw:: latex

    \clearpage

.. doxygenfunction:: ATS_GPU_GetVersion

.. raw:: latex

    \clearpage

.. doxygenfunction:: ATS_GPU_QueryCUDADeviceCount

.. raw:: latex

    \clearpage

.. doxygenfunction:: ATS_GPU_QueryCUDADeviceName

.. raw:: latex

    \clearpage

.. doxygenfunction:: ATS_GPU_SetCUDAComputeDevice

.. raw:: latex

    \clearpage

.. doxygenenum:: ATS_GPU_SETUP_FLAG

.. raw:: latex

    \clearpage

.. doxygenstruct:: InputRange
  :members:

.. raw:: latex

    \clearpage

..
  ATS-GPU-HPC
  ~~~~~~~~~~~

  ATS-GPU-HPC is designed to allow users to create their custom processing pipeline
  with simple structure declarations. These structures contain all the information necessary
  for the library to perform the desired kernel. 
  
  ATS-GPU-HPC is designed to be used by passing a list of KERNEL structures to
  ATS_GPU_SetProcessingPipeline from the ATS-GPU-BASE library. Some processing
  kernels require the installation of additional libraries in order to run.  
  
  Kernel description
  ^^^^^^^^^^^^^^^^^^
  
  .. doxygenenum:: _KERNEL_TYPE
  
  .. doxygenstruct:: _KERNEL
     :members:
  
  .. doxygenstruct:: _KERNEL_DATA
     :members:
  
  Processing kernels
  ^^^^^^^^^^^^^^^^^^
  
  .. doxygenstruct:: _PRE_FFT_KERNEL
     :members:
  
  .. doxygenstruct:: _FFT_KERNEL
     :members:
  
  .. doxygenstruct:: _POST_FFT_KERNEL
     :members:
  
  .. doxygenstruct:: _OCT_KERNEL
     :members:
  
  .. doxygenstruct:: _OUTPUT_KERNEL
     :members:

ATS-CUDA
--------

The ATS-CUDA SDK provides a framework to allow users to perform simple 
manipulations on CUDA-enabled GPUs. ATS-CUDA is designed to be used with 
ATS-GPU-BASE, but can also be used independently. This section of the 
programmer's guide covers the use of ATS-CUDA.


As with ATS-GPU-BASE, using ATS-CUDA requires expertise in CUDA programming 
because this involves writing custom CUDA kernels.

It is also essential for programmers to have in-depth knowledge of GPU
architecture and parallel programming.

API Reference
~~~~~~~~~~~~~

.. note::
   
   Errors from ATS-CUDA-BASE will be logged in ATS_GPU.log. Relevant
   information about the error will be logged here and can be useful for
   debugging. For Windows users log file is located in %TEMP%. For
   Linux users log file is located in /tmp/.

.. doxygenenum:: ATS_CUDA_Input_DataType

.. raw:: latex

    \clearpage

.. doxygenenum:: ALAZAR_PACKING

.. raw:: latex

    \clearpage

.. doxygenstruct:: UNPACK_DEINTERLEAVE_OPTIONS
  :members:

.. raw:: latex

    \clearpage

.. doxygenfunction:: ATS_CUDA_AllocCPUBuffer

.. raw:: latex

    \clearpage

.. doxygenfunction:: ATS_CUDA_AllocGPUBuffer

.. raw:: latex

    \clearpage

.. doxygenfunction:: ATS_CUDA_AverageRecords

.. raw:: latex

    \clearpage

.. doxygenfunction:: ATS_CUDA_BaseProcessBuffer


.. raw:: latex

    \clearpage

.. doxygenfunction:: ATS_CUDA_ConvertToVolts

.. raw:: latex

    \clearpage

.. doxygenfunction:: ATS_CUDA_CopyDeviceToHost

.. raw:: latex

    \clearpage

.. doxygenfunction:: ATS_CUDA_CopyHostToDevice

.. raw:: latex

    \clearpage

.. doxygenfunction:: ATS_CUDA_CreateStream

.. raw:: latex

    \clearpage

.. doxygenfunction:: ATS_CUDA_DestroyStream

.. raw:: latex

    \clearpage

.. doxygenfunction:: ATS_CUDA_FreeCPUBuffer

.. raw:: latex

    \clearpage

.. doxygenfunction:: ATS_CUDA_FreeGPUBuffer

.. raw:: latex

    \clearpage

.. doxygenfunction:: ATS_CUDA_GetVersion

.. raw:: latex

    \clearpage

.. doxygenfunction:: ATS_CUDA_GetComputeCapability

.. raw:: latex

    \clearpage

.. doxygenfunction:: ATS_CUDA_MultiplyRecords

.. raw:: latex

    \clearpage

.. doxygenfunction:: ATS_CUDA_QueryDeviceCount

.. raw:: latex

    \clearpage

.. doxygenfunction:: ATS_CUDA_QueryDeviceName

.. raw:: latex

    \clearpage

.. doxygenfunction:: ATS_CUDA_SetComputeDevice

.. raw:: latex

    \clearpage

.. doxygenfunction:: ATS_CUDA_StreamSynchronize

.. raw:: latex

    \clearpage

.. doxygenfunction:: ATS_CUDA_StreamQuery

.. raw:: latex

    \clearpage

.. doxygenfunction:: ATS_CUDA_SeparateDataFromNPTFooters

.. raw:: latex

    \clearpage

.. doxygenfunction:: ATS_CUDA_ExtractNPTFooters