Skip to content

csplatti/cuda-programming

Repository files navigation

cuda-programming

Overview

This repository contains various algorithms implemented using CUDA for parallel processing. My goal is to explore and experiment with parallelization techniques in CUDA across different types of algorithms. Each algorithm will be added as an independent module, and the repository will be updated with new implementations over time.

I am by no means an expert in CUDA (or C++), and do not claim that my solutions are optimal or well written. If you see any room for improvement feel free to message me or make a pull request implementing the fix with a detailed explanation. Upon receiving either I will fix the issue and give you credit in a comment!

Prerequisites

  • Hardware & Software Requirements
    • Latest version of CUDA (as I am writing this it is 12.9)
    • NVIDIA GPU with CUDA capability (I use an RTX 2000 Ada which has CUDA capabililty 8.9)
  • Dependencies
    • Python Libraries (For runtime plotting notebooks): matplotlib

Installation

  1. Fork this repository
  2. Clone the repository
    git clone https://github.com/{Your GitHub Username}/cuda-programming
  3. Navigate to the local repository
    cd ./cuda-programming
  4. Navigate to any of the folders, e.g. hello_cuda
    cd ./hello_cuda
  5. Compile with NVCC
    nvcc -o your_cuda your_cuda.cu

Usage

Run the binary with input from a file. Currently, input to all of my programs is simply an integer denoting the size of the randomly generated input array to the given algorithm.

cat input.txt | ./your_cuda

Input can also be given via user input by running

./your_cuda

The program will wait for you to input a number on the newline.

Running dataPlot.ipynb

  1. Open Terminal
  2. Navigate to the directory where you cloned this repository
  3. Create your own python virtual environment in the root directory python -m venv desired_env_name
  4. Activate the environment by running source ./desired_env_name/bin/activate
  5. Install python dependencies by running pip install -r requirements.txt
  6. Once you are finished, deactivate the python virtual environment via the deactivate command

Table of Contents

  1. Vector Addition
  2. Parallel Reduction

Vector Addition

image

Code

The serial algorithm I wrote to compare to a parallel implementation used a simple for loop.

void addVectorsSerial(int *a, int *b, int *c, int vectorLength) {
    for (int i = 0; i < vectorLength; i++) {
        c[i] = a[i] + b[i];
    }
}

The CUDA kernel itself was quite simple, requiring only pointers to the input and output arrays and the size of the arrays.

__global__ void addVectorsParallel(int *a, int *b, int *c, int N) {
    int i = threadIdx.x + blockIdx.x * blockDim.x;
    if (i < N) {
        c[i] = a[i] + b[i];
    }
}

The first line of the kernel calculates its thread index. If the kernel is assigned an index within the array, it performs the addition. I couldnt find a way to avoid this check without giving up the property that the input array could be of arbitrary size. Reach out to me or make a pull request with an explanation if such a solution exists!

The call to the CUDA kernel was as follows:

addVectorsParallel<<<sizeof(parallelOut) / 1024 + 1, 1024>>>(pa, pb, pOut, N);

One block for each multiple of 1024, plus 1 for any remainder, was called with 1024 threads.

Parallel Reduction (Sum)

What is it?

Say you have a list of integers

    int nums[] = {1, 2, 3, 4, 5, 6, 7};

For many problems, it can be useful to reduce such a list of numbers down to a single value (their sum, product, max, min, etc.). In the case of the array above, the reduced sum would be equal to $28$, the reduced maximum would be equal to $7$, the reduced product would be equal to $7! = 5040$, etc.

I specifically wrote a reduction sum, the choice of which was arbitrary and could be easily changed to any other binary operation.

My First Implementation

alt text

My first attempt at parallelization computed the sum correctly, but took orders of magnitude longer than the simple serial sum. I cannot say exactly what caused this poor performance, but I am fairly confident that it had to do with the choice I made to make the kernel run recursively (not advisable for code that will run on a GPU).

Second Attempt Using CUDA Libraries

alt text

Having struggled with writing the parallel reduction from scratch, I decided to take a step back and learn some more CUDA. I began with NVIDIA's CUDA Made Easy course from the NVIDIA Accelerated Computing Hub, which began by teaching CUDA's Thrust library. The Thrust library provides a variety of standard pre-written functions (reduce, transform, etc.) which can be used and combined to implement various algorithms in parallel without having to write them in raw CUDA completely from scratch.

Code

long serialSum(const thrust::universal_vector<int>& nums) {
    return thrust::reduce(thrust::host, nums.begin(), nums.end(), 0, thrust::plus<int>{});
}

long parallelSum(const thrust::universal_vector<int>& nums) {
    return thrust::reduce(thrust::device, nums.begin(), nums.end(), 0, thrust::plus<int>{});
}

Using thrust for parallel reduction was straightforward due do the existence of thrust::reduce. The only difference between serialSum and parallelSum is in the first argument; thrust::host is used when the function is to be run on the CPU, and thrust::device is used when the function is to be run on the GPU.

Next Steps

The next step in this project is to implement a parallel reduction in custom CUDA, avoiding recursion like in my first attempt. This will not only help optimize the algorithm but also deepen my understanding of parallel reduction and memory management in CUDA.

About

My personal project of learning gpu programming

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors