This repository contains various algorithms implemented using CUDA for parallel processing. My goal is to explore and experiment with parallelization techniques in CUDA across different types of algorithms. Each algorithm will be added as an independent module, and the repository will be updated with new implementations over time.
I am by no means an expert in CUDA (or C++), and do not claim that my solutions are optimal or well written. If you see any room for improvement feel free to message me or make a pull request implementing the fix with a detailed explanation. Upon receiving either I will fix the issue and give you credit in a comment!
- Hardware & Software Requirements
- Latest version of CUDA (as I am writing this it is 12.9)
- NVIDIA GPU with CUDA capability (I use an RTX 2000 Ada which has CUDA capabililty 8.9)
- Dependencies
- Python Libraries (For runtime plotting notebooks): matplotlib
- Fork this repository
- Clone the repository
git clone https://github.com/{Your GitHub Username}/cuda-programming - Navigate to the local repository
cd ./cuda-programming - Navigate to any of the folders, e.g. hello_cuda
cd ./hello_cuda - Compile with NVCC
nvcc -o your_cuda your_cuda.cu
Run the binary with input from a file. Currently, input to all of my programs is simply an integer denoting the size of the randomly generated input array to the given algorithm.
cat input.txt | ./your_cudaInput can also be given via user input by running
./your_cudaThe program will wait for you to input a number on the newline.
- Open Terminal
- Navigate to the directory where you cloned this repository
- Create your own python virtual environment in the root directory
python -m venv desired_env_name - Activate the environment by running
source ./desired_env_name/bin/activate - Install python dependencies by running
pip install -r requirements.txt - Once you are finished, deactivate the python virtual environment via the
deactivatecommand
The serial algorithm I wrote to compare to a parallel implementation used a simple for loop.
void addVectorsSerial(int *a, int *b, int *c, int vectorLength) {
for (int i = 0; i < vectorLength; i++) {
c[i] = a[i] + b[i];
}
}The CUDA kernel itself was quite simple, requiring only pointers to the input and output arrays and the size of the arrays.
__global__ void addVectorsParallel(int *a, int *b, int *c, int N) {
int i = threadIdx.x + blockIdx.x * blockDim.x;
if (i < N) {
c[i] = a[i] + b[i];
}
}The first line of the kernel calculates its thread index. If the kernel is assigned an index within the array, it performs the addition. I couldnt find a way to avoid this check without giving up the property that the input array could be of arbitrary size. Reach out to me or make a pull request with an explanation if such a solution exists!
The call to the CUDA kernel was as follows:
addVectorsParallel<<<sizeof(parallelOut) / 1024 + 1, 1024>>>(pa, pb, pOut, N);One block for each multiple of 1024, plus 1 for any remainder, was called with 1024 threads.
Say you have a list of integers
int nums[] = {1, 2, 3, 4, 5, 6, 7};For many problems, it can be useful to reduce such a list of numbers down to a single value (their sum, product, max, min, etc.). In the case of the array above, the reduced sum would be equal to
I specifically wrote a reduction sum, the choice of which was arbitrary and could be easily changed to any other binary operation.
My first attempt at parallelization computed the sum correctly, but took orders of magnitude longer than the simple serial sum. I cannot say exactly what caused this poor performance, but I am fairly confident that it had to do with the choice I made to make the kernel run recursively (not advisable for code that will run on a GPU).
Having struggled with writing the parallel reduction from scratch, I decided to take a step back and learn some more CUDA. I began with NVIDIA's CUDA Made Easy course from the NVIDIA Accelerated Computing Hub, which began by teaching CUDA's Thrust library. The Thrust library provides a variety of standard pre-written functions (reduce, transform, etc.) which can be used and combined to implement various algorithms in parallel without having to write them in raw CUDA completely from scratch.
long serialSum(const thrust::universal_vector<int>& nums) {
return thrust::reduce(thrust::host, nums.begin(), nums.end(), 0, thrust::plus<int>{});
}
long parallelSum(const thrust::universal_vector<int>& nums) {
return thrust::reduce(thrust::device, nums.begin(), nums.end(), 0, thrust::plus<int>{});
}Using thrust for parallel reduction was straightforward due do the existence of thrust::reduce. The only difference between serialSum and parallelSum is in the first argument; thrust::host is used when the function is to be run on the CPU, and thrust::device is used when the function is to be run on the GPU.
The next step in this project is to implement a parallel reduction in custom CUDA, avoiding recursion like in my first attempt. This will not only help optimize the algorithm but also deepen my understanding of parallel reduction and memory management in CUDA.

