What is CUDA and How to Write Code?
Gift University, Gujranwala
CUDA (Compute Unified Device Architecture) is a platform created by NVIDIA that allows software to use the GPU for general purpose processing.
Unlocks the power of the graphics card for math, science, and AI, not just gaming.
Think of your CPU as a brilliant Math Professor (smart, but works alone).
Think of CUDA as hiring 1,000 Students (less experienced, but they work together to finish the job much faster).
Few powerful cores. Optimized for serial processing (doing one thing at a time very quickly).
Extremely fast for one person, but can't move 50 people at once.
Thousands of smaller cores. Optimized for parallel processing (doing many things at once).
Slower top speed than a race car, but transports 50 people simultaneously.
Understanding how CUDA organizes work using a Construction Site analogy.
"The Worker"
One worker laying a single brick. The smallest unit of execution.
"The Team"
A group of workers building one wall together. Threads in a block can share memory.
"The Site"
The entire construction site. A collection of all blocks working on the full problem.
System RAM. Where your main program starts.
Video RAM (VRAM). Where the heavy lifting happens.
The GPU cannot access CPU memory directly. Moving data is the slowest part.
"Processing on the GPU is instant, but getting data there is like shipping a package. You want to ship a full truckload (large data), not just one envelope at a time."
Think of it like a Chef's Workflow.
Get bowls ready (Reserve GPU Memory)
Pour ingredients into bowls (Send Data CPU → GPU)
Turn on the mixer (Execute Kernel on GPU)
Pour cake batter back into pan (Send Results GPU → CPU)
Wash the bowls (Free GPU Memory)
__global__ void addArrays(...) {
// Calculate unique ID
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n) {
c[i] = a[i] + b[i];
}
}__global__"Hey Compiler, this function is special. It runs on the GPU and is called from the CPU."
blockIdx.x * blockDim.x"Which team (Block) am I in, and how big is that team?"
+ threadIdx.x"Which worker number am I inside my team?"
i = Global ID
Calculating 'i' gives every thread a unique ID badge so it knows exactly which number in the array to process.
Just like malloc() in standard C, we use cudaMalloc for the GPU.
int *d_a, *d_b, *d_c; // 'd' stands for Device
int size = n * sizeof(int);
cudaMalloc(&d_a, size);
cudaMalloc(&d_b, size);
cudaMalloc(&d_c, size);We move the numbers from the RAM to the Graphics Card.
// From Host (CPU) to Device (GPU)
cudaMemcpy(d_a, h_a, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, h_b, size, cudaMemcpyHostToDevice);The triple angle brackets tell the GPU how many blocks and threads to use.
int threadsPerBlock = 256;
int blocksPerGrid = (n + threadsPerBlock - 1) / threadsPerBlock;
addArrays<<<blocksPerGrid, threadsPerBlock>>>(d_a, d_b, d_c, n);#include <stdio.h>
__global__ void add(int *a, int *b, int *c, int n) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n) c[i] = a[i] + b[i];
}
int main() {
int n = 10;
int size = n * sizeof(int);
int h_a[10] = {1,2,3,4,5,6,7,8,9,10}, h_b[10] = {1,1,1,1,1,1,1,1,1,1}, h_c[10];
int *d_a, *d_b, *d_c;
cudaMalloc(&d_a, size); cudaMalloc(&d_b, size); cudaMalloc(&d_c, size);
cudaMemcpy(d_a, h_a, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, h_b, size, cudaMemcpyHostToDevice);
add<<<1, n>>>(d_a, d_b, d_c, n);
cudaMemcpy(h_c, d_c, size, cudaMemcpyDeviceToHost);
for(int i=0; i<n; i++) printf("%d + %d = %d\n", h_a[i], h_b[i], h_c[i]);
cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
return 0;
}CUDA unlocks the massive power of GPUs for everyday tasks.
Contact:
Danish Ali, Abdullah Azam
Ameer Hamza Bajwa, Muhammad Qasim
Thank You for your time!