Jeremy Yu

What is an accelerator?

Accelerators - Any non-CPU piece of hardware that can carry out work

There are two major examples of this:

Graphics Processing Unit (GPU) - A specialised processor designed to accelerate graphics processing
Field Programmable Gate Array (FPGA) - A piece of silicone with programmable logic gates

How do we use an accelerator?

Kernel - The name for a portion of code is sent to an accelerator Offloading - The term for sending work intended for the CPU to an accelerator

What frameworks will we cover?

Types of framework

OpenMP
OpenCL
OpenACC
CUDA

Programmatic

Uses specific function calls to give a high amount of control over the behaviour
This is typically non-portable
Prescriptive - User explicitly specifies actions to be taken by the compiler

Directive

Uses compiler flags to describe how the parallelism should be done
Portable
Descriptive - User ‘describes’, guides, the compiler but the compiler itself makes the final decision

OpenACC

To address it’s issues OpenACC had 3 major design pillars

Simplicity - OpenCL can be quite tricky to learn OpenCL requires several steps to get a kernel to run on an accelerator
- Declare variables for the input
- Declare variables for the GPU buffer
- Create a reference to the accelerator
- Create a command queue for that accelerator
- Initialise the buffer
- Carry out the work
- Free the memory
Power - OpenCL has poor support for crypto and machine learning
- Applications developed using OpenACC have significant speed increases, especially considering the amount of code required
Portability - There was no dominant manufacturer so supporting a variety of accelerators was important

A directive is just giving the compiler general instructions Compiler developers have much more freedom to figure out exactly how best to program that
- Nvidia has pledged support for all of it’s boards
- AMD has provided support so far, but refuses to guarantee this will continue
- Intel has said it will in the future

#pragma acc kernels - Automatically distribute loops across accelerator threads

Example for express parallelism


#pragma acc kernels {
  for (int i = 0; i < N; i++) {
    x[i] = i;
    y[i] = i * i;
  }

  for (int i = 0; i < N; i++) {
    z[i] = x[i] * y[i];
  }
}

#pragma acc parallel - Explicitly state that this portion of code should be parallelised. Works for loops or segments of code

Example


#pragma acc parallel {
  for (int i = 0 ; i < N; i++) {
    x[i] = i;
    y[i] = i * i;
  }
}

#pragma acc loop - Explicitly state that this portion of code should be workload shared in a similar way to OpenMP

Example



#pragma acc loop {
  for (int i = 0 ; i < N; i++) {
    x[i] = i;
    y[i] = i * i;
  }
}

#pragma acc data - This sets up data permanence on the accelerator, this is regardless of threads being created and destroyed


#pragma acc data {
  #pragma acc parallel {
    for (int i = 0 ; i < N; i++) {
      x[i] = i;
      y[i] = i * i;
    }
  }
}

copy - Specify what data must go to the accelerator, this then returns


#pragma acc data copy (x, y)
#pragma acc parallel {
  for (int i = 0 ; i < N; i++) {
    x[i] = i;
    y[i] = i * i;
  }
}

copyout - This tells the compiler the data should be returned, if it isn’t specified as copied in then it creates a variable on the accelerator create - This creates a local variable on the accelerator, this is good for temporary variables

OpenCL

Omputational intensity - The proportion of mathematical operations to unique memory operations.

Maths isquite fast, floating point muktiplication takes about clock cycles Memory is quite slow, loading from RAM takes arounds.

Mathematical operations	Unique memory operations
a[i] = b[i] + c[i]	1: 3 3:300
a[i] = b[i] + c[i] _ d[i]	2: 4 6:400
a[i]++	1: 2 3:200
a[i] += a[i] _ a[i]	2: 1 6:100

OpenCL structure

Complication

Create a program
Compile
Create the kernels

Setup

Get the devices and platform
Create a context
Create a command queue

Runtime

Create memory objects
Copy data to the GPT
Set the kernel arguments
Execute the kernels
Copu the data bakc from the GPU
Wait for the commands to finish

How do we use GPUs

Use of dimensions

Work item - A thread associated with a particular elements of the computer

Global dimensions - How many dimensions does the data have

Local dimensions - A smallerworkable chunk of the global dimension

Why should we focus on GPUs?

The power of GPUs is growing at a much faster rate than CPUs
The speed of improvement is impressive, it is likely that languages will start adding dedicated support for GPUs by default
There are two major factors stopping widespread adoption
1. The reconfiguring time is still very high and variable (1 hour - 30 hours)
2. The size of the programmable arrays is still quite limited
Field Programmable Gate Arrays (FPGA)
Configurable Logic Blocks (CLB)
AI Accelerator (AIA)
Cryptographic Accelerator (CA)

CUDA

It’s likely to be the language you will program in if you go into heavy HPC work So why might somebody choose this over OpenCL?

There are specific read and write libraries that are very efficient
There is a software caching mode that handles data transfer
There is a wide variety of libraries
The development tools are incredibly well done

GPU Computing Using CUDA

What is a GPU

Special type of co-processor to accelerate graphics computation
Contains many low-powered cores
Has a special memory architecture

GPGPU

Use of GPR to perform things other than graphical computations
- Bioinformatics
- Cryptography
- Neural Networks
- AI -Image Processing
First occurred as “abuse/misuse” of graphical APIs
BrookGPU was an early and influential framework

CUDA Basics

Architecture
- Host -CPU
- Device -GPU
- Program and Data is passed along PCIe
- K Kultiprocessors GLobal memory divided into memory blocks of b size
Memory Model
- Host and Device access global memory (Unifed memory?)
- Shared memory accessed only by threads on that MP (multi processes)
  - Private per-thread memory is allocated on shared memory
  - Features of …
Programming Model
- A kernel is a program written for a GPU. We write the program in CUDA C, and gives the instructions for one thread. Then we run many threads on the GPU at the same time
- Each thread follows the same instructions - they execute in lock step
- Threads are organized into Grids, Warps and Blocks

Grids Blocks and Warps

The Grid of threads represents all threads launched on the GPU
The Grid is divided in to Blocks
- You specify the size
- Blocks are resident on one SM(multiple of warp size)
- Threads within a block can communicate via shared memory
- Threads in a block
Blocks are divided into Warps

Accessing Memory

We access memory in CUDA C like we would do in a C array
- However we should make it coalesce
- To use a value in global memory, it must be featched from global and placed in shared memory
- This causes a whole block to be fetched from global into shared - potentially expensive
When we access shared mempry, we can access b distinct banks in parallel
- otherwise it can cause a bank conflick and will serialise

Sample Programs to Demonstrate CUDA

Example:


// Kernel definition
// CUDA Kernel Device code

__global__ void vectorAdd(const float *A, const float *B, float *C, int numElements) {
  int i = blockDim.x * blockIdx.x + threadIdx.x;
  if(i < numElements) {
    C[i] = A[i] + B[i] + 0.0f
  }
}

CUDA Recipe

CUDA Recipe - This is the name of the standard approach to carrying out this task

Process of CUDA Recipe

Convert the code into a serial kernel

The separated method could be ran on another computer

  int *calculate(int *a, int *b, int *c) {
   for(i = 0; i < 40; i++){
       c[i] = a[i] + b[i];
   }
   return c;
  }
  int main() {
     int i;
     int *a = (int *)malloc(40 * sizeof(int));
     int *b = (int *)malloc(40 * sizeof(int));
     int *c = (int *)malloc(40 * sizeof(int));
     // Read a file to assign a and b values
     c = calculate(a, b, c);
     free(a);
     free(b);
     free(c);
  }

Make it a CUDA kernel by adding “global”

Adding global above a function turns it into a CUDA function

 __global__
 int *calculate(int *a, int *b, int *c) {
   for(i = 0; i < 40; i++){
       c[i] = a[i] + b[i];
   }
   return c;
 }
 int main() {
     int i;
     int *a = (int *)malloc(40 * sizeof(int));
     int *b = (int *)malloc(40 * sizeof(int));
     int *c = (int *)malloc(40 * sizeof(int));
     // Read a file to assign a and b values
     c = calculate(a, b, c);
     free(a);
     free(b);
     free(c);
 }

Replace loops in the CUDA kernel (thread id indexing)

Even though there is only one element being altered by each thread, the whole array must be returned

  __global__
  int *calculate(int *a, int *b, int *c){
   int thread_id = (blockIdx.x * blockDim.x) + threadIdx.x
   for(i = 0; i < 40; i++){
     c[thread_id] = a[thread_id] + b[thread_id];
   }
   return c;
  }
  int main() {
    int i;
    int *a = (int *)malloc(40 * sizeof(int));
    int *b = (int *)malloc(40 * sizeof(int));
    int *c = (int *)malloc(40 * sizeof(int));
    // Read a file to assign a and b values
    c = calculate(a, b, c);
    free(a);
    free(b);
    free(c);
  }

Handle problem sizes smaller than the thread number

Adding the condition seems fairly obvious
Debugging parallel code can be quite tricky to do

  __global__
  int *calculate(int *a, int *b, int *c){
    int thread_id = (blockIdx.x * blockDim.x) +
    threadIdx.x;
    if(thread_id < 40)
       c[thread_id] = a[thread_id] + b[thread_id];
    return c;
  }

Allocate device memory


__global__
int *calculate(int *a, int *b, int *c){
    int thread_id = (blockIdx.x * blockDim.x) + threadIdx.x;
    if(thread_id < 40)
        c[thread_id] = a[thread_id] + b[thread_id];
    return c;
}
int main() {
    int i;
    int *device_a, *device_b, *device_c;
    int *a = (int *)malloc(40 * sizeof(int));
    int *b = (int *)malloc(40 * sizeof(int));
    int *c = (int *)malloc(40 * sizeof(int));
    // set up device memory
    cudaMalloc(&device_a, 40 * sizeof(int));
    cudaMalloc(&device_b, 40 * sizeof(int));
    cudaMalloc(&device_c, 40 * sizeof(int));
    // Read a file to assign a and b values
    c = calculate(a, b, c);
    free(a);
    free(b);
    free(c);
}

Copy data from host to device

__global__
int *calculate(int *a, int *b, int *c){
    int thread_id = (blockIdx.x * blockDim.x) + threadIdx.x;
    if(thread_id < 40)
        c[thread_id] = a[thread_id] + b[thread_id];
    return c;
}

int main() {
   int i;
   int *device_a, *device_b, *device_c;
   int *a = (int *)malloc(40 * sizeof(int));
   int *b = (int *)malloc(40 * sizeof(int));
   int *c = (int *)malloc(40 * sizeof(int));
   // set up device memory
   cudaMalloc(&device_a, 40 * sizeof(int));
   cudaMalloc(&device_b, 40 * sizeof(int));
   cudaMalloc(&device_c, 40 * sizeof(int));
   // Read a file to assign a and b values
   // copy to device
   cudaMemcpy(device_a, a, 40 * sizeof(int), cudaMemcpyHostToDevice);
   cudaMemcpy(device_b, b, 40 * sizeof(int), cudaMemcpyHostToDevice);
   c = calculate(a, b, c);
   free(a);
   free(b);
   free(c);
}

Call the CUDA kernel

Adding <<<x, y>>> between the function name and parameters sends it to the GPU
x - Number of blocks to use
y - Number of threads per block

   __global__
   int *calculate(int *a, int *b, int *c){
       int thread_id = (blockIdx.x * blockDim.x) + threadIdx.x;
       if(thread_id < 40)
           c[thread_id] = a[thread_id] + b[thread_id];
       return c;
   }
   int main() {
     int i;
     int *device_a, *device_b, *device_c;
     int *a = (int *)malloc(40 * sizeof(int));
     int *b = (int *)malloc(40 * sizeof(int));
     int *c = (int *)malloc(40 * sizeof(int));
     // set up device memory
     cudaMalloc(&device_a, 40 * sizeof(int));
     cudaMalloc(&device_b, 40 * sizeof(int));
     cudaMalloc(&device_c, 40 * sizeof(int));
     // Read a file to assign a and b values
     // copy to device
     cudaMemcpy(device_a, a, 40 * sizeof(int), cudaMemcpyHostToDevice);
     cudaMemcpy(device_b, b, 40 * sizeof(int), cudaMemcpyHostToDevice);
     c = calculate<<<5, 8>>>(a, b, c);
     free(a);
     free(b);
     free(c);
   }

Copy data from device to host

 int main() {
   int i;
   int *device_a, *device_b, *device_c;
   int *a = (int *)malloc(40 * sizeof(int));
   int *b = (int *)malloc(40 * sizeof(int));
   int *c = (int *)malloc(40 * sizeof(int));
   // set up device memory
   cudaMalloc(&device_a, 40 * sizeof(int));
   cudaMalloc(&device_b, 40 * sizeof(int));
   cudaMalloc(&device_c, 40 * sizeof(int));
   // Read a file to assign a and b values
   // copy to device
   cudaMemcpy(device_a, a, 40 * sizeof(int), cudaMemcpyHostToDevice);
   cudaMemcpy(device_b, b, 40 * sizeof(int), cudaMemcpyHostToDevice);
   c = calculate<<<5, 8>>>(a, b, c);
   // copy results back
   cudaMemcpy(c, device_c, 40 * sizeof(int), cudaMemcpyDeviceToHost);
   free(a);
   free(b);
   free(c);
 }

Jeremy

Formal

Algorithm

Common

Docker

Javascript

Network

Node

Summary

notes

LowCodeProject

MCU

MC_MP_Programming

advancedAlgorithmicTech

algorithm

algorithmGameTheory

compilation

computerCompose

controlSystem

dataStructure

efficientAlgorithm

frontend

higherMath

javascript

linearMath

machineLearning

network

privacy_security

programming_language

researchMethod

typescript

videoEditor

vite

webassembly

webpack