Jeremy Yu

What is the TSP(Traveling Salesman Problem)?

Cheapest Insertion (最少代价插入)(DP ｜ Greedy Approach)

Farthest Insertion (最远插入法)(DP ｜ Greedy Approach)

Why should I use OpenMp

We can achieve some pretty good speed up with normal hardware
Thread based parallelism
Targeted at share memory
Uses work sharing and tasks

Process	Thread
A basic unit of work for the operating system	Part of a program that can be run independently to other portions
Big overheads in creating/detention/context switching	Small overheads in comparison
isolate from other processes	Shared memory with other threads in the same process egfiles
	Share heap

How do I use OpenMP

The fork-join model

Expressions

expression	description
#pragma omp parallel
#pragma omp for	if the data is independent you can use data parallelism
#pragma omp parallel shared(i)	shared access acriss akk threads
#pragma omp parallel private(i)	Each thread gets it own copy of the variable
#pragma omp parallel firstprivate(i)
#pragma omp parallel lastprivate(i)
#pragma omp parallel for default(shared)
#pragma omp parallel for reduction(+, sum)	performs some operation with the results
#pragma omp parallel schedule<sraruc/dynamic, >
collapse
reduction

｜ schuduleing ｜｜ | locking | |

To define how threads to use our program
- export OMP_NUM_THREADS = 8; ./a.out
- #pragma omp parallel num_threads(x)
- omp_set_num_threads(x)

Scheduleing parallel for static/dynamic/guided

If we have four threads and use static scheduling we get this, notice it isn’t even
If we use dynamic then once a thread has finished its work it will take another job
schedule(guided, ) - Chunks of decreasing size are handed out, it’s a dynamic variant
schedule(runtime) - You can pass the schedule as an environment variable using OMP_SCHEDULE
schedule(auto) - The scheduler learns the problem after a few runs, don’t use this one

Locking Manual locks

Low level lock

omp_lock_t - This is the data type of a variable lock
omp_init_lock - This initialises the lock variable and sets it’s value to ‘unlocked’, meaning any thread reaching that point doesn’t have to stop
omp_set_lock - Once a thread passes this point, all other threads reaching this point have to wait until it unlocks
omp_unset_lock - Unlocks the variable lock

High level lock

We can use #pragma omp critical, this creates a lock for that portion of the code
Only one thread at a time can run that portion of code
The solution is #pragma omp atomic
It has hardware support, but it’s far less flexible than #pragma omp critical
#pragma omp atomic write - This allows for a statement to receive exclusive write access, still allows for the variables to be read

#pragma omp atomic read - This temporarily locks a variable so it can only be read to, avoids reading intermediate values

#pragma omp atomic update - This creates a lock for writing when another thread is writing, and a lock for reading when another thread is reading

#pragma omp atomic capture - This is a complete lock, similar to a critical region, however this requires hardware support for this feature. If the support doesn’t exist then it default to a critical region, which has some additional overhead

Sections

#pragma omp section - Defines some work that only needs to be carried out by a single thread

Makes the code more readable
You’re not keeping track of the thread number
Number of usable threads is limited by the number of sections

Expect:

Partical:

If clause

Sometimes you’ll want something to execute if a particular condition is met
if(x) - Allows you to put a condition on the execution of some parallel code
This is only an if, not an if else
I think it’s kinda cool, but i’ve never found a good use for it

Timing

I’ve mentioned before that there are multiple ways to time things in C The best method while using OpenMP is to use omp_get_wtime() You don’t have to worry about clock cycles etc

Deadlocks

Low level locks don’t always cause deadlocks, it’s just more likely to program it poorly

We need four things:
1. Initialise the lock
2. Set the lock
3. Unset the lock
4. Destroy the lock

Critical and Atomic

Critical

Can cover multiple statements
Is generalised so you can carry out any operation inside, read or write

Atomic

Only allows a single statement
Is usually specialised to lock one particular type of operation

Difference

Explicit workload balancing: We define exactly what each thread should do
Implicit workload balancing - We let the software sort out what each thread does

The fork-joint model

Environmental variables

OMP_PROC_BIND - Whichever core you execute a thread on, keep it where it is (choose it randomly, but do not move)
OMP_SET_DYNAMIC - OpenMP views your num_threads etc as guidelines, you can make it listen to you by disabling this
OMP_PLACES - Lets to hand pick which thread lives on which core, this is very fiddly, probably avoid this

Barriers

#pragma omp barrier - That’s it, you don’t need any code underneath it. Really handy if you need to do something that isn’t thread safe like file loading
#pragma omp single - Runs on a single thread, then has an implicit barrier. It has an implicit barrier at the end, so the other threads carry on when it’s done Really handy if you need to do something that isn’t thread safe like file loading.

Master

Using the master thread is good if you just need one thread to execute something but don’t care about being thread safe

Lower overhead than single
Can be placed inside of a workshare construct like parallel for

Tasks

#pragma omp tasks - A task unit that is executed by each thread

Non-Uniform Memory Access (NUMA)

Non-Uniform Memory Access: Each CPU has the L1, and L2 registers, but shares an L3 register

Spatial locality - if memory is accessed then it’s neighbours are likely to be accessed next (move data as a block, not as it’s needed)
Temporal locality - if a variable is used, it is likely to be needed again soon (keep it in the cache)

Memory assignment

An array may not be stored in one continuous position in memory, it could be split across several different areas

The array which we access as if it’s all in one memory location, is actually split across CPUs

False sharing

The OS doesn’t have any idea what we are doing, so it gives a copy of the sum_local cache to each of the threads

Vector units

Padding

We can add additional empty elements to an array, this is called padding

Jeremy

Formal

Algorithm

Common

Docker

Javascript

Network

Node

Summary

notes

LowCodeProject

MCU

MC_MP_Programming

advancedAlgorithmicTech

algorithm

algorithmGameTheory

compilation

computerCompose

controlSystem

dataStructure

efficientAlgorithm

frontend

higherMath

javascript

linearMath

machineLearning

network

privacy_security

programming_language

researchMethod

typescript

videoEditor

vite

webassembly

webpack