We can achieve some pretty good speed up with normal hardware
Thread based parallelism
Targeted at share memory
Uses work sharing and tasks
Process | Thread |
---|---|
A basic unit of work for the operating system | Part of a program that can be run independently to other portions |
Big overheads in creating/detention/context switching | Small overheads in comparison |
isolate from other processes | Shared memory with other threads in the same process egfiles |
Share heap |
expression | description |
---|---|
#pragma omp parallel | |
#pragma omp for | if the data is independent you can use data parallelism |
#pragma omp parallel shared(i) | shared access acriss akk threads |
#pragma omp parallel private(i) | Each thread gets it own copy of the variable |
#pragma omp parallel firstprivate(i) | |
#pragma omp parallel lastprivate(i) | |
#pragma omp parallel for default(shared) | |
#pragma omp parallel for reduction(+, sum) | performs some operation with the results |
#pragma omp parallel schedule<sraruc/dynamic, |
|
collapse | |
reduction |
| schuduleing | | | locking | |
If we have four threads and use static scheduling we get this, notice it isn’t even
If we use dynamic then once a thread has finished its work it will take another job
schedule(guided,
schedule(runtime) - You can pass the schedule as an environment variable using OMP_SCHEDULE
schedule(auto) - The scheduler learns the problem after a few runs, don’t use this one
omp_lock_t - This is the data type of a variable lock
omp_init_lock - This initialises the lock variable and sets it’s value to ‘unlocked’, meaning any thread reaching that point doesn’t have to stop
omp_set_lock - Once a thread passes this point, all other threads reaching this point have to wait until it unlocks
omp_unset_lock - Unlocks the variable lock
We can use #pragma omp critical
, this creates a lock for that portion of the code
Only one thread at a time can run that portion of code
The solution is #pragma omp atomic
It has hardware support, but it’s far less flexible than #pragma omp critical
#pragma omp atomic write
- This allows for a statement to receive exclusive write access, still allows for the variables to be read
#pragma omp atomic read
- This temporarily locks a variable so it can only be read to, avoids reading intermediate values#pragma omp atomic update
- This creates a lock for writing when another thread is writing, and a lock for reading when another thread is reading#pragma omp atomic capture
- This is a complete lock, similar to a critical region, however this requires hardware support for this feature. If the support doesn’t exist then it default to a critical region, which has some additional overhead#pragma omp section
- Defines some work that only needs to be carried out by a single thread
Expect:
Partical:
I’ve mentioned before that there are multiple ways to time things in C The best method while using OpenMP is to use omp_get_wtime() You don’t have to worry about clock cycles etc
Low level locks don’t always cause deadlocks, it’s just more likely to program it poorly
We need four things:
#pragma omp barrier
- That’s it, you don’t need any code underneath it. Really handy if you need to do something that isn’t thread safe like file loading
#pragma omp single
- Runs on a single thread, then has an implicit barrier. It has an implicit barrier at the end, so the other threads carry on when it’s done Really handy if you need to do something that isn’t thread safe like file loading.
Using the master thread is good if you just need one thread to execute something but don’t care about being thread safe
Non-Uniform Memory Access: Each CPU has the L1, and L2 registers, but shares an L3 register
Spatial locality - if memory is accessed then it’s neighbours are likely to be accessed next (move data as a block, not as it’s needed)
Temporal locality - if a variable is used, it is likely to be needed again soon (keep it in the cache)
An array may not be stored in one continuous position in memory, it could be split across several different areas
The array which we access as if it’s all in one memory location, is actually split across CPUs
The OS doesn’t have any idea what we are doing, so it gives a copy of the sum_local cache to each of the threads
We can add additional empty elements to an array, this is called padding