Parallelism:: Two or more processes that execute simultaneously and
Concurrency: Two or more processes that excute simultaneously and share at least one resource
Core: A processor that is built into a larger unit that can independently perform or process computational tasks
Thread: A software concept, a single sequential flow or control in a program.
Node: What is normally thought of as ‘one computer’, one motherboard OpenMP to program within a node
Parallel programming in OpenMP By Rohit Chandra
Parallel programming with MPI By Peter Pacheco
Using MPI – Portable Parallel Programming with the Message–Passing Interface
Work should be distributed as evenly as possible
Load imbalance - If one process is working for longer than the others, then the others will have to wait
Too much communication - Network is the bottleneck while computation waits
Resource contention - Two or more processes are asking for the same resource
Load imbalance - If one process is working for longer than the others, then the others will have to wait
Too much communication - Network is the bottleneck while computation waits
Resource contention - Two or more processes are asking for the same resource
Overheads - There is a computational cost for starting or stopping threads
Granularity - The size of chunks assigned to each processing element can be either too small
Changing thread tasks - Changing what a thread should be doing changes it’s presence in the scheduler, context switching is possible
Number of threads - If there is too much of a wait between actions, it becomes inefficient
Your code is likely not exploiting spatio-temporal locality
These are part of the die, and therefore are specifically designed for speed
Non-uniform memory access - Some memory is quicker than others, typically anything not shared has a dedicated bus
L1 - Level 1 cache is small but very fast, the smaller the cache, the quicker the access. Specific to the CPU.
L2 - Level 2 cache is larger than L1, but slightly slower. Specific to the CPU.
L3 - Level 3 cache is shared between two or more caches, and is significantly larger than L2 cache. It is also therefore slower.
Spatial locality (空间局部性) - if memory is accessed then it’s neighbours are likely to be accessed next (move data as a block, not as it’s needed)
Temporal locality (时间局部性) - if a variable is used, it is likely to be needed again soon (keep it in the cache)
// Bitonic sort
#include <iostream>
#include <time>
using namaspace std;
void main() {
clock_t begin = clock();
// some code
clock_t end = clock();
double time_slot = (double)(end - begin) CLOCKS_PER_SEC;
cout << timeslot;
}
Single-precision floating-point - This represents floating-point numbers in computers using a total of 32 bits. Can also be called FP32. In C this is float.
Double-precision floating-point - This represents floating-point numbers using 64 bits. Can be called FP62. In C this is double.
Flop - Abbreviation for floating-point operation. A common unit of measurement of processor speed is flops/s. You usually have to specify single or double precision.
R_peak is a metric for the upper bound performance of a system
R_peak = 2 * w_rec * r_clock * n_core * n_sock
Addition/Multiplier - For most boards, they typically do both in 1 cycle. Hence 2.
W_rec - Vector width (how many FP32)
r_clock - Clock speed (the @2.00 GHz bit)
N_core - Cores per socket
N_sock - Number of sockets per node
FP32 width = Bit width / 32
FP64 width = Bit width / 64
R_peak unit: Gfloat/s | float/s
// benchmark
Decomposition(分解<程序>) How do you break a problem down into parts that can be executed in parallel?
keywork | explain |
---|---|
Data parallelism | Splitting the data into discrete chunks and parallelising processing |
Task parallelism | Giving various tasks to be carried out to different threads etc if they are independent |
Domain decomposition | Split the problem into equal pieces |
Pipelining | A factory like approach where each process is different |
Mixed solution | A combination of different approaches |
factors | explain |
---|---|
t_1 | Time on 1 core |
t_p | Time on p cores |
s_p |
Strong scaling - Keep the problem size fixed, vary the number of cores and nodes. Ideally there should be a linear reduction in time.
Weak scaling - Vary the problem size, vary the number of cores and nodes proportionally with the problem size. Ideally there should be a flat line.
tp = α * t1 + ((1 - α) * t_1 / p) s_p = 1 / α + ((1 - α) / p)
As the number of processors approach ∞, the speedup becomes 1 / α
A program takes 180 seconds to run on a single core, and 100 seconds on 2 cores. What is the (theoretical) maximum speedup?
s_p = t_1 / t_p s_p = 1 / (α + (1 - α) / p)
s_p = 180 / 100 = 1 / (α + (1 - α) / p) α = 1 / 9 maximum speedup = 1 / α = 9x
Speedup = α + p(1 - α) = 0.25 + 50 _ (1 - 0.25) = 0.25 + (50 _ 0.75) = 37.75
s_p = 1 / (α + ((1 - α) / p)) = 1 / (0.25 + (0.75 / 50)) = 3.77
s_p = α + p(1 - α) = 0.1 + 10(1 - 0.1) = 9.1
When two or more tasks wait for each other eg people meet in corridor and cannot pass so they wait for each other to move
When the resolve state for the deadlock causes another deadlock eg people meet in the corridor and cannot pass. So they both move to the same side of the corridor and cannot pass, then swap to the other side of the corridor
Occurs when a programs behaviour changes depending on the sequence or timing of events outside of the control for the program eg thread scheduling