François Trahay
1965 - 2005
⟹ Increased processor performance
Since 2005
At each stage, several circuits are used
→ One instruction is executed at each cycle
⟹ several instructions executed simultaneously!
Limitations of the superscalar:
There should be no dependency between statements executed simultaneously.
Example of non-parallelizable instructions
cmp a, 7 ; a > 7 ?
ble L1
mov c, b ; b = c
br L2
L1: mov d, b ; b = d
L2: ...
⟹ waste of time
0x12 loop:
...
0x50 inc eax
0x54 cmpl eax, 10000
0x5A jl loop
0x5C end_loop:
...
Example: image processing, scientific computing
Using vector instructions (MMX, SSE, AVX, …)
Instructions specific to a processor type
Process the same operation on multiple data at once
Problem with superscalar / vector processors:
Simultaneous Multi-Threading (SMT, or Hyperthreading)
Limited scalability of SMT
dispatcher is shared
FPU is shared
→ Duplicate all the circuits
→ Non-Uniform Memory Architecture
→ Mainly used for small caches (ex: TLB)
→ Direct access to the cache line
Warning: risk of collision
example:
0x12345
67
8
and
0xbff72
67
8
→ K-way associative cache (in French: Cache associatif K-voies)
What if 2 threads access the same cache line?
Concurrent read: replication in local caches
Concurrent write: need to invalidate data in other caches
Cache snooping: the cache sends a message that invalidates the others caches