Vector Vector addition

Hi,

I have a basic question…not sure whether it’s the right place to post.
Why this might be true??

Not allowing me to keep URL/image…Please add http…

i41.tinypic.com/24dkne0.png

I think it’s memory…but not sure how to explain it elaborately.

Thanks.
Jennifer

In this simple case, memory bandwith is most probably the limiting factor.

Let’s say for example you have memory with a peak bandwith of 12GB/s and a Intel i7 processor clocked at 2.40GHz.

For each addition, the memory will have to access 3 (single-precision) float values (2 reads and 1 write), that is 12 bytes.
Memory thus limits the number of additions to 12/12=1 Goperation/s.

On the CPU side, let’s suppose that the for loop has been unrolled so that we can neglect index incrementing and conditional branching of the for loop.
The only instructions for an add operation are two loads from memory to registers, one addition and a store from register to memory. Supposing that memory accesses are done to and from the L1 cache and neglecting latency, a full operation needs:

  • 1st load: 1 cycle for the instruction + 1 cycle for memory throughput
  • 2nd load: idem
  • floating-point addition: 2 cycles
  • store: 1 cycle for the instruction + 1 cycle for memory throughput

So a full add operation will take 8 cycles, and the maximum number of operations is 2.40/8=0.3 Goperation/s.

(Of course, this is a very coarse estimation but it is enough to highlight the origin of the limitation)

So if four threads are concurrently computing, they will execute 1.2 Goperation/s which already exceeds the number of operations the memory can sustain (1 Gop/s). The load/store instructions will stall, waiting for data from/to memory.

[QUOTE=utnapishtim;29588]In this simple case, memory bandwith is most probably the limiting factor.

Let’s say for example you have memory with a peak bandwith of 12GB/s and a Intel i7 processor clocked at 2.40GHz.

For each addition, the memory will have to access 3 (single-precision) float values (2 reads and 1 write), that is 12 bytes.
Memory thus limits the number of additions to 12/12=1 Goperation/s.

On the CPU side, let’s suppose that the for loop has been unrolled so that we can neglect index incrementing and conditional branching of the for loop.
The only instructions for an add operation are two loads from memory to registers, one addition and a store from register to memory. Supposing that memory accesses are done to and from the L1 cache and neglecting latency, a full operation needs:

  • 1st load: 1 cycle for the instruction + 1 cycle for memory throughput
  • 2nd load: idem
  • floating-point addition: 2 cycles
  • store: 1 cycle for the instruction + 1 cycle for memory throughput

So a full add operation will take 8 cycles, and the maximum number of operations is 2.40/8=0.3 Goperation/s.

(Of course, this is a very coarse estimation but it is enough to highlight the origin of the limitation)

So if four threads are concurrently computing, they will execute 1.2 Goperation/s which already exceeds the number of operations the memory can sustain (1 Gop/s). The load/store instructions will stall, waiting for data from/to memory.[/QUOTE]

Thanks…this is very helpful…
As I am a beginner to this computing, I was thinking of the following…Please advise…

“) This might probably related to caching…in the beginning the processor doesn’t have any information about the computation…then after a few computations the previous calculation is stored in the cache memory and this cache memory is reused for future/parallel calculations which decreases the time needed to retrieve the output…
Use of Cache memory for recalculation and parallel computation…which decreases the time taken by the processor to recalculate if the output is already available. So, once it goes beyond 10, the time didn’t continue to decrease.”

Not sure whether both of us are coming to the same conclusion…

Thanks…for your reply!!!

Modern CPUs have prefetchers that can detect sequential accesses like here and efficiently fill the cache before computations take place.
The problem here is that the computations are so simple and fast that the memory cannot fill the cache fast enough to sustain the rate at which FP computations are done.

Cache can improve access latency but it won’t improve RAM throughput.

[QUOTE=utnapishtim;29598]Modern CPUs have prefetchers that can detect sequential accesses like here and efficiently fill the cache before computations take place.
The problem here is that the computations are so simple and fast that the memory cannot fill the cache fast enough to sustain the rate at which FP computations are done.

Cache can improve access latency but it won’t improve RAM throughput.[/QUOTE]

Thanks…U are extremely helpful & appreciate it…