Memory Allocation and time collection

Hi,

What’s better, i to work all iterations with private memory and in the end of my algorithm make the a copy of this private memory to global memory or it’s better in each iteration of algorithm insert intermediate results in a global memory?

When you measure time, takes in to consideration whole time, compilation and execution of kernel? Or only the kernel’s run?
Other doubt is that my work load at first is simple and when i run whether in CPU or GPU is fast, therefore i put the kernel’s call and the clFinish(queue) method, inside a structure for 1-10000 to colect time. Is it correct when run my algorithm in GPU?

Very Thanks,

Luiz Drumond.

I am interested in a similar thing also. I achieve significant speedup with the use of constant memory for a start. I am now looking at fissioning my task into data chunks of the size of the shared or local memory so as to maximise the work of each workitem. Then going to copy from global to shared/local and then do the computation. (Can you write directly to either of these?)

Also going to be looking at the use of the texture maps to see how that affects the speed.

So in a similar vain, is it faster to copy into shared/local before running the comp? I guess I will find out soon :stuck_out_tongue:

When you discover, please inform.

If i discover something i will post here.

Very thanks,

Luiz.

Private memory is just the same as global memory if it isn’t in a register. Its in a register if it can fit, and if it is an array with fixed indexing (or indexing that can be calculated at compile time).

Anyway this is pretty fundamental computer architecture: yes, if you can always use the fastest memory, the the next, and so on. So that means registers, then shared memory, then private/global memory.

Also, you cannot communicate within workgroups via global memory, it requires another kernel invocation.

When you measure time, takes in to consideration whole time, compilation and execution of kernel? Or only the kernel’s run?

This is entirely up to you, and depends on your application. If using a profiler I only time the kernel run as that is what it gives you - it is also consistent. If using gettimeofday() obviously it can time everything.

Other doubt is that my work load at first is simple and when i run whether in CPU or GPU is fast, therefore i put the kernel’s call and the clFinish(queue) method, inside a structure for 1-10000 to colect time. Is it correct when run my algorithm in GPU?

Sorry, your english isn’t very clear here. But yes if you’re timing on the host you need to use clFinish() before you measure the time, otherwise you only measure the time to enqueue the jobs. Don’t put it inside a loop unless you’re only interested in that section.

Ok so I’m slightly confused now. Private memory is only accessible within a workitem; ie. you can’t reach it from other workitems? Therefore, how does OpenCL know that it can be put into private memory? When you say shared memory, do you mean local memory? Or are we getting muddled with CUDA/OpenCL?

In my use case, caching isn’t of much use since I only use each piece of data once.

[quote=“homemade-jam”]

Ok so I’m slightly confused now. Private memory is only accessible within a workitem; ie. you can’t reach it from other workitems? Therefore, how does OpenCL know that it can be put into private memory? When you say shared memory, do you mean local memory? Or are we getting muddled with CUDA/OpenCL?

In my use case, caching isn’t of much use since I only use each piece of data once.[/quote]
Ahh yeah sorry, shared==local: apart from picking it up from nvidia’s ‘ported’ opencl docs, it is the only memory shared amongst work items.

Private memory is private yes (oddly enough …). If it fits/can be it goes into a register. The compiler knows it can go into private as private is the default qualifier for variable declarations. local and global are the other qualifiers.

Well you implied you were looping on the data in your original question.

But like i said, it depends on the problem. I’m sure you’ll end up having more than a single problem to solve though, so you will eventually come across a use for it. Local memory can be used for more than a cache too, for example to re-arrange scatter-gather global memory requests for more efficient lookup, or particularly to communicate partial results amongst work-items (for reductions). But if you are only reading data once, and it’s only being read sequentially (i.e. fully coalesced), then local memory just adds overhead.