What is faster?

Hi OpenCL community,

I want to ask your opinion about what is faster?

context: image filtering (convolution)

load from constant global variable (image vector) to private
process data in private memory
write in global

or

load from constant global variable to local memory
barrier to wait synchronization of local memory
process data from local
write result in global

I know that loading from global should be much slower and that I am loading the same data over and over in every work item, but the process is done in private which is much faster. In the other hand, I don’t know if waiting for the barrier can affect my performance and I also ignore a ratio (roughly) between the read/write speeds of global and local.

I will appreciate if anyone can answer.

Thanks

LC

The second option is what I call “manual caching”. So when the accelerator has no cache, the second one runs faster. When the GPU is enough cache, then it won’t really matter. Not tested in a while, so not sure anymore: when using the CPU, the second version runs slower.

In most cases I found other reasons to be more important to use local mem.

[QUOTE=VincentH;30197]The second option is what I call “manual caching”. So when the accelerator has no cache, the second one runs faster. When the GPU is enough cache, then it won’t really matter. Not tested in a while, so not sure anymore: when using the CPU, the second version runs slower.

In most cases I found other reasons to be more important to use local mem.[/QUOTE]

Thanks VincentH

LC