Hi OpenCL community,

I want to ask your opinion about what is faster?

context: image filtering (convolution)

load from constant global variable (image vector) to private
process data in private memory
write in global


load from constant global variable to local memory
barrier to wait synchronization of local memory
process data from local
write result in global

I know that loading from global should be much slower and that I am loading the same data over and over in every work item, but the process is done in private which is much faster. In the other hand, I don't know if waiting for the barrier can affect my performance and I also ignore a ratio (roughly) between the read/write speeds of global and local.

I will appreciate if anyone can answer.