Is this caused by memory contention?

I suspect that I experience problems with memory contention in the below setup. Do you agree? If so: can I do anything about it?

I send two large arrays to the GPU (in form of read-only buffers) and each kernel computes some output value by performing a large bunch of lookups in a sub-area of each input array. I have run the program on an 8 core CPU, and on a 240 core GPU, but the CPU is still marginally faster than the GPU. However, if I perform an experiment in which I still provide the two large arrays as input, but replace the array lookup-code with some very local computation (without lookups in the arrays), the GPU is much faster than the CPU as it should be.
So, doesn’t this looks like a problem with memory contention as the only difference (as I see it) is the numerous array lookups? In that case: can I deal with this contention in some way?

The arrays are transferred like this:
bs1_buf = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=numpy.array(bs1).astype(numpy.int32))

Where is the lookup table located ? It should be in the __local memory part or __constant. __local if you write them in the thread, else use constant (more space and could be given by the caller)
Lookup tables are bandwich limitated.

OK, very interesting! How do I specify that the arrays should be located in “constant” memory? Or in “local” or “private” memory?

If you use

__kernal void func(__constant int* cInts, __local int* lInts)

you set the parameter into local or constant mem. Local mem is given just as cl_int parameter, which defines the size to be used in local mem.