Moving data from __global to __private

Since we cannot use memcpy in OpenCL, i am wondering if there
is a similar function available that can be used to copy chunks of
data from __global to __private (or to __local) inside a kernel.

For example say I wish to copy 10 elements from global memory to
__private memory (per thread). I do not wish to make a loop like:


for (int i=0; i<n_elements.....
... 

How is this generally achieved in OpenCL?

The purpose is to get a list of data into each thread. I am making a raytracer
where I need to grab a list of surface data contained within each grid cell
(or tree node if I use that).

You have to do the copy for each work-item in the kernel code. There is a provide async_work_group_copy which will copy from global to local using all the work-items in a work-group, but there is no provided function for copying to private memory.

Your example is about what you need to do. At the start of the kernel you copy in the data you need from global memory. Remember that if you are copying to memory that will be shared across the work-group (local) you need to insert a barrier after the copy to ensure that all work-items have finished before any try to access it.

I was afraid of that…

Will the included async copy to local mem be significantly faster than
loop-copying it to private? It will be difficult to implement because :

I am making a monte carlo forward raytracing software,
where each workItem is one independent ray traced through
a geometry. The geometry is split in a grid right now (may use
kd-trees later depending on what happens) and each time a ray/photon
enters a new grid cell it must check if there are surfaces inside this
cell, or it must intersect one of the bounding planes.

Copying each element float by float takes a lot of time
(I assume this is due to global memory access times ).
I think i was able to reduce the time spent by grabbing
them as float4s from Image object memory, but I am not certain.

I assume image memory objects are the same as CUDA textures,
which this guy here recommends ?
http://bouliiii.blogspot.com/2008/08/re … a-100.html

The speed of the async copy global to local will depend entirely on the implementation (e.g., Cell could use a DMA engine, but I don’t think there are DMA engines for this on most GPUs) so I doubt it will be a win over a for-loop for you.

You should take a look a the vload functions which can do optimized vector loads of data. This will get you the best performance if your data is aligned to a vector size.

It sounds like there is no way to get around the copying of each float from what you are saying. Copying them into private/local memory is only a win if you have reuse. Note that private memory on most GPUs today is just registers, so there’s no real benefit over just keeping them around in your kernel as variables. Copying to local has a win because multiple work-items can share them so you can get more reuse.

Image memory objects are textures, so on GPUs that have texture caches (e.g., all of them) you will get the benefits of caching which can be far faster than buffer accesses if you have good spacial locality.

The other big issue is memory access coalescing. I know that on Nvidia GPUs this can make an order-of-magnitude difference in your memory bandwidth. Devices before the GT280 could only coalesce accesses from a work-group that were sequential. (E.g., each work-item accesses the next item.) The GT280 is more flexible so it should do better. However, if each work-item is doing its own random accesses, you will get very little coalescing so you will never be able to get close to the maximum bandwidth. This may just be a problem with mapping the algorithm to the hardware. Pre-loading data into the local memory can help if you can predict what data you are likely to use and it fits.

Unfortunately I do not see any way to predict what data a work Item will
require, since it is a ray with random direction but also with a spread in origin.

There is no way to manually pre-cache data into local memory from what I see,
because of this randomness…

So I will try to compare the approach of using vload-functions vs
loading from textures/image objects to see if I can find a speed
advantage in either method.

one more thing, I am using a GTX260 to do most of the work,
while the final program will run on a system with multiple Tesla 1060 cards
( Funny as it turns out that so far my program has run at identical speeds
when comparing Tesla vs GTX260, even a slight advantage to the GTX260 )

Results

I let my kernel run some test code and timed the execution.
These results are on my GTX260. I don’t have the tesla cards
here at home (theyre at my work place) but I think I ran a similar
test on them.

On the GTX260 I had inside my kernel :



	
	__private float aX = 0;
	__private int TR = 100;

	// One by One access
	for (int i=0; i<TR; i++) {
		__private float4 R = {	nodes[0],
								nodes[1],
								nodes[2],
								nodes[3]	};
		aX += R[0] + R[1] + R[2] + R[3];
	}


	// Vector access
	for (int i=0; i<TR; i++) {
		__private float4 R = vload4(0, nodes);
		aX += R[0] + R[1] + R[2] + R[3];
	}

	
	for (int i=0; i<TR; i++) {
		__private int2 coord = { 0, 0 };
		__private float4 R = read_imagef( src_image, samplerA, coord );
		aX += R[0] + R[1] + R[2] + R[3];
	}
		
	// Assign some arbitrary data to test read back 
	for (int i=0; i < MAX_HITS; i++) {
		energies[thread_id*MAX_HITS + i] = aX;
	}

Reference : The kernel without any of the tests took about 3-5 ms to execute.

Test 1 : (one by one access) took about 95-100 ms for a global work size of 1.000.000

Test 2 : (using vload4) took about 120-125 ms for a global work size of 1.000.000

Test 3 : (image objects) took about 20-25 ms for a global work size of 1.000.000

So I guess it was a good idea to stick with image objects then :slight_smile:

Image objects are cached, so any spatial locality will get you a big boost in performance. Note that variables are by default private so you don’t have to put __private (or just “private”) in any of those cases.

I’d also be a bit careful here. It looks like you are reading the same data everywhere. That will mean that the first image read will load the cache and every single read thereafter will hit. The other ones are not cached so they will have to do the read each time. This appears to cover the case where you have extremely good locality. If your real code is accessing all over the place you will see far less benefit from the image access.

Thanks. That is good to know.
The important part is just that it wont be significantly slower than reading it element by element.

At least the grid which is only about 20x10x20 should be cached to some degree, as
it is quite sparse. (each entry in the grid holds 2 float4s)

Right now I just let every work item read the same data to test performance.
How much data do you think could be cached?