Help With Optimizing My First Kernel

I have written a kernel that uses Newton’s gravitational formula to calculate new velocities for bodies in a simple physics simulator of mine. It works well and I have added some linear interpolation to help make the transitions between gravity update passes more smooth. However because I am very new to Open CL I am afraid I am making some large mistake that results in lower performance. I have never worked with C and I am in fact writing this project in Java using LWJGL. I would really appreciate it if anyone could look at my kernel and give me some quick tips on how to improve.
For my kernel, on startup, I fill the positions array with random points. Then run the init kernel. After that, I call the gravity kernel two times each second. Because I know when the gravity kernel is going to be run next and when it was run last, I calculate a linear interpolation value to use in the lerp kernel that is called every frame. Right now everything works as intended I am just looking for some help as to how to make the kernel run faster.

My Kernel:

//This is the "main" kernel. It calculates the new velocities for each object

//prePos  - The coordinate that was calculated last time.
//It is used with postPos to linearly interpolate between those points to make the illusion of somthness 
//postPos - The most up to date positions
//velocities - The list of velocities for all the objects
//colors - Each object's color
//size - The total number of colors
//add - A value to use to determine how long the resulting velocity vector should be
//mass - The mass of each object (they all have the same mass for now)

kernel void gravity(global float4* prePos, global float4* postPos, global float4* velocities, global float4* colors, const int size, const float add, const float mass) {
	const int itemId = get_global_id(0); //Get this thread's ID
	if(itemId < size) { //If we are within the objects that we want to update
		float4 pos = postPos[itemId]; //Retrive the position
		float4 vel = velocities[itemId]; //Retrive the velocity
				
		float4 otherPos, deltaPos;
		float gravity, dist;
		for(int i = 0; i < size; i++) { //Loop through every other object that exists...
			if(i != itemId) { // If the other object isn't this object...
				otherPos = postPos[i]; //Get the othert objects position
				deltaPos = (float4) (otherPos.x - pos.x, otherPos.y - pos.y, otherPos.z - pos.z, 0.0f);//Calculate the delta between this thread's object and the other object
				dist = sqrt(pow(deltaPos) + pow(deltaPos) + pow(deltaPos.z, 2.0f)); //Get the distance between them
				gravity = mass / (dist * dist); //Newton's gravitational formula
				
				deltaPos /= dist; //Normalise the vector
				
				deltaPos *= gravity; //Factor in gravity
				
				vel += deltaPos; //Add the newly generated vector that represents the gravity between this thread's object and the other object to this thread's velocity
			}
		}
		prePos[itemId] = pos; //Set the previous position to the one without the new calculation
		postPos[itemId] = pos + vel * add; // Set the new position to the old one plus the velocity
		velocities[itemId] = vel; //Set the velocity
	}
}

//This kernel calculates the values for the positions array using prePos, postPos, and a linear interpolation value. 
//Open GL renders the positions array as points so the points in the positions array cant jump around because that would be very noticeable and would not look nice or realistic
kernel void lerp(global float4* positions, global float4* prePos, global float4* postPos, const float value) {
	const int itemId = get_global_id(0);
	positions[itemId] = prePos[itemId] + value * (postPos[itemId] - prePos[itemId]);
	
}
//This kernel sets up the values for prePos and postPos so that calling the gravity will work as intended
kernel void init(global float4* positions, global float4* prePos, global float4* postPos) {
	const int itemId = get_global_id(0);
	float4 pos = positions[itemId];
	prePos[itemId] = pos;
	postPos[itemId] = pos;
}

Step one: profile. AMD has CodeXL, Intel has something else. In case of NVIDIA, well, you’re out of luck. You need to determine whether you’re compute bound or bandwidth bound. If it is compute bound, use native_sqrt and native_pow. Otherwise you may store your position data as half4 (it’d still be better to use floats for computation, but using halves would also help in the compute bound scenario if you have a Polaris or Pascal GPU).

These are pretty much all micro-optimizations I can think of. The next step would be to add some sort of space partitioning. Split your space into N cubes, assign each body a number based on the cube it falls into and sort the array using the numbers as keys. This would allow you for each body only account items from the same cube or from adjacent cubes. Others are just too far away and barely change the picture.

[QUOTE=Salabar;41876]Step one: profile. AMD has CodeXL, Intel has something else. In case of NVIDIA, well, you’re out of luck. You need to determine whether you’re compute bound or bandwidth bound. If it is compute bound, use native_sqrt and native_pow. Otherwise you may store your position data as half4 (it’d still be better to use floats for computation, but using halves would also help in the compute bound scenario if you have a Polaris or Pascal GPU).

These are pretty much all micro-optimizations I can think of. The next step would be to add some sort of space partitioning. Split your space into N cubes, assign each body a number based on the cube it falls into and sort the array using the numbers as keys. This would allow you for each body only account items from the same cube or from adjacent cubes. Others are just too far away and barely change the picture.[/QUOTE]

Technically, you can still profile OpenCL on NVIDIA.

James Price from University of Bristol outlined a way to profile using NVVP.

Otherwise, Nsight from NVIDIA can still profile OpenCL, with the restriction of requiring the use of an IDE (Visual Studio or Eclipse).

I would provide the links to both of these but unfortunately, as a new user I cannot.

I thought they were going to deprecate a tool used in that hack. Does it still work? Regardless, they seem to return to OpenCL now, so we can finally hope for official full support.

I haven’t tried James Price’s method, but AFAIK it should still work.

Nsight still currently works and is fully supported. I currently have it running in Visual Studio 2013.

dist = sqrt(pow(deltaPos) + pow(deltaPos) + pow(deltaPos.z, 2.0f)); //Get the distance between them

here pow is a very general and slow function. To just square things, multiply with themselves. deltaPos.z * deltaPos.z is a lot faster.

gravity = mass / (dist * dist); //Newton's gravitational formula

here you are taking square of a square root. You could have put the interior of sqrt instead.

These two adjustments should give you 2x performance at least.

Also you are scanning through all objects which come from cache. If you think your GPU’s local memory is better than cache in overall usage, then you could load 64 objects, do gravity between them, then load another 64 objects but keep the last group, compute these two groups’ gravity too, then load 64 by 64 until all objects are scanned. This should increase re-use ratio of local memory and give more performance.