Calculating Array Values on GPU

I wanted to pick some of your brains.

I want to create a populated array (say 20x20, in actuality it is much larger than this) on my GPU.
I have two input values, the start value and the multiplier.

I’m looking for something like this where for every index in the array it is the previous value multiplied by the multiplier.
[ab, abb, ab^3 … ]
[a*b^21, …

I know that this isn’t great on GPUs, but I’m trying to see if creating the array on the GPU would be faster than calculating it CPU side and passing the array.
Therefore, this array would be passed onto the next kernel in the queue and this array would never be transferred back to the CPU.

Does anyone know of a faster/more efficient way of creating this array besides just looping through the whole thing?

Any advice or suggestions would be greatly appreciated!

Using binary exponentiation in each workitem should work OK, up to 30 iterations for an integer power.

int binpow (int a, int n) {
	int res = 1;
	while (n) {
		if (n & 1)
			res *= a;
		a *= a;
		n >>= 1;
	}
	return res;
}

Do mind this operation has LARGE derivative, so many values will degrade into either 0 or +/-INF

By the way, depending on your use-case, it can actually be beneficial to store a and b in CL_MEM_READ_ONLY buffer and compute your a * b^n on flight. It’s not a design advice or something, but a fun fact you may want to be aware of.

My only comments regarding the binary exponentiation is that I will have to do a lot more than 30 iterations which may cause issues.
That being said, it may still work in my situation.

To give more background, I have two complex numbers and they are the “a” & “b” that I used as an example earlier.
The real and imaginary values of a & b should always be [-1, 1]. (They are cosine and sines of a radian value.)

That’s why it still may be of use to me, even at larger iteration counts.

Would you please give more insight on what you meant by your last statement?
Are you suggesting storing a & b in a global buffer? Or moving it to a local one and calculating it on flight?
I kind of get what you’re saying but I’m not fully sure.

Maybe additional background would help with this as well haha.
I have a couple of array that are fed into a different kernel that are combined through mathematical operations.
One set of variables from those arrays is then multiplied by a value in the array created by a & b.
This is repeated with new sets of variables for each value in the array.

A simple example,
X is a [3x3], and Y is a [3x3], and AB is a [3x3]. And each index is multiplied across.

Which is why I think calculating on the fly might be a little more challenging.

It’s quite simple. GPUs have very high latency memory and very small cache shared between huge number of threads. It often so happens that even complicated computations are actually cheaper than simple memory reads. But as they say, premature optimisation is root of evil something something.

Are you suggesting storing a & b in a global buffer? Or moving it to a local one and calculating it on flight?


__kernel void a(__global var* inp){
var cached_inp =  *inp;
}

This should put your a and b into a scalar register of a wavefront/warp so no operations involving those will touch RAM.

Which is why I think calculating on the fly might be a little more challenging.

Try simple way, profile. If you will find out you’re memory bound, my advice should help.

Alright, I’ll give it a shot and will probably get back to you in a week or so.

Thanks for all your help though!