I have a large 1-D array of size >= 2^32.

I am using OpenCL on the GPU and/or CPU to test my system out that processes a simple function on each element of this enormous array. (for simplicity let's assume the function is XOR with 0xFF).

I have a device where the maximum workgroup size is 1024. (i am writing code that can use a CPU OpenCL implementation if available as well, hence the GPU vs CPU choice is immaterial.)

Currently I divide the array into blocks of 1024 * 4 (2 ^ 12) and then make the 2^20 kernel calls to it.

This obviously is not efficient.

I have tried using the workgroup size as 2^32 as well and that is also very slow and leads to lots of heat production in the system. I have tried some other combinations but I am looking for a more generic method that I can use even if the transformation function is not a simple XOR but a complex one that involves multiple arithmetic operations.

How can I solve this problem by making fewer kernel calls and streaming the array to the compute device without having to wait for each kernel to complete ?

Assume that my kernel is just running a custom transform function on each element of the 1-D array.