Find max element in a huge array in kernel

Hi guys. Anyone can give an example of the code to find the maximum element in the array on OpenCL?

struct Item
{
cl_float3 color = {0.0f, 0.0f, 0.0f};
//below more data
};

clCreateBuffer(context, CL_MEM_READ_WRITE, over100millions * sizeof(Item), nullptr, &error);

__kernel void findMaxColor(__global Item* in_items, __global Item* out_maxElement)
and now i want to find max element on GPU by criterion like this:

float3 maxElement = (float3)(0.0f, 0.0f, 0.0f);
if(it.x > maxElement.x || it.y > maxElement.y || it.z > maxElement.z)
maxElement = it;
and so on.

so… How to calc maxElement and put value to out_maxElement?
P.S.: If possible, the code should be very fast, can it use local memory?
Need help…:confused:

For this kind of data reduction task, the typical plan is this:

[ul][li]Compute the max of each workgroup, moving the inputs into local memory at the beginning and using a binary tree for maximal concurrency.
[/li][li]Put the max of each workgroup into another array in global memory.
[/li][li]Recursively invoke the same kernel on the output until you only have one element left.
[/li][li]This element is your global maximum[/ul]
[/li]
Here’s a good resource on how to optimize the performance of a different kind of reduction, summing instead of computing the max: http://developer.download.nvidia.com/assets/cuda/files/reduction.pdf . It’s NVidia- and CUDA- based, but the general concepts map quite well to equivalent OpenCL constructs.

In general, reduction is memory-bound, so your performance target should be to max out the RAM bandwidth of your GPU.

thanks, i will try somethik like this;)