Emulating vector insert/delete in kernel -is this safe?

Say I have an array with elements of some type which I pass to the kernel as a constant source buffer. And I also have a destination buffer.

Based on some condition on the value of the elements (which is not known in advance), I want do one of the following:
i) copy from source to destination,
ii) compute several elements from one source and save to destination or
iii) disregard source element.

Now, I’ve come up with a way that seems to work


__kernel void myKernel(__global const struct MyStruct *src, __global const int *n_src,
                       __global struct MyStruct *dst, __global int *ind)
{
    int ids = get_global_id(0);
    
    if(ids < *n_src)
    {
        if(/* determine if src[ids] should be divided */)
        {
            int m = (*ind)++;
            int n = (*ind)++;
            
            // compute and assign values to dst[m] and dst[n]
        }
        else if(/* determine if src[ids] element should be preserved */)
        {
            int m = (*ind)++;
            dst[m] = src[ids];
        }
    }
}

But I wonder if this is safe? Assuming the destination buffer is at least twice as large as source, might I still run the risk of writing to the same destination element several times?

Also, will the operations on *ind be a serious bottleneck? Is there any way to avoid this?

No, you have no guarantee that it will always work.
The best way i could think to get the same result is using two kernels.
Notice the temp1 and temp2 buffers.
You have to run kernel2 after kernel1.

kernel1 determines how much output the item in src will produce.
kernel2 computes the indes in ind that each thread will start at,
then populates the destination buffer


__kernel void myKernel1(__global const struct MyStruct *src, __global const int *n_src,
                       __global struct MyStruct *dst, __global int *ind ,__globai int*temp)
{
    int ids = get_global_id(0);
   
    /* each work item computes an element of src. It decides the number of elements it has to put
      in the destination buffer, then saves it in an element of temp

	*/

	if(ids < *n_src)
    	{
        if(/* determine if src[ids] should be divided */)
        {
		temp[ids] = 2; 
        }
        else if(/* determine if src[ids] element should be preserved */)
        {
            temp[ids]=1;
        }
	  else
	  {
		temp[ids]=0;
	  }
    }
}

__kernel void myKernel2(__global const struct MyStruct *src, __global const int *n_src,
                       __global struct MyStruct *dst, __global int *ind ,__globai int*temp,__global *int temp2)

	int ids = get_global_id(0);
	
	
	/* Non parallelizable code
	   It computes the index of the element that the work item should read from ind,
         then puts it in temp2
      */ 

	
	if(ids == 0)
	{
		int index = -1;
		for(int i = 0; i< *n_src;i++)
		{
			index += temp[ids];
			temp2[ids] = index; 
		}
	}
	/* End of non parallelizable code */

	if(ids < *n_src)
    	{
        if(temp[ids] == 2)
        {

            int m = ind[temp2[ids]];
            int n = ind[temp2[ids]+1];
           
            // compute and assign values to dst[m] and dst[n]
        }
        else if(temp[ids]=1)
        {
            int m = ind[temp2[ids]];
            dst[m] = src[ids];
        }
    }

Ok, thanks for the reply. And I suppose that the non parallelizable part might aswell be done on the CPU between the two kernel executions.

yes, but running the non parallelizable part on the cpu requires that you copy temp1 and temp2 between host and device, this would be slower than directly running it on the GPU.