Data Parallelism but with Unknown Iteration Count

Is it possible to use an NDRange to parallelize a strlen() function for a dynamically allocated character array? This unknown size doesn’t seem to fit with the buffer model in OpenCL. What happens if I try to create a buffer longer than host_ptr? How would it know, and when? What if I use CL_MEM_USE_HOST_PTR? What happens if inside the kernel it accesses beyond the bounds of host_ptr? Has anyone tried this before in OpenCL?

Let’s see if I got that right.

You want to use OpenCL to determine the length of a null-terminated string in parallel? (Or a logically equivalent computation)

And in addition to that you want to prevent OpenCL from ever accessing any of the bytes that come after the null character?

If you had a reasonable upper bound on the length of the string you could use a prefix sum to find out the length of the string in a data parallel fashion. However, a prefix sum will access all data within the range given by that upper bound.

Hi David,

You got it spot on. Thank you for expressing it much more succinctly. Here is the kernel I had in mind, assuming the included atomic function is supported.

// Initially set n = UINT_MAX, the worst case scenerio if strlen is null terminated
__kernel void strlen(__global const char * restrict str, __global uint *n)
{
size_t global_id = get_global_id(0);

mem_fence(CL_MEM_FENCE_WRITE);
if (*(str + global_id) == ‘?0‘ && global_id < *n)
    atom_min(n, global_id);

}

The issue that I‘m thinking about is if this is to go into a library, e.g., cl_string.h, we may not know or be given the true upperbound to str. So there may be a buffer overflow in the sense that we may try to read past the last allocated element in str.

So my main question I reckon is can we create a buffer larger than the host_str using CL_MEM_READ_ONLY | CL_MEM_USE_HOST_PTR, and in that case what happens if we reference past the end of host_str in the buffer?

I‘m sorry, this line should read
write_mem_fence(CLK_GLOBAL_MEM_FENCE);

The issue that I‘m thinking about is if this is to go into a library, e.g., cl_string.h, we may not know or be given the true upperbound to str. So there may be a buffer overflow in the sense that we may try to read past the last allocated element in str.

So my main question I reckon is can we create a buffer larger than the host_str using CL_MEM_READ_ONLY | CL_MEM_USE_HOST_PTR, and in that case what happens if we reference past the end of host_str in the buffer?

I’ll explain it to the best of my knowledge (which is not much). When you call clCreateBuffer() with such arguments, the function will take the range of CPU virtual addresses given by hostptr and the length of the buffer (“size”) and try to pin it. By pinning I really mean two things: the actual pinning, which prevents the OS from swapping the pages out to disk, but also mapping that host virtual address range onto the GPU’s virtual address space.

Let’s say that with clCreateBuffer() we are trying to map 10MB worth of memory. If that range of host virtual address space is mapped to physical memory on the device, the call should succeed doing the stuff I described above and the kernel you showed will not page fault.

However, if the virtual address range given by hostptr+size is not fully mapped to physical memory, at the time clCreateBuffer() is called (or later if the OpenCL implementation is deferring memory allocations) then the call will fail… more or less cleanly depending on how carefully written is that particular OpenCL implementation.

In summary: don’t do that :slight_smile: