Only read specific chunks of a buffer

Hello everyone!
I’m currently trying to improve the “culling”-algorithm of an existing program (called Celestia).

Right now i try to increase the performance of the code i have so far.
The thing that really keeps bugging me, is that I can’t seem to find a way to read only a specific part of a memory object. This would accelerate my algorithm by a great amount.
Is it possible to read only specific parts of a buffer?
If I try to call the enqueueReadBuffer(), with the “size” argument being smaller than the Buffer it is trying to read, i get a “CL_INVALID_VALUE”-error.
(btw.: I use the c++ bindings for openCL)

Here is an explanation to what I’m trying to do:
I have a list of stars and one of my kernels is checking their attributes against render settings and viewer position to figure out, whether they are visible or not.
This kernel also calculates a few important variables for each visible star, necessary for later rendering.

Here’s the problem: Since star catalogs may get pretty huge (easily in the range of several 100.000 to a few million) reading the buffer object with the results (about 32 byte per star) takes up massive amounts of time (because i don’t know how many stars will be visible, I have to allocate memory to store results for all stars (as kind of a worst case scenario)). Now I call a new kernel afterwards, which puts all non-zero results (aka visible stars) at the beginning of a new list. Again for the worst case, that list needs to be able to contain results for every star. Therefor the Buffer object is initialized at program star-up with a size of reultSructure_size * numberOfStars.

After the first kernel is done, I also read another list which in the end tells me which star is visible and thus how many stars need to be rendered. So theoretically i would only need the first X entries of the buffer object containing the results, but OpenCL won’t let me :frowning:

The way I see it, I have to allocate a new buffer object every frame, which i dislike. I would much rather allocate too much memory at the start of the program than having to create a fitting buffer object (including passing the buffer to the device) at every frame :confused:

I can suggest a few things to try…

Instead of enqueueReadBuffer, give clEnqueueMapBuffer a go. This maps a buffer into the host memory space, and your host processor need only read the bytes that you need. This may still involve copying data from the OpenCL device’s memory space into host accessible memory space, however, so you will want to do as much work as you can on the device before sending any of it back to the host. For a non-CPU device, you may want to use the buffer allocation flags to try and tell the final output buffer to be in host memory that you provide… and then have your final kernel only do writes to this output buffer.

If your device is on the other side of a PCIe bus from your host CPU, you want to minimize how much data you move back and forth across it. So put as much of your code into kernels as possible and avoid having the host involved at multiple points.

Furthermore, writing is often much faster than reading because the device doesn’t have to wait for the operation to complete before proceeding (particularly across something like a PCIe bus). So if you can, arrange your data to all be in device memory until the last buffer, put that last buffer to be in host memory, and then have the device only write into that buffer (and only the bytes that it needs instead of the whole buffer). This output buffer should also be written contiguously – don’t skip bytes or try to fill in particular fields of an output struct. Try to write in-order, if possible. For devices with local memory, you might try filling in a local block and then transferring using the async local->global copy built-in functions (can be a bit tricky to get right though).

Also, if you’re running successive ‘frames’ of a simulation, you may also consider double-buffering. This means allocate two instances of shared buffers you need, and alternate between them on successive frames. This can allow your host and device processors to operate in parallel more effectively by avoiding dependency on a single buffer.

Thank you very much for that reply!
Gave me a bunch of good ideas :smiley:

Let’s see what I can put into practice.

Thanks again!

This answer is unrelated to OpenCL --Andrew already did a great job.

Here’s the problem: Since star catalogs may get pretty huge (easily in the range of several 100.000 to a few million) reading the buffer object with the results (about 32 byte per star) takes up massive amounts of time (because i don’t know how many stars will be visible, I have to allocate memory to store results for all stars (as kind of a worst case scenario))

Maybe I misunderstood; this sounds like you have the star catalog in an unordered array, scan through all of it and determine which stars intersect the current view frustrum, which takes O(n) time. Have you considered using some data structure to accelerate these range searches? Something like a K-d tree for example.

Hi!
Sorry for the late reply, I was away for a few days.

The programs standard culling algorithm uses a tree and loops through it in a recursive manner. I’m not sure how i can loop through the tree in an iterative way (since i can’t use recursive functions in OpenCL).

Well I can think of one solution, but I fear its plain awful…
I would use an array as large as the amount of nodes and then store the visibility information inside. Each work-item would then check whether its direct ancestor is visible or not…
I don’t know. Sounds to me like i could get a bunch of bank conflicts…?

I’ll go and experiment a bit.

If you have something else in mind, I’d really like to hear it.
Or if you know about some good papers/articles, please post links :slight_smile:

I’m not sure how i can loop through the tree in an iterative way

The standard way of removing recursion in any language is the same: create a stack (data structure, not the place in memory) of work to do and wherever you would call recursively, instead put the arguments in the work stack. This is one of the first results I got in Google.

It’s usually quite trivial to do.

Another possibility that I’ve encountered before when doing culling is to eliminate the culling tree and just brute force cull all objects in a linear list. How well this works depends on your data and hardware, but my experience on a couple of different pieces of hardware was that the win from doing it fully vectorized by far outweighed the benefit of having a culling tree. Its also possible to switch from a tree to a simpler structure like a 2-level one where the top level quickly identified batches of objects to be processed, and the batches are large enough to keep your parallel hardware fully utilized (hundreds or thousands of entries).