OpenCL and plain C++ implementatios of the same algorithm

Hi,

i am writing a software library that has some of its algorithms implemented in two different way: an OpenCL implementation and a plain C++ implementation. The problem i am facing, among many others, is the following.

Suppose the algorithm i am writing (call it A) take, as input parameter, an RGB image and does the following computations:

  1. RGB->gray scale->binary image
  2. component labeling

The second part is done to detect all connected components in the binary image. The plain C++ algorithm returns an array of ConnectedComponent structs:

struct Pixel
{
int x, y;
};

struct ConnectedComponent
{
int contour_len;
Pixel *contour;
};

So, if A find N connected components, it will allocate an array of type ConnectedComponent of length N and, for every element, it will allocate memory for the contour pointer.

This is not a good solution regarding OpenCL. One need to avoid the Pixel* pointer. I see two possible solution here:

suppose N is the number of connected components and M is the sum of contour_len
variables. Then, one could allocate a single linear buffer like:

ConnectedComponent ccs = (ConnectedComponent ) new
char[N
sizeof(ConnectedComponent) + M
sizeof(Pixel)];

You need to setup the contour pointers adeguately and use it not as a pointer but as an offset relative to ccs. However, this design is not useful when you find a connected component at time and add it to the list of connected components found…

Define the ConnectedComponent structure as follows:

struct ConnectedComponent
{
IMemoryObject *contour;
};

where IMemoryObject interface encapsulate a cl_mem object (or an alternative simple
implementation when OpenCL is not used).

How would you address the design in this situation?

And what if i want ConnectedComponent be a class?
What i am saying is that i want to use algorithm A without bothering if the implmentation actually is OpenCL or plain C++. But how to handle input/output parameters to algorithm A? One thing to keep in mind is the following:

suppose you have algorithms A, B, C. You want to execute ABC, using the results of A as input parameters of B and the results of B as input parameters of C. I want to do this minimizing device<->host memory transfer.

Any idea/suggestion?