Huge circular buffer beyong CL_DEVICE_MAX_MEM_ALLOC_SIZE

Hi

I am porting a code to GPU that uses an huge circular buffer (typically 3Gb) organized as rows (of typically 64k).
In the CPU code it is implemented as an array of pointer each pointing to a row of data.
When we need to rotate the buffer, we only rotate the small pointers array.

In the CPU case, the data itself is allocated as one bloc (of 3Gb), but in a GPU I must allocate the data in 4 buffers (due to the CL_DEVICE_MAX_MEM_ALLOC_SIZE limit)

If it were a single bloc, I could have used integer to give row permutation instead of pointers
e.g. accessing bufO[row[i]65536+j] instead of buf[i][j] with initially buf[i]=buf0+i65536

but if data is allocated as 4 blocs (say float buf0 buf1 buf2 and buf3) is it possible to have a kernel initializing an array to buf0, buf0+16384, buf0+216384 … buf1, buf1+16384, … and so on and use this array of pointer in successive kernels
(i.e. can we mix pointers to different cl_mem obkects -of same type however- within the same array AND are the cl_mem object addresses constant between successive kernels calls or does the CL_DEVICE_MAX_MEM_ALLOC_SIZE limit means that the addressing space is completely separated and that there are no real pointer but only pointer within one given cl_mem object.)

At least, is a construct like

float p;
switch (row[i]/12288)
{
case 0:
p=buf0+16384
row[i]:
break;
case 1:
p=buf1+16384*(row[i]-12288):
break;
case 2:
p=buf2+16384*(row[i]-122882):
break;
case 3:
p=buf3+16384
(row[i]-12288*3):
break;
}

and then working on the row of data in p[0…16383] valid ?

is it any way to store the pointers as p in an array that could be accessed by a new kernel as a __constant array from which the pointers can be retrieved ?

I have trouble understanding what you are trying to do.

As a general rule, I strongly recommend not storing pointers inside global memory. Memory objects move around and your pointers will become invalid. Let’s give an example:


__kernel void foo(__global int *bar, __global int *fubar)
{
    // This is a bad, bad idea. fubar could be in a different address in the future.
    bar[0] = (int)fubar;

    // This is also an equally bad idea for the same reason
    bar[0] = (int)&bar[10];
}

You will have to use offsets from the beginning of the buffer instead. For example:


__kernel void foo(__global int *bar, __global int *fubar)
{
    // This way bar[0] contains a reference to bar[10]
    bar[0] = 10;

    // This is like dereferencing a pointer stored in bar[0].
    int fubar = bar[bar[0]];
}

If the problem you have is that your data will be stored in multiple buffer objects, you could simply use the most significant bits of the offset to indicate to which of the memory objects the offset is referring to. Something like this:


__kernel void foo(__global uint* a0, __global uint* a1, __global uint* a2, __global uint* a3)
{
    // Put them all together into an array of pointers
    __global uint* a[4];
    a[0] = a0;
    a[1] = a1;
    a[2] = a2;
    a[3] = a3;

    // Use the two most significant bits of the offset to indicate which of
    // a0, a1, a2 or a3 the offset refers to.
    // For example, this is how you store a reference to a2[10] inside a1[2]
    a[1][2] = (2<<30) | 10;

    // Another example: this is a reference to a3[100]
    uint myRef = (3<<30) | 100;

    // This is how you read back that reference:
    a[myRef>>30][myRef&0x3FFFFFFF];
}

Obviously this would be much easier to code for if you add auxiliary macros like these. This is the same example as above using these macros:


#define BUFFER_ID_BITS 2
#define OFFSET_MASK (((1<<BUFFER_ID_BITS)-1) << (32-BUFFER_ID_BITS))
#define REF(buffer, offset) (((buffer) << (32-BUFFER_ID_BITS)) | (offset))
#define DEREF(a, ref) (a)[(ref) >> (32-BUFFER_ID_BITS)][(ref) & OFFSET_MASK]

__kernel void foo(__global uint* a0, __global uint* a1, __global uint* a2, __global uint* a3)
{
    // Put them all together into an array of pointers
    __global uint* a[4];
    a[0] = a0;
    a[1] = a1;
    a[2] = a2;
    a[3] = a3;

    // Use the two most significant bits of the offset to indicate which of
    // a0, a1, a2 or a3 the offset refers to.
    // For example, this is how you store a reference to a2[10] inside a1[2]
    a[1][2] = REF(2, 10);

    // Another example: this is a reference to a3[100]
    uint myRef = REF(3,100);

    // This is how you read back that reference:
    DEREF(a, myRef);
}

Warning: I haven’t built or tested the code above. There could be syntax errors or whatever. The general idea is correct and that’s what matters :slight_smile:

The code can be generalized for 32-bit or 64-bit devices if necessary.

I hope that helps. Let me know if you have any questions.

Thank you for your fast answer !

I think you understood my problem, I have a few questions though :

1st:
The first lines
a[0]=a0;
a[1]=a1;
a[2]=a2;
a[3]=a3;
must be present at the begining of every kernel that use a[] because between successive kernel launches the addressing for a0 a1 a2 and a3 may change ?

2nd:
Assume for simplification that each of the 4 blocs contains 4 «rows» of data, is it possible following you example to replace these 4 lines with the 16 lines
global uint *a[16];
a[0]=a0;
a[1]=a0+0x40000000;
a[2]=a0+0x80000000;
a[3]=a0+0xc0000000;
a[4]=a1;
a[5]=a1+0x40000000;

a[15]=a3+0xc0000000;
and access entry column_index (with column_index between 0 and 0x3fffffff) of row number row_index (with row_index between 0 an 15) by simply
a[row_index][column_index] = <value> ;

3rd:
instead of “global uint *a[…” which will cause all thread to write to the same global location, should I better use “private uint *a[…” which will use 4 (or 16 for my variant of question #2) registers but no memory conflicts (or does the openCL compiler detect that the first 4 lines are doing the same job for all thread and just skip this for all but the first thread of a NDrange) ?

4th:
the idea I had was to use a very short kernel that initialises “global uint *a[…” but using a storage allocated by clCreateBuffer and then pass the corresponding array as a __constant uint *a[… (I know t is forbiden to pass an argument of pointer array type, but it is possible to cheat by casting to a pointer when using) to the kernel doing the effective computation. This would not waste 4 or 16 registers per threads for storing exactly the same pointer for all threads and as I understood constant memory accesses are efficiently cached in a GPU ?

Following your advice, I can use two index arrays that I compute with a first small kernel
(one for the bloc index and one for the offset within the bloc) that are computed by a small kernel called before
__global uchar ahigh[16];
__global uint alow[16];
ahigh[0]=ahigh[1]=ahigh[2]=ahigh[3]=0;
ahigh[4]=ahigh[4]=ahigh[6]=ahigh[7]=1;
… ahigh[15]=3;
alow[0]=alow[4]=alow[8]=alow[12]=0x0;
alow[1]=alow[5]=alow[9]=alow[13]=0x40000000;
… alow[15]=0xc0000000;

and in the computation kernel I access the data through:
#define AREF(row,col) (a)[ahigh[(row)]][alow[(row)]+(col)]
…__global uint* a0, __global uint* a1, __global uint* a2, __global uint* a3,__constant uchar *ahigh,__constant uint *alow,…
a[0]=a0;
a[1]=a1;
a[2]=a2;
a[3]=a3;

AREF(row_index,column_index) = <my value> ;

this should work, however, it will require twice more accesses to the texture memory and one addition more than the variant proposed in question #2 but it requires initialisation in another kernel to have the array in __constant space because it is (obviously) read-only !

what is your opinion ?

Yes, perhaps I need to detail the issue of all this,

in fact I process my data by overlaping rows, for example with my 16 rows

I run the computation with rows 0 to 15 of my stream,

then I rotate the row 13 14 15 to positions 0 1 2, fill the rows 3 to 15 with fresh input data and then run the computation on the 16 rows,

then again I rotate the rows 13 to 15 to position 0 1 2, fill the rows 3 to 15 with new fresh data and so on (I have typically 80 Gb of signal to process, the true number of rows is around 1000, and the bloc size is typically 3Gb and the overlap around 1Gb)

Instead of truly moving some 600Mb (in my 16 rows example) between a3 and a0 (with possible complications if the overlap is above 1/4 of the data size) I can rotate the index arrays ahigh[] and alow[] instead (16 + 64 bytes only) !

1st:
The first lines
a[0]=a0;
a[1]=a1;
a[2]=a2;
a[3]=a3;
must be present at the begining of every kernel that use a because between successive kernel launches the addressing for a0 a1 a2 and a3 may change ?

That’s right.

2nd:
Assume for simplification that each of the 4 blocs contains 4 «rows» of data, is it possible following you example to replace these 4 lines with the 16 lines
global uint *a[16];
a[0]=a0;
a[1]=a0+0x40000000;
a[2]=a0+0x80000000;
a[3]=a0+0xc0000000;
a[4]=a1;
a[5]=a1+0x40000000;

a[15]=a3+0xc0000000;
and access entry column_index (with column_index between 0 and 0x3fffffff) of row number row_index (with row_index between 0 an 15) by simply
a[row_index][column_index] = <value> ;

If there are 16 rows, then BUFFER_ID_BITS will be 4 instead of 2. Therefore, column_index will take values between 0 and 0x0FFFFFFF instead of 0x3FFFFFFF.

Other than that, yes, that would work.

3rd:
instead of “global uint *a[…” which will cause all thread to write to the same global location, should I better use “private uint *a[…” which will use 4 (or 16 for my variant of question #2) registers but no memory conflicts (or does the openCL compiler detect that the first 4 lines are doing the same job for all thread and just skip this for all but the first thread of a NDrange) ?

You have misunderstood the declaration of variable “a”. “__global uint* a[N]” means “an array of N pointers stored in private memory and pointing to global memory”. Variable ‘a’ is not shared, it’s allocated separately by each work-item.

4th:
the idea I had was to use a very short kernel that initialises “global uint *a[…” but using a storage allocated by clCreateBuffer and then pass the corresponding array as a __constant uint *a[… (I know t is forbiden to pass an argument of pointer array type, but it is possible to cheat by casting to a pointer when using) to the kernel doing the effective computation. This would not waste 4 or 16 registers per threads for storing exactly the same pointer for all threads and as I understood constant memory accesses are efficiently cached in a GPU ?

That code would not be portable. In each kernel invocation buffer objects can move. Never assume that pointers to global memory are immutable.

Okay, thanks for the 3rd point… __private float* would not have worked and I would have spend days figuring why… (it is the prototype of horrible pitfall for beginners)

I will stick to your advice (the macro def with 4 pointers array in registers, and the 4 blocs in argument), and test two variants, one with the row addresses (bloc and offset in bloc) in texture memory and one with the addresses computed on the fly (because, after all, my row permutation is simply a circular permutation, which is fast to compute if a power of two row number is made compulsory)

I have the intuition that computing the address on the fly (even with a small “if then else” in the non power of two case) is faster than two constant memory accesses

I shall post a comment when code is operational, and again thanks a lot for your kind help :slight_smile:

Okay code is operational (but for a few rare operating modes and one auxiliary output) and definitively computing the address on the fly is the fastest code (even more, for the row processing batches, the pointer to the buffer part used only can be passed as argument to the kernel).

Thank a lot again for your help.

I’m glad it worked well in the end! :slight_smile: