passing an array of char* to kernel function

naroqueen · October 14, 2010, 8:08am

hello,
I’m an OpenCL beginner and not a C++ professional, I’ve read the oclVectorAdd example in the OpenCL samples… I found that the arguments of the kernel function were float * (float pointers representing 3 arrays of floats)… I’ve got an idea of a GPU program in which I want to create a simple kernel function but this time I want to deal with an array of strings… since there are no strings the cl program I’m trying to send 2 arrays of char* as an argument to the kernel function… Just for testing the idea, I wrote this code:


__kernel void normalizer(__global char * documentStrings[], __global char * documentStringsOut[])
{ int i = get_global_id(0);
int strLen=sizeof(documentStrings[i])/sizeof(char);
for(int j=0;j<strLen;j++){
    char ch = documentStrings[i][j]; documentStrings[i][j]=ch++;}
documentStringsOut[i]=documentStrings[i];
}

But when I try to call the clBuildProgram I get this error: “invalid address space for array argument to __kernel function”

can anybody help me?!

david.garcia · October 14, 2010, 8:34am

The reason you are seeing an error is because you are declaring “documentStrings” and “documentStringsOut” as arrays of pointers to char (i.e. “char *foo[]”). Arrays of pointers are not allowed as kernel arguments.

There are ways to accomplish the sort of thing you are trying to do but they are rather complicated for a beginner.

If I were you I would try to implement the code first in plain C (with the restriction above about arrays of pointers not allowed) and only once you have the code working then think about porting it to OpenCL.

andrew.brownsword · October 14, 2010, 8:59am

I agree with David’s suggestion to try this in plain C first, that will simplify your development work substantially. I will give you an additional hint as to how I could approach this though – since you can’t use pointers, the usual fallback is to use indices. Consider having a large buffer/array of chars, and instead of pointing at each string, have an index into the array to the start of your string.

Once you have the indices and the array in your kernel, you can switch to pointers if you prefer, but working with indices can have advantages.

david.garcia · October 14, 2010, 12:59pm

Thanks, Andrew. I hesitated on whether to give that advice or not

Once you have the indices and the array in your kernel, you can switch to pointers if you prefer, but working with indices can have advantages.

Keep in mind that in OpenCL you can’t work with pointers in the device’s address space. I would stick to indices for good.

naroqueen · October 15, 2010, 3:06am

Thanks a lot Mr. David & Mr. Andrew for your help! I got your idea and I’ll try implementing it but I tried another idea too (it might be less practical than yours, most probably I’ll go for yours, but I just want to understand)
The other idea I thought of was to wrap the char* in a structure then passing a pointer to this structure to the kernel function… the code was as follows:


typedef struct
{ char* c;
}string_t;

 __kernel void normalizer(__global string_t* documentStrings, __global string_t* documentStringsOut){ int i = get_global_id(0); int strLen=sizeof(documentStrings[i].c)/sizeof(char);	
for(int j=0;j<strLen;j++){	char ch = documentStrings[i].c[j]; documentStrings[i].c[j]=ch++;}
 documentStringsOut[i].c=documentStrings[i].c;
}

It ran without errors but it behaved in a strange way! the returned array (documentsStringsOut) had only one element, the first string only and the rest were “Bad Ptr”!
I wish to understand why? I implemented it on an array of 5 strings and in the clEnqueueNDRangeKernel function I set the global_work_size to 5 and the work_dim to 1…

I’m sorry for my long message but I really want to learn and understand… thanks very much for your help

david.garcia · October 15, 2010, 6:52am

typedef struct
{ char* c;
}string_t;

That’s an illegal struct declaration in OpenCL and the compiler should have produced a compilation error.

You also want to give a look at this code:

int strLen=sizeof(documentStrings[i].c)/sizeof(char);

Sizeof documentStrings[i].c doesn’t mean “the length of this string”. It means “the size of a pointer to char”, which is typically 4 or 8 bytes depending on your platform.

If you want to compute the length of a null-terminated string you need to use the function strlen() from the C standard library, which unfortunately doesn’t exist in OpenCL.

Spend some time practicing your C first and it will save you a lot of frustration when you try OpenCL

HolyGeneralK · October 15, 2010, 7:20am

This statement is so incredibly true that it no longer is funny to me. I’m a grad student in aerospace engineering (not a native programmer), and started with Fortran, and now am working a lot with C++ and Fortran. There’s a lot of “hand-holding” that occurs in these languages that don’t exist in C, and so when I write my OpenCL programs, I end up having to tweak a good many little C++ things that slip into my C kernels.

I am getting better…david.garcia is a wonderful asset on these forums!

andrew.brownsword · October 15, 2010, 10:58am

An important thing to realize about pointers in OpenCL is that they differ between devices. If you are running on the CPU device that happens to be the same underlying hardware as the host, then you may happen to find that the pointers are identical… but this is a very bad assumption to rely on, and it will break badly on any other devices (and isn’t guaranteed by the spec to ever work, even if your host is the exact same hardware as your device).

OpenCL devices may have independent address spaces from the host (and each other). There is no assurance that the physical bytes in a buffer when accessed from the host are the same as the physical bytes in the same buffer when accessed from any other device, or that the physical bytes will be the same ones holding the buffer over time. Even if they do happen to be the same physical bytes, they could be mapped to different addresses. When you pass a cl_mem object to a kernel, the system is responsible for converting that object handle into the device’s native pointers. This may include physically copying the data between memory spaces (for example, from host memory to a GPU’s VRAM). Since the bits that make up pointers look just like the bits that make up any other data, the OpenCL runtime has no idea what bits are pointers that need to be remapped… except when they are kernel arguments. Thus you cannot put pointers into your buffers and expect them to work from anywhere else (even a different work-group or kernel on the same device).

Even beyond having different addresses, you aren’t even assured that pointers are the same number of bits. If your host is running a 64-bit OS then its pointers are going to be 64-bit if your application is 64-bit. Most GPU’s, however, are not 64-bit devices and thus will use 32-bit pointers… or smaller. The different address spaces of a device (global, local, constant, private) can also vary in size and pointers could potentially be only 16-bits or even smaller for really small memory pools.

It is also important to understand the memory model that OpenCL provides – it is a “relaxed consistency memory model” and this means that there are very carefully defined rules about when memory needs to be made consistent between devices, kernels, work-groups, work-items … and these rules are defined to allow as low a degree of synchronization and coupling as possible, thus maximizing opportunity for concurrency. This means you need to be careful about having multiple pieces of code touching the same buffer at the same time… even if it seems to work on your particular machine when you write the code, any number of factors can cause your assumptions to break (changing hardware, changing OS, changing drivers, and even just changes to the way your own application works). This affects things like when writes to a buffer from one piece of code become visible to reads from the buffer in a different work-item, kernel, device, etc. Or what the result is when multiple pieces of code write to the same location “simultaneously” (and I put that in quotes because “simultaneous” is a very hard thing to define and harder to be assured of).

So this all boils down to: do not make assumptions about pointers and do not pass them between devices. The CL memory objects were put into the spec to abstract these details from us. Respect and understand that abstraction.

david.garcia · October 15, 2010, 11:52am

Amen to everything Andrew just said. He put it very eloquently.

naroqueen · October 18, 2010, 3:21am

I don’t know how to thank you Andrew for the detailed explanation… this made things much more clear… by the way I tried implementing your idea of the character buffer and working on indices instead of pointers… it’s now working…
Also I guess I’ll stick to your advice David, of practicing C first… and I’d appreciate any advice of how to improve my self in C and OpenCl…
Thanks again.

andrew.brownsword · October 18, 2010, 9:30am

Glad I could help.

There are lots of good books on C – I’m sure you can find one with good reviews on Amazon. Even if you’re a C expert though, I would still recommend this general implementation pattern:

[ul]
[li] implement in C[/3kysdict][/li][li] test/debug[/3kysdict][/li][li] minimal conversion to OpenCL C[/3kysdict][/li][li] test/debug on CPU device where you can debug and/or printf[/3kysdict][/li][li] test on GPU device[/3kysdict][/li][li] take advantage of OpenCL C extensions and built-in functions to optimize incrementally[/3kysdict][/li][li] test/debug on CPU[/3kysdict][/li][li] test on GPU device[/3kysdict][/li][li] profile and use this to inform next incremental optimization[/*3kysdict][/ul][/li]
It isn’t too hard to build your application so that your enqueuing of kernels can be easily replaced by normal host function calls. In fact, as an extra incremental step, on most CPU devices you can enqueue your host function as a native kernel so that it still executes as part of your task graph.

It gets a bit more challenging when you start into data parallelism, i.e. using multiple work-items. Even there though you can usually start by creating it as 1 work-item that you enqueue from a loop. For this the OpenCL 1.1 feature of being able to provide work-item id offsets is very useful (in 1.0 you have to pass in and use the offsets manually).

The basic philosophy is to always have something working, and making small changes at a time so the problems are easier to figure out. Its important to have an idea of where you’re trying to go (i.e. asynchronous optimized data-parallel tasks in a graph that runs parallel with your host code) so that you’re taking steps in the right direction to get there, but small steps are much more manageable and it avoids the rather depressing feeling that nothing ever seems to be working. Using a version control system is extremely useful so that you keep a history of each incremental step of your process, along with an informative comment on each version that you put in your repository.

naroqueen · October 24, 2010, 5:55am

Thanks very much for the valuable advice

naroqueen · December 11, 2010, 11:16am

test/debug on CPU device where you can debug and/or printf

Hi Andrew I know it has been a long time since I sent my last message in this thread but I have a small question related to your last message that I’d be grateful if you could answer…

I need to test my OpenCL code on the CPU but I’m not able to make it :s
here’s my code but it always prints “Invalid Context” and the returned context address is always zero…
I’m not sure if there’s something wrong in my code or if I have to do another proc to run opencl on CPU!

cl_platform_id *platforms=new cl_platform_id();
	cl_uint *num_platforms =new cl_uint();			
	err=(cl_int *)clGetPlatformIDs(3,platforms, num_platforms);

	
	cl_context_properties  properties[]= { CL_CONTEXT_PLATFORM, (cl_context_properties)platforms[0], 0};
	hContext = clCreateContextFromType(properties, CL_DEVICE_TYPE_CPU,
		0, 0, err);
	if(err==CL_SUCCESS)
		cout<<"context created successfully!"<<endl;
	if(err==(cl_int *)CL_DEVICE_NOT_FOUND)
		cout<<"no device of this type was found!";


	size_t nContextDescriptorSize;
	clGetContextInfo(hContext, CL_CONTEXT_DEVICES,
		0, 0, &nContextDescriptorSize);
 description (the last paramter is output)
	cl_device_id * aDevices = (cl_device_id *) malloc(nContextDescriptorSize);
	err=(cl_int * )clGetContextInfo(hContext, CL_CONTEXT_DEVICES,
		nContextDescriptorSize, aDevices, 0);
	if(err==CL_SUCCESS)
		cout<<"GetContextInfo succeeded!"<<endl;
	if(err==(cl_int *) CL_INVALID_CONTEXT)
		cout<<"invalid context!"<<endl;

andrew.brownsword · December 11, 2010, 11:37am

I don’t know if this is related to the problem you’re having, but you’re doing something strange with your error codes. Why have you declared err as a (cl_int*)? It should just be a cl_int. The call to clCreateContextFromType should have a last parameter of “&err”.

I’m guessing you’re actually getting a CL_DEVICE_NOT_FOUND, but not interpreting the error code correctly. Some implementations don’t have a CPU device, unfortunately.

naroqueen · December 14, 2010, 2:48pm

Some implementations don’t have a CPU device, unfortunately.

Thanks very much for the reply but if it is not supported, how can I debug my opencl code on cpu? is there any way to do this with my Nvidia drive or the only way is to install ATI drive which enables running opencl code on cpu?

Also I have a small question if you please; how can I run on 1 work item in a loop? :-

Even there though you can usually start by creating it as 1 work-item that you enqueue from a loop.

sorry for my many questions and thanks a lot for you cooperation!

andrew.brownsword · December 16, 2010, 8:26am

Yes, you can try installing the AMD and now Intel implementations. The committee has defined an ICD model that all vendors are adhering to so that you can have one system with multiple implementations installed, and Programs which use the clPLatform APIs to choose between at runtime.