clCreateContextFromType and clCreateCommandQueue consumes 1GB host memory per device

NVIDIA does not support parallel kernel execution for multiple-queues inside a single context, I had to create multiple threads and create one context per thread. Now a new issue emerges, the following two lines inside the for loop became extremely slow and memory consuming:

    cl_context_properties cps[3]={CL_CONTEXT_PLATFORM, (cl_context_properties)platform, 0};
    cl_context_properties* cprops=(platform==NULL)?NULL:cps;
    cl_command_queue_properties prop = CL_QUEUE_PROFILING_ENABLE;

    for(i=0;i<workdev;i++){
         OCL_ASSERT(((mcxcontext[i]=clCreateContextFromType(cprops,CL_DEVICE_TYPE_ALL,NULL,NULL,&status),status)));
         OCL_ASSERT(((mcxqueue[i]=clCreateCommandQueue(mcxcontext[i],devices[i],prop,&status),status)));
         ...
    }

full code can be browsed at mcxcl/mcx_host.cpp at nvidiaomp · fangq/mcxcl · GitHub

I found that for every device, clCreateContextFromType consumes about 5-600MB host memory, and clCreateCommandQueue consumes another 300-400MB memory. so, for an empty context and queue, the host needs to consume 1GB per device and about 2-3 seconds per device to create these objects. This feels really strange. I don’t know if this is really needed, why does an empty queue/context consumes so much memory?

this is not the end of it, if I request to run my cl kernel to use over 8 devices (my server has 11 GTX 1080Ti GPUs), the memory will grow up to around 8GB, and then the program crashes at the above two lines of code. My host has 16GB memory, I am not sure why it crashes (I was able to use all 11 devices if running CUDA).

I am wondering if anyone can offer some explanations on this behavior, should this be expected? is there a way I can reduce this overhead? I tried to change CL_QUEUE_PROFILING_ENABLE to CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE, did not help.

thanks

if anyone want to reproduce this, you can use the following commands:

git clone https://github.com/fangq/mcxcl.git
cd mcxcl
git checkout nvidiaomp
cd src
make clean all
cd ../example/benchmark
./run_benchmark1.sh -G 1

the above command will launch the kernel on the 1st CL device (use mcxcl/bin/mcxcl -L to list the devices). If you have multiple GPUs, you can use -G 01_mask (for example, -G 1101 will use the 1st, 2nd and 4th GPU). you can do a top in the meantime to see the memory allocation.

can anyone comment on this?

such behavior is quite unusual, really limits the scalability. can’t believe it was designed this way.