Acces physical id of the core

Hello,

I’m trying to get something to work but I run out of ideas so I figured I would ask here.

I have a kernel that has a large global size (usually 5 Million)
Each of the threads can require up to 1Mb of global memory (exact size not known in advance)

So i figured… ok, I have 6Gb and I can run 2880 cores in parrallel, more than enough right ?
My idea is to create a big buffer (well actually 2 because of the max buffer size limitation…)
Each thread pointing to a specific global memory area (with the coalescence and stuff, but you get the idea…)

My problem is, How do I know which thread is being used in the kernel to point to the right memory area ?
I did find the cl_arm_get_core_id extension but this only gives me the workgroup, not the acutal thread being used, plus this does not seem to be available on all GPUs, since it’s an extension.

I have the option to have work_group_size = nb_compute_units / nb_cores and have the offset to be arm_get_core_id() * work_group_size + global_id() % work_group_size
But maybe this group size is not optimal, and the portability issue still exists.

I can also enqueue a lot of kernels of global size 2880, and there I obviously know where to point to with the global Id.
But won’t this lead to a lot of overhead because of the 5Million / 2880 kernel calls ?

Any ideas to do this properly are very welcome !

Julien

On GPUs, the thread allocation is not deterministic, and very much depends on runtime scheduling. Even the first workgroup location is unknown - if the GPU scheduler is advanced (such as NV & AMD), it may use the first available compute unit to schedule the workgroup to run on it.

I’m trying to better understand the problem you’re describing above, and the requirements. Assuming you’re talking about NVIDIA GPUs, the “core” that you’re referring to (BTW, marketing FUD) is actually a lane inside a warp which is running on some SMX. So what are you really looking for ? the SMX ? the warp ID ? Also, why is that needed ?

[QUOTE=OferRosenberg;30473]On GPUs, the thread allocation is not deterministic, and very much depends on runtime scheduling. Even the first workgroup location is unknown - if the GPU scheduler is advanced (such as NV & AMD), it may use the first available compute unit to schedule the workgroup to run on it.

I’m trying to better understand the problem you’re describing above, and the requirements. Assuming you’re talking about NVIDIA GPUs, the “core” that you’re referring to (BTW, marketing FUD) is actually a lane inside a warp which is running on some SMX. So what are you really looking for ? the SMX ? the warp ID ? Also, why is that needed ?[/QUOTE]

To me it sounds like the problem is this: Julien has a set of computations that need up to 1 MB of memory on the GPU for each work unit, and due to this size they are forced to use global memory. So the question is, if they allocate 1 MB * 2880 in global memory as the ‘working’ memory for the computations, how would you index it per-thread?

Since it seems like each thread can take a variable amount of time/memory to complete it’s task, you can’t index by global thread ID (assuming one thread per task period) since thread 0-10 might be still working when threads 2880-2890 get queued. Assuming threads were tied to a core, that could be used as an ID of sort, since no two threads could have the same core ID (this isn’t actually how it works right? so its a moot point).

To Julien, what’s stopping you from just queuing 2880 tasks (or more, since you will most likely need more to get good occupancy) and just doing something like:

[code=c++]
__kernel void task(__global char * working_alloc, __global float * results, __constant int total_tasks)
{
int idx = get_global_id(0),
n_threads = get_global_size(0);
float *thread_mem = working_alloc[idx * 0x100000];

for (int i = idx; i < total_tasks; i += n_threads)
{
    thread_mem[/* offset or whatever */] = /* some calculation */;
    // do things
    results[idx] = /* stuff */;
}

}