Search:

Type: Posts; User: Tugrul

Search: Search took 0.00 seconds.

  1. Memory banks and memory channels are interleaved...

    Memory banks and memory channels are interleaved for addressing. Otherwise first n bytes would be serialized for a bad performance until one start using n+1 addresses.

    Because of interleaving, it...
  2. Checking address requests of neighbor workitems...

    Checking address requests of neighbor workitems must be easier than checking future address requests of a workitem since it would need extra processing. Also getting all future data for a workitem...
  3. Replies
    1
    Views
    120

    You must have given 256 as offset parameter too....

    You must have given 256 as offset parameter too. That adds that number to all workitems' global id values. If there is no multiple gpus, you may not need that value to be other than zero.

    cl_int...
  4. Replies
    1
    Views
    216

    if (xIndex >= width || yIndex >= height) {...

    if (xIndex >= width || yIndex >= height)
    {
    return;
    }

    this part makes some threads of a group quit early but

    barrier(CLK_LOCAL_MEM_FENCE|CLK_GLOBAL_MEM_FENCE );

    needs to be hit by all...
  5. Replies
    1
    Views
    390

    Also it could be a queue-size issue on rx550's...

    Also it could be a queue-size issue on rx550's implementation. Enqueueing nearly 64k child kernels may not be a good design maybe? How can I make a ray-tracer then? Each ray will refract+reflected...
  6. Replies
    1
    Views
    390

    How can I wait for on-device queue?

    I'm learning OpenCL 2.0 and stuck at synchronization of child kernels and parent kernels in a simple dynamic parallelism algorithm.

    When its just incrementing a single value, it seems to be...
  7. You can use 2 integers. 1 for integer part(easy...

    You can use 2 integers. 1 for integer part(easy single operation), 1 for floating part (bitwise interpretation).

    Then when youre done with these, convert thse 2 integers A and B to floats as A.0...
  8. Replies
    14
    Views
    1,241

    Thank you very much. I use in-order type for all...

    Thank you very much. I use in-order type for all queues so its ok to sync on single queue, provided that before that sync point, events ensure other queues are done computing, if I understand...
  9. Replies
    14
    Views
    1,241

    Then a synchronization point synchronizes for all...

    Then a synchronization point synchronizes for all queues and for everything? (I mean, that "moment" of state of all buffers, and all kernels,)

    If I dont use events, other queues may have still not...
  10. Replies
    14
    Views
    1,241

    Also I've been using this function to sync for...

    Also I've been using this function to sync for many queues without problem:



    __declspec(dllexport)
    void waitN(OpenClCommandQueue ** hCommandQueueArray, OpenClCommandQueue *...
  11. Replies
    14
    Views
    1,241

    Either it is read only accessed by any address or...

    Either it is read only accessed by any address or it is read/write only within narrow range per device(same range for host-device copy too)

    Like this:

    gpu1 read + write element 1

    gpu2...
  12. Replies
    14
    Views
    1,241

    There is already multi gpu support but working on...

    There is already multi gpu support but working on different regions of host array. All buffers are duplicated per device because they are in distinct contexts.

    They don't overlap writes. They...
  13. Replies
    14
    Views
    1,241

    Thank you, do you mean should I lift the bug...

    Thank you, do you mean should I lift the bug warning I've written here https://github.com/tugrul512bit/Cekirdekler/wiki/Bugs and here https://github.com/tugrul512bit/Cekirdekler/wiki/Pipelining ?
    ...
  14. Replies
    14
    Views
    1,241

    I've forgotten another thing to write in first...

    I've forgotten another thing to write in first question: N number of kernels were actually same cl::kernel instance, running on same buffer(parameter) with different offset+range. Should I convert it...
  15. Replies
    14
    Views
    1,241

    I have another question: If - at first,...

    I have another question:

    If

    - at first, command-queue-1 flows to command-queue-2 with an event
    - then command-queue-2 synchronizes with something like a "wait barrier" or "clFinish()"
    ...
  16. Replies
    14
    Views
    1,241

    I forgot to mention: by the "overlapping", I mean...

    I forgot to mention: by the "overlapping", I mean "time", not data. Just a note for newcomers like me.

    Maybe using N contexts can achieve same pipelining but AMD hardware wouldn't let me create...
  17. Replies
    14
    Views
    1,241

    Is this undefined behavior?

    I have a single kernel(embarrasingly parallel data access such as Ai = Ai + 5) and a single buffer for a read+write operation.

    To shorten total latency, I did these:

    - Break kernel into 64...
  18. Then do it in local space first, for all...

    Then do it in local space first, for all workgroup threads, then synchronize/atomically on global space.
  19. Maybe you can turn this into an "add" version: ...

    Maybe you can turn this into an "add" version:



    //Function to perform the atomic max
    inline void AtomicMax(volatile __global float *source, const float operand) {
    union {
    ...
  20. Replies
    5
    Views
    942

    dist = sqrt(pow(deltaPos) + pow(deltaPos) +...

    dist = sqrt(pow(deltaPos) + pow(deltaPos) + pow(deltaPos.z, 2.0f)); //Get the distance between them

    here pow is a very general and slow function. To just square things, multiply with themselves....
  21. Replies
    5
    Views
    769

    I have an open source project that you can try...

    I have an open source project that you can try driver-controlled pipelining and event controlled pipelining for separable kernels(can both upload+download+compute at the same time for all stages, per...
  22. Copying Command Queue With Everything in it

    I have a program doing heavy work on host side and enqueueing a lot of kernels (such as 50 kernels for a reduction) by it adds too much latency because of that host side sluggishness but device side...
  23. Replies
    4
    Views
    4,030

    Re: The support of out-of-order mode ?

    Nvidia Fermi Gpus has out-of-order execution on thread level

    http://www.nvidia.com/content/PDF/fermi ... epaper.pdf

    If i can simulate 2000 molecules with my pentium-centrino 2GHz, what can do...
Results 1 to 23 of 23
Proudly hosted by Digital Ocean