Reduction within one work item

Hi,

how would the more experienced devs work out that problem:

I want to calculate a financial math problem called “Ichimoku” on GPU.

The actual problem can be shortened down to:

  • you have a price series array - lets say an array of 10.000 doubles - 0 to 9.9999

Calculating Ichimoku involves basically the following task 2-3 times with different widths and a few minor challenges. All major calculations are independent from the previous / next one so the outer loop is perfectly parallel. The inner loop is a min/max reduction of the X previous values:

perfect parallel outer loop:

  • do the inner loop (kernel) for each array value independent from the prev / next value

inner loop:

(int) argument X = 26

calculating the result of array index I for width X:

  • find the low of index I to index (I - X) = LOW
  • find the high of index I to index (I - X) = HIGH
  • result for I = (LOW + HIGH) / 2.0

so for X = 26 and array_index = 100

  • find the low of array[100] to array[100-26-1] (inclusive)

  • find the high of array[100] to array[100-26-1]

  • global result[100]= (low+high)/2.0

  • of course only calculate for index values > X argument values

I could simply write a kernel which gets invoked with the array length and does a sequential calculation of the high/low in the kernel. I would gain over traditional cpu implementation because i can call that kernel for every array value perfectly in parallel but the inner loop main work load would still be sequential.

How could i do a min/max reducation within the kernel? Call array_size * X work items and keep track which work items are supposed to do a min/max local mem reduction at a certain stage and nothing on the later stage?

Help is very much appreciated.