PyOpencl, operation along a given axis like cumsum(array, axis=0)

Hello,

I am new to Opencl, I use to code using numpy and I am dealing with stack of images. At the end I have 3D array of float.
I made my first customized kernel and it works like a charm and it’s so fast, I love it!

Now I am converting other algorithm and many of them has operation over a given axis.
For instance I am stuck with cumsum along a given axis : I have a 3D set of data (time,row,col), I would like to compute like in numpy : numpy.cumsum(data, axis=0) to perform for every pixel of my 3D stack a cumsum along time

In the Pyopencl documentation : Parallel Algorithms - pyopencl 2022.2.4 documentation, there is an example :

knl = InclusiveScanKernel(context, np.int32, "a+b")

n = 2**20-2**18+5
host_data = np.random.randint(0, 10, n).astype(np.int32)
dev_data = cl_array.to_device(queue, host_data)

knl(dev_data)
assert (dev_data.get() == np.cumsum(host_data, axis=0)).all()

This code works for 1D input data, but I do not know how to input more than 1D data. I’d like to have 3D data as input and perform operation along a given axis.
Is the good method to have two loops in python and enqueue task?

I hope someone could help me and teach the right “opencl” approach for such operation
Thanks,
Jérôme