If I replace the kernel with an empty one, it executes without issue on the GPU. In fact if I comment out the double loop, i.e., change to code to reflect

but leave everything else the same, then is also executes without issue.Code :// for (size_t j = 0; j < M; ++j) { // size_t m = 1; // for (size_t i = 0; i < dim; ++i) { // size_t ki = (j / m) % n; // m *= n; // } // }

It is this simple double loop that is causing some strange memory leak. Incidentally, there is nothing special about the base n = 10. This loop was embedded an algorithm for which I used several different values: n = 13, 9, 7, and 5. This is the key algorithm needed for the index-arithmetic in computing parameters related to a tensor-product approximation. I can't really get rid of it. What I find the most puzzling is that it executes fine on the CPU without issue -- even with lots of other floating-point math going on inside the inner loop!