Question about Kernel Arguement Speed

Would it be faster to declare variables in my kernel and assign them to arguments passed if I’m going to be using them a lot?

Or is just using the arguments fast.

For example

if I pass a float as an argument and it is located in memory, would it be just as fast to use the memory as to make a local object in the kernel and assign it to the memory?

I guess the real question is do stream processors have local cache memory or register memory, and if they do, do kernels use it?

My 8800GTS is supposed to get 200+ gigaflops and I’m getting about 1.6 lol. Which I know I won’t get anything near the 200 as my algorithm does much more than just floating point operations, but to say 1.6 compared to 200…seems like my kernels could be sped up a bit.

The first step of performance tuning in any language is measuring where time is being spent.

You mention you are using an NVidia platform. Why not give Visual Profiler a look? (The page seems to be down, maybe due to AWS’ downtime)

It’s also a good idea to read some general guides on how to write OpenCL code, such as NVidia’s OpenCL programming guide or AMD’s OpenCL programming guide.

As for the other questions, I can’t quite make sense of them. Try rephrasing them in terms of “kernel arguments”, “kernel scope variables”, “global memory”, “local memory”, etc.