Apologies if this is in the wrong place (first post).

I'm new to OpenCL and have been playing with the attached kernel. It is run several times and the timings averaged.
In the area of interest, if there is no code it runs in approx .3356 seconds.
When there is
Code :
(some instructions to increase arithmetic instructions) inserted in the area of interest the run time drops to ~.1 second.
At first I though it may be allowing coalesced writes to memory but when I put a fence there the run time returned to the original.

Other info:
OpenCL 1.2 ATI Radeon 5450
Any insight would be appreciated.
Code :
static const char* qLearning =
        "#define NumberOfStates 50000\n"\
 "#define NumberOfActions  100\n"\
"typedef struct q\n"\
        "float value[NumberOfStates];\n"\
"typedef struct actionSelection\n"\
        "int id[NumberOfActions];\n"\
        "float reward[NumberOfActions];\n"\
        "__kernel void qLearning\n"                                             \
  "  (const float alpha,const float gamma,  __global q *q, __global actionSelection *transition)\n" \
  "{ \n" \
" __local uint i ;\n" \
"  i = get_global_id(0);\n" \
"__local float current;\n"\
"__local float old;\n"\
" old=-2;\n"\
"if(i<NumberOfStates) \n" \
"{ \n" \
        "float bestReward=transition[i].reward[0];\n"\
        "uint best=transition[i].id[0];\n"\
"__local int a;\n"\
***********Area of interest**********