Information concerning memory access

Hello,
I’ve a problem of wrong code and/or memory access. I’m kind of stuck optimizing a program for OpenCL.
My program is about Dijsktra algorithm: I’ve written 3 kernels so far, one for each step -> initialization, min search and update distances.
I’ve manage to never re-read the memory for the update part so that doesn’t loose time but I’m stuck with the find min part.
My kernel is as follow:

__kernel void find_min(__global int* distanceVector,
    __global int* usedSummitVector, const int size,
    __global int* min, __global int* lMin)
{
    int k = get_global_id(0);
    int gap = get_global_size(0);

    lMin[k] = 0;

    int startIndex = k * gap;
    int endIndex = startIndex + gap;

    if (endIndex > size) endIndex = size;

    int i = startIndex;

    for (; i < endIndex; i++)
    {
        if (distanceVector[lMin[k]] <= 0 && usedSummitVector[i] == 0
            || usedSummitVector[lMin[k]])
            lMin[k] = i;
        else
        {
            if (distanceVector[i] > 0
                && distanceVector[i] < distanceVector[lMin[k]]
                && usedSummitVector[i] == 0)
            {
                lMin[k] = i;
            }
        }
    }

    barrier(CLK_GLOBAL_MEM_FENCE);

    if (k == 0)
    {
        *min = 0;

        int i = 0;
        for (; i < gap; i++)
        {
            if (distanceVector[*min] <= 0 && usedSummitVector[lMin[i]] == 0
                || usedSummitVector[*min])
                *min = lMin[i];
            else
            {
                if (distanceVector[lMin[i]] > 0
                    && distanceVector[lMin[i]] < distanceVector[*min]
                    && usedSummitVector[lMin[i]] == 0)
                {
                    *min = lMin[i];
                }
            }
        }
    }

    barrier(CLK_GLOBAL_MEM_FENCE);
}

With this it’s working but each time I run the kernel I need to either call clEnqueueReadBuffer or clFinish and that’s taking like 10ms just for that and I don’t understand why.
If anybody could tell give me a hint on where to search for this problem that’s would be very helpfull.
Thank you in advance,
Benjamin.

It looks like you are trying to get work-item 0 to sum across the results of all work-items. The barrier you have in your code will only work across work-items in the same work-group, so unless you only have one work-group this code is invalid for OpenCL. (You’re only allowed to synchronize between work-items in a work-group, because there is no guarantee on the scheduler.) If you need to do a reduction (which is what this looks like) and you can’t fit into one work-group, you have to do it as multiple kernel executions.

I’m not sure why you need a clFinish, though.

Thank you for the reply,

I tought of that so I put a somme like that:

err = clEnqueueNDRangeKernel(this->queue, this->findMinKernel, 1, NULL,
                                     &global, &global, 0, NULL, NULL);

And if I remember well that still didn’t work. My point is I can’t let my program waste 10ms into memory sync in each loop because it growth with the scale so my sequential program is always better than my multi-core version. I really don’t understand why putting the command up there the barrier function didn’t worked at all.

Thank in advance for any help or hint.
Benjamin.

If you set your global and local sizes to be the same you will only execute one work-group on the device and will only use a small fraction (1/4 to 1/16th on a discrete GPU) of the processors. That will dramatically hurt your performance, besides which you’ll be limited to a maximum of 512 for your global size.

I’m afraid I’m not really following what you are saying. Why do you need to read the data back at all? (I’m assuming you mean a clEnqueueRead when you say “read back”.) You should be able to just enqueue the kernel executions and let each one run right after each other. Or perhaps I’m not understanding the algorithm correctly.

Thank for the info about the workgroup size so how can I choose manually a work group size or how can I sync work group in my function ?
My problem, is I think, linked directly with the sync of the work groups what I do now to have the good result is to do a CLFinish or a CLEnqueueReadBuffer on the min part of my alogorithm to pass it to the update kernel. But, now, this operation takes like 3-5ms sec and this time is growing in size with my data size. What I want to know is to insure that the data of the min is good so I can directly pass it to the update kernel without the result with the CLEnqueueReadBuffer function.

I hope this is a little clearer and thank you,
Benjamin.

Benjamin, I’m still not quite following you.

It seems like you are trying to find the minimum value across a large amount of data. This is effectively a reduction operation which requires synchronization between elements. If that’s the case, then you will need to either fit the whole reduction in one work-group (bad for performance) or split it across multiple kernel executions such that each work-group’s results are guaranteed to be done between them. That would look like this:

enqueueKernel(first_reduction_step) -> generates partial min results
enqueueKernel(second_reduction_step) -> combines partial min results
enqueueKernel(third_reduction_step) -> combines partial min results further
etc.

You do not need to finish or read back the data between them (as the kernel will finish before the next one starts on current GPUs or in-order-queues).

For more info on this, try doing a google search for “reduction synchronization opencl”.

Ok thank you so much for the hint I’m going to try it.
And sorry for not being real clear that’s hard to explain exactly what I mean whitout showing all my code.
So thank you again.

Hello again, I’ve tried to implement the kernel in a different way and run it in a different way too.
It’s really close to the result but for some reason I can see, some values of the partial min are overriden and I don’t know why.
Would you mind poiting out to me where is my mistake ?
Here is the kernel:


__kernel void find_min(__global int* distanceVector,
    __global int* usedSummitVector, const int size,
    __global int* oldMin, __global int* lMin, const int firstPass,
    const int lmax)
{
    int k = get_global_id(0);

    int i = k * 2;

    if (firstPass == 1) {
        lMin[k] = i;

        if (i + 1 < size)
        {
            if (distanceVector[lMin[k]] <= 0 || usedSummitVector[lMin[k]] != 0)
                lMin[k] = i + 1;
            else
            {
                if (distanceVector[i + 1] > 0 
                    && distanceVector[i + 1] < distanceVector[lMin[k]]
                    && usedSummitVector[i + 1] == 0)
                {
                    lMin[k] = i + 1;
                }
            }
        }
    }
    else
    {
        lMin[k] = oldMin[k];

        if (i + 1 < lmax)
        {
            if (distanceVector[lMin[k]] <= 0 || usedSummitVector[lMin[k]] != 0)
                lMin[k] = oldMin[k + 1];
            else
            {
                if (distanceVector[oldMin[k + 1]] > 0
                    && distanceVector[oldMin[k + 1]] < distanceVector[lMin[k]]
                    && usedSummitVector[oldMin[k + 1]] == 0)
                {
                    lMin[k] = oldMin[k + 1];
                }
            }
        }
    }
}

And here is how I use it:


int firstPass = 1;

        int lsize = size / 2;
        int lmax = lsize;
        if (size % 2 != 0)
            lsize++;

        while (lsize > 2)
        {
            size_t global = lsize;

            // Set the arguments to our compute kernel
            err = 0;
            err = clSetKernelArg(this->findMinKernel, 0, sizeof(cl_mem),
                                 &this->distanceVectorMem);
            err |= clSetKernelArg(this->findMinKernel, 1, sizeof(cl_mem),
                                  &this->usedSummitVectorMem);
            err |= clSetKernelArg(this->findMinKernel, 2, sizeof(int),
                                  &size);
            err |= clSetKernelArg(this->findMinKernel, 3, sizeof(cl_mem),
                                  &this->pFindMinMem);
            err |= clSetKernelArg(this->findMinKernel, 4, sizeof(cl_mem),
                                  &this->findMinMem);
            err |= clSetKernelArg(this->findMinKernel, 5, sizeof(int),
                                  &firstPass);
            err |= clSetKernelArg(this->findMinKernel, 6, sizeof(int),
                                  &lmax);

            if (err != CL_SUCCESS)
            {
                printf("Error: Failed to set kernel arguments
");

                exit(EXIT_FAILURE);
            }

            // Execute the kernel over the entire range of the data set
            err = clEnqueueNDRangeKernel(this->queue, this->findMinKernel, 1,
                                         NULL,
                                         &global, NULL, 0, NULL, NULL);

            if (err != CL_SUCCESS)
            {
                printf("Error: Failed to execute the kernel
");

                exit(EXIT_FAILURE);
            }

            this->pFindMinMem = this->findMinMem;

            lmax = lsize;
            
            int oldSize = lsize;
            lsize = oldSize / 2;
            if (oldSize % 2 != 0)
                lsize++;

            if (lsize % 2 != 0)
                lmax--;

            if (firstPass == 1)
                firstPass = 0;
        }

        err = 0;
        err = clSetKernelArg(this->findMinKernel, 0, sizeof(cl_mem),
                             &this->distanceVectorMem);
        err |= clSetKernelArg(this->findMinKernel, 1, sizeof(cl_mem),
                              &this->usedSummitVectorMem);
        err |= clSetKernelArg(this->findMinKernel, 2, sizeof(int),
                              &size);
        err |= clSetKernelArg(this->findMinKernel, 3, sizeof(cl_mem),
                              &this->pFindMinMem);
        err |= clSetKernelArg(this->findMinKernel, 4, sizeof(cl_mem),
                              &this->findMinMem);
        err |= clSetKernelArg(this->findMinKernel, 5, sizeof(int),
                              &firstPass);
        err |= clSetKernelArg(this->findMinKernel, 6, sizeof(int),
                              &lmax);

        size_t global = 1;
        err = clEnqueueNDRangeKernel(this->queue, this->findMinKernel, 1,
                                     NULL,
                                     &global, NULL, 0, NULL, NULL);

        if (err != CL_SUCCESS)
        {
            printf("Error: Failed to execute the kernel
");

            exit(EXIT_FAILURE);
        }

        clFinish(this->queue);

I’m sure there is a problem somewhere something I don’t see because I still need to use a clFinish for the result to work. I use the result of the kernel directly like this:


err |= clSetKernelArg(this->updateDistancesKernel, 3, sizeof(cl_mem),
                              &this->findMinMem);

In the next kernel.
I’ve searched over internet and found out an exemple of reduction but I’ve not managed to implement it the way I need it for my min. Because my min kernel does not work on values but on indices of values.

Thank you for the help in advance,
Benjamin.

It sounds like you’ve got a problem with a data race between work-items. I don’t fully understand your code, so I would suggest drawing out a picture and making sure there is no way two work-items can be writing to the same location, or one reading the results another has written.

Hello again,

I managed to modify my code so that now I’m sure that anything override anything else. But still I have problem with the memory object and I really need an explaination here. I think I’m completly mistaken in how to make a good use of the cl_mem objects.

Here is the code of the creation of my mem objects:


this->distanceVectorMem = clCreateBuffer(context,
            CL_MEM_READ_WRITE, sizeof(int) * summitCount,
            NULL, NULL);

        if (!this->distanceVectorMem)
        {
            cout << "Error: Failed to create distanceVectorMem" << endl;

            exit(EXIT_FAILURE);
        }

        this->usedSummitVectorMem = clCreateBuffer(context,
            CL_MEM_READ_WRITE, sizeof(int) * summitCount,
            NULL, NULL);

        if (!this->usedSummitVectorMem)
        {
            cout << "Error: Failed to create usedSummitVectorMem" << endl;

            exit(EXIT_FAILURE);
        }

Now I run it this way:


// Set the arguments to our compute kernel
        err = 0;
        err = clSetKernelArg(this->initKernel, 0, sizeof(int), &summitIndex);
        err |= clSetKernelArg(this->initKernel, 1, sizeof(cl_mem),
                              &this->distanceVectorMem);
        err |= clSetKernelArg(this->initKernel, 2, sizeof(cl_mem),
                              &this->usedSummitVectorMem);
        err |= clSetKernelArg(this->initKernel, 3, sizeof(int),
                              &size);
        err |= clSetKernelArg(this->initKernel, 4, sizeof(int),
                              &startIndex);
        err |= clSetKernelArg(this->initKernel, 5, sizeof(int),
                              &count);
        err |= clSetKernelArg(this->initKernel, 6, sizeof(cl_mem),
                              &this->adjacencyListMem);
        err |= clSetKernelArg(this->initKernel, 7, sizeof(cl_mem),
                              &this->adjacencyListDistancesMem);

        if (err != CL_SUCCESS)
        {
            printf("Error: Failed to set kernel arguments
");

            exit(EXIT_FAILURE);
        }

        size_t global = size;
        if (size % 2 != 0) global++;

        // Execute the kernel over the entire range of the data set
        err = clEnqueueNDRangeKernel(this->queue, this->initKernel, 1, NULL,
                                     &global, NULL, 0, NULL, NULL);

        if (err != CL_SUCCESS)
        {
            printf("Error: Failed to execute the kernel
");

            exit(EXIT_FAILURE);
        }

With this kernel code:


__kernel void init_int(const int summitIndex, __global int* distanceVector,
    __global int* usedSummitVector,
    const int size, const int startIndex, const int count,
    __global const int* adjacencyList,
    __global const int* adjacencyListDistances)
{
    int i = get_global_id(0);

    printf("entering in init (%d) 
", i);

    if (i < size)
    {
        distanceVector[i] = -1;
        usedSummitVector[i] = 0;
    }

    distanceVector[summitIndex] = 0;
    usedSummitVector[summitIndex] = 1;

    if (i == 0)
    {
        for (int k = startIndex + 1; k < startIndex + 1 + count; k++)
        {
            distanceVector[adjacencyList[k] - 1] = adjacencyListDistances[k];
            printf("distanceVector[%d] = %d
", adjacencyList[k] - 1, distanceVector[adjacencyList[k] - 1]);
        }
    }

    printf("exiting in init (%d) 
", i);
}

The trace of printf result is actually good something like that :
entering in init (0)
distanceVector[1] = 10
distanceVector[3] = 4
distanceVector[5] = 10
exiting in init (0)

After that I directly use distanceVector and usedSummitVector in my find min function:


err = 0;
            err = clSetKernelArg(this->findMinKernel, 0, sizeof(cl_mem),
                                 &this->distanceVectorMem);
            err |= clSetKernelArg(this->findMinKernel, 1, sizeof(cl_mem),
                                  &this->usedSummitVectorMem);
            err |= clSetKernelArg(this->findMinKernel, 2, sizeof(int),
                                  &size);
            err |= clSetKernelArg(this->findMinKernel, 3, sizeof(cl_mem),
                                  &this->pFindMinMem);
            err |= clSetKernelArg(this->findMinKernel, 4, sizeof(cl_mem),
                                  &this->findMinMem);
            err |= clSetKernelArg(this->findMinKernel, 5, sizeof(int),
                                  &firstPass);
            err |= clSetKernelArg(this->findMinKernel, 6, sizeof(int),
                                  &lmax);

            if (err != CL_SUCCESS)
            {
                printf("Error: Failed to set kernel arguments
");

                exit(EXIT_FAILURE);
            }

            // Execute the kernel over the entire range of the data set
            err = clEnqueueNDRangeKernel(this->queue, this->findMinKernel, 1,
                                         NULL,
                                         &global, NULL, 0, NULL, NULL);

            if (err != CL_SUCCESS)
            {
                printf("Error: Failed to execute the kernel
");

                exit(EXIT_FAILURE);
            }

without doing anything between the kernels executions. Here is the find min kernel:


__kernel void find_min(__global int* distanceVector,
    __global int* usedSummitVector, const int size,
    __global int* oldMin, __global int* lMin, const int firstPass,
    const int lmax)
{
    int k = get_global_id(0);

    printf("entering in find min (%d) 
", k);

    int i = k * 2;

    if (firstPass == 1) {
        int index = i;

        printf("select %d as index in %d with size %d
", i, k, size);

        if (index + 1 < size)
        {
            printf("select %d as index + 1 in %d with size %d
", i + 1, k, size);

            printf("distanceVector[%d] = %d
", index, distanceVector[index]);
            printf("usedSummitVector[%d] = %d
", index + 1, usedSummitVector[index + 1]);
            printf("usedSummitVector[%d] = %d
", index, usedSummitVector[index]);

            if (distanceVector[index] <= 0 && usedSummitVector[index + 1] == 0
                || usedSummitVector[index] != 0)
            {
                index = i + 1;
            }
            else
            {
                printf("distanceVector[%d] = %d
", index + 1, distanceVector[index + 1]);
                printf("distanceVector[%d] = %d
", index, distanceVector[index]);
                printf("usedSummitVector[%d] = %d
", index + 1, usedSummitVector[index + 1]);

                if (distanceVector[index + 1] > 0
                    && distanceVector[index + 1] < distanceVector[index]
                    && usedSummitVector[index + 1] == 0)
                {
                    index = i + 1;
                }
            }
        }

        lMin[k] = index;
    }
    else
    {

        int index = oldMin[i];

        printf("distanceVector[%d] = %d
", index, distanceVector[index]);
        printf("usedSummitVector[%d] = %d
", index, usedSummitVector[index]);

        if (i + 1 < lmax)
        {
            int index2 = oldMin[i + 1];

            printf("distanceVector[%d] = %d
", index2, distanceVector[index2]);
            printf("usedSummitVector[%d] = %d
", index2, usedSummitVector[index2]);

            if (distanceVector[index] <= 0
                && usedSummitVector[index2] == 0
                || usedSummitVector[index] != 0)
                index = index2;
            else
            {
                if (distanceVector[index2] > 0
                    && distanceVector[index2] < distanceVector[index]
                    && usedSummitVector[index2] == 0)
                {
                    index = index2;
                }
            }
        }

        lMin[k] = index;
    }

    printf("found %d as min in %d first pass ? %d
", lMin[k], k, firstPass);

    printf("exiting find min (%d) 
", k);
}

And the printf trace tells me here:
distanceVector[1] = -1

Why on the earth the memory isn’t committed between the two kernels execution ? What am I missing to do or what am I doing wrong here ?

Thank you so much in advance for the answer,
Benjamin.

I think work-item 1 is executing and setting distanceVector[i==1] = -1;
Try putting a clFinish() in after your first kernel execution.
If that fixes it, then that suggests that whatever driver you are using (are you using ATI’s CPU driver?) is running the kernels at the same time. You can either request an in-order queue, or use events to make them dependent on each other to avoid this.

BTW – don’t check if your cl_mem you get back from clCreate is null to see if it failed. You should check the error return.

Thank you very much I’m gonna try your advice.

Hi,

I finally managed to get everything working but I’m facing a strange problem of poor performances that seems to happen randomly.
When I scale my problem to a size like 100000 vertices sometimes my kernels takes barely 1000 ms to finish its execution and I don’t know why. I don’t use any ClFinish or read operation just enqueuing kernels and getting the final result at the end.
What do you think causes this ? Now my code is as follow for the kernel:


__kernel void find_min(__global int* distanceVector,
    __global int* usedSummitVector, const int size,
    __global int* oldMin, __global int* lMin, const int firstPass,
    const int lmax)
{
    int k = get_global_id(0);

    int i = k * 2;

    if (firstPass == 1) {
        int index = i;

        if (index + 1 < size)
        {
            if (distanceVector[index] <= 0 && usedSummitVector[index + 1] == 0
                || usedSummitVector[index] != 0)
            {
                index = i + 1;
            }
            else
            {
                if (distanceVector[index + 1] > 0
                    && distanceVector[index + 1] < distanceVector[index]
                    && usedSummitVector[index + 1] == 0)
                {
                    index = i + 1;
                }
            }
        }

        lMin[k] = index;
    }
    else
    {

        int index = oldMin[i];

        if (i + 1 < lmax)
        {
            int index2 = oldMin[i + 1];

            if (distanceVector[index] <= 0
                && usedSummitVector[index2] == 0
                || usedSummitVector[index] != 0)
                index = index2;
            else
            {
                if (distanceVector[index2] > 0
                    && distanceVector[index2] < distanceVector[index]
                    && usedSummitVector[index2] == 0)
                {
                    index = index2;
                }
            }
        }

        lMin[k] = index;
    }
}

and here is the use of the kernel:


int firstPass = 1;

        int lsize = size / 2;
        int lmax = lsize;
        if (size % 2 != 0)
            lsize++;

        while (lsize > 2)
        {
            size_t global = lsize;

            // Set the arguments to our compute kernel
            err = 0;
            err = clSetKernelArg(this->findMinKernel, 0, sizeof(cl_mem),
                                 &this->distanceVectorMem);
            err |= clSetKernelArg(this->findMinKernel, 1, sizeof(cl_mem),
                                  &this->usedSummitVectorMem);
            err |= clSetKernelArg(this->findMinKernel, 2, sizeof(int),
                                  &size);
            err |= clSetKernelArg(this->findMinKernel, 3, sizeof(cl_mem),
                                  &this->pFindMinMem);
            err |= clSetKernelArg(this->findMinKernel, 4, sizeof(cl_mem),
                                  &this->findMinMem);
            err |= clSetKernelArg(this->findMinKernel, 5, sizeof(int),
                                  &firstPass);
            err |= clSetKernelArg(this->findMinKernel, 6, sizeof(int),
                                  &lmax);

            if (err != CL_SUCCESS)
            {
                printf("Error: Failed to set kernel arguments
");

                exit(EXIT_FAILURE);
            }

            // Execute the kernel over the entire range of the data set
            err = clEnqueueNDRangeKernel(this->queue, this->findMinKernel, 1,
                                         NULL,
                                         &global, NULL, 0, NULL, NULL);

            if (err != CL_SUCCESS)
            {
                printf("Error: Failed to execute the kernel
");

                exit(EXIT_FAILURE);
            }

            this->pFindMinMem = this->findMinMem;
            
            lmax = lsize;
            
            int oldSize = lsize;
            lsize = oldSize / 2;
            if (oldSize % 2 != 0)
                lsize++;

            if (firstPass == 1)
                firstPass = 0;
        }

        err = 0;
        err = clSetKernelArg(this->findMinKernel, 0, sizeof(cl_mem),
                             &this->distanceVectorMem);
        err |= clSetKernelArg(this->findMinKernel, 1, sizeof(cl_mem),
                              &this->usedSummitVectorMem);
        err |= clSetKernelArg(this->findMinKernel, 2, sizeof(int),
                              &size);
        err |= clSetKernelArg(this->findMinKernel, 3, sizeof(cl_mem),
                              &this->pFindMinMem);
        err |= clSetKernelArg(this->findMinKernel, 4, sizeof(cl_mem),
                              &this->findMinMem);
        err |= clSetKernelArg(this->findMinKernel, 5, sizeof(int),
                              &firstPass);
        err |= clSetKernelArg(this->findMinKernel, 6, sizeof(int),
                              &lmax);

        size_t global = lsize;
        err = clEnqueueNDRangeKernel(this->queue, this->findMinKernel, 1,
                                     NULL,
                                     &global, NULL, 0, NULL, NULL);

        if (err != CL_SUCCESS)
        {
            printf("Error: Failed to execute the kernel
");

            exit(EXIT_FAILURE);
        }

        this->pFindMinMem = this->findMinMem;

        err = 0;
        err = clSetKernelArg(this->findMinKernel, 0, sizeof(cl_mem),
                             &this->distanceVectorMem);
        err |= clSetKernelArg(this->findMinKernel, 1, sizeof(cl_mem),
                              &this->usedSummitVectorMem);
        err |= clSetKernelArg(this->findMinKernel, 2, sizeof(int),
                              &size);
        err |= clSetKernelArg(this->findMinKernel, 3, sizeof(cl_mem),
                              &this->pFindMinMem);
        err |= clSetKernelArg(this->findMinKernel, 4, sizeof(cl_mem),
                              &this->findMinMem);
        err |= clSetKernelArg(this->findMinKernel, 5, sizeof(int),
                              &firstPass);
        err |= clSetKernelArg(this->findMinKernel, 6, sizeof(int),
                              &lmax);

        global = 1;
        err = clEnqueueNDRangeKernel(this->queue, this->findMinKernel, 1,
                                     NULL,
                                     &global, NULL, 0, NULL, NULL);

        if (err != CL_SUCCESS)
        {
            printf("Error: Failed to execute the kernel
");

            exit(EXIT_FAILURE);
        }

Do I need to use local cache to speed up things or is it linked to some kind of synchronization in the queue ?
I’m running opencl on mac book pro (9600MGT). And it seems that when I use CPU instead of GPU this problem did not appear (but if it’s a synchronization problem I think it’s normal no ?).

Thank in advance for the help.

I want to add that I tested it on CPU device and everything was running really quick despite the last step that took a large amount of time (like 4 times the time needed to run it on a single processor).
Where does it come from ?

The trace is something like that:
UpdateDistance time : 0
Find time : 772
UpdateDistance time : 0
Find time : 1
UpdateDistance time : 0
Find time : 0
UpdateDistance time : 0
Find time : 0
UpdateDistance time : 0
Find time : 1
UpdateDistance time : 0
Find time : 0
UpdateDistance time : 0
Find time : 1
UpdateDistance time : 0
Find time : 0
UpdateDistance time : 0
Find time : 0
UpdateDistance time : 0
Find time : 1
UpdateDistance time : 0
Find time : 0
UpdateDistance time : 0
Find time : 0
UpdateDistance time : 0
Find time : 0
UpdateDistance time : 0
Find time : 0
UpdateDistance time : 0
Find time : 1
UpdateDistance time : 0
Find time : 0
UpdateDistance time : 0
Find time : 0
UpdateDistance time : 1
Find time : 0
UpdateDistance time : 0
Find time : 0
UpdateDistance time : 0
Find time : 810
UpdateDistance time : 0

As you can see everything runs in no time except some runs. Those are taking so much time that they are killing performances of the rest of the app.