Results 1 to 2 of 2

Thread: Code speedup

  1. #1
    Junior Member
    Join Date
    Nov 2012

    Code speedup

    Hi, I don't know if i landed on the right forum. But I would like to ask if there is a way to speedup my code some more. My code is written with JOCL, but since there isn't that much of a community for that I came here. The code that I've written uses a large array of pixels of images. c[] contains (x images * 300 width * 300 height) , so it is a one dimensional array with all the pixels of different images. The purpose of my code is to take the sum of the intensities PER IMAGE. This means that if c[] contains (100x300x300) values the output should be 100 values (100 sums). This is my code

    Code :
    package PAR;
     * JOCL - Java bindings for OpenCL
     * Copyright 2009 Marco Hutter - [url][/url]
    import IMAGE_IO.ImageReader;
    import IMAGE_IO.Input_Folder;
    import static org.jocl.CL.*;
    import org.jocl.*;
     * A small JOCL sample.
    public class IPPARA {
         * The source code of the OpenCL program to execute
        private static String programSource =
                "__kernel void "
                + "sampleKernel(__global uint *a,"
                + "             __global uint *c)"
                + "{"
                + "__private uint intensity_core=0;"
                + "      uint i = get_global_id(0);"
                + "      for(uint j=i*90000; j < (i+1)*90000; j++){ "
                + "              intensity_core += a[j];"
                + "     }"
                + "c[i]=intensity_core;" 
                + "}";
         * The entry point of this sample
         * @param args Not used
        public static void main(String args[]) {
            long numBytes[] = new long[1];
            ImageReader imagereader = new ImageReader() ;
            int srcArrayA[]  = imagereader.readImages();
            int size[] = new int[1];
            size[0] = srcArrayA.length;
            long before = System.nanoTime();
            int dstArray[] = new int[size[0]/90000];
            Pointer srcA =;
            Pointer dst =;
            // Obtain the platform IDs and initialize the context properties
            System.out.println("Obtaining platform...");
            cl_platform_id platforms[] = new cl_platform_id[1];
            clGetPlatformIDs(platforms.length, platforms, null);
            cl_context_properties contextProperties = new cl_context_properties();
            contextProperties.addProperty(CL_CONTEXT_PLATFORM, platforms[0]);
            // Create an OpenCL context on a GPU device
            cl_context context = clCreateContextFromType(
                    contextProperties, CL_DEVICE_TYPE_CPU, null, null, null);
            if (context == null) {
                // If no context for a GPU device could be created,
                // try to create one for a CPU device.
                context = clCreateContextFromType(
                        contextProperties, CL_DEVICE_TYPE_CPU, null, null, null);
                if (context == null) {
                    System.out.println("Unable to create a context");
            // Enable exceptions and subsequently omit error checks in this sample
            // Get the list of GPU devices associated with the context
            clGetContextInfo(context, CL_CONTEXT_DEVICES, 0, null, numBytes);
            // Obtain the cl_device_id for the first device
            int numDevices = (int) numBytes[0] / Sizeof.cl_device_id;
            cl_device_id devices[] = new cl_device_id[numDevices];
            clGetContextInfo(context, CL_CONTEXT_DEVICES, numBytes[0],
          , null);
            // Create a command-queue
            cl_command_queue commandQueue =
                    clCreateCommandQueue(context, devices[0], 0, null);
            // Allocate the memory objects for the input- and output data
            cl_mem memObjects[] = new cl_mem[4];
            memObjects[0] = clCreateBuffer(context,
                    CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
                    Sizeof.cl_float * srcArrayA.length, srcA, null);
            memObjects[1] = clCreateBuffer(context,
                    Sizeof.cl_float * (srcArrayA.length/90000), null, null);
            // Create the program from the source code
            cl_program program = clCreateProgramWithSource(context,
                    1, new String[]{programSource}, null, null);
            // Build the program
            clBuildProgram(program, 0, null, null, null, null);
            // Create the kernel
            cl_kernel kernel = clCreateKernel(program, "sampleKernel", null);
            // Set the arguments for the kernel
            clSetKernelArg(kernel, 0,
            clSetKernelArg(kernel, 1,
            // Set the work-item dimensions
            long local_work_size[] = new long[]{1};
            long global_work_size[] = new long[]{(srcArrayA.length/90000)*local_work_size[0]};
            // Execute the kernel
            clEnqueueNDRangeKernel(commandQueue, kernel, 1, null,
                    global_work_size, local_work_size, 0, null, null);
            // Read the output data
            clEnqueueReadBuffer(commandQueue, memObjects[1], CL_TRUE, 0,
                    (srcArrayA.length/90000) * Sizeof.cl_float, dst, 0, null, null);
            // Release kernel, program, and memory objects
            long after = System.nanoTime();
            System.out.println("Time: " + (after - before) / 1e9);

    At the moment the sequential code and code run by jocl in parallel on the cpu are almost the same, though parallel is a bit slower. Running it on the GPU is alot slower.

    So my question is, is there a way to speed up this code some more ?

    My specs
    Graphics AMD Radeon HD 6490M 256 MB
    Processor Intel(R) Core(TM) i7-2635QM CPU @ 2.00GHz


  2. #2
    Senior Member
    Join Date
    Dec 2011

    Re: Code speedup

    What are the timing values? How are you measuring them?

    The OpenCL code has the additional overhead of writing the image buffer and reading back the result buffer which could make it slower than the straight CPU code.

    Probably the memory access pattern is what is slowing down the GPU. The CPU code is accessing sequential memory elements which is ideal for the fetch and pre-fetch hardware. The GPU code is (in parallel) accessing strided elements, which is not ideal for coalesced access. Memory access patterns have a huge effect on kernel execution times and a poor pattern and be magnitudes slower than a good one.

    For maximum performance, you need to re-do the algorithm so memory access can be coalesced. I recommend changing to something that is a 2D algorithm and defines the work group size. The fast-moving dimension 0 would do whole columns of data and the parallel processors would be doing adjacent rows (thereby producing coalesced accesses). The slower-moving dimension 1 would be the image index.

Similar Threads

  1. Replies: 1
    Last Post: 12-06-2010, 10:39 AM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
Proudly hosted by Digital Ocean