Why does my program run significantly faster on my CPU device than on my GPU device?

Hey all, I’m new to the forums, also pretty new to OpenCL. I learned a little bit in college, and by a little bit I mean my graphics professor taught us about GPGPU and OpenCL for a single day (whereas the rest of the class was focused on shaders and OpenGL and so on).

Anyways, I took an example program and changed it to work with the computations that I want it to run. However my program runs significantly faster on the CPU than my GPU, and I am trying to understand why.

About my program:

My program takes in one input float array and has two output arrays. Under single threaded circumstances it has three arguments <samplesPerTrace> <tracesIn> <tracesOut>. The input array’s size is: samplesPerTracetracesInsizeof(float), and the output array’s size are: samplesPerTracetracesOutsizeof(float). However this program takes in 5 arguments <samplesPerTrace> <tracesIn> <tracesOut> <#oflocalthreads> <whichdevicetypetouse>. If whichdevicetypetouse == 0, then CL_DEVICE_TYPE_ALL is used, 1 = CL_DEVICE_TYPE_GPU, 2 = CL_DEVICE_TYPE_CPU. If 0 is used the program uses the fastest device it can find, which it thinks is always the GPU despite that currently being false.

My test cases have been using the parameters 25000 2500 250, because that is on average the size of the arrays that I will be using (perhaps a little above average).

Here is the source code that OpenCL builds and runs on the kernel;

const char* M_AND_S_OPENCL_SOURCE_TEXT = 
__kernel void sumAllCL(__global const float prestackTraces[],
  __global float stackTracesOut[],
  __global float powerTracesOut[], const unsigned int nTracesOut, const unsigned int nTracesIn,
  const unsigned int samplesPerTrace) {

  unsigned int k = get_global_id(0); // Thread ID

  unsigned int kTimesIn = k * nTracesIn; // Store repeat ints
  unsigned int kTimesSamples = k * samplesPerTrace;

  for (int j = 0; j <         ; j++) {
    int jTimesSamplesPT = j * samplesPerTrace;
    for (int i = 0; i <         ; i++) {
      int valueIndex = i + jTimesSamplesPT;
      float value = prestackTraces[valueIndex];
      stackTracesOut[i + kTimesSamples] += value;
      powerTracesOut[i + kTimesSamples] += (value * value);
    }
  }
}

Note that the conditionals in the for loops are replaced at run time with fixed numbers, I do this because I thought it would help the compiler unroll the loops and thus increase performance.

With the above parameters stated (25000 2500 250 ~10 <1 or 2>) it takes my CPU about 0.6 seconds to complete the program and my GPU about 40 seconds to complete. That’s a bigger difference. Fyi, I have been messing around with the 4th parameters to see which value runs faster which is what is meant by the ~10.

My Devices:

My graphics card is a MSI Radeon R9 390X 8GB, given the name Hawaii. When I have OpenCL print out information about both of my devices this is what I get:

[VAR]
OpenCL Platform 0: AMD Accelerated Parallel Processing
----- OpenCL Device # 0: Hawaii-----
Gflops: 47.520000
Max Clock Frequency: 1080
Max Compute Units: 44
Max Work Group Size: 256
MEMORY…
Total Memory of Device: 8.000G (CL_DEVICE_GLOBAL_MEM_SIZE)
Local Memory of Device: 32.000K (CL_DEVICE_LOCAL_MEM_SIZE)
Max Memory Object Allocation: 3.999G (CL_DEVICE_MAX_MEM_ALLOC_SIZE)
Cache Size: 16.000K (CL_DEVICE_GLOBAL_MEM_CACHE_SIZE)
Cacheline Size: 64 bytes (CL_DEVICE_GLOBAL_MEM_CACHELINE_SIZE)
VERSIONS…
Device Vendor: Advanced Micro Devices, Inc.
Device Version: OpenCL 2.0 AMD-APP (2117.13)
Driver Version: 2117.13 (VM)
Device OpenCL Version: OpenCL C 2.0
----- OpenCL Device # 1: Intel® Core™ i7-6700K CPU 4.00GHz-----
Gflops: 32.064000
Max Clock Frequency: 4008
Max Compute Units: 8
Max Work Group Size: 1024
MEMORY…
Total Memory of Device: 15.967G (CL_DEVICE_GLOBAL_MEM_SIZE)
Local Memory of Device: 32.000K (CL_DEVICE_LOCAL_MEM_SIZE)
Max Memory Object Allocation: 3.1028G (CL_DEVICE_MAX_MEM_ALLOC_SIZE)
Cache Size: 32.000K (CL_DEVICE_GLOBAL_MEM_CACHE_SIZE)
Cacheline Size: 64 bytes (CL_DEVICE_GLOBAL_MEM_CACHELINE_SIZE)
VERSIONS…
Device Vendor: GenuineIntel
Device Version: OpenCL 1.2 AMD-APP (2117.13)
Driver Version: 2117.13 (sse2,avx)
Device OpenCL Version: OpenCL C 1.2[/VAR]

Source Code:

Here is the source code of my entire program. I’m not trying to say “hey can you look through my entire program to see what is wrong”, but rather I just wanted to provide a minimal complete (hopefully) verifiable example that you can compile and run if necessary/useful. It’s a standalone file, I couldn’t really cut much out of it because since I’m just getting started everything is relevant.


const char* M_AND_S_OPENCL_SOURCE_TEXT =
"__kernel void sumAllCL(__global const float prestackTraces[],
"
"  __global float stackTracesOut[],
"
"  __global float powerTracesOut[], const unsigned int nTracesOut, const unsigned int nTracesIn,
"
"  const unsigned int samplesPerTrace) {
"
"
"
"  unsigned int k = get_global_id(0);
" // Thread ID
"
"
"  unsigned int kTimesIn = k * nTracesIn;
" // Store repeat ints
"  unsigned int kTimesSamples = k * samplesPerTrace;
"
"
"
"  for (int j = 0; j < ?       ; j++) {
" // ? position to be replaced (nTracesOut)"
"
"
"    int jTimesSamplesPT = j * samplesPerTrace;
"
"
"
"    for (int i = 0; i < #       ; i++) {
" // # position to be replaced ()
"
"
"      int valueIndex = i + jTimesSamplesPT;
"
"      float value = prestackTraces[valueIndex];
"
"
"
"      stackTracesOut[i + kTimesSamples] += value;
"
"      powerTracesOut[i + kTimesSamples] += (value * value);
"
"
"
"    }
"
"  }
"
"}
";

Note that the ? and the # are replaced at run time with fixed numbers, I do this because I thought it would help the compiler unroll the rl

With the above parameters stated (25000 2500 250 ~10 <1 or 2>) it takes my CPU about 0.6 seconds to complete the program and my GPU about 40 seconds to complete. That’s a bigger difference. Fyi, I have been messing around with the 4th parameters to see which value runs faster which is what is meant by the ~10.

My Devices:

My graphics card is a MSI Radeon R9 390X 8GB, given the name Hawaii. When I have OpenCL print out information about both of my devices this is what I get:
[VAR]
OpenCL Platform 0: AMD Accelerated Parallel Processing
----- OpenCL Device # 0: Hawaii-----
Gflops: 47.520000
Max Clock Frequency: 1080
Max Compute Units: 44
Max Work Group Size: 256
MEMORY…
Total Memory of Device: 8.000G (CL_DEVICE_GLOBAL_MEM_SIZE)
Local Memory of Device: 32.000K (CL_DEVICE_LOCAL_MEM_SIZE)
Max Memory Object Allocation: 3.999G (CL_DEVICE_MAX_MEM_ALLOC_SIZE)
Cache Size: 16.000K (CL_DEVICE_GLOBAL_MEM_CACHE_SIZE)
Cacheline Size: 64 bytes (CL_DEVICE_GLOBAL_MEM_CACHELINE_SIZE)
VERSIONS…
Device Vendor: Advanced Micro Devices, Inc.
Device Version: OpenCL 2.0 AMD-APP (2117.13)
Driver Version: 2117.13 (VM)
Device OpenCL Version: OpenCL C 2.0
----- OpenCL Device # 1: Intel® Core™ i7-6700K CPU ? 4.00GHz-----
Gflops: 32.064000
Max Clock Frequency: 4008
Max Compute Units: 8
Max Work Group Size: 1024
MEMORY…
Total Memory of Device: 15.967G (CL_DEVICE_GLOBAL_MEM_SIZE)
Local Memory of Device: 32.000K (CL_DEVICE_LOCAL_MEM_SIZE)
Max Memory Object Allocation: 3.1028G (CL_DEVICE_MAX_MEM_ALLOC_SIZE)
Cache Size: 32.000K (CL_DEVICE_GLOBAL_MEM_CACHE_SIZE)
Cacheline Size: 64 bytes (CL_DEVICE_GLOBAL_MEM_CACHELINE_SIZE)
VERSIONS…
Device Vendor: GenuineIntel
Device Version: OpenCL 1.2 AMD-APP (2117.13)
Driver Version: 2117.13 (sse2,avx)
Device OpenCL Version: OpenCL C 1.2[/VAR]

Source Code:

Here is the source code of my entire program. I’m not trying to say “hey can you look through my entire program to see what is wrong”, but rather I just wanted to provide a minimal complete (hopefully) verifiable example that you can compile and run if necessary/useful. It’s a standalone file, I couldn’t really cut much out of it because since I’m just getting started everything is relevant.

#include <stdlib.h>
#include <stdio.h>
#include <sys/time.h>
#include <CL/opencl.h>
#include <unistd.h>
#include <string.h>
#include <sstream>
#include <math.h>
#include <stdarg.h>

const bool _VERBOSE = true;

// Variables used in OpenCL
static cl_device_id _deviceID;
static cl_context _context;
static cl_command_queue _queue;

const int MAX_LOOP_SIZE = 99999999; // Based off how many digits a number inserted into the program can have
// Note: If you change M_AND_S_OPENCL_SOURCE_TEXT, you must also change INSERT_LOCATION_1 & INSERT_LOCATION_2. You can
// set _FIND_INSERT_LOCATIONS to true to print out the new locations.
const bool _FIND_INSERT_LOCATIONS = false;
const int INSERT_LOCATION_1 = 388;
const int INSERT_LOCATION_2 = 478;
const char* M_AND_S_OPENCL_SOURCE_TEXT =
"__kernel void sumAllCL(__global const float prestackTraces[],
"
"  __global float stackTracesOut[],
"
"  __global float powerTracesOut[], const unsigned int nTracesOut, const unsigned int nTracesIn,
"
"  const unsigned int samplesPerTrace) {
"
"
"
"  unsigned int k = get_global_id(0);
" // Thread ID
"
"
"  unsigned int kTimesIn = k * nTracesIn;
" // Store repeat ints
"  unsigned int kTimesSamples = k * samplesPerTrace;
"
"
"
"  for (int j = 0; j <         ; j++) {
" // ? position to be replaced (nTracesOut)"
"
"
"    int jTimesSamplesPT = j * samplesPerTrace;
"
"
"
"    for (int i = 0; i <         ; i++) {
" // # position to be replaced ()
"
"
"      int valueIndex = i + jTimesSamplesPT;
"
"      float value = prestackTraces[valueIndex];
"
"
"
"      stackTracesOut[i + kTimesSamples] += value;
"
"      powerTracesOut[i + kTimesSamples] += (value * value);
"
"
"
"    }
"
"  }
"
"}
";

void sumAllCLPerThread(const float prestackTraces[],
   const unsigned int endIndex, float stackTracesOut[],
   float powerTracesOut[], const unsigned int nTracesOut, const unsigned int nTracesIn,
   const unsigned int samplesPerTrace, const unsigned int thisThread) {

  int k = thisThread; // Outer loop = thread ID

  int k_timesSPT = k * samplesPerTrace;

  // printf("j goes to %d, i goes to %d
", nTracesIn, endIndex);

  for (unsigned int j = 0; j < nTracesIn; j++) {

    int jTimesSamplesPT = j * samplesPerTrace;

    for (unsigned int i = 0; i < endIndex; i++) {

      int valueIndex = i + jTimesSamplesPT;
      float value = prestackTraces[valueIndex];

      stackTracesOut[i + k_timesSPT] += value;
      powerTracesOut[i + k_timesSPT] += (value * value);
    }
  }
}

// Helper function for the byte converter
const char* byteConverterHelper(long num, const char* suffix, int power) {

  double div1024 = pow(1024.0, (double)power);
  unsigned long insertNum = (unsigned long)(((double)num) / ((double)div1024));

  std::stringstream strs;
  strs << insertNum;
  if (power > 0) {

    strs << ".";
    unsigned long remainder = num % ((unsigned long)div1024); // Get the remainder in vase 1024
    remainder = (unsigned long)((double)remainder * (1000.0 / 1024.0)); // Convert base 1024 to base 1000
    remainder /= (unsigned long)pow(1000.0, (double)(power - 1)); // Get the first three digits

    if (remainder < 10L) strs << "00";
    else if (remainder < 100L) strs << "0";
    strs << remainder;
  }
  strs << suffix;
  strs << "\0";

  return strs.str().c_str();
}

/*
 * Convert a number into byes in the from ###.### with the correct postfix (such as G for gigabytes).
 */
char* byteConverter(unsigned long num) {

  //int lengthOfString;
  char* string = new char[30];

  if (num < (1024L)) { // Bytes

    const char* tempS = byteConverterHelper(num, " bytes", 0);
    int tempLength = strlen(tempS);
    int x = 0;
    for (x = 0; (x < 29) && (x < tempLength); x++) string[x] = tempS[x];
    string[x] = '\0';

  } else if (num < (1024L * 1024L)) { // Mega bytes

    const char* tempS = byteConverterHelper(num, "K", 1);
    int tempLength = strlen(tempS);
    int x = 0;
    for (x = 0; (x < 29) && (x < tempLength); x++) string[x] = tempS[x];
    string[x] = '\0';

  } else if (num < (1024L * 1024L * 1024L)) { // Kilo bytes

    const char* tempS = byteConverterHelper(num, "M", 2);
    int tempLength = strlen(tempS);
    int x = 0;
    for (x = 0; (x < 29) && (x < tempLength); x++) string[x] = tempS[x];
    string[x] = '\0';

  } else if (num < (1024L * 1024L * 1024L * 1024L)) { // Giga bytes

    const char* tempS = byteConverterHelper(num, "G", 3);
    int tempLength = strlen(tempS);
    int x = 0;
    for (x = 0; (x < 29) && (x < tempLength); x++) string[x] = tempS[x];
    string[x] = '\0';

  } else { // Terra bytes

    const char* tempS = byteConverterHelper(num, "T", 4);
    int tempLength = strlen(tempS);
    int x = 0;
    for (x = 0; (x < 29) && (x < tempLength); x++) string[x] = tempS[x];
    string[x] = '\0';

  }

  return string;
}

/**
 *  Used to keep track of how long it takes to execute this.
 */
// static double t0 = 0.0;
double GetTime() {

  struct timeval tv;
  gettimeofday(&tv, NULL);
  return tv.tv_sec + (1e-6 * tv.tv_usec);
}

/*
 *  Print message to stderr and exit
 */
void Fatal(const char* format, ...) {

  va_list args;
  va_start(args, format);
  vfprintf(stderr, format, args);
  va_end(args);
  exit(1);
}

/*
 * For when kernel parameters are failed to be set.
 */
void FatalSetArgs(const char* type, cl_int errorCode) {

  const char* errorTypes0 = "Cannot set kernel parameter: CL_INVALID_KERNEL
";
  const char* errorTypes1 = "Cannot set kernel parameter: CL_INVALID_ARG_INDEX
";
  const char* errorTypes2 = "Cannot set kernel parameter: CL_INVALID_ARG_VALUE
";
  const char* errorTypes3 = "Cannot set kernel parameter: CL_INVALID_MEM_OBJECT
";
  const char* errorTypes4 = "Cannot set kernel parameter: CL_INVALID_SAMPLER
";
  const char* errorTypes5 = "Cannot set kernel parameter: CL_INVALID_ARG_SIZE
";
  const char* errorTypes6 = "Unknown Error???
";

  if (errorCode == CL_INVALID_KERNEL) {
    Fatal("Error settings args for %s:
", errorTypes0);
  } else if (errorCode == CL_INVALID_ARG_INDEX) {
    Fatal("Error settings args for %s:
", errorTypes1);
  } else if (errorCode == CL_INVALID_ARG_VALUE) {
    Fatal("Error settings args for %s:
", errorTypes2);
  } else if (errorCode == CL_INVALID_MEM_OBJECT) {
    Fatal("Error settings args for %s:
", errorTypes3);
  } else if (errorCode == CL_INVALID_SAMPLER) {
    Fatal("Error settings args for %s:
", errorTypes4);
  } else if (errorCode == CL_INVALID_ARG_SIZE) {
    Fatal("Error settings args for %s:
", errorTypes5);
  } else {
    Fatal("Error settings args for %s:
", errorTypes6);
  }
}

/*
 *  Fatal during buffer creation.
 */
void FatalBufferCreation(const char* type, cl_int errorCode) {

  if (errorCode == CL_INVALID_CONTEXT) Fatal("Error when creating buffer for %s:
"
    "Context Not Valid.
", type);
  if (errorCode == CL_INVALID_VALUE) Fatal("Error when creating buffer for %s:
"
    "Flags Not Valid.
", type);
  if (errorCode == CL_INVALID_BUFFER_SIZE) Fatal("Error when creating buffer for %s:
"
    "Invalid Buffer Size: size is 0 or is greater than CL_DEVICE_MAX_MEM_ALLOC_SIZE.
", type);
  if (errorCode == CL_INVALID_HOST_PTR) Fatal("Error when creating buffer for %s:
"
    "if host_ptr is NULL and CL_MEM_USE_HOST_PTR or CL_MEM_COPY_HOST_PTR are set "
    "in flags or if host_ptr is not NULL but CL_MEM_COPY_HOST_PTR or "
    "CL_MEM_USE_HOST_PTR are not set in flags.
", type);
  if (errorCode == CL_MEM_OBJECT_ALLOCATION_FAILURE) Fatal("Error when creating buffer for %s:
"
    "Failure to allocate memory for buffer object.
", type);
  if (errorCode == CL_OUT_OF_HOST_MEMORY) Fatal("Error when creating buffer for %s:
"
    "failure to allocate resources required by the OpenCL implementation on "
    "the host.
", type);
}

/*
 *  OpenCL notify callback (echo to stderr)
 */
void Notify(const char* errinfo, const void* private_info, size_t cb, void* user_data) {

  fprintf(stderr, "%s
", errinfo);
}

/*
 * Prints the given int (numToInsert) at location inside chars.
 */
void PrintIntInStr(char* chars, int location, int numToInsert) {

  std::stringstream strs;
  strs << numToInsert;
  std::string temp_str = strs.str();
  char const* numToChars = temp_str.c_str();

  int numberLength = strlen(numToChars);

  int w;
  for (w = 0; w < numberLength; w++) {
    chars[location + w] = numToChars[w];
  }
}

/*
 * Initialize fastest OpenCL device.
 */
int InitOpenCL(int verbose, cl_int deviceType) {

  cl_uint Nplat;
  cl_int  err;
  char name[1024];
  int  MaxGflops = -1;

  cl_platform_id winnerPlatform = 0;

  // Reset (TODO)
  _deviceID = NULL;
  _context = NULL;
  _queue = NULL;

  // Get platforms
  cl_platform_id platforms[4];
  if (clGetPlatformIDs(4, platforms, &Nplat)) Fatal("Cannot get number of OpenCL platforms
");
  else if (Nplat<1) Fatal("No OpenCL platforms found
");

  // Loop over platforms
  for (unsigned int platform = 0; platform < Nplat; platform++) {

    if (clGetPlatformInfo(platforms[platform], CL_PLATFORM_NAME, sizeof(name), name, NULL)) Fatal("Cannot get OpenCL platform name
");
    if (verbose) printf("OpenCL Platform %d: %s
", platform, name);

    // Get GPU device IDs
    cl_uint Ndev;
    cl_device_id id[4];
    if (clGetDeviceIDs(platforms[platform], deviceType, 4, id, &Ndev))
      Fatal("Cannot get number of OpenCL devices: %d
", platform);
    else if (Ndev < 1) Fatal("No OpenCL devices found.
");

    // Find the fastest device
    for (unsigned int devId = 0; devId < Ndev; devId++) {

      // Print informatio about the device
      cl_uint compUnits, freq, cacheLineSize;
      cl_ulong memSize, maxAlloc, localMemSize, globalCacheSize;
      size_t maxWorkGrps;
      char deviceVendor[50];
      char deviceVersion[50];
      char driverVersion[50];
      char deviceOpenCLVersion[50];

      // Computing Power...
      if (clGetDeviceInfo(id[devId], CL_DEVICE_MAX_COMPUTE_UNITS, sizeof(compUnits), &compUnits, NULL)) Fatal("Cannot get OpenCL device units
");
      if (clGetDeviceInfo(id[devId], CL_DEVICE_MAX_CLOCK_FREQUENCY, sizeof(freq), &freq, NULL)) Fatal("Cannot get OpenCL device frequency
");
      if (clGetDeviceInfo(id[devId], CL_DEVICE_NAME, sizeof(name), name, NULL)) Fatal("Cannot get OpenCL device name
");
      if (clGetDeviceInfo(id[devId], CL_DEVICE_MAX_WORK_GROUP_SIZE, sizeof(maxWorkGrps), &maxWorkGrps, NULL)) Fatal("Cannot get OpenCL max work group size
");
      // Memory...
      if (clGetDeviceInfo(id[devId], CL_DEVICE_GLOBAL_MEM_SIZE, sizeof(memSize), &memSize, NULL)) Fatal("Cannot get OpenCL memory size.
");
      if (clGetDeviceInfo(id[devId], CL_DEVICE_LOCAL_MEM_SIZE, sizeof(localMemSize), &localMemSize, NULL)) localMemSize = 0;
      if (clGetDeviceInfo(id[devId], CL_DEVICE_MAX_MEM_ALLOC_SIZE, sizeof(maxAlloc), &maxAlloc, NULL)) Fatal("Cannot get OpenCL memory size.
");
      if (clGetDeviceInfo(id[devId], CL_DEVICE_GLOBAL_MEM_CACHE_SIZE, sizeof(globalCacheSize), &globalCacheSize, NULL)) globalCacheSize = 0;
      if (clGetDeviceInfo(id[devId], CL_DEVICE_GLOBAL_MEM_CACHELINE_SIZE, sizeof(cacheLineSize), &cacheLineSize, NULL)) cacheLineSize = 0;
      // Versions...
      clGetDeviceInfo(id[devId], CL_DEVICE_VENDOR, sizeof(deviceVendor), deviceVendor, NULL);
      clGetDeviceInfo(id[devId], CL_DEVICE_VERSION, sizeof(deviceVersion), deviceVersion, NULL);
      clGetDeviceInfo(id[devId], CL_DRIVER_VERSION, sizeof(driverVersion), driverVersion, NULL);
      clGetDeviceInfo(id[devId], CL_DEVICE_OPENCL_C_VERSION, sizeof(deviceOpenCLVersion), deviceOpenCLVersion, NULL);

      int Gflops = compUnits * freq;

      if (verbose) printf(" ----- OpenCL Device # %d: %s-----
"
        "Gflops: %f
"
        "Max Clock Frequency: %d
"
        "Max Compute Units: %d
"
        "Max Work Group Size: %zu
"
        "   MEMORY...
"
        "Total Memory of Device: %s   (CL_DEVICE_GLOBAL_MEM_SIZE)
"
        "Local Memory of Device: %s   (CL_DEVICE_LOCAL_MEM_SIZE)
"
        "Max Memory Object Allocation: %s   (CL_DEVICE_MAX_MEM_ALLOC_SIZE)
"
        "Cache Size: %s   (CL_DEVICE_GLOBAL_MEM_CACHE_SIZE)
"
        "Cacheline Size: %s   (CL_DEVICE_GLOBAL_MEM_CACHELINE_SIZE)
"
        "   VERSIONS...
"
        "Device Vendor: %s
"
        "Device Version: %s
"
        "Driver Version: %s
"
        "Device OpenCL Version: %s
",
        devId,
        name,
        (1e-3 * Gflops),
        freq,
        compUnits,
        maxWorkGrps,
        byteConverter((unsigned long)memSize),
        byteConverter((unsigned long)localMemSize),
        byteConverter((unsigned long)maxAlloc),
        byteConverter((unsigned long)globalCacheSize),
        byteConverter((unsigned long)cacheLineSize),
        deviceVendor,
        deviceVersion,
        driverVersion,
        deviceOpenCLVersion);

      if(Gflops > MaxGflops)
      {
        _deviceID = id[devId];
        MaxGflops = Gflops;

        winnerPlatform = platforms[platform];
      }
    }
  }

  // Print fastest device info (TODO: don't get name twice)
  if (clGetDeviceInfo(_deviceID, CL_DEVICE_NAME, sizeof(name), name, NULL)) Fatal("Cannot get OpenCL device name
");
  printf("
   Selected Fastest Open CL Device: %s (#%lu)
", name, (unsigned long)_deviceID);

  // Check thread count
  size_t mwgs;
  if (clGetDeviceInfo(_deviceID, CL_DEVICE_MAX_WORK_GROUP_SIZE, sizeof(mwgs), &mwgs, NULL))
    Fatal("Cannot get OpenCL max work group size
");

  // Create OpenCL context for fastest device
  cl_context_properties cps[3] =
  {
    CL_CONTEXT_PLATFORM,
    (cl_context_properties)winnerPlatform,
    (cl_context_properties)0
  };
  _context = clCreateContextFromType(cps, deviceType, NULL, NULL, &err);
  if (!_context || err) Fatal("Cannot create OpenCL Context
");

  // Properties for create command queue; currently nothing
  // cl_command_queue_properties *propers;
  cl_command_queue_properties prop = 0;
  //prop |= CL_QUEUE_PROFILING_ENABLE;
  //prop |= CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE;
  // propers = ∝

   _queue = clCreateCommandQueueWithProperties(_context, _deviceID, &prop, &err); //  Create OpenCL command queue for fastest device
  // _queue = clCreateCommandQueue(_context, _deviceID, &prop, &err);
  if (!_queue || err) {
    if (err == CL_INVALID_CONTEXT) Fatal("Cannot create OpenCL command cue: CL_INVALID_CONTEXT
");
    else if (err == CL_INVALID_DEVICE) Fatal("Cannot create OpenCL command cue: CL_INVALID_DEVICE
");
    else if (err == CL_INVALID_VALUE) Fatal("Cannot create OpenCL command cue: CL_INVALID_VALUE
");
    else if (err == CL_INVALID_QUEUE_PROPERTIES) Fatal("Cannot create OpenCL command cue: CL_INVALID_QUEUE_PROPERTIES
");
    else if (err == CL_OUT_OF_RESOURCES) Fatal("Cannot create OpenCL command cue: CL_OUT_OF_RESOURCES
");
    else if (err == CL_OUT_OF_HOST_MEMORY) Fatal("Cannot create OpenCL command cue: CL_OUT_OF_HOST_MEMORY
");
    else if (!_queue) Fatal("Cannot create OpenCL command cue: !queue
");
    else Fatal("Cannot create OpenCL command cue: ?????
");
  }

  if (_VERBOSE) printf("Init complete.
");

  return mwgs;
}

/*
 * Modify the source text to fit this run.
 */
char* ModifySourceText(unsigned int nTracesIn, unsigned int samplesPerT) {
  
  size_t sourceSize = strlen(M_AND_S_OPENCL_SOURCE_TEXT) + 1;
  char* moveStackSourceCode = new char[sourceSize];
  strncpy(moveStackSourceCode, M_AND_S_OPENCL_SOURCE_TEXT, sourceSize);
  moveStackSourceCode[sourceSize] = '\0';

  // Print out the locations of the characters where we should insert other text if asked to do so
  if (_FIND_INSERT_LOCATIONS) {
    size_t z;
    for (z = 0; z < sourceSize; z++) {
      if (moveStackSourceCode[z] == '?') {
        printf("Found ? at position %zu
", z);
        break;
      }
    }
    for (z = 0; z < sourceSize; z++) {
      if (moveStackSourceCode[z] == '#') {
        printf("Found # at position %zu
", z);
        break;
      }
    }
  }

  // Insert the digit that for loops go to inside of the source
  PrintIntInStr(moveStackSourceCode, INSERT_LOCATION_1, nTracesIn);
  PrintIntInStr(moveStackSourceCode, INSERT_LOCATION_2, samplesPerT);

  // Print the modified source code if verbose
  if (_FIND_INSERT_LOCATIONS) {
    printf("
   GPU Source Code: 
");
    printf("%s
", moveStackSourceCode);
  }

  return moveStackSourceCode;
}

/*
 * Wait for event and then release it.
 */
static void WaitForEventAndRelease(cl_event *event) {

  printf("WaitForEventAndRelease()
");

  cl_int status = CL_SUCCESS;

  status = clWaitForEvents(1, event);
  if (status) Fatal("clWaitForEvents Failed with Error Code");
   
  printf("About to release event...
");

  status = clReleaseEvent(*event);
  if (status) Fatal("clReleaseEvent Failed with Error Code");
}


// Runs the program via open CL
static double RunOpenCL(float prestackTracesArray[], float stackTracesOut1DArray[], float powerTracesOut1DArray[],
  unsigned int nTracesOut, unsigned int nTracesIn, unsigned int samplesPerT,
  size_t inXsamples, size_t outXsamples,
  unsigned int localThreadCount)
{

  cl_int err;

  // Get the source code
  char* modifiedGpuSource = ModifySourceText(nTracesIn, samplesPerT);

  // Allocate device memory
  // CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR | CL_MEM_USE_PERSISTENT_MEM_AMD (?)
  // Input...
  cl_mem prestackTracesCL = clCreateBuffer(_context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
    inXsamples * sizeof(cl_float), prestackTracesArray, &err);
  if (err) FatalBufferCreation("Prestack traces", err);
  // Output... TODO: How do we know that the output is zeroed out?
  cl_mem stackTracesOutCL = clCreateBuffer(_context, CL_MEM_WRITE_ONLY,
    outXsamples * sizeof(cl_float), NULL, &err);
  if (err) FatalBufferCreation("Stack traces", err);
  cl_mem powerTracesOutCL = clCreateBuffer(_context, CL_MEM_WRITE_ONLY,
    outXsamples * sizeof(cl_float), NULL, &err);
  if (err) FatalBufferCreation("Power traces", err);

  // Compile the source code
  char* gpuSourceText[1];
  gpuSourceText[0] = modifiedGpuSource;
  size_t sourceLength[1];
  sourceLength[0] = strlen(modifiedGpuSource);
  cl_program moveoutAndStackCLProgram = clCreateProgramWithSource(_context, 1, (const char**)gpuSourceText,
    (const size_t*)sourceLength, &err);
  if (err != CL_SUCCESS) {
    if (err == CL_INVALID_CONTEXT) Fatal("Cannot create program: CL_INVALID_CONTEXT
");
    else if (err == CL_INVALID_VALUE) Fatal("Cannot create program: CL_INVALID_VALUE
");
    else if (err == CL_OUT_OF_HOST_MEMORY) Fatal("Cannot create program: CL_OUT_OF_HOST_MEMORY
");
    else Fatal("Cannot create program_S %d
", err);
  }

  // Build the program
  cl_int buildCode = clBuildProgram(moveoutAndStackCLProgram, 0, NULL, NULL, NULL, NULL);
  if (buildCode != CL_SUCCESS) {
    // Attempt to get compile errors
    char log[1048576];
    if (clGetProgramBuildInfo(moveoutAndStackCLProgram, _deviceID, CL_PROGRAM_BUILD_LOG, sizeof(log), log, NULL)) {
      log[0] = '\0'; // Failed to get the log file
    }

    if (buildCode == CL_INVALID_PROGRAM) Fatal("Cannot build program: CL_INVALID_PROGRAM
%s", log);
    else if (buildCode == CL_INVALID_VALUE) Fatal("Cannot build program: CL_INVALID_VALUE
%s", log);
    else if (buildCode == CL_INVALID_DEVICE) Fatal("Cannot build program: CL_INVALID_DEVICE
%s", log);
    else if (buildCode == CL_INVALID_BINARY) Fatal("Cannot build program: CL_INVALID_BINARY
%s", log);
    else if (buildCode == CL_INVALID_BUILD_OPTIONS) Fatal("Cannot build program: CL_INVALID_BUILD
_OPTIONS
%s", log);
    else if (buildCode == CL_INVALID_OPERATION) Fatal("Cannot build program: CL_INVALID_OPERATION
%s", log);
    else if (buildCode == CL_COMPILER_NOT_AVAILABLE) Fatal("Cannot build program: CL_COMPILER_NOT_AVAILABLE
%s", log);
    else if (buildCode == CL_BUILD_PROGRAM_FAILURE) Fatal("Cannot build program: CL_BUILD_PROGRAM_FAILURE
%s", log);
    else if (buildCode == CL_INVALID_OPERATION) Fatal("Cannot build program: CL_INVALID_OPERATION
%s", log);
    else if (buildCode == CL_OUT_OF_HOST_MEMORY) Fatal("Cannot build program: CL_OUT_OF_HOST_MEMORY
%s", log);
    else Fatal("Cannot build program: %d
%s", buildCode, log);
  }

  // Compile the source code & build the kernel
  cl_kernel kernel = clCreateKernel(moveoutAndStackCLProgram, "sumAllCL", &err);
  if (err) {
    if (err == CL_INVALID_PROGRAM) Fatal("Cannot create kernel: CL_INVALID_PROGRAM
");
    else if (err == CL_INVALID_PROGRAM_EXECUTABLE) Fatal("Cannot create kernel: CL_INVALID_PROGRAM_EXECUTABLE
");
    else if (err == CL_INVALID_KERNEL_NAME) Fatal("Cannot create kernel: CL_INVALID_KERNEL_NAME
");
    else if (err == CL_INVALID_KERNEL_DEFINITION) Fatal("Cannot create kernel: CL_INVALID_KERNEL_DEFINITION
");
    else if (err == CL_INVALID_VALUE) Fatal("Cannot create kernel: CL_INVALID_VALUE
");
    else if (err == CL_OUT_OF_HOST_MEMORY) Fatal("Cannot create kernel: CL_OUT_OF_HOST_MEMOR
");
    else Fatal("Cannot create kernel: %d
", err);
  }

  // Set program parameters
  cl_int returnValArgSet;
  returnValArgSet = clSetKernelArg(kernel, 0, sizeof(cl_mem), &prestackTracesCL);
  if (returnValArgSet != CL_SUCCESS) FatalSetArgs("prestackTracesCL", returnValArgSet);
  returnValArgSet = clSetKernelArg(kernel, 1, sizeof(cl_mem), &stackTracesOutCL);
  if (returnValArgSet != CL_SUCCESS) FatalSetArgs("stackTracesOutCL", returnValArgSet);
  returnValArgSet = clSetKernelArg(kernel, 2, sizeof(cl_mem), &powerTracesOutCL);
  if (returnValArgSet != CL_SUCCESS) FatalSetArgs("powerTracesOutCL", returnValArgSet);
  returnValArgSet = clSetKernelArg(kernel, 3, sizeof(unsigned int), &nTracesOut);
  if (returnValArgSet != CL_SUCCESS) FatalSetArgs("nTracesOut", returnValArgSet);
  returnValArgSet = clSetKernelArg(kernel, 4, sizeof(unsigned int), &nTracesIn);
  if (returnValArgSet != CL_SUCCESS) FatalSetArgs("nTracesIn", returnValArgSet);
  returnValArgSet = clSetKernelArg(kernel, 5, sizeof(unsigned int), &samplesPerT);
  if (returnValArgSet != CL_SUCCESS) FatalSetArgs("samplesPerT", returnValArgSet);

  // TODO: verbose
  printf("About to run Kernel...
");

  // Start timer TODO: move?
  double runTime = GetTime();

  // Run the kernel (& also set the number of threads)
  cl_event runEvent;
  size_t Global[1] = { nTracesOut };
  size_t Local[1]  = { localThreadCount };
  if (localThreadCount > 0) err = clEnqueueNDRangeKernel(_queue, kernel, 1, NULL, Global, Local, 0, NULL, &runEvent);
  else err = clEnqueueNDRangeKernel(_queue, kernel, 1, NULL, Global, NULL, 0, NULL, &runEvent);
  if (err) {
    if (err == CL_INVALID_PROGRAM_EXECUTABLE) {
      Fatal("Cannot run Kernel: No successfully built program executable available.
");
    } else if (err == CL_INVALID_COMMAND_QUEUE) {
      Fatal("Cannot run Kernel: Command_queue is not a valid command-queue.
");
    } else if (err == CL_INVALID_KERNEL) {
      Fatal("Cannot run Kernel: Kernel is not a valid kernel object.
");
    } else if (err == CL_INVALID_CONTEXT) {
      Fatal("Cannot run Kernel: Context associated with command_queue and kernel is not the same or if "
        "the context associated with command_queue and events in event_wait_list are not the same.
");
    } else if (err == CL_INVALID_KERNEL_ARGS) {
      Fatal("Cannot run Kernel: Kernel argument values have not been specified.
");
    } else if (err == CL_INVALID_WORK_DIMENSION) {
      Fatal("Cannot run Kernel: work_dim is not a valid value (must be between 1 and 3).
");
    } else if (err == CL_INVALID_WORK_GROUP_SIZE) {
      Fatal("Cannot run Kernel: local_work_size is specified and number of work-items specified by global_work_size "
        "is not evenly divisable by size of work-group given by local_work_size or does not match the "
        "work-group size specified for kernel using the __attribute__((reqd_work_group_size(X, Y, Z))) "
        "qualifier in program source.
");
    } else if (err == CL_INVALID_WORK_ITEM_SIZE) {
      Fatal("Cannot run Kernel: If the number of work-items specified in any of local_work_size[0], ... "
        "local_work_size[work_dim - 1] is greater than the corresponding values specified "
        "by CL_DEVICE_MAX_WORK_ITEM_SIZES[0], .... CL_DEVICE_MAX_WORK_ITEM_SIZES[work_dim - 1]. .
");
    } else if (err == CL_INVALID_GLOBAL_OFFSET) {
      Fatal("Cannot run Kernel: Global_work_offset is not NULL.
");
    } else if (err == CL_OUT_OF_RESOURCES) {
      Fatal("Cannot run Kernel: CL_OUT_OF_RESOURCES.
");
    } else if (err == CL_MEM_OBJECT_ALLOCATION_FAILURE) {
      Fatal("Cannot run Kernel: Failure to allocate memory for data store associated with image or buffer "
        "objects specified as arguments to kernel.
");
    } else if (err == CL_INVALID_EVENT_WAIT_LIST) {
      Fatal("Cannot run Kernel: event_wait_list is NULL and num_events_in_wait_list > 0, or event_wait_list "
        "is not NULL and num_events_in_wait_list is 0, or if event objects in event_wait_list "
        "are not valid events..
");
    } else if (err == CL_OUT_OF_HOST_MEMORY) {
      Fatal("Cannot run Kernel: Failure to allocate resources required by the OpenCL implementation on the host.
");
    } else {
      Fatal("Cannot run Kernel: Unknown Error. (clEnqueueNDRangeKernel)");
    }
  }

  // Flush the program & wait for the program to finish executing
  if (clFlush(_queue)) printf("Flush Fail (Run)");
  WaitForEventAndRelease(&runEvent);

  // Copy the end result back to CPU memory side
  if (clEnqueueReadBuffer(_queue, stackTracesOutCL, CL_TRUE, 0, outXsamples * sizeof(cl_float), stackTracesOut1DArray, 0, NULL, NULL))
    Fatal("Cannot copy stackTracesOutCL from device to host
");
  if (clEnqueueReadBuffer(_queue, powerTracesOutCL, CL_TRUE, 0, outXsamples * sizeof(cl_float), powerTracesOut1DArray, 0, NULL, NULL))
    Fatal("Cannot copy powerTracesOutCL from device to host
");

  // Release kernel and program
  if (clReleaseKernel(kernel)) Fatal("Cannot release kernel
");
  if (clReleaseProgram(moveoutAndStackCLProgram)) Fatal("Cannot release program
");

  // Free device memory
  clReleaseMemObject(prestackTracesCL);
  clReleaseMemObject(stackTracesOutCL);
  clReleaseMemObject(powerTracesOutCL);

  // Release the context and queue
  clReleaseCommandQueue(_queue);
  clReleaseContext(_context);

  // Return the time it took to run this program
  return runTime;
}
// end of CL functions

// Returns a float 0.0 - 1.0, inclusive
float RandomFloat() {
  return static_cast <float> (rand()) / static_cast <float>(RAND_MAX);
}

// Fill in the prestack traces array
void RandomFillArray(float* fillArray, unsigned int length) {

  srand(time(NULL)); // Give a "random" seed

  for (unsigned int r = 0; r < length; r++) {
    fillArray[r] = RandomFloat() * 1000.0f;
  }
}

// Runs the program
double RunProg(unsigned int samplesPerTrace, unsigned int nTracesIn, unsigned int nTracesOut,
  unsigned int localThreadCount, unsigned int deviceType) {

  // Stores sizes of the various arrays
  size_t tracesInxSample = nTracesIn * samplesPerTrace;
  size_t tracesOutxSample = nTracesOut * samplesPerTrace;

  // Allocate arrays
  float* prestackTraces1D = (float*)malloc(tracesInxSample * sizeof(float));
  float* stackTracesOut1Dgpu = (float*)calloc(tracesOutxSample, sizeof(float)); // output; zero-out
  float* powerTracesOut1Dgpu = (float*)calloc(tracesOutxSample, sizeof(float)); // output; zero-out
  // float* stackTracesOut1Dcpu = (float*)calloc(tracesOutxSample, sizeof(float)); // output; zero-out
  // float* powerTracesOut1Dcpu = (float*)calloc(tracesOutxSample, sizeof(float)); // output; zero-out

  // Count how much memory all of this is
  if (_VERBOSE)
  {
    // Make sure it is consistent with above allocation
    unsigned long allocatedMemory = 0;
    allocatedMemory += tracesInxSample * sizeof(float);
    allocatedMemory += tracesOutxSample * sizeof(float);
    allocatedMemory += tracesOutxSample * sizeof(float);

    printf("TOTAL MEMORY ALLOCATED: %s
", byteConverter(allocatedMemory));
    printf("Input Array Sizes: %s
", byteConverter((unsigned int)(tracesInxSample * sizeof(float))));
    printf("Output Array Sizes: %s
", byteConverter((unsigned int)(tracesOutxSample * sizeof(float))));
  }

  // Fill in array with randoms
  RandomFillArray(prestackTraces1D, (unsigned int)tracesInxSample);

  // Init OpenCL using the desired device type
  double preInitTime = GetTime();
  int maxWorkGroupSize;
  if (deviceType == 0) maxWorkGroupSize = InitOpenCL(_VERBOSE, CL_DEVICE_TYPE_ALL);
  else if (deviceType == 1) maxWorkGroupSize = InitOpenCL(_VERBOSE, CL_DEVICE_TYPE_GPU);
  else maxWorkGroupSize = InitOpenCL(_VERBOSE, CL_DEVICE_TYPE_CPU);
  printf("Max work size for the device is: %d
", maxWorkGroupSize);

  // --- ACTUAL TEST ---
  // Run OpenCL
  double startTime = GetTime();
  double runTime = RunOpenCL(prestackTraces1D, stackTracesOut1Dgpu, powerTracesOut1Dgpu, // arrays
    nTracesOut, nTracesIn, samplesPerTrace, // ints
    tracesInxSample, tracesOutxSample,
    localThreadCount); // samples

  // Display run time
  double endTime = GetTime();
  printf("Elapsed Time:      %fsecs
", (endTime - runTime));
  printf("                   %fsecs (Before Function Call)
", (endTime - startTime));
  printf("                   %fsecs (Including Init)

", (endTime - preInitTime));

  // Free the 1D arrays
  free(powerTracesOut1Dgpu);
  free(stackTracesOut1Dgpu);
  free(prestackTraces1D);

  return (endTime - startTime);
}

void CheckFaster(unsigned int samplesPerTrace, unsigned int nTracesIn, unsigned int nTracesOut,
  unsigned int localThreadCount, unsigned int deviceType, double* fasterThan, unsigned int* fastestLocalCount) {

  double speed = RunProg(samplesPerTrace, nTracesIn, nTracesOut, localThreadCount, deviceType);

  if (speed > *fasterThan) {
    *fasterThan = speed;
    *fastestLocalCount = localThreadCount;
  }

}

// Main
int main(int argc, char* argv[]) {

  if (!((argc == 6) || (argc == 8) || (argc == 10))) {
    Fatal("Incorrect # of Arguments, need 3: SamplesPerTrace TracesIn TracesOut
"
      "or 8: SamplesPerTrace TracesIn TracesOut SamplesPerTraceIncrementAmount
"
      "TracesInIncrementAmount TracesOutIncrementAmount IncrementCount
");
  }

  unsigned int samplesPerTrace = atoi(argv[1]);
  unsigned int nTracesIn = atoi(argv[2]);
  unsigned int nTracesOut = atoi(argv[3]);

  unsigned int localThreadCount = atoi(argv[4]);
  unsigned int deviceType = atoi(argv[5]);

  if (argc == 6) RunProg(samplesPerTrace, nTracesIn, nTracesOut, localThreadCount, deviceType);
  else if (argc == 8) { // Try mutliple iterations; increasing the first three variables each time

    unsigned int samplesPerTraceIncrementAmount = atoi(argv[4]);
    unsigned int tracesInIncrementAmount = atoi(argv[5]);
    unsigned int tracesOutIncrementAmount = atoi(argv[6]);

    unsigned int incrementCount = atoi(argv[7]);

    for (unsigned int x = 0; x <= incrementCount; x++) {

      unsigned int nextSPT = samplesPerTrace + (x * samplesPerTraceIncrementAmount);
      unsigned int nextTracesIn = nTracesIn + (x * tracesInIncrementAmount);
      unsigned int nextTracesOut = nTracesOut + (x * tracesOutIncrementAmount);

      printf("
   NEXT RUN: %u %u %u

", nextSPT, nextTracesIn, nextTracesOut);
      RunProg(nextSPT, nextTracesIn, nextTracesOut, localThreadCount, deviceType);
    }
  } else if (argc == 10) { // Work-in-progress, currently hard coded to several test cases

    double fastestGPU = 0.0;
    double* fastestGPUp;
    fastestGPUp = &fastestGPU;
    double fastestCPU = 0.0;
    double* fastestCPUp;
    fastestCPUp = &fastestCPU;
    unsigned int bestLocalGPU = 0;
    unsigned int* bestLocalGPUp;
    bestLocalGPUp = &bestLocalGPU;
    unsigned int bestLocalCPU = 0;
    unsigned int* bestLocalCPUp;
    bestLocalCPUp = &bestLocalCPU;

    // GPU test
    CheckFaster(25000, 2500, 250, 50, 1,  fastestGPUp, bestLocalGPUp);
    CheckFaster(25000, 2500, 250, 25, 1,  fastestGPUp, bestLocalGPUp);
    CheckFaster(25000, 2500, 250, 10, 1,  fastestGPUp, bestLocalGPUp);
    CheckFaster(25000, 2500, 250, 5, 1,   fastestGPUp, bestLocalGPUp);
    CheckFaster(25000, 2500, 250, 2, 1,   fastestGPUp, bestLocalGPUp);
    CheckFaster(25000, 2500, 250, 1, 1,   fastestGPUp, bestLocalGPUp);

    // CPU test
    CheckFaster(25000, 2500, 250, 50, 2,  fastestCPUp, bestLocalCPUp);
    CheckFaster(25000, 2500, 250, 25, 2,  fastestCPUp, bestLocalCPUp);
    CheckFaster(25000, 2500, 250, 10, 2,  fastestCPUp, bestLocalCPUp);
    CheckFaster(25000, 2500, 250, 5, 2,   fastestCPUp, bestLocalCPUp);
    CheckFaster(25000, 2500, 250, 2, 2,   fastestCPUp, bestLocalCPUp);
    CheckFaster(25000, 2500, 250, 1, 2,   fastestCPUp, bestLocalCPUp);

    printf("

");
    printf("Fastest GPU: %f with local thread count %u
", fastestGPU, bestLocalGPU);
    printf("Fastest CPU: %f with local thread count %u
", fastestCPU, bestLocalCPU);

  }

  return 0;

}

Note: I had to change characters in my code to other characters because otherwise this forum wouldn’t accept them… I changed those characters to ?. I hope it still compiles.

My (almost certainly false) speculation

My first thought as to why it’s running so much slower on my GPU than my CPU is that maybe it’s because I am busing so much data over the graphics card before anything runs. Perhaps a better implementation would involve splitting the workload in multiple runs, so that code can be executing while more data is being bused over (I presume that’s a thing). However now that I think about it this is almost certainly false, because as I said I wrote this program based on an example, and that example did matrix multiplication, and that example runs much much faster on the GPU than my CPU. I don’t really know what the difference is.

If desired, I can also post the code for that program.

Sorry that I posted the Source Code section and My Devices section twice I had to change my post a lot because it seems that there is a forum rule against using characters I was using and that happened accidently while attempting to fix it all. I would correct it however I am no longer able to edit my own post I ran out of time trying to get everything else in.

For an admin could remove these copies that would be great. Note that the source code under the second source code section is the correct code that I meant to post it’s about eight hundred lines long please do not remove that.

And excuse my lack of knowledge, got to start somewhere I hope you all can help meanwhile I’m going to go watch some videos on OpenCL to learn everything that I didn’t learn in the one class.

Hey guys just an update; I tried using different optimizations in the compiler, I’ve tried making the problem into a two dimensional problem instead of a 1D one, and I’ve tried unrolling the loops using pragma unroll. All of those except the first actually made it slower.

I also watched developer central OpenCL videos, I learned a lot but I still don’t know what’s wrong.

Your GPU has 44 Compute Units, each one able to handle a work group of 256 work items.

Homework:

Think about what happens when you make a call such as: CheckFaster(25000, 2500, 250, 1, 1, fastestGPUp, bestLocalGPUp). How are distributed the work items?
Is it an efficient use of the GPU? How can you improve the situation? (Hint: read about the notion of occupancy)

Welp, I did my homework, and I have been way under utilizing my graphics card. That’s for sure. Thanks for pointing that out utnapishtim. I completely misunderstood what OpenCL meant by max workload.

However, even with much more global threads and a work group size near 250 there is still a problem. I think it has something to do with caching because each ALU in the graphics card will need memory from very different parts of the large arrays. I’ll get back to you all tomorrow.

Quick question, if I use CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE when creating the command queue can multiple kernels be running at the same time? Or does it simply mean that kernels will still execute one a time, but just out of order? If it is the 2nd option how would I go about creating a program that runs multiple OpenCL kernels on my GPU simultaneously, by creating multiple command queues?

However, even with much more global threads and a work group size near 250 there is still a problem

It is nothing on GPU. As a rule of thumb, you want global size to be equal to at least number of cores * 10 (there is a chance you will be compute or bandwidth bound before that, so it may hurt performance, but it is too early to concern yourself about that).

hen creating the command queue can multiple kernels be running at the same time?

This. You must make sure two concurrent kernels will never write into the same memory object using event for mutual exclusion.

Thanks for the reply Salabar,

Unfortunately it ends up that using CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE gives me CL_INVALID_QUEUE_PROPERTIES, which according to OpenCL documentations means that

“if values specified in properties are valid but are not supported by the device.”

So my device does not support CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE. That’s a big bummer. :frowning:

My first thought was “well what if I make two queues and try running two different kernels on the two different queues on the same device at the same time (also using different programs and kernels)”? It seems that that are still queued one after the other despite being in different queues. Is there any way around this? Why would my graphics card not support multiple programs running on it at once? It must, for else would I be able to do something such as say, have to video games running the same time side by side? Or any two programs that require using GPU at the same time? I am after all able to run something that uses my GPU while my OpenCL programs execute.

Or can I simply change something about my graphics card such that it will support CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE?

Would creating these different queues in different contexts make the difference? Even if the context are practically the same (containing the same device and only one device)?

Please note that you can permute the two loops in your kernel to replace O(n^2) read/writes to stackTracesOut and powerTracesOut by O(n) writes.
Also use two temporary variables to compute the sums, and store them into stackTracesOut and powerTracesOut once the sums are computed. Don’t expect the compiler to optimize that for you.

Given the nature of your computation, you cannot avoid O(n^2) reads from prestackTraces though.

[QUOTE=utnapishtim;41267]Please note that you can permute the two loops in your kernel to replace O(n^2) read/writes to stackTracesOut and powerTracesOut by O(n) writes.
Also use two temporary variables to compute the sums, and store them into stackTracesOut and powerTracesOut once the sums are computed. Don’t expect the compiler to optimize that for you.

Given the nature of your computation, you cannot avoid O(n^2) reads from prestackTraces though.[/QUOTE]

I did so and this helped tremendously. Thank you.