Hey all, I’m new to the forums, also pretty new to OpenCL. I learned a little bit in college, and by a little bit I mean my graphics professor taught us about GPGPU and OpenCL for a single day (whereas the rest of the class was focused on shaders and OpenGL and so on).
Anyways, I took an example program and changed it to work with the computations that I want it to run. However my program runs significantly faster on the CPU than my GPU, and I am trying to understand why.
About my program:
My program takes in one input float array and has two output arrays. Under single threaded circumstances it has three arguments <samplesPerTrace> <tracesIn> <tracesOut>. The input array’s size is: samplesPerTracetracesInsizeof(float), and the output array’s size are: samplesPerTracetracesOutsizeof(float). However this program takes in 5 arguments <samplesPerTrace> <tracesIn> <tracesOut> <#oflocalthreads> <whichdevicetypetouse>. If whichdevicetypetouse == 0, then CL_DEVICE_TYPE_ALL is used, 1 = CL_DEVICE_TYPE_GPU, 2 = CL_DEVICE_TYPE_CPU. If 0 is used the program uses the fastest device it can find, which it thinks is always the GPU despite that currently being false.
My test cases have been using the parameters 25000 2500 250, because that is on average the size of the arrays that I will be using (perhaps a little above average).
Here is the source code that OpenCL builds and runs on the kernel;
const char* M_AND_S_OPENCL_SOURCE_TEXT =
__kernel void sumAllCL(__global const float prestackTraces[],
__global float stackTracesOut[],
__global float powerTracesOut[], const unsigned int nTracesOut, const unsigned int nTracesIn,
const unsigned int samplesPerTrace) {
unsigned int k = get_global_id(0); // Thread ID
unsigned int kTimesIn = k * nTracesIn; // Store repeat ints
unsigned int kTimesSamples = k * samplesPerTrace;
for (int j = 0; j < ; j++) {
int jTimesSamplesPT = j * samplesPerTrace;
for (int i = 0; i < ; i++) {
int valueIndex = i + jTimesSamplesPT;
float value = prestackTraces[valueIndex];
stackTracesOut[i + kTimesSamples] += value;
powerTracesOut[i + kTimesSamples] += (value * value);
}
}
}
Note that the conditionals in the for loops are replaced at run time with fixed numbers, I do this because I thought it would help the compiler unroll the loops and thus increase performance.
With the above parameters stated (25000 2500 250 ~10 <1 or 2>) it takes my CPU about 0.6 seconds to complete the program and my GPU about 40 seconds to complete. That’s a bigger difference. Fyi, I have been messing around with the 4th parameters to see which value runs faster which is what is meant by the ~10.
My Devices:
My graphics card is a MSI Radeon R9 390X 8GB, given the name Hawaii. When I have OpenCL print out information about both of my devices this is what I get:
[VAR]
OpenCL Platform 0: AMD Accelerated Parallel Processing
----- OpenCL Device # 0: Hawaii-----
Gflops: 47.520000
Max Clock Frequency: 1080
Max Compute Units: 44
Max Work Group Size: 256
MEMORY…
Total Memory of Device: 8.000G (CL_DEVICE_GLOBAL_MEM_SIZE)
Local Memory of Device: 32.000K (CL_DEVICE_LOCAL_MEM_SIZE)
Max Memory Object Allocation: 3.999G (CL_DEVICE_MAX_MEM_ALLOC_SIZE)
Cache Size: 16.000K (CL_DEVICE_GLOBAL_MEM_CACHE_SIZE)
Cacheline Size: 64 bytes (CL_DEVICE_GLOBAL_MEM_CACHELINE_SIZE)
VERSIONS…
Device Vendor: Advanced Micro Devices, Inc.
Device Version: OpenCL 2.0 AMD-APP (2117.13)
Driver Version: 2117.13 (VM)
Device OpenCL Version: OpenCL C 2.0
----- OpenCL Device # 1: Intel® Core™ i7-6700K CPU 4.00GHz-----
Gflops: 32.064000
Max Clock Frequency: 4008
Max Compute Units: 8
Max Work Group Size: 1024
MEMORY…
Total Memory of Device: 15.967G (CL_DEVICE_GLOBAL_MEM_SIZE)
Local Memory of Device: 32.000K (CL_DEVICE_LOCAL_MEM_SIZE)
Max Memory Object Allocation: 3.1028G (CL_DEVICE_MAX_MEM_ALLOC_SIZE)
Cache Size: 32.000K (CL_DEVICE_GLOBAL_MEM_CACHE_SIZE)
Cacheline Size: 64 bytes (CL_DEVICE_GLOBAL_MEM_CACHELINE_SIZE)
VERSIONS…
Device Vendor: GenuineIntel
Device Version: OpenCL 1.2 AMD-APP (2117.13)
Driver Version: 2117.13 (sse2,avx)
Device OpenCL Version: OpenCL C 1.2[/VAR]
Source Code:
Here is the source code of my entire program. I’m not trying to say “hey can you look through my entire program to see what is wrong”, but rather I just wanted to provide a minimal complete (hopefully) verifiable example that you can compile and run if necessary/useful. It’s a standalone file, I couldn’t really cut much out of it because since I’m just getting started everything is relevant.
const char* M_AND_S_OPENCL_SOURCE_TEXT =
"__kernel void sumAllCL(__global const float prestackTraces[],
"
" __global float stackTracesOut[],
"
" __global float powerTracesOut[], const unsigned int nTracesOut, const unsigned int nTracesIn,
"
" const unsigned int samplesPerTrace) {
"
"
"
" unsigned int k = get_global_id(0);
" // Thread ID
"
"
" unsigned int kTimesIn = k * nTracesIn;
" // Store repeat ints
" unsigned int kTimesSamples = k * samplesPerTrace;
"
"
"
" for (int j = 0; j < ? ; j++) {
" // ? position to be replaced (nTracesOut)"
"
"
" int jTimesSamplesPT = j * samplesPerTrace;
"
"
"
" for (int i = 0; i < # ; i++) {
" // # position to be replaced ()
"
"
" int valueIndex = i + jTimesSamplesPT;
"
" float value = prestackTraces[valueIndex];
"
"
"
" stackTracesOut[i + kTimesSamples] += value;
"
" powerTracesOut[i + kTimesSamples] += (value * value);
"
"
"
" }
"
" }
"
"}
";
Note that the ? and the # are replaced at run time with fixed numbers, I do this because I thought it would help the compiler unroll the rl
With the above parameters stated (25000 2500 250 ~10 <1 or 2>) it takes my CPU about 0.6 seconds to complete the program and my GPU about 40 seconds to complete. That’s a bigger difference. Fyi, I have been messing around with the 4th parameters to see which value runs faster which is what is meant by the ~10.
My Devices:
My graphics card is a MSI Radeon R9 390X 8GB, given the name Hawaii. When I have OpenCL print out information about both of my devices this is what I get:
[VAR]
OpenCL Platform 0: AMD Accelerated Parallel Processing
----- OpenCL Device # 0: Hawaii-----
Gflops: 47.520000
Max Clock Frequency: 1080
Max Compute Units: 44
Max Work Group Size: 256
MEMORY…
Total Memory of Device: 8.000G (CL_DEVICE_GLOBAL_MEM_SIZE)
Local Memory of Device: 32.000K (CL_DEVICE_LOCAL_MEM_SIZE)
Max Memory Object Allocation: 3.999G (CL_DEVICE_MAX_MEM_ALLOC_SIZE)
Cache Size: 16.000K (CL_DEVICE_GLOBAL_MEM_CACHE_SIZE)
Cacheline Size: 64 bytes (CL_DEVICE_GLOBAL_MEM_CACHELINE_SIZE)
VERSIONS…
Device Vendor: Advanced Micro Devices, Inc.
Device Version: OpenCL 2.0 AMD-APP (2117.13)
Driver Version: 2117.13 (VM)
Device OpenCL Version: OpenCL C 2.0
----- OpenCL Device # 1: Intel® Core™ i7-6700K CPU ? 4.00GHz-----
Gflops: 32.064000
Max Clock Frequency: 4008
Max Compute Units: 8
Max Work Group Size: 1024
MEMORY…
Total Memory of Device: 15.967G (CL_DEVICE_GLOBAL_MEM_SIZE)
Local Memory of Device: 32.000K (CL_DEVICE_LOCAL_MEM_SIZE)
Max Memory Object Allocation: 3.1028G (CL_DEVICE_MAX_MEM_ALLOC_SIZE)
Cache Size: 32.000K (CL_DEVICE_GLOBAL_MEM_CACHE_SIZE)
Cacheline Size: 64 bytes (CL_DEVICE_GLOBAL_MEM_CACHELINE_SIZE)
VERSIONS…
Device Vendor: GenuineIntel
Device Version: OpenCL 1.2 AMD-APP (2117.13)
Driver Version: 2117.13 (sse2,avx)
Device OpenCL Version: OpenCL C 1.2[/VAR]
Source Code:
Here is the source code of my entire program. I’m not trying to say “hey can you look through my entire program to see what is wrong”, but rather I just wanted to provide a minimal complete (hopefully) verifiable example that you can compile and run if necessary/useful. It’s a standalone file, I couldn’t really cut much out of it because since I’m just getting started everything is relevant.
#include <stdlib.h>
#include <stdio.h>
#include <sys/time.h>
#include <CL/opencl.h>
#include <unistd.h>
#include <string.h>
#include <sstream>
#include <math.h>
#include <stdarg.h>
const bool _VERBOSE = true;
// Variables used in OpenCL
static cl_device_id _deviceID;
static cl_context _context;
static cl_command_queue _queue;
const int MAX_LOOP_SIZE = 99999999; // Based off how many digits a number inserted into the program can have
// Note: If you change M_AND_S_OPENCL_SOURCE_TEXT, you must also change INSERT_LOCATION_1 & INSERT_LOCATION_2. You can
// set _FIND_INSERT_LOCATIONS to true to print out the new locations.
const bool _FIND_INSERT_LOCATIONS = false;
const int INSERT_LOCATION_1 = 388;
const int INSERT_LOCATION_2 = 478;
const char* M_AND_S_OPENCL_SOURCE_TEXT =
"__kernel void sumAllCL(__global const float prestackTraces[],
"
" __global float stackTracesOut[],
"
" __global float powerTracesOut[], const unsigned int nTracesOut, const unsigned int nTracesIn,
"
" const unsigned int samplesPerTrace) {
"
"
"
" unsigned int k = get_global_id(0);
" // Thread ID
"
"
" unsigned int kTimesIn = k * nTracesIn;
" // Store repeat ints
" unsigned int kTimesSamples = k * samplesPerTrace;
"
"
"
" for (int j = 0; j < ; j++) {
" // ? position to be replaced (nTracesOut)"
"
"
" int jTimesSamplesPT = j * samplesPerTrace;
"
"
"
" for (int i = 0; i < ; i++) {
" // # position to be replaced ()
"
"
" int valueIndex = i + jTimesSamplesPT;
"
" float value = prestackTraces[valueIndex];
"
"
"
" stackTracesOut[i + kTimesSamples] += value;
"
" powerTracesOut[i + kTimesSamples] += (value * value);
"
"
"
" }
"
" }
"
"}
";
void sumAllCLPerThread(const float prestackTraces[],
const unsigned int endIndex, float stackTracesOut[],
float powerTracesOut[], const unsigned int nTracesOut, const unsigned int nTracesIn,
const unsigned int samplesPerTrace, const unsigned int thisThread) {
int k = thisThread; // Outer loop = thread ID
int k_timesSPT = k * samplesPerTrace;
// printf("j goes to %d, i goes to %d
", nTracesIn, endIndex);
for (unsigned int j = 0; j < nTracesIn; j++) {
int jTimesSamplesPT = j * samplesPerTrace;
for (unsigned int i = 0; i < endIndex; i++) {
int valueIndex = i + jTimesSamplesPT;
float value = prestackTraces[valueIndex];
stackTracesOut[i + k_timesSPT] += value;
powerTracesOut[i + k_timesSPT] += (value * value);
}
}
}
// Helper function for the byte converter
const char* byteConverterHelper(long num, const char* suffix, int power) {
double div1024 = pow(1024.0, (double)power);
unsigned long insertNum = (unsigned long)(((double)num) / ((double)div1024));
std::stringstream strs;
strs << insertNum;
if (power > 0) {
strs << ".";
unsigned long remainder = num % ((unsigned long)div1024); // Get the remainder in vase 1024
remainder = (unsigned long)((double)remainder * (1000.0 / 1024.0)); // Convert base 1024 to base 1000
remainder /= (unsigned long)pow(1000.0, (double)(power - 1)); // Get the first three digits
if (remainder < 10L) strs << "00";
else if (remainder < 100L) strs << "0";
strs << remainder;
}
strs << suffix;
strs << "\0";
return strs.str().c_str();
}
/*
* Convert a number into byes in the from ###.### with the correct postfix (such as G for gigabytes).
*/
char* byteConverter(unsigned long num) {
//int lengthOfString;
char* string = new char[30];
if (num < (1024L)) { // Bytes
const char* tempS = byteConverterHelper(num, " bytes", 0);
int tempLength = strlen(tempS);
int x = 0;
for (x = 0; (x < 29) && (x < tempLength); x++) string[x] = tempS[x];
string[x] = '\0';
} else if (num < (1024L * 1024L)) { // Mega bytes
const char* tempS = byteConverterHelper(num, "K", 1);
int tempLength = strlen(tempS);
int x = 0;
for (x = 0; (x < 29) && (x < tempLength); x++) string[x] = tempS[x];
string[x] = '\0';
} else if (num < (1024L * 1024L * 1024L)) { // Kilo bytes
const char* tempS = byteConverterHelper(num, "M", 2);
int tempLength = strlen(tempS);
int x = 0;
for (x = 0; (x < 29) && (x < tempLength); x++) string[x] = tempS[x];
string[x] = '\0';
} else if (num < (1024L * 1024L * 1024L * 1024L)) { // Giga bytes
const char* tempS = byteConverterHelper(num, "G", 3);
int tempLength = strlen(tempS);
int x = 0;
for (x = 0; (x < 29) && (x < tempLength); x++) string[x] = tempS[x];
string[x] = '\0';
} else { // Terra bytes
const char* tempS = byteConverterHelper(num, "T", 4);
int tempLength = strlen(tempS);
int x = 0;
for (x = 0; (x < 29) && (x < tempLength); x++) string[x] = tempS[x];
string[x] = '\0';
}
return string;
}
/**
* Used to keep track of how long it takes to execute this.
*/
// static double t0 = 0.0;
double GetTime() {
struct timeval tv;
gettimeofday(&tv, NULL);
return tv.tv_sec + (1e-6 * tv.tv_usec);
}
/*
* Print message to stderr and exit
*/
void Fatal(const char* format, ...) {
va_list args;
va_start(args, format);
vfprintf(stderr, format, args);
va_end(args);
exit(1);
}
/*
* For when kernel parameters are failed to be set.
*/
void FatalSetArgs(const char* type, cl_int errorCode) {
const char* errorTypes0 = "Cannot set kernel parameter: CL_INVALID_KERNEL
";
const char* errorTypes1 = "Cannot set kernel parameter: CL_INVALID_ARG_INDEX
";
const char* errorTypes2 = "Cannot set kernel parameter: CL_INVALID_ARG_VALUE
";
const char* errorTypes3 = "Cannot set kernel parameter: CL_INVALID_MEM_OBJECT
";
const char* errorTypes4 = "Cannot set kernel parameter: CL_INVALID_SAMPLER
";
const char* errorTypes5 = "Cannot set kernel parameter: CL_INVALID_ARG_SIZE
";
const char* errorTypes6 = "Unknown Error???
";
if (errorCode == CL_INVALID_KERNEL) {
Fatal("Error settings args for %s:
", errorTypes0);
} else if (errorCode == CL_INVALID_ARG_INDEX) {
Fatal("Error settings args for %s:
", errorTypes1);
} else if (errorCode == CL_INVALID_ARG_VALUE) {
Fatal("Error settings args for %s:
", errorTypes2);
} else if (errorCode == CL_INVALID_MEM_OBJECT) {
Fatal("Error settings args for %s:
", errorTypes3);
} else if (errorCode == CL_INVALID_SAMPLER) {
Fatal("Error settings args for %s:
", errorTypes4);
} else if (errorCode == CL_INVALID_ARG_SIZE) {
Fatal("Error settings args for %s:
", errorTypes5);
} else {
Fatal("Error settings args for %s:
", errorTypes6);
}
}
/*
* Fatal during buffer creation.
*/
void FatalBufferCreation(const char* type, cl_int errorCode) {
if (errorCode == CL_INVALID_CONTEXT) Fatal("Error when creating buffer for %s:
"
"Context Not Valid.
", type);
if (errorCode == CL_INVALID_VALUE) Fatal("Error when creating buffer for %s:
"
"Flags Not Valid.
", type);
if (errorCode == CL_INVALID_BUFFER_SIZE) Fatal("Error when creating buffer for %s:
"
"Invalid Buffer Size: size is 0 or is greater than CL_DEVICE_MAX_MEM_ALLOC_SIZE.
", type);
if (errorCode == CL_INVALID_HOST_PTR) Fatal("Error when creating buffer for %s:
"
"if host_ptr is NULL and CL_MEM_USE_HOST_PTR or CL_MEM_COPY_HOST_PTR are set "
"in flags or if host_ptr is not NULL but CL_MEM_COPY_HOST_PTR or "
"CL_MEM_USE_HOST_PTR are not set in flags.
", type);
if (errorCode == CL_MEM_OBJECT_ALLOCATION_FAILURE) Fatal("Error when creating buffer for %s:
"
"Failure to allocate memory for buffer object.
", type);
if (errorCode == CL_OUT_OF_HOST_MEMORY) Fatal("Error when creating buffer for %s:
"
"failure to allocate resources required by the OpenCL implementation on "
"the host.
", type);
}
/*
* OpenCL notify callback (echo to stderr)
*/
void Notify(const char* errinfo, const void* private_info, size_t cb, void* user_data) {
fprintf(stderr, "%s
", errinfo);
}
/*
* Prints the given int (numToInsert) at location inside chars.
*/
void PrintIntInStr(char* chars, int location, int numToInsert) {
std::stringstream strs;
strs << numToInsert;
std::string temp_str = strs.str();
char const* numToChars = temp_str.c_str();
int numberLength = strlen(numToChars);
int w;
for (w = 0; w < numberLength; w++) {
chars[location + w] = numToChars[w];
}
}
/*
* Initialize fastest OpenCL device.
*/
int InitOpenCL(int verbose, cl_int deviceType) {
cl_uint Nplat;
cl_int err;
char name[1024];
int MaxGflops = -1;
cl_platform_id winnerPlatform = 0;
// Reset (TODO)
_deviceID = NULL;
_context = NULL;
_queue = NULL;
// Get platforms
cl_platform_id platforms[4];
if (clGetPlatformIDs(4, platforms, &Nplat)) Fatal("Cannot get number of OpenCL platforms
");
else if (Nplat<1) Fatal("No OpenCL platforms found
");
// Loop over platforms
for (unsigned int platform = 0; platform < Nplat; platform++) {
if (clGetPlatformInfo(platforms[platform], CL_PLATFORM_NAME, sizeof(name), name, NULL)) Fatal("Cannot get OpenCL platform name
");
if (verbose) printf("OpenCL Platform %d: %s
", platform, name);
// Get GPU device IDs
cl_uint Ndev;
cl_device_id id[4];
if (clGetDeviceIDs(platforms[platform], deviceType, 4, id, &Ndev))
Fatal("Cannot get number of OpenCL devices: %d
", platform);
else if (Ndev < 1) Fatal("No OpenCL devices found.
");
// Find the fastest device
for (unsigned int devId = 0; devId < Ndev; devId++) {
// Print informatio about the device
cl_uint compUnits, freq, cacheLineSize;
cl_ulong memSize, maxAlloc, localMemSize, globalCacheSize;
size_t maxWorkGrps;
char deviceVendor[50];
char deviceVersion[50];
char driverVersion[50];
char deviceOpenCLVersion[50];
// Computing Power...
if (clGetDeviceInfo(id[devId], CL_DEVICE_MAX_COMPUTE_UNITS, sizeof(compUnits), &compUnits, NULL)) Fatal("Cannot get OpenCL device units
");
if (clGetDeviceInfo(id[devId], CL_DEVICE_MAX_CLOCK_FREQUENCY, sizeof(freq), &freq, NULL)) Fatal("Cannot get OpenCL device frequency
");
if (clGetDeviceInfo(id[devId], CL_DEVICE_NAME, sizeof(name), name, NULL)) Fatal("Cannot get OpenCL device name
");
if (clGetDeviceInfo(id[devId], CL_DEVICE_MAX_WORK_GROUP_SIZE, sizeof(maxWorkGrps), &maxWorkGrps, NULL)) Fatal("Cannot get OpenCL max work group size
");
// Memory...
if (clGetDeviceInfo(id[devId], CL_DEVICE_GLOBAL_MEM_SIZE, sizeof(memSize), &memSize, NULL)) Fatal("Cannot get OpenCL memory size.
");
if (clGetDeviceInfo(id[devId], CL_DEVICE_LOCAL_MEM_SIZE, sizeof(localMemSize), &localMemSize, NULL)) localMemSize = 0;
if (clGetDeviceInfo(id[devId], CL_DEVICE_MAX_MEM_ALLOC_SIZE, sizeof(maxAlloc), &maxAlloc, NULL)) Fatal("Cannot get OpenCL memory size.
");
if (clGetDeviceInfo(id[devId], CL_DEVICE_GLOBAL_MEM_CACHE_SIZE, sizeof(globalCacheSize), &globalCacheSize, NULL)) globalCacheSize = 0;
if (clGetDeviceInfo(id[devId], CL_DEVICE_GLOBAL_MEM_CACHELINE_SIZE, sizeof(cacheLineSize), &cacheLineSize, NULL)) cacheLineSize = 0;
// Versions...
clGetDeviceInfo(id[devId], CL_DEVICE_VENDOR, sizeof(deviceVendor), deviceVendor, NULL);
clGetDeviceInfo(id[devId], CL_DEVICE_VERSION, sizeof(deviceVersion), deviceVersion, NULL);
clGetDeviceInfo(id[devId], CL_DRIVER_VERSION, sizeof(driverVersion), driverVersion, NULL);
clGetDeviceInfo(id[devId], CL_DEVICE_OPENCL_C_VERSION, sizeof(deviceOpenCLVersion), deviceOpenCLVersion, NULL);
int Gflops = compUnits * freq;
if (verbose) printf(" ----- OpenCL Device # %d: %s-----
"
"Gflops: %f
"
"Max Clock Frequency: %d
"
"Max Compute Units: %d
"
"Max Work Group Size: %zu
"
" MEMORY...
"
"Total Memory of Device: %s (CL_DEVICE_GLOBAL_MEM_SIZE)
"
"Local Memory of Device: %s (CL_DEVICE_LOCAL_MEM_SIZE)
"
"Max Memory Object Allocation: %s (CL_DEVICE_MAX_MEM_ALLOC_SIZE)
"
"Cache Size: %s (CL_DEVICE_GLOBAL_MEM_CACHE_SIZE)
"
"Cacheline Size: %s (CL_DEVICE_GLOBAL_MEM_CACHELINE_SIZE)
"
" VERSIONS...
"
"Device Vendor: %s
"
"Device Version: %s
"
"Driver Version: %s
"
"Device OpenCL Version: %s
",
devId,
name,
(1e-3 * Gflops),
freq,
compUnits,
maxWorkGrps,
byteConverter((unsigned long)memSize),
byteConverter((unsigned long)localMemSize),
byteConverter((unsigned long)maxAlloc),
byteConverter((unsigned long)globalCacheSize),
byteConverter((unsigned long)cacheLineSize),
deviceVendor,
deviceVersion,
driverVersion,
deviceOpenCLVersion);
if(Gflops > MaxGflops)
{
_deviceID = id[devId];
MaxGflops = Gflops;
winnerPlatform = platforms[platform];
}
}
}
// Print fastest device info (TODO: don't get name twice)
if (clGetDeviceInfo(_deviceID, CL_DEVICE_NAME, sizeof(name), name, NULL)) Fatal("Cannot get OpenCL device name
");
printf("
Selected Fastest Open CL Device: %s (#%lu)
", name, (unsigned long)_deviceID);
// Check thread count
size_t mwgs;
if (clGetDeviceInfo(_deviceID, CL_DEVICE_MAX_WORK_GROUP_SIZE, sizeof(mwgs), &mwgs, NULL))
Fatal("Cannot get OpenCL max work group size
");
// Create OpenCL context for fastest device
cl_context_properties cps[3] =
{
CL_CONTEXT_PLATFORM,
(cl_context_properties)winnerPlatform,
(cl_context_properties)0
};
_context = clCreateContextFromType(cps, deviceType, NULL, NULL, &err);
if (!_context || err) Fatal("Cannot create OpenCL Context
");
// Properties for create command queue; currently nothing
// cl_command_queue_properties *propers;
cl_command_queue_properties prop = 0;
//prop |= CL_QUEUE_PROFILING_ENABLE;
//prop |= CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE;
// propers = ∝
_queue = clCreateCommandQueueWithProperties(_context, _deviceID, &prop, &err); // Create OpenCL command queue for fastest device
// _queue = clCreateCommandQueue(_context, _deviceID, &prop, &err);
if (!_queue || err) {
if (err == CL_INVALID_CONTEXT) Fatal("Cannot create OpenCL command cue: CL_INVALID_CONTEXT
");
else if (err == CL_INVALID_DEVICE) Fatal("Cannot create OpenCL command cue: CL_INVALID_DEVICE
");
else if (err == CL_INVALID_VALUE) Fatal("Cannot create OpenCL command cue: CL_INVALID_VALUE
");
else if (err == CL_INVALID_QUEUE_PROPERTIES) Fatal("Cannot create OpenCL command cue: CL_INVALID_QUEUE_PROPERTIES
");
else if (err == CL_OUT_OF_RESOURCES) Fatal("Cannot create OpenCL command cue: CL_OUT_OF_RESOURCES
");
else if (err == CL_OUT_OF_HOST_MEMORY) Fatal("Cannot create OpenCL command cue: CL_OUT_OF_HOST_MEMORY
");
else if (!_queue) Fatal("Cannot create OpenCL command cue: !queue
");
else Fatal("Cannot create OpenCL command cue: ?????
");
}
if (_VERBOSE) printf("Init complete.
");
return mwgs;
}
/*
* Modify the source text to fit this run.
*/
char* ModifySourceText(unsigned int nTracesIn, unsigned int samplesPerT) {
size_t sourceSize = strlen(M_AND_S_OPENCL_SOURCE_TEXT) + 1;
char* moveStackSourceCode = new char[sourceSize];
strncpy(moveStackSourceCode, M_AND_S_OPENCL_SOURCE_TEXT, sourceSize);
moveStackSourceCode[sourceSize] = '\0';
// Print out the locations of the characters where we should insert other text if asked to do so
if (_FIND_INSERT_LOCATIONS) {
size_t z;
for (z = 0; z < sourceSize; z++) {
if (moveStackSourceCode[z] == '?') {
printf("Found ? at position %zu
", z);
break;
}
}
for (z = 0; z < sourceSize; z++) {
if (moveStackSourceCode[z] == '#') {
printf("Found # at position %zu
", z);
break;
}
}
}
// Insert the digit that for loops go to inside of the source
PrintIntInStr(moveStackSourceCode, INSERT_LOCATION_1, nTracesIn);
PrintIntInStr(moveStackSourceCode, INSERT_LOCATION_2, samplesPerT);
// Print the modified source code if verbose
if (_FIND_INSERT_LOCATIONS) {
printf("
GPU Source Code:
");
printf("%s
", moveStackSourceCode);
}
return moveStackSourceCode;
}
/*
* Wait for event and then release it.
*/
static void WaitForEventAndRelease(cl_event *event) {
printf("WaitForEventAndRelease()
");
cl_int status = CL_SUCCESS;
status = clWaitForEvents(1, event);
if (status) Fatal("clWaitForEvents Failed with Error Code");
printf("About to release event...
");
status = clReleaseEvent(*event);
if (status) Fatal("clReleaseEvent Failed with Error Code");
}
// Runs the program via open CL
static double RunOpenCL(float prestackTracesArray[], float stackTracesOut1DArray[], float powerTracesOut1DArray[],
unsigned int nTracesOut, unsigned int nTracesIn, unsigned int samplesPerT,
size_t inXsamples, size_t outXsamples,
unsigned int localThreadCount)
{
cl_int err;
// Get the source code
char* modifiedGpuSource = ModifySourceText(nTracesIn, samplesPerT);
// Allocate device memory
// CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR | CL_MEM_USE_PERSISTENT_MEM_AMD (?)
// Input...
cl_mem prestackTracesCL = clCreateBuffer(_context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
inXsamples * sizeof(cl_float), prestackTracesArray, &err);
if (err) FatalBufferCreation("Prestack traces", err);
// Output... TODO: How do we know that the output is zeroed out?
cl_mem stackTracesOutCL = clCreateBuffer(_context, CL_MEM_WRITE_ONLY,
outXsamples * sizeof(cl_float), NULL, &err);
if (err) FatalBufferCreation("Stack traces", err);
cl_mem powerTracesOutCL = clCreateBuffer(_context, CL_MEM_WRITE_ONLY,
outXsamples * sizeof(cl_float), NULL, &err);
if (err) FatalBufferCreation("Power traces", err);
// Compile the source code
char* gpuSourceText[1];
gpuSourceText[0] = modifiedGpuSource;
size_t sourceLength[1];
sourceLength[0] = strlen(modifiedGpuSource);
cl_program moveoutAndStackCLProgram = clCreateProgramWithSource(_context, 1, (const char**)gpuSourceText,
(const size_t*)sourceLength, &err);
if (err != CL_SUCCESS) {
if (err == CL_INVALID_CONTEXT) Fatal("Cannot create program: CL_INVALID_CONTEXT
");
else if (err == CL_INVALID_VALUE) Fatal("Cannot create program: CL_INVALID_VALUE
");
else if (err == CL_OUT_OF_HOST_MEMORY) Fatal("Cannot create program: CL_OUT_OF_HOST_MEMORY
");
else Fatal("Cannot create program_S %d
", err);
}
// Build the program
cl_int buildCode = clBuildProgram(moveoutAndStackCLProgram, 0, NULL, NULL, NULL, NULL);
if (buildCode != CL_SUCCESS) {
// Attempt to get compile errors
char log[1048576];
if (clGetProgramBuildInfo(moveoutAndStackCLProgram, _deviceID, CL_PROGRAM_BUILD_LOG, sizeof(log), log, NULL)) {
log[0] = '\0'; // Failed to get the log file
}
if (buildCode == CL_INVALID_PROGRAM) Fatal("Cannot build program: CL_INVALID_PROGRAM
%s", log);
else if (buildCode == CL_INVALID_VALUE) Fatal("Cannot build program: CL_INVALID_VALUE
%s", log);
else if (buildCode == CL_INVALID_DEVICE) Fatal("Cannot build program: CL_INVALID_DEVICE
%s", log);
else if (buildCode == CL_INVALID_BINARY) Fatal("Cannot build program: CL_INVALID_BINARY
%s", log);
else if (buildCode == CL_INVALID_BUILD_OPTIONS) Fatal("Cannot build program: CL_INVALID_BUILD
_OPTIONS
%s", log);
else if (buildCode == CL_INVALID_OPERATION) Fatal("Cannot build program: CL_INVALID_OPERATION
%s", log);
else if (buildCode == CL_COMPILER_NOT_AVAILABLE) Fatal("Cannot build program: CL_COMPILER_NOT_AVAILABLE
%s", log);
else if (buildCode == CL_BUILD_PROGRAM_FAILURE) Fatal("Cannot build program: CL_BUILD_PROGRAM_FAILURE
%s", log);
else if (buildCode == CL_INVALID_OPERATION) Fatal("Cannot build program: CL_INVALID_OPERATION
%s", log);
else if (buildCode == CL_OUT_OF_HOST_MEMORY) Fatal("Cannot build program: CL_OUT_OF_HOST_MEMORY
%s", log);
else Fatal("Cannot build program: %d
%s", buildCode, log);
}
// Compile the source code & build the kernel
cl_kernel kernel = clCreateKernel(moveoutAndStackCLProgram, "sumAllCL", &err);
if (err) {
if (err == CL_INVALID_PROGRAM) Fatal("Cannot create kernel: CL_INVALID_PROGRAM
");
else if (err == CL_INVALID_PROGRAM_EXECUTABLE) Fatal("Cannot create kernel: CL_INVALID_PROGRAM_EXECUTABLE
");
else if (err == CL_INVALID_KERNEL_NAME) Fatal("Cannot create kernel: CL_INVALID_KERNEL_NAME
");
else if (err == CL_INVALID_KERNEL_DEFINITION) Fatal("Cannot create kernel: CL_INVALID_KERNEL_DEFINITION
");
else if (err == CL_INVALID_VALUE) Fatal("Cannot create kernel: CL_INVALID_VALUE
");
else if (err == CL_OUT_OF_HOST_MEMORY) Fatal("Cannot create kernel: CL_OUT_OF_HOST_MEMOR
");
else Fatal("Cannot create kernel: %d
", err);
}
// Set program parameters
cl_int returnValArgSet;
returnValArgSet = clSetKernelArg(kernel, 0, sizeof(cl_mem), &prestackTracesCL);
if (returnValArgSet != CL_SUCCESS) FatalSetArgs("prestackTracesCL", returnValArgSet);
returnValArgSet = clSetKernelArg(kernel, 1, sizeof(cl_mem), &stackTracesOutCL);
if (returnValArgSet != CL_SUCCESS) FatalSetArgs("stackTracesOutCL", returnValArgSet);
returnValArgSet = clSetKernelArg(kernel, 2, sizeof(cl_mem), &powerTracesOutCL);
if (returnValArgSet != CL_SUCCESS) FatalSetArgs("powerTracesOutCL", returnValArgSet);
returnValArgSet = clSetKernelArg(kernel, 3, sizeof(unsigned int), &nTracesOut);
if (returnValArgSet != CL_SUCCESS) FatalSetArgs("nTracesOut", returnValArgSet);
returnValArgSet = clSetKernelArg(kernel, 4, sizeof(unsigned int), &nTracesIn);
if (returnValArgSet != CL_SUCCESS) FatalSetArgs("nTracesIn", returnValArgSet);
returnValArgSet = clSetKernelArg(kernel, 5, sizeof(unsigned int), &samplesPerT);
if (returnValArgSet != CL_SUCCESS) FatalSetArgs("samplesPerT", returnValArgSet);
// TODO: verbose
printf("About to run Kernel...
");
// Start timer TODO: move?
double runTime = GetTime();
// Run the kernel (& also set the number of threads)
cl_event runEvent;
size_t Global[1] = { nTracesOut };
size_t Local[1] = { localThreadCount };
if (localThreadCount > 0) err = clEnqueueNDRangeKernel(_queue, kernel, 1, NULL, Global, Local, 0, NULL, &runEvent);
else err = clEnqueueNDRangeKernel(_queue, kernel, 1, NULL, Global, NULL, 0, NULL, &runEvent);
if (err) {
if (err == CL_INVALID_PROGRAM_EXECUTABLE) {
Fatal("Cannot run Kernel: No successfully built program executable available.
");
} else if (err == CL_INVALID_COMMAND_QUEUE) {
Fatal("Cannot run Kernel: Command_queue is not a valid command-queue.
");
} else if (err == CL_INVALID_KERNEL) {
Fatal("Cannot run Kernel: Kernel is not a valid kernel object.
");
} else if (err == CL_INVALID_CONTEXT) {
Fatal("Cannot run Kernel: Context associated with command_queue and kernel is not the same or if "
"the context associated with command_queue and events in event_wait_list are not the same.
");
} else if (err == CL_INVALID_KERNEL_ARGS) {
Fatal("Cannot run Kernel: Kernel argument values have not been specified.
");
} else if (err == CL_INVALID_WORK_DIMENSION) {
Fatal("Cannot run Kernel: work_dim is not a valid value (must be between 1 and 3).
");
} else if (err == CL_INVALID_WORK_GROUP_SIZE) {
Fatal("Cannot run Kernel: local_work_size is specified and number of work-items specified by global_work_size "
"is not evenly divisable by size of work-group given by local_work_size or does not match the "
"work-group size specified for kernel using the __attribute__((reqd_work_group_size(X, Y, Z))) "
"qualifier in program source.
");
} else if (err == CL_INVALID_WORK_ITEM_SIZE) {
Fatal("Cannot run Kernel: If the number of work-items specified in any of local_work_size[0], ... "
"local_work_size[work_dim - 1] is greater than the corresponding values specified "
"by CL_DEVICE_MAX_WORK_ITEM_SIZES[0], .... CL_DEVICE_MAX_WORK_ITEM_SIZES[work_dim - 1]. .
");
} else if (err == CL_INVALID_GLOBAL_OFFSET) {
Fatal("Cannot run Kernel: Global_work_offset is not NULL.
");
} else if (err == CL_OUT_OF_RESOURCES) {
Fatal("Cannot run Kernel: CL_OUT_OF_RESOURCES.
");
} else if (err == CL_MEM_OBJECT_ALLOCATION_FAILURE) {
Fatal("Cannot run Kernel: Failure to allocate memory for data store associated with image or buffer "
"objects specified as arguments to kernel.
");
} else if (err == CL_INVALID_EVENT_WAIT_LIST) {
Fatal("Cannot run Kernel: event_wait_list is NULL and num_events_in_wait_list > 0, or event_wait_list "
"is not NULL and num_events_in_wait_list is 0, or if event objects in event_wait_list "
"are not valid events..
");
} else if (err == CL_OUT_OF_HOST_MEMORY) {
Fatal("Cannot run Kernel: Failure to allocate resources required by the OpenCL implementation on the host.
");
} else {
Fatal("Cannot run Kernel: Unknown Error. (clEnqueueNDRangeKernel)");
}
}
// Flush the program & wait for the program to finish executing
if (clFlush(_queue)) printf("Flush Fail (Run)");
WaitForEventAndRelease(&runEvent);
// Copy the end result back to CPU memory side
if (clEnqueueReadBuffer(_queue, stackTracesOutCL, CL_TRUE, 0, outXsamples * sizeof(cl_float), stackTracesOut1DArray, 0, NULL, NULL))
Fatal("Cannot copy stackTracesOutCL from device to host
");
if (clEnqueueReadBuffer(_queue, powerTracesOutCL, CL_TRUE, 0, outXsamples * sizeof(cl_float), powerTracesOut1DArray, 0, NULL, NULL))
Fatal("Cannot copy powerTracesOutCL from device to host
");
// Release kernel and program
if (clReleaseKernel(kernel)) Fatal("Cannot release kernel
");
if (clReleaseProgram(moveoutAndStackCLProgram)) Fatal("Cannot release program
");
// Free device memory
clReleaseMemObject(prestackTracesCL);
clReleaseMemObject(stackTracesOutCL);
clReleaseMemObject(powerTracesOutCL);
// Release the context and queue
clReleaseCommandQueue(_queue);
clReleaseContext(_context);
// Return the time it took to run this program
return runTime;
}
// end of CL functions
// Returns a float 0.0 - 1.0, inclusive
float RandomFloat() {
return static_cast <float> (rand()) / static_cast <float>(RAND_MAX);
}
// Fill in the prestack traces array
void RandomFillArray(float* fillArray, unsigned int length) {
srand(time(NULL)); // Give a "random" seed
for (unsigned int r = 0; r < length; r++) {
fillArray[r] = RandomFloat() * 1000.0f;
}
}
// Runs the program
double RunProg(unsigned int samplesPerTrace, unsigned int nTracesIn, unsigned int nTracesOut,
unsigned int localThreadCount, unsigned int deviceType) {
// Stores sizes of the various arrays
size_t tracesInxSample = nTracesIn * samplesPerTrace;
size_t tracesOutxSample = nTracesOut * samplesPerTrace;
// Allocate arrays
float* prestackTraces1D = (float*)malloc(tracesInxSample * sizeof(float));
float* stackTracesOut1Dgpu = (float*)calloc(tracesOutxSample, sizeof(float)); // output; zero-out
float* powerTracesOut1Dgpu = (float*)calloc(tracesOutxSample, sizeof(float)); // output; zero-out
// float* stackTracesOut1Dcpu = (float*)calloc(tracesOutxSample, sizeof(float)); // output; zero-out
// float* powerTracesOut1Dcpu = (float*)calloc(tracesOutxSample, sizeof(float)); // output; zero-out
// Count how much memory all of this is
if (_VERBOSE)
{
// Make sure it is consistent with above allocation
unsigned long allocatedMemory = 0;
allocatedMemory += tracesInxSample * sizeof(float);
allocatedMemory += tracesOutxSample * sizeof(float);
allocatedMemory += tracesOutxSample * sizeof(float);
printf("TOTAL MEMORY ALLOCATED: %s
", byteConverter(allocatedMemory));
printf("Input Array Sizes: %s
", byteConverter((unsigned int)(tracesInxSample * sizeof(float))));
printf("Output Array Sizes: %s
", byteConverter((unsigned int)(tracesOutxSample * sizeof(float))));
}
// Fill in array with randoms
RandomFillArray(prestackTraces1D, (unsigned int)tracesInxSample);
// Init OpenCL using the desired device type
double preInitTime = GetTime();
int maxWorkGroupSize;
if (deviceType == 0) maxWorkGroupSize = InitOpenCL(_VERBOSE, CL_DEVICE_TYPE_ALL);
else if (deviceType == 1) maxWorkGroupSize = InitOpenCL(_VERBOSE, CL_DEVICE_TYPE_GPU);
else maxWorkGroupSize = InitOpenCL(_VERBOSE, CL_DEVICE_TYPE_CPU);
printf("Max work size for the device is: %d
", maxWorkGroupSize);
// --- ACTUAL TEST ---
// Run OpenCL
double startTime = GetTime();
double runTime = RunOpenCL(prestackTraces1D, stackTracesOut1Dgpu, powerTracesOut1Dgpu, // arrays
nTracesOut, nTracesIn, samplesPerTrace, // ints
tracesInxSample, tracesOutxSample,
localThreadCount); // samples
// Display run time
double endTime = GetTime();
printf("Elapsed Time: %fsecs
", (endTime - runTime));
printf(" %fsecs (Before Function Call)
", (endTime - startTime));
printf(" %fsecs (Including Init)
", (endTime - preInitTime));
// Free the 1D arrays
free(powerTracesOut1Dgpu);
free(stackTracesOut1Dgpu);
free(prestackTraces1D);
return (endTime - startTime);
}
void CheckFaster(unsigned int samplesPerTrace, unsigned int nTracesIn, unsigned int nTracesOut,
unsigned int localThreadCount, unsigned int deviceType, double* fasterThan, unsigned int* fastestLocalCount) {
double speed = RunProg(samplesPerTrace, nTracesIn, nTracesOut, localThreadCount, deviceType);
if (speed > *fasterThan) {
*fasterThan = speed;
*fastestLocalCount = localThreadCount;
}
}
// Main
int main(int argc, char* argv[]) {
if (!((argc == 6) || (argc == 8) || (argc == 10))) {
Fatal("Incorrect # of Arguments, need 3: SamplesPerTrace TracesIn TracesOut
"
"or 8: SamplesPerTrace TracesIn TracesOut SamplesPerTraceIncrementAmount
"
"TracesInIncrementAmount TracesOutIncrementAmount IncrementCount
");
}
unsigned int samplesPerTrace = atoi(argv[1]);
unsigned int nTracesIn = atoi(argv[2]);
unsigned int nTracesOut = atoi(argv[3]);
unsigned int localThreadCount = atoi(argv[4]);
unsigned int deviceType = atoi(argv[5]);
if (argc == 6) RunProg(samplesPerTrace, nTracesIn, nTracesOut, localThreadCount, deviceType);
else if (argc == 8) { // Try mutliple iterations; increasing the first three variables each time
unsigned int samplesPerTraceIncrementAmount = atoi(argv[4]);
unsigned int tracesInIncrementAmount = atoi(argv[5]);
unsigned int tracesOutIncrementAmount = atoi(argv[6]);
unsigned int incrementCount = atoi(argv[7]);
for (unsigned int x = 0; x <= incrementCount; x++) {
unsigned int nextSPT = samplesPerTrace + (x * samplesPerTraceIncrementAmount);
unsigned int nextTracesIn = nTracesIn + (x * tracesInIncrementAmount);
unsigned int nextTracesOut = nTracesOut + (x * tracesOutIncrementAmount);
printf("
NEXT RUN: %u %u %u
", nextSPT, nextTracesIn, nextTracesOut);
RunProg(nextSPT, nextTracesIn, nextTracesOut, localThreadCount, deviceType);
}
} else if (argc == 10) { // Work-in-progress, currently hard coded to several test cases
double fastestGPU = 0.0;
double* fastestGPUp;
fastestGPUp = &fastestGPU;
double fastestCPU = 0.0;
double* fastestCPUp;
fastestCPUp = &fastestCPU;
unsigned int bestLocalGPU = 0;
unsigned int* bestLocalGPUp;
bestLocalGPUp = &bestLocalGPU;
unsigned int bestLocalCPU = 0;
unsigned int* bestLocalCPUp;
bestLocalCPUp = &bestLocalCPU;
// GPU test
CheckFaster(25000, 2500, 250, 50, 1, fastestGPUp, bestLocalGPUp);
CheckFaster(25000, 2500, 250, 25, 1, fastestGPUp, bestLocalGPUp);
CheckFaster(25000, 2500, 250, 10, 1, fastestGPUp, bestLocalGPUp);
CheckFaster(25000, 2500, 250, 5, 1, fastestGPUp, bestLocalGPUp);
CheckFaster(25000, 2500, 250, 2, 1, fastestGPUp, bestLocalGPUp);
CheckFaster(25000, 2500, 250, 1, 1, fastestGPUp, bestLocalGPUp);
// CPU test
CheckFaster(25000, 2500, 250, 50, 2, fastestCPUp, bestLocalCPUp);
CheckFaster(25000, 2500, 250, 25, 2, fastestCPUp, bestLocalCPUp);
CheckFaster(25000, 2500, 250, 10, 2, fastestCPUp, bestLocalCPUp);
CheckFaster(25000, 2500, 250, 5, 2, fastestCPUp, bestLocalCPUp);
CheckFaster(25000, 2500, 250, 2, 2, fastestCPUp, bestLocalCPUp);
CheckFaster(25000, 2500, 250, 1, 2, fastestCPUp, bestLocalCPUp);
printf("
");
printf("Fastest GPU: %f with local thread count %u
", fastestGPU, bestLocalGPU);
printf("Fastest CPU: %f with local thread count %u
", fastestCPU, bestLocalCPU);
}
return 0;
}
Note: I had to change characters in my code to other characters because otherwise this forum wouldn’t accept them… I changed those characters to ?. I hope it still compiles.
My (almost certainly false) speculation
My first thought as to why it’s running so much slower on my GPU than my CPU is that maybe it’s because I am busing so much data over the graphics card before anything runs. Perhaps a better implementation would involve splitting the workload in multiple runs, so that code can be executing while more data is being bused over (I presume that’s a thing). However now that I think about it this is almost certainly false, because as I said I wrote this program based on an example, and that example did matrix multiplication, and that example runs much much faster on the GPU than my CPU. I don’t really know what the difference is.
If desired, I can also post the code for that program.