Access computing devices over IP-based network with OpenCL

Hello,

I like to announce my project, called CLara, which deals with providing OpenCL platforms and devices via network.

It consists of three components: a provider, an agent and a consumer. The provider is a network client program that provides the computing devices using a locally installed OpenCL library of a hardware vendor. It connects to the agent which is a kind of proxy server that handles all requests from so-called consumers. The consumer is actually your OpenCL program linked against a library that handles the network communication transparently. The application doesn’t know that the computing device is at a remote host.

The development of the framework is at a very early stage, but it is already working. The concept is proven. The implementation of the OpenCL API is almost complete. Simple programs are working, but the framework is not fault tolerant yet. There is still a lot to do.

Things you should know:

  • CLara uses SCTP/IP as communication protocol
  • you can use existing OpenCL applications without recompilation
  • there’s a simple load-balancing and resource allocation mechanism
  • CLara is currently provided under the terms of GPL
  • the development was initiated at the Technical University of Berlin, Germany

http://www.sourceforge.net/projects/clara/

I’m greedily awaiting your questions, comments, feedback, bug reports and so on.

Best regards
Björn

Does this provide access to remote devices as a new platform from within the OpenCL API?

Currently there are two modes implemented:

  1. direct platform access - the platforms of each host are visible to the application, e.g. if you have 20 hosts each providing four devices then clGetPlatformIDs returns 20 platforms where clGetDeviceIDs for each platform returns four devices.

  2. load-balanced access - there is always only one platform presented to the application, but each instance of the application will actually access another platform, i.e. a call of clGetPlatformIDs returns one platform and the framework locks this platform for exclusive use by the application. If there is no platform left then clGetPlatformIDs either returns zero platforms or blocks the application until a platform is available. The termination of an application process releases the platform to the pool of available platforms.

The access mode that you suggest, i.e. providing all devices within a single platform, is not easily possible in an efficient manner. Think about the implications: If you have one platform with 80 devices spreaded across 20 hosts then the application should be able to create a single context with these 80 devices using clCreateContext. That’s still possible. The problem is that memory objects exist per context and all devices within a context will access the memory. That means that you have to keep the memory synchronized between 20 hosts altered by 80 devices which is a huge amount of work.

There are many possible solutions how to deal with the problem, but at the end you have to ask yourself “Is it worth it?” or even “Is it necessary?” - I think it is not necessary, because I’m a follower of the good ol’ unix paradigm “keep it simple and stupid”. The primary goal of this project is to let OpenCL developers simply access platforms and devices in the network without doing the networking stuff. Everything else can be done on top of my work if it is really necessary.

I hope I answered your question. :slight_smile:

Best regards
Björn

That is very cool. I talked with some friends about doing something similar a year or so ago, but we never got around to it.

This is a very cool project. I hope it gains traction as our customers are already looking at small clusters of multi GPU machines. Couple comments:

  • Please don’t use the GPL! You’re making it unusable to whole section of developers (including myself). Proprietary software can not be written against GPL libraries (even if I’m just linking in your library through the OpenCL standard interface). LGPL is only slightly better, I still would not be able to statically link in your library. BSD or equivalent would be best.
  • How is access to remote machines handled? A service on the remote machine? Or can a user with ssh access simply access the remote machine?
  • How do you differentiate between host memory and device memory on remote machines? For example, I may want to pre-load large amounts of data into main memory on the remote host since the individual devices can’t handle the amount of data (think distributed database servers).

Good luck!

Every line of source code that I have written is licensed unter the terms of a 3-clause BSD license. The reason for providing the project as a whole under the terms of GPL is that the LZO library and the red-black tree implementation require the use of GPL. My plan is to move towards a GPL-free, i.e. BSD licensed, implementation. The red-black tree implementation is not necessary for the library part of CLara, but it still links to the LZO library. I’m thinking of linking the library dynamically at runtime (using dlopen and friends) where my code behaves same, but won’t be annexed by Richard M. Stallman. At the moment a solution would be to build a library without LZO compression which makes the code GPL-free, but this causes the memory to be transferred uncompressed (i.e. very slow).

  • How is access to remote machines handled? A service on the remote machine? Or can a user with ssh access simply access the remote machine?
    The remote machine (the one with the GPGPU hardware) runs a network client software that connects to a server that provides the computing capabilities. Secure connections are not possible right now because SSH does not allow to tunnel SCTP packets as far as I know. I’m aware of the consequences and I will take that into account for future decisions.
  • How do you differentiate between host memory and device memory on remote machines? For example, I may want to pre-load large amounts of data into main memory on the remote host since the individual devices can’t handle the amount of data (think distributed database servers).
    Currently you can’t state explicitly to use the memory of the remote machine. The OpenCL specification doesn’t differentiate between local and remote host memory because it doesn’t cover cluster features. Maybe I can abuse the CL_MEM_USE_HOST_PTR flag of the clCreateBuffer function, but I’m not sure. I suspect that I need to extend the API to provide such a feature.

Best regards
Björn

Addendum regarding the GPL issue:

Companies that are seriously interested in using free code can’t really complain about the GPL licence. That’s an old discussion. If they want to do what they want then they have to pay for it. Think about the fact that I put approximately 300-400 hours of work (~15.000 US-Dollar) into this project until now. I’m comparatively accommodating since I want to provide the code under the terms of BSDL. Companies that want to motivate me to moving faster towards a GPL-free library can honor my work slightly by using paypal.

Best regards
Björn

Every line of source code that I have written is licensed unter the terms of a 3-clause BSD license. The reason for providing the project as a whole under the terms of GPL is that the LZO library and the red-black tree implementation require the use of GPL. My plan is to move towards a GPL-free, i.e. BSD licensed, implementation. The red-black tree implementation is not necessary for the library part of CLara, but it still links to the LZO library. I’m thinking of linking the library dynamically at runtime (using dlopen and friends) where my code behaves same, but won’t be annexed by Richard M. Stallman. At the moment a solution would be to build a library without LZO compression which makes the code GPL-free, but this causes the memory to be transferred uncompressed (i.e. very slow).
[/quote]

Consider using zlib, http://www.zlib.net/, we’ve used it for years to do on-the-fly compression.

Currently you can’t state explicitly to use the memory of the remote machine. The OpenCL specification doesn’t differentiate between local and remote host memory because it doesn’t cover cluster features. Maybe I can abuse the CL_MEM_USE_HOST_PTR flag of the clCreateBuffer function, but I’m not sure. I suspect that I need to extend the API to provide such a feature.
[/quote]

Provided I could always see the CPU on the remote machine as another device I could use the memory attached to it and then use clEnqueueCopyBuffer to move data between main memory (CPU memory) and the devices.

It may be worth to provide it additionally. LZO has unbeatable real-time characteristics, i.e. high throughput and low latency which is more important than compression ratio.

That’s a very good idea. Thank you. I will take that seriously into account.

Best regards
Björn