PyOpenCL in main!

Yesterday (2014-05-26) my sponsor Piotr Ożarowski uploaded new version of PyOpenCL to Debian. Usually I can upload new versions of packages to Debian as I am Debian Maintainer. But this time it was very special upload. It was closing bug 723132, asking to move PyOpenCL from contrib to main. Because Debian contains free OpenCL implementations, Beignet and Mesa, one can run OpenCL programs using FLOSS code.

Moving package from contrib to main meant that PyOpenCL had to be removed from contrib, and uploaded anew to main. Thanks to Piotr for sponsoring it, and to FTP masters for accepting in from NEW, and dealing with all this removal/adding of package.

There is still work to do. Rebecca Palmer works on allowing for having all OpenCL implementations installed, which should lead for more experimentation and easier work with OpenCL but requires changes to many of the OpenCL-related packages. I’m also thinking about moving PyOpenCL to use pybuild, but this needs to wait till I have more free time.

Let’s hope that having PyOpenCL in main will allow for more people to find and use it.

Aparapi

Some time ago I’ve read about Aparapi (name comes from A PARallel API), the Java library for running code on GPU which was open-sourced in 2011. There are videos and webinar by Gary Frost (the main contributor according to SVN log) from 2011-12-22, describing it on the main page. There is also blog about it. Aparapi allows for running threads either on CPU (via Java Thread Pool) or GPU (using OpenCL).

Aparapi programming model is similar to what we already know from Java; we inherit from class Kernel and override method run(), just like we would create subclasses of Thread and override method run() for ordinary threads. We write code in Java, which means that e.g. if our Kernel class is the internal class it can access objects available in external class. Because of OpenCL limitations (for example not offering full object orientation or recursive methods) Aparapi classes cannot use Java classes using inheritance or method overloading.

Aparapi presents concept similar to one behind ORM (Object-Relational Mapping), trying to avoid “internal” language (either SQL or OpenCL) for the price of hiding some details and potentially sacrificing performance.

It has few examples (mandelbrot, nbody, game of life – which could gain from using of local memory) which show possible usages of it.

I look at Aparapi as someone who uses, contributes to, and is used to, PyOpenCL, and am more interested in its implementation than API.

Usage

To run the simplest code doubling array on GPU we can just write:

new Kernel() {
@Override public void run() {
int i = getGlobalID();
output[i] = 2.0*input[i];
}
}.execute(size);

When there is no ability to use GPU (for example we do not have OpenCL-capable GPU, or there are problems with compiling kernel), Java thread pool is used.

We can use special methods inside Kernel class to access OpenCL information:

  • getGlobalId()
  • getGroupId()
  • getLocalId()
  • getPassId()

As with ordinary OpenCL programs the host code is responsible for managing kernels, copy data, initialize arrays, etc.

There are some limitations coming from OpenCL limitations. Aparapi supports only one-dimensional arrays, so we need to simulate multi-dimensional arrays by calculating indices by ourselves. Support for doubles depend on OpenCL and GPU support – which means that we need to check it at runtime.

There is no support for static methods, method overloading, static methods, and recursion, which is caused by lack of support for it in OpenCL 1.x. Of course OpenCL code does not throw exceptions when run on GPU; this means that there might be different effects when running code on CPU and GPU. GPU will not throw ArrayIndexOutOfBoundException or ArightmeticException, while CPU will.

There is page with guidelines on writing Kernel classes.

There is also special case of execute method, in the case of

kernel.execute(size, count)

which loops count times over kernel. Inside such a kernel we can use variable postId to get index of call in looping, e.g. to prepare data on first iteration.

Implementation

Some details of converting Java to OpenCL are described on wiki page.

When trying to run Kernel, Aparapi checks whether it is first usage, and if not whether code was already compiled. If it was compiled, OpenCL source generated by Aparapi is used. If it was not successfully compiled Aparapi uses Java Thread Pool and runs code on CPU, as there must have been some troubles in previous runs. If this is the first run of the kernel, Aparapi tries to compile Kernel; if it succeeds, it runs on GPU, otherwise it uses CPU.

In case of any troubles, Aparapi uses CPU and Java Thread Pool to run computations.

When using Java Thread Pool, Aparapi creates pool of threads, one thread for CPU core. Then it clones Kernel object so each thread has one copy, to avoid cross-thread access. Each thread calls run() method globalSize/threadCount times. Aparapi updates special variables (like globalId) after each run() execution, and waits for all threads to finish.

Aparapi classes

There are four sets of classes. One set consists of classes responsible for communication with OpenCL and managing OpenCL classes; it contains JNI classes and their partners from Java side. It is similar situation to one in PyOpenCL, where there are also C++ classes and their Python wrappers. There are exceptions. There are special classes – Kernel class (from which programmer inherits) and Config class. And finally there are classes responsible for analysing Java bytecode and generating OpenCL kernels.

Class Config is responsible for runtime configuration, allowing for enabling or disabling some features.

Exceptions:

  • AparapiException – base class, all other exceptions inherit from it
  • ClassParseException – not supported Java code was encountered during code analisis; ClassParseException contains details of encountered error. The ClassParseException.TYPE enum contains details of which forbidden construct was found in the code.
  • CodeGenException
  • DeprecatedException
  • RangeException

OpenCL mapping classes:

  • Range – partner for JNIRange which manages work group sizes and their dimensionality
  • KernelArg – responsible for passing arguments to the kernels
  • JNIContext – responsible for running kernel, manages only one device, can have many queues, but can deal only with one kernel
  • KernelRunner – checks hardware capabilities, can execute Java and OpenCL code
  • ProfileInfo – it stores raw OpenCL performance-related information, like start and end time of execution, status, Java labels. It is returned by KernelRunner.

JNIContext is also responsible for managing memory containing data passed to the kernel. It pins such memory to avoid it being collected or moved by Garbage Collector. It also responds to changes for non-primitive objects, regenerates cl_mem buffers if it detects some changes. Its longest method is responsible for running kernel. also contains some debugging capabilities, like dumping memory as floats or integers.

UnsafeWrapper class is responsible for atomic operations, which is important when running multithreaded code in heterogenous environment. It is wrapper around sun.misc.Unsafe for atomic operations and seem to be used to avoid compiler warnings during reflection. What’s interesting is that all exceptions in this class are marked with “TODO: Auto-generated catch block”.

Class Kernel is intended to be inherited by programmer. It loads JNI, manages Range, wraps Math methods, and provides memory barriers. Its execute() method accepts global size, i.e. number of threads to run. For now it is not possible to run threads in 2D or 3D array, or to govern local block size. Method execute() is blocking; it returns only when all threads have finished. On the first call it determines the execution mode (OpenCL or JTP).

Kernel.EXECUTION_MODE enum allows for controlling which mode was used to
run our kernels:

  • CPU – OpenCL using CPU
  • GPU – OpenCL using GPU
  • JTP – Java Thread Pool (which usually means that there were problems with OpenCL)
  • SEQ – sequential loop on the CPU, very slow, used only for debugging

We can get current execution mode by running kernel.getExecutionMode(), or set it by kernel.setExecutionMode(mode) before running execute().

Java code analysis

To generate OpenCL code Aparapi loads class file, finds run() method and all methods called by it (using reflection), then it converts bytecode to the expression tree. When analysing bytecode Aparapi constructs the list of all accessed fields (using getReferencedFields() method) and accessing mode of them (read or write), converts scalars to kernel arguments, and converts primitive arrays into OpenCL buffers.

Bytecode consists of varying-length instructions and Java Virtual Machine is stack-based Virtual Machine, which means that when analysing bytecode we need to keep the state of the stack. We also need to determine type of arguments from bytecode.

The main algorithm of analysis creates list of operations in Program Counter order, creating one object for bytecode instruction. All instructions are categorised based on their effect on the stack. In subsequent analysis, when instruction consumes value from the stack, the previous instruction is marked as a child of consumer. After adding some hacks for bytecodes like dup and dup2 Aparapi produces have list of operation trees, ready to be used to generate OpenCL code.

There are classes intended for presenting bytecode in hierarchy,
just as compiler would do, to help generating kernels:

  • MethodModel checks whether method uses doubles, is recursive, checksfor getters and setters, creates list of instructions. It is long class, containing Instruction transformers
  • InstructionSet describes bytecodes and types. Contains information how each bytecode changes stack (what pops and pushes)
  • InstructionPattern matches bytecodes against Instructions
  • Instruction base class describing bytecode; contains links to previous and next instructions, its children, all branching targets and sources, etc. KernelWriter calls writeInstruction() method for each object of this class.
  • ExpressionList List of instructions in particular expression; used to build list of expressions to be then transformed into OpenCL; linked list with some ability to fold instructions
  • BranchSet represents a list of bytecodes from one branch of conditional instruction
  • ClassModel
  • ByteBuffer buffer for accessing bytes in *.class, used to parse ClassFile structure, used by ByteReader
  • ByteReader provides access to *.class file, used mostly by ClassModel and MethodModel

There is entire set of classes describing hierarchy of bytecode instructions:

  • ArrayAccess
    • AccessArrayElement
    • AssignToArrayElement
  • Branch
    • ConditionalBranch]
    • Switch
    • UnconditionalBranch – FakeGoto
  • CloneInstruction
  • CompositeInstruction
    • ArbitraryScope
    • EmptyLoop
    • ForEclipse, ForSun
    • IfElse
    • If
    • While
  • DUP, DUP_X1, DUP_X2, DUP2, DUP2_X1, DUP2_X2
  • Field
  • IncrementInstruction
  • OperatorInstruction
    • Binary
    • Unary
  • Return

And classes for generating code:

  • KernelWriter transfers Java methods (getGlobal*) into OpenCL function calls. Converts Java types to OpenCL types. Also writes all needed pragmas: for fp64, atomics, etc.
  • InstructionTransformer used by MethodModel, abstract, its only children are created on the fly
  • BlockWriter writes instructions, which it gets as Expressions or InstructionSets

There are methods (supporting many types, like int, byte, long, double, float) helping with debugging and tuning:

  • get(array) returns buffer from the GPU.
  • put(array) puts array to GPU

During code generation dupX bytecode (repeating X items on the stack) is replaced by repeating instructions generating those X items.

Performance

There are two compilations during runtime when using Aparapi: one is analysis of bytecode and generation of OpenCL source (made by Aparapi), another is compilation of OpenCL to binary understood by GPU, performed by the driver.

There are some methods helping with profiling:

  • getConversionTime() returns the time it took to transfer bytecode into OpenCL code
  • getExecutionTime() returns execution time of last run()
  • getAccumulatedExecutionTime() returns time it took to execute all calls to run() from the beginning.

All the usual performance tips should be used in Aparapi. The ideal candidates to speed up are operations on large arrays of primitives without importance of order. Avoid inter-element dependencies, as they break parallelism large amount of data, large computations, with small amount of data transfer (or transfer occluded by computations) As usual we need to avoid branching, as instructions in threads should be executed in the lockstep.

There is possibility to use local or constant memory in kernels generated by Aparapi kernels. To do so we need to attach _$constant$ or _$local$ to variable name. Of course when using local memory we need to put barriers into kernels, which means that we risk deadlocks. It is again case of leaky abstraction as we cannot pretend that we are writing Java code and we need to know details of kernel execution.

There is also possibility to see source of generated OpenCL kernels, by adding option -Dcom.amd.aparapi.enableShowGeneratedOpenCL=true to the program arguments.

Memory copying

Each execution of the kernel means copying of all the data used in the kernel. Aparapi assumes that Java code (code outside of run()) can changes anything, so it copies buffers to and from device. We can avoid this behaviour (killing performance) by demanding explicit copy, by calling:

kernel.setExplicit(true);

Then we are responsible for coping data, in following manner:

kernel.put(block);
kernel.run(size);
kernel.get(block);

If we are running the same computations over and over again we can put loop inside run(), or call execute(size, count) to let Aparapi manage looping.

Summary

Aparapi is interesting library. It can serve as an entry point to the GPU computing, by allowing to transfer existing Java code to OpenCL to determine whether it is even possible to run the existing code on GPU.

Aparapi is still developed; there is initial support for lambdas (in separate branch), and ability to choose device to run code on, although this seems not finished yet (according to wiki). There are unit tests testing OpenCL code generation for those who want to experiment with it.

There is also proposal of adding extensions to Aparapi; not OpenCL extensions, but the ability to attach real OpenCL code. It it succeeds, then Aparapi will not differ much from PyOpenCL.

The main problem of Aparapi is limit of one kernel and one run() method per class, which is the result of trying to model Kernel after Thread class. There are some hacks to deal with it, but they are just hacks.

According to documentation reduction cannot be implemented in Aparapi. It is present in PyOpenCL, but implemented using more than one kernel, so it might be a problem with a model. There is also no support for volatile memory, which might also be problem e.g. for reduction (there was problem with volatile memory and reduction in PyCUDA).

Another limitations is inability to have multidimensional Java arrays. There are proposals to have multidimensional ranges of execution (2D, 3D, etc.) and some work was done, but nothing is finished yet. There is also no support for sharing data between OpenCL and OpenGL

Aparapi does not support vector execution (SIMD) which could be used to speed up execution on CPU; this might be the result of lack of such a support in Java, as described by Jonathan Parri et al. in article “Returning control to the programmer: SIMD intrinsics for virtual machines” CACM 2011/04 (sorry, paywalled).

I’m wondering how OpenCL 2.0, which allows for blocks and calling kernels from inside other kernels, will change support for advanced Java features in Aparapi.

I want to finish with the quote from authors of Aparapi:

As a testament to how well we emulate OpenCL in JTP mode, this will also deadlock your kernel in JTP mode 😉 so be careful.

PyOpenCL and PyCUDA in Debian Wheezy

Latest versions of PyOpenCL and PyCUDA packages (2012.1-1 in both cases) have reached testing; thanks to Piotr Ożarowski for sponsoring. Wheezy is frozen now, so there will be no new versions of PyOpenCL nor PyCUDA in the new stable; the only allowed changes now are bug fixes. There has been no serious errors detected in my packages so it looks like there is no need for 2012.1-2.

All my packages have been going through some changes recently. Both PyCUDA and PyOpenCL were removed from testing during Boost 1.49 transition as they were depending on old Boost 1.46. Neither PyOpenCL nor PyCUDA could be rebuilt automatically as they depend on non-free packages but it might change in case of PyOpenCL.

I have created Python 3 package for pytools. This allowed for creation of PyOpenCL Python 3 package; it required extracting headers to separate package with headers to allow for installation of both PYthon 2 and Python 3 versions. Unfortunately I forgot to put some metadata (Replaces and Conflicts fields) which was noticed by Andreas Beckman in bug #674000. I have fixed it. At the same time Patrick Matthai uploaded AMD 12-4 drivers with OpenCL 1.2 support and Andreas Beckman uploaded new OpenCL 1.2 headers. This allowed for compiling PyOpenCL with OpenCL 1.2 support.  Unfortunately NVIDIA does not support OpenCL 1.2, so I had to force dependency on AMD libraries.

Bug #673992 was initially created by Vedran Miletic as the request for clarification of this situation, but soon it was clear that using AMD ICD loader with NVIDIA libraries fails in some cases. Using mixed libraries worked on all machines I had access to, but failed on hardware of other PyOpenCL users. I do not know whether this was caused by new
NVIDIA hardware (GF114) or by new AMD APU, but PyOpenCL was crashing during initialisation. Fortunately Vincent Danjean uploaded ocl-icd-libopencl1 package just
about that time. ocl-icd-libopencl1 is free ICD loader which seems to be giving better results for heterogeneous libraries.  Now PyOpenCL can be build with only free software. It can also run with only free software, but it’ll not be useful, as it’ll not find any ICDs. But there is work on free ICD so let’s hope that it will be soon solved. When it happens I’ll move PyOpenCL to the main section of Debian. If you want to know more, Vincent Danjean wrote email describing situation of OpenCL in Debian.

Currently PyOpenCL depends on any package providing libopencl1:

  • ocl-icd-libpencl1
  • amd-libopencl1
  • nvidia-libopencl1.

It’ll not work reliably with current nvidia-libopencl1 though, as it requires OpenCL 1.2 – and nvidia-libopencl1 provides only OpenCL 1.1, as was notices in bug #682435.  I have decided to leave relaxed dependency (instead of forcing PyOpenCL to require ocl-icd-libopencl1) to allow for experimentation with different hardware and software configurations and to be ready for new OpenCL implementations which might be uploaded to Debian in the future. But if you have some problems please try free ocl-icd-libopencl1 before filling the new bug report.

There is still small problem with PyOpenCL though. When using ocl-icd as ICD loader and running tests on NVIDIA hardware it crashes on Image testing. I was not able to test it throughly and I am not sure what is the root cause of the problem.

PyCUDA is Python 3 ready, but I have not created Python 3 package because there are still problems with compiling kernels on Python 3. Current package has already been split into python-pycuda and python-pycuda-headers so now adding Python 3 support is only matter of enabling python3-pycuda package.

As for Ubuntu, it looks like PyOpenCL 2012.1-1 has already been uploaded for Quantal Quetzal. It looks like changes I made for Debian also were beneficial for Ubuntu. PyCUDA is seen by Ubuntu but only as foreign (Debian) package, not as a part of Ubuntu. I think it will be problematic to push PyCUDA into Ubuntu as it depends on NVIDIA CUDA toolkit which is not packaged for Ubuntu.  If someone reading this knows how Ubuntu process works, or how to contact people responsible for release, or maintainers of NVIDIA or AMD drivers, please let me know. I am open for proposals of changes which will not break Debian package and will make life of Ubuntu maintainers easier.

New PyOpenCL in Debian

Debian has now new version of PyOpenCL package (2011.2), thanks Piotr for sponsoring.  I have changed dependencies allowing to use non-NVIDIA OpenCL ibraries, closing #628702. There is still open Ubuntu bug, but Debian-Ubuntu package synchronisation should deal with it.

Popularity of GPGPU-related packages

While dealing with #628702 I have looked at metadata of different packages, including popularity of packages, gathered with popcon. Data used in the following analysis is from 2011-12-05. I have looked at following packages:

PyOpenCL is installed on 189 machines, used on 62 of them. PyCUDA is installed on 13 machines, used on 7. As for NVIDIA GPGPU, libcuda1 is installed on 1110 machines, used on 131, nvidia-libopencl1 is installed on 822 machines, used on 134. AMD OpenCL libraries (amd-libopencl1) are installed on 33 machines, used on 9.

AMD OpenCL libraries are not very popular because they were only uploaded into Debian with the last version of Catalyst drivers, two weeks ago. Much more people have installed GPGPU packages (libcuda1 and libopencl1) than are using it. Data for NVIDIA packages might be skewed in favour of libcuda1: nvidia-opencl-icd depends on libcuda1 and libnvidia-compiler which means that everyone who wants to use NVIDIA OpenCL libraries must install libcuda1 for OpenCL to work.

Although more people have installed libcuda1 (1110) than libopencl1 (822), similar number of people are using those packages (131 vs. 134). This might be caused by the package dependencies described in previous paragraph. About half of the people who are using libopencl1 are doing so through PyOpenCL – 134 active users of libopencl1 and 62 active users of python-pyopencl which depends on libopencl1.

There is definitely interest in both OpenCL and CUDA in Debian, although until recently we had only NVIDIA OpenCL libraries available in repositories. It will be interesting to see how introducing AMD OpenCL libraries and PyOpenCL allowing to use any OpenCL provider will change popularity of OpenCL and CUDA.

Hardware

To test OpenCL on AMD I have been using Forconn nT A3500. It is small machine with AMD Fusion architecture. It has two core CPU and ATI 6310 integrated in the same chip. I have not yet tested performance of it. It is interesting that Foxconn is producing hardware under their own name. It looks like they have learned how to do good hardware doing outsourcing, and now are competing on the global market (with quite nice products IMO). That’s feature of outsourcing, I guess.