PyOpenCL in main!

Yesterday (2014-05-26) my sponsor Piotr Ożarowski uploaded new version of PyOpenCL to Debian. Usually I can upload new versions of packages to Debian as I am Debian Maintainer. But this time it was very special upload. It was closing bug 723132, asking to move PyOpenCL from contrib to main. Because Debian contains free OpenCL implementations, Beignet and Mesa, one can run OpenCL programs using FLOSS code.

Moving package from contrib to main meant that PyOpenCL had to be removed from contrib, and uploaded anew to main. Thanks to Piotr for sponsoring it, and to FTP masters for accepting in from NEW, and dealing with all this removal/adding of package.

There is still work to do. Rebecca Palmer works on allowing for having all OpenCL implementations installed, which should lead for more experimentation and easier work with OpenCL but requires changes to many of the OpenCL-related packages. I’m also thinking about moving PyOpenCL to use pybuild, but this needs to wait till I have more free time.

Let’s hope that having PyOpenCL in main will allow for more people to find and use it.

Advertisements

Aparapi

Some time ago I’ve read about Aparapi (name comes from A PARallel API), the Java library for running code on GPU which was open-sourced in 2011. There are videos and webinar by Gary Frost (the main contributor according to SVN log) from 2011-12-22, describing it on the main page. There is also blog about it. Aparapi allows for running threads either on CPU (via Java Thread Pool) or GPU (using OpenCL).

Aparapi programming model is similar to what we already know from Java; we inherit from class Kernel and override method run(), just like we would create subclasses of Thread and override method run() for ordinary threads. We write code in Java, which means that e.g. if our Kernel class is the internal class it can access objects available in external class. Because of OpenCL limitations (for example not offering full object orientation or recursive methods) Aparapi classes cannot use Java classes using inheritance or method overloading.

Aparapi presents concept similar to one behind ORM (Object-Relational Mapping), trying to avoid “internal” language (either SQL or OpenCL) for the price of hiding some details and potentially sacrificing performance.

It has few examples (mandelbrot, nbody, game of life – which could gain from using of local memory) which show possible usages of it.

I look at Aparapi as someone who uses, contributes to, and is used to, PyOpenCL, and am more interested in its implementation than API.

Usage

To run the simplest code doubling array on GPU we can just write:

new Kernel() {
@Override public void run() {
int i = getGlobalID();
output[i] = 2.0*input[i];
}
}.execute(size);

When there is no ability to use GPU (for example we do not have OpenCL-capable GPU, or there are problems with compiling kernel), Java thread pool is used.

We can use special methods inside Kernel class to access OpenCL information:

  • getGlobalId()
  • getGroupId()
  • getLocalId()
  • getPassId()

As with ordinary OpenCL programs the host code is responsible for managing kernels, copy data, initialize arrays, etc.

There are some limitations coming from OpenCL limitations. Aparapi supports only one-dimensional arrays, so we need to simulate multi-dimensional arrays by calculating indices by ourselves. Support for doubles depend on OpenCL and GPU support – which means that we need to check it at runtime.

There is no support for static methods, method overloading, static methods, and recursion, which is caused by lack of support for it in OpenCL 1.x. Of course OpenCL code does not throw exceptions when run on GPU; this means that there might be different effects when running code on CPU and GPU. GPU will not throw ArrayIndexOutOfBoundException or ArightmeticException, while CPU will.

There is page with guidelines on writing Kernel classes.

There is also special case of execute method, in the case of

kernel.execute(size, count)

which loops count times over kernel. Inside such a kernel we can use variable postId to get index of call in looping, e.g. to prepare data on first iteration.

Implementation

Some details of converting Java to OpenCL are described on wiki page.

When trying to run Kernel, Aparapi checks whether it is first usage, and if not whether code was already compiled. If it was compiled, OpenCL source generated by Aparapi is used. If it was not successfully compiled Aparapi uses Java Thread Pool and runs code on CPU, as there must have been some troubles in previous runs. If this is the first run of the kernel, Aparapi tries to compile Kernel; if it succeeds, it runs on GPU, otherwise it uses CPU.

In case of any troubles, Aparapi uses CPU and Java Thread Pool to run computations.

When using Java Thread Pool, Aparapi creates pool of threads, one thread for CPU core. Then it clones Kernel object so each thread has one copy, to avoid cross-thread access. Each thread calls run() method globalSize/threadCount times. Aparapi updates special variables (like globalId) after each run() execution, and waits for all threads to finish.

Aparapi classes

There are four sets of classes. One set consists of classes responsible for communication with OpenCL and managing OpenCL classes; it contains JNI classes and their partners from Java side. It is similar situation to one in PyOpenCL, where there are also C++ classes and their Python wrappers. There are exceptions. There are special classes – Kernel class (from which programmer inherits) and Config class. And finally there are classes responsible for analysing Java bytecode and generating OpenCL kernels.

Class Config is responsible for runtime configuration, allowing for enabling or disabling some features.

Exceptions:

  • AparapiException – base class, all other exceptions inherit from it
  • ClassParseException – not supported Java code was encountered during code analisis; ClassParseException contains details of encountered error. The ClassParseException.TYPE enum contains details of which forbidden construct was found in the code.
  • CodeGenException
  • DeprecatedException
  • RangeException

OpenCL mapping classes:

  • Range – partner for JNIRange which manages work group sizes and their dimensionality
  • KernelArg – responsible for passing arguments to the kernels
  • JNIContext – responsible for running kernel, manages only one device, can have many queues, but can deal only with one kernel
  • KernelRunner – checks hardware capabilities, can execute Java and OpenCL code
  • ProfileInfo – it stores raw OpenCL performance-related information, like start and end time of execution, status, Java labels. It is returned by KernelRunner.

JNIContext is also responsible for managing memory containing data passed to the kernel. It pins such memory to avoid it being collected or moved by Garbage Collector. It also responds to changes for non-primitive objects, regenerates cl_mem buffers if it detects some changes. Its longest method is responsible for running kernel. also contains some debugging capabilities, like dumping memory as floats or integers.

UnsafeWrapper class is responsible for atomic operations, which is important when running multithreaded code in heterogenous environment. It is wrapper around sun.misc.Unsafe for atomic operations and seem to be used to avoid compiler warnings during reflection. What’s interesting is that all exceptions in this class are marked with “TODO: Auto-generated catch block”.

Class Kernel is intended to be inherited by programmer. It loads JNI, manages Range, wraps Math methods, and provides memory barriers. Its execute() method accepts global size, i.e. number of threads to run. For now it is not possible to run threads in 2D or 3D array, or to govern local block size. Method execute() is blocking; it returns only when all threads have finished. On the first call it determines the execution mode (OpenCL or JTP).

Kernel.EXECUTION_MODE enum allows for controlling which mode was used to
run our kernels:

  • CPU – OpenCL using CPU
  • GPU – OpenCL using GPU
  • JTP – Java Thread Pool (which usually means that there were problems with OpenCL)
  • SEQ – sequential loop on the CPU, very slow, used only for debugging

We can get current execution mode by running kernel.getExecutionMode(), or set it by kernel.setExecutionMode(mode) before running execute().

Java code analysis

To generate OpenCL code Aparapi loads class file, finds run() method and all methods called by it (using reflection), then it converts bytecode to the expression tree. When analysing bytecode Aparapi constructs the list of all accessed fields (using getReferencedFields() method) and accessing mode of them (read or write), converts scalars to kernel arguments, and converts primitive arrays into OpenCL buffers.

Bytecode consists of varying-length instructions and Java Virtual Machine is stack-based Virtual Machine, which means that when analysing bytecode we need to keep the state of the stack. We also need to determine type of arguments from bytecode.

The main algorithm of analysis creates list of operations in Program Counter order, creating one object for bytecode instruction. All instructions are categorised based on their effect on the stack. In subsequent analysis, when instruction consumes value from the stack, the previous instruction is marked as a child of consumer. After adding some hacks for bytecodes like dup and dup2 Aparapi produces have list of operation trees, ready to be used to generate OpenCL code.

There are classes intended for presenting bytecode in hierarchy,
just as compiler would do, to help generating kernels:

  • MethodModel checks whether method uses doubles, is recursive, checksfor getters and setters, creates list of instructions. It is long class, containing Instruction transformers
  • InstructionSet describes bytecodes and types. Contains information how each bytecode changes stack (what pops and pushes)
  • InstructionPattern matches bytecodes against Instructions
  • Instruction base class describing bytecode; contains links to previous and next instructions, its children, all branching targets and sources, etc. KernelWriter calls writeInstruction() method for each object of this class.
  • ExpressionList List of instructions in particular expression; used to build list of expressions to be then transformed into OpenCL; linked list with some ability to fold instructions
  • BranchSet represents a list of bytecodes from one branch of conditional instruction
  • ClassModel
  • ByteBuffer buffer for accessing bytes in *.class, used to parse ClassFile structure, used by ByteReader
  • ByteReader provides access to *.class file, used mostly by ClassModel and MethodModel

There is entire set of classes describing hierarchy of bytecode instructions:

  • ArrayAccess
    • AccessArrayElement
    • AssignToArrayElement
  • Branch
    • ConditionalBranch]
    • Switch
    • UnconditionalBranch – FakeGoto
  • CloneInstruction
  • CompositeInstruction
    • ArbitraryScope
    • EmptyLoop
    • ForEclipse, ForSun
    • IfElse
    • If
    • While
  • DUP, DUP_X1, DUP_X2, DUP2, DUP2_X1, DUP2_X2
  • Field
  • IncrementInstruction
  • OperatorInstruction
    • Binary
    • Unary
  • Return

And classes for generating code:

  • KernelWriter transfers Java methods (getGlobal*) into OpenCL function calls. Converts Java types to OpenCL types. Also writes all needed pragmas: for fp64, atomics, etc.
  • InstructionTransformer used by MethodModel, abstract, its only children are created on the fly
  • BlockWriter writes instructions, which it gets as Expressions or InstructionSets

There are methods (supporting many types, like int, byte, long, double, float) helping with debugging and tuning:

  • get(array) returns buffer from the GPU.
  • put(array) puts array to GPU

During code generation dupX bytecode (repeating X items on the stack) is replaced by repeating instructions generating those X items.

Performance

There are two compilations during runtime when using Aparapi: one is analysis of bytecode and generation of OpenCL source (made by Aparapi), another is compilation of OpenCL to binary understood by GPU, performed by the driver.

There are some methods helping with profiling:

  • getConversionTime() returns the time it took to transfer bytecode into OpenCL code
  • getExecutionTime() returns execution time of last run()
  • getAccumulatedExecutionTime() returns time it took to execute all calls to run() from the beginning.

All the usual performance tips should be used in Aparapi. The ideal candidates to speed up are operations on large arrays of primitives without importance of order. Avoid inter-element dependencies, as they break parallelism large amount of data, large computations, with small amount of data transfer (or transfer occluded by computations) As usual we need to avoid branching, as instructions in threads should be executed in the lockstep.

There is possibility to use local or constant memory in kernels generated by Aparapi kernels. To do so we need to attach _$constant$ or _$local$ to variable name. Of course when using local memory we need to put barriers into kernels, which means that we risk deadlocks. It is again case of leaky abstraction as we cannot pretend that we are writing Java code and we need to know details of kernel execution.

There is also possibility to see source of generated OpenCL kernels, by adding option -Dcom.amd.aparapi.enableShowGeneratedOpenCL=true to the program arguments.

Memory copying

Each execution of the kernel means copying of all the data used in the kernel. Aparapi assumes that Java code (code outside of run()) can changes anything, so it copies buffers to and from device. We can avoid this behaviour (killing performance) by demanding explicit copy, by calling:

kernel.setExplicit(true);

Then we are responsible for coping data, in following manner:

kernel.put(block);
kernel.run(size);
kernel.get(block);

If we are running the same computations over and over again we can put loop inside run(), or call execute(size, count) to let Aparapi manage looping.

Summary

Aparapi is interesting library. It can serve as an entry point to the GPU computing, by allowing to transfer existing Java code to OpenCL to determine whether it is even possible to run the existing code on GPU.

Aparapi is still developed; there is initial support for lambdas (in separate branch), and ability to choose device to run code on, although this seems not finished yet (according to wiki). There are unit tests testing OpenCL code generation for those who want to experiment with it.

There is also proposal of adding extensions to Aparapi; not OpenCL extensions, but the ability to attach real OpenCL code. It it succeeds, then Aparapi will not differ much from PyOpenCL.

The main problem of Aparapi is limit of one kernel and one run() method per class, which is the result of trying to model Kernel after Thread class. There are some hacks to deal with it, but they are just hacks.

According to documentation reduction cannot be implemented in Aparapi. It is present in PyOpenCL, but implemented using more than one kernel, so it might be a problem with a model. There is also no support for volatile memory, which might also be problem e.g. for reduction (there was problem with volatile memory and reduction in PyCUDA).

Another limitations is inability to have multidimensional Java arrays. There are proposals to have multidimensional ranges of execution (2D, 3D, etc.) and some work was done, but nothing is finished yet. There is also no support for sharing data between OpenCL and OpenGL

Aparapi does not support vector execution (SIMD) which could be used to speed up execution on CPU; this might be the result of lack of such a support in Java, as described by Jonathan Parri et al. in article “Returning control to the programmer: SIMD intrinsics for virtual machines” CACM 2011/04 (sorry, paywalled).

I’m wondering how OpenCL 2.0, which allows for blocks and calling kernels from inside other kernels, will change support for advanced Java features in Aparapi.

I want to finish with the quote from authors of Aparapi:

As a testament to how well we emulate OpenCL in JTP mode, this will also deadlock your kernel in JTP mode 😉 so be careful.