Aparapi

Some time ago I’ve read about Aparapi (name comes from A PARallel API), the Java library for running code on GPU which was open-sourced in 2011. There are videos and webinar by Gary Frost (the main contributor according to SVN log) from 2011-12-22, describing it on the main page. There is also blog about it. Aparapi allows for running threads either on CPU (via Java Thread Pool) or GPU (using OpenCL).

Aparapi programming model is similar to what we already know from Java; we inherit from class Kernel and override method run(), just like we would create subclasses of Thread and override method run() for ordinary threads. We write code in Java, which means that e.g. if our Kernel class is the internal class it can access objects available in external class. Because of OpenCL limitations (for example not offering full object orientation or recursive methods) Aparapi classes cannot use Java classes using inheritance or method overloading.

Aparapi presents concept similar to one behind ORM (Object-Relational Mapping), trying to avoid “internal” language (either SQL or OpenCL) for the price of hiding some details and potentially sacrificing performance.

It has few examples (mandelbrot, nbody, game of life – which could gain from using of local memory) which show possible usages of it.

I look at Aparapi as someone who uses, contributes to, and is used to, PyOpenCL, and am more interested in its implementation than API.

Usage

To run the simplest code doubling array on GPU we can just write:

new Kernel() {
@Override public void run() {
int i = getGlobalID();
output[i] = 2.0*input[i];
}
}.execute(size);

When there is no ability to use GPU (for example we do not have OpenCL-capable GPU, or there are problems with compiling kernel), Java thread pool is used.

We can use special methods inside Kernel class to access OpenCL information:

  • getGlobalId()
  • getGroupId()
  • getLocalId()
  • getPassId()

As with ordinary OpenCL programs the host code is responsible for managing kernels, copy data, initialize arrays, etc.

There are some limitations coming from OpenCL limitations. Aparapi supports only one-dimensional arrays, so we need to simulate multi-dimensional arrays by calculating indices by ourselves. Support for doubles depend on OpenCL and GPU support – which means that we need to check it at runtime.

There is no support for static methods, method overloading, static methods, and recursion, which is caused by lack of support for it in OpenCL 1.x. Of course OpenCL code does not throw exceptions when run on GPU; this means that there might be different effects when running code on CPU and GPU. GPU will not throw ArrayIndexOutOfBoundException or ArightmeticException, while CPU will.

There is page with guidelines on writing Kernel classes.

There is also special case of execute method, in the case of

kernel.execute(size, count)

which loops count times over kernel. Inside such a kernel we can use variable postId to get index of call in looping, e.g. to prepare data on first iteration.

Implementation

Some details of converting Java to OpenCL are described on wiki page.

When trying to run Kernel, Aparapi checks whether it is first usage, and if not whether code was already compiled. If it was compiled, OpenCL source generated by Aparapi is used. If it was not successfully compiled Aparapi uses Java Thread Pool and runs code on CPU, as there must have been some troubles in previous runs. If this is the first run of the kernel, Aparapi tries to compile Kernel; if it succeeds, it runs on GPU, otherwise it uses CPU.

In case of any troubles, Aparapi uses CPU and Java Thread Pool to run computations.

When using Java Thread Pool, Aparapi creates pool of threads, one thread for CPU core. Then it clones Kernel object so each thread has one copy, to avoid cross-thread access. Each thread calls run() method globalSize/threadCount times. Aparapi updates special variables (like globalId) after each run() execution, and waits for all threads to finish.

Aparapi classes

There are four sets of classes. One set consists of classes responsible for communication with OpenCL and managing OpenCL classes; it contains JNI classes and their partners from Java side. It is similar situation to one in PyOpenCL, where there are also C++ classes and their Python wrappers. There are exceptions. There are special classes – Kernel class (from which programmer inherits) and Config class. And finally there are classes responsible for analysing Java bytecode and generating OpenCL kernels.

Class Config is responsible for runtime configuration, allowing for enabling or disabling some features.

Exceptions:

  • AparapiException – base class, all other exceptions inherit from it
  • ClassParseException – not supported Java code was encountered during code analisis; ClassParseException contains details of encountered error. The ClassParseException.TYPE enum contains details of which forbidden construct was found in the code.
  • CodeGenException
  • DeprecatedException
  • RangeException

OpenCL mapping classes:

  • Range – partner for JNIRange which manages work group sizes and their dimensionality
  • KernelArg – responsible for passing arguments to the kernels
  • JNIContext – responsible for running kernel, manages only one device, can have many queues, but can deal only with one kernel
  • KernelRunner – checks hardware capabilities, can execute Java and OpenCL code
  • ProfileInfo – it stores raw OpenCL performance-related information, like start and end time of execution, status, Java labels. It is returned by KernelRunner.

JNIContext is also responsible for managing memory containing data passed to the kernel. It pins such memory to avoid it being collected or moved by Garbage Collector. It also responds to changes for non-primitive objects, regenerates cl_mem buffers if it detects some changes. Its longest method is responsible for running kernel. also contains some debugging capabilities, like dumping memory as floats or integers.

UnsafeWrapper class is responsible for atomic operations, which is important when running multithreaded code in heterogenous environment. It is wrapper around sun.misc.Unsafe for atomic operations and seem to be used to avoid compiler warnings during reflection. What’s interesting is that all exceptions in this class are marked with “TODO: Auto-generated catch block”.

Class Kernel is intended to be inherited by programmer. It loads JNI, manages Range, wraps Math methods, and provides memory barriers. Its execute() method accepts global size, i.e. number of threads to run. For now it is not possible to run threads in 2D or 3D array, or to govern local block size. Method execute() is blocking; it returns only when all threads have finished. On the first call it determines the execution mode (OpenCL or JTP).

Kernel.EXECUTION_MODE enum allows for controlling which mode was used to
run our kernels:

  • CPU – OpenCL using CPU
  • GPU – OpenCL using GPU
  • JTP – Java Thread Pool (which usually means that there were problems with OpenCL)
  • SEQ – sequential loop on the CPU, very slow, used only for debugging

We can get current execution mode by running kernel.getExecutionMode(), or set it by kernel.setExecutionMode(mode) before running execute().

Java code analysis

To generate OpenCL code Aparapi loads class file, finds run() method and all methods called by it (using reflection), then it converts bytecode to the expression tree. When analysing bytecode Aparapi constructs the list of all accessed fields (using getReferencedFields() method) and accessing mode of them (read or write), converts scalars to kernel arguments, and converts primitive arrays into OpenCL buffers.

Bytecode consists of varying-length instructions and Java Virtual Machine is stack-based Virtual Machine, which means that when analysing bytecode we need to keep the state of the stack. We also need to determine type of arguments from bytecode.

The main algorithm of analysis creates list of operations in Program Counter order, creating one object for bytecode instruction. All instructions are categorised based on their effect on the stack. In subsequent analysis, when instruction consumes value from the stack, the previous instruction is marked as a child of consumer. After adding some hacks for bytecodes like dup and dup2 Aparapi produces have list of operation trees, ready to be used to generate OpenCL code.

There are classes intended for presenting bytecode in hierarchy,
just as compiler would do, to help generating kernels:

  • MethodModel checks whether method uses doubles, is recursive, checksfor getters and setters, creates list of instructions. It is long class, containing Instruction transformers
  • InstructionSet describes bytecodes and types. Contains information how each bytecode changes stack (what pops and pushes)
  • InstructionPattern matches bytecodes against Instructions
  • Instruction base class describing bytecode; contains links to previous and next instructions, its children, all branching targets and sources, etc. KernelWriter calls writeInstruction() method for each object of this class.
  • ExpressionList List of instructions in particular expression; used to build list of expressions to be then transformed into OpenCL; linked list with some ability to fold instructions
  • BranchSet represents a list of bytecodes from one branch of conditional instruction
  • ClassModel
  • ByteBuffer buffer for accessing bytes in *.class, used to parse ClassFile structure, used by ByteReader
  • ByteReader provides access to *.class file, used mostly by ClassModel and MethodModel

There is entire set of classes describing hierarchy of bytecode instructions:

  • ArrayAccess
    • AccessArrayElement
    • AssignToArrayElement
  • Branch
    • ConditionalBranch]
    • Switch
    • UnconditionalBranch – FakeGoto
  • CloneInstruction
  • CompositeInstruction
    • ArbitraryScope
    • EmptyLoop
    • ForEclipse, ForSun
    • IfElse
    • If
    • While
  • DUP, DUP_X1, DUP_X2, DUP2, DUP2_X1, DUP2_X2
  • Field
  • IncrementInstruction
  • OperatorInstruction
    • Binary
    • Unary
  • Return

And classes for generating code:

  • KernelWriter transfers Java methods (getGlobal*) into OpenCL function calls. Converts Java types to OpenCL types. Also writes all needed pragmas: for fp64, atomics, etc.
  • InstructionTransformer used by MethodModel, abstract, its only children are created on the fly
  • BlockWriter writes instructions, which it gets as Expressions or InstructionSets

There are methods (supporting many types, like int, byte, long, double, float) helping with debugging and tuning:

  • get(array) returns buffer from the GPU.
  • put(array) puts array to GPU

During code generation dupX bytecode (repeating X items on the stack) is replaced by repeating instructions generating those X items.

Performance

There are two compilations during runtime when using Aparapi: one is analysis of bytecode and generation of OpenCL source (made by Aparapi), another is compilation of OpenCL to binary understood by GPU, performed by the driver.

There are some methods helping with profiling:

  • getConversionTime() returns the time it took to transfer bytecode into OpenCL code
  • getExecutionTime() returns execution time of last run()
  • getAccumulatedExecutionTime() returns time it took to execute all calls to run() from the beginning.

All the usual performance tips should be used in Aparapi. The ideal candidates to speed up are operations on large arrays of primitives without importance of order. Avoid inter-element dependencies, as they break parallelism large amount of data, large computations, with small amount of data transfer (or transfer occluded by computations) As usual we need to avoid branching, as instructions in threads should be executed in the lockstep.

There is possibility to use local or constant memory in kernels generated by Aparapi kernels. To do so we need to attach _$constant$ or _$local$ to variable name. Of course when using local memory we need to put barriers into kernels, which means that we risk deadlocks. It is again case of leaky abstraction as we cannot pretend that we are writing Java code and we need to know details of kernel execution.

There is also possibility to see source of generated OpenCL kernels, by adding option -Dcom.amd.aparapi.enableShowGeneratedOpenCL=true to the program arguments.

Memory copying

Each execution of the kernel means copying of all the data used in the kernel. Aparapi assumes that Java code (code outside of run()) can changes anything, so it copies buffers to and from device. We can avoid this behaviour (killing performance) by demanding explicit copy, by calling:

kernel.setExplicit(true);

Then we are responsible for coping data, in following manner:

kernel.put(block);
kernel.run(size);
kernel.get(block);

If we are running the same computations over and over again we can put loop inside run(), or call execute(size, count) to let Aparapi manage looping.

Summary

Aparapi is interesting library. It can serve as an entry point to the GPU computing, by allowing to transfer existing Java code to OpenCL to determine whether it is even possible to run the existing code on GPU.

Aparapi is still developed; there is initial support for lambdas (in separate branch), and ability to choose device to run code on, although this seems not finished yet (according to wiki). There are unit tests testing OpenCL code generation for those who want to experiment with it.

There is also proposal of adding extensions to Aparapi; not OpenCL extensions, but the ability to attach real OpenCL code. It it succeeds, then Aparapi will not differ much from PyOpenCL.

The main problem of Aparapi is limit of one kernel and one run() method per class, which is the result of trying to model Kernel after Thread class. There are some hacks to deal with it, but they are just hacks.

According to documentation reduction cannot be implemented in Aparapi. It is present in PyOpenCL, but implemented using more than one kernel, so it might be a problem with a model. There is also no support for volatile memory, which might also be problem e.g. for reduction (there was problem with volatile memory and reduction in PyCUDA).

Another limitations is inability to have multidimensional Java arrays. There are proposals to have multidimensional ranges of execution (2D, 3D, etc.) and some work was done, but nothing is finished yet. There is also no support for sharing data between OpenCL and OpenGL

Aparapi does not support vector execution (SIMD) which could be used to speed up execution on CPU; this might be the result of lack of such a support in Java, as described by Jonathan Parri et al. in article “Returning control to the programmer: SIMD intrinsics for virtual machines” CACM 2011/04 (sorry, paywalled).

I’m wondering how OpenCL 2.0, which allows for blocks and calling kernels from inside other kernels, will change support for advanced Java features in Aparapi.

I want to finish with the quote from authors of Aparapi:

As a testament to how well we emulate OpenCL in JTP mode, this will also deadlock your kernel in JTP mode 😉 so be careful.

Advertisements

PyOpenCL, PyCUDA, and Python 3

In the previous week the new versions of Debian packages for PyOpenCL and PyCUDA reached Debian unstable. The support for Python 3 is the largest change in the new PyCUDA.

Both PyOpenCL and PyCUDA support Python 3 now and contain compiled modules for Python 3.2 and 3.3. Both provide *-dbg packages for easier debugging. Because of addition of debug packages and Python 3 for PyCUDA all packages had to go through the Debian NEW queue.

Uploaded packages do not block Python 3.3 transition because they are built against and provide Python 3.3 support. They will need to be rebuilded during Boost 1.53 transition – and I hope to upload new versions at the same time.

AWS Summit in Berlin

We’ve had two days of holidays in Poland at the beginning of May: Work Day on 1st of May and Constitution Day on 3rd of May. Most of the people used this time to visit families, make grill and so on; I decided to take 2nd may off and go to the Berlin to Amazon Web Services Summit.

AWS Summit was held in Berliner Congress Center at Alexanderplatz.  This is the same place which hosted Chaos Communication Congress for many years so it brought back some memories.

Again there was a queue for the entrance, although it was shorter than before Congress. On the other hand there was no Heart of Gold, nor blinking lights in the window. BCC looked professional. Also, there were security guards checking our bags during entrance. I wonder what they were looking for…

Inside there were again some similarities, inevitable at the event with over three thousand people (as organizers have not published official attendance numbers, I am estimating based on how crowded lecture rooms were): there were long queues to the WC, queues for food, people eating in all the places (on the stairs, etc.). Because this was computer event, most of the people were not very social, eating on their own, not trying to make contact; this changed after closing event, when there was beer provided 😉

As AWS Summit was professional event, not hacker congress, there were differences.  Food court looked empty, even boring, without blinken lights: it became yet another place to have your lunch. Instead of Engels there were hostesses; the good part was that they were nicely dressed (not underdressed like the ones at Confitura 2012).

I would never thought that I would say this but I somehow missed Nick Farr shouting from the stage to raise hand if someone has free seat; I had less feeling of community, less eagerness to make more room for fellow hackers to have place to sit down and listen to the lecture.  But the talking with people during breaks and after closing were as interesting as during other conferences.

Oh, the lecture rooms… Just like during CCC there were problems with more people wanting to listen to the lecture than could fit into the room. People were waiting in the queues, and some were not let in for the few interesting talks due to lack of space. Someone (I do not know who – there was too much crowd) joked that “they should just instantiate another room to have talk during such high demand”. Yeah, this shows that clouds cannot solve limitations of physical world. The difference from CCC was that there was no streaming of talks so those of us who failed to have a seat did not have a chance to watch it.

Organizers wanted to count attendance. They did not used Sputniks; instead our badges had barcodes printed on them and poor hostess had to scan all people entering the room. It was not foolproof – e.g. I entered lecture room earlier (during lunch break) to have place to sit, so I was not counted as attending that talk.

Most of the talks were about technical details of AWS.  I will just mention few interesting thought from keynotes.  Werner Vogel, CTO of amazon.com, mentioned something along the lines “just like Human Resources employs people when they are needed and reduces them during lower demand you can do with your computing capabilities”.  I do not want to be treated as commodity (or resource to be managed for that matter), and I repeat after The Prisoner: “I am not a number!”.  I believe that treating people, employees, as commodity, is part of the problem with economy today.  This is specially ironic when told on 2nd May, day after May 1st, the International Work Day.

On the other hand Nikolai Longolius, CEO of Schnee von morgen Web TV, made me feel old. He used the phrase: “we started with the cloud in 2006, so we are grandfathers”.  Other speakers were also using phrases “it is old way of computing, used in 1990s or 2000s”. Hey, I know that in computers time flows faster, but it might be good idea to stop from time to and look whether the past offers us some important lessons.

In summary, I’m glad I attended the Summit. I learned a lot, and talking with people responsible for example for Glacier helped me understand it better and fix some of my scripts. I met some interesting people attending Summit.  It also helped me see Congress from different perspective and changed my expectations about OHM 2013. I am waiting for it impatiently as it’s only 3 months from now!

A.M. Turing award lectures

I’ve just watched two lectures given by laureates of ACM Turing awards. First lecture was given by Barbara Liskov in 2009 and the second lecture was given by Chuck Tucker in 2010. Both lectures contain many interesting topics, and I do not want to merely summarize them as it would be disservice to presenters. Instead I’ll just focus on one aspect of computer science present in both.

They both talk about past experiences of developing computer science. Liskov describes how she was involved in implementation of CLU programming language. She describes how the situation looked in 1960s and 1970s regarding programming languages. There were many different programming languages and they offered different choices. For example there were many approaches to exception handling. One approach was termination (known today) and another was resurrection; after handling exception code could order returning to procedure which caused exception. One could also use FAILURE exceptions, which were something similar to today’s Java runtime exceptions but one could change any exception into failure, putting original exception as argument of failure (something similar to today’s exception wrapping). There was also special Guardian module which was responsible for catching uncatched exception which seems similar approach known from Virtual Machines but each unit (module) could have its own Guardian so exceptions were confined inside modules. She describes implementation of iterators which seem similar to Python’s generators with yield; even the way of implementing of iterators seems similar to how generators were implemented in Python. First there were just generators (PEP 255), and then they were extended to allow for coroutines (PEP 342). Liskov stopped before implementing coroutines. Python is going even further with using subgenerators (PEP 380) and allowing for using generators with asynchronous programming (currently discussed PEP 3156). Liskov said that CLU was “way ahead of its time” – and it is true. Only today we can see implementation of its concepts.

Another concept described by Liskov which now is implemented in current programming language is collections with WHERE. It is similar to templates e.g. from C#, with restrictions posed on parameters. In CLU parameter had to implement some methods (concept similar to duck typing), in C# one requires that argument implements some interface. It feels strange to see concept from 1970s re-discovered and implemented only in 2006.

But it gets more clear after watching Tucker’s lecture. He notes that “Computer science is very forgetful discipline”. Tucker talks about all the walls we are facing today – memory wall, power wall, complexity wall. His entire lecture is about history and its influence today. We (computer science people) made some choices back then and now we live with those choices. This can be seen when looking for example at BIOS – only now we can see migration to UEFI, but not without many problems (see Matthew Garret work ). Tucker uses interrupts as example of such legacy of the past limitation. Interrupts made sense in single-core system but now complicate things and make no sense in multi-core. Also this can be seen when looking at computer languages today – I do not know language which implements resurrection exceptions mentioned by Liskov. Tucker wonders how would we make those choices today, given current knowledge and technology.

Many choices Tucker mentions were made as the result of scarcity. He mentions problems with shared memory and message passing. Shared memory was easier to implement so it was the probable reason that it was chosen over messaging – but now it poses problems with coherency on multi-core chips. Well known example is virtual memory which was the result of small amount of RAM and necessity to spill this RAM into disk. Now we have much memory so swap is not needed (e.g. Android does not use swap; on the other hand Nokia n900 had swap implemented on the flash). Needing to virtualise memory to be able to find large continuous RAM chunk can be solved by Garbage Collector… So in theory we could resign from soma parts of current memory layout in the hardware and operating systems. I do not agree with Tucker that we should also resign from protection given by virtual memory; he mentions that this should be solved by using safe languages (i.e. not C) but we also need to deal with rogue programs and multi-user environment – so I think having protection offered by Virtual Memory is a Good Thing(TM).

Another problem which we experience only today is related to threads and locking. In the past systems were not large enough to show problems with locking – i.e. problems with composing them and lack of ability to compose smaller systems. Tucker does not believe in Transaction Memory. He does not like transactions because they are speculative and we do not have many experience with transactions. We use them in databases, but not in smaller scale with multiple threads on CPU level.

Tucker notes that 1950s and 60s were age of experimentation, then 70s and 80s were the period of consolidation and warfare (only few of the existing solutions left – he was talking about CPUs but the same is true for programming languages) and 90s were Instruction Level Parallelism. But there is not much ILP in most of the programs which can be done automatically. Looking at my experience with programming GPGPU one needs to work on program to get performance gains, and not much can be done just by the compiler, without programmer intervention. Also, there is not many applications which can use many cores; one does not need dozens of cores to watch video or write email.

The problem is that we learned to live with those limitations and many of those work well enough and cost of changing to something better is too high. We might need to rethink all those decisions; not necessarily change them – but decide again, in today’s situation and with current knowledge.

Problem with changing such widely used solutions as interrupts is that it is costly now and will bring profits in the long term. Need for many players to agree does not help. At the same time this might be good field for experimentation using virtual machines or various free software solutions. One can experiment with existing code on new architectures – just recompile programs or implement virtual machine on new architecture. There is also question how much control to give programmers. Too much and programs will be hard to port. Too little and they will have problems with getting performance. For example new Exynos 5 CPU offers 4 fast cores and 4 slow cores; system chooses which cores to use depending on the load. But what if I, the developer, want to use 1 fast core and 1 slow core, or some other combination? I’ll not have this ability from Android level – Dalvik operates on higher level than that.

I agree with Tucker that we live in very interesting times for Computer Science. It looks like we need to rethink some of the basics of systems – and maybe repeat such a process every time we have new hardware (for example GPGPU, heterogeneous chips, and so on). The problem is that we, as the discipline, have forgotten many of possibilities discovered and abandoned in the past – and those possibilities might be more relevant today. Python’s example however shows that good ideas survive and gets implemented, even if it takes many years.

On mobile phones in Poland

This is different from my usual posts about programming, GPU, etc. so you can skip this if you are not interested in mobile plans.

Recently I’ve decided to change my phone. I have Nokia n900 which was good and promising phone. I was using it less and less as a smartphone though – its browser had problems dealing with web pages and there were not many applications for it. Even computer-related conferences publish application for Android and iPhone and none for n900. My ties with Maemo community are very weak now – I haven’t logged into my Maemo accounts for months. So I have decided to go with the crowd and buy Android device. This meant some changes and two recent posts by Russell Coker, one about international calls and another about changing format of SIM card struck the chord so I have decided to write this post.

I had to get new SIM card because my new device has microSIM slot. Unfortunately only one mobile company in Poland changes them for free – in other you need to pay for new SIM, up to 50PLN+VAT (about 15EUR). So I decided to try to switch mobile providers – I would need to pay less when I sign new contract than when I am the customer (yes, I also think this is stupid policy; keeping existing customer vs. acquiring new one,..). One can keep phone number while changing mobile providers in Poland so I was less hesitant to try new company. I am not very social person, so I do not need many “free” minutes – but I wanted to have large Internet quota. I was visiting sales representatives of all mobile providers and was telling them:

I do not need many minutes – 60 per month is enough. I do want large internet packet though – something like 500MB to 1GB. Oh, if you have some plan with more minutes which can be used to call internationally I’ll gladly take it.

I was telling this in Polish, my native language, and all the sales people were also Polish – so there should be no language barrier. There was. I was getting responses like:

We have wonderful plan for you. You’ll get 200 minutes, and because you are moving your number to our network you’ll get 30% more minutes. Internet – oh, you need to buy this additional internet package which contains 200MB. As for international calls, we do not have anything like this so you will be paying maximum rate per minute allowed by EU. Are you ready to sign?

And all of that for 2-3 times more than what I was paying. There was one plan with international minutes, but this was very expensive and only for companies; I would need to buy phone for the company, not for personal use. So I decided to stay with my current company. As I was renewing my contract I even got new microSIM free of charge.

It seems like Polish mobile providers are still living in the past thinking that all customers want just one thing: more and more minutes. Even though some companies are now parts of international networks (we have Orange and T-Mobile) potential customer rarely sees advantages of being customer or international company. I was using Orange when I was in Switzerland. The plan I was using had 1GB of internet access and minutes for EU, USA, and Canada. (Funny fact: calls to Poland did not use minutes included in plan and I had to pay for them additionally so it seems that Poland is not part of EU according to Orange Switzerland). There are some signs of change though: Orange Poland recently started offering plans with included minutes to EU and USA – but only for landlines.

In summary, answering Russell Coker’s questions:

  1. New formats of SIM cards are for the mobile providers to charge customers for changing their SIMs or forcing them to renew contracts.
  2. People do not call internationally because many plans do not offer cheap international calls. People who have many international contact tend to use VoIP or similar solutions, avoiding paying telecoms.

GPGPU presentations in 2012

During the last three months I have given three presentations.

On September 16, during PyConPL 2012, I gave presentation “Asynchronous and event-driven PyOpenCL programming”. I have shown how to use events and queues to asynchronously call OpenCL code in Python and how this can help to use PyOpenCL in common programs, not only for scientific purposes.

On October 21 I gave presentation “PyOpenCL – unleash your GPU with the help of Python” at PyCon Ukraine 2012 in Kyiv. I started with short introduction to OpenCL and PyOpenCL and again tried to convince audience that GPGPU can be used in ordinary programs, especially with the help of PyOpenCL and its high-level features like reduction or parallel prefix scan.

The last of the presentations I gave was on November 12 at PyWaw 18. It was not programming-related presentation – I talked about PyCon Ukraine, my remarks, how it went, and so on.

During and after my talks at PyConPL and PyCon Ukraine I got questions related to GPU programming. Listeners were asking about debugging and profiling of code.  I also got some questions about performance differences between OpenCL and CUDA. One very interesting question was about existence of some library of kernels (either for CUDA or for
OpenCL) with the most common functions and computations. PyOpenCL contains some features (like mentioned reduction or prefix sum) but I have not heard about CPAN-for-GPGPU.  It might be a good concept though.

In summary, there was some interest in GPGPU, but during the time since my presentation I have not seen much new discussions on PyCUDA or PyOpenCL mailing lists. This means that either I am not so good speaker ( 🙂 ) or GPGPU is still considered niche topic.