Using cloud to rebuild Debian

Making sure that all packages are of necessary quality is hard work. That’s why there is freeze before releasing new Debian to make sure that there are no known release critical bugs.

There is “Collab QA” team whose role is “sharing results from QA tests (archive rebuilds, piuparts runs, and other static checks”. It is also present on Alioth where source code for various tools is hosted.

As noted in Collab QA description one of the teams responsibilities is rebuilding packages in Debian archive. Large rebuilds are needed to test new version of compiler (e.g. during transition from GCC 4.7 to 4.8) or when considering building package using LLVM-based compilers. Most of packages are built during uploading, but some QA and tests require rebuilding large parts of archive.

It requires large computational power.  Thanks to the Amazon support, Debian can access EC2 to run some tasks, as noted by Lucas Nussbaum in his “bits from DPL – November 2013”.

Lucas Nussbaum (current Debian Project Leader) has wrote some scripts to rebuild and test packages on Amazon EC2. Scripts can be downloaded from git repository or cloned using git protocol. Scripts started as tool for rebuilding entire archive, and now they are used to test different versions of compilers (different gcc versions) and compiling of packages using clang.  Currently Lucas is not actively developing the code; David Suarez and Sylvestre Ledru took that role.

Scripts are written in Ruby. I am not proficient in Ruby so please forgive any mistakes and misunderstanding.

Full rebuild

Rebuilding is managed using one master node which always runs. While master nodes controls slave nodes it is not responsible for starting and stopping them.  User is responsible for starting slave nodes (usually from own machine), and sending list of them to the master node.  Default setup, described in README, uses 50 m1.medium nodes and 10 m2.xlarge nodes.  Smaller nodes are used to compile small packages.  Larger nodes are used to compile huge packages, needing much memory, like LibreOffice, x.org, etc.

Each slave node has one or more slots to deal with tasks; it means that it is possible to run tests in parallel, e.g. compile more than one package at the same time.

User is supposed to use AWS-CLI tools or other means to manage slave nodes. AWS-CLI is not yet part of Debian, although Taniguchi Takagi wants to package and upload them to Debian (Bug #733211). For the time being you can download AWS CLI source code from GitHub.

Spot instances are used to save on the costs.  This is possible because compiling packages (especially for tests, not as the step during uploading package to Debian archive) is not time critical.  It also is idempotent (we can compile package as many times as we want to) and it deals well with being stopped when spot instance is not available anymore.

All data is sent between nodes encoded using JSON.  Using JSON allows for sending arrays and dictionaries, which means that is it easy to send the structure describing package, rebuild options, log, parameters, result, etc.

There is no communication between user machine and master node.  User is supposed to SSH to master node, clone repository with scripts, and run scripts from inside of this repository.  Master node communicates with slave nodes using SSH and SCP; it sends necessary scripts to slave nodes and then runs them.

Usual workflow is described in README:

  1. Request spot instances
  2. Wait for their start
  3. Connect to master node
  4. Prepare job description (list all packages to test)
  5. Run master script passing list of packages and list of nodes as arguments
  6. Wait for all tasks to finish
  7. Download result logs
  8. Stop all slave instances

JSON contains information about packages to compile. Each package is
described using following fields:

  • type – Whether to test package compilation or installation (instest).
  • package – Name of package to test.
  • dist – Debian distribution to test on.
  • esttime – Estimated time for performing test, used for building
  • logfile – Name of file to write log to.

Repository contains many scripts, their names usually convey their jobs.
Scripts containing instest in their names are intended to test
installation or upgrade of packages.

  • clean – removes all logs and JSON files describing tasks and slave nodes
  • create-instest-chroots – creates chroot, debootstrap it, copies basic configuration, updates system, copies maintscripts; works with sid, squeeze, wheezy
  • genereate-tasks-* – Scripts for generating JSON files describing tasks for master to distribute to slave nodes.
  • genereate-tasks-instest – Read all packages from local repository and set them to test installation.
  • genereate-tasks-rebuild – Read list of packages from Ultimate Debian Database, excluding some, create list. Allows for limiting packages based on their build time. Uses unstable chroot.
  • genereate-tasks-rebuild-jessie – Script for build Jessie packages, using Jesse chroot.
  • genereate-tasks-rebuild-wheezy – Script for build Wheezy packages, using Wheezy chroot.
  • gzip-logs
  • instest – Testing installation.
  • masternode – Script run on master node, sending all tasks to slaves.
  • merge-tasks – Merges JSON with description of tasks.
  • process-task – Main script run on slave node.
  • setup-ganglia  – Installs Ganglia monitor on slave node, to monitoring it health.
  • update – Updates chroot to newest versions.

masternode

It accepts files containing list of packages to test and list of slave nodes as command line arguments.

It connects to each slave node and uploads necessary scripts (instest, process-task) to them.

For each node it creates as many threads as there is slots; each thread opens one SSH and one SCP connection.  Then each thread gets one task from task queue, and calls execute_one_task to process this task. Success is logged if task succeeds.  Otherwise task is added to retry queue. If there is no tasks left in main queue, number of available slots on the slave node is decreased, and thread (except for the last one) ends.

The last thread for each node is responsible for dealing with failed tasks from retry queue.  Again it loops for all available tasks, this time from retry queue, and calls execute_one_task for each of them. This time each task is run alone on the node, so problems caused by concurrent compilation (e.g. compiling PyCUDA and PyOpenCL with hardening options on machine with less than 4GB is problematic) should be solved. If task fails again it is not retried but only logged.

Script creates one additional thread which periodically (every minute) checks whether there are any tasks left in main and retry queues.

Script ends when all threads finish.

execute_one_task is simple function. It encodes task description into JSON and uploads JSON to slave node. Then it executes process_task on slave node and downloads log. It can also download built package from slave node and upload it to archive using reprepro script. Function returns whether test succeeded or not.

process-task

It is script run on slave node for each task.  It reads JSON file with description of task passed as command line argument. If master node wants to test installation, it runs instest and exits. Otherwise it proceeds with testing package build.

Script can accept options governing package build process. For example it sets DEB_BUILDOPTIONS=parallel=10 when we want to test parallel build. It can also accept versions of compilers and libraries to use during compilation. Script sets repositories and package priorities to ensure that proper versions of build dependencies are used. Then script calls sbuild to build package, and checks whether estimation of time needed to perform test was correct.

instest

It is used to test installation and update of package.  It uses chroot to install package to.

Accepts chroot location and package to test as command line arguments. It cleans chroots and checks whether package is already installed. Script tests installation in various circumstances: it installs only dependencies or build dependencies, installs package, installs package and all packages recommended by it, installs package and all packages recommended and suggested by it. Script can also test upgrading package, to check whether there are some problems caused by upgrade.

There are some workarounds for MySQL and PostgreSQL; it looks like there are some problems with post-inst scripts (which try to connect to newly installed database) in those packages, so testing must take such failure into consideration.

Summary

Using cloud helps with running many tests in short time. Such tests can serve as QA tool and help for experimentation.  Building packages in controlled environment, one which can easily be recreated and shut down allows for ensuring that packages are of good quality.  At the same time ability to run many tests, and preparing different environments can help with experimentation, e.g. testing different compilers, configuration options, and so on.

Thanks for Amazon and to James Bromberger to providing grants allowing for Debian to use AWS and EC2 to perform such tests.

Advertisements

Aparapi

Some time ago I’ve read about Aparapi (name comes from A PARallel API), the Java library for running code on GPU which was open-sourced in 2011. There are videos and webinar by Gary Frost (the main contributor according to SVN log) from 2011-12-22, describing it on the main page. There is also blog about it. Aparapi allows for running threads either on CPU (via Java Thread Pool) or GPU (using OpenCL).

Aparapi programming model is similar to what we already know from Java; we inherit from class Kernel and override method run(), just like we would create subclasses of Thread and override method run() for ordinary threads. We write code in Java, which means that e.g. if our Kernel class is the internal class it can access objects available in external class. Because of OpenCL limitations (for example not offering full object orientation or recursive methods) Aparapi classes cannot use Java classes using inheritance or method overloading.

Aparapi presents concept similar to one behind ORM (Object-Relational Mapping), trying to avoid “internal” language (either SQL or OpenCL) for the price of hiding some details and potentially sacrificing performance.

It has few examples (mandelbrot, nbody, game of life – which could gain from using of local memory) which show possible usages of it.

I look at Aparapi as someone who uses, contributes to, and is used to, PyOpenCL, and am more interested in its implementation than API.

Usage

To run the simplest code doubling array on GPU we can just write:

new Kernel() {
@Override public void run() {
int i = getGlobalID();
output[i] = 2.0*input[i];
}
}.execute(size);

When there is no ability to use GPU (for example we do not have OpenCL-capable GPU, or there are problems with compiling kernel), Java thread pool is used.

We can use special methods inside Kernel class to access OpenCL information:

  • getGlobalId()
  • getGroupId()
  • getLocalId()
  • getPassId()

As with ordinary OpenCL programs the host code is responsible for managing kernels, copy data, initialize arrays, etc.

There are some limitations coming from OpenCL limitations. Aparapi supports only one-dimensional arrays, so we need to simulate multi-dimensional arrays by calculating indices by ourselves. Support for doubles depend on OpenCL and GPU support – which means that we need to check it at runtime.

There is no support for static methods, method overloading, static methods, and recursion, which is caused by lack of support for it in OpenCL 1.x. Of course OpenCL code does not throw exceptions when run on GPU; this means that there might be different effects when running code on CPU and GPU. GPU will not throw ArrayIndexOutOfBoundException or ArightmeticException, while CPU will.

There is page with guidelines on writing Kernel classes.

There is also special case of execute method, in the case of

kernel.execute(size, count)

which loops count times over kernel. Inside such a kernel we can use variable postId to get index of call in looping, e.g. to prepare data on first iteration.

Implementation

Some details of converting Java to OpenCL are described on wiki page.

When trying to run Kernel, Aparapi checks whether it is first usage, and if not whether code was already compiled. If it was compiled, OpenCL source generated by Aparapi is used. If it was not successfully compiled Aparapi uses Java Thread Pool and runs code on CPU, as there must have been some troubles in previous runs. If this is the first run of the kernel, Aparapi tries to compile Kernel; if it succeeds, it runs on GPU, otherwise it uses CPU.

In case of any troubles, Aparapi uses CPU and Java Thread Pool to run computations.

When using Java Thread Pool, Aparapi creates pool of threads, one thread for CPU core. Then it clones Kernel object so each thread has one copy, to avoid cross-thread access. Each thread calls run() method globalSize/threadCount times. Aparapi updates special variables (like globalId) after each run() execution, and waits for all threads to finish.

Aparapi classes

There are four sets of classes. One set consists of classes responsible for communication with OpenCL and managing OpenCL classes; it contains JNI classes and their partners from Java side. It is similar situation to one in PyOpenCL, where there are also C++ classes and their Python wrappers. There are exceptions. There are special classes – Kernel class (from which programmer inherits) and Config class. And finally there are classes responsible for analysing Java bytecode and generating OpenCL kernels.

Class Config is responsible for runtime configuration, allowing for enabling or disabling some features.

Exceptions:

  • AparapiException – base class, all other exceptions inherit from it
  • ClassParseException – not supported Java code was encountered during code analisis; ClassParseException contains details of encountered error. The ClassParseException.TYPE enum contains details of which forbidden construct was found in the code.
  • CodeGenException
  • DeprecatedException
  • RangeException

OpenCL mapping classes:

  • Range – partner for JNIRange which manages work group sizes and their dimensionality
  • KernelArg – responsible for passing arguments to the kernels
  • JNIContext – responsible for running kernel, manages only one device, can have many queues, but can deal only with one kernel
  • KernelRunner – checks hardware capabilities, can execute Java and OpenCL code
  • ProfileInfo – it stores raw OpenCL performance-related information, like start and end time of execution, status, Java labels. It is returned by KernelRunner.

JNIContext is also responsible for managing memory containing data passed to the kernel. It pins such memory to avoid it being collected or moved by Garbage Collector. It also responds to changes for non-primitive objects, regenerates cl_mem buffers if it detects some changes. Its longest method is responsible for running kernel. also contains some debugging capabilities, like dumping memory as floats or integers.

UnsafeWrapper class is responsible for atomic operations, which is important when running multithreaded code in heterogenous environment. It is wrapper around sun.misc.Unsafe for atomic operations and seem to be used to avoid compiler warnings during reflection. What’s interesting is that all exceptions in this class are marked with “TODO: Auto-generated catch block”.

Class Kernel is intended to be inherited by programmer. It loads JNI, manages Range, wraps Math methods, and provides memory barriers. Its execute() method accepts global size, i.e. number of threads to run. For now it is not possible to run threads in 2D or 3D array, or to govern local block size. Method execute() is blocking; it returns only when all threads have finished. On the first call it determines the execution mode (OpenCL or JTP).

Kernel.EXECUTION_MODE enum allows for controlling which mode was used to
run our kernels:

  • CPU – OpenCL using CPU
  • GPU – OpenCL using GPU
  • JTP – Java Thread Pool (which usually means that there were problems with OpenCL)
  • SEQ – sequential loop on the CPU, very slow, used only for debugging

We can get current execution mode by running kernel.getExecutionMode(), or set it by kernel.setExecutionMode(mode) before running execute().

Java code analysis

To generate OpenCL code Aparapi loads class file, finds run() method and all methods called by it (using reflection), then it converts bytecode to the expression tree. When analysing bytecode Aparapi constructs the list of all accessed fields (using getReferencedFields() method) and accessing mode of them (read or write), converts scalars to kernel arguments, and converts primitive arrays into OpenCL buffers.

Bytecode consists of varying-length instructions and Java Virtual Machine is stack-based Virtual Machine, which means that when analysing bytecode we need to keep the state of the stack. We also need to determine type of arguments from bytecode.

The main algorithm of analysis creates list of operations in Program Counter order, creating one object for bytecode instruction. All instructions are categorised based on their effect on the stack. In subsequent analysis, when instruction consumes value from the stack, the previous instruction is marked as a child of consumer. After adding some hacks for bytecodes like dup and dup2 Aparapi produces have list of operation trees, ready to be used to generate OpenCL code.

There are classes intended for presenting bytecode in hierarchy,
just as compiler would do, to help generating kernels:

  • MethodModel checks whether method uses doubles, is recursive, checksfor getters and setters, creates list of instructions. It is long class, containing Instruction transformers
  • InstructionSet describes bytecodes and types. Contains information how each bytecode changes stack (what pops and pushes)
  • InstructionPattern matches bytecodes against Instructions
  • Instruction base class describing bytecode; contains links to previous and next instructions, its children, all branching targets and sources, etc. KernelWriter calls writeInstruction() method for each object of this class.
  • ExpressionList List of instructions in particular expression; used to build list of expressions to be then transformed into OpenCL; linked list with some ability to fold instructions
  • BranchSet represents a list of bytecodes from one branch of conditional instruction
  • ClassModel
  • ByteBuffer buffer for accessing bytes in *.class, used to parse ClassFile structure, used by ByteReader
  • ByteReader provides access to *.class file, used mostly by ClassModel and MethodModel

There is entire set of classes describing hierarchy of bytecode instructions:

  • ArrayAccess
    • AccessArrayElement
    • AssignToArrayElement
  • Branch
    • ConditionalBranch]
    • Switch
    • UnconditionalBranch – FakeGoto
  • CloneInstruction
  • CompositeInstruction
    • ArbitraryScope
    • EmptyLoop
    • ForEclipse, ForSun
    • IfElse
    • If
    • While
  • DUP, DUP_X1, DUP_X2, DUP2, DUP2_X1, DUP2_X2
  • Field
  • IncrementInstruction
  • OperatorInstruction
    • Binary
    • Unary
  • Return

And classes for generating code:

  • KernelWriter transfers Java methods (getGlobal*) into OpenCL function calls. Converts Java types to OpenCL types. Also writes all needed pragmas: for fp64, atomics, etc.
  • InstructionTransformer used by MethodModel, abstract, its only children are created on the fly
  • BlockWriter writes instructions, which it gets as Expressions or InstructionSets

There are methods (supporting many types, like int, byte, long, double, float) helping with debugging and tuning:

  • get(array) returns buffer from the GPU.
  • put(array) puts array to GPU

During code generation dupX bytecode (repeating X items on the stack) is replaced by repeating instructions generating those X items.

Performance

There are two compilations during runtime when using Aparapi: one is analysis of bytecode and generation of OpenCL source (made by Aparapi), another is compilation of OpenCL to binary understood by GPU, performed by the driver.

There are some methods helping with profiling:

  • getConversionTime() returns the time it took to transfer bytecode into OpenCL code
  • getExecutionTime() returns execution time of last run()
  • getAccumulatedExecutionTime() returns time it took to execute all calls to run() from the beginning.

All the usual performance tips should be used in Aparapi. The ideal candidates to speed up are operations on large arrays of primitives without importance of order. Avoid inter-element dependencies, as they break parallelism large amount of data, large computations, with small amount of data transfer (or transfer occluded by computations) As usual we need to avoid branching, as instructions in threads should be executed in the lockstep.

There is possibility to use local or constant memory in kernels generated by Aparapi kernels. To do so we need to attach _$constant$ or _$local$ to variable name. Of course when using local memory we need to put barriers into kernels, which means that we risk deadlocks. It is again case of leaky abstraction as we cannot pretend that we are writing Java code and we need to know details of kernel execution.

There is also possibility to see source of generated OpenCL kernels, by adding option -Dcom.amd.aparapi.enableShowGeneratedOpenCL=true to the program arguments.

Memory copying

Each execution of the kernel means copying of all the data used in the kernel. Aparapi assumes that Java code (code outside of run()) can changes anything, so it copies buffers to and from device. We can avoid this behaviour (killing performance) by demanding explicit copy, by calling:

kernel.setExplicit(true);

Then we are responsible for coping data, in following manner:

kernel.put(block);
kernel.run(size);
kernel.get(block);

If we are running the same computations over and over again we can put loop inside run(), or call execute(size, count) to let Aparapi manage looping.

Summary

Aparapi is interesting library. It can serve as an entry point to the GPU computing, by allowing to transfer existing Java code to OpenCL to determine whether it is even possible to run the existing code on GPU.

Aparapi is still developed; there is initial support for lambdas (in separate branch), and ability to choose device to run code on, although this seems not finished yet (according to wiki). There are unit tests testing OpenCL code generation for those who want to experiment with it.

There is also proposal of adding extensions to Aparapi; not OpenCL extensions, but the ability to attach real OpenCL code. It it succeeds, then Aparapi will not differ much from PyOpenCL.

The main problem of Aparapi is limit of one kernel and one run() method per class, which is the result of trying to model Kernel after Thread class. There are some hacks to deal with it, but they are just hacks.

According to documentation reduction cannot be implemented in Aparapi. It is present in PyOpenCL, but implemented using more than one kernel, so it might be a problem with a model. There is also no support for volatile memory, which might also be problem e.g. for reduction (there was problem with volatile memory and reduction in PyCUDA).

Another limitations is inability to have multidimensional Java arrays. There are proposals to have multidimensional ranges of execution (2D, 3D, etc.) and some work was done, but nothing is finished yet. There is also no support for sharing data between OpenCL and OpenGL

Aparapi does not support vector execution (SIMD) which could be used to speed up execution on CPU; this might be the result of lack of such a support in Java, as described by Jonathan Parri et al. in article “Returning control to the programmer: SIMD intrinsics for virtual machines” CACM 2011/04 (sorry, paywalled).

I’m wondering how OpenCL 2.0, which allows for blocks and calling kernels from inside other kernels, will change support for advanced Java features in Aparapi.

I want to finish with the quote from authors of Aparapi:

As a testament to how well we emulate OpenCL in JTP mode, this will also deadlock your kernel in JTP mode 😉 so be careful.