Making sure that all packages are of necessary quality is hard work. That’s why there is freeze before releasing new Debian to make sure that there are no known release critical bugs.
There is “Collab QA” team whose role is “sharing results from QA tests (archive rebuilds, piuparts runs, and other static checks”. It is also present on Alioth where source code for various tools is hosted.
As noted in Collab QA description one of the teams responsibilities is rebuilding packages in Debian archive. Large rebuilds are needed to test new version of compiler (e.g. during transition from GCC 4.7 to 4.8) or when considering building package using LLVM-based compilers. Most of packages are built during uploading, but some QA and tests require rebuilding large parts of archive.
It requires large computational power. Thanks to the Amazon support, Debian can access EC2 to run some tasks, as noted by Lucas Nussbaum in his “bits from DPL – November 2013″.
Lucas Nussbaum (current Debian Project Leader) has wrote some scripts to rebuild and test packages on Amazon EC2. Scripts can be downloaded from git repository or cloned using git protocol. Scripts started as tool for rebuilding entire archive, and now they are used to test different versions of compilers (different gcc versions) and compiling of packages using clang. Currently Lucas is not actively developing the code; David Suarez and Sylvestre Ledru took that role.
Scripts are written in Ruby. I am not proficient in Ruby so please forgive any mistakes and misunderstanding.
Rebuilding is managed using one master node which always runs. While master nodes controls slave nodes it is not responsible for starting and stopping them. User is responsible for starting slave nodes (usually from own machine), and sending list of them to the master node. Default setup, described in README, uses 50 m1.medium nodes and 10 m2.xlarge nodes. Smaller nodes are used to compile small packages. Larger nodes are used to compile huge packages, needing much memory, like LibreOffice, x.org, etc.
Each slave node has one or more slots to deal with tasks; it means that it is possible to run tests in parallel, e.g. compile more than one package at the same time.
User is supposed to use AWS-CLI tools or other means to manage slave nodes. AWS-CLI is not yet part of Debian, although Taniguchi Takagi wants to package and upload them to Debian (Bug #733211). For the time being you can download AWS CLI source code from GitHub.
Spot instances are used to save on the costs. This is possible because compiling packages (especially for tests, not as the step during uploading package to Debian archive) is not time critical. It also is idempotent (we can compile package as many times as we want to) and it deals well with being stopped when spot instance is not available anymore.
All data is sent between nodes encoded using JSON. Using JSON allows for sending arrays and dictionaries, which means that is it easy to send the structure describing package, rebuild options, log, parameters, result, etc.
There is no communication between user machine and master node. User is supposed to SSH to master node, clone repository with scripts, and run scripts from inside of this repository. Master node communicates with slave nodes using SSH and SCP; it sends necessary scripts to slave nodes and then runs them.
Usual workflow is described in README:
- Request spot instances
- Wait for their start
- Connect to master node
- Prepare job description (list all packages to test)
- Run master script passing list of packages and list of nodes as arguments
- Wait for all tasks to finish
- Download result logs
- Stop all slave instances
JSON contains information about packages to compile. Each package is
described using following fields:
- type – Whether to test package compilation or installation (instest).
- package – Name of package to test.
- dist – Debian distribution to test on.
- esttime – Estimated time for performing test, used for building
- logfile – Name of file to write log to.
Repository contains many scripts, their names usually convey their jobs.
Scripts containing instest in their names are intended to test
installation or upgrade of packages.
- clean – removes all logs and JSON files describing tasks and slave nodes
- create-instest-chroots – creates chroot, debootstrap it, copies basic configuration, updates system, copies maintscripts; works with sid, squeeze, wheezy
- genereate-tasks-* – Scripts for generating JSON files describing tasks for master to distribute to slave nodes.
- genereate-tasks-instest – Read all packages from local repository and set them to test installation.
- genereate-tasks-rebuild – Read list of packages from Ultimate Debian Database, excluding some, create list. Allows for limiting packages based on their build time. Uses unstable chroot.
- genereate-tasks-rebuild-jessie – Script for build Jessie packages, using Jesse chroot.
- genereate-tasks-rebuild-wheezy – Script for build Wheezy packages, using Wheezy chroot.
- instest – Testing installation.
- masternode – Script run on master node, sending all tasks to slaves.
- merge-tasks – Merges JSON with description of tasks.
- process-task – Main script run on slave node.
- setup-ganglia – Installs Ganglia monitor on slave node, to monitoring it health.
- update – Updates chroot to newest versions.
It accepts files containing list of packages to test and list of slave nodes as command line arguments.
It connects to each slave node and uploads necessary scripts (instest, process-task) to them.
For each node it creates as many threads as there is slots; each thread opens one SSH and one SCP connection. Then each thread gets one task from task queue, and calls execute_one_task to process this task. Success is logged if task succeeds. Otherwise task is added to retry queue. If there is no tasks left in main queue, number of available slots on the slave node is decreased, and thread (except for the last one) ends.
The last thread for each node is responsible for dealing with failed tasks from retry queue. Again it loops for all available tasks, this time from retry queue, and calls execute_one_task for each of them. This time each task is run alone on the node, so problems caused by concurrent compilation (e.g. compiling PyCUDA and PyOpenCL with hardening options on machine with less than 4GB is problematic) should be solved. If task fails again it is not retried but only logged.
Script creates one additional thread which periodically (every minute) checks whether there are any tasks left in main and retry queues.
Script ends when all threads finish.
execute_one_task is simple function. It encodes task description into JSON and uploads JSON to slave node. Then it executes process_task on slave node and downloads log. It can also download built package from slave node and upload it to archive using reprepro script. Function returns whether test succeeded or not.
It is script run on slave node for each task. It reads JSON file with description of task passed as command line argument. If master node wants to test installation, it runs instest and exits. Otherwise it proceeds with testing package build.
Script can accept options governing package build process. For example it sets DEB_BUILDOPTIONS=parallel=10 when we want to test parallel build. It can also accept versions of compilers and libraries to use during compilation. Script sets repositories and package priorities to ensure that proper versions of build dependencies are used. Then script calls sbuild to build package, and checks whether estimation of time needed to perform test was correct.
It is used to test installation and update of package. It uses chroot to install package to.
Accepts chroot location and package to test as command line arguments. It cleans chroots and checks whether package is already installed. Script tests installation in various circumstances: it installs only dependencies or build dependencies, installs package, installs package and all packages recommended by it, installs package and all packages recommended and suggested by it. Script can also test upgrading package, to check whether there are some problems caused by upgrade.
There are some workarounds for MySQL and PostgreSQL; it looks like there are some problems with post-inst scripts (which try to connect to newly installed database) in those packages, so testing must take such failure into consideration.
Using cloud helps with running many tests in short time. Such tests can serve as QA tool and help for experimentation. Building packages in controlled environment, one which can easily be recreated and shut down allows for ensuring that packages are of good quality. At the same time ability to run many tests, and preparing different environments can help with experimentation, e.g. testing different compilers, configuration options, and so on.
Thanks for Amazon and to James Bromberger to providing grants allowing for Debian to use AWS and EC2 to perform such tests.