PyOpenCL in main!

Yesterday (2014-05-26) my sponsor Piotr Ożarowski uploaded new version of PyOpenCL to Debian. Usually I can upload new versions of packages to Debian as I am Debian Maintainer. But this time it was very special upload. It was closing bug 723132, asking to move PyOpenCL from contrib to main. Because Debian contains free OpenCL implementations, Beignet and Mesa, one can run OpenCL programs using FLOSS code.

Moving package from contrib to main meant that PyOpenCL had to be removed from contrib, and uploaded anew to main. Thanks to Piotr for sponsoring it, and to FTP masters for accepting in from NEW, and dealing with all this removal/adding of package.

There is still work to do. Rebecca Palmer works on allowing for having all OpenCL implementations installed, which should lead for more experimentation and easier work with OpenCL but requires changes to many of the OpenCL-related packages. I’m also thinking about moving PyOpenCL to use pybuild, but this needs to wait till I have more free time.

Let’s hope that having PyOpenCL in main will allow for more people to find and use it.

Advertisements

New PyOpenCL in Debian

Debian has now new version of PyOpenCL package (2011.2), thanks Piotr for sponsoring.  I have changed dependencies allowing to use non-NVIDIA OpenCL ibraries, closing #628702. There is still open Ubuntu bug, but Debian-Ubuntu package synchronisation should deal with it.

Popularity of GPGPU-related packages

While dealing with #628702 I have looked at metadata of different packages, including popularity of packages, gathered with popcon. Data used in the following analysis is from 2011-12-05. I have looked at following packages:

PyOpenCL is installed on 189 machines, used on 62 of them. PyCUDA is installed on 13 machines, used on 7. As for NVIDIA GPGPU, libcuda1 is installed on 1110 machines, used on 131, nvidia-libopencl1 is installed on 822 machines, used on 134. AMD OpenCL libraries (amd-libopencl1) are installed on 33 machines, used on 9.

AMD OpenCL libraries are not very popular because they were only uploaded into Debian with the last version of Catalyst drivers, two weeks ago. Much more people have installed GPGPU packages (libcuda1 and libopencl1) than are using it. Data for NVIDIA packages might be skewed in favour of libcuda1: nvidia-opencl-icd depends on libcuda1 and libnvidia-compiler which means that everyone who wants to use NVIDIA OpenCL libraries must install libcuda1 for OpenCL to work.

Although more people have installed libcuda1 (1110) than libopencl1 (822), similar number of people are using those packages (131 vs. 134). This might be caused by the package dependencies described in previous paragraph. About half of the people who are using libopencl1 are doing so through PyOpenCL – 134 active users of libopencl1 and 62 active users of python-pyopencl which depends on libopencl1.

There is definitely interest in both OpenCL and CUDA in Debian, although until recently we had only NVIDIA OpenCL libraries available in repositories. It will be interesting to see how introducing AMD OpenCL libraries and PyOpenCL allowing to use any OpenCL provider will change popularity of OpenCL and CUDA.

Hardware

To test OpenCL on AMD I have been using Forconn nT A3500. It is small machine with AMD Fusion architecture. It has two core CPU and ATI 6310 integrated in the same chip. I have not yet tested performance of it. It is interesting that Foxconn is producing hardware under their own name. It looks like they have learned how to do good hardware doing outsourcing, and now are competing on the global market (with quite nice products IMO). That’s feature of outsourcing, I guess.

Performance of copying only envelope

As promised in my previous post I have performed some tests with different methods of copying data from Palabos to Sailfish. Lattice Boltzmann method is based on local interactions so to work correctly we need to transfer incoming forces before simulation step (the outermost layer of our subdomain), and outgoing interactions (the next-to outermost layer of subdomain) after the simulation step. It means that when simulation is running it suffices to copy envelope instead of full subdomain: when copying from host to device we need to copy the outermost layers of subdomain, and when copying from device to host we need to copy penultimate layer.

The configuration is the same as in previous tests: 64-bit Debian Sid, NVIDIA drivers 275.28, Python 2.7.2 on:

  • Asus eeePC 1201N with Atom 330 (2 cores with HT@1.6GHz), 2GB of RAM, NVIDIA ION (GeForce 9400M) with 256MB RAM
  • desktop with Intel E5200 (2 cores@2.4GHz), 4GB of RAM, NVIDIA GeForce 460 with 1GB RAM

I have decided to perform 3 tests:

  • full “smart” copy using memcpy3d, with different layout of memory on host and device (the last case from previous test
  • “smart” envelope copy, trying to join as many calls of memcpu3D into one as possible
  • “naive” envelope copy, copying envelope in 6 steps (one for each cube wall) repeated 19 times to copy entire population

ION performance

ION copy performance chart

Time of copying envelope of 3D array on ION [ms]
Domain size memcpy3D without host padding optimised envelope copy naive envelope copy
GPU CPU GPU CPU GPU CPU
12 2.374 2.433 8.376 8.433 14.665 14.723
16 2.309 2.365 10.290 10.346 15.611 15.666
20 3.528 3.585 12.485 12.541 17.207 17.263
24 4.518 4.803 15.105 15.163 19.531 19.587
28 6.327 6.383 17.952 18.008 21.886 21.944
32 5.112 5.170 19.886 19.944 24.365 24.422
36 6.369 6.426 22.022 22.078 27.347 27.405
40 9.123 9.182 24.303 24.359 30.540 30.597
44 13.329 13.388 26.447 26.504 33.828 33.884
48 9.067 9.128 28.570 28.628 36.744 36.802
52 12.995 13.051 31.232 31.289 39.863 39.921
56 18.545 18.602 33.703 33.761 44.114 44.170
60 26.061 26.119 37.189 37.246 48.258 48.314
64 15.670 15.727 39.954 40.010 51.328 51.385
68 24.649 24.705 43.390 43.447 56.469 56.524
72 33.577 33.635 47.897 47.955 61.682 61.740
76 44.124 44.181 51.850 51.909 66.904 66.962
80 27.927 27.983 55.985 56.041 71.405 71.462
84 42.142 42.199 60.319 60.418 78.399 78.456
88 55.455 55.512 65.800 65.856 86.667 86.723
92 70.149 70.207 71.429 71.486 92.220 92.276

Results for ION (GeForce 9400M) are interesting. Smart envelope copy is faster than naive by 5 to 10 ms, which is probably caused by overhead of calling memcpy3D more times in the latter case. At the same time the envelope copy is slower than copying entire subdomain. Only for domain 92x92x92 smart envelope copy is as fast as (or as slow as) full copy – but for domain 96x96x96 it will again be slower (fastest copy for domain which size is divisible by 16 from previous post). I am not sure what is the cause of this slowness – the old card (GeForce 94000 is the Tesla-based CUDA GPU), or the fact that this is integrated GPU which shares RAM with CPU. This result might mean that for old devices it does not make sense to optimise copying as envelope copy will be faster for domains that does not fit into GPU memory – rendering entire optimisation pointless.

GeForce 460 performance

GeForce 460 copy performance chart

Time of copying envelope of 3D array on GeForce 460 [ms]
Domain size memcpy3D without host padding optimised envelope copy naive envelope copy
GPU CPU GPU CPU GPU CPU
12 0.643 0.663 1.324 1.343 2.008 2.027
16 0.543 0.561 1.758 1.778 2.369 2.426
20 1.320 1.339 2.343 2.361 2.883 2.902
24 1.913 1.933 2.718 2.747 3.453 3.472
28 3.349 3.367 3.342 3.361 4.138 4.157
32 1.868 1.886 3.843 3.863 4.641 4.661
36 4.096 4.134 4.406 4.426 5.517 5.537
40 6.112 6.131 4.979 4.999 6.410 6.428
44 9.014 9.034 5.819 5.839 7.447 7.466
48 4.933 4.953 6.461 6.481 8.113 8.132
52 10.080 10.100 7.357 7.378 9.491 9.510
56 13.882 13.903 8.344 8.364 10.725 10.745
60 18.616 18.637 9.479 9.497 12.230 12.248
64 11.484 11.504 10.670 10.688 13.144 13.163
68 19.885 19.905 11.643 11.663 14.868 14.888
72 26.244 26.264 12.857 12.876 16.550 16.570
76 33.263 33.283 14.496 14.517 18.471 18.491
80 22.082 22.102 16.018 16.037 19.735 19.754
84 34.263 34.283 17.330 17.351 22.013 22.033
88 43.567 43.586 19.222 19.242 24.199 24.249
92 53.220 53.241 21.446 21.466 26.398 26.417
96 37.322 37.342 23.568 23.589 26.989 27.009
100 54.475 54.495 25.014 25.034 28.639 28.657
104 67.171 67.198 27.209 27.230 30.250 30.270
108 80.673 80.695 29.748 29.769 32.992 33.011
112 58.523 58.544 32.149 32.285 34.782 34.803
116 80.705 80.727 34.213 34.234 37.963 37.983
120 96.921 96.945 36.714 36.735 40.828 40.848
124 114.409 114.432 39.748 39.769 44.335 44.357
128 88.675 88.698 42.330 42.351 46.419 46.442
132 114.462 114.487 44.497 44.518 49.184 49.205
136 133.911 133.935 47.742 47.763 51.318 51.340
140 155.449 155.473 51.191 51.214 55.047 55.070
144 123.982 124.006 54.835 54.857 58.005 58.027
148 155.113 155.140 58.077 58.097 62.523 62.547
152 180.289 180.317 61.871 61.893 67.255 67.280
156 206.152 206.179 66.315 66.337 72.369 72.392
160 170.891 170.917 69.996 70.019 74.349 74.372
164 204.927 204.956 74.031 74.054 78.243 78.267
168 234.912 234.940 78.354 78.379 83.180 83.204
172 267.724 267.754 83.445 83.469 88.934 88.957
176 226.915 226.943 88.477 88.501 94.031 94.085
180 265.012 265.042 93.267 93.291 99.729 99.754
184 299.342 299.372 99.016 99.043 103.424 103.449
188 338.534 338.567 104.384 104.410 109.178 109.204
192 296.850 296.881 109.775 109.800 120.001 120.026
196 339.310 339.342 115.620 115.675 121.675 121.699
200 377.733 377.766 121.387 121.413 129.160 129.186
204 422.621 422.656 128.546 128.573 134.787 134.813
208 375.804 375.837 134.658 134.684 138.811 138.867
212 426.756 426.792 141.159 141.185 147.007 147.032
216 469.584 469.620 148.531 148.557 154.895 154.921
220 522.136 522.174 166.426 166.454 165.330 165.363

For the GeForce 460 smart envelope copy is faster than copying full subdomain for domains larger than 64x64x64. The chart of time needed to copy full domain resembles function x3, while the chart of time needed to copy envelope resembles flat function x2. It is coherent with our intuition – when copying envelope number of copied values grows as square of domain size, and for full subdomain copy number of copied values grows as cube of domain size. Smart envelope copy again is the little bit faster than naive envelope copy.

The most noticeable difference between GeForce 460 and ION is the long time it takes to copy envelope on the latter GPU. ION is integrated GPU which means that it uses the same memory as CPU; it is probably not optimised for GPU needs. GeForce has separate memory which is optimised for GPU usage. Also, as pointed by deviceQuery from CUDA SDK:

        • ION:
          • Concurrent copy and execution: No with 0 copy engine(s)
          • Integrated GPU sharing Host Memory: Yes
          • Support host page-locked memory mapping: Yes
        • GeForce 460
          • Concurrent copy and execution: Yes with 1 copy engine(s)
          • Integrated GPU sharing Host Memory: No
          • Support host page-locked memory mapping: Yes

GeForce has special chip for speeding up copy operations, while ION does not have such a chip.

In summary, computations performed on new generation CUDA cards, based on Fermi chips, can benefit from optimisation of data transfer. Old devices have more limited amount of available memory and it takes longer to copy data to them so it is better to spend time needed for creating sophisticated data transfer schemas on some other parts of program.

Testing script

#! /usr/bin/python

import sys
import math
import time
import argparse

import numpy

import pycuda
import pycuda.driver
import pycuda.compiler
import pycuda.gpuarray

import pycuda.autoinit

def copyPlane(copy, stream, srcX, dstX, srcY, dstY, srcZ, dstZ, width, height, depth):
    copy.src_x_in_bytes = srcX
    copy.dst_x_in_bytes = dstX
    copy.src_y = srcY
    copy.dst_y = dstY
    copy.src_z = srcZ
    copy.dst_z = dstZ
    copy.width_in_bytes = width
    copy.height = height
    copy.depth = depth
    if stream:
        copy(stream)
    else:
        copy()

parser = argparse.ArgumentParser(description="Test speed of memory copy")

parser.add_argument("-d", "--domain", dest="domainSize", type=int,
    default=18, help="Size of the domain to copy (default: 18)")
parser.add_argument("-t", "--block", dest="blockSize", type=int,
    default=64, help="Size of the block of threads to copy (default: 64)")
parser.add_argument("-b", "--basis", dest="basis", type=int,
    default=19, help="Size of the block of threads to copy (default: 19)")

parser.add_argument("--direction", dest="copyDirection",
    action="store", default='htod', choices=['htod', 'dtoh', 'both'],
    help="Copy direction (default: htod)")

parser.add_argument("--envelope_method", dest="envelopeCopyMethod",
    action="store", default='everything', choices=['naive', 'smart'],
    help="Copy direction (default: naive)")

args = parser.parse_args()

stream = None

floatSize = 4
floatType = numpy.float32
strideX = int(math.ceil(float(args.domainSize)/args.blockSize))*args.blockSize*floatSize
strides = (args.domainSize*strideX, strideX, floatSize)
strideZ = args.domainSize*args.domainSize*strideX

gpudata = pycuda.driver.mem_alloc(strideZ*args.basis)

a3d = pycuda.gpuarray.GPUArray((args.basis*args.domainSize, args.domainSize, args.domainSize),
    dtype=floatType, strides=strides, gpudata=gpudata)
a3h = numpy.ndarray((args.basis*args.domainSize, args.domainSize, args.domainSize),
    dtype=floatType)
c3d = pycuda.driver.Memcpy3D()

startD = pycuda.driver.Event()
endD = pycuda.driver.Event()
startH = time.time()
endH = None

startD.record()
c3d.src_pitch = args.domainSize*floatSize
c3d.dst_pitch = strideX
c3d.src_height = args.domainSize
c3d.dst_height = args.domainSize

if args.envelopeCopyMethod == 'smart':
    if args.copyDirection in {'htod', 'both'}:
        c3d.set_src_host(a3h)
        c3d.set_dst_device(a3d.gpudata)
# XY
            copyPlane(c3d, stream, floatSize, floatSize, 1, 1, 0, 0,
                (args.domainSize-2)*floatSize, args.domainSize-2, 1)
            for i in range(1, args.basis):
                copyPlane(c3d, stream, floatSize, floatSize, 1, 1, i*args.domainSize-1, i*args.domainSize-1,
                    (args.domainSize-2)*floatSize, args.domainSize-2, 2)
            copyPlane(c3d, stream, floatSize, floatSize, 1, 1, args.domainSize*args.basis-1, args.domainSize*args.basis-1,
                (args.domainSize-2)*floatSize, args.domainSize-2, 1)
# XZ
        copyPlane(c3d, stream, 0, 0, 0, 0, 0, 0,
            args.domainSize*floatSize, 1, args.domainSize*args.basis)
        copyPlane(c3d, stream, 0, 0, args.domainSize-1, args.domainSize-1, 0, 0,
            args.domainSize*floatSize, 1, args.domainSize*args.basis)
# YZ
        copyPlane(c3d, stream, 0, 0, 0, 0, 0, 0,
            floatSize, args.domainSize, args.domainSize*args.basis)
        copyPlane(c3d, stream, (args.domainSize-1)*floatSize, (args.domainSize-1)*floatSize, 0, 0, 0, 0,
            floatSize, args.domainSize, args.domainSize*args.basis)
    if args.copyDirection in {'dtoh', 'both'}:
        c3d.set_src_device(a3d.gpudata)
        c3d.set_dst_host(a3h)
        c3d.src_pitch, c3d.dst_pitch = c3d.dst_pitch, c3d.src_pitch
# XY
            copyPlane(c3d, stream, floatSize*2, floatSize*2, 2, 2, 0, 0,
                (args.domainSize-4)*floatSize, args.domainSize-4, 1)
            for i in range(1, args.basis):
                copyPlane(c3d, stream, floatSize*2, floatSize*2, 2, 2, i*args.domainSize-1, i*args.domainSize-1,
                    (args.domainSize-4)*floatSize, args.domainSize-4, 2)
            copyPlane(c3d, stream, floatSize*2, floatSize*2, 2, 2, args.domainSize*args.basis-1, args.domainSize*args.basis-1,
                (args.domainSize-4)*floatSize, args.domainSize-4, 1)
# XZ
            copyPlane(c3d, stream, 1, 1, 1, 1, 1, 1,
                (args.domainSize-2)*floatSize, 1, args.domainSize*args.basis-2)
            copyPlane(c3d, stream, 1, 1, args.domainSize-2, args.domainSize-2, 1, 1,
                (args.domainSize-2)*floatSize, 1, args.domainSize*args.basis-2)
# YZ
            copyPlane(c3d, stream, 1, 1, 1, 1, 1, 1,
                floatSize, args.domainSize-2, args.domainSize*args.basis-2)
            copyPlane(c3d, stream, (args.domainSize-2)*floatSize, (args.domainSize-2)*floatSize, 1, 1, 1, 1,
                floatSize, args.domainSize-2, args.domainSize*args.basis-2)
elif args.envelopeCopyMethod == 'naive':
    if args.copyDirection in {'htod', 'both'}:
        c3d.set_src_host(a3h)
        c3d.set_dst_device(a3d.gpudata)
        for i in range(args.basis):
            c3d.set_src_host(a3h[i*args.domainSize:(i+1)*args.domainSize, :, :])
            c3d.set_dst_device(int(a3d.gpudata)+i*args.domainSize*args.domainSize*args.domainSize)
# XY
            copyPlane(c3d, stream, 0, 0, 0, 0, 0, 0,
                args.domainSize*floatSize, args.domainSize, 1)
            copyPlane(c3d, stream, 0, 0, 0, 0, args.domainSize-1, args.domainSize-1,
                args.domainSize*floatSize, args.domainSize, 1)
# XZ
            copyPlane(c3d, stream, 0, 0, 0, 0, 0, 0,
                args.domainSize*floatSize, 1, args.domainSize)
            copyPlane(c3d, stream, 0, 0, args.domainSize-1, args.domainSize-1, 0, 0,
                args.domainSize*floatSize, 1, args.domainSize)
# YZ
            copyPlane(c3d, stream, 0, 0, 0, 0, 0, 0,
                floatSize, args.domainSize, args.domainSize)
            copyPlane(c3d, stream, (args.domainSize-1)*floatSize, (args.domainSize-1)*floatSize,
                0, 0, 0, 0, floatSize, args.domainSize, args.domainSize)
    if args.copyDirection in {'dtoh', 'both'}:
        c3d.src_pitch, c3d.dst_pitch = c3d.dst_pitch, c3d.src_pitch
        for i in range(args.basis):
            c3d.set_src_device(int(a3d.gpudata)+i*args.domainSize*args.domainSize*args.domainSize)
            c3d.set_dst_host(a3h[i*args.domainSize:(i+1)*args.domainSize, :, :])
# XY
                copyPlane(c3d, stream, 1, 1, 1, 1, 1, 1,
                    (args.domainSize-2)*floatSize, args.domainSize, 1)
                copyPlane(c3d, stream, 1, 1, 1, 1, args.domainSize-2, args.domainSize-2,
                    (args.domainSize-2)*floatSize, args.domainSize-2, 1)
# XZ
                copyPlane(c3d, stream, 1, 1, 1, 1, 1, 1,
                    (args.domainSize-2)*floatSize, 1, args.domainSize-2)
                copyPlane(c3d, stream, 1, 1, args.domainSize-2, args.domainSize-2, 1, 1,
                    (args.domainSize-2)*floatSize, 1, args.domainSize-2)
# YZ
                copyPlane(c3d, stream, 1, 1, 1, 1, 1, 1,
                    floatSize, args.domainSize-2, args.domainSize-2)
                copyPlane(c3d, stream, (args.domainSize-2)*floatSize, (args.domainSize-2)*floatSize,
                    1, 1, 1, 1, floatSize, args.domainSize-2, args.domainSize-2)

endD.record()
endD.synchronize()
endH = time.time()
print("{0:.3f} {1:.3f}".format(endD.time_since(startD), 1000*(endH-startH)))

Errata for CUDA Memory Copy Performance

Unfortunately yesterday’s post contains grave error. When copying data, memcpy3d requires passing number of bytes to transfer from each row, not number of elements. When I was optimising smart copy performance I have taken this into consideration in some calls of memcpy3D, but not in the others. Below are the correct results and their analysis.

Performance of copy on ION

ION copy performance chart

Time of copying entire 3D array on ION [ms]
Domain size memcpy memcpy3D memcpy3D copying only values memcpy3D without host padding
GPU CPU GPU CPU GPU CPU GPU CPU
10 0.612 0.934 1.743 1.802 1.767 1.825 1.694 1.753
15 1.154 1.212 2.704 2.763 3.353 3.412 3.220 3.279
20 1.654 1.711 3.479 3.536 3.654 3.940 3.458 3.515
25 2.525 2.584 4.407 4.463 5.376 5.434 5.156 5.266
30 3.176 3.236 5.465 5.522 8.295 8.353 7.742 7.801
35 4.322 4.379 6.576 6.632 7.015 7.072 6.455 6.514
40 5.474 5.532 7.929 7.986 10.001 10.059 9.228 9.287
45 6.591 6.652 9.038 9.102 15.741 15.797 15.294 15.353
50 7.995 8.053 10.530 10.588 11.585 11.642 10.990 11.047
55 9.691 9.750 12.162 12.219 19.234 19.292 19.580 19.637
60 11.440 11.497 14.112 14.168 26.663 26.721 25.798 25.855
65 26.175 26.234 29.251 29.309 26.349 26.406 21.239 21.297
70 30.518 30.576 33.738 33.796 36.621 36.681 30.729 30.788
75 35.036 35.095 38.358 38.417 51.366 51.424 46.280 46.338
80 39.420 39.479 43.354 43.410 33.808 33.868 28.315 28.375
85 44.819 44.880 49.000 49.059 55.915 55.976 50.566 50.626
90 50.324 50.383 54.530 54.589 72.578 72.637 66.190 66.249

Because of observed changes in the time of copy for memcpy3D smart copy I have decided to test times of copying domains which sizes are divisible by 4. Below is the chart and table containing results of those tests. Previously I have not tested domain sizes for power-of-two sizes because Sailfish (CUDA) domain size is controlled by Palabos. I was more interested in trends than in exact values, but if there is large performance difference between 60x60x60 and 64x54x64 domain it might make sense to force only some of the possible domain sizes to be computed on the GPU.

ION copy performance chart

Time of copying entire 3D array on ION [ms]
Domain size memcpy memcpy3D memcpy3D copying only values memcpy3D without host padding
GPU CPU GPU CPU GPU CPU GPU CPU
12 0.812 0.868 2.087 2.146 2.309 2.367 2.173 2.229
16 1.289 1.346 2.893 2.948 2.459 2.515 2.214 2.271
20 1.639 1.695 3.475 3.531 4.032 4.092 3.471 3.531
24 2.196 2.254 4.222 4.282 4.930 4.988 4.600 4.659
28 2.759 2.816 4.966 5.023 6.712 6.768 6.095 6.380
32 3.734 3.792 5.948 6.006 5.201 5.259 4.982 5.042
36 4.439 4.497 6.798 6.855 7.036 7.093 6.321 6.378
40 5.325 5.385 7.778 7.835 9.802 9.861 9.171 9.228
44 6.394 6.452 8.703 8.760 13.877 13.933 13.034 13.092
48 7.424 7.484 9.931 9.987 8.947 9.003 8.949 9.008
52 8.641 8.701 11.226 11.286 13.853 13.909 13.028 13.086
56 10.020 10.079 12.543 12.602 19.571 19.627 18.742 18.799
60 11.416 11.499 14.131 14.188 26.724 26.783 25.784 25.840
64 12.992 13.050 15.882 15.940 15.943 15.998 15.915 15.973
68 28.681 28.741 32.315 32.375 31.796 31.855 24.704 24.761
72 32.265 32.322 35.709 35.767 41.205 41.263 34.071 34.130
76 35.676 35.735 39.521 39.579 52.227 52.286 44.459 44.516
80 39.525 39.585 43.617 43.675 33.766 33.824 28.207 28.266
84 43.384 43.443 47.657 47.714 50.252 50.309 42.078 42.149
88 47.531 47.590 52.083 52.141 63.892 63.952 55.614 55.672
92 52.028 52.088 57.211 57.268 80.168 80.225 71.291 71.349

For the full copy performance has not changed, but smart copy behaves differently. There is clearly visible (especially for the domain size divisible by 4) pattern of changes of time it takes to copy subdomain. The lowest time is for domain sizes divisible by 16: 32x32x32, 48x48x48, 64x64x64, 80x80x80. Then the copy time grows until domain size is 16N-4 (e.g. 44, 60, 76), to fall again when reaching the next product of 16.

Performance of copy on GeForce 460

GeForce 460 copy performance chart

Time of copying entire 3D array on GeForce 460 [ms]
Domain size memcpy memcpy3D memcpy3D copying only values memcpy3D without host padding
GPU CPU GPU CPU GPU CPU GPU CPU
10 0.410 0.432 0.579 0.597 0.489 0.510 0.445 0.463
15 0.840 0.859 1.083 1.101 1.153 1.171 1.093 1.114
20 1.324 1.342 1.466 1.485 1.395 1.414 1.328 1.346
25 1.930 1.949 2.251 2.271 2.438 2.457 2.349 2.369
30 3.876 3.902 3.117 3.138 4.139 4.159 4.092 4.154
35 3.548 3.567 3.672 3.692 3.894 3.912 3.860 3.889
40 4.465 4.482 5.508 5.529 6.721 6.741 6.066 6.086
45 5.636 5.658 5.759 5.777 10.261 10.282 9.919 9.938
50 7.156 7.175 7.243 7.263 9.045 9.066 9.032 9.053
55 8.386 8.406 8.821 8.862 13.406 13.425 14.677 14.698
60 10.122 10.141 10.523 10.545 18.931 18.951 19.866 19.895
65 23.016 23.036 23.736 23.761 17.597 17.618 17.003 17.024
70 26.552 26.572 27.247 27.268 23.212 23.236 23.341 23.366
75 31.211 31.233 31.886 31.907 32.082 32.106 33.479 33.500
80 35.820 35.840 35.643 35.665 24.675 24.698 21.872 21.890
85 40.238 40.263 39.731 39.753 38.612 38.633 40.134 40.154
90 44.809 44.831 44.978 44.998 48.372 48.394 49.591 49.614
95 50.882 50.907 49.887 49.914 65.114 65.137 66.627 66.650
100 55.894 55.917 56.961 57.019 57.187 57.209 56.676 56.698
105 64.301 64.327 66.465 66.519 73.844 73.900 76.339 76.362
110 69.854 69.880 67.134 67.157 89.754 89.778 107.916 107.972
115 79.202 79.231 79.291 79.381 82.054 82.077 90.558 90.583
120 87.267 87.298 85.412 85.438 103.439 103.463 116.363 116.426
125 92.227 92.281 89.846 89.869 131.798 131.866 133.805 133.836
130 143.031 143.057 142.833 142.889 113.838 113.896 119.259 119.284
135 157.465 157.490 156.511 156.541 141.232 141.258 143.574 143.598
140 177.982 178.017 183.117 183.145 160.857 160.889 162.139 162.168
145 173.023 173.056 177.246 177.272 154.549 154.579 153.971 153.997
150 185.642 185.671 188.587 188.619 170.216 170.271 172.422 172.448
155 198.834 198.870 202.212 202.240 212.219 212.278 213.878 213.906
160 213.445 213.475 213.257 213.289 181.400 181.430 179.200 179.228
165 233.645 233.674 231.348 231.377 224.925 224.954 229.401 229.433
170 240.339 240.373 241.443 241.476 258.366 258.428 258.161 258.219
175 255.133 255.192 252.854 252.916 310.002 310.038 315.292 315.323
180 275.525 275.555 281.269 281.330 277.271 277.301 278.145 278.178
185 283.135 283.171 285.218 285.279 333.883 333.913 334.866 334.897
190 298.274 298.311 297.961 297.991 371.361 371.399 373.259 373.320
195 420.039 420.077 415.976 416.011 359.394 359.426 353.314 353.422
200 439.519 439.556 439.015 439.056 393.671 393.708 391.862 391.896
205 465.656 465.693 470.123 470.160 467.429 467.465 467.979 468.053
210 489.099 489.144 481.797 481.835 435.453 435.487 423.905 423.948
215 516.464 516.502 511.549 511.588 486.948 486.986 491.866 491.903
220 537.333 537.495 528.705 528.774 543.285 543.328 534.583 534.621

As previously I have decided to perform analysis domain sizes divisible by 4.

GeForce 460 copy performance chart

Time of copying entire 3D array on GeForce 460 [ms]
Domain size memcpy memcpy3D memcpy3D copying only values memcpy3D without host padding
GPU CPU GPU CPU GPU CPU GPU CPU
12 0.565 0.584 0.762 0.781 0.682 0.700 0.638 0.657
16 0.892 0.911 1.146 1.165 0.647 0.666 0.543 0.561
20 1.317 1.336 1.474 1.492 1.397 1.416 1.364 1.382
24 1.804 1.823 1.956 1.974 1.981 1.999 1.943 1.962
28 2.329 2.347 2.480 2.498 3.398 3.415 3.297 3.315
32 2.962 2.980 3.146 3.166 2.073 2.091 1.838 1.855
36 3.767 3.784 3.853 3.872 4.051 4.070 4.023 4.042
40 4.522 4.540 4.644 4.694 6.143 6.162 6.139 6.156
44 5.534 5.628 5.596 5.616 9.002 9.021 9.014 9.032
48 6.400 6.420 6.478 6.496 5.137 5.156 4.912 4.930
52 7.471 7.489 7.456 7.474 9.940 9.958 9.849 9.867
56 8.640 8.659 8.657 8.675 13.864 13.882 13.829 13.848
60 9.933 9.951 10.058 10.077 18.686 18.705 18.614 18.633
64 11.191 11.209 11.456 11.474 11.413 11.432 11.454 11.473
68 24.843 24.862 25.094 25.113 19.206 19.226 19.774 19.792
72 28.012 28.031 28.264 28.282 25.232 25.251 26.093 26.112
76 31.000 31.020 31.352 31.371 32.162 32.181 33.003 33.023
80 34.561 34.581 34.563 34.582 23.263 23.281 21.749 21.769
84 37.958 37.977 38.136 38.157 33.513 33.532 34.219 34.238
88 41.732 41.752 42.181 42.202 42.511 42.529 43.483 43.503
92 45.269 45.288 45.753 45.773 52.521 52.541 53.341 53.361
96 49.748 49.769 49.343 49.363 38.892 38.911 37.362 37.381
100 53.916 53.937 53.985 54.006 54.421 54.442 54.530 54.549
104 57.963 57.984 58.106 58.127 68.072 68.092 66.333 66.353
108 62.377 62.399 62.681 62.701 79.832 79.852 80.333 80.354
112 67.131 67.152 67.641 67.664 60.153 60.174 58.579 58.599
116 72.275 72.297 71.823 71.844 80.010 80.030 80.734 80.754
120 77.130 77.151 76.991 77.013 99.275 99.297 97.347 97.368
124 82.201 82.222 82.540 82.562 116.753 116.775 114.670 114.693
128 87.523 87.544 88.606 88.628 87.967 87.989 87.779 87.801
132 140.171 140.194 139.516 139.539 112.754 112.776 114.028 114.050
136 147.762 147.787 148.465 148.491 131.506 131.528 133.523 133.547
140 157.080 157.106 156.612 156.638 153.446 153.471 155.699 155.724
144 166.334 166.390 166.700 166.725 129.608 129.630 124.698 124.721
148 176.337 176.363 174.642 174.667 154.533 154.556 155.071 155.094
152 186.635 186.662 186.065 186.094 178.441 178.466 180.504 180.529
156 199.970 199.997 194.241 194.267 205.473 205.499 206.427 206.453
160 204.636 204.662 208.479 208.512 174.436 174.461 171.110 171.134
164 216.665 216.693 215.878 215.905 205.541 205.566 205.244 205.272
168 227.487 227.514 225.826 225.853 234.673 234.701 235.387 235.414
172 236.822 236.849 239.309 239.337 265.910 265.939 267.028 267.057
176 249.419 249.447 247.784 247.811 230.288 230.316 226.439 226.465
180 260.284 260.313 259.921 259.950 265.468 265.495 265.742 265.769
184 270.572 270.603 271.259 271.287 300.409 300.439 300.370 300.399
188 285.403 285.433 283.713 283.742 337.666 337.697 337.687 337.718
192 295.388 295.418 295.069 295.098 296.256 296.286 294.651 294.680
196 411.596 411.630 409.348 409.382 348.688 348.719 339.280 339.312
200 428.020 428.055 426.115 426.149 383.760 383.792 378.659 378.692
204 457.611 457.647 443.511 443.547 426.987 427.021 421.823 421.857
208 464.745 464.783 463.364 463.399 384.384 384.417 373.665 373.698
212 481.896 481.934 477.560 477.596 435.707 435.742 426.547 426.582
216 499.794 499.831 498.332 498.368 478.811 478.847 468.109 468.143
220 517.785 517.823 515.649 515.687 526.951 526.988 522.219 522.256

Just like in case of ION performance, we can observe the performance pattern when the smallest copy time occurs for domain sizes divisible by 16; then it grows to fall again for the next product of 16. This is not clearly shown on the first chart, but very clearly visible on the chart showing the times for domains divisible by 4.

Corrected script

#! /usr/bin/python

import sys
import math
import time
import argparse

import numpy

import pycuda
import pycuda.driver
import pycuda.compiler
import pycuda.gpuarray

import pycuda.autoinit

def copyPlane(copy, stream, srcX, dstX, srcY, dstY, srcZ, dstZ, width, height, depth):
    copy.src_x_in_bytes = srcX
    copy.dst_x_in_bytes = dstX
    copy.src_y = srcY
    copy.dst_y = dstY
    copy.src_z = srcZ
    copy.dst_z = dstZ
    copy.width_in_bytes = width
    copy.height = height
    copy.depth = depth
    if stream:
        copy(stream)
    else:
        copy()

parser = argparse.ArgumentParser(description="Test speed of memory copy")

parser.add_argument("-d", "--domain", dest="domainSize", type=int,
    default=18, help="Size of the domain to copy (default: 18)")
parser.add_argument("-t", "--block", dest="blockSize", type=int,
    default=64, help="Size of the block of threads to copy (default: 64)")
parser.add_argument("-b", "--basis", dest="basis", type=int,
    default=19, help="Size of the block of threads to copy (default: 19)")

parser.add_argument("--direction", dest="copyDirection",
    action="store", default='htod', choices=['htod', 'dtoh', 'both'],
    help="Copy direction (default: htod)")

parser.add_argument("--full_method", dest="fullCopyMethod",
    action="store", default='memcpy3D', choices=['memcpy', 'memcpy3D', 'memcpy3Dvalues', 'memcpy3Dnopadding'],
    help="Copy direction (default: memcpy3D)")

args = parser.parse_args()

stream = None

floatSize = 4
floatType = numpy.float32
strideX = int(math.ceil(float(args.domainSize)/args.blockSize))*args.blockSize*floatSize
strides = (args.domainSize*strideX, strideX, floatSize)
strideZ = args.domainSize*args.domainSize*strideX

gpudata = pycuda.driver.mem_alloc(strideZ*args.basis)

a3d = pycuda.gpuarray.GPUArray((args.basis*args.domainSize, args.domainSize, args.domainSize),
    dtype=floatType, strides=strides, gpudata=gpudata)
if args.fullCopyMethod == 'memcpy3Dnopadding':
    a3h = numpy.ndarray((args.basis*args.domainSize, args.domainSize, args.domainSize),
        dtype=floatType)
else:
    a3h = numpy.ndarray((args.basis*args.domainSize, args.domainSize, strideX/floatSize),
        dtype=floatType, strides=strides)
c3d = pycuda.driver.Memcpy3D()

startD = pycuda.driver.Event()
endD = pycuda.driver.Event()
startH = time.time()
endH = None

startD.record()
if args.fullCopyMethod == 'memcpy3Dnopadding':
    c3d.src_pitch = args.domainSize*floatSize
else:
    c3d.src_pitch = strideX
c3d.dst_pitch = strideX
c3d.src_height = args.domainSize
c3d.dst_height = args.domainSize

if args.fullCopyMethod == 'memcpy':
    if args.copyDirection in {'htod', 'both'}:
        pycuda.driver.memcpy_htod(a3d.gpudata, a3h)
    if args.copyDirection in {'dtoh', 'both'}:
        pycuda.driver.memcpy_dtoh(a3h, a3d.gpudata)
elif args.fullCopyMethod == 'memcpy3D':
    if args.copyDirection in {'htod', 'both'}:
        c3d.set_src_host(a3h)
        c3d.set_dst_device(a3d.gpudata)
        copyPlane(c3d, stream, 0, 0, 0, 0, 0, 0,
            strideX, args.domainSize, args.domainSize*args.basis)
    if args.copyDirection in {'dtoh', 'both'}:
        c3d.set_src_device(a3d.gpudata)
        c3d.set_dst_host(a3h)
        copyPlane(c3d, stream, 0, 0, 0, 0, 0, 0,
            strideX, args.domainSize, args.domainSize*args.basis)
elif args.fullCopyMethod in {'memcpy3Dvalues', 'memcpy3Dnopadding'}:
    if args.copyDirection in {'htod', 'both'}:
        c3d.set_src_host(a3h)
        c3d.set_dst_device(a3d.gpudata)
        copyPlane(c3d, stream, 0, 0, 0, 0, 0, 0,
            args.domainSize*floatSize, args.domainSize, args.domainSize*args.basis)
    if args.copyDirection in {'dtoh', 'both'}:
        if args.fullCopyMethod == 'memcpy3Dnopadding':
            c3d.src_pitch, c3d.dst_pitch = c3d.dst_pitch, c3d.src_pitch
        c3d.set_src_device(a3d.gpudata)
        c3d.set_dst_host(a3h)
        copyPlane(c3d, stream, 0, 0, 0, 0, 0, 0,
            args.domainSize*floatSize, args.domainSize, args.domainSize*args.basis)

endD.record()
endD.synchronize()
endH = time.time()
print("{0:.3f} {1:.3f}".format(endD.time_since(startD), 1000*(endH-startH)))

Summary

There is no one best method of copying 3D arrays. When we restrict ourselves to domains with sizes divisible by 16, the best solution is to use “smart” memcpy3D, but for other sizes memcpy3D sometimes is slower and sometimes is faster than raw memcpy. I intend to use only memcpy3D, as it will make code simpler than using three different functions (memcpy_htod, memcpy_dtoh, memcpy3D).

As for my mistake, lack of symmetry in memcpy3D (X requires bytes, Y and Z number of items) makes it harder to experiment and for example to transpose 3D array. I guess this API design was chosen to show some of the internal details of memcpy3D implementation and that such low-level access may be important to achieve good performance. This only shows need for testing your code. If someone is interested in problems of good API design, ACM Queue has interesting article by M. Henning API Design Maters.

CUDA Memory Copy Performance

The code in this post contains error which resulted in some results being incorrect. For explanaition of the problem and for the correct results please see this post.

To allow cooperation between Palabos and Sailfish we need to copy population between those two systems. In D3Q19 model subdomains are 3-dimensional arrays with each node consisting of 19 elements. This post presents my findings on the performance of copying data between host and device. I have performed tests on two machines:

  • Asus eeePC 1201N with Atom 330 (2 cores with HT@1.6GHz), 2GB of RAM, NVIDIA ION (GeForce 9400M) with 256MB RAM
  • desktop with Intel E5200 (2 cores@2.4GHz), 4GB of RAM, NVIDIA GeForce 460 with 1GB RAM

Both systems were 64-bit Debian unstable with NVIDIA drivers 275.28. Tests were performed using float32 data type. All times are presented in milliseconds. The first system was able to store domain of size 90x90x90 on the GPU, the second was able to store domains up to 220x220x220.

The main problem with transferring data between Salifish and Palabos is different memory layout. Palabos takes into consideration CPU cache and uses rather compact layout in which entire population (all 19 values) are stored together. Sailfish, as it is running code on the GPU, must to take into consideration way in which different threads access memory (coalescing, memory bank access conflicts, etc.) so it is storing population in different manner – it stores 3D array of element 0, then 3D array of element 1, and so on, Basically one can treat Sailfish memory layout as 19 3D arrays. Additionally, Sailfish uses padding for the X index, to ensure coalesced memory access by threads. This means that each row has size that is multiplication of 64 (thread block size in Sailfish).

As a part of initialisation both libraries must be brought to the same state. It means that we need to transfer entire subdomain from Palabos (i.e. CPU, host) to Sailfish (i.e. CUDA, device). There are four ways we can transfer this data between host and device:

  • using raw memcpy_htod and memcpy_dtoh
  • using memcpy3D and copying everything (including padding for rows)
  • using memcpy3D and copying only meaningful values, leaving out padding
  • using memcpy3D, copying only values, and not having any padding in the host array

Later in the text I am calling the first two methods “full copy”, and the last two methods “smart copy”.

Performance of copy on ION

ION copy performance chart

Time of copying entire 3D array on ION [ms]
Domain size memcpy memcpy3D memcpy3D copying only values memcpy3D without host padding
GPU CPU GPU CPU GPU CPU GPU CPU
10 0.618 0.675 1.697 1.751 1.767 1.822 1.498 1.552
15 1.150 1.203 2.865 2.920 2.517 2.572 2.318 2.400
20 1.701 1.756 3.481 3.537 3.958 4.013 3.488 3.543
25 2.349 2.403 4.249 4.531 5.449 5.503 5.200 5.280
30 3.221 3.275 5.436 5.491 7.074 7.128 6.682 6.737
35 4.143 4.201 6.603 6.657 8.319 8.373 8.035 8.090
40 5.318 5.372 7.821 7.876 10.632 10.687 10.208 10.266
45 6.586 6.642 8.988 9.042 13.215 13.270 12.945 13.002
50 8.122 8.180 10.554 10.623 18.320 18.375 17.838 17.892
55 10.288 10.359 12.132 12.237 20.673 20.728 20.725 20.780
60 11.362 11.419 14.124 14.177 26.266 26.322 26.012 26.069
65 26.192 26.247 29.050 29.105 21.509 21.564 16.110 16.167
70 30.202 30.257 33.486 33.541 24.306 24.362 18.579 18.693
75 34.702 34.757 38.029 38.084 31.293 31.350 25.306 25.362
80 39.427 39.485 43.448 43.503 34.182 34.239 27.527 27.581
85 44.442 44.498 48.965 49.019 42.682 42.737 37.505 37.564
90 50.047 50.104 54.688 54.744 51.569 51.624 46.578 46.635

Up to the domain size of 60x60x60 copying everything (either by using memcpy or memcpy3D) is the performance win. Raw memcpy offers the smallest time of copying entire subdomain. But after domain grows to be 65x65x65 copying only relevant data (smart copy) using memcpy3D clearly becomes better choice over copying everything including padding bytes. The last copy method (“smart” copy with padding on the device and lack of padding on the host) offers the best performance. In my opinion it is because lack of padding on the host helps avoiding wasting CPU cache to store all those unused bytes. Atom N330 has two cores, each with 512kB of cache. Domain 65x65x65 of 32-bit floats is slightly larger than 1MB of available cache and lack of padding allows for host array to fit longer in cache – offering better performance than observed for smart copy with host padding. But of course there might be other reasons – I do not have access to implementation details of memcpy3D so cannot offer insight into its behaviour.

Large drop in performance of full copy methods when going from domain of size 60x60x60 to 65x65x65 can also be explained by padding. As noted earlier, Sailfish uses padding so each row’s size is divisible by 64. This means that we are for the 60x60x60 domain we are copying 64x60x60 array, and for domain 65x65x65 we are copying array which is over two times larger – 128x65x65. But while this explains large increase of time of full copy, I still do not understand and cannot explain the drop in smart copy time. It drops between 60 and 65, and reaches time it took to copy domain of 60x60x60 at dimension of 75x75x75.

Performance of copy on GeForce 460

GeForce 460 copy performance chart

Time of copying entire 3D array on GeForce 460 [ms]
Domain size memcpy memcpy3D memcpy3D copying only values memcpy3D without host padding
GPU CPU GPU CPU GPU CPU GPU CPU
10 0.410 0.430 0.572 0.590 0.353 0.371 0.319 0.336
15 0.843 0.862 1.076 1.095 0.616 0.635 0.549 0.567
20 1.322 1.340 1.470 1.490 1.086 1.106 0.981 0.999
25 1.932 1.951 2.179 2.197 1.535 1.553 1.389 1.408
30 2.686 2.705 2.852 2.870 2.197 2.215 2.008 2.027
35 3.517 3.535 3.652 3.671 3.373 3.391 3.134 3.153
40 4.493 4.512 4.654 4.673 4.154 4.172 3.899 3.918
45 5.650 5.669 5.818 5.835 5.986 6.004 5.845 5.864
50 6.901 6.919 6.932 6.951 8.397 8.415 8.041 8.061
55 8.366 8.385 8.349 8.368 9.956 9.975 9.921 9.939
60 9.889 9.908 10.086 10.104 13.088 13.106 12.906 12.925
65 22.850 22.869 22.891 22.911 9.190 9.208 7.863 7.882
70 26.467 26.484 26.421 26.440 10.511 10.529 9.654 9.673
75 30.310 30.329 30.287 30.306 13.352 13.371 11.851 11.869
80 34.683 34.703 34.836 34.856 15.128 15.146 14.577 14.595
85 38.842 38.862 38.915 38.934 20.079 20.097 18.685 18.704
90 43.827 43.848 43.711 43.731 25.259 25.277 24.924 24.944
95 48.314 48.335 48.120 48.141 28.258 28.278 27.766 27.786
100 53.858 53.878 53.582 53.603 34.990 35.010 34.504 34.523
105 59.075 59.095 58.923 58.942 43.180 43.201 43.113 43.133
110 64.690 64.711 64.635 64.655 47.070 47.088 46.632 46.651
115 71.115 71.135 70.978 70.998 56.152 56.171 56.918 56.939
120 77.243 77.264 77.542 77.564 61.187 61.207 60.831 60.851
125 83.958 83.980 83.105 83.127 72.124 72.146 73.920 73.941
130 136.309 136.334 135.974 135.997 50.504 50.525 45.929 45.949
135 146.751 146.777 145.515 145.538 54.448 54.468 50.977 50.998
140 158.134 158.177 156.634 156.657 61.193 61.213 57.211 57.233
145 168.521 168.547 167.841 167.867 71.299 71.321 70.364 70.385
150 180.432 180.457 179.934 179.960 76.502 76.523 74.500 74.522
155 192.819 192.845 191.539 191.566 89.529 89.551 90.365 90.388
160 205.790 205.817 205.412 205.438 94.870 94.893 94.428 94.450
165 218.756 218.784 217.885 217.911 111.780 111.803 112.584 112.607
170 231.733 231.761 232.367 232.396 128.750 128.775 129.171 129.195
175 246.083 246.111 245.708 245.737 136.877 136.901 139.651 139.676
180 260.130 260.158 259.842 259.870 158.555 158.580 157.252 157.277
185 274.028 274.058 274.744 274.772 181.361 181.385 185.707 185.735
190 291.114 291.146 288.064 288.092 191.480 191.505 192.982 193.009
195 406.808 406.842 403.830 403.863 148.540 148.564 135.534 135.559
200 424.325 424.367 425.941 425.976 156.040 156.066 142.504 142.528
205 450.420 450.457 447.015 447.050 171.625 171.651 159.442 159.468
210 473.551 473.588 466.182 466.218 186.861 186.888 177.049 177.075
215 496.014 496.052 490.306 490.343 195.255 195.283 189.415 189.443
220 517.620 517.658 516.187 516.225 221.179 221.209 213.509 213.566

For GeForce 460 the situation is similar. There is again increase in time needed to perform full copy after reaching 65x65x65 and drop in copy time for copying meaningful data for domains between 60x60x60 and 75x75x75. Because GeForce has larger memory, we can test copy times for larger domains. We can observe similar rapid change of time needed to copy data (increase for full copy, drop for “smart” copy) when we pass domain size of 128x128x128, and again the same behaviour when passing 192x192x192. I am assuming that similar situation can be observed when domain size reaches 256x256x256, 320x320x320, and so on, but do not have hardware to test it.

Summary

For larger domains there is clear winner: use memcpy3D, copy only meaningful data, and use different memory layout on host and device. For smaller domains this solution takes longer than calling memcpy_dtoh or memcpy_htod, but for such small domains it does not make sense to run simulation on GPU – the overhead of copying data, running kernels, copying results back to the host will make such a solution performance loss. Chosing memcpy3D also means that glue code joining Sailfish and Palabos can be simpler; the same memcpy3D can be used to copy envelope as well as copying entire subdomains.

Script used for testing

Script requires Python 2.7 as it uses argparse module and new syntax for creating sets.

#! /usr/bin/python

import sys
import math
import time
import argparse

import numpy

import pycuda
import pycuda.driver
import pycuda.compiler
import pycuda.gpuarray

import pycuda.autoinit

def copyPlane(copy, stream, srcX, dstX, srcY, dstY, srcZ, dstZ, width, height, depth):
    copy.src_x_in_bytes = srcX
    copy.dst_x_in_bytes = dstX
    copy.src_y = srcY
    copy.dst_y = dstY
    copy.src_z = srcZ
    copy.dst_z = dstZ
    copy.width_in_bytes = width
    copy.height = height
    copy.depth = depth
    if stream:
        copy(stream)
    else:
        copy()

parser = argparse.ArgumentParser(description="Test speed of memory copy")

parser.add_argument("-d", "--domain", dest="domainSize", type=int,
    default=18, help="Size of the domain to copy (default: 18)")
parser.add_argument("-t", "--block", dest="blockSize", type=int,
    default=64, help="Size of the block of threads to copy (default: 64)")
parser.add_argument("-b", "--basis", dest="basis", type=int,
    default=19, help="Size of the block of threads to copy (default: 19)")

parser.add_argument("--direction", dest="copyDirection",
    action="store", default='htod', choices=['htod', 'dtoh', 'both'],
    help="Copy direction (default: htod)")

parser.add_argument("--full_method", dest="fullCopyMethod",
    action="store", default='memcpy3D', choices=['memcpy', 'memcpy3D', 'memcpy3Dvalues', 'memcpy3Dnopadding'],
    help="Copy direction (default: memcpy3D)")

args = parser.parse_args()

stream = None

floatSize = 4
floatType = numpy.float32
strideX = int(math.ceil(float(args.domainSize)/args.blockSize))*args.blockSize*floatSize
strides = (args.domainSize*strideX, strideX, floatSize)
strideZ = args.domainSize*args.domainSize*strideX

gpudata = pycuda.driver.mem_alloc(strideZ*args.basis)

a3d = pycuda.gpuarray.GPUArray((args.basis*args.domainSize, args.domainSize, args.domainSize),
    dtype=floatType, strides=strides, gpudata=gpudata)
if args.fullCopyMethod == 'memcpy3Dnopadding':
    a3h = numpy.ndarray((args.basis*args.domainSize, args.domainSize, args.domainSize),
        dtype=floatType)
else:
    a3h = numpy.ndarray((args.basis*args.domainSize, args.domainSize, strideX/floatSize),
        dtype=floatType, strides=strides)
c3d = pycuda.driver.Memcpy3D()

startD = pycuda.driver.Event()
endD = pycuda.driver.Event()
startH = time.time()
endH = None

startD.record()
if args.fullCopyMethod == 'memcpy3Dnopadding':
    c3d.src_pitch = args.domainSize*floatSize
else:
    c3d.src_pitch = strideX
c3d.dst_pitch = strideX
c3d.src_height = args.domainSize
c3d.dst_height = args.domainSize

if args.fullCopyMethod == 'memcpy':
    if args.copyDirection in {'htod', 'both'}:
        pycuda.driver.memcpy_htod(a3d.gpudata, a3h)
    if args.copyDirection in {'dtoh', 'both'}:
        pycuda.driver.memcpy_dtoh(a3h, a3d.gpudata)
elif args.fullCopyMethod == 'memcpy3D':
    if args.copyDirection in {'htod', 'both'}:
        c3d.set_src_host(a3h)
        c3d.set_dst_device(a3d.gpudata)
        copyPlane(c3d, stream, 0, 0, 0, 0, 0, 0,
            strideX, args.domainSize, args.domainSize*args.basis)
    if args.copyDirection in {'dtoh', 'both'}:
        c3d.set_src_device(a3d.gpudata)
        c3d.set_dst_host(a3h)
        copyPlane(c3d, stream, 0, 0, 0, 0, 0, 0,
            strideX, args.domainSize, args.domainSize*args.basis)
elif args.fullCopyMethod in {'memcpy3Dvalues', 'memcpy3Dnopadding'}:
    if args.copyDirection in {'htod', 'both'}:
        c3d.set_src_host(a3h)
        c3d.set_dst_device(a3d.gpudata)
        copyPlane(c3d, stream, 0, 0, 0, 0, 0, 0,
            args.domainSize, args.domainSize, args.domainSize*args.basis)
    if args.copyDirection in {'dtoh', 'both'}:
        if args.fullCopyMethod == 'memcpy3Dnopadding':
            c3d.src_pitch, c3d.dst_pitch = c3d.dst_pitch, c3d.src_pitch
        c3d.set_src_device(a3d.gpudata)
        c3d.set_dst_host(a3h)
        copyPlane(c3d, stream, 0, 0, 0, 0, 0, 0,
            args.domainSize, args.domainSize, args.domainSize*args.basis)

endD.record()
endD.synchronize()
endH = time.time()
print("{0:.3f} {1:.3f}".format(endD.time_since(startD), 1000*(endH-startH)))

I have originally intended to present performance of results of copying only envelope, but this post is already getting long, so I decided to focus on copying full domains. I hope to post results of further performance analysis later this week.