Performance of copying only envelope

As promised in my previous post I have performed some tests with different methods of copying data from Palabos to Sailfish. Lattice Boltzmann method is based on local interactions so to work correctly we need to transfer incoming forces before simulation step (the outermost layer of our subdomain), and outgoing interactions (the next-to outermost layer of subdomain) after the simulation step. It means that when simulation is running it suffices to copy envelope instead of full subdomain: when copying from host to device we need to copy the outermost layers of subdomain, and when copying from device to host we need to copy penultimate layer.

The configuration is the same as in previous tests: 64-bit Debian Sid, NVIDIA drivers 275.28, Python 2.7.2 on:

  • Asus eeePC 1201N with Atom 330 (2 cores with HT@1.6GHz), 2GB of RAM, NVIDIA ION (GeForce 9400M) with 256MB RAM
  • desktop with Intel E5200 (2 cores@2.4GHz), 4GB of RAM, NVIDIA GeForce 460 with 1GB RAM

I have decided to perform 3 tests:

  • full “smart” copy using memcpy3d, with different layout of memory on host and device (the last case from previous test
  • “smart” envelope copy, trying to join as many calls of memcpu3D into one as possible
  • “naive” envelope copy, copying envelope in 6 steps (one for each cube wall) repeated 19 times to copy entire population

ION performance

ION copy performance chart

Time of copying envelope of 3D array on ION [ms]
Domain size memcpy3D without host padding optimised envelope copy naive envelope copy
GPU CPU GPU CPU GPU CPU
12 2.374 2.433 8.376 8.433 14.665 14.723
16 2.309 2.365 10.290 10.346 15.611 15.666
20 3.528 3.585 12.485 12.541 17.207 17.263
24 4.518 4.803 15.105 15.163 19.531 19.587
28 6.327 6.383 17.952 18.008 21.886 21.944
32 5.112 5.170 19.886 19.944 24.365 24.422
36 6.369 6.426 22.022 22.078 27.347 27.405
40 9.123 9.182 24.303 24.359 30.540 30.597
44 13.329 13.388 26.447 26.504 33.828 33.884
48 9.067 9.128 28.570 28.628 36.744 36.802
52 12.995 13.051 31.232 31.289 39.863 39.921
56 18.545 18.602 33.703 33.761 44.114 44.170
60 26.061 26.119 37.189 37.246 48.258 48.314
64 15.670 15.727 39.954 40.010 51.328 51.385
68 24.649 24.705 43.390 43.447 56.469 56.524
72 33.577 33.635 47.897 47.955 61.682 61.740
76 44.124 44.181 51.850 51.909 66.904 66.962
80 27.927 27.983 55.985 56.041 71.405 71.462
84 42.142 42.199 60.319 60.418 78.399 78.456
88 55.455 55.512 65.800 65.856 86.667 86.723
92 70.149 70.207 71.429 71.486 92.220 92.276

Results for ION (GeForce 9400M) are interesting. Smart envelope copy is faster than naive by 5 to 10 ms, which is probably caused by overhead of calling memcpy3D more times in the latter case. At the same time the envelope copy is slower than copying entire subdomain. Only for domain 92x92x92 smart envelope copy is as fast as (or as slow as) full copy – but for domain 96x96x96 it will again be slower (fastest copy for domain which size is divisible by 16 from previous post). I am not sure what is the cause of this slowness – the old card (GeForce 94000 is the Tesla-based CUDA GPU), or the fact that this is integrated GPU which shares RAM with CPU. This result might mean that for old devices it does not make sense to optimise copying as envelope copy will be faster for domains that does not fit into GPU memory – rendering entire optimisation pointless.

GeForce 460 performance

GeForce 460 copy performance chart

Time of copying envelope of 3D array on GeForce 460 [ms]
Domain size memcpy3D without host padding optimised envelope copy naive envelope copy
GPU CPU GPU CPU GPU CPU
12 0.643 0.663 1.324 1.343 2.008 2.027
16 0.543 0.561 1.758 1.778 2.369 2.426
20 1.320 1.339 2.343 2.361 2.883 2.902
24 1.913 1.933 2.718 2.747 3.453 3.472
28 3.349 3.367 3.342 3.361 4.138 4.157
32 1.868 1.886 3.843 3.863 4.641 4.661
36 4.096 4.134 4.406 4.426 5.517 5.537
40 6.112 6.131 4.979 4.999 6.410 6.428
44 9.014 9.034 5.819 5.839 7.447 7.466
48 4.933 4.953 6.461 6.481 8.113 8.132
52 10.080 10.100 7.357 7.378 9.491 9.510
56 13.882 13.903 8.344 8.364 10.725 10.745
60 18.616 18.637 9.479 9.497 12.230 12.248
64 11.484 11.504 10.670 10.688 13.144 13.163
68 19.885 19.905 11.643 11.663 14.868 14.888
72 26.244 26.264 12.857 12.876 16.550 16.570
76 33.263 33.283 14.496 14.517 18.471 18.491
80 22.082 22.102 16.018 16.037 19.735 19.754
84 34.263 34.283 17.330 17.351 22.013 22.033
88 43.567 43.586 19.222 19.242 24.199 24.249
92 53.220 53.241 21.446 21.466 26.398 26.417
96 37.322 37.342 23.568 23.589 26.989 27.009
100 54.475 54.495 25.014 25.034 28.639 28.657
104 67.171 67.198 27.209 27.230 30.250 30.270
108 80.673 80.695 29.748 29.769 32.992 33.011
112 58.523 58.544 32.149 32.285 34.782 34.803
116 80.705 80.727 34.213 34.234 37.963 37.983
120 96.921 96.945 36.714 36.735 40.828 40.848
124 114.409 114.432 39.748 39.769 44.335 44.357
128 88.675 88.698 42.330 42.351 46.419 46.442
132 114.462 114.487 44.497 44.518 49.184 49.205
136 133.911 133.935 47.742 47.763 51.318 51.340
140 155.449 155.473 51.191 51.214 55.047 55.070
144 123.982 124.006 54.835 54.857 58.005 58.027
148 155.113 155.140 58.077 58.097 62.523 62.547
152 180.289 180.317 61.871 61.893 67.255 67.280
156 206.152 206.179 66.315 66.337 72.369 72.392
160 170.891 170.917 69.996 70.019 74.349 74.372
164 204.927 204.956 74.031 74.054 78.243 78.267
168 234.912 234.940 78.354 78.379 83.180 83.204
172 267.724 267.754 83.445 83.469 88.934 88.957
176 226.915 226.943 88.477 88.501 94.031 94.085
180 265.012 265.042 93.267 93.291 99.729 99.754
184 299.342 299.372 99.016 99.043 103.424 103.449
188 338.534 338.567 104.384 104.410 109.178 109.204
192 296.850 296.881 109.775 109.800 120.001 120.026
196 339.310 339.342 115.620 115.675 121.675 121.699
200 377.733 377.766 121.387 121.413 129.160 129.186
204 422.621 422.656 128.546 128.573 134.787 134.813
208 375.804 375.837 134.658 134.684 138.811 138.867
212 426.756 426.792 141.159 141.185 147.007 147.032
216 469.584 469.620 148.531 148.557 154.895 154.921
220 522.136 522.174 166.426 166.454 165.330 165.363

For the GeForce 460 smart envelope copy is faster than copying full subdomain for domains larger than 64x64x64. The chart of time needed to copy full domain resembles function x3, while the chart of time needed to copy envelope resembles flat function x2. It is coherent with our intuition – when copying envelope number of copied values grows as square of domain size, and for full subdomain copy number of copied values grows as cube of domain size. Smart envelope copy again is the little bit faster than naive envelope copy.

The most noticeable difference between GeForce 460 and ION is the long time it takes to copy envelope on the latter GPU. ION is integrated GPU which means that it uses the same memory as CPU; it is probably not optimised for GPU needs. GeForce has separate memory which is optimised for GPU usage. Also, as pointed by deviceQuery from CUDA SDK:

        • ION:
          • Concurrent copy and execution: No with 0 copy engine(s)
          • Integrated GPU sharing Host Memory: Yes
          • Support host page-locked memory mapping: Yes
        • GeForce 460
          • Concurrent copy and execution: Yes with 1 copy engine(s)
          • Integrated GPU sharing Host Memory: No
          • Support host page-locked memory mapping: Yes

GeForce has special chip for speeding up copy operations, while ION does not have such a chip.

In summary, computations performed on new generation CUDA cards, based on Fermi chips, can benefit from optimisation of data transfer. Old devices have more limited amount of available memory and it takes longer to copy data to them so it is better to spend time needed for creating sophisticated data transfer schemas on some other parts of program.

Testing script

#! /usr/bin/python

import sys
import math
import time
import argparse

import numpy

import pycuda
import pycuda.driver
import pycuda.compiler
import pycuda.gpuarray

import pycuda.autoinit

def copyPlane(copy, stream, srcX, dstX, srcY, dstY, srcZ, dstZ, width, height, depth):
    copy.src_x_in_bytes = srcX
    copy.dst_x_in_bytes = dstX
    copy.src_y = srcY
    copy.dst_y = dstY
    copy.src_z = srcZ
    copy.dst_z = dstZ
    copy.width_in_bytes = width
    copy.height = height
    copy.depth = depth
    if stream:
        copy(stream)
    else:
        copy()

parser = argparse.ArgumentParser(description="Test speed of memory copy")

parser.add_argument("-d", "--domain", dest="domainSize", type=int,
    default=18, help="Size of the domain to copy (default: 18)")
parser.add_argument("-t", "--block", dest="blockSize", type=int,
    default=64, help="Size of the block of threads to copy (default: 64)")
parser.add_argument("-b", "--basis", dest="basis", type=int,
    default=19, help="Size of the block of threads to copy (default: 19)")

parser.add_argument("--direction", dest="copyDirection",
    action="store", default='htod', choices=['htod', 'dtoh', 'both'],
    help="Copy direction (default: htod)")

parser.add_argument("--envelope_method", dest="envelopeCopyMethod",
    action="store", default='everything', choices=['naive', 'smart'],
    help="Copy direction (default: naive)")

args = parser.parse_args()

stream = None

floatSize = 4
floatType = numpy.float32
strideX = int(math.ceil(float(args.domainSize)/args.blockSize))*args.blockSize*floatSize
strides = (args.domainSize*strideX, strideX, floatSize)
strideZ = args.domainSize*args.domainSize*strideX

gpudata = pycuda.driver.mem_alloc(strideZ*args.basis)

a3d = pycuda.gpuarray.GPUArray((args.basis*args.domainSize, args.domainSize, args.domainSize),
    dtype=floatType, strides=strides, gpudata=gpudata)
a3h = numpy.ndarray((args.basis*args.domainSize, args.domainSize, args.domainSize),
    dtype=floatType)
c3d = pycuda.driver.Memcpy3D()

startD = pycuda.driver.Event()
endD = pycuda.driver.Event()
startH = time.time()
endH = None

startD.record()
c3d.src_pitch = args.domainSize*floatSize
c3d.dst_pitch = strideX
c3d.src_height = args.domainSize
c3d.dst_height = args.domainSize

if args.envelopeCopyMethod == 'smart':
    if args.copyDirection in {'htod', 'both'}:
        c3d.set_src_host(a3h)
        c3d.set_dst_device(a3d.gpudata)
# XY
            copyPlane(c3d, stream, floatSize, floatSize, 1, 1, 0, 0,
                (args.domainSize-2)*floatSize, args.domainSize-2, 1)
            for i in range(1, args.basis):
                copyPlane(c3d, stream, floatSize, floatSize, 1, 1, i*args.domainSize-1, i*args.domainSize-1,
                    (args.domainSize-2)*floatSize, args.domainSize-2, 2)
            copyPlane(c3d, stream, floatSize, floatSize, 1, 1, args.domainSize*args.basis-1, args.domainSize*args.basis-1,
                (args.domainSize-2)*floatSize, args.domainSize-2, 1)
# XZ
        copyPlane(c3d, stream, 0, 0, 0, 0, 0, 0,
            args.domainSize*floatSize, 1, args.domainSize*args.basis)
        copyPlane(c3d, stream, 0, 0, args.domainSize-1, args.domainSize-1, 0, 0,
            args.domainSize*floatSize, 1, args.domainSize*args.basis)
# YZ
        copyPlane(c3d, stream, 0, 0, 0, 0, 0, 0,
            floatSize, args.domainSize, args.domainSize*args.basis)
        copyPlane(c3d, stream, (args.domainSize-1)*floatSize, (args.domainSize-1)*floatSize, 0, 0, 0, 0,
            floatSize, args.domainSize, args.domainSize*args.basis)
    if args.copyDirection in {'dtoh', 'both'}:
        c3d.set_src_device(a3d.gpudata)
        c3d.set_dst_host(a3h)
        c3d.src_pitch, c3d.dst_pitch = c3d.dst_pitch, c3d.src_pitch
# XY
            copyPlane(c3d, stream, floatSize*2, floatSize*2, 2, 2, 0, 0,
                (args.domainSize-4)*floatSize, args.domainSize-4, 1)
            for i in range(1, args.basis):
                copyPlane(c3d, stream, floatSize*2, floatSize*2, 2, 2, i*args.domainSize-1, i*args.domainSize-1,
                    (args.domainSize-4)*floatSize, args.domainSize-4, 2)
            copyPlane(c3d, stream, floatSize*2, floatSize*2, 2, 2, args.domainSize*args.basis-1, args.domainSize*args.basis-1,
                (args.domainSize-4)*floatSize, args.domainSize-4, 1)
# XZ
            copyPlane(c3d, stream, 1, 1, 1, 1, 1, 1,
                (args.domainSize-2)*floatSize, 1, args.domainSize*args.basis-2)
            copyPlane(c3d, stream, 1, 1, args.domainSize-2, args.domainSize-2, 1, 1,
                (args.domainSize-2)*floatSize, 1, args.domainSize*args.basis-2)
# YZ
            copyPlane(c3d, stream, 1, 1, 1, 1, 1, 1,
                floatSize, args.domainSize-2, args.domainSize*args.basis-2)
            copyPlane(c3d, stream, (args.domainSize-2)*floatSize, (args.domainSize-2)*floatSize, 1, 1, 1, 1,
                floatSize, args.domainSize-2, args.domainSize*args.basis-2)
elif args.envelopeCopyMethod == 'naive':
    if args.copyDirection in {'htod', 'both'}:
        c3d.set_src_host(a3h)
        c3d.set_dst_device(a3d.gpudata)
        for i in range(args.basis):
            c3d.set_src_host(a3h[i*args.domainSize:(i+1)*args.domainSize, :, :])
            c3d.set_dst_device(int(a3d.gpudata)+i*args.domainSize*args.domainSize*args.domainSize)
# XY
            copyPlane(c3d, stream, 0, 0, 0, 0, 0, 0,
                args.domainSize*floatSize, args.domainSize, 1)
            copyPlane(c3d, stream, 0, 0, 0, 0, args.domainSize-1, args.domainSize-1,
                args.domainSize*floatSize, args.domainSize, 1)
# XZ
            copyPlane(c3d, stream, 0, 0, 0, 0, 0, 0,
                args.domainSize*floatSize, 1, args.domainSize)
            copyPlane(c3d, stream, 0, 0, args.domainSize-1, args.domainSize-1, 0, 0,
                args.domainSize*floatSize, 1, args.domainSize)
# YZ
            copyPlane(c3d, stream, 0, 0, 0, 0, 0, 0,
                floatSize, args.domainSize, args.domainSize)
            copyPlane(c3d, stream, (args.domainSize-1)*floatSize, (args.domainSize-1)*floatSize,
                0, 0, 0, 0, floatSize, args.domainSize, args.domainSize)
    if args.copyDirection in {'dtoh', 'both'}:
        c3d.src_pitch, c3d.dst_pitch = c3d.dst_pitch, c3d.src_pitch
        for i in range(args.basis):
            c3d.set_src_device(int(a3d.gpudata)+i*args.domainSize*args.domainSize*args.domainSize)
            c3d.set_dst_host(a3h[i*args.domainSize:(i+1)*args.domainSize, :, :])
# XY
                copyPlane(c3d, stream, 1, 1, 1, 1, 1, 1,
                    (args.domainSize-2)*floatSize, args.domainSize, 1)
                copyPlane(c3d, stream, 1, 1, 1, 1, args.domainSize-2, args.domainSize-2,
                    (args.domainSize-2)*floatSize, args.domainSize-2, 1)
# XZ
                copyPlane(c3d, stream, 1, 1, 1, 1, 1, 1,
                    (args.domainSize-2)*floatSize, 1, args.domainSize-2)
                copyPlane(c3d, stream, 1, 1, args.domainSize-2, args.domainSize-2, 1, 1,
                    (args.domainSize-2)*floatSize, 1, args.domainSize-2)
# YZ
                copyPlane(c3d, stream, 1, 1, 1, 1, 1, 1,
                    floatSize, args.domainSize-2, args.domainSize-2)
                copyPlane(c3d, stream, (args.domainSize-2)*floatSize, (args.domainSize-2)*floatSize,
                    1, 1, 1, 1, floatSize, args.domainSize-2, args.domainSize-2)

endD.record()
endD.synchronize()
endH = time.time()
print("{0:.3f} {1:.3f}".format(endD.time_since(startD), 1000*(endH-startH)))
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s