As promised in my previous post I have performed some tests with different methods of copying data from Palabos to Sailfish. Lattice Boltzmann method is based on local interactions so to work correctly we need to transfer incoming forces before simulation step (the outermost layer of our subdomain), and outgoing interactions (the next-to outermost layer of subdomain) after the simulation step. It means that when simulation is running it suffices to copy envelope instead of full subdomain: when copying from host to device we need to copy the outermost layers of subdomain, and when copying from device to host we need to copy penultimate layer.
The configuration is the same as in previous tests: 64-bit Debian Sid, NVIDIA drivers 275.28, Python 2.7.2 on:
- Asus eeePC 1201N with Atom 330 (2 cores with HT@1.6GHz), 2GB of RAM, NVIDIA ION (GeForce 9400M) with 256MB RAM
- desktop with Intel E5200 (2 cores@2.4GHz), 4GB of RAM, NVIDIA GeForce 460 with 1GB RAM
I have decided to perform 3 tests:
- full “smart” copy using memcpy3d, with different layout of memory on host and device (the last case from previous test
- “smart” envelope copy, trying to join as many calls of memcpu3D into one as possible
- “naive” envelope copy, copying envelope in 6 steps (one for each cube wall) repeated 19 times to copy entire population
ION performance

Time of copying envelope of 3D array on ION [ms]
Domain size |
memcpy3D without host padding |
optimised envelope copy |
naive envelope copy |
GPU |
CPU |
GPU |
CPU |
GPU |
CPU |
12 |
2.374 |
2.433 |
8.376 |
8.433 |
14.665 |
14.723 |
16 |
2.309 |
2.365 |
10.290 |
10.346 |
15.611 |
15.666 |
20 |
3.528 |
3.585 |
12.485 |
12.541 |
17.207 |
17.263 |
24 |
4.518 |
4.803 |
15.105 |
15.163 |
19.531 |
19.587 |
28 |
6.327 |
6.383 |
17.952 |
18.008 |
21.886 |
21.944 |
32 |
5.112 |
5.170 |
19.886 |
19.944 |
24.365 |
24.422 |
36 |
6.369 |
6.426 |
22.022 |
22.078 |
27.347 |
27.405 |
40 |
9.123 |
9.182 |
24.303 |
24.359 |
30.540 |
30.597 |
44 |
13.329 |
13.388 |
26.447 |
26.504 |
33.828 |
33.884 |
48 |
9.067 |
9.128 |
28.570 |
28.628 |
36.744 |
36.802 |
52 |
12.995 |
13.051 |
31.232 |
31.289 |
39.863 |
39.921 |
56 |
18.545 |
18.602 |
33.703 |
33.761 |
44.114 |
44.170 |
60 |
26.061 |
26.119 |
37.189 |
37.246 |
48.258 |
48.314 |
64 |
15.670 |
15.727 |
39.954 |
40.010 |
51.328 |
51.385 |
68 |
24.649 |
24.705 |
43.390 |
43.447 |
56.469 |
56.524 |
72 |
33.577 |
33.635 |
47.897 |
47.955 |
61.682 |
61.740 |
76 |
44.124 |
44.181 |
51.850 |
51.909 |
66.904 |
66.962 |
80 |
27.927 |
27.983 |
55.985 |
56.041 |
71.405 |
71.462 |
84 |
42.142 |
42.199 |
60.319 |
60.418 |
78.399 |
78.456 |
88 |
55.455 |
55.512 |
65.800 |
65.856 |
86.667 |
86.723 |
92 |
70.149 |
70.207 |
71.429 |
71.486 |
92.220 |
92.276 |
Results for ION (GeForce 9400M) are interesting. Smart envelope copy is faster than naive by 5 to 10 ms, which is probably caused by overhead of calling memcpy3D more times in the latter case. At the same time the envelope copy is slower than copying entire subdomain. Only for domain 92x92x92 smart envelope copy is as fast as (or as slow as) full copy – but for domain 96x96x96 it will again be slower (fastest copy for domain which size is divisible by 16 from previous post). I am not sure what is the cause of this slowness – the old card (GeForce 94000 is the Tesla-based CUDA GPU), or the fact that this is integrated GPU which shares RAM with CPU. This result might mean that for old devices it does not make sense to optimise copying as envelope copy will be faster for domains that does not fit into GPU memory – rendering entire optimisation pointless.
GeForce 460 performance

Time of copying envelope of 3D array on GeForce 460 [ms]
Domain size |
memcpy3D without host padding |
optimised envelope copy |
naive envelope copy |
GPU |
CPU |
GPU |
CPU |
GPU |
CPU |
12 |
0.643 |
0.663 |
1.324 |
1.343 |
2.008 |
2.027 |
16 |
0.543 |
0.561 |
1.758 |
1.778 |
2.369 |
2.426 |
20 |
1.320 |
1.339 |
2.343 |
2.361 |
2.883 |
2.902 |
24 |
1.913 |
1.933 |
2.718 |
2.747 |
3.453 |
3.472 |
28 |
3.349 |
3.367 |
3.342 |
3.361 |
4.138 |
4.157 |
32 |
1.868 |
1.886 |
3.843 |
3.863 |
4.641 |
4.661 |
36 |
4.096 |
4.134 |
4.406 |
4.426 |
5.517 |
5.537 |
40 |
6.112 |
6.131 |
4.979 |
4.999 |
6.410 |
6.428 |
44 |
9.014 |
9.034 |
5.819 |
5.839 |
7.447 |
7.466 |
48 |
4.933 |
4.953 |
6.461 |
6.481 |
8.113 |
8.132 |
52 |
10.080 |
10.100 |
7.357 |
7.378 |
9.491 |
9.510 |
56 |
13.882 |
13.903 |
8.344 |
8.364 |
10.725 |
10.745 |
60 |
18.616 |
18.637 |
9.479 |
9.497 |
12.230 |
12.248 |
64 |
11.484 |
11.504 |
10.670 |
10.688 |
13.144 |
13.163 |
68 |
19.885 |
19.905 |
11.643 |
11.663 |
14.868 |
14.888 |
72 |
26.244 |
26.264 |
12.857 |
12.876 |
16.550 |
16.570 |
76 |
33.263 |
33.283 |
14.496 |
14.517 |
18.471 |
18.491 |
80 |
22.082 |
22.102 |
16.018 |
16.037 |
19.735 |
19.754 |
84 |
34.263 |
34.283 |
17.330 |
17.351 |
22.013 |
22.033 |
88 |
43.567 |
43.586 |
19.222 |
19.242 |
24.199 |
24.249 |
92 |
53.220 |
53.241 |
21.446 |
21.466 |
26.398 |
26.417 |
96 |
37.322 |
37.342 |
23.568 |
23.589 |
26.989 |
27.009 |
100 |
54.475 |
54.495 |
25.014 |
25.034 |
28.639 |
28.657 |
104 |
67.171 |
67.198 |
27.209 |
27.230 |
30.250 |
30.270 |
108 |
80.673 |
80.695 |
29.748 |
29.769 |
32.992 |
33.011 |
112 |
58.523 |
58.544 |
32.149 |
32.285 |
34.782 |
34.803 |
116 |
80.705 |
80.727 |
34.213 |
34.234 |
37.963 |
37.983 |
120 |
96.921 |
96.945 |
36.714 |
36.735 |
40.828 |
40.848 |
124 |
114.409 |
114.432 |
39.748 |
39.769 |
44.335 |
44.357 |
128 |
88.675 |
88.698 |
42.330 |
42.351 |
46.419 |
46.442 |
132 |
114.462 |
114.487 |
44.497 |
44.518 |
49.184 |
49.205 |
136 |
133.911 |
133.935 |
47.742 |
47.763 |
51.318 |
51.340 |
140 |
155.449 |
155.473 |
51.191 |
51.214 |
55.047 |
55.070 |
144 |
123.982 |
124.006 |
54.835 |
54.857 |
58.005 |
58.027 |
148 |
155.113 |
155.140 |
58.077 |
58.097 |
62.523 |
62.547 |
152 |
180.289 |
180.317 |
61.871 |
61.893 |
67.255 |
67.280 |
156 |
206.152 |
206.179 |
66.315 |
66.337 |
72.369 |
72.392 |
160 |
170.891 |
170.917 |
69.996 |
70.019 |
74.349 |
74.372 |
164 |
204.927 |
204.956 |
74.031 |
74.054 |
78.243 |
78.267 |
168 |
234.912 |
234.940 |
78.354 |
78.379 |
83.180 |
83.204 |
172 |
267.724 |
267.754 |
83.445 |
83.469 |
88.934 |
88.957 |
176 |
226.915 |
226.943 |
88.477 |
88.501 |
94.031 |
94.085 |
180 |
265.012 |
265.042 |
93.267 |
93.291 |
99.729 |
99.754 |
184 |
299.342 |
299.372 |
99.016 |
99.043 |
103.424 |
103.449 |
188 |
338.534 |
338.567 |
104.384 |
104.410 |
109.178 |
109.204 |
192 |
296.850 |
296.881 |
109.775 |
109.800 |
120.001 |
120.026 |
196 |
339.310 |
339.342 |
115.620 |
115.675 |
121.675 |
121.699 |
200 |
377.733 |
377.766 |
121.387 |
121.413 |
129.160 |
129.186 |
204 |
422.621 |
422.656 |
128.546 |
128.573 |
134.787 |
134.813 |
208 |
375.804 |
375.837 |
134.658 |
134.684 |
138.811 |
138.867 |
212 |
426.756 |
426.792 |
141.159 |
141.185 |
147.007 |
147.032 |
216 |
469.584 |
469.620 |
148.531 |
148.557 |
154.895 |
154.921 |
220 |
522.136 |
522.174 |
166.426 |
166.454 |
165.330 |
165.363 |
For the GeForce 460 smart envelope copy is faster than copying full subdomain for domains larger than 64x64x64. The chart of time needed to copy full domain resembles function x3, while the chart of time needed to copy envelope resembles flat function x2. It is coherent with our intuition – when copying envelope number of copied values grows as square of domain size, and for full subdomain copy number of copied values grows as cube of domain size. Smart envelope copy again is the little bit faster than naive envelope copy.
The most noticeable difference between GeForce 460 and ION is the long time it takes to copy envelope on the latter GPU. ION is integrated GPU which means that it uses the same memory as CPU; it is probably not optimised for GPU needs. GeForce has separate memory which is optimised for GPU usage. Also, as pointed by deviceQuery from CUDA SDK:
- ION:
- Concurrent copy and execution: No with 0 copy engine(s)
- Integrated GPU sharing Host Memory: Yes
- Support host page-locked memory mapping: Yes
- GeForce 460
- Concurrent copy and execution: Yes with 1 copy engine(s)
- Integrated GPU sharing Host Memory: No
- Support host page-locked memory mapping: Yes
GeForce has special chip for speeding up copy operations, while ION does not have such a chip.
In summary, computations performed on new generation CUDA cards, based on Fermi chips, can benefit from optimisation of data transfer. Old devices have more limited amount of available memory and it takes longer to copy data to them so it is better to spend time needed for creating sophisticated data transfer schemas on some other parts of program.
Testing script
#! /usr/bin/python
import sys
import math
import time
import argparse
import numpy
import pycuda
import pycuda.driver
import pycuda.compiler
import pycuda.gpuarray
import pycuda.autoinit
def copyPlane(copy, stream, srcX, dstX, srcY, dstY, srcZ, dstZ, width, height, depth):
copy.src_x_in_bytes = srcX
copy.dst_x_in_bytes = dstX
copy.src_y = srcY
copy.dst_y = dstY
copy.src_z = srcZ
copy.dst_z = dstZ
copy.width_in_bytes = width
copy.height = height
copy.depth = depth
if stream:
copy(stream)
else:
copy()
parser = argparse.ArgumentParser(description="Test speed of memory copy")
parser.add_argument("-d", "--domain", dest="domainSize", type=int,
default=18, help="Size of the domain to copy (default: 18)")
parser.add_argument("-t", "--block", dest="blockSize", type=int,
default=64, help="Size of the block of threads to copy (default: 64)")
parser.add_argument("-b", "--basis", dest="basis", type=int,
default=19, help="Size of the block of threads to copy (default: 19)")
parser.add_argument("--direction", dest="copyDirection",
action="store", default='htod', choices=['htod', 'dtoh', 'both'],
help="Copy direction (default: htod)")
parser.add_argument("--envelope_method", dest="envelopeCopyMethod",
action="store", default='everything', choices=['naive', 'smart'],
help="Copy direction (default: naive)")
args = parser.parse_args()
stream = None
floatSize = 4
floatType = numpy.float32
strideX = int(math.ceil(float(args.domainSize)/args.blockSize))*args.blockSize*floatSize
strides = (args.domainSize*strideX, strideX, floatSize)
strideZ = args.domainSize*args.domainSize*strideX
gpudata = pycuda.driver.mem_alloc(strideZ*args.basis)
a3d = pycuda.gpuarray.GPUArray((args.basis*args.domainSize, args.domainSize, args.domainSize),
dtype=floatType, strides=strides, gpudata=gpudata)
a3h = numpy.ndarray((args.basis*args.domainSize, args.domainSize, args.domainSize),
dtype=floatType)
c3d = pycuda.driver.Memcpy3D()
startD = pycuda.driver.Event()
endD = pycuda.driver.Event()
startH = time.time()
endH = None
startD.record()
c3d.src_pitch = args.domainSize*floatSize
c3d.dst_pitch = strideX
c3d.src_height = args.domainSize
c3d.dst_height = args.domainSize
if args.envelopeCopyMethod == 'smart':
if args.copyDirection in {'htod', 'both'}:
c3d.set_src_host(a3h)
c3d.set_dst_device(a3d.gpudata)
# XY
copyPlane(c3d, stream, floatSize, floatSize, 1, 1, 0, 0,
(args.domainSize-2)*floatSize, args.domainSize-2, 1)
for i in range(1, args.basis):
copyPlane(c3d, stream, floatSize, floatSize, 1, 1, i*args.domainSize-1, i*args.domainSize-1,
(args.domainSize-2)*floatSize, args.domainSize-2, 2)
copyPlane(c3d, stream, floatSize, floatSize, 1, 1, args.domainSize*args.basis-1, args.domainSize*args.basis-1,
(args.domainSize-2)*floatSize, args.domainSize-2, 1)
# XZ
copyPlane(c3d, stream, 0, 0, 0, 0, 0, 0,
args.domainSize*floatSize, 1, args.domainSize*args.basis)
copyPlane(c3d, stream, 0, 0, args.domainSize-1, args.domainSize-1, 0, 0,
args.domainSize*floatSize, 1, args.domainSize*args.basis)
# YZ
copyPlane(c3d, stream, 0, 0, 0, 0, 0, 0,
floatSize, args.domainSize, args.domainSize*args.basis)
copyPlane(c3d, stream, (args.domainSize-1)*floatSize, (args.domainSize-1)*floatSize, 0, 0, 0, 0,
floatSize, args.domainSize, args.domainSize*args.basis)
if args.copyDirection in {'dtoh', 'both'}:
c3d.set_src_device(a3d.gpudata)
c3d.set_dst_host(a3h)
c3d.src_pitch, c3d.dst_pitch = c3d.dst_pitch, c3d.src_pitch
# XY
copyPlane(c3d, stream, floatSize*2, floatSize*2, 2, 2, 0, 0,
(args.domainSize-4)*floatSize, args.domainSize-4, 1)
for i in range(1, args.basis):
copyPlane(c3d, stream, floatSize*2, floatSize*2, 2, 2, i*args.domainSize-1, i*args.domainSize-1,
(args.domainSize-4)*floatSize, args.domainSize-4, 2)
copyPlane(c3d, stream, floatSize*2, floatSize*2, 2, 2, args.domainSize*args.basis-1, args.domainSize*args.basis-1,
(args.domainSize-4)*floatSize, args.domainSize-4, 1)
# XZ
copyPlane(c3d, stream, 1, 1, 1, 1, 1, 1,
(args.domainSize-2)*floatSize, 1, args.domainSize*args.basis-2)
copyPlane(c3d, stream, 1, 1, args.domainSize-2, args.domainSize-2, 1, 1,
(args.domainSize-2)*floatSize, 1, args.domainSize*args.basis-2)
# YZ
copyPlane(c3d, stream, 1, 1, 1, 1, 1, 1,
floatSize, args.domainSize-2, args.domainSize*args.basis-2)
copyPlane(c3d, stream, (args.domainSize-2)*floatSize, (args.domainSize-2)*floatSize, 1, 1, 1, 1,
floatSize, args.domainSize-2, args.domainSize*args.basis-2)
elif args.envelopeCopyMethod == 'naive':
if args.copyDirection in {'htod', 'both'}:
c3d.set_src_host(a3h)
c3d.set_dst_device(a3d.gpudata)
for i in range(args.basis):
c3d.set_src_host(a3h[i*args.domainSize:(i+1)*args.domainSize, :, :])
c3d.set_dst_device(int(a3d.gpudata)+i*args.domainSize*args.domainSize*args.domainSize)
# XY
copyPlane(c3d, stream, 0, 0, 0, 0, 0, 0,
args.domainSize*floatSize, args.domainSize, 1)
copyPlane(c3d, stream, 0, 0, 0, 0, args.domainSize-1, args.domainSize-1,
args.domainSize*floatSize, args.domainSize, 1)
# XZ
copyPlane(c3d, stream, 0, 0, 0, 0, 0, 0,
args.domainSize*floatSize, 1, args.domainSize)
copyPlane(c3d, stream, 0, 0, args.domainSize-1, args.domainSize-1, 0, 0,
args.domainSize*floatSize, 1, args.domainSize)
# YZ
copyPlane(c3d, stream, 0, 0, 0, 0, 0, 0,
floatSize, args.domainSize, args.domainSize)
copyPlane(c3d, stream, (args.domainSize-1)*floatSize, (args.domainSize-1)*floatSize,
0, 0, 0, 0, floatSize, args.domainSize, args.domainSize)
if args.copyDirection in {'dtoh', 'both'}:
c3d.src_pitch, c3d.dst_pitch = c3d.dst_pitch, c3d.src_pitch
for i in range(args.basis):
c3d.set_src_device(int(a3d.gpudata)+i*args.domainSize*args.domainSize*args.domainSize)
c3d.set_dst_host(a3h[i*args.domainSize:(i+1)*args.domainSize, :, :])
# XY
copyPlane(c3d, stream, 1, 1, 1, 1, 1, 1,
(args.domainSize-2)*floatSize, args.domainSize, 1)
copyPlane(c3d, stream, 1, 1, 1, 1, args.domainSize-2, args.domainSize-2,
(args.domainSize-2)*floatSize, args.domainSize-2, 1)
# XZ
copyPlane(c3d, stream, 1, 1, 1, 1, 1, 1,
(args.domainSize-2)*floatSize, 1, args.domainSize-2)
copyPlane(c3d, stream, 1, 1, args.domainSize-2, args.domainSize-2, 1, 1,
(args.domainSize-2)*floatSize, 1, args.domainSize-2)
# YZ
copyPlane(c3d, stream, 1, 1, 1, 1, 1, 1,
floatSize, args.domainSize-2, args.domainSize-2)
copyPlane(c3d, stream, (args.domainSize-2)*floatSize, (args.domainSize-2)*floatSize,
1, 1, 1, 1, floatSize, args.domainSize-2, args.domainSize-2)
endD.record()
endD.synchronize()
endH = time.time()
print("{0:.3f} {1:.3f}".format(endD.time_since(startD), 1000*(endH-startH)))