Errata for CUDA Memory Copy Performance

Unfortunately yesterday’s post contains grave error. When copying data, memcpy3d requires passing number of bytes to transfer from each row, not number of elements. When I was optimising smart copy performance I have taken this into consideration in some calls of memcpy3D, but not in the others. Below are the correct results and their analysis.

Performance of copy on ION

ION copy performance chart

Time of copying entire 3D array on ION [ms]
Domain size memcpy memcpy3D memcpy3D copying only values memcpy3D without host padding
GPU CPU GPU CPU GPU CPU GPU CPU
10 0.612 0.934 1.743 1.802 1.767 1.825 1.694 1.753
15 1.154 1.212 2.704 2.763 3.353 3.412 3.220 3.279
20 1.654 1.711 3.479 3.536 3.654 3.940 3.458 3.515
25 2.525 2.584 4.407 4.463 5.376 5.434 5.156 5.266
30 3.176 3.236 5.465 5.522 8.295 8.353 7.742 7.801
35 4.322 4.379 6.576 6.632 7.015 7.072 6.455 6.514
40 5.474 5.532 7.929 7.986 10.001 10.059 9.228 9.287
45 6.591 6.652 9.038 9.102 15.741 15.797 15.294 15.353
50 7.995 8.053 10.530 10.588 11.585 11.642 10.990 11.047
55 9.691 9.750 12.162 12.219 19.234 19.292 19.580 19.637
60 11.440 11.497 14.112 14.168 26.663 26.721 25.798 25.855
65 26.175 26.234 29.251 29.309 26.349 26.406 21.239 21.297
70 30.518 30.576 33.738 33.796 36.621 36.681 30.729 30.788
75 35.036 35.095 38.358 38.417 51.366 51.424 46.280 46.338
80 39.420 39.479 43.354 43.410 33.808 33.868 28.315 28.375
85 44.819 44.880 49.000 49.059 55.915 55.976 50.566 50.626
90 50.324 50.383 54.530 54.589 72.578 72.637 66.190 66.249

Because of observed changes in the time of copy for memcpy3D smart copy I have decided to test times of copying domains which sizes are divisible by 4. Below is the chart and table containing results of those tests. Previously I have not tested domain sizes for power-of-two sizes because Sailfish (CUDA) domain size is controlled by Palabos. I was more interested in trends than in exact values, but if there is large performance difference between 60x60x60 and 64x54x64 domain it might make sense to force only some of the possible domain sizes to be computed on the GPU.

ION copy performance chart

Time of copying entire 3D array on ION [ms]
Domain size memcpy memcpy3D memcpy3D copying only values memcpy3D without host padding
GPU CPU GPU CPU GPU CPU GPU CPU
12 0.812 0.868 2.087 2.146 2.309 2.367 2.173 2.229
16 1.289 1.346 2.893 2.948 2.459 2.515 2.214 2.271
20 1.639 1.695 3.475 3.531 4.032 4.092 3.471 3.531
24 2.196 2.254 4.222 4.282 4.930 4.988 4.600 4.659
28 2.759 2.816 4.966 5.023 6.712 6.768 6.095 6.380
32 3.734 3.792 5.948 6.006 5.201 5.259 4.982 5.042
36 4.439 4.497 6.798 6.855 7.036 7.093 6.321 6.378
40 5.325 5.385 7.778 7.835 9.802 9.861 9.171 9.228
44 6.394 6.452 8.703 8.760 13.877 13.933 13.034 13.092
48 7.424 7.484 9.931 9.987 8.947 9.003 8.949 9.008
52 8.641 8.701 11.226 11.286 13.853 13.909 13.028 13.086
56 10.020 10.079 12.543 12.602 19.571 19.627 18.742 18.799
60 11.416 11.499 14.131 14.188 26.724 26.783 25.784 25.840
64 12.992 13.050 15.882 15.940 15.943 15.998 15.915 15.973
68 28.681 28.741 32.315 32.375 31.796 31.855 24.704 24.761
72 32.265 32.322 35.709 35.767 41.205 41.263 34.071 34.130
76 35.676 35.735 39.521 39.579 52.227 52.286 44.459 44.516
80 39.525 39.585 43.617 43.675 33.766 33.824 28.207 28.266
84 43.384 43.443 47.657 47.714 50.252 50.309 42.078 42.149
88 47.531 47.590 52.083 52.141 63.892 63.952 55.614 55.672
92 52.028 52.088 57.211 57.268 80.168 80.225 71.291 71.349

For the full copy performance has not changed, but smart copy behaves differently. There is clearly visible (especially for the domain size divisible by 4) pattern of changes of time it takes to copy subdomain. The lowest time is for domain sizes divisible by 16: 32x32x32, 48x48x48, 64x64x64, 80x80x80. Then the copy time grows until domain size is 16N-4 (e.g. 44, 60, 76), to fall again when reaching the next product of 16.

Performance of copy on GeForce 460

GeForce 460 copy performance chart

Time of copying entire 3D array on GeForce 460 [ms]
Domain size memcpy memcpy3D memcpy3D copying only values memcpy3D without host padding
GPU CPU GPU CPU GPU CPU GPU CPU
10 0.410 0.432 0.579 0.597 0.489 0.510 0.445 0.463
15 0.840 0.859 1.083 1.101 1.153 1.171 1.093 1.114
20 1.324 1.342 1.466 1.485 1.395 1.414 1.328 1.346
25 1.930 1.949 2.251 2.271 2.438 2.457 2.349 2.369
30 3.876 3.902 3.117 3.138 4.139 4.159 4.092 4.154
35 3.548 3.567 3.672 3.692 3.894 3.912 3.860 3.889
40 4.465 4.482 5.508 5.529 6.721 6.741 6.066 6.086
45 5.636 5.658 5.759 5.777 10.261 10.282 9.919 9.938
50 7.156 7.175 7.243 7.263 9.045 9.066 9.032 9.053
55 8.386 8.406 8.821 8.862 13.406 13.425 14.677 14.698
60 10.122 10.141 10.523 10.545 18.931 18.951 19.866 19.895
65 23.016 23.036 23.736 23.761 17.597 17.618 17.003 17.024
70 26.552 26.572 27.247 27.268 23.212 23.236 23.341 23.366
75 31.211 31.233 31.886 31.907 32.082 32.106 33.479 33.500
80 35.820 35.840 35.643 35.665 24.675 24.698 21.872 21.890
85 40.238 40.263 39.731 39.753 38.612 38.633 40.134 40.154
90 44.809 44.831 44.978 44.998 48.372 48.394 49.591 49.614
95 50.882 50.907 49.887 49.914 65.114 65.137 66.627 66.650
100 55.894 55.917 56.961 57.019 57.187 57.209 56.676 56.698
105 64.301 64.327 66.465 66.519 73.844 73.900 76.339 76.362
110 69.854 69.880 67.134 67.157 89.754 89.778 107.916 107.972
115 79.202 79.231 79.291 79.381 82.054 82.077 90.558 90.583
120 87.267 87.298 85.412 85.438 103.439 103.463 116.363 116.426
125 92.227 92.281 89.846 89.869 131.798 131.866 133.805 133.836
130 143.031 143.057 142.833 142.889 113.838 113.896 119.259 119.284
135 157.465 157.490 156.511 156.541 141.232 141.258 143.574 143.598
140 177.982 178.017 183.117 183.145 160.857 160.889 162.139 162.168
145 173.023 173.056 177.246 177.272 154.549 154.579 153.971 153.997
150 185.642 185.671 188.587 188.619 170.216 170.271 172.422 172.448
155 198.834 198.870 202.212 202.240 212.219 212.278 213.878 213.906
160 213.445 213.475 213.257 213.289 181.400 181.430 179.200 179.228
165 233.645 233.674 231.348 231.377 224.925 224.954 229.401 229.433
170 240.339 240.373 241.443 241.476 258.366 258.428 258.161 258.219
175 255.133 255.192 252.854 252.916 310.002 310.038 315.292 315.323
180 275.525 275.555 281.269 281.330 277.271 277.301 278.145 278.178
185 283.135 283.171 285.218 285.279 333.883 333.913 334.866 334.897
190 298.274 298.311 297.961 297.991 371.361 371.399 373.259 373.320
195 420.039 420.077 415.976 416.011 359.394 359.426 353.314 353.422
200 439.519 439.556 439.015 439.056 393.671 393.708 391.862 391.896
205 465.656 465.693 470.123 470.160 467.429 467.465 467.979 468.053
210 489.099 489.144 481.797 481.835 435.453 435.487 423.905 423.948
215 516.464 516.502 511.549 511.588 486.948 486.986 491.866 491.903
220 537.333 537.495 528.705 528.774 543.285 543.328 534.583 534.621

As previously I have decided to perform analysis domain sizes divisible by 4.

GeForce 460 copy performance chart

Time of copying entire 3D array on GeForce 460 [ms]
Domain size memcpy memcpy3D memcpy3D copying only values memcpy3D without host padding
GPU CPU GPU CPU GPU CPU GPU CPU
12 0.565 0.584 0.762 0.781 0.682 0.700 0.638 0.657
16 0.892 0.911 1.146 1.165 0.647 0.666 0.543 0.561
20 1.317 1.336 1.474 1.492 1.397 1.416 1.364 1.382
24 1.804 1.823 1.956 1.974 1.981 1.999 1.943 1.962
28 2.329 2.347 2.480 2.498 3.398 3.415 3.297 3.315
32 2.962 2.980 3.146 3.166 2.073 2.091 1.838 1.855
36 3.767 3.784 3.853 3.872 4.051 4.070 4.023 4.042
40 4.522 4.540 4.644 4.694 6.143 6.162 6.139 6.156
44 5.534 5.628 5.596 5.616 9.002 9.021 9.014 9.032
48 6.400 6.420 6.478 6.496 5.137 5.156 4.912 4.930
52 7.471 7.489 7.456 7.474 9.940 9.958 9.849 9.867
56 8.640 8.659 8.657 8.675 13.864 13.882 13.829 13.848
60 9.933 9.951 10.058 10.077 18.686 18.705 18.614 18.633
64 11.191 11.209 11.456 11.474 11.413 11.432 11.454 11.473
68 24.843 24.862 25.094 25.113 19.206 19.226 19.774 19.792
72 28.012 28.031 28.264 28.282 25.232 25.251 26.093 26.112
76 31.000 31.020 31.352 31.371 32.162 32.181 33.003 33.023
80 34.561 34.581 34.563 34.582 23.263 23.281 21.749 21.769
84 37.958 37.977 38.136 38.157 33.513 33.532 34.219 34.238
88 41.732 41.752 42.181 42.202 42.511 42.529 43.483 43.503
92 45.269 45.288 45.753 45.773 52.521 52.541 53.341 53.361
96 49.748 49.769 49.343 49.363 38.892 38.911 37.362 37.381
100 53.916 53.937 53.985 54.006 54.421 54.442 54.530 54.549
104 57.963 57.984 58.106 58.127 68.072 68.092 66.333 66.353
108 62.377 62.399 62.681 62.701 79.832 79.852 80.333 80.354
112 67.131 67.152 67.641 67.664 60.153 60.174 58.579 58.599
116 72.275 72.297 71.823 71.844 80.010 80.030 80.734 80.754
120 77.130 77.151 76.991 77.013 99.275 99.297 97.347 97.368
124 82.201 82.222 82.540 82.562 116.753 116.775 114.670 114.693
128 87.523 87.544 88.606 88.628 87.967 87.989 87.779 87.801
132 140.171 140.194 139.516 139.539 112.754 112.776 114.028 114.050
136 147.762 147.787 148.465 148.491 131.506 131.528 133.523 133.547
140 157.080 157.106 156.612 156.638 153.446 153.471 155.699 155.724
144 166.334 166.390 166.700 166.725 129.608 129.630 124.698 124.721
148 176.337 176.363 174.642 174.667 154.533 154.556 155.071 155.094
152 186.635 186.662 186.065 186.094 178.441 178.466 180.504 180.529
156 199.970 199.997 194.241 194.267 205.473 205.499 206.427 206.453
160 204.636 204.662 208.479 208.512 174.436 174.461 171.110 171.134
164 216.665 216.693 215.878 215.905 205.541 205.566 205.244 205.272
168 227.487 227.514 225.826 225.853 234.673 234.701 235.387 235.414
172 236.822 236.849 239.309 239.337 265.910 265.939 267.028 267.057
176 249.419 249.447 247.784 247.811 230.288 230.316 226.439 226.465
180 260.284 260.313 259.921 259.950 265.468 265.495 265.742 265.769
184 270.572 270.603 271.259 271.287 300.409 300.439 300.370 300.399
188 285.403 285.433 283.713 283.742 337.666 337.697 337.687 337.718
192 295.388 295.418 295.069 295.098 296.256 296.286 294.651 294.680
196 411.596 411.630 409.348 409.382 348.688 348.719 339.280 339.312
200 428.020 428.055 426.115 426.149 383.760 383.792 378.659 378.692
204 457.611 457.647 443.511 443.547 426.987 427.021 421.823 421.857
208 464.745 464.783 463.364 463.399 384.384 384.417 373.665 373.698
212 481.896 481.934 477.560 477.596 435.707 435.742 426.547 426.582
216 499.794 499.831 498.332 498.368 478.811 478.847 468.109 468.143
220 517.785 517.823 515.649 515.687 526.951 526.988 522.219 522.256

Just like in case of ION performance, we can observe the performance pattern when the smallest copy time occurs for domain sizes divisible by 16; then it grows to fall again for the next product of 16. This is not clearly shown on the first chart, but very clearly visible on the chart showing the times for domains divisible by 4.

Corrected script

#! /usr/bin/python

import sys
import math
import time
import argparse

import numpy

import pycuda
import pycuda.driver
import pycuda.compiler
import pycuda.gpuarray

import pycuda.autoinit

def copyPlane(copy, stream, srcX, dstX, srcY, dstY, srcZ, dstZ, width, height, depth):
    copy.src_x_in_bytes = srcX
    copy.dst_x_in_bytes = dstX
    copy.src_y = srcY
    copy.dst_y = dstY
    copy.src_z = srcZ
    copy.dst_z = dstZ
    copy.width_in_bytes = width
    copy.height = height
    copy.depth = depth
    if stream:
        copy(stream)
    else:
        copy()

parser = argparse.ArgumentParser(description="Test speed of memory copy")

parser.add_argument("-d", "--domain", dest="domainSize", type=int,
    default=18, help="Size of the domain to copy (default: 18)")
parser.add_argument("-t", "--block", dest="blockSize", type=int,
    default=64, help="Size of the block of threads to copy (default: 64)")
parser.add_argument("-b", "--basis", dest="basis", type=int,
    default=19, help="Size of the block of threads to copy (default: 19)")

parser.add_argument("--direction", dest="copyDirection",
    action="store", default='htod', choices=['htod', 'dtoh', 'both'],
    help="Copy direction (default: htod)")

parser.add_argument("--full_method", dest="fullCopyMethod",
    action="store", default='memcpy3D', choices=['memcpy', 'memcpy3D', 'memcpy3Dvalues', 'memcpy3Dnopadding'],
    help="Copy direction (default: memcpy3D)")

args = parser.parse_args()

stream = None

floatSize = 4
floatType = numpy.float32
strideX = int(math.ceil(float(args.domainSize)/args.blockSize))*args.blockSize*floatSize
strides = (args.domainSize*strideX, strideX, floatSize)
strideZ = args.domainSize*args.domainSize*strideX

gpudata = pycuda.driver.mem_alloc(strideZ*args.basis)

a3d = pycuda.gpuarray.GPUArray((args.basis*args.domainSize, args.domainSize, args.domainSize),
    dtype=floatType, strides=strides, gpudata=gpudata)
if args.fullCopyMethod == 'memcpy3Dnopadding':
    a3h = numpy.ndarray((args.basis*args.domainSize, args.domainSize, args.domainSize),
        dtype=floatType)
else:
    a3h = numpy.ndarray((args.basis*args.domainSize, args.domainSize, strideX/floatSize),
        dtype=floatType, strides=strides)
c3d = pycuda.driver.Memcpy3D()

startD = pycuda.driver.Event()
endD = pycuda.driver.Event()
startH = time.time()
endH = None

startD.record()
if args.fullCopyMethod == 'memcpy3Dnopadding':
    c3d.src_pitch = args.domainSize*floatSize
else:
    c3d.src_pitch = strideX
c3d.dst_pitch = strideX
c3d.src_height = args.domainSize
c3d.dst_height = args.domainSize

if args.fullCopyMethod == 'memcpy':
    if args.copyDirection in {'htod', 'both'}:
        pycuda.driver.memcpy_htod(a3d.gpudata, a3h)
    if args.copyDirection in {'dtoh', 'both'}:
        pycuda.driver.memcpy_dtoh(a3h, a3d.gpudata)
elif args.fullCopyMethod == 'memcpy3D':
    if args.copyDirection in {'htod', 'both'}:
        c3d.set_src_host(a3h)
        c3d.set_dst_device(a3d.gpudata)
        copyPlane(c3d, stream, 0, 0, 0, 0, 0, 0,
            strideX, args.domainSize, args.domainSize*args.basis)
    if args.copyDirection in {'dtoh', 'both'}:
        c3d.set_src_device(a3d.gpudata)
        c3d.set_dst_host(a3h)
        copyPlane(c3d, stream, 0, 0, 0, 0, 0, 0,
            strideX, args.domainSize, args.domainSize*args.basis)
elif args.fullCopyMethod in {'memcpy3Dvalues', 'memcpy3Dnopadding'}:
    if args.copyDirection in {'htod', 'both'}:
        c3d.set_src_host(a3h)
        c3d.set_dst_device(a3d.gpudata)
        copyPlane(c3d, stream, 0, 0, 0, 0, 0, 0,
            args.domainSize*floatSize, args.domainSize, args.domainSize*args.basis)
    if args.copyDirection in {'dtoh', 'both'}:
        if args.fullCopyMethod == 'memcpy3Dnopadding':
            c3d.src_pitch, c3d.dst_pitch = c3d.dst_pitch, c3d.src_pitch
        c3d.set_src_device(a3d.gpudata)
        c3d.set_dst_host(a3h)
        copyPlane(c3d, stream, 0, 0, 0, 0, 0, 0,
            args.domainSize*floatSize, args.domainSize, args.domainSize*args.basis)

endD.record()
endD.synchronize()
endH = time.time()
print("{0:.3f} {1:.3f}".format(endD.time_since(startD), 1000*(endH-startH)))

Summary

There is no one best method of copying 3D arrays. When we restrict ourselves to domains with sizes divisible by 16, the best solution is to use “smart” memcpy3D, but for other sizes memcpy3D sometimes is slower and sometimes is faster than raw memcpy. I intend to use only memcpy3D, as it will make code simpler than using three different functions (memcpy_htod, memcpy_dtoh, memcpy3D).

As for my mistake, lack of symmetry in memcpy3D (X requires bytes, Y and Z number of items) makes it harder to experiment and for example to transpose 3D array. I guess this API design was chosen to show some of the internal details of memcpy3D implementation and that such low-level access may be important to achieve good performance. This only shows need for testing your code. If someone is interested in problems of good API design, ACM Queue has interesting article by M. Henning API Design Maters.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s