CUDA Memory Copy Performance

The code in this post contains error which resulted in some results being incorrect. For explanaition of the problem and for the correct results please see this post.

To allow cooperation between Palabos and Sailfish we need to copy population between those two systems. In D3Q19 model subdomains are 3-dimensional arrays with each node consisting of 19 elements. This post presents my findings on the performance of copying data between host and device. I have performed tests on two machines:

  • Asus eeePC 1201N with Atom 330 (2 cores with HT@1.6GHz), 2GB of RAM, NVIDIA ION (GeForce 9400M) with 256MB RAM
  • desktop with Intel E5200 (2 cores@2.4GHz), 4GB of RAM, NVIDIA GeForce 460 with 1GB RAM

Both systems were 64-bit Debian unstable with NVIDIA drivers 275.28. Tests were performed using float32 data type. All times are presented in milliseconds. The first system was able to store domain of size 90x90x90 on the GPU, the second was able to store domains up to 220x220x220.

The main problem with transferring data between Salifish and Palabos is different memory layout. Palabos takes into consideration CPU cache and uses rather compact layout in which entire population (all 19 values) are stored together. Sailfish, as it is running code on the GPU, must to take into consideration way in which different threads access memory (coalescing, memory bank access conflicts, etc.) so it is storing population in different manner – it stores 3D array of element 0, then 3D array of element 1, and so on, Basically one can treat Sailfish memory layout as 19 3D arrays. Additionally, Sailfish uses padding for the X index, to ensure coalesced memory access by threads. This means that each row has size that is multiplication of 64 (thread block size in Sailfish).

As a part of initialisation both libraries must be brought to the same state. It means that we need to transfer entire subdomain from Palabos (i.e. CPU, host) to Sailfish (i.e. CUDA, device). There are four ways we can transfer this data between host and device:

  • using raw memcpy_htod and memcpy_dtoh
  • using memcpy3D and copying everything (including padding for rows)
  • using memcpy3D and copying only meaningful values, leaving out padding
  • using memcpy3D, copying only values, and not having any padding in the host array

Later in the text I am calling the first two methods “full copy”, and the last two methods “smart copy”.

Performance of copy on ION

ION copy performance chart

Time of copying entire 3D array on ION [ms]
Domain size memcpy memcpy3D memcpy3D copying only values memcpy3D without host padding
GPU CPU GPU CPU GPU CPU GPU CPU
10 0.618 0.675 1.697 1.751 1.767 1.822 1.498 1.552
15 1.150 1.203 2.865 2.920 2.517 2.572 2.318 2.400
20 1.701 1.756 3.481 3.537 3.958 4.013 3.488 3.543
25 2.349 2.403 4.249 4.531 5.449 5.503 5.200 5.280
30 3.221 3.275 5.436 5.491 7.074 7.128 6.682 6.737
35 4.143 4.201 6.603 6.657 8.319 8.373 8.035 8.090
40 5.318 5.372 7.821 7.876 10.632 10.687 10.208 10.266
45 6.586 6.642 8.988 9.042 13.215 13.270 12.945 13.002
50 8.122 8.180 10.554 10.623 18.320 18.375 17.838 17.892
55 10.288 10.359 12.132 12.237 20.673 20.728 20.725 20.780
60 11.362 11.419 14.124 14.177 26.266 26.322 26.012 26.069
65 26.192 26.247 29.050 29.105 21.509 21.564 16.110 16.167
70 30.202 30.257 33.486 33.541 24.306 24.362 18.579 18.693
75 34.702 34.757 38.029 38.084 31.293 31.350 25.306 25.362
80 39.427 39.485 43.448 43.503 34.182 34.239 27.527 27.581
85 44.442 44.498 48.965 49.019 42.682 42.737 37.505 37.564
90 50.047 50.104 54.688 54.744 51.569 51.624 46.578 46.635

Up to the domain size of 60x60x60 copying everything (either by using memcpy or memcpy3D) is the performance win. Raw memcpy offers the smallest time of copying entire subdomain. But after domain grows to be 65x65x65 copying only relevant data (smart copy) using memcpy3D clearly becomes better choice over copying everything including padding bytes. The last copy method (“smart” copy with padding on the device and lack of padding on the host) offers the best performance. In my opinion it is because lack of padding on the host helps avoiding wasting CPU cache to store all those unused bytes. Atom N330 has two cores, each with 512kB of cache. Domain 65x65x65 of 32-bit floats is slightly larger than 1MB of available cache and lack of padding allows for host array to fit longer in cache – offering better performance than observed for smart copy with host padding. But of course there might be other reasons – I do not have access to implementation details of memcpy3D so cannot offer insight into its behaviour.

Large drop in performance of full copy methods when going from domain of size 60x60x60 to 65x65x65 can also be explained by padding. As noted earlier, Sailfish uses padding so each row’s size is divisible by 64. This means that we are for the 60x60x60 domain we are copying 64x60x60 array, and for domain 65x65x65 we are copying array which is over two times larger – 128x65x65. But while this explains large increase of time of full copy, I still do not understand and cannot explain the drop in smart copy time. It drops between 60 and 65, and reaches time it took to copy domain of 60x60x60 at dimension of 75x75x75.

Performance of copy on GeForce 460

GeForce 460 copy performance chart

Time of copying entire 3D array on GeForce 460 [ms]
Domain size memcpy memcpy3D memcpy3D copying only values memcpy3D without host padding
GPU CPU GPU CPU GPU CPU GPU CPU
10 0.410 0.430 0.572 0.590 0.353 0.371 0.319 0.336
15 0.843 0.862 1.076 1.095 0.616 0.635 0.549 0.567
20 1.322 1.340 1.470 1.490 1.086 1.106 0.981 0.999
25 1.932 1.951 2.179 2.197 1.535 1.553 1.389 1.408
30 2.686 2.705 2.852 2.870 2.197 2.215 2.008 2.027
35 3.517 3.535 3.652 3.671 3.373 3.391 3.134 3.153
40 4.493 4.512 4.654 4.673 4.154 4.172 3.899 3.918
45 5.650 5.669 5.818 5.835 5.986 6.004 5.845 5.864
50 6.901 6.919 6.932 6.951 8.397 8.415 8.041 8.061
55 8.366 8.385 8.349 8.368 9.956 9.975 9.921 9.939
60 9.889 9.908 10.086 10.104 13.088 13.106 12.906 12.925
65 22.850 22.869 22.891 22.911 9.190 9.208 7.863 7.882
70 26.467 26.484 26.421 26.440 10.511 10.529 9.654 9.673
75 30.310 30.329 30.287 30.306 13.352 13.371 11.851 11.869
80 34.683 34.703 34.836 34.856 15.128 15.146 14.577 14.595
85 38.842 38.862 38.915 38.934 20.079 20.097 18.685 18.704
90 43.827 43.848 43.711 43.731 25.259 25.277 24.924 24.944
95 48.314 48.335 48.120 48.141 28.258 28.278 27.766 27.786
100 53.858 53.878 53.582 53.603 34.990 35.010 34.504 34.523
105 59.075 59.095 58.923 58.942 43.180 43.201 43.113 43.133
110 64.690 64.711 64.635 64.655 47.070 47.088 46.632 46.651
115 71.115 71.135 70.978 70.998 56.152 56.171 56.918 56.939
120 77.243 77.264 77.542 77.564 61.187 61.207 60.831 60.851
125 83.958 83.980 83.105 83.127 72.124 72.146 73.920 73.941
130 136.309 136.334 135.974 135.997 50.504 50.525 45.929 45.949
135 146.751 146.777 145.515 145.538 54.448 54.468 50.977 50.998
140 158.134 158.177 156.634 156.657 61.193 61.213 57.211 57.233
145 168.521 168.547 167.841 167.867 71.299 71.321 70.364 70.385
150 180.432 180.457 179.934 179.960 76.502 76.523 74.500 74.522
155 192.819 192.845 191.539 191.566 89.529 89.551 90.365 90.388
160 205.790 205.817 205.412 205.438 94.870 94.893 94.428 94.450
165 218.756 218.784 217.885 217.911 111.780 111.803 112.584 112.607
170 231.733 231.761 232.367 232.396 128.750 128.775 129.171 129.195
175 246.083 246.111 245.708 245.737 136.877 136.901 139.651 139.676
180 260.130 260.158 259.842 259.870 158.555 158.580 157.252 157.277
185 274.028 274.058 274.744 274.772 181.361 181.385 185.707 185.735
190 291.114 291.146 288.064 288.092 191.480 191.505 192.982 193.009
195 406.808 406.842 403.830 403.863 148.540 148.564 135.534 135.559
200 424.325 424.367 425.941 425.976 156.040 156.066 142.504 142.528
205 450.420 450.457 447.015 447.050 171.625 171.651 159.442 159.468
210 473.551 473.588 466.182 466.218 186.861 186.888 177.049 177.075
215 496.014 496.052 490.306 490.343 195.255 195.283 189.415 189.443
220 517.620 517.658 516.187 516.225 221.179 221.209 213.509 213.566

For GeForce 460 the situation is similar. There is again increase in time needed to perform full copy after reaching 65x65x65 and drop in copy time for copying meaningful data for domains between 60x60x60 and 75x75x75. Because GeForce has larger memory, we can test copy times for larger domains. We can observe similar rapid change of time needed to copy data (increase for full copy, drop for “smart” copy) when we pass domain size of 128x128x128, and again the same behaviour when passing 192x192x192. I am assuming that similar situation can be observed when domain size reaches 256x256x256, 320x320x320, and so on, but do not have hardware to test it.

Summary

For larger domains there is clear winner: use memcpy3D, copy only meaningful data, and use different memory layout on host and device. For smaller domains this solution takes longer than calling memcpy_dtoh or memcpy_htod, but for such small domains it does not make sense to run simulation on GPU – the overhead of copying data, running kernels, copying results back to the host will make such a solution performance loss. Chosing memcpy3D also means that glue code joining Sailfish and Palabos can be simpler; the same memcpy3D can be used to copy envelope as well as copying entire subdomains.

Script used for testing

Script requires Python 2.7 as it uses argparse module and new syntax for creating sets.

#! /usr/bin/python

import sys
import math
import time
import argparse

import numpy

import pycuda
import pycuda.driver
import pycuda.compiler
import pycuda.gpuarray

import pycuda.autoinit

def copyPlane(copy, stream, srcX, dstX, srcY, dstY, srcZ, dstZ, width, height, depth):
    copy.src_x_in_bytes = srcX
    copy.dst_x_in_bytes = dstX
    copy.src_y = srcY
    copy.dst_y = dstY
    copy.src_z = srcZ
    copy.dst_z = dstZ
    copy.width_in_bytes = width
    copy.height = height
    copy.depth = depth
    if stream:
        copy(stream)
    else:
        copy()

parser = argparse.ArgumentParser(description="Test speed of memory copy")

parser.add_argument("-d", "--domain", dest="domainSize", type=int,
    default=18, help="Size of the domain to copy (default: 18)")
parser.add_argument("-t", "--block", dest="blockSize", type=int,
    default=64, help="Size of the block of threads to copy (default: 64)")
parser.add_argument("-b", "--basis", dest="basis", type=int,
    default=19, help="Size of the block of threads to copy (default: 19)")

parser.add_argument("--direction", dest="copyDirection",
    action="store", default='htod', choices=['htod', 'dtoh', 'both'],
    help="Copy direction (default: htod)")

parser.add_argument("--full_method", dest="fullCopyMethod",
    action="store", default='memcpy3D', choices=['memcpy', 'memcpy3D', 'memcpy3Dvalues', 'memcpy3Dnopadding'],
    help="Copy direction (default: memcpy3D)")

args = parser.parse_args()

stream = None

floatSize = 4
floatType = numpy.float32
strideX = int(math.ceil(float(args.domainSize)/args.blockSize))*args.blockSize*floatSize
strides = (args.domainSize*strideX, strideX, floatSize)
strideZ = args.domainSize*args.domainSize*strideX

gpudata = pycuda.driver.mem_alloc(strideZ*args.basis)

a3d = pycuda.gpuarray.GPUArray((args.basis*args.domainSize, args.domainSize, args.domainSize),
    dtype=floatType, strides=strides, gpudata=gpudata)
if args.fullCopyMethod == 'memcpy3Dnopadding':
    a3h = numpy.ndarray((args.basis*args.domainSize, args.domainSize, args.domainSize),
        dtype=floatType)
else:
    a3h = numpy.ndarray((args.basis*args.domainSize, args.domainSize, strideX/floatSize),
        dtype=floatType, strides=strides)
c3d = pycuda.driver.Memcpy3D()

startD = pycuda.driver.Event()
endD = pycuda.driver.Event()
startH = time.time()
endH = None

startD.record()
if args.fullCopyMethod == 'memcpy3Dnopadding':
    c3d.src_pitch = args.domainSize*floatSize
else:
    c3d.src_pitch = strideX
c3d.dst_pitch = strideX
c3d.src_height = args.domainSize
c3d.dst_height = args.domainSize

if args.fullCopyMethod == 'memcpy':
    if args.copyDirection in {'htod', 'both'}:
        pycuda.driver.memcpy_htod(a3d.gpudata, a3h)
    if args.copyDirection in {'dtoh', 'both'}:
        pycuda.driver.memcpy_dtoh(a3h, a3d.gpudata)
elif args.fullCopyMethod == 'memcpy3D':
    if args.copyDirection in {'htod', 'both'}:
        c3d.set_src_host(a3h)
        c3d.set_dst_device(a3d.gpudata)
        copyPlane(c3d, stream, 0, 0, 0, 0, 0, 0,
            strideX, args.domainSize, args.domainSize*args.basis)
    if args.copyDirection in {'dtoh', 'both'}:
        c3d.set_src_device(a3d.gpudata)
        c3d.set_dst_host(a3h)
        copyPlane(c3d, stream, 0, 0, 0, 0, 0, 0,
            strideX, args.domainSize, args.domainSize*args.basis)
elif args.fullCopyMethod in {'memcpy3Dvalues', 'memcpy3Dnopadding'}:
    if args.copyDirection in {'htod', 'both'}:
        c3d.set_src_host(a3h)
        c3d.set_dst_device(a3d.gpudata)
        copyPlane(c3d, stream, 0, 0, 0, 0, 0, 0,
            args.domainSize, args.domainSize, args.domainSize*args.basis)
    if args.copyDirection in {'dtoh', 'both'}:
        if args.fullCopyMethod == 'memcpy3Dnopadding':
            c3d.src_pitch, c3d.dst_pitch = c3d.dst_pitch, c3d.src_pitch
        c3d.set_src_device(a3d.gpudata)
        c3d.set_dst_host(a3h)
        copyPlane(c3d, stream, 0, 0, 0, 0, 0, 0,
            args.domainSize, args.domainSize, args.domainSize*args.basis)

endD.record()
endD.synchronize()
endH = time.time()
print("{0:.3f} {1:.3f}".format(endD.time_since(startD), 1000*(endH-startH)))

I have originally intended to present performance of results of copying only envelope, but this post is already getting long, so I decided to focus on copying full domains. I hope to post results of further performance analysis later this week.

Advertisements

2 thoughts on “CUDA Memory Copy Performance

  1. Hi, your two graphs look like something went wrong with the resizing. At least for me using Chromium 12.0.742.112 (90304) on Ubuntu 10.10.

    • Thanks. Images are OK when I load them in Firefox 7. I use default WP theme which has rather narrow main column. If you have suggestions how to change it or how to disable automatic resizing of images, I will try them.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s