Echelon Blog

CUDA FAQ

This is an unofficial CUDA FAQ.

How can the runtime of a kernel be measured?
How can grid-wide synchronization or communication be achieved?
What is local memory?

How can the runtime of a kernel be measured?

Kernel calls are asynchronous. That means after calling a kernel like this

mykernel<<512, 128, 64>>(param1, param2);

the kernel will be started and the call will return almost immediately. The CPU will not wait for the kernel to finish. Therefore to time the kernel properly you have to explicitly wait for the kernel to finish. This can be done with the cudaThreadSynchronize() call. It will block until the device completed all preceding requested tasks. Your kernel call with timing function could look like this:

unsigned int timer;
cutCreateTimer(&timer);
cudaThreadSynchronize();
cutStartTimer(timer);
mykernel<<512, 128, 64>>(param1, param2);
cudaThreadSynchronize();
cutStopTimer(timer);
float time = cutGetTimerValue(timer);

Note that there are two cudaThreadSynchronize() calls. The call before the cutStartTimer is to ensure that all requested tasks are finished and the GPU is idle. Then the kernel is started and right before the timer is stopped the second cudaThreadSynchronize() call ensures that the kernel finishes.

This is of course only a very basic example. To take meaningful performance measurements other things have to be considered. The very first kernel call for example might take more time then subsequent calls because it is accompanied by GPU initialization overhead. Therefore it is recommended to discard the first runtime. It is in general a good idea to launch a kernel a couple of times and take the average or shortest time to determine the performance of a kernel. Especially when kernels are very short it makes sense to launch them say 1000 times to get usefull results.

How can grid-wide synchronization or communication be achieved?

There are a few problems with grid-wide synchronization. Firstly for a lot of problems a solution that requires grid-wide synchronization or communication is not the most efficient one. The synchronization is very expensive in terms of kernel runtime and there are methods better suited for parallel streaming architectures that are applicable for a wide variety of problems. Secondly grid-wide synchronization requires some sort of mutual exclusion or some means of controllable grid-wide communication. And what this boils down to is basically atomic reads and atomic writes from and to global memory. This however is only available with compute capability 1.1 devices. The Appendix C of the CUDA Programming Guide lists all functions available:

atomicAdd()
atomicSub()
atomicExch()
atomicMin()
atomicMax()
atomicInc()
atomicDec()
atomicCAS()
atomicAnd()
atomicOr()
atomicXor()

However for some problems people have reported to being able to communicate grid-wide with compute capability 1.0.

The basic solution for grid-wide synchronization is to split the two parts that have to run sequentially into two kernels and call them one after another.

What is local memory?

Local memory is part of the device memory and is ‘local’ to individual thread, that is, it is accessible only by the thread that declared it. Local memory is not cached and access to local memory is slow. Sometimes the compiler places variables in local memory especially when the variable is a huge array. Registers cannot be indexed.

echelon

CUDA FAQ

No Comments »

Leave a comment