Mandelbrot Using C++ AMP

It is time to start taking advantage of the computing power of GPUs…

A while ago I wrote an article about how to use the Microsoft Parallel Patterns Library (PPL) to render the Mandelbrot fractal using multiple CPU cores. That article can be found here.

This new article will make the Mandelbrot renderer multiple times faster by using a new Microsoft technology called C++ AMP: Accelerated Massive Parallelism, introduced in the Visual C++ 11 Preview.

The code in the previous article showed each line of the fractal immediately after it was calculated. For this article, this is changed. The Mandelbrot image will be rendered completely off-screen, and only shown when the entire image has been calculated. This is to reduce overhead of displaying the fractal line-by-line, especially with the C++ AMP version which will be so fast that this overhead could become pretty substantial.

I will also switch to single precision floating point numbers, because the GPU and GPU driver combination on my Windows 7 machine does not support double precision floating point numbers. The C++ AMP technology itself supports both single- and double precision floating point arithmetic. However, whether or not double precision arithmetic works depends on your specific GPU hardware and your specific drivers from your GPU vendor. A side effect of using single precision arithmetic in the Mandelbrot renderer is that the image will get blocky at big zoom levels. At the end of this article, a piece of code is shown how to check if an accelerator in your system supports double precision arithmetic or not. You could use that to decide at runtime whether to use a single precision or double precision implementation. This is left as an exercise for the reader.

Single-Threaded Implementation

All implementations, the single-threaded, the PPL, and the C++ AMP Mandelbrot version, use the following setup:

const int halfHeight = int(floor(m_nBuffHeight/2.0));
const int halfWidth = int(floor(m_nBuffWidth/2.0));
const int maxiter = 1024;
const float escapeValue = 4.0f;
float zoomLevel = float(m_zoomLevel);
float view_i = float(m_view_i);
float view_r = float(m_view_r);
float invLogOf2 = 1 / log(2.0f);
if (m_buffers[m_nRenderToBufferIndex].empty())
    return;
unsigned __int32* pBuffer = &(m_buffers[m_nRenderToBufferIndex][0]);

Here is the single-threaded implementation from my previous article, but updated to use single precision floating point arithmetic, and to render one Mandelbrot image to the buffer before displaying it:

for (int y = -halfHeight; y < halfHeight; ++y)
{
    // Formula: zi = z^2 + z0
    float Z0_i = view_i + y * zoomLevel;
    for (int x = -halfWidth; x < halfWidth; ++x)
    {
        float Z0_r = view_r + x * zoomLevel;
        float Z_r = Z0_r;
        float Z_i = Z0_i;
        float res = 0.0f;
        for (int iter = 0; iter < maxiter; ++iter)
        {
            float Z_rSquared = Z_r * Z_r;
            float Z_iSquared = Z_i * Z_i;
            if (Z_rSquared + Z_iSquared > escapeValue)
            {
                // We escaped
                res = iter + 1 - log(log(sqrt(Z_rSquared + Z_iSquared))) * invLogOf2;
                break;
            }
            Z_i = 2 * Z_r * Z_i + Z0_i;
            Z_r = Z_rSquared - Z_iSquared + Z0_r;
        }

        unsigned __int32 result = RGB(res * 50, res * 50, res * 50);
        pBuffer[(y + halfHeight) * m_nBuffWidth + (x + halfWidth)] = result;
    }
}

Multi-Threaded Implementation (PPL)

Parallelizing this implementation using the Microsoft Parallel Patterns Library (PPL) is shown in my previous article. Again, this implementation has been updated to use single precision arithmetic and to render one whole frame to the buffer before displaying it. The updated code looks as follows:

parallel_for(-halfHeight, halfHeight, 1, [&](int y)
{
    // Formula: zi = z^2 + z0
    float Z0_i = view_i + y * zoomLevel;
    for (int x = -halfWidth; x < halfWidth; ++x)
    {
        float Z0_r = view_r + x * zoomLevel;
        float Z_r = Z0_r;
        float Z_i = Z0_i;
        float res = 0.0f;
        for (int iter = 0; iter < maxiter; ++iter)
        {
            float Z_rSquared = Z_r * Z_r;
            float Z_iSquared = Z_i * Z_i;
            if (Z_rSquared + Z_iSquared > escapeValue)
            {
                // We escaped
                res = iter + 1 - log(log(sqrt(Z_rSquared + Z_iSquared))) * invLogOf2;
                break;
            }
            Z_i = 2 * Z_r * Z_i + Z0_i;
            Z_r = Z_rSquared - Z_iSquared + Z0_r;
        }

        unsigned __int32 result = RGB(res * 50, res * 50, res * 50);
        pBuffer[(y + halfHeight) * m_nBuffWidth + (x + halfWidth)] = result;
    }
});

C++ AMP Implementation

C++ AMP is a new technology being developed by Microsoft for C++ to allow you to easily write code that takes advantage of the computing power of accelerators, such as GPUs. For now, C++ AMP requires DirectX 11 GPUs. C++ AMP abstracts the accelerators for you. When using C++ AMP, you don’t need to worry about what kind of accelerator is in the system. For example, there is no need to write one implementation for an NVidia GPU and another for an AMD GPU. You just write 1 implementation, and C++ AMP handles the rest. If you have multiple GPUs in your system, C++ AMP can use them all at the same time, and they don’t even need to be of the same vendor. For example, if you have a system with both an AMD and an NVidia GPU in it, C++ AMP can use them both together.

Microsoft has the intention to make C++ AMP an open specification, so that other compiler vendors could also add support for it.

For this simple Mandelbrot renderer, once you have a PPL implementation, it’s almost trivial to transform it to C++ AMP. Without further ado, here is the C++ AMP version, with a detailed explanation after the code:

try
{
    array_view<writeonly<unsigned __int32>, 2> a(m_nBuffHeight, m_nBuffWidth, pBuffer);
    parallel_for_each(a.grid, [=](index<2> idx) restrict(direct3d)
    {
        // Formula: zi = z^2 + z0
        int x = idx[1] - halfWidth;
        int y = idx[0] - halfHeight;
        float Z0_i = view_i + y * zoomLevel;
        float Z0_r = view_r + x * zoomLevel;
        float Z_r = Z0_r;
        float Z_i = Z0_i;
        float res = 0.0f;
        for (int iter = 0; iter < maxiter; ++iter)
        {
            float Z_rSquared = Z_r * Z_r;
            float Z_iSquared = Z_i * Z_i;
            if (Z_rSquared + Z_iSquared > escapeValue)
            {
                // We escaped
                res = iter + 1 - fast_log(fast_log(fast_sqrt(Z_rSquared + Z_iSquared))) * invLogOf2;
                break;
            }
            Z_i = 2 * Z_r * Z_i + Z0_i;
            Z_r = Z_rSquared - Z_iSquared + Z0_r;
        }

        unsigned __int32 result = RGB(res * 50, res * 50, res * 50);
        a[idx] = result;
    });
    a.synchronize();
}
catch (const Concurrency::runtime_exception& ex)
{
    MessageBoxA(NULL, ex.what(), "Error", MB_ICONERROR);
}

Everything of C++ AMP is in the concurrency namespace. Certain Direct3D interoperability functionality, such as fast_log(), is in the concurrency::direct3d namespace.

The first thing the C++ AMP implementation does is defining a 2D array_view over the buffer. The array_view class creates a multi-dimensional view over a user supplied buffer. The data of this buffer is copied to and from GPU memory on demand by C++ AMP. The basic use of array_view is:

array_view<type, dim>

where type is the type of data in the buffer and dim is the dimensionality of your data. The dimensionality can be anything you want; it can be 1D, 2D, 3D, 4D… When you define your array_view like this, it will be a read and write view over your buffer, which means that initial data is copied from system memory to GPU memory, and the end result is copied from GPU memory back to system memory. If your GPU kernel is only writing to the buffer, the array_view can be defined as a write-only view over your buffer as follows:

array_view<writeonly<type>, dim>

This will optimize memory transfers. No initial data is copied from system memory to the GPU memory; data is only copied after finishing the calculations, from GPU memory to system memory.
The opposite is also possible. You can create a read-only array_view as follows:

array_view<const type, dim>

This will only copy initial data from system memory to GPU memory, and will not copy it back from GPU memory to system memory.

The above Mandelbrot implementation only writes to the buffer, so the array_view is defined as write-only.

Another option is to use the array class instead of array_view. The array class allocates a buffer, while array_view creates a view over a user allocated buffer.

After the C++ AMP Mandelbrot implementation has created the array_view, it uses concurrency::parallel_for_each().

The first parameter to the parallel_for_each() is a compute domain, called a grid. If you have an array_view, you can simply use its grid property as shown in the above implementation. For this Mandelbrot renderer, we have a two dimensional array_view, so the grid will also be 2D. On each position of the grid, a computation will be performed.

The second parameter to parallel_for_each() is a lambda expression that specifies restrict(direct3d) to say that the code in the lambda expression should be executed on your GPU, instead of your CPU. The restrict(direct3d) attribute is a compile time check; the compiler will check the code to see whether the code will be able to run on GPUs. For example, if you call a system call in your restrict(direct3d) GPU kernel, the code will not work on the GPU, and compilation will fail. In other words, if you use restrict(direct3d) and it compiles without errors, you are not violating any GPU restrictions and the code is valid to be executed on GPUs. Any function that you call from inside a GPU kernel should also have the restrict(direct3d) specifier. There are a number of restrictions on the code in a restrict(direct3d) function. MSDN lists them as follows:

  • The function can call only functions that have the restrict(direct3d) clause.
  • The function must be inlinable.
  • The function can declare only int, unsigned int, float, and double variables, and classes and structures that contain only these types.
  • Lambda functions cannot capture by reference and cannot capture pointers.
  • References and single-indirection pointers are supported only as local variables and function arguments.
  • The following are not allowed:
    • Recursion.
    • Variables declared with the volatile keyword.
    • Virtual functions.
    • Pointers to functions.
    • Pointers to member functions.
    • Pointers in structures.
    • Pointers to pointers.
    • goto statements.
    • Labeled statements.
    • try, catch, or throw statements.
    • Global variables.
    • Static variables. Use tile_static Keyword instead.
    • dynamic_cast casts.
    • The typeid operator.
    • asm declarations.
    • Varargs.

The lambda expression accepts only one parameter. This parameter is an index into the compute domain (= grid). For the 2D Mandelbrot case, this index is a 2D index denoted as:

index<2> idx

You can access each dimensionality of this index using array notation, for this 2D index you can use idx[0] and idx[1]. This gives you the position in the grid. An added bonus of using an index is that the code looks cleaner. In the C++ AMP version of the Mandelbrot renderer there is no ugly buffer index calculation. Compare the C++ AMP version:

a[idx] = result;

with the PPL version and single-threaded version:

pBuffer[(y + halfHeight) * m_nBuffWidth + (x + halfWidth)] = result;

The lambda expression should capture all variables it needs by value, except concurrency::array objects, which have to be captured by reference. MSDN also states that the following restrictions apply on the object types that can be captured:

  • No pointers or references (exception: array)
  • No char, short, or long long types
  • No bool types
  • Structs or classes that contain supported types are allowed
  • No objects that contain virtual functions or virtual bases
  • Structs or classes must be PODs (more formally, they must be copyable by blitting)

parallel_for_each() will throw exceptions in case of errors. Possible reasons are:

  • Failure to create the shader, which is the code that runs on the accelerator
  • Failure to create buffers
  • Invalid grid passed
  • Mismatched accelerators

The body of the lambda expression supplied to the parallel_for_each() (= kernel) runs asynchronously with the CPU code following the parallel_for_each() call, which means that your CPU code after the parallel_for_each() call continues to execute in parallel with the code inside the body of the parallel_for_each() lambda, until a synchronization point is reached. If your CPU code at a certain point observes the result of the parallel_for_each() kernel, by inspecting the array_view in this example, at that point C++ AMP synchronizes and blocks until the GPU kernel is finished. A synchronize is also forced when the array_view object goes out of scope. However, when the array_view goes out of scope, the synchronize() method is called from within the array_view destructor and thus is not allowed to throw any exceptions. In that case, GPU and CPU exceptions that happen during the synchronization, including during transferring results from GPU memory back to system memory, will be lost. For this reason, it is recommended to manually call synchronize() at a suitable point in time, and properly catch any exceptions.

C++ AMP also includes a function called copy_async() which copies data from a source to a destination. The function returns a C++11 std::future which you can check whenever you want to see whether the copy has finished or not.

To get even more performance out of your accelerators, you can investigate the tiled model, which works up to three dimensions. However, this article will not go deeper in on this more advanced feature. You can find an example in this blog post: tile_static, tile_barrier, and tiled matrix multiplication with C++ AMP.

You can use the accelerator and accelerator_view classes to retrieve information from installed accelerators in a system; and use the get_accelerators() function to get a vector of accelerators. This introductory article on C++ AMP does not go deeper in on the functionality of these classes. Only a small example is given.

You can get a hold of the default accelerator in a system by using the default constructor of the accelerator class. Once you have this, you can query it for information using the supports() method and the accelerator_restriction enumeration. For example, the following code checks whether the default accelerator supports double precision calculations or not:

accelerator acc;
if (acc.supports(accelerator_restriction::double_precision))
    cout << "Your C++ AMP default GPU supports double precision arithmetic." << endl;
else
    cout << "Your C++ AMP default GPU does not support double precision arithmetic." << endl;

Results

On my test machine (Intel Core i7-2600k, 16GB RAM, AMD Radeon HD 6800) I get the following averages in milliseconds for rendering one frame at the default zoom level at a resolution of 1916×951:

  • Single threaded version: 1167 ms
  • PPL version: 194 ms
  • C++ AMP version: 31 ms

The Intel Core i7-2600k has 4 physical cores. With hyperthreading enabled, this results in 8 cores seen by the operating system. When using the single threaded version, there is only 1 of these cores being utilized. Using the PPL version, all 8 cores are fully loaded. The C++ AMP version uses almost no CPU power at all; it only uses the GPU cores. On my Radeon HD 6800, it executes around 220 threads.

When you are running Windows 8 Preview with Visual C++ 11 Preview, you can use GPU debugging and put breakpoints in your GPU code; see Debugging a C++ AMP Application. At the moment, GPU debugging does not work on Windows 7.

Sample Application

This article comes with a sample application. You need the Visual Studio 11 Developer Preview to compile it. Unfortunately, I am not able to include a pre-compiled binary because that binary would depend on the Visual C++ 11 Preview redistributable which is not yet available.

The sample application has a toolbar with the following ‘programmers-art’ buttons:

  • R: Render one frame using the selected method
  • : Zoom out on the Mandelbrot fractal
  • +: Zoom in on the Mandelbrot fractal
  • S: Select the single-threaded implementation
  • PPL: Select the PPL implementation
  • AMP: Select the C++ AMP implementation
  • B: Benchmark: Render a number of frames and display the average in the window title bar.

After rendering a frame, the title bar shows how long it took to render and at what resolution. You can use the left mouse button to move around in the Mandelbrot fractal, and the mouse wheel to zoom in and out. Note however that this interactive moving and zooming is only useful with the C++ AMP implementation, because the single-threaded and PPL versions are too slow for interactivity.

Conclusion

This article is an introduction to, and demonstration of the power of C++ AMP. Check out the official documentation at C++ AMP: Accelerated Massive Parallelism for more information. Also check out Daniel Moth’s blog, containing a wealth of information on C++ AMP.

Disclaimer: Keep in mind that this article is written using the Developer Preview of Visual C++ 11. Certain aspects might change in the final version of Visual C++ 11.

Resources

Downloads

Share

Leave a Comment

Name: (Required)

E-mail: (Required)

Website:

Comment: