Back
Close

DPC++

krisrak
24.5K views

Offloading Computation

The sections below explains how computations can be offloaded to heterogeneous devices using kernel functions and different approches of expressing parallelism.

Types of Kernel invocations

The Kernel function can be invoked using one of the following method:

  • Single Task
  • Basic Parallel Kernel
  • ND-Range Kernel
  • Hierarchical Kernel

Single Task

single_task method is used to invoke a kernel function once on the target device.

The code below uses single_task method to double each value of the array. The array is initialized to values 0 1 2 3, and the kernel function will iterate through the array using a for-loop and double each value, and the result is printed on the host.

Single Task

Basic Parallel Kernel

parallel_for method is used to invoke a kernel function on each iteration of the task. The parallel_for method is used to specify the global range and the kernel function is invoked for every work-item using the id or item class.

The code below uses parallel_for method to double each value of the array. The array is initialized to values 0 1 2 3, and the kernel function is invoked for each item which will double the value.

Basic Parallel Kernel

ND-Range Kernel

ND-Range kernel allows programmers who are used to programming in OpenCL to continue using similar concepts and features for offloading executions.

ND-Range Kernels are similar to Basic Parallel Kernel where the kernel function is invoked on each of the iterations of the task, but allows mapping executions to hardware resources to achieve better performance optimization. For example, a GPU hardware consists of a bunch of compute units which has its own local memory. Grouping executions so that each groups executes on a single Compute Unit can be useful to optimize for better occupancy on the hardware. The grouping is called work-group and the individual instance of kernel function is called work-item. Each group can also make use of Local Memory associated this these compute units in the hardware for faster access, rather than repeated access to global memory.

DPC++ also introduces the concept of Sub-Groups, which is a subset of work-items within a work-group that may map to vector hardware, work-items within the sub-group execute simultaneously and may help in achieving higher performance. You can learn more about using Local Memory and Sub-Groups in future lessons.

The ND-Range kernel is invoked using the parallel_for method by specifying global size of iterations and work-group size using the nd_range class and the nd_item class is used to get the index and other properties related to the work-item.

h.parallel_for(nd_range<2>{global_size, work_group_size}, [=](nd_item<2> item){

});

The code below uses parallel_for method to double each value of the array. The array is size is 256 and the work-group is set to 64 so that all 64 work-items are executed on the same compute unit in the hardware. We will learn more of different ways to optimize computation using ND-Range kernels in future lessons.

ND-Range Kernel

Hierarchical Kernel

Hierarchical Kernel an alternative to ND-Range kernel way of expressing parallelism. Hierarchical Kernel provides an more hierarchical syntax with a loop for executing on each work-group and an inner loop that executes on each work-item.

Hierarchical Kernel aims to provide a more structures way to express parallel kernels with all the capabilities of optimizations of using local memory and other features that are possible in ND-Range kernels. This alternative may suit better for programmers not familiar to OpenCL programming.

parallel_for_work_group method is used to invoke a kernel function for every work-group, and the inner loop parallel_for_work_item is invoked for every work-item in the work-group.

h.parallel_for_work_group(num_groups, work_group_size, [=](group<1> g) {
    group.parallel_for_work_item([&](h_item<1> item) {

    });
});

The code below uses parallel_for_work_group and parallel_for_work_item methods to double each value of the array. The array is size is 256 and the work-group is set to 64 so that all 64 work-items are executed on the same compute unit in the hardware. We will learn more of different ways to optimize computation using hierarchical kernels in future lessons.

Hierarchical Kernel

Simplifications in DPC++

All of the above methods of expressing parallelism is defined in the SYCL specification. DPC++ simplifies the language by removing some verbosity in programming using these methods.

Some of the simplifications are making lambda name optional for kernels and simplification of one-dimentional range declaration as shown below:

Kernel invocation in SYCL:
h.parallel_for<class kernelA>(range<1>(N), [=](id<1> i){ ... });
Simplified Kernel invocation in DPC++:
h.parallel_for(N, [=](auto i){ ... });

We have learned about using Buffers and USM for memory management. DPC++ also simplfies kernel invocation when using USM by directly calling q.parallel_for and removing need for using q.submit to wrap the kernel invocation.

Kernel invocation for buffers:
q.submit([&] (handler &h){
    auto A = buf.get_access<access::mode::read_write>(h);
    h.parallel_for(range<1>(N), [=](id<1> i){
        A[i] *= 2;
    });
});
Kernel invocation for USM:
q.parallel_for(range<1>(N), [=] (id<1> i){
    data[i] *= 2;
});

Resources

Data Parallel C++ Reference

Data Parallel C++ Specification

Create your playground on Tech.io
This playground was created on Tech.io, our hands-on, knowledge-sharing platform for developers.
Go to tech.io