Hello, codeforces!
Do you have modern GPU and CPU and want to calculate something with them, but you don't know how to do it? Dont be upset, because you can read this blog and run your first GPU program on your AMD/Intel/NVIDIA GPU. I will describe installation process and then we will dive into the C++ code. As an example, we will calculate a lot of mazes for Bug Game in parallel. For doing this in simple way, we will use library Boost.Compute, which is a C++ STL-style wrapper over OpenCL library. ...
Installation for MacOS (Apple M1 Pro)
As win said, OpenCL
and g++
is pre-installed on MacOS
. Everything that you have to do is install Boost
with command brew install boost
. After it you will be able to compile your program with command g++ -std=c++17 -O3 -I/opt/homebrew/Cellar/boost/1.81.0_1/include -framework OpenCL -o main main.cpp
Installation for Windows
We need to install GCC, Boost and OpenCL. Let's do it with cygwin. Just download it by above link and run installation process. Select www.cygwin.mirror.constant.com
as a mirror!
Cygwin installation
Select GCC
On the step of choosing packages, select View: Full
, type Search: gcc
and select next packages:
- colorgcc
- gcc-core
- gcc-g++
- libgcc1
Select Boost
Then type Search: boost
and select all of the packages.
Select OpenCL
Then type Search: OpenCL
and select all of the packages: libOpenCL-devel
and libOpenCL1
.
Finish installation
After selecting all of the packages, click Next
twice and wait while installation process is not finished.
First example: print list of devices and default device
We are ready to write out first example. Lets print list of available devices. Here is the C++ code:
#include <iostream>
#include <iomanip>
#include <boost/compute/core.hpp>
namespace compute = boost::compute;
std::ostream &operator<<(std::ostream &os, const compute::device & device)
{
return os << device.name() << " (platform: " << device.platform().name() << ")";
}
int main()
{
// cycle over all of the devices:
std::cout << "List of available devices:\n";
for (int i = 0; auto device : compute::system::devices())
std::cout << std::setw(4) << i++ << ":\t" << device << std::endl;
// get the default device
compute::device device = compute::system::default_device();
std::cout << "\nDefault device:\n\t" << device << std::endl;
return 0;
}
This code will print list of available devices for using Boost.Compute
and default device. I saved this code in directory C:\Users\dkozyrev\Desktop\gpu-examples
as file printDevices.cpp
.
Open cygwin launcher (not installer!) and complete these steps for compilation and running:
cd "C:\Users\dkozyrev\Desktop\gpu-examples"
— going into our directory;g++ -Ofast -std=c++17 printDevices.cpp -o printDevices -lOpenCL
— compilation and linking;./printDevices
— running executable.
Please, pay attention, that on MacOS
the compilation line is different: g++ -std=c++17 -O3 -I/opt/homebrew/Cellar/boost/1.81.0_1/include -framework OpenCL -o main main.cpp
You will see the result and will not see your GPU in the list of available devices. Now we are going to install GPU drivers.
Install GPU Drivers
Intel GPU
Google intel opencl sdk
and you will see this website. Download and install Intel® SDK for OpenCL™ Applications
.
NVIDIA GPU
Google nvidia cuda toolkit
. You will see this website. Just download and install!
AMD GPU
Probably, you need to download and install AMD Accelerated Parallel Processing (APP) SDK 3.0
. I don't know exactly. Please, see the answer on stackoverflow here.
Copying GPU Drivers to cygwin
When installation will be finished, you will see the file C:\Windows\System32\OpenCL.dll
. Seems that that the path is same for any GPU (Intel, AMD, NVIDIA). Open cygwin launcher (not installer!) and complete next:
cp "/cygdrive/c/Windows/System32/OpenCl.dll" "/cygdrive/c/cygwin64/bin/"
cp "/cygdrive/c/Windows/System32/OpenCl.dll" "/cygdrive/c/cygwin64/bin/cygOpenCL-1.dll"
echo "cygOpenCL-1.dll" > "/etc/OpenCL/vendors/gpu.icd"
./printDevices
— rerun first example (printing all of the available devices)
You will see, that GPU is available now! I see my Intel(R) Iris(R) Xe Graphics
in the list of all devices.
OK, we are ready to calculate something!
Boost.Compute calculation scheme
GPU can process a lot of data simultaneously. This is something like processing $$$100.000$$$ independent testcases. There is no profit to run GPU for only one testcase. So, we will use next calculation scheme:
- Create vector of independent testcases on host (CPU)
- Copy this vector from host (from CPU) to device (to GPU)
- Process all of the testcases in parallel
- Copy vector of results from device (from GPU) to host (to CPU)
- Process vector of results on CPU
Lets start to solve Bug Game on GPU and we will see how it is implemented!
Solving Bug Game on GPU
Please, read problem statement firstly:
Now you know how bug moves and what is the maze. Suppose that you are building maze in your code. Some genetic or randomized algorithms, or brute force need to calculate the answer for whole population (whole set of mazes) independently. This step is costly.
We will create a vector of valid mazes (where the exit is reachable), send this vector to GPU, wait for finish of calculations, copy results and process them.
In this blog we are interested in looking on the performance, so, we will read maze with 1.488.650
steps from the input file, copy this maze 1.000.000
times and calculate this million of mazes on GPU.
CPU part of program
Reading of the maze from the input file:
std::ifstream fin(inputFileName);
if (!fin)
{
std::cout << "Can't open the file '" << inputFileName << "'" << std::endl;
return 0;
}
std::vector<std::vector<bool>> f(21, std::vector<bool>(31, 1));
for (int i = 0; i < 21; i++)
for (int j = 0; j < 31; j++)
{
char ch; fin >> ch;
f[i][j] = (ch == '#');
}
Then we need to create a struct for keeping number of visits for each cell. When bug's moving finished, the answer is sum of number of visits for each cell, so we need just to keep the array with number of visits.
Let's create the struct Field
: ~~~~~ struct Field { unsigned nVis[21 * 31]; }; ~~~~~
We are using 1D array instead of 2D array and data type unsigned
(unsigned integer 32-bit data type). Copying from vector of vectors to our new structure:
#define pos2int(r, c) ((r) * 31 + (c))
#define WALL (UINT_MAX)
#define EMPTY (0)
Field setField(const std::vector<std::vector<bool>> & f)
{
Field a;
for (int r = 0; r < 21; r++)
for (int c = 0; c < 31; c++)
a.nVis[pos2int(r,c)] = (f[r][c] ? WALL : EMPTY);
return a;
}
Also we need a function for accumulating the answer:
long long accumulateTheAnswer(const Field &f)
{
long long res = 0;
for (int i = 0; i < 21 * 31; i++)
if (f.nVis[i] != WALL)
res += f.nVis[i];
return res;
}
Now we need to write a function which will implement bug's moving on CPU:
__attribute__((noinline))
void bugMovingCPU(Field &a)
{
int pos = 1 * 31 + 1, dir = +31;
while (pos != 618)
{
++a.nVis[pos];
unsigned val = a.nVis[pos + dir];
{ unsigned next = a.nVis[pos + 31]; if (next < val) dir = +31, val = next; }
{ unsigned next = a.nVis[pos + 1]; if (next < val) dir = +1, val = next; }
{ unsigned next = a.nVis[pos - 31]; if (next < val) dir = -31, val = next; }
{ unsigned next = a.nVis[pos - 1]; if (next < val) dir = -1, val = next; }
pos += dir;
}
}
This algorithm starts moving from position $$$\left(1,1\right)=1\times 31 + 1$$$ and while the exit is not reached (pos is not equal to $$$\left(19,29\right)=19 \times 31 + 29 = 618$$$), we are calculating minimal value among adjacent cells in order DOWN-RIGHT-UP-LEFT and making step in minimal direction.
I added attribute noinline: this keyword will prohibit in-lining of this function and performance test will not be affected by GCC optimizations (otherwise GCC may decide to calculate the answer once and copy the answer n
times).
Yeah, we are starting with GPU part now.
GPU part of program
Firstly, GPU (OpenCL library) should be familiar with our struct Field
. It can be done with next macro:
BOOST_COMPUTE_ADAPT_STRUCT(Field, Field, (nVis))
You need to write the name of structure twice and then you need to list out all members of structure. In this example we have only one member nVis
.
We need to include a new header file #include <boost/compute/types/struct.hpp>
for working with structures.
Secondly, you need to write the function which will calculate the maze on GPU:
BOOST_COMPUTE_FUNCTION(Field, bugMovingGPU, (Field a),
{
int pos = 1 * 31 + 1, dir = +31;
while (pos != 618)
{
++a.nVis[pos];
unsigned val = a.nVis[pos + dir];
{ unsigned next = a.nVis[pos + 31]; if (next < val) dir = +31, val = next; }
{ unsigned next = a.nVis[pos + 1]; if (next < val) dir = +1, val = next; }
{ unsigned next = a.nVis[pos - 31]; if (next < val) dir = -31, val = next; }
{ unsigned next = a.nVis[pos - 1]; if (next < val) dir = -1, val = next; }
pos += dir;
}
return a;
});
We will use the same code, as we used above. Parameters of macro BOOST_COMPUTE_FUNCTION
: 1. Field
— return value; 2. bugMovingGPU
— name of a new function; 3. (Field a)
— the list of arguments of this function; 4. The body of this function.
After reading of field, in the function main
we need to detect GPU device, create a context for this device and create a command queue for sending jobs. We will use default device.
compute::device device = compute::system::default_device();
compute::context context(device);
compute::command_queue queue(context, device);
An alternative: you can iterate over all of the devices and pick one which you want to use or you can use all of the GPUs (if you have a lot of them), but, I think, that you will need to create different queues and contexts.
Lets create and fill the vector of mazes on CPU (the host's vector):
std::vector<Field> host_vector(nFields, setField(f));
Create vector of mazes on GPU (the device's vector) and copy from CPU to GPU:
// create a vector on the device
compute::vector<Field> device_vector(host_vector.size(), context);
// transfer data from the host to the device
compute::copy(host_vector.begin(), host_vector.end(), device_vector.begin(), queue);
We need to use data structures and algorithms from Boost.Compute
library. This is why we are creating compute::vector
(second argument is the device's context) and using compute::copy
algorithm (4-th argument is the created device's command queue). Don't forget #include <boost/compute/container/vector.hpp>
Boost.Compute
will perform copying of the data from CPU to GPU. If you need an asynchronous copying (you may want to do something useful on CPU while copying of data), you can use compute::copy_async
like in this example from Boost.Compute manual.
Now we are ready to calculate the answer on GPU for each field. We can use function transform
:
// calculate the answer for each field on GPU:
compute::transform(
device_vector.begin(), device_vector.end(), // input range
device_vector.begin(), // begin of output range
bugMovingGPU, queue // the lambda function which will be applied to each maze, and device's queue
);
This function is an equivalent to C++ STL std::transform
. See description on cppreference. You can use your lambda function as $$$4$$$-th parameter. This function will be applied to each element from the input range. The result will be written to output range. Don't forget #include <boost/compute/algorithm/transform.hpp>
.
Probably, you can use compute::for_each
too, but I was not able to do this, because GPU can't work with references to data (Field &f
) or I don't know how to do it...
And then, you can copy the results from device to host:
compute::copy(device_vector.begin(), device_vector.end(), host_vector.begin(), queue);
Then we will need to measure runtime. I'm using the library std::chrono
for it. You can see details in the source code.
Fixing the bug in Boost.Compute 1.66 header by yourself
When you will try to compile the program, you will get a compilation error: ~~~~~ In file included from /usr/include/boost/compute/detail/meta_kernel.hpp:37, from /usr/include/boost/compute/types/struct.hpp:26, from gpu-example.cpp:6: /usr/include/boost/compute/memory/svm_ptr.hpp: In member function ‘boost::compute::context& boost::compute::svm_ptr::get_context() const’: /usr/include/boost/compute/memory/svm_ptr.hpp:131:16: error: binding reference of type ‘boost::compute::context&’ to ‘const boost::compute::context’ discards qualifiers 131 | return m_context; ~~~~~
This is because an author of Boost.Compute
forgot about const
.
You need to fix this error by yourself or download a fixed file svm_ptr.hpp
and replace it. Please, download mine svm_ptr.hpp file.
Suppose, that you saved this file as C:\Users\dkozyrev\Desktop\gpu-examples\svm_ptr.hpp
. Open cygwin launched and complete the command cp "C:\Users\dkozyrev\Desktop\gpu-examples\svm_ptr.hpp" /usr/include/boost/compute/memory/svm_ptr.hpp
.
Compilation and running
Take this C++ file and save it as gpu-example.cpp
. Also save the 1.488.650 maze as 1.488.650.txt
.
Windows compilation
Open cygwin launcher and complete next:
cd "C:\Users\dkozyrev\Desktop\gpu-examples"
g++ -Ofast -std=c++17 -DBOOST_COMPUTE_DEBUG_KERNEL_COMPILATION gpu-example.cpp -o gpu-example -lOpenCL
The macro BOOST_COMPUTE_DEBUG_KERNEL_COMPILATION
is needed to print the compilation log from the GPU device.
MacOS Apple M1 Pro compilation:
cd "your directory"
g++ -std=c++17 -O3 -DBOOST_COMPUTE_DEBUG_KERNEL_COMPILATION -I/opt/homebrew/Cellar/boost/1.81.0_1/include -framework OpenCL -o gpu-example gpu-example.cpp
Running
Firstly, you can run $$$1$$$ field calculation with command ./gpu-example 1 "1.488.650.txt"
. If everything is OK, you can try to run from $$$1024$$$ to $$$10000-1000000$$$ fields.