Skip to content

Commit 113771c

Browse files
committed
Improve documentation for 0.3.0 release
1 parent e444051 commit 113771c

File tree

7 files changed

+94
-75
lines changed

7 files changed

+94
-75
lines changed

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -96,7 +96,7 @@ default).
9696
Simply run `make install` (or equivalent, depending on build system) to copy
9797
all relevant header files and libraries to the `CMAKE_INSTALL_PREFIX`. This
9898
includes a CMake [package configuration file](https://cmake.org/cmake/help/latest/manual/cmake-packages.7.html#package-configuration-file)
99-
which is placed inside the `lib/cmake` directory. You can then use
99+
which is placed inside the `lib/cmake/Celerity` directory. You can then use
100100
`find_package(Celerity CONFIG)` to include Celerity into your CMake project.
101101
Once included, you can use the `add_celerity_to_target(TARGET target SOURCES source1 source2...)`
102102
function to set up the required dependencies for a target (no need to link manually).

docs/getting-started.md

Lines changed: 55 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -4,32 +4,63 @@ title: Getting Started
44
sidebar_label: Getting Started
55
---
66

7-
Celerity allows you to write highly parallel applications that can be run on
8-
a cluster of accelerator nodes. It focuses on providing a way of scaling
9-
applications to many nodes without having to be an expert in cluster
10-
programming. In fact, the Celerity API does not make it apparent that a program is
11-
(potentially) running on many nodes at all: There is no notion of _ranks_;
12-
partitioning of work and data is taken care of transparently behind the scenes.
13-
This lets you focus on your actual work, without having to concern yourself with the
14-
complexities of modern distributed memory cluster programming.
15-
16-
While ease of use is one of Celerity's main goals, simplicity can only go so
17-
far without sacrificing considerable performance. Proficiency in modern C++
18-
as well as at least a rough understanding of how accelerator (GPU) programming
19-
differs to parallel CPU programming is required to make efficient use of Celerity.
20-
Lastly, you will require a good understanding of the algorithms and techniques you
21-
intend to implement using Celerity in order for the runtime system to be able
22-
to run it on a cluster both correctly and in an efficient manner.
7+
Celerity is a high-level C++17 API and runtime environment that aims to bring
8+
the power and ease of use of [SYCL](https://www.khronos.org/sycl/) to
9+
distributed-memory accelerator clusters.
10+
11+
> If you want to get your hands dirty right away, move on to the
12+
> [Installation](installation.md) guide.
13+
14+
## Transparently Scale Your Applications
15+
16+
Celerity allows you to write highly parallel applications that can be run on a
17+
cluster of accelerator nodes. It focuses on providing a way of scaling
18+
applications to many nodes **without having to be an expert in cluster
19+
programming**. In fact, the Celerity API does not make it apparent that a
20+
program is (potentially) running on many nodes at all: There is no notion of
21+
_ranks_; **partitioning of work and data is taken care of transparently behind the
22+
scenes**. This lets you focus on your actual work, without having to concern
23+
yourself with the complexities of modern distributed memory cluster programming.
24+
25+
### A Word of Caution
26+
27+
While ease of use is one of Celerity's main goals, simplicity can only go so far
28+
without sacrificing considerable performance. **Proficiency in modern C++ as
29+
well as at least a rough understanding of how accelerator (GPU) programming
30+
differs from parallel CPU programming is required** to make efficient use of
31+
Celerity. Additionally, you will require a good understanding of the algorithms
32+
and techniques you intend to implement using Celerity in order for the runtime
33+
system to be able to run it on a cluster both correctly and in an efficient
34+
manner.
35+
36+
## Built on a Strong Foundation
2337

2438
Celerity is built on top of [SYCL](https://www.khronos.org/sycl/), an
25-
open-standard high-level C++ embedded domain specific language for
26-
programming accelerators. SYCL provides a great API that hits a sweet spot
27-
between expressiveness and power as well as ease of use, making it the
28-
perfect starting point for Celerity: We set out to find the minimal set of
29-
extensions required to bring the SYCL API to distributed memory clusters -
30-
thus making it relatively easy to migrate an existing SYCL application to
31-
Celerity. If you don't have any experience with SYCL, don't worry, as we will
32-
introduce the most important concepts along the way.
39+
open-standard high-level C++ embedded domain specific language for programming
40+
accelerators. SYCL provides a great API that hits a sweet spot between
41+
expressiveness and power as well as ease of use, making it the perfect starting
42+
point for Celerity: We set out to find the **minimal set of API extensions**
43+
required to bring the SYCL API to distributed memory clusters - thus making it
44+
relatively **easy to migrate an existing SYCL application to Celerity**. If you
45+
don't have any experience with SYCL, don't worry, as we will introduce the most
46+
important concepts along the way.
47+
48+
### Points of Divergence
49+
50+
While it is one of Celerity's [Core Principles](core-principles.md) to stick as
51+
closely to SYCL as possible, **the Celerity API is neither a super- nor subset
52+
of the SYCL API**: On one hand certain SYCL features, such as SYCL 2020's
53+
unified shared memory (USM), are inherently unsuitable for distributed memory
54+
execution and can therefore not be supported by Celerity. On the other hand,
55+
certain high-performance computing (HPC) features such as [Collective Host
56+
Tasks](host-tasks.md#experimental-collective-host-tasks) are required to make
57+
Celerity work at scale.
58+
59+
> Starting with Celerity 0.3.0, most supported SYCL features are made available
60+
> through the `celerity::` namespace, in addition to the `sycl::` namespace.
61+
> When in doubt, we recommend sticking to the former.
62+
63+
## Further Reading
3364

3465
If this piqued your interest and you would like to try it for yourself, check
3566
out the [Installation](installation.md) section on how to build and install

docs/host-tasks.md

Lines changed: 2 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -38,10 +38,6 @@ q.submit([=](celerity::handler &cgh) {
3838
});
3939
```
4040

41-
> **Compatibility note:** Master-node host tasks replace the old _master-access tasks_ from Celerity 0.1. In addition to
42-
> the different syntax, master-access tasks were executed on the main thread, not in a thread pool. When porting an
43-
>existing Celerity program, be aware of the changed lifetime and synchronization requirements.
44-
4541
## Distributed Host Tasks
4642

4743
If a computation involving host code is to be distributed across a cluster, Celerity can split the iteration space
@@ -51,7 +47,7 @@ accordingly. Such a distributed host task is created by passing a global size to
5147
cgh.host_task(global_size, [](celerity::partition<Dims>) { ... });
5248
cgh.host_task(global_size, global_offset, [](celerity::partition<Dims>) { ... });
5349
```
54-
50+
5551
Instead of the per-item kernel invocation of `handler::parallel_for` that is useful for accelerator
5652
computations, ther host kernel will receive _partitions_ of the iteration space. They describe the iteration sub-space
5753
this node receives:
@@ -133,7 +129,7 @@ accesses, they can now be executed concurrently. For this purpose, each kernel r
133129
collective group. The prior example without explicit mentions of a collective group implicitly binds to
134130
`celerity::experimental::default_collective_group`.
135131

136-
### Buffer Access form a Collective Host Task
132+
### Buffer Access from a Collective Host Task
137133

138134
Collective host tasks are special in that they receive an implicit one-dimensional iteration space that just identifies
139135
the participating nodes. To access buffers in a meaningful way, these node indices must be translated to buffer regions.

docs/installation.md

Lines changed: 14 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -18,34 +18,35 @@ represents the de-facto standard in HPC nowadays.
1818

1919
## Picking a SYCL Implementation
2020

21-
Celerity currently supports two different SYCL implementations. If you're
21+
Celerity currently supports three different SYCL implementations. If you're
2222
simply giving Celerity a try, the choice does not matter all that much. For
2323
more advanced use cases or specific hardware setups it might however make
2424
sense to prefer one over the other.
2525

2626
### hipSYCL
2727

2828
[hipSYCL](https://github.com/illuhad/hipsycl) is an open source SYCL
29-
implementation based on AMD HIP. While not fully spec-conformant (especially
30-
regarding its OpenCL interoperability, which is fundamentally incompatible
31-
with its design), hipSYCL is a great choice when directly targeting Nvidia
32-
CUDA and AMD ROCm platforms.
29+
implementation focused on leveraging existing toolchains such as CUDA or HIP,
30+
making it a great choice when directly targeting Nvidia CUDA and AMD ROCm
31+
platforms.
3332

34-
> hipSYCL is currently only available on Linux.
33+
> hipSYCL is currently available on Linux and has experimental/partial support
34+
> for OSX and Windows.
3535
3636
### ComputeCpp
3737

38-
Codeplay's ComputeCpp is a fully SYCL 1.2.1 spec-conformant proprietary implementation. Binary
39-
distributions can be downloaded
40-
from [Codeplay's website](https://www.codeplay.com/products/computesuite/computecpp).
38+
ComputeCpp is a proprietary SYCL implementation by Codeplay. Binary
39+
distributions can be downloaded from [Codeplay's
40+
website](https://developer.codeplay.com/home/).
4141

4242
> ComputeCpp is available for both Linux and Windows.
4343
4444
### DPC++
4545

46-
Intel's LLVM fork [DPC++](https://github.com/intel/llvm) brings SYCL to the latest Intel CPU and GPU
47-
hardware and also, experimentally, to CUDA devices. Celerity will automatically detect
48-
when `CMAKE_CXX_COMPILER` points to a DPC++ Clang.
46+
Intel's LLVM fork [DPC++](https://github.com/intel/llvm) brings SYCL to the
47+
latest Intel CPU and GPU hardware and also, experimentally, to CUDA and HIP
48+
devices. Celerity will automatically detect when `CMAKE_CXX_COMPILER` points to
49+
a DPC++ Clang.
4950

5051
To launch kernels on Intel GPUs, you will also need to install a recent version of the
5152
[Intel Compute Runtime](https://github.com/intel/compute-runtime/releases) (failing to do so will
@@ -69,7 +70,7 @@ platform. Here are a couple of examples:
6970
<!--hipSYCL + Ninja -->
7071

7172
```
72-
cmake -G Ninja .. -DCMAKE_PREFIX_PATH="<path-to-hipsycl-install>/lib/cmake" -DHIPSYCL_PLATFORM=cuda -DHIPSYCL_GPU_ARCH=sm_52 -DCMAKE_INSTALL_PREFIX="<install-path>" -DCMAKE_BUILD_TYPE=Release
73+
cmake -G Ninja .. -DCMAKE_PREFIX_PATH="<path-to-hipsycl-install>" -DHIPSYCL_TARGETS="cuda:sm_52" -DCMAKE_INSTALL_PREFIX="<install-path>" -DCMAKE_BUILD_TYPE=Release
7374
```
7475

7576
<!--ComputeCpp + Unix Makefiles-->
@@ -94,10 +95,6 @@ be required if you installed SYCL in a non-standard location. See the [CMake
9495
documentation](https://cmake.org/documentation/) as well as the documentation
9596
for your SYCL implementation for more information on the other parameters.
9697

97-
> We currently recommend using the [Ninja build
98-
> system](https://ninja-build.org/) for building hipSYCL-based projects due
99-
> to some issues with dependency tracking that CMake has with Unix Makefiles.
100-
10198
Celerity comes with several example applications that are built by default.
10299
If you don't want to build examples, provide `-DCELERITY_BUILD_EXAMPLES=0` as
103100
an additional parameter to your CMake configuration call.

docs/range-mappers.md

Lines changed: 0 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -46,12 +46,6 @@ queue.submit([=](celerity::handler& cgh) {
4646
});
4747
```
4848

49-
> **NOTE**: In Celerity 0.1, range mappers were only used for compute kernels.
50-
> For master-node tasks (then called master-access tasks), explicit buffer ranges
51-
> were passed to `buffer::get_access`. These APIs have been unified and range mappers
52-
> are now required in all cases. In master node tasks, the `all` and `fixed`
53-
> mappers provide equivalent functionality to explicit ranges.
54-
5549
### Getting an Intuition
5650

5751
A useful way of thinking about kernel chunks is as a collection of individual

docs/reductions.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -36,14 +36,14 @@ auto rd = celerity::reduction(buf, cgh, parity, 0u /* explicit identity */,
3636

3737
## Limitations
3838

39-
### Only scalar reductions
39+
### Only Scalar Reductions
4040

4141
Currently, the SYCL standard only mandates scalar reductions, i.e. reductions that produce a single scalar value.
4242
While that is useful for synchronization work like terminating a loop on a stopping criterion, it is not enough for
4343
other common operations like histogram construction. Since Celerity delegates to SYCL for intra-node reductions,
4444
higher-dimensional reduction outputs will only become available once SYCL supports them.
4545

46-
### No broad support across SYCL implementations
46+
### No Broad Support Across SYCL Implementations
4747

4848
Only hipSYCL provides a complete implementation of SYCL 2020 reduction variables at the moment, but
4949
requires [a patch](https://github.com/illuhad/hipSYCL/pull/578). Installing this version of hipSYCL will

docs/tutorial.md

Lines changed: 20 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@ is to set up a CMake project. For this, create a new folder for your project
2323
and in it create a file `CMakeLists.txt` with the following contents:
2424

2525
```cmake
26-
cmake_minimum_required(VERSION 3.5.1)
26+
cmake_minimum_required(VERSION 3.13)
2727
project(celerity_edge_detection)
2828
2929
find_package(Celerity CONFIG REQUIRED)
@@ -152,7 +152,7 @@ out the kernel code. Replace the TODO with the following code:
152152
```cpp
153153
int sum = r_input[{item[0] + 1, item[1]}] + r_input[{item[0] - 1, item[1]}]
154154
+ r_input[{item[0], item[1] + 1}] + r_input[{item[0], item[1] - 1}];
155-
dw_edge[item] = 255 - std::max(0, sum - (4 * r_input[item]));
155+
w_edge[item] = 255 - std::max(0, sum - (4 * r_input[item]));
156156
```
157157
158158
This kernel computes a [discrete Laplace
@@ -171,29 +171,28 @@ before the kernel function with the following:
171171
172172
```cpp
173173
celerity::accessor r_input{input_buf, cgh, celerity::access::neighborhood{1, 1}, celerity::read_only};
174-
celerity::accessor dw_edge{edge_buf, cgh, celerity::access::one_to_one{}, celerity::write_only, celerity::no_init};
174+
celerity::accessor w_edge{edge_buf, cgh, celerity::access::one_to_one{}, celerity::write_only, celerity::no_init};
175175
```
176176

177177
If you have worked with SYCL before, these buffer accessors will look
178-
familiar to you. The template parameter is called the **access mode** and
179-
declares the type of access we inted to make on each buffer: We want to
180-
`read` from our `input_buf`, and want to write to our `edge_buf`. While
181-
there is a `write` access mode, we do not care at all about preserving any of
182-
the previous contents of `edge_buf`, which is why we choose to discard them
183-
and use the `discard_write` access mode.
178+
familiar to you. Accessors tie kernels to the data they operate on by declaring
179+
the type of access that we want to perform: We want to _read_ from our
180+
`input_buf`, and want to _write_ to our `edge_buf`. Additionally, we do not care
181+
at all about preserving any of the previous contents of `edge_buf`, which is why
182+
we choose to discard them by also passing the `celerity::no_init` property.
184183

185184
So far everything works exactly as it would in a SYCL application. However,
186-
there is an additional parameter passed into the `accessor`
187-
constructor that is not present in its SYCL counterpart. In fact, this parameter
188-
represents one of Celerity's most important API additions: While access modes
189-
tell the runtime system how a kernel intends to access a buffer, it does not
190-
include any information about _where_ a kernel will access said buffer. In
191-
order for Celerity to be able to split a single kernel execution across
185+
there is an additional parameter passed into the `accessor` constructor that is
186+
not present in its SYCL counterpart. In fact, this parameter represents one of
187+
Celerity's most important API additions: While access modes (such as `read` and
188+
`write`) tell the runtime system how a kernel intends to access a buffer, they
189+
do not convey any information about _where_ a kernel will access said buffer.
190+
In order for Celerity to be able to split a single kernel execution across
192191
potentially many different worker nodes, it needs to know how each of those
193192
**kernel chunks** will interact with the input and output buffers of a kernel
194193
-- i.e., which node requires which parts of the input, and produces which
195-
parts of the output. This is where Celerity's so-called **range mappers**
196-
come into play.
194+
parts of the output. This is where Celerity's so-called **range mappers** come
195+
into play.
197196

198197
Let us first discuss the range mapper for `edge_buf`, as it represents the
199198
simpler of the two cases. Looking at the kernel function, you can see that
@@ -221,8 +220,11 @@ surrounding the current work item.
221220
222221
Lastly, there are two more things of note for the call to `parallel_for`: The
223222
first is the **kernel name**. Just like in SYCL, each kernel function in
224-
Celerity has to have a unique name in the form of a template type parameter.
223+
Celerity may have a unique name in the form of a template type parameter.
225224
Here we chose `MyEdgeDetectionKernel`, but this can be anything you like.
225+
226+
> Kernel names used to be mandatory in SYCL 1.2.1 but have since become optional.
227+
226228
Finally, the first two parameters to the `parallel_for` function tell
227229
Celerity how many individual GPU threads (or work items) we want to execute.
228230
In our case we want to execute one thread for each pixel of our image, except
@@ -249,7 +251,6 @@ Just like the _compute tasks_ we created above by calling
249251
handler by calling `celerity::handler::host_task`. Add the following code at the end of
250252
your `main()` function:
251253

252-
253254
```cpp
254255
queue.submit([=](celerity::handler& cgh) {
255256
celerity::accessor out{edge_buf, cgh, celerity::access::all{}, celerity::read_only_host_task};

0 commit comments

Comments
 (0)