Improve documentation for 0.3.0 release

psalz · psalz · commit 113771c16147 · 2021-11-16T23:20:58.000+01:00
diff --git a/README.md b/README.md
@@ -96,7 +96,7 @@ default).
 Simply run `make install` (or equivalent, depending on build system) to copy
 all relevant header files and libraries to the `CMAKE_INSTALL_PREFIX`. This
 includes a CMake [package configuration file](https://cmake.org/cmake/help/latest/manual/cmake-packages.7.html#package-configuration-file)
-which is placed inside the `lib/cmake` directory. You can then use
+which is placed inside the `lib/cmake/Celerity` directory. You can then use
 `find_package(Celerity CONFIG)` to include Celerity into your CMake project.
 Once included, you can use the `add_celerity_to_target(TARGET target SOURCES source1 source2...)`
 function to set up the required dependencies for a target (no need to link manually).
diff --git a/docs/getting-started.md b/docs/getting-started.md
@@ -4,32 +4,63 @@ title: Getting Started
 sidebar_label: Getting Started
 ---
 
-Celerity allows you to write highly parallel applications that can be run on
-a cluster of accelerator nodes. It focuses on providing a way of scaling
-applications to many nodes without having to be an expert in cluster
-programming. In fact, the Celerity API does not make it apparent that a program is
-(potentially) running on many nodes at all: There is no notion of _ranks_;
-partitioning of work and data is taken care of transparently behind the scenes.
-This lets you focus on your actual work, without having to concern yourself with the
-complexities of modern distributed memory cluster programming.
-
-While ease of use is one of Celerity's main goals, simplicity can only go so
-far without sacrificing considerable performance. Proficiency in modern C++
-as well as at least a rough understanding of how accelerator (GPU) programming
-differs to parallel CPU programming is required to make efficient use of Celerity.
-Lastly, you will require a good understanding of the algorithms and techniques you
-intend to implement using Celerity in order for the runtime system to be able
-to run it on a cluster both correctly and in an efficient manner.
+Celerity is a high-level C++17 API and runtime environment that aims to bring
+the power and ease of use of [SYCL](https://www.khronos.org/sycl/) to
+distributed-memory accelerator clusters.
+
+> If you want to get your hands dirty right away, move on to the
+> [Installation](installation.md) guide.
+
+## Transparently Scale Your Applications
+
+Celerity allows you to write highly parallel applications that can be run on a
+cluster of accelerator nodes. It focuses on providing a way of scaling
+applications to many nodes **without having to be an expert in cluster
+programming**. In fact, the Celerity API does not make it apparent that a
+program is (potentially) running on many nodes at all: There is no notion of
+_ranks_; **partitioning of work and data is taken care of transparently behind the
+scenes**. This lets you focus on your actual work, without having to concern
+yourself with the complexities of modern distributed memory cluster programming.
+
+### A Word of Caution
+
+While ease of use is one of Celerity's main goals, simplicity can only go so far
+without sacrificing considerable performance. **Proficiency in modern C++ as
+well as at least a rough understanding of how accelerator (GPU) programming
+differs from parallel CPU programming is required** to make efficient use of
+Celerity. Additionally, you will require a good understanding of the algorithms
+and techniques you intend to implement using Celerity in order for the runtime
+system to be able to run it on a cluster both correctly and in an efficient
+manner.
+
+## Built on a Strong Foundation
 
 Celerity is built on top of [SYCL](https://www.khronos.org/sycl/), an
-open-standard high-level C++ embedded domain specific language for
-programming accelerators. SYCL provides a great API that hits a sweet spot
-between expressiveness and power as well as ease of use, making it the
-perfect starting point for Celerity: We set out to find the minimal set of
-extensions required to bring the SYCL API to distributed memory clusters -
-thus making it relatively easy to migrate an existing SYCL application to
-Celerity. If you don't have any experience with SYCL, don't worry, as we will
-introduce the most important concepts along the way.
+open-standard high-level C++ embedded domain specific language for programming
+accelerators. SYCL provides a great API that hits a sweet spot between
+expressiveness and power as well as ease of use, making it the perfect starting
+point for Celerity: We set out to find the **minimal set of API extensions**
+required to bring the SYCL API to distributed memory clusters - thus making it
+relatively **easy to migrate an existing SYCL application to Celerity**. If you
+don't have any experience with SYCL, don't worry, as we will introduce the most
+important concepts along the way.
+
+### Points of Divergence
+
+While it is one of Celerity's [Core Principles](core-principles.md) to stick as
+closely to SYCL as possible, **the Celerity API is neither a super- nor subset
+of the SYCL API**: On one hand certain SYCL features, such as SYCL 2020's
+unified shared memory (USM), are inherently unsuitable for distributed memory
+execution and can therefore not be supported by Celerity. On the other hand,
+certain high-performance computing (HPC) features such as [Collective Host
+Tasks](host-tasks.md#experimental-collective-host-tasks) are required to make
+Celerity work at scale.
+
+> Starting with Celerity 0.3.0, most supported SYCL features are made available
+> through the `celerity::` namespace, in addition to the `sycl::` namespace.
+> When in doubt, we recommend sticking to the former.
+
+## Further Reading
 
 If this piqued your interest and you would like to try it for yourself, check
 out the [Installation](installation.md) section on how to build and install
diff --git a/docs/host-tasks.md b/docs/host-tasks.md
@@ -38,10 +38,6 @@ q.submit([=](celerity::handler &cgh) {
 });
 ```
 
-> **Compatibility note:** Master-node host tasks replace the old _master-access tasks_ from Celerity 0.1. In addition to
-> the different syntax, master-access tasks were executed on the main thread, not in a thread pool. When porting an
->existing Celerity program, be aware of the changed lifetime and synchronization requirements.
-
 ## Distributed Host Tasks
 
 If a computation involving host code is to be distributed across a cluster, Celerity can split the iteration space
@@ -51,7 +47,7 @@ accordingly. Such a distributed host task is created by passing a global size to
 cgh.host_task(global_size, [](celerity::partition<Dims>) { ... });
 cgh.host_task(global_size, global_offset, [](celerity::partition<Dims>) { ... });
 ```
- 
+
 Instead of the per-item kernel invocation of `handler::parallel_for` that is useful for accelerator
 computations, ther host kernel will receive _partitions_ of the iteration space. They describe the iteration sub-space
 this node receives:
@@ -133,7 +129,7 @@ accesses, they can now be executed concurrently. For this purpose, each kernel r
 collective group. The prior example without explicit mentions of a collective group implicitly binds to
 `celerity::experimental::default_collective_group`.
 
-### Buffer Access form a Collective Host Task
+### Buffer Access from a Collective Host Task
 
 Collective host tasks are special in that they receive an implicit one-dimensional iteration space that just identifies
 the participating nodes. To access buffers in a meaningful way, these node indices must be translated to buffer regions.
diff --git a/docs/installation.md b/docs/installation.md
@@ -18,34 +18,35 @@ represents the de-facto standard in HPC nowadays.
 
 ## Picking a SYCL Implementation
 
-Celerity currently supports two different SYCL implementations. If you're
+Celerity currently supports three different SYCL implementations. If you're
 simply giving Celerity a try, the choice does not matter all that much. For
 more advanced use cases or specific hardware setups it might however make
 sense to prefer one over the other.
 
 ### hipSYCL
 
 [hipSYCL](https://github.com/illuhad/hipsycl) is an open source SYCL
-implementation based on AMD HIP. While not fully spec-conformant (especially
-regarding its OpenCL interoperability, which is fundamentally incompatible
-with its design), hipSYCL is a great choice when directly targeting Nvidia
-CUDA and AMD ROCm platforms.
+implementation focused on leveraging existing toolchains such as CUDA or HIP,
+making it a great choice when directly targeting Nvidia CUDA and AMD ROCm
+platforms.
 
-> hipSYCL is currently only available on Linux.
+> hipSYCL is currently available on Linux and has experimental/partial support
+> for OSX and Windows.
 
 ### ComputeCpp
 
-Codeplay's ComputeCpp is a fully SYCL 1.2.1 spec-conformant proprietary implementation. Binary
-distributions can be downloaded
-from [Codeplay's website](https://www.codeplay.com/products/computesuite/computecpp).
+ComputeCpp is a proprietary SYCL implementation by Codeplay. Binary
+distributions can be downloaded from [Codeplay's
+website](https://developer.codeplay.com/home/).
 
 > ComputeCpp is available for both Linux and Windows.
 
 ### DPC++
 
-Intel's LLVM fork [DPC++](https://github.com/intel/llvm) brings SYCL to the latest Intel CPU and GPU
-hardware and also, experimentally, to CUDA devices. Celerity will automatically detect
-when `CMAKE_CXX_COMPILER` points to a DPC++ Clang.
+Intel's LLVM fork [DPC++](https://github.com/intel/llvm) brings SYCL to the
+latest Intel CPU and GPU hardware and also, experimentally, to CUDA and HIP
+devices. Celerity will automatically detect when `CMAKE_CXX_COMPILER` points to
+a DPC++ Clang.
 
 To launch kernels on Intel GPUs, you will also need to install a recent version of the
 [Intel Compute Runtime](https://github.com/intel/compute-runtime/releases) (failing to do so will
@@ -69,7 +70,7 @@ platform. Here are a couple of examples:
 <!--hipSYCL + Ninja -->
 
 ```
-cmake -G Ninja .. -DCMAKE_PREFIX_PATH="<path-to-hipsycl-install>/lib/cmake" -DHIPSYCL_PLATFORM=cuda -DHIPSYCL_GPU_ARCH=sm_52 -DCMAKE_INSTALL_PREFIX="<install-path>" -DCMAKE_BUILD_TYPE=Release
+cmake -G Ninja .. -DCMAKE_PREFIX_PATH="<path-to-hipsycl-install>" -DHIPSYCL_TARGETS="cuda:sm_52" -DCMAKE_INSTALL_PREFIX="<install-path>" -DCMAKE_BUILD_TYPE=Release
 ```
 
 <!--ComputeCpp + Unix Makefiles-->
@@ -94,10 +95,6 @@ be required if you installed SYCL in a non-standard location. See the [CMake
 documentation](https://cmake.org/documentation/) as well as the documentation
 for your SYCL implementation for more information on the other parameters.
 
-> We currently recommend using the [Ninja build
-> system](https://ninja-build.org/) for building hipSYCL-based projects due
-> to some issues with dependency tracking that CMake has with Unix Makefiles.
-
 Celerity comes with several example applications that are built by default.
 If you don't want to build examples, provide `-DCELERITY_BUILD_EXAMPLES=0` as
 an additional parameter to your CMake configuration call.
diff --git a/docs/range-mappers.md b/docs/range-mappers.md
@@ -46,12 +46,6 @@ queue.submit([=](celerity::handler& cgh) {
 });
 ```
 
-> **NOTE**: In Celerity 0.1, range mappers were only used for compute kernels.
-> For master-node tasks (then called master-access tasks), explicit buffer ranges
-> were passed to `buffer::get_access`. These APIs have been unified and range mappers
-> are now required in all cases. In master node tasks, the `all` and `fixed`
-> mappers provide equivalent functionality to explicit ranges.
-
 ### Getting an Intuition
 
 A useful way of thinking about kernel chunks is as a collection of individual
diff --git a/docs/reductions.md b/docs/reductions.md
@@ -36,14 +36,14 @@ auto rd = celerity::reduction(buf, cgh, parity, 0u /* explicit identity */,
 
 ## Limitations
 
-### Only scalar reductions
+### Only Scalar Reductions
 
 Currently, the SYCL standard only mandates scalar reductions, i.e. reductions that produce a single scalar value.
 While that is useful for synchronization work like terminating a loop on a stopping criterion, it is not enough for
 other common operations like histogram construction. Since Celerity delegates to SYCL for intra-node reductions,
 higher-dimensional reduction outputs will only become available once SYCL supports them.
 
-### No broad support across SYCL implementations
+### No Broad Support Across SYCL Implementations
 
 Only hipSYCL provides a complete implementation of SYCL 2020 reduction variables at the moment, but
 requires [a patch](https://github.com/illuhad/hipSYCL/pull/578). Installing this version of hipSYCL will
diff --git a/docs/tutorial.md b/docs/tutorial.md
@@ -23,7 +23,7 @@ is to set up a CMake project. For this, create a new folder for your project
 and in it create a file `CMakeLists.txt` with the following contents:
 
 ```cmake
-cmake_minimum_required(VERSION 3.5.1)
+cmake_minimum_required(VERSION 3.13)
 project(celerity_edge_detection)
 
 find_package(Celerity CONFIG REQUIRED)
@@ -152,7 +152,7 @@ out the kernel code. Replace the TODO with the following code:
 ```cpp
 int sum = r_input[{item[0] + 1, item[1]}] + r_input[{item[0] - 1, item[1]}]
         + r_input[{item[0], item[1] + 1}] + r_input[{item[0], item[1] - 1}];
-dw_edge[item] = 255 - std::max(0, sum - (4 * r_input[item]));
+w_edge[item] = 255 - std::max(0, sum - (4 * r_input[item]));
 ```
 
 This kernel computes a [discrete Laplace
@@ -171,29 +171,28 @@ before the kernel function with the following:
 
 ```cpp
 celerity::accessor r_input{input_buf, cgh, celerity::access::neighborhood{1, 1}, celerity::read_only};
-celerity::accessor dw_edge{edge_buf, cgh, celerity::access::one_to_one{}, celerity::write_only, celerity::no_init};
+celerity::accessor w_edge{edge_buf, cgh, celerity::access::one_to_one{}, celerity::write_only, celerity::no_init};
 ```
 
 If you have worked with SYCL before, these buffer accessors will look
-familiar to you. The template parameter is called the **access mode** and
-declares the type of access we inted to make on each buffer: We want to
-`read` from our `input_buf`, and want to write to our `edge_buf`. While
-there is a `write` access mode, we do not care at all about preserving any of
-the previous contents of `edge_buf`, which is why we choose to discard them
-and use the `discard_write` access mode.
+familiar to you. Accessors tie kernels to the data they operate on by declaring
+the type of access that we want to perform: We want to _read_ from our
+`input_buf`, and want to _write_ to our `edge_buf`. Additionally, we do not care
+at all about preserving any of the previous contents of `edge_buf`, which is why
+we choose to discard them by also passing the `celerity::no_init` property.
 
 So far everything works exactly as it would in a SYCL application. However,
-there is an additional parameter passed into the `accessor`
-constructor that is not present in its SYCL counterpart. In fact, this parameter
-represents one of Celerity's most important API additions: While access modes
-tell the runtime system how a kernel intends to access a buffer, it does not
-include any information about _where_ a kernel will access said buffer. In
-order for Celerity to be able to split a single kernel execution across
+there is an additional parameter passed into the `accessor` constructor that is
+not present in its SYCL counterpart. In fact, this parameter represents one of
+Celerity's most important API additions: While access modes (such as `read` and
+`write`) tell the runtime system how a kernel intends to access a buffer, they
+do not convey any information about _where_ a kernel will access said buffer.
+In order for Celerity to be able to split a single kernel execution across
 potentially many different worker nodes, it needs to know how each of those
 **kernel chunks** will interact with the input and output buffers of a kernel
 -- i.e., which node requires which parts of the input, and produces which
-parts of the output. This is where Celerity's so-called **range mappers**
-come into play.
+parts of the output. This is where Celerity's so-called **range mappers** come
+into play.
 
 Let us first discuss the range mapper for `edge_buf`, as it represents the
 simpler of the two cases. Looking at the kernel function, you can see that
@@ -221,8 +220,11 @@ surrounding the current work item.
 
 Lastly, there are two more things of note for the call to `parallel_for`: The
 first is the **kernel name**. Just like in SYCL, each kernel function in
-Celerity has to have a unique name in the form of a template type parameter.
+Celerity may have a unique name in the form of a template type parameter.
 Here we chose `MyEdgeDetectionKernel`, but this can be anything you like.
+
+> Kernel names used to be mandatory in SYCL 1.2.1 but have since become optional.
+
 Finally, the first two parameters to the `parallel_for` function tell
 Celerity how many individual GPU threads (or work items) we want to execute.
 In our case we want to execute one thread for each pixel of our image, except
@@ -249,7 +251,6 @@ Just like the _compute tasks_ we created above by calling
 handler by calling `celerity::handler::host_task`. Add the following code at the end of
 your `main()` function:
 
-
 ```cpp
 queue.submit([=](celerity::handler& cgh) {
 	celerity::accessor out{edge_buf, cgh, celerity::access::all{}, celerity::read_only_host_task};