Skip to content

v0.4.0 - Delightful Daikon

Compare
Choose a tag to compare
@psalz psalz released this 13 Jul 17:46
· 281 commits to master since this release

We are back with a major release that touches all aspects of Celerity, bringing considerable improvements to its APIs, usability and performance.

Thanks to everybody who contributed to this release: @almightyvats @BlackMark29A @facuMH @fknorr @PeterTh @psalz!

HIGHLIGHTS

  • Celerity 0.4.0 uses a fully distributed scheduling model replacing the old master-worker approach. This improves the scheduling complexity of applications with all-to-all communication from O(N^2) to O(N), solving a central scaling bottleneck for many Celerity applications (#186).
  • Objects shared between multiple host_tasks, such as file handles for I/O operations, can now be explicitly managed by the runtime through a new experimental declarative API: A host_object encapsulates arbitrary host-side objects, while side_effects are used to read and/or mutate them, analogously to buffer and accessor. Embracing this new pattern will guarantee correct lifetimes and synchronization around these objects. (#68).
  • The new experimental fence API allows accessing buffer and host-object data from the main thread without manual synchronization and reimagines SYCL's host accessors in a way that is more compatible with Celerity's asynchronous execution model (#151).
  • The new CMake option CELERITY_ACCESSOR_BOUNDARY_CHECK can be set to enable out-of-bounds buffer access detection at runtime inside device kernels to detect errors such as incorrectly-specified range-mappers, at the cost of some runtime overhead. This check is enabled by default for debug builds of Celerity (#178).
  • Celerity now expects buffers (and the new host-objects) to be captured by reference into command group functions, where it previously required by-value captures. This is in accordance with SYCL 2020 and removes one common source of user errors (#173).
  • Last but not least, several significant performance improvements make Celerity even more competitive for real-world HPC applications (#100, #111, #112, #115, #133, #137, #138, #145, #184).

Changelog

We recommend using the following SYCL versions with this release:

  • DPC++: 61e51015 or newer
  • hipSYCL: 24980221 or newer

See our platform support guide for a complete list of all officially supported configurations.

Added

  • Introduce new experimental host_object and side_effect APIs to express non-buffer dependencies between host tasks (#68, 7a5326a)
  • Add new CELERITY_GRAPH_PRINT_MAX_VERTS config options (#80, d3dd722)
  • Named threads for better debugging (#98, 25d769d, #131, ff5fbec)
  • Add support for passing device selectors to distr_queue constructor (#113, 556b6f2)
  • Add new CELERITY_DRY_RUN_NODES environment variable to simulate the scheduling of an application on a large number of nodes (without execution or data transfers) (#125, 299ebbf)
  • Add ability to name buffers for debugging (#132, 1076522)
  • Introduce experimental fence API for accessing buffer and host-object data from the main thread (#151, 6b803f8)
  • Introduce backend system for vendor-specific code paths (#162, 750f32a)
  • Add CELERITY_USE_MIMALLOC CMake configuration option to use the mimalloc allocator (enabled by default) (#170, 234e3d2)
  • Support 0-dimensional buffers, accessors and kernels (#163, 0685d94)
  • Introduce new diagnostics utility for detecting erroneous reference captures into kernel functions, as well as unused accessors (#173, ff7ed02)
  • Introduce CELERITY_ACCESSOR_BOUNDARY_CHECK CMake option to detect out-of-bounds buffer accesses inside device kernels (enabled by default for debug builds) (#178, 2c738c8)
  • Print more helpful error message when buffer allocations exceed available device memory (#179, 79f97c2)

Changed

  • Update spdlog to 1.9.2 (#80, a178828)
  • Overhaul logging mechanism (#80, 1b19bfc)
  • Improve graph dependency tracking performance (#100, c9dab18)
  • Improve task lookup performance (#112, 5139256)
  • Introduce epochs as a mechanism for in-graph synchronization (#86, 61dd07e)
  • Miscellaneous performance improvements (#115, 9a099d2, #137, b0254fd, #138, 02258c0, #145, f0b53ce)
  • Improve scheduler performance by reducing lock contention (#111, 4547b5f)
  • Improve graph generation and printing performance (#133, 8122798)
  • Use libenvpp to validate all CELERITY_* environment variables (#158, b2ced9b)
  • Use native ("USM") pointers instead of SYCL buffers for backing buffer allocations (#162, 44497b3)
  • Implement range and id types instead of aliasing SYCL types (#163, 0685d94)
  • Disallow in-source builds (#176, 0a96d15)
  • Lift restrictions on reductions for DPC++ (#175, efff21b)
  • Remove multi-pass mechanism to allow reference capture of buffers and host-objects into command group functions, in alignment with the SYCL 2020 API (#173, 0a743c7)
  • Drastically improve performance of buffer data location tracking (#184, adff79e)
  • Switch to distributed scheduling model (#186, 0970bff)

Deprecated

  • Passing sycl::device to distr_queue constructor (use a device selector instead) (#113, 556b6f2)
  • Capturing buffers and host objects by value into command group functions (capture by reference instead) (#173, 0a743c7)
  • allow_by_ref is no longer required to capture references into command group functions (#173, 0a743c7)

Removed

  • Removed support for ComputeCpp (discontinued) (#167, 68367dd)
  • Removed deprecated host_memory_layout (use buffer_allocation_window instead) (#187, f5e6510)
  • Removed deprecated kernel dimension template parameter on one_to_one, fixed and all range mappers (#187, 40a12a4)
  • Kernels can no longer receive sycl::item (use celerity::item instead), this was already broken in 0.3.2 (#163, 67ccacc)

Fixed

  • Improve performance for buffer transfers on IBM Spectrum MPI (#114, c60527f)
  • Increase size limit on individual buffer transfer operations from 2 GiB to 128 GiB (#153, 972682f)
  • Fix race between creating collective groups and submitting host tasks (#152, 0a4fca5)
  • Align read-accessor operator[] with SYCL 2020 spec by returning const-reference instead of value (#156, 5011ded)

Internal