From 5f8c6fb9295be77c57e995a07a5a419e1e3200fe Mon Sep 17 00:00:00 2001 From: Emmanuel Mathot Date: Fri, 22 Aug 2025 10:12:37 +0200 Subject: [PATCH 01/18] refactor terminology and structure in documentation --- .../clause_4_terms_and_definitions.adoc | 11 ++-- .../sections/clause_7_unified_data_model.adoc | 58 +++++++++---------- .../sections/clause_9_zarr_encoding_core.adoc | 7 +-- .../clause_9_zarr_encoding_overviews.adoc | 17 +++--- 4 files changed, 45 insertions(+), 48 deletions(-) diff --git a/standard/template/sections/clause_4_terms_and_definitions.adoc b/standard/template/sections/clause_4_terms_and_definitions.adoc index 007f320..fa81a97 100644 --- a/standard/template/sections/clause_4_terms_and_definitions.adoc +++ b/standard/template/sections/clause_4_terms_and_definitions.adoc @@ -2,6 +2,9 @@ === Terms and definitions +GeoZarr specification inherits https://zarr-specs.readthedocs.io/en/latest/v3/core/index.html#concepts-and-terminology[concepts and terminology from the Zarr core specification]. +The following terms adds Geozarr specificity to the existing Zarr terminology + ==== array A multidimensional, regularly spaced collection of values (e.g., raster data or gridded measurements), typically indexed by dimensions such as time, latitude, longitude, or spectral band. @@ -22,17 +25,17 @@ An array containing the primary geospatial or scientific measurements of interes An index axis along which arrays are organised. Dimensions provide a naming and ordering scheme for accessing data in multidimensional arrays (e.g., `time`, `x`, `y`, `band`). -==== group +==== dataset -A container for datasets, variables, dimensions, and metadata in Zarr. Groups may be nested to represent a logical hierarchy (e.g., for resolutions or collections). +A group that contains one or more data variables along with their associated coordinate variables, having a consistent relationship between these components. A dataset represents a coherent set of related data arrays and follows the unified data model. ==== metadata Structured information describing the content, context, and semantics of datasets, variables, and attributes. GeoZarr metadata includes CF attributes, geotransform definitions, and links to STAC metadata where applicable. -==== multiscale dataset +==== multiscale group -A dataset that includes multiple representations of the same data variable at varying spatial resolutions. Each resolution level is associated with a tile matrix from an OGC Tile Matrix Set. +A group that contains 2 or more child groups representing the same data at different resolutions, where each child group is a <>. The multiscale group includes metadata describing the relationship between resolution levels. ==== tile matrix set diff --git a/standard/template/sections/clause_7_unified_data_model.adoc b/standard/template/sections/clause_7_unified_data_model.adoc index 8af7598..64c073a 100644 --- a/standard/template/sections/clause_7_unified_data_model.adoc +++ b/standard/template/sections/clause_7_unified_data_model.adoc @@ -87,11 +87,11 @@ To enable discovery of resources within the hierarchical structure of the data m A STAC extension consists of embedding or referencing STAC Collection and Item metadata within the data model: -* Each dataset resource MAY reference a corresponding STAC `Collection` or `Item` using an identifier or embedded object. +* Each store resource MAY reference a corresponding STAC `Collection` or `Item` using an identifier or embedded object. * STAC properties such as `datetime`, `bbox`, and `eo:bands` MAY be included in the metadata to enable spatial, temporal, and spectral filtering. * The structure is compatible with external STAC APIs and metadata harvesting systems. -STAC integration is non-intrusive and modular. It does not impose changes on the internal organisation of datasets and MAY be adopted incrementally by implementations requiring catalogue-based discovery capabilities. +STAC integration is non-intrusive and modular. It does not impose changes on the internal organisation of the store and MAY be adopted incrementally by implementations requiring catalogue-based discovery capabilities. ==== Modularity and Interoperability @@ -101,22 +101,22 @@ Each extension point is specified independently. Implementations may advertise s === Unified Model Structure -This clause defines the structural organisation of datasets conforming to the unified data model (UDM). It consolidates the foundational elements and optional extensions into a coherent architecture suitable for Zarr encoding, while remaining format-agnostic. The model establishes a modular and extensible framework that supports structured representation of multidimensional, geospatially-referenced resources. +This clause defines the structural organisation of stores conforming to the unified data model (UDM). It consolidates the foundational elements and optional extensions into a coherent architecture suitable for Zarr encoding, while remaining format-agnostic. The model establishes a modular and extensible framework that supports structured representation of multidimensional, geospatially-referenced resources. The model represents datasets as abstract compositions of dimensions, coordinate variables, data variables, and associated metadata. This abstraction ensures that applications and services can reason about the content and semantics of a dataset without reliance on storage layout or specific serialisation. -==== Dataset Structure +==== Store Structure -A dataset conforming to the Unified Data Model (UDM) is structured as a hierarchy rooted at a top-level dataset entity. This design enables modularity and facilitates the representation of complex, multi-resolution, or thematically partitioned data collections. +A store conforming to the Unified Data Model (UDM) is structured as a hierarchy rooted at a top-level group. This design enables modularity and facilitates the representation of complex, multi-resolution, or thematically partitioned data collections. -Each dataset node comprises the following core components, aligned with the Unidata Common Data Model (CDM) and Climate and Forecast (CF) Conventions: +Each <> comprises the following core components, aligned with the Unidata Common Data Model (CDM) and Climate and Forecast (CF) Conventions: - **Dimensions** – Named, integer-valued axes defining the extent of data variables. Examples include `time`, `x`, `y`, and `band`. - **Coordinate Variables** – Arrays that supply coordinate values along dimensions, providing spatial, temporal, or contextual referencing. These may be scalar or higher-dimensional, depending on the referencing scheme. - **Data Variables** – Multidimensional arrays representing physical measurements or derived products. Defined over one or more dimensions, these variables are associated with coordinate variables and annotated with metadata. - **Attributes** – Key-value pairs attached to variables or dataset components. Attributes convey semantic information such as units, standard names, and geospatial metadata. -The hierarchy is implemented through **groups**, which function as containers for variables, dimensions, and metadata. Groups may define local context while inheriting attributes from parent nodes. This supports the logical subdivision of datasets by theme, resolution, or processing stage, and enhances the clarity and reusability of complex geospatial structures. +A Zarr hierarchy is a tree structure, where each node in the tree is either a group or an array. Group nodes may have children but array nodes may not. This supports the logical subdivision by theme, resolution, or processing stage, and enhances the clarity and reusability of complex geospatial structures. The diagram below represents the structural layer of the unified data model, derived from the Unidata Common Data Model, which serves as the foundational framework for supporting all overlaying model layer. @@ -129,7 +129,7 @@ The diagram below represents the structural layer of the unified data model, der .... @startuml CDM_DAL_Object_Model -class Dataset { +class Store { + String location + open() + close() @@ -137,10 +137,9 @@ class Dataset { class Group { + String name - + List subgroups - + List variables - + List dimensions - + List attributes +} + +class Dataset { } class Dimension { @@ -152,9 +151,6 @@ class Dimension { class Variable { + String name - + DataType dataType - + List shape - + List attributes + read() } @@ -169,19 +165,20 @@ class Attribute { + List values } -Dataset --> Group : rootGroup -Group --> Group : contains > -Group --> Variable : contains > -Group --> Dimension : defines > -Group --> Attribute : has > -Variable --> Dimension : uses > -Variable --> DataType : has > -Variable --> Attribute : has > +Store "1" --> "*" Group : rootGroup +Group "1" --> "*" Group : contains +Dataset -up-|> Group +Dataset --> "*" Variable : contains +Dataset --> "*" Dimension : defines +Group --> "*" Attribute : has +Variable --> "*" Dimension : uses +Variable --> "1" DataType : has +Variable --> "*" Attribute : has @enduml .... //endif::never-shown[] -Note that, conceptually, node within this hierarchy might be treated as a self-contained dataset. +Note that, conceptually, node within this hierarchy might be treated as a self-contained store. ==== Coordinate Referencing @@ -196,7 +193,7 @@ The model accommodates both standard CF-compatible definitions and extended refe Metadata may be declared at various levels within the model structure: -- **Global Metadata** – Attributes describing the dataset as a whole, including elements such as `title`, `summary`, and `license`. +- **Global Metadata** – Attributes describing the store as a whole, including elements such as `title`, `summary`, and `license`. - **Variable Metadata** – Attributes associated with individual data or coordinate variables, conveying descriptive or semantic information. - **Extension Metadata** – Structured metadata linked to optional model extensions (e.g., multiscale tiling, catalogue references, geotransform properties). @@ -218,15 +215,15 @@ Overviews enable: ===== Conceptual Structure -An *Overviews* construct is defined as a *hierarchical set of multiscale representations* of one or more data variables. It comprises the following components: +A <> contains child groups representing the data at different resolutions, where each child group is a <> following the unified data model. It comprises the following components: [horizontal] -*Base Variable*:: The original, highest-resolution variable to which the overview hierarchy is anchored. It is defined using the standard `DataVariable` structure in the model. -*Overview Levels*:: A sequence of variables representing the same logical quantity as the base variable, but sampled at coarser spatial resolutions. +*Base Dataset*:: The original, highest-resolution dataset to which the multiscale hierarchy is anchored. +*Zoom Level Datasets*:: A sequence of datasets representing the same data as the base dataset, but sampled at coarser spatial resolutions. *Zoom Level Identifier*:: A unique identifier associated with each level, ordered from finest (e.g. `"0"`) to coarsest resolution (e.g. `"N"`). *Tile Grid Definition*:: A mapping that associates each zoom level with a spatial tiling layout, defined in alignment with a `TileMatrixSet`. -*Spatial Alignment*:: Each overview variable MUST be spatially aligned with the base variable using a consistent coordinate reference system and compatible axis orientation. -*Resampling Method*:: A declared method indicating the technique used to derive coarser levels from the base variable (e.g. `nearest`, `average`, `cubic`). +*Spatial Alignment*:: Each zoom-level dataset MUST be spatially aligned with the base dataset using a consistent coordinate reference system and compatible axis orientation. +*Resampling Method*:: A declared method indicating the technique used to derive coarser levels from the base dataset (e.g. `nearest`, `average`, `cubic`). ===== Model Components @@ -351,4 +348,3 @@ The unified data model facilitates interoperability with tools and libraries acr - *Cloud-native infrastructure*: support for parallel access, chunked storage, and hierarchical grouping compatible with object storage. Tooling support is expected to grow via standard-conformant implementations, easing adoption across domains and infrastructures. - diff --git a/standard/template/sections/clause_9_zarr_encoding_core.adoc b/standard/template/sections/clause_9_zarr_encoding_core.adoc index a2d6a2e..eedb689 100644 --- a/standard/template/sections/clause_9_zarr_encoding_core.adoc +++ b/standard/template/sections/clause_9_zarr_encoding_core.adoc @@ -1,7 +1,7 @@ === Hierarchical Structure -A dataset conforming to the unified data model is represented as a hierarchical structure of groups, variables (arrays), dimensions, and metadata. The dataset is rooted in a *top-level group*, which may contain: +A store conforming to the unified data model is structured as a hierarchy of groups, variables (arrays), dimensions, and metadata. Following Zarr conventions, this hierarchy is rooted in a group, which may contain: - Arrays representing coordinate or data variables - Child groups for modular organisation, including logical sub-collections or resolution levels @@ -14,7 +14,7 @@ Each group adheres to a consistent structure, allowing recursive composition. Th |=== |Model Element |Zarr v2 Encoding |Zarr v3 Encoding -|Root Dataset | Directory with `.zgroup` and `.zattrs` | Directory with `zarr.json`, with `node_type: group` +|Root Group | Directory with `.zgroup` and `.zattrs` | Directory with `zarr.json`, with `node_type: group` |Child Group | Subdirectory with `.zgroup` and `.zattrs` | Subdirectory with `zarr.json`, with `node_type: group` @@ -115,7 +115,7 @@ Example: === Global Metadata -Metadata associated with the dataset as a whole is stored at the root group level. +Metadata associated with the store is stored at the root group level. [cols="1,2,2"] @@ -157,4 +157,3 @@ In all cases: - Attribute names are case-sensitive and encoded as UTF-8 strings - Values shall conform to JSON-compatible types (string, number, boolean, array) - diff --git a/standard/template/sections/clause_9_zarr_encoding_overviews.adoc b/standard/template/sections/clause_9_zarr_encoding_overviews.adoc index b20092e..abf6832 100644 --- a/standard/template/sections/clause_9_zarr_encoding_overviews.adoc +++ b/standard/template/sections/clause_9_zarr_encoding_overviews.adoc @@ -1,30 +1,30 @@ === Encoding of Multiscale Overviews in Zarr -This clause specifies how multiscale tiling (also known as overviews or pyramids) is encoded in Zarr-based datasets conforming to the unified data model. The encoding supports both Zarr Version 2 and Version 3 and is aligned with the OGC Two Dimensional Tile Matrix Set Standard. +This clause specifies how multiscale tiling (also known as overviews or pyramids) is encoded in Zarr stores conforming to the unified data model. The encoding supports both Zarr Version 2 and Version 3 and is aligned with the OGC Two Dimensional Tile Matrix Set Standard. -Multiscale datasets are composed of a set of Zarr groups representing multiple zoom levels. Each level stores coarser-resolution resampled versions of the original data variables. +A multiscale group contains child groups, where each child group is a <> representing a zoom level that stores a coarser-resolution resampled version of the original data variables. ==== Hierarchical Layout -Each zoom level SHALL be represented as a Zarr group, identified by the Tile Matrix identifier (e.g., `"0"`, `"1"`, `"2"`). These groups SHALL be organised hierarchically under a common multiscale root group. Each zoom-level group SHALL contain the complete set of variables (Zarr arrays) corresponding to that resolution. +Each zoom level SHALL be represented as a child group, identified by the Tile Matrix identifier (e.g., `"0"`, `"1"`, `"2"`). These child groups SHALL be organized hierarchically under a common multiscale group and each SHALL be a <> containing the complete set of variables (arrays) corresponding to that resolution. All zoom-level datasets MUST maintain consistent structure. [cols="1,2,2"] |=== |Structure |Zarr v2 |Zarr v3 -|Zoom level groups | Subdirectories with `.zgroup` and `.zattrs` | Subdirectories with `zarr.json`, `node_type: group` +|Zoom level datasets | Subdirectories with `.zgroup` and `.zattrs` | Subdirectories with `zarr.json`, `node_type: group` -|Variables at each level | Zarr arrays (`.zarray`, `.zattrs`) in each group | Zarr arrays (`zarr.json`, `node_type: array`) in each group +|Variables at each level | Arrays (`.zarray`, `.zattrs`) in each dataset | Arrays (`zarr.json`, `node_type: array`) in each dataset -|Global metadata | `multiscales` defined in parent `.zattrs` | `multiscales` defined in parent group `zarr.json` under `attributes` +|Multiscale metadata | `multiscales` defined in multiscale group `.zattrs` | `multiscales` defined in multiscale group `zarr.json` under `attributes` |=== -Each multiscale group MUST define chunking (tiling) along the spatial dimensions (`X`, `Y`, or `lon`, `lat`). Recommended chunk sizes are 256×256 or 512×512. +Each zoom-level dataset MUST define chunking (tiling) along the spatial dimensions (`X`, `Y`, or `lon`, `lat`). Recommended chunk sizes are 256×256 or 512×512. ==== Metadata Encoding -Multiscale metadata SHALL be defined using a `multiscales` attribute located in the parent group of the zoom levels. This attribute SHALL be a JSON object with the following members: +Multiscale metadata SHALL be defined using a `multiscales` attribute located in the multiscale group. This attribute SHALL be a JSON object with the following members: - `tile_matrix_set` – Identifier, URI, or inline JSON object compliant with OGC TileMatrixSet v2 - `resampling_method` – One of the standard string values (e.g., `"nearest"`, `"average"`) @@ -98,4 +98,3 @@ The `resampling_method` MUST indicate the method used for downsampling across zo `nearest`, `average`, `bilinear`, `cubic`, `cubic_spline`, `lanczos`, `mode`, `max`, `min`, `med`, `sum`, `q1`, `q3`, `rms`, `gauss` The same method MUST apply across all levels. - From bb05f3fe597083de3bb1d7caf7e4f65a5ac67808 Mon Sep 17 00:00:00 2001 From: Emmanuel Mathot Date: Tue, 2 Sep 2025 23:08:16 +0200 Subject: [PATCH 02/18] Enhance documentation clarity and detail in the GeoZarr Unified Data Model, addressing semantic constructs and use cases for geospatial data workflows. --- .../sections/clause_0_front_material.adoc | 8 +++---- .../template/sections/clause_1_scope.adoc | 24 +++++++++++++++++-- .../sections/clause_7_unified_data_model.adoc | 2 +- 3 files changed, 27 insertions(+), 7 deletions(-) diff --git a/standard/template/sections/clause_0_front_material.adoc b/standard/template/sections/clause_0_front_material.adoc index b9f7975..a9f2ba6 100644 --- a/standard/template/sections/clause_0_front_material.adoc +++ b/standard/template/sections/clause_0_front_material.adoc @@ -11,11 +11,11 @@ This Standard has been developed in collaboration with contributors from Earth o [abstract] == Abstract -The GeoZarr Unified Data Model and Encoding Standard specifies a conceptual and implementation framework for representing multidimensional, geospatial datasets using the Zarr format. This Standard builds upon the Unidata Common Data Model (CDM) and the Climate and Forecast (CF) Conventions, and introduces interoperable constructs for tiling, georeferencing, and metadata integration. +Zarr provides efficient chunked storage for n-dimensional arrays but do not provide with the semantic constructs required for geospatial and scientific data workflows. The GeoZarr Unified Data Model and Encoding Standard addresses this gap by adding essential concepts—coordinate systems, grid mappings, temporal semantics, and CF-compliant metadata—on top of Zarr's storage foundation. -The model defines core elements—dimensions, coordinate variables, data variables, attributes—and optional extensions for multi-resolution overviews, affine geotransforms, and STAC metadata. Encoding guidance is provided for Zarr Version 2 and Zarr Version 3, including chunking, group hierarchy, and metadata conventions. +The Standard builds upon proven concepts from the Common Data Model (CDM) and Climate and Forecast (CF) Conventions to define core elements—dimensions, coordinate variables, data variables, and attributes—along with extensions for multi-resolution overviews, affine geotransforms, and STAC metadata. This layered approach ensures applications can work with semantically rich geospatial data while leveraging Zarr's cloud-optimized storage capabilities. -GeoZarr aims to bridge scientific and geospatial communities by enabling round-trip transformations with formats such as NetCDF and GeoTIFF, and supporting compatibility with tools in the scientific Python and geospatial ecosystems. This Standard enables scalable, standards-compliant, and semantically rich data structures for cloud-native Earth observation applications. +By providing a standardized framework for geospatial semantics, GeoZarr enables scientific and geospatial applications to fully utilize cloud-native storage architectures while maintaining the rich metadata and coordinate referencing required for Earth observation workflows. The result is a modern, scalable approach to storing and accessing geospatial data that meets the needs of both data providers and consumers. == Submitters @@ -29,4 +29,4 @@ All questions regarding this submission should be directed to the editor or the |Brianna Pagán _(editor)_ | DevSeed |Ryan Abernathey| EarthMover | TBD | TBD -|=== \ No newline at end of file +|=== diff --git a/standard/template/sections/clause_1_scope.adoc b/standard/template/sections/clause_1_scope.adoc index 93a5d91..1275ba4 100644 --- a/standard/template/sections/clause_1_scope.adoc +++ b/standard/template/sections/clause_1_scope.adoc @@ -2,6 +2,26 @@ The GeoZarr Unified Data Model and Encoding Standard defines a conceptual and implementation framework for representing and encoding geospatial and scientific datasets using the Zarr format. The scope of this Standard includes the definition of a format-agnostic unified data model, the specification of its encoding into Zarr Version 2 and Version 3, and the establishment of extension points to support interoperability with external metadata and tiling standards. -This Standard addresses the needs of Earth observation, environmental monitoring, and geospatial analysis applications that require efficient, scalable access to multidimensional datasets. It enables the harmonisation of existing data models, such as the Unidata Common Data Model (CDM) and the Climate and Forecast (CF) Conventions, with operational encoding formats suitable for cloud-native storage and analysis. +These capabilities are necessary because Zarr does not provide semantic constructs for geospatial data interpretation. Applications need to understand not just array shapes and values, but coordinate meanings, projection parameters, and scientific metadata. GeoZarr fills this gap without compromising Zarr's performance characteristics. -Typical use cases include the storage, transformation, discovery, and processing of raster and gridded data, data cubes with temporal or vertical dimensions, and catalogue-enabled datasets integrated with metadata standards such as STAC and OGC Tile Matrix Sets. +=== Why GeoZarr Exists + +Zarr, by design, is a low-level container for storing n-dimensional arrays and metadata. While this simplicity is a strength for performance and interoperability, it means Zarr lacks higher-level concepts that geospatial applications require: + +* *Coordinate Systems:* No native way to associate spatial or temporal meaning with array dimensions +* *Grid Mappings:* No standard mechanism for projection and coordinate reference system metadata +* *Semantic Metadata:* No conventions for units, standard names, or scientific attributes +* *Variable Relationships:* No formal distinction between coordinate variables and data variables + +These concepts are essential for geospatial workflows but must be layered on top of Zarr's array storage. GeoZarr provides this semantic layer through proven standards (Common Data Model and CF conventions) while preserving Zarr's cloud-native advantages. + +=== Use Cases and Applications + +This Standard addresses the needs of Earth observation, environmental monitoring, and geospatial analysis applications that require efficient, scalable access to multidimensional datasets. It enables the harmonisation of existing data models with operational encoding formats suitable for cloud-native storage and analysis. + +Typical use cases include: +* Storage and processing of raster and gridded data +* Management of data cubes with temporal or vertical dimensions +* Integration with catalogue systems through standardized metadata +* Multi-resolution tiling for efficient visualization and analysis +* Cloud-optimized access to large geospatial datasets diff --git a/standard/template/sections/clause_7_unified_data_model.adoc b/standard/template/sections/clause_7_unified_data_model.adoc index 64c073a..ff740ae 100644 --- a/standard/template/sections/clause_7_unified_data_model.adoc +++ b/standard/template/sections/clause_7_unified_data_model.adoc @@ -21,7 +21,7 @@ This clause specifies the logical composition of the unified model, the external === Foundational Model and Standards Reuse -The unified data model described in this Standard is derived from established community specifications to maximise interoperability and to enable the reuse of mature tools and practices. The model is grounded in the Unidata Common Data Model (CDM) and the Climate and Forecast (CF) Conventions, which together provide a robust framework for representing scientific and geospatial datasets. +GeoZarr adopts established data model concepts because Zarr itself provides only array storage without semantic interpretation. The Unidata Common Data Model (CDM) provides the conceptual framework for understanding dimensions, variables, and attributes, while CF Conventions provide standardized metadata semantics. This reuse ensures compatibility with existing scientific software while avoiding reinvention of proven concepts. ==== Common Data Model (CDM) From 561edd94e1e18d0b93268d0c06efe3cf4deb84c4 Mon Sep 17 00:00:00 2001 From: Emmanuel Mathot Date: Wed, 3 Sep 2025 14:33:07 +0200 Subject: [PATCH 03/18] Refine descriptions of multiscale groups in documentation for clarity and completeness --- standard/template/sections/clause_4_terms_and_definitions.adoc | 2 +- .../template/sections/clause_9_zarr_encoding_overviews.adoc | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/standard/template/sections/clause_4_terms_and_definitions.adoc b/standard/template/sections/clause_4_terms_and_definitions.adoc index fa81a97..2ba8d1e 100644 --- a/standard/template/sections/clause_4_terms_and_definitions.adoc +++ b/standard/template/sections/clause_4_terms_and_definitions.adoc @@ -35,7 +35,7 @@ Structured information describing the content, context, and semantics of dataset ==== multiscale group -A group that contains 2 or more child groups representing the same data at different resolutions, where each child group is a <>. The multiscale group includes metadata describing the relationship between resolution levels. +A group that contains child groups representing the same data at different resolutions, where each child group is a <>. The multiscale group includes metadata describing the relationship between resolution levels. A multiscale group can be initialized with a single dataset and expanded with additional resolution levels over time. ==== tile matrix set diff --git a/standard/template/sections/clause_9_zarr_encoding_overviews.adoc b/standard/template/sections/clause_9_zarr_encoding_overviews.adoc index abf6832..bd3a69f 100644 --- a/standard/template/sections/clause_9_zarr_encoding_overviews.adoc +++ b/standard/template/sections/clause_9_zarr_encoding_overviews.adoc @@ -3,7 +3,7 @@ This clause specifies how multiscale tiling (also known as overviews or pyramids) is encoded in Zarr stores conforming to the unified data model. The encoding supports both Zarr Version 2 and Version 3 and is aligned with the OGC Two Dimensional Tile Matrix Set Standard. -A multiscale group contains child groups, where each child group is a <> representing a zoom level that stores a coarser-resolution resampled version of the original data variables. +A <> contains one or more child groups, where each child group is a <> representing a zoom level of the data. Additional resolution levels can be added over time, with each new level storing a coarser-resolution resampled version of the original data variables. ==== Hierarchical Layout From f99d7427f4e3a64eafbbe4e0684f23709eb04e72 Mon Sep 17 00:00:00 2001 From: Emmanuel Mathot Date: Wed, 3 Sep 2025 20:15:38 +0200 Subject: [PATCH 04/18] Capitalize "Unified Data Model" for consistency across documentation sections --- .../clause_4_terms_and_definitions.adoc | 6 ++-- .../sections/clause_7_unified_data_model.adoc | 28 +++++++++---------- .../clause_9_zarr_encoding_overviews.adoc | 2 +- 3 files changed, 18 insertions(+), 18 deletions(-) diff --git a/standard/template/sections/clause_4_terms_and_definitions.adoc b/standard/template/sections/clause_4_terms_and_definitions.adoc index 2ba8d1e..4aae161 100644 --- a/standard/template/sections/clause_4_terms_and_definitions.adoc +++ b/standard/template/sections/clause_4_terms_and_definitions.adoc @@ -27,7 +27,7 @@ An index axis along which arrays are organised. Dimensions provide a naming and ==== dataset -A group that contains one or more data variables along with their associated coordinate variables, having a consistent relationship between these components. A dataset represents a coherent set of related data arrays and follows the unified data model. +A group that contains one or more data variables along with their associated coordinate variables, having a consistent relationship between these components. A dataset represents a coherent set of related data arrays and follows the Unified Data Model. ==== metadata @@ -45,9 +45,9 @@ A spatial tiling scheme defined by a hierarchy of zoom levels and consistent gri An affine transformation used to convert between grid coordinates and geospatial coordinates, typically defined using the GDAL GeoTransform convention. -==== unified data model (UDM) +==== Unified Data Model (UDM) -A conceptual model that defines how to structure geospatial data in Zarr using CDM-based constructs, including support for coordinate referencing, metadata integration, and multiscale representations. +A conceptual model that defines how to structure geospatial data in Zarr using CDM-based constructs, including support for coordinate referencing, metadata integration, and multiscale representations. The Unified Data Model provides a standardized framework for expressing spatial relationships, coordinate systems, and scientific metadata. === Abbreviated Terms diff --git a/standard/template/sections/clause_7_unified_data_model.adoc b/standard/template/sections/clause_7_unified_data_model.adoc index ff740ae..48cdaa9 100644 --- a/standard/template/sections/clause_7_unified_data_model.adoc +++ b/standard/template/sections/clause_7_unified_data_model.adoc @@ -4,9 +4,9 @@ === Scope and Purpose -This Standard defines a unified data model (UDM) that provides a conceptual framework for representing geospatial and scientific data in Zarr. The purpose of this model is to support standards-based interoperability across Earth observation systems and analytical environments, while preserving compatibility with existing data models and software ecosystems.. +This Standard defines the Unified Data Model (UDM) that provides a conceptual framework for representing geospatial and scientific data in Zarr. The purpose of this model is to support standards-based interoperability across Earth observation systems and analytical environments, while preserving compatibility with existing data models and software ecosystems.. -The unified data model incorporates and extends the following established specifications and community standards: +The Unified Data Model incorporates and extends the following established specifications and community standards: - **Unidata Common Data Model (CDM)** – Provides the foundational resource structure for scientific datasets, encompassing dimensions, coordinate systems, variables, and associated metadata elements. - **CF (Climate and Forecast) Conventions** – Defines a widely adopted metadata profile for describing spatiotemporal semantics in CDM-based datasets. @@ -15,9 +15,9 @@ The unified data model incorporates and extends the following established specif - **GDAL geotransform metadata**, used to express affine transformations and interpolation characteristics. - **SpatioTemporal Asset Catalog (STAC)** metadata elements for resource discovery and cataloguing (Collection and Item constructs). -The unified model is format-agnostic and describes the abstract structure of resources independently of the physical encoding. It does not redefine the semantics of the CDM or CF conventions, but introduces integration and extension points required to support tiled multiscale data, geospatial referencing, and metadata for discovery. +The Unified Data Model is format-agnostic and describes the abstract structure of resources independently of the physical encoding. It does not redefine the semantics of the CDM or CF conventions, but introduces integration and extension points required to support tiled multiscale data, geospatial referencing, and metadata for discovery. -This clause specifies the logical composition of the unified model, the external standards it leverages, and the conformance points that facilitate harmonised implementation within the GeoZarr framework. +This clause specifies the logical composition of the Unified Data Model, the external standards it leverages, and the conformance points that facilitate harmonised implementation within the GeoZarr framework. === Foundational Model and Standards Reuse @@ -25,7 +25,7 @@ GeoZarr adopts established data model concepts because Zarr itself provides only ==== Common Data Model (CDM) -The CDM defines a generalised schema for representing array-based scientific datasets. The following constructs are reused directly within the unified model: +The CDM defines a generalised schema for representing array-based scientific datasets. The following constructs are reused directly within the Unified Data Model: - **Dimensions** – Integer-valued, named axes that define the extents of data variables. - **Coordinate Variables** – Variables that supply coordinate values along dimensions, establishing spatial or temporal context. @@ -33,7 +33,7 @@ The CDM defines a generalised schema for representing array-based scientific dat - **Attributes** – Key-value metadata elements used to describe variables and datasets semantically. - **Groups** – Optional hierarchical containers enabling logical organisation of resources and metadata. -The unified data model adopts these CDM components without modification excluding the user-defined types. Semantic interpretation remains consistent with the original CDM specification. GeoZarr structures are mapped to CDM constructs to ensure compatibility and clarity. +The Unified Data Model adopts these CDM components without modification excluding the user-defined types. Semantic interpretation remains consistent with the original CDM specification. GeoZarr structures are mapped to CDM constructs to ensure compatibility and clarity. ==== CF Conventions @@ -44,7 +44,7 @@ The CF Conventions specify standardised metadata attributes and practices to des - Physical units - Standard variable naming -The unified data model supports CF-compliant metadata, including attributes such as `standard_name`, `units`, and `grid_mapping`. The unified data model does not prescribe CF compliance but enables it through permissive design. Partial adoption of CF attributes is supported, and non-compliant datasets may selectively adopt CF metadata as needed. +The Unified Data Model supports CF-compliant metadata, including attributes such as `standard_name`, `units`, and `grid_mapping`. The Unified Data Model does not prescribe CF compliance but enables it through permissive design. Partial adoption of CF attributes is supported, and non-compliant datasets may selectively adopt CF metadata as needed. ==== Standards-Based Extensions @@ -58,7 +58,7 @@ These extensions are integrated in a modular fashion and do not alter the core s === Model Extension Points -The unified data model specifies a series of optional, standards-aligned extension points to support functionality beyond the base CDM and CF constructs. These extensions enhance applicability to Earth observation and spatial analysis use cases without imposing additional mandatory requirements. +The Unified Data Model specifies a series of optional, standards-aligned extension points to support functionality beyond the base CDM and CF constructs. These extensions enhance applicability to Earth observation and spatial analysis use cases without imposing additional mandatory requirements. Each extension is defined as an independent module. Implementation of any given extension does not necessitate support for others. @@ -101,7 +101,7 @@ Each extension point is specified independently. Implementations may advertise s === Unified Model Structure -This clause defines the structural organisation of stores conforming to the unified data model (UDM). It consolidates the foundational elements and optional extensions into a coherent architecture suitable for Zarr encoding, while remaining format-agnostic. The model establishes a modular and extensible framework that supports structured representation of multidimensional, geospatially-referenced resources. +This clause defines the structural organisation of stores conforming to the Unified Data Model (UDM). It consolidates the foundational elements and optional extensions into a coherent architecture suitable for Zarr encoding, while remaining format-agnostic. The model establishes a modular and extensible framework that supports structured representation of multidimensional, geospatially-referenced resources. The model represents datasets as abstract compositions of dimensions, coordinate variables, data variables, and associated metadata. This abstraction ensures that applications and services can reason about the content and semantics of a dataset without reliance on storage layout or specific serialisation. @@ -118,7 +118,7 @@ Each <> comprises the following core components, aligned A Zarr hierarchy is a tree structure, where each node in the tree is either a group or an array. Group nodes may have children but array nodes may not. This supports the logical subdivision by theme, resolution, or processing stage, and enhances the clarity and reusability of complex geospatial structures. -The diagram below represents the structural layer of the unified data model, derived from the Unidata Common Data Model, which serves as the foundational framework for supporting all overlaying model layer. +The diagram below represents the structural layer of the Unified Data Model, derived from the Unidata Common Data Model, which serves as the foundational framework for supporting all overlaying model layer. //image::udm-core.png[] @@ -215,7 +215,7 @@ Overviews enable: ===== Conceptual Structure -A <> contains child groups representing the data at different resolutions, where each child group is a <> following the unified data model. It comprises the following components: +A <> contains child groups representing the data at different resolutions, where each child group is a <> following the Unified Data Model. It comprises the following components: [horizontal] *Base Dataset*:: The original, highest-resolution dataset to which the multiscale hierarchy is anchored. @@ -227,7 +227,7 @@ A <> contains child groups representing ===== Model Components -The *Overviews* construct is represented in the unified data model using the following logical elements: +The *Overviews* construct is represented in the Unified Data Model using the following logical elements: [cols="1,3"] |=== @@ -307,7 +307,7 @@ This extensibility framework supports both minimum-viable use and high-fidelity === Interoperability Considerations -Interoperability is a core objective of the GeoZarr unified data model. The model is designed to bridge diverse Earth observation and scientific data ecosystems by enabling structural and semantic compatibility with established formats and standards, while providing a forward-looking foundation for scalable, cloud-native workflows. +Interoperability is a core objective of the GeoZarr Unified Data Model. The model is designed to bridge diverse Earth observation and scientific data ecosystems by enabling structural and semantic compatibility with established formats and standards, while providing a forward-looking foundation for scalable, cloud-native workflows. This section outlines the principles and mechanisms supporting interoperability across formats, tools, and communities. @@ -341,7 +341,7 @@ This approach enables seamless integration into modern data catalogues and platf ==== Tool and Ecosystem Support -The unified data model facilitates interoperability with tools and libraries across the following domains: +The Unified Data Model facilitates interoperability with tools and libraries across the following domains: - *Scientific computing*: NetCDF-based libraries (e.g., xarray, netCDF4), Zarr-compatible clients. - *Geospatial processing*: GDAL, rasterio, QGIS (via Zarr driver extensions or translations). diff --git a/standard/template/sections/clause_9_zarr_encoding_overviews.adoc b/standard/template/sections/clause_9_zarr_encoding_overviews.adoc index bd3a69f..e91920e 100644 --- a/standard/template/sections/clause_9_zarr_encoding_overviews.adoc +++ b/standard/template/sections/clause_9_zarr_encoding_overviews.adoc @@ -1,7 +1,7 @@ === Encoding of Multiscale Overviews in Zarr -This clause specifies how multiscale tiling (also known as overviews or pyramids) is encoded in Zarr stores conforming to the unified data model. The encoding supports both Zarr Version 2 and Version 3 and is aligned with the OGC Two Dimensional Tile Matrix Set Standard. +This clause specifies how multiscale tiling (also known as overviews or pyramids) is encoded in Zarr stores conforming to the Unified Data Model. The encoding supports both Zarr Version 2 and Version 3 and is aligned with the OGC Two Dimensional Tile Matrix Set Standard. A <> contains one or more child groups, where each child group is a <> representing a zoom level of the data. Additional resolution levels can be added over time, with each new level storing a coarser-resolution resampled version of the original data variables. From b8c988b28c3bb842325d003587c8376e020c7c49 Mon Sep 17 00:00:00 2001 From: Emmanuel Mathot Date: Wed, 3 Sep 2025 20:16:14 +0200 Subject: [PATCH 05/18] Update section title to "Unified Data Model Structure" for consistency in documentation --- standard/template/sections/clause_7_unified_data_model.adoc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/standard/template/sections/clause_7_unified_data_model.adoc b/standard/template/sections/clause_7_unified_data_model.adoc index 48cdaa9..0f3cf4b 100644 --- a/standard/template/sections/clause_7_unified_data_model.adoc +++ b/standard/template/sections/clause_7_unified_data_model.adoc @@ -99,7 +99,7 @@ STAC integration is non-intrusive and modular. It does not impose changes on the Each extension point is specified independently. Implementations may advertise support for one or more extensions by declaring conformance to corresponding extension modules. This modularity facilitates incremental adoption, promotes reuse, and enhances interoperability across varied implementation environments. -=== Unified Model Structure +=== Unified Data Model Structure This clause defines the structural organisation of stores conforming to the Unified Data Model (UDM). It consolidates the foundational elements and optional extensions into a coherent architecture suitable for Zarr encoding, while remaining format-agnostic. The model establishes a modular and extensible framework that supports structured representation of multidimensional, geospatially-referenced resources. From 8cac80c42d3a1356323ee04a81b998389f3e37b4 Mon Sep 17 00:00:00 2001 From: Emmanuel Mathot Date: Thu, 4 Sep 2025 15:24:04 +0200 Subject: [PATCH 06/18] Enhance documentation clarity by defining relationships to Zarr core concepts, refining terminology, and ensuring consistent references to hierarchies and stores across multiple sections. --- .../template/sections/clause_1_scope.adoc | 4 ++++ .../clause_4_terms_and_definitions.adoc | 4 ++++ .../sections/clause_7_unified_data_model.adoc | 23 ++++++++++++------- .../sections/clause_9_zarr_encoding_core.adoc | 4 ++-- .../clause_9_zarr_encoding_overviews.adoc | 2 +- 5 files changed, 26 insertions(+), 11 deletions(-) diff --git a/standard/template/sections/clause_1_scope.adoc b/standard/template/sections/clause_1_scope.adoc index 1275ba4..3aa9b72 100644 --- a/standard/template/sections/clause_1_scope.adoc +++ b/standard/template/sections/clause_1_scope.adoc @@ -15,6 +15,10 @@ Zarr, by design, is a low-level container for storing n-dimensional arrays and m These concepts are essential for geospatial workflows but must be layered on top of Zarr's array storage. GeoZarr provides this semantic layer through proven standards (Common Data Model and CF conventions) while preserving Zarr's cloud-native advantages. +=== Relationship to Zarr Core Concepts + +GeoZarr builds upon Zarr's foundational concepts of <> and <>. A Zarr store provides the storage and retrieval interface (e.g., filesystem, cloud object storage), while a hierarchy defines the logical tree structure of groups and arrays within that store. GeoZarr specifies how to organize and structure hierarchies to support geospatial semantics, without modifying the underlying store interface. + === Use Cases and Applications This Standard addresses the needs of Earth observation, environmental monitoring, and geospatial analysis applications that require efficient, scalable access to multidimensional datasets. It enables the harmonisation of existing data models with operational encoding formats suitable for cloud-native storage and analysis. diff --git a/standard/template/sections/clause_4_terms_and_definitions.adoc b/standard/template/sections/clause_4_terms_and_definitions.adoc index 4aae161..cee4bbd 100644 --- a/standard/template/sections/clause_4_terms_and_definitions.adoc +++ b/standard/template/sections/clause_4_terms_and_definitions.adoc @@ -37,6 +37,10 @@ Structured information describing the content, context, and semantics of dataset A group that contains child groups representing the same data at different resolutions, where each child group is a <>. The multiscale group includes metadata describing the relationship between resolution levels. A multiscale group can be initialized with a single dataset and expanded with additional resolution levels over time. +==== store + +A system that provides storage and retrieval operations for Zarr hierarchies, as defined in the https://zarr-specs.readthedocs.io/en/latest/v3/core/index.html#stores[Zarr core specification]. A store implements the abstract store interface and can be backed by various storage technologies such as filesystems, cloud object storage, or databases. GeoZarr hierarchies are stored within and accessed through Zarr stores. + ==== tile matrix set A spatial tiling scheme defined by a hierarchy of zoom levels and consistent grid parameters (e.g., scale, CRS). Tile Matrix Sets enable spatial indexing and tiling of gridded data. diff --git a/standard/template/sections/clause_7_unified_data_model.adoc b/standard/template/sections/clause_7_unified_data_model.adoc index 0f3cf4b..0afc80f 100644 --- a/standard/template/sections/clause_7_unified_data_model.adoc +++ b/standard/template/sections/clause_7_unified_data_model.adoc @@ -4,7 +4,9 @@ === Scope and Purpose -This Standard defines the Unified Data Model (UDM) that provides a conceptual framework for representing geospatial and scientific data in Zarr. The purpose of this model is to support standards-based interoperability across Earth observation systems and analytical environments, while preserving compatibility with existing data models and software ecosystems.. +This Standard defines the Unified Data Model (UDM) that provides a conceptual framework for representing geospatial and scientific data in Zarr. The purpose of this model is to support standards-based interoperability across Earth observation systems and analytical environments, while preserving compatibility with existing data models and software ecosystems. + +The Unified Data Model operates within the Zarr framework, where a <> provides the storage and retrieval interface, and a hierarchy defines the logical organization of groups and arrays within that store. GeoZarr hierarchies are stored in and accessed through Zarr stores, which can be implemented using various storage technologies such as filesystems, cloud object storage, or databases. The Unified Data Model incorporates and extends the following established specifications and community standards: @@ -87,11 +89,11 @@ To enable discovery of resources within the hierarchical structure of the data m A STAC extension consists of embedding or referencing STAC Collection and Item metadata within the data model: -* Each store resource MAY reference a corresponding STAC `Collection` or `Item` using an identifier or embedded object. +* Each hierarchy MAY reference a corresponding STAC `Collection` or `Item` using an identifier or embedded object. * STAC properties such as `datetime`, `bbox`, and `eo:bands` MAY be included in the metadata to enable spatial, temporal, and spectral filtering. * The structure is compatible with external STAC APIs and metadata harvesting systems. -STAC integration is non-intrusive and modular. It does not impose changes on the internal organisation of the store and MAY be adopted incrementally by implementations requiring catalogue-based discovery capabilities. +STAC integration is non-intrusive and modular. It does not impose changes on the internal organisation of the hierarchy and MAY be adopted incrementally by implementations requiring catalogue-based discovery capabilities. ==== Modularity and Interoperability @@ -101,13 +103,13 @@ Each extension point is specified independently. Implementations may advertise s === Unified Data Model Structure -This clause defines the structural organisation of stores conforming to the Unified Data Model (UDM). It consolidates the foundational elements and optional extensions into a coherent architecture suitable for Zarr encoding, while remaining format-agnostic. The model establishes a modular and extensible framework that supports structured representation of multidimensional, geospatially-referenced resources. +This clause defines the structural organisation of Zarr hierarchies conforming to the Unified Data Model (UDM). It consolidates the foundational elements and optional extensions into a coherent architecture suitable for Zarr encoding, while remaining format-agnostic. The model establishes a modular and extensible framework that supports structured representation of multidimensional, geospatially-referenced resources. The model represents datasets as abstract compositions of dimensions, coordinate variables, data variables, and associated metadata. This abstraction ensures that applications and services can reason about the content and semantics of a dataset without reliance on storage layout or specific serialisation. -==== Store Structure +==== Hierarchy Structure -A store conforming to the Unified Data Model (UDM) is structured as a hierarchy rooted at a top-level group. This design enables modularity and facilitates the representation of complex, multi-resolution, or thematically partitioned data collections. +A Zarr hierarchy conforming to the Unified Data Model (UDM) is structured as a tree rooted at a top-level group. This design enables modularity and facilitates the representation of complex, multi-resolution, or thematically partitioned data collections. Each <> comprises the following core components, aligned with the Unidata Common Data Model (CDM) and Climate and Forecast (CF) Conventions: @@ -135,6 +137,10 @@ class Store { + close() } +class Hierarchy { + + String name +} + class Group { + String name } @@ -165,7 +171,8 @@ class Attribute { + List values } -Store "1" --> "*" Group : rootGroup +Store "1" --> "*" Hierarchy : implements +Hierarchy "1" *-- "1" Group : has root Group "1" --> "*" Group : contains Dataset -up-|> Group Dataset --> "*" Variable : contains @@ -193,7 +200,7 @@ The model accommodates both standard CF-compatible definitions and extended refe Metadata may be declared at various levels within the model structure: -- **Global Metadata** – Attributes describing the store as a whole, including elements such as `title`, `summary`, and `license`. +- **Global Metadata** – Attributes describing the hierarchy as a whole, including elements such as `title`, `summary`, and `license`. - **Variable Metadata** – Attributes associated with individual data or coordinate variables, conveying descriptive or semantic information. - **Extension Metadata** – Structured metadata linked to optional model extensions (e.g., multiscale tiling, catalogue references, geotransform properties). diff --git a/standard/template/sections/clause_9_zarr_encoding_core.adoc b/standard/template/sections/clause_9_zarr_encoding_core.adoc index eedb689..8a1972c 100644 --- a/standard/template/sections/clause_9_zarr_encoding_core.adoc +++ b/standard/template/sections/clause_9_zarr_encoding_core.adoc @@ -1,7 +1,7 @@ === Hierarchical Structure -A store conforming to the unified data model is structured as a hierarchy of groups, variables (arrays), dimensions, and metadata. Following Zarr conventions, this hierarchy is rooted in a group, which may contain: +A hierarchy conforming to the Unified Data Model is structured as a tree of groups, variables (arrays), dimensions, and metadata. Following Zarr conventions, this hierarchy is rooted in a group, which may contain: - Arrays representing coordinate or data variables - Child groups for modular organisation, including logical sub-collections or resolution levels @@ -115,7 +115,7 @@ Example: === Global Metadata -Metadata associated with the store is stored at the root group level. +Metadata associated with the hierarchy is stored at the root group level. [cols="1,2,2"] diff --git a/standard/template/sections/clause_9_zarr_encoding_overviews.adoc b/standard/template/sections/clause_9_zarr_encoding_overviews.adoc index e91920e..c367b6d 100644 --- a/standard/template/sections/clause_9_zarr_encoding_overviews.adoc +++ b/standard/template/sections/clause_9_zarr_encoding_overviews.adoc @@ -1,7 +1,7 @@ === Encoding of Multiscale Overviews in Zarr -This clause specifies how multiscale tiling (also known as overviews or pyramids) is encoded in Zarr stores conforming to the Unified Data Model. The encoding supports both Zarr Version 2 and Version 3 and is aligned with the OGC Two Dimensional Tile Matrix Set Standard. +This clause specifies how multiscale tiling (also known as overviews or pyramids) is encoded in Zarr hierarchies conforming to the Unified Data Model. The encoding supports both Zarr Version 2 and Version 3 and is aligned with the OGC Two Dimensional Tile Matrix Set Standard. A <> contains one or more child groups, where each child group is a <> representing a zoom level of the data. Additional resolution levels can be added over time, with each new level storing a coarser-resolution resampled version of the original data variables. From 08caa63294f58a8b73260d168641acf11d6be272 Mon Sep 17 00:00:00 2001 From: Emmanuel Mathot Date: Thu, 4 Sep 2025 15:32:10 +0200 Subject: [PATCH 07/18] Refine Unified Data Model description to clarify adaptations for Zarr's type system and ensure compatibility with CDM semantics. --- standard/template/sections/clause_7_unified_data_model.adoc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/standard/template/sections/clause_7_unified_data_model.adoc b/standard/template/sections/clause_7_unified_data_model.adoc index 0afc80f..f5937a9 100644 --- a/standard/template/sections/clause_7_unified_data_model.adoc +++ b/standard/template/sections/clause_7_unified_data_model.adoc @@ -35,7 +35,7 @@ The CDM defines a generalised schema for representing array-based scientific dat - **Attributes** – Key-value metadata elements used to describe variables and datasets semantically. - **Groups** – Optional hierarchical containers enabling logical organisation of resources and metadata. -The Unified Data Model adopts these CDM components without modification excluding the user-defined types. Semantic interpretation remains consistent with the original CDM specification. GeoZarr structures are mapped to CDM constructs to ensure compatibility and clarity. +The Unified Data Model adopts these CDM components with adaptations for Zarr's type system. While the conceptual structure remains consistent with the original CDM specification, attribute types are mapped to Zarr's JSON-compatible type system. GeoZarr structures preserve CDM semantics while conforming to Zarr's encoding constraints. ==== CF Conventions From 4db26fbeda15ec215f2066fb758fb4949cbce134 Mon Sep 17 00:00:00 2001 From: Emmanuel Mathot Date: Thu, 4 Sep 2025 16:07:07 +0200 Subject: [PATCH 08/18] Fix relationship notation for Store and Hierarchy in Unified Data Model diagram --- standard/template/sections/clause_7_unified_data_model.adoc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/standard/template/sections/clause_7_unified_data_model.adoc b/standard/template/sections/clause_7_unified_data_model.adoc index f5937a9..bb61701 100644 --- a/standard/template/sections/clause_7_unified_data_model.adoc +++ b/standard/template/sections/clause_7_unified_data_model.adoc @@ -171,7 +171,7 @@ class Attribute { + List values } -Store "1" --> "*" Hierarchy : implements +Store "1" ..|> "*" Hierarchy : implements Hierarchy "1" *-- "1" Group : has root Group "1" --> "*" Group : contains Dataset -up-|> Group From 1500a6d4ec1fe87e43eafbeaa73f0263ead4d1ce Mon Sep 17 00:00:00 2001 From: Emmanuel Mathot Date: Thu, 4 Sep 2025 16:25:16 +0200 Subject: [PATCH 09/18] Refine Unified Data Model relationships by correcting notation and enhancing clarity in class diagram --- .../sections/clause_7_unified_data_model.adoc | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/standard/template/sections/clause_7_unified_data_model.adoc b/standard/template/sections/clause_7_unified_data_model.adoc index bb61701..a1317ac 100644 --- a/standard/template/sections/clause_7_unified_data_model.adoc +++ b/standard/template/sections/clause_7_unified_data_model.adoc @@ -171,16 +171,16 @@ class Attribute { + List values } -Store "1" ..|> "*" Hierarchy : implements +Store "1" ..|> Hierarchy : implements Hierarchy "1" *-- "1" Group : has root -Group "1" --> "*" Group : contains -Dataset -up-|> Group +Group --* Group : is part of +Dataset -up-|> Group : is a Dataset --> "*" Variable : contains -Dataset --> "*" Dimension : defines -Group --> "*" Attribute : has -Variable --> "*" Dimension : uses -Variable --> "1" DataType : has -Variable --> "*" Attribute : has +Dimension --* "*" Dataset : is shared in +Group *-- "*" Attribute +Dimension --o "*" Variable : define the shape of +Variable --> "1" DataType +Variable *-- "*" Attribute @enduml .... //endif::never-shown[] From 201ae3ef6b0a85b7a250609c41a684ea21bc22dc Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Christophe=20No=C3=ABl?= Date: Fri, 10 Oct 2025 11:10:14 +0200 Subject: [PATCH 10/18] rename to GeoZarr standard --- standard/template/sections/clause_6_informative_text.adoc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/standard/template/sections/clause_6_informative_text.adoc b/standard/template/sections/clause_6_informative_text.adoc index 591f4f1..b3c951b 100644 --- a/standard/template/sections/clause_6_informative_text.adoc +++ b/standard/template/sections/clause_6_informative_text.adoc @@ -1,7 +1,7 @@ [[overview]] == Overview -The GeoZarr Unified Data Model and Encoding Standard defines a conceptual and implementation framework for representing multidimensional geospatial data using the Zarr format. Developed under the guidance of the OGC GeoZarr Standards Working Group (SWG), the Standard establishes conventions for encoding scientific and Earth observation datasets in a way that promotes scalability, interoperability, and compatibility with cloud-native infrastructure. +The GeoZarr Standard defines a conceptual and implementation framework for representing multidimensional geospatial data using the Zarr format. Developed under the guidance of the OGC GeoZarr Standards Working Group (SWG), the Standard establishes conventions for encoding scientific and Earth observation datasets in a way that promotes scalability, interoperability, and compatibility with cloud-native infrastructure. GeoZarr is built on widely adopted community standards, including the Unidata Common Data Model (CDM) and Climate and Forecast (CF) Conventions. It introduces additional extensions and structural constructs to support multi-resolution tiling, geospatial referencing, and catalogue-enabled metadata integration (e.g., STAC). From e6dae7ec4c95e63a748ff178064ef8028d0131e3 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Christophe=20No=C3=ABl?= Date: Fri, 10 Oct 2025 11:52:56 +0200 Subject: [PATCH 11/18] =?UTF-8?q?Simplified=20the=20specification=20to=20f?= =?UTF-8?q?ocus=20on=20the=20agreed=20intent=20of=20GeoZarr=E2=80=94defini?= =?UTF-8?q?ng=20the=20encoding=20of=20the=20Common=20Data=20Model=20(CDM)?= =?UTF-8?q?=20in=20Zarr=20and=20introducing=20extensions=20for=20multiscal?= =?UTF-8?q?e=20overviews=20and=20affine=20transformations.=20Other=20aspec?= =?UTF-8?q?ts=20and=20features=20are=20deferred=20for=20future=20evaluatio?= =?UTF-8?q?n=20to=20support=20timely=20completion=20of=20the=20initial=20s?= =?UTF-8?q?pecification.?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- .../sections/clause_0_front_material.adoc | 10 ++----- .../template/sections/clause_1_scope.adoc | 4 +-- .../sections/clause_2_conformance.adoc | 2 ++ .../sections/clause_6_informative_text.adoc | 28 ++++--------------- 4 files changed, 13 insertions(+), 31 deletions(-) diff --git a/standard/template/sections/clause_0_front_material.adoc b/standard/template/sections/clause_0_front_material.adoc index a9f2ba6..59f3afb 100644 --- a/standard/template/sections/clause_0_front_material.adoc +++ b/standard/template/sections/clause_0_front_material.adoc @@ -1,19 +1,15 @@ .Preface -The GeoZarr Unified Data Model and Encoding Standard defines a layered, standards-based framework for representing and encoding geospatial and scientific datasets in the Zarr format. It integrates foundational specifications such as the Unidata Common Data Model (CDM), the CF Conventions, and selected OGC and community standards to enable semantic, structural, and operational interoperability across Earth observation platforms and geospatial ecosystems. - -This Standard introduces a unified model that harmonises metadata structures, array-based data representations, coordinate referencing, and multiscale tiling semantics. It provides a coherent framework that facilitates encoding into Zarr v2 and v3, supporting scalable, cloud-native workflows. - -The purpose of this document is to provide implementation guidance and normative structure for consistent, interoperable adoption of GeoZarr across tools, platforms, and services. This work extends prior standardisation efforts within the OGC, including OGC API – Tiles, the Tile Matrix Set Standard, and EO metadata conventions, and anticipates integration with catalogue systems such as STAC. +The GeoZarr Standard defines a layered, standards-based framework for representing and encoding geospatial and scientific datasets in the Zarr format. The purpose of this document is to provide implementation guidance and normative structure for consistent, interoperable adoption of GeoZarr across tools, platforms, and services. This work extends prior standardisation efforts within the OGC, including OGC API – Tiles, the Tile Matrix Set Standard, and EO metadata conventions, and anticipates integration with catalogue systems such as STAC. This Standard has been developed in collaboration with contributors from Earth observation, climate science, geospatial analysis, and cloud-native geodata infrastructure communities. Future work may extend this model to additional storage formats, API services, and semantic layers. [abstract] == Abstract -Zarr provides efficient chunked storage for n-dimensional arrays but do not provide with the semantic constructs required for geospatial and scientific data workflows. The GeoZarr Unified Data Model and Encoding Standard addresses this gap by adding essential concepts—coordinate systems, grid mappings, temporal semantics, and CF-compliant metadata—on top of Zarr's storage foundation. +Zarr provides efficient chunked storage for n-dimensional arrays but do not provide with the semantic constructs required for geospatial and scientific data workflows. -The Standard builds upon proven concepts from the Common Data Model (CDM) and Climate and Forecast (CF) Conventions to define core elements—dimensions, coordinate variables, data variables, and attributes—along with extensions for multi-resolution overviews, affine geotransforms, and STAC metadata. This layered approach ensures applications can work with semantically rich geospatial data while leveraging Zarr's cloud-optimized storage capabilities. +GeoZarr defines an abstract data model and a set of conventions for representing geospatial and scientific datasets in the Zarr format. It builds on the Unidata Common Data Model (CDM), which provides the conceptual structure for organising variables, groups, coordinates, and metadata. GeoZarr specifies how these CDM concepts are encoded in Zarr, standardising practices already used by libraries such as xarray and nczarr. It also extends the CDM with features for geospatial workflows, including multiscale overviews and affine transformations, while remaining compatible with community metadata standards such as CF, GeoTIFF, and GDAL. By providing a standardized framework for geospatial semantics, GeoZarr enables scientific and geospatial applications to fully utilize cloud-native storage architectures while maintaining the rich metadata and coordinate referencing required for Earth observation workflows. The result is a modern, scalable approach to storing and accessing geospatial data that meets the needs of both data providers and consumers. diff --git a/standard/template/sections/clause_1_scope.adoc b/standard/template/sections/clause_1_scope.adoc index 3aa9b72..028e2c4 100644 --- a/standard/template/sections/clause_1_scope.adoc +++ b/standard/template/sections/clause_1_scope.adoc @@ -1,8 +1,8 @@ == Scope -The GeoZarr Unified Data Model and Encoding Standard defines a conceptual and implementation framework for representing and encoding geospatial and scientific datasets using the Zarr format. The scope of this Standard includes the definition of a format-agnostic unified data model, the specification of its encoding into Zarr Version 2 and Version 3, and the establishment of extension points to support interoperability with external metadata and tiling standards. +The GeoZarr Standard defines a conceptual and implementation framework for representing and encoding geospatial and scientific datasets using the Zarr format. The scope of this Standard includes the definition of a format-agnostic data model, the specification of its encoding into Zarr Version 2 and Version 3, and a set of extensions to support affine transformations and overviews. -These capabilities are necessary because Zarr does not provide semantic constructs for geospatial data interpretation. Applications need to understand not just array shapes and values, but coordinate meanings, projection parameters, and scientific metadata. GeoZarr fills this gap without compromising Zarr's performance characteristics. +These capabilities are necessary for geospatial data because Zarr does not provide semantic constructs for geospatial data interpretation. Applications need to understand not just array shapes and values, but coordinate meanings, projection parameters, and scientific metadata. GeoZarr fills this gap without compromising Zarr's performance characteristics. === Why GeoZarr Exists diff --git a/standard/template/sections/clause_2_conformance.adoc b/standard/template/sections/clause_2_conformance.adoc index 52e1a7c..fd32b8f 100644 --- a/standard/template/sections/clause_2_conformance.adoc +++ b/standard/template/sections/clause_2_conformance.adoc @@ -1,5 +1,7 @@ == Conformance +> WARNING: This section should be ignored and requirements classes should be designed and summarized here once the specification is completed. + The GeoZarr Unified Data Model is structured around a modular set of requirements classes. These classes define the conformance criteria for datasets and implementations adopting the GeoZarr specification. Each class provides a distinct set of structural or semantic expectations, facilitating interoperability across a broad spectrum of geospatial and scientific use cases. The *Core* requirements class defines the minimal compliance necessary to claim conformance with the GeoZarr Unified Data Model. It is intentionally open and permissive, supporting incremental adoption and broad compatibility with existing Zarr tools and data models based on the Unidata Common Data Model (CDM). diff --git a/standard/template/sections/clause_6_informative_text.adoc b/standard/template/sections/clause_6_informative_text.adoc index b3c951b..1653072 100644 --- a/standard/template/sections/clause_6_informative_text.adoc +++ b/standard/template/sections/clause_6_informative_text.adoc @@ -1,30 +1,14 @@ [[overview]] == Overview -The GeoZarr Standard defines a conceptual and implementation framework for representing multidimensional geospatial data using the Zarr format. Developed under the guidance of the OGC GeoZarr Standards Working Group (SWG), the Standard establishes conventions for encoding scientific and Earth observation datasets in a way that promotes scalability, interoperability, and compatibility with cloud-native infrastructure. +The **GeoZarr Standard** defines an **abstract data model** and a set of **conventions** for representing and describing geospatial and scientific datasets using the **Zarr** format. -GeoZarr is built on widely adopted community standards, including the Unidata Common Data Model (CDM) and Climate and Forecast (CF) Conventions. It introduces additional extensions and structural constructs to support multi-resolution tiling, geospatial referencing, and catalogue-enabled metadata integration (e.g., STAC). +Zarr provides efficient, chunked storage for n-dimensional arrays but does not include the semantic constructs required for geospatial and scientific data workflows. The **Unidata Common Data Model (CDM)** addresses this gap by introducing essential concepts that structure information through **variables**, **groups**, **coordinates**, and **metadata**. This abstract data model provides the semantic framework that enables structured interpretation of array-based data on top of Zarr’s storage foundation. -This Standard provides both: +The **primary objective** of GeoZarr is to specify how the **CDM** is encoded within Zarr. GeoZarr provides normative rules for encoding these CDM concepts in Zarr and thereby standardises the encoding practices already adopted by CDM-compatible libraries such as **xarray** and **nczarr**, promoting consistent interpretation and interoperability across tools and platforms. -* **Core requirements**, which define minimal compliance to represent array-based datasets using CDM constructs in Zarr, supporting open and permissive adoption across use cases. -* **Modular extension classes**, which define additional capabilities such as time series support, affine geotransform referencing, multi-resolution overviews, and projection coordinates, in line with OGC and community practices. +By defining an **abstract model** based on the **CDM** and a corresponding **encoding for Zarr**, GeoZarr establishes an explicit relationship between **the conceptual structure of the data** and **its physical storage representation**. Zarr defines how data are stored and accessed as chunked, hierarchical arrays, while GeoZarr specifies how this stored structure represents the scientific and geospatial meaning of the dataset.. -These modular components enable GeoZarr to serve a wide range of applications—from basic EO data storage to high-performance, cloud-native visualisation and analytics workflows. - -=== Encodings - -GeoZarr supports encoding in both Zarr Version 2 and Zarr Version 3. Each version defines how arrays, groups, and metadata are stored within a directory-based structure. All metadata is encoded in JSON-compatible formats, ensuring both human readability and machine interoperability. - -Encoding guidelines include: - -* Hierarchical grouping of datasets via Zarr groups. -* Dimension indexing and binding via dimension metadata. -* Attribute-based metadata compliant with CF conventions. -* Multi-resolution overviews aligned with OGC Tile Matrix Sets. -* Optional integration of STAC metadata for discovery and cataloguing. - -JSON is the primary format for metadata, attributes, and structural declarations. Implementations are encouraged to support standardised naming conventions, EPSG code references, and structured metadata to facilitate search, validation, and transformation across platforms. - -GeoZarr does not prescribe a single interface for data access. Instead, it enables **serverless and cloud-native** data access strategies by aligning its model with chunked, parallelisable storage patterns that are optimised for use in object stores and analytical environments. +As a **secondary objective**, GeoZarr extends the **CDM base layer** with additional capabilities required for geospatial and cloud-native applications. These extensions include **multiscale overviews**, which enable the representation of data at multiple levels of detail, and **affine transformations**, which define the spatial relationship between data coordinates and real-world locations. All extensions remain fully aligned with the CDM framework. +The **CDM** base layer also provides a **generic framework** capable of hosting metadata from a wide range of community standards. GeoZarr encourages the use of the **Climate and Forecast (CF) Conventions**, which are themselves defined around the CDM model, without imposing them as mandatory. This flexibility also supports metadata from other domain-specific standards such as **GeoTIFF**, **GDAL**, and similar geospatial conventions. From 10c637dee0c570664242e09927c90932b8ce8fb2 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Christophe=20No=C3=ABl?= Date: Fri, 10 Oct 2025 16:56:38 +0200 Subject: [PATCH 12/18] The Terms and Definitions section should include only common terms used throughout the document. It should exclude new concepts that are formally defined as part of the standard itself. However, it may include adapted or tailored definitions of commonly used terms to reflect their specific meaning within the context of the document. --- .../clause_4_terms_and_definitions.adoc | 31 ++++++++++++------- 1 file changed, 19 insertions(+), 12 deletions(-) diff --git a/standard/template/sections/clause_4_terms_and_definitions.adoc b/standard/template/sections/clause_4_terms_and_definitions.adoc index cee4bbd..1a1905b 100644 --- a/standard/template/sections/clause_4_terms_and_definitions.adoc +++ b/standard/template/sections/clause_4_terms_and_definitions.adoc @@ -2,8 +2,17 @@ === Terms and definitions -GeoZarr specification inherits https://zarr-specs.readthedocs.io/en/latest/v3/core/index.html#concepts-and-terminology[concepts and terminology from the Zarr core specification]. -The following terms adds Geozarr specificity to the existing Zarr terminology +GeoZarr specification inherits the terms from the following sources: + +* https://docs.unidata.ucar.edu/netcdf-java/5.2/userguide/common_data_model_overview.html#data-access-layer-object-model[Unidata Common Data Model] + +* https://zarr-specs.readthedocs.io/en/latest/v3/core/index.html#concepts-and-terminology[Zarr concepts and terminology]. + + +==== affine transformation + +An affine transformation is a geometric mapping that preserves points, straight lines, and parallelism. It combines linear transformations (such as rotation, scaling, reflection, or shear) with translation. + ==== array @@ -11,12 +20,16 @@ A multidimensional, regularly spaced collection of values (e.g., raster data or ==== chunk -A sub-array representing a partition of a larger array, used to optimise data access and storage. In Zarr, data is stored and accessed as a collection of independently compressed chunks. +A sub-array representing a partition of a larger array, used to optimize data access and storage. In Zarr, data is stored and accessed as a collection of independently compressed chunks. ==== coordinate variable A one-dimensional array whose values define the coordinate system for a dimension of one or more data variables. Typical examples include latitude, longitude, time, or vertical levels. +==== data model + +A data model is an *abstract*, conceptual framework that defines how data is structured, organized, and interpreted, independent of any particular storage medium or implementation. In contrast, a file format represents a concrete realization of this model, defining how the data is stored on disk. + ==== data variable An array containing the primary geospatial or scientific measurements of interest (e.g., temperature, reflectance). Data variables are defined over one or more dimensions and associated with attributes. @@ -27,15 +40,15 @@ An index axis along which arrays are organised. Dimensions provide a naming and ==== dataset -A group that contains one or more data variables along with their associated coordinate variables, having a consistent relationship between these components. A dataset represents a coherent set of related data arrays and follows the Unified Data Model. +While this term is overloaded and avoided in this document, a dataset usually represent a self-contained group of variables within a hierarchical data structure. They often share one or more dimensions and represent the unit that can be opened by a data access library (e.g., xarray) ==== metadata Structured information describing the content, context, and semantics of datasets, variables, and attributes. GeoZarr metadata includes CF attributes, geotransform definitions, and links to STAC metadata where applicable. -==== multiscale group +==== overview -A group that contains child groups representing the same data at different resolutions, where each child group is a <>. The multiscale group includes metadata describing the relationship between resolution levels. A multiscale group can be initialized with a single dataset and expanded with additional resolution levels over time. +A downscaled representation of a variable that facilitates rapid data display and efficient zooming. Overviews provide lower-resolution versions of the original data, enabling quick visualization and access without reading the full-resolution array. Multiple overview levels may be generated to support progressive rendering across different scales. ==== store @@ -45,13 +58,7 @@ A system that provides storage and retrieval operations for Zarr hierarchies, as A spatial tiling scheme defined by a hierarchy of zoom levels and consistent grid parameters (e.g., scale, CRS). Tile Matrix Sets enable spatial indexing and tiling of gridded data. -==== transform - -An affine transformation used to convert between grid coordinates and geospatial coordinates, typically defined using the GDAL GeoTransform convention. - -==== Unified Data Model (UDM) -A conceptual model that defines how to structure geospatial data in Zarr using CDM-based constructs, including support for coordinate referencing, metadata integration, and multiscale representations. The Unified Data Model provides a standardized framework for expressing spatial relationships, coordinate systems, and scientific metadata. === Abbreviated Terms From d10da7546289579ee17e2013deb09cf34f548a7c Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Christophe=20No=C3=ABl?= Date: Fri, 10 Oct 2025 17:00:54 +0200 Subject: [PATCH 13/18] Stressed out the core aspects of GeoZarr in the abstract --- standard/template/sections/clause_0_front_material.adoc | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/standard/template/sections/clause_0_front_material.adoc b/standard/template/sections/clause_0_front_material.adoc index 59f3afb..d6ba687 100644 --- a/standard/template/sections/clause_0_front_material.adoc +++ b/standard/template/sections/clause_0_front_material.adoc @@ -9,7 +9,11 @@ This Standard has been developed in collaboration with contributors from Earth o Zarr provides efficient chunked storage for n-dimensional arrays but do not provide with the semantic constructs required for geospatial and scientific data workflows. -GeoZarr defines an abstract data model and a set of conventions for representing geospatial and scientific datasets in the Zarr format. It builds on the Unidata Common Data Model (CDM), which provides the conceptual structure for organising variables, groups, coordinates, and metadata. GeoZarr specifies how these CDM concepts are encoded in Zarr, standardising practices already used by libraries such as xarray and nczarr. It also extends the CDM with features for geospatial workflows, including multiscale overviews and affine transformations, while remaining compatible with community metadata standards such as CF, GeoTIFF, and GDAL. +GeoZarr defines an abstract data model and a set of conventions for representing geospatial and scientific datasets in the Zarr format: + +- GeoZarr bridges the Unidata CDM and the Zarr format. GeoZarr establishes the link between the Unidata Common Data Model (CDM) and the Zarr format by defining how the semantic constructs of the CDM are represented within Zarr’s storage model. +- Supports community metadata standards like CF, GeoTIFF, and GDAL. +- Extends CDM for geospatial through multiscale overviews and affine transformations. By providing a standardized framework for geospatial semantics, GeoZarr enables scientific and geospatial applications to fully utilize cloud-native storage architectures while maintaining the rich metadata and coordinate referencing required for Earth observation workflows. The result is a modern, scalable approach to storing and accessing geospatial data that meets the needs of both data providers and consumers. From 4c94aae5b338072ca55e6e82f646b11f2d626443 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Christophe=20No=C3=ABl?= Date: Wed, 15 Oct 2025 12:10:30 +0200 Subject: [PATCH 14/18] refined dataset, and variable group. --- .../template/sections/clause_4_terms_and_definitions.adoc | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/standard/template/sections/clause_4_terms_and_definitions.adoc b/standard/template/sections/clause_4_terms_and_definitions.adoc index 1a1905b..47d75eb 100644 --- a/standard/template/sections/clause_4_terms_and_definitions.adoc +++ b/standard/template/sections/clause_4_terms_and_definitions.adoc @@ -40,7 +40,7 @@ An index axis along which arrays are organised. Dimensions provide a naming and ==== dataset -While this term is overloaded and avoided in this document, a dataset usually represent a self-contained group of variables within a hierarchical data structure. They often share one or more dimensions and represent the unit that can be opened by a data access library (e.g., xarray) +*Avoid using:* this term is overloaded and avoided in this document. A dataset usually represent a self-contained group of variables within a hierarchical data structure. They often share one or more dimensions and represent the unit that can be opened by a data access library (see <>) ==== metadata @@ -58,6 +58,11 @@ A system that provides storage and retrieval operations for Zarr hierarchies, as A spatial tiling scheme defined by a hierarchy of zoom levels and consistent grid parameters (e.g., scale, CRS). Tile Matrix Sets enable spatial indexing and tiling of gridded data. +[[variable-group]] +==== variable group + +A variable group is a container that includes a coherent collection of variables sharing the same dimensional structure and coordinate system ( and may contain additional variables or subgroups). It is conceptually equivalent to an xarray Dataset.. + === Abbreviated Terms From 73bf15749e1d49431efe2f9f9d5b772eeac6fdf0 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Christophe=20No=C3=ABl?= Date: Wed, 15 Oct 2025 17:42:13 +0200 Subject: [PATCH 15/18] GeoZArr data model refactoring --- .../sections/clause_7_unified_data_model.adoc | 210 ++++++++++++++---- 1 file changed, 172 insertions(+), 38 deletions(-) diff --git a/standard/template/sections/clause_7_unified_data_model.adoc b/standard/template/sections/clause_7_unified_data_model.adoc index a1317ac..868394e 100644 --- a/standard/template/sections/clause_7_unified_data_model.adoc +++ b/standard/template/sections/clause_7_unified_data_model.adoc @@ -1,5 +1,169 @@ [obligation==informative] +[[data-model]] +== GeoZarr Data Model + +=== Scope and Purpose + +The GeoZarr Data Model defines the abstract structure for representing geospatial and scientific gridded data within the GeoZarr framework. + +The GeoZarr Data Model serves the following purposes: + +* to **clarify the role of the Common Data Model (CDM)** as the base structural layer (recognising its suitability for Zarr as demonstrated by implementations such as *xarray*, *GDAL*, and *nczarr*); +* to **define how higher-level metadata conventions**—including the *Climate and Forecast (CF) Conventions* and other geospatial standards—are supported or partially supported within the CDM framework; +* to **extend the CDM** with additional geospatial capabilities not yet covered by the CDM or CF conventions, including *affine transformations* and *multiscale overviews*; + +=== Conceptual Basis + +The GeoZarr Data Model adopts and extends the **Unidata Common Data Model (CDM)** as its conceptual foundation. +The CDM defines a hierarchy of *groups*, *variables*, *dimensions*, and *attributes* that together describe the logical organisation of scientific data. + +GeoZarr reuses this structure directly and introduces targeted extensions required for geospatial and cloud-native applications. +These extensions are aligned with existing community standards and ensure continuity with established data models and tools. + +The model *integrates* the following conceptual layers: + +* **CDM Core Concepts** – Variables, dimensions, coordinates, and attributes providing the abstract representation of scientific data. +* **CF Conventions** – Optional but recommended layer with metadata rules describing physical meaning, coordinate systems, and units, enabling semantic interoperability. +* **GeoZarr Extensions** – Constructs for: + + ** *Affine transformations* defining spatial reference through linear mapping between array indices and real-world coordinates; + ** *Overviews* enabling multiscale representations and efficient visualization of large datasets. + +=== Format Independence + +Although the model was defined to support encoding in **Zarr**, it remains **format-agnostic** at the conceptual level. +Implementations may serialise the same GeoZarr Data Model structure into other compatible encodings, such as NetCDF or alternative object-based formats, provided they preserve the semantics and conformance requirements defined herein. + +This separation between **conceptual model** and **physical encoding** ensures that GeoZarr can evolve alongside emerging storage technologies while maintaining interoperability with existing CDM- and CF-based infrastructures. + +=== Conceptual Layers + +The GeoZarr Data Model is organised into three conceptual layers: + +1. the **Common Data Model (CDM)** – structural foundation; +2. the **CF Conventions** – semantic metadata layer; +3. the **GeoZarr Extensions** – additional geospatial capabilities. + +==== Common Data Model (CDM) + +The **Unidata Common Data Model (CDM)** defines the logical structure of scientific datasets through a hierarchy of **Groups**, **Variables**, **Dimensions**, and **Attributes**. +It provides the foundation upon which GeoZarr and many existing libraries (such as *xarray*, *GDAL*, and *nczarr*) operate. + +[plantuml, target="cdm_structure_overview", format=svg] +---- +@startuml +class Group { +} + +class Variable { +} + +class Dimension { +} + +class Attribute { +} + +class DataArray { +} + +class CoordinateVariable { +} + +class Subgroup { +} + +Group "1" *-- "0..*" Variable +Group "1" *-- "0..*" Dimension +Group "1" *-- "0..*" Attribute +Group "1" *-- "0..*" Subgroup +Variable "1" *-- "1" DataArray +Variable <|-- CoordinateVariable + +@enduml +---- + + +* A **Group** is a container that may include variables, dimensions, attributes, and subgroups. +* A **Variable** represents a multidimensional array associated with one or more dimensions and attributes. +* A **Dimension** defines an index axis used to organise data within variables. +* An **Attribute** holds descriptive metadata for groups or variables. +* A **Coordinate Variables** supplies coordinate values along dimensions, establishing spatial or temporal context. +* A **Data Array** represents observed or simulated phenomena, associated with dimensions and coordinate variables. + +This structure enables consistent representation of scientific data independently of storage format, providing the base semantic framework for all GeoZarr encodings. + +==== CF Conventions + +The **Climate and Forecast (CF) Conventions** build upon the CDM by defining a controlled vocabulary and metadata rules for expressing the *physical meaning* of variables and coordinates. +Because CF is explicitly based on CDM, GeoZarr can adopt CF semantics directly. + +Key CF concepts supported within the GeoZarr Data Model include: + +* **Coordinate and auxiliary coordinate variables** describing spatial and temporal axes; +* **Standard names**, **units**, and **cell methods** providing scientific context; +* **Grid mapping variables** defining coordinate reference systems and projection parameters; +* **Attributes** conveying data provenance and conventions compliance. + +GeoZarr does not alter CF semantics but clarifies their representation in CDM-structured hierarchies, enabling partial or full CF compliance depending on dataset complexity. + +[plantuml, target="cf_cdm_overview", format=svg] +---- +@startuml +left to right direction + +package "Common Data Model (CDM)" { + class Group + class Variable + class Dimension + class Attribute + + Group "1" *-- "0..*" Variable + Group "1" *-- "0..*" Dimension + Group "1" *-- "0..*" Attribute + Variable "1" *-- "1..*" Dimension + Variable "1" *-- "0..*" Attribute +} + +' Force CF Conventions package to be vertical +package "CF Conventions" { + together { + class CoordinateVariable + class AuxiliaryCoordinateVariable + } + + Variable <|-- CoordinateVariable + Variable <|-- AuxiliaryCoordinateVariable + +} + +@enduml + +---- + + + +==== GeoZarr Extensions + +GeoZarr extends the CDM with additional geospatial constructs required for cloud-native applications: + +* **Affine transformations** — define the mapping between array indices and real-world coordinates using linear coefficients. + This enables compact georeferencing for regularly gridded data. +* **Multiscale overviews** — represent downsampled versions of variables for efficient visualisation and scalable access. + Overviews are structured as subordinate variable groups sharing the same coordinate system. +* **Variable groups** — introduce a logical grouping of variables with identical dimensionality and coordinate context, supporting compound geospatial datasets (e.g., multi-band imagery, time series). + +All extensions remain aligned with the CDM hierarchy and are encoded using the same core constructs (groups, variables, and attributes). +Together, they provide the minimal geospatial extensions necessary for efficient, standards-based representation of Earth observation and scientific data in cloud environments. + +// include::clause_7_part_overviews.adoc[] + + +==== +WARNING - *To be removed* +==== + == Unified Data Model === Scope and Purpose @@ -109,6 +273,10 @@ The model represents datasets as abstract compositions of dimensions, coordinate ==== Hierarchy Structure +==== +To be reviewed +==== + A Zarr hierarchy conforming to the Unified Data Model (UDM) is structured as a tree rooted at a top-level group. This design enables modularity and facilitates the representation of complex, multi-resolution, or thematically partitioned data collections. Each <> comprises the following core components, aligned with the Unidata Common Data Model (CDM) and Climate and Forecast (CF) Conventions: @@ -275,45 +443,15 @@ The *Overviews* construct is applicable to any gridded data variable with at lea The structure may be extended for N-dimensional datasets in future revisions, provided that two spatial axes can be unambiguously identified. -=== Conformance and Extensibility - -The GeoZarr data model is designed with an open conformance approach to support a wide range of use cases and implementation contexts. Its core model is permissive, allowing partial implementations, while optional extensions and compliance profiles can define stricter requirements for interoperability. - -==== Core Conformance - -- Datasets conforming to the core model must: -* Represent data using CDM-compatible constructs (dimensions, variables, attributes). -* Follow attribute conventions where applicable. -* Be parsable as valid Zarr with structured metadata following this specification. -- CF compliance is not mandatory but is recommended for semantic interoperability. -==== Extension Conformance - -- Implementations may optionally support one or more extension modules: -* Multi-resolution overviews (Tile Matrix Set) -* GeoTransform metadata (GDAL) -* STAC metadata integration - -- Each extension defines its own requirement class with validation rules and expected metadata structures. - -- Tools may advertise which extensions they support and validate datasets accordingly. - -==== Conformance Classes - -- Conformance Classes may be defined to specify required components and extensions for specific application domains (e.g., visualisation clients, EO archives, catalogue indexing). -- Conformance Classes enable selective validation without constraining the general model. - -==== Extensibility Principles - -- All extensions must preserve compatibility with the core model and avoid redefining existing CDM or CF semantics. -- New extensions should be documented with clear identifiers, schemas, and conformance criteria. -- The model encourages interoperability by allowing tools to interpret unknown extensions without failure. - -This extensibility framework supports both minimum-viable use and high-fidelity metadata integration, enabling incremental adoption across the geospatial and scientific data communities. === Interoperability Considerations +==== +TBD: this section should be consolidated or removed +==== + Interoperability is a core objective of the GeoZarr Unified Data Model. The model is designed to bridge diverse Earth observation and scientific data ecosystems by enabling structural and semantic compatibility with established formats and standards, while providing a forward-looking foundation for scalable, cloud-native workflows. This section outlines the principles and mechanisms supporting interoperability across formats, tools, and communities. @@ -334,11 +472,7 @@ The data model is explicitly aligned with foundational standards including the U These mappings facilitate round-trip transformations and enable toolchains that consume or produce multiple formats without reengineering semantic models. -==== Semantic Interoperability - -Semantic interoperability is supported through adherence to CF conventions, use of standardised attribute names (e.g., `standard_name`, `units`), and alignment with metadata vocabularies used in other ecosystems (e.g., STAC, EPSG codes, ISO 19115 keywords). -The model does not prescribe specific vocabularies beyond CF but encourages reuse and recognition of widely accepted descriptors to promote cross-domain understanding. ==== Metadata and Discovery Integration From ed0b3866e608399aab4826bdc4f57676133f19bb Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Christophe=20No=C3=ABl?= Date: Fri, 17 Oct 2025 12:26:19 +0200 Subject: [PATCH 16/18] adapted CF description to unidata feedback (mail) --- .../template/sections/clause_7_unified_data_model.adoc | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/standard/template/sections/clause_7_unified_data_model.adoc b/standard/template/sections/clause_7_unified_data_model.adoc index 868394e..0bb8c5a 100644 --- a/standard/template/sections/clause_7_unified_data_model.adoc +++ b/standard/template/sections/clause_7_unified_data_model.adoc @@ -24,7 +24,7 @@ These extensions are aligned with existing community standards and ensure contin The model *integrates* the following conceptual layers: * **CDM Core Concepts** – Variables, dimensions, coordinates, and attributes providing the abstract representation of scientific data. -* **CF Conventions** – Optional but recommended layer with metadata rules describing physical meaning, coordinate systems, and units, enabling semantic interoperability. +* **CF Conventions** – A domain-specific, encoding-independent data model that builds upon the same conceptual principles as CDM and provides detailed semantics for geophysical variables; * **GeoZarr Extensions** – Constructs for: ** *Affine transformations* defining spatial reference through linear mapping between array indices and real-world coordinates; @@ -45,8 +45,10 @@ The GeoZarr Data Model is organised into three conceptual layers: 2. the **CF Conventions** – semantic metadata layer; 3. the **GeoZarr Extensions** – additional geospatial capabilities. + ==== Common Data Model (CDM) + The **Unidata Common Data Model (CDM)** defines the logical structure of scientific datasets through a hierarchy of **Groups**, **Variables**, **Dimensions**, and **Attributes**. It provides the foundation upon which GeoZarr and many existing libraries (such as *xarray*, *GDAL*, and *nczarr*) operate. @@ -96,8 +98,9 @@ This structure enables consistent representation of scientific data independentl ==== CF Conventions -The **Climate and Forecast (CF) Conventions** build upon the CDM by defining a controlled vocabulary and metadata rules for expressing the *physical meaning* of variables and coordinates. -Because CF is explicitly based on CDM, GeoZarr can adopt CF semantics directly. +The **Climate and Forecast (CF) Conventions** define a data model and metadata vocabulary for describing the *physical meaning* of variables and coordinates in geoscientific datasets. + +The CF data model is **encoding independent** and not formally derived from the CDM, but the two are **conceptually compatible**. Because CF conventions evolved from *netCDF* practices (which themselves align with CDM principles), CF constructs can be represented naturally within a CDM structure. Key CF concepts supported within the GeoZarr Data Model include: From d53e08a740f38087bc2989bb7701aa3fbb61e411 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Christophe=20No=C3=ABl?= Date: Fri, 17 Oct 2025 14:46:18 +0200 Subject: [PATCH 17/18] restrict to the minimum of what we do for the first version, to be extended if new topic are added. --- .../sections/clause_7_unified_data_model.adoc | 415 +++--------------- 1 file changed, 54 insertions(+), 361 deletions(-) diff --git a/standard/template/sections/clause_7_unified_data_model.adoc b/standard/template/sections/clause_7_unified_data_model.adoc index 0bb8c5a..d369d62 100644 --- a/standard/template/sections/clause_7_unified_data_model.adoc +++ b/standard/template/sections/clause_7_unified_data_model.adoc @@ -9,24 +9,17 @@ The GeoZarr Data Model defines the abstract structure for representing geospatia The GeoZarr Data Model serves the following purposes: -* to **clarify the role of the Common Data Model (CDM)** as the base structural layer (recognising its suitability for Zarr as demonstrated by implementations such as *xarray*, *GDAL*, and *nczarr*); -* to **define how higher-level metadata conventions**—including the *Climate and Forecast (CF) Conventions* and other geospatial standards—are supported or partially supported within the CDM framework; -* to **extend the CDM** with additional geospatial capabilities not yet covered by the CDM or CF conventions, including *affine transformations* and *multiscale overviews*; +* to **clarify the role of the Common Data Model (CDM)** as the structural foundation (recognising its suitability for Zarr as demonstrated by *xarray*, *GDAL*, and *nczarr*); +* to **extend the CDM** with additional geospatial capabilities required for cloud-native and tiled data representations, including *affine transformations* and *multiscale overviews*; +* to **ensure compatibility** with established data models and conventions such as *netCDF*, *CF*, *GDAL*, and *GeoTIFF*. + === Conceptual Basis The GeoZarr Data Model adopts and extends the **Unidata Common Data Model (CDM)** as its conceptual foundation. The CDM defines a hierarchy of *groups*, *variables*, *dimensions*, and *attributes* that together describe the logical organisation of scientific data. -GeoZarr reuses this structure directly and introduces targeted extensions required for geospatial and cloud-native applications. -These extensions are aligned with existing community standards and ensure continuity with established data models and tools. - -The model *integrates* the following conceptual layers: - -* **CDM Core Concepts** – Variables, dimensions, coordinates, and attributes providing the abstract representation of scientific data. -* **CF Conventions** – A domain-specific, encoding-independent data model that builds upon the same conceptual principles as CDM and provides detailed semantics for geophysical variables; -* **GeoZarr Extensions** – Constructs for: - +GeoZarr reuses these constructs and introduces an additional layer of **GeoZarr Extensions** that provide explicit geospatial semantics and support for cloud-native scalability: ** *Affine transformations* defining spatial reference through linear mapping between array indices and real-world coordinates; ** *Overviews* enabling multiscale representations and efficient visualization of large datasets. @@ -39,16 +32,14 @@ This separation between **conceptual model** and **physical encoding** ensures t === Conceptual Layers -The GeoZarr Data Model is organised into three conceptual layers: +The GeoZarr Data Model is organised into two conceptual layers: -1. the **Common Data Model (CDM)** – structural foundation; -2. the **CF Conventions** – semantic metadata layer; -3. the **GeoZarr Extensions** – additional geospatial capabilities. +1. the **Common Data Model (CDM)** – structural foundation for multidimensional data; +2. the **GeoZarr Extensions** – additional constructs for geospatial semantics and multiscale representations. ==== Common Data Model (CDM) - The **Unidata Common Data Model (CDM)** defines the logical structure of scientific datasets through a hierarchy of **Groups**, **Variables**, **Dimensions**, and **Attributes**. It provides the foundation upon which GeoZarr and many existing libraries (such as *xarray*, *GDAL*, and *nczarr*) operate. @@ -96,387 +87,89 @@ Variable <|-- CoordinateVariable This structure enables consistent representation of scientific data independently of storage format, providing the base semantic framework for all GeoZarr encodings. -==== CF Conventions -The **Climate and Forecast (CF) Conventions** define a data model and metadata vocabulary for describing the *physical meaning* of variables and coordinates in geoscientific datasets. +==== GeoZarr Extensions -The CF data model is **encoding independent** and not formally derived from the CDM, but the two are **conceptually compatible**. Because CF conventions evolved from *netCDF* practices (which themselves align with CDM principles), CF constructs can be represented naturally within a CDM structure. +GeoZarr extends the CDM with additional geospatial constructs required for cloud-native applications: -Key CF concepts supported within the GeoZarr Data Model include: +* **Affine transformations** — define the mapping between array indices and real-world coordinates using linear coefficients. + This enables compact georeferencing for regularly gridded data. +* **Multiscale overviews** — represent downsampled versions of variables for efficient visualisation and scalable access. + Overviews are structured as subordinate variable groups sharing the same coordinate system. -* **Coordinate and auxiliary coordinate variables** describing spatial and temporal axes; -* **Standard names**, **units**, and **cell methods** providing scientific context; -* **Grid mapping variables** defining coordinate reference systems and projection parameters; -* **Attributes** conveying data provenance and conventions compliance. +All extensions remain aligned with the CDM hierarchy and are encoded using the same core constructs (groups, variables, and attributes). +Together, they provide the minimal geospatial extensions necessary for efficient, standards-based representation of Earth observation and scientific data in cloud environments. -GeoZarr does not alter CF semantics but clarifies their representation in CDM-structured hierarchies, enabling partial or full CF compliance depending on dataset complexity. -[plantuml, target="cf_cdm_overview", format=svg] +[plantuml, target="geozarr_extension_overview", format=svg] ---- @startuml -left to right direction +skinparam classAttributeIconSize 0 +skinparam linetype ortho +skinparam packageStyle rectangle +skinparam backgroundColor #FFFFFF +title GeoZarr Extensions – Geospatial Enhancements package "Common Data Model (CDM)" { class Group class Variable class Dimension class Attribute - - Group "1" *-- "0..*" Variable - Group "1" *-- "0..*" Dimension - Group "1" *-- "0..*" Attribute - Variable "1" *-- "1..*" Dimension - Variable "1" *-- "0..*" Attribute } -' Force CF Conventions package to be vertical -package "CF Conventions" { - together { - class CoordinateVariable - class AuxiliaryCoordinateVariable - } - - Variable <|-- CoordinateVariable - Variable <|-- AuxiliaryCoordinateVariable - +package "GeoZarr Extensions" { + class AffineTransform + class Overview } -@enduml +Variable --> AffineTransform : georeference +Group --> Overview : provides +Variable --> Overview : provides +@enduml ---- - -==== GeoZarr Extensions - -GeoZarr extends the CDM with additional geospatial constructs required for cloud-native applications: - -* **Affine transformations** — define the mapping between array indices and real-world coordinates using linear coefficients. - This enables compact georeferencing for regularly gridded data. -* **Multiscale overviews** — represent downsampled versions of variables for efficient visualisation and scalable access. - Overviews are structured as subordinate variable groups sharing the same coordinate system. -* **Variable groups** — introduce a logical grouping of variables with identical dimensionality and coordinate context, supporting compound geospatial datasets (e.g., multi-band imagery, time series). - -All extensions remain aligned with the CDM hierarchy and are encoded using the same core constructs (groups, variables, and attributes). -Together, they provide the minimal geospatial extensions necessary for efficient, standards-based representation of Earth observation and scientific data in cloud environments. - // include::clause_7_part_overviews.adoc[] -==== -WARNING - *To be removed* -==== +=== Interoperability with Other Frameworks -== Unified Data Model - -=== Scope and Purpose +The Common Data Model (CDM), with its flexible hierarchy of groups, variables, dimensions, and attributes allows direct representation of metadata constructs used across multiple scientific and geospatial standards. -This Standard defines the Unified Data Model (UDM) that provides a conceptual framework for representing geospatial and scientific data in Zarr. The purpose of this model is to support standards-based interoperability across Earth observation systems and analytical environments, while preserving compatibility with existing data models and software ecosystems. +This design enables the CDM—and therefore GeoZarr—to act as a *host model* for conventions and metadata originating from other frameworks, while preserving their semantics within a unified structure. -The Unified Data Model operates within the Zarr framework, where a <> provides the storage and retrieval interface, and a hierarchy defines the logical organization of groups and arrays within that store. GeoZarr hierarchies are stored in and accessed through Zarr stores, which can be implemented using various storage technologies such as filesystems, cloud object storage, or databases. +* **netCDF and the Enhanced Data Model** – +The netCDF Enhanced Data Model and the CDM share common origins and are conceptually aligned. +Both organise data into variables, dimensions, and attributes. +As a result, most netCDF datasets can be represented as CDM hierarchies without loss of structure or metadata. +Conversely, GeoZarr datasets that follow the CDM pattern can be serialised as valid netCDF encodings. -The Unified Data Model incorporates and extends the following established specifications and community standards: +* **CF Conventions** – +The CF data model is encoding independent but conceptually compatible with the CDM. +CF metadata constructs—such as coordinate and auxiliary coordinate variables, standard names, units, and grid mappings—map directly onto CDM variables and attributes. +This allows GeoZarr datasets to incorporate CF semantics naturally, achieving partial or full CF compliance without modifying the underlying data model. -- **Unidata Common Data Model (CDM)** – Provides the foundational resource structure for scientific datasets, encompassing dimensions, coordinate systems, variables, and associated metadata elements. -- **CF (Climate and Forecast) Conventions** – Defines a widely adopted metadata profile for describing spatiotemporal semantics in CDM-based datasets. -- **Selected constructs from related Standards and practices**, including: - - The **OGC Tile Matrix Set Standard**, which enables multi-resolution representations of gridded data. - - **GDAL geotransform metadata**, used to express affine transformations and interpolation characteristics. - - **SpatioTemporal Asset Catalog (STAC)** metadata elements for resource discovery and cataloguing (Collection and Item constructs). +* **GDAL Metadata and Geotransform** – +GDAL expresses georeferencing through affine transformation coefficients and projection information. +These map directly to GeoZarr extension attributes (for affine transforms and CRS) stored as CDM attributes. +GDAL domain metadata can likewise be represented as CDM attributes within groups or variables, maintaining equivalence between GDAL and GeoZarr geospatial metadata. -The Unified Data Model is format-agnostic and describes the abstract structure of resources independently of the physical encoding. It does not redefine the semantics of the CDM or CF conventions, but introduces integration and extension points required to support tiled multiscale data, geospatial referencing, and metadata for discovery. +* **GeoTIFF Tags and Metadata** – +GeoTIFF georeferencing information, including coordinate reference system definitions, tie points, and pixel scale, correspond closely to the affine transform and CRS constructs in the GeoZarr Extensions. +These elements can be represented as attributes within CDM-compliant groups and variables, ensuring semantic consistency between file-based and cloud-native representations. -This clause specifies the logical composition of the Unified Data Model, the external standards it leverages, and the conformance points that facilitate harmonised implementation within the GeoZarr framework. - -=== Foundational Model and Standards Reuse - -GeoZarr adopts established data model concepts because Zarr itself provides only array storage without semantic interpretation. The Unidata Common Data Model (CDM) provides the conceptual framework for understanding dimensions, variables, and attributes, while CF Conventions provide standardized metadata semantics. This reuse ensures compatibility with existing scientific software while avoiding reinvention of proven concepts. - -==== Common Data Model (CDM) +Through these mappings, the CDM acts as a **common semantic framework** that integrates metadata from diverse geospatial standards. +This interoperability ensures that GeoZarr can serve as both a native storage model and a bridge between existing ecosystems such as netCDF/CF, GDAL, and GeoTIFF. -The CDM defines a generalised schema for representing array-based scientific datasets. The following constructs are reused directly within the Unified Data Model: - -- **Dimensions** – Integer-valued, named axes that define the extents of data variables. -- **Coordinate Variables** – Variables that supply coordinate values along dimensions, establishing spatial or temporal context. -- **Data Variables** – Multidimensional arrays representing observed or simulated phenomena, associated with dimensions and coordinate variables. -- **Attributes** – Key-value metadata elements used to describe variables and datasets semantically. -- **Groups** – Optional hierarchical containers enabling logical organisation of resources and metadata. - -The Unified Data Model adopts these CDM components with adaptations for Zarr's type system. While the conceptual structure remains consistent with the original CDM specification, attribute types are mapped to Zarr's JSON-compatible type system. GeoZarr structures preserve CDM semantics while conforming to Zarr's encoding constraints. - -==== CF Conventions - -The CF Conventions specify standardised metadata attributes and practices to describe spatiotemporal context within CDM-compliant datasets. These conventions support consistent interpretation of: - -- Coordinate systems -- Grid mappings -- Physical units -- Standard variable naming - -The Unified Data Model supports CF-compliant metadata, including attributes such as `standard_name`, `units`, and `grid_mapping`. The Unified Data Model does not prescribe CF compliance but enables it through permissive design. Partial adoption of CF attributes is supported, and non-compliant datasets may selectively adopt CF metadata as needed. - -==== Standards-Based Extensions - -To support additional capabilities, the model defines optional extension points referencing external OGC and community standards: - -- **OGC Tile Matrix Set** – Facilitates the definition of multiscale grid hierarchies for raster overviews. -- **GDAL Geotransform** – Enables geospatial referencing through affine transformations and optional interpolation specifications. -- **STAC Metadata (Collection and Item)** – Provides linkage to SpatioTemporal Asset Catalogs for resource discovery and indexing. - -These extensions are integrated in a modular fashion and do not alter the core semantics of the CDM or CF structures. Implementations may selectively adopt these extensions based on their application requirements. - -=== Model Extension Points - -The Unified Data Model specifies a series of optional, standards-aligned extension points to support functionality beyond the base CDM and CF constructs. These extensions enhance applicability to Earth observation and spatial analysis use cases without imposing additional mandatory requirements. - -Each extension is defined as an independent module. Implementation of any given extension does not necessitate support for others. - -==== Multi-Resolution Overviews (OGC Tile Matrix Set) - -Support for multi-resolution imagery is enabled via integration with the OGC Tile Matrix Set Standard: - -- Tile matrix sets define spatial tiling schemes with consistent resolutions and coordinate reference systems across zoom levels. -- Overviews may be represented as separate Zarr arrays or groups, each aligned to a specific tile matrix level. -- Metadata includes identifiers for tile matrices, spatial resolution, and spatial alignment. - -This approach aligns with the OGC API – Tiles and enables efficient access to large gridded datasets. - -==== GeoTransform Metadata (GDAL Interpolation and Affine Transform) - -Geospatial referencing can be further refined through the inclusion of metadata consistent with GDAL conventions: - -- Affine transformation is specified via the `GeoTransform` attribute or equivalent structures. -- Interpolation methods may be declared to indicate sampling behaviour or sub-pixel alignment strategies. - -This extension augments CF grid mappings by providing precise control over grid placement and coordinate transformations. - -==== STAC Collection and Item Integration - -To enable discovery of resources within the hierarchical structure of the data model, this Standard supports the inclusion of STAC metadata elements at appropriate locations within the group hierarchy. - -A STAC extension consists of embedding or referencing STAC Collection and Item metadata within the data model: - -* Each hierarchy MAY reference a corresponding STAC `Collection` or `Item` using an identifier or embedded object. -* STAC properties such as `datetime`, `bbox`, and `eo:bands` MAY be included in the metadata to enable spatial, temporal, and spectral filtering. -* The structure is compatible with external STAC APIs and metadata harvesting systems. - -STAC integration is non-intrusive and modular. It does not impose changes on the internal organisation of the hierarchy and MAY be adopted incrementally by implementations requiring catalogue-based discovery capabilities. - - -==== Modularity and Interoperability - -Each extension point is specified independently. Implementations may advertise support for one or more extensions by declaring conformance to corresponding extension modules. This modularity facilitates incremental adoption, promotes reuse, and enhances interoperability across varied implementation environments. - - -=== Unified Data Model Structure - -This clause defines the structural organisation of Zarr hierarchies conforming to the Unified Data Model (UDM). It consolidates the foundational elements and optional extensions into a coherent architecture suitable for Zarr encoding, while remaining format-agnostic. The model establishes a modular and extensible framework that supports structured representation of multidimensional, geospatially-referenced resources. - -The model represents datasets as abstract compositions of dimensions, coordinate variables, data variables, and associated metadata. This abstraction ensures that applications and services can reason about the content and semantics of a dataset without reliance on storage layout or specific serialisation. - -==== Hierarchy Structure - -==== -To be reviewed +[NOTE] ==== +GeoZarr does **not** define the mappings to CDM for metadata from existing conventions or formats such as CF, netCDF, GDAL, or GeoTIFF. +These mappings are already established and maintained by widely used libraries and implementations, including *xarray*, *netCDF-Java*, *GDAL*, etc. The role of GeoZarr is to provide a **data model and encoding framework**, not to redefine or replicate existing translation logic between metadata standards. -A Zarr hierarchy conforming to the Unified Data Model (UDM) is structured as a tree rooted at a top-level group. This design enables modularity and facilitates the representation of complex, multi-resolution, or thematically partitioned data collections. - -Each <> comprises the following core components, aligned with the Unidata Common Data Model (CDM) and Climate and Forecast (CF) Conventions: - -- **Dimensions** – Named, integer-valued axes defining the extent of data variables. Examples include `time`, `x`, `y`, and `band`. -- **Coordinate Variables** – Arrays that supply coordinate values along dimensions, providing spatial, temporal, or contextual referencing. These may be scalar or higher-dimensional, depending on the referencing scheme. -- **Data Variables** – Multidimensional arrays representing physical measurements or derived products. Defined over one or more dimensions, these variables are associated with coordinate variables and annotated with metadata. -- **Attributes** – Key-value pairs attached to variables or dataset components. Attributes convey semantic information such as units, standard names, and geospatial metadata. - -A Zarr hierarchy is a tree structure, where each node in the tree is either a group or an array. Group nodes may have children but array nodes may not. This supports the logical subdivision by theme, resolution, or processing stage, and enhances the clarity and reusability of complex geospatial structures. - -The diagram below represents the structural layer of the Unified Data Model, derived from the Unidata Common Data Model, which serves as the foundational framework for supporting all overlaying model layer. - -//image::udm-core.png[] - -//ifdef::never-shown[] -//Note: Hide until plantuml is supported -.Conformance-class model -[plantuml, cdm_model, svg, opts="debug"] -.... -@startuml CDM_DAL_Object_Model - -class Store { - + String location - + open() - + close() -} - -class Hierarchy { - + String name -} - -class Group { - + String name -} - -class Dataset { -} - -class Dimension { - + String name - + int length - + boolean isUnlimited - + boolean isShared -} - -class Variable { - + String name - + read() -} - -class DataType { - + String name - <> -} - -class Attribute { - + String name - + String type - + List values -} - -Store "1" ..|> Hierarchy : implements -Hierarchy "1" *-- "1" Group : has root -Group --* Group : is part of -Dataset -up-|> Group : is a -Dataset --> "*" Variable : contains -Dimension --* "*" Dataset : is shared in -Group *-- "*" Attribute -Dimension --o "*" Variable : define the shape of -Variable --> "1" DataType -Variable *-- "*" Attribute -@enduml -.... -//endif::never-shown[] - -Note that, conceptually, node within this hierarchy might be treated as a self-contained store. - -==== Coordinate Referencing - -Coordinate systems are defined using: - -- **CF Conventions** – Including attributes such as `standard_name`, `units`, `axis`, and `grid_mapping` to express spatiotemporal semantics and coordinate system properties. -- **Affine Transformation Extensions** – Optional support for georeferencing via affine transforms and interpolation metadata (e.g., as defined in GDAL practices), providing enhanced flexibility for irregular grids and grid-aligned imagery. - -The model accommodates both standard CF-compatible definitions and extended referencing mechanisms to support use cases that span scientific analysis and geospatial mapping. - -==== Metadata Integration - -Metadata may be declared at various levels within the model structure: - -- **Global Metadata** – Attributes describing the hierarchy as a whole, including elements such as `title`, `summary`, and `license`. -- **Variable Metadata** – Attributes associated with individual data or coordinate variables, conveying descriptive or semantic information. -- **Extension Metadata** – Structured metadata linked to optional model extensions (e.g., multiscale tiling, catalogue references, geotransform properties). - -All metadata follows harmonised naming and semantics consistent with the CDM and CF standards, enabling machine and human interpretability while supporting metadata exchange across diverse systems. - -==== Overviews - -The *Overviews* construct defines a formal, interoperable abstraction for multiscale gridded data. It ensures structural consistency across zoom levels and provides a semantic model for integration with tiled representations such as GeoTIFF overviews, OGC API – Tiles, and STAC Tiled Assets. - -===== Purpose - -The *Overviews* construct provides a general mechanism for associating a single logical data variable with a collection of resampled representations, referred to as *zoom levels*. Each zoom level holds a reduced-resolution version of the original variable, with progressively decreasing spatial resolution from the base (highest detail) to the coarsest level. - -Overviews enable: - -- Fast access to summary representations for visualisation -- Progressive transmission and downsampling -- Multi-resolution analytics and adaptive processing - -===== Conceptual Structure - -A <> contains child groups representing the data at different resolutions, where each child group is a <> following the Unified Data Model. It comprises the following components: - -[horizontal] -*Base Dataset*:: The original, highest-resolution dataset to which the multiscale hierarchy is anchored. -*Zoom Level Datasets*:: A sequence of datasets representing the same data as the base dataset, but sampled at coarser spatial resolutions. -*Zoom Level Identifier*:: A unique identifier associated with each level, ordered from finest (e.g. `"0"`) to coarsest resolution (e.g. `"N"`). -*Tile Grid Definition*:: A mapping that associates each zoom level with a spatial tiling layout, defined in alignment with a `TileMatrixSet`. -*Spatial Alignment*:: Each zoom-level dataset MUST be spatially aligned with the base dataset using a consistent coordinate reference system and compatible axis orientation. -*Resampling Method*:: A declared method indicating the technique used to derive coarser levels from the base dataset (e.g. `nearest`, `average`, `cubic`). - -===== Model Components - -The *Overviews* construct is represented in the Unified Data Model using the following logical elements: - -[cols="1,3"] -|=== -|Element |Definition - -|`OverviewSet` | A logical grouping of variables at multiple zoom levels associated with a single base variable. - -|`OverviewLevel` | A single resampled variable at a specific resolution, identified by a zoom level string. - -|`TileMatrixSetRef` | A reference to the tile grid specification applied across all overview levels. May refer to a well-known identifier, a URI, or an inline object. - -|`TileMatrixLimits` | (Optional) Constraints on the tile coverage per zoom level. - -|`resampling_method` | A string indicating the uniform method used to downsample data across all levels. -|=== - -All overview levels MUST preserve: - -- The data variable’s semantic identity (`standard_name`, `units`, etc.) -- The coordinate reference system -- The axis order and dimension semantics - -Only the resolution and extent (through tiling and shape) may differ across levels. - -===== Relationship to Tile Matrix Set - -The *Overviews* construct is structurally aligned with the OGC Tile Matrix Set concept. Each zoom level is mapped to a `TileMatrix`, and the chunk layout for the corresponding data variable SHALL match the tile grid’s `tileWidth` and `tileHeight`. - -The `OverviewSet` MAY constrain tile matrix limits using `TileMatrixSetLimits`, which restrict tile indices to actual data coverage, consistent with the spatial extent of the overview variable. - -===== Usage Context - -The *Overviews* construct is applicable to any gridded data variable with at least two spatial dimensions. It is primarily designed for: - -- Raster imagery (e.g. reflectance, temperature) -- Data cubes with spatial slices (e.g. time-series of spatial grids) -- Multi-band products with consistent spatial structure across levels - -The structure may be extended for N-dimensional datasets in future revisions, provided that two spatial axes can be unambiguously identified. - - - - -=== Interoperability Considerations - -==== -TBD: this section should be consolidated or removed +Accordingly, the **GeoZarr encoding specification** will only prescribe additional rules where a specific encoding behaviour in **Zarr** is required for interoperability or conformance. ==== -Interoperability is a core objective of the GeoZarr Unified Data Model. The model is designed to bridge diverse Earth observation and scientific data ecosystems by enabling structural and semantic compatibility with established formats and standards, while providing a forward-looking foundation for scalable, cloud-native workflows. - -This section outlines the principles and mechanisms supporting interoperability across formats, tools, and communities. - -==== Format Mapping and Alignment - -The data model is explicitly aligned with foundational standards including the Unidata Common Data Model (CDM), the CF Conventions, and established practices in formats such as NetCDF and GeoTIFF. Where applicable, GeoZarr datasets may be derived from or transformed into these formats using consistent mappings. - -- *NetCDF (classic and enhanced models)*: -* GeoZarr shares a common conceptual structure with NetCDF via CDM. -* Variables, dimensions, coordinate systems, and attributes follow directly mappable patterns. -* Metadata expressed in CF conventions in NetCDF can be preserved in GeoZarr without loss of fidelity. - -- *GeoTIFF*: -* Raster-based datasets in GeoZarr can map to GeoTIFF by interpreting spatial referencing (via CF or GeoTransform) and band structures. -* Overviews aligned to OGC Tile Matrix Sets may correspond to TIFF image pyramids. -* Projection metadata and resolution information can be mapped via standard tags. - -These mappings facilitate round-trip transformations and enable toolchains that consume or produce multiple formats without reengineering semantic models. - - - ==== Metadata and Discovery Integration STAC compatibility enables integration with catalogue services for discovery and indexing. Datasets can expose STAC-compliant metadata alongside core metadata, supporting federated search and filtering via STAC APIs. From 4f68cb6680bb6769a898069c57faa3c2908a3f15 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Christophe=20No=C3=ABl?= Date: Fri, 17 Oct 2025 16:34:15 +0200 Subject: [PATCH 18/18] adapted encodings to CDM model --- standard/template/geozarr-spec.adoc | 12 +- .../sections/clause_9_zarr_encoding.adoc | 9 +- .../sections/clause_9_zarr_encoding_core.adoc | 171 +++++++++++------- 3 files changed, 116 insertions(+), 76 deletions(-) diff --git a/standard/template/geozarr-spec.adoc b/standard/template/geozarr-spec.adoc index 5e4c5d5..b8cfec5 100644 --- a/standard/template/geozarr-spec.adoc +++ b/standard/template/geozarr-spec.adoc @@ -45,23 +45,23 @@ include::sections/clause_6_informative_text.adoc[] include::sections/clause_7_unified_data_model.adoc[] -include::sections/clause_8_conformance.adoc[] +// Discarded: include::sections/clause_8_conformance.adoc[] include::sections/clause_9_zarr_encoding.adoc[] -include::sections/clause_10_geotiff_encoding.adoc[] +// include::sections/clause_10_geotiff_encoding.adoc[] //// add or remove annexes after "A" as necessary //// -include::sections/annex-a.adoc[] +//include::sections/annex-a.adoc[] -include::sections/annex-n.adoc[] +// include::sections/annex-n.adoc[] //// Revision History should be the last annex before the Bibliography Bibliography should be the last annex //// -include::sections/annex-history.adoc[] +// include::sections/annex-history.adoc[] -include::sections/annex-bibliography.adoc[] +//include::sections/annex-bibliography.adoc[] diff --git a/standard/template/sections/clause_9_zarr_encoding.adoc b/standard/template/sections/clause_9_zarr_encoding.adoc index d62edec..3de6155 100644 --- a/standard/template/sections/clause_9_zarr_encoding.adoc +++ b/standard/template/sections/clause_9_zarr_encoding.adoc @@ -1,9 +1,12 @@ -== Unified Data Model Encoding for Zarr +== Encodings for Zarr -This clause defines the encoding of the unified data model into the Zarr format. The encoding supports both Zarr Version 2 and Zarr Version 3. +This clause defines the normative mapping between the **GeoZarr Data Model** and the **Zarr storage format**. +It specifies how the structural elements of the **Common Data Model (CDM)** - groups, variables, dimensions, and attributes — are encoded in **Zarr v2** and **Zarr v3**, and identifies additional constraints introduced by GeoZarr. -TIP: This is a very preliminary draft. The content is primarily for demonstrating the purpose of the proposed sections. +GeoZarr’s encoding rules are limited to cases where explicit guidance is required for interoperability. +GeoZarr does **not** redefine how CF, GDAL, or other metadata conventions map to CDM constructs — these mappings are already implemented in community libraries such as *xarray*, *GDAL*, and *netCDF-Java*. +The GeoZarr encoding rules therefore focus on **CDM structure and semantics**, with additional subsections specifying any **Zarr-specific requirements** for supported metadata conventions. include::clause_9_zarr_encoding_core.adoc[] diff --git a/standard/template/sections/clause_9_zarr_encoding_core.adoc b/standard/template/sections/clause_9_zarr_encoding_core.adoc index 8a1972c..44683f5 100644 --- a/standard/template/sections/clause_9_zarr_encoding_core.adoc +++ b/standard/template/sections/clause_9_zarr_encoding_core.adoc @@ -1,39 +1,36 @@ -=== Hierarchical Structure +=== Common Data Model Encodings -A hierarchy conforming to the Unified Data Model is structured as a tree of groups, variables (arrays), dimensions, and metadata. Following Zarr conventions, this hierarchy is rooted in a group, which may contain: - -- Arrays representing coordinate or data variables -- Child groups for modular organisation, including logical sub-collections or resolution levels -- Metadata attributes at group and array levels - -Each group adheres to a consistent structure, allowing recursive composition. This reflects the CDM's use of *groups* and is supported by both Zarr v2 and v3 with differing implementations. +==== Hierarchical Structure +A GeoZarr hierarchy follows the CDM model of a tree of **groups**, **variables** (arrays), **dimensions**, and **attributes**. +Each Zarr store contains a single root group and an arbitrary number of child groups and arrays, organised recursively. [cols="1,2,2"] |=== -|Model Element |Zarr v2 Encoding |Zarr v3 Encoding - -|Root Group | Directory with `.zgroup` and `.zattrs` | Directory with `zarr.json`, with `node_type: group` - -|Child Group | Subdirectory with `.zgroup` and `.zattrs` | Subdirectory with `zarr.json`, with `node_type: group` - -|Array | Subdirectory with `.zarray` and `.zattrs` | Subdirectory with `zarr.json`, with `node_type: array` +|CDM Element |Zarr v2 Encoding |Zarr v3 Encoding -|Attributes | `.zattrs` file | `attributes` field in `zarr.json` +|Group | Directory with `.zgroup` and `.zattrs` | Directory containing `zarr.json` with `"node_type": "group"` +|Variable (Array) | Directory with `.zarray` and `.zattrs` | Directory containing `zarr.json` with `"node_type": "array"` +|Attributes | `.zattrs` file (JSON object) | `attributes` field in `zarr.json` |=== -Zarr v3 requires `zarr_format: 3` and stores all metadata (including user-defined attributes) in the `zarr.json` document. Each node includes a `node_type` field: either `"group"` or `"array"`. +Zarr v3 nodes must declare `"zarr_format": 3` and include `"node_type"` set to either `"group"` or `"array"`. +All user-defined metadata, including GeoZarr attributes, shall be placed within the `attributes` field. -=== Dimensions +==== Dimensions -Dimensions define the axes along which variables are indexed. +Dimensions define the index axes for variables. -- In Zarr v2, dimensions are inferred from array shape and declared in `_ARRAY_DIMENSIONS` within `.zattrs`. -- In Zarr v3, dimensions are stored using the `dimension_names` field in `zarr.json`. +[cols="1,2,2"] +|=== +|Aspect |Zarr v2 |Zarr v3 -Example for a 2D array with dimension names `["lat", "lon"]`: +|Declaration | `_ARRAY_DIMENSIONS` attribute in `.zattrs` | `dimension_names` field in `zarr.json` +|Scope | Implicit, per array | Explicit, per array; names are globally unique within a group hierarchy +|=== +Example (Zarr v3 array with two dimensions): [source,json] ---- { @@ -45,24 +42,31 @@ Example for a 2D array with dimension names `["lat", "lon"]`: } ---- -=== Coordinate Variables +**Shared Dimensions:** + +Zarr does not define dimension entities as standalone objects. +To preserve CDM semantics, GeoZarr requires that dimension names be **unique within each group hierarchy** and reused consistently across variables that share the same axis. -Coordinate variables (excluding GeoTransform Coordinates) define the geospatial or temporal context of data. They are represented as named arrays with metadata attributes. +As in **netCDF-4**, where groups can define their own local dimensions, GeoZarr allows dimensions to be scoped within groups. +When a dimension defined in one group is shared by variables located in descendant groups, implementations may indicate this relationship by prefixing the dimension name with a slash (e.g., `"/time"`). +In this context, the leading slash signifies that the dimension is defined in an **ancestor group**—not necessarily the root of the hierarchy—and should be interpreted as a shared axis accessible to all subordinate groups. -Coordinate variables are represented as named 1D arrays aligned with corresponding dimensions. + +==== Coordinate Variables + +Coordinate variables define the spatial, temporal, or other contextual axes for data variables. +They are stored as one-dimensional arrays associated with their corresponding dimensions. [cols="1,2,2"] |=== -|Feature |Zarr v2 |Zarr v3 +|Aspect |Zarr v2 |Zarr v3 |Storage | Zarr array with `.zarray`, `.zattrs` | Zarr array with `zarr.json` - |Dimension Binding | `_ARRAY_DIMENSIONS` in `.zattrs` | `dimension_names` in `zarr.json` - -|CF Metadata | `standard_name`, `units`, `axis` in `.zattrs` | Under `attributes` in `zarr.json` +|Metadata | CF-style attributes (e.g., `standard_name`, `units`, `axis`) | Same under `attributes` |=== -Example `zarr.json` for a coordinate array: +Example (Zarr v3 coordinate array): [source,json] ---- { @@ -71,12 +75,6 @@ Example `zarr.json` for a coordinate array: "shape": [180], "dimension_names": ["lat"], "data_type": "float32", - "chunk_grid": { - "name": "regular", - "configuration": { - "chunk_shape": [180] - } - }, "attributes": { "standard_name": "latitude", "units": "degrees_north", @@ -85,51 +83,53 @@ Example `zarr.json` for a coordinate array: } ---- +Coordinate variables may also reference *grid mapping* variables for coordinate reference systems, as defined in the CF conventions. -=== Data Variables +==== Data Variables -Data variables represent measured or derived quantities. They are stored as multidimensional arrays with metadata attributes. +Data variables represent primary measurements or derived quantities. +They are encoded as multidimensional arrays linked to one or more dimensions and accompanied by descriptive metadata. [cols="1,2,2"] |=== -|Feature |Zarr v2 |Zarr v3 - -|Storage | Multidimensional array with `.zarray` and `.zattrs` | Same structure; v3 supports additional chunk storage formats +|Aspect |Zarr v2 |Zarr v3 -|Dimension Association | `_ARRAY_DIMENSIONS` attribute | Same as v2 - -|CF Metadata | `standard_name`, `units`, `long_name`, `_FillValue`, etc. | Same as v2; v3 may support typed attributes +|Storage | Directory containing `.zarray` and `.zattrs` | Directory containing `zarr.json` with `"node_type": "array"` +|Dimension Binding | `_ARRAY_DIMENSIONS` attribute | `dimension_names` field +|Metadata | Attributes such as `standard_name`, `units`, `long_name`, `_FillValue`, `scale_factor`, `add_offset` | Same, with typed attributes permitted in v3 |=== Example: [source,json] ---- { - "_ARRAY_DIMENSIONS": ["time", "lat", "lon"], - "standard_name": "air_temperature", - "units": "K", - "long_name": "Surface air temperature", - "_FillValue": -9999.0 + "zarr_format": 3, + "node_type": "array", + "shape": [12, 180, 360], + "dimension_names": ["time", "lat", "lon"], + "attributes": { + "standard_name": "air_temperature", + "units": "K", + "long_name": "Surface air temperature", + "_FillValue": -9999.0 + } } ---- -=== Global Metadata - -Metadata associated with the hierarchy is stored at the root group level. +==== Global and Group Metadata +Metadata applying to the entire hierarchy or subgroup is stored at the group level. [cols="1,2,2"] |=== -|Field |Zarr v2 |Zarr v3 - -|Location | `.zattrs` file of root `.zgroup` | `attributes` field in root `zarr.json` +|Aspect |Zarr v2 |Zarr v3 -|Group Identification | `.zgroup` file | `node_type: group` in `zarr.json` - -|CF Conformance | `Conventions` attribute (e.g., `CF-1.10`) | Same, under `attributes` +|Location | `.zattrs` in root group | `attributes` field in root `zarr.json` +|Identification | `.zgroup` file | `"node_type": "group"` +|Conventions | `Conventions` attribute (e.g., `CF-1.10`) | Same under `attributes` |=== -Example Zarr v3 root `zarr.json`: +Example: [source,json] ---- { @@ -144,16 +144,53 @@ Example Zarr v3 root `zarr.json`: } ---- +==== Variable and Attribute Metadata + +All metadata attributes for groups, coordinate variables, and data variables should follow established community naming and typing conventions. +GeoZarr encourages CF-compliant naming where applicable but does not require it. + +Attributes shall: +- use UTF-8–encoded names; +- have JSON-compatible values (string, number, boolean, or array); +- remain consistent across group hierarchies. + +Typical attributes include: + +* CF: `standard_name`, `units`, `axis`, `grid_mapping` +* Generic: `_FillValue`, `scale_factor`, `add_offset`, `long_name`, `missing_value` +* GDAL-compatible: `spatial_ref`, `GeoTransform`, `AREA_OR_POINT` + +Structured metadata values, such as JSON or XML content, may be included directly as objects rather than as serialised text. Implementations are encouraged to **store such metadata in deserialised form** (as native JSON objects) whenever possible, ensuring that attributes remain machine-readable and conform to JSON type rules. + +If serialised representations (e.g., XML strings or JSON text blocks) are used, they shall be valid UTF-8 strings and clearly identified by attribute naming or context. + + +==== CDM Encoding Notes and Special Cases + +* **Shared Dimensions** – +To emulate the CDM concept of shared dimensions, GeoZarr requires that identical dimension names across arrays refer to the same logical axis. +Libraries implementing GeoZarr should preserve this relationship explicitly in their in-memory representations. + +* **Unlimited Dimensions** – +Zarr’s chunked structure inherently supports extensible dimensions. +A dimension can be declared unlimited by allowing its corresponding array dimension to grow dynamically (e.g., time). +The use of `"resizeable": true` (Zarr v3) or dynamic chunk append operations is recommended. + +* **Nested Groups and Subgroups** – +Zarr v3 groups may nest recursively. +Each subgroup represents a CDM group and may hold its own variables and attributes. +This structure supports logical organisation such as multiple collections, products, or resolution levels. + +==== Metadata Integration for CF, GDAL, and GeoTIFF -=== Variables Metadata +While the GeoZarr Data Model provides the structure for metadata storage, **GeoZarr does not redefine how CF, GDAL, or GeoTIFF metadata are mapped into this structure**. +These mappings are well established in community libraries (e.g., *xarray*, *netCDF-Java*, *GDAL*). -All metadata attributes (for groups, coordinates variables and data variables) are recommended to conform to CF naming and typing conventions. Supported attributes include: +GeoZarr encoding rules therefore only specify **when a specific Zarr encoding requirement applies**, such as: -- `standard_name`, `units`, `axis`, `grid_mapping` (CF) -- `_FillValue`, `scale_factor`, `add_offset` -- `long_name`, `missing_value` +* use of `attributes` fields in `zarr.json` for CF or GDAL metadata; +* preservation of key metadata names (`grid_mapping`, `spatial_ref`, `GeoTransform`); +* ensuring metadata values remain valid JSON types. -In all cases: +Implementations may rely on existing libraries to populate or interpret such metadata consistently. -- Attribute names are case-sensitive and encoded as UTF-8 strings -- Values shall conform to JSON-compatible types (string, number, boolean, array)