marking the spec module experimental

anupkalburgi · anupkalburgi · commit f6c9a694ce7b · 2025-12-02T13:17:58.000-05:00
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -5,7 +5,7 @@ All notable changes to the Databricks Labs Data Generator will be documented in
 
 ### unreleased
 
-#### Fixed 
+#### Fixed
 * Updated build scripts to use Ubuntu 22.04 to correspond to environment in Databricks runtime
 * Refactored `DataAnalyzer` and `BasicStockTickerProvider` to comply with ANSI SQL standards
 * Removed internal modification of `SparkSession`
@@ -23,6 +23,7 @@ All notable changes to the Databricks Labs Data Generator will be documented in
 #### Added
 * Added support for serialization to/from JSON format
 * Added Ruff and mypy tooling
+* Pydantic-based specification API (Experimental)
 
 
 ### Version 0.4.0 Hotfix 2
@@ -59,7 +60,7 @@ All notable changes to the Databricks Labs Data Generator will be documented in
 * Updated docs for complex data types / JSON to correct code examples
 * Updated license file in public docs
 
-#### Fixed 
+#### Fixed
 * Fixed scenario where `DataAnalyzer` is used on dataframe containing a column named `summary`
 
 ### Version 0.3.6
@@ -90,14 +91,14 @@ All notable changes to the Databricks Labs Data Generator will be documented in
 ### Version 0.3.4 Post 2
 
 ### Fixed
-* Fix for use of values in columns of type array, map and struct 
+* Fix for use of values in columns of type array, map and struct
 * Fix for generation of arrays via `numFeatures` and `structType` attributes when numFeatures has value of 1
 
 
 ### Version 0.3.4 Post 1
 
 ### Fixed
-* Fix for use and configuration of root logger 
+* Fix for use and configuration of root logger
 
 ### Acknowledgements
 Thanks to Marvin Schenkel for the contribution
@@ -120,7 +121,7 @@ Thanks to Marvin Schenkel for the contribution
 
 #### Changed
 * Fixed use of logger in _version.py and in spark_singleton.py
-* Fixed template issues 
+* Fixed template issues
 * Document reformatting and updates, related code comment changes
 
 ### Fixed
@@ -133,19 +134,19 @@ Thanks to Marvin Schenkel for the contribution
 ### Version 0.3.2
 
 #### Changed
-* Adjusted column build phase separation (i.e which select statement is used to build columns) so that a 
+* Adjusted column build phase separation (i.e which select statement is used to build columns) so that a
   column with a SQL expression can refer to previously created columns without use of a `baseColumn` attribute
 * Changed build labelling to comply with PEP440
 
-#### Fixed 
+#### Fixed
 * Fixed compatibility of build with older versions of runtime that rely on `pyparsing` version 2.4.7
 
-#### Added 
+#### Added
 * Parsing of SQL expressions to determine column dependencies
 
 #### Notes
 * The enhancements to build ordering does not change actual order of column building -
-  but adjusts which phase columns are built in 
+  but adjusts which phase columns are built in
 
 
 ### Version 0.3.1
@@ -154,11 +155,11 @@ Thanks to Marvin Schenkel for the contribution
 * Refactoring of template text generation for better performance via vectorized implementation
 * Additional migration of tests to use of `pytest`
 
-#### Fixed 
+#### Fixed
 * added type parsing support for binary and constructs such as `nvarchar(10)`
-* Fixed error occurring when schema contains map, array or struct. 
+* Fixed error occurring when schema contains map, array or struct.
 
-#### Added 
+#### Added
 * Ability to change name of seed column to custom name (defaults to `id`)
 * Added type parsing support for structs, maps and arrays and combinations of the above
 
@@ -207,14 +208,14 @@ See the contents of the file `python/require.txt` to see the Python package depe
 The code for the Databricks Data Generator has the following dependencies
 
 * Requires Databricks runtime 9.1 LTS or later
-* Requires Spark 3.1.2 or later 
+* Requires Spark 3.1.2 or later
 * Requires Python 3.8.10 or later
 
-While the data generator framework does not require all libraries used by the runtimes, where a library from 
+While the data generator framework does not require all libraries used by the runtimes, where a library from
 the Databricks runtime is used, it will use the version found in the Databricks runtime for 9.1 LTS or later.
 You can use older versions of the Databricks Labs Data Generator by referring to that explicit version.
 
-The recommended method to install the package is to use `pip install` in your notebook to install the package from 
+The recommended method to install the package is to use `pip install` in your notebook to install the package from
 PyPi
 
 For example:
@@ -227,7 +228,7 @@ To use an older DB runtime version in your notebook, you can use the following c
 %pip install git+https://github.com/databrickslabs/dbldatagen@dbr_7_3_LTS_compat
 ```
 
-See the [Databricks runtime release notes](https://docs.databricks.com/release-notes/runtime/releases.html) 
+See the [Databricks runtime release notes](https://docs.databricks.com/release-notes/runtime/releases.html)
  for the full list of dependencies used by the Databricks runtime.
 
 This can be found at : https://docs.databricks.com/release-notes/runtime/releases.html
diff --git a/README.md b/README.md
@@ -1,11 +1,11 @@
-# Databricks Labs Data Generator (`dbldatagen`) 
+# Databricks Labs Data Generator (`dbldatagen`)
 
 <!-- Top bar will be removed from PyPi packaged versions -->
 <!-- Dont remove: exclude package -->
 [Documentation](https://databrickslabs.github.io/dbldatagen/public_docs/index.html) |
 [Release Notes](CHANGELOG.md) |
 [Examples](examples) |
-[Tutorial](tutorial) 
+[Tutorial](tutorial)
 <!-- Dont remove: end exclude package -->
 
 [![build](https://github.com/databrickslabs/dbldatagen/workflows/build/badge.svg?branch=master)](https://github.com/databrickslabs/dbldatagen/actions?query=workflow%3Abuild+branch%3Amaster)
@@ -14,53 +14,54 @@
 [![PyPi downloads](https://img.shields.io/pypi/dm/dbldatagen?label=PyPi%20Downloads)](https://pypistats.org/packages/dbldatagen)
 [![lines of code](https://tokei.rs/b1/github/databrickslabs/dbldatagen)]([https://codecov.io/github/databrickslabs/dbldatagen](https://github.com/databrickslabs/dbldatagen))
 
-<!-- 
+<!--
 [![Language grade: Python](https://img.shields.io/lgtm/grade/python/g/databrickslabs/dbldatagen.svg?logo=lgtm&logoWidth=18)](https://lgtm.com/projects/g/databrickslabs/dbldatagen/context:python)
 [![downloads](https://img.shields.io/github/downloads/databrickslabs/dbldatagen/total.svg)](https://hanadigital.github.io/grev/?user=databrickslabs&repo=dbldatagen)
 -->
 
 ## Project Description
-The `dbldatagen` Databricks Labs project is a Python library for generating synthetic data within the Databricks 
-environment using Spark. The generated data may be used for testing, benchmarking, demos, and many 
+The `dbldatagen` Databricks Labs project is a Python library for generating synthetic data within the Databricks
+environment using Spark. The generated data may be used for testing, benchmarking, demos, and many
 other uses.
 
-It operates by defining a data generation specification in code that controls 
+It operates by defining a data generation specification in code that controls
 how the synthetic data is generated.
 The specification may incorporate the use of existing schemas or create data in an ad-hoc fashion.
 
-It has no dependencies on any libraries that are not already installed in the Databricks 
+It has no dependencies on any libraries that are not already installed in the Databricks
 runtime, and you can use it from Scala, R or other languages by defining
 a view over the generated data.
 
 ### Feature Summary
 It supports:
-* Generating synthetic data at scale up to billions of rows within minutes using appropriately sized clusters 
-* Generating repeatable, predictable data supporting the need for producing multiple tables, Change Data Capture, 
+* Generating synthetic data at scale up to billions of rows within minutes using appropriately sized clusters
+* Generating repeatable, predictable data supporting the need for producing multiple tables, Change Data Capture,
 merge and join scenarios with consistency between primary and foreign keys
-* Generating synthetic data for all of the 
-Spark SQL supported primitive types as a Spark data frame which may be persisted, 
-saved to external storage or 
+* Generating synthetic data for all of the
+Spark SQL supported primitive types as a Spark data frame which may be persisted,
+saved to external storage or
 used in other computations
 * Generating ranges of dates, timestamps, and numeric values
 * Generation of discrete values - both numeric and text
-* Generation of values at random and based on the values of other fields 
+* Generation of values at random and based on the values of other fields
 (either based on the `hash` of the underlying values or the values themselves)
-* Ability to specify a distribution for random data generation 
+* Ability to specify a distribution for random data generation
 * Generating arrays of values for ML-style feature arrays
 * Applying weights to the occurrence of values
 * Generating values to conform to a schema or independent of an existing schema
 * use of SQL expressions in synthetic data generation
 * plugin mechanism to allow use of 3rd party libraries such as Faker
 * Use within a Databricks Delta Live Tables pipeline as a synthetic data generation source
 * Generate synthetic data generation code from existing schema or data (experimental)
+* Pydantic-based specification API for type-safe data generation (experimental)
 * Use of standard datasets for quick generation of synthetic data
 
 Details of these features can be found in the online documentation  -
- [online documentation](https://databrickslabs.github.io/dbldatagen/public_docs/index.html). 
+ [online documentation](https://databrickslabs.github.io/dbldatagen/public_docs/index.html).
 
 ## Documentation
 
-Please refer to the [online documentation](https://databrickslabs.github.io/dbldatagen/public_docs/index.html) for 
+Please refer to the [online documentation](https://databrickslabs.github.io/dbldatagen/public_docs/index.html) for
 details of use and many examples.
 
 Release notes and details of the latest changes for this specific release
@@ -76,40 +77,40 @@ Within a Databricks notebook, invoke the following in a notebook cell
 %pip install dbldatagen
 ```
 
-The Pip install command can be invoked within a Databricks notebook, a Delta Live Tables pipeline 
+The Pip install command can be invoked within a Databricks notebook, a Delta Live Tables pipeline
 and even works on the Databricks community edition.
 
-The documentation [installation notes](https://databrickslabs.github.io/dbldatagen/public_docs/installation_notes.html) 
+The documentation [installation notes](https://databrickslabs.github.io/dbldatagen/public_docs/installation_notes.html)
 contains details of installation using alternative mechanisms.
 
-## Compatibility 
-The Databricks Labs Data Generator framework can be used with Pyspark 3.4.1 and Python 3.10.12 or later. These are 
+## Compatibility
+The Databricks Labs Data Generator framework can be used with Pyspark 3.4.1 and Python 3.10.12 or later. These are
 compatible with the Databricks runtime 13.3 LTS and later releases. This version also provides Unity Catalog
 compatibily.
 
-For full library compatibility for a specific Databricks Spark release, see the Databricks 
+For full library compatibility for a specific Databricks Spark release, see the Databricks
 release notes for library compatibility
 
 - https://docs.databricks.com/release-notes/runtime/releases.html
 
-In older releases, when using the Databricks Labs Data Generator on "Unity Catalog" enabled Databricks environments, 
-the Data Generator requires the use of `Single User` or `No Isolation Shared` access modes when using Databricks 
-runtimes prior to release 13.2. This is because some needed features are not available in `Shared` 
-mode (for example, use of 3rd party libraries, use of Python UDFs) in these releases. 
+In older releases, when using the Databricks Labs Data Generator on "Unity Catalog" enabled Databricks environments,
+the Data Generator requires the use of `Single User` or `No Isolation Shared` access modes when using Databricks
+runtimes prior to release 13.2. This is because some needed features are not available in `Shared`
+mode (for example, use of 3rd party libraries, use of Python UDFs) in these releases.
 Depending on settings, the `Custom` access mode may be supported for those releases.
 
 The use of Unity Catalog `Shared` access mode is supported in Databricks runtimes from Databricks runtime release 13.2
-onwards. 
+onwards.
 
-*This version of the data generator uses the Databricks runtime 13.3 LTS as the minimum supported 
+*This version of the data generator uses the Databricks runtime 13.3 LTS as the minimum supported
 version and alleviates these issues.*
 
 See the following documentation for more information:
 
 - https://docs.databricks.com/data-governance/unity-catalog/compute.html
 
 ## Using the Data Generator
-To use the data generator, install the library using the `%pip install` method or install the Python wheel directly 
+To use the data generator, install the library using the `%pip install` method or install the Python wheel directly
 in your environment.
 
 Once the library has been installed, you can use it to generate a data frame composed of synthetic data.
@@ -120,7 +121,7 @@ for your use case.
 ```buildoutcfg
 import dbldatagen as dg
 df = dg.Datasets(spark, "basic/user").get(rows=1000_000).build()
-num_rows=df.count()                          
+num_rows=df.count()
 ```
 
 You can also define fully custom data sets using the `DataGenerator` class.
@@ -135,48 +136,48 @@ data_rows = 1000 * 1000
 df_spec = (dg.DataGenerator(spark, name="test_data_set1", rows=data_rows,
                                                   partitions=4)
            .withIdOutput()
-           .withColumn("r", FloatType(), 
+           .withColumn("r", FloatType(),
                             expr="floor(rand() * 350) * (86400 + 3600)",
                             numColumns=column_count)
            .withColumn("code1", IntegerType(), minValue=100, maxValue=200)
            .withColumn("code2", IntegerType(), minValue=0, maxValue=10)
            .withColumn("code3", StringType(), values=['a', 'b', 'c'])
-           .withColumn("code4", StringType(), values=['a', 'b', 'c'], 
+           .withColumn("code4", StringType(), values=['a', 'b', 'c'],
                           random=True)
-           .withColumn("code5", StringType(), values=['a', 'b', 'c'], 
+           .withColumn("code5", StringType(), values=['a', 'b', 'c'],
                           random=True, weights=[9, 1, 1])
- 
+
            )
-                            
+
 df = df_spec.build()
-num_rows=df.count()                          
+num_rows=df.count()
 ```
-Refer to the [online documentation](https://databrickslabs.github.io/dbldatagen/public_docs/index.html) for further 
-examples. 
+Refer to the [online documentation](https://databrickslabs.github.io/dbldatagen/public_docs/index.html) for further
+examples.
 
 The GitHub repository also contains further examples in the examples directory.
 
 ## Spark and Databricks Runtime Compatibility
-The `dbldatagen` package is intended to be compatible with recent LTS versions of the Databricks runtime, including 
-older LTS versions at least from 13.3 LTS and later. It also aims to be compatible with Delta Live Table runtimes, 
-including `current` and `preview`. 
+The `dbldatagen` package is intended to be compatible with recent LTS versions of the Databricks runtime, including
+older LTS versions at least from 13.3 LTS and later. It also aims to be compatible with Delta Live Table runtimes,
+including `current` and `preview`.
 
 While we don't specifically drop support for older runtimes, changes in Pyspark APIs or
 APIs from dependent packages such as `numpy`, `pandas`, `pyarrow`, and `pyparsing` make cause issues with older
-runtimes. 
+runtimes.
 
-By design, installing `dbldatagen` does not install releases of dependent packages in order 
+By design, installing `dbldatagen` does not install releases of dependent packages in order
 to preserve the curated set of packages pre-installed in any Databricks runtime environment.
 
 When building on local environments, run `make dev` to install required dependencies.
 
 ## Project Support
 Please note that all projects released under [`Databricks Labs`](https://www.databricks.com/learn/labs)
- are provided for your exploration only, and are not formally supported by Databricks with Service Level Agreements 
-(SLAs).  They are provided AS-IS, and we do not make any guarantees of any kind.  Please do not submit a support ticket 
+ are provided for your exploration only, and are not formally supported by Databricks with Service Level Agreements
+(SLAs).  They are provided AS-IS, and we do not make any guarantees of any kind.  Please do not submit a support ticket
 relating to any issues arising from the use of these projects.
 
-Any issues discovered through the use of this project should be filed as issues on the GitHub Repo.  
+Any issues discovered through the use of this project should be filed as issues on the GitHub Repo.
 They will be reviewed as time permits, but there are no formal SLAs for support.
 
 
diff --git a/dbldatagen/spec/__init__.py b/dbldatagen/spec/__init__.py
@@ -1,7 +1,11 @@
-"""Pydantic-based specification API for dbldatagen.
+"""Pydantic-based specification API for dbldatagen (Experimental).
 
 This module provides Pydantic models and specifications for defining data generation
 in a type-safe, declarative way.
+
+.. warning::
+   Experimental - This API is experimental and both APIs and generated code
+   are liable to change in future versions.
 """
 
 from typing import Any
diff --git a/dbldatagen/spec/column_spec.py b/dbldatagen/spec/column_spec.py
@@ -54,6 +54,9 @@ class ColumnDefinition(BaseModel):
                           "auto" (infer behavior), "hash" (hash the base column values),
                           "values" (use base column values directly)
 
+    .. warning::
+       Experimental - This API is subject to change in future versions
+
     .. note::
         Primary columns have special constraints:
         - Must have a type defined
diff --git a/dbldatagen/spec/generator_spec.py b/dbldatagen/spec/generator_spec.py
@@ -30,6 +30,9 @@ class TableDefinition(BaseModel):
     :param columns: List of ColumnDefinition objects specifying the columns to generate
                    in this table. At least one column must be specified
 
+    .. warning::
+       Experimental - This API is subject to change in future versions
+
     .. note::
         Setting an appropriate number of partitions can significantly impact generation performance.
         As a rule of thumb, use 2-4 partitions per CPU core available in your Spark cluster
@@ -64,6 +67,9 @@ class DatagenSpec(BaseModel):
     :param intended_for_databricks: Flag indicating if this spec is designed for Databricks.
                                    May be automatically inferred based on configuration
 
+    .. warning::
+       Experimental - This API is subject to change in future versions
+
     .. note::
         Call the validate() method before using this spec to ensure configuration is correct
 
diff --git a/dbldatagen/spec/generator_spec_impl.py b/dbldatagen/spec/generator_spec_impl.py
@@ -36,6 +36,9 @@ class Generator:
     :param spark: Active SparkSession to use for data generation
     :param app_name: Application name used in logging and tracking. Defaults to "DataGen_ClassBased"
 
+    .. warning::
+       Experimental - This API is subject to change in future versions
+
     .. note::
         The Generator requires an active SparkSession. On Databricks, you can use the pre-configured
         `spark` variable. For local development, create a SparkSession first