Skip to content

Commit f6c9a69

Browse files
committed
marking the spec module experimental
1 parent a0ce13b commit f6c9a69

File tree

6 files changed

+80
-62
lines changed

6 files changed

+80
-62
lines changed

CHANGELOG.md

Lines changed: 17 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ All notable changes to the Databricks Labs Data Generator will be documented in
55

66
### unreleased
77

8-
#### Fixed
8+
#### Fixed
99
* Updated build scripts to use Ubuntu 22.04 to correspond to environment in Databricks runtime
1010
* Refactored `DataAnalyzer` and `BasicStockTickerProvider` to comply with ANSI SQL standards
1111
* Removed internal modification of `SparkSession`
@@ -23,6 +23,7 @@ All notable changes to the Databricks Labs Data Generator will be documented in
2323
#### Added
2424
* Added support for serialization to/from JSON format
2525
* Added Ruff and mypy tooling
26+
* Pydantic-based specification API (Experimental)
2627

2728

2829
### Version 0.4.0 Hotfix 2
@@ -59,7 +60,7 @@ All notable changes to the Databricks Labs Data Generator will be documented in
5960
* Updated docs for complex data types / JSON to correct code examples
6061
* Updated license file in public docs
6162

62-
#### Fixed
63+
#### Fixed
6364
* Fixed scenario where `DataAnalyzer` is used on dataframe containing a column named `summary`
6465

6566
### Version 0.3.6
@@ -90,14 +91,14 @@ All notable changes to the Databricks Labs Data Generator will be documented in
9091
### Version 0.3.4 Post 2
9192

9293
### Fixed
93-
* Fix for use of values in columns of type array, map and struct
94+
* Fix for use of values in columns of type array, map and struct
9495
* Fix for generation of arrays via `numFeatures` and `structType` attributes when numFeatures has value of 1
9596

9697

9798
### Version 0.3.4 Post 1
9899

99100
### Fixed
100-
* Fix for use and configuration of root logger
101+
* Fix for use and configuration of root logger
101102

102103
### Acknowledgements
103104
Thanks to Marvin Schenkel for the contribution
@@ -120,7 +121,7 @@ Thanks to Marvin Schenkel for the contribution
120121

121122
#### Changed
122123
* Fixed use of logger in _version.py and in spark_singleton.py
123-
* Fixed template issues
124+
* Fixed template issues
124125
* Document reformatting and updates, related code comment changes
125126

126127
### Fixed
@@ -133,19 +134,19 @@ Thanks to Marvin Schenkel for the contribution
133134
### Version 0.3.2
134135

135136
#### Changed
136-
* Adjusted column build phase separation (i.e which select statement is used to build columns) so that a
137+
* Adjusted column build phase separation (i.e which select statement is used to build columns) so that a
137138
column with a SQL expression can refer to previously created columns without use of a `baseColumn` attribute
138139
* Changed build labelling to comply with PEP440
139140

140-
#### Fixed
141+
#### Fixed
141142
* Fixed compatibility of build with older versions of runtime that rely on `pyparsing` version 2.4.7
142143

143-
#### Added
144+
#### Added
144145
* Parsing of SQL expressions to determine column dependencies
145146

146147
#### Notes
147148
* The enhancements to build ordering does not change actual order of column building -
148-
but adjusts which phase columns are built in
149+
but adjusts which phase columns are built in
149150

150151

151152
### Version 0.3.1
@@ -154,11 +155,11 @@ Thanks to Marvin Schenkel for the contribution
154155
* Refactoring of template text generation for better performance via vectorized implementation
155156
* Additional migration of tests to use of `pytest`
156157

157-
#### Fixed
158+
#### Fixed
158159
* added type parsing support for binary and constructs such as `nvarchar(10)`
159-
* Fixed error occurring when schema contains map, array or struct.
160+
* Fixed error occurring when schema contains map, array or struct.
160161

161-
#### Added
162+
#### Added
162163
* Ability to change name of seed column to custom name (defaults to `id`)
163164
* Added type parsing support for structs, maps and arrays and combinations of the above
164165

@@ -207,14 +208,14 @@ See the contents of the file `python/require.txt` to see the Python package depe
207208
The code for the Databricks Data Generator has the following dependencies
208209

209210
* Requires Databricks runtime 9.1 LTS or later
210-
* Requires Spark 3.1.2 or later
211+
* Requires Spark 3.1.2 or later
211212
* Requires Python 3.8.10 or later
212213

213-
While the data generator framework does not require all libraries used by the runtimes, where a library from
214+
While the data generator framework does not require all libraries used by the runtimes, where a library from
214215
the Databricks runtime is used, it will use the version found in the Databricks runtime for 9.1 LTS or later.
215216
You can use older versions of the Databricks Labs Data Generator by referring to that explicit version.
216217

217-
The recommended method to install the package is to use `pip install` in your notebook to install the package from
218+
The recommended method to install the package is to use `pip install` in your notebook to install the package from
218219
PyPi
219220

220221
For example:
@@ -227,7 +228,7 @@ To use an older DB runtime version in your notebook, you can use the following c
227228
%pip install git+https://github.com/databrickslabs/dbldatagen@dbr_7_3_LTS_compat
228229
```
229230

230-
See the [Databricks runtime release notes](https://docs.databricks.com/release-notes/runtime/releases.html)
231+
See the [Databricks runtime release notes](https://docs.databricks.com/release-notes/runtime/releases.html)
231232
for the full list of dependencies used by the Databricks runtime.
232233

233234
This can be found at : https://docs.databricks.com/release-notes/runtime/releases.html

README.md

Lines changed: 46 additions & 45 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,11 @@
1-
# Databricks Labs Data Generator (`dbldatagen`)
1+
# Databricks Labs Data Generator (`dbldatagen`)
22

33
<!-- Top bar will be removed from PyPi packaged versions -->
44
<!-- Dont remove: exclude package -->
55
[Documentation](https://databrickslabs.github.io/dbldatagen/public_docs/index.html) |
66
[Release Notes](CHANGELOG.md) |
77
[Examples](examples) |
8-
[Tutorial](tutorial)
8+
[Tutorial](tutorial)
99
<!-- Dont remove: end exclude package -->
1010

1111
[![build](https://github.com/databrickslabs/dbldatagen/workflows/build/badge.svg?branch=master)](https://github.com/databrickslabs/dbldatagen/actions?query=workflow%3Abuild+branch%3Amaster)
@@ -14,53 +14,54 @@
1414
[![PyPi downloads](https://img.shields.io/pypi/dm/dbldatagen?label=PyPi%20Downloads)](https://pypistats.org/packages/dbldatagen)
1515
[![lines of code](https://tokei.rs/b1/github/databrickslabs/dbldatagen)]([https://codecov.io/github/databrickslabs/dbldatagen](https://github.com/databrickslabs/dbldatagen))
1616

17-
<!--
17+
<!--
1818
[![Language grade: Python](https://img.shields.io/lgtm/grade/python/g/databrickslabs/dbldatagen.svg?logo=lgtm&logoWidth=18)](https://lgtm.com/projects/g/databrickslabs/dbldatagen/context:python)
1919
[![downloads](https://img.shields.io/github/downloads/databrickslabs/dbldatagen/total.svg)](https://hanadigital.github.io/grev/?user=databrickslabs&repo=dbldatagen)
2020
-->
2121

2222
## Project Description
23-
The `dbldatagen` Databricks Labs project is a Python library for generating synthetic data within the Databricks
24-
environment using Spark. The generated data may be used for testing, benchmarking, demos, and many
23+
The `dbldatagen` Databricks Labs project is a Python library for generating synthetic data within the Databricks
24+
environment using Spark. The generated data may be used for testing, benchmarking, demos, and many
2525
other uses.
2626

27-
It operates by defining a data generation specification in code that controls
27+
It operates by defining a data generation specification in code that controls
2828
how the synthetic data is generated.
2929
The specification may incorporate the use of existing schemas or create data in an ad-hoc fashion.
3030

31-
It has no dependencies on any libraries that are not already installed in the Databricks
31+
It has no dependencies on any libraries that are not already installed in the Databricks
3232
runtime, and you can use it from Scala, R or other languages by defining
3333
a view over the generated data.
3434

3535
### Feature Summary
3636
It supports:
37-
* Generating synthetic data at scale up to billions of rows within minutes using appropriately sized clusters
38-
* Generating repeatable, predictable data supporting the need for producing multiple tables, Change Data Capture,
37+
* Generating synthetic data at scale up to billions of rows within minutes using appropriately sized clusters
38+
* Generating repeatable, predictable data supporting the need for producing multiple tables, Change Data Capture,
3939
merge and join scenarios with consistency between primary and foreign keys
40-
* Generating synthetic data for all of the
41-
Spark SQL supported primitive types as a Spark data frame which may be persisted,
42-
saved to external storage or
40+
* Generating synthetic data for all of the
41+
Spark SQL supported primitive types as a Spark data frame which may be persisted,
42+
saved to external storage or
4343
used in other computations
4444
* Generating ranges of dates, timestamps, and numeric values
4545
* Generation of discrete values - both numeric and text
46-
* Generation of values at random and based on the values of other fields
46+
* Generation of values at random and based on the values of other fields
4747
(either based on the `hash` of the underlying values or the values themselves)
48-
* Ability to specify a distribution for random data generation
48+
* Ability to specify a distribution for random data generation
4949
* Generating arrays of values for ML-style feature arrays
5050
* Applying weights to the occurrence of values
5151
* Generating values to conform to a schema or independent of an existing schema
5252
* use of SQL expressions in synthetic data generation
5353
* plugin mechanism to allow use of 3rd party libraries such as Faker
5454
* Use within a Databricks Delta Live Tables pipeline as a synthetic data generation source
5555
* Generate synthetic data generation code from existing schema or data (experimental)
56+
* Pydantic-based specification API for type-safe data generation (experimental)
5657
* Use of standard datasets for quick generation of synthetic data
5758

5859
Details of these features can be found in the online documentation -
59-
[online documentation](https://databrickslabs.github.io/dbldatagen/public_docs/index.html).
60+
[online documentation](https://databrickslabs.github.io/dbldatagen/public_docs/index.html).
6061

6162
## Documentation
6263

63-
Please refer to the [online documentation](https://databrickslabs.github.io/dbldatagen/public_docs/index.html) for
64+
Please refer to the [online documentation](https://databrickslabs.github.io/dbldatagen/public_docs/index.html) for
6465
details of use and many examples.
6566

6667
Release notes and details of the latest changes for this specific release
@@ -76,40 +77,40 @@ Within a Databricks notebook, invoke the following in a notebook cell
7677
%pip install dbldatagen
7778
```
7879

79-
The Pip install command can be invoked within a Databricks notebook, a Delta Live Tables pipeline
80+
The Pip install command can be invoked within a Databricks notebook, a Delta Live Tables pipeline
8081
and even works on the Databricks community edition.
8182

82-
The documentation [installation notes](https://databrickslabs.github.io/dbldatagen/public_docs/installation_notes.html)
83+
The documentation [installation notes](https://databrickslabs.github.io/dbldatagen/public_docs/installation_notes.html)
8384
contains details of installation using alternative mechanisms.
8485

85-
## Compatibility
86-
The Databricks Labs Data Generator framework can be used with Pyspark 3.4.1 and Python 3.10.12 or later. These are
86+
## Compatibility
87+
The Databricks Labs Data Generator framework can be used with Pyspark 3.4.1 and Python 3.10.12 or later. These are
8788
compatible with the Databricks runtime 13.3 LTS and later releases. This version also provides Unity Catalog
8889
compatibily.
8990

90-
For full library compatibility for a specific Databricks Spark release, see the Databricks
91+
For full library compatibility for a specific Databricks Spark release, see the Databricks
9192
release notes for library compatibility
9293

9394
- https://docs.databricks.com/release-notes/runtime/releases.html
9495

95-
In older releases, when using the Databricks Labs Data Generator on "Unity Catalog" enabled Databricks environments,
96-
the Data Generator requires the use of `Single User` or `No Isolation Shared` access modes when using Databricks
97-
runtimes prior to release 13.2. This is because some needed features are not available in `Shared`
98-
mode (for example, use of 3rd party libraries, use of Python UDFs) in these releases.
96+
In older releases, when using the Databricks Labs Data Generator on "Unity Catalog" enabled Databricks environments,
97+
the Data Generator requires the use of `Single User` or `No Isolation Shared` access modes when using Databricks
98+
runtimes prior to release 13.2. This is because some needed features are not available in `Shared`
99+
mode (for example, use of 3rd party libraries, use of Python UDFs) in these releases.
99100
Depending on settings, the `Custom` access mode may be supported for those releases.
100101

101102
The use of Unity Catalog `Shared` access mode is supported in Databricks runtimes from Databricks runtime release 13.2
102-
onwards.
103+
onwards.
103104

104-
*This version of the data generator uses the Databricks runtime 13.3 LTS as the minimum supported
105+
*This version of the data generator uses the Databricks runtime 13.3 LTS as the minimum supported
105106
version and alleviates these issues.*
106107

107108
See the following documentation for more information:
108109

109110
- https://docs.databricks.com/data-governance/unity-catalog/compute.html
110111

111112
## Using the Data Generator
112-
To use the data generator, install the library using the `%pip install` method or install the Python wheel directly
113+
To use the data generator, install the library using the `%pip install` method or install the Python wheel directly
113114
in your environment.
114115

115116
Once the library has been installed, you can use it to generate a data frame composed of synthetic data.
@@ -120,7 +121,7 @@ for your use case.
120121
```buildoutcfg
121122
import dbldatagen as dg
122123
df = dg.Datasets(spark, "basic/user").get(rows=1000_000).build()
123-
num_rows=df.count()
124+
num_rows=df.count()
124125
```
125126

126127
You can also define fully custom data sets using the `DataGenerator` class.
@@ -135,48 +136,48 @@ data_rows = 1000 * 1000
135136
df_spec = (dg.DataGenerator(spark, name="test_data_set1", rows=data_rows,
136137
partitions=4)
137138
.withIdOutput()
138-
.withColumn("r", FloatType(),
139+
.withColumn("r", FloatType(),
139140
expr="floor(rand() * 350) * (86400 + 3600)",
140141
numColumns=column_count)
141142
.withColumn("code1", IntegerType(), minValue=100, maxValue=200)
142143
.withColumn("code2", IntegerType(), minValue=0, maxValue=10)
143144
.withColumn("code3", StringType(), values=['a', 'b', 'c'])
144-
.withColumn("code4", StringType(), values=['a', 'b', 'c'],
145+
.withColumn("code4", StringType(), values=['a', 'b', 'c'],
145146
random=True)
146-
.withColumn("code5", StringType(), values=['a', 'b', 'c'],
147+
.withColumn("code5", StringType(), values=['a', 'b', 'c'],
147148
random=True, weights=[9, 1, 1])
148-
149+
149150
)
150-
151+
151152
df = df_spec.build()
152-
num_rows=df.count()
153+
num_rows=df.count()
153154
```
154-
Refer to the [online documentation](https://databrickslabs.github.io/dbldatagen/public_docs/index.html) for further
155-
examples.
155+
Refer to the [online documentation](https://databrickslabs.github.io/dbldatagen/public_docs/index.html) for further
156+
examples.
156157

157158
The GitHub repository also contains further examples in the examples directory.
158159

159160
## Spark and Databricks Runtime Compatibility
160-
The `dbldatagen` package is intended to be compatible with recent LTS versions of the Databricks runtime, including
161-
older LTS versions at least from 13.3 LTS and later. It also aims to be compatible with Delta Live Table runtimes,
162-
including `current` and `preview`.
161+
The `dbldatagen` package is intended to be compatible with recent LTS versions of the Databricks runtime, including
162+
older LTS versions at least from 13.3 LTS and later. It also aims to be compatible with Delta Live Table runtimes,
163+
including `current` and `preview`.
163164

164165
While we don't specifically drop support for older runtimes, changes in Pyspark APIs or
165166
APIs from dependent packages such as `numpy`, `pandas`, `pyarrow`, and `pyparsing` make cause issues with older
166-
runtimes.
167+
runtimes.
167168

168-
By design, installing `dbldatagen` does not install releases of dependent packages in order
169+
By design, installing `dbldatagen` does not install releases of dependent packages in order
169170
to preserve the curated set of packages pre-installed in any Databricks runtime environment.
170171

171172
When building on local environments, run `make dev` to install required dependencies.
172173

173174
## Project Support
174175
Please note that all projects released under [`Databricks Labs`](https://www.databricks.com/learn/labs)
175-
are provided for your exploration only, and are not formally supported by Databricks with Service Level Agreements
176-
(SLAs). They are provided AS-IS, and we do not make any guarantees of any kind. Please do not submit a support ticket
176+
are provided for your exploration only, and are not formally supported by Databricks with Service Level Agreements
177+
(SLAs). They are provided AS-IS, and we do not make any guarantees of any kind. Please do not submit a support ticket
177178
relating to any issues arising from the use of these projects.
178179

179-
Any issues discovered through the use of this project should be filed as issues on the GitHub Repo.
180+
Any issues discovered through the use of this project should be filed as issues on the GitHub Repo.
180181
They will be reviewed as time permits, but there are no formal SLAs for support.
181182

182183

dbldatagen/spec/__init__.py

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,11 @@
1-
"""Pydantic-based specification API for dbldatagen.
1+
"""Pydantic-based specification API for dbldatagen (Experimental).
22
33
This module provides Pydantic models and specifications for defining data generation
44
in a type-safe, declarative way.
5+
6+
.. warning::
7+
Experimental - This API is experimental and both APIs and generated code
8+
are liable to change in future versions.
59
"""
610

711
from typing import Any

dbldatagen/spec/column_spec.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -54,6 +54,9 @@ class ColumnDefinition(BaseModel):
5454
"auto" (infer behavior), "hash" (hash the base column values),
5555
"values" (use base column values directly)
5656
57+
.. warning::
58+
Experimental - This API is subject to change in future versions
59+
5760
.. note::
5861
Primary columns have special constraints:
5962
- Must have a type defined

dbldatagen/spec/generator_spec.py

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,9 @@ class TableDefinition(BaseModel):
3030
:param columns: List of ColumnDefinition objects specifying the columns to generate
3131
in this table. At least one column must be specified
3232
33+
.. warning::
34+
Experimental - This API is subject to change in future versions
35+
3336
.. note::
3437
Setting an appropriate number of partitions can significantly impact generation performance.
3538
As a rule of thumb, use 2-4 partitions per CPU core available in your Spark cluster
@@ -64,6 +67,9 @@ class DatagenSpec(BaseModel):
6467
:param intended_for_databricks: Flag indicating if this spec is designed for Databricks.
6568
May be automatically inferred based on configuration
6669
70+
.. warning::
71+
Experimental - This API is subject to change in future versions
72+
6773
.. note::
6874
Call the validate() method before using this spec to ensure configuration is correct
6975

dbldatagen/spec/generator_spec_impl.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -36,6 +36,9 @@ class Generator:
3636
:param spark: Active SparkSession to use for data generation
3737
:param app_name: Application name used in logging and tracking. Defaults to "DataGen_ClassBased"
3838
39+
.. warning::
40+
Experimental - This API is subject to change in future versions
41+
3942
.. note::
4043
The Generator requires an active SparkSession. On Databricks, you can use the pre-configured
4144
`spark` variable. For local development, create a SparkSession first

0 commit comments

Comments
 (0)