1- # Databricks Labs Data Generator (` dbldatagen ` )
1+ # Databricks Labs Data Generator (` dbldatagen ` )
22
33<!-- Top bar will be removed from PyPi packaged versions -->
44<!-- Dont remove: exclude package -->
55[ Documentation] ( https://databrickslabs.github.io/dbldatagen/public_docs/index.html ) |
66[ Release Notes] ( CHANGELOG.md ) |
77[ Examples] ( examples ) |
8- [ Tutorial] ( tutorial )
8+ [ Tutorial] ( tutorial )
99<!-- Dont remove: end exclude package -->
1010
1111[ ![ build] ( https://github.com/databrickslabs/dbldatagen/workflows/build/badge.svg?branch=master )] ( https://github.com/databrickslabs/dbldatagen/actions?query=workflow%3Abuild+branch%3Amaster )
1414[ ![ PyPi downloads] ( https://img.shields.io/pypi/dm/dbldatagen?label=PyPi%20Downloads )] ( https://pypistats.org/packages/dbldatagen )
1515[ ![ lines of code] ( https://tokei.rs/b1/github/databrickslabs/dbldatagen )] ( [https://codecov.io/github/databrickslabs/dbldatagen](https://github.com/databrickslabs/dbldatagen) )
1616
17- <!--
17+ <!--
1818[](https://lgtm.com/projects/g/databrickslabs/dbldatagen/context:python)
1919[](https://hanadigital.github.io/grev/?user=databrickslabs&repo=dbldatagen)
2020-->
2121
2222## Project Description
23- The ` dbldatagen ` Databricks Labs project is a Python library for generating synthetic data within the Databricks
24- environment using Spark. The generated data may be used for testing, benchmarking, demos, and many
23+ The ` dbldatagen ` Databricks Labs project is a Python library for generating synthetic data within the Databricks
24+ environment using Spark. The generated data may be used for testing, benchmarking, demos, and many
2525other uses.
2626
27- It operates by defining a data generation specification in code that controls
27+ It operates by defining a data generation specification in code that controls
2828how the synthetic data is generated.
2929The specification may incorporate the use of existing schemas or create data in an ad-hoc fashion.
3030
31- It has no dependencies on any libraries that are not already installed in the Databricks
31+ It has no dependencies on any libraries that are not already installed in the Databricks
3232runtime, and you can use it from Scala, R or other languages by defining
3333a view over the generated data.
3434
3535### Feature Summary
3636It supports:
37- * Generating synthetic data at scale up to billions of rows within minutes using appropriately sized clusters
38- * Generating repeatable, predictable data supporting the need for producing multiple tables, Change Data Capture,
37+ * Generating synthetic data at scale up to billions of rows within minutes using appropriately sized clusters
38+ * Generating repeatable, predictable data supporting the need for producing multiple tables, Change Data Capture,
3939merge and join scenarios with consistency between primary and foreign keys
40- * Generating synthetic data for all of the
41- Spark SQL supported primitive types as a Spark data frame which may be persisted,
42- saved to external storage or
40+ * Generating synthetic data for all of the
41+ Spark SQL supported primitive types as a Spark data frame which may be persisted,
42+ saved to external storage or
4343used in other computations
4444* Generating ranges of dates, timestamps, and numeric values
4545* Generation of discrete values - both numeric and text
46- * Generation of values at random and based on the values of other fields
46+ * Generation of values at random and based on the values of other fields
4747(either based on the ` hash ` of the underlying values or the values themselves)
48- * Ability to specify a distribution for random data generation
48+ * Ability to specify a distribution for random data generation
4949* Generating arrays of values for ML-style feature arrays
5050* Applying weights to the occurrence of values
5151* Generating values to conform to a schema or independent of an existing schema
5252* use of SQL expressions in synthetic data generation
5353* plugin mechanism to allow use of 3rd party libraries such as Faker
5454* Use within a Databricks Delta Live Tables pipeline as a synthetic data generation source
5555* Generate synthetic data generation code from existing schema or data (experimental)
56+ * Pydantic-based specification API for type-safe data generation (experimental)
5657* Use of standard datasets for quick generation of synthetic data
5758
5859Details of these features can be found in the online documentation -
59- [ online documentation] ( https://databrickslabs.github.io/dbldatagen/public_docs/index.html ) .
60+ [ online documentation] ( https://databrickslabs.github.io/dbldatagen/public_docs/index.html ) .
6061
6162## Documentation
6263
63- Please refer to the [ online documentation] ( https://databrickslabs.github.io/dbldatagen/public_docs/index.html ) for
64+ Please refer to the [ online documentation] ( https://databrickslabs.github.io/dbldatagen/public_docs/index.html ) for
6465details of use and many examples.
6566
6667Release notes and details of the latest changes for this specific release
@@ -76,40 +77,40 @@ Within a Databricks notebook, invoke the following in a notebook cell
7677%pip install dbldatagen
7778```
7879
79- The Pip install command can be invoked within a Databricks notebook, a Delta Live Tables pipeline
80+ The Pip install command can be invoked within a Databricks notebook, a Delta Live Tables pipeline
8081and even works on the Databricks community edition.
8182
82- The documentation [ installation notes] ( https://databrickslabs.github.io/dbldatagen/public_docs/installation_notes.html )
83+ The documentation [ installation notes] ( https://databrickslabs.github.io/dbldatagen/public_docs/installation_notes.html )
8384contains details of installation using alternative mechanisms.
8485
85- ## Compatibility
86- The Databricks Labs Data Generator framework can be used with Pyspark 3.4.1 and Python 3.10.12 or later. These are
86+ ## Compatibility
87+ The Databricks Labs Data Generator framework can be used with Pyspark 3.4.1 and Python 3.10.12 or later. These are
8788compatible with the Databricks runtime 13.3 LTS and later releases. This version also provides Unity Catalog
8889compatibily.
8990
90- For full library compatibility for a specific Databricks Spark release, see the Databricks
91+ For full library compatibility for a specific Databricks Spark release, see the Databricks
9192release notes for library compatibility
9293
9394- https://docs.databricks.com/release-notes/runtime/releases.html
9495
95- In older releases, when using the Databricks Labs Data Generator on "Unity Catalog" enabled Databricks environments,
96- the Data Generator requires the use of ` Single User ` or ` No Isolation Shared ` access modes when using Databricks
97- runtimes prior to release 13.2. This is because some needed features are not available in ` Shared `
98- mode (for example, use of 3rd party libraries, use of Python UDFs) in these releases.
96+ In older releases, when using the Databricks Labs Data Generator on "Unity Catalog" enabled Databricks environments,
97+ the Data Generator requires the use of ` Single User ` or ` No Isolation Shared ` access modes when using Databricks
98+ runtimes prior to release 13.2. This is because some needed features are not available in ` Shared `
99+ mode (for example, use of 3rd party libraries, use of Python UDFs) in these releases.
99100Depending on settings, the ` Custom ` access mode may be supported for those releases.
100101
101102The use of Unity Catalog ` Shared ` access mode is supported in Databricks runtimes from Databricks runtime release 13.2
102- onwards.
103+ onwards.
103104
104- * This version of the data generator uses the Databricks runtime 13.3 LTS as the minimum supported
105+ * This version of the data generator uses the Databricks runtime 13.3 LTS as the minimum supported
105106version and alleviates these issues.*
106107
107108See the following documentation for more information:
108109
109110- https://docs.databricks.com/data-governance/unity-catalog/compute.html
110111
111112## Using the Data Generator
112- To use the data generator, install the library using the ` %pip install ` method or install the Python wheel directly
113+ To use the data generator, install the library using the ` %pip install ` method or install the Python wheel directly
113114in your environment.
114115
115116Once the library has been installed, you can use it to generate a data frame composed of synthetic data.
@@ -120,7 +121,7 @@ for your use case.
120121``` buildoutcfg
121122import dbldatagen as dg
122123df = dg.Datasets(spark, "basic/user").get(rows=1000_000).build()
123- num_rows=df.count()
124+ num_rows=df.count()
124125```
125126
126127You can also define fully custom data sets using the ` DataGenerator ` class.
@@ -135,48 +136,48 @@ data_rows = 1000 * 1000
135136df_spec = (dg.DataGenerator(spark, name="test_data_set1", rows=data_rows,
136137 partitions=4)
137138 .withIdOutput()
138- .withColumn("r", FloatType(),
139+ .withColumn("r", FloatType(),
139140 expr="floor(rand() * 350) * (86400 + 3600)",
140141 numColumns=column_count)
141142 .withColumn("code1", IntegerType(), minValue=100, maxValue=200)
142143 .withColumn("code2", IntegerType(), minValue=0, maxValue=10)
143144 .withColumn("code3", StringType(), values=['a', 'b', 'c'])
144- .withColumn("code4", StringType(), values=['a', 'b', 'c'],
145+ .withColumn("code4", StringType(), values=['a', 'b', 'c'],
145146 random=True)
146- .withColumn("code5", StringType(), values=['a', 'b', 'c'],
147+ .withColumn("code5", StringType(), values=['a', 'b', 'c'],
147148 random=True, weights=[9, 1, 1])
148-
149+
149150 )
150-
151+
151152df = df_spec.build()
152- num_rows=df.count()
153+ num_rows=df.count()
153154```
154- Refer to the [ online documentation] ( https://databrickslabs.github.io/dbldatagen/public_docs/index.html ) for further
155- examples.
155+ Refer to the [ online documentation] ( https://databrickslabs.github.io/dbldatagen/public_docs/index.html ) for further
156+ examples.
156157
157158The GitHub repository also contains further examples in the examples directory.
158159
159160## Spark and Databricks Runtime Compatibility
160- The ` dbldatagen ` package is intended to be compatible with recent LTS versions of the Databricks runtime, including
161- older LTS versions at least from 13.3 LTS and later. It also aims to be compatible with Delta Live Table runtimes,
162- including ` current ` and ` preview ` .
161+ The ` dbldatagen ` package is intended to be compatible with recent LTS versions of the Databricks runtime, including
162+ older LTS versions at least from 13.3 LTS and later. It also aims to be compatible with Delta Live Table runtimes,
163+ including ` current ` and ` preview ` .
163164
164165While we don't specifically drop support for older runtimes, changes in Pyspark APIs or
165166APIs from dependent packages such as ` numpy ` , ` pandas ` , ` pyarrow ` , and ` pyparsing ` make cause issues with older
166- runtimes.
167+ runtimes.
167168
168- By design, installing ` dbldatagen ` does not install releases of dependent packages in order
169+ By design, installing ` dbldatagen ` does not install releases of dependent packages in order
169170to preserve the curated set of packages pre-installed in any Databricks runtime environment.
170171
171172When building on local environments, run ` make dev ` to install required dependencies.
172173
173174## Project Support
174175Please note that all projects released under [ ` Databricks Labs ` ] ( https://www.databricks.com/learn/labs )
175- are provided for your exploration only, and are not formally supported by Databricks with Service Level Agreements
176- (SLAs). They are provided AS-IS, and we do not make any guarantees of any kind. Please do not submit a support ticket
176+ are provided for your exploration only, and are not formally supported by Databricks with Service Level Agreements
177+ (SLAs). They are provided AS-IS, and we do not make any guarantees of any kind. Please do not submit a support ticket
177178relating to any issues arising from the use of these projects.
178179
179- Any issues discovered through the use of this project should be filed as issues on the GitHub Repo.
180+ Any issues discovered through the use of this project should be filed as issues on the GitHub Repo.
180181They will be reviewed as time permits, but there are no formal SLAs for support.
181182
182183
0 commit comments