Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
236 changes: 236 additions & 0 deletions docs/source/generating_json_data.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,236 @@
.. Test Data Generator documentation master file, created by
sphinx-quickstart on Sun Jun 21 10:54:30 2020.
You can adapt this file completely to your liking, but it should at least
contain the root `toctree` directive.

Generating JSON and structured column data
==========================================

This section explores generating JSON and structured column data. By structured columns,
we mean columns that are some combination of `struct`, `array` and `map` of other types.

Generating JSON data
--------------------
There are several methods for generating JSON data:

- Generate a dataframe and save it as JSON will generate full data set as JSON
- Generate JSON valued fields using SQL functions such as `named_struct` and `to_json`

Writing dataframe as JSON data
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The following example illustrates the basic technique for generating JSON data from a dataframe.

.. code-block:: python

from pyspark.sql.types import LongType, IntegerType, StringType

import dbldatagen as dg


country_codes = ['CN', 'US', 'FR', 'CA', 'IN', 'JM', 'IE', 'PK', 'GB', 'IL', 'AU', 'SG',
'ES', 'GE', 'MX', 'ET', 'SA', 'LB', 'NL']
country_weights = [1300, 365, 67, 38, 1300, 3, 7, 212, 67, 9, 25, 6, 47, 83, 126, 109, 58, 8,
17]

manufacturers = ['Delta corp', 'Xyzzy Inc.', 'Lakehouse Ltd', 'Acme Corp', 'Embanks Devices']

lines = ['delta', 'xyzzy', 'lakehouse', 'gadget', 'droid']

testDataSpec = (dg.DataGenerator(spark, name="device_data_set", rows=1000000,
partitions=8,
randomSeedMethod='hash_fieldname')
.withIdOutput()
# we'll use hash of the base field to generate the ids to
# avoid a simple incrementing sequence
.withColumn("internal_device_id", LongType(), minValue=0x1000000000000,
uniqueValues=device_population, omit=True, baseColumnType="hash")

# note for format strings, we must use "%lx" not "%x" as the
# underlying value is a long
.withColumn("device_id", StringType(), format="0x%013x",
baseColumn="internal_device_id")

# the device / user attributes will be the same for the same device id
# so lets use the internal device id as the base column for these attribute
.withColumn("country", StringType(), values=country_codes,
weights=country_weights,
baseColumn="internal_device_id")
.withColumn("manufacturer", StringType(), values=manufacturers,
baseColumn="internal_device_id")

# use omit = True if you don't want a column to appear in the final output
# but just want to use it as part of generation of another column
.withColumn("line", StringType(), values=lines, baseColumn="manufacturer",
baseColumnType="hash")
.withColumn("model_ser", IntegerType(), minValue=1, maxValue=11,
baseColumn="device_id",
baseColumnType="hash", omit=True)

.withColumn("event_type", StringType(),
values=["activation", "deactivation", "plan change",
"telecoms activity", "internet activity", "device error"],
random=True)
.withColumn("event_ts", "timestamp", begin="2020-01-01 01:00:00", end="2020-12-31 23:59:00",
interval="1 minute", random=True)

)

dfTestData = testDataSpec.build()

dfTestData.write.format("json").mode("overwrite").save("/tmp/jsonData1")

In the most basic form, you can simply save the dataframe to storage in JSON format.

Use of nested structures in data generation specifications
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

When we save a dataframe containing complex column types such as `map`, `struct` and `array`, these will be
converted to equivalent constructs in JSON.

So how do we go about creating these?

We can use a struct valued column to hold the nested structure data and write the results out as JSON

Struct / array and map valued columns can be created by adding a column of the appropriate type and using the `expr`
attribute to assemble the complex column.

Note that in the current release, the `expr` attribute will override other column data generation rules.

.. code-block:: python

from pyspark.sql.types import LongType, FloatType, IntegerType, StringType, DoubleType, BooleanType, ShortType, \
TimestampType, DateType, DecimalType, ByteType, BinaryType, ArrayType, MapType, StructType, StructField

import dbldatagen as dg


country_codes = ['CN', 'US', 'FR', 'CA', 'IN', 'JM', 'IE', 'PK', 'GB', 'IL', 'AU', 'SG',
'ES', 'GE', 'MX', 'ET', 'SA', 'LB', 'NL']
country_weights = [1300, 365, 67, 38, 1300, 3, 7, 212, 67, 9, 25, 6, 47, 83, 126, 109, 58, 8,
17]

manufacturers = ['Delta corp', 'Xyzzy Inc.', 'Lakehouse Ltd', 'Acme Corp', 'Embanks Devices']

lines = ['delta', 'xyzzy', 'lakehouse', 'gadget', 'droid']

testDataSpec = (dg.DataGenerator(spark, name="device_data_set", rows=1000000,
partitions=8,
randomSeedMethod='hash_fieldname')
.withIdOutput()
# we'll use hash of the base field to generate the ids to
# avoid a simple incrementing sequence
.withColumn("internal_device_id", LongType(), minValue=0x1000000000000,
uniqueValues=device_population, omit=True, baseColumnType="hash")

# note for format strings, we must use "%lx" not "%x" as the
# underlying value is a long
.withColumn("device_id", StringType(), format="0x%013x",
baseColumn="internal_device_id")

# the device / user attributes will be the same for the same device id
# so lets use the internal device id as the base column for these attribute
.withColumn("country", StringType(), values=country_codes,
weights=country_weights,
baseColumn="internal_device_id")

.withColumn("manufacturer", StringType(), values=manufacturers,
baseColumn="internal_device_id", omit=True)
.withColumn("line", StringType(), values=lines, baseColumn="manufacturer",
baseColumnType="hash", omit=True)
.withColumn("manufacturer_info", StructType([StructField('line',StringType()), StructField('manufacturer', StringType())]),
expr="named_struct('line', line, 'manufacturer', manufacturer)",
baseColumn=['manufacturer', 'line'])


.withColumn("model_ser", IntegerType(), minValue=1, maxValue=11,
baseColumn="device_id",
baseColumnType="hash", omit=True)

.withColumn("event_type", StringType(),
values=["activation", "deactivation", "plan change",
"telecoms activity", "internet activity", "device error"],
random=True, omit=True)
.withColumn("event_ts", "timestamp", begin="2020-01-01 01:00:00", end="2020-12-31 23:59:00",
interval="1 minute", random=True, omit=True)

.withColumn("event_info", StructType([StructField('event_type',StringType()), StructField('event_ts', TimestampType())]),
expr="named_struct('event_type', event_type, 'event_ts', event_ts)",
baseColumn=['event_type', 'event_ts'])
)

dfTestData = testDataSpec.build()
dfTestData.write.format("json").mode("overwrite").save("/tmp/jsonData2")

Generating JSON valued fields
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

JSON valued fields can be generated as fields of `string` type and assembled using a combination of Spark SQL
functions such as `named_struct` and `to_json`.

.. code-block:: python

from pyspark.sql.types import LongType, FloatType, IntegerType, StringType, DoubleType, BooleanType, ShortType, \
TimestampType, DateType, DecimalType, ByteType, BinaryType, ArrayType, MapType, StructType, StructField

import dbldatagen as dg


country_codes = ['CN', 'US', 'FR', 'CA', 'IN', 'JM', 'IE', 'PK', 'GB', 'IL', 'AU', 'SG',
'ES', 'GE', 'MX', 'ET', 'SA', 'LB', 'NL']
country_weights = [1300, 365, 67, 38, 1300, 3, 7, 212, 67, 9, 25, 6, 47, 83, 126, 109, 58, 8,
17]

manufacturers = ['Delta corp', 'Xyzzy Inc.', 'Lakehouse Ltd', 'Acme Corp', 'Embanks Devices']

lines = ['delta', 'xyzzy', 'lakehouse', 'gadget', 'droid']

testDataSpec = (dg.DataGenerator(spark, name="device_data_set", rows=1000000,
partitions=8,
randomSeedMethod='hash_fieldname')
.withIdOutput()
# we'll use hash of the base field to generate the ids to
# avoid a simple incrementing sequence
.withColumn("internal_device_id", LongType(), minValue=0x1000000000000,
uniqueValues=device_population, omit=True, baseColumnType="hash")

# note for format strings, we must use "%lx" not "%x" as the
# underlying value is a long
.withColumn("device_id", StringType(), format="0x%013x",
baseColumn="internal_device_id")

# the device / user attributes will be the same for the same device id
# so lets use the internal device id as the base column for these attribute
.withColumn("country", StringType(), values=country_codes,
weights=country_weights,
baseColumn="internal_device_id")

.withColumn("manufacturer", StringType(), values=manufacturers,
baseColumn="internal_device_id", omit=True)
.withColumn("line", StringType(), values=lines, baseColumn="manufacturer",
baseColumnType="hash", omit=True)
.withColumn("manufacturer_info", "string",
expr="to_json(named_struct('line', line, 'manufacturer', manufacturer))",
baseColumn=['manufacturer', 'line'])


.withColumn("model_ser", IntegerType(), minValue=1, maxValue=11,
baseColumn="device_id",
baseColumnType="hash", omit=True)

.withColumn("event_type", StringType(),
values=["activation", "deactivation", "plan change",
"telecoms activity", "internet activity", "device error"],
random=True, omit=True)
.withColumn("event_ts", "timestamp", begin="2020-01-01 01:00:00", end="2020-12-31 23:59:00",
interval="1 minute", random=True, omit=True)

.withColumn("event_info", "string",
expr="to_json(named_struct('event_type', event_type, 'event_ts', event_ts))",
baseColumn=['event_type', 'event_ts'])
)

dfTestData = testDataSpec.build()

#dfTestData.write.format("json").mode("overwrite").save("/tmp/jsonData2")
display(dfTestData)
1 change: 1 addition & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@ As it is installable via `%pip install`, it can also be incorporated in environm
Options for column specification <options_and_features>
Generating repeatable data <repeatable_data_generation>
Using streaming data <using_streaming_data>
Generating JSON and structured column data <generating_json_data>
Generating Change Data Capture (CDC) data<generating_cdc_data>
Using multiple tables <multi_table_data>
Extending text generation <extending_text_generation>
Expand Down
63 changes: 55 additions & 8 deletions docs/source/options_and_features.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,27 +12,60 @@ Options for column specification
The following table lists some of the common options that can be applied with the ``withColumn`` and ``withColumnSpec``
methods.

.. table:: Column creation options

================ ==============================
Parameter Usage
================ ==============================
minValue Minimum value for range of generated value. As alternative use ``dataRange``.

maxValue Minimum value for range of generated value. As alternative use ``dataRange``.
step Step to use for range of generated value. As an alternative, you may use the `dataRange` parameter
random If True, will generate random values for column value. Defaults to `False`
randomSeedMethod Determines how seed will be used. If 'fixed', will use fixed random seed. If set to 'hash_fieldname'
will use a hash of the field name as the random seed for a specific column.

step Step to use for range of generated value.

As an alternative, you may use the `dataRange` parameter

random If `True`, will generate random values for column value. Defaults to `False`

randomSeedMethod Determines how seed will be used.

If set to the value 'fixed', will use fixed random seed.

If set to 'hash_fieldname', it will use a hash of the field name as the random seed
for a specific column.

baseColumn Either the string name of the base column, or a list of columns to use to control data generation.
values List of discrete values for the column. Discrete values can numeric, dates timestamps, strings etc.

values List of discrete values for the column.

Discrete values can numeric, dates timestamps, strings etc.

weights List of discrete weights for the column. Controls spread of values
percentNulls Percentage of nulls to generate for column. Fraction representing percentage between 0.0 and 1.0

percentNulls Percentage of nulls to generate for column.

Fraction representing percentage between 0.0 and 1.0

uniqueValues Number of distinct unique values for the column. Use as alternative to data range.

begin Beginning of range for date and timestamp fields.

end End of range for date and timestamp fields.

interval Interval of range for date and timestamp fields.
dataRange An instance of an `NRange` or `DateRange` object. This can be used in place of ``minValue``, etc.

dataRange An instance of an `NRange` or `DateRange` object.

This can be used in place of ``minValue``, etc.

template Template controlling text generation
omit If True, omit column from final output. Use when column is only needed to compute other columns.

omit If True, omit column from final output.

Use when column is only needed to compute other columns.

expr SQL expression to control data generation

================ ==============================


Expand All @@ -44,12 +77,26 @@ expr SQL expression to control data generation
For more information, see :data:`~dbldatagen.daterange.DateRange`
or :data:`~dbldatagen.daterange.NRange`.

Using custom SQL to control data generation
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The `expr` attribute can be used to specify an arbitrary Spark SQL expression to control how the data is
generated for a column. If the body of the SQL references other columns, you will need to ensure that
those columns are created first.

By default, the columns are created in the order specified.

However, you can control the order of column creation using the `baseColumn` attribute.

More Details
^^^^^^^^^^^^

The full set of options for column specification which may be used with the ``withColumn``, ``withColumnSpec`` and
and ``withColumnSpecs`` method can be found at:

* :data:`~dbldatagen.column_spec_options.ColumnSpecOptions`


Generating views automatically
------------------------------

Expand Down
Loading