Skip to content

Commit c1ee9cb

Browse files
Feature repeatable text generation (#132)
* updates for test speed improvements * updated tests * updated tests * updated tests * updated tests * reverted pytest changes - separate feature * reverted pytest changes - separate feature * reverted pytest changes - separate feature * reverted pytest changes - separate feature * changed partitioning to run more efficiently on github runner * changed partitioning to run more efficiently on github runner * changed partitioning to run more efficiently on github runner * changed partitioning to run more efficiently on github runner * changed partitioning to run more efficiently on github runner * use as query name for spark instance * wip * wip * wip * wip * wip * wip * wip * Updates to template text generation for better performance and repeatable text generation * Updates to template text generation for better performance and repeatable text generation * additional test coverage * updated tests for ILText generation * updated tests for ILText generation * change to test potential break in build process * added explicit python version setup to build
1 parent efd6253 commit c1ee9cb

File tree

7 files changed

+796
-255
lines changed

7 files changed

+796
-255
lines changed

dbldatagen/text_generators.py

Lines changed: 344 additions & 67 deletions
Large diffs are not rendered by default.

docs/source/repeatable_data_generation.rst

Lines changed: 13 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -36,7 +36,7 @@ In addition, two columns in two different tables produced with the same generati
3636
the same values allowing for creation of multiple tables with referential integrity for use in joins.
3737

3838
.. note::
39-
The key exception to repeatability is where the data set contains the timestamp or date of when the
39+
The exception to repeatability is where the data set contains the timestamp or date of when the
4040
data is written. In these cases, runs from a later date will have different values.
4141

4242
This is why we stress generating date or timestamp ranges with a specific ``begin``, ``end`` and ``interval``
@@ -59,6 +59,11 @@ All columns will use the same random seed unless the random seed method is speci
5959
the seed is overridden at the column level. In the case of the use of the 'hash_fieldname' generation method,
6060
it will use a hash value of the field name so that each column has a different seed.
6161

62+
.. note::
63+
The text generators for templates and ILText always generate random data irrespective of the base column.
64+
That means, that these will produce repeatable data from run to run if a random seed is used - but not produce the
65+
same values for the same value of the base column.
66+
6267
True random Data
6368
^^^^^^^^^^^^^^^^
6469
To generate true random values, the random seed of -1 must be specified, either at the data spec level or at the
@@ -68,13 +73,19 @@ In this case,
6873
there is no guarantees of data repeatability - but you can constrain the data generated to specific ranges to use as
6974
foreign keys for data in other tables.
7075

71-
If columns are not marked random - they will produce a repeatable set of data. For most columns, as the columns
76+
If columns are not marked random - they will produce a repeatable set of data (with the exception of ILText, Template
77+
generation and third party library integration). For most columns, as the columns
7278
are produced by a deterministic transformation on the corresponding base columns, the data will always be repeatable.
7379

7480
For columns generated using an inherently random process such as those produced with the template generation, ILText
7581
and text data generator plugins, the random process will be seeded with a constant value unless the corresponding
7682
column specification is marked as ``random``.
7783

84+
.. note::
85+
Again this means data will be repeatable run to run but not for a specific
86+
value of the base column. For some 3rd party libraries such as `Faker` there is no integration of the random seeding
87+
capabilities at present so data will not be repeatable run to run.
88+
7889
If a random seed is provided, either as an argument to the DataGenerator instance specification,
7990
or as option on the column specification, the random seed will be applied to fields when random data generation is used.
8091

tests/test_data_generation_plugins.py

Lines changed: 18 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
import unittest
1+
import pytest
22

33
import pandas as pd
44
import numpy as np
@@ -9,30 +9,30 @@
99
spark = dg.SparkSingleton.getLocalInstance("basic tests")
1010

1111

12-
class TestTextGenerationPlugins(unittest.TestCase):
12+
class TestTextGenerationPlugins:
1313
row_count = 15000
1414
column_count = 10
1515

16-
def test_plugins(self):
16+
@pytest.mark.parametrize("dataRows", [1000, 10000, 100000])
17+
def test_plugins(self, dataRows):
1718
partitions_requested = 4
18-
data_rows = 100 * 1000
1919

2020
def initPluginContext(context):
2121
context.prefix = "testing"
2222

2323
text_generator = (lambda context, v: context.prefix + str(v))
2424

25-
pluginDataspec = (dg.DataGenerator(spark, rows=data_rows, partitions=partitions_requested)
25+
pluginDataspec = (dg.DataGenerator(spark, rows=dataRows, partitions=partitions_requested)
2626
.withColumn("text", text=PyfuncText(text_generator, init=initPluginContext))
2727
)
2828
dfPlugin = pluginDataspec.build()
2929

30-
self.assertTrue(dfPlugin.count() == data_rows)
30+
assert dfPlugin.count() == dataRows
3131

3232
dfCheck = dfPlugin.where("text like 'testing%'")
3333
new_count = dfCheck.count()
3434

35-
self.assertTrue(new_count == data_rows)
35+
assert new_count == dataRows
3636

3737
def test_plugin_clone(self):
3838
partitions_requested = 4
@@ -51,7 +51,7 @@ def initPluginContext(context):
5151
dfCheck = dfPlugin.where("text like 'testing%'")
5252
new_count = dfCheck.count()
5353

54-
self.assertTrue(new_count == data_rows)
54+
assert new_count == data_rows
5555

5656
# now check the clone
5757

@@ -61,7 +61,7 @@ def initPluginContext(context):
6161
dfCheck2 = dfPlugin2.where("text like 'testing%'")
6262
new_count2 = dfCheck2.count()
6363

64-
self.assertTrue(new_count2 == data_rows)
64+
assert new_count2 == data_rows
6565

6666
def test_plugins_extended_syntax(self):
6767
""" test property syntax"""
@@ -85,12 +85,12 @@ def initPluginContext(context):
8585
)
8686
dfPlugin = pluginDataspec.build()
8787

88-
self.assertTrue(dfPlugin.count() == data_rows)
88+
assert dfPlugin.count() == data_rows
8989

9090
dfCheck = dfPlugin.where("text like 'testing1'")
9191
new_count = dfCheck.count()
9292

93-
self.assertTrue(new_count == data_rows)
93+
assert new_count == data_rows
9494

9595
def test_plugins_extended_syntax2(self):
9696
""" test arg passing"""
@@ -115,12 +115,12 @@ def initPluginContext(context):
115115
)
116116
dfPlugin = pluginDataspec.build()
117117

118-
self.assertTrue(dfPlugin.count() == data_rows)
118+
assert dfPlugin.count() == data_rows
119119

120120
dfCheck = dfPlugin.where("text like 'testing1'")
121121
new_count = dfCheck.count()
122122

123-
self.assertTrue(new_count == data_rows)
123+
assert new_count == data_rows
124124

125125
def test_plugins_extended_syntax3(self):
126126
partitions_requested = 4
@@ -143,12 +143,12 @@ def initPluginContext(context):
143143
)
144144
dfPlugin = pluginDataspec.build()
145145

146-
self.assertTrue(dfPlugin.count() == data_rows)
146+
assert dfPlugin.count() == data_rows
147147

148148
dfCheck = dfPlugin.where("text like 'testing1again'")
149149
new_count = dfCheck.count()
150150

151-
self.assertTrue(new_count == data_rows)
151+
assert new_count == data_rows
152152

153153
def test_plugins_extended_syntax4(self):
154154
""" Test syntax extensions """
@@ -175,7 +175,7 @@ def initPluginContext(context):
175175
output = list(textGen.pandasGenerateText(inputSeries))
176176

177177
for x in output:
178-
self.assertEqual(x, "testing1again")
178+
assert x == "testing1again"
179179

180180
def test_plugins_faker_integration(self):
181181
""" test faker integration with mock objects"""
@@ -203,7 +203,7 @@ def test_plugins_faker_integration(self):
203203
dfFaker2 = fakerDataspec2.build()
204204
output = dfFaker2.select("name").collect()
205205
for x in output:
206-
self.assertTrue(x["name"].startswith("<MagicMock"))
206+
assert x["name"].startswith("<MagicMock")
207207

208208
def test_plugins_faker_integration2(self):
209209
""" test faker integration with mock objects"""
@@ -231,24 +231,5 @@ def test_plugins_faker_integration2(self):
231231
dfFaker2 = fakerDataspec2.build()
232232
output = dfFaker2.select("name").collect()
233233
for x in output:
234-
self.assertTrue(x["name"].startswith("<MagicMock"))
235-
236-
237-
# run the tests
238-
# if __name__ == '__main__':
239-
# print("Trying to run tests")
240-
# unittest.main(argv=['first-arg-is-ignored'],verbosity=2,exit=False)
241-
242-
# def runTests(suites):
243-
# suite = unittest.TestSuite()
244-
# result = unittest.TestResult()
245-
# for testSuite in suites:
246-
# suite.addTest(unittest.makeSuite(testSuite))
247-
# runner = unittest.TextTestRunner()
248-
# print(runner.run(suite))
249-
250-
251-
# runTests([TestBasicOperation])
234+
assert x["name"].startswith("<MagicMock")
252235

253-
if __name__ == '__main__':
254-
unittest.main()

tests/test_distributions.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,9 @@ def setUpClass(cls):
3434
)
3535
cls.testdata_generator.build().cache().createOrReplaceTempView("testdata")
3636

37+
# change to test build process
38+
print("inside setupClass")
39+
3740
@classmethod
3841
def unique_timestamp_seconds(cls):
3942
return (datetime.datetime.utcnow() - datetime.datetime.fromtimestamp(0)).total_seconds()

tests/test_pandas_integration.py

Lines changed: 1 addition & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -98,18 +98,4 @@ def test_numpy2(self):
9898

9999
self.assertGreater(np.sum(data), 0)
100100

101-
# run the tests
102-
# if __name__ == '__main__':
103-
# print("Trying to run tests")
104-
# unittest.main(argv=['first-arg-is-ignored'],verbosity=2,exit=False)
105-
106-
# def runTests(suites):
107-
# suite = unittest.TestSuite()
108-
# result = unittest.TestResult()
109-
# for testSuite in suites:
110-
# suite.addTest(unittest.makeSuite(testSuite))
111-
# runner = unittest.TextTestRunner()
112-
# print(runner.run(suite))
113-
114-
115-
# runTests([TestBasicOperation])
101+
#

0 commit comments

Comments
 (0)