-
-
Notifications
You must be signed in to change notification settings - Fork 261
Populate make_classification_df date functionality with random dates #851
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 4 commits
4b55f9a
424fca4
b740b27
24461fb
3a7c9d1
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -381,10 +381,11 @@ def make_classification( | |
| return X, y | ||
|
|
||
|
|
||
| def random_date(start, end): | ||
| def random_date(start, end, random_state=None): | ||
| rng_random_date = dask_ml.utils.check_random_state(random_state) | ||
| delta = end - start | ||
| int_delta = (delta.days * 24 * 60 * 60) + delta.seconds | ||
| random_second = np.random.randint(int_delta) | ||
| random_second = rng_random_date.randint(int_delta).compute().item() | ||
| return start + timedelta(seconds=random_second) | ||
|
|
||
|
|
||
|
|
@@ -430,6 +431,13 @@ def make_classification_df( | |
| The output values. | ||
|
|
||
| """ | ||
| if ( | ||
| random_state is not None | ||
| or not isinstance(random_state, np.random.RandomState) | ||
| or not isinstance(random_state, int) | ||
| ): | ||
| random_state = None | ||
|
||
|
|
||
| X_array, y_array = make_classification( | ||
| n_samples=n_samples, | ||
| flip_y=(1 - predictability), | ||
|
|
@@ -451,8 +459,13 @@ def make_classification_df( | |
| [ | ||
| X_df, | ||
| dd.from_array( | ||
| np.array([random_date(*dates)] * len(X_df)), | ||
| chunksize=chunks, | ||
| np.array( | ||
| [ | ||
| random_date(*dates, random_state + i) | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What happens when
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Okay, I will have to raise a ValueError exception there, on it.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why not use this code? rng = check_random_state(random_state)
dates = [random_date(*dates, rng) for i in range(len(X_df))]
...
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The code above will produce the same random number since the seed(
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. My main point: I think [ins] In [193]: def random_dates(random_state):
...: return random_state.randint(100)
...:
[ins] In [194]: rng = np.random.RandomState(42)
[ins] In [196]: [random_dates(rng) for _ in range(20)]
Out[196]: [51, 92, 14, 71, 60, 20, 82, 86, 74, 74, 87, 99, 23, 2, 21, 52, 1, 87, 29, 37]
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I can maybe just check if random_state is one of the accepted values like this and accordingly proceed -
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That runs counter to the use of If
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. random_state : int, RandomState instance or None, optional (default=None), these are values accepted by Scikit-Learn's
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Scikit-learn's It takes those values and produces the correct random seed generator.
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @stsievert the accepted types of |
||
| for i in range(len(X_df)) | ||
| ] | ||
| ), | ||
| chunksize=n_samples, | ||
| columns=["date"], | ||
| ), | ||
| ], | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -73,10 +73,22 @@ def test_make_classification_df(): | |
| dates=(date(2014, 1, 1), date(2015, 1, 1)), | ||
| ) | ||
|
|
||
| X_df1, y_series1 = dask_ml.datasets.make_classification_df( | ||
| n_samples=100, | ||
| n_features=5, | ||
| random_state=123, | ||
| chunks=100, | ||
| dates=(date(2014, 1, 1), date(2015, 1, 1)), | ||
| ) | ||
| check_randomness = np.unique((X_df["date"] == X_df1["date"]).compute()) | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good, this checks the random state. Shouldn't this also check that there's more than one unique value? That's what #845 is focused on.
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, the code I've written checks for repeatability, on account of the seed. Since the the numpy's
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The same I think
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Okay, that sounds good. Now I've to figure out, what would be a good threshold. Will
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Oh yeah, |
||
|
|
||
| assert X_df is not None | ||
| assert y_series is not None | ||
| assert "date" in X_df.columns | ||
| assert len(X_df.columns) == 6 | ||
| assert len(X_df) == 100 | ||
| assert len(y_series) == 100 | ||
| assert isinstance(y_series, dask.dataframe.core.Series) | ||
| assert check_randomness.size == 1 | ||
| assert check_randomness[0] is True | ||
| assert np.unique(X_df["date"]).size >= 2 | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit:
That way the
.compute()can be avoided (especially relevant on repeated calls.).