Skip to content

Commit 67d44cf

Browse files
Julien RousselJulien Roussel
authored andcommitted
cosmetic changes
1 parent 6a5c5bc commit 67d44cf

File tree

5 files changed

+26
-24
lines changed

5 files changed

+26
-24
lines changed

docs/analysis.rst

Lines changed: 13 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1,22 +1,22 @@
11

22
Analysis
33
========
4-
The analysis section gives a better understanding of the holes in a dataset.
4+
This section gives a better understanding of the holes in a dataset.
55

66
1. General approach
77
-------------------
88

99
As described in section :ref:`hole_generator`, there are 3 main types of missing data mechanism: MCAR, MAR and MNAR.
10-
The analysis brick provides tools to charaterize the type of holes.
10+
The analysis module provides tools to characterize the type of holes.
1111

12-
The MNAR case is the trickiest, the user must first consider whether or not his missing data mechanism is MNAR. In the meantime, we make the assumption that the missing-data mechanism is ignorable (ie is not MNAR). If the MNAR missing data mechanism is suspected, please see this article :ref:`An approach to test for MNAR [1]<Noonan-article>`.
12+
The MNAR case is the trickiest, the user must first consider whether their missing data mechanism is MNAR. In the meantime, we make assume that the missing-data mechanism is ignorable (ie., it is not MNAR). If an MNAR mechanism is suspected, please see this article :ref:`An approach to test for MNAR [1]<Noonan-article>` for relevant actions.
1313

1414
Then Qolmat proposes a test to determine whether the missing data mechanism is MCAR or MAR.
1515

16-
2. How to use the results ?
17-
---------------------------
16+
2. How to use the results
17+
-------------------------
1818

19-
At the end of the MCAR test, it can then be assumed whether the missing data mechanism is MCAR or not. This could be used for several things :
19+
At the end of the MCAR test, it can then be assumed whether the missing data mechanism is MCAR or not. This serves three differents purposes:
2020

2121
a. Diagnosis
2222
^^^^^^^^^^^^
@@ -27,30 +27,30 @@ The test result can then be used for continuous data quality management.
2727
b. Estimation
2828
^^^^^^^^^^^^^
2929

30-
Some estimation methods are not suitable for the MAR case. For example, dropingn the nans introduces bias into the estimator, it is necessary to have validated that the missing-data mechanism is MCAR.
30+
Some estimation methods are not suitable for the MAR case. For example, dropping the nans introduces bias into the estimator, it is necessary to have validated that the missing-data mechanism is MCAR.
3131

3232
c. Imputation
3333
^^^^^^^^^^^^^
3434

35-
Qolmat allows model selection imputation algorithms. For each of the K folds, Qolmat artificially masks a set of observed values using a default or user specified hole generator. It seems natural to create these masks according to the same missing-data mechanism as dtermined by the test. Here's the documentation on using Qolmat for imputation model selection. : `here <https://qolmat.readthedocs.io/en/latest/#:~:text=How%20does%20Qolmat%20work%20%3F>`_.
35+
Qolmat allows model selection imputation algorithms. For each of the K folds, Qolmat artificially masks a set of observed values using a default or user-specified hole generator. It seems natural to create these masks according to the same missing-data mechanism as determined by the test. Here is the documentation on using Qolmat for imputation `model selection <https://qolmat.readthedocs.io/en/latest/#:~:text=How%20does%20Qolmat%20work%20%3F>`_.
3636

3737
3. The MCAR Tests
3838
-----------------
3939

40-
There exist several statistical tests to determine if the missing data mechanism is MCAR or MAR. Most tests are based on the notion of missing pattern.
41-
A missing pattern, also called pattern, is the structure of observed and missing values in a dataset. For example, for a dataset with 2 columns, the possible patterns are : (0, 0), (1, 0), (0, 1), (1, 1). The value 1 indicates that the value in the column is missing.
40+
There are several statistical tests to determine if the missing data mechanism is MCAR or MAR. Most tests are based on the notion of missing pattern.
41+
A missing pattern, also called a pattern, is the structure of observed and missing values in a dataset. For example, for a dataset with two columns, the possible patterns are: (0, 0), (1, 0), (0, 1), (1, 1). The value 1 indicates that the value in the column is missing.
4242

4343
The MCAR missing-data mechanism means that there is independence between the presence of holes and the observed values. In other words, the data distribution is the same for all patterns.
4444

4545
a. Little's Test
4646
^^^^^^^^^^^^^^^^
4747

48-
The best-known MCAR test is the :ref:`Little [2]<Little-article>` test. Keep in mind that the Little's test is designed to test the homogeneity of means accross the missing patterns and won't be efficient to detect the heterogeneity of covariance accross missing patterns.
48+
The best-known MCAR test is the :ref:`Little [2]<Little-article>` test, and it has been implemented in :class:`LittleTest`. Keep in mind that the Little's test is designed to test the homogeneity of means across the missing patterns and won't be efficient to detect the heterogeneity of covariance accross missing patterns.
4949

5050
b. PKLM Test
5151
^^^^^^^^^^^^
5252

53-
The :ref:`PKLM [2]<PKLM-article>` (Projected Kullback-Leibler MCAR) test compares the distributions of different missing patterns on random projections in the variable space of the data. This recent test applies to mixed-type data.
53+
The :ref:`PKLM [2]<PKLM-article>` (Projected Kullback-Leibler MCAR) test compares the distributions of different missing patterns on random projections in the variable space of the data. This recent test applies to mixed-type data. It is not implemented yet in Qolmat.
5454

5555
References
5656
----------
@@ -61,7 +61,7 @@ References
6161

6262
.. _Little-article:
6363

64-
[2] Little. `A Test of Missing Completely at Random for Multivariate Data with Missing Values. <https://www.tandfonline.com/doi/abs/10.1080/01621459.1988.10478722>`_ Journal of the American Statistical Association, Volume 83, 1988 - Issue 404.
64+
[2] Little, R. J. A. `A Test of Missing Completely at Random for Multivariate Data with Missing Values. <https://www.tandfonline.com/doi/abs/10.1080/01621459.1988.10478722>`_ Journal of the American Statistical Association, Volume 83, 1988 - Issue 404.
6565

6666
.. _PKLM-article:
6767

docs/index.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -31,4 +31,4 @@
3131
:caption: ANALYSIS
3232

3333
analysis
34-
examples/tutorials/plot_tuto_mcar_test
34+
examples/tutorials/plot_tuto_mcar
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@
55
66
In this tutorial, we show how to test the MCAR case using the Little's test.
77
"""
8+
89
# %%
910
# First import some libraries
1011
from matplotlib import pyplot as plt

qolmat/analysis/holes_characterization.py

Lines changed: 10 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@
88
from qolmat.imputations.imputers import ImputerEM
99

1010

11-
class MCARTest(ABC):
11+
class McarTest(ABC):
1212
"""
1313
Astract class for MCAR tests.
1414
"""
@@ -18,11 +18,11 @@ def test(self, df: pd.DataFrame) -> float:
1818
pass
1919

2020

21-
class LittleTest(MCARTest):
21+
class LittleTest(McarTest):
2222
"""
23-
This class implements the Little's test. The Little's test is designed to detect the
24-
heterogeneity accross the missing patterns. The null hypothesis is "The missing data mechanism
25-
is MCAR". Be aware that this test won't detect the heterogeneity of covariance.
23+
This class implements the Little's test, which is designed to detect the heterogeneity accross
24+
the missing patterns. The null hypothesis is "The missing data mechanism is MCAR". The
25+
shortcoming of this test is that it won't detect the heterogeneity of covariance.
2626
2727
References
2828
----------
@@ -67,15 +67,16 @@ def test(self, df: pd.DataFrame) -> float:
6767
The p-value of the test.
6868
"""
6969
imputer = self.imputer or ImputerEM(random_state=self.random_state)
70-
fitted_imputer = imputer._fit_element(df)
70+
imputer = imputer._fit_element(df)
7171

7272
d0 = 0
7373
n_rows, n_cols = df.shape
7474
degree_f = -n_cols
75-
ml_means = fitted_imputer.means
76-
ml_cov = n_rows / (n_rows - 1) * fitted_imputer.cov
75+
ml_means = imputer.means
76+
ml_cov = n_rows / (n_rows - 1) * imputer.cov
7777

7878
# Iterate over the patterns
79+
7980
df_nan = df.notna()
8081
for tup_pattern, df_nan_pattern in df_nan.groupby(df_nan.columns.tolist()):
8182
n_rows_pattern, _ = df_nan_pattern.shape
@@ -89,4 +90,4 @@ def test(self, df: pd.DataFrame) -> float:
8990
d0 += n_rows_pattern * np.dot(np.dot(diff_means, inv_sigma_pattern), diff_means.T)
9091
degree_f += tup_pattern.count(True)
9192

92-
return 1 - chi2.cdf(d0, degree_f)
93+
return 1 - float(chi2.cdf(d0, degree_f))

tests/analysis/test_holes_characterization.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -42,7 +42,7 @@ def mar_hc_df() -> pd.DataFrame:
4242
quantile_95 = norm.ppf(0.975)
4343
df = pd.DataFrame(matrix, columns=["Column_1", "Column_2"])
4444
df_nan = df.copy()
45-
df_nan.loc[abs(df_nan["Column_1"]) > quantile_95, "Column_2"] = np.nan
45+
df_nan.loc[df_nan["Column_1"].abs() > quantile_95, "Column_2"] = np.nan
4646

4747
df_mask = df_nan.isna()
4848
return df.mask(df_mask)

0 commit comments

Comments
 (0)