Skip to content

Commit 6a5c5bc

Browse files
committed
📝 Changes following to the latest PR. Create an MCARTest abstract class, use pd in the tutorials and change the documentation.
1 parent 4fa0378 commit 6a5c5bc

File tree

13 files changed

+313
-219
lines changed

13 files changed

+313
-219
lines changed

docs/analysis.rst

Lines changed: 68 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,68 @@
1+
2+
Analysis
3+
========
4+
The analysis section gives a better understanding of the holes in a dataset.
5+
6+
1. General approach
7+
-------------------
8+
9+
As described in section :ref:`hole_generator`, there are 3 main types of missing data mechanism: MCAR, MAR and MNAR.
10+
The analysis brick provides tools to charaterize the type of holes.
11+
12+
The MNAR case is the trickiest, the user must first consider whether or not his missing data mechanism is MNAR. In the meantime, we make the assumption that the missing-data mechanism is ignorable (ie is not MNAR). If the MNAR missing data mechanism is suspected, please see this article :ref:`An approach to test for MNAR [1]<Noonan-article>`.
13+
14+
Then Qolmat proposes a test to determine whether the missing data mechanism is MCAR or MAR.
15+
16+
2. How to use the results ?
17+
---------------------------
18+
19+
At the end of the MCAR test, it can then be assumed whether the missing data mechanism is MCAR or not. This could be used for several things :
20+
21+
a. Diagnosis
22+
^^^^^^^^^^^^
23+
24+
If the result of the MCAR test is "The MCAR hypothesis is rejected", we can then ask ourselves over which range of values holes are more present.
25+
The test result can then be used for continuous data quality management.
26+
27+
b. Estimation
28+
^^^^^^^^^^^^^
29+
30+
Some estimation methods are not suitable for the MAR case. For example, dropingn the nans introduces bias into the estimator, it is necessary to have validated that the missing-data mechanism is MCAR.
31+
32+
c. Imputation
33+
^^^^^^^^^^^^^
34+
35+
Qolmat allows model selection imputation algorithms. For each of the K folds, Qolmat artificially masks a set of observed values using a default or user specified hole generator. It seems natural to create these masks according to the same missing-data mechanism as dtermined by the test. Here's the documentation on using Qolmat for imputation model selection. : `here <https://qolmat.readthedocs.io/en/latest/#:~:text=How%20does%20Qolmat%20work%20%3F>`_.
36+
37+
3. The MCAR Tests
38+
-----------------
39+
40+
There exist several statistical tests to determine if the missing data mechanism is MCAR or MAR. Most tests are based on the notion of missing pattern.
41+
A missing pattern, also called pattern, is the structure of observed and missing values in a dataset. For example, for a dataset with 2 columns, the possible patterns are : (0, 0), (1, 0), (0, 1), (1, 1). The value 1 indicates that the value in the column is missing.
42+
43+
The MCAR missing-data mechanism means that there is independence between the presence of holes and the observed values. In other words, the data distribution is the same for all patterns.
44+
45+
a. Little's Test
46+
^^^^^^^^^^^^^^^^
47+
48+
The best-known MCAR test is the :ref:`Little [2]<Little-article>` test. Keep in mind that the Little's test is designed to test the homogeneity of means accross the missing patterns and won't be efficient to detect the heterogeneity of covariance accross missing patterns.
49+
50+
b. PKLM Test
51+
^^^^^^^^^^^^
52+
53+
The :ref:`PKLM [2]<PKLM-article>` (Projected Kullback-Leibler MCAR) test compares the distributions of different missing patterns on random projections in the variable space of the data. This recent test applies to mixed-type data.
54+
55+
References
56+
----------
57+
58+
.. _Noonan-article:
59+
60+
[1] Noonan, Jack, et al. `An integrated approach to test for missing not at random. <https://arxiv.org/abs/2208.07813>`_ arXiv preprint arXiv:2208.07813 (2022).
61+
62+
.. _Little-article:
63+
64+
[2] Little. `A Test of Missing Completely at Random for Multivariate Data with Missing Values. <https://www.tandfonline.com/doi/abs/10.1080/01621459.1988.10478722>`_ Journal of the American Statistical Association, Volume 83, 1988 - Issue 404.
65+
66+
.. _PKLM-article:
67+
68+
[3] Spohn, Meta-Lina, et al. `PKLM: A flexible MCAR test using Classification. <https://arxiv.org/abs/2109.10150>`_ arXiv preprint arXiv:2109.10150 (2021).

docs/api.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -83,7 +83,7 @@ RPCA engine
8383
:template: class.rst
8484

8585
imputations.rpca.rpca_pcp.RPCAPCP
86-
imputations.rpca.rpca_noisy.RPCANoisy
86+
imputations.rpca.rpca_noisy.RpcaNoisy
8787

8888

8989
EM engine

docs/audit.rst

Lines changed: 0 additions & 3 deletions
This file was deleted.

docs/imputers.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,7 @@ See the :class:`~qolmat.imputations.imputers.ImputerRpcaPcp` class for implement
3838

3939
**Noisy RPCA** [2, 3, 4]
4040

41-
The class :class:`RPCANoisy` implements an recommanded improved version, which relies on a decomposition :math:`\mathbf{D} = \mathbf{M} + \mathbf{A} + \mathbf{E}`. The additionnal term encodes a Gaussian noise and makes the numerical convergence more reliable. This class also implements a time-consistency penalization for time series, parametrized by the :math:`\eta_k`and :math:`H_k`. By defining :math:`\Vert \mathbf{MH_k} \Vert_p` is either :math:`\Vert \mathbf{MH_k} \Vert_1` or :math:`\Vert \mathbf{MH_k} \Vert_F^2`, the optimisation problem is the following
41+
The class :class:`RpcaNoisy` implements an recommanded improved version, which relies on a decomposition :math:`\mathbf{D} = \mathbf{M} + \mathbf{A} + \mathbf{E}`. The additionnal term encodes a Gaussian noise and makes the numerical convergence more reliable. This class also implements a time-consistency penalization for time series, parametrized by the :math:`\eta_k`and :math:`H_k`. By defining :math:`\Vert \mathbf{MH_k} \Vert_p` is either :math:`\Vert \mathbf{MH_k} \Vert_1` or :math:`\Vert \mathbf{MH_k} \Vert_F^2`, the optimisation problem is the following
4242

4343
.. math::
4444
\text{min}_{\mathbf{M, A} \in \mathbb{R}^{m \times n}} \quad \frac 1 2 \Vert P_{\Omega} (\mathbf{D}-\mathbf{M}-\mathbf{A}) \Vert_F^2 + \tau \Vert \mathbf{M} \Vert_* + \lambda \Vert \mathbf{A} \Vert_1 + \sum_{k=1}^K \eta_k \Vert \mathbf{M H_k} \Vert_p

docs/index.rst

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,5 @@
11
.. include:: ../README.rst
22

3-
.. toctree::
4-
:maxdepth: 2
5-
:hidden:
6-
:caption: AUDIT
7-
8-
audit
9-
examples/tutorials/plot_tuto_mcar_test
10-
113
.. toctree::
124
:maxdepth: 2
135
:hidden:
@@ -32,3 +24,11 @@
3224
:caption: API
3325

3426
api
27+
28+
.. toctree::
29+
:maxdepth: 2
30+
:hidden:
31+
:caption: ANALYSIS
32+
33+
analysis
34+
examples/tutorials/plot_tuto_mcar_test

examples/RPCA.md

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -199,7 +199,6 @@ plt.show()
199199

200200
```python
201201
%%time
202-
# rpca_noisy = RPCANoisy(period=10, tau=1, lam=0.4, rank=2, list_periods=[10], list_etas=[0.01], norm="L2")
203202
rpca_noisy = RpcaNoisy(tau=1, lam=0.4, rank=2, norm="L2")
204203
M, A = rpca_noisy.decompose(D, Omega)
205204
# imputed = X

examples/tutorials/plot_tuto_hole_generator.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -282,7 +282,7 @@ def plot_cdf(
282282

283283

284284
# %%
285-
# d. Grouped Hole Generator
285+
# e. Grouped Hole Generator
286286
# ***************************************************************
287287
# The holes are generated according to the groups defined by the user.
288288
# This metohd is implemented in the

examples/tutorials/plot_tuto_mcar_test.py

Lines changed: 80 additions & 80 deletions
Original file line numberDiff line numberDiff line change
@@ -3,154 +3,154 @@
33
Tutorial for testing the MCAR case
44
============================================
55
6-
In this tutorial, we show how to use the mcar test class and its methods.
7-
8-
Keep in my mind that, at this moment, the mcar tests only handle tabular data.
6+
In this tutorial, we show how to test the MCAR case using the Little's test.
97
"""
108
# %%
119
# First import some libraries
1210
from matplotlib import pyplot as plt
13-
import random
1411

1512
import numpy as np
1613
import pandas as pd
14+
from scipy.stats import norm
15+
16+
from qolmat.analysis.holes_characterization import LittleTest
17+
from qolmat.benchmark.missing_patterns import UniformHoleGenerator
1718

18-
from qolmat.audit.holes_characterization import MCARTest
19+
plt.rcParams.update({"font.size": 12})
1920

2021
# %%
2122
# 1. The Little's test
2223
# ---------------------------------------------------------------
23-
# How to use the Little's test ?
24-
# ==============================
25-
# When we deal with missing data in our dataset it's interesting to know the nature of these holes.
26-
# There exist three types of holes : MCAR, MAR and MNAR.
27-
# (see the: `Rubin's missing mechanism classification
28-
# <https://qolmat.readthedocs.io/en/latest/explanation.html>`_)
24+
# First, we need to introduce the concept of missing pattern. A missing pattern, also called
25+
# pattern, is the structure of observed and missing values in a data set. For example, for a
26+
# dataset with 2 columns, the possible patterns are : (0, 0), (1, 0), (0, 1), (1, 1). The value 1
27+
# (0) indicates that the value in the column is missing (observed).
2928
#
30-
# The simplest case to test is the MCAR case. The most famous MCAR statistical test is the
31-
# `Little's test <https://www.tandfonline.com/doi/abs/10.1080/01621459.1988.10478722>`_.
32-
# Keep in mind that the Little's test is designed to test the homogeneity of means between the
33-
# missing patterns and won't be efficient to detect the heterogeneity of covariance between missing
34-
# patterns.
29+
# The null hypothesis, H0, is : "The means of observations within each pattern are similar.".
30+
# Against the alternative hypothesis, H1 : "The means of the observed variables can vary across the
31+
# patterns."
3532
#
36-
# The null hypothesis, H0, is : "The data are MCAR". Against,
37-
# The alternative hypothesis : " The data are not MCAR, the means of the observed variables can
38-
# vary across the patterns"
33+
# If H0 is not rejected , we can assume that the missing data mechanism is MCAR. On the contrary,
34+
# if H0 is rejected, we can assume that the missing data mechanism is MAR.
3935
#
40-
# We choose to use the classic threshold, equal to 5%. If the test pval is below this threshold,
36+
# We choose to use the classic threshold, equal to 5%. If the test p_value is below this threshold,
4137
# we reject the null hypothesis.
4238
#
4339
# This notebook shows how the Little's test performs and its limitations.
4440

45-
np.random.seed(11)
46-
47-
mcartest = MCARTest(method="little")
41+
mcartest = LittleTest()
4842

4943
# %%
50-
# Case 1 : Normal iid feature with MCAR holes
51-
# ===========================================
44+
# Case 1 : Normal iid features with MCAR holes
45+
# ============================================
5246

47+
np.random.seed(42)
5348
matrix = np.random.multivariate_normal(mean=[0, 0], cov=[[1, 0], [0, 1]], size=200)
54-
matrix.ravel()[np.random.choice(matrix.size, size=20, replace=False)] = np.nan
55-
matrix_masked = matrix[np.argwhere(np.isnan(matrix))]
56-
df_1 = pd.DataFrame(matrix)
49+
df = pd.DataFrame(data=matrix, columns=["Column_1", "Column_2"])
50+
51+
hole_gen = UniformHoleGenerator(n_splits=1, random_state=42, subset=["Column_2"], ratio_masked=0.2)
52+
df_mask = hole_gen.generate_mask(df)
53+
df_unmasked = ~df_mask
54+
df_unmasked["Column_1"] = False
5755

58-
plt_1 = plt.scatter(matrix[:, 0], matrix[:, 1])
59-
plt_2 = plt.scatter(matrix_masked[:, 0], matrix_masked[:, 1])
56+
df_observed = df.mask(df_mask).dropna()
57+
df_hidden = df.mask(df_unmasked).dropna(subset="Column_2")
58+
59+
plt_1 = plt.scatter(df_observed.iloc[:, 0], df_observed.iloc[:, 1], label="Observed values")
60+
plt_2 = plt.scatter(df_hidden.iloc[:, 0], df_hidden.iloc[:, 1], label="Missing values")
6061

6162
plt.legend(
62-
(plt_1, plt_2),
63-
("observed_values", "masked_values"),
64-
scatterpoints=1,
6563
loc="lower left",
66-
ncol=1,
6764
fontsize=8,
6865
)
69-
7066
plt.title("Case 1 : MCAR missingness mechanism")
71-
plt.xlabel("x values (all observed)")
72-
plt.ylabel("y values (with missing ones)")
73-
7467
plt.show()
7568

7669
# %%
7770

78-
mcartest.test(df_1)
71+
mcartest.test(df.mask(df_mask))
7972
# %%
80-
# The p-value is quite high, therefore we don't reject H_0.
73+
# The p-value is quite high, therefore we don't reject H0.
8174
# We can then suppose that our missingness mechanism is MCAR.
8275

8376
# %%
84-
# Case 2 : Normal iid feature with MAR holes
85-
# ==========================================
86-
np.random.seed(11)
77+
# Case 2 : Normal iid features with MAR holes
78+
# ===========================================
79+
np.random.seed(42)
80+
quantile_95 = norm.ppf(0.975)
8781

8882
matrix = np.random.multivariate_normal(mean=[0, 0], cov=[[1, 0], [0, 1]], size=200)
89-
threshold = random.uniform(0, 1)
90-
matrix[np.argwhere(matrix[:, 0] >= 1.96), 1] = np.nan
91-
matrix_masked = matrix[np.argwhere(np.isnan(matrix))]
92-
df_2 = pd.DataFrame(matrix)
83+
df = pd.DataFrame(matrix, columns=["Column_1", "Column_2"])
84+
df_nan = df.copy()
85+
df_nan.loc[df_nan["Column_1"] > quantile_95, "Column_2"] = np.nan
86+
87+
df_mask = df_nan.isna()
88+
df_unmasked = ~df_mask
89+
df_unmasked["Column_1"] = False
90+
91+
df_observed = df.mask(df_mask).dropna()
92+
df_hidden = df.mask(df_unmasked).dropna(subset="Column_2")
9393

94-
plt_1 = plt.scatter(matrix[:, 0], matrix[:, 1])
95-
plt_2 = plt.scatter(matrix_masked[:, 0], matrix_masked[:, 1])
94+
plt_1 = plt.scatter(df_observed.iloc[:, 0], df_observed.iloc[:, 1], label="Observed values")
95+
plt_2 = plt.scatter(df_hidden.iloc[:, 0], df_hidden.iloc[:, 1], label="Missing values")
9696

9797
plt.legend(
98-
(plt_1, plt_2),
99-
("observed_values", "masked_vlues"),
100-
scatterpoints=1,
10198
loc="lower left",
102-
ncol=1,
10399
fontsize=8,
104100
)
105-
106101
plt.title("Case 2 : MAR missingness mechanism")
107-
plt.xlabel("x values (all observed)")
108-
plt.ylabel("y values (with missing ones)")
109-
110102
plt.show()
111103

112104
# %%
113105

114-
mcartest.test(df_2)
106+
mcartest.test(df.mask(df_mask))
115107
# %%
116108
# The p-value is lower than the classic threshold (5%).
117-
# H_0 is then rejected and we can suppose that our missingness mechanism is MAR.
109+
# H0 is then rejected and we can suppose that our missingness mechanism is MAR.
118110

119111
# %%
120-
# Case 3 : Normal iid feature MAR holes
121-
# =====================================
122-
# The specific case is design to emphasize the Little's test limits. In the case, we generate holes
123-
# when the value of the first feature is high. This missingness mechanism is clearly MAR but the
124-
# means between missing patterns is not statistically different.
112+
# Case 3 : Normal iid features with MAR holes
113+
# ===========================================
114+
# The specific case is designed to emphasize the Little's test limits. In the case, we generate
115+
# holes when the absolute value of the first feature is high. This missingness mechanism is clearly
116+
# MAR but the means between missing patterns is not statistically different.
125117

126-
np.random.seed(11)
118+
np.random.seed(42)
127119

128120
matrix = np.random.multivariate_normal(mean=[0, 0], cov=[[1, 0], [0, 1]], size=200)
129-
matrix[np.argwhere(abs(matrix[:, 0]) >= 1.96), 1] = np.nan
130-
matrix_masked = matrix[np.argwhere(np.isnan(matrix))]
131-
df_3 = pd.DataFrame(matrix)
121+
df = pd.DataFrame(matrix, columns=["Column_1", "Column_2"])
122+
df_nan = df.copy()
123+
df_nan.loc[abs(df_nan["Column_1"]) > quantile_95, "Column_2"] = np.nan
132124

133-
plt_1 = plt.scatter(matrix[:, 0], matrix[:, 1])
134-
plt_2 = plt.scatter(matrix_masked[:, 0], matrix_masked[:, 1])
125+
df_mask = df_nan.isna()
126+
df_unmasked = ~df_mask
127+
df_unmasked["Column_1"] = False
128+
129+
df_observed = df.mask(df_mask).dropna()
130+
df_hidden = df.mask(df_unmasked).dropna(subset="Column_2")
131+
132+
plt_1 = plt.scatter(df_observed.iloc[:, 0], df_observed.iloc[:, 1], label="Observed values")
133+
plt_2 = plt.scatter(df_hidden.iloc[:, 0], df_hidden.iloc[:, 1], label="Missing values")
135134

136135
plt.legend(
137-
(plt_1, plt_2),
138-
("observed_values", "masked_values"),
139-
scatterpoints=1,
140136
loc="lower left",
141-
ncol=1,
142137
fontsize=8,
143138
)
144-
145139
plt.title("Case 3 : MAR missingness mechanism undetected by the Little's test")
146-
plt.xlabel("x values (all observed)")
147-
plt.ylabel("y values (with missing ones)")
148-
149140
plt.show()
150141

151142
# %%
152143

153-
mcartest.test(df_3)
144+
mcartest.test(df.mask(df_mask))
154145
# %%
155146
# The p-value is higher than the classic threshold (5%).
156-
# H_0 is not rejected whereas the missingness mechanism is clearly MAR.
147+
# H0 is not rejected whereas the missingness mechanism is clearly MAR.
148+
149+
# %%
150+
# Limitations
151+
# -----------
152+
# In this tutoriel, we can see that Little's test fails to detect covariance heterogeneity between
153+
# patterns.
154+
#
155+
# There exist other limitations. The Little's test only handles quantitative data. And finally, the
156+
# MCAR tests can only handle tabular data (withtout correlation in time).

0 commit comments

Comments
 (0)