scikit-learn-contrib
diff --git a/‎docs/analysis.rst‎
Lines changed: 68 additions & 0 deletions b/‎docs/analysis.rst‎
Lines changed: 68 additions & 0 deletions
diff --git a/‎docs/api.rst‎
Lines changed: 1 addition & 1 deletion b/‎docs/api.rst‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/audit.rst‎
Lines changed: 0 additions & 3 deletions b/‎docs/audit.rst‎
Lines changed: 0 additions & 3 deletions
diff --git a/‎docs/imputers.rst‎
Lines changed: 1 addition & 1 deletion b/‎docs/imputers.rst‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/index.rst‎
Lines changed: 8 additions & 8 deletions b/‎docs/index.rst‎
Lines changed: 8 additions & 8 deletions
diff --git a/‎examples/RPCA.md‎
Lines changed: 0 additions & 1 deletion b/‎examples/RPCA.md‎
Lines changed: 0 additions & 1 deletion
diff --git a/‎examples/tutorials/plot_tuto_hole_generator.py‎
Lines changed: 1 addition & 1 deletion b/‎examples/tutorials/plot_tuto_hole_generator.py‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎examples/tutorials/plot_tuto_mcar_test.py‎
Lines changed: 80 additions & 80 deletions b/‎examples/tutorials/plot_tuto_mcar_test.py‎
Lines changed: 80 additions & 80 deletions
@@ -0,0 +1,68 @@
+
+Analysis
+========
+The analysis section gives a better understanding of the holes in a dataset.
+
+1. General approach
+-------------------
+
+As described in section :ref:`hole_generator`, there are 3 main types of missing data mechanism: MCAR, MAR and MNAR.
+The analysis brick provides tools to charaterize the type of holes.
+
+The MNAR case is the trickiest, the user must first consider whether or not his missing data mechanism is MNAR. In the meantime, we make the assumption that the missing-data mechanism is ignorable (ie is not MNAR). If the MNAR missing data mechanism is suspected, please see this article :ref:`An approach to test for MNAR [1]<Noonan-article>`.
+
+Then Qolmat proposes a test to determine whether the missing data mechanism is MCAR or MAR.
+
+2. How to use the results ?
+---------------------------
+
+At the end of the MCAR test, it can then be assumed whether the missing data mechanism is MCAR or not. This could be used for several things :
+
+a. Diagnosis
+^^^^^^^^^^^^
+
+If the result of the MCAR test is "The MCAR hypothesis is rejected", we can then ask ourselves over which range of values holes are more present.
+The test result can then be used for continuous data quality management.
+
+b. Estimation
+^^^^^^^^^^^^^
+
+Some estimation methods are not suitable for the MAR case. For example, dropingn the nans introduces bias into the estimator, it is necessary to have validated that the missing-data mechanism is MCAR.
+
+c. Imputation
+^^^^^^^^^^^^^
+
+Qolmat allows model selection imputation algorithms. For each of the K folds, Qolmat artificially masks a set of observed values using a default or user specified hole generator. It seems natural to create these masks according to the same missing-data mechanism as dtermined by the test. Here's the documentation on using Qolmat for imputation model selection. : `here <https://qolmat.readthedocs.io/en/latest/#:~:text=How%20does%20Qolmat%20work%20%3F>`_.
+
+3. The MCAR Tests
+-----------------
+
+There exist several statistical tests to determine if the missing data mechanism is MCAR or MAR. Most tests are based on the notion of missing pattern.
+A missing pattern, also called pattern, is the structure of observed and missing values in a dataset. For example, for a dataset with 2 columns, the possible patterns are : (0, 0), (1, 0), (0, 1), (1, 1). The value 1 indicates that the value in the column is missing.
+
+The MCAR missing-data mechanism means that there is independence between the presence of holes and the observed values. In other words, the data distribution is the same for all patterns.
+
+a. Little's Test
+^^^^^^^^^^^^^^^^
+
+The best-known MCAR test is the :ref:`Little [2]<Little-article>` test. Keep in mind that the Little's test is designed to test the homogeneity of means accross the missing patterns and won't be efficient to detect the heterogeneity of covariance accross missing patterns.
+
+b. PKLM Test
+^^^^^^^^^^^^
+
+The :ref:`PKLM [2]<PKLM-article>` (Projected Kullback-Leibler MCAR) test compares the distributions of different missing patterns on random projections in the variable space of the data. This recent test applies to mixed-type data.
+
+References
+----------
+
+.. _Noonan-article:
+
+[1] Noonan, Jack, et al. `An integrated approach to test for missing not at random. <https://arxiv.org/abs/2208.07813>`_ arXiv preprint arXiv:2208.07813 (2022).
+
+.. _Little-article:
+
+[2] Little. `A Test of Missing Completely at Random for Multivariate Data with Missing Values. <https://www.tandfonline.com/doi/abs/10.1080/01621459.1988.10478722>`_ Journal of the American Statistical Association, Volume 83, 1988 - Issue 404.
+
+.. _PKLM-article:
+
+[3] Spohn, Meta-Lina, et al. `PKLM: A flexible MCAR test using Classification. <https://arxiv.org/abs/2109.10150>`_ arXiv preprint arXiv:2109.10150 (2021).
@@ -83,7 +83,7 @@ RPCA engine
     :template: class.rst
 
     imputations.rpca.rpca_pcp.RPCAPCP
-    imputations.rpca.rpca_noisy.RPCANoisy
+    imputations.rpca.rpca_noisy.RpcaNoisy
 
 
 EM engine
 
@@ -38,7 +38,7 @@ See the :class:`~qolmat.imputations.imputers.ImputerRpcaPcp` class for implement
 
 **Noisy RPCA** [2, 3, 4]
 
-The class :class:`RPCANoisy` implements an recommanded improved version, which relies on a decomposition :math:`\mathbf{D} = \mathbf{M} + \mathbf{A} + \mathbf{E}`. The additionnal term encodes a Gaussian noise and makes the numerical convergence more reliable. This class also implements a time-consistency penalization for time series, parametrized by the :math:`\eta_k`and :math:`H_k`. By defining :math:`\Vert \mathbf{MH_k} \Vert_p` is either :math:`\Vert \mathbf{MH_k} \Vert_1` or  :math:`\Vert \mathbf{MH_k} \Vert_F^2`, the optimisation problem is the following
+The class :class:`RpcaNoisy` implements an recommanded improved version, which relies on a decomposition :math:`\mathbf{D} = \mathbf{M} + \mathbf{A} + \mathbf{E}`. The additionnal term encodes a Gaussian noise and makes the numerical convergence more reliable. This class also implements a time-consistency penalization for time series, parametrized by the :math:`\eta_k`and :math:`H_k`. By defining :math:`\Vert \mathbf{MH_k} \Vert_p` is either :math:`\Vert \mathbf{MH_k} \Vert_1` or  :math:`\Vert \mathbf{MH_k} \Vert_F^2`, the optimisation problem is the following
 
 .. math::
    \text{min}_{\mathbf{M, A} \in \mathbb{R}^{m \times n}} \quad \frac 1 2 \Vert P_{\Omega} (\mathbf{D}-\mathbf{M}-\mathbf{A}) \Vert_F^2 + \tau \Vert \mathbf{M} \Vert_* + \lambda \Vert \mathbf{A} \Vert_1 + \sum_{k=1}^K \eta_k \Vert \mathbf{M H_k} \Vert_p
 
@@ -1,13 +1,5 @@
 .. include:: ../README.rst
 
-.. toctree::
-   :maxdepth: 2
-   :hidden:
-   :caption: AUDIT
-
-   audit
-   examples/tutorials/plot_tuto_mcar_test
-
 .. toctree::
    :maxdepth: 2
    :hidden:
@@ -32,3 +24,11 @@
    :caption: API
 
    api
+
+.. toctree::
+   :maxdepth: 2
+   :hidden:
+   :caption: ANALYSIS
+
+   analysis
+   examples/tutorials/plot_tuto_mcar_test
@@ -199,7 +199,6 @@ plt.show()
 
 ```python
 %%time
-# rpca_noisy = RPCANoisy(period=10, tau=1, lam=0.4, rank=2, list_periods=[10], list_etas=[0.01], norm="L2")
 rpca_noisy = RpcaNoisy(tau=1, lam=0.4, rank=2, norm="L2")
 M, A = rpca_noisy.decompose(D, Omega)
 # imputed = X
 
@@ -282,7 +282,7 @@ def plot_cdf(
 
 
 # %%
-# d. Grouped Hole Generator
+# e. Grouped Hole Generator
 # ***************************************************************
 # The holes are generated according to the groups defined by the user.
 # This metohd is implemented in the
 
@@ -3,154 +3,154 @@
 Tutorial for testing the MCAR case
 ============================================
 
-In this tutorial, we show how to use the mcar test class and its methods.
-
-Keep in my mind that, at this moment, the mcar tests only handle tabular data.
+In this tutorial, we show how to test the MCAR case using the Little's test.
 """
 # %%
 # First import some libraries
 from matplotlib import pyplot as plt
-import random
 
 import numpy as np
 import pandas as pd
+from scipy.stats import norm
+
+from qolmat.analysis.holes_characterization import LittleTest
+from qolmat.benchmark.missing_patterns import UniformHoleGenerator
 
-from qolmat.audit.holes_characterization import MCARTest
+plt.rcParams.update({"font.size": 12})
 
 # %%
 # 1. The Little's test
 # ---------------------------------------------------------------
-# How to use the Little's test ?
-# ==============================
-# When we deal with missing data in our dataset it's interesting to know the nature of these holes.
-# There exist three types of holes : MCAR, MAR and MNAR.
-# (see the: `Rubin's missing mechanism classification
-# <https://qolmat.readthedocs.io/en/latest/explanation.html>`_)
+# First, we need to introduce the concept of missing pattern. A missing pattern, also called
+# pattern, is the structure of observed and missing values in a data set. For example, for a
+# dataset with 2 columns, the possible patterns are : (0, 0), (1, 0), (0, 1), (1, 1). The value 1
+# (0) indicates that the value in the column is missing (observed).
 #
-# The simplest case to test is the MCAR case. The most famous MCAR statistical test is the
-# `Little's test <https://www.tandfonline.com/doi/abs/10.1080/01621459.1988.10478722>`_.
-# Keep in mind that the Little's test is designed to test the homogeneity of means between the
-# missing patterns and won't be efficient to detect the heterogeneity of covariance between missing
-# patterns.
+# The null hypothesis, H0, is : "The means of observations within each pattern are similar.".
+# Against the alternative hypothesis, H1 : "The means of the observed variables can vary across the
+# patterns."
 #
-# The null hypothesis, H0, is : "The data are MCAR". Against,
-# The alternative hypothesis : " The data are not MCAR, the means of the observed variables can
-# vary across the patterns"
+# If H0 is not rejected , we can assume that the missing data mechanism is MCAR. On the contrary,
+# if H0 is rejected, we can assume that the missing data mechanism is MAR.
 #
-# We choose to use the classic threshold, equal to 5%. If the test pval is below this threshold,
+# We choose to use the classic threshold, equal to 5%. If the test p_value is below this threshold,
 # we reject the null hypothesis.
 #
 # This notebook shows how the Little's test performs and its limitations.
 
-np.random.seed(11)
-
-mcartest = MCARTest(method="little")
+mcartest = LittleTest()
 
 # %%
-# Case 1 : Normal iid feature with MCAR holes
-# ===========================================
+# Case 1 : Normal iid features with MCAR holes
+# ============================================
 
+np.random.seed(42)
 matrix = np.random.multivariate_normal(mean=[0, 0], cov=[[1, 0], [0, 1]], size=200)
-matrix.ravel()[np.random.choice(matrix.size, size=20, replace=False)] = np.nan
-matrix_masked = matrix[np.argwhere(np.isnan(matrix))]
-df_1 = pd.DataFrame(matrix)
+df = pd.DataFrame(data=matrix, columns=["Column_1", "Column_2"])
+
+hole_gen = UniformHoleGenerator(n_splits=1, random_state=42, subset=["Column_2"], ratio_masked=0.2)
+df_mask = hole_gen.generate_mask(df)
+df_unmasked = ~df_mask
+df_unmasked["Column_1"] = False
 
-plt_1 = plt.scatter(matrix[:, 0], matrix[:, 1])
-plt_2 = plt.scatter(matrix_masked[:, 0], matrix_masked[:, 1])
+df_observed = df.mask(df_mask).dropna()
+df_hidden = df.mask(df_unmasked).dropna(subset="Column_2")
+
+plt_1 = plt.scatter(df_observed.iloc[:, 0], df_observed.iloc[:, 1], label="Observed values")
+plt_2 = plt.scatter(df_hidden.iloc[:, 0], df_hidden.iloc[:, 1], label="Missing values")
 
 plt.legend(
-    (plt_1, plt_2),
-    ("observed_values", "masked_values"),
-    scatterpoints=1,
     loc="lower left",
-    ncol=1,
     fontsize=8,
 )
-
 plt.title("Case 1 : MCAR missingness mechanism")
-plt.xlabel("x values (all observed)")
-plt.ylabel("y values (with missing ones)")
-
 plt.show()
 
 # %%
 
-mcartest.test(df_1)
+mcartest.test(df.mask(df_mask))
 # %%
-# The p-value is quite high, therefore we don't reject H_0.
+# The p-value is quite high, therefore we don't reject H0.
 # We can then suppose that our missingness mechanism is MCAR.
 
 # %%
-# Case 2 : Normal iid feature with MAR holes
-# ==========================================
-np.random.seed(11)
+# Case 2 : Normal iid features with MAR holes
+# ===========================================
+np.random.seed(42)
+quantile_95 = norm.ppf(0.975)
 
 matrix = np.random.multivariate_normal(mean=[0, 0], cov=[[1, 0], [0, 1]], size=200)
-threshold = random.uniform(0, 1)
-matrix[np.argwhere(matrix[:, 0] >= 1.96), 1] = np.nan
-matrix_masked = matrix[np.argwhere(np.isnan(matrix))]
-df_2 = pd.DataFrame(matrix)
+df = pd.DataFrame(matrix, columns=["Column_1", "Column_2"])
+df_nan = df.copy()
+df_nan.loc[df_nan["Column_1"] > quantile_95, "Column_2"] = np.nan
+
+df_mask = df_nan.isna()
+df_unmasked = ~df_mask
+df_unmasked["Column_1"] = False
+
+df_observed = df.mask(df_mask).dropna()
+df_hidden = df.mask(df_unmasked).dropna(subset="Column_2")
 
-plt_1 = plt.scatter(matrix[:, 0], matrix[:, 1])
-plt_2 = plt.scatter(matrix_masked[:, 0], matrix_masked[:, 1])
+plt_1 = plt.scatter(df_observed.iloc[:, 0], df_observed.iloc[:, 1], label="Observed values")
+plt_2 = plt.scatter(df_hidden.iloc[:, 0], df_hidden.iloc[:, 1], label="Missing values")
 
 plt.legend(
-    (plt_1, plt_2),
-    ("observed_values", "masked_vlues"),
-    scatterpoints=1,
     loc="lower left",
-    ncol=1,
     fontsize=8,
 )
-
 plt.title("Case 2 : MAR missingness mechanism")
-plt.xlabel("x values (all observed)")
-plt.ylabel("y values (with missing ones)")
-
 plt.show()
 
 # %%
 
-mcartest.test(df_2)
+mcartest.test(df.mask(df_mask))
 # %%
 # The p-value is lower than the classic threshold (5%).
-# H_0 is then rejected and we can suppose that our missingness mechanism is MAR.
+# H0 is then rejected and we can suppose that our missingness mechanism is MAR.
 
 # %%
-# Case 3 : Normal iid feature MAR holes
-# =====================================
-# The specific case is design to emphasize the Little's test limits. In the case, we generate holes
-# when the value of the first feature is high. This missingness mechanism is clearly MAR but the
-# means between missing patterns is not statistically different.
+# Case 3 : Normal iid features with MAR holes
+# ===========================================
+# The specific case is designed to emphasize the Little's test limits. In the case, we generate
+# holes when the absolute value of the first feature is high. This missingness mechanism is clearly
+# MAR but the means between missing patterns is not statistically different.
 
-np.random.seed(11)
+np.random.seed(42)
 
 matrix = np.random.multivariate_normal(mean=[0, 0], cov=[[1, 0], [0, 1]], size=200)
-matrix[np.argwhere(abs(matrix[:, 0]) >= 1.96), 1] = np.nan
-matrix_masked = matrix[np.argwhere(np.isnan(matrix))]
-df_3 = pd.DataFrame(matrix)
+df = pd.DataFrame(matrix, columns=["Column_1", "Column_2"])
+df_nan = df.copy()
+df_nan.loc[abs(df_nan["Column_1"]) > quantile_95, "Column_2"] = np.nan
 
-plt_1 = plt.scatter(matrix[:, 0], matrix[:, 1])
-plt_2 = plt.scatter(matrix_masked[:, 0], matrix_masked[:, 1])
+df_mask = df_nan.isna()
+df_unmasked = ~df_mask
+df_unmasked["Column_1"] = False
+
+df_observed = df.mask(df_mask).dropna()
+df_hidden = df.mask(df_unmasked).dropna(subset="Column_2")
+
+plt_1 = plt.scatter(df_observed.iloc[:, 0], df_observed.iloc[:, 1], label="Observed values")
+plt_2 = plt.scatter(df_hidden.iloc[:, 0], df_hidden.iloc[:, 1], label="Missing values")
 
 plt.legend(
-    (plt_1, plt_2),
-    ("observed_values", "masked_values"),
-    scatterpoints=1,
     loc="lower left",
-    ncol=1,
     fontsize=8,
 )
-
 plt.title("Case 3 : MAR missingness mechanism undetected by the Little's test")
-plt.xlabel("x values (all observed)")
-plt.ylabel("y values (with missing ones)")
-
 plt.show()
 
 # %%
 
-mcartest.test(df_3)
+mcartest.test(df.mask(df_mask))
 # %%
 # The p-value is higher than the classic threshold (5%).
-# H_0 is not rejected whereas the missingness mechanism is clearly MAR.
+# H0 is not rejected whereas the missingness mechanism is clearly MAR.
+
+# %%
+# Limitations
+# -----------
+# In this tutoriel, we can see that Little's test fails to detect covariance heterogeneity between
+# patterns.
+#
+# There exist other limitations. The Little's test only handles quantitative data. And finally, the
+# MCAR tests can only handle tabular data (withtout correlation in time).