You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.rst
+4-4Lines changed: 4 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -47,7 +47,7 @@ Qolmat can be installed in different ways:
47
47
.. code:: sh
48
48
49
49
$ pip install qolmat # installation via `pip`
50
-
$ pip install qolmat[tensorflow] # if you need tensorflow
50
+
$ pip install qolmat[pytorch] # if you need ImputerDiffusion relying on pytorch
51
51
$ pip install git+https://github.com/Quantmetry/qolmat # or directly from the github repository
52
52
53
53
⚡️ Quickstart
@@ -105,8 +105,8 @@ The full documentation can be found `on this link <https://qolmat.readthedocs.io
105
105
106
106
**How does Qolmat work ?**
107
107
108
-
Qolmat allows model selection for scikit-learn compatible imputation algorithms, by performing three steps pictured below:
109
-
1) For each of the K folds, Qolmat artificially masks a set of observed values using a default or user specified `hole generator <explanation.html#hole-generator>`_,
108
+
|Qolmat allows model selection for scikit-learn compatible imputation algorithms, by performing three steps pictured below:
109
+
1) For each of the K folds, Qolmat artificially masks a set of observed values using a default or user specified `hole generator <explanation.html#hole-generator>`_.
110
110
2) For each fold and each compared `imputation method <imputers.html>`_, Qolmat fills both the missing and the masked values, then computes each of the default or user specified `performance metrics <explanation.html#metrics>`_.
111
111
3) For each compared imputer, Qolmat pools the computed metrics from the K folds into a single value.
112
112
@@ -117,7 +117,7 @@ This is very similar in spirit to the `cross_val_score <https://scikit-learn.org
117
117
118
118
**Imputation methods**
119
119
120
-
The following table contains the available imputation methods. We distinguish single imputation methods (aiming for pointwise accuracy, mostly deterministic) from multiple imputation methods (aiming for distribution similarity, mostly stochastic).
120
+
The following table contains the available imputation methods. We distinguish single imputation methods (aiming for pointwise accuracy, mostly deterministic) from multiple imputation methods (aiming for distribution similarity, mostly stochastic). For further details regarding the distinction between single and multiple imputation, you can refer to the `Imputation article <https://en.wikipedia.org/wiki/Imputation_(statistics)>`_ on Wikipedia.
Copy file name to clipboardExpand all lines: docs/explanation.rst
+7-4Lines changed: 7 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -99,7 +99,7 @@ We compute the associated complete dataset :math:`\hat{X}^{(k)}` for the partial
99
99
-----------------
100
100
101
101
Evaluating the imputers requires to generate holes that are representative of the holes at hand.
102
-
The missingness mechanisms have been classified by Rubin [1] into MCAR, MAR and MNAR.
102
+
The missingness mechanisms have been classified by :ref:`Rubin [1]<rubin-article>` into MCAR, MAR and MNAR.
103
103
104
104
Suppose we have :math:`X_{obs}`, a subset of a complete data model :math:`X = (X_{obs}, X_{mis})`, which is not fully observable (:math:`X_{mis}` is the missing part).
105
105
We define the matrix :math:`M` such that :math:`M_{ij}=1` if :math:`X_{ij}` is missing, and 0 otherwise, and we assume distribution of :math:`M` is parametrised by :math:`\psi`.
@@ -108,14 +108,14 @@ The observations are said to be Missing Completely at Random (MCAR) if the proba
Finally, the observations are said to be Missing Not at Random (MNAR) in all other cases, i.e. if P(M | X_{obs}, X_{mis}, \psi) does not simplify.
118
+
Finally, the observations are said to be Missing Not at Random (MNAR) in all other cases, i.e. if :math:`P(M | X_{obs}, X_{mis}, \psi)` does not simplify.
119
119
120
120
Qolmat allows to generate new missing values on a an existing dataset, but only in the MCAR case.
121
121
@@ -140,4 +140,7 @@ Qolmat can be used to search for hyperparameters in imputation functions. Let sa
140
140
141
141
References
142
142
----------
143
-
[1] Rubin, Donald B. `Inference and missing data. <https://www.math.wsu.edu/faculty/xchen/stat115/lectureNotes3/Rubin%20Inference%20and%20Missing%20Data.pdf>`_ Biometrika 63.3 (1976): 581-592.
143
+
144
+
.. _rubin-article:
145
+
146
+
[1] Rubin, Donald B. `Inference and missing data. <https://www.math.wsu.edu/faculty/xchen/stat115/lectureNotes3/Rubin%20Inference%20and%20Missing%20Data.pdf>`_ Biometrika 63.3 (1976): 581-592.
Copy file name to clipboardExpand all lines: docs/imputers.rst
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -98,14 +98,14 @@ Two parametric distributions are implemented:
98
98
9. TabDDPM
99
99
-----------
100
100
101
-
:class:`qolmat.diffusions.TabDDPM` is a deep learning imputer based on Denoising Diffusion Probabilistic Models (DDPMs) [7] for handling multivariate tabular data. Our implementation mainly follows the works of [8, 9]. Diffusion models focus on modeling the process of data transitions from noisy and incomplete observations to the underlying true data. They include two main processes:
101
+
:class:`~qolmat.imputations.diffusions.ddpms.TabDDPM` is a deep learning imputer based on Denoising Diffusion Probabilistic Models (DDPMs) [7] for handling multivariate tabular data. Our implementation mainly follows the works of [8, 9]. Diffusion models focus on modeling the process of data transitions from noisy and incomplete observations to the underlying true data. They include two main processes:
102
102
103
103
* Forward process perturbs observed data to noise until all the original data structures are lost. The pertubation is done over a series of steps. Let :math:`X_{obs}` be observed data, :math:`T` be the number of steps that noises :math:`\epsilon\sim\mathcal{N}(0,I)` are added into the observed data. Therefore, :math:`X_{obs}^t = \bar{\alpha}_t \times X_{obs} + \sqrt{1-\bar{\alpha}_t} \times\epsilon` where :math:`\bar{\alpha}_t` controls the right amount of noise.
104
104
* Reverse process removes noise and reconstructs the observed data. At each step :math:`t`, we train an autoencoder :math:`\epsilon_\theta` based on ResNet [9] to predict the added noise :math:`\epsilon_t` based on the rest of the observed data. The objective function is the error between the noise added in the forward process and the noise predicted by :math:`\epsilon_\theta`.
105
105
106
106
In training phase, we use the self-supervised learning method of [8] to train incomplete data. In detail, our model randomly masks a part of observed data and computes loss from these masked data. Moving on to the inference phase, (1) missing data are replaced by Gaussian noises :math:`\epsilon\sim\mathcal{N}(0,I)`, (2) at each noise step from :math:`T` to 0, our model denoises these missing data based on :math:`\epsilon_\theta`.
107
107
108
-
In the case of time-series data, we also propose :class:`qolmat.diffusions.TabDDPMTS` (built on top of :class:`qolmat.diffusions.TabDDPM`) to capture time-based relationships between data points in a dataset. In fact, the dataset is pre-processed by using sliding window method to obtain a set of data partitions. The noise prediction of the model :math:`\epsilon_\theta` takes into account not only the observed data at the current time step but also data from previous time steps. These time-based relationships are encoded by using a transformer-based architecture [8].
108
+
In the case of time-series data, we also propose :class:`~qolmat.imputations.diffusions.ddpms.TsDDPM` (built on top of :class:`~qolmat.imputations.diffusions.ddpms.TabDDPM`) to capture time-based relationships between data points in a dataset. In fact, the dataset is pre-processed by using sliding window method to obtain a set of data partitions. The noise prediction of the model :math:`\epsilon_\theta` takes into account not only the observed data at the current time step but also data from previous time steps. These time-based relationships are encoded by using a transformer-based architecture [8].
0 commit comments