You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: HISTORY.rst
+9Lines changed: 9 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,6 +2,15 @@
2
2
History
3
3
=======
4
4
5
+
0.1.4 (2024-04-**)
6
+
------------------
7
+
8
+
* ImputerMean, ImputerMedian and ImputerMode have been merged into ImputerSimple
9
+
* File preprocessing.py added with classes new MixteHGBM, BinTransformer, OneHotEncoderProjector and WrapperTransformer providing tools to manage mixed types data
10
+
* Tutorial plot_tuto_categorical showcasing mixed type imputation
[7] Botterman, HL., Roussel, J., Morzadec, T., Jabbari, A., Brunel, N. "Robust PCA for Anomaly Detection and Data Imputation in Seasonal Time Series" (2022) in International Conference on Machine Learning, Optimization, and Data Science. Cham: Springer Nature Switzerland, (`pdf <https://link.springer.com/chapter/10.1007/978-3-031-25891-6_21>`__)
Copy file name to clipboardExpand all lines: docs/imputers.rst
+10-9Lines changed: 10 additions & 9 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -3,16 +3,16 @@ Imputers
3
3
4
4
All imputers can be found in the ``qolmat.imputations`` folder.
5
5
6
-
1. mean/median/shuffle
7
-
----------------------
8
-
Imputes the missing values using the mean/median along each column or with a random value in each column. See the :class:`~qolmat.imputations.imputers.ImputerMean`, :class:`~qolmat.imputations.imputers.ImputerMedian` and :class:`~qolmat.imputations.imputers.ImputerShuffle` classes.
6
+
1. Simple (mean/median/shuffle)
7
+
-------------------------------
8
+
Imputes the missing values using the mean/median along each column or with a random value in each column. See the :class:`~qolmat.imputations.imputers.ImputerSimple` and :class:`~qolmat.imputations.imputers.ImputerShuffle` classes.
9
9
10
10
2. LOCF
11
11
-------
12
12
Imputes the missing values using the last observation carried forward. See the :class:`~qolmat.imputations.imputers.ImputerLOCF` class.
13
13
14
-
3. interpolation (on residuals)
15
-
-------------------------------
14
+
3. Time interpolation and TSA decomposition
15
+
-------------------------------------------
16
16
Imputes missing using some interpolation strategies supported by `pd.Series.interpolate <https://pandas.pydata.org/docs/reference/api/pandas.Series.interpolate.html>`_. It is done column by column. See the :class:`~qolmat.imputations.imputers.ImputerInterpolation` class. When data are temporal with clear seasonal decomposition, we can interpolate on the residuals instead of directly interpolate the raw data. Series are de-seasonalised based on `statsmodels.tsa.seasonal.seasonal_decompose <https://www.statsmodels.org/stable/generated/statsmodels.tsa.seasonal.seasonal_decompose.html>`_, residuals are imputed via linear interpolation, then residuals are re-seasonalised. It is also done column by column. See the :class:`~qolmat.imputations.imputers.ImputerResiduals` class.
17
17
18
18
@@ -28,7 +28,7 @@ Two cases are considered.
28
28
29
29
**RPCA via Principal Component Pursuit (PCP)** [1, 12]
30
30
31
-
The class :class:`RPCAPCP` implements a matrix decomposition :math:`\mathbf{D} = \mathbf{M} + \mathbf{A}` where :math:`\mathbf{M}` has low-rank and :math:`\mathbf{A}` is sparse. It relies on the following optimisation problem
31
+
The class :class:`RpcaPcp` implements a matrix decomposition :math:`\mathbf{D} = \mathbf{M} + \mathbf{A}` where :math:`\mathbf{M}` has low-rank and :math:`\mathbf{A}` is sparse. It relies on the following optimisation problem
@@ -38,7 +38,7 @@ See the :class:`~qolmat.imputations.imputers.ImputerRpcaPcp` class for implement
38
38
39
39
**Noisy RPCA** [2, 3, 4]
40
40
41
-
The class :class:`RPCANoisy` implements an recommanded improved version, which relies on a decomposition :math:`\mathbf{D} = \mathbf{M} + \mathbf{A} + \mathbf{E}`. The additionnal term encodes a Gaussian noise and makes the numerical convergence more reliable. This class also implements a time-consistency penalization for time series, parametrized by the :math:`\eta_k`and :math:`H_k`. By defining :math:`\Vert\mathbf{MH_k} \Vert_p` is either :math:`\Vert\mathbf{MH_k} \Vert_1` or :math:`\Vert\mathbf{MH_k} \Vert_F^2`, the optimisation problem is the following
41
+
The class :class:`RpcaNoisy` implements an recommanded improved version, which relies on a decomposition :math:`\mathbf{D} = \mathbf{M} + \mathbf{A} + \mathbf{E}`. The additionnal term encodes a Gaussian noise and makes the numerical convergence more reliable. This class also implements a time-consistency penalization for time series, parametrized by the :math:`\eta_k`and :math:`H_k`. By defining :math:`\Vert\mathbf{MH_k} \Vert_p` is either :math:`\Vert\mathbf{MH_k} \Vert_1` or :math:`\Vert\mathbf{MH_k} \Vert_F^2`, the optimisation problem is the following
@@ -91,6 +91,7 @@ We estimate the distribution parameter :math:`\theta` by likelihood maximization
91
91
Once the parameter :math:`\theta^*` has been estimated the final data imputation can be done in two different ways, depending on the value of the argument `method`:
92
92
93
93
* `mle`: Returns the maximum likelihood estimator
94
+
94
95
.. math::
95
96
X^* = \mathrm{argmax}_X L(X, \theta^*)
96
97
@@ -115,8 +116,8 @@ In training phase, we use the self-supervised learning method of [9] to train in
115
116
116
117
In the case of time-series data, we also propose :class:`~qolmat.imputations.diffusions.ddpms.TsDDPM` (built on top of :class:`~qolmat.imputations.diffusions.ddpms.TabDDPM`) to capture time-based relationships between data points in a dataset. In fact, the dataset is pre-processed by using sliding window method to obtain a set of data partitions. The noise prediction of the model :math:`\epsilon_\theta` takes into account not only the observed data at the current time step but also data from previous time steps. These time-based relationships are encoded by using a transformer-based architecture [9].
117
118
118
-
References
119
-
----------
119
+
References (Imputers)
120
+
---------------------
120
121
121
122
[1] Candès, Emmanuel J., et al. `Robust principal component analysis? <https://arxiv.org/abs/2001.05484>`_ Journal of the ACM (JACM) 58.3 (2011): 1-37.
0 commit comments