Skip to content

Commit 7aff16f

Browse files
Julien RousselJulien Roussel
authored andcommitted
documentation updated
1 parent 3b7d4a9 commit 7aff16f

File tree

18 files changed

+1708
-569
lines changed

18 files changed

+1708
-569
lines changed

README.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -232,6 +232,8 @@ Selected Topics in Signal Processing 10.4 (2016): 740-756.
232232
[6] García, S., Luengo, J., & Herrera, F. "Data preprocessing in data mining". 2015.
233233
(`pdf <https://www.academia.edu/download/60477900/Garcia__Luengo__Herrera-Data_Preprocessing_in_Data_Mining_-_Springer_International_Publishing_201520190903-77973-th1o73.pdf>`__)
234234

235+
[7] Botterman, HL., Roussel, J., Morzadec, T., Jabbari, A., Brunel, N. "Robust PCA for Anomaly Detection and Data Imputation in Seasonal Time Series" (2022) in International Conference on Machine Learning, Optimization, and Data Science. Cham: Springer Nature Switzerland, (`pdf <https://link.springer.com/chapter/10.1007/978-3-031-25891-6_21>`__)
236+
235237
📝 License
236238
==========
237239

docs/api.rst

Lines changed: 38 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -4,8 +4,8 @@ Qolmat API
44

55
.. currentmodule:: qolmat
66

7-
Imputers
8-
=========
7+
Imputers API
8+
============
99

1010
.. autosummary::
1111
:toctree: generated/
@@ -15,10 +15,8 @@ Imputers
1515
imputations.imputers.ImputerKNN
1616
imputations.imputers.ImputerInterpolation
1717
imputations.imputers.ImputerLOCF
18-
imputations.imputers.ImputerMedian
19-
imputations.imputers.ImputerMean
18+
imputations.imputers.ImputerSimple
2019
imputations.imputers.ImputerMICE
21-
imputations.imputers.ImputerMode
2220
imputations.imputers.ImputerNOCB
2321
imputations.imputers.ImputerOracle
2422
imputations.imputers.ImputerRegressor
@@ -28,17 +26,17 @@ Imputers
2826
imputations.imputers.ImputerSoftImpute
2927
imputations.imputers.ImputerShuffle
3028

31-
Comparator
32-
===========
29+
Comparator API
30+
==============
3331

3432
.. autosummary::
3533
:toctree: generated/
3634
:template: class.rst
3735

3836
benchmark.comparator.Comparator
3937

40-
Missing Patterns
41-
================
38+
Missing Patterns API
39+
====================
4240

4341
.. autosummary::
4442
:toctree: generated/
@@ -51,8 +49,8 @@ Missing Patterns
5149
benchmark.missing_patterns.GroupedHoleGenerator
5250

5351

54-
Metrics
55-
=======
52+
Metrics API
53+
===========
5654

5755
.. autosummary::
5856
:toctree: generated/
@@ -63,6 +61,7 @@ Metrics
6361
benchmark.metrics.mean_absolute_error
6462
benchmark.metrics.mean_absolute_percentage_error
6563
benchmark.metrics.weighted_mean_absolute_percentage_error
64+
benchmark.metrics.accuracy
6665
benchmark.metrics.dist_wasserstein
6766
benchmark.metrics.kl_divergence
6867
benchmark.metrics.kolmogorov_smirnov_test
@@ -75,19 +74,19 @@ Metrics
7574
benchmark.metrics.pattern_based_weighted_mean_metric
7675

7776

78-
RPCA engine
79-
================
77+
RPCA engine API
78+
===============
8079

8180
.. autosummary::
8281
:toctree: generated/
8382
:template: class.rst
8483

85-
imputations.rpca.rpca_pcp.RPCAPCP
86-
imputations.rpca.rpca_noisy.RPCANoisy
84+
imputations.rpca.rpca_pcp.RpcaPcp
85+
imputations.rpca.rpca_noisy.RpcaNoisy
8786

8887

89-
EM engine
90-
================
88+
Expectation-Maximization engine API
89+
===================================
9190

9291
.. autosummary::
9392
:toctree: generated/
@@ -96,8 +95,8 @@ EM engine
9695
imputations.em_sampler.MultiNormalEM
9796
imputations.em_sampler.VARpEM
9897

99-
Diffusion engine
100-
================
98+
Diffusion Model engine API
99+
==========================
101100

102101
.. autosummary::
103102
:toctree: generated/
@@ -107,9 +106,27 @@ Diffusion engine
107106
imputations.diffusions.ddpms.TabDDPM
108107
imputations.diffusions.ddpms.TsDDPM
109108

109+
Preprocessing API
110+
=================
111+
112+
.. autosummary::
113+
:toctree: generated/
114+
:template: class.rst
115+
116+
imputations.preprocessing.MixteHGBM
117+
imputations.preprocessing.BinTransformer
118+
imputations.preprocessing.OneHotEncoderProjector
119+
imputations.preprocessing.WrapperTransformer
120+
121+
.. autosummary::
122+
:toctree: generated/
123+
:template: function.rst
124+
125+
imputations.preprocessing.make_pipeline_mixte_preprocessing
126+
imputations.preprocessing.make_robust_MixteHGB
110127

111-
Utils
112-
================
128+
Utils API
129+
=========
113130

114131
.. autosummary::
115132
:toctree: generated/

docs/imputers.rst

Lines changed: 10 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -3,16 +3,16 @@ Imputers
33

44
All imputers can be found in the ``qolmat.imputations`` folder.
55

6-
1. mean/median/shuffle
7-
----------------------
8-
Imputes the missing values using the mean/median along each column or with a random value in each column. See the :class:`~qolmat.imputations.imputers.ImputerMean`, :class:`~qolmat.imputations.imputers.ImputerMedian` and :class:`~qolmat.imputations.imputers.ImputerShuffle` classes.
6+
1. Simple (mean/median/shuffle)
7+
-------------------------------
8+
Imputes the missing values using the mean/median along each column or with a random value in each column. See the :class:`~qolmat.imputations.imputers.ImputerSimple` and :class:`~qolmat.imputations.imputers.ImputerShuffle` classes.
99

1010
2. LOCF
1111
-------
1212
Imputes the missing values using the last observation carried forward. See the :class:`~qolmat.imputations.imputers.ImputerLOCF` class.
1313

14-
3. interpolation (on residuals)
15-
-------------------------------
14+
3. Time interpolation and TSA decomposition
15+
-------------------------------------------
1616
Imputes missing using some interpolation strategies supported by `pd.Series.interpolate <https://pandas.pydata.org/docs/reference/api/pandas.Series.interpolate.html>`_. It is done column by column. See the :class:`~qolmat.imputations.imputers.ImputerInterpolation` class. When data are temporal with clear seasonal decomposition, we can interpolate on the residuals instead of directly interpolate the raw data. Series are de-seasonalised based on `statsmodels.tsa.seasonal.seasonal_decompose <https://www.statsmodels.org/stable/generated/statsmodels.tsa.seasonal.seasonal_decompose.html>`_, residuals are imputed via linear interpolation, then residuals are re-seasonalised. It is also done column by column. See the :class:`~qolmat.imputations.imputers.ImputerResiduals` class.
1717

1818

@@ -28,7 +28,7 @@ Two cases are considered.
2828

2929
**RPCA via Principal Component Pursuit (PCP)** [1, 12]
3030

31-
The class :class:`RPCAPCP` implements a matrix decomposition :math:`\mathbf{D} = \mathbf{M} + \mathbf{A}` where :math:`\mathbf{M}` has low-rank and :math:`\mathbf{A}` is sparse. It relies on the following optimisation problem
31+
The class :class:`RpcaPcp` implements a matrix decomposition :math:`\mathbf{D} = \mathbf{M} + \mathbf{A}` where :math:`\mathbf{M}` has low-rank and :math:`\mathbf{A}` is sparse. It relies on the following optimisation problem
3232

3333
.. math::
3434
\text{min}_{\mathbf{M} \in \mathbb{R}^{m \times n}} \quad \Vert \mathbf{M} \Vert_* + \lambda \Vert P_\Omega(\mathbf{D-M}) \Vert_1
@@ -38,7 +38,7 @@ See the :class:`~qolmat.imputations.imputers.ImputerRpcaPcp` class for implement
3838

3939
**Noisy RPCA** [2, 3, 4]
4040

41-
The class :class:`RPCANoisy` implements an recommanded improved version, which relies on a decomposition :math:`\mathbf{D} = \mathbf{M} + \mathbf{A} + \mathbf{E}`. The additionnal term encodes a Gaussian noise and makes the numerical convergence more reliable. This class also implements a time-consistency penalization for time series, parametrized by the :math:`\eta_k`and :math:`H_k`. By defining :math:`\Vert \mathbf{MH_k} \Vert_p` is either :math:`\Vert \mathbf{MH_k} \Vert_1` or :math:`\Vert \mathbf{MH_k} \Vert_F^2`, the optimisation problem is the following
41+
The class :class:`RpcaNoisy` implements an recommanded improved version, which relies on a decomposition :math:`\mathbf{D} = \mathbf{M} + \mathbf{A} + \mathbf{E}`. The additionnal term encodes a Gaussian noise and makes the numerical convergence more reliable. This class also implements a time-consistency penalization for time series, parametrized by the :math:`\eta_k`and :math:`H_k`. By defining :math:`\Vert \mathbf{MH_k} \Vert_p` is either :math:`\Vert \mathbf{MH_k} \Vert_1` or :math:`\Vert \mathbf{MH_k} \Vert_F^2`, the optimisation problem is the following
4242

4343
.. math::
4444
\text{min}_{\mathbf{M, A} \in \mathbb{R}^{m \times n}} \quad \frac 1 2 \Vert P_{\Omega} (\mathbf{D}-\mathbf{M}-\mathbf{A}) \Vert_F^2 + \tau \Vert \mathbf{M} \Vert_* + \lambda \Vert \mathbf{A} \Vert_1 + \sum_{k=1}^K \eta_k \Vert \mathbf{M H_k} \Vert_p
@@ -91,6 +91,7 @@ We estimate the distribution parameter :math:`\theta` by likelihood maximization
9191
Once the parameter :math:`\theta^*` has been estimated the final data imputation can be done in two different ways, depending on the value of the argument `method`:
9292

9393
* `mle`: Returns the maximum likelihood estimator
94+
9495
.. math::
9596
X^* = \mathrm{argmax}_X L(X, \theta^*)
9697
@@ -115,8 +116,8 @@ In training phase, we use the self-supervised learning method of [9] to train in
115116

116117
In the case of time-series data, we also propose :class:`~qolmat.imputations.diffusions.ddpms.TsDDPM` (built on top of :class:`~qolmat.imputations.diffusions.ddpms.TabDDPM`) to capture time-based relationships between data points in a dataset. In fact, the dataset is pre-processed by using sliding window method to obtain a set of data partitions. The noise prediction of the model :math:`\epsilon_\theta` takes into account not only the observed data at the current time step but also data from previous time steps. These time-based relationships are encoded by using a transformer-based architecture [9].
117118

118-
References
119-
----------
119+
References (Imputers)
120+
---------------------
120121

121122
[1] Candès, Emmanuel J., et al. `Robust principal component analysis? <https://arxiv.org/abs/2001.05484>`_ Journal of the ACM (JACM) 58.3 (2011): 1-37.
122123

docs/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@
1616

1717
imputers
1818
examples/tutorials/plot_tuto_benchmark_TS
19+
examples/tutorials/plot_tuto_categorical
1920
examples/tutorials/plot_tuto_diffusion_models
2021

2122
.. toctree::

examples/RPCA.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -199,7 +199,7 @@ plt.show()
199199

200200
```python
201201
%%time
202-
# rpca_noisy = RPCANoisy(period=10, tau=1, lam=0.4, rank=2, list_periods=[10], list_etas=[0.01], norm="L2")
202+
# rpca_noisy = RpcaNoisy(period=10, tau=1, lam=0.4, rank=2, list_periods=[10], list_etas=[0.01], norm="L2")
203203
rpca_noisy = RpcaNoisy(tau=1, lam=0.4, rank=2, norm="L2")
204204
M, A = rpca_noisy.decompose(D, Omega)
205205
# imputed = X

examples/benchmark.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -122,9 +122,9 @@ ratio_masked = 0.1
122122
```python tags=[]
123123
dict_config_opti = {}
124124

125-
imputer_mean = imputers.ImputerMean(groups=("station",))
126-
imputer_median = imputers.ImputerMedian(groups=("station",))
127-
imputer_mode = imputers.ImputerMode(groups=("station",))
125+
imputer_mean = imputers.ImputerSimple(groups=("station",), strategy="mean")
126+
imputer_median = imputers.ImputerSimple(groups=("station",), strategy="median")
127+
imputer_mode = imputers.ImputerSimple(groups=("station",), strategy="most_frequent")
128128
imputer_locf = imputers.ImputerLOCF(groups=("station",))
129129
imputer_nocb = imputers.ImputerNOCB(groups=("station",))
130130
imputer_interpol = imputers.ImputerInterpolation(groups=("station",), method="linear")

examples/tutorials/plot_tuto_benchmark_TS.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -61,7 +61,7 @@
6161
plt.show()
6262

6363
# %%
64-
# 2. Imputation methods
64+
# 2. Time series imputation methods
6565
# ---------------------------------------------------------------
6666
# All presented methods are group-wise: here each station is imputed independently.
6767
# For example ImputerMean computes the mean of each variable in each station and uses
@@ -78,7 +78,7 @@
7878

7979
ratio_masked = 0.1
8080

81-
imputer_median = imputers.ImputerMedian(groups=("station",))
81+
imputer_median = imputers.ImputerSimple(groups=("station",), strategy="median")
8282
imputer_interpol = imputers.ImputerInterpolation(groups=("station",), method="linear")
8383
imputer_residuals = imputers.ImputerResiduals(
8484
groups=("station",),

0 commit comments

Comments
 (0)