Skip to content

Commit 9ff8217

Browse files
Merge pull request #130 from scikit-learn-contrib/regressor_cat
Regressor cat
2 parents 6c6cbf3 + c2b6cfc commit 9ff8217

28 files changed

+2410
-473
lines changed

.flake8

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
[flake8]
2-
exclude = .git,__pycache__,.vscode,tests
2+
exclude = .git,__pycache__,.vscode
33
max-line-length=99
44
ignore=E302,E305,W503,E203,E731,E402,E266,E712,F401,F821
55
indent-size = 4

HISTORY.rst

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,15 @@
22
History
33
=======
44

5+
0.1.4 (2024-04-**)
6+
------------------
7+
8+
* ImputerMean, ImputerMedian and ImputerMode have been merged into ImputerSimple
9+
* File preprocessing.py added with classes new MixteHGBM, BinTransformer, OneHotEncoderProjector and WrapperTransformer providing tools to manage mixed types data
10+
* Tutorial plot_tuto_categorical showcasing mixed type imputation
11+
* Titanic dataset added
12+
* accuracy metric implemented
13+
514
0.1.3 (2024-03-07)
615
------------------
716

README.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -232,6 +232,8 @@ Selected Topics in Signal Processing 10.4 (2016): 740-756.
232232
[6] García, S., Luengo, J., & Herrera, F. "Data preprocessing in data mining". 2015.
233233
(`pdf <https://www.academia.edu/download/60477900/Garcia__Luengo__Herrera-Data_Preprocessing_in_Data_Mining_-_Springer_International_Publishing_201520190903-77973-th1o73.pdf>`__)
234234

235+
[7] Botterman, HL., Roussel, J., Morzadec, T., Jabbari, A., Brunel, N. "Robust PCA for Anomaly Detection and Data Imputation in Seasonal Time Series" (2022) in International Conference on Machine Learning, Optimization, and Data Science. Cham: Springer Nature Switzerland, (`pdf <https://link.springer.com/chapter/10.1007/978-3-031-25891-6_21>`__)
236+
235237
📝 License
236238
==========
237239

docs/api.rst

Lines changed: 38 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -4,8 +4,8 @@ Qolmat API
44

55
.. currentmodule:: qolmat
66

7-
Imputers
8-
=========
7+
Imputers API
8+
============
99

1010
.. autosummary::
1111
:toctree: generated/
@@ -15,10 +15,8 @@ Imputers
1515
imputations.imputers.ImputerKNN
1616
imputations.imputers.ImputerInterpolation
1717
imputations.imputers.ImputerLOCF
18-
imputations.imputers.ImputerMedian
19-
imputations.imputers.ImputerMean
18+
imputations.imputers.ImputerSimple
2019
imputations.imputers.ImputerMICE
21-
imputations.imputers.ImputerMode
2220
imputations.imputers.ImputerNOCB
2321
imputations.imputers.ImputerOracle
2422
imputations.imputers.ImputerRegressor
@@ -28,17 +26,17 @@ Imputers
2826
imputations.imputers.ImputerSoftImpute
2927
imputations.imputers.ImputerShuffle
3028

31-
Comparator
32-
===========
29+
Comparator API
30+
==============
3331

3432
.. autosummary::
3533
:toctree: generated/
3634
:template: class.rst
3735

3836
benchmark.comparator.Comparator
3937

40-
Missing Patterns
41-
================
38+
Missing Patterns API
39+
====================
4240

4341
.. autosummary::
4442
:toctree: generated/
@@ -51,8 +49,8 @@ Missing Patterns
5149
benchmark.missing_patterns.GroupedHoleGenerator
5250

5351

54-
Metrics
55-
=======
52+
Metrics API
53+
===========
5654

5755
.. autosummary::
5856
:toctree: generated/
@@ -63,6 +61,7 @@ Metrics
6361
benchmark.metrics.mean_absolute_error
6462
benchmark.metrics.mean_absolute_percentage_error
6563
benchmark.metrics.weighted_mean_absolute_percentage_error
64+
benchmark.metrics.accuracy
6665
benchmark.metrics.dist_wasserstein
6766
benchmark.metrics.kl_divergence
6867
benchmark.metrics.kolmogorov_smirnov_test
@@ -75,19 +74,19 @@ Metrics
7574
benchmark.metrics.pattern_based_weighted_mean_metric
7675

7776

78-
RPCA engine
79-
================
77+
RPCA engine API
78+
===============
8079

8180
.. autosummary::
8281
:toctree: generated/
8382
:template: class.rst
8483

85-
imputations.rpca.rpca_pcp.RPCAPCP
86-
imputations.rpca.rpca_noisy.RPCANoisy
84+
imputations.rpca.rpca_pcp.RpcaPcp
85+
imputations.rpca.rpca_noisy.RpcaNoisy
8786

8887

89-
EM engine
90-
================
88+
Expectation-Maximization engine API
89+
===================================
9190

9291
.. autosummary::
9392
:toctree: generated/
@@ -96,8 +95,8 @@ EM engine
9695
imputations.em_sampler.MultiNormalEM
9796
imputations.em_sampler.VARpEM
9897

99-
Diffusion engine
100-
================
98+
Diffusion Model engine API
99+
==========================
101100

102101
.. autosummary::
103102
:toctree: generated/
@@ -107,9 +106,27 @@ Diffusion engine
107106
imputations.diffusions.ddpms.TabDDPM
108107
imputations.diffusions.ddpms.TsDDPM
109108

109+
Preprocessing API
110+
=================
111+
112+
.. autosummary::
113+
:toctree: generated/
114+
:template: class.rst
115+
116+
imputations.preprocessing.MixteHGBM
117+
imputations.preprocessing.BinTransformer
118+
imputations.preprocessing.OneHotEncoderProjector
119+
imputations.preprocessing.WrapperTransformer
120+
121+
.. autosummary::
122+
:toctree: generated/
123+
:template: function.rst
124+
125+
imputations.preprocessing.make_pipeline_mixte_preprocessing
126+
imputations.preprocessing.make_robust_MixteHGB
110127

111-
Utils
112-
================
128+
Utils API
129+
=========
113130

114131
.. autosummary::
115132
:toctree: generated/

docs/imputers.rst

Lines changed: 10 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -3,16 +3,16 @@ Imputers
33

44
All imputers can be found in the ``qolmat.imputations`` folder.
55

6-
1. mean/median/shuffle
7-
----------------------
8-
Imputes the missing values using the mean/median along each column or with a random value in each column. See the :class:`~qolmat.imputations.imputers.ImputerMean`, :class:`~qolmat.imputations.imputers.ImputerMedian` and :class:`~qolmat.imputations.imputers.ImputerShuffle` classes.
6+
1. Simple (mean/median/shuffle)
7+
-------------------------------
8+
Imputes the missing values using the mean/median along each column or with a random value in each column. See the :class:`~qolmat.imputations.imputers.ImputerSimple` and :class:`~qolmat.imputations.imputers.ImputerShuffle` classes.
99

1010
2. LOCF
1111
-------
1212
Imputes the missing values using the last observation carried forward. See the :class:`~qolmat.imputations.imputers.ImputerLOCF` class.
1313

14-
3. interpolation (on residuals)
15-
-------------------------------
14+
3. Time interpolation and TSA decomposition
15+
-------------------------------------------
1616
Imputes missing using some interpolation strategies supported by `pd.Series.interpolate <https://pandas.pydata.org/docs/reference/api/pandas.Series.interpolate.html>`_. It is done column by column. See the :class:`~qolmat.imputations.imputers.ImputerInterpolation` class. When data are temporal with clear seasonal decomposition, we can interpolate on the residuals instead of directly interpolate the raw data. Series are de-seasonalised based on `statsmodels.tsa.seasonal.seasonal_decompose <https://www.statsmodels.org/stable/generated/statsmodels.tsa.seasonal.seasonal_decompose.html>`_, residuals are imputed via linear interpolation, then residuals are re-seasonalised. It is also done column by column. See the :class:`~qolmat.imputations.imputers.ImputerResiduals` class.
1717

1818

@@ -28,7 +28,7 @@ Two cases are considered.
2828

2929
**RPCA via Principal Component Pursuit (PCP)** [1, 12]
3030

31-
The class :class:`RPCAPCP` implements a matrix decomposition :math:`\mathbf{D} = \mathbf{M} + \mathbf{A}` where :math:`\mathbf{M}` has low-rank and :math:`\mathbf{A}` is sparse. It relies on the following optimisation problem
31+
The class :class:`RpcaPcp` implements a matrix decomposition :math:`\mathbf{D} = \mathbf{M} + \mathbf{A}` where :math:`\mathbf{M}` has low-rank and :math:`\mathbf{A}` is sparse. It relies on the following optimisation problem
3232

3333
.. math::
3434
\text{min}_{\mathbf{M} \in \mathbb{R}^{m \times n}} \quad \Vert \mathbf{M} \Vert_* + \lambda \Vert P_\Omega(\mathbf{D-M}) \Vert_1
@@ -38,7 +38,7 @@ See the :class:`~qolmat.imputations.imputers.ImputerRpcaPcp` class for implement
3838

3939
**Noisy RPCA** [2, 3, 4]
4040

41-
The class :class:`RPCANoisy` implements an recommanded improved version, which relies on a decomposition :math:`\mathbf{D} = \mathbf{M} + \mathbf{A} + \mathbf{E}`. The additionnal term encodes a Gaussian noise and makes the numerical convergence more reliable. This class also implements a time-consistency penalization for time series, parametrized by the :math:`\eta_k`and :math:`H_k`. By defining :math:`\Vert \mathbf{MH_k} \Vert_p` is either :math:`\Vert \mathbf{MH_k} \Vert_1` or :math:`\Vert \mathbf{MH_k} \Vert_F^2`, the optimisation problem is the following
41+
The class :class:`RpcaNoisy` implements an recommanded improved version, which relies on a decomposition :math:`\mathbf{D} = \mathbf{M} + \mathbf{A} + \mathbf{E}`. The additionnal term encodes a Gaussian noise and makes the numerical convergence more reliable. This class also implements a time-consistency penalization for time series, parametrized by the :math:`\eta_k`and :math:`H_k`. By defining :math:`\Vert \mathbf{MH_k} \Vert_p` is either :math:`\Vert \mathbf{MH_k} \Vert_1` or :math:`\Vert \mathbf{MH_k} \Vert_F^2`, the optimisation problem is the following
4242

4343
.. math::
4444
\text{min}_{\mathbf{M, A} \in \mathbb{R}^{m \times n}} \quad \frac 1 2 \Vert P_{\Omega} (\mathbf{D}-\mathbf{M}-\mathbf{A}) \Vert_F^2 + \tau \Vert \mathbf{M} \Vert_* + \lambda \Vert \mathbf{A} \Vert_1 + \sum_{k=1}^K \eta_k \Vert \mathbf{M H_k} \Vert_p
@@ -91,6 +91,7 @@ We estimate the distribution parameter :math:`\theta` by likelihood maximization
9191
Once the parameter :math:`\theta^*` has been estimated the final data imputation can be done in two different ways, depending on the value of the argument `method`:
9292

9393
* `mle`: Returns the maximum likelihood estimator
94+
9495
.. math::
9596
X^* = \mathrm{argmax}_X L(X, \theta^*)
9697
@@ -115,8 +116,8 @@ In training phase, we use the self-supervised learning method of [9] to train in
115116

116117
In the case of time-series data, we also propose :class:`~qolmat.imputations.diffusions.ddpms.TsDDPM` (built on top of :class:`~qolmat.imputations.diffusions.ddpms.TabDDPM`) to capture time-based relationships between data points in a dataset. In fact, the dataset is pre-processed by using sliding window method to obtain a set of data partitions. The noise prediction of the model :math:`\epsilon_\theta` takes into account not only the observed data at the current time step but also data from previous time steps. These time-based relationships are encoded by using a transformer-based architecture [9].
117118

118-
References
119-
----------
119+
References (Imputers)
120+
---------------------
120121

121122
[1] Candès, Emmanuel J., et al. `Robust principal component analysis? <https://arxiv.org/abs/2001.05484>`_ Journal of the ACM (JACM) 58.3 (2011): 1-37.
122123

docs/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@
1616

1717
imputers
1818
examples/tutorials/plot_tuto_benchmark_TS
19+
examples/tutorials/plot_tuto_categorical
1920
examples/tutorials/plot_tuto_diffusion_models
2021

2122
.. toctree::

environment.dev.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ dependencies:
1616
- python=3.8
1717
- pip=23.0.1
1818
- scipy=1.10.1
19-
- scikit-learn=1.2.2
19+
- scikit-learn=1.3.2
2020
- sphinx=4.3.2
2121
- sphinx-gallery=0.10.1
2222
- sphinx_rtd_theme=1.0.0

examples/RPCA.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -199,7 +199,7 @@ plt.show()
199199

200200
```python
201201
%%time
202-
# rpca_noisy = RPCANoisy(period=10, tau=1, lam=0.4, rank=2, list_periods=[10], list_etas=[0.01], norm="L2")
202+
# rpca_noisy = RpcaNoisy(period=10, tau=1, lam=0.4, rank=2, list_periods=[10], list_etas=[0.01], norm="L2")
203203
rpca_noisy = RpcaNoisy(tau=1, lam=0.4, rank=2, norm="L2")
204204
M, A = rpca_noisy.decompose(D, Omega)
205205
# imputed = X

0 commit comments

Comments
 (0)