Skip to content

Commit 0b579e3

Browse files
Julien RousselJulien Roussel
authored andcommitted
Merge branch 'dev' of https://github.com/Quantmetry/qolmat into dev
2 parents 05417ca + 59c25cd commit 0b579e3

File tree

12 files changed

+462
-328
lines changed

12 files changed

+462
-328
lines changed

HISTORY.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@ History
1010
* Tutorial plot_tuto_categorical showcasing mixed type imputation
1111
* Titanic dataset added
1212
* accuracy metric implemented
13+
* metrics.py rationalized, and split with algebra.py
1314

1415
0.1.3 (2024-03-07)
1516
------------------

docs/imputers.rst

Lines changed: 17 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -3,24 +3,28 @@ Imputers
33

44
All imputers can be found in the ``qolmat.imputations`` folder.
55

6-
1. Simple (mean/median/shuffle)
7-
-------------------------------
8-
Imputes the missing values using the mean/median along each column or with a random value in each column. See the :class:`~qolmat.imputations.imputers.ImputerSimple` and :class:`~qolmat.imputations.imputers.ImputerShuffle` classes.
6+
1. Simple (mean/median/mode)
7+
----------------------------
8+
Imputes the missing values using a basic simple statistics: the mode (most frequent value) for the categorical columns, and the mean,median or mode (depending on the user parameter) for the numerical columns. See :class:`~qolmat.imputations.imputers.ImputerSimple`.
99

10-
2. LOCF
10+
2. Shuffle
11+
----------
12+
Imputes the missing values using a random value sampled in the same column. See :class:`~qolmat.imputations.imputers.ImputerShuffle`.
13+
14+
3. LOCF
1115
-------
12-
Imputes the missing values using the last observation carried forward. See the :class:`~qolmat.imputations.imputers.ImputerLOCF` class.
16+
Imputes the missing values using the last observation carried forward. See :class:`~qolmat.imputations.imputers.ImputerLOCF`.
1317

14-
3. Time interpolation and TSA decomposition
18+
4. Time interpolation and TSA decomposition
1519
-------------------------------------------
16-
Imputes missing using some interpolation strategies supported by `pd.Series.interpolate <https://pandas.pydata.org/docs/reference/api/pandas.Series.interpolate.html>`_. It is done column by column. See the :class:`~qolmat.imputations.imputers.ImputerInterpolation` class. When data are temporal with clear seasonal decomposition, we can interpolate on the residuals instead of directly interpolate the raw data. Series are de-seasonalised based on `statsmodels.tsa.seasonal.seasonal_decompose <https://www.statsmodels.org/stable/generated/statsmodels.tsa.seasonal.seasonal_decompose.html>`_, residuals are imputed via linear interpolation, then residuals are re-seasonalised. It is also done column by column. See the :class:`~qolmat.imputations.imputers.ImputerResiduals` class.
20+
Imputes missing using some interpolation strategies supported by `pd.Series.interpolate <https://pandas.pydata.org/docs/reference/api/pandas.Series.interpolate.html>`_. It is done column by column. See the :class:`~qolmat.imputations.imputers.ImputerInterpolation` class. When data are temporal with clear seasonal decomposition, we can interpolate on the residuals instead of directly interpolate the raw data. Series are de-seasonalised based on `statsmodels.tsa.seasonal.seasonal_decompose <https://www.statsmodels.org/stable/generated/statsmodels.tsa.seasonal.seasonal_decompose.html>`_, residuals are imputed via linear interpolation, then residuals are re-seasonalised. It is also done column by column. See :class:`~qolmat.imputations.imputers.ImputerResiduals`.
1721

1822

19-
4. MICE
23+
5. MICE
2024
-------
2125
Multiple Imputation by Chained Equation: multiple imputations based on ICE. It uses `IterativeImputer <https://scikit-learn.org/stable/modules/generated/sklearn.impute.IterativeImputer.html#sklearn.impute.IterativeImputer>`_. See the :class:`~qolmat.imputations.imputers.ImputerMICE` class.
2226

23-
5. RPCA
27+
6. RPCA
2428
-------
2529
Robust Principal Component Analysis (RPCA) is a modification of the statistical procedure of PCA which allows to work with a data matrix :math:`\mathbf{D} \in \mathbb{R}^{n \times d}` containing missing values and grossly corrupted observations. We consider here the imputation task alone, but these methods can also tackle anomaly correction.
2630

@@ -46,7 +50,7 @@ The class :class:`RpcaNoisy` implements an recommanded improved version, which r
4650
with :math:`\mathbf{E} = \mathbf{D} - \mathbf{M} - \mathbf{A}`.
4751
See the :class:`~qolmat.imputations.imputers.ImputerRpcaNoisy` class for implementation details.
4852

49-
6. SoftImpute
53+
7. SoftImpute
5054
-------------
5155
SoftImpute is an iterative method for matrix completion that uses nuclear-norm regularization [11]. It is a faster alternative to RPCA, although it is much less robust due to the quadratic penalization. Given a matrix :math:`\mathbf{D} \in \mathbb{R}^{n \times d}` with observed entries indexed by the set :math:`\Omega`, this algorithm solves the following problem:
5256

@@ -56,11 +60,11 @@ SoftImpute is an iterative method for matrix completion that uses nuclear-norm r
5660
The imputed values are then given by the matrix :math:`M=LQ` on the unobserved data.
5761
See the :class:`~qolmat.imputations.imputers.ImputerSoftImpute` class for implementation details.
5862

59-
7. KNN
63+
8. KNN
6064
------
6165
K-nearest neighbors, based on `KNNImputer <https://scikit-learn.org/stable/modules/generated/sklearn.impute.KNNImputer.html>`_. See the :class:`~qolmat.imputations.imputers.ImputerKNN` class.
6266

63-
8. EM sampler
67+
9. EM sampler
6468
-------------
6569
Imputes missing values via EM algorithm [5], and more precisely via MCEM algorithm [6]. See the :class:`~qolmat.imputations.imputers.ImputerEM` class.
6670
Suppose the data :math:`\mathbf{X}` has a density :math:`p_\theta` parametrized by some parameter :math:`\theta`. The EM algorithm allows to draw samples from this distribution by alternating between the expectation and maximization steps.
@@ -104,7 +108,7 @@ Two parametric distributions are implemented:
104108
* :class:`~qolmat.imputations.em_sampler.VARpEM`: [7]: :math:`\mathbf{X} \in \mathbb{R}^{n \times d} \sim VAR_p(\nu, B_1, ..., B_p)` is generated by a VAR(p) process such that :math:`X_t = \nu + B_1 X_{t-1} + ... + B_p X_{t-p} + u_t` where :math:`\nu \in \mathbb{R}^d` is a vector of intercept terms, the :math:`B_i \in \mathbb{R}^{d \times d}` are the lags coefficient matrices and :math:`u_t` is white noise nonsingular covariance matrix :math:`\Sigma_u \mathbb{R}^{d \times d}`, so that :math:`\theta = (\nu, B_1, ..., B_p, \Sigma_u)`.
105109

106110

107-
9. TabDDPM
111+
10. TabDDPM
108112
-----------
109113

110114
:class:`~qolmat.imputations.diffusions.ddpms.TabDDPM` is a deep learning imputer based on Denoising Diffusion Probabilistic Models (DDPMs) [8] for handling multivariate tabular data. Our implementation mainly follows the works of [8, 9]. Diffusion models focus on modeling the process of data transitions from noisy and incomplete observations to the underlying true data. They include two main processes:

examples/benchmark.md

Lines changed: 5 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -16,9 +16,6 @@ jupyter:
1616
**This notebook aims to present the Qolmat repo through an example of a multivariate time series.
1717
In Qolmat, a few data imputation methods are implemented as well as a way to evaluate their performance.**
1818

19-
```python
20-
21-
```
2219

2320
First, import some useful librairies
2421

@@ -36,26 +33,18 @@ from IPython.display import Image
3633
import pandas as pd
3734
from datetime import datetime
3835
import numpy as np
39-
import scipy
4036
import hyperopt as ho
41-
from hyperopt.pyll.base import Apply as hoApply
4237
np.random.seed(1234)
43-
import pprint
4438
from matplotlib import pyplot as plt
45-
import matplotlib.image as mpimg
4639
import matplotlib.ticker as plticker
4740

4841
tab10 = plt.get_cmap("tab10")
4942
plt.rcParams.update({'font.size': 18})
5043

51-
from typing import Optional
5244

5345
from sklearn.linear_model import LinearRegression
54-
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor, HistGradientBoostingRegressor
5546

56-
57-
import sys
58-
from qolmat.benchmark import comparator, missing_patterns, hyperparameters
47+
from qolmat.benchmark import comparator, missing_patterns
5948
from qolmat.imputations import imputers
6049
from qolmat.utils import data, utils, plot
6150

@@ -240,12 +229,8 @@ dfs_imputed = {name: imp.fit_transform(df_plot) for name, imp in dict_imputers.i
240229
```
241230

242231
```python tags=[]
243-
dfs_imputed["VAR_max"].groupby("station").min()
244-
```
245-
246-
```python tags=[]
247-
# station = df_plot.index.get_level_values("station")[0]
248-
station = "Huairou"
232+
station = df_plot.index.get_level_values("station")[0]
233+
# station = "Huairou"
249234
df_station = df_plot.loc[station]
250235
dfs_imputed_station = {name: df_plot.loc[station] for name, df_plot in dfs_imputed.items()}
251236
```
@@ -362,7 +347,7 @@ comparison = comparator.Comparator(
362347
)
363348
```
364349

365-
```python jupyter={"outputs_hidden": true} tags=[]
350+
```python tags=[]
366351
generator_holes = missing_patterns.EmpiricalHoleGenerator(n_splits=3, groups=('station',), subset=cols_to_impute, ratio_masked=ratio_masked)
367352

368353
comparison = comparator.Comparator(
@@ -393,7 +378,7 @@ plt.show()
393378
df_plot = df_data[cols_to_impute]
394379
```
395380

396-
```python jupyter={"outputs_hidden": true} tags=[]
381+
```python tags=[]
397382
dfs_imputed = {name: imp.fit_transform(df_plot) for name, imp in dict_imputers.items()}
398383
```
399384

examples/tutorials/plot_tuto_categorical.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -57,7 +57,7 @@
5757
# - manage categorical features though one hot encoding
5858
# - manage missing features (native to the HistGradientBoosting)
5959

60-
pipestimator = preprocessing.make_robust_MixteHGB(allow_new=False)
60+
pipestimator = preprocessing.make_robust_MixteHGB(avoid_new=True)
6161
imputer_hgb = ImputerRegressor(estimator=pipestimator, handler_nan="none")
6262
imputer_wrap_hgb = preprocessing.WrapperTransformer(imputer_hgb, bt)
6363

0 commit comments

Comments
 (0)