Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
80 commits
Select commit Hold shift + click to select a range
e5cdd9a
spelling: , and
jsoref Nov 16, 2025
531a128
spelling: ; otherwise,
jsoref Nov 16, 2025
6790cad
spelling: a
jsoref Nov 16, 2025
5843aa3
spelling: access
jsoref Nov 16, 2025
3620bc9
spelling: across
jsoref Nov 16, 2025
3906516
spelling: additional
jsoref Nov 16, 2025
184d0fc
spelling: address
jsoref Nov 16, 2025
1842187
spelling: alternative
jsoref Nov 16, 2025
084a365
spelling: an
jsoref Nov 16, 2025
e955c4d
spelling: approaches
jsoref Nov 16, 2025
de1f225
spelling: are
jsoref Nov 16, 2025
2842132
spelling: array-like
jsoref Nov 16, 2025
45909a3
spelling: at
jsoref Nov 16, 2025
293f0f7
spelling: augmented
jsoref Nov 16, 2025
998fe9e
spelling: between
jsoref Nov 16, 2025
75e111c
spelling: bias
jsoref Nov 16, 2025
eb8f22b
spelling: building
jsoref Nov 16, 2025
4d4d47e
spelling: class
jsoref Nov 16, 2025
23db951
spelling: columns
jsoref Nov 16, 2025
c129699
spelling: compute
jsoref Nov 16, 2025
82a15ea
spelling: conditional
jsoref Nov 16, 2025
544dcb8
spelling: conditionally
jsoref Nov 16, 2025
a45cb6d
spelling: conditioning
jsoref Nov 16, 2025
d613f6a
spelling: conjugate
jsoref Nov 16, 2025
83672b9
spelling: consistent
jsoref Nov 16, 2025
fe72ca2
spelling: criteria
jsoref Nov 16, 2025
6d2c67c
spelling: dataframes
jsoref Nov 16, 2025
fbb2048
spelling: dataset
jsoref Nov 16, 2025
1a4522b
spelling: datetime
jsoref Nov 16, 2025
419f792
spelling: default
jsoref Nov 16, 2025
e907399
spelling: dictionary
jsoref Nov 16, 2025
b58d27b
spelling: different
jsoref Nov 16, 2025
9b669c8
spelling: distribution
jsoref Nov 16, 2025
ea57562
spelling: element-wise
jsoref Nov 16, 2025
5894c63
spelling: estimated
jsoref Nov 16, 2025
79009af
spelling: explanation
jsoref Nov 16, 2025
0cb129c
spelling: function
jsoref Nov 16, 2025
da9ee52
spelling: globally
jsoref Nov 16, 2025
7995000
spelling: hyperparameters
jsoref Nov 16, 2025
09c5b5c
spelling: id
jsoref Nov 16, 2025
33efcac
spelling: implementation
jsoref Nov 16, 2025
e03c1a9
spelling: imputation
jsoref Nov 16, 2025
7847127
spelling: independent
jsoref Nov 16, 2025
dac6805
spelling: kullback
jsoref Nov 16, 2025
b209d8f
spelling: libraries
jsoref Nov 16, 2025
fd85f13
spelling: matrix
jsoref Nov 16, 2025
b7192d5
spelling: method
jsoref Nov 16, 2025
3f06262
spelling: multi
jsoref Nov 16, 2025
181697c
spelling: original
jsoref Nov 16, 2025
469733d
spelling: percentage
jsoref Nov 16, 2025
7aa1049
spelling: performance
jsoref Nov 16, 2025
6b0e5ba
spelling: perturbation
jsoref Nov 16, 2025
1c6d12a
spelling: perturbed
jsoref Nov 16, 2025
afb87da
spelling: practice
jsoref Nov 16, 2025
4d13296
spelling: pressure
jsoref Nov 16, 2025
e418a1e
spelling: pretreated
jsoref Nov 16, 2025
e608b94
spelling: probability
jsoref Nov 16, 2025
7ffc8e6
spelling: recommended
jsoref Nov 16, 2025
e7e2083
spelling: refactor
jsoref Nov 16, 2025
9cdd5af
spelling: reproducibility
jsoref Nov 16, 2025
89b3300
spelling: results
jsoref Nov 16, 2025
4a5e402
spelling: returned
jsoref Nov 16, 2025
a8bb53b
spelling: returns
jsoref Nov 16, 2025
68062e5
spelling: seasonal
jsoref Nov 16, 2025
54654d4
spelling: series
jsoref Nov 16, 2025
67d6ac4
spelling: shrunk
jsoref Nov 16, 2025
dc81648
spelling: split
jsoref Nov 16, 2025
b97baa7
spelling: stopping
jsoref Nov 16, 2025
a3192c3
spelling: supported
jsoref Nov 16, 2025
0bc1c81
spelling: temporal
jsoref Nov 16, 2025
7fcd3cc
spelling: the
jsoref Nov 16, 2025
9e480d5
spelling: transformers
jsoref Nov 16, 2025
8789a6a
spelling: transition
jsoref Nov 16, 2025
323e454
spelling: tutorial
jsoref Nov 16, 2025
a452c93
spelling: update
jsoref Nov 16, 2025
cbafb26
spelling: useful
jsoref Nov 16, 2025
c517964
spelling: variables
jsoref Nov 16, 2025
050f8d2
spelling: whether or not
jsoref Nov 16, 2025
5d881eb
spelling: while
jsoref Nov 16, 2025
2939549
link: scikit-learn API
jsoref Nov 16, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion CONTRIBUTING.rst
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ Documenting your change
-----------------------

If you're adding a class or a function, then you'll need to add a docstring with a doctest. We follow the `numpy docstring convention <https://sphinxcontrib-napoleon.readthedocs.io/en/latest/example_numpy.html>`_, so please do too.
Any estimator should follow the [scikit-learn API](https://scikit-learn.org/stable/developers/develop.html), so please follow these guidelines.
Any estimator should follow the `scikit-learn API <https://scikit-learn.org/stable/developers/develop.html>`_, so please follow these guidelines.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw this while I was creating the PR -- the old notation is Markdown, but this file is RST.


Updating changelog
------------------
Expand Down
10 changes: 5 additions & 5 deletions HISTORY.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ History
0.1.10 (2024-??-??)
------------------
* Long EM and RPCA operations wrapped with tqdm progress bars
* Readme code sample updated, and results table made consistant
* Readme code sample updated, and results table made consistent

0.1.9 (2024-08-29)
------------------
Expand Down Expand Up @@ -41,7 +41,7 @@ History
* RPCA algorithms now start with a normalizing scaler
* The EM algorithms now include a gradient projection step to be more robust to colinearity
* The EM algorithm based on the Gaussian model is now initialized using a robust estimation of the covariance matrix
* A bug in the EM algorithm has been patched: the normalizing matrix gamma was creating a sampling biais
* A bug in the EM algorithm has been patched: the normalizing matrix gamma was creating a sampling bias
* Speed up of the EM algorithm likelihood maximization, using the conjugate gradient method
* The ImputeRegressor class now handles the nans by `row` by default
* The metric `frechet` was not correctly called and has been patched
Expand All @@ -67,9 +67,9 @@ History
-------------------

* VAR(p) EM sampler implemented, founding on a VAR(p) modelization such as the one described in `Lütkepohl (2005) New Introduction to Multiple Time Series Analysis`
* EM and RPCA matrices transposed in the low-level impelmentation, however the API remains unchanged
* EM and RPCA matrices transposed in the low-level implementation, however the API remains unchanged
* Sparse matrices introduced in the RPCA implementation so as to speed up the execution
* Implementation of SoftImpute, which provides a fast but less robust alterantive to RPCA
* Implementation of SoftImpute, which provides a fast but less robust alternative to RPCA
* Implementation of TabDDPM and TsDDPM, which are diffusion-based models for tabular data and time-series data, based on Denoising Diffusion Probabilistic Models. Their implementations follow the work of Tashiro et al., (2021) and Kotelnikov et al., (2023).
* ImputerDiffusion is an imputer-wrapper of these two models TabDDPM and TsDDPM.
* Docstrings and tests improved for the EM sampler
Expand Down Expand Up @@ -100,7 +100,7 @@ been changed into tuple attributes so that all are not immutable
0.0.13 (2023-06-07)
-------------------

* Refacto cross validation
* Refactor cross validation
* Fix Readme
* Add test utils.plot

Expand Down
4 changes: 2 additions & 2 deletions docs/analysis.rst
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ Then Qolmat proposes two tests to determine whether the missing data mechanism i
2. How to use the results
-------------------------

At the end of the MCAR test, it can then be assumed whether the missing data mechanism is MCAR or not. This serves three differents purposes:
At the end of the MCAR test, it can then be assumed whether or not the missing data mechanism is MCAR. This serves three different purposes:

a. Diagnosis
^^^^^^^^^^^^
Expand Down Expand Up @@ -45,7 +45,7 @@ The MCAR missing-data mechanism means that there is independence between the pre
a. Little's Test
^^^^^^^^^^^^^^^^

The best-known MCAR test is the :ref:`Little [1]<Little-article>` test, and it has been implemented in :class:`LittleTest`. Keep in mind that the Little's test is designed to test the homogeneity of means across the missing patterns and won't be efficient to detect the heterogeneity of covariance accross missing patterns.
The best-known MCAR test is the :ref:`Little [1]<Little-article>` test, and it has been implemented in :class:`LittleTest`. Keep in mind that the Little's test is designed to test the homogeneity of means across the missing patterns and won't be efficient to detect the heterogeneity of covariance across missing patterns.

b. PKLM Test
^^^^^^^^^^^^
Expand Down
2 changes: 1 addition & 1 deletion docs/explanation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -117,7 +117,7 @@ The observations are said to be Missing at Random (MAR) if the probability of an

Finally, the observations are said to be Missing Not at Random (MNAR) in all other cases, i.e. if :math:`P(M | X_{obs}, X_{mis}, \psi)` does not simplify.

Qolmat allows to generate new missing values on a an existing dataset, but only in the MCAR case.
Qolmat allows to generate new missing values on an existing dataset, but only in the MCAR case.

Here are the different classes to generate missing data. We recommend the last 3 for time series.

Expand Down
10 changes: 5 additions & 5 deletions docs/imputers.rst
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ See the :class:`~qolmat.imputations.imputers.ImputerRpcaPcp` class for implement

**Noisy RPCA** [2, 3, 4]

The class :class:`RpcaNoisy` implements an recommanded improved version, which relies on a decomposition :math:`\mathbf{D} = \mathbf{M} + \mathbf{A} + \mathbf{E}`. The additionnal term encodes a Gaussian noise and makes the numerical convergence more reliable. This class also implements a time-consistency penalization for time series, parametrized by the :math:`\eta_k`and :math:`H_k`. By defining :math:`\Vert \mathbf{MH_k} \Vert_p` is either :math:`\Vert \mathbf{MH_k} \Vert_1` or :math:`\Vert \mathbf{MH_k} \Vert_F^2`, the optimisation problem is the following
The class :class:`RpcaNoisy` implements a recommended improved version, which relies on a decomposition :math:`\mathbf{D} = \mathbf{M} + \mathbf{A} + \mathbf{E}`. The additional term encodes a Gaussian noise and makes the numerical convergence more reliable. This class also implements a time-consistency penalization for time series, parametrized by the :math:`\eta_k`and :math:`H_k`. By defining :math:`\Vert \mathbf{MH_k} \Vert_p` is either :math:`\Vert \mathbf{MH_k} \Vert_1` or :math:`\Vert \mathbf{MH_k} \Vert_F^2`, the optimisation problem is the following

.. math::
\text{min}_{\mathbf{M, A} \in \mathbb{R}^{m \times n}} \quad \frac 1 2 \Vert P_{\Omega} (\mathbf{D}-\mathbf{M}-\mathbf{A}) \Vert_F^2 + \tau \Vert \mathbf{M} \Vert_* + \lambda \Vert \mathbf{A} \Vert_1 + \sum_{k=1}^K \eta_k \Vert \mathbf{M H_k} \Vert_p
Expand Down Expand Up @@ -71,15 +71,15 @@ Suppose the data :math:`\mathbf{X}` has a density :math:`p_\theta` parametrized

**Expectation**

Draw samples of :math:`\mathbf{X}` assuming a fixed :math:`\theta`, conditionnaly on the values of :math:`\mathbf{X}_\mathrm{obs}`. This is done by MCMC using a projected Langevin algorithm.
Draw samples of :math:`\mathbf{X}` assuming a fixed :math:`\theta`, conditionally on the values of :math:`\mathbf{X}_\mathrm{obs}`. This is done by MCMC using a projected Langevin algorithm.
This process is characterized by a time step :math:`h`. Given an initial station :math:`X_0`, one can update the state at iteration *t* as

.. math::
\widetilde X_n = X_{n-1} + \Gamma \nabla L_X(X_{n-1}, \theta_n) (X_{n-1} - \mu) h + (2 h \Gamma)^{1/2} Z_n,

where :math:`Z_n` is a vector of independant standard normal random variables and :math:`L` is the log-likelihood.
where :math:`Z_n` is a vector of independent standard normal random variables and :math:`L` is the log-likelihood.
The sampled distribution tends to the target one in the limit :math:`h \rightarrow 0` and the number of iterations :math:`n \rightarrow \infty`.
Sampling from the conditionnal distribution :math:`p(\mathbf{X}_{mis} \vert \mathbf{X}_{obs} ; \theta^{(n)})` (see MCEM [6]) is achieved by projecting the samples at each step.
Sampling from the conditional distribution :math:`p(\mathbf{X}_{mis} \vert \mathbf{X}_{obs} ; \theta^{(n)})` (see MCEM [6]) is achieved by projecting the samples at each step.

.. math::
X_n = Proj_{obs} \left( \widetilde X_n \right),
Expand Down Expand Up @@ -113,7 +113,7 @@ Two parametric distributions are implemented:

:class:`~qolmat.imputations.diffusions.ddpms.TabDDPM` is a deep learning imputer based on Denoising Diffusion Probabilistic Models (DDPMs) [8] for handling multivariate tabular data. Our implementation mainly follows the works of [8, 9]. Diffusion models focus on modeling the process of data transitions from noisy and incomplete observations to the underlying true data. They include two main processes:

* Forward process perturbs observed data to noise until all the original data structures are lost. The pertubation is done over a series of steps. Let :math:`X_{obs}` be observed data, :math:`T` be the number of steps that noises :math:`\epsilon \sim N(0,I)` are added into the observed data. Therefore, :math:`X_{obs}^t = \bar{\alpha}_t \times X_{obs} + \sqrt{1-\bar{\alpha}_t} \times \epsilon` where :math:`\bar{\alpha}_t` controls the right amount of noise.
* Forward process perturbs observed data to noise until all the original data structures are lost. The perturbation is done over a series of steps. Let :math:`X_{obs}` be observed data, :math:`T` be the number of steps that noises :math:`\epsilon \sim N(0,I)` are added into the observed data. Therefore, :math:`X_{obs}^t = \bar{\alpha}_t \times X_{obs} + \sqrt{1-\bar{\alpha}_t} \times \epsilon` where :math:`\bar{\alpha}_t` controls the right amount of noise.
* Reverse process removes noise and reconstructs the observed data. At each step :math:`t`, we train an autoencoder :math:`\epsilon_\theta` based on ResNet [10] to predict the added noise :math:`\epsilon_t` based on the rest of the observed data. The objective function is the error between the noise added in the forward process and the noise predicted by :math:`\epsilon_\theta`.

In training phase, we use the self-supervised learning method of [9] to train incomplete data. In detail, our model randomly masks a part of observed data and computes loss from these masked data. Moving on to the inference phase, (1) missing data are replaced by Gaussian noises :math:`\epsilon \sim N(0,I)`, (2) at each noise step from :math:`T` to 0, our model denoises these missing data based on :math:`\epsilon_\theta`.
Expand Down
12 changes: 6 additions & 6 deletions examples/benchmark.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ jupyter:
In Qolmat, a few data imputation methods are implemented as well as a way to evaluate their performance.**


First, import some useful librairies
First, import some useful libraries

```python tags=[]
import warnings
Expand Down Expand Up @@ -54,7 +54,7 @@ from qolmat.utils import data, utils, plot


The dataset `Beijing` is the Beijing Multi-Site Air-Quality Data Set. It consists in hourly air pollutants data from 12 chinese nationally-controlled air-quality monitoring sites and is available at https://archive.ics.uci.edu/ml/machine-learning-databases/00501/.
This dataset only contains numerical vairables.
This dataset only contains numerical variables.

```python tags=[]
df_data = data.get_data_corrupted("Beijing", ratio_masked=.2, mean_size=120)
Expand Down Expand Up @@ -98,11 +98,11 @@ plt.show()
This part is devoted to the imputation methods. The idea is to try different algorithms and compare them.

<u>**Methods**</u>:
All presented methods are group-wise: here each station is imputed independently. For example ImputerMean computes the mean of each variable in each station and uses the result for imputation; ImputerInterpolation interpolates termporal signals corresponding to each variable on each station.
All presented methods are group-wise: here each station is imputed independently. For example ImputerMean computes the mean of each variable in each station and uses the result for imputation; ImputerInterpolation interpolates temporal signals corresponding to each variable on each station.

<u>**Hyperparameters' search**</u>:
Some methods require hyperparameters. The user can directly specify them, or rather determine them through an optimization step using the `search_params` dictionary. The keys are the imputation method's name and the values are a dictionary specifying the minimum, maximum or list of categories and type of values (Integer, Real, Category or a dictionary indexed by the variable names) to search.
In pratice, we rely on a cross validation to find the best hyperparams values minimizing an error reconstruction.
In practice, we rely on a cross validation to find the best hyperparams values minimizing an error reconstruction.

```python tags=[]
ratio_masked = 0.1
Expand Down Expand Up @@ -476,7 +476,7 @@ plt.show()


We first check the covariance. We simply plot one variable versus one another.
One observes the methods provide similar visual resuls: it's difficult to compare them based on this criterion.
One observes the methods provide similar visual results: it's difficult to compare them based on this criterion.

```python
fig = plt.figure(figsize=(6 * n_imputers, 6 * n_columns))
Expand All @@ -494,7 +494,7 @@ plt.show()
## Auto-correlation


We are now interested in the auto-correlation function (ACF). As seen before, time series display seaonal patterns.
We are now interested in the auto-correlation function (ACF). As seen before, time series display seasonal patterns.
[Autocorrelation](https://en.wikipedia.org/wiki/Autocorrelation) is the correlation of a signal with a delayed copy of itself as a function of delay. It measures the similarity between observations of a random variable as a function of the time lag between them. The objective is to have an ACF to be similar between the original dataset and the imputed one.

```python
Expand Down
12 changes: 6 additions & 6 deletions examples/tutorials/plot_tuto_benchmark_TS.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@
# For the purpose of this notebook,
# we corrupt the data, with the ``qolmat.utils.data.add_holes`` function
# on three variables: "TEMP", "PRES" and "WSPM"
# and the imputation methods will have acces to two additional features:
# and the imputation methods will have access to two additional features:
# "DEWP" and "RAIN".

df_data = data.get_data("Beijing")
Expand All @@ -51,7 +51,7 @@
df = data.add_holes(df_data, ratio_masked=0.15, mean_size=50)
df[["DEWP", "RAIN"]] = df_data[["DEWP", "RAIN"]]
# %%
# Let's take a look a one station, for instance "Aotizhongxin"
# Let's take a look at one station, for instance "Aotizhongxin"

station = "Aotizhongxin"
fig, ax = plt.subplots(len(cols_to_impute), 1, figsize=(13, 8))
Expand All @@ -68,7 +68,7 @@
# ---------------------------------------------------------------
# All presented methods are group-wise: here each station is imputed independently.
# For example ImputerMean computes the mean of each variable in each station and uses
# the result for imputation; ImputerInterpolation interpolates termporal
# the result for imputation; ImputerInterpolation interpolates temporal
# signals corresponding to each variable on each station.
# We consider five imputation methods:
# ``median`` for a baseline imputation;
Expand Down Expand Up @@ -181,10 +181,10 @@

# %%
# We can also check the covariance. We simply plot one variable versus one another.
# One observes the methods provide similar visual resuls: it's difficult to compare
# One observes the methods provide similar visual results: it's difficult to compare
# them based on this criterion, except the median imputation that greatly differs.
# Black points and ellipses are original datafames
# whiel colored ones are imputed dataframes.
# Black points and ellipses are original dataframes
# while colored ones are imputed dataframes.

n_columns = len(dfs_imputed_station)
fig = plt.figure(figsize=(10, 10))
Expand Down
6 changes: 3 additions & 3 deletions examples/tutorials/plot_tuto_categorical.py
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@
# %%
# The third approach uses ImputerRegressor which imputes iteratively each column using the other
# ones. The function make_robust_MixteHGB provides an underlying model able to:
# - adress both numerical targets (regression) and categorical targets (classification)
# - address both numerical targets (regression) and categorical targets (classification)
# - manage categorical features though one hot encoding
# - manage missing features (native to the HistGradientBoosting)

Expand All @@ -68,7 +68,7 @@
# %%
# 3. Mixed type model selection
# ---------------------------------------------------------------
# Let us now compare these three aproaches by measuring their ability to impute uniformly
# Let us now compare these three approaches by measuring their ability to impute uniformly
# distributed holes.

dict_imputers = {
Expand Down Expand Up @@ -101,5 +101,5 @@
results.loc["rmse"].style.highlight_min(color="lightgreen", axis=1)

# %%
# The HGB imputation methods globaly reaches a better accuracy on the categorical data.
# The HGB imputation methods globally reaches a better accuracy on the categorical data.
results.loc["accuracy"].style.highlight_max(color="lightgreen", axis=1)
10 changes: 5 additions & 5 deletions examples/tutorials/plot_tuto_diffusion_models.py
Original file line number Diff line number Diff line change
Expand Up @@ -54,12 +54,12 @@
#
# * ``cols_imputed``: list of columns that need to be imputed. Recall that we train the model on
# incomplete data by using the self-supervised learning method. We can set which columns to be
# masked during training. Its defaut value is ``None``.
# masked during training. Its default value is ``None``.
#
# * ``epochs`` : a number of iterations, its defaut value ``epochs=10``. In practice, we should
# * ``epochs`` : a number of iterations, its default value ``epochs=10``. In practice, we should
# set a larger number of epochs e.g., ``epochs=100``.
#
# * ``batch_size`` : a size of batch, its defaut value ``batch_size=100``.
# * ``batch_size`` : a size of batch, its default value ``batch_size=100``.
#
# The following hyperparams are for validation:
#
Expand Down Expand Up @@ -198,11 +198,11 @@
#
# For TsDDPM, we have two options for splitting data:
#
# * ``is_rolling=False`` (default value): the data is splited by using
# * ``is_rolling=False`` (default value): the data is split by using
# pandas.DataFrame.resample(rule=freq_str). There is no duplication of row between chunks,
# leading a smaller number of chunks than the number of rows in the original data.
#
# * ``is_rolling=True``: the data is splited by using pandas.DataFrame.rolling(window=freq_str).
# * ``is_rolling=True``: the data is split by using pandas.DataFrame.rolling(window=freq_str).
# The number of chunks is also the number of rows in the original data.
# Note that setting ``is_rolling=True`` always produces better quality of imputations
# but requires a longer training/inference time.
Expand Down
Loading