Skip to content

Commit 11fe4f6

Browse files
committed
Merge branch dev into angoho_docs_diffusion
2 parents 0c1806a + 3c50162 commit 11fe4f6

31 files changed

+444222
-413
lines changed

.github/workflows/test.yml

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -27,17 +27,19 @@ jobs:
2727
with:
2828
python-version: ${{matrix.python-version}}
2929
environment-file: environment.ci.yml
30-
channels: default, conda-forge
3130
- name: Lint with flake8
3231
run: |
3332
conda install flake8
3433
flake8
3534
- name: Test with pytest
3635
run: |
3736
conda install pytest
38-
pytest
39-
echo you should uncomment pytest and delete this line
37+
make coverage
4038
- name: typing with mypy
4139
run: |
4240
mypy qolmat
4341
echo you should uncomment mypy qolmat and delete this line
42+
- name: Upload coverage reports to Codecov
43+
uses: codecov/codecov-action@v3
44+
env:
45+
CODECOV_TOKEN: ${{ secrets.CODECOV_TOKEN }}

.readthedocs.yml

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,11 @@
11
version: 2
22

33
build:
4-
image: latest
4+
os: "ubuntu-22.04"
5+
tools:
6+
python: "mambaforge-22.9"
57

68
python:
7-
version: 3.8
89
install:
910
- method: pip
1011
path: .

AUTHORS.rst

Lines changed: 7 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -2,22 +2,20 @@
22
Credits
33
=======
44

5-
Development Lead
5+
Development Team
66
----------------
77

88
* Julien Roussel <jroussel@quantmetry.com>
9-
10-
Maintainers
11-
------------
12-
13-
* Mikail Duran <mduran@quantmetry.com>
149
* Anh Khoa Ngo Ho <angoho@quantmetry.com>
10+
* Charles-Henri Prat <chprat@quantmetry.com>
1511
* Guillaume Saës <gsaes@quantmetry.com>
1612

17-
Contributors
18-
------------
13+
Past Contributors
14+
-----------------
1915

2016
* Hong-Lan Botterman
17+
* Nicolas Brunel
2118
* Firas Dakhli
19+
* Mikaïl Duran
2220
* Rima Hajou
23-
* Vianey Taquet
21+
* Thomas Morzadec

CONTRIBUTING.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@ You can create a virtual environment via `conda`:
3232
$ conda env create -f environment.dev.yml
3333
$ conda activate env_qolmat_dev
3434
35-
If you need to use tensorflow, enter the command:
35+
If you need to use pytorch, enter the command:
3636

3737
.. code:: sh
3838

HISTORY.rst

Lines changed: 12 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,13 +2,23 @@
22
History
33
=======
44

5-
0.0.16 (2023-??-??)
5+
0.1.1 (2023-??-??)
6+
-------------------
7+
8+
* Hotfix reference to tensorflow in the documentation, when it should be pytorch
9+
10+
0.1.0 (2023-10-11)
611
-------------------
712

813
* VAR(p) EM sampler implemented, founding on a VAR(p) modelization such as the one described in `Lütkepohl (2005) New Introduction to Multiple Time Series Analysis`
914
* EM and RPCA matrices transposed in the low-level impelmentation, however the API remains unchanged
10-
* Sparse matrices introduced in the RPCA impletation so as to speed up the execution
15+
* Sparse matrices introduced in the RPCA implementation so as to speed up the execution
16+
* Implementation of SoftImpute, which provides a fast but less robust alterantive to RPCA
17+
* Implementation of TabDDPM and TsDDPM, which are diffusion-based models for tabular data and time-series data, based on Denoising Diffusion Probabilistic Models. Their implementations follow the work of Tashiro et al., (2021) and Kotelnikov et al., (2023).
18+
* ImputerDiffusion is an imputer-wrapper of these two models TabDDPM and TsDDPM.
1119
* Docstrings and tests improved for the EM sampler
20+
* Fix ImputerPytorch
21+
* Update Benchmark Deep Learning
1222

1323
0.0.15 (2023-08-03)
1424
-------------------

Makefile

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
coverage:
2+
pytest --cov-branch --cov=qolmat --cov-report=xml
3+
4+
doctest:
5+
pytest --doctest-modules --pyargs qolmat
6+
7+
doc:
8+
make html -C docs
9+
10+
clean:
11+
rm -rf .mypy_cache .pytest_cache .coverage*
12+
rm -rf **__pycache__
13+
make clean -C docs

README.rst

Lines changed: 68 additions & 93 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
.. -*- mode: rst -*-
22
3-
|GitHubActions|_ |ReadTheDocs|_ |License|_ |PythonVersion|_ |PyPi|_ |Release|_ |Commits|_
3+
|GitHubActions|_ |ReadTheDocs|_ |License|_ |PythonVersion|_ |PyPi|_ |Release|_ |Commits|_ |Codecov|_
44

55
.. |GitHubActions| image:: https://github.com/Quantmetry/qolmat/actions/workflows/test.yml/badge.svg
66
.. _GitHubActions: https://github.com/Quantmetry/qolmat/actions
@@ -23,6 +23,9 @@
2323
.. |Commits| image:: https://img.shields.io/github/commits-since/Quantmetry/qolmat/latest/main
2424
.. _Commits: https://github.com/Quantmetry/qolmat/commits/main
2525

26+
.. |Codecov| image:: https://codecov.io/gh/quantmetry/qolmat/branch/master/graph/badge.svg
27+
.. _Codecov: https://codecov.io/gh/quantmetry/qolmat
28+
2629
.. image:: https://raw.githubusercontent.com/Quantmetry/qolmat/main/docs/images/logo.png
2730
:align: center
2831

@@ -44,7 +47,7 @@ Qolmat can be installed in different ways:
4447
.. code:: sh
4548
4649
$ pip install qolmat # installation via `pip`
47-
$ pip install qolmat[tensorflow] # if you need tensforflow
50+
$ pip install qolmat[pytorch] # if you need pytorch
4851
$ pip install git+https://github.com/Quantmetry/qolmat # or directly from the github repository
4952
5053
⚡️ Quickstart
@@ -64,146 +67,122 @@ With just these few lines of code, you can see how easy it is to
6467
6568
from qolmat.benchmark import comparator, missing_patterns
6669
from qolmat.imputations import imputers
67-
from qolmat.utils.data import add_holes
70+
from qolmat.utils import data
71+
72+
# load and prepare csv data
6873
69-
# create time series with missing values
70-
np.random.seed(42)
71-
t = np.linspace(0,1,1000)
72-
y = np.cos(2*np.pi*t*10)+np.random.randn(1000)/2
73-
df = pd.DataFrame({'y': y}, index=pd.Series(t, name='index'))
74-
df_with_nan = add_holes(df, ratio_masked=0.1, mean_size=20)
74+
df_data = data.get_data("Beijing")
75+
columns = ["TEMP", "PRES", "WSPM"]
76+
df_data = df_data[columns]
77+
df_with_nan = data.add_holes(df_data, ratio_masked=0.2, mean_size=120)
7578
7679
# impute and compare
77-
imputer_mean = imputers.ImputerMean()
78-
imputer_interpol = imputers.ImputerInterpolation(method="linear")
79-
imputer_var1 = imputers.ImputerEM(model="VAR", method="mle", max_iter_em=100, n_iter_ou=15, dt=1e-3, p=1)
80+
imputer_mean = imputers.ImputerMean(groups=("station",))
81+
imputer_interpol = imputers.ImputerInterpolation(method="linear", groups=("station",))
82+
imputer_var1 = imputers.ImputerEM(model="VAR", groups=("station",), method="mle", max_iter_em=50, n_iter_ou=15, dt=1e-3, p=1)
8083
dict_imputers = {
81-
"mean": imputer_mean,
82-
"interpolation": imputer_interpol,
83-
"var1": imputer_var1
84-
}
84+
"mean": imputer_mean,
85+
"interpolation": imputer_interpol,
86+
"VAR(1) process": imputer_var1
87+
}
8588
generator_holes = missing_patterns.EmpiricalHoleGenerator(n_splits=4, ratio_masked=0.1)
8689
comparison = comparator.Comparator(
87-
dict_imputers,
88-
['y'],
89-
generator_holes = generator_holes,
90-
metrics = ["mae", "wmape", "KL_columnwise", "ks_test", "energy"],
91-
)
90+
dict_imputers,
91+
columns,
92+
generator_holes = generator_holes,
93+
metrics = ["mae", "wmape", "KL_columnwise", "ks_test", "energy"],
94+
)
9295
results = comparison.compare(df_with_nan)
93-
results.style.highlight_min(color="lime", axis=1)
96+
results.style.highlight_min(color="lightsteelblue", axis=1)
9497
9598
.. image:: https://raw.githubusercontent.com/Quantmetry/qolmat/main/docs/images/readme_tabular_comparison.png
9699
:align: center
97100

98-
.. code-block:: python
99-
100-
import matplotlib.pyplot as plt
101-
# visualise
102-
dfs_imputed = {name: imp.fit_transform(df_with_nan) for name, imp in dict_imputers.items()}
103-
plt.figure(figsize=(13,3))
104-
for (name, df_imputed), color in zip(dfs_imputed.items(), ["tab:green", "tab:blue", "tab:red"]):
105-
plt.plot(df_imputed, ".", c=color, label=name)
106-
plt.plot(df_with_nan, ".", c="k", label="original")
107-
plt.legend()
108-
plt.grid()
109-
plt.ylabel("values")
110-
plt.show()
111-
112-
.. image:: https://raw.githubusercontent.com/Quantmetry/qolmat/main/docs/images/readme_imputation_plot.png
113-
:align: center
114-
115-
116101
📘 Documentation
117102
================
118103

119104
The full documentation can be found `on this link <https://qolmat.readthedocs.io/en/latest/>`_.
120105

121106
**How does Qolmat work ?**
122107

123-
Qolmat simplifies the selection process of a data imputation algorithm. It does so by comparing of various methods based on different evaluation metrics.
124-
It is compatible with scikit-learn.
125-
Evaluation and comparison are based on the standard approach to select some observations, set their status to missing, and compare
126-
their imputation with their true values.
108+
Qolmat allows model selection for scikit-learn compatible imputation algorithms, by performing three steps pictured below:
109+
1) For each of the K folds, Qolmat artificially masks a set of observed values using a default or user specified `hole generator <explanation.html#hole-generator>`_,
110+
2) For each fold and each compared `imputation method <imputers.html>`_, Qolmat fills both the missing and the masked values, then computes each of the default or user specified `performance metrics <explanation.html#metrics>`_.
111+
3) For each compared imputer, Qolmat pools the computed metrics from the K folds into a single value.
127112

128-
More specifically, from the initial dataframe with missing value, we generate additional missing values (N samples).
129-
On each sample, different imputation models are tested and reconstruction errors are computed on these artificially missing entries. Then the errors of each imputation model are averaged and we eventually obtained a unique error score per model. This procedure allows the comparison of different models on the same dataset.
113+
This is very similar in spirit to the `cross_val_score <https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html>`_ function for scikit-learn.
130114

131115
.. image:: https://raw.githubusercontent.com/Quantmetry/qolmat/main/docs/images/schema_qolmat.png
132116
:align: center
133117

134118
**Imputation methods**
135119

136-
The following table contains the available imputation methods:
120+
The following table contains the available imputation methods. We distinguish single imputation methods (aiming for pointwise accuracy, mostly deterministic) from multiple imputation methods (aiming for distribution similarity, mostly stochastic).
137121

138122
.. list-table::
139-
:widths: 25 70 15 15 20
123+
:widths: 25 70 15 15
140124
:header-rows: 1
141125

142126
* - Method
143127
- Description
144-
- Tabular
145-
- Time series
146-
- Minimised criterion
128+
- Tabular or Time series
129+
- Single or Multiple
147130
* - mean
148131
- Imputes the missing values using the mean along each column
149-
- yes
150-
- no
151-
- point
132+
- tabular
133+
- single
152134
* - median
153135
- Imputes the missing values using the median along each column
154-
- yes
155-
- no
156-
- point
136+
- tabular
137+
- single
157138
* - LOCF
158139
- Imputes missing entries by carrying the last observation forward for each columns
159-
- yes
160-
- yes
161-
- point
140+
- time series
141+
- single
162142
* - shuffle
163143
- Imputes missing entries with the random value of each column
164-
- yes
165-
- no
166-
- point
144+
- tabular
145+
- multiple
167146
* - interpolation
168147
- Imputes missing using some interpolation strategies supported by pd.Series.interpolate
169-
- yes
170-
- yes
171-
- point
148+
- time series
149+
- single
172150
* - impute on residuals
173151
- The series are de-seasonalised, residuals are imputed via linear interpolation, then residuals are re-seasonalised
174-
- no
175-
- yes
176-
- point
152+
- time series
153+
- single
177154
* - MICE
178155
- Multiple Imputation by Chained Equation
179-
- yes
180-
- no
181-
- point
156+
- tabular
157+
- both
182158
* - RPCA
183159
- Robust Principal Component Analysis
184-
- yes
185-
- yes
186-
- point
160+
- both
161+
- single
187162
* - SoftImpute
188163
- Iterative method for matrix completion that uses nuclear-norm regularization
189-
- yes
190-
- no
191-
- point
164+
- tabular
165+
- single
192166
* - KNN
193167
- K-nearest kneighbors
194-
- yes
195-
- no
196-
- point
168+
- tabular
169+
- single
197170
* - EM sampler
198171
- Imputes missing values via EM algorithm
199-
- yes
200-
- yes
201-
- point/distribution
172+
- both
173+
- both
174+
* - MLP
175+
- Imputer based Multi-Layers Perceptron Model
176+
- both
177+
- both
178+
* - Autoencoder
179+
- Imputer based Autoencoder Model with Variationel method
180+
- both
181+
- both
202182
* - TabDDPM
203183
- Imputer based on Denoising Diffusion Probabilistic Models
204-
- yes
205-
- yes
206-
- distribution
184+
- both
185+
- both
207186

208187

209188

@@ -230,8 +209,6 @@ Qolmat has been developed by Quantmetry.
230209
🔍 References
231210
==============
232211

233-
Qolmat methods belong to the field of conformal inference.
234-
235212
[1] Candès, Emmanuel J., et al. “Robust principal component analysis?.”
236213
Journal of the ACM (JACM) 58.3 (2011): 1-37,
237214
(`pdf <https://arxiv.org/abs/0912.3599>`__)
@@ -242,15 +219,13 @@ Journal of advanced transportation 2018 (2018).
242219
(`pdf <https://www.hindawi.com/journals/jat/2018/7191549/>`__)
243220

244221
[3] Chen, Yuxin, et al. “Bridging convex and nonconvex optimization in
245-
robust PCA: Noise, outliers, and missing data.” arXiv preprint
246-
arXiv:2001.05484 (2020), (`pdf <https://arxiv.org/abs/2001.05484>`__)
222+
robust PCA: Noise, outliers, and missing data.” Annals of statistics, 49(5), 2948 (2021), (`pdf <https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9491514/pdf/nihms-1782570.pdf>`__)
247223

248224
[4] Shahid, Nauman, et al. “Fast robust PCA on graphs.” IEEE Journal of
249225
Selected Topics in Signal Processing 10.4 (2016): 740-756.
250226
(`pdf <https://arxiv.org/abs/1507.08173>`__)
251227

252-
[5] Jiashi Feng, et al. “Online robust pca via stochastic opti-
253-
mization.“ Advances in neural information processing systems, 26, 2013.
228+
[5] Jiashi Feng, et al. “Online robust pca via stochastic optimization.“ Advances in neural information processing systems, 26, 2013.
254229
(`pdf <https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.721.7506&rep=rep1&type=pdf>`__)
255230

256231
[6] García, S., Luengo, J., & Herrera, F. "Data preprocessing in data mining". 2015.

docs/conf.py

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -56,12 +56,12 @@
5656
from distutils.version import LooseVersion
5757

5858
# pngmath / imgmath compatibility layer for different sphinx versions
59-
import sphinx
59+
# import sphinx
6060

61-
if LooseVersion(sphinx.__version__) < LooseVersion("1.4"):
62-
extensions.append("sphinx.ext.pngmath")
63-
else:
64-
extensions.append("sphinx.ext.imgmath")
61+
# if LooseVersion(sphinx.__version__) < LooseVersion("1.4"):
62+
# extensions.append("sphinx.ext.pngmath")
63+
# else:
64+
# extensions.append("sphinx.ext.imgmath")
6565

6666
autodoc_default_flags = ["members", "inherited-members"]
6767

0 commit comments

Comments
 (0)