Skip to content

Commit 6b58c02

Browse files
Merge pull request #89 from Quantmetry/dev
Dev
2 parents 5ee0f34 + f0c26ab commit 6b58c02

25 files changed

+471
-103
lines changed

.github/workflows/test.yml

Lines changed: 5 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,20 +1,22 @@
1-
name: Unit test Qolmat
1+
name: Unit tests
22

33
on:
44
push:
55
branches:
66
-dev
77
-main
88
pull_request:
9+
types: [opened, synchronize, reopened, ready_for_review]
910
workflow_dispatch:
1011

1112
jobs:
1213
build-linux:
14+
if: github.event.pull_request.draft == false
1315
runs-on: ${{matrix.os}}
1416
strategy:
1517
matrix:
1618
os: [ubuntu-latest, windows-latest]
17-
python-version: [3.8, 3.9]
19+
python-version: ['3.8', '3.9', '3.10', '3.11']
1820
defaults:
1921
run:
2022
shell: bash -l {0}
@@ -27,16 +29,12 @@ jobs:
2729
with:
2830
python-version: ${{matrix.python-version}}
2931
environment-file: environment.ci.yml
30-
channels: default, conda-forge
3132
- name: Lint with flake8
3233
run: |
33-
conda install flake8
3434
flake8
3535
- name: Test with pytest
3636
run: |
37-
conda install pytest
38-
pytest
39-
echo you should uncomment pytest and delete this line
37+
make coverage
4038
- name: typing with mypy
4139
run: |
4240
mypy qolmat

.github/workflows/test_quick.yml

Lines changed: 68 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,68 @@
1+
name: Unit tests fast
2+
3+
on:
4+
push:
5+
branches-ignore:
6+
- dev
7+
- main
8+
workflow_dispatch:
9+
10+
jobs:
11+
basic-testing:
12+
runs-on: ${{matrix.os}}
13+
strategy:
14+
matrix:
15+
os: [ubuntu-latest]
16+
python-version: [3.8]
17+
defaults:
18+
run:
19+
shell: bash -l {0}
20+
21+
steps:
22+
- name: Git clone
23+
uses: actions/checkout@v3
24+
25+
# See caching environments
26+
# https://github.com/conda-incubator/setup-miniconda#caching-environments
27+
- name: Setup Mambaforge
28+
uses: conda-incubator/setup-miniconda@v2
29+
with:
30+
miniforge-variant: Mambaforge
31+
miniforge-version: latest
32+
activate-environment: env_qolmat_ci
33+
use-mamba: true
34+
35+
- name: Get Date
36+
id: get-date
37+
run: echo "today=$(/bin/date -u '+%Y%m%d')" >> $GITHUB_OUTPUT
38+
39+
- name: Cache Conda env
40+
uses: actions/cache@v2
41+
with:
42+
path: ${{ env.CONDA }}/envs
43+
key:
44+
conda-${{ runner.os }}--${{ runner.arch }}--${{
45+
steps.get-date.outputs.today }}-${{
46+
hashFiles('environment.ci.yml') }}-${{ env.CACHE_NUMBER
47+
}}
48+
env:
49+
# Increase this value to reset cache if environment.ci.yml has not changed
50+
CACHE_NUMBER: 0
51+
id: cache
52+
53+
- name: Update environment
54+
run: mamba env update -n env_qolmat_ci -f environment.ci.yml
55+
if: steps.cache.outputs.cache-hit != 'true'
56+
57+
- name: Lint with flake8
58+
run: |
59+
flake8
60+
- name: Test with pytest
61+
run: |
62+
make coverage
63+
- name: Test docstrings
64+
run: make doctest
65+
- name: typing with mypy
66+
run: |
67+
mypy qolmat
68+
echo you should uncomment mypy qolmat and delete this line

.readthedocs.yml

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,16 @@
11
version: 2
22

33
build:
4-
image: latest
4+
os: "ubuntu-22.04"
5+
tools:
6+
python: "mambaforge-22.9"
57

68
python:
7-
version: 3.8
89
install:
910
- method: pip
1011
path: .
12+
extra_requirements:
13+
- pytorch
1114

1215
conda:
1316
environment: environment.doc.yml

CONTRIBUTING.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@ You can create a virtual environment via `conda`:
3232
$ conda env create -f environment.dev.yml
3333
$ conda activate env_qolmat_dev
3434
35-
If you need to use tensorflow, enter the command:
35+
If you need to use pytorch, enter the command:
3636

3737
.. code:: sh
3838

HISTORY.rst

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,11 @@
22
History
33
=======
44

5+
0.1.1 (2023-??-??)
6+
-------------------
7+
8+
* Hotfix reference to tensorflow in the documentation, when it should be pytorch
9+
510
0.1.0 (2023-10-11)
611
-------------------
712

Makefile

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
coverage:
2+
pytest --cov-branch --cov=qolmat --cov-report=xml
3+
4+
doctest:
5+
pytest --doctest-modules --pyargs qolmat
6+
7+
doc:
8+
make html -C docs
9+
10+
clean:
11+
rm -rf .mypy_cache .pytest_cache .coverage*
12+
rm -rf **__pycache__
13+
make clean -C docs

README.rst

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -47,7 +47,7 @@ Qolmat can be installed in different ways:
4747
.. code:: sh
4848
4949
$ pip install qolmat # installation via `pip`
50-
$ pip install qolmat[tensorflow] # if you need tensorflow
50+
$ pip install qolmat[pytorch] # if you need ImputerDiffusion relying on pytorch
5151
$ pip install git+https://github.com/Quantmetry/qolmat # or directly from the github repository
5252
5353
⚡️ Quickstart
@@ -105,8 +105,8 @@ The full documentation can be found `on this link <https://qolmat.readthedocs.io
105105

106106
**How does Qolmat work ?**
107107

108-
Qolmat allows model selection for scikit-learn compatible imputation algorithms, by performing three steps pictured below:
109-
1) For each of the K folds, Qolmat artificially masks a set of observed values using a default or user specified `hole generator <explanation.html#hole-generator>`_,
108+
| Qolmat allows model selection for scikit-learn compatible imputation algorithms, by performing three steps pictured below:
109+
1) For each of the K folds, Qolmat artificially masks a set of observed values using a default or user specified `hole generator <explanation.html#hole-generator>`_.
110110
2) For each fold and each compared `imputation method <imputers.html>`_, Qolmat fills both the missing and the masked values, then computes each of the default or user specified `performance metrics <explanation.html#metrics>`_.
111111
3) For each compared imputer, Qolmat pools the computed metrics from the K folds into a single value.
112112

@@ -117,7 +117,7 @@ This is very similar in spirit to the `cross_val_score <https://scikit-learn.org
117117

118118
**Imputation methods**
119119

120-
The following table contains the available imputation methods. We distinguish single imputation methods (aiming for pointwise accuracy, mostly deterministic) from multiple imputation methods (aiming for distribution similarity, mostly stochastic).
120+
The following table contains the available imputation methods. We distinguish single imputation methods (aiming for pointwise accuracy, mostly deterministic) from multiple imputation methods (aiming for distribution similarity, mostly stochastic). For further details regarding the distinction between single and multiple imputation, you can refer to the `Imputation article <https://en.wikipedia.org/wiki/Imputation_(statistics)>`_ on Wikipedia.
121121

122122
.. list-table::
123123
:widths: 25 70 15 15

docs/api.rst

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -93,3 +93,24 @@ EM engine
9393

9494
imputations.em_sampler.MultiNormalEM
9595
imputations.em_sampler.VARpEM
96+
97+
Diffusion engine
98+
================
99+
100+
.. autosummary::
101+
:toctree: generated/
102+
:template: class.rst
103+
104+
imputations.imputers_pytorch.ImputerDiffusion
105+
imputations.diffusions.ddpms.TabDDPM
106+
imputations.diffusions.ddpms.TsDDPM
107+
108+
109+
Utils
110+
================
111+
112+
.. autosummary::
113+
:toctree: generated/
114+
:template: function.rst
115+
116+
utils.data.add_holes

docs/explanation.rst

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -99,7 +99,7 @@ We compute the associated complete dataset :math:`\hat{X}^{(k)}` for the partial
9999
-----------------
100100

101101
Evaluating the imputers requires to generate holes that are representative of the holes at hand.
102-
The missingness mechanisms have been classified by Rubin [1] into MCAR, MAR and MNAR.
102+
The missingness mechanisms have been classified by :ref:`Rubin [1]<rubin-article>` into MCAR, MAR and MNAR.
103103

104104
Suppose we have :math:`X_{obs}`, a subset of a complete data model :math:`X = (X_{obs}, X_{mis})`, which is not fully observable (:math:`X_{mis}` is the missing part).
105105
We define the matrix :math:`M` such that :math:`M_{ij}=1` if :math:`X_{ij}` is missing, and 0 otherwise, and we assume distribution of :math:`M` is parametrised by :math:`\psi`.
@@ -108,14 +108,14 @@ The observations are said to be Missing Completely at Random (MCAR) if the proba
108108
Formally,
109109

110110
.. math::
111-
P(M | X_{obs}, X_{mis}, \psi) = P(M, \psi), \quad \forall \psi.
111+
P(M | X_{obs}, X_{mis}, \psi) = P(M | \psi), \quad \forall \psi.
112112
113113
The observations are said to be Missing at Random (MAR) if the probability of an observation to be missing only depends on the observations. Formally,
114114

115115
.. math::
116116
P(M | X_{obs}, X_{mis}, \psi) = P(M | X_{obs}, \psi), \quad \forall \psi, X_{mis}.
117117
118-
Finally, the observations are said to be Missing Not at Random (MNAR) in all other cases, i.e. if P(M | X_{obs}, X_{mis}, \psi) does not simplify.
118+
Finally, the observations are said to be Missing Not at Random (MNAR) in all other cases, i.e. if :math:`P(M | X_{obs}, X_{mis}, \psi)` does not simplify.
119119

120120
Qolmat allows to generate new missing values on a an existing dataset, but only in the MCAR case.
121121

@@ -140,4 +140,7 @@ Qolmat can be used to search for hyperparameters in imputation functions. Let sa
140140
141141
References
142142
----------
143-
[1] Rubin, Donald B. `Inference and missing data. <https://www.math.wsu.edu/faculty/xchen/stat115/lectureNotes3/Rubin%20Inference%20and%20Missing%20Data.pdf>`_ Biometrika 63.3 (1976): 581-592.
143+
144+
.. _rubin-article:
145+
146+
[1] Rubin, Donald B. `Inference and missing data. <https://www.math.wsu.edu/faculty/xchen/stat115/lectureNotes3/Rubin%20Inference%20and%20Missing%20Data.pdf>`_ Biometrika 63.3 (1976): 581-592.

docs/imputers.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -98,14 +98,14 @@ Two parametric distributions are implemented:
9898
9. TabDDPM
9999
-----------
100100

101-
:class:`qolmat.diffusions.TabDDPM` is a deep learning imputer based on Denoising Diffusion Probabilistic Models (DDPMs) [7] for handling multivariate tabular data. Our implementation mainly follows the works of [8, 9]. Diffusion models focus on modeling the process of data transitions from noisy and incomplete observations to the underlying true data. They include two main processes:
101+
:class:`~qolmat.imputations.diffusions.ddpms.TabDDPM` is a deep learning imputer based on Denoising Diffusion Probabilistic Models (DDPMs) [7] for handling multivariate tabular data. Our implementation mainly follows the works of [8, 9]. Diffusion models focus on modeling the process of data transitions from noisy and incomplete observations to the underlying true data. They include two main processes:
102102

103103
* Forward process perturbs observed data to noise until all the original data structures are lost. The pertubation is done over a series of steps. Let :math:`X_{obs}` be observed data, :math:`T` be the number of steps that noises :math:`\epsilon \sim \mathcal{N}(0,I)` are added into the observed data. Therefore, :math:`X_{obs}^t = \bar{\alpha}_t \times X_{obs} + \sqrt{1-\bar{\alpha}_t} \times \epsilon` where :math:`\bar{\alpha}_t` controls the right amount of noise.
104104
* Reverse process removes noise and reconstructs the observed data. At each step :math:`t`, we train an autoencoder :math:`\epsilon_\theta` based on ResNet [9] to predict the added noise :math:`\epsilon_t` based on the rest of the observed data. The objective function is the error between the noise added in the forward process and the noise predicted by :math:`\epsilon_\theta`.
105105

106106
In training phase, we use the self-supervised learning method of [8] to train incomplete data. In detail, our model randomly masks a part of observed data and computes loss from these masked data. Moving on to the inference phase, (1) missing data are replaced by Gaussian noises :math:`\epsilon \sim \mathcal{N}(0,I)`, (2) at each noise step from :math:`T` to 0, our model denoises these missing data based on :math:`\epsilon_\theta`.
107107

108-
In the case of time-series data, we also propose :class:`qolmat.diffusions.TabDDPMTS` (built on top of :class:`qolmat.diffusions.TabDDPM`) to capture time-based relationships between data points in a dataset. In fact, the dataset is pre-processed by using sliding window method to obtain a set of data partitions. The noise prediction of the model :math:`\epsilon_\theta` takes into account not only the observed data at the current time step but also data from previous time steps. These time-based relationships are encoded by using a transformer-based architecture [8].
108+
In the case of time-series data, we also propose :class:`~qolmat.imputations.diffusions.ddpms.TsDDPM` (built on top of :class:`~qolmat.imputations.diffusions.ddpms.TabDDPM`) to capture time-based relationships between data points in a dataset. In fact, the dataset is pre-processed by using sliding window method to obtain a set of data partitions. The noise prediction of the model :math:`\epsilon_\theta` takes into account not only the observed data at the current time step but also data from previous time steps. These time-based relationships are encoded by using a transformer-based architecture [8].
109109

110110
References
111111
----------

0 commit comments

Comments
 (0)