scikit-learn-contrib
diff --git a/‎.github/workflows/test.yml‎
Lines changed: 5 additions & 3 deletions b/‎.github/workflows/test.yml‎
Lines changed: 5 additions & 3 deletions
diff --git a/‎.readthedocs.yml‎
Lines changed: 3 additions & 2 deletions b/‎.readthedocs.yml‎
Lines changed: 3 additions & 2 deletions
diff --git a/‎AUTHORS.rst‎
Lines changed: 7 additions & 9 deletions b/‎AUTHORS.rst‎
Lines changed: 7 additions & 9 deletions
diff --git a/‎CONTRIBUTING.rst‎
Lines changed: 1 addition & 1 deletion b/‎CONTRIBUTING.rst‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎HISTORY.rst‎
Lines changed: 12 additions & 2 deletions b/‎HISTORY.rst‎
Lines changed: 12 additions & 2 deletions
diff --git a/‎Makefile‎
Lines changed: 13 additions & 0 deletions b/‎Makefile‎
Lines changed: 13 additions & 0 deletions
diff --git a/‎README.rst‎
Lines changed: 68 additions & 93 deletions b/‎README.rst‎
Lines changed: 68 additions & 93 deletions
diff --git a/‎docs/conf.py‎
Lines changed: 5 additions & 5 deletions b/‎docs/conf.py‎
Lines changed: 5 additions & 5 deletions
@@ -27,17 +27,19 @@ jobs:
         with:
           python-version: ${{matrix.python-version}}
           environment-file: environment.ci.yml
-          channels: default, conda-forge
       - name: Lint with flake8
         run: |
           conda install flake8
           flake8
       - name: Test with pytest
         run: |
           conda install pytest
-          pytest
-          echo you should uncomment pytest and delete this line
+          make coverage
       - name: typing with mypy
         run: |
           mypy qolmat
           echo you should uncomment mypy qolmat and delete this line
+      - name: Upload coverage reports to Codecov
+        uses: codecov/codecov-action@v3
+        env:
+          CODECOV_TOKEN: ${{ secrets.CODECOV_TOKEN }}
@@ -1,10 +1,11 @@
 version: 2
 
 build:
-  image: latest
+  os: "ubuntu-22.04"
+  tools:
+    python: "mambaforge-22.9"
 
 python:
-  version: 3.8
   install:
     - method: pip
       path: .
 
@@ -2,22 +2,20 @@
 Credits
 =======
 
-Development Lead
+Development Team
 ----------------
 
 * Julien Roussel <jroussel@quantmetry.com>
-
-Maintainers
-------------
-
-* Mikail Duran <mduran@quantmetry.com>
 * Anh Khoa Ngo Ho <angoho@quantmetry.com>
+* Charles-Henri Prat <chprat@quantmetry.com>
 * Guillaume Saës <gsaes@quantmetry.com>
 
-Contributors
-------------
+Past Contributors
+-----------------
 
 * Hong-Lan Botterman
+* Nicolas Brunel
 * Firas Dakhli
+* Mikaïl Duran
 * Rima Hajou
-* Vianey Taquet
+* Thomas Morzadec
@@ -32,7 +32,7 @@ You can create a virtual environment via `conda`:
     $ conda env create -f environment.dev.yml
     $ conda activate env_qolmat_dev
 
-If you need to use tensorflow, enter the command:
+If you need to use pytorch, enter the command:
 
 .. code:: sh
 
 
@@ -2,13 +2,23 @@
 History
 =======
 
-0.0.16 (2023-??-??)
+0.1.1 (2023-??-??)
+-------------------
+
+* Hotfix reference to tensorflow in the documentation, when it should be pytorch
+
+0.1.0 (2023-10-11)
 -------------------
 
 * VAR(p) EM sampler implemented, founding on a VAR(p) modelization such as the one described in `Lütkepohl (2005) New Introduction to Multiple Time Series Analysis`
 * EM and RPCA matrices transposed in the low-level impelmentation, however the API remains unchanged
-* Sparse matrices introduced in the RPCA impletation so as to speed up the execution
+* Sparse matrices introduced in the RPCA implementation so as to speed up the execution
+* Implementation of SoftImpute, which provides a fast but less robust alterantive to RPCA
+* Implementation of TabDDPM and TsDDPM, which are diffusion-based models for tabular data and time-series data, based on Denoising Diffusion Probabilistic Models. Their implementations follow the work of Tashiro et al., (2021) and Kotelnikov et al., (2023).
+* ImputerDiffusion is an imputer-wrapper of these two models TabDDPM and TsDDPM.
 * Docstrings and tests improved for the EM sampler
+* Fix ImputerPytorch
+* Update Benchmark Deep Learning
 
 0.0.15 (2023-08-03)
 -------------------
 
@@ -0,0 +1,13 @@
+coverage:
+	pytest --cov-branch --cov=qolmat --cov-report=xml
+
+doctest:
+	pytest --doctest-modules --pyargs qolmat
+
+doc:
+	make html -C docs
+
+clean:
+	rm -rf .mypy_cache .pytest_cache .coverage*
+	rm -rf **__pycache__
+	make clean -C docs
@@ -1,6 +1,6 @@
 .. -*- mode: rst -*-
 
-|GitHubActions|_ |ReadTheDocs|_ |License|_ |PythonVersion|_ |PyPi|_ |Release|_ |Commits|_
+|GitHubActions|_ |ReadTheDocs|_ |License|_ |PythonVersion|_ |PyPi|_ |Release|_ |Commits|_ |Codecov|_
 
 .. |GitHubActions| image:: https://github.com/Quantmetry/qolmat/actions/workflows/test.yml/badge.svg
 .. _GitHubActions: https://github.com/Quantmetry/qolmat/actions
@@ -23,6 +23,9 @@
 .. |Commits| image:: https://img.shields.io/github/commits-since/Quantmetry/qolmat/latest/main
 .. _Commits: https://github.com/Quantmetry/qolmat/commits/main
 
+.. |Codecov| image:: https://codecov.io/gh/quantmetry/qolmat/branch/master/graph/badge.svg
+.. _Codecov: https://codecov.io/gh/quantmetry/qolmat
+
 .. image:: https://raw.githubusercontent.com/Quantmetry/qolmat/main/docs/images/logo.png
     :align: center
 
@@ -44,7 +47,7 @@ Qolmat can be installed in different ways:
 .. code:: sh
 
     $ pip install qolmat  # installation via `pip`
-    $ pip install qolmat[tensorflow] # if you need tensforflow
+    $ pip install qolmat[pytorch] # if you need pytorch
     $ pip install git+https://github.com/Quantmetry/qolmat  # or directly from the github repository
 
 ⚡️ Quickstart
@@ -64,146 +67,122 @@ With just these few lines of code, you can see how easy it is to
 
   from qolmat.benchmark import comparator, missing_patterns
   from qolmat.imputations import imputers
-  from qolmat.utils.data import add_holes
+  from qolmat.utils import data
+
+  # load and prepare csv data
 
-  # create time series with missing values
-  np.random.seed(42)
-  t = np.linspace(0,1,1000)
-  y = np.cos(2*np.pi*t*10)+np.random.randn(1000)/2
-  df = pd.DataFrame({'y': y}, index=pd.Series(t, name='index'))
-  df_with_nan = add_holes(df, ratio_masked=0.1, mean_size=20)
+  df_data = data.get_data("Beijing")
+  columns = ["TEMP", "PRES", "WSPM"]
+  df_data = df_data[columns]
+  df_with_nan = data.add_holes(df_data, ratio_masked=0.2, mean_size=120)
 
   # impute and compare
-  imputer_mean = imputers.ImputerMean()
-  imputer_interpol = imputers.ImputerInterpolation(method="linear")
-  imputer_var1 = imputers.ImputerEM(model="VAR", method="mle", max_iter_em=100, n_iter_ou=15, dt=1e-3, p=1)
+  imputer_mean = imputers.ImputerMean(groups=("station",))
+  imputer_interpol = imputers.ImputerInterpolation(method="linear", groups=("station",))
+  imputer_var1 = imputers.ImputerEM(model="VAR", groups=("station",), method="mle", max_iter_em=50, n_iter_ou=15, dt=1e-3, p=1)
   dict_imputers = {
-          "mean": imputer_mean,
-          "interpolation": imputer_interpol,
-          "var1": imputer_var1
-      }
+        "mean": imputer_mean,
+        "interpolation": imputer_interpol,
+        "VAR(1) process": imputer_var1
+    }
   generator_holes = missing_patterns.EmpiricalHoleGenerator(n_splits=4, ratio_masked=0.1)
   comparison = comparator.Comparator(
-          dict_imputers,
-          ['y'],
-          generator_holes = generator_holes,
-          metrics = ["mae", "wmape", "KL_columnwise", "ks_test", "energy"],
-      )
+        dict_imputers,
+        columns,
+        generator_holes = generator_holes,
+        metrics = ["mae", "wmape", "KL_columnwise", "ks_test", "energy"],
+    )
   results = comparison.compare(df_with_nan)
-  results.style.highlight_min(color="lime", axis=1)
+  results.style.highlight_min(color="lightsteelblue", axis=1)
 
 .. image:: https://raw.githubusercontent.com/Quantmetry/qolmat/main/docs/images/readme_tabular_comparison.png
     :align: center
 
-.. code-block:: python
-
-  import matplotlib.pyplot as plt
-  # visualise
-  dfs_imputed = {name: imp.fit_transform(df_with_nan) for name, imp in dict_imputers.items()}
-  plt.figure(figsize=(13,3))
-  for (name, df_imputed), color in zip(dfs_imputed.items(), ["tab:green", "tab:blue", "tab:red"]):
-      plt.plot(df_imputed, ".", c=color, label=name)
-  plt.plot(df_with_nan, ".", c="k", label="original")
-  plt.legend()
-  plt.grid()
-  plt.ylabel("values")
-  plt.show()
-
-.. image:: https://raw.githubusercontent.com/Quantmetry/qolmat/main/docs/images/readme_imputation_plot.png
-    :align: center
-
-
 📘 Documentation
 ================
 
 The full documentation can be found `on this link <https://qolmat.readthedocs.io/en/latest/>`_.
 
 **How does Qolmat work ?**
 
-Qolmat simplifies the selection process of a data imputation algorithm. It does so by comparing of various methods based on different evaluation metrics.
-It is compatible with scikit-learn.
-Evaluation and comparison are based on the standard approach to select some observations, set their status to missing, and compare
-their imputation with their true values.
+Qolmat allows model selection for scikit-learn compatible imputation algorithms, by performing three steps pictured below:
+1) For each of the K folds, Qolmat artificially masks a set of observed values using a default or user specified `hole generator <explanation.html#hole-generator>`_,
+2) For each fold and each compared `imputation method <imputers.html>`_, Qolmat fills both the missing and the masked values, then computes each of the default or user specified `performance metrics <explanation.html#metrics>`_.
+3) For each compared imputer, Qolmat pools the computed metrics from the K folds into a single value.
 
-More specifically, from the initial dataframe with missing value, we generate additional missing values (N samples).
-On each sample, different imputation models are tested and reconstruction errors are computed on these artificially missing entries. Then the errors of each imputation model are averaged and we eventually obtained a unique error score per model. This procedure allows the comparison of different models on the same dataset.
+This is very similar in spirit to the `cross_val_score <https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html>`_ function for scikit-learn.
 
 .. image:: https://raw.githubusercontent.com/Quantmetry/qolmat/main/docs/images/schema_qolmat.png
     :align: center
 
 **Imputation methods**
 
-The following table contains the available imputation methods:
+The following table contains the available imputation methods. We distinguish single imputation methods (aiming for pointwise accuracy, mostly deterministic) from multiple imputation methods (aiming for distribution similarity, mostly stochastic).
 
 .. list-table::
-   :widths: 25 70 15 15 20
+   :widths: 25 70 15 15
    :header-rows: 1
 
    * - Method
      - Description
-     - Tabular
-     - Time series
-     - Minimised criterion
+     - Tabular or Time series
+     - Single or Multiple
    * - mean
      - Imputes the missing values using the mean along each column
-     - yes
-     - no
-     - point
+     - tabular
+     - single
    * - median
      - Imputes the missing values using the median along each column
-     - yes
-     - no
-     - point
+     - tabular
+     - single
    * - LOCF
      - Imputes missing entries by carrying the last observation forward for each columns
-     - yes
-     - yes
-     - point
+     - time series
+     - single
    * - shuffle
      - Imputes missing entries with the random value of each column
-     - yes
-     - no
-     - point
+     - tabular
+     - multiple
    * - interpolation
      - Imputes missing using some interpolation strategies supported by pd.Series.interpolate
-     - yes
-     - yes
-     - point
+     - time series
+     - single
    * - impute on residuals
      - The series are de-seasonalised, residuals are imputed via linear interpolation, then residuals are re-seasonalised
-     - no
-     - yes
-     - point
+     - time series
+     - single
    * - MICE
      - Multiple Imputation by Chained Equation
-     - yes
-     - no
-     - point
+     - tabular
+     - both
    * - RPCA
      - Robust Principal Component Analysis
-     - yes
-     - yes
-     - point
+     - both
+     - single
    * - SoftImpute
      - Iterative method for matrix completion that uses nuclear-norm regularization
-     - yes
-     - no
-     - point
+     - tabular
+     - single
    * - KNN
      - K-nearest kneighbors
-     - yes
-     - no
-     - point
+     - tabular
+     - single
    * - EM sampler
      - Imputes missing values via EM algorithm
-     - yes
-     - yes
-     - point/distribution
+     - both
+     - both
+   * - MLP
+     - Imputer based Multi-Layers Perceptron Model
+     - both
+     - both
+   * - Autoencoder
+     - Imputer based Autoencoder Model with Variationel method
+     - both
+     - both
    * - TabDDPM
      - Imputer based on Denoising Diffusion Probabilistic Models
-     - yes
-     - yes
-     - distribution
+     - both
+     - both
 
 
 
@@ -230,8 +209,6 @@ Qolmat has been developed by Quantmetry.
 🔍  References
 ==============
 
-Qolmat methods belong to the field of conformal inference.
-
 [1] Candès, Emmanuel J., et al. “Robust principal component analysis?.”
 Journal of the ACM (JACM) 58.3 (2011): 1-37,
 (`pdf <https://arxiv.org/abs/0912.3599>`__)
@@ -242,15 +219,13 @@ Journal of advanced transportation 2018 (2018).
 (`pdf <https://www.hindawi.com/journals/jat/2018/7191549/>`__)
 
 [3] Chen, Yuxin, et al. “Bridging convex and nonconvex optimization in
-robust PCA: Noise, outliers, and missing data.” arXiv preprint
-arXiv:2001.05484 (2020), (`pdf <https://arxiv.org/abs/2001.05484>`__)
+robust PCA: Noise, outliers, and missing data.” Annals of statistics, 49(5), 2948 (2021), (`pdf <https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9491514/pdf/nihms-1782570.pdf>`__)
 
 [4] Shahid, Nauman, et al. “Fast robust PCA on graphs.” IEEE Journal of
 Selected Topics in Signal Processing 10.4 (2016): 740-756.
 (`pdf <https://arxiv.org/abs/1507.08173>`__)
 
-[5] Jiashi Feng, et al. “Online robust pca via stochastic opti-
-mization.“ Advances in neural information processing systems, 26, 2013.
+[5] Jiashi Feng, et al. “Online robust pca via stochastic optimization.“ Advances in neural information processing systems, 26, 2013.
 (`pdf <https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.721.7506&rep=rep1&type=pdf>`__)
 
 [6] García, S., Luengo, J., & Herrera, F. "Data preprocessing in data mining". 2015.
 
@@ -56,12 +56,12 @@
 from distutils.version import LooseVersion
 
 # pngmath / imgmath compatibility layer for different sphinx versions
-import sphinx
+# import sphinx
 
-if LooseVersion(sphinx.__version__) < LooseVersion("1.4"):
-    extensions.append("sphinx.ext.pngmath")
-else:
-    extensions.append("sphinx.ext.imgmath")
+# if LooseVersion(sphinx.__version__) < LooseVersion("1.4"):
+#     extensions.append("sphinx.ext.pngmath")
+# else:
+#     extensions.append("sphinx.ext.imgmath")
 
 autodoc_default_flags = ["members", "inherited-members"]