You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: HISTORY.rst
+12-2Lines changed: 12 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,13 +2,23 @@
2
2
History
3
3
=======
4
4
5
-
0.0.16 (2023-??-??)
5
+
0.1.1 (2023-??-??)
6
+
-------------------
7
+
8
+
* Hotfix reference to tensorflow in the documentation, when it should be pytorch
9
+
10
+
0.1.0 (2023-10-11)
6
11
-------------------
7
12
8
13
* VAR(p) EM sampler implemented, founding on a VAR(p) modelization such as the one described in `Lütkepohl (2005) New Introduction to Multiple Time Series Analysis`
9
14
* EM and RPCA matrices transposed in the low-level impelmentation, however the API remains unchanged
10
-
* Sparse matrices introduced in the RPCA impletation so as to speed up the execution
15
+
* Sparse matrices introduced in the RPCA implementation so as to speed up the execution
16
+
* Implementation of SoftImpute, which provides a fast but less robust alterantive to RPCA
17
+
* Implementation of TabDDPM and TsDDPM, which are diffusion-based models for tabular data and time-series data, based on Denoising Diffusion Probabilistic Models. Their implementations follow the work of Tashiro et al., (2021) and Kotelnikov et al., (2023).
18
+
* ImputerDiffusion is an imputer-wrapper of these two models TabDDPM and TsDDPM.
11
19
* Docstrings and tests improved for the EM sampler
The full documentation can be found `on this link <https://qolmat.readthedocs.io/en/latest/>`_.
120
105
121
106
**How does Qolmat work ?**
122
107
123
-
Qolmat simplifies the selection process of a data imputation algorithm. It does so by comparing of various methods based on different evaluation metrics.
124
-
It is compatible with scikit-learn.
125
-
Evaluation and comparison are based on the standard approach to select some observations, set their status to missing, and compare
126
-
their imputation with their true values.
108
+
Qolmat allows model selection for scikit-learn compatible imputation algorithms, by performing three steps pictured below:
109
+
1) For each of the K folds, Qolmat artificially masks a set of observed values using a default or user specified `hole generator <explanation.html#hole-generator>`_,
110
+
2) For each fold and each compared `imputation method <imputers.html>`_, Qolmat fills both the missing and the masked values, then computes each of the default or user specified `performance metrics <explanation.html#metrics>`_.
111
+
3) For each compared imputer, Qolmat pools the computed metrics from the K folds into a single value.
127
112
128
-
More specifically, from the initial dataframe with missing value, we generate additional missing values (N samples).
129
-
On each sample, different imputation models are tested and reconstruction errors are computed on these artificially missing entries. Then the errors of each imputation model are averaged and we eventually obtained a unique error score per model. This procedure allows the comparison of different models on the same dataset.
113
+
This is very similar in spirit to the `cross_val_score <https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html>`_ function for scikit-learn.
The following table contains the available imputation methods:
120
+
The following table contains the available imputation methods. We distinguish single imputation methods (aiming for pointwise accuracy, mostly deterministic) from multiple imputation methods (aiming for distribution similarity, mostly stochastic).
137
121
138
122
.. list-table::
139
-
:widths: 25 70 15 15 20
123
+
:widths: 25 70 15 15
140
124
:header-rows: 1
141
125
142
126
* - Method
143
127
- Description
144
-
- Tabular
145
-
- Time series
146
-
- Minimised criterion
128
+
- Tabular or Time series
129
+
- Single or Multiple
147
130
* - mean
148
131
- Imputes the missing values using the mean along each column
149
-
- yes
150
-
- no
151
-
- point
132
+
- tabular
133
+
- single
152
134
* - median
153
135
- Imputes the missing values using the median along each column
154
-
- yes
155
-
- no
156
-
- point
136
+
- tabular
137
+
- single
157
138
* - LOCF
158
139
- Imputes missing entries by carrying the last observation forward for each columns
159
-
- yes
160
-
- yes
161
-
- point
140
+
- time series
141
+
- single
162
142
* - shuffle
163
143
- Imputes missing entries with the random value of each column
164
-
- yes
165
-
- no
166
-
- point
144
+
- tabular
145
+
- multiple
167
146
* - interpolation
168
147
- Imputes missing using some interpolation strategies supported by pd.Series.interpolate
169
-
- yes
170
-
- yes
171
-
- point
148
+
- time series
149
+
- single
172
150
* - impute on residuals
173
151
- The series are de-seasonalised, residuals are imputed via linear interpolation, then residuals are re-seasonalised
174
-
- no
175
-
- yes
176
-
- point
152
+
- time series
153
+
- single
177
154
* - MICE
178
155
- Multiple Imputation by Chained Equation
179
-
- yes
180
-
- no
181
-
- point
156
+
- tabular
157
+
- both
182
158
* - RPCA
183
159
- Robust Principal Component Analysis
184
-
- yes
185
-
- yes
186
-
- point
160
+
- both
161
+
- single
187
162
* - SoftImpute
188
163
- Iterative method for matrix completion that uses nuclear-norm regularization
189
-
- yes
190
-
- no
191
-
- point
164
+
- tabular
165
+
- single
192
166
* - KNN
193
167
- K-nearest kneighbors
194
-
- yes
195
-
- no
196
-
- point
168
+
- tabular
169
+
- single
197
170
* - EM sampler
198
171
- Imputes missing values via EM algorithm
199
-
- yes
200
-
- yes
201
-
- point/distribution
172
+
- both
173
+
- both
174
+
* - MLP
175
+
- Imputer based Multi-Layers Perceptron Model
176
+
- both
177
+
- both
178
+
* - Autoencoder
179
+
- Imputer based Autoencoder Model with Variationel method
180
+
- both
181
+
- both
202
182
* - TabDDPM
203
183
- Imputer based on Denoising Diffusion Probabilistic Models
204
-
- yes
205
-
- yes
206
-
- distribution
184
+
- both
185
+
- both
207
186
208
187
209
188
@@ -230,8 +209,6 @@ Qolmat has been developed by Quantmetry.
230
209
🔍 References
231
210
==============
232
211
233
-
Qolmat methods belong to the field of conformal inference.
234
-
235
212
[1] Candès, Emmanuel J., et al. “Robust principal component analysis?.”
236
213
Journal of the ACM (JACM) 58.3 (2011): 1-37,
237
214
(`pdf <https://arxiv.org/abs/0912.3599>`__)
@@ -242,15 +219,13 @@ Journal of advanced transportation 2018 (2018).
0 commit comments