Skip to content

Commit 1c20597

Browse files
committed
grammarly
1 parent 0b0c536 commit 1c20597

16 files changed

+120
-114
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,3 @@
11
/.quarto/
22
dev/artifacts/upload/
3+
dev/artifacts/download/

paper/paper.pdf

2.07 KB
Binary file not shown.

paper/paper.tex

Lines changed: 56 additions & 56 deletions
Large diffs are not rendered by default.

paper/sections/conclusion.rmd

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,3 @@
11
# Concluding Remarks {#conclusion}
22

3-
This work has revisited and extended some of the most general and defining concepts underlying the literature on Counterfactual Explanations and, in particular, Algorithmic Recourse. We demonstrate that long-held beliefs as to what defines optimality in AR, may not always be suitable. Specifically, we run experiments that simulate the application of recourse in practice using various state-of-the-art counterfactual generators and find that all of them induce substantial domain and model shifts. We argue that these shifts should be considered as an expected external cost of individual recourse and call for a paradigm shift from individual to collective recourse in these types of situations. By proposing an adapted counterfactual search objective that incorporates this cost, we make that paradigm shift explicit. We show that this modified objective lends itself to mitigation strategies that can be used to effectively decrease the magnitude of induced domain and model shifts. Through our work we hope to inspire future research on this important topic. To this end we have open-sourced all of our code along with a Julia package: [`AlgorithmicRecourseDynamics.jl`](https://anonymous.4open.science/r/AlgorithmicRecourseDynamics/README.md). Future researchers should find it relatively easy to replicate, modify and extend the simulation experiments presented here and apply to their own custom counterfactual generators.
3+
This work has revisited and extended some of the most general and defining concepts underlying the literature on Counterfactual Explanations and, in particular, Algorithmic Recourse. We demonstrate that long-held beliefs as to what defines optimality in AR, may not always be suitable. Specifically, we run experiments that simulate the application of recourse in practice using various state-of-the-art counterfactual generators and find that all of them induce substantial domain and model shifts. We argue that these shifts should be considered as an expected external cost of individual recourse and call for a paradigm shift from individual to collective recourse in these types of situations. By proposing an adapted counterfactual search objective that incorporates this cost, we make that paradigm shift explicit. We show that this modified objective lends itself to mitigation strategies that can be used to effectively decrease the magnitude of induced domain and model shifts. Through our work, we hope to inspire future research on this important topic. To this end we have open-sourced all of our code along with a Julia package: [`AlgorithmicRecourseDynamics.jl`](https://anonymous.4open.science/r/AlgorithmicRecourseDynamics/README.md). Future researchers should find it relatively easy to replicate, modify and extend the simulation experiments presented here and apply to their own custom counterfactual generators.

paper/sections/discussion.rmd

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
# Discussion {#discussion}
22

3-
Our results in Section \@ref(empirical-2) indicate that state-of-the-art approaches to Algorithmic Recourse induce substantial domain and model shift, if implemented at scale in practice. These induced shifts can and should be considered as an (expected) external cost of individual recourse. While they do not affect the individual directly as long as we look at the individual in isolation, they can been seen to affect the broader group of stakeholders in automated data-driven decision-making. We have seen, for example, that out-of-sample model performance generally deteriorates in our simulation experiments. In practice, this can be seen as a cost to model owners, that is the group of stakeholders using the model as decision-making tool. As we have set out in Example \@ref(exm:student) of our introduction, these model owners may be unwilling to carry that cost, and hence can be expected to stop offering recourse to individuals altogether. This in turn is costly to those individuals that would otherwise derive utility from being offered recourse.
3+
Our results in Section \@ref(empirical-2) indicate that state-of-the-art approaches to Algorithmic Recourse induce substantial domain and model shift if implemented at scale in practice. These induced shifts can and should be considered as an (expected) external cost of individual recourse. While they do not affect the individual directly as long as we look at the individual in isolation, they can be seen to affect the broader group of stakeholders in automated data-driven decision-making. We have seen, for example, that out-of-sample model performance generally deteriorates in our simulation experiments. In practice, this can be seen as a cost to model owners, that is the group of stakeholders using the model as a decision-making tool. As we have set out in Example \@ref(exm:student) of our introduction, these model owners may be unwilling to carry that cost, and hence can be expected to stop offering recourse to individuals altogether. This in turn is costly to those individuals that would otherwise derive utility from being offered recourse.
44

5-
So, where does this leave us? We would argue that the expected external costs of individual recourse should be shared by all stakeholders. The most straightforward way to achieve this is to introduce a penalty for external costs in the counterfactual search objective function, as we have set out in Equation \@ref(eq:collective). This will on average lead to more costly counterfactual outcomes, but may help to avoid extreme scenarios, in which minimal-cost recourse is reserved to a tiny minority of individuals. We have shown various types of shift-mitigating strategies that can be used to this end. Since all of these strategies can be seen simply as specific adaption of Equation \@ref(eq:collective), they can be applied to any of the various counterfactual generators studied here.
5+
So, where does this leave us? We would argue that the expected external costs of individual recourse should be shared by all stakeholders. The most straightforward way to achieve this is to introduce a penalty for external costs in the counterfactual search objective function, as we have set out in Equation \@ref(eq:collective). This will on average lead to more costly counterfactual outcomes, but may help to avoid extreme scenarios, in which minimal-cost recourse is reserved to a tiny minority of individuals. We have shown various types of shift-mitigating strategies that can be used to this end. Since all of these strategies can be seen simply as a specific adaption of Equation \@ref(eq:collective), they can be applied to any of the various counterfactual generators studied here.

paper/sections/empirical.rmd

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,12 @@
11
# Experiment Setup {#empirical}
22

3-
This section presents the exact ingredients and parameter choices describing the simulation experiments we ran to produce the findings presented in the next section (\@ref(empirical-2)). For convenience, we use Algorithm \ref{algo-experiment} as a template to guide us through this section. A few high-level details upfront: each experiment is run for a total of $T=50$ rounds, where in each round we provide recourse to five percent of all individuals in the non-target class, so $B_t=0.05 * N_t^{\mathcal{D}_0}$^[As mentioned in the previous section, we end up providing recourse to a total of $\approx50\%$ by the end of round $T=50$.]. All classifiers and generative models are retrained for 10 epochs in each round $t$ of the experiment. Rather than retraining models from scratch, we initialize all parameters at their previous levels ($t-1$) and compute backpropagate for 10 epochs using the new training data as inputs into the existing model. Evaluation metrics are computed and stored every 10 rounds. To account for noise, each individual experiment is repeated five times.^[In the current implementation, we use the train-test split each time to only account for stochasticity associated with randomly selecting individuals for recourse. An interesting alternative may be to also perform data splitting each time, thereby adding an additional layer of randomness.]
3+
This section presents the exact ingredients and parameter choices describing the simulation experiments we ran to produce the findings presented in the next section (\@ref(empirical-2)). For convenience, we use Algorithm \ref{algo-experiment} as a template to guide us through this section. A few high-level details upfront: each experiment is run for a total of $T=50$ rounds, where in each round we provide recourse to five per cent of all individuals in the non-target class, so $B_t=0.05 * N_t^{\mathcal{D}_0}$^[As mentioned in the previous section, we end up providing recourse to a total of $\approx50\%$ by the end of round $T=50$.]. All classifiers and generative models are retrained for 10 epochs in each round $t$ of the experiment. Rather than retraining models from scratch, we initialize all parameters at their previous levels ($t-1$) and compute backpropagate for 10 epochs using the new training data as inputs into the existing model. Evaluation metrics are computed and stored every 10 rounds. To account for noise, each individual experiment is repeated five times.^[In the current implementation, we use the train-test split each time to only account for stochasticity associated with randomly selecting individuals for recourse. An interesting alternative may be to also perform data splitting each time, thereby adding an additional layer of randomness.]
44

55
## $M$---Classifiers and Generative Models {#empirical-classifiers}
66

77
For each dataset and generator, we look at three different types of classifiers, all of them built and trained using `Flux.jl` [@innes2018fashionable]: firstly, a simple linear classifier---**Logistic Regression**---implemented as a single linear layer with sigmoid activation; secondly, a multilayer perceptron (**MLP**); and finally, a **Deep Ensemble** composed of five MLPs following @lakshminarayanan2016simple that serves as our only probabilistic classifier. We have chosen to work with deep ensembles both for their simplicity and effectiveness at modelling predictive uncertainty. They are also the model of choice in @schut2021generating. The network architectures are kept simple (top half of Table \@ref(tab:architecture)), since we are only marginally concerned with achieving good initial classifier performance.
88

9-
The Latent Space generator relies on a separate generative model. Following the authors of both REVISE and CLUE we use Variational Autoencoders (**VAE**) for this purpose. As with the classifiers, we deliberately choose to work with fairly simple architectures (bottom half of Table \@ref(tab:architecture)). More expressive generative models generally also lead to more meaningful counterfactuals produced by Latent Space generators. But in our view this should simply be considered as a vulnerability of counterfactual generators that rely on surrogate models to learn what realistic representations of the underlying data.
9+
The Latent Space generator relies on a separate generative model. Following the authors of both REVISE and CLUE we use Variational Autoencoders (**VAE**) for this purpose. As with the classifiers, we deliberately choose to work with fairly simple architectures (bottom half of Table \@ref(tab:architecture)). More expressive generative models generally also lead to more meaningful counterfactuals produced by Latent Space generators. But in our view, this should simply be considered as a vulnerability of counterfactual generators that rely on surrogate models to learn realistic representations of the underlying data.
1010

1111
```{r architecture}
1212
tab <- data.frame(
@@ -55,13 +55,13 @@ We use four synthetic binary classification datasets consisting of 1000 samples
5555
knitr::include_graphics("www/synthetic_data.png")
5656
```
5757

58-
Ex-ante we expect to see that by construction, Wachter will create a new cluster of counterfactual instances in the proximity of the initial decision boundary as we saw in Figure \@ref(fig:poc). Thus, the choice of a black-box model may have an impact on the paths of the recourse. For generators that use latent space search (REVISE @joshi2019realistic, CLUE @antoran2020getting) or rely on (and have access to) probabilistic models (CLUE @antoran2020getting, Greedy @schut2021generating) we expect that counterfactuals will end up in regions of the target domain that are densely populated by training samples. Of course, this expectation hinges on how effective said probabilistic models are at capturing predictive uncertainty. Finally, we expect to see the counterfactuals generated by DiCE to be uniformly spread around the feature space inside the target class^[As we mentioned earlier, the diversity constraint used by DiCE is only effective for when at least two counterfactuals are being generated. We have therefore decided to always generate 5 counterfactuals for each generator and randomly pick one of them.]. In summary, we expect that the endogenous shifts induced by Wachter outsize those of all other generators, since Wachter is not explicitly concerned with generating what we have defined as meaningful counterfactuals.
58+
Ex-ante we expect to see that by construction, Wachter will create a new cluster of counterfactual instances in the proximity of the initial decision boundary as we saw in Figure \@ref(fig:poc). Thus, the choice of a black-box model may have an impact on the paths of the recourse. For generators that use latent space search (REVISE @joshi2019realistic, CLUE @antoran2020getting) or rely on (and have access to) probabilistic models (CLUE @antoran2020getting, Greedy @schut2021generating) we expect that counterfactuals will end up in regions of the target domain that are densely populated by training samples. Of course, this expectation hinges on how effective said probabilistic models are at capturing predictive uncertainty. Finally, we expect to see the counterfactuals generated by DiCE to be uniformly spread around the feature space inside the target class^[As we mentioned earlier, the diversity constraint used by DiCE is only effective when at least two counterfactuals are being generated. We have therefore decided to always generate 5 counterfactuals for each generator and randomly pick one of them.]. In summary, we expect that the endogenous shifts induced by Wachter outsize those of all other generators since Wachter is not explicitly concerned with generating what we have defined as meaningful counterfactuals.
5959

6060
### Real-world data
6161

62-
We use three different real-world datasets from the Finance and Economics domain, all of which are tabular and can be used for binary classification. Firstly, we use the **Give Me Some Credit** dataset which was open-sourced on Kaggle for the task to predict whether a borrower is likely to experience financial difficulties in the next two years [@kaggle2011give], originally consisting of 250,000 instances with 11 numerical attributes. Secondly, we use the **UCI defaultCredit** dataset [@yeh2009comparisons], a benchmark dataset that can be used to train binary classifiers to predict the binary outcome variable whether credit card clients default on their payment. In its raw form it consists of 23 explanatory variables: 4 categorical features relating to demographic attributes^[These have been omitted from the analysis. See Section \@ref(limit-data) for details.] and 19 continuous features largely relating to individuals' payment histories and amount of credit outstanding. Both datasets have been used in the literature on AR before (see for example @pawelczyk2021carla, @joshi2019realistic and @ustun2019actionable), presumably because they constitute real-world classification tasks involving individuals that compete for access to credit.
62+
We use three different real-world datasets from the Finance and Economics domain, all of which are tabular and can be used for binary classification. Firstly, we use the **Give Me Some Credit** dataset which was open-sourced on Kaggle for the task to predict whether a borrower is likely to experience financial difficulties in the next two years [@kaggle2011give], originally consisting of 250,000 instances with 11 numerical attributes. Secondly, we use the **UCI defaultCredit** dataset [@yeh2009comparisons], a benchmark dataset that can be used to train binary classifiers to predict the binary outcome variable of whether credit card clients default on their payment. In its raw form, it consists of 23 explanatory variables: 4 categorical features relating to demographic attributes^[These have been omitted from the analysis. See Section \@ref(limit-data) for details.] and 19 continuous features largely relating to individuals' payment histories and amount of credit outstanding. Both datasets have been used in the literature on AR before (see for example @pawelczyk2021carla, @joshi2019realistic and @ustun2019actionable), presumably because they constitute real-world classification tasks involving individuals that compete for access to credit.
6363

64-
As a third dataset we include the **California Housing** dataset derived from the 1990 U.S. census and sourced through scikit-learn [@pedregosa2011scikitlearn, @pace1997sparse]. It consists of 8 continuous features that can be used to predict the median house price for California districts. The continuous outcome variable is binarized as $\tilde{y}=\mathbb{I}_{y>\text{median}(Y)}$ indicating whether or not the median house price of a given district is above or below the median of all districts. While we have not seen this dataset used in the previous literature on AR, others have used the Boston Housing dataset in a similar fashion @schut2021generating. While we initially also conducted experiments on that dataset, we eventually discarded this dataset due to surrounding ethical concerns [@carlisle2019racist].
64+
As a third dataset, we include the **California Housing** dataset derived from the 1990 U.S. census and sourced through scikit-learn [@pedregosa2011scikitlearn, @pace1997sparse]. It consists of 8 continuous features that can be used to predict the median house price for California districts. The continuous outcome variable is binarized as $\tilde{y}=\mathbb{I}_{y>\text{median}(Y)}$ indicating whether or not the median house price of a given district is above or below the median of all districts. While we have not seen this dataset used in the previous literature on AR, others have used the Boston Housing dataset in a similar fashion @schut2021generating. While we initially also conducted experiments on that dataset, we eventually discarded this dataset due to surrounding ethical concerns [@carlisle2019racist].
6565

6666
Since the simulations involve generating counterfactuals for a significant proportion of the entire sample of individuals, we have randomly undersampled each dataset to yield balanced subsamples consisting of 5,000 individuals each. We have also standardized all explanatory features since our chosen classifiers are sensitive to scale.
6767

0 commit comments

Comments
 (0)