Skip to content

Commit 0e02a93

Browse files
committed
Merge remote-tracking branch 'origin/master' into develop
2 parents ea06082 + addecb3 commit 0e02a93

File tree

30 files changed

+547
-362
lines changed

30 files changed

+547
-362
lines changed

.devcontainer/devcontainer.json

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
{
2+
"name": "beyond-jupyter",
3+
"build": {
4+
"dockerfile": "../Dockerfile",
5+
"context": ".." // Set context to project root
6+
},
7+
"settings": {
8+
"terminal.integrated.shell.linux": "/bin/bash",
9+
"python.pythonPath": "/opt/conda/envs/pop/bin/python"
10+
},
11+
"extensions": [
12+
"ms-python.python"
13+
],
14+
"forwardPorts": [],
15+
"postCreateCommand": "echo 'Container ready!'",
16+
"remoteUser": "root"
17+
}

Dockerfile

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
FROM condaforge/mambaforge as builder
2+
3+
WORKDIR /beyond-jupyter
4+
RUN apt-get update && apt-get install -y gcc python3-dev
5+
6+
# Copy the environment file
7+
COPY environment.yml environment.yml
8+
RUN mamba env create -f environment.yml
9+
10+
FROM condaforge/mambaforge
11+
12+
WORKDIR /beyond-jupyter
13+
14+
# Copy environment from the builder stage
15+
16+
# Set up the shell to activate the Conda environment by default
17+
ARG CONDA_ENV=pop
18+
19+
COPY --from=builder /opt/conda/envs/$CONDA_ENV /opt/conda/envs/$CONDA_ENV
20+
ENV PATH /opt/conda/envs/$CONDA_ENV/bin:$PATH
21+
SHELL ["conda", "run", "-n", "$CONDA_ENV", "/bin/bash", "-c"]

anti-patterns/external-config/README.md

Lines changed: 7 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ This is particularly important if code is executed in different environments and
1515
In such cases, it wouldn't be practical to have to create differently parametrised versions of the software;
1616
it makes much more sense to make the software access exchangeable configuration instead.
1717

18-
The valid use case for external configuration is thus a case where the configuration changes dynamically, i.e. different configurations are applied in order to adapt, in particular, to external conditions.
18+
Valid use cases for external configuration are thus cases where the configuration changes dynamically, i.e. different configurations are applied in order to adapt, in particular, to external conditions.
1919

2020
## Static External Configuration Is Questionable
2121

@@ -111,14 +111,13 @@ Under the conditions described above, high-level code has the following advantag
111111
Configuration has none of these advantages, so the key question to ask is: Is the use of configuration necessary to achieve an elegant solution? In other words: Would not using configuration result in suboptimal design?
112112
Because of the downsides of configuration when compared to specifications in the programming language, we should only ever use configuration if the answer to these questions is a very clear "yes".
113113

114-
### Examples
114+
### Example: Tianshou High-Level Experiment
115115

116-
#### Example 1: Tianshou High-Level Experiment
117-
118-
Consider this example from the Tianshou reinforcement learning library, which heavily makes use of the flexibility of the Python interpreter:
116+
Consider this example from the Tianshou reinforcement learning library, which heavily makes use of the flexibility of the Python language:
119117
It uses the builder pattern to flexibly configure an experiment; individual arguments can use subtype polymorphism to achieve vastly different behaviour.
120-
The *entire* code snippet is high-level code which defines the configuration of an experiment; extracting the constants from it would *not* suffice to define it; we need the entire definition to maintain the full level of flexibility.
121-
(Note, however, that even if it was the case that the constant literals sufficed and we could move them to an external configuration, we still would not gain anything by doing so.)
118+
The *entire* code snippet is high-level code which defines the configuration of an experiment.
119+
Extracting the constants from it would *not* suffice as configuration, as this would fail to maintain the same level of flexibility.
120+
(Note, however, that even if it was the case that the constant literals sufficed and we could move them to external configuration, we still would not gain anything by doing so!)
122121

123122
```python
124123

@@ -162,7 +161,7 @@ experiment = (
162161
As a simple example, consider how we could, using external configuration, handle variations of this experiment where
163162

164163
* `with_model_factory_default` shall not be called at all (because we don't want to specify non-default hidden sizes)
165-
* call `with_model_factory(MyModelFactory())` instead of `with_model_factory_default`.
164+
* `with_model_factory(MyModelFactory())` shall be called instead of `with_model_factory_default`.
166165

167166
The solutions to both problems would not be entirely trivial to handle with configuration and would necessitate a differentiation of the various cases in the code that interprets the configuration.
168167
When using a Python script, however, the changes would be completely straightforward and clean.

oop-essentials/03-general-principles/README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -85,14 +85,14 @@ The core values of XP are:
8585
* **Communication**: XP stresses the need for communication between developers, for knowledge transfer and alignment.
8686
* **Simplicity** (= YAGNI)
8787
* **Feedback**: Periodically reflecting on past performance can help to identify areas for improvement, both in terms of code and the development process being applied.
88-
* **Courage**: Courage is required in order to raise issues that impede the development process, e.g organisational issues or even issues pertaining to the general direction the product is headed in.
88+
* **Courage**: Courage is required in order to raise issues that impede the development process, e.g. organisational issues or even issues pertaining to the general direction the product is headed in.
8989
* **Respect**: Mutual respect is required in order to foster communication and to provide and accept feedback.
9090

9191
XP furthermore defines a set of practices, including
9292
* **Simple Design**: Build software to a simple but always adequate design.
9393
* **Pair Programming**: Two developers directly collaborate to produce a single piece of code. This immediately creates knowledge transfer (as more than one person is familiar with every piece of code) and eliminates the need for additional reviews.
9494
* **Refactoring**: Constantly refactoring the code to retain the quality of simple, adequate code that has no technical debt.
95-
* **Test-Driven Development**: Thinking about how to functionality can be tested from the very beginng.
95+
* **Test-Driven Development**: Thinking about how functionality can be tested from the very beginning can improve design and avoid errors.
9696
* **Continuous Integration**: Constantly integrating changes that improve the software product and applying automated tests helps to keep quality standards high.
9797
* **Collective Code Ownership**: Every piece of code can be immediately maintained by at least two developers. All code gets the benefit of many people’s attention, which increases code quality and reduces defects.
9898

oop-essentials/05-ide-features/README.md

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -203,7 +203,13 @@ Auto-completion helps save time by not requiring the developer to type out the n
203203
| JetBrains | Ctrl+Space+Space |
204204
| VSCode | just start typing |
205205

206-
In VSCode, auto-import will not happen by default; it has to be enabled in settings.
206+
In VSCode, auto-import will not happen by default; it has to be enabled in settings:
207+
Open your settings.json (Ctrl+Shift+P and search for "User Settings") and add these lines:
208+
```
209+
"python.analysis.indexing": true,
210+
"python.analysis.autoImportCompletions": true,
211+
```
212+
Then, whenever you use an auto-completion in VSCode, it will also add the import.
207213

208214
#### Type-Based Auto-Completion
209215

refactoring-journey/step02-dataset-representation/run_classifier_evaluation.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -62,6 +62,8 @@ def load_data_frame(self) -> pd.DataFrame:
6262
:return: the full data frame for this dataset (including the class column)
6363
"""
6464
df = pd.read_csv(config.csv_data_path()).dropna()
65+
if self.drop_zero_popularity:
66+
df = df[df[COL_POPULARITY] > 0]
6567
if self.num_samples is not None:
6668
df = df.sample(self.num_samples, random_state=self.random_seed)
6769
df[COL_GEN_POPULARITY_CLASS] = df[COL_POPULARITY].apply(lambda x: CLASS_POPULAR if x >= self.threshold_popular else CLASS_UNPOPULAR)

refactoring-journey/step03-refactoring/songpop/data.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -56,6 +56,8 @@ def load_data_frame(self) -> pd.DataFrame:
5656
:return: the full data frame for this dataset (including the class column)
5757
"""
5858
df = pd.read_csv(config.csv_data_path()).dropna()
59+
if self.drop_zero_popularity:
60+
df = df[df[COL_POPULARITY] > 0]
5961
if self.num_samples is not None:
6062
df = df.sample(self.num_samples, random_state=self.random_seed)
6163
df[COL_GEN_POPULARITY_CLASS] = df[COL_POPULARITY].apply(lambda x: CLASS_POPULAR if x >= self.threshold_popular else CLASS_UNPOPULAR)

refactoring-journey/step04-model-specific-pipelines/README.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -66,8 +66,9 @@ which implement the aforementioned protocol. Every modification, such as
6666
requires the definition of a new pipeline; achieving modularity is not trivial.
6767
The combinatorial complexity of the manual definition of
6868
a pipeline per feature/pre-processing combination explodes very quickly.
69+
For ways of extending the pipeline-based approach, see [additional material](sklearn_pipelines_extended.ipynb).
6970

70-
In other words, only using pipeline objects is too "low-level".
71+
Essentially, only using pipeline objects is too "low-level".
7172
We want a higher level of abstraction, which enables us to only provide a **declaration** of what we would like
7273
to do and the corresponding pipeline to achieve this shall be composed automatically, i.e. we would like to declare,
7374
for each model,
Lines changed: 164 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,164 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"source": [
6+
"# Taking Things Further with scikit-learn Pipelines"
7+
],
8+
"metadata": {
9+
"collapsed": false
10+
},
11+
"id": "d89ff99bad9f1f3a"
12+
},
13+
{
14+
"cell_type": "markdown",
15+
"source": [
16+
"The pipelines we presented in this step are extremely simple, as they apply the same transformations to all features. Consequently, the models were limited to the set of features to which these transformations could be applied.\n",
17+
"In order to support a greater set of features in our models, we would need to apply different transformations to different features.\n",
18+
"\n",
19+
"In this notebook, we shall briefly explore ways of doing this."
20+
],
21+
"metadata": {
22+
"collapsed": false
23+
},
24+
"id": "598c79eb2e75854a"
25+
},
26+
{
27+
"cell_type": "markdown",
28+
"source": [
29+
"## Differentiating between Categorical and Numerical Features\n",
30+
"\n",
31+
"As a first step, let us add support for categorical features, which we shall encode using one-hot encoding, alongside numerical features, to which we shall apply standard scaling.\n",
32+
"\n",
33+
"We use an indicator function `is_categorical` which allows us to differentiate the two types of features."
34+
],
35+
"metadata": {
36+
"collapsed": false
37+
},
38+
"id": "3f0b51661f915c3e"
39+
},
40+
{
41+
"cell_type": "code",
42+
"outputs": [],
43+
"source": [
44+
"from dataclasses import dataclass\n",
45+
"from enum import Enum\n",
46+
"\n",
47+
"from sklearn.compose import ColumnTransformer\n",
48+
"from sklearn.ensemble import RandomForestClassifier\n",
49+
"from sklearn.pipeline import Pipeline\n",
50+
"from sklearn.preprocessing import OneHotEncoder, StandardScaler, MaxAbsScaler\n",
51+
"\n",
52+
"from songpop.data import COLS_MUSICAL_CATEGORIES\n",
53+
"\n",
54+
"\n",
55+
"def is_categorical(feature: str):\n",
56+
" return feature in COLS_MUSICAL_CATEGORIES \n",
57+
"\n",
58+
"\n",
59+
"def create_random_forest_pipeline(features: list[str]):\n",
60+
" return Pipeline([\n",
61+
" ('preprocess', ColumnTransformer([\n",
62+
" (\"cat\", OneHotEncoder(), [feature for feature in features if is_categorical(feature)]),\n",
63+
" (\"num\", StandardScaler(), [feature for feature in features if not is_categorical(feature)])])),\n",
64+
" ('classifier', RandomForestClassifier())])"
65+
],
66+
"metadata": {
67+
"collapsed": false,
68+
"ExecuteTime": {
69+
"end_time": "2024-06-19T12:52:54.435043100Z",
70+
"start_time": "2024-06-19T12:52:51.455805300Z"
71+
}
72+
},
73+
"id": "8d352e9ef6fc5638",
74+
"execution_count": 1
75+
},
76+
{
77+
"cell_type": "markdown",
78+
"source": [
79+
"## Adding Support for Different Scaling Transformations of Numerical Features\n",
80+
"\n",
81+
"In practice, it is, however, not usually reasonable to apply the same scaling transformation to all numerical features. How could we address this?\n",
82+
"\n",
83+
"Frequently, the way in which a feature shall be transformed is inherent to the feature semantics, and upon having analyzed the nature of a feature, the choice of transformation becomes clear. Therefore, what is needed is really an explicit representation of a feature, which includes information on how to transform it. A very naive attempt at this could look like this: "
84+
],
85+
"metadata": {
86+
"collapsed": false
87+
},
88+
"id": "24130c901bf2f5e9"
89+
},
90+
{
91+
"cell_type": "code",
92+
"outputs": [],
93+
"source": [
94+
"class FeatureTransformationType(Enum):\n",
95+
" NONE = 0\n",
96+
" ONE_HOT_ENCODING = 1\n",
97+
" STANDARD_SCALER = 2\n",
98+
" MAX_ABS_SCALER = 3\n",
99+
"\n",
100+
"\n",
101+
"@dataclass\n",
102+
"class Feature:\n",
103+
" col_name: str\n",
104+
" feature_transformation_type: FeatureTransformationType\n",
105+
"\n",
106+
"\n",
107+
"def create_random_forest_pipeline(features: list[Feature]):\n",
108+
" features_none = [f.col_name for f in features if f.feature_transformation_type == FeatureTransformationType.NONE]\n",
109+
" features_one_hot = [f.col_name for f in features if f.feature_transformation_type == FeatureTransformationType.ONE_HOT_ENCODING]\n",
110+
" features_num_std = [f.col_name for f in features if f.feature_transformation_type == FeatureTransformationType.STANDARD_SCALER]\n",
111+
" features_num_abs = [f.col_name for f in features if f.feature_transformation_type == FeatureTransformationType.MAX_ABS_SCALER]\n",
112+
" return Pipeline([\n",
113+
" ('preprocess', ColumnTransformer([\n",
114+
" (\"id\", \"passthrough\", features_none),\n",
115+
" (\"one_hot\", OneHotEncoder(), features_one_hot),\n",
116+
" (\"num_std\", StandardScaler(), features_num_std),\n",
117+
" (\"num_abs\", MaxAbsScaler(), features_num_abs)])),\n",
118+
" ('classifier', RandomForestClassifier())])"
119+
],
120+
"metadata": {
121+
"collapsed": false,
122+
"ExecuteTime": {
123+
"end_time": "2024-06-19T12:52:54.469264400Z",
124+
"start_time": "2024-06-19T12:52:54.444572600Z"
125+
}
126+
},
127+
"id": "9c42e4b55b32ec7e",
128+
"execution_count": 2
129+
},
130+
{
131+
"cell_type": "markdown",
132+
"source": [
133+
"A more sophisticated approach would involve a representation of each feature that is itself a transformer. This adds flexibility and allows for a more fine-grained control over the transformations applied to each feature.\n",
134+
"\n",
135+
"In the following, we will, however, use the concepts of the library sensAI instead. sensAI builds upon scikit-learn concepts, using strictly object-oriented design and a higher level of abstraction (see subsequent steps in the journey). "
136+
],
137+
"metadata": {
138+
"collapsed": false
139+
},
140+
"id": "b70dba814dcf5da0"
141+
}
142+
],
143+
"metadata": {
144+
"kernelspec": {
145+
"display_name": "Python 3",
146+
"language": "python",
147+
"name": "python3"
148+
},
149+
"language_info": {
150+
"codemirror_mode": {
151+
"name": "ipython",
152+
"version": 2
153+
},
154+
"file_extension": ".py",
155+
"mimetype": "text/x-python",
156+
"name": "python",
157+
"nbconvert_exporter": "python",
158+
"pygments_lexer": "ipython2",
159+
"version": "2.7.6"
160+
}
161+
},
162+
"nbformat": 4,
163+
"nbformat_minor": 5
164+
}

refactoring-journey/step04-model-specific-pipelines/songpop/data.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -56,6 +56,8 @@ def load_data_frame(self) -> pd.DataFrame:
5656
:return: the full data frame for this dataset (including the class column)
5757
"""
5858
df = pd.read_csv(config.csv_data_path()).dropna()
59+
if self.drop_zero_popularity:
60+
df = df[df[COL_POPULARITY] > 0]
5961
if self.num_samples is not None:
6062
df = df.sample(self.num_samples, random_state=self.random_seed)
6163
df[COL_GEN_POPULARITY_CLASS] = df[COL_POPULARITY].apply(lambda x: CLASS_POPULAR if x >= self.threshold_popular else CLASS_UNPOPULAR)

0 commit comments

Comments
 (0)