aai-institute
diff --git a/‎.devcontainer/devcontainer.json‎
Lines changed: 17 additions & 0 deletions b/‎.devcontainer/devcontainer.json‎
Lines changed: 17 additions & 0 deletions
diff --git a/‎Dockerfile‎
Lines changed: 21 additions & 0 deletions b/‎Dockerfile‎
Lines changed: 21 additions & 0 deletions
diff --git a/‎anti-patterns/external-config/README.md‎
Lines changed: 7 additions & 8 deletions b/‎anti-patterns/external-config/README.md‎
Lines changed: 7 additions & 8 deletions
diff --git a/‎oop-essentials/03-general-principles/README.md‎
Lines changed: 2 additions & 2 deletions b/‎oop-essentials/03-general-principles/README.md‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎oop-essentials/05-ide-features/README.md‎
Lines changed: 7 additions & 1 deletion b/‎oop-essentials/05-ide-features/README.md‎
Lines changed: 7 additions & 1 deletion
diff --git a/‎refactoring-journey/step02-dataset-representation/run_classifier_evaluation.py‎
Lines changed: 2 additions & 0 deletions b/‎refactoring-journey/step02-dataset-representation/run_classifier_evaluation.py‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎refactoring-journey/step03-refactoring/songpop/data.py‎
Lines changed: 2 additions & 0 deletions b/‎refactoring-journey/step03-refactoring/songpop/data.py‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎refactoring-journey/step04-model-specific-pipelines/README.md‎
Lines changed: 2 additions & 1 deletion b/‎refactoring-journey/step04-model-specific-pipelines/README.md‎
Lines changed: 2 additions & 1 deletion
diff --git a/‎refactoring-journey/step04-model-specific-pipelines/sklearn_pipelines_extended.ipynb‎
Lines changed: 164 additions & 0 deletions b/‎refactoring-journey/step04-model-specific-pipelines/sklearn_pipelines_extended.ipynb‎
Lines changed: 164 additions & 0 deletions
diff --git a/‎refactoring-journey/step04-model-specific-pipelines/songpop/data.py‎
Lines changed: 2 additions & 0 deletions b/‎refactoring-journey/step04-model-specific-pipelines/songpop/data.py‎
Lines changed: 2 additions & 0 deletions
@@ -0,0 +1,17 @@
+{
+  "name": "beyond-jupyter",
+  "build": {
+    "dockerfile": "../Dockerfile",
+    "context": ".."  // Set context to project root
+  },
+  "settings": {
+    "terminal.integrated.shell.linux": "/bin/bash",
+    "python.pythonPath": "/opt/conda/envs/pop/bin/python"
+  },
+  "extensions": [
+    "ms-python.python"
+  ],
+  "forwardPorts": [],
+  "postCreateCommand": "echo 'Container ready!'",
+  "remoteUser": "root"
+}
@@ -0,0 +1,21 @@
+FROM condaforge/mambaforge as builder
+
+WORKDIR /beyond-jupyter
+RUN apt-get update && apt-get install -y gcc python3-dev
+
+# Copy the environment file
+COPY environment.yml environment.yml
+RUN mamba env create -f environment.yml
+
+FROM condaforge/mambaforge
+
+WORKDIR /beyond-jupyter
+
+# Copy environment from the builder stage
+
+# Set up the shell to activate the Conda environment by default
+ARG CONDA_ENV=pop
+
+COPY --from=builder /opt/conda/envs/$CONDA_ENV /opt/conda/envs/$CONDA_ENV
+ENV PATH /opt/conda/envs/$CONDA_ENV/bin:$PATH
+SHELL ["conda", "run", "-n", "$CONDA_ENV", "/bin/bash", "-c"]
@@ -15,7 +15,7 @@ This is particularly important if code is executed in different environments and
 In such cases, it wouldn't be practical to have to create differently parametrised versions of the software; 
 it makes much more sense to make the software access exchangeable configuration instead.
 
-The valid use case for external configuration is thus a case where the configuration changes dynamically, i.e. different configurations are applied in order to adapt, in particular, to external conditions.
+Valid use cases for external configuration are thus cases where the configuration changes dynamically, i.e. different configurations are applied in order to adapt, in particular, to external conditions.
 
 ## Static External Configuration Is Questionable
 
@@ -111,14 +111,13 @@ Under the conditions described above, high-level code has the following advantag
 Configuration has none of these advantages, so the key question to ask is: Is the use of configuration necessary to achieve an elegant solution? In other words: Would not using configuration result in suboptimal design?
 Because of the downsides of configuration when compared to specifications in the programming language, we should only ever use configuration if the answer to these questions is a very clear "yes".
 
-### Examples
+### Example: Tianshou High-Level Experiment
 
-#### Example 1: Tianshou High-Level Experiment
-
-Consider this example from the Tianshou reinforcement learning library, which heavily makes use of the flexibility of the Python interpreter:
+Consider this example from the Tianshou reinforcement learning library, which heavily makes use of the flexibility of the Python language:
 It uses the builder pattern to flexibly configure an experiment; individual arguments can use subtype polymorphism to achieve vastly different behaviour.
-The *entire* code snippet is high-level code which defines the configuration of an experiment; extracting the constants from it would *not* suffice to define it; we need the entire definition to maintain the full level of flexibility.
-(Note, however, that even if it was the case that the constant literals sufficed and we could move them to an external configuration, we still would not gain anything by doing so.) 
+The *entire* code snippet is high-level code which defines the configuration of an experiment.
+Extracting the constants from it would *not* suffice as configuration, as this would fail to maintain the same level of flexibility.
+(Note, however, that even if it was the case that the constant literals sufficed and we could move them to external configuration, we still would not gain anything by doing so!) 
 
 ```python
 
@@ -162,7 +161,7 @@ experiment = (
 As a simple example, consider how we could, using external configuration, handle variations of this experiment where  
 
  * `with_model_factory_default` shall not be called at all (because we don't want to specify non-default hidden sizes)
- * call `with_model_factory(MyModelFactory())` instead of `with_model_factory_default`.
+ * `with_model_factory(MyModelFactory())` shall be called instead of `with_model_factory_default`.
 
 The solutions to both problems would not be entirely trivial to handle with configuration and would necessitate a differentiation of the various cases in the code that interprets the configuration. 
 When using a Python script, however, the changes would be completely straightforward and clean. 
 
@@ -85,14 +85,14 @@ The core values of XP are:
   * **Communication**: XP stresses the need for communication between developers, for knowledge transfer and alignment.
   * **Simplicity** (= YAGNI)
   * **Feedback**: Periodically reflecting on past performance can help to identify areas for improvement, both in terms of code and the development process being applied.
-  * **Courage**: Courage is required in order to raise issues that impede the development process, e.g organisational issues or even issues pertaining to the general direction the product is headed in.
+  * **Courage**: Courage is required in order to raise issues that impede the development process, e.g. organisational issues or even issues pertaining to the general direction the product is headed in.
   * **Respect**: Mutual respect is required in order to foster communication and to provide and accept feedback.
 
 XP furthermore defines a set of practices, including
   * **Simple Design**: Build software to a simple but always adequate design.
   * **Pair Programming**: Two developers directly collaborate to produce a single piece of code. This immediately creates knowledge transfer (as more than one person is familiar with every piece of code) and eliminates the need for additional reviews.
   * **Refactoring**: Constantly refactoring the code to retain the quality of simple, adequate code that has no technical debt.
-  * **Test-Driven Development**: Thinking about how to functionality can be tested from the very beginng.
+  * **Test-Driven Development**: Thinking about how functionality can be tested from the very beginning can improve design and avoid errors.
   * **Continuous Integration**: Constantly integrating changes that improve the software product and applying automated tests helps to keep quality standards high.
   * **Collective Code Ownership**: Every piece of code can be immediately maintained by at least two developers. All code gets the benefit of many people’s attention, which increases code quality and reduces defects.
 
 
@@ -203,7 +203,13 @@ Auto-completion helps save time by not requiring the developer to type out the n
 | JetBrains | Ctrl+Space+Space   |
 | VSCode    | just start typing  |
 
-In VSCode, auto-import will not happen by default; it has to be enabled in settings.
+In VSCode, auto-import will not happen by default; it has to be enabled in settings:
+Open your settings.json (Ctrl+Shift+P and search for "User Settings") and add these lines:
+```
+    "python.analysis.indexing": true,
+    "python.analysis.autoImportCompletions": true,
+```
+Then, whenever you use an auto-completion in VSCode, it will also add the import.
 
 #### Type-Based Auto-Completion
 
 
@@ -62,6 +62,8 @@ def load_data_frame(self) -> pd.DataFrame:
         :return: the full data frame for this dataset (including the class column)
         """
         df = pd.read_csv(config.csv_data_path()).dropna()
+        if self.drop_zero_popularity:
+            df = df[df[COL_POPULARITY] > 0]
         if self.num_samples is not None:
             df = df.sample(self.num_samples, random_state=self.random_seed)
         df[COL_GEN_POPULARITY_CLASS] = df[COL_POPULARITY].apply(lambda x: CLASS_POPULAR if x >= self.threshold_popular else CLASS_UNPOPULAR)
 
@@ -56,6 +56,8 @@ def load_data_frame(self) -> pd.DataFrame:
         :return: the full data frame for this dataset (including the class column)
         """
         df = pd.read_csv(config.csv_data_path()).dropna()
+        if self.drop_zero_popularity:
+            df = df[df[COL_POPULARITY] > 0]
         if self.num_samples is not None:
             df = df.sample(self.num_samples, random_state=self.random_seed)
         df[COL_GEN_POPULARITY_CLASS] = df[COL_POPULARITY].apply(lambda x: CLASS_POPULAR if x >= self.threshold_popular else CLASS_UNPOPULAR)
 
@@ -66,8 +66,9 @@ which implement the aforementioned protocol. Every modification, such as
 requires the definition of a new pipeline; achieving modularity is not trivial. 
 The combinatorial complexity of the manual definition of
 a pipeline per feature/pre-processing combination explodes very quickly.
+For ways of extending the pipeline-based approach, see [additional material](sklearn_pipelines_extended.ipynb).
 
-In other words, only using pipeline objects is too "low-level". 
+Essentially, only using pipeline objects is too "low-level". 
 We want a higher level of abstraction, which enables us to only provide a **declaration** of what we would like
 to do and the corresponding pipeline to achieve this shall be composed automatically, i.e. we would like to declare,
 for each model,
 
@@ -0,0 +1,164 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "source": [
+    "# Taking Things Further with scikit-learn Pipelines"
+   ],
+   "metadata": {
+    "collapsed": false
+   },
+   "id": "d89ff99bad9f1f3a"
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "The pipelines we presented in this step are extremely simple, as they apply the same transformations to all features. Consequently, the models were limited to the set of features to which these transformations could be applied.\n",
+    "In order to support a greater set of features in our models, we would need to apply different transformations to different features.\n",
+    "\n",
+    "In this notebook, we shall briefly explore ways of doing this."
+   ],
+   "metadata": {
+    "collapsed": false
+   },
+   "id": "598c79eb2e75854a"
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "## Differentiating between Categorical and Numerical Features\n",
+    "\n",
+    "As a first step, let us add support for categorical features, which we shall encode using one-hot encoding, alongside numerical features, to which we shall apply standard scaling.\n",
+    "\n",
+    "We use an indicator function `is_categorical` which allows us to differentiate the two types of features."
+   ],
+   "metadata": {
+    "collapsed": false
+   },
+   "id": "3f0b51661f915c3e"
+  },
+  {
+   "cell_type": "code",
+   "outputs": [],
+   "source": [
+    "from dataclasses import dataclass\n",
+    "from enum import Enum\n",
+    "\n",
+    "from sklearn.compose import ColumnTransformer\n",
+    "from sklearn.ensemble import RandomForestClassifier\n",
+    "from sklearn.pipeline import Pipeline\n",
+    "from sklearn.preprocessing import OneHotEncoder, StandardScaler, MaxAbsScaler\n",
+    "\n",
+    "from songpop.data import COLS_MUSICAL_CATEGORIES\n",
+    "\n",
+    "\n",
+    "def is_categorical(feature: str):\n",
+    "    return feature in COLS_MUSICAL_CATEGORIES \n",
+    "\n",
+    "\n",
+    "def create_random_forest_pipeline(features: list[str]):\n",
+    "    return Pipeline([\n",
+    "        ('preprocess', ColumnTransformer([\n",
+    "            (\"cat\", OneHotEncoder(), [feature for feature in features if is_categorical(feature)]),\n",
+    "            (\"num\", StandardScaler(), [feature for feature in features if not is_categorical(feature)])])),\n",
+    "        ('classifier', RandomForestClassifier())])"
+   ],
+   "metadata": {
+    "collapsed": false,
+    "ExecuteTime": {
+     "end_time": "2024-06-19T12:52:54.435043100Z",
+     "start_time": "2024-06-19T12:52:51.455805300Z"
+    }
+   },
+   "id": "8d352e9ef6fc5638",
+   "execution_count": 1
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "## Adding Support for Different Scaling Transformations of Numerical Features\n",
+    "\n",
+    "In practice, it is, however, not usually reasonable to apply the same scaling transformation to all numerical features. How could we address this?\n",
+    "\n",
+    "Frequently, the way in which a feature shall be transformed is inherent to the feature semantics, and upon having analyzed the nature of a feature, the choice of transformation becomes clear. Therefore, what is needed is really an explicit representation of a feature, which includes information on how to transform it. A very naive attempt at this could look like this: "
+   ],
+   "metadata": {
+    "collapsed": false
+   },
+   "id": "24130c901bf2f5e9"
+  },
+  {
+   "cell_type": "code",
+   "outputs": [],
+   "source": [
+    "class FeatureTransformationType(Enum):\n",
+    "    NONE = 0\n",
+    "    ONE_HOT_ENCODING = 1\n",
+    "    STANDARD_SCALER = 2\n",
+    "    MAX_ABS_SCALER = 3\n",
+    "\n",
+    "\n",
+    "@dataclass\n",
+    "class Feature:\n",
+    "    col_name: str\n",
+    "    feature_transformation_type: FeatureTransformationType\n",
+    "\n",
+    "\n",
+    "def create_random_forest_pipeline(features: list[Feature]):\n",
+    "    features_none = [f.col_name for f in features if f.feature_transformation_type == FeatureTransformationType.NONE]\n",
+    "    features_one_hot = [f.col_name for f in features if f.feature_transformation_type == FeatureTransformationType.ONE_HOT_ENCODING]\n",
+    "    features_num_std = [f.col_name for f in features if f.feature_transformation_type == FeatureTransformationType.STANDARD_SCALER]\n",
+    "    features_num_abs = [f.col_name for f in features if f.feature_transformation_type == FeatureTransformationType.MAX_ABS_SCALER]\n",
+    "    return Pipeline([\n",
+    "        ('preprocess', ColumnTransformer([\n",
+    "            (\"id\", \"passthrough\", features_none),\n",
+    "            (\"one_hot\", OneHotEncoder(), features_one_hot),\n",
+    "            (\"num_std\", StandardScaler(), features_num_std),\n",
+    "            (\"num_abs\", MaxAbsScaler(), features_num_abs)])),\n",
+    "        ('classifier', RandomForestClassifier())])"
+   ],
+   "metadata": {
+    "collapsed": false,
+    "ExecuteTime": {
+     "end_time": "2024-06-19T12:52:54.469264400Z",
+     "start_time": "2024-06-19T12:52:54.444572600Z"
+    }
+   },
+   "id": "9c42e4b55b32ec7e",
+   "execution_count": 2
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "A more sophisticated approach would involve a representation of each feature that is itself a transformer. This adds flexibility and allows for a more fine-grained control over the transformations applied to each feature.\n",
+    "\n",
+    "In the following, we will, however, use the concepts of the library sensAI instead. sensAI builds upon scikit-learn concepts, using strictly object-oriented design and a higher level of abstraction (see subsequent steps in the journey). "
+   ],
+   "metadata": {
+    "collapsed": false
+   },
+   "id": "b70dba814dcf5da0"
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 2
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython2",
+   "version": "2.7.6"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
@@ -56,6 +56,8 @@ def load_data_frame(self) -> pd.DataFrame:
         :return: the full data frame for this dataset (including the class column)
         """
         df = pd.read_csv(config.csv_data_path()).dropna()
+        if self.drop_zero_popularity:
+            df = df[df[COL_POPULARITY] > 0]
         if self.num_samples is not None:
             df = df.sample(self.num_samples, random_state=self.random_seed)
         df[COL_GEN_POPULARITY_CLASS] = df[COL_POPULARITY].apply(lambda x: CLASS_POPULAR if x >= self.threshold_popular else CLASS_UNPOPULAR)