|
| 1 | +{ |
| 2 | + "cells": [ |
| 3 | + { |
| 4 | + "cell_type": "markdown", |
| 5 | + "source": [ |
| 6 | + "# Taking Things Further with scikit-learn Pipelines" |
| 7 | + ], |
| 8 | + "metadata": { |
| 9 | + "collapsed": false |
| 10 | + }, |
| 11 | + "id": "d89ff99bad9f1f3a" |
| 12 | + }, |
| 13 | + { |
| 14 | + "cell_type": "markdown", |
| 15 | + "source": [ |
| 16 | + "The pipelines we presented in this step are extremely simple, as they apply the same transformations to all features. Consequently, the models were limited to the set of features to which these transformations could be applied.\n", |
| 17 | + "In order to support a greater set of features in our models, we would need to apply different transformations to different features.\n", |
| 18 | + "\n", |
| 19 | + "In this notebook, we shall briefly explore ways of doing this." |
| 20 | + ], |
| 21 | + "metadata": { |
| 22 | + "collapsed": false |
| 23 | + }, |
| 24 | + "id": "598c79eb2e75854a" |
| 25 | + }, |
| 26 | + { |
| 27 | + "cell_type": "markdown", |
| 28 | + "source": [ |
| 29 | + "## Differentiating between Categorical and Numerical Features\n", |
| 30 | + "\n", |
| 31 | + "As a first step, let us add support for categorical features, which we shall encode using one-hot encoding, alongside numerical features, to which we shall apply standard scaling.\n", |
| 32 | + "\n", |
| 33 | + "We use an indicator function `is_categorical` which allows us to differentiate the two types of features." |
| 34 | + ], |
| 35 | + "metadata": { |
| 36 | + "collapsed": false |
| 37 | + }, |
| 38 | + "id": "3f0b51661f915c3e" |
| 39 | + }, |
| 40 | + { |
| 41 | + "cell_type": "code", |
| 42 | + "outputs": [], |
| 43 | + "source": [ |
| 44 | + "from dataclasses import dataclass\n", |
| 45 | + "from enum import Enum\n", |
| 46 | + "\n", |
| 47 | + "from sklearn.compose import ColumnTransformer\n", |
| 48 | + "from sklearn.ensemble import RandomForestClassifier\n", |
| 49 | + "from sklearn.pipeline import Pipeline\n", |
| 50 | + "from sklearn.preprocessing import OneHotEncoder, StandardScaler, MaxAbsScaler\n", |
| 51 | + "\n", |
| 52 | + "from songpop.data import COLS_MUSICAL_CATEGORIES\n", |
| 53 | + "\n", |
| 54 | + "\n", |
| 55 | + "def is_categorical(feature: str):\n", |
| 56 | + " return feature in COLS_MUSICAL_CATEGORIES \n", |
| 57 | + "\n", |
| 58 | + "\n", |
| 59 | + "def create_random_forest_pipeline(features: list[str]):\n", |
| 60 | + " return Pipeline([\n", |
| 61 | + " ('preprocess', ColumnTransformer([\n", |
| 62 | + " (\"cat\", OneHotEncoder(), [feature for feature in features if is_categorical(feature)]),\n", |
| 63 | + " (\"num\", StandardScaler(), [feature for feature in features if not is_categorical(feature)])])),\n", |
| 64 | + " ('classifier', RandomForestClassifier())])" |
| 65 | + ], |
| 66 | + "metadata": { |
| 67 | + "collapsed": false, |
| 68 | + "ExecuteTime": { |
| 69 | + "end_time": "2024-06-19T12:52:54.435043100Z", |
| 70 | + "start_time": "2024-06-19T12:52:51.455805300Z" |
| 71 | + } |
| 72 | + }, |
| 73 | + "id": "8d352e9ef6fc5638", |
| 74 | + "execution_count": 1 |
| 75 | + }, |
| 76 | + { |
| 77 | + "cell_type": "markdown", |
| 78 | + "source": [ |
| 79 | + "## Adding Support for Different Scaling Transformations of Numerical Features\n", |
| 80 | + "\n", |
| 81 | + "In practice, it is, however, not usually reasonable to apply the same scaling transformation to all numerical features. How could we address this?\n", |
| 82 | + "\n", |
| 83 | + "Frequently, the way in which a feature shall be transformed is inherent to the feature semantics, and upon having analyzed the nature of a feature, the choice of transformation becomes clear. Therefore, what is needed is really an explicit representation of a feature, which includes information on how to transform it. A very naive attempt at this could look like this: " |
| 84 | + ], |
| 85 | + "metadata": { |
| 86 | + "collapsed": false |
| 87 | + }, |
| 88 | + "id": "24130c901bf2f5e9" |
| 89 | + }, |
| 90 | + { |
| 91 | + "cell_type": "code", |
| 92 | + "outputs": [], |
| 93 | + "source": [ |
| 94 | + "class FeatureTransformationType(Enum):\n", |
| 95 | + " NONE = 0\n", |
| 96 | + " ONE_HOT_ENCODING = 1\n", |
| 97 | + " STANDARD_SCALER = 2\n", |
| 98 | + " MAX_ABS_SCALER = 3\n", |
| 99 | + "\n", |
| 100 | + "\n", |
| 101 | + "@dataclass\n", |
| 102 | + "class Feature:\n", |
| 103 | + " col_name: str\n", |
| 104 | + " feature_transformation_type: FeatureTransformationType\n", |
| 105 | + "\n", |
| 106 | + "\n", |
| 107 | + "def create_random_forest_pipeline(features: list[Feature]):\n", |
| 108 | + " features_none = [f.col_name for f in features if f.feature_transformation_type == FeatureTransformationType.NONE]\n", |
| 109 | + " features_one_hot = [f.col_name for f in features if f.feature_transformation_type == FeatureTransformationType.ONE_HOT_ENCODING]\n", |
| 110 | + " features_num_std = [f.col_name for f in features if f.feature_transformation_type == FeatureTransformationType.STANDARD_SCALER]\n", |
| 111 | + " features_num_abs = [f.col_name for f in features if f.feature_transformation_type == FeatureTransformationType.MAX_ABS_SCALER]\n", |
| 112 | + " return Pipeline([\n", |
| 113 | + " ('preprocess', ColumnTransformer([\n", |
| 114 | + " (\"id\", \"passthrough\", features_none),\n", |
| 115 | + " (\"one_hot\", OneHotEncoder(), features_one_hot),\n", |
| 116 | + " (\"num_std\", StandardScaler(), features_num_std),\n", |
| 117 | + " (\"num_abs\", MaxAbsScaler(), features_num_abs)])),\n", |
| 118 | + " ('classifier', RandomForestClassifier())])" |
| 119 | + ], |
| 120 | + "metadata": { |
| 121 | + "collapsed": false, |
| 122 | + "ExecuteTime": { |
| 123 | + "end_time": "2024-06-19T12:52:54.469264400Z", |
| 124 | + "start_time": "2024-06-19T12:52:54.444572600Z" |
| 125 | + } |
| 126 | + }, |
| 127 | + "id": "9c42e4b55b32ec7e", |
| 128 | + "execution_count": 2 |
| 129 | + }, |
| 130 | + { |
| 131 | + "cell_type": "markdown", |
| 132 | + "source": [ |
| 133 | + "A more sophisticated approach would involve a representation of each feature that is itself a transformer. This adds flexibility and allows for a more fine-grained control over the transformations applied to each feature.\n", |
| 134 | + "\n", |
| 135 | + "In the following, we will, however, use the concepts of the library sensAI instead. sensAI builds upon scikit-learn concepts, using strictly object-oriented design and a higher level of abstraction (see subsequent steps in the journey). " |
| 136 | + ], |
| 137 | + "metadata": { |
| 138 | + "collapsed": false |
| 139 | + }, |
| 140 | + "id": "b70dba814dcf5da0" |
| 141 | + } |
| 142 | + ], |
| 143 | + "metadata": { |
| 144 | + "kernelspec": { |
| 145 | + "display_name": "Python 3", |
| 146 | + "language": "python", |
| 147 | + "name": "python3" |
| 148 | + }, |
| 149 | + "language_info": { |
| 150 | + "codemirror_mode": { |
| 151 | + "name": "ipython", |
| 152 | + "version": 2 |
| 153 | + }, |
| 154 | + "file_extension": ".py", |
| 155 | + "mimetype": "text/x-python", |
| 156 | + "name": "python", |
| 157 | + "nbconvert_exporter": "python", |
| 158 | + "pygments_lexer": "ipython2", |
| 159 | + "version": "2.7.6" |
| 160 | + } |
| 161 | + }, |
| 162 | + "nbformat": 4, |
| 163 | + "nbformat_minor": 5 |
| 164 | +} |
0 commit comments