|
| 1 | +# One monolithic function |
| 2 | + |
| 3 | +Meet the One Monolithic Function, also known as the Swiss Army Knife Function or the God Function—the Jack-of-all-trades and master of, well, none. |
| 4 | +These functions usually come with deceptively simple and generic names like *run*, *train*, or *main*. It may be the result of converting a Jupyter notebook, similar to what we’ve done in our [refactoring journey](../../refactoring-journey/step01-python-script/run_classifier_evaluation.py). |
| 5 | + |
| 6 | +This function is like that one person who insists on doing everything themselves—from cooking dinner to fixing the plumbing—except in the coding world. But just like our multitasking friend who forgets to turn off the stove while unclogging the sink, this approach can quickly become a recipe for disaster. |
| 7 | + |
| 8 | +By trying to do everything in one place, this monolithic function ends up being an unmaintainable tangle of responsibilities. It mixes high-level decisions like "What file format am I dealing with?" with low-level tasks like "Let's calculate the mean to fill in missing values," all in the same breath. It's a classic case of not knowing when to delegate, resulting in code that’s harder to read, harder to debug, and way harder to extend. |
| 9 | + |
| 10 | +Take a look at the following code and try to understand what it does without reading it line by line: |
| 11 | + |
| 12 | +```python |
| 13 | +import pandas as pd |
| 14 | +import numpy as np |
| 15 | +import json |
| 16 | +import logging |
| 17 | +from sklearn.model_selection import train_test_split |
| 18 | +from sklearn.tree import DecisionTreeClassifier |
| 19 | +from sklearn.metrics import accuracy_score |
| 20 | + |
| 21 | +def main(file_path: str) -> float: |
| 22 | + |
| 23 | + if file_path.endswith('.csv'): |
| 24 | + data = pd.read_csv(file_path) |
| 25 | + elif file_path.endswith('.json'): |
| 26 | + with open(file_path, 'r') as file: |
| 27 | + data_dict = json.load(file) |
| 28 | + data = pd.DataFrame(data_dict) |
| 29 | + else: |
| 30 | + raise ValueError("Unsupported file format!") |
| 31 | + |
| 32 | + logging.info(f"Data loaded from {file_path} with {len(data)} rows and {len(data.columns)} columns.") |
| 33 | + |
| 34 | + if 'target' not in data.columns: |
| 35 | + raise ValueError("Target column is missing in the dataset!") |
| 36 | + |
| 37 | + for column in data.columns: |
| 38 | + if data[column].isnull().sum() > 0: |
| 39 | + mean_value = data[column].mean() |
| 40 | + data[column].fillna(mean_value, inplace=True) |
| 41 | + |
| 42 | + for column in data.select_dtypes(include=[np.number]).columns: |
| 43 | + max_value = data[column].max() |
| 44 | + min_value = data[column].min() |
| 45 | + data[column] = (data[column] - min_value) / (max_value - min_value) |
| 46 | + |
| 47 | + data['feature_interaction'] = data['feature1'] * data['feature2'] * np.log1p(data['feature3']) |
| 48 | + |
| 49 | + X = data.drop('target', axis=1) |
| 50 | + y = data['target'] |
| 51 | + |
| 52 | + X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) |
| 53 | + |
| 54 | + model = DecisionTreeClassifier() |
| 55 | + model.fit(X_train, y_train) |
| 56 | + |
| 57 | + y_pred = model.predict(X_test) |
| 58 | + |
| 59 | + accuracy = accuracy_score(y_test, y_pred) |
| 60 | + return accuracy |
| 61 | +``` |
| 62 | +Sure, if you painstakingly read it line by line, you’ll eventually arrive at this thrilling revelation about what the function does: |
| 63 | + |
| 64 | +1. Loads the data by handling different file formats. |
| 65 | +2. Cleans the data by filling in missing values. |
| 66 | +3. Normalizes the data through scaling. |
| 67 | +4. Performs feature engineering by creating interaction terms. |
| 68 | +5. Trains a machine learning model. |
| 69 | +6. Evaluates model performance. |
| 70 | + |
| 71 | +While this function does manage to accomplish several tasks, it suffers from poor readability. The mixture of different responsibilities within a single function makes it difficult to follow what’s happening at a glance. |
| 72 | + |
| 73 | +Additionally, the function is not easily testable and hard to modify or extend. Because it handles everything from data loading to model evaluation, testing individual parts of the process in isolation is nearly impossible. |
| 74 | +Any changes to one part of the process could potentially impact the others, making the function fragile and prone to errors when updates are needed. |
| 75 | + |
| 76 | +A first entry point for refactoring this function could be to apply the [Single Level of Abstraction Principle (SLAP)](../../oop-essentials/03-general-principles/README.md/#slap-single-level-of-abstraction-principle). By ensuring that each function operates at a single level of abstraction, you can begin to separate the high-level orchestration from the low-level details. The result could look like this: |
| 77 | + |
| 78 | +```python |
| 79 | +import logging |
| 80 | +import json |
| 81 | +import pandas as pd |
| 82 | +import numpy as np |
| 83 | +from sklearn.model_selection import train_test_split |
| 84 | +from sklearn.tree import DecisionTreeClassifier |
| 85 | +from sklearn.metrics import accuracy_score |
| 86 | + |
| 87 | + |
| 88 | +def main(path: str) -> float: |
| 89 | + data = load_data(path) |
| 90 | + data = fill_missing_values(data) |
| 91 | + data = normalize_features(data) |
| 92 | + data = engineer_features(data) |
| 93 | + X_train, X_test, y_train, y_test = split_data(data) |
| 94 | + model = train_model(X_train, y_train) |
| 95 | + accuracy = evaluate_model(model, X_test, y_test) |
| 96 | + return accuracy |
| 97 | + |
| 98 | + |
| 99 | +def load_data(file_path: str) -> pd.DataFrame: |
| 100 | + if file_path.endswith('.csv'): |
| 101 | + data = pd.read_csv(file_path) |
| 102 | + elif file_path.endswith('.json'): |
| 103 | + with open(file_path, 'r') as file: |
| 104 | + data_dict = json.load(file) |
| 105 | + data = pd.DataFrame(data_dict) |
| 106 | + else: |
| 107 | + raise ValueError("Unsupported file format!") |
| 108 | + logging.info(f"Data loaded from {file_path} with {len(data)} rows and {len(data.columns)} columns.") |
| 109 | + return data |
| 110 | + |
| 111 | + |
| 112 | +def fill_missing_values(data: pd.DataFrame) -> pd.DataFrame: |
| 113 | + for column in data.columns: |
| 114 | + if data[column].isnull().sum() > 0: |
| 115 | + mean_value = data[column].mean() |
| 116 | + data[column].fillna(mean_value, inplace=True) |
| 117 | + return data |
| 118 | + |
| 119 | + |
| 120 | +def normalize_features(data: pd.DataFrame) -> pd.DataFrame: |
| 121 | + for column in data.select_dtypes(include=[np.number]).columns: |
| 122 | + max_value = data[column].max() |
| 123 | + min_value = data[column].min() |
| 124 | + data[column] = (data[column] - min_value) / (max_value - min_value) |
| 125 | + return data |
| 126 | + |
| 127 | + |
| 128 | +def engineer_features(data: pd.DataFrame) -> pd.DataFrame: |
| 129 | + data['feature_interaction'] = data['feature1'] * data[ |
| 130 | + 'feature2'] * np.log1p(data['feature3']) |
| 131 | + return data |
| 132 | + |
| 133 | + |
| 134 | +def split_data(data: pd.DataFrame) -> tuple[pd.DataFrame, ...]: |
| 135 | + y = data['target'] |
| 136 | + X = data.drop('target', axis=1) |
| 137 | + return train_test_split(X, y, test_size=0.2, random_state=42) |
| 138 | + |
| 139 | + |
| 140 | +def train_model(X_train: pd.DataFrame, y_train: pd.DataFrame) -> DecisionTreeClassifier: |
| 141 | + model = DecisionTreeClassifier() |
| 142 | + model.fit(X_train, y_train) |
| 143 | + return model |
| 144 | + |
| 145 | + |
| 146 | +def evaluate_model(model: DecisionTreeClassifier, X_test: pd.DataFrame, y_test: pd.DataFrame) -> float: |
| 147 | + y_pred = model.predict(X_test) |
| 148 | + accuracy = accuracy_score(y_test, y_pred) |
| 149 | + return accuracy |
| 150 | + |
| 151 | +``` |
| 152 | +By simply extracting low-level functions in a way that the low-level functions have an isolated task and calling |
| 153 | +them in the high-level function *main*, we already gained: |
| 154 | + |
| 155 | +1. Improved Readability: The main function now reads more like a summary of the overall process, with each low-level function clearly named to describe its specific task. This makes it easier for developers to understand the code at a glance. |
| 156 | + |
| 157 | +2. Enhanced Testability: Isolated functions are easier to unit test. |
| 158 | + You can test each low-level function individually to ensure it performs its task correctly, leading to more reliable and robust code. |
| 159 | + |
| 160 | +3. Increased Reusability: Low-level functions that perform specific tasks can often be reused in different parts of the codebase or in future projects, reducing the need to write redundant code. |
| 161 | + |
| 162 | +Nevertheless, this should only be considered a first step toward improving the code. While extracting low-level functions helps provide more clarity about what’s happening, the code still lacks a coherent software design and remains fragile and inflexible. To achieve a truly robust and maintainable solution, further refactoring is necessary. |
| 163 | + |
| 164 | +We highly encourage you to follow our [refactoring journey](../../refactoring-journey/README.md) to explore a more structured and well-designed approach. This will not only enhance the code’s flexibility but also ensure it’s better suited to handle future changes and extensions. |
0 commit comments