add AutoML Sweepable API

LittleLittleCloud · LittleLittleCloud · commit 6215ff074128 · 2022-09-23T12:29:50.000-07:00
diff --git a/machine-learning/05 - AutoML Sweepable API.ipynb b/machine-learning/05 - AutoML Sweepable API.ipynb
@@ -0,0 +1,277 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## AutoML Sweepable API\n",
+    "\n",
+    "This Notebook shows how to use `Sweepable` API to fully customize the pipeline or search space in your AutoML task. In this notebook, you will learn\n",
+    "- how to use `AutoML().CreateSweepableEstimator` to create `SweepableEstimator`.\n",
+    "- how to create `SweepablePipeline` for multiple trainer candidates.\n",
+    "- use built-in `SweepableEstimator` to simplify your work."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Install Nuget packages and add using statement"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "dotnet_interactive": {
+     "language": "csharp"
+    },
+    "vscode": {
+     "languageId": "dotnet-interactive.csharp"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "// using nightly-build\n",
+    "#i \"nuget:https://pkgs.dev.azure.com/dnceng/public/_packaging/MachineLearning/nuget/v3/index.json\"\n",
+    "#r \"nuget: Plotly.NET.Interactive, 3.0.2\"\n",
+    "#r \"nuget: Plotly.NET.CSharp, 0.0.1\"\n",
+    "#r \"nuget: Microsoft.ML.AutoML, 0.20.0-preview.22470.1\"\n",
+    "#r \"nuget: Microsoft.Data.Analysis, 0.20.0-preview.22470.1\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "dotnet_interactive": {
+     "language": "csharp"
+    },
+    "vscode": {
+     "languageId": "dotnet-interactive.csharp"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "using static Microsoft.DotNet.Interactive.Formatting.PocketViewTags;\n",
+    "using Microsoft.Data.Analysis;\n",
+    "using System;\n",
+    "using System.IO;\n",
+    "using Microsoft.ML;\n",
+    "using Microsoft.ML.AutoML;\n",
+    "using Microsoft.ML.AutoML.CodeGen;\n",
+    "using Microsoft.ML.Trainers.LightGbm;\n",
+    "using Microsoft.ML.Data;\n",
+    "using Plotly.NET;\n",
+    "using Microsoft.ML.Transforms.TimeSeries;\n",
+    "using Microsoft.ML.SearchSpace;\n",
+    "using System.Diagnostics;"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "dotnet_interactive": {
+     "language": "csharp"
+    }
+   },
+   "source": [
+    "#### Use `AutoML().CreateSweepableEstimator` to create `SweepableEstimator`\n",
+    "\n",
+    "A `SweepableEstimator` is nothing different than a normal `Estimator` plus `SearchSpace`. The following code shows how to create a sweepable `LightGbm` and `SDCA`.\n",
+    "\n",
+    "For simplicity, the built-in search space for `LightGbm` and `SDCA` is used but you can fully customize the search space however way you want. For more details on how to do that, please check [Parameter And SearchSpace](./Parameter%20and%20SearchSpace.ipynb)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "dotnet_interactive": {
+     "language": "csharp"
+    },
+    "vscode": {
+     "languageId": "dotnet-interactive.csharp"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "var context = new MLContext();\n",
+    "var lgbmSearchSpace = new SearchSpace<LgbmOption>();\n",
+    "var sweepableLgbm = context.Auto().CreateSweepableEstimator((context, param) => {\n",
+    "    var option = new LightGbmRegressionTrainer.Options()\n",
+    "    {\n",
+    "        NumberOfLeaves = param.NumberOfLeaves,\n",
+    "        NumberOfIterations = param.NumberOfTrees,\n",
+    "        MinimumExampleCountPerLeaf = param.MinimumExampleCountPerLeaf,\n",
+    "        LearningRate = param.LearningRate,\n",
+    "        LabelColumnName = \"Label\",\n",
+    "        FeatureColumnName = \"Features\",\n",
+    "        Booster = new GradientBooster.Options()\n",
+    "        {\n",
+    "            SubsampleFraction = param.SubsampleFraction,\n",
+    "            FeatureFraction = param.FeatureFraction,\n",
+    "            L1Regularization = param.L1Regularization,\n",
+    "            L2Regularization = param.L2Regularization,\n",
+    "        },\n",
+    "        MaximumBinCountPerFeature = param.MaximumBinCountPerFeature,\n",
+    "    };\n",
+    "\n",
+    "    return context.Regression.Trainers.LightGbm(option);\n",
+    "}, lgbmSearchSpace);\n",
+    "\n",
+    "var sdcaSearchSpace = new SearchSpace<SdcaOption>();\n",
+    "var sweepableSdca = context.Auto().CreateSweepableEstimator((context, param) => {\n",
+    "    return context.Regression.Trainers.Sdca(\"Label\", \"Features\", l1Regularization: param.L1Regularization, l2Regularization: param.L2Regularization);\n",
+    "}, sdcaSearchSpace);"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### Create `SweepablePipeline` with multiple trainer candidates.\n",
+    "\n",
+    "`SweepablePipeline` allows you to put multiple estimators as candidates to a certain stage. During AutoML sweeping, these candidates will be evaluated seperatly and the one with best metric will be picked. Note that the estimator doesn't necessarily need to be a trainer, it can be a trainer, transformer or even a `SweepablePipeline`, as long as they all have the same input and output schema.\n",
+    "\n",
+    "The following code shows how to create a `SweepablePipeline` with `sweepableSdca` and `sweepableLgbm` we created above."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "dotnet_interactive": {
+     "language": "csharp"
+    },
+    "vscode": {
+     "languageId": "dotnet-interactive.csharp"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "var sweepablePipeline = context.Transforms.Concatenate(\"Features\", \"X1\", \"X2\")\n",
+    "                            .Append(sweepableSdca, sweepableLgbm);"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### Config `AutoMLExperiment` using `sweepablePipeline`\n",
+    "In the next step, we are going to train `sweepablePipeline` on a generated non-linear dataset using `AutoMLExperiment`, which will sweeping both `sdca` and `lightGbm` on configured search space. Considering that `sdca` is a linear classifier, the winning model should be `lightGbm`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "dotnet_interactive": {
+     "language": "csharp"
+    },
+    "vscode": {
+     "languageId": "dotnet-interactive.csharp"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "var rand = new Random(0);\n",
+    "var context =new MLContext(seed: 1);\n",
+    "var x1 = Enumerable.Range(0, 1000).Select(_x => rand.NextSingle() * 100).ToArray();\n",
+    "var x2 = x1.Select(_x => rand.NextSingle() * 100).ToArray();\n",
+    "var y = Enumerable.Zip(x1, x2).Select(_x => _x.Second * _x.First + (rand.NextSingle() - 0.5f) * 10).ToArray();\n",
+    "var df = new DataFrame();\n",
+    "df[\"X1\"] = DataFrameColumn.Create(\"X1\", x1);\n",
+    "df[\"X2\"] = DataFrameColumn.Create(\"X2\", x2);\n",
+    "df[\"Label\"] = DataFrameColumn.Create(\"Label\", y);\n",
+    "var trainTestSplit = context.Data.TrainTestSplit(df);\n",
+    "df.Head(10)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "dotnet_interactive": {
+     "language": "csharp"
+    },
+    "vscode": {
+     "languageId": "dotnet-interactive.csharp"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "var monitor = new NotebookMonitor(sweepablePipeline);\n",
+    "var experiment = context.Auto().CreateExperiment();\n",
+    "experiment.SetDataset(df, 5)\n",
+    "          .SetPipeline(sweepablePipeline)\n",
+    "          .SetTrainingTimeInSeconds(50)\n",
+    "          .SetRegressionMetric(RegressionMetric.RootMeanSquaredError)\n",
+    "          .SetMonitor(monitor);\n",
+    "\n",
+    "// Configure Visualizer\t\t\t\n",
+    "monitor.SetUpdate(monitor.Display());\n",
+    "\n",
+    "var res = await experiment.RunAsync();\n",
+    "\n",
+    "// check the type of last trainer, which should be lightGbm\n",
+    "(res.Model as TransformerChain<ITransformer>).Last().GetType()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### Use built-in sweepable estimators\n",
+    "\n",
+    "`AutoML` provides built-in sweepable estimator candidates for binary-classification, multi-class classification and regression. For those scenarios, you can simply use those candidates instead of creating `SweepableEstimator` from scratch."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "dotnet_interactive": {
+     "language": "csharp"
+    },
+    "vscode": {
+     "languageId": "dotnet-interactive.csharp"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "var regressionTrainerCandidates = context.Auto().Regression();\n",
+    "var binaryClassificationTrainerCandidates = context.Auto().BinaryClassification();\n",
+    "var multiclassClassificationTrainerCandidates = context.Auto().MultiClassification();"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### See also\n",
+    "- [Training and AutoML](./03-Training%20and%20AutoML.ipynb)\n",
+    "- [Regression with Taxi Dataset](./E2E-Regression%20with%20Taxi%20Dataset.ipynb)\n",
+    "- [Classification with Iris Dataset](./E2E-Classification%20with%20Iris%20Dataset.ipynb)\n",
+    "- [Kaggle with Titanic Dataset](./REF-Kaggle%20with%20Titanic%20Dataset.ipynb)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": ".NET (C#)",
+   "language": "C#",
+   "name": ".net-csharp"
+  },
+  "language_info": {
+   "file_extension": ".cs",
+   "mimetype": "text/x-csharp",
+   "name": "C#",
+   "pygments_lexer": "csharp",
+   "version": "9.0"
+  },
+  "orig_nbformat": 4
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}