Skip to content

Commit 74ca846

Browse files
committed
docs: add new span classifier training tutorial and reorganize deep-learning docs
1 parent 46248bf commit 74ca846

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

71 files changed

+739
-39
lines changed

changelog.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@
99
- Parquet writer now has a `pyarrow_write_kwargs` to pass to [pyarrow.dataset.write_dataset](https://arrow.apache.org/docs/python/generated/pyarrow.dataset.write_dataset.html#pyarrow-dataset-write-dataset)
1010
- LinearSchedule (mostly used for LR scheduling) now allows a `end_value` parameter to configure if the learning rate should decay to zero or another value.
1111
- New `eds.explode` pipe that splits one document into multiple documents, one per span yielded by its `span_getter` parameter, each new document containing exactly that single span.
12+
- New `Training a span classifier` tutorial, and reorganized deep-learning docs
1213

1314
## Fixed
1415

docs/assets/overrides/main.html

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
{% extends "base.html" %}
22

33
{% block announce %}
4-
Check out the new <a href="/tutorials/training">Model Training tutorial</a> !
4+
Check out the new <a href="/tutorials/training-span-classifier">span classifier training tutorial</a> !
55
{% endblock %}

docs/training/training-api.md

Lines changed: 70 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,70 @@
1+
# Training API
2+
3+
Under the hood, EDS-NLP uses PyTorch to train and run deep-learning models. EDS-NLP acts as a sidekick to PyTorch, providing a set of tools to perform preprocessing, composition and evaluation. The trainable [`TorchComponents`][edsnlp.core.torch_component.TorchComponent] are actually PyTorch modules with a few extra methods to handle the feature preprocessing and postprocessing. Therefore, EDS-NLP is fully compatible with the PyTorch ecosystem.
4+
5+
To build and train a deep learning model, you can either build a training script from scratch (check out the [*Make a training script*](/tutorials/make-a-training-script) tutorial), or use the provided training API. The training API is designed to be flexible and can handle various types of models, including Named Entity Recognition (NER) models, span classifiers, and more. However, if you need more control over the training process, consider writing your own training script.
6+
7+
EDS-NLP supports training models either from the command line or from a Python script or notebook, and switching between the two is relatively straightforward thanks to the use of [Confit](https://aphp.github.io/confit/).
8+
9+
??? note "A word about Confit"
10+
11+
EDS-NLP makes heavy use of [Confit](https://aphp.github.io/confit/), a configuration library that allows you call functions from Python or the CLI, and validate and optionally cast their arguments.
12+
13+
The EDS-NLP function described on this page is the `train` function of the `edsnlp.train` module. When passing a dict to a type-hinted argument (either from a `config.yml` file, or by calling the function in Python), Confit will instantiate the correct class with the arguments provided in the dict. For instance, we pass a dict to the `train_data` parameter, which is actually type hinted as a `TrainingData`: this dict will actually be used as keyword arguments to instantiate this `TrainingData` object. You can also instantiate a `TrainingData` object directly and pass it to the function.
14+
15+
You can also tell Confit specifically which class you want to instantiate by using the `@register_name = "name_of_the_registered_class"` key and value in a dict or config section. We make a heavy use of this mechanism to build pipeline architectures.
16+
17+
## How it works
18+
19+
To train a model with EDS-NLP, you need the following ingredients:
20+
21+
- **Pipeline**: a [pipeline][edsnlp.core.pipeline.Pipeline] with at least one trainable component. Components that share parameters or that must be updated together are trained in the same phase.
22+
23+
- **Training streams**: one or more streams of documents wrapped in a TrainingData object. Each of these specifies how to shuffle the stream, how to batch it with a stat expression such as `2000 words` or `16 spans`, whether to split batches into sub batches for gradient accumulation, and which components it feeds.
24+
25+
- **Validation streams**: optional streams of documents used for periodic evaluation.
26+
27+
- **Scorer**: a [scorer][edsnlp.training.trainer.GenericScorer] that defines the metrics to compute on the validation set. By default, it reports speed and uses autocast during scoring unless disabled.
28+
29+
- **Optimizer**: an [optimizer][edsnlp.training.optimizer.ScheduledOptimizer]. Defaults to AdamW with linear warmup and two groups of parameters, one for the transformer with lr 5•10^-5, and one for the rest of the model with lr 3•10^-4.
30+
31+
- **A bunch of hyperparameters**: finally, the function expects various hyperparameters (most of them set to sensible defaults) to the function, such as `max_steps`, `seed`, `validation_interval`, `checkpoint_interval`, `grad_max_norm`, and more.
32+
33+
The training then proceeds in several steps:
34+
35+
**Setup**
36+
The function prepares the device with [Accelerate](https://huggingface.co/docs/accelerate/index), creates the output folders, materializes the validation set from the user-provided stream, and runs a post-initialization pass on the training data when requested. This `post_init` op let's the pipeline inspect the data before learning to adjust the number of heads depending on the labels encountered. Finally, the optimizer is instantiated.
37+
38+
**Phases**
39+
Training runs **by phases**. A phase groups components that should be optimized together because they share parameters (think for instance of a BERT shared between multiple models). During a phase, losses are computed for each of these "active" components at each step, and only their parameters are updated.
40+
41+
**Data preparation**
42+
Each TrainingData object turns its streams of documents into device ready batches. It optionally shuffles the stream, preprocess the documents for the active components, builds stat-aware batches (for instance, limiting the number of tokens per batch), optionally splits batches into sub batches for gradient accumulation, then converts everything into device-ready tensors. This can be done in parallel to the actual deep-learning work.
43+
44+
**Optimization**
45+
For every training step the function draws one batch from each training stream (in case there are more than one) and synchronizes statistics across processes (in case we're doing multi-GPU training) to keep supports and losses consistent. It runs forward passes for the phase components. When several components reuse the same intermediate features a cache avoids recomputation. Gradients are accumulated over sub batches.
46+
47+
**Gradient safety**
48+
Gradients are always clipped to `grad_max_norm`. Optionally the function tracks an exponential moving mean and variance of the gradient norm. If a spike is detected you can clip to the running mean or to a threshold or skip the update depending on `grad_dev_policy`. This protects training from rare extreme updates.
49+
50+
**Validation and logging**
51+
At regular intervals the scorer evaluates the pipeline on the validation documents. It isolates each task by copying docs and disabling unrelated pipes to avoid leakage. It reports throughput and metrics for NER and span attribute classifiers plus any custom metrics.
52+
53+
**Checkpoints and output**
54+
The model is saved on schedule and at the end in `output_dir/model-last` unless saving is disabled.
55+
56+
## Tutorials and examples
57+
58+
--8<-- "docs/tutorials/index.md:deep-learning-tutorials"
59+
60+
## Parameters of `edsnlp.train` {: #edsnlp.training.trainer.train }
61+
62+
Here are the parameters you can pass to the `train` function:
63+
64+
::: edsnlp.training.trainer.train
65+
options:
66+
heading_level: 4
67+
only_parameters: no-header
68+
skip_parameters: []
69+
show_source: false
70+
show_toc: false

docs/tutorials/index.md

Lines changed: 26 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,9 @@
22

33
We provide step-by-step guides to get you started. We cover the following use-cases:
44

5-
<!-- --8<-- [start:tutorials] -->
5+
### Base tutorials
6+
7+
<!-- --8<-- [start:classic-tutorials] -->
68

79
=== card {: href=/tutorials/spacy101 }
810

@@ -83,21 +85,35 @@ We provide step-by-step guides to get you started. We cover the following use-ca
8385
---
8486
Quickly visualize the results of your pipeline as annotations or tables.
8587

88+
### Deep learning tutorials
89+
90+
We also provide tutorials on how to train deep-learning models with EDS-NLP. These tutorials cover the training API, hyperparameter tuning, and more.
91+
92+
<!-- --8<-- [start:deep-learning-tutorials] -->
93+
8694
=== card {: href=/tutorials/make-a-training-script }
8795

8896
:fontawesome-solid-flask:
89-
**Deep learning tutorial**
97+
**Writing a training script**
98+
99+
---
100+
Learn how EDS-NLP handles training deep-neural networks, and how to write a training script on your own.
101+
102+
=== card {: href=/tutorials/training-ner }
103+
104+
:fontawesome-solid-highlighter:
105+
**Training a NER model**
90106

91107
---
92-
Learn how EDS-NLP handles training deep-neural networks.
108+
Learn how to quickly train a NER model with `edsnlp.train`.
93109

94-
=== card {: href=/tutorials/training }
110+
=== card {: href=/tutorials/training-span-classifier }
95111

96-
:fontawesome-solid-brain:
97-
**Training API**
112+
:fontawesome-solid-circle-check:
113+
**Training a Span Classifier model**
98114

99115
---
100-
Learn how to quicky train a deep-learning model with `edsnlp.train`.
116+
Learn how to quickly train a biopsy date classifier model model with `edsnlp.train`.
101117

102118
=== card {: href=/tutorials/tuning }
103119

@@ -108,4 +124,7 @@ We provide step-by-step guides to get you started. We cover the following use-ca
108124
Learn how to tune hyperparameters of a model with `edsnlp.tune`.
109125

110126

127+
<!-- --8<-- [end:deep-learning-tutorials] -->
128+
129+
111130
<!-- --8<-- [end:tutorials] -->

docs/tutorials/make-a-training-script.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
1-
# Deep-learning tutorial
1+
# Writing a training script
22

33
In this tutorial, we'll see how we can write our own deep learning model training script with EDS-NLP. We will implement a script to train a named-entity recognition (NER) model.
44

5-
If you do not care about the details and just want to train a model, we suggest you to use the [training API](/tutorials/training) and move on to the next tutorial.
5+
If you do not care about the details and just want to train a model, we suggest that you use the [training API](/training/training-api) and move on to the [next tutorial](/tutorials/training-ner).
66

77
!!! warning "Hardware requirements"
88

@@ -440,7 +440,7 @@ python train.py --config config.cfg --nlp.components.ner.embedding.embedding.tra
440440

441441
## Going further
442442

443-
EDS-NLP also provides a generic training script that follows the same structure as the one we just wrote. You can learn more about in the [next Training API tutorial](/tutorials/training).
443+
EDS-NLP also provides a generic training script that follows the same structure as the one we just wrote. You can learn more about in the [next NER model training tutorial through EDS-NLP training API](/tutorials/training-ner).
444444

445445
This tutorial gave you a glimpse of the training API of EDS-NLP. To build a custom trainable component, you can refer to the [TorchComponent][edsnlp.core.torch_component.TorchComponent] class or look up the implementation of [some of the trainable components on GitHub](https://github.com/aphp/edsnlp/tree/master/edsnlp/pipes/trainable).
446446

Lines changed: 8 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,12 @@
1-
# Training API {: #edsnlp.training.trainer.train }
1+
# Training a NER model
22

33
In this tutorial, we'll see how we can quickly train a deep learning model with EDS-NLP using the `edsnlp.train` function.
44

55
!!! warning "Hardware requirements"
66

7-
Training a modern deep learning model requires a lot of computational resources. We recommend using a machine with a GPU, ideally with at least 16GB of VRAM. If you don't have access to a GPU, you can use a cloud service like [Google Colab](https://colab.research.google.com/), [Kaggle](https://www.kaggle.com/), [Paperspace](https://www.paperspace.com/) or [Vast.ai](https://vast.ai/).
7+
Training modern deep-learning models is compute-intensive. A GPU with **≥ 16 GB VRAM** is recommended. Training on CPU is possible but much slower. On macOS, PyTorch’s MPS backend may not support all operations and you'll likely hit `NotImplementedError` messages : in this case, fall back to CPU using the `cpu=True` option.
88

9-
If you need a high level of control over the training procedure, we suggest you read the previous ["Deep learning tutorial"](./make-a-training-script.md) to understand how to build a training loop from scratch with EDS-NLP.
9+
This tutorial uses EDS-NLP’s command-line interface, `python -m edsnlp.train`. If you need fine-grained control over the loop, consider [**writing your own training script**](./make-a-training-script.md).
1010

1111
## Creating a project
1212

@@ -66,13 +66,7 @@ uv pip install -e ".[dev]" -p $(uv python find)
6666

6767
EDS-NLP supports training models either [from the command line](#from-the-command-line) or [from a Python script or notebook](#from-a-script-or-a-notebook), and switching between the two is straightforward thanks to the use of [Confit](https://aphp.github.io/confit/).
6868

69-
??? note "A word about Confit"
70-
71-
EDS-NLP makes heavy use of [Confit](https://aphp.github.io/confit/), a configuration library that allows you call functions from Python or the CLI, and validate and optionally cast their arguments.
72-
73-
The EDS-NLP function used in this script is the `train` function of the `edsnlp.train` module. When passing a dict to a type-hinted argument (either from a `config.yml` file, or by calling the function in Python), Confit will instantiate the correct class with the arguments provided in the dict. For instance, we pass a dict to the `val_data` parameter, which is actually type hinted as a `SampleGenerator`: this dict will actually be used as keyword arguments to instantiate this `SampleGenerator` object. You can also instantiate a `SampleGenerator` object directly and pass it to the function.
74-
75-
You can also tell Confit specifically which class you want to instantiate by using the `@register_name = "name_of_the_registered_class"` key and value in a dict or config section. We make a heavy use of this mechanism to build pipeline architectures.
69+
Visit the [`edsnlp.train` documentation][edsnlp.training.trainer.train] for a list of all the available options.
7670

7771
=== "From the command line"
7872

@@ -170,7 +164,7 @@ EDS-NLP supports training models either [from the command line](#from-the-comman
170164
- '@factory': eds.standoff_dict2doc
171165
span_setter: 'gold_spans'
172166

173-
logger:
167+
loggers:
174168
- '@loggers': csv
175169
- '@loggers': rich
176170
fields:
@@ -206,7 +200,7 @@ EDS-NLP supports training models either [from the command line](#from-the-comman
206200
grad_max_norm: 1.0
207201
scorer: ${ scorer }
208202
optimizer: ${ optimizer }
209-
logger: ${ logger }
203+
logger: ${ loggers }
210204
# Do preprocessing in parallel on 1 worker
211205
num_workers: 1
212206
# Enable on Mac OS X or if you don't want to use available GPUs
@@ -297,7 +291,7 @@ EDS-NLP supports training models either [from the command line](#from-the-comman
297291
)
298292

299293
#
300-
logger = [
294+
loggers = [
301295
CSVLogger(),
302296
RichLogger(
303297
fields={
@@ -328,7 +322,7 @@ EDS-NLP supports training models either [from the command line](#from-the-comman
328322
optimizer=optimizer,
329323
grad_max_norm=1.0,
330324
output_dir="artifacts",
331-
loggers
325+
logger=loggers,
332326
# Do preprocessing in parallel on 1 worker
333327
num_workers=1,
334328
# Enable on Mac OS X or if you don't want to use available GPUs
@@ -349,16 +343,6 @@ cfg = confit.Config.from_disk(
349343
nlp = train(**cfg["train"])
350344
```
351345

352-
Here are the parameters you can pass to the `train` function:
353-
354-
::: edsnlp.training.trainer.train
355-
options:
356-
heading_level: 4
357-
only_parameters: true
358-
skip_parameters: []
359-
show_source: false
360-
show_toc: false
361-
362346
## Use the model
363347

364348
You can now load the model and use it to process some text:

0 commit comments

Comments
 (0)