Skip to content

Commit 2c632c4

Browse files
authored
Docs and utils (#14)
updates the documentation and adds some utils --------- Signed-off-by: peter szemraj <peterszemraj@gmail.com>
1 parent 675b957 commit 2c632c4

File tree

5 files changed

+314
-38
lines changed

5 files changed

+314
-38
lines changed

CHANGELOG.md

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
### Changelog
2+
3+
All notable changes to this project will be documented in this file. Dates are displayed in UTC.
4+
5+
Generated by [`auto-changelog`](https://github.com/CookPete/auto-changelog).
6+
7+
#### [v0.1.3](https://github.com/pszemraj/annotated-mpnet/compare/v0.1.2...v0.1.3)
8+
9+
> 26 March 2025
10+
11+
- Improve training config [`#11`](https://github.com/pszemraj/annotated-mpnet/pull/11)
12+
- pad vocab for CUDA [`#10`](https://github.com/pszemraj/annotated-mpnet/pull/10)
13+
- Cleanup extra files [`#9`](https://github.com/pszemraj/annotated-mpnet/pull/9)
14+
15+
#### [v0.1.2](https://github.com/pszemraj/annotated-mpnet/compare/v0.1.1...v0.1.2)
16+
17+
> 5 March 2025
18+
19+
- Save load fix [`#8`](https://github.com/pszemraj/annotated-mpnet/pull/8)
20+
- Speedups [`#4`](https://github.com/pszemraj/annotated-mpnet/pull/4)
21+
22+
#### v0.1.1
23+
24+
> 25 February 2025
25+
26+
- Streaming dataset [`#3`](https://github.com/pszemraj/annotated-mpnet/pull/3)
27+
- Dtype fix [`#2`](https://github.com/pszemraj/annotated-mpnet/pull/2)
28+
- Fixes and updates [`#1`](https://github.com/pszemraj/annotated-mpnet/pull/1)
29+
- Create main.yml [`#3`](https://github.com/pszemraj/annotated-mpnet/pull/3)
30+
- Fix issue with assigning relative attention bias tensor to wrong layer in the HF MPNet model [`#2`](https://github.com/pszemraj/annotated-mpnet/pull/2)
31+
- Initial commit pushing all code and README [`#1`](https://github.com/pszemraj/annotated-mpnet/pull/1)
32+
- Added 3rd party licensing and a Pipfile to make installation as easy as possible [`af5adfb`](https://github.com/pszemraj/annotated-mpnet/commit/af5adfbed1d10c326a97308552c97988dcbbd90f)
33+
- Initial commit [`070c6d1`](https://github.com/pszemraj/annotated-mpnet/commit/070c6d176c1192bce5cb94f712db3b25423bdf05)
34+
- Changed formatting of 3rd party license file slightly [`daa7c51`](https://github.com/pszemraj/annotated-mpnet/commit/daa7c5157c4035303ed80c7c5bb1633d7ab69749)

README.md

Lines changed: 196 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -1,69 +1,230 @@
11
# Annotated MPNet
22

3+
`annotated-mpnet` provides a lightweight, heavily annotated, and standalone PyTorch implementation for pretraining MPNet models. This project aims to demystify the MPNet pretraining process, which was originally part of the larger `fairseq` codebase, making it more accessible for research and custom pretraining.
4+
5+
## Table of Contents
6+
37
- [Annotated MPNet](#annotated-mpnet)
4-
- [About](#about)
8+
- [Table of Contents](#table-of-contents)
9+
- [About the Project](#about-the-project)
10+
- [Key Features](#key-features)
511
- [Installation](#installation)
6-
- [Pretraining MPNet](#pretraining-mpnet)
7-
- [HuggingFace dataset](#huggingface-dataset)
8-
- [Local directory of text files](#local-directory-of-text-files)
9-
- [Porting a checkpoint to HuggingFace's format](#porting-a-checkpoint-to-huggingfaces-format)
12+
- [Requirements](#requirements)
13+
- [Usage](#usage)
14+
- [Pretraining MPNet](#pretraining-mpnet)
15+
- [Porting Checkpoint to Hugging Face](#porting-checkpoint-to-hugging-face)
16+
- [Model Architecture](#model-architecture)
17+
- [Project Structure](#project-structure)
18+
- [Changelog](#changelog)
19+
- [Contributing](#contributing)
20+
- [License](#license)
21+
- [Acknowledgements](#acknowledgements)
1022

11-
## About
1223

13-
This repository is based very closely on the wonderful code written by the authors of [MPNet: Masked and Permuted Pre-training for Language Understanding](https://arxiv.org/pdf/2004.09297.pdf) (Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, Tie-Yan Liu). MPNet employs a pretraining method that has seen a great deal of success in NLP fine-tuning tasks like information retrieval and learning-to-rank.
24+
## About the Project
1425

15-
While many of the fine-tuned applications of this model exist out in the open, pretraining the model remains somewhat esoteric, as this code lives within the researcher's modified [fairseq codebase](https://github.com/microsoft/MPNet/tree/master) (HuggingFace's MPNet can only be used for fine-tuning, since it doesn't properly encode two-stream attention). This codebase works well, but it is fairly bloated and the code is, perhaps unintentionally, obfuscated by the many subdirectories and imports pointing all through the source code.
26+
MPNet (Masked and Permuted Pre-training for Language Understanding) is a powerful pretraining method. However, its original pretraining code is embedded within the `fairseq` library, which can be complex to navigate and adapt. `annotated-mpnet` addresses this by:
1627

17-
With `annotated-mpnet`, we are looking to bring the MPNet pretraining code under one, lightweight, clean roof implemented in raw PyTorch. This eliminates the need to use the entirety of `fairseq`, and allows pretraining to occur on nonstandard training devices (i.e. accelerated hardware beyond a GPU).
28+
- Providing a clean, raw PyTorch implementation of MPNet pretraining.
29+
- Offering extensive annotations and comments throughout the codebase to improve understanding.
30+
- Enabling pretraining without the full `fairseq` dependency, facilitating use on various hardware setups.
1831

19-
Additionally, we have gone through the painstaking effort of carefully annotating and commenting on each portion of the model code so that general understanding of the model is more easily conveyed. It is our hope that using this codebase, others will be able to pretrain MPNet using their own data.
32+
**This repo** is a fork/update of the [original by yext](https://github.com/yext/annotated-mpnet).
2033

21-
## Installation
34+
## Key Features
2235

23-
Install as editable with `pip install -e .`
36+
- **Standalone PyTorch Implementation**: No `fairseq` dependency required for pretraining.
37+
- **Heavily Annotated Code**: Detailed comments explain the model architecture and training process.
38+
- **Flexible Data Handling**: Supports pretraining with Hugging Face streaming datasets or local text files.
39+
- **Hugging Face Compatibility**: Includes a tool to convert pretrained checkpoints to the Hugging Face `MPNetForMaskedLM` format for easy fine-tuning.
40+
- **Integrated Logging**: Supports TensorBoard and Weights & Biases for experiment tracking.
41+
42+
## Installation
2443

2544
```bash
2645
git clone https://github.com/pszemraj/annotated-mpnet.git
2746
cd annotated-mpnet
2847
pip install -e .
2948
```
3049

31-
## Pretraining MPNet
32-
3350
> [!NOTE]
34-
> Pretraining can be completed using a directory of text files (_original method_) or a HuggingFace dataset via streaming.
51+
> Pretraining MPNet is computationally intensive and requires a CUDA-enabled GPU. The training script will exit if CUDA is not available.
52+
53+
### Requirements
3554

36-
Pretraining is as simple as calling the pretraining entrypoint, which was just installed in the previous step. You can get a rundown of exactly which arguments are provided by typing `pretrain-mpnet -h`, but example commands are shown below.
55+
- Python 3.x
56+
- PyTorch (version >= 2.6.0, CUDA is required for training)
57+
- Hugging Face `transformers`, `datasets`
58+
- `wandb` (for Weights & Biases logging, optional)
59+
- `rich` (for enhanced console logging)
60+
- `numpy`
61+
- `cython`
62+
- `tensorboard` (for logging, optional)
3763

38-
### HuggingFace dataset
64+
See `setup.py` for a full list of dependencies.
65+
66+
## Usage
67+
68+
### Pretraining MPNet
69+
70+
The primary script for pretraining is `pretrain-mpnet`. You can see all available arguments by running `pretrain-mpnet -h`.
71+
72+
**1. Using a Hugging Face Dataset (Streaming):**
73+
This method streams data directly from the Hugging Face Hub. Validation and test sets are created by taking initial samples from the training stream.
3974

4075
```bash
41-
pretrain-mpnet --dataset-name HuggingFaceFW/fineweb-edu \
42-
--total-updates 10000 --warmup-updates 1000 \
43-
--batch-size 4 --update-freq 4 --lr 0.0002 \
44-
-save_steps 200 -log_dir checkpoints/experiments
76+
pretrain-mpnet \
77+
--dataset-name "gair-prox/DCLM-pro" \
78+
--text-field "text" \
79+
--tokenizer-name "microsoft/mpnet-base" \
80+
--max-tokens 512 \
81+
--encoder-layers 12 \
82+
--encoder-embed-dim 768 \
83+
--encoder-ffn-dim 3072 \
84+
--encoder-attention-heads 12 \
85+
--batch-size 16 \
86+
--update-freq 8 \
87+
--lr 6e-4 \
88+
--warmup-updates 1000 \
89+
--total-updates 100000 \
90+
--checkpoint-dir "./checkpoints/my_mpnet_run" \
91+
--tensorboard-log-dir "./logs/my_mpnet_run" \
92+
--wandb --wandb-project "annotated-mpnet-experiments" \
93+
--save_steps 2500
4594
```
4695

47-
### Local directory of text files
96+
Key arguments for streaming:
97+
98+
- `--dataset-name`: Name of the dataset on Hugging Face Hub.
99+
- `--text-field`: The column in the dataset containing the text (default: "text").
100+
- `--buffer-size`: Size of the shuffling buffer for streaming (default: 10000).
101+
- `--eval-samples`: Number of samples to take for validation/test sets from the stream (default: 500).
102+
- `--min-text-length`: Minimum length of text samples to consider (default: 64).
103+
104+
**2. Using Local Text Files:**
105+
Provide a directory of training files (one document/sentence per line is typical) and paths to single validation and test files.
48106

49107
```bash
50108
pretrain-mpnet \
51-
--train-dir /path/to/train/dir \
52-
--valid-file /path/to/validation/txtfile \
53-
--test-file /path/to/test/txtfile \
54-
--total-updates 10000 \ # total number of updates to run
55-
--warmup-updates 1000 \ # number of updates to warm up the LR
56-
--batch-size 256 \ # actual batch size in memory
57-
--update-freq 4 \ # simulating 1024 bs via gradient accumulation
58-
--lr 0.0002 # the peak learning rate, reached at warmup-updates and then decayed according to the --power arg
109+
--train-dir "/path/to/your/train_data_directory/" \
110+
--valid-file "/path/to/your/validation_data.txt" \
111+
--test-file "/path/to/your/test_data.txt" \
112+
--tokenizer-name "microsoft/mpnet-base" \
113+
--max-tokens 512 \
114+
--batch-size 16 \
115+
--update-freq 8 \
116+
--lr 6e-4 \
117+
--warmup-updates 1000 \
118+
--total-updates 100000 \
119+
--checkpoint-dir "./checkpoints/my_local_mpnet_run" \
120+
--tensorboard-log-dir "./logs/my_local_mpnet_run" \
121+
--save_steps 2500
59122
```
60123

61-
## Porting a checkpoint to HuggingFace's format
62-
63-
After pretraining is done, you'll probably want to do something with the model! The best way to do this is to port one of your checkpoints to the HuggingFace MPNet format. This will allow you to load the custom model into HuggingFace and do fine-tuning there as you normally would. Below is what a call would look like to do this conversion:
124+
**Key Pretraining Arguments (Common to both methods):**
125+
126+
- `--tokenizer-name`: Hugging Face tokenizer to use (default: `microsoft/mpnet-base`).
127+
- `--max-tokens`: Maximum sequence length (default: 512). Also sets `--max-positions` if not specified.
128+
- Model Architecture:
129+
- `--encoder-layers` (default: 12)
130+
- `--encoder-embed-dim` (default: 768)
131+
- `--encoder-ffn-dim` (default: 3072)
132+
- `--encoder-attention-heads` (default: 12)
133+
- Training Parameters:
134+
- `--batch-size`: Per-GPU batch size (default: 16).
135+
- `--update-freq`: Gradient accumulation steps to simulate larger batch sizes (default: 8). Effective batch size = `batch-size * update-freq * num_gpus`.
136+
- `--lr`: Peak learning rate (default: 6e-4).
137+
- `--warmup-updates`: Number of steps for LR warmup (default: 10% of `total-updates`).
138+
- `--total-updates`: Total number of training updates (default: 10000).
139+
- Logging and Saving:
140+
- `--checkpoint-dir`: Directory to save model checkpoints (default: `./checkpoints`).
141+
- `--tensorboard-log-dir`: Directory for TensorBoard logs. If unset, logs to console.
142+
- `--save_steps`: Save a checkpoint every N steps (default: -1, only best and final).
143+
- `--wandb`: Enable Weights & Biases logging.
144+
- `--wandb-project`, `--wandb-name`: W\&B project and run name.
145+
- `--compile`: Use `torch.compile()` for the model (experimental, default: False).
146+
- `--seed`: Random seed for reproducibility (default: 12345).
147+
148+
The script validates the tokenizer. For optimal performance with the default `whole_word_mask=True` in the data collator, a WordPiece-compatible tokenizer is expected.
149+
150+
### Porting Checkpoint to Hugging Face
151+
152+
After pretraining, convert your checkpoint to the Hugging Face `MPNetForMaskedLM` format using the `convert-to-hf` script. This allows you to load and use your model within the Hugging Face ecosystem.
64153

65154
```bash
66155
convert-to-hf \
67-
--mpnet-checkpoint-path ./checkpoints/best_checkpoint.pt \
68-
--hf-model-folder-path ./my_cool_hf_model/
156+
--mpnet-checkpoint-path "./checkpoints/my_mpnet_run/best_checkpoint.pt" \
157+
--hf-model-folder-path "./my_hf_mpnet_model/"
158+
```
159+
160+
- By default, this script will also save the tokenizer used during pretraining (if its name was stored in the checkpoint args). Use `--no-save-tokenizer` to disable this.
161+
- The output directory (`./my_hf_mpnet_model/`) will contain `pytorch_model.bin`, `config.json`, and tokenizer files (e.g., `tokenizer.json`, `vocab.txt`).
162+
163+
## Model Architecture
164+
165+
This repository implements MPNet, which utilizes a **Masked and Permuted Pre-training** objective. The architecture is based on the Transformer model.
166+
167+
- **`MPNetForPretraining`**: This is the main model class defined in `annotated_mpnet/modeling/mpnet_for_pretraining.py`. It encapsulates the encoder and the language modeling head.
168+
- **`SentenceEncoder`**: The core of the model, this is a stack of Transformer encoder layers. It's responsible for generating contextualized representations of the input tokens. Found in `annotated_mpnet/transformer_modules/sentence_encoder.py`.
169+
- **`SentenceEncoderLayer`**: Each layer within the `SentenceEncoder`. It primarily consists of:
170+
- **`RelativeMultiHeadAttention`**: A multi-head self-attention mechanism that incorporates relative positional information, crucial for MPNet. Defined in `annotated_mpnet/transformer_modules/rel_multihead_attention.py`.
171+
- Position-wise Feed-Forward Networks (FFN).
172+
- Layer normalization.
173+
- **Positional Embeddings**: The model uses positional embeddings to provide sequence order information. This implementation supports:
174+
- `LearnedPositionalEmbedding`: Positional embeddings are learned during training.
175+
- `SinusoidalPositionalEmbedding`: Fixed positional embeddings based on sine and cosine functions.
176+
The choice is configurable via `pretrain_mpnet.py` arguments. These are found in `annotated_mpnet/transformer_modules/`.
177+
- **Two-Stream Self-Attention**: A key innovation of MPNet. While not a separate module, this mechanism is implemented within the `MPNetForPretraining` forward pass. It allows the model to predict original tokens from a permuted version of the input by using two streams of information (content and query), enabling it to learn bidirectional context without the predicted tokens "seeing themselves" in the non-permuted context.
178+
- **`MPNetLMHead`**: A language modeling head placed on top of the `SentenceEncoder`'s output. It projects the contextual embeddings to the vocabulary space to predict the masked tokens. Defined in `annotated_mpnet/modeling/mpnet_for_pretraining.py`.
179+
- **Normalization Strategy**: The `--normalize-before` flag (default: `False` in `SentenceEncoder`, `True` for `encoder_normalize_before` in `MPNetForPretraining`) controls whether layer normalization is applied before or after sublayer operations (attention and FFN), following common Transformer variations.
180+
181+
The pretraining objective involves predicting original tokens based on a permuted sequence where a subset of tokens has been masked. The permutation helps in learning richer contextual representations compared to standard Masked Language Modeling (MLM).
182+
183+
## Project Structure
184+
185+
```text
186+
annotated-mpnet/
187+
├── annotated_mpnet/ # Core library code
188+
│ ├── data/ # Data loading, collation, (HF) streaming dataset
189+
│ ├── modeling/ # MPNetForPretraining model definition
190+
│ ├── scheduler/ # Learning rate scheduler
191+
│ ├── tracking/ # Metrics tracking (AverageMeter)
192+
│ ├── transformer_modules/ # Core Transformer building blocks (attention, layers, embeddings)
193+
│ └── utils/ # Utility functions, including Cython-accelerated permutation
194+
├── cli_tools/ # Command-line interface scripts
195+
│ ├── pretrain_mpnet.py
196+
│ └── convert_pretrained_mpnet_to_hf_model.py
197+
├── tests/ # Unit tests
198+
├── checkpoints/ # Default directory for saved model checkpoints
199+
├── LICENSE-3RD-PARTY.txt # Licenses for third-party dependencies
200+
├── README.md # This file
201+
├── CHANGELOG.md # Record of changes
202+
└── setup.py # Package setup script
69203
```
204+
205+
## Changelog
206+
207+
All notable changes to this project are documented in [CHANGELOG.md](CHANGELOG.md). The latest version is v0.1.4.
208+
209+
## Contributing
210+
211+
Contributions are welcome\! Please consider the following:
212+
213+
- **Reporting Issues**: Use GitHub Issues to report bugs or suggest new features.
214+
- **Pull Requests**: For code contributions, please open a pull request with a clear description of your changes.
215+
- **Running Tests**: Ensure tests pass. You can run tests using:
216+
217+
```bash
218+
python -m unittest discover tests
219+
```
220+
221+
## License
222+
223+
The licenses for third-party libraries used in this project are detailed in [LICENSE-3RD-PARTY.txt](https://www.google.com/search?q=LICENSE-3RD-PARTY.txt). The original MPNet code by Microsoft is licensed under the MIT License. The specific licensing for contributions made within this `annotated-mpnet` repository should be determined by its maintainers; users should refer to any specific license file provided at the root of this repository or assume standard open-source licensing practices.
224+
225+
Note that the detailed line-by-line license info is from the original repo and has not been updated in this fork.
226+
227+
## Acknowledgements
228+
229+
- This work is heavily based on the original MPNet paper and implementation by Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu from Microsoft.
230+
- The core Transformer module structures are adapted from the `fairseq` library.

0 commit comments

Comments
 (0)