pszemraj
diff --git a/‎CHANGELOG.md‎
Lines changed: 34 additions & 0 deletions b/‎CHANGELOG.md‎
Lines changed: 34 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 196 additions & 35 deletions b/‎README.md‎
Lines changed: 196 additions & 35 deletions
@@ -0,0 +1,34 @@
+### Changelog
+
+All notable changes to this project will be documented in this file. Dates are displayed in UTC.
+
+Generated by [`auto-changelog`](https://github.com/CookPete/auto-changelog).
+
+#### [v0.1.3](https://github.com/pszemraj/annotated-mpnet/compare/v0.1.2...v0.1.3)
+
+> 26 March 2025
+
+- Improve training config [`#11`](https://github.com/pszemraj/annotated-mpnet/pull/11)
+- pad vocab for CUDA [`#10`](https://github.com/pszemraj/annotated-mpnet/pull/10)
+- Cleanup extra files [`#9`](https://github.com/pszemraj/annotated-mpnet/pull/9)
+
+#### [v0.1.2](https://github.com/pszemraj/annotated-mpnet/compare/v0.1.1...v0.1.2)
+
+> 5 March 2025
+
+- Save load fix [`#8`](https://github.com/pszemraj/annotated-mpnet/pull/8)
+- Speedups [`#4`](https://github.com/pszemraj/annotated-mpnet/pull/4)
+
+#### v0.1.1
+
+> 25 February 2025
+
+- Streaming dataset [`#3`](https://github.com/pszemraj/annotated-mpnet/pull/3)
+- Dtype fix [`#2`](https://github.com/pszemraj/annotated-mpnet/pull/2)
+- Fixes and updates [`#1`](https://github.com/pszemraj/annotated-mpnet/pull/1)
+- Create main.yml [`#3`](https://github.com/pszemraj/annotated-mpnet/pull/3)
+- Fix issue with assigning relative attention bias tensor to wrong layer in the HF MPNet model [`#2`](https://github.com/pszemraj/annotated-mpnet/pull/2)
+- Initial commit pushing all code and README [`#1`](https://github.com/pszemraj/annotated-mpnet/pull/1)
+- Added 3rd party licensing and a Pipfile to make installation as easy as possible [`af5adfb`](https://github.com/pszemraj/annotated-mpnet/commit/af5adfbed1d10c326a97308552c97988dcbbd90f)
+- Initial commit [`070c6d1`](https://github.com/pszemraj/annotated-mpnet/commit/070c6d176c1192bce5cb94f712db3b25423bdf05)
+- Changed formatting of 3rd party license file slightly [`daa7c51`](https://github.com/pszemraj/annotated-mpnet/commit/daa7c5157c4035303ed80c7c5bb1633d7ab69749)
@@ -1,69 +1,230 @@
 # Annotated MPNet
 
+`annotated-mpnet` provides a lightweight, heavily annotated, and standalone PyTorch implementation for pretraining MPNet models. This project aims to demystify the MPNet pretraining process, which was originally part of the larger `fairseq` codebase, making it more accessible for research and custom pretraining.
+
+## Table of Contents
+
 - [Annotated MPNet](#annotated-mpnet)
-  - [About](#about)
+  - [Table of Contents](#table-of-contents)
+  - [About the Project](#about-the-project)
+  - [Key Features](#key-features)
   - [Installation](#installation)
-  - [Pretraining MPNet](#pretraining-mpnet)
-    - [HuggingFace dataset](#huggingface-dataset)
-    - [Local directory of text files](#local-directory-of-text-files)
-  - [Porting a checkpoint to HuggingFace's format](#porting-a-checkpoint-to-huggingfaces-format)
+    - [Requirements](#requirements)
+  - [Usage](#usage)
+    - [Pretraining MPNet](#pretraining-mpnet)
+    - [Porting Checkpoint to Hugging Face](#porting-checkpoint-to-hugging-face)
+  - [Model Architecture](#model-architecture)
+  - [Project Structure](#project-structure)
+  - [Changelog](#changelog)
+  - [Contributing](#contributing)
+  - [License](#license)
+  - [Acknowledgements](#acknowledgements)
 
-## About
 
-This repository is based very closely on the wonderful code written by the authors of [MPNet: Masked and Permuted Pre-training for Language Understanding](https://arxiv.org/pdf/2004.09297.pdf) (Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, Tie-Yan Liu). MPNet employs a pretraining method that has seen a great deal of success in NLP fine-tuning tasks like information retrieval and learning-to-rank.
+## About the Project
 
-While many of the fine-tuned applications of this model exist out in the open, pretraining the model remains somewhat esoteric, as this code lives within the researcher's modified [fairseq codebase](https://github.com/microsoft/MPNet/tree/master) (HuggingFace's MPNet can only be used for fine-tuning, since it doesn't properly encode two-stream attention). This codebase works well, but it is fairly bloated and the code is, perhaps unintentionally, obfuscated by the many subdirectories and imports pointing all through the source code.
+MPNet (Masked and Permuted Pre-training for Language Understanding) is a powerful pretraining method. However, its original pretraining code is embedded within the `fairseq` library, which can be complex to navigate and adapt. `annotated-mpnet` addresses this by:
 
-With `annotated-mpnet`, we are looking to bring the MPNet pretraining code under one, lightweight, clean roof implemented in raw PyTorch. This eliminates the need to use the entirety of `fairseq`, and allows pretraining to occur on nonstandard training devices (i.e. accelerated hardware beyond a GPU).
+- Providing a clean, raw PyTorch implementation of MPNet pretraining.
+- Offering extensive annotations and comments throughout the codebase to improve understanding.
+- Enabling pretraining without the full `fairseq` dependency, facilitating use on various hardware setups.
 
-Additionally, we have gone through the painstaking effort of carefully annotating and commenting on each portion of the model code so that general understanding of the model is more easily conveyed. It is our hope that using this codebase, others will be able to pretrain MPNet using their own data.
+**This repo** is a fork/update of the [original by yext](https://github.com/yext/annotated-mpnet).
 
-## Installation
+## Key Features
 
-Install as editable with `pip install -e .`
+- **Standalone PyTorch Implementation**: No `fairseq` dependency required for pretraining.
+- **Heavily Annotated Code**: Detailed comments explain the model architecture and training process.
+- **Flexible Data Handling**: Supports pretraining with Hugging Face streaming datasets or local text files.
+- **Hugging Face Compatibility**: Includes a tool to convert pretrained checkpoints to the Hugging Face `MPNetForMaskedLM` format for easy fine-tuning.
+- **Integrated Logging**: Supports TensorBoard and Weights & Biases for experiment tracking.
+
+## Installation
 
 ```bash
 git clone https://github.com/pszemraj/annotated-mpnet.git
 cd annotated-mpnet
 pip install -e .
 ```
 
-## Pretraining MPNet
-
 > [!NOTE]
-> Pretraining can be completed using a directory of text files (_original method_) or a HuggingFace dataset via streaming.
+> Pretraining MPNet is computationally intensive and requires a CUDA-enabled GPU. The training script will exit if CUDA is not available.
+
+### Requirements
 
-Pretraining is as simple as calling the pretraining entrypoint, which was just installed in the previous step. You can get a rundown of exactly which arguments are provided by typing `pretrain-mpnet -h`, but example commands are shown below.
+- Python 3.x
+- PyTorch (version >= 2.6.0, CUDA is required for training)
+- Hugging Face `transformers`, `datasets`
+- `wandb` (for Weights & Biases logging, optional)
+- `rich` (for enhanced console logging)
+- `numpy`
+- `cython`
+- `tensorboard` (for logging, optional)
 
-### HuggingFace dataset
+See `setup.py` for a full list of dependencies.
+
+## Usage
+
+### Pretraining MPNet
+
+The primary script for pretraining is `pretrain-mpnet`. You can see all available arguments by running `pretrain-mpnet -h`.
+
+**1. Using a Hugging Face Dataset (Streaming):**
+This method streams data directly from the Hugging Face Hub. Validation and test sets are created by taking initial samples from the training stream.
 
 ```bash
-pretrain-mpnet --dataset-name HuggingFaceFW/fineweb-edu \
---total-updates 10000 --warmup-updates 1000 \
---batch-size 4 --update-freq 4 --lr 0.0002 \
--save_steps 200 -log_dir checkpoints/experiments
+pretrain-mpnet \
+    --dataset-name "gair-prox/DCLM-pro" \
+    --text-field "text" \
+    --tokenizer-name "microsoft/mpnet-base" \
+    --max-tokens 512 \
+    --encoder-layers 12 \
+    --encoder-embed-dim 768 \
+    --encoder-ffn-dim 3072 \
+    --encoder-attention-heads 12 \
+    --batch-size 16 \
+    --update-freq 8 \
+    --lr 6e-4 \
+    --warmup-updates 1000 \
+    --total-updates 100000 \
+    --checkpoint-dir "./checkpoints/my_mpnet_run" \
+    --tensorboard-log-dir "./logs/my_mpnet_run" \
+    --wandb --wandb-project "annotated-mpnet-experiments" \
+    --save_steps 2500
 ```
 
-### Local directory of text files
+Key arguments for streaming:
+
+- `--dataset-name`: Name of the dataset on Hugging Face Hub.
+- `--text-field`: The column in the dataset containing the text (default: "text").
+- `--buffer-size`: Size of the shuffling buffer for streaming (default: 10000).
+- `--eval-samples`: Number of samples to take for validation/test sets from the stream (default: 500).
+- `--min-text-length`: Minimum length of text samples to consider (default: 64).
+
+**2. Using Local Text Files:**
+Provide a directory of training files (one document/sentence per line is typical) and paths to single validation and test files.
 
 ```bash
 pretrain-mpnet \
---train-dir /path/to/train/dir \
---valid-file /path/to/validation/txtfile \
---test-file /path/to/test/txtfile \
---total-updates 10000 \ # total number of updates to run
---warmup-updates 1000 \ # number of updates to warm up the LR
---batch-size 256 \ # actual batch size in memory
---update-freq 4 \ # simulating 1024 bs via gradient accumulation
---lr 0.0002 # the peak learning rate, reached at warmup-updates and then decayed according to the --power arg
+    --train-dir "/path/to/your/train_data_directory/" \
+    --valid-file "/path/to/your/validation_data.txt" \
+    --test-file "/path/to/your/test_data.txt" \
+    --tokenizer-name "microsoft/mpnet-base" \
+    --max-tokens 512 \
+    --batch-size 16 \
+    --update-freq 8 \
+    --lr 6e-4 \
+    --warmup-updates 1000 \
+    --total-updates 100000 \
+    --checkpoint-dir "./checkpoints/my_local_mpnet_run" \
+    --tensorboard-log-dir "./logs/my_local_mpnet_run" \
+    --save_steps 2500
 ```
 
-## Porting a checkpoint to HuggingFace's format
-
-After pretraining is done, you'll probably want to do something with the model! The best way to do this is to port one of your checkpoints to the HuggingFace MPNet format. This will allow you to load the custom model into HuggingFace and do fine-tuning there as you normally would. Below is what a call would look like to do this conversion:
+**Key Pretraining Arguments (Common to both methods):**
+
+- `--tokenizer-name`: Hugging Face tokenizer to use (default: `microsoft/mpnet-base`).
+- `--max-tokens`: Maximum sequence length (default: 512). Also sets `--max-positions` if not specified.
+- Model Architecture:
+  - `--encoder-layers` (default: 12)
+  - `--encoder-embed-dim` (default: 768)
+  - `--encoder-ffn-dim` (default: 3072)
+  - `--encoder-attention-heads` (default: 12)
+- Training Parameters:
+  - `--batch-size`: Per-GPU batch size (default: 16).
+  - `--update-freq`: Gradient accumulation steps to simulate larger batch sizes (default: 8). Effective batch size = `batch-size * update-freq * num_gpus`.
+  - `--lr`: Peak learning rate (default: 6e-4).
+  - `--warmup-updates`: Number of steps for LR warmup (default: 10% of `total-updates`).
+  - `--total-updates`: Total number of training updates (default: 10000).
+- Logging and Saving:
+  - `--checkpoint-dir`: Directory to save model checkpoints (default: `./checkpoints`).
+  - `--tensorboard-log-dir`: Directory for TensorBoard logs. If unset, logs to console.
+  - `--save_steps`: Save a checkpoint every N steps (default: -1, only best and final).
+  - `--wandb`: Enable Weights & Biases logging.
+  - `--wandb-project`, `--wandb-name`: W\&B project and run name.
+- `--compile`: Use `torch.compile()` for the model (experimental, default: False).
+- `--seed`: Random seed for reproducibility (default: 12345).
+
+The script validates the tokenizer. For optimal performance with the default `whole_word_mask=True` in the data collator, a WordPiece-compatible tokenizer is expected.
+
+### Porting Checkpoint to Hugging Face
+
+After pretraining, convert your checkpoint to the Hugging Face `MPNetForMaskedLM` format using the `convert-to-hf` script. This allows you to load and use your model within the Hugging Face ecosystem.
 
 ```bash
 convert-to-hf \
---mpnet-checkpoint-path ./checkpoints/best_checkpoint.pt \
---hf-model-folder-path ./my_cool_hf_model/
+    --mpnet-checkpoint-path "./checkpoints/my_mpnet_run/best_checkpoint.pt" \
+    --hf-model-folder-path "./my_hf_mpnet_model/"
+```
+
+- By default, this script will also save the tokenizer used during pretraining (if its name was stored in the checkpoint args). Use `--no-save-tokenizer` to disable this.
+- The output directory (`./my_hf_mpnet_model/`) will contain `pytorch_model.bin`, `config.json`, and tokenizer files (e.g., `tokenizer.json`, `vocab.txt`).
+
+## Model Architecture
+
+This repository implements MPNet, which utilizes a **Masked and Permuted Pre-training** objective. The architecture is based on the Transformer model.
+
+- **`MPNetForPretraining`**: This is the main model class defined in `annotated_mpnet/modeling/mpnet_for_pretraining.py`. It encapsulates the encoder and the language modeling head.
+- **`SentenceEncoder`**: The core of the model, this is a stack of Transformer encoder layers. It's responsible for generating contextualized representations of the input tokens. Found in `annotated_mpnet/transformer_modules/sentence_encoder.py`.
+- **`SentenceEncoderLayer`**: Each layer within the `SentenceEncoder`. It primarily consists of:
+  - **`RelativeMultiHeadAttention`**: A multi-head self-attention mechanism that incorporates relative positional information, crucial for MPNet. Defined in `annotated_mpnet/transformer_modules/rel_multihead_attention.py`.
+  - Position-wise Feed-Forward Networks (FFN).
+  - Layer normalization.
+- **Positional Embeddings**: The model uses positional embeddings to provide sequence order information. This implementation supports:
+  - `LearnedPositionalEmbedding`: Positional embeddings are learned during training.
+  - `SinusoidalPositionalEmbedding`: Fixed positional embeddings based on sine and cosine functions.
+        The choice is configurable via `pretrain_mpnet.py` arguments. These are found in `annotated_mpnet/transformer_modules/`.
+- **Two-Stream Self-Attention**: A key innovation of MPNet. While not a separate module, this mechanism is implemented within the `MPNetForPretraining` forward pass. It allows the model to predict original tokens from a permuted version of the input by using two streams of information (content and query), enabling it to learn bidirectional context without the predicted tokens "seeing themselves" in the non-permuted context.
+- **`MPNetLMHead`**: A language modeling head placed on top of the `SentenceEncoder`'s output. It projects the contextual embeddings to the vocabulary space to predict the masked tokens. Defined in `annotated_mpnet/modeling/mpnet_for_pretraining.py`.
+- **Normalization Strategy**: The `--normalize-before` flag (default: `False` in `SentenceEncoder`, `True` for `encoder_normalize_before` in `MPNetForPretraining`) controls whether layer normalization is applied before or after sublayer operations (attention and FFN), following common Transformer variations.
+
+The pretraining objective involves predicting original tokens based on a permuted sequence where a subset of tokens has been masked. The permutation helps in learning richer contextual representations compared to standard Masked Language Modeling (MLM).
+
+## Project Structure
+
+```text
+annotated-mpnet/
+├── annotated_mpnet/          # Core library code
+│   ├── data/                 # Data loading, collation, (HF) streaming dataset
+│   ├── modeling/             # MPNetForPretraining model definition
+│   ├── scheduler/            # Learning rate scheduler
+│   ├── tracking/             # Metrics tracking (AverageMeter)
+│   ├── transformer_modules/  # Core Transformer building blocks (attention, layers, embeddings)
+│   └── utils/                # Utility functions, including Cython-accelerated permutation
+├── cli_tools/                # Command-line interface scripts
+│   ├── pretrain_mpnet.py
+│   └── convert_pretrained_mpnet_to_hf_model.py
+├── tests/                    # Unit tests
+├── checkpoints/              # Default directory for saved model checkpoints
+├── LICENSE-3RD-PARTY.txt     # Licenses for third-party dependencies
+├── README.md                 # This file
+├── CHANGELOG.md              # Record of changes
+└── setup.py                  # Package setup script
 ```
+
+## Changelog
+
+All notable changes to this project are documented in [CHANGELOG.md](CHANGELOG.md). The latest version is v0.1.4.
+
+## Contributing
+
+Contributions are welcome\! Please consider the following:
+
+- **Reporting Issues**: Use GitHub Issues to report bugs or suggest new features.
+- **Pull Requests**: For code contributions, please open a pull request with a clear description of your changes.
+- **Running Tests**: Ensure tests pass. You can run tests using:
+
+    ```bash
+    python -m unittest discover tests
+    ```
+
+## License
+
+The licenses for third-party libraries used in this project are detailed in [LICENSE-3RD-PARTY.txt](https://www.google.com/search?q=LICENSE-3RD-PARTY.txt). The original MPNet code by Microsoft is licensed under the MIT License. The specific licensing for contributions made within this `annotated-mpnet` repository should be determined by its maintainers; users should refer to any specific license file provided at the root of this repository or assume standard open-source licensing practices.
+
+Note that the detailed line-by-line license info is from the original repo and has not been updated in this fork.
+
+## Acknowledgements
+
+- This work is heavily based on the original MPNet paper and implementation by Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu from Microsoft.
+- The core Transformer module structures are adapted from the `fairseq` library.