-
Implementing a Decoder-Only Transformer: The goal of this assignment is to develop a decoder-only transformer language model from scratch.
-
Training and Inference Enhancements: Beam Search Decoding, KV Caching, Gradient Accumulation, and Gradient Checkpointing.
- vocab_size = 10000,
- d_model = 300,
- num_layers = 3,
- num_heads = 8,
- d_ff = 1024,
- max_seq_length = 64,
- batch_size = 32,
- learning_rate = 3e-4,
- num_epochs = 3
- Embedding matrix: 10000 × 300 = 3,000,000
- Projection layer (300→296): 300 × 296 + 296 = 89,096
- Total: 3,089,096
Per block:
- Multi-head Attention:
- Q, K, V projections: 3 × (296 × 296 + 296) = 263,736
- Output projection: 296 × 296 + 296 = 87,912
- Attention total: 351,648
- Feed Forward:
- First linear: 296 × 1024 + 1024 = 304,128
- Second linear: 1024 × 296 + 296 = 303,400
- FF total: 607,528
- Layer Norms (2 per block): 2 × (296 + 296) = 1,184
- Per block total: 351,648 + 607,528 + 1,184 = 960,360
- 3 blocks total: 3 × 960,360 = 2,881,080
- Final LayerNorm: 296 + 296 = 592
- Output linear: 296 × 10000 + 10000 = 2,970,000
- Total: 2,970,592
Input Embedding: 3,089,096
Transformer Blocks: 2,881,080
Output Layers: 2,970,592
────────────────────────────────
TOTAL: 8,940,768 parameters
$ git clone https://github.com/lohar-animesh-27112001/ELL881-advance_LLM-assignment.git
$ cd ELL881-advance_LLM-assignment
$ pip install -r requirements.txt
$ cd part-i
$ cd layers
$ python fasttext_model.py
$ cd ..
$ cp layers/cc.en.300.bin .
$ python transformer_model.py
$ cd ..
$ cp part-i/cc.en.300.bin part-ii/transformer_model-with_fasttext_embeddings/
$ cd part-ii/transformer_model-with_fasttext_embeddings/
$ python transformer_model.py