Implementation of the HOPE architecture based on:
- Nested Learning: The Illusion of Deep Learning Architectures
- Titans: Learning to Memorize at Test Time
- MIRAS: Memory Is All You Need
HOPE combines core components from the Nested Learning and Titans papers, plus the MIRAS unified framework:
- Self-Modifying Titans: Memory attention with delta rule updates (Eq. 28-29)
- Continuum Memory System (CMS): Multi-frequency FFN chain (Eq. 30-31)
- MIRAS Framework: Unified memory system with configurable attentional bias and retention gates
M_{t+1} = M_t - M_t * k_t * k_t^T - eta * (M_t * k_t - v_t) * k_t^T
Where:
- First term: Forgetting (removes old association for key)
- Second term: Learning (gradient descent on L2 loss)
The MIRAS framework unifies sequence models through 4 design choices:
| Choice | Options | Description |
|---|---|---|
| Memory Architecture | Vector, Matrix, MLP | How memory is structured |
| Attentional Bias | L2, Lp, Huber, KL | Internal memory objective |
| Retention Gate | L2, Lq, KL, Elastic Net | How to retain past state |
| Learning Algorithm | GD, GD+Momentum, Newton | How to update memory |
Three pre-configured MIRAS models:
| Model | Attentional Bias | Retention Gate | Use Case |
|---|---|---|---|
| Moneta | Lp (p in (1,2)) | Lq (q in (1,2)) | Robust to key collisions |
| Yaad | Huber Loss | L2 | Robust to outlier values |
| Memora | L2 | KL-divergence | Soft thresholding |
Using uv (recommended):
uv syncOr using pip:
pip install torchfrom src.config import HopeSmallConfig
from src.model import HopeForCausalLM
config = HopeSmallConfig(vocab_size=32000)
model = HopeForCausalLM(config)
# Forward pass
outputs = model(input_ids=input_ids, labels=labels)
loss = outputs["loss"]from src.model import Hope
model = Hope(config)
memory_states = None
for batch in dataloader:
logits, memory_states = model(
batch["input_ids"],
memory_states=memory_states,
return_memory=True,
)from src.layers import Moneta, Yaad, Memora, MirasMemory
# Pre-configured models
moneta = Moneta(dim=512, num_heads=8, p=1.5, q=1.5)
yaad = Yaad(dim=512, num_heads=8, huber_delta=1.0)
memora = Memora(dim=512, num_heads=8, kl_temperature=1.0)
# Custom configuration
memory = MirasMemory(
dim_key=64, dim_value=64,
attentional_bias="huber", # l2, lp, huber, kl, dot_product
retention_gate="elastic_net", # l2, lq, kl, elastic_net, bregman
learning_rate=0.1,
retention_strength=0.1,
)generated = model.generate(
prompt,
max_new_tokens=100,
temperature=0.8,
top_p=0.9,
)| Size | Parameters | dim | layers | heads |
|---|---|---|---|---|
| Small | ~125M | 512 | 8 | 8 |
| Base | ~350M | 768 | 12 | 12 |
| Large | ~760M | 1024 | 24 | 16 |
| XL | ~1.3B | 2048 | 24 | 32 |
uv run python train.py --model_size small --batch_size 8 --learning_rate 1e-4Options:
--optimizer: adamw, adam_delta, sgd_delta, deep_momentum, muon--lr_scheduler: cosine, linear, constant--dtype: float32, float16, bfloat16
uv run python test_hope.pyuv run python example.pysrc/
__init__.py
config.py # Model configurations
model.py # Main Hope model
optimizers.py # Deep optimizers (DMGD, Muon, etc.)
modules/
__init__.py
titans.py # Self-Modifying Titans (MAC, MAG, MAL)
continuum_memory.py # CMS and variants
hope_block.py # Combined HOPE block
layers/
__init__.py
associative_memory.py # Delta rule memory
neural_memory.py # MLP-based neural memory
attentional_bias.py # MIRAS attentional bias (L2, Lp, Huber, KL)
retention_gates.py # MIRAS retention gates (L2, Lq, KL, Elastic Net)
miras_memory.py # MIRAS models (Moneta, Yaad, Memora)
- Nested Learning - The Illusion of Deep Learning Architectures
- Titans: Learning to Memorize at Test Time
- MIRAS: Memory Is All You Need
MIT License