| Model Basics |
| Model Name | Echo (based on google/flan-t5-base) |
| Architecture | Transformer encoder–decoder (T5) |
| Encoder / Decoder Layers | 12 each |
| Hidden Size | 768 |
| Attention Heads | 12 |
| Vocabulary | 32 k-piece SentencePiece |
| Parameter Count | ~770 million |
| Context Window | 512 tokens (≈ 2,000–2,500 words) |
| Training |
| Pretraining | Flan-T5 (instruction-tuned) |
| Fine-tuning Dataset | trl-lib/tldr (116,722 train, 6,447 valid) |
| Additional Task Data | UCL-DARK prefs held-out test (92,858) |
| Training Hyperparameters | lr = 3e-5, 3 epochs, BS = 8 (accum 2 ⇒ eff. 16), bf16, logging 100 |
| Total Train Steps | ~21,885 |
| Performance & Memory |
| Checkpoint Size (float32) | ~3.1 GB |
| BF16 In-Memory Footprint | ~1.6 GB |
| FP16 In-Memory Footprint | ~1.6 GB |
| Float32 In-Memory Footprint | ~3.1 GB |
| CPU Inference Requirements | ≥ 4 GB RAM (float32); ≥ 2 GB (bf16/int8) |
| Disk Requirement | ≥ 1 GB to store saved_model/ files |
| Inference |
| Inference Latency | ~0.5–1.0 s per 512-token call on CPU; ~0.1–0.2 s on a single A100 |
| Tokenizer Load Time | ~200 ms on CPU |
| Model Load Time | ~5–10 s (fp16) or ~15–30 s (float32) from local disk |
| Quantization Options | 8-bit via bitsandbytes for ~2× speedup & 50% memory reduction |
| Deployment |
| Docker Base Image | python:3.10-slim + torch + transformers + bitsandbytes |
| CUDA Support | No (CPU-only deployment in jam.py) |
| Low-CPU-Mem Usage | low_cpu_mem_usage=True, device_map="cpu" enabled |
| Lambda Configuration | ≥ 6 GB RAM, ≥ 2 GB ephemeral storage, 15 min timeout recommended |