Model Basics |
Model Name | Echo (based on google/flan-t5-base) |
Architecture | Transformer encoder–decoder (T5) |
Encoder / Decoder Layers | 12 each |
Hidden Size | 768 |
Attention Heads | 12 |
Vocabulary | 32 k-piece SentencePiece |
Parameter Count | ~770 million |
Context Window | 512 tokens (≈ 2,000–2,500 words) |
Training |
Pretraining | Flan-T5 (instruction-tuned) |
Fine-tuning Dataset | trl-lib/tldr (116,722 train, 6,447 valid) |
Additional Task Data | UCL-DARK prefs held-out test (92,858) |
Training Hyperparameters | lr = 3e-5, 3 epochs, BS = 8 (accum 2 ⇒ eff. 16), bf16, logging 100 |
Total Train Steps | ~21,885 |
Performance & Memory |
Checkpoint Size (float32) | ~3.1 GB |
BF16 In-Memory Footprint | ~1.6 GB |
FP16 In-Memory Footprint | ~1.6 GB |
Float32 In-Memory Footprint | ~3.1 GB |
CPU Inference Requirements | ≥ 4 GB RAM (float32); ≥ 2 GB (bf16/int8) |
Disk Requirement | ≥ 1 GB to store saved_model/ files |
Inference |
Inference Latency | ~0.5–1.0 s per 512-token call on CPU; ~0.1–0.2 s on a single A100 |
Tokenizer Load Time | ~200 ms on CPU |
Model Load Time | ~5–10 s (fp16) or ~15–30 s (float32) from local disk |
Quantization Options | 8-bit via bitsandbytes for ~2× speedup & 50% memory reduction |
Deployment |
Docker Base Image | python:3.10-slim + torch + transformers + bitsandbytes |
CUDA Support | No (CPU-only deployment in jam.py) |
Low-CPU-Mem Usage | low_cpu_mem_usage=True, device_map="cpu" enabled |
Lambda Configuration | ≥ 6 GB RAM, ≥ 2 GB ephemeral storage, 15 min timeout recommended |