Back to Chat

Echo Model Specifications

SpecificationDetails
Model Basics
Model NameEcho (based on google/flan-t5-base)
ArchitectureTransformer encoder–decoder (T5)
Encoder / Decoder Layers12 each
Hidden Size768
Attention Heads12
Vocabulary32 k-piece SentencePiece
Parameter Count~770 million
Context Window512 tokens (≈ 2,000–2,500 words)
Training
PretrainingFlan-T5 (instruction-tuned)
Fine-tuning Datasettrl-lib/tldr (116,722 train, 6,447 valid)
Additional Task DataUCL-DARK prefs held-out test (92,858)
Training Hyperparameterslr = 3e-5, 3 epochs, BS = 8 (accum 2 ⇒ eff. 16), bf16, logging 100
Total Train Steps~21,885
Performance & Memory
Checkpoint Size (float32)~3.1 GB
BF16 In-Memory Footprint~1.6 GB
FP16 In-Memory Footprint~1.6 GB
Float32 In-Memory Footprint~3.1 GB
CPU Inference Requirements≥ 4 GB RAM (float32); ≥ 2 GB (bf16/int8)
Disk Requirement≥ 1 GB to store saved_model/ files
Inference
Inference Latency~0.5–1.0 s per 512-token call on CPU; ~0.1–0.2 s on a single A100
Tokenizer Load Time~200 ms on CPU
Model Load Time~5–10 s (fp16) or ~15–30 s (float32) from local disk
Quantization Options8-bit via bitsandbytes for ~2× speedup & 50% memory reduction
Deployment
Docker Base Imagepython:3.10-slim + torch + transformers + bitsandbytes
CUDA SupportNo (CPU-only deployment in jam.py)
Low-CPU-Mem Usagelow_cpu_mem_usage=True, device_map="cpu" enabled
Lambda Configuration≥ 6 GB RAM, ≥ 2 GB ephemeral storage, 15 min timeout recommended