Benchmarks
Independent benchmark results for Cilow's context engine, measured on LongMemEval with real embeddings and a GPT-4o-mini judge.
(113/120)
naive context passing
end-to-end
LongMemEval results
LongMemEval is a benchmark for long-term memory evaluation. It tests a system's ability to recall, reason over, and update information across long conversation histories.
| Category | Score |
|---|---|
| Single-session attribution (SSA) | 100% (20/20) |
| Single-session preference (SSP) | 100% (20/20) |
| Single-session update (SSU) | 100% (20/20) |
| Knowledge update (KU) | 90% (18/20) |
| Multi-session (MS) | 90% (18/20) |
| Temporal reasoning (TR) | 85% (17/20) |
| Overall | 94.17% (113/120) |
Information extraction: 100% (60/60) — perfect recall across all attribution, preference, and update categories.
Token efficiency
Fewer tokens in the context window = lower cost, lower latency, and less noise for the model to reason over. Flooding a model with loosely relevant text degrades reasoning.
Cilow assembles a minimal working set for each inference call rather than retrieving and dumping all available context. The result is a smaller, higher-signal context window.
reduction in tokens vs. naive context passing — measured across representative workloads.
Latency
Measured end-to-end with OpenAI GPT-4o-mini and real embeddings — not stubs. Includes retrieval, ranking, and context assembly.
These numbers cover the full retrieval-to-assembly pipeline. Retrieval, reranking, and working set assembly are all included — not just vector search alone.
Methodology
Numbers are only useful if the methodology is transparent. Here is exactly how this benchmark was run.
LongMemEval stratified 120-question sample (seed=42).
SSA, SSP, SSU, KU, MS, TR — 20 questions each.
GPT-4o-mini for both synthesis and judge.
Real OpenAI text-embedding model — not stubs.
Each question tests the system's ability to recall and reason over information from a long conversation history.
Want to understand the architecture behind these results?
See the architecture behind these results → Architecture