Benchmarks

Independent benchmark results for Cilow's context engine, measured on LongMemEval with real embeddings and a GPT-4o-mini judge.

94.17%
LongMemEval accuracy
(113/120)
Up to 70%
token reduction vs.
naive context passing
226ms
P50 latency
end-to-end

LongMemEval results

LongMemEval is a benchmark for long-term memory evaluation. It tests a system's ability to recall, reason over, and update information across long conversation histories.

CategoryScore
Single-session attribution (SSA)100% (20/20)
Single-session preference (SSP)100% (20/20)
Single-session update (SSU)100% (20/20)
Knowledge update (KU)90% (18/20)
Multi-session (MS)90% (18/20)
Temporal reasoning (TR)85% (17/20)
Overall94.17% (113/120)

Information extraction: 100% (60/60) — perfect recall across all attribution, preference, and update categories.

Token efficiency

Fewer tokens in the context window = lower cost, lower latency, and less noise for the model to reason over. Flooding a model with loosely relevant text degrades reasoning.

Cilow assembles a minimal working set for each inference call rather than retrieving and dumping all available context. The result is a smaller, higher-signal context window.

Up to 70%

reduction in tokens vs. naive context passing — measured across representative workloads.

The query planner builds the minimal working set needed for the current call — it does not retrieve all related context and dump it into the prompt.
Smaller context windows reduce cost per call and lower the cognitive load on the model — both of which improve accuracy.

Latency

Measured end-to-end with OpenAI GPT-4o-mini and real embeddings — not stubs. Includes retrieval, ranking, and context assembly.

226ms
P50 latency
269ms
P95 latency

These numbers cover the full retrieval-to-assembly pipeline. Retrieval, reranking, and working set assembly are all included — not just vector search alone.

Methodology

Numbers are only useful if the methodology is transparent. Here is exactly how this benchmark was run.

Benchmark

LongMemEval stratified 120-question sample (seed=42).

Categories

SSA, SSP, SSU, KU, MS, TR — 20 questions each.

Model

GPT-4o-mini for both synthesis and judge.

Embeddings

Real OpenAI text-embedding model — not stubs.

Test scope

Each question tests the system's ability to recall and reason over information from a long conversation history.

Want to understand the architecture behind these results?

See the architecture behind these results → Architecture
Cilow