Step 5 — Score precision and recall

For each question and each arm:

recall@k = 1 if expected_item_id is in the top-k retrieved chunks, else 0.

precision@k = (relevant chunks in top-k) / k.

citation_accuracy (GDF arm) = 1 if the model's cited id matches expected_item_id.

Aggregate:

mean_recall@k    = average over questions
mean_precision@k = average over questions

Report in a simple table:

Arm recall@5 precision@5 citation_accuracy
GDF (jsonl)
plaintext

Interpretation:

  • GDF wins on citation_accuracy → structure helps attribution and audit.
  • GDF wins on recall@k → chunk boundaries or metadata help retrieval.
  • No difference → structure may not matter for this corpus size/shape; try more pages or harder cross-section questions.

Share results with your client as evidence, not just spec opinion.