For each question and each arm:
recall@k = 1 if expected_item_id is in the top-k retrieved chunks, else 0.
precision@k = (relevant chunks in top-k) / k.
citation_accuracy (GDF arm) = 1 if the model's cited id matches expected_item_id.
Aggregate:
mean_recall@k = average over questions mean_precision@k = average over questions
Report in a simple table:
| Arm | recall@5 | precision@5 | citation_accuracy |
|---|---|---|---|
| GDF (jsonl) | |||
| plaintext |
Interpretation:
Share results with your client as evidence, not just spec opinion.