Define metrics before you run. Suggested set (from the eval kit):
| Metric | Definition |
|---|---|
| recall@k | Gold expected_item_id appears in top-k retrieved chunks |
| precision@k | Share of retrieved chunks that are relevant |
| citation_accuracy | Final answer cites the correct item_id (GDF arm only) |
| answer_accuracy | Final answer matches gold (human or LLM judge) |
Arms (same host, same slug):
| Arm | Fetch URL |
|---|---|
| Structured (GDF) | https://guides.co/g/{slug}/gdf or ?format=jsonl |
| Plaintext baseline | https://guides.co/g/{slug}/gdf?format=plaintext |
Use jsonl when your retriever indexes one record per page - each line includes item_id, title, and body.