What to measure

Define metrics before you run. Suggested set (from the eval kit):

Metric Definition
recall@k Gold expected_item_id appears in top-k retrieved chunks
precision@k Share of retrieved chunks that are relevant
citation_accuracy Final answer cites the correct item_id (GDF arm only)
answer_accuracy Final answer matches gold (human or LLM judge)

Arms (same host, same slug):

Arm Fetch URL
Structured (GDF) https://guides.co/g/{slug}/gdf or ?format=jsonl
Plaintext baseline https://guides.co/g/{slug}/gdf?format=plaintext

Use jsonl when your retriever indexes one record per page - each line includes item_id, title, and body.