Why run this benchmark

GDF keeps a small required core and lets unknown fields pass through - so the format can survive vendor and version changes without breaking readers. Structure only matters if it helps agents find the right chunk.

This benchmark answers:

When our agent retrieves from GDF versus the same content as plain text, do precision and recall improve - with numbers?

Hypothesis: GDF arms should score higher on citation accuracy (correct item_id) and often on recall@k because pages, breadcrumbs, and jsonl chunks align with gold labels.

Null result is valid: if plaintext matches GDF, you may not need structure for that corpus shape - you still learn something.