3. Run a GDF vs plaintext benchmark

This section turns the GDF spec into a measured experiment: same corpus, two export formats, gold questions, precision/recall. Designed for teams feeding structured knowledge into agents and asking whether stable item_ids and chunk boundaries genuinely improve retrieval.