Phase 1: Experiment hygiene and threats-to-validity checks #83

Merged

erikinkinen merged 3 commits from 1-experiment-hygiene-and-threats-to-validity-checks-39 into main

2026-03-06 15:50:54 +01:00

erikinkinen commented

2026-03-06 15:45:01 +01:00

Owner

Summary

Implements Phase 1 experiment-hygiene and threats-to-validity safeguards from #39.

Closes #39.

What Changed

Enforced strict strategy-cohort comparability in the figure pipeline.
Added deterministic multi-seed 95% Student-t confidence intervals.
Added sweep preflight sanity checks and long-run gating in aes sweep.

Details

Plot pipeline (tools/figures/phase1/plot_phase1_figures.py)
Added hard-fail comparability validation before multi-strategy plots.
Comparability keys now follow plotted cohorts:
workload,seed for cost/completeness/scatter comparisons.
depth,seed or fanout,seed for sensitivity lines.
attempt_index,seed for post-revoke hot-path.
Comparability metadata is persisted in tie-group sidecars.
Added deterministic CI stats per bucket: n, mean, stddev_sample, sem, ci_low, ci_high.
Rendered CI whiskers on strategy bars and CI bands on line plots.
Scatter plots remain per-run points (no CI aggregation).
Sweep CLI (cli/src/main.cpp)
Added --preflight-only.
Added --max-runs <u64> with default 1000.
Added --allow-large-runs.
Preflight now always runs after config expansion and before batch execution.
Sweep fails with exit code 2 when run count exceeds --max-runs unless --allow-large-runs.
Preflight prints summary and non-fatal warning hints for low diversity.
Docs (docs/phase1.md)
Documented preflight flags and behavior.
Documented strict comparability semantics.
Documented deterministic 95% t-interval CI behavior.

Commits

Phase 1: Enforce comparable strategy cohorts (#39)
Phase 1: Add multi-seed confidence intervals (#39)
Phase 1: Add sweep preflight sanity checks (#39)

Validation

ctest --test-dir _build --output-on-failure -R "aes_phase1_figures_tests|aes_phase1_figures_smoke_test|aes_cli_sweep_tests|aes_cli_simulate_tests|aes_metrics_runner_tests|aes_revocation_outcome_metrics_tests"
ctest --test-dir _build --output-on-failure -R "aes_event_log_reader_tests|aes_event_log_replay_tests|aes_revocation_strategy_tests|aes_strategy_equivalence_tests|aes_invalid_event_determinism_tests"

All passed.

Notes for Reviewers

This PR intentionally makes comparability strict. Previously tolerated mismatched cohorts will now fail fast with actionable errors.
CI computation is deterministic and does not use bootstrap/resampling.

## Summary Implements Phase 1 experiment-hygiene and threats-to-validity safeguards from #39. Closes #39. ## What Changed - Enforced strict strategy-cohort comparability in the figure pipeline. - Added deterministic multi-seed 95% Student-t confidence intervals. - Added sweep preflight sanity checks and long-run gating in `aes sweep`. ## Details - Plot pipeline (`tools/figures/phase1/plot_phase1_figures.py`) - Added hard-fail comparability validation before multi-strategy plots. - Comparability keys now follow plotted cohorts: - `workload,seed` for cost/completeness/scatter comparisons. - `depth,seed` or `fanout,seed` for sensitivity lines. - `attempt_index,seed` for post-revoke hot-path. - Comparability metadata is persisted in tie-group sidecars. - Added deterministic CI stats per bucket: `n`, `mean`, `stddev_sample`, `sem`, `ci_low`, `ci_high`. - Rendered CI whiskers on strategy bars and CI bands on line plots. - Scatter plots remain per-run points (no CI aggregation). - Sweep CLI (`cli/src/main.cpp`) - Added `--preflight-only`. - Added `--max-runs <u64>` with default `1000`. - Added `--allow-large-runs`. - Preflight now always runs after config expansion and before batch execution. - Sweep fails with exit code `2` when run count exceeds `--max-runs` unless `--allow-large-runs`. - Preflight prints summary and non-fatal warning hints for low diversity. - Docs (`docs/phase1.md`) - Documented preflight flags and behavior. - Documented strict comparability semantics. - Documented deterministic 95% t-interval CI behavior. ## Commits - `Phase 1: Enforce comparable strategy cohorts (#39)` - `Phase 1: Add multi-seed confidence intervals (#39)` - `Phase 1: Add sweep preflight sanity checks (#39)` ## Validation - `ctest --test-dir _build --output-on-failure -R "aes_phase1_figures_tests|aes_phase1_figures_smoke_test|aes_cli_sweep_tests|aes_cli_simulate_tests|aes_metrics_runner_tests|aes_revocation_outcome_metrics_tests"` - `ctest --test-dir _build --output-on-failure -R "aes_event_log_reader_tests|aes_event_log_replay_tests|aes_revocation_strategy_tests|aes_strategy_equivalence_tests|aes_invalid_event_determinism_tests"` All passed. ## Notes for Reviewers - This PR intentionally makes comparability strict. Previously tolerated mismatched cohorts will now fail fast with actionable errors. - CI computation is deterministic and does not use bootstrap/resampling.

erikinkinen added this to the Phase 1 milestone

2026-03-06 15:45:01 +01:00

erikinkinen added the

phase-1

invariant

tests

labels

2026-03-06 15:45:01 +01:00

erikinkinen self-assigned this

2026-03-06 15:45:01 +01:00

erikinkinen added 3 commits

2026-03-06 15:45:01 +01:00