Phase 1: Experiment hygiene and threats-to-validity checks #83

Merged
erikinkinen merged 3 commits from 1-experiment-hygiene-and-threats-to-validity-checks-39 into main 2026-03-06 15:50:54 +01:00
Owner

Summary

Implements Phase 1 experiment-hygiene and threats-to-validity safeguards from #39.

Closes #39.

What Changed

  • Enforced strict strategy-cohort comparability in the figure pipeline.
  • Added deterministic multi-seed 95% Student-t confidence intervals.
  • Added sweep preflight sanity checks and long-run gating in aes sweep.

Details

  • Plot pipeline (tools/figures/phase1/plot_phase1_figures.py)

  • Added hard-fail comparability validation before multi-strategy plots.

  • Comparability keys now follow plotted cohorts:

  • workload,seed for cost/completeness/scatter comparisons.

  • depth,seed or fanout,seed for sensitivity lines.

  • attempt_index,seed for post-revoke hot-path.

  • Comparability metadata is persisted in tie-group sidecars.

  • Added deterministic CI stats per bucket: n, mean, stddev_sample, sem, ci_low, ci_high.

  • Rendered CI whiskers on strategy bars and CI bands on line plots.

  • Scatter plots remain per-run points (no CI aggregation).

  • Sweep CLI (cli/src/main.cpp)

  • Added --preflight-only.

  • Added --max-runs <u64> with default 1000.

  • Added --allow-large-runs.

  • Preflight now always runs after config expansion and before batch execution.

  • Sweep fails with exit code 2 when run count exceeds --max-runs unless --allow-large-runs.

  • Preflight prints summary and non-fatal warning hints for low diversity.

  • Docs (docs/phase1.md)

  • Documented preflight flags and behavior.

  • Documented strict comparability semantics.

  • Documented deterministic 95% t-interval CI behavior.

Commits

  • Phase 1: Enforce comparable strategy cohorts (#39)
  • Phase 1: Add multi-seed confidence intervals (#39)
  • Phase 1: Add sweep preflight sanity checks (#39)

Validation

  • ctest --test-dir _build --output-on-failure -R "aes_phase1_figures_tests|aes_phase1_figures_smoke_test|aes_cli_sweep_tests|aes_cli_simulate_tests|aes_metrics_runner_tests|aes_revocation_outcome_metrics_tests"
  • ctest --test-dir _build --output-on-failure -R "aes_event_log_reader_tests|aes_event_log_replay_tests|aes_revocation_strategy_tests|aes_strategy_equivalence_tests|aes_invalid_event_determinism_tests"

All passed.

Notes for Reviewers

  • This PR intentionally makes comparability strict. Previously tolerated mismatched cohorts will now fail fast with actionable errors.
  • CI computation is deterministic and does not use bootstrap/resampling.
## Summary Implements Phase 1 experiment-hygiene and threats-to-validity safeguards from #39. Closes #39. ## What Changed - Enforced strict strategy-cohort comparability in the figure pipeline. - Added deterministic multi-seed 95% Student-t confidence intervals. - Added sweep preflight sanity checks and long-run gating in `aes sweep`. ## Details - Plot pipeline (`tools/figures/phase1/plot_phase1_figures.py`) - Added hard-fail comparability validation before multi-strategy plots. - Comparability keys now follow plotted cohorts: - `workload,seed` for cost/completeness/scatter comparisons. - `depth,seed` or `fanout,seed` for sensitivity lines. - `attempt_index,seed` for post-revoke hot-path. - Comparability metadata is persisted in tie-group sidecars. - Added deterministic CI stats per bucket: `n`, `mean`, `stddev_sample`, `sem`, `ci_low`, `ci_high`. - Rendered CI whiskers on strategy bars and CI bands on line plots. - Scatter plots remain per-run points (no CI aggregation). - Sweep CLI (`cli/src/main.cpp`) - Added `--preflight-only`. - Added `--max-runs <u64>` with default `1000`. - Added `--allow-large-runs`. - Preflight now always runs after config expansion and before batch execution. - Sweep fails with exit code `2` when run count exceeds `--max-runs` unless `--allow-large-runs`. - Preflight prints summary and non-fatal warning hints for low diversity. - Docs (`docs/phase1.md`) - Documented preflight flags and behavior. - Documented strict comparability semantics. - Documented deterministic 95% t-interval CI behavior. ## Commits - `Phase 1: Enforce comparable strategy cohorts (#39)` - `Phase 1: Add multi-seed confidence intervals (#39)` - `Phase 1: Add sweep preflight sanity checks (#39)` ## Validation - `ctest --test-dir _build --output-on-failure -R "aes_phase1_figures_tests|aes_phase1_figures_smoke_test|aes_cli_sweep_tests|aes_cli_simulate_tests|aes_metrics_runner_tests|aes_revocation_outcome_metrics_tests"` - `ctest --test-dir _build --output-on-failure -R "aes_event_log_reader_tests|aes_event_log_replay_tests|aes_revocation_strategy_tests|aes_strategy_equivalence_tests|aes_invalid_event_determinism_tests"` All passed. ## Notes for Reviewers - This PR intentionally makes comparability strict. Previously tolerated mismatched cohorts will now fail fast with actionable errors. - CI computation is deterministic and does not use bootstrap/resampling.
erikinkinen added this to the Phase 1 milestone 2026-03-06 15:45:01 +01:00
Phase 1: Add sweep preflight sanity checks (#39)
All checks were successful
ci / smoke (push) Successful in 1m26s
clang-format / check-format (push) Successful in 10s
markdownlint / markdown-lint (push) Successful in 10s
ci / smoke (pull_request) Successful in 1m26s
clang-format / check-format (pull_request) Successful in 10s
markdownlint / markdown-lint (pull_request) Successful in 10s
c2798743d5
Sign in to join this conversation.
No reviewers
No milestone
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
erikinkinen/AES!83
No description provided.