Question
We're running local evaluation via docker compose -f docker/docker-compose.yaml up with the default sample_config.yaml. We'd like to understand how to make local testing more representative of the official portal evaluation.
Specifically:
-
Does the official portal evaluation use randomized task board poses for each submission? The docs mention randomization, but it's unclear whether each submission sees a fresh random config or a fixed (but secret) one.
-
Are the randomization ranges for the official eval the same as those documented in task_board_limits in sample_config.yaml? (NIC translation: [-0.0215, 0.0234]m, SC translation: [-0.06, 0.055]m, etc.)
-
Is the task board yaw fully randomized (0-360°) or constrained to a range where insertion is kinematically feasible?
-
Is there a recommended way to test locally with varied configurations? For example, should we rebuild the eval image with modified configs, or is there a launch parameter approach that works with the pre-built ghcr.io/intrinsic-dev/aic/aic_eval image?
We've noticed that policies perform very differently on the fixed sample_config.yaml vs slightly varied board poses, and want to ensure our local testing is representative before using limited daily submissions.
Thanks for any guidance!
Question
We're running local evaluation via
docker compose -f docker/docker-compose.yaml upwith the defaultsample_config.yaml. We'd like to understand how to make local testing more representative of the official portal evaluation.Specifically:
Does the official portal evaluation use randomized task board poses for each submission? The docs mention randomization, but it's unclear whether each submission sees a fresh random config or a fixed (but secret) one.
Are the randomization ranges for the official eval the same as those documented in
task_board_limitsinsample_config.yaml? (NIC translation: [-0.0215, 0.0234]m, SC translation: [-0.06, 0.055]m, etc.)Is the task board yaw fully randomized (0-360°) or constrained to a range where insertion is kinematically feasible?
Is there a recommended way to test locally with varied configurations? For example, should we rebuild the eval image with modified configs, or is there a launch parameter approach that works with the pre-built
ghcr.io/intrinsic-dev/aic/aic_evalimage?We've noticed that policies perform very differently on the fixed
sample_config.yamlvs slightly varied board poses, and want to ensure our local testing is representative before using limited daily submissions.Thanks for any guidance!