Skip to content

Multi-location inference per orchestrator pod#473

Open
akashmjn wants to merge 6 commits into
mainfrom
claude/investigate-issue-469-tVlXW
Open

Multi-location inference per orchestrator pod#473
akashmjn wants to merge 6 commits into
mainfrom
claude/investigate-issue-469-tVlXW

Conversation

@akashmjn
Copy link
Copy Markdown
Collaborator

@akashmjn akashmjn commented Mar 27, 2026

Summary

This PR fixes #469. Allows processing >4 locations per VM node without unnecessary memory overhead.

  • Refactors LiveInferenceOrchestrator.py to support multiple hydrophone locations per container, reducing idle time from ~80% per loop iteration
  • New config format: shared orchestrator_config + location_config list with per-location hls_hydrophone_id and optional model_config_overrides
  • Model loaded once; per-location overrides applied at inference time via detect_srkw_from_file()
  • Splits run_loop() into run_date_range() (single-location) and run_live_hls() (multi-location with sequential processing per iteration)
  • Per-location [location_id] prefixes in logs for observability with try/except error isolation so one failing location doesn't block others
  • All 13 test configs + deploy configmaps converted to new format
  • New consolidated inference-pod-0-configmap.yaml with all 8 locations in a single config

Test plan

  • Existing test_orchestrator.py tests pass (DateRangeHLS positive/negative/fail cases)
  • test_livehls_smoke passes with single-location LiveHLS config
  • test_livehls_multilocation_smoke passes — verifies both [rpi_orcasound_lab] and [rpi_bush_point] prefixes appear in output
  • Config parsing unit checks pass (all 14 configs, override isolation, hls_hydrophone_id defaulting)

claude added 4 commits March 27, 2026 18:34
…tion format

Introduces the new config parsing layer that expects:
- orchestrator_config: shared settings (model, stream type, azure, etc.)
- location_config: list of per-location entries with hls_hydrophone_id
  and optional model_config_overrides

load_model() no longer applies overrides at load time — per-location
overrides are resolved at inference time via resolve_location_model_config().

Part of #469.

https://claude.ai/code/session_01Dio4ds5gm21JjhpjiWiW8y
… support

- run_live_hls(): processes multiple locations sequentially within each
  time-aligned iteration. Pre-builds per-location HLS clients and model
  configs. Each location is wrapped in try/except for error isolation.
- run_date_range(): extracts existing DateRangeHLS logic unchanged
  (single-location only).
- _process_segment() now takes explicit hls_hydrophone_id and log_prefix
  params for per-location observability (e.g. "[rpi_orcasound_lab] ...").
- __main__ updated to use parse_config() and dispatch to the appropriate
  run function based on hls_stream_type.

Part of #469.

https://claude.ai/code/session_01Dio4ds5gm21JjhpjiWiW8y
… format

- All test orch_configs (DateRangeHLS, LiveHLS, Fail, Negative, Positive,
  LocalDebug) converted to nest shared settings under orchestrator_config
  and per-location settings under location_config list.
- New multi-location LiveHLS test config: LiveHLS_MultiLocation.yml
  (orcasound_lab + bush_point).
- New consolidated deploy configmap: inference-pod-0-configmap.yaml
  with all 8 locations in a single config (replaces per-location configmaps).

Part of #469.

https://claude.ai/code/session_01Dio4ds5gm21JjhpjiWiW8y
Adds test_livehls_multilocation_smoke() that runs the orchestrator with
LiveHLS_MultiLocation.yml (2 locations: orcasound_lab + bush_point) and
verifies both location prefixes appear in log output.

Part of #469.

https://claude.ai/code/session_01Dio4ds5gm21JjhpjiWiW8y
@akashmjn akashmjn requested a review from dthaler March 27, 2026 19:15
resolve_location_model_config() now applies two layers of overrides:
1. orchestrator_config.model_config_overrides (shared defaults)
2. location_config[].model_config_overrides (per-location, takes precedence)

This simplifies configs — common values like max_batch_size and
pred_global_threshold go in orchestrator_config, with only location-specific
deviations in location_config (e.g. north_sjc threshold=0.95).

Updated deploy configmap and test configs accordingly.

Part of #469.

https://claude.ai/code/session_01Dio4ds5gm21JjhpjiWiW8y
Comment thread InferenceSystem/deploy/inference-pod-0-configmap.yaml
@akashmjn akashmjn marked this pull request as ready for review March 27, 2026 19:43
@dthaler dthaler requested a review from Copilot March 30, 2026 15:12
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Refactors the Python InferenceSystem live orchestrator to process multiple hydrophone locations sequentially within a single pod/loop iteration, aiming to reduce idle time and per-location memory overhead (fixes #469).

Changes:

  • Introduces a new YAML config schema: shared orchestrator_config + location_config list, with optional per-location model_config_overrides.
  • Splits the orchestrator execution into run_live_hls() (multi-location) and run_date_range() (single-location), and adds per-location log prefixes.
  • Updates test orchestrator configs to the new schema and adds a multi-location LiveHLS smoke test; adds a new consolidated Kubernetes ConfigMap.

Reviewed changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
InferenceSystem/src/LiveInferenceOrchestrator.py Adds new config parsing + multi-location LiveHLS loop and per-location override resolution.
InferenceSystem/tests/test_orchestrator.py Adds a multi-location LiveHLS smoke test asserting per-location log prefixes.
InferenceSystem/tests/orch_configs/Positive/DateRangeHLS_SunsetBay.yml Migrates DateRange config to new orchestrator_config/location_config format.
InferenceSystem/tests/orch_configs/Positive/DateRangeHLS_PortTownsend.yml Migrates DateRange config to new format.
InferenceSystem/tests/orch_configs/Positive/DateRangeHLS_OrcasoundLab.yml Migrates DateRange config to new format.
InferenceSystem/tests/orch_configs/Positive/DateRangeHLS_NorthSJC.yml Migrates DateRange config to new format.
InferenceSystem/tests/orch_configs/Positive/DateRangeHLS_MastCenter.yml Migrates DateRange config to new format.
InferenceSystem/tests/orch_configs/Positive/DateRangeHLS_BushPoint.yml Migrates DateRange config to new format.
InferenceSystem/tests/orch_configs/Positive/DateRangeHLS_AndrewsBay.yml Migrates DateRange config to new format.
InferenceSystem/tests/orch_configs/Negative/DateRangeHLS_PointRobinson.yml Migrates DateRange config to new format.
InferenceSystem/tests/orch_configs/LocalDebug/DateRangeHLS_SunsetBay_AzureUploadTest.yml Migrates local debug DateRange config to new format.
InferenceSystem/tests/orch_configs/LiveHLS/LiveHLS_OrcasoundLab.yml Migrates LiveHLS config to new format (single location).
InferenceSystem/tests/orch_configs/LiveHLS/LiveHLS_MultiLocation.yml Adds a new multi-location LiveHLS test config with per-location override.
InferenceSystem/tests/orch_configs/Fail/DateRangeHLS_NoAudio2.yml Migrates fail-case DateRange config to new format.
InferenceSystem/tests/orch_configs/Fail/DateRangeHLS_NoAudio.yml Migrates fail-case DateRange config to new format.
InferenceSystem/tests/orch_configs/Fail/DateRangeHLS_IncompleteMinute.yml Migrates fail-case DateRange config to new format.
InferenceSystem/deploy/inference-pod-0-configmap.yaml Adds a consolidated ConfigMap containing multiple locations under the new schema.



def test_livehls_multilocation_smoke():
"""Smoke test: multi-location LiveHLS processes both locations per iteration."""
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docstring says the multi-location smoke test processes “both locations”, but the test config asserts three location prefixes (orcasound_lab, bush_point, north_sjc). Update the docstring to match the actual expectation to avoid confusion when maintaining the test.

Suggested change
"""Smoke test: multi-location LiveHLS processes both locations per iteration."""
"""Smoke test: multi-location LiveHLS processes all configured locations (3) per iteration."""

Copilot uses AI. Check for mistakes.
Comment on lines +244 to +246
orch_config = raw_config["orchestrator_config"]
locations = raw_config["location_config"]
for loc in locations:
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parse_config() unconditionally indexes raw_config["orchestrator_config"] / ["location_config"], so any existing single-location configs (e.g., the current Kubernetes hydrophone-configs ConfigMaps under InferenceSystem/deploy/*-configmap.yaml) will raise a KeyError and crash the pod. Consider either (a) updating the existing deploy manifests/ConfigMaps to the new nested format, or (b) making parse_config backward-compatible (detect old flat format and wrap into a single-entry location_config), and in all cases raise a clear ValueError when required keys are missing.

Copilot uses AI. Check for mistakes.
max_live_iterations=args.max_live_iterations,
)

elif hls_stream_type == "DateRangeHLS":
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the DateRangeHLS path, the code silently picks locations[0] and ignores any additional entries in location_config. It would be safer to validate that exactly one location is provided for DateRangeHLS (and raise a helpful error if not), or explicitly iterate all locations if multi-location DateRangeHLS is intended.

Suggested change
elif hls_stream_type == "DateRangeHLS":
elif hls_stream_type == "DateRangeHLS":
# DateRangeHLS currently supports exactly one location; validate configuration
if len(locations) != 1:
location_ids = [loc.get("location_id") for loc in locations]
raise ValueError(
f"DateRangeHLS requires exactly one location, but got "
f"{len(locations)} (location_ids={location_ids}). "
"Please update your orchestrator configuration."
)

Copilot uses AI. Check for mistakes.
Comment on lines +222 to +249
def parse_config(raw_config):
"""Parse YAML config into (orch_config, locations).

Expects the new multi-location format::

orchestrator_config:
model_id: ...
model_config_path: ...
...
location_config:
- location_id: "rpi_orcasound_lab"
hls_hydrophone_id: "rpi_orcasound_lab"
model_config_overrides: { ... }

Returns
-------
orch_config : dict
Shared orchestrator settings (from ``orchestrator_config`` key).
locations : list[dict]
Per-location dicts. Each has at least ``location_id`` and
``hls_hydrophone_id``, plus optional ``model_config_overrides``.
"""
orch_config = raw_config["orchestrator_config"]
locations = raw_config["location_config"]
for loc in locations:
if "hls_hydrophone_id" not in loc:
loc["hls_hydrophone_id"] = loc["location_id"]
return orch_config, locations
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new config parsing/override behavior (parse_config defaulting of hls_hydrophone_id + resolve_location_model_config precedence) isn’t directly unit-tested. Adding focused tests for: old-vs-new config format handling, hls_hydrophone_id defaulting, and override precedence/isolation would make this refactor safer and help catch future config regressions.

Copilot generated this review using guidance from repository custom instructions.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[RFC] Further reduce inference costs with multiple locations per inference orch container

3 participants