This implementation follows defensive programming principles with retry logic, state verification, and graceful degradation instead of relying on fixed delays.
- Verify State, Don't Assume - Always check actual state rather than waiting arbitrary amounts of time
- Retry with Backoff - Failed operations retry with exponential backoff
- Graceful Degradation - Services continue operating even when dependencies are temporarily unavailable
- Fail Loudly - Errors are logged with context for debugging
- Race Condition Awareness - All inter-service dependencies handle timing issues
Purpose: Manages the Docker container lifecycle with defensive mDNS handling.
Flow:
Start → Ensure Avahi Running → Wait for mDNS Clear → Start Container → Verify Healthy
Stop → Stop Container (10s timeout) → Verify Stopped
Defensive Features:
wait-for-avahi.sh: Polls Avahi until actually running (not just started)wait-for-mdns-clear.sh: Checks if old mDNS registration cleared (with fallback timeout)wait-for-container.sh: Verifies container is not just running, but healthy
Why No Fixed Sleeps:
- ❌ Old:
sleep 5- Blindly waits, might be too short or too long - ✅ New: Poll until actual state achieved, with timeout safety
Three Levels of Readiness:
- Stopped: Container not in
docker psand state != running - Running: Container exists and Docker reports
State.Running = true - Healthy: Container running AND
tidal_connect_applicationprocess alive
Usage:
./wait-for-container.sh <name> <max_wait_sec> <check_interval> <stopped|running|healthy>Why This Matters:
- Container might be "running" but app crashed
- mDNS registration happens after app starts
- Need to verify actual application health, not just container state
Problem: mDNS announcements have ~120s TTL. Rapid restarts cause self-collision.
Solution:
# Try to actively verify clearance
if avahi-browse available:
Poll until name disappears from mDNS
Wait for 2 consecutive "clear" checks
else:
Fallback to safe minimum delay (5s)Defensive Features:
- Active verification when possible
- Fallback to safe default if tools unavailable
- Timeout prevents infinite hangs
- Reads FRIENDLY_NAME from config automatically
Purpose: Monitors for errors and performs intelligent restart with full verification.
Error Detection:
check_for_errors() {
# Only recent logs (last CHECK_INTERVAL seconds)
- Token expired
- Connection errors (excluding normal EOFs)
- Container down
}Restart Logic:
1. Check cooldown (prevent restart loops)
2. Stop service
3. Wait for actual stop (not just command completion)
4. Clean up stale containers
5. Start service
6. Wait for healthy state (verify app running)
7. Verify no immediate mDNS collision
8. Restart volume bridge
Key Improvements:
- ❌ Old:
systemctl restart(no control over timing) - ✅ New:
stop+ verify stopped +start+ verify healthy - ❌ Old:
sleep 3then check - ✅ New: Poll with timeout until health confirmed
Defensive Features:
wait_for_service_stopped(): Polls both systemctl and Docker statewait_for_service_started(): Verifies systemctl active, container running, AND app process alive- Post-restart collision check
- Detailed logging for debugging
Purpose: Monitors speaker controller and syncs volume/metadata.
Challenge: Container restarts while bridge is running.
Solution:
is_container_ready() {
Container running AND speaker_controller_application process alive
}
Main loop:
if ! is_container_ready:
Track consecutive errors
After N errors: wait_for_container() with retry
Resume when availableDefensive Features:
- Detects container unavailability immediately
- Exponential backoff (errors trigger longer waits)
- Automatic recovery when container returns
- Doesn't crash/restart, just waits
- Logs connection status changes
Why This Matters:
- Volume bridge must survive container restarts
- systemd
Restart=on-failureonly helps if service crashes - Better to detect unavailability and wait than crash
Flow:
Watchdog detects "token expired" in logs
→ Cooldown check (don't restart too often)
→ Stop service + verify stopped
→ Clear stale containers
→ Start service + verify healthy (up to 45s)
→ Check for mDNS collision
→ Restart volume bridge
Fallback: If restart fails, watchdog logs error and waits for next check cycle.
Problem: User runs systemctl restart multiple times quickly.
Protection:
- systemctl enforces service dependency order
wait-for-mdns-clear.shdelays start until safeTimeoutStartSec=45prevents systemd timeout- Watchdog cooldown prevents interference
Flow:
Container temporarily loses network
→ Watchdog sees connection errors
→ Cooldown active? Wait
→ Else: Restart with full verification
→ Volume bridge detects container down
→ Waits for recovery
→ Resumes when healthy
Flow:
Docker restarts, all containers stop
→ Watchdog detects container down
→ Attempts restart
→ Might fail if Docker still initializing
→ Watchdog retries after CHECK_INTERVAL
→ Eventually succeeds when Docker ready
→ Volume bridge survives, reconnects automatically
Flow:
Avahi crashes
→ TIDAL Connect loses mDNS registration
→ Device disappears from app
→ Watchdog might not detect (depends on logs)
→ Next restart will start Avahi if needed
→ Manual intervention: systemctl restart avahi-daemon && systemctl restart tidal.service
Problem: Service restart before old mDNS TTL expires.
Solutions Applied:
- Don't restart Avahi (preserve mDNS state)
- Wait for mDNS clearance before starting
- Verify no collision after start
- Watchdog cooldown prevents rapid restarts
Problem: Service considers itself "up" but app isn't running.
Solutions Applied:
ExecStartPostwaits for "healthy" not just "running"- Check actual process (
pgrep tidal_connect_application) - Double-check after 2s to catch immediate crashes
Problem: Bridge starts, container not ready yet.
Solutions Applied:
After=tidal.service Requires=tidal.service(systemd ordering)- Bridge has
wait_for_container()on startup - Retries if container goes away
Problem: User restarts service, watchdog also triggers restart.
Solutions Applied:
- 60-second cooldown in watchdog
- Watchdog checks service state before acting
- Only restarts if actually failed/crashed
Problem: systemctl stop returns but container still cleaning up.
Solutions Applied:
ExecStopPostwaits for actual stop- Watchdog
wait_for_service_stopped()polls state - Force cleanup if graceful stop fails
TimeoutStartSec=45 # Container might take time to pull image/start
TimeoutStopSec=20 # 10s for docker-compose down + 10s for verification
CHECK_INTERVAL=30 # How often to check logs (balance responsiveness vs CPU)
RESTART_COOLDOWN=60 # Minimum time between restarts (prevent loops)# wait-for-container.sh
MAX_WAIT=30 # Fail after 30s if not ready
CHECK_INTERVAL=1 # Check every second
# wait-for-mdns-clear.sh
MAX_WAIT=15 # Most mDNS caches clear within 15s
CHECK_INTERVAL=2 # Check every 2s
# volume-bridge.sh
MAX_CONSECUTIVE_ERRORS=5 # Tolerate 5 errors before long wait
wait_for_container: 60 attempts × 2s = 2 minutes maxFaster Restarts (aggressive):
# Reduce waits (risk: more collisions)
MAX_WAIT=10 in wait-for-mdns-clear.sh
RESTART_COOLDOWN=30 in watchdogMore Reliable (conservative):
# Increase waits (trade-off: slower recovery)
MAX_WAIT=20 in wait-for-mdns-clear.sh
RESTART_COOLDOWN=90 in watchdogWatchdog:
# Add to tidal-watchdog.sh
set -x # Print all commandsService:
journalctl -u tidal.service -f
journalctl -u tidal-watchdog.service -f
journalctl -u tidal-volume-bridge.service -f# Check all states
./check-tidal-status.sh
# Container health
./wait-for-container.sh tidal_connect 5 1 healthy && echo "Healthy" || echo "Not healthy"
# mDNS status
avahi-browse -t _tidal._tcp
# Service dependencies
systemctl list-dependencies tidal.serviceContainer never becomes healthy:
docker logs tidal_connect --tail 50
docker exec tidal_connect ps auxmDNS never clears:
avahi-browse -a | grep hifiberry
# If stuck, restart: systemctl restart avahi-daemonWatchdog not restarting:
tail -50 /var/log/tidal-watchdog.log
# Check cooldown timing- Health Endpoint: Have tidal_connect expose HTTP health endpoint
- Metrics: Export timing metrics for monitoring
- Adaptive Cooldown: Increase cooldown after multiple failures
- mDNS Goodbye: Ensure container sends goodbye on stop
- Preemptive Token Refresh: Refresh token before expiry
- Cold start (service never run before)
- Normal restart (
systemctl restart) - Rapid restarts (5x within 30s)
- Token expiration (wait 1 hour)
- Docker daemon restart
- Avahi daemon restart
- Network disconnect/reconnect
- Container crash (kill -9 tidal_connect_application)
- Volume bridge survives container restart
- No mDNS collisions after restart