Skip to content

webhooks: add backup.failed, backup.restore.failed, config.apply.failed, server.crashed#14

Merged
seslly merged 1 commit into
shipstuff:mainfrom
justintout:feat/failure-webhook-events
May 4, 2026
Merged

webhooks: add backup.failed, backup.restore.failed, config.apply.failed, server.crashed#14
seslly merged 1 commit into
shipstuff:mainfrom
justintout:feat/failure-webhook-events

Conversation

@justintout
Copy link
Copy Markdown
Contributor

Closes #5.

What this adds

Four failure events on the webhook stream + a clean operator-stop vs unexpected-exit split:

Event Fires from Payload
backup.failed trigger_auto_backup, POST /api/backups, POST /api/backups/upload source (auto/manual/imported), reason, plus contextual fields (pinned, filename, trigger)
backup.restore.failed POST /api/backups/{id}/restore, POST /api/game-backups/{ts}/restore backupId, reason
config.apply.failed POST /api/config/apply (mod-stop or mod-apply path) stage (stop-for-mods / apply-mods), reason
server.crashed EventDetector on the online→offline edge when no operator stop was issued recently same shape as server.offline

server.offline keeps firing — but only for operator-initiated stops/restarts within _OPERATOR_STOP_GRACE_SECONDS (30 s) of the API call. Anything else is server.crashed.

Defaults

Per the discussion on #5 — would rather be verbose than miss a failing backup — the new failure events ship in the default WINDROSE_WEBHOOK_EVENTS. Operators who already set their own value see no change.

- server.online,server.offline,player.join,player.leave
+ server.online,server.offline,server.crashed,
+ player.join,player.leave,
+ backup.failed,backup.restore.failed,config.apply.failed

Mechanism for server.crashed

A module-level float _LAST_OPERATOR_STOP_AT is stamped at the top of _api_server_stop and _api_server_restart. EventDetector reads it on the offline transition; if time.time() - _LAST_OPERATOR_STOP_AT < 30s, fire server.offline, else server.crashed. Lock-free on purpose — the worst race is one missed crash classification across two consecutive 15 s polls, which is fine.

Diff size

server.py +59 / −12, README.md +14 / −9. ~75 lines.

Test plan

  • python3 -c "import ast; ast.parse(open('server.py').read())" — syntax clean.
  • python3 tests/test_auto_backup.py and python3 tests/test_server_control.py — both pass.
  • Local sanity: triggered a manual backup with the backup dir read-only → got backup.failed on Discord with the expected payload.
  • Triggered /api/server/restart from the UI → saw server.online after the restart, no server.crashed.
  • kill'd the game from outside the UI → saw server.crashed, no server.offline.

Happy to add unit tests if you'd like — the existing tests/ doesn't cover the webhook path today, so I held off to keep this diff small. Easy follow-up.

…ed, server.crashed

Today the webhook stream is success-only. Backup failures (auto, manual,
imported), restore failures, and config-apply failures hit stderr but
never reach the operator's Discord/generic webhook. server.offline also
fires for operator-initiated stops, so a real crash is indistinguishable
from a `docker compose stop` you just ran yourself.

This adds four failure events plus a clean offline-vs-crash split:

  backup.failed           — auto / manual POST /api/backups / POST /api/backups/upload
                            payload: source ("auto"|"manual"|"imported"), reason, ...
  backup.restore.failed   — POST /api/backups/{id}/restore + game-backup restore
                            payload: backupId, reason
  config.apply.failed     — POST /api/config/apply (mod stop / mod apply paths)
                            payload: stage, reason
  server.crashed          — distinct from server.offline. EventDetector tracks
                            the timestamp of the most recent /api/server/{stop,
                            restart} call; on the online->offline edge, it fires
                            server.offline if the last operator stop was within
                            _OPERATOR_STOP_GRACE_SECONDS (30 s), else
                            server.crashed.

Per maintainer guidance in shipstuff#5, the failure events join the
WINDROSE_WEBHOOK_EVENTS default - better to be a bit verbose than to miss
a failed backup. Existing operators who set their own
WINDROSE_WEBHOOK_EVENTS see no change.

Discord embed colors and payload builder updated for the new events.
README event table expanded with the new rows + the offline/crashed
distinction.

Closes shipstuff#5
@seslly seslly self-requested a review May 4, 2026 15:52
Copy link
Copy Markdown
Contributor

@seslly seslly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the pr, everything looks good, i'll merge and cut a new release

@seslly seslly merged commit 4d05718 into shipstuff:main May 4, 2026
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

webhook events: add backup.failed / server.crashed so notifications cover failures, not just successes

2 participants