webhooks: add backup.failed, backup.restore.failed, config.apply.failed, server.crashed#14
Merged
Conversation
…ed, server.crashed
Today the webhook stream is success-only. Backup failures (auto, manual,
imported), restore failures, and config-apply failures hit stderr but
never reach the operator's Discord/generic webhook. server.offline also
fires for operator-initiated stops, so a real crash is indistinguishable
from a `docker compose stop` you just ran yourself.
This adds four failure events plus a clean offline-vs-crash split:
backup.failed — auto / manual POST /api/backups / POST /api/backups/upload
payload: source ("auto"|"manual"|"imported"), reason, ...
backup.restore.failed — POST /api/backups/{id}/restore + game-backup restore
payload: backupId, reason
config.apply.failed — POST /api/config/apply (mod stop / mod apply paths)
payload: stage, reason
server.crashed — distinct from server.offline. EventDetector tracks
the timestamp of the most recent /api/server/{stop,
restart} call; on the online->offline edge, it fires
server.offline if the last operator stop was within
_OPERATOR_STOP_GRACE_SECONDS (30 s), else
server.crashed.
Per maintainer guidance in shipstuff#5, the failure events join the
WINDROSE_WEBHOOK_EVENTS default - better to be a bit verbose than to miss
a failed backup. Existing operators who set their own
WINDROSE_WEBHOOK_EVENTS see no change.
Discord embed colors and payload builder updated for the new events.
README event table expanded with the new rows + the offline/crashed
distinction.
Closes shipstuff#5
seslly
approved these changes
May 4, 2026
Contributor
seslly
left a comment
There was a problem hiding this comment.
thanks for the pr, everything looks good, i'll merge and cut a new release
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #5.
What this adds
Four failure events on the webhook stream + a clean operator-stop vs unexpected-exit split:
backup.failedtrigger_auto_backup,POST /api/backups,POST /api/backups/uploadsource(auto/manual/imported),reason, plus contextual fields (pinned,filename,trigger)backup.restore.failedPOST /api/backups/{id}/restore,POST /api/game-backups/{ts}/restorebackupId,reasonconfig.apply.failedPOST /api/config/apply(mod-stop or mod-apply path)stage(stop-for-mods/apply-mods),reasonserver.crashedEventDetectoron the online→offline edge when no operator stop was issued recentlyserver.offlineserver.offlinekeeps firing — but only for operator-initiated stops/restarts within_OPERATOR_STOP_GRACE_SECONDS(30 s) of the API call. Anything else isserver.crashed.Defaults
Per the discussion on #5 — would rather be verbose than miss a failing backup — the new failure events ship in the default
WINDROSE_WEBHOOK_EVENTS. Operators who already set their own value see no change.Mechanism for
server.crashedA module-level float
_LAST_OPERATOR_STOP_ATis stamped at the top of_api_server_stopand_api_server_restart.EventDetectorreads it on the offline transition; iftime.time() - _LAST_OPERATOR_STOP_AT < 30s, fireserver.offline, elseserver.crashed. Lock-free on purpose — the worst race is one missed crash classification across two consecutive 15 s polls, which is fine.Diff size
server.py +59 / −12,README.md +14 / −9. ~75 lines.Test plan
python3 -c "import ast; ast.parse(open('server.py').read())"— syntax clean.python3 tests/test_auto_backup.pyandpython3 tests/test_server_control.py— both pass.backup.failedon Discord with the expected payload./api/server/restartfrom the UI → sawserver.onlineafter the restart, noserver.crashed.kill'd the game from outside the UI → sawserver.crashed, noserver.offline.Happy to add unit tests if you'd like — the existing
tests/doesn't cover the webhook path today, so I held off to keep this diff small. Easy follow-up.