Operations
Daily operating checks for the review service: health, queue, logs, metrics, dashboards, and context services.
Health endpoints
- /health
- Liveness. Use for simple process checks.
- /ready
- Readiness. Use for orchestration because it waits for DB and migrations.
- /metrics
- Prometheus metrics for queues, jobs, HTTP requests, uptime, and AI usage.
Useful commands
docker compose ps
docker compose logs -f gittensory
curl http://localhost:8787/ready
curl http://localhost:8787/metricsbashImportant log events
selfhost_listening
selfhost_migrations_applied
selfhost_ai_provider
selfhost_ai_review_plan
selfhost_embed_provider
selfhost_vectorize
selfhost_job_dead
selfhost_cron_error
review_context_fetch_failed
selfhost_webhook_enqueue_failed
selfhost_webhook_enqueue_binding_missingObservability profile
The observability profile starts Prometheus, Alertmanager, Loki, Promtail, and Grafana with dashboards for infra, review activity, and AI usage.
Postgres installs also expose database internals through the bundled Postgres exporter: connection pressure, lock waits, long transactions, deadlocks, database/table growth, dead tuples, autovacuum activity, and backup freshness. Backup freshness appears when the backup profile is active.
When OpenTelemetry and Sentry are enabled, job audit logs and Sentry events include trace_id/span_id fields so an operator can jump from a failed job or issue to the matching trace in Grafana or Tempo.
docker compose --profile postgres --profile observability up -d
docker compose --profile postgres --profile observability --profile backup up -dbashAlerting — required for a 24/7 deployment
Alertmanager ships with a valid but silent default: every alert routes to a name-only receiver that discards it, so docker compose --profile observability up -d always starts clean even before you've configured anywhere to send notifications. This is intentional — the shipped config can't bake in a Slack/Discord/email destination that works for everyone — but it means nothing pages anyone until you edit alertmanager/alertmanager.yml yourself. Treat this as a required step, not an optional one, for any deployment you expect to run unattended.
The fastest verified path: create a Discord channel webhook (channel settings → Integrations → Webhooks → New Webhook), then uncomment the discord receiver block in alertmanager/alertmanager.yml and point the root route at it. Slack, email, and a generic webhook receiver (for PagerDuty or a custom handler) are also ready to uncomment in the same file.
Until you do, alerts are still visible without any extra setup: open Grafana and check the Alerts row on the main dashboard, which lists every currently-firing alert directly from Prometheus, independent of Alertmanager routing. Use this as your fallback check if you haven't wired up push notifications yet — it's exactly what the Dead jobs stay at zero routine check below is watching for.
Dead-lettered jobs also get one automatic revival attempt every 30 minutes (QUEUE_DEAD_LETTER_REVIVE_INTERVAL_MS), as long as the job hasn't already been revived more than a small, bounded number of extra times (QUEUE_DEAD_LETTER_AUTO_RETRY_MAX_EXTRA_ATTEMPTS, default 3) — so a job that died from a bug that's since been fixed and redeployed recovers on its own within the next cycle, without needing direct database access. A job that keeps failing the same way eventually exhausts this budget and stays dead, which is exactly what the alert above is watching for.
Two different Discord/Slack integrations
Don't confuse these — they're unrelated features that happen to share the same two chat platforms:
- Alertmanager → Discord/Slack (infra alerts)
- Covered above. System/stack health: dead jobs, queue backlog, Postgres pressure, and similar operational alerts, routed by alertmanager/alertmanager.yml.
- DISCORD_WEBHOOK_URL / SLACK_WEBHOOK_URL (per-PR outcomes)
- A .env-configured webhook the review engine itself posts to whenever it publishes a review outcome (merged, closed, manual hold) on any repo you review — a product notification, not an infra alert.
DISCORD_WEBHOOK_URL is a global fallback Discord channel for any repo without its own webhook. DISCORD_REPO_WEBHOOKS is a per-repo override — a JSON map of owner/repo to a webhook URL — for routing different repos' notifications to different channels. Both are unset (no Discord notifications) by default.
DISCORD_WEBHOOK_URL=https://discord.com/api/webhooks/...
DISCORD_REPO_WEBHOOKS={"owner/repoA":"https://discord.com/api/webhooks/...","owner/repoB":"https://..."}SLACK_WEBHOOK_URL posts the same per-action events (merged/closed/manual) as a Block Kit section to one Slack channel. Unlike Discord there is no per-repo map today — every repo shares this one webhook. Unset means no Slack notifications.
Resource profiles
Measured rows below come from a real production instance running the full profile set (qdrant + redis + observability + backup + postgres + ollama) at steady state —docker stats and docker system df snapshots, not a lab benchmark. Estimated rows are reasoned from that same baseline plus each service's declared deploy.resources.limits and image size in docker-compose.yml — they have not been measured directly and could be off, especially for CPU under real load. Treat estimates as a starting point for capacity planning, not a guarantee.
| Profile | CPU (steady state) | Memory (steady state) | Basis |
|---|---|---|---|
Minimal — app + redis only (no profile flags) | ~3% of one core | ~400–600MiB | Estimated: app + redis measured in isolation from the full-profile snapshot (app 2.6% CPU / 365MiB; redis is idle-light and its 512MiB limit is never approached in the full-profile run either). |
+ --profile postgres | +14% of one core (highest single-service CPU consumer) | +~200MiB | Measured: 14.24% CPU / 196MiB of its 2GiB limit — comfortable headroom on memory, but the largest CPU line item in the whole stack. |
+ --profile qdrant | Low single-digit % | Well under its 2GiB limit | Measured (part of the full-profile snapshot's "everything else" low-CPU, under-limit group). Grows with RAG corpus size — expect this to climb on installs with many indexed repos. |
+ --profile observability | Low single-digit % per service, except Grafana/Tempo below | Grafana ~305MiB (60% of 512MiB); Tempo ~209MiB (20% of 1GiB); Prometheus/Loki/ Alertmanager/Promtail/otel-collector each well under their limits | Measured. Grafana is the closest any service comes to its ceiling in production — worth watching if you add many custom dashboards or panels, but not currently a problem (40% headroom remains). |
+ --profile ollama | Near-zero idle; spikes hard during inference | Model-dependent, up to its 8GiB limit | Estimated. Not part of the live production profile mix (that instance uses AI_PROVIDER=codex, not Ollama) — the 8GiB default limit is sized for a single loaded 7–8B quantized model per the compose comment, not measured against a running model. Idle Ollama with no model pulled is cheap; a loaded model can legitimately approach the limit, which is why it has the largest default ceiling in the file. |
+ --profile backup | Near-zero except during runs | Low, bursts during dump/restore | Measured as part of the full-profile snapshot (no dedicated resource limit is set for backup/backup-exporter — both are short-lived or idle-polling processes, not sustained consumers). |
+ --profile runners | Unbounded by default — can starve the app under CI load | Unbounded by default | Estimated, and explicitly a known risk, not a guess about typical usage: the runner service ships with no CPU/memory limit at all. Production experience already documented in docker-compose.override.yml.example found 3 uncapped runner containers starving the app for CPU on an 8-vCPU box under real CI load — see that file for the cpu_shares/cpus mitigation before co-locating runners with the review stack. |
Full profile set (qdrant + redis + observability + backup + postgres + ollama, no active inference, no runners) | Postgres (~14%) dominates; everything else low single-digit % | No service near its limit except Grafana (~60%) | Measured, in full, on a real production instance. |
Disk
Measured on the same production instance: 48GB of 151GB used on the host root volume (32%) at steady state. docker system df breakdown:
- Images
- 22.59GB total, 19.24GB (85%) reclaimable via prune.
- Volumes
- 20.57GB total, 5.4GB (26%) reclaimable — this is real application state (databases, vector index, backups), so most of it is never pruned.
- Build cache
- 6.39GB total, 3.55GB (56%) reclaimable.
The reclaimable image and build-cache space here is expected steady state, not a leak — this instance runs scripts/deploy-selfhost-prebuilt.sh, which rebuilds the image from the current git checkout on every deploy and intentionally keeps prior layers around in the build cache for faster rebuilds. The gittensory-docker-safe-prune systemd timer (below) already runs daily against this exact instance and reclaims it on a schedule, so this is not a number to chase down manually.
When a compose default might need to change
Every deploy.resources.limits.memory in docker-compose.yml is operator-overridable via .env (see the *_MEM_LIMIT variables in .env.example). Against the measured full-profile data above, none of the current defaults look miscalibrated enough to change: nothing sits consistently near its limit in a way that risks an OOM kill under normal load (Grafana's ~60% is the closest and still has real headroom), and nothing is so oversized relative to plausible usage that it should be lowered — including Ollama's comparatively large 8GiB ceiling, which is sized for holding one quantized model in memory, not idle overhead. The one real gap is --profile runners, which ships with no limit at all; that is a known, documented tradeoff (see the table above and docker-compose.override.yml.example) rather than an oversight, since the right ceiling depends entirely on the host's core count and how many runner replicas you run.
Capacity planning: how much disk for N repos at M PRs/month
The 151GB host above is one measured point, not a formula. It says nothing about how disk use grows as you register more repos or review more pull requests — for that you have to reason about which tables and volumes actually grow with activity, versus which are fixed overhead. Treat every number below as an order-of-magnitude estimate to plan around, not a guarantee.
- review_audit (fixed overhead per PR, unbounded)
- Roughly 2 rows per PR — one finalized gate decision plus one realized merge/close outcome — each a few small text columns (well under 1KB/row). It has no retention policy in src/db/retention.ts, so it grows forever. Don't trust a blanket MB-per-thousand-PRs estimate here; measure your own instance's actual growth with pg_total_relation_size('review_audit') (or the equivalent SQLite page count) after a known number of PRs, then extrapolate from that.
- webhook_events (fixed overhead per delivery, unbounded)
- One row per inbound GitHub webhook delivery — every push, comment, check-run update, and review event, not just PR opens — so it accrues considerably faster than review_audit for the same PR volume (commonly 5-15x, depending on how chatty a repo's CI and review activity are). Also absent from RETENTION_POLICY, so it also grows without bound. Still small per row; the growth to watch is row count over months, not any single row's size.
- audit_events (bounded — 90-day retention)
- One row per privileged/security-relevant action (recordAuditEvent in src/db/repositories.ts), pruned automatically: RETENTION_POLICY in src/db/retention.ts keeps 90 days and the prune-retention job runs daily at 03:00 UTC (src/index.ts), so this table's steady-state size is capped regardless of how long the instance has been running — it will not be a long-term capacity driver the way the two tables above are.
- Postgres/SQLite backup dumps (scales with live DB size x retained copies)
- scripts/backup.sh keeps the newest BACKUP_RETAIN copies per target (default 7 — see the backup and scaling doc's retention section), so total backup-volume usage is roughly (live database size) x (retained count), independent of repo count except through the database-size term. A growing, unpruned review_audit/webhook_events pair feeds directly into this multiplier: whatever they add to the live database, the backup volume carries N times over.
Putting it together: for a small install (a handful of repos, tens of PRs/month), all of this is noise against the ~20GB of fixed Docker/image/volume overhead measured above — you will not notice review_audit or webhook_events growth for a long time. The estimate gets real at higher volume: an install running hundreds of PRs/month across dozens of repos, left unattended for a year or more, is a plausible case where the unbounded tables above (and the backups that multiply them) become the dominant long-term disk driver rather than Docker images and build cache. There is no first-party tool yet to prune review_audit or webhook_events — if you operate at that scale, monitor their row counts directly (SELECT count(*) FROM review_audit, SELECT count(*) FROM webhook_events) rather than assuming steady state.
Docker resource hygiene
Every service in docker-compose.yml caps its own container logs (10MB × 3 rotated files) out of the box, so log growth alone won't fill your disk. Unused Docker images and build cache are a separate, larger disk-growth vector on a host that rebuilds or pulls images repeatedly over months — Docker does not reclaim either automatically.
Install the provided host-level timer to reclaim both on a schedule (anything unused for less than 7 days is left alone, so a recent deploy is never at risk):
sudo cp systemd/gittensory-docker-prune.service.example /etc/systemd/system/gittensory-docker-prune.service
sudo cp systemd/gittensory-docker-prune.timer.example /etc/systemd/system/gittensory-docker-prune.timer
sudo $EDITOR /etc/systemd/system/gittensory-docker-prune.service # set WorkingDirectory / ExecStart to your path
sudo systemctl daemon-reload
sudo systemctl enable --now gittensory-docker-prune.timerbashRun it manually at any time with docker system df before and after to see what it reclaimed: sh scripts/selfhost-docker-prune.sh.
This should always prune containers, images, and build cache — never volumes. Pruning a volume deletes real application state (the database, backups, vector index, or a runner's registration and job data), not disposable build output, so it is never part of routine cleanup unless you intentionally want to delete that state.
Self-hosted runner temp storage
If you run --profile runners, keep every runner job's scratch/temp writes on the mounted runner-work volume, never the container's plain /tmp. A container's own /tmp lives in Docker's overlay/containerd snapshot storage — a CI job that writes high-volume temp data there (language toolchain caches, build artifacts, ad hoc mktemp calls) grows the host's Docker root storage directly, not the volume, so it is invisible to volume-scoped cleanup and can fill the disk out from under the whole stack. The shipped runner service points TMPDIR, TMP, and TEMP at /tmp/runner/tmp (a subdirectory of the mounted runner-work volume) and keeps RUNNER_WORKDIR at /tmp/runner on the same volume. A one-shot runner-tmp-init service creates that directory on the volume (and makes it world-writable, matching real /tmp permissions) before the runner container starts, so this works out of the box on a fresh volume with no manual steps.
Adding a second or third runner service in docker-compose.override.yml for higher CI throughput? Each one needs its own runner-work-style volume, its own init step, and the same temp env — YAML anchors don't cross separate compose files, so repeat the extension block in your override file:
x-runner-tmp-env: &runner-tmp-env
TMPDIR: /tmp/runner/tmp
TMP: /tmp/runner/tmp
TEMP: /tmp/runner/tmp
services:
runner-2-tmp-init:
image: alpine:3.20
profiles: ["runners"]
volumes:
- runner-work-2:/tmp/runner
command: ["sh", "-c", "mkdir -p /tmp/runner/tmp && chmod 1777 /tmp/runner/tmp"]
runner-2:
image: myoung34/github-runner:ubuntu-jammy
profiles: ["runners"]
depends_on:
runner-2-tmp-init:
condition: service_completed_successfully
environment:
<<: *runner-tmp-env
RUNNER_NAME: gittensory-runner-2
RUNNER_SCOPE: ${RUNNER_SCOPE:-repo}
REPO_URL: ${RUNNER_REPO_URL:-}
RUNNER_TOKEN: ${RUNNER_TOKEN:-}
RUNNER_WORKDIR: /tmp/runner
volumes:
- runner-work-2:/tmp/runner
volumes:
runner-work-2:yamlSentry server name
SENTRY_SERVER_NAME sets a clean, human name for this instance in Sentry (for example gittensory-us-east). Unset defaults to the OS hostname — never the public-origin URL. Set this explicitly if you run more than one instance and want to tell their Sentry events apart at a glance instead of matching container hostnames.
Sentry tracing
Leave SENTRY_TRACES_SAMPLE_RATE unset or blank to disable trace export, or set a positive sample rate such as 0.05 to send sampled review spans to Sentry. The custom OpenTelemetry provider installs Sentry hooks for review-stage spans carrying repo, PR, operation, outcome, and hashed installation tags.
Sentry cron monitors
When SENTRY_DSN is set, the self-host runtime emits Sentry monitor check-ins for the recurring loops where silent stoppage matters most. Leaving SENTRY_DSN unset keeps monitor reporting off.
- scheduled loop
- The two-minute maintenance tick that fans out sweeps, backfills, and refresh jobs.
- Orb export
- The hourly outcome export loop used by brokered self-host deployments.
- Orb relay drain
- The pull-mode relay loop for installations that receive events outbound from Orb.
- Orb relay register
- The recurring retry loop that (re-)registers this instance with the relay broker.
- Queue dead-letter revive
- The 30-minute (by default) sweep that retries dead-lettered jobs still under the auto-retry ceiling.
A missed monitor means the process may still be alive but the recurring work is not checking in on schedule. Pair the monitor with queue depth, dead-job counts, and the structured error log for the same subsystem.
Routine checks
- Queue pending count is not growing without processing.
- Dead jobs stay at zero or are investigated promptly.
- Webhook deliveries are recent and have 2xx responses, with no enqueue failures.
- AI usage matches expected review volume and model/effort choices.
- REES and RAG failures are visible and bounded.
- Postgres connections, lock waits, slow transactions, dead tuples, and table growth are stable.
- Backups are recent and restore-tested.
Updating and rolling back
Both update paths below only ever restart the gittensory app service (--no-deps) — they never touch other compose-profile services or their state (Postgres, Redis, Qdrant, and Grafana's own grafana-data volume), and they never touch .env keys other than the one they persist for next time. That means .env, the gittensory-config/ mount, every data volume — including the app's own /data volume where Codex/Claude Code auth material lives — and any docker-compose.override.yml are preserved automatically across an update. You don't need to back those up or re-supply them just to run either script, and you only need to recreate a profile service yourself if you're deliberately upgrading that service (its own image tag in docker-compose.yml, or a Postgres/Redis/Qdrant major-version bump) rather than the app.
Path 1: pull a published image
scripts/deploy-selfhost-image.sh pulls a tag or digest, restarts only the gittensory service, waits for it to report healthy via docker inspect's health status (configurable timeout, default 180s), and then persists the resolved image reference back to GITTENSORY_IMAGE in .env so the next plain invocation reuses it.
# Re-pull whatever GITTENSORY_IMAGE already resolves to (safe no-op restart if the tag is unchanged
# and nothing new was pushed under it)
./scripts/deploy-selfhost-image.sh
# Pin an exact release tag or content digest
./scripts/deploy-selfhost-image.sh ghcr.io/jsonbored/gittensory-selfhost:orb-v0.1.0
GITTENSORY_IMAGE=ghcr.io/jsonbored/gittensory-selfhost@sha256:... ./scripts/deploy-selfhost-image.shbashThe pull always runs with --policy always, so re-running the script against an unchanged tag is safe: if the registry has nothing new, it just restarts the same image and the health-check wait passes immediately.
Path 2: build from the current git checkout
scripts/deploy-selfhost-prebuilt.sh is for a source-based deploy (this is how GITTENSORY_VERSION ends up as a short git SHA instead of an image tag). It builds the bundle inside a Dockerized Node container — the host itself never needs Node or npm installed — then restarts only the gittensory service the same way as the image path.
git pull
./scripts/deploy-selfhost-prebuilt.shbashSENTRY_RELEASE defaults to gittensory-selfhost@<short git SHA of the current HEAD> unless you override it, so each deploy from a new commit gets a distinct release id automatically. When SENTRY_AUTH_TOKEN, SENTRY_ORG, and SENTRY_PROJECT are all configured, the script also injects and uploads Sentry source maps for that release before restarting the service (set SELFHOST_SKIP_SENTRY_UPLOAD=1 to skip this even when those three are present).
Rollback: no dedicated command today
There is no rollback script. Rolling back means re-running one of the two scripts above pointed at an older target:
- Image-based: re-run
deploy-selfhost-image.shwith the prior tag or digest (docker inspecton the running container, or your own deploy log, has the digest you were on before the update). - Source-based:
git checkoutthe prior commit, then re-rundeploy-selfhost-prebuilt.sh.
scripts/check-migrations.mjs only enforces a contiguous, non-colliding numbering, not a reverse path. If a migration has already run forward against the live database, rolling back the app code is not safe in general: older code can break against a newer schema (a dropped/renamed column, a NOT NULL column it never writes, a changed constraint), even though the migration itself succeeded. Before rolling back across a migration boundary, check whether everything the newer migration(s) did is purely additive (new nullable column, new table, new index) and, specifically, whether the code you're rolling back to actually still runs against that schema — additive is usually fine; anything the old code can't tolerate is not. Take a fresh backup first regardless — see Backup and scaling — and if in doubt, restore that backup to a scratch database and boot the older code against it before doing the same on the live instance.Before and after any update
Before updating:
- Source-based deploys:
git statusis clean (no uncommitted local changes the build would silently pick up or drop). - A current, verified backup exists if the update includes schema changes — see Backup and scaling.
After updating, work through the same checks as any other health pass — see Health endpoints and Useful commands above: confirm /ready returns 200, docker compose ps shows the service healthy, and tail recent logs for startup errors or an unexpected absence of selfhost_listening / selfhost_migrations_applied.
Neither /health nor /ready reports a version, so confirm the deployed release directly — GITTENSORY_IMAGE or SENTRY_RELEASE in .env records what the deploy script just resolved, and docker inspect confirms what the running container actually has:
grep -E '^(GITTENSORY_IMAGE|GITTENSORY_VERSION|SENTRY_RELEASE)=' .env
docker inspect --format '{{.Config.Image}}' "$(docker compose ps -q gittensory)"bashIf an operating check fails, go to Self-host troubleshooting.
Uninstalling and decommissioning
Tearing an instance down cleanly touches four independent things: the GitHub App installation, the data volumes, brokered-mode enrollment, and control-panel access. None of this is scripted today — do each step deliberately, in this order, and decide what to keep before you delete anything.
1. Revoke the GitHub App installation
Uninstalling stops GitHub from sending any further webhook events and immediately revokes the App's installation tokens — nothing on the self-host side needs to be told; there is no installation deleted webhook handler to run first. From the repo or org: Settings → Integrations → GitHub Apps → your App → Uninstall. Do this before stopping the container so you are not left with a dangling install pointed at a dead webhook URL.
If you only want to pause reviews without losing the App's configuration (permissions, webhook URL, private key), suspend the installation instead of uninstalling it — GitHub stops delivering events to a suspended install but keeps everything else intact for a later resume.
2. Decide what happens to the data volumes
Stopping the container does not delete anything — docker compose stop or docker compose down (without -v) leaves every named volume (gittensory-data, gittensory-pg, qdrant-data, gittensory-backups, grafana-data, and the rest declared in docker-compose.yml) on disk, along with the ./gittensory-config host directory (a bind mount, not a named volume, so it is never affected by -v either way). Pick one:
- Keep (pause, don't decommission)
- docker compose stop. Volumes and .env stay as-is; restarting later resumes with the same data. Use this if you might come back.
- Export, then delete
- Run the backup profile one last time (docker compose --profile backup up -d, then confirm with verify-backup.sh — see Backup and scaling) and copy the resulting archive off-host before removing anything.
- Delete everything
- docker compose down -v removes every named volume permanently — the review database, vector index, Grafana dashboards state, and any local backup archives in gittensory-backups go with it. This does not touch ./gittensory-config (delete that host directory yourself if it should go too).
docker compose down -v permanently destroys review history, settings, and the vector index with no recovery path — the volumes are the only copy. See Backup and scaling before running it on an instance you care about.3. Deregister from the Orb broker (brokered mode only)
If this instance runs in brokered mode (ORB_ENROLLMENT_SECRET is set — see GitHub App and Orb), be aware there is no self-service revocation endpoint today — the "Minimum broker safeguards" checklist on that page lists a revocation path as a prerequisite for a public brokered rollout that has not shipped yet. An enrollment record (orb_enrollments) lives in gittensory's own central database, not your container, and nothing in this codebase writes a revoked_at value to it outside of tests. Practical steps until that exists:
- Uninstalling the GitHub App (step 1) stops new webhook traffic and installation-token issuance from reaching your instance in practice, even though the enrollment row itself stays marked enrolled centrally.
- Stop the container and let
ORB_ENROLLMENT_SECRETgo with it — with nothing polling or listening, the secret is inert even if it still resolves to a valid enrollment. - If the secret may have leaked or you want it invalidated outright rather than just orphaned, treat this the same as any other suspected credential compromise: contact the Orb operator to have the enrollment revoked centrally, since there is no in-product way to do it yourself yet.
4. Remove ADMIN_GITHUB_LOGINS access
ADMIN_GITHUB_LOGINS is read fresh from the environment on every control-panel request (isAuthorizedGitHubSessionLogin in src/auth/security.ts) — it is never cached at startup or baked into an issued session. To remove someone's operator access, delete their login from the comma/whitespace-separated list in .env and restart the gittensory service so the process picks up the new value:
$EDITOR .env # remove the login from ADMIN_GITHUB_LOGINS
docker compose up -d --no-deps gittensorybashThis takes effect on their very next control-panel request after the restart — no signed-in session is grandfathered in, because authorization is re-checked against the current allowlist every time, not read from the session itself. If you are decommissioning the whole instance rather than removing one operator, this step is moot once the container is stopped.