Skip to main content
Convoy exports metrics about the state of events received and sent via Prometheus.

Enabling Metrics

Metrics are not enabled by default. To turn them on you need all of the following:
  • Enable the prometheus feature flag (CONVOY_ENABLE_FEATURE_FLAG=prometheus or --enable-feature-flag=prometheus).
  • Set the metrics backend to Prometheus: JSON metrics.metrics_backend: "prometheus", env CONVOY_METRICS_BACKEND=prometheus, or CLI --metrics-backend=prometheus.
  • Set metrics.enabled / CONVOY_METRICS_ENABLED=true together with metrics.metrics_backend: "prometheus" / CONVOY_METRICS_BACKEND=prometheus when using convoy.json or environment. Pure CLI use can instead pass --metrics-backend=prometheus, which (with the prometheus feature flag) enables the merged metrics configuration loaded for server and agent.
  • Ensure your license allows Prometheus export where the handler enforces entitlements.
Either one of the two code blocks below will work.
enabling convoy metrics using flags
convoy agent --metrics-backend=prometheus --enable-feature-flag=prometheus
enabling convoy metrics using env vars
export CONVOY_METRICS_ENABLED=true
export CONVOY_METRICS_BACKEND=prometheus
export CONVOY_ENABLE_FEATURE_FLAG=prometheus
convoy agent
Scrape GET /metrics on each process you run. Typical split deployments use convoy server (control plane API, default HTTP port 5005) and convoy agent (data plane: ingest, queue consumers, and data-plane HTTP including /metrics, default agent_port 5008). For example, docker-compose.dev.yml maps web5005 and agent5008. Both can register the shared Prometheus registry when Redis + Postgres are available. Export still requires the license to allow Prometheus metrics where the handler enforces it.

Example scrape configuration

Point Prometheus at each Convoy process you care about (replace host, port, and labels). Metrics path is always /metrics.
prometheus.yml fragment
scrape_configs:
  - job_name: convoy-server
    static_configs:
      - targets: ["convoy-server:5005"]
        labels:
          role: server
  - job_name: convoy-agent
    static_configs:
      - targets: ["convoy-agent:5008"]
        labels:
          role: agent
Use the HTTP ports your deployment actually binds: server.http.port for convoy server (often 5005) and server.http.agent_port / AGENT_PORT for convoy agent (often 5008, matching the dev compose layout).

Example PromQL queries

Illustrative only—adjust label selectors to match your deployment.
# Ingest rate (events/s) summed over all projects/sources
sum(rate(convoy_ingest_total[5m]))

# Ingest errors share of total (ratio)
sum(rate(convoy_ingest_error[5m])) / sum(rate(convoy_ingest_total[5m]))

# Approximate p95 end-to-end latency (seconds) — requires histogram buckets on your scrape
histogram_quantile(0.95, sum(rate(convoy_end_to_end_latency_bucket[5m])) by (le))

# Max backlog age (seconds) seen across series (Postgres-backed gauge)
max(convoy_event_queue_backlog_seconds)
Histogram series use Prometheus’ usual _bucket / _sum / _count suffixes for convoy_end_to_end_latency.

Grafana and dashboards

Use Grafana (or any Prometheus-backed UI) to track whether observability keeps pace with the product and stays trustworthy. Dashboard and query hygiene
  • Version and review: Treat dashboard JSON like code—store it in git when possible, review changes in PRs, and note which Convoy version each dashboard targets. When upgrading Convoy, re-check panels that use concrete metric or label names (see tables below and internal/pkg/metrics in the Convoy repo).
  • Metric contract: Prefer recording rules or documented PromQL snippets next to dashboards so renames or label changes surface during review, not only in production.
  • Environment parity: If staging and production differ (extra labels, fewer scrapes, or different scrape intervals), document that on the dashboard or in your internal runbook so comparisons stay honest.
Data quality checks
  • Scrape health: Alert on Prometheus up for each job that scrapes Convoy, on scrape_samples_scraped collapsing to zero for critical jobs, and on rule evaluation errors if you use recording rules.
  • Series sanity: Compare convoy_ingest_total rates to traffic you expect; sudden flatlines or orders-of-magnitude drift often indicate scrape, network, or process issues—not just low traffic.
  • Cardinality: High-cardinality labels (many unique endpoint, source, or project values) increase cost; use Grafana’s Explore or Prometheus TSDB status to spot exploding label sets after config or feature changes.
  • Postgres-backed gauges: Queue depth metrics refresh on metrics.prometheus_metrics.sample_time; stale values can reflect sampling interval, DB load, or query timeouts (CONVOY_METRICS_QUERY_TIMEOUT / query_timeout), not only empty queues.
Alerts and runbooks
  • Noise vs signal: Pair alerts on error rates or backlog with minimum traffic thresholds (for example, a for: duration, or a separate guard on ingest rate) to avoid flapping on idle environments.
  • Runbook links: Add a runbook URL or on-call note to each alert annotation (Grafana Alerting, Alertmanager annotations.runbook_url, etc.) describing first steps: check Convoy process health, Redis, Postgres, recent deploys, and the relevant section of this metrics doc.
  • License-gated metrics: If /metrics returns minimal or empty output after a license change, verify entitlements before chasing infrastructure issues.
For product usage analytics (Mixpanel), see Telemetry—that pipeline is separate from Prometheus.

Ingest counters and end-to-end latency

These are registered from internal/pkg/metrics/data_plane.go when Prometheus is enabled and the license allows export. Labels: project and source on ingest counters; project and endpoint on the histogram.
NameTypeDescription
convoy_ingest_totalCounterTotal number of events ingested
convoy_ingest_successCounterTotal number of events successfully ingested and consumed
convoy_ingest_errorCounterTotal number of errors during event ingestion
convoy_end_to_end_latencyHistogramTotal time (in seconds) an event spends in Convoy (recorded per delivery).
The code also defines a convoy_ingest_latency histogram (per project); your build may or may not register it on /metrics—confirm by scraping.

Queue depth and backlog (Redis and Postgres)

These come from custom collectors, not from data_plane.go. When metrics are enabled, RegisterQueueMetrics attaches the Redis queue and Postgres implementations to the same registry, so they appear alongside the series above on /metrics for that process. In server + agent deployments, queue and ingest series are normally observed on the agent scrape target (data plane); the control server exposes its own /metrics for whatever it registers. Postgres-backed values are refreshed on a sample interval (metrics.prometheus_metrics.sample_time). Depending on version and schema, queries may use materialized views or live SQL—see the server release notes if you upgrade.

Redis (Asynq) queues

NameTypeLabelsDescription
convoy_event_queue_scheduled_totalGaugestatusTasks waiting on the create-event queue (queue size minus completed/archived).
convoy_event_workflow_queue_match_subscriptions_totalGaugestatusTasks waiting on the workflow queue used when matching subscriptions.

Postgres (events and deliveries)

NameTypeLabelsDescription
convoy_event_queue_totalGaugeproject, source, statusCounts derived from events (or materialized views when present).
convoy_event_queue_backlog_secondsGaugeproject, sourceAge in seconds of the oldest pending work for that project/source.
convoy_event_delivery_queue_totalGaugeproject, project_name, endpoint, status, event_type, source, organisation_id, organisation_nameTasks in the delivery pipeline per endpoint and dimensions.
convoy_event_delivery_queue_backlog_secondsGaugeproject, endpoint, sourceOldest pending delivery backlog per endpoint (seconds).
convoy_event_delivery_attempts_totalGaugeproject, endpoint, status, http_status_codeDelivery attempts grouped by outcome and HTTP status.

Tracing

Convoy can emit application traces (separate from product telemetry in Mixpanel). Configure the tracer under tracer in convoy.json (see Configuration) or use the environment variables below—they map to TracerConfiguration in the Convoy server config.
  • Provider: CONVOY_TRACER_PROVIDER = otel | sentry | datadog (CLI: --tracer-type).
  • OpenTelemetry: CONVOY_OTEL_COLLECTOR_URL (collector gRPC URL), CONVOY_OTEL_SAMPLE_RATE, CONVOY_OTEL_INSECURE_SKIP_VERIFY, optional CONVOY_OTEL_AUTH_HEADER_NAME / CONVOY_OTEL_AUTH_HEADER_VALUE (same values as JSON tracer.otel.otel_auth.header_name / header_value).
  • Sentry: CONVOY_SENTRY_DSN, CONVOY_SENTRY_SAMPLE_RATE, CONVOY_SENTRY_ENVIRONMENT, CONVOY_SENTRY_DEBUG.
  • Datadog: CONVOY_DATADOG_AGENT_URL (requires Datadog tracing entitlement on the license).
OpenTelemetry via JSON (equivalent env vars above):
tracer otel fragment
{
  "tracer": {
    "type": "otel",
    "otel": {
      "collector_url": "otel-collector:4317",
      "sample_rate": 0.1,
      "insecure_skip_verify": false,
      "otel_auth": {
        "header_name": "",
        "header_value": ""
      }
    }
  }
}
Sampling is controlled by sample_rate / CONVOY_OTEL_SAMPLE_RATE; not every code path may emit spans at every request. Span names emitted from the data plane (agent) today include (non-exhaustive): event.creation.success, event.creation.error, dynamic.event.creation.success, dynamic.event.creation.error, dynamic.event.subscription.matching.error, and meta_event_delivery. New releases may add or rename spans—confirm in your trace backend.
[!WARNING] Feature flags in Convoy were reimplemented on a per-feature basis.
The following flags/configs are no longer valid:
  • --feature-flag=experimental
  • export CONVOY_FEATURE_FLAG=1