Development version. This documentation is for unreleased features on the main branch. View the latest stable documentation.

Metrics Cardinality Reference

Cardinality reference and metricRelabelings examples for the Coraza operator.

Overview

Budget target: < 5,000 operator-side series for a 50-engine cluster. Gauges with _info suffix follow the info-metric pattern (constant value=1, identity labels only); numeric observations are separate gauges. Never use _total suffix on gauges (reserved for counters by Prometheus convention).

Control-Plane Metrics (operator /metrics on :8443)

Cache Server (existing — `internal/rulesets/cache/metrics.go`)

Metric	Type	Labels	Worst-case series/cluster	Notes
`coraza_cache_server_requests_total`	counter	`handler`, `method`, `code`	~6	`handler` ∈ {`rules`, `latest`}
`coraza_cache_server_request_duration_seconds`	histogram	`handler`, `method`, `code`	~6 × 12 buckets (11 + +Inf)
`coraza_cache_server_in_flight_requests`	gauge	`handler`	~2
`coraza_cache_size_bytes`	gauge	(none)	1
`coraza_cache_instances`	gauge	(none)	1
`coraza_cache_total_entries`	gauge	(none)	1
`coraza_cache_config_max_size_bytes`	gauge	(none)	1
`coraza_cache_gc_pruned_entries_total`	counter	`reason`	2	`reason` ∈ {`age`, `size`}
`coraza_cache_gc_size_limit_exceeded_total`	counter	(none)	1
`coraza_cache_server_auth_failures_total`	counter	(none)	1	Authentication failures on cache server

Controller Observability (PR #397 — `internal/controller/metrics.go`)

Note: The metrics in this section are defined in internal/controller/metrics.go (merged via PR #397).

Metric	Type	Labels	Worst-case series/cluster	Notes
`coraza_engine_info`	gauge=1	`namespace`, `name`, `target_name`, `target_type`, `driver_type`, `ruleset_name`, `failure_policy`	1 per Engine	Info pattern
`coraza_engine_condition`	gauge	`namespace`, `name`, `condition`	4 per Engine (Ready/Progressing/Degraded/Accepted)
`coraza_engines`	gauge	`namespace`	1 per namespace
`coraza_ruleset_info`	gauge=1	`namespace`, `name`	1 per RuleSet	Info pattern
`coraza_ruleset_condition`	gauge	`namespace`, `name`, `condition`	3 per RuleSet (Ready/Progressing/Degraded)
`coraza_rulesets`	gauge	`namespace`	1 per namespace
`coraza_ruleset_sources`	gauge	`namespace`, `name`	1 per RuleSet
`coraza_ruleset_data_files`	gauge	`namespace`, `name`	1 per RuleSet
`coraza_rulesource_info`	gauge=1	`namespace`, `name`	1 per RuleSource	Info pattern
`coraza_rulesource_condition`	gauge	`namespace`, `name`, `condition`	2 per RuleSource (Ready/Degraded)
`coraza_rulesources`	gauge	`namespace`	1 per namespace
`coraza_ruledata_info`	gauge=1	`namespace`, `name`	1 per referenced RuleData	Only RuleDatas referenced by a RuleSet
`coraza_ruledata_condition`	gauge	`namespace`, `name`, `condition`	1 per referenced RuleData (Ready)
`coraza_ruledatas`	gauge	`namespace`	1 per namespace	Refreshed on RuleSet reconcile and RuleData watch events
`coraza_rulesource_validations_total`	counter	`namespace`, `outcome`	3 per namespace (valid/invalid/skipped)	Incremented once per status transition, not per informer resync
`coraza_ruleset_validations_total`	counter	`namespace`, `outcome`	2 per namespace (valid/invalid)	`valid` recorded after successful cache; `invalid` on degrade transition
`coraza_rulesource_validation_duration_seconds`	histogram	`namespace`, `outcome`	2 × 13 per namespace (valid/invalid)	`skipped` increments the counter only; 11 buckets + sum + count
`coraza_ruleset_validation_duration_seconds`	histogram	`namespace`, `outcome`	2 × 13 per namespace	PMFromFile + Coraza parse only; excludes status patches and cache
`coraza_cache_set_duration_seconds`	histogram	`namespace`	1 × 10 per namespace	Recorded once per cache transition; excludes resync and unchanged content

Controller-Runtime Built-ins (not `coraza_` prefixed)

These metrics are emitted automatically by controller-runtime and are not configurable by this chart. They appear in the same scrape job as the coraza_ metrics.

Metric	Notes
`controller_runtime_reconcile_total{controller, result}`	~4 series per controller
`controller_runtime_reconcile_errors_total{controller}`	~2 series
`workqueue_depth{name}`	~2 series

Many more standard operator observability metrics are emitted. Refer to the controller-runtime documentation for the full list.

Worked Example: 50-engine cluster

50 Engines × (1 info + 4 conditions)                              =  250 Engine series
50 Engines grouped in 5 namespaces: 5 namespace aggregates         =    5 series
50 RuleSets × (1 info + 3 conditions + 1 sources + 1 data_files)  =  300 RuleSet series
50 RuleSets in 5 namespaces: 5 namespace aggregates                =    5 series
200 RuleSources × (1 info + 2 conditions)                          =  600 RuleSource series
200 RuleSources in 5 namespaces: 5 namespace aggregates            =    5 series
50 referenced RuleData × (1 info + 1 condition)                    =  100 RuleData series
50 RuleData in 5 namespaces: 5 namespace aggregates                =    5 series
Validation counters: 5 ns × (3 rulesource + 2 ruleset outcomes)    =   25 series
Validation histograms: 5 ns × (2×13 rulesource + 2×13 ruleset + 10 cache) =  310 series
Cache server: ~99 series (incl. auth failures counter)
                                                                   ─────────────────
Total coraza_* operator series:                                   ~1704

Well within the 5,000 series budget.

What NOT to use as label values

Rule IDs on the operator side

The operator never sees per-rule decisions — those live in Envoy after the WAF driver processes traffic. Adding rule_id labels on the operator side would require data-plane scraping (see Data-Plane Metrics below). Rule-level cardinality explodes quickly: CRS alone contains thousands of rules.

IP addresses

IP addresses are unbounded cardinality and violate standard Prometheus cardinality policy. Never use client IPs, source IPs, or any IP-derived value as a label.

Numeric counts as label values

Use separate gauge metrics rather than encoding counts in label values. For example, coraza_ruleset_sources{namespace="...", name="..."} carrying the numeric count is correct. A label like source_count="5" on a parent metric is not — each distinct count creates a new time series and the label conveys no structural identity.

Raw error messages

Use error type or reason codes, never free-form error strings. Free-form strings have effectively unbounded cardinality (every unique message is a new series) and are difficult to query reliably.

Data-Plane Metrics (coraza_waf_* — emitted by WAF driver in Envoy)

Data-plane metrics are emitted directly from the Coraza WASM driver running inside Envoy sidecars. They are NOT scraped from the operator.

For the full specification of data-plane metric names, labels, and cardinality constraints (including the top-N rule limit that bounds per-rule series), see driver metrics contract.

Key differences from operator metrics:

Property	Operator metrics	Data-plane metrics
Source	operator `/metrics` on `:8443`	Envoy admin API or PodMonitor
Scrape target	ServiceMonitor on operator pod	PodMonitor on Gateway pods
Per-rule detail	No — operator never sees rule decisions	Yes — bounded by top-N limit
Worst-case (10-engine cluster)	~585 series	~7,000 series

ServiceMonitor label handling

The Helm chart enables honorLabels: true on the operator ServiceMonitor so the namespace label on coraza_* metrics reflects the CR namespace, not the operator pod namespace. Prometheus normally overwrites namespace with the scrape target namespace when honorLabels is false.

honorLabels also preserves any job or instance labels emitted by the operator. Those labels are reserved for scrape-target identity; if the operator ever emitted them, aggregation would silently break. The chart therefore applies a built-in metricRelabelings rule to drop job and instance before user-supplied metrics.serviceMonitor.metricRelabelings are applied.

Contract: operator metrics must not emit job or instance labels.

Reducing Cardinality with metricRelabelings

The Helm chart exposes metrics.serviceMonitor.metricRelabelings to drop or transform series before they are stored in your TSDB. The following examples can be pasted into values.yaml.

Example 1 — drop all info metrics in large multi-tenant clusters (keep only conditions + counts):

metrics:
  serviceMonitor:
    metricRelabelings:
      - sourceLabels: [__name__]
        regex: 'coraza_(engine|ruleset|rulesource|ruledata)_info'
        action: drop

This reduces series by 1 per Engine, 1 per RuleSet, 1 per RuleSource, and 1 per referenced RuleData. Useful when you have many short-lived resources and do not need the identity labels captured in info metrics.

Example 2 — keep only coraza_ prefixed metrics (drop controller-runtime built-ins from this scrape job):

metrics:
  serviceMonitor:
    metricRelabelings:
      - sourceLabels: [__name__]
        regex: 'coraza_.*'
        action: keep

This is useful when controller-runtime built-ins are already collected by a cluster-wide scrape job and you want to avoid duplication.

Example 3 — drop high-cardinality cache histogram (keep only counter + gauge):

metrics:
  serviceMonitor:
    metricRelabelings:
      - sourceLabels: [__name__]
        regex: 'coraza_cache_server_request_duration_seconds.*'
        action: drop

The histogram generates ~84 series (6 label combinations × 14 series each: n explicit buckets + 1 +Inf bucket + _count + _sum). Dropping it reduces operator-side series by ~84 while retaining the counter for request rate calculations.