Metrics Cardinality Reference

Cardinality reference and metricRelabelings examples for the Coraza operator.

Overview

Budget target: < 5,000 operator-side series for a 50-engine cluster. Gauges with _info suffix follow the info-metric pattern (constant value=1, identity labels only); numeric observations are separate gauges. Never use _total suffix on gauges (reserved for counters by Prometheus convention).

Control-Plane Metrics (operator /metrics on :8443)

Cache Server (existing — internal/rulesets/cache/metrics.go)

MetricTypeLabelsWorst-case series/clusterNotes
coraza_cache_server_requests_totalcounterhandler, method, code~6handler ∈ {rules, latest}
coraza_cache_server_request_duration_secondshistogramhandler, method, code~6 × 12 buckets (11 + +Inf)
coraza_cache_server_in_flight_requestsgaugehandler~2
coraza_cache_size_bytesgauge(none)1
coraza_cache_instancesgauge(none)1
coraza_cache_total_entriesgauge(none)1
coraza_cache_config_max_size_bytesgauge(none)1
coraza_cache_gc_pruned_entries_totalcounterreason2reason ∈ {age, size}
coraza_cache_gc_size_limit_exceeded_totalcounter(none)1
coraza_cache_server_auth_failures_totalcounter(none)1Authentication failures on cache server

Controller Observability (PR #397 — internal/controller/metrics.go)

Note: The metrics in this section are defined in internal/controller/metrics.go (merged via PR #397).

MetricTypeLabelsWorst-case series/clusterNotes
coraza_engine_infogauge=1namespace, name, target_name, target_type, driver_type, ruleset_name, failure_policy1 per EngineInfo pattern
coraza_engine_conditiongaugenamespace, name, condition4 per Engine (Ready/Progressing/Degraded/Accepted)
coraza_enginesgaugenamespace1 per namespace
coraza_ruleset_infogauge=1namespace, name1 per RuleSetInfo pattern
coraza_ruleset_conditiongaugenamespace, name, condition3 per RuleSet (Ready/Progressing/Degraded)
coraza_rulesetsgaugenamespace1 per namespace
coraza_ruleset_sourcesgaugenamespace, name1 per RuleSet
coraza_ruleset_data_filesgaugenamespace, name1 per RuleSet
coraza_rulesource_infogauge=1namespace, name1 per RuleSourceInfo pattern
coraza_rulesource_conditiongaugenamespace, name, condition2 per RuleSource (Ready/Degraded)
coraza_rulesourcesgaugenamespace1 per namespace
coraza_ruledata_infogauge=1namespace, name1 per referenced RuleDataOnly RuleDatas referenced by a RuleSet
coraza_ruledata_conditiongaugenamespace, name, condition1 per referenced RuleData (Ready)
coraza_ruledatasgaugenamespace1 per namespaceRefreshed on RuleSet reconcile and RuleData watch events
coraza_rulesource_validations_totalcounternamespace, outcome3 per namespace (valid/invalid/skipped)Incremented once per status transition, not per informer resync
coraza_ruleset_validations_totalcounternamespace, outcome2 per namespace (valid/invalid)valid recorded after successful cache; invalid on degrade transition
coraza_rulesource_validation_duration_secondshistogramnamespace, outcome2 × 13 per namespace (valid/invalid)skipped increments the counter only; 11 buckets + sum + count
coraza_ruleset_validation_duration_secondshistogramnamespace, outcome2 × 13 per namespacePMFromFile + Coraza parse only; excludes status patches and cache
coraza_cache_set_duration_secondshistogramnamespace1 × 10 per namespaceRecorded once per cache transition; excludes resync and unchanged content

Controller-Runtime Built-ins (not coraza_ prefixed)

These metrics are emitted automatically by controller-runtime and are not configurable by this chart. They appear in the same scrape job as the coraza_ metrics.

MetricNotes
controller_runtime_reconcile_total{controller, result}~4 series per controller
controller_runtime_reconcile_errors_total{controller}~2 series
workqueue_depth{name}~2 series

Many more standard operator observability metrics are emitted. Refer to the controller-runtime documentation for the full list.

Worked Example: 50-engine cluster

50 Engines × (1 info + 4 conditions)                              =  250 Engine series
50 Engines grouped in 5 namespaces: 5 namespace aggregates         =    5 series
50 RuleSets × (1 info + 3 conditions + 1 sources + 1 data_files)  =  300 RuleSet series
50 RuleSets in 5 namespaces: 5 namespace aggregates                =    5 series
200 RuleSources × (1 info + 2 conditions)                          =  600 RuleSource series
200 RuleSources in 5 namespaces: 5 namespace aggregates            =    5 series
50 referenced RuleData × (1 info + 1 condition)                    =  100 RuleData series
50 RuleData in 5 namespaces: 5 namespace aggregates                =    5 series
Validation counters: 5 ns × (3 rulesource + 2 ruleset outcomes)    =   25 series
Validation histograms: 5 ns × (2×13 rulesource + 2×13 ruleset + 10 cache) =  310 series
Cache server: ~99 series (incl. auth failures counter)
                                                                   ─────────────────
Total coraza_* operator series:                                   ~1704

Well within the 5,000 series budget.

What NOT to use as label values

Rule IDs on the operator side

The operator never sees per-rule decisions — those live in Envoy after the WAF driver processes traffic. Adding rule_id labels on the operator side would require data-plane scraping (see Data-Plane Metrics below). Rule-level cardinality explodes quickly: CRS alone contains thousands of rules.

IP addresses

IP addresses are unbounded cardinality and violate standard Prometheus cardinality policy. Never use client IPs, source IPs, or any IP-derived value as a label.

Numeric counts as label values

Use separate gauge metrics rather than encoding counts in label values. For example, coraza_ruleset_sources{namespace="...", name="..."} carrying the numeric count is correct. A label like source_count="5" on a parent metric is not — each distinct count creates a new time series and the label conveys no structural identity.

Raw error messages

Use error type or reason codes, never free-form error strings. Free-form strings have effectively unbounded cardinality (every unique message is a new series) and are difficult to query reliably.

Data-Plane Metrics (coraza_waf_* — emitted by WAF driver in Envoy)

Data-plane metrics are emitted directly from the Coraza WASM driver running inside Envoy sidecars. They are NOT scraped from the operator.

For the full specification of data-plane metric names, labels, and cardinality constraints (including the top-N rule limit that bounds per-rule series), see driver metrics contract.

Key differences from operator metrics:

PropertyOperator metricsData-plane metrics
Sourceoperator /metrics on :8443Envoy admin API or PodMonitor
Scrape targetServiceMonitor on operator podPodMonitor on Gateway pods
Per-rule detailNo — operator never sees rule decisionsYes — bounded by top-N limit
Worst-case (10-engine cluster)~585 series~7,000 series

ServiceMonitor label handling

The Helm chart enables honorLabels: true on the operator ServiceMonitor so the namespace label on coraza_* metrics reflects the CR namespace, not the operator pod namespace. Prometheus normally overwrites namespace with the scrape target namespace when honorLabels is false.

honorLabels also preserves any job or instance labels emitted by the operator. Those labels are reserved for scrape-target identity; if the operator ever emitted them, aggregation would silently break. The chart therefore applies a built-in metricRelabelings rule to drop job and instance before user-supplied metrics.serviceMonitor.metricRelabelings are applied.

Contract: operator metrics must not emit job or instance labels.

Reducing Cardinality with metricRelabelings

The Helm chart exposes metrics.serviceMonitor.metricRelabelings to drop or transform series before they are stored in your TSDB. The following examples can be pasted into values.yaml.

Example 1 — drop all info metrics in large multi-tenant clusters (keep only conditions + counts):

metrics:
  serviceMonitor:
    metricRelabelings:
      - sourceLabels: [__name__]
        regex: 'coraza_(engine|ruleset|rulesource|ruledata)_info'
        action: drop

This reduces series by 1 per Engine, 1 per RuleSet, 1 per RuleSource, and 1 per referenced RuleData. Useful when you have many short-lived resources and do not need the identity labels captured in info metrics.

Example 2 — keep only coraza_ prefixed metrics (drop controller-runtime built-ins from this scrape job):

metrics:
  serviceMonitor:
    metricRelabelings:
      - sourceLabels: [__name__]
        regex: 'coraza_.*'
        action: keep

This is useful when controller-runtime built-ins are already collected by a cluster-wide scrape job and you want to avoid duplication.

Example 3 — drop high-cardinality cache histogram (keep only counter + gauge):

metrics:
  serviceMonitor:
    metricRelabelings:
      - sourceLabels: [__name__]
        regex: 'coraza_cache_server_request_duration_seconds.*'
        action: drop

The histogram generates ~84 series (6 label combinations × 14 series each: n explicit buckets + 1 +Inf bucket + _count + _sum). Dropping it reduces operator-side series by ~84 while retaining the counter for request rate calculations.