Always-firing alert¶
Class: always-firing-alert
Severity (typical): Critical
Category: alert_hygiene
Finding ID prefix: alert_hygiene/always_firing
What it means¶
An always-firing alert is an alerting rule whose threshold is met across essentially the entire lookback window. The on-call gets a page, the page clears, the on-call gets a page, the page clears - on a loop, because the underlying signal never actually goes back below the line.
This is alert noise at its worst. Always-firing alerts train on-call humans to ignore a specific alertname; that habit then bleeds into ignoring nearby alerts that do mean something. They also corrupt incident timelines: a postmortem will see the alert firing continuously through the incident and struggle to identify when the real signal started.
The cause is almost always one of: a threshold that was set against a
load-test environment and never re-tuned for production; a metric whose units
or normalization changed since the rule was written; or a rule that was
flipped during refactoring (> should have become <).
How remetric detects it¶
The alert-hygiene analyzer (internal/analyzers/alerthygiene) runs the same
pipeline as the never-firing class:
- Fetch alerting rules from
/api/v1/rules(preferring--vmalertwhen set). - For each rule, query
ALERTS{alertname="<name>"}viaquery_rangeover the lookback (default 168h, step 1h). - Sum samples where
alertstate="firing".
The classifier compares the firing-sample sum to total steps in the window (ceil(lookback / step)):
| Tier | Trigger |
|---|---|
| Critical (always-firing) | firing ratio >= 0.95 |
| (none) | 0 < ratio < 0.95 |
| Medium (never-firing) | ratio == 0 |
The 0.95 threshold is the alwaysFiringThreshold constant in the analyzer.
The finding's title includes the exact ratio and step count
(Alert X firing 97.3% of window (163/168 steps)) so you can see how close
to permanent the rule is.
How to fix it¶
The analyzer emits a rule-change snippet that points at the offending rule:
# In <file>, group <group>: review threshold - alert is permanently firing
- alert: <name>
expr: <original expression>
Walk the expression in a graph view (/graph?g0.expr=<expr>&g0.range_input=1w)
to see what the input series actually looks like. The fix is usually one of:
- Re-tune the threshold. Compute current production p99 / p95 / p50 over a representative window and pick a threshold that fires on the tail you actually care about, not on baseline.
- Flip the comparison. If
up{job="x"} > 0is firing 100% of the time, you almost certainly meantup{job="x"} == 0. - Add a
for:clause. A rule with nofor:fires on every sample over the line. Addingfor: 5mkeeps brief spikes from generating a page (and surfaces real sustained breaches). - Delete the rule if the signal it was meant to catch no longer matters.
False positives¶
You may want to ignore this class when:
- An alert is intentionally always-firing (e.g. a heartbeat-style "the rule
engine is alive" alert that you route to silence). Move these to silences
rather than alert rules, but if you can't, ignore via
--ignore-alert. - A rule was just deployed during an active incident and the lookback hasn't
caught up. Re-run with a shorter
--lookbackto confirm.
Suppress via --ignore-alert <regex>. See the
README's "Ignoring findings" section.
Related¶
- Never-firing alert - the opposite failure mode: an alert loaded but never tripped.