Replace the current single Slack block with a routed alert system: three templates, two channels, OpenSearch-backed dedup with atomic first-detection. Repeated incidents collapse into a daily 09:00 CET digest. Cap-exceedance becomes its own incident — no silent collapse.
Three message templates: new-functional, repeated-functional-digest, technical-failure. Two channels: #bi-alerts for things that wake engineers, #bi-triage for things that need classification first. Dedup state in OpenSearch (cncqa_bi_incidents-YYYY-MM).
Out of scope: Slack API / bot tokens (v1.2), auto-close (v1.2), PagerDuty (v1.2+), incident UI in Grafana (v1.2).
ui-mismatch-suppressed no longer silent; unclassified moved to 4h + digest.| Verdict | Channel | Template | Dedup |
|---|---|---|---|
tracking-broken (new) | #bi-alerts | new-functional | 24h |
tracking-broken (repeat) | #bi-alerts | → digest | 24h |
param-mismatch | #bi-alerts | new-functional | 24h |
duplicate | #bi-alerts | new-functional | 24h |
under-counted | #bi-alerts | new-functional | 24h |
ui-mismatch-suppressed | #bi-triage | technical-failure | 4h |
rule-not-applicable | (no Slack) | — | — |
robot-broken | #bi-triage | technical-failure | 4h |
site-broken | #bi-triage | technical-failure | 4h |
timeout | #bi-triage | technical-failure | 4h |
unclassified | #bi-triage | technical-failure | 4h + digest |
rule-not-applicable is the only silent verdict — by definition the rule says "I don't apply", so there's no event miss to report.
baseKey = `${propertyKey}|${pageType}|${logicalEvent}|${source}|${failureKind}|${viewport}`
specific = failureKind === "param-mismatch" ? `|param=${paramName}`
: failureKind === "duplicate" ? `|rule=${ruleId}`
: ""
dedupKey = sha1(baseKey + specific)
URL intentionally excluded — same site-broken issue across URLs is one incident; sample URL recorded separately.
PUT /_create/<dedupKey> + 409-on-conflict.Codex blocker: the original GET-then-upsert is racy under overlapping scheduled runs — two jobs both see "first detection" and both alert.
Resolution — exactly-once first-alert:
_id = dedupKey (deterministic).PUT /cncqa_bi_incidents-YYYY-MM/_create/<dedupKey> with new-incident document. Atomic create.POST /_update/<dedupKey> with a script that increments occurrenceCount, sets lastSeenAt, sets lastSampleUrl. Use retry_on_conflict=5.SLACK_WH_URL, not SLACK_WEBHOOK_URL.Codex caught the env name mismatch. New vars match existing convention: SLACK_WH_URL_BI_ALERTS, SLACK_WH_URL_BI_TRIAGE. Old SLACK_WH_URL is fallback.
New postToSlack helper with retry — 429 honors Retry-After; 5xx exponential backoff (max 2 retries); 4xx other than 429 logs and drops.
new-functional matches §8 meeting-notes mock verbatim.new-functional — site / pageType / device / URL / issueType / logicalEvent / sourceStatus (DL/GA4/Gem) / expected / actual / uiEvidence / likelyMeaning / classification chip.technical-failure — site / device / URL / failureKind / errorSummary / uiEvidenceSummary / Grafana link.repeated-functional-digest — table of top 20 by occurrence, with Grafana link.GitLab schedules can miss runs with no built-in retry. New behavior:
cncqa_bi_digest_state/_doc/last-successful-digest with runAt.[watermark, now], not [now-24h, now].retry: 2 with 10-min delay.Codex flag: keeping legacy sendSlackSummary with its anomalies/aborted block alongside Phase 11 routed alerts would double-report. sendSlackSummary now consults EVENT_RULES_V2:
Per-channel rate cap: 20 messages per run per channel. If exceeded:
"+N more" line.cap-exceeded incident is emitted in #bi-triage with channel / runId / droppedCount / top-3 failureKinds by count.cncqa_bi_cap_exceeded-YYYY-MM) for Grafana trending.tracking-broken verdict produces a message in #bi-alerts matching the §8 meeting-notes mock exactly (snapshot test).dedupKey, produces zero new Slack messages but increments occurrenceCount via PUT /_create → 409 → scripted update.[watermark, now], or skips silently if zero repeats.robot-broken and unclassified route to #bi-triage, never #bi-alerts.unclassified dedup window is 4h and always included in digest.#bi-alerts doesn't fail the run; logged at error level; retries on 429 / 5xx.#bi-triage + OpenSearch metric doc.data/slack-routing.json is canonical; editing it changes routing without a code change.Two concurrent runs would both see "first detection" and both alert. Replaced with atomic PUT /_create/<dedupKey> + 409-conflict handling + scripted retry_on_conflict update.
ui-mismatch-suppressed wrongly silentMeeting notes treat UI/event mismatch as alert-worthy. Routed to #bi-triage so an analyst can decide whether the suppression heuristic was wrong.
unclassified at 1h was too loudDuring rule rollout an unclassified burst would page repeatedly. Moved to 4h + always in digest.
Two different param-mismatch defects on the same logical event would collapse. Added conditional paramName / ruleId suffix.
Existing env is SLACK_WH_URL. New vars renamed. Added 429-Retry-After + 5xx-backoff helper.
Added watermark doc, GitLab retry: 2, ≥36h stale warning.
Switched legacy sendSlackSummary to totals-only when EVENT_RULES_V2=true.
Cap-exceeded now emits its own incident + OpenSearch metric doc for Grafana trending.