Phase 11 · v1.1 · Shipped 2026-05-26

Slack routing, deduplication, digest.

Replace the current single Slack block with a routed alert system: three templates, two channels, OpenSearch-backed dedup with atomic first-detection. Repeated incidents collapse into a daily 09:00 CET digest. Cap-exceedance becomes its own incident — no silent collapse.

Gathered2026-05-26 ModeOutput layer rewrite Depends onPhase 10
Phase boundary

What Slack consumers see and don't see.

Three message templates: new-functional, repeated-functional-digest, technical-failure. Two channels: #bi-alerts for things that wake engineers, #bi-triage for things that need classification first. Dedup state in OpenSearch (cncqa_bi_incidents-YYYY-MM).

Out of scope: Slack API / bot tokens (v1.2), auto-close (v1.2), PagerDuty (v1.2+), incident UI in Grafana (v1.2).

Decisions

Eight design questions — Q3 was the codex blocker.

Q1 · Routing matrix · revised

ui-mismatch-suppressed no longer silent; unclassified moved to 4h + digest.

VerdictChannelTemplateDedup
tracking-broken (new)#bi-alertsnew-functional24h
tracking-broken (repeat)#bi-alerts→ digest24h
param-mismatch#bi-alertsnew-functional24h
duplicate#bi-alertsnew-functional24h
under-counted#bi-alertsnew-functional24h
ui-mismatch-suppressed#bi-triagetechnical-failure4h
rule-not-applicable(no Slack)
robot-broken#bi-triagetechnical-failure4h
site-broken#bi-triagetechnical-failure4h
timeout#bi-triagetechnical-failure4h
unclassified#bi-triagetechnical-failure4h + digest

rule-not-applicable is the only silent verdict — by definition the rule says "I don't apply", so there's no event miss to report.

Q2 · Dedup key · revised

Conditional suffix per failure kind so distinct defects don't collapse.

baseKey  = `${propertyKey}|${pageType}|${logicalEvent}|${source}|${failureKind}|${viewport}`
specific = failureKind === "param-mismatch" ? `|param=${paramName}`
         : failureKind === "duplicate"      ? `|rule=${ruleId}`
         : ""
dedupKey = sha1(baseKey + specific)

URL intentionally excluded — same site-broken issue across URLs is one incident; sample URL recorded separately.

Q3 · OpenSearch dedup · BLOCKER FIXED

Atomic PUT /_create/<dedupKey> + 409-on-conflict.

Codex blocker: the original GET-then-upsert is racy under overlapping scheduled runs — two jobs both see "first detection" and both alert.

Resolution — exactly-once first-alert:

  1. _id = dedupKey (deterministic).
  2. First attempt: PUT /cncqa_bi_incidents-YYYY-MM/_create/<dedupKey> with new-incident document. Atomic create.
  3. On 409: treat as repeat (no alert), then POST /_update/<dedupKey> with a script that increments occurrenceCount, sets lastSeenAt, sets lastSampleUrl. Use retry_on_conflict=5.
  4. On 201 (create succeeded): emit Slack alert. Only the one run that won the create posts; concurrent runs see 409.
Q4 · Webhooks · env naming + retry/backoff

Existing env is SLACK_WH_URL, not SLACK_WEBHOOK_URL.

Codex caught the env name mismatch. New vars match existing convention: SLACK_WH_URL_BI_ALERTS, SLACK_WH_URL_BI_TRIAGE. Old SLACK_WH_URL is fallback.

New postToSlack helper with retry — 429 honors Retry-After; 5xx exponential backoff (max 2 retries); 4xx other than 429 logs and drops.

Q5 · Template content

new-functional matches §8 meeting-notes mock verbatim.

  • new-functional — site / pageType / device / URL / issueType / logicalEvent / sourceStatus (DL/GA4/Gem) / expected / actual / uiEvidence / likelyMeaning / classification chip.
  • technical-failure — site / device / URL / failureKind / errorSummary / uiEvidenceSummary / Grafana link.
  • repeated-functional-digest — table of top 20 by occurrence, with Grafana link.
Q6 · Digest cron · catch-up watermark

Watermark doc + GitLab retry + ≥36h stale warning.

GitLab schedules can miss runs with no built-in retry. New behavior:

  • Persist cncqa_bi_digest_state/_doc/last-successful-digest with runAt.
  • Window = [watermark, now], not [now-24h, now].
  • Watermark >36h old → log warning, still post (catches up).
  • Pipeline retry: 2 with 10-min delay.
Q7 · Legacy summary · totals-only when EV2 on

Avoid duplicate incident reporting in Slack.

Codex flag: keeping legacy sendSlackSummary with its anomalies/aborted block alongside Phase 11 routed alerts would double-report. sendSlackSummary now consults EVENT_RULES_V2:

  • flag off → existing behavior (totals + anomalies + aborted block).
  • flag on → totals-only (pass/fail counts + Grafana link).
Q8 · Throttling · cap-exceeded becomes its own incident

No silent collapse.

Per-channel rate cap: 20 messages per run per channel. If exceeded:

  • Remaining alerts collapse into "+N more" line.
  • AND a dedicated cap-exceeded incident is emitted in #bi-triage with channel / runId / droppedCount / top-3 failureKinds by count.
  • Cap-exceeded counts write a metric document to OpenSearch (cncqa_bi_cap_exceeded-YYYY-MM) for Grafana trending.
Success criteria

Eight, including snapshot test for the §8 meeting-notes mock.

  1. A tracking-broken verdict produces a message in #bi-alerts matching the §8 meeting-notes mock exactly (snapshot test).
  2. Same verdict, repeated within 24h with same dedupKey, produces zero new Slack messages but increments occurrenceCount via PUT /_create → 409 → scripted update.
  3. Daily 09:00 CET cron produces a digest summarizing all repeats from [watermark, now], or skips silently if zero repeats.
  4. robot-broken and unclassified route to #bi-triage, never #bi-alerts.
  5. unclassified dedup window is 4h and always included in digest.
  6. Webhook failure on #bi-alerts doesn't fail the run; logged at error level; retries on 429 / 5xx.
  7. Per-run per-channel rate cap of 20 enforced; cap-exceeded emits its own incident in #bi-triage + OpenSearch metric doc.
  8. data/slack-routing.json is canonical; editing it changes routing without a code change.
Files

Templates, router, store, digest, routing JSON.

newsrc/lib/slackTemplates.tsBlock Kit builders for 3 templates
newsrc/lib/incidentStore.ts_create + scripted update + retry
newsrc/lib/slackRouter.tsverdict → (channel, template)
modsrc/lib/slack.tsrouter call + retry helper + EV2-aware summary
newscripts/slack-digest.ts, incident-close.ts
newdata/slack-routing.jsonBI-editable routing matrix
modpackage.json3 scripts
newtests/slack-templates.test.js (snapshot), slack-router.test.js, incident-store.test.js
newpipelines/pipeline-digest-daily.yml
mod.env_example2 new webhook vars
newdocs/slack-routing.mdBI-facing routing reference
Review

One blocker + six concerns, all integrated.

!

BLOCKER · racy GET-then-upsert

CODEX · INTEGRATED

Two concurrent runs would both see "first detection" and both alert. Replaced with atomic PUT /_create/<dedupKey> + 409-conflict handling + scripted retry_on_conflict update.

ui-mismatch-suppressed wrongly silent

CODEX · INTEGRATED

Meeting notes treat UI/event mismatch as alert-worthy. Routed to #bi-triage so an analyst can decide whether the suppression heuristic was wrong.

unclassified at 1h was too loud

CODEX · INTEGRATED

During rule rollout an unclassified burst would page repeatedly. Moved to 4h + always in digest.

Dedup key missed distinct defects

CODEX · INTEGRATED

Two different param-mismatch defects on the same logical event would collapse. Added conditional paramName / ruleId suffix.

Env naming + retry

CODEX · INTEGRATED

Existing env is SLACK_WH_URL. New vars renamed. Added 429-Retry-After + 5xx-backoff helper.

Digest had no catch-up or retry

CODEX · INTEGRATED

Added watermark doc, GitLab retry: 2, ≥36h stale warning.

Legacy summary would double-report

CODEX · INTEGRATED

Switched legacy sendSlackSummary to totals-only when EVENT_RULES_V2=true.

Silent rate-cap collapse hid Phase 10 bursts

CODEX · INTEGRATED

Cap-exceeded now emits its own incident + OpenSearch metric doc for Grafana trending.

← Previous Phase 10 · UI evidence + triage Next → Phase 12 · Tech debt blocking methodology