Phase 08 · v1.1 · Shipped 2026-05-26

Site & page-type inventory.

Crawl-assisted catalog of every CNC property the BI tool should monitor — the 12 existing per-domain configs plus subdomains currently outside coverage (tv.blesk.cz, isport.blesk.cz, prozeny.blesk.cz, video properties). For each (property × pageType) cell: example URL, viewport coverage, and an advisory tracking-presence probe.

Gathered2026-05-26 ModeDiscovery + artifact Depends on— (first in v1.1) Parallel withPhase 12
Phase boundary

What ships, what doesn't.

Two artifacts: a machine-readable data/site-inventory.json and a BI-facing docs/inventory/2026-05-26.html. The crawler is read-only against production — no clicks, no consent, no login. Tracking-presence probes are advisory only; Phase 10's runtime evidence is the authoritative signal.

Out of scope: writing event rules (Phase 9), adding per-domain configs for newly-discovered subdomains (separate PRs after Phase 9), any login or premium probe (deferred to v1.2), any runtime change to the orchestrator or robots.

Decisions

Seven design questions, each with a recommended answer.

Q1 · Discovery method

Hybrid seed + anchor harvest + manual classification.

Seed list = existing 12 hosts from data/config-per-domain/*.json + a hand-curated subdomain seed (tv.blesk.cz, isport.blesk.cz, prozeny.blesk.cz, video.auto.cz, …).

Anchor harvest — Playwright opens each seed root, dumps <a[href]> hosts, filters to same-org TLDs, emits candidate subdomains.

Manual classification — the BI lead marks each candidate in-scope / out-of-scope / unsure before it lands in the inventory. Discovered subdomains never auto-add.

Q2 · Subdomain treatment

Separate sites with their own per-domain config.

tv.blesk.cz and blesk.cz share zero selectors and have different tracking implementations (TV likely has additional video lifecycle events). The variant-mode approach would create N×M conditionals inside every robot.

The cost is one extra config per subdomain — the same cost the existing 12 already pay.

Phase 8 produces inventory property keys only. Per-domain config PRs are separate work, post-Phase 9.

Q3 · Page-type taxonomy · revised

13-column taxonomy (was 11 — codex added two).

homepage · category · category_paginated · article_standard ·
article_multipart · article_premium_locked · article_premium_unlocked ·
article_with_gallery · article_with_vplayer · gallery_standalone ·
video_standalone · paywall · login

category_paginated and article_multipart are explicit because Phase 9 must derive rules for the existing page_next/page_prev and articlePart_* flows (see data/event-mapping.json:3–4,9–10 and per-domain nextpage/part URLs).

Each cell is one of: null (doesn't exist), {urls,viewports,probe} (populated), {status:"unknown",reason} (couldn't probe), {status:"deferred",reason} (intentionally not probed).

Q4 · Tracking-presence probe

Five-second networkidle + three boolean checks.

  • window.dataLayer exists and has ≥1 entry → hasDataLayer: true
  • network request to *google-analytics.com* with collect in path → hasGA4: true
  • network request to a Gemius beacon host → hasGemius: true

These values are advisory, not validation — Phase 10's UI evidence is authoritative at runtime.

Q5 · Output format

JSON in repo, HTML in docs/inventory/<date>.html.

{
  "generatedAt": "2026-05-26T12:00:00Z",
  "properties": [
    {
      "key": "blesk",
      "name": "Blesk",
      "host": "www.blesk.cz",
      "scope": "in-scope",
      "parentKey": null,
      "pageTypes": { /* … */ }
    }
  ]
}
Q6 · Premium URL coverage

Defer to v1.2 — unauthenticated probe only.

Premium-unlocked articles need a logged-in premium session to render. Mixing login into discovery couples two failure modes — chosen instead to mark article_premium_unlocked as {status:"unknown",reason:"requires-premium-session"} for v1.1.

Q7 · Output destination & deploy

JSON versioned, HTML deployed via staged Wrangler snippet.

JSON lands in repo. HTML at cnc-bi-events.pages.dev/inventory/<date>. README's staged-deploy snippet now creates the inventory/ subdirectory and copies the HTML in.

Success criteria

Evidence-based, not quota-based.

Codex flagged that the original "≥18 properties / ≥4 page types" criteria were quotas without grounding. Replaced with classification completeness.

  1. Every candidate the operator saw is classified. All 12 existing properties plus every discovered subdomain candidate either populated or explicitly marked out-of-scope/deferred with a reason. No silent drops.
  2. Narrow properties allowed. Video-only subdomains may populate <4 cells. Empty cells are null (doesn't exist) or {status:"deferred",reason} (not probed).
  3. Every URL returns HTTP 200 in the probe run. Non-200 URLs excluded with logged warning.
  4. docs/inventory/<date>.html deploys to cnc-bi-events.pages.dev/inventory/<date> and matches the existing visual system.
  5. README staged-deploy snippet creates inventory/ subdir before copying (mkdir -p "$STAGE/inventory").
  6. tsconfig.json includes scripts/ so npm run check typechecks the new crawler. Today tsconfig.json:14 excludes it.
  7. npm run check + npm run test:logic pass with no regression.
  8. Probe semantics are advisory only. A hasGA4:false from the probe is documented as "unauthenticated, no consent" — Phase 9/10 rules don't trust probe values.
Files

What lands in the repo.

newscripts/crawl-inventory.ts
newdata/site-inventory.jsongenerated; reviewable in PR
newdocs/inventory/2026-05-26.html
newtests/site-inventory.test.jsschema validation only
modREADME.mdstaged-deploy snippet adds inventory/ subdir
modtsconfig.jsoninclude scripts/
mod.planning/ROADMAP.md, STATE.md
Open questions

Resolve before execution.

  • Q6 — confirm unauthenticated only for v1.1.
  • Subdomain seed source must be a required input, not optional (codex). DNS export, sitemap, or hand-curated list — pick one before crawl. Anchor harvest alone will miss unlinked subdomains.
  • Should the probe respect robots.txt? (Recommend yes; override flag for exceptions.)
  • Crawl politeness: rate limit between probes? (Recommend 1 req/sec/host.)
  • Sketch per-property selector packs for Phase 10's UI evidence as a crawl side-output, or defer entirely?
Review

Codex critique & resolution.

Quotas → evidence-based criteria

CODEX · INTEGRATED

Original "≥18 properties" and "≥4 page types per property" were quotas without grounding. Replaced with classification completeness; narrow properties allowed; cell shape extended with unknown/deferred to distinguish "doesn't exist" from "not probed".

Missing existing page types

CODEX · INTEGRATED

The 11-column taxonomy missed category_paginated and article_multipart — both are current first-class flows. Without them Phase 9 can't derive rules for page_next/page_prev/articlePart_*. Taxonomy widened to 13 columns.

Verification & deploy mechanics

CODEX · INTEGRATED

tsconfig.json:14 excludes scripts/npm run check would silently skip the new crawler. README staged-deploy is flat-file only — inventory/<date>.html needs mkdir -p first. Both surfaced as explicit success criteria.

← Previous Phase 7 · Failure taxonomy Next → Phase 9 · Event rules catalog