Scrubbing Rules
How we scrub the public activity feed · Version 2.1 · Effective April 22, 2026
The 10,000-client forcing function
This policy is designed to be correct on the day WatchLocal crosses 10,000 paying clients. Not the day we cross 3. Not the day we cross 50. The day we cross 10,000.
That constraint shapes every decision below. An earlier draft of this policy hardcoded a 51-entry trade list, a 25-entry action list, and a below_threshold: true flag that was meant to flip off later. All three would have become maintenance treadmills — at 10,000 clients we'd be amending the trade list monthly, the action list weekly, and either shipping threshold-flag code into production or scrambling to disable it. We do the work once, at the scale we intend to operate at.
The current design has three properties that survive 10,000 clients:
- Closed axes are shaped by the work, not by the world. The 8 work-type categories don't grow — they describe what marketing work is.
Surface(platform) andtrade(client industry) are open strings because the world supplies those and we don't get to decide when a new platform launches a feature or when a new client signs up as "pool & spa service." - The client owns their own label. Trade names live in the client's own brand profile. When a new client joins as "pool & spa service," no edit to this policy is required. No schema change. The client owns it.
- K-anonymity is continuous, not thresholded. No row appears in the feed unless at least k distinct clients contribute to that
(category, surface, trade, city)bucket right now, in the live window. There is no flag to flip off. When the cohort grows, more rows appear. When it shrinks, rows disappear. The rule is the same at N=3 and at N=10,000.
If a future amendment adds any field that re-introduces a curated list, a threshold flag, or a client-specific carve-out, that amendment fails the discipline check and should be rejected.
1. Purpose
Define exactly which fields from raw agent logs may appear in the public agent-activity feed, which must be stripped, and which must be generalized. This is also the written policy WatchLocal points at when a client or prospect asks "how do you protect my business's data when you publish agent activity?"
Operating principle: the feed must prove the machine works without identifying any single client. If a reader can combine what we publish with any public data source to deanonymize a specific business, we have failed.
2. Scope
In scope: every record in the public JSON feed served from our infrastructure. The feed has two independent blocks: the per-client tasks array (governed by sections 3–10) and the platform by_category counts (governed by section 9).
Out of scope: internal agent logs, internal dashboards, client-facing monthly visibility reports. Those can contain client names and details — they are not public.
Never in feed: any record whose source is a prospecting action (outreach SMS or email to non-clients). The feed only surfaces actions taken on behalf of paying clients. Prospecting activity is a different category and belongs elsewhere, or nowhere public.
3. The five published fields — and nothing else
Every row in the public feed has exactly these five fields. The aggregator produces rows that look like this and only like this.
| Field | Type | Example |
|---|---|---|
category | closed enum (8 values — see section 4) | listing_optimization |
surface | free string, ≤ 80 chars | Google Business Profile |
trade | free string, ≤ 60 chars | HVAC contractor |
city_state | free string, ≤ 60 chars (format: City, ST) | Celina, TX |
ts | ISO-8601 UTC, hour-rounded | 2026-04-22T14:00:00Z |
Plus two top-level metadata fields in the feed envelope (not per-row): generated_at (hour-rounded) and total/by-category counts.
Anything not listed above is stripped. The aggregator does not pass through an agent field, a business field, a phone field, or any free-text body. If a field isn't in this table, it doesn't appear.
4. The 8 per-client categories (closed enum)
This is the one hardcoded axis in the per-client feed. It describes the kinds of work the agents do. It's closed because the shape of SEO and local-marketing work is closed — new platforms come and go, new client trades come and go, but the fundamental operations fall into these eight buckets.
If a raw action doesn't fit any of the 8 categories, the record is dropped and logged to the violations log (section 10). We do not invent a 9th category on the fly. Amendments to this list require written approval, a policy version bump, and a discipline check — and are reviewed skeptically, because the whole point of 8 is that the list doesn't grow.
5. surface and trade — free strings, client-owned
5a. surface
The platform or channel on which the work was performed. Examples the aggregator emits today: Google Business Profile, Yelp, Nextdoor, Facebook Business, Instagram, Bing Places, Apple Maps, BBB, Angi, HomeAdvisor, Thumbtack, client website, client sitemap, schema.org.
No enum. When a platform ships a new feature and we start using it, the string just appears in the feed. When a new supported platform goes live, no schema change is needed. The aggregator enforces a maximum length of 80 characters as a sanity cap — beyond that, it's almost certainly a copy-paste of a description and gets dropped to violations.
5b. trade
The client's industry label, as the client wants it displayed. The client owns this value. It lives in the client's brand profile:
When a client is onboarded, display_trade is set once. The aggregator reads it and emits it verbatim. If the client later says "show us as 'heating & cooling' instead," they edit their brand profile — no code change, no policy amendment, no coordination with shared infrastructure.
Sanity caps: maximum length 60 characters. If the string is empty or malformed, the record is dropped to violations.
5c. Why free strings work at 10,000 clients
A single closed enum for trades is a maintenance treadmill that breaks under load: every new client onboarded in a category we haven't seen before would require a policy edit, a schema edit, and a deploy. Free strings plus client-owned labels move the responsibility to where it belongs (the client) and remove the central bottleneck.
The risk of free strings is that a client could type something revealing ("Joe's One-Truck Plumbing" instead of "Plumber"). Three controls mitigate this:
display_tradeis labeled on the onboarding form: "Industry or trade only. Do NOT include your business name, your city, or your phone number."- The aggregator runs a regex check before emitting: reject any
tradevalue containing digits,@,http, or more than 3 consecutive capitalized words — all proxies for "this is a business name, not a trade." - The continuous k-anonymity rule (section 6) guarantees no row appears unless at least k distinct clients share the exact
(category, surface, trade, city)bucket. A one-off bad label can't leak through because it won't cluster.
6. Continuous k-anonymity
Rule: before emitting any row, the aggregator groups the last 30 days of scrubbed activity records by the tuple (category, surface, trade, city_state). Only buckets with at least k = 3 distinct client slugs contributing are emitted. The aggregator publishes the latest timestamped record from each qualifying bucket.
Generalization is not automatic. Earlier drafts quietly bumped Pediatric dentist to Dentist to force a match. The current design does not — if a bucket doesn't meet k, the row is simply not published. Generalization is the client's own choice via display_trade (a solo pediatric-dentist client can set display_trade to "Dentist" if they want broader coverage; that's their call, not the aggregator's).
The k parameter. k = 3 at launch. If an analysis at N=500 or N=5,000 clients shows k=3 is too loose (specific buckets deanonymize under re-identification attacks), we raise k globally. We do not add per-category carve-outs.
City widening is NOT performed by the aggregator. Earlier drafts had a fallback that dropped city and published state-only rows. This was removed. If (category, surface, trade, city_state) doesn't meet k, nothing is emitted for that bucket. State-level aggregation would create deanonymization risk when combined with external data (e.g., only Texas HVAC contractor doing schema markup = one client). Simpler and safer: don't publish unless the full tuple meets k.
7. Timestamp handling
- Round to the hour:
2026-04-22T14:23:07Zbecomes2026-04-22T14:00:00Z. Drop minutes and seconds before the record enters the feed. - Do not expose more than 30 rolling days of history in the public feed, even if older data exists internally.
generated_atat the feed envelope level is also hour-rounded.- Exact second and minute timestamps are a re-identification surface (cross-referenced with a client's own social-post timestamps); the hour-round breaks that.
8. Twelve concrete before/after examples
Raw = what lives in internal logs. Scrubbed = what the aggregator emits into the public JSON feed (assuming k-anonymity passes; the row is dropped entirely if it doesn't).
citation_building — NAP listing created for a plumberreview_management — review request SMS to an existing customer (not prospecting)listing_optimization — GBP service-area updatecontent_publishing — Nextdoor post with free-text body strippedschema_markup — LocalBusiness markup added to client sitetechnical_seo — sitemap refreshreputation_response — GBP review response (bodies stripped)seasonal_content — weather-driven HVAC postout_of_scope_prospectinglisting_optimization — Apple Maps update where display_trade is unusualdisplay_trade to "Landscape contractor" to widen their cohort.schema_markup — FAQ schemareview_management — review request email batch9. The platform activity block (counts only)
tasks array: individual marketing actions taken on behalf of paying clients, scrubbed to 5 fields, gated by continuous k-anonymity. That's correct at 10,000 clients and at 3 clients, but at 3 clients zero rows qualify — the per-client feed correctly shows empty. Meanwhile the system is still doing real work every day (audits, market research, technical scans, listing surveys) across the entire public web. Section 9 exists so that work can be published honestly, in a way that carries zero re-identification risk and therefore needs no anonymity math.
9a. Scope
In scope for the platform block: aggregate counts of work performed by our agents that is not tied to any specific paying client — public-web audits, market-ranking reports, directory surveys, technical-health scans, content scheduled for public distribution, platform-infrastructure and cost events.
Out of scope for the platform block (same as section 2):
- Any outreach event to a specific recipient (SMS or email to a prospect). The individual send is out-of-scope; the aggregate count of outreach sends is also out-of-scope because that number signals commercial intent and is competitively sensitive.
- Any per-business row, even without a name. If the shape of a row would allow cross-reference with public data to identify who was scraped or audited, it doesn't go in.
- Anything that would require a per-client
display_tradeorcity_statelookup to render — those belong in the per-clienttasksarray, not here.
9b. What gets published — counts only
The platform block in the feed is a flat structure:
No timestamps per action, no surfaces, no cities, no trades, no business identifiers, no row array. This is the sufficient-and-necessary condition for no re-identification risk. A bare integer carries no identification signal regardless of scale.
9c. The 7 platform categories (closed enum)
Parallel to the 8-value per-client enum but independent from it. Shaped by the work the fleet does at platform scale, not by trade or platform.
A new category requires written approval, a policy version bump, and a discipline check. Same discipline as section 4: the shape does not grow with the world.
9d. What sources feed these counts
The aggregator counts records produced by our agents over the last rolling 30 days, broken down by category:
| Category | Source (abstract) |
|---|---|
business_audits | Multi-dimension audit run outputs (one record per completed audit) |
market_intelligence | Market-gap and competitive-ranking analyses (one record per completed analysis) |
content_publishing | Scheduled and published post records (one record per post) |
technical_scans | Security-alert and technical-health scan outputs (one record per scan run) |
ranking_reports | Local-ranking report outputs (one record per report) |
listing_scrapes | Directory-survey run outputs — survey runs, NOT any individual outreach SMS or email that may follow, which remain out of scope per section 2 |
ops_events | Platform operations and cost event records |
If a source category produces zero records in the window, its count is 0 and the feed reflects that honestly. No source list, no client list, and no thresholds are hardcoded — as coverage expands, counts grow automatically.
9e. Discipline check
The platform block does not introduce a curated list, a threshold flag, or a client-specific carve-out:
- Closed enum of 7 platform categories: shaped by work types, same rationale as the 8-value per-client enum. Does not grow with scale.
- No threshold flag: there is no below-N fallback, no per-category toggle. If a source is empty, its count is 0 and the feed reflects that.
- No client carve-out: this section explicitly publishes nothing per-client. The only client-related block in the feed remains the
tasksarray, governed by sections 3–8 unchanged.
10. Violation handling
The aggregator must never silently pass through anything that fails a rule above. On failure:
- Drop the record from the feed.
- Write a line to an internal violations log containing the raw record plus the rule that was violated. Violation codes:
category_not_in_enum— raw action didn't map to any of the 8 categoriessurface_emptyorsurface_too_longtrade_empty,trade_too_long, ortrade_regex_fail(looks like a business name, not a trade)city_state_missingorcity_state_malformedts_unparseablek_anonymity_fail— bucket had fewer than k distinct clients (not an error, just a filter; logged at debug level, not violation level)out_of_scope_prospecting— record was a prospecting actionplatform_out_of_scope— source was not in the section-9d whitelistplatform_outreach_leak— source appeared to be an outreach send (recipient phone, email, or SMS body present)
- Never auto-expand the category enums — a new category requires a human-approved amendment to this document.
The violation log is reviewed at the end of each internal review cycle. If the non-debug violation rate exceeds 1% of records for two consecutive cycles, the aggregator auto-disables publishing and alerts the team.
11. Amendment process
Changes to this document require:
- Written founder approval (commit reviewed in a dedicated PR).
- A change-log entry explaining the why.
- A corresponding schema update if field names change.
- A discipline check: does the change introduce a curated list, a threshold flag, or a client-specific carve-out? If yes, the amendment is rejected by default. Strong cause required.
The aggregator reads the current version of this document's policy constants at runtime via the feed's JSON schema, which is the enforceable artifact, and logs which schema version it was built against.
12. What changed from the earlier draft
| Concern | Earlier draft | Current (v2.1) |
|---|---|---|
| Work-type axis | action — 25-entry closed enum | category — 8-entry closed enum |
| Platform axis | platform — implicit closed list | surface — free string, ≤ 80 chars |
| Industry axis | trade — 51-entry closed enum maintained centrally | trade — free string, ≤ 60 chars, owned by client |
| Small-cohort handling | Below-threshold flag, expected to flip later | Continuous k-anonymity — row simply doesn't appear unless k ≥ 3 |
| Generalization | Aggregator auto-bumped Pediatric dentist to Dentist | Client chooses their own display_trade |
| City widening | Dentist, TX fallback when (trade, city) cohort too small | Removed — state-only rows too re-identifiable |
| New client or platform | Policy amendment + schema amendment + deploy | No coordination — strings just appear |
| Agent identity in feed | Agent identifier published | Dropped — redundant with category, and reduces fingerprinting surface |
Contact
Questions about this policy, or want receipts for any specific claim? Email the founder at [email protected].