Early accessSuperPost is in early access. Sign up to get early-bird pricing locked for life.Lock pricing →
superpost
Engineering6 min read

Brand voice cloning — the privacy story

How SuperPost learns the way you write without ever exposing your voice to another customer's account, another customer's model, or our shared training data.

By Vuk· Co-founder

The single most-asked question we get on customer calls, by a wide margin, is some variant of: "if I let you read my old posts to learn my voice, where does that voice live, and what stops you from using it on another account?"

The fear is reasonable. The state of the art for brand-voice cloning in most marketing tools is "send everything to GPT and hope." Hope that the model doesn't memorise your phrasing. Hope that the next account doesn't accidentally inherit your jokes. Hope that your competitor's content, three months from now, doesn't sound suspiciously like yours.

We're not going to ask you to hope. We engineered around the problem. This post is the architecture.

What "brand voice cloning" actually means

When a developer signs up for SuperPost, we offer to ingest a corpus of their existing content — past blog posts, past tweets, past LinkedIn posts, past changelog entries. We use that corpus to do two things:

  1. Build a retrieval index keyed to that workspace and only that workspace, that the planner uses to ground new content in the user's existing phrasing, vocabulary, and rhetorical moves.
  2. Tune a small adapter layer on top of the base model that biases generations toward the tonal patterns of that corpus — sentence length, hedge-vs-assert ratios, the use of em-dashes, whether you start sentences with "And."

We do not fine-tune the base model on your content. We do not contribute your content to a shared pool of training data. We do not, under any circumstance, allow another customer's model to retrieve from your index.

The four tenant-isolation rules

We maintain four hard rules in the brand-voice subsystem. Every line of code in this part of the product is built around enforcing them.

Rule 1: One tenant, one index, one namespace.

Brand voice retrieval lives in a per-workspace vector index in Postgres (we use pgvector, sharded by workspace ID). Every query is parameterised by workspace, and we have a request-level guard that fails closed if the workspace ID is missing or doesn't match the authenticated session. There is no cross-workspace retrieval API. There is no admin override. The fastest path from one customer's voice to another customer's content goes through a deliberate database export and re-import — i.e., a manual operator action, not an in-flight bug.

Rule 2: No fine-tuning on customer content. Ever.

We use the base model untouched. The "personalisation" you see is entirely retrieval-based and prompt-based. This means:

  • Your voice stays in your index, retrievable only when generating for your workspace.
  • We never ship a model update that has learned anything specific about your business.
  • If you delete your workspace, your voice is genuinely gone — no residual capacity in a fine-tuned model that we'd have to roll back to scrub.

We considered fine-tuning. It would, in narrow benchmarks, produce slightly better outputs. It would also create a permanent leak surface (the customer's voice bleeding into the foundation model's behaviour on unrelated calls), and it would make data deletion more like data laundering than data deletion. We decided the privacy posture mattered more than the few-percent quality gain. Three years in, we're confident that was correct.

Rule 3: Retrieval is logged, scoped, and replayable.

Every retrieval call writes a row: which workspace queried, which workspace's data was returned, the query embedding, the documents retrieved, and the timestamp. Those two workspace IDs must match — we have a runtime check, a database constraint, and a daily auditor job that scans for any mismatched row and pages oncall. As of this writing, the mismatch counter has read zero for 218 days running.

This is the kind of thing you mostly want as a forensic capability — a guarantee that if something ever leaked, you'd be able to prove the bounds of the leak. We hope to never use it for that. We use it weekly for debugging and capacity planning.

Rule 4: The model never sees content it shouldn't see in a single inference call.

When we prompt the model to generate, the only customer content in the prompt is the retrieval result for the current workspace. There's no shared system prompt that has examples from multiple customers. There's no "few-shot example pool" that mixes data across accounts. Every inference call is hermetic: one workspace's voice, one workspace's content, one rendered output.

The temptation, when building these systems, is to share a few high-quality examples across all generations because it improves quality. We tried it. The quality lift was small. The risk surface was unacceptable. We removed the shared examples and built per-workspace example libraries instead. Quality dropped 6%. We earned it back over the next three sprints with better retrieval.

What happens at deletion

When you delete your workspace from your settings, three things happen, in this order:

  1. The brand voice index for that workspace is dropped. Vector rows, document metadata, the embedding cache — all gone.
  2. The retrieval log is anonymised. We retain the fact that retrieval calls happened (for billing and audit), but the workspace ID is replaced with a one-way hash.
  3. The publishing artefact archive is purged after a 30-day grace period (in case you want to undelete; after 30 days it's irrecoverable).

We publish a deletion timestamp and audit hash that you can verify. We're working on getting that flow into the SOC 2 control set this quarter, with our compliance terminal.

What we don't do — and why we're saying so

We don't:

  • Sell your content to data brokers, AI training pools, or any third party. We don't even have a "data partnerships" team.
  • Use your content to improve the base model. (We're not the model vendor — Anthropic is — and we've never had a discussion with them about contributing customer data.)
  • Look at your content as a team unless you've explicitly opened a support ticket and asked us to. Even then, access is gated to a named individual and logged.
  • Embed analytics into the brand voice prompts. Some tools route brand-voice queries through a third-party prompt-management service. We don't, because that introduces a third party to the conversation that we can't audit.

The reason we're enumerating these is that "we don't" claims are only as valuable as their specificity. "We respect your privacy" is meaningless. "We don't sell your data, here's what selling would mean and why we don't do it" is something you can hold us to.

Why bother going this hard?

The honest answer is that the customers we want to serve — small developer teams shipping serious products — are exactly the customers who will read the privacy posture before they sign up. They've been burned by tools that quietly used their content to train shared models. They've watched competitors' phrasing show up in their feeds. They're skeptical, correctly. The only way to earn that audience is to engineer the boundaries before we ship, not after we get caught.

We also believe — and this is partly conviction, partly product taste — that brand voice is the most personal thing a founder gives us when they hand over their account. We're holding it carefully because if we don't, they shouldn't trust us with the rest.

See the controls

If this resonates and you want to dig in, the pricing page covers what's in each plan, and the legal section has the formal data-handling commitments. The fastest way to get a guided tour through the privacy controls in the product itself is a 15-minute demo. We'll log into a sandbox workspace and walk through the deletion flow live.

Newsletter

One short email a week.

New posts, what we're shipping, and the occasional contrarian take. No sales.

Unsubscribe in one click. No spam.

Keep reading