How Salary Confidential approaches the goal of k-anonymity while keeping respondents safe
k-anonymity is a family of data anonymization approaches built around a single idea: ensuring that each record in a released dataset is indistinguishable from at least k − 1 others along identifying dimensions, while preserving analytical usefulness. It is most often applied to sensitive data collected in non-anonymous contexts — such as medical or administrative records gathered by hospitals or governments — that researchers later want to reuse safely.
Why we did not adopt classical k-anonymity algorithms, but learned useful things in the process
Many of the problems Salary Confidential cares about — in particular, preventing the recognition of a specific individual in a dataset — are also central to k-anonymity. We therefore explored several classical approaches, especially since there are well-documented algorithms and libraries available. Ultimately, however, these techniques proved poorly suited to our use case.
First, much of classical k-anonymity focuses on removing or generalizing quasi-identifiers — data elements with high identifying power. While this logic can apply in principle, we found that designing surveys to limit identifying information upfront is both simpler and more effective than retrofitting mitigations after data collection.
Second, classical k-anonymity relies on forming groups of indistinguishable records by progressively coarsening attributes until each record is no longer unique. This approach assumes sufficiently large datasets. In our case, peer groups are intentionally small and purpose-built, and aggressive generalization would quickly make the data vague and useless
Designing privacy in the data-gathering process itself
Unlike many anonymization use cases, we do not start with a large, pre-existing dataset that must later be made safe for reuse. Our datasets are purpose-created during a poll, and we control both the questions being asked and how much the platform ever learns about respondents — which is intentionally very little.
This allows us to move from post-hoc anonymization to privacy by construction. Rather than adopting k-anonymity algorithms directly, we draw on the insights of that research — particularly the problems it identifies and the constraints it exposes when protecting individuals in large datasets.
We preserve the precision of what matters most — compensation — and introduce carefully controlled ambiguity in contextual attributes when needed. This ambiguity is not arbitrary: it is designed to reduce the risk of re-identification while preserving the overall structure and interpretability of the data.
At the same time, we recognize that being able to keep track of a specific respondent context can be important for interpreting results — whether that context is educational background, prior experience at a given company, gender, or something else entirely. While we don't allow freeform questions like "what is your gender" and we therefore don't have individual responses that would be marked with the answer - a requester can define that a survey peer group is used for the specific purpose, of, for example, responses from female respondents because only female respondents were invited through this peer group.
The end result is that while we do not apply classical k-anonymity algorithms, we preserve a similar property: characteristics are only observable when they are shared across a group of at least four respondents. Rather than making records indistinguishable, we ensure that context is never exposed at the individual level, but only when it is safely shared.
Using (or not) perturbation or jittering of the data
Numerical compensation data is valuable precisely because of its precision. Many anonymization and disclosure-control techniques rely on perturbation or jittering — modifying values within an acceptable range to make inference or re-identification harder.
In Salary Confidential, we do not apply these techniques to compensation itself.
Compensation values are always stored and shown as reported. Preserving their precision is essential to the purpose of the product.
However, not all data carries the same risk.
Certain contextual attributes — such as organization size — can become identifying in small, targeted samples. In those cases, we may transform how that context is represented.
This can include:
- using approximate categories instead of exact values
- allowing categories to have soft boundaries
- or suppressing the attribute entirely when it cannot be safely shown
These transformations are applied to context, not to compensation.
This allows us to preserve the precision of the data that matters most, while reducing the risk of re-identification from attributes that are more likely to act as pseudo-identifiers.
Using non-joinable disclosure in the way survey outreach messages templates are crafted
We also apply lessons from k-anonymity research, and handling relational confidentiality constraints, to how outreach messages are designed.
Respondents in Salary Confidential are not just contributors; they are also beneficiaries of the resulting report. For someone to decide whether to participate, they need a clear sense of what the survey is about and how closely the peer group relates to them. This creates a tension between clarity and privacy.
The most transparent — and least safe — approach would be something like:
“Here are the titles and companies of the five people we’re inviting.”
In small or well-defined professional circles, this would make individuals trivially identifiable. Did we try to go down this path and then apply various algorithmic ways to “lattice” titles into something more blurred? We did. Did it work? Reader, it did not.
Instead, we use a principle we call non-joinable marginal disclosure:
We ask requesters for precise titles in one question, and precise companies in another — but never for title–company pairs. This provides potential respondents with a concrete, textured understanding of the peer group, while preventing straightforward linkage to specific individuals. To further fuzzy up linkage, we randomize the order of one of the lists relative to the way our requester provides it.