Why Censuses Should Be (Slightly) Inaccurate

Since the founding of the country, the US government has been conducting a census every 10 years. The results of this survey are used to calculate the number of seats each state gets in the House of Representatives, to distribute federal funds to states, etc. [1] In addition to estimating the populations of states, the census aims to find out the number of people living in each household, as well as the age, sex, race and relationship status of each individual in that household [2]. Such data can be used by the government to, for example, monitor compliance with anti-discrimination laws, but also, if made accessible to the public, can be of great use to academics, NGOs, and others. Of course, given the sensitive nature of these data, we do not want to compromise the privacy of the individuals who are disclosing the information. So how much about the census can we reveal so that it provides useful insight into the population and, at the same time, keeps the individuals’ data confidential?

Aggregation

One of the most obvious ways to publish data about a population without revealing the identities of the individuals within it, is to aggregate those data. Saying that the median age of the American population is 38.2 years, tells you virtually nothing about how old Morgan Freeman or Laura Dern is. Unfortunately, that level of aggregation may not be particularly useful.

To understand local populations, more fine-grained data may be preferred, but with that comes higher risk of privacy loss. If aggregated data about very few people are published, there may be a way to reconstruct the characteristics of those individuals. We will imagine a scenario where an intern1 at a city council decides to upload aggregated block-by-block age data (by demographic) to city’s public website without reviewing each block individually. We will explore data of one of these hypothetical blocks—one where 8 people live:

Aggregation of data

Some of you may notice right away how careless the release of such data is. We can already deduce that a 45-year-old white male lives in the block because there is only one white male living in the block according to the aggregated data, and we have age data for that group. As bad as it already is, there is a good chance that we might even be able to determine whether that guy is single or married, thus reconstructing all of his traits that were aggregated.

Reconstruction Attacks

In a recent paper, computer scientists at US Census Bureau laid out how bad actors may utilize the processing power of modern computers to infer something about the individuals from their summaries—a process known as data reconstruction [1]. In one of the more popular methods involving SAT2 solvers, one could describe this task almost like an algebra problem. We can attempt to reconstruct summary tables by introducing a bunch of variables and then set up equations (constraints) which we would try to solve. For example, we might describe the statistics above using equations such as $$\frac{A_1 + A_2 + \cdots + A_8}{8} = 39$$ where $A_i$ is the age of person #$i$.

I applied data reconstruction principles described in [1] to our example. Modern SAT solvers allowed me to determine with 100% certainty that the 45-year-old white male is married. Not only that, I was able to reconstruct all characteristics of all 8 people with 100% certainty. That is, from the universe of all possible combinations of the eight individuals and their traits, there was only one that fit all the data in the table, and the solver found it:

Reconstruction of data

The Best Defense

What is the best way to avoid such disastrous scenarios?

Suppression

One of the techniques to avoid compromising the privacy when publishing summaries about a small number of people is called cell suppression. For example, if a statistic is based on one or two people, a choice could be made to not publish it. When applied to our example, it would mean that we would have to suppress the data for single adults, black females, black males and white males. Of course, if we publish data for white females and females overall, we can easily deduce the number (2) and average age (32) of black females. Thus, an argument could be made to also suppress the data for white females. However, for the sake of simplicity and the fact that a lot of facts could be deduced this way (just using more steps), I will keep the data for white females in place. With aforementioned fields suppressed, there are now 9 possible configurations that could produce such statistics:

Reconstruction of data with even more data suppressed

The fact that now there is more than one plausible set of people that would fit the summary table, at least in theory, ensures more privacy—the attacker will be less certain which set is the real one. However, if we investigate closely, we see that the situation has not improved by a lot. All 9 possibilities are very similar—mostly just the youngest and the oldest individuals have slightly different characteristics from set to set. This means that the attacker could determine the characteristics of some people with very high certainty. We could try suppressing even more data but that may be problematic because

  1. ensuring high level of privacy (i.e. having many plausible combinations) severely limits how many pieces of data can be published [1],
  2. even with relatively small data sets, it might be computationally infeasible to determine whether the data that are published could be used to identify any of the individuals [1],
  3. there is no guarantee that the attacker will not have in their possession additional data (gathered from other sources) that, when combined with the redacted statistics, would compromise the privacy of everyone whose personal information was aggregated.

Noise

If we choose to publish more informative statistics, one thing we could do to preserve privacy is… to alter the data! If we add noise—say, a random integer between $-3$ and $3$—to your age, an attacker will be less likely to identify you because your age is not well defined anymore. But doesn’t this make the published statistics less accurate? Technically yes, but not by as much as one might expect. Remember that we are not publishing data about individuals, but rather about groups of them—if the noise is applied in a careful way, it will tend to cancel out, i.e. it is unlikely that the ages of all the residents in a particular block will be altered in the same direction. And the more aggregation we apply, the less negative impact the noise will have: city-level data will be more accurate than block-level data, state-level data will be more accurate than city-level data, and so on.

Even more effective is adding noise to the tabulated data directly, instead of the individual entries. Not only do we require significantly less noise in that case [4], but it is also much more difficult to reconstruct the database, i.e. there is a very large number of plausible combinations and they are much more diverse3:

Reconstruction of noisy data

For a long time, it was difficult to quantify the effect of noise when publishing statistical results, but in 2006 a framework, called differential privacy, was developed [5]. This system formalizes the treatment of trade-off between privacy loss and accuracy. That is, differential privacy shows to what extent a certain amount of noise increases privacy and decreases accuracy of published statistics, as well as how to inject the noise in the most efficient way. Importantly for censuses, this framework gives relative guarantees about the privacy loss resulting from public release of statistics [4]. Even if an attacker has additional information, such as consumer data from a credit bureau, using differential privacy in a careful way guarantees that publishing statistics will not increase the privacy loss of the people whose noisy data are included [4].

Given the power of differential privacy, it will, for the first time, be used in the 2020 US national census. Although noise was being injected in previous censuses (in a form of swapping, for example [1]), differential privacy will allow the Census Bureau to decide how much noise to inject to both 1) ensure sufficient privacy, and 2) make sure that statistics are accurate enough. Additionally, it makes it possible to prioritize—certain characteristics might be treated as more private without disproportionately affecting their accuracy [4]. Finally, differential privacy increases transparency—the Bureau can release their implementation of the method and that will not affect the privacy of the respondents [4].

The New Normal

Differential privacy is the most mathematically rigorous way of dealing with sensitive data. And it should be applied to much more than just the censuses. Whenever we are handling private information—whether sharing results of clinical trials or training neural networks on sensitive data—we have an obligation to ensure the privacy of individuals whose data are being used. At the same time, I am not optimistic that everyone will accept the use of this counter-intuitive concept of tweaking real-world data in such important procedures as national censuses. Unfortunately, there is probably no other way.

Code

The code that I used for data reconstruction examples can be found here.

References

[1] Garfinkel, S., Abowd, J. M., & Martindale, C. (2018). Understanding database reconstruction attacks on public data. Queue, 16(5), 28-53.

[2] United States Census 2020. Questions Asked on the Form. Retrieved November 10, 2020, from https://2020census.gov/en/about-questions.html

[3] Dinur, I., & Nissim, K. (2003, June). Revealing information while preserving privacy. In Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems (pp. 202-210).

[4] Polich, K. [host], Garfinkel, S. [guest] (2020). Differential Privacy at the US Census. Data Skeptic, differential-privacy-at-the-us-census.mp3

[5] Dwork, C., McSherry, F., Nissim, K., & Smith, A. (2006, March). Calibrating noise to sensitivity in private data analysis. In Theory of cryptography conference (pp. 265-284). Springer, Berlin, Heidelberg.


  1. because we can blame them for everything ↩︎
  2. bolean SATisfiability problem ↩︎
  3. I made up the tables on the left just to illustrate the point. In reality, the possiblities would depend on the nature of applied noise. ↩︎