Preserving Data Privacy using Differential Privacy

Siddharth Shah
5 min readJun 30, 2021

--

By Sowmya Ganapathi Krishnan and Siddharth Shah

Differential Privacy (DP)

An algorithm is said to be differentially private if by looking at the output, one cannot tell whether any individual’s data was included in the original dataset or not.

The mechanism used by differential privacy to protect privacy is to add carefully calibrated noise to data purposefully (deliberate errors, in other words).

This results in a win-win situation for everybody whose intentions are good and creates a puzzled picture in front of adversaries who want to misuse such statistical data. Even if it were possible to recover data about an individual, there would be no way to know whether that information was meaningful or nonsensical.

Why is it important?

A doctor wants to do a health study so that she can establish a correlation between an individual’s lifestyle and a chronic condition. The most challenging part is convincing the general public to be participants in the study as they have concerns around privacy, As to how data is collected and processed.

At this moment a Data Scientist comes in and makes a claim that convinces the general public to take part in the survey. She mentions that she will be employing a privacy preserving technique that is used by the likes of Apple, Google, Uber.

If we look at both the perspectives in the above scenario, both want to do something good and learn something about a set of individuals through generalization, or analysing aggregated data. It is in these situations that differential privacy shines the most.

“You will not be affected, adversely or otherwise, by allowing your data to be used in any study or analysis, no matter what other studies, data sets, or information sources, are available.”

- Cynthia Dwork in The Algorithmic Foundations of Differential Privacy

Differential Privacy aims to protect our data against two most common exposures with a dataset that is collected to derive statistical information out of it.

Differentiated Attack

In the situation above, we talked about collecting user’s lifestyle data. Lets say it includes many attributes along with number of steps walked by a participant. Later we remove all the PII data and release data containing just the number of steps and locality name.

However, let’s say someone viewing the results wants to know number of steps walked by one particular person. To do this, the adversary has gone and surveyed the locality and got the information on steps walked by every person except the person of interest to them. If the adversary knows how much others have walked, then they can determine how much the person in question has walked.

This type of attack is known as a differentiated attack, and is very difficult to protect against, and is what differential privacy aims to defend against.

Linkage Attack

On October 2, 2006, Netflix announced the $1-million Prize for improving their movie recommendation algorithm. Netflix released an anonymous dataset of movie ratings by 500,000 subscribers. Netflix asserted that all personal identifiable information (PII) has been removed.

One year later, Arvind Narayanan and Vitaly Shmatikov, two researchers at the University of Texas at Austin, proved the removal of PII is not adequate in protecting data privacy. They used the Internet Movie Database (IMDb) as the external source to re-identify the anonymous Netflix subscribers.

Netflix Linkage Attack (Image Source)

How does it work?

At a very high level, Differential Privacy is achieved by introducing noise to the data before publishing the results. This can be done at two different points of time while collecting and analysing data.

Local Differential Privacy

When users do not trust the data collector, the noise is introduced at the source itself even before the data is aggregated. This is useful where you are starting to collect some data and want to make sure its privacy is protected right from the beginning.

The differential privacy technology used by Apple is rooted in the idea that statistical noise that is slightly biased can mask a user’s individual data before it is sent to the cloud. If many people are submitting the same data, the noise that has been added can average out over large numbers of data points, and Apple can see meaningful information emerge.

How Apple collects data using Differential Privacy — by adding noise at user’s device

Global Differential Privacy

When users trust the data collector but not the querier, the noise is introduced by the aggregator before sending out the data. This is also useful in situations when we have previously collected data and want to run some analytics on top of it in a privacy preserving way.

At Uber, when someone tries to get statistical information out of existing user data, the queries are passed through a system called CHORUS which runs them against the original database having no noise at all but makes sure that the query results are returned in a differentially private manner.

How Uber shares data collected from users with other third parties — by adding noise at the query layer

Applications

Differential Privacy is widely used by major tech companies at large scale. Some of its applications include:

  1. Privacy Preserving statistics and recommender system
  2. Differentially Private Machine Learning
  3. Secure analysis of customer behaviour
  4. Secure sharing of demographic data

Explore Differential Privacy Further

Many of the tech giants have open sourced their differential privacy implementations and the following libraries are available for different use cases:

  1. Google’s Differential Privacy
    This library has implementations in all the major languages like Python, Java and GO
  2. Google’s Tensorflow Privacy
    This can be used in conjunction with TensorFlow to perform differentially private machine learning
  3. IBM’s differential-privacy-library
    Written in python and available through pip, this library can be used by data scientists as a starting point in their differential privacy journey
  4. Facebook’s Opacus
    This library comes in handy when you are analysing data using PyTorch to perform differentially private machine learning

References

  1. The Algorithmic Foundations of Differential Privacy by Cynthia Dwork and Aaron Roth
  2. An overview of Differential Privacy by Harvard can be found here.
  3. Arvind Narayanan and Vitaly Shmatikov’s research paper describing the methodologies used to come up with Netflix Linkage Attack
  4. How Apple collects data using Differential Privacy and further in depth reading ca be done here.
  5. How Uber shares data collected from users with other third parties
  6. OpenMined is also a very good place to keep up-to-date with everything related to DP

--

--

Siddharth Shah
Siddharth Shah

Written by Siddharth Shah

big fan of metaphors, likes to keep it simple

No responses yet