Data is the fuel that keeps the tech world running and was recently said to have come at par with the petroleum industry. There is no doubt that companies are focusing on collecting vast quantities of data. Facebook, one of the tech titans, has gained a notorious reputation for collecting data without user consent. Governments, too, are diving into the data collection field. The prime example is the Aadhar Card initiative in India, which was rolled out by the BJP-led NDA government.
Collection of massive data inevitably poses data security risks. There have been incidents of data leaks where the user’s personally identifiable information was exposed. Differential privacy provides a way for the companies to collect and share the data but without risking the personal information. For instance, a survey that aims to find out how many people, from the given set of 100 people who watch television, watch Netflix or Prime, will treat the answers as a data set instead of an individual — keeping anonymity intact on paper. So, instead of analysing each individual from the set, we get an overall figure. The figure might look something like 70 out of 100 people watch say, Netflix. But the identity of those 70 people who watch Netflix or the remaining 30 that don’t isn’t revealed by the data set.
Differential privacy uses algorithms for data anonymisation. In simple terms, differential privacy is a more robust and mathematically powerful definition of data privacy. Cynthia Dwork is credited to be the founder of this technique.
According to a research paper authored by Cynthia Dwork and Aaron Roth, titled The Algorithmic Foundations of Differential Privacy, ” Differential privacy describes a promise, made by a data holder, or curator, to a data subject [that] you will not be affected, adversely or otherwise, by allowing your data to be used in any study or analysis, no matter what other studies, data sets, or information sources are available. ”
Need for Differential privacy
Data privacy is paramount. Big data collection and analysis is necessary for companies to understand human behaviour and motivations to be able to judge consumer trends and market their products accordingly. Big data has a tremendous scope and the market for the same is exponentially growing, but so are the risks.
In recent years, micro-data, which is essentially the information about the individual, is becoming public. Micro-data contains the most private and varied aspects of an individual and thus are most susceptible.
Protection from linkage attacks
In 2006, Netflix published a dataset of about 500,000 users, to support the Netflix prize data mining contest. In this contest, they randomised some data and hid the others. However, researchers were able to demonstrate that even the most anonymised data can be breached.
For sparse datasets, such as that of Netflix, an attacker with the least technical knowledge can perform a data breach. The researchers juxtaposed the data from the IMDb over the Netflix datasets and found out the user’s name. Thus solidifying the need for differential privacy.
Conclusion irrespective of individual
Differential privacy also solves the paradox of ‘learning nothing about the individual while learning useful information about a population’ by making the conclusion independent of the individual. In other words, it does not matter whether the individual opts out or stays in the study; the findings will remain the same nonetheless. In differential privacy, each output is likely to occur equally, irrespective of the presence of the participant.
Sometimes the queries, in itself, are problematic and vague. Suppose, ‘A’ has a known condition or a habit. There are two research questions about A’s condition:
- How many people in a given dataset have the same condition or habit as A?
- How many people in the dataset, not having the name A, have the condition or habit?
Thus, from the above example, it is clear that A’s condition can be deduced easily unless differential privacy is applied.
How does it work?
Let us consider a situation where we have to find out about who watches Netflix or Prime. In a data set of 100 people, an attacker wants to find out about Sunny’s habits. He already knows that 70 out of 100 people watch Netflix. An attacker, by obtaining background information about other 99 people, can easily say that 30 watch Prime and 69 watch Netflix. Sunny, being the 100th member, can only watch Netflix as 70 out of 100 people watch Netflix.
Differential privacy guarantees protection against such attacks by inserting a random ‘noise’ to the data. Noise is a carefully designed mathematical function that gives the probability of the events occurring in an experiment. In the above situation, instead of using a 70/30 ratio, we can use odd ratios like 69/31. In this way, it is challenging to reach Sunny individually, but the overall ratio remains nearly the same.
Thus, the noise adds additional algorithms to the whole process. They are particularly useful if there is data about some confidential habits. Some common noise mechanisms are Laplace distribution and Gaussian distribution.
Challenges with differential privacy
- Differential privacy works only when the aggregated data is extensive. For less extensive data, this technique is not much useful.
- This technique is not helpful where there is an unequal summation of data. For example, in a data aggregation about incomes, inclusion or exclusion of one individual can change the result considerably. The reason is that the incomes are unevenly distributed is that the top 20 percent earn exponentially higher than the rest of the 80 percent. Therefore, the exclusion of any top 20 percent member will affect the result.
- In queries involving a series of private questions, we have to add more noise to obfuscate the identity. More the queries, more the noise, which makes it difficult to derive anything useful from the data.
- In the garb of differential privacy, companies can now collect even more data from the users.
Big giants like Apple, Google are already applying differential privacy for protecting user data. Apart from that, software companies like Privitar are also applying this method. Differential privacy has also found implications in cloud security.
Thus, differential privacy is gaining momentum in recent years and is likely to continue along the upward trajectory in the foreseeable future.
Also read: What is a Credential-based cyberattack?