How to Protect Anonymity in Machine Learning

The General Data Protection Regulation (GDPR) just went into effect in the European Union. It is a law that is designed to protect the privacy of individuals by requiring explicit permission for data collection and the enforcement of strict data usage policies. Companies, specifically those that employ machine learning, have complained about the onerous regulation, claiming that their security practices already protect individual privacy. The problem is that some of these security methods, like employing subject anonymity, have been shown to fail. Luckily there is a new field called Differential Privacy which aims to protect anonymity in machine learning.

TELEGRID recently completed a review by an Institutional Review Board (IRB) for a test involving biometrics. An IRB is an ethics board that approves and monitors research involving human subjects – think of drug trials for the Food and Drug Administration. In fact when our Government customer first asked us to go before an IRB I replied, “Why? I am not putting shampoo in anyone’s eyes.” However, since we were collecting biometrics from human subjects we were required to speak to an IRB about how we intended to protect the subjects’ information from data leakage.

Aside from the mountain of paperwork, the process was fairly painless and we were deemed exempt based on the type of data we were collecting, the level of security we maintain at our offices, and our ability to maintain subject anonymity. While I agree with the type of data and cybersecurity controls being a basis for exemption, past studies have raised doubts about the ability to protect anonymity in machine learning.

For instance, in 2006, researchers from the University of Texas at Austin were able to identify Netflix users by matching a database of anonymous users’ movie preferences with users who publicly entered movie ranking on IMDb. Anonymity was challenged again in 2013 when a Harvard professor identified 40% of a sample of anonymous participants in the Personal Genome Project. While both studies relied on a secondary dataset, which may not always be available, they did show that it is possible to identify subjects in anonymous databases.

To protect anonymity in machine learning researchers have been working on a new technique called Differential Privacy. Differential Privacy allows machine learning algorithms to arrive at the same conclusion whether or not a subject is included in the input data set. To explain it we will use the classic example of a pollster asking a subject which political party they voted for. If the pollster collected other data, which can be cross referenced to a public database, it is possible to identify the subject and their voting history. To institute Differential Privacy we would instead ask the subject to flip a coin, and based on the result, either tell the truth or lie about who they voted for. Using statistics it is possible to extract the ‘noise’ of the coin flip.

In short, Differential Privacy is the controlled injection of noise into a data sample to provide a subject with the ability to plausibly deny that they gave a specific response.

Differential Privacy is still in its infancy, and requires a larger data set in order to overcome the injected noise, but it is currently the most promising option we have to protect anonymity in machine learning. If you would like to learn more about Differential Privacy I would suggest starting with this episode from the podcast This Week in Machine Learning & AI.

Eric Sharret is Vice President of Business Development at TELEGRID. TELEGRID has unique expertise in secure authentication, PKI, Multi-Factor Authentication, and secure embedded systems.

Disclaimer: The opinions expressed here do not represent those of TELEGRID Technologies, Inc. The Company will not be held liable for any errors, omissions, or delays in this information or any losses, injuries, or damages arising from its display or use. All information is provided on an as-is basis.