Tech Firms Can Easily Identify You Using Anonymized Data On Yourself
Chitanis - Aug 27, 2019
Researchers have recently pointed out that even if your personal information has been anonymized, advanced technology can still identify you.
- Group Dating App 3Fun Found Exposing Millions Of Sensitive User Data
- Some Best Postpaid Family Plans On Vodafone Vs Airtel
- Hackers Easily Stole A Bunch Of High-Profile YouTube Channels
Just by living in this modern world, you are giving up a lot of your personal info to many services and institutions. Many places promise that they will keep your data as private and secure as possible, but in fact, they often share your anonymized data to some third parties either for profit or for research. But the new research shows that anonymized data isn’t so anonymous.
Recently, the Imperial College London’s researchers published their paper titled “Estimating the success of re-identifications in incomplete datasets using generative models,” show that techniques currently used to anonymize data sets are insufficient. Before sharing a dataset, companies will delete identifying information (names, e-mail address, etc.). But even if identifiable factors were excluded from the dataset, it isn’t difficult to match definite information and find out who is the user of that data set, with high accuracy.
The researchers used 210 datasets for the analyses. These datasets were collected from 5 sources. It also includes the US government, which has over 11 million individuals’ information. According to the study, by using a machine learning model along with datasets including 15 identifiable factors (gender, birth date, age, marital status, ZIP code, etc.), the researchers can reidentify up to 99.98% of people in an anonymized data set. According to the researchers, these findings are a successful effort for proposing and validating “a statistical model to quantify the likelihood for a re-identification attempt to be successful, even if the disclosed dataset is heavily incomplete.”
The study offered a hypothesis, a health insurance company issues a data set of 1,000 anonymous customers, which is 1% of the total customers of the company in California. This data set includes the ZIP code, gender, date of birth and diagnosis of breast cancer. One of these individuals’ boss finds out that there was a man, who has the same date of birth and ZIP code, and base on the data set, is having breast cancer and his stage IV treatments didn't succeed. However, the health insurance company is able to say that, even if this unique data of the employer and the record in their company’s file match, it could be anyone else among tens of thousands of people insured at that company.
One of the paper's authors - Luc Rocher – a researcher of Université Catholique de Louvain - said: “While there might be a lot of people who are in their thirties, male, and living in New York City, far fewer of them were also born on 5 January, are driving a red sports car, and live with two kids (both girls) and one dog.”
Yves-Alexandre de Montjoye - the paper's senior author, characterized these attributes as “pretty standard information for companies to ask for.”
The hypothesis in this study is not only a fiction. Lately, in June, a patient at the University of Chicago Medicine sued both Google and the private research university for sharing his personal data without his permission. This medical center supposedly de-identified the data set, but still provided Google with records of the patient's vital signs, height, weight, information about their diseases, medical procedures they have experienced, the medicine they are using and date stamp. This complaint not only showed the hole of privacy in sharing private data without their agreement, but it also pointed out that although the data is anonymized, some of the powerful tech corporations can use their tools and easily reverse engineer that data and identify someone.
There are a lot of companies are now collecting data sets that can provide enough information to identify someone, and the fact that the researchers are able to reidentify users by using only 15 identifiable characteristics shows that we really need to reevaluate what creates an ethical anonymized dataset.
“Companies and governments have downplayed the risk of re-identification by arguing that the datasets they sell are always incomplete,” Mr. de Montjoye said. “Our findings contradict this and demonstrate that an attacker could easily and accurately estimate the likelihood that the record they found belongs to the person they are looking for.”
According to the researchers, policymakers have the responsibility to make better standards for all of the anonymization techniques to make sure that the sharing of data sets will stop becoming an invasion of privacy. “The goal of anonymization is so we can use data to benefit society,” said Mr. de Montjoye. “This is extremely important but should not and does not have to happen at the expense of people’s privacy.”