The world needs more data.

And people want stricter privacy guarantees when it comes to the collection, use, and dissemination of their personal details.

The problem is that getting useful datasets means less privacy, but guaranteeing privacy means either less data, or stripping datasets to the point they are no longer useful.

How can we increase both data utility and privacy?

At the risk of simplifying a complex problem: synthetic data is the solution. Synthetic data provides useful datasets to train AI without relying on people’s personal and private information.

Let’s take a step back. First, why do we need more data? To train AI[1]. AI makes up our today. And our tomorrow: we are already leveraging AI towards a future of self-driving cars, robot surgeons and virtual assistants. Machine learning and deep learning, as subsets of AI, make up the new programming paradigm, where engineers ask how a computer can automatically learn and make its own performance rules just by looking at data. With machine learning, humans input data as well as the answers expected from the data and the computer figures out its rules (this is the AI, so to speak). This model can then be deployed to new data to produce original answers. Bottom-line: the more data a model can train on, the better the model will perform.

So, to push technological development, we need more data. But not just any data – we need quality data. A model will only be as learned as the data on which it is trained.[2]

Which leads us to our second question, where do we get the data now? Today, the norm is to use real datasets. Walk down any street in San Francisco and guaranteed you’ll see at least one car outfitted in sensors and cameras, gathering data to train its autonomous vehicle brethren. Also par for the course is data scraped off the internet. That old picture you uploaded to that website you built in third grade? Publicly available, so yeah, that’s fair game.

There are many problems with using real data to train AI. Besides the more technical problems (e.g., the necessary labelling/annotating of data is a tedious and imprecise manual exercise that falls short of the detail and richness we need to meet the increasingly complex tasks we demand from our AI), using real data to train models is rife with privacy risks (especially now with the rise of comprehensive privacy regimes like the GDPR in Europe and the CCPA in California). To counteract these risks, real data must undergo a de-identification process, which, as mentioned above, reduces the utility of the dataset.

De-identification, sometimes referred to as anonymization, strips a data set of personal identifiers. The extent of what -and how- data is anonymized is important: if data elements used to identify an individual are removed (i.e., anonymized) from a dataset, the remaining data becomes nonpersonal information and privacy and data protection laws generally do not apply. But, the dataset is now less rich and has less information on which an AI can train.

Further, while there is a regulatory distinction between de-identified/anonymized information and pseudonymized data (legal term for data that can be reversed and re-identify individuals), the truth of the matter is that all anonymized data is subject to reversal. The only real bar is the state of technology at the point in time. Anonymized data today becomes pseudonymized data tomorrow as AI becomes better at re-identifying data points. In the future, algorithms will likely be capable of linking seemingly innocuous data points to construct very intimate profiles on us.

And thus our third question: where can we get data that is useful and not inevitably subject to re-identification? Enter synthetic data.

Synthetic data is useful: it is computer generated and thus inherently boasts pixel-perfect labels and annotations, and has the potential to cover all edge cases, utilizing ML techniques to augment real distributions.

Synthetic data also erases privacy concerns. We can snooze the consequences of using real data, try and strip it (generalize and suppress it) to the point where, today, we can no longer identify the discrete real data points within the set. But this is a temporary band-aid. Synthetic data is fake data; no personal identifiers that could be susceptible to re-identification down the road. Synthetic data guarantees privacy by changing the paradigm and getting rid of any need to use real data.

So yes, a generalization of a complex problem, but synthetic data may be how we strike the balance between privacy and utility.

With synthetic data, we can have our cake and eat it: more precise, accurate, and complex AI (which necessitates detailed data), and guaranteed privacy.

Thoughts, questions, comments? Reach out, we’re always ready to talk synthetic data.

[1]   In the words of Francois Chollet, AI & deep learning researcher and developer of Keras: “A concise definition of the field of [AI] would be as follows: the effort to automate intellectual tasks normally performed by humans. AI is a general field that encompasses machine learning and deep learning…” See Chollet, F. Deep Learning with Python. Manning Publications (2017).

[2]   Moreover, models are notoriously ‘stupid’: they will find the path of least resistance and follow that until taught differently. AI models are creatures of statistics – they will output the statistics of the dataset they were trained on. Check out this nifty convolutional neural network (CNN) visualizer and see for yourself: Because of this reality, anyone training AI needs to be very careful and cognizant of the limits and biases inherent in each dataset.


Leave a Reply

Your email address will not be published. Required fields are marked *