If your work leverages data of any kind, you’ve almost certainly received a pedantic reminder that the word ‘data’ is plural. Its singular form is ‘datum.’

Simple enough, in theory: “The data are under review.”

But recently, while doing some work on our style guide in order to achieve consistency across our product and communications, we realized that our team was unclear how ‘data’—as a plural—reconciled with a suffix that would turn it into a collection: the word ‘set.’

In short, is it data set or dataset? Because, as we realized, we’ve been using both.

It seems we’re not the only ones. While the Wikipedia page for data set features the phrase as two words, it includes a parenthetical instance of dataset, suggesting that it’s a common and acceptable alternative.

Google Books Ngram Viewer suggests that while ‘data set’ was indeed more common until recently, ‘dataset’ took the lead in 2013. That is, unless you flip the setting from “English 2019” to “American English 2019,” in which case ‘data set’ remains slightly more common.

A screenshot of Google Ngram Viewer illustrating the debate of data set or dataset.

Google Trends confirms that, in the United States at least, ‘data set’ is a more common search term than ‘dataset,’ and has remained so since overtaking it in 2010. Toggle the search filters to “Worldwide” however, and you see that ‘dataset’ has been a more common search query globally until very recently.

The Merriam-Webster Dictionary prefers to ignore the debate altogether, with neither ‘data set’ nor ‘dataset’ appearing, not even as an example in their entry on ‘data.’

Meanwhile, Dictionary.com provides results exclusively for ‘data set.’ It also redirects you to that entry should you type ‘dataset.’ The only dictionary we were able to find that includes an entry for ‘dataset’ was Macmillan Dictionary, but that publication includes a separate entry for ‘data set’ with a distinct definition. Troubling at this point, considering how the consensus otherwise appears to be that the terms may be used interchangeably.

The AP Stylebook not only comes up short—articulating ‘databank’ and ‘database’ as one word, but ‘data processing’ as two—but also muddies the waters on when data should be written as plural: “The word typically takes singular verbs and pronouns when writing for general audiences and in data journalism contexts: The data is sound. In scientific and academic writing, plural verbs and pronouns are preferred.” Bold, not at all confusing distinction!

Generally, at this point we’d turn to the wisdom of the crowd. Fortunately for us, machine learning researcher Randy Olson already beat us to the punch. In 2017 he Tweeted the following poll. Folks preferred ‘dataset’ as one word by a 3:1 ratio.

 

Where does this leave us? We don’t know. Is it possible we’re overthinking it? Most definitely. It’s been three years since Olson’s Twitter poll, and we don’t seem to be any closer to a broad consensus. So, we’ll leave it up to you. Of the following, which do you prefer?

Accelerate your model development with trusted and reliable synthetic data sets from Zumo Labs. Book a demo today.

Accelerate your model development with trusted and reliable synthetic datasets from Zumo Labs. Book a demo today.

Let me know at michael@zumolabs.ai.


0 Comments

Leave a Reply

Your email address will not be published. Required fields are marked *