Data is not made. Data is born as a result of a measurement process. Taking measurements (in conjunction with a measurement theory) creates data. But then, what should we call – in contrast – the results of simulations, the output of theoretical models? Some might object that this is not an interesting question in the first place, but pointless – and rather nitpicky at that – semantics. However, I must disagree with this. The question is of critical importance, as “simulated data” is a contradiction in terms. Data concern the state of the real world, the world of things. The outputs of a simulation firmly belong to the realm of ideas. It is critical not to confuse the two, lest we make ill-informed statements about the real world solely based on observations we made based on models or simulations. This matters, as was impressively shown in the financial crisis of 2008 – when it became apparent that the if in “risk is only tamed if the real world risk fits the risk model” is indeed a big (and uncertain) if. Entire fields can be built on the confusion between models that work for idealized systems and the real world they are trying to account for, as Ricardo and the rise of economics impressively shows. That doesn’t mean one should attempt this.
With that in mind, what should we call “simulated data”? The term itself is a contradiction in terms and it bugs me that the rise of data science makes me qualify actual data with the term “real”. Predictions would be needlessly imperialist, as many (if not most) models deal solely with postdiction (there is nothing inherently wrong with this). I tried to make “sata” happen, but that did not catch on (yet).
PS: Lest you think that I’m needlessly pedantic and that this is a distinction without a difference – it really does matter. Simulations, modeling and “simulated data” (data) all have a role in the scientific process. But they are no substitute for data. In age of sophisticated modeling, it really does matter what one uses as a training (and test) set. Actual, real data – well recorded – is best. To wit: http://www.pnas.org/content/early/2016/06/27/1602413113.long