Photographer:Fotograaf: Joey Roberts
Imagine you, a researcher, are given a bag of money, unlimited time and personnel. What research would you do? Stephan Smeekes, data scientist at the School of Business and Economics, would like to investigate why computer programmes often find ‘nonsensical connections’ in big data. This wouldn't require large investments; it would just take a lot of time.
Did you know that people who eat a lot of cheese run a greater risk of dying by getting entangled in their bedding? Or, that the older Miss America is that year, the more murders will be committed in the USA with steam or a hot objects? Or that the chance of drowning after a fall into a swimming pool is greater in the years in which many films with Nicholas Cage have been released?
Ridiculous of course. “Researchers put those variables into a programme as a joke,” says Smeekes. “Just by coincidence, more people drowned after falling into a swimming pool in the years in which there were a lot of Nicholas Cage films. If this occurs approximately in the same way a couple of years in a row, this gives rise to strong correlations when you analyse the variables with statistical programmes.” The fact that it is a case of ‘nonsensical correlations’ is not always clear. In particular when the amounts of data are enormous.
Correct correlations are not just very interesting for science but also for businesses, because these enable them to map out the exact characteristics and wishes of (potential) clients. With this they can strategically target their advertising, Smeekes explains. The same goes for insurance companies. “They already want to know your age and postal code; young people and inhabitants of busy cities pay higher premiums because the risk of damage is greater.” Big data can reveal down to the smallest detail which characteristics form a risk: film preferences, professions, or maybe something totally unexpected such as the colour of your hair?
But to be able to do so, the correlations made have to be correct. You can of course only know that if the big data has been analysed in the proper way. “The classic (SPSS) methods used by statisticians are no longer suitable for the large amounts of data that are now available, so today complicated computer techniques are being applied increasingly. These so-called ‘machine learning’ techniques – a type of artificial intelligence – were initially developed to recognise pictures and handwriting from huge amounts of data: for example, to pick out photographs of cats from thousands of pictures.”
They can also be used to find patterns in economic data, but this often goes wrong. “The algorithms make correlations that they shouldn't make. For the years 1971 to 1990, for example, a clear relation was found between child mortality in Egypt, the gross income of American farmers, and the amount of money in Honduras. I would like to see how I could adapt the methods and algorithms in such a way that the programme itself can learn and discover whether something is nonsense or not. To do so, I would have to write out and analyse the mathematics behind the algorithm. For that I would need a great amount of time.”
The unreliability of the results that now come up is dangerous, says Smeekes. “If computer analyses were to show that gender had an influence on the number of claims received by insurance companies, they would - aside from the ethical aspects - have to be very sure before they adapt their premiums. The same applies to the police when they draw up criminal profiles on the basis of large amounts of data.”
Determining the reliability of these results is therefore the second part of his dream research. “I want to be able to say: ‘We can say with X per cent certainty that the results from the analysis correspond with reality’. If that percentage is very high, you can include the smallest and craziest details in, for example, economic forecasts.”