Tuesday, 19 July 2016

Big Data Headaches

Data driven decision making has proven to be key for organisational performance improvements. This stimulates organisations to gather data, analyse it and use decision support models to improve their decision making speed and quality. With the rapid decline in cost of both storage and computing power, there are nearly no limitations to what you can store or analyse. As a result organisations have started building data lakes and invested in big data analytics platforms to store and analyse as much data as possible. This is especially true in the consumer goods and services sector where big data technology can been transformative as it enables a very granular analysis of human activity (up to the personal level). With these granular insights companies can personalise their offerings, potentially increasing revenue by selling additional products or services. This allows for new business models to emerge and is changing the way of doing business completely. As the potential of all this data is huge, many organisations are investing in big data technology expecting plug and play inference to support their decision making. The big data practice however is something different and is full of rude awakenings and headaches.

That big data technology can create value is proven by the fact that companies like Google, Facebook and Amazon exist and do well. Surveys from Gartner and IDC show that the number of companies adopting big data technology is increasing fast. Many of them want to use this technology to improve their business and start using it in an exploratory manner. When asked about the results they get from their analysis many of them respond that they experience difficulty in getting results due to data issues, others report difficulty getting insights that go beyond preaching to the choir. Some of them even report disappointment as their outcomes turn out to be wrong when put into practice. Many times the lack of experienced analytical talent is mentioned as a reason for this, but there is more to it. Although big data has the potential to be transformative, it also comes with fundamental challenges which when not acknowledged can cause unrealistic expectations and disappointing results. Some of these challenges are even unsolvable at this time.

Even if there is a lot of data, it can’t be used properly

To illustrate some of these fundamental challenges, let’s take an example of an online retailer. The retailer has data on its customers and uses it to identify generic customer preferences. Based on the identified preferences offers are generated and customers targeted. The retailer wants to increase revenue and starts to collect more data on the individual customer level. The retailer wants to use the additional data to create personalised offerings (the right product, at the right time, for the right customer, at the right price) and to make predictions about future preferences (so the retailer can restructure its product portfolio continuously). In order to do so the retailer needs to find out what the preferences of its customers are and the drivers of their buying behaviour. This requires constructing and testing hypotheses based on the customer attributes gathered. In the old situation the number of available attributes (like address, gender, past transactions) was small. Therefore only a small number of hypothesis (for example “women living in a certain part of the city are inclined to buy a specific brand of white wine”) can be tested to cover all possible combinations. However with the increase in the number of attributes, the number of combinations of attributes that are to be investigated increases exponentially. If in the old situation the retailer had 10 attributes per customer, a total of 1024 (=210) possible combinations needed to be evaluated. However when the number of attributes increases to say 500 (which in practice is still quite small), the number of possible combinations of attributes increases to 3.27 10150  (=2500) This exponential growth causes computational issues as it becomes impossible to test all possible hypotheses even with the fastest available computers. The practical way around this is to significantly reduce the number attributes taken into account. This will leave much of the data unused and many possible combinations of attributes untested, therefore reducing the potential to improve. This might also cause much of the big data analysis results to be too obvious.

The larger the data set, the stronger the noise

There is another problem with analysing large amounts of data. With the increase in the size of the data set, all kinds of patterns will be found but most of them are going to be just noise. Recent research has provided proof that as data sets grow larger they have to contain arbitrary correlations. These correlations appear due to the size, not the nature, of the data, which indicates that most of the correlations will be spurious. Without proper practical testing of the findings, this could cause you to act upon a phantom correlation. Testing all the detected patterns in practice is impossible as the number of detected correlations will increase exponentially with the data set size. So even though you have more data available you’re worse of as too much information behaves like very little information. Besides the increase of arbitrary correlations in big data sets, testing the huge number of possible hypotheses is also going to be a problem. To illustrate, using a significance level of 0.05, testing 50 hypothesis on the same data will give at least one significant result with a 92% chance.

P(at least one significant result) = 1 − P(no significant results) = 1 − (1 − 0.05)50 ≈ 92%

This implies that we will find an increasing number of statistical significant results due to chance alone. As a result the number of False Positives will rise, potentially causing you to act upon phantom findings. Note that this is not only a big data issue, but a small data issue as well. In the above example we already need to test 1024 hypotheses with 10 attributes.

Data driven decision making has nothing to do with the size of your data

So, should the above challenges stop you from adopting data driven decision making? No, but be aware that it requires more than just some hardware and a lot of data. Sure, with a lot of data and enough computing power significant patterns will be detected even if you can’t identify all the patterns that are in the data. However, not many of these patterns will be of any interest as spurious patterns will vastly outnumber the meaningful ones.  Therefore, with the increase in size of the available data also the skill level for analysing the data needs to grow. In my opinion data and technology (even a lot of it) is no substitute for brains. The smart way to deal with big data is to extract and analyze key information embedded in “mountains of data” and to ignore most of it. You could say that you first need to trim down the haystack to better locate where the needle is. What remains are collections of small amounts of data that can be analysed much better. This approach will prevent you from getting a big headache from your big data initiatives and will improve both speed and quality of drive data driven decision within your organisation.
Post a Comment