Sunday, 5 November 2017

Averaging Risk is Risking the Average


http://www.danzigercartoons.com/
To assure public safety regulatory agencies like the ACM in the Netherlands or Ofgem in the UK monitor the performance of gas and electric grid operators on area’s like costs, safety and the quality of their networks. The regulator compares the performance of the grid operators and decides on incentives to stimulate improvements in these areas. Difficulty with these comparisons is that grid operators use different definitions and/or methodologies to calculate performance, which complicates a like for like comparison on for example asset health, criticality or risk across the grid operators. In the UK this has led to a new concept for risk calculations, the concept of monetised risk. In calculating monetised risk not only the probability of failure of the asset is used, also the probability of the consequence of a failure and its financial impact are taken into account. The question is if this new method delivers more insightful risk estimations to allow for a better comparison among grid operators. Also, will it support fair risk trading among asset groups or the development of improved risk mitigation strategies?

The cost - risk trade-off that grid operators need to make is complex. Costly risk reducing adjustments to the grid need to be weighed against the rise in cost of operating the network and therefore the rates consumers pay for using the grid. For making the trade-off, an estimate of the probability of failure of an asset is required. In most cases, specific analytical models are developed to estimate these probabilities. Using pipeline characteristics like type of material, age, and data on the environment the pipeline is in (i.e. soil type, temperature and humidity) pipeline specific failure rate models can be created. Results from inspections of the pipeline can be used to further calibrate the model. Due to the increased analytics maturity of grid operators, these models are becoming more common. Grid operators are also starting to incorporate these failure rate models in the creation of their maintenance plans.

Averaging the Risk


As you can probably imagine, there are many ways for constructing failure rate models. This makes it difficult for a regulator to compare reported asset conditions from the grid operators, as these estimates could have been based on different assumptions and modelling techniques.  That is why, in the UK at least, it was agreed between the 4 major gas distribution networks (GDN), to standardise the approach.  In short, the method can be described as follows.
  1. Identify the failure modes of each asset category/sub group in the asset base and estimate the probability of failure for each identified failure mode. 
  2. For each failure mode the consequences of the failure are identified, including the probability of the consequence occurring.
  3.  For each consequence the monetary impact is estimated. 
  4.  By summing up over all failure modes and consequences, a probability weighted estimate of monetised risk for an asset category/sub group is calculated. Summarising over all asset categories/sub groups gives a total level of monetised risk for the grid.


This new standardised way of calculating risks makes the performance evaluation much easier, it also  allows for a more in-depth comparison. See for more details on the method the official documentation.

An interesting part of this new way of reporting risk is the explicit and standardised way of modelling asset failure, consequence of asset failure and cost of the consequence. This is similar to how a consolidated financial statement of a firm is created. Therefore, you could interpret it as a consolidated risk statement. But can risks of individual assets or asset groups be aggregated in the described way and provide a meaningful estimate of the total actual risk? The above described approach sums the estimated (or weighted average) risk for each asset category/sub group, so it’s an estimate of the average risk for the complete asset base. However risk management is not about looking at the average risk, it’s about extreme values. For those who read Sam Savage’s The Flaw of Averages or Nassim Taleb’s Black Swan know what I’m talking about.

Risking the Average


Risks are characterised by extreme outcomes, not averages. To be able to analyse extreme values, a probability distribution of the outcome you’re interested in is required. Averaging reduces the distribution of all possible outcomes to a point estimate, hiding the spread and likelihood of all possible outcomes. Also, averaging risks ignores the dependence between each of the identified modes of failure or consequence. To illustrate let’s assume that we have 5 pipelines, each with a probability of failure of 20%. There is only one consequence (probability =1) with a monetary impact of 1,000,000. The monetised risk per pipeline than becomes 200,000 (=0,20*1,000,000), for the total grid it is equal to 1,000,000. If we take dependence of the failures into account than there will be a 20% probability of all pipes failing when these are fully correlated events. There will be a 0,032% change of all pipes failing if they are fully independent. The estimated financial impact than ranges from 1,000,000 in the fully correlated case to 1,600 in the fully independent case. That’s quite a range which isn’t visible in the monetised risk approach.

Regulators must assess risk in many different areas. Banking has been top of mind in the past years, but industries like Pharma and Utilities also had a lot of attention. How a regulator decides to measure and asses risk is very important. If risks are underestimated, this could impact society (like a banking crisis, deaths due to the admission of unsafe drugs or increase of injuries due to pipeline failures). If risks are overestimated costly mitigation might be imposed, again impacting society with high costs. The above example shows that the monetised risk approach is insufficient as it estimates risk with averages, where in risk mitigation the extreme values are much more important. What than is a better way of aggregating these uncertainties and risks than just averaging them?

Monte Carlo Simulation


The best way to better understand the financial impact of asset failure is to construct a probability density function of all possible outcomes using Monte Carlo simulation and based on that distribution make the trade-off between costs and risk. Monte Carlo Simulation has proven its value in many industries and in this case will provide what we need. Using the free tools of Sam Savage’s probabilitymanagement.org the above hypothetical example of 5 pipe lines can be modelled and the distribution of financial impact analysed. In just a few minutes the below cumulative distribution (CDF) of the financial impact for the 5 pipelines case can be created. Remember that the monetised risk calculation resulted in a risk level equal to the average, 1,000,000.

From the graph it immediate follows that P(Financial Impact<=Monetised Risk) = 33%. It implies that the P(Financial Impact>Monetised Risk) = 1-33%=66%. So, a 66% chance that the financial impact of pipe failures will be higher than the calculated monetised risk. Therefore we’re taking a serious risk by using the averaged asset risks. Given the objective of better comparison of grid operator performance and enabling risk trading between asset groups, the monetised risk method is to simple I would say. By averaging the risks, the distribution of financial impact is rolled up into one number leaving you no clue on what the actual distribution looks like (See also Sam Savage’s : The Flaw of Averages) A better way would be to set an acceptable “risk threshold” (say 95%) and use the estimated CDF to determine the corresponding financial impact.

This approach would also allow for better comparison of grid operators by creating a cumulative distribution for all of them and plotting them together into one graph (See example below). In a similar way risk mitigations can be evaluated and comparisons made between different asset groups, allowing for better informed risk trading.

Standardising the way in which asset failures and consequence of failures are estimated and monetised definitely is a good step towards a comparable way to measure risk. But risks should not be averaged in the way the monetised risk approach suggests. There are better ways, which will provide insight on the whole distribution of risk. Given the available tools and computing power, there is no reason not to do so. It will improve our insights on the risks we face and help us find the best mitigation strategies to reducing public risks.

Friday, 27 January 2017

Want to get value from Data Science? Keep it simple and focussed!

What is the latest data science success story you have read? The one from Walmart? Maybe a fascinating result from a Kaggle competition? I’m always interested in these stories wanting to understand what has been achieved, why it was important and what the drivers for success were. Although the buzz on the potential of data science is very strong, the number of stories on impactful practical applications of data science is still not very large. The Harvard Business Review recently published an article explaining why organisations are not getting value from their data science initiatives. Although there are many more reasons than mentioned in the article one key reason for many initiatives to fail is a disconnect between the business goals and the data science efforts. Also, the article states that the focus of data scientists is to keep fine tuning their models instead of taking on new business questions, causing delays in the speed at which business problems are analysed and solved.

Seduced by inflated promises, organisations have started to mine their data with state of art algorithms expecting that it is turned into gold instantly. This expectation that technology will act as a philosopher’s stone, makes data science comparable to alchemy. It looks like science, but it isn’t. Most of the algorithms fail to deliver value as they can’t provide an explanation as to why things are happening nor provide actionable insights or guidance for influencing the phenomena being investigated. To illustrate, take the London riots in 2011. Since the 2009 G20 summit, the UK police has been gathering and analysing a lot of social media data, but still they were not able to prevent the 2011 riots from happening nor track and arrest the rioters. Did the police have too little data or lack of computing or algorithmic power? No, millions have been spent. Despite all the available technology the police was unable to make sense of it all. I see other organisations struggle with the same problem trying to make sense of their data. Although I’m a strong proponent of using data and mathematics (and as such data science) for answering business questions, I do believe that technology can never be sufficient to provide an answer. Likewise, the amount, diversity and speed of the data.


Inference vs Prediction

Let’s investigate the disconnect between the business goals and the data science efforts as mentioned in the HBR article. Many of today’s data science initiatives result in predictive models. In a B2C context these models are used to predict whether you’re going to click on an ad, buy a suggested product, if you’re going to churn, or if you’re likely to commit fraud or default on a loan. Although a lot of effort goes into creating highly accurate predictions, questions is if these predictions really create business value. Most organisations require a way to influence the phenomenon being predicted instead of the prediction itself. This will allow them to decide on the appropriate actions to take. Therefore, understanding what makes you click, buy, churn, default or commit fraud is the real objective. To be able to understand what influences human behaviour requires another approach than creating predictions, it requires inference. Inference is a statistical, hypothesis driven approach to modelling and focusses on understanding the causality of a relationship. Computer science, the core of most data science methods, focusses on finding the best model to fit the data and doesn’t focus on understanding why.  Inferential models provide the decision maker with guidance on how to influence customer behaviour and thus value can be created. This might better explain the disconnect between business goals and the analytics efforts as reported in the HBR article. For example, knowing that a call positively influences customer experience and prevents churn for a specific type of customer gives the decision maker the opportunity to plan such a call. Prediction models can’t provide these insights, but will provide the expected number of churners or who is most likely to churn. How to react on these predictions is left to the decision maker.

Keep it simple!

Second reason for failure mentioned in the HBR article is that data scientists put a lot of effort in improving the predictive accuracy of their models instead of taking on new business questions. Reason mentioned for this behaviour is the huge effort for getting the data ready for analysis and modelling. Consequence of this tendency is that it increases model complexity. Is this complexity really required? From a user’s perspective, complex models are more difficult to understand and therefore also more difficult to adopt, trust and use. For easy acceptance and deployment, it is better to have understandable models. Sometimes this is even a legal requirement, for example in credit scoring. A best practice I apply in my work as a consultant is to balance the model accuracy well against the accuracy required for the decision to be made, the analytics maturity of the decision maker and the accuracy of the data. This also applies to data science projects. For example, targeting the receivers of your next marketing campaign requires less accuracy than have a self-driven car find its way to its destination. Also, you can’t make more accurate predictions than the accuracy of your data. Most data are uncertain, biased, incomplete and contain errors, when you have a lot of data this becomes even worse. This will negatively influence the quality and applicability of the model based on this data. In addition, research shows that the added value of more complex methods is marginal compared to what can be achieved with simple methods. Simple models already catch most of the signal in the data, enough in most practical situations to base a decision on. So, instead of creating a very complex and highly accurate model, better to test various simple ones. They will capture the essence of what is in the data and speed up the analysis. From a business perspective, this is exactly what you should ask you data scientists to do, come up with simple models fast and if required for the decision use the insights from these simple models to direct the construction of more advanced ones.

The question “How to get value from your data science initiative?” has no simple answer. There are many reasons why data science projects succeed or fail, the HBR article only mentions a few. I’m confident that the above considerations and recommendations will increase the chances of your next data science initiative to be successful. Can’t promise you gold however, I’m no alchemist.

Thursday, 13 October 2016

The Error in Predictive Analytics

For more predictions see : http://xkcd.com/887/
We are all well aware of the predictive analytical capabilities of companies like Netflix, Amazon and Google. Netflix predicts the next film you are going watch. Amazon shortens delivery times by predicting what you are going to buy next, Google even lets you use their algorithms to build your own prediction models. Following the predictive successes of Netflix, Google and Amazon companies in telecom, finance, insurance and retail have started to use predictive analytical models and developed the analytical capabilities to improve their business. Predictive analytics can be applied to a wide range of business questions and has been a key technique in search, advertising and recommendations.  Many of today's applications of predictive analytics are in the commercial arena, focusing on predicting customer behaviour. First steps in other businesses are being taken. Organisations in healthcare, industry, and utilities are investigating what value predictive analytics can bring. In these first steps much can be learned from the experience the front running industries have in building and using predictive analytical models. However, care must be taken as the context in which predictive analytics has been used is quite different from the new application areas, especially when it comes to the impact of prediction errors.

Leveraging the data

It goes without saying that the success of Amazon comes from, besides the infinite shelf space, its recommendation engine. Similar for Netflix. According to McKinsey, 35 percent of what consumers purchase on Amazon and 75 percent of what they watch on Netflix comes from algorithmic product recommendations. Recommendation engines work well because there is a lot of data available on customers, products and transactions, especially online. This abundance of data is why there are so many predictive analytics initiatives in sales & marketing.  Main objective of these initiatives is to predict customer behaviour, like which customer is likely to churn or buy a specific product/service, which ads will be clicked on or what marketing channel to use to reach a certain type of customer. In these types of applications predictive models are created either using statistical (like regression, probit or logit) or machine learning techniques (like random forests or deep learning) With the insights gained from using these predictive models many organisations have been able to increase their revenues.

Predictions always contain errors!

Predictive analytics has many applications, the above mentioned examples are just the tip of the iceberg. Many of them will add value, but it remains important to stress that the outcome of a prediction model will always contain an error. Decision makers need to know how big that error is. To illustrate, in using historic data to predict the future you assume that the future will have the same dynamics as the past, an assumption which history has proven to be dangerous. The 2008 financial crisis is prove of that. Even though there is no shortage of data nowadays, there will be factors that influence the phenomenon you’re predicting (like churn) that are not included in your data. Also, the data itself will contain errors as measurements always include some kind of error. Last but not last, models are always an abstraction of reality and can't contain every detail, so something is always left out. All of this will impact the accuracy and precision of your predictive model. Decision makers should be aware of these errors and the impact it may have on their decisions.

When statistical techniques are used to build a predictive model the model error can be estimated, it is usually provided in the form of confidence intervals. Any statistical package will provide them, helping you asses the model quality and its prediction errors. In the past few years other techniques have become popular for building predictive models, for example algorithms like deep learning and random forests. Although these techniques are powerful and able to provide accurate predictive models, they are unable to provide a confidence intervals (or error bars) for their predictions. So there is no way of telling how accurate or precise the predictions are. In marketing and sales, this may be less of an issue. The consequence might be that you call the wrong people or show an ad to the wrong audience. The consequences can however be more severe. You might remember the offensive auto tagging by Flickr, labelling images of people with tags like “ape” or “animal” or the racial bias in predictive policing algorithms.

Untitled

Where is the error bar?

The point that I would like to make is that when adopting predictive modelling be sure to have a way of estimating the error in your predictions, both on accuracy and precision. In statistics this is common practice and helps improve models and decision making. Models constructed with machine learning techniques usually only provide point estimates (for example, the probability of churn for a customer is some percentage) which provides little insight on the accuracy or precision of the prediction. When using machine learning it is possible to construct error estimates (see for example the research of Michael I. Jordan) but it is not common practice yet. Many analytical practitioners are not even aware of the possibility. Especially now that predictive modelling is getting used in environments where errors can have a large impact, this should be top of mind for both the analytics professional and the decision maker. Just imagine your doctor concluding that your liver needs to be taken out because his predictive model estimates a high probability of a very nasty decease? Wouldn’t your first question be how certain he/she is about that prediction? So, my advice to decision makers, only use outcomes of predictive models if accuracy and precision measures are provided. If they are not there, ask for them. Without them, a decision based on these predictions comes close to a blind leap of faith.

Wednesday, 3 August 2016

Airport Security, can more be done with less?


One of the main news items of the past few days is the increased level of security at Amsterdam Schiphol Airport and the additional delays it has caused travellers both incoming and outgoing. Extra security checks on the roads around the airport are being conducted, also in the airport additional checks are being performed. Security checks have increased after the authorities received reports of a possible threat. We are in the peak of the holiday season where around 170.000 passengers per day arrive, depart or transfer at Schiphol Airport. With these numbers of people for sure authorities want to do their utmost to keep us save, as always. This intensified security puts the military police (MP) and security officers under stress however as more needs to be done with the same number of people. It will be difficult for them to keep up the increased number of checks for long. Additional resources will be required, for example from the military. Question is, does security really improve by these additional checks or could a more differentiated approach offer more security (lower risk) with less effort?

How has airport security evolved?

If I take a plane to my holiday destination …I need to take of my coat, my shoes, and my belt, get my laptop and other electronic equipment out of my back, separate the chargers and batteries, hand in my excess liquids, empty my pockets, and step through a security scanner.  This takes time, and with an increasing numbers of passengers waiting times will increase. We all know these measures are necessary to keep us save but taking a trip abroad doesn’t start very enjoyable. These measures have been adopted to prevent the same attack from happening again and has resulted in the current rule based system of security checks. Over the years the number of security measures has increased enormously, see for example the timeline on the TSA website, making it a resource heavy activity which can’t be continued in the same way in the near future. A smarter way is needed.

Risk Based Screening

At present most airports apply the same security measures to all passengers, a one size fits all approach. This means that low risk passengers are subject to the same checks as high risk passengers. This implies that changes to the security checks can have an enormous impact on the resources requirements. Introducing a one minute additional check by a security officer to all passengers at Schiphol requires 354 additional security officers to check 170.000 passengers.  A smarter way would be to apply different measures to different passenger types, high risk measures to high risk passengers and low risk measures to low risk passengers. This risk based approach is at the foundation of SURE! (Smart Unpredictable Risk Based Entry) a concept introduced by the NCTV (The National Coordinator for Security and Counterterrorism) Consider this, what is more threatening, a non-threat passenger with banned items (pocket knife, water bottle) or a threat passenger with bad intentions (and no banned items). I guess you will agree that the latter is the more threatening one and this is exactly where risk based screening focusses on.  Key component in risk based security is to decide what security measures to apply to which passenger, taking into account that attackers will adapt their plans when additional security measures are installed.

Operations Research helps safeguard us

The concept of risk based screening makes sense as scarce resources like security officers, MP’s and scanners are utilized better. In the one size fits all approach a lot of these resources are used to screen low risk passengers and as a consequence less resources are available for detecting high risk passengers. Still, even with risk based screening trade-offs must be made as resources will remain scarce. Also decisions need to be made in an uncertain and continuously changing environment, with little, false or no information. Sound familiar? This is the exactly the area where Operations Research shines. Decision making under uncertainty can for example be supported by simulation, Bayesian belief networks, Markov decision and control theory models. Using game theoretic concepts the behaviour of attackers can be modelled and incorporated, leading to the identification of new and robust counter measures. Queuing theory and waiting line models can be used to analyse various security check configurations (for example centralised versus decentralised, and yes centralised is better!) including the required staffing. This will help airports to develop efficient and effective security checks limiting the impact on passengers while achieving the highest possible risk reduction. These are but a small number of examples where OR can help, there are many more.


Some of the concepts of risk based security checks, resulting from the SURE! Programme are already put into practice. Schiphol is working towards centralised security and recently opened the security check point of the future for passengers traveling within Europe. It’s good to know that the decision making rigour comes from Operations Research, resulting in effective, efficient and passenger friendly security checks. 

Thursday, 21 July 2016

Towards Prescriptive Asset Maintenance

Every utility deploys capital assets to serve its customers.  During the asset life cycle an asset manager repetitively must make complex decisions with the objective to minimise asset life cycle cost while maintaining high availability and reliability of the assets and networks. Avoiding unexpected outages, managing risk and maintaining assets before failure are critical goals to improve customer satisfaction. To better manage asset and network performance utilities are starting to adopt a data driven approach. With analytics they expect to lower asset life cycle cost while maintaining high availability and reliability of their networks. Using actual performance data, asset condition models are created which provide insight on the asset deterioration over time and what the driving factors of deterioration are. With this insights forecasts can be made on the future asset and network performance. These models are useful, but lack the ability to effectively support the asset manager in designing a robust and cost effective maintenance strategy.

Asset condition models allow for the ranking of assets based on their expected time to failure. Within utilities it is common practice to use this ranking in deciding which assets to maintain. By starting at the assets with the shortest time to failure, assets are selected for maintenance until the budget available for maintenance is exhausted.  This prioritisation approach will ensure that the assets most prone to failure are selected for maintenance, however it will not deliver the maintenance strategy with the highest overall reduction of risk. Also the approach can’t effectively handle constraints in addition to the budget constraint. For example constraints on manpower availability, precedence constraints on maintenance projects, or required materials or equipment. Therefore a better way to determine a maintenance strategy is required taking into account all these decision dimensions. More advanced analytical methods, like mathematical optimization (=prescriptive analytics), will provide the asset manager with the required decision support.

In finding the best maintenance strategy the asset manager could instead of making a ranking, list all possible subsets of maintenance projects that are within budget and calculate the total risk reduction of each subset. The best subset of projects to select would be the subset with the highest overall risk reduction (or any other measure). This way of selecting projects also allows for additional constraints, like required manpower, required equipment or spare parts, time depended budget limits, to be taken into account. Subsets that do not fulfil these requirements are simply left out. Also, subsets could be constructed in such a manner that mandatory maintenance projects are included.  With a small number of projects this way of selecting projects would be possible, 10 projects would lead to 1024 (=2^10) possible subsets. But with large numbers this is not possible, a set of 100 potential projects would lead 1.26*10^30 possible subsets which would take too much time, if possible at all, to construct and evaluate them all.  This is exactly where mathematical optimisation proofs its value because it allows you to implicitly construct and evaluate all feasible subsets of projects, fulfilling not only the budget constraint but any other constraint that needs to be included. Selecting the best subset is achieved by using an objective function which expresses how you value each subset. Using mathematical optimisation assures the best possible solution will be found. Mathematical optimisation has proven its value many times in many industries, also in Utilities, and disciplines, like maintenance. MidWest ISO for example uses optimisation techniques to continuously balance energy production with energy consumption, including the distribution of electricity in their networks. Other asset heavy industries like petrochemicals use optimisation modelling to identify cost effective, reliable and safe maintenance strategies.



In improving their asset maintenance strategies, utilities best next step is to adopt mathematical optimisation. It allows them to leverage the insights from their asset condition models and turn these insights into value adding maintenance decisions. Compared to their current rule based selection of maintenance projects in which they can only evaluate a limited number of alternatives, they can significantly improve as mathematical optimisation lets them evaluate trillions (possibly all) alternative maintenance strategies within seconds. Although “rules of thumb”, “politics” and “intuition” will always provide a solution that is “good”, mathematical optimisation assures that The Best solution will be found.  

Tuesday, 19 July 2016

Big Data Headaches

http://tinyurl.com/jeyjtna
Data driven decision making has proven to be key for organisational performance improvements. This stimulates organisations to gather data, analyse it and use decision support models to improve their decision making speed and quality. With the rapid decline in cost of both storage and computing power, there are nearly no limitations to what you can store or analyse. As a result organisations have started building data lakes and invested in big data analytics platforms to store and analyse as much data as possible. This is especially true in the consumer goods and services sector where big data technology can been transformative as it enables a very granular analysis of human activity (up to the personal level). With these granular insights companies can personalise their offerings, potentially increasing revenue by selling additional products or services. This allows for new business models to emerge and is changing the way of doing business completely. As the potential of all this data is huge, many organisations are investing in big data technology expecting plug and play inference to support their decision making. The big data practice however is something different and is full of rude awakenings and headaches.

That big data technology can create value is proven by the fact that companies like Google, Facebook and Amazon exist and do well. Surveys from Gartner and IDC show that the number of companies adopting big data technology is increasing fast. Many of them want to use this technology to improve their business and start using it in an exploratory manner. When asked about the results they get from their analysis many of them respond that they experience difficulty in getting results due to data issues, others report difficulty getting insights that go beyond preaching to the choir. Some of them even report disappointment as their outcomes turn out to be wrong when put into practice. Many times the lack of experienced analytical talent is mentioned as a reason for this, but there is more to it. Although big data has the potential to be transformative, it also comes with fundamental challenges which when not acknowledged can cause unrealistic expectations and disappointing results. Some of these challenges are even unsolvable at this time.

Even if there is a lot of data, it can’t be used properly

To illustrate some of these fundamental challenges, let’s take an example of an online retailer. The retailer has data on its customers and uses it to identify generic customer preferences. Based on the identified preferences offers are generated and customers targeted. The retailer wants to increase revenue and starts to collect more data on the individual customer level. The retailer wants to use the additional data to create personalised offerings (the right product, at the right time, for the right customer, at the right price) and to make predictions about future preferences (so the retailer can restructure its product portfolio continuously). In order to do so the retailer needs to find out what the preferences of its customers are and the drivers of their buying behaviour. This requires constructing and testing hypotheses based on the customer attributes gathered. In the old situation the number of available attributes (like address, gender, past transactions) was small. Therefore only a small number of hypothesis (for example “women living in a certain part of the city are inclined to buy a specific brand of white wine”) can be tested to cover all possible combinations. However with the increase in the number of attributes, the number of combinations of attributes that are to be investigated increases exponentially. If in the old situation the retailer had 10 attributes per customer, a total of 1024 (=210) possible combinations needed to be evaluated. However when the number of attributes increases to say 500 (which in practice is still quite small), the number of possible combinations of attributes increases to 3.27 10150  (=2500) This exponential growth causes computational issues as it becomes impossible to test all possible hypotheses even with the fastest available computers. The practical way around this is to significantly reduce the number attributes taken into account. This will leave much of the data unused and many possible combinations of attributes untested, therefore reducing the potential to improve. This might also cause much of the big data analysis results to be too obvious.

The larger the data set, the stronger the noise

There is another problem with analysing large amounts of data. With the increase in the size of the data set, all kinds of patterns will be found but most of them are going to be just noise. Recent research has provided proof that as data sets grow larger they have to contain arbitrary correlations. These correlations appear due to the size, not the nature, of the data, which indicates that most of the correlations will be spurious. Without proper practical testing of the findings, this could cause you to act upon a phantom correlation. Testing all the detected patterns in practice is impossible as the number of detected correlations will increase exponentially with the data set size. So even though you have more data available you’re worse of as too much information behaves like very little information. Besides the increase of arbitrary correlations in big data sets, testing the huge number of possible hypotheses is also going to be a problem. To illustrate, using a significance level of 0.05, testing 50 hypothesis on the same data will give at least one significant result with a 92% chance.

P(at least one significant result) = 1 − P(no significant results) = 1 − (1 − 0.05)50 ≈ 92%

This implies that we will find an increasing number of statistical significant results due to chance alone. As a result the number of False Positives will rise, potentially causing you to act upon phantom findings. Note that this is not only a big data issue, but a small data issue as well. In the above example we already need to test 1024 hypotheses with 10 attributes.

Data driven decision making has nothing to do with the size of your data


So, should the above challenges stop you from adopting data driven decision making? No, but be aware that it requires more than just some hardware and a lot of data. Sure, with a lot of data and enough computing power significant patterns will be detected even if you can’t identify all the patterns that are in the data. However, not many of these patterns will be of any interest as spurious patterns will vastly outnumber the meaningful ones.  Therefore, with the increase in size of the available data also the skill level for analysing the data needs to grow. In my opinion data and technology (even a lot of it) is no substitute for brains. The smart way to deal with big data is to extract and analyze key information embedded in “mountains of data” and to ignore most of it. You could say that you first need to trim down the haystack to better locate where the needle is. What remains are collections of small amounts of data that can be analysed much better. This approach will prevent you from getting a big headache from your big data initiatives and will improve both speed and quality of drive data driven decision within your organisation.

Friday, 29 April 2016

Is Analytics losing its competitive edge?

Since Tom Davenport wrote his ground-breaking HBR article on Competing on Analytics in 2006 a lot has changed in how we think about data and analytics and its impact on decision making. In the past 10 years the amount of data has gone sky high due to new technological developments like the Internet of Things. Also, data storage costs have plummeted so we no longer need to choose whether we would like to store the data or not.  Analytics technology has become readily available. Open source platforms like KNIME and R have lowered the adoption thresholds, providing access to state of art analytical methods to everyone. To monitor the impact of these developments on the way organisations use data and analytics MIT Sloan Management review sends out a survey on a regular basis. Recently they published their most recent findings in Beyond the Hype: The hard work behind analytics success. One of the key findings is that analytics seems to be losing its competitive edge.

Analytics has become table stakes

Comparing their survey results over several years MIT Sloan reports a decrease in the past 2 years in the number organisations that gained a competitive advantage in using analytics. An interesting finding, especially now when organisations seems to be set to leverage on the investments they have done in (big) data platforms, visualisation and analytics software. An obvious explanation for this decline is that more organisations are using analytics in their decision making, therefore it lowers the competitive advantage. In other words analytics has become table stakes. The use of analytics in decision making has become a required capability for some organisations to stay competitive. For example in the hospitality and airlines industry. All companies in those industries use analytics extensively to come up with the best offer for their customers. Without the extensive use of analytics they would not be able to compete. There are however more reasons for the reported decline in competitive advantage.

Step by step 

From the MIT Sloan report, several of the reported reasons for having difficulty in gaining a competitive edged with analytics are related to organisational focus and culture. The survey results show that this is due to lack of senior sponsorship. Also, senior management doesn’t use analytics in their strategic decision making. As a consequence there are only localised initiatives that have little impact. I see this happen in a lot of organisations. Many managers see value in using analytics in decision making but have difficulty convincing senior management in supporting them. There can be many reasons for that. It could be that senior management simply doesn’t not know what to expect from analytics and therefore avoid investing time and money in an activity with uncertain outcome. It could also be that the outcomes of analytics models are so counterintuitive senior management simple can’t believe the outcomes. There are several ways to change this and benefit more from analytics than just in local initiatives. Key is to take a step by step approach, starting with the current way of decision making and gradually introduce analytics to improve it. Simple steps with measurable impact. That way senior management can familiarise itself with what analytics can do and gain confidence in its outcomes. It can take some time, but each step will be an improvement and will grow the analytical competitiveness of the organisation.

Investing in People

One other main reason from the survey for having difficulty in gaining an edge with analytics is that organisations don’t know how to use the analytics insights. One important reason for this to happen is that analytics projects are not well embedded in a business context. Driven by the ambition to use data and analytics in decision making, organisations rush into doing analytics projects without taking enough time to assure the project addresses an important enough business issue, has clear objectives and scope and implementation plan. As a results insights from the analytics project are knocks on an open door or are too far of what the business needs or its unclear what to do with the outcomes.
Another reason I come across often is that analytics projects are started from the technology perspective: “We have bought analytics software, now what can we do with it?”. It should be the other way around. The required analytics software comes after understanding the business issue and the conditions under which it needs to be solved. Therefore analytics is more than buying software or hardware, people need to be trained to recognise business issues that can be solved from an analytics perspective and be able to choose the appropriate analytical methods and tools. The training will also result in a common understanding of the value of analytics for the organisation which in turn will help change the current way of decision making into one that incorporates the analytics insights.

So, has analytics become less competitive? The picture I get from the above reasons is that most organisations have difficulty changing into a new and more analytical way of working. Many organisations are just starting to use analytics, the MIT Sloan survey reports conforms this given the significant increase in first users (the Analytically Challenged Organisations). These organisations have high expectations on what they will get from analytics but will need to go through organisational changes and changes in the way decisions are made before the benefits of using analytics become visible. This will, following a Satir like change curve, at first cause a decrease in productivity causing in my opinion the lower expectation on the competitive gain these organisations expect to get from using analytics. But this will change over time, and end in a new and improved productivity level. As with any new capability or technology, you first need to learn how to walk, then run and then jump