Back to home

Understanding the Black Swan in AI: Managing Risks in Big Data

In 2008, author and former Wall Street trader Nassem Taleb, had the biggest ‘I told you so’ moment of his career. For many years Taleb had been warning everyone who would listen (or not listen), that the tools being used to measure risk in the financial markets were invalid and were going to lead to trouble. He published a book called The Black Swan prior to the day Lehman Brothers went broke, the event which precipitated the financial crisis. Taleb uses the anecdote of the black swan to describe why he believes that financial modelling lacks the tools to account for rare events. The anecdote goes, “No amount of observations of white swans can allow the inference that all swans are white, but the observation of a single black swan is sufficient to refute that conclusion.” When we think about Taleb’s warnings in the context of Big Data and AI, we recognize that firms must avoid being fooled by the assumption that their big data stores contain all possible instances and outcomes.  By avoiding overconfidence in their data, management can defend their business from the negative impacts of a black swan.  

To help understand the Black Swan Problem and how it may affect business, let’s distinguish between two types of datasets. An example of the first type is the height of adult males. When we think about a dataset of this type, we’ll see that most men are about 5’10”. There are quite a few men who are slightly above this height and quite a few who are slightly below. We will also see that there are a few large outliers who are above 6’5” and a below 5’3”. There are even fewer who are outside of this range. Ultimately, we see that the distribution of data points in this case are captured within a small range, and although we can expect outliers, there is a limit to how tall, or how short, an adult male can be.

The above dataset matches what we call in statistics the Normal Distribution. The Normal Distribution allows us to accurately measure the height of males and even captures the probability of rare outliers with high accuracy. The problem that Taleb has with this type of distribution is the incorrect application to datasets which are not normal. To illustrate this let’s look at a second dataset: the returns of the S&P 500.

What we find when we look at this dataset is for the most part it looks very similar to that of the height of males. The returns go up a little bit some days and down a little bit on others. The difference is that there are a small number of very bad days. Days where the S&P 500 falls so dramatically, the Normal Distribution predicts the drop would only happen once in several million years. This would be like saying that we have an adult male who is 1-foot tall. By ignoring the fact that different datasets have different levels of variation, and different possibilities of extreme outliers, companies can  open themselves up to greater risks than they anticipate.

This concept has not yet been fully explored in the realm of Big Data and AI, but it is important that managers have a strong grasp of the Black Swan Problem when they begin to rely on AI algorithms for predictions and actions. Management must ask the question of whether or not the dataset is of the ‘height-type’ or the ‘financial-type’. Making this distinction protects their company from making blind risks based on data they believe to be representative of the entire population. As companies move through their AI transformation, they must be on the lookout for the Black Swan. Only those who fail to look will be surprised when they find one.