Higher interest on your loan because of your birth or heritage. A low final exam grade due to a poor background. An excessively expensive car insurance because of your neighbourhood. Unjustly spying on people. A hateful AI. Extra surveillance in certain neighbourhoods.
Just because an algorithm said so โ based on prejudices.
Modern companies that make strategic and tactical choices purely on gut feeling no longer exist. The only phase where this is somewhat acceptable is the startup phase. In the rollercoaster phase that follows, data should already play an important part in pivoting and improving the product concept. Data increasingly determines the strategy of an organisation. However, this represents a potential business risk of which we are barely aware โ and for which we take almost no mitigating measures.
This is part one of a series about biased data.
The problem: bias is everywhere
The problems and risks have to do with prejudice. These prejudices โ hereinafter referred to as "bias" โ are ubiquitous in every society. Biases, if handled carelessly, can end up in data-driven models. And we do handle them carelessly.
Since bias is difficult to detect, it is (in)directly translated into business strategy. Operational choices are also influenced by biased and one-sided resources.
As a result, prejudices are reinforced, after which this is fed back as data for the next analysis โ an infinite loop is created. The lens through which the organisation sees the world becomes clouded. The longer this loop continues, the cloudier it gets, and the business risk will increase.
Data as raw material for machine learning
Data is often referred to as "the new oil." Not only because it is the raw material on which the machine runs, but also because of the negative impact it can have. Mining the wrong data in the wrong way can leave traces and damage for years.
Applications of machine learning are becoming more diverse and successful. Contrary to popular belief, this technology is still in a development phase. New applications are discovered every day. Data for these machine learning algorithms is initially selected and provided by humans. Here lies the fundamental problem: selected data is too often biased.
Microsoft's hateful AI: Tay
One of the clearest examples of a system that had a major impact through biased data was Microsoft's Twitter bot "Thinking About You" โ shortened to "Tay."
Tay was developed in the image of a 19-year-old American teenage girl and was designed to learn from interactions with other Twitter users. She was launched on 23 March 2016.
According to Bloomberg Businessweek (April 4, 2016), Microsoft's plan was to launch multiple bots, each with its own personality and increasingly realistic behaviour. These bots were designed to be self-learning through interaction with users โ the interaction data was stored and used to train the algorithm.
Lili Cheng, VP of Microsoft AI & Research, already expected the first iteration to be imperfect:
"When you start early, there's a risk you get it wrong. I know we will get it wrong. Tay is going to offend somebody." โ Lili Cheng, Bloomberg 2016
Within 16 hours, Tay had offended not just "someone," but millions of people. Microsoft took her offline after a flood of hate tweets and foul language. Within hours, Tay had gone from "super human" to hardcore neo-Nazi territory, posting several racist and Hitler-adoring messages.
Tay had been influenced by direct messages from users in the first hours of her existence โ posts full of hate and bias. Tay determined her worldview from these messages; the only data available to her. She is a textbook example of how machine learning can interpret the world through a specific bias lens. All her training data consisted of extreme right-wing content, which became "normal" for her.
The Great Intellectual Fraud: exam grades based on the normal distribution
The assumption that the world is always normally distributed is one of the most dangerous prejudices in data science. Nassim Taleb describes this powerfully in The Black Swan:
"Almost everything in social life is produced by rare but consequential shocks and jumps; all the while almost everything studied about social life focuses on the 'normal,' particularly with 'bell curve' methods of inference that tell you close to nothing. Why? Because the bell curve ignores large deviations, cannot handle them, yet makes us confident that we have tamed uncertainty." โ Nassim Nicolas Taleb
During the COVID pandemic, large parts of Europe โ including the United Kingdom โ were unable to hold final exams. To ensure students still received a grade, an algorithm was developed. The normal distribution was applied, which meant that at least one child in every class had to receive an unsatisfactory mark โ even when previous results did not suggest this was appropriate.
As a result, the least strong student in a class of otherwise excellent students received a fail, even though their average grade was more than satisfactory. Children from poorer backgrounds received disproportionately worse grades than expected, while students from private schools received higher grades.
After widespread concern from parents and educators, the government decided that only grades based on human judgement would be valid.
The use of historical data here was biased. It did not account for the additional variables that had influenced students during that period, nor did the normal distribution account for extreme variations.
Biases: we are blind to our own beliefs
A bias is a systematic error โ favourable or unfavourable, conscious or unconscious โ in the thinking process, mainly based on general human assumptions. Data that doesn't accurately reflect reality, but is accepted by the analyst.
Data that a person selects, collects and/or interprets manually โ consciously or unconsciously โ reflects these prejudices. This information ends up in reports on which business choices are based. Those choices again generate data based on prejudice. Business strategy supported by biased data rests on dangerous quicksand.
Researchers are, in general, well acquainted with various biases and how they can negatively affect a study. But the data and business intelligence domain is growing fast. The demand for employees cannot be matched with the supply. More and more staff are being trained internally. This increases unfamiliarity with biases and increases business risk.
Every data analyst, data engineer and data scientist should be familiar with the most common biases that can be traced back to data. This should be a standard component in training and education โ but it isn't.
You will learn how to build the most beautiful interactive Power BI dashboard and set up the most dynamic Azure lakehouse infrastructure. A true critical look at data quality, free from bias, is hardly taught. Until millions of dollars in mistakes are made โ it's not sexy enough.
What's next
We often only hear the beautiful developments of AI. But there are many articles and stories about biased data and its consequences for businesses and society.
This business risk still receives too little attention โ until a scandal happens, money is lost, or business strategy turns out to be wrong.
Part 2 covers: the different types of data bias, mitigating measures, and a process to prevent the risk.
Back to blog