Why Your Data Is Wrong, But You Don't Know It

After a busy day Anton and Amber are leaving the office. The sliding doors open smoothly. A wall of heat meets them. It is 17:03 on a warm Thursday evening in July — the temperature is dropping, but it is still 31 degrees Celsius.

"This heat. It still feels like Spain. I assumed the temperature would drop faster." Anton sighs, frowning his dark eyebrows. Amber looks at Anton with a smile. She can see on his face that he also didn't expect this temperature at this time of day.

They walk to the train station together, talking about the tropical weather and the hectic day of meetings — and never-ending discussions about the right Data Strategy. What is "right" anyway?

As they approach the zebra crossing, Anton steps out without paying attention to the traffic. "Watch out!" Amber suddenly pulls Anton back towards her. A BMW misses him by a few centimetres.

"BMW drivers — they are careless idiots!", says Amber, shocked and irritated.

This is part two of a series on biased data. Part one covered the problem of biased data in everyday life and business.

Nothing new under the sun: 1996

We deal with biases every day — continuously, mostly unnoticed and unconsciously. To assume that our beliefs do not flow into IT systems, and from those systems into generated data, is an erroneous and very naive assumption.

Back in 1996, Friedman and Nissenbaum wrote a paper on biases in computer systems. At the time, they argued that bias can lead to discrimination against individuals and groups, and warned that biased systems can lead to a "system of injustice." Friedman and Nissenbaum were right.

They identified three types of bias:

1. Existing Bias

Existing biases operate at two levels: society-level and individual-level. These biases are ingrained in our being or society and occur outside of technological systems. However, they can end up in a database in different ways — because of the implicit prejudices of the developer or data professional, often unconsciously. Assumptions regarding origin and background can end up in an algorithm without anyone noticing.

2. Technical Bias

These arise from limitations of technology or bad programming. Examples:

A maximum number of rows on a screen, causing the first page to get the most clicks — nobody goes beyond page one of Google search results
Loan applications processed in alphabetical order instead of by time of submission
Errors in randomisation, so that certain groups always have a disadvantage
Digital proctoring software not trained to identify people with certain skin tones

3. Emerging Bias

Where existing and technical bias can sometimes be detected in advance, emerging bias cannot. It arises from a mismatch between what the data is intended for and the context of the user. This often emerges after a system has been delivered, or following a cultural shift.

A Sinterklaas story that was socially acceptable ten years ago but is no longer today. An algorithm developed for the Asian market implemented in Europe — where it generates biased results because of different cultural habits. How often do "Siri" or "Alexa" react inappropriately? Different dialects and languages are difficult to predict.

Six biases every data professional should know

Type 3 bias is hard to detect and often arises after deployment. But there are several type 1 and 2 biases that a data professional can actively look for.

1. Sampling Bias

This occurs when certain data is more likely to end up in the database than others. The sample is not representative of reality.

Everyday example: An election survey that does not reflect the total population. If published in the media, it will affect the outcome of the election because of availability bias.

Data example: A satisfaction score with only customers who actually bought the product or service. You miss all the customers you sold "no" to — yet that is critical information.

2. Overgeneralisation Bias

Generalising takes less energy for our brains, which is useful in many situations. But generalisations are often wrong.

The Romanian satirist Juvenal wrote around AD 80 that the "perfect woman" resembled the "nigroque cycna" — the non-existent black swan. For centuries, the black swan was a synonym for something that did not exist. Until Dutch sailor Willem Janszoon arrived in Western Australia in 1606. The black swan does exist.

Everyday example: People called Simon don't like BMWs. Blonde women are pretty but aren't very smart. I was recently in Berlin — Germans are very grumpy.

Data example: Using data generated from the Asian market and applying it directly to the European market. Implementing an algorithm built for a large medical research hospital at a small local clinic.

3. Implicit Bias

Stereotyping. Making assumptions based on personal experiences and beliefs. Deeply woven into culture and education. First defined by psychologists Mahzarin Banaji and Anthony Greenwald.

Book tip: Blindspot: Hidden Biases of Good People

Everyday example: Research shows that assumptions about women and mathematics are not necessarily accurate. But by passing these assumptions from generation to generation, fewer women study science, fewer women are hired in exact professions, and fewer role models exist. A self-fulfilling prophecy is created.

Data example: Amazon worked on an algorithm to automate the HR process and prevent discrimination by humans. To train it, employee CVs were used as data. Around 50,000 key terms were classified. The result: women were considered less suitable to work at Amazon. CVs containing the word "woman" received fewer points than those with "man" — because there were fewer women working at Amazon, so the system learned that a woman didn't fit the culture.

4. Confirmation Bias

Many organisations determine a strategy and then look for data to support it. The danger is that only data and views that confirm the strategy are collected. Data that doesn't fit is ignored, left out or — worse — manipulated.

"Kill your darlings" is difficult at managerial level. Nothing feels better than confirmation that your strategy is the right one.

Book tip: Adam Grant — Think Again: The Power of Knowing What You Don't Know

Everyday example: A dog attacked a child. Eyewitness A, who thinks dogs are dangerous, sees a wild dog attacking a kid. Eyewitness B, who loves dogs, sees a dog defending himself against a crazy kid. Neither account is truly reliable.

Data example: Ignoring data that shows different information than you would like to see. Selecting only positive data sources. When this ends up in algorithms, there is a strong chance of an "echo chamber" or "filter bubble" — more and more misinformation is created, which reinforces existing beliefs.

5. Automation Bias

Assuming that automated outcomes are better than human judgement.

Everyday example: Aircraft flight automation was designed to relieve pilots of repetitive actions so they could maintain an overview. In 2014, Casner and Schooler investigated the effects of automation in aircraft. Their research showed that when more automation was introduced, pilots became less engaged with actually flying. When automation was suddenly disabled during simulation flights, pilots kept the aircraft in the air — but had far less control. Too much reliance had been placed on the automated process.

Data example: Letting an analytic tool select a machine learning algorithm based on a data pool. Trusting the automated choice without any "black box" control of the outcome. Using the result for strategic decisions because the computer said so.

6. Availability Bias

The tendency to give more weight and importance to examples or thoughts that come first to mind. This bias has enormous impact on how we view the world.

Everyday example: Starting a new project? It should be done agile — everybody is doing it! A colleague sneezes. That must be COVID. Pimples? Monkey pox.

Data example: Believing only new data is relevant. Not ensuring that your data has the right depth and breadth. If your data pool gets archived to reduce storage costs, creating a correct model or dashboard that represents a true picture becomes impossible.

What's next

I promised to describe mitigation measures and a process in this article. I have not kept that promise — the subject deserves more space. That will be the core subject of part 3.

Oh — and to all the BMW drivers out there: I actually love BMWs. I promise to drive an electric BMW after my Tesla.