Measuring the Inherent Bias in Embedding Models

Introduction: Understanding Embedding Models and Semantic Spaces

Imagine a giant document warehouse where every word, phrase, and sentence has a physical location. In this warehouse, phrases with similar meanings or that are commonly used together are stored close to one another, while words with no relation are placed far apart. This is how embedding models work, they transform words into numerical representations in a high dimensional space, grouping related concepts together.

For example, in this warehouse, words like “doctor” and “nurse” might be placed in the same aisle, while “apple” and “banana” are in another. If a model sees “apple” often appearing near “fruit,” it reinforces their connection. This structure allows machines to understand human language, but it also means that if words appear in biased contexts in real-world text, the model absorbs those biases too.

The Attempt to Neutralize Bias

Recognizing that these models can reflect and amplify biases, organizations have taken steps to mitigate the issue. A common approach is to neutralize explicit gender identifiers in datasets, removing or replacing gendered words such as “he,” “she,” “his,” and “hers.” The idea is that if gender specific words are absent from the training data, the model will not learn biased associations.

This method can help reduce some overt biases, but it is not a complete solution.

Why Neutralizing Gendered Words Isn’t Enough

While removing explicit gendered words from text may seem like a logical fix, it does not eliminate implicit gender associations. Many words and phrases acquire gendered meanings based on how they are used in context. If certain words frequently appear alongside male associated terms, while others appear more often with female associated terms, an embedding model will still encode these associations, even without explicit gendered words.

For example, a model trained on a large dataset plucked out of the internet might learn that “CEO” is more commonly linked with male associated words, while “nurse” is more frequently found near female associated terms. This happens because the internet reflects existing social biases, and embedding models simply absorb these patterns.

“Well, That Sounds Reasonable, But Can You Prove It?”

Yes, we can! Embedding spaces allow us to think about words in a purely mathematical way. Since each word is represented as a vector in a high dimensional space, we can analyze its properties objectively, without relying on intuition or subjective interpretation (LLM as a judge is subjective). This mathematical structure ensures that beyond just logical consistency, our results are also empirically reproducible.

Moreover, since bias in embeddings is a quantifiable phenomenon, it is straightforward to define a test to falsify the claim. If gendered associations in word embeddings were purely coincidental or nonexistent, these measurements should yield random, unstructured results. However, that is not what we observe, our tests reveal clear, systematic patterns of gendered associations, demonstrating that these biases are embedded in the model itself.

Our Test: Measuring Bias in Embeddings

To demonstrate this concept, we conducted an experiment, using a few public data sets of phrases and a publicly available embedding model and this is what we found:

  • Explicit gender identifiers (e.g., “him,” “her,” “she,” “his”) naturally had strong gender associations.
  • Unexpected words also exhibited gender bias. For instance, words like “ceo,” “presentation,” “fast,” and “keyboard” carried noticeable gender projections, even though they do not explicitly reference gender.

Why Is “Keyboard” More Feminine Than “Screen”?

Some of these findings may seem counterintuitive. Why would a “keyboard” lean feminine while a “screen” might not? The answer lies in the statistical prevalence of words appearing together in training data. If “keyboard” frequently appears in online discussions dominated by one gender more than another, the model picks up on that pattern, regardless of the actual nature of the object itself.

This highlights the fundamental challenge of bias in embedding models: they do not “think” like humans; they simply reflect the data they are trained on.

Conclusion

Bias in embedding models is not just a theoretical concern, it is a measurable and quantifiable phenomenon. While attempts to remove explicit gendered words from datasets may reduce some bias, it does not eliminate the underlying statistical patterns that models learn from human generated text. Understanding and measuring these biases is the first step toward ensuring fairness in LLM related technology.

To learn more, check out the citrusx.ai website.

Leave a comment