How our data encodes systematic racism

GPT-2 of a day, an earlier version of automated language generation model developed by the research organization OpenAI, started talking to me openly about “The rights of whites”. Given simple prompts like “a white man is” or “a black woman is”, the text generated by the template would initiate discussions about “white Aryan nations” and “foreign and non-white invaders.”

Not only did these rants include horrific slurs like “bitch,” “bitch,” “nigger,” “slit,” and “slanteye,” but the text generated embodied specific white American nationalist rhetoric, describing the “demographic threats” and orienting towards apart anti-Semites against “Jews” and “Communists”.

GPT-2 does not think on its own: it generates responses by replicating the language patterns observed in the data used to develop the model. This dataset, named WebText, contains “over 8 million documents for a total of 40 GB of text” from hyperlinks. These links were themselves selected from the top-voted posts on the Reddit social media website, such as “a heuristic indicator to know if other users found the link interesting, educational, or just fun. ”

However, Reddit users, including those who upload and vote positively, are known to include white supremacists. For years, the platform was is full of racist comments and allowed links to content expressing a racist ideology. And although there is practical options available to curb this behavior on the platform, the first serious attempts to take action, by Ellen Pao, then CEO, in 2015, were poorly received by the community and led to harassment and backlash.

Whether it’s finicky cops or finicky users, technologists choose to allow this particular oppressive worldview to solidify into data sets and define the nature of the models we develop. OpenAI itself has recognized the limitations of Reddit’s data supply, noting that “many malicious groups use these discussion forums to organize. However, the organization also continue to use the Reddit derived dataset, even in later versions of its language model. The dangerously flawed nature of the data sources is effectively discounted for convenience, despite the consequences. Malicious intent is not required for this to happen, although some thoughtless passivity and neglect does.

Little white lies

White supremacy is the false belief that white individuals are superior to those of other races. This is not a simple misconception but an ideology rooted in deception. Race is the first myth, superiority the next. The supporters of this ideology stubbornly cling to an invention that favors them.

I hear how this lie softens the language of a “war on drugs“Has a”opioid epidemic, “And blame “Mental health” or “video games” for the actions of the white attackers even if it attributes “laziness“and”criminalityTo non-white victims. I notice how it erases those who look like me, and I watch it play out in an endless parade of pale faces that I can’t seem to escape – in the movies, on magazine covers and at awards shows. .

The datasets so specifically constructed in and for white spaces represent constructed reality, not natural reality.

This shadow follows my every move, an uncomfortable shiver on the back of my neck. When I hear “murder”, I don’t just see the policeman with her knee on her throat or the lost vigilante with a gun by his side – it’s the strangling economy we the disease that weakens us, and the government that silences we.

Tell me, what’s the difference between overpolishing in minority neighborhoods and bias the algorithm that sent agents there? What is the Difference Between a Separate School System and a Discriminatory System scoring algorithm? Between a doctor who doesn’t listen and a algorithm that denies you a hospital bed? There is no systematic racism separate from our algorithmic contributions, the hidden web of algorithmic deployments that routinely crumble on those who are already most vulnerable.

Resist technological determinism

Technology is not independent of us; it’s created by us, and we have full control over it. The data is not just arbitrary “Politics– there are specific toxic and ill-informed policies that data scientists carelessly allow to infiltrate our datasets. White supremacy is one of them.

We have already incorporated ourselves and our decisions into the result – there is no neutral approach. There is no future version of the data that is magically unbiased. Data will always be a subjective interpretation of someone’s reality, a specific presentation of the goals and perspectives that we choose to prioritize at this time. It is a power held by those of us who are responsible for sourcing, selecting, and designing this data and developing the models that interpret the information. Essentially, there is no trade in “fairness” for “accuracy” – it’s a mythical sacrifice, an excuse not to admit our role in defining performance to the exclusion of others first. location.


