The truth, the whole truth and nothing but the truth
The truth
[Truth is the property of being in accord with fact or reality. In everyday language, truth is typically ascribed to things that aim to represent reality or otherwise correspond to it, such as beliefs, propositions, and declarative sentences | Wikipedia (link)]
ChatGPT (link) is the talk of the town. Some people are in total awe of its capabilities and possibilities, others are wondering what the impact will be on creativity/art¹ or the way it will impact our education systems.
This blog is about something else, the truth.
Artificial Intelligence (AI) answers questions by applying models. These models are trained by feeding them with data. Now that is where I believe things get tricky. I know AI has come a long way in recent years and that the data sets that are used to train the models are huge. Flagrant fails like predicting the next miss universe based on data biased towards white contestants (link) is less likely to happen today, but how can we be sure the models are not biased but based on generally accepted facts or reality (i.e. the truth)?
The truth and what ChatGPT makes of it
The creators of ChatGPT describe its limitations (link). But these disclaimers are probably not read by the vast majority of ChatGPT users (just like most people blindly accept Terms & Conditions). Most users of ChatGPT will use the answers that are generated by ChatGPT as-is and believe these answers are the truth.
Some possible issues arise:
- The data used to train the model was collected in 2021. This means that everything that happened in 2022 and later is not in the current model. Major events like the war in Ukraine are not part of the current ChatGPT model.
- The model was trained using data obtained from books, web texts, Wikipedia, articles and other pieces of writing on the internet (link). I work for a software company and if I ask ChatGPT how my company’s product compares to that of a competitor, I get an answer that comes pretty close to our marketing material. Would that mean that the more marketing budget I have, the more favourable the ChatGPT answer would be? Would that give propaganda a louder voice?
- Fake news and disinformation, there is a lot of it on the internet. ChatGPT has employed human AI trainers to validate the model. Can we be sure they have filtered out fake news and disinformation? When I write this blog, ChatGPT is temporarily not available, but I am curious to understand what is says about what happened in Armenia between 1895 and 1923. As a user of ChatGPT you can provide feedback on the answers given by ChatGPT. Would a large group of people providing feedback be able to change the answer to what happened in Armenia (i.e. genocide or no genocide)?
- Based on the current model, people (and bots?) will create more and more content and publish that to the internet. The next time ChatGPT is trained, some of that content may end up in the model, reinforcing the slightly biased knowledge over and over again(?), eventually diluting the truth.
Lost civilizations & Recency bias
Throughout the course of history, great civilizations have been built and disappeared. Sometimes it took centuries to reinvent what these ancient civilizations had already discovered.
Will tools like ChatGPT suffer from recency bias (link)? Will they give greater importance to the most recent events and, by doing so, cause our civilization to loose more ancient knowledge. Will future generations have to reinvent things we know today, just because ChatGPT has made us forget them?
Human Civilization Body of Knowledge
Humanity creates and uses incredible amounts of data. It has been estimated that 94 zetabytes of data was consumed in 2022, growing to 118 zetabytes in 2023 (link).
It will be impossible to verify all of it. If we rely more and more on AI to make sense of all this data, how do we filter out fake news and how do we avoid recency bias?
Maybe we should reset the ChatGPT model; start from scratch, only using material that is generally accepted to be accurate/truthful. Not easy, as truth is in the eye of the beholder. Essential nonetheless.
Then, after the reset, only add data that is is generally accepted to be accurate/truthful. Maybe we need a Blockchain of Truth. Maybe we need “Civilization Body of Knowledge duty”; require qualified citizens to occasionally be part of a committee that maintains our Civilization Body of Knowledge.
¹ The image for this article was generated by DALL . E 2 using the phrase “robots swearing the truth the whole truth and nothing but the truth”.