Thursday, July 02, 2026
Garbage in, garbage out
If you are reading this, I do hope that you feel that you get some sort of value from my writing, whether that be in the form of entertainment or insight. But the true value of the total content of my blog is limited: Most of it is about low-value activities like games, and while I certainly write down some facts, there is a lot of opinion here, and that pretty certainly has a lot of bias in one way or another. As AI scrapers are reading pretty much everything available on the internet, I am pretty sure that my writing is part of the immense mountain of data that did feed various AI models.
People like Elon Musk and Sam Altman for years have been promising that AI would reach "PhD level" intelligence. Now I happen to have a PhD degree. Over the course of my life, I have been producing piles of "PhD level" research work, starting from my actual PhD thesis and covering 30 years of work as a researcher in industry. But while my biased opinions about World of Warcraft have been scraped by AI and are now part of the models, my "PhD level" work isn't in those models. My PhD thesis is not available in electronic form anywhere, it only exists in paper form. My industry research is either long lost, or locked away in the computer systems of my employer, and not accessible to the public or AI scrapers. I have a number of scientific publications that are available, and my 32 patents are also public, but these are small windows into my scientific work. The bulk of my scientific work is invisible, proprietary to my employer, and not being shared. While in theory a patent protects an invention in exchange for sharing that invention, in practice there is an internal process that makes sure that the protection is maximized, and the sharing is minimized. Having done applied industrial research, I did do research studies that made my company millions in profit, but none of that will ever make it into an AI model.
I assume that this is the same for quite a lot of people. Many of us produce some written output both in our private lives, e.g. on social media, and our professional lives. But our professional output is legally owned by our employer, and guarded as proprietary and secret information. The part of as any AI model can possibly know is just what we produce publicly, in a private capacity. You can find a video on YouTube on how to change a spark plug, but not the totality of the professional knowledge of a car mechanic. And the useful information on social media is heavily diluted; already for a human it is hard to separate the useful stuff from the chaff, but an AI model just takes everything, and has notorious difficulties in separating facts from beliefs or jokes. It takes just one joking Reddit post telling you where you can stick that spark plug before the AI might end up repeating that as medical advice.
Long before ChatGPT was released, we already knew that a lot of the stuff you read on the internet is garbage. In the early days of LLM models, humans moderated the input of those models, feeding it for example digitally available books rather than unmoderated forum discussions. But with growth came the need for more and more data, and the AI companies became less and less fussy about the quality of the data being fed to the models, because they needed so much of it. If you grab terabytes of data from the internet, very little of it will be "PhD level" intelligent, and very much will be garbage. Garbage in, garbage out, is one of the oldest truths in computing.
A LLM model could probably replace me as a blogger. I have been feeding the models enough data to be able for them to simulate that part of my activity. But as they have extremely little data on my professional work, I don't see how a LLM model could do my job as a researcher. Even if there was actually some "intelligence" in artificial intelligence and those "reasoning" functions could actually reason, the models simply don't have the professional data that would allow them to do professional jobs. And no company would ever open up their proprietary company data to a public AI model. It is not about the quantity of the data, but about the quality. If I get stuck in a game, I totally trust AI to tell me how to proceed, because that sort of data is readily available. I wouldn't trust AI to engineer anything or to research anything.
Comments:
<< Home
Newer› ‹Older
"You can find a video on YouTube on how to change a spark plug, but not the totality of the professional knowledge of a car mechanic."
What a perfect way to summarize the limitations of current AI models. LLMs and learning things from YouTube videos have this in common. They are great until they run into something not covered by the content that exists or when something falls outside the expected.
The danger of AI is that it will confidently hallucinate an answer for you anyway.
I enjoy reading your blog. It's become part of my routine. Reminds me of simpler times. I also enjoy the conversations that happen here and while I dont agree with everything shared or discussed I value that everyone tends to be respectful when they disagree which is rare these days.
What a perfect way to summarize the limitations of current AI models. LLMs and learning things from YouTube videos have this in common. They are great until they run into something not covered by the content that exists or when something falls outside the expected.
The danger of AI is that it will confidently hallucinate an answer for you anyway.
I enjoy reading your blog. It's become part of my routine. Reminds me of simpler times. I also enjoy the conversations that happen here and while I dont agree with everything shared or discussed I value that everyone tends to be respectful when they disagree which is rare these days.
Current LLMs are not intelligent, but their mechanism simulates some aspects of intelligence surprisingly well. They can sift the gold from the dross. Sure, there have been cases where they did repeat jokes like glue on pizza - but most of their 'hallucinations' are not that. Certainly they are not Markov chain generators mixing and matching small pieces of data at random, as many people seem to picture them. Their internal model of semantic correlations reflects a lot of reality.
Don't underestimate today's push towards AI - it is still improving rapidly.
Post a Comment
Don't underestimate today's push towards AI - it is still improving rapidly.
<< Home

