Hacker News — vinext + Cloudflare Workers

new
past
show
ask
show
jobs
submit

▲GPT-5.5 hallucinates 3x more than MIT-licensed GLM-5.2 (arrowtsx.dev)

202 points by oshrimpton 19 hours ago | 62 comments

stalfie 34 minutes ago [-]

One thing I wonder about hallucinations, is that it seems on the surface that it is an easy problem for RLVR to target. Since you're already generating enormous amounts of reasoning traces which are verified by correct answers, just have "don't know" as an option as a valid answer, and on problems where none of the thousands of reasoning traces led to a correct answer, just promote the traces that led to the "don't know" answer as training data. Essentially teaching the model that "I don't know" is a valid answer.

Sam Altman himself had a blog post about this a while ago that seemed to suggest this thought, so I guess it's obvious to everyone. But if that is so I assume it's just not as easy in practice.

cyanydeez 27 minutes ago [-]

the problem is the null answer will stop the "markov" chain.

so, thats all.

amelius 25 minutes ago [-]

But if an LLM says "I don't know" should you pay for the tokens?

guerrilla 12 minutes ago [-]

Why not? It did the work. Why should you expect it to be omniscient?

We can rank them based on how much they know and people will gravitate towards those that do know more.

It's a market after all.

aesthesia 11 hours ago [-]

Hallucination rate scores are a little tricky to interpret because they're conditional on the model not knowing the answer. That means they don't measure the probability of your encountering a hallucination in everyday use, since that also depends on the probability of the model not knowing the answer, as well as how well your distribution of tasks aligns with the distribution tested in the eval.

I'd also hesitate to attribute this difference in hallucination rates purely to model size. Yes, GLM-5.2 hallucinates much less frequently than DeepSeek-V4 Pro with twice as many parameters, but DeepSeek-V4 Flash is less than half the size of GLM-5.2 and tops the AA-Omniscience hallucination index. Opus 4.8, which is likely larger than DeepSeek-V4 Pro, has a 36% hallucination rate on the index, above GLM-5.2's 28%, but way below the DeepSeek numbers. Opus also has a 47% accuracy rate vs GLM-5.2's 25%. If you use these numbers to calculate the absolute hallucination rate (i.e., the number of hallucinated responses divided by the total number of responses), you get 19% for Opus and 21% for GLM-5.2.

So yes, all else equal larger models may be more prone to hallucination in scenarios where they don't know the answer, but there are a lot of other factors that affect hallucination rates, and it's not totally clear that this is the main metric that's worth tracking.

in-silico 9 hours ago [-]

Additionally, maybe it's easier for a model to realize that it doesn't know the answer when the question is easier.

If Opus gets all but the hardest questions right, it might have a higher hallucination rate because the questions it gets wrong are the questions where verification or hallucination detection are the most difficult

gymbeaux 5 hours ago [-]

Those numbers are abysmal. Should we really be using LLMs to write our code? I have a theory- LLMs can spit out code that gets the job done and looks ok, maybe even great, but contains small “anomalies” that compound over time. An enterprise app developed entirely with LLM-happy devs might end up virtually unmaintainable.

I’m not sure how to explain it, but the more I see LLM-written code the more I feel it’s bad code doing a good job of masquerading as good code. I think this take will become less-hot in the next year or two when we see enterprise greenfield projects that were created entirely with LLM “assistance” go to prod. I think we’ll find that the code is difficult for humans to read, understand, debug, and extend- and I think the larger the codebase the harder it will be for LLMs to maintain. More opportunity for hallucination, larger context windows needed, more tokens bought and spent for smaller and smaller code changes. I think the more code an LLM writes for an app, the worse that codebase becomes.

andybak 1 hours ago [-]

I can't help but feel that people continually underestimate how bad human written code becomes over time. The exception is probably single-person passion projects or open source projects that maintain quality governance over time.

I strongly suspect most closed source code developed under commercial or internal pressure is pretty awful after a few years of development.

All LLM code has to do is suck less than existing code. And that's presuming the code quality doesn't improve as the models, the harnesses and our ways of working with them improve.

O5vYtytb 2 minutes ago [-]

I've been sent code from vendors that didn't even compile, long before llms were a thing. Most shops that aren't primarily software have really really terrible software.

embedding-shape 32 minutes ago [-]

Sucky human-written code is still based on human understanding, which can change over time, be readjusted or solidified. People implement something wrong once, then update their perspective, then in the future does it right.

LLMs doesn't have this benefit. You forget to add the correct to the system prompt, and the LLM will repeat the same mistake over and over, and worse than that, their mistakes aren't based on their understanding, it's basically random guesses.

Humans, even bad coders, still seem to have some sort of architecture in mind, even if it's spaghetti, whereas LLMs (obviously) don't think more than a few steps, and never about the full scope of what they're contributing too, and on purpose too, because you want the context to be as small as possible when you work with LLMs.

With LLMs you need to thread carefully between "What does the LLM need to know?" and "Can I skip passing this to the LLM this time?" while a human you can more or less dump them everything you sit on, and let them shift it through, and they'll mostly make it out OK.

xzenor 20 minutes ago [-]

And where do you think the LLM learned coding from?

But anyway, let the LLM verify the code to give advice on improvements but don't let it write code unverified. That's my opinion on it anyway.

rienbdj 11 minutes ago [-]

I have a theory that LLM generated code in a highly modular style (simple data, pure functions) will be easier to “recover” by a human team when the LLM gets muddled. So Haskell, basically.

xvinci 2 hours ago [-]

Not my observation. If you never look at the code and dont have basic guardrails in place (linters, architecture tests, some guidelines for best practices) - probably.

But as soon as you do minimal reviews and high-level corrections, applications turn out just fine.

Can there be bugs? Sure. That's the price of not reading or understanding every line. It should depend on the criticality of your software how much of these you tolerate and how much you don't (reviewing, understanding, testing everything 100% like you were used to if you had written it yourself will kill most if not all of your gained speed)

But I never got the impression of unmaintainability or unfixable bugs.

Actually the other side around: A really good cleanup pass, architectural changes, or bugfixes are seldom more than a few prompts and 2 hours away, provided your overall base is decent and you actually gave a fuck from the start.

VBprogrammer 2 hours ago [-]

> Can there be bugs? Sure. That's the price of not reading or understanding every line.

I've yet to come across a human developer who's output would meet this standard, despite writing every line.

In fact, having an LLM review our code is catching quite a few bugs before it reaches QA.

ben_w 1 hours ago [-]

Indeed, though I find the distribution is different.

The humans may skip unit tests and need reminding; the AI always write unit tests once it's in AGENTS.md or whatever, but my experience* was that 5-10% of the time the LLM's attempt at a "test" would, instead of executing the code and examining the results, open the source code as a text file and run a regex to find/exclude certain substrings.

* At the start of this year, because Anthropic and OpenAI were both offering free trials. IDK how much things have changed since then, some things change fast in this domain, other things don't.

baq 45 minutes ago [-]

I’ve been piloting LLMs for the past six months non stop and we’re at the point where formally verified models generated as an intermediate step between spec and code are very good value.

Riding the exponential means you have to update priors more often.

szundi 2 hours ago [-]

[dead]

realaleris149 2 hours ago [-]

Take a look at a sufficiently old random internal repo which was not written with LLMs and compare.

My observation is that they are equally bad and hard to maintain or even more so than the new ones.

One thing I’ve noticed is that the LLM assisted ones have a lot more comments which is nice but take more time to read.

realusername 4 hours ago [-]

> code that gets the job done and looks ok, maybe even great, but contains small “anomalies” that compound over time

They clearly are only assistants for the moment, you can use them to do work ... but only if you could do the said work yourself alone in the first place.

ben_w 1 hours ago [-]

I would say "only if you can review said work yourself alone", rather than "do".

I'm an experienced developer, but I don't count myself as a web dev or a python dev; I can review the web and python stuff I get out of the AI (sometimes I need to ask the AI follow-up questions so I can find official documentation for what it did), but I can't write it.

Foobar8568 4 hours ago [-]

Have you worked with enterprise apps? The ones I have used for decades are hot garbages.

IsTom 3 hours ago [-]

Now imagine decades of LLM code. Extrapolating the rate of increase of LoC, the source code ain't gonna fit on hard drives anymore.

sudosysgen 6 hours ago [-]

This is missing a common failure mode, which is information past the knowledge cutoff. If you need info past that time they'll fail no matter how big or small the model is, so the hallucination rate can matter independently of the knowledge base. If all use-cases had a uniform risk of falling out of support, this would be a valid argument, but since it's often the case that a datapoint is guaranteed to fall out of support, the absolute ability to recognize that is crucial.

reinitctxoffset 6 hours ago [-]

Hallucination should be called "failure to ground".

Something about the cost model of US near frontier has the cattle prod out whenever a model is uncertain but thrashes on whether to search. Search flinch is roughly all hallucination.

I don't even wait for the model's turn, if there's a man page or Hoogle hit, stuff the last prefix cache cut point. You come out ahead.

grayhatter 8 hours ago [-]

> Hallucination rate scores are a little tricky to interpret because they're conditional on the model not knowing the answer. That means they don't measure the probability of your encountering a hallucination in everyday use, since that also depends on the probability of the model not knowing the answer, as well as how well your distribution of tasks aligns with the distribution tested in the eval.

Do you have a cite for this?

If a human makes up some bullshit lie, I wouldn't accuse them of making it up only if they actually knew the correct answer. If you don't know, the only correct answer is I don't know. Any other answer is made up bullshit. Why is it only a hallucination if and only if the LLM contains the answer? If you make something up it's still wrong. It shouldn't matter if you could give the correct answer. You didn't, and instead invented some bullshit instead?

Follow up question, how can I apply this rule set to the next test I have to take? I'd love to be able to use "I didn't know" as the excuse for why I made something up.

edit:

> and it's not totally clear that this is the main metric that's worth tracking.

I don't know, the rate at which some model is willing to make up something feels useful. If the argument I see repeated on HN so much is that it's impossible to completely get rid of hallucinations; being able to choose a model that's less likely to invent some lie seems like a positive trait, no?

Either way, I'm happy to agree that a restrictive definition, where a lie doesn't count as a hallucination iff the model doesn't know the answer feels strictly, infinitely less useful than an exact error rate. What percentage of emitted tokens are misleading would be useful for me. Anyone know any group that's attempted to quantify the global error rate?

jpalomaki 32 minutes ago [-]

As human I also give wrong answers if if I know the right one. Sometimes I also give answers even when I don’t really know them.

When pushed, I then start thinking and realise my mistake. System 1 vs 2?

aesthesia 6 hours ago [-]

This isn't quite the point. When comparing two different models' hallucination rates, the denominator is different. The evaluation works more or less like this: for each question, the model has the option to answer or abstain, so there are three possible outcomes: the model answers and gets it right, the model answers and gets it wrong (hallucination), or the model abstains. The hallucination rate is (model answers wrong) / (model answers wrong or abstains). So if a model A has 50 correct answers, 20 incorrect answers, and 30 abstentions, its hallucination rate is 40%, while a model with 20 correct answers, 20 incorrect answers, and 60 abstentions has a hallucination rate of 25%, even though it hallucinated exactly the same number of times. This is why hallucination rate is incomplete as a metric: it says nothing about the accuracy rate.

sgc 7 hours ago [-]

Since models just output the the most probable tokens and you can never accuse them of doing anything other than making it all up, I would like to see these tests run with a prompt that attempts to mitigate hallucination and finishes with something like: "Telling me that you don't have the relevant information or that the task is impossible is extremely useful to me and a valid answer", and see how much that changes the scoring - as well as the usefulness of the answers. There are so many skills like context7 that can be tweaked to improve these results as well.

In other words, you shouldn't choose the model that hallucinates the least without detailed prompting, since a well-crafted agents.md clause should go a long way to improving output, and almost certainly the top scoring order will be different. To the point that I don't find this type of raw comparison useful beyond maybe 'make sure you test that one with more explicit prompts'.

grayhatter 6 hours ago [-]

> In other words, you shouldn't choose the model that hallucinates the least without detailed prompting

You're prompting it wrong is quickly becoming the new, you're holding it wrong.

It's wild how willing software engineers are to blame the user when the actual problem is their own defective design.

Ideally we all, as an industry, will stop accepting this as reasonable excuse for the demonstrated incompetence

epihelix 2 hours ago [-]

[dead]

luuundonjk 3 hours ago [-]

there is a difference between a human knowingly bullshitting and being confident because he misremembers something

master-lincoln 1 hours ago [-]

there is a difference in their intent, but not necessarily in the effect.

taffydavid 3 hours ago [-]

> For the non technical, this is like asking a delivery driver to drop off packages at 3 houses at the same time without ever stopping the truck.

I'm already hallucinating about how this could work and it involves catapults

m3h 3 hours ago [-]

Or we could simply hallucinate that the packages are there at the three houses.

Hallucinations all the way down...

boofus 44 minutes ago [-]

Nobody said the 3 houses needed to be on separate properties. Just throw the 3 packages from the moving truck at the one address where all 3 live.

Being an LLM is easy!

sigmoid10 2 hours ago [-]

In the end it's just Boltzmann brains.

https://en.wikipedia.org/wiki/Boltzmann_brain

Lionga 2 hours ago [-]

Tell the delivery driver "Make no mistakes" and it should work I heard.

solid_fuel 11 hours ago [-]

> It’s been proven that when a model is trained on large volumes of highly factual and non-theoretical data, it learns to always have an answer. DeepSeek V4 Pro (1.6T params, 49B active, 44 AA Intelligence Index score) has a ludicrous 94% hallucination score on the AA-Omniscience benchmark, meaning on questions that it couldn’t figure out, it only stated that it didn’t know around 6% of the time, and the rest it confidently hallucinated an answer. GLM-5.2 scored a 28% hallucination rate, Opus 4.8 was 36%, Fable 5 was 48%, and GPT-5.5 was 86%.

Wow! I already knew from previous research shared here that hallucinations are a fundamental problem for LLMs and likely to be unfixable, just like prompt injection, but I didn't realize the hallucination rates were so bad!

Everyone has been acting like the best models only hallucinate in edge cases, but even the best performing one mentioned here - GLM-5.2 - has a hallucination rate of 28% when it doesn't "know" the answer to something.

That said, I think the title on the blog - "Bigger models are not the way" is probably more fitting and touches on what should be even bigger news. If bigger models and bigger training sets have already stopped producing proportional returns, then it seems likely we are already near the top of the S-curve. That's huge news, considering the valuation of companies like OpenAI and xAI is largely based around the (absurd) idea of ever increasing scaling from these models.

oshrimpton 6 hours ago [-]

Agreed on the title, my bad! But yeah, I've had some truly terrible experiences using these "frontier" models in coding agents especially, where they just fabricate facts about codebases.

wiether 1 hours ago [-]

Purely anecdotal, but when OpenAI removed Codex-5.3 from the ChatGPT sub and forced me to move to GPT-5.5, the result was far worse than what I was enjoying with Codex.

And, of course, it was burning 10 times more tokens for this output.

fvv 48 minutes ago [-]

I have the opposite experience with codex 5.3 I had to use 5.2 to design and 5.3-codex to execute , while 5.4 was a better in both, and 5.5 ( all used xhigh) is even better

oshrimpton 1 hours ago [-]

Yeah they are 100% in the wrong for removing the fine tuned codex models. It makes sense why they wouldn't want to allocate so many resources towards fine tuning but still the enshittification of GPT models is real

embedding-shape 29 minutes ago [-]

Huh, the fine-tuned "codex" variants always seemed like "quick specific edit" prototypes that weren't meant for real use. They worked OK when you were very specific, but besides that, nowhere close to GPT5.X and the other "real" models.

wiether 20 minutes ago [-]

Since Codex-5.3 came out it was my daily driver for everything: quick scripting, greenfield projects, new features on old projects...

Idk if it was the harness (OpenCode), my AGENT or my prompts, but I was getting exactly what I wanted, and quickly.

With GPT-5.5 it tries to play smart, takes much more times and is often stuck on basic stuff that DeepSeek solves oneshot.

embedding-shape 17 minutes ago [-]

> With GPT-5.5 it tries to play smart, takes much more times and is often stuck on basic stuff that DeepSeek solves oneshot.

You have any session logs or similar that shows this thing? Never once, since I started using the codex TUI when it became available, has GPT models gotten stuck on something another model breeze through, I quite literally run every prompt I do through multiple providers, this would be very visible very quickly for me.

I remember trying every -codex variant of the models and could never get them to be productive for tasks taking longer than 5-10 minutes, compared to GPT 5.5 which quite literally worked through the night day (with the /goal feature), and actually had something valuable and useful in the end this morning that wasn't exploding in LOC and complexity. I don't think any of the -codex variants would have been able to do this at all, based on how they worked when I last used them.

fuck_google 30 minutes ago [-]

[dead]

frankohn 3 hours ago [-]

I think hallucination rates are not a matter of model size but depends on the training of the model. They have been trained on a huge corpus of material that had overwhelmingly well formed questions and we'll formulated and correct answers. This is typically the case of books where the material is highly curated by experts in the field. In a book you never see a question which admit no answer and the book just reasoning and explaining why and how the question has no answer. Neither you will see a good question and the book explaining candidly it doesn't know the answer , because the way the book material is curated the author will omit discussing the question for which it has no answers.

In addition, I think that during HFRL, the labs has a bias for interesting answers that admit a solution and under represent the "bad" questions that admit no good answer. In addition they probably do less effort to HFRL on questions the model should admit it doesn't know.

As humans we have been trained all our lives, in the real world, to be confronted with questions we don't know the response right away and we learned to very quickly assess that we don't know or that we are not sure about the answer.

Another thing we have and LLM have not is fear. We have an amygdala in our brain, separated from the logic thinking part, that can raise a signal of fear so that we get much more carefully about what we say. On the other LLM has no fear organ like the amygdala and just learn to respond based on the patterns in it's training corpus. It never "fears" looking bad or being fired because it gave a wrong answer so it can merrily give perfectly wrong answers.

So, we see hallucination rates can be improved with training but currently the lab are not optimizing for that because there is an high stake race to get the most intelligent and capable model.

Alternatively I can see creating a separate amygdala-like organ for an LLM and that organ may asynchronously fires signal, based on the user prompt and the LLM thinking trace, to inject into the LLM reasoning a fear signal so that it can steer it's answer to something more safe.

oshrimpton 3 hours ago [-]

I'd definitely agree that it isn't directly model size, but there is the fact that a larger model in terms of parameter count needs a large amount of training data to not overfit or underfit. So I think this race to the top of "max training data size" has kind of led to unintentional overfitting, not catastrophically, but enough to trigger this perceived omniscience within the model

leobg 2 hours ago [-]

Skinner would say it is not so much about emotions like fear or greed, but about consequences.

frankohn 2 hours ago [-]

Yes, that's when we are mindful and we see the arise in our mind but we don't directly act out of it but we understand it and reason about our options and the consequences.

However the fear has to arise in the first place, to raise the alert.

probiz 2 hours ago [-]

[flagged]

xlii 2 hours ago [-]

My anecdotal experience differs (though I hold ground that LLM evaluations are highly subjective and benchmarks are just as useful for LLMs as they are for dating websites users).

GLM 5.2 tends to stray way more than and 5.1. It also hallucinates you things subtly: morphs requirements, makes unfounded conclusions. This output is not something I experienced in any model I seen so far.

In coding it's especially annoying because it steers whole request. E.g. I give instruction: "make we a Rust-WASM-Canvas app" and GLM 5.2 goes like "Oh user surely doesn't mean that. I'll better build Dioxus app instead".

LaurensBER 2 hours ago [-]

GLM 5.2 is great but it heavily detoriates once the context window gets past 200k tokens.

I've had more success with creating a plan first and then implementing it in (short-lived) sub-agents.

Ironically good software architecture patterns (small functions, single responsibility) heavily impact the performance of these models as well. They do surprisingly well in well architectured codebases.

They do very poorly in anything that's a mess where Opus and GPT 5.5 still get reasonable performance.

oshrimpton 2 hours ago [-]

Yeah the benchmark for sure isn't perfect and without super rigid prompting it is far too easy for it to get off course. 28% hallucination rate isn't nothing either

Naveja 16 minutes ago [-]

loving glm 5.2 personally

raincole 2 hours ago [-]

> meaning on questions that it couldn’t figure out, it only stated that it didn’t know around 6% of the time, and the rest it confidently hallucinated an answer.

From how they measure it, a model that simply answers "I don't know." to any prompt would be the one hallucinates the least. So it's not surprising at all that a smaller model can perform better.

EbNar 3 hours ago [-]

The fact that a huge amount uf parameters may lead to worse hallucinations is something I didn't think of. Would this somewhat imply that DeepSeek V4 flash should be less prone tho these issues?

oshrimpton 3 hours ago [-]

Surprisingly not! It is the biggest hallucinator on the AA Omniscience Index just 2pp away from V4 Pro. I think this is partially due to the fact that Flash was trained on >32T tokens just like Pro deapite being almost 10x smaller - it seems somewhat likely it was overfit.

cwillu 9 hours ago [-]

Please don't editorialize titles unless the original title is misleading.

nextaccountic 8 hours ago [-]

>GPT-5.5 and DeepSeek V4 Pro are two of the clearest hallucination leaders, despite being absolutely huge. Because of their immense size they simply did not learn how to say “I don’t know” or recognize intricate logical and technical fallacies. While it is true that a multi-trillion parameter model will always beat a lightweight consumer model on paper (today at least), the commoditization of these huge models is blurring the line between benchmark performance and actual real-world truthfulness and accuracy.

What about using two models, with a smaller model used for this kind of negative reasoning?

bastawhiz 8 hours ago [-]

Now you need a third model to decide if the two other models disagree

4 hours ago [-]

spwa4 2 hours ago [-]

Why is everyone expecting LLMs to be like the Star Trek computer? I wonder if anyone's ever measured what the hallucination rate of a human is.

master-lincoln 54 minutes ago [-]

Yeah it has been looked at e.g. in [0]. They separate that from lying, but I think for the LLM context it should be included. To me the difference is humans do not bullshit at the same rate and I can find out over time who tends to bullshit more and exclude that persons info from my pool.

> Why is everyone expecting LLMs to be like the Star Trek computer?

Because they are often marketed as magic AIs, not as mere language models.

[0] https://bpspsychub.onlinelibrary.wiley.com/doi/10.1111/bjso....

bravetraveler 1 hours ago [-]

Marketing, essentially

oshrimpton 2 hours ago [-]

I would be so curious to find a comprehensive benchmark on this, humans do have an unfortunate ahem Dunning-Kruger effect ahem tendency to do this

jingpostmedia 1 hours ago [-]

[flagged]

cws_ai_buddy 2 hours ago [-]

[flagged]

anchorapi 2 hours ago [-]

[dead]

Anoian 17 hours ago [-]

[dead]

Ozzie-D 10 hours ago [-]

[dead]

abracadobre 2 hours ago [-]

This is where I asked GPT 5.5

"they say u hallucinate 3x more than GLM 5.2, whats your comeback to this? do i need to dump u? $article"

Rendered at 11:08:58 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.