What question can I ask ChatGPT, right now, that will reliably produce a factually incorrect, wrong, or false answer?

LoveRainbow@lemmy.world · edit-2 1 day ago

What question can I ask ChatGPT, right now, that will reliably produce a factually incorrect, wrong, or false answer?

queermunist she/her@lemmy.ml · edit-2 16 hours ago

It gets medical questions wrong 15% of the time.

The problem with your question is that there’s never going to be a question it gets wrong every time, because it’s probabilistic. You might as well ask “what question can I ask my dice that will reliably produce a wrong answer?”

LoveRainbow@lemmy.world · edit-2 14 hours ago

The article states: “ChatGPT-4o performed best with 84.6% validity”

It is reasonable to assume that the GPT 5.5 on thinking mode has significantly reduced the error rate.

It is also worth noting that the error rate when it comes to diagnosis amongst real doctors is estimated to be around 5%

Admittedly a quite old study: Singh, H., Meyer, A. N. D., & Thomas, E. J. (2014). The frequency of diagnostic errors in outpatient care: Estimations from three large observational studies involving US adult populations. BMJ Quality & Safety, 23(9), 727–731. https://doi.org/10.1136/bmjqs-2013-002627⁠�

In response to your point: I am mainly interested in probabilistic reliability - if it gives the correct answer 99.9% of the time, it is clearly superior to the vast majority of human beings (with, perhaps, the exception of the best specialists in the most obscure niches) - especially given the sheer breadth of topics is can reliability answer questions on.

Interestingly, my question “What was India like before the British arrived?” produces consistently biased and misleading answers. Though I haven’t asked it for the new model.

floquant@lemmy.dbzer0.com · 13 hours ago

It is reasonable to assume that the GPT 5.5 on thinking mode has significantly reduced the error rate.

I am sorry to break the bubble but that is a baseless assumption, if not in marketing. GPT models have been sold as having “PhD-” or “MD-” “level intelligence” since GPT3. Anectodally, recent models have been improving in some areas but regressing in others. “Frontier models” have incredibly opaque performance and safety benchmarks, and as time goes on more and more training data is LLM-generated, less and less comes from humans, and models start breaking down.

In response to your point: I am mainly interested in probabilistic reliability - if it gives the correct answer 99.9% of the time, it is clearly superior to the vast majority of human beings

Again, nowhere near the actual accuracy of current models. It is a big jump from 85% (wrong >1/10 of the time) to 99.9% (wrong 1 in 1000 times). At best it would barely break 90%, which is still 1 in 10.

Interestingly, my question “What was India like before the British arrived?” produces consistently biased and misleading answers. Though I haven’t asked it for the new model.

An LLM’s knowledge, its “intelligence”, is its training data, nothing more, nothing less. Its scope, or “purpose” is its context/prompt, nothing more, nothing less. That means answering the question though the lens of British colonialism, based on a corpus of mostly “white history”. I bet that if you ask the same question using a timeframe (i.e. “before the 14th century”) and don’t use the word “British” you’ll get a slightly less, but still biased answer.

LoveRainbow@lemmy.world · 7 hours ago

It’s not a baseless assumption.

It is an assumption based on the fact that every model upgrade has, so far, made answers more accurate.