LLMs approach expert-level clinical knowledge and reasoning in ophthalmology | Hacker News

Hacker News new | past | comments | ask | show | jobs | submit

login

		LLMs approach expert-level clinical knowledge and reasoning in ophthalmology (ft.com)
		78 points by marban 14 days ago \| hide \| past \| favorite \| 102 comments

monkeydust 14 days ago | [–]

Link to paper:

https://journals.plos.org/digitalhealth/article?id=10.1371/j...

gwern 13 days ago | | [–]

> FRCOphth Part 2 questions were sourced from a textbook for doctors preparing to take the examination [17]. This textbook is not freely available on the internet, making the possibility of its content being included in LLMs’ training datasets unlikely [1].

I can't believe they're serious. They didn't even write any new questions?

dagmx 13 days ago | | | [–]

I’m sure they’ll be surprised how many books aren’t freely available on the internet that are in common training sets.

It would have taken them a few minutes to learn about the current lawsuits around book piracy.

I agree with you. That they didn’t try asking novel questions or even look into what training data is used makes this paper bunk.

hyperorca 13 days ago | | | | [–]

Funny thing is that this textbook can be easily found on LibGen. I do not know a lot about LLM datasets, but they probably include books from these shadow libraries, right ?

cameldrv 13 days ago | | | [–]

Nice catch. GPT-3 at least was trained on “Books2” which is very widely suspected of containing all of Libgen and Zlibrary. If the questions are in Libgen, this whole paper is invalid.

pants2 13 days ago | | | [–]

Considering that textbooks are probably the single highest quality source of training data for LLMs, I would be very surprised if OpenAI wasn't buying and scanning textbooks for their own training data (including books that aren't even in Books2).

kaycey2022 12 days ago | | | [–]

It’s highly unlikely that they will spend hundreds of thousands of dollars on buying their own copies when it’s not even remotely clear that doing so will be enough to answer the copyright violation cases they are facing.

rdedev 13 days ago | | | | [–]

Not just that but wouldn't the references for the textbook be mostly research papers that would be freely available?

gwern 13 days ago | | | | [–]

I didn't even bother checking Libgen, because if it wasn't in Libgen, it'd be in some other dataset.

westurner 13 days ago | | | [–]

"Large language models approach expert-level clinical knowledge and reasoning in ophthalmology: A head-to-head cross-sectional study" (2024) https://journals.plos.org/digitalhealth/article?id=10.1371/j...

> [...] We trialled GPT-3.5 and GPT-4 on 347 ophthalmology questions before GPT-3.5, GPT-4, PaLM 2, LLaMA, expert ophthalmologists, and doctors in training were trialled on a mock examination of 87 questions. Performance was analysed with respect to question subject and type (first order recall and higher order reasoning). Masked ophthalmologists graded the accuracy, relevance, and overall preference of GPT-3.5 and GPT-4 responses to the same questions. The performance of GPT-4 (69%) was superior to GPT-3.5 (48%), LLaMA (32%), and PaLM 2 (56%). GPT-4 compared favourably with expert ophthalmologists (median 76%, range 64–90%), ophthalmology trainees (median 59%, range 57–63%), and unspecialised junior doctors (median 43%, range 41–44%). Low agreement between LLMs and doctors reflected idiosyncratic differences in knowledge and reasoning with overall consistency across subjects and types (p>0.05). All ophthalmologists preferred GPT-4 responses over GPT-3.5 and rated the accuracy and relevance of GPT-4 as higher (p<0.05). LLMs are approaching expert-level knowledge and reasoning skills in ophthalmology. In view of the comparable or superior performance to trainee-grade ophthalmologists and unspecialised junior doctors, state-of-the-art LLMs such as GPT-4 may provide useful medical advice and assistance where access to expert ophthalmologists is limited. Clinical benchmarks provide useful assays of LLM capabilities in healthcare before clinical trials can be designed and conducted.

mdrzn 14 days ago | | [–]

https://archive.is/w3Sk0

sethammons 13 days ago | | [–]

My wife had gone through half a dozen doctors or more over a decade, giving them the same description of her issues. They never correctly diagnosed the issue. We typed her descriptions into chatgpt and it suggested multiple sclerosis. It was right. It must be what other folks feel when they talk to tech support, but I've never, not once, been impressed by a doctor and have found them lacking in nearly every interaction.

xanderlewis 13 days ago | | [–]

The best doctors are the ones who listen carefully to the patient before dispensing their expert opinion. Very few do.

Perhaps that’s just because they’re jaded from having to deal with the public all day long…

Zenzero 13 days ago | | | [–]

HN is going to select for a very small subset of the population, so unfortunately people's experiences have to be interpreted in that light.

A massive chunk of my caseload is the direct result of people not listening to basic instructions. Between people not listening and people who don't understand shockingly basic fundamentals of health, you become jaded.

Add to that the active disincentives to doctors really digging into cases, and we end up where we are. It feels much of my knowledge goes unused and wasted.

As an aside, I have messed around with chat gpt 4 for how well it can mimic what I know. My personal experience is that I can see how it would impress non-medically trained people, but to my eyes it has a long way to go. It struggles to accurately rank differentials. Also once you press deeper into decision making it gets really scary. It was consistently presenting treatment plans that would have killed patients if followed. I would love more than anything to have AI cut down some of our work but I was really disappointed with how it currently performs. Hopefully improvements keep coming.

xanderlewis 13 days ago | | | [–]

> I can see how it would impress non-medically trained people, but to my eyes it has a long way to go

I think this is spot on. It's easy (with this kind of technology and amount of data) to superficially mimic expertise. It looks impressive at first, but when scrutinised by a real expert it becomes clear how superficial it really is.

caeril 13 days ago [flagged] | | | | [9 more]

> people not listening to basic instructions.

Unless the basic instructions are "put down the fork, you piece of shit fat fuck", you aren't doing your job.

Presumably, you know that ~80% of medical costs are due to chronic illness, and you also know that ~80% of chronic illness is attributed to obesity, correct?

Unless you are fat-shaming 64% of your patients, you are committing malpractice.

Unless you are agitating for the hospital cafeteria to remove the Hostess cakes and sodas, you are committing malpractice.

But that's not what you people do, is it? You prescribe metformin, statins, beta-blockers, and write angry posts on hackernews that they're not compliant with their medication schedule. You dance around the root cause of 64% of your patient visits because if you addressed the root cause, your billables would decline.

Your entire profession is predicated on keeping your own patients sick, and prescribing workarounds to the root cause of 64% of them. And the remaining 36% of them, who are sick through no fault of their own, are waiting weeks just to see you, because you refuse to do the right thing and tell your patients that their wives are going to fuck the poolboy because they're too fat to see their own dick.

"Patch Adams" has destroyed the medical industry. Your ridiculous concern for "bedside manner", and being nice and polite, is killing Americans in droves, and you guys are more than happy to scoop up the profits in drawing out the process as long as possible, but not fixing it.

Zenzero 13 days ago [flagged] | | | [8 more]

Funny enough the people who don't listen talk just like you do.

If you had some legitimate point in your rant I would engage with it. But seeing as you posted a string of delusions on the level of "they put microchips in the vaccines", it's clear there is no knowledge worth wasting on you. I wish you extraordinary luck in society.

caeril 13 days ago [flagged] | | | [7 more]

It's certainly unsurprising that a representative of the profession that misdiagnosed my mother's hematoma as a tumor - while she was bleeding internally at a rate of 1 unit per hour - would lack in both reading comprehension and math skills.

I wish you luck, too. Keep murdering Americans at a rate of 240,000 annually due to your own mistakes while considering yourselves the good guys. NBC medical dramas will continue to carry water for you, so your public image will be forever untarnished as the Great American Heroes you were destined to be.

Zenzero 13 days ago [flagged] | | | [6 more]

Ah the root cause is found. If you don't want your seething anger to continue to consume you into adulthood it would be best for you to seek extensive therapy.

It's unfortunate how well-adjusted adults on HN seem to be increasingly outnumbered by angry children from reddit.

dang 13 days ago | | | [–]

You broke the site guidelines more than once in this thread.

Flamewar comments will get you banned here, regardless of how wrong someone else is or you feel they are.

If you'd please review https://news.ycombinator.com/newsguidelines.html and not do this again, we'd aprpeciate it.

caeril 13 days ago [flagged] | | | | [4 more]

I wasn't aware that medical school cited fMRI studies showing that emotional response immediately halts in the human brain after 18. The rot goes deeper than I had ever imagined. Johns Hopkins ought to be investigated.

This may be shocking for you, but it is indeed possible for a person to experience justifiable anger in their forties. Particularly when said person's personal dataset of your profession's gross incompetence causing direct medical catastrophe to loved ones is not n=1, but n>10.

I'm heartened to know that when members of your profession legally get away with murdering their patients due to their own "mistakes", standard practice is to console the survivors by calling them "angry children". That has to be satisfying to psychopaths, but not quite as satisfying as raping the deceased's estate for everything you can.

One can only imagine how nice it must be to work in a profession in which success is rewarded with precisely the same compensation as failure.

Patrick Bateman definitely chose the wrong career.

dang 13 days ago | | | [–]

You broke the site guidelines more than once in this thread.

Flamewar comments will get you banned here, regardless of how wrong someone else is or you feel they are.

If you'd please review https://news.ycombinator.com/newsguidelines.html and not do this again, we'd aprpeciate it.

caeril 12 days ago | | | [–]

Sorry, dang.

I know the guidelines, it's just so hard to not get baited sometimes. I'll do better.

tanseydavid 13 days ago | | | | [–]

>> This may be shocking for you, but it is indeed possible for a person to experience justifiable anger in their forties.

Justifiable does not make it a healthy obsession.

Justifiable resentments destroy people.

joelfried 13 days ago | | | | [–]

I agree with you - listening is very important.

I worked in an ophthalmology practice when I was quite young many years ago. The head of the practice often had over 50 patients a day (one of my responsibilities was pulling the folders for all daily patients from the wall-o'-folders; this was in the age everything was on paper).

50 patients over 8 hours gives you less than 10 minutes per patient. I suspect a nontrivial part of the problem is being rushed.

azinman2 13 days ago | | | [–]

Regulation could help solve that. You shouldn’t see so many patients in a day unless you also double your working hours.

I’m surprised this was many years ago. Now this pressure exists because medical school is so expensive and people need to pay off their loans. Want to see improvements in care? Make medical school free / cheap.

joelfried 13 days ago | | | [–]

It's also extremely long. Assuming a 3 year residency the average Doctor doesn't even start practicing until 31.[1] So you've not just got a house's worth of debt without a house, you've also lost a decade to training.

I'd love improvements in care, but nobody is going to fix this problem. It's too lucrative for everyone else involved.

[1] https://averagedoctor.com/the-average-age-of-doctors-graduat...

swader999 13 days ago | | | | [–]

Figure out how they are paid and that will tell you much. In Alberta they get paid by the visit and many are using ten minute appointments as the norm. Often takes six weeks for me to get an appointment with my family doc.

Zenzero 13 days ago | | | [–]

It goes further than this. As an employee, a doctor is pressured into this kind of schedule. They don't like it any more than you do.

rjzzleep 13 days ago | | | [–]

The way med school works by definition is that it filters for very conformant hard working individuals and then fills them up with the most common symptoms, diseases and treatments and turns them into human dictionaries. There are of course many exceptions to that rule, but if you look at medical textbooks that is the structure most of them follow. A human competing on who's the best dictionary is a terrible way to solve this problem although it might have historically been helpful.

I recently had a retinal detachment, not entirely sure why, no doctor can tell me anything useful besides you must have been stressed, but I got it lasered despite my better judgement, because I knew fixating something in the eye somewhat arbitrarily must cause problems, so it left me with blurriness and aniseikonia(a condition with one eye seeing things smaller).

I don't need an LLM for that, just the right google queries do the trick. This is by far not the only issue I have with doctors, but my experience with doctors has been vastly different between exceptional doctors and people with absolutely no interest into any kind of in depth root cause analysis, but the vast majority of them has been in the latter category.

bidandanswer 13 days ago | | | [–]

> I don't need an LLM for that, just the right google queries do the trick.

LLMs are clearly superior at presenting the information and have tangible room for improvement, whereas Google has regressed over the last decade and is getting more brittle by the day.

Open technology (LLMs) are going to offer much more robust and reproducible solutions to such questions as "here are my symptoms, what's wrong with me?" than Google.

You can nitpick my statements all you want and I will probably agree with you, but the overarching takeaway should be that LLMs are a much better solution, especially for laypeople who cannot use Google effectively.

pants2 13 days ago | | | [–]

LLMs are excellent at presenting more common diagnoses first. With Google you can easily find some rare and scary disease that fits your symptoms, that trips up a lot of casual Googlers.

rjzzleep 13 days ago | | | [–]

Sometimes when you look at studies, those rare diagnosis turn out to be really common, just not part of the standard curriculum. In those cases LLMs perform more like a normal doctor giving you all the normal junk a doctor would say.

What google needs is a "search journals" toggle on its normal search, not related to scholar.google.com since those results are also different.

kaycey2022 12 days ago | | | | [–]

You’re unlikely to find anything useful because everything has been SEOed up af. Useful results, if not already in the SEOed sites, are buried pages deep into the search. Content explosion did not start with generative AI, so search quality has been deteriorating for quite a while.

staticman2 13 days ago | | | [–]

In my experience (just having used it the other day to look at a blood test) ChatGPT is quick to tell a healthy person they are sick.

aantix 14 days ago | | [–]

This doesn’t surprise me.

I find most doctors are terrible at debugging root cause.

They’re simply working through a protocol of most common symptoms/cause and if your ailment doesn’t fit these narrow requirements, you’re screwed. Your 10 minute consult has ended and they’re not thinking about the problem anymore.

resource_waste 13 days ago | | [–]

The most horrifying thing has been being an adult and seeing how terrible medical doctors are at diagnosis (of my kids).

If there are no objective measurements, do your own research and get second opinions. Maybe this works because I come from a science background.

Heck today, I'm getting closer and closer to trusting ChatGPT4 over a doctor. My wife has used it to help diagnosis patients and 2 of them had life changing success. (okay hold up before my keyboard warrior pals think we are using chatgpt4 like a magic 8 ball, closing our eyes and showing the patient the response. No it dosent work like that)

aantix 13 days ago | | | [–]

I had blood pressure issues of 25 years.

“Lose more weight”. “Less sodium.” “More medications. “ (four to control my BP at its peak)

Four cardiologists. Three nephrologists. All have WebMD level recommendations at best.

I do my own research. I see hyperaldosteronism as a possibility. Cardiologist didn’t think so. But reluctantly gives an endocrinologist referral.

Yes, I have it. Had my left (overactive) adrenal gland removed because of it.

Now, BP is normalized with significantly less medication.

Your health is in your hands. Do your own research.

No one else will save you, including the doctors.

beoberha 13 days ago | | | [–]

That’s great for you! However, what a sad state of affairs that it took effort on your part to uncover your own diagnosis like that. I wonder why. There’s a few levels to it like 1) why didn’t doctor not even think of it 2) why was he so hesitant to give a referral. I try not to be cynical and give doctors the benefit of the doubt

caeril 13 days ago | | | [–]

> 1) why didn’t doctor not even think of it

This is such a strange question. Of course they thought of it.

Imagine you're a software developer, and you're not paid on the basis of software quality, responsiveness, features, or any other standard metric. You're paid on number of bugs closed, even if the bugs are only closed temporarily.

Your incentive structure is such that if you write a kludge for a bug, close it, and there is a regression on that same bug in 3 years, you get paid twice for the same bug.

Now you know what it's like to be a physician. Maybe when you were younger, you idealized making great quality software. But as of now, you'll make way more money putting band-aids on bugs and your lifestyle already requires a $20k monthly nut for your instagram "model" girlfriend's travel requirements, so you do what you have to do.

This is the world of medicine. You don't care about the patient, you probably know what the correct diagnosis is, but this other diagnosis will make sure that the Eli Lilly pharma rep will top you off in the handicap stall at your next "meeting" if you hit those prescription quotas, and your patient will keep up your billables, so that's what you choose to do. Also, the last thing you want to do is refer your patients to that asshole gland removal surgeon, because he has a better handicap than you do at the Doctor's Country Club, so fuck him.

You will NEVER go wrong assuming the worst of physicians.

Keep your BMI on the low side, exercise regularly, and avoid known carcinogens. If you're lucky, this approach will delay you from ever having to interact with these psychopaths until your sixties.

RandomLensman 13 days ago | | | | [–]

I wonder if these things are somewhat caused by systems that need referrals.

Btw., you still needed someone to remove your gland, I take it, so at that point someone else actually did their job, no?

aantix 13 days ago | | | [–]

The surgeon did her job regarding the adrenal removal. Agree.

But even with my consultation with her, I bring up the problematic adrenal and the possibility of it being a pheochromocytoma.

She says "For that, I would have to see a scan about 10 years before to see if there's any growth."

To which I responded, "You mean like my PET scan from 2008 of the same region mentioned in the first paragraph of your report?"

She barely read my report.

Sloppiness.

Parallels with the software industry - I think doctors could benefit from being paired with a complimentary doctor to act as a team and to check each other's work.

I think they simultaneously have a god complex for which their assessments are rarely checked. They need social pressure to probe further.

RandomLensman 13 days ago | | | [–]

Would it had mattered prior to surgery if it was a different type of tumor (and rare at that)? Otherwise, I could see why that isn't a discussion worth having anyway.

aantix 13 days ago | | | [–]

It may not have mattered.

A response of "A cancerous pheochromocytoma is incredibly rare. I don't see any other signs of cancer, so I didn't think to analyze those scans."

That would be a completely reasonable response.

But a deer-in-headlights, "I had no idea" type of look, is sad.

She is a gland expert.

Probably getting paid $500/hour for the consultation.

The least she could do is know the bare minimum of my history outlined in the report.

That is not an unreasonable standard.

riahi 13 days ago | | | [–]

Did you have a pheochromocytoma or just a hormonal active adrenal adenoma?

Pheos don’t typically make aldosterone. They tend to make epinephrine and derivatives. Different cell lineage; different endocrine responsibility.

aantix 13 days ago | | | [–]

Benign adrenal adenoma.

From the pathology report:

"Representative sections predominantly consist of a normal adrenal gland with intermixed adrenocortical tissue and medulla. A distinct nodular area is present with prominent foamy-type clear cytoplasm reminiscent of normal adrenocortical tissue. No significant cytologic atypia, necrosis, or increased mitotic activity is present. These findings are consistent with an adrenocortical adenoma. Note: this area appears to be limited to the adrenal gland although some adrenocortical tissue is present in the adipose tissue outside the capsule that morphologically appears dissimilar to the nodule and likely represents normal/benign tissue. Clinical correlation recommended."

riahi 13 days ago | | | [–]

Tough case. Adrenal adenomas are common incidental findings. ~15% are hormonally active, which means the vast majority are not.

As a radiologist, I sometimes wonder about whether I make too many recommendations to referring doctors (consider endocrine evaluation for a potentially hormonally active adrenal nodule).

A FREQUENT attack on us as a specialty is that we "find too many incidentals" (see attacks on mammography, breast cancer screening, other sorts of screening, ad nauseam).

Perhaps I'll keep doing the adrenal nodule recommendation, although I usually only make the recommendation if it's 1cm or larger.

aantix 13 days ago | | | [–]

I could see that if you're evaluating just the imaging, that's a hard call.

Are you provided these details as well?

* Hypertension 20+ years

* Resistant hypertension - four medications with one being a diuretic.

* Early onset hypertension (high school)

* Low potassium

Coupled with the history, Hyperaldosteronism seems much more probable.

There are a ton edge cases/conditions to keep in one's head. I'm sure that's a problem in all domains, definitely medicine.

I wish it could be a multidisciplinary team decision. But then it would become an issue of reaching consensus. And probably too expensive.

riahi 13 days ago | | | [–]

Sometimes we have the clinical context, usually if practicing in a large hospital system with an integrated EMR. It's not usually so neatly summarized though; maybe if we are lucky we can quickly glance through relevant notes at the time of scan interpretation.

However, healthcare in the US is very fragmented. Many patients seek cheaper imaging at freestanding imaging centers. Those places often don't have the same HIT integrations to have similar medical context.

And in those settings, I only know what's on the images and maybe 200-300 characters on the "reason for study" box.

This is not to say I think everyone should get scanned at expensive sites; more an indictment on how annoying the current EMR situation is.

resource_waste 13 days ago | | | | [–]

I agree, the self imposed doctor shortage have caused God Doctors that think they have every diagnosis.

We need more doctors, but the cartel limits it.

RandomLensman 14 days ago | | | [–]

Common things are common, so going through common things first doesn't strike me as wrong. If someone suffers from something really exotic then I'd argue that is what specialists are for.

The 10 minute window is more an external/financial constraint, I'd say.

aantix 13 days ago | | | [–]

I saw multiple cardiologists in Lincoln, NE, and San Francisco, CA.

Multiple nephrologists throughout the 25 years.

These are the "specialists" I was referred to from my general physician.

I had to work side-by-side with them to manage my condition.

I had lost significant weight in the past, but it never reduced my blood pressure medication intake.

But there was nothing special about these doctors.

They simply had no answers beyond the typical.

The equivalent for software would be that a bug report comes in and the engineer responds with "have you tried rebooting the servers?" Just superficial, unhelpful recommendations.

RandomLensman 13 days ago | | | [–]

Why do you think your diagnosis was so difficult to get to (given that it wasn't something very exotic if I understand it correctly - not an expert of any sort there but just doing a quick search)?

aantix 13 days ago | | | [–]

Hyperaldosteronism is a disease of the adrenal gland, so its diagnosis would fall under an endocrinologist.

Most people think of cardilogy when thinking about blood pressure. General physicians included. That's natural. Heartbeat ->blood pressure.

Less think of a nephrologist. But they still get recommended by general physicians. The kidneys play a big role in BP regulation.

I never had an endocrinologist recommended. Never.

But aldosterone is produced in the adrenals. It regulates water/salt retention.

I had been experiencing edema for many months after stopping spironolactone. I thought the weight gain was from me getting fatter. But it was water retention. My ankles were swollen. This should have been the first red flag.

I was also responsive to spironolactone when I started it again. I lost 15 lbs immediately (lost the water retention). My BP dropped significantly. That should have been another indicator that it could be aldosterone-related.

I don't think it's on the cardiologist/nephrologist's radar. They have different subsystem concerns entirely.

magicalhippo 13 days ago | | | | [–]

That's how I approach analyzing a software issue as well. Start with the basics, is the application even running, is it actually reading the file you think it's reading etc.

So often it's not something weird.

wredue 13 days ago | | | | [–]

One of the main issues doctors face today is that everyone is really fat, and being fat leads to every diagnosis under the sun.

People don’t like being told that being fat is causing their problem (when it almost definitely is), and this is creating a lot of distrust in doctors.

I mean, you see these examples of it being more than just overweight, but that’s the minority, and everyone being fat unfortunately contributes to increasing the rate of these misdiagnosis.

aantix 13 days ago | | | [–]

I wonder how we get better physician ratings?

Right now, all online reviews are superficial at best.

"He listened to my questions."

"She made my family feel at ease."

"The front office followed up for my next appointment."

I don't care how kind or rude the physician is.

I want them to get my diagnosis correct.

I'd like to see data on each physician - each patient's five- and ten-year follow-up data. Reveal to me if the diagnosis was correct.

I understand the messiness of this data gathering. I know some patients may not want to reveal such sensitive data. Doctors would probably game the scoring system. There may be confounding conditions as well.

But this would be my dream data scenario.

mewpmewp2 14 days ago | | | [–]

This is exactly what a doctor told me during my health check up as well.

He casually hit up a monologue on how dumb doctors are. I didn't really ask, but I wonder if he does this monologue to every patient he has.

It seemed like for years he had developed this theatrical and interactive comedy piece for all the patients he was doing the check up on.

emporas 13 days ago | | | [–]

They are also getting paid by the drugs they prescribe, not by the diseases they cure. Incentives guide actions of many people. Not saying their actions are dictated only by incentives and nothing else, but still, a strong influence there.

hestefisk 14 days ago | | [–]

“What this work shows is that the knowledge and reasoning ability of these large language models in an eye health context is now almost indistinguishable from experts,” said Arun Thirunavukarasu, the lead author of a paper on the findings published in PLOS Digital Health journal.”

FTFA.

nicklecompte 14 days ago | | [–]

What this work actually shows is that a bunch of scientists are ignorant about how LLMs work, but are rushing to publish papers about them anyway. It is ridiculous for Thirunavukarasu to draw this conclusion from GPT-4's performance on a written ophthalmology exam.

From the good folks at AI Snake Oil[1]

> Memorization is a spectrum. Even if a language model hasn’t seen an exact problem on a training set, it has inevitably seen examples that are pretty close, simply because of the size of the training corpus. That means it can get away with a much shallower level of reasoning....In some real-world tasks, shallow reasoning may be sufficient, but not always. The world is constantly changing, so if a bot is asked to analyze the legal consequences of a new technology or a new judicial decision, it doesn’t have much to draw upon. In short, as Emily Bender points out, tests designed for humans lack construct validity when applied to bots.

> On top of this, professional exams, especially the bar exam, notoriously overemphasize subject-matter knowledge and underemphasize real-world skills, which are far harder to measure in a standardized, computer-administered way. In other words, not only do these exams emphasize the wrong thing, they overemphasize precisely the thing that language models are good at.

Also[2]:

> Undoubtedly, AI and LLMs will transform every facet of what we do, from research and writing to graphic design and medical diagnosis. However, its current success in passing standardized test after standardized test is an indictment of what and how we train our doctors, our lawyers, and our students in general. ChatGPT passed an examination that rewards memorizing the components of a system rather than analyzing how it works, how it fails, how it was created, how it is maintained. Its success demonstrates some of the shortcomings in how we train and evaluate medical students. Critical thinking requires appreciation that ground truths in medicine continually shift, and more importantly, an understanding how and why they shift. Perhaps the most important lesson from the success of LLMs in passing examinations such as the USMLE is that now is the time to rethink how we train and evaluate our students.

[1] https://www.aisnakeoil.com/p/gpt-4-and-professional-benchmar...

[2] https://journals.plos.org/digitalhealth/article?id=10.1371/j...

Filligree 14 days ago | | | [–]

That might be the thing that makes me most optimistic about AI.

Not because they’re super useful. They are, if and only if you use them right, which is a skill few people seem to have.

But because they’re illuminating flaws in how we’re train our students, and act as a forcing function to _make_ the universities and schools fix that. There’s no longer any choice!

somenameforme 14 days ago | | | [–]

Software could "learn" multiplication by simply creating a lookup table for every value to some reasonable degree. A test isn't going to have somebody multiplying 100 digit numbers, or even 10 digit numbers. Every single number up to 5 digits could be done with just 10 billion entries. This doesn't mean that a multiplication test is just testing memorization, or that math is just memory. It simply means that machine learning learning/application and human learning/application have relatively little in common, in spite of the constant and generally awkward attempts to try to anthropomorphize machine processes.

_heimdall 14 days ago | | | | [–]

What you're describing aren't AI at all, there's no intelligence there.

danielbln 13 days ago | | | [–]

And the cloud is water droplets, not a datacenter. Apple is a fruit, not a product company. OpenAI isn't incredibly open, and so on. There was a time when it was worthy to die on the hill that AI is really just a product of ML, or whatever, but that ship has long sailed. It's not really intelligent (most likely), but the term is stuck now. Time to move on, the horse is long dead.

_heimdall 13 days ago | | | [–]

What term are we supposed to use then with regards to concerns over actual AI? It sure feels like the term artificial intelligence was repurposed, and bastardized, to be nothing more than ML and make any concerns over an AI sound ridiculous.

I'm not concerned over LLMs personally, though I do have serious concerns how we'll handle it if/when we develop an actual artificial intelligence. I can't really share those concerns clearly at all if the term AI has been used to make these discussions effectively meaningless.

Neither clouds nor Apple are topics of debate. Concerns over AI have been raised for decades and largely went unanswered, leaving us with tech getting closer and closer to it and no one willing or able to have any meaningful discussions about it at scale. OpenAI has an explicit goal, for example, of creating an AGI. Maybe AGI is the new term for AI, though I disagree with their definitional metric of economic value, which again leaves us with someone trying to purposely build an artificial intelligence without us first deciding the basics like would an AI have rights or will turning it off be tantamount to murder.

boyka 14 days ago | | | [–]

FTFA?

alex_suzuki 14 days ago | | | [–]

From The F**ing Article, presumably.

woodrowbarlow 14 days ago | | | [–]

i've been assuming, on HN, it stands for "the featured article".

Gorath 14 days ago | | | | [–]

It’s ok, you can swear here.

bheadmaster 14 days ago | | | [–]

It doesn't hurt anybody!

Fuck, fuckity, fuck fuck fuck.

thiago_fm 14 days ago | | [–]

Lots of people publish papers to get cited and be first in the field, without any rigorous research behind it, to try to net them a good career.

Mostly recently the best to do is to ignore this, and soon enough the things that are true will emerge, without academia.

The media is the worst, that tends to look for whatever research that would be click-baity. It's a real waste of time to follow FT and others.

greybox 14 days ago | | [–]

So. . . It doesn't work

mewpmewp2 14 days ago | | [–]

So it doesn't match?

lionkor 14 days ago | | [–]

If smart people made language,yea, youd be right. But most people speaking english dont have the capacity, so no, "all but" means the "completely", for no logical reason.

SiempreViernes 14 days ago | | | [–]

No, in this case "all but" means "as good as, or worse, than the bottom 2/5 of doctors tested on a written exam", see table 1.

ethagknight 14 days ago | | | | [–]

I know you are being funny, but I would just say this title is an incorrect use of the idiom. I’m assuming the intent is to have the words “model matches doctors” as clickbait. “All but” idiom means “very nearly”, not completely.

mewpmewp2 13 days ago | | | [–]

It means very nearly, but I also hate how it means that, because as a foreigner it just sounds if somebody "they did all, but the work they were assigned", it means they went in completely different direction or slacked off to me.

rafaelero 13 days ago | | [–]

I am very bullish for GPT-5. It's going to surpass even the best diagnosticians we have right now. And we will all have access for $20 a month. Crazy and exciting times.

daveguy 13 days ago | | [–]

Do you think hallucinations will be solved with GPT-5? If so, that would be an amazing breakthrough. If not, it still won't be suitable for medical advice.

rafaelero 13 days ago | | | [–]

It will certainly decrease. Also, there are multiple ways to deal with hallucinations. You can sample GPT-4 not once, but 10, 100, 1000 times. The chances of it hallucinating the same things asymptotically reaches 0. It all depends on how much money you are willing to invest in getting the right opinion, which in the field of medicine, can be quite a lot.

daveguy 13 days ago | | | [–]

> You can sample GPT-4 not once, but 10, 100, 1000 times.

Is there a study on improved outcomes based on simple repetitions of GPT-4? I would be very interested in that study. I don't think gpt hallucinations are like human hallucinations. Where if you ask someone after a temporary hallucination they might get it right another 9 times, but I could be wrong. That would be an interesting result.

rafaelero 13 days ago | | | [–]

This is called self-consistency: sample multiple times and then select the answer that appears the most. You can read Alphacode's paper, which uses a similar method. I think we can do even better, though. Over 1000 runs, it's likely that the correct answer appears at least once. Instead of selecting based on the majority, we could use the LLM to judge the answer individually. Since it's easier to verify than to generate, I think this method would work very well. I don't know if any lab has tried this, though.

caeril 13 days ago | | | | [–]

This is a solid point.

When humans have vague or incomplete knowledge on a specific topic, the very LAST thing they do is fill in the gaps with their best guesses. Humans are built with no ego, and when they don't have a high degree of confidence in their answer, they always admit to it, and they NEVER fill in knowledge gaps with low-confidence information. Hallucination is a problem that ONLY exists with these evil machine models, not the flawless Vulcan Logic Devices inside our heads.

What these LLM fetishists don't understand is that this hallucination problem is a Very Big Deal, and we should always rely on the infallible human brain, which has never made assumptions, lied, or, made shit up.

Whenever you're looking at the third leading cause of death in the United States (preventable medical errors), seeing that a quarter million Americans are murdered annually, with a significant proportion of those errors being gross misdiagnosis, it's Super Important to remember that an LLM that hallucinates will NEVER be suitable for medical advice, because the human brain is a Magical Jesus Box that has never made similar mistakes.

Sometimes I like to watch World War 2 documentaries, and I tear up at the realization that nearly half a million American soldiers lost their lives in four years to those evil Japanese and Nazi scum. War is Hell.

But what is DEFINITELY NOT HELL, is exactly the same number of American fatalities at the hands of human doctors with their perfect, infallible, flawless-in-every-way human brains in only half the time, which is what we have now.

Imagine how bad it would be if we allowed these LLMs to hallucinate medical advice! People might die! It would be horrible!

daveguy 13 days ago | | | [–]

Medical software and devices are held to a higher standard of correctness. And because that correctness can be empirically measured, why shouldn't it be? At least humans are currently capable of saying "I don't know" and until GPTs are too, I don't think they are suitable for medical advice. But I did enjoy reading the satire, thank you!

squeegmeister 13 days ago | | | [–]

There are reasons to believe we’ve reached a plateau with gpt 4. Check out Gary Marcus’s Substack

rafaelero 13 days ago | | | [–]

Unless he has access to GPT-5 his opinion is worthless.

staticman2 13 days ago | | | [–]

Please practice what you preach and delete your comment about how great GPT-5 will be.

rafaelero 13 days ago | | | [–]

I am just extrapolating on a trend that started in 2018. I think I am well justified in my opinion, but feel free to ignore the trend and listen to substack fools.

svaha1728 14 days ago | | [–]

Ask the model to provide an image of its answers written on a chalkboard. Or, ask it to provide you an image of pathological eye conditions and look at the text labels.

henriquenunez 13 days ago | | [–]

I do not trust doctors in general (unless they are scientists); and I also think that clinical knowledge is not that hard in ophthalmology…

troq13 13 days ago | | [–]

Well it is not going anywhere near my eyes.

dedosk 14 days ago | | [–]

behind paywall

mtmail 14 days ago | | [–]

But please don't post complaints about paywalls. https://news.ycombinator.com/newsfaq.html

danesparza 14 days ago | | | [–]

How do I read the article?

batch12 14 days ago | | | [–]

https://news.ycombinator.com/item?id=40074620

And

https://news.ycombinator.com/item?id=40074640

RandomLensman 14 days ago | | [–]

On questions, not actually examining eyes, though.

SiempreViernes 14 days ago | | [–]

It also scored 0% on ethics :D

(Was just one question though, 8/10 humans got that one right)

JaimeThompson 14 days ago | | | [–]

Maybe it enabled modern MBA mode for that question ;)

bidandanswer 13 days ago | | | [–]

On that note, legacy computer vision models have exceeded humans at identifying pathologies from images of retinas, etc., for probably decades now.

VeejayRampay 13 days ago | | [–]

doctors are by and large terrible

trained to condescend and ignore / infantilize patients, generally thinking highly of themselves, never providing explanations for anything, thinking they're better than everyone, I can't wait for them to get humbled

Euphorbium 13 days ago | [–]

Doctors cannot be automated soon enough, and nothing of value will be lost.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact