Large language models (LLMs) have demonstrated impressive capabilities, but the bar for clinical applications is high. Attempts to assess the clinical knowledge of models typically rely on automated evaluations based on limited benchmarks. Here, to address these limitations, we present MultiMedQA, a benchmark combining six existing medical question answering datasets spanning professional medicine, research and consumer queries and a new dataset of medical questions searched online, HealthSearchQA. We propose a human evaluation framework for model answers along multiple axes including factuality, comprehension, reasoning, possible harm and bias. In addition, we evaluate Pathways Language Model1 (PaLM, a 540-billion parameter LLM) and its instruction-tuned variant, Flan-PaLM2 on MultiMedQA. Using a combination of prompting strategies, Flan-PaLM achieves state-of-the-art accuracy on every MultiMedQA multiple-choice dataset (MedQA3, MedMCQA4, PubMedQA5 and Measu
Yonatan Mintz :: UW–Madison Experts
wisc.edu - get the latest breaking news, showbiz & celebrity photos, sport news & rumours, viral videos and top stories from wisc.edu Daily Mail and Mail on Sunday newspapers.
Joint Norwegian-Ethiopian Conference on Health and Higher Education
mareeg.com - get the latest breaking news, showbiz & celebrity photos, sport news & rumours, viral videos and top stories from mareeg.com Daily Mail and Mail on Sunday newspapers.