FHIR Workbench

Compare and track the latest advancements in Large Language Models across multiple benchmarks related to FHIR.

Supports strict search and regex • Use semicolons for multiple terms

Filter by size:

Showing 16 of 16 models

Rank	Model	Size	FHIR-QA	FHIR-RESTQA	FHIR-ResourceID	Note2FHIR	Avg
#1🥇	GPT-4o	Closed	94.0%	92.7%	99.9%	34.7%	80.3%
#2🥈	Gemini-2-Flash	Closed	94.0%	90.0%	96.9%	34.0%	78.7%
#3🥉	Gemini-1.5-Pro	Closed	93.3%	91.3%	93.7%	34.3%	78.2%
#4	Deepseek-v3	671B	94.0%	94.0%	91.4%	32.2%	77.9%
#5	Qwen/Qwen2.5-Coder-32B-Instruct	32B	90.0%	91.3%	88.8%	33.5%	75.9%
#6	mistralai/Mistral-Small-24B-Instruct-2501	24B	88.7%	92.0%	88.6%	34.0%	75.8%
#7	Gemini-1.5-Flash	Closed	92.0%	90.7%	92.0%	24.1%	74.7%
#8	GPT-4o-mini	Closed	95.3%	94.0%	92.1%	16.3%	74.4%
#9	Qwen/Qwen2.5-Coder-7B-Instruct	7B	95.3%	87.3%	89.2%	21.2%	73.3%
#10	GPT-4.5-preview	Closed	90.7%	92.0%	N/A	36.3%	73.0%
#11	microsoft/phi-4	14B	88.7%	89.3%	82.9%	29.0%	72.5%
#12	meta-llama/Llama-3.1-8B-Instruct	8B	85.3%	88.0%	82.0%	20.8%	69.0%
#13	allenai/Llama-3.1-Tulu-3-8B	8B	83.3%	85.3%	73.2%	20.3%	65.5%
#14	google/gemma-2-9b-it	9B	57.3%	82.0%	95.2%	6.3%	60.2%
#15	BioMistral/BioMistral-7B-DARE	7B	85.3%	84.0%	58.1%	7.6%	58.8%
#16	allenai/OLMo-2-1124-7B-Instruct	7B	82.0%	74.0%	61.1%	3.8%	55.2%

Submit Your Model

Have a FHIR-capable model you want to include in our leaderboard? Simply provide the HuggingFace repo URL below, and we'll evaluate it.